Unicode Normalization in URLs

In some contexts, normalizing a string means upper or lower-casing it. In Unicode "normalization" means something much different. The Unicode standard offers four "normalization" forms which irreversibly transform a given character or sequence of characters according to either a simple mapping rule, or a more complex algorithmic rule. -Since browser interoperability depends on each browser processing a URL the same as the next, I thought testing some of the more popular browsers might be a good idea.

Why should you care?

If you're a Web developer using Unicode anywhere in your URLs, then you're probably concerned when those URLs get handled differently in various Web browsers. If you're a penetration tester, you probably like to find quirky ways that URLs get transformed.

Test Setup

To test Unicode normalization I used some of the character sequences from Unicode Standard Annex 15 "Unicode Normalization Forms" and others from RFC3197. From TR15 I looked at a Singleton from Figure 3 - U+212B which normalizes to U+00C5 Å under NFC, and U+0041 U+030A Å under NFD. I also looked at multiple combining marks from Figure 5, U+10EB U+0323 ძ̣, and the sequence U+1E9B U+0323 ẛ̣ from Figure 6 Compatibility Composites. Through those few tests we can test for each of the four normalization forms, and see NFC being applied in Safari and Chrome (in different ways), and rule out NFD, NFKC, and NFKD.

Test Results

I was hoping to find some security bugs, but only found interoperability bugs. That doesn't mean security bugs don't exist here. As if URLs weren't tricky enough with plain old ASCII, handling Unicode characters makes them even more open to interpretation. For example, an Internationalized Resource Identifier (IRI) with a path, query, and fragment containing U+212B means code point U+212B to IE, Firefox, and Opera, but it means U+00C5 to Chrome (in the fragment only), and U+00C5 percent-encoded #%C3%85 to Safari (in the path, query, and fragment).

These types of character transformations make for ripe targets in security testing, but only when the resulting character has some practical use such as bypassing an XSS or SQL injection filter. When a certain input X transforms to become Y, an attacker has more opportunity slip a malicious link or XSS payload past an unsuspecting defensive filter. In testing how Web browsers normalize Unicode across a URL/IRIs components, I made the following observations.

  1. Safari applies NFC normalization to the path, query, and fragment.
  2. Chrome applies NFC normalization to the fragment only.
  3. MSIE, Firefox, and Opera do not apply normalization anywhere.
  4. MSIE violates RFC 3986 by sending raw, unescaped UTF-8 bytes in the query during an HTTP request.
  5. Chrome, Safari, Firefox, and Opera all send percent-encoded UTF-8 in the path and query during an HTTP request
  6. Safari percent-encodes the fragment.

Firefox and Opera seem to be the only two that agree in all tests, Chrome is a little odd with the fragment, and Safari is the odd-guy out across the entire URL. IE is the only browser that sends raw UTF-8 encoded bytes out on the wire (in the query component only), but I think that RFC 3986 allows for that anyway.

My conclusions were based on reviewing the following:

  1. The DOM property values for the anchor element, which included an individual the test case.
  2. The raw HTTP GET request (for the img) as sniffed off the wire using winpcap, triggered by a test case using the img element

The spreadsheet spreadsheet below includes table of results observed from the test cases, and can also be opened in a separate window.

Good stuff.

But what makes you think that having non-ASCII characters in the query part (IE) is allowed per RFC 3986?

I'm glad you asked - because after re-reading I think my interpretation of this clause was incorrect:

"However, as query components
are often used to carry identifying information in the form of
"key=value" pairs and one frequently used value is a reference to
another URI, it is sometimes better for usability to avoid percent-
encoding those characters."

So then, does "those characters" refer only to slash ("/") and question mark ("?") only?

> So then, does "those characters" refer only to slash ("/") and question mark ("?") only?

Yes, that seems to be the case.

Yeah, here's the BNF (collated from RFCs 3986 and 2234):

query = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
ALPHA = %x41-5A / %x61-7A ; A-Z / a-z
DIGIT = %x30-39 ; 0-9
pct-encoded = "%" HEXDIG HEXDIG
HEXDIG = DIGIT / "A" / "B" / "C" / "D" / "E" / "F"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

Looks like MSIE's querystring handling is a spec violation after all. I was worried there for a second.

Maybe update the post/spreadsheet to reflect this?

Also, what does the red/green color-coding indicate in Table 1? (It doesn't correlate with No/Yes, for example.)

Thanks for the analysis whit537. I've modified the test results section to call out that IE violates RFC 3986. I imagine this has been well known for many years but still it may become more important to understand over time as we move to a more Internationalized Web.

The red/green is just a loose way of calling out areas that I felt might be problematic or in the minority.

Thanks for sharing this information...great help!!!

For Online Tech Support,please follow the link

Best Regards
Kelly