Testing ASCII-unsafe encodings in Web browsers

Note: To jump straight to test page click here http://lookout.net/test/charsets/ascii-unsafe/

[UPDATE: Some feedback from Anne van Kesteren pointed to the fact that all browsers do support HZ-GB-2312, even though the test results showed IE and Firefox did not. The direct URL for that particular encoding test is http://lookout.net/test/charsets/ascii-unsafe/charset.php?alias=HZ-GB-2312. Looking closer it seems the ICU trancoding added a two-byte preamble to the string, which are 0x7E 0x7D, or '~}'. I'm not very familiar with HZ-GB-2312 but a quick look at RFC 1843 tells me that this two-byte sequence switches the context from GB-mode to ASCII-mode. So it seems that Firefox and IE do not recognize this mode-switching byte-sequence, or at least not in this context.]

Web browsers support a variety of character set encodings which could be broadly categorized as either ASCII-safe [1] or ASCII-unsafe [2].  The goal of this test was to identify which ASCII-unsafe character encodings were supported by each Web browser.

String encodings play an important role in testing Web applications for security vulnerability.  If I can control some input's encoding then I will manipulate it in ways that might confuse a parsing process or bypass a defensive filter.  To use a common example - imagine you input a string somewhere that includes the U+003E GREATER-THAN SIGN '>' in a meager attempt at cross-site scripting.  An XSS filter consumes the input as UTF-8 (which is ASCII-safe) and immediately recognizes the 0x3E byte sequences as something naughty, at which point it throws back an error message.  Since you realize that a query string parameter (e.g. &charset=utf-8) controls the page's output encoding you change the charset parameter's value to 'cp037' and encode the input string accordingly.  In the cp037 encoding, the '>' character is represented with the byte 0x6E, which in ASCII would be the 'n' character, two completely different characters.  The character slips by the filter which assumes it was encoded as UTF-8, and makes its way on to the destination.  The reason for the confusion was that the two encodings cp-037 and UTF-8 (ASCII-safe) are not compatible.

How the testing was setup

The test page attempts to identify which ASCII-unsafe charset a Web browser supports by loading a string encoded in each charset, and testing if the browser decoded it as expected.  The page uses the XmlHttpRequest to fetch each string from the server, which returns the string in an HTTP request that includes the Content-Type header, and the corresponding charset label for the test case.  The test page then decodes the string according to the charset label, and tests it for equivalence with the following static control string.

 $%'()*+,-./<>:;=

There are some potential pitfalls to this approach.  The most obvious being that the browser may not officially support the given charset encoding under test, but it instead may be applying some intelligence (e.g. sniffing) to the string to try and figure out what it's encoding could be.  For example, many of the ASCII-unsafe encodings share similar ranges of characters, where the '>' may actually be represented with byte 0x6E in all of of them.  So if you were to test using only a single character you might end up with false positives if the browser was sniffing and decided that the encoding was 'cp237' instead of the 'cp037'.  Although these are both variants of EBCDIC, there are some differences.  So the test ended up using a string of many characters, which still doesn't totally solve the challenge.  However, it works okay and produces decent results.

Because the testing uses the ICU project to build the test strings, it's limited to only the character set tables that ICU includes.  That's quite a lot mind you, but some other interesting variants and oddities might not be included.


Transcoding the test string

The test string shown above uses 17 characters with familiar names - this string gets transcoded into 417 different character set encodings (the recurring 17 is just coincidence, I think).  Because most of the 417 labels are just aliases for a superset, they can be further grouped into a much smaller set of around 17 (just kidding) encodings.

The ICU project's Converter API was used to perform the transcoding.  ICU also provided all of the charset aliases/labels used for testing.  The code for transcoding is available on github for the curious.

Test Results

The following table, which can also be opened in a new window, lists all of the ASCII-unsafe charsets supported in each Web browser tested.

Notes

[1] ascii-safe An ASCII-compatible character encoding is a single-byte or variable-length encoding in which the bytes 0x09, 0x0A, 0x0C, 0x0D, 0x20 - 0x22, 0x26, 0x27, 0x2C - 0x3F, 0x41 - 0x5A, and 0x61 - 0x7A, ignoring bytes that are the second and later bytes of multibyte sequences, all correspond to single-byte sequences that map to the same Unicode characters as those bytes in ANSI_X3.4-1968 (US-ASCII). [RFC1345]

[2] ascii-unsafe ASCII-compatible bytes do not map.