Testing charset encoding support in Web Browsers

13 Feb 2012

Note: To jump straight to test page click here http://www.lookout.net/test/charsets/ascii-unsafe/

Web browsers support a variety of character set encodings mostly for legacy reasons and backwards compatibility. After all, UTF-8 and a handful of other encodings today are capable of representing all of the characters that were once relegated to a wide assortment of character encodings. It's clearly evident from Google's February 2012 report that UTF-8 is dominating the Web, with 60% of Web documents using UTF-8 - and that number is rising as other legacy character encodings are declining in use.

Those of us who test Web application security are often concerned with character encodings in our attempts to manipulate string input in ways that would eventually lead to mayhem. For that reason it's good to know a bit not just about which encodings the server-side components support, but also which ones the Web browser supports. I've documented the results of testing character set support in Web browsers in the table below, along with a brief summary.

Test Results

The following table, which can also be opened in a new window, lists all of the supported charset encodings in each Web browser tested on a Windows 7 and Ubuntu 11.10 OS where possible. Testing was only concerned with IANA's official list of character set names that may be used on the Internet.

Default Fallback Encoding

Most browsers use UTF-8 as the default fallback encoding. However Safari, and Chrome on Ubuntu, fell back to ISO-8859-1 when an unrecognized charset label, such as "freshies", was tested.

Supported charset labels

The results also show all supported character set labels per browser, in a comma-separated form of named_charset,interpreted_charset where the named_charset was the test case and the interpreted_charset was what the Web browser's contentDocument.charset property returned. Using iso-ir-144,ISO-8859-5 as an example - the test returned a document with the HTTP Content-Type set to iso-ir-144. Then the contentDocument.charset property was checked and found to be ISO-8859-1. Since the two were aliases for one another the test was considered a pass, meaning the charset label was supported by the browser.

Charset labels that fallback to non-equivocal IANA alias

If the contentDocument.charset returned a value that was not an equivalent charset alias for the test case (according to IANA's list) then it was deemed a failed test case. Often however, the interpreted_charset was in fact an equivalent, or superset, encoding, even though it was not listed as so by IANA. In some barely interesting cases a vendor-specific charset label could be found this way, such as unicodeFEFF which seems to only be used by Internet Explorer.