Advisory: Browser BOM'ing for XSS

Damage: Filter evasion, cross-site scripting
Exploit: Insert Unicode byte order mark (BOM) U+FEFF into HTML elements, attributes, or javascript statements to bypass filters and execute XSS.
Root Cause: character absorption/swallowing
Product Version: Safari, iPhone 2.0

No, this doesn't work in Chrome, I assume they took the latest WebKit drop before releasing. This issue has been found years ago in similar products such as Firefox. To quote that CVE:
Mozilla Firefox and Thunderbird before 1.5.0.4 strips the Unicode Byte-order-Mark (BOM) from a UTF-8 page before the page is passed to the parser, which allows remote attackers to conduct cross-site scripting (XSS) attacks via a BOM sequence in the middle of a dangerous tag such as SCRIPT.

In July 2008 Apple released an advisory for a similar issue - http://support.apple.com/kb/HT2351 - , one  with the way Unicode BOMs were swallowed.  The Unicode standard would call this "Character Deletion".  This behavior could lead to all sorts of nastiness, such as enabling cross-site scripting, bypassing or evading HTML filters and WAF's.  To get to the point, here's what's possible by injecting the Unicode BOM U+FEFF in HTML markup and javascript:

<a h[U+FEFF]ref="javas[U+FEFF]cript[U+FEFF](ale[U+FEFF]rt('onclick')">

Apple released the details of this issue with their advisory:
Safari ignores Unicode byte order mark sequences when parsing web pages. Certain websites and web content filters attempt to sanitize input by blocking specific HTML tags. This approach to filtering may be bypassed and lead to cross-site scripting when encountering maliciously-crafted HTML tags containing byte order mark sequences.

The Unicode byte-order-mark (BOM) consists of the character code U+FEFF and is normally used at the start of a file to indicate to the parser the encoding form and byte order.


























BytesEncoding Form
00 00 FE FFUTF-32, big-endian
FF FE 00 00UTF-32, little-endian
FE FFUTF-16, big-endian
FF FEUTF-16, little-endian
EF BB BFUTF-8

When the BOM sequence occurs in the middle of a file, we might expect it to change the meaning of the string.  In other words, we wouldn't expect the following to be valid HTML:


<sc[U+FEFF]ript>
va[U+FEFF]r x = "x";
document.wr[U+FEFF]ite('ouch');
</script>
<a h[U+FEFF]ref="htt[U+FEFF]p://lookout.net">


So yes it seems the above does become valid HTML. Seems like the BOM character is stripped prior to HTML rendering.  The expected behavior would be an error condition or a replacement character.  But, when you look at the Unicode standard, they do give the choice of ignoring or erroring.  Regarding U+FEFF handling when found in the middle of markup files, they say:
When designing a markup language or data protocol, the use of U+FEFF can be restricted to that of Byte Order Mark. In that case, any U+FEFF occurring in the middle of the file can be ignored, or treated as an error.

The part that says 'can be ignored' is likely what's happening here.  It seems like some Unicode processing is removing the U+FEFF before passing the content to the HTML and javascript parsers.

Here's a link to the test case.