Detecting ill-formed UTF-8 byte sequences in HTML content

One issue I’ve come across, pretty infrequently, is the existence of ill-formed UTF-8 byte sequences in HTML content. As far as I can tell nobody’s every really tried to find this type of bug. Huh, so what’s up?

UTF-8 is a variable-width encoding, in which the lead byte indicates to the decoder how many bytes are to follow. The following, or trailing bytes, must fall within specific ranges to be valid.

So why would HTML content get returned with ill-formed byte sequences? Good question. I doubt most platforms/frameworks would do this, so my guess when I see this type of bug is that the application is doing some special handling of user-input. USER-INPUT? That’s right, in most cases I’ve seen this, I was able to control input in some way to cause this condition.

So is it exploitable? Depends. In the cases I’ve seen I couldn’t find an immediately apparent way to exploit this case, however it felt like more time spent would turn up something. Take an example below.

Say I send a URL parameter that inclues the Unicode U+2739 TWELVE POINTED BLACK STAR ✹ as input. The valid UTF-8 byte sequence for this would be:

[E2, 9C, B9]

But the byte sequence I see in the HTML content becomes:

[E2, 3F]

This is one weird example, where the trailing bytes somehow were converted to a question mark ?.

Determing exploitability all depends on how the application transforms the bytes. To be safe, I say find these bugs and eliminate them.

So how can you find these bugs? We recently released a passive security testing and auditing tool called Watcher which is available from Codeplex. Watcher includes a check (added by Samuel) to look for ill-formed UTF-8 byte sequences.

Check it out.