Tag Archives: encodings

Ultrafast UTF-8 decoder by Bjoern Hoehrmann

I believe this is still getting tested by several parties, but it’s obviously a highly optimized implementation of a UTF-8 decoder. Bjoern Hoehrmann released his Flexible and Economical UTF-8 Decoder recently, check it out: // Copyright (c) 2008-2009 Bjoern Hoehrmann … Continue reading

Posted in Unicode | Tagged , | Leave a comment

Detecting ill-formed UTF-8 byte sequences in HTML content

One issue I’ve come across, pretty infrequently, is the existence of ill-formed UTF-8 byte sequences in HTML content. As far as I can tell nobody’s every really tried to find this type of bug. Huh, so what’s up? UTF-8 is … Continue reading

Posted in Unicode | Tagged , , | Leave a comment

Surrogates, supplementary characters, double-byte, multi-byte, and variable-width encoding ranges in Unicode and ANSI code pages

When I started digging into Unicode I was lost. It started to clear up for me when I eventually found a lot of terms that are synonymous and used interchangeably all over the place. For starters, “code page” might be … Continue reading

Posted in Unicode | Tagged , | Leave a comment

CSS 2.1 escape sequences and encodings

I know there’s plenty of good work being done over at places like http://ha.ckers.com, and http://www.thespanner.co.uk/. I have been researching CSS 2.1 and testing some very thorough and complex HTML and CSS filters myself, and trying to find the stuff … Continue reading

Posted in cascading style sheets | Tagged , , | Leave a comment