Surrogates, supplementary characters, double-byte, multi-byte, and variable-width encoding ranges in Unicode and ANSI code pages

When I started digging into Unicode I was lost. It started to clear up for me when I eventually found a lot of terms that are synonymous and used interchangeably all over the place. For starters, "code page" might be called codepage, character set, charset, encoding, code character set, character map, or others. Then it gets a bit more specific but still confusing with Single Byte Character Set, (SBCS), Double Byte Character Set (DBCS), and Multi Byte Character Set (MBCS), especially when DBCS and MBCS can be used interchangeably. At this point we're still talking about mapping characters in a hexadecimal grid.

Then there's talk of more specific encodings where transformations actually occur. These are getting specific like UTF-8, UTF-16, and UTF-32. They get further defined with endianess and other info. UTF-8 and UTF-16 are both sometimes referred to as MBCS, which might be more incorrect than calling them variable-width.

Anyway, Surrogate code points and supplementary characters still seem an interesting area of vulnerability research. So it's been obvious for a while now we need to pay attention to the variable-width code pages. Microsoft has a good list of common codepages in use.

MSDN also has a useful page on Internet Exlplorer's Character Set Recognition. This categorizes popular charset aliases into families as far as IE is concerned.

I18N Guy also keeps up a page to kick start the surrogates and supplementary character research.

Variable-width characters in ANSI codepages

Now this isn't Unicode so much, but some ANSI code pages have variable-width character representations. That means that more than one byte can be used to represent a single character. Typically, it's two bytes, a lead and a trailing, like:

[lead byte] + [trailing byte] = character

Looking at these tables seems like the range 81 - FE would be a good place to start looking for issues we've previously seen with character 'absorbtion'. This happens in some applications when a lead byte is encountered, followed by an invalid trailing byte. It's what Yosuke Hasegawa and Cheng Peng Su were talking about a couple years ago. The net effect is sometimes, depending on the application, that the second byte gets consumed, or absorbed, giving a whole new meaning to the string. The correct effect, would be to fail the string containing the illegal byte sequence although we'll usually see people replace it with a safe fallback character. That can work but can also be problematic depending on the implementation.

Windows Codepage 932 ANSI/OEM Japanese; Japanese (Shift-JIS)

Windows Codepage 936 ANSI/OEM Simplified Chinese (PRC, Singapore); Chinese Simplified (GB2312)

Windows Codepage 949 ANSI/OEM Korean (Unified Hangul Code)

Windows Codepage 950 ANSI/OEM Traditional Chinese (Taiwan; Hong Kong SAR, PRC); Chinese Traditional (Big5)

The Hong Kong Supplementary Character Set

香港增補字符集 Hong Kong Supplementary Character Set (HKSCS)