Unicode security attacks and test cases – Normalization expansion for buffer overflows

Normalization, like casing operations, can cause changes to the number of characters and bytes in a string. In testing software, I want to know how to get the most bang for my buck – in other words, what’s the minimal input I can provide to cause the maximum character and byte exansion?

First step:  Figure out what normalization operation your input is going through – NFC, NFD, NFCD, or NFKD.

Next step: Find the right input.

For example, if I pass in a character like U+2177 SMALL ROMAN NUMERAL EIGHT (ⅷ), I’ve passed in a single ‘character’ that takes three bytes [E2, 85, B7] to encode in UTF-8. If that character passes through a decomposed normalization form like NFKC or NFKD, then it has a compatibility mapping from one code point to four: U+0076 U+0069 U+0069 U+0069. Now those are all ASCII characters, so bytewise I didn’t really expand all that much, just one byte, but three extra characters.

Well there may be better cases than this one, just take a look at the maximum expansion factor table, courtesy of the Unicode Normalization FAQ: