Unicode attacks and test cases - Visual Spoofing, IDN homograph attacks, and the Whole Script Confusables

More on lookalikes, confusables, IDN homograph attacks, and other fun stuff, continued from the previous post. To recap, the three classes of confusables are:

Single-script
Mixed-script
Whole-script

Whole-script confusables

It's starting to make sense now. Let's look at the Unicode TR39 definition of a whole-script confusable:
X and Y are whole-script confusables if they are mixed-script confusables, and each of them is a single script string. Example: "scope" in Latin and "ѕсоре" in Cyrillic.

If we look at the code points, we'll see the clear difference between the two scripts being used:

  • scope == \u0073\u0063\u006F\u0070\u0065

  • ѕсоре == \u0455\u0441\u043E\u0440\u0435


The first version of 'scope' uses all Latin letters, but the second uses all Cyrillic letters. We call it a whole-script confusable because each word is made of entirely of a single script, we're not mixing scripts within the same string.

The confusables can be used to bypass profanity filters, ad filters, or just about any system that wants to blacklist words but still accepts Unicode.

As a test case, most browsers and other software shouldn't allow the end-user to be fooled by the following IDN homograph attacks. These domain names contain whole-script confusables, and should be represented in their lovely Punycode encoding for the user to realize they may not be what they appear to be.

www.аЬс.com is http://www.xn--80a8a6a.com/
www.ігѕ.com is http://www.xn--c1a2eb.com/