Unicode attacks and test cases - Visual Spoofing, IDN homograph attacks, and the Mixed Script Confusables

More on lookalikes, confusables, IDN homograph attacks, and other fun stuff, continued from the previous post.

Mixed-script confusables

These occur when letters from one alphabet or script, are used to give the same visual appearance as letters from a completely different script.  For example, the following words contain a mix of Latin and Cyrillic letters which are indistinguishable from their counterparts:

  • Spооfing with hоmogrаphs


If you look at the letters, you'll see that the 'oo' in 'Spoofing' is made up of two Cyrillic small letters 'o', and the 'a' in 'homographs' is Cyrillic as well.  Let's take some of the words apart and look at the Unicode code point values for each.

  • Spoofing == \u0053\u0070\u006F\u006F\u0066\u0069\u006E\u0067

  • Spoofing == \u0053\u0070\u043E\u043E\u0066\u0069\u006E\u0067


The first version of 'Spoofing' uses all ASCII Latin letters, but the second mixes in the Cyrillic letters 'oo'. Now if the word 'Spoofing' was being filtered, you could probably bypass the filter using this case of mixed-script confusables.

In fact, the confusables can be used to bypass profanity filters, ad filters, or just about any system that wants to blacklist words but still accepts Unicode.

As a test case, most browsers and other software shouldn't allow the end-user to be fooled by the following IDN homograph attacks. These domain names contain mixed-script confusables, and should be represented in their lovely Punycode encoding for the user to realize they may not be what they appear to be.

www.microsоft.com is http://www.xn--microsft-sbh.com/
www.Αpple.com is http://www.xn--pple-zld.com/
www.faϲebook.com is http://www.xn--faebook-6pf.com/

I'll take them apart another time, planning to look closer at IDN, IRI's and the rules around them.