Unicode security testing library

Oftentimes, I want to break software, mostly Web applications, but occasionally platform-related, such as protocols or OS code.  When it comes to testing string input to find bugs, or vulnerabilities, Unicode can be a tester's best friend.  Strings are not simple things for software engineers - they require a lot of planning - buffers, encodings, transmission, and storage are just a few concerns.

I've had some success over the years finding nasty bugs, things that get critical ratings and require the world to reboot, for which Unicode has often been a useful creative tool.  I've leveraged the Unicode specifications for this quite a bit, and learned by past research by other bug hunters.  I've also managed to work on a few tools, one being x5s, a cross-site scripting tester which was implemented by John Hernandez.  It's novelty was in sending various Unicode characters and detecting when they transform into an ASCII equivalent.  Character transformations can lead to dangerous scenarios.

I also wanted to document more interesting Unicode characters so they could be easily accessible, and pre-defined.  Often people ask me, what characters should I use for testing?  Which ones flip text around?  Which ones cause problems?  Which one maps to an apostrophe for SQL injection, or a less-than sign for XSS?   To answer these questions, I put everything I knew of (well most of it) into a small utility library, unicode-hax, available on Github for your security testing pleasures.

Major features:

  • Contains methods to get best fit mappings.  For example, you want to know all the characters in various legacy encodings that transform to "<" or some other ASCII character.
  • Contains methods to get Unicode normalization mappings.  For example, you wan to know if any special Unicode characters will transform to ">" or some other ASCII character.
  • Contains a small set of hard-coded Unicode characters useful in fuzzing, as well as some functions for returning invalid byte sequences or characters that .NET would not allow by itself (because they're not well-formed).  
    • ill-formed byte sequences
    • Unicode non-characters (an oxymoron?)
    • private use area (PUA)
    • unassigned code points
    • code points with special meaning such as the BOM and RLO
    • half-surrogate values like U+DEAD, a very nasty little guy all by itself

I wanted to reduce the number of iterations during fuzzing to a very small group of characters with special meaning which historically cause problems in software.

If you have any suggestions for improvement, additions etc., please let me know.  Find the code here:


Happy bug hunting.