Unicode security attacks and test cases – fuzzing with Unicode

When it comes to fuzzing parsers, protocols, and other software, I want the fuzzer to be capable of producing tests specific to Unicode. Here’s what it should do at a minimum:

  • Generate half a surrogate pair in UTF-8 or UTF-16

  • Generate illformed byte sequences for UTF-8 and UTF-16

  • Generate overlong UTF-8

  • Generate unassigned and reserved code points

  • Generate codepoints outside of the valid range

  • Generate interesting control characters and characters with special meaning like the BOM, embedding, overrides, etc.

I’ve got some code that does most of these things. Maybe I should elaborate on them some more… Does Peach or another fuzzing framework provide this already?