Special Unicode characters for testing, fuzzing, and controllingthe visual display of text

WARNING: Some of these characters may cause strange things to happen in your software.

Of course, that's the point right?  Here's a minimal set of special Unicode characters I like to use in application testing.  This bit is from a small Unicode generation library I use for a fetching things like:

  • best fit mappings

  • Unicode normalization mappings

  • ill-formed byte sequences

  • overlong-utf8

  • non-characters

  • private use area (PUA)

  • unassigned code points

  • code points with special meaning such as the BOM and RLO

  • half-surrogate values

  • invisible characters


Some of these (the RLO and MVS) are useful for visual spoofing or controlling the visual appearance of text in modal dialog boxes or other user-controlled content.  For example, through the RLO character in the middle of a string to switch the reading order so the characters run right-to-left.  Like so:

The site www.example.com‮‮ is known to host malware, continue?

A lame example I know but the point is as a software developer you should never let the override characters into your code.  Other characters have caused weird (often exploitable) errors in Web applications, Web browsers, Web servers and other software I've come across.  For example, if an ASP.NET application is passing user-controlled input to a StreamWriter it will enter an irrecoverable error condition leading to a permanent (until restarted) denial of service when an illegal surrogate (a single low surrogate without a matching high or vice versa) is encountered.

/// The Byte Order Mark U+FEFF is a special character defining the byte order and endianess
/// of text data.
///

public static readonly string uBOM = "\uFEFF";
///
/// The Right to Left Override U+202E defines special meaning to re-order the
/// display of text for right-to-left reading.
///
public static readonly string uRLO = "\u202E";
///
/// Mongolian Vowel Separator U+180E is invisible and has the whitespace property.
///
public static readonly string uMVS = "\u180E";
///
/// Word Joiner U+2060 is an invisible zero-width character.
///
public static readonly string uWordJoiner = "\u2060";
///
/// A reserved code point U+FEFE
///
public static readonly string uReservedCodePoint = "\uFEFE";
///
/// The code point U+FFFF is guaranteed to not be a Unicode character at all
///
public static readonly string uNotACharacter = "\uFFFF";
///
/// An unassigned code point U+0FED
///
public static readonly string uUnassigned = "\u0FED";
///
/// An illegal low half-surrogate U+DEAD
///
public static readonly string uDEAD = "\uDEAD";
///
/// An illegal high half-surrogate U+DAAD
///
public static readonly string uDAAD = "\uDAAD";
///
/// A Private Use Area code point U+F8FF which Apple happens to use for its logo.
///
public static readonly string uPrivate = "\uF8FF";
///
/// U+FF0F FULLWIDTH SOLIDUS should normalize to / in a hostname
///
public static readonly string uFullwidthSolidus = "\uFF0F";
///
/// Code point with a numerical mapping and value U+1D7D6 MATHEMATICAL BOLD DIGIT EIGHT
///
public static readonly string uBoldEight = char.ConvertFromUtf32(0x1D7D6);
///
/// IDNA2003/2008 Deviant - U+00DF normalizes to "ss" during IDNA2003's mapping phase,
/// different from its IDNA2008 mapping.
/// See http://www.unicode.org/reports/tr46/
///
public static readonly string uIdnaSs = "\u00DF";
///
/// U+FDFD expands by 11x (UTF-8) and 18x (UTF-16) under NFKC/NFKC
///
public static readonly string uFDFA = "\uFDFA";
///
/// U+0390 expands by 3x (UTF-8) under NFD
///
public static readonly string u0390 = "\u0390";
///
/// U+1F82 expands by 4x (UTF-16) under NFD
///
public static readonly string u1F82 = "\u1F82";
///
/// U+FB2C expands by 3x (UTF-16) under NFC
///
public static readonly string uFB2C = "\uFB2C";
///
/// U+1D160 expands by 3x (UTF-8) under NFC
///
public static readonly string u1D160 = char.ConvertFromUtf32(0x1D160);