I18N input validation whitelist filtering with System.Globalization and GetUnicodeCategory


Maybe you’re building internationalized code and wondering how to build a whitelist filter that will support all the different character sets your planning to support. If you support more than ten, especially some of the larger east Asian sets, this might seem like an unwieldy or tricky process.
Well luckily it’s easier than most people would think. Building a good input validation filter can be simplified with .Net’s GetUnicodeCategory. But use the method from the System.Globalization namespace as the other one in System.Char looks like it may become the subordinate.

With GetUnicodeCategory you can simply build a whitelist supporting the character categories you want to allow. So get away from thinking you have to write a regEx filter and list out all the character ranges you want to allow in each character set, it’s much simpler than that!

The Unicode standard assigns ever character to one of about 31 categories. They make sense too, for example Other Control charactes (Cc) , Lowercase Letter (Ll), Uppercase Letter (Lu), Math Symbol (Sm). So for example you might want to only allow letters, numbers, and punctuation in your whitelist. This could be achieved with the following snippet:


char cUntrustedInput; // the untrusted user-input
UnicodeCategory cInputTest = CharUnicodeInfo.GetUnicodeCategory(cUntrustedInput);
if (cTestCategory == UnicodeCategory.LowercaseLetter ||
cTestCategory == UnicodeCategory.UppercaseLetter ||
cTestCategory == UnicodeCategory.DecimalDigitNumber ||
cTestCategory == UnicodeCategory.TitlecaseLetter ||
cTestCategory == UnicodeCategory.OtherLetter ||
cTestCategory == UnicodeCategory.NonSpacingMark ||
cTestCategory == UnicodeCategory.DashPunctuation ||
cTestCategory == UnicodeCategory.ConnectorPunctuation)
{
// character looks safe, continue
}
else
{
// character is not allowed, fail
}


Not too bad eh.

Hi, do you know of any similar utility/framework for Java?

I'm not familiar with Java, but a quick search led me to:

http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#getType(int)

public static int getType(int codePoint)

Looks like getType will return a similar result as GetUnicodeCategory, allowing you to filter down to specific categories you want to allow. Nice that you can pass it an int and avoid having to deal with surrogates and other multi-byte character sequences.

More info:
http://java.sun.com/docs/books/tutorial/i18n/text/charintro.html