Injecting new line characters (e.g. CR LF) into security logs with Unicode

Today I was asked if ESAPI's approach to sanitizing log messages for CRLF (carriage return, line feed) injection was sound. "CRLF Injection" in this case describes an attack whereby textual content such as records in a security log can be forged. Imagine if a plain text security log file separates log entries with two CRLF sequences. I'm using plain text here to keep it simple, but hopefully real logs would be using some form of markup. In hex this would look like 0x0D 0x0A 0x0D 0x0A. If the input validation routines did not sanitize CR LF characters then an attacker could manipulate their input to create what appeared to be new records in the log. Here's a snippet from ESAPI which attempts to protect against this:

// ensure no CRLF injection into logs for forging records
String clean = message.replace('\n', '_').replace('\r', '_');
if (ESAPI.securityConfiguration().getLogEncodingRequired()) {
clean = ESAPI.encoder().encodeForHTML(message);
if (!message.equals(clean)) {
clean += " (Encoded)";
}
}

Note:I have never worked with or tested ESAPI. I don't know what actions the methods getLogEncodingRequired() and encodeForHTML(message) perform, so I don't know at all if ESAPI would be vulnerable to the attacks I'm about to describe. Maybe someone from ESAPI can jump in. I'm only using ESAPI to make the example more realistic.

ESAPI is concerned with the visual (human-readable) appearance of log entries here and not how software processes the characters in those entries. There seem to be three vectors that would screw up ESAPIs logic for protecting against CRLF injection:

  1. Unicode normalization that decomposes and maps a character (or set of) to either a CR or an LF
    * Not a problem.

  2. Charset best-fit mappings that map input characters to either CR or LF during transcoding
    * Unpredictable problem.

  3. Unicode characters that provide the same visual effect as CR and LF
    * Definitely a problem.


#1 you don’t have to worry about it. The four Unicode normalization forms do not map any characters to CR or LF.

#2 Best-fits are tough to predict, because they can differ per platform. Below are the set of characters I know that will best-fit map to either U+000A (LF) or U+000D (CR) in the given charset (e.g. CP424).

000A 008E #REPEAT CP424
000A 25D9 -- IBMGRAPH
000A 008E #CONTROL CP037
000A 008E #CONTROL CP1026
000A 008E #CONTROL CP500
000A 008E #CONTROL CP875
000A 2326 # ERASE TO THE RIGHT # Delete right (right-to-left text) KEYBOARD
000D 266A 02 IBMGRAPH


#3 Here is the most practical and most obvious attack. Each of the following Unicode characters (code points) will create a visual “new line” effect.

U+000A LINE FEED (LF)
U+000B LINE TABULATION
U+000D CARRIAGE RETURN (CR)
U+000C FORM FEED (FF)
U+0085 NEXT LINE (NEL)
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR


Meaning ESAPI should be filtering out all of these as well if it plans to handle Unicode input.

Of course there’s a #4 I didn’t mention – concerning the target locale and character encoding of the logs.

I assume this ESAPI function is concerned with logs written to using Latin characters in a Western locale. I tend to agree that blacklisting is not the best answer but sometimes it makes sense and works. If the logs are written out in plain text encoded with UTF-8 or other Unicode encoding then #3 above would be a problem.

Isn’t the whacky world of Unicode and internationalization fun?