Wednesday, February 25, 2009

Encoding is "the process of transforming information from one format into another" [Wikipedia]

In the web development world, when we talk about encoding text, we are normally talking about taking some input text and making it appropriate to use in a given context. For example, taking the user's first name and last name, and making it safe to put in a <b> tag within an html page.

We care about encoding most when we take input that we don't trust from our users - if we ever display that input we have to be careful to remove any characters that may interfere with the display of our web pages, cause javascript to run, or allow other malicious actions.

This article will help you understand what encoding is, why you need to do it and how that helps prevent cross-site scripting, and give a little introduction to the AntiXSS library.

A bold example

As a running example, let's say we are letting the user enter anything they want for their name - in an input box like this on our website:

Text box to collect name from the user

We then take the text they enter and store it in our database. Later on when we display it on the web page, we wrap the text in bold tags so that it stands out:

Welcome to the website, Kirk!

In ASP.NET one way of doing this would be to put an ASP.NET label between <b> tags:

Welcome to the website, <b><asp:Label ID="NameLabel" runat="server"></asp:Label></b>!

...and then in the code behind, take the name from our database and assign it to the Text property:

User user = GetFromDatabase();

NameLabel.Text = user.Name;

Trust no-one

The problem is, we've received this name directly from your user (who of course, you shouldn't trust), and we've stored it in a column in our database (which we now can't trust), and now we can't safely display it on our website without sanitising it or making it trust-worthy.

The number one lesson I try to give in my presentations on web security is "Don't trust...". You can't trust your user, you can't trust your employees, your students, or even your mother. There is no such thing as "safe input" that you receive over the Internet, everything you receive is suspect.

(Even people who are otherwise trustworthy might not be in control of their faculties if they have spyware or are virus-infected)

Everything is fine if the user enters only ascii characters:

User enters

But what happens if the user enters some html into the input box?

The user enters html, the page layout changes.

The user is now able to change how our page looks! Indeed, they can inject HTML, script or other content directly into pages on our website!

This is known as Cross-site scripting, or XSS, and is the bane of our existence as web developers.

What went wrong?

The ASP.NET label outputs the Text directly into the HTML output of the page:

    Welcome to the website, <b><span id="NameLabel">Kirk </b><i>Jackson</i></span></b>!

The problem here is that the ASP.NET label is not encoding the text before outputting it. The text is not appropriate to use in an HTML context, as it contains characters that have meaning in HTML (namely the characters making the </b> and <i> tags).

To make the user's name safe to use in an HTML context, we need to encode the inappropriate text to be safe in an HTML context:

Kirk &lt;/b>&lt;i>Jackson&lt;/i>

HTML Encoding

HTML encoding is turning a string into a safe block of text for insertion in an HTML web page.

This means it should not use any of the special characters that are used to mark the beginning or end of tags (< and >), attribute values (") or the ampersand character on it's own (&). If those characters are left in the string, then they could be used to start or stop HTML tags and change the behaviour of our page.

To remove these characters, HTML encoding requires them to be turned into character entity references, or numeric entity references. This stops them from being treated as special characters for formatting an HTML page, and just treats them as a character to be displayed.

Original character Character Entity Reference Numeric Entity Reference
< (less-than sign) &lt; &#60;
> (greater-than sign) &gt; &#62;
" (double quote) &quot; &#34;
& (ampersand) &amp; &#38;

The above table shows a few examples of how to encode special characters. For a more complete reference, see Wikipedia or W3C.

Note that since the ampersand character is used to start an encoded character sequence, it can't be used on it's own as a regular character. This is why ampersands should be encoded as &amp; in HTML.

Once the users name is encoded, it will then be in the HTML as &lt;i> instead of <i>, which means that in the above example, italic mode won't turn on:

The users text is now encoded correctly.

The screenshot above looks a little weird, but the page now displays the text exactly as the user typed it in, without treating the users input as special HTML markup.

Attribute Encoding

Attribute encoding is turning a string into a safe block of text for use within an attribute of an HTML tag.

Attributes are the name/value pairs on a tag node in HTML (or SGML and XML, for that matter). For example, in the following HTML, the a tag has a title attribute:

<a href="foo.html" title="test">thing</a>

The title tag is displayed as a tooltip

The text inside the title attribute is used to create a tool tip when the mouse pointer hovers over the hyperlink.

This HTML contains an a tag (an anchor tag), which has two attributes set: href and title. The a tag also contains some HTML within it: the text 'thing'. The contained text must be HTML encoded if you only want text within the a tag, and the two attributes must be attribute encoded.

At a simplistic level, text is valid inside an attribute as long as it doesn't contain double quotes ("), ampersands (&) or less-than symbols (<), as the double quote would prematurely end the attribute, and the other two characters must be encoded anywhere they are used within an HTML document (except when creating tags).

To extend our earlier example, imagine the users name is used as the tooltip of a link, to pop up before they follow the link. If we naively output the users name as a title attribute without encoding it, the user could inject some additional behaviour into our page. e.g.

<a href="foo.html" title="<%= User.Name %>">thing</a>

If the user enters something malicious, for example by entering a double-quote followed by some javascript, then they have managed to inject extra HTML or javascript behaviour into our site:

User enters script into Name field

The hover for the hyperlink looks okay, but when the user clicks the link, malicious javacript can run:

Malicious javascript running

This is because the HTML that we have sent to the clients browser actually contains an onclick attribute that we didn't intend:

<a href="foo.html" title="Kirk" onclick="alert('Hi')">thing</a>

Encoding the users data before sending it to the browser would have protected us from this, and then the HTML sent would look like this:

<a href="foo.html" title="Kirk&quot; onclick=&quot;alert('Hi')">thing</a>

Which correctly displays exactly what the user entered:

Tooltip now shows complete text entered

URL Encoding

URL encoding is turning a string into a safe block of text for appending on the query string of a URL.

The original specification for HTTP URL's (RFC 1738) specifies that URLs should only include certain characters, and all others must be encoded. This is similar to the case of HTML encoding, but there is a much smaller set of characters allowed, and the way you encode them is different.

To encode characters to append to a URL, you use a percentage symbol, followed by the two-digit hex number representing that character. For example:

Original character Character Entity Reference
space %20
/ (forward slash) %2F
" (double quote) %22
? (question mark) %3F

The above table shows a few examples of how to URL encode special characters. For a more complete reference, see Brian Wilson's URL Encoding page.

We need to encode strings before appending them to a URL, to make sure that untrusted input is not able to change the URL.

For example, if our page above constructed a URL to search Google for the name of the user entered into the website, it could look like this:

Construct a search url by joining two strings together

When the user clicks the link, they will search Google for their name.

Here the naive code is just constructing a url by joining the two strings together:

User user = GetFromDatabase();

string url = "" + user.Name;

But if a name with spaces is entered, then we're generating an invalid URL:

Create a url with spaces in it

The URL is invalid because it contains an illegal character - a space that should be encoded as %20.

We could also be opening our users up to cross-site scripting bugs, because we are effectively letting them create any url they want. For example:

Create a url with ampersands in it

Here we are appending the ampersand (&) that the user entered directly to the end of the url, so rather than their text being passed to the server as the "q" parameter, we're letting them add other query string parameters (in this case, the "I'm feeling lucky!" button). The solution in this case is to encode the ampersand as %26.

The AntiXSS library

The AntiXSS library (currently at version 3.0 beta) has been built by the Microsoft ACE Security and Performance Team [ooops! By the Connected Information Security Group, sorry!]

The library provides two related functions:

  • Encoding methods to make text safe for a variety of contexts
  • An HttpHandler to automatically encode your ASP.NET controls

I'll cover the Security Runtime Engine HttpHandler in another post.

The encoding methods have been built using more robust and secure coding practices than the existing methods in the HttpUtility class of the .NET framework, so you should use them in preference when encoding your data.

public class AntiXss
    public static string HtmlAttributeEncode(string input);
    public static string HtmlEncode(string input);
    public static string JavaScriptEncode(string input);
    public static string UrlEncode(string input);
    public static string VisualBasicScriptEncode(string input);
    public static string XmlAttributeEncode(string input);
    public static string XmlEncode(string input);

You need to decide which context you're outputting text, and then choose the appropriate method to encode the text.

  • HtmlEncode - use for all HTML output, except for when you're adding text inside an attribute of a tag (e.g. use for <b>...</b>)
  • HtmlAttributeEncode - use for text that will appear inside attributes of tags (e.g. <a title="...">)
  • UrlEncode - use for text that you are appending as a value in a url query string (e.g.
  • JavascriptEncode - use when you want to put the string into a javascript variable (e.g. var foo = '...'). This method will also create the surrounding quotes.
  • VisualBasicScriptEncode - use if you're unlucky enough to be creating pages with VBScript on them
  • XmlEncode, XmlAttributeEncode - the XML equivalents of the above HTML methods

To use inline in your ASPX page, you can call the library methods directly:

<a href="foo.html" title="<%= HttpUtility.HtmlAttributeEncode(User.Name) %>">thing</a>

To use from your code-behind, decide whether your control outputs it's content as an attribute or in an html context, and then call the appropriate method:

Label1.Text = AntiXss.HtmlEncode(User.Name);

Deciding which context you're in and which encoding method to use is a major annoyance, so be sure to look at the Security Runtime Engine which does it for you. I'll write more about that in a future blog post, so please subscribe to my RSS.

Hopefully this article has helped you understand what encoding is; why you need to encode untrusted input and how that helps prevent cross-site scripting; and has given a little intro to the AntiXSS library.


Thursday, April 30, 2009 1:32:36 PM (New Zealand Standard Time, UTC+12:00)
Excellent article. Thanks for the concise explanation of the different types of encoding and the first line of defense against XSS.
Comments are closed.