View Full Version : Preventing Malicious HTML~
FuRom
09-28-2007, 06:15 AM
Well, in the php security thread, Tigress said to not use BBcode and just use the HTML Purifier library (http://htmlpurifier.org/). I downloaded it, tested it, and well... I'm lazy and didn't feel like sorting through all of the broken includes, which led me to writing this. Theres a much simpler way to make sure only safe HTML is used on your site. You just use the strip_tags() function built into php. I took it a little bit further to make the coding cleaner. Here:
<?php
function tag_stripper($text, $allowed_tags){
foreach($allowed_tags as $k){
$allowed_tags = $allowed_tags.$k;
}
return strip_tags($text, $allowed_tags);
}
// Example of usage
$board_tags = array(
'<p>',
'<a>',
'<img>',
'<br>');
$text = '<b>Blah....</b> <br><br><br> BRed down here! <a href="http://www.php.net">Gah!</a>';
echo strip_tags($text, $board_tags);
?>
Enjoy fellow PHP junkies.
................... Credits [//:]
I learned this trick from actually taking the time to read the kronikpets source code instead of making a wannabe neopet clone that lacked features.
// Edit -----------------------------------
Well, it turns out the errors in HTML Purifier I had issues with were just do to misunderstanding or something I guess. I figured it had example php codes with it. I'm still not advocating HTML Purifier though, because it seems too pointless. Yes, it does fix the HTML coding up to what I assume would be w3 standards, but the fact is w3 standards aren't much to really think about when you only want to allow html in place of BBcode in a message board system or a guild system.
Tigress
09-28-2007, 02:17 PM
That is extremely insecure. What if you want to allow image tags, but don't want people placing XSS in the image href? This is one of the most simple XSS attacks, and your function does nothing then.
What happens if the user leaves broken tags? Again, you're shit out of luck - strip_tags is useless here. What's more, broken tags in combination with strip_tags will eat up user content, and they'll be wondering where the rest of their page went.
Seriously, people: Don't bother giving yourself headaches over trying to write your own XSS-prevention and HTML cleaning functions. Use HTMLPurifier. It's the best HTML-cleaning library I've seen. "Well w3 standards aren't necessary" is a poor excuse not to use it - it's a huge library, but it excels at what it does, and if you want extremely secure HTML then taking ten minutes to learn to use HTMLPurifier properly is well worth your time.
OwlManAtt
09-28-2007, 07:09 PM
The problem with strip_tags is that the whitelist is *only for tags*, and not for attributes. Whitelisting elements means you can put any attribute on it willy-nilly.
Let's consider:
$whitelist = array(
'p',
'a',
'br',
'b',
'i',
'u',
's',
'img',
'span',
'center',
);
$clean_content = strip_tags($content,$whitelist);
This will strip out any HTML tags not in the whitelist, and it may end up taking out more than you bargained for if the HTML is malformed (unbalanced tags, unclosed quotes, etc). So, you're protected against <script> - that means no javascript can get through and you're golden, right?
Wrong! Several of the whitelisted tags support the javascript callback attributes (onMouseOver, onClick, etc.). These take javascript as their value - so a client can end up executing malicious javascript just by mousing over a link (and potentially just by loading a page, depending on how poorly your browser implements the spec).
Then, several of the tags will take style as an attribute. Strip_tags will not strip out CSS that will break your layout. A user can position things over your ads, add in fake navigation links (your bank takes a PIN? oops, someone went to the bank from my profile and didn't notice the URL was hahayouarehacked.com/bank.php).
Also in the not-breaking your layout category, it will not remove unopened/unclosed tags like HTMLPurifier will - so someone can slap a few </div></div></div>s in, and if <div> is in your whitelist...you get the idea - it applies to any whitelisted tag.
HTMLPurifier solves these problems by being aware of attributes and what values they take. It knows what attributes should have URIs supplied, and it will remove attributes with malicious URIs.
Consider:
javascript://malicious_code();
vbscript://malicious.vb
Both of these are valid URIs that can be put in an <a href=''></a> or an <img src='' />. Most browsers support javascript://, and IE supports vbscript:// (as brain-dead as that sounds...I am not aware if it supports similar URIs for other scripting languages).
So, again, simply by clicking a link or loading a page with an image, your clients may execute malicious javascript.
HTMLPurifier aims to solve these problems by being aware of the HTML/CSS specs in their entirety, and not in the half-asses way strip_tags works.
All of that said, I agree that HTMLPurifier is bloody *huge*. But understand that parsing HTML is *not* a simple task. HTML is a lot more complicated than people make it out to be, especially when you cannot assume that the HTML is anything even approaching valid. (If you know it's valid, you can just load it via the DOM and half of your work is done - no unbalanced tags to throw your parser off!)
People sidestep the whole issue of being aware of HTML with BBCode - it implements a very, very small subset of what HTML does in a simple manner. Tags don't have attributes. There is no scripting language to worry about. There is no styling to worry about. url=foo is always just what it says.
I believe Kronikpets took the BBCode route. It did an htmlentities() on the content to turn any existing HTML into harmless text, then went through and applied huge regular expressions to transform BBCode into HTML. That's not a bad approach, but you can't use something cool like TinyMCE with it, and it forces people to learn Yet Another Dialect of BBCode when we already have a standard that solves all of these formatting problems and more.
There are good arguments for and against using HTMLPurifier (or a similar HTML cleaner - WordPress uses Kses, and there are two or three other popular ones) over BBCode. But, trying to compare HTMLPurifier to strip_tags is apples to oranges - their scopes are very, very different.
FuRom
09-28-2007, 08:39 PM
That is extremely insecure. What if you want to allow image tags, but don't want people placing XSS in the image href? This is one of the most simple XSS attacks, and your function does nothing then.
What happens if the user leaves broken tags? Again, you're shit out of luck - strip_tags is useless here. What's more, broken tags in combination with strip_tags will eat up user content, and they'll be wondering where the rest of their page went.
Seriously, people: Don't bother giving yourself headaches over trying to write your own XSS-prevention and HTML cleaning functions. Use HTMLPurifier. It's the best HTML-cleaning library I've seen. "Well w3 standards aren't necessary" is a poor excuse not to use it - it's a huge library, but it excels at what it does, and if you want extremely secure HTML then taking ten minutes to learn to use HTMLPurifier properly is well worth your time.
Don't allow IMG tags if you're worried about XSS attacks =O! Anyways, also, if you're only doing a bbcode, you're not going to be using anything that's too serious HTML-wise. You'd generally use <b>, <p>, <i>, <font>, and such simple things generally. You wouldn't be allowing the following in most cases:
<script>, <object>, <applet>, <embed> and <form>
I see the HTML purifier scripts as just a headache to use and implement. Using the strip_tags is just an alternative to those that are technical enough to implement HTML purifier. Personally, I just don't feel that 700kb of data is needed to make your site secure in any way.
OwlManAtt
09-28-2007, 09:00 PM
As per the HTML4 spec, many of those simple formatting tags support javascript events and styles. (http://www.w3.org/TR/html4/struct/text.html#h-9.2.1)
FuRom
09-28-2007, 09:10 PM
As per the HTML4 spec, many of those simple formatting tags support javascript events and styles. (http://www.w3.org/TR/html4/struct/text.html#h-9.2.1)
u_u Simple fixing.
$text = preg_replace("/ style=([\\\"\\']).*([\\\"\\'])/", "", $text);
$text = preg_replace("/ onclick=([\\\"\\']).*([\\\"\\'])/", "", $text);
$text = preg_replace("/ onmouseover=([\\\"\\']).*([\\\"\\'])/", "", $text);
and such added into the function.
//Edit 12:08 AM Sept 29
I would like to further show exactly what I'm doing with the strip tags, name one thing that would be disastrous that could happen when I filter like this and I'll take back my statement about HTML purifier being too much code for a such a job. Also by disastrous, I mean something where someone can completely ruin my site security-wise such as stealing or screwing up database information. I kinda assume preventing the use of header information in a image would be a daunting task to take on as it would require you to make your server connect to another server and that uses up processes I do believe.
function tag_stripper($text, $allowed_tags, $allow_style = 'no'){
foreach($allowed_tags as $k){
$allowed_tags = $allowed_tags.$k;
}
if($allow_style == 'no'){
$text = preg_replace("/style=([\\\"\\']).*([\\\"\\'])/", "", $text);
$text = preg_replace("/id=([\\\"\\']).*([\\\"\\'])/", "", $text);
$text = preg_replace("/class=([\\\"\\']).*([\\\"\\'])/", "", $text);
$text = preg_replace("/title=([\\\"\\']).*([\\\"\\'])/", "", $text);
}
$text = preg_replace("/lang
$text = preg_replace("/dir=([\\\"\\']).*([\\\"\\'])/", "", $text);
$text = preg_replace("/onclick=([\\\"\\']).*([\\\"\\'])/", "", $text);
$text = preg_replace("/ondblclick=([\\\"\\']).*([\\\"\\'])/", "", $text);
$text = preg_replace("/onmousedown=([\\\"\\']).*([\\\"\\'])/", "", $text);
$text = preg_replace("/onmouseup=([\\\"\\']).*([\\\"\\'])/", "", $text);
$text = preg_replace("/onmouseover=([\\\"\\']).*([\\\"\\'])/", "", $text);
$text = preg_replace("/onmousemove=([\\\"\\']).*([\\\"\\'])/", "", $text);
$text = preg_replace("/onmouseout=([\\\"\\']).*([\\\"\\'])/", "", $text);
$text = preg_replace("/onkeypress=([\\\"\\']).*([\\\"\\'])/", "", $text);
$text = preg_replace("/onkeydown=([\\\"\\']).*([\\\"\\'])/", "", $text);
$text = preg_replace("/onkeyup=([\\\"\\']).*([\\\"\\'])/", "", $text);
$text = preg_replace("/javascript:/", "", $text);
$text = preg_replace("/vbscript:/", "", $text);
return strip_tags($text, $allowed_tags);
}
Tigress
09-29-2007, 12:04 AM
Don't allow IMG tags if you're worried about XSS attacks =O!
The entire point of all this effort is to give users the freedom to use more HTML, not take it away from them. There are obvious limits, but if you are remotely competent (i.e. know how to use an HTML cleaning/XSS preventing library) there is no excuse for not allowing people to use things like images in the appropriate locations. For example: Let's say you want to allow images on people's profiles. What then? Or how about you want to give them personal pages. It'd be kind of stupid to keep them from posting images on their pages, wouldn't it?
In regards to your new function: what about character encoding? There are a million ways to get around simple filters like yours. Rather than reinventing the wheel, use HTMLPurifier (or some other competent XSS-removing library. HTMLPurifier is the only one that cleans up code to w3 standards, and there's a few other advantages to it as well, but anyways). Complexity? You really haven't the slightest clue what you're doing. Here's how many lines of code it takes to use HTMLPurifier:
require_once('../includes/libraries/htmlpurifier/HTMLPurifier.auto.php');
$purifier = new HTMLPurifier();
$content = $purifier->purify($content);
Yeah. That's so complicated, right?
FuRom
09-29-2007, 12:25 AM
Ugh, I hate arguing a point.
There are a million ways to get around simple filters like yours.
Prove it, don't just state it. Unless you show precedent, your words are unfounded.
require_once('../includes/libraries/htmlpurifier/HTMLPurifier.auto.php');
$purifier = new HTMLPurifier();
$content = $purifier->purify($content);
Yeah, thats pretty much all it takes to use it, but what about when you want to do less than what it does? I know I hate caching, I'm sure there are others out there that wouldn't want caching to take place either, even though it does make things happen faster. Having perfect w3 HTML standards in any user submitted system (especially message boards) is not the #1 thing in the world. There are other matters that are more pressing for the everyone in the php programming community. Also, I'mma tell you specifically why I don't like HTML purifier. The pet site I'm creating already has a an object orientation in it. Using multiple objects globally becomes confusing, not to mention implementing HTML purifier in something that is already cleanly coded with object orientation makes it harder to keep the code legible. Also, I know I personally would want people just slapping in any type of HTML they want in my message boards, I wouldn't want people making full bodied tables in my message board for the simple fact the layout would probably end up stretched out more than I'd want. I wouldn't want people posting style tags and other crap. I'm not saying that HTML purifier isn't good, I'm just saying it's a bit excessive.
Yeah. That's so complicated, right?
Don't be childish and immature, you demean yourself that way.
Tigress
09-29-2007, 12:47 AM
Ugh, I hate arguing a point.
Well, this is what you got yourself into when you tried to act as if a highly insecure function like the one in the OP would produce trusted HTML.
Prove it, don't just state it. Unless you show precedent, your words are unfounded.
...Did you even read my post? I said: Character encoding. There are other ways as well. See here for an entire list of XSS attacks, most of which your filter wouldn't handle. (http://htmlpurifier.org/live/smoketests/xssAttacks.php)
Yeah, thats pretty much all it takes to use it, but what about when you want to do less than what it does? I know I hate caching, I'm sure there are others out there that wouldn't want caching to take place either, even though it does make things happen faster.
Then turn off caching. It takes another few simple lines of code to do that:
$config = HTMLPurifier_Config::createDefault();
$config->set('Core', 'DefinitionCache', null);
$purifier = new HTMLPurifier($config)
It might seem slightly more complex than before - except for the fact that I copied and pasted it straight from HTMLPurifier's documentation.
Having perfect w3 HTML standards in any user submitted system (especially message boards) is not the #1 thing in the world. There are other matters that are more pressing for the everyone in the php programming community.
No, it isn't, but it's not as if this library takes you hours and hours to implement. Ten minutes tops, and you get XSS removal, broken tag removal, AND cleaning up to w3 standards. It doesn't require jumping through hoops. It doesn't require you to even think about what you're doing. Merely the time it would take to write a complete XSS-removing function yourself, when you could simply use a library, proves that "matters that are more pressing" aren't really that much of a concern to you.
Also, I'mma tell you specifically why I don't like HTML purifier. The pet site I'm creating already has a an object orientation in it. Using multiple objects globally becomes confusing, not to mention implementing HTML purifier in something that is already cleanly coded with object orientation makes it harder to keep the code legible.
Adding anything at all to code makes it less legible. Deal with it. Kitto uses HTMLPurifier, is object-oriented, and is perfectly legible.
Also, I know I personally would want people just slapping in any type of HTML they want in my message boards, I wouldn't want people making full bodied tables in my message board for the simple fact the layout would probably end up stretched out more than I'd want. I wouldn't want people posting style tags and other crap.
That's just laziness. If you really want to, you CAN use strip_tags in conjunction with HTMLPurifier (that's what I do). HTMLPurifier will produce clean, XSS-free content, and then you can use strip_tags to remove the tags that you don't want. No user content getting eaten up by broken tags, and everything that's left behind is safe.
I'm not saying that HTML purifier isn't good, I'm just saying it's a bit excessive.
Oh really?
HTML Purifier, I'm not gunna say this again, is a bullshit system!
it ain't that great
Yeah. Sure.
Don't be childish and immature, you demean yourself that way.
It's sarcasm, not immaturity. I'm simply driving my point home with a little more force =]
OwlManAtt
09-29-2007, 01:18 AM
$text = preg_replace("/javascript:/", "", $text);Internet Explorer, in its infinite wisdom, thinks of these as the same thing:
javascript://alert('owned');
jav
ascript://alert('owned');
You need to perform more validation on the protocol than just that. It would probably be better to whitelist it instead of blacklisting it (so when aim:// is discovered to be exploitable you aren't scrambling to block it).
FuRom
09-29-2007, 06:26 AM
Oh really?
http://www.orlyowl.com/upload/files/yes_rly.jpg
Yeah. Sure.
It's bullshit that it's so friekin big just to do that little bit of work! <_<
Anyways, I'm further testing it. I might change my mind on if I like it.......... who knows, who cares though?
It's sarcasm, not immaturity. I'm simply driving my point home with a little more force =]
Sarcasm ain't nothing but immaturity in most cases. I'd know, I'm the most immature and mean jackass in the world. =O!
Points of hate during testing so far:
Having to actually stick to w3 standards in my own layout with such useless blowfish codes as this.... at least I learned WTF it was for I guess...
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
My gawd! That's too much code... a whopping 700+ kb.
The fact you actually have to read docs to get it implemented..... <_< what ever happened to copy and paste library examples slapped into the main folders?! I coulda sworn it was a tradition =O!
Even though generally adding code is messy when adding it to a script that is already fairly coded, this takes messy to a whole new level. My scripts were so nice and neat until I pull this out of it's zip archive. Finding a way to implement it without having it hurt my eyes is a daunting task..... <_<' My code was so nice and neat! A couple hundred thousand lines worth of clean coding castrated in only 12 lines!
Meh, that's my personal opinions, get over it.
Tigress
09-29-2007, 01:56 PM
You are honestly clueless and you're just grabbing at arguments at this point because you don't want to admit that you have no reason to have such a vendetta against HTMLPurifier. I've already argued against all of the points you made in this post; either you can't read or you're willfully ignorant. In either case, "personal opinions" don't justify willful stupidity.
Do you have the slightest understanding of what goes into parsing and purifying HTML? You're looking at the issue like a complete non-coder might look at a website: It doesn't look that advanced to him. It's not supposed to - it's just meant to give him a nice, uncomplicated browsing experience. There is no way for him to even begin to perceive the thousands of lines of code that exist under the hood, or the true complexity of the site he's browsing. You're looking at HTMLPurifier and other HTML-parsing libraries just like he is. Trust me, if HTMLPurifier could be any simpler while retaining the same functionality, it would be. But HTMLPurifier's priority is doing a thorough and quality job, not keeping its size down.
And you continue to rant about the ~700kb filesize even though I have explicity stated to you that HTMLPurifier only includes the files it needs. No, I don't know how many kb that is, but it's a hell of a lot smaller than the 700kb atrocity you're whining about. If you use every single function and capability available in HTMLPurifier, then yes, you will probably include several hundred kb of files. But if you're using its most basic functionality (and in most cases that's all you need), you're including far less than that.
I'm not going to start on you labelling a filetype/charset declaration as "useless blowfish" codes. Then again, you're the same guy who thinks people should upgrade their "crapware" and then ranted about how your Wii didn't work with your TV. Congratulations: You've learned first-hand why standards compliance is important!
And finally, what the fuck are you doing when unzipping the archive? Smattering the files liberally throughout your already-existing directory structure? Here's a hint: Put the contents of the archive into their own directory and never look at the directory again. It's not that hard. But apparently, you're just incredibly incompetent. Way to go.
Let's calm down and take a breath every once and a while. Ok?
jlp09550
09-29-2007, 07:38 PM
I agree. This is just coding; everyone is different in that case. It's okay to get angry sometimes (a lot for me) but you still gotta breathe everyone once in a while. That's what you have to do to code anyways. xD
vBulletin® v3.7.2, Copyright ©2000-2009, Jelsoft Enterprises Ltd.