Like most people who've had to deal with nasty HTML markup I've used Dave Ragget's HTML-Tidy
utility in one form or another. Most recently as a built-in part of the HTML-Kit
editor. It's always done a remarkable job of making even the nastiest HTML usable.
When I started work on a recent project
I found myself having to deal with all kinds of horrible HTML, and all
of it horrible in different and unpredictable ways. My naive
attempts to write a sanitizer were going nowhere fast when octopod in #ruby
asked if I'd ever come across TidyLib
It turns out that when Dave turned HTML-Tidy loose it got
picked up and maintained by a group of people who created a neat, open
source, library to which others have added bindings for their own
language. There's even a TidyLib RubyGem
for those enlightened folks who use that particular language. So, at a stroke, I had all that evil markup validating as XHTML-Strict
and was saved from a world of hurt. Deep, deep, joy!
There was just one fly in the ointment. One particular item ended
up with some stray non-SGML characters in it and I traced the problem
back to the output from TidyLib. My heart started sinking.
Reading through the archives of the TidyLib mailing list I couldn't see anything relevant but I did find out that there is a #tidy
channel over on FreeNode. There I spoke to Björn Höhrmann
who, it turns out, has been maintaining TidyLib for the last 3
years. Even though he's not a Ruby coder Björn downloaded the gem
code, started comparing it to the library source, and quickly narrowed it
down to a likely buffer overwrite in the gem code.
Then I had to go away for a couple of days. I came back today
ready to start persuing a fix only to find Björn several steps ahead of
me. In the
interim he had spoken to Kevin Howe
who maintains the gem. They worked together to isolate the bug, Kevin patched it and then updated the gem.
All I had to do was type "gem update tidy", sit back, and smile :-)
My special thanks to Björn for taking the time to look at the Ruby
code, spot the problem, and follow through. Star quality man!