Brent Dax (brentdax) wrote,
Brent Dax
brentdax

  • Mood:

The wrong answer is the right answer

So, I'm working on a programming project that I will only publicly call "Knightley". (I've spoken much more about it to a few select friends, but it's a relatively clever idea, so I'm not willing to blab about it to just anybody.) It's a web-based project that involves, among other things, annotating user-submitted HTML with custom XML tags and later preprocessing it before sending it to users.

In particular, the XML data has a very high "tag density"—that is, the amount of tags compared to the amount of text between them is very high. The XML files involved aren't small, either—we're talking fifty kilobytes or so for a small file.

I've been trying to figure out if I can do the preprocessing efficiently enough for the project to be feasible, so I built a number of prototypes with one of Perl's faster XML parsers (XML::SAX::ExpatXS). The vanilla version wasn't fast enough, so I tried a few things—I combined some of the steps and I came up with ways of storing partially-processed data in a SQL database or as Perl code. The combining helped a little; the alternate storage techniques didn't.

So finally I wrote a version that cheated. Horribly. And naturally, it blew the competition out of the water.

s/iterlast3-plcombined-pllast3-sqlcombined-sqlfulllast3-xmlcombined-xmlnasty
last3-pl1.69---1%-24%-26%-35%-36%-38%-100%
combined-pl1.671%---23%-25%-34%-35%-37%-100%
last3-sql1.2931%30%---3%-14%-15%-18%-100%
combined-sql1.2535%33%3%---11%-13%-16%-100%
full1.1153%51%16%13%---2%-5%-100%
last3-xml1.0955%53%18%15%2%---4%-100%
combined-xml1.0561%59%23%19%6%4%---100%
nasty7.76e-04218042%215036%165922%161091%142766%140318%135294%--

("full", by the way, includes several extra XML processing steps that all of the others assume have already been performed; if the slowness is in my code, then "full" should be a lot slower than the others.)

If you can't read that, here's the gist: the fastest alternative takes 1,353 times as long as the "nasty" version.

So this leaves me with a bit of a conundrum.

The "nasty" version literally reads in the XML line-by-line and transforms it with a regular expression. It's blazing fast compared to the versions that do it "right", but it's also much less safe—if something unexpected sneaks into the data it could screw up horribly. I can block most of the constructs such a thing could be put in (stylesheets, scripts, comments, CDATA sections), but if I make a mistake, the consequences could be worse with "nasty".

On the other hand, none of the non-"nasty" versions are anywhere near fast enough to be run every time a user needs to access the document, and I don't think any optimization tricks are gonna help much here—parsing tag-dense XML properly is just a very slow process. So it seems like the non-"nasty" versions aren't really viable anyway.

Thoughts?
Tags: hacking, knightley
Subscribe
  • Post a new comment

    Error

    default userpic

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 6 comments