Log in

No account? Create an account

Previous Entry | Next Entry

The wrong answer is the right answer

So, I'm working on a programming project that I will only publicly call "Knightley". (I've spoken much more about it to a few select friends, but it's a relatively clever idea, so I'm not willing to blab about it to just anybody.) It's a web-based project that involves, among other things, annotating user-submitted HTML with custom XML tags and later preprocessing it before sending it to users.

In particular, the XML data has a very high "tag density"—that is, the amount of tags compared to the amount of text between them is very high. The XML files involved aren't small, either—we're talking fifty kilobytes or so for a small file.

I've been trying to figure out if I can do the preprocessing efficiently enough for the project to be feasible, so I built a number of prototypes with one of Perl's faster XML parsers (XML::SAX::ExpatXS). The vanilla version wasn't fast enough, so I tried a few things—I combined some of the steps and I came up with ways of storing partially-processed data in a SQL database or as Perl code. The combining helped a little; the alternate storage techniques didn't.

So finally I wrote a version that cheated. Horribly. And naturally, it blew the competition out of the water.


("full", by the way, includes several extra XML processing steps that all of the others assume have already been performed; if the slowness is in my code, then "full" should be a lot slower than the others.)

If you can't read that, here's the gist: the fastest alternative takes 1,353 times as long as the "nasty" version.

So this leaves me with a bit of a conundrum.

The "nasty" version literally reads in the XML line-by-line and transforms it with a regular expression. It's blazing fast compared to the versions that do it "right", but it's also much less safe—if something unexpected sneaks into the data it could screw up horribly. I can block most of the constructs such a thing could be put in (stylesheets, scripts, comments, CDATA sections), but if I make a mistake, the consequences could be worse with "nasty".

On the other hand, none of the non-"nasty" versions are anywhere near fast enough to be run every time a user needs to access the document, and I don't think any optimization tricks are gonna help much here—parsing tag-dense XML properly is just a very slow process. So it seems like the non-"nasty" versions aren't really viable anyway.



( Read 6 comments — Leave a comment )
Sep. 15th, 2006 08:38 am (UTC)
Well, if you're willing to invest the time in meticulously making sure the "nasty" system works, I'd say by all means go for it. Just make sure to give it plenty of time to play around with and test and be given to other people to abuse for you, and probably have one of the other systems ready as a backup in case you need it.
Sep. 15th, 2006 12:47 pm (UTC)
Use the nasty system.

Just because it's nasty doesn't mean it's not the answer. Hacking, you know?

Just try and hack the hack. Make sure that you've got a decent back-up or recover-from-fuck-up in place.
Sep. 15th, 2006 02:35 pm (UTC)
Well, you can still use the slow alternative for batch conformance testing the fast nasty version, if you go that way.

Assuming the results of this regex transformation can be stored decently, such as back to a database, for logging, and the associated source later reparsed by the slow library. Would make for a decent sanity spotcheck, either in development or live-maintenance.

You're going to be giving up XML validation.
Sep. 15th, 2006 02:48 pm (UTC)
Damn, dude... you're like an epic hacker to me. The cleverest thing I've done with code this summer was a PHP/MySQL dynamic event calendar for a website. (And it *was* rather clever, especially the logon feature, but I digress...)
Sep. 16th, 2006 02:55 am (UTC)
Epic she says! ^o^
Sep. 16th, 2006 03:46 am (UTC)
The number of cutegirls proclaiming Brent's epic-ness is increasing.
( Read 6 comments — Leave a comment )