In particular, the XML data has a very high "tag density"—that is, the amount of tags compared to the amount of text between them is very high. The XML files involved aren't small, either—we're talking fifty kilobytes or so for a small file.
I've been trying to figure out if I can do the preprocessing efficiently enough for the project to be feasible, so I built a number of prototypes with one of Perl's faster XML parsers (
XML::SAX::ExpatXS). The vanilla version wasn't fast enough, so I tried a few things—I combined some of the steps and I came up with ways of storing partially-processed data in a SQL database or as Perl code. The combining helped a little; the alternate storage techniques didn't.
So finally I wrote a version that cheated. Horribly. And naturally, it blew the competition out of the water.
("full", by the way, includes several extra XML processing steps that all of the others assume have already been performed; if the slowness is in my code, then "full" should be a lot slower than the others.)
If you can't read that, here's the gist: the fastest alternative takes 1,353 times as long as the "nasty" version.
So this leaves me with a bit of a conundrum.
The "nasty" version literally reads in the XML line-by-line and transforms it with a regular expression. It's blazing fast compared to the versions that do it "right", but it's also much less safe—if something unexpected sneaks into the data it could screw up horribly. I can block most of the constructs such a thing could be put in (stylesheets, scripts, comments, CDATA sections), but if I make a mistake, the consequences could be worse with "nasty".
On the other hand, none of the non-"nasty" versions are anywhere near fast enough to be run every time a user needs to access the document, and I don't think any optimization tricks are gonna help much here—parsing tag-dense XML properly is just a very slow process. So it seems like the non-"nasty" versions aren't really viable anyway.