One-pass parser

See also alternative parsers

Our current wikitext parser goes through many passes, with many regexps and a few explode()/implode()s. Not only is it kinda slow, it's prone to horrible frightening bugs when different levels interfere with each other (such as the URL-in-URL bugs).

One, or at least fewer, passes would be good.

Magnus has, according to rumour, written such a thing for WINOR. Could it be adapted? (It doesn't appear to handle nested italics & bold properly; may have other problems. Also doesn't touch HTML yet, but that's a separate step really.)

Yes, I did :-)

Actually, I was surprised myself how well it worked, considering the fast hack. Well, I guess writing in C++ is different to PHP after all. I do have "multiple passes", however; the nowiki tags are parsed in and out in additional steps. Also, the whole text is broken into lines and patched together again. Another step would have to be added for HTML proofreading, but is that really our job?

Would it make sense to call a C++-compiled parser from PHP? Or should I try and rewrite it in PHP? I'd prefer to write a Phase IV in C++, though. Magnus Manske 19:14 17 Mar 2003 (UTC)

Sure. I don't know offhand how to get php and c++ to talk to each other nicely, but I'm sure there's a way... --Brion VIBBER 19:38 17 Mar 2003 (UTC)

Having written many parsers before, I am poking around and exploring the idea a creating a one or two pass parser for parsing the wiki syntax.

Unlikely, but I wonder if anybody has ported LEX and YACC (Yet another compiler compiler) to PHP? Lex is a tokenizer, YACC takes a context free grammer and parses the source using callbacks to process each language element. See The Lex & Yacc Page. The wiki language is not entirely context free, but there may be workarounds for this.
metaphp has something very similar in spirit to lex&yacc. I've used it successfully to develop a javascript engine written entirely in php ( j4p5 ) -- Henri T. 7 July 2005
My son found this link about someones experience creating a parser in PHP: [1] - this has now been used by Dokuwiki (Archived 2006-12-21 at the Wayback Machine) - description can be found at http://wiki.splitbrain.org/wiki:parser - as it goes, it's pretty fast and more importantly scales fairly evenly to larger pages. Memory use can get significant but this could be avoided by streaming "tokens" in and out of an intermediate file.
If some ambitious developer, who knows the full wiki syntax really well, could define the grammer, it would serve as a first step to implementing this project. From a Unix shell do a man rcsfile an example of such a grammer. Even if a one pass parser does not come to fruition, such a description would be an important piece of documentation, and could be used as an aid in ensuring that the current scheme is debugged.
(Asside: The last thing these grammers are good for is helping a human understand a language. Likewise, if not written very carefully the code to do a one pass parse, a state machine in software, can be quite difficult to understand and maintain. It should be very fast.)
Wikipedia lexer is a LEX & YACC type description of the wiki syntax. NickP 13:55, 25 Feb 2004 (UTC)
Whether the introduction of a one-pass parser would have any significant impact on performance is yet to be seen. There is only one way to find out: Measure performance before, implement your changes, and measure the performance afterwards. For starters you could try to answer the following questions: How long is the average (or median or typical) processing time for a page today? And how much of that is spent in the parser? My uninformed guess is that a few passes of regexp parsing takes a few microseconds, while database and file accesses take milliseconds. Still, thinking about how to design a one-pass parser can have useful side effects, such as suggested improvements in wiki syntax. Best wishes. --LA2, 25 Feb 2004
This page has some commentary about why this might be tough for Wikis. Is this relevant? I don't know enough to judge for myself. Matthewsim 21:18, 20 Aug 2004 (UTC)