Jump to content

Parser 2011/Stage 1: Formal grammar

From mediawiki.org

Provisional PEG grammar being worked on as part of Extension:ParserPlayground may end up developing into the main spec.

(back to Parser 2011/Parser plan)

Low-level tokens and structures

[edit]

From tightest to loosest binding, except where wildly wrong. ;)

Tightly-bound tags

[edit]
<!--.*?-->
<(hook name)(attrs)/>
<(hook name)(attrs)>.*?<(hook name)(sp)>

Nothing is expandable in those guys, except maybe in attrs (memory is hazy—check this!). Content needs to be re-run through the parser by the hook if relevant (eg for ref or poem, which parse their contents.)

Each of these guys can basically be thought of as a single unit; once we've parsed it all that can change is that it might be replaced with an expanded structure after hook execution.

Brace structures

[edit]

In an ideal future world, we'll pretty much just have brace structures and plain text (or the equivalent in structure). Nesting and boundary behavior is relatively straightforward following the Preprocessor rewrite a couple years back. Any inside text can be expanded via further nesting of template/parser functions; parser functions can be given either structured or flattened text for processing.

{{ .. | .. }}
{{{ .. | .. }}}
[[ .. | .. ]]
[(url) ..]

Structure is enforced at the low level, and these bad boys can be nested. Parser functions / templates, template parameters, and links/images have a basic nestable structure. (Nesting for links is traditionally only used with captions, but should get interpreted sanely.)

Possibly links or URL links should be pushed out to another section.


Loose structures

[edit]

(HTML, table start/end tags)

<(tag)(attrs)>
</(tag)(sp)>
<(tag)(attrs)/>
{|(attrs)
|+ (caption)
|-
|-(attrs)
|..
|(attrs)|..
||...||...
|}
''
''' (???)

Attrs, and possibly tags, may be expandable inside.

Start/end tags may exist on different levels of template expansion, or be missing, or be wrong! Matching these back up is important at the next level of fixups.

Line type tokens

[edit]
^(={1,6})..\1
^[*#:;]+
^" "
----
(empty)

Note that things like headings and list items may come in via a template—check into inline/block 'first char' mode stuff.

Free/magic markup

[edit]
freelinks
ISBN foo
__TOC__ etc

Things that may be found sitting about. How many of these should be at the low-low level? What about template expansion boundaries?

Character references

[edit]
&(char-name);
&#(digit)+;
&#x(hex-digit)+;

A char ref should never accidentally get treated as something else's syntax!

Question: is it possible to replace contents of a char ref in via a template? Should it? :)

Raw characters

[edit]

(anything else with no special meaning :D)