User:Kephir/XML parse tree
The following is an unofficial documentation of the XML parse tree format, as returned by Special:ExpandTemplates and the API, like API:Expandtemplates and API:Revisions, when a generatexml
argument is passed to the API call.
DTD
[edit]<!DOCTYPE root [
<!ENTITY % mixed-markup "(#PCDATA|template|comment|h|possible-h|tplarg|ext|ignore)*">
<!ELEMENT root %mixed-markup; >
<!ELEMENT template (title, part*) >
<!ELEMENT tplarg (title, part*) >
<!ELEMENT part (#PCDATA|name|value) >
<!ELEMENT title %mixed-markup; >
<!ELEMENT name %mixed-markup; >
<!ELEMENT value %mixed-markup; >
<!ELEMENT h %mixed-markup; >
<!ELEMENT possible-h %mixed-markup; >
<!ELEMENT comment (#PCDATA) >
<!ELEMENT ext (name, attr, inner?, close?) >
<!ELEMENT attr (#PCDATA) >
<!ELEMENT inner (#PCDATA) >
<!ELEMENT ignore (#PCDATA) >
<!ELEMENT close (#PCDATA) >
<!ATTLIST root
xml:space CDATA #FIXED "preserve" >
<!ATTLIST template
lineStart CDATA #IMPLIED >
<!ATTLIST tplarg
lineStart CDATA #IMPLIED >
<!ATTLIST name
index CDATA #IMPLIED >
<!ATTLIST h
i CDATA #REQUIRED
level CDATA #REQUIRED >
<!ATTLIST possible-h
i CDATA #REQUIRED
level CDATA #REQUIRED >
]>
Elements
[edit]- <dt id="root"> root
- The root element. Has no interesting attributes by itself.
- Since whitespace is significant in reconstructing wiki markup, it is a good idea to parse the XML document as if
root
had anxml:space="preserve"
attribute. MediaWiki does not specify it explicitly, however. - <dt id="template"> template
- Indicates a template, variable, or parser function invocation (
{{ ... }}
). Must contain at least atitle
element, followed by optionalpart
elements. - The
lineStart
attribute is present and set to 1 if the template immediately follows a newline. - It is impossible in general to determine whether the node represents a transclusion or a parser function/variable until the contents of
<title>
are expanded:{{ {{{foo|x2}}}|aye|nay}}
expands to "nay
" iffoo
is assigned "#if:
", for one. - API:Siteinfo provides several methods to gather the list of variables and parser functions (
siprop=magicwords
,siprop=variables
andsiprop=functionhooks
), but none of them can be reliably used to recognise their precise syntax as of MediaWiki 1.24. - <dt id="tplarg"> tplarg
- Indicates a template argument reference (
{{{ ... }}}
). Contents are just liketemplate
, atitle
element followed by optionalpart
s. ThelineStart
attribute has the same meaning as above. - <dt id="part"> part
- Indicates a template argument (or default value for a template argument reference). Always contains a
name
and avalue
element, in that order, with an equal sign between them if the name is given explicitly. If the template argument is an implicitly numbered one, thename
element will be empty and contain anindex
attribute specifying the index. - For
tplarg
elements, only the firstpart
child should be looked at to provide default arguments, the rest are ignored. The split intoname
andvalue
is disregarded. - <dt id="h"> h and possible-h
- Indicates a header (
=== ... ===
). Thelevel
attribute contains the header level, whilei
contains the section number, regardless of level (the same that the§ion=
query string parameter uses). <possible-h>
tags appear only in the output of the hashtable-based parser (Preprocessor_Hash.php). They are created in place of<h>
tags everywhere except at the highest level of the tree (below<root>
). Otherwise they are mostly equivalent to<h>
; note that template logic might make them not end up as actual headers in the fully-parsed page.- <dt id="ext"> ext
- Indicates a parser extension tag, such as
<ref>...</ref>
,<source>...</source>
or<nowiki>...</nowiki>
. Not all tags are parser extension tags;<b>...</b>
or<table>...</table>
, for example, are not. Which tags are considered parser tags depends on MediaWiki installation. To obtain a list of extension tags, use API:Siteinfo with thesiprop=extensiontags
query parameter. - This element always contains (possibly empty)
name
(tag name) andattr
(attributes) child elements, optionally aninner
element, and optionallyclose
following it. The contents ofattr
need not conform to HTML or XML attribute syntax. - If the parser tag is specified in a self-closing form (e.g.
<nowiki/>
), theext
element will lackinner
andclose
child elements. - <dt id="ignore"> ignore
- Indicates text to be ignored, usually a
<noinclude>...</noinclude>
,<onlyinclude>...</onlyinclude>
or<includeonly>...</includeonly>
tag and/or its contents. - There is no option in the publicly available API to preprocess wikitext in transclusion mode, i.e. ignoring contents of
<noinclude>...</noinclude>
while parsing<includeonly>...</includeonly>
or restricting parsing to<onlyinclude>...</onlyinclude>
(T51353, gerrit:168669). - <dt id="comment"> comment
- Indicates an HTML-style comment, i.e.
<!-- ... -->
. The contents of this element include the comment start mark (<!--
) and end mark (-->
).
Serialisation
[edit]- Note: the following method guarantees only that valid parser output will serialise back into original markup. Modifying parse trees without regard for escaping may produce unexpected results. See below for information on escaping template arguments.
Turning the XML parse tree back into wiki markup is rather simple. It amounts to four substitutions, three of them being:
<template>...</template> → {{...}} <tplarg>...</tplarg> → {{{...}}} <part>...</part> → |...
Care has to be taken when handling ext
elements. For elements that contain inner
element, the following substitution is appropriate:
<ext><name>...</name><attr>...</attr>...</ext> → <......>...
Otherwise, use:
<ext><name>...</name><attr>...</attr></ext> → <....../>
Other elements can have their contents passed through as is.
The whole process is equivalent to applying the following XSLT stylesheet:
<?xml version="1.0" standalone="yes" ?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" media-type="text/x-wiki" />
<xsl:preserve-space elements="*" />
<xsl:template match="template">
<xsl:text>{{</xsl:text>
<xsl:apply-templates />
<xsl:text>}}</xsl:text>
</xsl:template>
<xsl:template match="tplarg">
<xsl:text>{{{</xsl:text>
<xsl:apply-templates />
<xsl:text>}}}</xsl:text>
</xsl:template>
<xsl:template match="part">
<xsl:text>|</xsl:text>
<xsl:apply-templates />
</xsl:template>
<xsl:template match="ext[inner]">
<xsl:text><</xsl:text>
<xsl:apply-templates />
</xsl:template>
<xsl:template match="ext[not(inner)]">
<xsl:text><</xsl:text>
<xsl:apply-templates />
<xsl:text>/></xsl:text>
</xsl:template>
<xsl:template match="inner">
<xsl:text>></xsl:text>
<xsl:apply-templates />
</xsl:template>
<xsl:template match="*">
<xsl:apply-templates />
</xsl:template>
</xsl:stylesheet>
Escaping and transformations
[edit]The pipe character, the equal sign and consecutive curly braces are interpreted specially in template invocations. If you wish to employ either as literal characters, you have to escape them. Unfortunately, MediaWiki markup does not lend itself to escaping very well. There are many methods of escaping markup, and they come with many caveats. Proper escaping is significant when modifying parse trees, hence we discuss it here.
The simplest method is to wrap special characters, or the whole string, inside a <nowiki>
tag, or escape them with numerical HTML escapes: |
, =
, {
and }
(and possibly escape other characters as well). This has two disadvantages: first, wikilinks and transclusions stop working (obviously). Second, the escaped text might not be recognised by template or module logic that processes it. In this section, more universal alternatives will be discussed.
If you want to allow wikilinks in an argument, but not templates (or template arguments), the simplest universal method is to perform the following substitutions:
{{{
→{<noinclude/>{<noinclude/>{
}}}
→}<noinclude/>}<noinclude/>}
{{
→{<noinclude/>{
}}
→}<noinclude/>}
=
→{{lc:=}}
|
→{{!}}
(built-in magic word since MediaWiki 1.24; for older versions, you have to ensure that Template:! expands to the pipe character.)
It has the disadvantage that piped wikilinks come out of it as [[link target{{!}}label]]
, which may be aesthetically unpleasing, although it still renders as expected. It also prevents the pipe trick from working. If you wish to avoid that, you will have to count pairs of brackets preceding |
to see if they match, and therefore it is not a part of a wikilink and needs escaping.
If you want to allow both links and templates, but prevent misinterpretations of |
and premature template closures, you need to follow the following steps:
- Parse the markup you wish to escape. (The following will assume that you get an XML tree as described above.)
- For each direct child text node of the
<root>
element, escape=
,|
,}}}
and}}
, as discussed. - Serialise the parse tree back into wiki markup.
The resultant text will be interpreted as if it were a stand-alone piece of markup, even inside a template argument. Following these steps is the only universal method of escaping wikitext.