Jump to content

User:HappyDog/WikiText parsing

From mediawiki.org

This is a technical page describing how the media-wiki engine parses a page of wiki markup to create the page you see. I wrote it on my own wiki a long time ago in order to help me understand how the engine works, not to explain wiki-syntax. I have reposted it here in case it is of any use to anyone, in particular with regards to the Markup spec project. It may not be 100% accurate or complete, so no guarantees made!


This page describes the anatomy of the addWikiText() function of the OutputPage class (instantiated as $wgOut in the code). The function translates the wiki markup it receives as an argument into HTML text, which it adds to the final page using the classes addHTML() function. This description does not include any details about parsing external to this function, for example redirects, although these may be added at a later date.

The separating and recombining of <nowiki>, <math> and <pre> tags (Steps 1 to 3 and the final step) are carried out by the addWikiText() function, whilst the rest of the parsing is carried out by doWikiPass2(), a separate function of the Output object. doWikiPass2() calls several other class functions to convert the text, and these are indicated where relevant.

The information on this page is based on an unmodified MediaWiki version 1.3.10 (I think!).


  1. <nowiki> content is separated out.
    • Neither the opening nor closing <nowiki> tag are case sensitive, and they may contain whitespace between the word nowiki and the brackets. No whitespace is allowed between the opening bracket and the slash on the </nowiki> tag, however.
    • A closing </nowiki> tag is not required. If it is missing then the rest of the supplied text is treated as nowiki.
    • Text within the nowiki tags has all backslashes () and triangular brackets (< and >) replaced by the appropriate HTML entity code. This means that HTML markup won't work within the nowiki tags.
    • No further parsing is done on the text within the tags.
  2. If TeX support is enabled for maths functions then it is separated out and rendered.
    • Maths content is specified using the <math> and </math> tags.
    • The tags are processed in the same was as the <nowiki> tags, specifically:
      • Neither the opening nor closing tag are case sensitive, and they may contain whitespace between the word math and the brackets. No whitespace is allowed between the opening bracket and the backslash on the </math> tag, however.
      • A closing </math> tag is not required. If it is missing then the rest of the supplied text is treated as part of the maths mark-up.
    • The contents of the math tag are rendered by calling renderMath(). The details of this function are not yet included on this page.
    • If TeX support is disabled (global variable $wgUseTeX == false) then any math markup (including the tags themselves) is treated as normal wiki code.
  3. Any text enclosed by <pre> tags is separated out.
    • Text within <pre> tags is treated exactly the same as text within <nowiki> tags (including how the tags are parsed and how the text is treated) with the following minor differences:
      • Any maths content within a <pre> tag will already have been separated out and rendered.
      • The <pre> tags are retained, and continue to enclose the text, whereas the <nowiki> tags are removed from the final output.
  4. HTML tags are validated (using function removeHTMLTags())
    • This is quite a complex procedure, which I may go into more detail on on a separate page, but for the moment it can be summarised as follows:
      • HTML comments are removed.
      • Any tags that are not allowed by the software (e.g. <scipt> tags) are replaced by HTML entitities, so they display as literals and are not treated as HTML by the browser.
      • Any badly formed tags (e.g. nested tags that shouldn't be nested, <tr> tags outside a <table> tag, etc.) are also replaced by HTML entitities so they are not treated as HTML.
      • Any attributes that are not allowed by the software (e.g. onMouseOver) are removed from otherwise valid tags.
      • A small amount of minor source formatting is applied (basically, the removal of unnecessary whitespace).
      • A closing tag is added at the end for all tags that are not closed properly. Note that some tags (e.g. <br>) don't need to be closed.
  5. Built-in wiki variables are replaced (using function replaceVariables())
  6. Horizontal lines are generated
    • Any occurence of four or more hyphens at the start of a line are replaced by an html <hr> tag.
    • Any capitalised <HR> tags are made lower case.
  7. Bold and italic formatting is applied (using function doAllQuotes())
  8. Headings are formatted (using function doHeadings())
  9. Lists and indentation formatting is applied (using function doBlockLevels())
  10. If dynamic dates are enabled, dates are reformatted appropriately (using function $wgLang->replaceDates())
    • Dynamic dates allow users to select a custom date format in the preferences section, and are enabled using the global variable $wgUseDynamicDates.
    • If dynamic dates are disabled then no replacement is made.
  11. External wiki links are created (using function replaceExternalLinks())
  12. Internal wiki links are created (using function replaceInternalLinks())
  13. ISBN numbers are made into links (using function magicISBN())
  14. RFC numbers are made into links (using function magicRFC())
    • This function currently does nothing. I am assuming it will eventually turn RFC numbers into links in a similar manner to the way ISBN numbers are handled.
  15. Headings are formatted (using function formatHeadings())
  16. The text is passed to the Skin object for formatting (using the function transformContent() in the user's Skin class)
    • This allows the skin to add any of it's own formatting that may be required, e.g. table background colours.
    • The default skin does not make any alterations at this stage.
  17. Finally, the pre, math and nowiki content are recombined with the fully rendered wiki code and the whole HTML text is output