Talk:Markup spec/Archive 1
Add topicA long time ago, before MediaWiki had such fantastical things as extensions, I did a bit of hacking around in the rendering code in order to implement a few syntax enhancements that I needed at that time. As part of the process I documented some of what I found. I don't know if it is at all relevant any more - it was for MW 1.3.10, I think, and I'm sure a lot has changed - but I've posted it at User:HappyDog/WikiText parsing in case it's of any use to anyone. --HappyDog 14:54, 17 May 2006 (UTC)
Parsing Expression Grammar
[edit]I rather like the idea of using a Parsing expression grammar. I'll give it a try here as soon as I work out where to start. HTH HAND —Phil | Talk 22:02, 24 May 2006 (UTC)
BNF
[edit]I saw that HappyDog started with giving the links in BNF. I thought a bit about it and concluded that that I'd prefer something easier to get started, so I tried to describe articles containing only text and horizontal rules at Markup spec/BNF/Article. It is a long time ago that I did this, but I seem to remember that the alternatives in BNF must be disjoint. I don't think it is possible to achieve this without producing an absurdly long specification. So I decided to add some comments telling how the rules should be applied.
I don't care what form we specify the wiki markup, but after this short attempt with BNF I can see why Phil wants to use a PEG. However, there is inherently nothing wrong with BNF + English. In short, I wished I had a better memory and actually remembered what I was taught in the Formal Languages course. -- Jitse Niesen 14:47, 27 May 2006 (UTC)
- I started with links, because I figured it was easier to start at the bottom and work up than the other way round. Actually, it probably makes little difference though. I also started work on a top-down approach, which is different to yours. I don't want to go in and just change what you've done, so I'm going to post what I have so far in Talk:Markup spec/BNF/Article for discussion. --HappyDog 02:18, 29 May 2006 (UTC)
- Actually - on closer inspection they are not that different. --HappyDog 02:29, 29 May 2006 (UTC)
- My opinion:
- A BNF is a very good thing, because it allows an analysis of the real "WikiML",
- A BNF will never be enough, since it only abstracts correct code while much of the real "WikiML" usage is erroneous, but still functional.
- Rursus 12:02, 17 October 2009 (UTC)
Exceptions, Context-sensitivity and hacks
[edit]I do not know whether this is the best central page to discuss the formal description of MediaWiki syntax (is there a better one?).
In any case, there are many hacks in the original parser that make the language elements highly context-sensitive. The question is whether this should get described and implemented in future parsers, making it very difficult to create and maintain such parser or if it wouldn't be better to remove these things from the language in order to make the meaning more easy to grasp for both humans and computers.
One of the most difficult things to parse correctly are quotes: unlike the article text seems to imply, two quotes are not always italics, three quotes are not always bold etc. If two quotes are followed by three quotes then the three quotes are interpreted as end of italics plus a literal quote, but not if again followed by three quotes in which case they are interpreted as start bold (and the second triple of quotes is interpreted as end bold). This gets even more complex with more complex combinations and sometimes the placement of the literal quote depends on whether it is followed by a single lower case character. All the quote-induced formatting is ended by a single newline in the input, but the equivalent HTML tags (e.g. <i>) which are usually allowed in the input will not be ended by even paragraphs.
So I think a detailed English language description of how the markup is processed would be a necessary first step: this would need to include exceptions and information about which constructs (if any) will terminate other constructs (e.g. end of line terminates any open italic/bold if it was started by a quote-construct) or under what circumstances the construct is NOT interpreted in the way one would expect (i.e. three quotes not interpreted as start/end of bold). Johann p 16:37, 20 February 2007 (UTC)
- You are confusing the parser error handling and the ambiguity of tokens with context-sensitive grammars. Wikicode is not context-sensitive. Please read the discussion in Markup spec#Feasibility study --Kanor 12:41, 27 November 2009 (UTC)
Lists
[edit]Lists can't be handled by BNF when using a naive lexical analyser, but it may be possible to handle them using a more complicated lexer. The idea is to consider the blocks of *:;# at the start of lines to be separate from the rest of the line; instead of generating tokens for *, :, etc., it makes much more sense to generate tokens for 'one more * than the previous line', 'one less * than the previous line', etc.. Representing these tokens as {*, *}, etc., and a newline as \n, the following complicated nested list:
# a #* b #* c # d # e # f #:: g #:** h **** i # j
would be lexically analsyed as
{# a \n {* b \n c \n *} d \n e \n f \n {: {: g \n :} {* {* h \n *} *} :} #} {* {* {* {* i \n *} *} *} *} {# j \n }#
This maps to the equivalent HTML, list structure, etc., in a very BNF-able way, and is easily obtainable from the lexical analyser (which tokenises the wikitext before passing it to the BNF). It makes a lot more sense than trying to treat * as a token, anyway... Ais523 10:37, 14 November 2007 (UTC)
- It makes no sense to create a language rules such as 'one more * than the previous line' and 'one less * than the previous line', because you might want to jump two levels in a list. Additionally, when you write
#
and then###
you expect two levels to be added. If you restrict that during the analysis you are limiting by hand the power of the language, which is not a Good Thing (tm), in my opinion. --Kanor 12:47, 27 November 2009 (UTC)
Automatic substition
[edit]I would find it really useful to have something which automatically substitutes a template like ~~~~ does. This would make it easier for editors who do not understand substitution to add a template tag. Maybe something like #SUBST as the first included part of a template could automatically do this. --Barfbagger 13:45, 22 January 2008 (UTC)
<Pre>
[edit]Is there anyway to stop the <pre> from extending one line across the entire page? -PatPeter, MediaWiki Support Team 22:07, 26 January 2008 (UTC)
- No, there isn't. If you want inline bits of code use <code> --Skizzerz talk - contribs MediaWiki Support Team 22:26, 26 January 2008 (UTC)
Haskell/Parsec
[edit]Would there be any interest in a Haskell/Parsec parser? I've found it relatively easy to write, and it has the added advantage of being able to be run and test in an interpreter:
*WikiAst Main> run pLink "[[en::world#123|hello]]" InternalLink {linkName = "world", linkDescription = Just "hello", linkNamespace = Nothing, linkInterwiki = Just "en", linkSection = Just "123"} *WikiAst Main> run pLink "[http://google.com Google website]" ExternalLink {linkName = "http://google.com", linkDescription = Just "Google website"}
Newhoggy 02:45, 28 June 2008 (UTC)
- Then make an extension for it. I don't see why it would deserve any inclusion in the main code, as having multiple parsers/parsing styles in the same wiki generally just serves to confuse people. --Skizzerz 03:04, 28 June 2008 (UTC)
- i'd very much like to see the code, and try out how it compares to the original "parser" code. -- ∂ 18:48, 9 July 2008 (UTC)
Wikicode is NOT context-sensitive
[edit]There has been a lot of confusion about the Wikicode grammar. In many places I have read that Wikicode is a context-sensitive language. That's not true as far as I know. I tried to explain it in the main page, in the discussion uner 'feasability study'.
In the documents I have read concerning the grammar description concepts such as grammar type, syntax analysis (or parsing), ambiguity and error recovery have been mixed and confused.
Lists can be generated with a context-free grammar. Tag mismatching has nothing to do with the grammar type. Contex-sensitive languages are a different thing =)