Jump to content

User:OrenBochman/ParserNG/WikiTable

From mediawiki.org

An Antlr Spec for the WikiTable Markup

ANTLR spec

[edit]
grammar wikiTable;

@header {
package p;

}

//@header {import org.antlr.test;} // not auto-copied to lexer
@lexer::header{
package p;
//import org.antlr.test;
//
}

@lexer::members {
//state check are deeply nested in a table are we?
int inTable=0;
List tokens = new ArrayList();
public void emit(Token token) {
        state.token = token;
    	tokens.add(token);
}
public Token nextToken() {
    	super.nextToken();
        if ( tokens.size()==0 ) {
            return Token.EOF_TOKEN;
        }
        return (Token)tokens.remove(0);
}
}

@members{
//int inTable=0;
//public void foo(){};
//int rows=0;
}

//Parser Rules
wikiTable 	: TBL_START  xml_attributes?  caption? head?  rows TBL_END;
caption		: CAPTION_START HS xml_attributes? captionText=TEXT+;
fragment
head		: (hCell hCellInLine*)+;	
rows 		: (firstRow|row) row*;
firstRow 	: cells;
row		: rowStart xml_attributes? cells;
rowStart	: ROW_START;

cells		:((cell|hCell) (cellInline|hCellInLine)*)+;
cell		: CELL_START xml_attributes? text=TEXT*;
cellInline	: CELL_INLINE_STRT xml_attributes? text=TEXT*;
hCell		: HEAD_START xml_attributes? text=TEXT*;
hCellInLine	: HEAD_INLINE_STRT xml_attributes? text=TEXT*;


//this is the recursive definition alowing table nesting
//cells		:( {input.LT(0)==CELL_START||input.LT(0)==HEAD_START}?=>(HEAD_START | CELL_START) XHTML_ATTRIBUTES? (TEXT|wikiTable)+ (CELL_INLINE_STRT XHTML_ATTRIBUTES? (TEXT|wikiTable)+)* )+  ;

//this needs to be in the parser for LT(2) to mean the second parser token
xml_attributes: {input.LT(2).getText().equals("=")}? xml_attribute+ PIPE? ;
xml_attribute: name=TEXT EQ DQUOTE value=TEXT* DQUOTE ;
//Lexer Rules
TBL_START	: {getCharPositionInLine()==0}?=> '{|'{inTable++; }	;
TBL_END		: {getCharPositionInLine()==0&&inTable>0}?=> '|}'{inTable--;}	;
HEAD_START      : {getCharPositionInLine()==0&&inTable>0}?=> '!';
HEAD_INLINE_STRT: {inTable>0}?=> '!!';

CELL_START  	: {getCharPositionInLine()==0&&inTable>0}?=> '|';	//this should only be recignised within a table
PIPE		: {getCharPositionInLine()>0||inTable==0}?=> '|';	//outside table or not at tart of line

CELL_INLINE_STRT: {inTable>0}?=> '||'; 					//this should only be recignised within a table
ROW_START 	: {getCharPositionInLine()==0&&inTable>0}?=> '|-' ;
CAPTION_START	: {getCharPositionInLine()==0&&inTable>0}?=> '|+' 	;


TEXT		: ('a'..'z'|'A'..'Z'|'0'..'9'|'.'|'-'|';'|':'|',')+;					//simplified

DQUOTE		: '"';
//WS 		:  (HS | VS)  ; //{ $channel = HIDDEN; } ;
HS		: ( ' ' | '\t'  )+ { $channel = HIDDEN; } ;
VS		: ( '\r' | '\n' )+ { $channel = HIDDEN; } ;
EQ		: '=';

Status

[edit]
  • This is a lexer + a parser.
  • Tested against the examples in table.
  • A tree grammar or a string template could be used to transform into XHTM etc.
  • Does not support full unicode to simplify development - but the string could be changed with minimal impact.

Problems

[edit]

The speck has a recognizer nondeterminism [1]

  1. Antlr is unabile to decide which path to take when meeting a HEAD_START symbol since it could belong to
  • In the optional header.
  • There is no optional header but the body starts with a header. (this is a mistake)
  1. This is a warning and option #2 is discarded . How could this nondeterminsm be removed ?
  1. adding a variable with a table wide scope
    boolean hasHead=TRUE;
    
  2. use it in a predicate on the optional header
    {hasHead}?;
    
  3. add an action after the optional header to flip it
    {hasHead=FALSE;}
    
  • Antlr complains that the first non-header cell might belong
  • In the (optional) first row, i.e. the one without a |- indicator.
  • In the optional other rows after.

Table in Table Test

[edit]
You type You get
<!-- outer -->
{| border="1"
| Orange || Apple     || align="right" | 12,333.00
|-
| Bread  || Pie       || align="right" | 500.00
|-
| Butter || Ice cream || align="right" | 1.00
<!-- inner -->
{| border="1"
| Orange || Apple     || align="right" | 12,333.00
|-
| Bread  || Pie       || align="right" | 500.00
|-
| Butter || Ice cream || align="right" | 1.00
|}
|}
Orange Apple 12,333.00
Bread Pie 500.00
Butter Ice cream 1.00
Orange Apple 12,333.00
Bread Pie 500.00
Butter Ice cream 1.00

Refrences

[edit]
  1. The Definitive ANTLR Reference: Building Domain-Specific Languages; Terence Parr; 2007; ISBN 0-9787392-5-6 p.127