How the Parser Works - NBBC: The New BBCode Parser

VI. Appendices

[ Previous: F. CSS Examples | Next: H. Tag Parameters ]

G. How the Parser Works

Tech Tip

This section is intended primarily for experienced software engineers and computer scientists. It's heavy on theory, light on practice, and if you're uncomfortable around concepts like computational orders, finite-state machines, and push-down automata, you're going to be lost. If you need an introduction to these concepts, we recommend the classic Dragon Book and Hopcroft and Ullman's Little Book of Evil, but be forewarned that compilers are a dense subject, and each one of those books could readily be the subject of a year or two of study at a decent university.

For what it's worth, you don't need to know any of the stuff in this section to be able to use NBBC; this section is primarily of academic interest.

NBBC is a compiler, and is broken down into two main layers: There's the lexical analyzer, which is responsible for breaking characters sequences into tokens, and is implemented in nbbc_lex.php; and there's the parser, which is responsible for analyzing the token sequences and generating proper output. This appendix is not intended to be a complete analysis of the structure of NBBC: Rather, it's intended to be a broad overview that will help you to understand the rationale embodied by the source code, which is well-commented, so should be relatively easy to follow once you've read this short discussion.

The Lexer

Most of the discussion and most of the complexity is focussed in the parser, so let's quickly get the lexical analyzer (lexer) out of the way. Like most lexers, NBBC's is built on regular expressions. The general principle used by the lexer is to use preg_split to initially divide the input into an array of strings that alternate between tags/whitespace/newlines in the odd array indices and non-whitespace text strings in the even array indices. A simple flag is used to track whether the next item in the array is a tag/whitespace/newline or a text string.

For this discussion, we'll use the following input as our primary example:

Code:

Before. [center]Hello, [i]World[/i]![/center] After.

The lexer breaks this input down into the following stream of tokens (wrapped here for clarity's sake; note that the empty array elements are omitted in this discussion for clarity):

Code:

"Before." WS NL [center] "Hello," WS [i] "World" [/i] "!" [/center] NL "After."

Each token is of type BBCODE_TEXT (like "Before."), BBCODE_NL (which is a newline matching either Un*x, Windows, or Mac formats), BBCODE_WS (any sequence of non-newline whitespace characters), BBCODE_TAG (a start tag), or BBCODE_ENDTAG (an end tag). These are retrieved by the parser one at a time by calling BBCodeLexer::NextToken().

Overview of the Parser

The parser is a push-down automaton that recognizes an LL(1) grammar. Its design is inspired by that of simple push-down mathematical expression parsers, where the state of the stack and a single token of lookahead is used to determine what operation to perform next. This means that the parser is implicitly constructing a document tree based on the tokens and the tags' class contraints, but because it's an implict instead of explicit construction, it's much faster than a genuine tree-builder: We don't need to allocate, store, and free tree nodes for this to work.

When provided with a string, the parser gives it to the lexer, which breaks it down into tokens; then the parser reads each successive token, one at a time, and processes the token according to its current state and the contents of the stack. This is an approximate description of what happens for each incoming token, ignoring the many edge cases, error-detection and error-correction, and whitespace cleanups:

BBCODE_TEXT: Plain text is filtered, and then pushed onto the top of the stack to await further processing.
BBCODE_WS: Non-newline whitespace is pushed just like text.
BBCODE_NL: Newlines are pushed onto the stack mostly like text.
BBCODE_TAG: One of several things can happen, depending on what kind of tag this is:
- If the tag is a verbatim tag, run a BBCODE_CHECK on it, then collect successive tokens until we reach its matching end tag. Pack the intervening tokens together, and pass the result to the tag's processing function. The output of the processing function is then pushed back onto the stack as text.
- If the tag is an isolated tag (it has no end tag), run a BBCODE_CHECK on it, and then immediately pass the tag to the tag's processing function. The output of the processing function is then pushed back onto the stack as text.
- For all other start tags, including those with both required and optional end tags, we simply push the tag onto the stack, since we won't know how to fully process it until its body has been processed.
BBCODE_ENDTAG: Walk backwards up the stack, searching for a matching start tag. If an ending-optional tag is found in between, a suitable end tag is generated for it on the spot and processed, until there is nothing between this end tag and its start tag except for processed HTML text. The processed HTML text is collected as a string, and passed to the tag's processing function; the output of the processing function is then pushed back onto the stack as text.

At the end of processing the input, the stack will contain the output as multiple stack entries; these are collected to a single string, which is returned as the output.

Performance analysis

This algorithm runs in O(n) time except when we encounter an end tag and have to go back up the stack; then it's O(n) to walk back to the start tag, and O(n) to collect tokens between the start tag and end tag and collapse them into a string. So this is effectively O(n+n*m), where "n" is the number of input tokens and "m" is the number of end tags; however, that's a worst-case analysis, assuming that each end tag contains an entire document's worth of tokens between it and its start tag. In practice, the number of tokens between a start and an end tag is fairly small, approaching some constant k (which is around 20 or 30 tokens, in practice), so in typical, average cases, this algorithm runs in O(n+m) time.

(As a future optimization, a constant-time speedup could be achieved by using hash tables to track start tag locations, instead of searching for them, but the intervening tokens must be collected anyway when the start tag is found, so the resulting search performance would still be O(n).)

Error-correction

While BBCODE_TEXT, BBCODE_WS, and BBCODE_NL tokens are always legal, BBCODE_TAG and BBCODE_ENDTAG tokens may be illegal in some cases, and the parser needs to cope with this. There are several possibilities:

BBCODE_TAG with no matching BBCODE_ENDTAG: When text is being collected at the end, the initial tag will have an invisible BBCODE_ENDTAG added to the end of the input.
BBCODE_ENDTAG with no matching BBCODE_TAG: This is caught during the processing of BBCODE_ENDTAG: If the search up through the stack finds no matching BBCODE_TAG, this end tag is simply converted to plain text and pushed as a new BBCODE_TEXT entry.
Tag inside a class that doesn't allow it: If, for example, a user places a [center] inside an [i] tag, that's illegal, and NBBC has a special routine to handle this case. When a BBCODE_TAG is encountered, it is checked against the current containing class, which can be determined from the top of the stack; if the containing class is disallowed, this BBCODE_TAG will be converted to plain text and pushed as a BBCODE_TEXT token.
Mis-ordered [i][b]tag nesting[/i][/b]: This is detected during BBCODE_ENDTAG processing: If the walk back up the stack crosses over any non-matching start tags, NBBC generates a matching end tag for each of those start tags just before the current end tag, and sets a flag to indicate that any single following unmatched end tag with the same name should be silently removed from the input. We can be sure that any start tags that are crossed are not start tags of a higher class, because they would have been turned into plain text (by the previous error-handling rule) if they were. So in the example of [i][b]tag nesting[/i][/b], upon reaching the [/i], the parser would search for the [i], find the [b] first, and generate an additional [/b] before the [/i]; then, to help "fix" the output, NBBC will silently remove any one unmatched [/b] it finds later in the input stream. So the output would be <i><b>tag nesting</b></i>, which is about what the user might expect to get. Note that for some really ugly mis-nestings, this rule will produce weird results, but they'll still at least be valid XHTML 1.0 Strict.

Example processing

Let's look at how this would then process our example input. First, this is again our incoming token stream:

Code:

"Before." WS NL [center] "Hello," WS [i] "World" [/i] "!" [/center] NL "After."

This is what happens during the parse of this stream:

Initial class is "block".
"Before." read, and pushed onto the stack.
WS read, and pushed onto the stack.
NL read. WS on the stack is popped (NL consumes nearby WS), and the NL is pushed onto the stack.
[center] read. Class is checked, and [center] is allowed within "block". Pop the NL from the stack, since [center] consumes newlines. [center] is then pushed onto the stack, and class is changed to "block".
"Hello," read, and pushed onto the stack.
WS read, and pushed onto the stack.
[i] read. Class is checked, and [i] is allowed within "block". [i] is pushed onto the stack, and class is changed to "inline".
"World" read, and pushed onto the stack.
[/i] read. Search down the stack for a [i]. Remove all tokens in between and convert them to a string. Pass string to [i]'s processing function, and pop [i] from stack. Push the resulting HTML, "<i>World</i>", onto the stack.
"!" read, and pushed onto the stack.
[/center] read. Search down the stack for a [center]. Remove all tokens in between and convert them to a string. Pass string to [center]'s processing function, and pop [center] from stack. Push the resulting HTML, "<div style="text-align:center;">Hello, <i>World</i>!</div>", onto the stack. Remove any trailing NL or WS tokens, since [/center] consumes newlines.
(The next NL was skipped by [/center], so it's not seen by the main parsing loop.)
"After." read, and pushed onto the stack.
End-of-input. Main parsing loop ends. Everything still on the stack is collected as a string, "Before.<div style="text-align:center;">Hello, <i>World</i>!</div>After.", which is then returned to the caller.

For what it's worth, you can test this yourself by enabling debug mode: In debug mode, the parser prints to the browser each action it performs, separating each token with horizontal rules. In addition to being useful for debugging new tags, this mode can also be useful for divining the basic workings of the parser itself.

[ Previous: F. CSS Examples | Next: H. Tag Parameters ]