<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Web 2.1 &#187; parsing</title>
	<atom:link href="http://web.2point1.com/tag/parsing/feed/" rel="self" type="application/rss+xml" />
	<link>http://web.2point1.com</link>
	<description>Tim Whitlock&#039;s home in the Blogohedron</description>
	<lastBuildDate>Thu, 13 May 2010 21:26:34 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>jParser and jTokenizer released</title>
		<link>http://web.2point1.com/2009/11/14/jparser-and-jtokenizer-released/</link>
		<comments>http://web.2point1.com/2009/11/14/jparser-and-jtokenizer-released/#comments</comments>
		<pubDate>Sat, 14 Nov 2009 17:24:52 +0000</pubDate>
		<dc:creator>Tim</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[jParser]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[rainy day]]></category>

		<guid isPermaLink="false">http://web.2point1.com/?p=243</guid>
		<description><![CDATA[After nearly two years I&#8217;ve finally gotten around to releasing my PHP JavaScript parser, although documentation is still thin on the ground.

Download jParser 1.0.0 (recommended)

Download jParser devel package (Full source and build scripts)
See the library examples running at timwhitlock.info/jparser

The library has been split in two:

jTokenizer &#8211; A JavaScript tokenizer designed to mimic the PHP tokenizer.
jParser [...]]]></description>
			<content:encoded><![CDATA[<p>After nearly two years I&#8217;ve <em>finally</em> gotten around to releasing my PHP JavaScript parser, although documentation is still thin on the ground.</p>
<ul>
<li><strong><a href="http://web.2point1.com/wp-content/uploads/2009/11/jparser-1-0-0.tgz">Download jParser 1.0.0</a> </strong><strong>(recommended)<br />
</strong></li>
<li><a href="http://web.2point1.com/wp-content/uploads/2009/11/jparser-devel-1-0-0.tgz">Download jParser devel package</a> (Full source and build scripts)</li>
<li>See the library examples running at <a href="http://timwhitlock.info/jparser/" target="_blank">timwhitlock.info/jparser</a></li>
</ul>
<p>The library has been split in two:</p>
<ol>
<li><strong>jTokenizer</strong> &#8211; A JavaScript tokenizer designed to mimic the <a href="http://www.php.net/manual/en/book.tokenizer.php" target="_blank">PHP tokenizer</a>.</li>
<li><strong>jParser </strong>- The fully blown JavaScript syntactical parser which generates a parse tree.</li>
</ol>
<p><span id="more-243"></span>The reason for the split is that for most purposes where you think you need a parser, you in fact just need a tokenizer. The tokenizer library is about 15KB, whereas the parser is over 700KB (minified), so you can see why you might not want to include it unnecessarily.</p>
<p>The library files <code>jparser.php</code> and <code>jtokenizer.php</code> are self-contained, minified files for production use. If you wish to inspect or modify the code you will need to download the devel package. This package provides a build script which collates the libraries into their distributable files.</p>
<h3>jTokenizer</h3>
<p>Possible uses for the tokenizer include code highlighting and simple manipulation of JavaScript source code.</p>
<p>The main function you will want to use is <code>j_token_get_all</code> which behaves the same as the PHP <a href="http://www.php.net/manual/en/function.token-get-all.php" target="_blank">token_get_all</a> function with the addition of a column number as well as a line number. Additionally there is the <code>j_token_name</code> as per the PHP <a href="http://www.php.net/manual/en/function.token-name.php" target="_blank">token_name</a> function.</p>
<h3>jParser</h3>
<p>This is a full, syntactical parser. On its own it simply generates a parse tree which can be traversed and manipulated. There is no proper documentation on this yet, but take a look at the node classes in the devel package if you are serious about doing something useful with this parser.</p>
<h3>Some other notes in no particular order</h3>
<p>The full parser uses a lot of juice. I recommend giving PHP loads of memory, and be careful what you throw at it if you&#8217;re going to run it on a production server.</p>
<p>A parser is not an interpreter or a JavaScript engine. If you want to develop such a thing in PHP you might be insane, but it could be done with this parser as a base.</p>
<p>The JParser parse tree is purposefully not a full tree, it collapses redundant nodes to save memory. If you want to see a full tree then take a look at the <code>JParserRaw</code> class. (devel package required)</p>
<p>Splitting the parsing process into two parts (tokenize/parse) is probably not the most efficient and probably uses more memory than it would another way. However, I figured it would be neat to mimic the PHP tokenizer functionality so that parsers could be built that take a stream of PHP tokens.</p>
]]></content:encoded>
			<wfw:commentRss>http://web.2point1.com/2009/11/14/jparser-and-jtokenizer-released/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>jParser grammar</title>
		<link>http://web.2point1.com/2009/02/26/jparser-grammar/</link>
		<comments>http://web.2point1.com/2009/02/26/jparser-grammar/#comments</comments>
		<pubDate>Thu, 26 Feb 2009 10:54:39 +0000</pubDate>
		<dc:creator>Tim</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[jParser]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[php]]></category>

		<guid isPermaLink="false">http://web.2point1.com/2009/02/26/jparser-grammar/</guid>
		<description><![CDATA[I&#8217;ve been asked how I generate the JavaScript parse table for jParser, so I&#8217;m posting the grammar file here for anyone else who&#8217;s interested.
↓ JavaScript grammar file for jParser
 This file is in a (probably non-standard) variant of BNF notation. I&#8217;m not generating the tables with a tool like ANTLR primarily because I don&#8217;t write [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been asked how I generate the JavaScript parse table for jParser, so I&#8217;m posting the grammar file here for anyone else who&#8217;s interested.</p>
<p><a href="/wp-content/uploads/2009/03/jas.bnf" target="_blank"><strong>↓ JavaScript grammar file for jParser</strong></a></p>
<p><span id="more-100"></span> This file is in a (probably non-standard) variant of <a href="http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form" target="_blank">BNF notation</a>. I&#8217;m not generating the tables with a tool like ANTLR primarily because I don&#8217;t write C. I should also point out that I don&#8217;t come from a formalized computer science background, so don&#8217;t expect this to be 100% conventional.</p>
<p>I&#8217;ve developed a native PHP parse table generator that in turn uses a parser (based on itself) to parse this BNF grammar into a table. If you understand grammar files like this you&#8217;ll notice something a bit odd &#8211; The terminal symbols don&#8217;t go right down to individual characters, the grammar expects the source code to already have been tokenized into significant chunks, such as <code>J_NUMBER</code> representing an already identified numeric token. This was done deliberately to be compatibile with PHP&#8217;s own <a href="http://uk.php.net/manual/en/ref.tokenizer.php" target="_blank">Tokenizer functions</a>. The underlying parser framework was designed such that PHP token based grammars could also be developed.</p>
]]></content:encoded>
			<wfw:commentRss>http://web.2point1.com/2009/02/26/jparser-grammar/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>JavaScript Obfuscator and Minifier</title>
		<link>http://web.2point1.com/2008/06/14/javascript-obfuscator-and-minifier/</link>
		<comments>http://web.2point1.com/2008/06/14/javascript-obfuscator-and-minifier/#comments</comments>
		<pubDate>Sat, 14 Jun 2008 15:23:04 +0000</pubDate>
		<dc:creator>Tim</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[ECMAScript]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[parsing]]></category>

		<guid isPermaLink="false">http://web.2point1.com/2008/06/14/javascript-obfuscator-and-minifier/</guid>
		<description><![CDATA[This tool is based on a full JavaScript parser that is part of a much bigger plan. I won&#8217;t go into that just yet, but along the way I&#8217;m going to be releasing useful tools like this as they come about. It&#8217;s useful to have some short term goals to keep up morale and ensure [...]]]></description>
			<content:encoded><![CDATA[<p>This tool is based on a <a href="http://web.2point1.com/2008/05/09/full-javascript-parser-for-php/">full JavaScript parser</a> that is part of a much bigger plan. I won&#8217;t go into that just yet, but along the way I&#8217;m going to be releasing useful tools like this as they come about. It&#8217;s useful to have some short term goals to keep up morale and ensure that the framework is working well.</p>
<p><strong>&gt; <a href="http://timwhitlock.info/plug/examples/JavaScript/j_obfuscate.php" target="_blank">Try it here:</a></strong><a href="http://timwhitlock.info/plug/examples/JavaScript/j_obfuscate.php" target="_blank"> Obfuscate and minify your JavaScript code</a></p>
<h3><span id="more-49"></span> Minifying</h3>
<p>There are other minifiers out there;</p>
<p><em>[ Link to project removed due to incredibly rude email from its author ]</em></p>
<p>The parser framework behind my attempt allows a great deal more power at the expense of being a much heftier package. The source code is about 700k, so this is anything but light-weight. The extra power means that, unlike many minifiers, line breaks are not required at all. Any JavaScript program can be stripped down to one, very long line of code. The reason for this is that it performs <a href="http://web.2point1.com/2008/06/01/jparser-now-with-automatic-semicolon-insertion/">automatic semicolon insertion</a>. It can do this because rather than just performing a lexical scan of the source, it compiles the full syntax of the program into a parse tree. Of course, a program with a syntax error will not be minified, but then why would you want it to?</p>
<h3>Obfuscation</h3>
<p>Another advantage of the syntactical parsing is that we stand a better chance of safely altering identifier names. This still has serious limitations though, because the code is still not actually being executed there are many situations that could result in disaster. The current version obfuscates all explicitly named, top-level entities, such as function names, function arguments, labels, and variable declarations. Member expressions are particularly problematic, and so I have not attempted to obfuscate these. I may at some point in future work on improving this, but as this is only a side project, I am not going to hold my breath.</p>
<p>You can always take additional steps in your original code to better preprare for effective obfuscation. Consider this;</p>
<pre class="code">var myDocument = document;
var myElement = myDocument.getElementById('myId');</pre>
<p><strong><a href="http://timwhitlock.info/plug/examples/JavaScript/j_obfuscate.php" target="_blank">Try it and see</a></strong></p>
]]></content:encoded>
			<wfw:commentRss>http://web.2point1.com/2008/06/14/javascript-obfuscator-and-minifier/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>JParser now with Unicode support</title>
		<link>http://web.2point1.com/2008/06/08/jparser-now-with-unicode-support/</link>
		<comments>http://web.2point1.com/2008/06/08/jparser-now-with-unicode-support/#comments</comments>
		<pubDate>Sun, 08 Jun 2008 21:37:54 +0000</pubDate>
		<dc:creator>Tim</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[ECMAScript]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[tokenizing]]></category>

		<guid isPermaLink="false">http://web.2point1.com/2008/06/08/jparser-now-with-unicode-support/</guid>
		<description><![CDATA[I&#8217;ve updated my JavaScript parser to include full Unicode support.
Check out the test interfaces for:
» Full parser;
» Code highlighting.
Code highlighting does not require the full syntactical parser, it just uses the tokenizer and does not break when a bad character is found.
What&#8217;s in?
When I say full Unicode support, what I mean is that Unicode characters [...]]]></description>
			<content:encoded><![CDATA[<p><strong>I&#8217;ve updated my JavaScript parser to include full Unicode support.</strong><br />
Check out the test interfaces for:<br />
<strong><span class="separator">»</span> <a href="http://timwhitlock.info/plug/examples/JavaScript/JParser.php" target="_blank">Full parser</a></strong>;<br />
<strong><span class="separator">»</span> <a href="http://timwhitlock.info/plug/examples/JavaScript/j_token_html.php" target="_blank">Code highlighting</a></strong>.</p>
<p><span id="more-47"></span>Code highlighting does not require the full syntactical parser, it just uses the tokenizer and does not break when a bad character is found.</p>
<h3>What&#8217;s in?</h3>
<p>When I say <em>full</em> Unicode support, what I mean is that Unicode characters inside string literals and comments were always implicitly supported, but now it can cope with Unicode characters in identifiers too, (as per the ECMAScript standard). Support also includes the use of Unicode escape sequences, although for the sake of speed, the validity of these is not checked. Checking could be done at a later stage when the parse tree is employed to do something useful. The full range of whitespace and line-terminating characters have also been added, although not tested.</p>
<p>Unicode support makes the tokenizing process slower. Because of this it may be switched off if it is not needed. Most programmers with English as their first language are unlikely to use Unicode characters in &#8216;hand-written&#8217; identifiers, and I have to wonder if anyone has ever done so with an escape sequence.</p>
]]></content:encoded>
			<wfw:commentRss>http://web.2point1.com/2008/06/08/jparser-now-with-unicode-support/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>JavaScript Syntax Nuances</title>
		<link>http://web.2point1.com/2008/06/07/javascript-syntax-nuances/</link>
		<comments>http://web.2point1.com/2008/06/07/javascript-syntax-nuances/#comments</comments>
		<pubDate>Sat, 07 Jun 2008 12:49:32 +0000</pubDate>
		<dc:creator>Tim</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[ECMAScript]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[parsing]]></category>

		<guid isPermaLink="false">http://web.2point1.com/2008/06/07/javascript-syntax-nuances/</guid>
		<description><![CDATA[If you learn a programming language it is unlikely that you will read the formal language specification that defines all the laws of the syntax. You may never read it at all. It is more useful to learn by example, or at least topic-by-topic. However, a mere ten years after writing my first few lines [...]]]></description>
			<content:encoded><![CDATA[<p>If you learn a programming language it is unlikely that you will read the formal language specification that defines all the laws of the syntax. You may never read it at all. It is more useful to learn by example, or at least topic-by-topic. However, a mere ten years after writing my first few lines of JavaScript, I read the ECMAScript standard and it threw up some things I did not know.</p>
<p>There are many things that you can write in JavaScript that are perfectly valid syntax, but that you probably never will write. Here are a few that raised an eyebrow or two.</p>
<p><span id="more-34"></span></p>
<h3>Comma Operator</h3>
<p>What would you expect to see if you ran this code in your browser?</p>
<pre class="code">var test = [ 'a', 'b', 'c' ];
alert( test[ 0, 1, 2 ] );</pre>
<p>You might expect it to be a syntax error, but in fact <code>test[0,1,2]</code> evaluates in this example to <code>"c"</code>. The expressions ( <code>0</code>, <code>1</code>, and <code>2</code> ) are all evaluated, but only the final one can return the single value of the expression.</p>
<p>Similarly pointless constructs:</p>
<pre class="code">if( alert("Hello"), alert("World"), false ){
    alert("You will never see this");
}
var a = ( 1, 2, 3 ); // a will be set to 3</pre>
<h3>The &#8220;/&#8221; character</h3>
<p>The syntax does not discriminate between operand types. For example, you can <em>attempt</em> to divide any expression by any other expression, even if it makes no sense to do so; such as:</p>
<pre class="code">['a'] / 2</pre>
<p>This will evaluate to <code>NaN</code>, because the array&#8217;s value can never be a number, but it is perfectly valid syntax. The real point I&#8217;m getting to is that the following is an exception and will raise syntax error:</p>
<pre class="code">{ a:1 } / 2;</pre>
<p>Why? Because the lexical analyser will expect <code>"/"</code> to be the start of a Regular Expression Literal, which it isn&#8217;t. It gets it &#8216;wrong&#8217; in this case, because the <code>"}"</code> a is tricky so-and-so; it could either be the end of an expression, or the termination of a block statement. The lexical analyser is not a full syntax parser; it knows nothing of the full grammar of the language, and so it makes a choice based on some fairly weak rules.</p>
<p>The following IS valid, but it is not a division operation, it is just a pointless mess:</p>
<pre class="code">{ a:1 } / 2 /i;</pre>
<h3>undefined vs null</h3>
<p><code>undefined</code> is not a literal, or even a reserved word, whereas <code>null</code> is. <code>undefined</code> is a special built-in object, so if you were stupid enough, you could do this:</p>
<pre class="code">undefined = true;</pre>
<p>This is allowed by the same token that allows you to foolishly do this:</p>
<pre class="code">Array = null;
var a = new Array();
// TypeError: Array is not a constructor</pre>
]]></content:encoded>
			<wfw:commentRss>http://web.2point1.com/2008/06/07/javascript-syntax-nuances/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>JParser now with Automatic Semicolon Insertion</title>
		<link>http://web.2point1.com/2008/06/01/jparser-now-with-automatic-semicolon-insertion/</link>
		<comments>http://web.2point1.com/2008/06/01/jparser-now-with-automatic-semicolon-insertion/#comments</comments>
		<pubDate>Sun, 01 Jun 2008 15:34:19 +0000</pubDate>
		<dc:creator>Tim</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[ECMAScript]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[parsing]]></category>

		<guid isPermaLink="false">http://web.2point1.com/2008/06/01/jparser-now-with-automatic-semicolon-insertion/</guid>
		<description><![CDATA[I finally found a spare few hours to implement Automatic Semicolon Insertion into my JavaScript  Parser.
Check out the test interface here.
Section 7.9 of the ECMAScript language specification details the circumstances under which you are permitted to omit a semicolon  in a statement which otherwise requires it. In short, there are situations where this [...]]]></description>
			<content:encoded><![CDATA[<p>I finally found a spare few hours to implement Automatic Semicolon Insertion into my JavaScript  Parser.<br />
<strong><a href="http://timwhitlock.info/plug/examples/JavaScript/JParser.php" target="_blank">Check out the test interface here</a></strong>.</p>
<p><span id="more-41"></span>Section 7.9 of the <a href="http://www.ecma-international.org/publications/standards/Ecma-262.htm" target="_blank">ECMAScript language specification</a> details the circumstances under which you are permitted to omit a semicolon  in a statement which otherwise requires it. In short, there are situations where this lazy practice is not the end of the world; it is fairly clear what the programmer is trying to do, so it lets you off.</p>
<p>I&#8217;ve managed to implement <em>most </em>of these conditions, but not all. I still have some work to do on the  more peculiar conditions, but the ones that are relevant to most common practice should be covered. Give it a bash, and let me know if you can break it!</p>
<p>If you don&#8217;t know what I&#8217;m talking about, consider this:</p>
<pre class="code">a = b
c = d</pre>
<p>Most JavaScript programmers would be disciplined enough write:</p>
<pre class="code">a = b;
c = d;</pre>
<p>However, there are some conditions that probably catch us all out.<br />
What is the difference between this:</p>
<pre class="code">function xfunc(){
}</pre>
<p>and this?:</p>
<pre class="code">var xfunc = function(){
}</pre>
<p>The former is a <em>Function Declaration</em>, and does not require termination, but the latter is <em>a Function Expression</em>, which in this case forms an <em>Expression Statement</em> and so it requires a semicolon to terminate it. A very common &#8216;mistake&#8217;, but solved quite comfortably by the rules of Automatic Semicolon Insertion, which transform it into:</p>
<pre class="code">var xfunc = function(){
};</pre>
<p>There are plenty more cases I could discuss, but if you&#8217;re really interested, I suggest reading the <a href="http://www.ecma-international.org/publications/standards/Ecma-262.htm" target="_blank">ECMA-262</a> specification yourself.</p>
]]></content:encoded>
			<wfw:commentRss>http://web.2point1.com/2008/06/01/jparser-now-with-automatic-semicolon-insertion/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Full JavaScript parser for PHP</title>
		<link>http://web.2point1.com/2008/05/09/full-javascript-parser-for-php/</link>
		<comments>http://web.2point1.com/2008/05/09/full-javascript-parser-for-php/#comments</comments>
		<pubDate>Fri, 09 May 2008 21:29:15 +0000</pubDate>
		<dc:creator>Tim</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[ECMAScript]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[php]]></category>

		<guid isPermaLink="false">http://web.2point1.com/2008/05/09/full-javascript-parser-for-php/</guid>
		<description><![CDATA[[ Update 18 Nov 2009 ]
This article is rather old now &#8211; the jParser code has been released
&#8211;
Despite the glorious sunshine this week, my week off, I managed to put some time into my pet project of developing a full JavaScript parser written in 100% native PHP. Actually, I&#8217;ve been developing a generic parser suite [...]]]></description>
			<content:encoded><![CDATA[<p><strong>[ Update 18 Nov 2009 ]</strong></p>
<p>This article is rather old now &#8211; the <a href="http://web.2point1.com/2009/11/14/jparser-and-jtokenizer-released/">jParser code has been released</a><br />
&#8211;</p>
<p>Despite the glorious sunshine this week, my week off, I managed to put some time into my pet project of developing a full JavaScript parser written in 100% native PHP. Actually, I&#8217;ve been developing a generic parser suite for some time, and using it to build a full JavaScript parser was my ultimate goal to be satisfied that it all works and is powerful enough to be useful. I&#8217;ve written a bunch of blogs about developing a parser generator in PHP, (<a title="Parsing blogs" href="/tag/parsing/">click &#8220;parsing&#8221;</a> to do a tag search).</p>
<p>Before I start wittering on,<br />
<strong><a href="http://timwhitlock.info/plug/examples/JavaScript/JParser.php" target="_blank">Click here to play with the online example of JParser<br />
</a></strong></p>
<p><span id="more-35"></span></p>
<p>Here are the main difficulties I encountered while building the JavaScript parser:</p>
<p><strong>1. Performance<br />
</strong> Generating the parse table was taking about 30 minutes and using several hundred megabytes of memory. Going back to the drawing board with certain parts of the parse table generator, I&#8217;ve managed to get this down to about 7 minutes on my humble Mac Mini.</p>
<p><strong>2. Special rules<br />
</strong> The <a title="ECMA 262 Edition 3" href="http://www.ecma-international.org/publications/standards/Ecma-262.htm" target="_blank">ECMAScript standard</a> states certain special cases in the grammar rules. One of particular note (clause 12.4) says that an <em>ExpressionStatement</em> may not begin with a <code>"{"</code> or a <code>"function"</code>. This special rule avoids ambiguity and therefore avoids parse table conflicts, but the rule is effectively outside of the grammar. I&#8217;ve finally found the right part of the parser architecture to implement such rules</p>
<p><strong> 3. Automatic semicolon insertion</strong><br />
As you probably know just from writing JavaScript, the <a title="ECMA 262 Edition 3" href="http://www.ecma-international.org/publications/standards/Ecma-262.htm" target="_blank">ECMAScript standard</a> permits the lazy omission of semicolons at the end of some statements, as long as you terminate with a line break instead. This is actually more complex than it sounds, but more to the point, it is another special rule that is not directly a part of the grammar and is handled at parse time.<br />
<span style="color: #ff0000;">[<strong>UPDATE</strong>: Automatic semicolon insertion now implemented, <strong><a href="http://web.2point1.com/2008/06/01/jparser-now-with-automatic-semicolon-insertion/">See</a></strong>! ]</span></p>
]]></content:encoded>
			<wfw:commentRss>http://web.2point1.com/2008/05/09/full-javascript-parser-for-php/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
		<item>
		<title>Parsing for PHP developers &#8211; Part III</title>
		<link>http://web.2point1.com/2008/04/06/parsing-for-php-developers-part-iii/</link>
		<comments>http://web.2point1.com/2008/04/06/parsing-for-php-developers-part-iii/#comments</comments>
		<pubDate>Sun, 06 Apr 2008 21:47:17 +0000</pubDate>
		<dc:creator>Tim</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[php]]></category>

		<guid isPermaLink="false">http://web.2point1.com/2008/04/06/parsing-for-php-developers-part-iii/</guid>
		<description><![CDATA[JSON Parser
If you haven&#8217;t read Part 1, or Part 2 they are there for the reading.
I&#8217;m going to demo a JSON parser in this post. It&#8217;s 100% native PHP code, and is based on the work I&#8217;ve done toward my ultimate goal of a full JavaScript parser.
Click here to play with the interactive JSONParser demo
I [...]]]></description>
			<content:encoded><![CDATA[<h3>JSON Parser</h3>
<p>If you haven&#8217;t read <a href="http://web.2point1.com/2008/03/24/parsing-for-php-developers-part-i/">Part 1</a>, or <a href="http://web.2point1.com/2008/03/30/parsing-for-php-developers-part-ii/">Part 2</a> they are there for the reading.</p>
<p>I&#8217;m going to demo a <a href="http://en.wikipedia.org/wiki/JSON" title="JSON reference" target="_blank">JSON</a> parser in this post. It&#8217;s 100% native PHP code, and is based on the work I&#8217;ve done toward my ultimate goal of a full JavaScript parser.</p>
<p><strong><a href="http://timwhitlock.info/plug/examples/JavaScript/JSON/JSONParser.php">Click here to play with the interactive JSONParser demo</a></strong></p>
<p>I thought I&#8217;d get this example online now as my ultimate goal is taking longer than I had hoped. I shan&#8217;t go into the details, suffice to say that the JSON grammar below is a very tiny subset of the full JavaScript grammar and doesn&#8217;t really have any complex rules. </p>
<p><span id="more-19"></span><br />
Here&#8217;s the JSON grammar I put together.</p>
<pre class="code" style="overflow: auto; max-height: 300px"><json_object_literal>&lt;JSON_OBJECT_LITERAL&gt;
	: "{" "}"
	| "{" &lt;JSON_PROP_LIST&gt; "}"
	;

&lt;JSON_PROP_LIST&gt;
	: JSON_STRING_LITERAL ":" &lt;JSON_LITERAL&gt;
	| &lt;JSON_PROP_LIST&gt; "," JSON_STRING_LITERAL ":" &lt;JSON_LITERAL&gt;
	;

&lt;JSON_LITERAL&gt;
	: JSON_STRING_LITERAL
	| JSON_NUMERIC_LITERAL
	| &lt;JSON_ARRAY_LITERAL&gt;
	| &lt;JSON_OBJECT_LITERAL&gt;
	| "true"
	| "false"
	| "null"
	;

&lt;JSON_ARRAY_LITERAL&gt;
	: "[" "]"
	| "[" &lt;JSON_ELEMENT_LIST&gt; "]"
	;

&lt;JSON_ELEMENT_LIST&gt;
	: &lt;JSON_LITERAL&gt;
	| &lt;JSON_ELEMENT_LIST&gt; "," &lt;JSON_LITERAL&gt;
	;</pre>
<p>The grammar notation of the full JavaScript language may only be about 12 times the size of the JSON grammar above, but the parse table it generates is hundreds of times bigger. The <a href="http://timwhitlock.info/plug/examples/JavaScript/JSON/JSONParseTable.php" target="_blank">JSON parse table</a> was generated in just a few milliseconds and the <a href="http://timwhitlock.info/plug/examples/JavaScript/JSON/JSONParseTable.phps" target="_blank">PHP source code for the table</a> alone is about 3K. In comparison; my current JavaScript parse table generator takes about 7 minutes and the table source code is about 800k.</p>
<p>Anyhow, I digress. The purpose of showing off the JSON parser is to underline the usefulness of the parse tree, as I touched on in <a href="http://web.2point1.com/2008/03/30/parsing-for-php-developers-part-ii/">part 2</a>, and of course to make it relevant to PHP :)</p>
<h3>Parse node classes</h3>
<p>Each node in the parse tree is assigned a different PHP class which extends a vanilla flavoured node class. You can manipulate these nodes much as you would with an XML or DOM tree. Most importantly you can write custom routines to <em>evaluate</em> them. When you evaluate the root node you begin a recursive procedure which ultimately gives you a value, or object that represents the whole structure. In this case an associative array which is the deserialized JSON object.</p>
<p>These nodes don&#8217;t need much code either. For example, the  terminal symbol <code>JSON_NUMERIC_LITERAL</code> has a class assigned to it who&#8217;s evaluate method simply returns its string value as  a native PHP number. The nodes for <code>JSON_OBJECT_LITERAL</code> and <code>JSON_ARRAY_LITERAL</code> are obviously a bit more complex, but I&#8217;m sure you get the idea. It doesn&#8217;t take much imagination to see that once the parser has given you a tree it&#8217;s very easy to hook in whatever logic you want.</p>
<p>This was only an academic exercise, particularly as PHP5 has a <a href="http://uk2.php.net/manual/en/ref.json.php" target="_blank">JSON extension</a> enabled by default. The PHP <a href="http://uk2.php.net/manual/en/function.json-decode.php" target="_blank">json_decode</a> function is much faster than my parser, and of course my parser is only one-directional, but it shows that if an extension doesn&#8217;t exist for what you want to parse, it is possible to write one in native PHP.</p>
<p>As is usual for me, this blog has very little direction and I am not sure what the topic of the next post will be, or if there was even a topic for this one. However, my goals for this body of work are clear and I look forward to demonstrating a fully working JavaScript parser some day soon. I will also release some code eventually &#8211; honest.</p>
]]></content:encoded>
			<wfw:commentRss>http://web.2point1.com/2008/04/06/parsing-for-php-developers-part-iii/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parsing for PHP developers &#8211; Part II</title>
		<link>http://web.2point1.com/2008/03/30/parsing-for-php-developers-part-ii/</link>
		<comments>http://web.2point1.com/2008/03/30/parsing-for-php-developers-part-ii/#comments</comments>
		<pubDate>Sun, 30 Mar 2008 17:55:32 +0000</pubDate>
		<dc:creator>Tim</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[php]]></category>

		<guid isPermaLink="false">http://web.2point1.com/2008/03/30/parsing-for-php-developers-part-ii/</guid>
		<description><![CDATA[In part 1 I introduced and demonstrated the parsing concept using a very simple date parser. In this part I am going to talk about the important role of tokenizing. If you haven&#8217;t read part 1 this may not make much sense, so read it now if you haven&#8217;t already.
Syntactical vs Lexical
Looking again at the [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://web.2point1.com/2008/03/24/parsing-for-php-developers-part-i/">part 1</a> I introduced and demonstrated the parsing concept using a very simple date parser. In this part I am going to talk about the important role of <em>tokenizing</em>. If you haven&#8217;t read part 1 this may not make much sense, so <a href="http://web.2point1.com/2008/03/24/parsing-for-php-developers-part-i/">read it now</a> if you haven&#8217;t already.</p>
<h3><a href="http://dictionary.reference.com/browse/syntactical" title="definition of syntactical" target="_blank"><em>Syntactical</em></a> vs <em><a href="http://dictionary.reference.com/browse/lexical" title="definition of lexical" target="_blank">Lexical</a></em></h3>
<p>Looking again at the simple grammar of part 1. You may notice that the rule: <code>&lt;D_DIGIT&gt; ::= "0" | "1" ... "9"</code> is a bit different to all the others. It does not really contribute to the <em>syntax</em> of our language, it merely describes the legal characters that make up a single digit. It is convenient  to view this aspect of the language as a subset of the grammar; one that is concerned only with what input &#8216;looks like&#8217; rather than where it appears. This can be called the <em><a href="http://dictionary.reference.com/browse/lexical" title="definition of lexical" target="_blank">lexical</a> </em>grammar. The rest of the language which is concerned with syntax can be called the <a href="http://dictionary.reference.com/browse/syntactical" title="definition of syntactical" target="_blank"><em>syntactical </em></a>grammar.<span id="more-16"></span></p>
<p>If we were to run our date string through a <em>lexical </em>parser first, it could output a more organized set of symbols that we could then feed into our <em>syntactical</em> parser as input. The syntactical grammar could then safely take for granted that a symbol <code>D_DIGIT</code> is a perfectly valid <em>terminal</em> symbol and wouldn&#8217;t need to worry about what characters it might have contained.  So the input for our simple date parser could be transformed from:</p>
<pre class="code">"1976-11-03 18:15:00"</pre>
<p>- to something like:</p>
<pre class="code" style="height: 3em">[ D_DIGIT D_DIGIT D_DIGIT D_DIGIT "-" D_DIGIT D_DIGIT "-" D_DIGIT " " D_DIGIT D_DIGIT ":" D_DIGIT D_DIGIT ":" D_DIGIT D_DIGIT ]</pre>
<p>And because the non-terminal symbol <code>&lt;D_DIGIT&gt;</code> has been scrapped in favour of the terminal symbol <code>D_DIGIT</code>, a grammar rule like :</p>
<pre class="code">&lt;D_MONTH&gt; ::= &lt;D_DIGIT&gt; &lt;D_DIGIT&gt;</pre>
<p>- could become</p>
<pre class="code">&lt;D_MONTH&gt; ::= D_DIGIT D_DIGIT</pre>
<p>This might seem a bit pointless at this stage. Having two grammars (and two parsers) for our simple date language is probably not necessary. One of the factors that makes our original test grammar so simple is that all of its terminal symbols are single characters, and only a restricted set of characters at that. This made it fairly easy to embed the lexical information within a single grammar.  However, as we consider more complex languages this practice soon becomes a problem and slows things down immensely. Consequently, this separation of the two grammars becomes increasingly useful. Categorizing symbols together like this doesn&#8217;t just make our grammar neater, it also makes the parser hugely more efficient.</p>
<h3>Tokenization</h3>
<p>The lexical parsing process (called <a href="http://en.wikipedia.org/wiki/Lexing#Tokenizer" target="_blank">&#8216;lexing&#8217;, or &#8216;tokenizing&#8217;</a>)  can often be much simpler and easier to achieve than the full, syntactical parse. If we wanted to parse the  PHP language itself we don&#8217;t have to worry about this step at all, because we have a built in tokenizer at our disposal &#8211; the <a href="http://uk.php.net/manual/en/ref.tokenizer.php" target="_blank">PHP Tokenizer</a>. There are other benefits to having a separate tokenizer too; Perhaps you want to parse a string, but you don&#8217;t need a full parse tree or even need to check the syntax. A good example of this is <a href="http://timwhitlock.info/plug/examples/functions/PHP/php_highlight_string.php" title="Improved PHP source highlighting" target="_blank">code highlighting</a>.</p>
<p>It looks like our vocabulary is getting bigger, so let&#8217;s have a terminology catch-up as I start to talk about tokenizing in the context of PHP.</p>
<p>A <em>&#8216;<strong>symbol</strong></em><em>&#8216;</em>, whether terminal or not, is always scalar. As with our date grammar a terminal symbol could represent a piece of input literally, like &#8220;2&#8243;, or as with PHP, it could stand for a <em>type </em>of input such as <a href="http://uk.php.net/manual/en/tokens.php" target="_blank">T_LNUMBER</a> who&#8217;s value may be any numeric string. The latter type of symbol is represented internally as an integer constant. Non-terminal symbols never represent actual input, and so will always be integer constants.</p>
<p>A <em>&#8216;<strong>token</strong>&#8216;</em> represents a single piece of [pre-processed] input. It is identified by a <em>terminal symbol</em>, but it may contain more information. For a symbol like T_LNUMBER, an input token needs to contain both the symbol and a literal value. We use an array as the simplest way to express this:<br />
<code>array ( T_LNUMBER, "123" );</code> The literal value at offset 1 is not actually consulted during syntactical parsing but if you&#8217;re going to do anything useful with your parser, other than just check syntax, you&#8217;ll almost definitely need this data later on. It is of course also useful for debugging and error messages during parsing.</p>
<p>It made sense for me to base my parser&#8217;s notion of a <em>token</em> on the PHP tokenizer&#8217;s model, but really there are many ways to represent such concepts and it is largely a programming design issue. Hopefully by now you get the basic premise that tokenizing raw input into more manageable data makes life easier in performing a full, syntactical parse.</p>
<p>My parser package does not currently contain the kind of automated tokenizer that you&#8217;ll find in the world of [ahem] <em>proper </em>programming languages, (e.g. <a href="http://en.wikipedia.org/wiki/Lex_programming_tool" target="_blank">Lex</a> for C),  but this is in the pipeline. If you have to write your own tokenizer, you can go about it any way you like as long as what you end up with is an array of tokens supported by the parser. In theory you could build a lexical grammar with rules like <code>&lt;T_FUNCTION&gt; ::= "f" "u" "n" "c" "t" "i" "o" "n",</code> and use the same parser that your syntactical grammar will use, but this is a pretty bad idea. I have found it a hell of a lot easier and infinitely more efficient to write a bespoke function. It doesn&#8217;t have to be complex &#8211; here is an example of a possible tokenizer for our humble date grammar<strong>: </strong><a href="http://timwhitlock.info/plug/examples/parsing/__Test/tokenize_date_string.phps" target="_blank">The source</a> and <a href="http://timwhitlock.info/plug/examples/parsing/__Test/tokenize_date_string.php" target="_blank">the output</a><strong>. </strong>PHP also provides the<strong> </strong><span id="pageTitle"><a href="http://uk.php.net/manual/en/function.strtok.php" target="_blank">strtok</a> function, but I have not found it useful.<br />
</span></p>
<h3>Parsing PHP</h3>
<p>In the case of PHP of course, we don&#8217;t even need to bother with all this, we just call <a href="http://uk.php.net/manual/en/function.token-get-all.php" target="_blank">token_get_all</a>, and we&#8217;re ready to parse, so before this article gets too long, let&#8217;s create a mini grammar that can use the PHP Tokenizer, and then my parser as before. The following language merely adds up numbers, that&#8217;s it &#8211; that&#8217;s all it does.</p>
<pre class="code">1. &lt;NT_PROGRAM&gt; ::=
2.    T_OPEN_TAG &lt;NT_SUM&gt; |
3.    T_OPEN_TAG &lt;NT_SUM&gt; T_CLOSE_TAG;
4. &lt;NT_SUM&gt; ::=
5.    &lt;NT_NUMBER&gt; |
6.    &lt;NT_NUMBER&gt; "+" &lt;NT_SUM&gt;;
7. &lt;NT_NUMBER&gt; ::=
8.    T_LNUMBER |
9.    T_DNUMBER</pre>
<p>This tiny subset of PHP syntax must start with <code>&lt;?php</code> simply for the tokenizer to work, but the closing tag is optional. This is defined in the first rule on lines 1-3. The second rule on lines 4-6 is recursive. It defines that sums may be added together with further sums until ultimately a number is produced. The third rule on lines 7-9 defines that both <em>integers</em> and <em>doubles</em> are valid numbers in this language.</p>
<p>If we parse this source: <code>"&lt;?php 100 + 2.2 ?&gt;"</code>, we get the following parse tree:</p>
<pre class="code">[009] &lt;NT_PROGRAM&gt; :
[000] .  "T_OPEN_TAG" = '&lt;?php'
[007] .  &lt;NT_SUM&gt; :
[002] .  .  &lt;NT_NUMBER&gt; :
[001] .  .  .  "T_LNUMBER" = '100'
[002] .  .  &lt;/NT_NUMBER&gt;
[003] .  .  "+"
[006] .  .  &lt;NT_SUM&gt; :
[005] .  .  .  &lt;NT_NUMBER&gt; :
[004] .  .  .  .  "T_DNUMBER" = '2.2'
[005] .  .  .  &lt;/NT_NUMBER&gt;
[006] .  .  &lt;/NT_SUM&gt;
[007] .  &lt;/NT_SUM&gt;
[008] .  "T_CLOSE_TAG" = '?&gt;'
[009] &lt;/NT_PROGRAM&gt;</pre>
<p><a href="http://timwhitlock.info/plug/examples/parsing/__Test/PHPSumParser.php" title="Interactive php sum parser" target="_blank"><strong>Click here to try the interactive version</strong></a>.</p>
<p>The structure of this parse tree shows that the <code>&lt;NT_SUM&gt;</code> nodes are nested recursively. This is a natural occurrence of the grammar rule on lines 4-6. It is simple enough to &#8216;flatten&#8217; the tree if you needed, but ultimately this structure reflects the inherent nature of the language which might be essential for whatever you&#8217;re going to do with it next.</p>
<p>You&#8217;ll see that the <a href="http://timwhitlock.info/plug/examples/parsing/__Test/PHPSumParser.php" target="_blank">interactive demo</a> actually gives the result of the sum. This is achieved by <em>evaluating </em>the parse tree, which is when things start to get a bit more useful, and a lot more interesting. I&#8217;ll get to that topic in a other post to follow soon.</p>
]]></content:encoded>
			<wfw:commentRss>http://web.2point1.com/2008/03/30/parsing-for-php-developers-part-ii/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parsing for PHP developers &#8211; Part I</title>
		<link>http://web.2point1.com/2008/03/24/parsing-for-php-developers-part-i/</link>
		<comments>http://web.2point1.com/2008/03/24/parsing-for-php-developers-part-i/#comments</comments>
		<pubDate>Mon, 24 Mar 2008 10:35:33 +0000</pubDate>
		<dc:creator>Tim</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[php]]></category>

		<guid isPermaLink="false">http://web.2point1.com/2008/03/24/parsing-for-php-developers-part-i/</guid>
		<description><![CDATA[Parsing is a fairly common word in the web developer&#8217;s vocabulary. We do it all the time. One immediately thinks of XML as something we parse regularly without batting an eyelid. As a PHP developer you might also parse an ini file with parse_ini_file, or parse a date string with strtotime. Whatever language you write, [...]]]></description>
			<content:encoded><![CDATA[<p>Parsing is a fairly common word in the web developer&#8217;s vocabulary. We do it all the time. One immediately thinks of XML as something we parse regularly without batting an eyelid. As a PHP developer you might also parse an <em>ini</em> file with <a href="http://php.net/manual/en/function.parse-ini-file.php" target="_blank">parse_ini_file</a>, or parse a date string with <a href="http://php.net/manual/en/function.strtotime.php" target="_blank">strtotime</a>. Whatever language you write, these tasks are easily achieved using either built-in functions or by installing other code libraries or extensions. Sometimes you may find yourself needing to parse something more bespoke, like say a postcode &#8211; you&#8217;ll either write a routine yourself, or do some googling for a neat algorithm someone out there has decided to share. &#8211; no problem.</p>
<h3>A rod for my back</h3>
<p>But what if you want to parse something really complex, like say &#8211; an entire JavaScript program. What if you can&#8217;t find a third party library that works for you? Well I tried to find one. I found some very promising projects. But they ranged from abandoned projects, to dodgy alpha releases, to ones that just plain didn&#8217;t work and with no documentation to help. The most serious looking projects were so sophisticated that I didn&#8217;t even have the knowledge to start using them. I decided, as I often do, that I need empowering with the knowledge to write my own parser should I need one for &#8211; well, whatever.<span id="more-13"></span></p>
<p>What transpired as the Googling began was that I had bitten off rather more than I have ever attempted to chew. The fact is that we take the concept of <a href="http://en.wikipedia.org/wiki/Parsing" target="_blank">parsing</a> a little for granted and we use the term quite casually. I have recently learned quite what a huge and fundamental topic of computer science this really is &#8211; It has been around longer than computers, which perhaps says it all.</p>
<p>Before I go any further with my ramblings I&#8217;d like to point out that everything I now know about this topic I owe to one book; <em><a href="http://www.cs.vu.nl/~dick/PT2Ed.html" target="_blank">Parsing Techniques &#8211; A Practical Guide</a></em>. Entirely as result of studying this book I have managed to implement a powerful LR(1) parser and parse table generator in PHP. If you don&#8217;t know what that means, I suggest you read the book, because I will not attempt to provide a tutorial on these deep topics. I will however whet your appetite with a little introduction, some demos, (aka showing off), and eventually some code.</p>
<h3>Our notion of parsing</h3>
<p>Let&#8217;s go back for a moment and look at our understanding of what parsing is. Suppose you want to parse a UK date (by <em>hand</em> so to speak) &#8211; in PHP you could do something like this:</p>
<ol class="php">
<li class="odd"><span class="T_VARIABLE">$sDate</span><span class="T_WHITESPACE"> </span>=<span class="T_WHITESPACE"> </span><span class="T_CONSTANT_ENCAPSED_STRING PHP_QUOTED">&#8216;03 / 11 / 1976&#8242;</span>;</li>
<li class="even"><span class="T_VARIABLE">$aDate</span><span class="T_WHITESPACE"> </span>=<span class="T_WHITESPACE"> </span><span class="T_STRING">preg_split</span>(<span class="T_WHITESPACE"> </span><span class="T_CONSTANT_ENCAPSED_STRING PHP_QUOTED">&#8216;/\D/&#8217;</span>,<span class="T_WHITESPACE"> </span><span class="T_VARIABLE">$sDate</span>,<span class="T_WHITESPACE"> </span>-<span class="T_LNUMBER">1</span>,<span class="T_WHITESPACE"> </span><span class="T_STRING">PREG_SPLIT_NO_EMPTY</span><span class="T_WHITESPACE"> </span>);<span class="T_WHITESPACE"> </span></li>
<li class="odd"><span class="T_VARIABLE">$timestamp</span><span class="T_WHITESPACE"> </span>=<span class="T_WHITESPACE"> </span><span class="T_STRING">mktime</span>(<span class="T_WHITESPACE">0</span>, 0<span class="T_WHITESPACE"></span>, 0<span class="T_WHITESPACE"></span>,<span class="T_WHITESPACE"> </span><span class="T_VARIABLE">$aDate</span>[<span class="T_LNUMBER">1</span>],<span class="T_WHITESPACE"> </span><span class="T_VARIABLE">$aDate</span>[0],<span class="T_WHITESPACE"> </span><span class="T_VARIABLE">$aDate</span>[<span class="T_LNUMBER">2</span>],<span class="T_WHITESPACE"> </span><span class="T_STRING">false</span><span class="T_WHITESPACE"> </span>);</li>
</ol>
<p>Easy peasy &#8211; You have parsed the date, in so far as it has been transformed from a string of bytes to something tangible within your code. You can get meaningful handles on the data and do stuff with it. Great! &#8211; but not powerful or versatile; This is an entirely bespoke process that works for this purpose alone. Proper parsing techniques allow a problem like this to be standardized greatly, and will allow you to use the  same code base to implement parsers for as many different <em>things</em> as your imagination and time allows you. Furthermore, it is possible to fully harness the expressive nature of a language rather than just chopping a sentence up and grabbing bits of it, like I have shown above.</p>
<p>As a <em>proof of concept</em> I&#8217;m going to use a humble date format to demonstrate the parser I have developed for PHP. In reality using such a powerful technique would be total overkill, but it makes for a pretty good first example.</p>
<h3>Proof of concept</h3>
<p>Let&#8217;s take a date format as the <em>language</em> for which we wish to create a parser. A legitimate expression (or <em>sentence</em>) in this language could be as follows: <code>1976-11-03 18:15:00</code></p>
<p>Starting with the top-most, and largest component of our language, let&#8217;s examine what it consists of and define ourselves a <em>grammar </em>which expresses its rules and syntax.</p>
<p>The complete sentence is quite simply a <em>date</em> &#8211; This entity is what we call the &#8220;<em>goal symbol</em>&#8220;, because it is our ultimate result. Let&#8217;s create a sensibly named symbol with an appropriate namespace.</p>
<pre class="code">&lt;D_DATE&gt;</pre>
<p>We can split the <code>D_DATE</code> symbol into two separate components for date (1976-03-11) and time (18:15:00), Let&#8217;s call these symbols <code>D_DATE_COMPONENT</code> &amp; <code>D_TIME_COMPONENT</code>. They are separated by a space, which we write literally.</p>
<pre class="code">&lt;D_DATE&gt; ::= &lt;D_DATE_COMPONENT&gt; " " &lt;D_TIME_COMPONENT&gt;</pre>
<p>The literal space symbol is called a <em>terminal</em> symbol because it is an actual character in the date we are parsing and consequently can not be divided into any further symbols. The other symbols we have defined so far are called <em>non-terminal</em> because they are made up of further symbols, and are not physically present in the original date string. They are, to all intent and purpose, imaginary symbols created by us to provide an understanding of what the sentence means.</p>
<p>You get the gist &#8211; hopefully &#8211; So here&#8217;s the whole grammar.</p>
<pre class="code" style="overflow: auto; max-height: 200px">1.<strong>  &lt;</strong>D_DATE&gt;<strong> ::=</strong>
2.     &lt;D_DATE_COMPONENT&gt; " " &lt;D_TIME_COMPONENT&gt;;
3.
4.  &lt;D_DATE_COMPONENT&gt; ::=
5.     &lt;D_YEAR&gt; "-" &lt;D_MONTH&gt; "-" &lt;D_DAY&gt;;
6.
7.  &lt;D_TIME_COMPONENT&gt; ::=
8.     &lt;D_HOUR&gt; ":" &lt;D_MIN&gt; ":" &lt;D_SEC&gt; ;
9.
10. &lt;D_YEAR&gt; ::=
11.    &lt;D_DIGIT&gt; &lt;D_DIGIT&gt; &lt;D_DIGIT&gt; &lt;D_DIGIT&gt; ;
12.
13. &lt;D_MONTH&gt; ::=
14.   &lt;D_DIGIT&gt; &lt;D_DIGIT&gt; ;
15.
16. &lt;D_DAY&gt; ::=
17.    &lt;D_DIGIT&gt; &lt;D_DIGIT&gt; ;
18.
19. &lt;D_HOUR&gt; ::=
20.    &lt;D_DIGIT&gt; &lt;D_DIGIT&gt; ;
21.
22. &lt;D_HOUR&gt; ::=
23.    &lt;D_DIGIT&gt; &lt;D_DIGIT&gt; ;
24.
25. &lt;D_MIN&gt; ::=
26.    &lt;D_DIGIT&gt; &lt;D_DIGIT&gt; ;
27.
28. &lt;D_SEC&gt; ::=
29.    &lt;D_DIGIT&gt; &lt;D_DIGIT&gt; ;
30.
31. &lt;D_DIGIT&gt; ::=
32.    "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;</pre>
<p>This is written in a notational format called <a href="http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form" target="_blank">Backus Naur form (BNF)</a>. Non terminal symbols are written inside angled brackets and the terminal symbols are all written inside quotes as they are [in this case] literal. As you can see, the only terminal symbols of the language are the punctuation and numeric characters that a legal date string in this language will consist of. The last non-terminal <code>D_DIGIT</code> illustrates a choice of terminal symbol &#8211; the 10 decimal characters, (lines 31-32).</p>
<h3>The parse tree</h3>
<p>As we drill down from our goal symbol into its child symbols you may realise that we are seeing a tree-like structure. This tree has the goal symbol at it&#8217;s root, and along each branch we eventually find a leaf &#8211; a terminal symbol. This is important, because it is precisely the parser&#8217;s purpose in life to take a stream of terminal symbols and build them into a <a href="http://en.wikipedia.org/wiki/Parse_tree" title="Definition of parse tree" target="_blank">parse tree</a>. Without further ado, here is the parse tree of our example date string.</p>
<pre class="code" style="overflow: auto; max-height: 200px">[041] &lt;D_DATE&gt; :
[021] .  &lt;D_DATE_COMPONENT&gt; :
[008] .  .  &lt;D_YEAR&gt; :
[001] .  .  .  &lt;D_DIGIT&gt; :
[000] .  .  .  .  "1"
[001] .  .  .  &lt;/D_DIGIT&gt;
[003] .  .  .  &lt;D_DIGIT&gt; :
[002] .  .  .  .  "9"
[003] .  .  .  &lt;/D_DIGIT&gt;
[005] .  .  .  &lt;D_DIGIT&gt; :
[004] .  .  .  .  "7"
[005] .  .  .  &lt;/D_DIGIT&gt;
[007] .  .  .  &lt;D_DIGIT&gt; :
[006] .  .  .  .  "6"
[007] .  .  .  &lt;/D_DIGIT&gt;
[008] .  .  &lt;/D_YEAR&gt;
[009] .  .  "-"
[014] .  .  &lt;D_MONTH&gt; :
[011] .  .  .  &lt;D_DIGIT&gt; :
[010] .  .  .  .  "1"
[011] .  .  .  &lt;/D_DIGIT&gt;
[013] .  .  .  &lt;D_DIGIT&gt; :
[012] .  .  .  .  "1"
[013] .  .  .  &lt;/D_DIGIT&gt;
[014] .  .  &lt;/D_MONTH&gt;
[015] .  .  "-"
[020] .  .  &lt;D_DAY&gt; :
[017] .  .  .  &lt;D_DIGIT&gt; :
[016] .  .  .  .  "0"
[017] .  .  .  &lt;/D_DIGIT&gt;
[019] .  .  .  &lt;D_DIGIT&gt; :
[018] .  .  .  .  "3"
[019] .  .  .  &lt;/D_DIGIT&gt;
[020] .  .  &lt;/D_DAY&gt;
[021] .  &lt;/D_DATE_COMPONENT&gt;
[022] .  " "
[040] .  &lt;D_TIME_COMPONENT&gt; :
[027] .  .  &lt;D_HOUR&gt; :
[024] .  .  .  &lt;D_DIGIT&gt; :
[023] .  .  .  .  "1"
[024] .  .  .  &lt;/D_DIGIT&gt;
[026] .  .  .  &lt;D_DIGIT&gt; :
[025] .  .  .  .  "8"
[026] .  .  .  &lt;/D_DIGIT&gt;
[027] .  .  &lt;/D_HOUR&gt;
[028] .  .  ":"
[033] .  .  &lt;D_MIN&gt; :
[030] .  .  .  &lt;D_DIGIT&gt; :
[029] .  .  .  .  "1"
[030] .  .  .  &lt;/D_DIGIT&gt;
[032] .  .  .  &lt;D_DIGIT&gt; :
[031] .  .  .  .  "5"
[032] .  .  .  &lt;/D_DIGIT&gt;
[033] .  .  &lt;/D_MIN&gt;
[034] .  .  ":"
[039] .  .  &lt;D_SEC&gt; :
[036] .  .  .  &lt;D_DIGIT&gt; :
[035] .  .  .  .  "0"
[036] .  .  .  &lt;/D_DIGIT&gt;
[038] .  .  .  &lt;D_DIGIT&gt; :
[037] .  .  .  .  "0"
[038] .  .  .  &lt;/D_DIGIT&gt;
[039] .  .  &lt;/D_SEC&gt;
[040] .  &lt;/D_TIME_COMPONENT&gt;
[041] &lt;/D_DATE&gt;</pre>
<p><strong><a href="http://timwhitlock.info/plug/examples/parsing/__Test/SimpleDateParser.php" target="_blank">Click here to play with the interactive version</a></strong><br />
You will see that the parser throws a descriptive error if you try to parse an invalid date.</p>
<p>That&#8217;s the concept in a nutshell. Hopefully you can see that this tree structure gives us a powerful and flexible way to describe the anatomy of the input we have parsed.  Compare that with simply grabbing the linear sequence of numbers that make up the date string and imagine the potential.</p>
<p>I shall leave this introduction for now. In <a href="http://web.2point1.com/2008/03/30/parsing-for-php-developers-part-ii/" title="Part II">part 2</a>, I go on to discuss tokenizing and demonstrate the parsing of PHP source.</p>
]]></content:encoded>
			<wfw:commentRss>http://web.2point1.com/2008/03/24/parsing-for-php-developers-part-i/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
