XML_PullParser
A token-based interface to the PHP expat XML library
version 1.3.2
Myron Turner
Myron_Turner@shaw.ca

Introduction

Contents         

This is a Class modeled on the PullParser module found in the Perl HTML::Parser distribution. It moves the API from an event-based model to a token-based model. Instead of processing data as it is passed from the parser to callbacks, a script using XML_PullParser requests "tokens" from various "tokenizing" functions, most particularly from XML_PullParser_getToken. and XML_PullParser_getElement. Tokens are arrays representing XML structures, which become available in the order in which they appear in the document being parsed.

In addtion to the tokenizers, a rich set of accessors are provided to extract data from the elements and attributes bundled in the tokens. There are also techniques and class methods for selecting elements and attributes, and for testing for their position and relevancy. Finally, there are package-level functions to set the contexts that affect the operations of the module.

XML_PullParser is not as clearly a "token" parser as HTML::PullParser . The Perl module focuses on the individual tag as it comes on stream, which makes it suited to large blocks of text with a great many embedded tags, whereas XML_PullParser is oriented towards nested structures, which makes it suited to the kinds of database structures that much XML is used for. The current DocBook paragraph is a good example of where the Perl module has the advantage:

Example 1
<classname>XML_PullParser </classname> is not as purely a "token" parser as
<classname>HTML::PullParser </classname>. The Perl module foucuses on the
individual tag as it comes on stream, which makes it suited to large blocks
of text with a great many embedded tags, whereas  <classname>XML_PullParser </classname>
is oriented towards nested structures. . .

If Perl's HTML::PullParser were to format Example 1 , the <classname> tags would be announced at the points at which they occur in the stream, and so re-casting <classname> to bold italics, as here, would be a simple matter of exchanging <b> <i> for <classname> whenever the <classname> tag came on stream. XML_PullParser, on the other hand, would output an entire structure enclosed by either <blockquote> or <programlisting>. To convert the classnames to bold, it is then necessary to review this structure and apply a replacement function like preg_replace to each element that calls for re-casting. XML_PullParser has a function which does just this: XML_PullParser_getTextMarkedUp.

This page is very likely being generated on the fly from the orignal XML, using XML_PullParser_getTextMarkedUp, and certainly over the web there's no noticeable performance defecit. Nevertheless, the strength of XML_PullParser, is with structures like like Example 2.

Example 2
<ENTRY>
<ipaddress> 172.20.19.6 </ipaddress>
<domain> example.com </domain>
<server ip="192.168.10.1"> example_1.com </server>
<server ip="192.168.10.2"> example_2.com </server>
<server ip="192.168.10.3"> example_3.com </server>
<alias> <www.example.com </alias>
</ENTRY>

<ENTRY>
<ipaddress> 172.20.19.7 </ipaddress>
        •
        •
<alias> <www.example.org </alias>
</ENTRY>

In a database-like file with a set of entries like this, XML_PullParser would loop through the file grabbing up an entire <ENTRY> structure with each iteration and provide immediate, direct access to each of its elements. The next two sections introduce the coding for such tasks and try to give a tase of how XML_PullParser works.