XML_PullParser
A token-based interface to the PHP expat XML library
version 1.3.2
Myron Turner
Text Accesors

Contents         

There are four text accessors in XML_PullParser:

  1. string XML_PullParser_getText ([mixed $el = ""], [integer $which = 0])
  2. arrray XML_PullParser_getTextArray (mixed $el)
  3. string XML_PullParser_getTextMarkedUp (array $mark_up, [mixed $el = ""])
  4. string XML_PullParser_getTextStripped ([mixed $el = ""])

All the text accessors return either Null or an empty string or array if no data is available. All of these methods are well documented in the Class Documentation, which should be consulted in addition to this manual.


1. XML_PullParser_getTextStripped
XML_PullParser_getText and XML_PullParser_getTextArray are front-ends for XML_PullParser_getTextStripped. Therefore, an understanding of this method will aid in the understanding of the other two.

XML_PullParser_getTextStripped takes one parameter, which can be either a tokenized array or the name of an element. If a name is passed in, or if no parameter is passed in, then it assumes that the subject of the request is either the $current_element or the current token. 1   If a token is passed in then it uses the token. Its defining characteristic is that is does not observe element boundaries. It returns a concatenated string made up of all the text found in the token. This includes the text of all children, and descendent elements. It includes as well all white space separating element from element, and white space includes new-lines. The default delimiter between the concatenated members of this string is a single space character. This can be changed using

string XML_PullParser_setDelimiter (string $delimiter)

The returned string is the old delimiter, which can then be reset, if necessary, with a second call to XML_PullParser_setDelimiter.

The text returned by XML_PullParser_getTextStripped is subject to the CDATA modifiers:
  1. void XML_PullParser_excludeBlanks (boolean $bool)
    Setting this to true will exclude all text lines which consist solely of white space.
  2. void XML_PullParser_excludeBlanksStrict (boolean $bool)
    Setting this to true will exclude all text lines which do not have alphanumeric characters,
    hyphen, and underscore, ie. do not satify the the regular expression '/\w+/'
  3. void XML_PullParser_trimCdata (boolean $bool)
    Setting this to true will cause all text extracted from each element to be passed through
    the PHP trim function.

In the following example the <emphasis> element is concatenated with the first <News_item> element and both with the second <News_item>; they are separated by the default delimiter, a single space:

 XML:
    <News_item>
           There was a  <emphasis>big</emphasis>  rainstorm last night.
    </News_item>
    <News_item> It rained cats and dogs </News_item>

 Result: There was a  big  rainstorm last night. It rained cats and dogs!

For more examples and further detail see the class documentation.


2. XML_PullParser_getTextArray
This method is a front-end to XML_PullParser_getTextStripped. It returns an array of the strings in the element specified in the parameter, which is required and which is either a tokenized array or a string and is treated exactly the same as the parameter to XML_PullParser_getTextStripped.

This method takes advantage of the fact that XML_PullParser_getTextStripped ignores element boundaries and returns a concatenated string of texts separated by a pre-set delimiter. It changes the delimiter to ';;' by calling XML_PullParser_setDelimiter(';;'); then it creates the array by calling explode on the string. It then resets the delimiter to its old value. Obviously, this means that if a database uses a double semi-colon, this method will not work correctly, but it can be easily enough duplicated.

Let's assume the following database to demonstrate XML_PullParser_getTextArray.

Example 1: Movies.xml
<Movies>
    <Movie>
        <Title>Gone With The wind</Title>
        <date>1939</date>
        <leading_lady>Vivien Leigh</leading_lady>
        <leading_man>Clark Gable</leading_man>
    </Movie>
    <Movie>
        <Title>How Green Was My Valley</Title>
        <date>1941</date>
        <leading_lady>Maureen O'Hara</leading_lady>
        <leading_man>Walter Pidgeon</leading_man>
    </Movie>
    <Movie>
        <Title>Jurassic Park</Title>
        <date>1993</date>
        <leading_lady>Laura Dern</leading_lady>
        <leading_man>Sam Neil</leading_man>
    </Movie>
</Movies>

To get all the titles from Movies.xml , all that's necessary is the following call:

$parser->XML_PullParser_getTextArray("Title")

The technique is demonstrated in Listing 17:

Listing 17

1.    $tags = array("Movies");
2.    $child_tags = array();
3.
3.    $parser = new XML_PullParser("Movies.xml", $tags,$child_tags);
5.
6.    $token = $parser->XML_PullParser_getToken();
7.
8.    $text_array = $parser->XML_PullParser_getTextArray("Title");
9.    print_r($text_array);

/*
 Result
    Array
    (
        [0] => Gone With The wind
        [1] => How Green Was My Valley
        [2] => Jurassic Park
    )
*/

One precautionary note. Given the current coding, the following call will not return the expected result:

$parser->XML_PullParser_getTextArray("Title")

The expected result is:


Array
    (
        [0] => Gone With The wind
        [1] => 1939
        [2] => Vivien Leigh
        [3] => Clark Gable
        [4] => How Green Was My Valley
        [5] => 1941
        [6] => Maureen O'Hara
        [7] => Walter Pidgeon
        [8] => Jurassic Park
        [9] => 1993
        [10] => Laura Dern
        [11] => Sam Neil
    )
    

But instead we get:


Array
(
    [0] =>

    [1] =>

    [2] => Gone With The wind
    [3] =>

    [4] => 1939
    [5] =>

    [6] => Vivien Leigh
    [7] =>
     •
     •
     •
)

The empty array elements represent new-lines, and we can see that's the case since there is no new-line between elements [2] and [3] or elements [4] and [5]. What's required here is a call to XML_PullParser_excludeBlanksStrict with a value of true. That gets rid of all the blank elements and gives the expected result.


3. XML_PullParser_getText
All calls to this method are eventually passed on to XML_PullParser_getTextStripped. XML_PullParser_getText identifies and prepares the element which will be passed in to XML_PullParser_getTextStripped, and that method then returns all the text found in the element in accordance with the rules that govern its return values.

XML_PullParser_getText takes three optional parameters, $el, which is a tokenized element (an array) or its name (a string), a $which value, and the boolean $xcl. In its default state, none of these parameters are passed in and it uses either the $current_element or the current token, whichever is currently operative, together with a $which value of zero and an $xcl value of FALSE.

The following listing demonstrates the use of the defaults; it uses the DNS example we've worked with throughout.

Listing 18
 1.   $tags = array("entry");
 2.   $child_tags = array("server","domain");
 3.
 4.    $parser = new XML_PullParser("DNS.xml",$tags,$child_tags);
 5.
 6.    $parser->XML_PullParser_getToken();
 7.    echo $parser->XML_PullParser_getText() . "\n";
 8.
 9.    $el = $parser->XML_PullParser_getElement("server");
10.    echo $parser->XML_PullParser_getText() . "\n";
11.
12.
13.    $parser->XML_PullParser_getElement("domain");
14.    echo $parser->XML_PullParser_getText() . "\n";

/*
Result

     172.20.19.6
      example.com
      example_1.com
      example_2.com
      example_3.com
      www.example.com

     example_1.com example_2.com example_3.com
     example.com
*/

Line 6 retrieves the entire Entry element and all of its children, and these are output on line 7, giving us the first block of the Result section. This consists of everything included in the element and all of the white space, which is why the result appears on separate lines. Had we called XML_PullParser_excludeBlanks(true) the result would have appeared as a single line of text:

172.20.19.6 example.com example_1.com example_2.com example_3.com www.example.com

The result from the call to XML_PullParser_getElement('server') in line 9 appears on a single line, because XML_PullParser_getElement incorporates into the token only the server elements. In this case, any whitespace found within the elements themselves would appear in the result but not the whitespace separating element from element. It's the latter, with its new-lines, which causes the texts derived from the $converted_token created by XML_PullParser_getToken to be printed on separate lines.

The call to XML_PullParser_getElement('domain') in line 13 yields

example.com

because there is only one domain element in the XML document. Had there been more than one domain element we would have to use the $which parameter to single out the desired domain element. The same mechanism applies, of course, to the server elements.


A Closer Look at the Parameters to XML_PullParser_getText
The element parameter ( $el ) passed in to XML_PullParser_getText can be either a string, which is the name of an element, or a tokenized array.
  1. If the element parameter is the name of an element, then either the $current_element or the current token will be searched for the named element, depending on which is currently operative. The method returns the which_th instance of that element. If $which = 0, it will return the texts from all instances of the named element found in the token.
  2. If the element parameter is a tokenized array, the method will return the character data from which_th element found in the array. If $which = 0, it will return all the character data found in the array. This is the rule which governs the output of line 6 in Listing 16 above. That is, no parameters are passed into the method, so that the the default token becomes the entire <ENTRY> array and $which defaults to zero. Therefore, all the character data found in the default token is returned--all parents, all descendents,
The difference between the two sets of returned values arises out of what the method knows. In the first case, it knows the name of the element and can therefore search the default token for one of more instances of the named element. In the second case, it doesn't have the name of an element. Therefore, if it's passed a $which value of 1, it returns the character data of the first element, regardless of its name.

The third parameter to XML_PullParser_getText is the boolean $xcl. This parameter plays a part only in the handling of arrays, that is where $el is a tokenized array or one of the two default tokens. It defaults to FALSE. But when it is set to TRUE, the subject array is filtered through XML_PullParser_childXCL. This means that all descendent elements are removed and that we are left with an array consisting solely of the parent or of elements with the same name as the first top-level element and which are themselves not descendents of any other element.

This is a complex function and it might be worthwhile to look at the class documentation. In addition, Listing_23.php and Listing_24.php in the manual/listings directory demonstrate the variety of parameter combinations and their results. To see their output, click on these links:   Listing_23.php and Listing_24.php.

Note
Prior to release 1.2.1, if the $el parameter was the name of the default token, Null was returned. In current releases, if $el is the name of the default token, the behavior is the same as the behavior when an array is passed in as $el.


4. XML_PullParser_getTextMarkedUp
This function is designed for converting streams of XML to HTML. It converts XML elements to HTML tags. Otherwise, its functionlity is the essentially the same as that of XML_PullParser_getTextStripped, with one exception: it is not subject to the CDATA modifiers.

It takes two parameters. The first is the $markup array which maps XML elements to HTML tags, the second an optional element parameter consisting of either a tokenized array or the name of an element. The element parameter behaves exactly as it does in XML_PullParser_getTextStripped. The advantage of placing the optional element parameter last is that it can be omitted when one of the two default tokens is being used. 2   All that is needed then is to pass in the $markup array.

The markup array uses four helper methods:
  1. array XML_PullParser_getCSSSpans (array $markup)
  2. array XML_PullParser_getHTMLTags (array $markup)
  3. array XML_PullParser_getStyledSpans (array $markup, array $attributes)
  4. array XML_PullParser_getStyledTags (array $markup, array $attributes)

All the parameters are associative arrays. In the two "Spans" methods, the $markup arrays map XML element names to HTML class names:

array("code"=>"code", "emphasis"=>"bold_italic", "classname"=>"cname")

These will create <SPAN> tags with the class attribute set to the the mapped value:

<span class="cname">XML_PullParser</span>

In the two "Tags" methods, the $markup arrays map XML element names to standard HTML tag names:

array("code"=>"code", "emphasis"=>"b", "classname"=>"i")

The $attributes parameter of the two "Styled" methods allows for additional attributes to be inserted in the HTML tags. For the most part these will be style attributes, but technically they can be anything. The $attributes parameters are also associative arrays:

    array("style"=>"font-size: 10pt; text-decoration:underline",
        "style"=>"background-color:blue; color: yellow;", "style"="color: #999999">)

The $attributes array has to be sequentially parallel to the $markup array, so that if the above styles were applied to the tags example, the first tag would get the first style, the second tag the second style, etc:

   <code style="font-size: 10pt; text-decoration:underline">$markup</code>
   <b style="background-color:blue; color: yellow;">This is BOLD yellow on Blue</b>

The $markup arrays can be concatenated:

  $markup =  $parser->XML_PullParser_getCSSSpans(array(. . . .));
  $markup += $parser->XML_PullParser_getHTMLTags(array(. . . .));
  $markup += $parser->XML_PullParser_getStyledTags (array(. . . .), array(. . . .));

  $text = $parser->XML_PullParser_getTextMarkedUp($markup);

A final point. This manual was written in conformance with the Docbook specification. XML_PullParser_getTextMarkedUp has built-in support for the Docbook ulink element and will automatically convert a ulink to an HTML A tag:

   <ulink url="http://XML_PullParse.org/manual.html">Manual</ulink>
    <A href="http://XML_PullParse.org/manual.html">>Manual</A>

Notes
1. See Instantiating the XML_PullParser Object
2. $current_element or $converted_token