The tokenizing functions create arrays which are mapped to the structure and data of
the xml document. It's from these that the data accessors extract text and attribute
data.
Appendix 1 has examples of
tokens returned by some of these functions as well as notes on their structure.
There are several basic rules which apply to all tokenizers:
- They return tokenized arrays or NULL or FALSE.
| They return NULL when conditions are normal but no tokens are
available. This makes it possible to use several of these methods
in loops that come to an end when no more tokens are available.
They return FALSE when an error occurs.
|
- The read buffer must be large enough to hold the entire token.
| This means that a program must be able to accomodate the largest token it will request. The
default read buffer is 8KB. This can be reset using the package level utility
XML_PullParser_setReadLength.
For more on this function, see the package level
utilities in the Utilities section
and its description in the class
documentation.
|
There are six tokenizers
- array XML_PullParser_childXCL (array $parent, [mixed $args = ""])
- array XML_PullParser_getChild (string $child, [integer $which = 1], [array $el = ""])
- array XML_PullParser_getElement (string $el)
- array XML_PullParser_getEscapedToken ()
- array XML_PullParser_getToken ()
- array XML_PullParser_nextElement ($xcl = True)
- void XML_PullParser_getChildren (string $child, [array $el = ""])
- array XML_PullParser_getChildrenFromName (string $name, string $el)
All the tokenizing functions return Null or an empty array if no token is available and therefore
are suited for use in loops which test for these to signal the end of the loop.
XML_PullParser_getToken and
XML_PullParser_getElement have been used throughout
the manual and were explained in some detail
earlier in the manual. This section will look at the other tokenizers.
1. XML_PullParser_childXCL
XML_PullParser_childXCL is an important
selector function.
We've already seen its usefulness in the
previous section,
where it was used to strip out all child elements from the parent. It does this when passed
a single parameter, a token representing the parent element. It is used internally in several
class methods for just this purpose. But it can also take a second parameter, either a vairable
parameter list of strings or an array of strings. These are the names of selected elements for exclusion.
That is, they will be excluded from the returned array, which will consist of the parent and
any child elements which are not named.
Example 1 and
Listing 15 demonstrate how this might be used.
Example 1
<Confidential_report>
<item>
The company has a ground-breaking new product called <emphasis>Ground-breaker. </emphasis>
</item>
<topsecret>Its formula is H20 </topsecret>
<item>We expect to begin selling it by the end of the year. </item>
</Confidential_report>
The point here will be to exclude topsecret from the final output.
Listing 15 does this:
Listing 15
1. tags = array("Confidential_report");
2. $child_tags = array();
3. XML_PullParser_trimCdata(true);
4. XML_PullParser_excludeBlanks(true);
5.
6. $parser = new XML_PullParser_doc($topsecret, $tags, $child_tags);
7. $token = $parser->XML_PullParser_getToken();
8. $classified = $parser->XML_PullParser_childXCL($token, "topsecret");
9.
11. $old_delim = $parser->XML_PullParser_setDelimiter("\n");
12. echo $parser->XML_PullParser_getTextStripped($classified) . "\n";
13. $parser->XML_PullParser_setDelimiter($old_delim);
/* Result
The company has a ground-breaking new product called
Ground-breaker.
We expect to begin selling it by the end of the year.
*/
There are a number of things in this listing that we haven't seen before. First, it uses
XML_PullParser_getTextStripped (line 12), which ignores element boundaries and
returns a string consisting of all the character data found within the parent
element, i.e. including all text found in child elements. Secondly, the default text
delimiter is a single space. This is replaced in line 11 by the newline character so that
the output is printed in several lines. Lines 3 and 4 make sure that the output is cleaned up,
since the parser will return as part of the character data any newlines it finds in the
text, which includes the newlines between the element declarations.
We exclude topsecret in line 8. Consequently, the final output
consists of all the lines of Confidential_report, excluding the
topsecret formula.
2. XML_PullParser_getChild
This method extracts an individual child element and all its descendents from a parent.
The first parameter is a string, the name of the child element to extract from the parent.
The second is an optional $which value. It specifies which instance of the child element
to extract; the instances are treated as a sequence, which mirrors the order of appearance in the XML document.
The parent is (optionally) specified in the 3rd parameter.
If the parent is not passed in, then it uses the $current_element or, failing that,
the current token.
This method is set up so as to work effortlessly with XML_PullParser_getToken or
XML_PullParser_getElement . If either of these has been
called, and there is only one instance of the child in the xml, then all that's needed is to
call XML_PullParser_getElement with the name of the child, since $which
defaults to 1. Otherwise, the $which has to be passed in.
To illustrate this method, let's look at the numbered list of function definitions at the top
of this page. It's XML basis is a Docbook structure called
simplelist . 1
Here is the XML:
Example 2
<para>
<simplelist type='vert' columns='1'>
<member>array XML_PullParser_childXCL (array $parent, [mixed $args = ""]) </member>
<member>
array XML_PullParser_getChild (string $child, [integer $which = 1],[array $el = ""])
</member>
<member>array XML_PullParser_getElement (string $el) </member>
<member>array XML_PullParser_getEscapedToken () </member>
<member>array XML_PullParser_getToken () </member>
<member>array XML_PullParser_nextElement () </member>
</simplelist>
</para>
Following is the code which created the numbered list, using the HTML <OL> tag.
The result, of course, is the list printed at the top of the top. So, instead, in the
Result section is printed one instance of a child array extracted by
XML_PullParser_getChild.
Listing 16
1. $tags = array('para');
2. $child_tags = array();
3. $parser = new XML_PullParser("List.xml", $tags, $child_tags);
4.
5. $parser->XML_PullParser_getToken();
6. $list = $parser->XML_PullParser_getChild('simpleList');
7.
8. $which = 1;
9. $items = "";
10. echo " <OL>\n";
11. while($member = $parser->XML_PullParser_getChild('member',$which,$list)) {
12. $member_text = $parser->XML_PullParser_getText($member);
13. $items .= " <LI>". trim($member_text) . "\n";
14. $which++;
15. }
16. echo $items;
17. echo " </OL>\n";
/* Result: Child token $member returned by XML_PullParser_getChild
[8] => S__MEMBER
[9] => Array
(
)
[10] => Array
(
[cdata] =>
array XML_PullParser_getChild (string $child, [integer $which = 1], [array $el = ""])
)
[11] => E__MEMBER
*/
The call to XML_PullParser_getToken (line 5) gets the entire para
structure and all of its children. We don't need to save its return value, because we will
be relying on the token saved internally in $converted_token. 2
In line 6, XML_PullParser_getChild extracts the entire
simplelist and all of its children from para.
These are the six member elements. We initialize $which to 1
(line 8), create an empty string to hold our list (line 9), and set up a while loop which
repeatedly calls XML_PullParser_getChild with the name of the child element we
want ("member"), the instance ("$which"), and the parent array ("$list").
Line 13 trims $member_text in order to remove extra line feeds. We can
see that there are two extra line feeds in the Result array's [cdata], one
before the function definition and one after. 3
And line 14 updates $which so that we get the next child element.
Let's look at one more real world example.
Both XML_PullParser_getChild, and XML_PullParser_childXCL, are used
internally by a number of the class methods. In some cases they are used together, as in
this snippet from XML_PullParser_getText:
if ($el && $which > 0) {
if(!$tmp_array = $this->XML_PullParser_getChild($el, $which)) {
return Null;
}
$tmp_array=$this- > XML_PullParser_childXCL($tmp_array);
return $this- > XML_PullParser_getTextStripped($tmp_array);
}
|
$el is the name of the child and $which is the instance
of the child in the parent, which here will default to either the current token
i.e. $converted_token, or to $current_element. If
XML_PullParser_getChild finds a child instance, it is saved in a temporary
variable which is then fed into XML_PullParser_childXCL in order
to strip away all of this child's children. The reason for this is that almost all text
in XML_PullParser is ultimately retrieved from
XML_PullParser_getTextStripped. 4
So, it's necessary to exclude any of the child's descendents; otherwise
XML_PullParser_getTextStripped, which does not observe element
boundaries, will return all the text it finds.
3. XML_PullParser_nextElement
This method is tied in to XML_PullParser_getElement. Whenever
XML_PullParser_getElement is called, a copy is made of the $current_element
which serves as a stack for XML_PullParser_nextElement. Each time it is called,
XML_PullParser_nextElement shifts the next element off its stack and returns it
to the caller. When the stack is exhausted, it returns Null, a feature which
makes it suitable for use in a loop.
By default, before the element is returned, it is filtered through XML_PullParser_childXCL,
which strips out all child elements.
This guarantees that the result returned when requesting text and attributes is for the element
named in the parameter to XML_PullParser_getElement:
$parser->XML_PullParser_getElement('element_name')
But this also means that it is not suitable for use in applications which need to slurp together
text from parent and all its children, as in a marked-up paragraph, since all the mark-up
would be deleted in favor of the parent element.
The default behvior can be turned off by passing in a False value as a parameter:
$parser->XML_PullParser_nextElement(False)
In this case the returned element will not be filtered through XML_PullParser_childXCL.
4. XML_PullParser_getEscapedToken
This method returns a single escaped token each time it is called.
An escaped token represents an element which is declared in both the $tags array and the
$child_tags array. A separate stack is created for these tokens.
Each time XML_PullParser_getEscapedToken
returns a token the token is popped off the stack: the method returns tokens until
the stack is exhausted, at which point it returns Null,
making this method suitable for use in a loop.
The stack is persistent. If it is not exhausted and if the file being processed is larger than
$read_length, 5
tokens will be added to the stack when the next chunk of the file is parsed.
A function is provided to clear the stack, should that be necessary:
void XML_PullParser_clearEscapedTokens ()
| Some Points about Escaped Tokens |
|---|
| An escaped token can be accessed by XML_PullParser_getEscapedToken at any time,
as long as it is still on the stack. |
| Escaped tokens are treated as valid members of the $child_tags array and
therefore can be accessed in normal document order by XML_PullParser_getElement. |
| Escaped tokens are not treated as members of the $tags arrray
and therefore are not returned by XML_PullParser_getToken .
But an escaped element can still be the child of the current token, if
its parent has been declared in the $tags array. If so, it can
be accessed in the same ways as any child of the current token. |
5. XML_PullParser_getChildren
6. XML_PullParser_getChildrenFromName
These methods are tokenizers but instead of returning tokens they return
numerically indexed arrays of tokens, which is illustrated in the
appendix.
The only difference between the two methods is in their parameters.
The first parameter to both of these functions is a string, the name of an element.
It's this element which will consistute the child elements being sought. The difference
between them is in the second paramter. In
XML_PullParser_getChildren
the second parameter is either a tokenized array, which is parent to the children,
or null, in which case either
$current_element will be
used or faling that the current token. In
XML_PullParser_getChildrenFromName
the second parameter is required and is the name of the parent element.
Accesing the tokens returned by these arrays is a matter simply of running them through a loop. This
can be a foreach loop, a for loop, or while
loop that uses an indexing variable.
| Notes |
|---|
| 1. SimpleList should technically be an 'undecorated" list,
according to the Docbook spec but has been recast here
as a numbered list. |
| 2. See
Instantiating the XML_PullParser Object |
| 3. We could also have called XML_PullParser_trimCdata(true) at the top
of the listing to trim all text internally. |
| 4. The one exception to this is XML_PullParser_getTextMarkedUp. |
| 5. The length value passed to PHP's fread function.
See the class documentation for
XML_PullParser_setReadLength
and
$read_length |