eXcavator
An XML Query Facility for XML_PullParser
version 1.0.6
Myron Turner
Query Syntax

Contents         

Unless otherwise noted, the examples in this manual are based on the XML document in the appendix


Basic Query Unit
The basic syntactical expression in eXcavator is an element name followed by a condition in square brackets:
element_name[condition]
If the condition evalutes to TRUE, then the query is moved along to the next syntactical unit. The following is a valid query:
color[green]
If eXcavator finds an elment which holds "green", this expression will evaluate to TRUE, but it will not do anything in itself. That is, it will return TRUE, but it will not return any data. To have it return data, we have to enclose the condition in double square brackets:
color[[green]]
Listing 1 below is a complete query session using this query:

Listing 1
$eXc = new eXcavator($doc, eXcavator_STRING);
$eXc->eXcavator_Query("color[[green]]");
echo  $eXc->eXcavator_getResultAsString()  . "\n";

/* Result

  <COLOR>green</COLOR>

*/


There are a number of things to notice about the query in Listing 1.
  1. The expression [[green]] resolves to a test for equality. Moreover, the test is case-insensitive. This is in keeping with the default behavior of XML_PullParser and the expat XML parser. To change the default behavior call XML_PullParser_caseSensitive(TRUE). What this Result tells us that there is indeed a green car in our inventory.
  2. The query returns everything between the color element's START and END tags. If there were other xml data between these START and END tags, that also would be returned.


The CDATA keyword
We could have written the test for green, using the CDATA keyword:
color[[CDATA = green]]
CDATA stands for any character data that is found in the specified element. This expression makes clear that the test for "green" is a test for equality, for which [[green]] is a short-cut. The CDATA keyword has another and very important use. When used in a condition by itself, it matches ANY character data found in the subject element:
color[[CDATA]]
Had we used this expression in Listing 1, then the query would have returned the color for each of the vehicles in our inventory. The call to eXcavator_getResultAsString would have yielded the following result:


<COLOR>green</COLOR>

<COLOR>white</COLOR>


Accessing Attributes
Attributes are designated with the @ prefix. For instance, the vehicle element in the example document has three attributes, make, model, and year. Assuming we had a large database of vehicles and want to locate all vehicles of model years earlier than 2005, we could use the less-than operator and write the following expression:
vehicle[[@year <2005]]
This would return all the vehicle elements which satisfy the condition. The returned values would include the vehicle element and all its descendents; in other words it includes the START and END tags of the vehicle element and everything in between. For an illustration of this, see the example in the section on Expression Chaining

It is also possible to write an expression which tests for the presence of an attribute, regardless of its value:
first_name[[@middle_init]]
A query using this expression would find all first_name elements which have a middle_init attrribute. The actual output from our document, using eXcavator_getResultAsString, would be:


<FIRST_NAME MIDDLE_INIT = "M">Michael</FIRST_NAME>

<FIRST_NAME MIDDLE_INIT = "J">Douglas</FIRST_NAME>


Strings
if a string consists of only a single word, then quotes are not required:
color[green]
But if a string has more than one word, it must be enclosed in double quotes:
dealer:zip [["R3N 2B2"]]
This is true for attribute values as well. This means that a query string which includes a quoted condition string must be enlosed in single quotes:
$eXc->eXcavator_Query('dealer:zip [["R3N 2B2"]]');


Expression Chaining
Unlike most XML technology, XML_PullParser does not view the document as a tree. Rather it reconfigures the XML document to a "flat" array, that is, a numerically indexed array in which each array element holds either the name of a tag or data. For this reason, eXcavator can access elements outside of a strict hierarchical sequence, without having to work its way from root to branch. This, for instance, would be a valid query:
last_name[[CDATA]]
With this query, eXcavator would return all the last names it finds. The parser does not have to know the complete genealogy of last_name in order to find it.

Nevertheless, XML is by its nature hierarchical. And it's only by respecting its hierachical character that one can locate data with precision. It's usually necessary to set the context for a search. For instance, a query for name[[CDATA]], would return all the names found in the document, both those of the owners and those of the dealers. The solution to this is expression chaining, which uses the colon as a separator. In expression chaining the element to the right of the colon must always be a descendent of the element to the left of the colon: it does not have to be a first-generation child of the left-hand element. Chaining allows sequences of elements, as long as the sequence respects the principle that each element to the right of the colon must be a descendent of the element to the left.
element_1:descendent_of_1...:descendent_of_n-1
Let's look at an example. The expression
owner:name[[CDATA]]
would return all the name elements that are descendents of owner. In the example document there would be two results, one for the owner named Taylor, the other for Jones. The data returned would include everything between the name START and END tags, as in the following instance:


<NAME>
<LAST_NAME>Jones</LAST_NAME>
<FIRST_NAME MIDDLE_INIT = "J">Douglas</FIRST_NAME>
<ADDRESS>
<STREET>200 Winnipegosis Ave</STREET>
<CITY>St Adolphe</CITY>
<CITY>Winnipeg</CITY>
<ZIP>R3L 1Z5</ZIP>
</ADDRESS>
</NAME>

Chaining allows for narrowing of focus. An important and powerful mechanism here is that it allows each element in the chain to have its own condition. Suppose we want to locate the owner of the Honda. The vehicle's make is an attribute of the vehicle element. So, we could find the owner using this query:
vehicle[@make=Honda]:owner[[CDATA]]
eXcavator locates the vehicle with a make attribute that equals "Honda", but since the condition is not in double square brackets, it doesn't save the vehicle element. Instead, it passes the query on to the next condition in the chain. If that evalutes to true, i.e. if there is character data in the owner element, it saves the element and all its descendents. Had the first condition been in double brackets, it would have saved both a copy of the vehicle element and a copy of owner, making for unnecessary duplication, since the owner data is a subset of the vehicle element.


Context and Context Element
Because eXcavator does not use strict hierarchical structures, a chained element does not have to be a first-generation child of the element to its left. As we have noted above, it needs only to be a descendent. Therefore, this would be a valid query expression:
owner:street[[CDATA]]
street is a third-generation descendent of owner. But this query would return all the street data enclosed by owner elements. Here owner is the context-element. eXcavator looks at each owner element to see whether it has a descendent named street and if it finds one with character data, it returns the data. As far as the eXcavator evaluation engine is concerned all of these are equal in status:
owner:name,owner:last_name,owner:address,owner:street
It does not matter that these elements represent three generations of parents, children and siblings. The only factor of significance is that they are all governed by the same context-element, owner. In a chained sequence of elements, each element to the left of the colon is the context and hence the context-element for the element to the right of the colon.

If an element is the first element in the chain, i.e. the left-most element, then its context is implicitly the root element. But any possible parent/grand-parent elements which it might have are out of scope.

Scope plays an important role when formatting text with eXcavator_getFormattedText. eXcavator saves only the data which has been requested with the double brackets. Therefore, the data from any contexts which precede the double brackets has not been saved and is consequently out of scope:
vehicle[color=>green]:owner[[CDATA]]
The formatting method will be able to access the street element of owner but not the color or carfax elements, which appear only under vehicle.


The Arrow Operator
The arrow operator is used to meet a specific need in query processing. Assume that there are more than one owner with a Honda, and that we'd like to get the owner information for the owner named Jones. The following expression would return only the last name:
owner:last_name[[Jones]]
And we already have that. If instead, we wrote the expression as follows:
owner[[CDATA]]:last_name[[Jones]]
the evaluation engine would answer TRUE every time it encountered character data in an owner element, so in effect we'd get all the owners in our list. And then when it encountered Jones, it would turn out a single last_name element with the name Jones. The only way to control the output is to place owner in a context where it evaluates to FALSE in every case except one: the case in which Jones occurs in last_name.

The solution to this problem is the ARROW operator, which enables us to place an element name inside the condition:
owner[[last_name=>Jones]]
This expression tell eXcvator to to look in each owner element until it finds one with last_name equal to "Jones". Only then will it return TRUE. The expression to the right of the ARROW operator can be any valid eXcavator expression. For instance this could be re-written as
[[last_name=>CDATA = Jones]]
or, if we wanted everyone but Jones:
[[last_name=>CDATA != Jones]]. 1

There is only one exception. The ARROW operator does not support the unqualified CDATA construct:
owner[[last_name=>CDATA]]
To get this result use the following:
owner[[last_name=>CDATA != ""]]


The Arrow Operator and Focussing in on Descendent Elements
Let us assume that we want to return an entire vehicle where the dealer is in a particular city. We could use the following:
vehicle[[city=>CDATA _INCL "St Adolphe"]]
But this would return all city elements which include the city of St. Adolphe, so that if there were owners who lived in St. Adolphe, their vehicle elments would also be returned. The way to deal with this is as follows:
vehicle[[dealer/address/city=>CDATA _INCL "St Adolphe"]]
This would also work:
vehicle[[dealer/city=>CDATA _INCL "St Adolphe"]]
Note: Each element in the descendent list has to be a descendent of the previous element in the list.


The Comma Operator
The comma operator separates a list of parallell elements, each of which is a descendent of the same context-element. A context-element must always be explicitly suppplied. Here is an example query using the comma operator:
owner:last_name[[CDATA]],first_name[[CDATA]],street[[CDATA]]
This would also work, since all three of these elements are descendents of name:
owner:last_name:name[[CDATA]],first_name[[CDATA]],street[[CDATA]]
Using eXcavator_getResultAsXMLDoc to display the output from this query, we get:


<?xml version = "1.0"?>
<__root__>

<LAST_NAME>Taylor</LAST_NAME>

<FIRST_NAME MIDDLE_INIT = "M">Michael</FIRST_NAME>

<STREET>323 Oak Bay</STREET>

<LAST_NAME>Jones</LAST_NAME>

<FIRST_NAME MIDDLE_INIT = "J">Douglas</FIRST_NAME>

<STREET>200 Winnipegosis Ave</STREET>

</__root__>


There are a number of points to keep in mind when using the comma operator.
  1. The results appear in the order in which they occur in the list. If street had been placed first in the list, then it would appear first in the results.

  2. The evaluation engine will stop processing when it comes on an expression which evaluates to FALSE. So, if we only wanted the first name and street address of owners with the last name of Jones, we could start the list with
    owner:last_name[[Jones]].
    But the following sequence could lead to unexpected results:
    owner:first_name[[CDATA]],last_name[[Jones]],street[[CDATA]]
    As eXcavator works its way through a database with this query, it will pump out all the first names that it finds, because CDATA will always evaluate to TRUE and because there is nothing in front of it that evaluates to FALSE. But the last_name element will evaluate to TRUE only when the evaluation engine encounters "Jones", at which point eXcavator will save that element and its data. Having encountered "Jones", it then goes on to the next expression, which is street[[CDATA]]. and it saves that element. In every other case, the last_name expression evaluates to FALSE and eXcavator goes on to the next owner, saving neither last_name nor street. So, we end up with a long list of first names and one entry each for Jones's last name and street address.

  3. The elements in the list must be precisely parallel, so that this would not give us the street, even though address:street is a descendent of owner:
    owner:name:last_name[[CDATA]],first_name[[CDATA]],address:street[[CDATA]]
    This, however, would give us the names and the address elements:
    owner:name:last_name[[CDATA]],first_name[[CDATA]],address[[CDATA]]

  4. If a context-element is not explicitly supplied, the query will fail.


The _OR_ and _AND_ Operators
Beginning with version 1.0.2 of eXcavator , these operators have an enhanced functionality which is detailed in the next section of the manual.

Notes
1. The full range of operators and expressions is detailed on the Operators page.