eXcavator
An XML Query Facility for XML_PullParser
version 1.0.6
Myron Turner
Operators and Expressions

Contents         

The examples in this manual are based on the XML document in the appendix. With only a few exceptions all operators have both a C-style form and a Text-style form. For instance, GREATER THAN OR EQUAL can be represented either as >= or _GTE_ . The Text-style operators must always be upper case.


Operators and Terms used in Expressions

: (colon) context separator: expression to the right of colon is evaluated in the context of the left-hand expression
, (comma) (1) separates a list of parallell elements, each of which is a descendent of the same context-element.
(2) separates list of sub-expressions in a condition:
        [@model=Accord,@make=Honda]
List processing stops at first expression that evaluates to FALSE
@ (at-sign) attribute prefix: used within conditions to distinguish attribute names from element names. @middle_init evaluates to middle_init , which is an attribute of the first_name element.
"" (double quotes) must be used to enclose strings of more than a single word; this means that the query expression must be enclosed in single quotes:
        'dealer:name[["Winnipeg Motors"]]'
[ ] (brackets) brackets enclose a condition; if the condition evaluates to TRUE the evaluation process moves on to the next expression; if FALSE the evaulation process stops and moves on to the next context expression, if any.
[[ ]] (double brackets) double brackets enclose a condition but if the condition evaluates to TRUE the data stipulated in the condition is extracted and returned; otherwise, it behaves exactly as single brackets.
The following operators all have both a C-Style form and a Text form. In simple expressions they can be used interchangeably. But in complex expressions, which bind together more than one expression, the C-style operator must always be used in all sub-expressions. A simple expression consists of one relation: A > B , Complex expressions consist of two or more relations: A > B _OR_ C != D _OR_ F = G
>= _GTE_ GREATER THAN OR EQUAL
< = _LTE_ LESS THAN OR EQUAL
< _LT_ LESS THAN
> _GT_ GREATER THAN
= _EQ_ EQUAL
Note: only a single, not a double, equal sign is used for this relation:
            vehicle[@year=2005]
!= _NE NOT EQUAL
&& _AND_ The text format must be used to bind together the major terms of the AND-ed expression, and the C-style forms must be used in the minor terms:
@year>=2004 _AND_ @model=Accord _AND_ @make=Honda Enhanced functionality for both _AND_ and _OR_ was implemented with release 1.0.2 and is detailed below.
_OR_ The OR operator does not have a C-style equivalent. Like The AND operator,OR-ed sub-expressions must use the C-Style operators. Enhanced functionality for both _AND_ and _OR_ was implemented with release 1.0.2 and is detailed below.
<< _INCL_ INCLUSION. The INCLUSION operator indicates that the left term includes the right term:
      owner:name[CDATA _INCL_ winnipeg]
      vehicle[@model << Accord]
The INCLUSION operator uses Perl regular expressions and makes case-insensitive comparisons, even when case-sensitivity is in effect. It also ignores duplicate runs of internal white-space.
=>_ARROW_ The ARROW operator binds together an element and one or more expressions. The expressions are evaluated only as they apply to this element. This element, moreover, must be a descendent of the element outside the condition brackets, i.e. the context-element. If the condition evaluates to TRUE, and double brackets are used, the context-element and all its descendents are returned:
      dealer[[zip => "R3N 2B2"]]
If eXcavator locates a dealer whose Zip Code is R3N 2B2, it will return owner and descendents.

Any valid query expression, including comma-separated lists, may go to the right of the ARROW, with one exception: the ARROW operator does not support the unqualfied CDATA keyword:
      vehicle[mileage=>CDATA]:dealer[[CDATA]]
Instead, the following expression will get the desired result: 1
      vehicle[mileage=>CDATA != ""]:dealer[[CDATA]]
    / The FORWARD SLASH is used in conjunction with the ARROW operator to narrow down the search to a particular descendent element:
      vehicle[dealer/address/city=>CDATA = Winnipeg]
Non_Operators
CDATA CDATA is a keyword that represents any character data that occurs in an element. It works by simple substitution of an element's character data for the CDATA keyword. CDATA works with strings and numbers alike:
      vehicle[[mileage => CDATA < 15000]]
      dealer:address[[CDATA _INCL_ "R3N 2B2"]] 2
element_name: This construct tests for the existence of an element. An element name followed by a colon returns a true value if it exists in the current context. Otherwise it returns FALSE. If it is the left-most element, i.e. the first element in the chain, then its context is implicitly the root element.
element[@attribute] A condition which consists solely of the attribute name tests for the existence of the attribute in the current context and returns TRUE if it exists, FALSE if it doesn't.
element[string] This construct is a shortcut for equality: vehicle:color[green] is the same as vehicle:color[CDATA = green].


Enhanced Functionality for _AND_ and _OR_

Versions < 1.0.2
Prior to version 1.0.2 conditions using _AND_ and _OR_ had two restrictions. First, each expression in the condition had to be a dependent of the same element, and secondly a single condition could not test for both the attribute data and the character data contained by the element. This would be a valid query in versions prior to 1.0.2:
vehicle[[@year> 2003 _AND_ @make << Acura ]]
Both terms of the expression test for attributes and both attributes are dependents of vehicle . But this query, which tests for both an attribute and the element's character data, would have failed:
first_name[[CDATA << Douglas _AND_ @middle_init=J]]

Version 1.0.2
Beginning with version 1.0.2, the above query would be handled as follows:
first_name[[CDATA << Douglas _AND_ first_name @middle_init=J]]
The only difference from the original attempt is that the expression to the right of the _AND_ is now prefaced by the name of the element: 'first_name @middle_init=J' . The basic syntax is this:
[[initial-expression _AND_ target-element relational-expression _AND_ ...]]
The initial-expression remains syntactically unchanged from previous versions of eXcavator , except that the shortcut for equality cannot be used. 2 All expressions to the right of _AND_ use the new syntax, where target-element can be either the context-element or one of its descendents. The context-element is the element to the left of the brackets:
context-element[[ initial-expression _AND_ . . .]]
The target-element does not have to be a first-generation child of the context-element , only a descendent. In our example, the initial-expression is 'CDATA < < Douglas' This is followed by the target element, i.e. the element which contains the data, which is first_name, and then by the relational expression, namely '@middle_init=J' . This new syntax holds whether we are testing for attributes or for the character data contained by an element:
vehicle[[@year>= 2003 _AND_ color = green ]]
The first test is for the year attribute of vehicle ; the second test is for the character data of the color element, which is a descendent of the context-element , naemly vehicle . Note that the relational is '= green' . While '= green' is obviously the right hand side of a relational expression, it can be understood as a shortcut for a complete relational expression: 'CDATA = green' .

As a final example, let's process the following query:
owner:name[[CDATA << Douglas _AND_ last_name CDATA << Jones]]
It would return the entire name element:

 < NAME>
 < LAST_NAME>Jones < /LAST_NAME>
 < FIRST_NAME MIDDLE_INIT = "J">Douglas < /FIRST_NAME>
 < ADDRESS>
 < STREET>200 Winnipegosis Ave < /STREET>
 < CITY>St Adolphe < /CITY>
 < CITY>Winnipeg < /CITY>
 < ZIP>R3L 1Z5 < /ZIP>
 < /ADDRESS>
 < /NAME>

The above query could also be formed with the _ARROW_ operator:
owner:name[[last_name => CDATA << Douglas _AND_ last_name CDATA << Jones]]
This would be clearer and more precise.

Forward Slash Syntax
The target-element does not accept the forward slash syntax:
vehicle[[name/address/city =>CDATA <<Winnipeg _AND_ street <<Oak]]
vehicle[[name/address/city =>CDATA <<Winnipeg _AND_ name/address/street <<Oak]]
Both of these queries are viewed as being the same. In the second example, when eXcavator sees 'name/address/street', it ignores 'name/address/' and looks only at 'street'. In both cases the context for street is name/address. This is a change from versions prior to 1.0.6, where in both cases the context for the street would have been vehicle.


The_OR_ Operator
Everything that's been said here about the new syntax for the _AND_ operator applies to _OR_ . Other than the differences in semantics between the two operators, the class's coding for both is identical.

Note on 1.0.2 _AND_ and _OR_ Syntax
The _OR_ and _AND_ operators cannot be used in the same condition.

Example Queries
Following is a list of sample queries. To view the output from these queries, click here.
  1. owner:street[[CDATA]]
    Extract owner if there is any character data in street.
  2. owner[[CDATA _INCL_ Jones]]
    Extract owner if it or any of its descendents contains the name Jones.
  3. first_name[[@middle_init]]
    Extract any first_name elements that have an attribute named middle_init
  4. first_name[[@middle_init=J]]
    Extract any first_name elements that have a a middle_init element with the value of "J"
  5. owner[[first_name=>@middle_init=J]]
    Extract the owner whose middle initial is "J". (Of course, there could be more than one, and if so there would be more than one result.)
  6. dealer[[name => "Winnipeg Motors"]]
    Find the dealer named "Winnipeg Motors" and get its data. This is a test for equality, so the match has to be exact. Any superfluous white-space in either the original XML document or the query will affect the result. Use _INCL_, as in the following example, if there is any question of superfluous white space.
  7. dealer[[name => CDATA _INCL_ " Winnipeg Motors"]]
    Find the dealer named "Winnipeg Motors" and get its data. Ignore superfluous white-space.
  8. dealer[[zip => "R3N 2B2"]]
    Find the dealer whose ZIP code is R2N 2B2 and get its data.
  9. dealer:address[[CDATA << WINNIPEG]]
    Find dealers whose address element includes reference to Winnipeg and extract the address for each one found.
  10. vehicle[carfax=>@buyback=no]:dealer[[CDATA]]
    Search for vehicles that do not offer a buyback option and extract the dealer for those vehicles.
  11. owner:last_name[[CDATA]], first_name[[Mary]], street[[CDATA]]
    This probably isn't going to do what the query-writer intended, if it is designed to search for an owner with the first name of "Mary" and extract the name and street address. It will extract the last name of every owner in the database. If it finds "Mary", then it will extract first_name and first_name as well. Since our example XML doesn't have a "Mary", all we get is a list of the last names.

    Explanation: Each request for last_name[[CDATA]] will be true, unless the element is empty, so the last names are extracted. But if the expression evaluator encounters "Mary" and does not find her name in the first_name element, it will return FALSE and go on to the next owner. If it does find "Mary", it will return true and eXcavator will save her first name and address, in addtion to the last name, which has already been saved.

    Solution: put first_name[[Mary]] in the first position in the series. Then all owners except ones named Mary will be rejected. In our case, this means that no result would be returned. The next example query formulates this type of query correctly.
  12. owner:last_name[[Jones]], first_name[[CDATA]], street[[CDATA]]
    This query does what the previous query intended. It rejects all owners without last_name equal to "Jones". When it comes to Jones, it extracts last_name, first_name, and street.
  13. vehicle:[mileage=>CDATA < 30000]:dealer[[CDATA]]
    This searches for vehicles with mileage under 30000 and then retrieves the dealer.
  14. owner:last_name[[Jones]]
    The form element_name[string] is an equality test. It could also be written: owner:last_name[[CDATA = Jones]] as in the next example. If TRUE, it returns only the last_name, which we aleady know. Its usefulness would most probably be in compound expressions.
  15. owner:last_name[[CDATA = Jones]]
    See previous query. Returns only last name. We probably want the format in the next query, which uses the owner's last name to fetch the owner element and all its descendents.
  16. owner[[last_name => Jones]]
    Get owner if last_name equals "Jones".
  17. owner:last_name[[CDATA != Jones]]
    Get owner if last_name does not equal "Jones".
  18. owner[[last_name => CDATA]]
    This wants to extract owner if there is character data in last_name but will fail because this construction is not allowed. See the description of the ARROW operator in the section on Operators above and in the section on the ARROW operator on the prevous page. See the next example query for the required syntax to achieve this purpose.
  19. owner[[last_name => CDATA != ""]]
    This is the way to do what the previous example fails to do. It compares CDATA to the empty string. If CDATA is not empty it returns TRUE, otherwise FALSE. In our example, it returns the owner for both vehicle contexts.
  20. owner[[last_name => CDATA = Jones]]
    This extracts the owner if you know the owner's name. It can also be written owner[[last_name => Jones]] , as in example 16 above.
  21. vehicle[@make=Honda]:name[[CDATA]]
    This query extracts all the name elements in vehicle , the two name elements from dealer and the name element with all its descendents from owner . The reason for this is that the context for name is vehicle . The next example narrows the focus of the query by placing name in the context of owner .
  22. vehicle[@make=Honda]:owner:name[[CDATA]]
    Unlike the previous example which gets all the name elements in vehicle , this query extracts only the name element from owner .
  23. vehicle:first_name[[@middle_init]]
    This query extracts all first_name elements with a middle int attribute. Since vehicle is the top-most governing context-element, this could also have been written: first_name[[@middle_init]]. And it is, in fact more effecient. By prefixing the expression with vehicle, you are asking XML_PullParser to parse every vehicle element and place it and all descendents on the parser's internal stack. But if you ask for the stripped-down form, it will only parse and place on its stack the first_name element and any possible descendents.
  24. vehicle:color[green]:Last_Name[[CDATA]]
    This will not work because there is no link in the chain between color and Last_Name . color is not a descendent of Last_Name . Either one of the next two will work.
  25. vehicle:color[green],Last_Name[[CDATA]]
    This works because both color and Last_Name are descendents of vehicle
  26. vehicle[color=>green]:Last_Name[[CDATA]]
    This works because Last_Name is a descendent of vehicle

Notes
1. If some elements hold only white space, and you don't get the desired result, try making the following call before executing the Query:
XML_PullParser_excludeBlanks(TRUE).
This shouldn't normally be necessary, but may prove useful.
2. That is, you must use CDATA = green and not just green , which is allowed in expressions like color[green] . See Operators and Terms above.