Wednesday, July 20, 2011

Traversing the XPath



What is XPath: XPath is an expression language that points to data represented in an XML document. XPath can point to one or more nodes in an XML document and perform basic arithmetic operations. XPath is a very rich language and selection of nodes can be based on any condition including arithmetic comparisons.

Why is XPath important: XPath is used in many other very important specifications. For example, XML Styling Language Transformation (XSLT), and XPointer (XML Pointer). XML Styling language Transformation can be used to generate one XML from another. It can also be used to generate human readable XHTML documents from an XML document. XPointer itself is used in specifications like XLink. (Which I discussed in earlier posts)

XPath works on the XML dataset: What it means is it is not possible to use XPath to point to the '<' character the makes up a tag. Instead, it can be used to access the Document Object Model or DOM. In short, it recognizes the dataset and DOM represented in the XML document, not the lexical representation of it. Hence, any document format that creates a DOM and similar datasets like an XML document can also be processed with XPath. For example, JSON (Javascript Object Notation) can create DOM similar to XML (but much less rich). Hence, a subset of XPath can be used to process JSON also, if we choose to do so.




How to Test: Before we go any further, I realized that we will need a program to test our XPath expressions. We will use the following java code for that purpose. Now, why java? firstly, because that is the language I am most comfortable with, and secondly, it runs on all platforms. So just copy this program and paste the code in a file (but please check the package structure), compile it and run it. If you run it without arguments, it will prompt for the arguments.

** If you do not know how to compile and run a java program, there is a very good tutorial here


package com.blogspot.debasishwebguide;

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;

import javax.xml.namespace.NamespaceContext;
import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

/**
 * A small program to parse an XPath expression and run it on a given XML document
 * @author Debasish Ray Chawdhuri
 *
 */

public class XPathTest {

        
        public static void main(String[] args) {
                if(args.length!=2 && args.length!=3){
                        System.out.println("Usage: com.blogspot.debasishwebguide.XPathTest <filename> <xpath_expression> [namespaceBindings]");
                        return;
                }
                try{
                        String filename=args[0];
                        String xpathExpr=args[1];
                        String namespaceBindings=args.length==3?args[2]:"";
                        
                        System.out.println("Evaluating expression: "+xpathExpr);
                        
                        //Process the namespace bindings parsed as argument
                        
                        NamespaceContext namespaceContext=getNamespaceContext(namespaceBindings);
                        
                        //Parse the XML document into a DOM
                        DocumentBuilderFactory documentBuilderFactory=DocumentBuilderFactory.newInstance();
                        documentBuilderFactory.setNamespaceAware(true);//Very important if we are using namespaces
                        DocumentBuilder documentBuilder=documentBuilderFactory.newDocumentBuilder();
                        Document doc=documentBuilder.parse(new File(filename));
                        
                        //Now we are ready for XPath processing
                        XPathFactory xPathFactory=XPathFactory.newInstance();
                        XPath xPath=xPathFactory.newXPath();
                        xPath.setNamespaceContext(namespaceContext);
                        
                        
                        //We need to tell what we are expecting as a result, in this case, a nodeset

                        Object result=xPath.evaluate(xpathExpr, doc, XPathConstants.NODESET);
                        
                        
                        if(result instanceof NodeList){
                                NodeList nodeList=(NodeList) result;
                                int length=nodeList.getLength();
                                for(int i=0; i<length; i++){
                                        Node node=nodeList.item(i);
                                        QName nodeName=node.getLocalName()!=null?new QName(node.getNamespaceURI(), node.getLocalName()):new QName(node.getNodeName());
                                    short nodeType=node.getNodeType();
                                    
                                    switch(nodeType){
                                    case Node.ELEMENT_NODE:
                                            printElement((Element) node);
                                            break;
                                    case Node.ATTRIBUTE_NODE:
                                            System.out.println("Attribute: name="+nodeName+", value="+node.getNodeValue());
                                            break;
                                    case Node.CDATA_SECTION_NODE:
                                            System.out.println("CDATA Section: "+node.getTextContent());
                                            break;
                                    case Node.COMMENT_NODE:
                                            System.out.println("Comment: "+node.getTextContent());
                                            break;
                                    case Node.TEXT_NODE:
                                            System.out.println("Text: "+node.getTextContent());
                                            break;
                                    default:
                                            System.out.println("Node: "+node.getTextContent());
                                    }
                                }
                        }else{
                                System.out.println(result);
                        }
                        
                }catch(Exception ex){
                        System.out.println("Usage: com.blogspot.debasishwebguide.XPathTest <filename> <xpath_expression> [namespaceBindings]");
                        ex.printStackTrace();
                }

        }
        
        public static void printElement(Element elementNode) throws TransformerException{
                TransformerFactory transformerFactory=TransformerFactory.newInstance();
                Transformer transformer=transformerFactory.newTransformer();
                ByteArrayOutputStream output=new ByteArrayOutputStream();                
                transformer.transform(new DOMSource(elementNode), new StreamResult(output));
                System.out.println(new String(output.toByteArray()));
        }
        
        public static NamespaceContext getNamespaceContext(String namespaceMappings){
                final Map<String,String> prefixURIBinding=new HashMap<String, String>();
                final Map<String,List<String>> URIPrefixBinding=new HashMap<String, List<String>>();
                
                
                String [] namespaceMappingPairs=namespaceMappings.split("\\s*\\,\\s*");
                
                for(String nsPair:namespaceMappingPairs){
                        if(nsPair.contains("=")){
                                String [] pairParts=nsPair.split("\\s*\\=\\s*");
                                String prefix=pairParts[0];
                                String URI=pairParts[1];
                                prefixURIBinding.put(prefix, URI);
                                List<String> prefixList=URIPrefixBinding.get(URI);
                                if(prefixList==null){
                                        prefixList=new ArrayList<String>();
                                        URIPrefixBinding.put(URI, prefixList);
                                }
                                prefixList.add(prefix);
                        }
                }
                
                NamespaceContext namespaceContext=new NamespaceContext() {
                        
                        
                        @Override
                        public Iterator getPrefixes(String namespaceURI) {
                                if(prefixURIBinding.containsKey(namespaceURI)){
                                        return URIPrefixBinding.get(namespaceURI).iterator();
                                }
                                return null;
                        }
                        
                        @Override
                        public String getPrefix(String namespaceURI) {
                                if(prefixURIBinding.containsKey(namespaceURI)){
                                        return URIPrefixBinding.get(namespaceURI).iterator().next();
                                }
                                return null;
                        }
                        
                        @Override
                        public String getNamespaceURI(String prefix) {
                                return prefixURIBinding.get(prefix);
                        }
                        
                        
                };
                
                
                return namespaceContext;
        }

}



The following shows the arguments required.
java com.blogspot.debasishwebguide.XPathTest <filename> <xpath_expression> [namespaceBindings]

The namespaceBindings are optional, and is a comma-separated list of prefix-URI bindings. The prefix and the URI are separated by an equal sign.White spaces is allowed in the comma-separated list.

Location Path: The most important syntax for an XPath is the location path, although it is not the most general syntax. A location path is made of one or more steps. Each step represents a set of nodes, with which, the next step is applied. A single '/' at the beginning of a location path represents the document node, the only child of which is the root element. If a location path starts with a '/' (thus from the document node), it is an absolute location path, otherwise it is a relative location path.

The following shows a few example of location paths.

Location Path Description
/bookshelf/book Selects all book elements under the root tag bookself
/bookshelf/book[1] The first book element under the root tag bookself
/bookshelf/* All elements under the bookshelf root tag

So, A location path is a series of steps separated by a '/' character. A step can be written as [axis]::[nodetest][predicate]. For example, let us consider the following XML

<?xml version="1.0" encoding="UTF‐8"?>
<bookshelf>
        <book>
                <title>The Great Adventure</title>
                <author>J.K.Miller</author>
                <pages countingCover="true">360</pages>
                <publisher>B and B</publisher>
        </book>
        <book>
                <title>On the Way</title>
                <author>J.Johnson</author>
                <pages countingCover="true">2135</pages>
        </book>
        <book>
                <title>The Rich, the Poor and Cobweb</title>
                <author>L.A.Narayanan</author>
                <pages countingCover="false">1252</pages>
        </book>
        <book>
                <title>Sing Me a Song</title>
                <author>N.A.Basak</author>
                <pages countingCover="false">230</pages>
                <publisher>A and A</publisher>
        </book>
        <book>
                <title>The Silent Rage</title>
                <author>Thomas B.</author>
                <pages countingCover="false">530</pages>
        </book>
</bookshelf>


If we execute the following XPath expression, it will select all the books that have a publisher.

/child::bookshelf/child::book[child::publisher]

Using our program:
java com.blogspot.debasishwebguide.XPathTest WithoutNamespace.xml "/child::bookshelf/child::book[child::publisher]"
Output:
Evaluating expression: /child::bookshelf/child::book[child::publisher]
<?xml version="1.0" encoding="UTF‐8"?><book>
                <title>The Great Adventure</title>
                <author>J.K.Miller</author>
                <pages countingCover="true">360</pages>
                <publisher>B and B</publisher>
        </book>
<?xml version="1.0" encoding="UTF‐8"?><book>
                <title>Sing Me a Song</title>
                <author>N.A.Basak</author>
                <pages countingCover="false">230</pages>
                <publisher>A and A</publisher>
        </book>


Axis: In the above example, child is an axis. An axis is a set of nodes with respect to the context node. For example, the self axis contain only one node - the context node. The preceding axis contains all nodes that come before the context node in document order. The following tables lists all axes with their description.

Axis Description
ancestor Contains parent, parent's parent and so on, upto the root node
ancestor-or-self Contains self, parent, parent's parent and so on, upto the root node
attribute Contains all attributes of the context node
child Contains all children
descendant Contains the children, children's children and so on
descendant-or-self Contains self, children, children's children and so on
following Contains all nodes that come after the context node in document order, or the order in which they appear in the document
following-sibling Contains all siblings that come after the context node in document order
namespace Contains all namespaces in the context node. If the context node is not an element, this axis is empty.
parent Contains the parent of the context node.
preceding Contains all preceding nodes that come after the context node in document order. It is more precisely all following nodes in reverse document order, as we will see later.
preceding-sibling Contains all preceding siblings that come after the context node in document order
self The context node itself

The child axis is the default axis, so, if no axis is specified, child axis is used. That means, the above expression can also be written as

/bookshelf/book[publisher]


Node Test: The name of tag we are using just after the '::' symbol is actually a node test. Node tests are of two kinds, namely the name test (that we are using here) and the node type. The name test is simply the name of the node we want to select. Thus, self::book selects the context node if its name is book, attribute::countingCover selects the countingCover attribute of the context node.

Principal Node Type: Every axis has a principal node type. The principal node type of attribute axis is attribute. The principal node type of namespace axis is a namespace. For all other axes, principal node type is element. For a name test, all nodes of type principal node type and name as specified are selected. Again the name in name test is a qualified name.

A nodeType is a test for the type of the node. You know the name test is a node type test if it ends with an empty pair of parentheses '()'. The following node types are available.

comment()
text()
processing-instruction()
and node()

I guess they are pretty self explanatory.

Predicate: The predicate is the part inside square brackets, and can be used to further filter the result. Any expression can go in the predicate. If the result type is a node set, its considered to be true if it is not empty. If the result type is numeric, its true if the result is equal to the proximity position of the node.

Proximity Position: A forward axis is an axis, in which, all the nodes come after (or the same as) the context node in document order (For example- child, following etc. ). On the other hand, a reverse axis is an axis, in which, all the nodes come before (or the same as) the context node in document order (For example- parent, preceding etc.). The proximity position is the the position in the direction of the axis. In other words, it is the distance of a selected node from the context node. For example, /bookshelf/book[3]/preceding::book[1] selects the second book element.


Abbreviated Syntax: There are some abbreviated symbols available to shorten the expression. Follow table summerizes them.


Abbreviation Expansion
. self::node()
.. parent::node()
@ attribute::
* Selects all nodes of principal node type
// /descendant-or-self::node()/

Hence, the expression /.//pages[@countingCover='true'] is same as /self::node()/descendant-or-self::node()/pages[attribute::countingCover='true'].

Functions: The function invocation syntax is very similar to C functions. For example, last() is a function that returns the last position in the current node set selection, and position() is a function that return the position of a node in a node set. For example, /bookshelf/book[3]/preceding::book[position()=last()] will return the first book element [remember reverse axis?].


Interesting to Note: The expression /bookshelf/book/preceding::book[position()=1] will select all book elements except the last one. Each step is calculated for each node that is the result of the previous step. Here, what happens is, /bookself/book selects all book nodes. The next step then selects the preceding::book[position()=1] with respect to each one of them and then combines the result. Thus, the expression //*/*[1] selects all the elements that are first child of their parents. Here //* expands to /descendant-or-self::node()/* and thus selects all elements that have a parent. The next step /*[1] selects the first child element in each one of them.

Namespace: As I pointed out earlier, XPath has full support for XML namespaces. The qualified names are expressed in the same way as in XML [prefix:localname], however, the prefix to URI mapping is done by the particular application/specification it is used in. In other words, the prefix is externally bound to namespace URI. However, the blank prefix can only be bound to no namespace.

Reference: http://www.w3.org/TR/xpath/


If you ask me a question or drop me a comment, I will be happy to help or to improve my blog.(You can put your question in the comment box)

2 comments:

Debasish Ray Chawdhuri said...

I just noticed that chrome does not process HTML escape sequences, what a shame. Please use firefox instead. Chrome also fails to process CSS properly.

Debasish Ray Chawdhuri said...

Sorry for the inconvenience, since I discovered that chrome does not process escape sequences, I am fixing my blog, it might be unavailable for a few minutes.

Post a Comment