Saturday, November 6, 2010

PDOM - A new XML Parser

This article introduces a new breed of Open Source XML Parser, which is named as “Partial DOM Parser” (PDOM for short). Before I delve into the specifics of its implementation; let’s take a brief look at the current parser choices that we have:
  1. SAX – Event driven stream parser, which calls back a handler API as it reads every element.
  2. DOM – Loads the entire xml as a tree structure and facilitates traversal of the tree structure.
  3. STAX  - Cursor oriented parser.
  4. JDOM – Internally uses SAX Parser and loads the entire xml as tree structure with user friendly API to traverse as well as add/remove elements.
Short Comings of SAX/DOM:
SAX doesn’t load the entire xml into memory and hence saves memory, however in practice it is hard to program to the SAX API and maintaining the client program implementation becomes even harder as the input xml structure changes.

DOM, on the other hand consumes more memory as it loads the entire xml. JDOM is no different either.

What is PDOM?:
What if, we have a parser which has the advantage of SAX (viz., doesn’t load the entire xml), and also the advantage of DOM (viz., ease of traversing back and forth).
Enter PDOM Parser, which positions itself somewhere in between SAX and DOM. It uses SAX internally, but loads only that cross section of the xml which the client program needs as a tree structure and hence the name “Partial DOM”.
So, if you have a huge xml (say 50 MB) and you need to traverse across a small section of xml, PDOM Parser is the way to go.
              
How to use PDOM?
Before using the PDOM API, you should know the cross section of the xml which you need and represent it with an XPATH. The PDOM API would accept one or more XPATH expressions as input and will load the tree structure represented by this XPATH. This tree structure is represented by JDOM API (as JDOM is user friendly and is widely accepted).

So, technically, all you need to have is a XPATH representation of the tree structure that you need and invoke the PDOM API. You would get back a list of JDOM elements for every XPATH expressions that you supply.
Illustration of  PDOM usage:
Suppose we need to parse an xml file which is having details about books as below:

<details>
<books type="fiction">
<book>
<title>Deception Point</title>
<author>Dan Brown</author>
<price>50</price>
</book>
</books>
<books type="classic">
<book>
<title>Sons of Fortune</title>
<author>Jeffrey Archer</author>
<price>30</price>
</book>
</books>
</details>

Suppose we need only books of type fiction.
TheXPATH representation for this is: //details/books[@type=”fiction”]
Pseudo code to invoke the PDOM Parser:
//create File object/Input stream representing input xml file
//String representing the xpath expression
//Invoke PDOM parser API
For eg., Map resultMap =  parser.parse(file,xpathlist)
Inputs for PDOM Parser API:
  1. File/Input Stream representing xml (Other variants of inputs too. Refer javadoc)
  2. List of XPATH expressions – One or XPATH expressions could be provided. Even if one xpath is invalid, the parser would exit throwing an exception.
Output returned by PDOM Parser:
A Map having each of the input XPATH expression as key and a list of JDOM elements (corresponding to the XPATH) as value.
For eg., if the XPATH expression list contained the following expressions:
//details/books[@type=”fiction”]
//details/books[@type=”classic”]
The map would contain:
Key                                                                                          Value
--------------------------------------------------------------------------------------------
//details/books[@type=”fiction”]          List of JDOM Elements(fiction book)
//details/books[@type=”classic”]          List of JDOM Elements(classic book)
HOW PDOM works?
PDOM comes up with a built in XPATH compiler, which compiles and interprets the input XPATH expressions. Once all the XPATH expressions are compiled/interpreted, the SAX Parser is invoked.
There are 3 salient blocks in the parsing process.
Refer below diagram for a pictorial illustration.



1. Locator:
Locates the right elements in the xml which are represented by the xpath expression. Once an element is successfully located, it is handed over to the Collector.

2.  Collector:
Collects the specified xml element into a JDOM element object.

3.  Filter:
Filter is invoked at the end of parsing. It verifies if the collected element completely represents the xpath expression (The xpath expression might be a combination of AND/OR etc. The entire XPATH expression is evaluated and only if successfully evaluated, the element is returned back).

Implementation Specifics:
  1. Supports Java 1.4 and above
  2. Built in XPATH compiler
  3. Built in Expression evaluator
  4. Custom Stack/Tree collection implementations

For source code and other documentation, refer http://code.google.com/p/pdomparser/

I hope this minute Open Source contribution helps in developing better applications.