VTD-XML: The Future of XML Processing

SourceForge.net Logo

Sourceforge Home

Mailing Lists





1. Introduction

As an open and platform-independent data/document encoding format, XML is playing an increasingly important role in information exchange and storage. Application developers often find XML easy to learn because it is human readable. At the same time, enterprises are storing large amounts of business data permanently in XML format. Web services, the next generation middleware technology, also choose XML as the wire format for data in order to overcome interoperability issues associated with prior-generation middleware technologies, e.g. CORBA and DCOM. Although XML is being used in so many different ways, one thing remains the same: An XML document needs to be parsed before applications can do anything with it. Inheriting heavily from traditional text processing techniques, existing XML processors extracts tokens out of the source document into many discrete string objects. Subsequently, one can build data structure or class hierarchies on top of those tokens.

Nevertheless, we would like to observe that, to achieve the purpose of tokenization, one has another option, i.e. to only record the starting offset and token length, while leaving the token content "as is" in the source document. In other words, we can treat the source document as a large "token" bucket, while creating a map detailing the positions of tokens in the bucket. 

To help illustrate how this "non-extractive" style of tokenization works, we compare it with traditional "extractive" tokens in some common usage scenarios:

  • String Comparison- Under the traditional text-processing framework, one uses some flavors of C's "strcmp" function (in <string.h>) to compare an "extractive" token against a known string. In our approach, one simply uses C's "strncmp" function in <string.h>.

  • String to numerical data conversion- Other frequently used macros, such as "atoi" and "atof," can be revised to work with non-extractive tokens. One of the possible changes is the signatures of the functions. For example, "atoi" takes a character string as the input. To make a non-extractive equivalent, one can create a "new-atoi" that accepts three variables: the source document (of the type char*), offset (of the type int), and length (of the type int). The difference in implementation is mostly to deal with the new string/token representation (e.g. end of string is no longer marked by \0).

  • Trim- Removing the leading and trailing white spaces of a "non-extractive" token only requires changes to the values of offset and length. This is usually simpler than extractive style of tokenization, which often involves the creation of new objects.

Overall we feel that, apart from some implementation changes, the difference between traditional tokenization and the proposed "non-extractive" tokenization is largely a change in perspective. They are just two different ways to describe the same thing, i.e. a token.

Going one step further, one can implement various "non-extractive" functions to directly deal with native character encoding. For XML processing, application developers are more used to UCS-2 strings within their code. Those functions need to be "smart" enough to understand different character encodings and, at the same time, export UCS-2 compatible signatures to calling applications.

VTD in 30 seconds

VTD+XML Format

User's Guide

Developer's Guide

VTD: A Technical Perspective

  0. Abstract

  1. Introduction

  2. A Processing Model Based on VTD

  3. Navigate XML

  4. A Closer Look

  5. Conclusion


Code Samples


Getting Involved

Articles and Presentations