VTD-XML technical

VTD-XML: The Future of XML Processing

VTD-XML Home	0. Abstract As the first step of most XML processing algorithms, one usually extracts token content out of the source document into many discrete string objects. We propose a "non-extractive" tokenization approach that maintains the source document intact in memory. Using a binary encoding specification called Virtual Token Descriptor (VTD), the processing model represents tokens exclusively using starting offset and length. To create a hierarchical view of the data encapsulated in XML, the parser further indexes elements of same depths using directory-like structures we call location cache. Through a demonstration of navigating the document hierarchy using VTD and location caches, we show that it is indeed possible to create a cursor-based API that retains most of DOM's random-access capabilities at a fraction of its memory usage. Furthermore, by analyzing key design constraints of custom hardware, we reason that the memory conserving characteristics of the processing model simultaneously make possible "XML on a chip" and "binary-enhanced XML." The benchmark results show that the reference implementation of our processing model significantly outperforms Xerces DOM in terms of both memory and processing performance.
VTD in 30 seconds
VTD+XML Format
User's Guide
Developer's Guide
VTD: A Technical Perspective 0. Abstract 1. Introduction 2. A Processing Model Based on VTD 3. Navigate XML 4. A Closer Look 5. Conclusion References
Code Samples
FAQ
Getting Involved
Articles and Presentations
Benchmark
API Doc
Demo

0. Abstract