VTD-XML: The Future of XML Processing

SourceForge.net Logo

Sourceforge Home

Mailing Lists





0. Abstract

As the first step of most XML processing algorithms, one usually extracts token content out of the source document into many discrete string objects. We propose a "non-extractive" tokenization approach that maintains the source document intact in memory. Using a binary encoding specification called Virtual Token Descriptor (VTD), the processing model represents tokens exclusively using starting offset and length. To create a hierarchical view of the data encapsulated in XML, the parser further indexes elements of same depths using directory-like structures we call location cache. Through a demonstration of navigating the document hierarchy using VTD and location caches, we show that it is indeed possible to create a cursor-based API that retains most of DOM's random-access capabilities at a fraction of its memory usage. Furthermore, by analyzing key design constraints of custom hardware, we reason that the memory conserving characteristics of the processing model simultaneously make possible "XML on a chip" and "binary-enhanced XML." The benchmark results show that the reference implementation of our processing model significantly outperforms Xerces DOM in terms of both memory and processing performance.

VTD in 30 seconds

VTD+XML Format

User's Guide

Developer's Guide

VTD: A Technical Perspective

  0. Abstract

  1. Introduction

  2. A Processing Model Based on VTD

  3. Navigate XML

  4. A Closer Look

  5. Conclusion


Code Samples


Getting Involved

Articles and Presentations