VTD-XML: The Future of XML Processing

SourceForge.net Logo

Sourceforge Home

Mailing Lists





Users' Guide


 A simplified version: VTD-XML introduction and API overview (PowerPoint, PDF)

(Separate code-only VTD-XML tutorials are available in C,Java and C#)

0. Introduction

Basic concept of VTD

For XML files  that don't declare entity in Document Type Declaration (e.g. SOAP), tokenization can be done by only recording starting offset and length. To make it work, one also needs to maintain XML in memory intact and un-decoded. This has led to the design of a binary encoding specification we called Virtual Token Descriptor (VTD). VTD records are 64-bit integers that encode the starting offsets, lengths, token types and nesting depths of tokens in the XML document.

Specific to our current implementation, a VTD record is a 64-bit integer in network byte order (big endian) with the following bit-level layout:

  • Starting offset--30 bits (b29 ~ b0)

  • Length--20 bits (b51 ~ b32)

    • For some token types
      * Prefix length: 9 bits (b51~ b43)
      * Qname length: 11 bits (b42 ~ b 32)

  • Nesting Depth--8 bits (b59~b52) -- Maximum value is 2^8-2 = 254

  • Token type--4 bits (b63~b60): more details on page 6.

  • Reserved bit--2 bits (b31: b30) are reserved for a tri-state variable marking namespaces.

  • Unit--Because the processing model internally stores XML un-decoded, the unit for offset and length are in raw character of the encoding format. For UTF-8 and ISO-8859, length and offset are in bytes. They are in 16-bit words for UTF-16.

Pros and Cons of VTD

Because VTD records are constant in length, XML processing based on VTD can potentially have the following benefits:

  • Conserving memory: (1)Because VTD records are not objects, it is not subject to per-object memory overhead. (2)VTD storage can be bulk-allocated (i.e., using large memory blocks): when allocating a large memory block to store 1024 VTD tokens, one only incurs the per-array memory overhead once, essentially reducing the per-record overhead to almost nothing.

  • High Performance: (1) Using VTD we attempt to achieve high performance in parsing, which is a by-product of VTD's memory conserving features: Less memory usage means less amount of memory is allocated. (2) Large memory blocks are faster to allocate and GC than many discrete objects. Please keep in mind that we are in the early stage of this technology. So further improvements on performance and usability should be expected.

  • Inherent Persistence: VTD records can be persisted on disk or transmitted along XML to improve XML processing throughput.

  • Incremental Update: This feature is well-explained in a recently article

  • Cut, Paste, Split and Assemble

At the same time, one needs to be aware of the some of the limitations of VTD:

  • Upper limits of various fields: (1) For starting tags (the max Qname length is 2048; the prefix 512), overflow conditions result in parse exceptions. For other tokens (upper limit is 1M), one can potentially break a long token into multiple shorter ones.(2) Depth field overflow condition results in parse exceptions. (3) Starting offset: Currently the biggest document supported is 1G characters (1G bytes or 2G bytes, depending on actual document encoding).

  • Limit of Bit-level layout: It is possible that one needs to rearrange bit-level layout to meet actual processing requirement.

  • VTD Token length limit: Currently a VTD record is 64-bit in length. One can add another 32 bit if 64 bits are not enough.

  • Entity support: Current Implementation only supports built-in ones: &amps; > < ' "

VTD in 30 seconds

VTD+XML Format

User's Guide

  0. Introduction

  1. Goals and Features   

  2. How to Process XML

  3. Navigate VTD

  4.Classes/Interfaces and Methods

  5. Comparison with DOM, SAX, and Pull

  6. Table for Token Types

 7. The C version VTD-XML

Developer's Guide

VTD: A Technical Perspective

Code Samples


Getting Involved

Articles and Presentations