Project Homepage of VTD-XML Logo

Sourceforge Home

Mailing Lists





4. A Closer Look

In this section, we analyze some important aspects of the processing model.

4.1 Memory Usage

The fact that the processing model makes extensive use of 64-bit integers has strong implications on its memory usage, which we summarize as follows:

  • Avoid per-object memory overhead- Per-object allocation typically incurs a small amount of memory overhead in many modern object-oriented VM-based languages. For JDK 1.42, our measurement shows an 8-byte overhead associated with every object allocation. For an array, that overhead goes up to 16 bytes. A VTD record or a LC entry is immune to Java's per-object overhead because it is an integer, not an object.

  • Use array whenever possible- We feel that the biggest memory-saving factor is that both VTD records and LC entries are constant in length and can be stored in array-like memory chunks. For example, by allocating a large array for 4096 VTD records, one incurs the per-array overhead of 16 bytes only once across 4096 records, and the per-record overhead is dramatically reduced to almost nothing. Furthermore, by taking advantages of spatial locality of associated records, one avoids the cost of explicitly stitching together objects using pointers.

Our benchmark results indicate that the total size of the internal representation is usually 1/3 ~1/5 of a DOM tree. For document-oriented XML, the multiplying factor is usually around 1.3, i.e. the internal representation is 30% larger than the document (assuming single-byte character encoding). For data oriented XML, the multiplying factor is between 1.5~2. Comparing with DOM, the memory reduction is most significant when XML is "taggy" and DOM's memory consumption is near the high end of its 5x~10x estimate.

4.2 Inherent Persistence

For applications that demand frequent read-only access to XML documents, the processing model provides the option of saving the internal representations on disk after the XML documents are processed for the first time. Afterwards, those applications only needs to load "pre-parsed" form of XML back in memory. In other words, the in-memory representation of the processing model is inherent persistent. This again is due to the fact that both VTD and LC use 64-bit integers, not objects, to represent tokens and hierarchy. When the processing performance is critical, one can potentially create XML and VTD at the same time, and then package them into a single file before sending it across the network. The downside of this approach is that it will add to the network transmission cost, which may not be suitable for bandwidth constrained environment. The upside is that XML-aware network appliances sitting on the data path can directly use the binary portion of the data to process XML (without parsing again). In addition, the "pre-parsed" form contains the XML text, and therefore loses no inherent benefits of XML, e.g. extensibility and human readability.

4.3 XML on a chip

Recently, "XML on a chip" has gathered much attention as many companies are racing to deliver ASIC-based XML processing solutions. A big reason is that the performance of XML parsing is often cited as a major bottleneck for many performance-sensitive applications. Matthias et al. [5] identify XML parsing performance to be a key obstacle to a successful XML deployment, and recommend hardware-assisted parsing engine to be a potential research topic. This is hardly surprising, considering the fact that hardware implementations of processing algorithms free CPU from performing processor-intensive computations and sometimes provide the only ways to guarantee QoS. However, one must take into account design constraints that are specific to XML and custom hardware when thinking of hardware-based XML parsing. In that context, our processing model possesses a few attributes critical to the success of implementing "XML on a chip," which we summarize below:

  • Inherent persistence- For the custom hardware to co-process XML and still interface with the application logic, the system most certainly has to transport "something" from the hardware (e.g. a PCI card) into the main address space in which the programs reside. To make the hardware work seamlessly with the application logic, one of the fundamental assumptions of distributed computing applies: The "something" must be persistent because the PCI bus, though carrying a different name, is physically not different than a set of wires like Ethernet, and therefore is subject to the same set of design constraints. Notice that VTD and LCs satisfy this constraint, as both are inherently persistent. Implementing DOM on a chip, on the other hand, is prone to fail because a DOM tree is not persistent.

  • Constant record/entry length- Optionally, the hardware can produce some serialized form of a DOM tree or SAX events. The problem of this approach is that, to rebuild the DOM tree or regenerate SAX events, the software application at the receiving end will have no choice but to go through the most expensive part of DOM/SAX parsing, i.e., to allocate and garbage-collect objects, all over again, essentially defeating the purpose of hardware-based parsing. The culprit is the variable length of "extractive" tokens. In contrast, because VTD records and LC entries are constant in length and addressable using integers, the persistence of the internal representation is post-binding. In other words, the software application at the receiving end does not incur the cost of allocating a large number of objects, thus reaping maximum benefit of hardware-based parsing.

  • Minimum buffering- In order to obtain maximum performance, a good design should strive to be as stateless as possible. Our processing model, when implemented on chip, features a flow-through, process-then-discard architecture that is quite similar to that of DES (Data Encryption Standard)[11]—an algorithm well known for its high performance hardware implementation. Similar to a DES chip, a VTD chip scans through the entire document one page at a time to record all the relevant information, and no character is accessed more than once. In addition, to avoid the propagation delays associated with accessing off-chip memory elements, the design should keep as many variables/states on chip in fixed-size register files. In that regard, an offset/length based tokenization plays right into the strength of custom hardware.

To sum up, the constant-length token representation is the most important attribute of the processing model. We can also look at this problem from the opposite direction. If tokens are not constant in length, then they are not addressable using integers, and the processing model can no longer bulk-allocate memory buffers to store tokens. Consequently, one probably has no choice but to allocate lots of objects to represent tokens and hierarchy, making per-object memory/processing overhead all but inevitable. And in so doing, applications won't be able to reap significant performance benefit from custom hardware.

VTD in 30 seconds

VTD+XML Format

User's Guide

Developer's Guide

VTD: A Technical Perspective

  0. Abstract

  1. Introduction

  2. A Processing Model Based on VTD

  3. Navigate XML

  4. A Closer Look

  5. Conclusion


Code Samples


Articles and Presentations