A simplified version:
VTD-XML introduction and API overview (PowerPoint,
VTD-XML tutorials are available in
Basic concept of VTD
For XML files that don't declare entity
in Document Type Declaration (e.g. SOAP),
tokenization can be done by only recording starting offset and length. To
make it work, one also needs to maintain XML in memory intact and
un-decoded. This has led to the design of a binary encoding specification we
called Virtual Token Descriptor (VTD). VTD records are 64-bit integers that
encode the starting offsets, lengths, token types and nesting depths of tokens in the XML document.
Specific to our current implementation,
a VTD record is a 64-bit integer in network byte order (big endian) with the
following bit-level layout:
bits (b29 ~ b0)
(b51 ~ b32)
bits (b59~b52) -- Maximum value is 2^8-2 = 254
Token type--4 bits
(b63~b60): more details on page 6.
bits (b31: b30) are reserved for a tri-state variable marking namespaces.
processing model internally stores XML un-decoded, the unit for offset and length are in
raw character of the encoding format. For UTF-8 and ISO-8859, length
and offset are in bytes. They are in 16-bit words for UTF-16.
Cons of VTD
Because VTD records are constant in length, XML processing based on VTD can potentially
have the following benefits:
(1)Because VTD records are not objects, it is not subject to per-object
memory overhead. (2)VTD storage can be bulk-allocated (i.e., using large
memory blocks): when allocating a large memory block to store 1024 VTD
tokens, one only incurs the per-array memory overhead once, essentially
reducing the per-record overhead to almost nothing.
(1) Using VTD we attempt to achieve high performance in parsing, which is a
by-product of VTD's memory conserving features: Less memory usage means
less amount of memory is allocated. (2) Large memory blocks are faster to
allocate and GC than many discrete objects. Please keep in mind that we
are in the early stage of this technology. So further improvements on
performance and usability should be expected.
VTD records can be persisted on disk or transmitted along XML to
improve XML processing throughput.
This feature is well-explained in a recently article
Cut, Paste, Split
At the same time, one needs to be aware
of the some of the limitations of VTD:
Upper limits of various
fields: (1) For starting tags (the max Qname length is 2048; the
prefix 512), overflow conditions result in parse exceptions. For other
tokens (upper limit is 1M), one can potentially break a long token into
multiple shorter ones.(2) Depth field overflow condition results in parse
exceptions. (3) Starting offset: Currently the biggest document supported
is 1G characters (1G bytes or 2G bytes, depending on actual document
Limit of Bit-level
layout: It is possible that one needs to rearrange bit-level layout to
meet actual processing requirement.
VTD Token length limit:
Currently a VTD record is 64-bit in length. One can add another 32 bit if 64 bits are not
Current Implementation only supports built-in ones: &s; > <