1. Introduction
As an open and
platform-independent data/document encoding format, XML is playing an
increasingly important role in information exchange and storage.
Application developers often find XML easy to learn because it is human
readable. At the same time, enterprises are storing large amounts of
business data permanently in XML format. Web services, the next generation
middleware technology, also choose XML as the wire format for data in
order to overcome interoperability issues associated with prior-generation
middleware technologies, e.g. CORBA and DCOM. Although XML is being used
in so many different ways, one thing remains the same: An XML document
needs to be parsed before applications can do anything with it. Inheriting
heavily from traditional text processing techniques, existing XML
processors extracts tokens out of the source document into many discrete
string objects. Subsequently, one can build data structure or class
hierarchies on top of those tokens.
Nevertheless,
we would like to observe that, to achieve the purpose of tokenization, one
has another option, i.e. to only record the starting offset and
token length, while leaving the token content "as is" in the source
document. In other words, we can treat the source document as a large
"token" bucket, while creating a map detailing the positions of tokens in
the bucket.
To help
illustrate how this "non-extractive" style of tokenization works, we
compare it with traditional "extractive" tokens in some common usage
scenarios:
-
String Comparison- Under the traditional text-processing framework, one uses some
flavors of C's "strcmp" function (in <string.h>) to compare
an "extractive" token against a known string. In our approach, one simply
uses C's "strncmp" function in <string.h>.
-
String to numerical
data conversion- Other frequently used macros, such as "atoi"
and "atof," can be revised to work with non-extractive tokens. One of the
possible changes is the signatures of the functions. For example, "atoi"
takes a character string as the input. To make a non-extractive
equivalent, one can create a "new-atoi" that accepts three
variables: the source document (of the type char*), offset (of the type
int), and length (of the type int). The difference in implementation is
mostly to deal with the new string/token representation (e.g. end of
string is no longer marked by \0).
-
Trim- Removing the
leading and trailing white spaces of a "non-extractive" token only
requires changes to the values of offset and length. This is usually
simpler than extractive style of tokenization, which often involves the
creation of new objects.
Overall we feel that,
apart from some implementation changes, the difference between traditional
tokenization and the proposed "non-extractive" tokenization is largely a
change in perspective. They are just two different ways to describe the
same thing, i.e. a token.
Going one step further, one can implement various "non-extractive"
functions to directly deal with native character encoding. For XML
processing, application developers are more used to UCS-2 strings within
their code. Those functions need to be "smart" enough to understand
different character encodings and, at the same time, export UCS-2
compatible signatures to calling applications.