Introduction
Website: http://xmlsoft.org/
- Tiny XML parser started in 1998 in C
- Integrated in the GNOME project in 99
- Grew up to support additional specifications
- 150,000 lines, 4 MBytes of code, 1MB of binary
- API includes 1200+ entry points
- 5 commiters, 500+ subscribers to the mailing-list
- Bindings for various languages (Python, Perl, ...)
- Ported and "supported" on many platforms
- Deployed on all Linux systems, integrated in Solaris
The XML libraries and tools
- Libxml2: the core XML Library
- Libxslt: the XSLT-1.0 Library
- LibExslt: the XSLT Extensions Library
- XMLSig: the XML signature and encryption Library
- xmllint: CLI tool for libxml2
- xsltproc: CLI tool for XSLT transformations
Architecture of libxml2
XML example
XML is a W3C Recommendation
Inserting metadata in text to associate structure to content
<p> Classic example based on
<a href="http://www.w3.org/">HTML</a> markup</p>
The mathematical representation is a tree:
The tree parser Interface
- Parsing is done in one step
- The result is an tree instance
- Validation can be added to parsing
- Errors are reported via callbacks
import libxml2
doc = libxml2.parseFile("ex1.xml")
p = doc.children
print p.name
doc.freeDoc()
See example 1, XML
The SAX interface
- SAX: Simple API for Xml
- Used to process large documents
- Callback based interface
- Fast but complex for the programmer
The reader interface
- Used to process large documents
- Based on C# xmlReader
interface
- Simpler to program: iterator over the document nodes
- Allows validation
import libxml2
input = libxml2.inputBuffer(open("ex2.xml"))
reader = input.newTextReader("ex2.xml")
ret = reader.Read()
while ret == 1:
print reader.Name()
ret = reader.Read()
See example 2, XML
Validation: XML Schemas Datatypes
Validation: Relax-NG
- Relax-NG:
Counter proposal to Schemas structures
- Very flexible, clear, formal definition
- Allow to reuse external Datatypes
- Fully implemented in libxml2
- Developped by OASIS, on standard track at ISO
<element name="p"
xmlns="http://relaxng.org/ns/structure/1.0">
<zeroOrMore>
<choice>
<text/>
<element name="a">
<attribute name="href"/>
<text/>
</element>
</choice>
</zeroOrMore>
</element>
See example 4, XML, RNG
Validation: streaming
- Validating instances too big for memory
- DTD and Relax-NG validation on top of the xmlReader
- Available as xmllint command line --stream
See example 5, XML, RNG
XPath: addressing language
- XPath: language for addressing
parts of an XML document
- expression languages
- handles strings, numbers, booleans, and set of nodes
- Reused by XSLT, XPointer, XInclude, Schemas ...
- Provides a basic function library
Examples:
/p/a
//a
//a[@href = "index.html"]
doc = libxml2.htmlParseFile(url, None);
ctxt = doc.xpathNewContext()
anchors = ctxt.xpathEval("//a[@href]")
for anchor in anchors:
href = anchor.prop("href")
Try "xmllint --shell" to test XPath expressions
XPointer: fragment and selection
- XPointer: syntax for XML
fragment identifier
- How to address subresources
- Mostly based on XPath
XInclude: inclusion mechanism
- XInclude: an include
mechanism
- Includes XML documents, fragments or text
- Allow the use of XPointer to include only fragments
- Mostly useful for document processing
See example 6, XML, Included
XSLT: the transformation language
- XSLT: a transformation language
for XML
- an XSLT stylesheet describes a transformation
- XSLT uses XPath to target nodes in the input
- the output can be XML, HTML or text
- Used for format convertion or documentation processing
- libxslt: a library providing XSLT on top of libxml2
- xsltproc: a command line transformation based on libxslt
See example 7, XML, Stylesheet
Extending XSLT with python
- Possible but hard, especially tree management
def f(ctx, str):
libxslt.registerExtModuleFunction("foo",
"http://example.com/foo", f)
See example 8
Catalogs and I/O handling
- libxml handles FTP and HTTP by default
- I/O handlers can be redefined
- use XML
Catalog to map resources to local files
Python bindings: the build
Fully automated with a bit of glue!
- Python script parsing C headers and code
- Generates an XML description
- Python generator generates bindings + python module
- Class remapping done by the generator
- There is a few extra C interface for easier bindings
- The python module uses some hand-coded core classes
- Packaging is either RPM or setup.py for non-Linux systems
Python bindings: problems and TODOs
- Reference counting for documents is too hard
- Explicit free of documents is needed
- Basic work on iterators based on the tree
- Full coverage of libxml2: I/O, XPointer, ...
- Better bindings and other improvements ?
- Standard Python XML tree interface ? DOM ?
Conclusions
- This has been a lot of work: 5 years
- There is still a lot TODO: XML Schemas, XML-1.1
- There is even more work: XPath2 and XSLT2
- Large industrial deployments
- Ultimately the community will handle most of the maintainance
Questions and Answers