Introduction
Website: http://xmlsoft.org/
- Tiny XML parser started in 1998 in C
- Integrated in the GNOME project in 99
- Grew up to support additional specifications
- 150,000 lines, 4 MBytes of code, 1MB of binary
- API includes 1200+ entry points
- 5 commiters, 500+ subscribers to the mailing-list
- Bindings for various languages (Python, Perl, ...)
- Ported and "supported" on many platforms
- Deployed on all Linux systems, integrated in Solaris
The XML libraries and tools
- Libxml2: the core XML Library
- Libxslt: the XSLT-1.0 Library
- LibExslt: the XSLT Extensions Library
- XMLSig: the XML signature and encryption Library
- xmllint: CLI tool for libxml2
- xsltproc: CLI tool for XSLT transformations
Architecture of libxml2
XML itself
XML is a W3C Recommendation
Inserting metadata in text to associate structure to content
<p> Classic example based on
<a href="http://www.w3.org/">HTML</a> markup</p>
The mathematical representation is a tree:
The tree parser Interface
- Parsing is done in one step
- The result is an tree instance
- Validation can be added to parsing
- Errors are reported via callbacks
import libxml2
doc = libxml2.parseFile("ex1.xml")
p = doc.children
print p.name
doc.freeDoc()
See example 1, XML
The SAX interface
- SAX: Simple API for Xml
- Used to process large documents
- Callback based interface
- Fast but complex for the programmer
The reader interface
- Used to process large documents
- Based on C# xmlReader
interface
- Simpler to program: iterator over the document nodes
- Allows validation
import libxml2
input = libxml2.inputBuffer(open("ex2.xml"))
reader = input.newTextReader("ex2.xml")
ret = reader.Read()
while ret == 1:
print reader.Name()
ret = reader.Read()
See example 2, XML
XML Namespaces
- Goal: be able to mix vocabularies unambiguously
- Uses URIs as qualifiers for the names
- Syntax based on attributes "xmlns" to declare namespaces
- Use of prefix to bind namespaces to instances
<p xmlns="http://www.w3.org/1999/xhtml"> Classic example based on
<a href="http://www.w3.org/">HTML</a> markup</p>
<x:p xmlns:x="http://www.w3.org/1999/xhtml"> Classic example based on
<x:a href="http://www.w3.org/">HTML</x:a> markup</x:p>
Validation: DTDs
- Simple regexps based rules
- Only structure not content
- Part of the base XML specification
<!DOCTYPE p [
<!ELEMENT p (#PCDATA | a | em)*>
<!ELEMENT a (#PCDATA)>
<!ATTLIST a href CDATA #REQUIRED >
<!ELEMENT em (#PCDATA)>
]>
See example 3, XML
Validation: XML Schemas
- W3C Recommendations: Structures and Datatypes
- Complex, not very flexible
- Tries to model OO data: extensions and restrictions
- Augment the tree with type informations (PSVI)
- Libxml2 implementation for Structure is in progress
- Libxml2 Datatypes support is mostly complete
Validation: XML Schemas Datatypes
Validation: Relax-NG
- Relax-NG:
Counter proposal to Schemas structures
- Very flexible, clear, formal definition
- Allow to reuse external Datatypes
- Fully implemented in libxml2
- Developped by OASIS, on standard track at ISO
<element name="p"
xmlns="http://relaxng.org/ns/structure/1.0">
<zeroOrMore>
<choice>
<text/>
<element name="a">
<attribute name="href"/>
<text/>
</element>
</choice>
</zeroOrMore>
</element>
See example 4, XML, RNG
Validation: streaming
- Validating instances too big for memory
- DTD and Relax-NG validation on top of the xmlReader
- Available as xmllint command line --stream
See example 5, XML, RNG
XPath: addressing language
- XPath: language for addressing
parts of an XML document
- expression languages
- handles strings, numbers, booleans, and set of nodes
- Reused by XSLT, XPointer, XInclude, Schemas ...
- Provides a basic function library
Examples:
/p/a
//a
//a[@href = "index.html"]
Try "xmllint --shell" to test XPath expressions
XPointer: fragment and selection
- XPointer: syntax for XML
fragment identifier
- How to address subresources
- Mostly based on XPath
XInclude: inclusion mechanism
- XInclude: an include
mechanism
- Includes XML documents, fragments or text
- Allow the use of XPointer to include only fragments
- Mostly useful for document processing
See example 6, XML, Included
XSLT: the transformation language
- XSLT: a transformation language
for XML
- an XSLT stylesheet describes a transformation
- XSLT uses XPath to target nodes in the input
- the output can be XML, HTML or text
- Used for format convertion or documentation processing
- libxslt: a library providing XSLT on top of libxml2
- xsltproc: a command line transformation based on libxslt
See example 7, XML, Stylesheet
EXSLT: XSLT extensions
- EXSLT: an extension library for
XSLT
- provided as a separate library
- available directly from xsltproc
XMLSig: signand encrypt
- XMLSig: a library
implementing XML crypto specs
- XML Signature: W3C REC to sign an XML document
- XML Encryption: W3C REC to encrypt part of an XML document
- based on Canonicalization C14N integrated in libxml2
- developped independantly by Aleksey Sanin
Catalogs and I/O handling
- libxml handles FTP and HTTP by default
- I/O handlers can be redefined
- use XML
Catalog to map resources to local files
Conclusions
- This has been a lot of work: 5 years
- There is still a lot TODO: XML Schemas, XML-1.1
- There is even more work: XPath2 and XSLT2
- Large industrial deployments
- Ultimately the community will handle most of the maintainance
Questions and Answers