What is libxml
- XML library: eXtensible Markup Language
- Design oriented toward document processing
- Written in C
- Portable Linux, Unices, Windows, MacOS, QNX, embedded systems
- Focusing on compliance
- Implements a number of associated XML standard
XML
XML is a metalanguage
- tag based language
- generate a subset of SGML
- validates with Document Type Definitions DTDs
XML defines the structural rules but not the semantic of the markup
XML-1.0 is a W3C recommendation
Namespace in XML is another Recommendation
XML Example
<?xml version="1.0" encoding="ISO-8859-1"?>
<exemple>
<titre>Un exemple</titre>
<chapitre numéro="1">
<titre>Introduction</titre>
<p>Ceci est un exemple très succint</p>
<img source="logo.gif"/>
</chapitre>
<chapitre numéro="2"/>
</exemple>
Architecture
Parsing interfaces
Libxml embeds an XML and an HTML parser (SGML docbook available too).
SAX
- Simple Api for XML
- Callback based API:
- startDocument(), endDocument()
- startElement(), endElement()
- characters(), etc.
DOM
- Acronym for Document
Object Model
- tree based API
- libxml expose the tree structure
- libxml provides tree manipulation routines
- gdome2 provide a real implementation of DOM2
- DOM1 and DOM2 are W3C recommendations
DTD validation support
libxml does not try to validate by default
the API allows:
- to validate while parsing
- to validate a parsed tree
- to validate against an arbitrary Dtd
problem this is dependant on DOM
all allocations and deallocations are centralized
an API allows to redefine the allocations functions
a debugging modules keep lists and reports leaks
Is a serious problem in libxml1
Fixed in libxml2:
- all internal representation is UTF8
- has embedded support of UTF16 and ISO latin 1
- uses Iconv or used supplied routines otherwise
The parser is progressive, allowing either pull or push data flow
multiple input mechanismes:
- file possibly compressed
- HTTP
- FTP
- mechnism to override and define new protocols
- XPath is an XML expression
language
- Allows to lookup set of nodes in a document
- simple language
- standard types: string boolean numbers functions and variables
- defines a standard library
- uses axis based searches
- XPointer and
XSLT reuses XPath
Examples
- /chapter[@type="warning"]
- //p[position()=5]
XPointer
- XPointer defines the fragment
identifier syntax for XML resources
- goal is to be able to address parts of an XML document:
- compatible with HTML fragment ID
- allows to represent any user selection
- reuses XPath but extend it to allow finer addressing
- String searches
- specific hypertext extensions: here(), origin(), ...
A few examples:
- #Introduction
- #xpointer(id("Introduction"))
- #xpointer(/chapitre[2]/p[3])
- #xpointer(//chapitre[titre="Introduction"]/
descendant::p[position()=last()])
- #xpointer(id("sec2.1")//p[2]/range-to(id("sec2.2")//p[last()]))
- #xpointer(string-range(//title,"Thomas Pynchon"))
XML Base, XInclude
- Extra functionnalities for link support:
- XML Base defines a BASE mechanism for XML
- XInclude defines
a standard including mechanism for XML
XSLT
XSLT is a transformation language.
an XSLT sheet is an XML document
processing is based on templates
recusive transitive closure on templates
allow output to XML, HTML or text
libxslt
libxslt is the library implementing XSLT on top of libxml2
the xsltproc program allows to run it on the command line
The API is relatively simple:
- Loading/compiling an XSLT stylesheet
- Applying a stylesheet to a parsed tree
- Saving the result according to a stylesheet
Should be compliant with XSLT-1.0 implement some 1.1 extensions
Relatively fast as long as one doesn't swap
A few examples
Future work in libxml and libxslt
finish XSLT, bugfixing
basic event support
support for non determinist Dtd models
support for different tree models, large files, databases
XML Schemas, validation with decent type support
Future work on top of libxml
XML-RPC, SOAP or other XML based protocols (Jabber)