What is XML ?
XML is a W3C Recommendation for
building tag-based structured documents/data.
Tags like in HTML/SGML : <myTag>
Structured: XML requires proper nesting and closing
- <myTag> bla bla <unclosed> bla
bla </myTag>
- <myTag> bla bla <nesting> bla
bla </myTag> </nesting>
Why is XML useful for free software ?
- Well defined, strong I18N support, open specification
- vendor independant, and getting wide industry acceptance
- ease reuse of data between projects
- allow to share code
A few examples:
- config file format (GConf, gphoto, ...)
- data format (Gnumeric, abiword, advogato, packages, ...)
- client-server protocols (WebDav, rpmfind, ...)
What is libxml
Libxml is an XML toolkit:
- developped within the Gnome
(gnome-xml) project, and also as one of
W3C software
- available under LGPL and the W3C IPR
- present in Gnome
and W3C CVS bases
- Web page is at http://xmlsoft.org/
- Packaged as libxml on most Linux distributions
- Extremely portable (Windows, Unices, Embedded targets, PSion, ...)
- Version 1.0 shipped beginning of 1999. Version 2.0 released beginning of
April
- conformance tested against the OASis test suite
Libxml main interfaces
- The parser: does Well-Formeness and DTD validation, handle namespaces,
can operate in pull or push modes
- SAX: callback based interface (opening, closing, characters ...)
- DOM tree: build a full in-memory tree followingly DOM interfaces
- HTML: an HTML parser, generating either SAX callbacks or a DOM tree
- Tree: routines to create/modify/save a DOM tree. Being able to save back
parsed documents influenced a lot of the design of libxml
- URI module
- I/0 interface: modular with basic FTP and HTTP built-in
- I18N handlers: defaults UTF8 UTF16 and ISO-Latin, uses iconv if
found
- XPath expression language to query XML documents
- Debug module (including a small shell) and memory usage checking
Libxml architecture overview
A basic example
<?xml version="1.0"?>
<example prop1="gnome is great" prop2="& linux too">
<head>
<title>Welcome to Gnome</title>
</head>
<chapter>
<title>The Linux adventure</title>
<p>bla bla bla ...</p>
<image href="linus.gif"/>
<p>...</p>
</chapter>
</example>
The DOM tree
Internationalization
XML requires support for the following encodings:
- UTF8: includes the ASCII range and use a variable lenght encoding for
the rest of the full Unicode set
- UTF16: uses 16bits chars (and sometime surrogates)
Libxml v1 was very weak in that area. Libxml2 has native support for both
encodings and ISO-Latin-x
If iconv library is found at compile time, it is used to add support to a
large (and expandable) set of encodings including the most common Japanese
encodings.
Libxml can save to a specific encoding, if a character cannot be encoded it
is converted to a char ref on the fly like ሴ
On the workbench
Started but not completed:
- XPath: a sample and currently incomplete implementation
Todo:
- XPointer: built on top of XPath allow to address any portion of an XML
document
- XLink detection: detect XML hypertext links and implement a simple link
database
- Schemas: new version obsoleting DTDs inherited from SGML
Add-ons worked on:
- Gdome: the DOM interface
- Gtkhtml2: new version of the Gnome HTML widget
Existing use
Libxml is already in use in a number of environments:
- rpm2html/rpmfind: generation/maintenance/client for 300Megs of RDF
encoded linux software package metadata
- configuration files for programs, metadata (gconf, glade, nautilus)
- data format for gnome programs (gill, gnumeric, gphoto, ...)
- embedded system (TV desktop set)
- handling of satellite images
However most users just don't give feedback, but patches are coming (and
usually accepted !)
Future
Focuse on implementing future XML specifications
Would like a basic rendering widget (gtkhtml2)
Would like volunteers for an XSLT implementation