Re: [xml] Saving without tree

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Tue May 30 2000 - 05:08:11 EDT


On Mon, May 29, 2000 at 11:43:29PM +0000, mdf@angoss.com wrote:
>
> > Of course libxml is both a generator and a parser ... But to be able
> > to generate it *needs* to understand the in-memory storage format.
> > If you decide to not use the libxml tree format, for whatever reason,
> > I don't see how libxml could "magically" understand your encoding of
> > the data. Try to think 2 minutes about this there is no obvious
> > solution for this !
>
> I've actually thought about this so much that I actually *have a working
> generator* that needs no "in-memory storage format", beyond whatever
> tree/list structure that is implicit in the data itself.
>
> Rather than defining rigid, stifling, data structures, you just play
> around with some function and object pointers instead. {ie, rigid,
> stifling, interfaces.} Naturally, this is alot cleaner in a so-called
> "civilized" languages like C++, but it can be done easily [though
> tediously] enough in C as well.

  I'm definitely rigid and uncivilized, i.e. I stick to C - which
is probably why I have a user base at all -, on the other hand I'm
not formerly opposed to clean things up when necessary, or add new
interfaces when they ease the reuse of my code !

> Add in some helpers for conversion to and from core data types like
> strings, deals with the standard entity stuff, numbers and you have
> something which can deal with about 90% of the "work" one is likely to
> do with XML. [Well, the work *I've* done at least..]

  I would not call this obvious ... You need at least a structural
model, and a set of per-node calls. I assume it's like DOM, you can
define interfaces without defining the content model strictly.
However people want running code, at at some point you need to select
a data representation. I did, with a set of constraints in mind,
including being able to save back with minimal changes, and
ease the implementation of DOM on top of it (seems there is 2
available gdome(2) and the DOM module in PHP4, so at least I didn't
completely missed my target).

> The result is effectively an inverse of a SAX parser.
 
 Can you share it ? Does it support the full content model of an XML
information set or just a restricted version ? In other words can it
output the full document space defined by the productions of the
XML specification, including PIs, internal subset, etc ... ?

> > > Failing that, allowing one to use the streams libXML reads from in an
> > > output mode would be nice [thus one gets compression 'for free'].
> >
> > Parse error. I cannot understand this sentence...
>
> Translation: expose a "stream" thingee which has nice, easy to use functions
> a la fread, fwrite(), but can do the compression dance if necessary.
>
> I am aware this is probably well beyond the purview of an XML parser though,
> but I suspect a parser could make excellent use of such a thing. Example:
>
> > May I suggest people actaully *look* at what is available before starting
> > suggesting modifying/extending the library ? All the output routines are
> > available in tree.c.
>
> Inside tree.c is a bunch of "buffer" stuff. Now I haven't used it myself,
> but reading around in the code, it looks like you hand someone a complete
> document [in the "in-memory storage format"] and it proceeds to
> generate the XML [optionally compressed?] into another in-memory buffer,
> and this buffer is finally slopped to a file.

  Was version 2.0, current code allows a different scheme.

> If this understanding is correct, the primary issue is the usual one:
> memory consumption. Namely, one will have the source data in-memory
> [unavoidable [*]], an in-memory document tree, and, the in-memory XML
> equivalent of this tree.

  agreed :-)

> For piddly small documents like web-pages and the like, this is probably
> bearable. But for multi-megabyte monsters, it would be better if the
> memory footprint for a simple "save" operation be nominal to non-existant.
>
> This is probably easiest done when one has a nifty stream gizmo into
> which one just dribbles the XML straight out of the application's
> objects, and this eventually makes it to the disk (or socket connection
> or whereever). Net memory hit is independent of document size, and might
> be a mighty 4k if one is feeling decadent the day the code is written... ;-)

  I'm decadent, I used up to 8K in the case where there is an encoding
conversion done on the fly when saving.

> [*] In the application area I am interested in, there is really a continuous
> stream of source data. So even though the total amount of data is, in
> principle at least, unbounded, the actual data being dealt with at any
> time is O(100 bytes).

  Note that XML in itself does not allow streaming. The specification
ask for a single root.

> > I also strongly suggest that if your interested going
> > that deep in the technical details of the library, then your should
> > use the CVS tree to see what is really there and not an ancient version.
>
> I am looking at, and using, libxml2.0.0.

  Ok, the CVS version has more flexibility w.r.t. saving function
especially :
  int xmlSaveFileTo(xmlOutputBuffer *buf, xmlDocPtr cur, const char *encoding);
and
  static void
  xmlNodeDumpOutput(xmlOutputBufferPtr buf, xmlDocPtr doc, xmlNodePtr cur,
              int level, int format, const char *encoding);

  Note this last one is currently static.
Those functions were added to be able to save to a given encoding.
And xmlOutputBufferPtr can be created with:

  xmlOutputBufferPtr
  xmlOutputBufferCreateIO(xmlOutputWriteCallback iowrite,
               xmlOutputCloseCallback ioclose, void *ioctx,
               xmlCharEncodingHandlerPtr encoder);

  I.e. using a callback based output mechanism ... So a lot of flexibility
has been added in this area recently hence my suggestion to use the CVS
version. I still need more work before being able to release libxml-2.1,
if you want to make use of this you should not wait for the "release"
(reminder there are daily tar.gz snapshots made from CVS current version
at ftp://rpmfind.net/pub/libxml/ ).

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | Today's Bookmarks :
Tel : +33 476 615 257  | 655, avenue de l'Europe | Linux XML libxml WWW
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind
 http://www.w3.org/People/all#veillard%40w3.org  | RPM badminton Kaffe
----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Wed Aug 02 2000 - 12:30:13 EDT