Re: [xml] Still validating while using SAX Interface?

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Mon Oct 16 2000 - 09:58:09 EDT


On Mon, Oct 16, 2000 at 12:39:09AM +0200, rolf@pointsman.de wrote:
> Don't get me wrong. DOM is an easy to use way to represent XML data in
> memory, but have his limitations. I have to handle XML files up to
> 100 MByte and more (XML Productcatalogs). It isn't an option for me to
> donate 1 GByte of memory just to be able to read the data (libxml DOM
> trees are big...).

  Don't get me wrong, I understand your problem ...

> It's true, a validating SAX parser may need some variable memory, not
> only to store entities but of course to store the hole structure
> information out of the DTD. But that's typically much smaller memory
> requirements than that for an hole DOM tree.

  Right,

> Please could you be a bit more elaborated about what informations
> stored in the DOM tree are needed for validation?

  Okay on the top of my head:
    - for each node in context (i.e. whose endElement() has not finished)
      one need to keep the list of children types (text/blanks/element names)
    - keep the list of IDs declarations or references
    - attribute checks can be done directly from the startElement() callback
This model doesn't take into account entities, adding entities support means
data their data model in memory too.

Of course there is also the DTD related informations, but I assume
using DOM to keep that fixed size portion should not be a problem.
And given the complexity of DTD parsing I assume you don't want to
rewrite it !

> valid.c). I had to realize, that it isn't a task of only one or two
> hours, to understand all the internal bells and whistles. Therefor I
> decided to ask the "gurus", if it's worth a more serious attempt.

Not two hours. I think implementing a SAX layer doing DTD checking
based on a DTD DOM and the flow of event would take me about 2 full
days. The problem is that:
   1/ It would duplicate existing code
   2/ Making it fully XML-1.0 compliant may take more time (especially
      entities support).
   3/ through the existing valid.c code could be reused changing it's
      code to accept both structures seems uneasy.

A cheaper approach which might be worth looking at is do just change
the endElement() existing SAX.c callback and remove the node content
after the validation last step. Okay strustures would be constructed
and destroyed on the fly leading to excessive processing but it would
basically do the same as a pure SAX callback approach. It might look
gross at a first glance but actually it's not that different from
rebuilding minimal informations.

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | Today's Bookmarks :
Tel : +33 476 615 257  | 655, avenue de l'Europe | Linux XML libxml WWW
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind
 http://www.w3.org/People/all#veillard%40w3.org  | RPM badminton Kaffe
----
Message from the list xml@rpmfind.net
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rpmfind.net


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Mon Oct 16 2000 - 10:43:24 EDT