Re: [xml] Loss of whitespace

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Wed Mar 01 2000 - 22:15:30 EST


On Wed, Mar 01, 2000 at 09:24:08PM +0100, Daniel Veillard wrote:
>
> On Wed, Mar 01, 2000 at 01:40:39PM -0600, Paul DuBois wrote:
> >
> > I first reported this on the PHP4 mailing list, but after some more
> > poking around, what I'm observing seems to be happening in the libxml
> > library that PHP4 uses.
> >
> > Here's a sample document:
> >
> > <?xml version='1.0'?>
> > <root>
> > a<x> </x>b<x> </x>c
> > </root>
> >
> >
> > If I run this through "tester", I get:
> >
> > <root>
> > a<x/>b<x/>c
> > </root>
> >
> > Note that the whitespace which forms the contents of the <x> elements
> > has been discarded.
> >
> >
> > How do I defeat this?
>
> With an upgrade, I need to fix this ASAP. I guess it's worth releasing
> an 1.8.7 lib just for this <grin/>

  This proves to be incredibly hard to fix !!!
If I fix it, basically the tree produced by a document like
<root>
<x> </x>
</root>
would be:

doc
 |
 -> root
     |
     ->text(\n) -> x -> text(\n)
                   |
                   -> text( )

instead of

doc
 |
 -> root
     |
     -> x
        |
        -> text( )

which is what you expect.
and currently libxml generates

doc
 |
 -> root
     |
     -> x

There is no clean way to know whether such a white space is significant
a priori, and I'm afraid that in the current for it would break most of the
apps around using libxml.

Without a DTD telling me what is the content type of the element root
I cannot assume it's just (x)* and not (CDATA | x), hence wether
I can safely assume that this can be ignored

Check the related section at:
 http://xml.com/axml/target.html#sec-white-space

and especially the couple of comments put by Tim Bray under the (T)
links.
No single heuristic will work. And I'm using such an heuristic in libxml
and changing it will probably kill a lot of applications using libxml,
I tried with gnumeric it didn't liked :-(. PHP might break seriously too

The only ways I can think about this is the following:
 1/ provide a flag in the parser context to change the
    behaviour to pass all white spaces (if we are not validating)
 2/ switch the parser to pass all white spaces to SAX,
    but in the DOM generation callback, remove all text
    nodes containing only empty spaces

 1/ allow to preserve compatibility with all the existing set of
libxml applications and allow a "purist" mode if needed
 2/ is more pure but will break most SAX based libxml apps
    like libglade I'm afraid, and creating nodes to remove them later
    sounds unclean ...

  Conclusion, I'm puzzled, it's a really hard issue, and I'm afraid
of breaking a number of apps. On the other hand I *really* want to
be as conformant as possible. I delayed libxml-1.8.7 until I find
a decent solution.

  Feedback on this issue from libxml users would really be appreciated,

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | Today's Bookmarks :
Tel : +33 476 615 257  | 655, avenue de l'Europe | Linux XML libxml WWW
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind
 http://www.w3.org/People/all#veillard%40w3.org  | RPM badminton Kaffe
----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Wed Aug 02 2000 - 12:30:06 EDT