Re: [xml] Loss of whitespace

Date view Thread view Subject view Author view

From: Paul DuBois (paul@snake.net)
Date: Wed Mar 01 2000 - 23:25:24 EST


At 4:15 AM +0100 2000-03-02, Daniel Veillard wrote:
>On Wed, Mar 01, 2000 at 09:24:08PM +0100, Daniel Veillard wrote:
>>
>> On Wed, Mar 01, 2000 at 01:40:39PM -0600, Paul DuBois wrote:
>> >
>> > I first reported this on the PHP4 mailing list, but after some more
>> > poking around, what I'm observing seems to be happening in the libxml
>> > library that PHP4 uses.
>> >
>> > Here's a sample document:
>> >
> > > <?xml version='1.0'?>
>> > <root>
>> > a<x> </x>b<x> </x>c
>> > </root>
> > >
>> >
>> > If I run this through "tester", I get:
>> >
>> > <root>
>> > a<x/>b<x/>c
>> > </root>
>> >
>> > Note that the whitespace which forms the contents of the <x> elements
>> > has been discarded.
>> >
>> >
>> > How do I defeat this?
>>
>> With an upgrade, I need to fix this ASAP. I guess it's worth releasing
>> an 1.8.7 lib just for this <grin/>
>
> This proves to be incredibly hard to fix !!!
>If I fix it, basically the tree produced by a document like
><root>
><x> </x>
></root>
>would be:
>
>doc
> |
> -> root
> |
> ->text(\n) -> x -> text(\n)
> |
> -> text( )
>
>instead of
>
>doc
> |
> -> root
> |
> -> x
> |
> -> text( )
>
>which is what you expect.

Actually, the *first* form is exactly what I would expect, not the second.
I would expect a text node to be created for each occurrence of
text, whitespace included.

>and currently libxml generates
>
>doc
> |
> -> root
> |
> -> x
>
>There is no clean way to know whether such a white space is significant
>a priori, and I'm afraid that in the current for it would break most of the
>apps around using libxml.
>
>Without a DTD telling me what is the content type of the element root
>I cannot assume it's just (x)* and not (CDATA | x), hence wether
>I can safely assume that this can be ignored
>
>Check the related section at:
> http://xml.com/axml/target.html#sec-white-space
>
>and especially the couple of comments put by Tim Bray under the (T)
>links.

What I see there is this comment:

An XML processor must always pass all characters in a document that
are not markup through to the application.

And the annotation (http://xml.com/axml/notes/AllWSAlways.html) notes this:

------
XML's White Space Policy

XML has an incredibly simple rule about how to handle white space,
that is contained in this one sentence: "If it ain't markup, it's
data."
Under no circumstances will an XML processor discard some white space
because, in the processor's opinion, it is not "significant".
------

and this, which seems especially pertinent to the discussion:

------
This behavior is going to cause some surprises and problems for XML
users and programmers, because we've come to expect (as a result of
working with SGML and HTML) "insignificant" white space to
auto-magically vanish.
------

>No single heuristic will work. And I'm using such an heuristic in libxml
>and changing it will probably kill a lot of applications using libxml,
>I tried with gnumeric it didn't liked :-(. PHP might break seriously too

Actually, I noticed it because in an application I'm doing the PHP4
version was acting very strange compared to the Perl version. Turned
out that the difference was libxml's tossing of whitespace.

The Perl parsers I'm used to dealing with return text no matter where
it occurs. The DOM parser constructs text nodes for all text.

>
>The only ways I can think about this is the following:
> 1/ provide a flag in the parser context to change the
> behaviour to pass all white spaces (if we are not validating)
> 2/ switch the parser to pass all white spaces to SAX,
> but in the DOM generation callback, remove all text
> nodes containing only empty spaces
>
> 1/ allow to preserve compatibility with all the existing set of
>libxml applications and allow a "purist" mode if needed
> 2/ is more pure but will break most SAX based libxml apps
> like libglade I'm afraid, and creating nodes to remove them later
> sounds unclean ...
>
> Conclusion, I'm puzzled, it's a really hard issue, and I'm afraid
>of breaking a number of apps. On the other hand I *really* want to
>be as conformant as possible. I delayed libxml-1.8.7 until I find
>a decent solution.
>
> Feedback on this issue from libxml users would really be appreciated,
>
>Daniel

I certainly wouldn't advise breaking existing applications, although I
think one could argue that they never should have been written to expect
the current behavior. In any event, I have to admit I was surprised
athaving parts of my documents thrown away. :-) For me, this was
important, because
for the documents in question, I'm using them as formatting templates
specifying how to generate HTML output:

- Read data XML record and construct DOM tree
- Read formatting XML specification and construct DOM tree
- Traverse the formatting tree, generating output and substituting
   in data values from the data tree as we go.

For instance, the format spec might have something like this:

<require taglist="fax"><strong>Fax:</strong>
<field name="fax"/></require>

This tells me that if the data document doesn't have a <fax> element,
to skip everything in the <require> tag. If there is a <fax> element,
write "<strong>Fax:</strong>", a newline (because that is text), and
then the <fax> value from the data record.

When all-whitespace text sequences are tossed, the result is that
"Fax:" and the following fax number have no space between.

It would be sufficient for my purposes to have a flag telling the
parser whether or not to pass all whitespace back. (I'm hoping
that if you do this, the PHP folks will provide an interface to
that flag, of course.)

-- 
Paul DuBois, paul@snake.net
----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Wed Aug 02 2000 - 12:30:06 EDT