Re: [xml] Loss of whitespace

Date view Thread view Subject view Author view

From: acassin@cs.mu.OZ.AU
Date: Thu Mar 02 2000 - 00:28:14 EST


On 2 Mar, Daniel Veillard wrote:
>
> On Wed, Mar 01, 2000 at 09:24:08PM +0100, Daniel Veillard wrote:
>>
>> On Wed, Mar 01, 2000 at 01:40:39PM -0600, Paul DuBois wrote:
>> >
>> > I first reported this on the PHP4 mailing list, but after some more
>> > poking around, what I'm observing seems to be happening in the libxml
>> > library that PHP4 uses.
>> >
>> > Here's a sample document:
>> >
>> > <?xml version='1.0'?>
>> > <root>
>> > a<x> </x>b<x> </x>c
>> > </root>
>> >
>> >
>> > If I run this through "tester", I get:
>> >
>> > <root>
>> > a<x/>b<x/>c
>> > </root>
>> >
>> > Note that the whitespace which forms the contents of the <x> elements
>> > has been discarded.
>> >
>> >
>> > How do I defeat this?
>>
>> With an upgrade, I need to fix this ASAP. I guess it's worth releasing
>> an 1.8.7 lib just for this <grin/>
>
> This proves to be incredibly hard to fix !!!
> If I fix it, basically the tree produced by a document like
> <root>
> <x> </x>
> </root>
> would be:
>
> doc
> |
> -> root
> |
> ->text(\n) -> x -> text(\n)
> |
> -> text( )
>
> instead of
>
> doc
> |
> -> root
> |
> -> x
> |
> -> text( )
>
> which is what you expect.
> and currently libxml generates
>
> doc
> |
> -> root
> |
> -> x
>
> There is no clean way to know whether such a white space is significant
> a priori, and I'm afraid that in the current for it would break most of the
> apps around using libxml.
>
> Without a DTD telling me what is the content type of the element root
> I cannot assume it's just (x)* and not (CDATA | x), hence wether
> I can safely assume that this can be ignored
>
> Check the related section at:
> http://xml.com/axml/target.html#sec-white-space
>
> and especially the couple of comments put by Tim Bray under the (T)
> links.
> No single heuristic will work. And I'm using such an heuristic in libxml
> and changing it will probably kill a lot of applications using libxml,
> I tried with gnumeric it didn't liked :-(. PHP might break seriously too

surely a fix for these applications wouldn't be that difficult? Schedule
the whitespace corrections for the next major release of libxml
perhapsto give app developers time to switch? I think conformance is the
more important objective here. Cryptographic algorithms need to be
precise about how libxml is going to handle whitespace otherwise
cryptographic keys over document content wont compute the same as
another XML parser...

There is also one other thing to be aware of, the attribute xml:space
which defines the whitespace processing behaviour intended by the
application. Section 2.10 of the XML-Rec.

Andrew Cassin
acassin@cs.mu.oz.au

----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Wed Aug 02 2000 - 12:30:06 EDT