Re: [xml] Bug in parser (HTML)

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Fri Oct 27 2000 - 14:45:10 EDT


On Fri, Oct 27, 2000 at 01:39:27PM -0400, Marc Sanfacon wrote:
> Hi there the following document causes a bug in the resulting parsing:
>
> <html><body>
> <b>bbbbbbbbbb</b> <b>ccccccccccccccc</b>
> </body></html>
>
> The parsing looses the ' ' (space) between bbbbbbbb & cccccccc. Is it the
> normal behavior of libxml. One of our developer found this bug and I
> haven't looked at it yet. So if you tell me this is normal, I won't look.

  Well we would' like to lose the space between b's and c's in the
following:
    bbbbbbbbbb ccccccccccccccc

  Well, it's kinda tricky, here is what's happening:

Start of element body: pushed body
SAX.startElement(body)
Start of element body, was html
SAX.ignorableWhitespace(
, 1)
Start of element b: pushed b
SAX.startElement(b)
Start of element b, was body
SAX.characters(bbbbbbbbbb, 10)
Close of b stack: 3 elements
0 : html
1 : body
2 : b
SAX.endElement(b)
End of tag b: popping out b
SAX.ignorableWhitespace( , 1)

  the heuristic concludes it's an ignorable white space.
It should not really. The CR after the opening body should be
considered as such, as well as the one between the 2 p elements
in the following.

<p>bla</p>
<p>bla</p>

 but b really indicates text plus style, but it's text not structure
while p is just structure. We should not consider ignorable white spaces
those occuring between elements representing stylistic info. <em> and
<bold> are two other examples coming to mind.

  We should add detection of those and avoid considering ignorable spaces
those in those context .... I will look at it,

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | libxml Gnome XML toolkit
Tel : +33 476 615 257  | 655, avenue de l'Europe | http://xmlsoft.org/
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Rpmfind search site
 http://www.w3.org/People/all#veillard%40w3.org  | http://rpmfind.net/
----
Message from the list xml@rpmfind.net
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rpmfind.net


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Fri Oct 27 2000 - 15:43:36 EDT