Re: [xml] Bug in parser (HTML)

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Fri Oct 27 2000 - 17:09:54 EDT


On Fri, Oct 27, 2000 at 12:02:31PM -0700, Wayne Davison wrote:
>
> On Fri, 27 Oct 2000, Daniel Veillard wrote:
> > the heuristic concludes it's an ignorable white space.
>
> I think that the root of the problem is that <B> didn't trigger an implied
> <P> tag. If it had added the missing <P> tag, the space would not have
> been considered to be ignorable.

  Of course that's the first thing I tried :-)

~/XML -> cat tst.html
<html><body>
<p><b>bbbbbbbbbb</b> <b>ccccccccccccccc</b>
</body></html>
~/XML -> ./testHTML tst.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>
<b>bbbbbbbbbb</b>
<b>ccccccccccccccc</b>
</p></body></html>
~/XML ->

the interesting point is that Wayne is right in the sense that
this generate a character() SAX callback instead of
ignorableWhitespace() ...

SAX.startElement(b)
Start of element b, was p
SAX.characters(bbbbbbbbbb, 10)
Close of b stack: 4 elements
0 : html
1 : body
2 : p
3 : b
SAX.endElement(b)
End of tag b: popping out b
SAX.characters( , 1)

One can consider libxml broken there but I really do think

<html><body>
<p><a href="xxx">bbbbbbbbbb</a> <a href="yyy">ccccccccccccccc</a>
</p></body></html>

is equivalent to

<html><body><p>
<a href="xxx">bbbbbbbbbb</a>
<a href="yyy">ccccccccccccccc</a>
</p></body></html>

But that ain't true for <b>. And my understanding is this is due to
<b> being actually text node at a semantic level. It's just an
artifact of adding the style on structure.
So I'm inclined to fix only <b> <em> <strong> and the likes (did I
forgot one ?). But if someone want to fix this more strongly I will
take the patch :-)

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | libxml Gnome XML toolkit
Tel : +33 476 615 257  | 655, avenue de l'Europe | http://xmlsoft.org/
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Rpmfind search site
 http://www.w3.org/People/all#veillard%40w3.org  | http://rpmfind.net/
----
Message from the list xml@rpmfind.net
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rpmfind.net


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Fri Oct 27 2000 - 17:43:29 EDT