Re: [xml] HTMLParser bug...

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Fri Nov 17 2000 - 16:54:57 EST


On Fri, Nov 17, 2000 at 03:54:19PM -0500, Marc Sanfacon wrote:
>
> Oups, sorry...
>
> Just in case it doesn't work, here is the code:
>
> <center>
> <html><head><TITLE>Classifieds</TITLE>
> </head><body>
> <center>
> </center><a name=rsearch"></form></BODY></HTML><!-- END PAGE FOOTER
> --></center>
>
> One of the files contains 5 lines, no CR at the end. This one is causing
> the bug. The other ones contains 6 lines, with a CR at the end. No bug.

  Okay I see,

 the enclosed patch try to clean up the mess introduced by auto-opening
body and head (related to one of the case you sent earlier this week)
and also the following.
 If you could try it for a little while and report if it clean things up
I would feel better. Problem is that at that point we are starting to
play heuristics at the parser level and I don't really like this. this has
the potential to clean up a number of problem but may also raise new
ones :-\, so feedback on heavy duty HTML parsing tasks would be welcome.

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | libxml Gnome XML toolkit
Tel : +33 476 615 257  | 655, avenue de l'Europe | http://xmlsoft.org/
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Rpmfind search site
 http://www.w3.org/People/all#veillard%40w3.org  | http://rpmfind.net/


----
Message from the list xml@rpmfind.net
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rpmfind.net


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Fri Nov 17 2000 - 17:43:32 EST