Re: [xml] HTMLParser bug...

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Sat Nov 18 2000 - 02:21:43 EST


On Sat, Nov 18, 2000 at 01:05:07AM +0100, TOM wrote:
>
>
> Le 17/11/00 22:54:57, Daniel Veillard a écrit :
> > Problem is that at that point we are starting to play heuristics at
> the
> > parser level and I don't really like this. this has the potential to
> clean
> > up a number of problem but may also raise new ones :-\, so feedback on
> heavy
> > duty HTML parsing tasks would be welcome.
>
> Why not using some HTML Tidy heuristics to clean up the tree (after or
> during parsing) ?

  Yes that's excatly what I suggested to Marc in a separate mail
yesterday :-) . That's something I would feel more comfortable with.

> In such cases, I remember HTML Tidy reorder the tree and do some node
> replacements.
> For example with the example given by Marc :
> <center> : create the missing or implicit elements : html, head, title,
> body
> <html> : merge attributes of the existing element (created when meeting
> <center>) and the new one (just parsed)
> <head> : idem
> <title> : replace the existing element with the just parsed one
> <body> : same as <html> and <head>
> etc.

  yes

> The problem here is that we have to play with invalid documents. The
> tidying phase should probably be in a different "module" of libxml, with
> its own default callbacks and some functions to clean up an existing
> tree.

  agreed, that's something we discussed on the list,

> In fact something like a rewrite of HTML Tidy using the libxml SAX
> interface and DOM tree rather than a "proprietary" one.
> I planned to do something like this (making of HTML Tidy a library,
> which is currently not, and thus making integration of Tidy as part of
> other software easier) but don't have enough time to do it :o( Though if
> someone wants to lead the project, i'll follow him/her and try to help
> when I'll find time.

  Seems to me that Marc and his team are the people who would have the
most interest in such a module so far (and maybe the most expertise).

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | libxml Gnome XML toolkit
Tel : +33 476 615 257  | 655, avenue de l'Europe | http://xmlsoft.org/
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Rpmfind search site
 http://www.w3.org/People/all#veillard%40w3.org  | http://rpmfind.net/
----
Message from the list xml@rpmfind.net
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rpmfind.net


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Sat Nov 18 2000 - 02:45:52 EST