From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Sat Nov 18 2000 - 02:21:43 EST
On Sat, Nov 18, 2000 at 01:05:07AM +0100, TOM wrote:
>
>
> Le 17/11/00 22:54:57, Daniel Veillard a écrit :
> > Problem is that at that point we are starting to play heuristics at
> the
> > parser level and I don't really like this. this has the potential to
> clean
> > up a number of problem but may also raise new ones :-\, so feedback on
> heavy
> > duty HTML parsing tasks would be welcome.
>
> Why not using some HTML Tidy heuristics to clean up the tree (after or
> during parsing) ?
Yes that's excatly what I suggested to Marc in a separate mail
yesterday :-) . That's something I would feel more comfortable with.
> In such cases, I remember HTML Tidy reorder the tree and do some node
> replacements.
> For example with the example given by Marc :
> <center> : create the missing or implicit elements : html, head, title,
> body
> <html> : merge attributes of the existing element (created when meeting
> <center>) and the new one (just parsed)
> <head> : idem
> <title> : replace the existing element with the just parsed one
> <body> : same as <html> and <head>
> etc.
yes
> The problem here is that we have to play with invalid documents. The
> tidying phase should probably be in a different "module" of libxml, with
> its own default callbacks and some functions to clean up an existing
> tree.
agreed, that's something we discussed on the list,
> In fact something like a rewrite of HTML Tidy using the libxml SAX
> interface and DOM tree rather than a "proprietary" one.
> I planned to do something like this (making of HTML Tidy a library,
> which is currently not, and thus making integration of Tidy as part of
> other software easier) but don't have enough time to do it :o( Though if
> someone wants to lead the project, i'll follow him/her and try to help
> when I'll find time.
Seems to me that Marc and his team are the people who would have the
most interest in such a module so far (and maybe the most expertise).
Daniel
-- Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes | libxml Gnome XML toolkit Tel : +33 476 615 257 | 655, avenue de l'Europe | http://xmlsoft.org/ Fax : +33 476 615 207 | 38330 Montbonnot FRANCE | Rpmfind search site http://www.w3.org/People/all#veillard%40w3.org | http://rpmfind.net/ ---- Message from the list xml@rpmfind.net Archived at : http://xmlsoft.org/messages/ to unsubscribe: echo "unsubscribe xml" | mail majordomo@rpmfind.net
This archive was generated by hypermail 2b29 : Sat Nov 18 2000 - 02:45:52 EST