Re: [xml] HTMLParser bug...

Date view Thread view Subject view Author view

From: TOM (ptittom@free.fr)
Date: Fri Nov 17 2000 - 19:05:07 EST


Le 17/11/00 22:54:57, Daniel Veillard a écrit :
> Problem is that at that point we are starting to play heuristics at
the
> parser level and I don't really like this. this has the potential to
clean
> up a number of problem but may also raise new ones :-\, so feedback on
heavy
> duty HTML parsing tasks would be welcome.

Why not using some HTML Tidy heuristics to clean up the tree (after or
during parsing) ?
In such cases, I remember HTML Tidy reorder the tree and do some node
replacements.
For example with the example given by Marc :
<center> : create the missing or implicit elements : html, head, title,
body
<html> : merge attributes of the existing element (created when meeting
<center>) and the new one (just parsed)
<head> : idem
<title> : replace the existing element with the just parsed one
<body> : same as <html> and <head>
etc.

The problem here is that we have to play with invalid documents. The
tidying phase should probably be in a different "module" of libxml, with
its own default callbacks and some functions to clean up an existing
tree.
In fact something like a rewrite of HTML Tidy using the libxml SAX
interface and DOM tree rather than a "proprietary" one.
I planned to do something like this (making of HTML Tidy a library,
which is currently not, and thus making integration of Tidy as part of
other software easier) but don't have enough time to do it :o( Though if
someone wants to lead the project, i'll follow him/her and try to help
when I'll find time.

Tom.

----
Message from the list xml@rpmfind.net
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rpmfind.net


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Fri Nov 17 2000 - 19:43:32 EST