Re: [xml] HTMLParser from libxml concerns.

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Fri Oct 20 2000 - 12:15:15 EDT


On Fri, Oct 20, 2000 at 08:57:57AM -0400, Marc Sanfacon wrote:
> Hi there,
> as I said earlier, what we mainly use from libxml, for now, is the
> HTMLParser. And as everybody know, the HTML pages available on the WEB and
> everywhere else, are far from well formatted. By that I mean that the HTML
> found in thoses pages is rarely valid.

  Unfortunately true !

> The HTMLParser from libxml do a certain job of fixing them, but a
> lot of pages don't get fix in the process. I would like to know if anyone
> has the same problem than me.

  Well I do minimimalist fixes, those would would really break things.
My point of view is to import them "as is" while retaining their original
structure. Then some serious cleaning up can be achieved.

> I am willing to modify the HTMLParser so that it uses the same rules
> than 'HTMLTidy' (http://www.w3.org/People/Raggett/tidy/) to fix a HTML page.
> I do not want to create a second HTMLTidy, so I won't put all features in
> it, but the rules used to fix the HTML.
>
> So now I want to know:
>
> * Daniel, is it a good idea to put these rules in the HTMLParser ?

   Actually I would not put them in the parser per se. I would rather
make a tool which based on the document tree generated would applies
a set of rules to tidy them. From that point I agree with you.

> * Is there anyone else already doing this job ?

   Not me :-)

> * Is there another pre-parser that fixes a HTML file ?

   Tidy is the only one I know, at least as a standalone tool.

   Another point of consideration is that using an tree generated by
the HTML parser it should not be too hard in the tidying module to
parse the XHTML dtd, and then check after the tidying rules are applied
if there is some problem left. A tighter integration with the validation
module could help building automatic recovery, because a set of rules
will never really replace a real validation. That would be the significant
improvement over Tidy which would IMHO make the work very valuable.

   I suggest to try to developp it as a separate rule and DTD based
cleaning module, gatheric enough experience to make it generic would
open the prospect to make a recovery tool, that I'm pretty sure a lot
of people would enjoy even in the XML community.

   So yes good idea, and interesting future work, more than just embedding
Tidy (a great tool too) !

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | libxml Gnome XML toolkit
Tel : +33 476 615 257  | 655, avenue de l'Europe | http://xmlsoft.org/
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Rpmfind search site
 http://www.w3.org/People/all#veillard%40w3.org  | http://rpmfind.net/
----
Message from the list xml@rpmfind.net
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rpmfind.net


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Fri Oct 20 2000 - 12:43:38 EDT