[xml] HTMLParser from libxml concerns.

Date view	Thread view	Subject view	Author view

From: Marc Sanfacon (sanm@copernic.com)
Date: Fri Oct 20 2000 - 08:57:57 EDT

Next message: Bjorn Reese: "[xml] xmlXPathRoot error?"
Previous message: Daniel Veillard: "Re: [xml] + broke??"

Hi there,
as I said earlier, what we mainly use from libxml, for now, is the
HTMLParser. And as everybody know, the HTML pages available on the WEB and
everywhere else, are far from well formatted. By that I mean that the HTML
found in thoses pages is rarely valid.

The HTMLParser from libxml do a certain job of fixing them, but a
lot of pages don't get fix in the process. I would like to know if anyone
has the same problem than me.

I am willing to modify the HTMLParser so that it uses the same rules
than 'HTMLTidy' (http://www.w3.org/People/Raggett/tidy/) to fix a HTML page.
I do not want to create a second HTMLTidy, so I won't put all features in
it, but the rules used to fix the HTML.

So now I want to know:

* Daniel, is it a good idea to put these rules in the HTMLParser ?
* Is there anyone else already doing this job ?
* Is there another pre-parser that fixes a HTML file ?

Regards,
Marc.

---------------------------------------------------------------------
"Better the pride that resides, in a citizen of the world.
Than the pride that divides, when a colorful rag is
unfurled." Neil Peart
---------------------------------------------------------------------
Marc Sanfacon, Software developer Copernic.com
e-mail: sanm@copernic.com R&D Group
Tel : (418) 527-0528 ext 1212

application/ms-tnef attachment: stored

----
Message from the list xml@rpmfind.net
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rpmfind.net

Next message: Bjorn Reese: "[xml] xmlXPathRoot error?"
Previous message: Daniel Veillard: "Re: [xml] + broke??"

Date view	Thread view	Subject view	Author view

This archive was generated by hypermail 2b29 : Fri Oct 20 2000 - 09:43:40 EDT