RE: [xml] HTMLParser bug...

Date view Thread view Subject view Author view

From: Marc Sanfacon (sanm@copernic.com)
Date: Mon Nov 20 2000 - 13:34:46 EST


Hi there...
        we would be interested to do such a thing. One problem is that we
have re-written a SAX interface that generates a C++ DOM of the parsed tree.
One of the problem is that the SAX interface is being called, sometimes,
with invalid nodes. For example, it is called with:

        endElement(html) and the current node is a body.

        By fixing the 'HTMLParser', is solves our problem at the same time.

        What Daniel suggested me is to go through the tree generated by
libxml and to correct it. The problem with that approach, as said earlier,
is that the people using the SAX interface will not benefit from the
'tidying'.

        But since this kind of functionality would really help us, we are
looking for ways to solve our problem.

        If you have other ideas to be able to fix the SAX interface at the
same time, tell me.

I am going to let you know what we decide.

Regards,
        Marc.

> Le 17/11/00 22:54:57, Daniel Veillard a écrit :
> > Problem is that at that point we are starting to play heuristics at
> the
> > parser level and I don't really like this. this has the potential to
> clean
> > up a number of problem but may also raise new ones :-\, so feedback on
> heavy
> > duty HTML parsing tasks would be welcome.
>
> Why not using some HTML Tidy heuristics to clean up the tree (after or
> during parsing) ?

  Yes that's excatly what I suggested to Marc in a separate mail
yesterday :-) . That's something I would feel more comfortable with.

> In such cases, I remember HTML Tidy reorder the tree and do some node
> replacements.
> For example with the example given by Marc :
> <center> : create the missing or implicit elements : html, head, title,
> body
> <html> : merge attributes of the existing element (created when meeting
> <center>) and the new one (just parsed)
> <head> : idem
> <title> : replace the existing element with the just parsed one
> <body> : same as <html> and <head>
> etc.

  yes

> The problem here is that we have to play with invalid documents. The
> tidying phase should probably be in a different "module" of libxml, with
> its own default callbacks and some functions to clean up an existing
> tree.

  agreed, that's something we discussed on the list,

> In fact something like a rewrite of HTML Tidy using the libxml SAX
> interface and DOM tree rather than a "proprietary" one.
> I planned to do something like this (making of HTML Tidy a library,
> which is currently not, and thus making integration of Tidy as part of
> other software easier) but don't have enough time to do it :o( Though if
> someone wants to lead the project, i'll follow him/her and try to help
> when I'll find time.

  Seems to me that Marc and his team are the people who would have the
most interest in such a module so far (and maybe the most expertise).

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | libxml Gnome XML toolkit
Tel : +33 476 615 257  | 655, avenue de l'Europe | http://xmlsoft.org/
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Rpmfind search site
 http://www.w3.org/People/all#veillard%40w3.org  | http://rpmfind.net/
----
Message from the list xml@rpmfind.net
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rpmfind.net
----
Message from the list xml@rpmfind.net
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rpmfind.net


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Mon Nov 20 2000 - 13:44:52 EST