Re: [xml] Question about libxml...

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Wed Nov 15 2000 - 17:39:12 EST


On Wed, Nov 15, 2000 at 02:51:18PM -0500, Marc Sanfacon wrote:
> Hi there,
> we have found a problem in the HTML parser. Here is my HTML code:
>
> <SCRIPT LANGUAGE="JavaScript">
> <!--
> var cobrand_directory = "";
> //-->
> </SCRIPT>
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
> <HTML>
> <HEAD>
> <TITLE>Title</TITLE>
> </HEAD>
>
> <BODY>
> This is a test
> </BODY>
> </HTML>

yeah, nicely piece of horrible HTML like crap, not even single rooted !!!
Look closely, there is not a single root element for this document !
Whoever produced code generating this kind of horrible stuff should
get some really nasty advertizing ...

> libxml (2.2.7) outputs the following:
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
> <html><head>
> <script language="JavaScript">
> <!--
> var cobrand_directory = "";
> //-->
> </script>
> <html>
> <head><title>Title</title></head>
> <body><p>
> This is a test
> </p></body>
> </html>
> </html>
>
> As you can see, the results contain 2 html tags, 2 head tags, 2 ending html
> and only 1 head tag.
> I have pinpointed where this comes from (htmlcheckImplied), but haven't
> found where to fix it yet.

  What's happening:
    - libxml sees the script without context, so it auto-adds
      html and head parent. I think this is normal, this should
      not be changed
    - when the script element is closed we are still with
      html and head open
 
 Now at that point libxml, seeing opening tags, just open them ...

 The best to do is to not open html when we are in html, nor head when
in head nor body when in body.
 However we are diverging slightly from the initial goal of traying to parse
and bive back without too many changes what was on the input to the
upper layer, and start the kind of cleanups similar to Tidy's ones.
 Finding the limit is a bit difficult, in this case I would still add
the heuristics suggested before.

  But damnit, how broken the HTML found on the web is !

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | libxml Gnome XML toolkit
Tel : +33 476 615 257  | 655, avenue de l'Europe | http://xmlsoft.org/
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Rpmfind search site
 http://www.w3.org/People/all#veillard%40w3.org  | http://rpmfind.net/
----
Message from the list xml@rpmfind.net
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rpmfind.net


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Wed Nov 15 2000 - 17:43:37 EST