Re: [xml] Question about libxml...

Date view Thread view Subject view Author view

From: Wayne Davison (wayned@users.sourceforge.net)
Date: Wed Nov 15 2000 - 16:17:30 EST


On Wed, 15 Nov 2000, Marc Sanfacon wrote:
> I think there should be only 1 html and 1 head tag with the proper ending
> tag.

It might be possible in the internal DOM representation to merge multiple
HTML/HEAD/BODY sections together. The SAX parser wouldn't be able to
reorder its events, though.

Interestingly, I had just seen a similar problem to your example where
someone had actually inserted some tags that must go in the body, prior to
the official <HTML> start (before the HEAD section and everything). I
think that the way to fix this for the SAX handler is to close the current
(perhaps implied) tags and start new ones. My suggested fix is attached to
this email.

After my patch, your HTML example would end up looking like this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html><head><script language="JavaScript">
<!--
var cobrand_directory = "";
//-->
</script></head></html>
<html>
<head><title>Title</title></head>
<body><p>
This is a test
</p></body>
</html>

While it would be fairly easy in this example to merge the two head
sections together, in the bogus web page I saw the parser was already
inside an implied BODY section and needed to revert to a HEAD section.
Something like this:

<p>this is a test!
<html>
<head>
<title>hi</title>
</head>
<body>
Wow!

After my patch, the SAX elements open and close the HTML+BODY elements
around the misplaced paragraph:

SAX.setDocumentLocator()
SAX.startDocument()
SAX.startElement(html)
SAX.startElement(body)
SAX.startElement(p)
SAX.characters(this is a test!
, 16)
SAX.endElement(p)
SAX.endElement(body)
SAX.endElement(html)
SAX.startElement(html)
SAX.ignorableWhitespace(
, 1)
SAX.startElement(head)
SAX.ignorableWhitespace(
, 1)
SAX.startElement(title)
SAX.characters(hi, 2)
SAX.endElement(title)
SAX.ignorableWhitespace(
, 1)
SAX.endElement(head)
SAX.ignorableWhitespace(
, 1)
SAX.startElement(body)
SAX.startElement(p)
SAX.characters(
Wow!
, 6)
SAX.endElement(p)
SAX.endElement(body)
SAX.endElement(html)
SAX.endDocument()

However, strangely, the DOM version still seems to put the second HTML
section inside the first:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body><p>this is a test!
</p></body>
<html>
<head><title>hi</title></head>
<body><p>
Wow!
</p></body>
</html>
</html>

I haven't tried to figure out why that is yet.

Additionally, my patch removes "hr" from the list of elements that
"form" closes. Since HR doesn't have a close element, I don't see why
that was there.

..wayne..


----
Message from the list xml@rpmfind.net
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rpmfind.net


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Wed Nov 15 2000 - 16:45:09 EST