[xml] Dumping bogus tags in the HTML parser

Date view Thread view Subject view Author view

From: Wayne Davison (wayned@users.sourceforge.net)
Date: Mon Oct 23 2000 - 17:44:57 EDT


The attached patch implements how I believe the HTML parser should
work (which is also how current web browsers work): If a bogus tag is
encountered, the bogus tag is ignored. The current code outputs bogus
tags (in a partial form) as literal text.

For instance, take this file:

----------------------------------------------------------------------
<HTML><HEAD>
<TITLE>test</TITLE>
</HEAD><BODY>
<! ---------- this is a common illegal comment -------->
Believe it or not, tags like this are included in the
http://my.excite.com/togo/ content:
<!A HREF=/news/r/001022/21/business-manufacturing-honeywell-dc>
I saw something like this in the current HTML test data:
<IMG SRC="foo_bar.gif" Bogus?.?>
----------------------------------------------------------------------

The current code complains several times about the last error (the bogus
attribute) and includes tag characters in the literal text:

----------------------------------------------------------------------
test.html:4: error: htmlParseStartTag: invalid element name
<! ---------- this is a common illegal comment -------->
 ^
test.html:7: error: htmlParseStartTag: invalid element name
<!A HREF=/news/r/001022/21/business-manufacturing-honeywell-dc>
 ^
test.html:9: error: error parsing attribute name
<IMG SRC="foo_bar.gif" Bogus?.?>
                            ^
test.html:9: error: htmlParseStartTag: problem parsing attributes
<IMG SRC="foo_bar.gif" Bogus?.?>
                            ^
test.html:9: error: Couldn't find end of Start Tag img
<IMG SRC="foo_bar.gif" Bogus?.?>
                            ^
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><title>test</title></head>
<body><p>! ---------- this is a common illegal comment --------&gt;
Believe it or not, tags like this are included in the
http://my.excite.com/togo/ content:
!A HREF=/news/r/001022/21/business-manufacturing-honeywell-dc&gt;
I saw something like this in the current HTML test data:
<img SRC="foo_bar.gif" Bogus>?.?&gt;
</p></body>
</html>
----------------------------------------------------------------------

After applying my patch, the code complains more succinctly, and simply
eliminates the bogus tags:

----------------------------------------------------------------------
test.html:4: error: htmlParseStartTag: invalid element name
<! ---------- this is a common illegal comment -------->
 ^
test.html:7: error: htmlParseStartTag: invalid element name
<!A HREF=/news/r/001022/21/business-manufacturing-honeywell-dc>
 ^
test.html:9: error: error parsing attribute name
<IMG SRC="foo_bar.gif" Bogus?.?>
                            ^
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><title>test</title></head>
<body><p>
Believe it or not, tags like this are included in the
http://my.excite.com/togo/ content:

I saw something like this in the current HTML test data:
<img src="foo_bar.gif" bogus>
</p></body>
</html>
----------------------------------------------------------------------

My patch is attached to this email.

..wayne..


----
Message from the list xml@rpmfind.net
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rpmfind.net


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Mon Oct 23 2000 - 18:43:23 EDT