From: Marc Sanfacon (sanm@copernic.com)
Date: Fri Nov 17 2000 - 09:57:13 EST
Hi again,
I found another problem with the HTML parser. I know, that this is
non-valid HTML, but... Here is the case:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Title</TITLE>
<META http-equiv=Content-Type content="text/html; charset=windows-1252">
</HEAD>
<BODY>
So <A href="http://www.ebay.com/">eBay® Company</A>
</BODY></HTML>
As you can see, the line that contains the href contains the following
'®' and it doesn't end with a ';'. So the result from libxml is:
BUGSun.txt:6: error: htmlParseCharRef: invalid decimal value
So <A href="http://www.ebay.com/">eBay® Company</A>
^
BUGSun.txt:6: error: htmlParseCharRef: invalid xmlChar value 0
So <A href="http://www.ebay.com/">eBay® Company</A>
^
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>Title</title>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
</head>
<body><p>
So <a href="http://www.ebay.com/">eBay</a>
</p></body>
</html>
I lost the caracter, but also the subsequent caracters ' Company'. The
reason for that is the following:
SAX.startElement(a, href='http://www.ebay.com/')
SAX.characters(eBay, 4)
SAX.error: htmlParseCharRef: invalid decimal value
SAX.error: htmlParseCharRef: invalid xmlChar value 0
SAX.characters(�, 1)
SAX.characters( Company, 8)
SAX.endElement(a)
In the sax parser, the library send me a caracter '�' which ends the
string, so any subsequent caracters are not used.
I fixed it doing the following in method: htmlParseReference
void
htmlParseReference(htmlParserCtxtPtr ctxt) {
htmlEntityDescPtr ent;
xmlChar out[6];
xmlChar *name;
if (CUR != '&') return;
if (NXT(1) == '#') {
unsigned int c;
int bits, i = 0;
c = htmlParseCharRef(ctxt);
-> if (c != 0) {
if (c < 0x80) { out[i++]= c; bits= -6; }
else if (c < 0x800) { out[i++]=((c >> 6) & 0x1F) | 0xC0;
bits= 0; }
else if (c < 0x10000) { out[i++]=((c >> 12) & 0x0F) | 0xE0;
bits= 6; }
else { out[i++]=((c >> 18) & 0x07) | 0xF0;
bits= 12; }
-> }
for ( ; bits >= 0; bits-= 6) {
out[i++]= ((c >> bits) & 0x3F) | 0x80;
}
out[i] = 0;
htmlCheckParagraph(ctxt);
-> if (i > 0 && (ctxt->sax != NULL) && (ctxt->sax->characters !=
NULL))
ctxt->sax->characters(ctxt->userData, out, i);
Regards,
Marc.
---------------------------------------------------------------------
"Better the pride that resides, in a citizen of the world.
Than the pride that divides, when a colorful rag is
unfurled." Neil Peart
---------------------------------------------------------------------
Marc Sanfacon, Software developer Copernic.com
e-mail: msanfacon@copernic.com R&D Group
Tel : (418) 527-0528 ext 1212
---- Message from the list xml@rpmfind.net Archived at : http://xmlsoft.org/messages/ to unsubscribe: echo "unsubscribe xml" | mail majordomo@rpmfind.net
This archive was generated by hypermail 2b29 : Fri Nov 17 2000 - 10:43:40 EST