[xml] Another HTML parser issue...

Date view Thread view Subject view Author view

From: Marc Sanfacon (sanm@copernic.com)
Date: Fri Nov 17 2000 - 09:57:13 EST


Hi again,
        I found another problem with the HTML parser. I know, that this is
non-valid HTML, but... Here is the case:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Title</TITLE>
<META http-equiv=Content-Type content="text/html; charset=windows-1252">
</HEAD>
<BODY>
      So <A href="http://www.ebay.com/">eBay&#174 Company</A>
</BODY></HTML>

As you can see, the line that contains the href contains the following
'&#174' and it doesn't end with a ';'. So the result from libxml is:

BUGSun.txt:6: error: htmlParseCharRef: invalid decimal value
      So <A href="http://www.ebay.com/">eBay&#174 Company</A>
                                                 ^
BUGSun.txt:6: error: htmlParseCharRef: invalid xmlChar value 0
      So <A href="http://www.ebay.com/">eBay&#174 Company</A>
                                                 ^
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>Title</title>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
</head>
<body><p>
      So <a href="http://www.ebay.com/">eBay</a>
</p></body>
</html>

I lost the caracter, but also the subsequent caracters ' Company'. The
reason for that is the following:

SAX.startElement(a, href='http://www.ebay.com/')
SAX.characters(eBay, 4)
SAX.error: htmlParseCharRef: invalid decimal value
SAX.error: htmlParseCharRef: invalid xmlChar value 0
SAX.characters(&#0;, 1)
SAX.characters( Company, 8)
SAX.endElement(a)

In the sax parser, the library send me a caracter '&#0' which ends the
string, so any subsequent caracters are not used.

I fixed it doing the following in method: htmlParseReference

void
htmlParseReference(htmlParserCtxtPtr ctxt) {
    htmlEntityDescPtr ent;
    xmlChar out[6];
    xmlChar *name;
    if (CUR != '&') return;

    if (NXT(1) == '#') {
        unsigned int c;
        int bits, i = 0;

        c = htmlParseCharRef(ctxt);
-> if (c != 0) {
            if (c < 0x80) { out[i++]= c; bits= -6; }
            else if (c < 0x800) { out[i++]=((c >> 6) & 0x1F) | 0xC0;
bits= 0; }
            else if (c < 0x10000) { out[i++]=((c >> 12) & 0x0F) | 0xE0;
bits= 6; }
            else { out[i++]=((c >> 18) & 0x07) | 0xF0;
bits= 12; }
-> }

        for ( ; bits >= 0; bits-= 6) {
            out[i++]= ((c >> bits) & 0x3F) | 0x80;
        }
        out[i] = 0;

        htmlCheckParagraph(ctxt);
-> if (i > 0 && (ctxt->sax != NULL) && (ctxt->sax->characters !=
NULL))
            ctxt->sax->characters(ctxt->userData, out, i);

Regards,
        Marc.

---------------------------------------------------------------------
 "Better the pride that resides, in a citizen of the world.
  Than the pride that divides, when a colorful rag is
  unfurled." Neil Peart
---------------------------------------------------------------------
Marc Sanfacon, Software developer Copernic.com
e-mail: msanfacon@copernic.com R&D Group
Tel : (418) 527-0528 ext 1212


----
Message from the list xml@rpmfind.net
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rpmfind.net


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Fri Nov 17 2000 - 10:43:40 EST