[xml] Tag-attribute character-conversion wrong

Date view Thread view Subject view Author view

From: Wayne Davison (wayned@blorf.net)
Date: Thu Sep 14 2000 - 03:45:33 EDT


I was parsing an HTML file with a high-bit character in a tag attribute
and the code was getting the character wrong (even though it got the
character right outside the tag).

I think this patch is the right fix:

Index: HTMLparser.c
@@ -1972,7 +1972,7 @@
             }
         } else {
             unsigned int c;
- int bits;
+ int bits, l;
 
             if (out - buffer > buffer_size - 100) {
                 int index = out - buffer;
@@ -1980,7 +1980,7 @@
                 growBuffer(buffer);
                 out = &buffer[index];
             }
- c = CUR;
+ c = CUR_CHAR(l);
             if (c < 0x80)
                     { *out++ = c; bits= -6; }
             else if (c < 0x800)

This causes the htmlParseHTMLAttribute() function to read a UTF8
character, rather than just an 8-bit value.

Attached is a test file that demonstrates how the value is read OK until
the file gets converted into UTF8, and then goes wrong. Just run it
through "textHTML --sax" to see that the value gets changed from &#145;
into &Acirc; in the second instance. After my patch, all is well.

..wayne..


----
Message from the list xml@rpmfind.net
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rpmfind.net


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Thu Sep 14 2000 - 04:43:50 EDT