[xml] Tag-attribute character-conversion wrong

Date view	Thread view	Subject view	Author view

From: Wayne Davison (wayned@blorf.net)
Date: Thu Sep 14 2000 - 03:45:33 EDT

Previous message: Luca Padovani: "[xml] libxml2"

I was parsing an HTML file with a high-bit character in a tag attribute
and the code was getting the character wrong (even though it got the
character right outside the tag).

I think this patch is the right fix:

Index: HTMLparser.c
@@ -1972,7 +1972,7 @@
             }
         } else {
             unsigned int c;
- int bits;
+ int bits, l;

             if (out - buffer > buffer_size - 100) {
                 int index = out - buffer;
@@ -1980,7 +1980,7 @@
                 growBuffer(buffer);
                 out = &buffer[index];
             }
- c = CUR;
+ c = CUR_CHAR(l);
             if (c < 0x80)
                     { *out++ = c; bits= -6; }
             else if (c < 0x800)

This causes the htmlParseHTMLAttribute() function to read a UTF8
character, rather than just an 8-bit value.

Attached is a test file that demonstrates how the value is read OK until
the file gets converted into UTF8, and then goes wrong. Just run it
through "textHTML --sax" to see that the value gets changed from 
into Â in the second instance. After my patch, all is well.

..wayne..

TEXT/html attachment: high-bit attribute test

----
Message from the list xml@rpmfind.net
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rpmfind.net

Previous message: Luca Padovani: "[xml] libxml2"

Date view	Thread view	Subject view	Author view

This archive was generated by hypermail 2b29 : Thu Sep 14 2000 - 04:43:50 EDT