From: Wayne Davison (wayned@blorf.net)
Date: Thu Sep 14 2000 - 03:45:33 EDT
I was parsing an HTML file with a high-bit character in a tag attribute
and the code was getting the character wrong (even though it got the
character right outside the tag).
I think this patch is the right fix:
Index: HTMLparser.c
@@ -1972,7 +1972,7 @@
}
} else {
unsigned int c;
- int bits;
+ int bits, l;
if (out - buffer > buffer_size - 100) {
int index = out - buffer;
@@ -1980,7 +1980,7 @@
growBuffer(buffer);
out = &buffer[index];
}
- c = CUR;
+ c = CUR_CHAR(l);
if (c < 0x80)
{ *out++ = c; bits= -6; }
else if (c < 0x800)
This causes the htmlParseHTMLAttribute() function to read a UTF8
character, rather than just an 8-bit value.
Attached is a test file that demonstrates how the value is read OK until
the file gets converted into UTF8, and then goes wrong. Just run it
through "textHTML --sax" to see that the value gets changed from ‘
into  in the second instance. After my patch, all is well.
..wayne..
---- Message from the list xml@rpmfind.net Archived at : http://xmlsoft.org/messages/ to unsubscribe: echo "unsubscribe xml" | mail majordomo@rpmfind.net
This archive was generated by hypermail 2b29 : Thu Sep 14 2000 - 04:43:50 EDT