[xml] A truncation bug and some testHTML.c enhancements

Date view Thread view Subject view Author view

From: Wayne Davison (wayned@blorf.net)
Date: Sun Aug 13 2000 - 03:41:40 EDT


There's a bug in the HTML parser when we're using the push interface
and we encounter a meta tag that changes the charset. After the code
shrinks the input buffer to remove all the already-parsed characters,
it then calls xmlCharEncFirstLine(), which only converts 45 characters
or less from the new raw buffer. Any characters in the raw buffer
after this are never parsed. Depending on the buffer size, this can
truncate the entire HTML file in the middle of the HEAD section.

I made a quick fix by changing the xmlCharEncFirstLine() call in
parser.c into xmlCharEncInFunc(). This ensures that the entire buffer
gets converted and used. This may even be the right thing to do,
since I don't see why we should be using the first-line version of
this function when we know that we've already parsed data from the
file (it seems to me that it should only get called if we're at the
very start of the file).

If you'd like to see the bug in action, you need to use the push
interface with a buffer that is larger than 45 bytes. The testHTML
program (as it exists now) will only trigger the bug if you use both
--push and --repeat (since it only uses a 3-byte buffer by default).

This brings me to my changes for testHTML.c.

I noticed that testHTML did not allow me to both push data and test
the SAX parser, so I added the appropriate code to parseSAXFile() to
honor the push flag. I also added an option named "--bigpush" that
behaves just like --push except that it uses the whole 1024-byte
buffer. (Given this option, I also removed the magic effect of
--repeat on the buffer size -- use --bigpush with --repeat to get the
old behavior.) Lastly, I tweaked some of the option-parsing code to
make it a little less repetitive.

After you apply this patch, you can run testHTML with both --bigpush
and --sax and see the truncated data quite easily.

I've attached my test file (meta.html), a patch with the one-line
change for parser.c, and a patch with my changes to testHTML.c.

All changes are based on the CVS source I just grabbed from gnome.org.

..wayne..




----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Sun Aug 13 2000 - 01:43:26 EDT