Re: [xml] Another encoder truncation bug

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Tue Aug 22 2000 - 17:36:04 EDT


On Tue, Aug 22, 2000 at 01:32:08PM -0700, Wayne Davison wrote:
> I found another case where the push code can truncate the HTML input.
> If the input file has a high-bit character in it (e.g. 0xA0 = nbsp)
> but there is currently no encoding, the input is assumed to be
> ISO-8859-1 and the first line is decoded (about 40 chars or so).
> However, after these characters get parsed, the htmlParseChunk() call
> returns without processing the rest of the raw buffer. If the very
> next call is a flush, all the remaining (raw) data is lost. I've
> attached a simple html file that will cause "testHTML --sax --push" to
> fail.

  thanks for the report and the test, it helps a lot !
Problem understood, and agreed ...

> I whipped up a solution that works for me -- when the user flushes the
> buffer, make sure that we've encoded all of "raw" into "buffer" before
> the call to htmlParseTryOrFinish(). A better solution might be to
> ensure that the characters get processed before returning from the
> htmlParseChunk() call so that there isn't such a potential for delayed
> handling.
>
> My quick fix is as follows:

  I think one first fix is to get xmlSwitchEncoding() to convert the
full parser content in the case of HTML documents. Your patch certainly
applies too. With both and the xmlParserInputBufferPush() in the case
were there is stuff pushed, I think all cases are covered and as soon
as possible (i.e. as soon as the new encoding is detected or as soon
as the data are pushed).

> Also, I'm curious why the htmlParserInputRead() function goes to the
> trouble of shifting a buffer of pushed data since it can't read any
> new data into the buffer. Adding the following check makes the
> function return without doing anything if there is no readcallback
> defined:
>
> Index: parser.c
> @@ -443,6 +443,7 @@
> if (in->base == NULL) return(-1);
> if (in->cur == NULL) return(-1);
> if (in->buf->buffer == NULL) return(-1);
> + if (in->buf->readcallback == NULL) return(-1);
>
> CHECK_BUFFER(in);

  Hum, this really changes the semantic of the function, the goal
is not only to read data but also to shrink the buffer use by
discarding scanned characters. Well I applied the patch, tested it
it doesn't seem to choke with the testsuite nor http even in push
mode, so I assume ti doesn't break anything :-)

  thanks for the reports, patch enclosed,

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | Today's Bookmarks :
Tel : +33 476 615 257  | 655, avenue de l'Europe | Linux XML libxml WWW
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind
 http://www.w3.org/People/all#veillard%40w3.org  | RPM badminton Kaffe


----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Tue Aug 22 2000 - 14:43:12 EDT