Re: [xml] UTF8ToHtml() changes

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Sun Aug 27 2000 - 18:31:37 EDT


On Sun, Aug 27, 2000 at 02:18:35PM -0700, Wayne Davison wrote:
> I was trying to use UTF8ToHtml() to transform some internal-format
> characters back into HTML, and it was failing to translate some curly
> quotes and such. Come to find out that the array, html40EntitiesTable[],
> is not sorted like UTF8ToHtml() expects it to be.

  Okay good catch,

> Also, the function does
> not escape actual ampersands in the input, which leads to ambiguities in
> the output.

  And for a good reason, the escaping is done at another level
UTF8ToHtml() is just registered as the encoding handler in
encoding.c, the conversion is done before calling it, the result
was that it was completely breaking libxml HTML output (by
double escaping of &).
  HTML output is handled by UTF8ToHtml as the default encoding
filter for HTML. You can perfectly save in UTF-16 or something
even more exotic.

> The attached patch does the following:
>
> + Sorts all the entries in html40EntitiesTable[] by unicode value.

  thanks, applied,

> + Renamed htmlEntityLookup() to htmlEntityNameLookup() and then added

  Sorry not possible would break binary compatibility

> htmlEntityValueLookup() (since I wanted to lookup entities by value
> in my own code). The value lookup code has a debug check that
> complains if it finds a value in the list that is out of order.

  yes added

> the new htmlEntityValueLookup() function (which uses a slightly more
> efficient linear scan -- it has a maximum of N+1 value comparisons
> rather than 2*N).

  Okay, applied

> + Fixed an off-by-one bug in UTF8ToHtml() when it was checking for enough
> room in the output buffer to fit an entity.

  I prefer having one extra byte left in the case the user need to
add a zero, in that case there is no loss in output, and the fact that
UTF8ToHtml() returns -2 is not handled as an error condition, it's the
normal way to use the conversion filters. That one byte is for safety at
no cost in reality.

> + Tweaked the entity-copying code in UTF8ToHtml() a tad.

  Okay, I kept most of it.

> + Removed a superfluous "i = 0" initialization that I happened to notice.

  Okay.

> /* assertion: c is a single UTF-4 value */
> if (c < 0x80) {
> - if (out >= outend)
> + switch (c) {
> + case '&':
> + if (out + 5 > outend) {
> + *outlen = out - outstart;
> + *inlen = processed - instart;
> + return(0);
> + }
> + memcpy(out, "&amp;", 5);
> + out += 5;
> + break;
> + case '<':
> + if (out + 4 > outend) {
> + *outlen = out - outstart;
> + *inlen = processed - instart;
> + return(0);
> + }
> + memcpy(out, "&lt;", 4);
> + out += 4;
> + break;
> + case '>':
> + if (out + 4 > outend) {
> + *outlen = out - outstart;
> + *inlen = processed - instart;
> + return(0);
> + }
> + memcpy(out, "&gt;", 4);
> + out += 4;
> break;
> - *out++ = c;
> + default:
> + if (out >= outend) {
> + *outlen = out - outstart;
> + *inlen = processed - instart;
> + return(0);
> + }
> + *out++ = c;
> + break;
> + }

  This part was not applied. It was breaking HTML generation (check
"make HTMLtests" to get an idea :-)
  This may be worth a separate function if you just want to output
a string extrated from the internal representation. I would accept
it without problem.

  thanks for the report and the patch,

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | Today's Bookmarks :
Tel : +33 476 615 257  | 655, avenue de l'Europe | Linux XML libxml WWW
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind
 http://www.w3.org/People/all#veillard%40w3.org  | RPM badminton Kaffe
----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Sun Aug 27 2000 - 15:43:15 EDT