Re: [xml] Char encoding again

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Fri Jul 14 2000 - 12:40:16 EDT


On Tue, Jun 27, 2000 at 04:34:23PM +0200, Daniel Veillard wrote:
>
> On Tue, Jun 27, 2000 at 03:07:41PM +0200, Tobias Peters wrote:
> > 1) Numeric character references (&#...;) are translated to utf8.
> >
> > 2) Non-Ascii characters (as a umlaut) are rejected by the parser when no encoding is
> > specified in the document.
> > 3) Non-Ascii characters are accepted when the document uses the "ISO-8859-1" (or
> > probably some other) encoding. They are left untouched by libxml.
> >
> > While 1) and 2) are good things, 3) is at least questionable, I think. But it gets
> > *really* bad, when 1)and 3) come together: Suppose you have a document that declares
> > to use the ISO-8859-1 encoding, that contains some special characters of this charset,
> > but at the same time contains numeric character references that can or can not be
> > represented in the chosen encoding. Either way, libxml will leave the special characters
> > untouched and translate the numeric character reference to utf8. So you could end up
> > having characters that use different encodings in a single string. That's what I consider
> > bad.
>
> yes, you are right ...
> Currently ISO Latin is handled differently than all other encodings in the
> sense that it's not converted internally to UTF8. This is bad and mostly I
> kept it that way due to historical reasons. This opens a lot of problems like
> being able to represent characters inserted as charrefs and not in the
> ISO-Latin range, this is simply impossible right now

  Okay this is fixed in libxml2-2.2.0

> > having characters that use different encodings in a single string. That's what I consider
> > a) translate everything to utf8. If libxml does not know how to translate a specific
> > charset, return an error. Let the library user supply conversion functions.

  Done, everything is now translated to UTF-8 internally, I have written a
doc on the way encoding support is done (and the choices I made)

  http://xmlsoft.org/encoding.html

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | Today's Bookmarks :
Tel : +33 476 615 257  | 655, avenue de l'Europe | Linux XML libxml WWW
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind
 http://www.w3.org/People/all#veillard%40w3.org  | RPM badminton Kaffe
----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Wed Aug 02 2000 - 12:30:22 EDT