From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Fri Jul 14 2000 - 12:40:16 EDT
On Tue, Jun 27, 2000 at 04:34:23PM +0200, Daniel Veillard wrote:
>
> On Tue, Jun 27, 2000 at 03:07:41PM +0200, Tobias Peters wrote:
> > 1) Numeric character references (&#...;) are translated to utf8.
> >
> > 2) Non-Ascii characters (as a umlaut) are rejected by the parser when no encoding is
> > specified in the document.
> > 3) Non-Ascii characters are accepted when the document uses the "ISO-8859-1" (or
> > probably some other) encoding. They are left untouched by libxml.
> >
> > While 1) and 2) are good things, 3) is at least questionable, I think. But it gets
> > *really* bad, when 1)and 3) come together: Suppose you have a document that declares
> > to use the ISO-8859-1 encoding, that contains some special characters of this charset,
> > but at the same time contains numeric character references that can or can not be
> > represented in the chosen encoding. Either way, libxml will leave the special characters
> > untouched and translate the numeric character reference to utf8. So you could end up
> > having characters that use different encodings in a single string. That's what I consider
> > bad.
>
> yes, you are right ...
> Currently ISO Latin is handled differently than all other encodings in the
> sense that it's not converted internally to UTF8. This is bad and mostly I
> kept it that way due to historical reasons. This opens a lot of problems like
> being able to represent characters inserted as charrefs and not in the
> ISO-Latin range, this is simply impossible right now
Okay this is fixed in libxml2-2.2.0
> > having characters that use different encodings in a single string. That's what I consider
> > a) translate everything to utf8. If libxml does not know how to translate a specific
> > charset, return an error. Let the library user supply conversion functions.
Done, everything is now translated to UTF-8 internally, I have written a
doc on the way encoding support is done (and the choices I made)
http://xmlsoft.org/encoding.html
Daniel
-- Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes | Today's Bookmarks : Tel : +33 476 615 257 | 655, avenue de l'Europe | Linux XML libxml WWW Fax : +33 476 615 207 | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind http://www.w3.org/People/all#veillard%40w3.org | RPM badminton Kaffe ---- Message from the list xml@xmlsoft.org Archived at : http://xmlsoft.org/messages/ to unsubscribe: echo "unsubscribe xml" | mail majordomo@xmlsoft.org
This archive was generated by hypermail 2b29 : Wed Aug 02 2000 - 12:30:22 EDT