From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Tue Jun 27 2000 - 10:34:23 EDT
On Tue, Jun 27, 2000 at 03:07:41PM +0200, Tobias Peters wrote:
>
> I am using the tree interface in libxml2. I have made the following experiences with the
> parser:
Hi Tobias,
preliminary question, did you tried the CVS version ? I have made
serious improvement in that area.
> 1) Numeric character references (&#...;) are translated to utf8.
>
> 2) Non-Ascii characters (as a umlaut) are rejected by the parser when no encoding is
> specified in the document.
> 3) Non-Ascii characters are accepted when the document uses the "ISO-8859-1" (or
> probably some other) encoding. They are left untouched by libxml.
>
> While 1) and 2) are good things, 3) is at least questionable, I think. But it gets
> *really* bad, when 1)and 3) come together: Suppose you have a document that declares
> to use the ISO-8859-1 encoding, that contains some special characters of this charset,
> but at the same time contains numeric character references that can or can not be
> represented in the chosen encoding. Either way, libxml will leave the special characters
> untouched and translate the numeric character reference to utf8. So you could end up
> having characters that use different encodings in a single string. That's what I consider
> bad.
yes, you are right ...
Currently ISO Latin is handled differently than all other encodings in the
sense that it's not converted internally to UTF8. This is bad and mostly I
kept it that way due to historical reasons. This opens a lot of problems like
being able to represent characters inserted as charrefs and not in the
ISO-Latin range, this is simply impossible right now
> having characters that use different encodings in a single string. That's what I consider
> a) translate everything to utf8. If libxml does not know how to translate a specific
> charset, return an error. Let the library user supply conversion functions.
> b) try to translate everything to the given encoding. If this is impossible, return
> an error.
>
> I would prefer a).
> Maybe this is already implemented, but I have not found it yet.
> Comments welcome.
I expect to do a)
People who want to keep the internal string in ISO-Latin instead of UTF8
will still be able to do that by registering a specific encoding handler
for that encoding using xmlRegisterCharEncodingHandler():
http://xmlsoft.org/gnome-xml-encoding.html#XMLREGISTERCHARENCODINGHANDLER
Daniel
-- Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes | Today's Bookmarks : Tel : +33 476 615 257 | 655, avenue de l'Europe | Linux XML libxml WWW Fax : +33 476 615 207 | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind http://www.w3.org/People/all#veillard%40w3.org | RPM badminton Kaffe ---- Message from the list xml@xmlsoft.org Archived at : http://xmlsoft.org/messages/ to unsubscribe: echo "unsubscribe xml" | mail majordomo@xmlsoft.org
This archive was generated by hypermail 2b29 : Wed Aug 02 2000 - 12:30:16 EDT