Re: [xml] Char encoding again

Date view	Thread view	Subject view	Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Tue Jun 27 2000 - 10:34:23 EDT

Next message: Petr Kozelka: "[xml] entity ref. in attribute value"
Previous message: Tobias Peters: "[xml] Char encoding again"
In reply to: Tobias Peters: "[xml] Char encoding again"
Next in thread: Daniel Veillard: "Re: [xml] Char encoding again"
Reply: Daniel Veillard: "Re: [xml] Char encoding again"

On Tue, Jun 27, 2000 at 03:07:41PM +0200, Tobias Peters wrote:
>
> I am using the tree interface in libxml2. I have made the following experiences with the
> parser:

Hi Tobias,

preliminary question, did you tried the CVS version ? I have made
serious improvement in that area.

> 1) Numeric character references (&#...;) are translated to utf8.
>
> 2) Non-Ascii characters (as a umlaut) are rejected by the parser when no encoding is
> specified in the document.
> 3) Non-Ascii characters are accepted when the document uses the "ISO-8859-1" (or
> probably some other) encoding. They are left untouched by libxml.
>
> While 1) and 2) are good things, 3) is at least questionable, I think. But it gets
> *really* bad, when 1)and 3) come together: Suppose you have a document that declares
> to use the ISO-8859-1 encoding, that contains some special characters of this charset,
> but at the same time contains numeric character references that can or can not be
> represented in the chosen encoding. Either way, libxml will leave the special characters
> untouched and translate the numeric character reference to utf8. So you could end up
> having characters that use different encodings in a single string. That's what I consider
> bad.

yes, you are right ...
Currently ISO Latin is handled differently than all other encodings in the
sense that it's not converted internally to UTF8. This is bad and mostly I
kept it that way due to historical reasons. This opens a lot of problems like
being able to represent characters inserted as charrefs and not in the
ISO-Latin range, this is simply impossible right now

> having characters that use different encodings in a single string. That's what I consider
> a) translate everything to utf8. If libxml does not know how to translate a specific
> charset, return an error. Let the library user supply conversion functions.
> b) try to translate everything to the given encoding. If this is impossible, return
> an error.
>
> I would prefer a).
> Maybe this is already implemented, but I have not found it yet.
> Comments welcome.

I expect to do a)
People who want to keep the internal string in ISO-Latin instead of UTF8
will still be able to do that by registering a specific encoding handler
for that encoding using xmlRegisterCharEncodingHandler():
http://xmlsoft.org/gnome-xml-encoding.html#XMLREGISTERCHARENCODINGHANDLER

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | Today's Bookmarks :
Tel : +33 476 615 257  | 655, avenue de l'Europe | Linux XML libxml WWW
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind
 http://www.w3.org/People/all#veillard%40w3.org  | RPM badminton Kaffe
----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org

Next message: Petr Kozelka: "[xml] entity ref. in attribute value"
Previous message: Tobias Peters: "[xml] Char encoding again"
In reply to: Tobias Peters: "[xml] Char encoding again"
Next in thread: Daniel Veillard: "Re: [xml] Char encoding again"
Reply: Daniel Veillard: "Re: [xml] Char encoding again"

Date view	Thread view	Subject view	Author view

This archive was generated by hypermail 2b29 : Wed Aug 02 2000 - 12:30:16 EDT