[xml] Char encoding again

Date view Thread view Subject view Author view

From: Tobias Peters (t-peters@hrz2.uni-oldenburg.de)
Date: Tue Jun 27 2000 - 09:07:41 EDT


I am using the tree interface in libxml2. I have made the following experiences with the
parser:

1) Numeric character references (&#...;) are translated to utf8.

2) Non-Ascii characters (as a umlaut) are rejected by the parser when no encoding is
   specified in the document.
3) Non-Ascii characters are accepted when the document uses the "ISO-8859-1" (or
   probably some other) encoding. They are left untouched by libxml.

While 1) and 2) are good things, 3) is at least questionable, I think. But it gets
*really* bad, when 1)and 3) come together: Suppose you have a document that declares
to use the ISO-8859-1 encoding, that contains some special characters of this charset,
but at the same time contains numeric character references that can or can not be
represented in the chosen encoding. Either way, libxml will leave the special characters
untouched and translate the numeric character reference to utf8. So you could end up
having characters that use different encodings in a single string. That's what I consider
bad.

having characters that use different encodings in a single string. That's what I consider
a) translate everything to utf8. If libxml does not know how to translate a specific
   charset, return an error. Let the library user supply conversion functions.
b) try to translate everything to the given encoding. If this is impossible, return
   an error.

I would prefer a).
Maybe this is already implemented, but I have not found it yet.
Comments welcome.

Tobias

----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Wed Aug 02 2000 - 12:30:16 EDT