From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Mon Dec 13 1999 - 08:54:52 EST
On Mon, Dec 13, 1999 at 01:58:19PM +0100, Michael Fischer v. Mollard wrote:
>
> Just a short question: I parse an iso latin1 encoded xml file, and all
> special characters like 'ü' are replaced by a CharRefEntity &uxxx; . How
> do I avoid this conversion?
Hum, hard problem. I18N is clearly the weakest point of libxml, I'm
not a specialist and help in this area would be more than welcome.
Currently since I don't handle cleanly encodings, the output method
revert to a safe but annoying way to save non-ascii data, i.e. use
CharRefEntities.
Going a bit deeper in the analysis:
xmlNodeDump calls xmlEncodeEntitiesReentrant to prepare the
node content output stream
xmlEncodeEntitiesReentrant currently output CharRefs to anything
outside the [0x20 - 0x80] range.
My take is that
You should declare an encoding in the xml declaration section, i.e.
start your documents with:
<?xml version='1.0' encoding='ISO-8859-1'?>
And depending on the encoding of your document the xmlEncodeEntitiesReentrant
should check how to encode those out of range data.
There is a function in encoding.c called xmlParseCharEncoding()
which should be used to detect it and then take appropriate action
but it is not called currently.
There is also basic encoding/decoding functions to/from UTF8 for
ISO-Latin-1 and UTF-16 in encoding.c.
There is then 2 possibilities:
- keep the internal encoding ISO-Latin-1, or whatever encoding the
initial entity was using. This is difficult if we need to handle
chars encoded on more than one byte (xmlChar is currently defined
as unsigned char) since in that case the parser might have trouble
dealing with the input (but would work fine for ISO-Latin class of
encodings).
- convert the entity on-the fly to UTF-8 used internally, this is
what the encoder field in xmlParserInputBuffer is supposed to do,
then when data is pushed in the input buffer it's first converted
to UTF-8 before parsing.
Both methods have advantages/drawbacks, the second one must clearly be
supported to get any "XML conformant" label, we must be able to parse
UTF-16 at some point.
So there is some framework for I18N scattered around the code, but
it's not yet plugged and tested. I'm afraid until this area is cleaned
up, avoiding CharRef encoding in the output would be an ugly hack.
Clearly this is an area where I need help and feedback,
Daniel
-- Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes | Today's Bookmarks : Tel : +33 476 615 257 | 655, avenue de l'Europe | Linux XML libxml WWW Fax : +33 476 615 207 | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind http://www.w3.org/People/all#veillard%40w3.org | RPM badminton Kaffe ---- Message from the list xml@xmlsoft.org Archived at : http://xmlsoft.org/messages/ to unsubscribe: echo "unsubscribe xml" | mail majordomo@xmlsoft.org
This archive was generated by hypermail 2b29 : Wed Aug 02 2000 - 12:29:54 EDT