Re: [xml] How to avoid character entities

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Mon Dec 13 1999 - 08:54:52 EST


On Mon, Dec 13, 1999 at 01:58:19PM +0100, Michael Fischer v. Mollard wrote:
>
> Just a short question: I parse an iso latin1 encoded xml file, and all
> special characters like 'ü' are replaced by a CharRefEntity &uxxx; . How
> do I avoid this conversion?

  Hum, hard problem. I18N is clearly the weakest point of libxml, I'm
not a specialist and help in this area would be more than welcome.
Currently since I don't handle cleanly encodings, the output method
revert to a safe but annoying way to save non-ascii data, i.e. use
CharRefEntities.

  Going a bit deeper in the analysis:
    xmlNodeDump calls xmlEncodeEntitiesReentrant to prepare the
                node content output stream
    xmlEncodeEntitiesReentrant currently output CharRefs to anything
                outside the [0x20 - 0x80] range.

My take is that

  You should declare an encoding in the xml declaration section, i.e.
start your documents with:
<?xml version='1.0' encoding='ISO-8859-1'?>

And depending on the encoding of your document the xmlEncodeEntitiesReentrant
should check how to encode those out of range data.

There is a function in encoding.c called xmlParseCharEncoding()
which should be used to detect it and then take appropriate action
but it is not called currently.
There is also basic encoding/decoding functions to/from UTF8 for
ISO-Latin-1 and UTF-16 in encoding.c.

There is then 2 possibilities:
  - keep the internal encoding ISO-Latin-1, or whatever encoding the
    initial entity was using. This is difficult if we need to handle
    chars encoded on more than one byte (xmlChar is currently defined
    as unsigned char) since in that case the parser might have trouble
    dealing with the input (but would work fine for ISO-Latin class of
    encodings).
  - convert the entity on-the fly to UTF-8 used internally, this is
    what the encoder field in xmlParserInputBuffer is supposed to do,
    then when data is pushed in the input buffer it's first converted
    to UTF-8 before parsing.

 Both methods have advantages/drawbacks, the second one must clearly be
supported to get any "XML conformant" label, we must be able to parse
UTF-16 at some point.
 So there is some framework for I18N scattered around the code, but
it's not yet plugged and tested. I'm afraid until this area is cleaned
up, avoiding CharRef encoding in the output would be an ugly hack.

  Clearly this is an area where I need help and feedback,

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | Today's Bookmarks :
Tel : +33 476 615 257  | 655, avenue de l'Europe | Linux XML libxml WWW
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind
 http://www.w3.org/People/all#veillard%40w3.org  | RPM badminton Kaffe
----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Wed Aug 02 2000 - 12:29:54 EDT