[xml] Re: I18N Issues.

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Tue Feb 08 2000 - 03:39:08 EST


[ Cc'ing the list since this may certainly impact other people, Daniel]

On Tue, Feb 08, 2000 at 01:41:11PM +0800, Y. Cheng wrote:
> Hi,
>
> I am pretty new to libxml ;)
> I hope to make libxml I18N-Enbale. (Well, for me, chinese-enable ;)
>
> As stat in TODO:
>
> I plan to keep everything internally as UTF-8
>
> I read some part of libxml, major xmlIO.c, the following
> are places that I think needs nodification.
>
> ----- tree.h:52 -----
> #ifdef UNICODE
> typedef unsigned short xmlChar;
> #else
> typedef unsigned char xmlChar;
> #endif
> ---------------------
>
> because utf8 is multi-byte encoding set (have 1-byte, 2-bytes and 3-bytes
> character) so one xmlChar one byte will be enough. (well, I don't know
> the history of libxml)

  I have been working on this this week-end, I already propagated
some changes to the W3C CVs base (http://dev.w3.org , module XML)
The UNICODE stuff is already being dropped. But I need to change
the parser to use the extended macro even in case of UTF-8 encoding.

> Then,
>
> ------ encoding.c:268 (xmlDetectCharEncoding) ------
> if ((in[0] == 0x3C) && (in[1] == 0x3F) &&
> (in[2] == 0x78) && (in[3] == 0x6D))
> return(XML_CHAR_ENCODING_UTF8);
> ^^^^^^^^^^^^^^^^^^^^^^
> return(XML_CHAR_ENCODING_NONE);
> ----------------------------------------------------
> I think it should return XML_CHAR_ENCODING_NONE. Becaure there are many
> encoding that compatable with utf8 in the range of ascii. Take zh_TW.Big5
> as example, if the first bit of a char is 0, then it's a one byte character,
> eler (the first bit of a char is 1) it's the first bye of two byte
> chinese character.

  I disagree, by the specification, if there is no encoding defined, then
we are using either UTF-8 or UTF-16 (the latter being detected by reading the
4 first bytes). After that if there is an encoding defined it will be time
to switch on to this one. Look at the new version of parser.c in the database,
especially the new function xmlNextChar(). If no encoding is defined it will
raise an error if

> Latter tne encoding="xxx" will do real job (of changing encoding into
> the real encoding).

  If needed, which for ISO-LATIN-X is not the case.

> With this, as the encoding change from XML_CHAR_ENCODING_NONE to
> some spectific encoding (well, say EUC-JP), we needs a function to
> transform the existing buffer from origional data to read utf8.
> (maybe xmlSwitchEncoding of call by xmlSwitchEncoding). But this
> transformation only accept transformation from XML_CHAR_ENCODING_NONE
> to some other encoding. Once the transformation is done, no more
> transformation is necessay (there is no method to switch encoding
> in the middle of XML document, right ?) so all other mechanism will
> work.

   yep something like that is needed.

> And the last thing, the encoding transformation function in
> encoding.c (say UTF8ToUTF16) can't handle if not enough byte is
> given (say one the first byte of a three-bytes character is read in)
> For a thread-safe library, I suggest we add a return value to let
> UTF8toUTF16 say that some byte are not processed and store these
> byte in a new field on "struct _xmlParserInput".

  I'm not entierely convinced it's the place where this should be
stored. I would rather add it to the input buffer. After all at one
time multiple entities may be opened and we may have residual bytes
from each of them.

> I am willing to write code. I just want to know if you
> think my direction is correct.

  Mostly, yes. But make sure you use the codebase from dev.w3.org

export CVSROOT=:pserver:anonymous@dev.w3.org:/sources/public
cvs login
(enter anonymous
then
cvs -z9 get XML

> ps. I am not sure whether to post this to mail-list.
> It's no problem if you want to put this on mail-list.

 Ok, done

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | Today's Bookmarks :
Tel : +33 476 615 257  | 655, avenue de l'Europe | Linux XML libxml WWW
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind
 http://www.w3.org/People/all#veillard%40w3.org  | RPM badminton Kaffe
----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Wed Aug 02 2000 - 12:30:01 EDT