Re: [xml] Re: I18N Issues.

Date view Thread view Subject view Author view

From: Y. Cheng (ycheng@phi.sinica.edu.tw)
Date: Tue Feb 08 2000 - 08:11:13 EST


On Tue, Feb 08, 2000 at 09:39:08AM +0100, Daniel Veillard wrote:
[deleted]
> On Tue, Feb 08, 2000 at 01:41:11PM +0800, Y. Cheng wrote:
[deleted]
> > because utf8 is multi-byte encoding set (have 1-byte, 2-bytes and 3-bytes
> > character) so one xmlChar one byte will be enough. (well, I don't know
> > the history of libxml)
> I have been working on this this week-end, I already propagated
> some changes to the W3C CVs base (http://dev.w3.org , module XML)
> The UNICODE stuff is already being dropped. But I need to change
> the parser to use the extended macro even in case of UTF-8 encoding.

I read some of your modification to parser.c, expecially NEXT.
        Maybe you will do something of rollback. (read my idea below)

> > Then,
> > ------ encoding.c:268 (xmlDetectCharEncoding) ------
> > if ((in[0] == 0x3C) && (in[1] == 0x3F) &&
> > (in[2] == 0x78) && (in[3] == 0x6D))
> > return(XML_CHAR_ENCODING_UTF8);
> > ^^^^^^^^^^^^^^^^^^^^^^
> > return(XML_CHAR_ENCODING_NONE);
> > ----------------------------------------------------
> > I think it should return XML_CHAR_ENCODING_NONE. Becaure there are many
> > encoding that compatable with utf8 in the range of ascii. Take zh_TW.Big5
> > as example, if the first bit of a char is 0, then it's a one byte character,
> > eler (the first bit of a char is 1) it's the first bye of two byte
> > chinese character.
> I disagree, by the specification, if there is no encoding defined, then
> we are using either UTF-8 or UTF-16 (the latter being detected by reading the
> 4 first bytes). After that if there is an encoding defined it will be time
> to switch on to this one. Look at the new version of parser.c in the database,
> especially the new function xmlNextChar(). If no encoding is defined it will
> raise an error if

Agree.

> > Latter tne encoding="xxx" will do real job (of changing encoding into
> > the real encoding).
> If needed, which for ISO-LATIN-X is not the case.

Yes, encoding="xxx" is optional to xml.
        (This is what you want to express ?)

> > With this, as the encoding change from XML_CHAR_ENCODING_NONE to
> > some spectific encoding (well, say EUC-JP), we needs a function to
> > transform the existing buffer from origional data to read utf8.
> > (maybe xmlSwitchEncoding of call by xmlSwitchEncoding). But this
> > transformation only accept transformation from XML_CHAR_ENCODING_NONE
> > to some other encoding. Once the transformation is done, no more
> > transformation is necessay (there is no method to switch encoding
> > in the middle of XML document, right ?) so all other mechanism will
> > work.
> yep something like that is needed.

Do you think we should change from

isolat1ToUTF8(xxx *out, int outlen, xxx *in, int inlen)
        to
isolat1ToUTF8(xxx *out, int outlen, xxx *in, int *inlen)

which the output of inlen is how many bytes left on in.

> > And the last thing, the encoding transformation function in
> > encoding.c (say UTF8ToUTF16) can't handle if not enough byte is
> > given (say one the first byte of a three-bytes character is read in)
> > For a thread-safe library, I suggest we add a return value to let
> > UTF8toUTF16 say that some byte are not processed and store these
> > byte in a new field on "struct _xmlParserInput".
> I'm not entierely convinced it's the place where this should be
> stored. I would rather add it to the input buffer. After all at one
> time multiple entities may be opened and we may have residual bytes
> from each of them.

You mean "struct _xmlParserInputBuffer" ?
If yes, I can't agree anymore and sorry not to notice this.
 
I think it' will be more moduler in this way.
But with this, we need a function to say that
there are some bytes left (which can't be transformed
into one single utf8) but we have got EOF already.

I plan to use iconv to convert many encoding to utf8.
As I know, there are two implemention of stand-alone libiconv
(glibc also has one, but if you don't want glibc).

[deleted]
> export CVSROOT=:pserver:anonymous@dev.w3.org:/sources/public
[deleted]

I got it.

If you think the direction is right, I will start coding.
        (Modify encoding.c first.)

Yuan-Chen Cheng

----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Wed Aug 02 2000 - 12:30:01 EDT