Re: [xml] Re: I18N Issues.

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Tue Feb 08 2000 - 11:02:23 EST


On Tue, Feb 08, 2000 at 09:11:13PM +0800, Y. Cheng wrote:
> On Tue, Feb 08, 2000 at 09:39:08AM +0100, Daniel Veillard wrote:
> > > Latter tne encoding="xxx" will do real job (of changing encoding into
> > > the real encoding).
> > If needed, which for ISO-LATIN-X is not the case.
>
> Yes, encoding="xxx" is optional to xml.
> (This is what you want to express ?)

  and if not expressed we have to assume UTF-8 (or 16), that's the spec

> > > With this, as the encoding change from XML_CHAR_ENCODING_NONE to
> > > some spectific encoding (well, say EUC-JP), we needs a function to
> > > transform the existing buffer from origional data to read utf8.
> > > (maybe xmlSwitchEncoding of call by xmlSwitchEncoding). But this
> > > transformation only accept transformation from XML_CHAR_ENCODING_NONE
> > > to some other encoding. Once the transformation is done, no more
> > > transformation is necessay (there is no method to switch encoding
> > > in the middle of XML document, right ?) so all other mechanism will
> > > work.
> > yep something like that is needed.
>
> Do you think we should change from
>
> isolat1ToUTF8(xxx *out, int outlen, xxx *in, int inlen)
> to
> isolat1ToUTF8(xxx *out, int outlen, xxx *in, int *inlen)
>
> which the output of inlen is how many bytes left on in.

 isolat1ToUTF8 (and this class of functions) returns the number of byte
written, or -1 by lack of space. So there is already an error condition
but I agree that the upper layer pushing a buffer which doesn't end-up
on a character boundary should not be considered an error.
 For compatibility with other similar apis, I would rather have *inlen
return the actual number of byte read. In that case we could also not
generate an error if outlen is too small either. it's better to unify the
two "error" case handling.

> > > And the last thing, the encoding transformation function in
> > > encoding.c (say UTF8ToUTF16) can't handle if not enough byte is
> > > given (say one the first byte of a three-bytes character is read in)
> > > For a thread-safe library, I suggest we add a return value to let
> > > UTF8toUTF16 say that some byte are not processed and store these
> > > byte in a new field on "struct _xmlParserInput".
> > I'm not entierely convinced it's the place where this should be
> > stored. I would rather add it to the input buffer. After all at one
> > time multiple entities may be opened and we may have residual bytes
> > from each of them.
>
> You mean "struct _xmlParserInputBuffer" ?
> If yes, I can't agree anymore and sorry not to notice this.

  I assume I should read "I can't agree more", right ?

> I think it' will be more moduler in this way.
> But with this, we need a function to say that
> there are some bytes left (which can't be transformed
> into one single utf8) but we have got EOF already.
>
> I plan to use iconv to convert many encoding to utf8.
> As I know, there are two implemention of stand-alone libiconv
> (glibc also has one, but if you don't want glibc).

  yes but it has to be optionnal. Libxml is fairly standalone right
now, and I would rather keep it that way, at least for the major encodings
required by the spec (ISO-Latin-x, UTF-8 and UTF-16 at least).

  I am converting the parser code a lot there days, if you could focuse on
encoding.c for a couple of days that would be easier.

   thanks for working on this,

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | Today's Bookmarks :
Tel : +33 476 615 257  | 655, avenue de l'Europe | Linux XML libxml WWW
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind
 http://www.w3.org/People/all#veillard%40w3.org  | RPM badminton Kaffe
----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Wed Aug 02 2000 - 12:30:01 EDT