Re: [xml] A need for a function

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Mon Aug 28 2000 - 11:39:27 EDT


On Mon, Aug 28, 2000 at 06:13:13PM +0300, Tuomas Luttinen wrote:
> No, I didn't ment this. So I try to be more clearer.

  Just to be sure :
    - make sure you use libxml-2.2.2
    - do read http://xmlsoft.org/encoding.html until you understand
      it fully *before* reading on this E-mail ... I'm not joking,
      I accept to do free support but I don't want to repeat and
      repeat the same thing again !

> The WML ompiler gets the WML document, uses libxml to make up
> a DOM tree out of it and then runs through the tree node by node
> changing tags into one byte binary relevants and putting them
> together with the text contents of those nodes. A little example:
>
> Input:
> <?xml version="1.0" encoding="ISO-8859-1"?>
> <!DOCTYPE wml PUBLIC "-//WAPFORUM//DTD WML 1.1//EN"
> "http://www.wapforum.org/DTD/wml_1.1.xml">
> <wml>
> <card id="main" title="Tervehdys" newcontext="true">
> <p>Hyvää päivää maailma!</p>
> </card>
> </wml>

  Okay, this is a wellformed XML document in ISO-Latin-1

What libxml-2.2 builds in memory is the following:
~/XML -> ./xmllint --noblanks --debug tst.xml
DOCUMENT
version=1.0
encoding=ISO-8859-1
standalone=true
  DTD(wml), PUBLIC -//WAPFORUM//DTD WML 1.1//EN, SYSTEM http://www.wapforum.org/DTD/wml_1.1.xml
  ELEMENT wml
    ELEMENT card
      ATTRIBUTE id
        TEXT
        content=main
      ATTRIBUTE title
        TEXT
        content=Tervehdys
      ATTRIBUTE newcontext
        TEXT
        content=true
      ELEMENT p
        TEXT
        content=Hyv#C3#A4#C3#A4 p#C3#A4iv#C3#A4#C3#A4 maailma!
   
  the last text node content is the UTF8 encoded string for
   "Hyvää päivää maailma!"

> Output:
> \x01\x04j\x00\x7f\xe7U\x03main\x006\x03Tervehdys\x00#\x01`
> \x03Hyv\xc3\xa4\xc3\xa4 p\xc3\xa4iv\xc3\xa4\xc3\xa4 maailma!\x00
> \x01\x01\x01

  I have no idea how you managed to get this ... Well I have read
the WAP compression paper when it was just a draft, didn't really
enjoyed it (crappy IMHO), but that's another problem.
  I'm sorry but this is definitely not libxml related :-)

> Well, now there's a little problem with the output, since the third byte
> ('j' or 0x6a) of the binary is equivalent to XML clause
> encoding="ISO-8859-1"
> but as you can see, those scandinavian characters are still in the UTF-8
> because they are just copied from the tree node. So what I need is to
> change
> UTF-8 characters to something else. (Note that I use ISO-8859-1 only as
> an
> example here, since my cellular doesn't happen to understand russian,
> greek or
> arabian...) If that kind of function seems to fall out of libxml's
> territory,
> well, it's ok for me, but I'd rather use some encoding facilities in the
> libxml,
> since depending on a yet another library doesn't sound as a good idea.

  Seems to me that:
   - you have a compression algorithm which handles only ISO-Latin-1
   - libxml2 now generates UTF-8 strings in memory
   - you need to change your encoder to work on UTF8 input instead
     of ISO-Latin-1

This can be done in 2 ways:
   - natively asssuming you have a table based compressor update it
     to handle UTF-8 input instead of ISO-Latin-1
   - or just reconvert to ISO-Latin-1 (if possible) using
     UTF8Toisolat1() which is part of libxml2
     
 The added benefit is that even if in the input I provide the
 same message but say in UTF-16, you won't have to change a
 single line of code libxml will handle it !

> Padon me, if this sounds like a stupid question, but I'm not very
> familiar
> with the libxml implementation, so far I have mostly just installed it
> and
> upgraded it time to time.

  So please read the documentation I wrote, this is a minimum. I just
spent nearly half an hour dealing with your specific issue, if it's just
because you didn't read the documentation, that very lame, everybody
looses time :-(

  RTFM !!!
        http://xmlsoft.org/encoding.html

> > So you already loaded the document, and everything is in memory.
> > Thus the initial document was XML (unless there is a bug in libxml).
> > What you want to do is translate this document into something else,
> > right ?
>
> Or retain the original character set encoding that has been internally
> changed to UTF-8, see above.

  Then reread the last paragraph of my mail JUST ABOUT THIS:

> > - try to force libxml to store in the native encoding, there
> > is a small paragraph at http://xmlsoft.org/encoding.html#extend
> > about this, but I suggest not to do this, or make sure that
> > you don't violate any of the XML well formedness rules due
> > to this trick (*)

  but again I suggest you don't do this and handle UTF-8 natively
in your WAP compressor. Do it once and for good.

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | Today's Bookmarks :
Tel : +33 476 615 257  | 655, avenue de l'Europe | Linux XML libxml WWW
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind
 http://www.w3.org/People/all#veillard%40w3.org  | RPM badminton Kaffe
----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Mon Aug 28 2000 - 09:43:23 EDT