Re: [xml] A need for a function

Date view Thread view Subject view Author view

From: Tuomas Luttinen (tuo@wapit.com)
Date: Mon Aug 28 2000 - 11:13:13 EDT


Daniel Veillard wrote:
 
> On Mon, Aug 28, 2000 at 04:37:35PM +0300, Tuomas Luttinen wrote:

> > I'm using libxml in a compiler that changes WML to binary format.
> > So the problem I face is that I'd need to get the content of the
> > next nodes in another character set than UTF-8 in time to time.
 
> I don't understand. The XML specification is ver clear about
> this, every entity has to be in a single encoding. You cannot
> have one part say in UTF8 and another part in ISO-Latin-1 for
> example, this would not be an XML document.

No, I didn't ment this. So I try to be more clearer.

The WML ompiler gets the WML document, uses libxml to make up
a DOM tree out of it and then runs through the tree node by node
changing tags into one byte binary relevants and putting them
together with the text contents of those nodes. A little example:

Input:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE wml PUBLIC "-//WAPFORUM//DTD WML 1.1//EN"
"http://www.wapforum.org/DTD/wml_1.1.xml">
<wml>
        <card id="main" title="Tervehdys" newcontext="true">
                <p>Hyvää päivää maailma!</p>
        </card>
</wml>

Output:
\x01\x04j\x00\x7f\xe7U\x03main\x006\x03Tervehdys\x00#\x01`
\x03Hyv\xc3\xa4\xc3\xa4 p\xc3\xa4iv\xc3\xa4\xc3\xa4 maailma!\x00
\x01\x01\x01

Well, now there's a little problem with the output, since the third byte
('j' or 0x6a) of the binary is equivalent to XML clause
encoding="ISO-8859-1"
but as you can see, those scandinavian characters are still in the UTF-8
because they are just copied from the tree node. So what I need is to
change
UTF-8 characters to something else. (Note that I use ISO-8859-1 only as
an
example here, since my cellular doesn't happen to understand russian,
greek or
arabian...) If that kind of function seems to fall out of libxml's
territory,
well, it's ok for me, but I'd rather use some encoding facilities in the
libxml,
since depending on a yet another library doesn't sound as a good idea.

Padon me, if this sounds like a stupid question, but I'm not very
familiar
with the libxml implementation, so far I have mostly just installed it
and
upgraded it time to time.

> So you already loaded the document, and everything is in memory.
> Thus the initial document was XML (unless there is a bug in libxml).
> What you want to do is translate this document into something else,
> right ?

Or retain the original character set encoding that has been internally
changed to UTF-8, see above.
 
> Which would need to force the DOM tree to add one extra charset item
> for each of those elements in the tree. People are already complaining
> that the DOM generation is a bit heavy memory wise, I'm pretty sure
> a lot of people would not be pleased with this, at all...

No, of course the same character set is used throughout the document.

> Practically what you can do:
> - reuse the _private field defined for most objects, that's
> 4 bytes where you could try to store informations, but
> you won't be able to save the document with libxml native
> functions.

As I said, no document are saved, the DOM tree is freed after the
binary form is formed.

> - try to force libxml to store in the native encoding, there
> is a small paragraph at http://xmlsoft.org/encoding.html#extend
> about this, but I suggest not to do this, or make sure that
> you don't violate any of the XML well formedness rules due
> to this trick (*)
 
> (*) WAP implementors have already being accused by the XML community
> to violate Well Formedness rules of XML by defaulting to ISO-Latin-1
> when no encoding was declared in the XML declaration, nor in the
> HTTP headers. I would not be pleased if libxml was pointed as being
> one of those lousy implementation. This was true in libxml1 but I
> have tried to make sure that this got corrected in libxml2.

Yes, that was really a good change to do. Now if I just got this problem
solved I could bury libxml1 compability on my project.

-- 
Tuomas Luttinen <tuo@wapit.com>
   Software Developer, Kannel project <http://www.kannel.org>
      Wapit Ltd
----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Mon Aug 28 2000 - 09:43:23 EDT