Re: [xml] A need for a function

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Mon Aug 28 2000 - 10:10:43 EDT


On Mon, Aug 28, 2000 at 04:37:35PM +0300, Tuomas Luttinen wrote:
>
> I'm using libxml in a compiler that changes WML to bFrom xml-owner@rpmfind.net Mon Aug 28 07:12:25 2000
Received: (from majordomo@localhost)
        by rpmfind.net (8.9.3/8.9.3) id HAA05949
        for xml-list; Mon, 28 Aug 2000 07:12:25 -0400
Resent-Date: Mon, 28 Aug 2000 07:12:25 -0400
Resent-Message-Id: <200008281112.HAA05949@rpmfind.net>
X-Authentication-Warning: rpmfind.net: majordomo set sender to xml-request@rufus.w3.org using -f
Received: from tux.inrialpes.fr (IDENT:root@tux.inrialpes.fr [194.199.20.134])
        by rpmfind.net (8.9.3/8.9.3) with ESMTP id HAA05935
        for <xml@xmlsoft.org>; Mon, 28 Aug 2000 07:12:22 -0400
Received: (from veillard@localhost)
        by tux.inrialpes.fr (8.9.3/8.9.3) id QAA20982;
        Mon, 28 Aug 2000 16:10:43 +0200
Date: Mon, 28 Aug 2000 16:10:43 +0200
From: Daniel Veillard <Daniel.Veillard@w3.org>
To: xml@rpmfind.net
Cc: xml@xmlsoft.org
Subject: Re: [xml] A need for a function
Message-ID: <20000828161043.I15963@w3.org>
References: <39AA6B1F.35F06772@wapit.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
X-Mailer: Mutt 1.0.1i
In-Reply-To: <39AA6B1F.35F06772@wapit.com>; from tuo@wapit.com on Mon, Aug 28, 2000 at 04:37:35PM +0300
Organization: World Wide Web Consortium (W3C http://www.w3.org/)
Sender: xml-request@rufus.w3.org
Precedence: list
Reply-To: xml@rpmfind.net
Errors-To: xml-error@rpmfind.net
X-loop: xml@rpmfind.net
X-mailing-list: xml@rpmfind.net
Resent-from: xml@rpmfind.net

On Mon, Aug 28, 2000 at 04:37:35PM +0300, Tuomas Luttinen wrote:
>
> I'm using libxml in a compiler that changes WML to binary format.
> So the problem I face is that I'd need to get the content of the
> next nodes in another character set than UTF-8 in time to time.

  I don't understand. The XML specification is ver clear about
this, every entity has to be in a single encoding. You cannot
have one part say in UTF8 and another part in ISO-Latin-1 for
example, this would not be an XML document.

> With a brief look I got an impression that changing the character
> set of the document would need saving the hole document, winary format.
> So the problem I face is that I'd need to get the content of the
> next nodes in another character set than UTF-8 in time to time.

  I don't understand. The XML specification is ver clear about
this, every entity has to be in a single encoding. You cannot
have one part say in UTF8 and another part in ISO-Latin-1 for
example, this would not be an XML document.

> With a brief look I got an impression that changing the character
> set of the document would need saving the hole document, which is
> rather clumsy when fiddling with parts of the document in memory.

  So you already loaded the document, and everything is in memory.
Thus the initial document was XML (unless there is a bug in libxml).
What you want to do is translate this document into something else,
right ?

> So how about a general use function for changing the character set
> for a pure text string that could be called for content of a text node
> or name of an element etc?

  Which would need to force the DOM tree to add ohich is
> rather clumsy when fiddling with parts of the document in memory.

  So you already loaded the document, and everything is in memory.
Thus the initial document was XML (unless there is a bug in libxml).
What you want to do is translate this document into something else,
right ?

> So how about a general use function for changing the character set
> for a pure text string that could be called for content of a text node
> or name of an element etc?

  Which would need to force the DOM tree to add one extra charset item
for each of those elements in the tree. People are already complaining
that the DOM generation is a bit heavy memory wise, I'm pretty sure
a lot of people would not be pleased with this, at all...

> The point being here that a couple hundred ä:s and ö:s etc can make a
> huge difference when marked with one versus several bytes when the
> hole document must be 1400 bytes at maximum.

  I doubt you will gain much. I'm sure how much you will loose.
In the specific case of those values ene extra charset item
for each of those elements in the tree. People are already complaining
that the DOM generation is a bit heavy memory wise, I'm pretty sure
a lot of people would not be pleased with this, at all...

> The point being here that a couple hundred ä:s and ö:s etc can make a
> huge difference when marked with one versus several bytes when the
> hole document must be 1400 bytes at maximum.

  I doubt you will gain much. I'm sure how much you will loose.
In the specific case of those values extracted from the ISO-Latin
set, the difference in encoding is 2 bytes instead of 1 byte.
  Practically what you can do:
    - reuse the _private field defined for most objects, that's
      4 bytes where you could try to store informations, but
      you won't be able to save the document with libxml native
      functions.
    - try to force libxml to store in the native encoding, there
      is a small paragraph at http://xmlsoft.org/encoding.html#extend
      about this, but I suggest not to do this, oxtracted from the ISO-Latin
set, the difference in encoding is 2 bytes instead of 1 byte.
  Practically what you can do:
    - reuse the _private field defined for most objects, that's
      4 bytes where you could try to store informations, but
      you won't be able to save the document with libxml native
      functions.
    - try to force libxml to store in the native encoding, there
      is a small paragraph at http://xmlsoft.org/encoding.html#extend
      about this, but I suggest not to do this, or make sure that
      you don't violate any of the XML well formedness rules due
      to this trick (*)

   I hope this helps,

Daniel

(*) WAP implementors have already being accused by the XML community
    to violate Well Formedness rules of XML by defaulting to ISO-Latin-1
    when no encoding was declared in the XML declaration, nor in the
    HTTP headers. I would not be pleased if libxml was pointed as being
    one of those lousy implementation. This was true in libxml1 but I
    have tried to make sure that this got corrected in libxml2.

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | Today's Bookmarks :
Tel : +33 476 615 257  | 655, avenue de l'Europe | Linux XML libxml WWW
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind
 http://www.w3.org/People/all#veillard%40w3.org  | RPM badminton Kaffe
----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org
r make sure that
      you don't violate any of the XML well formedness rules due
      to this trick (*) 

I hope this helps,

Daniel

(*) WAP implementors have already being accused by the XML community to violate Well Formedness rules of XML by defaulting to ISO-Latin-1 when no encoding was declared in the XML declaration, nor in the HTTP headers. I would not be pleased if libxml was pointed as being one of those lousy implementation. This was true in libxml1 but I have tried to make sure that this got corrected in libxml2.

-- Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes | Today's Bookmarks : Tel : +33 476 615 257 | 655, avenue de l'Europe | Linux XML libxml WWW Fax : +33 476 615 207 | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind http://www.w3.org/People/all#veillard%40w3.org | RPM badminton Kaffe ---- Message from the list xml@xmlsoft.org Archived at : http://xmlsoft.org/messages/ to unsubscribe: echo "unsubscribe xml" | mail majordomo@xmlsoft.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Mon Aug 28 2000 - 09:43:23 EDT