From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Mon Aug 28 2000 - 10:10:43 EDT
On Mon, Aug 28, 2000 at 04:37:35PM +0300, Tuomas Luttinen wrote:
>
> I'm using libxml in a compiler that changes WML to bFrom xml-owner@rpmfind.net Mon Aug 28 07:12:25 2000
Received: (from majordomo@localhost)
by rpmfind.net (8.9.3/8.9.3) id HAA05949
for xml-list; Mon, 28 Aug 2000 07:12:25 -0400
Resent-Date: Mon, 28 Aug 2000 07:12:25 -0400
Resent-Message-Id: <200008281112.HAA05949@rpmfind.net>
X-Authentication-Warning: rpmfind.net: majordomo set sender to xml-request@rufus.w3.org using -f
Received: from tux.inrialpes.fr (IDENT:root@tux.inrialpes.fr [194.199.20.134])
by rpmfind.net (8.9.3/8.9.3) with ESMTP id HAA05935
for <xml@xmlsoft.org>; Mon, 28 Aug 2000 07:12:22 -0400
Received: (from veillard@localhost)
by tux.inrialpes.fr (8.9.3/8.9.3) id QAA20982;
Mon, 28 Aug 2000 16:10:43 +0200
Date: Mon, 28 Aug 2000 16:10:43 +0200
From: Daniel Veillard <Daniel.Veillard@w3.org>
To: xml@rpmfind.net
Cc: xml@xmlsoft.org
Subject: Re: [xml] A need for a function
Message-ID: <20000828161043.I15963@w3.org>
References: <39AA6B1F.35F06772@wapit.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
X-Mailer: Mutt 1.0.1i
In-Reply-To: <39AA6B1F.35F06772@wapit.com>; from tuo@wapit.com on Mon, Aug 28, 2000 at 04:37:35PM +0300
Organization: World Wide Web Consortium (W3C http://www.w3.org/)
Sender: xml-request@rufus.w3.org
Precedence: list
Reply-To: xml@rpmfind.net
Errors-To: xml-error@rpmfind.net
X-loop: xml@rpmfind.net
X-mailing-list: xml@rpmfind.net
Resent-from: xml@rpmfind.net
On Mon, Aug 28, 2000 at 04:37:35PM +0300, Tuomas Luttinen wrote:
>
> I'm using libxml in a compiler that changes WML to binary format.
> So the problem I face is that I'd need to get the content of the
> next nodes in another character set than UTF-8 in time to time.
I don't understand. The XML specification is ver clear about
this, every entity has to be in a single encoding. You cannot
have one part say in UTF8 and another part in ISO-Latin-1 for
example, this would not be an XML document.
> With a brief look I got an impression that changing the character
> set of the document would need saving the hole document, winary format.
> So the problem I face is that I'd need to get the content of the
> next nodes in another character set than UTF-8 in time to time.
I don't understand. The XML specification is ver clear about
this, every entity has to be in a single encoding. You cannot
have one part say in UTF8 and another part in ISO-Latin-1 for
example, this would not be an XML document.
> With a brief look I got an impression that changing the character
> set of the document would need saving the hole document, which is
> rather clumsy when fiddling with parts of the document in memory.
So you already loaded the document, and everything is in memory.
Thus the initial document was XML (unless there is a bug in libxml).
What you want to do is translate this document into something else,
right ?
> So how about a general use function for changing the character set
> for a pure text string that could be called for content of a text node
> or name of an element etc?
Which would need to force the DOM tree to add ohich is
> rather clumsy when fiddling with parts of the document in memory.
So you already loaded the document, and everything is in memory.
Thus the initial document was XML (unless there is a bug in libxml).
What you want to do is translate this document into something else,
right ?
> So how about a general use function for changing the character set
> for a pure text string that could be called for content of a text node
> or name of an element etc?
Which would need to force the DOM tree to add one extra charset item
for each of those elements in the tree. People are already complaining
that the DOM generation is a bit heavy memory wise, I'm pretty sure
a lot of people would not be pleased with this, at all...
> The point being here that a couple hundred ä:s and ö:s etc can make a
> huge difference when marked with one versus several bytes when the
> hole document must be 1400 bytes at maximum.
I doubt you will gain much. I'm sure how much you will loose.
In the specific case of those values ene extra charset item
for each of those elements in the tree. People are already complaining
that the DOM generation is a bit heavy memory wise, I'm pretty sure
a lot of people would not be pleased with this, at all...
> The point being here that a couple hundred ä:s and ö:s etc can make a
> huge difference when marked with one versus several bytes when the
> hole document must be 1400 bytes at maximum.
I doubt you will gain much. I'm sure how much you will loose.
In the specific case of those values extracted from the ISO-Latin
set, the difference in encoding is 2 bytes instead of 1 byte.
Practically what you can do:
- reuse the _private field defined for most objects, that's
4 bytes where you could try to store informations, but
you won't be able to save the document with libxml native
functions.
- try to force libxml to store in the native encoding, there
is a small paragraph at http://xmlsoft.org/encoding.html#extend
about this, but I suggest not to do this, oxtracted from the ISO-Latin
set, the difference in encoding is 2 bytes instead of 1 byte.
Practically what you can do:
- reuse the _private field defined for most objects, that's
4 bytes where you could try to store informations, but
you won't be able to save the document with libxml native
functions.
- try to force libxml to store in the native encoding, there
is a small paragraph at http://xmlsoft.org/encoding.html#extend
about this, but I suggest not to do this, or make sure that
you don't violate any of the XML well formedness rules due
to this trick (*)
I hope this helps,
Daniel
(*) WAP implementors have already being accused by the XML community
to violate Well Formedness rules of XML by defaulting to ISO-Latin-1
when no encoding was declared in the XML declaration, nor in the
HTTP headers. I would not be pleased if libxml was pointed as being
one of those lousy implementation. This was true in libxml1 but I
have tried to make sure that this got corrected in libxml2.
-- Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes | Today's Bookmarks : Tel : +33 476 615 257 | 655, avenue de l'Europe | Linux XML libxml WWW Fax : +33 476 615 207 | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind http://www.w3.org/People/all#veillard%40w3.org | RPM badminton Kaffe ---- Message from the list xml@xmlsoft.org Archived at : http://xmlsoft.org/messages/ to unsubscribe: echo "unsubscribe xml" | mail majordomo@xmlsoft.org r make sure that you don't violate any of the XML well formedness rules due to this trick (*)I hope this helps,
Daniel
(*) WAP implementors have already being accused by the XML community to violate Well Formedness rules of XML by defaulting to ISO-Latin-1 when no encoding was declared in the XML declaration, nor in the HTTP headers. I would not be pleased if libxml was pointed as being one of those lousy implementation. This was true in libxml1 but I have tried to make sure that this got corrected in libxml2.
-- Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes | Today's Bookmarks : Tel : +33 476 615 257 | 655, avenue de l'Europe | Linux XML libxml WWW Fax : +33 476 615 207 | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind http://www.w3.org/People/all#veillard%40w3.org | RPM badminton Kaffe ---- Message from the list xml@xmlsoft.org Archived at : http://xmlsoft.org/messages/ to unsubscribe: echo "unsubscribe xml" | mail majordomo@xmlsoft.org
This archive was generated by hypermail 2b29 : Mon Aug 28 2000 - 09:43:23 EDT