Re: [xml] Weirdness in XML parser

Date view Thread view Subject view Author view

From: Stefan Bambach (bambach@triplex.de)
Date: Fri Nov 12 1999 - 04:30:40 EST


Hi Kristian,

Thursday, November 11, 1999, 9:37:40 PM, you wrote:

KHK> [ I sent this earlier, but apparently it didn't show up on the
KHK> list... It's not on the list archive page either, which by the way
KHK> seems to have stopped updating around October 11. Anyway, here
KHK> goes... ]

KHK> Hi,

KHK> When parsing this document:

KHK> <HTML>
KHK> <BODY>
    
KHK> <P>
KHK> <HR>
KHK> foo
    
KHK> </BODY>
KHK> </HTML>

KHK> The HTML parser makes "foo" a child of <HR>... I tracked the problem
KHK> to this piece of code in htmlParseElement (lines 2310-2317):

KHK> if (((depth == ctxt->nameNr) && (oldname == ctxt->name)) ||
KHK> (name == NULL)) {
KHK> if (CUR == '>')
KHK> NEXT;
KHK> return;
KHK> }

KHK> which look a bit weird to me... I dont see what it's supposed to do.
KHK> What happens is that <HR> autocloses <P>, and when control reaches
KHK> these lines, oldname points to freed memory. Accidently, this memory
KHK> is used for the name of the new name, so oldname == ctxt->name and
KHK> thus htmlParseElement returns prematurely (it doesn't reach the test
for info->>empty).

KHK> I see you've made <DD> autoclose <DT> and <DT> autoclose <DD>, but
KHK> what about also making <DD> autoclose <DD> and likewise for <DT>?
KHK> This would make the parser a bit more robust; suppose someone were to
KHK> do something like:

KHK> <DL>
KHK> <DD>foo
KHK> <DD>bar
KHK> </DL>

KHK> it would get parsed as

KHK> <DL>
KHK> <DD>foo</DD>
KHK> <DD>bar</DD>
KHK> </DL>

KHK> which I believe is a bit more useful than

KHK> <DL>
KHK> <DD>foo
KHK> <DD>bar</DL>
KHK> </DD>
KHK> </DL>

KHK> regards,
KHK> Kristian
KHK> ----
KHK> Message from the list xml@rufus.w3.org
KHK> Archived at : http://rufus.w3.org/veillard/XML/messages
KHK> to unsubscribe: echo "unsubscribe xml" | mail majordomo@rufus.w3.org

Yes you are right, but it's in the users resposibility closing the
tags. The Parser tries to go on parsing when such an error occures
and this is the result. It tries to analyse and handle such cases
with errors. Normally you have to break parsing and print out some
kind of error messages. In my opionion there are too many possibilities
that you can handle each type of errors correctly.

ciao. Stefan

-----------------------------------------------------------------------
Stefan Bambach

triplex - agentur für neue medien GmbH
Erhardtstr. 8
80469 München

Tel: +49 89 209138-21
Fax: +49 89 209138-10
mailto:bambach@triplex.de
http://www.triplex.de
-----------------------------------------------------------------------

----
Message from the list xml@rufus.w3.org
Archived at : http://rufus.w3.org/veillard/XML/messages
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rufus.w3.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Wed Aug 02 2000 - 12:29:51 EDT