Re: [xml] HTML & XML Parser

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Sun Sep 24 2000 - 15:53:16 EDT


On Thu, Sep 21, 2000 at 03:40:24AM -0400, Manuel Guesdon wrote:
>
>
> Hello,
>
> I'd like to parse HTML file with xml parser so people can change
> DTDs (adding tags,...) without having to re-compile libxml.
> The main problem is "Auto Closed" tag.

   Have you looked at the HTML parser in libxml ? did you noticed
it produces a similar tree as say an equivalent XHTML document would have
produced when parsed with the XML parser ? The HTmL parser of course handle
autoclosed (and auto-opened in some mesure) tags.

   The only change should be doc->type == XML_HTML_DOCUMENT_NODE instead
of XML_DOCUMENT_NODE.

   If you plan is to make extensions to the SGML DtDs and have libxml
HTML parser support those, you're just wrong. Latest and future versions
of HTML are XML based, hence new extensions will be made using the
XHTML version and in an XML framework. Forget about auto-closed tags
and other SGML minimizations nastyness this does not fit into this
framework anymore.
   If the HTML parser found htmL tags it doesn't know it will complain
but it will generate a DOM tree which will be XML ready.

> So I'd like to know some things:
> Can I safetely mix html and xml parser functions (i.e. construct
> a context with xmlCreateMemoryParserCtxt() and parse the doc with
> htmlParseDocument.

  Can you tell me what you're aiming at this way ? I don't understand
your approach. If it's an HTML document use the HTML parser, otherwise
use the XML parser.

> My SAX functions use parser context _private member) ?

  I don't understand. _private are fields located in the DOM generated
tree structures. And you say you use SAX ... So I'm lost what are you doing
there, what API do you use exactly ?

> It works with version 2.2.3 but will it works with next versions ?

  _private will stay here forever, unless someone specifically compile
without them, this space is available by default.

> Another solution would be use xml parser only but how can I manage
> auto close tags ?

  Do not try to use the XML parser for not well formed XML document it
simply *won't work* !

> BTW, I've noticed few things in last version:
> - there's no htmlCreateMemoryParserCtxt() public function

  Right this is missing, should not be hard to add.

> - Definitions of SAX handlers like
> xmlSAXHandler sgmlDefaultSAXHandler = {
> internalSubset,
> NULL,
> ...
> doesn't define externalSubset.

  This code is not released publicly and may never be. I don't intend
to add a generic SGML parser it was mostly to work with Docbook SGML
import.

> - html parser seems to not call xmlLoadExternalEntity

  Yes because there is no external entities ! I don't validate SGML
but you can use the XHTML XML Dtd to postvalidate an HTML DOM tree
generated by the HTML parser.

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | Today's Bookmarks :
Tel : +33 476 615 257  | 655, avenue de l'Europe | Linux XML libxml WWW
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind
 http://www.w3.org/People/all#veillard%40w3.org  | RPM badminton Kaffe
----
Message from the list xml@rpmfind.net
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@rpmfind.net


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Sun Sep 24 2000 - 16:43:18 EDT