Re: [xml] Adding implied P tags?

Date view Thread view Subject view Author view

From: Wayne Davison (wayned@blorf.net)
Date: Thu Aug 17 2000 - 14:19:22 EDT


On Thu, 17 Aug 2000, Daniel Veillard wrote:
> However, this is unclear how deep one should go in this direction,
> there is two conflicting goals:
> - try to get "valid" (in the XML sense) input, and this imply adding
> elements and modifying the structure of what is the input document
> (when it has any structure at all :-\)
> - try to provide a set of SAX callbacks (or a DOM tree representation)
> as close as possible from the input

This might be a good area to add one or more options so that the user can
decide how raw the HTML data should be. For the program I've written
(which takes in HTML files and outputs a Rocket eBook file), I want to see
the implied tags. If the sax callbacks don't give me implied P tags, for
instance, I'll have to add code to add them myself (because I offer an
option to re-render paragraphs in "book form", I need to know where all
the paragraphs are so I can change them).

This enhanced-HTML output is why I switched over from libwww's HTML
callbacks to libxml (since libwww doesn't even give me implied close tags,
making it harder to re-render things like tables into the simple elements
that the Rocket eBook understands).

Here's another problem case for P-tag inferral:

<h1>CHAPTER 1</h1>
<a id="p1"></a><p>Paragraph #1</p>
<a id="p2"></a><p>Paragraph #2</p>

The A tags outside the paragraphs doesn't have any rendered content, so it
would be nice if there were not any implied P tags added (Baen books loves
to do things like this). [Aside: I ran this test case through your latest
XML code, and it stopped parsing after the first paragraph]. I'm not sure
what the official HTML spec says about such code, though. It may well be
that the A tags should get an implied P wrapper, and that the renderer
should treat the effectively-empty paragraph as invisible.

Also (and you've probably already noticed this) the new code is adding
implied P tags for whitespace between tags:

<p>one</p>
<p>two</p>

becomes:

<p>one</p><p>
</p><p>two</p>

> I dunno how far should one go in the direction of cleaning
> up at parsing time the HTML input.

A difficult question. While it would be great to have extra cleanup
available (e.g. properly nesting mis-ordered tags), the best place to
push this functionality may well be into a separate library.

..wayne..

----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Thu Aug 17 2000 - 11:43:13 EDT