Re: [xml] Adding implied P tags?

Date view Thread view Subject view Author view

From: Daniel Veillard (Daniel.Veillard@w3.org)
Date: Thu Aug 17 2000 - 05:47:36 EDT


On Tue, Aug 15, 2000 at 08:56:33PM -0700, Wayne Davison wrote:
>
> I ran the following HTML page through "testHTML --sax" to see how well
> the inferred tags were doing:
>
> <title>This is a test!</title>
> Start of the body (hopefully).
> <hr size=1 noshade width="50%">
> This is a new paragraph.
> <br>
> A new line.
>
> I noticed that the "Start ..." text was placed inside the inferred
> HEAD section (with the inferred BODY section starting with the HR
> tag). If I manually add a P tag before the word "Start", the HEAD &
> BODY tags gets put into the right place, an inferred close-P tag gets
> put in front of the HR tag, but no new P tag gets opened up after the
> <HR> (prior to the word "This").
>
> So, is the plan to put implied P tags into the document? Or leave them
> out? If they're going in, that should neatly solve the HEAD/BODY
> boundary problem (since the implied tag would naturally cause the start
> of the BODY section to occur). If they're not going in, we'll want to
> have the character-string code check to ensure that it is at least in
> the body already (when it is not inside a HEAD-compatible container
> tag).

  Makes sense to do those kind of checks. I added some preliminary
code to handle auto-opening of <p> tags.
Current result on the same input gives:
  
~/XML -> cat auto.html
<title>This is a test!</title>
Start of the body (hopefully).
<hr size=1 noshade width="50%">
This is a new paragraph.
<br>
A new line.
<h1>header</h1>
and a new line
~/XML -> ./testHTML auto.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><title>This is a test!</title></head>
<body>
<p>
Start of the body (hopefully).
</p>
<hr size="1" noshade width="50%">
<p>
This is a new paragraph.
<br>
A new line.
</p>
<h1>header</h1>
<p>
and a new line
</p>
</body>
</html>
~/XML ->

  i.e the p gets opened, and the body get closed. If one add an
H1, the autogenerated p get closed and a new one is opened after
H1 closure.

However, this is unclear how deep one should go in this direction,
there is two conflicting goals:
   - try to get "valid" (in the XML sense) input, and this imply adding
     elements and modifying the structure of what is the input document
     (when it has any structure at all :-\)
   - try to provide a set of SAX callbacks (or a DOM tree representation)
     as close as possible from the input
 
 The former direction generally makes the work of upper layers simpler
but but lower the accuracy of the representation of the original document
making some potential "tricks" nearly impossible (never forget the existence
of tidy tool from Dave Raggett" see http://www.w3.org/Status for pointers)

 Let's consider the case of
   <li>some text></li>
assuming (I can't check I'm on a plane without the HTML DtD) one should
theorically imply a <p> to encapsulate the text, it is well known that
the rendering of <li><p>some text</li> is actually different. Same goes
for <li> witout an enclosing list item. Suppose I automatically add a
<ul> definition for "orphans" li, it makes impossible to build HTML
rendering conformant not to the HTML DtD but to the "usual" rendering
that people seems to have considered the standard since some HTML3
implementation.
  Anyway there is obvious context cases where <p> should be implied
you pinpointed the /html/body case, /html or / are also good candidates.
I have have added basic support for it (and will commit once landed)
one may want to add a few extra tag for auto-opening of <p> tags
(check htmlNoContentElements in HTMLparser.c)

  I dunno how far should one go in the direction of cleaning
up at parsing time the HTML input. Amaya and Tidy makes quite a
lot of work in this direction, I dunno what the libxml HTML users
think ? Total cleanup is also hard to implement and I'm unsure
that it is actually what the libxml community ask for.
  Feedback welcome on this topic,

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | Today's Bookmarks :
Tel : +33 476 615 257  | 655, avenue de l'Europe | Linux XML libxml WWW
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind
 http://www.w3.org/People/all#veillard%40w3.org  | RPM badminton Kaffe
----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Thu Aug 17 2000 - 09:43:14 EDT