RE: [xml] Valid URI ?

Date view Thread view Subject view Author view

From: Marc Sanfacon (sanm@copernic.com)
Date: Thu Aug 24 2000 - 09:44:56 EDT


Hi Daniel,
        thank you for the explanation. I made a mistake, MS URLCracker
gives: ?cp=nsidircat as extra info.

        Marc.

-----Original Message-----
From: xml-request@rufus.w3.org [mailto:xml-request@rufus.w3.org]On
Behalf Of Daniel Veillard
Sent: August 23, 2000 16:34
To: xml@rpmfind.net
Subject: Re: [xml] Valid URI ?

On Wed, Aug 23, 2000 at 03:18:37PM -0400, Marc Sanfacon wrote:
>
> Hi there,
> I am trying to figure out if this URI is valid:
>
> http://info.netscape.com/world/espa%0!ol?cp=nsidircat
>
> Got that when doing my tests. libxml tells me it is invalid, but
> the URLCracker of MS gives me a valid URI with '%0!ol?cp=nsidircat' as
extra
> info.
>
> If it is a valid URI, I will try to find the problem in libxml.

  Hum, a foreword ... Warning Warning Warning ...
  URI related work is hard, really.
  And also, just looking at it ... I don't know the answer !

The first thing even before trying to analyze the string w.r.t. the
specification (I'm using RFC 2396 as defined on top of the uri.c file)
is to know in which context it was found !

For example when contained within an XML document, this string would not
and URI but an escaped (from an XML point of view) string moreover
converted to the encoding representation of the document, if contained
in other resource it may be needed to escape it accordingly to the type
of document in use (I told you it is hard ...).

Assuming it's an unescaped URI, then we can process through 2396:

        http://www.faqs.org/rfcs/rfc2396.html
                 3. URI Syntactic Components
         
  http://info.netscape.com/world/
this part is no problem, scheme, server, and path with 1 segment

  let's look at the annoying part:
   espa%0!ol?cp=nsidircat
  considering the context, "espa" are 4 chars of the next segment
  
  segment = *pchar *( ";" param )

  pchar = unreserved | escaped |
                  ":" | "@" | "&" | "=" | "+" | "$" | ","

  unreserved = alphanum | mark

     alphanum = alpha | digit (letters or number in the ASCII range)

        mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

  if the next element was a pchar, then it would have to be an
  escaped, but

     escaped = "%" hex hex

   hence one would need two hexa digit in a row, there is only one.

 So the next element is not a pchar and the last segment is "espa"
 with "%0!ol?cp=nsidircat" remaining.
 The spec allows a query after the path:

   hier_part = ( net_path | abs_path ) [ "?" query ]

  but this needs a '?' first,

  URLCracker of MS tells you "http://info.netscape.com/world/espa"
is the valid longest URI which can be extracted from the given string
and the rest "%0!ol?cp=nsidircat" is garbage.

 the URI module of libxml rightly says:
   "http://info.netscape.com/world/espa%0!ol?cp=nsidircat"
 is not an URI.

 There is no contradiction, but this is misleading. Since
  "http://info.netscape.com/world/espa%0!ol?cp=nsidircat"
 the last segment was probably espa<ntilde>ol and due to either
buggy software or bad escaping of the string, the actual resource
was
  http://info.netscape.com/world/espa>ol?cp=nsidircat

  and certainly not http://info.netscape.com/world/espa

  So i don't think one need to fix uri.c, but if you got a problem
this si probably due to an escaping problem, like if you take
a string directly out of libxml document memory, this string is
encoded in UTF8 and to get a real URI, you would have to
escape it properly (in that case convert the <ntilde> un UTF8 to
the appropriate %XX sequence. But it's impossible to tell without
more context where the error really lies.
  One thing is sure, a function to convert from an UTF-8 encoded
inline reference to an escaped valid URI is needed bit it's not
simple to do !

Daniel

-- 
Daniel.Veillard@w3.org | W3C, INRIA Rhone-Alpes  | Today's Bookmarks :
Tel : +33 476 615 257  | 655, avenue de l'Europe | Linux XML libxml WWW
Fax : +33 476 615 207  | 38330 Montbonnot FRANCE | Gnome rpm2html rpmfind
 http://www.w3.org/People/all#veillard%40w3.org  | RPM badminton Kaffe
----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org
----
Message from the list xml@xmlsoft.org
Archived at : http://xmlsoft.org/messages/
to unsubscribe: echo "unsubscribe xml" | mail  majordomo@xmlsoft.org


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Thu Aug 24 2000 - 09:43:14 EDT