2003-01-14 10:02:44 -05:00
|
|
|
\chapter{Parsing and Processing URIs}\label{cha:uri}
|
|
|
|
|
|
|
|
The \ex{uri} structure contains a library for dealing with URIs.
|
|
|
|
|
|
|
|
\section{Notes on URI Syntax}
|
2002-02-12 06:50:54 -05:00
|
|
|
|
|
|
|
A URI (Uniform Resource Identifier) is of following syntax:
|
2003-01-14 10:02:44 -05:00
|
|
|
%
|
2002-02-12 06:50:54 -05:00
|
|
|
\begin{inset}
|
2003-01-14 10:02:44 -05:00
|
|
|
[scheme] : \var{path} [{\normalfont?\/} search] [{\normalfont\#} fragid]
|
2002-02-12 06:50:54 -05:00
|
|
|
\end{inset}
|
2003-01-14 10:02:44 -05:00
|
|
|
%
|
|
|
|
Parts in brackets may be ommitted.
|
|
|
|
|
|
|
|
The URI contains characters like \verb|:| to indicate its different
|
|
|
|
parts. Some special characters are \emph{escaped} if they are a
|
|
|
|
regular part of a name and not indicators for the structure of a URI.
|
|
|
|
Escape sequences are of following scheme: \verb|\%hh| where \verb|h|
|
|
|
|
is a hexadecimal digit. The hexadecimal number refers to the
|
|
|
|
ASCII of the escaped character, e.g.\ \ex{\%20} is space (ASCII
|
|
|
|
32) and \ex{\%61} is `a' (ASCII 97). This module
|
|
|
|
provides procedures to escape and unescape strings that are meant to
|
|
|
|
be used in a URI.
|
2002-02-12 06:50:54 -05:00
|
|
|
|
2002-08-21 10:52:34 -04:00
|
|
|
\section{Procedures}
|
2002-02-12 06:50:54 -05:00
|
|
|
|
2003-01-14 10:02:44 -05:00
|
|
|
\defun{parse-uri} {uri-string } {scheme path search
|
2002-02-12 06:50:54 -05:00
|
|
|
frag-id} \label{proc:parse-uri}
|
|
|
|
\begin{desc}
|
2003-01-14 10:02:44 -05:00
|
|
|
Parses an \var{uri\=string} into its four fields.
|
|
|
|
The fields are \emph{not} unescaped, as the rules for
|
|
|
|
parsing the \var{path} component in particular need unescaped
|
|
|
|
text, and are dependent on \var{scheme}. The URL parser is
|
|
|
|
responsible for doing this. If the \var{scheme}, \var{search}
|
|
|
|
or \var{fragid} portions are not specified, they are \sharpf.
|
|
|
|
Otherwise, \var{scheme}, \var{search}, and \var{fragid} are
|
|
|
|
strings. \var{path} is a non-empty string list----the path split
|
|
|
|
at slashes.
|
2002-02-12 06:50:54 -05:00
|
|
|
\end{desc}
|
|
|
|
|
2003-01-14 10:02:44 -05:00
|
|
|
Here is a description of the parsing technique. It is inwards from
|
|
|
|
both ends:
|
|
|
|
\begin{itemize}
|
|
|
|
\item First, the code searches forwards for the first reserved
|
|
|
|
character (\verb|=|, \verb|;|, \verb|/|, \verb|#|, \verb|?|,
|
|
|
|
\verb|:| or \verb|space|). If it's a colon, then that's the
|
|
|
|
\var{scheme} part, otherwise there is no \var{scheme} part. At
|
|
|
|
all events, it is removed.
|
|
|
|
\item Then the code searches backwards from the end for the last reserved
|
|
|
|
char. If it's a sharp, then that's the \var{fragid} part---remove it.
|
|
|
|
\item Then the code searches backwards from the end for the last reserved
|
|
|
|
char. If it's a question-mark, then that's the \var{search}
|
|
|
|
part----remove it.
|
|
|
|
\item What's left is the path. The code split it at slashes. The
|
|
|
|
empty string becomes a list containing the empty string.
|
|
|
|
\end{itemize}
|
|
|
|
%
|
|
|
|
This scheme is tolerant of the various ways people build broken
|
|
|
|
URI's out there on the Net\footnote{So it does not absolutely conform
|
|
|
|
to RFC~1630.}, e.g.\ \verb|=| is a reserved character, but used
|
|
|
|
unescaped in the search-part. It was given to me\footnote{That's
|
|
|
|
Olin Shivers.} by Dan Connolly of the W3C and slightly modified.
|
|
|
|
|
|
|
|
\defun{unescape-uri}{string [start] [end]}{string}
|
|
|
|
\begin{desc}
|
|
|
|
\ex{Unescape-uri} unescapes a string. If \var{start} and/or \var{end} are
|
|
|
|
specified, they specify start and end positions within \var{string}
|
|
|
|
should be unescaped.
|
|
|
|
\end{desc}
|
|
|
|
%
|
|
|
|
This procedure should only be used \emph{after} the URI was parsed,
|
|
|
|
since unescaping may introduce characters that blow up the
|
|
|
|
parse---that's why escape sequences are used in URIs.
|
2002-02-12 06:50:54 -05:00
|
|
|
|
2003-01-14 10:02:44 -05:00
|
|
|
\defvar{uri-escaped-chars}{char-set}
|
|
|
|
\begin{desc}
|
|
|
|
This is a set of characters (in the sense of SRFI~14) which are
|
|
|
|
escaped in URIs. These are the
|
|
|
|
following characters: \verb|$|, \verb|-|, \verb|_|, \verb|@|, %$
|
|
|
|
\verb|.|, \verb|&|, \verb|!|, \verb|*|, \verb|\|, \verb|"|,
|
|
|
|
\verb|'|, \verb|(|, \verb|)|, \verb|,|, \verb|+|, and all other
|
|
|
|
characters that are neither letters nor digits (such as space and
|
|
|
|
control characters).
|
|
|
|
\end{desc}
|
2002-02-12 06:50:54 -05:00
|
|
|
|
2003-01-14 10:02:44 -05:00
|
|
|
\defun{escape-uri} {string [escaped-chars]} {string}
|
|
|
|
\begin{desc}
|
|
|
|
This procedure escapes characters of \var{string} that are in
|
|
|
|
\var{escaped\=chars}. \var{Escaped\=chars} defaults to
|
|
|
|
\ex{uri\=escaped\=chars}.
|
|
|
|
\end{desc}
|
|
|
|
%
|
|
|
|
Be careful with using this procedure to chunks of text with
|
|
|
|
syntactically meaningful reserved characters (e.g., paths with URI
|
|
|
|
slashes or colons)---they'll be escaped, and lose their special
|
|
|
|
meaning. E.g.\ it would be a mistake to apply \ex{escape-uri} to
|
|
|
|
\begin{verbatim}
|
|
|
|
//lcs.mit.edu:8001/foo/bar.html}
|
|
|
|
\end{verbatim}
|
|
|
|
%
|
|
|
|
because the sla\-shes and co\-lons would be escaped.
|
2002-02-12 06:50:54 -05:00
|
|
|
|
2003-01-14 10:02:44 -05:00
|
|
|
\defun{split-uri}{uri start end} {list}
|
|
|
|
\begin{desc}
|
|
|
|
This procedure splits \var{uri} at slashes. Only the substring given
|
|
|
|
with \var{start} (inclusive) and \var{end} (exclusive) as indices is
|
|
|
|
considered. \var{start} and $\var{end} - 1$ have to be within the
|
|
|
|
range of \var{uri}. Otherwise an \ex{index-out-of-range} exception
|
|
|
|
will be raised.
|
|
|
|
|
|
|
|
Example: \codex{(split-uri "foo/bar/colon" 4 11)} returns
|
|
|
|
\codex{("bar" "col")}
|
|
|
|
\end{desc}
|
2002-02-12 06:50:54 -05:00
|
|
|
|
2003-01-14 10:02:44 -05:00
|
|
|
\defun{uri-path->uri}{plist}{string}
|
|
|
|
\begin{desc}
|
|
|
|
This procedure generates a path out of a URI path list by inserting
|
|
|
|
slashes between the elements of \var{plist}.
|
|
|
|
\end{desc}
|
|
|
|
%
|
|
|
|
If you want to use the resulting string for further operation, you
|
|
|
|
should escape the elements of \var{plist} in case they contain
|
|
|
|
slashes, like so:
|
|
|
|
%
|
|
|
|
\begin{verbatim}
|
|
|
|
(uri-path->uri (map escape-uri pathlist))
|
|
|
|
\end{verbatim}
|
2002-02-12 06:50:54 -05:00
|
|
|
|
2003-01-14 10:02:44 -05:00
|
|
|
\defun{simplify-uri-path}{path}{list}
|
2002-02-12 06:50:54 -05:00
|
|
|
\begin{desc}
|
2003-01-14 10:02:44 -05:00
|
|
|
This procedure simplifies a URI path. It removes \verb|"."| and
|
|
|
|
\verb|"/.."| entries from path, and removes parts before a root.
|
|
|
|
The result is a list, or \sharpf{} if the path tries to back up past
|
|
|
|
root.
|
2002-02-12 06:50:54 -05:00
|
|
|
\end{desc}
|
2003-01-14 10:02:44 -05:00
|
|
|
%
|
|
|
|
According to RFC~2396, relative paths are considered not to start with
|
|
|
|
\verb|/|. They are appended to a base URL path and then simplified.
|
|
|
|
So before you start to simplify a URL try to find out if it is a
|
|
|
|
relative path (i.e. it does not start with a \verb|/|).
|
2002-02-12 06:50:54 -05:00
|
|
|
|
2003-01-14 10:02:44 -05:00
|
|
|
Examples:
|
|
|
|
%
|
|
|
|
\begin{alltt}
|
|
|
|
(simplify-uri-path (split-uri "/foo/bar/baz/.." 0 15))
|
|
|
|
\(\Rightarrow\) ("" "foo" "bar")
|
2002-02-12 06:50:54 -05:00
|
|
|
|
2003-01-14 10:02:44 -05:00
|
|
|
(simplify-uri-path (split-uri "foo/bar/baz/../../.." 0 20))
|
|
|
|
\(\Rightarrow\) ()
|
2002-02-12 06:50:54 -05:00
|
|
|
|
2003-01-14 10:02:44 -05:00
|
|
|
(simplify-uri-path (split-uri "/foo/../.." 0 10))
|
|
|
|
\(\Rightarrow\) #f
|
2002-02-12 06:50:54 -05:00
|
|
|
|
2003-01-14 10:02:44 -05:00
|
|
|
(simplify-uri-path (split-uri "foo/bar//" 0 9))
|
|
|
|
\(\Rightarrow\) ("")
|
2002-02-12 06:50:54 -05:00
|
|
|
|
2003-01-14 10:02:44 -05:00
|
|
|
(simplify-uri-path (split-uri "foo/bar/" 0 8))
|
|
|
|
\(\Rightarrow\) ("")
|
|
|
|
|
|
|
|
(simplify-uri-path (split-uri "/foo/bar//baz/../.." 0 19))
|
|
|
|
\(\Rightarrow\) #f
|
|
|
|
\end{alltt}
|
2002-02-12 06:50:54 -05:00
|
|
|
|
|
|
|
|
|
|
|
%%% Local Variables:
|
|
|
|
%%% mode: latex
|
2002-08-21 10:52:34 -04:00
|
|
|
%%% TeX-master: "man"
|
2002-02-12 06:50:54 -05:00
|
|
|
%%% End:
|