From ff8061c4eae35b0483795d6cb96641edbb20f398 Mon Sep 17 00:00:00 2001 From: sperber Date: Tue, 14 Jan 2003 15:02:44 +0000 Subject: [PATCH] Reasonably complete and up-to-date docs. --- doc/latex/uri.tex | 328 +++++++++++++++++++--------------------------- 1 file changed, 138 insertions(+), 190 deletions(-) diff --git a/doc/latex/uri.tex b/doc/latex/uri.tex index 9eb2389..1a313b9 100644 --- a/doc/latex/uri.tex +++ b/doc/latex/uri.tex @@ -1,218 +1,166 @@ -\chapter{Handle URIs}\label{cha:uri} -% -\begin{description} -\item[Used files:] uri.scm -\item[Name of the package:] uri -\end{description} -% +\chapter{Parsing and Processing URIs}\label{cha:uri} + +The \ex{uri} structure contains a library for dealing with URIs. + +\section{Notes on URI Syntax} -\section{Overview} A URI (Uniform Resource Identifier) is of following syntax: +% \begin{inset} -[scheme] : \semvar{path} [{\normalfont?\/} search] [{\normalfont\#} fragmentid] +[scheme] : \var{path} [{\normalfont?\/} search] [{\normalfont\#} fragid] \end{inset} -Parts in brackets may be ommitted. The last part is usually referred -to as fragid in this document. +% +Parts in brackets may be ommitted. -As you see, the URI contains characters like \verb|:| to indicate its -different parts. But what, if the \semvar{scheme} contains \verb|:| as -part of its name? For this purpose, some special characters are -\emph{escaped} if they are a regular part of a name and not indicators -for the structure of a URI. Escape-sequences are of following scheme: -\verb|\%hh| where \verb|h| is a hexadecimal digit. The hexadecimal -number refers to the (US) ASCII code of the escaped character, e.g.\ -\ex{\%20} is space (ASCII character 32) and \ex{\%61} is `a' (ASCII -character 97). This module provides procedures to escape and unescape -strings that are meant to be used in a URI. +The URI contains characters like \verb|:| to indicate its different +parts. Some special characters are \emph{escaped} if they are a +regular part of a name and not indicators for the structure of a URI. +Escape sequences are of following scheme: \verb|\%hh| where \verb|h| +is a hexadecimal digit. The hexadecimal number refers to the +ASCII of the escaped character, e.g.\ \ex{\%20} is space (ASCII +32) and \ex{\%61} is `a' (ASCII 97). This module +provides procedures to escape and unescape strings that are meant to +be used in a URI. \section{Procedures} -\begin{defundesc}{parse-uri} {uri-string } {scheme path search +\defun{parse-uri} {uri-string } {scheme path search frag-id} \label{proc:parse-uri} - - Parses an \semvar{uri\=string} in the possible four fields, as - mentioned above in \emph{Overview}. These four fields are returned - as a multiple value. They are \emph{not} unescaped, as the rules for - parsing the \semvar{path} component in particular need unescaped - text, and are dependent on \semvar{scheme}. The URL parser is - responsible for doing this. If the \semvar{scheme}, \semvar{search}\ - or \semvar{fragid} portions are not specified, they are \sharpf. - Otherwise, \semvar{scheme}, \semvar{search}, and \semvar{fragid} are - strings. \semvar{path} is a non-empty string list -- the path split +\begin{desc} + Parses an \var{uri\=string} into its four fields. + The fields are \emph{not} unescaped, as the rules for + parsing the \var{path} component in particular need unescaped + text, and are dependent on \var{scheme}. The URL parser is + responsible for doing this. If the \var{scheme}, \var{search} + or \var{fragid} portions are not specified, they are \sharpf. + Otherwise, \var{scheme}, \var{search}, and \var{fragid} are + strings. \var{path} is a non-empty string list----the path split at slashes. - - For those of you who are interested, here is a description of the - parsing technique. It is inwards from both ends. - \begin{itemize} - \item First we search forwards for the first reserved character - (\verb|=|, \verb|;|, \verb|/|, \verb|#|, \verb|?|, \verb|:| or - \verb|space|). If it's a colon, then that's the \semvar{scheme} - part, otherwise we have no \semvar{scheme} part. At all events we - remove it. - \item Then we search backwards from the end for the last reserved - char. If it's a sharp, then that's the \semvar{fragid} part -- - remove it. - \item Then we search backwards from the end for the last reserved - char. If it's a question-mark, then that's the \semvar{search} - part -- remove it. - \item What's left is the path. We split at slashes. The empty string - becomes a list containing the empty string. - \end{itemize} - - This scheme is tolerant of the various ways people build broken - URI's out there on the Net\footnote{So it is not absolutely conform - with RFC~1630}, e.g. \verb|=| is a reserved character, but used - unescaped in the search-part. It was given to me\footnote{That's - Olin Shivers.} by Dan Connolly of the W3C and slightly modified. -\end{defundesc} +\end{desc} -\begin{defundesc}{unescape-uri} {string [start] [end]} {string} - Unescapes a string. This procedure should only be used \emph{after} - the URL (!) was parsed, since unescaping may introduce characters - that blow up the parse (that's why escape sequences are used in URIs - ;-). Escape sequences are of the scheme as described in ``Overview''. -\end{defundesc} +Here is a description of the parsing technique. It is inwards from +both ends: +\begin{itemize} +\item First, the code searches forwards for the first reserved + character (\verb|=|, \verb|;|, \verb|/|, \verb|#|, \verb|?|, + \verb|:| or \verb|space|). If it's a colon, then that's the + \var{scheme} part, otherwise there is no \var{scheme} part. At + all events, it is removed. +\item Then the code searches backwards from the end for the last reserved + char. If it's a sharp, then that's the \var{fragid} part---remove it. +\item Then the code searches backwards from the end for the last reserved + char. If it's a question-mark, then that's the \var{search} + part----remove it. +\item What's left is the path. The code split it at slashes. The + empty string becomes a list containing the empty string. +\end{itemize} +% +This scheme is tolerant of the various ways people build broken +URI's out there on the Net\footnote{So it does not absolutely conform + to RFC~1630.}, e.g.\ \verb|=| is a reserved character, but used +unescaped in the search-part. It was given to me\footnote{That's + Olin Shivers.} by Dan Connolly of the W3C and slightly modified. +\defun{unescape-uri}{string [start] [end]}{string} +\begin{desc} + \ex{Unescape-uri} unescapes a string. If \var{start} and/or \var{end} are + specified, they specify start and end positions within \var{string} + should be unescaped. +\end{desc} +% +This procedure should only be used \emph{after} the URI was parsed, +since unescaping may introduce characters that blow up the +parse---that's why escape sequences are used in URIs. -%\texttt{uri-escaped-chars} \hfill -%\texttt{char-set}\index{\texttt{uri-escaped-chars}} \defvar{uri-escaped-chars}{char-set} \begin{desc} - A set of characters that are escaped in URIs. These are the - following characters: dollar (\verb|$|), minus (\verb|-|),%fool Xemacs$ - underscore (\verb|_|), at (\verb|@|), dot (\verb|.|), and-sign - (\verb|&|), exclamation mark (\verb|!|), asterisk (\verb|*|), - backslash (\verb|\|), double quote (\verb|"|), single quote - (\verb|'|), open brace (\verb|(|), close brace (\verb|)|), comma - (\verb|,|) plus (\verb|+|) and all other characters that are neither - letters nor digits (such as space and control characters). + This is a set of characters (in the sense of SRFI~14) which are + escaped in URIs. These are the + following characters: \verb|$|, \verb|-|, \verb|_|, \verb|@|, %$ + \verb|.|, \verb|&|, \verb|!|, \verb|*|, \verb|\|, \verb|"|, + \verb|'|, \verb|(|, \verb|)|, \verb|,|, \verb|+|, and all other + characters that are neither letters nor digits (such as space and + control characters). \end{desc} -\begin{defundesc}{escape-uri} {string [escaped-chars]} {string} - Escapes characters of \semvar{string} that are given with - \semvar{escaped\=chars}. \semvar{escaped\=chars} default to - \ex{uri\=escaped\=chars}. Be careful with using this procedure to - chunks of text with syntactically meaningful reserved characters - (e.g., paths with URI slashes or colons) -- they'll be escaped, and - lose their special meaning. E.g.\ it would be a mistake to apply - \ex{escape-uri} to - ``\ex{//lcs.\ob{}mit.\ob{}edu:8001\ob/foo\ob/bar.html}'' because the - sla\-shes and co\-lons would be escaped. Note that \ex{esacpe-uri} - doesn't check this as it would lose his meaning. -\end{defundesc} - -\begin{defundesc}{resolve-uri} {cscheme cp scheme p} {scheme path} -%FIXME{Sorry, I can't figure out what resolve-uri is inteded to do. -%Perhaps I find it out later.} -%There is a paragraph in the spec, that describes someting like -%resolve-uri does. We have to check this. - To be done. -\end{defundesc} - -\begin{defundesc}{split-uri-path} {uri start end} {list} - Splits uri at slashes. Only the substring given with \semvar{start} - (inclusive) and \semvar{end} (exclusive) as indices is considered. - \semvar{start} and $\semvar{end} - 1$ have to be within the range of - \semvar{uri}. Otherwise an index-out-of-range exception will be - raised. Example: \codex{(split-uri-path "foo/bar/colon" 4 11)} - results to \codex{'("bar" "col")} -\end{defundesc} - -\begin{defundesc}{uri-path-list->path} {plist} {string} - Generates a path out of an uri-path-list by inserting slashes - between the elements of \semvar{plist}. If you want to use the - resulting string for further operation, you should escape the - elements of \semvar{plist} in case the contain slashes. This doesn't - escape them for you, you must do that yourself like - \ex{(uri-path-list->path (map escape-uri pathlist))}. -\end{defundesc} - -\begin{defundesc}{simplify-uri-path} {path} {list} - Removes `\ex{.}' and `\ex{..}' entries from path. The result is - a (maybe empty) list representing a path that does not contain any - `\ex{.}' or `\ex{..}'\,. The list can only be empty if the path - did not start with a slash (for the rare occasion someone wants to - simplify a relative path). The result is \sharpf{} if the path tries - to back up past root, for example by `\ex{/..}' or - `\ex{/foo\ob/..\ob/..}' or just `\ex{..}'\,. `\ex{//}' may occur - somewhere in the path referring to root but not being backed up. - Examples: -%FIXME: Can't we have a better environment for examples like these? -\begin{alltt} -(simplify-uri-path - (split-uri-path "/foo/bar/baz/.." 0 15)) -\end{alltt} - results to - \codex{'("" "foo" "bar")} - -\begin{alltt} -(simplify-uri-path - (split-uri-path "foo/bar/baz/../../.." 0 20)) -\end{alltt} - results to - \codex{'()} - -\begin{alltt} -(simplify-uri-path - (split-uri-path "/foo/../.." 0 10)) -\end{alltt} - results to - \codex{\sharpf ; tried to back up root} - -\begin{alltt} -(simplify-uri-path - (split-uri-path "foo/bar//" 0 9)) -\end{alltt} - results to - \codex{'("") ; "//" refers to root} - -\begin{alltt} -(simplify-uri-path - (split-uri-path "foo/bar/" 0 8)) -\end{alltt} - results to - \codex{'("") ; last "/" also refers to root} - -\begin{alltt} -(simplify-uri-path - (split-uri-path "/foo/bar//baz/../.." 0 19)) -\end{alltt} - results to - \codex{\sharpf ; tries to back up root} -\end{defundesc} - -\section{Unexported names} - -\defvar{uri-reserved}{char-set} +\defun{escape-uri} {string [escaped-chars]} {string} \begin{desc} - A list of reserved characters (semicolon, slash, hash, question - mark, double colon and space). + This procedure escapes characters of \var{string} that are in + \var{escaped\=chars}. \var{Escaped\=chars} defaults to + \ex{uri\=escaped\=chars}. +\end{desc} +% +Be careful with using this procedure to chunks of text with +syntactically meaningful reserved characters (e.g., paths with URI +slashes or colons)---they'll be escaped, and lose their special +meaning. E.g.\ it would be a mistake to apply \ex{escape-uri} to +\begin{verbatim} +//lcs.mit.edu:8001/foo/bar.html} +\end{verbatim} +% +because the sla\-shes and co\-lons would be escaped. + +\defun{split-uri}{uri start end} {list} +\begin{desc} + This procedure splits \var{uri} at slashes. Only the substring given + with \var{start} (inclusive) and \var{end} (exclusive) as indices is + considered. \var{start} and $\var{end} - 1$ have to be within the + range of \var{uri}. Otherwise an \ex{index-out-of-range} exception + will be raised. + + Example: \codex{(split-uri "foo/bar/colon" 4 11)} returns + \codex{("bar" "col")} \end{desc} -\begin{defundesc}{hex-digit?} {character} {boolean} - Returns \sharpt{} if character is a hexadecimal digit (i.e., one of 1--9, - a--f, A--F), \sharpf{} otherwise. -\end{defundesc} +\defun{uri-path->uri}{plist}{string} +\begin{desc} + This procedure generates a path out of a URI path list by inserting + slashes between the elements of \var{plist}. +\end{desc} +% +If you want to use the resulting string for further operation, you +should escape the elements of \var{plist} in case they contain +slashes, like so: +% +\begin{verbatim} +(uri-path->uri (map escape-uri pathlist)) +\end{verbatim} +\defun{simplify-uri-path}{path}{list} +\begin{desc} + This procedure simplifies a URI path. It removes \verb|"."| and + \verb|"/.."| entries from path, and removes parts before a root. + The result is a list, or \sharpf{} if the path tries to back up past + root. +\end{desc} +% +According to RFC~2396, relative paths are considered not to start with +\verb|/|. They are appended to a base URL path and then simplified. +So before you start to simplify a URL try to find out if it is a +relative path (i.e. it does not start with a \verb|/|). -\begin{defundesc}{hexchar->int} {character} {number} - Translates the given character to an integer, e.g. \ex{(hexchar->int - \#a)} results to 10. -\end{defundesc} +Examples: +% +\begin{alltt} +(simplify-uri-path (split-uri "/foo/bar/baz/.." 0 15)) +\(\Rightarrow\) ("" "foo" "bar") -\begin{defundesc}{int->hexchar} {integer} {character} - Translates the given integer from range 1--15 into an hexadecimal - character (uses uppercase letters), e.g. \ex{(int->hexchar 14)} - results to `E'. -\end{defundesc} +(simplify-uri-path (split-uri "foo/bar/baz/../../.." 0 20)) +\(\Rightarrow\) () -\begin{defundesc}{rev-append} {list-a list-b} {list} - Performs a \ex{(append (reverse list-a) list-b)}. The comment says it - should be defined in a list package but I am wondering how often - this will be used. -\end{defundesc} +(simplify-uri-path (split-uri "/foo/../.." 0 10)) +\(\Rightarrow\) #f + +(simplify-uri-path (split-uri "foo/bar//" 0 9)) +\(\Rightarrow\) ("") + +(simplify-uri-path (split-uri "foo/bar/" 0 8)) +\(\Rightarrow\) ("") + +(simplify-uri-path (split-uri "/foo/bar//baz/../.." 0 19)) +\(\Rightarrow\) #f +\end{alltt} -%EOF %%% Local Variables: %%% mode: latex