sunet/doc/latex/httpd.tex

\section{HTTP server}\label{sec:httpd}
%
\begin{description}
\item[Used files:] httpd-core.scm, httpd-handlers.scm, httpd-options.scm
\item[Name of the package:] httpd-core, httpd-basic-handler, httpd-make-options
\end{description}
%

\subsection{Introduction}

The \Scheme underground Web system is a package of \Scheme code
that provides utilities for interacting with the World-Wide Web.
This includes:
\begin{itemize}
\item A Web server.
\item URI and URL parsers and un-parsers (see sections \ref{sec:uri}
  and \ref{sec:url}).
\item RFC822-style header parsers (see section \ref{sec:rfc822}).
\item Code for performing structured html output
\item Code to assist in writing CGI \Scheme programs that can be used by
  any CGI-compliant HTTP server (such as NCSA's httpd, or the S.U.
  Web server).
\end{itemize}
       
The code can be obtained via anonymous
ftp\footnote{\ttt{}ftp://ftp-swiss.ai.mit.edu/pub/scsh/contrib/net/net.tar.gz}
and is implemented in \scm, using the system calls and support
procedures of scsh, the \Scheme Shell. The code was written to be
clear and modifiable -- it is voluminously commented and all non-\RnRS
dependencies are described at the beginning of each source file.
   
\FIXME{We should remove the note to read the source files and insert
the essentials here instead.}
I do not have the time to write detailed documentation for these
packages. However, they are very thoroughly commented, and I strongly
recommend reading the source files; they were written to be read, and
the source code comments should provide a clear description of the
system. The remainder of this note gives an overview of the server's
basic architecture and interfaces.
   
\subsection{The Scheme Underground Web Server}

The server was designed with three principle goals in mind:

\begin{description}   
\item{Extensibility} \\
  The server is designed to make it easy to extend the basic
  functionality. In fact, the server is nothing but extensions.  There
  is no distinction between the set of basic services provided by the
  server implementation and user extensions -- they are both
  implemented in Scheme, and have equal status. The design is ``turtles
  all the way down''.
          
\item{Mobile code} \\
  Because the server is written in \scm, it is simple to use the \scm
  module system to upload programs to the server for safe execution
  within a protected, server-chosen environment. The server comes with
  a simple example upload service to demonstrate this capability.
          
\item{Clarity of implementation} \\
  Because the server is written in a high-level language, it should
  make for a clearer exposition of the HTTP protocol and the
  associated URL and URI notations than one written in a low-level
  language such as C. This also should help to make the server easy to
  modify and adapt to different uses.
\end{description}

\subsubsection*{Basic server structure}
  
The Web server is started by calling the httpd procedure, which takes
one argument, a \ex{httpd\=options}-record:

\defun{httpd}{options}{\noreturn}
\begin{desc}
  This procedure starts the server. The various \semvar{options} can
  be set via the options transformers that are explained below.
  
  The server's basic loop is to wait on the port for a connection from
  an HTTP client. When it receives a connection, it reads in and
  parses the request into a special request data structure. Then the
  server forks a thread, who binds the current I/O ports to the
  connection socket, and then hands off to the top-level
  \semvar{path-handler} (the first argument to httpd). The
  \semvar{path-handler} procedure is responsible for actually serving
  the request -- it can be any arbitrary computation.  Its output goes
  directly back to the HTTP client that sent the request.
   
  Before calling the path handler to service the request, the HTTP
  server installs an error handler that fields any uncaught error,
  sends an error reply to the client, and aborts the request
  transaction.  Hence any error caused by a path-handler will be
  handled in a reasonable and robust fashion.
   
  The basic server loop, and the associated request data structure are
  the fixed architecture of the S.U. Web server; its flexibility lies
  in the notion of path handlers.
\end{desc}

\defun{with-port}{port \ovar{options}}{options}
\defunx{with-root-directory}{root-directory
  \ovar{options}}{options} 
\defunx{with-fqdn}{fqdn \ovar{options}}{options}
\defunx{with-reported-port}{reported-port
  \ovar{options}}{options}
\defunx{with-path-handler}{path-handler
  \ovar{options}}{options}
\defunx{with-server-admin}{mail-address
  \ovar{options}}{options}
\defunx{with-simultaneous-requests}{requests
  \ovar{options}}{options}
\defunx{with-logfile}{logfile \ovar{options}}{options}
\defunx{with-syslog?}{syslog? \ovar{options}}{options}
\begin{desc}
  As noted above, these transformers set the options for the web
  server. Every transformer changes one aspect of the
  \semvar{options} (for the \ex{httpd}). If this optional argument is missing, the
  default values are used. These are the following:

  \begin{tabular}{ll}
    \bf{transformer} & \bf{default value} \\
    \hline
    \ex{with\=port} & 80 \\
    \ex{with\=root\=directory} & ``\ex{/}'' \\
    \ex{with\=fqdn} & \sharpf \\
    \ex{with\=reported-port} & \sharpf \\
    \ex{with\=path\=handler} & \sharpf \\
    \ex{with\=server\=admin} & \sharpf \\
    \ex{with\=simultaneous\=requests} & \sharpf \\
    \ex{with\=logfile} & ``\ex{/logfile.log}''\\
    \ex{with\=syslog?} & \sharpt \\
  \end{tabular}

%  that can be found in the \ex{httpd\=make\=options}-structure:
%  \ex{with\=port}, \ex{with\=root\=directory}, \ex{with\=fqdn},
%  \ex{with\=reported-port}, \ex{with\=path\=handler},
%  \ex{with\=server\=admin}, \ex{with\=simultaneous-requests},
%  \ex{with\=logfile}, \ex{with\=syslog?} that set the port the server
%  is listening to, the root-directory of the server, the FQDN of the
%  server, the port the server assumes it is listening to, the
%  path-handler of the server (see below), the mail-address of the
%  server-admin, the maximum number of simultaneous handled requests,
%  the name of the file or the port logging in the Common Log Format
%  (CLF) is output to and if the server shall create syslog messages,
%  respectively. The port defaults to 80, the root directory defaults
%  to ``\ex{/}'', the mail address of the server-admin defaults to
%  ``\ex{sperber@\ob{}informatik.\ob{}uni\=tuebingen.\ob{}de}'',
%  \FIXME{Why does the server admin mail address have
%   sperber@informatik... as default value?}logging is done to
%  ``\ex{httpd.log}'' and syslog is enabled. All other options default
%  to \sharpf.

  For example
  \begin{code}
(httpd (with-path-handler 
        (rooted-file-handler "/usr/local/etc/httpd")
        (with-root-directory "/usr/local/etc/httpd")))
  \end{code}
  
  starts the server on port 80 with
  ``\ex{/usr/\ob{}local/\ob{}etc/\ob{}httpd}'' as root directory and
  lets it serve any file out from this directory.
  \ex{rooted\=file\=handler} creates a path handler and is explained
  below. You see, the transformers are used nested. So, every
  transformer changes one aspect of the options that the following
  transformer returns and the last transformer (here:
  \ex{with\=root\=directory}) changes an aspect of the default values

  
  \semvar{port} is the port the server is listening to,
  \semvar{root-directory} is the directory in the file system the
  server uses as root, \semvar{fqdn} is the fully qualified domain
  name the server reports, \semvar{reported-port} is the port the
  server reports it is listening to and \semvar{server-admin} is the
  mail address of the server admin. \semvar{requests} denote the
  maximum number of allowed simultaneous requests to the server.
  \sharpf\ means infinite. \semvar{logfile} is either a string, then
  it is the file name of the logfile, or a port, where the log entries
  are written to, or \sharpf, that means no logging is made. The
  logfile is in Common Log Format (CLF). \semvar{syslog?} tells the
  server to write syslog messages (\sharpt) or not (\sharpf).
\end{desc}   

\subsubsection*{Path handlers}
  
   A path handler is a procedure taking two arguments:
\defun{path-handler}{path req}{value}
\begin{desc}
  The \semvar{req} argument is a request record giving all the details
  of the client's request; it has the following structure: \FIXME{Make
    the record's structure a table}
  \begin{code}
(define-record request
 method            ; A string such as "GET", "PUT", etc.
 uri               ; The escaped URI string as read from request line.
 url               ; An http URL record (see url.scm).
 version           ; A (major . minor) integer pair.
 headers           ; An rfc822 header alist (see rfc822.scm).
 socket)           ; The socket connected to the client.\end{code}

The \semvar{path} argument is the URL's path, parsed and split at
slashes into a string list. For example, if the Web client
dereferences URL
\codex{http://\ob{}clark.\ob{}lcs.\ob{}mit.\ob{}edu:\ob{}8001/\ob{}h/\ob{}shi\ob{}vers/\ob{}co\ob{}de/\ob{}web.\ob{}tar.\ob{}gz}
then the server would pass the following path to the top-level
handler: \ex{("h"\ob{} "shivers"\ob{} "code"\ob{}
  "web.\ob{}tar.\ob{}gz")}

The \semvar{path} argument's pre-parsed representation as a string
list makes it easy for the path handler to implement recursive
operations dispatch on URL paths.
\end{desc}
   
Path handlers can do anything they like to respond to HTTP requests;
they have the full range of Scheme to implement the desired
functionality. When handling HTTP requests that have an associated
entity body (such as POST), the body should be read from the current
input port. Path handlers should in all cases write their reply to the
current output port. Path handlers should not perform I/O on the
request record's socket. Path handlers are frequently called
recursively, and doing I/O directly to the socket might bypass a
filtering or other processing step interposed on the current I/O ports
by some superior path handler.

\subsubsection*{Basic path handlers}
  
Although the user can write any path-handler he likes, the S.U. server
comes with a useful toolbox of basic path handlers that can be used
and built upon (exported by the \ex{httpd\=basic\=handlers}-structure):
   
\begin{defundesc}{alist-path-dispatcher}{ph-alist default-ph}{path-handler}
  This procedure takes a \ex{string->\ob{}path\=handler} alist, and a
  default path handler, and returns a handler that dispatches on its
  path argument. When the new path handler is applied to a path
  \ex{("foo"\ob{} "bar"\ob{} "baz")}, it uses the first element of
  the path -- ``\ex{foo}'' -- to index into the alist. If it finds an
  associated path handler in the alist, it hands the request off to
  that handler, passing it the tail of the path, \ex{("bar"\ob{}
    "baz")}. On the other hand, if the path is empty, or the alist
  search does not yield a hit, we hand off to the default path
  handler, passing it the entire original path, \ex{("foo"\ob{}
    "bar"\ob{} "baz")}.
          
  This procedure is how you say: ``If the first element of the URL's
  path is `foo', do X; if it's `bar', do Y; otherwise, do Z.'' If one
  takes an object-oriented view of the process, an alist path-handler
  does method lookup on the requested operation, dispatching off to
  the appropriate method defined for the URL.
          
  The slash-delimited URI path structure implies an associated tree of
  names. The path-handler system and the alist dispatcher allow you to
  procedurally define the server's response to any arbitrary subtree
  of the path space.
          
  Example: A typical top-level path handler is
\begin{code}          
(define ph
  (alist-path-dispatcher
      `(("h"       . ,(home-dir-handler "public_html"))
        ("cgi-bin" . ,(cgi-handler "/usr/local/etc/httpd/cgi-bin"))
        ("seval"   . ,seval-handler))
      (rooted-file-handler "/usr/local/etc/httpd/htdocs")))\end{code}
    
    This means:
\begin{itemize}          
\item If the path looks like \ex{("h"\ob{} "shivers"\ob{}
    "code"\ob{} "web.\ob{}tar.\ob{}gz")}, pass the path
  \ex{("shivers"\ob{} "code"\ob{} "web.\ob{}tar.\ob{}gz")} to a
  home-directory path handler.
\item If the path looks like \ex{("cgi-\ob{}bin"\ob{} "calendar")},
    pass ("calendar") off to the CGI path handler.
  \item If the path looks like \ex{("seval"\ob{} \ldots)}, the tail
    of the path is passed off to the code-uploading seval path
    handler.
  \item Otherwise, the whole path is passed to a rooted file handler,
    who will convert it into a filename, rooted at
    \ex{/usr/\ob{}lo\ob{}cal/\ob{}etc/\ob{}httpd/\ob{}htdocs},
    and serve that file.
\end{itemize}
\end{defundesc}
            
\begin{defundesc}{home-dir-handler}{subdir}{path-handler}
  This procedure builds a path handler that does basic file serving
  out of home directories. If the resulting \semvar{path-handler} is
  passed a path of \ex{(user . file\=path)}, then it serves the file
  \ex{user's\=ho\ob{}me\=di\ob{}rec\ob{}to\ob{}ry/\ob{}sub\ob{}dir/\ob{}file\=path}
    
  The path handler only handles GET requests; the filename is not
  allowed to contain \ex{..} elements.
\end{defundesc}
          
\begin{defundesc}{tilde-home-dir-handler}{subdir default-path-handler}{path-handler}
  This path handler examines the car of the path. If it is a string
  beginning with a tilde, e.g., \ex{"~ziggy"}, then the string is
  taken to mean a home directory, and the request is served similarly
  to a home-dir-handler path handler. Otherwise, the request is passed
  off in its entirety to the \semvar{default-path-handler}.
          
  This procedure is useful for implementing servers that provide the
  semantics of the NCSA httpd server.
\end{defundesc}
          
\begin{defundesc}{cgi-handler}{cgi-directory}{path-handler}
  This procedure returns a path-handler that passes the request off to
  some program using the CGI interface. The script name is taken from
  the car of the path; it is checked for occurrences of \ex{..}'s. If
  the path is \ex{("my\=prog"\ob{} "foo"\ob{} "bar")} then the
  program executed is
  \ex{cgi\=di\ob{}rec\ob{}to\ob{}ry\ob{}my\=prog}.

  When the CGI path handler builds the process environment for the CGI
  script, several elements (e.g., \ex{\$PATH and \$SERVER\_SOFTWARE}) are request-invariant, and can be
  computed at server start-up time. This can be done by calling
  \codex{(initialise-request-invariant-cgi-env)} 
  when the server starts up. This is not necessary, but will make CGI
  requests a little faster.
\end{defundesc}
          
\begin{defundesc}{rooted-file-handler}{root-dir}{path-handler} 
  Returns a path handler that serves files from a particular root in
  the file system. Only the GET operation is provided. The path
  argument passed to the handler is converted into a filename, and
  appended to root-dir.  The file name is checked for \ex{..}
  components, and the transaction is aborted if it does.  Otherwise,
  the file is served to the client.
\end{defundesc}
          
\begin{defundesc}{null-path-handler}{path req}{\noreturn}
  This path handler is useful as a default handler. It handles no
  requests, always returning a ``404 Not found'' reply to the client.
\end{defundesc}
          
\subsection{HTTP errors}
  
Authors of path-handlers need to be able to handle errors in a
reasonably simple fashion. The S.U. Web server provides a set of error
conditions that correspond to the error replies in the HTTP protocol.
These errors can be raised with the \ex{http\=error} procedure. When
the server runs a path handler, it runs it in the context of an error
handler that catches these errors, sends an error reply to the client,
and closes the transaction.
   
\begin{defundesc}{http-error}{reply-code req \ovar{extra \ldots}}{\noreturn}
  This raises an http error condition. The reply code is one of the
  numeric HTTP error reply codes, which are bound to the variables
  \ex{http\=re\ob{}ply/\ob{}ok, http\=re\ob{}ply/\ob{}not\=found,
    http\=re\ob{}ply/\ob{}bad\=request}, and so forth. The
  \semvar{req} argument is the request record that caused the error.
  Any following extra args are passed along for informational
  purposes. Different HTTP errors take different types of extra
  arguments. For example, the ``301 moved permanently'' and ``302
  moved temporarily'' replies use the first two extra values as the
  \ex{URI:} and \ex{Lo\-ca\-tion:} fields in the reply header,
  respectively. See the clauses of the
  \ex{send\=http\=er\ob{}ror\=re\ob{}ply} procedure for details.
\end{defundesc}
          
\begin{defundesc}{send-http-error-reply}{reply-code request \ovar{extra \ldots}}{\noreturn}
  This procedure writes an error reply out to the current output port.
  If an error occurs during this process, it is caught, and the
  procedure silently returns. The http server's standard error handler
  passes all http errors raised during path-handler execution to this
  procedure to generate the error reply before aborting the request
  transaction.
\end{defundesc}
          
\subsection{Simple directory generation}
  
Most path-handlers that serve files to clients eventually call an
internal procedure named \ex{file\=serve}, which implements a simple
directory-generation service using the following rules:
\begin{itemize}
\item If the filename has the form of a directory (i.e., it ends with
  a slash), then \ex{file\=serve} actually looks for a file named
  ``index.html'' in that directory.
\item If the filename names a directory, but is not in directory form
  (i.e., it doesn't end in a slash, as in
  ``\ex{/usr\ob{}in\ob{}clu\ob{}de}'' or ``\ex{/usr\ob{}raj}''),
  then \ex{file\=serve} sends back a ``301 moved permanently''
  message, redirecting the client to a slash-terminated version of the
  original URL. For example, the URL
  \ex{http://\ob{}clark.\ob{}lcs.\ob{}mit.\ob{}edu/\ob{}~shi\ob{}vers}
  would be redirected to
  \ex{http://\ob{}clark.\ob{}lcs.\ob{}mit.\ob{}edu/\ob{}~shi\ob{}vers/}
\item If the filename names a regular file, it is served to the
  client.
\end{itemize}
       
\subsection{Support procs}
  
The source files contain a host of support procedures which will be of
utility to anyone writing a custom path-handler. Read the files first.
\FIXME{Let us read the files and paste the contents here.}
   
\subsection{Losing}
  
   Be aware of two Unix problems, which may require workarounds:
\begin{enumerate}
\item NeXTSTEP's Posix implementation of the \ex{get\ob{}pwnam()}
  routine will silently tell you that every user has uid 0. This means
  that if your server, running as root, does a
  \codex{(set-uid (user->uid "nobody"))}
  it will essentially do a 
  \codex{(set-uid 0)}
  and you will thus still be running as root.  The fix is to manually
  find out who user nobody is (he's -2 on my system), and to hard-wire
  this into the server: 
  \codex{(set-uid -2)} 
  This problem is NeXTSTEP specific. If you are using not using
  NeXTSTEP, no problem.
\item On NeXTSTEP, the \ex{ip\=ad\ob{}dress->\ob{}host\=name}
  translation routine (in C, \ex{get\ob{}host\ob{}by\ob{}addr()}; in
  scsh, \ex{(host\=in\ob{}fo addr)}) does not use the DNS system; it
  goes through NeXT's propietary Netinfo system, and may not return a
  fully-qualified domain name. For example, on my system, I get
  ``\ex{ame\ob{}lia\=ear\ob{}hart}'', when I want
  ``\ex{ame\ob{}lia\=ear\ob{}hart.\ob{}lcs.\ob{}mit.\ob{}edu}''. Since
  the server uses this name to construct redirection URL's to be sent
  back to the Web client, they need to be FQDN's.
  
  This problem may occur on other OS's; I cannot determine if
  \ex{get\ob{}host\ob{}by\ob{}addr()} is required to return a FQDN or
  not. (I would appreciate hearing the answer if you know; my local
  Internet guru's couldn't tell me.)
  
  If your system doesn't give you a complete Internet address when you
  say
  \codex{(host-info:name (host-info (system-name)))}
  then you have this problem. 

  The server has a workaround. There is a procedure exported from the
  \ex{httpd\=core} package:
  \codex{(set-my-fqdn name)}
  Call this to crow-bar the server's idea of its own Internet host
  name before running the server, and all will be well.
\end{enumerate}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: