sunet/scheme/httpd/su-httpd.txt

353 lines
15 KiB
Plaintext
Raw Normal View History

2000-09-26 10:35:26 -04:00
The Scheme Underground Web system
Olin Shivers
7/95
Additions by Mike Sperber, 10/96
The Scheme underground Web system is a package of Scheme code that provides
utilities for interacting with the World-Wide Web. This includes:
- A Web server.
- URI and URL parsers and un-parsers.
- RFC822-style header parsers.
- Code for performing structured html output
- Code to assist in writing CGI Scheme programs
that can be used by any CGI-compliant HTTP server
(such as NCSA's httpd, or the S.U. Web server).
The code can be obtained via anonymous ftp and is implemented in Scheme 48,
using the system calls and support procedures of scsh, the Scheme Shell. The
code was written to be clear and modifiable -- it is voluminously commented
and all non-R4RS dependencies are described at the beginning of each source
file.
I do not have the time to write detailed documentation for these packages.
However, they are very thoroughly commented, and I strongly recommend reading
the source files; they were written to be read, and the source code comments
should provide a clear description of the system. The remainder of this note
gives an overview of the server's basic architecture and interfaces.
* The Scheme Underground Web Server
The server was designed with three principle goals in mind:
- Extensibility
The server is designed to make it easy to extend the basic
functionality. In fact, the server is nothing but extensions. There is
no distinction between the set of basic services provided by the server
implementation and user extensions -- they are both implemented in
Scheme, and have equal status. The design is "turtles all the way down."
- Mobile code
Because the server is written in Scheme 48, it is simple to use the
Scheme 48 module system to upload programs to the server for safe
execution within a protected, server-chosen environment. The server
comes with a simple example upload service to demonstrate this
capability.
- Clarity of implementation
Because the server is written in a high-level language, it should make
for a clearer exposition of the HTTP protocol and the associated URL
and URI notations than one written in a low-level language such as C.
This also should help to make the server easy to modify and adapt to
different uses.
** Basic server structure
The Web server is started by calling the HTTPD procedure, which takes
one required and two optional arguments:
(httpd path-handler [port working-directory])
The server accepts connections from the given port, which defaults to 80.
The server runs with the working directory set to the given value,
which defaults to
/usr/local/etc/httpd
The server's basic loop is to wait on the port for a connection from an HTTP
client. When it receives a connection, it reads in and parses the request into
a special request data structure. Then the server forks a child process, who
binds the current I/O ports to the connection socket, and then hands off to
the top-level path handler (the first argument to httpd). The path-handler
procedure is responsible for actually serving the request -- it can be any
arbitrary computation. Its output goes directly back to the HTTP client that
sent the request.
Before calling the path handler to service the request, the HTTP server
installs an error handler that fields any uncaught error, sends an
error reply to the client, and aborts the request transaction. Hence
any error caused by a path-handler will be handled in a reasonable and
robust fashion.
The basic server loop, and the associated request data structure are the fixed
architecture of the S.U. Web server; its flexibility lies in the notion of
path handlers.
** Path handlers
A path handler is a procedure taking two arguments:
(path-handler path req)
The REQ argument is a request record giving all the details of the
client's request; it has the following structure:
(define-record request
method ; A string such as "GET", "PUT", etc.
uri ; The escaped URI string as read from request line.
url ; An http URL record (see url.scm).
version ; A (major . minor) integer pair.
headers ; An rfc822 header alist (see rfc822.scm).
socket) ; The socket connected to the client.
The PATH argument is the URL's path, parsed and split at slashes into a string
list. For example, if the Web client dereferences URL
http://clark.lcs.mit.edu:8001/h/shivers/code/web.tar.gz
then the server would pass the following path to the top-level handler:
("h" "shivers" "code" "web.tar.gz")
The path argument's pre-parsed representation as a string list makes it easy
for the path handler to implement recursive operations dispatch on URL paths.
Path handlers can do anything they like to respond to HTTP requests; they have
the full range of Scheme to implement the desired functionality. When
handling HTTP requests that have an associated entity body (such as POST), the
body should be read from the current input port. Path handlers should in all
cases write their reply to the current output port. Path handlers should *not*
perform I/O on the request record's socket. Path handlers are frequently
called recursively, and doing I/O directly to the socket might bypass a
filtering or other processing step interposed on the current I/O ports by some
superior path handler.
*** Basic path handlers
Although the user can write any path-handler he likes, the S.U. server comes
with a useful toolbox of basic path handlers that can be used and built upon:
(alist-path-dispatcher ph-alist default-ph) -> path-handler
This procedure takes a string->path-handler alist, and a default
path handler, and returns a handler that dispatches on its path argument.
When the new path handler is applied to a path ("foo" "bar" "baz"),
it uses the first element of the path -- "foo" -- to index into
the alist. If it finds an associated path handler in the alist, it
hands the request off to that handler, passing it the tail of the path,
("bar" "baz"). On the other hand, if the path is empty, or the alist
search does not yield a hit, we hand off to the default path handler,
passing it the entire original path, ("foo" "bar" "baz").
This procedure is how you say: "If the first element of the URL's path
is `foo', do X; if it's `bar', do Y; otherwise, do Z." If one takes
an object-oriented view of the process, an alist path-handler does
method lookup on the requested operation, dispatching off to the
appropriate method defined for the URL.
The slash-delimited URI path structure implies an associated
tree of names. The path-handler system and the alist dispatcher
allow you to procedurally define the server's response to any
arbitrary subtree of the path space.
Example:
A typical top-level path handler is
(define ph
(alist-path-dispatcher
`(("h" . ,(home-dir-handler "public_html"))
("cgi-bin" . ,(cgi-handler "/usr/local/etc/httpd/cgi-bin"))
("seval" . ,seval-handler))
(rooted-file-handler "/usr/local/etc/httpd/htdocs")))
This means:
- If the path looks like ("h" "shivers" "code" "web.tar.gz"),
pass the path ("shivers" "code" "web.tar.gz") to a
home-directory path handler.
- If the path looks like ("cgi-bin" "calendar"),
pass ("calendar") off to the CGI path handler.
- If the path looks like ("seval" ...), the tail of the path
is passed off to the code-uploading seval path handler.
- Otherwise, the whole path is passed to a rooted file handler, who
will convert it into a filename, rooted at /usr/local/etc/httpd/htdocs,
and serve that file.
(home-dir-handler subdir) -> path-handler
This procedure builds a path handler that does basic file serving
out of home directories. If the resulting path handler is passed
a path of (<user> . <file-path>), then it serves the file
<user's-home-directory>/<subdir>/<file-path>
The path handler only handles GET requests; the filename is not
allowed to contain .. elements.
(tilde-home-dir-handler subdir default-path-handler) -> path-handler
This path handler examines the car of the path. If it is a string
beginning with a tilde, e.g., "~ziggy", then the string is taken to
mean a home directory, and the request is served similarly to a
HOME-DIR-HANDLER path handler. Otherwise, the request is passed off in
its entirety to the default path handler.
This procedure is useful for implementing servers that provide the
semantics of the NCSA httpd server.
(cgi-handler cgi-directory) -> path-handler
This procedure returns a path-handler that passes the request off to some
program using the CGI interface. The script name is taken from the
car of the path; it is checked for occurrences of ..'s. If the path is
("my-prog" "foo" "bar")
then the program executed is
<cgi-directory>/my-prog
When the CGI path handler builds the process environment for the
CGI script, several elements (e.g., $PATH and $SERVER_SOFTWARE)
are request-invariant, and can be computed at server start-up time.
This can be done by calling
(initialise-request-invariant-cgi-env)
when the server starts up. This is *not* necessary, but will make CGI
requests a little faster.
(rooted-file-handler root-dir) -> path-handler
Returns a path handler that serves files from a particular root in the
file system. Only the GET operation is provided. The path argument
passed to the handler is converted into a filename, and appended to
ROOT-DIR. The file name is checked for .. components, and the
transaction is aborted if it does. Otherwise, the file is served to the
client.
(rooted-file-or-directory-handler root-dir icon-name) -> path-handler
The same as rooted-file-handler, except it can also serve
directory index listings for directories that do not contain a
file index.html. ICON-NAME is an object describing how to get at
the various icons required for generating directory listings. It
uses the icons provided by CERN httpd 3.0. ICON-NAME can either
be a string which is used as a prefix for generating the icon
URLs. If it is a procedure, it should accept an icon tag (read
httpd-handlers.scm for reference) and return an icon name. If it
is neither, it will just use the plain icon name, which is almost
guaranteed not to work.
(null-path-handler path req)
This path handler is useful as a default handler. It handles no requests,
always returning a "404 Not found" reply to the client.
** HTTP errors
Authors of path-handlers need to be able to handle errors in a reasonably
simple fashion. The S.U. Web server provides a set of error conditions that
correspond to the error replies in the HTTP protocol. These errors can be
raised with the HTTP-ERROR procedure. When the server runs a path handler,
it runs it in the context of an error handler that catches these errors,
sends an error reply to the client, and closes the transaction.
(http-error reply-code req [extra ...])
This raises an http error condition. The reply code is one of the
numeric HTTP error reply codes, which are bound to the variables
HTTP-REPLY/OK, HTTP-REPLY/NOT-FOUND, HTTP-REPLY/BAD-REQUEST, and so
forth. The REQ argument is the request record that caused the error.
Any following EXTRA args are passed along for informational purposes.
Different HTTP errors take different types of extra arguments. For
example, the "301 moved permanently" and "302 moved temporarily"
replies use the first two extra values as the URI: and Location: fields
in the reply header, respectively. See the clauses of the
SEND-HTTP-ERROR-REPLY procedure for details.
(send-http-error-reply reply-code request [extra ...])
This procedure writes an error reply out to the current output
port. If an error occurs during this process, it is caught, and
the procedure silently returns. The http server's standard error
handler passes all http errors raised during path-handler execution
to this procedure to generate the error reply before aborting the
request transaction.
** Simple directory generation
Most path-handlers that serve files to clients eventually call an internal
procedure named FILE-SERVE, which implements a simple directory-generation
service using the following rules:
- If the filename has the *form* of a directory (i.e., it ends with a
slash), then FILE-SERVE actually looks for a file named "index.html"
in that directory.
- If the filename names a directory, but is not in directory form
(i.e., it doesn't end in a slash, as in "/usr/include" or "/usr/raj"),
then FILE-SERVE sends back a "301 moved permanently" message,
redirecting the client to a slash-terminated version of the original
URL. For example, the URL
http://clark.lcs.mit.edu/~shivers
would be redirected to
http://clark.lcs.mit.edu/~shivers/
- If the filename names a regular file, it is served to the client.
** Support procs
The source files contain a host of support procedures which will be of utility
to anyone writing a custom path-handler. Read the files first.
** Local customization
The http-core package exports a procedure:
(set-server/admin! admin-name)
which allows you to set the name of the site administrator. If you
don't set this, Olin may get unwanted mail and visit
disproportionate violence on you in return.
There is a procedure exported from the httpd-core package:
(set-my-fqdn! name)
Call this to crow-bar the server's idea of its own Internet host
name before running the server, and all will be well.
You may want this for one of several reasons. On NeXTSTEP and on
systems that do DNS via NIS/Yellow Pages, you only get an
unqualified hostname. Also, in case of aliased names, you just
might get the wrong one. Furthermore, you may get screwed in the
presence of a server accelerator such as Squid.
There is a similar procedure in httpd-core:
(set-my-port! portnum)
Call this to set the local port of your server. This may be
important to get redirection right in the presence of a web server
accelerator.
** Losing
Be aware of certain Unix problems which may require workarounds:
1. NeXTSTEP's Posix implementation of the getpwnam() routine
will silently tell you that every user has uid 0. This means
that if your server, running as root, does a
(set-uid (user->uid "nobody"))
it will essentially do a
(set-uid 0)
and you will thus still be running as root.
The fix is to manually find out who user nobody is (he's -2 on my
system), and to hard-wire this into the server:
(set-uid -2)
This problem is NeXTSTEP specific. If you are not using NeXTSTEP,
no problem.