diff --git a/doc/html/index.html b/doc/html/index.html
new file mode 100644
index 0000000..0b8e359
--- /dev/null
+++ b/doc/html/index.html
@@ -0,0 +1,85 @@
+
+
+The Scheme Underground Network Package
+
+
+
+The Scheme Underground Network Package
+I have written a set of libraries for doing Net hacking from Scheme/scsh.
+It includes:
+
+- An smtp client library.
+
- Forge mail from the comfort of your own Scheme process.
+
+
- rfc822 header library
+
- Read email-style headers. Useful in several contexts (smtp, http, etc.)
+
+
- Simple structured HTML output library
+
- Balanced delimiters, etc.
+
+
- The SU Web server
+
- This is a complete implementation of an HTTP 1.0 server in Scheme.
+ The server contains other standalone packages that may separately be of
+ use:
+
+ - URI and URL parsers and unparsers.
+
- A library to help writing CGI scripts in Scheme.
+
- Server extensions for interfacing to CGI scripts.
+
- Server extensions for uploading Scheme code.
+
+ The server has three main design goals:
+
+ - Extensibility
+
- The server is in fact nothing but extensions, using a mechanism
+ called "path handlers" to define URL-specific services. It has a toolkit
+ of services that can be used as-is, extended or built upon.
+ User extensions have exactly the same status as the base services.
+
+
+ The extension mechanism allows for easy implementation of new services
+ without the overhead of the CGI interface. Since the server is written
+ on top of the Scheme shell, the full set of Unix system calls and
+ program tools is available to the implementor.
+
+
- Mobile code
+
- The server allows Scheme code to be uploaded for direct execution
+ inside the server. The server has complete control over the code,
+ and can safely execute it in restricted environments that do not
+ provide access to potentially dangerous primitives (such as the
+ "delete file" procedure.)
+
+
+
- Clarity
+
- I wrote this server to help myself understand the Web. It is voluminously
+ commented, and I hope it will prove to be an aid in understanding the
+ low-level details of the Web protocols.
+
+
+
+ The S.U. server has the ability to upload code from Web clients and
+ execute that code on behalf of the client in a protected environment.
+
+
+ Some simple documentation on the server
+ is available.
+
+
+
+Obtaining the system
+The network code is available by
+ftp.
+To run the server, you need our 0.4 release of
+scsh
+which has just been released.
+
+Beyond actually running the server,
+the separate parser libraries and other utilites may be of use as separate
+modules.
+
+Olin Shivers
+ / shivers@ai.mit.edu
+
+
+
+
+
diff --git a/doc/html/su-httpd.html b/doc/html/su-httpd.html
new file mode 100644
index 0000000..356aa37
--- /dev/null
+++ b/doc/html/su-httpd.html
@@ -0,0 +1,482 @@
+
+
+
+The Scheme Underground Web system
+
+
+
+The Scheme Underground Web System
+
+Olin Shivers
+ / shivers@ai.mit.edu
+
+July 1995
+
+
+Note: Netscape typesets description lists in a manner that makes the
+procedure descriptions below blur together, even in the absence of the
+HTML COMPACT attribute. You may just wish to print out a simple
+ASCII version of this note, instead.
+
+
+
+
+
+Introduction
+
+The
+Scheme underground
+Web system is a package of
+Scheme
+code that provides
+utilities for interacting with the
+World-Wide Web.
+This includes:
+
+- A Web server.
+
- URI and URL parsers and un-parsers.
+
- RFC822-style header parsers.
+
- Code for performing structured html output
+
- Code to assist in writing CGI Scheme programs
+ that can be used by any CGI-compliant HTTP server
+ (such as NCSA's httpd, or the S.U. Web server).
+
+
+
+The code can be obtained via
+
+anonymous ftp
+and is implemented in
+Scheme 48,
+using the system calls and support procedures of
+scsh,
+the Scheme Shell.
+The code was written to be clear and modifiable --
+it is voluminously commented and all non-R4RS dependencies are
+described at the beginning of each source file.
+
+
+I do not have the time to write detailed documentation for these packages.
+However, they are very thoroughly commented, and I strongly recommend
+reading the source files; they were written to be read, and the source
+code comments should provide a clear description of the system.
+The remainder of this note gives an overview of the server's basic
+architecture and interfaces.
+
+
The Scheme Underground Web Server
+
+The server was designed with three principle goals in mind:
+
+- Extensibility
+
- The server is designed to make it easy to extend the basic
+ functionality. In fact, the server is nothing but extensions. There is
+ no distinction between the set of basic services provided by the server
+ implementation and user extensions -- they are both implemented in
+ Scheme, and have equal status. The design is "turtles all the way down."
+
+
+
- Mobile code
+
- Because the server is written in Scheme 48, it is simple to use the
+ Scheme 48 module system to upload programs to the server for safe
+ execution within a protected, server-chosen environment. The server
+ comes with a simple example upload service to demonstrate this
+ capability.
+
+
+
- Clarity of implementation
+
- Because the server is written in a high-level language, it should make
+ for a clearer exposition of the HTTP protocol and the associated URL
+ and URI notations than one written in a low-level language such as C.
+ This also should help to make the server easy to modify and adapt to
+ different uses.
+
+
+
+Basic server structure
+
+The Web server is started by calling the httpd
procedure,
+which takes one required and two optional arguments:
+
+ (httpd path-handler [port working-directory])
+
+
+The server accepts connections from the given port, which defaults to 80.
+The server runs with the working directory set to the given value,
+which defaults to
+
+ /usr/local/etc/httpd
+
+
+
+
+The server's basic loop is to wait on the port for a connection from an HTTP
+client. When it receives a connection, it reads in and parses the request into
+a special request data structure. Then the server forks a child process, who
+binds the current I/O ports to the connection socket, and then hands off to
+the top-level path handler (the first argument to httpd
).
+The path-handler procedure is responsible for actually serving the request --
+it can be any arbitrary computation.
+Its output goes directly back to the HTTP client that sent the request.
+
+
+Before calling the path handler to service the request, the HTTP server
+installs an error handler that fields any uncaught error, sends an
+error reply to the client, and aborts the request transaction. Hence
+any error caused by a path-handler will be handled in a reasonable and
+robust fashion.
+
+
+The basic server loop, and the associated request data structure are the fixed
+architecture of the S.U. Web server; its flexibility lies in the notion of
+path handlers.
+
+
+
+
Path handlers
+
+A path handler is a procedure taking two arguments:
+
+ (path-handler path req)
+
+
+
+The req argument is a request record giving all the details of the
+client's request; it has the following structure:
+
+ (define-record request
+ method ; A string such as "GET", "PUT", etc.
+ uri ; The escaped URI string as read from request line.
+ url ; An http URL record (see url.scm).
+ version ; A (major . minor) integer pair.
+ headers ; An rfc822 header alist (see rfc822.scm).
+ socket) ; The socket connected to the client.
+
+
+The path argument is the URL's path,
+parsed and split at slashes into a string list.
+For example, if the Web client dereferences URL
+
+ http://clark.lcs.mit.edu:8001/h/shivers/code/web.tar.gz
+
+then the server would pass the following path to the top-level handler:
+
+ ("h" "shivers" "code" "web.tar.gz")
+
+
+
+The path argument's pre-parsed representation as a string list makes it easy
+for the path handler to implement recursive operations dispatch on URL paths.
+
+
+Path handlers can do anything they like to respond to HTTP requests; they have
+the full range of Scheme to implement the desired functionality. When
+handling HTTP requests that have an associated entity body (such as POST), the
+body should be read from the current input port. Path handlers should in all
+cases write their reply to the current output port. Path handlers should
+not perform I/O on the request record's socket.
+Path handlers are frequently called recursively, and doing I/O directly to the
+socket might bypass a filtering or other processing step interposed on the
+current I/O ports by some superior path handler.
+
+
+
Basic path handlers
+
+Although the user can write any path-handler he likes, the S.U. server comes
+with a useful toolbox of basic path handlers that can be used and built upon:
+
+
+
+-
+
(alist-path-dispatcher ph-alist default-ph) -> path-handler
+
+ -
+ This procedure takes a string->path-handler alist, and a default
+ path handler, and returns a handler that dispatches on its path argument.
+ When the new path handler is applied to a path
+
("foo" "bar" "baz")
,
+ it uses the first element of the path -- "foo"
-- to
+ index into the alist.
+ If it finds an associated path handler in the alist, it
+ hands the request off to that handler, passing it the tail of the
+ path, ("bar" "baz")
.
+ On the other hand, if the path is empty, or the alist search does
+ not yield a hit, we hand off to the default path handler,
+ passing it the entire original path, ("foo" "bar" "baz")
.
+
+
+ This procedure is how you say: "If the first element of the URL's path
+ is `foo', do X; if it's `bar', do Y; otherwise, do Z." If one takes
+ an object-oriented view of the process, an alist path-handler does
+ method lookup on the requested operation, dispatching off to the
+ appropriate method defined for the URL.
+
+
+ The slash-delimited URI path structure implies an associated
+ tree of names. The path-handler system and the alist dispatcher
+ allow you to procedurally define the server's response to any arbitrary
+ subtree of the path space.
+
+
+ Example:
+ A typical top-level path handler is
+
+
+ (define ph
+ (alist-path-dispatcher
+ `(("h" . ,(home-dir-handler "public_html"))
+ ("cgi-bin" . ,(cgi-handler "/usr/local/etc/httpd/cgi-bin"))
+ ("seval" . ,seval-handler))
+ (rooted-file-handler "/usr/local/etc/httpd/htdocs")))
+
+
+ This means:
+
+- If the path looks like
("h" "shivers" "code" "web.tar.gz")
,
+ pass the path ("shivers" "code" "web.tar.gz")
to a
+ home-directory path handler.
+
+
+ - If the path looks like
("cgi-bin" "calendar")
,
+ pass ("calendar")
off to the CGI path handler.
+
+
+ - If the path looks like
("seval" ...)
,
+ the tail of the path is passed off to the code-uploading seval
+ path handler.
+
+ - Otherwise, the whole path is passed to a rooted file handler, who
+ will convert it into a filename, rooted at
+
/usr/local/etc/httpd/htdocs
, and serve that file.
+
+
+
+ -
(home-dir-handler subdir) ->
+ path-handler
+ -
+ This procedure builds a path handler that does basic file serving
+ out of home directories. If the resulting path handler is passed
+ a path of
(user . file-path)
,
+ then it serves the file
+
+ user's-home-directory/subdir/file-path
+
+ The path handler only handles GET requests; the filename is not
+ allowed to contain ..
elements.
+
+
+ -
+
(tilde-home-dir-handler subdir default-path-handler)
+ -> path-handler
+
+ -
+ This path handler examines the car of the path. If it is a string
+ beginning with a tilde, e.g., "
~ziggy
",
+ then the string is taken
+ to mean a home directory, and the request is served similarly to a
+ home-dir-handler
path handler.
+ Otherwise, the request is passed off
+ in its entirety to the default path handler.
+
+
+ This procedure is useful for implementing servers that provide the
+ semantics of the NCSA httpd server.
+
+
+
-
+
(cgi-handler cgi-directory) -> path-handler
+
+ -
+ This procedure returns a path-handler that passes the request off to some
+ program using the CGI interface. The script name is taken from the
+ car of the path; it is checked for occurrences of
..
's.
+ If the path is
+
+ ("my-prog" "foo" "bar")
+
+ then the program executed is
+
+ cgi-directory/my-prog
+
+
+ When the CGI path handler builds the process environment for the
+ CGI script, several elements
+ (e.g., $PATH
and $SERVER_SOFTWARE
)
+ are request-invariant, and can be computed at server start-up time.
+ This can be done by calling
+
+ (initialise-request-invariant-cgi-env)
+
+ when the server starts up. This is not necessary,
+ but will make CGI requests a little faster.
+
+
+ -
+
(rooted-file-handler root-dir) -> path-handler
+
+ -
+ Returns a path handler that serves files from a particular root
+ in the file system. Only the GET operation is provided. The path
+ argument passed to the handler is converted into a filename,
+ and appended to root-dir.
+ The file name is checked for
..
components,
+ and the transaction is aborted if it does. Otherwise, the file is
+ served to the client.
+
+ -
+
(null-path-handler path req)
+ -
+ This path handler is useful as a default handler. It handles no requests,
+ always returning a "404 Not found" reply to the client.
+
+
+
+
+HTTP errors
+
+Authors of path-handlers need to be able to handle errors in a reasonably
+simple fashion. The S.U. Web server provides a set of error conditions that
+correspond to the error replies in the HTTP protocol. These errors can be
+raised with the http-error
procedure.
+When the server runs a path handler,
+it runs it in the context of an error handler that catches these errors,
+sends an error reply to the client, and closes the transaction.
+
+
+
+-
+
(http-error reply-code req [extra ...])
+ -
+ This raises an http error condition. The reply code is one of the
+ numeric HTTP error reply codes, which are bound to the variables
+
http-reply/ok
, http-reply/not-found
,
+ http-reply/bad-request
, and so
+ forth. The req argument is the request record that caused
+ the error.
+ Any following extra args are passed along for
+ informational purposes.
+ Different HTTP errors take different types of extra arguments.
+ For example, the "301 moved permanently" and "302 moved temporarily"
+ replies use the first two extra values as the
+ URI:
and Location:
+ fields in the reply header, respectively. See the clauses of the
+ send-http-error-reply
procedure for details.
+
+
+ -
+
(send-http-error-reply reply-code request
+ [extra ...])
+
+ -
+ This procedure writes an error reply out to the current output
+ port. If an error occurs during this process, it is caught, and
+ the procedure silently returns. The http server's standard error
+ handler passes all http errors raised during path-handler execution
+ to this procedure to generate the error reply before aborting the
+ request transaction.
+
+
+
+Simple directory generation
+
+Most path-handlers that serve files to clients eventually call an internal
+procedure named file-serve
,
+which implements a simple directory-generation service using the
+following rules:
+
+
+
+
+Support procs
+
+The source files contain a host of support procedures which will be of utility
+to anyone writing a custom path-handler. Read the files first.
+
+
+
+Losing
+
+Be aware of two Unix problems, which may require workarounds:
+
+
+-
+ NeXTSTEP's Posix implementation of the
getpwnam()
routine
+ will silently tell you that every user has uid 0. This means
+ that if your server, running as root, does a
+
+ (set-uid (user->uid "nobody"))
+
+ it will essentially do a
+
+ (set-uid 0)
+
+ and you will thus still be running as root.
+
+
+ The fix is to manually find out who user nobody is (he's -2 on my
+ system), and to hard-wire this into the server:
+
+ (set-uid -2)
+
+ This problem is NeXTSTEP specific. If you are using not using NeXTSTEP,
+ no problem.
+
+
+ -
+ On NeXTSTEP, the ip-address->host-name translation routine
+ (in C,
gethostbyaddr()
; in scsh,
+ (host-info addr)
) does not
+ use the DNS system; it goes through NeXT's propietary Netinfo
+ system, and may not return a fully-qualified domain name. For
+ example, on my system, I get "amelia-earhart", when I want
+ "amelia-earhart.lcs.mit.edu". Since the server uses this name
+ to construct redirection URL's to be sent back to the Web client,
+ they need to be FQDN's.
+
+
+ This problem may occur on other OS's;
+ I cannot determine if gethostbyaddr()
+ is required to return a FQDN or not. (I would appreciate hearing the
+ answer if you know; my local Internet guru's couldn't tell me.)
+
+
+ If your system doesn't give you a complete Internet address when
+ you say
+
+ (host-info:name (host-info (system-name)))
+
+ then you have this problem.
+
+
+ The server has a workaround. There is a procedure exported from
+ the httpd-core package:
+
+ (set-my-fqdn name)
+
+ Call this to crow-bar the server's idea of its own Internet host name
+ before running the server, and all will be well.
+
+
+
+
diff --git a/doc/rfc2396.txt b/doc/rfc2396.txt
new file mode 100644
index 0000000..5bd5211
--- /dev/null
+++ b/doc/rfc2396.txt
@@ -0,0 +1,2243 @@
+
+
+
+
+
+
+Network Working Group T. Berners-Lee
+Request for Comments: 2396 MIT/LCS
+Updates: 1808, 1738 R. Fielding
+Category: Standards Track U.C. Irvine
+ L. Masinter
+ Xerox Corporation
+ August 1998
+
+
+ Uniform Resource Identifiers (URI): Generic Syntax
+
+Status of this Memo
+
+ This document specifies an Internet standards track protocol for the
+ Internet community, and requests discussion and suggestions for
+ improvements. Please refer to the current edition of the "Internet
+ Official Protocol Standards" (STD 1) for the standardization state
+ and status of this protocol. Distribution of this memo is unlimited.
+
+Copyright Notice
+
+ Copyright (C) The Internet Society (1998). All Rights Reserved.
+
+IESG Note
+
+ This paper describes a "superset" of operations that can be applied
+ to URI. It consists of both a grammar and a description of basic
+ functionality for URI. To understand what is a valid URI, both the
+ grammar and the associated description have to be studied. Some of
+ the functionality described is not applicable to all URI schemes, and
+ some operations are only possible when certain media types are
+ retrieved using the URI, regardless of the scheme used.
+
+Abstract
+
+ A Uniform Resource Identifier (URI) is a compact string of characters
+ for identifying an abstract or physical resource. This document
+ defines the generic syntax of URI, including both absolute and
+ relative forms, and guidelines for their use; it revises and replaces
+ the generic definitions in RFC 1738 and RFC 1808.
+
+ This document defines a grammar that is a superset of all valid URI,
+ such that an implementation can parse the common components of a URI
+ reference without knowing the scheme-specific requirements of every
+ possible identifier type. This document does not define a generative
+ grammar for URI; that task will be performed by the individual
+ specifications of each URI scheme.
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 1]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+1. Introduction
+
+ Uniform Resource Identifiers (URI) provide a simple and extensible
+ means for identifying a resource. This specification of URI syntax
+ and semantics is derived from concepts introduced by the World Wide
+ Web global information initiative, whose use of such objects dates
+ from 1990 and is described in "Universal Resource Identifiers in WWW"
+ [RFC1630]. The specification of URI is designed to meet the
+ recommendations laid out in "Functional Recommendations for Internet
+ Resource Locators" [RFC1736] and "Functional Requirements for Uniform
+ Resource Names" [RFC1737].
+
+ This document updates and merges "Uniform Resource Locators"
+ [RFC1738] and "Relative Uniform Resource Locators" [RFC1808] in order
+ to define a single, generic syntax for all URI. It excludes those
+ portions of RFC 1738 that defined the specific syntax of individual
+ URL schemes; those portions will be updated as separate documents, as
+ will the process for registration of new URI schemes. This document
+ does not discuss the issues and recommendation for dealing with
+ characters outside of the US-ASCII character set [ASCII]; those
+ recommendations are discussed in a separate document.
+
+ All significant changes from the prior RFCs are noted in Appendix G.
+
+1.1 Overview of URI
+
+ URI are characterized by the following definitions:
+
+ Uniform
+ Uniformity provides several benefits: it allows different types
+ of resource identifiers to be used in the same context, even
+ when the mechanisms used to access those resources may differ;
+ it allows uniform semantic interpretation of common syntactic
+ conventions across different types of resource identifiers; it
+ allows introduction of new types of resource identifiers
+ without interfering with the way that existing identifiers are
+ used; and, it allows the identifiers to be reused in many
+ different contexts, thus permitting new applications or
+ protocols to leverage a pre-existing, large, and widely-used
+ set of resource identifiers.
+
+ Resource
+ A resource can be anything that has identity. Familiar
+ examples include an electronic document, an image, a service
+ (e.g., "today's weather report for Los Angeles"), and a
+ collection of other resources. Not all resources are network
+ "retrievable"; e.g., human beings, corporations, and bound
+ books in a library can also be considered resources.
+
+
+
+Berners-Lee, et. al. Standards Track [Page 2]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ The resource is the conceptual mapping to an entity or set of
+ entities, not necessarily the entity which corresponds to that
+ mapping at any particular instance in time. Thus, a resource
+ can remain constant even when its content---the entities to
+ which it currently corresponds---changes over time, provided
+ that the conceptual mapping is not changed in the process.
+
+ Identifier
+ An identifier is an object that can act as a reference to
+ something that has identity. In the case of URI, the object is
+ a sequence of characters with a restricted syntax.
+
+ Having identified a resource, a system may perform a variety of
+ operations on the resource, as might be characterized by such words
+ as `access', `update', `replace', or `find attributes'.
+
+1.2. URI, URL, and URN
+
+ A URI can be further classified as a locator, a name, or both. The
+ term "Uniform Resource Locator" (URL) refers to the subset of URI
+ that identify resources via a representation of their primary access
+ mechanism (e.g., their network "location"), rather than identifying
+ the resource by name or by some other attribute(s) of that resource.
+ The term "Uniform Resource Name" (URN) refers to the subset of URI
+ that are required to remain globally unique and persistent even when
+ the resource ceases to exist or becomes unavailable.
+
+ The URI scheme (Section 3.1) defines the namespace of the URI, and
+ thus may further restrict the syntax and semantics of identifiers
+ using that scheme. This specification defines those elements of the
+ URI syntax that are either required of all URI schemes or are common
+ to many URI schemes. It thus defines the syntax and semantics that
+ are needed to implement a scheme-independent parsing mechanism for
+ URI references, such that the scheme-dependent handling of a URI can
+ be postponed until the scheme-dependent semantics are needed. We use
+ the term URL below when describing syntax or semantics that only
+ apply to locators.
+
+ Although many URL schemes are named after protocols, this does not
+ imply that the only way to access the URL's resource is via the named
+ protocol. Gateways, proxies, caches, and name resolution services
+ might be used to access some resources, independent of the protocol
+ of their origin, and the resolution of some URL may require the use
+ of more than one protocol (e.g., both DNS and HTTP are typically used
+ to access an "http" URL's resource when it can't be found in a local
+ cache).
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 3]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ A URN differs from a URL in that it's primary purpose is persistent
+ labeling of a resource with an identifier. That identifier is drawn
+ from one of a set of defined namespaces, each of which has its own
+ set name structure and assignment procedures. The "urn" scheme has
+ been reserved to establish the requirements for a standardized URN
+ namespace, as defined in "URN Syntax" [RFC2141] and its related
+ specifications.
+
+ Most of the examples in this specification demonstrate URL, since
+ they allow the most varied use of the syntax and often have a
+ hierarchical namespace. A parser of the URI syntax is capable of
+ parsing both URL and URN references as a generic URI; once the scheme
+ is determined, the scheme-specific parsing can be performed on the
+ generic URI components. In other words, the URI syntax is a superset
+ of the syntax of all URI schemes.
+
+1.3. Example URI
+
+ The following examples illustrate URI that are in common use.
+
+ ftp://ftp.is.co.za/rfc/rfc1808.txt
+ -- ftp scheme for File Transfer Protocol services
+
+ gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles
+ -- gopher scheme for Gopher and Gopher+ Protocol services
+
+ http://www.math.uio.no/faq/compression-faq/part1.html
+ -- http scheme for Hypertext Transfer Protocol services
+
+ mailto:mduerst@ifi.unizh.ch
+ -- mailto scheme for electronic mail addresses
+
+ news:comp.infosystems.www.servers.unix
+ -- news scheme for USENET news groups and articles
+
+ telnet://melvyl.ucop.edu/
+ -- telnet scheme for interactive services via the TELNET Protocol
+
+1.4. Hierarchical URI and Relative Forms
+
+ An absolute identifier refers to a resource independent of the
+ context in which the identifier is used. In contrast, a relative
+ identifier refers to a resource by describing the difference within a
+ hierarchical namespace between the current context and an absolute
+ identifier of the resource.
+
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 4]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ Some URI schemes support a hierarchical naming system, where the
+ hierarchy of the name is denoted by a "/" delimiter separating the
+ components in the scheme. This document defines a scheme-independent
+ `relative' form of URI reference that can be used in conjunction with
+ a `base' URI (of a hierarchical scheme) to produce another URI. The
+ syntax of hierarchical URI is described in Section 3; the relative
+ URI calculation is described in Section 5.
+
+1.5. URI Transcribability
+
+ The URI syntax was designed with global transcribability as one of
+ its main concerns. A URI is a sequence of characters from a very
+ limited set, i.e. the letters of the basic Latin alphabet, digits,
+ and a few special characters. A URI may be represented in a variety
+ of ways: e.g., ink on paper, pixels on a screen, or a sequence of
+ octets in a coded character set. The interpretation of a URI depends
+ only on the characters used and not how those characters are
+ represented in a network protocol.
+
+ The goal of transcribability can be described by a simple scenario.
+ Imagine two colleagues, Sam and Kim, sitting in a pub at an
+ international conference and exchanging research ideas. Sam asks Kim
+ for a location to get more information, so Kim writes the URI for the
+ research site on a napkin. Upon returning home, Sam takes out the
+ napkin and types the URI into a computer, which then retrieves the
+ information to which Kim referred.
+
+ There are several design concerns revealed by the scenario:
+
+ o A URI is a sequence of characters, which is not always
+ represented as a sequence of octets.
+
+ o A URI may be transcribed from a non-network source, and thus
+ should consist of characters that are most likely to be able to
+ be typed into a computer, within the constraints imposed by
+ keyboards (and related input devices) across languages and
+ locales.
+
+ o A URI often needs to be remembered by people, and it is easier
+ for people to remember a URI when it consists of meaningful
+ components.
+
+ These design concerns are not always in alignment. For example, it
+ is often the case that the most meaningful name for a URI component
+ would require characters that cannot be typed into some systems. The
+ ability to transcribe the resource identifier from one medium to
+ another was considered more important than having its URI consist of
+ the most meaningful of components. In local and regional contexts
+
+
+
+Berners-Lee, et. al. Standards Track [Page 5]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ and with improving technology, users might benefit from being able to
+ use a wider range of characters; such use is not defined in this
+ document.
+
+1.6. Syntax Notation and Common Elements
+
+ This document uses two conventions to describe and define the syntax
+ for URI. The first, called the layout form, is a general description
+ of the order of components and component separators, as in
+
+ /;?
+
+ The component names are enclosed in angle-brackets and any characters
+ outside angle-brackets are literal separators. Whitespace should be
+ ignored. These descriptions are used informally and do not define
+ the syntax requirements.
+
+ The second convention is a BNF-like grammar, used to define the
+ formal URI syntax. The grammar is that of [RFC822], except that "|"
+ is used to designate alternatives. Briefly, rules are separated from
+ definitions by an equal "=", indentation is used to continue a rule
+ definition over more than one line, literals are quoted with "",
+ parentheses "(" and ")" are used to group elements, optional elements
+ are enclosed in "[" and "]" brackets, and elements may be preceded
+ with * to designate n or more repetitions of the following
+ element; n defaults to 0.
+
+ Unlike many specifications that use a BNF-like grammar to define the
+ bytes (octets) allowed by a protocol, the URI grammar is defined in
+ terms of characters. Each literal in the grammar corresponds to the
+ character it represents, rather than to the octet encoding of that
+ character in any particular coded character set. How a URI is
+ represented in terms of bits and bytes on the wire is dependent upon
+ the character encoding of the protocol used to transport it, or the
+ charset of the document which contains it.
+
+ The following definitions are common to many elements:
+
+ alpha = lowalpha | upalpha
+
+ lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
+ "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
+ "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
+
+ upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
+ "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
+ "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 6]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
+ "8" | "9"
+
+ alphanum = alpha | digit
+
+ The complete URI syntax is collected in Appendix A.
+
+2. URI Characters and Escape Sequences
+
+ URI consist of a restricted set of characters, primarily chosen to
+ aid transcribability and usability both in computer systems and in
+ non-computer communications. Characters used conventionally as
+ delimiters around URI were excluded. The restricted set of
+ characters consists of digits, letters, and a few graphic symbols
+ were chosen from those common to most of the character encodings and
+ input facilities available to Internet users.
+
+ uric = reserved | unreserved | escaped
+
+ Within a URI, characters are either used as delimiters, or to
+ represent strings of data (octets) within the delimited portions.
+ Octets are either represented directly by a character (using the US-
+ ASCII character for that octet [ASCII]) or by an escape encoding.
+ This representation is elaborated below.
+
+2.1 URI and non-ASCII characters
+
+ The relationship between URI and characters has been a source of
+ confusion for characters that are not part of US-ASCII. To describe
+ the relationship, it is useful to distinguish between a "character"
+ (as a distinguishable semantic entity) and an "octet" (an 8-bit
+ byte). There are two mappings, one from URI characters to octets, and
+ a second from octets to original characters:
+
+ URI character sequence->octet sequence->original character sequence
+
+ A URI is represented as a sequence of characters, not as a sequence
+ of octets. That is because URI might be "transported" by means that
+ are not through a computer network, e.g., printed on paper, read over
+ the radio, etc.
+
+ A URI scheme may define a mapping from URI characters to octets;
+ whether this is done depends on the scheme. Commonly, within a
+ delimited component of a URI, a sequence of characters may be used to
+ represent a sequence of octets. For example, the character "a"
+ represents the octet 97 (decimal), while the character sequence "%",
+ "0", "a" represents the octet 10 (decimal).
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 7]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ There is a second translation for some resources: the sequence of
+ octets defined by a component of the URI is subsequently used to
+ represent a sequence of characters. A 'charset' defines this mapping.
+ There are many charsets in use in Internet protocols. For example,
+ UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences
+ of characters in the repertoire of ISO 10646.
+
+ In the simplest case, the original character sequence contains only
+ characters that are defined in US-ASCII, and the two levels of
+ mapping are simple and easily invertible: each 'original character'
+ is represented as the octet for the US-ASCII code for it, which is,
+ in turn, represented as either the US-ASCII character, or else the
+ "%" escape sequence for that octet.
+
+ For original character sequences that contain non-ASCII characters,
+ however, the situation is more difficult. Internet protocols that
+ transmit octet sequences intended to represent character sequences
+ are expected to provide some way of identifying the charset used, if
+ there might be more than one [RFC2277]. However, there is currently
+ no provision within the generic URI syntax to accomplish this
+ identification. An individual URI scheme may require a single
+ charset, define a default charset, or provide a way to indicate the
+ charset used.
+
+ It is expected that a systematic treatment of character encoding
+ within URI will be developed as a future modification of this
+ specification.
+
+2.2. Reserved Characters
+
+ Many URI include components consisting of or delimited by, certain
+ special characters. These characters are called "reserved", since
+ their usage within the URI component is limited to their reserved
+ purpose. If the data for a URI component would conflict with the
+ reserved purpose, then the conflicting data must be escaped before
+ forming the URI.
+
+ reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
+ "$" | ","
+
+ The "reserved" syntax class above refers to those characters that are
+ allowed within a URI, but which may not be allowed within a
+ particular component of the generic URI syntax; they are used as
+ delimiters of the components described in Section 3.
+
+
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 8]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ Characters in the "reserved" set are not reserved in all contexts.
+ The set of characters actually reserved within any given URI
+ component is defined by that component. In general, a character is
+ reserved if the semantics of the URI changes if the character is
+ replaced with its escaped US-ASCII encoding.
+
+2.3. Unreserved Characters
+
+ Data characters that are allowed in a URI but do not have a reserved
+ purpose are called unreserved. These include upper and lower case
+ letters, decimal digits, and a limited set of punctuation marks and
+ symbols.
+
+ unreserved = alphanum | mark
+
+ mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
+
+ Unreserved characters can be escaped without changing the semantics
+ of the URI, but this should not be done unless the URI is being used
+ in a context that does not allow the unescaped character to appear.
+
+2.4. Escape Sequences
+
+ Data must be escaped if it does not have a representation using an
+ unreserved character; this includes data that does not correspond to
+ a printable character of the US-ASCII coded character set, or that
+ corresponds to any US-ASCII character that is disallowed, as
+ explained below.
+
+2.4.1. Escaped Encoding
+
+ An escaped octet is encoded as a character triplet, consisting of the
+ percent character "%" followed by the two hexadecimal digits
+ representing the octet code. For example, "%20" is the escaped
+ encoding for the US-ASCII space character.
+
+ escaped = "%" hex hex
+ hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
+ "a" | "b" | "c" | "d" | "e" | "f"
+
+2.4.2. When to Escape and Unescape
+
+ A URI is always in an "escaped" form, since escaping or unescaping a
+ completed URI might change its semantics. Normally, the only time
+ escape encodings can safely be made is when the URI is being created
+ from its component parts; each component may have its own set of
+ characters that are reserved, so only the mechanism responsible for
+ generating or interpreting that component can determine whether or
+
+
+
+Berners-Lee, et. al. Standards Track [Page 9]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ not escaping a character will change its semantics. Likewise, a URI
+ must be separated into its components before the escaped characters
+ within those components can be safely decoded.
+
+ In some cases, data that could be represented by an unreserved
+ character may appear escaped; for example, some of the unreserved
+ "mark" characters are automatically escaped by some systems. If the
+ given URI scheme defines a canonicalization algorithm, then
+ unreserved characters may be unescaped according to that algorithm.
+ For example, "%7e" is sometimes used instead of "~" in an http URL
+ path, but the two are equivalent for an http URL.
+
+ Because the percent "%" character always has the reserved purpose of
+ being the escape indicator, it must be escaped as "%25" in order to
+ be used as data within a URI. Implementers should be careful not to
+ escape or unescape the same string more than once, since unescaping
+ an already unescaped string might lead to misinterpreting a percent
+ data character as another escaped character, or vice versa in the
+ case of escaping an already escaped string.
+
+2.4.3. Excluded US-ASCII Characters
+
+ Although they are disallowed within the URI syntax, we include here a
+ description of those US-ASCII characters that have been excluded and
+ the reasons for their exclusion.
+
+ The control characters in the US-ASCII coded character set are not
+ used within a URI, both because they are non-printable and because
+ they are likely to be misinterpreted by some control mechanisms.
+
+ control =
+
+ The space character is excluded because significant spaces may
+ disappear and insignificant spaces may be introduced when URI are
+ transcribed or typeset or subjected to the treatment of word-
+ processing programs. Whitespace is also used to delimit URI in many
+ contexts.
+
+ space =
+
+ The angle-bracket "<" and ">" and double-quote (") characters are
+ excluded because they are often used as the delimiters around URI in
+ text documents and protocol fields. The character "#" is excluded
+ because it is used to delimit a URI from a fragment identifier in URI
+ references (Section 4). The percent character "%" is excluded because
+ it is used for the encoding of escaped characters.
+
+ delims = "<" | ">" | "#" | "%" | <">
+
+
+
+Berners-Lee, et. al. Standards Track [Page 10]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ Other characters are excluded because gateways and other transport
+ agents are known to sometimes modify such characters, or they are
+ used as delimiters.
+
+ unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
+
+ Data corresponding to excluded characters must be escaped in order to
+ be properly represented within a URI.
+
+3. URI Syntactic Components
+
+ The URI syntax is dependent upon the scheme. In general, absolute
+ URI are written as follows:
+
+ :
+
+ An absolute URI contains the name of the scheme being used ()
+ followed by a colon (":") and then a string (the ) whose interpretation depends on the scheme.
+
+ The URI syntax does not require that the scheme-specific-part have
+ any general structure or set of semantics which is common among all
+ URI. However, a subset of URI do share a common syntax for
+ representing hierarchical relationships within the namespace. This
+ "generic URI" syntax consists of a sequence of four main components:
+
+ ://?
+
+ each of which, except , may be absent from a particular URI.
+ For example, some URI schemes do not allow an component,
+ and others do not use a component.
+
+ absoluteURI = scheme ":" ( hier_part | opaque_part )
+
+ URI that are hierarchical in nature use the slash "/" character for
+ separating hierarchical components. For some file systems, a "/"
+ character (used to denote the hierarchical structure of a URI) is the
+ delimiter used to construct a file name hierarchy, and thus the URI
+ path will look similar to a file pathname. This does NOT imply that
+ the resource is a file or that the URI maps to an actual filesystem
+ pathname.
+
+ hier_part = ( net_path | abs_path ) [ "?" query ]
+
+ net_path = "//" authority [ abs_path ]
+
+ abs_path = "/" path_segments
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 11]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ URI that do not make use of the slash "/" character for separating
+ hierarchical components are considered opaque by the generic URI
+ parser.
+
+ opaque_part = uric_no_slash *uric
+
+ uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" |
+ "&" | "=" | "+" | "$" | ","
+
+ We use the term to refer to both the and
+ constructs, since they are mutually exclusive for any
+ given URI and can be parsed as a single component.
+
+3.1. Scheme Component
+
+ Just as there are many different methods of access to resources,
+ there are a variety of schemes for identifying such resources. The
+ URI syntax consists of a sequence of components separated by reserved
+ characters, with the first component defining the semantics for the
+ remainder of the URI string.
+
+ Scheme names consist of a sequence of characters beginning with a
+ lower case letter and followed by any combination of lower case
+ letters, digits, plus ("+"), period ("."), or hyphen ("-"). For
+ resiliency, programs interpreting URI should treat upper case letters
+ as equivalent to lower case in scheme names (e.g., allow "HTTP" as
+ well as "http").
+
+ scheme = alpha *( alpha | digit | "+" | "-" | "." )
+
+ Relative URI references are distinguished from absolute URI in that
+ they do not begin with a scheme name. Instead, the scheme is
+ inherited from the base URI, as described in Section 5.2.
+
+3.2. Authority Component
+
+ Many URI schemes include a top hierarchical element for a naming
+ authority, such that the namespace defined by the remainder of the
+ URI is governed by that authority. This authority component is
+ typically defined by an Internet-based server or a scheme-specific
+ registry of naming authorities.
+
+ authority = server | reg_name
+
+ The authority component is preceded by a double slash "//" and is
+ terminated by the next slash "/", question-mark "?", or by the end of
+ the URI. Within the authority component, the characters ";", ":",
+ "@", "?", and "/" are reserved.
+
+
+
+Berners-Lee, et. al. Standards Track [Page 12]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ An authority component is not required for a URI scheme to make use
+ of relative references. A base URI without an authority component
+ implies that any relative reference will also be without an authority
+ component.
+
+3.2.1. Registry-based Naming Authority
+
+ The structure of a registry-based naming authority is specific to the
+ URI scheme, but constrained to the allowed characters for an
+ authority component.
+
+ reg_name = 1*( unreserved | escaped | "$" | "," |
+ ";" | ":" | "@" | "&" | "=" | "+" )
+
+3.2.2. Server-based Naming Authority
+
+ URL schemes that involve the direct use of an IP-based protocol to a
+ specified server on the Internet use a common syntax for the server
+ component of the URI's scheme-specific data:
+
+ @:
+
+ where may consist of a user name and, optionally, scheme-
+ specific information about how to gain authorization to access the
+ server. The parts "@" and ":" may be omitted.
+
+ server = [ [ userinfo "@" ] hostport ]
+
+ The user information, if present, is followed by a commercial at-sign
+ "@".
+
+ userinfo = *( unreserved | escaped |
+ ";" | ":" | "&" | "=" | "+" | "$" | "," )
+
+ Some URL schemes use the format "user:password" in the userinfo
+ field. This practice is NOT RECOMMENDED, because the passing of
+ authentication information in clear text (such as URI) has proven to
+ be a security risk in almost every case where it has been used.
+
+ The host is a domain name of a network host, or its IPv4 address as a
+ set of four decimal digit groups separated by ".". Literal IPv6
+ addresses are not supported.
+
+ hostport = host [ ":" port ]
+ host = hostname | IPv4address
+ hostname = *( domainlabel "." ) toplabel [ "." ]
+ domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
+ toplabel = alpha | alpha *( alphanum | "-" ) alphanum
+
+
+
+Berners-Lee, et. al. Standards Track [Page 13]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ IPv4address = 1*digit "." 1*digit "." 1*digit "." 1*digit
+ port = *digit
+
+ Hostnames take the form described in Section 3 of [RFC1034] and
+ Section 2.1 of [RFC1123]: a sequence of domain labels separated by
+ ".", each domain label starting and ending with an alphanumeric
+ character and possibly also containing "-" characters. The rightmost
+ domain label of a fully qualified domain name will never start with a
+ digit, thus syntactically distinguishing domain names from IPv4
+ addresses, and may be followed by a single "." if it is necessary to
+ distinguish between the complete domain name and any local domain.
+ To actually be "Uniform" as a resource locator, a URL hostname should
+ be a fully qualified domain name. In practice, however, the host
+ component may be a local domain literal.
+
+ Note: A suitable representation for including a literal IPv6
+ address as the host part of a URL is desired, but has not yet been
+ determined or implemented in practice.
+
+ The port is the network port number for the server. Most schemes
+ designate protocols that have a default port number. Another port
+ number may optionally be supplied, in decimal, separated from the
+ host by a colon. If the port is omitted, the default port number is
+ assumed.
+
+3.3. Path Component
+
+ The path component contains data, specific to the authority (or the
+ scheme if there is no authority component), identifying the resource
+ within the scope of that scheme and authority.
+
+ path = [ abs_path | opaque_part ]
+
+ path_segments = segment *( "/" segment )
+ segment = *pchar *( ";" param )
+ param = *pchar
+
+ pchar = unreserved | escaped |
+ ":" | "@" | "&" | "=" | "+" | "$" | ","
+
+ The path may consist of a sequence of path segments separated by a
+ single slash "/" character. Within a path segment, the characters
+ "/", ";", "=", and "?" are reserved. Each path segment may include a
+ sequence of parameters, indicated by the semicolon ";" character.
+ The parameters are not significant to the parsing of relative
+ references.
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 14]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+3.4. Query Component
+
+ The query component is a string of information to be interpreted by
+ the resource.
+
+ query = *uric
+
+ Within a query component, the characters ";", "/", "?", ":", "@",
+ "&", "=", "+", ",", and "$" are reserved.
+
+4. URI References
+
+ The term "URI-reference" is used here to denote the common usage of a
+ resource identifier. A URI reference may be absolute or relative,
+ and may have additional information attached in the form of a
+ fragment identifier. However, "the URI" that results from such a
+ reference includes only the absolute URI after the fragment
+ identifier (if any) is removed and after any relative URI is resolved
+ to its absolute form. Although it is possible to limit the
+ discussion of URI syntax and semantics to that of the absolute
+ result, most usage of URI is within general URI references, and it is
+ impossible to obtain the URI from such a reference without also
+ parsing the fragment and resolving the relative form.
+
+ URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]
+
+ The syntax for relative URI is a shortened form of that for absolute
+ URI, where some prefix of the URI is missing and certain path
+ components ("." and "..") have a special meaning when, and only when,
+ interpreting a relative path. The relative URI syntax is defined in
+ Section 5.
+
+4.1. Fragment Identifier
+
+ When a URI reference is used to perform a retrieval action on the
+ identified resource, the optional fragment identifier, separated from
+ the URI by a crosshatch ("#") character, consists of additional
+ reference information to be interpreted by the user agent after the
+ retrieval action has been successfully completed. As such, it is not
+ part of a URI, but is often used in conjunction with a URI.
+
+ fragment = *uric
+
+ The semantics of a fragment identifier is a property of the data
+ resulting from a retrieval action, regardless of the type of URI used
+ in the reference. Therefore, the format and interpretation of
+ fragment identifiers is dependent on the media type [RFC2046] of the
+ retrieval result. The character restrictions described in Section 2
+
+
+
+Berners-Lee, et. al. Standards Track [Page 15]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ for URI also apply to the fragment in a URI-reference. Individual
+ media types may define additional restrictions or structure within
+ the fragment for specifying different types of "partial views" that
+ can be identified within that media type.
+
+ A fragment identifier is only meaningful when a URI reference is
+ intended for retrieval and the result of that retrieval is a document
+ for which the identified fragment is consistently defined.
+
+4.2. Same-document References
+
+ A URI reference that does not contain a URI is a reference to the
+ current document. In other words, an empty URI reference within a
+ document is interpreted as a reference to the start of that document,
+ and a reference containing only a fragment identifier is a reference
+ to the identified fragment of that document. Traversal of such a
+ reference should not result in an additional retrieval action.
+ However, if the URI reference occurs in a context that is always
+ intended to result in a new request, as in the case of HTML's FORM
+ element, then an empty URI reference represents the base URI of the
+ current document and should be replaced by that URI when transformed
+ into a request.
+
+4.3. Parsing a URI Reference
+
+ A URI reference is typically parsed according to the four main
+ components and fragment identifier in order to determine what
+ components are present and whether the reference is relative or
+ absolute. The individual components are then parsed for their
+ subparts and, if not opaque, to verify their validity.
+
+ Although the BNF defines what is allowed in each component, it is
+ ambiguous in terms of differentiating between an authority component
+ and a path component that begins with two slash characters. The
+ greedy algorithm is used for disambiguation: the left-most matching
+ rule soaks up as much of the URI reference string as it is capable of
+ matching. In other words, the authority component wins.
+
+ Readers familiar with regular expressions should see Appendix B for a
+ concrete parsing example and test oracle.
+
+5. Relative URI References
+
+ It is often the case that a group or "tree" of documents has been
+ constructed to serve a common purpose; the vast majority of URI in
+ these documents point to resources within the tree rather than
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 16]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ outside of it. Similarly, documents located at a particular site are
+ much more likely to refer to other resources at that site than to
+ resources at remote sites.
+
+ Relative addressing of URI allows document trees to be partially
+ independent of their location and access scheme. For instance, it is
+ possible for a single set of hypertext documents to be simultaneously
+ accessible and traversable via each of the "file", "http", and "ftp"
+ schemes if the documents refer to each other using relative URI.
+ Furthermore, such document trees can be moved, as a whole, without
+ changing any of the relative references. Experience within the WWW
+ has demonstrated that the ability to perform relative referencing is
+ necessary for the long-term usability of embedded URI.
+
+ The syntax for relative URI takes advantage of the syntax
+ of (Section 3) in order to express a reference that is
+ relative to the namespace of another hierarchical URI.
+
+ relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ]
+
+ A relative reference beginning with two slash characters is termed a
+ network-path reference, as defined by in Section 3. Such
+ references are rarely used.
+
+ A relative reference beginning with a single slash character is
+ termed an absolute-path reference, as defined by in
+ Section 3.
+
+ A relative reference that does not begin with a scheme name or a
+ slash character is termed a relative-path reference.
+
+ rel_path = rel_segment [ abs_path ]
+
+ rel_segment = 1*( unreserved | escaped |
+ ";" | "@" | "&" | "=" | "+" | "$" | "," )
+
+ Within a relative-path reference, the complete path segments "." and
+ ".." have special meanings: "the current hierarchy level" and "the
+ level above this hierarchy level", respectively. Although this is
+ very similar to their use within Unix-based filesystems to indicate
+ directory levels, these path components are only considered special
+ when resolving a relative-path reference to its absolute form
+ (Section 5.2).
+
+ Authors should be aware that a path segment which contains a colon
+ character cannot be used as the first segment of a relative URI path
+ (e.g., "this:that"), because it would be mistaken for a scheme name.
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 17]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ It is therefore necessary to precede such segments with other
+ segments (e.g., "./this:that") in order for them to be referenced as
+ a relative path.
+
+ It is not necessary for all URI within a given scheme to be
+ restricted to the syntax, since the hierarchical
+ properties of that syntax are only necessary when relative URI are
+ used within a particular document. Documents can only make use of
+ relative URI when their base URI fits within the syntax.
+ It is assumed that any document which contains a relative reference
+ will also have a base URI that obeys the syntax. In other words,
+ relative URI cannot be used within a document that has an unsuitable
+ base URI.
+
+ Some URI schemes do not allow a hierarchical syntax matching the
+ syntax, and thus cannot use relative references.
+
+5.1. Establishing a Base URI
+
+ The term "relative URI" implies that there exists some absolute "base
+ URI" against which the relative reference is applied. Indeed, the
+ base URI is necessary to define the semantics of any relative URI
+ reference; without it, a relative reference is meaningless. In order
+ for relative URI to be usable within a document, the base URI of that
+ document must be known to the parser.
+
+ The base URI of a document can be established in one of four ways,
+ listed below in order of precedence. The order of precedence can be
+ thought of in terms of layers, where the innermost defined base URI
+ has the highest precedence. This can be visualized graphically as:
+
+ .----------------------------------------------------------.
+ | .----------------------------------------------------. |
+ | | .----------------------------------------------. | |
+ | | | .----------------------------------------. | | |
+ | | | | .----------------------------------. | | | |
+ | | | | | | | | | |
+ | | | | `----------------------------------' | | | |
+ | | | | (5.1.1) Base URI embedded in the | | | |
+ | | | | document's content | | | |
+ | | | `----------------------------------------' | | |
+ | | | (5.1.2) Base URI of the encapsulating entity | | |
+ | | | (message, document, or none). | | |
+ | | `----------------------------------------------' | |
+ | | (5.1.3) URI used to retrieve the entity | |
+ | `----------------------------------------------------' |
+ | (5.1.4) Default Base URI is application-dependent |
+ `----------------------------------------------------------'
+
+
+
+Berners-Lee, et. al. Standards Track [Page 18]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+5.1.1. Base URI within Document Content
+
+ Within certain document media types, the base URI of the document can
+ be embedded within the content itself such that it can be readily
+ obtained by a parser. This can be useful for descriptive documents,
+ such as tables of content, which may be transmitted to others through
+ protocols other than their usual retrieval context (e.g., E-Mail or
+ USENET news).
+
+ It is beyond the scope of this document to specify how, for each
+ media type, the base URI can be embedded. It is assumed that user
+ agents manipulating such media types will be able to obtain the
+ appropriate syntax from that media type's specification. An example
+ of how the base URI can be embedded in the Hypertext Markup Language
+ (HTML) [RFC1866] is provided in Appendix D.
+
+ A mechanism for embedding the base URI within MIME container types
+ (e.g., the message and multipart types) is defined by MHTML
+ [RFC2110]. Protocols that do not use the MIME message header syntax,
+ but which do allow some form of tagged metainformation to be included
+ within messages, may define their own syntax for defining the base
+ URI as part of a message.
+
+5.1.2. Base URI from the Encapsulating Entity
+
+ If no base URI is embedded, the base URI of a document is defined by
+ the document's retrieval context. For a document that is enclosed
+ within another entity (such as a message or another document), the
+ retrieval context is that entity; thus, the default base URI of the
+ document is the base URI of the entity in which the document is
+ encapsulated.
+
+5.1.3. Base URI from the Retrieval URI
+
+ If no base URI is embedded and the document is not encapsulated
+ within some other entity (e.g., the top level of a composite entity),
+ then, if a URI was used to retrieve the base document, that URI shall
+ be considered the base URI. Note that if the retrieval was the
+ result of a redirected request, the last URI used (i.e., that which
+ resulted in the actual retrieval of the document) is the base URI.
+
+5.1.4. Default Base URI
+
+ If none of the conditions described in Sections 5.1.1--5.1.3 apply,
+ then the base URI is defined by the context of the application.
+ Since this definition is necessarily application-dependent, failing
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 19]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ to define the base URI using one of the other methods may result in
+ the same content being interpreted differently by different types of
+ application.
+
+ It is the responsibility of the distributor(s) of a document
+ containing relative URI to ensure that the base URI for that document
+ can be established. It must be emphasized that relative URI cannot
+ be used reliably in situations where the document's base URI is not
+ well-defined.
+
+5.2. Resolving Relative References to Absolute Form
+
+ This section describes an example algorithm for resolving URI
+ references that might be relative to a given base URI.
+
+ The base URI is established according to the rules of Section 5.1 and
+ parsed into the four main components as described in Section 3. Note
+ that only the scheme component is required to be present in the base
+ URI; the other components may be empty or undefined. A component is
+ undefined if its preceding separator does not appear in the URI
+ reference; the path component is never undefined, though it may be
+ empty. The base URI's query component is not used by the resolution
+ algorithm and may be discarded.
+
+ For each URI reference, the following steps are performed in order:
+
+ 1) The URI reference is parsed into the potential four components and
+ fragment identifier, as described in Section 4.3.
+
+ 2) If the path component is empty and the scheme, authority, and
+ query components are undefined, then it is a reference to the
+ current document and we are done. Otherwise, the reference URI's
+ query and fragment components are defined as found (or not found)
+ within the URI reference and not inherited from the base URI.
+
+ 3) If the scheme component is defined, indicating that the reference
+ starts with a scheme name, then the reference is interpreted as an
+ absolute URI and we are done. Otherwise, the reference URI's
+ scheme is inherited from the base URI's scheme component.
+
+ Due to a loophole in prior specifications [RFC1630], some parsers
+ allow the scheme name to be present in a relative URI if it is the
+ same as the base URI scheme. Unfortunately, this can conflict
+ with the correct parsing of non-hierarchical URI. For backwards
+ compatibility, an implementation may work around such references
+ by removing the scheme if it matches that of the base URI and the
+ scheme is known to always use the syntax. The parser
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 20]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ can then continue with the steps below for the remainder of the
+ reference components. Validating parsers should mark such a
+ misformed relative reference as an error.
+
+ 4) If the authority component is defined, then the reference is a
+ network-path and we skip to step 7. Otherwise, the reference
+ URI's authority is inherited from the base URI's authority
+ component, which will also be undefined if the URI scheme does not
+ use an authority component.
+
+ 5) If the path component begins with a slash character ("/"), then
+ the reference is an absolute-path and we skip to step 7.
+
+ 6) If this step is reached, then we are resolving a relative-path
+ reference. The relative path needs to be merged with the base
+ URI's path. Although there are many ways to do this, we will
+ describe a simple method using a separate string buffer.
+
+ a) All but the last segment of the base URI's path component is
+ copied to the buffer. In other words, any characters after the
+ last (right-most) slash character, if any, are excluded.
+
+ b) The reference's path component is appended to the buffer
+ string.
+
+ c) All occurrences of "./", where "." is a complete path segment,
+ are removed from the buffer string.
+
+ d) If the buffer string ends with "." as a complete path segment,
+ that "." is removed.
+
+ e) All occurrences of "/../", where is a
+ complete path segment not equal to "..", are removed from the
+ buffer string. Removal of these path segments is performed
+ iteratively, removing the leftmost matching pattern on each
+ iteration, until no matching pattern remains.
+
+ f) If the buffer string ends with "/..", where
+ is a complete path segment not equal to "..", that
+ "/.." is removed.
+
+ g) If the resulting buffer string still begins with one or more
+ complete path segments of "..", then the reference is
+ considered to be in error. Implementations may handle this
+ error by retaining these components in the resolved path (i.e.,
+ treating them as part of the final URI), by removing them from
+ the resolved path (i.e., discarding relative levels above the
+ root), or by avoiding traversal of the reference.
+
+
+
+Berners-Lee, et. al. Standards Track [Page 21]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ h) The remaining buffer string is the reference URI's new path
+ component.
+
+ 7) The resulting URI components, including any inherited from the
+ base URI, are recombined to give the absolute form of the URI
+ reference. Using pseudocode, this would be
+
+ result = ""
+
+ if scheme is defined then
+ append scheme to result
+ append ":" to result
+
+ if authority is defined then
+ append "//" to result
+ append authority to result
+
+ append path to result
+
+ if query is defined then
+ append "?" to result
+ append query to result
+
+ if fragment is defined then
+ append "#" to result
+ append fragment to result
+
+ return result
+
+ Note that we must be careful to preserve the distinction between a
+ component that is undefined, meaning that its separator was not
+ present in the reference, and a component that is empty, meaning
+ that the separator was present and was immediately followed by the
+ next component separator or the end of the reference.
+
+ The above algorithm is intended to provide an example by which the
+ output of implementations can be tested -- implementation of the
+ algorithm itself is not required. For example, some systems may find
+ it more efficient to implement step 6 as a pair of segment stacks
+ being merged, rather than as a series of string pattern replacements.
+
+ Note: Some WWW client applications will fail to separate the
+ reference's query component from its path component before merging
+ the base and reference paths in step 6 above. This may result in
+ a loss of information if the query component contains the strings
+ "/../" or "/./".
+
+ Resolution examples are provided in Appendix C.
+
+
+
+Berners-Lee, et. al. Standards Track [Page 22]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+6. URI Normalization and Equivalence
+
+ In many cases, different URI strings may actually identify the
+ identical resource. For example, the host names used in URL are
+ actually case insensitive, and the URL is
+ equivalent to . In general, the rules for
+ equivalence and definition of a normal form, if any, are scheme
+ dependent. When a scheme uses elements of the common syntax, it will
+ also use the common syntax equivalence rules, namely that the scheme
+ and hostname are case insensitive and a URL with an explicit ":port",
+ where the port is the default for the scheme, is equivalent to one
+ where the port is elided.
+
+7. Security Considerations
+
+ A URI does not in itself pose a security threat. Users should beware
+ that there is no general guarantee that a URL, which at one time
+ located a given resource, will continue to do so. Nor is there any
+ guarantee that a URL will not locate a different resource at some
+ later point in time, due to the lack of any constraint on how a given
+ authority apportions its namespace. Such a guarantee can only be
+ obtained from the person(s) controlling that namespace and the
+ resource in question. A specific URI scheme may include additional
+ semantics, such as name persistence, if those semantics are required
+ of all naming authorities for that scheme.
+
+ It is sometimes possible to construct a URL such that an attempt to
+ perform a seemingly harmless, idempotent operation, such as the
+ retrieval of an entity associated with the resource, will in fact
+ cause a possibly damaging remote operation to occur. The unsafe URL
+ is typically constructed by specifying a port number other than that
+ reserved for the network protocol in question. The client
+ unwittingly contacts a site that is in fact running a different
+ protocol. The content of the URL contains instructions that, when
+ interpreted according to this other protocol, cause an unexpected
+ operation. An example has been the use of a gopher URL to cause an
+ unintended or impersonating message to be sent via a SMTP server.
+
+ Caution should be used when using any URL that specifies a port
+ number other than the default for the protocol, especially when it is
+ a number within the reserved space.
+
+ Care should be taken when a URL contains escaped delimiters for a
+ given protocol (for example, CR and LF characters for telnet
+ protocols) that these are not unescaped before transmission. This
+ might violate the protocol, but avoids the potential for such
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 23]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ characters to be used to simulate an extra operation or parameter in
+ that protocol, which might lead to an unexpected and possibly harmful
+ remote operation to be performed.
+
+ It is clearly unwise to use a URL that contains a password which is
+ intended to be secret. In particular, the use of a password within
+ the 'userinfo' component of a URL is strongly disrecommended except
+ in those rare cases where the 'password' parameter is intended to be
+ public.
+
+8. Acknowledgements
+
+ This document was derived from RFC 1738 [RFC1738] and RFC 1808
+ [RFC1808]; the acknowledgements in those specifications still apply.
+ In addition, contributions by Gisle Aas, Martin Beet, Martin Duerst,
+ Jim Gettys, Martijn Koster, Dave Kristol, Daniel LaLiberte, Foteos
+ Macrides, James Marshall, Ryan Moats, Keith Moore, and Lauren Wood
+ are gratefully acknowledged.
+
+9. References
+
+ [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and
+ Languages", BCP 18, RFC 2277, January 1998.
+
+ [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A
+ Unifying Syntax for the Expression of Names and Addresses
+ of Objects on the Network as used in the World-Wide Web",
+ RFC 1630, June 1994.
+
+ [RFC1738] Berners-Lee, T., Masinter, L., and M. McCahill, Editors,
+ "Uniform Resource Locators (URL)", RFC 1738, December 1994.
+
+ [RFC1866] Berners-Lee T., and D. Connolly, "HyperText Markup Language
+ Specification -- 2.0", RFC 1866, November 1995.
+
+ [RFC1123] Braden, R., Editor, "Requirements for Internet Hosts --
+ Application and Support", STD 3, RFC 1123, October 1989.
+
+ [RFC822] Crocker, D., "Standard for the Format of ARPA Internet Text
+ Messages", STD 11, RFC 822, August 1982.
+
+ [RFC1808] Fielding, R., "Relative Uniform Resource Locators", RFC
+ 1808, June 1995.
+
+ [RFC2046] Freed, N., and N. Borenstein, "Multipurpose Internet Mail
+ Extensions (MIME) Part Two: Media Types", RFC 2046,
+ November 1996.
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 24]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ [RFC1736] Kunze, J., "Functional Recommendations for Internet
+ Resource Locators", RFC 1736, February 1995.
+
+ [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997.
+
+ [RFC1034] Mockapetris, P., "Domain Names - Concepts and Facilities",
+ STD 13, RFC 1034, November 1987.
+
+ [RFC2110] Palme, J., and A. Hopmann, "MIME E-mail Encapsulation of
+ Aggregate Documents, such as HTML (MHTML)", RFC 2110, March
+ 1997.
+
+ [RFC1737] Sollins, K., and L. Masinter, "Functional Requirements for
+ Uniform Resource Names", RFC 1737, December 1994.
+
+ [ASCII] US-ASCII. "Coded Character Set -- 7-bit American Standard
+ Code for Information Interchange", ANSI X3.4-1986.
+
+ [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646",
+ RFC 2279, January 1998.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 25]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+10. Authors' Addresses
+
+ Tim Berners-Lee
+ World Wide Web Consortium
+ MIT Laboratory for Computer Science, NE43-356
+ 545 Technology Square
+ Cambridge, MA 02139
+
+ Fax: +1(617)258-8682
+ EMail: timbl@w3.org
+
+
+ Roy T. Fielding
+ Department of Information and Computer Science
+ University of California, Irvine
+ Irvine, CA 92697-3425
+
+ Fax: +1(949)824-1715
+ EMail: fielding@ics.uci.edu
+
+
+ Larry Masinter
+ Xerox PARC
+ 3333 Coyote Hill Road
+ Palo Alto, CA 94034
+
+ Fax: +1(415)812-4333
+ EMail: masinter@parc.xerox.com
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 26]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+A. Collected BNF for URI
+
+ URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]
+ absoluteURI = scheme ":" ( hier_part | opaque_part )
+ relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ]
+
+ hier_part = ( net_path | abs_path ) [ "?" query ]
+ opaque_part = uric_no_slash *uric
+
+ uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" |
+ "&" | "=" | "+" | "$" | ","
+
+ net_path = "//" authority [ abs_path ]
+ abs_path = "/" path_segments
+ rel_path = rel_segment [ abs_path ]
+
+ rel_segment = 1*( unreserved | escaped |
+ ";" | "@" | "&" | "=" | "+" | "$" | "," )
+
+ scheme = alpha *( alpha | digit | "+" | "-" | "." )
+
+ authority = server | reg_name
+
+ reg_name = 1*( unreserved | escaped | "$" | "," |
+ ";" | ":" | "@" | "&" | "=" | "+" )
+
+ server = [ [ userinfo "@" ] hostport ]
+ userinfo = *( unreserved | escaped |
+ ";" | ":" | "&" | "=" | "+" | "$" | "," )
+
+ hostport = host [ ":" port ]
+ host = hostname | IPv4address
+ hostname = *( domainlabel "." ) toplabel [ "." ]
+ domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
+ toplabel = alpha | alpha *( alphanum | "-" ) alphanum
+ IPv4address = 1*digit "." 1*digit "." 1*digit "." 1*digit
+ port = *digit
+
+ path = [ abs_path | opaque_part ]
+ path_segments = segment *( "/" segment )
+ segment = *pchar *( ";" param )
+ param = *pchar
+ pchar = unreserved | escaped |
+ ":" | "@" | "&" | "=" | "+" | "$" | ","
+
+ query = *uric
+
+ fragment = *uric
+
+
+
+Berners-Lee, et. al. Standards Track [Page 27]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ uric = reserved | unreserved | escaped
+ reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
+ "$" | ","
+ unreserved = alphanum | mark
+ mark = "-" | "_" | "." | "!" | "~" | "*" | "'" |
+ "(" | ")"
+
+ escaped = "%" hex hex
+ hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
+ "a" | "b" | "c" | "d" | "e" | "f"
+
+ alphanum = alpha | digit
+ alpha = lowalpha | upalpha
+
+ lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
+ "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
+ "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
+ upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
+ "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
+ "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
+ digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
+ "8" | "9"
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 28]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+B. Parsing a URI Reference with a Regular Expression
+
+ As described in Section 4.3, the generic URI syntax is not sufficient
+ to disambiguate the components of some forms of URI. Since the
+ "greedy algorithm" described in that section is identical to the
+ disambiguation method used by POSIX regular expressions, it is
+ natural and commonplace to use a regular expression for parsing the
+ potential four components and fragment identifier of a URI reference.
+
+ The following line is the regular expression for breaking-down a URI
+ reference into its components.
+
+ ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
+ 12 3 4 5 6 7 8 9
+
+ The numbers in the second line above are only to assist readability;
+ they indicate the reference points for each subexpression (i.e., each
+ paired parenthesis). We refer to the value matched for subexpression
+ as $. For example, matching the above expression to
+
+ http://www.ics.uci.edu/pub/ietf/uri/#Related
+
+ results in the following subexpression matches:
+
+ $1 = http:
+ $2 = http
+ $3 = //www.ics.uci.edu
+ $4 = www.ics.uci.edu
+ $5 = /pub/ietf/uri/
+ $6 =
+ $7 =
+ $8 = #Related
+ $9 = Related
+
+ where indicates that the component is not present, as is
+ the case for the query component in the above example. Therefore, we
+ can determine the value of the four components and fragment as
+
+ scheme = $2
+ authority = $4
+ path = $5
+ query = $7
+ fragment = $9
+
+ and, going in the opposite direction, we can recreate a URI reference
+ from its components using the algorithm in step 7 of Section 5.2.
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 29]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+C. Examples of Resolving Relative URI References
+
+ Within an object with a well-defined base URI of
+
+ http://a/b/c/d;p?q
+
+ the relative URI would be resolved as follows:
+
+C.1. Normal Examples
+
+ g:h = g:h
+ g = http://a/b/c/g
+ ./g = http://a/b/c/g
+ g/ = http://a/b/c/g/
+ /g = http://a/g
+ //g = http://g
+ ?y = http://a/b/c/?y
+ g?y = http://a/b/c/g?y
+ #s = (current document)#s
+ g#s = http://a/b/c/g#s
+ g?y#s = http://a/b/c/g?y#s
+ ;x = http://a/b/c/;x
+ g;x = http://a/b/c/g;x
+ g;x?y#s = http://a/b/c/g;x?y#s
+ . = http://a/b/c/
+ ./ = http://a/b/c/
+ .. = http://a/b/
+ ../ = http://a/b/
+ ../g = http://a/b/g
+ ../.. = http://a/
+ ../../ = http://a/
+ ../../g = http://a/g
+
+C.2. Abnormal Examples
+
+ Although the following abnormal examples are unlikely to occur in
+ normal practice, all URI parsers should be capable of resolving them
+ consistently. Each example uses the same base as above.
+
+ An empty reference refers to the start of the current document.
+
+ <> = (current document)
+
+ Parsers must be careful in handling the case where there are more
+ relative path ".." segments than there are hierarchical levels in the
+ base URI's path. Note that the ".." syntax cannot be used to change
+ the authority component of a URI.
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 30]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ ../../../g = http://a/../g
+ ../../../../g = http://a/../../g
+
+ In practice, some implementations strip leading relative symbolic
+ elements (".", "..") after applying a relative URI calculation, based
+ on the theory that compensating for obvious author errors is better
+ than allowing the request to fail. Thus, the above two references
+ will be interpreted as "http://a/g" by some implementations.
+
+ Similarly, parsers must avoid treating "." and ".." as special when
+ they are not complete components of a relative path.
+
+ /./g = http://a/./g
+ /../g = http://a/../g
+ g. = http://a/b/c/g.
+ .g = http://a/b/c/.g
+ g.. = http://a/b/c/g..
+ ..g = http://a/b/c/..g
+
+ Less likely are cases where the relative URI uses unnecessary or
+ nonsensical forms of the "." and ".." complete path segments.
+
+ ./../g = http://a/b/g
+ ./g/. = http://a/b/c/g/
+ g/./h = http://a/b/c/g/h
+ g/../h = http://a/b/c/h
+ g;x=1/./y = http://a/b/c/g;x=1/y
+ g;x=1/../y = http://a/b/c/y
+
+ All client applications remove the query component from the base URI
+ before resolving relative URI. However, some applications fail to
+ separate the reference's query and/or fragment components from a
+ relative path before merging it with the base path. This error is
+ rarely noticed, since typical usage of a fragment never includes the
+ hierarchy ("/") character, and the query component is not normally
+ used within relative references.
+
+ g?y/./x = http://a/b/c/g?y/./x
+ g?y/../x = http://a/b/c/g?y/../x
+ g#s/./x = http://a/b/c/g#s/./x
+ g#s/../x = http://a/b/c/g#s/../x
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 31]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ Some parsers allow the scheme name to be present in a relative URI if
+ it is the same as the base URI scheme. This is considered to be a
+ loophole in prior specifications of partial URI [RFC1630]. Its use
+ should be avoided.
+
+ http:g = http:g ; for validating parsers
+ | http://a/b/c/g ; for backwards compatibility
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 32]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+D. Embedding the Base URI in HTML documents
+
+ It is useful to consider an example of how the base URI of a document
+ can be embedded within the document's content. In this appendix, we
+ describe how documents written in the Hypertext Markup Language
+ (HTML) [RFC1866] can include an embedded base URI. This appendix
+ does not form a part of the URI specification and should not be
+ considered as anything more than a descriptive example.
+
+ HTML defines a special element "BASE" which, when present in the
+ "HEAD" portion of a document, signals that the parser should use the
+ BASE element's "HREF" attribute as the base URI for resolving any
+ relative URI. The "HREF" attribute must be an absolute URI. Note
+ that, in HTML, element and attribute names are case-insensitive. For
+ example:
+
+
+
+ An example HTML document
+
+
+ ... a hypertext anchor ...
+
+
+ A parser reading the example document should interpret the given
+ relative URI "../x" as representing the absolute URI
+
+
+
+ regardless of the context in which the example document was obtained.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 33]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+E. Recommendations for Delimiting URI in Context
+
+ URI are often transmitted through formats that do not provide a clear
+ context for their interpretation. For example, there are many
+ occasions when URI are included in plain text; examples include text
+ sent in electronic mail, USENET news messages, and, most importantly,
+ printed on paper. In such cases, it is important to be able to
+ delimit the URI from the rest of the text, and in particular from
+ punctuation marks that might be mistaken for part of the URI.
+
+ In practice, URI are delimited in a variety of ways, but usually
+ within double-quotes "http://test.com/", angle brackets
+ , or just using whitespace
+
+ http://test.com/
+
+ These wrappers do not form part of the URI.
+
+ In the case where a fragment identifier is associated with a URI
+ reference, the fragment would be placed within the brackets as well
+ (separated from the URI with a "#" character).
+
+ In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) may
+ need to be added to break long URI across lines. The whitespace
+ should be ignored when extracting the URI.
+
+ No whitespace should be introduced after a hyphen ("-") character.
+ Because some typesetters and printers may (erroneously) introduce a
+ hyphen at the end of line when breaking a line, the interpreter of a
+ URI containing a line break immediately after a hyphen should ignore
+ all unescaped whitespace around the line break, and should be aware
+ that the hyphen may or may not actually be part of the URI.
+
+ Using <> angle brackets around each URI is especially recommended as
+ a delimiting style for URI that contain whitespace.
+
+ The prefix "URL:" (with or without a trailing space) was recommended
+ as a way to used to help distinguish a URL from other bracketed
+ designators, although this is not common in practice.
+
+ For robustness, software that accepts user-typed URI should attempt
+ to recognize and strip both delimiters and embedded whitespace.
+
+ For example, the text:
+
+
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 34]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ Yes, Jim, I found it under "http://www.w3.org/Addressing/",
+ but you can probably pick it up from . Note the warning in .
+
+ contains the URI references
+
+ http://www.w3.org/Addressing/
+ ftp://ds.internic.net/rfc/
+ http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 35]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+F. Abbreviated URLs
+
+ The URL syntax was designed for unambiguous reference to network
+ resources and extensibility via the URL scheme. However, as URL
+ identification and usage have become commonplace, traditional media
+ (television, radio, newspapers, billboards, etc.) have increasingly
+ used abbreviated URL references. That is, a reference consisting of
+ only the authority and path portions of the identified resource, such
+ as
+
+ www.w3.org/Addressing/
+
+ or simply the DNS hostname on its own. Such references are primarily
+ intended for human interpretation rather than machine, with the
+ assumption that context-based heuristics are sufficient to complete
+ the URL (e.g., most hostnames beginning with "www" are likely to have
+ a URL prefix of "http://"). Although there is no standard set of
+ heuristics for disambiguating abbreviated URL references, many client
+ implementations allow them to be entered by the user and
+ heuristically resolved. It should be noted that such heuristics may
+ change over time, particularly when new URL schemes are introduced.
+
+ Since an abbreviated URL has the same syntax as a relative URL path,
+ abbreviated URL references cannot be used in contexts where relative
+ URLs are expected. This limits the use of abbreviated URLs to places
+ where there is no defined base URL, such as dialog boxes and off-line
+ advertisements.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 36]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+G. Summary of Non-editorial Changes
+
+G.1. Additions
+
+ Section 4 (URI References) was added to stem the confusion regarding
+ "what is a URI" and how to describe fragment identifiers given that
+ they are not part of the URI, but are part of the URI syntax and
+ parsing concerns. In addition, it provides a reference definition
+ for use by other IETF specifications (HTML, HTTP, etc.) that have
+ previously attempted to redefine the URI syntax in order to account
+ for the presence of fragment identifiers in URI references.
+
+ Section 2.4 was rewritten to clarify a number of misinterpretations
+ and to leave room for fully internationalized URI.
+
+ Appendix F on abbreviated URLs was added to describe the shortened
+ references often seen on television and magazine advertisements and
+ explain why they are not used in other contexts.
+
+G.2. Modifications from both RFC 1738 and RFC 1808
+
+ Changed to URI syntax instead of just URL.
+
+ Confusion regarding the terms "character encoding", the URI
+ "character set", and the escaping of characters with %
+ equivalents has (hopefully) been reduced. Many of the BNF rule names
+ regarding the character sets have been changed to more accurately
+ describe their purpose and to encompass all "characters" rather than
+ just US-ASCII octets. Unless otherwise noted here, these
+ modifications do not affect the URI syntax.
+
+ Both RFC 1738 and RFC 1808 refer to the "reserved" set of characters
+ as if URI-interpreting software were limited to a single set of
+ characters with a reserved purpose (i.e., as meaning something other
+ than the data to which the characters correspond), and that this set
+ was fixed by the URI scheme. However, this has not been true in
+ practice; any character that is interpreted differently when it is
+ escaped is, in effect, reserved. Furthermore, the interpreting
+ engine on a HTTP server is often dependent on the resource, not just
+ the URI scheme. The description of reserved characters has been
+ changed accordingly.
+
+ The plus "+", dollar "$", and comma "," characters have been added to
+ those in the "reserved" set, since they are treated as reserved
+ within the query component.
+
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 37]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ The tilde "~" character was added to those in the "unreserved" set,
+ since it is extensively used on the Internet in spite of the
+ difficulty to transcribe it with some keyboards.
+
+ The syntax for URI scheme has been changed to require that all
+ schemes begin with an alpha character.
+
+ The "user:password" form in the previous BNF was changed to a
+ "userinfo" token, and the possibility that it might be
+ "user:password" made scheme specific. In particular, the use of
+ passwords in the clear is not even suggested by the syntax.
+
+ The question-mark "?" character was removed from the set of allowed
+ characters for the userinfo in the authority component, since testing
+ showed that many applications treat it as reserved for separating the
+ query component from the rest of the URI.
+
+ The semicolon ";" character was added to those stated as being
+ reserved within the authority component, since several new schemes
+ are using it as a separator within userinfo to indicate the type of
+ user authentication.
+
+ RFC 1738 specified that the path was separated from the authority
+ portion of a URI by a slash. RFC 1808 followed suit, but with a
+ fudge of carrying around the separator as a "prefix" in order to
+ describe the parsing algorithm. RFC 1630 never had this problem,
+ since it considered the slash to be part of the path. In writing
+ this specification, it was found to be impossible to accurately
+ describe and retain the difference between the two URI
+ and
+ without either considering the slash to be part of the path (as
+ corresponds to actual practice) or creating a separate component just
+ to hold that slash. We chose the former.
+
+G.3. Modifications from RFC 1738
+
+ The definition of specific URL schemes and their scheme-specific
+ syntax and semantics has been moved to separate documents.
+
+ The URL host was defined as a fully-qualified domain name. However,
+ many URLs are used without fully-qualified domain names (in contexts
+ for which the full qualification is not necessary), without any host
+ (as in some file URLs), or with a host of "localhost".
+
+ The URL port is now *digit instead of 1*digit, since systems are
+ expected to handle the case where the ":" separator between host and
+ port is supplied without a port.
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 38]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+ The recommendations for delimiting URI in context (Appendix E) have
+ been adjusted to reflect current practice.
+
+G.4. Modifications from RFC 1808
+
+ RFC 1808 (Section 4) defined an empty URL reference (a reference
+ containing nothing aside from the fragment identifier) as being a
+ reference to the base URL. Unfortunately, that definition could be
+ interpreted, upon selection of such a reference, as a new retrieval
+ action on that resource. Since the normal intent of such references
+ is for the user agent to change its view of the current document to
+ the beginning of the specified fragment within that document, not to
+ make an additional request of the resource, a description of how to
+ correctly interpret an empty reference has been added in Section 4.
+
+ The description of the mythical Base header field has been replaced
+ with a reference to the Content-Location header field defined by
+ MHTML [RFC2110].
+
+ RFC 1808 described various schemes as either having or not having the
+ properties of the generic URI syntax. However, the only requirement
+ is that the particular document containing the relative references
+ have a base URI that abides by the generic URI syntax, regardless of
+ the URI scheme, so the associated description has been updated to
+ reflect that.
+
+ The BNF term has been replaced with , since the
+ latter more accurately describes its use and purpose. Likewise, the
+ authority is no longer restricted to the IP server syntax.
+
+ Extensive testing of current client applications demonstrated that
+ the majority of deployed systems do not use the ";" character to
+ indicate trailing parameter information, and that the presence of a
+ semicolon in a path segment does not affect the relative parsing of
+ that segment. Therefore, parameters have been removed as a separate
+ component and may now appear in any path segment. Their influence
+ has been removed from the algorithm for resolving a relative URI
+ reference. The resolution examples in Appendix C have been modified
+ to reflect this change.
+
+ Implementations are now allowed to work around misformed relative
+ references that are prefixed by the same scheme as the base URI, but
+ only for schemes known to use the syntax.
+
+
+
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 39]
+
+RFC 2396 URI Generic Syntax August 1998
+
+
+H. Full Copyright Statement
+
+ Copyright (C) The Internet Society (1998). All Rights Reserved.
+
+ This document and translations of it may be copied and furnished to
+ others, and derivative works that comment on or otherwise explain it
+ or assist in its implementation may be prepared, copied, published
+ and distributed, in whole or in part, without restriction of any
+ kind, provided that the above copyright notice and this paragraph are
+ included on all such copies and derivative works. However, this
+ document itself may not be modified in any way, such as by removing
+ the copyright notice or references to the Internet Society or other
+ Internet organizations, except as needed for the purpose of
+ developing Internet standards in which case the procedures for
+ copyrights defined in the Internet Standards process must be
+ followed, or as required to translate it into languages other than
+ English.
+
+ The limited permissions granted above are perpetual and will not be
+ revoked by the Internet Society or its successors or assigns.
+
+ This document and the information contained herein is provided on an
+ "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
+ TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
+ BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
+ HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
+ MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Berners-Lee, et. al. Standards Track [Page 40]
+
diff --git a/doc/rfc822.scm.doc b/doc/rfc822.scm.doc
new file mode 100644
index 0000000..a2e38c7
--- /dev/null
+++ b/doc/rfc822.scm.doc
@@ -0,0 +1,161 @@
+This file documents names defined in rfc822.scm:
+
+
+
+
+NOTES
+
+
+
+A note on line-terminators:
+
+Line-terminating sequences are always a drag, because there's no
+agreement on them -- the Net protocols and DOS use cr/lf; Unix uses
+lf; the Mac uses cr. One one hand, you'd like to use the code for all
+of the above, on the other, you'd also like to use the code for strict
+applications that need definitely not to recognise bare cr's or lf's
+as terminators.
+
+RFC 822 requires a cr/lf (carriage-return/line-feed) pair to terminate
+lines of text. On the other hand, careful perusal of the text shows up
+some ambiguities (there are maybe three or four of these, and I'm too
+lazy to write them all down). Furthermore, it is an unfortunate fact
+that many Unix apps separate lines of RFC 822 text with simple
+linefeeds (e.g., messages kept in /usr/spool/mail). As a result, this
+code takes a broad-minded view of line-terminators: lines can be
+terminated by either cr/lf or just lf, and either terminating sequence
+is trimmed.
+
+If you need stricter parsing, you can call the lower-level procedure
+%READ-RFC-822-FIELD and %READ-RFC822-HEADERS procs. They take the
+read-line procedure as an extra parameter. This means that you can
+pass in a procedure that recognises only cr/lf's, or only cr's (for a
+Mac app, perhaps), and you can determine whether or not the
+terminators get trimmed. However, your read-line procedure must
+indicate the header-terminating empty line by returning *either* the
+empty string or the two-char string cr/lf (or the EOF object).
+
+
+
+
+DEFINITIONS AND DESCRIPTIONS
+
+
+
+(read-rfc822-field [port])
+(%read-rfc822-field read-line port)
+
+Read one field from the port, and return two values [NAME BODY]:
+
+ - NAME Symbol such as 'subject or 'to. The field name is converted
+ to a symbol using the Scheme implementation's preferred
+ case. If the implementation reads symbols in a case-sensitive
+ fashion (e.g., scsh), lowercase is used. This means you can
+ compare these symbols to quoted constants using EQ?. When
+ printing these field names out, it looks best if you capitalise
+ them with (CAPITALIZE-STRING (SYMBOL->STRING FIELD-NAME)).
+
+ - BODY List of strings which are the field's body, e.g.
+ ("shivers@lcs.mit.edu"). Each list element is one line from
+ the field's body, so if the field spreads out over three lines,
+ then the body is a list of three strings. The terminating
+ cr/lf's are trimmed from each string. A leading space or a
+ leading horizontal tab is also trimmed, but one and onyl one.
+
+When there are no more fields -- EOF or a blank line has terminated
+the header section -- then the procedure returns [#f #f].
+
+The %READ-RFC822-FIELD variant allows you to specify your own
+read-line procedure. The one used by READ-RFC822-FIELD terminates
+lines with either cr/lf or just lf, and it trims the terminator from
+the line. Your read-line procedure should trim the terminator of the
+line, so an empty line is returned as an empty string.
+
+The procedures raise an error if the syntax of the read field (the
+line returned by the read-line-function) is illegal (RFC822 illegal).
+
+
+
+read-rfc822-headers [port]
+%read-rfc822-headers read-line port
+
+Read in and parse up a section of text that looks like the header
+portion of an RFC 822 message. Return an alist mapping a field name (a
+symbol such as 'date or 'subject) to a list of field bodies -- one for
+each occurence of the field in the header. So if there are five
+"Received-by:" fields in the header, the alist maps 'received-by to a
+five element list. Each body is in turn represented by a list of
+strings -- one for each line of the field. So a field spread across
+three lines would produce a three element body.
+
+The %READ-RFC822-HEADERS variant allows you to specify your own
+read-line procedure. See notes (A note on line-terminators) above for
+reasons why.
+
+
+
+rejoin-header-lines alist [seperator]
+
+Takes a field alist such as is returned by READ-RFC822-HEADERS and
+returns an equivalent alist. Each body (string list) in the input
+alist is joined into a single list in the output alist. SEPARATOR is
+the string used to join these elements together; it defaults to a
+single space " ", but can usefully be "\n" or "\r\n".
+
+To rejoin a single body list, use scsh's JOIN-STRINGS procedure.
+
+
+
+For the following definitions' examples, let's use this set of of
+RFC822 headers:
+ From: shivers
+ To: ziggy,
+ newts
+ To: gjs, tk
+
+
+
+get-header-all headers name
+
+returns all entries or #f, p.e.
+(get-header-all hdrs 'to) -> ((" ziggy," " newts") (" gjs, tk"))
+
+
+
+get-header-lines headers name
+
+returns all lines of the first entry or #f, p.e.
+(get-header-lines hdrs 'to) -> (" ziggy," " newts")
+
+
+
+get-headers headers name [seperator]
+
+returns the first entry with the lines joined together by seperator
+(newline by default (\n)), p.e.
+(get-header hdrs 'to) -> "ziggy,\n newts"
+
+
+
+htab
+
+is the horizontal tab (ascii-code 9)
+
+
+
+string->symbol-pref
+
+is a procedure that takes a string and converts it to a symbol
+using the Scheme implementation's preferred case. The preferred case
+is recognized by a doing a symbol->string conversion of 'a.
+
+
+
+
+DESIREABLE FUNCTIONALITIES
+
+ - Unfolding long lines.
+ - Lexing structured fields.
+ - Unlexing structured fields into canonical form.
+ - Parsing and unparsing dates.
+ - Parsing and unparsing addresses.
diff --git a/doc/rfc822.txt b/doc/rfc822.txt
new file mode 100644
index 0000000..35b09a3
--- /dev/null
+++ b/doc/rfc822.txt
@@ -0,0 +1,2901 @@
+
+
+
+
+
+
+ RFC # 822
+
+ Obsoletes: RFC #733 (NIC #41952)
+
+
+
+
+
+
+
+
+
+
+
+
+ STANDARD FOR THE FORMAT OF
+
+ ARPA INTERNET TEXT MESSAGES
+
+
+
+
+
+
+ August 13, 1982
+
+
+
+
+
+
+ Revised by
+
+ David H. Crocker
+
+
+ Dept. of Electrical Engineering
+ University of Delaware, Newark, DE 19711
+ Network: DCrocker @ UDel-Relay
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ TABLE OF CONTENTS
+
+
+ PREFACE .................................................... ii
+
+ 1. INTRODUCTION ........................................... 1
+
+ 1.1. Scope ............................................ 1
+ 1.2. Communication Framework .......................... 2
+
+ 2. NOTATIONAL CONVENTIONS ................................. 3
+
+ 3. LEXICAL ANALYSIS OF MESSAGES ........................... 5
+
+ 3.1. General Description .............................. 5
+ 3.2. Header Field Definitions ......................... 9
+ 3.3. Lexical Tokens ................................... 10
+ 3.4. Clarifications ................................... 11
+
+ 4. MESSAGE SPECIFICATION .................................. 17
+
+ 4.1. Syntax ........................................... 17
+ 4.2. Forwarding ....................................... 19
+ 4.3. Trace Fields ..................................... 20
+ 4.4. Originator Fields ................................ 21
+ 4.5. Receiver Fields .................................. 23
+ 4.6. Reference Fields ................................. 23
+ 4.7. Other Fields ..................................... 24
+
+ 5. DATE AND TIME SPECIFICATION ............................ 26
+
+ 5.1. Syntax ........................................... 26
+ 5.2. Semantics ........................................ 26
+
+ 6. ADDRESS SPECIFICATION .................................. 27
+
+ 6.1. Syntax ........................................... 27
+ 6.2. Semantics ........................................ 27
+ 6.3. Reserved Address ................................. 33
+
+ 7. BIBLIOGRAPHY ........................................... 34
+
+
+ APPENDIX
+
+ A. EXAMPLES ............................................... 36
+ B. SIMPLE FIELD PARSING ................................... 40
+ C. DIFFERENCES FROM RFC #733 .............................. 41
+ D. ALPHABETICAL LISTING OF SYNTAX RULES ................... 44
+
+
+ August 13, 1982 - i - RFC #822
+
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ PREFACE
+
+
+ By 1977, the Arpanet employed several informal standards for
+ the text messages (mail) sent among its host computers. It was
+ felt necessary to codify these practices and provide for those
+ features that seemed imminent. The result of that effort was
+ Request for Comments (RFC) #733, "Standard for the Format of ARPA
+ Network Text Message", by Crocker, Vittal, Pogran, and Henderson.
+ The specification attempted to avoid major changes in existing
+ software, while permitting several new features.
+
+ This document revises the specifications in RFC #733, in
+ order to serve the needs of the larger and more complex ARPA
+ Internet. Some of RFC #733's features failed to gain adequate
+ acceptance. In order to simplify the standard and the software
+ that follows it, these features have been removed. A different
+ addressing scheme is used, to handle the case of inter-network
+ mail; and the concept of re-transmission has been introduced.
+
+ This specification is intended for use in the ARPA Internet.
+ However, an attempt has been made to free it of any dependence on
+ that environment, so that it can be applied to other network text
+ message systems.
+
+ The specification of RFC #733 took place over the course of
+ one year, using the ARPANET mail environment, itself, to provide
+ an on-going forum for discussing the capabilities to be included.
+ More than twenty individuals, from across the country, partici-
+ pated in the original discussion. The development of this
+ revised specification has, similarly, utilized network mail-based
+ group discussion. Both specification efforts greatly benefited
+ from the comments and ideas of the participants.
+
+ The syntax of the standard, in RFC #733, was originally
+ specified in the Backus-Naur Form (BNF) meta-language. Ken L.
+ Harrenstien, of SRI International, was responsible for re-coding
+ the BNF into an augmented BNF that makes the representation
+ smaller and easier to understand.
+
+
+
+
+
+
+
+
+
+
+
+
+ August 13, 1982 - ii - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ 1. INTRODUCTION
+
+ 1.1. SCOPE
+
+ This standard specifies a syntax for text messages that are
+ sent among computer users, within the framework of "electronic
+ mail". The standard supersedes the one specified in ARPANET
+ Request for Comments #733, "Standard for the Format of ARPA Net-
+ work Text Messages".
+
+ In this context, messages are viewed as having an envelope
+ and contents. The envelope contains whatever information is
+ needed to accomplish transmission and delivery. The contents
+ compose the object to be delivered to the recipient. This stan-
+ dard applies only to the format and some of the semantics of mes-
+ sage contents. It contains no specification of the information
+ in the envelope.
+
+ However, some message systems may use information from the
+ contents to create the envelope. It is intended that this stan-
+ dard facilitate the acquisition of such information by programs.
+
+ Some message systems may store messages in formats that
+ differ from the one specified in this standard. This specifica-
+ tion is intended strictly as a definition of what message content
+ format is to be passed BETWEEN hosts.
+
+ Note: This standard is NOT intended to dictate the internal for-
+ mats used by sites, the specific message system features
+ that they are expected to support, or any of the charac-
+ teristics of user interface programs that create or read
+ messages.
+
+ A distinction should be made between what the specification
+ REQUIRES and what it ALLOWS. Messages can be made complex and
+ rich with formally-structured components of information or can be
+ kept small and simple, with a minimum of such information. Also,
+ the standard simplifies the interpretation of differing visual
+ formats in messages; only the visual aspect of a message is
+ affected and not the interpretation of information within it.
+ Implementors may choose to retain such visual distinctions.
+
+ The formal definition is divided into four levels. The bot-
+ tom level describes the meta-notation used in this document. The
+ second level describes basic lexical analyzers that feed tokens
+ to higher-level parsers. Next is an overall specification for
+ messages; it permits distinguishing individual fields. Finally,
+ there is definition of the contents of several structured fields.
+
+
+
+ August 13, 1982 - 1 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ 1.2. COMMUNICATION FRAMEWORK
+
+ Messages consist of lines of text. No special provisions
+ are made for encoding drawings, facsimile, speech, or structured
+ text. No significant consideration has been given to questions
+ of data compression or to transmission and storage efficiency,
+ and the standard tends to be free with the number of bits con-
+ sumed. For example, field names are specified as free text,
+ rather than special terse codes.
+
+ A general "memo" framework is used. That is, a message con-
+ sists of some information in a rigid format, followed by the main
+ part of the message, with a format that is not specified in this
+ document. The syntax of several fields of the rigidly-formated
+ ("headers") section is defined in this specification; some of
+ these fields must be included in all messages.
+
+ The syntax that distinguishes between header fields is
+ specified separately from the internal syntax for particular
+ fields. This separation is intended to allow simple parsers to
+ operate on the general structure of messages, without concern for
+ the detailed structure of individual header fields. Appendix B
+ is provided to facilitate construction of these parsers.
+
+ In addition to the fields specified in this document, it is
+ expected that other fields will gain common use. As necessary,
+ the specifications for these "extension-fields" will be published
+ through the same mechanism used to publish this document. Users
+ may also wish to extend the set of fields that they use
+ privately. Such "user-defined fields" are permitted.
+
+ The framework severely constrains document tone and appear-
+ ance and is primarily useful for most intra-organization communi-
+ cations and well-structured inter-organization communication.
+ It also can be used for some types of inter-process communica-
+ tion, such as simple file transfer and remote job entry. A more
+ robust framework might allow for multi-font, multi-color, multi-
+ dimension encoding of information. A less robust one, as is
+ present in most single-machine message systems, would more
+ severely constrain the ability to add fields and the decision to
+ include specific fields. In contrast with paper-based communica-
+ tion, it is interesting to note that the RECEIVER of a message
+ can exercise an extraordinary amount of control over the
+ message's appearance. The amount of actual control available to
+ message receivers is contingent upon the capabilities of their
+ individual message systems.
+
+
+
+
+
+ August 13, 1982 - 2 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ 2. NOTATIONAL CONVENTIONS
+
+ This specification uses an augmented Backus-Naur Form (BNF)
+ notation. The differences from standard BNF involve naming rules
+ and indicating repetition and "local" alternatives.
+
+ 2.1. RULE NAMING
+
+ Angle brackets ("<", ">") are not used, in general. The
+ name of a rule is simply the name itself, rather than "".
+ Quotation-marks enclose literal text (which may be upper and/or
+ lower case). Certain basic rules are in uppercase, such as
+ SPACE, TAB, CRLF, DIGIT, ALPHA, etc. Angle brackets are used in
+ rule definitions, and in the rest of this document, whenever
+ their presence will facilitate discerning the use of rule names.
+
+ 2.2. RULE1 / RULE2: ALTERNATIVES
+
+ Elements separated by slash ("/") are alternatives. There-
+ fore "foo / bar" will accept foo or bar.
+
+ 2.3. (RULE1 RULE2): LOCAL ALTERNATIVES
+
+ Elements enclosed in parentheses are treated as a single
+ element. Thus, "(elem (foo / bar) elem)" allows the token
+ sequences "elem foo elem" and "elem bar elem".
+
+ 2.4. *RULE: REPETITION
+
+ The character "*" preceding an element indicates repetition.
+ The full form is:
+
+ *element
+
+ indicating at least and at most occurrences of element.
+ Default values are 0 and infinity so that "*(element)" allows any
+ number, including zero; "1*element" requires at least one; and
+ "1*2element" allows one or two.
+
+ 2.5. [RULE]: OPTIONAL
+
+ Square brackets enclose optional elements; "[foo bar]" is
+ equivalent to "*1(foo bar)".
+
+ 2.6. NRULE: SPECIFIC REPETITION
+
+ "(element)" is equivalent to "*(element)"; that is,
+ exactly occurrences of (element). Thus 2DIGIT is a 2-digit
+ number, and 3ALPHA is a string of three alphabetic characters.
+
+
+ August 13, 1982 - 3 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ 2.7. #RULE: LISTS
+
+ A construct "#" is defined, similar to "*", as follows:
+
+ #element
+
+ indicating at least and at most elements, each separated
+ by one or more commas (","). This makes the usual form of lists
+ very easy; a rule such as '(element *("," element))' can be shown
+ as "1#element". Wherever this construct is used, null elements
+ are allowed, but do not contribute to the count of elements
+ present. That is, "(element),,(element)" is permitted, but
+ counts as only two elements. Therefore, where at least one ele-
+ ment is required, at least one non-null element must be present.
+ Default values are 0 and infinity so that "#(element)" allows any
+ number, including zero; "1#element" requires at least one; and
+ "1#2element" allows one or two.
+
+ 2.8. ; COMMENTS
+
+ A semi-colon, set off some distance to the right of rule
+ text, starts a comment that continues to the end of line. This
+ is a simple way of including useful notes in parallel with the
+ specifications.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ August 13, 1982 - 4 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ 3. LEXICAL ANALYSIS OF MESSAGES
+
+ 3.1. GENERAL DESCRIPTION
+
+ A message consists of header fields and, optionally, a body.
+ The body is simply a sequence of lines containing ASCII charac-
+ ters. It is separated from the headers by a null line (i.e., a
+ line with nothing preceding the CRLF).
+
+ 3.1.1. LONG HEADER FIELDS
+
+ Each header field can be viewed as a single, logical line of
+ ASCII characters, comprising a field-name and a field-body.
+ For convenience, the field-body portion of this conceptual
+ entity can be split into a multiple-line representation; this
+ is called "folding". The general rule is that wherever there
+ may be linear-white-space (NOT simply LWSP-chars), a CRLF
+ immediately followed by AT LEAST one LWSP-char may instead be
+ inserted. Thus, the single line
+
+ To: "Joe & J. Harvey" , JJV @ BBN
+
+ can be represented as:
+
+ To: "Joe & J. Harvey" ,
+ JJV@BBN
+
+ and
+
+ To: "Joe & J. Harvey"
+ , JJV
+ @BBN
+
+ and
+
+ To: "Joe &
+ J. Harvey" , JJV @ BBN
+
+ The process of moving from this folded multiple-line
+ representation of a header field to its single line represen-
+ tation is called "unfolding". Unfolding is accomplished by
+ regarding CRLF immediately followed by a LWSP-char as
+ equivalent to the LWSP-char.
+
+ Note: While the standard permits folding wherever linear-
+ white-space is permitted, it is recommended that struc-
+ tured fields, such as those containing addresses, limit
+ folding to higher-level syntactic breaks. For address
+ fields, it is recommended that such folding occur
+
+
+ August 13, 1982 - 5 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ between addresses, after the separating comma.
+
+ 3.1.2. STRUCTURE OF HEADER FIELDS
+
+ Once a field has been unfolded, it may be viewed as being com-
+ posed of a field-name followed by a colon (":"), followed by a
+ field-body, and terminated by a carriage-return/line-feed.
+ The field-name must be composed of printable ASCII characters
+ (i.e., characters that have values between 33. and 126.,
+ decimal, except colon). The field-body may be composed of any
+ ASCII characters, except CR or LF. (While CR and/or LF may be
+ present in the actual text, they are removed by the action of
+ unfolding the field.)
+
+ Certain field-bodies of headers may be interpreted according
+ to an internal syntax that some systems may wish to parse.
+ These fields are called "structured fields". Examples
+ include fields containing dates and addresses. Other fields,
+ such as "Subject" and "Comments", are regarded simply as
+ strings of text.
+
+ Note: Any field which has a field-body that is defined as
+ other than simply is to be treated as a struc-
+ tured field.
+
+ Field-names, unstructured field bodies and structured
+ field bodies each are scanned by their own, independent
+ "lexical" analyzers.
+
+ 3.1.3. UNSTRUCTURED FIELD BODIES
+
+ For some fields, such as "Subject" and "Comments", no struc-
+ turing is assumed, and they are treated simply as s, as
+ in the message body. Rules of folding apply to these fields,
+ so that such field bodies which occupy several lines must
+ therefore have the second and successive lines indented by at
+ least one LWSP-char.
+
+ 3.1.4. STRUCTURED FIELD BODIES
+
+ To aid in the creation and reading of structured fields, the
+ free insertion of linear-white-space (which permits folding
+ by inclusion of CRLFs) is allowed between lexical tokens.
+ Rather than obscuring the syntax specifications for these
+ structured fields with explicit syntax for this linear-white-
+ space, the existence of another "lexical" analyzer is assumed.
+ This analyzer does not apply for unstructured field bodies
+ that are simply strings of text, as described above. The
+ analyzer provides an interpretation of the unfolded text
+
+
+ August 13, 1982 - 6 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ composing the body of the field as a sequence of lexical sym-
+ bols.
+
+ These symbols are:
+
+ - individual special characters
+ - quoted-strings
+ - domain-literals
+ - comments
+ - atoms
+
+ The first four of these symbols are self-delimiting. Atoms
+ are not; they are delimited by the self-delimiting symbols and
+ by linear-white-space. For the purposes of regenerating
+ sequences of atoms and quoted-strings, exactly one SPACE is
+ assumed to exist, and should be used, between them. (Also, in
+ the "Clarifications" section on "White Space", below, note the
+ rules about treatment of multiple contiguous LWSP-chars.)
+
+ So, for example, the folded body of an address field
+
+ ":sysmail"@ Some-Group. Some-Org,
+ Muhammed.(I am the greatest) Ali @(the)Vegas.WBA
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ August 13, 1982 - 7 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ is analyzed into the following lexical symbols and types:
+
+ :sysmail quoted string
+ @ special
+ Some-Group atom
+ . special
+ Some-Org atom
+ , special
+ Muhammed atom
+ . special
+ (I am the greatest) comment
+ Ali atom
+ @ atom
+ (the) comment
+ Vegas atom
+ . special
+ WBA atom
+
+ The canonical representations for the data in these addresses
+ are the following strings:
+
+ ":sysmail"@Some-Group.Some-Org
+
+ and
+
+ Muhammed.Ali@Vegas.WBA
+
+ Note: For purposes of display, and when passing such struc-
+ tured information to other systems, such as mail proto-
+ col services, there must be NO linear-white-space
+ between s that are separated by period (".") or
+ at-sign ("@") and exactly one SPACE between all other
+ s. Also, headers should be in a folded form.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ August 13, 1982 - 8 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ 3.2. HEADER FIELD DEFINITIONS
+
+ These rules show a field meta-syntax, without regard for the
+ particular type or internal syntax. Their purpose is to permit
+ detection of fields; also, they present to higher-level parsers
+ an image of each field as fitting on one line.
+
+ field = field-name ":" [ field-body ] CRLF
+
+ field-name = 1*
+
+ field-body = field-body-contents
+ [CRLF LWSP-char field-body]
+
+ field-body-contents =
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ August 13, 1982 - 9 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ 3.3. LEXICAL TOKENS
+
+ The following rules are used to define an underlying lexical
+ analyzer, which feeds tokens to higher level parsers. See the
+ ANSI references, in the Bibliography.
+
+ ; ( Octal, Decimal.)
+ CHAR = ; ( 0-177, 0.-127.)
+ ALPHA =
+ ; (101-132, 65.- 90.)
+ ; (141-172, 97.-122.)
+ DIGIT = ; ( 60- 71, 48.- 57.)
+ CTL = ; ( 177, 127.)
+ CR = ; ( 15, 13.)
+ LF = ; ( 12, 10.)
+ SPACE = ; ( 40, 32.)
+ HTAB = ; ( 11, 9.)
+ <"> = ; ( 42, 34.)
+ CRLF = CR LF
+
+ LWSP-char = SPACE / HTAB ; semantics = SPACE
+
+ linear-white-space = 1*([CRLF] LWSP-char) ; semantics = SPACE
+ ; CRLF => folding
+
+ specials = "(" / ")" / "<" / ">" / "@" ; Must be in quoted-
+ / "," / ";" / ":" / "\" / <"> ; string, to use
+ / "." / "[" / "]" ; within a word.
+
+ delimiters = specials / linear-white-space / comment
+
+ text = atoms, specials,
+ CR & bare LF, but NOT ; comments and
+ including CRLF> ; quoted-strings are
+ ; NOT recognized.
+
+ atom = 1*
+
+ quoted-string = <"> *(qtext/quoted-pair) <">; Regular qtext or
+ ; quoted chars.
+
+ qtext = , ; => may be folded
+ "\" & CR, and including
+ linear-white-space>
+
+ domain-literal = "[" *(dtext / quoted-pair) "]"
+
+
+
+
+ August 13, 1982 - 10 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ dtext = may be folded
+ "]", "\" & CR, & including
+ linear-white-space>
+
+ comment = "(" *(ctext / quoted-pair / comment) ")"
+
+ ctext = may be folded
+ ")", "\" & CR, & including
+ linear-white-space>
+
+ quoted-pair = "\" CHAR ; may quote any char
+
+ phrase = 1*word ; Sequence of words
+
+ word = atom / quoted-string
+
+
+ 3.4. CLARIFICATIONS
+
+ 3.4.1. QUOTING
+
+ Some characters are reserved for special interpretation, such
+ as delimiting lexical tokens. To permit use of these charac-
+ ters as uninterpreted data, a quoting mechanism is provided.
+ To quote a character, precede it with a backslash ("\").
+
+ This mechanism is not fully general. Characters may be quoted
+ only within a subset of the lexical constructs. In particu-
+ lar, quoting is limited to use within:
+
+ - quoted-string
+ - domain-literal
+ - comment
+
+ Within these constructs, quoting is REQUIRED for CR and "\"
+ and for the character(s) that delimit the token (e.g., "(" and
+ ")" for a comment). However, quoting is PERMITTED for any
+ character.
+
+ Note: In particular, quoting is NOT permitted within atoms.
+ For example when the local-part of an addr-spec must
+ contain a special character, a quoted string must be
+ used. Therefore, a specification such as:
+
+ Full\ Name@Domain
+
+ is not legal and must be specified as:
+
+ "Full Name"@Domain
+
+
+ August 13, 1982 - 11 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ 3.4.2. WHITE SPACE
+
+ Note: In structured field bodies, multiple linear space ASCII
+ characters (namely HTABs and SPACEs) are treated as
+ single spaces and may freely surround any symbol. In
+ all header fields, the only place in which at least one
+ LWSP-char is REQUIRED is at the beginning of continua-
+ tion lines in a folded field.
+
+ When passing text to processes that do not interpret text
+ according to this standard (e.g., mail protocol servers), then
+ NO linear-white-space characters should occur between a period
+ (".") or at-sign ("@") and a . Exactly ONE SPACE should
+ be used in place of arbitrary linear-white-space and comment
+ sequences.
+
+ Note: Within systems conforming to this standard, wherever a
+ member of the list of delimiters is allowed, LWSP-chars
+ may also occur before and/or after it.
+
+ Writers of mail-sending (i.e., header-generating) programs
+ should realize that there is no network-wide definition of the
+ effect of ASCII HT (horizontal-tab) characters on the appear-
+ ance of text at another network host; therefore, the use of
+ tabs in message headers, though permitted, is discouraged.
+
+ 3.4.3. COMMENTS
+
+ A comment is a set of ASCII characters, which is enclosed in
+ matching parentheses and which is not within a quoted-string
+ The comment construct permits message originators to add text
+ which will be useful for human readers, but which will be
+ ignored by the formal semantics. Comments should be retained
+ while the message is subject to interpretation according to
+ this standard. However, comments must NOT be included in
+ other cases, such as during protocol exchanges with mail
+ servers.
+
+ Comments nest, so that if an unquoted left parenthesis occurs
+ in a comment string, there must also be a matching right
+ parenthesis. When a comment acts as the delimiter between a
+ sequence of two lexical symbols, such as two atoms, it is lex-
+ ically equivalent with a single SPACE, for the purposes of
+ regenerating the sequence, such as when passing the sequence
+ onto a mail protocol server. Comments are detected as such
+ only within field-bodies of structured fields.
+
+ If a comment is to be "folded" onto multiple lines, then the
+ syntax for folding must be adhered to. (See the "Lexical
+
+
+ August 13, 1982 - 12 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ Analysis of Messages" section on "Folding Long Header Fields"
+ above, and the section on "Case Independence" below.) Note
+ that the official semantics therefore do not "see" any
+ unquoted CRLFs that are in comments, although particular pars-
+ ing programs may wish to note their presence. For these pro-
+ grams, it would be reasonable to interpret a "CRLF LWSP-char"
+ as being a CRLF that is part of the comment; i.e., the CRLF is
+ kept and the LWSP-char is discarded. Quoted CRLFs (i.e., a
+ backslash followed by a CR followed by a LF) still must be
+ followed by at least one LWSP-char.
+
+ 3.4.4. DELIMITING AND QUOTING CHARACTERS
+
+ The quote character (backslash) and characters that delimit
+ syntactic units are not, generally, to be taken as data that
+ are part of the delimited or quoted unit(s). In particular,
+ the quotation-marks that define a quoted-string, the
+ parentheses that define a comment and the backslash that
+ quotes a following character are NOT part of the quoted-
+ string, comment or quoted character. A quotation-mark that is
+ to be part of a quoted-string, a parenthesis that is to be
+ part of a comment and a backslash that is to be part of either
+ must each be preceded by the quote-character backslash ("\").
+ Note that the syntax allows any character to be quoted within
+ a quoted-string or comment; however only certain characters
+ MUST be quoted to be included as data. These characters are
+ the ones that are not part of the alternate text group (i.e.,
+ ctext or qtext).
+
+ The one exception to this rule is that a single SPACE is
+ assumed to exist between contiguous words in a phrase, and
+ this interpretation is independent of the actual number of
+ LWSP-chars that the creator places between the words. To
+ include more than one SPACE, the creator must make the LWSP-
+ chars be part of a quoted-string.
+
+ Quotation marks that delimit a quoted string and backslashes
+ that quote the following character should NOT accompany the
+ quoted-string when the string is passed to processes that do
+ not interpret data according to this specification (e.g., mail
+ protocol servers).
+
+ 3.4.5. QUOTED-STRINGS
+
+ Where permitted (i.e., in words in structured fields) quoted-
+ strings are treated as a single symbol. That is, a quoted-
+ string is equivalent to an atom, syntactically. If a quoted-
+ string is to be "folded" onto multiple lines, then the syntax
+ for folding must be adhered to. (See the "Lexical Analysis of
+
+
+ August 13, 1982 - 13 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ Messages" section on "Folding Long Header Fields" above, and
+ the section on "Case Independence" below.) Therefore, the
+ official semantics do not "see" any bare CRLFs that are in
+ quoted-strings; however particular parsing programs may wish
+ to note their presence. For such programs, it would be rea-
+ sonable to interpret a "CRLF LWSP-char" as being a CRLF which
+ is part of the quoted-string; i.e., the CRLF is kept and the
+ LWSP-char is discarded. Quoted CRLFs (i.e., a backslash fol-
+ lowed by a CR followed by a LF) are also subject to rules of
+ folding, but the presence of the quoting character (backslash)
+ explicitly indicates that the CRLF is data to the quoted
+ string. Stripping off the first following LWSP-char is also
+ appropriate when parsing quoted CRLFs.
+
+ 3.4.6. BRACKETING CHARACTERS
+
+ There is one type of bracket which must occur in matched pairs
+ and may have pairs nested within each other:
+
+ o Parentheses ("(" and ")") are used to indicate com-
+ ments.
+
+ There are three types of brackets which must occur in matched
+ pairs, and which may NOT be nested:
+
+ o Colon/semi-colon (":" and ";") are used in address
+ specifications to indicate that the included list of
+ addresses are to be treated as a group.
+
+ o Angle brackets ("<" and ">") are generally used to
+ indicate the presence of a one machine-usable refer-
+ ence (e.g., delimiting mailboxes), possibly including
+ source-routing to the machine.
+
+ o Square brackets ("[" and "]") are used to indicate the
+ presence of a domain-literal, which the appropriate
+ name-domain is to use directly, bypassing normal
+ name-resolution mechanisms.
+
+ 3.4.7. CASE INDEPENDENCE
+
+ Except as noted, alphabetic strings may be represented in any
+ combination of upper and lower case. The only syntactic units
+
+
+
+
+
+
+
+
+ August 13, 1982 - 14 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ which requires preservation of case information are:
+
+ - text
+ - qtext
+ - dtext
+ - ctext
+ - quoted-pair
+ - local-part, except "Postmaster"
+
+ When matching any other syntactic unit, case is to be ignored.
+ For example, the field-names "From", "FROM", "from", and even
+ "FroM" are semantically equal and should all be treated ident-
+ ically.
+
+ When generating these units, any mix of upper and lower case
+ alphabetic characters may be used. The case shown in this
+ specification is suggested for message-creating processes.
+
+ Note: The reserved local-part address unit, "Postmaster", is
+ an exception. When the value "Postmaster" is being
+ interpreted, it must be accepted in any mixture of
+ case, including "POSTMASTER", and "postmaster".
+
+ 3.4.8. FOLDING LONG HEADER FIELDS
+
+ Each header field may be represented on exactly one line con-
+ sisting of the name of the field and its body, and terminated
+ by a CRLF; this is what the parser sees. For readability, the
+ field-body portion of long header fields may be "folded" onto
+ multiple lines of the actual field. "Long" is commonly inter-
+ preted to mean greater than 65 or 72 characters. The former
+ length serves as a limit, when the message is to be viewed on
+ most simple terminals which use simple display software; how-
+ ever, the limit is not imposed by this standard.
+
+ Note: Some display software often can selectively fold lines,
+ to suit the display terminal. In such cases, sender-
+ provided folding can interfere with the display
+ software.
+
+ 3.4.9. BACKSPACE CHARACTERS
+
+ ASCII BS characters (Backspace, decimal 8) may be included in
+ texts and quoted-strings to effect overstriking. However, any
+ use of backspaces which effects an overstrike to the left of
+ the beginning of the text or quoted-string is prohibited.
+
+
+
+
+
+ August 13, 1982 - 15 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ 3.4.10. NETWORK-SPECIFIC TRANSFORMATIONS
+
+ During transmission through heterogeneous networks, it may be
+ necessary to force data to conform to a network's local con-
+ ventions. For example, it may be required that a CR be fol-
+ lowed either by LF, making a CRLF, or by , if the CR is
+ to stand alone). Such transformations are reversed, when the
+ message exits that network.
+
+ When crossing network boundaries, the message should be
+ treated as passing through two modules. It will enter the
+ first module containing whatever network-specific transforma-
+ tions that were necessary to permit migration through the
+ "current" network. It then passes through the modules:
+
+ o Transformation Reversal
+
+ The "current" network's idiosyncracies are removed and
+ the message is returned to the canonical form speci-
+ fied in this standard.
+
+ o Transformation
+
+ The "next" network's local idiosyncracies are imposed
+ on the message.
+
+ ------------------
+ From ==> | Remove Net-A |
+ Net-A | idiosyncracies |
+ ------------------
+ ||
+ \/
+ Conformance
+ with standard
+ ||
+ \/
+ ------------------
+ | Impose Net-B | ==> To
+ | idiosyncracies | Net-B
+ ------------------
+
+
+
+
+
+
+
+
+
+
+
+ August 13, 1982 - 16 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ 4. MESSAGE SPECIFICATION
+
+ 4.1. SYNTAX
+
+ Note: Due to an artifact of the notational conventions, the syn-
+ tax indicates that, when present, some fields, must be in
+ a particular order. Header fields are NOT required to
+ occur in any particular order, except that the message
+ body must occur AFTER the headers. It is recommended
+ that, if present, headers be sent in the order "Return-
+ Path", "Received", "Date", "From", "Subject", "Sender",
+ "To", "cc", etc.
+
+ This specification permits multiple occurrences of most
+ fields. Except as noted, their interpretation is not
+ specified here, and their use is discouraged.
+
+ The following syntax for the bodies of various fields should
+ be thought of as describing each field body as a single long
+ string (or line). The "Lexical Analysis of Message" section on
+ "Long Header Fields", above, indicates how such long strings can
+ be represented on more than one line in the actual transmitted
+ message.
+
+ message = fields *( CRLF *text ) ; Everything after
+ ; first null line
+ ; is message body
+
+ fields = dates ; Creation time,
+ source ; author id & one
+ 1*destination ; address required
+ *optional-field ; others optional
+
+ source = [ trace ] ; net traversals
+ originator ; original mail
+ [ resent ] ; forwarded
+
+ trace = return ; path to sender
+ 1*received ; receipt tags
+
+ return = "Return-path" ":" route-addr ; return address
+
+ received = "Received" ":" ; one per relay
+ ["from" domain] ; sending host
+ ["by" domain] ; receiving host
+ ["via" atom] ; physical path
+ *("with" atom) ; link/mail protocol
+ ["id" msg-id] ; receiver msg id
+ ["for" addr-spec] ; initial form
+
+
+ August 13, 1982 - 17 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ ";" date-time ; time received
+
+ originator = authentic ; authenticated addr
+ [ "Reply-To" ":" 1#address] )
+
+ authentic = "From" ":" mailbox ; Single author
+ / ( "Sender" ":" mailbox ; Actual submittor
+ "From" ":" 1#mailbox) ; Multiple authors
+ ; or not sender
+
+ resent = resent-authentic
+ [ "Resent-Reply-To" ":" 1#address] )
+
+ resent-authentic =
+ = "Resent-From" ":" mailbox
+ / ( "Resent-Sender" ":" mailbox
+ "Resent-From" ":" 1#mailbox )
+
+ dates = orig-date ; Original
+ [ resent-date ] ; Forwarded
+
+ orig-date = "Date" ":" date-time
+
+ resent-date = "Resent-Date" ":" date-time
+
+ destination = "To" ":" 1#address ; Primary
+ / "Resent-To" ":" 1#address
+ / "cc" ":" 1#address ; Secondary
+ / "Resent-cc" ":" 1#address
+ / "bcc" ":" #address ; Blind carbon
+ / "Resent-bcc" ":" #address
+
+ optional-field =
+ / "Message-ID" ":" msg-id
+ / "Resent-Message-ID" ":" msg-id
+ / "In-Reply-To" ":" *(phrase / msg-id)
+ / "References" ":" *(phrase / msg-id)
+ / "Keywords" ":" #phrase
+ / "Subject" ":" *text
+ / "Comments" ":" *text
+ / "Encrypted" ":" 1#2word
+ / extension-field ; To be defined
+ / user-defined-field ; May be pre-empted
+
+ msg-id = "<" addr-spec ">" ; Unique message id
+
+
+
+
+
+
+ August 13, 1982 - 18 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ extension-field =
+
+
+ user-defined-field =
+
+
+ 4.2. FORWARDING
+
+ Some systems permit mail recipients to forward a message,
+ retaining the original headers, by adding some new fields. This
+ standard supports such a service, through the "Resent-" prefix to
+ field names.
+
+ Whenever the string "Resent-" begins a field name, the field
+ has the same semantics as a field whose name does not have the
+ prefix. However, the message is assumed to have been forwarded
+ by an original recipient who attached the "Resent-" field. This
+ new field is treated as being more recent than the equivalent,
+ original field. For example, the "Resent-From", indicates the
+ person that forwarded the message, whereas the "From" field indi-
+ cates the original author.
+
+ Use of such precedence information depends upon partici-
+ pants' communication needs. For example, this standard does not
+ dictate when a "Resent-From:" address should receive replies, in
+ lieu of sending them to the "From:" address.
+
+ Note: In general, the "Resent-" fields should be treated as con-
+ taining a set of information that is independent of the
+ set of original fields. Information for one set should
+ not automatically be taken from the other. The interpre-
+ tation of multiple "Resent-" fields, of the same type, is
+ undefined.
+
+ In the remainder of this specification, occurrence of legal
+ "Resent-" fields are treated identically with the occurrence of
+
+
+
+
+
+
+
+
+ August 13, 1982 - 19 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ fields whose names do not contain this prefix.
+
+ 4.3. TRACE FIELDS
+
+ Trace information is used to provide an audit trail of mes-
+ sage handling. In addition, it indicates a route back to the
+ sender of the message.
+
+ The list of known "via" and "with" values are registered
+ with the Network Information Center, SRI International, Menlo
+ Park, California.
+
+ 4.3.1. RETURN-PATH
+
+ This field is added by the final transport system that
+ delivers the message to its recipient. The field is intended
+ to contain definitive information about the address and route
+ back to the message's originator.
+
+ Note: The "Reply-To" field is added by the originator and
+ serves to direct replies, whereas the "Return-Path"
+ field is used to identify a path back to the origina-
+ tor.
+
+ While the syntax indicates that a route specification is
+ optional, every attempt should be made to provide that infor-
+ mation in this field.
+
+ 4.3.2. RECEIVED
+
+ A copy of this field is added by each transport service that
+ relays the message. The information in the field can be quite
+ useful for tracing transport problems.
+
+ The names of the sending and receiving hosts and time-of-
+ receipt may be specified. The "via" parameter may be used, to
+ indicate what physical mechanism the message was sent over,
+ such as Arpanet or Phonenet, and the "with" parameter may be
+ used to indicate the mail-, or connection-, level protocol
+ that was used, such as the SMTP mail protocol, or X.25 tran-
+ sport protocol.
+
+ Note: Several "with" parameters may be included, to fully
+ specify the set of protocols that were used.
+
+ Some transport services queue mail; the internal message iden-
+ tifier that is assigned to the message may be noted, using the
+ "id" parameter. When the sending host uses a destination
+ address specification that the receiving host reinterprets, by
+
+
+ August 13, 1982 - 20 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ expansion or transformation, the receiving host may wish to
+ record the original specification, using the "for" parameter.
+ For example, when a copy of mail is sent to the member of a
+ distribution list, this parameter may be used to record the
+ original address that was used to specify the list.
+
+ 4.4. ORIGINATOR FIELDS
+
+ The standard allows only a subset of the combinations possi-
+ ble with the From, Sender, Reply-To, Resent-From, Resent-Sender,
+ and Resent-Reply-To fields. The limitation is intentional.
+
+ 4.4.1. FROM / RESENT-FROM
+
+ This field contains the identity of the person(s) who wished
+ this message to be sent. The message-creation process should
+ default this field to be a single, authenticated machine
+ address, indicating the AGENT (person, system or process)
+ entering the message. If this is not done, the "Sender" field
+ MUST be present. If the "From" field IS defaulted this way,
+ the "Sender" field is optional and is redundant with the
+ "From" field. In all cases, addresses in the "From" field
+ must be machine-usable (addr-specs) and may not contain named
+ lists (groups).
+
+ 4.4.2. SENDER / RESENT-SENDER
+
+ This field contains the authenticated identity of the AGENT
+ (person, system or process) that sends the message. It is
+ intended for use when the sender is not the author of the mes-
+ sage, or to indicate who among a group of authors actually
+ sent the message. If the contents of the "Sender" field would
+ be completely redundant with the "From" field, then the
+ "Sender" field need not be present and its use is discouraged
+ (though still legal). In particular, the "Sender" field MUST
+ be present if it is NOT the same as the "From" Field.
+
+ The Sender mailbox specification includes a word sequence
+ which must correspond to a specific agent (i.e., a human user
+ or a computer program) rather than a standard address. This
+ indicates the expectation that the field will identify the
+ single AGENT (person, system, or process) responsible for
+ sending the mail and not simply include the name of a mailbox
+ from which the mail was sent. For example in the case of a
+ shared login name, the name, by itself, would not be adequate.
+ The local-part address unit, which refers to this agent, is
+ expected to be a computer system term, and not (for example) a
+ generalized person reference which can be used outside the
+ network text message context.
+
+
+ August 13, 1982 - 21 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ Since the critical function served by the "Sender" field is
+ identification of the agent responsible for sending mail and
+ since computer programs cannot be held accountable for their
+ behavior, it is strongly recommended that when a computer pro-
+ gram generates a message, the HUMAN who is responsible for
+ that program be referenced as part of the "Sender" field mail-
+ box specification.
+
+ 4.4.3. REPLY-TO / RESENT-REPLY-TO
+
+ This field provides a general mechanism for indicating any
+ mailbox(es) to which responses are to be sent. Three typical
+ uses for this feature can be distinguished. In the first
+ case, the author(s) may not have regular machine-based mail-
+ boxes and therefore wish(es) to indicate an alternate machine
+ address. In the second case, an author may wish additional
+ persons to be made aware of, or responsible for, replies. A
+ somewhat different use may be of some help to "text message
+ teleconferencing" groups equipped with automatic distribution
+ services: include the address of that service in the "Reply-
+ To" field of all messages submitted to the teleconference;
+ then participants can "reply" to conference submissions to
+ guarantee the correct distribution of any submission of their
+ own.
+
+ Note: The "Return-Path" field is added by the mail transport
+ service, at the time of final deliver. It is intended
+ to identify a path back to the orginator of the mes-
+ sage. The "Reply-To" field is added by the message
+ originator and is intended to direct replies.
+
+ 4.4.4. AUTOMATIC USE OF FROM / SENDER / REPLY-TO
+
+ For systems which automatically generate address lists for
+ replies to messages, the following recommendations are made:
+
+ o The "Sender" field mailbox should be sent notices of
+ any problems in transport or delivery of the original
+ messages. If there is no "Sender" field, then the
+ "From" field mailbox should be used.
+
+ o The "Sender" field mailbox should NEVER be used
+ automatically, in a recipient's reply message.
+
+ o If the "Reply-To" field exists, then the reply should
+ go to the addresses indicated in that field and not to
+ the address(es) indicated in the "From" field.
+
+
+
+
+ August 13, 1982 - 22 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ o If there is a "From" field, but no "Reply-To" field,
+ the reply should be sent to the address(es) indicated
+ in the "From" field.
+
+ Sometimes, a recipient may actually wish to communicate with
+ the person that initiated the message transfer. In such
+ cases, it is reasonable to use the "Sender" address.
+
+ This recommendation is intended only for automated use of
+ originator-fields and is not intended to suggest that replies
+ may not also be sent to other recipients of messages. It is
+ up to the respective mail-handling programs to decide what
+ additional facilities will be provided.
+
+ Examples are provided in Appendix A.
+
+ 4.5. RECEIVER FIELDS
+
+ 4.5.1. TO / RESENT-TO
+
+ This field contains the identity of the primary recipients of
+ the message.
+
+ 4.5.2. CC / RESENT-CC
+
+ This field contains the identity of the secondary (informa-
+ tional) recipients of the message.
+
+ 4.5.3. BCC / RESENT-BCC
+
+ This field contains the identity of additional recipients of
+ the message. The contents of this field are not included in
+ copies of the message sent to the primary and secondary reci-
+ pients. Some systems may choose to include the text of the
+ "Bcc" field only in the author(s)'s copy, while others may
+ also include it in the text sent to all those indicated in the
+ "Bcc" list.
+
+ 4.6. REFERENCE FIELDS
+
+ 4.6.1. MESSAGE-ID / RESENT-MESSAGE-ID
+
+ This field contains a unique identifier (the local-part
+ address unit) which refers to THIS version of THIS message.
+ The uniqueness of the message identifier is guaranteed by the
+ host which generates it. This identifier is intended to be
+ machine readable and not necessarily meaningful to humans. A
+ message identifier pertains to exactly one instantiation of a
+ particular message; subsequent revisions to the message should
+
+
+ August 13, 1982 - 23 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ each receive new message identifiers.
+
+ 4.6.2. IN-REPLY-TO
+
+ The contents of this field identify previous correspon-
+ dence which this message answers. Note that if message iden-
+ tifiers are used in this field, they must use the msg-id
+ specification format.
+
+ 4.6.3. REFERENCES
+
+ The contents of this field identify other correspondence
+ which this message references. Note that if message identif-
+ iers are used, they must use the msg-id specification format.
+
+ 4.6.4. KEYWORDS
+
+ This field contains keywords or phrases, separated by
+ commas.
+
+ 4.7. OTHER FIELDS
+
+ 4.7.1. SUBJECT
+
+ This is intended to provide a summary, or indicate the
+ nature, of the message.
+
+ 4.7.2. COMMENTS
+
+ Permits adding text comments onto the message without
+ disturbing the contents of the message's body.
+
+ 4.7.3. ENCRYPTED
+
+ Sometimes, data encryption is used to increase the
+ privacy of message contents. If the body of a message has
+ been encrypted, to keep its contents private, the "Encrypted"
+ field can be used to note the fact and to indicate the nature
+ of the encryption. The first parameter indicates the
+ software used to encrypt the body, and the second, optional
+ is intended to aid the recipient in selecting the
+ proper decryption key. This code word may be viewed as an
+ index to a table of keys held by the recipient.
+
+ Note: Unfortunately, headers must contain envelope, as well
+ as contents, information. Consequently, it is neces-
+ sary that they remain unencrypted, so that mail tran-
+ sport services may access them. Since names,
+ addresses, and "Subject" field contents may contain
+
+
+ August 13, 1982 - 24 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ sensitive information, this requirement limits total
+ message privacy.
+
+ Names of encryption software are registered with the Net-
+ work Information Center, SRI International, Menlo Park, Cali-
+ fornia.
+
+ 4.7.4. EXTENSION-FIELD
+
+ A limited number of common fields have been defined in
+ this document. As network mail requirements dictate, addi-
+ tional fields may be standardized. To provide user-defined
+ fields with a measure of safety, in name selection, such
+ extension-fields will never have names that begin with the
+ string "X-".
+
+ Names of Extension-fields are registered with the Network
+ Information Center, SRI International, Menlo Park, California.
+
+ 4.7.5. USER-DEFINED-FIELD
+
+ Individual users of network mail are free to define and
+ use additional header fields. Such fields must have names
+ which are not already used in the current specification or in
+ any definitions of extension-fields, and the overall syntax of
+ these user-defined-fields must conform to this specification's
+ rules for delimiting and folding fields. Due to the
+ extension-field publishing process, the name of a user-
+ defined-field may be pre-empted
+
+ Note: The prefatory string "X-" will never be used in the
+ names of Extension-fields. This provides user-defined
+ fields with a protected set of names.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ August 13, 1982 - 25 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ 5. DATE AND TIME SPECIFICATION
+
+ 5.1. SYNTAX
+
+ date-time = [ day "," ] date time ; dd mm yy
+ ; hh:mm:ss zzz
+
+ day = "Mon" / "Tue" / "Wed" / "Thu"
+ / "Fri" / "Sat" / "Sun"
+
+ date = 1*2DIGIT month 2DIGIT ; day month year
+ ; e.g. 20 Jun 82
+
+ month = "Jan" / "Feb" / "Mar" / "Apr"
+ / "May" / "Jun" / "Jul" / "Aug"
+ / "Sep" / "Oct" / "Nov" / "Dec"
+
+ time = hour zone ; ANSI and Military
+
+ hour = 2DIGIT ":" 2DIGIT [":" 2DIGIT]
+ ; 00:00:00 - 23:59:59
+
+ zone = "UT" / "GMT" ; Universal Time
+ ; North American : UT
+ / "EST" / "EDT" ; Eastern: - 5/ - 4
+ / "CST" / "CDT" ; Central: - 6/ - 5
+ / "MST" / "MDT" ; Mountain: - 7/ - 6
+ / "PST" / "PDT" ; Pacific: - 8/ - 7
+ / 1ALPHA ; Military: Z = UT;
+ ; A:-1; (J not used)
+ ; M:-12; N:+1; Y:+12
+ / ( ("+" / "-") 4DIGIT ) ; Local differential
+ ; hours+min. (HHMM)
+
+ 5.2. SEMANTICS
+
+ If included, day-of-week must be the day implied by the date
+ specification.
+
+ Time zone may be indicated in several ways. "UT" is Univer-
+ sal Time (formerly called "Greenwich Mean Time"); "GMT" is per-
+ mitted as a reference to Universal Time. The military standard
+ uses a single character for each zone. "Z" is Universal Time.
+ "A" indicates one hour earlier, and "M" indicates 12 hours ear-
+ lier; "N" is one hour later, and "Y" is 12 hours later. The
+ letter "J" is not used. The other remaining two forms are taken
+ from ANSI standard X3.51-1975. One allows explicit indication of
+ the amount of offset from UT; the other uses common 3-character
+ strings for indicating time zones in North America.
+
+
+ August 13, 1982 - 26 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ 6. ADDRESS SPECIFICATION
+
+ 6.1. SYNTAX
+
+ address = mailbox ; one addressee
+ / group ; named list
+
+ group = phrase ":" [#mailbox] ";"
+
+ mailbox = addr-spec ; simple address
+ / phrase route-addr ; name & addr-spec
+
+ route-addr = "<" [route] addr-spec ">"
+
+ route = 1#("@" domain) ":" ; path-relative
+
+ addr-spec = local-part "@" domain ; global address
+
+ local-part = word *("." word) ; uninterpreted
+ ; case-preserved
+
+ domain = sub-domain *("." sub-domain)
+
+ sub-domain = domain-ref / domain-literal
+
+ domain-ref = atom ; symbolic reference
+
+ 6.2. SEMANTICS
+
+ A mailbox receives mail. It is a conceptual entity which
+ does not necessarily pertain to file storage. For example, some
+ sites may choose to print mail on their line printer and deliver
+ the output to the addressee's desk.
+
+ A mailbox specification comprises a person, system or pro-
+ cess name reference, a domain-dependent string, and a name-domain
+ reference. The name reference is optional and is usually used to
+ indicate the human name of a recipient. The name-domain refer-
+ ence specifies a sequence of sub-domains. The domain-dependent
+ string is uninterpreted, except by the final sub-domain; the rest
+ of the mail service merely transmits it as a literal string.
+
+ 6.2.1. DOMAINS
+
+ A name-domain is a set of registered (mail) names. A name-
+ domain specification resolves to a subordinate name-domain
+ specification or to a terminal domain-dependent string.
+ Hence, domain specification is extensible, permitting any
+ number of registration levels.
+
+
+ August 13, 1982 - 27 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ Name-domains model a global, logical, hierarchical addressing
+ scheme. The model is logical, in that an address specifica-
+ tion is related to name registration and is not necessarily
+ tied to transmission path. The model's hierarchy is a
+ directed graph, called an in-tree, such that there is a single
+ path from the root of the tree to any node in the hierarchy.
+ If more than one path actually exists, they are considered to
+ be different addresses.
+
+ The root node is common to all addresses; consequently, it is
+ not referenced. Its children constitute "top-level" name-
+ domains. Usually, a service has access to its own full domain
+ specification and to the names of all top-level name-domains.
+
+ The "top" of the domain addressing hierarchy -- a child of the
+ root -- is indicated by the right-most field, in a domain
+ specification. Its child is specified to the left, its child
+ to the left, and so on.
+
+ Some groups provide formal registration services; these con-
+ stitute name-domains that are independent logically of
+ specific machines. In addition, networks and machines impli-
+ citly compose name-domains, since their membership usually is
+ registered in name tables.
+
+ In the case of formal registration, an organization implements
+ a (distributed) data base which provides an address-to-route
+ mapping service for addresses of the form:
+
+ person@registry.organization
+
+ Note that "organization" is a logical entity, separate from
+ any particular communication network.
+
+ A mechanism for accessing "organization" is universally avail-
+ able. That mechanism, in turn, seeks an instantiation of the
+ registry; its location is not indicated in the address specif-
+ ication. It is assumed that the system which operates under
+ the name "organization" knows how to find a subordinate regis-
+ try. The registry will then use the "person" string to deter-
+ mine where to send the mail specification.
+
+ The latter, network-oriented case permits simple, direct,
+ attachment-related address specification, such as:
+
+ user@host.network
+
+ Once the network is accessed, it is expected that a message
+ will go directly to the host and that the host will resolve
+
+
+ August 13, 1982 - 28 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ the user name, placing the message in the user's mailbox.
+
+ 6.2.2. ABBREVIATED DOMAIN SPECIFICATION
+
+ Since any number of levels is possible within the domain
+ hierarchy, specification of a fully qualified address can
+ become inconvenient. This standard permits abbreviated domain
+ specification, in a special case:
+
+ For the address of the sender, call the left-most
+ sub-domain Level N. In a header address, if all of
+ the sub-domains above (i.e., to the right of) Level N
+ are the same as those of the sender, then they do not
+ have to appear in the specification. Otherwise, the
+ address must be fully qualified.
+
+ This feature is subject to approval by local sub-
+ domains. Individual sub-domains may require their
+ member systems, which originate mail, to provide full
+ domain specification only. When permitted, abbrevia-
+ tions may be present only while the message stays
+ within the sub-domain of the sender.
+
+ Use of this mechanism requires the sender's sub-domain
+ to reserve the names of all top-level domains, so that
+ full specifications can be distinguished from abbrevi-
+ ated specifications.
+
+ For example, if a sender's address is:
+
+ sender@registry-A.registry-1.organization-X
+
+ and one recipient's address is:
+
+ recipient@registry-B.registry-1.organization-X
+
+ and another's is:
+
+ recipient@registry-C.registry-2.organization-X
+
+ then ".registry-1.organization-X" need not be specified in the
+ the message, but "registry-C.registry-2" DOES have to be
+ specified. That is, the first two addresses may be abbrevi-
+ ated, but the third address must be fully specified.
+
+ When a message crosses a domain boundary, all addresses must
+ be specified in the full format, ending with the top-level
+ name-domain in the right-most field. It is the responsibility
+ of mail forwarding services to ensure that addresses conform
+
+
+ August 13, 1982 - 29 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ with this requirement. In the case of abbreviated addresses,
+ the relaying service must make the necessary expansions. It
+ should be noted that it often is difficult for such a service
+ to locate all occurrences of address abbreviations. For exam-
+ ple, it will not be possible to find such abbreviations within
+ the body of the message. The "Return-Path" field can aid
+ recipients in recovering from these errors.
+
+ Note: When passing any portion of an addr-spec onto a process
+ which does not interpret data according to this stan-
+ dard (e.g., mail protocol servers). There must be NO
+ LWSP-chars preceding or following the at-sign or any
+ delimiting period ("."), such as shown in the above
+ examples, and only ONE SPACE between contiguous
+ s.
+
+ 6.2.3. DOMAIN TERMS
+
+ A domain-ref must be THE official name of a registry, network,
+ or host. It is a symbolic reference, within a name sub-
+ domain. At times, it is necessary to bypass standard mechan-
+ isms for resolving such references, using more primitive
+ information, such as a network host address rather than its
+ associated host name.
+
+ To permit such references, this standard provides the domain-
+ literal construct. Its contents must conform with the needs
+ of the sub-domain in which it is interpreted.
+
+ Domain-literals which refer to domains within the ARPA Inter-
+ net specify 32-bit Internet addresses, in four 8-bit fields
+ noted in decimal, as described in Request for Comments #820,
+ "Assigned Numbers." For example:
+
+ [10.0.3.19]
+
+ Note: THE USE OF DOMAIN-LITERALS IS STRONGLY DISCOURAGED. It
+ is permitted only as a means of bypassing temporary
+ system limitations, such as name tables which are not
+ complete.
+
+ The names of "top-level" domains, and the names of domains
+ under in the ARPA Internet, are registered with the Network
+ Information Center, SRI International, Menlo Park, California.
+
+ 6.2.4. DOMAIN-DEPENDENT LOCAL STRING
+
+ The local-part of an addr-spec in a mailbox specification
+ (i.e., the host's name for the mailbox) is understood to be
+
+
+ August 13, 1982 - 30 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ whatever the receiving mail protocol server allows. For exam-
+ ple, some systems do not understand mailbox references of the
+ form "P. D. Q. Bach", but others do.
+
+ This specification treats periods (".") as lexical separators.
+ Hence, their presence in local-parts which are not quoted-
+ strings, is detected. However, such occurrences carry NO
+ semantics. That is, if a local-part has periods within it, an
+ address parser will divide the local-part into several tokens,
+ but the sequence of tokens will be treated as one uninter-
+ preted unit. The sequence will be re-assembled, when the
+ address is passed outside of the system such as to a mail pro-
+ tocol service.
+
+ For example, the address:
+
+ First.Last@Registry.Org
+
+ is legal and does not require the local-part to be surrounded
+ with quotation-marks. (However, "First Last" DOES require
+ quoting.) The local-part of the address, when passed outside
+ of the mail system, within the Registry.Org domain, is
+ "First.Last", again without quotation marks.
+
+ 6.2.5. BALANCING LOCAL-PART AND DOMAIN
+
+ In some cases, the boundary between local-part and domain can
+ be flexible. The local-part may be a simple string, which is
+ used for the final determination of the recipient's mailbox.
+ All other levels of reference are, therefore, part of the
+ domain.
+
+ For some systems, in the case of abbreviated reference to the
+ local and subordinate sub-domains, it may be possible to
+ specify only one reference within the domain part and place
+ the other, subordinate name-domain references within the
+ local-part. This would appear as:
+
+ mailbox.sub1.sub2@this-domain
+
+ Such a specification would be acceptable to address parsers
+ which conform to RFC #733, but do not support this newer
+ Internet standard. While contrary to the intent of this stan-
+ dard, the form is legal.
+
+ Also, some sub-domains have a specification syntax which does
+ not conform to this standard. For example:
+
+ sub-net.mailbox@sub-domain.domain
+
+
+ August 13, 1982 - 31 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ uses a different parsing sequence for local-part than for
+ domain.
+
+ Note: As a rule, the domain specification should contain
+ fields which are encoded according to the syntax of
+ this standard and which contain generally-standardized
+ information. The local-part specification should con-
+ tain only that portion of the address which deviates
+ from the form or intention of the domain field.
+
+ 6.2.6. MULTIPLE MAILBOXES
+
+ An individual may have several mailboxes and wish to receive
+ mail at whatever mailbox is convenient for the sender to
+ access. This standard does not provide a means of specifying
+ "any member of" a list of mailboxes.
+
+ A set of individuals may wish to receive mail as a single unit
+ (i.e., a distribution list). The construct permits
+ specification of such a list. Recipient mailboxes are speci-
+ fied within the bracketed part (":" - ";"). A copy of the
+ transmitted message is to be sent to each mailbox listed.
+ This standard does not permit recursive specification of
+ groups within groups.
+
+ While a list must be named, it is not required that the con-
+ tents of the list be included. In this case, the
+ serves only as an indication of group distribution and would
+ appear in the form:
+
+ name:;
+
+ Some mail services may provide a group-list distribution
+ facility, accepting a single mailbox reference, expanding it
+ to the full distribution list, and relaying the mail to the
+ list's members. This standard provides no additional syntax
+ for indicating such a service. Using the address
+ alternative, while listing one mailbox in it, can mean either
+ that the mailbox reference will be expanded to a list or that
+ there is a group with one member.
+
+ 6.2.7. EXPLICIT PATH SPECIFICATION
+
+ At times, a message originator may wish to indicate the
+ transmission path that a message should follow. This is
+ called source routing. The normal addressing scheme, used in
+ an addr-spec, is carefully separated from such information;
+ the portion of a route-addr is provided for such occa-
+ sions. It specifies the sequence of hosts and/or transmission
+
+
+ August 13, 1982 - 32 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ services that are to be traversed. Both domain-refs and
+ domain-literals may be used.
+
+ Note: The use of source routing is discouraged. Unless the
+ sender has special need of path restriction, the choice
+ of transmission route should be left to the mail tran-
+ sport service.
+
+ 6.3. RESERVED ADDRESS
+
+ It often is necessary to send mail to a site, without know-
+ ing any of its valid addresses. For example, there may be mail
+ system dysfunctions, or a user may wish to find out a person's
+ correct address, at that site.
+
+ This standard specifies a single, reserved mailbox address
+ (local-part) which is to be valid at each site. Mail sent to
+ that address is to be routed to a person responsible for the
+ site's mail system or to a person with responsibility for general
+ site operation. The name of the reserved local-part address is:
+
+ Postmaster
+
+ so that "Postmaster@domain" is required to be valid.
+
+ Note: This reserved local-part must be matched without sensi-
+ tivity to alphabetic case, so that "POSTMASTER", "postmas-
+ ter", and even "poStmASteR" is to be accepted.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ August 13, 1982 - 33 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ 7. BIBLIOGRAPHY
+
+
+ ANSI. "USA Standard Code for Information Interchange," X3.4.
+ American National Standards Institute: New York (1968). Also
+ in: Feinler, E. and J. Postel, eds., "ARPANET Protocol Hand-
+ book", NIC 7104.
+
+ ANSI. "Representations of Universal Time, Local Time Differen-
+ tials, and United States Time Zone References for Information
+ Interchange," X3.51-1975. American National Standards Insti-
+ tute: New York (1975).
+
+ Bemer, R.W., "Time and the Computer." In: Interface Age (Feb.
+ 1979).
+
+ Bennett, C.J. "JNT Mail Protocol". Joint Network Team, Ruther-
+ ford and Appleton Laboratory: Didcot, England.
+
+ Bhushan, A.K., Pogran, K.T., Tomlinson, R.S., and White, J.E.
+ "Standardizing Network Mail Headers," ARPANET Request for
+ Comments No. 561, Network Information Center No. 18516; SRI
+ International: Menlo Park (September 1973).
+
+ Birrell, A.D., Levin, R., Needham, R.M., and Schroeder, M.D.
+ "Grapevine: An Exercise in Distributed Computing," Communica-
+ tions of the ACM 25, 4 (April 1982), 260-274.
+
+ Crocker, D.H., Vittal, J.J., Pogran, K.T., Henderson, D.A.
+ "Standard for the Format of ARPA Network Text Message,"
+ ARPANET Request for Comments No. 733, Network Information
+ Center No. 41952. SRI International: Menlo Park (November
+ 1977).
+
+ Feinler, E.J. and Postel, J.B. ARPANET Protocol Handbook, Net-
+ work Information Center No. 7104 (NTIS AD A003890). SRI
+ International: Menlo Park (April 1976).
+
+ Harary, F. "Graph Theory". Addison-Wesley: Reading, Mass.
+ (1969).
+
+ Levin, R. and Schroeder, M. "Transport of Electronic Messages
+ through a Network," TeleInformatics 79, pp. 29-33. North
+ Holland (1979). Also as Xerox Palo Alto Research Center
+ Technical Report CSL-79-4.
+
+ Myer, T.H. and Henderson, D.A. "Message Transmission Protocol,"
+ ARPANET Request for Comments, No. 680, Network Information
+ Center No. 32116. SRI International: Menlo Park (1975).
+
+
+ August 13, 1982 - 34 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ NBS. "Specification of Message Format for Computer Based Message
+ Systems, Recommended Federal Information Processing Standard."
+ National Bureau of Standards: Gaithersburg, Maryland
+ (October 1981).
+
+ NIC. Internet Protocol Transition Workbook. Network Information
+ Center, SRI-International, Menlo Park, California (March
+ 1982).
+
+ Oppen, D.C. and Dalal, Y.K. "The Clearinghouse: A Decentralized
+ Agent for Locating Named Objects in a Distributed Environ-
+ ment," OPD-T8103. Xerox Office Products Division: Palo Alto,
+ CA. (October 1981).
+
+ Postel, J.B. "Assigned Numbers," ARPANET Request for Comments,
+ No. 820. SRI International: Menlo Park (August 1982).
+
+ Postel, J.B. "Simple Mail Transfer Protocol," ARPANET Request
+ for Comments, No. 821. SRI International: Menlo Park (August
+ 1982).
+
+ Shoch, J.F. "Internetwork naming, addressing and routing," in
+ Proc. 17th IEEE Computer Society International Conference, pp.
+ 72-79, Sept. 1978, IEEE Cat. No. 78 CH 1388-8C.
+
+ Su, Z. and Postel, J. "The Domain Naming Convention for Internet
+ User Applications," ARPANET Request for Comments, No. 819.
+ SRI International: Menlo Park (August 1982).
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ August 13, 1982 - 35 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ APPENDIX
+
+
+ A. EXAMPLES
+
+ A.1. ADDRESSES
+
+ A.1.1. Alfred Neuman
+
+ A.1.2. Neuman@BBN-TENEXA
+
+ These two "Alfred Neuman" examples have identical seman-
+ tics, as far as the operation of the local host's mail sending
+ (distribution) program (also sometimes called its "mailer")
+ and the remote host's mail protocol server are concerned. In
+ the first example, the "Alfred Neuman" is ignored by the
+ mailer, as "Neuman@BBN-TENEXA" completely specifies the reci-
+ pient. The second example contains no superfluous informa-
+ tion, and, again, "Neuman@BBN-TENEXA" is the intended reci-
+ pient.
+
+ Note: When the message crosses name-domain boundaries, then
+ these specifications must be changed, so as to indicate
+ the remainder of the hierarchy, starting with the top
+ level.
+
+ A.1.3. "George, Ted"
+
+ This form might be used to indicate that a single mailbox
+ is shared by several users. The quoted string is ignored by
+ the originating host's mailer, because "Shared@Group.Arpanet"
+ completely specifies the destination mailbox.
+
+ A.1.4. Wilt . (the Stilt) Chamberlain@NBA.US
+
+ The "(the Stilt)" is a comment, which is NOT included in
+ the destination mailbox address handed to the originating
+ system's mailer. The local-part of the address is the string
+ "Wilt.Chamberlain", with NO space between the first and second
+ words.
+
+ A.1.5. Address Lists
+
+ Gourmets: Pompous Person ,
+ Childs@WGBH.Boston, Galloping Gourmet@
+ ANT.Down-Under (Australian National Television),
+ Cheapie@Discount-Liquors;,
+ Cruisers: Port@Portugal, Jones@SEA;,
+ Another@Somewhere.SomeOrg
+
+
+ August 13, 1982 - 36 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ This group list example points out the use of comments and the
+ mixing of addresses and groups.
+
+ A.2. ORIGINATOR ITEMS
+
+ A.2.1. Author-sent
+
+ George Jones logs into his host as "Jones". He sends
+ mail himself.
+
+ From: Jones@Group.Org
+
+ or
+
+ From: George Jones
+
+ A.2.2. Secretary-sent
+
+ George Jones logs in as Jones on his host. His secre-
+ tary, who logs in as Secy sends mail for him. Replies to the
+ mail should go to George.
+
+ From: George Jones
+ Sender: Secy@Other-Group
+
+ A.2.3. Secretary-sent, for user of shared directory
+
+ George Jones' secretary sends mail for George. Replies
+ should go to George.
+
+ From: George Jones
+ Sender: Secy@Other-Group
+
+ Note that there need not be a space between "Jones" and the
+ "<", but adding a space enhances readability (as is the case
+ in other examples.
+
+ A.2.4. Committee activity, with one author
+
+ George is a member of a committee. He wishes to have any
+ replies to his message go to all committee members.
+
+ From: George Jones
+ Sender: Jones@Host
+ Reply-To: The Committee: Jones@Host.Net,
+ Smith@Other.Org,
+ Doe@Somewhere-Else;
+
+ Note that if George had not included himself in the
+
+
+ August 13, 1982 - 37 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ enumeration of The Committee, he would not have gotten an
+ implicit reply; the presence of the "Reply-to" field SUPER-
+ SEDES the sending of a reply to the person named in the "From"
+ field.
+
+ A.2.5. Secretary acting as full agent of author
+
+ George Jones asks his secretary (Secy@Host) to send a
+ message for him in his capacity as Group. He wants his secre-
+ tary to handle all replies.
+
+ From: George Jones
+ Sender: Secy@Host
+ Reply-To: Secy@Host
+
+ A.2.6. Agent for user without online mailbox
+
+ A friend of George's, Sarah, is visiting. George's
+ secretary sends some mail to a friend of Sarah in computer-
+ land. Replies should go to George, whose mailbox is Jones at
+ Registry.
+
+ From: Sarah Friendly
+ Sender: Secy-Name
+ Reply-To: Jones@Registry.
+
+ A.2.7. Agent for member of a committee
+
+ George's secretary sends out a message which was authored
+ jointly by all the members of a committee. Note that the name
+ of the committee cannot be specified, since names are
+ not permitted in the From field.
+
+ From: Jones@Host,
+ Smith@Other-Host,
+ Doe@Somewhere-Else
+ Sender: Secy@SHost
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ August 13, 1982 - 38 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ A.3. COMPLETE HEADERS
+
+ A.3.1. Minimum required
+
+ Date: 26 Aug 76 1429 EDT Date: 26 Aug 76 1429 EDT
+ From: Jones@Registry.Org or From: Jones@Registry.Org
+ Bcc: To: Smith@Registry.Org
+
+ Note that the "Bcc" field may be empty, while the "To" field
+ is required to have at least one address.
+
+ A.3.2. Using some of the additional fields
+
+ Date: 26 Aug 76 1430 EDT
+ From: George Jones
+ Sender: Secy@SHOST
+ To: "Al Neuman"@Mad-Host,
+ Sam.Irving@Other-Host
+ Message-ID:
+
+ A.3.3. About as complex as you're going to get
+
+ Date : 27 Aug 76 0932 PDT
+ From : Ken Davis
+ Subject : Re: The Syntax in the RFC
+ Sender : KSecy@Other-Host
+ Reply-To : Sam.Irving@Reg.Organization
+ To : George Jones ,
+ Al.Neuman@MAD.Publisher
+ cc : Important folk:
+ Tom Softwood ,
+ "Sam Irving"@Other-Host;,
+ Standard Distribution:
+ /main/davis/people/standard@Other-Host,
+ "standard.dist.3"@Tops-20-Host>;
+ Comment : Sam is away on business. He asked me to handle
+ his mail for him. He'll be able to provide a
+ more accurate explanation when he returns
+ next week.
+ In-Reply-To: , George's message
+ X-Special-action: This is a sample of user-defined field-
+ names. There could also be a field-name
+ "Special-action", but its name might later be
+ preempted
+ Message-ID: <4231.629.XYzi-What@Other-Host>
+
+
+
+
+
+
+ August 13, 1982 - 39 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ B. SIMPLE FIELD PARSING
+
+ Some mail-reading software systems may wish to perform only
+ minimal processing, ignoring the internal syntax of structured
+ field-bodies and treating them the same as unstructured-field-
+ bodies. Such software will need only to distinguish:
+
+ o Header fields from the message body,
+
+ o Beginnings of fields from lines which continue fields,
+
+ o Field-names from field-contents.
+
+ The abbreviated set of syntactic rules which follows will
+ suffice for this purpose. It describes a limited view of mes-
+ sages and is a subset of the syntactic rules provided in the main
+ part of this specification. One small exception is that the con-
+ tents of field-bodies consist only of text:
+
+ B.1. SYNTAX
+
+
+ message = *field *(CRLF *text)
+
+ field = field-name ":" [field-body] CRLF
+
+ field-name = 1*
+
+ field-body = *text [CRLF LWSP-char field-body]
+
+
+ B.2. SEMANTICS
+
+ Headers occur before the message body and are terminated by
+ a null line (i.e., two contiguous CRLFs).
+
+ A line which continues a header field begins with a SPACE or
+ HTAB character, while a line beginning a field starts with a
+ printable character which is not a colon.
+
+ A field-name consists of one or more printable characters
+ (excluding colon, space, and control-characters). A field-name
+ MUST be contained on one line. Upper and lower case are not dis-
+ tinguished when comparing field-names.
+
+
+
+
+
+
+
+ August 13, 1982 - 40 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ C. DIFFERENCES FROM RFC #733
+
+ The following summarizes the differences between this stan-
+ dard and the one specified in Arpanet Request for Comments #733,
+ "Standard for the Format of ARPA Network Text Messages". The
+ differences are listed in the order of their occurrence in the
+ current specification.
+
+ C.1. FIELD DEFINITIONS
+
+ C.1.1. FIELD NAMES
+
+ These now must be a sequence of printable characters. They
+ may not contain any LWSP-chars.
+
+ C.2. LEXICAL TOKENS
+
+ C.2.1. SPECIALS
+
+ The characters period ("."), left-square bracket ("["), and
+ right-square bracket ("]") have been added. For presentation
+ purposes, and when passing a specification to a system that
+ does not conform to this standard, periods are to be contigu-
+ ous with their surrounding lexical tokens. No linear-white-
+ space is permitted between them. The presence of one LWSP-
+ char between other tokens is still directed.
+
+ C.2.2. ATOM
+
+ Atoms may not contain SPACE.
+
+ C.2.3. SPECIAL TEXT
+
+ ctext and qtext have had backslash ("\") added to the list of
+ prohibited characters.
+
+ C.2.4. DOMAINS
+
+ The lexical tokens and have been
+ added.
+
+ C.3. MESSAGE SPECIFICATION
+
+ C.3.1. TRACE
+
+ The "Return-path:" and "Received:" fields have been specified.
+
+
+
+
+
+ August 13, 1982 - 41 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ C.3.2. FROM
+
+ The "From" field must contain machine-usable addresses (addr-
+ spec). Multiple addresses may be specified, but named-lists
+ (groups) may not.
+
+ C.3.3. RESENT
+
+ The meta-construct of prefacing field names with the string
+ "Resent-" has been added, to indicate that a message has been
+ forwarded by an intermediate recipient.
+
+ C.3.4. DESTINATION
+
+ A message must contain at least one destination address field.
+ "To" and "CC" are required to contain at least one address.
+
+ C.3.5. IN-REPLY-TO
+
+ The field-body is no longer a comma-separated list, although a
+ sequence is still permitted.
+
+ C.3.6. REFERENCE
+
+ The field-body is no longer a comma-separated list, although a
+ sequence is still permitted.
+
+ C.3.7. ENCRYPTED
+
+ A field has been specified that permits senders to indicate
+ that the body of a message has been encrypted.
+
+ C.3.8. EXTENSION-FIELD
+
+ Extension fields are prohibited from beginning with the char-
+ acters "X-".
+
+ C.4. DATE AND TIME SPECIFICATION
+
+ C.4.1. SIMPLIFICATION
+
+ Fewer optional forms are permitted and the list of three-
+ letter time zones has been shortened.
+
+ C.5. ADDRESS SPECIFICATION
+
+
+
+
+
+
+ August 13, 1982 - 42 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ C.5.1. ADDRESS
+
+ The use of quoted-string, and the ":"-atom-":" construct, have
+ been removed. An address now is either a single mailbox
+ reference or is a named list of addresses. The latter indi-
+ cates a group distribution.
+
+ C.5.2. GROUPS
+
+ Group lists are now required to to have a name. Group lists
+ may not be nested.
+
+ C.5.3. MAILBOX
+
+ A mailbox specification may indicate a person's name, as
+ before. Such a named list no longer may specify multiple
+ mailboxes and may not be nested.
+
+ C.5.4. ROUTE ADDRESSING
+
+ Addresses now are taken to be absolute, global specifications,
+ independent of transmission paths. The construct has
+ been provided, to permit explicit specification of transmis-
+ sion path. RFC #733's use of multiple at-signs ("@") was
+ intended as a general syntax for indicating routing and/or
+ hierarchical addressing. The current standard separates these
+ specifications and only one at-sign is permitted.
+
+ C.5.5. AT-SIGN
+
+ The string " at " no longer is used as an address delimiter.
+ Only at-sign ("@") serves the function.
+
+ C.5.6. DOMAINS
+
+ Hierarchical, logical name-domains have been added.
+
+ C.6. RESERVED ADDRESS
+
+ The local-part "Postmaster" has been reserved, so that users can
+ be guaranteed at least one valid address at a site.
+
+
+
+
+
+
+
+
+
+
+ August 13, 1982 - 43 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ D. ALPHABETICAL LISTING OF SYNTAX RULES
+
+ address = mailbox ; one addressee
+ / group ; named list
+ addr-spec = local-part "@" domain ; global address
+ ALPHA =
+ ; (101-132, 65.- 90.)
+ ; (141-172, 97.-122.)
+ atom = 1*
+ authentic = "From" ":" mailbox ; Single author
+ / ( "Sender" ":" mailbox ; Actual submittor
+ "From" ":" 1#mailbox) ; Multiple authors
+ ; or not sender
+ CHAR = ; ( 0-177, 0.-127.)
+ comment = "(" *(ctext / quoted-pair / comment) ")"
+ CR = ; ( 15, 13.)
+ CRLF = CR LF
+ ctext = may be folded
+ ")", "\" & CR, & including
+ linear-white-space>
+ CTL = ; ( 177, 127.)
+ date = 1*2DIGIT month 2DIGIT ; day month year
+ ; e.g. 20 Jun 82
+ dates = orig-date ; Original
+ [ resent-date ] ; Forwarded
+ date-time = [ day "," ] date time ; dd mm yy
+ ; hh:mm:ss zzz
+ day = "Mon" / "Tue" / "Wed" / "Thu"
+ / "Fri" / "Sat" / "Sun"
+ delimiters = specials / linear-white-space / comment
+ destination = "To" ":" 1#address ; Primary
+ / "Resent-To" ":" 1#address
+ / "cc" ":" 1#address ; Secondary
+ / "Resent-cc" ":" 1#address
+ / "bcc" ":" #address ; Blind carbon
+ / "Resent-bcc" ":" #address
+ DIGIT = ; ( 60- 71, 48.- 57.)
+ domain = sub-domain *("." sub-domain)
+ domain-literal = "[" *(dtext / quoted-pair) "]"
+ domain-ref = atom ; symbolic reference
+ dtext = may be folded
+ "]", "\" & CR, & including
+ linear-white-space>
+ extension-field =
+
+
+
+ August 13, 1982 - 44 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ field = field-name ":" [ field-body ] CRLF
+ fields = dates ; Creation time,
+ source ; author id & one
+ 1*destination ; address required
+ *optional-field ; others optional
+ field-body = field-body-contents
+ [CRLF LWSP-char field-body]
+ field-body-contents =
+
+ field-name = 1*
+ group = phrase ":" [#mailbox] ";"
+ hour = 2DIGIT ":" 2DIGIT [":" 2DIGIT]
+ ; 00:00:00 - 23:59:59
+ HTAB = ; ( 11, 9.)
+ LF = ; ( 12, 10.)
+ linear-white-space = 1*([CRLF] LWSP-char) ; semantics = SPACE
+ ; CRLF => folding
+ local-part = word *("." word) ; uninterpreted
+ ; case-preserved
+ LWSP-char = SPACE / HTAB ; semantics = SPACE
+ mailbox = addr-spec ; simple address
+ / phrase route-addr ; name & addr-spec
+ message = fields *( CRLF *text ) ; Everything after
+ ; first null line
+ ; is message body
+ month = "Jan" / "Feb" / "Mar" / "Apr"
+ / "May" / "Jun" / "Jul" / "Aug"
+ / "Sep" / "Oct" / "Nov" / "Dec"
+ msg-id = "<" addr-spec ">" ; Unique message id
+ optional-field =
+ / "Message-ID" ":" msg-id
+ / "Resent-Message-ID" ":" msg-id
+ / "In-Reply-To" ":" *(phrase / msg-id)
+ / "References" ":" *(phrase / msg-id)
+ / "Keywords" ":" #phrase
+ / "Subject" ":" *text
+ / "Comments" ":" *text
+ / "Encrypted" ":" 1#2word
+ / extension-field ; To be defined
+ / user-defined-field ; May be pre-empted
+ orig-date = "Date" ":" date-time
+ originator = authentic ; authenticated addr
+ [ "Reply-To" ":" 1#address] )
+ phrase = 1*word ; Sequence of words
+
+
+
+
+ August 13, 1982 - 45 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ qtext = , ; => may be folded
+ "\" & CR, and including
+ linear-white-space>
+ quoted-pair = "\" CHAR ; may quote any char
+ quoted-string = <"> *(qtext/quoted-pair) <">; Regular qtext or
+ ; quoted chars.
+ received = "Received" ":" ; one per relay
+ ["from" domain] ; sending host
+ ["by" domain] ; receiving host
+ ["via" atom] ; physical path
+ *("with" atom) ; link/mail protocol
+ ["id" msg-id] ; receiver msg id
+ ["for" addr-spec] ; initial form
+ ";" date-time ; time received
+
+ resent = resent-authentic
+ [ "Resent-Reply-To" ":" 1#address] )
+ resent-authentic =
+ = "Resent-From" ":" mailbox
+ / ( "Resent-Sender" ":" mailbox
+ "Resent-From" ":" 1#mailbox )
+ resent-date = "Resent-Date" ":" date-time
+ return = "Return-path" ":" route-addr ; return address
+ route = 1#("@" domain) ":" ; path-relative
+ route-addr = "<" [route] addr-spec ">"
+ source = [ trace ] ; net traversals
+ originator ; original mail
+ [ resent ] ; forwarded
+ SPACE = ; ( 40, 32.)
+ specials = "(" / ")" / "<" / ">" / "@" ; Must be in quoted-
+ / "," / ";" / ":" / "\" / <"> ; string, to use
+ / "." / "[" / "]" ; within a word.
+ sub-domain = domain-ref / domain-literal
+ text = atoms, specials,
+ CR & bare LF, but NOT ; comments and
+ including CRLF> ; quoted-strings are
+ ; NOT recognized.
+ time = hour zone ; ANSI and Military
+ trace = return ; path to sender
+ 1*received ; receipt tags
+ user-defined-field =
+
+ word = atom / quoted-string
+
+
+
+
+ August 13, 1982 - 46 - RFC #822
+
+
+
+ Standard for ARPA Internet Text Messages
+
+
+ zone = "UT" / "GMT" ; Universal Time
+ ; North American : UT
+ / "EST" / "EDT" ; Eastern: - 5/ - 4
+ / "CST" / "CDT" ; Central: - 6/ - 5
+ / "MST" / "MDT" ; Mountain: - 7/ - 6
+ / "PST" / "PDT" ; Pacific: - 8/ - 7
+ / 1ALPHA ; Military: Z = UT;
+ <"> = ; ( 42, 34.)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ August 13, 1982 - 47 - RFC #822
+
diff --git a/doc/uri.scm.doc b/doc/uri.scm.doc
new file mode 100644
index 0000000..ff44a8d
--- /dev/null
+++ b/doc/uri.scm.doc
@@ -0,0 +1,150 @@
+This file documents names specified in uri.scm.
+
+
+
+
+NOTES
+
+URIs are of following syntax:
+
+[scheme] : path [? search ] [# fragmentid]
+
+Parts in [] may be ommitted. The last part is usually referred to as
+fragid in this document.
+
+
+
+DEFINITIONS AND DESCRIPTIONS
+
+
+char-set
+uri-reserved
+
+A list of reserved characters (semicolon, slash, hash, question mark,
+double colon and space).
+
+procedure
+parse-uri uri-string --> (scheme, path, search, frag-id)
+
+Multiple-value return: scheme, path, search, frag-id, in this
+order. scheme, search and frag-id are either #f or a string. path is a
+nonempty list of strings. An empty path is a list containing the empty
+string. parse-uri tries to be tolerant of the various ways people build broken URIs out there on the Net (so it is not absolutely conform with RFC 1630).
+
+
+procedure
+unescape-uri string [start [end]] --> string
+
+Unescapes a string. This procedure should only be used *after* the url
+(!) was parsed, since unescaping may introduce characters that blow
+up the parse (that's why escape sequences are used in URIs ;).
+Escape-sequences are of following scheme: %hh where h is a hexadecimal
+digit. E.g. %20 is space (ASCII character 32).
+
+
+procedure
+hex-digit? character --> boolean
+
+Returns #t if character is a hexadecimal digit (i.e., one of 1-9, a-f,
+A-F), #f otherwise.
+
+
+procedure
+hexchar->int character --> number
+
+Translates the given character to an integer, p.e. (hexchar->int \#a)
+=> 10.
+
+
+procedure
+int->hexchar integer --> character
+
+Translates the given integer from range 1-15 into an hexadecimal
+character (uses uppercase letters), p.e. (int->hexchar 14) => E.
+
+
+char-set
+uri-escaped-chars
+
+A set of characters that are escaped in URIs. These are the following
+characters: dollar ($), minus (-), underscore (_), at (@), dot (.),
+and-sign (&), exclamation mark (!), asterisk (*), backslash (\),
+double quote ("), single quote ('), open brace ((), close brace ()),
+comma (,) plus (+) and all other characters that are neither letters
+nor digits (such as space and control characters).
+
+
+procedure
+escape-uri string [escaped-chars] --> string
+
+Escapes characters of string that are given with escaped-chars.
+escaped-chars default to uri-escaped-chars. Be careful with using this
+procedure to chunks of text with syntactically meaningful reserved
+characters (e.g., paths with URI slashes or colons) -- they'll be
+escaped, and lose their special meaning. E.g. it would be a mistake to
+apply escape-uri to "//lcs.mit.edu:8001/foo/bar.html" because the
+slashes and colons would be escaped. Note that esacpe-uri doesn't
+check this as it would lose his meaning.
+
+
+procedure
+resolve-uri cscheme cp scheme p --> (scheme, path)
+
+Sorry, I can't figure out what resolve-uri is inteded to do. Perhaps
+I find it out later.
+
+The code seems to have a bug: In the body of receive, there's a
+loop. j should, according to the comment, count sequential /. But j
+counts nothing in the body. Either zero is added ((lp (cdr cp-tail)
+(cons (car cp-tail) rhead) (+ j 0))) or j is set to 1 ((lp (cdr
+cp-tail) (cons (car cp-tail) rhead) 1))). Nevertheless, j is expected
+to reach value numsl that can be larger than one. So what? I am
+confused.
+
+
+procedure
+rev-append list-a list-b --> list
+
+Performs a (append (reverse list-a) list-b). The comment says it
+should be defined in a list package but I am wondering how often this
+will be used.
+
+
+procedure
+split-uri-path uri start end --> list
+
+Splits uri at /'s. Only the substring given with start (inclusive) and
+end (exclusive) is considered. Start and end - 1 have to be within the
+range of the uri-string. Otherwise an index-out-of-range exception
+will be raised. Example: (split-uri-path "foo/bar/colon" 4 11) ==>
+'("bar" "col")
+
+
+procedure
+simplify-uri-path path --> list
+
+Removes "." and ".." entries from path. The result is a (maybe empty)
+list representing a path that does not contain any "." or "..". The
+list can only be empty if the path did not start with "/" (for the
+rare occasion someone wants to simplify a relative path). The result
+is #f if the path tries to back up past root, for example by "/.." or
+"/foo/../.." or just "..". "//" may occur somewhere in the path
+referring to root but not being backed up.
+Examples:
+(simplify-uri-path (split-uri-path "/foo/bar/baz/.." 0 15))
+==> '("" "foo" "bar")
+
+(simplify-uri-path (split-uri-path "foo/bar/baz/../../.." 0 20))
+==> '()
+
+(simplify-uri-path (split-uri-path "/foo/../.." 0 10))
+==> #f ; tried to back up root
+
+(simplify-uri-path (split-uri-path "foo/bar//" 0 9))
+==> '("") ; "//" refers to root
+
+(simplify-uri-path (split-uri-path "foo/bar/" 0 8))
+==> '("") ; last "/" also refers to root
+
+(simplify-uri-path (split-uri-path "/foo/bar//baz/../.." 0 19))
+==> #f ; tries to back up root
diff --git a/doc/url.scm.doc b/doc/url.scm.doc
new file mode 100644
index 0000000..4819ca4
--- /dev/null
+++ b/doc/url.scm.doc
@@ -0,0 +1,69 @@
+This file documents names defined in url.scm
+
+
+
+
+NOTES
+
+
+
+
+DEFINITIONS AND DESCRIPTIONS
+
+
+userhost record
+
+A record containing the fields user, password, host and port. Created
+by parsing a string like //:@:/. The
+record describes path-prefixes of the form
+//:@:/ These are frequently used as the
+initial prefix of URL's describing Internet resources.
+
+
+parse-userhost path default
+
+Parse a URI path (a list representing a path, not a string!) into a
+userhost record. Default values are taken from the userhost record
+DEFAULT except for the host. Returns a userhost record if it wins, and
+#f if it cannot parse the path. It is an error if the specified path
+does not begin with '//..' like noted at userhost.
+
+
+userhost-escaped-chars list
+
+The union of uri-escaped-chars and the characters '@' and ':'. Used
+for the unparser.
+
+
+userhost->string userhost procedure
+
+Unparses a userhost record to a string.
+
+
+http-url record
+
+Record containing the fields userhost (a userhost record), path (a
+path list), search and frag-id. The PATH slot of this record is the
+URL's path split at slashes, e.g., "foo/bar//baz/" => ("foo" "bar" ""
+"baz" ""). These elements are in raw, unescaped format. To convert
+back to a string, use (uri-path-list->path (map escape-uri pathlist)).
+
+
+parse-http-url path search frag-id procedure
+
+Returns a http-url record. path, search and frag-id are results of a
+parse-uri call on the initial uri. See there (uri.scm) for further
+details. search and frag-id are stored as they are. This parser
+decodes the path elements. It is an error if the path specifies an
+user or a password as this is not allowd at http-urls.
+
+
+default-http-userhost record
+
+A userhost record that specifies the port as 80 and anything else as
+#f.
+
+
+http-url->string http-url
+
+Unparses the given http-url to a string.