diff --git a/doc/html/index.html b/doc/html/index.html new file mode 100644 index 0000000..0b8e359 --- /dev/null +++ b/doc/html/index.html @@ -0,0 +1,85 @@ + + +The Scheme Underground Network Package + + + +

The Scheme Underground Network Package

+I have written a set of libraries for doing Net hacking from Scheme/scsh. +It includes: +
+
An smtp client library. +
Forge mail from the comfort of your own Scheme process. + +
rfc822 header library +
Read email-style headers. Useful in several contexts (smtp, http, etc.) + +
Simple structured HTML output library +
Balanced delimiters, etc. + +
The SU Web server +
This is a complete implementation of an HTTP 1.0 server in Scheme. + The server contains other standalone packages that may separately be of + use: + + The server has three main design goals: +
+
Extensibility +
The server is in fact nothing but extensions, using a mechanism + called "path handlers" to define URL-specific services. It has a toolkit + of services that can be used as-is, extended or built upon. + User extensions have exactly the same status as the base services. + +

+ The extension mechanism allows for easy implementation of new services + without the overhead of the CGI interface. Since the server is written + on top of the Scheme shell, the full set of Unix system calls and + program tools is available to the implementor. + +

Mobile code +
The server allows Scheme code to be uploaded for direct execution + inside the server. The server has complete control over the code, + and can safely execute it in restricted environments that do not + provide access to potentially dangerous primitives (such as the + "delete file" procedure.) + + +
Clarity +
I wrote this server to help myself understand the Web. It is voluminously + commented, and I hope it will prove to be an aid in understanding the + low-level details of the Web protocols. +
+ +

+ The S.U. server has the ability to upload code from Web clients and + execute that code on behalf of the client in a protected environment. + +

+ Some simple documentation on the server + is available. + +

+ +

Obtaining the system

+The network code is available by +ftp. +To run the server, you need our 0.4 release of +scsh +which has just been released. + +Beyond actually running the server, +the separate parser libraries and other utilites may be of use as separate +modules. + +
Olin Shivers + / shivers@ai.mit.edu
+ + + + + diff --git a/doc/html/su-httpd.html b/doc/html/su-httpd.html new file mode 100644 index 0000000..356aa37 --- /dev/null +++ b/doc/html/su-httpd.html @@ -0,0 +1,482 @@ + + + +The Scheme Underground Web system + + + +

The Scheme Underground Web System

+ +
Olin Shivers + / shivers@ai.mit.edu +
+July 1995 + +
+Note: Netscape typesets description lists in a manner that makes the +procedure descriptions below blur together, even in the absence of the +HTML COMPACT attribute. You may just wish to print out a simple +ASCII version of this note, instead. +
+ + + + +

Introduction

+ +The +Scheme underground +Web system is a package of +Scheme +code that provides +utilities for interacting with the +World-Wide Web. +This includes: + + +

+The code can be obtained via + +anonymous ftp +and is implemented in +Scheme 48, +using the system calls and support procedures of +scsh, +the Scheme Shell. +The code was written to be clear and modifiable -- +it is voluminously commented and all non-R4RS dependencies are +described at the beginning of each source file. + +

+I do not have the time to write detailed documentation for these packages. +However, they are very thoroughly commented, and I strongly recommend +reading the source files; they were written to be read, and the source +code comments should provide a clear description of the system. +The remainder of this note gives an overview of the server's basic +architecture and interfaces. + +

The Scheme Underground Web Server

+ +The server was designed with three principle goals in mind: +
+
Extensibility +
The server is designed to make it easy to extend the basic + functionality. In fact, the server is nothing but extensions. There is + no distinction between the set of basic services provided by the server + implementation and user extensions -- they are both implemented in + Scheme, and have equal status. The design is "turtles all the way down." + + +
Mobile code +
Because the server is written in Scheme 48, it is simple to use the + Scheme 48 module system to upload programs to the server for safe + execution within a protected, server-chosen environment. The server + comes with a simple example upload service to demonstrate this + capability. + + +
Clarity of implementation +
Because the server is written in a high-level language, it should make + for a clearer exposition of the HTTP protocol and the associated URL + and URI notations than one written in a low-level language such as C. + This also should help to make the server easy to modify and adapt to + different uses. +
+ + +

Basic server structure

+ +The Web server is started by calling the httpd procedure, +which takes one required and two optional arguments: +
+    (httpd path-handler [port working-directory])
+
+ +The server accepts connections from the given port, which defaults to 80. +The server runs with the working directory set to the given value, +which defaults to +
+    /usr/local/etc/httpd
+
+ + +

+The server's basic loop is to wait on the port for a connection from an HTTP +client. When it receives a connection, it reads in and parses the request into +a special request data structure. Then the server forks a child process, who +binds the current I/O ports to the connection socket, and then hands off to +the top-level path handler (the first argument to httpd). +The path-handler procedure is responsible for actually serving the request -- +it can be any arbitrary computation. +Its output goes directly back to the HTTP client that sent the request. + +

+Before calling the path handler to service the request, the HTTP server +installs an error handler that fields any uncaught error, sends an +error reply to the client, and aborts the request transaction. Hence +any error caused by a path-handler will be handled in a reasonable and +robust fashion. + +

+The basic server loop, and the associated request data structure are the fixed +architecture of the S.U. Web server; its flexibility lies in the notion of +path handlers. + + + +

Path handlers

+ +A path handler is a procedure taking two arguments: +
+    (path-handler path req)
+
+ + +The req argument is a request record giving all the details of the +client's request; it has the following structure: +
+    (define-record request
+      method		; A string such as "GET", "PUT", etc.
+      uri		; The escaped URI string as read from request line.
+      url		; An http URL record (see url.scm).
+      version		; A (major . minor) integer pair.
+      headers		; An rfc822 header alist (see rfc822.scm).
+      socket)		; The socket connected to the client.
+
+ +The path argument is the URL's path, +parsed and split at slashes into a string list. +For example, if the Web client dereferences URL +
+    http://clark.lcs.mit.edu:8001/h/shivers/code/web.tar.gz
+
+then the server would pass the following path to the top-level handler: +
+    ("h" "shivers" "code" "web.tar.gz")
+
+ +

+The path argument's pre-parsed representation as a string list makes it easy +for the path handler to implement recursive operations dispatch on URL paths. + +

+Path handlers can do anything they like to respond to HTTP requests; they have +the full range of Scheme to implement the desired functionality. When +handling HTTP requests that have an associated entity body (such as POST), the +body should be read from the current input port. Path handlers should in all +cases write their reply to the current output port. Path handlers should +not perform I/O on the request record's socket. +Path handlers are frequently called recursively, and doing I/O directly to the +socket might bypass a filtering or other processing step interposed on the +current I/O ports by some superior path handler. + + +

Basic path handlers

+ +Although the user can write any path-handler he likes, the S.U. server comes +with a useful toolbox of basic path handlers that can be used and built upon: + +
+ +
+(alist-path-dispatcher ph-alist default-ph) -> path-handler + +
+ This procedure takes a string->path-handler alist, and a default + path handler, and returns a handler that dispatches on its path argument. + When the new path handler is applied to a path + ("foo" "bar" "baz"), + it uses the first element of the path -- "foo" -- to + index into the alist. + If it finds an associated path handler in the alist, it + hands the request off to that handler, passing it the tail of the + path, ("bar" "baz"). + On the other hand, if the path is empty, or the alist search does + not yield a hit, we hand off to the default path handler, + passing it the entire original path, ("foo" "bar" "baz"). + +

+ This procedure is how you say: "If the first element of the URL's path + is `foo', do X; if it's `bar', do Y; otherwise, do Z." If one takes + an object-oriented view of the process, an alist path-handler does + method lookup on the requested operation, dispatching off to the + appropriate method defined for the URL. + +

+ The slash-delimited URI path structure implies an associated + tree of names. The path-handler system and the alist dispatcher + allow you to procedurally define the server's response to any arbitrary + subtree of the path space. + +

+ Example:
+ A typical top-level path handler is + +

+  (define ph
+    (alist-path-dispatcher
+	`(("h"       . ,(home-dir-handler "public_html"))
+	  ("cgi-bin" . ,(cgi-handler "/usr/local/etc/httpd/cgi-bin"))
+	  ("seval"   . ,seval-handler))
+	(rooted-file-handler "/usr/local/etc/httpd/htdocs")))
+
+ + This means: + + + +
(home-dir-handler subdir) -> + path-handler +
+ This procedure builds a path handler that does basic file serving + out of home directories. If the resulting path handler is passed + a path of (user . file-path), + then it serves the file +
+    user's-home-directory/subdir/file-path
+
+ The path handler only handles GET requests; the filename is not + allowed to contain .. elements. + + +
+(tilde-home-dir-handler subdir default-path-handler) + -> path-handler + +
+ This path handler examines the car of the path. If it is a string + beginning with a tilde, e.g., "~ziggy", + then the string is taken + to mean a home directory, and the request is served similarly to a + home-dir-handler path handler. + Otherwise, the request is passed off + in its entirety to the default path handler. + +

+ This procedure is useful for implementing servers that provide the + semantics of the NCSA httpd server. + + +

+(cgi-handler cgi-directory) -> path-handler + +
+ This procedure returns a path-handler that passes the request off to some + program using the CGI interface. The script name is taken from the + car of the path; it is checked for occurrences of ..'s. + If the path is +
+    ("my-prog" "foo" "bar")
+
+ then the program executed is +
+    cgi-directory/my-prog
+
+

+ When the CGI path handler builds the process environment for the + CGI script, several elements + (e.g., $PATH and $SERVER_SOFTWARE) + are request-invariant, and can be computed at server start-up time. + This can be done by calling +

+    (initialise-request-invariant-cgi-env)
+
+ when the server starts up. This is not necessary, + but will make CGI requests a little faster. + + +
+(rooted-file-handler root-dir) -> path-handler + +
+ Returns a path handler that serves files from a particular root + in the file system. Only the GET operation is provided. The path + argument passed to the handler is converted into a filename, + and appended to root-dir. + The file name is checked for .. components, + and the transaction is aborted if it does. Otherwise, the file is + served to the client. + +
+(null-path-handler path req) +
+ This path handler is useful as a default handler. It handles no requests, + always returning a "404 Not found" reply to the client. + +
+ + +

HTTP errors

+ +Authors of path-handlers need to be able to handle errors in a reasonably +simple fashion. The S.U. Web server provides a set of error conditions that +correspond to the error replies in the HTTP protocol. These errors can be +raised with the http-error procedure. +When the server runs a path handler, +it runs it in the context of an error handler that catches these errors, +sends an error reply to the client, and closes the transaction. + +
+ +
+(http-error reply-code req [extra ...]) +
+ This raises an http error condition. The reply code is one of the + numeric HTTP error reply codes, which are bound to the variables + http-reply/ok, http-reply/not-found, + http-reply/bad-request, and so + forth. The req argument is the request record that caused + the error. + Any following extra args are passed along for + informational purposes. + Different HTTP errors take different types of extra arguments. + For example, the "301 moved permanently" and "302 moved temporarily" + replies use the first two extra values as the + URI: and Location: + fields in the reply header, respectively. See the clauses of the + send-http-error-reply procedure for details. + + +
+(send-http-error-reply reply-code request + [extra ...]) + +
+ This procedure writes an error reply out to the current output + port. If an error occurs during this process, it is caught, and + the procedure silently returns. The http server's standard error + handler passes all http errors raised during path-handler execution + to this procedure to generate the error reply before aborting the + request transaction. +
+ + +

Simple directory generation

+ +Most path-handlers that serve files to clients eventually call an internal +procedure named file-serve, +which implements a simple directory-generation service using the +following rules: + + + + +

Support procs

+ +The source files contain a host of support procedures which will be of utility +to anyone writing a custom path-handler. Read the files first. + + + +

Losing

+ +Be aware of two Unix problems, which may require workarounds: +
    + +
  1. + NeXTSTEP's Posix implementation of the getpwnam() routine + will silently tell you that every user has uid 0. This means + that if your server, running as root, does a +
    +    (set-uid (user->uid "nobody"))
    +
    + it will essentially do a +
    +    (set-uid 0)
    +
    + and you will thus still be running as root. + +

    + The fix is to manually find out who user nobody is (he's -2 on my + system), and to hard-wire this into the server: +

    +    (set-uid -2)
    +
    + This problem is NeXTSTEP specific. If you are using not using NeXTSTEP, + no problem. + + +
  2. + On NeXTSTEP, the ip-address->host-name translation routine + (in C, gethostbyaddr(); in scsh, + (host-info addr)) does not + use the DNS system; it goes through NeXT's propietary Netinfo + system, and may not return a fully-qualified domain name. For + example, on my system, I get "amelia-earhart", when I want + "amelia-earhart.lcs.mit.edu". Since the server uses this name + to construct redirection URL's to be sent back to the Web client, + they need to be FQDN's. + +

    + This problem may occur on other OS's; + I cannot determine if gethostbyaddr() + is required to return a FQDN or not. (I would appreciate hearing the + answer if you know; my local Internet guru's couldn't tell me.) + +

    + If your system doesn't give you a complete Internet address when + you say +

    +    (host-info:name (host-info (system-name)))
    +
    + then you have this problem. + +

    + The server has a workaround. There is a procedure exported from + the httpd-core package: +

    +    (set-my-fqdn name)
    +
    + Call this to crow-bar the server's idea of its own Internet host name + before running the server, and all will be well. +
+ + + diff --git a/doc/rfc2396.txt b/doc/rfc2396.txt new file mode 100644 index 0000000..5bd5211 --- /dev/null +++ b/doc/rfc2396.txt @@ -0,0 +1,2243 @@ + + + + + + +Network Working Group T. Berners-Lee +Request for Comments: 2396 MIT/LCS +Updates: 1808, 1738 R. Fielding +Category: Standards Track U.C. Irvine + L. Masinter + Xerox Corporation + August 1998 + + + Uniform Resource Identifiers (URI): Generic Syntax + +Status of this Memo + + This document specifies an Internet standards track protocol for the + Internet community, and requests discussion and suggestions for + improvements. Please refer to the current edition of the "Internet + Official Protocol Standards" (STD 1) for the standardization state + and status of this protocol. Distribution of this memo is unlimited. + +Copyright Notice + + Copyright (C) The Internet Society (1998). All Rights Reserved. + +IESG Note + + This paper describes a "superset" of operations that can be applied + to URI. It consists of both a grammar and a description of basic + functionality for URI. To understand what is a valid URI, both the + grammar and the associated description have to be studied. Some of + the functionality described is not applicable to all URI schemes, and + some operations are only possible when certain media types are + retrieved using the URI, regardless of the scheme used. + +Abstract + + A Uniform Resource Identifier (URI) is a compact string of characters + for identifying an abstract or physical resource. This document + defines the generic syntax of URI, including both absolute and + relative forms, and guidelines for their use; it revises and replaces + the generic definitions in RFC 1738 and RFC 1808. + + This document defines a grammar that is a superset of all valid URI, + such that an implementation can parse the common components of a URI + reference without knowing the scheme-specific requirements of every + possible identifier type. This document does not define a generative + grammar for URI; that task will be performed by the individual + specifications of each URI scheme. + + + + +Berners-Lee, et. al. Standards Track [Page 1] + +RFC 2396 URI Generic Syntax August 1998 + + +1. Introduction + + Uniform Resource Identifiers (URI) provide a simple and extensible + means for identifying a resource. This specification of URI syntax + and semantics is derived from concepts introduced by the World Wide + Web global information initiative, whose use of such objects dates + from 1990 and is described in "Universal Resource Identifiers in WWW" + [RFC1630]. The specification of URI is designed to meet the + recommendations laid out in "Functional Recommendations for Internet + Resource Locators" [RFC1736] and "Functional Requirements for Uniform + Resource Names" [RFC1737]. + + This document updates and merges "Uniform Resource Locators" + [RFC1738] and "Relative Uniform Resource Locators" [RFC1808] in order + to define a single, generic syntax for all URI. It excludes those + portions of RFC 1738 that defined the specific syntax of individual + URL schemes; those portions will be updated as separate documents, as + will the process for registration of new URI schemes. This document + does not discuss the issues and recommendation for dealing with + characters outside of the US-ASCII character set [ASCII]; those + recommendations are discussed in a separate document. + + All significant changes from the prior RFCs are noted in Appendix G. + +1.1 Overview of URI + + URI are characterized by the following definitions: + + Uniform + Uniformity provides several benefits: it allows different types + of resource identifiers to be used in the same context, even + when the mechanisms used to access those resources may differ; + it allows uniform semantic interpretation of common syntactic + conventions across different types of resource identifiers; it + allows introduction of new types of resource identifiers + without interfering with the way that existing identifiers are + used; and, it allows the identifiers to be reused in many + different contexts, thus permitting new applications or + protocols to leverage a pre-existing, large, and widely-used + set of resource identifiers. + + Resource + A resource can be anything that has identity. Familiar + examples include an electronic document, an image, a service + (e.g., "today's weather report for Los Angeles"), and a + collection of other resources. Not all resources are network + "retrievable"; e.g., human beings, corporations, and bound + books in a library can also be considered resources. + + + +Berners-Lee, et. al. Standards Track [Page 2] + +RFC 2396 URI Generic Syntax August 1998 + + + The resource is the conceptual mapping to an entity or set of + entities, not necessarily the entity which corresponds to that + mapping at any particular instance in time. Thus, a resource + can remain constant even when its content---the entities to + which it currently corresponds---changes over time, provided + that the conceptual mapping is not changed in the process. + + Identifier + An identifier is an object that can act as a reference to + something that has identity. In the case of URI, the object is + a sequence of characters with a restricted syntax. + + Having identified a resource, a system may perform a variety of + operations on the resource, as might be characterized by such words + as `access', `update', `replace', or `find attributes'. + +1.2. URI, URL, and URN + + A URI can be further classified as a locator, a name, or both. The + term "Uniform Resource Locator" (URL) refers to the subset of URI + that identify resources via a representation of their primary access + mechanism (e.g., their network "location"), rather than identifying + the resource by name or by some other attribute(s) of that resource. + The term "Uniform Resource Name" (URN) refers to the subset of URI + that are required to remain globally unique and persistent even when + the resource ceases to exist or becomes unavailable. + + The URI scheme (Section 3.1) defines the namespace of the URI, and + thus may further restrict the syntax and semantics of identifiers + using that scheme. This specification defines those elements of the + URI syntax that are either required of all URI schemes or are common + to many URI schemes. It thus defines the syntax and semantics that + are needed to implement a scheme-independent parsing mechanism for + URI references, such that the scheme-dependent handling of a URI can + be postponed until the scheme-dependent semantics are needed. We use + the term URL below when describing syntax or semantics that only + apply to locators. + + Although many URL schemes are named after protocols, this does not + imply that the only way to access the URL's resource is via the named + protocol. Gateways, proxies, caches, and name resolution services + might be used to access some resources, independent of the protocol + of their origin, and the resolution of some URL may require the use + of more than one protocol (e.g., both DNS and HTTP are typically used + to access an "http" URL's resource when it can't be found in a local + cache). + + + + + +Berners-Lee, et. al. Standards Track [Page 3] + +RFC 2396 URI Generic Syntax August 1998 + + + A URN differs from a URL in that it's primary purpose is persistent + labeling of a resource with an identifier. That identifier is drawn + from one of a set of defined namespaces, each of which has its own + set name structure and assignment procedures. The "urn" scheme has + been reserved to establish the requirements for a standardized URN + namespace, as defined in "URN Syntax" [RFC2141] and its related + specifications. + + Most of the examples in this specification demonstrate URL, since + they allow the most varied use of the syntax and often have a + hierarchical namespace. A parser of the URI syntax is capable of + parsing both URL and URN references as a generic URI; once the scheme + is determined, the scheme-specific parsing can be performed on the + generic URI components. In other words, the URI syntax is a superset + of the syntax of all URI schemes. + +1.3. Example URI + + The following examples illustrate URI that are in common use. + + ftp://ftp.is.co.za/rfc/rfc1808.txt + -- ftp scheme for File Transfer Protocol services + + gopher://spinaltap.micro.umn.edu/00/Weather/California/Los%20Angeles + -- gopher scheme for Gopher and Gopher+ Protocol services + + http://www.math.uio.no/faq/compression-faq/part1.html + -- http scheme for Hypertext Transfer Protocol services + + mailto:mduerst@ifi.unizh.ch + -- mailto scheme for electronic mail addresses + + news:comp.infosystems.www.servers.unix + -- news scheme for USENET news groups and articles + + telnet://melvyl.ucop.edu/ + -- telnet scheme for interactive services via the TELNET Protocol + +1.4. Hierarchical URI and Relative Forms + + An absolute identifier refers to a resource independent of the + context in which the identifier is used. In contrast, a relative + identifier refers to a resource by describing the difference within a + hierarchical namespace between the current context and an absolute + identifier of the resource. + + + + + + +Berners-Lee, et. al. Standards Track [Page 4] + +RFC 2396 URI Generic Syntax August 1998 + + + Some URI schemes support a hierarchical naming system, where the + hierarchy of the name is denoted by a "/" delimiter separating the + components in the scheme. This document defines a scheme-independent + `relative' form of URI reference that can be used in conjunction with + a `base' URI (of a hierarchical scheme) to produce another URI. The + syntax of hierarchical URI is described in Section 3; the relative + URI calculation is described in Section 5. + +1.5. URI Transcribability + + The URI syntax was designed with global transcribability as one of + its main concerns. A URI is a sequence of characters from a very + limited set, i.e. the letters of the basic Latin alphabet, digits, + and a few special characters. A URI may be represented in a variety + of ways: e.g., ink on paper, pixels on a screen, or a sequence of + octets in a coded character set. The interpretation of a URI depends + only on the characters used and not how those characters are + represented in a network protocol. + + The goal of transcribability can be described by a simple scenario. + Imagine two colleagues, Sam and Kim, sitting in a pub at an + international conference and exchanging research ideas. Sam asks Kim + for a location to get more information, so Kim writes the URI for the + research site on a napkin. Upon returning home, Sam takes out the + napkin and types the URI into a computer, which then retrieves the + information to which Kim referred. + + There are several design concerns revealed by the scenario: + + o A URI is a sequence of characters, which is not always + represented as a sequence of octets. + + o A URI may be transcribed from a non-network source, and thus + should consist of characters that are most likely to be able to + be typed into a computer, within the constraints imposed by + keyboards (and related input devices) across languages and + locales. + + o A URI often needs to be remembered by people, and it is easier + for people to remember a URI when it consists of meaningful + components. + + These design concerns are not always in alignment. For example, it + is often the case that the most meaningful name for a URI component + would require characters that cannot be typed into some systems. The + ability to transcribe the resource identifier from one medium to + another was considered more important than having its URI consist of + the most meaningful of components. In local and regional contexts + + + +Berners-Lee, et. al. Standards Track [Page 5] + +RFC 2396 URI Generic Syntax August 1998 + + + and with improving technology, users might benefit from being able to + use a wider range of characters; such use is not defined in this + document. + +1.6. Syntax Notation and Common Elements + + This document uses two conventions to describe and define the syntax + for URI. The first, called the layout form, is a general description + of the order of components and component separators, as in + + /;? + + The component names are enclosed in angle-brackets and any characters + outside angle-brackets are literal separators. Whitespace should be + ignored. These descriptions are used informally and do not define + the syntax requirements. + + The second convention is a BNF-like grammar, used to define the + formal URI syntax. The grammar is that of [RFC822], except that "|" + is used to designate alternatives. Briefly, rules are separated from + definitions by an equal "=", indentation is used to continue a rule + definition over more than one line, literals are quoted with "", + parentheses "(" and ")" are used to group elements, optional elements + are enclosed in "[" and "]" brackets, and elements may be preceded + with * to designate n or more repetitions of the following + element; n defaults to 0. + + Unlike many specifications that use a BNF-like grammar to define the + bytes (octets) allowed by a protocol, the URI grammar is defined in + terms of characters. Each literal in the grammar corresponds to the + character it represents, rather than to the octet encoding of that + character in any particular coded character set. How a URI is + represented in terms of bits and bytes on the wire is dependent upon + the character encoding of the protocol used to transport it, or the + charset of the document which contains it. + + The following definitions are common to many elements: + + alpha = lowalpha | upalpha + + lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | + "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | + "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" + + upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | + "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | + "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" + + + + +Berners-Lee, et. al. Standards Track [Page 6] + +RFC 2396 URI Generic Syntax August 1998 + + + digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | + "8" | "9" + + alphanum = alpha | digit + + The complete URI syntax is collected in Appendix A. + +2. URI Characters and Escape Sequences + + URI consist of a restricted set of characters, primarily chosen to + aid transcribability and usability both in computer systems and in + non-computer communications. Characters used conventionally as + delimiters around URI were excluded. The restricted set of + characters consists of digits, letters, and a few graphic symbols + were chosen from those common to most of the character encodings and + input facilities available to Internet users. + + uric = reserved | unreserved | escaped + + Within a URI, characters are either used as delimiters, or to + represent strings of data (octets) within the delimited portions. + Octets are either represented directly by a character (using the US- + ASCII character for that octet [ASCII]) or by an escape encoding. + This representation is elaborated below. + +2.1 URI and non-ASCII characters + + The relationship between URI and characters has been a source of + confusion for characters that are not part of US-ASCII. To describe + the relationship, it is useful to distinguish between a "character" + (as a distinguishable semantic entity) and an "octet" (an 8-bit + byte). There are two mappings, one from URI characters to octets, and + a second from octets to original characters: + + URI character sequence->octet sequence->original character sequence + + A URI is represented as a sequence of characters, not as a sequence + of octets. That is because URI might be "transported" by means that + are not through a computer network, e.g., printed on paper, read over + the radio, etc. + + A URI scheme may define a mapping from URI characters to octets; + whether this is done depends on the scheme. Commonly, within a + delimited component of a URI, a sequence of characters may be used to + represent a sequence of octets. For example, the character "a" + represents the octet 97 (decimal), while the character sequence "%", + "0", "a" represents the octet 10 (decimal). + + + + +Berners-Lee, et. al. Standards Track [Page 7] + +RFC 2396 URI Generic Syntax August 1998 + + + There is a second translation for some resources: the sequence of + octets defined by a component of the URI is subsequently used to + represent a sequence of characters. A 'charset' defines this mapping. + There are many charsets in use in Internet protocols. For example, + UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences + of characters in the repertoire of ISO 10646. + + In the simplest case, the original character sequence contains only + characters that are defined in US-ASCII, and the two levels of + mapping are simple and easily invertible: each 'original character' + is represented as the octet for the US-ASCII code for it, which is, + in turn, represented as either the US-ASCII character, or else the + "%" escape sequence for that octet. + + For original character sequences that contain non-ASCII characters, + however, the situation is more difficult. Internet protocols that + transmit octet sequences intended to represent character sequences + are expected to provide some way of identifying the charset used, if + there might be more than one [RFC2277]. However, there is currently + no provision within the generic URI syntax to accomplish this + identification. An individual URI scheme may require a single + charset, define a default charset, or provide a way to indicate the + charset used. + + It is expected that a systematic treatment of character encoding + within URI will be developed as a future modification of this + specification. + +2.2. Reserved Characters + + Many URI include components consisting of or delimited by, certain + special characters. These characters are called "reserved", since + their usage within the URI component is limited to their reserved + purpose. If the data for a URI component would conflict with the + reserved purpose, then the conflicting data must be escaped before + forming the URI. + + reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | + "$" | "," + + The "reserved" syntax class above refers to those characters that are + allowed within a URI, but which may not be allowed within a + particular component of the generic URI syntax; they are used as + delimiters of the components described in Section 3. + + + + + + + +Berners-Lee, et. al. Standards Track [Page 8] + +RFC 2396 URI Generic Syntax August 1998 + + + Characters in the "reserved" set are not reserved in all contexts. + The set of characters actually reserved within any given URI + component is defined by that component. In general, a character is + reserved if the semantics of the URI changes if the character is + replaced with its escaped US-ASCII encoding. + +2.3. Unreserved Characters + + Data characters that are allowed in a URI but do not have a reserved + purpose are called unreserved. These include upper and lower case + letters, decimal digits, and a limited set of punctuation marks and + symbols. + + unreserved = alphanum | mark + + mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" + + Unreserved characters can be escaped without changing the semantics + of the URI, but this should not be done unless the URI is being used + in a context that does not allow the unescaped character to appear. + +2.4. Escape Sequences + + Data must be escaped if it does not have a representation using an + unreserved character; this includes data that does not correspond to + a printable character of the US-ASCII coded character set, or that + corresponds to any US-ASCII character that is disallowed, as + explained below. + +2.4.1. Escaped Encoding + + An escaped octet is encoded as a character triplet, consisting of the + percent character "%" followed by the two hexadecimal digits + representing the octet code. For example, "%20" is the escaped + encoding for the US-ASCII space character. + + escaped = "%" hex hex + hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | + "a" | "b" | "c" | "d" | "e" | "f" + +2.4.2. When to Escape and Unescape + + A URI is always in an "escaped" form, since escaping or unescaping a + completed URI might change its semantics. Normally, the only time + escape encodings can safely be made is when the URI is being created + from its component parts; each component may have its own set of + characters that are reserved, so only the mechanism responsible for + generating or interpreting that component can determine whether or + + + +Berners-Lee, et. al. Standards Track [Page 9] + +RFC 2396 URI Generic Syntax August 1998 + + + not escaping a character will change its semantics. Likewise, a URI + must be separated into its components before the escaped characters + within those components can be safely decoded. + + In some cases, data that could be represented by an unreserved + character may appear escaped; for example, some of the unreserved + "mark" characters are automatically escaped by some systems. If the + given URI scheme defines a canonicalization algorithm, then + unreserved characters may be unescaped according to that algorithm. + For example, "%7e" is sometimes used instead of "~" in an http URL + path, but the two are equivalent for an http URL. + + Because the percent "%" character always has the reserved purpose of + being the escape indicator, it must be escaped as "%25" in order to + be used as data within a URI. Implementers should be careful not to + escape or unescape the same string more than once, since unescaping + an already unescaped string might lead to misinterpreting a percent + data character as another escaped character, or vice versa in the + case of escaping an already escaped string. + +2.4.3. Excluded US-ASCII Characters + + Although they are disallowed within the URI syntax, we include here a + description of those US-ASCII characters that have been excluded and + the reasons for their exclusion. + + The control characters in the US-ASCII coded character set are not + used within a URI, both because they are non-printable and because + they are likely to be misinterpreted by some control mechanisms. + + control = + + The space character is excluded because significant spaces may + disappear and insignificant spaces may be introduced when URI are + transcribed or typeset or subjected to the treatment of word- + processing programs. Whitespace is also used to delimit URI in many + contexts. + + space = + + The angle-bracket "<" and ">" and double-quote (") characters are + excluded because they are often used as the delimiters around URI in + text documents and protocol fields. The character "#" is excluded + because it is used to delimit a URI from a fragment identifier in URI + references (Section 4). The percent character "%" is excluded because + it is used for the encoding of escaped characters. + + delims = "<" | ">" | "#" | "%" | <"> + + + +Berners-Lee, et. al. Standards Track [Page 10] + +RFC 2396 URI Generic Syntax August 1998 + + + Other characters are excluded because gateways and other transport + agents are known to sometimes modify such characters, or they are + used as delimiters. + + unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`" + + Data corresponding to excluded characters must be escaped in order to + be properly represented within a URI. + +3. URI Syntactic Components + + The URI syntax is dependent upon the scheme. In general, absolute + URI are written as follows: + + : + + An absolute URI contains the name of the scheme being used () + followed by a colon (":") and then a string (the ) whose interpretation depends on the scheme. + + The URI syntax does not require that the scheme-specific-part have + any general structure or set of semantics which is common among all + URI. However, a subset of URI do share a common syntax for + representing hierarchical relationships within the namespace. This + "generic URI" syntax consists of a sequence of four main components: + + ://? + + each of which, except , may be absent from a particular URI. + For example, some URI schemes do not allow an component, + and others do not use a component. + + absoluteURI = scheme ":" ( hier_part | opaque_part ) + + URI that are hierarchical in nature use the slash "/" character for + separating hierarchical components. For some file systems, a "/" + character (used to denote the hierarchical structure of a URI) is the + delimiter used to construct a file name hierarchy, and thus the URI + path will look similar to a file pathname. This does NOT imply that + the resource is a file or that the URI maps to an actual filesystem + pathname. + + hier_part = ( net_path | abs_path ) [ "?" query ] + + net_path = "//" authority [ abs_path ] + + abs_path = "/" path_segments + + + + +Berners-Lee, et. al. Standards Track [Page 11] + +RFC 2396 URI Generic Syntax August 1998 + + + URI that do not make use of the slash "/" character for separating + hierarchical components are considered opaque by the generic URI + parser. + + opaque_part = uric_no_slash *uric + + uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" | + "&" | "=" | "+" | "$" | "," + + We use the term to refer to both the and + constructs, since they are mutually exclusive for any + given URI and can be parsed as a single component. + +3.1. Scheme Component + + Just as there are many different methods of access to resources, + there are a variety of schemes for identifying such resources. The + URI syntax consists of a sequence of components separated by reserved + characters, with the first component defining the semantics for the + remainder of the URI string. + + Scheme names consist of a sequence of characters beginning with a + lower case letter and followed by any combination of lower case + letters, digits, plus ("+"), period ("."), or hyphen ("-"). For + resiliency, programs interpreting URI should treat upper case letters + as equivalent to lower case in scheme names (e.g., allow "HTTP" as + well as "http"). + + scheme = alpha *( alpha | digit | "+" | "-" | "." ) + + Relative URI references are distinguished from absolute URI in that + they do not begin with a scheme name. Instead, the scheme is + inherited from the base URI, as described in Section 5.2. + +3.2. Authority Component + + Many URI schemes include a top hierarchical element for a naming + authority, such that the namespace defined by the remainder of the + URI is governed by that authority. This authority component is + typically defined by an Internet-based server or a scheme-specific + registry of naming authorities. + + authority = server | reg_name + + The authority component is preceded by a double slash "//" and is + terminated by the next slash "/", question-mark "?", or by the end of + the URI. Within the authority component, the characters ";", ":", + "@", "?", and "/" are reserved. + + + +Berners-Lee, et. al. Standards Track [Page 12] + +RFC 2396 URI Generic Syntax August 1998 + + + An authority component is not required for a URI scheme to make use + of relative references. A base URI without an authority component + implies that any relative reference will also be without an authority + component. + +3.2.1. Registry-based Naming Authority + + The structure of a registry-based naming authority is specific to the + URI scheme, but constrained to the allowed characters for an + authority component. + + reg_name = 1*( unreserved | escaped | "$" | "," | + ";" | ":" | "@" | "&" | "=" | "+" ) + +3.2.2. Server-based Naming Authority + + URL schemes that involve the direct use of an IP-based protocol to a + specified server on the Internet use a common syntax for the server + component of the URI's scheme-specific data: + + @: + + where may consist of a user name and, optionally, scheme- + specific information about how to gain authorization to access the + server. The parts "@" and ":" may be omitted. + + server = [ [ userinfo "@" ] hostport ] + + The user information, if present, is followed by a commercial at-sign + "@". + + userinfo = *( unreserved | escaped | + ";" | ":" | "&" | "=" | "+" | "$" | "," ) + + Some URL schemes use the format "user:password" in the userinfo + field. This practice is NOT RECOMMENDED, because the passing of + authentication information in clear text (such as URI) has proven to + be a security risk in almost every case where it has been used. + + The host is a domain name of a network host, or its IPv4 address as a + set of four decimal digit groups separated by ".". Literal IPv6 + addresses are not supported. + + hostport = host [ ":" port ] + host = hostname | IPv4address + hostname = *( domainlabel "." ) toplabel [ "." ] + domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum + toplabel = alpha | alpha *( alphanum | "-" ) alphanum + + + +Berners-Lee, et. al. Standards Track [Page 13] + +RFC 2396 URI Generic Syntax August 1998 + + + IPv4address = 1*digit "." 1*digit "." 1*digit "." 1*digit + port = *digit + + Hostnames take the form described in Section 3 of [RFC1034] and + Section 2.1 of [RFC1123]: a sequence of domain labels separated by + ".", each domain label starting and ending with an alphanumeric + character and possibly also containing "-" characters. The rightmost + domain label of a fully qualified domain name will never start with a + digit, thus syntactically distinguishing domain names from IPv4 + addresses, and may be followed by a single "." if it is necessary to + distinguish between the complete domain name and any local domain. + To actually be "Uniform" as a resource locator, a URL hostname should + be a fully qualified domain name. In practice, however, the host + component may be a local domain literal. + + Note: A suitable representation for including a literal IPv6 + address as the host part of a URL is desired, but has not yet been + determined or implemented in practice. + + The port is the network port number for the server. Most schemes + designate protocols that have a default port number. Another port + number may optionally be supplied, in decimal, separated from the + host by a colon. If the port is omitted, the default port number is + assumed. + +3.3. Path Component + + The path component contains data, specific to the authority (or the + scheme if there is no authority component), identifying the resource + within the scope of that scheme and authority. + + path = [ abs_path | opaque_part ] + + path_segments = segment *( "/" segment ) + segment = *pchar *( ";" param ) + param = *pchar + + pchar = unreserved | escaped | + ":" | "@" | "&" | "=" | "+" | "$" | "," + + The path may consist of a sequence of path segments separated by a + single slash "/" character. Within a path segment, the characters + "/", ";", "=", and "?" are reserved. Each path segment may include a + sequence of parameters, indicated by the semicolon ";" character. + The parameters are not significant to the parsing of relative + references. + + + + + +Berners-Lee, et. al. Standards Track [Page 14] + +RFC 2396 URI Generic Syntax August 1998 + + +3.4. Query Component + + The query component is a string of information to be interpreted by + the resource. + + query = *uric + + Within a query component, the characters ";", "/", "?", ":", "@", + "&", "=", "+", ",", and "$" are reserved. + +4. URI References + + The term "URI-reference" is used here to denote the common usage of a + resource identifier. A URI reference may be absolute or relative, + and may have additional information attached in the form of a + fragment identifier. However, "the URI" that results from such a + reference includes only the absolute URI after the fragment + identifier (if any) is removed and after any relative URI is resolved + to its absolute form. Although it is possible to limit the + discussion of URI syntax and semantics to that of the absolute + result, most usage of URI is within general URI references, and it is + impossible to obtain the URI from such a reference without also + parsing the fragment and resolving the relative form. + + URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ] + + The syntax for relative URI is a shortened form of that for absolute + URI, where some prefix of the URI is missing and certain path + components ("." and "..") have a special meaning when, and only when, + interpreting a relative path. The relative URI syntax is defined in + Section 5. + +4.1. Fragment Identifier + + When a URI reference is used to perform a retrieval action on the + identified resource, the optional fragment identifier, separated from + the URI by a crosshatch ("#") character, consists of additional + reference information to be interpreted by the user agent after the + retrieval action has been successfully completed. As such, it is not + part of a URI, but is often used in conjunction with a URI. + + fragment = *uric + + The semantics of a fragment identifier is a property of the data + resulting from a retrieval action, regardless of the type of URI used + in the reference. Therefore, the format and interpretation of + fragment identifiers is dependent on the media type [RFC2046] of the + retrieval result. The character restrictions described in Section 2 + + + +Berners-Lee, et. al. Standards Track [Page 15] + +RFC 2396 URI Generic Syntax August 1998 + + + for URI also apply to the fragment in a URI-reference. Individual + media types may define additional restrictions or structure within + the fragment for specifying different types of "partial views" that + can be identified within that media type. + + A fragment identifier is only meaningful when a URI reference is + intended for retrieval and the result of that retrieval is a document + for which the identified fragment is consistently defined. + +4.2. Same-document References + + A URI reference that does not contain a URI is a reference to the + current document. In other words, an empty URI reference within a + document is interpreted as a reference to the start of that document, + and a reference containing only a fragment identifier is a reference + to the identified fragment of that document. Traversal of such a + reference should not result in an additional retrieval action. + However, if the URI reference occurs in a context that is always + intended to result in a new request, as in the case of HTML's FORM + element, then an empty URI reference represents the base URI of the + current document and should be replaced by that URI when transformed + into a request. + +4.3. Parsing a URI Reference + + A URI reference is typically parsed according to the four main + components and fragment identifier in order to determine what + components are present and whether the reference is relative or + absolute. The individual components are then parsed for their + subparts and, if not opaque, to verify their validity. + + Although the BNF defines what is allowed in each component, it is + ambiguous in terms of differentiating between an authority component + and a path component that begins with two slash characters. The + greedy algorithm is used for disambiguation: the left-most matching + rule soaks up as much of the URI reference string as it is capable of + matching. In other words, the authority component wins. + + Readers familiar with regular expressions should see Appendix B for a + concrete parsing example and test oracle. + +5. Relative URI References + + It is often the case that a group or "tree" of documents has been + constructed to serve a common purpose; the vast majority of URI in + these documents point to resources within the tree rather than + + + + + +Berners-Lee, et. al. Standards Track [Page 16] + +RFC 2396 URI Generic Syntax August 1998 + + + outside of it. Similarly, documents located at a particular site are + much more likely to refer to other resources at that site than to + resources at remote sites. + + Relative addressing of URI allows document trees to be partially + independent of their location and access scheme. For instance, it is + possible for a single set of hypertext documents to be simultaneously + accessible and traversable via each of the "file", "http", and "ftp" + schemes if the documents refer to each other using relative URI. + Furthermore, such document trees can be moved, as a whole, without + changing any of the relative references. Experience within the WWW + has demonstrated that the ability to perform relative referencing is + necessary for the long-term usability of embedded URI. + + The syntax for relative URI takes advantage of the syntax + of (Section 3) in order to express a reference that is + relative to the namespace of another hierarchical URI. + + relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ] + + A relative reference beginning with two slash characters is termed a + network-path reference, as defined by in Section 3. Such + references are rarely used. + + A relative reference beginning with a single slash character is + termed an absolute-path reference, as defined by in + Section 3. + + A relative reference that does not begin with a scheme name or a + slash character is termed a relative-path reference. + + rel_path = rel_segment [ abs_path ] + + rel_segment = 1*( unreserved | escaped | + ";" | "@" | "&" | "=" | "+" | "$" | "," ) + + Within a relative-path reference, the complete path segments "." and + ".." have special meanings: "the current hierarchy level" and "the + level above this hierarchy level", respectively. Although this is + very similar to their use within Unix-based filesystems to indicate + directory levels, these path components are only considered special + when resolving a relative-path reference to its absolute form + (Section 5.2). + + Authors should be aware that a path segment which contains a colon + character cannot be used as the first segment of a relative URI path + (e.g., "this:that"), because it would be mistaken for a scheme name. + + + + +Berners-Lee, et. al. Standards Track [Page 17] + +RFC 2396 URI Generic Syntax August 1998 + + + It is therefore necessary to precede such segments with other + segments (e.g., "./this:that") in order for them to be referenced as + a relative path. + + It is not necessary for all URI within a given scheme to be + restricted to the syntax, since the hierarchical + properties of that syntax are only necessary when relative URI are + used within a particular document. Documents can only make use of + relative URI when their base URI fits within the syntax. + It is assumed that any document which contains a relative reference + will also have a base URI that obeys the syntax. In other words, + relative URI cannot be used within a document that has an unsuitable + base URI. + + Some URI schemes do not allow a hierarchical syntax matching the + syntax, and thus cannot use relative references. + +5.1. Establishing a Base URI + + The term "relative URI" implies that there exists some absolute "base + URI" against which the relative reference is applied. Indeed, the + base URI is necessary to define the semantics of any relative URI + reference; without it, a relative reference is meaningless. In order + for relative URI to be usable within a document, the base URI of that + document must be known to the parser. + + The base URI of a document can be established in one of four ways, + listed below in order of precedence. The order of precedence can be + thought of in terms of layers, where the innermost defined base URI + has the highest precedence. This can be visualized graphically as: + + .----------------------------------------------------------. + | .----------------------------------------------------. | + | | .----------------------------------------------. | | + | | | .----------------------------------------. | | | + | | | | .----------------------------------. | | | | + | | | | | | | | | | + | | | | `----------------------------------' | | | | + | | | | (5.1.1) Base URI embedded in the | | | | + | | | | document's content | | | | + | | | `----------------------------------------' | | | + | | | (5.1.2) Base URI of the encapsulating entity | | | + | | | (message, document, or none). | | | + | | `----------------------------------------------' | | + | | (5.1.3) URI used to retrieve the entity | | + | `----------------------------------------------------' | + | (5.1.4) Default Base URI is application-dependent | + `----------------------------------------------------------' + + + +Berners-Lee, et. al. Standards Track [Page 18] + +RFC 2396 URI Generic Syntax August 1998 + + +5.1.1. Base URI within Document Content + + Within certain document media types, the base URI of the document can + be embedded within the content itself such that it can be readily + obtained by a parser. This can be useful for descriptive documents, + such as tables of content, which may be transmitted to others through + protocols other than their usual retrieval context (e.g., E-Mail or + USENET news). + + It is beyond the scope of this document to specify how, for each + media type, the base URI can be embedded. It is assumed that user + agents manipulating such media types will be able to obtain the + appropriate syntax from that media type's specification. An example + of how the base URI can be embedded in the Hypertext Markup Language + (HTML) [RFC1866] is provided in Appendix D. + + A mechanism for embedding the base URI within MIME container types + (e.g., the message and multipart types) is defined by MHTML + [RFC2110]. Protocols that do not use the MIME message header syntax, + but which do allow some form of tagged metainformation to be included + within messages, may define their own syntax for defining the base + URI as part of a message. + +5.1.2. Base URI from the Encapsulating Entity + + If no base URI is embedded, the base URI of a document is defined by + the document's retrieval context. For a document that is enclosed + within another entity (such as a message or another document), the + retrieval context is that entity; thus, the default base URI of the + document is the base URI of the entity in which the document is + encapsulated. + +5.1.3. Base URI from the Retrieval URI + + If no base URI is embedded and the document is not encapsulated + within some other entity (e.g., the top level of a composite entity), + then, if a URI was used to retrieve the base document, that URI shall + be considered the base URI. Note that if the retrieval was the + result of a redirected request, the last URI used (i.e., that which + resulted in the actual retrieval of the document) is the base URI. + +5.1.4. Default Base URI + + If none of the conditions described in Sections 5.1.1--5.1.3 apply, + then the base URI is defined by the context of the application. + Since this definition is necessarily application-dependent, failing + + + + + +Berners-Lee, et. al. Standards Track [Page 19] + +RFC 2396 URI Generic Syntax August 1998 + + + to define the base URI using one of the other methods may result in + the same content being interpreted differently by different types of + application. + + It is the responsibility of the distributor(s) of a document + containing relative URI to ensure that the base URI for that document + can be established. It must be emphasized that relative URI cannot + be used reliably in situations where the document's base URI is not + well-defined. + +5.2. Resolving Relative References to Absolute Form + + This section describes an example algorithm for resolving URI + references that might be relative to a given base URI. + + The base URI is established according to the rules of Section 5.1 and + parsed into the four main components as described in Section 3. Note + that only the scheme component is required to be present in the base + URI; the other components may be empty or undefined. A component is + undefined if its preceding separator does not appear in the URI + reference; the path component is never undefined, though it may be + empty. The base URI's query component is not used by the resolution + algorithm and may be discarded. + + For each URI reference, the following steps are performed in order: + + 1) The URI reference is parsed into the potential four components and + fragment identifier, as described in Section 4.3. + + 2) If the path component is empty and the scheme, authority, and + query components are undefined, then it is a reference to the + current document and we are done. Otherwise, the reference URI's + query and fragment components are defined as found (or not found) + within the URI reference and not inherited from the base URI. + + 3) If the scheme component is defined, indicating that the reference + starts with a scheme name, then the reference is interpreted as an + absolute URI and we are done. Otherwise, the reference URI's + scheme is inherited from the base URI's scheme component. + + Due to a loophole in prior specifications [RFC1630], some parsers + allow the scheme name to be present in a relative URI if it is the + same as the base URI scheme. Unfortunately, this can conflict + with the correct parsing of non-hierarchical URI. For backwards + compatibility, an implementation may work around such references + by removing the scheme if it matches that of the base URI and the + scheme is known to always use the syntax. The parser + + + + +Berners-Lee, et. al. Standards Track [Page 20] + +RFC 2396 URI Generic Syntax August 1998 + + + can then continue with the steps below for the remainder of the + reference components. Validating parsers should mark such a + misformed relative reference as an error. + + 4) If the authority component is defined, then the reference is a + network-path and we skip to step 7. Otherwise, the reference + URI's authority is inherited from the base URI's authority + component, which will also be undefined if the URI scheme does not + use an authority component. + + 5) If the path component begins with a slash character ("/"), then + the reference is an absolute-path and we skip to step 7. + + 6) If this step is reached, then we are resolving a relative-path + reference. The relative path needs to be merged with the base + URI's path. Although there are many ways to do this, we will + describe a simple method using a separate string buffer. + + a) All but the last segment of the base URI's path component is + copied to the buffer. In other words, any characters after the + last (right-most) slash character, if any, are excluded. + + b) The reference's path component is appended to the buffer + string. + + c) All occurrences of "./", where "." is a complete path segment, + are removed from the buffer string. + + d) If the buffer string ends with "." as a complete path segment, + that "." is removed. + + e) All occurrences of "/../", where is a + complete path segment not equal to "..", are removed from the + buffer string. Removal of these path segments is performed + iteratively, removing the leftmost matching pattern on each + iteration, until no matching pattern remains. + + f) If the buffer string ends with "/..", where + is a complete path segment not equal to "..", that + "/.." is removed. + + g) If the resulting buffer string still begins with one or more + complete path segments of "..", then the reference is + considered to be in error. Implementations may handle this + error by retaining these components in the resolved path (i.e., + treating them as part of the final URI), by removing them from + the resolved path (i.e., discarding relative levels above the + root), or by avoiding traversal of the reference. + + + +Berners-Lee, et. al. Standards Track [Page 21] + +RFC 2396 URI Generic Syntax August 1998 + + + h) The remaining buffer string is the reference URI's new path + component. + + 7) The resulting URI components, including any inherited from the + base URI, are recombined to give the absolute form of the URI + reference. Using pseudocode, this would be + + result = "" + + if scheme is defined then + append scheme to result + append ":" to result + + if authority is defined then + append "//" to result + append authority to result + + append path to result + + if query is defined then + append "?" to result + append query to result + + if fragment is defined then + append "#" to result + append fragment to result + + return result + + Note that we must be careful to preserve the distinction between a + component that is undefined, meaning that its separator was not + present in the reference, and a component that is empty, meaning + that the separator was present and was immediately followed by the + next component separator or the end of the reference. + + The above algorithm is intended to provide an example by which the + output of implementations can be tested -- implementation of the + algorithm itself is not required. For example, some systems may find + it more efficient to implement step 6 as a pair of segment stacks + being merged, rather than as a series of string pattern replacements. + + Note: Some WWW client applications will fail to separate the + reference's query component from its path component before merging + the base and reference paths in step 6 above. This may result in + a loss of information if the query component contains the strings + "/../" or "/./". + + Resolution examples are provided in Appendix C. + + + +Berners-Lee, et. al. Standards Track [Page 22] + +RFC 2396 URI Generic Syntax August 1998 + + +6. URI Normalization and Equivalence + + In many cases, different URI strings may actually identify the + identical resource. For example, the host names used in URL are + actually case insensitive, and the URL is + equivalent to . In general, the rules for + equivalence and definition of a normal form, if any, are scheme + dependent. When a scheme uses elements of the common syntax, it will + also use the common syntax equivalence rules, namely that the scheme + and hostname are case insensitive and a URL with an explicit ":port", + where the port is the default for the scheme, is equivalent to one + where the port is elided. + +7. Security Considerations + + A URI does not in itself pose a security threat. Users should beware + that there is no general guarantee that a URL, which at one time + located a given resource, will continue to do so. Nor is there any + guarantee that a URL will not locate a different resource at some + later point in time, due to the lack of any constraint on how a given + authority apportions its namespace. Such a guarantee can only be + obtained from the person(s) controlling that namespace and the + resource in question. A specific URI scheme may include additional + semantics, such as name persistence, if those semantics are required + of all naming authorities for that scheme. + + It is sometimes possible to construct a URL such that an attempt to + perform a seemingly harmless, idempotent operation, such as the + retrieval of an entity associated with the resource, will in fact + cause a possibly damaging remote operation to occur. The unsafe URL + is typically constructed by specifying a port number other than that + reserved for the network protocol in question. The client + unwittingly contacts a site that is in fact running a different + protocol. The content of the URL contains instructions that, when + interpreted according to this other protocol, cause an unexpected + operation. An example has been the use of a gopher URL to cause an + unintended or impersonating message to be sent via a SMTP server. + + Caution should be used when using any URL that specifies a port + number other than the default for the protocol, especially when it is + a number within the reserved space. + + Care should be taken when a URL contains escaped delimiters for a + given protocol (for example, CR and LF characters for telnet + protocols) that these are not unescaped before transmission. This + might violate the protocol, but avoids the potential for such + + + + + +Berners-Lee, et. al. Standards Track [Page 23] + +RFC 2396 URI Generic Syntax August 1998 + + + characters to be used to simulate an extra operation or parameter in + that protocol, which might lead to an unexpected and possibly harmful + remote operation to be performed. + + It is clearly unwise to use a URL that contains a password which is + intended to be secret. In particular, the use of a password within + the 'userinfo' component of a URL is strongly disrecommended except + in those rare cases where the 'password' parameter is intended to be + public. + +8. Acknowledgements + + This document was derived from RFC 1738 [RFC1738] and RFC 1808 + [RFC1808]; the acknowledgements in those specifications still apply. + In addition, contributions by Gisle Aas, Martin Beet, Martin Duerst, + Jim Gettys, Martijn Koster, Dave Kristol, Daniel LaLiberte, Foteos + Macrides, James Marshall, Ryan Moats, Keith Moore, and Lauren Wood + are gratefully acknowledged. + +9. References + + [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and + Languages", BCP 18, RFC 2277, January 1998. + + [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A + Unifying Syntax for the Expression of Names and Addresses + of Objects on the Network as used in the World-Wide Web", + RFC 1630, June 1994. + + [RFC1738] Berners-Lee, T., Masinter, L., and M. McCahill, Editors, + "Uniform Resource Locators (URL)", RFC 1738, December 1994. + + [RFC1866] Berners-Lee T., and D. Connolly, "HyperText Markup Language + Specification -- 2.0", RFC 1866, November 1995. + + [RFC1123] Braden, R., Editor, "Requirements for Internet Hosts -- + Application and Support", STD 3, RFC 1123, October 1989. + + [RFC822] Crocker, D., "Standard for the Format of ARPA Internet Text + Messages", STD 11, RFC 822, August 1982. + + [RFC1808] Fielding, R., "Relative Uniform Resource Locators", RFC + 1808, June 1995. + + [RFC2046] Freed, N., and N. Borenstein, "Multipurpose Internet Mail + Extensions (MIME) Part Two: Media Types", RFC 2046, + November 1996. + + + + +Berners-Lee, et. al. Standards Track [Page 24] + +RFC 2396 URI Generic Syntax August 1998 + + + [RFC1736] Kunze, J., "Functional Recommendations for Internet + Resource Locators", RFC 1736, February 1995. + + [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. + + [RFC1034] Mockapetris, P., "Domain Names - Concepts and Facilities", + STD 13, RFC 1034, November 1987. + + [RFC2110] Palme, J., and A. Hopmann, "MIME E-mail Encapsulation of + Aggregate Documents, such as HTML (MHTML)", RFC 2110, March + 1997. + + [RFC1737] Sollins, K., and L. Masinter, "Functional Requirements for + Uniform Resource Names", RFC 1737, December 1994. + + [ASCII] US-ASCII. "Coded Character Set -- 7-bit American Standard + Code for Information Interchange", ANSI X3.4-1986. + + [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", + RFC 2279, January 1998. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 25] + +RFC 2396 URI Generic Syntax August 1998 + + +10. Authors' Addresses + + Tim Berners-Lee + World Wide Web Consortium + MIT Laboratory for Computer Science, NE43-356 + 545 Technology Square + Cambridge, MA 02139 + + Fax: +1(617)258-8682 + EMail: timbl@w3.org + + + Roy T. Fielding + Department of Information and Computer Science + University of California, Irvine + Irvine, CA 92697-3425 + + Fax: +1(949)824-1715 + EMail: fielding@ics.uci.edu + + + Larry Masinter + Xerox PARC + 3333 Coyote Hill Road + Palo Alto, CA 94034 + + Fax: +1(415)812-4333 + EMail: masinter@parc.xerox.com + + + + + + + + + + + + + + + + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 26] + +RFC 2396 URI Generic Syntax August 1998 + + +A. Collected BNF for URI + + URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ] + absoluteURI = scheme ":" ( hier_part | opaque_part ) + relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ] + + hier_part = ( net_path | abs_path ) [ "?" query ] + opaque_part = uric_no_slash *uric + + uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" | + "&" | "=" | "+" | "$" | "," + + net_path = "//" authority [ abs_path ] + abs_path = "/" path_segments + rel_path = rel_segment [ abs_path ] + + rel_segment = 1*( unreserved | escaped | + ";" | "@" | "&" | "=" | "+" | "$" | "," ) + + scheme = alpha *( alpha | digit | "+" | "-" | "." ) + + authority = server | reg_name + + reg_name = 1*( unreserved | escaped | "$" | "," | + ";" | ":" | "@" | "&" | "=" | "+" ) + + server = [ [ userinfo "@" ] hostport ] + userinfo = *( unreserved | escaped | + ";" | ":" | "&" | "=" | "+" | "$" | "," ) + + hostport = host [ ":" port ] + host = hostname | IPv4address + hostname = *( domainlabel "." ) toplabel [ "." ] + domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum + toplabel = alpha | alpha *( alphanum | "-" ) alphanum + IPv4address = 1*digit "." 1*digit "." 1*digit "." 1*digit + port = *digit + + path = [ abs_path | opaque_part ] + path_segments = segment *( "/" segment ) + segment = *pchar *( ";" param ) + param = *pchar + pchar = unreserved | escaped | + ":" | "@" | "&" | "=" | "+" | "$" | "," + + query = *uric + + fragment = *uric + + + +Berners-Lee, et. al. Standards Track [Page 27] + +RFC 2396 URI Generic Syntax August 1998 + + + uric = reserved | unreserved | escaped + reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | + "$" | "," + unreserved = alphanum | mark + mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | + "(" | ")" + + escaped = "%" hex hex + hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | + "a" | "b" | "c" | "d" | "e" | "f" + + alphanum = alpha | digit + alpha = lowalpha | upalpha + + lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | + "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | + "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" + upalpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | + "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" | + "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" + digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | + "8" | "9" + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 28] + +RFC 2396 URI Generic Syntax August 1998 + + +B. Parsing a URI Reference with a Regular Expression + + As described in Section 4.3, the generic URI syntax is not sufficient + to disambiguate the components of some forms of URI. Since the + "greedy algorithm" described in that section is identical to the + disambiguation method used by POSIX regular expressions, it is + natural and commonplace to use a regular expression for parsing the + potential four components and fragment identifier of a URI reference. + + The following line is the regular expression for breaking-down a URI + reference into its components. + + ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? + 12 3 4 5 6 7 8 9 + + The numbers in the second line above are only to assist readability; + they indicate the reference points for each subexpression (i.e., each + paired parenthesis). We refer to the value matched for subexpression + as $. For example, matching the above expression to + + http://www.ics.uci.edu/pub/ietf/uri/#Related + + results in the following subexpression matches: + + $1 = http: + $2 = http + $3 = //www.ics.uci.edu + $4 = www.ics.uci.edu + $5 = /pub/ietf/uri/ + $6 = + $7 = + $8 = #Related + $9 = Related + + where indicates that the component is not present, as is + the case for the query component in the above example. Therefore, we + can determine the value of the four components and fragment as + + scheme = $2 + authority = $4 + path = $5 + query = $7 + fragment = $9 + + and, going in the opposite direction, we can recreate a URI reference + from its components using the algorithm in step 7 of Section 5.2. + + + + + +Berners-Lee, et. al. Standards Track [Page 29] + +RFC 2396 URI Generic Syntax August 1998 + + +C. Examples of Resolving Relative URI References + + Within an object with a well-defined base URI of + + http://a/b/c/d;p?q + + the relative URI would be resolved as follows: + +C.1. Normal Examples + + g:h = g:h + g = http://a/b/c/g + ./g = http://a/b/c/g + g/ = http://a/b/c/g/ + /g = http://a/g + //g = http://g + ?y = http://a/b/c/?y + g?y = http://a/b/c/g?y + #s = (current document)#s + g#s = http://a/b/c/g#s + g?y#s = http://a/b/c/g?y#s + ;x = http://a/b/c/;x + g;x = http://a/b/c/g;x + g;x?y#s = http://a/b/c/g;x?y#s + . = http://a/b/c/ + ./ = http://a/b/c/ + .. = http://a/b/ + ../ = http://a/b/ + ../g = http://a/b/g + ../.. = http://a/ + ../../ = http://a/ + ../../g = http://a/g + +C.2. Abnormal Examples + + Although the following abnormal examples are unlikely to occur in + normal practice, all URI parsers should be capable of resolving them + consistently. Each example uses the same base as above. + + An empty reference refers to the start of the current document. + + <> = (current document) + + Parsers must be careful in handling the case where there are more + relative path ".." segments than there are hierarchical levels in the + base URI's path. Note that the ".." syntax cannot be used to change + the authority component of a URI. + + + + +Berners-Lee, et. al. Standards Track [Page 30] + +RFC 2396 URI Generic Syntax August 1998 + + + ../../../g = http://a/../g + ../../../../g = http://a/../../g + + In practice, some implementations strip leading relative symbolic + elements (".", "..") after applying a relative URI calculation, based + on the theory that compensating for obvious author errors is better + than allowing the request to fail. Thus, the above two references + will be interpreted as "http://a/g" by some implementations. + + Similarly, parsers must avoid treating "." and ".." as special when + they are not complete components of a relative path. + + /./g = http://a/./g + /../g = http://a/../g + g. = http://a/b/c/g. + .g = http://a/b/c/.g + g.. = http://a/b/c/g.. + ..g = http://a/b/c/..g + + Less likely are cases where the relative URI uses unnecessary or + nonsensical forms of the "." and ".." complete path segments. + + ./../g = http://a/b/g + ./g/. = http://a/b/c/g/ + g/./h = http://a/b/c/g/h + g/../h = http://a/b/c/h + g;x=1/./y = http://a/b/c/g;x=1/y + g;x=1/../y = http://a/b/c/y + + All client applications remove the query component from the base URI + before resolving relative URI. However, some applications fail to + separate the reference's query and/or fragment components from a + relative path before merging it with the base path. This error is + rarely noticed, since typical usage of a fragment never includes the + hierarchy ("/") character, and the query component is not normally + used within relative references. + + g?y/./x = http://a/b/c/g?y/./x + g?y/../x = http://a/b/c/g?y/../x + g#s/./x = http://a/b/c/g#s/./x + g#s/../x = http://a/b/c/g#s/../x + + + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 31] + +RFC 2396 URI Generic Syntax August 1998 + + + Some parsers allow the scheme name to be present in a relative URI if + it is the same as the base URI scheme. This is considered to be a + loophole in prior specifications of partial URI [RFC1630]. Its use + should be avoided. + + http:g = http:g ; for validating parsers + | http://a/b/c/g ; for backwards compatibility + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 32] + +RFC 2396 URI Generic Syntax August 1998 + + +D. Embedding the Base URI in HTML documents + + It is useful to consider an example of how the base URI of a document + can be embedded within the document's content. In this appendix, we + describe how documents written in the Hypertext Markup Language + (HTML) [RFC1866] can include an embedded base URI. This appendix + does not form a part of the URI specification and should not be + considered as anything more than a descriptive example. + + HTML defines a special element "BASE" which, when present in the + "HEAD" portion of a document, signals that the parser should use the + BASE element's "HREF" attribute as the base URI for resolving any + relative URI. The "HREF" attribute must be an absolute URI. Note + that, in HTML, element and attribute names are case-insensitive. For + example: + + + + An example HTML document + + + ... a hypertext anchor ... + + + A parser reading the example document should interpret the given + relative URI "../x" as representing the absolute URI + + + + regardless of the context in which the example document was obtained. + + + + + + + + + + + + + + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 33] + +RFC 2396 URI Generic Syntax August 1998 + + +E. Recommendations for Delimiting URI in Context + + URI are often transmitted through formats that do not provide a clear + context for their interpretation. For example, there are many + occasions when URI are included in plain text; examples include text + sent in electronic mail, USENET news messages, and, most importantly, + printed on paper. In such cases, it is important to be able to + delimit the URI from the rest of the text, and in particular from + punctuation marks that might be mistaken for part of the URI. + + In practice, URI are delimited in a variety of ways, but usually + within double-quotes "http://test.com/", angle brackets + , or just using whitespace + + http://test.com/ + + These wrappers do not form part of the URI. + + In the case where a fragment identifier is associated with a URI + reference, the fragment would be placed within the brackets as well + (separated from the URI with a "#" character). + + In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) may + need to be added to break long URI across lines. The whitespace + should be ignored when extracting the URI. + + No whitespace should be introduced after a hyphen ("-") character. + Because some typesetters and printers may (erroneously) introduce a + hyphen at the end of line when breaking a line, the interpreter of a + URI containing a line break immediately after a hyphen should ignore + all unescaped whitespace around the line break, and should be aware + that the hyphen may or may not actually be part of the URI. + + Using <> angle brackets around each URI is especially recommended as + a delimiting style for URI that contain whitespace. + + The prefix "URL:" (with or without a trailing space) was recommended + as a way to used to help distinguish a URL from other bracketed + designators, although this is not common in practice. + + For robustness, software that accepts user-typed URI should attempt + to recognize and strip both delimiters and embedded whitespace. + + For example, the text: + + + + + + + +Berners-Lee, et. al. Standards Track [Page 34] + +RFC 2396 URI Generic Syntax August 1998 + + + Yes, Jim, I found it under "http://www.w3.org/Addressing/", + but you can probably pick it up from . Note the warning in . + + contains the URI references + + http://www.w3.org/Addressing/ + ftp://ds.internic.net/rfc/ + http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 35] + +RFC 2396 URI Generic Syntax August 1998 + + +F. Abbreviated URLs + + The URL syntax was designed for unambiguous reference to network + resources and extensibility via the URL scheme. However, as URL + identification and usage have become commonplace, traditional media + (television, radio, newspapers, billboards, etc.) have increasingly + used abbreviated URL references. That is, a reference consisting of + only the authority and path portions of the identified resource, such + as + + www.w3.org/Addressing/ + + or simply the DNS hostname on its own. Such references are primarily + intended for human interpretation rather than machine, with the + assumption that context-based heuristics are sufficient to complete + the URL (e.g., most hostnames beginning with "www" are likely to have + a URL prefix of "http://"). Although there is no standard set of + heuristics for disambiguating abbreviated URL references, many client + implementations allow them to be entered by the user and + heuristically resolved. It should be noted that such heuristics may + change over time, particularly when new URL schemes are introduced. + + Since an abbreviated URL has the same syntax as a relative URL path, + abbreviated URL references cannot be used in contexts where relative + URLs are expected. This limits the use of abbreviated URLs to places + where there is no defined base URL, such as dialog boxes and off-line + advertisements. + + + + + + + + + + + + + + + + + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 36] + +RFC 2396 URI Generic Syntax August 1998 + + +G. Summary of Non-editorial Changes + +G.1. Additions + + Section 4 (URI References) was added to stem the confusion regarding + "what is a URI" and how to describe fragment identifiers given that + they are not part of the URI, but are part of the URI syntax and + parsing concerns. In addition, it provides a reference definition + for use by other IETF specifications (HTML, HTTP, etc.) that have + previously attempted to redefine the URI syntax in order to account + for the presence of fragment identifiers in URI references. + + Section 2.4 was rewritten to clarify a number of misinterpretations + and to leave room for fully internationalized URI. + + Appendix F on abbreviated URLs was added to describe the shortened + references often seen on television and magazine advertisements and + explain why they are not used in other contexts. + +G.2. Modifications from both RFC 1738 and RFC 1808 + + Changed to URI syntax instead of just URL. + + Confusion regarding the terms "character encoding", the URI + "character set", and the escaping of characters with % + equivalents has (hopefully) been reduced. Many of the BNF rule names + regarding the character sets have been changed to more accurately + describe their purpose and to encompass all "characters" rather than + just US-ASCII octets. Unless otherwise noted here, these + modifications do not affect the URI syntax. + + Both RFC 1738 and RFC 1808 refer to the "reserved" set of characters + as if URI-interpreting software were limited to a single set of + characters with a reserved purpose (i.e., as meaning something other + than the data to which the characters correspond), and that this set + was fixed by the URI scheme. However, this has not been true in + practice; any character that is interpreted differently when it is + escaped is, in effect, reserved. Furthermore, the interpreting + engine on a HTTP server is often dependent on the resource, not just + the URI scheme. The description of reserved characters has been + changed accordingly. + + The plus "+", dollar "$", and comma "," characters have been added to + those in the "reserved" set, since they are treated as reserved + within the query component. + + + + + + +Berners-Lee, et. al. Standards Track [Page 37] + +RFC 2396 URI Generic Syntax August 1998 + + + The tilde "~" character was added to those in the "unreserved" set, + since it is extensively used on the Internet in spite of the + difficulty to transcribe it with some keyboards. + + The syntax for URI scheme has been changed to require that all + schemes begin with an alpha character. + + The "user:password" form in the previous BNF was changed to a + "userinfo" token, and the possibility that it might be + "user:password" made scheme specific. In particular, the use of + passwords in the clear is not even suggested by the syntax. + + The question-mark "?" character was removed from the set of allowed + characters for the userinfo in the authority component, since testing + showed that many applications treat it as reserved for separating the + query component from the rest of the URI. + + The semicolon ";" character was added to those stated as being + reserved within the authority component, since several new schemes + are using it as a separator within userinfo to indicate the type of + user authentication. + + RFC 1738 specified that the path was separated from the authority + portion of a URI by a slash. RFC 1808 followed suit, but with a + fudge of carrying around the separator as a "prefix" in order to + describe the parsing algorithm. RFC 1630 never had this problem, + since it considered the slash to be part of the path. In writing + this specification, it was found to be impossible to accurately + describe and retain the difference between the two URI + and + without either considering the slash to be part of the path (as + corresponds to actual practice) or creating a separate component just + to hold that slash. We chose the former. + +G.3. Modifications from RFC 1738 + + The definition of specific URL schemes and their scheme-specific + syntax and semantics has been moved to separate documents. + + The URL host was defined as a fully-qualified domain name. However, + many URLs are used without fully-qualified domain names (in contexts + for which the full qualification is not necessary), without any host + (as in some file URLs), or with a host of "localhost". + + The URL port is now *digit instead of 1*digit, since systems are + expected to handle the case where the ":" separator between host and + port is supplied without a port. + + + + +Berners-Lee, et. al. Standards Track [Page 38] + +RFC 2396 URI Generic Syntax August 1998 + + + The recommendations for delimiting URI in context (Appendix E) have + been adjusted to reflect current practice. + +G.4. Modifications from RFC 1808 + + RFC 1808 (Section 4) defined an empty URL reference (a reference + containing nothing aside from the fragment identifier) as being a + reference to the base URL. Unfortunately, that definition could be + interpreted, upon selection of such a reference, as a new retrieval + action on that resource. Since the normal intent of such references + is for the user agent to change its view of the current document to + the beginning of the specified fragment within that document, not to + make an additional request of the resource, a description of how to + correctly interpret an empty reference has been added in Section 4. + + The description of the mythical Base header field has been replaced + with a reference to the Content-Location header field defined by + MHTML [RFC2110]. + + RFC 1808 described various schemes as either having or not having the + properties of the generic URI syntax. However, the only requirement + is that the particular document containing the relative references + have a base URI that abides by the generic URI syntax, regardless of + the URI scheme, so the associated description has been updated to + reflect that. + + The BNF term has been replaced with , since the + latter more accurately describes its use and purpose. Likewise, the + authority is no longer restricted to the IP server syntax. + + Extensive testing of current client applications demonstrated that + the majority of deployed systems do not use the ";" character to + indicate trailing parameter information, and that the presence of a + semicolon in a path segment does not affect the relative parsing of + that segment. Therefore, parameters have been removed as a separate + component and may now appear in any path segment. Their influence + has been removed from the algorithm for resolving a relative URI + reference. The resolution examples in Appendix C have been modified + to reflect this change. + + Implementations are now allowed to work around misformed relative + references that are prefixed by the same scheme as the base URI, but + only for schemes known to use the syntax. + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 39] + +RFC 2396 URI Generic Syntax August 1998 + + +H. Full Copyright Statement + + Copyright (C) The Internet Society (1998). All Rights Reserved. + + This document and translations of it may be copied and furnished to + others, and derivative works that comment on or otherwise explain it + or assist in its implementation may be prepared, copied, published + and distributed, in whole or in part, without restriction of any + kind, provided that the above copyright notice and this paragraph are + included on all such copies and derivative works. However, this + document itself may not be modified in any way, such as by removing + the copyright notice or references to the Internet Society or other + Internet organizations, except as needed for the purpose of + developing Internet standards in which case the procedures for + copyrights defined in the Internet Standards process must be + followed, or as required to translate it into languages other than + English. + + The limited permissions granted above are perpetual and will not be + revoked by the Internet Society or its successors or assigns. + + This document and the information contained herein is provided on an + "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING + TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING + BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION + HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF + MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. + + + + + + + + + + + + + + + + + + + + + + + + +Berners-Lee, et. al. Standards Track [Page 40] + diff --git a/doc/rfc822.scm.doc b/doc/rfc822.scm.doc new file mode 100644 index 0000000..a2e38c7 --- /dev/null +++ b/doc/rfc822.scm.doc @@ -0,0 +1,161 @@ +This file documents names defined in rfc822.scm: + + + + +NOTES + + + +A note on line-terminators: + +Line-terminating sequences are always a drag, because there's no +agreement on them -- the Net protocols and DOS use cr/lf; Unix uses +lf; the Mac uses cr. One one hand, you'd like to use the code for all +of the above, on the other, you'd also like to use the code for strict +applications that need definitely not to recognise bare cr's or lf's +as terminators. + +RFC 822 requires a cr/lf (carriage-return/line-feed) pair to terminate +lines of text. On the other hand, careful perusal of the text shows up +some ambiguities (there are maybe three or four of these, and I'm too +lazy to write them all down). Furthermore, it is an unfortunate fact +that many Unix apps separate lines of RFC 822 text with simple +linefeeds (e.g., messages kept in /usr/spool/mail). As a result, this +code takes a broad-minded view of line-terminators: lines can be +terminated by either cr/lf or just lf, and either terminating sequence +is trimmed. + +If you need stricter parsing, you can call the lower-level procedure +%READ-RFC-822-FIELD and %READ-RFC822-HEADERS procs. They take the +read-line procedure as an extra parameter. This means that you can +pass in a procedure that recognises only cr/lf's, or only cr's (for a +Mac app, perhaps), and you can determine whether or not the +terminators get trimmed. However, your read-line procedure must +indicate the header-terminating empty line by returning *either* the +empty string or the two-char string cr/lf (or the EOF object). + + + + +DEFINITIONS AND DESCRIPTIONS + + + +(read-rfc822-field [port]) +(%read-rfc822-field read-line port) + +Read one field from the port, and return two values [NAME BODY]: + + - NAME Symbol such as 'subject or 'to. The field name is converted + to a symbol using the Scheme implementation's preferred + case. If the implementation reads symbols in a case-sensitive + fashion (e.g., scsh), lowercase is used. This means you can + compare these symbols to quoted constants using EQ?. When + printing these field names out, it looks best if you capitalise + them with (CAPITALIZE-STRING (SYMBOL->STRING FIELD-NAME)). + + - BODY List of strings which are the field's body, e.g. + ("shivers@lcs.mit.edu"). Each list element is one line from + the field's body, so if the field spreads out over three lines, + then the body is a list of three strings. The terminating + cr/lf's are trimmed from each string. A leading space or a + leading horizontal tab is also trimmed, but one and onyl one. + +When there are no more fields -- EOF or a blank line has terminated +the header section -- then the procedure returns [#f #f]. + +The %READ-RFC822-FIELD variant allows you to specify your own +read-line procedure. The one used by READ-RFC822-FIELD terminates +lines with either cr/lf or just lf, and it trims the terminator from +the line. Your read-line procedure should trim the terminator of the +line, so an empty line is returned as an empty string. + +The procedures raise an error if the syntax of the read field (the +line returned by the read-line-function) is illegal (RFC822 illegal). + + + +read-rfc822-headers [port] +%read-rfc822-headers read-line port + +Read in and parse up a section of text that looks like the header +portion of an RFC 822 message. Return an alist mapping a field name (a +symbol such as 'date or 'subject) to a list of field bodies -- one for +each occurence of the field in the header. So if there are five +"Received-by:" fields in the header, the alist maps 'received-by to a +five element list. Each body is in turn represented by a list of +strings -- one for each line of the field. So a field spread across +three lines would produce a three element body. + +The %READ-RFC822-HEADERS variant allows you to specify your own +read-line procedure. See notes (A note on line-terminators) above for +reasons why. + + + +rejoin-header-lines alist [seperator] + +Takes a field alist such as is returned by READ-RFC822-HEADERS and +returns an equivalent alist. Each body (string list) in the input +alist is joined into a single list in the output alist. SEPARATOR is +the string used to join these elements together; it defaults to a +single space " ", but can usefully be "\n" or "\r\n". + +To rejoin a single body list, use scsh's JOIN-STRINGS procedure. + + + +For the following definitions' examples, let's use this set of of +RFC822 headers: + From: shivers + To: ziggy, + newts + To: gjs, tk + + + +get-header-all headers name + +returns all entries or #f, p.e. +(get-header-all hdrs 'to) -> ((" ziggy," " newts") (" gjs, tk")) + + + +get-header-lines headers name + +returns all lines of the first entry or #f, p.e. +(get-header-lines hdrs 'to) -> (" ziggy," " newts") + + + +get-headers headers name [seperator] + +returns the first entry with the lines joined together by seperator +(newline by default (\n)), p.e. +(get-header hdrs 'to) -> "ziggy,\n newts" + + + +htab + +is the horizontal tab (ascii-code 9) + + + +string->symbol-pref + +is a procedure that takes a string and converts it to a symbol +using the Scheme implementation's preferred case. The preferred case +is recognized by a doing a symbol->string conversion of 'a. + + + + +DESIREABLE FUNCTIONALITIES + + - Unfolding long lines. + - Lexing structured fields. + - Unlexing structured fields into canonical form. + - Parsing and unparsing dates. + - Parsing and unparsing addresses. diff --git a/doc/rfc822.txt b/doc/rfc822.txt new file mode 100644 index 0000000..35b09a3 --- /dev/null +++ b/doc/rfc822.txt @@ -0,0 +1,2901 @@ + + + + + + + RFC # 822 + + Obsoletes: RFC #733 (NIC #41952) + + + + + + + + + + + + + STANDARD FOR THE FORMAT OF + + ARPA INTERNET TEXT MESSAGES + + + + + + + August 13, 1982 + + + + + + + Revised by + + David H. Crocker + + + Dept. of Electrical Engineering + University of Delaware, Newark, DE 19711 + Network: DCrocker @ UDel-Relay + + + + + + + + + + + + + + + + Standard for ARPA Internet Text Messages + + + TABLE OF CONTENTS + + + PREFACE .................................................... ii + + 1. INTRODUCTION ........................................... 1 + + 1.1. Scope ............................................ 1 + 1.2. Communication Framework .......................... 2 + + 2. NOTATIONAL CONVENTIONS ................................. 3 + + 3. LEXICAL ANALYSIS OF MESSAGES ........................... 5 + + 3.1. General Description .............................. 5 + 3.2. Header Field Definitions ......................... 9 + 3.3. Lexical Tokens ................................... 10 + 3.4. Clarifications ................................... 11 + + 4. MESSAGE SPECIFICATION .................................. 17 + + 4.1. Syntax ........................................... 17 + 4.2. Forwarding ....................................... 19 + 4.3. Trace Fields ..................................... 20 + 4.4. Originator Fields ................................ 21 + 4.5. Receiver Fields .................................. 23 + 4.6. Reference Fields ................................. 23 + 4.7. Other Fields ..................................... 24 + + 5. DATE AND TIME SPECIFICATION ............................ 26 + + 5.1. Syntax ........................................... 26 + 5.2. Semantics ........................................ 26 + + 6. ADDRESS SPECIFICATION .................................. 27 + + 6.1. Syntax ........................................... 27 + 6.2. Semantics ........................................ 27 + 6.3. Reserved Address ................................. 33 + + 7. BIBLIOGRAPHY ........................................... 34 + + + APPENDIX + + A. EXAMPLES ............................................... 36 + B. SIMPLE FIELD PARSING ................................... 40 + C. DIFFERENCES FROM RFC #733 .............................. 41 + D. ALPHABETICAL LISTING OF SYNTAX RULES ................... 44 + + + August 13, 1982 - i - RFC #822 + + + + + Standard for ARPA Internet Text Messages + + + PREFACE + + + By 1977, the Arpanet employed several informal standards for + the text messages (mail) sent among its host computers. It was + felt necessary to codify these practices and provide for those + features that seemed imminent. The result of that effort was + Request for Comments (RFC) #733, "Standard for the Format of ARPA + Network Text Message", by Crocker, Vittal, Pogran, and Henderson. + The specification attempted to avoid major changes in existing + software, while permitting several new features. + + This document revises the specifications in RFC #733, in + order to serve the needs of the larger and more complex ARPA + Internet. Some of RFC #733's features failed to gain adequate + acceptance. In order to simplify the standard and the software + that follows it, these features have been removed. A different + addressing scheme is used, to handle the case of inter-network + mail; and the concept of re-transmission has been introduced. + + This specification is intended for use in the ARPA Internet. + However, an attempt has been made to free it of any dependence on + that environment, so that it can be applied to other network text + message systems. + + The specification of RFC #733 took place over the course of + one year, using the ARPANET mail environment, itself, to provide + an on-going forum for discussing the capabilities to be included. + More than twenty individuals, from across the country, partici- + pated in the original discussion. The development of this + revised specification has, similarly, utilized network mail-based + group discussion. Both specification efforts greatly benefited + from the comments and ideas of the participants. + + The syntax of the standard, in RFC #733, was originally + specified in the Backus-Naur Form (BNF) meta-language. Ken L. + Harrenstien, of SRI International, was responsible for re-coding + the BNF into an augmented BNF that makes the representation + smaller and easier to understand. + + + + + + + + + + + + + August 13, 1982 - ii - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + 1. INTRODUCTION + + 1.1. SCOPE + + This standard specifies a syntax for text messages that are + sent among computer users, within the framework of "electronic + mail". The standard supersedes the one specified in ARPANET + Request for Comments #733, "Standard for the Format of ARPA Net- + work Text Messages". + + In this context, messages are viewed as having an envelope + and contents. The envelope contains whatever information is + needed to accomplish transmission and delivery. The contents + compose the object to be delivered to the recipient. This stan- + dard applies only to the format and some of the semantics of mes- + sage contents. It contains no specification of the information + in the envelope. + + However, some message systems may use information from the + contents to create the envelope. It is intended that this stan- + dard facilitate the acquisition of such information by programs. + + Some message systems may store messages in formats that + differ from the one specified in this standard. This specifica- + tion is intended strictly as a definition of what message content + format is to be passed BETWEEN hosts. + + Note: This standard is NOT intended to dictate the internal for- + mats used by sites, the specific message system features + that they are expected to support, or any of the charac- + teristics of user interface programs that create or read + messages. + + A distinction should be made between what the specification + REQUIRES and what it ALLOWS. Messages can be made complex and + rich with formally-structured components of information or can be + kept small and simple, with a minimum of such information. Also, + the standard simplifies the interpretation of differing visual + formats in messages; only the visual aspect of a message is + affected and not the interpretation of information within it. + Implementors may choose to retain such visual distinctions. + + The formal definition is divided into four levels. The bot- + tom level describes the meta-notation used in this document. The + second level describes basic lexical analyzers that feed tokens + to higher-level parsers. Next is an overall specification for + messages; it permits distinguishing individual fields. Finally, + there is definition of the contents of several structured fields. + + + + August 13, 1982 - 1 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + 1.2. COMMUNICATION FRAMEWORK + + Messages consist of lines of text. No special provisions + are made for encoding drawings, facsimile, speech, or structured + text. No significant consideration has been given to questions + of data compression or to transmission and storage efficiency, + and the standard tends to be free with the number of bits con- + sumed. For example, field names are specified as free text, + rather than special terse codes. + + A general "memo" framework is used. That is, a message con- + sists of some information in a rigid format, followed by the main + part of the message, with a format that is not specified in this + document. The syntax of several fields of the rigidly-formated + ("headers") section is defined in this specification; some of + these fields must be included in all messages. + + The syntax that distinguishes between header fields is + specified separately from the internal syntax for particular + fields. This separation is intended to allow simple parsers to + operate on the general structure of messages, without concern for + the detailed structure of individual header fields. Appendix B + is provided to facilitate construction of these parsers. + + In addition to the fields specified in this document, it is + expected that other fields will gain common use. As necessary, + the specifications for these "extension-fields" will be published + through the same mechanism used to publish this document. Users + may also wish to extend the set of fields that they use + privately. Such "user-defined fields" are permitted. + + The framework severely constrains document tone and appear- + ance and is primarily useful for most intra-organization communi- + cations and well-structured inter-organization communication. + It also can be used for some types of inter-process communica- + tion, such as simple file transfer and remote job entry. A more + robust framework might allow for multi-font, multi-color, multi- + dimension encoding of information. A less robust one, as is + present in most single-machine message systems, would more + severely constrain the ability to add fields and the decision to + include specific fields. In contrast with paper-based communica- + tion, it is interesting to note that the RECEIVER of a message + can exercise an extraordinary amount of control over the + message's appearance. The amount of actual control available to + message receivers is contingent upon the capabilities of their + individual message systems. + + + + + + August 13, 1982 - 2 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + 2. NOTATIONAL CONVENTIONS + + This specification uses an augmented Backus-Naur Form (BNF) + notation. The differences from standard BNF involve naming rules + and indicating repetition and "local" alternatives. + + 2.1. RULE NAMING + + Angle brackets ("<", ">") are not used, in general. The + name of a rule is simply the name itself, rather than "". + Quotation-marks enclose literal text (which may be upper and/or + lower case). Certain basic rules are in uppercase, such as + SPACE, TAB, CRLF, DIGIT, ALPHA, etc. Angle brackets are used in + rule definitions, and in the rest of this document, whenever + their presence will facilitate discerning the use of rule names. + + 2.2. RULE1 / RULE2: ALTERNATIVES + + Elements separated by slash ("/") are alternatives. There- + fore "foo / bar" will accept foo or bar. + + 2.3. (RULE1 RULE2): LOCAL ALTERNATIVES + + Elements enclosed in parentheses are treated as a single + element. Thus, "(elem (foo / bar) elem)" allows the token + sequences "elem foo elem" and "elem bar elem". + + 2.4. *RULE: REPETITION + + The character "*" preceding an element indicates repetition. + The full form is: + + *element + + indicating at least and at most occurrences of element. + Default values are 0 and infinity so that "*(element)" allows any + number, including zero; "1*element" requires at least one; and + "1*2element" allows one or two. + + 2.5. [RULE]: OPTIONAL + + Square brackets enclose optional elements; "[foo bar]" is + equivalent to "*1(foo bar)". + + 2.6. NRULE: SPECIFIC REPETITION + + "(element)" is equivalent to "*(element)"; that is, + exactly occurrences of (element). Thus 2DIGIT is a 2-digit + number, and 3ALPHA is a string of three alphabetic characters. + + + August 13, 1982 - 3 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + 2.7. #RULE: LISTS + + A construct "#" is defined, similar to "*", as follows: + + #element + + indicating at least and at most elements, each separated + by one or more commas (","). This makes the usual form of lists + very easy; a rule such as '(element *("," element))' can be shown + as "1#element". Wherever this construct is used, null elements + are allowed, but do not contribute to the count of elements + present. That is, "(element),,(element)" is permitted, but + counts as only two elements. Therefore, where at least one ele- + ment is required, at least one non-null element must be present. + Default values are 0 and infinity so that "#(element)" allows any + number, including zero; "1#element" requires at least one; and + "1#2element" allows one or two. + + 2.8. ; COMMENTS + + A semi-colon, set off some distance to the right of rule + text, starts a comment that continues to the end of line. This + is a simple way of including useful notes in parallel with the + specifications. + + + + + + + + + + + + + + + + + + + + + + + + + + + + August 13, 1982 - 4 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + 3. LEXICAL ANALYSIS OF MESSAGES + + 3.1. GENERAL DESCRIPTION + + A message consists of header fields and, optionally, a body. + The body is simply a sequence of lines containing ASCII charac- + ters. It is separated from the headers by a null line (i.e., a + line with nothing preceding the CRLF). + + 3.1.1. LONG HEADER FIELDS + + Each header field can be viewed as a single, logical line of + ASCII characters, comprising a field-name and a field-body. + For convenience, the field-body portion of this conceptual + entity can be split into a multiple-line representation; this + is called "folding". The general rule is that wherever there + may be linear-white-space (NOT simply LWSP-chars), a CRLF + immediately followed by AT LEAST one LWSP-char may instead be + inserted. Thus, the single line + + To: "Joe & J. Harvey" , JJV @ BBN + + can be represented as: + + To: "Joe & J. Harvey" , + JJV@BBN + + and + + To: "Joe & J. Harvey" + , JJV + @BBN + + and + + To: "Joe & + J. Harvey" , JJV @ BBN + + The process of moving from this folded multiple-line + representation of a header field to its single line represen- + tation is called "unfolding". Unfolding is accomplished by + regarding CRLF immediately followed by a LWSP-char as + equivalent to the LWSP-char. + + Note: While the standard permits folding wherever linear- + white-space is permitted, it is recommended that struc- + tured fields, such as those containing addresses, limit + folding to higher-level syntactic breaks. For address + fields, it is recommended that such folding occur + + + August 13, 1982 - 5 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + between addresses, after the separating comma. + + 3.1.2. STRUCTURE OF HEADER FIELDS + + Once a field has been unfolded, it may be viewed as being com- + posed of a field-name followed by a colon (":"), followed by a + field-body, and terminated by a carriage-return/line-feed. + The field-name must be composed of printable ASCII characters + (i.e., characters that have values between 33. and 126., + decimal, except colon). The field-body may be composed of any + ASCII characters, except CR or LF. (While CR and/or LF may be + present in the actual text, they are removed by the action of + unfolding the field.) + + Certain field-bodies of headers may be interpreted according + to an internal syntax that some systems may wish to parse. + These fields are called "structured fields". Examples + include fields containing dates and addresses. Other fields, + such as "Subject" and "Comments", are regarded simply as + strings of text. + + Note: Any field which has a field-body that is defined as + other than simply is to be treated as a struc- + tured field. + + Field-names, unstructured field bodies and structured + field bodies each are scanned by their own, independent + "lexical" analyzers. + + 3.1.3. UNSTRUCTURED FIELD BODIES + + For some fields, such as "Subject" and "Comments", no struc- + turing is assumed, and they are treated simply as s, as + in the message body. Rules of folding apply to these fields, + so that such field bodies which occupy several lines must + therefore have the second and successive lines indented by at + least one LWSP-char. + + 3.1.4. STRUCTURED FIELD BODIES + + To aid in the creation and reading of structured fields, the + free insertion of linear-white-space (which permits folding + by inclusion of CRLFs) is allowed between lexical tokens. + Rather than obscuring the syntax specifications for these + structured fields with explicit syntax for this linear-white- + space, the existence of another "lexical" analyzer is assumed. + This analyzer does not apply for unstructured field bodies + that are simply strings of text, as described above. The + analyzer provides an interpretation of the unfolded text + + + August 13, 1982 - 6 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + composing the body of the field as a sequence of lexical sym- + bols. + + These symbols are: + + - individual special characters + - quoted-strings + - domain-literals + - comments + - atoms + + The first four of these symbols are self-delimiting. Atoms + are not; they are delimited by the self-delimiting symbols and + by linear-white-space. For the purposes of regenerating + sequences of atoms and quoted-strings, exactly one SPACE is + assumed to exist, and should be used, between them. (Also, in + the "Clarifications" section on "White Space", below, note the + rules about treatment of multiple contiguous LWSP-chars.) + + So, for example, the folded body of an address field + + ":sysmail"@ Some-Group. Some-Org, + Muhammed.(I am the greatest) Ali @(the)Vegas.WBA + + + + + + + + + + + + + + + + + + + + + + + + + + + + + August 13, 1982 - 7 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + is analyzed into the following lexical symbols and types: + + :sysmail quoted string + @ special + Some-Group atom + . special + Some-Org atom + , special + Muhammed atom + . special + (I am the greatest) comment + Ali atom + @ atom + (the) comment + Vegas atom + . special + WBA atom + + The canonical representations for the data in these addresses + are the following strings: + + ":sysmail"@Some-Group.Some-Org + + and + + Muhammed.Ali@Vegas.WBA + + Note: For purposes of display, and when passing such struc- + tured information to other systems, such as mail proto- + col services, there must be NO linear-white-space + between s that are separated by period (".") or + at-sign ("@") and exactly one SPACE between all other + s. Also, headers should be in a folded form. + + + + + + + + + + + + + + + + + + + August 13, 1982 - 8 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + 3.2. HEADER FIELD DEFINITIONS + + These rules show a field meta-syntax, without regard for the + particular type or internal syntax. Their purpose is to permit + detection of fields; also, they present to higher-level parsers + an image of each field as fitting on one line. + + field = field-name ":" [ field-body ] CRLF + + field-name = 1* + + field-body = field-body-contents + [CRLF LWSP-char field-body] + + field-body-contents = + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + August 13, 1982 - 9 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + 3.3. LEXICAL TOKENS + + The following rules are used to define an underlying lexical + analyzer, which feeds tokens to higher level parsers. See the + ANSI references, in the Bibliography. + + ; ( Octal, Decimal.) + CHAR = ; ( 0-177, 0.-127.) + ALPHA = + ; (101-132, 65.- 90.) + ; (141-172, 97.-122.) + DIGIT = ; ( 60- 71, 48.- 57.) + CTL = ; ( 177, 127.) + CR = ; ( 15, 13.) + LF = ; ( 12, 10.) + SPACE = ; ( 40, 32.) + HTAB = ; ( 11, 9.) + <"> = ; ( 42, 34.) + CRLF = CR LF + + LWSP-char = SPACE / HTAB ; semantics = SPACE + + linear-white-space = 1*([CRLF] LWSP-char) ; semantics = SPACE + ; CRLF => folding + + specials = "(" / ")" / "<" / ">" / "@" ; Must be in quoted- + / "," / ";" / ":" / "\" / <"> ; string, to use + / "." / "[" / "]" ; within a word. + + delimiters = specials / linear-white-space / comment + + text = atoms, specials, + CR & bare LF, but NOT ; comments and + including CRLF> ; quoted-strings are + ; NOT recognized. + + atom = 1* + + quoted-string = <"> *(qtext/quoted-pair) <">; Regular qtext or + ; quoted chars. + + qtext = , ; => may be folded + "\" & CR, and including + linear-white-space> + + domain-literal = "[" *(dtext / quoted-pair) "]" + + + + + August 13, 1982 - 10 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + dtext = may be folded + "]", "\" & CR, & including + linear-white-space> + + comment = "(" *(ctext / quoted-pair / comment) ")" + + ctext = may be folded + ")", "\" & CR, & including + linear-white-space> + + quoted-pair = "\" CHAR ; may quote any char + + phrase = 1*word ; Sequence of words + + word = atom / quoted-string + + + 3.4. CLARIFICATIONS + + 3.4.1. QUOTING + + Some characters are reserved for special interpretation, such + as delimiting lexical tokens. To permit use of these charac- + ters as uninterpreted data, a quoting mechanism is provided. + To quote a character, precede it with a backslash ("\"). + + This mechanism is not fully general. Characters may be quoted + only within a subset of the lexical constructs. In particu- + lar, quoting is limited to use within: + + - quoted-string + - domain-literal + - comment + + Within these constructs, quoting is REQUIRED for CR and "\" + and for the character(s) that delimit the token (e.g., "(" and + ")" for a comment). However, quoting is PERMITTED for any + character. + + Note: In particular, quoting is NOT permitted within atoms. + For example when the local-part of an addr-spec must + contain a special character, a quoted string must be + used. Therefore, a specification such as: + + Full\ Name@Domain + + is not legal and must be specified as: + + "Full Name"@Domain + + + August 13, 1982 - 11 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + 3.4.2. WHITE SPACE + + Note: In structured field bodies, multiple linear space ASCII + characters (namely HTABs and SPACEs) are treated as + single spaces and may freely surround any symbol. In + all header fields, the only place in which at least one + LWSP-char is REQUIRED is at the beginning of continua- + tion lines in a folded field. + + When passing text to processes that do not interpret text + according to this standard (e.g., mail protocol servers), then + NO linear-white-space characters should occur between a period + (".") or at-sign ("@") and a . Exactly ONE SPACE should + be used in place of arbitrary linear-white-space and comment + sequences. + + Note: Within systems conforming to this standard, wherever a + member of the list of delimiters is allowed, LWSP-chars + may also occur before and/or after it. + + Writers of mail-sending (i.e., header-generating) programs + should realize that there is no network-wide definition of the + effect of ASCII HT (horizontal-tab) characters on the appear- + ance of text at another network host; therefore, the use of + tabs in message headers, though permitted, is discouraged. + + 3.4.3. COMMENTS + + A comment is a set of ASCII characters, which is enclosed in + matching parentheses and which is not within a quoted-string + The comment construct permits message originators to add text + which will be useful for human readers, but which will be + ignored by the formal semantics. Comments should be retained + while the message is subject to interpretation according to + this standard. However, comments must NOT be included in + other cases, such as during protocol exchanges with mail + servers. + + Comments nest, so that if an unquoted left parenthesis occurs + in a comment string, there must also be a matching right + parenthesis. When a comment acts as the delimiter between a + sequence of two lexical symbols, such as two atoms, it is lex- + ically equivalent with a single SPACE, for the purposes of + regenerating the sequence, such as when passing the sequence + onto a mail protocol server. Comments are detected as such + only within field-bodies of structured fields. + + If a comment is to be "folded" onto multiple lines, then the + syntax for folding must be adhered to. (See the "Lexical + + + August 13, 1982 - 12 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + Analysis of Messages" section on "Folding Long Header Fields" + above, and the section on "Case Independence" below.) Note + that the official semantics therefore do not "see" any + unquoted CRLFs that are in comments, although particular pars- + ing programs may wish to note their presence. For these pro- + grams, it would be reasonable to interpret a "CRLF LWSP-char" + as being a CRLF that is part of the comment; i.e., the CRLF is + kept and the LWSP-char is discarded. Quoted CRLFs (i.e., a + backslash followed by a CR followed by a LF) still must be + followed by at least one LWSP-char. + + 3.4.4. DELIMITING AND QUOTING CHARACTERS + + The quote character (backslash) and characters that delimit + syntactic units are not, generally, to be taken as data that + are part of the delimited or quoted unit(s). In particular, + the quotation-marks that define a quoted-string, the + parentheses that define a comment and the backslash that + quotes a following character are NOT part of the quoted- + string, comment or quoted character. A quotation-mark that is + to be part of a quoted-string, a parenthesis that is to be + part of a comment and a backslash that is to be part of either + must each be preceded by the quote-character backslash ("\"). + Note that the syntax allows any character to be quoted within + a quoted-string or comment; however only certain characters + MUST be quoted to be included as data. These characters are + the ones that are not part of the alternate text group (i.e., + ctext or qtext). + + The one exception to this rule is that a single SPACE is + assumed to exist between contiguous words in a phrase, and + this interpretation is independent of the actual number of + LWSP-chars that the creator places between the words. To + include more than one SPACE, the creator must make the LWSP- + chars be part of a quoted-string. + + Quotation marks that delimit a quoted string and backslashes + that quote the following character should NOT accompany the + quoted-string when the string is passed to processes that do + not interpret data according to this specification (e.g., mail + protocol servers). + + 3.4.5. QUOTED-STRINGS + + Where permitted (i.e., in words in structured fields) quoted- + strings are treated as a single symbol. That is, a quoted- + string is equivalent to an atom, syntactically. If a quoted- + string is to be "folded" onto multiple lines, then the syntax + for folding must be adhered to. (See the "Lexical Analysis of + + + August 13, 1982 - 13 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + Messages" section on "Folding Long Header Fields" above, and + the section on "Case Independence" below.) Therefore, the + official semantics do not "see" any bare CRLFs that are in + quoted-strings; however particular parsing programs may wish + to note their presence. For such programs, it would be rea- + sonable to interpret a "CRLF LWSP-char" as being a CRLF which + is part of the quoted-string; i.e., the CRLF is kept and the + LWSP-char is discarded. Quoted CRLFs (i.e., a backslash fol- + lowed by a CR followed by a LF) are also subject to rules of + folding, but the presence of the quoting character (backslash) + explicitly indicates that the CRLF is data to the quoted + string. Stripping off the first following LWSP-char is also + appropriate when parsing quoted CRLFs. + + 3.4.6. BRACKETING CHARACTERS + + There is one type of bracket which must occur in matched pairs + and may have pairs nested within each other: + + o Parentheses ("(" and ")") are used to indicate com- + ments. + + There are three types of brackets which must occur in matched + pairs, and which may NOT be nested: + + o Colon/semi-colon (":" and ";") are used in address + specifications to indicate that the included list of + addresses are to be treated as a group. + + o Angle brackets ("<" and ">") are generally used to + indicate the presence of a one machine-usable refer- + ence (e.g., delimiting mailboxes), possibly including + source-routing to the machine. + + o Square brackets ("[" and "]") are used to indicate the + presence of a domain-literal, which the appropriate + name-domain is to use directly, bypassing normal + name-resolution mechanisms. + + 3.4.7. CASE INDEPENDENCE + + Except as noted, alphabetic strings may be represented in any + combination of upper and lower case. The only syntactic units + + + + + + + + + August 13, 1982 - 14 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + which requires preservation of case information are: + + - text + - qtext + - dtext + - ctext + - quoted-pair + - local-part, except "Postmaster" + + When matching any other syntactic unit, case is to be ignored. + For example, the field-names "From", "FROM", "from", and even + "FroM" are semantically equal and should all be treated ident- + ically. + + When generating these units, any mix of upper and lower case + alphabetic characters may be used. The case shown in this + specification is suggested for message-creating processes. + + Note: The reserved local-part address unit, "Postmaster", is + an exception. When the value "Postmaster" is being + interpreted, it must be accepted in any mixture of + case, including "POSTMASTER", and "postmaster". + + 3.4.8. FOLDING LONG HEADER FIELDS + + Each header field may be represented on exactly one line con- + sisting of the name of the field and its body, and terminated + by a CRLF; this is what the parser sees. For readability, the + field-body portion of long header fields may be "folded" onto + multiple lines of the actual field. "Long" is commonly inter- + preted to mean greater than 65 or 72 characters. The former + length serves as a limit, when the message is to be viewed on + most simple terminals which use simple display software; how- + ever, the limit is not imposed by this standard. + + Note: Some display software often can selectively fold lines, + to suit the display terminal. In such cases, sender- + provided folding can interfere with the display + software. + + 3.4.9. BACKSPACE CHARACTERS + + ASCII BS characters (Backspace, decimal 8) may be included in + texts and quoted-strings to effect overstriking. However, any + use of backspaces which effects an overstrike to the left of + the beginning of the text or quoted-string is prohibited. + + + + + + August 13, 1982 - 15 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + 3.4.10. NETWORK-SPECIFIC TRANSFORMATIONS + + During transmission through heterogeneous networks, it may be + necessary to force data to conform to a network's local con- + ventions. For example, it may be required that a CR be fol- + lowed either by LF, making a CRLF, or by , if the CR is + to stand alone). Such transformations are reversed, when the + message exits that network. + + When crossing network boundaries, the message should be + treated as passing through two modules. It will enter the + first module containing whatever network-specific transforma- + tions that were necessary to permit migration through the + "current" network. It then passes through the modules: + + o Transformation Reversal + + The "current" network's idiosyncracies are removed and + the message is returned to the canonical form speci- + fied in this standard. + + o Transformation + + The "next" network's local idiosyncracies are imposed + on the message. + + ------------------ + From ==> | Remove Net-A | + Net-A | idiosyncracies | + ------------------ + || + \/ + Conformance + with standard + || + \/ + ------------------ + | Impose Net-B | ==> To + | idiosyncracies | Net-B + ------------------ + + + + + + + + + + + + August 13, 1982 - 16 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + 4. MESSAGE SPECIFICATION + + 4.1. SYNTAX + + Note: Due to an artifact of the notational conventions, the syn- + tax indicates that, when present, some fields, must be in + a particular order. Header fields are NOT required to + occur in any particular order, except that the message + body must occur AFTER the headers. It is recommended + that, if present, headers be sent in the order "Return- + Path", "Received", "Date", "From", "Subject", "Sender", + "To", "cc", etc. + + This specification permits multiple occurrences of most + fields. Except as noted, their interpretation is not + specified here, and their use is discouraged. + + The following syntax for the bodies of various fields should + be thought of as describing each field body as a single long + string (or line). The "Lexical Analysis of Message" section on + "Long Header Fields", above, indicates how such long strings can + be represented on more than one line in the actual transmitted + message. + + message = fields *( CRLF *text ) ; Everything after + ; first null line + ; is message body + + fields = dates ; Creation time, + source ; author id & one + 1*destination ; address required + *optional-field ; others optional + + source = [ trace ] ; net traversals + originator ; original mail + [ resent ] ; forwarded + + trace = return ; path to sender + 1*received ; receipt tags + + return = "Return-path" ":" route-addr ; return address + + received = "Received" ":" ; one per relay + ["from" domain] ; sending host + ["by" domain] ; receiving host + ["via" atom] ; physical path + *("with" atom) ; link/mail protocol + ["id" msg-id] ; receiver msg id + ["for" addr-spec] ; initial form + + + August 13, 1982 - 17 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + ";" date-time ; time received + + originator = authentic ; authenticated addr + [ "Reply-To" ":" 1#address] ) + + authentic = "From" ":" mailbox ; Single author + / ( "Sender" ":" mailbox ; Actual submittor + "From" ":" 1#mailbox) ; Multiple authors + ; or not sender + + resent = resent-authentic + [ "Resent-Reply-To" ":" 1#address] ) + + resent-authentic = + = "Resent-From" ":" mailbox + / ( "Resent-Sender" ":" mailbox + "Resent-From" ":" 1#mailbox ) + + dates = orig-date ; Original + [ resent-date ] ; Forwarded + + orig-date = "Date" ":" date-time + + resent-date = "Resent-Date" ":" date-time + + destination = "To" ":" 1#address ; Primary + / "Resent-To" ":" 1#address + / "cc" ":" 1#address ; Secondary + / "Resent-cc" ":" 1#address + / "bcc" ":" #address ; Blind carbon + / "Resent-bcc" ":" #address + + optional-field = + / "Message-ID" ":" msg-id + / "Resent-Message-ID" ":" msg-id + / "In-Reply-To" ":" *(phrase / msg-id) + / "References" ":" *(phrase / msg-id) + / "Keywords" ":" #phrase + / "Subject" ":" *text + / "Comments" ":" *text + / "Encrypted" ":" 1#2word + / extension-field ; To be defined + / user-defined-field ; May be pre-empted + + msg-id = "<" addr-spec ">" ; Unique message id + + + + + + + August 13, 1982 - 18 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + extension-field = + + + user-defined-field = + + + 4.2. FORWARDING + + Some systems permit mail recipients to forward a message, + retaining the original headers, by adding some new fields. This + standard supports such a service, through the "Resent-" prefix to + field names. + + Whenever the string "Resent-" begins a field name, the field + has the same semantics as a field whose name does not have the + prefix. However, the message is assumed to have been forwarded + by an original recipient who attached the "Resent-" field. This + new field is treated as being more recent than the equivalent, + original field. For example, the "Resent-From", indicates the + person that forwarded the message, whereas the "From" field indi- + cates the original author. + + Use of such precedence information depends upon partici- + pants' communication needs. For example, this standard does not + dictate when a "Resent-From:" address should receive replies, in + lieu of sending them to the "From:" address. + + Note: In general, the "Resent-" fields should be treated as con- + taining a set of information that is independent of the + set of original fields. Information for one set should + not automatically be taken from the other. The interpre- + tation of multiple "Resent-" fields, of the same type, is + undefined. + + In the remainder of this specification, occurrence of legal + "Resent-" fields are treated identically with the occurrence of + + + + + + + + + August 13, 1982 - 19 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + fields whose names do not contain this prefix. + + 4.3. TRACE FIELDS + + Trace information is used to provide an audit trail of mes- + sage handling. In addition, it indicates a route back to the + sender of the message. + + The list of known "via" and "with" values are registered + with the Network Information Center, SRI International, Menlo + Park, California. + + 4.3.1. RETURN-PATH + + This field is added by the final transport system that + delivers the message to its recipient. The field is intended + to contain definitive information about the address and route + back to the message's originator. + + Note: The "Reply-To" field is added by the originator and + serves to direct replies, whereas the "Return-Path" + field is used to identify a path back to the origina- + tor. + + While the syntax indicates that a route specification is + optional, every attempt should be made to provide that infor- + mation in this field. + + 4.3.2. RECEIVED + + A copy of this field is added by each transport service that + relays the message. The information in the field can be quite + useful for tracing transport problems. + + The names of the sending and receiving hosts and time-of- + receipt may be specified. The "via" parameter may be used, to + indicate what physical mechanism the message was sent over, + such as Arpanet or Phonenet, and the "with" parameter may be + used to indicate the mail-, or connection-, level protocol + that was used, such as the SMTP mail protocol, or X.25 tran- + sport protocol. + + Note: Several "with" parameters may be included, to fully + specify the set of protocols that were used. + + Some transport services queue mail; the internal message iden- + tifier that is assigned to the message may be noted, using the + "id" parameter. When the sending host uses a destination + address specification that the receiving host reinterprets, by + + + August 13, 1982 - 20 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + expansion or transformation, the receiving host may wish to + record the original specification, using the "for" parameter. + For example, when a copy of mail is sent to the member of a + distribution list, this parameter may be used to record the + original address that was used to specify the list. + + 4.4. ORIGINATOR FIELDS + + The standard allows only a subset of the combinations possi- + ble with the From, Sender, Reply-To, Resent-From, Resent-Sender, + and Resent-Reply-To fields. The limitation is intentional. + + 4.4.1. FROM / RESENT-FROM + + This field contains the identity of the person(s) who wished + this message to be sent. The message-creation process should + default this field to be a single, authenticated machine + address, indicating the AGENT (person, system or process) + entering the message. If this is not done, the "Sender" field + MUST be present. If the "From" field IS defaulted this way, + the "Sender" field is optional and is redundant with the + "From" field. In all cases, addresses in the "From" field + must be machine-usable (addr-specs) and may not contain named + lists (groups). + + 4.4.2. SENDER / RESENT-SENDER + + This field contains the authenticated identity of the AGENT + (person, system or process) that sends the message. It is + intended for use when the sender is not the author of the mes- + sage, or to indicate who among a group of authors actually + sent the message. If the contents of the "Sender" field would + be completely redundant with the "From" field, then the + "Sender" field need not be present and its use is discouraged + (though still legal). In particular, the "Sender" field MUST + be present if it is NOT the same as the "From" Field. + + The Sender mailbox specification includes a word sequence + which must correspond to a specific agent (i.e., a human user + or a computer program) rather than a standard address. This + indicates the expectation that the field will identify the + single AGENT (person, system, or process) responsible for + sending the mail and not simply include the name of a mailbox + from which the mail was sent. For example in the case of a + shared login name, the name, by itself, would not be adequate. + The local-part address unit, which refers to this agent, is + expected to be a computer system term, and not (for example) a + generalized person reference which can be used outside the + network text message context. + + + August 13, 1982 - 21 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + Since the critical function served by the "Sender" field is + identification of the agent responsible for sending mail and + since computer programs cannot be held accountable for their + behavior, it is strongly recommended that when a computer pro- + gram generates a message, the HUMAN who is responsible for + that program be referenced as part of the "Sender" field mail- + box specification. + + 4.4.3. REPLY-TO / RESENT-REPLY-TO + + This field provides a general mechanism for indicating any + mailbox(es) to which responses are to be sent. Three typical + uses for this feature can be distinguished. In the first + case, the author(s) may not have regular machine-based mail- + boxes and therefore wish(es) to indicate an alternate machine + address. In the second case, an author may wish additional + persons to be made aware of, or responsible for, replies. A + somewhat different use may be of some help to "text message + teleconferencing" groups equipped with automatic distribution + services: include the address of that service in the "Reply- + To" field of all messages submitted to the teleconference; + then participants can "reply" to conference submissions to + guarantee the correct distribution of any submission of their + own. + + Note: The "Return-Path" field is added by the mail transport + service, at the time of final deliver. It is intended + to identify a path back to the orginator of the mes- + sage. The "Reply-To" field is added by the message + originator and is intended to direct replies. + + 4.4.4. AUTOMATIC USE OF FROM / SENDER / REPLY-TO + + For systems which automatically generate address lists for + replies to messages, the following recommendations are made: + + o The "Sender" field mailbox should be sent notices of + any problems in transport or delivery of the original + messages. If there is no "Sender" field, then the + "From" field mailbox should be used. + + o The "Sender" field mailbox should NEVER be used + automatically, in a recipient's reply message. + + o If the "Reply-To" field exists, then the reply should + go to the addresses indicated in that field and not to + the address(es) indicated in the "From" field. + + + + + August 13, 1982 - 22 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + o If there is a "From" field, but no "Reply-To" field, + the reply should be sent to the address(es) indicated + in the "From" field. + + Sometimes, a recipient may actually wish to communicate with + the person that initiated the message transfer. In such + cases, it is reasonable to use the "Sender" address. + + This recommendation is intended only for automated use of + originator-fields and is not intended to suggest that replies + may not also be sent to other recipients of messages. It is + up to the respective mail-handling programs to decide what + additional facilities will be provided. + + Examples are provided in Appendix A. + + 4.5. RECEIVER FIELDS + + 4.5.1. TO / RESENT-TO + + This field contains the identity of the primary recipients of + the message. + + 4.5.2. CC / RESENT-CC + + This field contains the identity of the secondary (informa- + tional) recipients of the message. + + 4.5.3. BCC / RESENT-BCC + + This field contains the identity of additional recipients of + the message. The contents of this field are not included in + copies of the message sent to the primary and secondary reci- + pients. Some systems may choose to include the text of the + "Bcc" field only in the author(s)'s copy, while others may + also include it in the text sent to all those indicated in the + "Bcc" list. + + 4.6. REFERENCE FIELDS + + 4.6.1. MESSAGE-ID / RESENT-MESSAGE-ID + + This field contains a unique identifier (the local-part + address unit) which refers to THIS version of THIS message. + The uniqueness of the message identifier is guaranteed by the + host which generates it. This identifier is intended to be + machine readable and not necessarily meaningful to humans. A + message identifier pertains to exactly one instantiation of a + particular message; subsequent revisions to the message should + + + August 13, 1982 - 23 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + each receive new message identifiers. + + 4.6.2. IN-REPLY-TO + + The contents of this field identify previous correspon- + dence which this message answers. Note that if message iden- + tifiers are used in this field, they must use the msg-id + specification format. + + 4.6.3. REFERENCES + + The contents of this field identify other correspondence + which this message references. Note that if message identif- + iers are used, they must use the msg-id specification format. + + 4.6.4. KEYWORDS + + This field contains keywords or phrases, separated by + commas. + + 4.7. OTHER FIELDS + + 4.7.1. SUBJECT + + This is intended to provide a summary, or indicate the + nature, of the message. + + 4.7.2. COMMENTS + + Permits adding text comments onto the message without + disturbing the contents of the message's body. + + 4.7.3. ENCRYPTED + + Sometimes, data encryption is used to increase the + privacy of message contents. If the body of a message has + been encrypted, to keep its contents private, the "Encrypted" + field can be used to note the fact and to indicate the nature + of the encryption. The first parameter indicates the + software used to encrypt the body, and the second, optional + is intended to aid the recipient in selecting the + proper decryption key. This code word may be viewed as an + index to a table of keys held by the recipient. + + Note: Unfortunately, headers must contain envelope, as well + as contents, information. Consequently, it is neces- + sary that they remain unencrypted, so that mail tran- + sport services may access them. Since names, + addresses, and "Subject" field contents may contain + + + August 13, 1982 - 24 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + sensitive information, this requirement limits total + message privacy. + + Names of encryption software are registered with the Net- + work Information Center, SRI International, Menlo Park, Cali- + fornia. + + 4.7.4. EXTENSION-FIELD + + A limited number of common fields have been defined in + this document. As network mail requirements dictate, addi- + tional fields may be standardized. To provide user-defined + fields with a measure of safety, in name selection, such + extension-fields will never have names that begin with the + string "X-". + + Names of Extension-fields are registered with the Network + Information Center, SRI International, Menlo Park, California. + + 4.7.5. USER-DEFINED-FIELD + + Individual users of network mail are free to define and + use additional header fields. Such fields must have names + which are not already used in the current specification or in + any definitions of extension-fields, and the overall syntax of + these user-defined-fields must conform to this specification's + rules for delimiting and folding fields. Due to the + extension-field publishing process, the name of a user- + defined-field may be pre-empted + + Note: The prefatory string "X-" will never be used in the + names of Extension-fields. This provides user-defined + fields with a protected set of names. + + + + + + + + + + + + + + + + + + + August 13, 1982 - 25 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + 5. DATE AND TIME SPECIFICATION + + 5.1. SYNTAX + + date-time = [ day "," ] date time ; dd mm yy + ; hh:mm:ss zzz + + day = "Mon" / "Tue" / "Wed" / "Thu" + / "Fri" / "Sat" / "Sun" + + date = 1*2DIGIT month 2DIGIT ; day month year + ; e.g. 20 Jun 82 + + month = "Jan" / "Feb" / "Mar" / "Apr" + / "May" / "Jun" / "Jul" / "Aug" + / "Sep" / "Oct" / "Nov" / "Dec" + + time = hour zone ; ANSI and Military + + hour = 2DIGIT ":" 2DIGIT [":" 2DIGIT] + ; 00:00:00 - 23:59:59 + + zone = "UT" / "GMT" ; Universal Time + ; North American : UT + / "EST" / "EDT" ; Eastern: - 5/ - 4 + / "CST" / "CDT" ; Central: - 6/ - 5 + / "MST" / "MDT" ; Mountain: - 7/ - 6 + / "PST" / "PDT" ; Pacific: - 8/ - 7 + / 1ALPHA ; Military: Z = UT; + ; A:-1; (J not used) + ; M:-12; N:+1; Y:+12 + / ( ("+" / "-") 4DIGIT ) ; Local differential + ; hours+min. (HHMM) + + 5.2. SEMANTICS + + If included, day-of-week must be the day implied by the date + specification. + + Time zone may be indicated in several ways. "UT" is Univer- + sal Time (formerly called "Greenwich Mean Time"); "GMT" is per- + mitted as a reference to Universal Time. The military standard + uses a single character for each zone. "Z" is Universal Time. + "A" indicates one hour earlier, and "M" indicates 12 hours ear- + lier; "N" is one hour later, and "Y" is 12 hours later. The + letter "J" is not used. The other remaining two forms are taken + from ANSI standard X3.51-1975. One allows explicit indication of + the amount of offset from UT; the other uses common 3-character + strings for indicating time zones in North America. + + + August 13, 1982 - 26 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + 6. ADDRESS SPECIFICATION + + 6.1. SYNTAX + + address = mailbox ; one addressee + / group ; named list + + group = phrase ":" [#mailbox] ";" + + mailbox = addr-spec ; simple address + / phrase route-addr ; name & addr-spec + + route-addr = "<" [route] addr-spec ">" + + route = 1#("@" domain) ":" ; path-relative + + addr-spec = local-part "@" domain ; global address + + local-part = word *("." word) ; uninterpreted + ; case-preserved + + domain = sub-domain *("." sub-domain) + + sub-domain = domain-ref / domain-literal + + domain-ref = atom ; symbolic reference + + 6.2. SEMANTICS + + A mailbox receives mail. It is a conceptual entity which + does not necessarily pertain to file storage. For example, some + sites may choose to print mail on their line printer and deliver + the output to the addressee's desk. + + A mailbox specification comprises a person, system or pro- + cess name reference, a domain-dependent string, and a name-domain + reference. The name reference is optional and is usually used to + indicate the human name of a recipient. The name-domain refer- + ence specifies a sequence of sub-domains. The domain-dependent + string is uninterpreted, except by the final sub-domain; the rest + of the mail service merely transmits it as a literal string. + + 6.2.1. DOMAINS + + A name-domain is a set of registered (mail) names. A name- + domain specification resolves to a subordinate name-domain + specification or to a terminal domain-dependent string. + Hence, domain specification is extensible, permitting any + number of registration levels. + + + August 13, 1982 - 27 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + Name-domains model a global, logical, hierarchical addressing + scheme. The model is logical, in that an address specifica- + tion is related to name registration and is not necessarily + tied to transmission path. The model's hierarchy is a + directed graph, called an in-tree, such that there is a single + path from the root of the tree to any node in the hierarchy. + If more than one path actually exists, they are considered to + be different addresses. + + The root node is common to all addresses; consequently, it is + not referenced. Its children constitute "top-level" name- + domains. Usually, a service has access to its own full domain + specification and to the names of all top-level name-domains. + + The "top" of the domain addressing hierarchy -- a child of the + root -- is indicated by the right-most field, in a domain + specification. Its child is specified to the left, its child + to the left, and so on. + + Some groups provide formal registration services; these con- + stitute name-domains that are independent logically of + specific machines. In addition, networks and machines impli- + citly compose name-domains, since their membership usually is + registered in name tables. + + In the case of formal registration, an organization implements + a (distributed) data base which provides an address-to-route + mapping service for addresses of the form: + + person@registry.organization + + Note that "organization" is a logical entity, separate from + any particular communication network. + + A mechanism for accessing "organization" is universally avail- + able. That mechanism, in turn, seeks an instantiation of the + registry; its location is not indicated in the address specif- + ication. It is assumed that the system which operates under + the name "organization" knows how to find a subordinate regis- + try. The registry will then use the "person" string to deter- + mine where to send the mail specification. + + The latter, network-oriented case permits simple, direct, + attachment-related address specification, such as: + + user@host.network + + Once the network is accessed, it is expected that a message + will go directly to the host and that the host will resolve + + + August 13, 1982 - 28 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + the user name, placing the message in the user's mailbox. + + 6.2.2. ABBREVIATED DOMAIN SPECIFICATION + + Since any number of levels is possible within the domain + hierarchy, specification of a fully qualified address can + become inconvenient. This standard permits abbreviated domain + specification, in a special case: + + For the address of the sender, call the left-most + sub-domain Level N. In a header address, if all of + the sub-domains above (i.e., to the right of) Level N + are the same as those of the sender, then they do not + have to appear in the specification. Otherwise, the + address must be fully qualified. + + This feature is subject to approval by local sub- + domains. Individual sub-domains may require their + member systems, which originate mail, to provide full + domain specification only. When permitted, abbrevia- + tions may be present only while the message stays + within the sub-domain of the sender. + + Use of this mechanism requires the sender's sub-domain + to reserve the names of all top-level domains, so that + full specifications can be distinguished from abbrevi- + ated specifications. + + For example, if a sender's address is: + + sender@registry-A.registry-1.organization-X + + and one recipient's address is: + + recipient@registry-B.registry-1.organization-X + + and another's is: + + recipient@registry-C.registry-2.organization-X + + then ".registry-1.organization-X" need not be specified in the + the message, but "registry-C.registry-2" DOES have to be + specified. That is, the first two addresses may be abbrevi- + ated, but the third address must be fully specified. + + When a message crosses a domain boundary, all addresses must + be specified in the full format, ending with the top-level + name-domain in the right-most field. It is the responsibility + of mail forwarding services to ensure that addresses conform + + + August 13, 1982 - 29 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + with this requirement. In the case of abbreviated addresses, + the relaying service must make the necessary expansions. It + should be noted that it often is difficult for such a service + to locate all occurrences of address abbreviations. For exam- + ple, it will not be possible to find such abbreviations within + the body of the message. The "Return-Path" field can aid + recipients in recovering from these errors. + + Note: When passing any portion of an addr-spec onto a process + which does not interpret data according to this stan- + dard (e.g., mail protocol servers). There must be NO + LWSP-chars preceding or following the at-sign or any + delimiting period ("."), such as shown in the above + examples, and only ONE SPACE between contiguous + s. + + 6.2.3. DOMAIN TERMS + + A domain-ref must be THE official name of a registry, network, + or host. It is a symbolic reference, within a name sub- + domain. At times, it is necessary to bypass standard mechan- + isms for resolving such references, using more primitive + information, such as a network host address rather than its + associated host name. + + To permit such references, this standard provides the domain- + literal construct. Its contents must conform with the needs + of the sub-domain in which it is interpreted. + + Domain-literals which refer to domains within the ARPA Inter- + net specify 32-bit Internet addresses, in four 8-bit fields + noted in decimal, as described in Request for Comments #820, + "Assigned Numbers." For example: + + [10.0.3.19] + + Note: THE USE OF DOMAIN-LITERALS IS STRONGLY DISCOURAGED. It + is permitted only as a means of bypassing temporary + system limitations, such as name tables which are not + complete. + + The names of "top-level" domains, and the names of domains + under in the ARPA Internet, are registered with the Network + Information Center, SRI International, Menlo Park, California. + + 6.2.4. DOMAIN-DEPENDENT LOCAL STRING + + The local-part of an addr-spec in a mailbox specification + (i.e., the host's name for the mailbox) is understood to be + + + August 13, 1982 - 30 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + whatever the receiving mail protocol server allows. For exam- + ple, some systems do not understand mailbox references of the + form "P. D. Q. Bach", but others do. + + This specification treats periods (".") as lexical separators. + Hence, their presence in local-parts which are not quoted- + strings, is detected. However, such occurrences carry NO + semantics. That is, if a local-part has periods within it, an + address parser will divide the local-part into several tokens, + but the sequence of tokens will be treated as one uninter- + preted unit. The sequence will be re-assembled, when the + address is passed outside of the system such as to a mail pro- + tocol service. + + For example, the address: + + First.Last@Registry.Org + + is legal and does not require the local-part to be surrounded + with quotation-marks. (However, "First Last" DOES require + quoting.) The local-part of the address, when passed outside + of the mail system, within the Registry.Org domain, is + "First.Last", again without quotation marks. + + 6.2.5. BALANCING LOCAL-PART AND DOMAIN + + In some cases, the boundary between local-part and domain can + be flexible. The local-part may be a simple string, which is + used for the final determination of the recipient's mailbox. + All other levels of reference are, therefore, part of the + domain. + + For some systems, in the case of abbreviated reference to the + local and subordinate sub-domains, it may be possible to + specify only one reference within the domain part and place + the other, subordinate name-domain references within the + local-part. This would appear as: + + mailbox.sub1.sub2@this-domain + + Such a specification would be acceptable to address parsers + which conform to RFC #733, but do not support this newer + Internet standard. While contrary to the intent of this stan- + dard, the form is legal. + + Also, some sub-domains have a specification syntax which does + not conform to this standard. For example: + + sub-net.mailbox@sub-domain.domain + + + August 13, 1982 - 31 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + uses a different parsing sequence for local-part than for + domain. + + Note: As a rule, the domain specification should contain + fields which are encoded according to the syntax of + this standard and which contain generally-standardized + information. The local-part specification should con- + tain only that portion of the address which deviates + from the form or intention of the domain field. + + 6.2.6. MULTIPLE MAILBOXES + + An individual may have several mailboxes and wish to receive + mail at whatever mailbox is convenient for the sender to + access. This standard does not provide a means of specifying + "any member of" a list of mailboxes. + + A set of individuals may wish to receive mail as a single unit + (i.e., a distribution list). The construct permits + specification of such a list. Recipient mailboxes are speci- + fied within the bracketed part (":" - ";"). A copy of the + transmitted message is to be sent to each mailbox listed. + This standard does not permit recursive specification of + groups within groups. + + While a list must be named, it is not required that the con- + tents of the list be included. In this case, the
+ serves only as an indication of group distribution and would + appear in the form: + + name:; + + Some mail services may provide a group-list distribution + facility, accepting a single mailbox reference, expanding it + to the full distribution list, and relaying the mail to the + list's members. This standard provides no additional syntax + for indicating such a service. Using the address + alternative, while listing one mailbox in it, can mean either + that the mailbox reference will be expanded to a list or that + there is a group with one member. + + 6.2.7. EXPLICIT PATH SPECIFICATION + + At times, a message originator may wish to indicate the + transmission path that a message should follow. This is + called source routing. The normal addressing scheme, used in + an addr-spec, is carefully separated from such information; + the portion of a route-addr is provided for such occa- + sions. It specifies the sequence of hosts and/or transmission + + + August 13, 1982 - 32 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + services that are to be traversed. Both domain-refs and + domain-literals may be used. + + Note: The use of source routing is discouraged. Unless the + sender has special need of path restriction, the choice + of transmission route should be left to the mail tran- + sport service. + + 6.3. RESERVED ADDRESS + + It often is necessary to send mail to a site, without know- + ing any of its valid addresses. For example, there may be mail + system dysfunctions, or a user may wish to find out a person's + correct address, at that site. + + This standard specifies a single, reserved mailbox address + (local-part) which is to be valid at each site. Mail sent to + that address is to be routed to a person responsible for the + site's mail system or to a person with responsibility for general + site operation. The name of the reserved local-part address is: + + Postmaster + + so that "Postmaster@domain" is required to be valid. + + Note: This reserved local-part must be matched without sensi- + tivity to alphabetic case, so that "POSTMASTER", "postmas- + ter", and even "poStmASteR" is to be accepted. + + + + + + + + + + + + + + + + + + + + + + + + August 13, 1982 - 33 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + 7. BIBLIOGRAPHY + + + ANSI. "USA Standard Code for Information Interchange," X3.4. + American National Standards Institute: New York (1968). Also + in: Feinler, E. and J. Postel, eds., "ARPANET Protocol Hand- + book", NIC 7104. + + ANSI. "Representations of Universal Time, Local Time Differen- + tials, and United States Time Zone References for Information + Interchange," X3.51-1975. American National Standards Insti- + tute: New York (1975). + + Bemer, R.W., "Time and the Computer." In: Interface Age (Feb. + 1979). + + Bennett, C.J. "JNT Mail Protocol". Joint Network Team, Ruther- + ford and Appleton Laboratory: Didcot, England. + + Bhushan, A.K., Pogran, K.T., Tomlinson, R.S., and White, J.E. + "Standardizing Network Mail Headers," ARPANET Request for + Comments No. 561, Network Information Center No. 18516; SRI + International: Menlo Park (September 1973). + + Birrell, A.D., Levin, R., Needham, R.M., and Schroeder, M.D. + "Grapevine: An Exercise in Distributed Computing," Communica- + tions of the ACM 25, 4 (April 1982), 260-274. + + Crocker, D.H., Vittal, J.J., Pogran, K.T., Henderson, D.A. + "Standard for the Format of ARPA Network Text Message," + ARPANET Request for Comments No. 733, Network Information + Center No. 41952. SRI International: Menlo Park (November + 1977). + + Feinler, E.J. and Postel, J.B. ARPANET Protocol Handbook, Net- + work Information Center No. 7104 (NTIS AD A003890). SRI + International: Menlo Park (April 1976). + + Harary, F. "Graph Theory". Addison-Wesley: Reading, Mass. + (1969). + + Levin, R. and Schroeder, M. "Transport of Electronic Messages + through a Network," TeleInformatics 79, pp. 29-33. North + Holland (1979). Also as Xerox Palo Alto Research Center + Technical Report CSL-79-4. + + Myer, T.H. and Henderson, D.A. "Message Transmission Protocol," + ARPANET Request for Comments, No. 680, Network Information + Center No. 32116. SRI International: Menlo Park (1975). + + + August 13, 1982 - 34 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + NBS. "Specification of Message Format for Computer Based Message + Systems, Recommended Federal Information Processing Standard." + National Bureau of Standards: Gaithersburg, Maryland + (October 1981). + + NIC. Internet Protocol Transition Workbook. Network Information + Center, SRI-International, Menlo Park, California (March + 1982). + + Oppen, D.C. and Dalal, Y.K. "The Clearinghouse: A Decentralized + Agent for Locating Named Objects in a Distributed Environ- + ment," OPD-T8103. Xerox Office Products Division: Palo Alto, + CA. (October 1981). + + Postel, J.B. "Assigned Numbers," ARPANET Request for Comments, + No. 820. SRI International: Menlo Park (August 1982). + + Postel, J.B. "Simple Mail Transfer Protocol," ARPANET Request + for Comments, No. 821. SRI International: Menlo Park (August + 1982). + + Shoch, J.F. "Internetwork naming, addressing and routing," in + Proc. 17th IEEE Computer Society International Conference, pp. + 72-79, Sept. 1978, IEEE Cat. No. 78 CH 1388-8C. + + Su, Z. and Postel, J. "The Domain Naming Convention for Internet + User Applications," ARPANET Request for Comments, No. 819. + SRI International: Menlo Park (August 1982). + + + + + + + + + + + + + + + + + + + + + + + + August 13, 1982 - 35 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + APPENDIX + + + A. EXAMPLES + + A.1. ADDRESSES + + A.1.1. Alfred Neuman + + A.1.2. Neuman@BBN-TENEXA + + These two "Alfred Neuman" examples have identical seman- + tics, as far as the operation of the local host's mail sending + (distribution) program (also sometimes called its "mailer") + and the remote host's mail protocol server are concerned. In + the first example, the "Alfred Neuman" is ignored by the + mailer, as "Neuman@BBN-TENEXA" completely specifies the reci- + pient. The second example contains no superfluous informa- + tion, and, again, "Neuman@BBN-TENEXA" is the intended reci- + pient. + + Note: When the message crosses name-domain boundaries, then + these specifications must be changed, so as to indicate + the remainder of the hierarchy, starting with the top + level. + + A.1.3. "George, Ted" + + This form might be used to indicate that a single mailbox + is shared by several users. The quoted string is ignored by + the originating host's mailer, because "Shared@Group.Arpanet" + completely specifies the destination mailbox. + + A.1.4. Wilt . (the Stilt) Chamberlain@NBA.US + + The "(the Stilt)" is a comment, which is NOT included in + the destination mailbox address handed to the originating + system's mailer. The local-part of the address is the string + "Wilt.Chamberlain", with NO space between the first and second + words. + + A.1.5. Address Lists + + Gourmets: Pompous Person , + Childs@WGBH.Boston, Galloping Gourmet@ + ANT.Down-Under (Australian National Television), + Cheapie@Discount-Liquors;, + Cruisers: Port@Portugal, Jones@SEA;, + Another@Somewhere.SomeOrg + + + August 13, 1982 - 36 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + This group list example points out the use of comments and the + mixing of addresses and groups. + + A.2. ORIGINATOR ITEMS + + A.2.1. Author-sent + + George Jones logs into his host as "Jones". He sends + mail himself. + + From: Jones@Group.Org + + or + + From: George Jones + + A.2.2. Secretary-sent + + George Jones logs in as Jones on his host. His secre- + tary, who logs in as Secy sends mail for him. Replies to the + mail should go to George. + + From: George Jones + Sender: Secy@Other-Group + + A.2.3. Secretary-sent, for user of shared directory + + George Jones' secretary sends mail for George. Replies + should go to George. + + From: George Jones + Sender: Secy@Other-Group + + Note that there need not be a space between "Jones" and the + "<", but adding a space enhances readability (as is the case + in other examples. + + A.2.4. Committee activity, with one author + + George is a member of a committee. He wishes to have any + replies to his message go to all committee members. + + From: George Jones + Sender: Jones@Host + Reply-To: The Committee: Jones@Host.Net, + Smith@Other.Org, + Doe@Somewhere-Else; + + Note that if George had not included himself in the + + + August 13, 1982 - 37 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + enumeration of The Committee, he would not have gotten an + implicit reply; the presence of the "Reply-to" field SUPER- + SEDES the sending of a reply to the person named in the "From" + field. + + A.2.5. Secretary acting as full agent of author + + George Jones asks his secretary (Secy@Host) to send a + message for him in his capacity as Group. He wants his secre- + tary to handle all replies. + + From: George Jones + Sender: Secy@Host + Reply-To: Secy@Host + + A.2.6. Agent for user without online mailbox + + A friend of George's, Sarah, is visiting. George's + secretary sends some mail to a friend of Sarah in computer- + land. Replies should go to George, whose mailbox is Jones at + Registry. + + From: Sarah Friendly + Sender: Secy-Name + Reply-To: Jones@Registry. + + A.2.7. Agent for member of a committee + + George's secretary sends out a message which was authored + jointly by all the members of a committee. Note that the name + of the committee cannot be specified, since names are + not permitted in the From field. + + From: Jones@Host, + Smith@Other-Host, + Doe@Somewhere-Else + Sender: Secy@SHost + + + + + + + + + + + + + + + August 13, 1982 - 38 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + A.3. COMPLETE HEADERS + + A.3.1. Minimum required + + Date: 26 Aug 76 1429 EDT Date: 26 Aug 76 1429 EDT + From: Jones@Registry.Org or From: Jones@Registry.Org + Bcc: To: Smith@Registry.Org + + Note that the "Bcc" field may be empty, while the "To" field + is required to have at least one address. + + A.3.2. Using some of the additional fields + + Date: 26 Aug 76 1430 EDT + From: George Jones + Sender: Secy@SHOST + To: "Al Neuman"@Mad-Host, + Sam.Irving@Other-Host + Message-ID: + + A.3.3. About as complex as you're going to get + + Date : 27 Aug 76 0932 PDT + From : Ken Davis + Subject : Re: The Syntax in the RFC + Sender : KSecy@Other-Host + Reply-To : Sam.Irving@Reg.Organization + To : George Jones , + Al.Neuman@MAD.Publisher + cc : Important folk: + Tom Softwood , + "Sam Irving"@Other-Host;, + Standard Distribution: + /main/davis/people/standard@Other-Host, + "standard.dist.3"@Tops-20-Host>; + Comment : Sam is away on business. He asked me to handle + his mail for him. He'll be able to provide a + more accurate explanation when he returns + next week. + In-Reply-To: , George's message + X-Special-action: This is a sample of user-defined field- + names. There could also be a field-name + "Special-action", but its name might later be + preempted + Message-ID: <4231.629.XYzi-What@Other-Host> + + + + + + + August 13, 1982 - 39 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + B. SIMPLE FIELD PARSING + + Some mail-reading software systems may wish to perform only + minimal processing, ignoring the internal syntax of structured + field-bodies and treating them the same as unstructured-field- + bodies. Such software will need only to distinguish: + + o Header fields from the message body, + + o Beginnings of fields from lines which continue fields, + + o Field-names from field-contents. + + The abbreviated set of syntactic rules which follows will + suffice for this purpose. It describes a limited view of mes- + sages and is a subset of the syntactic rules provided in the main + part of this specification. One small exception is that the con- + tents of field-bodies consist only of text: + + B.1. SYNTAX + + + message = *field *(CRLF *text) + + field = field-name ":" [field-body] CRLF + + field-name = 1* + + field-body = *text [CRLF LWSP-char field-body] + + + B.2. SEMANTICS + + Headers occur before the message body and are terminated by + a null line (i.e., two contiguous CRLFs). + + A line which continues a header field begins with a SPACE or + HTAB character, while a line beginning a field starts with a + printable character which is not a colon. + + A field-name consists of one or more printable characters + (excluding colon, space, and control-characters). A field-name + MUST be contained on one line. Upper and lower case are not dis- + tinguished when comparing field-names. + + + + + + + + August 13, 1982 - 40 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + C. DIFFERENCES FROM RFC #733 + + The following summarizes the differences between this stan- + dard and the one specified in Arpanet Request for Comments #733, + "Standard for the Format of ARPA Network Text Messages". The + differences are listed in the order of their occurrence in the + current specification. + + C.1. FIELD DEFINITIONS + + C.1.1. FIELD NAMES + + These now must be a sequence of printable characters. They + may not contain any LWSP-chars. + + C.2. LEXICAL TOKENS + + C.2.1. SPECIALS + + The characters period ("."), left-square bracket ("["), and + right-square bracket ("]") have been added. For presentation + purposes, and when passing a specification to a system that + does not conform to this standard, periods are to be contigu- + ous with their surrounding lexical tokens. No linear-white- + space is permitted between them. The presence of one LWSP- + char between other tokens is still directed. + + C.2.2. ATOM + + Atoms may not contain SPACE. + + C.2.3. SPECIAL TEXT + + ctext and qtext have had backslash ("\") added to the list of + prohibited characters. + + C.2.4. DOMAINS + + The lexical tokens and have been + added. + + C.3. MESSAGE SPECIFICATION + + C.3.1. TRACE + + The "Return-path:" and "Received:" fields have been specified. + + + + + + August 13, 1982 - 41 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + C.3.2. FROM + + The "From" field must contain machine-usable addresses (addr- + spec). Multiple addresses may be specified, but named-lists + (groups) may not. + + C.3.3. RESENT + + The meta-construct of prefacing field names with the string + "Resent-" has been added, to indicate that a message has been + forwarded by an intermediate recipient. + + C.3.4. DESTINATION + + A message must contain at least one destination address field. + "To" and "CC" are required to contain at least one address. + + C.3.5. IN-REPLY-TO + + The field-body is no longer a comma-separated list, although a + sequence is still permitted. + + C.3.6. REFERENCE + + The field-body is no longer a comma-separated list, although a + sequence is still permitted. + + C.3.7. ENCRYPTED + + A field has been specified that permits senders to indicate + that the body of a message has been encrypted. + + C.3.8. EXTENSION-FIELD + + Extension fields are prohibited from beginning with the char- + acters "X-". + + C.4. DATE AND TIME SPECIFICATION + + C.4.1. SIMPLIFICATION + + Fewer optional forms are permitted and the list of three- + letter time zones has been shortened. + + C.5. ADDRESS SPECIFICATION + + + + + + + August 13, 1982 - 42 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + C.5.1. ADDRESS + + The use of quoted-string, and the ":"-atom-":" construct, have + been removed. An address now is either a single mailbox + reference or is a named list of addresses. The latter indi- + cates a group distribution. + + C.5.2. GROUPS + + Group lists are now required to to have a name. Group lists + may not be nested. + + C.5.3. MAILBOX + + A mailbox specification may indicate a person's name, as + before. Such a named list no longer may specify multiple + mailboxes and may not be nested. + + C.5.4. ROUTE ADDRESSING + + Addresses now are taken to be absolute, global specifications, + independent of transmission paths. The construct has + been provided, to permit explicit specification of transmis- + sion path. RFC #733's use of multiple at-signs ("@") was + intended as a general syntax for indicating routing and/or + hierarchical addressing. The current standard separates these + specifications and only one at-sign is permitted. + + C.5.5. AT-SIGN + + The string " at " no longer is used as an address delimiter. + Only at-sign ("@") serves the function. + + C.5.6. DOMAINS + + Hierarchical, logical name-domains have been added. + + C.6. RESERVED ADDRESS + + The local-part "Postmaster" has been reserved, so that users can + be guaranteed at least one valid address at a site. + + + + + + + + + + + August 13, 1982 - 43 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + D. ALPHABETICAL LISTING OF SYNTAX RULES + + address = mailbox ; one addressee + / group ; named list + addr-spec = local-part "@" domain ; global address + ALPHA = + ; (101-132, 65.- 90.) + ; (141-172, 97.-122.) + atom = 1* + authentic = "From" ":" mailbox ; Single author + / ( "Sender" ":" mailbox ; Actual submittor + "From" ":" 1#mailbox) ; Multiple authors + ; or not sender + CHAR = ; ( 0-177, 0.-127.) + comment = "(" *(ctext / quoted-pair / comment) ")" + CR = ; ( 15, 13.) + CRLF = CR LF + ctext = may be folded + ")", "\" & CR, & including + linear-white-space> + CTL = ; ( 177, 127.) + date = 1*2DIGIT month 2DIGIT ; day month year + ; e.g. 20 Jun 82 + dates = orig-date ; Original + [ resent-date ] ; Forwarded + date-time = [ day "," ] date time ; dd mm yy + ; hh:mm:ss zzz + day = "Mon" / "Tue" / "Wed" / "Thu" + / "Fri" / "Sat" / "Sun" + delimiters = specials / linear-white-space / comment + destination = "To" ":" 1#address ; Primary + / "Resent-To" ":" 1#address + / "cc" ":" 1#address ; Secondary + / "Resent-cc" ":" 1#address + / "bcc" ":" #address ; Blind carbon + / "Resent-bcc" ":" #address + DIGIT = ; ( 60- 71, 48.- 57.) + domain = sub-domain *("." sub-domain) + domain-literal = "[" *(dtext / quoted-pair) "]" + domain-ref = atom ; symbolic reference + dtext = may be folded + "]", "\" & CR, & including + linear-white-space> + extension-field = + + + + August 13, 1982 - 44 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + field = field-name ":" [ field-body ] CRLF + fields = dates ; Creation time, + source ; author id & one + 1*destination ; address required + *optional-field ; others optional + field-body = field-body-contents + [CRLF LWSP-char field-body] + field-body-contents = + + field-name = 1* + group = phrase ":" [#mailbox] ";" + hour = 2DIGIT ":" 2DIGIT [":" 2DIGIT] + ; 00:00:00 - 23:59:59 + HTAB = ; ( 11, 9.) + LF = ; ( 12, 10.) + linear-white-space = 1*([CRLF] LWSP-char) ; semantics = SPACE + ; CRLF => folding + local-part = word *("." word) ; uninterpreted + ; case-preserved + LWSP-char = SPACE / HTAB ; semantics = SPACE + mailbox = addr-spec ; simple address + / phrase route-addr ; name & addr-spec + message = fields *( CRLF *text ) ; Everything after + ; first null line + ; is message body + month = "Jan" / "Feb" / "Mar" / "Apr" + / "May" / "Jun" / "Jul" / "Aug" + / "Sep" / "Oct" / "Nov" / "Dec" + msg-id = "<" addr-spec ">" ; Unique message id + optional-field = + / "Message-ID" ":" msg-id + / "Resent-Message-ID" ":" msg-id + / "In-Reply-To" ":" *(phrase / msg-id) + / "References" ":" *(phrase / msg-id) + / "Keywords" ":" #phrase + / "Subject" ":" *text + / "Comments" ":" *text + / "Encrypted" ":" 1#2word + / extension-field ; To be defined + / user-defined-field ; May be pre-empted + orig-date = "Date" ":" date-time + originator = authentic ; authenticated addr + [ "Reply-To" ":" 1#address] ) + phrase = 1*word ; Sequence of words + + + + + August 13, 1982 - 45 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + qtext = , ; => may be folded + "\" & CR, and including + linear-white-space> + quoted-pair = "\" CHAR ; may quote any char + quoted-string = <"> *(qtext/quoted-pair) <">; Regular qtext or + ; quoted chars. + received = "Received" ":" ; one per relay + ["from" domain] ; sending host + ["by" domain] ; receiving host + ["via" atom] ; physical path + *("with" atom) ; link/mail protocol + ["id" msg-id] ; receiver msg id + ["for" addr-spec] ; initial form + ";" date-time ; time received + + resent = resent-authentic + [ "Resent-Reply-To" ":" 1#address] ) + resent-authentic = + = "Resent-From" ":" mailbox + / ( "Resent-Sender" ":" mailbox + "Resent-From" ":" 1#mailbox ) + resent-date = "Resent-Date" ":" date-time + return = "Return-path" ":" route-addr ; return address + route = 1#("@" domain) ":" ; path-relative + route-addr = "<" [route] addr-spec ">" + source = [ trace ] ; net traversals + originator ; original mail + [ resent ] ; forwarded + SPACE = ; ( 40, 32.) + specials = "(" / ")" / "<" / ">" / "@" ; Must be in quoted- + / "," / ";" / ":" / "\" / <"> ; string, to use + / "." / "[" / "]" ; within a word. + sub-domain = domain-ref / domain-literal + text = atoms, specials, + CR & bare LF, but NOT ; comments and + including CRLF> ; quoted-strings are + ; NOT recognized. + time = hour zone ; ANSI and Military + trace = return ; path to sender + 1*received ; receipt tags + user-defined-field = + + word = atom / quoted-string + + + + + August 13, 1982 - 46 - RFC #822 + + + + Standard for ARPA Internet Text Messages + + + zone = "UT" / "GMT" ; Universal Time + ; North American : UT + / "EST" / "EDT" ; Eastern: - 5/ - 4 + / "CST" / "CDT" ; Central: - 6/ - 5 + / "MST" / "MDT" ; Mountain: - 7/ - 6 + / "PST" / "PDT" ; Pacific: - 8/ - 7 + / 1ALPHA ; Military: Z = UT; + <"> = ; ( 42, 34.) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + August 13, 1982 - 47 - RFC #822 + diff --git a/doc/uri.scm.doc b/doc/uri.scm.doc new file mode 100644 index 0000000..ff44a8d --- /dev/null +++ b/doc/uri.scm.doc @@ -0,0 +1,150 @@ +This file documents names specified in uri.scm. + + + + +NOTES + +URIs are of following syntax: + +[scheme] : path [? search ] [# fragmentid] + +Parts in [] may be ommitted. The last part is usually referred to as +fragid in this document. + + + +DEFINITIONS AND DESCRIPTIONS + + +char-set +uri-reserved + +A list of reserved characters (semicolon, slash, hash, question mark, +double colon and space). + +procedure +parse-uri uri-string --> (scheme, path, search, frag-id) + +Multiple-value return: scheme, path, search, frag-id, in this +order. scheme, search and frag-id are either #f or a string. path is a +nonempty list of strings. An empty path is a list containing the empty +string. parse-uri tries to be tolerant of the various ways people build broken URIs out there on the Net (so it is not absolutely conform with RFC 1630). + + +procedure +unescape-uri string [start [end]] --> string + +Unescapes a string. This procedure should only be used *after* the url +(!) was parsed, since unescaping may introduce characters that blow +up the parse (that's why escape sequences are used in URIs ;). +Escape-sequences are of following scheme: %hh where h is a hexadecimal +digit. E.g. %20 is space (ASCII character 32). + + +procedure +hex-digit? character --> boolean + +Returns #t if character is a hexadecimal digit (i.e., one of 1-9, a-f, +A-F), #f otherwise. + + +procedure +hexchar->int character --> number + +Translates the given character to an integer, p.e. (hexchar->int \#a) +=> 10. + + +procedure +int->hexchar integer --> character + +Translates the given integer from range 1-15 into an hexadecimal +character (uses uppercase letters), p.e. (int->hexchar 14) => E. + + +char-set +uri-escaped-chars + +A set of characters that are escaped in URIs. These are the following +characters: dollar ($), minus (-), underscore (_), at (@), dot (.), +and-sign (&), exclamation mark (!), asterisk (*), backslash (\), +double quote ("), single quote ('), open brace ((), close brace ()), +comma (,) plus (+) and all other characters that are neither letters +nor digits (such as space and control characters). + + +procedure +escape-uri string [escaped-chars] --> string + +Escapes characters of string that are given with escaped-chars. +escaped-chars default to uri-escaped-chars. Be careful with using this +procedure to chunks of text with syntactically meaningful reserved +characters (e.g., paths with URI slashes or colons) -- they'll be +escaped, and lose their special meaning. E.g. it would be a mistake to +apply escape-uri to "//lcs.mit.edu:8001/foo/bar.html" because the +slashes and colons would be escaped. Note that esacpe-uri doesn't +check this as it would lose his meaning. + + +procedure +resolve-uri cscheme cp scheme p --> (scheme, path) + +Sorry, I can't figure out what resolve-uri is inteded to do. Perhaps +I find it out later. + +The code seems to have a bug: In the body of receive, there's a +loop. j should, according to the comment, count sequential /. But j +counts nothing in the body. Either zero is added ((lp (cdr cp-tail) +(cons (car cp-tail) rhead) (+ j 0))) or j is set to 1 ((lp (cdr +cp-tail) (cons (car cp-tail) rhead) 1))). Nevertheless, j is expected +to reach value numsl that can be larger than one. So what? I am +confused. + + +procedure +rev-append list-a list-b --> list + +Performs a (append (reverse list-a) list-b). The comment says it +should be defined in a list package but I am wondering how often this +will be used. + + +procedure +split-uri-path uri start end --> list + +Splits uri at /'s. Only the substring given with start (inclusive) and +end (exclusive) is considered. Start and end - 1 have to be within the +range of the uri-string. Otherwise an index-out-of-range exception +will be raised. Example: (split-uri-path "foo/bar/colon" 4 11) ==> +'("bar" "col") + + +procedure +simplify-uri-path path --> list + +Removes "." and ".." entries from path. The result is a (maybe empty) +list representing a path that does not contain any "." or "..". The +list can only be empty if the path did not start with "/" (for the +rare occasion someone wants to simplify a relative path). The result +is #f if the path tries to back up past root, for example by "/.." or +"/foo/../.." or just "..". "//" may occur somewhere in the path +referring to root but not being backed up. +Examples: +(simplify-uri-path (split-uri-path "/foo/bar/baz/.." 0 15)) +==> '("" "foo" "bar") + +(simplify-uri-path (split-uri-path "foo/bar/baz/../../.." 0 20)) +==> '() + +(simplify-uri-path (split-uri-path "/foo/../.." 0 10)) +==> #f ; tried to back up root + +(simplify-uri-path (split-uri-path "foo/bar//" 0 9)) +==> '("") ; "//" refers to root + +(simplify-uri-path (split-uri-path "foo/bar/" 0 8)) +==> '("") ; last "/" also refers to root + +(simplify-uri-path (split-uri-path "/foo/bar//baz/../.." 0 19)) +==> #f ; tries to back up root diff --git a/doc/url.scm.doc b/doc/url.scm.doc new file mode 100644 index 0000000..4819ca4 --- /dev/null +++ b/doc/url.scm.doc @@ -0,0 +1,69 @@ +This file documents names defined in url.scm + + + + +NOTES + + + + +DEFINITIONS AND DESCRIPTIONS + + +userhost record + +A record containing the fields user, password, host and port. Created +by parsing a string like //:@:/. The +record describes path-prefixes of the form +//:@:/ These are frequently used as the +initial prefix of URL's describing Internet resources. + + +parse-userhost path default + +Parse a URI path (a list representing a path, not a string!) into a +userhost record. Default values are taken from the userhost record +DEFAULT except for the host. Returns a userhost record if it wins, and +#f if it cannot parse the path. It is an error if the specified path +does not begin with '//..' like noted at userhost. + + +userhost-escaped-chars list + +The union of uri-escaped-chars and the characters '@' and ':'. Used +for the unparser. + + +userhost->string userhost procedure + +Unparses a userhost record to a string. + + +http-url record + +Record containing the fields userhost (a userhost record), path (a +path list), search and frag-id. The PATH slot of this record is the +URL's path split at slashes, e.g., "foo/bar//baz/" => ("foo" "bar" "" +"baz" ""). These elements are in raw, unescaped format. To convert +back to a string, use (uri-path-list->path (map escape-uri pathlist)). + + +parse-http-url path search frag-id procedure + +Returns a http-url record. path, search and frag-id are results of a +parse-uri call on the initial uri. See there (uri.scm) for further +details. search and frag-id are stored as they are. This parser +decodes the path elements. It is an error if the path specifies an +user or a password as this is not allowd at http-urls. + + +default-http-userhost record + +A userhost record that specifies the port as 80 and anything else as +#f. + + +http-url->string http-url + +Unparses the given http-url to a string.