<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <!-- - Do a paragraph check <p> - The Unicode char tables are messed up, but it can't be fixed w/o CSS2 support, which I do not currently find in web browsers. - Can I have bangs, plusses, or slashes in #tags? Spaces? Yes: plus, bang, star No: space Yes: slash, question, ampersand You can't put sharp in a path, so anything goes, really. Nonetheless, some of these confuse Netscape, so I'll avoid them. --> <!--========================================================================--> <html lang=en-US> <head> <meta name="keywords" content="Scheme, programming language, list processing, SRFI, underage lesbian sluts"> <link rev=made href="mailto:shivers@ai.mit.edu"> <title>SRFI 14: Character-set Library</title> <!-- Should have a media=all to get, for example, printing to work. == But my Netscape will completely ignore the tag if I do that. --> <style type="text/css"> /* A little general layout hackery for headers & the title. */ body { margin-left: +7%; font-family: "Helvetica", sans-serif; } /* Netscape workaround: */ td, th { font-family: "Helvetica", sans-serif; } code, pre { font-family: "courier new", "courier"; } div.inset { margin-left: +5%; } h1 { margin-left: -5%; } h1, h2 { clear: both; } h1, h2, h3, h4, h5, h6 { color: blue } div.title-text { font-size: large; font-weight: bold; } h3 { margin-top: 2em; margin-bottom: 0em } div.indent { margin-left: 2em; } /* General indentation */ pre.code-example { margin-left: 2em; } /* Indent code examples. */ /* "Continue" class marks text that isn't really the start ** of a new paragraph -- e.g., continuing a para after a ** code sample. */ p.continue { text-indent: 0em; margin-top: 0em} /* This stuff is for definition lists of defined procedures. ** A proc-def1 is used when you want a stack of procs to go ** with one dd body. In this case, make the first ** proc a proc-def1, following ones proc-defi's, and the last one ** a proc-defn. ** ** Unfortunately, Netscape has huge bugs with respect to style ** sheets and dl list rendering. We have to set truly random ** values here to get the rendering to come out. The proper values ** are in the following style sheet, for Internet Explorer. ** In the following settings, the *comments* say what the ** setting *really* causes Netscape to do. ** ** Ugh. Professional coders sacrifice their self-respect, ** that others may live. */ /* m-t ignored; m-b sets top margin space. */ dt.proc-def1 { margin-top: 0ex; margin-bottom: 3ex; } dt.proc-defi { margin-top: 0ex; margin-bottom: 0ex; } dt.proc-defn { margin-top: 0ex; margin-bottom: 0ex; } /* m-t works weird depending on whether or not the last line ** of the previous entry was a pre. Set to zero. */ dt.proc-def { margin-top: 0ex; margin-bottom: 3ex; } /* m-b sets space between dd & dt; m-t ignored. */ dd.proc-def { margin-bottom: 0.5ex; margin-top: 0ex; } /* Boldface the name of a procedure when it's being defined. */ code.proc-def { font-weight: bold; font-size: 110%} /* For the index of procedures. ** Same hackery as for dt.proc-def, above. */ /* m-b sets space between dd & dt; m-t ignored. */ dd.proc-index { margin-bottom: 0ex; margin-top: 0ex; } /* What the fuck? */ pre.proc-index { margin-top: -2ex; } /* Pull the table of contents back flush with the margin. ** Both NS & IE screw this up in different ways. */ #toc-table { margin-top: -2ex; margin-left: -5%; } /* R5RS proc names are in italic; extended R5RS names ** in italic boldface. */ span.r5rs-proc { font-weight: bold; } span.r5rs-procx { font-style: italic; font-weight: bold; } /* Spread out bibliographic lists. */ /* More Netscape-specific lossage; see the following stylesheet ** for the proper values (used by IE). */ dt.biblio { margin-bottom: 3ex; } /* Links to draft copies (e.g., not at the official SRFI site) ** are colored in red, so people will use them during the ** development process and kill them when the document's done. */ a.draft { color: red; } </style> <style type="text/css" media=all> /* Nastiness: Here, I'm using a bug to work around a bug. ** Netscape rendering bugs mean you need bogus <dt> and <dd> ** margin settings -- settings which screw up IE's proper rendering. ** Fortunately, Netscape has *another* bug: it will ignore this ** media=all style sheet. So I am placing the (proper) IE values ** here. Perhaps, one day, when these rendering bugs are fixed, ** this gross hackery can be removed. */ dt.proc-def1 { margin-top: 3ex; margin-bottom: 0ex; } dt.proc-defi { margin-top: 0ex; margin-bottom: 0ex; } dt.proc-defn { margin-top: 0ex; margin-bottom: 0.5ex; } dt.proc-def { margin-top: 3ex; margin-bottom: 0.5ex; } pre { margin-top: 1ex; } dd.proc-def { margin-bottom: 2ex; margin-top: 0.5ex; } /* For the index of procedures. ** Same hackery as for dt.proc-def, above. */ dd.proc-index { margin-top: 0ex; } pre.proc-index { margin-top: 0ex; } /* Spread out bibliographic lists. */ dt.biblio { margin-top: 3ex; margin-bottom: 0ex; } dd.biblio { margin-bottom: 1ex; } </style> </head> <body> <!--========================================================================--> <h1>Title</h1> <div class=title-text> Character-set Library </div> <!--========================================================================--> <h1>Author</H1> <address> <a href="http://www.ai.mit.edu/~shivers/">Olin Shivers</A> / <a href="mailto:shivers@ai.mit.edu">shivers@ai.mit.edu</A> </address> <!--========================================================================--> <h1>Table of contents</H1> <!-- A bug in netscape (?) keeps the first link in this UL from being active. ==== So the Abstract link be dead. 99/8/22 -Olin --> <ul id=toc-table> <li><a href="#Abstract">Abstract</a> <li><a href="#VariableIndex">Variable index</a> <li><a href="#Rationale">Rationale</a> <ul> <li><a href="#LinearUpdateOperations">"Linear-update" operations</a> <li><a href="#ExtraSRFI">Extra-SRFI recommendations</a> </ul> <li><a href="#Specification">Specification</a> <ul> <li><a href="#GeneralProcs">General procedures</a> <li><a href="#Iterating">Iterating over character sets</a> <li><a href="#Creating">Creating character sets</a> <li><a href="#Querying">Querying character sets</a> <li><a href="#Algebra">Character set algebra</a> <li><a href="#StandardCharsets">Standard character sets</a> </ul> <li><a href="#StandardCharsetDefs">Unicode, Latin-1 and ASCII definitions of the standard character sets</a> <li><a href="#ReferenceImp">Reference implementation</a> <li><a href="#Acknowledgements">Acknowledgements</a> <li><a href="#Links">References & Links</a> <li><a href="#Copyright">Copyright</a> </ul> <!--========================================================================--> <h1><a name="Abstract">Abstract</a></H1> <p> The ability to efficiently represent and manipulate sets of characters is an unglamorous but very useful capability for text-processing code -- one that tends to pop up in the definitions of other libraries. Hence it is useful to specify a general substrate for this functionality early. This SRFI defines a general library that provides this functionality. It is accompanied by a reference implementation for the spec. The reference implementation is fairly efficient, straightforwardly portable, and has a "free software" copyright. The implementation is tuned for "small" 7 or 8 bit character types, such as ASCII or Latin-1; the data structures and algorithms would have to be altered for larger 16 or 32 bit character types such as Unicode -- however, the specs have been carefully designed with these larger character types in mind. Several forthcoming SRFIs can be defined in terms of this one: <ul> <li> string library <li> delimited input procedures (<em>e.g.</em>, <code>read-line</code>) <li> regular expressions </ul> <!--========================================================================--> <h1><a name="VariableIndex">Variable Index</a></h1> <p> Here is the complete set of bindings -- procedural and otherwise -- exported by this library. In a Scheme system that has a module or package system, these procedures should be contained in a module named "char-set-lib". <div class=indent> <dl> <dt class=proc-index> Predicates & comparison <dd class=proc-index> <pre class=proc-index> <a href="#char-set-p">char-set?</a> <a href="#char-set=">char-set=</a> <a href="#char-set<=">char-set<=</a> <a href="#char-set-hash">char-set-hash</a> </pre> <dt class=proc-index> Iterating over character sets <dd class=proc-index> <pre class=proc-index> <a href="#char-set-cursor">char-set-cursor</a> <a href="#char-set-ref">char-set-ref</a> <a href="#char-set-cursor-next">char-set-cursor-next</a> <a href="#end-of-char-set-p">end-of-char-set?</a> <a href="#char-set-fold">char-set-fold</a> <a href="#char-set-unfold">char-set-unfold</a> <a href="#char-set-unfold!">char-set-unfold!</a> <a href="#char-set-for-each">char-set-for-each</a> <a href="#char-set-map">char-set-map</a> </pre> <dt class=proc-index> Creating character sets <dd class=proc-index> <pre class=proc-index> <a href="#char-set-copy">char-set-copy</a> <a href="#char-set">char-set</a> <a href="#list->char-set">list->char-set</a> <a href="#string->char-set">string->char-set</a> <a href="#list->char-set!">list->char-set!</a> <a href="#string->char-set!">string->char-set!</a> <a href="#char-set-filter">char-set-filter</a> <a href="#ucs-range->char-set">ucs-range->char-set</a> <a href="# char-set-filter!"> char-set-filter!</a> <a href="#ucs-range->char-set!">ucs-range->char-set!</a> <a href="#->char-set">->char-set</a> </pre> <dt class=proc-index> Querying character sets <dd class=proc-index> <pre class=proc-index> <a href="#char-set->list">char-set->list</a> <a href="#char-set->string">char-set->string</a> <a href="#char-set-size">char-set-size</a> <a href="#char-set-count">char-set-count</a> <a href="#char-set-contains-p">char-set-contains?</a> <a href="#char-set-every">char-set-every</a> <a href="#char-set-any">char-set-any</a> </pre> <dt class=proc-index> Character-set algebra <dd class=proc-index> <pre class=proc-index> <a href="#char-set-adjoin">char-set-adjoin</a> <a href="#char-set-delete">char-set-delete</a> <a href="#char-set-adjoin!">char-set-adjoin!</a> <a href="#char-set-delete!">char-set-delete!</a> <a href="#char-set-complement">char-set-complement</a> <a href="#char-set-union">char-set-union</a> <a href="#char-set-intersection">char-set-intersection</a> <a href="#char-set-complement!">char-set-complement!</a> <a href="#char-set-union!">char-set-union!</a> <a href="#char-set-intersection!">char-set-intersection!</a> <a href="#char-set-difference">char-set-difference</a> <a href="#char-set-xor">char-set-xor</a> <a href="#char-set-diff+intersection">char-set-diff+intersection</a> <a href="#char-set-difference!">char-set-difference!</a> <a href="#char-set-xor!">char-set-xor!</a> <a href="#char-set-diff+intersection!">char-set-diff+intersection!</a> </pre> <dt class=proc-index> Standard character sets <dd class=proc-index> <pre class=proc-index> <a href="#char-set:lower-case">char-set:lower-case</a> <a href="#char-set:upper-case">char-set:upper-case</a> <a href="#char-set:title-case">char-set:title-case</a> <a href="#char-set:letter">char-set:letter</a> <a href="#char-set:digit">char-set:digit</a> <a href="#char-set:letter+digit">char-set:letter+digit</a> <a href="#char-set:graphic">char-set:graphic</a> <a href="#char-set:printing">char-set:printing</a> <a href="#char-set:whitespace">char-set:whitespace</a> <a href="#char-set:iso-control">char-set:iso-control</a> <a href="#char-set:punctuation">char-set:punctuation</a> <a href="#char-set:symbol">char-set:symbol</a> <a href="#char-set:hex-digit">char-set:hex-digit</a> <a href="#char-set:blank">char-set:blank</a> <a href="#char-set:ascii">char-set:ascii</a> <a href="#char-set:empty">char-set:empty</a> <a href="#char-set:full">char-set:full</a> </pre> </dl> </div> <!--========================================================================--> <h1><a name="Rationale">Rationale</a></h1> <p> The ability to efficiently manipulate sets of characters is quite useful for text-processing code. Encapsulating this functionality in a general, efficiently implemented library can assist all such code. This library defines a new data structure to represent these sets, called a "char-set." The char-set type is distinct from all other types. <p> This library is designed to be portable across implementations that use different character types and representations, especially ASCII, Latin-1 and Unicode. Some effort has been made to preserve compatibility with Java in the Unicode case (see the definition of <code>char-set:whitespace</code> for the single real deviation). <!--========================================================================--> <h2><a name="LinearUpdateOperations">Linear-update operations</a></h2> <p> The procedures of this SRFI, by default, are "pure functional" -- they do not alter their parameters. However, this SRFI defines a set of "linear-update" procedures which have a hybrid pure-functional/side-effecting semantics: they are allowed, but not required, to side-effect one of their parameters in order to construct their result. An implementation may legally implement these procedures as pure, side-effect-free functions, or it may implement them using side effects, depending upon the details of what is the most efficient or simple to implement in terms of the underlying representation. <p> The linear-update routines all have names ending with "!". <p> Clients of these procedures <em>may not</em> rely upon these procedures working by side effect. For example, this is not guaranteed to work: <pre class=code-example> (let* ((cs1 (char-set #\a #\b #\c)) ; cs1 = {a,b,c}. (cs2 (char-set-adjoin! cs1 #\d))) ; Add d to {a,b,c}. cs1) ; Could be either {a,b,c} or {a,b,c,d}. </pre> <p class=continue> However, this is well-defined: <pre class=code-example> (let ((cs (char-set #\a #\b #\c))) (char-set-adjoin! cs #\d)) ; Add d to {a,b,c}. </pre> <p> So clients of these procedures write in a functional style, but must additionally be sure that, when the procedure is called, there are no other live pointers to the potentially-modified character set (hence the term "linear update"). <p> There are two benefits to this convention: <ul> <li> Implementations are free to provide the most efficient possible implementation, either functional or side-effecting. <li> Programmers may nonetheless continue to assume that character sets are purely functional data structures: they may be reliably shared without needing to be copied, uniquified, and so forth. </ul> <p> Note that pure functional representations are the right thing for ASCII- or Latin-1-based Scheme implementations, since a char-set can be represented in an ASCII Scheme with 4 32-bit words. Pure set-algebra operations on such a representation are very fast and efficient. Programmers who code using linear-update operations are guaranteed the system will provide the best implementation across multiple platforms. <p> In practice, these procedures are most useful for efficiently constructing character sets in a side-effecting manner, in some limited local context, before passing the character set outside the local construction scope to be used in a functional manner. <p> Scheme provides no assistance in checking the linearity of the potentially side-effected parameters passed to these functions --- there's no linear type checker or run-time mechanism for detecting violations. (But sophisticated programming environments, such as DrScheme, might help.) <!--========================================================================--> <h2><a name="ExtraSRFI">Extra-SRFI recommendations</a></h2> <p> Users are cautioned that the R5RS predicates <div class=inset><code> char-alphabetic? <br> char-numeric? <br> char-whitespace? <br> char-upper-case? <br> char-lower-case? <br> </code> </div> <p class=continue> may or may not be in agreement with the SRFI 14 base character sets <div class=inset> <code> char-set:letter<br> char-set:digit<br> char-set:whitespace<br> char-set:upper-case<br> char-set:lower-case<br> </code> </div> <p class=continue> Implementors are strongly encouraged to bring these predicates into agreement with the base character sets of this SRFI; not to do so risks major confusion. <!--========================================================================--> <h1><a name="Specification">Specification</a></h1> <p> In the following procedure specifications: <ul> <li> A <var>cs</var> parameter is a character set. <li> An <var>s</var> parameter is a string. <li> A <var>char</var> parameter is a character. <li> A <var>char-list</var> parameter is a list of characters. <li> A <var>pred</var> parameter is a unary character predicate procedure, returning a true/false value when applied to a character. <li> An <var>obj</var> parameter may be any value at all. </ul> <p> Passing values to procedures with these parameters that do not satisfy these types is an error. <p> Unless otherwise noted in the specification of a procedure, procedures always return character sets that are distinct (from the point of view of the linear-update operations) from the parameter character sets. For example, <code>char-set-adjoin</code> is guaranteed to provide a fresh character set, even if it is not given any character parameters. <p> Parameters given in square brackets are optional. Unless otherwise noted in the text describing the procedure, any prefix of these optional parameters may be supplied, from zero arguments to the full list. When a procedure returns multiple values, this is shown by listing the return values in square brackets, as well. So, for example, the procedure with signature <pre class=code-example> halts? <var>f [x init-store]</var> -> <var>[boolean integer]</var> </pre> would take one (<var>f</var>), two (<var>f</var>, <var>x</var>) or three (<var>f</var>, <var>x</var>, <var>init-store</var>) input parameters, and return two values, a boolean and an integer. <p> A parameter followed by "<code>...</code>" means zero-or-more elements. So the procedure with the signature <pre class=code-example> sum-squares <var>x ... </var> -> <var>number</var> </pre> takes zero or more arguments (<var>x ...</var>), while the procedure with signature <pre class=code-example> spell-check <var>doc dict<sub>1</sub> dict<sub>2</sub> ...</var> -> <var>string-list</var> </pre> takes two required parameters (<var>doc</var> and <var>dict<sub>1</sub></var>) and zero or more optional parameters (<var>dict<sub>2</sub> ...</var>). <!--========================================================================--> <h2><a name="GeneralProcs">General procedures</a></h2> <dl> <!-- ==== char-set? ============================================================================--> <dt class=proc-def> <a name="char-set-p"></a> <code class=proc-def>char-set?</code><var> obj -> boolean</var> <dd class=proc-def> Is the object <var>obj</var> a character set? <!-- ==== char-set= ============================================================================--> <dt class=proc-def> <a name="char-set="></a> <code class=proc-def>char-set=</code><var> cs<sub>1</sub> ... -> boolean</var> <dd class=proc-def> Are the character sets equal? <p> Boundary cases: <pre class=code-example> (char-set=) => <var>true</var> (char-set= cs) => <var>true</var> </pre> <p> Rationale: transitive binary relations are generally extended to n-ary relations in Scheme, which enables clearer, more concise code to be written. While the zero-argument and one-argument cases will almost certainly not arise in first-order uses of such relations, they may well arise in higher-order cases or macro-generated code. <em>E.g.,</em> consider <pre class=code-example> (apply char-set= cset-list) </pre> <p class=continue> This is well-defined if the list is empty or a singleton list. Hence we extend these relations to any number of arguments. Implementors have reported actual uses of n-ary relations in higher-order cases allowing for fewer than two arguments. The way of Scheme is to handle the general case; we provide the fully general extension. <p> A counter-argument to this extension is that <abbr title="Revised^5 Report on Scheme"><a href="#R5RS">R5RS</a></abbr>'s transitive binary arithmetic relations (<code>=</code>, <code><</code>, <em>etc.</em>) require at least two arguments, hence this decision is a break with the prior convention -- although it is at least one that is backwards-compatible. <!-- ==== char-set<= ============================================================================--> <dt class=proc-def> <a name="char-set<="></a> <code class=proc-def>char-set<=</code><var> cs<sub>1</sub> ... -> boolean</var> <dd class=proc-def> Returns true if every character set <var>cs<sub>i</sub></var> is a subset of character set <var>cs<sub>i+1</sub></var>. <p> Boundary cases: <pre class=code-example> (char-set<=) => <var>true</var> (char-set<= cs) => <var>true</var> </pre> <p> Rationale: See <code>char-set=</code> for discussion of zero- and one-argument applications. Consider testing a list of char-sets for monotonicity with <pre class=code-example> (apply char-set<= cset-list) </pre> <!-- ==== char-set-hash ============================================================================--> <dt class=proc-def> <a name="char-set-hash"></a> <code class=proc-def>char-set-hash</code><var> cs [bound] -> integer</var> <dd class=proc-def> Compute a hash value for the character set <var>cs</var>. <var>Bound</var> is a non-negative exact integer specifying the range of the hash function. A positive value restricts the return value to the range [0,<var>bound</var>). <p> If <var>bound</var> is either zero or not given, the implementation may use an implementation-specific default value, chosen to be as large as is efficiently practical. For instance, the default range might be chosen for a given implementation to map all strings into the range of integers that can be represented with a single machine word. <p> Invariant: <pre class=code-example> (char-set= cs<sub>1</sub> cs<sub>2</sub>) => (= (char-set-hash cs<sub>1</sub> b) (char-set-hash cs<sub>2</sub> b)) </pre> <p> A legal but nonetheless discouraged implementation: <pre class=code-example> (define (char-set-hash cs . maybe-bound) 1) </pre> <p> Rationale: allowing the user to specify an explicit bound simplifies user code by removing the mod operation that typically accompanies every hash computation, and also may allow the implementation of the hash function to exploit a reduced range to efficiently compute the hash value. <em>E.g.</em>, for small bounds, the hash function may be computed in a fashion such that intermediate values never overflow into bignum integers, allowing the implementor to provide a fixnum-specific "fast path" for computing the common cases very rapidly. </dl> <!--========================================================================--> <h2><a name="Iterating">Iterating over character sets</a></h2> <dl> <!-- ==== char-set-cursor char-set-ref char-set-cursor-next end-of-char-set? ============================================================================--> <dt class=proc-def1> <a name="char-set-cursor"></a> <a name="char-set-ref"></a> <a name="char-set-cursor-next"></a> <a name="end-of-char-set-p"></a> <code class=proc-def>char-set-cursor</code><var> cset -> cursor</var> <dt class=proc-defi> <code class=proc-def>char-set-ref</code><var> cset cursor -> char</var> <dt class=proc-defi> <code class=proc-def>char-set-cursor-next</code><var> cset cursor -> cursor</var> <dt class=proc-defn> <code class=proc-def>end-of-char-set?</code><var> cursor -> boolean</var> <dd class=proc-def> Cursors are a low-level facility for iterating over the characters in a set. A cursor is a value that indexes a character in a char set. <code>char-set-cursor</code> produces a new cursor for a given char set. The set element indexed by the cursor is fetched with <code>char-set-ref</code>. A cursor index is incremented with <code>char-set-cursor-next</code>; in this way, code can step through every character in a char set. Stepping a cursor "past the end" of a char set produces a cursor that answers true to <code>end-of-char-set?</code>. It is an error to pass such a cursor to <code>char-set-ref</code> or to <code>char-set-cursor-next</code>. <p> A cursor value may not be used in conjunction with a different character set; if it is passed to <code>char-set-ref</code> or <code>char-set-cursor-next</code> with a character set other than the one used to create it, the results and effects are undefined. <p> Cursor values are <em>not</em> necessarily distinct from other types. They may be integers, linked lists, records, procedures or other values. This license is granted to allow cursors to be very "lightweight" values suitable for tight iteration, even in fairly simple implementations. <p> Note that these primitives are necessary to export an iteration facility for char sets to loop macros. <p> Example: <pre class=code-example> (define cs (char-set #\G #\a #\T #\e #\c #\h)) ;; Collect elts of CS into a list. (let lp ((cur (char-set-cursor cs)) (ans '())) (if (end-of-char-set? cur) ans (lp (char-set-cursor-next cs cur) (cons (char-set-ref cs cur) ans)))) => (#\G #\T #\a #\c #\e #\h) ;; Equivalently, using a list unfold (from SRFI 1): (unfold-right end-of-char-set? (curry char-set-ref cs) (curry char-set-cursor-next cs) (char-set-cursor cs)) => (#\G #\T #\a #\c #\e #\h) </pre> <p> Rationale: Note that the cursor API's four functions "fit" the functional protocol used by the unfolders provided by the list, string and char-set SRFIs (see the example above). By way of contrast, here is a simpler, two-function API that was rejected for failing this criterion. Besides <code>char-set-cursor</code>, it provided a single function that mapped a cursor and a character set to two values, the indexed character and the next cursor. If the cursor had exhausted the character set, then this function returned false instead of the character value, and another end-of-char-set cursor. In this way, the other three functions of the current API were combined together. <!-- ==== char-set-fold ============================================================================--> <dt class=proc-def> <a name="char-set-fold"></a> <code class=proc-def>char-set-fold</code><var> kons knil cs -> object</var> <dd class=proc-def> This is the fundamental iterator for character sets. Applies the function <var>kons</var> across the character set <var>cs</var> using initial state value <var>knil</var>. That is, if <var>cs</var> is the empty set, the procedure returns <var>knil</var>. Otherwise, some element <var>c</var> of <var>cs</var> is chosen; let <var>cs'</var> be the remaining, unchosen characters. The procedure returns <pre class=code-example> (char-set-fold <var>kons</var> (<var>kons</var> <var>c</var> <var>knil</var>) <var>cs'</var>) </pre> <p> Examples: <pre class=code-example> ;; CHAR-SET-MEMBERS (lambda (cs) (char-set-fold cons '() cs)) ;; CHAR-SET-SIZE (lambda (cs) (char-set-fold (lambda (c i) (+ i 1)) 0 cs)) ;; How many vowels in the char set? (lambda (cs) (char-set-fold (lambda (c i) (if (vowel? c) (+ i 1) i)) 0 cs)) </pre> <!-- ==== char-set-unfold char-set-unfold! ============================================================================--> <dt class=proc-def1> <a name="char-set-unfold"></a> <a name="char-set-unfold!"></a> <code class=proc-def>char-set-unfold </code><var> f p g seed [base-cs] -> char-set</var> <dt class=proc-defn><code class=proc-def>char-set-unfold!</code><var> f p g seed base-cs -> char-set</var> <dd class=proc-def> This is a fundamental constructor for char-sets. <ul> <li> <var>G</var> is used to generate a series of "seed" values from the initial seed: <var>seed</var>, (<var>g</var> <var>seed</var>), (<var>g<sup>2</sup></var> <var>seed</var>), (<var>g<sup>3</sup></var> <var>seed</var>), ... <li> <var>P</var> tells us when to stop -- when it returns true when applied to one of these seed values. <li> <var>F</var> maps each seed value to a character. These characters are added to the base character set <var>base-cs</var> to form the result; <var>base-cs</var> defaults to the empty set. <code>char-set-unfold!</code> adds the characters to <var>base-cs</var> in a linear-update -- it is allowed, but not required, to side-effect and use <var>base-cs</var>'s storage to construct the result. </ul> <p> More precisely, the following definitions hold, ignoring the optional-argument issues: <pre class=code-example> (define (char-set-unfold p f g seed base-cs) (char-set-unfold! p f g seed (char-set-copy base-cs))) (define (char-set-unfold! p f g seed base-cs) (let lp ((seed seed) (cs base-cs)) (if (p seed) cs ; P says we are done. (lp (g seed) ; Loop on (G SEED). (char-set-adjoin! cs (f seed)))))) ; Add (F SEED) to set. </pre> (Note that the actual implementation may be more efficient.) <p> Examples: <pre class=code-example> (port->char-set p) = (char-set-unfold eof-object? values (lambda (x) (read-char p)) (read-char p)) (list->char-set lis) = (char-set-unfold null? car cdr lis) </pre> <!-- ==== char-set-for-each ============================================================================--> <dt class=proc-def> <a name="char-set-for-each"></a> <code class=proc-def>char-set-for-each</code><var> proc cs -> unspecified</var> <dd class=proc-def> Apply procedure <var>proc</var> to each character in the character set <var>cs</var>. Note that the order in which <var>proc</var> is applied to the characters in the set is not specified, and may even change from one procedure application to another. <p> Nothing at all is specified about the value returned by this procedure; it is not even required to be consistent from call to call. It is simply required to be a value (or values) that may be passed to a command continuation, <em>e.g.</em> as the value of an expression appearing as a non-terminal subform of a <code>begin</code> expression. Note that in <abbr title="Revised^5 Report on Scheme"><a href="#R5RS">R5RS</a></abbr>, this restricts the procedure to returning a single value; non-R5RS systems may not even provide this restriction. <!-- ==== char-set-map ============================================================================--> <dt class=proc-def> <a name="char-set-map"></a> <code class=proc-def>char-set-map</code><var> proc cs -> char-set</var> <dd class=proc-def> <var>proc</var> is a char->char procedure. Apply it to all the characters in the char-set <var>cs</var>, and collect the results into a new character set. <p> Essentially lifts <var>proc</var> from a char->char procedure to a char-set -> char-set procedure. <p> Example: <pre class=code-example> (char-set-map char-downcase cset) </pre> </dl> <!--========================================================================--> <h2><a name="Creating">Creating character sets</a></h2> <dl> <!-- ==== char-set-copy ============================================================================--> <dt class=proc-def> <a name="char-set-copy"></a> <code class=proc-def>char-set-copy</code><var> cs -> char-set</var> <dd class=proc-def> Returns a copy of the character set <var>cs</var>. "Copy" means that if either the input parameter or the result value of this procedure is passed to one of the linear-update procedures described below, the other character set is guaranteed not to be altered. <p> A system that provides pure-functional implementations of the linear-operator suite could implement this procedure as the identity function -- so copies are <em>not</em> guaranteed to be distinct by <code>eq?</code>. <!-- ==== char-set ============================================================================--> <dt class=proc-def> <a name="char-set"></a> <code class=proc-def>char-set</code><var> char<sub>1</sub> ... -> char-set</var> <dd class=proc-def> Return a character set containing the given characters. <!-- ==== list->char-set list->char-set ============================================================================--> <dt class=proc-def1> <a name="list->char-set"></a> <a name="list->char-set!"></a> <code class=proc-def>list->char-set </code><var> char-list [base-cs] -> char-set</var> <dt class=proc-defn><code class=proc-def>list->char-set!</code><var> char-list base-cs -> char-set</var> <dd class=proc-def> Return a character set containing the characters in the list of characters <var>char-list</var>. <p> If character set <var>base-cs</var> is provided, the characters from <var>char-list</var> are added to it. <code>list->char-set!</code> is allowed, but not required, to side-effect and reuse the storage in <var>base-cs</var>; <code>list->char-set</code> produces a fresh character set. <!-- ==== string->char-set string->char-set! ============================================================================--> <dt class=proc-def1> <a name="string->char-set"></a> <a name="string->char-set!"></a> <code class=proc-def>string->char-set </code><var> s [base-cs] -> char-set</var> <dt class=proc-defn><code class=proc-def>string->char-set!</code><var> s base-cs -> char-set</var> <dd class=proc-def> Return a character set containing the characters in the string <var>s</var>. <p> If character set <var>base-cs</var> is provided, the characters from <var>s</var> are added to it. <code>string->char-set!</code> is allowed, but not required, to side-effect and reuse the storage in <var>base-cs</var>; <code>string->char-set</code> produces a fresh character set. <!-- ==== char-set-filter char-set-filter! ============================================================================--> <dt class=proc-def1> <a name="char-set-filter"></a> <a name="char-set-filter!"></a> <code class=proc-def>char-set-filter </code><var> pred cs [base-cs] -> char-set</var> <dt class=proc-defn><code class=proc-def>char-set-filter!</code><var> pred cs base-cs -> char-set</var> <dd class=proc-def> Returns a character set containing every character <var>c</var> in <var>cs</var> such that <code>(<var>pred</var> <var>c</var>)</code> returns true. <p> If character set <var>base-cs</var> is provided, the characters specified by <var>pred</var> are added to it. <code>char-set-filter!</code> is allowed, but not required, to side-effect and reuse the storage in <var>base-cs</var>; <code>char-set-filter</code> produces a fresh character set. <p> An implementation may not save away a reference to <var>pred</var> and invoke it after <code>char-set-filter</code> or <code>char-set-filter!</code> returns -- that is, "lazy," on-demand implementations are not allowed, as <var>pred</var> may have external dependencies on mutable data or have other side-effects. <p> Rationale: This procedure provides a means of converting a character predicate into its equivalent character set; the <var>cs</var> parameter allows the programmer to bound the predicate's domain. Programmers should be aware that filtering a character set such as <code>char-set:full</code> could be a very expensive operation in an implementation that provided an extremely large character type, such as 32-bit Unicode. An earlier draft of this library provided a simple <code>predicate->char-set</code> procedure, which was rejected in favor of <code>char-set-filter</code> for this reason. <!-- ==== ucs-range->char-set ucs-range->char-set! ============================================================================--> <dt class=proc-def1> <a name="ucs-range->char-set"></a> <a name="ucs-range->char-set!"></a> <code class=proc-def>ucs-range->char-set </code><var> lower upper [error? base-cs] -> char-set</var> <dt class=proc-defn><code class=proc-def>ucs-range->char-set!</code><var> lower upper error? base-cs -> char-set</var> <dd class=proc-def> <var>Lower</var> and <var>upper</var> are exact non-negative integers; <var>lower</var> <= <var>upper</var>. <p> Returns a character set containing every character whose ISO/IEC 10646 UCS-4 code lies in the half-open range [<var>lower</var>,<var>upper</var>). <ul> <li> If the requested range includes unassigned UCS values, these are silently ignored (the current UCS specification has "holes" in the space of assigned codes). <li> If the requested range includes "private" or "user space" codes, these are handled in an implementation-specific manner; however, a UCS- or Unicode-based Scheme implementation should pass them through transparently. <li> If any code from the requested range specifies a valid, assigned UCS character that has no corresponding representative in the implementation's character type, then (1) an error is raised if <var>error?</var> is true, and (2) the code is ignored if <var>error?</var> is false (the default). This might happen, for example, if the implementation uses ASCII characters, and the requested range includes non-ASCII characters. </ul> <p> If character set <var>base-cs</var> is provided, the characters specified by the range are added to it. <code>ucs-range->char-set!</code> is allowed, but not required, to side-effect and reuse the storage in <var>base-cs</var>; <code>ucs-range->char-set</code> produces a fresh character set. <p> Note that ASCII codes are a subset of the Latin-1 codes, which are in turn a subset of the 16-bit Unicode codes, which are themselves a subset of the 32-bit UCS-4 codes. We commit to a specific encoding in this routine, regardless of the underlying representation of characters, so that client code using this library will be portable. <em>I.e.</em>, a conformant Scheme implementation may use EBCDIC or SHIFT-JIS to encode characters; it must simply map the UCS characters from the given range into the native representation when possible, and report errors when not possible. <!-- ==== ->char-set ============================================================================--> <dt class=proc-def> <a name="->char-set"></a> <code class=proc-def>->char-set</code><var> x -> char-set</var> <dd class=proc-def> Coerces <var>x</var> into a char-set. <var>X</var> may be a string, character or char-set. A string is converted to the set of its constituent characters; a character is converted to a singleton set; a char-set is returned as-is. This procedure is intended for use by other procedures that want to provide "user-friendly," wide-spectrum interfaces to their clients. </dl> <!--========================================================================--> <h2><a name="Querying">Querying character sets</a></h2> <dl> <!-- ==== char-set-size ============================================================================--> <dt class=proc-def> <a name="char-set-size"></a> <code class=proc-def>char-set-size</code><var> cs -> integer</var> <dd class=proc-def> Returns the number of elements in character set <var>cs</var>. <!-- ==== char-set-count ============================================================================--> <dt class=proc-def> <a name="char-set-count"></a> <code class=proc-def>char-set-count</code><var> pred cs -> integer</var> <dd class=proc-def> Apply <var>pred</var> to the chars of character set <var>cs</var>, and return the number of chars that caused the predicate to return true. <!-- ==== char-set->list ============================================================================--> <dt class=proc-def> <a name="char-set->list"></a> <code class=proc-def>char-set->list</code><var> cs -> character-list</var> <dd class=proc-def> This procedure returns a list of the members of character set <var>cs</var>. The order in which <var>cs</var>'s characters appear in the list is not defined, and may be different from one call to another. <!-- ==== char-set->string ============================================================================--> <dt class=proc-def> <a name="char-set->string"></a> <code class=proc-def>char-set->string</code><var> cs -> string</var> <dd class=proc-def> This procedure returns a string containing the members of character set <var>cs</var>. The order in which <var>cs</var>'s characters appear in the string is not defined, and may be different from one call to another. <!-- ==== char-set-contains? ============================================================================--> <dt class=proc-def> <a name="char-set-contains-p"></a> <code class=proc-def>char-set-contains?</code><var> cs char -> boolean</var> <dd class=proc-def> This procedure tests <var>char</var> for membership in character set <var>cs</var>. <p> The MIT Scheme character-set package called this procedure <var>char-set-member?</var>, but the argument order isn't consistent with the name. <!-- ==== char-set-every char-set-any ============================================================================--> <dt class=proc-def1> <a name="char-set-every"></a> <a name="char-set-any"></a> <code class=proc-def>char-set-every</code><var> pred cs -> boolean</var> <dt class=proc-defn><code class=proc-def>char-set-any </code><var> pred cs -> boolean</var> <dd class=proc-def> The <code>char-set-every</code> procedure returns true if predicate <var>pred</var> returns true of every character in the character set <var>cs</var>. Likewise, <code>char-set-any</code> applies <var>pred</var> to every character in character set <var>cs</var>, and returns the first true value it finds. If no character produces a true value, it returns false. The order in which these procedures sequence through the elements of <var>cs</var> is not specified. <p> Note that if you need to determine the actual character on which a predicate returns true, use <code>char-set-any</code> and arrange for the predicate to return the character parameter as its true value, <em>e.g.</em> <pre class=code-example> (char-set-any (lambda (c) (and (char-upper-case? c) c)) cs) </pre> </dl> <!--========================================================================--> <h2><a name="Algebra">Character-set algebra</a></h2> <dl> <!-- ==== char-set-adjoin char-set-delete ============================================================================--> <dt class=proc-def1> <a name="char-set-adjoin"></a> <a name="char-set-delete"></a> <code class=proc-def>char-set-adjoin</code><var> cs char<sub>1</sub> ... -> char-set</var> <dt class=proc-defn><code class=proc-def>char-set-delete</code><var> cs char<sub>1</sub> ... -> char-set</var> <dd class=proc-def> Add/delete the <var>char<sub>i</sub></var> characters to/from character set <var>cs</var>. <!-- ==== char-set-adjoin! char-set-delete! ============================================================================--> <dt class=proc-def1> <a name="char-set-adjoin!"></a> <a name="char-set-delete!"></a> <code class=proc-def>char-set-adjoin!</code><var> cs char<sub>1</sub> ... -> char-set</var> <dt class=proc-defn><code class=proc-def>char-set-delete!</code><var> cs char<sub>1</sub> ... -> char-set</var> <dd class=proc-def> Linear-update variants. These procedures are allowed, but not required, to side-effect their first parameter. <!-- ==== char-set-complement char-set-union char-set-intersection ==== char-set-difference char-set-xor char-set-diff+intersection ============================================================================--> <dt class=proc-def1> <a name="char-set-complement"></a> <a name="char-set-union"></a> <a name="char-set-intersection"></a> <a name="char-set-difference"></a> <a name="char-set-xor"></a> <a name="char-set-diff+intersection"></a> <code class=proc-def>char-set-complement</code><var> cs -> char-set</var> <dt class=proc-defi><code class=proc-def>char-set-union</code><var> cs<sub>1</sub> ... -> char-set</var> <dt class=proc-defi><code class=proc-def>char-set-intersection</code><var> cs<sub>1</sub> ... -> char-set</var> <dt class=proc-defi><code class=proc-def>char-set-difference</code><var> cs<sub>1</sub> cs<sub>2</sub> ... -> char-set</var> <dt class=proc-defi><code class=proc-def>char-set-xor</code><var> cs<sub>1</sub> ... -> char-set</var> <dt class=proc-defn><code class=proc-def>char-set-diff+intersection</code><var> cs<sub>1</sub> cs<sub>2</sub> ... -> [char-set char-set]</var> <dd class=proc-def> These procedures implement set complement, union, intersection, difference, and exclusive-or for character sets. The union, intersection and xor operations are n-ary. The difference function is also n-ary, associates to the left (that is, it computes the difference between its first argument and the union of all the other arguments), and requires at least one argument. <p> Boundary cases: <pre class=code-example> (char-set-union) => char-set:empty (char-set-intersection) => char-set:full (char-set-xor) => char-set:empty (char-set-difference <var>cs</var>) => <var>cs</var> </pre> <p> <code>char-set-diff+intersection</code> returns both the difference and the intersection of the arguments -- it partitions its first parameter. It is equivalent to <pre class=code-example> (values (char-set-difference <var>cs<sub>1</sub></var> <var>cs<sub>2</sub></var> ...) (char-set-intersection <var>cs<sub>1</sub></var> (char-set-union <var>cs<sub>2</sub></var> ...))) </pre> but can be implemented more efficiently. <p> Programmers should be aware that <code>char-set-complement</code> could potentially be a very expensive operation in Scheme implementations that provide a very large character type, such as 32-bit Unicode. If this is a possibility, sets can be complimented with respect to a smaller universe using <code>char-set-difference</code>. <!-- ==== char-set-complement! char-set-union! char-set-intersection! ==== char-set-difference! char-set-xor! char-set-diff+intersection! ============================================================================--> <dt class=proc-def1> <a name="char-set-complement!"></a> <a name="char-set-union!"></a> <a name="char-set-intersection!"></a> <a name="char-set-difference!"></a> <a name="char-set-xor!"></a> <a name="char-set-diff+intersection!"></a> <code class=proc-def>char-set-complement!</code><var> cs -> char-set</var> <dt class=proc-defi><code class=proc-def>char-set-union!</code><var> cs<sub>1</sub> cs<sub>2</sub> ... -> char-set</var> <dt class=proc-defi><code class=proc-def>char-set-intersection!</code><var> cs<sub>1</sub> cs<sub>2</sub> ... -> char-set</var> <dt class=proc-defi><code class=proc-def>char-set-difference!</code><var> cs<sub>1</sub> cs<sub>2</sub> ... -> char-set</var> <dt class=proc-defi><code class=proc-def>char-set-xor!</code><var> cs<sub>1</sub> cs<sub>2</sub> ... -> char-set</var> <dt class=proc-defn><code class=proc-def>char-set-diff+intersection!</code><var> cs<sub>1</sub> cs<sub>2</sub> cs<sub>3</sub> ... -> [char-set char-set]</var> <dd class=proc-def> These are linear-update variants of the set-algebra functions. They are allowed, but not required, to side-effect their first (required) parameter. <p> <code>char-set-diff+intersection!</code> is allowed to side-effect both of its two required parameters, <var>cs<sub>1</sub></var> and <var>cs<sub>2</sub></var>. </dl> <!--========================================================================--> <h2><a name="StandardCharsets">Standard character sets</a></h2> <p> Several character sets are predefined for convenience: <a name="char-set:lower-case"></a> <a name="char-set:lower-case"></a> <a name="char-set:upper-case"></a> <a name="char-set:title-case"></a> <a name="char-set:letter"></a> <a name="char-set:digit"></a> <a name="char-set:letter+digit"></a> <a name="char-set:graphic"></a> <a name="char-set:printing"></a> <a name="char-set:whitespace"></a> <a name="char-set:iso-control"></a> <a name="char-set:punctuation"></a> <a name="char-set:symbol"></a> <a name="char-set:hex-digit"></a> <a name="char-set:blank"></a> <a name="char-set:ascii"></a> <a name="char-set:empty"></a> <a name="char-set:full"></a> <div class=inset> <table cellpadding=0 cellspacing=0> <tr><td><code>char-set:lower-case</code> </td><td>Lower-case letters</td></tr> <tr><td><code>char-set:upper-case</code> </td><td>Upper-case letters</td></tr> <tr><td><code>char-set:title-case</code> </td><td>Title-case letters</td></tr> <tr><td><code>char-set:letter</code> </td><td>Letters</td></tr> <tr><td><code>char-set:digit</code> </td><td>Digits</td></tr> <tr><td><code>char-set:letter+digit</code> </td><td>Letters and digits</td></tr> <tr><td><code>char-set:graphic</code> </td><td>Printing characters except spaces</td></tr> <tr><td><code>char-set:printing</code> </td><td>Printing characters including spaces</td></tr> <tr><td><code>char-set:whitespace</code> </td><td>Whitespace characters </td></tr> <tr><td><code>char-set:iso-control</code> </td><td>The ISO control characters </td></tr> <tr><td><code>char-set:punctuation</code> </td><td>Punctuation characters</td></tr> <tr><td><code>char-set:symbol</code> </td><td>Symbol characters</td></tr> <tr><td><code>char-set:hex-digit</code> </td><td>A hexadecimal digit: 0-9, A-F, a-f </td></tr> <tr><td><code>char-set:blank</code> </td><td>Blank characters -- horizontal whitespace</td></tr> <tr><td><code>char-set:ascii</code> </td><td>All characters in the ASCII set. </td></tr> <tr><td><code>char-set:empty</code> </td><td>Empty set </td></tr> <tr><td><code>char-set:full</code> </td><td>All characters </td></tr> </table> </div> <p> Note that there may be characters in <code>char-set:letter</code> that are neither upper or lower case---this might occur in implementations that use a character type richer than ASCII, such as Unicode. A "graphic character" is one that would put ink on your page. While the exact composition of these sets may vary depending upon the character type provided by the underlying Scheme system, here are the definitions for some of the sets in an ASCII implementation: <div class=inset> <table cellpadding=0 cellspacing=0> <tr><td><code>char-set:lower-case</code> </td><td>a-z </td></tr> <tr><td><code>char-set:upper-case</code> </td><td>A-Z </td></tr> <tr><td><code>char-set:letter</code> </td><td>A-Z and a-z </td></tr> <tr><td><code>char-set:digit</code> </td><td>0123456789</td></tr> <tr><td><code>char-set:punctuation</code> </td><td><code>!"#%&'()*,-./:;?@[\]_{}</code></td></tr> <tr><td><code>char-set:symbol</code> </td><td><code>$+<=>^`|~</code></td></tr> <tr><td><code>char-set:whitespace</code> </td><td>Space, newline, tab, form feed, </td></tr> <tr><td></td><td> vertical tab, carriage return </td></tr> <tr><td><code>char-set:blank</code> </td><td>Space and tab </td></tr> <tr><td><code>char-set:graphic</code> </td><td>letter + digit + punctuation + symbol</td></tr> <tr><td><code>char-set:printing</code> </td><td>graphic + whitespace</td></tr> <tr><td><code>char-set:iso-control</code> </td><td>ASCII 0-31 and 127 </td></tr> </table> </div> <p> Note that the existence of the <code>char-set:ascii</code> set implies that the underlying character set is required to be at least as rich as ASCII (including ASCII's control characters). <p> Rationale: The name choices reflect a shift from the older "alphabetic/numeric" terms found in <abbr title="Revised^5 Report on Scheme"><a href="#R5RS">R5RS</a></abbr> and Posix to newer, Unicode-influenced "letter/digit" lexemes. <!--========================================================================--> <h1><a name="StandardCharsetDefs"> Unicode, Latin-1 and ASCII definitions of the standard character sets</a> </h1> <p> In Unicode Scheme implementations, the base character sets are compatible with Java's Unicode specifications. For ASCII or Latin-1, we simply restrict the Unicode set specifications to their first 128 or 256 codes, respectively. Scheme implementations that are not based on ASCII, Latin-1 or Unicode should attempt to preserve the sense or spirit of these definitions. <p> The following descriptions frequently make reference to the "Unicode character database." This is a file, available at URL <div class=inset> <a href="ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt"> ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt</a> </div> <p class=continue> Each line contains a description of a Unicode character. The first semicolon-delimited field of the line gives the hex value of the character's code; the second field gives the name of the character, and the third field gives a two-letter category. Other fields give simple 1-1 case-mappings for the character and other information; see <div class=inset> <a href="ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html"> ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html</a> </div> <p class=continue> for further description of the file's format. Note in particular the two-letter category specified in the the third field, which is referenced frequently in the descriptions below. <!--========================================================================--> <h2><a name="lower-case-def">char-set:lower-case</a></h2> <p> For Unicode, we follow Java's specification: a character is lowercase if <ul> <li> it is not in the range [U+2000,U+2FFF], and <li> the Unicode attribute table does not give a lowercase mapping for it, and <li> at least one of the following is true: <ul> <li> the Unicode attribute table gives a mapping to uppercase for the character, or <li> the name for the character in the Unicode attribute table contains the words "SMALL LETTER" or "SMALL LIGATURE". </ul> </ul> <p> The lower-case ASCII characters are <div class=inset> abcdefghijklmnopqrstuvwxyz </div> <p class=continue> Latin-1 adds another 33 lower-case characters to the ASCII set: <div class=inset> <table cellpadding=0 cellspacing=0> <tr><td>00B5</td> <td>MICRO SIGN</td></tr> <tr><td>00DF</td> <td>LATIN SMALL LETTER SHARP S</td></tr> <tr><td>00E0</td> <td>LATIN SMALL LETTER A WITH GRAVE</td></tr> <tr><td>00E1</td> <td>LATIN SMALL LETTER A WITH ACUTE</td></tr> <tr><td>00E2</td> <td>LATIN SMALL LETTER A WITH CIRCUMFLEX</td></tr> <tr><td>00E3</td> <td>LATIN SMALL LETTER A WITH TILDE</td></tr> <tr><td>00E4</td> <td>LATIN SMALL LETTER A WITH DIAERESIS</td></tr> <tr><td>00E5</td> <td>LATIN SMALL LETTER A WITH RING ABOVE</td></tr> <tr><td>00E6</td> <td>LATIN SMALL LETTER AE</td></tr> <tr><td>00E7</td> <td>LATIN SMALL LETTER C WITH CEDILLA</td></tr> <tr><td>00E8</td> <td>LATIN SMALL LETTER E WITH GRAVE</td></tr> <tr><td>00E9</td> <td>LATIN SMALL LETTER E WITH ACUTE</td></tr> <tr><td>00EA</td> <td>LATIN SMALL LETTER E WITH CIRCUMFLEX</td></tr> <tr><td>00EB</td> <td>LATIN SMALL LETTER E WITH DIAERESIS</td></tr> <tr><td>00EC</td> <td>LATIN SMALL LETTER I WITH GRAVE</td></tr> <tr><td>00ED</td> <td>LATIN SMALL LETTER I WITH ACUTE</td></tr> <tr><td>00EE</td> <td>LATIN SMALL LETTER I WITH CIRCUMFLEX</td></tr> <tr><td>00EF</td> <td>LATIN SMALL LETTER I WITH DIAERESIS</td></tr> <tr><td>00F0</td> <td>LATIN SMALL LETTER ETH</td></tr> <tr><td>00F1</td> <td>LATIN SMALL LETTER N WITH TILDE</td></tr> <tr><td>00F2</td> <td>LATIN SMALL LETTER O WITH GRAVE</td></tr> <tr><td>00F3</td> <td>LATIN SMALL LETTER O WITH ACUTE</td></tr> <tr><td>00F4</td> <td>LATIN SMALL LETTER O WITH CIRCUMFLEX</td></tr> <tr><td>00F5</td> <td>LATIN SMALL LETTER O WITH TILDE</td></tr> <tr><td>00F6</td> <td>LATIN SMALL LETTER O WITH DIAERESIS</td></tr> <tr><td>00F8</td> <td>LATIN SMALL LETTER O WITH STROKE</td></tr> <tr><td>00F9</td> <td>LATIN SMALL LETTER U WITH GRAVE</td></tr> <tr><td>00FA</td> <td>LATIN SMALL LETTER U WITH ACUTE</td></tr> <tr><td>00FB</td> <td>LATIN SMALL LETTER U WITH CIRCUMFLEX</td></tr> <tr><td>00FC</td> <td>LATIN SMALL LETTER U WITH DIAERESIS</td></tr> <tr><td>00FD</td> <td>LATIN SMALL LETTER Y WITH ACUTE</td></tr> <tr><td>00FE</td> <td>LATIN SMALL LETTER THORN</td></tr> <tr><td>00FF</td> <td>LATIN SMALL LETTER Y WITH DIAERESIS</td></tr> </table> </div> <p class=continue> Note that three of these have no corresponding Latin-1 upper-case character: <div class=inset> <table cellpadding=0 cellspacing=0> <tr><td>00B5</td> <td>MICRO SIGN</td></tr> <tr><td>00DF</td> <td>LATIN SMALL LETTER SHARP S</td></tr> <tr><td>00FF</td> <td>LATIN SMALL LETTER Y WITH DIAERESIS</td></tr> </table> </div> <p class=continue> (The compatibility micro character uppercases to the non-Latin-1 Greek capital mu; the German sharp s character uppercases to the pair of characters "SS," and the capital y-with-diaeresis is non-Latin-1.) <p> (Note that the Java spec for lowercase characters given at <div class=inset> <a href="http://java.sun.com/docs/books/jls/html/javalang.doc4.html#14345"> http://java.sun.com/docs/books/jls/html/javalang.doc4.html#14345</a> </div> <p class=continue> is inconsistent. U+00B5 MICRO SIGN fulfills the requirements for a lower-case character (as of Unicode 3.0), but is not given in the numeric list of lower-case character codes.) <p> (Note that the Java spec for <code>isLowerCase()</code> given at <div class=inset> <a href="http://java.sun.com/products/jdk/1.2/docs/api/java/lang/Character.html#isLowerCase(char)"> http://java.sun.com/products/jdk/1.2/docs/api/java/lang/Character.html#isLowerCase(char)</a> </div> <p class=continue> gives three mutually inconsistent definitions of "lower case." The first is the definition used in this SRFI. Following text says "A character is considered to be lowercase if and only if it is specified to be lowercase by the Unicode 2.0 standard (category Ll in the Unicode specification data file)." The former spec excludes U+00AA FEMININE ORDINAL INDICATOR and U+00BA MASCULINE ORDINAL INDICATOR; the later spec includes them. Finally, the spec enumerates a list of characters in the Latin-1 subset; this list excludes U+00B5 MICRO SIGN, which is included in both of the previous specs.) <!--========================================================================--> <h2><a name="upper-case-def">char-set:upper-case</a></h2> <p> For Unicode, we follow Java's specification: a character is uppercase if <ul> <li> it is not in the range [U+2000,U+2FFF], and <li> the Unicode attribute table does not give an uppercase mapping for it (this excludes titlecase characters), and <li> at least one of the following is true: <ul> <li> the Unicode attribute table gives a mapping to lowercase for the character, or <li> the name for the character in the Unicode attribute table contains the words "CAPITAL LETTER" or "CAPITAL LIGATURE". </ul> </ul> <p> The upper-case ASCII characters are <div class=inset> ABCDEFGHIJKLMNOPQRSTUVWXYZ </div> <p class=continue> Latin-1 adds another 30 upper-case characters to the ASCII set: <div class=inset> <table cellspacing=0 cellpadding=0> <tr><td>00C0</td> <td>LATIN CAPITAL LETTER A WITH GRAVE</td></tr> <tr><td>00C1</td> <td>LATIN CAPITAL LETTER A WITH ACUTE</td></tr> <tr><td>00C2</td> <td>LATIN CAPITAL LETTER A WITH CIRCUMFLEX</td></tr> <tr><td>00C3</td> <td>LATIN CAPITAL LETTER A WITH TILDE</td></tr> <tr><td>00C4</td> <td>LATIN CAPITAL LETTER A WITH DIAERESIS</td></tr> <tr><td>00C5</td> <td>LATIN CAPITAL LETTER A WITH RING ABOVE</td></tr> <tr><td>00C6</td> <td>LATIN CAPITAL LETTER AE</td></tr> <tr><td>00C7</td> <td>LATIN CAPITAL LETTER C WITH CEDILLA</td></tr> <tr><td>00C8</td> <td>LATIN CAPITAL LETTER E WITH GRAVE</td></tr> <tr><td>00C9</td> <td>LATIN CAPITAL LETTER E WITH ACUTE</td></tr> <tr><td>00CA</td> <td>LATIN CAPITAL LETTER E WITH CIRCUMFLEX</td></tr> <tr><td>00CB</td> <td>LATIN CAPITAL LETTER E WITH DIAERESIS</td></tr> <tr><td>00CC</td> <td>LATIN CAPITAL LETTER I WITH GRAVE</td></tr> <tr><td>00CD</td> <td>LATIN CAPITAL LETTER I WITH ACUTE</td></tr> <tr><td>00CE</td> <td>LATIN CAPITAL LETTER I WITH CIRCUMFLEX</td></tr> <tr><td>00CF</td> <td>LATIN CAPITAL LETTER I WITH DIAERESIS</td></tr> <tr><td>00D0</td> <td>LATIN CAPITAL LETTER ETH</td></tr> <tr><td>00D1</td> <td>LATIN CAPITAL LETTER N WITH TILDE</td></tr> <tr><td>00D2</td> <td>LATIN CAPITAL LETTER O WITH GRAVE</td></tr> <tr><td>00D3</td> <td>LATIN CAPITAL LETTER O WITH ACUTE</td></tr> <tr><td>00D4</td> <td>LATIN CAPITAL LETTER O WITH CIRCUMFLEX</td></tr> <tr><td>00D5</td> <td>LATIN CAPITAL LETTER O WITH TILDE</td></tr> <tr><td>00D6</td> <td>LATIN CAPITAL LETTER O WITH DIAERESIS</td></tr> <tr><td>00D8</td> <td>LATIN CAPITAL LETTER O WITH STROKE</td></tr> <tr><td>00D9</td> <td>LATIN CAPITAL LETTER U WITH GRAVE</td></tr> <tr><td>00DA</td> <td>LATIN CAPITAL LETTER U WITH ACUTE</td></tr> <tr><td>00DB</td> <td>LATIN CAPITAL LETTER U WITH CIRCUMFLEX</td></tr> <tr><td>00DC</td> <td>LATIN CAPITAL LETTER U WITH DIAERESIS</td></tr> <tr><td>00DD</td> <td>LATIN CAPITAL LETTER Y WITH ACUTE</td></tr> <tr><td>00DE</td> <td>LATIN CAPITAL LETTER THORN</td></tr> </table> </div> <!--========================================================================--> <h2><a name="title-case-def">char-set:title-case</a></h2> <p> In Unicode, a character is titlecase if it has the category Lt in the character attribute database. There are very few of these characters; here is the entire 31-character list as of Unicode 3.0: <div class=inset> <table cellspacing=0 cellpadding=0> <tr><td>01C5 </td><td nowrap> LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON </td></tr> <tr><td>01C8 </td><td nowrap> LATIN CAPITAL LETTER L WITH SMALL LETTER J </td></tr> <tr><td>01CB </td><td nowrap> LATIN CAPITAL LETTER N WITH SMALL LETTER J </td></tr> <tr><td>01F2 </td><td nowrap> LATIN CAPITAL LETTER D WITH SMALL LETTER Z </td></tr> <tr><td>1F88 </td><td nowrap> GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI </td></tr> <tr><td>1F89 </td><td nowrap> GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI </td></tr> <tr><td>1F8A </td><td nowrap>GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI </td></tr> <tr><td>1F8B </td><td nowrap> GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI </td></tr> <tr><td>1F8C </td><td nowrap> GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI </td></tr> <tr><td>1F8D </td><td nowrap> GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI </td></tr> <tr><td>1F8E </td><td nowrap> GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI </td></tr> <tr><td>1F8F </td><td nowrap> GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI </td></tr> <tr><td>1F98 </td><td nowrap> GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI </td></tr> <tr><td>1F99 </td><td nowrap> GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI </td></tr> <tr><td>1F9A </td><td nowrap> GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI </td></tr> <tr><td>1F9B </td><td nowrap> GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI </td></tr> <tr><td>1F9C </td><td nowrap> GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI </td></tr> <tr><td>1F9D </td><td nowrap> GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI </td></tr> <tr><td>1F9E </td><td nowrap> GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI </td></tr> <tr><td>1F9F </td><td nowrap> GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI </td></tr> <tr><td>1FA8 </td><td nowrap> GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI </td></tr> <tr><td>1FA9 </td><td nowrap> GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI </td></tr> <tr><td>1FAA </td><td nowrap> GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI </td></tr> <tr><td>1FAB </td><td nowrap> GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI </td></tr> <tr><td>1FAC </td><td nowrap> GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI </td></tr> <tr><td>1FAD </td><td nowrap> GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI </td></tr> <tr><td>1FAE </td><td nowrap> GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI </td></tr> <tr><td>1FAF </td><td nowrap> GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI </td></tr> <tr><td>1FBC </td><td nowrap> GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI </td></tr> <tr><td>1FCC </td><td nowrap> GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI </td></tr> <tr><td>1FFC </td><td nowrap> GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI </td></tr> </table> </div> <p> There are no ASCII or Latin-1 titlecase characters. <!--========================================================================--> <h2><a name="letter-def">char-set:letter</a></h2> <p> In Unicode, a letter is any character with one of the letter categories (Lu, Ll, Lt, Lm, Lo) in the Unicode character database. <p> There are 52 ASCII letters <div class=indent> abcdefghijklmnopqrstuvwxyz <br> ABCDEFGHIJKLMNOPQRSTUVWXYZ <br> </div> <p> There are 117 Latin-1 letters. These are the 115 characters that are members of the Latin-1 <code>char-set:lower-case</code> and <code>char-set:upper-case</code> sets, plus <div class=inset> <table cellspacing=0 cellpadding=0> <tr><td>00AA</td> <td>FEMININE ORDINAL INDICATOR</td></tr> <tr><td>00BA</td> <td>MASCULINE ORDINAL INDICATOR</td></tr> </table> </div> <p class=continue> (These two letters are considered lower-case by Unicode, but not by Java or SRFI 14.) <!--========================================================================--> <h2><a name="digit-def">char-set:digit</a></h2> <p> In Unicode, a character is a digit if it has the category Nd in the character attribute database. In Latin-1 and ASCII, the only such characters are 0123456789. In Unicode, there are other digit characters in other code blocks, such as Gujarati digits and Tibetan digits. <!--========================================================================--> <h2><a name="hex-digit-def">char-set:hex-digit</a></h2> <p> The only hex digits are 0123456789abcdefABCDEF. <!--========================================================================--> <h2><a name="letter+digit-def">char-set:letter+digit</a></h2> <p> The union of <code>char-set:letter</code> and <code>char-set:digit.</code> <!--========================================================================--> <h2><a name="graphic-def">char-set:graphic</a></h2> <p> A graphic character is one that would put ink on paper. The ASCII and Latin-1 graphic characters are the members of <div class=inset> <table cellspacing=0 cellpadding=0> <tr><td><code>char-set:letter</code></td></tr> <tr><td><code>char-set:digit</code></td></tr> <tr><td><code>char-set:punctuation</code></td></tr> <tr><td><code>char-set:symbol</code></td></tr> </table> </div> <!--========================================================================--> <h2><a name="printing-def">char-set:printing</a></h2> <p> A printing character is one that would occupy space when printed, <em>i.e.</em>, a graphic character or a space character. <code>char-set:printing</code> is the union of <code>char-set:whitespace</code> and <code>char-set:graphic.</code> <!--========================================================================--> <h2><a name="whitespace-def">char-set:whitespace</a></h2> <p> In Unicode, a whitespace character is either <ul> <li> a character with one of the space, line, or paragraph separator categories (Zs, Zl or Zp) of the Unicode character database. <li> U+0009 Horizontal tabulation (\t control-I) <li> U+000A Line feed (\n control-J) <li> U+000B Vertical tabulation (\v control-K) <li> U+000C Form feed (\f control-L) <li> U+000D Carriage return (\r control-M) </ul> <p> There are 24 whitespace characters in Unicode 3.0: <div class=inset> <table cellspacing=0 cellpadding=0> <tr><td>0009</td> <td>HORIZONTAL TABULATION </td> <td> \t control-I</td></tr> <tr><td>000A</td> <td>LINE FEED </td> <td> \n control-J</td></tr> <tr><td>000B</td> <td>VERTICAL TABULATION </td> <td> \v control-K</td></tr> <tr><td>000C</td> <td>FORM FEED </td> <td> \f control-L</td></tr> <tr><td>000D</td> <td>CARRIAGE RETURN </td> <td> \r control-M</td></tr> <tr><td>0020</td> <td>SPACE </td> <td> Zs</td></tr> <tr><td>00A0</td> <td>NO-BREAK SPACE </td> <td> Zs</td></tr> <tr><td>1680</td> <td>OGHAM SPACE MARK </td> <td> Zs</td></tr> <tr><td>2000</td> <td>EN QUAD </td> <td> Zs</td></tr> <tr><td>2001</td> <td>EM QUAD </td> <td> Zs</td></tr> <tr><td>2002</td> <td>EN SPACE </td> <td> Zs</td></tr> <tr><td>2003</td> <td>EM SPACE </td> <td> Zs</td></tr> <tr><td>2004</td> <td>THREE-PER-EM SPACE </td> <td> Zs</td></tr> <tr><td>2005</td> <td>FOUR-PER-EM SPACE </td> <td> Zs</td></tr> <tr><td>2006</td> <td>SIX-PER-EM SPACE </td> <td> Zs</td></tr> <tr><td>2007</td> <td>FIGURE SPACE </td> <td> Zs</td></tr> <tr><td>2008</td> <td>PUNCTUATION SPACE </td> <td> Zs</td></tr> <tr><td>2009</td> <td>THIN SPACE </td> <td> Zs</td></tr> <tr><td>200A</td> <td>HAIR SPACE </td> <td> Zs</td></tr> <tr><td>200B</td> <td>ZERO WIDTH SPACE </td> <td> Zs</td></tr> <tr><td>2028</td> <td>LINE SEPARATOR </td> <td> Zl</td></tr> <tr><td>2029</td> <td>PARAGRAPH SEPARATOR </td> <td> Zp</td></tr> <tr><td>202F</td> <td>NARROW NO-BREAK SPACE </td> <td> Zs</td></tr> <tr><td>3000</td> <td>IDEOGRAPHIC SPACE </td> <td> Zs</td></tr> </table> </div> <p> The ASCII whitespace characters are the first six characters in the above list -- line feed, horizontal tabulation, vertical tabulation, form feed, carriage return, and space. These are also exactly the characters recognised by the Posix <code>isspace()</code> procedure. Latin-1 adds the no-break space. <p> Note: Java's <code>isWhitespace()</code> method is incompatible, including <div class=inset> <table cellspacing=0 cellpadding=0> <tr><td>0009</td> <td>HORIZONTAL TABULATION </td> <td> (\t control-I)</td></tr> <tr><td>001C</td> <td>FILE SEPARATOR </td> <td> (control-\)</td></tr> <tr><td>001D</td> <td>GROUP SEPARATOR </td> <td>(control-])</td></tr> <tr><td>001E</td> <td>RECORD SEPARATOR </td> <td>(control-^)</td></tr> <tr><td>001F</td> <td>UNIT SEPARATOR </td> <td>(control-_)</td></tr> </table> </div> <p class=continue> and excluding <div class=inset> <table cellspacing=0 cellpadding=0> <tr><td>00A0</td> <td>NO-BREAK SPACE</td></tr> </table> </div> <p> Java's excluding the no-break space means that tokenizers can simply break character streams at "whitespace" boundaries. However, the exclusion introduces exceptions in other places, <em>e.g.</em> <code>char-set:printing</code> is no longer simply the union of <code>char-set:graphic</code> and <code>char-set:whitespace.</code> <!--========================================================================--> <h2><a name="iso-control-def">char-set:iso-control</a></h2> <p> The ISO control characters are the Unicode/Latin-1 characters in the ranges [U+0000,U+001F] and [U+007F,U+009F]. <p> ASCII restricts this set to the characters in the range [U+0000,U+001F] plus the character U+007F. <p> Note that Unicode defines other control characters which do not belong to this set (hence the qualifying prefix "iso-" in the name). This restriction is compatible with the Java <code>IsISOControl()</code> method. <!--========================================================================--> <h2><a name="punctuation-def">char-set:punctuation</a></h2> <p> In Unicode, a punctuation character is any character that has one of the punctuation categories in the Unicode character database (Pc, Pd, Ps, Pe, Pi, Pf, or Po.) <p> ASCII has 23 punctuation characters: <pre class=code-example> !"#%&'()*,-./:;?@[\]_{} </pre> <p> Latin-1 adds six more: <div class=inset> <table cellspacing=0 cellpadding=0> <tr><td>00A1 </td> <td> INVERTED EXCLAMATION MARK <tr><td>00AB </td> <td> LEFT-POINTING DOUBLE ANGLE QUOTATION MARK <tr><td>00AD </td> <td> SOFT HYPHEN <tr><td>00B7 </td> <td> MIDDLE DOT <tr><td>00BB </td> <td> RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK <tr><td>00BF </td> <td> INVERTED QUESTION MARK </table> </div> <p> Note that the nine ASCII characters <code>$+<=>^`|~</code> are <em>not</em> punctuation. They are "symbols." <!--========================================================================--> <h2><a name="symbol-def">char-set:symbol</a></h2> <p> In Unicode, a symbol is any character that has one of the symbol categories in the Unicode character database (Sm, Sc, Sk, or So). There are nine ASCII symbol characters: <pre class=code-example> $+<=>^`|~ </pre> <p> Latin-1 adds 18 more: <div class=inset> <table cellspacing=0 cellpadding=0> <tr><td>00A2 </td> <td> CENT SIGN </td></tr> <tr><td>00A3 </td> <td> POUND SIGN </td></tr> <tr><td>00A4 </td> <td> CURRENCY SIGN </td></tr> <tr><td>00A5 </td> <td> YEN SIGN </td></tr> <tr><td>00A6 </td> <td> BROKEN BAR </td></tr> <tr><td>00A7 </td> <td> SECTION SIGN </td></tr> <tr><td>00A8 </td> <td> DIAERESIS </td></tr> <tr><td>00A9 </td> <td> COPYRIGHT SIGN </td></tr> <tr><td>00AC </td> <td> NOT SIGN </td></tr> <tr><td>00AE </td> <td> REGISTERED SIGN </td></tr> <tr><td>00AF </td> <td> MACRON </td></tr> <tr><td>00B0 </td> <td> DEGREE SIGN </td></tr> <tr><td>00B1 </td> <td> PLUS-MINUS SIGN </td></tr> <tr><td>00B4 </td> <td> ACUTE ACCENT </td></tr> <tr><td>00B6 </td> <td> PILCROW SIGN </td></tr> <tr><td>00B8 </td> <td> CEDILLA </td></tr> <tr><td>00D7 </td> <td> MULTIPLICATION SIGN </td></tr> <tr><td>00F7 </td> <td> DIVISION SIGN </td></tr> </table> </div> <!--========================================================================--> <h2><a name="blank-def">char-set:blank</a></h2> <p> Blank chars are horizontal whitespace. In Unicode, a blank character is either <ul> <li> a character with the space separator category (Zs) in the Unicode character database. <li> U+0009 Horizontal tabulation (\t control-I) </ul> <p> There are eighteen blank characters in Unicode 3.0: <div class=inset> <table cellspacing=0 cellpadding=0> <tr><td>0009 </td> <td> HORIZONTAL TABULATION </td> <td> \t control-I </td></tr> <tr><td>0020 </td> <td> SPACE </td> <td> Zs </td></tr> <tr><td>00A0 </td> <td> NO-BREAK SPACE </td> <td> Zs </td></tr> <tr><td>1680 </td> <td> OGHAM SPACE MARK </td> <td> Zs </td></tr> <tr><td>2000 </td> <td> EN QUAD </td> <td> Zs </td></tr> <tr><td>2001 </td> <td> EM QUAD </td> <td> Zs </td></tr> <tr><td>2002 </td> <td> EN SPACE </td> <td> Zs </td></tr> <tr><td>2003 </td> <td> EM SPACE </td> <td> Zs </td></tr> <tr><td>2004 </td> <td> THREE-PER-EM SPACE </td> <td> Zs </td></tr> <tr><td>2005 </td> <td> FOUR-PER-EM SPACE </td> <td> Zs </td></tr> <tr><td>2006 </td> <td> SIX-PER-EM SPACE </td> <td> Zs </td></tr> <tr><td>2007 </td> <td> FIGURE SPACE </td> <td> Zs </td></tr> <tr><td>2008 </td> <td> PUNCTUATION SPACE </td> <td> Zs </td></tr> <tr><td>2009 </td> <td> THIN SPACE </td> <td> Zs </td></tr> <tr><td>200A </td> <td> HAIR SPACE </td> <td> Zs </td></tr> <tr><td>200B </td> <td> ZERO WIDTH SPACE </td> <td> Zs </td></tr> <tr><td>202F </td> <td> NARROW NO-BREAK SPACE </td> <td> Zs </td></tr> <tr><td>3000 </td> <td> IDEOGRAPHIC SPACE </td> <td> Zs </td></tr> </table> </div> <p> The ASCII blank characters are the first two characters above -- horizontal tab and space. Latin-1 adds the no-break space. <p> Java doesn't have the concept of "blank" characters, so there are no compatibility issues. <!--========================================================================--> <h1><a name="ReferenceImp">Reference implementation</a></h1> <p> This SRFI comes with a reference implementation. It resides at: <div class=inset> <a href="http://srfi.schemers.org/srfi-14/srfi-14.scm"> http://srfi.schemers.org/srfi-14/srfi-14.scm</a> </div> <p class=continue> I have placed this source on the Net with an unencumbered, "open" copyright. Some of the code in the reference implementation bears a distant family relation to the MIT Scheme implementation, and being derived from that code, is covered by the MIT Scheme copyright (which is a generic BSD-style open-source copyright -- see the source file for details). The remainder of the code was written by myself for scsh or for this SRFI; I have placed this code under the scsh copyright, which is also a generic BSD-style open-source copyright. <p> The code is written for portability and should be simple to port to any Scheme. It has only the following deviations from R4RS, clearly discussed in the comments: <ul> <li> an <code>error</code> procedure; <li> the R5RS <code>values</code> procedure for producing multiple return values; <li> a simple <code>check-arg</code> procedure for argument checking; <li> <code>let-optionals*</code> and <code>:optional</code> macros for for parsing, checking and defaulting optional arguments from rest lists; <li> The SRFI-19 <code>define-record-type</code> form; <li> <code>bitwise-and</code> for the hash function; <li> <code>%latin1->char</code> and <code>%char->latin1</code>. </ul> <p> The library is written for clarity and well-commented; the current source is about 375 lines of source code and 375 lines of comments and white space. It is also written for efficiency. Fast paths are provided for common cases. <p> This is not to say that the implementation can't be tuned up for a specific Scheme implementation. There are notes in comments addressing ways implementors can tune the reference implementation for performance. <p> In short, I've written the reference implementation to make it as painless as possible for an implementor -- or a regular programmer -- to adopt this library and get good results with it. <p> The code uses a rather simple-minded, inefficient representation for ASCII/Latin-1 char-sets -- a 256-character string. The character whose code is <var>i</var> is in the set if <var>s[i]</var> = ASCII 1 (soh, or ^a); not in the set if <var>s[i]</var> = ASCII 0 (nul). A much faster and denser representation would be 16 or 32 bytes worth of bit string. A portable implementation using bit sets awaits standards for bitwise logical-ops and byte vectors. <p> "Large" character types, such as Unicode, should use a sparse representation, taking care that the Latin-1 subset continues to be represented with a dense 32-byte bit set. <!--========================================================================--> <h1><a name="Acknowledgements">Acknowledgements</a></h1> <p> The design of this library benefited greatly from the feedback provided during the SRFI discussion phase. Among those contributing thoughtful commentary and suggestions, both on the mailing list and by private discussion, were Paolo Amoroso, Lars Arvestad, Alan Bawden, Jim Bender, Dan Bornstein, Per Bothner, Will Clinger, Brian Denheyer, Kent Dybvig, Sergei Egorov, Marc Feeley, Matthias Felleisen, Will Fitzgerald, Matthew Flatt, Arthur A. Gleckler, Ben Goetter, Sven Hartrumpf, Erik Hilsdale, Shiro Kawai, Richard Kelsey, Oleg Kiselyov, Bengt Kleberg, Donovan Kolbly, Bruce Korb, Shriram Krishnamurthi, Bruce Lewis, Tom Lord, Brad Lucier, Dave Mason, David Rush, Klaus Schilling, Jonathan Sobel, Mike Sperber, Mikael Staldal, Vladimir Tsyshevsky, Donald Welsh, and Mike Wilson. I am grateful to them for their assistance. <p> I am also grateful the authors, implementors and documentors of all the systems mentioned in the introduction. Aubrey Jaffer should be noted for his work in producing Web-accessible versions of the R5RS spec, which was a tremendous aid. <p> This is not to imply that these individuals necessarily endorse the final results, of course. <p> During this document's long development period, great patience was exhibited by Mike Sperber, who is the editor for the SRFI, and by Hillary Sullivan, who is not. <!--========================================================================--> <h1><a name="Links">References & links</a></h1> <dl> <dt class=biblio><strong><a name="Java">[Java]</a></strong> <dd> The following URLs provide documentation on relevant Java classes. <br> <a href="http://java.sun.com/products/jdk/1.2/docs/api/java/lang/Character.html">http://java.sun.com/products/jdk/1.2/docs/api/java/lang/Character.html</a> <br> <a href="http://java.sun.com/products/jdk/1.2/docs/api/java/lang/String.html">http://java.sun.com/products/jdk/1.2/docs/api/java/lang/String.html</a> <br> <a href="http://java.sun.com/products/jdk/1.2/docs/api/java/lang/StringBuffer.html">http://java.sun.com/products/jdk/1.2/docs/api/java/lang/StringBuffer.html</a> <br> <a href="http://java.sun.com/products/jdk/1.2/docs/api/java/text/Collator.html">http://java.sun.com/products/jdk/1.2/docs/api/java/text/Collator.html</a> <br> <a href="http://java.sun.com/products/jdk/1.2/docs/api/java/text/package-summary.html">http://java.sun.com/products/jdk/1.2/docs/api/java/text/package-summary.html</a> <dt class=biblio><strong><a name="MIT-Scheme">[MIT-Scheme]</a></strong> <dd> <a href="http://www.swiss.ai.mit.edu/projects/scheme/">http://www.swiss.ai.mit.edu/projects/scheme/</a> <dt class=biblio><strong><a name="R5RS">[R5RS]</a></strong></dt> <dd>Revised<sup>5</sup> report on the algorithmic language Scheme.<br> R. Kelsey, W. Clinger, J. Rees (editors). <br> Higher-Order and Symbolic Computation, Vol. 11, No. 1, September, 1998. <br> and ACM SIGPLAN Notices, Vol. 33, No. 9, October, 1998. <br> Available at <a href="http://www.schemers.org/Documents/Standards/"> http://www.schemers.org/Documents/Standards/</a>. <dt class=biblio><strong>[SRFI]</strong></dt> <dd> The SRFI web site. <br> <a href="http://srfi.schemers.org/">http://srfi.schemers.org/</a> <dt class=biblio><strong>[SRFI-14]</strong></dt> <dd> SRFI-14: String libraries. <br> <a href="http://srfi.schemers.org/srfi-14/">http://srfi.schemers.org/srfi-14/</a> <dl> <dt> This document, in HTML: <dd><a href="http://srfi.schemers.org/srfi-14/srfi-14.html"> http://srfi.schemers.org/srfi-14/srfi-14.html</a> <dt> This document, in plain text format: <dd><a href="http://srfi.schemers.org/srfi-14/srfi-14.txt"> http://srfi.schemers.org/srfi-14/srfi-14.txt</a> <dt> Source code for the reference implementation: <dd> <a href="http://srfi.schemers.org/srfi-14/srfi-14.scm"> http://srfi.schemers.org/srfi-14/srfi-14.scm</a> <dt> Scheme 48 module specification, with typings: <dd> <a href="http://srfi.schemers.org/srfi-14/srfi-14-s48-module.scm"> http://srfi.schemers.org/srfi-14/srfi-14-s48-module.scm</a> <dt> Regression-test suite: <dd> <a href="http://srfi.schemers.org/srfi-14/srfi-14-tests.scm"> http://srfi.schemers.org/srfi-14/srfi-14-tests.scm</a> </dl> </dd> <dt class=biblio><strong><a name="Unicode">[Unicode]</a></strong> <dd> <a href="http://www.unicode.org/">http://www.unicode.org/</a> <dt class=biblio><strong><a name="UnicodeData">[UnicodeData]</a></strong> <dd> The Unicode character database. <br> <a href="ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt">ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt</a> <br> <a href="ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html">ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html</a> </dl> <!--========================================================================--> <h1><a name="Copyright">Copyright</a></h1> <p> Certain portions of this document -- the specific, marked segments of text describing the <abbr title="Revised^5 Report on Scheme"><a href="#R5RS">R5RS</a></abbr> procedures -- were adapted with permission from the R5RS report. <p> All other text is copyright (C) Olin Shivers (1998, 1999, 2000). All Rights Reserved. <p> This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Scheme Request For Implementation process or editors, except as needed for the purpose of developing SRFIs in which case the procedures for copyrights defined in the SRFI process must be followed, or as required to translate it into languages other than English. <p> The limited permissions granted above are perpetual and will not be revoked by the authors or their successors or assigns. <p> This document and the information contained herein is provided on an "<strong>as is</strong>" basis and <strong>the authors and the SRFI editors disclaim all warranties, express or implied, including but not limited to any warranty that the use of the information herein will not infringe any rights or any implied warranties of merchantability or fitness for a particular purpose.</strong> </body> </html> <!-- LocalWords: SRFI refs HTML css hackery sans Netscape td pre div para LocalWords: proc def procs defi's defn dl dt defi dd NS RS rs procx LocalWords: stylesheet IE biblio IE's Internationalisation ascii doc LocalWords: normalisation lib ref ci ok titlecase upcase downcase LocalWords: xsubstring xcopy tokenize kmp slib RScheme MzScheme init LocalWords: Bigloo Chez APL SML Unicode API eszet SS dz downcases LocalWords: titlecasing normalised normalise underbar ss eq vs dict LocalWords: backquote parameterised denmark taiwan UnicodeData txt LocalWords: pred nchars obj len cBa epilog foo baz wrt subst tstart LocalWords: Szilagyi zilagyi cs abcdefgh ca cd cond eek ee tHIS com LocalWords: elba elbA ary consed XXXX ac bc kons knil ans plusses LocalWords: catamorphism lp eof lis cdr knull kar kdr anamorphism LocalWords: abcdefg sfrom sto TCL perl slen rv exp initialisation LocalWords: plen SJ PJ si sj pj IPORT iport patlen DF buf Bevan LocalWords: Denheyer scsh Paolo Amoroso Arvestad Bawden Dybvig LocalWords: Bornstein Bothner Egorov Feeley Matthias Felleisen LocalWords: Flatt ucs Gleckler Goetter Sven Hartrumpf Hilsdale LocalWords: Kiselyov Bengt Korb Kleberg Kolbly Shriram bignum LocalWords: Krishnamurthi Lucier Schilling Sobel Mikael Staldal LocalWords: Tsyshevsky documentors Jaffer Sperber cltl AE fixnum LocalWords: CommonLisp HyperSpec Clinger Rees SIGPLAN uniquified LocalWords: cset EA DrScheme IEC conformant JIS xor diff Posix URL LocalWords: FFF DIAERESIS abcdefghijklmnopqrstuvwxyz EB EC EF ETH LocalWords: FA FB FC FD FF Ll AA diaeresis isLowerCase BA CB CC CE LocalWords: CF DA DC Lt CARON PSILI Lu PROSGEGRAMMENI DASIA VARIA LocalWords: OXIA PERISPOMENI FAA FAB FAC FAE FAF FBC FFC Lm Lo LocalWords: abcdefABCDEF Zs Zl Zp OGHAM IDEOGRAPHIC Pc recognised LocalWords: tokenizers iso Pd Ps Pe Pf AB BB BF Sm Sc Sk AF MACRON LocalWords: PILCROW soh nul ops Shiro Kawai subform -->