2017 lines
		
	
	
		
			88 KiB
		
	
	
	
		
			HTML
		
	
	
	
			
		
		
	
	
			2017 lines
		
	
	
		
			88 KiB
		
	
	
	
		
			HTML
		
	
	
	
| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
 | |
|             "http://www.w3.org/TR/html4/loose.dtd">
 | |
| <!-- 
 | |
|  - Do a paragraph check <p>
 | |
|  - The Unicode char tables are messed up, but it can't be fixed w/o CSS2
 | |
|    support, which I do not currently find in web browsers.
 | |
|  - Can I have bangs, plusses, or slashes in #tags? Spaces?
 | |
|         Yes: plus, bang, star   No: space  Yes: slash, question, ampersand
 | |
|         You can't put sharp in a path, so anything goes, really.
 | |
|         Nonetheless, some of these confuse Netscape, so I'll avoid them.
 | |
|  -->
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <html lang=en-US>
 | |
|   <head>
 | |
|     <meta name="keywords" content="Scheme, programming language, list processing, SRFI, underage lesbian sluts">
 | |
|     <link rev=made href="mailto:shivers@ai.mit.edu">
 | |
|     <title>SRFI 14: Character-set Library</title>
 | |
| 
 | |
|     <!-- Should have a media=all to get, for example, printing to work.
 | |
|       == But my Netscape will completely ignore the tag if I do that.
 | |
|       -->
 | |
|     <style type="text/css">
 | |
|            /* A little general layout hackery for headers & the title. */
 | |
|            body { margin-left: +7%;
 | |
|                   font-family: "Helvetica", sans-serif;
 | |
|                   }
 | |
|            /* Netscape workaround: */
 | |
|            td, th { font-family: "Helvetica", sans-serif; }
 | |
| 
 | |
|            code, pre { font-family: "courier new", "courier"; }
 | |
| 
 | |
|            div.inset { margin-left: +5%; }
 | |
| 
 | |
|            h1 { margin-left: -5%; }
 | |
|            h1, h2 { clear: both; }
 | |
|            h1, h2, h3, h4, h5, h6 { color: blue }
 | |
|            div.title-text { font-size: large; font-weight: bold; }
 | |
| 	   h3 { margin-top: 2em; margin-bottom: 0em }
 | |
| 
 | |
|            div.indent { margin-left: 2em; }       /* General indentation */
 | |
|            pre.code-example { margin-left: 2em; } /* Indent code examples. */
 | |
| 
 | |
| 	   /* "Continue" class marks text that isn't really the start
 | |
| 	   ** of a new paragraph -- e.g., continuing a para after a 
 | |
| 	   ** code sample.
 | |
| 	   */
 | |
| 	   p.continue { text-indent: 0em; margin-top: 0em}
 | |
| 
 | |
|            /* This stuff is for definition lists of defined procedures.
 | |
|            ** A proc-def1 is used when you want a stack of procs to go
 | |
|            ** with one dd body. In this case, make the first
 | |
|            ** proc a proc-def1, following ones proc-defi's, and the last one
 | |
|            ** a proc-defn.
 | |
|            **
 | |
|            ** Unfortunately, Netscape has huge bugs with respect to style
 | |
|            ** sheets and dl list rendering. We have to set truly random
 | |
|            ** values here to get the rendering to come out. The proper values
 | |
|            ** are in the following style sheet, for Internet Explorer.
 | |
|            ** In the following settings, the *comments* say what the 
 | |
|            ** setting *really* causes Netscape to do.
 | |
|            **
 | |
|            ** Ugh. Professional coders sacrifice their self-respect,
 | |
|            ** that others may live.
 | |
|            */
 | |
|            /* m-t ignored; m-b sets top margin space. */
 | |
|            dt.proc-def1 { margin-top: 0ex; margin-bottom: 3ex; }
 | |
|            dt.proc-defi { margin-top: 0ex; margin-bottom: 0ex; }
 | |
|            dt.proc-defn { margin-top: 0ex; margin-bottom: 0ex; }
 | |
| 
 | |
|            /* m-t works weird depending on whether or not the last line
 | |
|            ** of the previous entry was a pre. Set to zero.
 | |
|            */
 | |
|            dt.proc-def  { margin-top: 0ex; margin-bottom: 3ex; }
 | |
| 
 | |
|            /* m-b sets space between dd & dt; m-t ignored. */
 | |
|            dd.proc-def { margin-bottom: 0.5ex; margin-top: 0ex; } 
 | |
| 
 | |
| 
 | |
|            /* Boldface the name of a procedure when it's being defined. */
 | |
|            code.proc-def { font-weight: bold; font-size: 110%}
 | |
| 
 | |
|            /* For the index of procedures. 
 | |
|            ** Same hackery as for dt.proc-def, above.
 | |
|            */
 | |
|            /* m-b sets space between dd & dt; m-t ignored. */
 | |
|            dd.proc-index  { margin-bottom: 0ex; margin-top: 0ex; } 
 | |
|            /* What the fuck? */
 | |
|            pre.proc-index { margin-top: -2ex; }
 | |
| 
 | |
|            /* Pull the table of contents back flush with the margin.
 | |
|            ** Both NS & IE screw this up in different ways.
 | |
|            */
 | |
|            #toc-table { margin-top: -2ex; margin-left: -5%; }
 | |
| 
 | |
|            /* R5RS proc names are in italic; extended R5RS names 
 | |
|            ** in italic boldface.
 | |
|            */
 | |
|            span.r5rs-proc { font-weight: bold; }
 | |
|            span.r5rs-procx { font-style: italic; font-weight: bold; }
 | |
| 
 | |
|            /* Spread out bibliographic lists. */
 | |
|            /* More Netscape-specific lossage; see the following stylesheet
 | |
|            ** for the proper values (used by IE).
 | |
|            */
 | |
|            dt.biblio { margin-bottom: 3ex; }
 | |
| 
 | |
|            /* Links to draft copies (e.g., not at the official SRFI site)
 | |
|            ** are colored in red, so people will use them during the 
 | |
|            ** development process and kill them when the document's done.
 | |
|            */
 | |
|            a.draft { color: red; }
 | |
| 
 | |
|     </style>
 | |
| 
 | |
|     <style type="text/css" media=all>
 | |
|            /* Nastiness: Here, I'm using a bug to work around a bug.
 | |
|            ** Netscape rendering bugs mean you need bogus <dt> and <dd>
 | |
|            ** margin settings -- settings which screw up IE's proper rendering.
 | |
|            ** Fortunately, Netscape has *another* bug: it will ignore this
 | |
|            ** media=all style sheet. So I am placing the (proper) IE values
 | |
|            ** here. Perhaps, one day, when these rendering bugs are fixed,
 | |
|            ** this gross hackery can be removed.
 | |
|            */
 | |
|            dt.proc-def1 { margin-top: 3ex; margin-bottom: 0ex; }
 | |
|            dt.proc-defi { margin-top: 0ex; margin-bottom: 0ex; }
 | |
|            dt.proc-defn { margin-top: 0ex; margin-bottom: 0.5ex; }
 | |
|            dt.proc-def  { margin-top: 3ex; margin-bottom: 0.5ex; }
 | |
| 
 | |
|            pre { margin-top: 1ex; }
 | |
| 
 | |
|            dd.proc-def { margin-bottom: 2ex; margin-top: 0.5ex; } 
 | |
| 
 | |
|            /* For the index of procedures. 
 | |
|            ** Same hackery as for dt.proc-def, above.
 | |
|            */
 | |
|            dd.proc-index { margin-top: 0ex; } 
 | |
|            pre.proc-index { margin-top: 0ex; }
 | |
| 
 | |
|            /* Spread out bibliographic lists. */
 | |
|            dt.biblio { margin-top: 3ex; margin-bottom: 0ex; }
 | |
|            dd.biblio { margin-bottom: 1ex; }
 | |
|     </style>
 | |
|   </head>
 | |
| 
 | |
| <body>
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h1>Title</h1>
 | |
| <div class=title-text>
 | |
| Character-set Library
 | |
| </div>
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h1>Author</H1>
 | |
|     <address>
 | |
|        <a href="http://www.ai.mit.edu/~shivers/">Olin Shivers</A> /
 | |
|        <a href="mailto:shivers@ai.mit.edu">shivers@ai.mit.edu</A>
 | |
|     </address>
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h1>Table of contents</H1>
 | |
| 
 | |
| <!-- A bug in netscape (?) keeps the first link in this UL from being active.
 | |
| ==== So the Abstract link be dead. 99/8/22 -Olin
 | |
| -->
 | |
| <ul id=toc-table>
 | |
| <li><a href="#Abstract">Abstract</a>
 | |
| <li><a href="#VariableIndex">Variable index</a>
 | |
| <li><a href="#Rationale">Rationale</a>
 | |
|   <ul>
 | |
|   <li><a href="#LinearUpdateOperations">"Linear-update" operations</a>
 | |
|   <li><a href="#ExtraSRFI">Extra-SRFI recommendations</a>
 | |
|   </ul>
 | |
| 
 | |
| <li><a href="#Specification">Specification</a>
 | |
|   <ul>
 | |
|   <li><a href="#GeneralProcs">General procedures</a>
 | |
|   <li><a href="#Iterating">Iterating over character sets</a>
 | |
|   <li><a href="#Creating">Creating character sets</a>
 | |
|   <li><a href="#Querying">Querying character sets</a>
 | |
|   <li><a href="#Algebra">Character set algebra</a>
 | |
|   <li><a href="#StandardCharsets">Standard character sets</a>
 | |
|   </ul>
 | |
| 
 | |
| <li><a href="#StandardCharsetDefs">Unicode, Latin-1 and ASCII definitions of the standard character sets</a>
 | |
| <li><a href="#ReferenceImp">Reference implementation</a>
 | |
| <li><a href="#Acknowledgements">Acknowledgements</a>
 | |
| <li><a href="#Links">References & Links</a>
 | |
| <li><a href="#Copyright">Copyright</a>
 | |
| </ul>
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h1><a name="Abstract">Abstract</a></H1>
 | |
| <p>
 | |
| 
 | |
| The ability to efficiently represent and manipulate sets of characters is an
 | |
| unglamorous but very useful capability for text-processing code -- one that
 | |
| tends to pop up in the definitions of other libraries.  Hence it is useful to
 | |
| specify a general substrate for this functionality early.  This SRFI defines a
 | |
| general library that provides this functionality. 
 | |
| 
 | |
| It is accompanied by a reference implementation for the spec. The reference
 | |
| implementation is fairly efficient, straightforwardly portable, and has a
 | |
| "free software" copyright. The implementation is tuned for "small" 7 or 8
 | |
| bit character types, such as ASCII or Latin-1; the data structures and
 | |
| algorithms would have to be altered for larger 16 or 32 bit character types
 | |
| such as Unicode -- however, the specs have been carefully designed with these
 | |
| larger character types in mind.
 | |
| 
 | |
| Several forthcoming SRFIs can be defined in terms of this one:
 | |
| <ul>
 | |
|     <li> string library
 | |
|     <li> delimited input procedures (<em>e.g.</em>, <code>read-line</code>)
 | |
|     <li> regular expressions
 | |
| </ul>
 | |
| 
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h1><a name="VariableIndex">Variable Index</a></h1>
 | |
| <p>
 | |
| Here is the complete set of bindings -- procedural and otherwise --
 | |
| exported by this library. In a Scheme system that has a module or package 
 | |
| system, these procedures should be contained in a module named "char-set-lib".
 | |
| 
 | |
| <div class=indent>
 | |
| <dl>
 | |
| <dt class=proc-index> Predicates & comparison
 | |
| <dd class=proc-index>
 | |
| <pre class=proc-index>
 | |
| <a href="#char-set-p">char-set?</a> <a href="#char-set=">char-set=</a> <a href="#char-set<=">char-set<=</a> <a href="#char-set-hash">char-set-hash</a>
 | |
| </pre>
 | |
| 
 | |
| <dt class=proc-index> Iterating over character sets
 | |
| <dd class=proc-index>
 | |
| <pre class=proc-index>
 | |
| <a href="#char-set-cursor">char-set-cursor</a> <a href="#char-set-ref">char-set-ref</a> <a href="#char-set-cursor-next">char-set-cursor-next</a> <a href="#end-of-char-set-p">end-of-char-set?</a> 
 | |
| <a href="#char-set-fold">char-set-fold</a> <a href="#char-set-unfold">char-set-unfold</a> <a href="#char-set-unfold!">char-set-unfold!</a>
 | |
| <a href="#char-set-for-each">char-set-for-each</a> <a href="#char-set-map">char-set-map</a>
 | |
| </pre>
 | |
| 
 | |
| <dt class=proc-index> Creating character sets
 | |
| <dd class=proc-index>
 | |
| <pre class=proc-index>
 | |
| <a href="#char-set-copy">char-set-copy</a> <a href="#char-set">char-set</a>
 | |
| 
 | |
| <a href="#list->char-set">list->char-set</a>  <a href="#string->char-set">string->char-set</a>
 | |
| <a href="#list->char-set!">list->char-set!</a> <a href="#string->char-set!">string->char-set!</a>
 | |
|     
 | |
| <a href="#char-set-filter">char-set-filter</a>  <a href="#ucs-range->char-set">ucs-range->char-set</a> <a href="#
 | |
| char-set-filter!">
 | |
| char-set-filter!</a> <a href="#ucs-range->char-set!">ucs-range->char-set!</a>
 | |
| 
 | |
| <a href="#->char-set">->char-set</a>
 | |
| </pre>
 | |
| 
 | |
| <dt class=proc-index> Querying character sets
 | |
| <dd class=proc-index>
 | |
| <pre class=proc-index>
 | |
| <a href="#char-set->list">char-set->list</a> <a href="#char-set->string">char-set->string</a>
 | |
| <a href="#char-set-size">char-set-size</a> <a href="#char-set-count">char-set-count</a> <a href="#char-set-contains-p">char-set-contains?</a>
 | |
| <a href="#char-set-every">char-set-every</a> <a href="#char-set-any">char-set-any</a>
 | |
| </pre>
 | |
| 
 | |
| <dt class=proc-index> Character-set algebra
 | |
| <dd class=proc-index>
 | |
| <pre class=proc-index>
 | |
| <a href="#char-set-adjoin">char-set-adjoin</a>  <a href="#char-set-delete">char-set-delete</a>
 | |
| <a href="#char-set-adjoin!">char-set-adjoin!</a> <a href="#char-set-delete!">char-set-delete!</a>
 | |
| 
 | |
| <a href="#char-set-complement">char-set-complement</a>  <a href="#char-set-union">char-set-union</a>  <a href="#char-set-intersection">char-set-intersection</a>
 | |
| <a href="#char-set-complement!">char-set-complement!</a> <a href="#char-set-union!">char-set-union!</a> <a href="#char-set-intersection!">char-set-intersection!</a>
 | |
| 
 | |
| <a href="#char-set-difference">char-set-difference</a>  <a href="#char-set-xor">char-set-xor</a>  <a href="#char-set-diff+intersection">char-set-diff+intersection</a>
 | |
| <a href="#char-set-difference!">char-set-difference!</a> <a href="#char-set-xor!">char-set-xor!</a> <a href="#char-set-diff+intersection!">char-set-diff+intersection!</a>
 | |
| </pre>
 | |
| 
 | |
| <dt class=proc-index> Standard character sets
 | |
| <dd class=proc-index>
 | |
| <pre class=proc-index>
 | |
| <a href="#char-set:lower-case">char-set:lower-case</a>  <a href="#char-set:upper-case">char-set:upper-case</a>  <a href="#char-set:title-case">char-set:title-case</a>
 | |
| <a href="#char-set:letter">char-set:letter</a>      <a href="#char-set:digit">char-set:digit</a>       <a href="#char-set:letter+digit">char-set:letter+digit</a>
 | |
| <a href="#char-set:graphic">char-set:graphic</a>     <a href="#char-set:printing">char-set:printing</a>    <a href="#char-set:whitespace">char-set:whitespace</a>
 | |
| <a href="#char-set:iso-control">char-set:iso-control</a> <a href="#char-set:punctuation">char-set:punctuation</a> <a href="#char-set:symbol">char-set:symbol</a>
 | |
| <a href="#char-set:hex-digit">char-set:hex-digit</a>   <a href="#char-set:blank">char-set:blank</a>       <a href="#char-set:ascii">char-set:ascii</a>
 | |
| <a href="#char-set:empty">char-set:empty</a>       <a href="#char-set:full">char-set:full</a>
 | |
| </pre>
 | |
| 
 | |
| </dl>
 | |
| </div>
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h1><a name="Rationale">Rationale</a></h1>
 | |
| 
 | |
| <p>
 | |
| The ability to efficiently manipulate sets of characters is quite
 | |
| useful for text-processing code. Encapsulating this functionality in
 | |
| a general, efficiently implemented library can assist all such code.
 | |
| This library defines a new data structure to represent these sets, called
 | |
| a "char-set." The char-set type is distinct from all other types.
 | |
| 
 | |
| <p>
 | |
| This library is designed to be portable across implementations that use
 | |
| different character types and representations, especially ASCII, Latin-1
 | |
| and Unicode. Some effort has been made to preserve compatibility with Java
 | |
| in the Unicode case (see the definition of <code>char-set:whitespace</code> for the
 | |
| single real deviation).
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="LinearUpdateOperations">Linear-update operations</a></h2>
 | |
| 
 | |
| <p>
 | |
| The procedures of this SRFI, by default, are "pure functional" -- they do not
 | |
| alter their parameters. However, this SRFI defines a set of "linear-update"
 | |
| procedures which have a hybrid pure-functional/side-effecting semantics: they
 | |
| are allowed, but not required, to side-effect one of their parameters in order
 | |
| to construct their result. An implementation may legally implement these
 | |
| procedures as pure, side-effect-free functions, or it may implement them using
 | |
| side effects, depending upon the details of what is the most efficient or
 | |
| simple to implement in terms of the underlying representation.
 | |
| 
 | |
| <p>
 | |
| The linear-update routines all have names ending with "!".
 | |
| 
 | |
| <p>
 | |
| Clients of these procedures <em>may not</em> rely upon these procedures working by
 | |
| side effect. For example, this is not guaranteed to work:
 | |
| <pre class=code-example>
 | |
| (let* ((cs1 (char-set #\a #\b #\c))      ; cs1 = {a,b,c}.
 | |
|        (cs2 (char-set-adjoin! cs1 #\d))) ; Add d to {a,b,c}.
 | |
|   cs1) ; Could be either {a,b,c} or {a,b,c,d}.
 | |
| </pre>
 | |
| <p class=continue>
 | |
| However, this is well-defined:
 | |
| <pre class=code-example>
 | |
| (let ((cs (char-set #\a #\b #\c)))
 | |
|   (char-set-adjoin! cs #\d)) ; Add d to {a,b,c}.
 | |
| </pre>
 | |
| 
 | |
| <p>
 | |
| So clients of these procedures write in a functional style, but must
 | |
| additionally be sure that, when the procedure is called, there are no other
 | |
| live pointers to the potentially-modified character set (hence the term
 | |
| "linear update").
 | |
| 
 | |
| <p>
 | |
| There are two benefits to this convention:
 | |
| <ul>
 | |
|   <li> Implementations are free to provide the most efficient possible
 | |
|     implementation, either functional or side-effecting.
 | |
|   <li> Programmers may nonetheless continue to assume that character sets
 | |
|     are purely functional data structures: they may be reliably shared
 | |
|     without needing to be copied, uniquified, and so forth.
 | |
| </ul>
 | |
| 
 | |
| <p>
 | |
| Note that pure functional representations are the right thing for
 | |
| ASCII- or Latin-1-based Scheme implementations, since a char-set can
 | |
| be represented in an ASCII Scheme with 4 32-bit words. Pure set-algebra
 | |
| operations on such a representation are very fast and efficient. Programmers
 | |
| who code using linear-update operations are guaranteed the system will
 | |
| provide the best implementation across multiple platforms.
 | |
| 
 | |
| <p>
 | |
| In practice, these procedures are most useful for efficiently constructing
 | |
| character sets in a side-effecting manner, in some limited local context, 
 | |
| before passing the character set outside the local construction scope to be
 | |
| used in a functional manner.
 | |
| 
 | |
| <p>
 | |
| Scheme provides no assistance in checking the linearity of the potentially
 | |
| side-effected parameters passed to these functions --- there's no linear
 | |
| type checker or run-time mechanism for detecting violations. (But
 | |
| sophisticated programming environments, such as DrScheme, might help.)
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="ExtraSRFI">Extra-SRFI recommendations</a></h2>
 | |
| <p>
 | |
| Users are cautioned that the R5RS predicates 
 | |
| <div class=inset><code>
 | |
| char-alphabetic? <br>
 | |
| char-numeric? <br>
 | |
| char-whitespace? <br>
 | |
| char-upper-case? <br>
 | |
| char-lower-case? <br>
 | |
| </code>
 | |
| </div>
 | |
| <p class=continue>
 | |
| may or may not be in agreement with the SRFI 14 base character sets
 | |
| <div class=inset>
 | |
| <code>
 | |
| char-set:letter<br>
 | |
| char-set:digit<br>
 | |
| char-set:whitespace<br>
 | |
| char-set:upper-case<br>
 | |
| char-set:lower-case<br>
 | |
| </code>
 | |
| </div>
 | |
| <p class=continue>
 | |
| Implementors are strongly encouraged to bring these predicates into
 | |
| agreement with the base character sets of this SRFI; not to do so risks
 | |
| major confusion.
 | |
| 
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h1><a name="Specification">Specification</a></h1>
 | |
| <p>
 | |
| In the following procedure specifications:
 | |
| <ul>
 | |
|     <li> A <var>cs</var> parameter is a character set.
 | |
| 
 | |
|     <li> An <var>s</var> parameter is a string.
 | |
| 
 | |
|     <li> A <var>char</var> parameter is a character.
 | |
| 
 | |
|     <li> A <var>char-list</var> parameter is a list of characters.
 | |
| 
 | |
|     <li> A <var>pred</var> parameter is a unary character predicate procedure, returning 
 | |
|       a true/false value when applied to a character.
 | |
| 
 | |
|     <li> An <var>obj</var> parameter may be any value at all.
 | |
| </ul>
 | |
| 
 | |
| <p>
 | |
| Passing values to procedures with these parameters that do not satisfy these
 | |
| types is an error.
 | |
| 
 | |
| <p>
 | |
| Unless otherwise noted in the specification of a procedure, procedures
 | |
| always return character sets that are distinct (from the point of view
 | |
| of the linear-update operations) from the parameter character sets. For
 | |
| example, <code>char-set-adjoin</code> is guaranteed to provide a fresh character set,
 | |
| even if it is not given any character parameters.
 | |
| 
 | |
| <p>
 | |
| Parameters given in square brackets are optional. Unless otherwise noted in the
 | |
| text describing the procedure, any prefix of these optional parameters may
 | |
| be supplied, from zero arguments to the full list. When a procedure returns
 | |
| multiple values, this is shown by listing the return values in square
 | |
| brackets, as well. So, for example, the procedure with signature
 | |
| <pre class=code-example>
 | |
| halts? <var>f [x init-store]</var> -> <var>[boolean integer]</var>
 | |
| </pre>
 | |
| would take one (<var>f</var>), two (<var>f</var>, <var>x</var>) 
 | |
| or three (<var>f</var>, <var>x</var>, <var>init-store</var>) input parameters, 
 | |
| and return two values, a boolean and an integer.
 | |
| 
 | |
| <p>
 | |
| A parameter followed by "<code>...</code>" means zero-or-more elements. 
 | |
| So the procedure with the signature
 | |
| <pre class=code-example>
 | |
| sum-squares <var>x ... </var> -> <var>number</var>
 | |
| </pre>
 | |
| takes zero or more arguments (<var>x ...</var>), 
 | |
| while the procedure with signature
 | |
| <pre class=code-example>
 | |
| spell-check <var>doc dict<sub>1</sub> dict<sub>2</sub> ...</var> -> <var>string-list</var>
 | |
| </pre>
 | |
| takes two required parameters 
 | |
| (<var>doc</var> and <var>dict<sub>1</sub></var>) 
 | |
| and zero or more optional parameters (<var>dict<sub>2</sub> ...</var>).
 | |
| 
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="GeneralProcs">General procedures</a></h2>
 | |
| <dl>
 | |
| 
 | |
| <!--
 | |
| ==== char-set?
 | |
| ============================================================================-->
 | |
| <dt class=proc-def>
 | |
| <a name="char-set-p"></a>
 | |
| <code class=proc-def>char-set?</code><var> obj -> boolean</var>
 | |
| <dd class=proc-def>
 | |
| 
 | |
|     Is the object <var>obj</var> a character set?
 | |
| 
 | |
| <!--
 | |
| ==== char-set=
 | |
| ============================================================================-->
 | |
| <dt class=proc-def>
 | |
| <a name="char-set="></a>
 | |
| <code class=proc-def>char-set=</code><var> cs<sub>1</sub> ... -> boolean</var>
 | |
| <dd class=proc-def>
 | |
|     Are the character sets equal?
 | |
| <p>
 | |
|     Boundary cases:
 | |
| <pre class=code-example>
 | |
| (char-set=) => <var>true</var>
 | |
| (char-set= cs) => <var>true</var>
 | |
| </pre>
 | |
| 
 | |
| <p>
 | |
|     Rationale: transitive binary relations are generally extended to n-ary
 | |
|     relations in Scheme, which enables clearer, more concise code to be
 | |
|     written. While the zero-argument and one-argument cases will almost
 | |
|     certainly not arise in first-order uses of such relations, they may well
 | |
|     arise in higher-order cases or macro-generated code. 
 | |
|     <em>E.g.,</em> consider
 | |
| <pre class=code-example>
 | |
| (apply char-set= cset-list)
 | |
| </pre>
 | |
| <p class=continue>
 | |
|     This is well-defined if the list is empty or a singleton list. Hence
 | |
|     we extend these relations to any number of arguments. Implementors
 | |
|     have reported actual uses of n-ary relations in higher-order cases
 | |
|     allowing for fewer than two arguments. The way of Scheme is to handle the
 | |
|     general case; we provide the fully general extension.
 | |
| <p>
 | |
|     A counter-argument to this extension is that 
 | |
|     <abbr title="Revised^5 Report on Scheme"><a href="#R5RS">R5RS</a></abbr>'s
 | |
|     transitive binary arithmetic relations 
 | |
|     (<code>=</code>, <code><</code>, <em>etc.</em>) 
 | |
|     require at least two arguments, hence
 | |
|     this decision is a break with the prior convention -- although it is
 | |
|     at least one that is backwards-compatible.
 | |
| 
 | |
| <!--
 | |
| ==== char-set<=
 | |
| ============================================================================-->
 | |
| <dt class=proc-def>
 | |
| <a name="char-set<="></a>
 | |
| <code class=proc-def>char-set<=</code><var> cs<sub>1</sub> ... -> boolean</var>
 | |
| <dd class=proc-def>
 | |
|     Returns true if every character set <var>cs<sub>i</sub></var> is 
 | |
|     a subset of character set <var>cs<sub>i+1</sub></var>.
 | |
| 
 | |
| <p>
 | |
| Boundary cases:
 | |
| <pre class=code-example>
 | |
| (char-set<=) => <var>true</var>
 | |
| (char-set<= cs) => <var>true</var>
 | |
| </pre>
 | |
| <p>
 | |
| Rationale: See <code>char-set=</code> for discussion of zero- and one-argument
 | |
| applications. Consider testing a list of char-sets for monotonicity
 | |
| with 
 | |
| <pre class=code-example>
 | |
| (apply char-set<= cset-list)
 | |
| </pre>
 | |
| 
 | |
| <!--
 | |
| ==== char-set-hash
 | |
| ============================================================================-->
 | |
| <dt class=proc-def>
 | |
| <a name="char-set-hash"></a>
 | |
| <code class=proc-def>char-set-hash</code><var> cs [bound] -> integer</var>
 | |
| <dd class=proc-def>
 | |
|     Compute a hash value for the character set <var>cs</var>. 
 | |
|     <var>Bound</var> is a non-negative
 | |
|     exact integer specifying the range of the hash function. A positive
 | |
|     value restricts the return value to the range [0,<var>bound</var>).
 | |
| 
 | |
|     <p>
 | |
|     If <var>bound</var> is either zero or not given, the implementation may use
 | |
|     an implementation-specific default value, chosen to be as large as
 | |
|     is efficiently practical. For instance, the default range might be chosen
 | |
|     for a given implementation to map all strings into the range of
 | |
|     integers that can be represented with a single machine word.
 | |
| 
 | |
| 
 | |
|     <p>
 | |
|     Invariant:
 | |
| <pre class=code-example>
 | |
| (char-set= cs<sub>1</sub> cs<sub>2</sub>) => (= (char-set-hash cs<sub>1</sub> b) (char-set-hash cs<sub>2</sub> b))
 | |
| </pre>
 | |
| 
 | |
|     <p>
 | |
|     A legal but nonetheless discouraged implementation:
 | |
| <pre class=code-example>
 | |
| (define (char-set-hash cs . maybe-bound) 1)
 | |
| </pre>
 | |
| 
 | |
| <p>
 | |
|     Rationale: allowing the user to specify an explicit bound simplifies user
 | |
|     code by removing the mod operation that typically accompanies every hash
 | |
|     computation, and also may allow the implementation of the hash function to
 | |
|     exploit a reduced range to efficiently compute the hash value. 
 | |
|     <em>E.g.</em>, for
 | |
|     small bounds, the hash function may be computed in a fashion such that
 | |
|     intermediate values never overflow into bignum integers, allowing the
 | |
|     implementor to provide a fixnum-specific "fast path" for computing the
 | |
|     common cases very rapidly.
 | |
| 
 | |
| </dl>
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="Iterating">Iterating over character sets</a></h2>
 | |
| 
 | |
| <dl>
 | |
| <!--
 | |
| ==== char-set-cursor char-set-ref char-set-cursor-next end-of-char-set?
 | |
| ============================================================================-->
 | |
| <dt class=proc-def1>
 | |
| <a name="char-set-cursor"></a>
 | |
| <a name="char-set-ref"></a>
 | |
| <a name="char-set-cursor-next"></a>
 | |
| <a name="end-of-char-set-p"></a>
 | |
| <code class=proc-def>char-set-cursor</code><var> cset -> cursor</var>
 | |
| <dt class=proc-defi>
 | |
| <code class=proc-def>char-set-ref</code><var> cset cursor -> char</var>
 | |
| <dt class=proc-defi>
 | |
| <code class=proc-def>char-set-cursor-next</code><var> cset cursor -> cursor</var>
 | |
| <dt class=proc-defn>
 | |
| <code class=proc-def>end-of-char-set?</code><var> cursor -> boolean</var>
 | |
| <dd class=proc-def>
 | |
|     Cursors are a low-level facility for iterating over the characters in a
 | |
|     set. A cursor is a value that indexes a character in a char set.
 | |
|     <code>char-set-cursor</code> produces a new cursor for a given char set. 
 | |
|     The set element indexed by the cursor is fetched with 
 | |
|     <code>char-set-ref</code>. 
 | |
|     A cursor index is incremented with <code>char-set-cursor-next</code>; 
 | |
|     in this way, code can step through every character in a char set. 
 | |
|     Stepping a cursor "past the end" of a char set produces a cursor that 
 | |
|     answers true to <code>end-of-char-set?</code>. 
 | |
|     It is an error to pass such a cursor to <code>char-set-ref</code> or to
 | |
|     <code>char-set-cursor-next</code>.
 | |
| 
 | |
| <p>
 | |
|     A cursor value may not be used in conjunction with a different character
 | |
|     set; if it is passed to <code>char-set-ref</code> or 
 | |
|     <code>char-set-cursor-next</code> with
 | |
|     a character set other than the one used to create it, the results and
 | |
|     effects are undefined.
 | |
| 
 | |
| <p>
 | |
|     Cursor values are <em>not</em> necessarily distinct from other types. 
 | |
|     They may be
 | |
|     integers, linked lists, records, procedures or other values. This license
 | |
|     is granted to allow cursors to be very "lightweight" values suitable for
 | |
|     tight iteration, even in fairly simple implementations.
 | |
| 
 | |
| <p>
 | |
|     Note that these primitives are necessary to export an iteration facility
 | |
|     for char sets to loop macros.
 | |
| 
 | |
| <p>
 | |
|     Example:
 | |
| <pre class=code-example>
 | |
| (define cs (char-set #\G #\a #\T #\e #\c #\h))
 | |
| 
 | |
| ;; Collect elts of CS into a list.
 | |
| (let lp ((cur (char-set-cursor cs)) (ans '()))
 | |
|   (if (end-of-char-set? cur) ans
 | |
|       (lp (char-set-cursor-next cs cur)
 | |
|           (cons (char-set-ref cs cur) ans))))
 | |
|   => (#\G #\T #\a #\c #\e #\h)
 | |
| 
 | |
| ;; Equivalently, using a list unfold (from SRFI 1):
 | |
| (unfold-right end-of-char-set? 
 | |
|               (curry char-set-ref cs)
 | |
| 	      (curry char-set-cursor-next cs)
 | |
| 	      (char-set-cursor cs))
 | |
|   => (#\G #\T #\a #\c #\e #\h)
 | |
| </pre>
 | |
| 
 | |
| <p>
 | |
|     Rationale: Note that the cursor API's four functions "fit" the functional
 | |
|     protocol used by the unfolders provided by the list, string and char-set
 | |
|     SRFIs (see the example above). By way of contrast, here is a simpler, 
 | |
|     two-function API that was rejected for failing this criterion. Besides 
 | |
|     <code>char-set-cursor</code>, it provided a single
 | |
|     function that mapped a cursor and a character set to two values, the
 | |
|     indexed character and the next cursor. If the cursor had exhausted the
 | |
|     character set, then this function returned false instead of the character
 | |
|     value, and another end-of-char-set cursor. In this way, the other three
 | |
|     functions of the current API were combined together.
 | |
| 
 | |
| <!--
 | |
| ==== char-set-fold
 | |
| ============================================================================-->
 | |
| <dt class=proc-def>
 | |
| <a name="char-set-fold"></a>
 | |
| <code class=proc-def>char-set-fold</code><var> kons knil cs -> object</var>
 | |
| <dd class=proc-def>
 | |
|     This is the fundamental iterator for character sets.  Applies the function
 | |
|     <var>kons</var> across the character set <var>cs</var> using initial state value <var>knil</var>.  That is,
 | |
|     if <var>cs</var> is the empty set, the procedure returns <var>knil</var>.  Otherwise, some
 | |
|     element <var>c</var> of <var>cs</var> is chosen; 
 | |
|     let <var>cs'</var> be the remaining, unchosen characters.
 | |
|     The procedure returns
 | |
| <pre class=code-example>
 | |
| (char-set-fold <var>kons</var> (<var>kons</var> <var>c</var> <var>knil</var>) <var>cs'</var>)
 | |
| </pre>
 | |
|     <p>
 | |
|     Examples:
 | |
| <pre class=code-example>
 | |
| ;; CHAR-SET-MEMBERS
 | |
| (lambda (cs) (char-set-fold cons '() cs))
 | |
| 
 | |
| ;; CHAR-SET-SIZE
 | |
| (lambda (cs) (char-set-fold (lambda (c i) (+ i 1)) 0 cs))
 | |
| 
 | |
| ;; How many vowels in the char set?
 | |
| (lambda (cs) 
 | |
|   (char-set-fold (lambda (c i) (if (vowel? c) (+ i 1) i))
 | |
|                  0 cs))
 | |
| </pre>
 | |
| 
 | |
| <!--
 | |
| ==== char-set-unfold char-set-unfold!
 | |
| ============================================================================-->
 | |
| <dt class=proc-def1>
 | |
| <a name="char-set-unfold"></a>
 | |
| <a name="char-set-unfold!"></a>
 | |
| <code class=proc-def>char-set-unfold </code><var> f p g seed [base-cs] -> char-set</var>
 | |
| <dt class=proc-defn><code class=proc-def>char-set-unfold!</code><var> f p g seed base-cs -> char-set</var>
 | |
| <dd class=proc-def>
 | |
|     This is a fundamental constructor for char-sets. 
 | |
| <ul>
 | |
|     <li> <var>G</var> is used to generate a series of "seed" values from the initial seed:
 | |
|         <var>seed</var>, (<var>g</var> <var>seed</var>), (<var>g<sup>2</sup></var> <var>seed</var>), (<var>g<sup>3</sup></var> <var>seed</var>), ...
 | |
|     <li> <var>P</var> tells us when to stop -- when it returns true when applied to one 
 | |
|       of these seed values.
 | |
|     <li> <var>F</var> maps each seed value to a character. These characters are added
 | |
|       to the base character set <var>base-cs</var> to form the result; <var>base-cs</var> defaults to
 | |
|       the empty set. <code>char-set-unfold!</code> adds the characters to <var>base-cs</var> in a 
 | |
|       linear-update -- it is allowed, but not required, to side-effect
 | |
|       and use <var>base-cs</var>'s storage to construct the result.
 | |
| </ul>
 | |
| 
 | |
|     <p>
 | |
|     More precisely, the following definitions hold, ignoring the
 | |
|     optional-argument issues:
 | |
| 
 | |
| <pre class=code-example>
 | |
| (define (char-set-unfold p f g seed base-cs) 
 | |
|   (char-set-unfold! p f g seed (char-set-copy base-cs)))
 | |
| 
 | |
| (define (char-set-unfold! p f g seed base-cs)
 | |
|   (let lp ((seed seed) (cs base-cs))
 | |
|         (if (p seed) cs                                 ; P says we are done.
 | |
|             (lp (g seed)                                ; Loop on (G SEED).
 | |
|                 (char-set-adjoin! cs (f seed))))))      ; Add (F SEED) to set.
 | |
| </pre>
 | |
| 
 | |
|     (Note that the actual implementation may be more efficient.)
 | |
| 
 | |
|     <p>
 | |
|     Examples:
 | |
| <pre class=code-example>                         
 | |
| (port->char-set p) = (char-set-unfold eof-object? values
 | |
|                                       (lambda (x) (read-char p))
 | |
|                                       (read-char p))
 | |
| 
 | |
| (list->char-set lis) = (char-set-unfold null? car cdr lis)
 | |
| </pre>
 | |
| <!--
 | |
| ==== char-set-for-each
 | |
| ============================================================================-->
 | |
| <dt class=proc-def>
 | |
| <a name="char-set-for-each"></a>
 | |
| <code class=proc-def>char-set-for-each</code><var> proc cs -> unspecified</var>
 | |
| <dd class=proc-def>
 | |
|     Apply procedure <var>proc</var> to each character in the character set <var>cs</var>.
 | |
|     Note that the order in which <var>proc</var> is applied to the characters in the
 | |
|     set is not specified, and may even change from one procedure application
 | |
|     to another.
 | |
| 
 | |
|     <p>
 | |
|     Nothing at all is specified about the value returned by this procedure; it
 | |
|     is not even required to be consistent from call to call. It is simply
 | |
|     required to be a value (or values) that may be passed to a command
 | |
|     continuation, <em>e.g.</em> as the value of an expression appearing as a
 | |
|     non-terminal subform of a <code>begin</code> expression. 
 | |
|     Note that in 
 | |
|     <abbr title="Revised^5 Report on Scheme"><a href="#R5RS">R5RS</a></abbr>,
 | |
|     this restricts the procedure to returning a single value; 
 | |
|     non-R5RS systems may not even provide this restriction.
 | |
| 
 | |
| <!--
 | |
| ==== char-set-map
 | |
| ============================================================================-->
 | |
| <dt class=proc-def>
 | |
| <a name="char-set-map"></a>
 | |
| <code class=proc-def>char-set-map</code><var> proc cs -> char-set</var>
 | |
| <dd class=proc-def>
 | |
|     <var>proc</var> is a char->char procedure. Apply it to all the characters in
 | |
|     the char-set <var>cs</var>, and collect the results into a new character set.
 | |
| 
 | |
|     <p>
 | |
|     Essentially lifts <var>proc</var> from a char->char procedure to a char-set ->
 | |
|     char-set procedure.
 | |
| 
 | |
|     <p>
 | |
|     Example:
 | |
| <pre class=code-example>
 | |
| (char-set-map char-downcase cset)
 | |
| </pre>
 | |
| </dl>
 | |
| 
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="Creating">Creating character sets</a></h2>
 | |
| <dl>
 | |
| 
 | |
| <!--
 | |
| ==== char-set-copy
 | |
| ============================================================================-->
 | |
| <dt class=proc-def>
 | |
| <a name="char-set-copy"></a>
 | |
| <code class=proc-def>char-set-copy</code><var> cs -> char-set</var>
 | |
| <dd class=proc-def>
 | |
|     Returns a copy of the character set <var>cs</var>.  "Copy" means that if either the
 | |
|     input parameter or the result value of this procedure is passed to one of
 | |
|     the linear-update procedures described below, the other character set is
 | |
|     guaranteed not to be altered.  
 | |
| 
 | |
|     <p>
 | |
|     A system that provides pure-functional implementations of the
 | |
|     linear-operator suite could implement this procedure as the identity
 | |
|     function -- so copies are <em>not</em> guaranteed to be distinct by <code>eq?</code>.
 | |
| 
 | |
| <!--
 | |
| ==== char-set
 | |
| ============================================================================-->
 | |
| <dt class=proc-def>
 | |
| <a name="char-set"></a>
 | |
| <code class=proc-def>char-set</code><var> char<sub>1</sub> ... -> char-set</var>
 | |
| <dd class=proc-def>
 | |
|     Return a character set containing the given characters.
 | |
| 
 | |
| <!--
 | |
| ==== list->char-set list->char-set
 | |
| ============================================================================-->
 | |
| <dt class=proc-def1>
 | |
| <a name="list->char-set"></a>
 | |
| <a name="list->char-set!"></a>
 | |
| <code class=proc-def>list->char-set </code><var> char-list [base-cs] -> char-set</var>
 | |
| <dt class=proc-defn><code class=proc-def>list->char-set!</code><var> char-list base-cs -> char-set</var>
 | |
| <dd class=proc-def>
 | |
|     Return a character set containing the characters in the list of
 | |
|     characters <var>char-list</var>.
 | |
| 
 | |
|     <p>
 | |
|     If character set <var>base-cs</var> is provided, the characters from <var>char-list</var>
 | |
|     are added to it. <code>list->char-set!</code> is allowed, but not required,
 | |
|     to side-effect and reuse the storage in <var>base-cs</var>; 
 | |
|     <code>list->char-set</code> produces a fresh character set.
 | |
| 
 | |
| <!--
 | |
| ==== string->char-set string->char-set!
 | |
| ============================================================================-->
 | |
| <dt class=proc-def1>
 | |
| <a name="string->char-set"></a>
 | |
| <a name="string->char-set!"></a>
 | |
| <code class=proc-def>string->char-set </code><var> s [base-cs] -> char-set</var>
 | |
| <dt class=proc-defn><code class=proc-def>string->char-set!</code><var> s base-cs -> char-set</var>
 | |
| <dd class=proc-def>
 | |
| 
 | |
|     Return a character set containing the characters in the string <var>s</var>.
 | |
| 
 | |
|     <p>
 | |
|     If character set <var>base-cs</var> is provided, the characters from <var>s</var> are added to
 | |
|     it. <code>string->char-set!</code> is allowed, but not required, to side-effect and
 | |
|     reuse the storage in <var>base-cs</var>; <code>string->char-set</code> produces a fresh character
 | |
|     set.
 | |
| 
 | |
| <!--
 | |
| ==== char-set-filter char-set-filter!
 | |
| ============================================================================-->
 | |
| <dt class=proc-def1>
 | |
| <a name="char-set-filter"></a>
 | |
| <a name="char-set-filter!"></a>
 | |
| <code class=proc-def>char-set-filter </code><var> pred cs [base-cs] -> char-set</var>
 | |
| <dt class=proc-defn><code class=proc-def>char-set-filter!</code><var> pred cs base-cs -> char-set</var>
 | |
| <dd class=proc-def>
 | |
| 
 | |
|     Returns a character set containing every character <var>c</var> 
 | |
|     in <var>cs</var> such that <code>(<var>pred</var> <var>c</var>)</code> 
 | |
|     returns true.
 | |
| 
 | |
| <p>
 | |
|     If character set <var>base-cs</var> is provided, the characters specified 
 | |
|     by <var>pred</var> are added to it. 
 | |
|     <code>char-set-filter!</code> is allowed, but not required,
 | |
|     to side-effect and reuse the storage in <var>base-cs</var>; 
 | |
|     <code>char-set-filter</code> produces a fresh character set.
 | |
| 
 | |
| <p>
 | |
|     An implementation may not save away a reference to <var>pred</var> and
 | |
|     invoke it after <code>char-set-filter</code> or 
 | |
|     <code>char-set-filter!</code> returns -- that is, "lazy,"
 | |
|     on-demand implementations are not allowed, as <var>pred</var> may have
 | |
|     external dependencies on mutable data or have other side-effects.
 | |
| 
 | |
| <p>
 | |
|     Rationale: This procedure provides a means of converting a character
 | |
|     predicate into its equivalent character set; the <var>cs</var> parameter
 | |
|     allows the programmer to bound the predicate's domain. Programmers should
 | |
|     be aware that filtering a character set such as <code>char-set:full</code>
 | |
|     could be a very expensive operation in an implementation that provided an
 | |
|     extremely large character type, such as 32-bit Unicode. An earlier draft
 | |
|     of this library provided a simple <code>predicate->char-set</code>
 | |
|     procedure, which was rejected in favor of <code>char-set-filter</code> for
 | |
|     this reason.
 | |
| 
 | |
| 
 | |
| <!--
 | |
| ==== ucs-range->char-set ucs-range->char-set!
 | |
| ============================================================================-->
 | |
| <dt class=proc-def1>
 | |
| <a name="ucs-range->char-set"></a>
 | |
| <a name="ucs-range->char-set!"></a>
 | |
| <code class=proc-def>ucs-range->char-set </code><var> lower upper [error? base-cs] -> char-set</var>
 | |
| <dt class=proc-defn><code class=proc-def>ucs-range->char-set!</code><var> lower upper error? base-cs -> char-set</var>
 | |
| <dd class=proc-def>
 | |
|     <var>Lower</var> and <var>upper</var> are exact non-negative integers; 
 | |
|     <var>lower</var> <= <var>upper</var>.
 | |
| 
 | |
|     <p>
 | |
|     Returns a character set containing every character whose ISO/IEC 10646
 | |
|     UCS-4 code lies in the half-open range [<var>lower</var>,<var>upper</var>).
 | |
| 
 | |
| <ul>
 | |
|     <li> If the requested range includes unassigned UCS values, these are
 | |
|       silently ignored (the current UCS specification has "holes" in the
 | |
|       space of assigned codes).
 | |
|     
 | |
|     <li> If the requested range includes "private" or "user space" codes, these
 | |
|       are handled in an implementation-specific manner; however, a UCS- or
 | |
|       Unicode-based Scheme implementation should pass them through
 | |
|       transparently.
 | |
|     
 | |
|     <li> If any code from the requested range specifies a valid, assigned
 | |
|       UCS character that has no corresponding representative in the
 | |
|       implementation's character type, then (1) an error is raised if <var>error?</var>
 | |
|       is true, and (2) the code is ignored if <var>error?</var> is false (the default).
 | |
|       This might happen, for example, if the implementation uses ASCII
 | |
|       characters, and the requested range includes non-ASCII characters.
 | |
| </ul>
 | |
| 
 | |
|     <p>
 | |
|     If character set <var>base-cs</var> is provided, the characters specified by the
 | |
|     range are added to it. <code>ucs-range->char-set!</code> is allowed, but not required,
 | |
|     to side-effect and reuse the storage in <var>base-cs</var>; 
 | |
|     <code>ucs-range->char-set</code> produces a fresh character set.
 | |
| 
 | |
|     <p>
 | |
|     Note that ASCII codes are a subset of the Latin-1 codes, which are in turn
 | |
|     a subset of the 16-bit Unicode codes, which are themselves a subset of the
 | |
|     32-bit UCS-4 codes. We commit to a specific encoding in this routine,
 | |
|     regardless of the underlying representation of characters, so that client
 | |
|     code using this library will be portable. <em>I.e.</em>, a conformant Scheme
 | |
|     implementation may use EBCDIC or SHIFT-JIS to encode characters; it must
 | |
|     simply map the UCS characters from the given range into the native
 | |
|     representation when possible, and report errors when not possible.
 | |
| 
 | |
| <!--
 | |
| ==== ->char-set
 | |
| ============================================================================-->
 | |
| <dt class=proc-def>
 | |
| <a name="->char-set"></a>
 | |
| <code class=proc-def>->char-set</code><var> x -> char-set</var>
 | |
| <dd class=proc-def>
 | |
|     Coerces <var>x</var> into a char-set. 
 | |
|     <var>X</var> may be a string, character or
 | |
|     char-set. A string is converted to the set of its constituent characters;
 | |
|     a character is converted to a singleton set; a char-set is returned
 | |
|     as-is.
 | |
|     This procedure is intended for use by other procedures that want to 
 | |
|     provide "user-friendly," wide-spectrum interfaces to their clients.
 | |
| 
 | |
| </dl>
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="Querying">Querying character sets</a></h2>
 | |
| <dl>
 | |
| 
 | |
| <!--
 | |
| ==== char-set-size
 | |
| ============================================================================-->
 | |
| <dt class=proc-def>
 | |
| <a name="char-set-size"></a>
 | |
| <code class=proc-def>char-set-size</code><var> cs -> integer</var>
 | |
| <dd class=proc-def>
 | |
|     Returns the number of elements in character set <var>cs</var>.
 | |
| 
 | |
| <!--
 | |
| ==== char-set-count
 | |
| ============================================================================-->
 | |
| <dt class=proc-def>
 | |
| <a name="char-set-count"></a>
 | |
| <code class=proc-def>char-set-count</code><var> pred cs -> integer</var>
 | |
| <dd class=proc-def>
 | |
|     Apply <var>pred</var> to the chars of character set <var>cs</var>, and return the number
 | |
|     of chars that caused the predicate to return true.
 | |
| 
 | |
| <!--
 | |
| ==== char-set->list
 | |
| ============================================================================-->
 | |
| <dt class=proc-def>
 | |
| <a name="char-set->list"></a>
 | |
| <code class=proc-def>char-set->list</code><var> cs -> character-list</var>
 | |
| <dd class=proc-def>
 | |
|     This procedure returns a list of the members of character set <var>cs</var>.
 | |
|     The order in which <var>cs</var>'s characters appear in the list is not defined,
 | |
|     and may be different from one call to another.
 | |
| 
 | |
| <!--
 | |
| ==== char-set->string
 | |
| ============================================================================-->
 | |
| <dt class=proc-def>
 | |
| <a name="char-set->string"></a>
 | |
| <code class=proc-def>char-set->string</code><var> cs -> string</var>
 | |
| <dd class=proc-def>
 | |
|     This procedure returns a string containing the members of character set <var>cs</var>.
 | |
|     The order in which <var>cs</var>'s characters appear in the string is not defined,
 | |
|     and may be different from one call to another.
 | |
| 
 | |
| <!--
 | |
| ==== char-set-contains?
 | |
| ============================================================================-->
 | |
| <dt class=proc-def>
 | |
| <a name="char-set-contains-p"></a>
 | |
| <code class=proc-def>char-set-contains?</code><var> cs char -> boolean</var>
 | |
| <dd class=proc-def>
 | |
|     This procedure tests <var>char</var> for membership in character set <var>cs</var>.
 | |
| 
 | |
|     <p>
 | |
|     The MIT Scheme character-set package called this procedure
 | |
|     <var>char-set-member?</var>, but the argument order isn't consistent with the name.
 | |
| 
 | |
| <!--
 | |
| ==== char-set-every char-set-any
 | |
| ============================================================================-->
 | |
| <dt class=proc-def1>
 | |
| <a name="char-set-every"></a>
 | |
| <a name="char-set-any"></a>
 | |
| <code class=proc-def>char-set-every</code><var> pred cs -> boolean</var>
 | |
| <dt class=proc-defn><code class=proc-def>char-set-any  </code><var> pred cs -> boolean</var>
 | |
| <dd class=proc-def>
 | |
|     The <code>char-set-every</code> procedure returns true if predicate <var>pred</var>
 | |
|     returns true of every character in the character set <var>cs</var>.
 | |
|     Likewise, <code>char-set-any</code> applies <var>pred</var> to every character in
 | |
|     character set <var>cs</var>, and returns the first true value it finds.
 | |
|     If no character produces a true value, it returns false.
 | |
|     The order in which these procedures sequence through the elements of
 | |
|     <var>cs</var> is not specified.
 | |
| 
 | |
|     <p>
 | |
|     Note that if you need to determine the actual character on which a 
 | |
|     predicate returns true, use <code>char-set-any</code> and arrange for the predicate 
 | |
|     to return the character parameter as its true value, <em>e.g.</em>
 | |
| <pre class=code-example>
 | |
| (char-set-any (lambda (c) (and (char-upper-case? c) c)) 
 | |
|               cs)
 | |
| </pre>
 | |
| </dl>
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="Algebra">Character-set algebra</a></h2>
 | |
| <dl>
 | |
| 
 | |
| <!--
 | |
| ==== char-set-adjoin char-set-delete
 | |
| ============================================================================-->
 | |
| <dt class=proc-def1>
 | |
| <a name="char-set-adjoin"></a>
 | |
| <a name="char-set-delete"></a>
 | |
| <code class=proc-def>char-set-adjoin</code><var> cs char<sub>1</sub> ... -> char-set</var>
 | |
| <dt class=proc-defn><code class=proc-def>char-set-delete</code><var> cs char<sub>1</sub> ... -> char-set</var>
 | |
| <dd class=proc-def>
 | |
|     Add/delete the <var>char<sub>i</sub></var> characters to/from character set <var>cs</var>.
 | |
| 
 | |
| <!--
 | |
| ==== char-set-adjoin! char-set-delete!
 | |
| ============================================================================-->
 | |
| <dt class=proc-def1>
 | |
| <a name="char-set-adjoin!"></a>
 | |
| <a name="char-set-delete!"></a>
 | |
| <code class=proc-def>char-set-adjoin!</code><var> cs char<sub>1</sub> ... -> char-set</var>
 | |
| <dt class=proc-defn><code class=proc-def>char-set-delete!</code><var> cs char<sub>1</sub> ... -> char-set</var>
 | |
| <dd class=proc-def>
 | |
| 
 | |
|     Linear-update variants. These procedures are allowed, but not
 | |
|     required, to side-effect their first parameter.
 | |
| 
 | |
| <!--
 | |
| ==== char-set-complement char-set-union char-set-intersection 
 | |
| ==== char-set-difference char-set-xor char-set-diff+intersection
 | |
| ============================================================================-->
 | |
| <dt class=proc-def1>
 | |
| <a name="char-set-complement"></a>
 | |
| <a name="char-set-union"></a>
 | |
| <a name="char-set-intersection"></a>
 | |
| <a name="char-set-difference"></a>
 | |
| <a name="char-set-xor"></a>
 | |
| <a name="char-set-diff+intersection"></a>
 | |
| <code class=proc-def>char-set-complement</code><var> cs                     -> char-set</var>
 | |
| <dt class=proc-defi><code class=proc-def>char-set-union</code><var> cs<sub>1</sub> ...                 -> char-set</var>
 | |
| <dt class=proc-defi><code class=proc-def>char-set-intersection</code><var> cs<sub>1</sub> ...          -> char-set</var>
 | |
| <dt class=proc-defi><code class=proc-def>char-set-difference</code><var> cs<sub>1</sub> cs<sub>2</sub> ...        -> char-set</var>
 | |
| <dt class=proc-defi><code class=proc-def>char-set-xor</code><var> cs<sub>1</sub> ...                   -> char-set</var>
 | |
| <dt class=proc-defn><code class=proc-def>char-set-diff+intersection</code><var> cs<sub>1</sub> cs<sub>2</sub> ... -> [char-set char-set]</var>
 | |
| <dd class=proc-def>
 | |
|     These procedures implement set complement, union, intersection,
 | |
|     difference, and exclusive-or for character sets. The union, intersection
 | |
|     and xor operations are n-ary. The difference function is also n-ary,
 | |
|     associates to the left (that is, it computes the difference between
 | |
|     its first argument and the union of all the other arguments),
 | |
|     and requires at least one argument.
 | |
| 
 | |
|     <p>
 | |
|     Boundary cases:
 | |
| <pre class=code-example>
 | |
| (char-set-union) => char-set:empty
 | |
| (char-set-intersection) => char-set:full
 | |
| (char-set-xor) => char-set:empty
 | |
| (char-set-difference <var>cs</var>) => <var>cs</var>
 | |
| </pre>
 | |
| 
 | |
|     <p>
 | |
|     <code>char-set-diff+intersection</code> returns both the difference and the
 | |
|     intersection of the arguments -- it partitions its first parameter.
 | |
|     It is equivalent to 
 | |
| <pre class=code-example>
 | |
| (values (char-set-difference <var>cs<sub>1</sub></var> <var>cs<sub>2</sub></var> ...)
 | |
|         (char-set-intersection <var>cs<sub>1</sub></var> (char-set-union <var>cs<sub>2</sub></var> ...)))
 | |
| </pre>
 | |
|     but can be implemented more efficiently.
 | |
| 
 | |
| <p>
 | |
|     Programmers should be aware that <code>char-set-complement</code> could potentially
 | |
|     be a very expensive operation in Scheme implementations that provide
 | |
|     a very large character type, such as 32-bit Unicode. If this is a
 | |
|     possibility, sets can be complimented with respect to a smaller
 | |
|     universe using <code>char-set-difference</code>.
 | |
| 
 | |
| 
 | |
| <!--
 | |
| ==== char-set-complement! char-set-union! char-set-intersection! 
 | |
| ==== char-set-difference! char-set-xor! char-set-diff+intersection!
 | |
| ============================================================================-->
 | |
| <dt class=proc-def1>
 | |
| <a name="char-set-complement!"></a>
 | |
| <a name="char-set-union!"></a>
 | |
| <a name="char-set-intersection!"></a>
 | |
| <a name="char-set-difference!"></a>
 | |
| <a name="char-set-xor!"></a>
 | |
| <a name="char-set-diff+intersection!"></a>
 | |
| <code class=proc-def>char-set-complement!</code><var> cs                     -> char-set</var>
 | |
| <dt class=proc-defi><code class=proc-def>char-set-union!</code><var>  cs<sub>1</sub> cs<sub>2</sub> ...                   -> char-set</var>
 | |
| <dt class=proc-defi><code class=proc-def>char-set-intersection!</code><var>  cs<sub>1</sub> cs<sub>2</sub> ...          -> char-set</var>
 | |
| <dt class=proc-defi><code class=proc-def>char-set-difference!</code><var>  cs<sub>1</sub> cs<sub>2</sub> ...            -> char-set</var>
 | |
| <dt class=proc-defi><code class=proc-def>char-set-xor!</code><var>  cs<sub>1</sub> cs<sub>2</sub> ...                   -> char-set</var>
 | |
| <dt class=proc-defn><code class=proc-def>char-set-diff+intersection!</code><var>  cs<sub>1</sub> cs<sub>2</sub> cs<sub>3</sub> ... -> [char-set char-set]</var>
 | |
| <dd class=proc-def>
 | |
|     These are linear-update variants of the set-algebra functions.
 | |
|     They are allowed, but not required, to side-effect their first (required)
 | |
|     parameter.
 | |
| 
 | |
|     <p>
 | |
|     <code>char-set-diff+intersection!</code> is allowed to side-effect both
 | |
|     of its two required parameters, <var>cs<sub>1</sub></var>
 | |
|     and <var>cs<sub>2</sub></var>.
 | |
| </dl>
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="StandardCharsets">Standard character sets</a></h2>
 | |
| <p>
 | |
| Several character sets are predefined for convenience:
 | |
| <a name="char-set:lower-case"></a>
 | |
| <a name="char-set:lower-case"></a>
 | |
| <a name="char-set:upper-case"></a>
 | |
| <a name="char-set:title-case"></a>
 | |
| <a name="char-set:letter"></a>
 | |
| <a name="char-set:digit"></a>
 | |
| <a name="char-set:letter+digit"></a>
 | |
| <a name="char-set:graphic"></a>
 | |
| <a name="char-set:printing"></a>
 | |
| <a name="char-set:whitespace"></a>
 | |
| <a name="char-set:iso-control"></a>
 | |
| <a name="char-set:punctuation"></a>
 | |
| <a name="char-set:symbol"></a>
 | |
| <a name="char-set:hex-digit"></a>
 | |
| <a name="char-set:blank"></a>
 | |
| <a name="char-set:ascii"></a>
 | |
| <a name="char-set:empty"></a>
 | |
| <a name="char-set:full"></a>
 | |
| <div class=inset>
 | |
| <table cellpadding=0 cellspacing=0>
 | |
| <tr><td><code>char-set:lower-case</code> </td><td>Lower-case letters</td></tr>
 | |
| <tr><td><code>char-set:upper-case</code> </td><td>Upper-case letters</td></tr>
 | |
| <tr><td><code>char-set:title-case</code> </td><td>Title-case letters</td></tr>
 | |
| <tr><td><code>char-set:letter</code> </td><td>Letters</td></tr>
 | |
| <tr><td><code>char-set:digit</code> </td><td>Digits</td></tr>
 | |
| <tr><td><code>char-set:letter+digit</code> </td><td>Letters and digits</td></tr>
 | |
| <tr><td><code>char-set:graphic</code> </td><td>Printing characters except spaces</td></tr>
 | |
| <tr><td><code>char-set:printing</code> </td><td>Printing characters including spaces</td></tr>
 | |
| <tr><td><code>char-set:whitespace</code> </td><td>Whitespace characters </td></tr>
 | |
| <tr><td><code>char-set:iso-control</code> </td><td>The ISO control characters </td></tr>
 | |
| <tr><td><code>char-set:punctuation</code> </td><td>Punctuation characters</td></tr>
 | |
| <tr><td><code>char-set:symbol</code> </td><td>Symbol characters</td></tr>
 | |
| <tr><td><code>char-set:hex-digit</code> </td><td>A hexadecimal digit: 0-9, A-F, a-f </td></tr>
 | |
| <tr><td><code>char-set:blank</code> </td><td>Blank characters -- horizontal whitespace</td></tr>
 | |
| <tr><td><code>char-set:ascii</code> </td><td>All characters in the ASCII set. </td></tr>
 | |
| <tr><td><code>char-set:empty</code> </td><td>Empty set </td></tr>
 | |
| <tr><td><code>char-set:full</code> </td><td>All characters </td></tr>
 | |
| </table>
 | |
| </div>
 | |
| 
 | |
| <p>
 | |
| Note that there may be characters in <code>char-set:letter</code> that are neither upper or
 | |
| lower case---this might occur in implementations that use a character type
 | |
| richer than ASCII, such as Unicode. A "graphic character" is one that would
 | |
| put ink on your page. While the exact composition of these sets may vary
 | |
| depending upon the character type provided by the underlying Scheme system,
 | |
| here are the definitions for some of the sets in an ASCII implementation:
 | |
| <div class=inset>
 | |
| <table cellpadding=0 cellspacing=0>
 | |
| <tr><td><code>char-set:lower-case</code> </td><td>a-z </td></tr>
 | |
| <tr><td><code>char-set:upper-case</code> </td><td>A-Z </td></tr>
 | |
| <tr><td><code>char-set:letter</code> </td><td>A-Z and a-z </td></tr>
 | |
| <tr><td><code>char-set:digit</code> </td><td>0123456789</td></tr>
 | |
| <tr><td><code>char-set:punctuation</code> </td><td><code>!"#%&'()*,-./:;?@[\]_{}</code></td></tr>
 | |
| <tr><td><code>char-set:symbol</code> </td><td><code>$+<=>^`|~</code></td></tr>
 | |
| <tr><td><code>char-set:whitespace</code> </td><td>Space, newline, tab, form feed, </td></tr>
 | |
| <tr><td></td><td>                               vertical tab, carriage return </td></tr>
 | |
| <tr><td><code>char-set:blank</code> </td><td>Space and tab </td></tr>
 | |
| <tr><td><code>char-set:graphic</code> </td><td>letter + digit + punctuation + symbol</td></tr>
 | |
| <tr><td><code>char-set:printing</code> </td><td>graphic + whitespace</td></tr>
 | |
| <tr><td><code>char-set:iso-control</code> </td><td>ASCII 0-31 and 127 </td></tr>
 | |
| </table>
 | |
| </div>
 | |
| 
 | |
| <p>
 | |
| Note that the existence of the <code>char-set:ascii</code> set implies that the underlying
 | |
| character set is required to be at least as rich as ASCII (including
 | |
| ASCII's control characters).
 | |
| 
 | |
| <p>
 | |
| Rationale: The name choices reflect a shift from the older "alphabetic/numeric"
 | |
| terms found in 
 | |
| <abbr title="Revised^5 Report on Scheme"><a href="#R5RS">R5RS</a></abbr>
 | |
| and Posix to newer, Unicode-influenced "letter/digit" lexemes.
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h1><a name="StandardCharsetDefs">
 | |
|     Unicode, Latin-1 and ASCII definitions of the standard character sets</a>
 | |
| </h1>
 | |
| <p>
 | |
| In Unicode Scheme implementations, the base character sets are compatible with
 | |
| Java's Unicode specifications. For ASCII or Latin-1, we simply restrict the
 | |
| Unicode set specifications to their first 128 or 256 codes, respectively.
 | |
| Scheme implementations that are not based on ASCII, Latin-1 or Unicode should
 | |
| attempt to preserve the sense or spirit of these definitions.
 | |
| 
 | |
| <p>
 | |
| The following descriptions frequently make reference to the "Unicode character
 | |
| database." This is a file, available at URL
 | |
| <div class=inset>
 | |
| <a href="ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt">
 | |
| ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt</a>
 | |
| </div>
 | |
| <p class=continue>
 | |
| Each line contains a description of a Unicode character. The first
 | |
| semicolon-delimited field of the line gives the hex value of the character's
 | |
| code; the second field gives the name of the character, and the third field
 | |
| gives a two-letter category. Other fields give simple 1-1 case-mappings for
 | |
| the character and other information; see
 | |
| <div class=inset>
 | |
| <a href="ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html">
 | |
| ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html</a>
 | |
| </div>
 | |
| <p class=continue>
 | |
| for further description of the file's format. Note in particular the
 | |
| two-letter category specified in the the third field, which is referenced
 | |
| frequently in the descriptions below.
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="lower-case-def">char-set:lower-case</a></h2>
 | |
| <p>
 | |
| For Unicode, we follow Java's specification: a character is lowercase if
 | |
| <ul>
 | |
| <li> it is not in the range [U+2000,U+2FFF], and
 | |
| <li> the Unicode attribute table does not give a lowercase mapping for it, and
 | |
| <li> at least one of the following is true:
 | |
|   <ul>
 | |
|   <li> the Unicode attribute table gives a mapping to uppercase 
 | |
|     for the character, or
 | |
|   <li> the name for the character in the Unicode attribute table contains
 | |
|     the words "SMALL LETTER" or "SMALL LIGATURE".
 | |
|   </ul>
 | |
| </ul>
 | |
| 
 | |
| <p>
 | |
| The lower-case ASCII characters are 
 | |
| <div class=inset>
 | |
|     abcdefghijklmnopqrstuvwxyz
 | |
| </div>
 | |
| <p class=continue>
 | |
| Latin-1 adds another 33 lower-case characters to the ASCII set:
 | |
| <div class=inset>
 | |
| <table cellpadding=0 cellspacing=0>
 | |
| <tr><td>00B5</td> <td>MICRO SIGN</td></tr>
 | |
| <tr><td>00DF</td> <td>LATIN SMALL LETTER SHARP S</td></tr>
 | |
| <tr><td>00E0</td> <td>LATIN SMALL LETTER A WITH GRAVE</td></tr>
 | |
| <tr><td>00E1</td> <td>LATIN SMALL LETTER A WITH ACUTE</td></tr>
 | |
| <tr><td>00E2</td> <td>LATIN SMALL LETTER A WITH CIRCUMFLEX</td></tr>
 | |
| <tr><td>00E3</td> <td>LATIN SMALL LETTER A WITH TILDE</td></tr>
 | |
| <tr><td>00E4</td> <td>LATIN SMALL LETTER A WITH DIAERESIS</td></tr>
 | |
| <tr><td>00E5</td> <td>LATIN SMALL LETTER A WITH RING ABOVE</td></tr>
 | |
| <tr><td>00E6</td> <td>LATIN SMALL LETTER AE</td></tr>
 | |
| <tr><td>00E7</td> <td>LATIN SMALL LETTER C WITH CEDILLA</td></tr>
 | |
| <tr><td>00E8</td> <td>LATIN SMALL LETTER E WITH GRAVE</td></tr>
 | |
| <tr><td>00E9</td> <td>LATIN SMALL LETTER E WITH ACUTE</td></tr>
 | |
| <tr><td>00EA</td> <td>LATIN SMALL LETTER E WITH CIRCUMFLEX</td></tr>
 | |
| <tr><td>00EB</td> <td>LATIN SMALL LETTER E WITH DIAERESIS</td></tr>
 | |
| <tr><td>00EC</td> <td>LATIN SMALL LETTER I WITH GRAVE</td></tr>
 | |
| <tr><td>00ED</td> <td>LATIN SMALL LETTER I WITH ACUTE</td></tr>
 | |
| <tr><td>00EE</td> <td>LATIN SMALL LETTER I WITH CIRCUMFLEX</td></tr>
 | |
| <tr><td>00EF</td> <td>LATIN SMALL LETTER I WITH DIAERESIS</td></tr>
 | |
| <tr><td>00F0</td> <td>LATIN SMALL LETTER ETH</td></tr>
 | |
| <tr><td>00F1</td> <td>LATIN SMALL LETTER N WITH TILDE</td></tr>
 | |
| <tr><td>00F2</td> <td>LATIN SMALL LETTER O WITH GRAVE</td></tr>
 | |
| <tr><td>00F3</td> <td>LATIN SMALL LETTER O WITH ACUTE</td></tr>
 | |
| <tr><td>00F4</td> <td>LATIN SMALL LETTER O WITH CIRCUMFLEX</td></tr>
 | |
| <tr><td>00F5</td> <td>LATIN SMALL LETTER O WITH TILDE</td></tr>
 | |
| <tr><td>00F6</td> <td>LATIN SMALL LETTER O WITH DIAERESIS</td></tr>
 | |
| <tr><td>00F8</td> <td>LATIN SMALL LETTER O WITH STROKE</td></tr>
 | |
| <tr><td>00F9</td> <td>LATIN SMALL LETTER U WITH GRAVE</td></tr>
 | |
| <tr><td>00FA</td> <td>LATIN SMALL LETTER U WITH ACUTE</td></tr>
 | |
| <tr><td>00FB</td> <td>LATIN SMALL LETTER U WITH CIRCUMFLEX</td></tr>
 | |
| <tr><td>00FC</td> <td>LATIN SMALL LETTER U WITH DIAERESIS</td></tr>
 | |
| <tr><td>00FD</td> <td>LATIN SMALL LETTER Y WITH ACUTE</td></tr>
 | |
| <tr><td>00FE</td> <td>LATIN SMALL LETTER THORN</td></tr>
 | |
| <tr><td>00FF</td> <td>LATIN SMALL LETTER Y WITH DIAERESIS</td></tr>
 | |
| </table>
 | |
| </div>
 | |
| <p class=continue>
 | |
| Note that three of these have no corresponding Latin-1 upper-case character:
 | |
| <div class=inset>
 | |
| <table cellpadding=0 cellspacing=0>
 | |
| <tr><td>00B5</td> <td>MICRO SIGN</td></tr>
 | |
| <tr><td>00DF</td> <td>LATIN SMALL LETTER SHARP S</td></tr>
 | |
| <tr><td>00FF</td> <td>LATIN SMALL LETTER Y WITH DIAERESIS</td></tr>
 | |
| </table>
 | |
| </div>
 | |
| <p class=continue>
 | |
| (The compatibility micro character uppercases to the non-Latin-1 Greek capital
 | |
| mu; the German sharp s character uppercases to the pair of characters "SS,"
 | |
| and the capital y-with-diaeresis is non-Latin-1.)
 | |
| 
 | |
| <p>
 | |
| (Note that the Java spec for lowercase characters given at
 | |
| <div class=inset>
 | |
| <a href="http://java.sun.com/docs/books/jls/html/javalang.doc4.html#14345">
 | |
| http://java.sun.com/docs/books/jls/html/javalang.doc4.html#14345</a>
 | |
| </div>
 | |
| <p class=continue>
 | |
| is inconsistent. U+00B5 MICRO SIGN fulfills the requirements for a lower-case
 | |
| character (as of Unicode 3.0), but is not given in the numeric list of
 | |
| lower-case character codes.)
 | |
| 
 | |
| <p>
 | |
| (Note that the Java spec for <code>isLowerCase()</code> given at
 | |
| <div class=inset>
 | |
| <a href="http://java.sun.com/products/jdk/1.2/docs/api/java/lang/Character.html#isLowerCase(char)">
 | |
| http://java.sun.com/products/jdk/1.2/docs/api/java/lang/Character.html#isLowerCase(char)</a>
 | |
| </div>
 | |
| <p class=continue>
 | |
| gives three mutually inconsistent definitions of "lower case." The first is
 | |
| the definition used in this SRFI. Following text says "A character is
 | |
| considered to be lowercase if and only if it is specified to be lowercase by
 | |
| the Unicode 2.0 standard (category Ll in the Unicode specification data
 | |
| file)." The former spec excludes U+00AA FEMININE ORDINAL INDICATOR and
 | |
| U+00BA MASCULINE ORDINAL INDICATOR; the later spec includes them. Finally,
 | |
| the spec enumerates a list of characters in the Latin-1 subset; this list
 | |
| excludes U+00B5 MICRO SIGN, which is included in both of the previous specs.) 
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="upper-case-def">char-set:upper-case</a></h2>
 | |
| <p>
 | |
| For Unicode, we follow Java's specification: a character is uppercase if
 | |
| <ul>
 | |
| <li> it is not in the range [U+2000,U+2FFF], and
 | |
| <li> the Unicode attribute table does not give an uppercase mapping for it
 | |
| (this excludes titlecase characters), and
 | |
| <li> at least one of the following is true:
 | |
|   <ul>
 | |
|   <li> the Unicode attribute table gives a mapping to lowercase 
 | |
|     for the character, or
 | |
|   <li> the name for the character in the Unicode attribute table contains
 | |
|     the words "CAPITAL LETTER" or "CAPITAL LIGATURE".
 | |
|   </ul>
 | |
| </ul>
 | |
| 
 | |
| <p>
 | |
| The upper-case ASCII characters are 
 | |
| <div class=inset>
 | |
| ABCDEFGHIJKLMNOPQRSTUVWXYZ
 | |
| </div>
 | |
| <p class=continue>
 | |
| Latin-1 adds another 30 upper-case characters to the ASCII set:
 | |
| <div class=inset>
 | |
| <table cellspacing=0 cellpadding=0>
 | |
| <tr><td>00C0</td> <td>LATIN CAPITAL LETTER A WITH GRAVE</td></tr>
 | |
| <tr><td>00C1</td> <td>LATIN CAPITAL LETTER A WITH ACUTE</td></tr>
 | |
| <tr><td>00C2</td> <td>LATIN CAPITAL LETTER A WITH CIRCUMFLEX</td></tr>
 | |
| <tr><td>00C3</td> <td>LATIN CAPITAL LETTER A WITH TILDE</td></tr>
 | |
| <tr><td>00C4</td> <td>LATIN CAPITAL LETTER A WITH DIAERESIS</td></tr>
 | |
| <tr><td>00C5</td> <td>LATIN CAPITAL LETTER A WITH RING ABOVE</td></tr>
 | |
| <tr><td>00C6</td> <td>LATIN CAPITAL LETTER AE</td></tr>
 | |
| <tr><td>00C7</td> <td>LATIN CAPITAL LETTER C WITH CEDILLA</td></tr>
 | |
| <tr><td>00C8</td> <td>LATIN CAPITAL LETTER E WITH GRAVE</td></tr>
 | |
| <tr><td>00C9</td> <td>LATIN CAPITAL LETTER E WITH ACUTE</td></tr>
 | |
| <tr><td>00CA</td> <td>LATIN CAPITAL LETTER E WITH CIRCUMFLEX</td></tr>
 | |
| <tr><td>00CB</td> <td>LATIN CAPITAL LETTER E WITH DIAERESIS</td></tr>
 | |
| <tr><td>00CC</td> <td>LATIN CAPITAL LETTER I WITH GRAVE</td></tr>
 | |
| <tr><td>00CD</td> <td>LATIN CAPITAL LETTER I WITH ACUTE</td></tr>
 | |
| <tr><td>00CE</td> <td>LATIN CAPITAL LETTER I WITH CIRCUMFLEX</td></tr>
 | |
| <tr><td>00CF</td> <td>LATIN CAPITAL LETTER I WITH DIAERESIS</td></tr>
 | |
| <tr><td>00D0</td> <td>LATIN CAPITAL LETTER ETH</td></tr>
 | |
| <tr><td>00D1</td> <td>LATIN CAPITAL LETTER N WITH TILDE</td></tr>
 | |
| <tr><td>00D2</td> <td>LATIN CAPITAL LETTER O WITH GRAVE</td></tr>
 | |
| <tr><td>00D3</td> <td>LATIN CAPITAL LETTER O WITH ACUTE</td></tr>
 | |
| <tr><td>00D4</td> <td>LATIN CAPITAL LETTER O WITH CIRCUMFLEX</td></tr>
 | |
| <tr><td>00D5</td> <td>LATIN CAPITAL LETTER O WITH TILDE</td></tr>
 | |
| <tr><td>00D6</td> <td>LATIN CAPITAL LETTER O WITH DIAERESIS</td></tr>
 | |
| <tr><td>00D8</td> <td>LATIN CAPITAL LETTER O WITH STROKE</td></tr>
 | |
| <tr><td>00D9</td> <td>LATIN CAPITAL LETTER U WITH GRAVE</td></tr>
 | |
| <tr><td>00DA</td> <td>LATIN CAPITAL LETTER U WITH ACUTE</td></tr>
 | |
| <tr><td>00DB</td> <td>LATIN CAPITAL LETTER U WITH CIRCUMFLEX</td></tr>
 | |
| <tr><td>00DC</td> <td>LATIN CAPITAL LETTER U WITH DIAERESIS</td></tr>
 | |
| <tr><td>00DD</td> <td>LATIN CAPITAL LETTER Y WITH ACUTE</td></tr>
 | |
| <tr><td>00DE</td> <td>LATIN CAPITAL LETTER THORN</td></tr>
 | |
| </table>
 | |
| </div>
 | |
| <!--========================================================================-->
 | |
| <h2><a name="title-case-def">char-set:title-case</a></h2>
 | |
| <p>
 | |
| In Unicode, a character is titlecase if it has the category Lt in
 | |
| the character attribute database. There are very few of these characters;
 | |
| here is the entire 31-character list as of Unicode 3.0:
 | |
| <div class=inset>
 | |
| <table cellspacing=0 cellpadding=0>
 | |
| <tr><td>01C5 </td><td nowrap> LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
 | |
| </td></tr>
 | |
| <tr><td>01C8 </td><td nowrap> LATIN CAPITAL LETTER L WITH SMALL LETTER J
 | |
| </td></tr>
 | |
| <tr><td>01CB </td><td nowrap> LATIN CAPITAL LETTER N WITH SMALL LETTER J
 | |
| </td></tr>
 | |
| <tr><td>01F2 </td><td nowrap> LATIN CAPITAL LETTER D WITH SMALL LETTER Z
 | |
| </td></tr>
 | |
| <tr><td>1F88 </td><td nowrap> GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1F89 </td><td nowrap> GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1F8A </td><td nowrap>GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1F8B </td><td nowrap> GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1F8C </td><td nowrap> GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1F8D </td><td nowrap> GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1F8E </td><td nowrap> GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1F8F </td><td nowrap> GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1F98 </td><td nowrap> GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1F99 </td><td nowrap> GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1F9A </td><td nowrap> GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1F9B </td><td nowrap> GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1F9C </td><td nowrap> GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1F9D </td><td nowrap> GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1F9E </td><td nowrap> GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1F9F </td><td nowrap> GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1FA8 </td><td nowrap> GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1FA9 </td><td nowrap> GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1FAA </td><td nowrap> GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1FAB </td><td nowrap> GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1FAC </td><td nowrap> GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1FAD </td><td nowrap> GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1FAE </td><td nowrap> GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1FAF </td><td nowrap> GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1FBC </td><td nowrap> GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1FCC </td><td nowrap> GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| <tr><td>1FFC </td><td nowrap> GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI
 | |
| </td></tr>
 | |
| </table>
 | |
| </div>
 | |
| <p>
 | |
| There are no ASCII or Latin-1 titlecase characters.
 | |
| 
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="letter-def">char-set:letter</a></h2>
 | |
| <p>
 | |
| In Unicode, a letter is any character with one of the letter categories
 | |
| (Lu, Ll, Lt, Lm, Lo) in the Unicode character database. 
 | |
| 
 | |
| <p>
 | |
| There are 52 ASCII letters
 | |
| <div class=indent>
 | |
|     abcdefghijklmnopqrstuvwxyz <br>
 | |
|     ABCDEFGHIJKLMNOPQRSTUVWXYZ <br>
 | |
| </div>
 | |
| <p>
 | |
| There are 117 Latin-1 letters. These are the 115 characters that are
 | |
| members of the Latin-1 <code>char-set:lower-case</code> and <code>char-set:upper-case</code> sets, 
 | |
| plus
 | |
| <div class=inset>
 | |
| <table cellspacing=0 cellpadding=0>
 | |
| <tr><td>00AA</td> <td>FEMININE ORDINAL INDICATOR</td></tr>
 | |
| <tr><td>00BA</td> <td>MASCULINE ORDINAL INDICATOR</td></tr>
 | |
| </table>
 | |
| </div>
 | |
| <p class=continue>
 | |
| (These two letters are considered lower-case by Unicode, but not by
 | |
| Java or SRFI 14.)
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="digit-def">char-set:digit</a></h2>
 | |
| 
 | |
| <p>
 | |
| In Unicode, a character is a digit if it has the category Nd in
 | |
| the character attribute database. In Latin-1 and ASCII, the only
 | |
| such characters are 0123456789. In Unicode, there are other digit
 | |
| characters in other code blocks, such as Gujarati digits and Tibetan
 | |
| digits.
 | |
| 
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="hex-digit-def">char-set:hex-digit</a></h2>
 | |
| <p>
 | |
| The only hex digits are 0123456789abcdefABCDEF.
 | |
| 
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="letter+digit-def">char-set:letter+digit</a></h2>
 | |
| <p>
 | |
| The union of <code>char-set:letter</code> and <code>char-set:digit.</code>
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="graphic-def">char-set:graphic</a></h2>
 | |
| <p>
 | |
| A graphic character is one that would put ink on paper. The ASCII and Latin-1
 | |
| graphic characters are the members of
 | |
| <div class=inset>
 | |
| <table cellspacing=0 cellpadding=0>
 | |
| <tr><td><code>char-set:letter</code></td></tr>
 | |
| <tr><td><code>char-set:digit</code></td></tr>
 | |
| <tr><td><code>char-set:punctuation</code></td></tr>
 | |
| <tr><td><code>char-set:symbol</code></td></tr>
 | |
| </table>
 | |
| </div>
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="printing-def">char-set:printing</a></h2>
 | |
| <p>
 | |
| A printing character is one that would occupy space when printed, <em>i.e.</em>,
 | |
| a graphic character or a space character. <code>char-set:printing</code> is the union
 | |
| of <code>char-set:whitespace</code> and <code>char-set:graphic.</code>
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="whitespace-def">char-set:whitespace</a></h2>
 | |
| <p>
 | |
| In Unicode, a whitespace character is either
 | |
| <ul>
 | |
|   <li> a character with one of the space, line, or paragraph separator categories
 | |
|     (Zs, Zl or Zp) of the Unicode character database.
 | |
|   <li> U+0009 Horizontal tabulation (\t control-I)
 | |
|   <li> U+000A Line feed (\n control-J)
 | |
|   <li> U+000B Vertical tabulation (\v control-K)
 | |
|   <li> U+000C Form feed (\f control-L)
 | |
|   <li> U+000D Carriage return (\r control-M)
 | |
| </ul>
 | |
| 
 | |
| <p>
 | |
| There are 24 whitespace characters in Unicode 3.0:
 | |
| <div class=inset>
 | |
| <table cellspacing=0 cellpadding=0>
 | |
| <tr><td>0009</td> <td>HORIZONTAL TABULATION </td> <td>  \t control-I</td></tr>
 | |
| <tr><td>000A</td> <td>LINE FEED         </td> <td> \n control-J</td></tr>
 | |
| <tr><td>000B</td> <td>VERTICAL TABULATION       </td> <td> \v control-K</td></tr>
 | |
| <tr><td>000C</td> <td>FORM FEED         </td> <td> \f control-L</td></tr>
 | |
| <tr><td>000D</td> <td>CARRIAGE RETURN   </td> <td> \r control-M</td></tr>
 | |
| <tr><td>0020</td> <td>SPACE                     </td> <td> Zs</td></tr>
 | |
| <tr><td>00A0</td> <td>NO-BREAK SPACE    </td> <td> Zs</td></tr>
 | |
| <tr><td>1680</td> <td>OGHAM SPACE MARK  </td> <td> Zs</td></tr>
 | |
| <tr><td>2000</td> <td>EN QUAD           </td> <td> Zs</td></tr>
 | |
| <tr><td>2001</td> <td>EM QUAD           </td> <td> Zs</td></tr>
 | |
| <tr><td>2002</td> <td>EN SPACE          </td> <td> Zs</td></tr>
 | |
| <tr><td>2003</td> <td>EM SPACE          </td> <td> Zs</td></tr>
 | |
| <tr><td>2004</td> <td>THREE-PER-EM SPACE        </td> <td> Zs</td></tr>
 | |
| <tr><td>2005</td> <td>FOUR-PER-EM SPACE </td> <td> Zs</td></tr>
 | |
| <tr><td>2006</td> <td>SIX-PER-EM SPACE  </td> <td> Zs</td></tr>
 | |
| <tr><td>2007</td> <td>FIGURE SPACE              </td> <td> Zs</td></tr>
 | |
| <tr><td>2008</td> <td>PUNCTUATION SPACE </td> <td> Zs</td></tr>
 | |
| <tr><td>2009</td> <td>THIN SPACE                </td> <td> Zs</td></tr>
 | |
| <tr><td>200A</td> <td>HAIR SPACE                </td> <td> Zs</td></tr>
 | |
| <tr><td>200B</td> <td>ZERO WIDTH SPACE  </td> <td> Zs</td></tr>
 | |
| <tr><td>2028</td> <td>LINE SEPARATOR    </td> <td> Zl</td></tr>
 | |
| <tr><td>2029</td> <td>PARAGRAPH SEPARATOR       </td> <td> Zp</td></tr>
 | |
| <tr><td>202F</td> <td>NARROW NO-BREAK SPACE     </td> <td> Zs</td></tr>
 | |
| <tr><td>3000</td> <td>IDEOGRAPHIC SPACE </td> <td> Zs</td></tr>
 | |
| </table>
 | |
| </div>
 | |
| <p>
 | |
| The ASCII whitespace characters are the first six characters in the above list
 | |
| -- line feed, horizontal tabulation, vertical tabulation, form feed, carriage
 | |
| return, and space. These are also exactly the characters recognised by the
 | |
| Posix <code>isspace()</code> procedure. Latin-1 adds the no-break space.
 | |
| 
 | |
| <p>
 | |
| Note: Java's <code>isWhitespace()</code> method is incompatible, including
 | |
| <div class=inset>
 | |
| <table cellspacing=0 cellpadding=0>
 | |
| <tr><td>0009</td> <td>HORIZONTAL TABULATION </td> <td>  (\t control-I)</td></tr>
 | |
| <tr><td>001C</td> <td>FILE SEPARATOR   </td> <td> (control-\)</td></tr>
 | |
| <tr><td>001D</td> <td>GROUP SEPARATOR  </td> <td>(control-])</td></tr>
 | |
| <tr><td>001E</td> <td>RECORD SEPARATOR </td> <td>(control-^)</td></tr>
 | |
| <tr><td>001F</td> <td>UNIT SEPARATOR   </td> <td>(control-_)</td></tr>
 | |
| </table>
 | |
| </div>
 | |
| <p class=continue>
 | |
| and excluding
 | |
| <div class=inset>
 | |
| <table cellspacing=0 cellpadding=0>
 | |
| <tr><td>00A0</td> <td>NO-BREAK SPACE</td></tr>
 | |
| </table>
 | |
| </div>
 | |
| <p>
 | |
| Java's excluding the no-break space means that tokenizers can simply break
 | |
| character streams at "whitespace" boundaries. However, the exclusion introduces
 | |
| exceptions in other places, <em>e.g.</em> <code>char-set:printing</code> is no longer simply the
 | |
| union of <code>char-set:graphic</code> and <code>char-set:whitespace.</code>
 | |
| 
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="iso-control-def">char-set:iso-control</a></h2>
 | |
| <p>
 | |
| The ISO control characters are the Unicode/Latin-1 characters in the ranges
 | |
| [U+0000,U+001F] and [U+007F,U+009F].
 | |
| 
 | |
| <p>
 | |
| ASCII restricts this set to the characters in the range [U+0000,U+001F] 
 | |
| plus the character U+007F.
 | |
| 
 | |
| <p>
 | |
| Note that Unicode defines other control characters which do not belong to this
 | |
| set (hence the qualifying prefix "iso-" in the name). This restriction is
 | |
| compatible with the Java <code>IsISOControl()</code> method.
 | |
| 
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="punctuation-def">char-set:punctuation</a></h2>
 | |
| <p>
 | |
| In Unicode, a punctuation character is any character that has one of the
 | |
| punctuation categories in the Unicode character database (Pc, Pd, Ps,
 | |
| Pe, Pi, Pf, or Po.)
 | |
| 
 | |
| <p>
 | |
| ASCII has 23 punctuation characters:
 | |
| <pre class=code-example>
 | |
| !"#%&'()*,-./:;?@[\]_{}
 | |
| </pre>
 | |
| <p>
 | |
| Latin-1 adds six more:
 | |
| <div class=inset>
 | |
| <table cellspacing=0 cellpadding=0>
 | |
| <tr><td>00A1 </td> <td> INVERTED EXCLAMATION MARK
 | |
| <tr><td>00AB </td> <td> LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
 | |
| <tr><td>00AD </td> <td> SOFT HYPHEN
 | |
| <tr><td>00B7 </td> <td> MIDDLE DOT
 | |
| <tr><td>00BB </td> <td> RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
 | |
| <tr><td>00BF </td> <td> INVERTED QUESTION MARK
 | |
| </table>
 | |
| </div>
 | |
| 
 | |
| <p>
 | |
| Note that the nine ASCII characters <code>$+<=>^`|~</code> are <em>not</em>
 | |
| punctuation. They are "symbols."
 | |
| 
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="symbol-def">char-set:symbol</a></h2>
 | |
| <p>
 | |
| In Unicode, a symbol is any character that has one of the symbol categories
 | |
| in the Unicode character database (Sm, Sc, Sk, or So). There
 | |
| are nine ASCII symbol characters:
 | |
| <pre class=code-example>
 | |
| $+<=>^`|~
 | |
| </pre>
 | |
| <p>
 | |
| Latin-1 adds 18 more:
 | |
| <div class=inset>
 | |
| <table cellspacing=0 cellpadding=0>
 | |
| <tr><td>00A2 </td> <td> CENT SIGN </td></tr>
 | |
| <tr><td>00A3 </td> <td> POUND SIGN </td></tr>
 | |
| <tr><td>00A4 </td> <td> CURRENCY SIGN </td></tr>
 | |
| <tr><td>00A5 </td> <td> YEN SIGN </td></tr>
 | |
| <tr><td>00A6 </td> <td> BROKEN BAR </td></tr>
 | |
| <tr><td>00A7 </td> <td> SECTION SIGN </td></tr>
 | |
| <tr><td>00A8 </td> <td> DIAERESIS </td></tr>
 | |
| <tr><td>00A9 </td> <td> COPYRIGHT SIGN </td></tr>
 | |
| <tr><td>00AC </td> <td> NOT SIGN </td></tr>
 | |
| <tr><td>00AE </td> <td> REGISTERED SIGN </td></tr>
 | |
| <tr><td>00AF </td> <td> MACRON </td></tr>
 | |
| <tr><td>00B0 </td> <td> DEGREE SIGN </td></tr>
 | |
| <tr><td>00B1 </td> <td> PLUS-MINUS SIGN </td></tr>
 | |
| <tr><td>00B4 </td> <td> ACUTE ACCENT </td></tr>
 | |
| <tr><td>00B6 </td> <td> PILCROW SIGN </td></tr>
 | |
| <tr><td>00B8 </td> <td> CEDILLA </td></tr>
 | |
| <tr><td>00D7 </td> <td> MULTIPLICATION SIGN </td></tr>
 | |
| <tr><td>00F7 </td> <td> DIVISION SIGN </td></tr>
 | |
| </table>
 | |
| </div>
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h2><a name="blank-def">char-set:blank</a></h2>
 | |
| 
 | |
| <p>
 | |
| Blank chars are horizontal whitespace. In Unicode, a blank character is either
 | |
| <ul>
 | |
|   <li> a character with the space separator category (Zs) in the Unicode 
 | |
|     character database.
 | |
|   <li> U+0009 Horizontal tabulation (\t control-I)
 | |
| </ul>
 | |
| 
 | |
| <p>
 | |
| There are eighteen blank characters in Unicode 3.0:
 | |
| <div class=inset>
 | |
| <table cellspacing=0 cellpadding=0>
 | |
| <tr><td>0009 </td> <td> HORIZONTAL TABULATION   </td> <td> \t control-I </td></tr>
 | |
| <tr><td>0020 </td> <td> SPACE                   </td> <td> Zs </td></tr>
 | |
| <tr><td>00A0 </td> <td> NO-BREAK SPACE  </td> <td> Zs </td></tr>
 | |
| <tr><td>1680 </td> <td> OGHAM SPACE MARK        </td> <td> Zs </td></tr>
 | |
| <tr><td>2000 </td> <td> EN QUAD         </td> <td> Zs </td></tr>
 | |
| <tr><td>2001 </td> <td> EM QUAD         </td> <td> Zs </td></tr>
 | |
| <tr><td>2002 </td> <td> EN SPACE                </td> <td> Zs </td></tr>
 | |
| <tr><td>2003 </td> <td> EM SPACE                </td> <td> Zs </td></tr>
 | |
| <tr><td>2004 </td> <td> THREE-PER-EM SPACE      </td> <td> Zs </td></tr>
 | |
| <tr><td>2005 </td> <td> FOUR-PER-EM SPACE       </td> <td> Zs </td></tr>
 | |
| <tr><td>2006 </td> <td> SIX-PER-EM SPACE        </td> <td> Zs </td></tr>
 | |
| <tr><td>2007 </td> <td> FIGURE SPACE            </td> <td> Zs </td></tr>
 | |
| <tr><td>2008 </td> <td> PUNCTUATION SPACE       </td> <td> Zs </td></tr>
 | |
| <tr><td>2009 </td> <td> THIN SPACE              </td> <td> Zs </td></tr>
 | |
| <tr><td>200A </td> <td> HAIR SPACE              </td> <td> Zs </td></tr>
 | |
| <tr><td>200B </td> <td> ZERO WIDTH SPACE        </td> <td> Zs </td></tr>
 | |
| <tr><td>202F </td> <td> NARROW NO-BREAK SPACE   </td> <td> Zs </td></tr>
 | |
| <tr><td>3000 </td> <td> IDEOGRAPHIC SPACE       </td> <td> Zs </td></tr>
 | |
| </table>
 | |
| </div>
 | |
| <p>
 | |
| The ASCII blank characters are the first two characters above --
 | |
| horizontal tab and space. Latin-1 adds the no-break space.
 | |
| 
 | |
| <p>
 | |
| Java doesn't have the concept of "blank" characters, so there are no
 | |
| compatibility issues.
 | |
| 
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h1><a name="ReferenceImp">Reference implementation</a></h1>
 | |
| <p>
 | |
| This SRFI comes with a reference implementation. It resides at:
 | |
| <div class=inset>
 | |
|     <a href="http://srfi.schemers.org/srfi-14/srfi-14.scm">
 | |
| http://srfi.schemers.org/srfi-14/srfi-14.scm</a>
 | |
| </div>
 | |
| <p class=continue>
 | |
| I have placed this source on the Net with an unencumbered, "open" copyright.
 | |
| Some of the code in the reference implementation bears a distant family
 | |
| relation to the MIT Scheme implementation, and being derived from that code,
 | |
| is covered by the MIT Scheme copyright (which is a generic BSD-style
 | |
| open-source copyright -- see the source file for details). The remainder of
 | |
| the code was written by myself for scsh or for this SRFI; I have placed this
 | |
| code under the scsh copyright, which is also a generic BSD-style open-source
 | |
| copyright.
 | |
| 
 | |
| <p>
 | |
| The code is written for portability and should be simple to port to
 | |
| any Scheme. It has only the following deviations from R4RS, clearly
 | |
| discussed in the comments:
 | |
| <ul>
 | |
|   <li> an <code>error</code> procedure;
 | |
|   <li> the R5RS <code>values</code> procedure for producing multiple return values;
 | |
|   <li> a simple <code>check-arg</code> procedure for argument checking;
 | |
|   <li> <code>let-optionals*</code> and <code>:optional</code> macros for for parsing, checking and defaulting
 | |
|     optional arguments from rest lists;
 | |
|   <li> The SRFI-19 <code>define-record-type</code> form;
 | |
|   <li> <code>bitwise-and</code> for the hash function;
 | |
|   <li> <code>%latin1->char</code> and <code>%char->latin1</code>.
 | |
| </ul>
 | |
| 
 | |
| <p>
 | |
| The library is written for clarity and well-commented; the current source is
 | |
| about 375 lines of source code and 375 lines of comments and white space.
 | |
| It is also written for efficiency. Fast paths are provided for common cases.
 | |
| 
 | |
| <p>
 | |
| This is not to say that the implementation can't be tuned up for
 | |
| a specific Scheme implementation. There are notes in comments addressing
 | |
| ways implementors can tune the reference implementation for performance.
 | |
| 
 | |
| <p>
 | |
| In short, I've written the reference implementation to make it as painless
 | |
| as possible for an implementor -- or a regular programmer -- to adopt this
 | |
| library and get good results with it.
 | |
| 
 | |
| <p>
 | |
| The code uses a rather simple-minded, inefficient representation for
 | |
| ASCII/Latin-1 char-sets -- a 256-character string. The character whose code is
 | |
| <var>i</var> is in the set if <var>s[i]</var> = ASCII 1 (soh, or ^a); 
 | |
| not in the set if <var>s[i]</var> = ASCII 0 (nul). 
 | |
| A much faster and denser representation would be 16 or 32 bytes worth
 | |
| of bit string. A portable implementation using bit sets awaits standards for
 | |
| bitwise logical-ops and byte vectors.
 | |
| 
 | |
| <p>
 | |
| "Large" character types, such as Unicode, should use a sparse representation,
 | |
| taking care that the Latin-1 subset continues to be represented with a
 | |
| dense 32-byte bit set.
 | |
| 
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h1><a name="Acknowledgements">Acknowledgements</a></h1>
 | |
| <p>
 | |
| The design of this library benefited greatly from the feedback provided during
 | |
| the SRFI discussion phase. Among those contributing thoughtful commentary and
 | |
| suggestions, both on the mailing list and by private discussion, were Paolo
 | |
| Amoroso, Lars Arvestad, Alan Bawden, Jim Bender, Dan Bornstein, Per Bothner,
 | |
| Will Clinger, Brian Denheyer, Kent Dybvig, Sergei Egorov, Marc Feeley,
 | |
| Matthias Felleisen, Will Fitzgerald, Matthew Flatt, Arthur A. Gleckler, Ben
 | |
| Goetter, Sven Hartrumpf, Erik Hilsdale, Shiro Kawai, Richard Kelsey, Oleg
 | |
| Kiselyov, Bengt Kleberg, Donovan Kolbly, Bruce Korb, Shriram Krishnamurthi,
 | |
| Bruce Lewis, Tom Lord, Brad Lucier, Dave Mason, David Rush, Klaus Schilling,
 | |
| Jonathan Sobel, Mike Sperber, Mikael Staldal, Vladimir Tsyshevsky, Donald
 | |
| Welsh, and Mike Wilson. I am grateful to them for their assistance.
 | |
| 
 | |
| <p>
 | |
| I am also grateful the authors, implementors and documentors of all the
 | |
| systems mentioned in the introduction. Aubrey Jaffer should be noted for his
 | |
| work in producing Web-accessible versions of the R5RS spec, which was a
 | |
| tremendous aid.
 | |
| 
 | |
| <p>
 | |
| This is not to imply that these individuals necessarily endorse the final
 | |
| results, of course. 
 | |
| 
 | |
| <p>
 | |
| During this document's long development period, great patience was exhibited
 | |
| by Mike Sperber, who is the editor for the SRFI, and by Hillary Sullivan,
 | |
| who is not.
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h1><a name="Links">References & links</a></h1>
 | |
| 
 | |
| <dl>
 | |
| <dt class=biblio><strong><a name="Java">[Java]</a></strong>
 | |
| <dd>
 | |
|     The following URLs provide documentation on relevant Java classes. <br>
 | |
| 
 | |
|     <a href="http://java.sun.com/products/jdk/1.2/docs/api/java/lang/Character.html">http://java.sun.com/products/jdk/1.2/docs/api/java/lang/Character.html</a>
 | |
|     <br>
 | |
|     <a href="http://java.sun.com/products/jdk/1.2/docs/api/java/lang/String.html">http://java.sun.com/products/jdk/1.2/docs/api/java/lang/String.html</a>
 | |
|     <br>
 | |
|     <a href="http://java.sun.com/products/jdk/1.2/docs/api/java/lang/StringBuffer.html">http://java.sun.com/products/jdk/1.2/docs/api/java/lang/StringBuffer.html</a>
 | |
|     <br>
 | |
|     <a href="http://java.sun.com/products/jdk/1.2/docs/api/java/text/Collator.html">http://java.sun.com/products/jdk/1.2/docs/api/java/text/Collator.html</a>
 | |
|     <br>
 | |
|     <a href="http://java.sun.com/products/jdk/1.2/docs/api/java/text/package-summary.html">http://java.sun.com/products/jdk/1.2/docs/api/java/text/package-summary.html</a>
 | |
| 
 | |
| <dt class=biblio><strong><a name="MIT-Scheme">[MIT-Scheme]</a></strong>
 | |
| <dd>
 | |
|     <a href="http://www.swiss.ai.mit.edu/projects/scheme/">http://www.swiss.ai.mit.edu/projects/scheme/</a>
 | |
| 
 | |
| <dt class=biblio><strong><a name="R5RS">[R5RS]</a></strong></dt>
 | |
| <dd>Revised<sup>5</sup> report on the algorithmic language Scheme.<br>
 | |
|     R. Kelsey, W. Clinger, J. Rees (editors). <br>
 | |
|     Higher-Order and Symbolic Computation, Vol. 11, No. 1, September, 1998. <br>
 | |
|     and ACM SIGPLAN Notices, Vol. 33, No. 9, October, 1998. <br>
 | |
|     Available at <a href="http://www.schemers.org/Documents/Standards/">
 | |
|     http://www.schemers.org/Documents/Standards/</a>.
 | |
| 
 | |
| <dt class=biblio><strong>[SRFI]</strong></dt>
 | |
| <dd>
 | |
|     The SRFI web site. <br>
 | |
|     <a href="http://srfi.schemers.org/">http://srfi.schemers.org/</a>
 | |
| 
 | |
| <dt class=biblio><strong>[SRFI-14]</strong></dt>
 | |
| <dd>
 | |
|     SRFI-14: String libraries. <br>
 | |
|     <a href="http://srfi.schemers.org/srfi-14/">http://srfi.schemers.org/srfi-14/</a>
 | |
| 
 | |
|     <dl>    
 | |
|     <dt>
 | |
|       This document, in HTML:
 | |
|     <dd><a href="http://srfi.schemers.org/srfi-14/srfi-14.html">
 | |
|         http://srfi.schemers.org/srfi-14/srfi-14.html</a>
 | |
| 
 | |
|     <dt>
 | |
|       This document, in plain text format:
 | |
|     <dd><a href="http://srfi.schemers.org/srfi-14/srfi-14.txt">
 | |
|         http://srfi.schemers.org/srfi-14/srfi-14.txt</a>
 | |
| 
 | |
|     <dt> Source code for the reference implementation:
 | |
|     <dd>
 | |
|       <a href="http://srfi.schemers.org/srfi-14/srfi-14.scm">
 | |
|          http://srfi.schemers.org/srfi-14/srfi-14.scm</a>
 | |
| 
 | |
|     <dt> Scheme 48 module specification, with typings:
 | |
|     <dd>
 | |
|       <a href="http://srfi.schemers.org/srfi-14/srfi-14-s48-module.scm">
 | |
|         http://srfi.schemers.org/srfi-14/srfi-14-s48-module.scm</a>
 | |
| 
 | |
|     <dt> Regression-test suite:
 | |
|     <dd> <a href="http://srfi.schemers.org/srfi-14/srfi-14-tests.scm">
 | |
|          http://srfi.schemers.org/srfi-14/srfi-14-tests.scm</a>
 | |
| 
 | |
|     </dl>
 | |
| </dd>
 | |
| 
 | |
| <dt class=biblio><strong><a name="Unicode">[Unicode]</a></strong>
 | |
| <dd>
 | |
|     <a href="http://www.unicode.org/">http://www.unicode.org/</a>
 | |
| 
 | |
| <dt class=biblio><strong><a name="UnicodeData">[UnicodeData]</a></strong>
 | |
| <dd>
 | |
|     The Unicode character database. <br>
 | |
|     <a href="ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt">ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt</a>
 | |
|     <br>
 | |
|     <a href="ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html">ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.html</a>
 | |
| </dl>
 | |
| 
 | |
| <!--========================================================================-->
 | |
| <h1><a name="Copyright">Copyright</a></h1>
 | |
| 
 | |
| <p>
 | |
| Certain portions of this document -- the specific, marked segments of text
 | |
| describing the <abbr title="Revised^5 Report on Scheme"><a href="#R5RS">R5RS</a></abbr> procedures -- were adapted with permission from the R5RS
 | |
| report.
 | |
|     
 | |
| <p>
 | |
| All other text is copyright (C) Olin Shivers (1998, 1999, 2000). 
 | |
| All Rights Reserved. 
 | |
| 
 | |
| <p>
 | |
| This document and translations of it may be copied and furnished to others,
 | |
| and derivative works that comment on or otherwise explain it or assist in its
 | |
| implementation may be prepared, copied, published and distributed, in whole or
 | |
| in part, without restriction of any kind, provided that the above copyright
 | |
| notice and this paragraph are included on all such copies and derivative
 | |
| works. However, this document itself may not be modified in any way, such as
 | |
| by removing the copyright notice or references to the Scheme Request For
 | |
| Implementation process or editors, except as needed for the purpose of
 | |
| developing SRFIs in which case the procedures for copyrights defined in the
 | |
| SRFI process must be followed, or as required to translate it into languages
 | |
| other than English.
 | |
| 
 | |
| <p>
 | |
| The limited permissions granted above are perpetual and will not be revoked by
 | |
| the authors or their successors or assigns.
 | |
| 
 | |
| <p>
 | |
| This document and the information contained herein is provided on an
 | |
| "<strong>as is</strong>" basis and <strong>the authors and the SRFI editors
 | |
| disclaim all warranties, express or implied, including but not limited to any
 | |
| warranty that the use of the information herein will not infringe any rights
 | |
| or any implied warranties of merchantability or fitness for a particular
 | |
| purpose.</strong>
 | |
| 
 | |
| </body>
 | |
| </html>
 | |
| <!--
 | |
|   LocalWords:  SRFI refs HTML css hackery sans Netscape td pre div para
 | |
|   LocalWords:  proc def procs defi's defn dl dt defi dd NS RS rs procx
 | |
|   LocalWords:  stylesheet IE biblio IE's Internationalisation ascii doc
 | |
|   LocalWords:  normalisation lib ref ci ok titlecase upcase downcase
 | |
|   LocalWords:  xsubstring xcopy tokenize kmp slib RScheme MzScheme init
 | |
|   LocalWords:  Bigloo Chez APL SML Unicode API eszet SS dz downcases
 | |
|   LocalWords:  titlecasing normalised normalise underbar ss eq vs dict
 | |
|   LocalWords:  backquote parameterised denmark taiwan UnicodeData txt
 | |
|   LocalWords:  pred nchars obj len cBa epilog foo baz wrt subst tstart
 | |
|   LocalWords:  Szilagyi zilagyi cs abcdefgh ca cd cond eek ee tHIS com
 | |
|   LocalWords:  elba elbA ary consed XXXX ac bc kons knil ans plusses 
 | |
|   LocalWords:  catamorphism lp eof lis cdr knull kar kdr anamorphism
 | |
|   LocalWords:  abcdefg sfrom sto TCL perl slen rv exp initialisation
 | |
|   LocalWords:  plen SJ PJ si sj pj IPORT iport patlen DF buf Bevan
 | |
|   LocalWords:  Denheyer scsh Paolo Amoroso Arvestad Bawden Dybvig
 | |
|   LocalWords:  Bornstein Bothner Egorov Feeley Matthias Felleisen
 | |
|   LocalWords:  Flatt ucs Gleckler Goetter Sven Hartrumpf Hilsdale
 | |
|   LocalWords:  Kiselyov Bengt Korb Kleberg Kolbly Shriram  bignum
 | |
|   LocalWords:  Krishnamurthi Lucier Schilling Sobel Mikael Staldal
 | |
|   LocalWords:  Tsyshevsky documentors Jaffer Sperber cltl AE fixnum
 | |
|   LocalWords:  CommonLisp HyperSpec Clinger Rees SIGPLAN uniquified
 | |
|   LocalWords:  cset EA DrScheme IEC conformant JIS xor diff Posix URL
 | |
|   LocalWords:  FFF DIAERESIS abcdefghijklmnopqrstuvwxyz EB EC EF ETH
 | |
|   LocalWords:  FA FB FC FD FF Ll AA diaeresis isLowerCase BA CB CC CE
 | |
|   LocalWords:  CF DA DC Lt CARON PSILI Lu PROSGEGRAMMENI DASIA VARIA
 | |
|   LocalWords:  OXIA PERISPOMENI FAA FAB FAC FAE FAF FBC FFC Lm Lo
 | |
|   LocalWords:  abcdefABCDEF Zs Zl Zp OGHAM IDEOGRAPHIC Pc recognised
 | |
|   LocalWords:  tokenizers iso Pd Ps Pe Pf AB BB BF Sm Sc Sk AF MACRON
 | |
|   LocalWords:  PILCROW soh nul ops Shiro Kawai subform
 | |
| -->
 |