scsh-0.6/scsh/lib/strings.txt

Todo:
    parse-start+end parse-final-start+end need "string" in the name
        Also, export macro binder.
    What's up w/quotient? (quotient -1 3) = 0.
    regexp-foldl
    type regexp interface
    land*
    Let-optional:
      A let-optional that parses a prefix of the args.
      Arg checking forms that get used if it parses, but are not
        applied to the default.

The Scheme Underground string library includes a rich set of operations
for manipulating strings. These are frequently useful for scripting and
other text-manipulation applications.

The library's design was influenced by the string libraries found in MIT
Scheme, Gambit, RScheme, MzScheme, slib, Common Lisp, Bigloo, guile, APL and
the SML standard basis.  Some of the code bears a distant family relation to
the MIT Scheme implementation, and being derived from that code, is covered by
the MIT Scheme copyright (which is a fairly generic "free" copyright -- see
the source file for details). The fast KMP string-search code used in
SUBSTRING? was loosely adapted from old slib code by Stephen Bevan.

The library has the following design principles:
- *All* procedures involving character comparison are available in
  both case-sensitive and case-insensitive forms.

- *All* functionality is available in substring and full-string forms.

- The procedures are spec'd so as to permit efficient implementation in a
  Scheme that provided shared-text substrings (e.g., guile). This means that
  you should not rely on many of the substring-selecting procedures to return
  freshly-allocated strings. Careful attention is paid to the issue of which
  procedures allocate fresh storage, and which are permitted to return results
  that share storage with the arguments.

- Common Lisp theft:
    + inequality functions return mismatch index.
      I generalised this so that this "protocol" is extended even to
      the equality functions. This means that clients can be handed any generic
      string-comparison function and rely on the meaning of the true value.

    + Common Lisp capitalisation definition

The library addresses some problems with the R5RS string procedures:
  - Question marks after string-comparison functions (string=?, etc.)
    This is inconsistent with numeric comparison functions, and ugly, too.
  - String-comparison functions do not provide useful true value.
  - STRING-COPY should have optional start/end args;
    SUBSTRING shouldn't specify if it copies or returns shared bits.
  - STRING-FILL! and STRING->LIST should take optional start/end args.
  - No <> function provided.

In the following procedure specifications:
	- Any S parameter is a string;

	- START and END parameters are half-open string indices specifying
	  a substring within a string parameter; when optional, they default
	  to 0 and the length of the string, respectively. When specified, it
	  must be the case that 0 <= START <= END <= (string-length S), for
	  the corresponding parameter S. They typically restrict a procedure's
          action to the indicated substring.

	- A CHAR/CHAR-SET/PRED parameter is a value used to select/search
          for a character in a string. If it is a character, it is used in
	  an equality test; if it is a character set, it is used as a
          membership test; if it is a procedure, it is applied to the
	  characters as a test predicate.

This library contains a large number of procedures, but they follow
a consistent naming scheme. The names are composed of smaller lexemes
in a regular way that exposes the structure and relationships between the
procedures. This should help the programmer to recall or reconstitute the name
of the particular procedure that he needs when writing his own code. In
particular
	- Procedures whose names end in "-ci" are case-insensitive variants.
	- Procedures whose names end in "!" are side-effecting variants.
	  These procedures generally return an unspecified value.
	- The order of common parameters is fairly consistent across the
	  different procedures.

For more text-manipulation functionality, see also the regular expression,
file-name, character set, and character->character partial map packages.

-------------------------------------------------------------------------------
* R4RS/R5RS procedures

The R4RS and R5RS reports define 22 string procedures. The string-lib
package includes 8 of these exactly as defined, 4 in an extended,
backwards-compatible way, and drops the remaining 10 (whose functionality
is available via other bindings).

The 8 procedures provided exactly as documented in the reports are
	string?
	make-string
	string
        string-length
	string-ref
	string-set!
	string-append
	list->string

The ten functions not included are the R4RS string-comparison functions:
	string=?  string-ci=?
	string<?  string-ci<?
	string>?  string-ci>?
	string<=? string-ci<=?
	string>=? string-ci>=?
The string-lib package provides alternate bindings.

Additionally, the four extended procedures are

    string-fill! s char [start end] -> unspecific
    string->list s [start end] -> char-list
    substring s start [end] -> string
    string-copy s [start end] -> string

These procedures are documented in the following section.  In brief, they are
extended to take optional start/end parameters specifying substring ranges;
Additionally, SUBSTRING is allowed to return a value that shares storage with
its argument.


* Procedures

These procedures are contained in the Scheme 48 package "string-lib",
which is open in the default user package. They are not found in the
"scsh" package; script writers and other programmers that use the Scheme
48 module system must open string-lib explicitly.

string-map  proc s [start end] -> string
string-map! proc s [start end] -> unspecified
    PROC is a char->char procedure; it is mapped over S.
    Note: no sequence order is specified.

string-fold       kons knil s [start end] -> value
string-fold-right kons knil s [start end] -> value
    These are the fundamental iterators for strings.
    The left-fold operator maps the KONS procedure across the
    string from left to right
	(... (kons s[2] (kons s[1] (kons s[0] knil))))
    In other words, string-fold obeys the recursion
	(string-fold kons knil s start end) =
	    (string-fold kons (kons s[start] knil) start+1 end)

    The right-fold operator maps the KONS procedure across the
    string from right to left
	(kons s[0] (... (kons s[end-3] (kons s[end-2] (kons s[end-1] knil)))))
    obeying the recursion
	(string-fold-right kons knil s start end) =
	    (string-fold-right kons (kons s[end-1] knil) start end-1)

    Examples:
	To convert a string to a list of chars:
	    (string-fold-right cons '() s)

	To count the number of lower-case characters in a string:
	    (string-fold (lambda (c count)
                           (if (char-set-contains? char-set:lower c)
                               (+ count 1)
                               count))
                         0
                         s)

string-unfold p f g seed -> string
    This is the fundamental constructor for strings.
    - G is used to generate a series of "seed" values from the initial seed:
	SEED, (G SEED), (G^2 SEED), (G^3 SEED), ...
    - P tells us when to stop -- when it returns true when applied to one
      of these seed values.
    - F maps each seed value to the corresponding character
      in the result string.

    More precisely, the following (simple, inefficient) definition holds:
    (define (string-unfold p f g seed)
      (if (p seed) ""
	  (string-append (string (f seed))
			 (string-unfold p f g (g seed)))))

    STRING-UNFOLD is a fairly powerful constructor -- you can use it to
    reverse a string, copy a string, convert a list to a string, read
    a port into a string, and so forth. Examples:
    (port->string p) = (string-unfold eof-object? values
                                      (lambda (x) (read-char p))
                                      (read-char p))

    (list->string lis) = (string-unfold null? car cdr lis)

    (tabulate-string f size) = (string-unfold (lambda (i) (= i size)) f add1 0)

    To map F over a list LIS, producing a string:
	(string-unfold null? (compose f car) cdr lis)

string-tabulate proc len -> string
    PROC is an integer->char procedure. Construct a string of size LEN
    by applying PROC to each index to produce the corresponding string
    element. The order in which PROC is applied to the indices is not
    specified.

string-for-each  proc s [start end] -> unspecified
string-iter      proc s [start end] -> unspecified
    Apply PROC to each character in S.
    STRING-FOR-EACH has no specified iteration order.
    STRING-ITER is required to iterate from START to END
    in increasing order.

string-every? pred s [start end] -> boolean
string-any?   pred s [start end] -> value
    Note: no sequence order specified.
    Checks to see if predicate PRED is true of every / any character in S.
    STRING-ANY? is witness-generating -- it applies PRED to the elements
    of S, returning the first true value it finds, otherwise false.

string-compare    s1 s2 lt-proc eq-proc gt-proc -> values
string-compare-ci s1 s2 lt-proc eq-proc gt-proc -> values
   Apply LT-PROC, EQ-PROC, GT-PROC to the mismatch index, depending
   upon whether S1 is less than, equal to, or greater than S2.
   The "mismatch index" is the largest index i such that for
   every 0 <= j < i, s1[j] = s2[j] -- that is, I is the first
   position that doesn't match. If S1 = S2, the mismatch index
   is simply the length of the strings; we observe the protocol
   in this redundant case for uniformity.

substring-compare    s1 start1 end1 s2 start2 end2 lt-proc eq-proc gt-proc -> values
substring-compare-ci s1 start1 end1 s2 start2 end2 lt-proc eq-proc gt-proc -> values
    The continuation procedures are applied to S1's mismatch index (as defined
    above). In the case of EQ-PROC, this is always END1.

string=  s1 s2 -> #f or integer
string<> s1 s2 -> #f or integer
string<  s1 s2 -> #f or integer
string>  s1 s2 -> #f or integer
string<= s1 s2 -> #f or integer
string>= s1 s2 -> #f or integer
    If the comparison operation is true, the function returns the
    mismatch index (as defined for the previous comparator functions).

string-ci=  s1 s2 -> #f or integer
string-ci<> s1 s2 -> #f or integer
string-ci<  s1 s2 -> #f or integer
string-ci>  s1 s2 -> #f or integer
string-ci<= s1 s2 -> #f or integer
string-ci>= s1 s2 -> #f or integer
    Case-insensitive variants.

substring=     s1 start1 end1 s2 start2 end2 -> #f or integer
substring<>    s1 start1 end1 s2 start2 end2 -> #f or integer
substring<     s1 start1 end1 s2 start2 end2 -> #f or integer
substring>     s1 start1 end1 s2 start2 end2 -> #f or integer
substring<=    s1 start1 end1 s2 start2 end2 -> #f or integer
substring>=    s1 start1 end1 s2 start2 end2 -> #f or integer

substring-ci=  s1 start1 end1 s2 start2 end2 -> #f or integer
substring-ci<> s1 start1 end1 s2 start2 end2 -> #f or integer
substring-ci<  s1 start1 end1 s2 start2 end2 -> #f or integer
substring-ci>  s1 start1 end1 s2 start2 end2 -> #f or integer
substring-ci<= s1 start1 end1 s2 start2 end2 -> #f or integer
substring-ci>= s1 start1 end1 s2 start2 end2 -> #f or integer
    These variants restrict the comparison to the indicated
    substrings of S1 and S2.

string-upper-case? s [start end] -> boolean
string-lower-case? s [start end] -> boolean
    STRING-UPPER-CASE? returns true iff the string contains
    no lower-case characters. STRING-LOWER-CASE returns true
    iff the string contains no upper-case characters.
    (string-upper-case? "") => #t
    (string-lower-case? "") => #t
    (string-upper-case? "FOOb") => #f
    (string-upper-case? "U.S.A.") => #t

capitalize-string  s [start end] -> string
capitalize-string! s [start end] -> unspecified
    Capitalize the string: upcase the first alphanumeric character,
    and downcase the rest of the string. CAPITALIZE-STRING returns
    a freshly allocated string.

    (capitalize-string "--capitalize tHIS sentence.") =>
      "--Capitalize this sentence."

    (capitalize-string "see Spot run. see Nix run.") =>
      "See spot run. see nix run."

    (capitalize-string "3com makes routers.") =>
      "3com makes routers."

capitalize-words  s [start end] -> string
capitalize-words! s [start end] -> unspecified
    A "word" is a maximal contiguous sequence of alphanumeric characters.
    Upcase the first character of every word; downcase the rest of the word.
    CAPITALIZE-WORDS returns a freshly allocated string.

    (capitalize-words "HELLO, 3THErE, my nAME IS olin") =>
	"Hello, 3there, My Name Is Olin"

    More sophisticated capitalisation procedures can be synthesized
    using CAPITALIZE-STRING and pattern matchers. In this context,
    the REGEXP-SUBSTITUTE/GLOBAL procedure may be useful for picking
    out the units to be capitalised and applying CAPITALIZE-STRING to
    their components.

string-upcase    s [start end] -> string
string-upcase!   s [start end] -> unspecified
string-downcase  s [start end] -> string
string-downcase! s [start end] -> unspecified
    Raise or lower the case of the alphabetic characters in the string.
    STRING-UPCASE and STRING-DOWNCASE return freshly allocated strings.

string-take s nchars -> string
string-drop s nchars -> string
string-take-right s nchars -> string
string-drop-right s nchars -> string
    STRING-TAKE returns the first NCHARS of STRING;
    STRING-DROP returns all but the first NCHARS of STRING.
    STRING-TAKE-RIGHT returns the last NCHARS of STRING;
    STRING-DROP-RIGHT returns all but the last NCHARS of STRING.
    These generalise MIT Scheme's HEAD & TAIL functions.
    If these procedures produce the entire string, they may return either
    S or a copy of S; in some implementations, proper substrings may share
    memory with S.

string-pad       s k [char start end] -> string
string-pad-right s k [char start end] -> string
    Build a string of length K comprised of S padded on the left (right)
    by as many occurences of the character CHAR as needed. If S has more
    than K chars, it is truncated on the left (right) to length k. CHAR
    defaults to #\space.

    If K is exactly the length of S, these functions may return
    either S or a copy of S.

string-trim       s [char/char-set/pred start end] -> string
string-trim-right s [char/char-set/pred start end] -> string
string-trim-both  s [char/char-set/pred start end] -> string
    Trim S by skipping over all characters on the left / on the right /
    on both sides that satisfy the second parameter CHAR/CHAR-SET/PRED:
	- If it is a character CHAR, characters equal to CHAR are trimmed.
        - If it is a char set CHAR-SET, characters contained in CHAR-SET
          are trimmed.
	- If it is a predicate PRED, it is a test predicate that is applied
	  to the characters in S; a character causing it to return true
	  is skipped.
    CHAR/CHAR/SET-PRED defaults to CHAR-SET:WHITESPACE.

    If no trimming occurs, these functions may return either S or a copy of S;
    in some implementations, proper substrings may share memory with S.

    (string-trim-both "  The outlook wasn't brilliant,  \n\r")
	=> "The outlook wasn't brilliant,"

string-filter s char/char-set/pred [start end] -> string
string-delete s char/char-set/pred [start end] -> string
    Filter the string S, retaining only those characters that
    satisfy / do not satisfy the CHAR/CHAR-SET/PRED argument. If
    this argument is a procedure, it is applied to the character
    as a predicate; if it is a char-set, the character is tested
    for membership; if it is a character, it is used in an equality test.

    If the string is unaltered by the filtering operation, these
    functions may return either S or a copy of S.

string-index       s char/char-set/pred [start end] -> integer or #f
string-index-right s char/char-set/pred [end start] -> integer or #f
string-skip        s char/char-set/pred [start end] -> integer or #f
string-skip-right  s char/char-set/pred [end start] -> integer or #f
    Note the inverted start/end ordering of index-right and skip-right's
    parameters.

    Index (index-right) searches through the string from the left (right),
    returning the index of the first occurence of a character which
	- equals CHAR/CHAR-SET/PRED (if it is a character);
	- is in CHAR/CHAR-SET/PRED (if it is a char-set);
	- satisfies the predicate CHAR/CHAR-SET/PRED (if it is a procedure).
    If no match is found, the functions return false.

    The skip functions are similar, but use the complement of the criteria:
    they search for the first char that *doesn't* satisfy the test. E.g.,
    to skip over initial whitespace, say
        (cond ((string-skip s char-set:whitespace) =>
               (lambda (i)
                 ;; (string-ref s i) is not whitespace.
		 ...)))

string-prefix-count    s1 s2 -> integer
string-suffix-count    s1 s2 -> integer
string-prefix-count-ci s1 s2 -> integer
string-suffix-count-ci s1 s2 -> integer
    Return the length of the longest common prefix/suffix of the two strings.
    This is equivalent to the "mismatch index" for the strings.

substring-prefix-count    s1 start1 end1 s2 start2 end2 -> integer
substring-suffix-count    s1 start1 end1 s2 start2 end2 -> integer
substring-prefix-count-ci s1 start1 end1 s2 start2 end2 -> integer
substring-suffix-count-ci s1 start1 end1 s2 start2 end2 -> integer
    Substring variants.

string-prefix?    s1 s2 -> boolean
string-suffix?    s1 s2 -> boolean
string-prefix-ci? s1 s2 -> boolean
string-suffix-ci? s1 s2 -> boolean
    Is S1 a prefix/suffix of S2?

substring-prefix?    s1 start1 end1 s2 start2 end2 -> boolean
substring-suffix?    s1 start1 end1 s2 start2 end2 -> boolean
substring-prefix-ci? s1 start1 end1 s2 start2 end2 -> boolean
substring-suffix-ci? s1 start1 end1 s2 start2 end2 -> boolean
    Substring variants.

substring?    s1 s2 [start end] -> integer or false
substring-ci? s1 s2 [start end] -> integer or false
    Return the index in S2 where S1 occurs as a substring, or false.
    The returned index is in the range [start,end).
    The current implementation uses the Knuth-Morris-Pratt algorithm.

string-fill! s char [start end] -> unspecified
    Store CHAR into the elements of S.
    This is the R4RS procedure extended to have optional START/END parameters.

string-copy! target tstart s [start end] -> unspecified
    Copy the sequence of characters from index range [START,END) in
    string S to string TARGET, beginning at index TSTART. The characters
    are copied left-to-right or right-to-left as needed -- the copy is
    guaranteed to work, even if TARGET and S are the same string.

substring   s start [end] -> string
string-copy s [start end] -> string
    These R4RS procedures are extended to have optional START/END parameters.
    Use STRING-COPY when you want to indicate explicitly in your code that you
    wish to allocate new storage; use SUBSTRING when you don't care if you
    get a fresh copy or share storage with the original string.
    E.g.:
	(string-copy "Beta substitution") => "Beta substitution"
	(string-copy "Beta substitution" 1 10)
	    => "eta subst"
	(string-copy "Beta substitution" 5) => "substitution"

    SUBSTRING may return a value with shares memory with S.

string-reverse  s [start end] -> string
string-reverse! s [start end] -> unspecific
    Reverse the string.

reverse-list->string char-list -> string
    An efficient implementation of (compose string->list reverse):
	(reverse-list->string '(#\a #\B #\c)) -> "cBa"
    This is a common idiom in the epilog of string-processing loops
    that accumulate an answer in a reverse-order list.

string-concat string-list -> string
    Append the elements of STRING-LIST together into a single list.
    Guaranteed to return a freshly allocated list. Appears sufficiently
    often as to warrant being named.

string-concat/shared string-list -> string
string-append/shared s ... -> string
    These two procedures are variants of STRING-CONCAT and STRING-APPEND
    that are permitted to return results that share storage with their
    parameters. In particular, if STRING-APPEND/SHARED is applied to just
    one argument, it may return exactly that argument, whereas STRING-APPEND
    is required to allocate a fresh string.

string->list s [start end] -> char-list
    The R5RS STRING->LIST procedure is extended to take optional START/END
    arguments.

string-null? s -> bool
    Is S the empty string?

xsubstring s from [to start end] -> string
    This is the "extended substring" procedure that implements replicated
    copying of a substring of some string.

    S is a string; START and END are optional arguments that demarcate
    a substring of S, defaulting to 0 and the length of S (e.g., the whole
    string). Replicate this substring up and down index space, in both the
    positive and negative directions. For example, if S = "abcdefg", START=3,
    and END=6, then we have the conceptual bidirectionally-infinite string
	...  d  e  f  d  e  f  d  e  f  d  e  f  d  e  f  d  e  f  d  e  f ...
	... -9 -8 -7 -6 -5 -4 -3 -2 -1  0  1  2  3  4  5  6  7  8  9 ...
    XSUBSTRING returns the substring of this string beginning at index FROM,
    and ending at TO (which defaults to FROM+(END-START)).

    You can use XSUBSTRING to perform a variety of tasks:
    - To rotate a string left:  (xsubstring "abcdef" 2)  => "cdefab"
    - To rotate a string right: (xsubstring "abcdef" -2) => "efabcd"
    - To replicate a string:    (xsubstring "abc" 0 7) => "abcabca"

    Note that
      - The FROM/TO indices give a half-open range -- the characters from
	index FROM up to, but not including, index TO.
      - The FROM/TO indices are not in terms of the index space for string S.
	They are in terms of the replicated index space of the substring
	defined by S, START, and END.

    It is an error if START=END -- although this is allowed by special
    dispensation when FROM=TO.

string-xcopy! target tstart s sfrom [sto start end] -> unspecific
    Exactly the same as XSUBSTRING, but the extracted text is written
    into the string TARGET starting at index TSTART.
    This operation is not defined if (EQ? TARGET S) -- you cannot copy
    a string on top of itself.


* Lower-level procedures

The following procedures are useful for writing other string-processing
functions, and are contained in the string-lib-internals package.

parse-start+end proc s args -> [start end rest]
parse-final-start+end proc s args -> [start end]
    PARSE-START+END may be used to parse a pair of optional START/END arguments
    from an argument list, defaulting them to 0 and the length of some string
    S, respectively. Let the length of string S be SLEN.
    - If ARGS = (), the function returns (values 0 slen '())
    - If ARGS = (i), I is checked to ensure it is an integer, and
      that 0 <= i <= slen. Returns (values i slen (cdr rest)).
    - If ARGS = (i j ...), I and J are checked to ensure they are
      integers, and that 0 <= i <= j <= slen. Returns (values i j (cddr rest)).
    If any of the checks fail, an error condition is raised, and PROC is used
    as part of the error condition -- it should be the name of the client
    procedure whose argument list PARSE-START+END is parsing.

    parse-final-start+end is exactly the same, except that the args list
    passed to it is required to be of length two or less; if it is longer,
    an error condition is raised. It may be used when the optional START/END
    parameters are final arguments to the procedure.

check-substring-spec proc s start end -> unspecific
    Check values START and END to ensure they specify a valid substring
    in S. This means that START and END are exact integers, and
	0 <= START <= END <= (STRING-LENGTH S)
    If this is not the case, an error condition is raised. PROC is used
    as part of error condition, and should be the procedure whose START/END
    parameters we are checking.

make-kmp-restart-vector s c= -> vector
    Build the Knuth-Morris-Pratt "restart vector," which is useful
    for quickly searching character sequences for the occurrence of
    string S. C= is a character-equality function used to construct
    the restart vector; it is usefully CHAR=? or CHAR-CI=?.

    The definition of the restart vector RV for string S is:
    If we have matched chars 0..i-1 of S against some search string SS, and
    S[i] doesn't match SS[k], then reset i := RV[i], and try again to
    match SS[k].  If RV[i] = -1, then punt SS[k] completely, and move on to
    SS[k+1] and S[0].

    In other words, if you have matched the first i chars of S, but
    the i+1'th char doesn't match, RV[i] tells you what the next-longest
    prefix of PATTERN is that you have matched.

    The following string-search function shows how a restart vector
    is used to search. It can be easily adapted to search other character
    sequences (such as ports).

    (define (find-substring pattern source start end)
      (let ((plen (string-length pattern))
	    (rv (make-kmp-restart-vector pattern char=?)))

	;; The search loop. SJ & PJ are redundant state.
	(let lp ((si start) (pi 0)
		 (sj (- end start))	; (- end si)  -- how many chars left.
		 (pj plen))		; (- plen pi) -- how many chars left.

	  (if (= pi plen) (- si plen)			; Win.

	      (and (<= pj sj)				; Lose.

		   (if (char=? (string-ref source si)		; Search.
			       (string-ref pattern pi))
		       (lp (+ 1 si) (+ 1 pi) (- sj 1) (- pj 1))	; Advance.

		       (let ((pi (vector-ref rv pi)))		; Retreat.
			 (if (= pi -1)
			     (lp (+ si 1)  0   (- sj 1)  plen)	; Punt.
			     (lp si        pi  sj        (- plen pi))))))))))