scsh-0.5/scsh/lib/strings.txt

579 lines
25 KiB
Plaintext

Todo:
parse-start+end parse-final-start+end need "string" in the name
Also, export macro binder.
What's up w/quotient? (quotient -1 3) = 0.
regexp-foldl
type regexp interface
land*
Let-optional:
A let-optional that parses a prefix of the args.
Arg checking forms that get used if it parses, but are not
applied to the default.
The Scheme Underground string library includes a rich set of operations
for manipulating strings. These are frequently useful for scripting and
other text-manipulation applications.
The library's design was influenced by the string libraries found in MIT
Scheme, Gambit, RScheme, MzScheme, slib, Common Lisp, Bigloo, guile, APL and
the SML standard basis. Some of the code bears a distant family relation to
the MIT Scheme implementation, and being derived from that code, is covered by
the MIT Scheme copyright (which is a fairly generic "free" copyright -- see
the source file for details). The fast KMP string-search code used in
SUBSTRING? was loosely adapted from old slib code by Stephen Bevan.
The library has the following design principles:
- *All* procedures involving character comparison are available in
both case-sensitive and case-insensitive forms.
- *All* functionality is available in substring and full-string forms.
- The procedures are spec'd so as to permit efficient implementation in a
Scheme that provided shared-text substrings (e.g., guile). This means that
you should not rely on many of the substring-selecting procedures to return
freshly-allocated strings. Careful attention is paid to the issue of which
procedures allocate fresh storage, and which are permitted to return results
that share storage with the arguments.
- Common Lisp theft:
+ inequality functions return mismatch index.
I generalised this so that this "protocol" is extended even to
the equality functions. This means that clients can be handed any generic
string-comparison function and rely on the meaning of the true value.
+ Common Lisp capitalisation definition
The library addresses some problems with the R5RS string procedures:
- Question marks after string-comparison functions (string=?, etc.)
This is inconsistent with numeric comparison functions, and ugly, too.
- String-comparison functions do not provide useful true value.
- STRING-COPY should have optional start/end args;
SUBSTRING shouldn't specify if it copies or returns shared bits.
- STRING-FILL! and STRING->LIST should take optional start/end args.
- No <> function provided.
In the following procedure specifications:
- Any S parameter is a string;
- START and END parameters are half-open string indices specifying
a substring within a string parameter; when optional, they default
to 0 and the length of the string, respectively. When specified, it
must be the case that 0 <= START <= END <= (string-length S), for
the corresponding parameter S. They typically restrict a procedure's
action to the indicated substring.
- A CHAR/CHAR-SET/PRED parameter is a value used to select/search
for a character in a string. If it is a character, it is used in
an equality test; if it is a character set, it is used as a
membership test; if it is a procedure, it is applied to the
characters as a test predicate.
This library contains a large number of procedures, but they follow
a consistent naming scheme. The names are composed of smaller lexemes
in a regular way that exposes the structure and relationships between the
procedures. This should help the programmer to recall or reconstitute the name
of the particular procedure that he needs when writing his own code. In
particular
- Procedures whose names end in "-ci" are case-insensitive variants.
- Procedures whose names end in "!" are side-effecting variants.
These procedures generally return an unspecified value.
- The order of common parameters is fairly consistent across the
different procedures.
For more text-manipulation functionality, see also the regular expression,
file-name, character set, and character->character partial map packages.
-------------------------------------------------------------------------------
* R4RS/R5RS procedures
The R4RS and R5RS reports define 22 string procedures. The string-lib
package includes 8 of these exactly as defined, 4 in an extended,
backwards-compatible way, and drops the remaining 10 (whose functionality
is available via other bindings).
The 8 procedures provided exactly as documented in the reports are
string?
make-string
string
string-length
string-ref
string-set!
string-append
list->string
The ten functions not included are the R4RS string-comparison functions:
string=? string-ci=?
string<? string-ci<?
string>? string-ci>?
string<=? string-ci<=?
string>=? string-ci>=?
The string-lib package provides alternate bindings.
Additionally, the four extended procedures are
string-fill! s char [start end] -> unspecific
string->list s [start end] -> char-list
substring s start [end] -> string
string-copy s [start end] -> string
These procedures are documented in the following section. In brief, they are
extended to take optional start/end parameters specifying substring ranges;
Additionally, SUBSTRING is allowed to return a value that shares storage with
its argument.
* Procedures
These procedures are contained in the Scheme 48 package "string-lib",
which is open in the default user package. They are not found in the
"scsh" package; script writers and other programmers that use the Scheme
48 module system must open string-lib explicitly.
string-map proc s [start end] -> string
string-map! proc s [start end] -> unspecified
PROC is a char->char procedure; it is mapped over S.
Note: no sequence order is specified.
string-fold kons knil s [start end] -> value
string-fold-right kons knil s [start end] -> value
These are the fundamental iterators for strings.
The left-fold operator maps the KONS procedure across the
string from left to right
(... (kons s[2] (kons s[1] (kons s[0] knil))))
In other words, string-fold obeys the recursion
(string-fold kons knil s start end) =
(string-fold kons (kons s[start] knil) start+1 end)
The right-fold operator maps the KONS procedure across the
string from right to left
(kons s[0] (... (kons s[end-3] (kons s[end-2] (kons s[end-1] knil)))))
obeying the recursion
(string-fold-right kons knil s start end) =
(string-fold-right kons (kons s[end-1] knil) start end-1)
Examples:
To convert a string to a list of chars:
(string-fold-right cons '() s)
To count the number of lower-case characters in a string:
(string-fold (lambda (c count)
(if (char-set-contains? char-set:lower c)
(+ count 1)
count))
0
s)
string-unfold p f g seed -> string
This is the fundamental constructor for strings.
- G is used to generate a series of "seed" values from the initial seed:
SEED, (G SEED), (G^2 SEED), (G^3 SEED), ...
- P tells us when to stop -- when it returns true when applied to one
of these seed values.
- F maps each seed value to the corresponding character
in the result string.
More precisely, the following (simple, inefficient) definition holds:
(define (string-unfold p f g seed)
(if (p seed) ""
(string-append (string (f seed))
(string-unfold p f g (g seed)))))
STRING-UNFOLD is a fairly powerful constructor -- you can use it to
reverse a string, copy a string, convert a list to a string, read
a port into a string, and so forth. Examples:
(port->string p) = (string-unfold eof-object? values
(lambda (x) (read-char p))
(read-char p))
(list->string lis) = (string-unfold null? car cdr lis)
(tabulate-string f size) = (string-unfold (lambda (i) (= i size)) f add1 0)
To map F over a list LIS, producing a string:
(string-unfold null? (compose f car) cdr lis)
string-tabulate proc len -> string
PROC is an integer->char procedure. Construct a string of size LEN
by applying PROC to each index to produce the corresponding string
element. The order in which PROC is applied to the indices is not
specified.
string-for-each proc s [start end] -> unspecified
string-iter proc s [start end] -> unspecified
Apply PROC to each character in S.
STRING-FOR-EACH has no specified iteration order.
STRING-ITER is required to iterate from START to END
in increasing order.
string-every? pred s [start end] -> boolean
string-any? pred s [start end] -> value
Note: no sequence order specified.
Checks to see if predicate PRED is true of every / any character in S.
STRING-ANY? is witness-generating -- it applies PRED to the elements
of S, returning the first true value it finds, otherwise false.
string-compare s1 s2 lt-proc eq-proc gt-proc -> values
string-compare-ci s1 s2 lt-proc eq-proc gt-proc -> values
Apply LT-PROC, EQ-PROC, GT-PROC to the mismatch index, depending
upon whether S1 is less than, equal to, or greater than S2.
The "mismatch index" is the largest index i such that for
every 0 <= j < i, s1[j] = s2[j] -- that is, I is the first
position that doesn't match. If S1 = S2, the mismatch index
is simply the length of the strings; we observe the protocol
in this redundant case for uniformity.
substring-compare s1 start1 end1 s2 start2 end2 lt-proc eq-proc gt-proc -> values
substring-compare-ci s1 start1 end1 s2 start2 end2 lt-proc eq-proc gt-proc -> values
The continuation procedures are applied to S1's mismatch index (as defined
above). In the case of EQ-PROC, this is always END1.
string= s1 s2 -> #f or integer
string<> s1 s2 -> #f or integer
string< s1 s2 -> #f or integer
string> s1 s2 -> #f or integer
string<= s1 s2 -> #f or integer
string>= s1 s2 -> #f or integer
If the comparison operation is true, the function returns the
mismatch index (as defined for the previous comparator functions).
string-ci= s1 s2 -> #f or integer
string-ci<> s1 s2 -> #f or integer
string-ci< s1 s2 -> #f or integer
string-ci> s1 s2 -> #f or integer
string-ci<= s1 s2 -> #f or integer
string-ci>= s1 s2 -> #f or integer
Case-insensitive variants.
substring= s1 start1 end1 s2 start2 end2 -> #f or integer
substring<> s1 start1 end1 s2 start2 end2 -> #f or integer
substring< s1 start1 end1 s2 start2 end2 -> #f or integer
substring> s1 start1 end1 s2 start2 end2 -> #f or integer
substring<= s1 start1 end1 s2 start2 end2 -> #f or integer
substring>= s1 start1 end1 s2 start2 end2 -> #f or integer
substring-ci= s1 start1 end1 s2 start2 end2 -> #f or integer
substring-ci<> s1 start1 end1 s2 start2 end2 -> #f or integer
substring-ci< s1 start1 end1 s2 start2 end2 -> #f or integer
substring-ci> s1 start1 end1 s2 start2 end2 -> #f or integer
substring-ci<= s1 start1 end1 s2 start2 end2 -> #f or integer
substring-ci>= s1 start1 end1 s2 start2 end2 -> #f or integer
These variants restrict the comparison to the indicated
substrings of S1 and S2.
string-upper-case? s [start end] -> boolean
string-lower-case? s [start end] -> boolean
STRING-UPPER-CASE? returns true iff the string contains
no lower-case characters. STRING-LOWER-CASE returns true
iff the string contains no upper-case characters.
(string-upper-case? "") => #t
(string-lower-case? "") => #t
(string-upper-case? "FOOb") => #f
(string-upper-case? "U.S.A.") => #t
capitalize-string s [start end] -> string
capitalize-string! s [start end] -> unspecified
Capitalize the string: upcase the first alphanumeric character,
and downcase the rest of the string. CAPITALIZE-STRING returns
a freshly allocated string.
(capitalize-string "--capitalize tHIS sentence.") =>
"--Capitalize this sentence."
(capitalize-string "see Spot run. see Nix run.") =>
"See spot run. see nix run."
(capitalize-string "3com makes routers.") =>
"3com makes routers."
capitalize-words s [start end] -> string
capitalize-words! s [start end] -> unspecified
A "word" is a maximal contiguous sequence of alphanumeric characters.
Upcase the first character of every word; downcase the rest of the word.
CAPITALIZE-WORDS returns a freshly allocated string.
(capitalize-words "HELLO, 3THErE, my nAME IS olin") =>
"Hello, 3there, My Name Is Olin"
More sophisticated capitalisation procedures can be synthesized
using CAPITALIZE-STRING and pattern matchers. In this context,
the REGEXP-SUBSTITUTE/GLOBAL procedure may be useful for picking
out the units to be capitalised and applying CAPITALIZE-STRING to
their components.
string-upcase s [start end] -> string
string-upcase! s [start end] -> unspecified
string-downcase s [start end] -> string
string-downcase! s [start end] -> unspecified
Raise or lower the case of the alphabetic characters in the string.
STRING-UPCASE and STRING-DOWNCASE return freshly allocated strings.
string-take s nchars -> string
string-drop s nchars -> string
string-take-right s nchars -> string
string-drop-right s nchars -> string
STRING-TAKE returns the first NCHARS of STRING;
STRING-DROP returns all but the first NCHARS of STRING.
STRING-TAKE-RIGHT returns the last NCHARS of STRING;
STRING-DROP-RIGHT returns all but the last NCHARS of STRING.
These generalise MIT Scheme's HEAD & TAIL functions.
If these procedures produce the entire string, they may return either
S or a copy of S; in some implementations, proper substrings may share
memory with S.
string-pad s k [char start end] -> string
string-pad-right s k [char start end] -> string
Build a string of length K comprised of S padded on the left (right)
by as many occurences of the character CHAR as needed. If S has more
than K chars, it is truncated on the left (right) to length k. CHAR
defaults to #\space.
If K is exactly the length of S, these functions may return
either S or a copy of S.
string-trim s [char/char-set/pred start end] -> string
string-trim-right s [char/char-set/pred start end] -> string
string-trim-both s [char/char-set/pred start end] -> string
Trim S by skipping over all characters on the left / on the right /
on both sides that satisfy the second parameter CHAR/CHAR-SET/PRED:
- If it is a character CHAR, characters equal to CHAR are trimmed.
- If it is a char set CHAR-SET, characters contained in CHAR-SET
are trimmed.
- If it is a predicate PRED, it is a test predicate that is applied
to the characters in S; a character causing it to return true
is skipped.
CHAR/CHAR/SET-PRED defaults to CHAR-SET:WHITESPACE.
If no trimming occurs, these functions may return either S or a copy of S;
in some implementations, proper substrings may share memory with S.
(string-trim-both " The outlook wasn't brilliant, \n\r")
=> "The outlook wasn't brilliant,"
string-filter s char/char-set/pred [start end] -> string
string-delete s char/char-set/pred [start end] -> string
Filter the string S, retaining only those characters that
satisfy / do not satisfy the CHAR/CHAR-SET/PRED argument. If
this argument is a procedure, it is applied to the character
as a predicate; if it is a char-set, the character is tested
for membership; if it is a character, it is used in an equality test.
If the string is unaltered by the filtering operation, these
functions may return either S or a copy of S.
string-index s char/char-set/pred [start end] -> integer or #f
string-index-right s char/char-set/pred [end start] -> integer or #f
string-skip s char/char-set/pred [start end] -> integer or #f
string-skip-right s char/char-set/pred [end start] -> integer or #f
Note the inverted start/end ordering of index-right and skip-right's
parameters.
Index (index-right) searches through the string from the left (right),
returning the index of the first occurence of a character which
- equals CHAR/CHAR-SET/PRED (if it is a character);
- is in CHAR/CHAR-SET/PRED (if it is a char-set);
- satisfies the predicate CHAR/CHAR-SET/PRED (if it is a procedure).
If no match is found, the functions return false.
The skip functions are similar, but use the complement of the criteria:
they search for the first char that *doesn't* satisfy the test. E.g.,
to skip over initial whitespace, say
(cond ((string-skip s char-set:whitespace) =>
(lambda (i)
;; (string-ref s i) is not whitespace.
...)))
string-prefix-count s1 s2 -> integer
string-suffix-count s1 s2 -> integer
string-prefix-count-ci s1 s2 -> integer
string-suffix-count-ci s1 s2 -> integer
Return the length of the longest common prefix/suffix of the two strings.
This is equivalent to the "mismatch index" for the strings.
substring-prefix-count s1 start1 end1 s2 start2 end2 -> integer
substring-suffix-count s1 start1 end1 s2 start2 end2 -> integer
substring-prefix-count-ci s1 start1 end1 s2 start2 end2 -> integer
substring-suffix-count-ci s1 start1 end1 s2 start2 end2 -> integer
Substring variants.
string-prefix? s1 s2 -> boolean
string-suffix? s1 s2 -> boolean
string-prefix-ci? s1 s2 -> boolean
string-suffix-ci? s1 s2 -> boolean
Is S1 a prefix/suffix of S2?
substring-prefix? s1 start1 end1 s2 start2 end2 -> boolean
substring-suffix? s1 start1 end1 s2 start2 end2 -> boolean
substring-prefix-ci? s1 start1 end1 s2 start2 end2 -> boolean
substring-suffix-ci? s1 start1 end1 s2 start2 end2 -> boolean
Substring variants.
substring? s1 s2 [start end] -> integer or false
substring-ci? s1 s2 [start end] -> integer or false
Return the index in S2 where S1 occurs as a substring, or false.
The returned index is in the range [start,end).
The current implementation uses the Knuth-Morris-Pratt algorithm.
string-fill! s char [start end] -> unspecified
Store CHAR into the elements of S.
This is the R4RS procedure extended to have optional START/END parameters.
string-copy! target tstart s [start end] -> unspecified
Copy the sequence of characters from index range [START,END) in
string S to string TARGET, beginning at index TSTART. The characters
are copied left-to-right or right-to-left as needed -- the copy is
guaranteed to work, even if TARGET and S are the same string.
substring s start [end] -> string
string-copy s [start end] -> string
These R4RS procedures are extended to have optional START/END parameters.
Use STRING-COPY when you want to indicate explicitly in your code that you
wish to allocate new storage; use SUBSTRING when you don't care if you
get a fresh copy or share storage with the original string.
E.g.:
(string-copy "Beta substitution") => "Beta substitution"
(string-copy "Beta substitution" 1 10)
=> "eta subst"
(string-copy "Beta substitution" 5) => "substitution"
SUBSTRING may return a value with shares memory with S.
string-reverse s [start end] -> string
string-reverse! s [start end] -> unspecific
Reverse the string.
reverse-list->string char-list -> string
An efficient implementation of (compose string->list reverse):
(reverse-list->string '(#\a #\B #\c)) -> "cBa"
This is a common idiom in the epilog of string-processing loops
that accumulate an answer in a reverse-order list.
string-concat string-list -> string
Append the elements of STRING-LIST together into a single list.
Guaranteed to return a freshly allocated list. Appears sufficiently
often as to warrant being named.
string-concat/shared string-list -> string
string-append/shared s ... -> string
These two procedures are variants of STRING-CONCAT and STRING-APPEND
that are permitted to return results that share storage with their
parameters. In particular, if STRING-APPEND/SHARED is applied to just
one argument, it may return exactly that argument, whereas STRING-APPEND
is required to allocate a fresh string.
string->list s [start end] -> char-list
The R5RS STRING->LIST procedure is extended to take optional START/END
arguments.
string-null? s -> bool
Is S the empty string?
xsubstring s from [to start end] -> string
This is the "extended substring" procedure that implements replicated
copying of a substring of some string.
S is a string; START and END are optional arguments that demarcate
a substring of S, defaulting to 0 and the length of S (e.g., the whole
string). Replicate this substring up and down index space, in both the
positive and negative directions. For example, if S = "abcdefg", START=3,
and END=6, then we have the conceptual bidirectionally-infinite string
... d e f d e f d e f d e f d e f d e f d e f ...
... -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 ...
XSUBSTRING returns the substring of this string beginning at index FROM,
and ending at TO (which defaults to FROM+(END-START)).
You can use XSUBSTRING to perform a variety of tasks:
- To rotate a string left: (xsubstring "abcdef" 2) => "cdefab"
- To rotate a string right: (xsubstring "abcdef" -2) => "efabcd"
- To replicate a string: (xsubstring "abc" 0 7) => "abcabca"
Note that
- The FROM/TO indices give a half-open range -- the characters from
index FROM up to, but not including, index TO.
- The FROM/TO indices are not in terms of the index space for string S.
They are in terms of the replicated index space of the substring
defined by S, START, and END.
It is an error if START=END -- although this is allowed by special
dispensation when FROM=TO.
string-xcopy! target tstart s sfrom [sto start end] -> unspecific
Exactly the same as XSUBSTRING, but the extracted text is written
into the string TARGET starting at index TSTART.
This operation is not defined if (EQ? TARGET S) -- you cannot copy
a string on top of itself.
* Lower-level procedures
The following procedures are useful for writing other string-processing
functions, and are contained in the string-lib-internals package.
parse-start+end proc s args -> [start end rest]
parse-final-start+end proc s args -> [start end]
PARSE-START+END may be used to parse a pair of optional START/END arguments
from an argument list, defaulting them to 0 and the length of some string
S, respectively. Let the length of string S be SLEN.
- If ARGS = (), the function returns (values 0 slen '())
- If ARGS = (i), I is checked to ensure it is an integer, and
that 0 <= i <= slen. Returns (values i slen (cdr rest)).
- If ARGS = (i j ...), I and J are checked to ensure they are
integers, and that 0 <= i <= j <= slen. Returns (values i j (cddr rest)).
If any of the checks fail, an error condition is raised, and PROC is used
as part of the error condition -- it should be the name of the client
procedure whose argument list PARSE-START+END is parsing.
parse-final-start+end is exactly the same, except that the args list
passed to it is required to be of length two or less; if it is longer,
an error condition is raised. It may be used when the optional START/END
parameters are final arguments to the procedure.
check-substring-spec proc s start end -> unspecific
Check values START and END to ensure they specify a valid substring
in S. This means that START and END are exact integers, and
0 <= START <= END <= (STRING-LENGTH S)
If this is not the case, an error condition is raised. PROC is used
as part of error condition, and should be the procedure whose START/END
parameters we are checking.
make-kmp-restart-vector s c= -> vector
Build the Knuth-Morris-Pratt "restart vector," which is useful
for quickly searching character sequences for the occurrence of
string S. C= is a character-equality function used to construct
the restart vector; it is usefully CHAR=? or CHAR-CI=?.
The definition of the restart vector RV for string S is:
If we have matched chars 0..i-1 of S against some search string SS, and
S[i] doesn't match SS[k], then reset i := RV[i], and try again to
match SS[k]. If RV[i] = -1, then punt SS[k] completely, and move on to
SS[k+1] and S[0].
In other words, if you have matched the first i chars of S, but
the i+1'th char doesn't match, RV[i] tells you what the next-longest
prefix of PATTERN is that you have matched.
The following string-search function shows how a restart vector
is used to search. It can be easily adapted to search other character
sequences (such as ports).
(define (find-substring pattern source start end)
(let ((plen (string-length pattern))
(rv (make-kmp-restart-vector pattern char=?)))
;; The search loop. SJ & PJ are redundant state.
(let lp ((si start) (pi 0)
(sj (- end start)) ; (- end si) -- how many chars left.
(pj plen)) ; (- plen pi) -- how many chars left.
(if (= pi plen) (- si plen) ; Win.
(and (<= pj sj) ; Lose.
(if (char=? (string-ref source si) ; Search.
(string-ref pattern pi))
(lp (+ 1 si) (+ 1 pi) (- sj 1) (- pj 1)) ; Advance.
(let ((pi (vector-ref rv pi))) ; Retreat.
(if (= pi -1)
(lp (+ si 1) 0 (- sj 1) plen) ; Punt.
(lp si pi sj (- plen pi))))))))))