Todo: parse-start+end parse-final-start+end need "string" in the name Also, export macro binder. What's up w/quotient? (quotient -1 3) = 0. regexp-foldl type regexp interface land* Let-optional: A let-optional that parses a prefix of the args. Arg checking forms that get used if it parses, but are not applied to the default. The Scheme Underground string library includes a rich set of operations for manipulating strings. These are frequently useful for scripting and other text-manipulation applications. The library's design was influenced by the string libraries found in MIT Scheme, Gambit, RScheme, MzScheme, slib, Common Lisp, Bigloo, guile, APL and the SML standard basis. Some of the code bears a distant family relation to the MIT Scheme implementation, and being derived from that code, is covered by the MIT Scheme copyright (which is a fairly generic "free" copyright -- see the source file for details). The fast KMP string-search code used in SUBSTRING? was loosely adapted from old slib code by Stephen Bevan. The library has the following design principles: - *All* procedures involving character comparison are available in both case-sensitive and case-insensitive forms. - *All* functionality is available in substring and full-string forms. - The procedures are spec'd so as to permit efficient implementation in a Scheme that provided shared-text substrings (e.g., guile). This means that you should not rely on many of the substring-selecting procedures to return freshly-allocated strings. Careful attention is paid to the issue of which procedures allocate fresh storage, and which are permitted to return results that share storage with the arguments. - Common Lisp theft: + inequality functions return mismatch index. I generalised this so that this "protocol" is extended even to the equality functions. This means that clients can be handed any generic string-comparison function and rely on the meaning of the true value. + Common Lisp capitalisation definition The library addresses some problems with the R5RS string procedures: - Question marks after string-comparison functions (string=?, etc.) This is inconsistent with numeric comparison functions, and ugly, too. - String-comparison functions do not provide useful true value. - STRING-COPY should have optional start/end args; SUBSTRING shouldn't specify if it copies or returns shared bits. - STRING-FILL! and STRING->LIST should take optional start/end args. - No <> function provided. In the following procedure specifications: - Any S parameter is a string; - START and END parameters are half-open string indices specifying a substring within a string parameter; when optional, they default to 0 and the length of the string, respectively. When specified, it must be the case that 0 <= START <= END <= (string-length S), for the corresponding parameter S. They typically restrict a procedure's action to the indicated substring. - A CHAR/CHAR-SET/PRED parameter is a value used to select/search for a character in a string. If it is a character, it is used in an equality test; if it is a character set, it is used as a membership test; if it is a procedure, it is applied to the characters as a test predicate. This library contains a large number of procedures, but they follow a consistent naming scheme. The names are composed of smaller lexemes in a regular way that exposes the structure and relationships between the procedures. This should help the programmer to recall or reconstitute the name of the particular procedure that he needs when writing his own code. In particular - Procedures whose names end in "-ci" are case-insensitive variants. - Procedures whose names end in "!" are side-effecting variants. These procedures generally return an unspecified value. - The order of common parameters is fairly consistent across the different procedures. For more text-manipulation functionality, see also the regular expression, file-name, character set, and character->character partial map packages. ------------------------------------------------------------------------------- * R4RS/R5RS procedures The R4RS and R5RS reports define 22 string procedures. The string-lib package includes 8 of these exactly as defined, 4 in an extended, backwards-compatible way, and drops the remaining 10 (whose functionality is available via other bindings). The 8 procedures provided exactly as documented in the reports are string? make-string string string-length string-ref string-set! string-append list->string The ten functions not included are the R4RS string-comparison functions: string=? string-ci=? string? string-ci>? string<=? string-ci<=? string>=? string-ci>=? The string-lib package provides alternate bindings. Additionally, the four extended procedures are string-fill! s char [start end] -> unspecific string->list s [start end] -> char-list substring s start [end] -> string string-copy s [start end] -> string These procedures are documented in the following section. In brief, they are extended to take optional start/end parameters specifying substring ranges; Additionally, SUBSTRING is allowed to return a value that shares storage with its argument. * Procedures These procedures are contained in the Scheme 48 package "string-lib", which is open in the default user package. They are not found in the "scsh" package; script writers and other programmers that use the Scheme 48 module system must open string-lib explicitly. string-map proc s [start end] -> string string-map! proc s [start end] -> unspecified PROC is a char->char procedure; it is mapped over S. Note: no sequence order is specified. string-fold kons knil s [start end] -> value string-fold-right kons knil s [start end] -> value These are the fundamental iterators for strings. The left-fold operator maps the KONS procedure across the string from left to right (... (kons s[2] (kons s[1] (kons s[0] knil)))) In other words, string-fold obeys the recursion (string-fold kons knil s start end) = (string-fold kons (kons s[start] knil) start+1 end) The right-fold operator maps the KONS procedure across the string from right to left (kons s[0] (... (kons s[end-3] (kons s[end-2] (kons s[end-1] knil))))) obeying the recursion (string-fold-right kons knil s start end) = (string-fold-right kons (kons s[end-1] knil) start end-1) Examples: To convert a string to a list of chars: (string-fold-right cons '() s) To count the number of lower-case characters in a string: (string-fold (lambda (c count) (if (char-set-contains? char-set:lower c) (+ count 1) count)) 0 s) string-unfold p f g seed -> string This is the fundamental constructor for strings. - G is used to generate a series of "seed" values from the initial seed: SEED, (G SEED), (G^2 SEED), (G^3 SEED), ... - P tells us when to stop -- when it returns true when applied to one of these seed values. - F maps each seed value to the corresponding character in the result string. More precisely, the following (simple, inefficient) definition holds: (define (string-unfold p f g seed) (if (p seed) "" (string-append (string (f seed)) (string-unfold p f g (g seed))))) STRING-UNFOLD is a fairly powerful constructor -- you can use it to reverse a string, copy a string, convert a list to a string, read a port into a string, and so forth. Examples: (port->string p) = (string-unfold eof-object? values (lambda (x) (read-char p)) (read-char p)) (list->string lis) = (string-unfold null? car cdr lis) (tabulate-string f size) = (string-unfold (lambda (i) (= i size)) f add1 0) To map F over a list LIS, producing a string: (string-unfold null? (compose f car) cdr lis) string-tabulate proc len -> string PROC is an integer->char procedure. Construct a string of size LEN by applying PROC to each index to produce the corresponding string element. The order in which PROC is applied to the indices is not specified. string-for-each proc s [start end] -> unspecified string-iter proc s [start end] -> unspecified Apply PROC to each character in S. STRING-FOR-EACH has no specified iteration order. STRING-ITER is required to iterate from START to END in increasing order. string-every? pred s [start end] -> boolean string-any? pred s [start end] -> value Note: no sequence order specified. Checks to see if predicate PRED is true of every / any character in S. STRING-ANY? is witness-generating -- it applies PRED to the elements of S, returning the first true value it finds, otherwise false. string-compare s1 s2 lt-proc eq-proc gt-proc -> values string-compare-ci s1 s2 lt-proc eq-proc gt-proc -> values Apply LT-PROC, EQ-PROC, GT-PROC to the mismatch index, depending upon whether S1 is less than, equal to, or greater than S2. The "mismatch index" is the largest index i such that for every 0 <= j < i, s1[j] = s2[j] -- that is, I is the first position that doesn't match. If S1 = S2, the mismatch index is simply the length of the strings; we observe the protocol in this redundant case for uniformity. substring-compare s1 start1 end1 s2 start2 end2 lt-proc eq-proc gt-proc -> values substring-compare-ci s1 start1 end1 s2 start2 end2 lt-proc eq-proc gt-proc -> values The continuation procedures are applied to S1's mismatch index (as defined above). In the case of EQ-PROC, this is always END1. string= s1 s2 -> #f or integer string<> s1 s2 -> #f or integer string< s1 s2 -> #f or integer string> s1 s2 -> #f or integer string<= s1 s2 -> #f or integer string>= s1 s2 -> #f or integer If the comparison operation is true, the function returns the mismatch index (as defined for the previous comparator functions). string-ci= s1 s2 -> #f or integer string-ci<> s1 s2 -> #f or integer string-ci< s1 s2 -> #f or integer string-ci> s1 s2 -> #f or integer string-ci<= s1 s2 -> #f or integer string-ci>= s1 s2 -> #f or integer Case-insensitive variants. substring= s1 start1 end1 s2 start2 end2 -> #f or integer substring<> s1 start1 end1 s2 start2 end2 -> #f or integer substring< s1 start1 end1 s2 start2 end2 -> #f or integer substring> s1 start1 end1 s2 start2 end2 -> #f or integer substring<= s1 start1 end1 s2 start2 end2 -> #f or integer substring>= s1 start1 end1 s2 start2 end2 -> #f or integer substring-ci= s1 start1 end1 s2 start2 end2 -> #f or integer substring-ci<> s1 start1 end1 s2 start2 end2 -> #f or integer substring-ci< s1 start1 end1 s2 start2 end2 -> #f or integer substring-ci> s1 start1 end1 s2 start2 end2 -> #f or integer substring-ci<= s1 start1 end1 s2 start2 end2 -> #f or integer substring-ci>= s1 start1 end1 s2 start2 end2 -> #f or integer These variants restrict the comparison to the indicated substrings of S1 and S2. string-upper-case? s [start end] -> boolean string-lower-case? s [start end] -> boolean STRING-UPPER-CASE? returns true iff the string contains no lower-case characters. STRING-LOWER-CASE returns true iff the string contains no upper-case characters. (string-upper-case? "") => #t (string-lower-case? "") => #t (string-upper-case? "FOOb") => #f (string-upper-case? "U.S.A.") => #t capitalize-string s [start end] -> string capitalize-string! s [start end] -> unspecified Capitalize the string: upcase the first alphanumeric character, and downcase the rest of the string. CAPITALIZE-STRING returns a freshly allocated string. (capitalize-string "--capitalize tHIS sentence.") => "--Capitalize this sentence." (capitalize-string "see Spot run. see Nix run.") => "See spot run. see nix run." (capitalize-string "3com makes routers.") => "3com makes routers." capitalize-words s [start end] -> string capitalize-words! s [start end] -> unspecified A "word" is a maximal contiguous sequence of alphanumeric characters. Upcase the first character of every word; downcase the rest of the word. CAPITALIZE-WORDS returns a freshly allocated string. (capitalize-words "HELLO, 3THErE, my nAME IS olin") => "Hello, 3there, My Name Is Olin" More sophisticated capitalisation procedures can be synthesized using CAPITALIZE-STRING and pattern matchers. In this context, the REGEXP-SUBSTITUTE/GLOBAL procedure may be useful for picking out the units to be capitalised and applying CAPITALIZE-STRING to their components. string-upcase s [start end] -> string string-upcase! s [start end] -> unspecified string-downcase s [start end] -> string string-downcase! s [start end] -> unspecified Raise or lower the case of the alphabetic characters in the string. STRING-UPCASE and STRING-DOWNCASE return freshly allocated strings. string-take s nchars -> string string-drop s nchars -> string string-take-right s nchars -> string string-drop-right s nchars -> string STRING-TAKE returns the first NCHARS of STRING; STRING-DROP returns all but the first NCHARS of STRING. STRING-TAKE-RIGHT returns the last NCHARS of STRING; STRING-DROP-RIGHT returns all but the last NCHARS of STRING. These generalise MIT Scheme's HEAD & TAIL functions. If these procedures produce the entire string, they may return either S or a copy of S; in some implementations, proper substrings may share memory with S. string-pad s k [char start end] -> string string-pad-right s k [char start end] -> string Build a string of length K comprised of S padded on the left (right) by as many occurences of the character CHAR as needed. If S has more than K chars, it is truncated on the left (right) to length k. CHAR defaults to #\space. If K is exactly the length of S, these functions may return either S or a copy of S. string-trim s [char/char-set/pred start end] -> string string-trim-right s [char/char-set/pred start end] -> string string-trim-both s [char/char-set/pred start end] -> string Trim S by skipping over all characters on the left / on the right / on both sides that satisfy the second parameter CHAR/CHAR-SET/PRED: - If it is a character CHAR, characters equal to CHAR are trimmed. - If it is a char set CHAR-SET, characters contained in CHAR-SET are trimmed. - If it is a predicate PRED, it is a test predicate that is applied to the characters in S; a character causing it to return true is skipped. CHAR/CHAR/SET-PRED defaults to CHAR-SET:WHITESPACE. If no trimming occurs, these functions may return either S or a copy of S; in some implementations, proper substrings may share memory with S. (string-trim-both " The outlook wasn't brilliant, \n\r") => "The outlook wasn't brilliant," string-filter s char/char-set/pred [start end] -> string string-delete s char/char-set/pred [start end] -> string Filter the string S, retaining only those characters that satisfy / do not satisfy the CHAR/CHAR-SET/PRED argument. If this argument is a procedure, it is applied to the character as a predicate; if it is a char-set, the character is tested for membership; if it is a character, it is used in an equality test. If the string is unaltered by the filtering operation, these functions may return either S or a copy of S. string-index s char/char-set/pred [start end] -> integer or #f string-index-right s char/char-set/pred [end start] -> integer or #f string-skip s char/char-set/pred [start end] -> integer or #f string-skip-right s char/char-set/pred [end start] -> integer or #f Note the inverted start/end ordering of index-right and skip-right's parameters. Index (index-right) searches through the string from the left (right), returning the index of the first occurence of a character which - equals CHAR/CHAR-SET/PRED (if it is a character); - is in CHAR/CHAR-SET/PRED (if it is a char-set); - satisfies the predicate CHAR/CHAR-SET/PRED (if it is a procedure). If no match is found, the functions return false. The skip functions are similar, but use the complement of the criteria: they search for the first char that *doesn't* satisfy the test. E.g., to skip over initial whitespace, say (cond ((string-skip s char-set:whitespace) => (lambda (i) ;; (string-ref s i) is not whitespace. ...))) string-prefix-count s1 s2 -> integer string-suffix-count s1 s2 -> integer string-prefix-count-ci s1 s2 -> integer string-suffix-count-ci s1 s2 -> integer Return the length of the longest common prefix/suffix of the two strings. This is equivalent to the "mismatch index" for the strings. substring-prefix-count s1 start1 end1 s2 start2 end2 -> integer substring-suffix-count s1 start1 end1 s2 start2 end2 -> integer substring-prefix-count-ci s1 start1 end1 s2 start2 end2 -> integer substring-suffix-count-ci s1 start1 end1 s2 start2 end2 -> integer Substring variants. string-prefix? s1 s2 -> boolean string-suffix? s1 s2 -> boolean string-prefix-ci? s1 s2 -> boolean string-suffix-ci? s1 s2 -> boolean Is S1 a prefix/suffix of S2? substring-prefix? s1 start1 end1 s2 start2 end2 -> boolean substring-suffix? s1 start1 end1 s2 start2 end2 -> boolean substring-prefix-ci? s1 start1 end1 s2 start2 end2 -> boolean substring-suffix-ci? s1 start1 end1 s2 start2 end2 -> boolean Substring variants. substring? s1 s2 [start end] -> integer or false substring-ci? s1 s2 [start end] -> integer or false Return the index in S2 where S1 occurs as a substring, or false. The returned index is in the range [start,end). The current implementation uses the Knuth-Morris-Pratt algorithm. string-fill! s char [start end] -> unspecified Store CHAR into the elements of S. This is the R4RS procedure extended to have optional START/END parameters. string-copy! target tstart s [start end] -> unspecified Copy the sequence of characters from index range [START,END) in string S to string TARGET, beginning at index TSTART. The characters are copied left-to-right or right-to-left as needed -- the copy is guaranteed to work, even if TARGET and S are the same string. substring s start [end] -> string string-copy s [start end] -> string These R4RS procedures are extended to have optional START/END parameters. Use STRING-COPY when you want to indicate explicitly in your code that you wish to allocate new storage; use SUBSTRING when you don't care if you get a fresh copy or share storage with the original string. E.g.: (string-copy "Beta substitution") => "Beta substitution" (string-copy "Beta substitution" 1 10) => "eta subst" (string-copy "Beta substitution" 5) => "substitution" SUBSTRING may return a value with shares memory with S. string-reverse s [start end] -> string string-reverse! s [start end] -> unspecific Reverse the string. reverse-list->string char-list -> string An efficient implementation of (compose string->list reverse): (reverse-list->string '(#\a #\B #\c)) -> "cBa" This is a common idiom in the epilog of string-processing loops that accumulate an answer in a reverse-order list. string-concat string-list -> string Append the elements of STRING-LIST together into a single list. Guaranteed to return a freshly allocated list. Appears sufficiently often as to warrant being named. string-concat/shared string-list -> string string-append/shared s ... -> string These two procedures are variants of STRING-CONCAT and STRING-APPEND that are permitted to return results that share storage with their parameters. In particular, if STRING-APPEND/SHARED is applied to just one argument, it may return exactly that argument, whereas STRING-APPEND is required to allocate a fresh string. string->list s [start end] -> char-list The R5RS STRING->LIST procedure is extended to take optional START/END arguments. string-null? s -> bool Is S the empty string? xsubstring s from [to start end] -> string This is the "extended substring" procedure that implements replicated copying of a substring of some string. S is a string; START and END are optional arguments that demarcate a substring of S, defaulting to 0 and the length of S (e.g., the whole string). Replicate this substring up and down index space, in both the positive and negative directions. For example, if S = "abcdefg", START=3, and END=6, then we have the conceptual bidirectionally-infinite string ... d e f d e f d e f d e f d e f d e f d e f ... ... -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 ... XSUBSTRING returns the substring of this string beginning at index FROM, and ending at TO (which defaults to FROM+(END-START)). You can use XSUBSTRING to perform a variety of tasks: - To rotate a string left: (xsubstring "abcdef" 2) => "cdefab" - To rotate a string right: (xsubstring "abcdef" -2) => "efabcd" - To replicate a string: (xsubstring "abc" 0 7) => "abcabca" Note that - The FROM/TO indices give a half-open range -- the characters from index FROM up to, but not including, index TO. - The FROM/TO indices are not in terms of the index space for string S. They are in terms of the replicated index space of the substring defined by S, START, and END. It is an error if START=END -- although this is allowed by special dispensation when FROM=TO. string-xcopy! target tstart s sfrom [sto start end] -> unspecific Exactly the same as XSUBSTRING, but the extracted text is written into the string TARGET starting at index TSTART. This operation is not defined if (EQ? TARGET S) -- you cannot copy a string on top of itself. * Lower-level procedures The following procedures are useful for writing other string-processing functions, and are contained in the string-lib-internals package. parse-start+end proc s args -> [start end rest] parse-final-start+end proc s args -> [start end] PARSE-START+END may be used to parse a pair of optional START/END arguments from an argument list, defaulting them to 0 and the length of some string S, respectively. Let the length of string S be SLEN. - If ARGS = (), the function returns (values 0 slen '()) - If ARGS = (i), I is checked to ensure it is an integer, and that 0 <= i <= slen. Returns (values i slen (cdr rest)). - If ARGS = (i j ...), I and J are checked to ensure they are integers, and that 0 <= i <= j <= slen. Returns (values i j (cddr rest)). If any of the checks fail, an error condition is raised, and PROC is used as part of the error condition -- it should be the name of the client procedure whose argument list PARSE-START+END is parsing. parse-final-start+end is exactly the same, except that the args list passed to it is required to be of length two or less; if it is longer, an error condition is raised. It may be used when the optional START/END parameters are final arguments to the procedure. check-substring-spec proc s start end -> unspecific Check values START and END to ensure they specify a valid substring in S. This means that START and END are exact integers, and 0 <= START <= END <= (STRING-LENGTH S) If this is not the case, an error condition is raised. PROC is used as part of error condition, and should be the procedure whose START/END parameters we are checking. make-kmp-restart-vector s c= -> vector Build the Knuth-Morris-Pratt "restart vector," which is useful for quickly searching character sequences for the occurrence of string S. C= is a character-equality function used to construct the restart vector; it is usefully CHAR=? or CHAR-CI=?. The definition of the restart vector RV for string S is: If we have matched chars 0..i-1 of S against some search string SS, and S[i] doesn't match SS[k], then reset i := RV[i], and try again to match SS[k]. If RV[i] = -1, then punt SS[k] completely, and move on to SS[k+1] and S[0]. In other words, if you have matched the first i chars of S, but the i+1'th char doesn't match, RV[i] tells you what the next-longest prefix of PATTERN is that you have matched. The following string-search function shows how a restart vector is used to search. It can be easily adapted to search other character sequences (such as ports). (define (find-substring pattern source start end) (let ((plen (string-length pattern)) (rv (make-kmp-restart-vector pattern char=?))) ;; The search loop. SJ & PJ are redundant state. (let lp ((si start) (pi 0) (sj (- end start)) ; (- end si) -- how many chars left. (pj plen)) ; (- plen pi) -- how many chars left. (if (= pi plen) (- si plen) ; Win. (and (<= pj sj) ; Lose. (if (char=? (string-ref source si) ; Search. (string-ref pattern pi)) (lp (+ 1 si) (+ 1 pi) (- sj 1) (- pj 1)) ; Advance. (let ((pi (vector-ref rv pi))) ; Retreat. (if (= pi -1) (lp (+ si 1) 0 (- sj 1) plen) ; Punt. (lp si pi sj (- plen pi))))))))))