%latex -*- latex -*- % Many of the \object's should be \values or something. % look for "...", *...*, hand-inset code blocks %\documentclass[twoside]{report} %\usepackage{code,boxedminipage,makeidx,palatino,ct, % headings,mantitle,array,matter,mysize10} \newcommand{\anglequote}[1]{{$<\!\!<$}#1$>\!\!>$} % Style issues %\parskip = 3pt plus 3pt %\sloppy %\input{decls} %\begin{document} %\mainmatter %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Pattern-matching strings with regular expressions} \label{chapt:sre} Scsh provides a rich facility for matching regular-expression patterns in strings. The system is composed of several pieces: \begin{itemize} \item An s-expression notation for writing down general regular expressions. In most systems, regexp patterns are encoded as string literals, such as \verb+"g(oo|ee)se"+. In scsh, they are written using s-expressions, such as \verb+(: "g" (| "oo" "ee") "se")+, and are called \emph{sre's}. The sre notation has several advantages over the traditional string-based notation. It's more expressive, can be commented, and can be indented to expose the structure of the form. \item An abstract data type (ADT) representation for regexp values. Traditional regular-expression systems compute regular expressions from run-time values using strings. This can be awkward. Scsh, instead, provides a separate data type for regexps, with a set of basic constructor and accessor functions; regular expressions can be dynamically computed and manipulated using these functions. \item Some tools that work on the regexp ADT: case-sensitve to case-insensitive regexp transform, a regexp simplifier, and so forth. \item Parsers and unparsers that can convert between external representations and the regexp ADT. The supported external representations are \begin{itemize} \item Posix strings \item S-expression notation (that is, sre's) \end{itemize} Being able to convert regexps to Posix strings allows implementations to implement regexp matching using standard Posix C-based engines. \item Macro support for the s-expression notation. The \ex{rx} macro provides a new special form that allows you to embed regexps in the s-expression notation within a Scheme program. Evaluating the macro form produces a regexp ADT value which can be used by Scheme pattern-matching procedures and other regexp consumers. \item Pattern-matching and searching procedures. Spencer's Posix regexp engine is linked in to the runtime; the regexp code uses this engine to provide text matching. \end{itemize} The regexp language supported is a complete superset of Posix functionality, providing: \begin{itemize} \item sequencing and choice (\ex{|}) \item repetition (\ex{*}, \ex{+}, \ex{?}, \ex{\{$m$,$n$\}}) \item character classes (\eg, \ex{[aeiou]}) and wildcard (\ex{.}) \item beginning/end of string anchors (\verb|^|, \verb|$|) \item beginning/end of line anchors \item beginning/end of word anchors \item case-sensitivity control \item submatch-marking \end{itemize} \section{Summary SRE syntax} The following figures give a summary of the SRE syntax; the next section is a friendlier tutorial introduction. \newlength{\foolength} \def\srecomment#1{\multicolumn{2}{l}% {\qquad\setlength{\foolength}{\textwidth}% \addtolength{\textwidth}{-4em}\begin{tabular}{p{\textwidth}}#1\end{tabular}}} \begin{boxedfigure}{tbhp} \begin{tabular}{lp{3in}} \var{string} & Literal match---interpreted relative to the current case-sensitivity lexical context (default is case-sensitive) \\ \\ \ex{(\var{string1} \var{string2} {\ldots})} & Set of chars, \eg, \ex{("abc" "XYZ")}. Interpreted relative to the current case-sensitivity lexical context. \\ \\ \ex{(* \var{sre} {\ldots})} & 0 or more matches \\ \ex{(+ \var{sre} {\ldots})} & 1 or more matches \\ \ex{(? \var{sre} {\ldots})} & 0 or 1 matches \\ \ex{(= \var{n} \var{sre} {\ldots})} & \var{n} matches \\ \ex{(>= \var{n} \var{sre} {\ldots})} & \var{n} or more matches \\ \ex{(** \var{n} \var{m} \var{sre} {\ldots})} & \var{n} to \var{m} matches \\ \srecomment{ \var{N} and \var{m} are Scheme expressions producing non-negative integers. \\ \var{M} may also be \ex{\#f}, meaning ``infinity.''} \\ \\ \ex{(| \var{sre} {\ldots})} & Choice (\ex{or} is R5RS symbol; \\ \ex{(or \var{sre} {\ldots})} & \ex{|} is not specified by R5RS.) \\ \\ \ex{(: \var{sre} {\ldots})} & Sequence (\ex{seq} is legal \\ \ex{(seq \var{sre} {\ldots})} & Common Lisp symbol) \\ \\ \ex{(submatch \var{sre} {\ldots})} & Numbered submatch \\ \\ \ex{(dsm \var{pre} \var{post} \var{sre} {\ldots})} & Deleted submatches \\ \srecomment{\var{Pre} and \var{post} are numerals.} \\ \\ \ex{(uncase \var{sre} {\ldots})} & Case-folded match \\ \ex{(w/case \var{sre} {\ldots})} & Introduce a lexical case-sensitivity \\ \ex{(w/nocase \var{sre} {\ldots})} & context. \\ \\ \ex{,@\var{exp}} & Dynamically computed regexp \\ \ex{,\var{exp}} & Same as ,@\var{exp}, but no submatch info \\ \srecomment{\var{Exp} must produce a character, string, char-set, or regexp.} \\ \\ \ex{bos eos} & Beginning/end of string \\ \ex{bol eol} & Beginning/end of line \\ \ex{bow eow} & Beginning/end of word \\ \end{tabular} \caption{SRE syntax summary (part 1)} \end{boxedfigure} \begin{boxedfigure}{tbhp} \begin{tabular}{lp{3in}} \ex{(word \var{sre} {\ldots})} & (: bow \var{sre} {\ldots} eow) \\ \ex{(word+ \var{cset-sre} {\ldots})} & \cd{(word (+ (& (| alphanumeric "_")} \\ & \cd{ (| \var{cset-sre} {\ldots}))))} \\ \ex{word} & \ex{(word+ any)} \\ \\ \ex{(posix-string \var{string})} & Escape for Posix string notation \\ \\ \ex{\var{char}} & Singleton char set \\ \ex{\var{class-name}} & alphanumeric, whitespace, \etc \\ \srecomment{These two forms are interpreted subject to the lexical case-sensitivity context.} \\ \\ \cd{(~ \var{cset-sre} {\ldots})} & Complement-of-union (\cd{[^{\ldots}]}) \\ \ex{(- \var{cset-sre} {\ldots})} & Difference \\ \cd{(& \var{cset-sre} {\ldots})} & Intersection \\ \\ \ex{(/ \var{range-spec} {\ldots})} & Character range---interpreted subject to the lexical case-sensitivy context \\ \end{tabular} \caption{SRE syntax summary (part 2)} \end{boxedfigure} \begin{boxedfigure}{tbhp} {\tt \begin{tabular}{l@{\quad\texttt{|}\quad}ll} \multicolumn{1}{l}{\var{class-name}\quad ::=\quad} & any \\ & nonl \\ & lower-case & | lower \\ & upper-case & | upper \\ & alphabetic & | alpha \\ & numeric & | digit | num \\ & alphanumeric & | alnum \\ & punctuation & | punct \\ & graphic & | graph \\ & whitespace & | space | white \\ & printing & | print \\ & control & | cntrl \\ & hex-digit & | xdigit | hex \\ & ascii \end{tabular} \\[2ex] \ex{\var{range-spec} ::= \var{string} | \var{char}} \\ } The chars are taken in pairs to form inclusive ranges. \caption{SRE character-class names and range specs.} \end{boxedfigure} \begin{boxedfigure}{tbhp} \begin{verbatim} ::= (~ ...) Set complement-of-union | (- ...) Set difference | (& ...) Intersection | (| ...) Set union | (/ ...) Range | () Constant set | Singleton constant set | For 1-char string "c" | Constant set | , evals to a char-set, | ,@ char, single-char string, or re-char-set regexp. | (uncase ) Case-folding | (w/case ) | (w/nocase ) \end{verbatim} \caption{%The \cd{~}, \cd{-}, \cd{&}, and \cd{word+} operators may only be applied to SRE's that specify character sets. These are the ``type-checking'' rules for character-set SRE's.} \end{boxedfigure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Examples} \begin{widecode} (- alpha ("aeiouAEIOU")) ; Various forms of (- alpha ("aeiou") ("AEIOU")) ; non-vowel letter (w/nocase (- alpha ("aeiou"))) (- (/"azAZ") ("aeiouAEIOU")) (w/nocase (- (/"az") ("aeiou"))) ;;; Upper-case letter, lower-case vowel, or digit (| upper ("aeiou") digit) (| (/"AZ09") ("aeiou")) ;;; Not an SRE, but Scheme code containing some embedded SREs. (let* ((ws (rx (+ whitespace))) ; Seq of whitespace (date (rx (: (| "Jan" "Feb" "Mar" ...) ; A month/day date. ,ws (| ("123456789") ; 1-9 (: ("12") digit) ; 10-29 "30" "31"))))) ; 30-31 ;; Now we can use DATE several times: (rx ... ,date ... (* ... ,date ...) ... .... ,date)) ;;; More Scheme code (define (csl re) ; A comma-separated list of RE's is (rx (| "" ; either zero of them (empty string), or (: ,re ; one RE, followed by (* ", " ,re))))) ; Zero or more comma-space-RE matches. (csl (rx (| "John" "Paul" "George" "Ringo")))\end{widecode} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{A short tutorial} S-expression regexps are called "SRE"s. Keep in mind that they are \emph{not} Scheme expressions; they are another, separate notation that is expressed using the underlying framework of s-expression list structure: lists, symbols, {\etc} SRE's can be \emph{embedded} inside of Scheme expressions using special forms that extend Scheme's syntax (such as the \ex{rx} macro); there are places in the SRE grammar where one may place a Scheme expression. In these ways, SRE's and Scheme expressions can be intertwined. But this isn't fundamental; SRE's may be used in a completely Scheme-independent context. By simply restricting the notation to eliminate two special Scheme-embedding forms, they can be a completely independent notation. \paragraph{Constant strings} The simplest SRE is a string, denoting a constant regexp. For example, the SRE \begin{code} "Spot"\end{code} % matches only the string \anglequote{capital-S, little-p, little-o, little-t}. There is no interpretation of the characters in the string at all---the SRE \begin{code} ".*["\end{code} % matches the string \anglequote{period, asterisk, open-bracket}. \paragraph{Simple character sets} To specify a set of characters, write a list whose single element is a string containing the set's elements. So the SRE \begin{code} ("aeiou")\end{code} % only matches a vowel. One way to think of this, notationally, is that the set brackets are \ex{("} and \ex{")}. \paragraph{Wild card} Another simple SRE is the symbol \ex{any}, which matches any single character---including newline and \textsc{Ascii} nul. \paragraph{Sequences} We can form sequences of SRE's with the SRE \ex{(: \var{sre} \ldots)}. So the SRE \begin{code} (: "x" any "z")\end{code} % matches any three-character string starting with ``x'' and ending with ``z''. As we'll see shortly, many SRE forms have bodies that are implicit sequences of other SRE's, analogous to the manner in which the body of a Scheme \ex{lambda} or \ex{let} expression is an implicit \ex{begin} sequence. The regexp \ex{(seq \var{sre} \ldots)} is completely equivalent to \ex{(: \var{sre} \ldots)}; it's included in order to have a syntax that doesn't require \ex{:} to be a legal symbol \footnote{That is, for use within s-expression syntax frameworks that, unlike R5RS, don't allow for \ex{:} as a legal symbol. A Common Lisp embedding of SREs, for example, would need to use \ex{seq} instead of \ex{:}.} \section{Choices} The SRE \ex{(| \var{sre} \ldots)} is a regexp that matches anything any of the \var{sre} regexps match. So the regular expression \begin{code} (| "sasha" "Pete")\end{code} % matches either the string ``sasha'' or the string ``Pete''. The regexp \begin{code} (| ("aeiou") ("0123456789"))\end{code} % is the same as \begin{code} ("aeiou0123456789") \end{code} % The regexp \ex{(or \var{sre} \ldots)} is completely equivalent to \ex{(| \var{sre} \ldots)}; it's included in order to have a syntax that doesn't require \ex{|} to be a legal symbol. \paragraph{Repetition} There are several SRE forms that match multiple occurences of a regular expression. For example, the SRE \ex{(* \var{sre} \ldots)} matches zero or more occurences of the sequence \ex{(: \var{sre} \ldots)}. Here is the complete list of SRE repetition forms: \begin{inset} \begin{tabular}{llrr} SRE & means & at least & no more than \\ \hline \ex{(* \var{sre} \ldots)} &zero-or-more &0 &infinity \\ \ex{(+ \var{sre} \ldots)} &one-or-more &1 &infinity \\ \ex{(? \var{sre} \ldots)} &zero-or-one &0 &1 \\ \ex{(= \var{from} \var{sre} \ldots)} &exactly-n &\var{from} &\var{from} \\ \ex{(>= \var{from} \var{sre} \ldots)} &n-or-more &\var{from} &infinity \\ \ex{(** \var{from} \var{to} \var{sre} \ldots)} &n-to-m &\var{from} &\var{to} \end{tabular} \end{inset} A \var{from} field is a Scheme expression that produces an integer. A \var{to} field is a Scheme expression that produces either an integer, or false, meaning infinity. While it is illegal for the \var{from} or \var{to} fields to be negative, it \emph{is} allowed for \var{from} to be greater than \var{to} in a \ex{**} form---this simply produces a regexp that will never match anything. As an example, we can describe the names of car/cdr access functions ("car", "cdr", "cadr", "cdar", "caar" , "cddr", "caaadr", \etc) with either of the SREs \begin{code} (: "c" (+ (| "a" "d")) "r") (: "c" (+ ("ad")) "r")\end{code} We can limit the a/d chains to 4 characters or less with the SRE \begin{code} (: "c" (** 1 4 ("ad")) "r")\end{code} Some boundary cases: \begin{code} (** 5 2 "foo") ; Will never match (** 0 0 "foo") ; Matches the empty string\end{code} \paragraph{Character classes} There is a special set of SRE's that form ``character classes''---basically, a regexp that matches one character from some specified set of characters. There are operators to take the intersection, union, complement, and difference of character classes to produce a new character class. (Except for union, these capabilities are not provided for general regexps as they are computationally intractable in the general case.) A single character is the simplest character class: \verb|#\x| is a character class that matches only the character ``x''. A string that has only one letter is also a character class: \ex{"x"} is the same SRE as \verb|#\x|. The character-set notation \ex{(\var{string})} we've seen is a primitive character class, as is the wildcard \ex{any}. When arguments to the choice operator, \ex{|}, are all character classes, then the choice form is itself a character-class. So these SREs are all character-classes: \begin{code} ("aeiou") (| #\\a #\\e #\\i #\\o #\\u) (| ("aeiou") ("1234567890"))\end{code} However, these SRE's are \emph{not} character-classes: \begin{code} "aeiou" (| "foo" #\\x)\end{code} The \cd{(~ \var{cset-sre} \ldots)} char class matches one character not in the specified classes: \begin{code} (~ ("0248") ("1359"))\end{code} % matches any character that is not a digit. More compactly, we can use the \ex{/} operator to specify character sets by giving the endpoints of contiguous ranges, where the endpoints are specified by a sequence of strings and characters. For example, any of these char classes \begin{inset} \begin{verbatim} (/ #\A #\Z #\a #\z #\0 #\9) (/ "AZ" #\a #\z "09") (/ "AZ" #\a "z09") (/"AZaz09") \end{verbatim}\end{inset}% % matches a letter or a digit. The range endpoints are taken in pairs to form inclusive ranges of characters. Note that the exact set of characters included in a range is dependent on the underlying implementation's character type, so ranges may not be portable across different implementations. There is a wide selection of predefined, named character classes that may be used. One such SRE is the wildcard \ex{any}. \ex{nonl} is a character class matching anything but newline; it is equivalent to \begin{inset} \begin{verbatim} (~ #\newline) \end{verbatim}\end{inset}% % and is useful as a wildcard in line-oriented matching. There are also predefined named char classes for the standard Posix and Gnu character classes: \begin{inset} \begin{tabular}{llll} scsh name & Posix/ctype & Alternate name & Comment \\ \hline \ex{lower-case} & \ex{lower} \\ \ex{upper-case} & \ex{upper} \\ \ex{alphabetic} & \ex{alpha} \\ \ex{numeric} & \ex{digit} & \ex{num} \\ \ex{alphanumeric} & \ex{alnum} & \ex{alphanum} \\ \ex{punctuation} & \ex{punct} \\ \ex{graphic} & \ex{graph} \\ \ex{blank} & (Gnu extension) \\ \ex{whitespace} & \ex{space} & \ex{white} & {``\ex{space}'' is deprecated.}\\ \ex{printing} & \ex{print} \\ \ex{control} & \ex{cntrl} \\ \ex{hex-digit} & \ex{xdigit} & \ex{hex} \\ \ex{ascii} & (Gnu extension) \\ \end{tabular} \end{inset} See the scsh character-set documentation or the Posix isalpha(3) man page for the exact definitions of these sets. You can use either the long scsh name or the shorter Posix and alternate names to refer to these char classes. The standard Posix name ``\ex{space}'' is provided, but deprecated, since it is ambiguous. It means ``whitespace,'' the set of whitespace characters, not the singleton set of the \verb|#\space| character. If you want a short name for the set of whitespace characters, use the char-class name ``white'' instead. Char classes may be intersected with the operator \cd{(& \var{cset-sre} \ldots)}, and set-difference can be performed with \ex{(- \var{cset-sre} \ldots)}. These operators are particularly useful when you want to specify a set by negation \emph{with respect to a limited universe.} For example, the set of all non-vowel letters is \begin{code} (- alpha ("aeiou") ("AEIOU"))\end{code}% % whereas writing a simple complement \begin{code} (~ ("aeiouAEIOU"))\end{code}% % gives a char class that will match any non-vowel---including punctuation, digits, white space, control characters, and \textsc{Ascii} nul. We can \emph{compute} a char class by writing the SRE \begin{code} ,\var{cset-exp}\end{code}% % where \var{cset-exp} is a Scheme expression producing a value that can be coerced to a character set: a character set, character, one-character string, or char-class regexp value. This regexp matches one character from the set. The char-class SRE \cd{,@\var{cset-exp}} is entirely equivalent to \ex{,\var{cset-exp}} when \var{cset-exp} produces a character set (but see below for the more general non-char-class context, where there \emph{is} a distinction between \cd{,\var{exp}} and \cd{,@\var{exp}}. As an example of character-class SREs, an SRE that matches a lower-case vowel, upper-case letter, or digit is \begin{code} (| ("aeiou") (/"AZ09"))\end{code}% % or, equivalently \begin{code} (| ("aeiou") upper-case numeric)\end{code}% % Boundary cases: the empty-complement char class \begin{code} (~)\end{code}% % matches any character; it is equivalent to \ex{any}. The empty-union char class \begin{code} (|)\end{code}% % never matches at all. This is rarely useful for human-written regexps, but may be of occasional utility in machine-generated regexps, perhaps produced by macros. The rules for determining if an SRE is a simple, char-class SRE or a more complex SRE form a little ``type system'' for SRE's. See the summary section preceding this one for a complete listing of these rules. \paragraph{Case sensitivity} There are three forms that control case sensitivity: \begin{code} (uncase \var{sre} \ldots) (w/case \var{sre} \ldots) (w/nocase \var{sre} \ldots)\end{code}% % \ex{uncase} is a regexp operator producing a regexp that matches any case permutation of any string that matches \ex{(: \var{sre} \ldots)}. For example, the regexp \begin{code} (uncase "foo")\end{code}% % matches the strings ``foo'', ``foO'', ``fOo'', ``fOO'', ``Foo'', \ldots Expressions in SRE notation are interpreted in a lexical case-sensitivy context. The forms \ex{w/case} and \ex{w/nocase} are the scoping operators for this context, which controls how constant strings and char-class forms are interpreted in their bodies. So, for example, the regexp \begin{code} (w/nocase "abc" (* "FOO" (w/case "Bar")) ("aeiou"))\end{code}% % defines a case-insensitive match for all of its elements except for the sub-element "Bar", which must match exactly capital-B, little-a, little-r. The default, the outermost, top-level context is case sensitive. The lexical case-sensitivity context affects the interpretation of \begin{itemize} \item constant strings, such as \ex{"foo"}, \item chars, such as \verb|#\x|, \item char sets, such as \ex{("abc")}, and \item ranges, such as \ex{(/"az")} that appear within that context. It does not affect dynamically computed regexps---ones that are introduced by ,\var{exp} and ,@\var{exp} forms. It does not affect named char-classes---presumably, if you wrote \ex{lower}, you didn't mean \ex{alpha}. \ex{uncase} is \emph{not} the same as \ex{w/nocase}. To point up one distinction, consider the two regexps \begin{code} (uncase (~ "a")) (w/nocase (~ "a"))\end{code}% % \end{itemize} The regexp \cd{(~ "a")} matches any character except ``a,'' which means it \emph{does} match ``A.'' Now, \ex{(uncase \var{re})} matches any case-permutation of a string that \var{re} matches. \cd{(~ "a")} matches ``A,'' so \cd{(uncase (~ "a"))} matches ``A'' and ``a''---and, for that matter, every other character. So \cd{(uncase (~ "a"))} is equivalent to \ex{any}. In contrast, \cd{(w/nocase (~ "a"))} establishes a case-insensitive lexical context in which the \cd{"a"} is interpreted, making the SRE equivalent to \cd{(~ ("aA"))}. \paragraph{Dynamic regexps} SRE notation allows you to compute parts of a regular expressions at run time. The SRE \begin{code} ,\var{exp}\end{code}% % is a regexp whose body \var{exp} is a Scheme expression producing a string, character, char-set, or regexp as its value. Strings and characters are converted into constant regexps; char-sets are converted into char-class regexps; and regexp values are substituted in place. So we can write regexps like this \begin{code} (: "feeding the " ,(if (> n 1) "geese" "goose"))\end{code}% % This is how you can drop computed strings, such as someone's name, or the decimal numeral for a computed number, into a complex regexp. If we have a large, complex regular expression that is used multiple times in some other, containing regular expression, we can name it, using the binding forms of the embedding language (\eg, Scheme), and refer to it by name in the containing expression. For example, consider the Scheme expression \begin{code} (let* ((ws (rx (+ whitespace))) ; Seq of whitespace ;; Something like "Mar 14" (date (rx (: (| "Jan" "Feb" "Mar" {\ldots}) ,ws (| ("123456789") ; 1-9 (: ("12") digit) ; 10-29 "30" ; 30 "31"))))) ; 31 ;; Now we can use DATE several times: (rx {\ldots} ,date {\ldots} (* {\ldots} ,date {\ldots}) {\ldots} ,date {\ldots}))\end{code}% % where the \ex{(rx \var{sre} \ldots)} macro is the Scheme special form that produces a Scheme regexp value given a body in SRE notation. As we saw in the char-class section, if a dynamic regexp is used in a char-class context (\eg, as an argument to a \verb|~| operation), the expression must be coercable not merely to a general regexp, but to a character sre---so it must be either a singleton string, a character, a scsh char set, or a char-class regexp. We can also define and use functions on regexps in the host language. For example, consider the following Scheme expressions, containing embedded SRE's (inside the \ex{rx} macro expressions) which in term contain embedded Scheme expressions computing dynamic regexps: \begin{code} (define (csl re) ;; A comma-separated list of RE's is either (rx (| "" ; zero of them (empty string), (: ,re ; or RE followed by (* ", " ,re))))); zero or more comma-space-RE matches. (rx ... ,date ... ,(csl (rx (| "John" "Paul" "George" "Ringo"))) ... ,(csl date) ...)\end{code}% % We leave the extension of \ex{csl} to allow for an optional ``and'' between the last two matches as an exercise for the interested reader (\eg, to match ``John, Paul, George and Ringo''). Note, in passing, one of the nice features of SRE notation: they can be commented, and indented in a fashion to show the lexical extent of the subexpressions. When we embed a computed regexp inside another regular expression with the ,\var{exp} form, we must specify how to account for the submatches that may be in the computed part. For example, suppose we have the regexp \begin{code} (rx (submatch (* "foo")) (submatch (? "bar")) ,(f x) (submatch "baz"))\end{code}% % It's clear that the submatch for the \ex{(* "foo")} part of the regexp is submatch \#1, and the \ex{(? "bar")} part is submatch \#2. But what number submatch is the \ex{"baz"} submatch? It's not clear. Suppose the Scheme expression \ex{(f x)} produces a regular expression that itself has 3 subforms. Are these counted (making the \ex{"baz"} submatch \#6), or not counted (making the \ex{"bar"} submatch \#3)? SRE notation provides for both possibilities. The SRE \begin{code} ,\var{exp}\end{code}% % does \emph{not} contribute its submatches to its containing regexp; it has zero submatches. So one can reliably assign submatch indices to forms appearing after a \ex{,\var{exp}} form in a regexp. On the other hand, the SRE \begin{code} ,@\var{exp}\end{code}% % ``splices'' its resulting regexp into place, \emph{exposing} its submatches to the containing regexp. This is useful if the computed regexp is defined to produce a certain number of submatches---if that is part of \var{exp}'s ``contract.'' \paragraph{String, line, and word units} The regexps \ex{bos} and \ex{eos} match the empty string at the beginning and end of the string, respectively. The regexps \ex{bol} and \ex{eol} match the empty string at the beginning and end of a line, respectively. A line begins at the beginning of the string, and just after every newline character. A line ends at the end of the string, and just before every newline character. The char class \ex{nonl} matches any character except newline, and is useful in conjunction with line-based pattern matching. The regexps \ex{bow} and \ex{eow} match the empty string at the beginning and end of a word, respectively. A word is a contiguous sequence of characters that are either alphanumeric or the underscore character. The regexp \ex{(word \var{sre} \ldots)} surrounds the sequence \ex{(: \var{sre} \ldots)}with bow/eow delimiters. It is equivalent to \begin{code} (: bow \var{sre} \ldots eow)\end{code}% % The regexp \ex{(word+ \var{cset-sre} \ldots)} matches a word whose body is one or more word characters matched by the char-set sre \var{cset-sre}. It is equivalent to \begin{code} (word (+ (& (| alphanumeric "_") (| \var{cset-sre} \ldots))))\end{code}% % For example, a word not containing x, y, or z is \begin{code} (word+ (~ ("xyz")))\end{code}% % The regexp \ex{word} matches one word; it is equivalent to \begin{code} (word+ any) \end{code}% \note{\ex{bol} and \ex{eol} are not supported by scsh's current regexp search engine, which is Spencer's Posix matcher. This is the only element of the notation that is not supported by the current scsh reference implementation.} %\paragraph{Miscellaneous elements} \paragraph{Posix string notation} The SRE \ex{(posix-string \var{string})}, where \var{string} is a string literal (\emph{not} a general Scheme expression), allows one to use Posix string notation for a regexp. It's intended as backwards compatibility and is deprecated. For example, \verb!(posix-string "[aeiou]+|x*|y{3,5}")! matches a string of vowels, a possibly empty string of x's, or three to five y's. Note that parentheses are used ambiguously in Posix notation---both for grouping and submatch marking. The \ex{(posix-string \var{string})} form makes the conservative assumption: all parentheses introduce submatches. \paragraph{Deleted submatches} Deleted submatches, or ``DSM's,'' are a subtle feature that are never required in expressions written by humans. They can be introduced by the simplifier when reducing regular expressions to simpler equivalents, and are included in the syntax to give it expressibility spanning the full regexp ADT. They may appear when unparsing simplified regular expressions that have been run through the simplifier; otherwise you are not likely to see them. Feel free to skip this section. The regexp simplifier can sometimes eliminate entire sub-expressions from a regexp. For example, the regexp \begin{code} (: "foo" (** 0 0 "apple") "bar")\end{code}% % can be simplified to \begin{code} "foobar"\end{code}% % since \ex{(** 0 0 "apple")} will always match the empty string. The regexp \begin{code} (| "foo" (: "Richard" (|) "Nixon") "bar")\end{code}% % can be simplified to \begin{code} (| "foo" "bar")\end{code}% % The empty choice \ex{(|)} can't match anything, so the whole \begin{code} (: "Richard" (|) "Nixon")\end{code}% % sequence can't match, and we can remove it from the choice. However, if deleting part of a regular expression removes a submatch form, any following submatch forms will have their numbering changed, which would be an error. For example, if we simplify \begin{code} (: (** 0 0 (submatch "apple")) (submatch "bar"))\end{code}% % to \begin{code} (submatch "bar")\end{code}% % then the \ex{"bar"} submatch changes from submatch \#2 to submatch \#1---so this is not a legal simplification. When the simplifier deletes a sub-regexp that contains submatches, it introduces a special regexp form to account for the missing, deleted submatches, thus keeping the submatch accounting correct. \begin{code} (dsm \var{pre} \var{post} \var{sre} \ldots)\end{code}% % is a regexp that matches the sequence \ex{(: \var{sre} \ldots)}. \var{pre} and \var{post} are integer constants. The DSM form introduces \var{pre} deleted submatches before the body, and \var{post} deleted submatches after the body. If the body \var{(: \var{sre} \ldots)} itself has \var{body-sm} submatches, then the total number of submatches for the DSM form is $$\var{pre} + \var{body-sm} + \var{post}.$$ These extra, deleted submatches are never assigned string indices in any match values produced when matching the regexp against a string. As examples, \begin{code} (| (: (submatch "Richard") (|) "Nixon") (submatch "bar"))\end{code}% % can be simplified to \begin{code} (dsm 1 0 (submatch "bar"))\end{code}% % The regexp \begin{code} (: (** 0 0 (submatch "apple")) (submatch "bar"))\end{code}% % can be simplified to \begin{code} (dsm 1 0 (submatch "bar"))\end{code}% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Embedding regexps within Scheme programs} SRE's can be placed in a Scheme program using the \ex{(rx \var{sre} \ldots) } Scheme form, which evaluates to a Scheme regexp value. \subsubsection{Static and dynamic regexps} We separate SRE expressions into two classes: static and dynamic expressions. A \emph{static} expression is one that has no run-time dependencies; it is a complete, self-contained description of a regular set. A \emph{dynamic} expression is one that requires run-time computation to determine the particular regular set being described. There are two places where one can embed run-time computations in an SRE: \begin{itemize} \item The \var{from} or \var{to} repetition counts of \ex{**}, \ex{=}, and \ex{>=} forms; \item \ex{,\var{exp}} and \ex{,@\var{exp}} forms. \end{itemize} A static SRE is one that does not contain any \ex{,\var{exp}} or \ex{,@\var{exp}} forms, and whose \ex{**}, \ex{=}, and \ex{>=} forms all contain constant repetition counts. Scsh's \ex{rx} macro is able, at macro-expansion time, to completely parse, simplify and translate any static SRE into the equivalent Posix string which is used to drive the underlying C-based matching engine; there is no run-time overhead. Dynamic SRE's are partially simplified and then expanded into Scheme code that constructs the regexp at run-time. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Regexp functions} \subsection{Obsolete, deprecated procedures} These two procedures are survivors from the previous, now-obsolete scsh regexp interface. Old code must open the \ex{re-old-funs} package to access them. They should not be used in new code. \defun{string-match}{posix-re-string string [start]}{match or false} \defunx{make-regexp}{posix-re-string}{regexp} \begin{desc} These are old functions included for backwards compatibility with previous releases. They are deprecated and will go away at some point in the future. Note that the new release has no ``regexp compiling'' procedure at all---regexp values are compiled for the matching engine on-demand, and the necessary data structures are cached inside the ADT values. \end{desc} \subsection{Standard procedures and syntax} \dfn{rx}{sre \ldots}{regexp}{Syntax} \begin{desc} This allows you to describe a regexp value with SRE notation. \end{desc} \defun{regexp?}{x}{\boolean} \begin{desc} Returns true if the value is a regular expression. \end{desc} \defun{regexp-search}{re string [start flags]}{match-data or false} \defunx{regexp-search?}{re string [start flags]}{\boolean} \begin{desc} Search \var{string} starting at position \var{start}, looking for a match for regexp \var{re}. If a match is found, return a match structure describing the match, otherwise {\sharpf}. \var{Start} defaults to 0. \var{Flags} is the bitwise-or of \ex{regexp/bos-not-bol} and \ex{regexp/eos-not-eol}. \ex{regexp/bos-not-bol} means the beginning of the string isn't a line-begin. \ex{regexp/eos-not-eol} is analogous. \note{They're currently ignored because begining/end-of-line anchors aren't supported by the current implementation.} Use \ex{regexp-search?} when you don't need submatch information, as it has the potential to be \emph{significantly} faster on submatch-containing regexps. There is no longer a separate regexp ``compilation'' function; regexp values are compiled for the C engine on demand, and the resulting C structures are cached in the regexp structure after the first use. \end{desc} \defun {match:start}{m [i]}{{\integer} or false} \defunx{match:end}{ m [i]}{{\integer} or false} \defunx{match:substring}{m [i]}{{\str} or false} \begin{desc} \ex{match:start} returns the start position of the submatch denoted by \var{match-number}. The whole regexp is 0; positive integers index submatches in the regexp, counting left-to-right. \var{Match-number} defaults to 0. If the regular expression matches as a whole, but a particular sub-expression does not match, then \ex{match:start} returns {\sharpf}. \ex{match:end} is analogous to \ex{match:start}, returning the end position of the indexed submatch. \ex{match:substring} returns the substring matched regexp's submatch. If there was no match for the indexed submatch, it returns false. \end{desc} \defun{regexp-substitute}{port-or-false match . items}{\object} \begin{desc} This procedure can be used to perform string substitutions based on regular-expression matches. The results of the substitution can be either output to a port or returned as a string. The \var{match} argument is a regular-expression match structure that controls the substitution. If \var{port} is an output port, the \var{items} are written out to the port: \begin{itemize} \item If an item is a string, it is copied directly to the port. \item If an item is an integer, the corresponding submatch from \var{match} is written to the port. \item If an item is \ex{'pre}, the prefix of the matched string (the text preceding the match) is written to the port. \item If an item is \ex{'post}, the suffix of the matched string is written. \end{itemize} If \var{port} is {\sharpf}, nothing is written, and a string is constructed and returned instead. \end{desc} % An item is a string (copied verbatim), integer (match index), % \ex{'pre} (chars before the match), or \ex{'post} (chars after the match). % Passing false for the port means return a string. \defun{regexp-substitute/global}{port-or-false re str . items}{\object} \begin{desc} % Same as above, except \ex{'post} item means recurse % on post-match substring. % If \var{re} doesn't match \var{str}, returns \var{str.} This procedure is similar to \ex{regexp-substitute}, but can be used to perform repeated match/substitute operations over a string. It has the following differences with \ex{regexp-substitute}: \begin{itemize} \item It takes a regular expression and string to be matched as parameters, instead of a completed match structure. \item If the regular expression doesn't match the string, this procedure is the identity transform---it returns or outputs the string. \item If an item is \ex{'post}, the procedure recurses on the suffix string (the text from \var{string} following the match). Including a \ex{'post} in the list of items is how one gets multiple match/substitution operations. \item If an item is a procedure, it is applied to the match structure for a given match. The procedure returns a string to be used in the result. \end{itemize} The \var{regexp} parameter can be either a compiled regular expression or a string specifying a regular expression. Some examples: {\small \begin{widecode} ;;; Replace occurrences of "Cotton" with "Jin". (regexp-substitute/global #f (rx "Cotton") s 'pre "Jin" 'post) ;;; mm/dd/yy -> dd/mm/yy date conversion. (regexp-substitute/global #f (rx (submatch (+ digit)) "/" ; 1 = M (submatch (+ digit)) "/" ; 2 = D (submatch (+ digit))) ; 3 = Y s ; Source string 'pre 2 "/" 1 "/" 3 'post) ;;; "9/29/61" -> "Sep 29, 1961" date conversion. (regexp-substitute/global #f (rx (submatch (+ digit)) "/" ; 1 = M (submatch (+ digit)) "/" ; 2 = D (submatch (+ digit))) ; 3 = Y s ; Source string 'pre ;; Sleazy converter -- ignores "year 2000" issue, ;; and blows up if month is out of range. (lambda (m) (let ((mon (vector-ref '#("Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec") (- (string->number (match:substring m 1)) 1))) (day (match:substring m 2)) (year (match:substring m 3))) (string-append mon " " day ", 19" year))) 'post) ;;; Remove potentially offensive substrings from string S. (define (kill-matches re s) (regexp-substitute/global #f s 'pre 'post)) (kill-matches (rx (| "Windows" "tcl" "Intel")) s) ; Protect the children.\end{widecode}} \end{desc} \defun{regexp-fold}{re kons knil s [finish start]}{\object} \begin{desc} The following definition is a bit unwieldy, but the intuition is simple: this procedure uses the regexp \var{re} to divide up string \var{s} into non-matching/matching chunks, and then ``folds'' the procedure \var{kons} across this sequence of chunks. It is useful when you wish to operate on a string in sub-units defined by some regular expression, as are the related \ex{regexp-fold-right} and \ex{regexp-for-each} procedures. Search from \var{start} (defaulting to 0) for a match to \var{re}; call this match \var{m}. Let \var{i} be the index of the end of the match (that is, \ex{(match:end \var{m} 0))}. Loop as follows: \begin{tightcode} (regexp-fold \var{re} \var{kons} (\var{kons} \var{start} \var{m} \var{knil}) \var{s} \var{finish} \var{i})\end{tightcode} % If there is no match, return instead \begin{tightcode} (\var{finish} \var{start} \var{knil})\end{tightcode} % \var{Finish} defaults to \ex{(lambda (i knil) knil)}. In other words, we divide up \var{s} into a sequence of non-matching/matching chunks: $$ \vari{NM}1 \; \vari{M}1 \; \vari{NM}1 \; \vari{M}2 \; {\ldots} \; \vari{NM}{k-1} \; \vari{M}{k-1} \; \vari{NM}k $$ % where \vari{NM}1 is the initial part of \var{s} that isn't matched by the regexp \var{re}, \vari{M}1 is the first match, \vari{NM}2 is the following part of \var{s} that isn't matched, \vari{M}2 is the second match, and so forth---\vari{NM}k is the final non-matching chunk of \var{s}. We apply \var{kons} from left to right to build up a result, passing it one non-matching/matching chunk each time: on an application \ex{(\var{kons} \var{i} \var{m} \var{knil})}, the non-matching chunk goes from \var{i} to \ex{(match:begin \var{m} 0)}, and the following matching chunk goes from \ex{(match:begin \var{m} 0)} to \ex{(match:end \var{m} 0)}. The last non-matching chunk \vari{NM}k is processed by \var{k}. So the computation we perform is \begin{centercode} (\var{final} \var{Q} (\var{kons} \vari{j}{k} \vari{M}{k} {\ldots} (\var{kons} \vari{J}{1} \vari{M}{1} \var{knil}) \ldots))\end{centercode}% % where \vari{J}{i} is the index of the start of \vari{NM}{i}, \vari{M}{i} is a match value describing \vari{M}{i}, and \var{Q} is the index of the beginning of \vari{NM}k. Hint: The \ex{let-match} macro is frequently useful for operating on the match value \var{M} passed to the \var{kons} function. \end{desc} \defun{regexp-fold-right}{re kons knil s [finish start]}\object \begin{desc} The right-to-left variant of \ex{regexp-fold}. This procedure repeatedly matches regexp \var{re} across string \var{s}. This divides \var{s} up into a sequence of matching/non-matching chunks: $$ \vari{NM}1 \; \vari{M}1 \; \vari{NM}1 \; \vari{M}2 \; {\ldots} \; \vari{NM}{k-1} \; \vari{M}{k-1} \; \vari{NM}k $$ % where \vari{NM}1 is the initial part of \var{s} that isn't matched by the regexp \var{re}, \vari{M}1 is the first match, \vari{NM}2 is the following part of \var{s} that isn't matched, \vari{M}2 is the second match, and so forth---\vari{NM}k is the final non-matching chunk of \var{s}. We apply \var{kons} from right to left to build up a result, passing it one non-matching/matching chunk each time: \begin{centercode} (\var{final} \var{Q} (\var{kons} \vari{M}{1} \vari{j}{1} {\ldots} (\var{kons} \vari{M}{k} \vari{J}{k} \var{knil}) \ldots))\end{centercode}% % where MTCHi is a match value describing Mi, Ji is the index of the end of NMi (or, equivalently, the beginning of Mi+1), and Q is the index of the beginning of M1. In other words, KONS is passed a match, an index describing the following non-matching text, and the value produced by folding the following text. The FINAL function "polishes off" the fold operation by handling the initial chunk of non-matching text (NM0, above). FINISH defaults to (lambda (i knil) knil) Example: To pick out all the matches to \var{re} in \var{s}, say \begin{code} (regexp-fold-right re (\l{m i lis} (cons (match:substring m 0) lis)) '() s)\end{code}% % Hint: The \ex{let-match} macro is frequently useful for operating on the match value \var{m} passed to the \ex{kons} function. \end{desc} \defun{regexp-for-each}{re proc s [start]}{\undefined} \begin{desc} Repeatedly match regexp \var{re} against string \var{s}. Apply \var{proc} to each match that is produced. Matches do not overlap. Hint: The \ex{let-match} macro is frequently useful for operating on the match value \var{m} passed to var{proc}. \end{desc} \dfn{let-match}{match-exp mvars body \ldots}{\object}{Syntax} \dfnx{if-match}{match-exp mvars on-match no-match}{\object}{Syntax} \begin{desc} \var{Mvars} is a list of vars that is bound to the match and submatches of the string; \verb|#F| is allowed as a don't-care element. For example, \begin{code} (let-match (regexp-search date s) (whole-date month day year) {\ldots} \var{body} {\ldots})\end{code}% % matches the regexp against string \ex{s}, then evaluates the body of the \ex{let-match} in a scope where \ex{whole-date} is bound to the matched string, and \ex{month}, \ex{day} and \ex{year} are bound to the first, second and third submatches. \ex{if-match} is similar, but if the match expression is false, then the \var{no-match} expression is evaluated; this would be an error in \ex{let-match}. \end{desc} \dfn{match-cond}{clause \ldots}{\object}{Syntax} \begin{desc} This macro allows one to conditionally attempt a sequence of pattern matches, interspersed with other, general conditional tests. There are four kinds of \ex{match-cond} clause, one introducing a pattern match, and the other three simply being regular \ex{cond}-style clauses, marked by the \ex{test} and \ex{else} keywords: \begin{code} (match-cond (\var{match-exp} \var{match-vars} \var{body} \ldots) ; As in if-match (test \var{exp} \var{body} \ldots) ; As in cond (test \var{exp} => \var{proc}) ; As in cond (else \var{body} \ldots)) ; As in cond\end{code}% \end{desc} \defun {flush-submatches}{re}{re} \defunx{uncase}{re}{re} \defunx{simplify-regexp}{re}{re} \defunx{uncase-char-set}{cset}{re} \defunx{uncase-string}{str}{re} \begin{desc} These functions map regexps and char sets to other regexps. \ex{flush-submatches} returns a regexp which matches exactly what its argument matches, but contains no submatches. \ex{uncase} returns a regexp that matches any case-permutation of its argument regexp. \ex{simplify-regexp} applies the simplifier to its argument. This is done automatically when compiling regular expressions, so this is only useful for programmers that are directly examining the ADT value with lower-level accessors. \ex{uncase-char-set} maps a char set to a regular expression that matches any character from that set, regardless of case. Similarly, \ex{uncase-string} returns a regexp that matches any case-permutation of the string. For example, \ex{(uncase-string "Knight")} returns the same value that \ex{(rx ("kK") ("nN") ("iI") ("gG") ("hH") ("tT"))} or \ex{(rx (w/nocase "Knight"))}. \end{desc} \defun {sre->regexp}{sre}{re} \defunx{regexp->sre}{re}{sre} \begin{desc} These are the SRE parser and unparser. That is, \ex{sre->regexp} maps an SRE to a regexp value, and \ex{regexp->sre} does the inverse. The latter function can be useful for printing out regexps in a readable format. \begin{widecode} (sre->regexp '(: "Olin " (? "G. ") "Shivers")) {\evalto} \var{regexp} (define re (re-seq (re-string "Pete ") (re-repeat 1 #f (re-string "Sz")) (re-string "ilagyi"))) (regexp->sre (re-repeat 0 1 re)) {\evalto} '(? "Pete" (+ "Sz") "ilagyi")\end{widecode} \end{desc} \defun {posix-string->regexp}{string}{re} \defunx{regexp->posix-string}{re}{string} \begin{desc} These two functions are the Posix notation parser and unparser. That is, \ex{posix-string->regexp} maps a Posix-notation regular expression, such as \ex{"g(ee|oo)se"}, to a regexp value, and \ex{regexp->posix-string} does the inverse. You can use these tools to map between scsh regexps and Posix regexp strings, which can be useful if you want to do conversion between SRE's and Posix form. For example, you can write a particularly complex regexp in SRE form, or compute it using the ADT constructors, then convert to Posix form, print it out, cut and paste it into a C or emacs lisp program. Or you can import an old regexp from some other program, parse it into an ADT value, render it to an SRE, print it out, then cut and paste it into a scsh program. Note:\begin{itemize} \item The string parser doesn't handle the exotica of character class names such as \verb|[[:alnum:]]|; the current implementation was written in in three hours. \item The unparser produces Spencer-specific strings for bow/eow elements; otherwise, it's Posix all the way. \end{itemize} \end{desc} \section{The regexp ADT} The following functions may be used to construct and examine scsh's regexp abstract data type. They are in the following Scheme 48 packages: re-adt-lib re-lib scsh Each basic class of regexp has a predicate, a basic constructor, a ``smart'' consructor that performs limited ``peephole'' optimisation on its arguments, and a set of accessors. The \ex{\ldots:tsm} accessor returns the total number of submatches contained in the regular expression. \dfn {re-seq?}{x}{boolean}{Type predicate} \dfnx{make-re-seq}{re \ldots}{re}{Basic constructor} \dfnx{re-seq}{re \ldots}{re}{Smart constructor} \dfnx{re-seq:elts}{re}{re-list}{Accessor} \dfnx{re-seq:tsm}{re}{integer}{Accessor} \dfn {re-choice?}{x}{boolean}{Type predicate} \dfnx{make-re-choice}{re-list}{re}{Basic constructor} \dfnx{re-choice}{re \ldots}{re}{Smart constructor} \dfnx{re-choice:elts}{re}{re-list}{Accessor} \dfnx{re-choice:tsm}{re}{integer}{Accessor} \dfn {re-repeat?}{x}{boolean}{Type predicate} \dfnx{make-re-repeat}{from to body}{re}{Accessor} \dfnx{re-repeat:from}{re}{integer}{Accessor} \dfnx{re-repeat:to}{re}{integer}{Accessor} \dfnx{re-repeat:tsm}{re}{integer}{Accessor} \dfn {re-submatch?}{x}{boolean}{Type predicate} \dfnx{make-re-submatch}{body [pre-dsm post-dsm]}{re}{Accessor} \dfnx{re-submatch:pre-dsm}{re}{integer}{Accessor} \dfnx{re-submatch:post-dsm}{re}{integer}{Accessor} \dfnx{re-submatch:tsm}{re}{integer}{Accessor} \dfn {re-string?}{x}{boolean}{Type predicate} \dfnx{make-re-string}{chars}{re}{Basic constructor} \dfnx{re-string}{chars}{re}{Basic constructor} \dfnx{re-string:chars}{re}{string}{Accessor} \dfn {re-char-set?}{x}{boolean}{Type predicate} \dfnx{make-re-char-set}{cset}{re}{Basic constructor} \dfnx{re-char-set}{cset}{re}{Basic constructor} \dfnx{re-char-set:cset}{re}{char-set}{Accessor} \dfn {re-dsm?}{x}{boolean}{Type predicate} \dfnx{make-re-dsm}{body pre-dsm post-dsm}{re}{Basic constructor} \dfnx{re-dsm}{body pre-dsm post-dsm}{re}{Smart constructor} \dfnx{re-dsm:body}{re}{re}{Accessor} \dfnx{re-dsm:pre-dsm}{re}{integer}{Accessor} \dfnx{re-dsm:post-dsm}{re}{integer}{Accessor} \dfnx{re-dsm:tsm}{re}{integer}{Accessor} \defvar {re-bos}{regexp} \defvarx{re-eos}{regexp} \defvarx{re-bol}{regexp} \defvarx{re-eol}{regexp} \defvarx{re-bow}{regexp} \defvarx{re-eow}{regexp} \begin{desc} These variables are bound to the primitive anchor regexps. \end{desc} \defun {re-bos?}{\object}{\boolean} \defunx{re-eos?}{\object}{\boolean} \defunx{re-bol?}{\object}{\boolean} \defunx{re-eol?}{\object}{\boolean} \defunx{re-bow?}{\object}{\boolean} \defunx{re-eow?}{\object}{\boolean} \begin{desc} These predicates recognise the associated primitive anchor regexp. \end{desc} \defvar{re-trivial}{regexp} \defunx{re-trivial?}{re}{\boolean} \begin{desc} The variable \ex{re-trivial} is bound to a regular expression that matches the empty string (corresponding to the SRE \ex{""} or \ex{(:)}); it is recognised by the associated predicate. Note that the predicate is only guaranteed to recognise this particular trivial regexp; other trivial regexps built using other constructors may or may not produce a true value. \end{desc} \defvar{re-empty}{regexp} \defunx{re-empty?}{re}{\boolean} \begin{desc} The variable \ex{re-empty} is bound to a regular expression that never matches (corresponding to the SRE \ex{(|)}); it is recognised by the associated predicate. Note that the predicate is only guaranteed to recognise this particular empty regexp; other empty regexps built using other constructors may or may not produce a true value. \end{desc} \defvar{re-any}{regexp} \defunx{re-any?}{re}{\boolean} \begin{desc} The variable \ex{re-any} is bound to a regular expression that matches any character (corresponding to the SRE \ex{any}); it is recognised by the associated predicate. Note that the predicate is only guaranteed to recognise this particular any-character regexp value; other any-character regexps built using other constructors may or may not produce a true value. \end{desc} % These are non-primitive predefined regexps of general utility. \defvar {re-nonl}{regexp} \defvarx{re-word}{regexp} \begin{desc} The variable \ex{re-nonl} is bound to a regular expression that matches any non-newline character (corresponding to the SRE \verb|(~ #\newline)|). Similarly, \ex{re-word} is bound to a regular expression that matches any word (corresponding to the SRE \ex{word}). \end{desc} \defun{regexp?}{\object}{\boolean} \begin{desc} Is the object a regexp? \end{desc} \defun{re-tsm}{re}{\integer} \begin{desc} Return the total number of submatches contained in the regexp. \end{desc} \defun{clean-up-cres}{}{\undefined} \begin{desc} The current scsh implementation should call this function periodically to release C-heap storage associated with compiled regexps. Hopefully, this procedure will be removed at a later date. \end{desc} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Syntax-hacking tools} The Scheme 48 package \ex{sre-syntax-tools} exports several tools for macro writers that want to use SREs in their macros. In the functions defined below, \var{compare} and \var{rename} parameters are as passed to Clinger-Rees explicit-renaming low-level macros. \dfn{if-sre-form}{form conseq-form alt-form}{form}{Syntax} \begin{desc} If \var{form} is a legal SRE, this is equivalent to the expression \var{conseq-form}, otherwise it expands to \var{alt-form}. This is useful for high-level macro authors who want to write a macro where one field in the macro can be an SRE or possibly something else. \Eg, we might have a conditional form wherein if the test part of one arm is an SRE, it expands to a regexp match on some implied value, otherwise the form is evaluated as a boolean Scheme expression. For example, a conditional macro might expand into code containing the following form, which in turn would have one of two possible expansions: \begin{centercode} (if-sre-form test-exp ; If TEST-EXP is SRE, (regexp-search? (rx test-exp) line) ; match it w/the line, test-exp) ; otw it's a text exp.\end{centercode}% \end{desc} \defun{sre-form?}{form rename compare}{\boolean} \begin{desc} This procedure is for low-level macros doing things equivalent to \ex{if-sre-form}. It returns true if the form is a legal SRE. Note that neither \ex{sre-form} nor \ex{if-sre-form} does a deep recursion over the form in the case where the form is a list. They simply check the car of the form for one of the legal SRE keywords. \end{desc} \defun {parse-sre}{sre-form compare rename}{re} \defunx{parse-sres}{sre-forms compare rename}{re} \begin{desc} Parse \ex{sre-form} into an ADT. Note that if the SRE is dynamic---contains \ex{,\var{exp}} or \ex{,@\var{exp}} forms, or has repeat operators whose from/to counts are not constants---then the returned ADT will have \var{Scheme expressions} in the corresponding slots of the regexp records instead of the corresponding integer, char-set, or regexp. In other words, we use the ADT as its own AST. It's called a ``hack.'' \ex{parse-sres} parses a list of SRE forms that comprise an implicit sequence. \end{desc} \defun{regexp->scheme}{re rename}{Scheme-expression} \begin{desc} Returns a Scheme expression that will construct the regexp \var{re} using ADT constructors such as \ex{make-re-sequence}, \ex{make-re-repeat}, and so forth. If the regexp is static, it will be simplified and pre-translated to a Posix string as well, which will be part of the constructed regexp value. \end{desc} \defun{static-regexp?}{re}{\boolean} \begin{desc} Is the regexp a static one? \end{desc}