791 lines
30 KiB
TeX
791 lines
30 KiB
TeX
% -*- latex -*-
|
|
\chapter{Strings and characters}
|
|
|
|
Scsh provides a set of procedures for processing strings and characters.
|
|
The procedures provided match regular expressions, search strings,
|
|
parse file-names, and manipulate sets of characters.
|
|
|
|
Also see chapters \ref{chapt:rdelim} and \ref{chapt:fr-awk}
|
|
on record I/O, field parsing, and the awk loop.
|
|
The procedures documented there allow you to read character-delimited
|
|
records from ports, use regular expressions to split the records into fields
|
|
(for example, splitting a string at every occurrence of colon or white-space),
|
|
and loop over streams of these records in a convenient way.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{String manipulation}
|
|
\label{sec:stringmanip}
|
|
|
|
Strings are the basic communication medium for {\Unix} processes, so a
|
|
shell language must have reasonable facilities for manipulating them.
|
|
|
|
\subsection{Regular expressions}
|
|
\label{sec:regexps}
|
|
|
|
The following functions perform regular expression matching.
|
|
The code uses Henry Spencer's regular expression package.
|
|
|
|
\begin{defundesc}{string-match} {regexp string [start]} {match or false}
|
|
Search \var{string} starting at position \var{start}, looking for a match
|
|
for \var{regexp}. If a match is found, return a match structure describing
|
|
the match, otherwise {\sharpf}. \var{Start} defaults to 0.
|
|
|
|
\var{regexp} may be a compiled regexp structure or a string defining
|
|
a regular expression, which will be compiled to a regexp structure.
|
|
\end{defundesc}
|
|
|
|
\begin{defundesc} {regexp-match?} {obj} \boolean
|
|
Is the object a regular expression match?
|
|
\end{defundesc}
|
|
|
|
\begin{defundesc} {match:start} {match [match-number]} {{\fixnum} or false}
|
|
Returns the start position of the match denoted by \var{match-number}.
|
|
The whole regexp is 0. Each further number represents positions
|
|
enclosed by \ex{(\ldots)} sections. \var{Match-number} defaults to 0.
|
|
|
|
If the regular expression matches as a whole,
|
|
but a particular parenthesized sub-expression does not match, then
|
|
\ex{match:start} returns {\sharpf}.
|
|
\end{defundesc}
|
|
|
|
\begin{defundesc} {match:end} {match [match-number]} \fixnum
|
|
Returns the end position of the match denoted by \var{match-number}.
|
|
\var{Match-number} defaults to 0 (the whole match).
|
|
|
|
If the regular expression matches as a whole,
|
|
but a particular parenthesized sub-expression does not match, then
|
|
\ex{match:end} returns {\sharpf}.
|
|
\end{defundesc}
|
|
|
|
\begin{defundesc} {match:substring} {match [match-number]} {{\str} or false}
|
|
Returns the substring matched by match \var{match-number}.
|
|
\var{Match-number} defaults to 0 (the whole match).
|
|
If there was no match, returns false.
|
|
\end{defundesc}
|
|
|
|
Regular expression matching compiles patterns into special data
|
|
structures which can be efficiently used to match against strings.
|
|
The overhead of compiling patterns that will be used for multiple
|
|
searches can be avoided by these lower-level routines:
|
|
%
|
|
\begin{defundesc} {make-regexp} {str} {re}
|
|
Generate a compiled regular expression from the given string.
|
|
\end{defundesc}
|
|
|
|
\begin{defundesc} {regexp?} {obj} \boolean
|
|
Is the object a regular expression?
|
|
\end{defundesc}
|
|
|
|
\begin{defundesc} {regexp-exec} {regexp str [start]} {match or false}
|
|
Apply the regular expression \var{regexp} to the string \var{str} starting
|
|
at position \var{start}. If the match succeeds it returns a regexp-match,
|
|
otherwise {\sharpf}. \var{Start} defaults to 0.
|
|
\end{defundesc}
|
|
|
|
\begin{defundesc} {->regexp} {regexp-or-string} {regexp}
|
|
Coerce the input value into a compiled regular expression:
|
|
strings are compiled; regexp structures are passed through unchanged.
|
|
\end{defundesc}
|
|
|
|
\defun{regexp-quote}{str}{\str}
|
|
\begin{desc}
|
|
Returns a regular expression that matches the string \var{str} exactly.
|
|
In other words, it quotes the regular expression, prepending backslashes
|
|
to all the special regexp characters in \var{str}.
|
|
\begin{code}
|
|
(regexp-quote "*Hello* world.")
|
|
{\evalto}"\\\\*Hello\\\\* world\\\\."\end{code}
|
|
\end{desc}
|
|
|
|
\defun{regexp-substitute}{port match . items}{{\str} or \undefined}
|
|
\begin{desc}
|
|
This procedure can be used to perform string substitutions based on
|
|
regular expression matches.
|
|
The results of the substitution can be either output to a port or
|
|
returned as a string.
|
|
|
|
The \var{match} argument is a regular expression match structure
|
|
that controls the substitution.
|
|
If \var{port} is an output port, the \var{items} are written out to
|
|
the port:
|
|
\begin{itemize}
|
|
\item If an item is a string, it is copied directly to the port.
|
|
\item If an item is an integer, the corresponding submatch from \var{match}
|
|
is written to the port.
|
|
\item If an item is \ex{'pre},
|
|
the prefix of the matched string (the text preceding the match)
|
|
is written to the port.
|
|
\item If an item is \ex{'post},
|
|
the suffix of the matched string is written.
|
|
\end{itemize}
|
|
|
|
If \var{port} is {\sharpf}, nothing is written, and a string is constructed
|
|
and returned instead.
|
|
\end{desc}
|
|
|
|
\defun{regexp-substitute/global}{port regexp string . items}
|
|
{{\str} or \undefined}
|
|
\begin{desc}
|
|
This procedure is similar to \ex{regexp-substitute},
|
|
but can be used to perform repeated match/substitute operations over
|
|
a string.
|
|
It has the following differences with \ex{regexp-substitute}:
|
|
\begin{itemize}
|
|
\item It takes a regular expression and string to be matched as
|
|
parameters, instead of a completed match structure.
|
|
\item If the regular expression doesn't match the string, this
|
|
procedure is the identity transform---it returns or outputs the
|
|
string.
|
|
\item If an item is \ex{'post}, the procedure recurses on the suffix string
|
|
(the text from \var{string} following the match).
|
|
Including a \ex{'post} in the list of items is how one gets multiple
|
|
match/substitution operations.
|
|
\item If an item is a procedure, it is applied to the match structure for
|
|
a given match.
|
|
The procedure returns a string to be used in the result.
|
|
\end{itemize}
|
|
The \var{regexp} parameter can be either a compiled regular expression or
|
|
a string specifying a regular expression.
|
|
|
|
Some examples:
|
|
{\small
|
|
\begin{widecode}
|
|
;;; Replace occurrences of "Cotton" with "Jin".
|
|
(regexp-substitute/global #f "Cotton" s
|
|
'pre "Jin" 'post)
|
|
|
|
;;; mm/dd/yy -> dd/mm/yy date conversion.
|
|
(regexp-substitute/global #f "([0-9]+)/([0-9]+)/([0-9]+)" ; mm/dd/yy
|
|
s ; Source string
|
|
'pre 2 "/" 1 "/" 3 'post)
|
|
|
|
;;; "9/29/61" -> "Sep 29, 1961" date conversion.
|
|
(regexp-substitute/global #f "([0-9]+)/([0-9]+)/([0-9]+)" ; mm/dd/yy
|
|
s ; Source string
|
|
|
|
'pre
|
|
;; Sleazy converter -- ignores "year 2000" issue, and blows up if
|
|
;; month is out of range.
|
|
(lambda (m)
|
|
(let ((mon (vector-ref '#("Jan" "Feb" "Mar" "Apr" "May" "Jun"
|
|
"Jul" "Aug" "Sep" "Oct" "Nov" "Dec")
|
|
(- (string->number (match:substring m 1)) 1)))
|
|
(day (match:substring m 2))
|
|
(year (match:substring m 3)))
|
|
(string-append mon " " day ", 19" year)))
|
|
'post)
|
|
|
|
;;; Remove potentially offensive substrings from string S.
|
|
(regexp-substitute/global #f "Windows|tcl|Intel" s
|
|
'pre 'post)\end{widecode}}
|
|
|
|
\end{desc}
|
|
|
|
\subsection{Other string manipulation facilities}
|
|
|
|
\defun {index} {string char [start]} {{\fixnum} or false}
|
|
\defunx {rindex} {string char [start]} {{\fixnum} or false}
|
|
\begin{desc}
|
|
These procedures search through \var{string} looking for an occurrence
|
|
of character \var{char}. \ex{index} searches left-to-right; \ex{rindex}
|
|
searches right-to-left.
|
|
|
|
\ex{index} returns the smallest index $i$ of \var{string} greater
|
|
than or equal to \var{start} such that $\var{string}[i] = \var{char}$.
|
|
The default for \var{start} is zero. If there is no such match,
|
|
\ex{index} returns false.
|
|
|
|
\ex{rindex} returns the largest index $i$ of \var{string} less than
|
|
\var{start} such that $\var{string}[i] = \var{char}$.
|
|
The default for \var{start} is \ex{(string-length \var{string})}.
|
|
If there is no such match, \ex{rindex} returns false.
|
|
\end{desc}
|
|
|
|
I should probably snarf all the MIT Scheme string functions, and stick them
|
|
in a package. {\Unix} programs need to mung character strings a lot.
|
|
|
|
MIT string match commands:
|
|
\begin{tightcode}
|
|
[sub]string-match-{forward,backward}[-ci]
|
|
[sub]string-{prefix,suffix}[-ci]?
|
|
[sub]string-find-{next,previous}-char[-ci]
|
|
[sub]string-find-{next,previous}-char-in-set
|
|
[sub]string-replace[!]
|
|
\ldots\etc\end{tightcode}
|
|
These are not currently provided.
|
|
|
|
\begin{defundesc} {substitute-env-vars} {fname} \str
|
|
Replace occurrences of environment variables with their values.
|
|
An environment variable is denoted by a dollar sign followed by
|
|
alphanumeric chars and underscores, or is surrounded by braces.
|
|
|
|
\begin{exampletable}
|
|
\splitline{\ex{(substitute-env-vars "\$USER/.login")}}
|
|
{\ex{"shivers/.login"}} \\
|
|
\cd{(substitute-env-vars "$\{USER\}_log")} & \cd{"shivers_log"}
|
|
\end{exampletable}
|
|
\end{defundesc}
|
|
|
|
\subsection{Manipulating file-names}
|
|
\label{sec:filenames}
|
|
|
|
These procedures do not access the file-system at all; they merely operate
|
|
on file-name strings. Much of this structure is patterned after the gnu emacs
|
|
design. Perhaps a more sophisticated system would be better, something
|
|
like the pathname abstractions of {\CommonLisp} or MIT Scheme. However,
|
|
being {\Unix}-specific, we can be a little less general.
|
|
|
|
\subsubsection{Terminology}
|
|
These procedures carefully adhere to the {\Posix} standard for file-name
|
|
resolution, which occasionally entails some slightly odd things.
|
|
This section will describe these rules, and give some basic terminology.
|
|
|
|
A \emph{file-name} is either the file-system root (``/''),
|
|
or a series of slash-terminated directory components, followed by
|
|
a a file component.
|
|
Root is the only file-name that may end in slash.
|
|
Some examples:
|
|
\begin{center}
|
|
\begin{tabular}{lll}
|
|
File name & Dir components & File component \\\hline
|
|
\ex{src/des/main.c} & \ex{("src" "des")} & \ex{"main.c"} \\
|
|
\ex{/src/des/main.c} & \ex{("" "src" "des")} & \ex{"main.c"} \\
|
|
\ex{main.c} & \ex{()} & \ex{"main.c"} \\
|
|
\end{tabular}
|
|
\end{center}
|
|
|
|
Note that the relative filename \ex{src/des/main.c} and the absolute filename
|
|
\ex{/src/des/main.c} are distinguished by the presence of the root component
|
|
\ex{""} in the absolute path.
|
|
|
|
Multiple embedded slashes within a path have the same meaning as
|
|
a single slash.
|
|
More than two leading slashes at the beginning of a path have the same
|
|
meaning as a single leading slash---they indicate that the file-name
|
|
is an absolute one, with the path leading from root.
|
|
However, {\Posix} permits the OS to give special meaning to
|
|
\emph{two} leading slashes.
|
|
For this reason, the routines in this section do not simplify two leading
|
|
slashes to a single slash.
|
|
|
|
A file-name in \emph{directory form} is either a file-name terminated by
|
|
a slash, \eg, ``\ex{/src/des/}'', or the empty string, ``''.
|
|
The empty string corresponds to the current working directory,
|
|
whose file-name is dot (``\ex{.}'').
|
|
Working backwards from the append-a-slash rule,
|
|
we extend the syntax of {\Posix} file-names to define the empty string
|
|
to be a file-name form of the root directory ``\ex{/}''.
|
|
(However, ``\ex{/}'' is also acceptable as a file-name form for root.)
|
|
So the empty string has two interpretations:
|
|
as a file-name form, it is the file-system root;
|
|
as a directory form, it is the current working directory.
|
|
Slash is also an ambiguous form: \ex{/} is both a directory-form and
|
|
a file-name form.
|
|
|
|
The directory form of a file-name is very rarely used.
|
|
Almost all of the procedures in scsh name directories by giving
|
|
their file-name form (without the trailing slash), not their directory form.
|
|
So, you say ``\ex{/usr/include}'', and ``\ex{.}'', not
|
|
``\ex{/usr/include/}'' and ``''.
|
|
The sole exceptions are
|
|
\ex{file-name-as-directory} and \ex{directory-as-file-name},
|
|
whose jobs are to convert back-and-forth between these forms,
|
|
and \ex{file-name-directory}, whose job it is to split out the
|
|
directory portion of a file-name.
|
|
However, most procedures that expect a directory argument will coerce
|
|
a file-name in directory form to file-name form if it does not have
|
|
a trailing slash.
|
|
Bear in mind that the ambiguous case, empty string, will be
|
|
interpreted in file-name form, \ie, as root.
|
|
|
|
|
|
|
|
\subsubsection{Procedures}
|
|
|
|
\defun {file-name-directory?} {fname} \boolean
|
|
\defunx {file-name-non-directory?} {fname} \boolean
|
|
\begin{desc}
|
|
These predicates return true if the string is in directory form, or
|
|
file-name form (see the above discussion of these two forms).
|
|
Note that they both return true on the ambiguous case of empty string,
|
|
which is both a directory (current working directory), and a file name
|
|
(the file-system root).
|
|
\begin{center}
|
|
\begin{tabular}{lll}
|
|
File name & \ex{\ldots-directory?} & \ex{\ldots-non-directory?} \\
|
|
\hline
|
|
\ex{"src/des"} & \ex{\sharpf} & \ex{\sharpt} \\
|
|
\ex{"src/des/"} & \ex{\sharpt} & \ex{\sharpf} \\
|
|
\ex{"/"} & \ex{\sharpt} & \ex{\sharpf} \\
|
|
\ex{"."} & \ex{\sharpf} & \ex{\sharpt} \\
|
|
\ex{""} & \ex{\sharpt} & \ex{\sharpt}
|
|
\end{tabular}
|
|
\end{center}
|
|
\end{desc}
|
|
|
|
\begin{defundesc} {file-name-as-directory} {fname} \str
|
|
Convert a file-name to directory form.
|
|
Basically, add a trailing slash if needed:
|
|
\begin{exampletable}
|
|
\ex{(file-name-as-directory "src/des")} & \ex{"src/des/"} \\
|
|
\ex{(file-name-as-directory "src/des/")} & \ex{"src/des/"} \\[2ex]
|
|
%
|
|
\header{\ex{.}, \ex{/}, and \ex{""} are special:}
|
|
\ex{(file-name-as-directory ".")} & \ex{""} \\
|
|
\ex{(file-name-as-directory "/")} & \ex{"/"} \\
|
|
\ex{(file-name-as-directory "")} & \ex{"/"}
|
|
\end{exampletable}
|
|
\end{defundesc}
|
|
|
|
\begin{defundesc} {directory-as-file-name} {fname} \str
|
|
Convert a directory to a simple file-name.
|
|
Basically, kill a trailing slash if one is present:
|
|
\begin{exampletable}
|
|
\ex{(directory-as-file-name "foo/bar/")} & \ex{"foo/bar"} \\[2ex]
|
|
%
|
|
\header{\ex{/} and \ex{""} are special:}
|
|
\ex{(directory-as-file-name "/")} & \ex{"/"} \\
|
|
\ex{(directory-as-file-name "")} & \ex{"."} (\ie, the cwd) \\
|
|
\end{exampletable}
|
|
\end{defundesc}
|
|
|
|
\begin{defundesc} {file-name-absolute?} {fname} \boolean
|
|
Does \var{fname} begin with a root or \ex{\~} component?
|
|
(Recognising \ex{\~} as a home-directory specification
|
|
is an extension of {\Posix} rules.)
|
|
%
|
|
\begin{exampletable}
|
|
\ex{(file-name-absolute? "/usr/shivers")} & {\sharpt} \\
|
|
\ex{(file-name-absolute? "src/des")} & {\sharpf} \\
|
|
\ex{(file-name-absolute? "\~/src/des")} & {\sharpt} \\[2ex]
|
|
%
|
|
\header{Non-obvious case:}
|
|
\ex{(file-name-absolute? "")} & {\sharpt} (\ie, root)
|
|
\end{exampletable}
|
|
\end{defundesc}
|
|
|
|
|
|
\begin{defundesc} {file-name-directory} {fname} {{\str} or false}
|
|
Return the directory component of \var{fname} in directory form.
|
|
If the file-name is already in directory form, return it as-is.
|
|
%
|
|
\begin{exampletable}
|
|
\ex{(file-name-directory "/usr/bdc")} & \ex{"/usr/"} \\
|
|
{\ex{(file-name-directory "/usr/bdc/")}} &
|
|
{\ex{"/usr/bdc/"}} \\
|
|
\ex{(file-name-directory "bdc/.login")} & \ex{"bdc/"} \\
|
|
\ex{(file-name-directory "main.c")} & \ex{""} \\[2ex]
|
|
%
|
|
\header{Root has no directory component:}
|
|
\ex{(file-name-directory "/")} & \ex{""} \\
|
|
\ex{(file-name-directory "")} & \ex{""}
|
|
\end{exampletable}
|
|
\end{defundesc}
|
|
|
|
|
|
\begin{defundesc} {file-name-nondirectory} {fname} \str
|
|
Return non-directory component of fname.
|
|
%
|
|
\begin{exampletable}
|
|
{\ex{(file-name-nondirectory "/usr/ian")}} &
|
|
{\ex{"ian"}} \\
|
|
\ex{(file-name-nondirectory "/usr/ian/")} & \ex{""} \\
|
|
{\ex{(file-name-nondirectory "ian/.login")}} &
|
|
{\ex{".login"}} \\
|
|
\ex{(file-name-nondirectory "main.c")} & \ex{"main.c"} \\
|
|
\ex{(file-name-nondirectory "")} & \ex{""} \\
|
|
\ex{(file-name-nondirectory "/")} & \ex{"/"}
|
|
\end{exampletable}
|
|
\end{defundesc}
|
|
|
|
|
|
\begin{defundesc} {split-file-name} {fname} {{\str} list}
|
|
Split a file-name into its components.
|
|
%
|
|
\begin{exampletable}
|
|
\splitline{\ex{(split-file-name "src/des/main.c")}}
|
|
{\ex{("src" "des" "main.c")}} \\[1.5ex]
|
|
%
|
|
\splitline{\ex{(split-file-name "/src/des/main.c")}}
|
|
{\ex{("" "src" "des" "main.c")}} \\[1.5ex]
|
|
%
|
|
\splitline{\ex{(split-file-name "main.c")}} {\ex{("main.c")}} \\[1.5ex]
|
|
%
|
|
\splitline{\ex{(split-file-name "/")}} {\ex{("")}}
|
|
\end{exampletable}
|
|
\end{defundesc}
|
|
|
|
|
|
\begin{defundesc} {path-list->file-name} {path-list [dir]} \str
|
|
Inverse of \ex{split-file-name}.
|
|
\begin{code}
|
|
(path-list->file-name '("src" "des" "main.c"))
|
|
{\evalto} "src/des/main.c"
|
|
(path-list->file-name '("" "src" "des" "main.c"))
|
|
{\evalto} "/src/des/main.c"
|
|
\cb
|
|
{\rm{}Optional \var{dir} arg anchors relative path-lists:}
|
|
(path-list->file-name '("src" "des" "main.c")
|
|
"/usr/shivers")
|
|
{\evalto} "/usr/shivers/src/des/main.c"\end{code}
|
|
%
|
|
The optional \var{dir} argument is usefully \ex{(cwd)}.
|
|
\end{defundesc}
|
|
|
|
|
|
\begin{defundesc} {file-name-extension} {fname} \str
|
|
Return the file-name's extension.
|
|
%
|
|
\begin{exampletable}
|
|
\ex{(file-name-extension "main.c")} & \ex{".c"} \\
|
|
\ex{(file-name-extension "main.c.old")} & \ex{".old"} \\
|
|
\ex{(file-name-extension "/usr/shivers")} & \ex{""}
|
|
\end{exampletable}
|
|
%
|
|
\begin{exampletable}
|
|
\header{Weird cases:}
|
|
\ex{(file-name-extension "foo.")} & \ex{"."} \\
|
|
\ex{(file-name-extension "foo..")} & \ex{"."}
|
|
\end{exampletable}
|
|
%
|
|
\begin{exampletable}
|
|
\header{Dot files are not extensions:}
|
|
\ex{(file-name-extension "/usr/shivers/.login")} & \ex{""}
|
|
\end{exampletable}
|
|
\end{defundesc}
|
|
|
|
|
|
\begin{defundesc} {file-name-sans-extension} {fname} \str
|
|
Return everything but the extension.
|
|
%
|
|
\begin{exampletable}
|
|
\ex{(file-name-sans-extension "main.c")} & \ex{"main"} \\
|
|
\ex{(file-name-sans-extension "main.c.old")} & \ex{"main.c""} \\
|
|
\splitline{\ex{(file-name-sans-extension "/usr/shivers")}}
|
|
{\ex{"/usr/shivers"}}
|
|
\end{exampletable}
|
|
%
|
|
\begin{exampletable}
|
|
\header{Weird cases:}
|
|
\ex{(file-name-sans-extension "foo.")} & \ex{"foo"} \\
|
|
\ex{(file-name-sans-extension "foo..")} & \ex{"foo."} \\[2ex]
|
|
%
|
|
\header{Dot files are not extensions:}
|
|
\splitline{\ex{(file-name-sans-extension "/usr/shivers/.login")}}
|
|
{\ex{"/usr/shivers/.login}}
|
|
\end{exampletable}
|
|
|
|
Note that appending the results of \ex{file-name-extension} and
|
|
{\ttt file\=name\=sans\=extension} in all cases produces the original file-name.
|
|
\end{defundesc}
|
|
|
|
|
|
\begin{defundesc} {parse-file-name} {fname} {[dir name extension]}
|
|
Let $f$ be \ex{(file-name-nondirectory \var{fname})}.
|
|
This function returns the three values:
|
|
\begin{itemize}
|
|
\item \ex{(file-name-directory \var{fname})}
|
|
\item \ex{(file-name-sans-extension \var{f}))}
|
|
\item \ex{(file-name-extension \var{f}\/)}
|
|
\end{itemize}
|
|
The inverse of \ex{parse-file-name}, in all cases, is \ex{string-append}.
|
|
The boundary case of \ex{/} was chosen to preserve this inverse.
|
|
\end{defundesc}
|
|
|
|
\begin{defundesc} {replace-extension} {fname ext} \str
|
|
This procedure replaces \var{fname}'s extension with \var{ext}.
|
|
It is exactly equivalent to
|
|
\codex{(string-append (file-name-sans-extension \var{fname}) \var{ext})}
|
|
\end{defundesc}
|
|
|
|
\defun{simplify-file-name}{fname}\str
|
|
\begin{desc}
|
|
Removes leading and internal occurrences of dot.
|
|
A trailing dot is left alone, as the parent could be a symlink.
|
|
Removes internal and trailing double-slashes.
|
|
A leading double-slash is left alone, in accordance with {\Posix}.
|
|
However, triple and more leading slashes are reduced to a single slash,
|
|
in accordance with {\Posix}.
|
|
Double-dots (parent directory) are left alone, in case they come after
|
|
symlinks or appear in a \ex{/../\var{machine}/\ldots} ``super-root'' form
|
|
(which {\Posix} permits).
|
|
\end{desc}
|
|
|
|
\defun{resolve-file-name}{fname [dir]}\str
|
|
\begin{desc}
|
|
\begin{itemize}
|
|
\item Do \ex{\~} expansion.
|
|
\item If \var{dir} is given,
|
|
convert a relative file-name to an absolute file-name,
|
|
relative to directory \var{dir}.
|
|
\end{itemize}
|
|
\end{desc}
|
|
|
|
\begin{defundesc} {expand-file-name} {fname [dir]} \str
|
|
Resolve and simplify the file-name.
|
|
\end{defundesc}
|
|
|
|
\begin{defundesc} {absolute-file-name} {fname [dir]} \str
|
|
Convert file-name \var{fname} into an absolute file name,
|
|
relative to directory \var{dir}, which defaults to the current
|
|
working directory. The file name is simplified before being
|
|
returned.
|
|
|
|
This procedure does not treat a leading tilde character specially.
|
|
\end{defundesc}
|
|
|
|
\begin{defundesc} {home-dir} {[user]} \str
|
|
\ex{home-dir} returns \var{user}'s home directory.
|
|
\var{User} defaults to the current user.
|
|
|
|
\begin{exampletable}
|
|
\ex{(home-dir)} & \ex{"/user1/lecturer/shivers"} \\
|
|
\ex{(home-dir "ctkwan")} & \ex{"/user0/research/ctkwan"}
|
|
\end{exampletable}
|
|
\end{defundesc}
|
|
|
|
\begin{defundesc} {home-file} {[user] fname} \str
|
|
Returns file-name \var{fname} relative to \var{user}'s home directory;
|
|
\var{user} defaults to the current user.
|
|
%
|
|
\begin{exampletable}
|
|
\ex{(home-file "man")} & \ex{"/usr/shivers/man"} \\
|
|
\ex{(home-file "fcmlau" "man")} & \ex{"/usr/fcmlau/man"}
|
|
\end{exampletable}
|
|
\end{defundesc}
|
|
|
|
The general \ex{substitute-env-vars} string procedure,
|
|
defined in the previous section,
|
|
is also frequently useful for expanding file-names.
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{ASCII encoding}
|
|
|
|
\defun {char->ascii}{\character} \integer
|
|
\defunx {ascii->char}{\integer} \character
|
|
\begin{desc}
|
|
These are identical to \ex{char->integer} and \ex{integer->char} except that
|
|
they use the {\Ascii} encoding.
|
|
\end{desc}
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Character sets}
|
|
\label{sec:char-sets}
|
|
|
|
Scsh provides a \ex{char-set} type for expressing sets of characters.
|
|
These sets are used by some of the delimited-input procedures
|
|
(section~\ref{sec:field-reader}).
|
|
Scsh's character set package was adapted and extended from
|
|
Project Mac's MIT Scheme package.
|
|
Note that the character type used in the current implementation corresponds
|
|
to the ASCII character set---but you would be wise not to build this
|
|
assumption into your code if you can help it.\footnote{
|
|
Actually, it's slightly uglier than that, albeit somewhat more
|
|
useful. The current character type corresponds to an eight-bit
|
|
superset of ASCII. The \ex{ascii->char} and \ex{char->ascii}
|
|
functions will preserve this eighth bit. However, none of the
|
|
the high 128 characters appear in any of the standard character
|
|
sets defined in section~\ref{sec:std-csets}, except for
|
|
\ex{char-set:full}. If someone would email the authors a listing
|
|
of the full Latin-1 definition, we'll be happy to upgrade these
|
|
sets' definitions to make them Latin-1 compliant.}
|
|
|
|
\defun{char-set?}{x}\boolean
|
|
\begin{desc}
|
|
Is the object \var{x} a character set?
|
|
\end{desc}
|
|
|
|
\defun{char-set=}{cs1 cs2}\boolean
|
|
\begin{desc}
|
|
Are the character sets \var{cs1} and \var{cs2} equal?
|
|
\end{desc}
|
|
|
|
\defun{char-set<=}{cs1 cs2}\boolean
|
|
\begin{desc}
|
|
Returns true if character set \var{cs1} is a subset of character set \var{cs2}.
|
|
\end{desc}
|
|
|
|
\defun{reduce-char-set}{kons knil cs}\object
|
|
\begin{desc}
|
|
This is the fundamental iterator for character sets.
|
|
Reduces the function \var{kons} across the character set \var{cs} using
|
|
initial state value \var{knil}.
|
|
That is, if \var{cs} is the empty set, the procedure returns \var{knil}.
|
|
Otherwise, some element \var{c} of \var{cs} is chosen; let \var{cs'} be
|
|
the remaining, unchosen characters.
|
|
The procedure returns
|
|
\begin{tightcode}
|
|
(reduce-char-set \var{kons} (\var{kons} \var{c} \var{knil}) \var{cs'})\end{tightcode}
|
|
For example, we could define \ex{char-set-members} (see below)
|
|
as
|
|
\begin{tightcode}
|
|
(lambda (cs) (reduce-char-set cons '() cs))\end{tightcode}
|
|
\end{desc}
|
|
|
|
\subsection{Side effects}
|
|
\defun{set-char-set!}{cs char in?}{\undefined}
|
|
\begin{desc}
|
|
This side-effects character set \var{cs}.
|
|
If \var{in?} is true, character \var{char} is added to the set.
|
|
Otherwise, it is deleted from the set.
|
|
|
|
Use of this procedure is deprecated, since it could damage other procedures
|
|
that retain pointers to existing character sets.
|
|
You should use \ex{set-char-set!} in contexts where it is guaranteed that
|
|
there are no other pointers to the character set being modified.
|
|
(For example, functions that create character sets can use this function
|
|
to efficiently construct the character set, after which time the set is
|
|
used in a pure-functional, shared manner.)
|
|
\end{desc}
|
|
|
|
\defun{char-set-for-each}{p cs}{\undefined}
|
|
\begin{desc}
|
|
Apply procedure \var{p} to each character in the character set \var{cs}.
|
|
Note that the order in which \var{p} is applied to the characters in the
|
|
set is not specified, and may even change from application to application.
|
|
\end{desc}
|
|
|
|
\defun{copy-char-set}{cs}{char-set}
|
|
\begin{desc}
|
|
Returns a copy of the character set \var{cs}.
|
|
\end{desc}
|
|
|
|
\subsection{Creating character sets}
|
|
|
|
\defun{char-set}{\vari{char}1\ldots}{char-set}
|
|
\begin{desc}
|
|
Return a character set containing the given characters.
|
|
\end{desc}
|
|
|
|
\defun{chars->char-set}{chars}{char-set}
|
|
\begin{desc}
|
|
Return a character set containing the characters in the list \var{chars}.
|
|
\end{desc}
|
|
|
|
\defun{string->char-set}{s}{char-set}
|
|
\begin{desc}
|
|
Return a character set containing the characters in the string \var{s}.
|
|
\end{desc}
|
|
|
|
\defun{predicate->char-set}{pred}{char-set}
|
|
\begin{desc}
|
|
Returns a character set containing every character \var{c} such that
|
|
\ex{(\var{pred} \var{c})} returns true.
|
|
\end{desc}
|
|
|
|
\defun{ascii-range->char-set}{lower upper}{char-set}
|
|
\begin{desc}
|
|
Returns a character set containing every character whose {\Ascii}
|
|
code lies in the half-open range $[\var{lower},\var{upper})$.
|
|
\end{desc}
|
|
|
|
\subsection{Querying character sets}
|
|
\defun {char-set-members}{char-set}{character-list}
|
|
\begin{desc}
|
|
This procedure returns a list of the members of \var{char-set}.
|
|
\end{desc}
|
|
|
|
\defunx{char-set-contains?}{char-set char}\boolean
|
|
\begin{desc}
|
|
This procedure tests \var{char} for membership in set \var{char-set}.
|
|
\remark{Previous releases of scsh called this procedure \ex{char-set-member?},
|
|
reversing the order of the arguments.
|
|
This made sense, but was unfortunately the reverse order in which the
|
|
arguments appear in MIT Scheme.
|
|
A reasonable argument order was not backwards-compatible with MIT Scheme;
|
|
on the other hand, the MIT Scheme argument order was counter-intuitive
|
|
and at odds with common mathematical notation and the \ex{member} family
|
|
of R4RS procedures.
|
|
|
|
We sought to escape the dilemma by shifting to a new name.}
|
|
\end{desc}
|
|
|
|
\defun{char-set-size}{cs}\integer
|
|
\begin{desc}
|
|
Returns the number of elements in character set \var{cs}.
|
|
\end{desc}
|
|
|
|
\subsection{Character set algebra}
|
|
\defun {char-set-invert}{char-set}{char-set}
|
|
\defunx{char-set-union}{\vari{char-set}1\ldots}{char-set}
|
|
\defunx{char-set-intersection}{\vari{char-set}1 \vari{char-set}2\ldots}{char-set}
|
|
\defunx{char-set-difference}{\vari{char-set}1 \vari{char-set}2\ldots}{char-set}
|
|
\begin{desc}
|
|
These procedures implement set complement, union, intersection, and difference
|
|
for character sets.
|
|
The union, intersection, and difference operations are n-ary, associating
|
|
to the left; the difference function requires at least one argument, while
|
|
union and intersection may be applied to zero arguments.
|
|
\end{desc}
|
|
|
|
\subsection{Standard character sets}
|
|
\label{sec:std-csets}
|
|
Several character sets are predefined for convenience:
|
|
|
|
\begin{center}
|
|
\newcommand{\entry}[1]{\ex{#1}\index{#1}}
|
|
\begin{tabular}{|ll|}
|
|
\hline
|
|
\entry{char-set:alphabetic} & Alphabetic chars \\
|
|
\entry{char-set:lower-case} & Lower-case alphabetic chars \\
|
|
\entry{char-set:upper-case} & Upper-case alphabetic chars \\
|
|
\entry{char-set:numeric} & Decimal digits: 0--9 \\
|
|
\entry{char-set:alphanumeric} & Alphabetic or numeric \\
|
|
\entry{char-set:graphic} & Printing characters except space \\
|
|
\entry{char-set:printing} & Printing characters including space \\
|
|
\entry{char-set:whitespace} & Whitespace characters \\
|
|
\entry{char-set:blank} & Blank characters \\
|
|
\entry{char-set:control} & Control characters \\
|
|
\entry{char-set:punctuation} & Punctuation characters \\
|
|
\entry{char-set:hex-digit} & A hexadecimal digit: 0--9, A--F, a--f \\
|
|
\entry{char-set:ascii} & A character in the ASCII set. \\
|
|
\entry{char-set:empty} & Empty set \\
|
|
\entry{char-set:full} & All characters \\
|
|
\hline
|
|
\end{tabular}
|
|
\end{center}
|
|
The first twelve of these correspond to the character classes defined in
|
|
Posix.
|
|
Note that there may be characters in \ex{char-set:alphabetic} that are
|
|
neither upper or lower case---this might occur in implementations that
|
|
use a character type richer than ASCII, such as Unicode.
|
|
A ``graphic character'' is one that would put ink on your page.
|
|
While the exact composition of these sets may vary depending upon the
|
|
character type provided by the Scheme system upon which scsh is running,
|
|
here are the definitions for some of the sets in an ASCII character set:
|
|
\begin{center}
|
|
\newcommand{\entry}[1]{\ex{#1}\index{#1}}
|
|
\begin{tabular}{|ll|}
|
|
\hline
|
|
char-set:alphabetic & A--Z and a--z \\
|
|
char-set:lower-case & a--z \\
|
|
char-set:upper-case & A--Z \\
|
|
char-set:graphic & Alphanumeric + punctuation \\
|
|
char-set:whitespace & Space, newline, tab, page,
|
|
vertical tab, carriage return \\
|
|
char-set:blank & Space and tab \\
|
|
char-set:control & ASCII 0--31 and 127 \\
|
|
char-set:punctuation & \verb|!"#$%&'()*+,-./:;<=>|\verb#?@[\]^_`{|}~# \\
|
|
\hline
|
|
\end{tabular}
|
|
\end{center}
|
|
|
|
|
|
\defun {char-alphabetic?}\character\boolean
|
|
\defunx{char-lower-case?}\character\boolean
|
|
\defunx{char-upper-case?}\character\boolean
|
|
\defunx{char-numeric? }\character\boolean
|
|
\defunx{char-alphanumeric?}\character\boolean
|
|
\defunx{char-graphic?}\character\boolean
|
|
\defunx{char-printing?}\character\boolean
|
|
\defunx{char-whitespace?}\character\boolean
|
|
\defunx{char-blank?}\character\boolean
|
|
\defunx{char-control?}\character\boolean
|
|
\defunx{char-punctuation?}\character\boolean
|
|
\defunx{char-hex-digit?}\character\boolean
|
|
\defunx{char-ascii?}\character\boolean
|
|
\begin{desc}
|
|
These predicates are defined in terms of the above character sets.
|
|
\end{desc}
|