1995-11-03 23:41:53 -05:00
|
|
|
% -*- latex -*-
|
1995-10-13 23:34:21 -04:00
|
|
|
\chapter{Strings and characters}
|
|
|
|
|
|
|
|
Scsh provides a set of procedures for processing strings and characters.
|
|
|
|
The procedures provided match regular expressions, search strings,
|
|
|
|
parse file-names, and manipulate sets of characters.
|
|
|
|
|
1999-09-08 11:18:25 -04:00
|
|
|
Also see chapters \ref{chapt:sre}, \ref{chapt:rdelim} and \ref{chapt:fr-awk}
|
|
|
|
on regular-expressions, record I/O, field parsing, and the awk loop.
|
|
|
|
The procedures documented there allow you to search and pattern-match strings,
|
|
|
|
read character-delimited records from ports,
|
|
|
|
use regular expressions to split the records into fields
|
1995-10-13 23:34:21 -04:00
|
|
|
(for example, splitting a string at every occurrence of colon or white-space),
|
|
|
|
and loop over streams of these records in a convenient way.
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
\section{String manipulation}
|
|
|
|
\label{sec:stringmanip}
|
|
|
|
|
|
|
|
Strings are the basic communication medium for {\Unix} processes, so a
|
|
|
|
shell language must have reasonable facilities for manipulating them.
|
|
|
|
|
1999-09-08 11:18:25 -04:00
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
1995-10-13 23:34:21 -04:00
|
|
|
\subsection{Manipulating file-names}
|
|
|
|
\label{sec:filenames}
|
|
|
|
|
|
|
|
These procedures do not access the file-system at all; they merely operate
|
|
|
|
on file-name strings. Much of this structure is patterned after the gnu emacs
|
|
|
|
design. Perhaps a more sophisticated system would be better, something
|
|
|
|
like the pathname abstractions of {\CommonLisp} or MIT Scheme. However,
|
|
|
|
being {\Unix}-specific, we can be a little less general.
|
|
|
|
|
|
|
|
\subsubsection{Terminology}
|
|
|
|
These procedures carefully adhere to the {\Posix} standard for file-name
|
|
|
|
resolution, which occasionally entails some slightly odd things.
|
|
|
|
This section will describe these rules, and give some basic terminology.
|
|
|
|
|
|
|
|
A \emph{file-name} is either the file-system root (``/''),
|
|
|
|
or a series of slash-terminated directory components, followed by
|
|
|
|
a a file component.
|
|
|
|
Root is the only file-name that may end in slash.
|
|
|
|
Some examples:
|
|
|
|
\begin{center}
|
|
|
|
\begin{tabular}{lll}
|
|
|
|
File name & Dir components & File component \\\hline
|
|
|
|
\ex{src/des/main.c} & \ex{("src" "des")} & \ex{"main.c"} \\
|
|
|
|
\ex{/src/des/main.c} & \ex{("" "src" "des")} & \ex{"main.c"} \\
|
|
|
|
\ex{main.c} & \ex{()} & \ex{"main.c"} \\
|
|
|
|
\end{tabular}
|
|
|
|
\end{center}
|
|
|
|
|
|
|
|
Note that the relative filename \ex{src/des/main.c} and the absolute filename
|
|
|
|
\ex{/src/des/main.c} are distinguished by the presence of the root component
|
|
|
|
\ex{""} in the absolute path.
|
|
|
|
|
|
|
|
Multiple embedded slashes within a path have the same meaning as
|
|
|
|
a single slash.
|
|
|
|
More than two leading slashes at the beginning of a path have the same
|
|
|
|
meaning as a single leading slash---they indicate that the file-name
|
|
|
|
is an absolute one, with the path leading from root.
|
|
|
|
However, {\Posix} permits the OS to give special meaning to
|
|
|
|
\emph{two} leading slashes.
|
|
|
|
For this reason, the routines in this section do not simplify two leading
|
|
|
|
slashes to a single slash.
|
|
|
|
|
|
|
|
A file-name in \emph{directory form} is either a file-name terminated by
|
|
|
|
a slash, \eg, ``\ex{/src/des/}'', or the empty string, ``''.
|
1995-11-03 23:41:53 -05:00
|
|
|
The empty string corresponds to the current working directory,
|
|
|
|
whose file-name is dot (``\ex{.}'').
|
1995-10-13 23:34:21 -04:00
|
|
|
Working backwards from the append-a-slash rule,
|
|
|
|
we extend the syntax of {\Posix} file-names to define the empty string
|
|
|
|
to be a file-name form of the root directory ``\ex{/}''.
|
|
|
|
(However, ``\ex{/}'' is also acceptable as a file-name form for root.)
|
|
|
|
So the empty string has two interpretations:
|
|
|
|
as a file-name form, it is the file-system root;
|
|
|
|
as a directory form, it is the current working directory.
|
|
|
|
Slash is also an ambiguous form: \ex{/} is both a directory-form and
|
|
|
|
a file-name form.
|
|
|
|
|
|
|
|
The directory form of a file-name is very rarely used.
|
|
|
|
Almost all of the procedures in scsh name directories by giving
|
|
|
|
their file-name form (without the trailing slash), not their directory form.
|
|
|
|
So, you say ``\ex{/usr/include}'', and ``\ex{.}'', not
|
|
|
|
``\ex{/usr/include/}'' and ``''.
|
|
|
|
The sole exceptions are
|
|
|
|
\ex{file-name-as-directory} and \ex{directory-as-file-name},
|
|
|
|
whose jobs are to convert back-and-forth between these forms,
|
|
|
|
and \ex{file-name-directory}, whose job it is to split out the
|
|
|
|
directory portion of a file-name.
|
|
|
|
However, most procedures that expect a directory argument will coerce
|
|
|
|
a file-name in directory form to file-name form if it does not have
|
|
|
|
a trailing slash.
|
|
|
|
Bear in mind that the ambiguous case, empty string, will be
|
|
|
|
interpreted in file-name form, \ie, as root.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{Procedures}
|
|
|
|
|
1995-11-03 23:41:53 -05:00
|
|
|
\defun {file-name-directory?} {fname} \boolean
|
|
|
|
\defunx {file-name-non-directory?} {fname} \boolean
|
|
|
|
\begin{desc}
|
|
|
|
These predicates return true if the string is in directory form, or
|
|
|
|
file-name form (see the above discussion of these two forms).
|
|
|
|
Note that they both return true on the ambiguous case of empty string,
|
|
|
|
which is both a directory (current working directory), and a file name
|
|
|
|
(the file-system root).
|
|
|
|
\begin{center}
|
|
|
|
\begin{tabular}{lll}
|
|
|
|
File name & \ex{\ldots-directory?} & \ex{\ldots-non-directory?} \\
|
|
|
|
\hline
|
|
|
|
\ex{"src/des"} & \ex{\sharpf} & \ex{\sharpt} \\
|
|
|
|
\ex{"src/des/"} & \ex{\sharpt} & \ex{\sharpf} \\
|
|
|
|
\ex{"/"} & \ex{\sharpt} & \ex{\sharpf} \\
|
|
|
|
\ex{"."} & \ex{\sharpf} & \ex{\sharpt} \\
|
|
|
|
\ex{""} & \ex{\sharpt} & \ex{\sharpt}
|
|
|
|
\end{tabular}
|
|
|
|
\end{center}
|
|
|
|
\end{desc}
|
|
|
|
|
1995-10-13 23:34:21 -04:00
|
|
|
\begin{defundesc} {file-name-as-directory} {fname} \str
|
|
|
|
Convert a file-name to directory form.
|
|
|
|
Basically, add a trailing slash if needed:
|
|
|
|
\begin{exampletable}
|
|
|
|
\ex{(file-name-as-directory "src/des")} & \ex{"src/des/"} \\
|
|
|
|
\ex{(file-name-as-directory "src/des/")} & \ex{"src/des/"} \\[2ex]
|
|
|
|
%
|
|
|
|
\header{\ex{.}, \ex{/}, and \ex{""} are special:}
|
|
|
|
\ex{(file-name-as-directory ".")} & \ex{""} \\
|
|
|
|
\ex{(file-name-as-directory "/")} & \ex{"/"} \\
|
|
|
|
\ex{(file-name-as-directory "")} & \ex{"/"}
|
|
|
|
\end{exampletable}
|
|
|
|
\end{defundesc}
|
|
|
|
|
|
|
|
\begin{defundesc} {directory-as-file-name} {fname} \str
|
|
|
|
Convert a directory to a simple file-name.
|
|
|
|
Basically, kill a trailing slash if one is present:
|
|
|
|
\begin{exampletable}
|
|
|
|
\ex{(directory-as-file-name "foo/bar/")} & \ex{"foo/bar"} \\[2ex]
|
|
|
|
%
|
|
|
|
\header{\ex{/} and \ex{""} are special:}
|
|
|
|
\ex{(directory-as-file-name "/")} & \ex{"/"} \\
|
|
|
|
\ex{(directory-as-file-name "")} & \ex{"."} (\ie, the cwd) \\
|
|
|
|
\end{exampletable}
|
|
|
|
\end{defundesc}
|
|
|
|
|
|
|
|
\begin{defundesc} {file-name-absolute?} {fname} \boolean
|
|
|
|
Does \var{fname} begin with a root or \ex{\~} component?
|
|
|
|
(Recognising \ex{\~} as a home-directory specification
|
|
|
|
is an extension of {\Posix} rules.)
|
|
|
|
%
|
|
|
|
\begin{exampletable}
|
|
|
|
\ex{(file-name-absolute? "/usr/shivers")} & {\sharpt} \\
|
|
|
|
\ex{(file-name-absolute? "src/des")} & {\sharpf} \\
|
|
|
|
\ex{(file-name-absolute? "\~/src/des")} & {\sharpt} \\[2ex]
|
|
|
|
%
|
|
|
|
\header{Non-obvious case:}
|
|
|
|
\ex{(file-name-absolute? "")} & {\sharpt} (\ie, root)
|
|
|
|
\end{exampletable}
|
|
|
|
\end{defundesc}
|
|
|
|
|
|
|
|
|
|
|
|
\begin{defundesc} {file-name-directory} {fname} {{\str} or false}
|
|
|
|
Return the directory component of \var{fname} in directory form.
|
|
|
|
If the file-name is already in directory form, return it as-is.
|
|
|
|
%
|
|
|
|
\begin{exampletable}
|
|
|
|
\ex{(file-name-directory "/usr/bdc")} & \ex{"/usr/"} \\
|
|
|
|
{\ex{(file-name-directory "/usr/bdc/")}} &
|
|
|
|
{\ex{"/usr/bdc/"}} \\
|
|
|
|
\ex{(file-name-directory "bdc/.login")} & \ex{"bdc/"} \\
|
|
|
|
\ex{(file-name-directory "main.c")} & \ex{""} \\[2ex]
|
|
|
|
%
|
|
|
|
\header{Root has no directory component:}
|
|
|
|
\ex{(file-name-directory "/")} & \ex{""} \\
|
|
|
|
\ex{(file-name-directory "")} & \ex{""}
|
|
|
|
\end{exampletable}
|
|
|
|
\end{defundesc}
|
|
|
|
|
|
|
|
|
|
|
|
\begin{defundesc} {file-name-nondirectory} {fname} \str
|
|
|
|
Return non-directory component of fname.
|
|
|
|
%
|
|
|
|
\begin{exampletable}
|
|
|
|
{\ex{(file-name-nondirectory "/usr/ian")}} &
|
|
|
|
{\ex{"ian"}} \\
|
|
|
|
\ex{(file-name-nondirectory "/usr/ian/")} & \ex{""} \\
|
|
|
|
{\ex{(file-name-nondirectory "ian/.login")}} &
|
|
|
|
{\ex{".login"}} \\
|
|
|
|
\ex{(file-name-nondirectory "main.c")} & \ex{"main.c"} \\
|
|
|
|
\ex{(file-name-nondirectory "")} & \ex{""} \\
|
|
|
|
\ex{(file-name-nondirectory "/")} & \ex{"/"}
|
|
|
|
\end{exampletable}
|
|
|
|
\end{defundesc}
|
|
|
|
|
|
|
|
|
|
|
|
\begin{defundesc} {split-file-name} {fname} {{\str} list}
|
|
|
|
Split a file-name into its components.
|
|
|
|
%
|
|
|
|
\begin{exampletable}
|
|
|
|
\splitline{\ex{(split-file-name "src/des/main.c")}}
|
|
|
|
{\ex{("src" "des" "main.c")}} \\[1.5ex]
|
|
|
|
%
|
|
|
|
\splitline{\ex{(split-file-name "/src/des/main.c")}}
|
|
|
|
{\ex{("" "src" "des" "main.c")}} \\[1.5ex]
|
|
|
|
%
|
|
|
|
\splitline{\ex{(split-file-name "main.c")}} {\ex{("main.c")}} \\[1.5ex]
|
|
|
|
%
|
|
|
|
\splitline{\ex{(split-file-name "/")}} {\ex{("")}}
|
|
|
|
\end{exampletable}
|
|
|
|
\end{defundesc}
|
|
|
|
|
|
|
|
|
|
|
|
\begin{defundesc} {path-list->file-name} {path-list [dir]} \str
|
|
|
|
Inverse of \ex{split-file-name}.
|
|
|
|
\begin{code}
|
|
|
|
(path-list->file-name '("src" "des" "main.c"))
|
|
|
|
{\evalto} "src/des/main.c"
|
|
|
|
(path-list->file-name '("" "src" "des" "main.c"))
|
|
|
|
{\evalto} "/src/des/main.c"
|
|
|
|
\cb
|
|
|
|
{\rm{}Optional \var{dir} arg anchors relative path-lists:}
|
|
|
|
(path-list->file-name '("src" "des" "main.c")
|
|
|
|
"/usr/shivers")
|
|
|
|
{\evalto} "/usr/shivers/src/des/main.c"\end{code}
|
|
|
|
%
|
|
|
|
The optional \var{dir} argument is usefully \ex{(cwd)}.
|
|
|
|
\end{defundesc}
|
|
|
|
|
|
|
|
|
|
|
|
\begin{defundesc} {file-name-extension} {fname} \str
|
|
|
|
Return the file-name's extension.
|
|
|
|
%
|
|
|
|
\begin{exampletable}
|
|
|
|
\ex{(file-name-extension "main.c")} & \ex{".c"} \\
|
|
|
|
\ex{(file-name-extension "main.c.old")} & \ex{".old"} \\
|
|
|
|
\ex{(file-name-extension "/usr/shivers")} & \ex{""}
|
|
|
|
\end{exampletable}
|
|
|
|
%
|
|
|
|
\begin{exampletable}
|
|
|
|
\header{Weird cases:}
|
|
|
|
\ex{(file-name-extension "foo.")} & \ex{"."} \\
|
|
|
|
\ex{(file-name-extension "foo..")} & \ex{"."}
|
|
|
|
\end{exampletable}
|
|
|
|
%
|
|
|
|
\begin{exampletable}
|
|
|
|
\header{Dot files are not extensions:}
|
|
|
|
\ex{(file-name-extension "/usr/shivers/.login")} & \ex{""}
|
|
|
|
\end{exampletable}
|
|
|
|
\end{defundesc}
|
|
|
|
|
|
|
|
|
|
|
|
\begin{defundesc} {file-name-sans-extension} {fname} \str
|
|
|
|
Return everything but the extension.
|
|
|
|
%
|
|
|
|
\begin{exampletable}
|
|
|
|
\ex{(file-name-sans-extension "main.c")} & \ex{"main"} \\
|
|
|
|
\ex{(file-name-sans-extension "main.c.old")} & \ex{"main.c""} \\
|
|
|
|
\splitline{\ex{(file-name-sans-extension "/usr/shivers")}}
|
|
|
|
{\ex{"/usr/shivers"}}
|
|
|
|
\end{exampletable}
|
|
|
|
%
|
|
|
|
\begin{exampletable}
|
|
|
|
\header{Weird cases:}
|
|
|
|
\ex{(file-name-sans-extension "foo.")} & \ex{"foo"} \\
|
|
|
|
\ex{(file-name-sans-extension "foo..")} & \ex{"foo."} \\[2ex]
|
|
|
|
%
|
|
|
|
\header{Dot files are not extensions:}
|
|
|
|
\splitline{\ex{(file-name-sans-extension "/usr/shivers/.login")}}
|
|
|
|
{\ex{"/usr/shivers/.login}}
|
|
|
|
\end{exampletable}
|
|
|
|
|
|
|
|
Note that appending the results of \ex{file-name-extension} and
|
|
|
|
{\ttt file\=name\=sans\=extension} in all cases produces the original file-name.
|
|
|
|
\end{defundesc}
|
|
|
|
|
|
|
|
|
|
|
|
\begin{defundesc} {parse-file-name} {fname} {[dir name extension]}
|
|
|
|
Let $f$ be \ex{(file-name-nondirectory \var{fname})}.
|
|
|
|
This function returns the three values:
|
|
|
|
\begin{itemize}
|
|
|
|
\item \ex{(file-name-directory \var{fname})}
|
|
|
|
\item \ex{(file-name-sans-extension \var{f}))}
|
|
|
|
\item \ex{(file-name-extension \var{f}\/)}
|
|
|
|
\end{itemize}
|
|
|
|
The inverse of \ex{parse-file-name}, in all cases, is \ex{string-append}.
|
|
|
|
The boundary case of \ex{/} was chosen to preserve this inverse.
|
|
|
|
\end{defundesc}
|
|
|
|
|
|
|
|
\begin{defundesc} {replace-extension} {fname ext} \str
|
|
|
|
This procedure replaces \var{fname}'s extension with \var{ext}.
|
|
|
|
It is exactly equivalent to
|
|
|
|
\codex{(string-append (file-name-sans-extension \var{fname}) \var{ext})}
|
|
|
|
\end{defundesc}
|
|
|
|
|
|
|
|
\defun{simplify-file-name}{fname}\str
|
|
|
|
\begin{desc}
|
|
|
|
Removes leading and internal occurrences of dot.
|
|
|
|
A trailing dot is left alone, as the parent could be a symlink.
|
|
|
|
Removes internal and trailing double-slashes.
|
|
|
|
A leading double-slash is left alone, in accordance with {\Posix}.
|
|
|
|
However, triple and more leading slashes are reduced to a single slash,
|
|
|
|
in accordance with {\Posix}.
|
|
|
|
Double-dots (parent directory) are left alone, in case they come after
|
|
|
|
symlinks or appear in a \ex{/../\var{machine}/\ldots} ``super-root'' form
|
|
|
|
(which {\Posix} permits).
|
|
|
|
\end{desc}
|
|
|
|
|
|
|
|
\defun{resolve-file-name}{fname [dir]}\str
|
|
|
|
\begin{desc}
|
|
|
|
\begin{itemize}
|
|
|
|
\item Do \ex{\~} expansion.
|
|
|
|
\item If \var{dir} is given,
|
|
|
|
convert a relative file-name to an absolute file-name,
|
|
|
|
relative to directory \var{dir}.
|
|
|
|
\end{itemize}
|
|
|
|
\end{desc}
|
|
|
|
|
|
|
|
\begin{defundesc} {expand-file-name} {fname [dir]} \str
|
|
|
|
Resolve and simplify the file-name.
|
|
|
|
\end{defundesc}
|
|
|
|
|
1997-11-09 21:34:45 -05:00
|
|
|
\begin{defundesc} {absolute-file-name} {fname [dir]} \str
|
|
|
|
Convert file-name \var{fname} into an absolute file name,
|
|
|
|
relative to directory \var{dir}, which defaults to the current
|
|
|
|
working directory. The file name is simplified before being
|
|
|
|
returned.
|
|
|
|
|
|
|
|
This procedure does not treat a leading tilde character specially.
|
|
|
|
\end{defundesc}
|
|
|
|
|
1995-10-13 23:34:21 -04:00
|
|
|
\begin{defundesc} {home-dir} {[user]} \str
|
|
|
|
\ex{home-dir} returns \var{user}'s home directory.
|
|
|
|
\var{User} defaults to the current user.
|
|
|
|
|
|
|
|
\begin{exampletable}
|
|
|
|
\ex{(home-dir)} & \ex{"/user1/lecturer/shivers"} \\
|
|
|
|
\ex{(home-dir "ctkwan")} & \ex{"/user0/research/ctkwan"}
|
|
|
|
\end{exampletable}
|
|
|
|
\end{defundesc}
|
|
|
|
|
|
|
|
\begin{defundesc} {home-file} {[user] fname} \str
|
|
|
|
Returns file-name \var{fname} relative to \var{user}'s home directory;
|
|
|
|
\var{user} defaults to the current user.
|
|
|
|
%
|
|
|
|
\begin{exampletable}
|
|
|
|
\ex{(home-file "man")} & \ex{"/usr/shivers/man"} \\
|
|
|
|
\ex{(home-file "fcmlau" "man")} & \ex{"/usr/fcmlau/man"}
|
|
|
|
\end{exampletable}
|
|
|
|
\end{defundesc}
|
|
|
|
|
|
|
|
The general \ex{substitute-env-vars} string procedure,
|
|
|
|
defined in the previous section,
|
|
|
|
is also frequently useful for expanding file-names.
|
|
|
|
|
|
|
|
|
1999-09-08 11:18:25 -04:00
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
\subsection{Other string manipulation facilities}
|
|
|
|
|
|
|
|
\defun {index} {string char [start]} {{\fixnum} or false}
|
|
|
|
\defunx {rindex} {string char [start]} {{\fixnum} or false}
|
|
|
|
\begin{desc}
|
|
|
|
These procedures search through \var{string} looking for an occurrence
|
|
|
|
of character \var{char}. \ex{index} searches left-to-right; \ex{rindex}
|
|
|
|
searches right-to-left.
|
|
|
|
|
|
|
|
\ex{index} returns the smallest index $i$ of \var{string} greater
|
|
|
|
than or equal to \var{start} such that $\var{string}[i] = \var{char}$.
|
|
|
|
The default for \var{start} is zero. If there is no such match,
|
|
|
|
\ex{index} returns false.
|
|
|
|
|
|
|
|
\ex{rindex} returns the largest index $i$ of \var{string} less than
|
|
|
|
\var{start} such that $\var{string}[i] = \var{char}$.
|
|
|
|
The default for \var{start} is \ex{(string-length \var{string})}.
|
|
|
|
If there is no such match, \ex{rindex} returns false.
|
|
|
|
\end{desc}
|
|
|
|
|
|
|
|
I should probably snarf all the MIT Scheme string functions, and stick them
|
|
|
|
in a package. {\Unix} programs need to mung character strings a lot.
|
|
|
|
|
|
|
|
MIT string match commands:
|
|
|
|
\begin{tightcode}
|
|
|
|
[sub]string-match-{forward,backward}[-ci]
|
|
|
|
[sub]string-{prefix,suffix}[-ci]?
|
|
|
|
[sub]string-find-{next,previous}-char[-ci]
|
|
|
|
[sub]string-find-{next,previous}-char-in-set
|
|
|
|
[sub]string-replace[!]
|
|
|
|
\ldots\etc\end{tightcode}
|
|
|
|
These are not currently provided.
|
|
|
|
|
|
|
|
\begin{defundesc} {substitute-env-vars} {fname} \str
|
|
|
|
Replace occurrences of environment variables with their values.
|
|
|
|
An environment variable is denoted by a dollar sign followed by
|
|
|
|
alphanumeric chars and underscores, or is surrounded by braces.
|
|
|
|
|
|
|
|
\begin{exampletable}
|
|
|
|
\splitline{\ex{(substitute-env-vars "\$USER/.login")}}
|
|
|
|
{\ex{"shivers/.login"}} \\
|
|
|
|
\cd{(substitute-env-vars "$\{USER\}_log")} & \cd{"shivers_log"}
|
|
|
|
\end{exampletable}
|
|
|
|
\end{defundesc}
|
|
|
|
|
|
|
|
|
1995-10-13 23:34:21 -04:00
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
\section{ASCII encoding}
|
|
|
|
|
|
|
|
\defun {char->ascii}{\character} \integer
|
|
|
|
\defunx {ascii->char}{\integer} \character
|
|
|
|
\begin{desc}
|
|
|
|
These are identical to \ex{char->integer} and \ex{integer->char} except that
|
|
|
|
they use the {\Ascii} encoding.
|
|
|
|
\end{desc}
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
\section{Character sets}
|
|
|
|
\label{sec:char-sets}
|
|
|
|
|
|
|
|
Scsh provides a \ex{char-set} type for expressing sets of characters.
|
1995-11-03 23:41:53 -05:00
|
|
|
These sets are used by some of the delimited-input procedures
|
1995-10-13 23:34:21 -04:00
|
|
|
(section~\ref{sec:field-reader}).
|
1998-06-16 17:19:32 -04:00
|
|
|
Scsh's character set package was adapted and extended from
|
|
|
|
Project Mac's MIT Scheme package.
|
|
|
|
Note that the character type used in the current implementation corresponds
|
|
|
|
to the ASCII character set---but you would be wise not to build this
|
|
|
|
assumption into your code if you can help it.\footnote{
|
|
|
|
Actually, it's slightly uglier than that, albeit somewhat more
|
|
|
|
useful. The current character type corresponds to an eight-bit
|
|
|
|
superset of ASCII. The \ex{ascii->char} and \ex{char->ascii}
|
|
|
|
functions will preserve this eighth bit. However, none of the
|
|
|
|
the high 128 characters appear in any of the standard character
|
|
|
|
sets defined in section~\ref{sec:std-csets}, except for
|
|
|
|
\ex{char-set:full}. If someone would email the authors a listing
|
|
|
|
of the full Latin-1 definition, we'll be happy to upgrade these
|
|
|
|
sets' definitions to make them Latin-1 compliant.}
|
1995-10-13 23:34:21 -04:00
|
|
|
|
|
|
|
\defun{char-set?}{x}\boolean
|
|
|
|
\begin{desc}
|
1998-06-16 17:19:32 -04:00
|
|
|
Is the object \var{x} a character set?
|
|
|
|
\end{desc}
|
|
|
|
|
1999-09-08 11:18:25 -04:00
|
|
|
\defun{char-set=}{\vari{cs}1 \vari{cs}2\ldots}\boolean
|
1998-06-16 17:19:32 -04:00
|
|
|
\begin{desc}
|
1999-09-08 11:18:25 -04:00
|
|
|
Are the character sets equal?
|
1998-06-16 17:19:32 -04:00
|
|
|
\end{desc}
|
|
|
|
|
1999-09-08 11:18:25 -04:00
|
|
|
\defun{char-set<=}{\vari{cs}1 \vari{cs}2\ldots}\boolean
|
1998-06-16 17:19:32 -04:00
|
|
|
\begin{desc}
|
1999-09-08 11:18:25 -04:00
|
|
|
Returns true if every character set \vari{cs}{i} is
|
|
|
|
a subset of character set \vari{cs}{i+1}.
|
1998-06-16 17:19:32 -04:00
|
|
|
\end{desc}
|
|
|
|
|
1999-09-08 11:18:25 -04:00
|
|
|
\defun{char-set-fold}{kons knil cs}\object
|
1998-06-16 17:19:32 -04:00
|
|
|
\begin{desc}
|
|
|
|
This is the fundamental iterator for character sets.
|
1999-09-08 11:18:25 -04:00
|
|
|
Applies the function \var{kons} across the character set \var{cs} using
|
1998-06-16 17:19:32 -04:00
|
|
|
initial state value \var{knil}.
|
|
|
|
That is, if \var{cs} is the empty set, the procedure returns \var{knil}.
|
|
|
|
Otherwise, some element \var{c} of \var{cs} is chosen; let \var{cs'} be
|
|
|
|
the remaining, unchosen characters.
|
|
|
|
The procedure returns
|
|
|
|
\begin{tightcode}
|
1999-09-08 11:18:25 -04:00
|
|
|
(char-set-fold \var{kons} (\var{kons} \var{c} \var{knil}) \var{cs'})\end{tightcode}
|
1998-06-16 17:19:32 -04:00
|
|
|
For example, we could define \ex{char-set-members} (see below)
|
|
|
|
as
|
|
|
|
\begin{tightcode}
|
1999-09-08 11:18:25 -04:00
|
|
|
(lambda (cs) (char-set-fold cons '() cs))\end{tightcode}
|
1998-06-16 17:19:32 -04:00
|
|
|
|
1999-09-08 11:18:25 -04:00
|
|
|
\remark{This procedure was formerly named \texttt{\indx{reduce-char-set}}.
|
|
|
|
The old binding is still provided, but is deprecated and will
|
|
|
|
probably vanish in a future release.}
|
1998-06-16 17:19:32 -04:00
|
|
|
\end{desc}
|
|
|
|
|
|
|
|
\defun{char-set-for-each}{p cs}{\undefined}
|
|
|
|
\begin{desc}
|
|
|
|
Apply procedure \var{p} to each character in the character set \var{cs}.
|
|
|
|
Note that the order in which \var{p} is applied to the characters in the
|
|
|
|
set is not specified, and may even change from application to application.
|
|
|
|
\end{desc}
|
|
|
|
|
1999-09-08 11:18:25 -04:00
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
1995-10-13 23:34:21 -04:00
|
|
|
\subsection{Creating character sets}
|
|
|
|
|
|
|
|
\defun{char-set}{\vari{char}1\ldots}{char-set}
|
|
|
|
\begin{desc}
|
|
|
|
Return a character set containing the given characters.
|
|
|
|
\end{desc}
|
|
|
|
|
|
|
|
\defun{chars->char-set}{chars}{char-set}
|
|
|
|
\begin{desc}
|
|
|
|
Return a character set containing the characters in the list \var{chars}.
|
|
|
|
\end{desc}
|
|
|
|
|
|
|
|
\defun{string->char-set}{s}{char-set}
|
|
|
|
\begin{desc}
|
|
|
|
Return a character set containing the characters in the string \var{s}.
|
|
|
|
\end{desc}
|
|
|
|
|
|
|
|
\defun{predicate->char-set}{pred}{char-set}
|
|
|
|
\begin{desc}
|
|
|
|
Returns a character set containing every character \var{c} such that
|
|
|
|
\ex{(\var{pred} \var{c})} returns true.
|
|
|
|
\end{desc}
|
|
|
|
|
|
|
|
\defun{ascii-range->char-set}{lower upper}{char-set}
|
|
|
|
\begin{desc}
|
|
|
|
Returns a character set containing every character whose {\Ascii}
|
1995-11-03 23:41:53 -05:00
|
|
|
code lies in the half-open range $[\var{lower},\var{upper})$.
|
1995-10-13 23:34:21 -04:00
|
|
|
\end{desc}
|
|
|
|
|
1999-09-08 11:18:25 -04:00
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
1995-10-13 23:34:21 -04:00
|
|
|
\subsection{Querying character sets}
|
|
|
|
\defun {char-set-members}{char-set}{character-list}
|
|
|
|
\begin{desc}
|
|
|
|
This procedure returns a list of the members of \var{char-set}.
|
|
|
|
\end{desc}
|
|
|
|
|
1995-11-03 23:41:53 -05:00
|
|
|
\defunx{char-set-contains?}{char-set char}\boolean
|
1995-10-13 23:34:21 -04:00
|
|
|
\begin{desc}
|
|
|
|
This procedure tests \var{char} for membership in set \var{char-set}.
|
1995-11-03 23:41:53 -05:00
|
|
|
\remark{Previous releases of scsh called this procedure \ex{char-set-member?},
|
|
|
|
reversing the order of the arguments.
|
|
|
|
This made sense, but was unfortunately the reverse order in which the
|
|
|
|
arguments appear in MIT Scheme.
|
|
|
|
A reasonable argument order was not backwards-compatible with MIT Scheme;
|
|
|
|
on the other hand, the MIT Scheme argument order was counter-intuitive
|
|
|
|
and at odds with common mathematical notation and the \ex{member} family
|
|
|
|
of R4RS procedures.
|
|
|
|
|
|
|
|
We sought to escape the dilemma by shifting to a new name.}
|
1995-10-13 23:34:21 -04:00
|
|
|
\end{desc}
|
|
|
|
|
1998-06-16 17:19:32 -04:00
|
|
|
\defun{char-set-size}{cs}\integer
|
|
|
|
\begin{desc}
|
|
|
|
Returns the number of elements in character set \var{cs}.
|
|
|
|
\end{desc}
|
|
|
|
|
1999-09-08 11:18:25 -04:00
|
|
|
\defun{char-set-every?}{pred cs}\boolean
|
|
|
|
\defunx{char-set-any?}{pred cs}\object
|
|
|
|
\begin{desc}
|
|
|
|
The \ex{char-set-every?} procedure returns true if predicate \var{pred}
|
|
|
|
returns true of every character in the character set \var{cs}.
|
|
|
|
|
|
|
|
Likewise, \ex{char-set-any?} applies \var{pred} to every character in
|
|
|
|
character set \var{cs}, and returns the first true value it finds.
|
|
|
|
If no character produces a true value, it returns false.
|
|
|
|
|
|
|
|
The order in which these procedures sequence through the elements of
|
|
|
|
\var{cs} is not specified.
|
|
|
|
\end{desc}
|
|
|
|
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
\subsection{Character-set algebra}
|
1995-10-13 23:34:21 -04:00
|
|
|
\defun {char-set-invert}{char-set}{char-set}
|
1998-06-16 17:19:32 -04:00
|
|
|
\defunx{char-set-union}{\vari{char-set}1\ldots}{char-set}
|
|
|
|
\defunx{char-set-intersection}{\vari{char-set}1 \vari{char-set}2\ldots}{char-set}
|
|
|
|
\defunx{char-set-difference}{\vari{char-set}1 \vari{char-set}2\ldots}{char-set}
|
1995-10-13 23:34:21 -04:00
|
|
|
\begin{desc}
|
|
|
|
These procedures implement set complement, union, intersection, and difference
|
|
|
|
for character sets.
|
1998-06-16 17:19:32 -04:00
|
|
|
The union, intersection, and difference operations are n-ary, associating
|
|
|
|
to the left; the difference function requires at least one argument, while
|
|
|
|
union and intersection may be applied to zero arguments.
|
1995-10-13 23:34:21 -04:00
|
|
|
\end{desc}
|
|
|
|
|
1999-09-08 11:18:25 -04:00
|
|
|
\defun {char-set-adjoin}{cs \vari{char}1\ldots}{char-set}
|
|
|
|
\defunx{char-set-delete}{cs \vari{char}1\ldots}{char-set}
|
|
|
|
\begin{desc}
|
|
|
|
Add/delete the \vari{char}i characters to/from character set \var{cs}.
|
|
|
|
\end{desc}
|
|
|
|
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
1995-10-13 23:34:21 -04:00
|
|
|
\subsection{Standard character sets}
|
1998-06-16 17:19:32 -04:00
|
|
|
\label{sec:std-csets}
|
1995-10-13 23:34:21 -04:00
|
|
|
Several character sets are predefined for convenience:
|
|
|
|
|
|
|
|
\begin{center}
|
|
|
|
\newcommand{\entry}[1]{\ex{#1}\index{#1}}
|
|
|
|
\begin{tabular}{|ll|}
|
|
|
|
\hline
|
1998-06-16 17:19:32 -04:00
|
|
|
\entry{char-set:lower-case} & Lower-case alphabetic chars \\
|
|
|
|
\entry{char-set:upper-case} & Upper-case alphabetic chars \\
|
1999-09-08 11:18:25 -04:00
|
|
|
\entry{char-set:alphabetic} & Alphabetic chars \\
|
1998-06-16 17:19:32 -04:00
|
|
|
\entry{char-set:numeric} & Decimal digits: 0--9 \\
|
1995-10-13 23:34:21 -04:00
|
|
|
\entry{char-set:alphanumeric} & Alphabetic or numeric \\
|
1998-06-16 17:19:32 -04:00
|
|
|
\entry{char-set:graphic} & Printing characters except space \\
|
|
|
|
\entry{char-set:printing} & Printing characters including space \\
|
|
|
|
\entry{char-set:whitespace} & Whitespace characters \\
|
|
|
|
\entry{char-set:control} & Control characters \\
|
|
|
|
\entry{char-set:punctuation} & Punctuation characters \\
|
|
|
|
\entry{char-set:hex-digit} & A hexadecimal digit: 0--9, A--F, a--f \\
|
1999-09-08 11:18:25 -04:00
|
|
|
\entry{char-set:blank} & Blank characters \\
|
1998-06-16 17:19:32 -04:00
|
|
|
\entry{char-set:ascii} & A character in the ASCII set. \\
|
|
|
|
\entry{char-set:empty} & Empty set \\
|
|
|
|
\entry{char-set:full} & All characters \\
|
|
|
|
\hline
|
|
|
|
\end{tabular}
|
|
|
|
\end{center}
|
1999-09-08 11:18:25 -04:00
|
|
|
The first eleven of these correspond to the character classes defined in
|
1998-06-16 17:19:32 -04:00
|
|
|
Posix.
|
|
|
|
Note that there may be characters in \ex{char-set:alphabetic} that are
|
|
|
|
neither upper or lower case---this might occur in implementations that
|
|
|
|
use a character type richer than ASCII, such as Unicode.
|
|
|
|
A ``graphic character'' is one that would put ink on your page.
|
|
|
|
While the exact composition of these sets may vary depending upon the
|
|
|
|
character type provided by the Scheme system upon which scsh is running,
|
|
|
|
here are the definitions for some of the sets in an ASCII character set:
|
|
|
|
\begin{center}
|
|
|
|
\newcommand{\entry}[1]{\ex{#1}\index{#1}}
|
|
|
|
\begin{tabular}{|ll|}
|
|
|
|
\hline
|
|
|
|
char-set:alphabetic & A--Z and a--z \\
|
|
|
|
char-set:lower-case & a--z \\
|
|
|
|
char-set:upper-case & A--Z \\
|
|
|
|
char-set:graphic & Alphanumeric + punctuation \\
|
|
|
|
char-set:whitespace & Space, newline, tab, page,
|
|
|
|
vertical tab, carriage return \\
|
|
|
|
char-set:blank & Space and tab \\
|
|
|
|
char-set:control & ASCII 0--31 and 127 \\
|
|
|
|
char-set:punctuation & \verb|!"#$%&'()*+,-./:;<=>|\verb#?@[\]^_`{|}~# \\
|
1995-10-13 23:34:21 -04:00
|
|
|
\hline
|
|
|
|
\end{tabular}
|
|
|
|
\end{center}
|
|
|
|
|
|
|
|
|
1998-06-16 17:19:32 -04:00
|
|
|
\defun {char-alphabetic?}\character\boolean
|
1995-10-13 23:34:21 -04:00
|
|
|
\defunx{char-lower-case?}\character\boolean
|
1998-06-16 17:19:32 -04:00
|
|
|
\defunx{char-upper-case?}\character\boolean
|
1995-10-13 23:34:21 -04:00
|
|
|
\defunx{char-numeric? }\character\boolean
|
|
|
|
\defunx{char-alphanumeric?}\character\boolean
|
|
|
|
\defunx{char-graphic?}\character\boolean
|
1998-06-16 17:19:32 -04:00
|
|
|
\defunx{char-printing?}\character\boolean
|
|
|
|
\defunx{char-whitespace?}\character\boolean
|
|
|
|
\defunx{char-blank?}\character\boolean
|
|
|
|
\defunx{char-control?}\character\boolean
|
|
|
|
\defunx{char-punctuation?}\character\boolean
|
|
|
|
\defunx{char-hex-digit?}\character\boolean
|
|
|
|
\defunx{char-ascii?}\character\boolean
|
1995-10-13 23:34:21 -04:00
|
|
|
\begin{desc}
|
|
|
|
These predicates are defined in terms of the above character sets.
|
|
|
|
\end{desc}
|
1999-09-08 11:18:25 -04:00
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
\subsection{Linear-update character-set operations}
|
|
|
|
These procedures have a hybrid pure-functional/side-effecting semantics:
|
|
|
|
they are allowed, but not required, to side-effect one of their parameters
|
|
|
|
in order to construct their result.
|
|
|
|
An implementation may legally implement these procedures as pure,
|
|
|
|
side-effect-free functions, or it may implement them using side effects,
|
|
|
|
depending upon the details of what is the most efficient or simple to
|
|
|
|
implement in terms of the underlying representation.
|
|
|
|
|
|
|
|
What this means is that clients of these procedures \emph{may not} rely
|
|
|
|
upon these procedures working by side effect.
|
|
|
|
For example, this is not guaranteed to work:
|
|
|
|
\begin{verbatim}
|
|
|
|
(let ((cs (char-set #\a #\b #\c)))
|
|
|
|
(char-set-adjoin! cs #\d)
|
|
|
|
cs) ; Could be either {a,b,c} or {a,b,c,d}.
|
|
|
|
\end{verbatim}
|
|
|
|
However, this is well-defined:
|
|
|
|
\begin{verbatim}
|
|
|
|
(let ((cs (char-set #\a #\b #\c)))
|
|
|
|
(char-set-adjoin! cs #\d)) ; {a,b,c,d}
|
|
|
|
\end{verbatim}
|
|
|
|
So clients of these procedures write in a functional style, but must
|
|
|
|
additionally be sure that, when the procedure is called, there are no
|
|
|
|
other live pointers to the potentially-modified character set (hence the term
|
|
|
|
``linear update'').
|
|
|
|
|
|
|
|
There are two benefits to this convention:
|
|
|
|
\begin{itemize}
|
|
|
|
\item Implementations are free to provide the most efficient possible
|
|
|
|
implementation, either functional or side-effecting.
|
|
|
|
\item Programmers may nonetheless continue to assume that character sets
|
|
|
|
are purely functional data structures: they may be reliably shared
|
|
|
|
without needing to be copied, uniquified, and so forth.
|
|
|
|
\end{itemize}
|
|
|
|
|
|
|
|
In practice, these procedures are most useful for efficiently constructing
|
|
|
|
character sets in a side-effecting manner, in some limited local context,
|
|
|
|
before passing the character set outside the local construction scope to be
|
|
|
|
used in a functional manner.
|
|
|
|
|
|
|
|
Scsh provides no assistance in checking the linearity of the potentially
|
|
|
|
side-effected parameters passed to these functions --- there's no linear
|
|
|
|
type checker or run-time mechanism for detecting violations.
|
|
|
|
|
|
|
|
\defun{char-set-copy}{cs}{char-set}
|
|
|
|
\begin{desc}
|
|
|
|
Returns a copy of the character set \var{cs}.
|
|
|
|
``Copy'' means that if either the input parameter or the
|
|
|
|
result value of this procedure is passed to one of the linear-update
|
|
|
|
procedures described below, the other character set is guaranteed
|
|
|
|
not to be altered.
|
|
|
|
(A system that provides pure-functional implementations of the rest of
|
|
|
|
the linear-operator suite could implement this procedure as the
|
|
|
|
identity function.)
|
|
|
|
\end{desc}
|
|
|
|
|
|
|
|
\defun{char-set-adjoin!}{cs \vari{char}1\ldots}{char-set}
|
|
|
|
\begin{desc}
|
|
|
|
Add the \vari{char}i characters to character set \var{cs}, and
|
|
|
|
return the result.
|
|
|
|
This procedure is allowed, but not required, to side-effect \var{cs}.
|
|
|
|
\end{desc}
|
|
|
|
|
|
|
|
\defun{char-set-delete!}{cs \vari{char}1\ldots}{char-set}
|
|
|
|
\begin{desc}
|
|
|
|
Remove the \vari{char}i characters to character set \var{cs}, and
|
|
|
|
return the result.
|
|
|
|
This procedure is allowed, but not required, to side-effect \var{cs}.
|
|
|
|
\end{desc}
|
|
|
|
|
|
|
|
\defun {char-set-invert!}{char-set}{char-set}
|
|
|
|
\defunx{char-set-union!}{\vari{char-set}1 \vari{char-set}2\ldots}{char-set}
|
|
|
|
\defunx{char-set-intersection!}{\vari{char-set}1 \vari{char-set}2\ldots}{char-set}
|
|
|
|
\defunx{char-set-difference!}{\vari{char-set}1 \vari{char-set}2\ldots}{char-set}
|
|
|
|
\begin{desc}
|
|
|
|
These procedures implement set complement, union, intersection, and difference
|
|
|
|
for character sets.
|
|
|
|
They are allowed, but not required, to side-effect their first parameter.
|
|
|
|
The union, intersection, and difference operations are n-ary, associating
|
|
|
|
to the left.
|
|
|
|
\end{desc}
|