scsh-0.5/doc/scsh-manual/strings.tex

483 lines
18 KiB
TeX

% -*- latex -*-
\chapter{Strings and characters}
Strings are the basic communication medium for {\Unix} processes, so a
Unix programming environment must have reasonable facilities for manipulating
them.
Scsh provides a powerful set of procedures for processing strings and
characters.
Besides the the facilities described in this chapter, scsh also provides
\begin{itemize}
\itum{Regular expressions (chapter~\ref{chapt:sre})}
A complete regular-expression system.
\itum{Field parsing, delimited record I/O and the awk loop
(chapter~\ref{chapt:fr-awk})}
These procedures let you read in chunks of text delimited by selected
characters, and
parse each record into fields based on regular expressions
(for example, splitting a string at every occurrence of colon or
white-space).
The \ex{awk} form allows you to loop over streams of these records
in a convenient way.
\itum{The SRFI-13 string libraries}
This pair of libraries contains procedures that create, fold, iterate over,
search, compare, assemble, cut, hash, case-map, and otherwise manipulate
strings.
They are provided by the \ex{string-lib} and \ex{string-lib-internals}
packages, and are also available in the default \ex{scsh} package.
More documentation on these procedures can be found at URLs
\begin{tightinset}
% The gratuitous mbox makes xdvi render the hyperlinks better.
\mbox{\url{http://srfi.schemers.org/srfi-13/srfi-13.html}}\\
\url{http://srfi.schemers.org/srfi-13/srfi-13.txt}
\end{tightinset}
\itum{The SRFI-14 character-set library}
This library provides a set-of-characters abstraction, which is frequently
useful when searching, parsing, filtering or otherwise operating on
strings and character data. The SRFI is provided by the \ex{char-set-lib}
package; it's bindings are also available in the default \ex{scsh} package.
More documentation on this library can be found at URLs
\begin{tightinset}
% The gratuitous mbox makes xdvi render the hyperlinks better.
\mbox{\url{http://srfi.schemers.org/srfi-14/srfi-14.html}}\\
\url{http://srfi.schemers.org/srfi-14/srfi-14.txt}
\end{tightinset}
\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Manipulating file names}
\label{sec:filenames}
These procedures do not access the file-system at all; they merely operate
on file-name strings. Much of this structure is patterned after the gnu emacs
design. Perhaps a more sophisticated system would be better, something
like the pathname abstractions of {\CommonLisp} or MIT Scheme. However,
being {\Unix}-specific, we can be a little less general.
\subsection{Terminology}
These procedures carefully adhere to the {\Posix} standard for file-name
resolution, which occasionally entails some slightly odd things.
This section will describe these rules, and give some basic terminology.
A \emph{file-name} is either the file-system root (``/''),
or a series of slash-terminated directory components, followed by
a a file component.
Root is the only file-name that may end in slash.
Some examples:
\begin{center}
\begin{tabular}{lll}
File name & Dir components & File component \\\hline
\ex{src/des/main.c} & \ex{("src" "des")} & \ex{"main.c"} \\
\ex{/src/des/main.c} & \ex{("" "src" "des")} & \ex{"main.c"} \\
\ex{main.c} & \ex{()} & \ex{"main.c"} \\
\end{tabular}
\end{center}
Note that the relative filename \ex{src/des/main.c} and the absolute filename
\ex{/src/des/main.c} are distinguished by the presence of the root component
\ex{""} in the absolute path.
Multiple embedded slashes within a path have the same meaning as
a single slash.
More than two leading slashes at the beginning of a path have the same
meaning as a single leading slash---they indicate that the file-name
is an absolute one, with the path leading from root.
However, {\Posix} permits the OS to give special meaning to
\emph{two} leading slashes.
For this reason, the routines in this section do not simplify two leading
slashes to a single slash.
A file-name in \emph{directory form} is either a file-name terminated by
a slash, \eg, ``\ex{/src/des/}'', or the empty string, ``''.
The empty string corresponds to the current working directory,
whose file-name is dot (``\ex{.}'').
Working backwards from the append-a-slash rule,
we extend the syntax of {\Posix} file-names to define the empty string
to be a file-name form of the root directory ``\ex{/}''.
(However, ``\ex{/}'' is also acceptable as a file-name form for root.)
So the empty string has two interpretations:
as a file-name form, it is the file-system root;
as a directory form, it is the current working directory.
Slash is also an ambiguous form: \ex{/} is both a directory-form and
a file-name form.
The directory form of a file-name is very rarely used.
Almost all of the procedures in scsh name directories by giving
their file-name form (without the trailing slash), not their directory form.
So, you say ``\ex{/usr/include}'', and ``\ex{.}'', not
``\ex{/usr/include/}'' and ``''.
The sole exceptions are
\ex{file-name-as-directory} and \ex{directory-as-file-name},
whose jobs are to convert back-and-forth between these forms,
and \ex{file-name-directory}, whose job it is to split out the
directory portion of a file-name.
However, most procedures that expect a directory argument will coerce
a file-name in directory form to file-name form if it does not have
a trailing slash.
Bear in mind that the ambiguous case, empty string, will be
interpreted in file-name form, \ie, as root.
\subsection{Procedures}
\defun {file-name-directory?} {fname} \boolean
\defunx {file-name-non-directory?} {fname} \boolean
\begin{desc}
These predicates return true if the string is in directory form, or
file-name form (see the above discussion of these two forms).
Note that they both return true on the ambiguous case of empty string,
which is both a directory (current working directory), and a file name
(the file-system root).
\begin{center}
\begin{tabular}{lll}
File name & \ex{\ldots-directory?} & \ex{\ldots-non-directory?} \\
\hline
\ex{"src/des"} & \ex{\sharpf} & \ex{\sharpt} \\
\ex{"src/des/"} & \ex{\sharpt} & \ex{\sharpf} \\
\ex{"/"} & \ex{\sharpt} & \ex{\sharpf} \\
\ex{"."} & \ex{\sharpf} & \ex{\sharpt} \\
\ex{""} & \ex{\sharpt} & \ex{\sharpt}
\end{tabular}
\end{center}
\end{desc}
\begin{defundesc} {file-name-as-directory} {fname} \str
Convert a file-name to directory form.
Basically, add a trailing slash if needed:
\begin{exampletable}
\ex{(file-name-as-directory "src/des")} & \ex{"src/des/"} \\
\ex{(file-name-as-directory "src/des/")} & \ex{"src/des/"} \\[2ex]
%
\header{\ex{.}, \ex{/}, and \ex{""} are special:}
\ex{(file-name-as-directory ".")} & \ex{""} \\
\ex{(file-name-as-directory "/")} & \ex{"/"} \\
\ex{(file-name-as-directory "")} & \ex{"/"}
\end{exampletable}
\end{defundesc}
\begin{defundesc} {directory-as-file-name} {fname} \str
Convert a directory to a simple file-name.
Basically, kill a trailing slash if one is present:
\begin{exampletable}
\ex{(directory-as-file-name "foo/bar/")} & \ex{"foo/bar"} \\[2ex]
%
\header{\ex{/} and \ex{""} are special:}
\ex{(directory-as-file-name "/")} & \ex{"/"} \\
\ex{(directory-as-file-name "")} & \ex{"."} (\ie, the cwd) \\
\end{exampletable}
\end{defundesc}
\begin{defundesc} {file-name-absolute?} {fname} \boolean
Does \var{fname} begin with a root or \ex{\~} component?
(Recognising \ex{\~} as a home-directory specification
is an extension of {\Posix} rules.)
%
\begin{exampletable}
\ex{(file-name-absolute? "/usr/shivers")} & {\sharpt} \\
\ex{(file-name-absolute? "src/des")} & {\sharpf} \\
\ex{(file-name-absolute? "\~/src/des")} & {\sharpt} \\[2ex]
%
\header{Non-obvious case:}
\ex{(file-name-absolute? "")} & {\sharpt} (\ie, root)
\end{exampletable}
\end{defundesc}
\begin{defundesc} {file-name-directory} {fname} {{\str} or false}
Return the directory component of \var{fname} in directory form.
If the file-name is already in directory form, return it as-is.
%
\begin{exampletable}
\ex{(file-name-directory "/usr/bdc")} & \ex{"/usr/"} \\
{\ex{(file-name-directory "/usr/bdc/")}} &
{\ex{"/usr/bdc/"}} \\
\ex{(file-name-directory "bdc/.login")} & \ex{"bdc/"} \\
\ex{(file-name-directory "main.c")} & \ex{""} \\[2ex]
%
\header{Root has no directory component:}
\ex{(file-name-directory "/")} & \ex{""} \\
\ex{(file-name-directory "")} & \ex{""}
\end{exampletable}
\end{defundesc}
\begin{defundesc} {file-name-nondirectory} {fname} \str
Return non-directory component of fname.
%
\begin{exampletable}
{\ex{(file-name-nondirectory "/usr/ian")}} &
{\ex{"ian"}} \\
\ex{(file-name-nondirectory "/usr/ian/")} & \ex{""} \\
{\ex{(file-name-nondirectory "ian/.login")}} &
{\ex{".login"}} \\
\ex{(file-name-nondirectory "main.c")} & \ex{"main.c"} \\
\ex{(file-name-nondirectory "")} & \ex{""} \\
\ex{(file-name-nondirectory "/")} & \ex{"/"}
\end{exampletable}
\end{defundesc}
\begin{defundesc} {split-file-name} {fname} {{\str} list}
Split a file-name into its components.
%
\begin{exampletable}
\splitline{\ex{(split-file-name "src/des/main.c")}}
{\ex{("src" "des" "main.c")}} \\[1.5ex]
%
\splitline{\ex{(split-file-name "/src/des/main.c")}}
{\ex{("" "src" "des" "main.c")}} \\[1.5ex]
%
\splitline{\ex{(split-file-name "main.c")}} {\ex{("main.c")}} \\[1.5ex]
%
\splitline{\ex{(split-file-name "/")}} {\ex{("")}}
\end{exampletable}
\end{defundesc}
\begin{defundesc} {path-list->file-name} {path-list [dir]} \str
Inverse of \ex{split-file-name}.
\begin{code}
(path-list->file-name '("src" "des" "main.c"))
{\evalto} "src/des/main.c"
(path-list->file-name '("" "src" "des" "main.c"))
{\evalto} "/src/des/main.c"
\cb
{\rm{}Optional \var{dir} arg anchors relative path-lists:}
(path-list->file-name '("src" "des" "main.c")
"/usr/shivers")
{\evalto} "/usr/shivers/src/des/main.c"\end{code}
%
The optional \var{dir} argument is usefully \ex{(cwd)}.
\end{defundesc}
\begin{defundesc} {file-name-extension} {fname} \str
Return the file-name's extension.
%
\begin{exampletable}
\ex{(file-name-extension "main.c")} & \ex{".c"} \\
\ex{(file-name-extension "main.c.old")} & \ex{".old"} \\
\ex{(file-name-extension "/usr/shivers")} & \ex{""}
\end{exampletable}
%
\begin{exampletable}
\header{Weird cases:}
\ex{(file-name-extension "foo.")} & \ex{"."} \\
\ex{(file-name-extension "foo..")} & \ex{"."}
\end{exampletable}
%
\begin{exampletable}
\header{Dot files are not extensions:}
\ex{(file-name-extension "/usr/shivers/.login")} & \ex{""}
\end{exampletable}
\end{defundesc}
\begin{defundesc} {file-name-sans-extension} {fname} \str
Return everything but the extension.
%
\begin{exampletable}
\ex{(file-name-sans-extension "main.c")} & \ex{"main"} \\
\ex{(file-name-sans-extension "main.c.old")} & \ex{"main.c""} \\
\splitline{\ex{(file-name-sans-extension "/usr/shivers")}}
{\ex{"/usr/shivers"}}
\end{exampletable}
%
\begin{exampletable}
\header{Weird cases:}
\ex{(file-name-sans-extension "foo.")} & \ex{"foo"} \\
\ex{(file-name-sans-extension "foo..")} & \ex{"foo."} \\[2ex]
%
\header{Dot files are not extensions:}
\splitline{\ex{(file-name-sans-extension "/usr/shivers/.login")}}
{\ex{"/usr/shivers/.login}}
\end{exampletable}
Note that appending the results of \ex{file-name-extension} and
{\ttt file\=name\=sans\=extension} in all cases produces the original file-name.
\end{defundesc}
\begin{defundesc} {parse-file-name} {fname} {[dir name extension]}
Let $f$ be \ex{(file-name-nondirectory \var{fname})}.
This function returns the three values:
\begin{itemize}
\item \ex{(file-name-directory \var{fname})}
\item \ex{(file-name-sans-extension \var{f}))}
\item \ex{(file-name-extension \var{f}\/)}
\end{itemize}
The inverse of \ex{parse-file-name}, in all cases, is \ex{string-append}.
The boundary case of \ex{/} was chosen to preserve this inverse.
\end{defundesc}
\begin{defundesc} {replace-extension} {fname ext} \str
This procedure replaces \var{fname}'s extension with \var{ext}.
It is exactly equivalent to
\codex{(string-append (file-name-sans-extension \var{fname}) \var{ext})}
\end{defundesc}
\defun{simplify-file-name}{fname}\str
\begin{desc}
Removes leading and internal occurrences of dot.
A trailing dot is left alone, as the parent could be a symlink.
Removes internal and trailing double-slashes.
A leading double-slash is left alone, in accordance with {\Posix}.
However, triple and more leading slashes are reduced to a single slash,
in accordance with {\Posix}.
Double-dots (parent directory) are left alone, in case they come after
symlinks or appear in a \ex{/../\var{machine}/\ldots} ``super-root'' form
(which {\Posix} permits).
\end{desc}
\defun{resolve-file-name}{fname [dir]}\str
\begin{desc}
\begin{itemize}
\item Do \ex{\~} expansion.
\item If \var{dir} is given,
convert a relative file-name to an absolute file-name,
relative to directory \var{dir}.
\end{itemize}
\end{desc}
\begin{defundesc} {expand-file-name} {fname [dir]} \str
Resolve and simplify the file-name.
\end{defundesc}
\begin{defundesc} {absolute-file-name} {fname [dir]} \str
Convert file-name \var{fname} into an absolute file name,
relative to directory \var{dir}, which defaults to the current
working directory. The file name is simplified before being
returned.
This procedure does not treat a leading tilde character specially.
\end{defundesc}
\begin{defundesc} {home-dir} {[user]} \str
\ex{home-dir} returns \var{user}'s home directory.
\var{User} defaults to the current user.
\begin{exampletable}
\ex{(home-dir)} & \ex{"/user1/lecturer/shivers"} \\
\ex{(home-dir "ctkwan")} & \ex{"/user0/research/ctkwan"}
\end{exampletable}
\end{defundesc}
\begin{defundesc} {home-file} {[user] fname} \str
Returns file-name \var{fname} relative to \var{user}'s home directory;
\var{user} defaults to the current user.
%
\begin{exampletable}
\ex{(home-file "man")} & \ex{"/usr/shivers/man"} \\
\ex{(home-file "fcmlau" "man")} & \ex{"/usr/fcmlau/man"}
\end{exampletable}
\end{defundesc}
The general \ex{substitute-env-vars} string procedure,
defined in the previous section,
is also frequently useful for expanding file-names.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Other string manipulation facilities}
\begin{defundesc} {substitute-env-vars} {fname} \str
Replace occurrences of environment variables with their values.
An environment variable is denoted by a dollar sign followed by
alphanumeric chars and underscores, or is surrounded by braces.
\begin{exampletable}
\splitline{\ex{(substitute-env-vars "\$USER/.login")}}
{\ex{"shivers/.login"}} \\
\cd{(substitute-env-vars "$\{USER\}_log")} & \cd{"shivers_log"}
\end{exampletable}
\end{defundesc}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{ASCII encoding}
\defun {char->ascii}{\character} \integer
\defunx {ascii->char}{\integer} \character
\begin{desc}
These are identical to \ex{char->integer} and \ex{integer->char} except that
they use the {\Ascii} encoding.
\end{desc}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Character predicates}
\defun {char-letter?}\character\boolean
\defunx{char-lower-case?}\character\boolean
\defunx{char-upper-case?}\character\boolean
\defunx{char-title-case?}\character\boolean
\defunx{char-digit?}\character\boolean
\defunx{char-letter+digit?}\character\boolean
\defunx{char-graphic?}\character\boolean
\defunx{char-printing?}\character\boolean
\defunx{char-whitespace?}\character\boolean
\defunx{char-blank?}\character\boolean
\defunx{char-iso-control?}\character\boolean
\defunx{char-punctuation?}\character\boolean
\defunx{char-hex-digit?}\character\boolean
\defunx{char-ascii?}\character\boolean
\begin{desc}
Each of these predicates tests for membership in one of the standard
character sets provided by the SRFI-14 character-set library.
Additionally, the following redundant bindings are provided for {R5RS}
compatibility:
\begin{inset}
\begin{tabular}{ll}
{R5RS} name & scsh definition \\ \hline
\ex{char-alphabetic?} & \ex{char-letter+digit?} \\
\ex{char-numeric?} & \ex{char-digit?} \\
\ex{char-alphanumeric?} & \ex{char-letter+digit?}
\end{tabular}
\end{inset}
\end{desc}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Deprecated character-set procedures}
\label{sec:char-sets}
The SRFI-13 character-set library grew out of an earlier library developed
for scsh.
However, the SRFI standardisation process introduced incompatibilities with
the original scsh bindings.
The current version of scsh provides the library
\ex{obsolete-char-set-lib}, which contains the old bindings found in
previous releases of scsh.
The following table lists the members of this library, along with
the equivalent SRFI-13 binding. This obsolete library is deprecated and
\emph{not} open by default in the standard \ex{scsh} environment;
new code should use the SRFI-13 bindings.
\begin{inset}
\begin{tabular}{ll}
Old \ex{obsolete-char-set-lib} & SRFI-13 \ex{char-set-lib} \\ \hline
\ex{char-set-members} & \ex{char-set->list} \\
\ex{chars->char-set} & \ex{list->char-set} \\
\ex{ascii-range->char-set} & \ex{ucs-range->char-set} (not exact) \\
\ex{predicate->char-set} & \ex{char-set-filter} (not exact) \\
\ex{char-set-every}? & \ex{char-set-every} \\
\ex{char-set-any}? & \ex{char-set-any} \\
\\
\ex{char-set-invert} & \ex{char-set-complement} \\
\ex{char-set-invert}! & \ex{char-set-complement!} \\
\\
\ex{char-set:alphabetic} & \ex{char-set:letter} \\
\ex{char-set:numeric} & \ex{char-set:digit} \\
\ex{char-set:alphanumeric} & \ex{char-set:letter+digit} \\
\ex{char-set:control} & \ex{char-set:iso-control}
\end{tabular}
\end{inset}
Note also that the \ex{->char-set} procedure no longer handles a predicate
argument.