scsh-0.5/scsh/rx/simp.notes

44 lines
1.6 KiB
Plaintext

The simplifier produces regexps with some simple invariants:
- DSM's are only top-level, never appearing in the body of a DSM,
repeat, sequence, choice, or submatch.
- A repeat's body is not a repeat, trivial match, or empty match.
- A choice's body contains more than one element; no element is
- a choice,
- a DSM, or
- an empty-match.
- A choice contains 0 or 1 char-set, bos, and eos elements.
- A sequence's body contains more than one element; no element is
- a sequence,
- a DSM,
- a trivial match, or
- an empty-match
- There are no empty matches in the regexp unless the entire regexp
is either an empty match, or a dsm whose body is an empty match.
(This is good, because there is no way to write an empty match
in Posix notation in a char-set independent way -- you have to
use the six-char "[^\000-\177]" for ASCII.)
To see these invariants:
- We can always bubble up empty matches:
- If a sequence has one, the whole sequence is reduced to an empty match.
- They can be deleted from a choice; if the choice reduces to 0 elements,
the choice can be reduced to an empty match.
- A repeat of an empty match is either an empty match or a trivial match,
depending upon whether FROM is >0 or 0, respectively.
- DSM of an empty match: the DSM itself can be bubbled upwards (see below).
- We can always bubble up DSM regexps:
- If an elt of a choice or sequence is a DSM, it can be "absorbed"
into the element's relocation offset.
- Repeat commutes with DSM.
- A DSM body can be "absorbed" into a submatch record by increasing the
submatch's DSM0 count.
- Nested DSM's can be collapsed together.