The simplifier produces regexps with some simple invariants: - DSM's are only top-level, never appearing in the body of a DSM, repeat, sequence, choice, or submatch. - A repeat's body is not a repeat, trivial match, or empty match. - A choice's body contains more than one element; no element is - a choice, - a DSM, or - an empty-match. - A choice contains 0 or 1 char-set, bos, and eos elements. - A sequence's body contains more than one element; no element is - a sequence, - a DSM, - a trivial match, or - an empty-match - There are no empty matches in the regexp unless the entire regexp is either an empty match, or a dsm whose body is an empty match. (This is good, because there is no way to write an empty match in Posix notation in a char-set independent way -- you have to use the six-char "[^\000-\177]" for ASCII.) To see these invariants: - We can always bubble up empty matches: - If a sequence has one, the whole sequence is reduced to an empty match. - They can be deleted from a choice; if the choice reduces to 0 elements, the choice can be reduced to an empty match. - A repeat of an empty match is either an empty match or a trivial match, depending upon whether FROM is >0 or 0, respectively. - DSM of an empty match: the DSM itself can be bubbled upwards (see below). - We can always bubble up DSM regexps: - If an elt of a choice or sequence is a DSM, it can be "absorbed" into the element's relocation offset. - Repeat commutes with DSM. - A DSM body can be "absorbed" into a submatch record by increasing the submatch's DSM0 count. - Nested DSM's can be collapsed together.