scsh-0.6/ps-compiler/doc/node.txt

341 lines
13 KiB
Plaintext

In the compiler `continuation' means a continuation that is a lambda node.
Non-lambda continuation arguments, such as the argument to a RETURN, are
not referred to as continuations (the argument isn't a continuation, it
is a variable that is bound to a continuation).
Every node has the following fields:
variant ; one of LITERAL, REFERENCE, LAMBDA, or CALL
parent ; parent node
index ; index of this node in parent, if parent is a call node
simplified? ; true if it has already been simplified; if this is #F
; then all of this node's ancestors must also be unsimplified
flag ; useful flag, all users must leave this is #F
Literal nodes:
value ; the value
type ; the type of the value (important for statically typed languages,
; not so useful for Scheme)
Reference nodes:
variable ; the referenced variable; the binder of the variable must be
; an ancestor of the reference node
Call nodes:
primop ; the primitive being called
args ; vector of argument nodes
exits ; the number of arguments that are continuations; the continuation
; arguments come before the non-continuation ones
source ; source info; used for error messages
Primops are either trivial or nontrivial. Trivial primops only return a value
and have no side effects. Calls to trivial primops never have continuation
arguments and are always arguments to other calls. Calls to nontrivial primops
may or may not have continuations and are always the body of a lambda node.
Lambda nodes:
type ; one of PROC, CONT, or JUMP (and maybe THROW at some point)
name ; symbol (for debugging)
id ; unique integer (for debugging)
body ; the call-node that is the body of the lambda
variables ; a list of variable records, with #Fs for ignored positions
source ; source info; used for error messages
protocol ; calling protocol from the source language
block ; for use during code generation
env ; for use when adding explicit environments
PROC's are general procedures. The first variable of a PROC will be bound
to the PROC's continuation.
CONT's are continuation arguments to calls.
JUMP's are continuations bound by LET or LETREC, whose calling points are
known, and which are created and called within a single PROC.
Variables:
name ; source code name for variable (used for debugging only)
id ; unique numeric identifier (used for debugging only)
type ; type of variable's value
binder ; LAMBDA node which binds this variable (or #F if none)
refs ; list of reference nodes n for which (REFERENCE-VARIABLE n)
; = this variable
flag ; useful slot, used by shapes, COPY-NODE, NODE->VECTOR, etc.
; all users must leave this is #F
flags ; list of various annotations, e.g. IGNORABLE
generate ; for whatever code generation wants
----------------------------------------------------------------
The node tree has a very regular lexical structure:
The body of every lambda node is a non-trivial call.
The parent of every non-trivial call is a lambda node.
Every CONT lambda is a continuation of a non-trivial call.
Every JUMP lambda is an argument to either the LET or the LETREC
primops (described below).
The lambda node that binds a variable is an ancestor of every reference
to that variable.
If you start from any leaf node and follow the parent pointers up through the
node tree, you first go through some number, possible zero, of trivial calls
until a non-trivial call is reached. From that point on non-trivial calls
alternate with CONT nodes until a PROC or JUMP lambda is reached. Going up
from a PROC lambda is the same as going up from a leaf, while JUMP lambdas
are always arguments to LET or LETREC, both of which are non-trivial.
A basic block appears as a sequence of non-trivial calls with a single
continuation apiece. The block begins with a PROC or JUMP lambda, or
with a CONT lambda that is an argument to a call with two or more
continuations, and ends with a call that has either no continuations,
or two or more.
Basic blocks are grouped into trees. The root of every tree is either
a PROC or JUMP lambda, the branch points are calls with two or more
continuations, and the leaves are jumps or returns. Within a tree
the control flow follows the lexical structure of the program from
parent to child (if we ignore calls to other PROCs).
Every JUMP lambda is called from within only one PROC lambda, so a PROC
can be considered to consist of a set of trees, the leaves of which either
return from that PROC or jump to the top of another tree in the set.
----------------------------------------------------------------
Primops:
id ; unique symbol identifying this primop
trivial? ; #t if this primop has does not accept a continuation
side-effects ; one of #F, READ, WRITE, ALLOCATE, or IO
simplify-call-proc ; simplify method
primop-cost-proc ; cost of executing this operation
; (in some undisclosed metric)
return-type-proc ; the type of the value returned (for trivial primops only)
proc-data ; more data for the procedure primops
cond-data ; more data for conditional primops
code-data ; code generation data
`procedure' primops are those that call one of their values.
`conditional' primops are those that have more than one continuation.
Below is a list of the standard primops. All but the last two are non-trivial.
For the following the five primops the lambda node being called, jumped to,
or whatever has been identified by the compiler, and the number of variables
that the lambda node has matches the number of arguments.
(CALL <cont> <proc> . <args>)
(TAIL-CALL <cont-var> <proc> . <args>)
(RETURN <cont-var> . <args>)
(JUMP <jump-var> . <args>)
; (THROW <throw-var> . <args>) not yet implemented
These are the same as the above except that the procedure has not been
identified by the compiler. There is no UNKNOWN-JUMP because all calls
to JUMP lambdas must be known.
(UNKNOWN-CALL <cont> <proc> . <args>)
(UNKNOWN-TAIL-CALL <cont> <proc> . <args>)
(UNKNOWN-RETURN <cont-var> . <args>)
PROC lambdas are called with either CALL or TAIL-CALL if all of their call
sites have been identified, or with UNKNOWN-CALL or UNKNOWN-TAIL-CALL if not.
JUMP lambdas are called using JUMP.
LET binds random values, such as lambda nodes or the results of trivial
calls, to variables. This primop only exists because of the requirement
that every call have a primop; all it does is apply <cont> to <args>
(it is called LET instead of APPLY because LET forms in the source code
become calls to this primop).
(LET <cont> . <args>)
Recursive binding:
(LETREC1 <cont>)
(LETREC2 <cont> <id-var> <lambda1> <lambda2> ...)
These are always used together, with the body of the continuation to LETREC1
being a call to LETREC2. The two calls together look like:
(LETREC1 (lambda (<id-var> <var1> ... <varN>)
(LETREC2 <cont> <id-var> <lambda1> ... <lambdaN>)))
which the CPS pretty-printer prints as:
(let* (...
((id-var var1 ... varN) (letrec1))
(() (letrec2 id-var lambda1 ... lambdaN))
...)
...)
The end result is to bind <varI> to <lambdaI>. The point to the excercise
is that lambdas occur within the scope of the variables.
Undefined effect. This takes a continuation variable as an argument only
so that the continuation variable is always reached.
(UNDEFINED-EFFECT <cont-var> ...)
Accessing and mutating the store.
Cells are used to implement SET! on lexically bound variables. GLOBAL-SET!
and GLOBAL-REF are used for module variables that may be set.
(CELL-SET! <cont> <cell> <value>)
(GLOBAL-SET! <cont> <global-var> <value>)
(CELL-REF <cell>) ; trivial
(GLOBAL-REF <global-var>) ; trivial
----------------------------------------------------------------
Printing out the node tree.
The following procedure:
(define (fact n)
(let loop ((n n) (r 1))
(if (< n 2)
r
(loop (- n 1) (* n r)))))
when converted into nodes is:
(LAMBDAp (c_6 n_1)
(letrec1 (LAMBDAc (x_13 loop_2)
(letrec2 (LAMBDAc ()
(unknown-tail-call c_6 loop_2 n_1 '1))
x_13
(LAMBDAp (c_8 n_3 r_4)
(test
(LAMBDAc ()
(unknown-return c_8 r_4))
(LAMBDAc ()
(unknown-tail-call c_8 loop_2 (- n_3 '1) (* n_3 r_4)))
(< n_3 '2)))))))
where LAMBDAp is a PROC lambda and LAMBDAc is a CONT lambda. Lexically bound
variables are printed as <name>_<id> and constants as '<value>. This is not
very readable, and larger procedures are much worse. The first step in making
it more comprehensible is to print each lambda node separately with a marker
to indicate where it appears in the tree.
(LAMBDAp fact_7 (c_6 n_1)
(letrec1 1 ^c_14))
(LAMBDAc c_14 (x_13 loop_2)
(letrec2 1 ^c_12 x_13 ^loop_9))
(LAMBDAc c_12 ()
(unknown-tail-call 0 c_6 loop_2 n_1 '1))
(LAMBDAp loop9 (c_8 n_3 r_4)
(test 2 ^g_10 ^g_11 (< n_3 '2)))
(LAMBDAc g_10 ()
(unknown-return 0 c_8 r_4))
(LAMBDAc g_11 ()
(unknown-tail-call 0 c_8 loop_2 (- n_3 '1) (* n_3 r_4)))
The labels used are the names and id's of the lambda nodes, with a ^ in front
to distinguish them from variables. The code for each lambda is indented
slightly more than the lambda in which it actually occurs. To make the
distinction between continuation and non-continuation lambdas clearer the
number of continuation arguments to a call is printed just after the primop
(for example the first two arguments to TEST are continuations).
The first three calls form a basic block because the first two calls have
exactly one continuation apiece. To make this more easily seen these
calls can be printed using a more condensed notation:
(LAMBDAp fact_7 (c_6 n_1)
(LET* (((x_13 loop_2) (letrec1))
(() (letrec2 x_13 ^loop_9)))
(unknown-tail-call 0 c_6 loop_2 n_1 '1)))
The continuations are not printed as arguments but instead their variables
are printed to the left of the call in a parody of Scheme's LET*. The results
of the LETREC1 are bound to the variables X_13 and LOOP_2 as would happen with
the real LET* (if it allowed calls to return multiple values).
Finally, here is the way the code for FACT is actually printed:
7 (P fact_7 (c_6 n_1)
14 (LET* (((x_13 loop_2)
(letrec1))
12 (() (letrec2 x_13 ^loop_9)))
(unknown-tail-call 0 c_6 loop_2 n_1 '1)))
9 (P loop_9 (c_8 n_3 r_4)
(test 2 ^g_10 ^g_11 (< n_3 '2)))
10 (C g_10 ()
(unknown-return 0 c_8 r_4))
11 (C g_11 ()
(unknown-tail-call 0 c_8 loop_2 (- n_3 '1) (* n_3 r_4)))
The ID number of every lambda node is printed out at the beginning of the
line on which the code for the lambda appears. This is redundant for the
lambdas that are not printed as part of a LET*. The word `LAMBDA' is not
printed. The (letrec1) call appears on a new line because the printer
indents the calls in LET* a fixed amount.
The reason for printing the ID numbers is so that the actual nodes can be
obtained. Once a lambda has been printed (either by the pretty printer or
by the regular printer), (NODE-UNHASH <id>) will return it:
scheme-compiler> (node-unhash 9)
'#{Node lambda loop 9}
scheme-compiler> ,inspect ##
'#{Node lambda loop 9}
[0: variant] 'lambda
[1: parent] '#{Node call letrec2}
[2: index] 2
[3: simplified?] #t
[4: flag] #f
[5: stuff-0] '#{Node call test}
[6: stuff-1] '(#{Variable n 3} #{Variable r 4})
[7: stuff-2] '(#{Name #} (n r) (if # r #))
[8: stuff-3] '#{Lambda-data}
----------------------------------------------------------------
Simplification.
The factorial procedure above is how it looks when originally translated
into a node tree. The next step in compilation is to simplify the tree,
doing constant folding, identifying call points, and so on. The simplified
version of FACT is:
7 (P fact_7 (c_6 n_1)
14 (LET* (((x_13 loop_2)
(letrec1))
12 (() (letrec2 x_13 ^loop_9)))
(jump 0 loop_2 n_1 '1)))
9 (J loop_9 (n_3 r_4)
(test 2 ^g_10 ^g_11 (< n_3 '2)))
10 (C g_10 ()
(unknown-return 0 c_6 r_4))
11 (C g_11 ()
(jump 0 loop_2 (+ '-1 n_3) (* n_3 r_4)))
The only change is that the loop has been turned into a JUMP lambda.
----------------------------------------------------------------
Still to describe:
protocol determination
simplifier moving stuff down, duplicating, later passes move values back up