ffigen-website/www/userman.html

<!-- -*- mode: html; mode: font-lock -*-

  Hand-translated from LaTeX to HTML by lth on 2000-05-16, and
  converted footnotes to in-line text.  Fixed a small number of
  typos. No other changes. -->

<html>
<head>
<title>FFIGEN User's Manual</title>
</head>

<body>

<center>
<h1>FFIGEN User's Manual</h1><br>
(Preliminary)<br>
Lars Thomas Hansen<br>
<tt>lth@cs.uoregon.edu</tt><br>
February 6, 1996
</center>

<h2>1. Introduction</h2>

<p>FFIGEN is a program system which facilitates the writing of
translators from C header files to foreign function interfaces for
particular programming language implementations.  This document
describes its structure and use.  The discussion is aimed at translator
writers; everyone else should confine themselves to section 3.  A
companion document, <a href="manifesto.html">FFIGEN Manifesto and
Overview</a>, motivates the work, and other companion documents describe
specific translator implementations.  In particular, the document
<em>FFIGEN Back-end for Chez Scheme Version 5</em> describes one
translator in detail.</p>

<p>FFIGEN is based on the <em>lcc</em> C compiler, which is copyrighted
software.  See Section 10 for a full copyright notice.</p>

<h2>2. Writing Translators</h2>

<p>To generate a translation of a header file you run the <em>ffigen</em>
command to generate an intermediate form of the C header files you want
to translate, and then run the back-end on the resulting files to
generate the foreign function interface for the library.</p>

<p>Your task, should you choose to accept it, is to implement the
target-specific parts of the back-end for your particular target (which
is to say, combination of host language implementation, operating
system, architecture, foreign language implementation, and translation
policy).  You should be able to use the FFIGEN front-end and the
target-independent parts of the back-end pretty much as they are.</p>

<p>How to implement the target-specific parts of the back-end is
discussed in Section 6.  Use of the front end is described in Section 2.
The intermediate format is described in Section 4, and the
target-independent parts of the back-end and their interface to the
target-dependent part are described in Section 5. Finally, Section 7
covers some issues which need to be tackled in the future.</p>

<h2>3. Running FFIGEN</h2>

<p>The command <em>ffigen</em> is run on a set of header files with
preprocessor option and include file options.  Arguments are processed
in order.  For each header file (type <tt>.h</tt>) and all the files it
includes, a single preprocessor file (type <tt>.ffi</tt>) is
produced.</p>

<p>The options are:
<dl>
<dt><tt>-Dname[=value]</tt>
<dd>Define preprocessor macro.
<dt><tt>-Uname</tt>
<dd>Undefine preprocessor macro.
<dt><tt>-Idirectory</tt>
<dd>Add directory to the <em>beginning</em> of the list
of include files.  Standard directories include the <em>lcc</em> include
directory, <tt>/usr/include</tt>, and the current directory (in that order).
See the release notes for information about how to change the defaults.
</dl>

<em>ffigen</em> performs full syntax and type checks on its input.</p>

The back-end is run by starting your favorite Scheme system and then
loading first the target-independent file <tt>process.sch</tt> and second
the target-dependent part of the translator; in the case of the Chez
Scheme back-end the file is called <tt>chez.sch</tt>.  You then call the
procedure <tt>process</tt> with the name of the <tt>.ffi</tt> file to
process, as discussed in section 5.

<h2>4. Intermediate Format</h2>

<p>The intermediate format consists of s-expressions following this grammar:

<pre>
  &lt;file&gt;      -&gt; &lt;record&gt; ...
  &lt;record&gt;    -&gt; (function &lt;filename&gt; &lt;name&gt; &lt;type&gt; &lt;attrs&gt;)
               | (var &lt;filename&gt; &lt;name&gt; &lt;type&gt; &lt;attrs&gt;)
               | (type &lt;filename&gt; &lt;name&gt; &lt;type&gt;)
               | (struct &lt;filename&gt; &lt;name&gt; ((&lt;name&gt; &lt;type&gt;) ...))
               | (union &lt;filename&gt; &lt;name&gt; ((&lt;name&gt; &lt;type&gt;) ...))
               | (enum &lt;filename&gt; &lt;name&gt; ((&lt;name&gt; &lt;value&gt;) ...))
               | (enum-ident &lt;filename&gt; &lt;name&gt; &lt;value&gt;)
               | (macro &lt;filename&gt; &lt;name+args&gt; &lt;body&gt;)
  &lt;type&gt;      -&gt; (&lt;primitive&gt; &lt;attrs&gt;)
               | (struct-ref &lt;tag&gt;)
               | (union-ref &lt;tag&gt;)
               | (enum-ref &lt;tag&gt;)
               | (function (&lt;type&gt; ...) &lt;type&gt;)
               | (pointer &lt;type&gt;)
               | (array &lt;value&gt; &lt;type&gt;)
  &lt;attrs&gt;     -&gt; (&lt;attr&gt; ...)
  &lt;attr&gt;      -&gt; static | extern | const | volatile
  &lt;primitive&gt; -&gt; char | signed-char | unsigned-char | short
               | unsigned-short | int | unsigned | long
               | unsigned-long | float | double | void
  &lt;value&gt;     -&gt; &lt;integer&gt;
  &lt;filename&gt;  -&gt; &lt;string&gt;
  &lt;name&gt;      -&gt; &lt;string&gt;
  &lt;body&gt;      -&gt; &lt;string&gt;
  &lt;name+args&gt; -&gt; &lt;string&gt;
  &lt;tag&gt;       -&gt; &lt;string&gt;
</pre>

Notes relating to the grammar:</p>

<ul>
<li> <tt>...</tt> means "zero or more of" the preceding item.

<li> The grammar is a little more general than the actual output
language.  All structs, unions, and enums in parameter lists, return
types, and variable declarations are encoded as <tt>struct-ref</tt>,
<tt>union-ref</tt>, and <tt>enum-ref</tt>, respectively; structure, union,
and enum type definitions occur only in <tt>struct</tt>, <tt>union</tt>,
and <tt>enum</tt> records.

<li> The <tt>&lt;tag&gt;</tt> field in structs/unions/enums (and their
<tt>-ref</tt> forms) is the tag.  If one of these types
has a user-defined tag, then that tag is used in the <tt>struct-ref</tt>
item for the type; if the structure had no user-defined tag then a tag has been
generated by <em>lcc</em>.  Generated tags have the syntax of positive
integers; in particular they start with a digit.  There is one namespace
each for structs, unions, and enums.

<li>
<tt>typedef</tt> names are not used anywhere: they occur in <tt>type</tt>
records only.

<li>
The attributes on primitive types are <tt>const</tt> or <tt>volatile</tt>; the
attributes <tt>static</tt> and <tt>extern</tt> are used only on functions and
global variables.

<li>
Functions which are known to take no parameters (<em>ie</em> <tt>t f(void)</tt>) have
one parameter, of type <tt>(void ())</tt>.  The void type appears in a
parameter list only as the last element.

<li>
Functions which take a variable number of arguments have at least one
defined non-void parameter and a last parameter of type <tt>(void ())</tt>.

<li>
Functions for which no parameters were defined (<em>ie</em> <tt>t f()</tt>) have
no parameters.

<li>
The ordering of records in the input has no relation to the
relative ordering of declarations in the original source.

<li>
The <tt>&lt;value&gt;</tt> field in the array is its size.  If the size is not
known, it is 0.

<li>
Multidimensional arrays are represented as nested array types with the
leftmost dimension outermost in the expected way; i.e., it looks like
an array of arrays.

<li>
Arrays are not valid return types.

<li>
Array parameters lose some semantic information in the translation in
the current system.  An array parameter <tt>t a[n]</tt> is always
converted to a pointer: <tt>(pointer t)</tt> regardless of whether
<tt>n</tt> is known or not.  As expected, then, something like
<tt>t a[n][m][o]</tt> gets the parameter type
<tt>(pointer (array m (array o t)))</tt>.  Note that this only pertains to
parameter types; variables of array type are not converted in this manner.
(The semantic information claimed lost is the size of the leftmost
dimension.  This lossage may make it impossible to perform array conversion
at call boundaries, for example.)

<li>
The grammar describes the current format, which will change: line number
and column information will be incorporated.  You should always use the
accessor functions defined in the target-independent part of the
back-end; see section 5.  The grammar does not allow
for bit fields or qualifications on anything but primitive
types, but these will be accomodated eventually.

</ul>


<h2>5. The Target-Independent Back-End</h2>

<p>The target-independent back-end is a Scheme program called
<tt>process</tt> which reads the intermediate form into memory and
performs some initial processing.  It exports some global variables and
a number of procedures which are used to access the structures in the
database of intermediate records, and imports two target-dependent
functions from the target-dependent back-end.  This section describes
the interfaces.</p>

<p>The global variables which hold the database are:

<pre>
    (define functions '())      ; list of function records
    (define vars '())           ; list of var records
    (define types '())          ; list of type records
    (define structs '())        ; list of struct records
    (define unions '())         ; list of union records
    (define macros '())         ; list of macro records
    (define enums '())          ; list of enum records
    (define enum-idents '())    ; list of enum-ident records
</pre>

Each of these contains a list of all the records of the type indicated
by their names.  Note that records may look different internally than
in the defined intermediate form, so accessor functions (see below) should
always be used.</p>

<p>In addition, there are two globals which are set but not used by
the target-independent back-end:

<pre>
    (define source-file #f)     ; name of the input file itself
    (define filenames '())      ; names of all files in the input
</pre>
</p>

<p>The main entry point to the back end is the procedure <tt>process</tt>,
which takes a single file name as an argument.  <tt>Process</tt>
initializes globals, reads the file, and processes the records.

<pre>
    (define (process filename) ...)
</pre></p>

<p>Record processing consists of some general analysis and target-specific
code generation.  First, the target-specific procedure
<tt>select-functions</tt> is called; it must set or reset the
"referenced" bit in each record depending on whether the function is
interesting to the back-end or not.  After computing reachability of
structured types and setting the referenced bits of those types which
are reachable, a translation is generated by a call to the back-end
function <tt>generate-translation</tt>, which takes no arguments.

<pre>
    (define (select-functions) ...)
    (define (generate-translation) ...)
</pre></p>

<p>A number of data structure accessors and mutators are also available.
These are generic procedures which work on all of the record types.

<pre>
    (define (file r) ...)          ; file name of record
    (define (name r) ...)          ; name in records which have one
    (define (type r) ...)          ; type in records which have one
    (define (attrs r) ...)         ; attrs in records which have one
    (define (fields r) ...)        ; fields in struct/union record
    (define (value r) ...)         ; value of enum-ident record
    (define (tag r) ...)           ; tag in struct/union/union/-ref record

    (define (referenced? r) ...)   ; is record referenced?
    (define (referenced! r) ...)   ; set referenced bit
    (define (unreferenced! r) ...) ; reset referenced bit
</pre>

Arguably the <tt>tag</tt> accessor should go away and <tt>name</tt>
should simply be used in its place.  As it is, <tt>name</tt> is not
defined on <tt>struct-ref</tt>, <tt>union-ref</tt>, and
<tt>enum-ref</tt> records.</p>

<p>The procedure <tt>record-tag</tt> returns the tag of the record currently
being held.  It can also be applied to types.

<pre>
    (define (record-tag r) ...)    ; get record tag
</pre></p>

<p>All records can have back-end specific values attached to them; usually
these are cached names for operations on structured values, so for now
the procedures which manipulate the back-end specific data are called
<tt>cache-name</tt> to remember a value and <tt>cached-names</tt> to return
the list of remembered values:

<pre>
    (define (cache-name r v) ...)  ; remember value in record
    (define (cached-names r) ...)  ; retrieve remembered values
</pre>

We should probably replace this with a more general property-list-like
mechanism.</p>

<p>In addition, two procedures extract parts of function types:

<pre>
    (define (arglist r) ...)       ; function argument types
    (define (rett r) ...)          ; function return type
</pre></p>

<p>Some utilities to deal with file names are also provided:

<pre>
    (define (strip-extension fn) ...)
    (define (strip-path fn) ...)
    (define (get-path fn) ...)
</pre></p>

<p>A string macro expander makes it easier to generate C code, for the back
ends that need it.  The macro expander is called <tt>instantiate</tt> and
is called with a string template and a vector of arguments (which are
also strings).  The template contains patterns of the form <tt>@n</tt>
where <tt>n</tt> is a single digit; when such a pattern is seen it is
replaced with the corresponding value from the argument vector.

<pre>
    (define (instantiate template arguments) ...)
</pre></p>

<p>Two procedures, <tt>struct-names</tt> and <tt>union-names</tt>, take a
structure (or union) and returns a list of all the typedef names which
reference the structure directly.

<pre>
    (define (struct-names struct) ...)
    (define (union-names union) ...)
</pre></p>

<p>An association function which searches one of the record lists for a
given record by the <tt>name</tt> field is also available:

<pre>
    (define (lookup key items) ...)
</pre></p>

<p>The procedure <tt>user-defined-tag?</tt> determines whether a tag was
defined by the user or generated by the system:

<pre>
    (define (user-defined-tag? x) ...)
</pre></p>

<p>The procedure <tt>warn</tt> takes some arbitrary arguments and generates
a warning message on standard output:

<pre>
    (define (warn msg . rest) ...)
</pre></p>

<p>Some standard predicates take a type and test its kind:
<tt>primitive-type?</tt> is true if the argument is of a primitive type as
outlined in the grammar above; <tt>basic-type?</tt> is true if the
argument is a primitive type or a pointer type; <tt>array-type?</tt> is
true if the argument is an array type, and finally,
<tt>structured-type?</tt> is true if the argument is a <tt>struct-ref</tt>
or <tt>union-ref</tt> type:

<pre>
    (define (primitive-type? t) ...)
    (define (basic-type? t) ...)
    (define (array-type? t) ...)
    (define (structured-type? t) ...)
</pre></p>

<h2>6. Writing a Target-Dependent Back-End</h2>

<p>To write the target-dependent back-end, you must decide on the policy
for the translation and then implement the translation.  The policy
covers such issues as: which constructs in C are or are not handled; the
translation for each handled construct; how non-handled constructs are
dealt with (ignored, detected with warnings, detected with errors); how
to deal with exceptional cases (consider the <tt>fgets</tt> example from
the <a href="manifesto.html">Manifesto</a>).</p>

<p>For a concrete example, see the companion document <em>FFIGEN Backend
for Chez Scheme Version 5</em>, which addresses many of the choices to be
made and their possible solutions.</p>

<h2>7. Future Work</h2>

<p>A number of features <em>will</em> be supported in the future:</p>

<ul>
<li> There will be a line and a column field in each record, giving the
source line on which the identifier was defined.

<li> Bitfields will be supported.

<li> Qualifiers (what's now called attributes, that is, const and
volatile) will be supported on all types, not just on primitive
non-pointer types like now.

<li> The intermediate representation will include the name of the orignal
input file, and its path.

<li> The intermediate representation will include a representation of
the include file hierarchy which was traversed to produce the
intermediate representation.

</ul>

<p>A number of features will most likely be supported, but need
to be investigated:</p>

<ul>
<li> It would be nice to retain comments.

<li> Various popular extensions to C are not currently supported by
<em>lcc</em>, but would be extremely useful: <tt>long long</tt> is used
extensively in Unix header files, and header files for compilers on PCs
often use the common Microsoft extensions <tt>__huge</tt>, <tt>__far</tt>,
and <tt>__near</tt> (and their non-underscore equivalents).  Some C compilers
support <tt>__inline</tt> declarations, and although we can't generate
code for in-line procedures we can at least parse them if the compiler
can cope with <tt>__inline</tt>.  (<tt>__inline</tt> is the easier, since it
can be ignored.  The others must show up as type qualifiers or new types.)

<li> The current shell-program driver will probably be replaced by
something based on the lcc driver.

<li> I'm going to experiment with partial macro application in the
front end so that back-ends can have simple support for macro
definitions.  Currently, for example, even something as simple as the
<tt>EOF</tt> macro will be ignored by the Chez Scheme back-end because its
form is <tt>"(-1)"</tt> rather than simply <tt>"-1"</tt>.

<li> Information about the layout of fields within structured types
should possibly be emitted; this information would be useful to
low-level FFIs which need byte offset and size to access the field of a
structure.

</ul>

<p>In addition, there are some issues to investigate in a larger
perspective:</p>

<ul>
<li> General (target-independent) support for useful policy mechanisms.

<li> How well can the intermediate language support other front-ends?
I don't want to fall into the UNCOL pit, but it would be interesting to
see how languages which resemble C in their parameter passing mechanisms
(Pascal, Modula, Oberon) could be mapped onto the intermediate language.
This is not high priority with me, however.  If I embark on supporting
another front-end language it will probably be (sigh) C++.

</ul>

<h2>8. Please Contribute!</h2>

<p>My goal is to support as many target languages as is reasonable, but I
can't write all the translators myself (I lack the time and, in many
cases, the knowledge).  Targets that I will take care of include STk,
and, if no-one beats me to it, Scsh, both Scheme systems.  Someone has
already volunteered to write the ILU back-end.  Others are interested
in back-ends for Modula-3 and Mercury.</p>

<p>Volunteers for any translator back-end are welcome to e-mail me and
volunteer their help.  I will coach, coordinate, and help out as much as
possible.</p>

<h2>9. Credits</h2>

<p>FFIGEN is based on the freely available <em>lcc</em> ANSI C compiler,
implemented by Christopher Fraser (of AT&amp;T Bell Labs) and David Hanson
(of Princeton University).</p>

<p>I would like to thank Fraser and Hanson for producing such an excellent
system; <em>lcc</em> has been a joy to work with, and their book, <em>A
Retargetable C Compiler: Design and Implementation</em>, made the
implementation of the FFIGEN front end in the matter of roughly a single
work day possible.  Would it be that all software was this clean!</p>

<p>The development of FFIGEN was supported by ARPA
under U.S. Army grant No. DABT63-94-C-0029,
``Programming Environments, Compiler Technology and Runtime Systems
for Object Oriented Parallel Processing''.</p>

<h2>10. Copyrights</h2>

<em>lcc</em> is covered by the following Copyright notice:

<blockquote>
<p>The authors of this software are Christopher W. Fraser and
David R. Hanson.</p>

<p>Copyright (c) 1991,1992,1993,1994,1995 by AT&amp;T, Christopher W. Fraser,
and David R. Hanson. All Rights Reserved.</p>

<p>Permission to use, copy, modify, and distribute this software for any
purpose, subject to the provisions described below, without fee is
hereby granted, provided that this entire notice is included in all
copies of any software that is or includes a copy or modification of
this software and in all copies of the supporting documentation for
such software.</p>

<p>THIS SOFTWARE IS BEING PROVIDED "AS IS", WITHOUT ANY EXPRESS OR IMPLIED
WARRANTY. IN PARTICULAR, NEITHER THE AUTHORS NOR AT&amp;T MAKE ANY
REPRESENTATION OR WARRANTY OF ANY KIND CONCERNING THE MERCHANTABILITY
OF THIS SOFTWARE OR ITS FITNESS FOR ANY PARTICULAR PURPOSE.</p>

<p>lcc is not public-domain software, shareware, and it is not protected
by a `copyleft' agreement, like the code from the Free Software
Foundation.</p>

<p>lcc is available free for your personal research and instructional use
under the `fair use' provisions of the copyright law. You may,
however, redistribute the lcc in whole or in part provided you
acknowledge its source and include this COPYRIGHT file.</P>

<p>You may not sell lcc or any product derived from it in which it is a
significant part of the value of the product. Using the lcc front end
to build a C syntax checker is an example of this kind of product.</p>

<p>You may use parts of lcc in products as long as you charge for only
those components that are entirely your own and you acknowledge the use
of lcc clearly in all product documentation and distribution media. You
must state clearly that your product uses or is based on parts of lcc
and that lcc is available free of charge. You must also request that
bug reports on your product be reported to you. Using the lcc front
end to build a C compiler for the Motorola 88000 chip and charging for
and distributing only the 88000 code generator is an example of this
kind of product.</p>

<p>Using parts of lcc in other products is more problematic. For example,
using parts of lcc in a C++ compiler could save substantial time and
effort and therefore contribute significantly to the profitability of
the product. This kind of use, or any use where others stand to make a
profit from what is primarily our work, is subject to negotiation.</p>

<p>Chris Fraser / cwf@research.att.com <br>
David Hanson / drh@cs.princeton.edu<br>
Fri Jun 17 11:57:07 EDT 1994</p>
</blockquote>

<hr>
<address>
<A HREF="mailto:lth@acm.org">lth@acm.org</A>
</address>
<em>24 May 2000</em>
</body>
</html>