I think this might be my last blog entry on the subject of building SBCL for a while.
One of the premises behind SBCL as a separate entity from CMUCL, its parent, was to make the result of its build be independent of the compiler used to build it. To a world where separate compilation is the norm, the very idea that building some software should persistently modify the state of the compiler probably seems bizarre, but the Lisp world evolved in that way and Lisp environments (at least those written in themselves) developed build recipes where the steps to construct a new Lisp system from an old one and the source code would depend critically on internal details of both the old and the new one: substantial amounts of introspection on the build host were used to bootstrap the target, so if the details revealed by introspection were no longer valid for the new system, there would need to be some patching in the middle of the build process. (How would you know whether that was necessary? Typically, because the build would fail with a more-or-less – usually more – cryptic error.)
Enter SBCL, whose strategy is essentially to use the source files
first to build an SBCL!Compiler running in a host Common Lisp
implementation, and then to use that SBCL!Compiler to compile the
source files again to produce the target system. This requires some
contortions in the source files: we must write enough of the system in
portable Common Lisp so that an arbitrary host can execute
SBCL!Compiler to compile SBCL-flavoured sources (including the
standard headache-inducing (defun car (list) (car list))
and
similar, which works because SBCL!Compiler knows how to compile calls
to
car
).
How much is “enough” of the system? Well, one answer might be when the build output actually works, at least to the point of running and executing some Lisp code. We got there about twelve years ago, when OpenMCL (as it was then called) compiled SBCL. And yet... how do we know there aren't odd differences that depend on the host compiler lurking, which will not obviously affect normal operation but will cause hard-to-debug trouble later? (In fact there were plenty of those, popping up at inopportune moments).
I’ve been working intermittently on dealing with this, by attempting to make the Common Lisp code that SBCL!Compiler is written in sufficiently portable that executing it on different implementations generates bitwise-identical output. Because then, and only then, can we be confident that we are not depending in some unforseen way on a particular implementation-specific detail; if output files are different, it might be a harmless divergence, for example a difference in ordering of steps where neither depends on the other, or it might in fact indicate a leak from the host environment into the target. Before this latest attack at the problem, I last worked on it seriously in 2009, getting most of the way there but with some problems remaining, as measured by the number of output files (out of some 330 or so) whose contents differed depending on which host Common Lisp implementation SBCL!Compiler was running on.
Over the last month, then, I have been slowly solving these problems, one by one. This has involved refining what is probably my second most useless skill, working out what SBCL fasl files are doing by looking at their contents in a text editor, and from that intuiting the differences in the implementations that give rise to the differences in the output files. The final pieces of the puzzle fell into place earlier this week, and the triumphant commit announces that as of Wednesday all 335 target source files get compiled identically by SBCL!Compiler, whether that is running under Clozure Common Lisp (32- or 64-bit versions), CLISP, or a different version of SBCL itself.
Oh but wait. There is another component to the build: as well as
SBCL!Compiler, we have SBCL!Loader, which is responsible for taking
those 335 output files and constructing from them a Lisp image file
which the platform executable can use to start a Lisp session.
(SBCL!Loader is maybe better known as “genesis”; but it is to
load
what SBCL!Compiler is to
compile-file
).
And it was slightly disheartening to find that despite having 335
identical output files, the resulting cold-sbcl.core
file differed
between builds on different host compilers, even after I had
remembered to discount the build fingerprint constructed to be
different for every build.
Fortunately, the actual problem that needed fixing was relatively
small: a call to
maphash
,
which (understandably) makes no guarantees about ordering, was used to
affect the Lisp image data directly. I then spent a certain amount of
time being thoroughly confused, having managed to construct for myself
a Lisp image where the following forms executed with ... odd results:
(loop for x being the external-symbols of "CL" count 1)
; => 1032
(length (delete-duplicates (loop for x being the external-symbols of "CL" collect x)))
; => 978
(shades of times gone by). Eventually I realised that
(unless (member (package-name package) '("COMMON-LISP" "KEYWORD" :test #'string=))
...)
was not the same as
(unless (member (package-name package) '("COMMON-LISP" "KEYWORD") :test #'string=)
...)
and all was well again, and as of
this commit
the cold-sbcl.core
output file is identical no matter the build
host.
It might be interesting to survey the various implementation-specific behaviours that we have run into during the process of making this build completely repeatable. The following is a probably non-exhaustive list – it has been twelve years, after all – but maybe is some food for thought, or (if you have a particularly demonic turn of mind) an ingredients list for a maximally-irritating CL implementation...
- Perhaps most obviously, various constants are
implementation-defined. The ones which caused the most trouble were
undoubtably
most-positive-fixnum
andmost-negative-fixnum
– particularly since they could end up being used in ways where their presence wasn’t obvious. For example,(deftype fd () `(integer 0 ,most-positive-fixnum))
has, in the SBCL build process, a subtly different meaning from(deftype fd () (and fixnum unsigned-byte))
– in the second case, thefd
type will have the intended meaning in the target system, using the target’sfixnum
range, while in the first case we have no way of intercepting or translating the host’s value ofmost-positive-fixnum
. Special mentions go toarray-dimension-limit
, which caused Bill Newman to be cross on the Internet, and tointernal-time-units-per-second
; I ended up tracking down one difference in output machine code from a leak of the host’s value of that constant into target code. - Similarly,
random
andsxhash
quite justifiably differ between implementations. The practical upshot of that is that these functions can’t be used to implement a cache in SBCL!Compiler, because the access patterns and hence the patterns of cache hits and misses will be different depending on the host implementation. - As I’ve already mentioned,
maphash
does not iterate over hash-table contents in a specified order, and in fact that order need not be deterministic; similarly,with-package-iterator
can generate symbols in arbitrary orders, and set operations (intersection
,set-difference
and friends) will return the set as a list whose elements are in an arbitrary order. Incautious use of these functions tended to give rise to harmless but sometimes hard-to-diagnose differences in output; the solution was typically to sort the iteration output before operating on any of it, to introduce determinism... - ... but it was possible to get that wrong in a harder-to-detect way,
because
sort
isnot specified to be stable. In some implementations, it actually is a stable sort in some conditions, but for cases where it’s important to preserve an already-existing partial order,stable-sort
is the tool for the job. - The language specification explicitly says that the initial contents
of uninitialized arrays are undefined. In most implementations, at
most times, executing
(make-array 8 :element-type (unsigned-byte 8))
will give a zero-filled array, but there are circumstances in some implementations where the returned array will have arbitrary data. - Not only are some constants implementation-defined, but so also are
the effects of normal operation on some variables.
*gensym-counter*
is affected by macroexpansion if the macro function callsgensym
, and implementations are permitted to macroexpand macros an arbitrary number of times. That means that our use ofgensym
needs to be immune to whatever the host implementation’s macroexpansion and evaluation strategy is. - The object returned by
byte
to represent a bitfield with size and position is implementation-defined. Implementations (variously) return bitmasks, conses, structures, vectors; host return values ofbyte
must not be used during the execution of SBCL!Compiler. More subtly, the variousboole
-related constants (boole-and
and friends) also need special treatment; at one point, their host values were used when SBCL!Compiler compiled theboole
function itself, and it so happens that CLISP and SBCL both represent the constants as integers between 0 and 15... but with a different mapping between operation and integer. - my last blog entry talked
about constant coalescing, and about printing of
(quote foo)
. In fact printing in general has been a pain, and there are still significant differences in interpretation or at least in implementation of pretty-printing: to the extent that at one point we had to minimize printing at all in order for the build to complete under some implementations. - there are a number of things which are implementation-defined but
have caused a certain amount of difficulty. Floating point in
general is a problem, not completely solved (SBCL will not build
correctly if its host doesn’t have distinct single- and double-float
types that are at least approximately IEEE754-compliant). Some
implementations lack denormalized numbers; some do not expose signed
zeros to the user; and some implementations compute
(log 2d0 10d0)
more accurately than others, including SBCL itself, do. The behaviour of the host implementation on legal but dubious code is also potentially tricky: SBCL’s build treats fullwarning
s as worthy of stopping, but some hosts emit full warnings for constructs that are tricky to write in other ways: for example to write portable code to handle multiple kinds of string, one might write(typecase string (simple-base-string ...) ((simple-array character (*)) ...)) (string ...))
but some implementations emit fullwarning
s if a clause in atypecase
is completely shadowed by other clauses, and ifbase-char
andcharacter
are identical in that implementation thetypecase
above will signal.
There were probably other, more minor differences between implementations, but the above list gives a flavour of the things that needed doing in order to get to this point, where we have some assurance that our code is behaving as intended. And all of this is a month ahead of my self-imposed deadline of SBCL’s 15th birthday: SBCL was announced to the world on December 14th, 1999. (I’m hoping to be able to put on an sbcl15 workshop in conjunction with the European Lisp Symposium around April 20th/21st/22nd – if that sounds interesting, please pencil the dates in the diary and let me know...)