One of SBCL’s Google Summer of Code students, Krzysztof Drewniak (no relation) just got to merge in his development efforts, giving SBCL a far more complete set of Unicode operations.

Given that this was the merge of three months’ out-of-tree work, it’s not entirely surprising that there were some hiccups, and indeed we spent some time diagnosing and fixing a 1000-fold slowdown in char-downcase. Touch wood, all seems mostly well, except that Jan Moringen reported that, when building without the :sb-unicode feature (and hence having a Lisp with 8-bit characters) one of the printer consistency tests was resulting in an error.

Tracking this down was fun; it in fact had nothing in particular to do with the commit that first showed the symptom, but had been lying latent for a while and had simply never shown up in automated testing. I’ve expressed my admiration for the Common Lisp standard before, and I’ll do it again: both as a user of the language and as an implementor, I think the Common Lisp standard is a well-executed document. But that doesn’t stop it from having problems, and this is a neat one:

When a line break is inserted by any type of conditional newline, any blanks that immediately precede the conditional newline are omitted from the output and indentation is introduced at the beginning of the next line.

(from pprint-newline)

For the graphic standard characters, the character itself is always used for printing in #\ notation---even if the character also has a name[5].

(from CLHS 22.1.3.2)

Space is defined to be graphic.

(from CLHS glossary entry for ‘graphic’)

What do these three requirements together imply? Imagine printing the list (#\a #\b #\c #\Space #\d #\e #\f) with a right-margin of 17:

(write-to-string '(#\a #\b #\c #\Space #\d #\e #\f) :pretty t :right-margin 17)
; => "(#\\a #\\b #\\c #\\
; #\\d #\\e #\\f)"

The #\Space character is defined to be graphic; therefore, it must print as #\ rather than #\Space; if it happens to be printed just before a conditional newline (such as, for example, generated by using pprint-fill to print a list), the pretty-printer will helpfully remove the space character that has just been printed before inserting the newline. This means that a #\Space character, printed at or near the right margin, will be read back as a #\Newline character.

It’s interesting to see what other implementations do. CLISP 2.49 in its default mode always prints #\Space; in -ansi mode it prints #\ but preserves the space even before a conditional newline. CCL 1.10 similarly preserves the space; there’s an explicit check in output-line-and-setup-for-next for an “escaped” space (and a comment that acknowledges that this is a heuristic that can be wrong in the other direction). I’m not sure what the best fix for this is; it’s fairly clear that the requirements on the printer aren’t totally consistent. For SBCL, I have merged a one-line change that makes the printer print using character names even for graphic characters, if the *print-readably* printer control variable is true; it may not be ideal that print/read round-tripping was broken in the normal case, but in the case where it’s explicitly been asked for it is clearly wrong.