One of SBCL’s Google Summer of Code students, Krzysztof Drewniak (no relation) just got to merge in his development efforts, giving SBCL a far more complete set of Unicode operations.
Given that this was the merge of three months’ out-of-tree work, it’s
not entirely surprising that there were some hiccups, and indeed we
spent some time
diagnosing and fixing
a 1000-fold slowdown in
char-downcase
.
Touch wood, all seems mostly well, except that Jan Moringen reported
that, when building without the :sb-unicode
feature (and hence
having a Lisp with 8-bit characters) one of the printer consistency
tests was resulting in an error.
Tracking this down was fun; it in fact had nothing in particular to do with the commit that first showed the symptom, but had been lying latent for a while and had simply never shown up in automated testing. I’ve expressed my admiration for the Common Lisp standard before, and I’ll do it again: both as a user of the language and as an implementor, I think the Common Lisp standard is a well-executed document. But that doesn’t stop it from having problems, and this is a neat one:
When a line break is inserted by any type of conditional newline, any blanks that immediately precede the conditional newline are omitted from the output and indentation is introduced at the beginning of the next line.
(from pprint-newline
)
For the graphic standard characters, the character itself is always used for printing in #\ notation---even if the character also has a name[5].
(from CLHS 22.1.3.2)
Space is defined to be graphic.
(from CLHS glossary entry for ‘graphic’)
What do these three requirements together imply? Imagine printing the
list (#\a #\b #\c #\Space #\d #\e #\f)
with a right-margin of 17:
(write-to-string '(#\a #\b #\c #\Space #\d #\e #\f) :pretty t :right-margin 17)
; => "(#\\a #\\b #\\c #\\
; #\\d #\\e #\\f)"
The #\Space
character is defined to be graphic; therefore, it must
print as #\
rather than #\Space
; if it happens to be printed just
before a conditional newline (such as, for example, generated by using
pprint-fill
to print a list), the pretty-printer will helpfully remove the space
character that has just been printed before inserting the newline.
This means that a #\Space
character, printed at or near the right
margin, will be read back as a #\Newline
character.
It’s interesting to see what other implementations do.
CLISP 2.49 in its default mode always prints
#\Space
; in -ansi
mode it prints #\
but preserves the space
even before a conditional newline. CCL
1.10 similarly preserves the space; there’s an explicit check in
output-line-and-setup-for-next
for an “escaped” space (and a comment
that acknowledges that this is a heuristic that can be wrong in the
other direction). I’m not sure what the best fix for this is; it’s
fairly clear that the requirements on the printer aren’t totally
consistent. For SBCL, I have merged a one-line change that makes the
printer print using character names even for graphic characters, if
the
*print-readably*
printer control variable is true; it may not be ideal that print/read
round-tripping was broken in the normal case, but in the case where
it’s explicitly been asked for it is clearly wrong.