code walking for pipe sequencing

← inquisitor 1352 | blog | inquisitor 1351 →

Since it seems still topical to talk about Lisp and code-transformation macros, here’s another worked example – this time inspired by the enthusiasm for the R magrittr package.

The basic idea behind the magrittr package is, as Hadley said at EARL2014, to convert from a form of code where arguments to the same function are far apart to one where they’re essentially close together; the example he presented was converting

arrange(
  summarise
    group_by(
      filter(babynames, name == "Hadley"),
      year),
    total = sum(n)
  desc(year))

b0 <- babynames
b1 <- filter(b0, name == "Hadley")
b2 <- group_by(b1, year)
b3 <- summarise(b2, total = sum(n))
b4 <- arrange(b3, desc(year))

only without the danger of mistyping one of the variable names along the way and failing to perform the computation that was intended.

R, as I have said before, is a Lisp-1 with weird syntax and wacky evaluation semantics. One of the things that ordinary user code can do is inspect the syntactic form of its arguments, before evaluating them. This means that when looking at a fragment of code such as

foo(bar(2,3), 4)

where a call-by-value language would first evaluate bar(2,3), then call foo with two arguments (the value resulting from the evaluation, and 4), R instead uses a form of call-by-need evaluation, and also provides operators for inspecting the promise directly. This means R users can do such horrible things as

foo <- function(x) {
    tmp <- substitute(x)
    sgn <- 1
    while(class(tmp) == "(") {
        tmp <- tmp[[2]]
        sgn <- sgn * -1
    }
    sgn * eval.parent(tmp)
}
foo(3) # 3
foo((3)) # -3
foo(((3))) # 3
foo((((3)))) # -3 (isn’t this awesome?  I did say “wacky”)

In the case of magrittr, the package authors have taken advantage of this to invent some new syntax; the pipe operator %>% is charged with inserting its first argument (its left-hand side, in normal operation) as the first argument to the call of its second argument (right-hand side). Hadley’s example is

babynames %>%
  filter(name == "Hadley") %>%
  group_by(year) %>%
  summarise(total = sum(n)) %>%
  arrange(desc(year))

and this is effective because the data flow in this case really is a pipeline: there's a dataset, which needs filtering, then grouping, then summarization, then sorting, and each operation works on the result of the previous. This already needs to inspect the syntactic form of the argument; an additional feature is recognizing the presence of .s in the call, and placing the left-hand side value in that argument position instead of as the first argument if it is present.

In Common Lisp, there are some piping or chaining operators out there (e.g. one two three (search for ablock) four and probably many others), and they do well enough. However! They mostly suffer from similar problems that we’ve seen before: doing code transformations with not quite enough understanding of the semantics of the code that they’re transforming; again, that’s fine for normal use, but for didactic purposes let’s pretend that we really care about this.

The -> macro from http://stackoverflow.com/a/11080068 is basically the same as the magrittr %>% operator: it converts symbols in the pipeline to function calls, and places the result of the previous evaluation as the first argument of the current operator, except if a $ is present in the arguments, in which case it replaces that. (This version doesn’t support more than one $ in the argument list; it would be a little bit of a pain to support that, needing a temporary name, but it’s straightforward in principle).

Since the -> macro does its job, a code-walker implementation isn’t strictly necessary: pure syntactic manipulation is good enough, and if it’s used with just the code it expects, it will do it well. It is of course possible to express what it does using a code-walker; we’ll fix the multiple-$ ‘bug’ along the way, by explicitly introducing bindings rather than replacements of symbols:

(defmacro -> (form &body body)
  (labels ((find-$ (form env)
             (sb-walker:walk-form form env
              (lambda (f c e)
                (cond
                  ((eql f '$) (return-from find-$ t))
                  ((eql f form) f)
                  (t (values f t)))))
             nil)
           (walker (form context env)
             (cond
               ((symbolp form) (list form))
               ((atom form) form)
               (t (if (find-$ form env)
                      (values `(setq $ ,form) t)
                      (values `(setq $ ,(list* (car form) '$ (cdr form))) t))))))
    `(let (($ ,form))
       ,@(mapcar (lambda (f) (sb-walker:walk-form f nil #'walker)) body))))

How to understand this implementation? Well, clearly, we need to understand what sb-walker:walk does. Broadly, it calls the walker function (its third argument) on successive evaluated subforms of the original form (and on variable names set by setq); the primary return value is used as the interim result of the walk, subject to further walking (macroexpansion and walking of its subforms) except if the second return value from the walker function is t.

Now, let’s start with the find-$ local function: its job is to walk a form, and returns t if it finds a $ variable to be evaluated at toplevel and nil otherwise. It does that by returning t if the form it’s given is $; otherwise, if the form it’s given is the original form, we need to walk its subforms, so return f; otherwise, return its form argument f with a secondary value of t to inhibit further walking. This operation is slightly at odds with the use of a code walker: we are explicitly not taking advantage of the fact that it understands the semantics of the code it’s walking. This might explain why the find-$ function itself looks a bit weird.

The walker local function is responsible for most of the code transformation. It binds $ to the value of the first form, then repeatedly sets $ to the value of successive forms, rewritten to interpolate a $ in the first argument position if there isn’t one in the form already (as reported by find-$). If any of the forms is a symbol, it gets listified and subsequently re-walked. Thus

(macroexpand-1 '(-> "THREE" string-downcase (char 0)))
; => (LET (($ "THREE"))
;      (SETQ $ (STRING-DOWNCASE $))
;      (SETQ $ (CHAR $ 0))),
;    T

So far, so good. Now, what could we do with a code-walker that we can’t without? Well, the above implementation of -> supports chaining simple function calls, so one answer is “chaining things that aren’t just function calls”. Another refinement is to support eliding the insertion of $ when there are any uses of $ in the form, not just as a bare argument. Looking at the second one first, since it’s less controversial:

(defmacro -> (form &body body)
  (labels ((find-$ (form env)
             (sb-walker:walk-form form env
              (lambda (f c e)
                (cond
                  ((and (eql f '$) (eql c :eval))
                   (return-from find-$ t))
                  (t f))))
             nil)
           (walker (form context env)
             (cond
               ((symbolp form) (list form))
               ((atom form) form)
               (t (if (find-$ form env)
                      (values `(setq $ ,form) t)
                      (values `(setq $ ,(list* (car form) '$ (cdr form))) t))))))
    `(let (($ ,form))
       ,@(mapcar (lambda (f) (sb-walker:walk-form f nil #'walker)) body))))

The only thing that’s changed here is the definition of find-$, and in fact it’s a little simpler: the task is now to walk the entire form and find uses of $ in an evaluated position, no matter how deep in the evaluation. Because this is a code-walker, this will correctly handle macros, backquotes, quoted symbols, and so on, and this allows code of the form

(macroexpand-1 '(-> "THREE" string-downcase (char 0) char-code (complex (1+ $) (1- $))))
; => (LET (($ "THREE"))
;      (SETQ $ (STRING-DOWNCASE $))
;      (SETQ $ (CHAR-CODE $))
;      (SETQ $ (COMPLEX (1+ $) (1- $)))),
;    T

which, as far as I can tell, is not supported in magrittr: doing 3 %>% complex(.+1,.-1) is met with the error that “object '.' not found”. Supporting this might, of course, not be a good idea, but at least the code walker shows that it’s possible.

What if we wanted to augment -> to handle binding forms, or special forms in general? This is probably beyond the call of duty, but let’s just briefly imagine that we wanted to be able to support binding special variables around the individual calls in the chain; for example, we want

(-> 3 (let ((*random-state* (make-random-state))) rnorm) mean)

to expand to

(let (($ 3))
  (setq $ (let ((*random-state* (make-random-state))) (rnorm $)))
  (setq $ (mean $)))

and let us also say, to make it interesting, that uses of $ in the bindings clauses of the let should not count against inhibiting the insertion of $ in the first argument position of the first form in the body of the let, so

(-> 3 (let ((y (1+ $))) (atan y)))

should expand to

(let (($ 3)) (setq $ (let ((y (1+ $))) (atan $ y))))

So our code walker needs to walk the bindings of the let, merely collecting information into the walker’s lexical environment, then walk the body performing the same rewrite as before. CHALLENGE ACCEPTED:

(defmacro -> (&body forms)
  (let ((rewrite t))
    (declare (special rewrite))
    (labels ((find-$ (form env)
               (sb-walker:walk-form form env
                (lambda (f c e)
                  (cond
                    ((and (eql f '$) (eql c :eval))
                     (return-from find-$ t))
                    (t f))))
               nil)
             (walker (form context env)
               (declare (ignore context))
               (typecase form
                 (symbol (if rewrite (list form) form))
                 (atom form)
                 ((cons (member with-rewriting without-rewriting))
                  (let ((rewrite (eql (car form) 'with-rewriting)))
                    (declare (special rewrite))
                    (values (sb-walker:walk-form (cadr form) env #'walker) t)))
                 ((cons (member let let*))
                  (unless rewrite
                    (return-from walker form))
                  (let* ((body (member 'declare (cddr form)
                                       :key (lambda (x) (when (consp x) (car x))) :test-not #'eql))
                         (declares (ldiff (cddr form) body))
                         (rewritten (sb-walker:walk-form
                                     `(without-rewriting
                                          (,(car form) ,(cadr form)
                                            ,@declares
                                            (with-rewriting
                                                ,@body)))
                                     env #'walker)))
                    (values rewritten t)))
                 (t
                  (unless rewrite
                    (return-from walker form))
                  (if (find-$ form env)
                      (values `(setq $ ,form) t)
                      (values `(setq $ ,(list* (car form) '$ (cdr form))) t))))))
      `(let (($ ,(car forms)))
         ,@(mapcar (lambda (f) (sb-walker:walk-form f nil #'walker)) (cdr forms))))))

Here, find-$ is unchanged from the previous version; all the new functionality is in walker. How does it work? The default branch of the walker function is also unchanged; what has changed is handling of let and let* forms. The main trick is to communicate information between successive calls to the walker function, and turn the rewriting on and off appropriately: we wrap parts of the form in new pseudo-special operators with-rewriting and without-rewriting, which is basically a tacky and restricted implementation of compiler-let – if we needed to, we could do a proper one with macrolet. Within the scope of a without-rewriting, walker doesn’t do anything special, but merely return the form it was given, except if the form it’s given is a with-rewriting form. This is a nice illustration, incidentally, of the idea that lexical scope in the code translates nicely to dynamic scope in the compiler; I can’t remember where I read that first (but it’s certainly not a new idea).

And now

(macroexpand '(-> 3 (let ((*random-state* (make-random-state))) rnorm) mean))
; => (LET (($ 3))
;      (LET ((*RANDOM-STATE* (MAKE-RANDOM-STATE)))
;        (SETQ $ (RNORM $)))
;      (SETQ $ (MEAN $))),
;    T
(macroexpand '(-> 3 (let ((y (1+ $))) (atan y))))
; => (LET (($ 3))
;      (LET ((Y (1+ $)))
;        (SETQ $ (ATAN $ Y)))),
;    T

Just to be clear: this post isn’t advocating a smarter pipe operator; I don’t have a clear enough view, but I doubt that the benefits of the smartness outweigh the complexity. It is demonstrating what can be done, in a reasonably controlled way, using a code-walker: ascribing semantics to fragments of Common Lisp code, and combining those fragments in a particular way, and of course it’s another example of sb-walker:walk in use.

Finally, if something like this does in fact get used, people sometimes get tripped up by the package system: the special bits of syntax are symbols, and importing or package-qualifying -> without doing the corresponding thing to $ would lead to cryptic errors, wrong results and/or confusion. One possibility to handle that is to invent a bit more reader syntax:

(set-macro-character #\¦
 (defun pipe-reader (stream char)
   (let ((*readtable* (copy-readtable)))
     (set-macro-character #\·
      (lambda (stream char)
        (declare (ignore stream char))
        '$) t)
   (cons '-> (read-delimited-list char stream t)))) nil)
¦"THREE" string-downcase (find-if #'alpha-char-p ·) char-code¦

If this is the exported syntax, it has the advantage that the interface can only be misused intentionally: the actual macro and its anaphoric symbol are both hidden from the programmer; and the syntax is reasonably easy to type – on my keyboard ¦ is AltGr+| and · is AltGr+. – and moderately mnemonic from shell pipes and function notation respectively. It also has all the usual disadvantages of reader-based interfaces, such as composability, somewhat mitigated if pipe-reader is part of the macro’s exported interface.