london employment visualization part 2

← azed 2199 | blog | azed 2198 →

Previously, I did all the hard work to obtain and transform some data related to London, including borough and MSOA shapes, population counts, and employment figures, and used them to generate some subjectively pretty pictures. I promised a followup on the gridSVG approach to generating visualizations with more potential for interactivity than a simple picture; this is the beginning of that.

Having done all the heavy lifting in the last post, including being able to generate ggplot objects (whose printing results in the pictures), it is relatively simple to wrap output to SVG instead of output to PNG around it all. In fact it is extremely simple to output to SVG; simply use an SVG output device

svg("/tmp/london.svg", width=16, height=10)

rather than a PNG one

png("/tmp/london.png", width=1536, height=960)

(which brings back for me memories of McCLIM, and my implementation of an SVG backend, about a decade ago). So what does that look like? Well, if you’ve entered those forms at the R repl, close the png device

dev.off()

and then (the currently active device being the SVG one)

print(ggplot.london(fulltime/(allages-younger-older)))
dev.off()

That produces an SVG file, and if SVG in and of itself is the goal, that’s great. But I would expect that the main reason for producing SVG isn’t so much for the format itself (though it is nice that it is a vector image format rather than rasterized, so that zooming in principle doesn’t cause artifacts) but for the ability to add scripting to it: and since the output SVG doesn’t retain any information about the underlying data that was used to generate it, it is very difficult to do anything meaningful with it.

I write “very difficult” rather than “impossible”, because in fact the SVGAnnotation package aimed to do just that: specifically, read the SVG output produced by the R SVG output device, and (with a bit of user assistance and a liberal sprinkling of heuristics) attempt to identify the regions of the plot corresponding to particular slices of datasets. Then, using a standard XML library, the user could decorate the SVG with extra information, add links or scripts, and essentially do whatever they needed to do; this was all wrapped up in an svgPlot function. The problem with this approach is that it is fragile: for example, one heuristic used to identify a lattice plot area was that there should be no text in it, which fails for custom panel functions with labelled guidlines. It is possible to override the default heuristic, but it’s difficult to build a robust system this way (and in fact when I tried to run some two-year old analysis routines recently, the custom SVG annotation that I wrote broke into multiple pieces given new data).

gridSVG’s approach is a little bit different. Instead of writing SVG out and reading it back in, it relies on the grid graphics engine (so does not work with so-called base graphics, the default graphics system in R), and on manipulating the grid object which represents the current scene. The gridsvg pseudo-graphics-device does the behind-the-scenes rendering for us, with some cost related to yet more wacky interactions with R’s argument evaluation semantics which we will pay later.

gridsvg("/tmp/gridsvg-london.svg", width=16, height=10)
print(ggplot.london(fulltime/(allages-younger-older)))
dev.off()

Because ggplot uses grid graphics, this just works, and generates a much more structured svg file, which should render identically to the previous one:

If it renders identically, why bother? Well, because now we have something that writes out the current grid scene, we can alter that scene before writing out the document (at dev.off() time). For example, we might want to add tooltips to the MSOAs so that their name and the quantity value can be read off by a human. Wrapping it all up into a function, we get

gridsvg.london <- function(expr, subsetexpr=TRUE, filename="/tmp/london.svg") {

We need to compute the subset in this function, even though we’re going to be using the full dataset in ggplot.london when we call it, in order to get the values and zone labels.

    london.data <- droplevels(do.call(subset, list(london$msoa.fortified, substitute(subsetexpr))))

Then we need to map (pun mostly intended) the values in the fortified data frame to the polygons drawn; without delving into the format, my intuition is that the fortified data frame contains vertex information, whereas the grid (and hence SVG) data is organized by polygons, and there may be more than one polygon for a region (for example if there are islands in the Thames). Here we simply generate an index from a group identifier to the first row in the dataframe in that group, and use it to pull out the appropriate value and label.

    is <- match(levels(london.data$group), london.data$group)
    vals <- eval(substitute(expr), london.data)[is]
    labels <- levels(london.data$zonelabel)[london.data$zonelabel[is]]

Then we pay the cost of the argument evaluation semantics. My first try at this line was gridsvg(filename, width=16, height=10), which I would have (perhaps naïvely) expected to work, but which in fact gave me an odd error suggesting that the environment filename was being evaluated in was the wrong one. Calling gridsvg like this forces evaluation of filename before the call, so there should be less that can go wrong.

    do.call(gridsvg, list(filename, width=16, height=10))

And, as before, we have to do substitutions rather than evaluations to get the argument expressions evaluated in the right place:

    print(do.call(ggplot.london, list(substitute(expr), substitute(subsetexpr))))

Now comes the payoff. At this point, we have a grid scene, which we can investigate using grid.ls(). Doing so suggests that the map data is in a grid object named like GRID.polygon followed by an integer, presumably in an attempt to make names unique. We can “garnish” that object with attributes that we want: some javascript callbacks, and the values and labels that we previously calculated.

    grid.garnish("GRID.polygon.*",
                 onmouseover=rep("showTooltip(evt)", length(is)),
                 onmouseout=rep("hideTooltip()", length(is)),
                 zonelabel=labels, value=vals,
                 group=FALSE, grep=TRUE)

We need also to provide implementations of those callbacks. It is possible to do that inline, but for simplicity here we simply link to an external resource.

    grid.script(filename="tooltip.js")

Then close the gridsvg device, and we’re done!

    dev.off()
}

Then gridsvg.london(fulltime/(allages-younger-older)) produces:

which is some kind of improvement over a static image for data of this complexity.

And yet... the perfectionist in me is not quite satisfied. At issue is a minor graphical glitch, but it’s enough to make me not quite content; the border of each MSOA is stroked in a slightly lighter colour than the fill colour, but that stroke extends beyond the border of the MSOA region (the stroke’s centre is along the polygon edge). This means that the strokes from adjacent MSOAs overlie each other, so that the most recently drawn obliterates any drawn previously. This also causes some odd artifacts around the edges of London (and into the Thames, and pretty much obscures the river Lea).

This can be fixed by clipping; I think the trick to clip a path to itself counts as well-known. But clipping in SVG is slightly hard, and the gridSVG facilities for doing it work on a grob-by-grob basis, while the map is all one big polygon grid object. So to get the output I want, I am going to have to perform surgery on the SVG document itself after all; we are still in a better position than before, because we will start with a sensible hierarchical arrangement of graphical objects in the SVG XML structure, and gridSVG furthermore provides some introspective capabilities to give XML ids or XPath query strings for particular grobs.

grid.export exports the current grid scene to SVG, returning a list with the SVG XML itself along with this mapping information. We have in the SVG output an arbitrary number of polygon objects; our task is to arrange such that each of those polygons has a clip mask which is itself. In order to do that, we need for each polygon a clipPath entry with a unique id in a defs section somewhere, where each clipPath contains a use pointing to the original polygon’s ID; then each polygon needs to have a clip-path style property pointing to the corresponding clipPath object. Clear?

addClipPaths <- function(gridsvg, id) {

given the return value of grid.export and the identifier of the map grob, we want to get the set of XML nodes corresponding to the polygons within that grob.

    ns <- getNodeSet(gridsvg$svg, sprintf("%s/*", gridsvg$mappings$grobs[[id]]$xpath))

Then for each of those nodes, we want to set a clip path.

    for (i in 1:length(ns)) {
        addAttributes(ns[[i]], style=sprintf("clip-path: url(#clipPath%s)", i))
    }

For each of those nodes, we also need to define a clip path

    clippaths <- list()
    for (i in 1:length(ns)) {
        clippaths[[i]] <- newXMLNode("clipPath", attrs=c(id=sprintf("clipPath%s", i)))
        use <- newXMLNode("use", attrs = c("xlink:href"=sprintf("#%s", xmlAttrs(ns[[i]])[["id"]])))
        addChildren(clippaths[[i]], kids=list(use))
    }

And hook it into the existing XML

    defs <- newXMLNode("defs")
    addChildren(defs, kids=clippaths)
    top <- getNodeSet(gridsvg$svg, "//*[@id='gridSVG']")[[1]]
    addChildren(top, kids=list(defs))
}

Then our driver function needs some slight modifications:

gridsvg.london2 <- function(expr, subsetexpr=TRUE, filename="/tmp/london.svg") {
    london.data <- droplevels(do.call(subset, list(london$msoa.fortified, substitute(subsetexpr))))
    is <- match(levels(london.data$group), london.data$group)
    vals <- eval(substitute(expr), london.data)[is]
    labels <- levels(london.data$zonelabel)[london.data$zonelabel[is]]

Until here, everything is the same, but we can’t use the gridsvg pseudo-graphics device any more, so we need to do graphics device handling ourselves:

    pdf(width=16, height=10)
    print(do.call(ggplot.london, list(substitute(expr), substitute(subsetexpr))))
    grid.garnish("GRID.polygon.*",
                 onmouseover=rep("showTooltip(evt)", length(is)),
                 onmouseout=rep("hideTooltip()", length(is)),
                 zonelabel=labels, value=vals,
                 group=FALSE, grep=TRUE)
    grid.script(filename="tooltip.js")

Now we export the scene to SVG,

    gridsvg <- grid.export()

find the grob containing all the map polygons,

    grobnames <- grid.ls(flatten=TRUE, print=FALSE)$name
    grobid <- grobnames[[grep("GRID.polygon", grobnames)[1]]]

add the clip paths,

    addClipPaths(gridsvg, grobid)
    saveXML(gridsvg$svg, file=filename)

and we’re done!

    dev.off()
}

Then gridsvg.london2(fulltime/(allages-younger-older)) produces:

and I leave whether the graphical output is worth the effort to the beholder’s judgment.