pages tagged london
notes
http://christophe.rhodes.io/notes/tag/london/
notes
ikiwiki
2014-07-31T17:14:34Z
london employment visualization part 2
http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization_part_2/
2014-07-31T17:14:34Z
2014-07-31T17:07:34Z
<p><a href="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/">Previously</a>, I did all the hard
work to obtain and transform some data related to London, including
borough and MSOA shapes, population counts, and employment figures,
and used them to generate some subjectively pretty pictures. I
promised a followup on the
<a href="https://sjp.co.nz/projects/gridsvg/"><code>gridSVG</code></a> approach to
generating visualizations with more potential for interactivity than a
simple picture; this is the beginning of that.</p>
<p>Having done all the heavy lifting in the
<a href="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/">last post</a>, including being able to
generate <a href="http://ggplot2.org/"><code>ggplot</code></a> objects (whose printing
results in the pictures), it is relatively simple to wrap output to
SVG instead of output to PNG around it all. In fact it is <em>extremely</em>
simple to output to SVG; simply use an SVG output device</p>
<pre><code>svg("/tmp/london.svg", width=16, height=10)
</code></pre>
<p>rather than a PNG one</p>
<pre><code>png("/tmp/london.png", width=1536, height=960)
</code></pre>
<p>(which brings back for me memories of
<a href="http://common-lisp.net/project/mcclim/">McCLIM</a>, and my
implementation of an SVG backend, about a decade ago). So what does
that look like? Well, if you’ve entered those forms at the R repl,
close the png device</p>
<pre><code>dev.off()
</code></pre>
<p>and then (the currently active device being the SVG one)</p>
<pre><code>print(ggplot.london(fulltime/(allages-younger-older)))
dev.off()
</code></pre>
<p><a href="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization_part_2/default-svg-device.svg"><img src="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization_part_2/default-svg-device.svg" width="256" alt="default (cairo) SVG device" class="img" /></a></p>
<p>That produces an SVG file, and if SVG in and of itself is the goal,
that’s great. But I would expect that the main reason for producing
SVG isn’t so much for the format itself (though it is nice that it is
a vector image format rather than rasterized, so that zooming in
principle doesn’t cause artifacts) but for the ability to add
scripting to it: and since the output SVG doesn’t retain any
information about the underlying data that was used to generate it, it
is very difficult to do anything meaningful with it.</p>
<p>I write “very difficult” rather than “impossible”, because in fact the
<a href="http://www.omegahat.org/SVGAnnotation/"><code>SVGAnnotation</code></a> package
aimed to do just that: specifically, read the SVG output produced by
the R SVG output device, and (with a bit of user assistance and a
liberal sprinkling of heuristics) attempt to identify the regions of
the plot corresponding to particular slices of datasets. Then, using
a standard XML library, the user could decorate the SVG with extra
information, add links or scripts, and essentially do whatever they
needed to do; this was all wrapped up in an <code>svgPlot</code> function. The
problem with this approach is that it is fragile: for example, one
heuristic used to identify a
<a href="http://lattice.r-forge.r-project.org/documentation.php"><code>lattice</code></a>
plot area was that there should be no text in it, which fails for
custom panel functions with labelled guidlines. It is possible to
override the default heuristic, but it’s difficult to build a robust
system this way (and in fact when I tried to run some two-year old
analysis routines recently, the custom SVG annotation that I wrote
broke into multiple pieces given new data).</p>
<p><code>gridSVG</code>’s approach is a little bit different. Instead of writing
SVG out and reading it back in, it relies on the <code>grid</code> graphics
engine (so does not work with so-called base graphics, the default
graphics system in R), and on manipulating the <code>grid</code> object which
represents the current scene. The <code>gridsvg</code> pseudo-graphics-device
does the behind-the-scenes rendering for us, with some cost related to
yet more wacky interactions with R’s argument evaluation semantics
which we will pay later.</p>
<pre><code>gridsvg("/tmp/gridsvg-london.svg", width=16, height=10)
print(ggplot.london(fulltime/(allages-younger-older)))
dev.off()
</code></pre>
<p>Because <code>ggplot</code> uses <code>grid</code> graphics, this just works, and generates
a much more structured svg file, which should
render identically to the previous one:</p>
<p><a href="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization_part_2/gridsvg-device.svg"><img src="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization_part_2/gridsvg-device.svg" width="256" alt="SVG from gridSVG device" class="img" /></a></p>
<p>If it renders identically, why bother? Well, because now we have
something that writes out the current <code>grid</code> scene, we can alter that
scene before writing out the document (at <code>dev.off()</code> time). For
example, we might want to add tooltips to the MSOAs so that their name
and the quantity value can be read off by a human. Wrapping it all up
into a function, we get</p>
<pre><code>gridsvg.london <- function(expr, subsetexpr=TRUE, filename="/tmp/london.svg") {
</code></pre>
<p>We need to compute the subset in this function, even though we’re
going to be using the full dataset in <code>ggplot.london</code> when we call it,
in order to get the values and zone labels.</p>
<pre><code> london.data <- droplevels(do.call(subset, list(london$msoa.fortified, substitute(subsetexpr))))
</code></pre>
<p>Then we need to map (pun mostly intended) the values in the fortified
data frame to the polygons drawn; without delving into the format, my
intuition is that the fortified data frame contains vertex
information, whereas the grid (and hence SVG) data is organized by
polygons, and there may be more than one polygon for a region (for
example if there are islands in the Thames). Here we simply generate
an index from a group identifier to the first row in the dataframe in
that group, and use it to pull out the appropriate value and label.</p>
<pre><code> is <- match(levels(london.data$group), london.data$group)
vals <- eval(substitute(expr), london.data)[is]
labels <- levels(london.data$zonelabel)[london.data$zonelabel[is]]
</code></pre>
<p>Then we pay the cost of the argument evaluation semantics. My first
try at this line was <code>gridsvg(filename, width=16, height=10)</code>, which I
would have (perhaps naïvely) expected to work, but which in fact gave
me an odd error suggesting that the environment <code>filename</code> was being
evaluated in was the wrong one. Calling <code>gridsvg</code> like this forces
evaluation of <code>filename</code> before the call, so there should be less that
can go wrong.</p>
<pre><code> do.call(gridsvg, list(filename, width=16, height=10))
</code></pre>
<p>And, as before, we have to do substitutions rather than evaluations to
get the argument expressions evaluated in the right place:</p>
<pre><code> print(do.call(ggplot.london, list(substitute(expr), substitute(subsetexpr))))
</code></pre>
<p>Now comes the payoff. At this point, we have a <code>grid</code> scene, which we
can investigate using <code>grid.ls()</code>. Doing so suggests that the map
data is in a grid object named like <code>GRID.polygon</code> followed by an
integer, presumably in an attempt to make names unique. We can
“garnish” that object with attributes that we want: some javascript
callbacks, and the values and labels that we previously calculated.</p>
<pre><code> grid.garnish("GRID.polygon.*",
onmouseover=rep("showTooltip(evt)", length(is)),
onmouseout=rep("hideTooltip()", length(is)),
zonelabel=labels, value=vals,
group=FALSE, grep=TRUE)
</code></pre>
<p>We need also to provide implementations of those callbacks. It is
possible to do that inline, but for simplicity here we simply link to
an external resource.</p>
<pre><code> grid.script(filename="tooltip.js")
</code></pre>
<p>Then close the <code>gridsvg</code> device, and we’re done!</p>
<pre><code> dev.off()
}
</code></pre>
<p>Then <code>gridsvg.london(fulltime/(allages-younger-older))</code> produces:</p>
<p><a href="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization_part_2/london-full-time.svg"><img src="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization_part_2/london-full-time.svg" width="256" alt="proportion employed full-time" class="img" /></a></p>
<p>which is some kind of improvement over a static image for data of this
complexity.</p>
<p>And yet... the perfectionist in me is not quite satisfied. At issue
is a minor graphical glitch, but it’s enough to make me not quite
content; the border of each MSOA is stroked in a slightly lighter
colour than the fill colour, but that stroke extends beyond the border
of the MSOA region (the stroke’s centre is along the polygon edge).
This means that the strokes from adjacent MSOAs overlie each other, so
that the most recently drawn obliterates any drawn previously. This
also causes some odd artifacts around the edges of London (and into
the Thames, and pretty much obscures the river Lea).</p>
<p>This can be fixed by clipping; I think the trick to clip a path to
itself counts as well-known. But clipping in SVG is slightly hard,
and the gridSVG facilities for doing it work on a grob-by-grob basis,
while the map is all one big polygon grid object. So to get the
output I want, I am going to have to perform surgery on the SVG
document itself after all; we are still in a better position than
before, because we will start with a sensible hierarchical arrangement
of graphical objects in the SVG XML structure, and <code>gridSVG</code>
furthermore provides some introspective capabilities to give XML ids
or XPath query strings for particular grobs.</p>
<p><code>grid.export</code> exports the current grid scene to SVG, returning a list
with the SVG XML itself along with this mapping information. We have
in the SVG output an arbitrary number of <code>polygon</code> objects; our task
is to arrange such that each of those polygons has a clip mask which
is itself. In order to do that, we need for each polygon a <code>clipPath</code>
entry with a unique <code>id</code> in a <code>defs</code> section somewhere, where each
<code>clipPath</code> contains a <code>use</code> pointing to the original polygon’s ID;
then each polygon needs to have a <code>clip-path</code> style property pointing
to the corresponding <code>clipPath</code> object. Clear?</p>
<pre><code>addClipPaths <- function(gridsvg, id) {
</code></pre>
<p>given the return value of <code>grid.export</code> and the identifier of the map
grob, we want to get the set of XML nodes corresponding to the
polygons within that grob.</p>
<pre><code> ns <- getNodeSet(gridsvg$svg, sprintf("%s/*", gridsvg$mappings$grobs[[id]]$xpath))
</code></pre>
<p>Then for each of those nodes, we want to set a clip path.</p>
<pre><code> for (i in 1:length(ns)) {
addAttributes(ns[[i]], style=sprintf("clip-path: url(#clipPath%s)", i))
}
</code></pre>
<p>For each of those nodes, we also need to define a clip path</p>
<pre><code> clippaths <- list()
for (i in 1:length(ns)) {
clippaths[[i]] <- newXMLNode("clipPath", attrs=c(id=sprintf("clipPath%s", i)))
use <- newXMLNode("use", attrs = c("xlink:href"=sprintf("#%s", xmlAttrs(ns[[i]])[["id"]])))
addChildren(clippaths[[i]], kids=list(use))
}
</code></pre>
<p>And hook it into the existing XML</p>
<pre><code> defs <- newXMLNode("defs")
addChildren(defs, kids=clippaths)
top <- getNodeSet(gridsvg$svg, "//*[@id='gridSVG']")[[1]]
addChildren(top, kids=list(defs))
}
</code></pre>
<p>Then our driver function needs some slight modifications:</p>
<pre><code>gridsvg.london2 <- function(expr, subsetexpr=TRUE, filename="/tmp/london.svg") {
london.data <- droplevels(do.call(subset, list(london$msoa.fortified, substitute(subsetexpr))))
is <- match(levels(london.data$group), london.data$group)
vals <- eval(substitute(expr), london.data)[is]
labels <- levels(london.data$zonelabel)[london.data$zonelabel[is]]
</code></pre>
<p>Until here, everything is the same, but we can’t use the <code>gridsvg</code>
pseudo-graphics device any more, so we need to do graphics device
handling ourselves:</p>
<pre><code> pdf(width=16, height=10)
print(do.call(ggplot.london, list(substitute(expr), substitute(subsetexpr))))
grid.garnish("GRID.polygon.*",
onmouseover=rep("showTooltip(evt)", length(is)),
onmouseout=rep("hideTooltip()", length(is)),
zonelabel=labels, value=vals,
group=FALSE, grep=TRUE)
grid.script(filename="tooltip.js")
</code></pre>
<p>Now we export the scene to SVG,</p>
<pre><code> gridsvg <- grid.export()
</code></pre>
<p>find the grob containing all the map polygons,</p>
<pre><code> grobnames <- grid.ls(flatten=TRUE, print=FALSE)$name
grobid <- grobnames[[grep("GRID.polygon", grobnames)[1]]]
</code></pre>
<p>add the clip paths,</p>
<pre><code> addClipPaths(gridsvg, grobid)
saveXML(gridsvg$svg, file=filename)
</code></pre>
<p>and we’re done!</p>
<pre><code> dev.off()
}
</code></pre>
<p>Then <code>gridsvg.london2(fulltime/(allages-younger-older))</code> produces:</p>
<p><a href="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization_part_2/london-full-time-clipping.svg"><img src="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization_part_2/london-full-time-clipping.svg" width="256" alt="proportion employed full-time (with polygon clipping)" class="img" /></a></p>
<p>and I leave whether the graphical output is worth the effort to the
beholder’s judgment.</p>
<p>As before, these images contain National Statistics and Ordnance
Survey data © Crown copyright and database right 2012.</p>
london employment visualization
http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/
2014-07-19T20:18:42Z
2014-07-19T20:12:00Z
<p>I <a href="http://christophe.rhodes.io/notes/blog/posts/2014/londonr_june/">went to londonR</a> a month or so ago, and the talk
whose content stayed with me most was the one given by Simon Hailstone
on map-making and data visualisation with
<a href="http://ggplot2.org/">ggplot2</a>. The take-home message for me from
that talk was that it was likely to be fairly straightforward to come
up with interesting visualisations and data mashups, if you already
know what you are doing.</p>
<p>Not unrelatedly, in the Goldsmiths Data Science MSc programme which we
are running from October, it is quite likely that I will be teaching a
module in Open Data, where the students will be expected to produce
visualisations and mashups of publically-accessible open data sets.</p>
<p>So, it is then important for me to know what I am doing, in order that
I can help the students to learn to know what they are doing. My
first, gentle steps are to look at life in London; motivated at least
partly by the expectation that I encounter when dropping off or
picking up children from school that I need more things to do with my
time (invitations to come and be trained in IT at a childrens' centre
at 11:00 are particularly misdirected, I feel) I was interested to try
to find out how local employment varies. This blog serves at least
partly as a stream-of-exploration of not just the data but also
<code>ggplot</code> and <code>gridsvg</code>; as a (mostly satisfied) user of <code>lattice</code> and
<code>SVGAnnotation</code> in a <a href="http://www.teclo.net/">previous life</a>, I'm
interested to see how the R graphing tools have evolved.</p>
<p>Onwards! Using R 3.1 or later: first, we load in the libraries we
will need.</p>
<pre><code>library(ggplot2)
library(extremevalues)
library(maps)
library(rgeos)
library(maptools)
library(munsell)
library(gridSVG)
</code></pre>
<p>Then, let's define a couple of placeholder lists to accumulate related
data structures.</p>
<pre><code>london <- list()
tmp <- list()
</code></pre>
<p>Firstly, let's get hold of some data. I want to look at employment;
that means getting hold of counts of employed people, broken down by
area, and also counts of people living in that area in the first
place. I found some Excel spreadsheets published by London datastore
and the Office for National Statistics, with data from the 2011
census, and extracted from them some useful data in CSV format. (See
the bottom of this post for more detail about the data sources).</p>
<pre><code>london$labour <- read.csv("2011-labour.csv", header=TRUE, skip=1)
london$population <- read.csv("mid-2011-lsoa-quinary-estimates.csv", header=TRUE, skip=3)
</code></pre>
<p>The labour information here does include working population counts by
area – specifically by
<a href="http://en.wikipedia.org/wiki/ONS_coding_system#Neighbourhood_Statistics_Geography">Middle Layer Super Output Area</a>
(MSOA), a unit of area holding roughly eight thousand people. Without
going into too much detail, these units of area are formed by
aggregating Lower Layer Super Output Areas (LSOAs), which themselves
are formed by aggregating census Output Areas. In order to get the
level of granularity we want, we have to do some of that aggregation
ourselves.</p>
<p>The labour data is published in one sheet by borough (administrative
area), ward (local electoral area) and LSOA. The zone IDs for these
areas begin E09, E05 and E01; we need to manipulate the data frame we
have to include just the LSOA entries</p>
<pre><code>london$labour.lsoa <- droplevels(subset(london$labour, grepl("^E01", ZONEID)))
</code></pre>
<p>and then aggregate LSOAs into MSOAs “by hand”, by using the fact that
LSOAs are named such that the name of their corresponding MSOA is all
but the last letter of the LSOA name (e.g. “Newham 001A” and “Newham
001B” LSOAs both have “Newham 001” as their parent MSOA). In doing
the aggregation, all the numeric columns are count data, so the
correct aggregation is to sum corresponding entries of numerical data
(and discard non-numeric ones)</p>
<pre><code>tmp$nonNumericNames <- c("DISTLABEL", "X", "X.1", "ZONEID", "ZONELABEL")
tmp$msoaNames <- sub(".$", "", london$labour.lsoa$ZONELABEL)
tmp$data <- london$labour.lsoa[,!names(london$labour.lsoa) %in% tmp$nonNumericNames]
london$labour.msoa <- aggregate(tmp$data, list(tmp$msoaNames), sum)
tmp <- list()
</code></pre>
<p>The population data is also at LSOA granularity; it is in fact for all
LSOAs in England and Wales, rather than just London, and includes some
aggregate data at a higher level. We can restrict our attention to
London LSOAs by reusing the zone IDs from the labour data, and then
performing the same aggregation of LSOAs into MSOAs.</p>
<pre><code>london$population.lsoa <- droplevels(subset(london$population, Area.Codes %in% levels(london$labour.lsoa$ZONEID)))
tmp$nonNumericNames <- c("Area.Codes", "Area.Names", "X")
tmp$msoaNames <- sub(".$", "", london$population.lsoa$X)
tmp$data <- london$population.lsoa[,!(names(london$population.lsoa) %in% tmp$nonNumericNames)]
london$population.msoa <- aggregate(tmp$data, list(tmp$msoaNames), sum)
tmp <- list()
</code></pre>
<p>This is enough data to be getting on with; it only remains to join the
data frames up. The MSOA name column is common between the two
tables; we can therefore reorder one of them so that it lines up with
the other before simply binding columns together:</p>
<pre><code>london$employment.msoa <- cbind(london$labour.msoa, london$population.msoa[match(london$labour.msoa$Group.1, london$population.msoa$Group.1)])
</code></pre>
<p>(In practice in this case, the order of the two datasets is the same,
so the careful reordering is in fact the identity transformation)</p>
<pre><code>tmp$notequals <- london$population.msoa$Group.1 != london$labour.msoa$Group.1
stopifnot(!sum(tmp$notequals))
tmp <- list()
</code></pre>
<p>Now, finally, we are in a position to look at some of the data. The
population data contains a count of all people in the MSOA, along with
counts broken down into 5-year age bands (so 0-5 years old, 5-10, and
so on, up to 90+). We can take a look at the proportion of each
MSOA's population in each age band:</p>
<pre><code>ggplot(reshape(london$population.msoa, varying=list(3:21),
direction="long", idvar="Group.1")) +
geom_line(aes(x=5*time-2.5, y=X0.4/All.Ages, group=Group.1),
alpha=0.03, show_guide=FALSE) +
scale_x_continuous(name="age / years") +
scale_y_continuous(name="fraction") +
theme_bw()
</code></pre>
<p><a href="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/age-distribution.png"><img src="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/256x-age-distribution.png" width="256" height="192" alt="age distributions in London MSOAs" class="img" /></a></p>
<p>This isn't a perfect visualisation; it does show the broad overall
shape of the London population, along with some interesting deviations
(some MSOAs having a substantial peak in population in the 20-25 year
band, while rather more peak in the 25-30 band; a concentration around
7% of 0-5 year olds, with some significant outliers; and a tail of
older people with some interesting high outliers in the 45-50 and
60-65 year bands, as well as one outlier in the 75-80 and 80-85).</p>
<p>That was just population; what about the employment statistics?
Here's a visualisation of the fraction of people in full-time
employment, treated as a distribution</p>
<pre><code>ggplot(london$employment.msoa, aes(x=Full.time/All.Ages)) +
geom_density() + theme_bw()
</code></pre>
<p><a href="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/employment-distribution.png"><img src="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/256x-employment-distribution.png" width="256" height="192" alt="fraction in full-time employment in London MSOAs" class="img" /></a></p>
<p>showing an overal concentration of full-time employment level around
35%. But this of course includes in the population estimate people of
non-working ages; it may be more relevant to view the distribution of
full-time employment as a fraction of those of “working age”, which
here we will define as those between 15 and 65 (not quite right but
close enough):</p>
<pre><code>london$employment.msoa$younger = rowSums(london$employment.msoa[,62:64])
london$employment.msoa$older = rowSums(london$employment.msoa[,75:80])
ggplot(london$employment.msoa, aes(x=Full.time/(All.Ages-younger-older))) +
geom_density() + theme_bw()
</code></pre>
<p><a href="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/employment-working-age.png"><img src="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/256x-employment-working-age.png" width="256" height="192" alt="fraction of those of working age in full-time employment in London MSOAs" class="img" /></a></p>
<p>That's quite a different picture: we've lost the bump of outliers of
high employment around 60%, and instead there's substantial structure
at lower employment rates (plateaus in the density function around 50%
and 35% of the working age population). What's going on? The
proportions of people not of working age must not be evenly
distributed between MSOAs; what does it look like?</p>
<pre><code>ggplot(london$employment.msoa, aes(x=older/All.Ages, y=younger/All.Ages)) +
geom_point(alpha=0.1) + geom_density2d() + coord_equal() + theme_bw()
</code></pre>
<p><a href="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/younger-vs-older.png"><img src="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/256x-younger-vs-older.png" width="256" height="192" alt="distribution of younger vs older people in London MSOAs" class="img" /></a></p>
<p>So here we see that a high proportion of older people in an MSOA
(around 20%) is a strong predictor of the proportion of younger people
(around 17%); similarly, if the younger population is 25% or so of the
total population, there are likely to be less than 10% of people of
older non-working ages. But at intermediate values of either older or
younger proportions, 10% of older or 15% of younger, the observed
variation of the other proportion is very high: what this says is that
the proportion of people of working age in an MSOA varies quite
strongly, and for a given proportion of people of working age, some
MSOAs have more older people and some more younger people. Not a
world-shattering discovery, but important to note.</p>
<p>So, going back to employment: what are the interesting cases? One way
of trying to characterize them is to fit the central portion of the
observations to some parameterized distribution, and then examining
cases where the observation is of a sufficiently low probability. The
<code>extremevalues</code> R package does some of this for us; it offers the
option to fit a variety of distributions to a set of observations with
associated quantiles. Unfortunately, the natural distribution to use
for proportions, the <a>Beta distribution</a>, is not one of the options
(fitting involves messing around with derivatives of the inverse of
the <a>incomplete Beta function</a>, which is a bit of a pain), but we
can cheat: instead of doing the reasonable thing and fitting a beta
distribution, we can transform the proportion using the logit
transform and fit a Normal distribution instead, because everything is
secretly a Normal distribution. Right? Right? Anyway.</p>
<pre><code>tmp$outliers <- with(london$employment.msoa, getOutliers(log(Full.time/All.Ages/(1-Full.time/All.Ages))))
tmp$high <- london$employment.msoa$Group.1[tmp$outliers$iRight]
tmp$low <-london$employment.msoa$Group.1[tmp$outliers$iLeft]
print(tmp[c("high", "low")])
tmp <- list()
</code></pre>
<p>This shows up 9 MSOAs as having a surprisingly high full employment
rate, and one (poor old <code>Hackney 001</code>) as having a surprisingly low
rate (where “surprising” is defined relative to our
not-well-motivated-at-all Normal distribution of logit-transformed
full employment rates, and so shouldn't be taken too seriously).
Where are these things? Clearly the next step is to put these MSOAs
on a map. Firstly, let's parse some shape files (again, see the
bottom of this post for details of where to get them from).</p>
<pre><code>london$borough <- readShapePoly("London_Borough_Excluding_MHW")
london$msoa <- readShapePoly("MSOA_2011_London_gen_MHW")
</code></pre>
<p>In order to plot these shape files using <code>ggplot</code>, we need to convert
the shape file into a “fortified” form:</p>
<pre><code>london$borough.fortified <- fortify(london$borough)
london$msoa.fortified <- fortify(london$msoa)
</code></pre>
<p>and we will also need to make sure that we have the MSOA (and borough)
information as well as the shapes:</p>
<pre><code>london$msoa.fortified$borough <- london$msoa@data[as.character(london$msoa.fortified$id), "LAD11NM"]
london$msoa.fortified$zonelabel <- london$msoa@data[as.character(london$msoa.fortified$id), "MSOA11NM"]
</code></pre>
<p>For convenience, let's add to the “fortified” MSOA data frame the
columns that we've been investigating so far:</p>
<pre><code>tmp$is <- match(london$msoa.fortified$zonelabel, london$employment.msoa$Group.1)
london$msoa.fortified$fulltime <- london$employment.msoa$Full.time[tmp$is]
london$msoa.fortified$allages <- london$employment.msoa$All.Ages[tmp$is]
london$msoa.fortified$older <- london$employment.msoa$older[tmp$is]
london$msoa.fortified$younger <- london$employment.msoa$younger[tmp$is]
</code></pre>
<p>And now, a side note about R and its evaluation semantics. R
has... odd semantics. The basic evaluation model is that function
arguments are evaluated lazily, when the callee needs them, but in the
caller's environment – except if the argument is a defaulted one (when
no argument was actually passed), in which ase it's in the callee's
environment active at the time of evaluation, except that these rules
can be subverted using various mechanisms, such as simply looking at
the call object that is currently being evaluated. There's
significant complexity lurking, and libraries like <code>ggplot</code> (and
<code>lattice</code>, and pretty much anything you could mention) make use of
that complexity to give a convenient surface interface to their
functions. Which is great, until the time comes for someone to write
a function that calls these libraries, such as the one I'm about to
present which draws pictures of London. I have to know about the
evaluation semantics of the functions that I'm calling – which I
suppose would be reasonable if there weren't quite so many possible
options – and necessary workarounds that I need to install so that
<em>my</em> function has the right, convenient evaluation semantics while
still being able to call the library functions that I need to. I can
cope with this because I have substantial experience with programming
languages with customizeable evaluation semantics; I wonder how to
present this to Data Science students.</p>
<p>With all that said, I present a fairly flexible function for drawing
bits of London. This presentation of the finished object of course
obscures all my exploration of things that didn't work, some for
obvious reasons and some which remain a mystery to me.</p>
<pre><code>ggplot.london <- function(expr, subsetexpr=TRUE) {
</code></pre>
<p>We will call this function with an (unevaluated) expression to be
displayed on a choropleth, to be evaluated in the data frame
<code>london$msoa.fortified</code>, and an optional subsetting expression to
restrict the plotting. We compute the range of the expression,
independent of any data subsetting, so that we have a consistent scale
between calls</p>
<pre><code> expr.limits <- range(eval(substitute(expr), london$msoa.fortified))
</code></pre>
<p>and compute a suitable default <code>aes</code>thetic structure for <code>ggplot</code></p>
<pre><code> expr.aes <- do.call(aes, list(fill=substitute(expr), colour=substitute(expr)))
</code></pre>
<p>where <code>substitute</code> is used to propagate the convenience of an
expression interface in our function to the same expression interface
in the ggplot functions, and <code>do.call</code> is necessary because <code>aes</code> does
not evaluate its arguments, while we do need to perform the
substitution before the call.</p>
<pre><code> london.data <- droplevels(do.call(subset, list(london$msoa.fortified, substitute(subsetexpr))))
</code></pre>
<p>This is another somewhat complicated set of calls to make sure that
the subset expression is evaluated when we need it, in the environment
of the data frame. <code>london.data</code> at this point contains the subset of
the MSOAs that we will be drawing.</p>
<pre><code> (ggplot(london$borough.fortified) + coord_equal() +
</code></pre>
<p>Then, we set up the plot, specify that we want a 1:1 aspect ratio,</p>
<pre><code> (theme_minimal() %+replace%
theme(panel.background = element_rect(fill="black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())) +
</code></pre>
<p>a black background with no grid lines,</p>
<pre><code> geom_map(map=london.data, data=london.data, modifyList(aes(map_id=id, x=long, y=lat), expr.aes), label="map") +
</code></pre>
<p>and plot the map.</p>
<pre><code> scale_fill_gradient(limits=expr.limits, low=mnsl2hex(darker(rgb2mnsl(coords(hex2RGB("#13432B", TRUE))))), high=mnsl2hex(darker(rgb2mnsl(coords(hex2RGB("#56F7B1", TRUE)))))) +
scale_colour_gradient(limits=expr.limits, low="#13432B", high="#56F7B1", guide=FALSE) +
</code></pre>
<p>We specify a greenish colour scale, with the fill colour being
slightly darker than the outline colour for prettiness, and</p>
<pre><code> geom_map(map=london$borough.fortified, aes(map_id=id, x=long, y=lat), colour="white", fill=NA, size=0.1)
</code></pre>
<p>also draw a superposition of borough boundaries to help guide the eye.</p>
<pre><code> )
}
</code></pre>
<p>So, now, we can draw a map of full time employment:</p>
<pre><code>ggplot.london(fulltime/(allages-younger-older))
</code></pre>
<p><a href="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/london-full-time.png"><img src="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/256x-london-full-time.png" width="256" height="192" alt="proportion employed full-time" class="img" /></a></p>
<p>or the same, but just the outlier MSOAs:</p>
<pre><code>ggplot.london(fulltime/(allages-younger-older), zonelabel %in% london$outliers)
</code></pre>
<p><a href="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/london-full-time-outliers.png"><img src="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/256x-london-full-time-outliers.png" width="256" height="192" alt="proportion employed full-time (outlier MSOAs)" class="img" /></a></p>
<p>or the same again, plotting only those MSOAs which are within 0.1 of
the extremal values:</p>
<pre><code>ggplot.london(fulltime/(allages-younger-older), pmin(fulltime/(allages-younger-older) - min(fulltime/(allages-younger-older)), max(fulltime/(allages-younger-older)) - fulltime/(allages-younger-older)) < 0.1)
</code></pre>
<p><a href="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/london-full-time-extremals.png"><img src="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/256x-london-full-time-extremals.png" width="256" height="192" alt="proportion employed full-time (near-extremal MSOAs)" class="img" /></a></p>
<p>Are there take-home messages from this? On the data analysis itself,
probably not; it's not really a surprise to discover that there's
substantial underemployment in the North and East of London, and
relatively full employment in areas not too far from the City of
London; it is not unreasonable for nursery staff to ask whether
parents dropping children off might want training; it might be that
they have the balance of probabilities on their side.</p>
<p>From the tools side, it seems that they can be made to work;
I've only really been kicking the tyres rather than pushing them hard,
and the number of times I had to rewrite a call to <code>ggplot</code>, <code>aes</code> or
one of the helper functions as a response to an error message,
presumably because of some evaluation not happening at quite the right
time, is high – but the flexibility and interactivity they offer for
exploration feels to me as if the benefits outweigh the costs.</p>
<p>The R commands in this post are
<a href="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/london-employment-visualization.R">here</a>; the
<a href="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/london-employment-visualization-shapefiles.zip">shape files</a> are
extracted from
<a href="http://data.london.gov.uk/datastore/package/statistical-gis-boundary-files-london">Statistical GIS Boundary Files for London</a>
and redistributed here under the terms of the Open Government Licence;
they contain National Statistics and Ordnance Survey data © Crown
copyright and database right 2012. The
<a href="http://christophe.rhodes.io/notes/blog/posts/2014/london_employment_visualization/mid-2011-lsoa-quinary-estimates.csv">CSV data file for population counts</a>
is derived from
<a href="http://www.ons.gov.uk/ons/rel/sape/soa-mid-year-pop-est-engl-wales-exp/mid-2011--census-based-/rft---mid-2011-lsoa-table.zip">an ONS publication</a>,
distributed here under the terms of the Open Government Licence, and
contains National Statistics data © Crown copyright and database right
2013; the <span class="createlink">CSV data file for labour statistics</span> is
derived from
<a href="http://data.london.gov.uk/datastorefiles/datafiles/demographics/2011-census/visualisation-data-labour.zip">a spreadsheet</a>
published by the <a href="http://data.london.gov.uk/">London datastore</a>,
distributed here under the terms of the Open Government Licence, and
contains National Statistics data © Crown copyright and database right
2013.</p>
<p>Still here after all that licence information? I know I promised some
<code>gridSVG</code> evaluation too, but it's getting late and I've been drafting
this post for long enough already. I will come back to it, though,
because it was the more fun bit, and it makes some nifty interactive
visualizations fairly straightforward. Watch this space.</p>