pages tagged reproducible research

earl conference

2014-09-23T20:33:50Z

This week, I went to the Effective Applications of the R Language conference. I’d been alerted to its existence from my visit to a londonR meeting in June. Again, I went for at least two reasons: one as an R enthusiast, though admittedly one (as usual) more interested in tooling than in applications; and one as postgraduate coordinator in Goldsmiths Computing, where for one of our modules in the new Data Science programme (starting ~~today~~yesterday! Hooray for “Welcome Week”!) involves exposing students to live data science briefs from academia and industry, with the aim of fostering a relevant and interesting final project.

A third reason? Ben Goldacre as invited speaker. A fantastic choice of keynote, even if he did call us ‘R dorks’ a lot and confess that he was a Stata user. The material, as one might expect, was derived from his books and related experience, but the delivery was excellent, and the drive clear to see. There were some lovely quotes that it’s tempting to include out of context; in the light of my as-yet unposed ‘research question’ for the final module of the PG Certificate in Higher Education – that I am still engaged on – it is tempting to bait the “infestation of qualitative researchers in the educational research establishment”, and attempt a Randomised Controlled Trial, or failing that a statistical analysis of assessments to try to uncover suitable hypotheses for future testing.

Ben’s unintended warm-up act – they were the other way around in the programme, but clearly travelling across London is complicated – was Hadley Wickham, of ggplot2 fame. His talk, about RStudio, his current view on the data analysis workflow, and new packages to support it, was a nice counterpoint: mostly tools, not applications, but clearly focussed to help make sense of complicated (and initially untidy) datasets. I liked the shiny-based in-browser living documents in R-markdown, which is not a technology that I’ve investigated yet; at this rate I will have more options for reproducible research than reports written. He, and others at the conference, were advocating a pipe-based code sequencing structure – the R implementation of this is called magrittr (ha, ha) and has properties that are made possible to user code through R’s nature of a Lisp-1 with crazy evaluation semantics, on which more in another post.

The rest of the event was made up of shorter, usually more domain-specific talks: around 20 minutes for each speaker. I think it suffered a little bit from many of the participants not being able to speak freely – a natural consequence of a mostly-industrial event, but frustrating. I think it was also probably a mistake to schedule in the first one of the regular (parallel) sessions of the event a reflective slot for three presentations about comparing R with other languages (Python, Julia, and a more positive one about R’s niche): there hadn’t really been time for a positive tone to be established, and it just felt like a bit of a downer. (Judging by the room, most of the delegates – perhaps wisely – had opted for the other track, on “Business Applications of R”).

Highlights of the shorter talks, for me:

YPlan’s John Sandall talking about “agile” data analytics, leading to agile business practices relentlessly focussed on one KPI. At the time, I wondered whether the focus on the first time a user gives them money would act against building something of lasting value – analytics give the power to make decisions, but the underlying strategy still has to be thought about. On the other hand, I’m oh-too familiar with the notion that startups must survive first and building something “awesome” is a side-effect, and focussing on money in is pretty sensible.
Richard Pugh’s (from Mango Solutions) talk about modelling and simulating the behaviour of a sales team did suffer from the confidentiality problem (“I can’t talk about the project this comes from, or the data”) but was at least entertaining: the behaviours he talked about (optimistic opportunity value, interactions of CRM closing dates with quarter boundaries) were highly plausible, and the question of whether he was applying the method to his own sales team quite pointed. (no)
the team from Simpson Carpenter Ltd, as well as saying that “London has almost as many market research agencies as pubs” (which rings true) had what I think is a fair insight: R is perhaps less of a black-box than certain commercial tools; there’s a certain retrocomputing feel to starting R, being at the prompt, and thinking “now what?” That implies that to actually do something with R, you need to know a bit more about what you’re doing. (That didn’t stop a few egregiously bad graphs being used in other presentations, including my personal favourite of a graph of workflow expressed as business value against time, with the inevitable backwards-arrows).
some other R-related tools to look into:

And then there was of course the ~~hallway~~break room track; Tower Hotel catered admirably for us, with free-flowing coffee, nibbles and lunch. I had some good conversations with a number of people, and am optimistic that students with the right attitude could both benefit and gain hugely from data science internships. I’m sure I was among the most tool-oriented of the attendees (most of the delegates were actually using R), but I did get to have a conversation with Hadley about “Advanced R”, and we discussed object systems, and conditions and restarts. More free-form notes about the event on my wiki.

Meanwhile, in related news, parts of the swank backend implementation of SLIME changed, mostly moving symbols to new packages. I've updated swankr to take account of the changes, and (I believe, untested) preserved compatibility with older (pre 2014-09-13) SLIMEs.

still writing mini-projects workshop talk

2014-02-12T10:56:42Z

I'm on a train!

Specifically, I'm on a train, still writing my talk on Tools for Music Informatics (previously). It's to be delivered to a mixed audience, including technologists already embedded in the Music Informatics area, and musicologists who have no idea what it is but have heard that there might be some money attached if they can spin an idea the right way. This mixed audience has led to the talk having something of an identity crisis, anthropomorphically speaking: should it be focussed on tools that have been used to perform musicological tasks? Should it be focussed on workflows that are simple to relate to? In the end, constrained by the fact that my slot is 25 minutes long at most, I have gone for surveying tools that are commonly used in the Music Informatics world, and biasing my survey towards those which provide visuals that can be used in a talk.

It turns out that the constraint on having example pictures is quite a strong one. I appreciate that academic researchers are not necessarily oriented towards powerpoint-style marketing; for many, and generally I include myself in this category, the ideas and results stand on their own merits. But in this area of software as output, where there's the possibility of scaling one's impact, moving beyond the set of people one can talk to, to anyone who can discover and use the software, why is the user-facing documentation so poor.

Well, one reason is that creating good documentation is hard, and not necessarily a skill that goes hand in hand with academic research or software development, so in the lone wolf model of academia the chances of finding someone who can do all three are low. But the fact that there are several software packages in Music Informatics, all doing approximately the same thing and all lamentably documented, suggests that there might other reasons, and one of the benefits of being on a train and needing to procrastinate away from finishing the talk is that I get to discuss reasons for this with my captive colleagues.

So, here's a cynical view. There are two sets of competing incentives that, together, might explain why there are a fair number of underdocumented PhD-level software frameworks, and why there aren't substantial communities formed around the further development of such software. The first relates to the principal developer of the software, assuming that they are still in academia – if they're not, then there's probably no incentive to do anything more with the software at all. In the academic case, the career ladder and prestige indicators are constructed such that the credit for new things is vastly greater than the credit for improving already-existing things. In particular, once a permanent academic position has come along, a potentially demanding userbase is actively detrimental, taking time away from the real business of academics, which is to write funding applications to be able to hire researchers to do research and write papers on which you gain author credit. At that point, the software might still be a help to generate the research results, but the most valuable users are likely to be in close physical proximity with the lead developer, and so they can learn directly from the horse's mouth.

The second incentive applies to the principal source of uncommitted labour in the system: the graduate student. A graduate student, in theory, could be doing anything with their time; in practice, they are guided in their activities by a supervisor, who suggests ideas and projects and has as a goal the successful construction of a dissertation showing substantial work at PhD level. As with the previous case, the student gains more credit from new things than improving existing things, and the contribution of the student to independent work will be clearer for the supervisors and the doctoral examiners than the contribution to a collaboration, particularly a dynamic collaboration such as in software development and maintenance. So an otherwise uncommitted supervisor will likely recommend to a student interested in software development for music informatics to develop their own software rather than invest time and effort in one of the existing ones: the outcome for the student's dissertation is likely to be better, as an independent implementation can make up a chapter or more, while participating in a development community is most likely not to be usable in a dissertation at all.

After all the cynicism, what's the solution? Changing the academic prestige system is clearly a long-term prospect, and likely to run into game theory buffers; while it's nice to imagine a world where community contribution is as valued as independent “discovery”, I wouldn't hold my breath. But this, it seems to me, is where agitating for more reproducible research is important; not only does it lower the barrier to participating in the scientific process, it also provides built-in tutorial material in a real research context for any and all software that is used in the research itself. Initiatives such as conferences awarding prizes for reproducible research, or journals mandating that reproducibility materials be deposited alongside the manuscripts, has the potential to lead to an improvement in reuse of software as well as in proving the research results themselves.

why investigate databases of audio?

2014-01-13T20:35:49Z

At the launch meeting for Transforming Musicology (I should perhaps say that technically this was in the pub after the launch meeting), Laurence Dreyfus asked me why I was experimenting with a collection of recordings when planning my Josquin/Gombert case study, as opposed to the perhaps more natural approach of investigating the motivic or thematic similarity of music starting with the notated works (manuscripts, contemporary or modern editions).

By this stage in the evening, I was not feeling at my best – a fact unrelated to the venue of the conversation, I hasten to add – and I wouldn't be surprised if my answer was not a model of clarity. Here's another go, then, beginning with an anecdote.

When I was a PhD student, a long, long time ago in a discipline far, far, away, the research group I was in had a weekly paper-reading club, where students and postdocs, and the occasional permanent member of staff, would gather together over lunch and try to work and think through an article in the field. Usually, the paper selected was roughly contemporary, maybe published in the last few years, and therefore (because of the publishing norms in Applied Mathematics and Theoretical Physics) the paper would be available on the arXiv, and so could be acquired, and printed, without leaving one's desk. (Picking up the printed copy could double up as one's daily exercise quota.) One week, however, some imp put down a paper that predated the arXiv's rise to prominence. I have two clear memories, which should nevertheless be treated as unreliable: one of myself, spending about an hour of elapsed time in the mathematical library, registering, finding the journal, finding the volume and photocopying the article; the other of the lunch meeting, discovering that those of us having seen copies of the article (let's not go as far as having read the article) being definitively in the minority.

There are several lessons to be drawn from this, beyond further evidence of the laziness and hubris of the typical PhD student. The availability of information, in whatever form, makes it more likely to be useful and more likely to be used. It's timely to reflect on the case of Aaron Swartz, who confronted this issue and faced disproportionate consequences; I'm pleased to say that I am involved in my institution's working group on Open Access, as we try to find a way of simultaneously satisfying all the requirements imposed on us as well as extending our ability to offer a transformative experience.

How does this affect musicology, and our particular attempt to give it the transformative experience? Well, one issue with music in general is its lack of availability.

That is of course a ridiculous statement, designed to be provocative: in this era of downloadable music, streaming music, ripped music, music in your pocket and on your phone, music has never been more available. But the music available in all those forms is audio; the story for other forms of musical data (manuscript notation, chord sequences, performing editions, even lyrics) is far less good. Sometimes that is for copyright reasons, as rights-holders attempt to limit the distribution of their artifacts; sometimes that is for economic reasons, as the resources to digitize and encode public-domain sources of notation are unavailable. But the net effect is that while musical audio is limited and expensive, it is often not possible to purchase large amounts of musical data in other forms at any price.

So, in my still-hypothetical case study of Josquin/Gombert attribution, I could use notation as my primary source of data if I had encodings of sources or editions of the pieces I'm interested in – the needles, if you will; if I had those, I could verify that some computational algorithm could identify the association between sections of the chansons, Lugebat and Credo. That encoding task would be tedious, but achievable in – let's say – a week or so for the three pieces. The problem with that is that that is only half the battle; it is the construction of the needle, but for a method like this to be properly tested it's not fair just to let it loose on the needle; a search system has to be given the chance to fail by hiding the needle in a suitable haystack. And constructing a haystack – encoding a few hundreds of other Renaissance vocal works – is the thing that makes it less than practical to work on search within databases of notation, at least for this long-tail repertoire.

And that's why at the moment I investigate databases of musical audio: I can get the data, and in principle at least so can other people. If you can do it better, or just differently, why not consider applying for one of the Transforming Musicology mini-projects? There will be a one-day exploratory event at Lancaster University on the 12th February to explore possible project ideas and collaborations, and having as wide a pool of interested parties as possible is crucial for the production and execution of interesting ideas. Please come!

slime has moved to github

2014-01-13T20:35:49Z

This is probably not news to anyone reading this, but: SLIME has moved its primary source repository to github.

One of my projects, swankr (essentially, a SLIME for R, reusing most of the existing communication protocol and implementing a SWANK backend in R, hence the name), obviously depends on SLIME: if nothing else, the existing instructions for getting swankr up and running included a suggestion to get the SLIME sources from common-lisp.net CVS, which as Zach says is a problem that it's nice no longer to have. (Incidentally, Zach is running a survey – direct link – to assess possible effects of changes on the SLIME userbase.)

So, time to update the instructions, and in the process also

hoover up any obvious patches from my inbox;
improve the startup configuration so that the user need do no explicit configuration beyond a suitable entry in slime-lisp-implementations;
update the BUGS.org and TODO.org files for the current reality.

That's all safely checked in and pushed to the many various locations which people might think of as the primary swankr repository. But those org-mode files are exported to HTML to produce the minimal swankr web presence, and the script that does that was last run with org-mode version 7.x, while now, in the future, org-mode is up to version 8 (and this time the change in major version number is definitely indicative of non-backwards-compatible API changes).

This has already bitten me. The principles of reproducible research are good, and org-mode offers many of the features that help: easily-edited source format, integration with a polyglossia of programming languages, high-quality PDF output (and adequate-quality HTML output, from the same sources) – I've written some technical reports in this style, and been generally happy with it. But, with the changes from org-mode 7 to org-mode 8, in order to reproduce my existing documents I needed to make not only (minor) source document changes, but also (quite major) changes to the bits of elisp to generate exported documents in house style. Reproducible research is difficult; I suppose it's obvious that exact reproduction depends on exact software versions, platform configurations and so on – see this paper, for example – but how many authors of supposedly reproducible research are actually listing all the programs, up to exact versions of dynamically-linked libraries they link to, used in the preparation and treatment of the data and document?

In any case, the changes needed for swankr are pretty minor, though I was briefly confused when I tried org-html-export-as-html first (and no output was produced, because that's the new name for exporting to an HTML buffer rather than an HTML file. The swankr website should now reflect the current state of affairs.