ecda day 1

← ecda days 2 and 3 | blog | azed 2194 →

I made it to Jacobs University Bremen, somewhat bleary-eyed thanks to the brutal ~~04:15~~04:30 start (Stansted airport at 06:00 reminded me why I hate air travel). I was travelling with my Transforming Musicology colleague-in-arms David Baker, both of us due to present small pieces of our work at the European Conference on Data Analysis. Mind you, I had the lucky draw; my presentation is scheduled for Friday, whereas David had to give his that afternoon.

We got here Wikimedia user Muck123 CC-BY-SA , and we’re definitely not in New Cross any more David Howard CC-BY . I haven’t actually been to a campus University for ages, and I don’t remember being on one as charming as this one: it’s small, there are lawns everywhere, the pond has a duck house... after some confusion in attempting to learn how to operate doors, I got to the first keynote presentation, by Themis Palpanas of Université Paris Decartes – which happened to be on indexing structures for exact and approximate data series query and retrieval, so the whole trip was probably already worthwhile. I have a long list of references to chase.

The first “semi-plenary” was also directly relevant to me, this time to the work that I am due to present; Rebecca Nugent talked about random forest approaches to aggregating the same individuals in human-maintained records, for example in patent databases. The main point I took away from her talk was that the internal states of the trees in the random forests (and in the forests themselves if using forests of random forests) effectively induce a distribution over probabilities, and those distributions can be used to distinguish between certain and uncertain probability predictions. I found it odd that she and her colleagues at the NSF Census Research “node” had chosen skew as the distributional measure to evaluate, rather than some measure of spread – I probably missed something there. The twist at the two-thirds point, to consider casualty lists in Syria using the same methodology, was very effective: though the topic is of course gloomy, attempting to validate the various claims and counterclaims of the various sources is closer to my heart than identifying unique patent authors or making the US census a bit cheaper.

The afternoon brought on the parallel sessions – lots of them: seven sessions ranging from machine learning and knowledge discovery to data analysis in marketing, though where data analysis in musicology fits on that implied continuum I don't know. In any case, I went to support David, regretfully therefore not being able to attend the session on “high-dimensional data: using all dimensions”. I thought David’s talk went very well; the musical examples (and audience participation) kept everyone awake, and the questions from the small but engaged audience showed that they had definitely understood what we were up to. The other talks in the session were interesting too: Nadja Bauer presented a method for model-selection in the parameter space of onset detection methods. Denis Amelynck showed work on classifying people’s movement to Brahms’ Piano Concerto No. 1 – the clustering method based on Dirichlet Process Models was fine, but the surprising observation was that even something as crude as summing the “directogram”s of 35 participants’ movements (a similarity matrix where the feature was which of 32 directions the dominant hand moved in each frame) reproduced, by eye at least, the coarse similarity matrix of the music.

Finally in the session on Musicology, Tillman Weyde presented the somewhat surprising result that in attempting to separate solo piano music from other classes in a 600-strong database of recording, the audio was no help given even extremely poor metadata. This result is similar to our result about chroma distance being only marginally associated with leitmotif recognition difficulty, and I think the conclusion is the same in each case: the audio feature just isn’t good enough. (In our case, there’s no particular surprise: a 12-dimensional chroma barely captures any melodic information, while leitmotifs are substantially melodic objects; in Tillman’s case, I suspect the issue is that the instrument identification, done using Emmanouil Benetos’ transcription method, depends on instrument sound templates, which might match poorly to the actual historical recordings – as well as the fact that even the very poor metadata is good enough to separate out most of the piano solo music in this database).

The last event of the first day was a slightly chaotic panel discussion on the future of publications in classification and data sciences, which was definitely oriented from the point of view of the German Classification Society – and in particular how the conference (post-)proceedings should be published. The current state is as a Springer volume; the discussion centered around whether there would be benefit to the society, and its members, to switching to an Open Access journal working with a “less prestigious” publisher.

This is a difficult question, in the general context of academic publishing which at the moment I would characterize as “a total mess”. Some of the contradictions were explicitly stated: early career researchers are often made to “publish [in ‘high-quality’ journals] or perish”; UK funding bodies now insist that publications should be Open Access (“green” or “gold”); Open Access currently requires either funding to pay journal fees, or an embargo period and an institutional repository. (I am simplifying; it’s more complicated than this). The feeling in the room, it seemed to me, was that there was broad acceptance of an Open Access venue for the conference proceedings, provided that there were no author charges (other than, I guess, the need to attend the conference in the first place). There was also a tension between the needs of the Classification Society (a plausible venue for its proceedings) and the needs of the research community as a whole (do we really need another Open Access journal in Data Science?). But I left unsatisfied because of more than that; I was a bit depressed that there seemed to be a tacit acceptance that good work goes in “high-quality” journals, or even worse that the impersonal academic system judges that a work published in a “high-quality” journal is good – a mistake that no Data Scientist would make in their domain of expertise.

After that, it was time to explore the local area. After a comedy period of striking out in completely the wrong direction, caused largely by the cunning trick of the University distributing a campus map with South pointing up the page; I noticed our directional confusion eventually by the lack of river, and corrected it by navigating by the sun and my memory of the map of the area. Eventually, we got to Bremen-Vegesack and had a fine beer and fish supper, and got back to the campus past an improbable number of impeccably-kept front gardens without getting substantially more lost.