ECDA 2014

Stansted at 6am: not a nice experience.

Wednesday 2nd July

Made it! Found accommodation, left my things, back in time for the first keynote.

Do It Yourself: Exploratory Analysis of Very Large Scientific Data (Themis Palpanas, Paris Descartes)

Data series... time: - wind; - stock quotes; - gps trajectories; - motion capture/gesture (difference between pointing finger and gun)

angle: - paleontology (similarity over angle): “efficient similarity search” - astronomy (luminosity)

position: - text analysis - genome data

tasks: - classification - clustering - outlier detection - frequent pattern mining

all depend on similarity search => need for indexing - TS-tree [ Assent et al '08] / TStree [XX '13] - iSax [Shieh & Keogh '08]: average, alphabet (with breakpoints) (building index takes too long: 500 million data series takes 20 days) - iSax2 and iSax2+ ICDM'10 KAIS'14: scales to 1-2 billions of series - Adaptive Data Series Index: ADS+: build index on-demand: 7x faster for 100,000 queries

Exploratory!

Future: big sequence management system: - physical and logical data independence - rich, declarative query language - query optimization - inherent scalability - support for uncertain data series

Acknowledgments to Shieh & Keogh

Solving the Identity Crisis: Large-Scale Clustering with Distributions of Distances and Applications in Record Linkage (Rebecca Nugent with Samuel L. Ventura)

NSF Census Research Network: methodology for census (and reducing cost!)

capture/recapture
census and surveys
Confidentiality, privacy, online self-disclosure
record linkage
small area estimation
http://www.stat.cmu.edu/NCRN

patents: database includes - inventor names/locations - assignee names/locations patents as proxy for research, *sigh*

David Miller: deduplication of inventors

everyone assumes that deduplicated / disambiguated dataset is correct

quantify similarity of record pair
find the probability that each record-pair is a match
hierarchical clustering to identify unique entities
sequential blocking

(point 2. closely related to deduplication)

De Groot distributions

splitting / lumping

syrian death lists and matching records

Fast Model Based Optimization of Tone Onset Detection by Instance Sampling (Nadja Bauer with Klaus Friedrichs, Bernd Bischl and Claus Weihs)

(optimization of hearing aids, transcript)

model seletion on a subset before full evaluation

Model-Based Optimization vs fast Model-Based Optimization

Recognition of Leitmotives in Wagner (David Baker with Daniel Müllensiefen et al.)

not only skin conductance but also ECG

The surprising character of music (Denis Amelynck with Pieter-Kan Maes, Marc Leman, someone else)

surprise → embodiment (people's motion)

Dirichlet process model

small clusters... but is there something different between small clusters and “errors”? Duration threshold

directograms (are people dancing similarly) generating structural fingerprints.

Hm, the “surprise” thing applies to the clustering, but the directogram comes from the

Interactive learning for dataset demarcation (Tillman Weyde and others)

audio (Benetos 2013) turns out not to help at all

Panel Discussion on future of publication

What are your thoughts on publication?
- career options of young people? publish or perish. If it's not A+ we don't tenure you. Rethink conference proceedings?
- how can we make attending the conference worthwhile, from the publication point of view?
can an open-access journal have guaranteed visibility? Also, quality?
- visibility: link structure of the 'net

Discussion: - long-term open access - conference for work in their infancy

why one OA journal? Why not a bundle? - something like a journal for the members - algorithms + datasets + etc - extra long research papers

don't like author pays

where's the added value?

Thursday 3rd July

Oops, missed the first plenary session at 8:00

Visualization and Data Mining for High Dimensional Data (Alfred Inselberg)

Answering questions we did not know how to ask

Come from Geometry

What's a pattern? anything that catches your eye. Trust your eyes.

Nice little application (ParallAX) for data visualization

Effects of Parenthood on Well-Being (Evgenia Smilova and Colin Vance)

Are happy people happy parents? Insights from a quantile regression

literature: negative relationship between parenthood and well-being (Hansen 2012, Stanca 2012, Blanchflower 2008)

take home message: happy people make happy parents (no strong negative effets for high-percentile well-being people). Women happy when 19+ children leave home; men less happy (positive happiness shift at very low happiness quantiles, negative elsewhere)

How health literacy facilitates (J Paech)

what makes people actually do the physical activity they should do / think they should do?

Validation of questionnaires using a pilot trial and the English Longitudinal Study of Ageing (Adi Florea and others from Colchester Hospital and Essex)

N=74 women suffering from Urinary Incontinence

Symptom Severity Instrument
Rosenberg self-esteem scale
Incontinence Quality-of-Life Questionnaire

Student life-style revisited: values, attitudes and behaviour (Thomas Hummel)

Rokeach -- Mitchell Values and Life Style

A Hidden Markov Model to detect relevance in financial documents based on on/off topics (Dimitrios Kampas)

introduces dependence on word order to LDA

On a decision-maker without preferences (Andreas Geyer-Shulz)

causal models vs noise models. Want to reduce number of (noise) models
observed choices reveal preferences (Paul Samuel)

Missing Data Methods for Big Data Analysis (Dieter William Joenssen)

Need to deal with big data problems (variety, volume, velocity) individually before you can deal with them all together.

Mean imputation not really good enough for missing data

Big Data Oriented Symbolic Data Analysis in Cloud (Hiroyuki MINAMI, Masahiro MIZUTA)

Hadoop (and not R)

Big Data Analytics vs Classical Data Science (Claus Weihs)

“Big Data”: impossible to exactly solve the learning problem computationally
“Data Science” generalizable extraction of knowledge from data

particularly: Classification & Regression methods; many variables / many instances

Epistemic uncertainty sampling for active learning on data streams (Ammar Shaker and Eyke Hüllermeier)

aleatoric vs epistemic uncertainty

credal sets / plausibility-possibility: plausibility of 1 everywhere means "I don't know"

epistemic uncertainty sampling should work better than conventional uncertainty sampling

Friday 4th July

Size and Shape in biplots (Michael Greenacre)

is size or shape captured by typical distance measures used in ecology? Bray-Curtis. Raw chi-squared (Greenacre 2010) + box-cox does better?

Reviewing Graphical Modelling of multi-variate temporal processes (Matthias Eckhart)

MRF: Lauritzen (1996)

random variables evolving over time

Learning Hierarchical Document Classifications from Recommender Graphs: Modularity Clustering (Fabian Ball with Andreas Geyer-Schulz)

Community detection in graphs: single partition

But what about hierarchical partitions? e.g. linkage graphs of bibliographic data in catalogues. Want to exploit the whole hierarchical structure of graphs

Group theory and S_n

Why the "wrong" minima? They look stable

Comparing audio and playlist features (Igor Vatolkin)

Hm but wait: this is genre classification again. Perhaps not surprising that playlist co-occurence features can be do well and outperform audio features

Big Data (Tilman Weyde)

“we're investigating [...] popular music, because it's ... popular”

dml.city.ac.uk

Machine Learning for the Analysis of a Large Collection of Music Scales (Tilman Weyde)

The representation (interval in cents relative to the root) is perhaps not quite the most helpful; I suggest all intervals present in the scale

OpenML (Joaquin Vanschoren)

Galileo, Royal Society, Galaxy Zoo

“Data is the new soil”

increase visibility, findability, convenience, credit