ECDA 2014
Stansted at 6am: not a nice experience.
Wednesday 2nd July
Made it! Found accommodation, left my things, back in time for the first keynote.
Do It Yourself: Exploratory Analysis of Very Large Scientific Data (Themis Palpanas, Paris Descartes)
Data series... time: - wind; - stock quotes; - gps trajectories; - motion capture/gesture (difference between pointing finger and gun)
angle: - paleontology (similarity over angle): “efficient similarity search” - astronomy (luminosity)
position: - text analysis - genome data
tasks: - classification - clustering - outlier detection - frequent pattern mining
all depend on similarity search => need for indexing - TS-tree [ Assent et al '08] / TStree [XX '13] - iSax [Shieh & Keogh '08]: average, alphabet (with breakpoints) (building index takes too long: 500 million data series takes 20 days) - iSax2 and iSax2+ ICDM'10 KAIS'14: scales to 1-2 billions of series - Adaptive Data Series Index: ADS+: build index on-demand: 7x faster for 100,000 queries
Exploratory!
Future: big sequence management system: - physical and logical data independence - rich, declarative query language - query optimization - inherent scalability - support for uncertain data series
Acknowledgments to Shieh & Keogh
Solving the Identity Crisis: Large-Scale Clustering with Distributions of Distances and Applications in Record Linkage (Rebecca Nugent with Samuel L. Ventura)
NSF Census Research Network: methodology for census (and reducing cost!)
- capture/recapture
- census and surveys
- Confidentiality, privacy, online self-disclosure
- record linkage
small area estimation
http://www.stat.cmu.edu/NCRN
patents: database includes - inventor names/locations - assignee names/locations patents as proxy for research, *sigh*
David Miller: deduplication of inventors
everyone assumes that deduplicated / disambiguated dataset is correct
- quantify similarity of record pair
- find the probability that each record-pair is a match
- hierarchical clustering to identify unique entities
- sequential blocking
(point 2. closely related to deduplication)
De Groot distributions
splitting / lumping
syrian death lists and matching records
Fast Model Based Optimization of Tone Onset Detection by Instance Sampling (Nadja Bauer with Klaus Friedrichs, Bernd Bischl and Claus Weihs)
(optimization of hearing aids, transcript)
model seletion on a subset before full evaluation
Model-Based Optimization vs fast Model-Based Optimization
Recognition of Leitmotives in Wagner (David Baker with Daniel Müllensiefen et al.)
not only skin conductance but also ECG
The surprising character of music (Denis Amelynck with Pieter-Kan Maes, Marc Leman, someone else)
surprise → embodiment (people's motion)
Dirichlet process model
small clusters... but is there something different between small clusters and “errors”? Duration threshold
directograms (are people dancing similarly) generating structural fingerprints.
Hm, the “surprise” thing applies to the clustering, but the directogram comes from the
Interactive learning for dataset demarcation (Tillman Weyde and others)
audio (Benetos 2013) turns out not to help at all
Panel Discussion on future of publication
- What are your thoughts on publication?
- career options of young people? publish or perish. If it's not A+ we don't tenure you. Rethink conference proceedings?
- how can we make attending the conference worthwhile, from the publication point of view?
- can an open-access journal have guaranteed visibility? Also,
quality?
- visibility: link structure of the 'net
Discussion: - long-term open access - conference for work in their infancy
why one OA journal? Why not a bundle? - something like a journal for the members - algorithms + datasets + etc - extra long research papers
don't like author pays
where's the added value?
Thursday 3rd July
Oops, missed the first plenary session at 8:00
Visualization and Data Mining for High Dimensional Data (Alfred Inselberg)
Answering questions we did not know how to ask
Come from Geometry
What's a pattern? anything that catches your eye. Trust your eyes.
Nice little application (ParallAX) for data visualization
Effects of Parenthood on Well-Being (Evgenia Smilova and Colin Vance)
Are happy people happy parents? Insights from a quantile regression
literature: negative relationship between parenthood and well-being (Hansen 2012, Stanca 2012, Blanchflower 2008)
take home message: happy people make happy parents (no strong negative effets for high-percentile well-being people). Women happy when 19+ children leave home; men less happy (positive happiness shift at very low happiness quantiles, negative elsewhere)
How health literacy facilitates (J Paech)
what makes people actually do the physical activity they should do / think they should do?
Validation of questionnaires using a pilot trial and the English Longitudinal Study of Ageing (Adi Florea and others from Colchester Hospital and Essex)
N=74 women suffering from Urinary Incontinence
- Symptom Severity Instrument
- Rosenberg self-esteem scale
- Incontinence Quality-of-Life Questionnaire
Student life-style revisited: values, attitudes and behaviour (Thomas Hummel)
Rokeach -- Mitchell Values and Life Style
A Hidden Markov Model to detect relevance in financial documents based on on/off topics (Dimitrios Kampas)
introduces dependence on word order to LDA
On a decision-maker without preferences (Andreas Geyer-Shulz)
causal models vs noise models. Want to reduce number of (noise) models
observed choices reveal preferences (Paul Samuel)
Missing Data Methods for Big Data Analysis (Dieter William Joenssen)
Need to deal with big data problems (variety, volume, velocity) individually before you can deal with them all together.
Mean imputation not really good enough for missing data
Big Data Oriented Symbolic Data Analysis in Cloud (Hiroyuki MINAMI, Masahiro MIZUTA)
Hadoop (and not R)
Big Data Analytics vs Classical Data Science (Claus Weihs)
“Big Data”: impossible to exactly solve the learning problem computationally
“Data Science” generalizable extraction of knowledge from data
particularly: Classification & Regression methods; many variables / many instances
Epistemic uncertainty sampling for active learning on data streams (Ammar Shaker and Eyke Hüllermeier)
aleatoric vs epistemic uncertainty
credal sets / plausibility-possibility: plausibility of 1 everywhere means "I don't know"
epistemic uncertainty sampling should work better than conventional uncertainty sampling
Friday 4th July
Size and Shape in biplots (Michael Greenacre)
is size or shape captured by typical distance measures used in ecology? Bray-Curtis. Raw chi-squared (Greenacre 2010) + box-cox does better?
Reviewing Graphical Modelling of multi-variate temporal processes (Matthias Eckhart)
MRF: Lauritzen (1996)
random variables evolving over time
Learning Hierarchical Document Classifications from Recommender Graphs: Modularity Clustering (Fabian Ball with Andreas Geyer-Schulz)
Community detection in graphs: single partition
But what about hierarchical partitions? e.g. linkage graphs of bibliographic data in catalogues. Want to exploit the whole hierarchical structure of graphs
Group theory and S_n
Why the "wrong" minima? They look stable
Comparing audio and playlist features (Igor Vatolkin)
Hm but wait: this is genre classification again. Perhaps not surprising that playlist co-occurence features can be do well and outperform audio features
Big Data (Tilman Weyde)
“we're investigating [...] popular music, because it's ... popular”
dml.city.ac.uk
Machine Learning for the Analysis of a Large Collection of Music Scales (Tilman Weyde)
The representation (interval in cents relative to the root) is perhaps not quite the most helpful; I suggest all intervals present in the scale
OpenML (Joaquin Vanschoren)
Galileo, Royal Society, Galaxy Zoo
“Data is the new soil”
increase visibility, findability, convenience, credit