Christophe Weblog Wiki Code Publications Music

Presentations

There were some presentations at londonR 2014, and I took some notes.

validR (Chris Campbell, Mango Solutions)

Mango and ValidR

“hard programming languages” – Java, C#

UIs, training, consulting, code, ... , validR

“turning academic code into useful code”

validR means: R, with the packages you use, supported, validated, compliant with regulatory guidelines.

Validation

“Establishing documented evidence which provides a high degree of assurance that a specific process will consitently produce a product, meeting its predermined specifications and quality attributes”

CFR Part 11

  • Define intended use
  • specify tolerance measures
  • test it, within and without tolerance
  • document

Requirements / functionMap

help files

vignettes!

functionMap

require(functionMap)
prsVT <- parseRfolder("visualTest/R")
nVT <- createNetwork(prsVT)
plot(...)

Testing and testCoverage

knitr: automated + reviewer comments for testing

write coverage reports

Testing for Graphics with visualTest

how to compare rendered outputs?

  • file size
  • file identity
  • pixel values
  • image summaries

image fingerprint and fuzziness. Based on fourier transform. Why not SIFT? they have thought about wavelets, so OK.

supervised / unsupervised learning in churn & fraud (Ana Costa e Silva, Tibco)

Spotfire.

Tower of big and fast data

hundreds (KPIs) -> millions (visual data discovery) -> billions (Big data / data mining) -> trillions (Fast data, real time)

TIBCO Enterprise Runtime for R (TERR) community edition

engine improvements:

  • data object representation
  • memory management

much faster (7-80x)

Same language.

in-database / in-Hadoop

integrates with R studio

demo! Predictive Analytics for Cross-Sell Revenue Maximization. Gift cards... nice revenue stream, often unused. Offer the “opportunity” to by a gift card. Web-based interface.

tibco's approach to hadoop:

  • fit interface to user skills

answer to “embrace and extend” question was disturbingly unclear between syntax and semantics. Ah, second question pushes harder.

High-quality maps with R and ggplot (Simon Hailstone, Royal Free NHS)

introduced with “it’s no lattice”. A couple of other NHS R users in the audience...

intro

  • automation
  • business objects reporting system
    • limited charting
    • no maps

http://flowingdata.com/2009/11/12/how-to-make-a-us-county-thematic-map-using-free-tools/

http://www.thisisthegreenroom.com/2009/choropleths-in-r/

where to get data

http://data.london.gov.uk/datastore and Office of National Statistics

MSOA / LSOA: really easy matching to UK geography

census Output Areas; middle-layer (15k) or lower-layer (smaller) super output areas

find something interesting * e.g. ambulance service incidnets + binge drinking + assault + deprivation + population

also A&E departments and sizes http://www.england.nhs.uk/statistics

where to get shapefiles

widely-used file for geographical features; vector-based, points/polylines/polygons

  • http://geoportal.statistics.gov.uk (ONS)
  • http://www.ordnancesurvey.co.uk/business-and-government/
  • http://naturalearthdata.com
  • http://openstreetmap.org

getting them into R: maptools package with readShapeSpatial; rgeos package with gSimplify; filtering might also be necessary.

how to geocode the easy way

geocoding: adding geographic information to data

usually involves adding postcodes. Bit of a pain. Maintenance of postcode database a tedious.

library("ggmap")
AAE$Address <- paste0(AAE$Name, ",LONDON,UK")
geocode(AAE$Address)

2000 records a day or so.

how to combine all of this in ggplot

fortify: converts spatial data into data frame: time consuming

  • CCG borders (health areas)

geom_polygon plots shapefiles; coord_map for projection; theme_bw to remove graphical elements. Lots of extra element_blank().

  • use strokes! Cairo, for anti-aliasing.

pros and cons

  • pros

    • reusable
    • shareable
    • transparent code
    • flexible
    • precise control
    • nice output
  • cons

    • labels, text formatting
    • processing time
    • not as user-friendly for single bits of analysis (QGIS wins)