Presentations
There were some presentations at londonR 2014, and I took some notes.
validR (Chris Campbell, Mango Solutions)
Mango and ValidR
“hard programming languages” – Java, C#
UIs, training, consulting, code, ... , validR
“turning academic code into useful code”
validR means: R, with the packages you use, supported, validated, compliant with regulatory guidelines.
Validation
“Establishing documented evidence which provides a high degree of assurance that a specific process will consitently produce a product, meeting its predermined specifications and quality attributes”
- Define intended use
- specify tolerance measures
- test it, within and without tolerance
- document
Requirements / functionMap
help files
vignettes!
functionMap
require(functionMap)
prsVT <- parseRfolder("visualTest/R")
nVT <- createNetwork(prsVT)
plot(...)
Testing and testCoverage
knitr: automated + reviewer comments for testing
write coverage reports
Testing for Graphics with visualTest
how to compare rendered outputs?
- file size
- file identity
- pixel values
- image summaries
image fingerprint and fuzziness. Based on fourier transform. Why not SIFT? they have thought about wavelets, so OK.
supervised / unsupervised learning in churn & fraud (Ana Costa e Silva, Tibco)
Spotfire.
Tower of big and fast data
hundreds (KPIs) -> millions (visual data discovery) -> billions (Big data / data mining) -> trillions (Fast data, real time)
TIBCO Enterprise Runtime for R (TERR) community edition
engine improvements:
- data object representation
- memory management
much faster (7-80x)
Same language.
in-database / in-Hadoop
integrates with R studio
demo! Predictive Analytics for Cross-Sell Revenue Maximization. Gift cards... nice revenue stream, often unused. Offer the “opportunity” to by a gift card. Web-based interface.
tibco's approach to hadoop:
- fit interface to user skills
answer to “embrace and extend” question was disturbingly unclear between syntax and semantics. Ah, second question pushes harder.
High-quality maps with R and ggplot (Simon Hailstone, Royal Free NHS)
introduced with “it’s no lattice”. A couple of other NHS R users in the audience...
intro
- automation
- business objects reporting system
- limited charting
- no maps
http://flowingdata.com/2009/11/12/how-to-make-a-us-county-thematic-map-using-free-tools/
http://www.thisisthegreenroom.com/2009/choropleths-in-r/
where to get data
http://data.london.gov.uk/datastore and Office of National Statistics
MSOA / LSOA: really easy matching to UK geography
census Output Areas; middle-layer (15k) or lower-layer (smaller) super output areas
find something interesting * e.g. ambulance service incidnets + binge drinking + assault + deprivation + population
also A&E departments and sizes http://www.england.nhs.uk/statistics
where to get shapefiles
widely-used file for geographical features; vector-based, points/polylines/polygons
- http://geoportal.statistics.gov.uk (ONS)
- http://www.ordnancesurvey.co.uk/business-and-government/
- http://naturalearthdata.com
- http://openstreetmap.org
getting them into R: maptools
package with readShapeSpatial
;
rgeos
package with gSimplify
; filtering might also be necessary.
how to geocode the easy way
geocoding: adding geographic information to data
usually involves adding postcodes. Bit of a pain. Maintenance of postcode database a tedious.
library("ggmap")
AAE$Address <- paste0(AAE$Name, ",LONDON,UK")
geocode(AAE$Address)
2000 records a day or so.
how to combine all of this in ggplot
fortify
: converts spatial data into data frame: time consuming
- CCG borders (health areas)
geom_polygon
plots shapefiles; coord_map
for projection;
theme_bw
to remove graphical elements. Lots of extra
element_blank()
.
- use strokes! Cairo, for anti-aliasing.
pros and cons
pros
- reusable
- shareable
- transparent code
- flexible
- precise control
- nice output
cons
- labels, text formatting
- processing time
- not as user-friendly for single bits of analysis (QGIS wins)