The KDD conference #kdd2013 in Chicago from August 10-14 is one of the prime events in the data mining, big data space. This is illustrated by this year’s record-breaking attendance of 1200+ data scientists (both researchers and practitioners) from academia, industry, and government. It’s organized by ACMs SIGKDD (Knowledge Discovery and Data Mining).
This is a personal summary of the event, based on my choices of the sessions. All picture are my own (@dirkvandenpoel). At any time, there were many sessions in parallel. Let’s have an in-depth look at the event in chronological order.
On Saturday, KDD’s Big Data Camp kicked off the event. About 150 data scientists joined this pre-conference bootcamp. Prof. Dr. Robert Grossman (University of Chicago, Open Data Group, see picture below) kicked off the event with an introduction. He emphasizes that both R and Python are gaining traction. R dominates the modelling space (top half of the figure below), Python dominates the IT deployment space (mainly because of R’s restrictive GPL license). He also highlighted OSDC’s public datasets initiative. He also discussed the life cycle of a predictive model.
Next, Jeffrey Ryan (Lemnica Corporation, see picture below) gave an introduction to high-performance R for large datasets. His main expertise is in the analysis of financial Time Series in R. He emphasizes the performance penalty of very large data.frames. He highlights the use of “Environments in R” (they act like lists). Moreover, he promotes the use of data.table as a fast, efficient alternative to data.frames. He is the author of several R packages (xts for comprehensive time series analysis, mmap for a useful view on a file). He is co-organizer of the R/Finance event.
Dean Wampler (@deanwampler, Concurrent Thought, see picture below, click here for his slides) was up next about NoSQL databases. He started out by saying that SQL still remains great for ACID transactions (e.g. for updating a bank account you would prefer a relational database). Then, he went about combined projects such as Hive: NewSQL databases combine ACID transactions and the relational model with NoSQL scaling techniques – he is the co-author of the “Programming Hive” book). He discussed the following categories of NoSQL databases: Column-Oriented stores (e.g. Cassandra, HBase, Big Table), Document-Oriented stores (e.g. MongoDB, CouchBase), Key-Value (Tuple) stores (e.g. Amazon SimpleDB, MemcacheDB), eventually-consistent Key-Value stores, and graph stores (e.g. Neo4J, Titan) as well as some interesting new approaches (such as Datomic: without destructive updates, so all previous history is retained and usable).
After lunch, Q Ethan McCallum (@qethanm) talked about R & Hadoop: Getting R to dance with the elephant. He is co-author of the “Parallel R” book. He briefly discussed Segue, RHIPE, Rhadoop (rmr), R+Hadoop (as approaches to do Big Data with R). This was followed by a joint presentation by Collin Bennett and Jim Pivarski (Open Data Group, see picture below). They gave a talk titled “Building and Deploying Predictive Models Using Python, Augustus and PMML”. They also illustrated the use of Augustus with lots of demos. Great content!
To wrap up the day (and also the Big Data Camp), Prof. Dr. Andrew Johnson (University of Illinois at Chicago, see picture below) talked about data visualization in a Big Data era. He gave a lot of examples of the good, the bad and the ugly…