Introduction to Mining Big Data with Map Reduce

Introduction to Mining Big Data

with Map Reduce




This short course will familiarize attendees with map reduce and show how it can be applied to data mining problems.  The course will provide an understanding of map reduce and general techniques for arranging data mining algorithms into the map reduce framework.  We'll reduce several data mining algorithms to mapper and reducer code in python.  We'll demo this code in the class running on mrjob. 


Course Outline


I.  Basics of Map Reduce (MR)


Map Reduce is the programming model used to process large data sets.  MR lends itself to parallel processing implementations.  Many Map Functions run concurrently over the data producing intermediate key/value pairs.  The Reduce Function merges all the intermediate values associated with the same intermediate key to produce a combined result.   


II. General Scheme for Data Mining with MR 

Algorithms which can be written in a Statistical Query and Summation Form are easily adapted to the MR model. Examples of algorithms which can naturally be expressed in this form include Locally Weighted linear Regression, Gaussian Discriminative Analysis, Expectation Maximization, Support Vector Machine, K-means, Logistic Regression, and both Principal and Independent Component Analysis.  

III. Calculate Average 

Details for using the MR model for calculating an average over a large data set will be provided.  

IV. Clustering 

Canopy Clustering is used to reduce required processing time.  K-Means Clustering can be implemented in conjunction with Canopy Clustering to reduce the number of necessary operations. 

V. Supervised Learning

Two algorithms will be addressed.  Elastic Net is linear regression with the addition of both L1 and L2 penalties.  Elastic Net implemented with the glmnet algorithm will be presented.  The PEGASOS algorithm for the Support Vector Machine will also be addressed.