"Machine Learning with Big Data
using Apache Spark" was presented to Lansing Big Data and Hadoop User Group by Muk Agaram and Amit Singh on 3/31/2015. It goes over the basics of machine learning and demos a use case of predicting recession using Apache Spark through Logistic Regression, SVM and Random Forest Algorithm
3. What is Machine Learning
• Branch of AI
– Alan Turing – can machines think???
– “Field of study that gives Computers the ability to learn without
being explicitly programmed” – Arthur Samuel
• Learn from data
• Improve with experience
• Iteratively refine a model that can be used to predict outcomes of
questions based on previous learning
4. Types of Machine Learning
MachineLearning
Supervised
Regression
Interest rate
prediction
Classification Spam/No-spam
Unsupervised Clustering
Social Network
analysis
Recommender
Systems
Collaborative
Filtering
Netflix
recommendation
5. Prevalent Use Cases
• Spam Detection – Google Gmail
• Voice Recognition – Apple Siri
• Stock Trading
–High Frequency
–Recommendation Systems
–Algorithmic Trading
• Robotics
–Acquire skills – grasping objects, locomotion, automated driving and navigation
• Medicine and Healthcare
–Healthcare Analytics, Prediction based on Genomes, Health sensor analysis
• Advertising:
–Targeted Advertising based on interests and social media
• Retail and E Commerce:
–Frequency club cards, targeting coupons and promotions
–Recommendation Engines
6. Use Cases
• Gaming Analytics
– Predictive Analytics for Sports Games, Console based gaming profiles, upsell and
targeting in-app purchases and mods
• Internet of Things
– Large scale sensor data analysis for prediction, classification
• Social Network Analysis
– Facebook, LinkedIn
• Astronomy
– Galaxy formation
7. Languages and Platforms
• Apache Spark – MLlib
– Scala, Java, Python
• Mahout
• Python Libraries
– Scikit-learn, PyML, PyBrain, matplotlib
• R
– Open Source statistical programming language
• Matlab
• SAS
• Weka
• Octave
• Clojure
13. Model Design
• Data Collection
–Identify key inputs to the model
• Data Transformation and Curation
–Human ‘analyst’ should be able to view the data sets make predictions
–Data needs to be cleaned, scrubbed, transformed – normalized
–Generally most important step for any type of supervised learning
algorithm
• Review the data
• Visually make predictions of individual learning indicators
14. Sample Econometrics Model
• Objective: Predict Economic Conditions (Growth/Recession)
• Supervised Learning
• Widely used algorithms
• Logistic Regression
• SVM
• RandomForest (Decision Trees)
• Current Challenges
– Forecasts are either too early – 6-12 months before contraction starts
– Forecasts are too late and reported by NBER (National Bureau of Economic
Research) after recession has started
– Individuals and Corporations cannot plan effectively based on prevailing economic
conditions
15. Sample Econometrics Model
• Use ‘leading indicators’ for economic health
–Treasury Yield Curve between 10 year and 3 month (T10Y3M)
–Industrial Production (INDPRO)
–Unemployment insurance
–Market Returns – S&P500
22. Model Training and Testing
• Model Data should be divided into
Training
Cross Validation
Testing set
• Splitting into these 3 helps improve model performance in
real world by eliminating bias and variance and helps get
model closer to optimal results
• More features does not necessarily mean a better prediction
• MLlib provides API to help with these operations
24. Model Performance Measurements
• Precision
How often does our algorithm have false positives
= true positives / # predicted positive
= true positives / (true positive + false positive)
High precision is good (i.e. closer to 1)
You want a big number, because you want false positive
to be as close to 0 as possible
25. Model Performance Measurements
• Recall
How sensitive is our algorithm?
Of all patients in set that actually have cancer, what
fraction did we correctly detect
= true positives / # actual positives
= true positive / (true positive + false negative)
• High recall is good (i.e. closer to 1)
• You want a big number, because you want false negative to
be as close to 0 as possible
26. Model Performance Measurements
F1Score (fscore)
• = 2 * (PR/ [P + R])
• Fscore is like taking the average of precision and recall
giving a higher weight to the lower value
27. Model Results and Code Review
• Logistic Regression
• SVM
• Random Forest
28. Apache Spark MLlib
• Algorithms Supported:
–Linear SVM
–Logistic Regression SGD
–Classification and Regression Tree
–K-Means Clustering
–Recommendation versus alternating mean squares
–Singular Value Decomposition
–Linear Regression with L1 and L2 Regularization
–Multinomial Naïve Bayes
–Basic Statistics
–Feature Transformations
29. Unsupervised Learning
• K Means Clustering
– Customer Segmentation
– Social Network Analysis
– Computer Data Center Analysis
– Astronomical Galaxy formations
• Recommendation Engines
31. Recommender Systems
• Class of information filtering system that
predicts the ‘rating’ or ‘preference’ user
would give to an item
• Examples:
– NetFlix
– Amazon
– Apple Genius