SlideShare a Scribd company logo
1 of 29
Download to read offline
Introduction to
machine learning
with mahout
John Ternent
@jaternent
Orlando Data Science – www.orlandods.com
May 13, 2014
Welcome!
Updates
Social Media
 Facebook.com/orlandodata
 Twitter.com/orlandodata
 LinkedIn
OrlandoDS.com
 Social Network
 Forum
 Articles and Content
 And More
 Send articles to: scott@orlandods.com
Orlando Wiki
 Completely Open
 Aggregate Learning Resources!
 Go NUTS
May 28th Event
 Full Sail, UCF, and Florida Polytechnic
 Submit Your Questions! @orlandodata
Member Survey
 Need n=30!!!
 OrlandoDS.com/member-survey
 OR: find it in our past meetup
announcements
Learn Hadoop
 First Class: June 3rd.
 Location: Here
Future Plans
 Establish Non-Profit
 Increase Global Following
 Become Strong Networking and
Education Resource for YOU
A (very) little bit about
me…
 Consultant (Management & Technology)
 Open Source Evangelist
 Full-spectrum data nerd
A little about you!
 Rate yourself (1 – 10) on Mahout
 Rate yourself (1 – 10) on Machine
Learning/Data Mining
 Rate yourself (1 – 10) on Big
Data/Hadoop
 Please wait… optimizing presentation…
A computer program is said to learn from
experience E with respect to some class of
tasks T and performance measure P, if its
performance at tasks in T, as measured by
P, improves with experience E.
-- Tom M. Mitchell, 1997
Data mining is defined as the process of
discovering patterns in data. The process
must be automatic or (more usually)
semiautomatic. The patterns discovered must
be meaningful in that they lead to … an
economic advantage.
-- Ian H. Witten & Eibe Frank, 2005
If you’re in academia, you call it “machine
learning.” If you’re in business, you call it
“data mining.”
 Mark Hall
I create or
improve general
purpose
algorithms for
machine
learning
I use multiple
machine
learning
algorithms for
practical data
discovery
Source : xkcd Source : xkcd
Machine Learning Uses
Clustering
Classification
Recommendation
Machine Learning
Algorithms
 Regression
 K-means Clustering
 K-NN
 CART
 Neural Networks
 Support Vector Machines
 Association Rules
 Principal Component Analysis
 Singular Value Decomposition
 Ensemble Methods
 Naïve Bayes
 …
Real-World Applications
 Recommender Systems
 Image recognition
 Signal Processing
 Propensity to buy/churn
 Fraud analysis
 Text analytics
 Spam filtering
 Forecasting methods
 Revenue management
 …
The Problem … and Opportunity
Big Data™
If you have to choose, having more data does indeed trump a
better algorithm. However, what is better than just having
more data on its own is also having an algorithm that
annotates the data with new linkages and statistics which alter
the underlying data asset.”
- Omar Tawakol
Weka Explorer can handle ~1M instances, 25 attributes (50
MB file)
- Ian Witten
Potential Solutions
 Expand RAM
 Use incremental algorithms
 Use distributable algorithms
Scale
Up
Scale Out
Hadoop in 30 seconds
Input
Input
Input
Input
Input
Input
Input
Map (K,V)
Map (K,V)
Map (K,V)
Map (K,V)
Shuffle
/ Sort
Reduce
Reduce
Reduce
Output
Output
Output
Finally -- Mahout
 A Java-based library of machine
learning algorithms designed to support
distributed processing
 Initially on MapReduce, now leaning
heavily towards Spark
 Primarily focused on Recommenders,
Clustering, and Classification spaces.
Running Mahout
 Locally – download mahout distro.
/bin/mahout is the wrapper script, default shows all
the example programs available.
Lots of tools included to convert data into vector
formats and pre-process text, worth a look
 Amazon EC2
Configure stack from scratch on EC2 servers
 Amazon EMR
Quicker start, a lot of the build is already optimized
for MapReduce jobs, just add Mahout as a custom
jar and pass the script as a parameter
Running Recommenders
 Multiple Recommender Algorithms
User-based
Item-based
 A Recommender Needs:
DataModel (e.g. FileDataModel)
Similarity driver (PearsonCorrelationSimilarity)
Neighborhood (NearstNUserNeighborhood,
ThresholdUserNeighborhood)
Recommender
Running Recommenders
 Tip : If you have no preferences, there
are Boolean equivalents of the
recommender classes
 Evaluate user vs. item similarities
 Example
Clustering Algorithms
 To cluster you need:
Location in n-dimensional space
Distance metric
Threshold
 K-means
 Canopy
 Dirichlet
 Fuzzy K-means
 Spectral Clustering
Clustering
Clustering Text
 Identify k topics in a document corpus
 Requires conversion of text into vector
 Lucene utilities are available to vectorize
text and apply stop-word or weighting
criteria.
 Seqdirectory – from a directory of text
files
 Lucene.vector – from a Lucene index
Classifiers
 NaïveBayes
 RandomForests
 LogisticRegression (SGD)
 HiddenMarkov
 Example : 20 Newsgroups
Sidebar : Risks of Big Data
Unsupervised Learning

More Related Content

What's hot

Open Data, Big Data and Machine Learning
Open Data, Big Data and Machine LearningOpen Data, Big Data and Machine Learning
Open Data, Big Data and Machine LearningSteven Van Vaerenbergh
 
Mauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopMauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopCosmoAIMS Bassett
 
Machine Learning in the age of Big Data
Machine Learning in the age of Big DataMachine Learning in the age of Big Data
Machine Learning in the age of Big DataDaniel Sârbe
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to PythonSpotle.ai
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Edureka!
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learningKnoldus Inc.
 
ML crash course
ML crash courseML crash course
ML crash coursemikaelhuss
 
Intro to Machine Learning
Intro to Machine LearningIntro to Machine Learning
Intro to Machine LearningCorey Chivers
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache MahoutEdureka!
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine LearningMostafa
 
Programming for data science in python
Programming for data science in pythonProgramming for data science in python
Programming for data science in pythonUmmeSalmaM1
 
Machine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkMachine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkInSemble
 
GTU GeekDay Data Science and Applications
GTU GeekDay Data Science and ApplicationsGTU GeekDay Data Science and Applications
GTU GeekDay Data Science and ApplicationsKürşat İNCE
 
How to Calculate OA APC Spend for Your University
How to Calculate OA APC Spend for Your UniversityHow to Calculate OA APC Spend for Your University
How to Calculate OA APC Spend for Your UniversityHeather Piwowar
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
 
Hadoop Turns a Corner and Sees the Future
Hadoop Turns a Corner and Sees the FutureHadoop Turns a Corner and Sees the Future
Hadoop Turns a Corner and Sees the FutureDataWorks Summit
 
Best Python Libraries For Data Science & Machine Learning | Edureka
Best Python Libraries For Data Science & Machine Learning | EdurekaBest Python Libraries For Data Science & Machine Learning | Edureka
Best Python Libraries For Data Science & Machine Learning | EdurekaEdureka!
 
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...Edureka!
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 

What's hot (20)

Open Data, Big Data and Machine Learning
Open Data, Big Data and Machine LearningOpen Data, Big Data and Machine Learning
Open Data, Big Data and Machine Learning
 
Mauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopMauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshop
 
Machine Learning in the age of Big Data
Machine Learning in the age of Big DataMachine Learning in the age of Big Data
Machine Learning in the age of Big Data
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
 
ML crash course
ML crash courseML crash course
ML crash course
 
Intro to Machine Learning
Intro to Machine LearningIntro to Machine Learning
Intro to Machine Learning
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache Mahout
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
Programming for data science in python
Programming for data science in pythonProgramming for data science in python
Programming for data science in python
 
Machine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkMachine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache Spark
 
GTU GeekDay Data Science and Applications
GTU GeekDay Data Science and ApplicationsGTU GeekDay Data Science and Applications
GTU GeekDay Data Science and Applications
 
How to Calculate OA APC Spend for Your University
How to Calculate OA APC Spend for Your UniversityHow to Calculate OA APC Spend for Your University
How to Calculate OA APC Spend for Your University
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 
Hadoop Turns a Corner and Sees the Future
Hadoop Turns a Corner and Sees the FutureHadoop Turns a Corner and Sees the Future
Hadoop Turns a Corner and Sees the Future
 
Best Python Libraries For Data Science & Machine Learning | Edureka
Best Python Libraries For Data Science & Machine Learning | EdurekaBest Python Libraries For Data Science & Machine Learning | Edureka
Best Python Libraries For Data Science & Machine Learning | Edureka
 
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 

Similar to Mahout and Distributed Machine Learning 101

Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Rio Info
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLPaco Nathan
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
 
Apache mahout and R-mining complex dataobject
Apache mahout and R-mining complex dataobjectApache mahout and R-mining complex dataobject
Apache mahout and R-mining complex dataobjectsakthibalabalamuruga
 
Cloud as a Data Platform
Cloud as a Data PlatformCloud as a Data Platform
Cloud as a Data PlatformAndrei Savu
 
Machine Learning on Big Data with HADOOP
Machine Learning on Big Data with HADOOPMachine Learning on Big Data with HADOOP
Machine Learning on Big Data with HADOOPEPAM Systems
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data AnalyticsOsman Ali
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxShanmugasundaram M
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Sparkelephantscale
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7Paul Lo
 
The Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewThe Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewDr. Ananth Krishnamoorthy
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02BIWUG
 

Similar to Mahout and Distributed Machine Learning 101 (20)

Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 
mahout introduction
mahout  introductionmahout  introduction
mahout introduction
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Apache mahout and R-mining complex dataobject
Apache mahout and R-mining complex dataobjectApache mahout and R-mining complex dataobject
Apache mahout and R-mining complex dataobject
 
Cloud as a Data Platform
Cloud as a Data PlatformCloud as a Data Platform
Cloud as a Data Platform
 
Machine Learning on Big Data with HADOOP
Machine Learning on Big Data with HADOOPMachine Learning on Big Data with HADOOP
Machine Learning on Big Data with HADOOP
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
PoolParty Semantic Classifier
PoolParty Semantic ClassifierPoolParty Semantic Classifier
PoolParty Semantic Classifier
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
 
The Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewThe Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape Overview
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
 

Recently uploaded

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 

Recently uploaded (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

Mahout and Distributed Machine Learning 101

  • 1. Introduction to machine learning with mahout John Ternent @jaternent Orlando Data Science – www.orlandods.com May 13, 2014
  • 2.
  • 4. Social Media  Facebook.com/orlandodata  Twitter.com/orlandodata  LinkedIn
  • 5. OrlandoDS.com  Social Network  Forum  Articles and Content  And More  Send articles to: scott@orlandods.com
  • 6. Orlando Wiki  Completely Open  Aggregate Learning Resources!  Go NUTS
  • 7. May 28th Event  Full Sail, UCF, and Florida Polytechnic  Submit Your Questions! @orlandodata
  • 8. Member Survey  Need n=30!!!  OrlandoDS.com/member-survey  OR: find it in our past meetup announcements
  • 9. Learn Hadoop  First Class: June 3rd.  Location: Here
  • 10. Future Plans  Establish Non-Profit  Increase Global Following  Become Strong Networking and Education Resource for YOU
  • 11. A (very) little bit about me…  Consultant (Management & Technology)  Open Source Evangelist  Full-spectrum data nerd
  • 12. A little about you!  Rate yourself (1 – 10) on Mahout  Rate yourself (1 – 10) on Machine Learning/Data Mining  Rate yourself (1 – 10) on Big Data/Hadoop  Please wait… optimizing presentation…
  • 13. A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. -- Tom M. Mitchell, 1997 Data mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to … an economic advantage. -- Ian H. Witten & Eibe Frank, 2005
  • 14. If you’re in academia, you call it “machine learning.” If you’re in business, you call it “data mining.”  Mark Hall I create or improve general purpose algorithms for machine learning I use multiple machine learning algorithms for practical data discovery Source : xkcd Source : xkcd
  • 16. Machine Learning Algorithms  Regression  K-means Clustering  K-NN  CART  Neural Networks  Support Vector Machines  Association Rules  Principal Component Analysis  Singular Value Decomposition  Ensemble Methods  Naïve Bayes  …
  • 17. Real-World Applications  Recommender Systems  Image recognition  Signal Processing  Propensity to buy/churn  Fraud analysis  Text analytics  Spam filtering  Forecasting methods  Revenue management  …
  • 18. The Problem … and Opportunity Big Data™ If you have to choose, having more data does indeed trump a better algorithm. However, what is better than just having more data on its own is also having an algorithm that annotates the data with new linkages and statistics which alter the underlying data asset.” - Omar Tawakol Weka Explorer can handle ~1M instances, 25 attributes (50 MB file) - Ian Witten
  • 19. Potential Solutions  Expand RAM  Use incremental algorithms  Use distributable algorithms Scale Up Scale Out
  • 20. Hadoop in 30 seconds Input Input Input Input Input Input Input Map (K,V) Map (K,V) Map (K,V) Map (K,V) Shuffle / Sort Reduce Reduce Reduce Output Output Output
  • 21. Finally -- Mahout  A Java-based library of machine learning algorithms designed to support distributed processing  Initially on MapReduce, now leaning heavily towards Spark  Primarily focused on Recommenders, Clustering, and Classification spaces.
  • 22. Running Mahout  Locally – download mahout distro. /bin/mahout is the wrapper script, default shows all the example programs available. Lots of tools included to convert data into vector formats and pre-process text, worth a look  Amazon EC2 Configure stack from scratch on EC2 servers  Amazon EMR Quicker start, a lot of the build is already optimized for MapReduce jobs, just add Mahout as a custom jar and pass the script as a parameter
  • 23. Running Recommenders  Multiple Recommender Algorithms User-based Item-based  A Recommender Needs: DataModel (e.g. FileDataModel) Similarity driver (PearsonCorrelationSimilarity) Neighborhood (NearstNUserNeighborhood, ThresholdUserNeighborhood) Recommender
  • 24. Running Recommenders  Tip : If you have no preferences, there are Boolean equivalents of the recommender classes  Evaluate user vs. item similarities  Example
  • 25. Clustering Algorithms  To cluster you need: Location in n-dimensional space Distance metric Threshold  K-means  Canopy  Dirichlet  Fuzzy K-means  Spectral Clustering
  • 27. Clustering Text  Identify k topics in a document corpus  Requires conversion of text into vector  Lucene utilities are available to vectorize text and apply stop-word or weighting criteria.  Seqdirectory – from a directory of text files  Lucene.vector – from a Lucene index
  • 28. Classifiers  NaïveBayes  RandomForests  LogisticRegression (SGD)  HiddenMarkov  Example : 20 Newsgroups
  • 29. Sidebar : Risks of Big Data Unsupervised Learning