SlideShare una empresa de Scribd logo
1 de 28
Data Mining in Hadoop,
Making Sense of it in Mahout!


Hadoop World 2011
Michael Cutler @cotdp
Hello (Hadoop) World!
• Senior Research Engineer

• British Sky Broadcasting

• Lead the Hadoop initiative

• Fostering development
Topics
•   What is Data Mining?
•   Introducing Mahout
•   Using Mahout
•   Demo
•   Summary
•   Q&A
What is Data Mining?
It’s all about discovery...
• Grouping similar data records

• Identifying unusual records

• Detecting relationships between records

• Discovering previously unknown patterns
Trends...
• 1990’s approach;
 “Think carefully first and get it right!”

• 2000’s approach;
 “Think a little first, evolve it later...”

• 2010’s approach;
 “... if we capture everything, sense will come(?)”
Cost of Storage




http://www.mkomo.com/cost-per-gigabyte
Other Reasons...
• Increased generation of data

• Complex interconnected datasets

• You can be lazy about it...

Consequence:
  – More data to process than ever before
Traditional Approach...
• Collate your data into files

• 6pm take your Database offline

• Bulk load the previous 24hrs data

• Run data mining, analytics, reporting overnight

• Bring the database back up for 9am
Modern Approach
• Stream data straight into Hadoop

• No need for downtime

• Analysis updated periodically or real-time

• Scalable approach
Introducing
What is it?
Library of scalable machine learning algorithms;

• Classification

• Clustering

• Collaborative Filtering (Recommendations)

• Frequent Pattern mining ... and many more
How do you use it?
• It’s just a Java library

• Simple to get started

• Easy to extend and enhance

• Powerful command-line tools & examples
Classification
• Labels input data with one or more categories

• Trained with known data
Clustering
• Groups data based on their similarity

• Unsupervised – no training
Collaborative Filtering
• User-based recommendations
  – Analyse user data
  – Build neighbourhoods of users
  – Other people like you, liked <these>

• Item-based recommendations
  – Analyse domain data
  – Build relationships between items
  – If you liked this, what about <these>
Others
• Frequent Pattern mining




• High performance maths & utilities
Mahout is a toolbox
• Understand your data

• Determine what needs to be done

• Build a pipeline to compute results

• Think about performance from the start
Please Note
• Scalability through Map/Reduce jobs

• Like MR it is inherently Batch-driven

• Some are not implemented in MR yet

• Fast-paced development
Using Mahout
Building a Recommender
Objectives:

• Personalised

• Item-based recommendations

• Evolve with the times

• Implicit feedback through measurement
Problems with Recommenders
• “Cold start” problem

• “New stuff” problem

• Tainted profiles

• Stale profile data
When they go wrong...
Basic Strategy
• Pre-compute rarely-changing data

• Cache and serve them using traditional means

• Flag data when it needs refreshed

• Tailor the cache on-the-fly
Demo
Summary
• Mahout is exciting!

• Wide range of applications

• Scalable algorithms

• Scalable community
Questions?
Thank you!


Hadoop World 2011
Michael Cutler @cotdp

Más contenido relacionado

La actualidad más candente

Hadoop big data online training
Hadoop big data online trainingHadoop big data online training
Hadoop big data online training
Magnific Trainings
 
Hadoop training and certification
Hadoop training and certificationHadoop training and certification
Hadoop training and certification
Magnific Trainings
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Lucidworks
 

La actualidad más candente (18)

Hadoop big data online training
Hadoop big data online trainingHadoop big data online training
Hadoop big data online training
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Bigdata slide
Bigdata slideBigdata slide
Bigdata slide
 
Hadoop online training usa
Hadoop online training usaHadoop online training usa
Hadoop online training usa
 
Hadoop training australia
Hadoop training australiaHadoop training australia
Hadoop training australia
 
Hadppo training
Hadppo trainingHadppo training
Hadppo training
 
Hadoop training and certification
Hadoop training and certificationHadoop training and certification
Hadoop training and certification
 
Hadoop
HadoopHadoop
Hadoop
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Big data developer training
Big data developer trainingBig data developer training
Big data developer training
 
MahoutNew
MahoutNewMahoutNew
MahoutNew
 
Spark
SparkSpark
Spark
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Bigdata training
Bigdata trainingBigdata training
Bigdata training
 
Hadoop training in usa
Hadoop training in usaHadoop training in usa
Hadoop training in usa
 
Hadoop online training usa
Hadoop online training usaHadoop online training usa
Hadoop online training usa
 
Hadoop course content @ a1 trainingss
Hadoop course content @ a1 trainingssHadoop course content @ a1 trainingss
Hadoop course content @ a1 trainingss
 
Big data training australia
Big data training australiaBig data training australia
Big data training australia
 

Destacado

Airspace configuration using_air_traffic_complexity_metrics
Airspace configuration using_air_traffic_complexity_metricsAirspace configuration using_air_traffic_complexity_metrics
Airspace configuration using_air_traffic_complexity_metrics
xiaofeng007
 
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Mohamed Zaki
 
Data Mining with R CH1 요약
Data Mining with R CH1 요약Data Mining with R CH1 요약
Data Mining with R CH1 요약
Sung Yub Kim
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
Ontotext
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
KU Leuven
 

Destacado (17)

Airspace configuration using_air_traffic_complexity_metrics
Airspace configuration using_air_traffic_complexity_metricsAirspace configuration using_air_traffic_complexity_metrics
Airspace configuration using_air_traffic_complexity_metrics
 
Big data concept
Big data conceptBig data concept
Big data concept
 
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
 
[2014년 3월 25일] mining minds 빅 데이터, 욕망을 읽다
[2014년 3월 25일] mining minds   빅 데이터, 욕망을 읽다[2014년 3월 25일] mining minds   빅 데이터, 욕망을 읽다
[2014년 3월 25일] mining minds 빅 데이터, 욕망을 읽다
 
Kth daisy 추천솔루션_20130509_v1.0_이호철
Kth daisy 추천솔루션_20130509_v1.0_이호철Kth daisy 추천솔루션_20130509_v1.0_이호철
Kth daisy 추천솔루션_20130509_v1.0_이호철
 
Text mining
Text miningText mining
Text mining
 
Dm ml study_roadmap
Dm ml study_roadmapDm ml study_roadmap
Dm ml study_roadmap
 
Data Mining with R CH1 요약
Data Mining with R CH1 요약Data Mining with R CH1 요약
Data Mining with R CH1 요약
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
 
Expanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoExpanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with Tajo
 
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
집단지성 프로그래밍 01-데이터마이닝 개요
집단지성 프로그래밍 01-데이터마이닝 개요집단지성 프로그래밍 01-데이터마이닝 개요
집단지성 프로그래밍 01-데이터마이닝 개요
 
마인즈랩 회사소개서 V2.3_한국어버전
마인즈랩 회사소개서 V2.3_한국어버전마인즈랩 회사소개서 V2.3_한국어버전
마인즈랩 회사소개서 V2.3_한국어버전
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
 
Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
 

Similar a Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Michael Cutler - British Sky Broadcasting

Similar a Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Michael Cutler - British Sky Broadcasting (20)

Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Machine Learning & Apache Mahout
Machine Learning & Apache MahoutMachine Learning & Apache Mahout
Machine Learning & Apache Mahout
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
Machine learninginspark
Machine learninginsparkMachine learninginspark
Machine learninginspark
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
How and why you need to build a big data lab
How and why you need to build a big data labHow and why you need to build a big data lab
How and why you need to build a big data lab
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
 
Data Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah GuidoData Day Seattle 2015: Sarah Guido
Data Day Seattle 2015: Sarah Guido
 
Cassandra eu
Cassandra euCassandra eu
Cassandra eu
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedIn
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Data Scientist Toolbox
Data Scientist ToolboxData Scientist Toolbox
Data Scientist Toolbox
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Building a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beamBuilding a modern data platform with scala, akka, apache beam
Building a modern data platform with scala, akka, apache beam
 

Más de Cloudera, Inc.

Más de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Michael Cutler - British Sky Broadcasting

  • 1. Data Mining in Hadoop, Making Sense of it in Mahout! Hadoop World 2011 Michael Cutler @cotdp
  • 2. Hello (Hadoop) World! • Senior Research Engineer • British Sky Broadcasting • Lead the Hadoop initiative • Fostering development
  • 3. Topics • What is Data Mining? • Introducing Mahout • Using Mahout • Demo • Summary • Q&A
  • 4. What is Data Mining?
  • 5. It’s all about discovery... • Grouping similar data records • Identifying unusual records • Detecting relationships between records • Discovering previously unknown patterns
  • 6. Trends... • 1990’s approach; “Think carefully first and get it right!” • 2000’s approach; “Think a little first, evolve it later...” • 2010’s approach; “... if we capture everything, sense will come(?)”
  • 8. Other Reasons... • Increased generation of data • Complex interconnected datasets • You can be lazy about it... Consequence: – More data to process than ever before
  • 9. Traditional Approach... • Collate your data into files • 6pm take your Database offline • Bulk load the previous 24hrs data • Run data mining, analytics, reporting overnight • Bring the database back up for 9am
  • 10. Modern Approach • Stream data straight into Hadoop • No need for downtime • Analysis updated periodically or real-time • Scalable approach
  • 12. What is it? Library of scalable machine learning algorithms; • Classification • Clustering • Collaborative Filtering (Recommendations) • Frequent Pattern mining ... and many more
  • 13. How do you use it? • It’s just a Java library • Simple to get started • Easy to extend and enhance • Powerful command-line tools & examples
  • 14. Classification • Labels input data with one or more categories • Trained with known data
  • 15. Clustering • Groups data based on their similarity • Unsupervised – no training
  • 16. Collaborative Filtering • User-based recommendations – Analyse user data – Build neighbourhoods of users – Other people like you, liked <these> • Item-based recommendations – Analyse domain data – Build relationships between items – If you liked this, what about <these>
  • 17. Others • Frequent Pattern mining • High performance maths & utilities
  • 18. Mahout is a toolbox • Understand your data • Determine what needs to be done • Build a pipeline to compute results • Think about performance from the start
  • 19. Please Note • Scalability through Map/Reduce jobs • Like MR it is inherently Batch-driven • Some are not implemented in MR yet • Fast-paced development
  • 21. Building a Recommender Objectives: • Personalised • Item-based recommendations • Evolve with the times • Implicit feedback through measurement
  • 22. Problems with Recommenders • “Cold start” problem • “New stuff” problem • Tainted profiles • Stale profile data
  • 23. When they go wrong...
  • 24. Basic Strategy • Pre-compute rarely-changing data • Cache and serve them using traditional means • Flag data when it needs refreshed • Tailor the cache on-the-fly
  • 25. Demo
  • 26. Summary • Mahout is exciting! • Wide range of applications • Scalable algorithms • Scalable community
  • 28. Thank you! Hadoop World 2011 Michael Cutler @cotdp

Notas del editor

  1. Clustering Outliers AssociationIt’s not new, we’ve been doing it manually for years
  2. So why has it changed?
  3. 1990 ~ $10,0002000 ~ $102010 ~ $0.10Currently 5 cents per GB
  4. - It’s easier than ever before to generate or collect data- Complexity has increased- Storage and processing power is relatively cheap
  5. Call data records, web logs etc.Rinse, RepeatProblem is as the volume of data has grown you need to go about it in a better way
  6. Files,Hbase etc. Dashboards
  7. Collaborative filtering for user-based and item-based recommendations Various clustering algorithms
  8. Two JAR’s “core” and “math”Basic implementations for everythingYou can string together many use-cases just using the examples and CLI
  9. Examples:Detecting spam emailOptical character recognition
  10. You feed in the dataGive it a similarity metricSet a limit on the number of clusters
  11. Colors Blue &amp; Red appear together three timesPurple, Orange and Green appear only twice
  12. How do you recommend to users you know nothing about If nobody has stumbled onto it, how do you recommend it? Outlier behaviour skewing results Tastes can change over time or seasonally
  13. On the face of it, the fact it recommended SAW based on a Kids movie just means that parents are likely to watch SAW
  14. Item-to-item relationships rarely changeHistorical data and trends rarely changeEasy to compute for new items