Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Michael Cutler - British Sky Broadcasting

Data Mining in Hadoop,
Making Sense of it in Mahout!

Hadoop World 2011
Michael Cutler @cotdp

Hello (Hadoop) World!
• Senior Research Engineer

• British Sky Broadcasting

• Lead the Hadoop initiative

• Fostering development

Topics
• What is Data Mining?
• Introducing Mahout
• Using Mahout
• Demo
• Summary
• Q&A

It’s all about discovery...
• Grouping similar data records

• Identifying unusual records

• Detecting relationships between records

• Discovering previously unknown patterns

Trends...
• 1990’s approach;
“Think carefully first and get it right!”

“Think a little first, evolve it later...”

“... if we capture everything, sense will come(?)”

Cost of Storage

http://www.mkomo.com/cost-per-gigabyte

Other Reasons...
• Increased generation of data

• Complex interconnected datasets

• You can be lazy about it...

Consequence:
– More data to process than ever before

Traditional Approach...
• Collate your data into files

• 6pm take your Database offline

• Bulk load the previous 24hrs data

• Run data mining, analytics, reporting overnight

• Bring the database back up for 9am

Modern Approach
• Stream data straight into Hadoop

• No need for downtime

• Analysis updated periodically or real-time

• Scalable approach

What is it?
Library of scalable machine learning algorithms;

• Classification

• Clustering

• Collaborative Filtering (Recommendations)

• Frequent Pattern mining ... and many more

How do you use it?
• It’s just a Java library

• Simple to get started

• Easy to extend and enhance

• Powerful command-line tools & examples

Classification
• Labels input data with one or more categories

• Trained with known data

Clustering
• Groups data based on their similarity

• Unsupervised – no training

Collaborative Filtering
• User-based recommendations
– Analyse user data
– Build neighbourhoods of users
– Other people like you, liked <these>

• Item-based recommendations
– Analyse domain data
– Build relationships between items
– If you liked this, what about <these>

Others
• Frequent Pattern mining

• High performance maths & utilities

Mahout is a toolbox
• Understand your data

• Determine what needs to be done

• Build a pipeline to compute results

• Think about performance from the start

Please Note
• Scalability through Map/Reduce jobs

• Like MR it is inherently Batch-driven

• Some are not implemented in MR yet

• Fast-paced development

Building a Recommender
Objectives:

• Personalised

• Item-based recommendations

• Evolve with the times

• Implicit feedback through measurement

Problems with Recommenders
• “Cold start” problem

• “New stuff” problem

• Tainted profiles

• Stale profile data

Basic Strategy
• Pre-compute rarely-changing data

• Cache and serve them using traditional means

• Flag data when it needs refreshed

• Tailor the cache on-the-fly

Summary
• Mahout is exciting!

• Wide range of applications

• Scalable algorithms

• Scalable community

Thank you!

Hadoop World 2011
Michael Cutler @cotdp

Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Michael Cutler - British Sky Broadcasting

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (18)

Destacado

Destacado (17)

Similar a Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Michael Cutler - British Sky Broadcasting

Similar a Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Michael Cutler - British Sky Broadcasting (20)

Más de Cloudera, Inc.

Más de Cloudera, Inc. (20)

Último

Último (20)

Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Michael Cutler - British Sky Broadcasting

Notas del editor