Much of Hadoop adoption thus far has been for use cases such as processing log files, text mining, and storing masses of file data -- all very necessary, but largely not exciting. In this presentation, Michael Cutler presents a selection of methodologies, primarily using Mahout, that will enable you to derive real insight into your data (mined in Hadoop) and build a recommendation engine focused on the implicit data collected from your users.
5. It’s all about discovery...
• Grouping similar data records
• Identifying unusual records
• Detecting relationships between records
• Discovering previously unknown patterns
6. Trends...
• 1990’s approach;
“Think carefully first and get it right!”
• 2000’s approach;
“Think a little first, evolve it later...”
• 2010’s approach;
“... if we capture everything, sense will come(?)”
8. Other Reasons...
• Increased generation of data
• Complex interconnected datasets
• You can be lazy about it...
Consequence:
– More data to process than ever before
9. Traditional Approach...
• Collate your data into files
• 6pm take your Database offline
• Bulk load the previous 24hrs data
• Run data mining, analytics, reporting overnight
• Bring the database back up for 9am
10. Modern Approach
• Stream data straight into Hadoop
• No need for downtime
• Analysis updated periodically or real-time
• Scalable approach
12. What is it?
Library of scalable machine learning algorithms;
• Classification
• Clustering
• Collaborative Filtering (Recommendations)
• Frequent Pattern mining ... and many more
13. How do you use it?
• It’s just a Java library
• Simple to get started
• Easy to extend and enhance
• Powerful command-line tools & examples
16. Collaborative Filtering
• User-based recommendations
– Analyse user data
– Build neighbourhoods of users
– Other people like you, liked <these>
• Item-based recommendations
– Analyse domain data
– Build relationships between items
– If you liked this, what about <these>
18. Mahout is a toolbox
• Understand your data
• Determine what needs to be done
• Build a pipeline to compute results
• Think about performance from the start
19. Please Note
• Scalability through Map/Reduce jobs
• Like MR it is inherently Batch-driven
• Some are not implemented in MR yet
• Fast-paced development
24. Basic Strategy
• Pre-compute rarely-changing data
• Cache and serve them using traditional means
• Flag data when it needs refreshed
• Tailor the cache on-the-fly
- It’s easier than ever before to generate or collect data- Complexity has increased- Storage and processing power is relatively cheap
Call data records, web logs etc.Rinse, RepeatProblem is as the volume of data has grown you need to go about it in a better way
Files,Hbase etc. Dashboards
Collaborative filtering for user-based and item-based recommendations Various clustering algorithms
Two JAR’s “core” and “math”Basic implementations for everythingYou can string together many use-cases just using the examples and CLI
Examples:Detecting spam emailOptical character recognition
You feed in the dataGive it a similarity metricSet a limit on the number of clusters
Colors Blue & Red appear together three timesPurple, Orange and Green appear only twice
How do you recommend to users you know nothing about If nobody has stumbled onto it, how do you recommend it? Outlier behaviour skewing results Tastes can change over time or seasonally
On the face of it, the fact it recommended SAW based on a Kids movie just means that parents are likely to watch SAW
Item-to-item relationships rarely changeHistorical data and trends rarely changeEasy to compute for new items