Over the past few years, relevant recommendations have become expected and essential as part of the customer experience. From the customer’s perspective, marketing interactions are becoming helpful and time saving, instead of being generic, out of context, and annoying. If you shop at any of the major online retailers such as Amazon or Bluefly you may think they somehow have gotten inside your head as they present and recommend products relevant to you. This is an exponential improvement of the traditional psych-demographic profiling and targeting of the “old world”.
We talked about how Mahout can be leveraged to build a Recommendation Engine with a minimum of coding. We discussd how the open source search and machine learning capabilities of Apache Solr and Mahout can be combined to power large scale data driven applications that effectively combine real time access with large scale enrichment and discovery.
Caserta Concepts has grown beyond its roots as a provider of traditional data warehouse and BI consulting to also offer big data warehousing. If you’re a developer and are experienced in Hadoop, Hive, HBase, Mahout, Datameer or other Big Data technologies, we want to get to know you!
For more information, visit http://www.casertaconcepts.com/.
Big Data Warehousing: Building a Relevance Engine using Hadoop, Mahout, and Pig
1. Big Data Warehousing Meetup
Today’s Topic: Building a Relevance
Engine using Hadoop, Mahout & Pig
Sponsored By:
2. WELCOME!
Joe Caserta
Founder & President, Caserta Concepts
3. Agenda
7:00 Networking
Grab a slice of pizza and a drink...
7:15 Joe Caserta Welcome
President, Caserta Concepts About the Meetup and about Caserta Concepts
Author, Data Warehouse ETL Toolkit
7:30 Erik Laurence Big Data Facts and Figures
VP Marketing, Caserta Concepts Interesting observations from the world of Big Data
7:45 Elliott Cordo Relevance
Principal Consultant, Caserta Concepts Building a Big Data recommendation engine with Mahout
8:15 Grant Ingersoll Machine Learning
Chief Scientist, Lucidworks Powering large scale data driven real time apps with
Mahout co-founder Apache Solr and Mahout
Lucene/Solr committer
8:45 - More Networking
9:00 Tell us what you’re up to…
4. About BDW Meetup
• Big Data is a complex, rapidly
changing landscape
• We want to share our stories and
hear about yours
• Great networking opportunity for
like minded data nerds
• Opportunities to collaborate on
exciting projects
5. About Caserta Concepts
Focused Industries Served
Expertise
• Financial Services
• Big Data Analytics • Healthcare / Insurance
• Data Warehousing • Retail / eCommerce
• Business Intelligence • Digital Media / Marketing
• Strategic Data • K-12 / Higher Education
Ecosystems
Founded in 2001
• President: Joe Caserta, industry thought leader,
consultant, educator and co-author, The Data
Warehouse ETL Toolkit (Wiley, 2004)
7. Expertise & Offerings
Strategic Roadmap/
Assessment/Consulting
Big Data
Analytics
Data Warehousing/
ETL/Data Integration
BI/Visualization/
Analytics
Master Data Management
8. Big Data at Caserta Concepts
Caserta Concepts is a blend of the best designers in traditional
enterprise data with the best new designers in Big Data.
Traditional Data Big Data
• Tools • Tools
• RDBMS • Hadoop
• DQ • Mahout
• MDM • Relevance Engine
• BI • Analytics
• ETL • New Data
• Analytics • Social
• Traditional Data • Machine
• Transactions • Deep History
• Unstructured
Immutable Data Concepts
• Transformation • Profiling
• Conforming • Processing Efficiency/Speed
8
10. BIG DATA FACTS AND FIGURES
Erik Laurence
VP Marketing, Caserta Concepts
11. What is Really Meant by Big Data?
• The 4 Vs of Big Data
10%
• Volume
Structured
• More data than ever before
• Most of world’s data is unstructured, 90% Un/Semi/Multi-
Structured
semi-structured or multi-structured
• Variety
• More sources than ever before
• Social, web logs, machine logs, documents, geotags, video, …
• Velocity
• Some data only has value for a short period of time
• Relevance engines, financial fraud sensors, early warning sensors, etc.
• Vitality
• Agility is required in analytics
• Adapt quickly to changing business needs
12. Enterprise Involvement with Big Data
6%
18%
Beyond Pilot Stage
Engaged in Pilot
76%
Not Yet Involved
• Awareness of Big Data high among enterprises, but three-quarters still
wondering, ―What is this all about?‖
• Answer across all businesses, ―We don't know what the business case
is.‖
Source: WSJ November 29, 2012
13. Business Cases Have Been Identified
―The use of data and analytics …is going to be a basis of competition
going forward for individual firms, for sectors and even for countries.
Those companies that are able to use data effectively are more likely to
win in the marketplace.‖
- Michael Chui, McKinsey Global Institute
In just one field—personal location data—$100 billion of value can be
created globally for service providers through use of data.
Benefits for consumers could be six times that.
Source: (WSJ 11/29/12)
14. Big Data Played A Role in the Election
―This was the first presidential
election campaign where all of the
data that was coming into the
campaign was successfully
collected and centralized.
―The Obama campaign did a
successful job with that; the Obama campaign hired an analytics department five
times as large as that of the 2008 operation.
Romney campaign did not.‖
- John Aristotle Phillips, Chief Executive of
Aristotle International (WSJ 11/29/12)
15. Big Data Example in Obama Campaign
• $40k-a-head dinner in June at Sarah Jessica
Parker’s home in NYC
• 7 different versions of the email solicitation for the
event
• Some mentioned a 2nd fundraiser that night, a Mariah
Carey concert
• Some said Ms. Parker is a mother
• Some said Vogue editor Anna Wintour would be at the
dinner
• Who got which email depended on big data
• Profile info about each prospect
• How they react to different messages
• Campaign created a single massive system to join
info from Democratic voter files to
• pollsters, fundraisers, field workers and consumer
databases, social-media, and mobile contacts
Sources: WSJ, Time Magazine
16. Hadoop Market: Growing & Evolving
• Big data outranks virtualization as
#1 trend driving spending initiatives
• Barclays CIO Survey, April 2012
• Overall market at $100B
• Hadoop 2nd only to RDBMS in
potential
• Estimates put market growth at >
40% CAGR
• IDC expects Big Data tech and
services market to grow to $16.9B in
2015
• According to JPMC 50% of Big Data
market will be influenced by Hadoop
17. Hadoop Cost Effective for Archiving
• Hadoop is orders of magnitude cheaper than traditional
archival methods
• Annual cost of 1 TB of archival storage for a credit card
company
Tape SAN Hadoop
$30,000 $3,000 $300
18. Hadoop is Fast
• Sears' process to analyze loyalty club
marketing campaigns took six weeks on
mainframe, Teradata, and SAS servers
• In retail, that’s half the season!
• New process on Hadoop is done weekly
• For online and mobile, daily analysis is done
• What’s more, old models used 10% of data, new models use all
the data
• Source: Information Week (October 31, 2012)
20. Recommendations
• Your customers expect them
• Good recommendations make life easier
• Help them find information, products, and services they might not
have thought of
• What makes a good recommendation?
• Relevant but not obvious
• Sense of ―surprise‖
21. Where can recommendations
engines be found?
• Applications can be found in a wide variety of industries
and applications:
• Travel
• Service Industry
• Music/Online radio
• TV and Video
• Online Publications
• Retail
..and countless others
22. Our Use Case: Online Magazine
Goals:
• Serve customers recommendations based on what their
peers are reading.
• Recommendation must have context to the article they
are currently viewing.
24. How we did it
Solution leverages three main algorithms:
• Mahout K-Means – identifying groups of similar articles
• Mahout Item-Based Recommender - recommendations
based on peer behavior
• Raw Popularity – custom Pig script ―people who read this
article also read..‖
25. K-Means
• Treats items as coordinates
• Places a number of random
―centroids‖ and assigns the
nearest items
• Moves the centroids around
based on average location
• Process repeats until the
assignments stop changing
We used the major attributes of the articles to create
coordinate points:
Author, Topic, Section, Region, Media, etc.
*Diagram from Collective Intelligence by Toby Segaran
26. Item-Based Recommender
• Build an item-item matrix determining relationships
between pairs of items (usage)
• Using the matrix, and the data on the current user, infer
his taste
• We use a dataset containing Customer, Article and
Rating
• Since no rating was available we used a 1 to 5
scale based on age (a ramped 6 month decay)
• In the output a 0 to 5 scale is calculated, 5 being the
most highly recommended for this customer
27. Popularity
• Self join usage dataset based on Article
Also_Read_Data= join Readers1 by
Customer_ID, Readers2 by Customer_ID using 'merge'
• Group article based on Article, ―Also Read Article‖
• Sort descending based on the number of distinct peer
customers
• Limit 25 (most popular ―Also Read Article‖)
• In the output a 0 to 5 scale is calculated, 5 being the most
popular for a given article
28. Delivering Recommendations
Customer views an article online and we are passed their
Customer ID and the Article they are viewing
We then do the following:
1. K-Means – get all items in the same cluster and calculate
Item-Based: K-Means:
Euclidean Distance. Reverse and scale 0-5.
Peers are reading Similar
2. Item-Based - get all peer recommendations for this customer
3. Popularity – get all popular recommendations for this article
4. Join the three data sets together, add the final rankings and
bring back the most highly rated articles.
Popularity:
Most popular
29. Items recommended by more than 1
algorithm are the most highly rated
Item-Based: K-Means:
Peers are reading Similar
Popularity:
Most popular
Best
Recommendations
30. Improvements/Ideas
• Conditionally swap algorithms: Peer recommendations
can be unwieldy for new users
• Allow users to rate how relevant this recommendation is -
> retrain the model
• Play with the weighting of current algorithms, evaluate
others
• Hybrid search platform: Replace or supplement K-Means
with Search platform