Big Data Warehousing: Building a Relevance Engine using Hadoop, Mahout, and Pig

Big Data Warehousing Meetup

Today’s Topic: Building a Relevance
Engine using Hadoop, Mahout & Pig

Sponsored By:

WELCOME!
Joe Caserta
Founder & President, Caserta Concepts

Agenda
7:00 Networking
Grab a slice of pizza and a drink...

7:15 Joe Caserta Welcome
President, Caserta Concepts About the Meetup and about Caserta Concepts
Author, Data Warehouse ETL Toolkit

7:30 Erik Laurence Big Data Facts and Figures
VP Marketing, Caserta Concepts Interesting observations from the world of Big Data

7:45 Elliott Cordo Relevance
Principal Consultant, Caserta Concepts Building a Big Data recommendation engine with Mahout

8:15 Grant Ingersoll Machine Learning
Chief Scientist, Lucidworks Powering large scale data driven real time apps with
Mahout co-founder Apache Solr and Mahout
Lucene/Solr committer

8:45 - More Networking
9:00 Tell us what you’re up to…

About BDW Meetup
• Big Data is a complex, rapidly
changing landscape

• We want to share our stories and
hear about yours

• Great networking opportunity for
like minded data nerds

• Opportunities to collaborate on
exciting projects

About Caserta Concepts
Focused Industries Served
Expertise
• Financial Services
• Big Data Analytics • Healthcare / Insurance
• Data Warehousing • Retail / eCommerce
• Business Intelligence • Digital Media / Marketing
• Strategic Data • K-12 / Higher Education
Ecosystems

Founded in 2001

• President: Joe Caserta, industry thought leader,
consultant, educator and co-author, The Data
Warehouse ETL Toolkit (Wiley, 2004)

Client Portfolio
Finance
& Insurance

Retail/eCommerce
& Manufacturing

Education
& Services

Expertise & Offerings
Strategic Roadmap/
Assessment/Consulting

Big Data
Analytics

Data Warehousing/
ETL/Data Integration

BI/Visualization/
Analytics

Master Data Management

Big Data at Caserta Concepts
Caserta Concepts is a blend of the best designers in traditional
enterprise data with the best new designers in Big Data.

Traditional Data Big Data
• Tools • Tools
• RDBMS • Hadoop
• DQ • Mahout
• MDM • Relevance Engine
• BI • Analytics
• ETL • New Data
• Analytics • Social
• Traditional Data • Machine
• Transactions • Deep History
• Unstructured

Immutable Data Concepts
• Transformation • Profiling
• Conforming • Processing Efficiency/Speed

8

Contacts

Joe Caserta
President & Founder, Caserta Concepts
P: (855) 755-2246 x227
E: joe@casertaconcepts.com

Erik Laurence
VP Marketing, Caserta Concepts
P: (855) 755-2246 x528 info@casertaconcepts.com
E: erik@casertaconcepts.com 1(855) 755-2246
www.casertaconcepts.com
Elliott Cordo
Principal Consultant, Caserta Concepts
P: (855) 755-2246 x267
E: elliott@casertaconcepts.com

BIG DATA FACTS AND FIGURES
Erik Laurence
VP Marketing, Caserta Concepts

What is Really Meant by Big Data?
• The 4 Vs of Big Data
10%
• Volume
Structured
• More data than ever before
• Most of world’s data is unstructured, 90% Un/Semi/Multi-
Structured
semi-structured or multi-structured
• Variety
• More sources than ever before
• Social, web logs, machine logs, documents, geotags, video, …
• Velocity
• Some data only has value for a short period of time
• Relevance engines, financial fraud sensors, early warning sensors, etc.
• Vitality
• Agility is required in analytics
• Adapt quickly to changing business needs

Enterprise Involvement with Big Data
6%

18%
Beyond Pilot Stage
Engaged in Pilot
76%
Not Yet Involved

• Awareness of Big Data high among enterprises, but three-quarters still
wondering, ―What is this all about?‖
• Answer across all businesses, ―We don't know what the business case
is.‖

Source: WSJ November 29, 2012

Business Cases Have Been Identified
―The use of data and analytics …is going to be a basis of competition
going forward for individual firms, for sectors and even for countries.
Those companies that are able to use data effectively are more likely to
win in the marketplace.‖
- Michael Chui, McKinsey Global Institute

In just one field—personal location data—$100 billion of value can be
created globally for service providers through use of data.

Benefits for consumers could be six times that.

Source: (WSJ 11/29/12)

Big Data Played A Role in the Election
―This was the first presidential
election campaign where all of the
data that was coming into the
campaign was successfully
collected and centralized.

―The Obama campaign did a
successful job with that; the Obama campaign hired an analytics department five
times as large as that of the 2008 operation.
Romney campaign did not.‖

- John Aristotle Phillips, Chief Executive of
Aristotle International (WSJ 11/29/12)

Big Data Example in Obama Campaign
• $40k-a-head dinner in June at Sarah Jessica
Parker’s home in NYC
• 7 different versions of the email solicitation for the
event
• Some mentioned a 2nd fundraiser that night, a Mariah
Carey concert
• Some said Ms. Parker is a mother
• Some said Vogue editor Anna Wintour would be at the
dinner
• Who got which email depended on big data
• Profile info about each prospect
• How they react to different messages
• Campaign created a single massive system to join
info from Democratic voter files to
• pollsters, fundraisers, field workers and consumer
databases, social-media, and mobile contacts

Sources: WSJ, Time Magazine

Hadoop Market: Growing & Evolving
• Big data outranks virtualization as
#1 trend driving spending initiatives
• Barclays CIO Survey, April 2012

• Overall market at $100B
• Hadoop 2nd only to RDBMS in
potential

• Estimates put market growth at >
40% CAGR
• IDC expects Big Data tech and
services market to grow to $16.9B in
2015
• According to JPMC 50% of Big Data
market will be influenced by Hadoop

Hadoop Cost Effective for Archiving
• Hadoop is orders of magnitude cheaper than traditional
archival methods

• Annual cost of 1 TB of archival storage for a credit card
company

Tape SAN Hadoop
$30,000 $3,000 $300

Hadoop is Fast
• Sears' process to analyze loyalty club
marketing campaigns took six weeks on
mainframe, Teradata, and SAS servers
• In retail, that’s half the season!

• New process on Hadoop is done weekly
• For online and mobile, daily analysis is done

• What’s more, old models used 10% of data, new models use all
the data

• Source: Information Week (October 31, 2012)

BUILDING A RECOMMENDATION ENGINE
Elliott Cordo
Principal Consultant, Caserta Concepts

Recommendations
• Your customers expect them
• Good recommendations make life easier
• Help them find information, products, and services they might not
have thought of

• What makes a good recommendation?
• Relevant but not obvious
• Sense of ―surprise‖

Where can recommendations
engines be found?
• Applications can be found in a wide variety of industries
and applications:
• Travel
• Service Industry
• Music/Online radio
• TV and Video
• Online Publications
• Retail
..and countless others

Our Use Case: Online Magazine
Goals:
• Serve customers recommendations based on what their
peers are reading.
• Recommendation must have context to the article they
are currently viewing.

Technical Details
Core Platform:
• Cloudera Hadoop Cluster
• Mahout Machine Learning Library
• Apache Pig

Additional Technology:
• Talend Big Data Edition (ETL to/from relational)
• Datameer (Analysis and Visualization)

How we did it
Solution leverages three main algorithms:
• Mahout K-Means – identifying groups of similar articles
• Mahout Item-Based Recommender - recommendations
based on peer behavior
• Raw Popularity – custom Pig script ―people who read this
article also read..‖

K-Means
• Treats items as coordinates
• Places a number of random
―centroids‖ and assigns the
nearest items
• Moves the centroids around
based on average location
• Process repeats until the
assignments stop changing

We used the major attributes of the articles to create
coordinate points:
Author, Topic, Section, Region, Media, etc.

*Diagram from Collective Intelligence by Toby Segaran

Item-Based Recommender
• Build an item-item matrix determining relationships
between pairs of items (usage)
• Using the matrix, and the data on the current user, infer
his taste

• We use a dataset containing Customer, Article and
Rating
• Since no rating was available we used a 1 to 5
scale based on age (a ramped 6 month decay)
• In the output a 0 to 5 scale is calculated, 5 being the
most highly recommended for this customer

Popularity
• Self join usage dataset based on Article
Also_Read_Data= join Readers1 by
Customer_ID, Readers2 by Customer_ID using 'merge'
• Group article based on Article, ―Also Read Article‖
• Sort descending based on the number of distinct peer
customers
• Limit 25 (most popular ―Also Read Article‖)
• In the output a 0 to 5 scale is calculated, 5 being the most
popular for a given article

Delivering Recommendations
Customer views an article online and we are passed their
Customer ID and the Article they are viewing

We then do the following:
1. K-Means – get all items in the same cluster and calculate
Item-Based: K-Means:
Euclidean Distance. Reverse and scale 0-5.
Peers are reading Similar

2. Item-Based - get all peer recommendations for this customer
3. Popularity – get all popular recommendations for this article
4. Join the three data sets together, add the final rankings and
bring back the most highly rated articles.
Popularity:
Most popular

Items recommended by more than 1
algorithm are the most highly rated

Item-Based: K-Means:
Peers are reading Similar

Popularity:
Most popular
Best
Recommendations

Improvements/Ideas
• Conditionally swap algorithms: Peer recommendations
can be unwieldy for new users
• Allow users to rate how relevant this recommendation is -
> retrain the model
• Play with the weighting of current algorithms, evaluate
others
• Hybrid search platform: Replace or supplement K-Means
with Search platform

MACHINE LEARNING
Grant Ingersoll
President, Lucidworks
Mahout co-founder
Lucene/Solr committer

Big Data Warehousing: Building a Relevance Engine using Hadoop, Mahout, and Pig

Recomendados

Recomendados

Más contenido relacionado

Más de Caserta

Más de Caserta (20)

Último

Último (20)

Big Data Warehousing: Building a Relevance Engine using Hadoop, Mahout, and Pig