From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

From Big Legacy Data to Insight: Lessons Learned Creating
New Value from a Billion Low Quality Records

Jaime Fitzgerald, President, Fitzgerald Analytics, Inc.
Alex Hasha, Chief Data Scientist, Bundle.com

May 1, 2012

Architects of Fact-Based Decisions™

Agenda for Today’s Talk

1. The Business Model

2. The Text Analytics Challenge

3. How We Overcame the Challenge

4. Key Takeaways

5. Q&A

From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records © 2012 Fitzgerald Analytics, Inc. All Rights Reserved 2

Introduction

Jaime Fitzgerald, Alex Hasha

Founder @ Data Scientist @
Fitzgerald Analytics Bundle Corp
@JaimeFitzgerald @AlexHasha

 Leading development of data products
 Transforming data into value for clients
Responsible  Designing statistical methods / algorithm
For… that transform data into insights for
 Creating meaningful careers for employees
consumers

 Helps clients convert Data to Dollars™  Uses data to help consumers make better
At a decisions with their money
 Brings a strategic perspective to improve  Bends valuable legacy data to new
Company
ROI on investments in technology, data, purposes
That
people, and processes  Is growing and hiring!

Also  Working to Democratize Analytics by  Learning about and implementing best
Working Reducing the “Barrier to Benefit” for non- practices for managing complex data
On profits, social entrepreneurs, and gov’t pipelines


The Local Search Business


Gaps in Local Search Offerings

Paid Advertisement Not Trusted

User-Reviews Can be Biased

Not
Selection Can be
Personalized
Bias Gamed
(to you)


Bundle’s Unique Contribution
Unlike other merchant listing sites, our content is based on real credit card
spending by 20 million households

Example: Credit Card Statement Data


A Screen Shot From our Site


We Do This with Billions of Real Spending Records
Unlike other merchant listing sites, our content is based on real credit card
spending by 20 million households
Key Issues with this Data:
Example: Credit Card Statement Data 1. Credit card data lacks
merchant identifier
2. So we rely on text analytics
to associate transactions
with merchants


Building our “Version of the Truth” from 3 sources

Our
Localeze Factual
Transaction Data

 Proprietary  Crowd Sourced
 High Quality
Pros  Differentiated  Up to the
 Clean / Verified
 Special Sauce Minute

 Incomplete  More variability
Cons  Semi-Structured
 Lag / Recency in quality


Data: Not Useful Until Refined.


Key Steps in “Refinement” (Transformation)

Transformed To Create New
Old Data in New Ways Features Such As…

Card Transaction Normalization People Who Shop
Data Here Also Like…

Clustering
Merchant Listings The Bundle Loyalty
(e.g., Address, Phone Score
Number, Business Type)
Linking
Data-Driven
Other Data: Reviews From an
Census, Bureau of Labor
Aggregation Array of Customer
Statistics, User Feedback Segments


Before the Fun Stuff Happens…
Before we can generate insights about merchants for our users, we must associate
each transaction in our database with a specific merchant from a master list….

Two main problems:
Credit Card
Transactions 1. Accurate Fuzzy Matching is Difficult
(Billions – 109) 2. Scale of Data is Enormous
• Highly variable text
descriptions
• Noisy geographic
info Comprehensive Listing
Text
• Noisy merchant Matching of US Merchants
category info (Tens of Millions – 107)

Naïve item by item search takes O(1016)
expensive string comparisons: Too Slow!


A “Brute Force” Approach Would Never Work…

1
1. Matching w/in Hundreds of
Millions of Merchants would
Processing Time / Workload

require massive processing… Nation
….Fortunately we don’t need to
match at this level

2. Batching at local
area, process
orders of
magnitude faster.
City

Neighborhood
0
Hundreds Hundreds of Tens of Millions
Thousands
# of Merchants in Comparison Set


Solution to Scaling Problem
This is a “Cascade of Scale Reductions”, Parallelizing by Location
Credit Card Transactions
(Billions – 109)
Keys to solving the scaling problem:
Batch Transactions by
Geographic Neighborhood
1. Scale Reduction /
Parallelized Text Clustering
2. Free Open Source Software
1 2 10000

Dedupe
Description
Strings
Secondary Fuzzy Matching
Process Reconciles Preliminary
Listings with Merchant
Text Clustering “Source of Truth”
(Not Matching)
Consolidate Strings Belonging
to Same Merchant
Computational Efficiency
Increased by a Factor of 108!
Preliminary Merchant Final Merged
Listing Generated Directly Transaction Eons -> Days -> Minutes
from Transactions Data Set
(Tens of Millions–107)


Data Preparation: Phase 1

Machine
DAMA Lens Learning Lens

Example:
• Unsupervised Anthonys Restaurant
Deduping Learning #123 Brkly NY
• Matching X 10, • Text Clustering
(Strings)
Cleansing • Pattern
Anthony’s Restaurant
Discovery


Data Preparation: Phase 2

Machine
DAMA Lens Learning Lens

Search Retrieves Top
10 Possible Matches
• Deduping
• Record • Information Classifier applied to
+ 30%
Linkage Retrieval each, returns
• More
• Data Quality Cleansing confidence score
• Supervised
Enhancement • Data Classifier If Confidence = High,
Enrichment Records are linked


Takeaways

1. Tame your data before perfecting your methods.
efficiency enables experimentation, iteration, improvement.

2. Design your process to minimize unnecessary complexity
(e.g. Parallel Processing at Scale, Normalization, Pre-Filtering)

3. Tools: Take advantage of powerful (and inexpensive) open-
source tools that enable your process...


From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

Similar to From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year (20)

More from Fitzgerald Analytics, Inc.

More from Fitzgerald Analytics, Inc. (14)

Recently uploaded

Recently uploaded (20)

From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

Editor's Notes