From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year
1. From Big Legacy Data to Insight: Lessons Learned Creating
New Value from a Billion Low Quality Records
Jaime Fitzgerald, President, Fitzgerald Analytics, Inc.
Alex Hasha, Chief Data Scientist, Bundle.com
May 1, 2012
Architects of Fact-Based Decisionsโข
2. Agenda for Todayโs Talk
1. The Business Model
2. The Text Analytics Challenge
3. How We Overcame the Challenge
4. Key Takeaways
5. Q&A
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 2
3. Introduction
Jaime Fitzgerald, Alex Hasha
Founder @ Data Scientist @
Fitzgerald Analytics Bundle Corp
@JaimeFitzgerald @AlexHasha
๏ง Leading development of data products
๏ง Transforming data into value for clients
Responsible ๏ง Designing statistical methods / algorithm
Forโฆ that transform data into insights for
๏ง Creating meaningful careers for employees
consumers
๏ง Helps clients convert Data to Dollarsโข ๏ง Uses data to help consumers make better
At a decisions with their money
๏ง Brings a strategic perspective to improve ๏ง Bends valuable legacy data to new
Company
ROI on investments in technology, data, purposes
That
people, and processes ๏ง Is growing and hiring!
Also ๏ง Working to Democratize Analytics by ๏ง Learning about and implementing best
Working Reducing the โBarrier to Benefitโ for non- practices for managing complex data
On profits, social entrepreneurs, and govโt pipelines
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 3
4. The Local Search Business
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 4
5. Gaps in Local Search Offerings
Paid Advertisement Not Trusted
User-Reviews Can be Biased
Not
Selection Can be
Personalized
Bias Gamed
(to you)
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 5
6. Bundleโs Unique Contribution
Unlike other merchant listing sites, our content is based on real credit card
spending by 20 million households
Example: Credit Card Statement Data
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 6
7. A Screen Shot From our Site
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 7
8. A Screen Shot From our Site
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 8
9. A Screen Shot From our Site
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 9
10. We Do This with Billions of Real Spending Records
Unlike other merchant listing sites, our content is based on real credit card
spending by 20 million households
Key Issues with this Data:
Example: Credit Card Statement Data 1. Credit card data lacks
merchant identifier
2. So we rely on text analytics
to associate transactions
with merchants
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 10
11. Building our โVersion of the Truthโ from 3 sources
Our
Localeze Factual
Transaction Data
๏ง Proprietary ๏ง Crowd Sourced
๏ง High Quality
Pros ๏ง Differentiated ๏ง Up to the
๏ง Clean / Verified
๏ง Special Sauce Minute
๏ง Incomplete ๏ง More variability
Cons ๏ง Semi-Structured
๏ง Lag / Recency in quality
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 11
12. Data: Not Useful Until Refined.
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 12
13. Key Steps in โRefinementโ (Transformation)
Transformed To Create New
Old Data in New Ways Features Such Asโฆ
Card Transaction Normalization People Who Shop
Data Here Also Likeโฆ
Clustering
Merchant Listings The Bundle Loyalty
(e.g., Address, Phone Score
Number, Business Type)
Linking
Data-Driven
Other Data: Reviews From an
Census, Bureau of Labor
Aggregation Array of Customer
Statistics, User Feedback Segments
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 13
14. Before the Fun Stuff Happensโฆ
Before we can generate insights about merchants for our users, we must associate
each transaction in our database with a specific merchant from a master listโฆ.
Two main problems:
Credit Card
Transactions 1. Accurate Fuzzy Matching is Difficult
(Billions โ 109) 2. Scale of Data is Enormous
โข Highly variable text
descriptions
โข Noisy geographic
info Comprehensive Listing
Text
โข Noisy merchant Matching of US Merchants
category info (Tens of Millions โ 107)
Naรฏve item by item search takes O(1016)
expensive string comparisons: Too Slow!
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 14
15. A โBrute Forceโ Approach Would Never Workโฆ
1
1. Matching w/in Hundreds of
Millions of Merchants would
Processing Time / Workload
require massive processingโฆ Nation
โฆ.Fortunately we donโt need to
match at this level
2. Batching at local
area, process
orders of
magnitude faster.
City
Neighborhood
0
Hundreds Hundreds of Tens of Millions
Thousands
# of Merchants in Comparison Set
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 15
16. Solution to Scaling Problem
This is a โCascade of Scale Reductionsโ, Parallelizing by Location
Credit Card Transactions
(Billions โ 109)
Keys to solving the scaling problem:
Batch Transactions by
Geographic Neighborhood
1. Scale Reduction /
Parallelized Text Clustering
2. Free Open Source Software
1 2 10000
Dedupe
Description
Strings
Secondary Fuzzy Matching
Process Reconciles Preliminary
Listings with Merchant
Text Clustering โSource of Truthโ
(Not Matching)
Consolidate Strings Belonging
to Same Merchant
Computational Efficiency
Increased by a Factor of 108!
Preliminary Merchant Final Merged
Listing Generated Directly Transaction Eons -> Days -> Minutes
from Transactions Data Set
(Tens of Millionsโ107)
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 16
17. Data Preparation: Phase 1
Machine
DAMA Lens Learning Lens
Example:
โข Unsupervised Anthonys Restaurant
Deduping Learning #123 Brkly NY
โข Matching X 10, โข Text Clustering
(Strings)
Cleansing โข Pattern
Anthonyโs Restaurant
Discovery
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 17
18. Data Preparation: Phase 2
Machine
DAMA Lens Learning Lens
Search Retrieves Top
10 Possible Matches
โข Deduping
โข Record โข Information Classifier applied to
+ 30%
Linkage Retrieval each, returns
โข More
โข Data Quality Cleansing confidence score
โข Supervised
Enhancement โข Data Classifier If Confidence = High,
Enrichment Records are linked
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 18
19. Takeaways
1. Tame your data before perfecting your methods.
efficiency enables experimentation, iteration, improvement.
2. Design your process to minimize unnecessary complexity
(e.g. Parallel Processing at Scale, Normalization, Pre-Filtering)
3. Tools: Take advantage of powerful (and inexpensive) open-
source tools that enable your process...
From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 19
Editor's Notes
Jaime intro:Alex Intro: Thanks Jaime. Since Jaime has already introduced me, Iโll introduce Bundle. Bundle is a company that uses data to help consumers make better decisions with their money. We do this on the one hand by providing free tools for managing personal financial data. But more to the point of todayโs talk, we are also mining mountains of credit card transaction data to extract actionable insights for consumers based on the spending behavior of their peers.
First to provide local merchant profiles for consumers that is deeply data-drivenLocal Search Business (Yelp, CitiSearch, FourSquare, Google, Bing)% of local searches on mobile devices is growing very fastFast-growing sector in data-driven startupsExample: Tedโs montana grillBundle addresses issues with other sites:Selection Bias (strong opinions over-represented)System Gaming (just like SEO. interesting story โreputation mgtโ companies!)Explicit rankings (rank by the actual metrics!)
Alex: So where does Text Analytics come into this? As you might imagine, bending old data to a new purpose is fraught with difficulties, because the dataset was designed with different applications in mind. A key problem we faced with our credit card transaction database was that the transaction records lack a merchant identifier. Itโs primary purpose is for interacting with card holders, generating statements, and not suprisingly itโs formatted very much like an enormous credit card statement. The merchant name is embedded in a text field, which also contains other information. Itโs semi-structured, but lacks a consistent format.Clearly, to unlock insights about merchants from this data, we have to associate the transactions with merchants using this text field, so text analytics is absolutely crucial to our business.AH: Just some background here: In the credit card industry there are โacquiring banksโ, which deals with merchants and processes their credit card transaction over various payment networks, and โissuing banksโ which issue cards to consumers, and manage the generation of statements and billing of individuals. Since the interactions with merchants and consumers are split between two entities, you end up with data sets that are either consumer or merchant focused. We get our data from an โissuingโ bank, so they donโt have detailed merchant info., beyond what they need to generate statements for cardholders. That is the root of our problem.
Alex: This is a screen shot of our core offering,the Bundle Merchant recommender, which aims to help consumers with their most frequent money decisions: where to spend it. Visually, Iโm sure youโre reminded of user review sites like Yelp or Citysearch, and the purpose, to help you discover great merchants, is similar. Our content, though, is very different because itโs generated directly from the credit card transactions of over 20 million US households.
Alex: (Review features left to right.)I just wanted to return to this screen shot to highlight the features that are made possible by transforming credit card data in this way. (Loyalty score) Unlike other sites, our star ratings are data driven: we assign each merchant what we call the โBundle Loyalty Scoreโ, which is calculated from the share of wallet a merchantโs customers devote to the business and how frequently they return. (Coverage) Because we capture transactions from a broad-cross section of the population, we have data on many small local merchants, not just the popular ones that attract a lot of reviews. (Segments and Silent majority) We can break merchants customers down into demographic and behavioral segments, to show how well it serves different groups, and which groups it is most popular with. Weโre capturing information about the silent majority of shoppers, who shop without writing about it online, and also avoid the common bias on review sites towards extremely positive or extremely negative reviews.(Real price levels) We have rich data about the real range of prices visitors to this merchant are paying, based on real transactions.(Web of merchants) Another unique feature on Bundle is that we can show you what other merchants are popular with customers of this merchant. Weโre all familiar with โPeople who bought this also boughtโ on Amazon and other online market places, but I believe weโre the first to take this to the offline market place on a massive scale.
Alex: This is a screen shot of our core offering,the Bundle Merchant recommender, which aims to help consumers with their most frequent money decisions: where to spend it. Visually, Iโm sure youโre reminded of user review sites like Yelp or Citysearch, and the purpose, to help you discover great merchants, is similar. Our content, though, is very different because itโs generated directly from the credit card transactions of over 20 million US households.
Alex: So where does Text Analytics come into this? As you might imagine, bending old data to a new purpose is fraught with difficulties, because the dataset was designed with different applications in mind. A key problem we faced with our credit card transaction database was that the transaction records lack a merchant identifier. Itโs primary purpose is for interacting with card holders, generating statements, and not suprisingly itโs formatted very much like an enormous credit card statement. The merchant name is embedded in a text field, which also contains other information. Itโs semi-structured, but lacks a consistent format.Clearly, to unlock insights about merchants from this data, we have to associate the transactions with merchants using this text field, so text analytics is absolutely crucial to our business.AH: Just some background here: In the credit card industry there are โacquiring banksโ, which deals with merchants and processes their credit card transaction over various payment networks, and โissuing banksโ which issue cards to consumers, and manage the generation of statements and billing of individuals. Since the interactions with merchants and consumers are split between two entities, you end up with data sets that are either consumer or merchant focused. We get our data from an โissuingโ bank, so they donโt have detailed merchant info., beyond what they need to generate statements for cardholders. That is the root of our problem.
Top 10 Possible Matches, Like Google Search)
Jaime: Take it back to audience. A common theme in converting data to dollars is to to extract new value from old data by MATCHING with other preexisting data. No need to dwell on particulars of Bundle data on this slide, except as an instance of a more general pattern.
JF Provides Framing: This is a universal problem for companies seeking to convert Data to Dollars, repurposing old data sets often requires matching with other data sets without a common key. AH: It should be clear now how a robust, accurate algorithm for matching text descriptions to merchant listings is a prerequisite for our entire user experience.There are two aspects of this problem that created significant challenges for us. First, thereโs the basic issue that accurate fuzzy string matching is hard. Our inputs highly variable transaction descriptions, sometimes dozens or hundreds per merchant, inconsistent coding, error prone geographic indicators, and noisy merchant category indicators. These give us a lot to go on, but to treat any of them as a source of truth gets you in trouble. Weโre at a Text Analytics conference, so I donโt have to tell you that accurate fuzzy string matching can be hard, especially if supporting data like merchant category and geo information are not 100% reliable. But before we could even begin to attack that problem we had to do something about the sheer size of our data set.We receive about 1 billion credit card transactions per year, each of which must be associated with one of 10s of millions of merchants in a comprehensive listing. Not that anyone would try this, but a brute force attempt to take each transaction description and scan through the merchant listing item by item looking for a match would require on the order of 10^16 fuzzy string comparisons. To put that in perspective, if each comparison took about a millisecond, the match would take over 300,000 years to run.Clearly something needs to be done to reduce the scale of the input AND the matching search space. Broadly speaking, we accomplished this by breaking the matching process into two phases, using text clustering in the first phase to dramatically decrease the size of the data set, and then proceeding to a fuzzy match.
This isnโt rocket science, there are a handful of obvious places to start simplifying the problem. One key lever is location: if you have a transaction that occurred in New Mexico it doesnโt make sense to include merchants in New York in your search.There are tens of millions of merchants nationally, but only hundreds of thousands in each city, and maybe a thousand max in each neighborhood. If you can identify the neighborhood of a transaction, and only search the merchants in that neighborhood, the efficiency payoff is hugeThis wasnโt a completely obvious step for us, though, because as I mentioned before the geographic fields in our transaction data were not 100% reliable. We could identify the city with no problem, but at the neighborhood level there is a significant error rate. But we eventually realized we had to ignore all the little complications and, at all costs, reduce the size of our data so we could work with it efficiently. Itโs worth creating an intermediate data set thatโs still pretty messy, if you can now load it into R on your laptop and try out a few fuzzy matching experiments in an afternoon.
This slide gives a high level overview of how we achieved a cascade of scale reductions by batching transactions by neighborhood. Considering each neighborhood in isolation, we dedupe and then cluster transaction strings which are highly likely to be generated by the same merchant. Each of these clusters is assigned a preliminary merchant ID. At this point we have a preliminary merchant listing which still suffers from some of the quality issues of the original data set but Can provide aggregated transaction data views which to inform subsequent matching and is on a much more manageable scale.The output of the clustering algorithm feeds into a more resource intensive fuzzy matching algorithm, which becomes feasible at this scale.Taking this approach on a single machine, we were able to get our processing time down to about a week. However, in startup time a week is not much better than 300K years. Thanks to the revolution in open source parallel computing, we were able to quickly set up a small Hadoop cluster which parallelizes the text clustering operations so all the neighborhoods run at the same time. This brought our processing down to about 20 minutes. While this isnโt a complete solution to the initial problem, it vastly increases our capability to experiment with new methods and tweaks to the existing process.So thatโs a quick and dirty introduction to a part of our technology stack, and now Iโll turn it over to Jaime to convert my case study into some high level takeaways.
Robin custbehavior PayComplainPay....then....ST vs LT RecAdvLoyalty
Top 10 Possible Matches, Like Google Search)
Comments:Consider trade-offs between false positive and false negativesRelated Hot/Emerging Best Practices we can mention to frame this:Metrics-Driven DevelopmentBeginning with the End in Mind / Causal Clarity ๏