SlideShare a Scribd company logo
1 of 19
Download to read offline
From Big Legacy Data to Insight: Lessons Learned Creating
New Value from a Billion Low Quality Records

Jaime Fitzgerald, President, Fitzgerald Analytics, Inc.
Alex Hasha, Chief Data Scientist, Bundle.com

May 1, 2012


                                             Architects of Fact-Based Decisionsโ„ข
Agenda for Todayโ€™s Talk




                          1.       The Business Model


                          2.       The Text Analytics Challenge


                          3.       How We Overcame the Challenge


                          4.       Key Takeaways


                          5.       Q&A




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved   2
Introduction

                                                      Jaime Fitzgerald,                                                         Alex Hasha

                                                      Founder @                                                                 Data Scientist @
                                                      Fitzgerald Analytics                                                      Bundle Corp
                                                      @JaimeFitzgerald                                                          @AlexHasha

                                                                                                       ๏‚ง Leading development of data products
                             ๏‚ง Transforming data into value for clients
    Responsible                                                                                        ๏‚ง Designing statistical methods / algorithm
          Forโ€ฆ                                                                                           that transform data into insights for
                             ๏‚ง Creating meaningful careers for employees
                                                                                                         consumers

                             ๏‚ง Helps clients convert Data to Dollarsโ„ข                                  ๏‚ง Uses data to help consumers make better
            At a                                                                                         decisions with their money
                             ๏‚ง Brings a strategic perspective to improve                               ๏‚ง Bends valuable legacy data to new
        Company
                               ROI on investments in technology, data,                                   purposes
           That
                               people, and processes                                                   ๏‚ง Is growing and hiring!

            Also             ๏‚ง Working to Democratize Analytics by                                     ๏‚ง Learning about and implementing best
         Working               Reducing the โ€œBarrier to Benefitโ€ for non-                                practices for managing complex data
             On                profits, social entrepreneurs, and govโ€™t                                  pipelines



From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved   3
The Local Search Business




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved   4
Gaps in Local Search Offerings


                                           Paid Advertisement Not Trusted



                                                User-Reviews Can be Biased


                                                                                                                   Not
                              Selection                                   Can be
                                                                                                               Personalized
                                Bias                                      Gamed
                                                                                                                 (to you)


From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved   5
Bundleโ€™s Unique Contribution
        Unlike other merchant listing sites, our content is based on real credit card
        spending by 20 million households

        Example: Credit Card Statement Data




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved   6
A Screen Shot From our Site




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved   7
A Screen Shot From our Site




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved   8
A Screen Shot From our Site




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved   9
We Do This with Billions of Real Spending Records
        Unlike other merchant listing sites, our content is based on real credit card
        spending by 20 million households
                                                                                                            Key Issues with this Data:
        Example: Credit Card Statement Data                                                                 1. Credit card data lacks
                                                                                                               merchant identifier
                                                                                                            2. So we rely on text analytics
                                                                                                               to associate transactions
                                                                                                               with merchants




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved   10
Building our โ€œVersion of the Truthโ€ from 3 sources


                                   Our
                                                                                       Localeze                                          Factual
                             Transaction Data


                ๏‚ง Proprietary                                                                                             ๏‚ง Crowd Sourced
                                                                          ๏‚ง High Quality
           Pros ๏‚ง Differentiated                                                                                          ๏‚ง Up to the
                                                                          ๏‚ง Clean / Verified
                ๏‚ง Special Sauce                                                                                             Minute



                                                                          ๏‚ง Incomplete                                    ๏‚ง More variability
          Cons ๏‚ง Semi-Structured
                                                                          ๏‚ง Lag / Recency                                   in quality



From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved   11
Data: Not Useful Until Refined.




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved   12
Key Steps in โ€œRefinementโ€ (Transformation)

                                                                          Transformed                                         To Create New
                       Old Data                                           in New Ways                                       Features Such Asโ€ฆ


                Card Transaction                                             Normalization                                   People Who Shop
                      Data                                                                                                    Here Also Likeโ€ฆ


                                                                             Clustering
               Merchant Listings                                                                                            The Bundle Loyalty
               (e.g., Address, Phone                                                                                              Score
              Number, Business Type)
                                                                             Linking
                                                                                                                                Data-Driven
                    Other Data:                                                                                              Reviews From an
             Census, Bureau of Labor
                                                                             Aggregation                                     Array of Customer
             Statistics, User Feedback                                                                                           Segments



From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved   13
Before the Fun Stuff Happensโ€ฆ
        Before we can generate insights about merchants for our users, we must associate
        each transaction in our database with a specific merchant from a master listโ€ฆ.



                                                                                 Two main problems:
                              Credit Card
                             Transactions                                        1. Accurate Fuzzy Matching is Difficult
                            (Billions โ€“ 109)                                     2. Scale of Data is Enormous
                    โ€ข Highly variable text
                      descriptions
                    โ€ข Noisy geographic
                      info                                                                         Comprehensive Listing
                                                                    Text
                    โ€ข Noisy merchant                               Matching                           of US Merchants
                      category info                                                                (Tens of Millions โ€“ 107)


             Naรฏve item by item search takes O(1016)
             expensive string comparisons: Too Slow!

From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved   14
A โ€œBrute Forceโ€ Approach Would Never Workโ€ฆ


                                      1
                                                 1. Matching w/in Hundreds of
                                                    Millions of Merchants would
                    Processing Time / Workload


                                                    require massive processingโ€ฆ                                              Nation
                                                    โ€ฆ.Fortunately we donโ€™t need to
                                                    match at this level

                                                 2. Batching at local
                                                    area, process
                                                    orders of
                                                    magnitude faster.
                                                                                       City



                                                    Neighborhood
                                      0
                                                     Hundreds                   Hundreds of                          Tens of Millions
                                                                                 Thousands
                                                               # of Merchants in Comparison Set

From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved   15
Solution to Scaling Problem
        This is a โ€œCascade of Scale Reductionsโ€, Parallelizing by Location
                 Credit Card Transactions
                          (Billions โ€“ 109)
                                                                                                       Keys to solving the scaling problem:
                Batch Transactions by
               Geographic Neighborhood
                                                                                                           1. Scale Reduction /
                                                                                                              Parallelized Text Clustering
                                                                                                           2. Free Open Source Software
             1        2                        10000



                           Dedupe
                          Description
                            Strings
                                                                                                                  Secondary Fuzzy Matching
                                                                                                                Process Reconciles Preliminary
                                                                                                                    Listings with Merchant
                      Text Clustering                                                                                   โ€œSource of Truthโ€
                   (Not Matching)
            Consolidate Strings Belonging
                 to Same Merchant
                                                                                                                                 Computational Efficiency
                                                                                                                               Increased by a Factor of 108!
                   Preliminary Merchant                                                   Final Merged
                 Listing Generated Directly                                                Transaction                            Eons -> Days -> Minutes
                      from Transactions                                                      Data Set
                   (Tens of Millionsโ€“107)

From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved   16
Data Preparation: Phase 1


                                                                            Machine
                               DAMA Lens                                  Learning Lens


                                                                                                                               Example:
                                                                                    โ€ข Unsupervised                             Anthonys Restaurant
                                                         Deduping                     Learning                                 #123 Brkly NY
                โ€ข Matching                                 X 10,                    โ€ข Text Clustering
                  (Strings)
                                                         Cleansing                  โ€ข Pattern
                                                                                                                               Anthonyโ€™s Restaurant
                                                                                      Discovery




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved   17
Data Preparation: Phase 2


                                                                       Machine
                             DAMA Lens                               Learning Lens


                                                                                                                                 Search Retrieves Top
                                                                                                                                 10 Possible Matches
                                                    โ€ข Deduping
                 โ€ข Record                                                            โ€ข Information                               Classifier applied to
                                                      + 30%
                   Linkage                                                             Retrieval                                 each, returns
                                                    โ€ข More
                 โ€ข Data Quality                       Cleansing                                                                  confidence score
                                                                                     โ€ข Supervised
                   Enhancement                      โ€ข Data                             Classifier                                If Confidence = High,
                                                      Enrichment                                                                 Records are linked




From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved   18
Takeaways



           1. Tame your data before perfecting your methods.
           efficiency enables experimentation, iteration, improvement.



           2. Design your process to minimize unnecessary complexity
           (e.g. Parallel Processing at Scale, Normalization, Pre-Filtering)



            3. Tools: Take advantage of powerful (and inexpensive) open-
            source tools that enable your process...


From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved   19

More Related Content

What's hot

Big Data and Analytics
Big Data and AnalyticsBig Data and Analytics
Big Data and Analytics
dmurph4
ย 
Big Transaction Data - CMG Vegas 2012
Big Transaction Data - CMG Vegas 2012Big Transaction Data - CMG Vegas 2012
Big Transaction Data - CMG Vegas 2012
nickychu
ย 
B2Bdatapartners Capabilities
B2Bdatapartners CapabilitiesB2Bdatapartners Capabilities
B2Bdatapartners Capabilities
B2Bdatapartners
ย 
Analytical Revolution
Analytical RevolutionAnalytical Revolution
Analytical Revolution
NedODoherty
ย 
Intel Cloud Summit: Big Data
Intel Cloud Summit: Big DataIntel Cloud Summit: Big Data
Intel Cloud Summit: Big Data
IntelAPAC
ย 
Monetizing data - An Evening with Eight of Chicago's Data Product Management...
Monetizing data  - An Evening with Eight of Chicago's Data Product Management...Monetizing data  - An Evening with Eight of Chicago's Data Product Management...
Monetizing data - An Evening with Eight of Chicago's Data Product Management...
Randy Horton
ย 
Zy Vision Solutions Overview
Zy Vision Solutions OverviewZy Vision Solutions Overview
Zy Vision Solutions Overview
tresag71
ย 
Enfathom Service Overview
Enfathom Service OverviewEnfathom Service Overview
Enfathom Service Overview
bgoverstreet
ย 
Enfathom Service Overview
Enfathom Service OverviewEnfathom Service Overview
Enfathom Service Overview
cfsanders
ย 
Intel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick KnupfferIntel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick Knupffer
IntelAPAC
ย 

What's hot (20)

Data Discovery for Big Big Insights - Tableau Webinar Slides
Data Discovery for Big Big Insights - Tableau Webinar SlidesData Discovery for Big Big Insights - Tableau Webinar Slides
Data Discovery for Big Big Insights - Tableau Webinar Slides
ย 
Big Data and Analytics
Big Data and AnalyticsBig Data and Analytics
Big Data and Analytics
ย 
The Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information ArchitectureThe Comprehensive Approach: A Unified Information Architecture
The Comprehensive Approach: A Unified Information Architecture
ย 
Sales Growth: Find Big Growth in Big Data - Lattice Engines & McKinsey
Sales Growth: Find Big Growth in Big Data - Lattice Engines & McKinseySales Growth: Find Big Growth in Big Data - Lattice Engines & McKinsey
Sales Growth: Find Big Growth in Big Data - Lattice Engines & McKinsey
ย 
Big Transaction Data - CMG Vegas 2012
Big Transaction Data - CMG Vegas 2012Big Transaction Data - CMG Vegas 2012
Big Transaction Data - CMG Vegas 2012
ย 
Crunching โ€œBig Dataโ€ to Drive 2012 Revenue Growth: The 5 Myths of Sales & Mar...
Crunching โ€œBig Dataโ€ to Drive 2012 Revenue Growth: The 5 Myths of Sales & Mar...Crunching โ€œBig Dataโ€ to Drive 2012 Revenue Growth: The 5 Myths of Sales & Mar...
Crunching โ€œBig Dataโ€ to Drive 2012 Revenue Growth: The 5 Myths of Sales & Mar...
ย 
B2Bdatapartners Capabilities
B2Bdatapartners CapabilitiesB2Bdatapartners Capabilities
B2Bdatapartners Capabilities
ย 
Analytical Revolution
Analytical RevolutionAnalytical Revolution
Analytical Revolution
ย 
Knowledgelevers expanded
Knowledgelevers expandedKnowledgelevers expanded
Knowledgelevers expanded
ย 
Smarter Analytics giver dig indsigt i hele forretningen, Rich Holada, IBM US
Smarter Analytics giver dig indsigt i hele forretningen, Rich Holada, IBM USSmarter Analytics giver dig indsigt i hele forretningen, Rich Holada, IBM US
Smarter Analytics giver dig indsigt i hele forretningen, Rich Holada, IBM US
ย 
Demystifying BI For Mid-Market Enterprises
Demystifying BI For Mid-Market EnterprisesDemystifying BI For Mid-Market Enterprises
Demystifying BI For Mid-Market Enterprises
ย 
Intel Cloud Summit: Big Data
Intel Cloud Summit: Big DataIntel Cloud Summit: Big Data
Intel Cloud Summit: Big Data
ย 
Monetizing data - An Evening with Eight of Chicago's Data Product Management...
Monetizing data  - An Evening with Eight of Chicago's Data Product Management...Monetizing data  - An Evening with Eight of Chicago's Data Product Management...
Monetizing data - An Evening with Eight of Chicago's Data Product Management...
ย 
Zy Vision Solutions Overview
Zy Vision Solutions OverviewZy Vision Solutions Overview
Zy Vision Solutions Overview
ย 
From Data Science to Business Value - Analytics Applied
From Data Science to Business Value - Analytics AppliedFrom Data Science to Business Value - Analytics Applied
From Data Science to Business Value - Analytics Applied
ย 
Enfathom Service Overview
Enfathom Service OverviewEnfathom Service Overview
Enfathom Service Overview
ย 
Enfathom Service Overview
Enfathom Service OverviewEnfathom Service Overview
Enfathom Service Overview
ย 
Australian CIO Summit 2012: A Strategic Approach To BIG DATA Analytics. Separ...
Australian CIO Summit 2012: A Strategic Approach To BIG DATA Analytics. Separ...Australian CIO Summit 2012: A Strategic Approach To BIG DATA Analytics. Separ...
Australian CIO Summit 2012: A Strategic Approach To BIG DATA Analytics. Separ...
ย 
Intel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick KnupfferIntel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick Knupffer
ย 
Aod Narrative
Aod NarrativeAod Narrative
Aod Narrative
ย 

Similar to From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

Big Data Meets Customer Profitability Analytics
Big Data Meets Customer Profitability AnalyticsBig Data Meets Customer Profitability Analytics
Big Data Meets Customer Profitability Analytics
DATAVERSITY
ย 
EDF2012 Wolfgang Nimfuehr - Bringing Big Data to the Enterprise
EDF2012   Wolfgang Nimfuehr - Bringing Big Data to the EnterpriseEDF2012   Wolfgang Nimfuehr - Bringing Big Data to the Enterprise
EDF2012 Wolfgang Nimfuehr - Bringing Big Data to the Enterprise
European Data Forum
ย 
Robert LeBlanc - Why Big Data? Why Now?
Robert LeBlanc - Why Big Data? Why Now?Robert LeBlanc - Why Big Data? Why Now?
Robert LeBlanc - Why Big Data? Why Now?
Mauricio Godoy
ย 
Day 2 aziz apj aziz_big_datakeynote_press
Day 2 aziz apj aziz_big_datakeynote_pressDay 2 aziz apj aziz_big_datakeynote_press
Day 2 aziz apj aziz_big_datakeynote_press
IntelAPAC
ย 
01 im overview high level
01 im overview high level01 im overview high level
01 im overview high level
James Findlay
ย 

Similar to From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year (20)

Governing the Data to Dollars Value Chainโ„ข - Sept 2012 NYC Data Governance Co...
Governing the Data to Dollars Value Chainโ„ข - Sept 2012 NYC Data Governance Co...Governing the Data to Dollars Value Chainโ„ข - Sept 2012 NYC Data Governance Co...
Governing the Data to Dollars Value Chainโ„ข - Sept 2012 NYC Data Governance Co...
ย 
Big Data Meets Customer Profitability Analytics
Big Data Meets Customer Profitability AnalyticsBig Data Meets Customer Profitability Analytics
Big Data Meets Customer Profitability Analytics
ย 
Evaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsEvaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics Platforms
ย 
Big Data in Financial Services: How to Improve Performance with Data-Driven D...
Big Data in Financial Services: How to Improve Performance with Data-Driven D...Big Data in Financial Services: How to Improve Performance with Data-Driven D...
Big Data in Financial Services: How to Improve Performance with Data-Driven D...
ย 
Data Activation For (Not So Much) Dummies
Data Activation For (Not So Much) DummiesData Activation For (Not So Much) Dummies
Data Activation For (Not So Much) Dummies
ย 
Search2012 ibm vf
Search2012 ibm vfSearch2012 ibm vf
Search2012 ibm vf
ย 
EDF2012 Wolfgang Nimfuehr - Bringing Big Data to the Enterprise
EDF2012   Wolfgang Nimfuehr - Bringing Big Data to the EnterpriseEDF2012   Wolfgang Nimfuehr - Bringing Big Data to the Enterprise
EDF2012 Wolfgang Nimfuehr - Bringing Big Data to the Enterprise
ย 
Big Data Meets Customer Profitability Analytics
Big Data Meets Customer Profitability AnalyticsBig Data Meets Customer Profitability Analytics
Big Data Meets Customer Profitability Analytics
ย 
The Big Deal About Big Data For Customer Engagement
The Big Deal About Big Data For Customer EngagementThe Big Deal About Big Data For Customer Engagement
The Big Deal About Big Data For Customer Engagement
ย 
Hadoop: What It Is and What It's Not
Hadoop: What It Is and What It's NotHadoop: What It Is and What It's Not
Hadoop: What It Is and What It's Not
ย 
Robert LeBlanc - Why Big Data? Why Now?
Robert LeBlanc - Why Big Data? Why Now?Robert LeBlanc - Why Big Data? Why Now?
Robert LeBlanc - Why Big Data? Why Now?
ย 
Enfathom service overview
Enfathom service overviewEnfathom service overview
Enfathom service overview
ย 
Day 2 aziz apj aziz_big_datakeynote_press
Day 2 aziz apj aziz_big_datakeynote_pressDay 2 aziz apj aziz_big_datakeynote_press
Day 2 aziz apj aziz_big_datakeynote_press
ย 
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
ย 
Valtech - Big Data for marketing (EN)
Valtech - Big Data for marketing (EN)Valtech - Big Data for marketing (EN)
Valtech - Big Data for marketing (EN)
ย 
Scenari evolutivi nello snellimento dei sistemi informativi
Scenari evolutivi nello snellimento dei sistemi informativiScenari evolutivi nello snellimento dei sistemi informativi
Scenari evolutivi nello snellimento dei sistemi informativi
ย 
01 im overview high level
01 im overview high level01 im overview high level
01 im overview high level
ย 
OSC2012: Big Data Using Open Source: Netapp Project - Technical
OSC2012: Big Data Using Open Source: Netapp Project - TechnicalOSC2012: Big Data Using Open Source: Netapp Project - Technical
OSC2012: Big Data Using Open Source: Netapp Project - Technical
ย 
Making Money With Big Data
Making Money With Big DataMaking Money With Big Data
Making Money With Big Data
ย 
Building A Bi Strategy
Building A Bi StrategyBuilding A Bi Strategy
Building A Bi Strategy
ย 

More from Fitzgerald Analytics, Inc.

Profiting from customer profitability + big data fitzgerald analytics
Profiting from customer profitability + big data fitzgerald analyticsProfiting from customer profitability + big data fitzgerald analytics
Profiting from customer profitability + big data fitzgerald analytics
Fitzgerald Analytics, Inc.
ย 
2013 12-05 data-driven innovation - fitzgerald analytics workshop at gilbane ...
2013 12-05 data-driven innovation - fitzgerald analytics workshop at gilbane ...2013 12-05 data-driven innovation - fitzgerald analytics workshop at gilbane ...
2013 12-05 data-driven innovation - fitzgerald analytics workshop at gilbane ...
Fitzgerald Analytics, Inc.
ย 
Analytics in financial services prez behavioral finance + data visualizatio...
Analytics in financial services prez   behavioral finance + data visualizatio...Analytics in financial services prez   behavioral finance + data visualizatio...
Analytics in financial services prez behavioral finance + data visualizatio...
Fitzgerald Analytics, Inc.
ย 
Jaime Fitzgerald on Data-Driven Customer Experience in Financial Services and...
Jaime Fitzgerald on Data-Driven Customer Experience in Financial Services and...Jaime Fitzgerald on Data-Driven Customer Experience in Financial Services and...
Jaime Fitzgerald on Data-Driven Customer Experience in Financial Services and...
Fitzgerald Analytics, Inc.
ย 
Data to Dollarsโ„ข - Practical Analytics in the Big Data Era Jaime Fitzgerald A...
Data to Dollarsโ„ข - Practical Analytics in the Big Data Era Jaime Fitzgerald A...Data to Dollarsโ„ข - Practical Analytics in the Big Data Era Jaime Fitzgerald A...
Data to Dollarsโ„ข - Practical Analytics in the Big Data Era Jaime Fitzgerald A...
Fitzgerald Analytics, Inc.
ย 
Knowledge management for analytic teams jaime fitzgerald and alex hasha - p...
Knowledge management for analytic teams   jaime fitzgerald and alex hasha - p...Knowledge management for analytic teams   jaime fitzgerald and alex hasha - p...
Knowledge management for analytic teams jaime fitzgerald and alex hasha - p...
Fitzgerald Analytics, Inc.
ย 

More from Fitzgerald Analytics, Inc. (14)

Profiting from customer profitability + big data fitzgerald analytics
Profiting from customer profitability + big data fitzgerald analyticsProfiting from customer profitability + big data fitzgerald analytics
Profiting from customer profitability + big data fitzgerald analytics
ย 
2013 12-05 data-driven innovation - fitzgerald analytics workshop at gilbane ...
2013 12-05 data-driven innovation - fitzgerald analytics workshop at gilbane ...2013 12-05 data-driven innovation - fitzgerald analytics workshop at gilbane ...
2013 12-05 data-driven innovation - fitzgerald analytics workshop at gilbane ...
ย 
Analytics in Financial Services - Behavioral Finance Event - Data Visualizati...
Analytics in Financial Services - Behavioral Finance Event - Data Visualizati...Analytics in Financial Services - Behavioral Finance Event - Data Visualizati...
Analytics in Financial Services - Behavioral Finance Event - Data Visualizati...
ย 
Analytics in financial services prez behavioral finance + data visualizatio...
Analytics in financial services prez   behavioral finance + data visualizatio...Analytics in financial services prez   behavioral finance + data visualizatio...
Analytics in financial services prez behavioral finance + data visualizatio...
ย 
Jaime Fitzgerald on Data-Driven Customer Experience in Financial Services and...
Jaime Fitzgerald on Data-Driven Customer Experience in Financial Services and...Jaime Fitzgerald on Data-Driven Customer Experience in Financial Services and...
Jaime Fitzgerald on Data-Driven Customer Experience in Financial Services and...
ย 
TDWI NYC Chapter - Tony Baer Ovum on Big data, Data quality, and BI Convergence
TDWI NYC Chapter - Tony Baer Ovum on Big data, Data quality, and BI ConvergenceTDWI NYC Chapter - Tony Baer Ovum on Big data, Data quality, and BI Convergence
TDWI NYC Chapter - Tony Baer Ovum on Big data, Data quality, and BI Convergence
ย 
Text graph-visualization redux
Text graph-visualization reduxText graph-visualization redux
Text graph-visualization redux
ย 
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
ย 
Data to Dollarsโ„ข - Practical Analytics in the Big Data Era Jaime Fitzgerald A...
Data to Dollarsโ„ข - Practical Analytics in the Big Data Era Jaime Fitzgerald A...Data to Dollarsโ„ข - Practical Analytics in the Big Data Era Jaime Fitzgerald A...
Data to Dollarsโ„ข - Practical Analytics in the Big Data Era Jaime Fitzgerald A...
ย 
Keynote on Financial Services Analytics - Presented aug 2011
Keynote on Financial Services Analytics - Presented aug 2011Keynote on Financial Services Analytics - Presented aug 2011
Keynote on Financial Services Analytics - Presented aug 2011
ย 
Knowledge management for analytic teams jaime fitzgerald and alex hasha - p...
Knowledge management for analytic teams   jaime fitzgerald and alex hasha - p...Knowledge management for analytic teams   jaime fitzgerald and alex hasha - p...
Knowledge management for analytic teams jaime fitzgerald and alex hasha - p...
ย 
Analytics in Financial Services: Keynote Presentation for TDWI and NY Tech Co...
Analytics in Financial Services: Keynote Presentation for TDWI and NY Tech Co...Analytics in Financial Services: Keynote Presentation for TDWI and NY Tech Co...
Analytics in Financial Services: Keynote Presentation for TDWI and NY Tech Co...
ย 
Fitzgerald Analytics 1-Page Overview
Fitzgerald Analytics 1-Page OverviewFitzgerald Analytics 1-Page Overview
Fitzgerald Analytics 1-Page Overview
ย 
Jaime Fitzgerald: A Master Data Management Road-Trip - Presented Enterprise D...
Jaime Fitzgerald: A Master Data Management Road-Trip - Presented Enterprise D...Jaime Fitzgerald: A Master Data Management Road-Trip - Presented Enterprise D...
Jaime Fitzgerald: A Master Data Management Road-Trip - Presented Enterprise D...
ย 

Recently uploaded

VIP Call Girl in Mumbai Central ๐Ÿ’ง 9920725232 ( Call Me ) Get A New Crush Ever...
VIP Call Girl in Mumbai Central ๐Ÿ’ง 9920725232 ( Call Me ) Get A New Crush Ever...VIP Call Girl in Mumbai Central ๐Ÿ’ง 9920725232 ( Call Me ) Get A New Crush Ever...
VIP Call Girl in Mumbai Central ๐Ÿ’ง 9920725232 ( Call Me ) Get A New Crush Ever...
dipikadinghjn ( Why You Choose Us? ) Escorts
ย 
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Malda Call-girls in Women Seeking Men ๐Ÿ”Malda๐Ÿ” Escorts Ser...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Malda Call-girls in Women Seeking Men  ๐Ÿ”Malda๐Ÿ”   Escorts Ser...โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Malda Call-girls in Women Seeking Men  ๐Ÿ”Malda๐Ÿ”   Escorts Ser...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Malda Call-girls in Women Seeking Men ๐Ÿ”Malda๐Ÿ” Escorts Ser...
amitlee9823
ย 
( Jasmin ) Top VIP Escorts Service Dindigul ๐Ÿ’ง 7737669865 ๐Ÿ’ง by Dindigul Call G...
( Jasmin ) Top VIP Escorts Service Dindigul ๐Ÿ’ง 7737669865 ๐Ÿ’ง by Dindigul Call G...( Jasmin ) Top VIP Escorts Service Dindigul ๐Ÿ’ง 7737669865 ๐Ÿ’ง by Dindigul Call G...
( Jasmin ) Top VIP Escorts Service Dindigul ๐Ÿ’ง 7737669865 ๐Ÿ’ง by Dindigul Call G...
dipikadinghjn ( Why You Choose Us? ) Escorts
ย 
VIP Call Girl Service Andheri West โšก 9920725232 What It Takes To Be The Best ...
VIP Call Girl Service Andheri West โšก 9920725232 What It Takes To Be The Best ...VIP Call Girl Service Andheri West โšก 9920725232 What It Takes To Be The Best ...
VIP Call Girl Service Andheri West โšก 9920725232 What It Takes To Be The Best ...
dipikadinghjn ( Why You Choose Us? ) Escorts
ย 
From Luxury Escort Service Kamathipura : 9352852248 Make on-demand Arrangemen...
From Luxury Escort Service Kamathipura : 9352852248 Make on-demand Arrangemen...From Luxury Escort Service Kamathipura : 9352852248 Make on-demand Arrangemen...
From Luxury Escort Service Kamathipura : 9352852248 Make on-demand Arrangemen...
From Luxury Escort : 9352852248 Make on-demand Arrangements Near yOU
ย 
Toronto dominion bank investor presentation.pdf
Toronto dominion bank investor presentation.pdfToronto dominion bank investor presentation.pdf
Toronto dominion bank investor presentation.pdf
JinJiang6
ย 
VIP Call Girl in Mumbai ๐Ÿ’ง 9920725232 ( Call Me ) Get A New Crush Everyday Wit...
VIP Call Girl in Mumbai ๐Ÿ’ง 9920725232 ( Call Me ) Get A New Crush Everyday Wit...VIP Call Girl in Mumbai ๐Ÿ’ง 9920725232 ( Call Me ) Get A New Crush Everyday Wit...
VIP Call Girl in Mumbai ๐Ÿ’ง 9920725232 ( Call Me ) Get A New Crush Everyday Wit...
dipikadinghjn ( Why You Choose Us? ) Escorts
ย 
Call Girls Banaswadi Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...
Call Girls Banaswadi Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...Call Girls Banaswadi Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...
Call Girls Banaswadi Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...
amitlee9823
ย 
VIP Independent Call Girls in Andheri ๐ŸŒน 9920725232 ( Call Me ) Mumbai Escorts...
VIP Independent Call Girls in Andheri ๐ŸŒน 9920725232 ( Call Me ) Mumbai Escorts...VIP Independent Call Girls in Andheri ๐ŸŒน 9920725232 ( Call Me ) Mumbai Escorts...
VIP Independent Call Girls in Andheri ๐ŸŒน 9920725232 ( Call Me ) Mumbai Escorts...
dipikadinghjn ( Why You Choose Us? ) Escorts
ย 

Recently uploaded (20)

Mira Road Awesome 100% Independent Call Girls NUmber-9833754194-Dahisar Inter...
Mira Road Awesome 100% Independent Call Girls NUmber-9833754194-Dahisar Inter...Mira Road Awesome 100% Independent Call Girls NUmber-9833754194-Dahisar Inter...
Mira Road Awesome 100% Independent Call Girls NUmber-9833754194-Dahisar Inter...
ย 
falcon-invoice-discounting-unlocking-prime-investment-opportunities
falcon-invoice-discounting-unlocking-prime-investment-opportunitiesfalcon-invoice-discounting-unlocking-prime-investment-opportunities
falcon-invoice-discounting-unlocking-prime-investment-opportunities
ย 
VIP Call Girl in Mumbai Central ๐Ÿ’ง 9920725232 ( Call Me ) Get A New Crush Ever...
VIP Call Girl in Mumbai Central ๐Ÿ’ง 9920725232 ( Call Me ) Get A New Crush Ever...VIP Call Girl in Mumbai Central ๐Ÿ’ง 9920725232 ( Call Me ) Get A New Crush Ever...
VIP Call Girl in Mumbai Central ๐Ÿ’ง 9920725232 ( Call Me ) Get A New Crush Ever...
ย 
(Sexy Sheela) Call Girl Mumbai Call Now ๐Ÿ‘‰9920725232๐Ÿ‘ˆ Mumbai Escorts 24x7
(Sexy Sheela) Call Girl Mumbai Call Now ๐Ÿ‘‰9920725232๐Ÿ‘ˆ Mumbai Escorts 24x7(Sexy Sheela) Call Girl Mumbai Call Now ๐Ÿ‘‰9920725232๐Ÿ‘ˆ Mumbai Escorts 24x7
(Sexy Sheela) Call Girl Mumbai Call Now ๐Ÿ‘‰9920725232๐Ÿ‘ˆ Mumbai Escorts 24x7
ย 
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Malda Call-girls in Women Seeking Men ๐Ÿ”Malda๐Ÿ” Escorts Ser...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Malda Call-girls in Women Seeking Men  ๐Ÿ”Malda๐Ÿ”   Escorts Ser...โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Malda Call-girls in Women Seeking Men  ๐Ÿ”Malda๐Ÿ”   Escorts Ser...
โžฅ๐Ÿ” 7737669865 ๐Ÿ”โ–ป Malda Call-girls in Women Seeking Men ๐Ÿ”Malda๐Ÿ” Escorts Ser...
ย 
Business Principles, Tools, and Techniques in Participating in Various Types...
Business Principles, Tools, and Techniques  in Participating in Various Types...Business Principles, Tools, and Techniques  in Participating in Various Types...
Business Principles, Tools, and Techniques in Participating in Various Types...
ย 
Diva-Thane European Call Girls Number-9833754194-Diva Busty Professional Call...
Diva-Thane European Call Girls Number-9833754194-Diva Busty Professional Call...Diva-Thane European Call Girls Number-9833754194-Diva Busty Professional Call...
Diva-Thane European Call Girls Number-9833754194-Diva Busty Professional Call...
ย 
7 tips trading Deriv Accumulator Options
7 tips trading Deriv Accumulator Options7 tips trading Deriv Accumulator Options
7 tips trading Deriv Accumulator Options
ย 
( Jasmin ) Top VIP Escorts Service Dindigul ๐Ÿ’ง 7737669865 ๐Ÿ’ง by Dindigul Call G...
( Jasmin ) Top VIP Escorts Service Dindigul ๐Ÿ’ง 7737669865 ๐Ÿ’ง by Dindigul Call G...( Jasmin ) Top VIP Escorts Service Dindigul ๐Ÿ’ง 7737669865 ๐Ÿ’ง by Dindigul Call G...
( Jasmin ) Top VIP Escorts Service Dindigul ๐Ÿ’ง 7737669865 ๐Ÿ’ง by Dindigul Call G...
ย 
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Koregaon Park Call Me 7737669865 Budget Friendly No Advance Booking
ย 
Q1 2024 Conference Call Presentation vF.pdf
Q1 2024 Conference Call Presentation vF.pdfQ1 2024 Conference Call Presentation vF.pdf
Q1 2024 Conference Call Presentation vF.pdf
ย 
VIP Call Girl Service Andheri West โšก 9920725232 What It Takes To Be The Best ...
VIP Call Girl Service Andheri West โšก 9920725232 What It Takes To Be The Best ...VIP Call Girl Service Andheri West โšก 9920725232 What It Takes To Be The Best ...
VIP Call Girl Service Andheri West โšก 9920725232 What It Takes To Be The Best ...
ย 
From Luxury Escort Service Kamathipura : 9352852248 Make on-demand Arrangemen...
From Luxury Escort Service Kamathipura : 9352852248 Make on-demand Arrangemen...From Luxury Escort Service Kamathipura : 9352852248 Make on-demand Arrangemen...
From Luxury Escort Service Kamathipura : 9352852248 Make on-demand Arrangemen...
ย 
Airport Road Best Experience Call Girls Number-๐Ÿ“ž๐Ÿ“ž9833754194 Santacruz MOst Es...
Airport Road Best Experience Call Girls Number-๐Ÿ“ž๐Ÿ“ž9833754194 Santacruz MOst Es...Airport Road Best Experience Call Girls Number-๐Ÿ“ž๐Ÿ“ž9833754194 Santacruz MOst Es...
Airport Road Best Experience Call Girls Number-๐Ÿ“ž๐Ÿ“ž9833754194 Santacruz MOst Es...
ย 
Toronto dominion bank investor presentation.pdf
Toronto dominion bank investor presentation.pdfToronto dominion bank investor presentation.pdf
Toronto dominion bank investor presentation.pdf
ย 
VIP Call Girl in Mumbai ๐Ÿ’ง 9920725232 ( Call Me ) Get A New Crush Everyday Wit...
VIP Call Girl in Mumbai ๐Ÿ’ง 9920725232 ( Call Me ) Get A New Crush Everyday Wit...VIP Call Girl in Mumbai ๐Ÿ’ง 9920725232 ( Call Me ) Get A New Crush Everyday Wit...
VIP Call Girl in Mumbai ๐Ÿ’ง 9920725232 ( Call Me ) Get A New Crush Everyday Wit...
ย 
Navi Mumbai Cooperetive Housewife Call Girls-9833754194-Natural Panvel Enjoye...
Navi Mumbai Cooperetive Housewife Call Girls-9833754194-Natural Panvel Enjoye...Navi Mumbai Cooperetive Housewife Call Girls-9833754194-Natural Panvel Enjoye...
Navi Mumbai Cooperetive Housewife Call Girls-9833754194-Natural Panvel Enjoye...
ย 
Call Girls Banaswadi Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...
Call Girls Banaswadi Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...Call Girls Banaswadi Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...
Call Girls Banaswadi Just Call ๐Ÿ‘— 7737669865 ๐Ÿ‘— Top Class Call Girl Service Ban...
ย 
VIP Independent Call Girls in Andheri ๐ŸŒน 9920725232 ( Call Me ) Mumbai Escorts...
VIP Independent Call Girls in Andheri ๐ŸŒน 9920725232 ( Call Me ) Mumbai Escorts...VIP Independent Call Girls in Andheri ๐ŸŒน 9920725232 ( Call Me ) Mumbai Escorts...
VIP Independent Call Girls in Andheri ๐ŸŒน 9920725232 ( Call Me ) Mumbai Escorts...
ย 
cost-volume-profit analysis.ppt(managerial accounting).pptx
cost-volume-profit analysis.ppt(managerial accounting).pptxcost-volume-profit analysis.ppt(managerial accounting).pptx
cost-volume-profit analysis.ppt(managerial accounting).pptx
ย 

From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Per Year

  • 1. From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records Jaime Fitzgerald, President, Fitzgerald Analytics, Inc. Alex Hasha, Chief Data Scientist, Bundle.com May 1, 2012 Architects of Fact-Based Decisionsโ„ข
  • 2. Agenda for Todayโ€™s Talk 1. The Business Model 2. The Text Analytics Challenge 3. How We Overcame the Challenge 4. Key Takeaways 5. Q&A From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 2
  • 3. Introduction Jaime Fitzgerald, Alex Hasha Founder @ Data Scientist @ Fitzgerald Analytics Bundle Corp @JaimeFitzgerald @AlexHasha ๏‚ง Leading development of data products ๏‚ง Transforming data into value for clients Responsible ๏‚ง Designing statistical methods / algorithm Forโ€ฆ that transform data into insights for ๏‚ง Creating meaningful careers for employees consumers ๏‚ง Helps clients convert Data to Dollarsโ„ข ๏‚ง Uses data to help consumers make better At a decisions with their money ๏‚ง Brings a strategic perspective to improve ๏‚ง Bends valuable legacy data to new Company ROI on investments in technology, data, purposes That people, and processes ๏‚ง Is growing and hiring! Also ๏‚ง Working to Democratize Analytics by ๏‚ง Learning about and implementing best Working Reducing the โ€œBarrier to Benefitโ€ for non- practices for managing complex data On profits, social entrepreneurs, and govโ€™t pipelines From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 3
  • 4. The Local Search Business From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 4
  • 5. Gaps in Local Search Offerings Paid Advertisement Not Trusted User-Reviews Can be Biased Not Selection Can be Personalized Bias Gamed (to you) From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 5
  • 6. Bundleโ€™s Unique Contribution Unlike other merchant listing sites, our content is based on real credit card spending by 20 million households Example: Credit Card Statement Data From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 6
  • 7. A Screen Shot From our Site From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 7
  • 8. A Screen Shot From our Site From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 8
  • 9. A Screen Shot From our Site From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 9
  • 10. We Do This with Billions of Real Spending Records Unlike other merchant listing sites, our content is based on real credit card spending by 20 million households Key Issues with this Data: Example: Credit Card Statement Data 1. Credit card data lacks merchant identifier 2. So we rely on text analytics to associate transactions with merchants From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 10
  • 11. Building our โ€œVersion of the Truthโ€ from 3 sources Our Localeze Factual Transaction Data ๏‚ง Proprietary ๏‚ง Crowd Sourced ๏‚ง High Quality Pros ๏‚ง Differentiated ๏‚ง Up to the ๏‚ง Clean / Verified ๏‚ง Special Sauce Minute ๏‚ง Incomplete ๏‚ง More variability Cons ๏‚ง Semi-Structured ๏‚ง Lag / Recency in quality From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 11
  • 12. Data: Not Useful Until Refined. From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 12
  • 13. Key Steps in โ€œRefinementโ€ (Transformation) Transformed To Create New Old Data in New Ways Features Such Asโ€ฆ Card Transaction Normalization People Who Shop Data Here Also Likeโ€ฆ Clustering Merchant Listings The Bundle Loyalty (e.g., Address, Phone Score Number, Business Type) Linking Data-Driven Other Data: Reviews From an Census, Bureau of Labor Aggregation Array of Customer Statistics, User Feedback Segments From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 13
  • 14. Before the Fun Stuff Happensโ€ฆ Before we can generate insights about merchants for our users, we must associate each transaction in our database with a specific merchant from a master listโ€ฆ. Two main problems: Credit Card Transactions 1. Accurate Fuzzy Matching is Difficult (Billions โ€“ 109) 2. Scale of Data is Enormous โ€ข Highly variable text descriptions โ€ข Noisy geographic info Comprehensive Listing Text โ€ข Noisy merchant Matching of US Merchants category info (Tens of Millions โ€“ 107) Naรฏve item by item search takes O(1016) expensive string comparisons: Too Slow! From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 14
  • 15. A โ€œBrute Forceโ€ Approach Would Never Workโ€ฆ 1 1. Matching w/in Hundreds of Millions of Merchants would Processing Time / Workload require massive processingโ€ฆ Nation โ€ฆ.Fortunately we donโ€™t need to match at this level 2. Batching at local area, process orders of magnitude faster. City Neighborhood 0 Hundreds Hundreds of Tens of Millions Thousands # of Merchants in Comparison Set From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 15
  • 16. Solution to Scaling Problem This is a โ€œCascade of Scale Reductionsโ€, Parallelizing by Location Credit Card Transactions (Billions โ€“ 109) Keys to solving the scaling problem: Batch Transactions by Geographic Neighborhood 1. Scale Reduction / Parallelized Text Clustering 2. Free Open Source Software 1 2 10000 Dedupe Description Strings Secondary Fuzzy Matching Process Reconciles Preliminary Listings with Merchant Text Clustering โ€œSource of Truthโ€ (Not Matching) Consolidate Strings Belonging to Same Merchant Computational Efficiency Increased by a Factor of 108! Preliminary Merchant Final Merged Listing Generated Directly Transaction Eons -> Days -> Minutes from Transactions Data Set (Tens of Millionsโ€“107) From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 16
  • 17. Data Preparation: Phase 1 Machine DAMA Lens Learning Lens Example: โ€ข Unsupervised Anthonys Restaurant Deduping Learning #123 Brkly NY โ€ข Matching X 10, โ€ข Text Clustering (Strings) Cleansing โ€ข Pattern Anthonyโ€™s Restaurant Discovery From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 17
  • 18. Data Preparation: Phase 2 Machine DAMA Lens Learning Lens Search Retrieves Top 10 Possible Matches โ€ข Deduping โ€ข Record โ€ข Information Classifier applied to + 30% Linkage Retrieval each, returns โ€ข More โ€ข Data Quality Cleansing confidence score โ€ข Supervised Enhancement โ€ข Data Classifier If Confidence = High, Enrichment Records are linked From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 18
  • 19. Takeaways 1. Tame your data before perfecting your methods. efficiency enables experimentation, iteration, improvement. 2. Design your process to minimize unnecessary complexity (e.g. Parallel Processing at Scale, Normalization, Pre-Filtering) 3. Tools: Take advantage of powerful (and inexpensive) open- source tools that enable your process... From Big Legacy Data to Insight: Lessons Learned Creating New Value from a Billion Low Quality Records ยฉ 2012 Fitzgerald Analytics, Inc. All Rights Reserved 19

Editor's Notes

  1. Jaime intro:Alex Intro: Thanks Jaime. Since Jaime has already introduced me, Iโ€™ll introduce Bundle. Bundle is a company that uses data to help consumers make better decisions with their money. We do this on the one hand by providing free tools for managing personal financial data. But more to the point of todayโ€™s talk, we are also mining mountains of credit card transaction data to extract actionable insights for consumers based on the spending behavior of their peers.
  2. First to provide local merchant profiles for consumers that is deeply data-drivenLocal Search Business (Yelp, CitiSearch, FourSquare, Google, Bing)% of local searches on mobile devices is growing very fastFast-growing sector in data-driven startupsExample: Tedโ€™s montana grillBundle addresses issues with other sites:Selection Bias (strong opinions over-represented)System Gaming (just like SEO. interesting story โ€œreputation mgtโ€ companies!)Explicit rankings (rank by the actual metrics!)
  3. Alex: So where does Text Analytics come into this? As you might imagine, bending old data to a new purpose is fraught with difficulties, because the dataset was designed with different applications in mind. A key problem we faced with our credit card transaction database was that the transaction records lack a merchant identifier. Itโ€™s primary purpose is for interacting with card holders, generating statements, and not suprisingly itโ€™s formatted very much like an enormous credit card statement. The merchant name is embedded in a text field, which also contains other information. Itโ€™s semi-structured, but lacks a consistent format.Clearly, to unlock insights about merchants from this data, we have to associate the transactions with merchants using this text field, so text analytics is absolutely crucial to our business.AH: Just some background here: In the credit card industry there are โ€œacquiring banksโ€, which deals with merchants and processes their credit card transaction over various payment networks, and โ€œissuing banksโ€ which issue cards to consumers, and manage the generation of statements and billing of individuals. Since the interactions with merchants and consumers are split between two entities, you end up with data sets that are either consumer or merchant focused. We get our data from an โ€œissuingโ€ bank, so they donโ€™t have detailed merchant info., beyond what they need to generate statements for cardholders. That is the root of our problem.
  4. Alex: This is a screen shot of our core offering,the Bundle Merchant recommender, which aims to help consumers with their most frequent money decisions: where to spend it. Visually, Iโ€™m sure youโ€™re reminded of user review sites like Yelp or Citysearch, and the purpose, to help you discover great merchants, is similar. Our content, though, is very different because itโ€™s generated directly from the credit card transactions of over 20 million US households.
  5. Alex: (Review features left to right.)I just wanted to return to this screen shot to highlight the features that are made possible by transforming credit card data in this way. (Loyalty score) Unlike other sites, our star ratings are data driven: we assign each merchant what we call the โ€œBundle Loyalty Scoreโ€, which is calculated from the share of wallet a merchantโ€™s customers devote to the business and how frequently they return. (Coverage) Because we capture transactions from a broad-cross section of the population, we have data on many small local merchants, not just the popular ones that attract a lot of reviews. (Segments and Silent majority) We can break merchants customers down into demographic and behavioral segments, to show how well it serves different groups, and which groups it is most popular with. Weโ€™re capturing information about the silent majority of shoppers, who shop without writing about it online, and also avoid the common bias on review sites towards extremely positive or extremely negative reviews.(Real price levels) We have rich data about the real range of prices visitors to this merchant are paying, based on real transactions.(Web of merchants) Another unique feature on Bundle is that we can show you what other merchants are popular with customers of this merchant. Weโ€™re all familiar with โ€œPeople who bought this also boughtโ€ on Amazon and other online market places, but I believe weโ€™re the first to take this to the offline market place on a massive scale.
  6. Alex: This is a screen shot of our core offering,the Bundle Merchant recommender, which aims to help consumers with their most frequent money decisions: where to spend it. Visually, Iโ€™m sure youโ€™re reminded of user review sites like Yelp or Citysearch, and the purpose, to help you discover great merchants, is similar. Our content, though, is very different because itโ€™s generated directly from the credit card transactions of over 20 million US households.
  7. Alex: So where does Text Analytics come into this? As you might imagine, bending old data to a new purpose is fraught with difficulties, because the dataset was designed with different applications in mind. A key problem we faced with our credit card transaction database was that the transaction records lack a merchant identifier. Itโ€™s primary purpose is for interacting with card holders, generating statements, and not suprisingly itโ€™s formatted very much like an enormous credit card statement. The merchant name is embedded in a text field, which also contains other information. Itโ€™s semi-structured, but lacks a consistent format.Clearly, to unlock insights about merchants from this data, we have to associate the transactions with merchants using this text field, so text analytics is absolutely crucial to our business.AH: Just some background here: In the credit card industry there are โ€œacquiring banksโ€, which deals with merchants and processes their credit card transaction over various payment networks, and โ€œissuing banksโ€ which issue cards to consumers, and manage the generation of statements and billing of individuals. Since the interactions with merchants and consumers are split between two entities, you end up with data sets that are either consumer or merchant focused. We get our data from an โ€œissuingโ€ bank, so they donโ€™t have detailed merchant info., beyond what they need to generate statements for cardholders. That is the root of our problem.
  8. Top 10 Possible Matches, Like Google Search)
  9. Jaime: Take it back to audience. A common theme in converting data to dollars is to to extract new value from old data by MATCHING with other preexisting data. No need to dwell on particulars of Bundle data on this slide, except as an instance of a more general pattern.
  10. JF Provides Framing: This is a universal problem for companies seeking to convert Data to Dollars, repurposing old data sets often requires matching with other data sets without a common key. AH: It should be clear now how a robust, accurate algorithm for matching text descriptions to merchant listings is a prerequisite for our entire user experience.There are two aspects of this problem that created significant challenges for us. First, thereโ€™s the basic issue that accurate fuzzy string matching is hard. Our inputs highly variable transaction descriptions, sometimes dozens or hundreds per merchant, inconsistent coding, error prone geographic indicators, and noisy merchant category indicators. These give us a lot to go on, but to treat any of them as a source of truth gets you in trouble. Weโ€™re at a Text Analytics conference, so I donโ€™t have to tell you that accurate fuzzy string matching can be hard, especially if supporting data like merchant category and geo information are not 100% reliable. But before we could even begin to attack that problem we had to do something about the sheer size of our data set.We receive about 1 billion credit card transactions per year, each of which must be associated with one of 10s of millions of merchants in a comprehensive listing. Not that anyone would try this, but a brute force attempt to take each transaction description and scan through the merchant listing item by item looking for a match would require on the order of 10^16 fuzzy string comparisons. To put that in perspective, if each comparison took about a millisecond, the match would take over 300,000 years to run.Clearly something needs to be done to reduce the scale of the input AND the matching search space. Broadly speaking, we accomplished this by breaking the matching process into two phases, using text clustering in the first phase to dramatically decrease the size of the data set, and then proceeding to a fuzzy match.
  11. This isnโ€™t rocket science, there are a handful of obvious places to start simplifying the problem. One key lever is location: if you have a transaction that occurred in New Mexico it doesnโ€™t make sense to include merchants in New York in your search.There are tens of millions of merchants nationally, but only hundreds of thousands in each city, and maybe a thousand max in each neighborhood. If you can identify the neighborhood of a transaction, and only search the merchants in that neighborhood, the efficiency payoff is hugeThis wasnโ€™t a completely obvious step for us, though, because as I mentioned before the geographic fields in our transaction data were not 100% reliable. We could identify the city with no problem, but at the neighborhood level there is a significant error rate. But we eventually realized we had to ignore all the little complications and, at all costs, reduce the size of our data so we could work with it efficiently. Itโ€™s worth creating an intermediate data set thatโ€™s still pretty messy, if you can now load it into R on your laptop and try out a few fuzzy matching experiments in an afternoon.
  12. This slide gives a high level overview of how we achieved a cascade of scale reductions by batching transactions by neighborhood. Considering each neighborhood in isolation, we dedupe and then cluster transaction strings which are highly likely to be generated by the same merchant. Each of these clusters is assigned a preliminary merchant ID. At this point we have a preliminary merchant listing which still suffers from some of the quality issues of the original data set but Can provide aggregated transaction data views which to inform subsequent matching and is on a much more manageable scale.The output of the clustering algorithm feeds into a more resource intensive fuzzy matching algorithm, which becomes feasible at this scale.Taking this approach on a single machine, we were able to get our processing time down to about a week. However, in startup time a week is not much better than 300K years. Thanks to the revolution in open source parallel computing, we were able to quickly set up a small Hadoop cluster which parallelizes the text clustering operations so all the neighborhoods run at the same time. This brought our processing down to about 20 minutes. While this isnโ€™t a complete solution to the initial problem, it vastly increases our capability to experiment with new methods and tweaks to the existing process.So thatโ€™s a quick and dirty introduction to a part of our technology stack, and now Iโ€ll turn it over to Jaime to convert my case study into some high level takeaways.
  13. Robin custbehavior PayComplainPay....then....ST vs LT RecAdvLoyalty
  14. Top 10 Possible Matches, Like Google Search)
  15. Comments:Consider trade-offs between false positive and false negativesRelated Hot/Emerging Best Practices we can mention to frame this:Metrics-Driven DevelopmentBeginning with the End in Mind / Causal Clarity ๏Š