SlideShare una empresa de Scribd logo
1 de 34
Descargar para leer sin conexión
Data Platform and Services

  Vipul Sharma and EyalReuveni
Agenda


            Eventbrite
           Data Products
           Data Platform
         Recommendations
            Questions
•   A social event ticketing and discovery platform
•   50th Million Ticket Sold
•   Revenue doubled YOY
•   180 Employees in SOMA SF
•   Solving significant engineering problems
    • Data
    • Data, Infrastructure, Mobile, Web, Scale, Ops, QA
• Firing all cylinders and hiring blazing fast
www.eventbrite.com/jobs
Data Products
Eventbrite dataplatform and services - Interest graph based recommendations
Eventbrite dataplatform and services - Interest graph based recommendations
Analytics




            • Add–Hoc queries by Analysts
Fraud and Spam
Data Platform
Eventbrite dataplatform and services - Interest graph based recommendations
Hadoop Cluster




•   30 persistent EC2 High-Memory Instances
•   30TB disk with replication factor of 2, ext3 formatted
•   CDH3
•   Fair Scheduler
•   HBase
Infrastructure

• Search
   • Solr
   • Incremental updates towards event driven
• Recommendation/Graph
   • Hadoop
   • Native Java MapReduce
   • Bash for workflow
• Persistence
   •   MySql
   •   HDFS
   •   HBase
   •   MongoDB (Investigating Cassandra and Riak)
Infrastructure


• Stream
   • RabbitMQ
   • Internal Fire hose (Investigating Kafka)
• Offline
   •   MapRedude
   •   Streaming
   •   Hive
   •   Hue
Infrastructure - Sqoozie



• Workflow for mysql imports to HDFS
    • Generate Sqoop commands
    • Run these imports in parallel
•   Transparent to schema changes
•   Include or exclude on column, data types, table level
•   Data Type Casting tinyint(1)  Integer
•   Distributed Table Imports
Infrastructure - Blammo



•   Raw logs are imported to HDFS via flume
•   Almost real-time – 5 min latency
•   Logs are key-value pairs in JSON
•   Each log producer publishes schema in yaml
•   Hive schema and schema yaml in sync using thrift
•   Control exclusion and inclusion
Recommendations
You will like to attend this event
Recommendation Engines



                                                                                      Interest Graph
                                                                                      Based
                                                                 Social Graph
                                                                 Based (Your          (Your friends who
                                                                 friends like Lady    like rock music
                                          Collaborative          Gaga so you will     like you are
                                          Filtering – Item-      like Lady            attending Eric
                                          Item similarity        Gaga, PYMK –         Clapton Event–
                                                                 Facebook, Linkedin   Eventbrite)
                      Collaborative       (You like
                                          Godfather so you       )
                      Filtering – User-
                      User Similarity     will like Scarface -
                                          Netflix)
                      (People who
     Item             bought camera
     Hierarchy        also bought
                      batteries -
     (You bought      Amazon)
     camera so you
     need batteries
     - Amazon)
Why Interest?




  Events are Social          Events are Interest




Dense Graph is Irrelevant
                            Interest are Changing
How do we know your Interest?


• We ask you
• Based on your activity
   • Events Attended
   • Events Browsed
• Facebook Interests
   • User Interest has to match Event category
   • Static
• Machine Learning
   • Logistic Regression using MLE
   • Sparse Matrix is generated using MapReduce
   • A model for each interest
Model Based vs Clustering

            Item-Item vs User-User

     Building Social Graph is Clustering Step

Social Graph Recommendation is a Ranking Problem
Implicit Social Graph


                                 U1


                            E1        E4

                  U2                       U3


             E2        E3

        U4                       U5
Mixed Social Graph


                                U1


                           E1

                 U2                  U3


            E2        E3
                                          FB
       U4                       U5
                                          LI
15M * 260 * 260 = 1.14 Trillion Edges
               4Billion edges ranked
   Each node is a feature vector representing a User

Each edge is a feature vector representing a Relationship
Feature Generation

•   Mixed Features
•   A series of map-reduce jobs
•   Output on HDFS in flat files; Input to subsequent jobs
•   Orders = Event  Attendees
    • MAP: eid: uid
    • REDUCE: eid:[uid]
• Attendees  Social Graph
    • Input: eid:[uid]
    • MAP: uidi:[uid]
    • REDUCE: uid:[neighbors]
• Interest based features, user specific, graph mining etc
• Upload feature values to HBase
U1




U2        U3
HBase
HBase




• Collect data from multiple Map Reduce jobs
   • Stores entire social graph
   • Over one million writes per second
HBase




    rowid     neighbors   events   featureX
    2718282   101         3        0.3678795
HBase




rowid     314159:n   314159:e   314159:fx   161803:n   161803:e   161803:fx
2718282   31         1          0.3183      83         2          0.618
Tips & Tricks




• Distributed cache database
   • Sped up some Map Reduce jobs by hours
   • Be sure to use counters!
Tips & Tricks




• Hive (ab)uses
   •   Almost as many hive jobs as custom ones
   •   “flip join”
   •   Statistical functions using hive
   •   UDF
Tips & Tricks


•   Memory Memory Memory
•   LZO, WAL
•   Combiners are great until
•   Shuffle and Sorting stage
•   Hadoop ecosystem is still new
Questions?

Más contenido relacionado

Similar a Eventbrite dataplatform and services - Interest graph based recommendations

CSC 8101 Non Relational Databases
CSC 8101 Non Relational DatabasesCSC 8101 Non Relational Databases
CSC 8101 Non Relational Databasessjwoodman
 
InfiniteGraph Presentation from Oct 21, 2010 DBTA Webcast
InfiniteGraph Presentation from Oct 21, 2010 DBTA WebcastInfiniteGraph Presentation from Oct 21, 2010 DBTA Webcast
InfiniteGraph Presentation from Oct 21, 2010 DBTA WebcastInfiniteGraph
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Claudio Martella
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataSteve Watt
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Miguel Pastor
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Benjamin Nussbaum
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Chris Fregly
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databasesthai
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
WOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph MiningWOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph Miningaravindan_raghu
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoopAbhi Goyan
 
Scratchpads past,present,future
Scratchpads past,present,futureScratchpads past,present,future
Scratchpads past,present,futureEdward Baker
 

Similar a Eventbrite dataplatform and services - Interest graph based recommendations (20)

Eventbrite sxsw
Eventbrite sxswEventbrite sxsw
Eventbrite sxsw
 
CSC 8101 Non Relational Databases
CSC 8101 Non Relational DatabasesCSC 8101 Non Relational Databases
CSC 8101 Non Relational Databases
 
InfiniteGraph Presentation from Oct 21, 2010 DBTA Webcast
InfiniteGraph Presentation from Oct 21, 2010 DBTA WebcastInfiniteGraph Presentation from Oct 21, 2010 DBTA Webcast
InfiniteGraph Presentation from Oct 21, 2010 DBTA Webcast
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 
Steve Watt Presentation
Steve Watt PresentationSteve Watt Presentation
Steve Watt Presentation
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
Knowledge Graphs - Journey to the Connected Enterprise - Data Strategy and An...
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
UNit4.pdf
UNit4.pdfUNit4.pdf
UNit4.pdf
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Music streams
Music streamsMusic streams
Music streams
 
WOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph MiningWOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph Mining
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Scratchpads past,present,future
Scratchpads past,present,futureScratchpads past,present,future
Scratchpads past,present,future
 

Último

COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 

Último (20)

COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 

Eventbrite dataplatform and services - Interest graph based recommendations

  • 1. Data Platform and Services Vipul Sharma and EyalReuveni
  • 2. Agenda Eventbrite Data Products Data Platform Recommendations Questions
  • 3. A social event ticketing and discovery platform • 50th Million Ticket Sold • Revenue doubled YOY • 180 Employees in SOMA SF • Solving significant engineering problems • Data • Data, Infrastructure, Mobile, Web, Scale, Ops, QA • Firing all cylinders and hiring blazing fast www.eventbrite.com/jobs
  • 7. Analytics • Add–Hoc queries by Analysts
  • 11. Hadoop Cluster • 30 persistent EC2 High-Memory Instances • 30TB disk with replication factor of 2, ext3 formatted • CDH3 • Fair Scheduler • HBase
  • 12. Infrastructure • Search • Solr • Incremental updates towards event driven • Recommendation/Graph • Hadoop • Native Java MapReduce • Bash for workflow • Persistence • MySql • HDFS • HBase • MongoDB (Investigating Cassandra and Riak)
  • 13. Infrastructure • Stream • RabbitMQ • Internal Fire hose (Investigating Kafka) • Offline • MapRedude • Streaming • Hive • Hue
  • 14. Infrastructure - Sqoozie • Workflow for mysql imports to HDFS • Generate Sqoop commands • Run these imports in parallel • Transparent to schema changes • Include or exclude on column, data types, table level • Data Type Casting tinyint(1)  Integer • Distributed Table Imports
  • 15. Infrastructure - Blammo • Raw logs are imported to HDFS via flume • Almost real-time – 5 min latency • Logs are key-value pairs in JSON • Each log producer publishes schema in yaml • Hive schema and schema yaml in sync using thrift • Control exclusion and inclusion
  • 17. You will like to attend this event
  • 18. Recommendation Engines Interest Graph Based Social Graph Based (Your (Your friends who friends like Lady like rock music Collaborative Gaga so you will like you are Filtering – Item- like Lady attending Eric Item similarity Gaga, PYMK – Clapton Event– Facebook, Linkedin Eventbrite) Collaborative (You like Godfather so you ) Filtering – User- User Similarity will like Scarface - Netflix) (People who Item bought camera Hierarchy also bought batteries - (You bought Amazon) camera so you need batteries - Amazon)
  • 19. Why Interest? Events are Social Events are Interest Dense Graph is Irrelevant Interest are Changing
  • 20. How do we know your Interest? • We ask you • Based on your activity • Events Attended • Events Browsed • Facebook Interests • User Interest has to match Event category • Static • Machine Learning • Logistic Regression using MLE • Sparse Matrix is generated using MapReduce • A model for each interest
  • 21. Model Based vs Clustering Item-Item vs User-User Building Social Graph is Clustering Step Social Graph Recommendation is a Ranking Problem
  • 22. Implicit Social Graph U1 E1 E4 U2 U3 E2 E3 U4 U5
  • 23. Mixed Social Graph U1 E1 U2 U3 E2 E3 FB U4 U5 LI
  • 24. 15M * 260 * 260 = 1.14 Trillion Edges 4Billion edges ranked Each node is a feature vector representing a User Each edge is a feature vector representing a Relationship
  • 25. Feature Generation • Mixed Features • A series of map-reduce jobs • Output on HDFS in flat files; Input to subsequent jobs • Orders = Event  Attendees • MAP: eid: uid • REDUCE: eid:[uid] • Attendees  Social Graph • Input: eid:[uid] • MAP: uidi:[uid] • REDUCE: uid:[neighbors] • Interest based features, user specific, graph mining etc • Upload feature values to HBase
  • 26. U1 U2 U3
  • 27. HBase
  • 28. HBase • Collect data from multiple Map Reduce jobs • Stores entire social graph • Over one million writes per second
  • 29. HBase rowid neighbors events featureX 2718282 101 3 0.3678795
  • 30. HBase rowid 314159:n 314159:e 314159:fx 161803:n 161803:e 161803:fx 2718282 31 1 0.3183 83 2 0.618
  • 31. Tips & Tricks • Distributed cache database • Sped up some Map Reduce jobs by hours • Be sure to use counters!
  • 32. Tips & Tricks • Hive (ab)uses • Almost as many hive jobs as custom ones • “flip join” • Statistical functions using hive • UDF
  • 33. Tips & Tricks • Memory Memory Memory • LZO, WAL • Combiners are great until • Shuffle and Sorting stage • Hadoop ecosystem is still new