SlideShare a Scribd company logo
1 of 19
How Klout is changing the
landscape of social media with
Hadoop and BI
Dave Mariani
VP Engineering, Klout


Denny Lee
Principal Program Manager
Microsoft
Klout uses Big Data to unify the social web
Klout’s Big Data makes all this possible


   15 Social Networks Processed Every Day
   120 Terabytes of Data Storage
   200,000 Indexed Users Added Every Day
   140,000,000 Users Indexed Every Day
   1,000,000,000 Social Signals Processed Every Day
   30,000,000,000 API Calls Delivered Every Month
   54,000,000,000 Rows of Data In Klout Data Warehouse
                                                         3
Scenario and Definitions


 Project:              Event:      Category:       Property:
Collection            Captured      Attribute       Event
of Events            User Action     Type          Attribute




    +K (Add a topic) event

                                    Topic,      {Big Data, BI}
                                   Gender,          {Male}
                                   Location      {Palo Alto}
Klout Event Tracker
                                          1    Perform A|B Testing of User Flows


                                          2    Optimize User Registration Funnels




3   Monitor consumer engagement & retention (DAUs & MAUs)


4   Flexibly track and report on user generated events




                                                                                   5
Klout Event Tracker Requirements

                                              3rd Party
                                                          Hadoop      BI
                                                Web
              Requirement                                   &        Query
                                              Analytics
                                                           Hive     Engines
                                                Tools
Capture & store all user and visitor events      No        Yes         No
Integrate internal Klout Data                    No        Yes         No
Support queries against granular data            No        Yes         No
Support interactive queries                     Yes        No         Yes
Support 3rd party BI tools                       No        No         Yes
“Query-able” by custom apps                      No        No         Yes




TODO: Make this look good and use animation to “blend” the last 2 columns

                                                                              6
Klout Data Architecture – The Best Tool for the Job

                                                                       Serve
    Signal
   Collectors                                  Registrations DB
    (Java/                                         (MySql)
                    Data
    Scala)      Enhancement
                   Engine                                                      Klout.com
                 (PIG/Hive)                       Profile DB                   (Node.js)




                                                                  Klout API
                                                   (HBase)




                                                                   (Scala)
                              Data Warehouse
                                   (Hive)
                                                Search Index                   Mobile
                                               (Elastic Search)
                                                         In                    (ObjectiveC)


  Store & Enhance                                 Streams
                                                 (MongoDB)


                                                                               Monitoring
                                                                                (Nagios)
                                                                               Dashboards
                                                                                (Tableau)
                                                  Analytics
           Analyze                                  Cube                      Perks Analyics
                                                   (SSAS)                        (Scala)
                                                                              Event Tracker
                                                                                 (Scala)
TO DO: Need to animate the red boxes & make this look better +
add Instrument, Collect, Persist, Query, Report information                                    7
TO DO: make this look better +
add Instrument, Collect, Persist, Query, Report information
If possible, merge slides 7 and 8 together




                                                              8
A Peek into Product Insights >
A|B Test Example for Viral Workflow




                                      9
10
11
A Peek into Product Insights >
Projects: Mobile iOS




                                 12
Projects > Mobile iOS > Scala/JavaScript API

 DB.withConnection("cube")(implicit conn => {
        var sql1 = SQL("""
 select
    [[Date]].[Date]].[Date]].[MEMBER_CAPTION]]] AS date,
    ...
    convert(int, [[Measures]].[Counter]]]) AS cnt
 from openquery(productinsight, ’
      SELECT {[Measures].[Counter]} ON COLUMNS,
      NON EMPTY CROSSJOIN (
      exists([Date].[Date].[Date].allmembers,
      {[Date].[Date].&[""" + dateFormat(past) + """]:...
 ) DIMENSION PROPERTIES MEMBER_CAPTION ON ROWS
 FROM [ProductInsight]
 ')
 """)

 sql1().iterator.foreach(row => {
    // process row
    val event = row[String]("event")
    // ....
 })
Projects > Mobile iOS > Actual MDX

 SELECT {
       [Measures].[Counter],
       [Measures].[PreviousPeriodCounter]
 } ON COLUMNS,
 NON EMPTY CROSSJOIN (
 exists([Date].[Date].[Date].allmembers,
       [Date].[Date].&[2012-05-19T00:00:00]:[Date].
       [Date].&[2012-06-02T00:00:00]),
       [Events].[Event].[Event].allmembers
 )
 DIMENSION PROPERTIES MEMBER_CAPTION ON ROWS
 FROM [ProductInsight]
 WHERE ({[Projects].[Project].[mobile-ios]})
Drilling down to the Events >
Query Hive using Excel




                                15
Drilling down to the Events > HiveQL Query

CREATE TABLE mobile-ios-details-20120530 as

SELECT
   get_json_object(json_text,'$.sid') as sid,
   get_json_object(json_text,'$.inc') as inc,
   get_json_object(json_text,'$.status') as status,
   event json_text
FROM bi.event_log
WHERE project="mobile-ios"
   AND dt=20120530
   AND get_json_object(json_text,'$.v')!='1.5'
   AND (event = 'api_error' OR event = 'api_timeout')

DISTRIBUTE BY get_json_object(json_text,'$.sid')
SORT BY get_json_object(json_text,'$.sid') asc
Adhoc Analysis >
Answering Questions on the Fly




                                 17
Summary


•  Leverage the best tool for the function or job

•  Big Data != Business Intelligence

•  Go open source wherever possible but use commercial
   software when needed




                                                         18
Any Questions? What’s next




                             19

More Related Content

What's hot

Berlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingBerlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowling
Jim Dowling
 
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the RealityPhar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Databricks
 
AI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer ExperienceAI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer Experience
Databricks
 
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsAccelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Stamatis Zampetakis
 

What's hot (20)

How RightScale Architects Its Own Databases for Worldwide Scale, HA, and DR S...
How RightScale Architects Its Own Databases for Worldwide Scale, HA, and DR S...How RightScale Architects Its Own Databases for Worldwide Scale, HA, and DR S...
How RightScale Architects Its Own Databases for Worldwide Scale, HA, and DR S...
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph Analysis
 
Dsdt meetup-january2018
Dsdt meetup-january2018Dsdt meetup-january2018
Dsdt meetup-january2018
 
A Picture is Worth a Thousand Words
A Picture is Worth a Thousand WordsA Picture is Worth a Thousand Words
A Picture is Worth a Thousand Words
 
Berlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingBerlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowling
 
Hopsworks data engineering melbourne april 2020
Hopsworks   data engineering melbourne april 2020Hopsworks   data engineering melbourne april 2020
Hopsworks data engineering melbourne april 2020
 
2020 | Metadata Day | LinkedIn
2020 | Metadata Day | LinkedIn2020 | Metadata Day | LinkedIn
2020 | Metadata Day | LinkedIn
 
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceNodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
 
Cascading User Group Meet
Cascading User Group MeetCascading User Group Meet
Cascading User Group Meet
 
"Application monitoring — from requirements to tools, not the other way aroun...
"Application monitoring — from requirements to tools, not the other way aroun..."Application monitoring — from requirements to tools, not the other way aroun...
"Application monitoring — from requirements to tools, not the other way aroun...
 
Democratizing Data
Democratizing DataDemocratizing Data
Democratizing Data
 
Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
 
Sergiy Lunyakin "Cloud BI with Azure Analysis Services"
Sergiy Lunyakin "Cloud BI with Azure Analysis Services"Sergiy Lunyakin "Cloud BI with Azure Analysis Services"
Sergiy Lunyakin "Cloud BI with Azure Analysis Services"
 
Azure Industrial Iot Edge
Azure Industrial Iot EdgeAzure Industrial Iot Edge
Azure Industrial Iot Edge
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
 
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the RealityPhar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the Reality
 
AI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer ExperienceAI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer Experience
 
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancementsAccelerating distributed joins in Apache Hive: Runtime filtering enhancements
Accelerating distributed joins in Apache Hive: Runtime filtering enhancements
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
 

Similar to Klout changing landscape of social media

Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystem
Yael Garten
 
16h00 globant - aws globant-big-data_summit2012
16h00   globant - aws globant-big-data_summit201216h00   globant - aws globant-big-data_summit2012
16h00 globant - aws globant-big-data_summit2012
infolive
 

Similar to Klout changing landscape of social media (20)

How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BI
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
 
Galaxy of bits
Galaxy of bitsGalaxy of bits
Galaxy of bits
 
Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystem
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
 
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...
 
How we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayHow we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the way
 
SQL on Big Data using Optiq
SQL on Big Data using OptiqSQL on Big Data using Optiq
SQL on Big Data using Optiq
 
Globant and Big Data on AWS
Globant and Big Data on AWSGlobant and Big Data on AWS
Globant and Big Data on AWS
 
16h00 globant - aws globant-big-data_summit2012
16h00   globant - aws globant-big-data_summit201216h00   globant - aws globant-big-data_summit2012
16h00 globant - aws globant-big-data_summit2012
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentation
 
Big Data Beers - Introducing Snowplow
Big Data Beers - Introducing SnowplowBig Data Beers - Introducing Snowplow
Big Data Beers - Introducing Snowplow
 
Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106
 
Rounds analytics pipeline
Rounds analytics pipelineRounds analytics pipeline
Rounds analytics pipeline
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
Myth Busters II: BI Tools and Data Virtualization are Interchangeable
Myth Busters II: BI Tools and Data Virtualization are InterchangeableMyth Busters II: BI Tools and Data Virtualization are Interchangeable
Myth Busters II: BI Tools and Data Virtualization are Interchangeable
 
Engineering practices in big data storage and processing
Engineering practices in big data storage and processingEngineering practices in big data storage and processing
Engineering practices in big data storage and processing
 
ActiveWarehouse/ETL - BI & DW for Ruby/Rails
ActiveWarehouse/ETL - BI & DW for Ruby/RailsActiveWarehouse/ETL - BI & DW for Ruby/Rails
ActiveWarehouse/ETL - BI & DW for Ruby/Rails
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Klout changing landscape of social media

  • 1. How Klout is changing the landscape of social media with Hadoop and BI Dave Mariani VP Engineering, Klout Denny Lee Principal Program Manager Microsoft
  • 2. Klout uses Big Data to unify the social web
  • 3. Klout’s Big Data makes all this possible 15 Social Networks Processed Every Day 120 Terabytes of Data Storage 200,000 Indexed Users Added Every Day 140,000,000 Users Indexed Every Day 1,000,000,000 Social Signals Processed Every Day 30,000,000,000 API Calls Delivered Every Month 54,000,000,000 Rows of Data In Klout Data Warehouse 3
  • 4. Scenario and Definitions Project: Event: Category: Property: Collection Captured Attribute Event of Events User Action Type Attribute +K (Add a topic) event Topic, {Big Data, BI} Gender, {Male} Location {Palo Alto}
  • 5. Klout Event Tracker 1 Perform A|B Testing of User Flows 2 Optimize User Registration Funnels 3 Monitor consumer engagement & retention (DAUs & MAUs) 4 Flexibly track and report on user generated events 5
  • 6. Klout Event Tracker Requirements 3rd Party Hadoop BI Web Requirement & Query Analytics Hive Engines Tools Capture & store all user and visitor events No Yes No Integrate internal Klout Data No Yes No Support queries against granular data No Yes No Support interactive queries Yes No Yes Support 3rd party BI tools No No Yes “Query-able” by custom apps No No Yes TODO: Make this look good and use animation to “blend” the last 2 columns 6
  • 7. Klout Data Architecture – The Best Tool for the Job Serve Signal Collectors Registrations DB (Java/ (MySql) Data Scala) Enhancement Engine Klout.com (PIG/Hive) Profile DB (Node.js) Klout API (HBase) (Scala) Data Warehouse (Hive) Search Index Mobile (Elastic Search) In (ObjectiveC) Store & Enhance Streams (MongoDB) Monitoring (Nagios) Dashboards (Tableau) Analytics Analyze Cube Perks Analyics (SSAS) (Scala) Event Tracker (Scala) TO DO: Need to animate the red boxes & make this look better + add Instrument, Collect, Persist, Query, Report information 7
  • 8. TO DO: make this look better + add Instrument, Collect, Persist, Query, Report information If possible, merge slides 7 and 8 together 8
  • 9. A Peek into Product Insights > A|B Test Example for Viral Workflow 9
  • 10. 10
  • 11. 11
  • 12. A Peek into Product Insights > Projects: Mobile iOS 12
  • 13. Projects > Mobile iOS > Scala/JavaScript API DB.withConnection("cube")(implicit conn => { var sql1 = SQL(""" select [[Date]].[Date]].[Date]].[MEMBER_CAPTION]]] AS date, ... convert(int, [[Measures]].[Counter]]]) AS cnt from openquery(productinsight, ’ SELECT {[Measures].[Counter]} ON COLUMNS, NON EMPTY CROSSJOIN ( exists([Date].[Date].[Date].allmembers, {[Date].[Date].&[""" + dateFormat(past) + """]:... ) DIMENSION PROPERTIES MEMBER_CAPTION ON ROWS FROM [ProductInsight] ') """) sql1().iterator.foreach(row => { // process row val event = row[String]("event") // .... })
  • 14. Projects > Mobile iOS > Actual MDX SELECT { [Measures].[Counter], [Measures].[PreviousPeriodCounter] } ON COLUMNS, NON EMPTY CROSSJOIN ( exists([Date].[Date].[Date].allmembers, [Date].[Date].&[2012-05-19T00:00:00]:[Date]. [Date].&[2012-06-02T00:00:00]), [Events].[Event].[Event].allmembers ) DIMENSION PROPERTIES MEMBER_CAPTION ON ROWS FROM [ProductInsight] WHERE ({[Projects].[Project].[mobile-ios]})
  • 15. Drilling down to the Events > Query Hive using Excel 15
  • 16. Drilling down to the Events > HiveQL Query CREATE TABLE mobile-ios-details-20120530 as SELECT get_json_object(json_text,'$.sid') as sid, get_json_object(json_text,'$.inc') as inc, get_json_object(json_text,'$.status') as status, event json_text FROM bi.event_log WHERE project="mobile-ios" AND dt=20120530 AND get_json_object(json_text,'$.v')!='1.5' AND (event = 'api_error' OR event = 'api_timeout') DISTRIBUTE BY get_json_object(json_text,'$.sid') SORT BY get_json_object(json_text,'$.sid') asc
  • 17. Adhoc Analysis > Answering Questions on the Fly 17
  • 18. Summary •  Leverage the best tool for the function or job •  Big Data != Business Intelligence •  Go open source wherever possible but use commercial software when needed 18