SlideShare una empresa de Scribd logo
1 de 44
Scaling Big Data
Mining Infrastructure


            Jimmy Lin   Dmitriy Ryaboy
            @lintool    @squarecog

            Hadoop Summit Europe
            Thursday, March 21, 2013
From the Ivory Tower…

Source: Wikipedia (All Souls College, Oxford)
… to building sh*t that works.
Source: Wikipedia (Factory)
IMHO
     Represents personal opinion. Not official position of Twitter.
Management not responsible for misuse. Void where prohibited. YMMV.




              (If someone asks, I probably wasn’t here)
“Yesterday”



        ~150 people total
       ~60 Hadoop nodes
~6 people use analytics stack daily
“Today”


           ~1400 people total
10s of Ks of Hadoop nodes, multiple DCs
 10s of PBs total Hadoop DW capacity
          ~100 TB ingest daily
   dozens of teams use Hadoop daily
     10s of Ks of Hadoop jobs daily
Why?


         actionable insights
data     data products
Big data mining
Cool, you get to work on new algorithms!



        No, not really…
Big data mining isn’t mainly
 about data mining per se!
It’s impossible to overstress this: 80%
                                of the work in any data project is in
                                 cleaning the data. – DJ Patil ―Data
                                               Jujitsu‖




Source: Wikipedia (Jujitsu)
Reality
 Your boss says something vague
 You think very hard on how to move the needle
 Where’s the data?
 What’s in this dataset?
 What’s all the f#$!* crap in the data?
 Clean the data
 Run some off-the-shelf data mining algorithm
 …
 Productionize, act on the insight
 Rinse, repeat
Data science is less
     glamorous that you think!


How do we make data scientists’ lives a bit easier?
Gathering
Source: Wikipedia (Logging)
Moving




Source: Wikipedia (Timber rafting)
Organizing




Source: Wikipedia (Logging)
Log directly into a database!

Source: http://www.flickr.com/photos/snukkel/3206405352/
create table `my_audit_log` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `created_at` datetime,
  `user_id` int(11),
  `action` varchar(256),
  ...
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


           Don’t do this!
             —   Workload mismatch
             —   Scaling challenges
             —   Overkill
             —   Schema changes
Main Datacenter


                                Scribe
                              Aggregators



                                                HDFS                      Main Hadoop
                                                                              DW

                                   Staging Hadoop Cluster
Datacenter
         Scribe Daemons                     Datacenter
        (Production Hosts)
                           Scribe
                         Aggregators                                   Scribe
                                                                     Aggregators

                                            HDFS
                                                                                      HDFS

                              Staging Hadoop Cluster
                                                                          Staging Hadoop Cluster
     Scribe Daemons
    (Production Hosts)                           Scribe Daemons
                                                (Production Hosts)




                             Use Scribe. or Flume.or Kafka.
Scribe solves log transport only…




        System.out.println
             LOG.info
^(w+s+d+s+d+:d+:d+)s+
([^@]+?)@(S+)s+(S+):s+(S+)s+(S+)
s+((?:S+?,s+)*(?:S+?))s+(S+)s+(S+)
s+[([^]]+)]s+"(w+)s+([^"]*
(?:.[^"]*)*)s+(S+)"s+(S+)s+
(S+)s+"([^"]*(?:.[^"]*)*)
"s+"([^"]*(?:.[^"]*)*)"s*
(d*-[d-]*)?s*(d+)?s*(d*.[d.]*)?
(s+[-w]+)?.*$

  An actual Java regular expression used to
  parse log message at Twitter circa 2010

     Plain-text log messages suck
             Don’t do this!
userid
CamelCase

smallCamelCase    user_id

snake_case

camel_Snake

dunder__snake




             Naming things is hard!
JSON to the Rescue!

Source: http://www.flickr.com/photos/snukkel/3206405352/
This should really be a list…
Remember the camelSnake!
        {
            "token": 945842,
            "feature_enabled": "super_special",
            "userid": 229922,
            "page": "null",             Is this really an integer?
            "info": { "email": "my@place.com" }
        }


                         Is this really null?

    What keys? What values?


                 This does not scale.
struct MessageInfo {
   1: optional string name
   2: optional string email
 }

struct LogMessage {
   1: required i64 token
   2: required string user_id
   3: optional list<Feature> enabled_features
   4: optional i64 page = 0
   5: optional MessageInfo info
 }
                              + DDL provides type safety
 enum Feature {
   super_special,             + Auto codgen
   less_special
 }                            + Efficient serialization

                              +   Sane schema migration
                              +   Separate logical from
                                  physical
            Use Thrift. or Protobufs.or Avro.
Schemas aren’t enough!


We need a data discovery service!
          Where’s the data?
          How do I read it?
          Who produces it?
         Who consumes it?
      When was it last generated?
                  …
Where to find data?

 Old way:
                Hard-coded partitioning scheme, path, format
  A = LOAD ‘/tables/statuses/2011/01/{05,06,07}/*.lzo’
      USING LzoStatusProtobufBlockPigLoader();
                                          Custom loader

 How do people know?            1.) Ask around 2.) Cargo-cult

 New way:
                   Nice UI for browsing
  A = LOAD ‘tables.statuses’
      USING TwadoopLoader(); Same loader each time

  B = FILTER A BY year == ‘2011’
      AND month == ‘11’
      AND day == ‘01’
      AND hour >= ’05’
      AND hour <= ‘07’;
                         Filters are pushed into the loader.
                            No need to understand partitioning
                            scheme.
Data Access Layer (DAL)
―All problems in computer science can be solved by another level
of indirection... Except for the problem of too many layers of
indirection.‖      – David Wheeler


        All data accesses go through DAL
           Thin layer on top of HCatalog
Data Access Layer (DAL)
          Who wrote what data, when?


                      #win
Automatically construct data/job dependency graph
        Automatically figure out ownership

       Hooks into alerting, auditing, repos,
              deploy systems, etc.
Plumbing
                               Jimmy Lin and Alek Kolcz. Large-Scale Machine Learning at Twitter.
                               SIGMOD 2012.
Source: Wikipedia (Plumbing)
Classification




Source: Wikipedia (Sorting)
label
Given
                    feature vector



                                       ?   features


    features
     features
      features      learner                    classifier
       features
        features



Induce                                            ?   ?




                      Training       Predicting
Stone age machine learning…




Source: Wikipedia (Stonehenge)
upload results


data munging
 Joining multiple dataset
 Feature extraction
 …                                  download
                      down-sample   test data



                            train    predict
What doesn’t work…
 1.   Down-sampling for training on single-processor
         Defeats the whole point of big data!
 2.   Ad hoc productionizing
         Disconnected from rest of production workflow
Production considerations:
    dependency management
           scheduling
       resource allocation
           monitoring
         error reporting
             alerting
                …


        We need…
Seamless scaling




Source: Wikipedia (Galaxy)
Integration with production workflows




Source: Wikipedia (Oil refinery)
Stochastic Gradient Descent
             Conceptually, classifier training is a like
               user-defined aggregate function!


             AVG                     SGD

initialize   sum = 0                 initialize
             count = 0               weights

 update      add to sum
             increment
             count
terminate    return sum / count      return weights
previous Pig dataflow                previous Pig dataflow



             map

Classifier
Training
                    reduce
                                label, feature vector
             Pig storage
                function

                           model                           model     model



                                    feature vector                 feature vector
  Making           model        UDF              model         UDF
Predictions                         prediction                     prediction
                             Just like any other parallel Pig dataflow
It’s just Pig!




                 For ―free‖: dependency management,
                 scheduling, resource allocation,
                 monitoring, error reporting, alerting, …
Source: Wikipedia (Road surface)
Takeaway messages
How do we make data scientists’ lives a bit easier?




     Adding a bit of structure goes a long way
Getting the plumbing right makes all the difference




 ―In theory, there is no difference between
 theory and practice. But, in practice, there
 is.‖
                                                Questions?
                 - Jan L.A. van de Snepscheut

Más contenido relacionado

La actualidad más candente

Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...Edureka!
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkDataWorks Summit
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkFlink Forward
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouseVianney FOUCAULT
 

La actualidad más candente (20)

Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
 

Destacado

Composing and scaling data platforms
Composing and scaling data platformsComposing and scaling data platforms
Composing and scaling data platformsSigmoid
 
To Hell With Good Intentions: Linked Data and the Power to Name
To Hell With Good Intentions: Linked Data and the Power to NameTo Hell With Good Intentions: Linked Data and the Power to Name
To Hell With Good Intentions: Linked Data and the Power to NameMark Matienzo
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopHortonworks
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterBill Graham
 

Destacado (6)

Composing and scaling data platforms
Composing and scaling data platformsComposing and scaling data platforms
Composing and scaling data platforms
 
To Hell With Good Intentions: Linked Data and the Power to Name
To Hell With Good Intentions: Linked Data and the Power to NameTo Hell With Good Intentions: Linked Data and the Power to Name
To Hell With Good Intentions: Linked Data and the Power to Name
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at TwitterHadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter
 

Similar a Scaling Big Data Mining Infrastructure Twitter Experience

Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupCloudera, Inc.
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking systemJesse Vincent
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache AccumuloJared Winick
 
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
Four Problems You Run into When DIY-ing a “Big Data” Analytics SystemFour Problems You Run into When DIY-ing a “Big Data” Analytics System
Four Problems You Run into When DIY-ing a “Big Data” Analytics SystemTreasure Data, Inc.
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoopfvanvollenhoven
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 
Puppet for Sys Admins
Puppet for Sys AdminsPuppet for Sys Admins
Puppet for Sys AdminsPuppet
 
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...Zhenzhong Xu
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 

Similar a Scaling Big Data Mining Infrastructure Twitter Experience (20)

Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking system
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
Four Problems You Run into When DIY-ing a “Big Data” Analytics SystemFour Problems You Run into When DIY-ing a “Big Data” Analytics System
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Puppet for Sys Admins
Puppet for Sys AdminsPuppet for Sys Admins
Puppet for Sys Admins
 
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
FlinkForward Asia 2019 - Evolving Keystone to an Open Collaborative Real Time...
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Django at Scale
Django at ScaleDjango at Scale
Django at Scale
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 

Más de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 

Último (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Scaling Big Data Mining Infrastructure Twitter Experience

  • 1. Scaling Big Data Mining Infrastructure Jimmy Lin Dmitriy Ryaboy @lintool @squarecog Hadoop Summit Europe Thursday, March 21, 2013
  • 2. From the Ivory Tower… Source: Wikipedia (All Souls College, Oxford)
  • 3. … to building sh*t that works. Source: Wikipedia (Factory)
  • 4. IMHO Represents personal opinion. Not official position of Twitter. Management not responsible for misuse. Void where prohibited. YMMV. (If someone asks, I probably wasn’t here)
  • 5.
  • 6. “Yesterday” ~150 people total ~60 Hadoop nodes ~6 people use analytics stack daily
  • 7. “Today” ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total Hadoop DW capacity ~100 TB ingest daily dozens of teams use Hadoop daily 10s of Ks of Hadoop jobs daily
  • 8. Why? actionable insights data data products
  • 9. Big data mining Cool, you get to work on new algorithms! No, not really…
  • 10. Big data mining isn’t mainly about data mining per se!
  • 11. It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data. – DJ Patil ―Data Jujitsu‖ Source: Wikipedia (Jujitsu)
  • 12. Reality Your boss says something vague You think very hard on how to move the needle Where’s the data? What’s in this dataset? What’s all the f#$!* crap in the data? Clean the data Run some off-the-shelf data mining algorithm … Productionize, act on the insight Rinse, repeat
  • 13. Data science is less glamorous that you think! How do we make data scientists’ lives a bit easier?
  • 14.
  • 18. Log directly into a database! Source: http://www.flickr.com/photos/snukkel/3206405352/
  • 19. create table `my_audit_log` ( `id` int(11) NOT NULL AUTO_INCREMENT, `created_at` datetime, `user_id` int(11), `action` varchar(256), ... ) ENGINE=InnoDB DEFAULT CHARSET=utf8; Don’t do this! — Workload mismatch — Scaling challenges — Overkill — Schema changes
  • 20. Main Datacenter Scribe Aggregators HDFS Main Hadoop DW Staging Hadoop Cluster Datacenter Scribe Daemons Datacenter (Production Hosts) Scribe Aggregators Scribe Aggregators HDFS HDFS Staging Hadoop Cluster Staging Hadoop Cluster Scribe Daemons (Production Hosts) Scribe Daemons (Production Hosts) Use Scribe. or Flume.or Kafka.
  • 21. Scribe solves log transport only… System.out.println LOG.info
  • 23. userid CamelCase smallCamelCase user_id snake_case camel_Snake dunder__snake Naming things is hard!
  • 24. JSON to the Rescue! Source: http://www.flickr.com/photos/snukkel/3206405352/
  • 25. This should really be a list… Remember the camelSnake! { "token": 945842, "feature_enabled": "super_special", "userid": 229922, "page": "null", Is this really an integer? "info": { "email": "my@place.com" } } Is this really null? What keys? What values? This does not scale.
  • 26. struct MessageInfo { 1: optional string name 2: optional string email } struct LogMessage { 1: required i64 token 2: required string user_id 3: optional list<Feature> enabled_features 4: optional i64 page = 0 5: optional MessageInfo info } + DDL provides type safety enum Feature { super_special, + Auto codgen less_special } + Efficient serialization + Sane schema migration + Separate logical from physical Use Thrift. or Protobufs.or Avro.
  • 27. Schemas aren’t enough! We need a data discovery service! Where’s the data? How do I read it? Who produces it? Who consumes it? When was it last generated? …
  • 28. Where to find data? Old way: Hard-coded partitioning scheme, path, format A = LOAD ‘/tables/statuses/2011/01/{05,06,07}/*.lzo’ USING LzoStatusProtobufBlockPigLoader(); Custom loader How do people know? 1.) Ask around 2.) Cargo-cult New way: Nice UI for browsing A = LOAD ‘tables.statuses’ USING TwadoopLoader(); Same loader each time B = FILTER A BY year == ‘2011’ AND month == ‘11’ AND day == ‘01’ AND hour >= ’05’ AND hour <= ‘07’; Filters are pushed into the loader. No need to understand partitioning scheme.
  • 29. Data Access Layer (DAL) ―All problems in computer science can be solved by another level of indirection... Except for the problem of too many layers of indirection.‖ – David Wheeler All data accesses go through DAL Thin layer on top of HCatalog
  • 30. Data Access Layer (DAL) Who wrote what data, when? #win Automatically construct data/job dependency graph Automatically figure out ownership Hooks into alerting, auditing, repos, deploy systems, etc.
  • 31. Plumbing Jimmy Lin and Alek Kolcz. Large-Scale Machine Learning at Twitter. SIGMOD 2012. Source: Wikipedia (Plumbing)
  • 33. label Given feature vector ? features features features features learner classifier features features Induce ? ? Training Predicting
  • 34. Stone age machine learning… Source: Wikipedia (Stonehenge)
  • 35. upload results data munging Joining multiple dataset Feature extraction … download down-sample test data train predict
  • 36. What doesn’t work… 1. Down-sampling for training on single-processor  Defeats the whole point of big data! 2. Ad hoc productionizing  Disconnected from rest of production workflow
  • 37. Production considerations: dependency management scheduling resource allocation monitoring error reporting alerting … We need…
  • 39. Integration with production workflows Source: Wikipedia (Oil refinery)
  • 40. Stochastic Gradient Descent Conceptually, classifier training is a like user-defined aggregate function! AVG SGD initialize sum = 0 initialize count = 0 weights update add to sum increment count terminate return sum / count return weights
  • 41. previous Pig dataflow previous Pig dataflow map Classifier Training reduce label, feature vector Pig storage function model model model feature vector feature vector Making model UDF model UDF Predictions prediction prediction Just like any other parallel Pig dataflow
  • 42. It’s just Pig! For ―free‖: dependency management, scheduling, resource allocation, monitoring, error reporting, alerting, …
  • 44. Takeaway messages How do we make data scientists’ lives a bit easier? Adding a bit of structure goes a long way Getting the plumbing right makes all the difference ―In theory, there is no difference between theory and practice. But, in practice, there is.‖ Questions? - Jan L.A. van de Snepscheut