SlideShare una empresa de Scribd logo
1 de 46
Hadoop and Vertica
The Data Analytics Platform at Twitter
               Bill Graham - @billgraham
     Data Systems Engineer, Analytics Infrastructure
              Hadoop Summit, June 2012
About that pony giveaway...




                              2
Outline
  • Architecture
  • Data flow
  • Job coordination
  • Resource management
  • Vertica integration
  • Gotchas
  • Future work




                          3
We count things

  • 140 characters
  • 140M active users
  • 400M tweets per day
  • 80-100 TB ingested daily (uncompressed)
  • 10s of Ks daily Hadoop jobs




                                              4
Heterogeneous stack
  • Many job execution applications
    • Crane - Java ETL
    • Oink - Pig scheduler
    • Rasvelg - SQL aggregations
    • Scalding - Cascading via Scala
    • PyCascading - Cascading via Python
    • Indexing jobs
  • Our users
    • Analytics, Revenue, Growth, Search, Recommendations, etc.
    • PMs, Sales!


                                                                  5
Data flow: Analytics

                                       Production Hosts
                  Log                                      Application
                events                                     Data
                         Scribe
                         Aggregators

  Third Party
                                                                     Social graph
   Imports                        HDFS                    MySQL/     Tweets
                                                          Gizzard    User profiles
                     Staging Hadoop Cluster




                Main Hadoop DW           HBase                                      Analytics
                                                           Vertica
                                                                                    Web Tools




                                                            MySQL
                                                                                                6
Data flow: Analytics

                                       Production Hosts
                  Log                                            Application
                events                                           Data
                         Scribe
                         Aggregators

  Third Party
                                                                           Social graph
   Imports                        HDFS                          MySQL/     Tweets
                                                                Gizzard    User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler



                            Log
                            Mover



                Main Hadoop DW           HBase                                            Analytics
                                                                 Vertica
                                                                                          Web Tools




                                                                  MySQL
                                                                                                      6
Data flow: Analytics

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler
       Crane                                                Crane
                                                                       Crane
                            Log
                            Mover



                Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane


                                                                       MySQL
                                                                                                           6
Data flow: Analytics

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler
       Crane                                                Crane
                                                                       Crane
                            Log
                            Mover


                                                           Oink
    Oink        Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane

                                                 Oink
                                                                       MySQL
                                                                                                           6
Data flow: Analytics

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed
                     Staging Hadoop Cluster       Crawler
       Crane                                                Crane
                                                                       Crane
                            Log                                                     Rasvelg
                            Mover


                                                           Oink
    Oink        Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane

                                                 Oink
                                                                       MySQL
                                                                                                           6
Data flow: Analytics

                                       Production Hosts
                  Log                                                 Application
                events                                                Data
                         Scribe
                         Aggregators

  Third Party
                                                                                Social graph
   Imports                        HDFS                              MySQL/      Tweets
                                                                    Gizzard     User profiles
                                                  Distributed                                              Analysts
                     Staging Hadoop Cluster       Crawler                                                  Engineers
       Crane                                                                                               PMs
                                                            Crane                                          Sales
                                                                       Crane
                            Log                                                     Rasvelg
                            Mover


                                                           Oink
    Oink        Main Hadoop DW           HBase                                                 Analytics
                                                                      Vertica
                                                                                               Web Tools
                                                          Crane

                                                                    Crane
                                                          Crane

                                                 Oink
                                                                       MySQL
                                                                                                                  6
Data flow: Analytics

                                             Production Hosts
                        Log                                                 Application
                      events                                                Data
                               Scribe
                               Aggregators

       Third Party
                                                                                      Social graph
        Imports                         HDFS                              MySQL/      Tweets
                                                                          Gizzard     User profiles
                                                        Distributed                                              Analysts
                           Staging Hadoop Cluster       Crawler                                                  Engineers
              Crane                                                                                              PMs
                                                                  Crane                                          Sales
                                                                             Crane
                                  Log                                                     Rasvelg
HCatalog                          Mover


                                                                 Oink
           Oink       Main Hadoop DW           HBase                                                 Analytics
                                                                            Vertica
                                                                                                     Web Tools
                                                                Crane

                                                                          Crane
                                                                Crane

                                                       Oink
                                                                             MySQL
                                                                                                                        6
Chaotic? Actually, no.




                         7
System concepts


  • Loose coupling
  • Job coordination as a service
  • Resource management as a service
  • Idempotence




                                       8
Loose coupling


  • Multiple job frameworks
  • Right tool for the job
  • Common dependency management




                                   9
Job coordination

  • Shared batch table for job state
  • Access via client libraries
  • Jobs & data are time-based
  • 3 types of preconditions
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)
    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)

                                                            10
Job coordination

  • Shared batch table for job state             batch table:
                                                 (id, description, state,
  • Access via client libraries                   start_time, end_time,
                                                  job_start_time, job_end_time)

  • Jobs & data are time-based
  • 3 types of preconditions
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)
    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)

                                                                          10
Job coordination

  • Shared batch table for job state             batch table:
                                                 (id, description, state,
  • Access via client libraries                   start_time, end_time,
                                                  job_start_time, job_end_time)

  • Jobs & data are time-based
  • 3 types of preconditions
                                                                   Job
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)
    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)

                                                                          10
Job coordination

  • Shared batch table for job state             batch table:
                                                 (id, description, state,
  • Access via client libraries                   start_time, end_time,
                                                  job_start_time, job_end_time)

  • Jobs & data are time-based
  • 3 types of preconditions
                                                                   Job
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)         Data

    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)

                                                                          10
Job coordination

  • Shared batch table for job state             batch table:
                                                 (id, description, state,
  • Access via client libraries                   start_time, end_time,
                                                  job_start_time, job_end_time)

  • Jobs & data are time-based
  • 3 types of preconditions
                                                                      Job
    1. other job success (i.e., predecessor job complete)
    2. existence of data (i.e., HDFS input exists)         Data

    3. user-defined (i.e., MySQL slave lag)
  • Failed jobs get retried (usually)                             ?


                                                                            10
Resource management

  • Analytics Resource Manager - ARM!
  • Library above Zookeeper
  • Throttles jobs and workers
    • Only 1 job of this name may run at once
    • Only N jobs may be run by this app at once
    • Only M mappers may write to Vertica at once




                                                    11
Resource management

  • Analytics Resource Manager - ARM!
  • Library above Zookeeper
  • Throttles jobs and workers
    • Only 1 job of this name may run at once
    • Only N jobs may be run by this app at once
    • Only M mappers may write to Vertica at once




                                                    11
Resource management

  • Analytics Resource Manager - ARM!
  • Library above Zookeeper
  • Throttles jobs and workers
    • Only 1 job of this name may run at once
    • Only N jobs may be run by this app at once
    • Only M mappers may write to Vertica at once




                                                    11
Job DAG & state transition

            “Local View”
            • Is it time for me to run yet?
            • Are my dependancies satisfied?
            • Any resource constraints?




                                              12
Job DAG & state transition

            “Local View”
            • Is it time for me to run yet?
            • Are my dependancies satisfied?
            • Any resource constraints?
                                          granted


                                 denied                                  Insert entry into
                                                                            batch table
                                                      no
                          Idle                             yes   Completion
                                     Execution
                                     Complete?

                                          Execution




                                                                                    12
Job DAG & state transition

                 “Local View”
                 • Is it time for me to run yet?
                 • Are my dependancies satisfied?
                 • Any resource constraints?
                                               granted


                                      denied                                  Insert entry into
                                                                                 batch table
                                                           no
                               Idle                             yes   Completion
                                          Execution
                                          Complete?

                                               Execution


     batch table:
     (id, description, state,
      start_time, end_time,
      job_start_time, job_end_time)
                                                                                         12
Example: active users

  Production Hosts




                     Main Hadoop DW




       MySQL/                                  Analytics
       Gizzard                        MySQL   Dashboards
                          Vertica




                                                           13
Example: active users
                                                                       Job DAG




                                                                 Log mover
  Production Hosts
                                   Log mover
                              (via staging cluster)

                         ib   e   web_events
                     Scr
                                                      Main Hadoop DW
                 Scr
                        ibe       sms_events




       MySQL/                                                                             Analytics
       Gizzard                                                                   MySQL   Dashboards
                                                           Vertica




                                                                                                      13
Example: active users
                                                                       Job DAG




                                                                             Oink
                                                                 Log mover
  Production Hosts
                                   Log mover
                              (via staging cluster)
                                                                                 Oink/Pig
                         ibe      web_events
                     Scr                                                         Cleanse
                                                      Main Hadoop DW             Filter
                                                                                 Transform
                 Scr                                                             Geo lookup
                        ibe       sms_events                                     Union
                                                                                 Distinct




       MySQL/                                                                                  Analytics
       Gizzard                                                                      MySQL     Dashboards
                                                           Vertica




                                                                                                           13
Example: active users
                                                                       Job DAG




                                                                             Oink     Oink
                                                                 Log mover
  Production Hosts
                                   Log mover
                              (via staging cluster)
                                                                                 Oink/Pig
                         ibe      web_events
                     Scr                                                         Cleanse
                                                      Main Hadoop DW             Filter
                                                                                 Transform
                 Scr                                                             Geo lookup
                        ibe       sms_events                                     Union
                                                                                 Distinct

                                                             Oink
                                                             user_sessions



       MySQL/                                                                                  Analytics
       Gizzard                                                                      MySQL     Dashboards
                                                           Vertica




                                                                                                           13
Example: active users
                                                                       Job DAG




                                                                             Oink     Oink
                                                                 Log mover
  Production Hosts
                                                                                     Crane
                                   Log mover
                              (via staging cluster)
                                                                                 Oink/Pig
                         ibe      web_events
                     Scr                                                         Cleanse
                                                      Main Hadoop DW             Filter
                                                                                 Transform
                 Scr                                                             Geo lookup
                        ibe       sms_events                                     Union
                                                                                 Distinct

                                                             Oink
                                                             user_sessions



       MySQL/                         Crane                                                    Analytics
       Gizzard                                                                      MySQL     Dashboards
                                   user_profiles            Vertica




                                                                                                           13
Example: active users
                                                                            Job DAG




                                                                                   Oink     Oink
                                                                      Log mover
  Production Hosts
                                                                                           Crane
                                   Log mover                                                       Rasvelg
                              (via staging cluster)
                                                                                      Oink/Pig
                         ibe      web_events
                     Scr                                                              Cleanse
                                                       Main Hadoop DW                 Filter
                                                                                      Transform
                 Scr                                                                  Geo lookup
                        ibe       sms_events                                          Union
                                                                                      Distinct

                                                                   Oink
                                                                   user_sessions



       MySQL/                         Crane                                                                   Analytics
       Gizzard                                                                            MySQL              Dashboards
                                   user_profiles                 Vertica



                                                      Rasvelg
                                                      Join,
                                                      Join Group, Count
                                                      Aggregations:
                                                      - active_by_geo
                                                      - active_by_device
                                                      - active_by_client
                                                      ...                                                                 13
Example: active users
                                                                             Job DAG




                                                                                    Oink     Oink
                                                                                                           ...
                                                                      Log mover
  Production Hosts
                                                                                            Crane
                                   Log mover                                                          Rasvelg Crane
                              (via staging cluster)
                                                                                         Oink/Pig
                         ibe      web_events
                     Scr                                                                 Cleanse
                                                       Main Hadoop DW                    Filter
                                                                                         Transform
                 Scr                                                                     Geo lookup
                        ibe       sms_events                                             Union
                                                                                         Distinct

                                                                   Oink
                                                                   user_sessions



       MySQL/                         Crane                                 Crane                             Analytics
       Gizzard                                                                             MySQL             Dashboards
                                   user_profiles                 Vertica    active_by_*



                                                      Rasvelg
                                                      Join,
                                                      Join Group, Count
                                                      Aggregations:
                                                      - active_by_geo
                                                      - active_by_device
                                                      - active_by_client
                                                      ...                                                                 13
Vertica or Hadoop?
  • Vertica
    • Loads 100s of Ks rows/second
    • Aggregate 100s of Ms rows in seconds
    • Used for low latency queries and aggregations
    • Keep a sliding window of data
  • Hadoop
    • Excels when data size is massive
    • Flexible and powerful
    • Great with nested data structures and unstructured data
    • Used for complex functions and ML



                                                                14
Vertica import options
  • Direct import via Crane
    • Load into dest table, single thread
  • Atomic import via Crane/Rasvelg
    • Crane loads to temp table, single thread
    • Rasvelg moves to dest table
  • Parallel import via Oink/Pig
    • Pig job via VerticaStorer
                                                                MySQL/
                                                                Gizzard



    • ARM throttles active DB connections                         Crane

                                                                           Rasvelg


                                                        Oink
                                       Main Hadoop DW
                                                                 Vertica
                                                        Crane




                                                                                15
Vertica imports - pros/cons
  • Crane & Rasvelg
    • Good for smaller datasets, DB to DB transfers
    • Single threaded
    • Easy on Vertica
    • Hadoop not required
  • Pig
    • Great for larger datasets                                  MySQL/
                                                                 Gizzard


    • More complex, not atomic
                                                                   Crane

    • DDOS potential                                                        Rasvelg


                                                         Oink
                                        Main Hadoop DW
                                                                  Vertica
                                                         Crane




                                                                                16
VerticaStorer
  • PigStorage implementation
  • From Vertica’s Hadoop connector suite
  • Out of the box
    • Easy to get Hello World working
    • Well documented
    • Pig/Vertica data bindings work well
    • Fast!
    • Transaction-aware tasks
    • No bugs found
    • Open source?



                                            17
Pig VerticaStorage
  • Our enhancements
    • Connection credential management
    • Truncate before load option
    • Throttle concurrent writers via ZK
  • Future features
    • Counters for rows inserted/rejected
    • Name-based tuple-column bindings
    • Atomic load via temp table




                                            18
Pig VerticaStorage
  • Our enhancements
    • Connection credential management
    • Truncate before load option
    • Throttle concurrent writers via ZK
  • Future features
    • Counters for rows inserted/rejected
    • Name-based tuple-column bindings
    • Atomic load via temp table
         SET mapred.map.tasks.speculative.execution false

         user_sessions = LOAD ‘/processed/user_sessions/2012/06/14’;

         STORE user_sessions INTO '{db_schema.user_sessions}' USING
               com.twitter.twadoop.pig.store.VerticaStorage(
               'config/db.yml', 'db_name', 'arm_resource_name');
                                                                       18
Gotcha #1


  • MR data load is not atomic
    • Avoid partial reads
    • Option 1: load to temp table, then insert direct
    • Option 2: add job dependency concept




                                                         19
Gotcha #2



  • Speculative execution is not always your friend
    • Launch more tasks than needed, just in case
    • For non-idempotent jobs, extra tasks == BAD




                                                      20
Gotcha #3


  • isIdempotant() must be a first-class concept
    • Loader jobs will fail
    • Failure after first task success == not good
    • Can’t automate retry without cleanup




                                                    21
Gotcha #4

  • Vendor code only gets you so far
    • Nice to haves == have to write
    • Favor the decorator pattern
    • Pig’s StoreFuncWrapper can help
    • Vendor open sourcing is ideal




                                        22
Future work
  • More VerticaStorer features
  • Multiple Vertica clusters
  • Atomic DB loads with Pig/Oink
  • Better DAG visibility
  • Better job history visibility
  • MR job optimizations via historic stats
  • HCatalog data registry
  • Job push events


                                              23
Acknowledgements




                   24
Questions?

 Bill Graham - @billgraham




                             25

Más contenido relacionado

La actualidad más candente

Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaselarsgeorge
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBradford Stephens
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopDataWorks Summit
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
One Large Data Lake, Hold the Hype
One Large Data Lake, Hold the HypeOne Large Data Lake, Hold the Hype
One Large Data Lake, Hold the HypeJared Winick
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 

La actualidad más candente (20)

Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
One Large Data Lake, Hold the Hype
One Large Data Lake, Hold the HypeOne Large Data Lake, Hold the Hype
One Large Data Lake, Hold the Hype
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 

Destacado

HP Vertica培训-基础篇
HP Vertica培训-基础篇HP Vertica培训-基础篇
HP Vertica培训-基础篇Andy Lee
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)LivePerson
 
Cloudera Manager4.0とNameNode-HAセミナー資料
Cloudera Manager4.0とNameNode-HAセミナー資料Cloudera Manager4.0とNameNode-HAセミナー資料
Cloudera Manager4.0とNameNode-HAセミナー資料Cloudera Japan
 
【17-E-3】 オンライン機械学習で実現する大規模データ処理
【17-E-3】 オンライン機械学習で実現する大規模データ処理【17-E-3】 オンライン機械学習で実現する大規模データ処理
【17-E-3】 オンライン機械学習で実現する大規模データ処理Developers Summit
 
20120830 DBリファクタリング読書会第三回
20120830 DBリファクタリング読書会第三回20120830 DBリファクタリング読書会第三回
20120830 DBリファクタリング読書会第三回都元ダイスケ Miyamoto
 
Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Hortonworks
 
Data analytics with hadoop hive on multiple data centers
Data analytics with hadoop hive on multiple data centersData analytics with hadoop hive on multiple data centers
Data analytics with hadoop hive on multiple data centersHirotaka Niisato
 
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)Satoshi Yamada
 
並列データベースシステムの概念と原理
並列データベースシステムの概念と原理並列データベースシステムの概念と原理
並列データベースシステムの概念と原理Makoto Yui
 
Lars George HBase Seminar with O'REILLY Oct.12 2012
Lars George HBase Seminar with O'REILLY Oct.12 2012Lars George HBase Seminar with O'REILLY Oct.12 2012
Lars George HBase Seminar with O'REILLY Oct.12 2012Cloudera Japan
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Hortonworks
 
あなたの知らないPostgreSQL監視の世界
あなたの知らないPostgreSQL監視の世界あなたの知らないPostgreSQL監視の世界
あなたの知らないPostgreSQL監視の世界Yoshinori Nakanishi
 
【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方
【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方
【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方kwatch
 
SQLチューニング入門 入門編
SQLチューニング入門 入門編SQLチューニング入門 入門編
SQLチューニング入門 入門編Miki Shimogai
 
Datalogからsqlへの トランスレータを書いた話
Datalogからsqlへの トランスレータを書いた話Datalogからsqlへの トランスレータを書いた話
Datalogからsqlへの トランスレータを書いた話Yuki Takeichi
 
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~Miki Shimogai
 

Destacado (18)

HP Vertica培训-基础篇
HP Vertica培训-基础篇HP Vertica培训-基础篇
HP Vertica培训-基础篇
 
Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)Introduction to Vertica (Architecture & More)
Introduction to Vertica (Architecture & More)
 
Cloudera Manager4.0とNameNode-HAセミナー資料
Cloudera Manager4.0とNameNode-HAセミナー資料Cloudera Manager4.0とNameNode-HAセミナー資料
Cloudera Manager4.0とNameNode-HAセミナー資料
 
【17-E-3】 オンライン機械学習で実現する大規模データ処理
【17-E-3】 オンライン機械学習で実現する大規模データ処理【17-E-3】 オンライン機械学習で実現する大規模データ処理
【17-E-3】 オンライン機械学習で実現する大規模データ処理
 
20120830 DBリファクタリング読書会第三回
20120830 DBリファクタリング読書会第三回20120830 DBリファクタリング読書会第三回
20120830 DBリファクタリング読書会第三回
 
Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012
 
Data analytics with hadoop hive on multiple data centers
Data analytics with hadoop hive on multiple data centersData analytics with hadoop hive on multiple data centers
Data analytics with hadoop hive on multiple data centers
 
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)
PostgreSQLの実行計画を読み解こう(OSC2015 Spring/Tokyo)
 
Database smells
Database smellsDatabase smells
Database smells
 
並列データベースシステムの概念と原理
並列データベースシステムの概念と原理並列データベースシステムの概念と原理
並列データベースシステムの概念と原理
 
Lars George HBase Seminar with O'REILLY Oct.12 2012
Lars George HBase Seminar with O'REILLY Oct.12 2012Lars George HBase Seminar with O'REILLY Oct.12 2012
Lars George HBase Seminar with O'REILLY Oct.12 2012
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
あなたの知らないPostgreSQL監視の世界
あなたの知らないPostgreSQL監視の世界あなたの知らないPostgreSQL監視の世界
あなたの知らないPostgreSQL監視の世界
 
【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方
【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方
【SQLインジェクション対策】徳丸先生に怒られない、動的SQLの安全な組み立て方
 
SQLチューニング入門 入門編
SQLチューニング入門 入門編SQLチューニング入門 入門編
SQLチューニング入門 入門編
 
Datalogからsqlへの トランスレータを書いた話
Datalogからsqlへの トランスレータを書いた話Datalogからsqlへの トランスレータを書いた話
Datalogからsqlへの トランスレータを書いた話
 
ならば(その弐)
ならば(その弐)ならば(その弐)
ならば(その弐)
 
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
PostgreSQLクエリ実行の基礎知識 ~Explainを読み解こう~
 

Similar a Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter

16h00 globant - aws globant-big-data_summit2012
16h00   globant - aws globant-big-data_summit201216h00   globant - aws globant-big-data_summit2012
16h00 globant - aws globant-big-data_summit2012infolive
 
Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social mediaDataWorks Summit
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlKhanderao Kand
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data SolutionsMark Kromer
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataSteve Watt
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIDenny Lee
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentationpbridges
 
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...RightScale
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
Infochimps: Cloud for Big Data
Infochimps: Cloud for Big DataInfochimps: Cloud for Big Data
Infochimps: Cloud for Big Datainside-BigData.com
 
제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata 제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata Gruter
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetupRoby Chen
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingm_hepburn
 
Hydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop ApplicationHydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop ApplicationACSG Section Montréal
 

Similar a Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter (20)

Globant and Big Data on AWS
Globant and Big Data on AWSGlobant and Big Data on AWS
Globant and Big Data on AWS
 
16h00 globant - aws globant-big-data_summit2012
16h00   globant - aws globant-big-data_summit201216h00   globant - aws globant-big-data_summit2012
16h00 globant - aws globant-big-data_summit2012
 
Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social media
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
 
Steve Watt Presentation
Steve Watt PresentationSteve Watt Presentation
Steve Watt Presentation
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BI
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentation
 
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...
RightScale Webinar: How RightScale Architects Its Databases (for Worldwide Sc...
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Infochimps: Cloud for Big Data
Infochimps: Cloud for Big DataInfochimps: Cloud for Big Data
Infochimps: Cloud for Big Data
 
제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata 제1회 Korea Community Day 발표자료 Bigdata
제1회 Korea Community Day 발표자료 Bigdata
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
 
Fluentd meetup #3
Fluentd meetup #3Fluentd meetup #3
Fluentd meetup #3
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetup
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
 
Hydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop ApplicationHydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop Application
 

Último

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Último (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

Hadoop Summit 2012 - Hadoop and Vertica: The Data Analytics Platform at Twitter

  • 1. Hadoop and Vertica The Data Analytics Platform at Twitter Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Hadoop Summit, June 2012
  • 2. About that pony giveaway... 2
  • 3. Outline • Architecture • Data flow • Job coordination • Resource management • Vertica integration • Gotchas • Future work 3
  • 4. We count things • 140 characters • 140M active users • 400M tweets per day • 80-100 TB ingested daily (uncompressed) • 10s of Ks daily Hadoop jobs 4
  • 5. Heterogeneous stack • Many job execution applications • Crane - Java ETL • Oink - Pig scheduler • Rasvelg - SQL aggregations • Scalding - Cascading via Scala • PyCascading - Cascading via Python • Indexing jobs • Our users • Analytics, Revenue, Growth, Search, Recommendations, etc. • PMs, Sales! 5
  • 6. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Staging Hadoop Cluster Main Hadoop DW HBase Analytics Vertica Web Tools MySQL 6
  • 7. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Log Mover Main Hadoop DW HBase Analytics Vertica Web Tools MySQL 6
  • 8. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Mover Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane MySQL 6
  • 9. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  • 10. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Staging Hadoop Cluster Crawler Crane Crane Crane Log Rasvelg Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  • 11. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Analysts Staging Hadoop Cluster Crawler Engineers Crane PMs Crane Sales Crane Log Rasvelg Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  • 12. Data flow: Analytics Production Hosts Log Application events Data Scribe Aggregators Third Party Social graph Imports HDFS MySQL/ Tweets Gizzard User profiles Distributed Analysts Staging Hadoop Cluster Crawler Engineers Crane PMs Crane Sales Crane Log Rasvelg HCatalog Mover Oink Oink Main Hadoop DW HBase Analytics Vertica Web Tools Crane Crane Crane Oink MySQL 6
  • 14. System concepts • Loose coupling • Job coordination as a service • Resource management as a service • Idempotence 8
  • 15. Loose coupling • Multiple job frameworks • Right tool for the job • Common dependency management 9
  • 16. Job coordination • Shared batch table for job state • Access via client libraries • Jobs & data are time-based • 3 types of preconditions 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  • 17. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  • 18. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions Job 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  • 19. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions Job 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) Data 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) 10
  • 20. Job coordination • Shared batch table for job state batch table: (id, description, state, • Access via client libraries start_time, end_time, job_start_time, job_end_time) • Jobs & data are time-based • 3 types of preconditions Job 1. other job success (i.e., predecessor job complete) 2. existence of data (i.e., HDFS input exists) Data 3. user-defined (i.e., MySQL slave lag) • Failed jobs get retried (usually) ? 10
  • 21. Resource management • Analytics Resource Manager - ARM! • Library above Zookeeper • Throttles jobs and workers • Only 1 job of this name may run at once • Only N jobs may be run by this app at once • Only M mappers may write to Vertica at once 11
  • 22. Resource management • Analytics Resource Manager - ARM! • Library above Zookeeper • Throttles jobs and workers • Only 1 job of this name may run at once • Only N jobs may be run by this app at once • Only M mappers may write to Vertica at once 11
  • 23. Resource management • Analytics Resource Manager - ARM! • Library above Zookeeper • Throttles jobs and workers • Only 1 job of this name may run at once • Only N jobs may be run by this app at once • Only M mappers may write to Vertica at once 11
  • 24. Job DAG & state transition “Local View” • Is it time for me to run yet? • Are my dependancies satisfied? • Any resource constraints? 12
  • 25. Job DAG & state transition “Local View” • Is it time for me to run yet? • Are my dependancies satisfied? • Any resource constraints? granted denied Insert entry into batch table no Idle yes Completion Execution Complete? Execution 12
  • 26. Job DAG & state transition “Local View” • Is it time for me to run yet? • Are my dependancies satisfied? • Any resource constraints? granted denied Insert entry into batch table no Idle yes Completion Execution Complete? Execution batch table: (id, description, state, start_time, end_time, job_start_time, job_end_time) 12
  • 27. Example: active users Production Hosts Main Hadoop DW MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  • 28. Example: active users Job DAG Log mover Production Hosts Log mover (via staging cluster) ib e web_events Scr Main Hadoop DW Scr ibe sms_events MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  • 29. Example: active users Job DAG Oink Log mover Production Hosts Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  • 30. Example: active users Job DAG Oink Oink Log mover Production Hosts Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Analytics Gizzard MySQL Dashboards Vertica 13
  • 31. Example: active users Job DAG Oink Oink Log mover Production Hosts Crane Log mover (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica 13
  • 32. Example: active users Job DAG Oink Oink Log mover Production Hosts Crane Log mover Rasvelg (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica Rasvelg Join, Join Group, Count Aggregations: - active_by_geo - active_by_device - active_by_client ... 13
  • 33. Example: active users Job DAG Oink Oink ... Log mover Production Hosts Crane Log mover Rasvelg Crane (via staging cluster) Oink/Pig ibe web_events Scr Cleanse Main Hadoop DW Filter Transform Scr Geo lookup ibe sms_events Union Distinct Oink user_sessions MySQL/ Crane Crane Analytics Gizzard MySQL Dashboards user_profiles Vertica active_by_* Rasvelg Join, Join Group, Count Aggregations: - active_by_geo - active_by_device - active_by_client ... 13
  • 34. Vertica or Hadoop? • Vertica • Loads 100s of Ks rows/second • Aggregate 100s of Ms rows in seconds • Used for low latency queries and aggregations • Keep a sliding window of data • Hadoop • Excels when data size is massive • Flexible and powerful • Great with nested data structures and unstructured data • Used for complex functions and ML 14
  • 35. Vertica import options • Direct import via Crane • Load into dest table, single thread • Atomic import via Crane/Rasvelg • Crane loads to temp table, single thread • Rasvelg moves to dest table • Parallel import via Oink/Pig • Pig job via VerticaStorer MySQL/ Gizzard • ARM throttles active DB connections Crane Rasvelg Oink Main Hadoop DW Vertica Crane 15
  • 36. Vertica imports - pros/cons • Crane & Rasvelg • Good for smaller datasets, DB to DB transfers • Single threaded • Easy on Vertica • Hadoop not required • Pig • Great for larger datasets MySQL/ Gizzard • More complex, not atomic Crane • DDOS potential Rasvelg Oink Main Hadoop DW Vertica Crane 16
  • 37. VerticaStorer • PigStorage implementation • From Vertica’s Hadoop connector suite • Out of the box • Easy to get Hello World working • Well documented • Pig/Vertica data bindings work well • Fast! • Transaction-aware tasks • No bugs found • Open source? 17
  • 38. Pig VerticaStorage • Our enhancements • Connection credential management • Truncate before load option • Throttle concurrent writers via ZK • Future features • Counters for rows inserted/rejected • Name-based tuple-column bindings • Atomic load via temp table 18
  • 39. Pig VerticaStorage • Our enhancements • Connection credential management • Truncate before load option • Throttle concurrent writers via ZK • Future features • Counters for rows inserted/rejected • Name-based tuple-column bindings • Atomic load via temp table SET mapred.map.tasks.speculative.execution false user_sessions = LOAD ‘/processed/user_sessions/2012/06/14’; STORE user_sessions INTO '{db_schema.user_sessions}' USING com.twitter.twadoop.pig.store.VerticaStorage( 'config/db.yml', 'db_name', 'arm_resource_name'); 18
  • 40. Gotcha #1 • MR data load is not atomic • Avoid partial reads • Option 1: load to temp table, then insert direct • Option 2: add job dependency concept 19
  • 41. Gotcha #2 • Speculative execution is not always your friend • Launch more tasks than needed, just in case • For non-idempotent jobs, extra tasks == BAD 20
  • 42. Gotcha #3 • isIdempotant() must be a first-class concept • Loader jobs will fail • Failure after first task success == not good • Can’t automate retry without cleanup 21
  • 43. Gotcha #4 • Vendor code only gets you so far • Nice to haves == have to write • Favor the decorator pattern • Pig’s StoreFuncWrapper can help • Vendor open sourcing is ideal 22
  • 44. Future work • More VerticaStorer features • Multiple Vertica clusters • Atomic DB loads with Pig/Oink • Better DAG visibility • Better job history visibility • MR job optimizations via historic stats • HCatalog data registry • Job push events 23
  • 46. Questions? Bill Graham - @billgraham 25

Notas del editor

  1. \n
  2. \n
  3. \n
  4. \n
  5. Point out differences more. which ones move from where\n
  6. describe colo\n
  7. describe colo\n
  8. describe colo\n
  9. describe colo\n
  10. describe colo\n
  11. describe colo\n
  12. \n
  13. point out develop your own tools pattern more\nopt-in too common services like screech-owl\n
  14. \n
  15. expand on the time-based aspect more (jobs and data)\n
  16. expand on the time-based aspect more (jobs and data)\n
  17. expand on the time-based aspect more (jobs and data)\n
  18. expand on the time-based aspect more (jobs and data)\n
  19. \n
  20. \n
  21. Point out that batch table is updated for all state changes\n
  22. Point out that batch table is updated for all state changes\n
  23. talk about when we use vertica and when we use Hadoop\n
  24. talk about when we use vertica and when we use Hadoop\n
  25. talk about when we use vertica and when we use Hadoop\n
  26. talk about when we use vertica and when we use Hadoop\n
  27. talk about when we use vertica and when we use Hadoop\n
  28. talk about when we use vertica and when we use Hadoop\n
  29. Writes are fast because they bypass the Vertica write buffer (copy direct)\n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. 2 vertica clusters: one for just queries\n
  39. \n
  40. \n