SlideShare una empresa de Scribd logo
1 de 15
Descargar para leer sin conexión
HIVE
data warehousing using Hadoop




Facebook Data Team
Motivation

 Structured log and dimension data
  – Well known schemas, different serialization formats (binary/text)
  – Rich data structures – nesting/maps/lists

 Query language over structured data
  – SQL helps in easier adoption by business analysts + reduced learning
    curve for everyone
  – Developers love streaming and direct access to map-reduce
  – Query Language brings together SQL and Streaming

 Data Management
  – Tables/Partitions for easy data addressability
  – Abstractions allow optimizations:
        Organize data for large joins/sampling
        Add indices/manage compression/replication transparently
What is HIVE?
 Mgmt. Web UI



                                                               Map Reduce      HDFS


                              Hive CLI
                  Browsing       Queries    DDL


                 Thrift API                Parser
                                                               Execution
                                           Planner
                                                     Hive QL

                                                                 SerDe
                                                          Thrift Jute JSON..
                MetaStore
Dealing with Structured Data

 Type system
  – Primitive types
  – Recursively build up using Composition/Maps/Lists
 Generic (De)Serialization Interface (SerDe)
  – To recursively list schema
  – To recursively access fields within a row object
 Serialization families implement interface
  – Thrift (Binary and Delimited Text), RecordIO, JSON/PADS(?)
 XPath like field expressions
  – profiles.network[@is_primary=1].id
 Inbuilt DDL
  – Define schema over delimited text files
  – Leverages Thrift DDL
Data Model
                                                     #Partitions=32
                                        Schema       Sort-key=uid
                                                     uid
                                         Library




                  Hash         clicks
               Partitioning
                               views        IP
Logical Partitioning                        userId
                                 …
                                            AdId
/hive/clicks
/hive/clicks/ds=2008-03-25     Tables    Dimensions
/hive/clicks/ds=2008-03-25/0

                       HDFS    MetaStore
MetaStore

 Stores Table/Partition properties:
  –   Table schema and SerDe library
  –   Table Location on HDFS
  –   Logical Partitioning keys and types
  –   Sort column
  –   Mapping from columns to well known Dimensions


 Thrift API
  – Current clients in Php (Web Interface), Python (CLI), Java (Query
    Engine), Perl (Tests)
 Stores all properties in text files
Hive CLI

 Implemented in Python
  – uses MetaStore Thrift API
 DDL:
  – create table/drop table/rename table
  – alter table add column etc.
 Browsing:
  – show tables
  – describe table
  – cat table
 Loading Data
  – load data inpath <path1, …> into table <tablename/partition-spec>]
    [bucketed <N> ways by <dimension>]
 Queries
  – Issue queries in Hive QL.
Hive Query Language

 Philosophy
  – SQL like constructs + Hadoop Streaming


 Query Operators in initial version
  –   Projections
  –   Equijoins and Cogroups
  –   Group by
  –   Sampling


 Output of these operators can be:
  – passed to Streaming mappers/reducers
  – can be stored in another Hive Table
  – can be output to HDFS files
Hive Query Language

 Package these capabilities into a more formal SQL like query language
 in next version
 Introduce other important constructs:
  –   Views
  –   Multi table inserts
  –   Order bys
  –   Select distincts
  –   SQL like column expressions
  –   A bunch of other builtin functions
 Still work in progress
Query Language - Examples

  Multi table inserts

  FROM ad_impressions_stg imps
   INSERT INTO ad_legals/ds=2008-03-08 select imps.* where imps.legal = 1
   INSERT INTO ad_non_legals/ds=2008-03-08 select imps.* where imps.legal = 0


  Joins

 FROM ad_impressions imps, ad_dimensions ads
  INSERT INTO ad_legals_joined select imps.*, ads.campaignid
             JOIN ON(imps.adid, ads.adid)
             WHERE imps.legal = 1
Query Language - Examples

 Group By

 FROM ad_legals_joined imps
   INSERT INTO hdfs://hadoop001:9000/user/ads/adid_uu_summary
               select imps.adid, count_distinct(imps.uid)
               group by(imps.adid)
   INSERT INTO hdfs://hadoop001:9000/user/ads/campaignid_uu_summary
               select imps.campaign_id, count_distinct(imps.uid)
               group by(imps.campaignid)
Query Language – HadoopStreaming

 APPLY ON TABLE

 CREATE OPERATOR filter_legal using ‘exec://filter_legal.py’
        (ts date, adid long, uid long)

 FROM (APPLY filter_legal ON TABLE ad_impression)
   INSERT INTO ad_legals where ts >= ‘2008-03-11’ and ts < ‘2008-03-12’


 APPLY can also be applied after JOIN as reducer script

 FROM ad_impressions imps, ad_dimensions ads
      INSERT INTO ad_legals_joined select imps.*, ads.campaignid
                  JOIN ON(imps.adid, ads.adid)
                  APPLY filter_legal BEFORE OUTPUT
Query Language – Views

 Used for expressing
  – Union alls
  – APPLY operators


 Example

 CREATE VIEW actions
 SELECT photo_views.*
 UNION ALL
 SELECT video_views.*
 UNION ALL
 SELECT profile_views.* …
Hive Usage in Facebook

 Applications:
  – Summarization
       Eg: Daily/Weekly aggregations of impression/click counts
  – Ad hoc Analysis
       Eg: how many group admins broken down by state/country
  – Data Mining (Assembling training data)
       Eg: User Engagement as a function of user attributes
 Usage statistics:
  – Total Users: ~40 (about 25% of engineering !)
  – Hive Data (compressed): 22 TB total, ~200GB incoming per day
  – Jobs over last 7 days:
        Total Jobs: 3514, Projections:821, Joins: 152, Aggregates: 800,
        Loaders: 600
     * Aggregates biased because of multi-stage map-reduce
Conclusion

 Release to Open Source in 3-4 months
 People:
  –   Suresh Anthony (suresh@facebook.com)
  –   Jeff Hammerbacher (jeffh@)
  –   Joydeep Sarma (jssarma@)
  –   Ashish Thusoo (athusoo@)
  –   Pete Wyckoff (pwyckoff@)

Más contenido relacionado

La actualidad más candente

La actualidad más candente (18)

Advanced topics in hive
Advanced topics in hiveAdvanced topics in hive
Advanced topics in hive
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Apache storm
Apache stormApache storm
Apache storm
 
Unit 5-lecture-3
Unit 5-lecture-3Unit 5-lecture-3
Unit 5-lecture-3
 
Apache hive
Apache hiveApache hive
Apache hive
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Unit 5-apache hive
Unit 5-apache hiveUnit 5-apache hive
Unit 5-apache hive
 
3 - Trafodion Technology Look
3 - Trafodion Technology Look3 - Trafodion Technology Look
3 - Trafodion Technology Look
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
Refactoring HUBzero for Linked Data
Refactoring HUBzero for Linked DataRefactoring HUBzero for Linked Data
Refactoring HUBzero for Linked Data
 
Killer Scenarios with Data Lake in Azure with U-SQL
Killer Scenarios with Data Lake in Azure with U-SQLKiller Scenarios with Data Lake in Azure with U-SQL
Killer Scenarios with Data Lake in Azure with U-SQL
 
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
Hbase
HbaseHbase
Hbase
 

Similar a Facebook hadoop-summit

02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hiveSubhas Kumar Ghosh
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalknzhang
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Yahoo Developer Network
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxvishwasgarade1
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalMichael Rainey
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 

Similar a Facebook hadoop-summit (20)

02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalk
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 

Último

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 

Último (20)

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 

Facebook hadoop-summit

  • 1. HIVE data warehousing using Hadoop Facebook Data Team
  • 2. Motivation Structured log and dimension data – Well known schemas, different serialization formats (binary/text) – Rich data structures – nesting/maps/lists Query language over structured data – SQL helps in easier adoption by business analysts + reduced learning curve for everyone – Developers love streaming and direct access to map-reduce – Query Language brings together SQL and Streaming Data Management – Tables/Partitions for easy data addressability – Abstractions allow optimizations: Organize data for large joins/sampling Add indices/manage compression/replication transparently
  • 3. What is HIVE? Mgmt. Web UI Map Reduce HDFS Hive CLI Browsing Queries DDL Thrift API Parser Execution Planner Hive QL SerDe Thrift Jute JSON.. MetaStore
  • 4. Dealing with Structured Data Type system – Primitive types – Recursively build up using Composition/Maps/Lists Generic (De)Serialization Interface (SerDe) – To recursively list schema – To recursively access fields within a row object Serialization families implement interface – Thrift (Binary and Delimited Text), RecordIO, JSON/PADS(?) XPath like field expressions – profiles.network[@is_primary=1].id Inbuilt DDL – Define schema over delimited text files – Leverages Thrift DDL
  • 5. Data Model #Partitions=32 Schema Sort-key=uid uid Library Hash clicks Partitioning views IP Logical Partitioning userId … AdId /hive/clicks /hive/clicks/ds=2008-03-25 Tables Dimensions /hive/clicks/ds=2008-03-25/0 HDFS MetaStore
  • 6. MetaStore Stores Table/Partition properties: – Table schema and SerDe library – Table Location on HDFS – Logical Partitioning keys and types – Sort column – Mapping from columns to well known Dimensions Thrift API – Current clients in Php (Web Interface), Python (CLI), Java (Query Engine), Perl (Tests) Stores all properties in text files
  • 7. Hive CLI Implemented in Python – uses MetaStore Thrift API DDL: – create table/drop table/rename table – alter table add column etc. Browsing: – show tables – describe table – cat table Loading Data – load data inpath <path1, …> into table <tablename/partition-spec>] [bucketed <N> ways by <dimension>] Queries – Issue queries in Hive QL.
  • 8. Hive Query Language Philosophy – SQL like constructs + Hadoop Streaming Query Operators in initial version – Projections – Equijoins and Cogroups – Group by – Sampling Output of these operators can be: – passed to Streaming mappers/reducers – can be stored in another Hive Table – can be output to HDFS files
  • 9. Hive Query Language Package these capabilities into a more formal SQL like query language in next version Introduce other important constructs: – Views – Multi table inserts – Order bys – Select distincts – SQL like column expressions – A bunch of other builtin functions Still work in progress
  • 10. Query Language - Examples Multi table inserts FROM ad_impressions_stg imps INSERT INTO ad_legals/ds=2008-03-08 select imps.* where imps.legal = 1 INSERT INTO ad_non_legals/ds=2008-03-08 select imps.* where imps.legal = 0 Joins FROM ad_impressions imps, ad_dimensions ads INSERT INTO ad_legals_joined select imps.*, ads.campaignid JOIN ON(imps.adid, ads.adid) WHERE imps.legal = 1
  • 11. Query Language - Examples Group By FROM ad_legals_joined imps INSERT INTO hdfs://hadoop001:9000/user/ads/adid_uu_summary select imps.adid, count_distinct(imps.uid) group by(imps.adid) INSERT INTO hdfs://hadoop001:9000/user/ads/campaignid_uu_summary select imps.campaign_id, count_distinct(imps.uid) group by(imps.campaignid)
  • 12. Query Language – HadoopStreaming APPLY ON TABLE CREATE OPERATOR filter_legal using ‘exec://filter_legal.py’ (ts date, adid long, uid long) FROM (APPLY filter_legal ON TABLE ad_impression) INSERT INTO ad_legals where ts >= ‘2008-03-11’ and ts < ‘2008-03-12’ APPLY can also be applied after JOIN as reducer script FROM ad_impressions imps, ad_dimensions ads INSERT INTO ad_legals_joined select imps.*, ads.campaignid JOIN ON(imps.adid, ads.adid) APPLY filter_legal BEFORE OUTPUT
  • 13. Query Language – Views Used for expressing – Union alls – APPLY operators Example CREATE VIEW actions SELECT photo_views.* UNION ALL SELECT video_views.* UNION ALL SELECT profile_views.* …
  • 14. Hive Usage in Facebook Applications: – Summarization Eg: Daily/Weekly aggregations of impression/click counts – Ad hoc Analysis Eg: how many group admins broken down by state/country – Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes Usage statistics: – Total Users: ~40 (about 25% of engineering !) – Hive Data (compressed): 22 TB total, ~200GB incoming per day – Jobs over last 7 days: Total Jobs: 3514, Projections:821, Joins: 152, Aggregates: 800, Loaders: 600 * Aggregates biased because of multi-stage map-reduce
  • 15. Conclusion Release to Open Source in 3-4 months People: – Suresh Anthony (suresh@facebook.com) – Jeff Hammerbacher (jeffh@) – Joydeep Sarma (jssarma@) – Ashish Thusoo (athusoo@) – Pete Wyckoff (pwyckoff@)