SlideShare a Scribd company logo
1 of 26
Coordinating the Many
        Tools of Big Data
Strata 2013

Alan F. Gates
@alanfgates




                              Page 1
Big Data = Terabytes, Petabytes, …




Image Credit: Gizmodo
             © Hortonworks 2013
                                        Page 2
But It Is Also Complex Algorithms
• An example from a talk by Jimmy Lin at Hadoop Summit
  2012 on calculations Twitter is doing via UDFs in Pig.
  This equation uses stochastic gradient descent to do
  machine learning with their data:



   w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y)




      © Hortonworks 2013
                                                       Page 3
And New Tools
• Apache Hadoop brings with it a large selection of tools
  and paradigms
   – Apache HBase, Apache Cassandra – Distributed, high volume
     reads and rights of individual data records
   – Apache Hive - SQL
   – Apache Pig, Cascading – Data flow programming for ETL, data
     modeling, and exploration
   – Apache Giraph – Graph processing
   – MapReduce – Batch processing
   – Storm, S4 – Stream processing
   – Plus lots of commercial offerings




      © Hortonworks 2013
                                                                   Page 4
Pre-Cloud: One Tool per Machine
• Databases presented SQL or SQL-like paradigms for operating on data
• Other tools came in separate packages (e.g. R) or on separate platforms (e.g.
  SAS).



                             Data
                             Mart
                                             Statistical
                                             Analysis
         Data
       Warehouse


                             Cube/M                        OLTP
                              OLAP



        © Hortonworks 2013
                                                                            Page 5
Cloud: Many Tools One Platform
   • Users no longer want to be concerned with what platform their data is in – just
     apply the tool to it
   • SQL no longer the only or primary data access tool

                                                                           Statistical
                  Data                                                     Analysis
                  Mart
  Data
Warehouse




Cube/M                                                                   OLT
 OLAP                                                                     P




            © Hortonworks 2013
                                                                                     Page 6
Upside - Pick the Right Tool for the Job




    © Hortonworks 2013
                                       Page 7
Downside – Tools Don’t Play Well Together

• Hard for users to share data between tools
  – Different storage formats
  – Different data models
  – Different user defined function interfaces




      © Hortonworks 2013
                                                 Page 8
Downside – Wasted Developer Time
• Wastes developer time since each tool supplies the
  redundant functionality


                                          Hive

                             Pig         Parser

                            Parser     Metadata

                           Optimizer   Optimizer
                           Physical     Physical
                           Planner      Planner

                           Executor     Executor


      © Hortonworks 2013
                                                       Page 9
Downside – Wasted Developer Time
• Wastes developer time since each tool supplies the
  redundant functionality


                                                   Hive

                             Pig                  Parser

                            Parser               Metadata

                           Optimizer             Optimizer
                           Physical              Physical
                                       Overlap
                           Planner               Planner

                           Executor              Executor


      © Hortonworks 2013
                                                             Page 10
Conclusion: We Need Services
• We need to find a way to share services where we can
• Gives users the same experience across tools
• Allows developers to share effort when it makes sense




        © Hortonworks 2013
                                                          Page 11
Hadoop = Distributed Data Operating
System
Service                                                   Hadoop Component

Table Management                                          Hive

Access To Metadata                                        HCatalog

User authentication                                       Knox

Resource management                                       YARN

Notification                                              HCatalog

REST/Connectors                                           webhcat, webhdfs, Hive, HBase,
                                                          Oozie
Relational data processing                                Tez

                               Exists   Pieces exist in this component   New Project

          © Hortonworks 2013
                                                                                           Page 12
Hadoop = Distributed Data Operating
System
Service                                                   Hadoop Component

Table Management                                          Hive

Access To Metadata                                        HCatalog

User authentication                                       Knox

Resource management                                       YARN

Notification                                              HCatalog

REST/Connectors                                           webhcat, webhdfs, Hive, HBase,
                                                          Oozie
Relational data processing                                Tez

                               Exists   Pieces exist in this component   New Project

          © Hortonworks 2013
                                                                                           Page 13
HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside
  Hadoop
• Presents tools with a table paradigm that abstracts away
  storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access




      © Hortonworks 2013
                                                             Page 14
HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside
  Hadoop
• Presents tools with a table paradigm that abstracts away
  storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access

                             Hive




                           Metastore




      © Hortonworks 2013
                                                             Page 15
HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside
  Hadoop
• Presents tools with a table paradigm that abstracts away
  storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access

                             Hive            Pig
                                            HCat
                                           Loader



                           Metastore      MapReduce
                                           HCatInput
                                            Format

      © Hortonworks 2013
                                                             Page 16
HCatalog – Table Management
• Opens up Hive’s tables to other tools inside and outside
  Hadoop
• Presents tools with a table paradigm that abstracts away
  storage details
• Provides a shared data model
• Provides a shared code path for data and metadata access

                                Hive         Pig
   External
   Systems                                  HCat
                                           Loader
   REST
                    WebHCat
                              Metastore   MapReduce
                                           HCatInput
                                            Format

      © Hortonworks 2013
                                                             Page 17
Tez – Moving Beyond MapReduce
• Low level data-processing execution engine
• Use it for the base of MapReduce, Hive, Pig, Cascading
  etc.
• Enables pipelining of jobs
• Removes task and job launch times
• Hive and Pig jobs no longer need to move to the end of
  the queue between steps in the pipeline
• Does not write intermediate output to HDFS
  – Much lighter disk and network usage
• Built on YARN



      © Hortonworks 2013
                                                       Page 18
Pig/Hive-MR versus Pig/Hive-Tez
                                            SELECT a.state, COUNT(*), AVERAGE(c.price)
                                                             FROM a
                                                      JOIN b ON (a.id = b.id)
                                                  JOIN c ON (a.itemId = c.itemId)
                                                        GROUP BY a.state


                                   Job 1



                                                          Job 2

I/O Synchronization
      Barrier




             I/O Synchronization
                   Barrier




                                                  Job 3




                         Pig/Hive - MR
                       © Hortonworks 2013
                                                                                         Page 19
Pig/Hive-MR versus Pig/Hive-Tez
                                            SELECT a.state, COUNT(*), AVERAGE(c.price)
                                                             FROM a
                                                      JOIN b ON (a.id = b.id)
                                                  JOIN c ON (a.itemId = c.itemId)
                                                        GROUP BY a.state


                                   Job 1



                                                          Job 2

I/O Synchronization
      Barrier




             I/O Synchronization
                   Barrier



                                                                         Single Job


                                                  Job 3




                         Pig/Hive - MR                                                   Pig/Hive - Tez
                       © Hortonworks 2013
                                                                                                          Page 20
FastQuery: Beyond Batch with YARN




 Tez Generalizes Map-Reduce           Always-On Tez Service
Simplified execution plans process   Low latency processing for
        data more efficiently        all Hadoop data processing




       © Hortonworks 2013
                                                                  Page 21
Knox – Single Sign On




   © Hortonworks 2013
                        Page 22
Today’s Access Options
• Direct Access
   – Access Services via REST (WebHDFS, WebHCat)
   – Need knowledge of and access to whole cluster
   – Security handled by each component in the cluster
   – Kerberos details exposed to users


          User              {REST}   Hadoop Cluster


• Gateway / Portal Nodes
   – Dedicated nodes behind firewall
   – User SSH to node to access Hadoop services

                             SSH
                                      GW
          User                                  Hadoop Cluster
                                     Node


       © Hortonworks 2013
                                                                 Page 23
Knox Design Goals
• Operators can firewall cluster without end user access to
  “gateway node”
• Users see one cluster end-point that aggregates
  capabilities for data access, metadata and job control
• Provide perimeter security to make Hadoop security setup
  easier
• Enable integration enterprise and cloud identity
  management environments




      © Hortonworks 2013
                                                        Page 24
Perimeter Verification & Authentication
Verification
- Verify identity token                       Authentication       Hadoop Cluster
- SAML, propagation of identity
Authentication
                                                    User Store
- Establish identity at Gateway to
  Authenticate with LDAP + AD                        KDC, AD,             DN        DN
                                                      LDAP
                                                                  Web     DN        DN
                                                                  HDFS
                                                                               NN
                            {REST}                    Knox
         Client                                      Gateway

                                                                               JT
                                                                  Web
                                                                               Hive
                                     ID Provider                  HCat
                                      KDC, AD,
                                        LDAP                                 HCat

                                                   Verification
                © Hortonworks 2013
                                                                                      Page 25
Thank You




   © Hortonworks 2012
                        Page 26

More Related Content

What's hot

Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015alanfgates
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoYu Liu
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveDataWorks Summit
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014alanfgates
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFTDataWorks Summit
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Deadt3rmin4t0r
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryTsz-Wo (Nicholas) Sze
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetupt3rmin4t0r
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 

What's hot (20)

Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
 
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
 
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache HiveAdding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
Adding ACID Transactions, Inserts, Updates, and Deletes in Apache Hive
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFT
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Dead
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
HiveACIDPublic
HiveACIDPublicHiveACIDPublic
HiveACIDPublic
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft Library
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetup
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 

Viewers also liked

Outline providing effectivefeedbacktoemployees (1)
Outline providing effectivefeedbacktoemployees (1)Outline providing effectivefeedbacktoemployees (1)
Outline providing effectivefeedbacktoemployees (1)Linnea Hanson
 
Simply the best college best work
Simply the best   college best workSimply the best   college best work
Simply the best college best workCraig Skelly
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013alanfgates
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016alanfgates
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016alanfgates
 
Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016alanfgates
 
Hortonworks apache training
Hortonworks apache trainingHortonworks apache training
Hortonworks apache trainingalanfgates
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014alanfgates
 

Viewers also liked (14)

Outline providing effectivefeedbacktoemployees (1)
Outline providing effectivefeedbacktoemployees (1)Outline providing effectivefeedbacktoemployees (1)
Outline providing effectivefeedbacktoemployees (1)
 
Simply the best college best work
Simply the best   college best workSimply the best   college best work
Simply the best college best work
 
Bowling event
Bowling eventBowling event
Bowling event
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Ninjutsu
NinjutsuNinjutsu
Ninjutsu
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
Rpp reproduksi - copy (1)
Rpp reproduksi - copy (1)Rpp reproduksi - copy (1)
Rpp reproduksi - copy (1)
 
Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016
 
Hortonworks apache training
Hortonworks apache trainingHortonworks apache training
Hortonworks apache training
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Brownian motion
Brownian motionBrownian motion
Brownian motion
 
Types dbms
Types dbmsTypes dbms
Types dbms
 

Similar to Coordinating the Many Tools of Big Data

Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Big Data Spain
 
Apache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondApache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondDataWorks Summit
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?Hortonworks
 
Introduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsIntroduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsDataWorks Summit
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014Hortonworks
 
Introduction to Hortonworks Data Platform
Introduction to Hortonworks Data PlatformIntroduction to Hortonworks Data Platform
Introduction to Hortonworks Data PlatformHortonworks
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondTeradata Aster
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championAmeet Paranjape
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchHortonworks
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 

Similar to Coordinating the Many Tools of Big Data (20)

Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
 
Apache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondApache Hadoop Now Next and Beyond
Apache Hadoop Now Next and Beyond
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
Introduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsIntroduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI Tools
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
 
Introduction to Hortonworks Data Platform
Introduction to Hortonworks Data PlatformIntroduction to Hortonworks Data Platform
Introduction to Hortonworks Data Platform
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Hadoop - Now, Next and Beyond
Hadoop - Now, Next and BeyondHadoop - Now, Next and Beyond
Hadoop - Now, Next and Beyond
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop Search
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Hadoop Trends
Hadoop TrendsHadoop Trends
Hadoop Trends
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 

Coordinating the Many Tools of Big Data

  • 1. Coordinating the Many Tools of Big Data Strata 2013 Alan F. Gates @alanfgates Page 1
  • 2. Big Data = Terabytes, Petabytes, … Image Credit: Gizmodo © Hortonworks 2013 Page 2
  • 3. But It Is Also Complex Algorithms • An example from a talk by Jimmy Lin at Hadoop Summit 2012 on calculations Twitter is doing via UDFs in Pig. This equation uses stochastic gradient descent to do machine learning with their data: w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y) © Hortonworks 2013 Page 3
  • 4. And New Tools • Apache Hadoop brings with it a large selection of tools and paradigms – Apache HBase, Apache Cassandra – Distributed, high volume reads and rights of individual data records – Apache Hive - SQL – Apache Pig, Cascading – Data flow programming for ETL, data modeling, and exploration – Apache Giraph – Graph processing – MapReduce – Batch processing – Storm, S4 – Stream processing – Plus lots of commercial offerings © Hortonworks 2013 Page 4
  • 5. Pre-Cloud: One Tool per Machine • Databases presented SQL or SQL-like paradigms for operating on data • Other tools came in separate packages (e.g. R) or on separate platforms (e.g. SAS). Data Mart Statistical Analysis Data Warehouse Cube/M OLTP OLAP © Hortonworks 2013 Page 5
  • 6. Cloud: Many Tools One Platform • Users no longer want to be concerned with what platform their data is in – just apply the tool to it • SQL no longer the only or primary data access tool Statistical Data Analysis Mart Data Warehouse Cube/M OLT OLAP P © Hortonworks 2013 Page 6
  • 7. Upside - Pick the Right Tool for the Job © Hortonworks 2013 Page 7
  • 8. Downside – Tools Don’t Play Well Together • Hard for users to share data between tools – Different storage formats – Different data models – Different user defined function interfaces © Hortonworks 2013 Page 8
  • 9. Downside – Wasted Developer Time • Wastes developer time since each tool supplies the redundant functionality Hive Pig Parser Parser Metadata Optimizer Optimizer Physical Physical Planner Planner Executor Executor © Hortonworks 2013 Page 9
  • 10. Downside – Wasted Developer Time • Wastes developer time since each tool supplies the redundant functionality Hive Pig Parser Parser Metadata Optimizer Optimizer Physical Physical Overlap Planner Planner Executor Executor © Hortonworks 2013 Page 10
  • 11. Conclusion: We Need Services • We need to find a way to share services where we can • Gives users the same experience across tools • Allows developers to share effort when it makes sense © Hortonworks 2013 Page 11
  • 12. Hadoop = Distributed Data Operating System Service Hadoop Component Table Management Hive Access To Metadata HCatalog User authentication Knox Resource management YARN Notification HCatalog REST/Connectors webhcat, webhdfs, Hive, HBase, Oozie Relational data processing Tez Exists Pieces exist in this component New Project © Hortonworks 2013 Page 12
  • 13. Hadoop = Distributed Data Operating System Service Hadoop Component Table Management Hive Access To Metadata HCatalog User authentication Knox Resource management YARN Notification HCatalog REST/Connectors webhcat, webhdfs, Hive, HBase, Oozie Relational data processing Tez Exists Pieces exist in this component New Project © Hortonworks 2013 Page 13
  • 14. HCatalog – Table Management • Opens up Hive’s tables to other tools inside and outside Hadoop • Presents tools with a table paradigm that abstracts away storage details • Provides a shared data model • Provides a shared code path for data and metadata access © Hortonworks 2013 Page 14
  • 15. HCatalog – Table Management • Opens up Hive’s tables to other tools inside and outside Hadoop • Presents tools with a table paradigm that abstracts away storage details • Provides a shared data model • Provides a shared code path for data and metadata access Hive Metastore © Hortonworks 2013 Page 15
  • 16. HCatalog – Table Management • Opens up Hive’s tables to other tools inside and outside Hadoop • Presents tools with a table paradigm that abstracts away storage details • Provides a shared data model • Provides a shared code path for data and metadata access Hive Pig HCat Loader Metastore MapReduce HCatInput Format © Hortonworks 2013 Page 16
  • 17. HCatalog – Table Management • Opens up Hive’s tables to other tools inside and outside Hadoop • Presents tools with a table paradigm that abstracts away storage details • Provides a shared data model • Provides a shared code path for data and metadata access Hive Pig External Systems HCat Loader REST WebHCat Metastore MapReduce HCatInput Format © Hortonworks 2013 Page 17
  • 18. Tez – Moving Beyond MapReduce • Low level data-processing execution engine • Use it for the base of MapReduce, Hive, Pig, Cascading etc. • Enables pipelining of jobs • Removes task and job launch times • Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline • Does not write intermediate output to HDFS – Much lighter disk and network usage • Built on YARN © Hortonworks 2013 Page 18
  • 19. Pig/Hive-MR versus Pig/Hive-Tez SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Job 1 Job 2 I/O Synchronization Barrier I/O Synchronization Barrier Job 3 Pig/Hive - MR © Hortonworks 2013 Page 19
  • 20. Pig/Hive-MR versus Pig/Hive-Tez SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Job 1 Job 2 I/O Synchronization Barrier I/O Synchronization Barrier Single Job Job 3 Pig/Hive - MR Pig/Hive - Tez © Hortonworks 2013 Page 20
  • 21. FastQuery: Beyond Batch with YARN Tez Generalizes Map-Reduce Always-On Tez Service Simplified execution plans process Low latency processing for data more efficiently all Hadoop data processing © Hortonworks 2013 Page 21
  • 22. Knox – Single Sign On © Hortonworks 2013 Page 22
  • 23. Today’s Access Options • Direct Access – Access Services via REST (WebHDFS, WebHCat) – Need knowledge of and access to whole cluster – Security handled by each component in the cluster – Kerberos details exposed to users User {REST} Hadoop Cluster • Gateway / Portal Nodes – Dedicated nodes behind firewall – User SSH to node to access Hadoop services SSH GW User Hadoop Cluster Node © Hortonworks 2013 Page 23
  • 24. Knox Design Goals • Operators can firewall cluster without end user access to “gateway node” • Users see one cluster end-point that aggregates capabilities for data access, metadata and job control • Provide perimeter security to make Hadoop security setup easier • Enable integration enterprise and cloud identity management environments © Hortonworks 2013 Page 24
  • 25. Perimeter Verification & Authentication Verification - Verify identity token Authentication Hadoop Cluster - SAML, propagation of identity Authentication User Store - Establish identity at Gateway to Authenticate with LDAP + AD KDC, AD, DN DN LDAP Web DN DN HDFS NN {REST} Knox Client Gateway JT Web Hive ID Provider HCat KDC, AD, LDAP HCat Verification © Hortonworks 2013 Page 25
  • 26. Thank You © Hortonworks 2012 Page 26

Editor's Notes

  1. This is how we tend to think of Big data
  2. Limited in a couple of ways:Scalability limited by being on one machine or a small cluster that counts on all participants being upHard to apply different types of processing without moving data around
  3. Hive is the only SQL based app in this pileOther apps still in the picture, it’s not like Hadoop is displacing everything