SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
1
Introduction to Apache Hive (and
HCatalog)	
  
Mark Grover
github.com/markgrover/nyc-hug-
hive
Me!
•  Contributor	
  to	
  Apache	
  Hive	
  
•  Sec3on	
  Author	
  of	
  O’Reilly’s	
  Programming	
  Hive	
  book	
  
•  So?ware	
  Developer	
  at	
  Cloudera	
  
•  @mark_grover	
  
•  mgrover@cloudera.com	
  
•  hFps://github.com/markgrover/nyc-­‐hug-­‐hive	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Preamble
•  This	
  is	
  a	
  remote	
  talk	
  
•  Feel	
  free	
  to	
  ask	
  ques3ons	
  any	
  3me!	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Hive
•  Data	
  warehouse	
  system	
  for	
  Hadoop	
  
•  Enables	
  Extract/Transform/Load	
  (ETL)	
  
•  Associate	
  structure	
  with	
  a	
  variety	
  of	
  data	
  formats	
  
•  Logical	
  Table	
  -­‐>	
  Physical	
  Loca3on	
  
•  Logical	
  Table	
  -­‐>	
  Physical	
  Data	
  Format	
  Handler	
  (SerDe)	
  
•  Integrates	
  with	
  HDFS,	
  HBase,	
  MongoDB,	
  etc.	
  
•  Query	
  execu3on	
  in	
  MapReduce	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Why use Hive?
•  MapReduce	
  is	
  catered	
  towards	
  developers	
  
•  Run	
  SQL-­‐like	
  queries	
  that	
  get	
  compiled	
  and	
  run	
  as	
  
MapReduce	
  jobs	
  
•  Data	
  in	
  Hadoop	
  even	
  though	
  generally	
  unstructured	
  
has	
  some	
  vague	
  structure	
  associated	
  with	
  it	
  
•  Benefits	
  of	
  MapReduce	
  +	
  HDFS	
  (Hadoop)	
  
•  Fault	
  tolerant	
  
•  Robust	
  
•  Scalable	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Hive features
•  Create	
  table,	
  create	
  view,	
  create	
  index	
  -­‐	
  DDL	
  
•  Select,	
  where	
  clause,	
  group	
  by,	
  order	
  by,	
  joins	
  
•  Pluggable	
  User	
  Defined	
  Func3ons	
  -­‐	
  UDFs	
  (e.g	
  
from_unix3me)	
  
•  Pluggable	
  User	
  Defined	
  Aggregate	
  Func3ons	
  -­‐	
  
UDAFs	
  (e.g.	
  count,	
  avg)	
  
•  Pluggable	
  User	
  Defined	
  Table	
  Genera3ng	
  Func3ons	
  
-­‐	
  UDTFs	
  (e.g.	
  explode)	
  
Hive features
•  Pluggable	
  custom	
  Input/Output	
  format	
  
•  Pluggable	
  Serializa3on	
  Deserializa3on	
  libraries	
  
(SerDes)	
  
•  Pluggable	
  custom	
  map	
  and	
  reduce	
  scripts	
  
What Hive does NOT support
•  OLTP	
  workloads	
  -­‐	
  low	
  latency	
  
•  Correlated	
  subqueries	
  
•  Not	
  super	
  performant	
  with	
  small	
  amounts	
  of	
  data	
  
•  How	
  much	
  data	
  do	
  you	
  need	
  to	
  call	
  it	
  “Big	
  Data”?	
  
Other Hive features
•  Par33oning	
  
•  Sampling	
  
•  Bucke3ng	
  
•  Various	
  join	
  op3miza3ons	
  
•  Integra3on	
  with	
  HBase	
  and	
  other	
  storage	
  handlers	
  
•  Views	
  –	
  Unmaterialized	
  
•  Complex	
  data	
  types	
  –	
  arrays,	
  structs,	
  maps	
  
Connecting to Hive
•  Hive	
  Shell	
  
•  JDBC	
  driver	
  
•  ODBC	
  driver	
  
•  Thri?	
  client	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Hive metastore
•  Backed	
  by	
  RDBMS	
  
•  Derby,	
  MySQL,	
  PostgreSQL,	
  etc.	
  supported	
  
•  Default	
  Embedded	
  Derby	
  
•  Not	
  recommend	
  for	
  anything	
  but	
  a	
  quick	
  Proof	
  of	
  
Concept	
  
•  3	
  different	
  modes	
  of	
  opera3on:	
  
•  Embedded	
  Derby	
  (default)	
  
•  Local	
  
•  Remote	
  
Hive architecture
Hive Remote Mode
Hive server
Problems with Hive Server
•  No	
  sessions/concurrency	
  
•  Essen3ally	
  need	
  1	
  server	
  per	
  client	
  
•  Security	
  
•  Auding/Logging	
  
Hive server 2
Hive architecture
•  Compiler	
  
•  Parser	
  
•  Type	
  checking	
  
•  Seman3c	
  Analyzer	
  
•  Plan	
  Genera3on	
  
•  Task	
  Genera3on	
  
Hive architecture
•  Execu3on	
  Engine	
  
•  Plan	
  
•  Operators	
  
•  SerDes	
  
•  UDFs/UDAFs/UDTFs	
  
•  Metastore	
  
•  Stores	
  schema	
  of	
  data	
  
•  HCatalog	
  
Architecture Summary
•  Use	
  remote	
  metastore	
  service	
  for	
  sharing	
  the	
  
metastore	
  with	
  HCatalog	
  and	
  other	
  tools	
  
•  Use	
  Hive	
  Server2	
  for	
  concurrent	
  queries	
  
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Hive Metastore Remote Mode
HCatalog
•  Sub-­‐component	
  of	
  Hive	
  
•  Table	
  and	
  storage	
  management	
  service	
  
•  Public	
  APIs	
  and	
  webservice	
  wrappers	
  for	
  accessing	
  
metadata	
  in	
  Hive	
  metastore	
  
•  Metastore	
  contains	
  informa3on	
  of	
  interest	
  to	
  other	
  
tools	
  (Pig,	
  MapReduce	
  jobs)	
  
•  Expose	
  that	
  informa3on	
  as	
  REST	
  interface	
  
•  WebHCat:	
  Web	
  Server	
  for	
  engaging	
  with	
  the	
  Hive	
  
metastore	
  	
  
WebHCat
REST
REST
Agenda
•  What	
  is	
  Hive?	
  
•  Why	
  use	
  Hive?	
  
•  Hive	
  features	
  
•  Hive	
  architecture	
  
•  HCatalog	
  
•  Demo!	
  
Applications of Hive
•  Web	
  Analy3cs	
  
•  Retail	
  
•  Healthcare	
  
•  Spam	
  detec3on	
  
•  Data	
  Mining	
  
•  Ad	
  op3miza3on	
  
•  ETL	
  workloads	
  
How to install Hive
•  Download	
  Apache	
  Hive	
  tarball	
  
•  Use	
  Apache	
  Bigtop	
  packages	
  
•  Use	
  a	
  Demo	
  VM	
  
Want to learn more about Hive?
Contact info
@mark_grover	
  
github.com/markgrover	
  
linkedin.com/in/grovermark	
  
mgrover@cloudera.com	
  
	
  
	
  

Más contenido relacionado

La actualidad más candente

Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep dive
DataWorks Summit
 

La actualidad más candente (20)

Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Apache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop componentsApache Atlas: Tracking dataset lineage across Hadoop components
Apache Atlas: Tracking dataset lineage across Hadoop components
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive Hadoop
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Inside open metadata—the deep dive
Inside open metadata—the deep diveInside open metadata—the deep dive
Inside open metadata—the deep dive
 
Big Data
Big DataBig Data
Big Data
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Hive
HiveHive
Hive
 

Destacado

HiveServer2 for Apache Hive
HiveServer2 for Apache HiveHiveServer2 for Apache Hive
HiveServer2 for Apache Hive
Carl Steinbach
 
Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)
Prasan Samtani
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
lucenerevolution
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at Hulu
DataWorks Summit
 

Destacado (20)

HCatalog
HCatalogHCatalog
HCatalog
 
HiveServer2 for Apache Hive
HiveServer2 for Apache HiveHiveServer2 for Apache Hive
HiveServer2 for Apache Hive
 
Hive hcatalog
Hive hcatalogHive hcatalog
Hive hcatalog
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
 
HiveServer2
HiveServer2HiveServer2
HiveServer2
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)Inside Hulu's Data platform (BigDataCamp LA 2013)
Inside Hulu's Data platform (BigDataCamp LA 2013)
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Improving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of Service
 
Future of HCatalog
Future of HCatalogFuture of HCatalog
Future of HCatalog
 
Hadoop and Hive
Hadoop and HiveHadoop and Hive
Hadoop and Hive
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
 
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
 
How to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot ProjectHow to Create 80% of a Big Data Pilot Project
How to Create 80% of a Big Data Pilot Project
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at Hulu
 
Restaurant management
Restaurant managementRestaurant management
Restaurant management
 
SC4 Workshop 1: Logistics and big data German herrero
SC4 Workshop 1: Logistics and big data  German herreroSC4 Workshop 1: Logistics and big data  German herrero
SC4 Workshop 1: Logistics and big data German herrero
 

Similar a Introduction to Hive and HCatalog

Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
gregchanan
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
gregchanan
 
Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3
Wen-Tien Chang
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 

Similar a Introduction to Hive and HCatalog (20)

Search onhadoopsfhug081413
Search onhadoopsfhug081413Search onhadoopsfhug081413
Search onhadoopsfhug081413
 
Search On Hadoop
Search On HadoopSearch On Hadoop
Search On Hadoop
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 
Search On Hadoop Frontier Meetup
Search On Hadoop Frontier MeetupSearch On Hadoop Frontier Meetup
Search On Hadoop Frontier Meetup
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Middleware in Golang: InVision's Rye
Middleware in Golang: InVision's RyeMiddleware in Golang: InVision's Rye
Middleware in Golang: InVision's Rye
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3Service-Oriented Design and Implement with Rails3
Service-Oriented Design and Implement with Rails3
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsight
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding EdgeCIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
 
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anyninesCloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
Cloud Infrastructures Slide Set 7 - Docker - Neo4j | anynines
 

Más de markgrover

Más de markgrover (20)

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integration
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsen
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy design
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beam
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyft
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spot
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 

Último

1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
jaanualu31
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 

Último (20)

S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 

Introduction to Hive and HCatalog

  • 1. 1 Introduction to Apache Hive (and HCatalog)   Mark Grover github.com/markgrover/nyc-hug- hive
  • 2. Me! •  Contributor  to  Apache  Hive   •  Sec3on  Author  of  O’Reilly’s  Programming  Hive  book   •  So?ware  Developer  at  Cloudera   •  @mark_grover   •  mgrover@cloudera.com   •  hFps://github.com/markgrover/nyc-­‐hug-­‐hive  
  • 3. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 4. Preamble •  This  is  a  remote  talk   •  Feel  free  to  ask  ques3ons  any  3me!  
  • 5. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 6. Hive •  Data  warehouse  system  for  Hadoop   •  Enables  Extract/Transform/Load  (ETL)   •  Associate  structure  with  a  variety  of  data  formats   •  Logical  Table  -­‐>  Physical  Loca3on   •  Logical  Table  -­‐>  Physical  Data  Format  Handler  (SerDe)   •  Integrates  with  HDFS,  HBase,  MongoDB,  etc.   •  Query  execu3on  in  MapReduce  
  • 7. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 8. Why use Hive? •  MapReduce  is  catered  towards  developers   •  Run  SQL-­‐like  queries  that  get  compiled  and  run  as   MapReduce  jobs   •  Data  in  Hadoop  even  though  generally  unstructured   has  some  vague  structure  associated  with  it   •  Benefits  of  MapReduce  +  HDFS  (Hadoop)   •  Fault  tolerant   •  Robust   •  Scalable  
  • 9. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 10. Hive features •  Create  table,  create  view,  create  index  -­‐  DDL   •  Select,  where  clause,  group  by,  order  by,  joins   •  Pluggable  User  Defined  Func3ons  -­‐  UDFs  (e.g   from_unix3me)   •  Pluggable  User  Defined  Aggregate  Func3ons  -­‐   UDAFs  (e.g.  count,  avg)   •  Pluggable  User  Defined  Table  Genera3ng  Func3ons   -­‐  UDTFs  (e.g.  explode)  
  • 11. Hive features •  Pluggable  custom  Input/Output  format   •  Pluggable  Serializa3on  Deserializa3on  libraries   (SerDes)   •  Pluggable  custom  map  and  reduce  scripts  
  • 12. What Hive does NOT support •  OLTP  workloads  -­‐  low  latency   •  Correlated  subqueries   •  Not  super  performant  with  small  amounts  of  data   •  How  much  data  do  you  need  to  call  it  “Big  Data”?  
  • 13. Other Hive features •  Par33oning   •  Sampling   •  Bucke3ng   •  Various  join  op3miza3ons   •  Integra3on  with  HBase  and  other  storage  handlers   •  Views  –  Unmaterialized   •  Complex  data  types  –  arrays,  structs,  maps  
  • 14. Connecting to Hive •  Hive  Shell   •  JDBC  driver   •  ODBC  driver   •  Thri?  client  
  • 15. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 16. Hive metastore •  Backed  by  RDBMS   •  Derby,  MySQL,  PostgreSQL,  etc.  supported   •  Default  Embedded  Derby   •  Not  recommend  for  anything  but  a  quick  Proof  of   Concept   •  3  different  modes  of  opera3on:   •  Embedded  Derby  (default)   •  Local   •  Remote  
  • 20. Problems with Hive Server •  No  sessions/concurrency   •  Essen3ally  need  1  server  per  client   •  Security   •  Auding/Logging  
  • 22. Hive architecture •  Compiler   •  Parser   •  Type  checking   •  Seman3c  Analyzer   •  Plan  Genera3on   •  Task  Genera3on  
  • 23. Hive architecture •  Execu3on  Engine   •  Plan   •  Operators   •  SerDes   •  UDFs/UDAFs/UDTFs   •  Metastore   •  Stores  schema  of  data   •  HCatalog  
  • 24. Architecture Summary •  Use  remote  metastore  service  for  sharing  the   metastore  with  HCatalog  and  other  tools   •  Use  Hive  Server2  for  concurrent  queries  
  • 25. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 27. HCatalog •  Sub-­‐component  of  Hive   •  Table  and  storage  management  service   •  Public  APIs  and  webservice  wrappers  for  accessing   metadata  in  Hive  metastore   •  Metastore  contains  informa3on  of  interest  to  other   tools  (Pig,  MapReduce  jobs)   •  Expose  that  informa3on  as  REST  interface   •  WebHCat:  Web  Server  for  engaging  with  the  Hive   metastore    
  • 29. Agenda •  What  is  Hive?   •  Why  use  Hive?   •  Hive  features   •  Hive  architecture   •  HCatalog   •  Demo!  
  • 30. Applications of Hive •  Web  Analy3cs   •  Retail   •  Healthcare   •  Spam  detec3on   •  Data  Mining   •  Ad  op3miza3on   •  ETL  workloads  
  • 31. How to install Hive •  Download  Apache  Hive  tarball   •  Use  Apache  Bigtop  packages   •  Use  a  Demo  VM  
  • 32. Want to learn more about Hive?
  • 33. Contact info @mark_grover   github.com/markgrover   linkedin.com/in/grovermark   mgrover@cloudera.com