SlideShare una empresa de Scribd logo
1 de 29
Sharing Metadata
Across the Data Lake
and Streams
Alan F. Gates
Co-founder Hortonworks,
Member Apache Hive PMC
June 2018
2 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
3 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Metadata in SQL
4 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Big Data SQL Engines
There are many big data SQL engines:
Hive, Spark, Impala, Presto, …
Hive Impala
Presto Spark
5 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Pro Pluribus Unum
There are many big data SQL engines:
Hive, Spark, Impala, Presto, …
These engines all store their metadata
in the Hive Metastore
Hive
Metastore
Hive Impala
Presto Spark
6 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
The Good, …
Hive
Metastore
These engines all store their metadata in
the Hive Metastore
Good: Shared metadata makes sharing
data between engines easier
Hive Impala
Presto Spark
There are many big data SQL engines:
Hive, Spark, Impala, Presto, …
7 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
The Bad, …
Hive
Metastore
These engines all store their metadata in
the Hive Metastore
Hive Impala
Presto Spark
Bad: Non-Hive systems have to install
much of Hive to get the Metastore
Bad: Hard for other projects to
contribute to the Metastore
Good: Shared metadata makes sharing
data between engines easier
There are many big data SQL engines:
Hive, Spark, Impala, Presto, …
8 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
And the First Proposal
Metastore
These engines all store their metadata in
the Hive Metastore
Hive Impala
Presto Spark
Proposal:
Separate the Metastore from Hive
Good: Shared metadata makes sharing
data between engines easier
Bad: Non-Hive systems have to install
much of Hive to get the Metastore
Bad: Hard for other projects to
contribute the Metastore
There are many big data SQL engines:
Hive, Spark, Impala, Presto, …
9 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Breaking out the Metastore
 Enables the Metastore to continue to be used by many engines
 In Hive 3.0 the Metastore was released as a separate module
 Can be installed and run without the rest of Hive
– A few features missing when Hive not present: e.g. the compactor
– Planning to add these in the future
 Backwards compatibility maintained for Thrift clients
– Older version clients can talk to the new, separate, metastore
 A few small changes for server hook implementations
 There is a proposal to make it a separate Apache project
– Will enable better collaboration with non-Hive projects
– Still in discussion with the Hive PMC on this
10 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Enables Shared Metadata in the Cloud
Shared Data
& Storage
On-Demand
Ephemeral Workloads
10101
10101010101
01010101010101
0101010101010101010
Elastic Resource
Management
Shared Metadata,
Security & Governance
11 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Is this HCatalog 2.0?
 Didn’t we do this before? Wasn’t it called HCatalog?
 No, HCatalog is different
 HCatalog focuses on making the Metastore accessible by MapReduce, Pig, and other
applications
– Includes metadata access
– Also includes data access (serdes, object inspectors, and input/output formats)
 Metastore stores metadata, including which serdes etc. to use but does not provide
readers and writers
 HCatalog stays with Hive in this split, it does not go with the Metastore
– Because it includes the data access
12 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Schemas in Streams
13 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Example: Hortonworks Schema Registry
 Provides a central repository for messages’ metadata
 Intended for streaming data (e.g. Kafka) or edge data (e.g. NiFi)
 Can be used by any application via REST interface
 Schema defined in JSON
 Schema is tied to a Kafka topic or NiFi flow
 Every schema has a name: e.g. temp_sensor_data
 Schemas can have one or more versions
– Different messages in a topic will have different versions of the schema
– Compatibility between schema versions can be none, backwards, forwards, or both
 Lifecycle management: schema versions have state, e.g. INITIATED, ENABLED, ARCHIVED
 Serdes stored with schema so system knows how to (de)serialize data
14 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Example Schema Registry Schema
{ "name": "temp_sensor_data",
"fields": [
{ "name": "sensorId", "type": "long"},
{ "name": "location", "type": "record",
"fields": [
{ "name": "longitude", "type": "double"},
{ "name": "latitude", "type": "double"}
]},
{ "name": "temperature", "type": "int"},
{ "name": "readAt", "type": "long"}
]
}
15 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Contrasting SQL and Registry Schemas
SQL Schema Registry
Schema tied to a table Schema tied to a Kafka topic or NiFi flow
Schema applies to all records in a partition Records in a topic may have different versions
of the schema, with no given order
Schema defined in SQL DDL
CREATE TABLE T (A INT, B VARCHAR(20));
Schema defined in JSON
Primary access is via SQL for users and Thrift
for SQL engines
Primary access is via UI for developers and
Java/REST for streaming applications
Supports standard SQL types and Java types Supports Java types
No concept of schema lifecycle Schema lifecycle management via schema
version state
16 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Bringing the Strands Together
17 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
First Problem
 Administrators have another system to install, monitor, update, …
 Developers must maintain two systems whose basic functionality, record & serve
runtime metadata, is the same
 Other systems that want to integrate with runtime metadata, security systems like
Ranger and Sentry and governance systems like Atlas, have to integrate with each
component separately
With both the Hive Metastore and the Schema Registry we are adding yet another
component to the system
18 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Second Problem
 Sometimes your streaming application will want to read from a table
– It would prefer to think of data in the registry model, whether it comes from a Hive table or a Kafka
stream
 Sometimes your query will want to read from a stream
– It needs to think about data as being in a table, whether it comes from a Hive table or a Kafka
stream
 To share data today tools have to be able to read data using both paradigms
Hardwiring a perspective into a metadata repository makes it harder to share data
between applications
19 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
The Second Proposal: Cross the Streams
 Put the Schema Registry on top of the Metastore
 It will still support SQL and streaming perspectives
 One system means less work for admins, developers, and other tools
 One system with multiple perspectives means
– streaming tools can view data as a stream whether it is in Kafka or Hive
– batch tools can view data as a table whether it is in Hive or Kafka
20 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Streaming Application Reading from a Table
Kafka topic userEvents
Schema:
{ "group": "kafka",
"fields": [{
"userid": "long",
"eventtype": "string",
...
}]
}
• A stream userEvents
• An application that flags users who have called support in the last 24 hours
Hive table support_calls
userid long
calltime timestamp
summary string
supportCalls
Schema:
{ "group": "hive",
"fields": [{
"userid": "long",
"calltime": "timestamp",
"summary" : "string"
}]
}
• App can cache table every hour, do a join as events arrive to flag users who need extra attention
• Possible today, but requires caching data in Kafka or coding app to read both Hive and Kafka
• Because HMS and SR are unified, streaming apps can view this as an SR Schema
Example:
• Hive has record of support calls, Kafka does not
21 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Query Reading from a Stream
Hive table user_events,
partitioned by event_hour
user_id long
event_type varchar(256)
event_hour datetime
Kafka topic userEvents
Schema:
{ "group": "kafka",
"fields": [{
"userid": "long",
"eventtype": "string",
...
}]
}
• Hive table user_events is loaded every hour from Kafka topic userEvents
Example:
• Because HMS and SR are unified, Hive can view Kafka topic as partition of its table
Hive table user_events,
partition event_hour='latest'
• Hive queries can now read Kafka topic userEvents as a partition of user_events
• Today Hive streaming can quickly ingest data from Kafka, but will still be missing the last few
seconds from Kafka
• Would like to be able to read latest events from Kafka rather than wait until it loads into Hive
22 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Some Assembly Required
 Need to bridge the gaps between SQL and Registry schemas - Nontrivial
– Schema consistent for all records in a partition versus different schema versions in the stream
– SQL types versus Java types
– Schema as an attribute of a table versus as a first class object with version and lifecycle
 Will require connectors so streaming apps can use batch serdes and vice versa
 Work in progress:
– https://github.com/apache/hive/pull/347
– https://github.com/apache/hive/pull/348
– https://github.com/apache/hive/pull/349
– https://issues.apache.org/jira/browse/HIVE-19521
– https://issues.apache.org/jira/browse/HIVE-19522
23 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Can We Share Too Much?
24 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Yes, Yes We Can
Example use case: Hive LLAP being used for analytics, Spark for ETL
Metastore
LLAP Spark
I have been extolling the benefits of a shared Metastore for the last 20 slides, so
clearly we want to share one instance between them
But,
• Hive and Spark can't always read each other's data
• e.g. Spark can't read Hive's ACID tables
• Different use cases require different security models
• e.g. Spark ETL is likely to use StorageBasedAuth, while LLAP is likely to use Ranger
• Different defaults are appropriate for different use cases
• e.g. doAs=false for LLAP, doAs=true for Hive reads from Spark catalog
25 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Third Proposal: Add Catalogs
 Catalog is standard SQL top level container
 Catalogs contain databases, thus fully addressing a table will become
catalog.database.table
 Default catalog 'hive' added in 3.0, and all existing databases placed in it
 In 3.0 only exists in metastore, not yet exposed to SQL
 Goal: different catalogs can have different security settings and defaults
 Ongoing work, can be tracked at HIVE-18685
26 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Example Installation With Catalogs
Metastore
LLAP Spark
LLAP defaults to 'hive' catalog
• Tables are ACID by default
• Ranger for security
• doAs=false
Spark defaults to 'etl' catalog
• ACID tables not allowed
• StorageBasedAuth for security
• doAs=true
Each can still read from the other catalog (assuming permission granted),
but can now be aware of changing authorization, defaults, etc.
Also useful in the cloud, where multiple business units may sharing
storage but need different defaults, policies, etc.
27 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
What Next?
 Now that we have released the Metastore as a separate module, Hive community needs
to decide whether it becomes a subproject or a separate top level project
 Need to finish the work to integrate the Schema Registry
 Need to involve contributors from other, non-Hive projects
 Need to finish implementing Catalogs
 Patches accepted!
28 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Credits
 Apache Atlas, Apache Hadoop, Apache Hive, Apache Impala, Apache Kafka, Apache Pig,
Apache Ranger, Apache Sentry, and Apache Spark are Apache Software Foundation
projects
– All are referred to herein without “Apache” for brevity
 HDFS and MapReduce are components of Apache Hadoop
 Thanks to the Hive community for their work in getting the Hive Metastore separated
out from much of the rest of Hive
 Google Translate used for Latin slide title
29 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Thank You

Más contenido relacionado

La actualidad más candente

SAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made EasySAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made Easy
DataWorks Summit
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
DataWorks Summit
 

La actualidad más candente (20)

What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objects
 
Druid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best PracticesDruid and Hive Together : Use Cases and Best Practices
Druid and Hive Together : Use Cases and Best Practices
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
 
Multitenancy At Bloomberg - HBase and Oozie
Multitenancy At Bloomberg - HBase and OozieMultitenancy At Bloomberg - HBase and Oozie
Multitenancy At Bloomberg - HBase and Oozie
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentEnd-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service Deployment
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
 
Provisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & AmbariProvisioning Big Data Platform using Cloudbreak & Ambari
Provisioning Big Data Platform using Cloudbreak & Ambari
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the Cloud
 
SAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made EasySAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made Easy
 
Big Data Analytics from Edge to Core
Big Data Analytics from Edge to CoreBig Data Analytics from Edge to Core
Big Data Analytics from Edge to Core
 
Next Generation Execution Engine for Apache Storm
Next Generation Execution Engine for Apache StormNext Generation Execution Engine for Apache Storm
Next Generation Execution Engine for Apache Storm
 
Scale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARNScale-Out Resource Management at Microsoft using Apache YARN
Scale-Out Resource Management at Microsoft using Apache YARN
 
Data Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and KafkaData Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and Kafka
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 

Similar a Standalone metastore-dws-sjc-june-2018

Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 

Similar a Standalone metastore-dws-sjc-june-2018 (20)

Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Schema Registry & Stream Analytics Manager
Schema Registry  & Stream Analytics ManagerSchema Registry  & Stream Analytics Manager
Schema Registry & Stream Analytics Manager
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
 
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseCloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?Is your Enterprise Data lake Metadata Driven AND Secure?
Is your Enterprise Data lake Metadata Driven AND Secure?
 
Classification based security in Hadoop
Classification based security in HadoopClassification based security in Hadoop
Classification based security in Hadoop
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
 

Más de alanfgates

Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014
alanfgates
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
alanfgates
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013
alanfgates
 

Más de alanfgates (14)

Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
Hortonworks apache training
Hortonworks apache trainingHortonworks apache training
Hortonworks apache training
 
Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016
 
Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016Hive ACID Apache BigData 2016
Hive ACID Apache BigData 2016
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
Hive & HBase for Transaction Processing Hadoop Summit EU Apr 2015
 
Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015Hive acid-updates-strata-sjc-feb-2015
Hive acid-updates-strata-sjc-feb-2015
 
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013
 
Strata feb2013
Strata feb2013Strata feb2013
Strata feb2013
 

Último

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Último (20)

8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 

Standalone metastore-dws-sjc-june-2018

  • 1. Sharing Metadata Across the Data Lake and Streams Alan F. Gates Co-founder Hortonworks, Member Apache Hive PMC June 2018
  • 2. 2 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
  • 3. 3 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Metadata in SQL
  • 4. 4 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Big Data SQL Engines There are many big data SQL engines: Hive, Spark, Impala, Presto, … Hive Impala Presto Spark
  • 5. 5 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Pro Pluribus Unum There are many big data SQL engines: Hive, Spark, Impala, Presto, … These engines all store their metadata in the Hive Metastore Hive Metastore Hive Impala Presto Spark
  • 6. 6 © Hortonworks Inc. 2011 – 2018. All Rights Reserved The Good, … Hive Metastore These engines all store their metadata in the Hive Metastore Good: Shared metadata makes sharing data between engines easier Hive Impala Presto Spark There are many big data SQL engines: Hive, Spark, Impala, Presto, …
  • 7. 7 © Hortonworks Inc. 2011 – 2018. All Rights Reserved The Bad, … Hive Metastore These engines all store their metadata in the Hive Metastore Hive Impala Presto Spark Bad: Non-Hive systems have to install much of Hive to get the Metastore Bad: Hard for other projects to contribute to the Metastore Good: Shared metadata makes sharing data between engines easier There are many big data SQL engines: Hive, Spark, Impala, Presto, …
  • 8. 8 © Hortonworks Inc. 2011 – 2018. All Rights Reserved And the First Proposal Metastore These engines all store their metadata in the Hive Metastore Hive Impala Presto Spark Proposal: Separate the Metastore from Hive Good: Shared metadata makes sharing data between engines easier Bad: Non-Hive systems have to install much of Hive to get the Metastore Bad: Hard for other projects to contribute the Metastore There are many big data SQL engines: Hive, Spark, Impala, Presto, …
  • 9. 9 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Breaking out the Metastore  Enables the Metastore to continue to be used by many engines  In Hive 3.0 the Metastore was released as a separate module  Can be installed and run without the rest of Hive – A few features missing when Hive not present: e.g. the compactor – Planning to add these in the future  Backwards compatibility maintained for Thrift clients – Older version clients can talk to the new, separate, metastore  A few small changes for server hook implementations  There is a proposal to make it a separate Apache project – Will enable better collaboration with non-Hive projects – Still in discussion with the Hive PMC on this
  • 10. 10 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Enables Shared Metadata in the Cloud Shared Data & Storage On-Demand Ephemeral Workloads 10101 10101010101 01010101010101 0101010101010101010 Elastic Resource Management Shared Metadata, Security & Governance
  • 11. 11 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Is this HCatalog 2.0?  Didn’t we do this before? Wasn’t it called HCatalog?  No, HCatalog is different  HCatalog focuses on making the Metastore accessible by MapReduce, Pig, and other applications – Includes metadata access – Also includes data access (serdes, object inspectors, and input/output formats)  Metastore stores metadata, including which serdes etc. to use but does not provide readers and writers  HCatalog stays with Hive in this split, it does not go with the Metastore – Because it includes the data access
  • 12. 12 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Schemas in Streams
  • 13. 13 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Example: Hortonworks Schema Registry  Provides a central repository for messages’ metadata  Intended for streaming data (e.g. Kafka) or edge data (e.g. NiFi)  Can be used by any application via REST interface  Schema defined in JSON  Schema is tied to a Kafka topic or NiFi flow  Every schema has a name: e.g. temp_sensor_data  Schemas can have one or more versions – Different messages in a topic will have different versions of the schema – Compatibility between schema versions can be none, backwards, forwards, or both  Lifecycle management: schema versions have state, e.g. INITIATED, ENABLED, ARCHIVED  Serdes stored with schema so system knows how to (de)serialize data
  • 14. 14 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Example Schema Registry Schema { "name": "temp_sensor_data", "fields": [ { "name": "sensorId", "type": "long"}, { "name": "location", "type": "record", "fields": [ { "name": "longitude", "type": "double"}, { "name": "latitude", "type": "double"} ]}, { "name": "temperature", "type": "int"}, { "name": "readAt", "type": "long"} ] }
  • 15. 15 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Contrasting SQL and Registry Schemas SQL Schema Registry Schema tied to a table Schema tied to a Kafka topic or NiFi flow Schema applies to all records in a partition Records in a topic may have different versions of the schema, with no given order Schema defined in SQL DDL CREATE TABLE T (A INT, B VARCHAR(20)); Schema defined in JSON Primary access is via SQL for users and Thrift for SQL engines Primary access is via UI for developers and Java/REST for streaming applications Supports standard SQL types and Java types Supports Java types No concept of schema lifecycle Schema lifecycle management via schema version state
  • 16. 16 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Bringing the Strands Together
  • 17. 17 © Hortonworks Inc. 2011 – 2018. All Rights Reserved First Problem  Administrators have another system to install, monitor, update, …  Developers must maintain two systems whose basic functionality, record & serve runtime metadata, is the same  Other systems that want to integrate with runtime metadata, security systems like Ranger and Sentry and governance systems like Atlas, have to integrate with each component separately With both the Hive Metastore and the Schema Registry we are adding yet another component to the system
  • 18. 18 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Second Problem  Sometimes your streaming application will want to read from a table – It would prefer to think of data in the registry model, whether it comes from a Hive table or a Kafka stream  Sometimes your query will want to read from a stream – It needs to think about data as being in a table, whether it comes from a Hive table or a Kafka stream  To share data today tools have to be able to read data using both paradigms Hardwiring a perspective into a metadata repository makes it harder to share data between applications
  • 19. 19 © Hortonworks Inc. 2011 – 2018. All Rights Reserved The Second Proposal: Cross the Streams  Put the Schema Registry on top of the Metastore  It will still support SQL and streaming perspectives  One system means less work for admins, developers, and other tools  One system with multiple perspectives means – streaming tools can view data as a stream whether it is in Kafka or Hive – batch tools can view data as a table whether it is in Hive or Kafka
  • 20. 20 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Streaming Application Reading from a Table Kafka topic userEvents Schema: { "group": "kafka", "fields": [{ "userid": "long", "eventtype": "string", ... }] } • A stream userEvents • An application that flags users who have called support in the last 24 hours Hive table support_calls userid long calltime timestamp summary string supportCalls Schema: { "group": "hive", "fields": [{ "userid": "long", "calltime": "timestamp", "summary" : "string" }] } • App can cache table every hour, do a join as events arrive to flag users who need extra attention • Possible today, but requires caching data in Kafka or coding app to read both Hive and Kafka • Because HMS and SR are unified, streaming apps can view this as an SR Schema Example: • Hive has record of support calls, Kafka does not
  • 21. 21 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Query Reading from a Stream Hive table user_events, partitioned by event_hour user_id long event_type varchar(256) event_hour datetime Kafka topic userEvents Schema: { "group": "kafka", "fields": [{ "userid": "long", "eventtype": "string", ... }] } • Hive table user_events is loaded every hour from Kafka topic userEvents Example: • Because HMS and SR are unified, Hive can view Kafka topic as partition of its table Hive table user_events, partition event_hour='latest' • Hive queries can now read Kafka topic userEvents as a partition of user_events • Today Hive streaming can quickly ingest data from Kafka, but will still be missing the last few seconds from Kafka • Would like to be able to read latest events from Kafka rather than wait until it loads into Hive
  • 22. 22 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Some Assembly Required  Need to bridge the gaps between SQL and Registry schemas - Nontrivial – Schema consistent for all records in a partition versus different schema versions in the stream – SQL types versus Java types – Schema as an attribute of a table versus as a first class object with version and lifecycle  Will require connectors so streaming apps can use batch serdes and vice versa  Work in progress: – https://github.com/apache/hive/pull/347 – https://github.com/apache/hive/pull/348 – https://github.com/apache/hive/pull/349 – https://issues.apache.org/jira/browse/HIVE-19521 – https://issues.apache.org/jira/browse/HIVE-19522
  • 23. 23 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Can We Share Too Much?
  • 24. 24 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Yes, Yes We Can Example use case: Hive LLAP being used for analytics, Spark for ETL Metastore LLAP Spark I have been extolling the benefits of a shared Metastore for the last 20 slides, so clearly we want to share one instance between them But, • Hive and Spark can't always read each other's data • e.g. Spark can't read Hive's ACID tables • Different use cases require different security models • e.g. Spark ETL is likely to use StorageBasedAuth, while LLAP is likely to use Ranger • Different defaults are appropriate for different use cases • e.g. doAs=false for LLAP, doAs=true for Hive reads from Spark catalog
  • 25. 25 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Third Proposal: Add Catalogs  Catalog is standard SQL top level container  Catalogs contain databases, thus fully addressing a table will become catalog.database.table  Default catalog 'hive' added in 3.0, and all existing databases placed in it  In 3.0 only exists in metastore, not yet exposed to SQL  Goal: different catalogs can have different security settings and defaults  Ongoing work, can be tracked at HIVE-18685
  • 26. 26 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Example Installation With Catalogs Metastore LLAP Spark LLAP defaults to 'hive' catalog • Tables are ACID by default • Ranger for security • doAs=false Spark defaults to 'etl' catalog • ACID tables not allowed • StorageBasedAuth for security • doAs=true Each can still read from the other catalog (assuming permission granted), but can now be aware of changing authorization, defaults, etc. Also useful in the cloud, where multiple business units may sharing storage but need different defaults, policies, etc.
  • 27. 27 © Hortonworks Inc. 2011 – 2018. All Rights Reserved What Next?  Now that we have released the Metastore as a separate module, Hive community needs to decide whether it becomes a subproject or a separate top level project  Need to finish the work to integrate the Schema Registry  Need to involve contributors from other, non-Hive projects  Need to finish implementing Catalogs  Patches accepted!
  • 28. 28 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Credits  Apache Atlas, Apache Hadoop, Apache Hive, Apache Impala, Apache Kafka, Apache Pig, Apache Ranger, Apache Sentry, and Apache Spark are Apache Software Foundation projects – All are referred to herein without “Apache” for brevity  HDFS and MapReduce are components of Apache Hadoop  Thanks to the Hive community for their work in getting the Hive Metastore separated out from much of the rest of Hive  Google Translate used for Latin slide title
  • 29. 29 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Thank You