Más contenido relacionado
La actualidad más candente (20)
Similar a Standalone metastore-dws-sjc-june-2018 (20)
Standalone metastore-dws-sjc-june-2018
- 4. 4 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Big Data SQL Engines
There are many big data SQL engines:
Hive, Spark, Impala, Presto, …
Hive Impala
Presto Spark
- 5. 5 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Pro Pluribus Unum
There are many big data SQL engines:
Hive, Spark, Impala, Presto, …
These engines all store their metadata
in the Hive Metastore
Hive
Metastore
Hive Impala
Presto Spark
- 6. 6 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
The Good, …
Hive
Metastore
These engines all store their metadata in
the Hive Metastore
Good: Shared metadata makes sharing
data between engines easier
Hive Impala
Presto Spark
There are many big data SQL engines:
Hive, Spark, Impala, Presto, …
- 7. 7 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
The Bad, …
Hive
Metastore
These engines all store their metadata in
the Hive Metastore
Hive Impala
Presto Spark
Bad: Non-Hive systems have to install
much of Hive to get the Metastore
Bad: Hard for other projects to
contribute to the Metastore
Good: Shared metadata makes sharing
data between engines easier
There are many big data SQL engines:
Hive, Spark, Impala, Presto, …
- 8. 8 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
And the First Proposal
Metastore
These engines all store their metadata in
the Hive Metastore
Hive Impala
Presto Spark
Proposal:
Separate the Metastore from Hive
Good: Shared metadata makes sharing
data between engines easier
Bad: Non-Hive systems have to install
much of Hive to get the Metastore
Bad: Hard for other projects to
contribute the Metastore
There are many big data SQL engines:
Hive, Spark, Impala, Presto, …
- 9. 9 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Breaking out the Metastore
Enables the Metastore to continue to be used by many engines
In Hive 3.0 the Metastore was released as a separate module
Can be installed and run without the rest of Hive
– A few features missing when Hive not present: e.g. the compactor
– Planning to add these in the future
Backwards compatibility maintained for Thrift clients
– Older version clients can talk to the new, separate, metastore
A few small changes for server hook implementations
There is a proposal to make it a separate Apache project
– Will enable better collaboration with non-Hive projects
– Still in discussion with the Hive PMC on this
- 10. 10 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Enables Shared Metadata in the Cloud
Shared Data
& Storage
On-Demand
Ephemeral Workloads
10101
10101010101
01010101010101
0101010101010101010
Elastic Resource
Management
Shared Metadata,
Security & Governance
- 11. 11 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Is this HCatalog 2.0?
Didn’t we do this before? Wasn’t it called HCatalog?
No, HCatalog is different
HCatalog focuses on making the Metastore accessible by MapReduce, Pig, and other
applications
– Includes metadata access
– Also includes data access (serdes, object inspectors, and input/output formats)
Metastore stores metadata, including which serdes etc. to use but does not provide
readers and writers
HCatalog stays with Hive in this split, it does not go with the Metastore
– Because it includes the data access
- 13. 13 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Example: Hortonworks Schema Registry
Provides a central repository for messages’ metadata
Intended for streaming data (e.g. Kafka) or edge data (e.g. NiFi)
Can be used by any application via REST interface
Schema defined in JSON
Schema is tied to a Kafka topic or NiFi flow
Every schema has a name: e.g. temp_sensor_data
Schemas can have one or more versions
– Different messages in a topic will have different versions of the schema
– Compatibility between schema versions can be none, backwards, forwards, or both
Lifecycle management: schema versions have state, e.g. INITIATED, ENABLED, ARCHIVED
Serdes stored with schema so system knows how to (de)serialize data
- 14. 14 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Example Schema Registry Schema
{ "name": "temp_sensor_data",
"fields": [
{ "name": "sensorId", "type": "long"},
{ "name": "location", "type": "record",
"fields": [
{ "name": "longitude", "type": "double"},
{ "name": "latitude", "type": "double"}
]},
{ "name": "temperature", "type": "int"},
{ "name": "readAt", "type": "long"}
]
}
- 15. 15 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Contrasting SQL and Registry Schemas
SQL Schema Registry
Schema tied to a table Schema tied to a Kafka topic or NiFi flow
Schema applies to all records in a partition Records in a topic may have different versions
of the schema, with no given order
Schema defined in SQL DDL
CREATE TABLE T (A INT, B VARCHAR(20));
Schema defined in JSON
Primary access is via SQL for users and Thrift
for SQL engines
Primary access is via UI for developers and
Java/REST for streaming applications
Supports standard SQL types and Java types Supports Java types
No concept of schema lifecycle Schema lifecycle management via schema
version state
- 16. 16 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Bringing the Strands Together
- 17. 17 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
First Problem
Administrators have another system to install, monitor, update, …
Developers must maintain two systems whose basic functionality, record & serve
runtime metadata, is the same
Other systems that want to integrate with runtime metadata, security systems like
Ranger and Sentry and governance systems like Atlas, have to integrate with each
component separately
With both the Hive Metastore and the Schema Registry we are adding yet another
component to the system
- 18. 18 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Second Problem
Sometimes your streaming application will want to read from a table
– It would prefer to think of data in the registry model, whether it comes from a Hive table or a Kafka
stream
Sometimes your query will want to read from a stream
– It needs to think about data as being in a table, whether it comes from a Hive table or a Kafka
stream
To share data today tools have to be able to read data using both paradigms
Hardwiring a perspective into a metadata repository makes it harder to share data
between applications
- 19. 19 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
The Second Proposal: Cross the Streams
Put the Schema Registry on top of the Metastore
It will still support SQL and streaming perspectives
One system means less work for admins, developers, and other tools
One system with multiple perspectives means
– streaming tools can view data as a stream whether it is in Kafka or Hive
– batch tools can view data as a table whether it is in Hive or Kafka
- 20. 20 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Streaming Application Reading from a Table
Kafka topic userEvents
Schema:
{ "group": "kafka",
"fields": [{
"userid": "long",
"eventtype": "string",
...
}]
}
• A stream userEvents
• An application that flags users who have called support in the last 24 hours
Hive table support_calls
userid long
calltime timestamp
summary string
supportCalls
Schema:
{ "group": "hive",
"fields": [{
"userid": "long",
"calltime": "timestamp",
"summary" : "string"
}]
}
• App can cache table every hour, do a join as events arrive to flag users who need extra attention
• Possible today, but requires caching data in Kafka or coding app to read both Hive and Kafka
• Because HMS and SR are unified, streaming apps can view this as an SR Schema
Example:
• Hive has record of support calls, Kafka does not
- 21. 21 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Query Reading from a Stream
Hive table user_events,
partitioned by event_hour
user_id long
event_type varchar(256)
event_hour datetime
Kafka topic userEvents
Schema:
{ "group": "kafka",
"fields": [{
"userid": "long",
"eventtype": "string",
...
}]
}
• Hive table user_events is loaded every hour from Kafka topic userEvents
Example:
• Because HMS and SR are unified, Hive can view Kafka topic as partition of its table
Hive table user_events,
partition event_hour='latest'
• Hive queries can now read Kafka topic userEvents as a partition of user_events
• Today Hive streaming can quickly ingest data from Kafka, but will still be missing the last few
seconds from Kafka
• Would like to be able to read latest events from Kafka rather than wait until it loads into Hive
- 22. 22 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Some Assembly Required
Need to bridge the gaps between SQL and Registry schemas - Nontrivial
– Schema consistent for all records in a partition versus different schema versions in the stream
– SQL types versus Java types
– Schema as an attribute of a table versus as a first class object with version and lifecycle
Will require connectors so streaming apps can use batch serdes and vice versa
Work in progress:
– https://github.com/apache/hive/pull/347
– https://github.com/apache/hive/pull/348
– https://github.com/apache/hive/pull/349
– https://issues.apache.org/jira/browse/HIVE-19521
– https://issues.apache.org/jira/browse/HIVE-19522
- 23. 23 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Can We Share Too Much?
- 24. 24 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Yes, Yes We Can
Example use case: Hive LLAP being used for analytics, Spark for ETL
Metastore
LLAP Spark
I have been extolling the benefits of a shared Metastore for the last 20 slides, so
clearly we want to share one instance between them
But,
• Hive and Spark can't always read each other's data
• e.g. Spark can't read Hive's ACID tables
• Different use cases require different security models
• e.g. Spark ETL is likely to use StorageBasedAuth, while LLAP is likely to use Ranger
• Different defaults are appropriate for different use cases
• e.g. doAs=false for LLAP, doAs=true for Hive reads from Spark catalog
- 25. 25 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Third Proposal: Add Catalogs
Catalog is standard SQL top level container
Catalogs contain databases, thus fully addressing a table will become
catalog.database.table
Default catalog 'hive' added in 3.0, and all existing databases placed in it
In 3.0 only exists in metastore, not yet exposed to SQL
Goal: different catalogs can have different security settings and defaults
Ongoing work, can be tracked at HIVE-18685
- 26. 26 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Example Installation With Catalogs
Metastore
LLAP Spark
LLAP defaults to 'hive' catalog
• Tables are ACID by default
• Ranger for security
• doAs=false
Spark defaults to 'etl' catalog
• ACID tables not allowed
• StorageBasedAuth for security
• doAs=true
Each can still read from the other catalog (assuming permission granted),
but can now be aware of changing authorization, defaults, etc.
Also useful in the cloud, where multiple business units may sharing
storage but need different defaults, policies, etc.
- 27. 27 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
What Next?
Now that we have released the Metastore as a separate module, Hive community needs
to decide whether it becomes a subproject or a separate top level project
Need to finish the work to integrate the Schema Registry
Need to involve contributors from other, non-Hive projects
Need to finish implementing Catalogs
Patches accepted!
- 28. 28 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Credits
Apache Atlas, Apache Hadoop, Apache Hive, Apache Impala, Apache Kafka, Apache Pig,
Apache Ranger, Apache Sentry, and Apache Spark are Apache Software Foundation
projects
– All are referred to herein without “Apache” for brevity
HDFS and MapReduce are components of Apache Hadoop
Thanks to the Hive community for their work in getting the Hive Metastore separated
out from much of the rest of Hive
Google Translate used for Latin slide title