SlideShare a Scribd company logo
1 of 23
Š Hortonworks Inc. 2015
Hive 0.14 Does ACID
February 2015
Page 1
Alan Gates
gates@hortonworks.com
@alanfgates
Š Hortonworks Inc. 2015 Page 2
• Hive only updated partitions
–Insert overwrite rewrote an entire partition
–Forced daily or even hourly partitions
–Could add files to partition directory, but no file compaction
• What about concurrent readers?
–Ok for inserts, but overwrite caused races
–There is a zookeeper lock manager, but…
• No way to delete or update rows
• No INSERT INTO T VALUES…
–Breaks some tools
History
Š Hortonworks Inc. 2015 Page 3
•Hadoop and Hive have always…
–Worked without ACID
–Perceived as tradeoff for performance
•But, your data isn’t static
–It changes daily, hourly, or faster
–Ad hoc solutions require a lot of work
–Managing change makes the user’s life better
•Do or Do Not, There is NO Try
Why is ACID Critical?
Š Hortonworks Inc. 2015 Page 4
• NOT OLTP!!!
• Updating a Dimension Table
–Changing a customer’s address
• Delete Old Records
–Remove records for compliance
• Update/Restate Large Fact Tables
–Fix problems after they are in the warehouse
• Streaming Data Ingest
–A continual stream of data coming in
–Typically from Flume or Storm
• NOT OLTP!!!
Use Cases
Š Hortonworks Inc. 2015 Page 5
• New DML
– INSERT INTO T VALUES(1, ‘fred’, ...);
– UPDATE T SET (x = 5[, ...]) WHERE ...
– DELETE FROM T WHERE ...
– Supports partitioned and non-partitioned tables, WHERE clause can
specify partition but not required
• Restrictions
– Table must have format that extends AcidInputFormat
– currently ORC
– Table must be bucketed and not sorted
– can use 1 bucket but this will restrict write ||ism
– Table must be marked transactional
– create table T(...) clustered by (a) into 2 buckets stored as orc TBLPROPERTIES
('transactional'='true');
New SQL in Hive 0.14
Š Hortonworks Inc. 2015 Page 6
•Good
–Handles compactions for us
–Already has similar data model with LSM
•Bad
–No cross row transactions
–Would require us to write a transaction manager over HBase,
doable, but not less work
–Hfile is column family based rather than columnar
–HBase focused on point lookups and range scans
–Warehousing requires full scans
Why Not HBase?
Š Hortonworks Inc. 2015 Page 7
•HDFS Does Not Allow Arbitrary Writes
–Store changes as delta files
–Stitched together by client on read
•Writes get a Transaction ID
–Sequentially assigned by Metastore
•Reads get Committed Transactions
–Provides snapshot consistency
–No locks required
–Provide a snapshot of data from start of query
Design
Š Hortonworks Inc. 2015
Stitching Buckets Together
Page 8
Š Hortonworks Inc. 2015 Page 9
•Partition locations remain unchanged
–Still warehouse/$db/$tbl/$part
•Bucket Files Structured By Transactions
–Base files $part/base_$tid/bucket_*
–Delta files $part/delta_$tid_$tid/bucket_*
HDFS Layout
Š Hortonworks Inc. 2015 Page 10
•Created new AcidInput/OutputFormat
–Unique key is transaction, bucket, row
•Reader returns correct version of row based on
transaction state
•Also Added Raw API for Compactor
–Provides previous events as well
•ORC implements new API
–Extends records with change metadata
–Add operation (d, u, i), transaction and key
Input and Output Formats
Š Hortonworks Inc. 2015 Page 11
•Need to split buckets for MapReduce
–Need to split base and deltas the same way
–Use key ranges
–Use indexes
Distributing the Work
Š Hortonworks Inc. 2015 Page 12
• Existing lock managers
–In memory - not durable
–ZooKeeper - requires additional components to install, administer,
etc.
• Locks need to be integrated with transactions
–commit/rollback must atomically release locks
• We sort of have this database lying around which has
ACID characteristics (metastore)
• Transactions and locks stored in metastore
• Uses metastore DB to provide unique, ascending ids for
transactions and locks
Transaction Manager
Š Hortonworks Inc. 2015 Page 13
•In Hive 0.14 DML statements are auto-commit
–Working on adding BEGIN, COMMIT, ROLLBACK
•Snapshot isolation
–Reader will see consistent data for the duration of
his/her query
–May extend to other isolation levels in the future
•Current transactions can be displayed using
new SHOW TRANSACTIONS statement
Transaction Model
Š Hortonworks Inc. 2015 Page 14
•Three types of locks
–shared
–semi-shared (can co-exist with shared, but not
other semi-shared)
–exclusive
•Operations require different locks
–SELECT, INSERT – shared
–UPDATE, DELETE – semi-shared
–DROP, INSERT OVERWRITE – exclusive
Locking Model
Š Hortonworks Inc. 2015 Page 15
•Each transaction (or batch of transactions
in streaming ingest) creates a new delta file
•Too many files = NameNode 
•Need a way to
–Collect many deltas into one delta – minor
compaction
–Rewrite base and delta to new base – major
compaction
Compactor
Š Hortonworks Inc. 2015 Page 16
•Run when there are 10 or more deltas
(configurable)
•Results in base + 1 delta
Minor Compaction
/hive/warehouse/purchaselog/ds=201403311000/base_0028000
/hive/warehouse/purchaselog/ds=201403311000/delta_0028001_0028100
/hive/warehouse/purchaselog/ds=201403311000/delta_0028101_0028200
/hive/warehouse/purchaselog/ds=201403311000/delta_0028201_0028300
/hive/warehouse/purchaselog/ds=201403311000/delta_0028301_0028400
/hive/warehouse/purchaselog/ds=201403311000/delta_0028401_0028500
/hive/warehouse/purchaselog/ds=201403311000/base_0028000
/hive/warehouse/purchaselog/ds=201403311000/delta_0028001_0028500
Š Hortonworks Inc. 2015 Page 17
•Run when deltas are 10% the size of base
(configurable)
•Results in new base
Major Compaction
/hive/warehouse/purchaselog/ds=201403311000/base_0028000
/hive/warehouse/purchaselog/ds=201403311000/delta_0028001_0028100
/hive/warehouse/purchaselog/ds=201403311000/delta_0028101_0028200
/hive/warehouse/purchaselog/ds=201403311000/delta_0028201_0028300
/hive/warehouse/purchaselog/ds=201403311000/delta_0028301_0028400
/hive/warehouse/purchaselog/ds=201403311000/delta_0028401_0028500
/hive/warehouse/purchaselog/ds=201403311000/base_0028500
Š Hortonworks Inc. 2015 Page 18
• Metastore thrift server will schedule and execute
compactions
–No need for user to schedule
–User can initiate via new ALTER TABLE COMPACT
statement
• No locking required, compactions run at same time
as select and DML
–Compactor aware of readers, does not remove old files until
readers have finished with them
• Current compactions can be viewed via new SHOW
COMPACTIONS statement
Compactor Continued
Š Hortonworks Inc. 2015 Page 19
• Data is flowing in from generators in a stream
• Without this, you have to add it to Hive in batches, often
every hour
–Thus your users have to wait an hour before they can see their
data
• New interface in hive.hcatalog.streaming lets applications
write small batches of records and commit them
–Users can now see data within a few seconds of it arriving from
the data generators
• Available for Apache Flume in HDP 2.1 and Storm in HDP
2.2
Application: Streaming Ingest
Š Hortonworks Inc. 2015 Page 20
•On the client
hive.support.concurrency=true
hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.enforce.bucketing=true
•On the metastore server
hive.compactor.initiator.on=true
hive.compactor.worker.threads=1 # or more
Configuration
Š Hortonworks Inc. 2015 Page 21
• Phase 1, Hive 0.13
– Transaction and new lock manager
– ORC file support
– Automatic and manual compaction
– Snapshot isolation
– Streaming ingest via Flume
• Phase 2, Hive 0.14
– INSERT … VALUES, UPDATE, DELETE
• Phase 3, Hive 1.2(?)
– Add support for only some columns in insert
– INSERT into T (a, b) select c, d from U;
– BEGIN, COMMIT, ROLLBACK
• Future (all speculative based on user feedback)
– Integration with HCatalog
– Versioned or point in time queries
– Streaming ingest of updates and deletes
– Additional isolation levels such as dirty read or read committed
– MERGE
Phases of Development
Š Hortonworks Inc. 2015 Page 22
•JIRA:
https://issues.apache.org/jira/browse/HI
VE-5317
•Adds ACID semantics to Hive
•Uses SQL standard commands
–INSERT, UPDATE, DELETE
•Provides scalable read and write access
Conclusion
Š Hortonworks Inc. 2015
Thank You!
Questions & Answers
Page 23

More Related Content

What's hot

Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
Strata feb2013
Strata feb2013Strata feb2013
Strata feb2013alanfgates
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013alanfgates
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016alanfgates
 
Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0DataWorks Summit
 
Hive - 1455: Cloud Storage
Hive - 1455: Cloud StorageHive - 1455: Cloud Storage
Hive - 1455: Cloud StorageHortonworks
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetupt3rmin4t0r
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Deadt3rmin4t0r
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesDataWorks Summit
 
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationEyad Garelnabi
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopHortonworks
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleYifeng Jiang
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerDataWorks Summit
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoYu Liu
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 

What's hot (20)

Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Strata feb2013
Strata feb2013Strata feb2013
Strata feb2013
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016
 
Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0
 
Hive - 1455: Cloud Storage
Hive - 1455: Cloud StorageHive - 1455: Cloud Storage
Hive - 1455: Cloud Storage
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetup
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Dead
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query Optimization
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scale
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
A TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with PrestoA TPC Benchmark of Hive LLAP and Comparison with Presto
A TPC Benchmark of Hive LLAP and Comparison with Presto
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 

Viewers also liked

Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016alanfgates
 
Hortonworks apache training
Hortonworks apache trainingHortonworks apache training
Hortonworks apache trainingalanfgates
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013alanfgates
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016alanfgates
 
ORC 2015
ORC 2015ORC 2015
ORC 2015t3rmin4t0r
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014alanfgates
 
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...Hakka Labs
 
GNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for DatabasesGNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for DatabasesTanel Poder
 
GT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL DatabaseGT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL DatabaseRob Tweed
 
Timeline service V2 at the Hadoop Summit SJ 2016
Timeline service V2 at the Hadoop Summit SJ 2016Timeline service V2 at the Hadoop Summit SJ 2016
Timeline service V2 at the Hadoop Summit SJ 2016Vrushali Channapattan
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...DataWorks Summit/Hadoop Summit
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
 

Viewers also liked (19)

Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016Keynote apache bd-eu-nov-2016
Keynote apache bd-eu-nov-2016
 
Hortonworks apache training
Hortonworks apache trainingHortonworks apache training
Hortonworks apache training
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
ORC 2015
ORC 2015ORC 2015
ORC 2015
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
Introduction to InfluxDB, an Open Source Distributed Time Series Database by ...
 
Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics
 
The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit
 
Sparksee overview
Sparksee overviewSparksee overview
Sparksee overview
 
GNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for DatabasesGNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for Databases
 
GT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL DatabaseGT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL Database
 
Timeline service V2 at the Hadoop Summit SJ 2016
Timeline service V2 at the Hadoop Summit SJ 2016Timeline service V2 at the Hadoop Summit SJ 2016
Timeline service V2 at the Hadoop Summit SJ 2016
 
Simplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & TroubleshootingSimplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & Troubleshooting
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
 
The Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache FlinkThe Stream Processor as a Database Apache Flink
The Stream Processor as a Database Apache Flink
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 

Similar to Hive acid-updates-strata-sjc-feb-2015

Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACIDHortonworks
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?DataWorks Summit
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019alanfgates
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
An In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveAn In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveDataWorks Summit
 
ACID Transactions in Hive
ACID Transactions in HiveACID Transactions in Hive
ACID Transactions in HiveEugene Koifman
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019alanfgates
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?DataWorks Summit
 
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...In-Memory Computing Summit
 
Thug feb 23 2015 Chen Zhang
Thug feb 23 2015 Chen ZhangThug feb 23 2015 Chen Zhang
Thug feb 23 2015 Chen ZhangChen Zhang
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudDataWorks Summit/Hadoop Summit
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?DataWorks Summit
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleHortonworks
 
Hive & HBase For Transaction Processing
Hive & HBase For Transaction ProcessingHive & HBase For Transaction Processing
Hive & HBase For Transaction ProcessingDataWorks Summit
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 

Similar to Hive acid-updates-strata-sjc-feb-2015 (20)

Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
An In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveAn In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in Hive
 
ACID Transactions in Hive
ACID Transactions in HiveACID Transactions in Hive
ACID Transactions in Hive
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
 
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...
IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...
 
Thug feb 23 2015 Chen Zhang
Thug feb 23 2015 Chen ZhangThug feb 23 2015 Chen Zhang
Thug feb 23 2015 Chen Zhang
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Hive & HBase For Transaction Processing
Hive & HBase For Transaction ProcessingHive & HBase For Transaction Processing
Hive & HBase For Transaction Processing
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 

Recently uploaded

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptrcbcrtm
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 

Recently uploaded (20)

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
cpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.pptcpct NetworkING BASICS AND NETWORK TOOL.ppt
cpct NetworkING BASICS AND NETWORK TOOL.ppt
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 

Hive acid-updates-strata-sjc-feb-2015

  • 1. Š Hortonworks Inc. 2015 Hive 0.14 Does ACID February 2015 Page 1 Alan Gates gates@hortonworks.com @alanfgates
  • 2. Š Hortonworks Inc. 2015 Page 2 • Hive only updated partitions –Insert overwrite rewrote an entire partition –Forced daily or even hourly partitions –Could add files to partition directory, but no file compaction • What about concurrent readers? –Ok for inserts, but overwrite caused races –There is a zookeeper lock manager, but… • No way to delete or update rows • No INSERT INTO T VALUES… –Breaks some tools History
  • 3. Š Hortonworks Inc. 2015 Page 3 •Hadoop and Hive have always… –Worked without ACID –Perceived as tradeoff for performance •But, your data isn’t static –It changes daily, hourly, or faster –Ad hoc solutions require a lot of work –Managing change makes the user’s life better •Do or Do Not, There is NO Try Why is ACID Critical?
  • 4. Š Hortonworks Inc. 2015 Page 4 • NOT OLTP!!! • Updating a Dimension Table –Changing a customer’s address • Delete Old Records –Remove records for compliance • Update/Restate Large Fact Tables –Fix problems after they are in the warehouse • Streaming Data Ingest –A continual stream of data coming in –Typically from Flume or Storm • NOT OLTP!!! Use Cases
  • 5. Š Hortonworks Inc. 2015 Page 5 • New DML – INSERT INTO T VALUES(1, ‘fred’, ...); – UPDATE T SET (x = 5[, ...]) WHERE ... – DELETE FROM T WHERE ... – Supports partitioned and non-partitioned tables, WHERE clause can specify partition but not required • Restrictions – Table must have format that extends AcidInputFormat – currently ORC – Table must be bucketed and not sorted – can use 1 bucket but this will restrict write ||ism – Table must be marked transactional – create table T(...) clustered by (a) into 2 buckets stored as orc TBLPROPERTIES ('transactional'='true'); New SQL in Hive 0.14
  • 6. Š Hortonworks Inc. 2015 Page 6 •Good –Handles compactions for us –Already has similar data model with LSM •Bad –No cross row transactions –Would require us to write a transaction manager over HBase, doable, but not less work –Hfile is column family based rather than columnar –HBase focused on point lookups and range scans –Warehousing requires full scans Why Not HBase?
  • 7. Š Hortonworks Inc. 2015 Page 7 •HDFS Does Not Allow Arbitrary Writes –Store changes as delta files –Stitched together by client on read •Writes get a Transaction ID –Sequentially assigned by Metastore •Reads get Committed Transactions –Provides snapshot consistency –No locks required –Provide a snapshot of data from start of query Design
  • 8. Š Hortonworks Inc. 2015 Stitching Buckets Together Page 8
  • 9. Š Hortonworks Inc. 2015 Page 9 •Partition locations remain unchanged –Still warehouse/$db/$tbl/$part •Bucket Files Structured By Transactions –Base files $part/base_$tid/bucket_* –Delta files $part/delta_$tid_$tid/bucket_* HDFS Layout
  • 10. Š Hortonworks Inc. 2015 Page 10 •Created new AcidInput/OutputFormat –Unique key is transaction, bucket, row •Reader returns correct version of row based on transaction state •Also Added Raw API for Compactor –Provides previous events as well •ORC implements new API –Extends records with change metadata –Add operation (d, u, i), transaction and key Input and Output Formats
  • 11. Š Hortonworks Inc. 2015 Page 11 •Need to split buckets for MapReduce –Need to split base and deltas the same way –Use key ranges –Use indexes Distributing the Work
  • 12. Š Hortonworks Inc. 2015 Page 12 • Existing lock managers –In memory - not durable –ZooKeeper - requires additional components to install, administer, etc. • Locks need to be integrated with transactions –commit/rollback must atomically release locks • We sort of have this database lying around which has ACID characteristics (metastore) • Transactions and locks stored in metastore • Uses metastore DB to provide unique, ascending ids for transactions and locks Transaction Manager
  • 13. Š Hortonworks Inc. 2015 Page 13 •In Hive 0.14 DML statements are auto-commit –Working on adding BEGIN, COMMIT, ROLLBACK •Snapshot isolation –Reader will see consistent data for the duration of his/her query –May extend to other isolation levels in the future •Current transactions can be displayed using new SHOW TRANSACTIONS statement Transaction Model
  • 14. Š Hortonworks Inc. 2015 Page 14 •Three types of locks –shared –semi-shared (can co-exist with shared, but not other semi-shared) –exclusive •Operations require different locks –SELECT, INSERT – shared –UPDATE, DELETE – semi-shared –DROP, INSERT OVERWRITE – exclusive Locking Model
  • 15. Š Hortonworks Inc. 2015 Page 15 •Each transaction (or batch of transactions in streaming ingest) creates a new delta file •Too many files = NameNode  •Need a way to –Collect many deltas into one delta – minor compaction –Rewrite base and delta to new base – major compaction Compactor
  • 16. Š Hortonworks Inc. 2015 Page 16 •Run when there are 10 or more deltas (configurable) •Results in base + 1 delta Minor Compaction /hive/warehouse/purchaselog/ds=201403311000/base_0028000 /hive/warehouse/purchaselog/ds=201403311000/delta_0028001_0028100 /hive/warehouse/purchaselog/ds=201403311000/delta_0028101_0028200 /hive/warehouse/purchaselog/ds=201403311000/delta_0028201_0028300 /hive/warehouse/purchaselog/ds=201403311000/delta_0028301_0028400 /hive/warehouse/purchaselog/ds=201403311000/delta_0028401_0028500 /hive/warehouse/purchaselog/ds=201403311000/base_0028000 /hive/warehouse/purchaselog/ds=201403311000/delta_0028001_0028500
  • 17. Š Hortonworks Inc. 2015 Page 17 •Run when deltas are 10% the size of base (configurable) •Results in new base Major Compaction /hive/warehouse/purchaselog/ds=201403311000/base_0028000 /hive/warehouse/purchaselog/ds=201403311000/delta_0028001_0028100 /hive/warehouse/purchaselog/ds=201403311000/delta_0028101_0028200 /hive/warehouse/purchaselog/ds=201403311000/delta_0028201_0028300 /hive/warehouse/purchaselog/ds=201403311000/delta_0028301_0028400 /hive/warehouse/purchaselog/ds=201403311000/delta_0028401_0028500 /hive/warehouse/purchaselog/ds=201403311000/base_0028500
  • 18. Š Hortonworks Inc. 2015 Page 18 • Metastore thrift server will schedule and execute compactions –No need for user to schedule –User can initiate via new ALTER TABLE COMPACT statement • No locking required, compactions run at same time as select and DML –Compactor aware of readers, does not remove old files until readers have finished with them • Current compactions can be viewed via new SHOW COMPACTIONS statement Compactor Continued
  • 19. Š Hortonworks Inc. 2015 Page 19 • Data is flowing in from generators in a stream • Without this, you have to add it to Hive in batches, often every hour –Thus your users have to wait an hour before they can see their data • New interface in hive.hcatalog.streaming lets applications write small batches of records and commit them –Users can now see data within a few seconds of it arriving from the data generators • Available for Apache Flume in HDP 2.1 and Storm in HDP 2.2 Application: Streaming Ingest
  • 20. Š Hortonworks Inc. 2015 Page 20 •On the client hive.support.concurrency=true hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager hive.enforce.bucketing=true •On the metastore server hive.compactor.initiator.on=true hive.compactor.worker.threads=1 # or more Configuration
  • 21. Š Hortonworks Inc. 2015 Page 21 • Phase 1, Hive 0.13 – Transaction and new lock manager – ORC file support – Automatic and manual compaction – Snapshot isolation – Streaming ingest via Flume • Phase 2, Hive 0.14 – INSERT … VALUES, UPDATE, DELETE • Phase 3, Hive 1.2(?) – Add support for only some columns in insert – INSERT into T (a, b) select c, d from U; – BEGIN, COMMIT, ROLLBACK • Future (all speculative based on user feedback) – Integration with HCatalog – Versioned or point in time queries – Streaming ingest of updates and deletes – Additional isolation levels such as dirty read or read committed – MERGE Phases of Development
  • 22. Š Hortonworks Inc. 2015 Page 22 •JIRA: https://issues.apache.org/jira/browse/HI VE-5317 •Adds ACID semantics to Hive •Uses SQL standard commands –INSERT, UPDATE, DELETE •Provides scalable read and write access Conclusion
  • 23. Š Hortonworks Inc. 2015 Thank You! Questions & Answers Page 23