This document discusses Treasure Data's migration architecture for managing resources across multiple clusters when upgrading from Hive 1.x to Hive 2.0. It introduces components like PerfectQueue and Plazma that enable blue-green deployment without downtime. It also describes how automatic testing and validation is done to prevent performance degradation. Resource management is discussed to define resources per account across different job queues and Hadoop clusters. Brief performance comparisons show improvements from Hive 2.x features like Tez and vectorization.
2. About Me
• Kai Sasaki (佐々木 海)
• @Lewuathe (Twitter)
• Software Engineer
at Treasure Data Inc.
• Maintaining and develop
Hadoop/Presto infrastructure
3. Topic
• Treasure Data infrastructure
• Hive 2.0 change
• Migration architecture
• Resource management for multi tenancy
• Performance comparison
4. • Live Data Management Platform
• Original creator of Fluentd/Embulk/Digdag
• 70+ integrations with
• BI tools
• Mobile/IoT
• Cloud Storage
• and more
5.
6. • Hive/Pig/Presto data processing interface
• 40000+ Hive queries / day
• 130000+ Presto queries / day
• Plazma Cloud Storage
• 450000+ records/sec imported
8. Hive 2.0
• Include major new features
• Fixed 600+ bugs
• 140+ improvements or new features
• Backward compatible as much as possible
• Hive 1.x stable line
• 2.1.0 is available from June 20th, 2016
http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale
9. Hive 2.0
• HPLSQL
• LLAP
• HBase metastore
• Improvements of Hive on Spark
• CBO improvements
http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale
10. HPLSQL
• Procedural SQL like Oracle’s PL/SQL
• Cursor
• loops (WHILE, FOR, LOOP)
• branches (IF)
• External library which communicates through JDBC
• http://www.hplsql.org/doc
http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale
11. LLAP
• Sub-second Queries in Hive
• Save JVM container launch time
• Data caching
• Fit to Adhoc or interactive use case
• Beta in 2.0
http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png
12. LLAP
• Sub-second Queries in Hive
http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png
13. HBase metastore
• Use HBase as metastore of Hive
• Fetching thousands of partitions
• Limitation of concurrent connection
• Will support transaction with Apache Omid
• Alpha in Hive 2.0
http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png
15. That’s all?
• Operation cost of migration
• Manage multiple cluster
• Test and verify multiple packages
• Difference of configuration and parameter
16. That’s all?
• Operation cost of migration
• Manage multiple cluster
• Test and verify multiple packages
• Difference of configuration and parameter
• Need to reduce operation cost at the same time
18. Challenge
• NO DOWNTIME
• NO HARMFUL OPERATION
• Change package easily
• Separate from other components (Micro service)
• NO DEGRADATION
• Automatic query test and validation
19. NO DOWNTIME
• Hadoop cluster Blue-Green deployment
• Reliable queue system separated from Hadoop
→ PerfectQueue
• Reliable storage system separated from Hadoop
→ Plazma
20. PerfectQueue
• Distributed queue built on top of RDBMS
• At-least-once semantics
• Graceful and live restarting
• State consistency by transaction
• https://github.com/treasure-data/perfectqueue
21. Plazma
• Distributed cloud-based storage
• PostgreSQL + S3/Riak CS
• Enable time-index push down for Hive/Pig/Presto
• Column-oriented IO (mpc1)
• Data consistency with transactional API
29. NO HARMFUL OPS
• Automatic package version up
• Chef server specifies the version
• Hadoop package repository
• S3 remote package repository
• Hadoop as a REST service
• elephant-server
38. NO DEGRADATION
• Validation in
• Parameter difference
• Query result difference
• Performance deterioration
• Automatic testing and persistent result tables
43. elephant
server
S3
1. upload param
and configurations
2. upload query result
Plazma
x
submit
v1
3. send metrics
S3 Plazma
x
v2
Verification between
persistent result set
PQ
PQ
App
request
pull REST
44. Resource management
• Define 1 resource per 1 account
• Workload type of an account varies
• Batch, Adhoc, BI tool…
• Require high level resource management
across clusters
• An account can have multiple resource pools
• For service and internal purpose