chapter 5.pptx: drainage and irrigation engineering
PLAZMA TD Tech Talk 2018 at Shibuya: Hive2 as a new td hadoop core engine
1. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Oct 17 2018, Ryu Kobayashi
PLAZMA TD Tech Talk
2018 at Shibuya
Hive2 as a new TD
Hadoop core Engine
3. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Agenda
- PTD Hive
- Our storage is PlazmaDB
- Default support Vectorization
- Test
- Next plan
4. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Ryu Kobayashi
Software engineer at the Hadoop team
• Backend team -> Hadoop team -> MPP(Presto) team -> Hadoop Team
• Hadoop usage history: about 10 years
– Background:
6. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
PTD Hive
• PTD = Patch set by Treasure Data
• Our Hadoop and Hive History
– CDH3 -> CDH4 -> HDP2 -> Apache Hadoop and Hive
• Why did we discarded the distribution?
– Bugs are fixed by ourselves
▪ But, it will not be taken in soon(Hive): e.g. HIVE-11353
– Distribution depends on a specific version
▪ The test range becomes wider
7. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
PTD Hive
• PTD project starts from 2015
– At that time version: Hive 2.1.0
– Current support version: Hive 2.3.2
• Why from 2.1.0 to 2.3.2, between 2015 and 2018?
– See the self introduction
– So, restart 2018
• We have fixed many bugs in 2.3.2 as well
8. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
PTD Hive
• We apply internal patch besides this:
– INSERT INTO/OVERWRITE
▪ Why?
– Our storage is PlazmaDB
– Storage does not HDFS
– So, output must be made to PlazmaDB
• Our original bugs may happen
– Investigation is serious
▪ Our original or Hive itself?
9. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Our storage is PlazmaDB
10. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
PlazmaDB
• What is PlazmaDB?
– Columnar Compression Storage
• PlazmaDB’s contents
– plazmadb
– plazmadb-mpcfile
▪ What is mpcfile?
– A proprietary format that compresses the MessagePack
11. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
PlazmaDB
• We does not used HDFS(But, we are using it as an intermediate file)
– Advantage: Easy upgrade Hadoop’s version
• Upgrade internal PlazmaDB library from Hive2
– Old:
▪ plazmadb
▪ plazmadb-mpcfile
▪ td-storage
▪ msgpack(0.6)
– New:
▪ plazmadb
▪ partition-manager
▪ msgpack(0.8)
12. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Default support Vectorization
13. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Default support Vectorization
• Currently our Hive 0.13 does not support Vectorization
– Because there are many bugs
• Since bugs have been fixed from Hive2, support by default
– There are some problems internally
▪ Schema type problem: READ and WRITE
14. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Default support Vectorization
• Performance?
– About 2 times Our legacy Hive than faster
▪ Vectorization
▪ New Storage Library
• The remaining challenges
– Our UDF support for vectorization
▪ Mainly time related
16. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Test
• How do we testing?
– system-test
▪ scheduled run
– Hive 0.13 and Hive2
– elephant-testing
▪ scheduled run
– Register query that was problematic so far
17. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Test
• What kind of problems happened?
– The result is different
▪ Schema type problem
– Null
– Decimal point
▪ This also affects INSERT INTO/OVERWRITE
– Specific UDF does not work
▪ Compatibility of jar used by Hive and jar used by us
– Cross join is not supported by default
▪ Because of hive.strict.checks.cartesian.product property
19. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Next plan
• Alpha release next Month
• Beta and Stable next year
• Our new PlazmaDB
– CBO support
• Tez support
– last time 2015...
▪ 0.8.4 -> 0.9(currently 0.9.1)
• Hive3 support