The document discusses HDInsight cluster architecture and configuration. It describes how HDInsight clusters connect to Azure data stores like Azure Blob Storage and Azure Data Lake Store. It also discusses using Azure Data Factory for HDInsight orchestration and monitoring an HDInsight cluster.
9. To get your standard storage accounts to grow past the advertised limits in capacity,
ingress/egress and request rate, please make a request through Azure Support
24. Source 1
Source n
Source 2
Source 3
Result Set 1
HDInsight Cluster
Result Set 2
Result Set 3
Result Set n
25.
26.
27. The HDInsight cluster has been scaled down to a very few nodes. The number of nodes is below or close to
the HDFS replication factor.
hdfs dfsadmin -D "fs.default.name=hdfs://mycluster/" –report
hdfs fsck -D "fs.default.name=hdfs://mycluster/" /
Fix the issue by leaving safemode
hdfs dfsadmin -D "fs.default.name=hdfs://mycluster/" -safemode leave
28.
29.
30.
31. Capability Interactive Query Spark SQL Presto
Interactive Query Speed High High Medium
Scale High High Low
Caching Yes Yes Early Support
Intelligent Cache Eviction Yes No No
Complex Fact to Fact Joins Yes Yes No
Transactions Yes No No
Query Concurrency High Low Low
Row , Column level security Yes [Apache Ranger+ AAD] High Medium
Rich end user Tools Yes Yes Yes
Language Support SQL, UDF SQL, Scala, Python SQL
Data Source Connector
Support
Storage Handlers Data Sources High number of
connectors
32. • LLAP, Spark, and Presto against 1 TB derived from the TPC-DS benchmark
• Out of the box HDInsight Configuration
• 45 queries derived from the TPC-DS benchmark that ran on all engines
successfully
33.
34.
35. • We used number of different concurrency levels to test the concurrency
performance
• 99 queries on 1 TB data with 32 worker node cluster with max concurrency set
to 32.
Test 1: Run all 99 queries, 1 at a time - Concurrency = 1
Test 2: Run all 99 queries, 2 at a time - Concurrency = 2
Test 3: Run all 99 queries, 4 at a time - Concurrency = 4
Test 4: Run all 99 queries, 8 at a time - Concurrency = 8
Test 5: Run all 99 queries, 16 at a time - Concurrency = 16
Test 6: Run all 99 queries, 32 at a time - Concurrency = 32
Test 7: Run all 99 queries, 64 at a time - Concurrency = 64
39. • Use custom Metastore whenever possible, this will help you separate Compute and
Metadata
• Start with S2 tier which will give you 50 DTU and 250 GB of storage, you can always
scale the database up in case you see bottlenecks
• Ensure that the Metastore created for one HDInsight cluster version is not shared
across different HDInsight cluster versions. This is due to different Hive versions has
different schemas. Example - Hive 1.2 and Hive 2.1 clusters trying to use same
Metastore.
• Back-up your custom Metastore periodically for OOPS recovery and DR needs
• Keep Metastore and HDInsight cluster in same region
• Monitor your Metastore for performance and availability with Azure SQL DB
Monitoring tools [Azure Portal , Azure Log Analytics]
40. for d in `hive -e "show databases"`; do echo "create database $d; use $d;" >> alltables.sql ; for t in `hive --database
$d -e "show tables"` ; do ddl=`hive --database $d -e "show create table $t"`; echo "$ddl ;" >> alltables.sql ; echo
"$ddl" | grep -q "PARTITIONEDs*BY" && echo "MSCK REPAIR TABLE $t ;" >> alltables.sql ; done; done
hive -f alltables.sql
Export
Import
58. OMS Agent for
Linux
HDInsight nodes (Head, Worker ,
Zookeeper )
FluentD
HDInsight
plugin
1. Plugin for ‘in_tail’ for all Logs, allows
regexp to create JSON object
2. Filter for WARN and above for each
Log Type. `grep` filter plugin
3. Output to out_oms_api Type
4. Exec plugin for Metrics
HBaseConfigosmconfig
Spark
Hive/ LLAP
Storm
Kafka
Config
Config
Config
Config
Log Analytics(OMS) Service
HDInsight Log Analytics Architecture