Since HDInsight launched Spark clusters last year, HDInsight spark team’s mission has been making Spark easy-to-use and production-ready. In the process, we have explored many open source technologies such as Livy, Jupyter, Zeppelin. In this talk, we will demo top customer features, deep dive into HDInsight Spark architecture, and share learnings from building the perfect cluster.
Speakers: Judy Nash and Lin Chan
2. About Us
Azure HDInsight Service
Azure’s answer to big data with open source tech
deploy and manage clusters hosting Hadoop, HBase, Storm, and now Spark
Our Goal – Make Spark easy to use on Azure
How Do We Make It Happen
Deploy new spark clusters via SDK and Portal
Pre-configure and tune cluster for optimal experience
Adopt open source technologies to enhance spark workload
Contribute back to open source
3. About the Talk
How to Build an Enterprise-ready Spark System
Deep Dive of HDInsight’s Spark Cluster
Cluster Architecture
Resource Manager
End-to-end Workflows
Business Intelligence
Remote Job Submission
5. Why Yarn?
Standalone
Better UI
Less memory overhead
Faster application launch time
YARN
Better community support
More powerful resource management
Share resources with other job workflows
More user friendly to users who knew Hadoop on yarn already
7. Addressing Multi-tenancy
Fair Scheduler
Allow sharing resources between queries within thrift server
Important for BI customers who share a cluster. Avoid bad query taking over a
cluster.
To Use, set default queue type as “fair” scheduling
Dynamic Allocation
Allow sharing resources between thrift and other applications
Leave minimum footprints for customers who do not use thrift, but able to expand
to maximum resource allowed when customers execute expensive queries
8. What is Livy?
REST Server allowing remote job submission
2 modes currently: batch & interactive
Open source project
Co-development with Cloudera
14. Livy vs Job Server
Had Job Server initially
Job server is not easy to use for simple jar submission or notebook case
Job server is good for embedding Spark work within a bigger app
Client mode is coming to Livy soon
Partner with Cloudera is important
15. More on Livy
HDI online documentation: https://azure.microsoft.com/en-
us/documentation/articles/hdinsight-apache-spark-livy-rest-interface
Livy Repo: https://github.com/cloudera/livy
16. More on HDInsight
HDInsight Blog
https://blogs.msdn.microsoft.com/azuredatalake/
Contact Us
Lin Chan https://www.linkedin.com/in/linchanms
Judy Nash https://www.linkedin.com/in/judynash
Notas del editor
HDInsight – an Azure service dedicated to hosting big data solutions from open source communities.
Azure service dedicated to deploy and manage clusters hosting big data solutions from open source
Key concepts
* What does the node types do
* Introduce cluster daemons
* Mentions HA, monitoring, telemetry – future spark talk topics
Talk Points
What is business intelligence? Who are the customers?
What is thrift? An open source protocol that handles data transfers between client and services. Similar to SOAP in functionality.
Spark Thrift server -> at launch time creates a spark SQL application session -> sends queries to Spark SQL for processing