SlideShare una empresa de Scribd logo
1 de 16
Spark on HDInsight
Seattle Spark Meetup on March 9, 2016
Presenters: Judy Nash & Lin Chan
About Us
 Azure HDInsight Service
 Azure’s answer to big data with open source tech
 deploy and manage clusters hosting Hadoop, HBase, Storm, and now Spark
 Our Goal – Make Spark easy to use on Azure
 How Do We Make It Happen
 Deploy new spark clusters via SDK and Portal
 Pre-configure and tune cluster for optimal experience
 Adopt open source technologies to enhance spark workload
 Contribute back to open source
About the Talk
 How to Build an Enterprise-ready Spark System
 Deep Dive of HDInsight’s Spark Cluster
 Cluster Architecture
 Resource Manager
 End-to-end Workflows
 Business Intelligence
 Remote Job Submission
Spark Cluster Architecture
Why Yarn?
 Standalone
 Better UI
 Less memory overhead
 Faster application launch time
 YARN
 Better community support
 More powerful resource management
 Share resources with other job workflows
 More user friendly to users who knew Hadoop on yarn already
Business Intelligence Workflow
Addressing Multi-tenancy
 Fair Scheduler
 Allow sharing resources between queries within thrift server
 Important for BI customers who share a cluster. Avoid bad query taking over a
cluster.
 To Use, set default queue type as “fair” scheduling
 Dynamic Allocation
 Allow sharing resources between thrift and other applications
 Leave minimum footprints for customers who do not use thrift, but able to expand
to maximum resource allowed when customers execute expensive queries
What is Livy?
 REST Server allowing remote job submission
 2 modes currently: batch & interactive
 Open source project
 Co-development with Cloudera
Batch Job Submission
Sample Call
 Submit a batch job
curl -k --user "admin:mypassword1!" -v -H 'Content-Type: application/json' -X
POST -d '{
"file":"wasb://mycontainer@mystorageaccount.blob.core.windows.net/data/Spar
kSimpleTest.jar", "className":"com.microsoft.spark.test.SimpleFile" }'
"https://mysparkcluster.azurehdinsight.net/livy/batches"
 Check the job status
curl -k --user "admin:mypassword1!" -v -X GET
"https://mysparkcluster.azurehdinsight.net/livy/batches/{batchId}"
Interactive session
Sample Call
 Start a Scala interactive session
curl -k --user "admin:mypassword1!" -v -H 'Content-Type: application/json' -X POST -d '{
"kind":"spark" }' "https://mysparkcluster.azurehdinsight.net/livy/sessions"
 Post a statement
curl -k --user "admin:mypassword1!" -v -H 'Content-Type: application/json' -X POST -d
'{"code":"1+1" }'
"https://mysparkcluster.azurehdinsight.net/livy/sessions/{sessionId}/statements"
 Check the statement result
curl -k --user "admin:mypassword1!" -v -X GET
"https://mysparkcluster.azurehdinsight.net/livy/sessions/{sessionId}/statements"
 Terminate a session
curl -k --user "admin:mypassword1!" -v -X DELETE
"https://mysparkcluster.azurehdinsight.net/livy/sessions/{sessionId}”
Integration with
Jupyter
Livy vs Job Server
 Had Job Server initially
 Job server is not easy to use for simple jar submission or notebook case
 Job server is good for embedding Spark work within a bigger app
 Client mode is coming to Livy soon
 Partner with Cloudera is important
More on Livy
 HDI online documentation: https://azure.microsoft.com/en-
us/documentation/articles/hdinsight-apache-spark-livy-rest-interface
 Livy Repo: https://github.com/cloudera/livy
More on HDInsight
 HDInsight Blog
 https://blogs.msdn.microsoft.com/azuredatalake/
 Contact Us
 Lin Chan https://www.linkedin.com/in/linchanms
 Judy Nash https://www.linkedin.com/in/judynash

Más contenido relacionado

La actualidad más candente

Self-Service Provisioning and Hadoop Management with Apache Ambari
Self-Service Provisioning and  Hadoop Management with Apache AmbariSelf-Service Provisioning and  Hadoop Management with Apache Ambari
Self-Service Provisioning and Hadoop Management with Apache Ambari
DataWorks Summit
 
Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government data
DataWorks Summit
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
DataWorks Summit
 
Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big Data
Stylight
 

La actualidad más candente (20)

Self-Service Provisioning and Hadoop Management with Apache Ambari
Self-Service Provisioning and  Hadoop Management with Apache AmbariSelf-Service Provisioning and  Hadoop Management with Apache Ambari
Self-Service Provisioning and Hadoop Management with Apache Ambari
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Red Hat Openshift on Microsoft Azure
Red Hat Openshift on Microsoft AzureRed Hat Openshift on Microsoft Azure
Red Hat Openshift on Microsoft Azure
 
Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government data
 
azure synapse analytics end-to-end solution-hands-on at 20200728
azure synapse analytics end-to-end solution-hands-on at 20200728azure synapse analytics end-to-end solution-hands-on at 20200728
azure synapse analytics end-to-end solution-hands-on at 20200728
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 
McGraw-Hill Optimizes Analytics Workloads with Databricks
 McGraw-Hill Optimizes Analytics Workloads with Databricks McGraw-Hill Optimizes Analytics Workloads with Databricks
McGraw-Hill Optimizes Analytics Workloads with Databricks
 
Machine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataMachine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of Data
 
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceBig SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
 
Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big Data
 
Openshift 3.10 & Container solutions for Blockchain, IoT and Data Science
Openshift 3.10 & Container solutions for Blockchain, IoT and Data ScienceOpenshift 3.10 & Container solutions for Blockchain, IoT and Data Science
Openshift 3.10 & Container solutions for Blockchain, IoT and Data Science
 
Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612Microsoft Technologies for Data Science 201612
Microsoft Technologies for Data Science 201612
 
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
 
Manage Microservices & Fast Data Systems on One Platform w/ DC/OS
Manage Microservices & Fast Data Systems on One Platform w/ DC/OSManage Microservices & Fast Data Systems on One Platform w/ DC/OS
Manage Microservices & Fast Data Systems on One Platform w/ DC/OS
 
Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!Ignite Your Big Data With a Spark!
Ignite Your Big Data With a Spark!
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
The next-phase-of-distributed-systems-with-apache-ignite
The next-phase-of-distributed-systems-with-apache-igniteThe next-phase-of-distributed-systems-with-apache-ignite
The next-phase-of-distributed-systems-with-apache-ignite
 
When the Cloud is a Rockin: High Availability in Apache CloudStack
When the Cloud is a Rockin: High Availability in Apache CloudStackWhen the Cloud is a Rockin: High Availability in Apache CloudStack
When the Cloud is a Rockin: High Availability in Apache CloudStack
 
BlueData EPIC on AWS - Spec Sheet
BlueData EPIC on AWS - Spec SheetBlueData EPIC on AWS - Spec Sheet
BlueData EPIC on AWS - Spec Sheet
 
Building Enterprise Clouds - Key Considerations and Strategies - RED HAT
Building Enterprise Clouds - Key Considerations and Strategies - RED HATBuilding Enterprise Clouds - Key Considerations and Strategies - RED HAT
Building Enterprise Clouds - Key Considerations and Strategies - RED HAT
 

Destacado

Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetup
Gianmario Spacagna
 

Destacado (20)

Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetup
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Go Serverless with Azure Functions
Go Serverless with Azure FunctionsGo Serverless with Azure Functions
Go Serverless with Azure Functions
 
Azure api app métricas com application insights
Azure api app métricas com application insightsAzure api app métricas com application insights
Azure api app métricas com application insights
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
Azure IOT
Azure IOTAzure IOT
Azure IOT
 
Belgian Windows Server 2012 Launch windows azure insights for the enterprise ...
Belgian Windows Server 2012 Launch windows azure insights for the enterprise ...Belgian Windows Server 2012 Launch windows azure insights for the enterprise ...
Belgian Windows Server 2012 Launch windows azure insights for the enterprise ...
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
 
Microsoft NYC 14
Microsoft NYC 14Microsoft NYC 14
Microsoft NYC 14
 
Big data streaming with Apache Spark on Azure
Big data streaming with Apache Spark on AzureBig data streaming with Apache Spark on Azure
Big data streaming with Apache Spark on Azure
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
Going serverless
Going serverlessGoing serverless
Going serverless
 
2016-08-25 TechExeter - going serverless with Azure
2016-08-25 TechExeter - going serverless with Azure2016-08-25 TechExeter - going serverless with Azure
2016-08-25 TechExeter - going serverless with Azure
 
Software scope
Software scopeSoftware scope
Software scope
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
Open up to a better learning ecosystem
Open up to a better learning ecosystemOpen up to a better learning ecosystem
Open up to a better learning ecosystem
 
Azure IoT Hub on a Toradex Colibri VF61 – Part 1 - Sending data to the cloud
Azure IoT Hub on a Toradex Colibri VF61 – Part 1 - Sending data to the cloudAzure IoT Hub on a Toradex Colibri VF61 – Part 1 - Sending data to the cloud
Azure IoT Hub on a Toradex Colibri VF61 – Part 1 - Sending data to the cloud
 
Azure functions
Azure functionsAzure functions
Azure functions
 
Going serverless
Going serverlessGoing serverless
Going serverless
 

Similar a Spark on Azure HDInsight - spark meetup seattle

OSCON 2013 - The Hitchiker’s Guide to Open Source Cloud Computing
OSCON 2013 - The Hitchiker’s Guide to Open Source Cloud ComputingOSCON 2013 - The Hitchiker’s Guide to Open Source Cloud Computing
OSCON 2013 - The Hitchiker’s Guide to Open Source Cloud Computing
Mark Hinkle
 
Building Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyBuilding Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technology
Domino Data Lab
 
Cloud Expo East 2013: Essential Open Source Software for Building the Open Cloud
Cloud Expo East 2013: Essential Open Source Software for Building the Open CloudCloud Expo East 2013: Essential Open Source Software for Building the Open Cloud
Cloud Expo East 2013: Essential Open Source Software for Building the Open Cloud
Mark Hinkle
 

Similar a Spark on Azure HDInsight - spark meetup seattle (20)

Kafka for data scientists
Kafka for data scientistsKafka for data scientists
Kafka for data scientists
 
Drupal In The Cloud
Drupal In The CloudDrupal In The Cloud
Drupal In The Cloud
 
Workshop - Openstack, Cloud Computing, Virtualization
Workshop - Openstack, Cloud Computing, VirtualizationWorkshop - Openstack, Cloud Computing, Virtualization
Workshop - Openstack, Cloud Computing, Virtualization
 
Openstack workshop @ Kalasalingam
Openstack workshop @ KalasalingamOpenstack workshop @ Kalasalingam
Openstack workshop @ Kalasalingam
 
OpenStack and Cloud Foundry - Pair the leading open source IaaS and PaaS
OpenStack and Cloud Foundry - Pair the leading open source IaaS and PaaSOpenStack and Cloud Foundry - Pair the leading open source IaaS and PaaS
OpenStack and Cloud Foundry - Pair the leading open source IaaS and PaaS
 
Just one-shade-of-openstack
Just one-shade-of-openstackJust one-shade-of-openstack
Just one-shade-of-openstack
 
AWS re:Invent 2016: Open Source at AWS—Contributions, Support, and Engagement...
AWS re:Invent 2016: Open Source at AWS—Contributions, Support, and Engagement...AWS re:Invent 2016: Open Source at AWS—Contributions, Support, and Engagement...
AWS re:Invent 2016: Open Source at AWS—Contributions, Support, and Engagement...
 
DIMT 2023 SG - Hands-on Workshop_ Getting started with Confluent Cloud.pdf
DIMT 2023 SG - Hands-on Workshop_ Getting started with Confluent Cloud.pdfDIMT 2023 SG - Hands-on Workshop_ Getting started with Confluent Cloud.pdf
DIMT 2023 SG - Hands-on Workshop_ Getting started with Confluent Cloud.pdf
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
OSCON 2013 - The Hitchiker’s Guide to Open Source Cloud Computing
OSCON 2013 - The Hitchiker’s Guide to Open Source Cloud ComputingOSCON 2013 - The Hitchiker’s Guide to Open Source Cloud Computing
OSCON 2013 - The Hitchiker’s Guide to Open Source Cloud Computing
 
Building Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyBuilding Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technology
 
963
963963
963
 
Cloud Expo East 2013: Essential Open Source Software for Building the Open Cloud
Cloud Expo East 2013: Essential Open Source Software for Building the Open CloudCloud Expo East 2013: Essential Open Source Software for Building the Open Cloud
Cloud Expo East 2013: Essential Open Source Software for Building the Open Cloud
 
OpenStack Identity - Keystone (liberty) by Lorenzo Carnevale and Silvio Tavilla
OpenStack Identity - Keystone (liberty) by Lorenzo Carnevale and Silvio TavillaOpenStack Identity - Keystone (liberty) by Lorenzo Carnevale and Silvio Tavilla
OpenStack Identity - Keystone (liberty) by Lorenzo Carnevale and Silvio Tavilla
 
Cisco Cloud Computing and Open Stack: Velocity 2011
Cisco Cloud Computing and Open Stack: Velocity 2011Cisco Cloud Computing and Open Stack: Velocity 2011
Cisco Cloud Computing and Open Stack: Velocity 2011
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
Cloud Computing using OpenStack
Cloud Computing using OpenStackCloud Computing using OpenStack
Cloud Computing using OpenStack
 
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
 
Pivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
Pivoting Spring XD to Spring Cloud Data Flow with Sabby AnandanPivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
Pivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
 
[DevDay 2016] OpenStack and approaches for new users - Speaker: Chi Le – Head...
[DevDay 2016] OpenStack and approaches for new users - Speaker: Chi Le – Head...[DevDay 2016] OpenStack and approaches for new users - Speaker: Chi Le – Head...
[DevDay 2016] OpenStack and approaches for new users - Speaker: Chi Le – Head...
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Spark on Azure HDInsight - spark meetup seattle

  • 1. Spark on HDInsight Seattle Spark Meetup on March 9, 2016 Presenters: Judy Nash & Lin Chan
  • 2. About Us  Azure HDInsight Service  Azure’s answer to big data with open source tech  deploy and manage clusters hosting Hadoop, HBase, Storm, and now Spark  Our Goal – Make Spark easy to use on Azure  How Do We Make It Happen  Deploy new spark clusters via SDK and Portal  Pre-configure and tune cluster for optimal experience  Adopt open source technologies to enhance spark workload  Contribute back to open source
  • 3. About the Talk  How to Build an Enterprise-ready Spark System  Deep Dive of HDInsight’s Spark Cluster  Cluster Architecture  Resource Manager  End-to-end Workflows  Business Intelligence  Remote Job Submission
  • 5. Why Yarn?  Standalone  Better UI  Less memory overhead  Faster application launch time  YARN  Better community support  More powerful resource management  Share resources with other job workflows  More user friendly to users who knew Hadoop on yarn already
  • 7. Addressing Multi-tenancy  Fair Scheduler  Allow sharing resources between queries within thrift server  Important for BI customers who share a cluster. Avoid bad query taking over a cluster.  To Use, set default queue type as “fair” scheduling  Dynamic Allocation  Allow sharing resources between thrift and other applications  Leave minimum footprints for customers who do not use thrift, but able to expand to maximum resource allowed when customers execute expensive queries
  • 8. What is Livy?  REST Server allowing remote job submission  2 modes currently: batch & interactive  Open source project  Co-development with Cloudera
  • 10. Sample Call  Submit a batch job curl -k --user "admin:mypassword1!" -v -H 'Content-Type: application/json' -X POST -d '{ "file":"wasb://mycontainer@mystorageaccount.blob.core.windows.net/data/Spar kSimpleTest.jar", "className":"com.microsoft.spark.test.SimpleFile" }' "https://mysparkcluster.azurehdinsight.net/livy/batches"  Check the job status curl -k --user "admin:mypassword1!" -v -X GET "https://mysparkcluster.azurehdinsight.net/livy/batches/{batchId}"
  • 12. Sample Call  Start a Scala interactive session curl -k --user "admin:mypassword1!" -v -H 'Content-Type: application/json' -X POST -d '{ "kind":"spark" }' "https://mysparkcluster.azurehdinsight.net/livy/sessions"  Post a statement curl -k --user "admin:mypassword1!" -v -H 'Content-Type: application/json' -X POST -d '{"code":"1+1" }' "https://mysparkcluster.azurehdinsight.net/livy/sessions/{sessionId}/statements"  Check the statement result curl -k --user "admin:mypassword1!" -v -X GET "https://mysparkcluster.azurehdinsight.net/livy/sessions/{sessionId}/statements"  Terminate a session curl -k --user "admin:mypassword1!" -v -X DELETE "https://mysparkcluster.azurehdinsight.net/livy/sessions/{sessionId}”
  • 14. Livy vs Job Server  Had Job Server initially  Job server is not easy to use for simple jar submission or notebook case  Job server is good for embedding Spark work within a bigger app  Client mode is coming to Livy soon  Partner with Cloudera is important
  • 15. More on Livy  HDI online documentation: https://azure.microsoft.com/en- us/documentation/articles/hdinsight-apache-spark-livy-rest-interface  Livy Repo: https://github.com/cloudera/livy
  • 16. More on HDInsight  HDInsight Blog  https://blogs.msdn.microsoft.com/azuredatalake/  Contact Us  Lin Chan https://www.linkedin.com/in/linchanms  Judy Nash https://www.linkedin.com/in/judynash

Notas del editor

  1. HDInsight – an Azure service dedicated to hosting big data solutions from open source communities. Azure service dedicated to deploy and manage clusters hosting big data solutions from open source
  2. Key concepts * What does the node types do * Introduce cluster daemons * Mentions HA, monitoring, telemetry – future spark talk topics 
  3. Talk Points What is business intelligence? Who are the customers? What is thrift? An open source protocol that handles data transfers between client and services. Similar to SOAP in functionality. Spark Thrift server -> at launch time creates a spark SQL application session -> sends queries to Spark SQL for processing