SlideShare una empresa de Scribd logo
1 de 29
IPPON 2020
Accelerate Big Data Analytics
with Managed Cluster Services
IPPON 2020
IPPON 2019
Introduction.
Data engineer at Ippon
Technologies, a boutique
consulting firm specializing in
Data, Cloud, DevOps, and Full
Stack Application
Development.
Ippon has deep expertise
across all major cloud
platforms:
★ AWS
★ Azure
★ GCP
Delivers Data Initiative
Architecture consulting from scoping
Project delivery from POC
Robust client portfolio
Led by Peter Choe
We’re hiring!
About Ippon’s Data Practice
Engineer with hands-on
experience with AWS and
Azure across multiple data
projects.
Currently working with
Cassandra (NoSQL) and
AWS compute and analytics
services.
Fun fact: My team won first
place at the 2020 Virginia
Governor's Datathon
I also organize the
Richmond (VA) Data
Engineering Meetup
Sam Portillo
Data Engineer
IPPON 2020
IPPON 2019
Agenda.
1. Introduction
2. Architecture and Design
3. Contrast with Similar Services
4. Lessons Learned
5. Q&A
IPPON
2020
What is a cluster?
Background on Big Data Computing.
❖ Pre 1994 - mainframe era
❖ 1994 to 2010 - cluster era
➢ Identical machines connected to the same network to run
workloads
➢ Price varies from 5-6 figures for hardware alone
❖ 2000s to present - GPU era
➢ GPU processing accelerates specific types of
problems (ML training, gaming graphics, etc.)
➢ Hardware can also vary from 5-6 figures
❖ 2010s to present - Cloud era
➢ Introduces pay-as-you-go IaaS and PaaS with, firms can run
Big Data applications without buying/managing hardware
Benefits of IaaS.
❖ No longer need to buy/manage
hardware
❖ “Wait minutes not months”
❖ Pay-as-you-go beats financing
hardware in most cases
❖ Scale without physical hardware
limits
❖ User needs to manage OS,
middleware, applications
Benefits of PaaS.
❖ Cloud provider manages OS,
database, development tools
❖ Can be cheaper than IaaS
❖ Faster development time because
software comes preconfigured
❖ User needs to applications, data,
etc.
IaaS, PaaS, SaaS Overview.
Managed Cluster Services.
❖ PaaS offering that allows cloud customers to easily run Big Data
workloads
❖ Support for open source tools like Spark and Hadoop
❖ Examples are
➢ AWS Elastic MapReduce
➢ Azure HDInsight
➢ Google Cloud Dataproc
Core Benefits of Managed Cluster Services.
❖ Cost savings associated with PaaS
❖ Ease of use for developers
❖ Integration with other cloud services
❖ Elastic
➢ Scales up and down as needed
❖ Flexibility
➢ Users can customize environments to solve a variety of problems
Use Cases.
❖ ETL Pipelines
➢ Build reliable data pipelines with Spark
❖ Big data migration
➢ Utilize the power of a cluster to transfer data in a distributed way
❖ Interactive analytics
➢ Quickly ingest terabytes of data to do initial analysis
❖ Machine learning
➢ Train ML models with Tensorflow or Spark MLlib
Takeaway.
❖ Managed cluster services can be versatile and used in different workloads
❖ Benefits that cloud services are typically known for
❖ A good understanding of architecture principles will help an engineer
support a variety of projects
Architecture and Design
Architecture in AWS EMR.
❖ Each node is an EC2 instance
❖ Master node
➢ Responsible for coordinating applications
among the cluster
➢ Runs driver component for Spark apps
➢ SSH access
❖ Core nodes
➢ Do heavy lifting for applications
➢ Store data for Hadoop Distributed File
System
How distributed computing speeds up workloads.
❖ Workloads get distributed among a cluster
❖ Hadoop MapReduce
❖ Heavily relies on disk; stores data back to HDFS after each operation
How distributed computing speeds up workloads.
❖ Apache Spark uses a directed acyclic graph (DAG) to plan workloads
❖ Advantages over MapReduce:
➢ Utilizes RAM and caches datasets between map-reduce jobs
➢ Doesn’t need to write to disk between map-reduce jobs
➢ Rich API that allows for dataset transformations
➢ Potentially 100 times faster than Hadoop MapReduce.
Support for open-source software.
Customizing compute environments.
❖ Support for
➢ Logging
➢ SSH connection
➢ Bootstrap scripts to install custom software/packages
➢ Custom AMIs to achieve anything a bootstrap script can’t
EMR in action.
Contrast with Similar Services
Services to address.
❖ Databricks
❖ ETL tools
❖ Batch services
Databricks.
❖ Databricks is a managed Spark service
➢ Databricks manages compute
➢ Workflow automation and data pipelines
➢ Integrated workspace
❖ A managed cluster service just runs Spark jobs
ETL Tools.
❖ Can have similar ETL functions but managed cluster
services offer more than ETL
❖ ETL on ETL tools is generally much easier to configure, but
less flexible
❖ AWS Glue works on top of a Spark environment
Batch Services.
❖ Batch services are meant to run any batch computing job
at any scale.
❖ They may be able to do some ETL use cases
❖ Not a cluster service, don’t run Spark or Hadoop
❖ Not an environment for interactive analytics
Takeaway.
❖ Decision to use a managed cluster service heavily depends
on the workload
Lessons Learned
IPPON 2020
IPPON 2019
Things to keep in mind.
1. Tuning can be difficult
2. Adopt best practices early
3. Assess what parts of a workload warrant a cluster service
4. Know your data
5. Explore different options before committing to a solution
IPPON
2020
IPPON 2020
Connect with me :D
❖ Sam Portillo
❖ https://www.linkedin.com/in/portillosc
❖ Email: sportillo@ippon.fr
Q&A

Más contenido relacionado

La actualidad más candente

Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 

La actualidad más candente (20)

How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?
 
Clinical Suspecting at Scale Using PySpark
Clinical Suspecting at Scale Using PySparkClinical Suspecting at Scale Using PySpark
Clinical Suspecting at Scale Using PySpark
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake Analytics
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
From hadoop to spark
From hadoop to sparkFrom hadoop to spark
From hadoop to spark
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowSimplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.ir
 
Digital Transformation with Microsoft Azure
Digital Transformation with Microsoft AzureDigital Transformation with Microsoft Azure
Digital Transformation with Microsoft Azure
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics tools
 
Serverless data pipelines gcp
Serverless data pipelines gcpServerless data pipelines gcp
Serverless data pipelines gcp
 
Dremio introduction
Dremio introductionDremio introduction
Dremio introduction
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
 
Introduction to Dremio
Introduction to DremioIntroduction to Dremio
Introduction to Dremio
 
What is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperWhat is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | Whitepaper
 

Similar a Managed Cluster Services

Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...
VMware Tanzu
 

Similar a Managed Cluster Services (20)

An introduction to cloud systems architecture
An introduction to cloud systems architectureAn introduction to cloud systems architecture
An introduction to cloud systems architecture
 
Cloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs GoogleCloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs Google
 
AWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesAWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - Slides
 
What is Google Cloud Platform - GDG DevFest 18 Depok
What is Google Cloud Platform - GDG DevFest 18 DepokWhat is Google Cloud Platform - GDG DevFest 18 Depok
What is Google Cloud Platform - GDG DevFest 18 Depok
 
Talend for big_data_intorduction
Talend for big_data_intorductionTalend for big_data_intorduction
Talend for big_data_intorduction
 
AirBNB's ML platform - BigHead
AirBNB's ML platform - BigHeadAirBNB's ML platform - BigHead
AirBNB's ML platform - BigHead
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive Applications
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Big Data Infrastructure
Big Data InfrastructureBig Data Infrastructure
Big Data Infrastructure
 
IBM - Introduction to Cloudant
IBM - Introduction to CloudantIBM - Introduction to Cloudant
IBM - Introduction to Cloudant
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
 
Oracle Data Integration - Overview
Oracle Data Integration - OverviewOracle Data Integration - Overview
Oracle Data Integration - Overview
 
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from Pivotal
 

Más de Adam Doyle

Más de Adam Doyle (20)

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop Development
 
The new big data
The new big dataThe new big data
The new big data
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech Stack
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science Lifecycle
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 
Cloudera - Docker on hadoop
Cloudera - Docker on hadoopCloudera - Docker on hadoop
Cloudera - Docker on hadoop
 

Último

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 

Último (20)

Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 

Managed Cluster Services

  • 1. IPPON 2020 Accelerate Big Data Analytics with Managed Cluster Services
  • 2. IPPON 2020 IPPON 2019 Introduction. Data engineer at Ippon Technologies, a boutique consulting firm specializing in Data, Cloud, DevOps, and Full Stack Application Development. Ippon has deep expertise across all major cloud platforms: ★ AWS ★ Azure ★ GCP Delivers Data Initiative Architecture consulting from scoping Project delivery from POC Robust client portfolio Led by Peter Choe We’re hiring! About Ippon’s Data Practice Engineer with hands-on experience with AWS and Azure across multiple data projects. Currently working with Cassandra (NoSQL) and AWS compute and analytics services. Fun fact: My team won first place at the 2020 Virginia Governor's Datathon I also organize the Richmond (VA) Data Engineering Meetup Sam Portillo Data Engineer
  • 3. IPPON 2020 IPPON 2019 Agenda. 1. Introduction 2. Architecture and Design 3. Contrast with Similar Services 4. Lessons Learned 5. Q&A IPPON 2020
  • 4. What is a cluster?
  • 5. Background on Big Data Computing. ❖ Pre 1994 - mainframe era ❖ 1994 to 2010 - cluster era ➢ Identical machines connected to the same network to run workloads ➢ Price varies from 5-6 figures for hardware alone ❖ 2000s to present - GPU era ➢ GPU processing accelerates specific types of problems (ML training, gaming graphics, etc.) ➢ Hardware can also vary from 5-6 figures ❖ 2010s to present - Cloud era ➢ Introduces pay-as-you-go IaaS and PaaS with, firms can run Big Data applications without buying/managing hardware
  • 6. Benefits of IaaS. ❖ No longer need to buy/manage hardware ❖ “Wait minutes not months” ❖ Pay-as-you-go beats financing hardware in most cases ❖ Scale without physical hardware limits ❖ User needs to manage OS, middleware, applications
  • 7. Benefits of PaaS. ❖ Cloud provider manages OS, database, development tools ❖ Can be cheaper than IaaS ❖ Faster development time because software comes preconfigured ❖ User needs to applications, data, etc.
  • 8. IaaS, PaaS, SaaS Overview.
  • 9. Managed Cluster Services. ❖ PaaS offering that allows cloud customers to easily run Big Data workloads ❖ Support for open source tools like Spark and Hadoop ❖ Examples are ➢ AWS Elastic MapReduce ➢ Azure HDInsight ➢ Google Cloud Dataproc
  • 10. Core Benefits of Managed Cluster Services. ❖ Cost savings associated with PaaS ❖ Ease of use for developers ❖ Integration with other cloud services ❖ Elastic ➢ Scales up and down as needed ❖ Flexibility ➢ Users can customize environments to solve a variety of problems
  • 11. Use Cases. ❖ ETL Pipelines ➢ Build reliable data pipelines with Spark ❖ Big data migration ➢ Utilize the power of a cluster to transfer data in a distributed way ❖ Interactive analytics ➢ Quickly ingest terabytes of data to do initial analysis ❖ Machine learning ➢ Train ML models with Tensorflow or Spark MLlib
  • 12. Takeaway. ❖ Managed cluster services can be versatile and used in different workloads ❖ Benefits that cloud services are typically known for ❖ A good understanding of architecture principles will help an engineer support a variety of projects
  • 14. Architecture in AWS EMR. ❖ Each node is an EC2 instance ❖ Master node ➢ Responsible for coordinating applications among the cluster ➢ Runs driver component for Spark apps ➢ SSH access ❖ Core nodes ➢ Do heavy lifting for applications ➢ Store data for Hadoop Distributed File System
  • 15. How distributed computing speeds up workloads. ❖ Workloads get distributed among a cluster ❖ Hadoop MapReduce ❖ Heavily relies on disk; stores data back to HDFS after each operation
  • 16. How distributed computing speeds up workloads. ❖ Apache Spark uses a directed acyclic graph (DAG) to plan workloads ❖ Advantages over MapReduce: ➢ Utilizes RAM and caches datasets between map-reduce jobs ➢ Doesn’t need to write to disk between map-reduce jobs ➢ Rich API that allows for dataset transformations ➢ Potentially 100 times faster than Hadoop MapReduce.
  • 18. Customizing compute environments. ❖ Support for ➢ Logging ➢ SSH connection ➢ Bootstrap scripts to install custom software/packages ➢ Custom AMIs to achieve anything a bootstrap script can’t
  • 21. Services to address. ❖ Databricks ❖ ETL tools ❖ Batch services
  • 22. Databricks. ❖ Databricks is a managed Spark service ➢ Databricks manages compute ➢ Workflow automation and data pipelines ➢ Integrated workspace ❖ A managed cluster service just runs Spark jobs
  • 23. ETL Tools. ❖ Can have similar ETL functions but managed cluster services offer more than ETL ❖ ETL on ETL tools is generally much easier to configure, but less flexible ❖ AWS Glue works on top of a Spark environment
  • 24. Batch Services. ❖ Batch services are meant to run any batch computing job at any scale. ❖ They may be able to do some ETL use cases ❖ Not a cluster service, don’t run Spark or Hadoop ❖ Not an environment for interactive analytics
  • 25. Takeaway. ❖ Decision to use a managed cluster service heavily depends on the workload
  • 27. IPPON 2020 IPPON 2019 Things to keep in mind. 1. Tuning can be difficult 2. Adopt best practices early 3. Assess what parts of a workload warrant a cluster service 4. Know your data 5. Explore different options before committing to a solution IPPON 2020
  • 28. IPPON 2020 Connect with me :D ❖ Sam Portillo ❖ https://www.linkedin.com/in/portillosc ❖ Email: sportillo@ippon.fr
  • 29. Q&A

Notas del editor

  1. Introduce myself and stuff Talk about rvade maybe or the cats Been with ippon for about a year Things i like I’m on the qomplx project and use a lot of aws services and cassandra I like python
  2. Skim over this but focus on the about me Offer: Ippon helps companies delivering Data initiative from massive workloads (FastData) to massive storage (BigData). Services: Architecture consulting from scoping DataEngineering industrialisation Project delivery from POC US Reference: Ippon is delivering the full Data capability of SwissRe in the US from realtime analysis to legal longterm storage in a secured Cloud.
  3. Read off slides Small demo with architecture and design
  4. Pre 1994 - super computer/mainframe; single big computer 1994 - 2010 - linux got popular and this started with the beowulf project. Folks started buying commodity hardware with two network cards and distributing workloads Present - there’s still benefit in on prem hardware, which is why the GPU era is still relevant. I know carmax has on prem resources for their machine learning workloads. Makes sense when you do the quick math between running a workload in the cloud vs buying the hardware (not the same as managing an entire data center). Acknowledge the overlap
  5. Examples of these are amazon ec2, azure vms, gcp compute engine
  6. Aws elastic beanstalk is a paas orchestration service offered by Amazon Web Services for deploying applications which orchestrates various AWS services
  7. Here’s an overview of iaas, paas, and saas, i see new as-a-service acronyms each year but most cloud services fall into these categories Most cloud certification courses ask questions about these a saas example would be the microsoft office 365 suite
  8. Tell audience we’re focusing on EMR because i’ve used it on a couple projects within the DP Not a sales pitch for EMR I’ve just used it too much
  9. Ease of use: easy for devs to get started with; learning curve is pretty low
  10. Use “emerging problems” Any questions so far? At its core, EMR is a platform as a service that offers on-demand distributed computing clusters. This service comes with the benefits such as scalability and cost-savings that AWS services are typically known for. AWS also manages the installation of a variety of popular distributed computing and data frameworks like Spark, Hadoop, Tensorflow, and many more. This makes EMR especially versatile as engineers can work on almost any type of problem without too much trouble switching contexts. With EMR being a cornerstone for many workloads, it’s an incredibly helpful tool to keep in the toolbox. A strong understanding of its architecture principles coupled with it’s variety of popular frameworks available makes it a good choice for emerging problems.
  11. Ignore the “mapr node” thing. I stole this image
  12. Most services are phasing out hadoop verbiage on their product pages. And apache retired some hadoop related projects recently Explanation of graphic: say we want to count the occurrences of each animal in this input dataset In the Map phase, each line gets parsed and all of the animals occurrences are extracted. The output of each step of the Map phase is just a list of of each animal and a 1 associated with it. Now that we have all of the animal occurrences by line, we want a count of occurrences across the dataset for each animal. In the Reduce phase, we combine all of the same animals from different mappers and "reduce" them to a single animal and a count. Any questions?
  13. Introducing Spark
  14. Go straight to the console
  15. Theres a lot of similar services out there. I’d like to clear up some differences. Bottom line will be that the service you choose depends on your workload
  16. Spark tuning is really hard, figure out what needs to be done correctly otherwise you may end up just pulling a bunch of different levers Whenever working with a new service or framework, get to know the best practices, you do not want to be three weeks into something when you realize you’re doing something anti pattern Theres a lot of components of an ETL pipeline, but just because one part needs to be done in spark doesn’t mean the whole thing does Omg, self explanatory. Exhaustively study it to make sure nothing will come in to break your pipeline/business logic. Found hundred line sql statements in breach data Don’t fall into the hype of the latest and greatest service. Extensively research your use-case. There’s a lot of ways to get something done, make sure you pick the right tool for the job
  17. If you like to talk about data/tinker with tools/technology/services OR like to throw it down on the dance floor. Connect with me, let’s be friends I also welcome philosophical debate, whether its about material i presented, thoughts on architecture paradigms/strategies, the future of aws/data.