SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
Massive Simulations In Spark:
Distributed Monte Carlo For Global
Health Forecasts
Kyle Foreman, PhD
07 June 2016
Simulations in Spark
• Context
• Motivation
• SimBuilder
• Backends
• Benchmarks
• Discussion
2
What is IHME?
3
• Independent research center at the
University of Washington
• Core funding by Bill & Melinda Gates
Foundation and State of Washington
• ~300 faculty, researchers, students and
staff
• Providing rigorous, scientific measurement
o What are the world’s major health problems?
o How well is society addressing these problems?
o How should we best dedicate resources to
improving health?
Goal:
improve health
by providing the best
information
on population health
Global Burden of Disease
• A systematic scientific effort to
quantify the comparative magnitude
of health loss due to diseases,
injuries and risk factors
o 188 countries
o 300+ diseases
o 1990 – 2015+
o 50+ risk factors
4
GBD Collaborative Model
1,100 experts in 106 countries
Death Rate in 2013 (age-standardized)
6
Childhood Death Rate in 2013
7
DALYs in
High
Income
Countries
(2010)
8
DALYs in
Low & Middle
Income
Countries
(2010)
9
Simulations in Spark
• Context
• Motivation
• SimBuilder
• Backends
• Benchmarks
• Discussion
10
Forecasting the Burden of Disease
11
1. Generate a baseline scenario projecting the GBD 25 years into the future
o Mortality, morbidity, population, and risk factors
o Every country, cause, sex, age
2. Create a comprehensive simulation framework to assess alternative
scenarios
o Capture the complex interrelationships between risk factors, interventions, and
diseases to explore the effects of changes to the system
3. Build a modular and flexible platform that can incorporate new models to
answer detailed “what if?” questions
o E.g. introduction of a new vaccine, effects of global warming, scale up of ART
coverage, risk of global pandemics, etc
Simulations
• Takes into account interdependencies between risk factors,
diseases, etc.
o Use draws of model parameters to advance from t to t+1, allowing for
forecasts of other quantities (e.g. risk factors and mortality) to interact
• Modular structure allows for detailed “what ifs”
o E.g. scale up of coverage of a new intervention
• Allows us to incorporate alternative models to test sensitivity to model
choice and specification
12
Basic Simulation Process
1. Fit statistical models (PyMB) capturing most important
relationships as statistical parameters
2. Generate many draws from those parameters, reflecting
their uncertainty and correlation structure
3. Run Monte Carlo simulations to update each quantity
over time, taking into account its dependencies
13
Simulations in Spark
• Context
• Motivation
• SimBuilder
• Backends
• Benchmarks
• Discussion
14
SimBuilder
• Directed Acyclic Graph (DAG) construction and validation
o yaml and sympy specification
o curses interface
o graphviz visualization
• Flexible execution backends
o pandas
o pyspark
15
Example Global Health DAG
16
GDP 2016
Deaths
2016
Births 2016
Population
2016
Workforce
2016
Deaths
2017
Births 2017
Population
2017
Workforce
2017
GDP 2017
Creating a DAG with SimBuilder
1. Create a YAML for the overall simulation
o Specify child models and the dimensions of the simulation
2. Build separate YAML files for each child model
o Dimensions along which the model varies
o Location of historical data
o Expression for how to generate the quantity in t+1
3. Run SimBuilder to validate and explore DAG
17
Creating a DAG with SimBuilder
18
name: gdp_pop
models:
- name: gdp_per_capita
- name: workforce
- name: mortality
- name: gdp
- name: pop
global_parameters:
- name: loc
set: List
values: ["USA", "MEX", "CAN", ..., ]
- name: t
set: Range
start: 2014
end: 2040
- name: draw
set: Range
start: 0
end: 999
	
sim.yaml
19
Creating a DAG with SimBuilder
name: pop
version: 0.1
variables: [loc,t]
history:
- type: csv
path: pop.csv
	
pop.yaml
Creating a DAG with SimBuilder
name: gdp_per_capita	
version: 0.1	
expr: gdp(draw, t, loc)/pop(draw, t-1, loc)	
variables: [draw, t, loc]	
history:	
- type: csv	
path: gdp_pc_w_draw.csv	
	
gdp_per_capita.yaml
	
20
21
Creating a DAG with SimBuilder
name: gdp
version: 0.1
expr: gdp(d,loc,t-1) * exp(
alpha_0(loc, t, draw) +
beta_gdp(draw) * log(gdp_per_capita(draw, loc, t-1)) +
beta_pop(draw, t) * workforce(loc, t)
)
variables: [draw ,loc,t]
history:
- type: csv
path: gdp_w_draw.csv
model_parameters:
- name: alpha_0
variables: [draw, loc, t]
history:
- type: csv
path: gdp/alpha_0.csv
- name: beta_gdp
variables: [d]
history:
- type: csv
path: gdp/beta_gdp.csv
- name: beta_pop
variables: [d]
history:
- type: csv
path: gdp/beta_pop.csv
	
gdp.yaml
DAG
22
gdp{0}
gdp_per_capita{0}
workforce{0}
pop{0}
mortality{0}
gdp_beta_gdp gdp_alpha_0{0} gdp_beta_pop{0}
mortality_beta_pop{0} mortality_alpha_0{0} mortality_beta_gdp
23
gdp{0}
gdp{1}
gdp_per_capita{0}
gdp_per_capita{1}
workforce{0}
mortality{1}
workforce{1}
pop{0}
pop{1}
mortality{0}
gdp_beta_gdpgdp_alpha_0{0}gdp_beta_pop{0}
gdp_alpha_0{1} gdp_beta_pop{1}
mortality_beta_pop{0} mortality_alpha_0{0} mortality_beta_gdp
mortality_beta_pop{1} mortality_alpha_0{1}
24
gdp{0}
gdp{1}
gdp_per_capita{0}
gdp{2}
gdp_per_capita{1}
gdp{3}
gdp_per_capita{2}
gdp{4}
gdp_per_capita{3}
gdp_per_capita{4}
workforce{0}
mortality{1}
workforce{1}
mortality{2}
workforce{2}
mortality{3}
workforce{3}
mortality{4}
workforce{4}
pop{0}
pop{1}
pop{2}
pop{3}
pop{4}
mortality{0}
gdp_beta_gdpgdp_alpha_0{0} gdp_beta_pop{0}
gdp_alpha_0{1} gdp_beta_pop{1}
gdp_alpha_0{2}gdp_beta_pop{2}
gdp_alpha_0{3} gdp_beta_pop{3}
gdp_alpha_0{4}gdp_beta_pop{4}
mortality_beta_pop{0}mortality_alpha_0{0}mortality_beta_gdp
mortality_beta_pop{1}mortality_alpha_0{1}
mortality_beta_pop{2} mortality_alpha_0{2}
mortality_beta_pop{3} mortality_alpha_0{3}
mortality_beta_pop{4} mortality_alpha_0{4}
Simulations in Spark
• Context
• Motivation
• SimBuilder
• Backends
• Benchmarks
• Discussion
25
SimBuilder Backends
• SimBuilder traverses the DAG
o Backends plugin via API
• Backends provide
o Data loading strategy
o Methods to combine data from different nodes to
calculate new nodes
o Method to save computed data
26
Local Pandas Backend
• Useful for debugging and small simulations
• Master process maintains list of Model objects:
o pandas DataFrames of all model data
─ Indexed by global variables
o sympy expression for how to calculate t+1
─ Vectorized when possible, fallback to row-wise apply
• Simulating over time traverses DAG to join necessary
DataFrames and compute results to fill in new columns
27
Spark DataFrame Backend
• Similar to local backend, swapping in Spark’s DataFrame
for Pandas’
• Spark context maintains list of Model objects:
o pyspark DataFrames of all model data
─ Columns for each global variable and
o sympy expression for how to calculate t+1
─ Calculated using row-wise apply
• Joins DataFrames where necessary
28
Spark + Numpy Backend
• Uses Spark to distribute and schedule the execution of
vectorized numpy operations
• One RDD for each node and year:
o Containing an ndarray and index vector for each axis
o Can optionally partition the data array into key-value pairs
along one or more axes (similar to bolt)
• To calculate a new node, all the parent RDD’s are
unioned and then reduced into a new child RDD
29
Simulations in Spark
• Context
• Motivation
• SimBuilder
• Backends
• Benchmarks
• Discussion
30
Benchmarking
• Single executor (8 cores, 32GB)
• Synthetic data
o 65 nodes in DAG
o 3 countries x 20 ages x 2 sexes x N draws
o Forecasted forward 25 years
31
Benchmarks
32
Benchmarks
33
Simulations in Spark
• Context
• Motivation
• SimBuilder
• Backends
• Benchmarks
• Discussion
34
Limitations of Spark DataFrame
• Majority of execution time is spent joining the DataFrame
and aligning the dimension columns
• These times were achieved after careful tuning of
partitions
o Perhaps custom partitioners could reduce runtime further
35
Taking Advantage of Numpy
• For multidimensional datasets of consistent shape and
size, simply aligning indices can eliminate the overhead
of joins
o Including generalizations to axes of size 1 to simplify coding
• Numpy’s vectorized operations are highly tuned and
scale well
o Including multithreading when e.g. compiled against MKL
36
Future Development
• Experiment with better partitioning of Numpy RDDs
o Perhaps use bolt?
• Improve partial graph execution
o Executing the entire DAG at once often crashes the driver
o Executing each year’s partial DAG in sequence results in idle CPU
time towards the end of that DAG
• Investigate more efficient Spark DataFrame joining
methods for Panel data
37
Team
Kyle Heuton
krheuton@uw.edu
Chun-Wei Yuan
cwyuan@uw.edu
Kyle Foreman
kfor@uw.edu
healthdata.org

Más contenido relacionado

La actualidad más candente

[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기Chris Ohk
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
ML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problemsML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problemsAmy Hodler
 
Instant search - A hands-on tutorial
Instant search  - A hands-on tutorialInstant search  - A hands-on tutorial
Instant search - A hands-on tutorialGanesh Venkataraman
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeDatabricks
 
개발자도 알아야 하는 DBMS튜닝
개발자도 알아야 하는 DBMS튜닝개발자도 알아야 하는 DBMS튜닝
개발자도 알아야 하는 DBMS튜닝정해 이
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedDatabricks
 
Large scale gpu cluster for ai
Large scale gpu cluster for aiLarge scale gpu cluster for ai
Large scale gpu cluster for aiKyunam Cho
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
 
Next-generation MMORPG service architecture
Next-generation MMORPG service architectureNext-generation MMORPG service architecture
Next-generation MMORPG service architectureJongwon Kim
 
Geospatial Options in Apache Spark
Geospatial Options in Apache SparkGeospatial Options in Apache Spark
Geospatial Options in Apache SparkDatabricks
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkDatabricks
 
[데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기
[데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기 [데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기
[데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기 choi kyumin
 
[NDC2017] 딥러닝으로 게임 콘텐츠 제작하기 - VAE를 이용한 콘텐츠 생성 기법 연구 사례
[NDC2017] 딥러닝으로 게임 콘텐츠 제작하기 - VAE를 이용한 콘텐츠 생성 기법 연구 사례[NDC2017] 딥러닝으로 게임 콘텐츠 제작하기 - VAE를 이용한 콘텐츠 생성 기법 연구 사례
[NDC2017] 딥러닝으로 게임 콘텐츠 제작하기 - VAE를 이용한 콘텐츠 생성 기법 연구 사례Hwanhee Kim
 
Better Together: How Graph database enables easy data integration with Spark ...
Better Together: How Graph database enables easy data integration with Spark ...Better Together: How Graph database enables easy data integration with Spark ...
Better Together: How Graph database enables easy data integration with Spark ...TigerGraph
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Presto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsPresto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsShubham Tagra
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiLev Brailovskiy
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...StreamNative
 

La actualidad más candente (20)

[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
ML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problemsML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problems
 
Instant search - A hands-on tutorial
Instant search  - A hands-on tutorialInstant search  - A hands-on tutorial
Instant search - A hands-on tutorial
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
 
개발자도 알아야 하는 DBMS튜닝
개발자도 알아야 하는 DBMS튜닝개발자도 알아야 하는 DBMS튜닝
개발자도 알아야 하는 DBMS튜닝
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
 
Large scale gpu cluster for ai
Large scale gpu cluster for aiLarge scale gpu cluster for ai
Large scale gpu cluster for ai
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
Next-generation MMORPG service architecture
Next-generation MMORPG service architectureNext-generation MMORPG service architecture
Next-generation MMORPG service architecture
 
Geospatial Options in Apache Spark
Geospatial Options in Apache SparkGeospatial Options in Apache Spark
Geospatial Options in Apache Spark
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
 
[데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기
[데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기 [데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기
[데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기
 
[NDC2017] 딥러닝으로 게임 콘텐츠 제작하기 - VAE를 이용한 콘텐츠 생성 기법 연구 사례
[NDC2017] 딥러닝으로 게임 콘텐츠 제작하기 - VAE를 이용한 콘텐츠 생성 기법 연구 사례[NDC2017] 딥러닝으로 게임 콘텐츠 제작하기 - VAE를 이용한 콘텐츠 생성 기법 연구 사례
[NDC2017] 딥러닝으로 게임 콘텐츠 제작하기 - VAE를 이용한 콘텐츠 생성 기법 연구 사례
 
Better Together: How Graph database enables easy data integration with Spark ...
Better Together: How Graph database enables easy data integration with Spark ...Better Together: How Graph database enables easy data integration with Spark ...
Better Together: How Graph database enables easy data integration with Spark ...
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Presto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsPresto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analysts
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
 

Destacado

GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonJen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersJen Aman
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache SparkJen Aman
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on MesosJen Aman
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkJen Aman
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat DetectionJen Aman
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkJen Aman
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkJen Aman
 
Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg:  Dynamically Composable Analytics Spark at Bloomberg:  Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics Jen Aman
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development KitJen Aman
 
Big Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudBig Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudJen Aman
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousJen Aman
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkJen Aman
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...
Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...
Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...Spark Summit
 
Estimating Financial Risk with Apache Spark
Estimating Financial Risk with Apache SparkEstimating Financial Risk with Apache Spark
Estimating Financial Risk with Apache SparkCloudera, Inc.
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Morticia: Visualizing And Debugging Complex Spark Workflows
Morticia: Visualizing And Debugging Complex Spark WorkflowsMorticia: Visualizing And Debugging Complex Spark Workflows
Morticia: Visualizing And Debugging Complex Spark WorkflowsSpark Summit
 

Destacado (20)

GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
 
Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg:  Dynamically Composable Analytics Spark at Bloomberg:  Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Big Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudBig Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the Cloud
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...
Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...
Scaling Unsupervised Ciliary Motion Analysis for Actionable Biomedical Insigh...
 
Estimating Financial Risk with Apache Spark
Estimating Financial Risk with Apache SparkEstimating Financial Risk with Apache Spark
Estimating Financial Risk with Apache Spark
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Morticia: Visualizing And Debugging Complex Spark Workflows
Morticia: Visualizing And Debugging Complex Spark WorkflowsMorticia: Visualizing And Debugging Complex Spark Workflows
Morticia: Visualizing And Debugging Complex Spark Workflows
 

Similar a Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
 
Numerical inflation: simulation of observational parameters
Numerical inflation: simulation of observational parametersNumerical inflation: simulation of observational parameters
Numerical inflation: simulation of observational parametersMilan Milošević
 
Graphs and Financial Services Analytics
Graphs and Financial Services AnalyticsGraphs and Financial Services Analytics
Graphs and Financial Services AnalyticsNeo4j
 
Quick tour all handout
Quick tour all handoutQuick tour all handout
Quick tour all handoutYi-Shin Chen
 
DutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision MakingDutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision MakingBigML, Inc
 
IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079ibankuk
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)Toshiyuki Shimono
 
Carbon dioxide emissions from cars of different brands.
Carbon dioxide emissions from cars of different brands.Carbon dioxide emissions from cars of different brands.
Carbon dioxide emissions from cars of different brands.Rohan386712
 
Euro30 2019 - Benchmarking tree approaches on street data
Euro30 2019 - Benchmarking tree approaches on street dataEuro30 2019 - Benchmarking tree approaches on street data
Euro30 2019 - Benchmarking tree approaches on street dataFabion Kauker
 

Similar a Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts (20)

ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
 
19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE
 
ONS local presents clustering
ONS local presents clusteringONS local presents clustering
ONS local presents clustering
 
[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop PDF
Hadoop PDFHadoop PDF
Hadoop PDF
 
Numerical inflation: simulation of observational parameters
Numerical inflation: simulation of observational parametersNumerical inflation: simulation of observational parameters
Numerical inflation: simulation of observational parameters
 
PCL (Point Cloud Library)
PCL (Point Cloud Library)PCL (Point Cloud Library)
PCL (Point Cloud Library)
 
Skillwise Big data
Skillwise Big dataSkillwise Big data
Skillwise Big data
 
Graphs and Financial Services Analytics
Graphs and Financial Services AnalyticsGraphs and Financial Services Analytics
Graphs and Financial Services Analytics
 
Quick tour all handout
Quick tour all handoutQuick tour all handout
Quick tour all handout
 
DutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision MakingDutchMLSchool. Automating Decision Making
DutchMLSchool. Automating Decision Making
 
IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)A Hacking Toolset for Big Tabular Files (3)
A Hacking Toolset for Big Tabular Files (3)
 
Big data
Big dataBig data
Big data
 
Carbon dioxide emissions from cars of different brands.
Carbon dioxide emissions from cars of different brands.Carbon dioxide emissions from cars of different brands.
Carbon dioxide emissions from cars of different brands.
 
Euro30 2019 - Benchmarking tree approaches on street data
Euro30 2019 - Benchmarking tree approaches on street dataEuro30 2019 - Benchmarking tree approaches on street data
Euro30 2019 - Benchmarking tree approaches on street data
 

Más de Jen Aman

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesJen Aman
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLJen Aman
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkJen Aman
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To ProductionJen Aman
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On SparkJen Aman
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduJen Aman
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersJen Aman
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Jen Aman
 
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine LearningUtilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine LearningJen Aman
 

Más de Jen Aman (15)

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
 
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine LearningUtilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine Learning
 

Último

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 

Último (20)

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts

  • 1. Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forecasts Kyle Foreman, PhD 07 June 2016
  • 2. Simulations in Spark • Context • Motivation • SimBuilder • Backends • Benchmarks • Discussion 2
  • 3. What is IHME? 3 • Independent research center at the University of Washington • Core funding by Bill & Melinda Gates Foundation and State of Washington • ~300 faculty, researchers, students and staff • Providing rigorous, scientific measurement o What are the world’s major health problems? o How well is society addressing these problems? o How should we best dedicate resources to improving health? Goal: improve health by providing the best information on population health
  • 4. Global Burden of Disease • A systematic scientific effort to quantify the comparative magnitude of health loss due to diseases, injuries and risk factors o 188 countries o 300+ diseases o 1990 – 2015+ o 50+ risk factors 4
  • 5. GBD Collaborative Model 1,100 experts in 106 countries
  • 6. Death Rate in 2013 (age-standardized) 6
  • 9. DALYs in Low & Middle Income Countries (2010) 9
  • 10. Simulations in Spark • Context • Motivation • SimBuilder • Backends • Benchmarks • Discussion 10
  • 11. Forecasting the Burden of Disease 11 1. Generate a baseline scenario projecting the GBD 25 years into the future o Mortality, morbidity, population, and risk factors o Every country, cause, sex, age 2. Create a comprehensive simulation framework to assess alternative scenarios o Capture the complex interrelationships between risk factors, interventions, and diseases to explore the effects of changes to the system 3. Build a modular and flexible platform that can incorporate new models to answer detailed “what if?” questions o E.g. introduction of a new vaccine, effects of global warming, scale up of ART coverage, risk of global pandemics, etc
  • 12. Simulations • Takes into account interdependencies between risk factors, diseases, etc. o Use draws of model parameters to advance from t to t+1, allowing for forecasts of other quantities (e.g. risk factors and mortality) to interact • Modular structure allows for detailed “what ifs” o E.g. scale up of coverage of a new intervention • Allows us to incorporate alternative models to test sensitivity to model choice and specification 12
  • 13. Basic Simulation Process 1. Fit statistical models (PyMB) capturing most important relationships as statistical parameters 2. Generate many draws from those parameters, reflecting their uncertainty and correlation structure 3. Run Monte Carlo simulations to update each quantity over time, taking into account its dependencies 13
  • 14. Simulations in Spark • Context • Motivation • SimBuilder • Backends • Benchmarks • Discussion 14
  • 15. SimBuilder • Directed Acyclic Graph (DAG) construction and validation o yaml and sympy specification o curses interface o graphviz visualization • Flexible execution backends o pandas o pyspark 15
  • 16. Example Global Health DAG 16 GDP 2016 Deaths 2016 Births 2016 Population 2016 Workforce 2016 Deaths 2017 Births 2017 Population 2017 Workforce 2017 GDP 2017
  • 17. Creating a DAG with SimBuilder 1. Create a YAML for the overall simulation o Specify child models and the dimensions of the simulation 2. Build separate YAML files for each child model o Dimensions along which the model varies o Location of historical data o Expression for how to generate the quantity in t+1 3. Run SimBuilder to validate and explore DAG 17
  • 18. Creating a DAG with SimBuilder 18 name: gdp_pop models: - name: gdp_per_capita - name: workforce - name: mortality - name: gdp - name: pop global_parameters: - name: loc set: List values: ["USA", "MEX", "CAN", ..., ] - name: t set: Range start: 2014 end: 2040 - name: draw set: Range start: 0 end: 999 sim.yaml
  • 19. 19 Creating a DAG with SimBuilder name: pop version: 0.1 variables: [loc,t] history: - type: csv path: pop.csv pop.yaml
  • 20. Creating a DAG with SimBuilder name: gdp_per_capita version: 0.1 expr: gdp(draw, t, loc)/pop(draw, t-1, loc) variables: [draw, t, loc] history: - type: csv path: gdp_pc_w_draw.csv gdp_per_capita.yaml 20
  • 21. 21 Creating a DAG with SimBuilder name: gdp version: 0.1 expr: gdp(d,loc,t-1) * exp( alpha_0(loc, t, draw) + beta_gdp(draw) * log(gdp_per_capita(draw, loc, t-1)) + beta_pop(draw, t) * workforce(loc, t) ) variables: [draw ,loc,t] history: - type: csv path: gdp_w_draw.csv model_parameters: - name: alpha_0 variables: [draw, loc, t] history: - type: csv path: gdp/alpha_0.csv - name: beta_gdp variables: [d] history: - type: csv path: gdp/beta_gdp.csv - name: beta_pop variables: [d] history: - type: csv path: gdp/beta_pop.csv gdp.yaml
  • 24. 24 gdp{0} gdp{1} gdp_per_capita{0} gdp{2} gdp_per_capita{1} gdp{3} gdp_per_capita{2} gdp{4} gdp_per_capita{3} gdp_per_capita{4} workforce{0} mortality{1} workforce{1} mortality{2} workforce{2} mortality{3} workforce{3} mortality{4} workforce{4} pop{0} pop{1} pop{2} pop{3} pop{4} mortality{0} gdp_beta_gdpgdp_alpha_0{0} gdp_beta_pop{0} gdp_alpha_0{1} gdp_beta_pop{1} gdp_alpha_0{2}gdp_beta_pop{2} gdp_alpha_0{3} gdp_beta_pop{3} gdp_alpha_0{4}gdp_beta_pop{4} mortality_beta_pop{0}mortality_alpha_0{0}mortality_beta_gdp mortality_beta_pop{1}mortality_alpha_0{1} mortality_beta_pop{2} mortality_alpha_0{2} mortality_beta_pop{3} mortality_alpha_0{3} mortality_beta_pop{4} mortality_alpha_0{4}
  • 25. Simulations in Spark • Context • Motivation • SimBuilder • Backends • Benchmarks • Discussion 25
  • 26. SimBuilder Backends • SimBuilder traverses the DAG o Backends plugin via API • Backends provide o Data loading strategy o Methods to combine data from different nodes to calculate new nodes o Method to save computed data 26
  • 27. Local Pandas Backend • Useful for debugging and small simulations • Master process maintains list of Model objects: o pandas DataFrames of all model data ─ Indexed by global variables o sympy expression for how to calculate t+1 ─ Vectorized when possible, fallback to row-wise apply • Simulating over time traverses DAG to join necessary DataFrames and compute results to fill in new columns 27
  • 28. Spark DataFrame Backend • Similar to local backend, swapping in Spark’s DataFrame for Pandas’ • Spark context maintains list of Model objects: o pyspark DataFrames of all model data ─ Columns for each global variable and o sympy expression for how to calculate t+1 ─ Calculated using row-wise apply • Joins DataFrames where necessary 28
  • 29. Spark + Numpy Backend • Uses Spark to distribute and schedule the execution of vectorized numpy operations • One RDD for each node and year: o Containing an ndarray and index vector for each axis o Can optionally partition the data array into key-value pairs along one or more axes (similar to bolt) • To calculate a new node, all the parent RDD’s are unioned and then reduced into a new child RDD 29
  • 30. Simulations in Spark • Context • Motivation • SimBuilder • Backends • Benchmarks • Discussion 30
  • 31. Benchmarking • Single executor (8 cores, 32GB) • Synthetic data o 65 nodes in DAG o 3 countries x 20 ages x 2 sexes x N draws o Forecasted forward 25 years 31
  • 34. Simulations in Spark • Context • Motivation • SimBuilder • Backends • Benchmarks • Discussion 34
  • 35. Limitations of Spark DataFrame • Majority of execution time is spent joining the DataFrame and aligning the dimension columns • These times were achieved after careful tuning of partitions o Perhaps custom partitioners could reduce runtime further 35
  • 36. Taking Advantage of Numpy • For multidimensional datasets of consistent shape and size, simply aligning indices can eliminate the overhead of joins o Including generalizations to axes of size 1 to simplify coding • Numpy’s vectorized operations are highly tuned and scale well o Including multithreading when e.g. compiled against MKL 36
  • 37. Future Development • Experiment with better partitioning of Numpy RDDs o Perhaps use bolt? • Improve partial graph execution o Executing the entire DAG at once often crashes the driver o Executing each year’s partial DAG in sequence results in idle CPU time towards the end of that DAG • Investigate more efficient Spark DataFrame joining methods for Panel data 37