SlideShare una empresa de Scribd logo
1 de 27
Descargar para leer sin conexión
Experimental Design for
Distributed Machine Learning
Myles Baker (@mydpy)
26 October, 2017
1
PROBLEM 1 PROBLEM 2
Common Modeling Problems
• Business problem is defined,
machine learning is a solution
candidate
• Multiple programming / machine
learning models apply
• Which model has the least error due
to generalization for a given
application?
• Business problem is defined,
machine learning is a solution
candidate
• A machine learning algorithm has
been selected and implemented
• What is the expected generalization
error? Can be the business be
confident in the model results?
22
PROBLEM 3 PROBLEM 4
Common Modeling Problems
• An existing business problem has
been solved with a machine learning
model
• How do I know how variations in the
input will affect the output and
predicted outcomes?
• How will results change after the
model is retrained?
• An existing business problem has
been solved with a machine learning
model
• The model was designed for a
serial-execution system and does
not scale with input set
• A parallel model must be developed
and tested
33
Steps in a Machine Learning Study
4
Formulate
Problem and
Plan the Study
Collect Data &
Insights, &
Define a Model
Conceptual
Model
Valid?
Conceptual Model
Yes
No
Conceptualizing a Distributed Model
• Model capabilities are different between serial and distributed applications
– Some algorithms do not have a distributed implementation
– Understanding computational complexity becomes an increasingly present
limitation
– Solver and optimizer implementations for existing algorithms may be different or not
supported
– Model assumptions may change when migrating a serial model to a distributed
model
• Data characteristics are more challenging to reveal
– Outliers are prevalent but may be incorrectly or poorly modeled
– Missing value compensation can significantly skew results
– Synthetic data can poorly represent actual system
5
6 6
Understanding Scale-out Learning
Steps in a Machine Learning Study
7
Build a
Learning Model
Model Tuning
Learning
Model
Valid?
Learning Model
Yes
Conceptual
Model
No, Redefine Conceptual Model
Steps in a Machine Learning Study
8
Design
Experiments
Evaluate
Learning
Model
Experimentation
Learning
Model
Analyze Output
Data
Present,
Launch,
Monitor,
& Maintain
Subject of this Talk
The goal of experimentation is to
understand the effect of model factors and
obtain conclusions which we can consider
statistically significant
9 9
This is challenging for distributed learning!
How do you design an experiment?
• An Experiment is an evaluation of a model using a combination of
controllable factors that affect the response
• Experiments must be designed correctly using statistical methodology
• An Experiment should be:
– Independent of other responses
– Controlled for variance and uncontrollable error
– Reproducible, especially between model candidates
• Techniques include:
– Measuring Classifier Performance
– Hypothesis Testing
10
CONTROLLABLE UNCONTROLLABLE
Factors that affect model outcomes
• Noise in the data
• Optimization randomness
• Outcomes not observed during
training but part of the system being
modeled (I.e., a rare disease
outcome)
• Learning algorithm
• Input data
• Model parameters
• Model hyperparameters
1111
1212
Generating Responses
O ( LF ) O ( LF ) O ( LF
)
1313
Generating Responses
val p1 = new ParamMap().put(factor1.w(3), factor2.w(1))
val p2 = new ParamMap().put(factor1.w(1), factor2.w(2))
val p3 = new ParamMap().put(factor1.w(4), factor2.w(4))
val factorGrid = new ParamGridBuilder()
.addGrid(factor1, Array(1, 2, 3, 4))
.addGrid(factor2, Array(3))
.build()
val factorGrid = new ParamGridBuilder()
.addGrid(factor1, Array(1, 2, 3, 4))
.addGrid(factor2, Array(1, 2, 3, 4))
.build()
1414
Generating Responses
Train-Validation Split
val tvs =
new TrainValidationSplit()
.setEstimatorParamMaps(factorGrid)
.setEvaluator(new RegressionEvaluator)
.setTrainRatio(r)
• Creates an estimator based on the parameter map or grid
• Randomly splits the input dataset into train and validation sets based on
the training ratio r
• Uses evaluation metric on the validation set to select the best model
val model = tvs.fit(data)
model.bestModel
.extractParamMap
Models support
different
methods /
metrics
1515
Generating Responses
Cross Validator
val cv = new CrossValidator()
.setEstimatorParamMaps(factorGrid)
.setEvaluator(new
BinaryClassificationEvaluator)
.setNumFolds(k)
• Creates k non-overlapping randomly partitioned folds which are used as
separate training and test datasets
• Controls for uncontrollable factors and variance
• The ‘bestModel’ contains the model with the highest average
cross-validation
• Tracks the metrics for each param map evaluated
val model = cv.fit(data)
model.bestModel
.extractParamMap
Models support
different
methods /
metrics
Reducing the Number of Responses
Computational Complexity Limits Practicality of Factorial
Search
• Use a well-formulated conceptual model. This informs factor choice and
reduces unnecessary model iterations
• Normalize factors where possible (i.e., factor is determined by input
as-opposed to arbitrarily chosen by the modeler)
• If your data is large enough, you can split your dataset into multiple parts
for use during cross-validation
16
CLASSIFICATION
• Precision / recall relationships for binary
classification problems
– Receiver Operating Characteristic Curve
• For multi-classification problems:
– Most packages only support 0/1 error functions
– Confusion matrix
• For multilabel classification:
– Again, 0/1 indicator function is only supported
– Measures by label are most appropriate
1717
How do you analyze model output?
REGRESSION
• Linear: RSME = Easy
• Non-linear: … *runs*
– SoftMax
– Cross Entropy
1818
Analyzing Model Output
Binary Classification
Metric Spark Implementation
Receiver Operating Characteristic roc
Area Under Receiver Operating
Characteristic Curve
areaUnderROC
Area Under Precision-Recall Curve areaUnderPR
Measures by Threshold {measure}ByThreshold
val metrics = new
BinaryClassificationMetrics(predictionAndLabels.rdd.map( r =>
(r.getAs[Double]("prediction"), r.getAs[Double]("label"))))
Dataframe of
(prediction,
label)
1919
Threshold Curves
val threshold = metrics.{measure}ByThreshold
.toDF('threshold', 'measure')
display(precision.join({measure}, 'threshold')
.join(recall, 'threshold'))
Optional beta
parameter for
F1 measure
(default = 1)
2020
Analyzing Model Output
Multiclass Classification
Metric Spark Implementation
Confusion Matrix confusionMatrix
Accuracy accuracy
Measures by Label {measure}ByLabel
Weighted Measures weighted{Measure}
val metrics = new
MulticlassMetrics(predictionAndLabels.rdd.map( r =>
(r.getAs[Double]("prediction"), r.getAs[Double]("label"))))
Dataframe of
(prediction,
label)
Usually not a
robust metric
by itself
2121
Confusion Matrix
metrics.confusionMatrix
.toArray
//Display using non-Scala tool
%python
confusion = np.array([[...],[...]])
How do I
know my
model is
correct?
Hypothesis testing describes the significance of the result and
provides a mechanism for assigning confidence to model
selection
22
Why conduct and test a hypothesis?
• 1 - Binomial Test
• 1 - Approximate Normal Test
• 1 - t Test
• 2 - McNemar’s Test
• 2 - K-Fold Cross-Validated Paired t Test
1. What is the likelihood that my model
will make a misclassification error?
This probability is not known!
2. Given two learning algorithms, which
has the lower expected error rate?
2323
Chi-Squared Test
Hypothesis: Outcomes are statistically independent
New in Spark 2.2!
• Conducts Pearson's independence test for every feature against the label
• Chi-squared statistics is computed from (feature, label) pairs
• All label and feature values must be categorical
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.stat.test.ChiSqTestResult
val goodnessOfFitTestResult = Statistics.chiSqTest(labels)
val independenceTestResult = Statistics.chiSqTest(contingencyMatrix)
Chi squared test summary:
method: pearson
degrees of freedom = 4
statistic = 0.12499999999999999
pValue = 0.998126379239318
No presumption against null hypothesis: observed follows the same distribution as
expected..
Nice, but you still
need to do the
hard work of
constructing the
hypothesis and
validating!
2424
McNemar’s Test
Hypothesis: Model 1 and model 2 have the same rate of generalization error
val totalObs = test.count
val conditions = "..."
val p1c =
predictions1.where(conditions).count()
val p1m = totalObs - p1c
val p2c =
predictions2.where(conditions).count()
val p2m = totalObs - p2c
val e00 = p1m + p2m
val e01 = p1m
val e10 = p2m
val e11 = p1c + p2c
25
Analyzing of Variance (ANOVA)
Analysis of Variance is used to compare multiple models. What
is the statistical significance of running model 1 or model 2?
• Currently no techniques for ANOVA directly within MLlib
• Requires calculating statistics manually
• A very useful technique in-practice despite the manual work needed
Thank you
myles.baker@databricks.com
26
References
• Design and Analysis of Machine Learning, Chapter 19 of Introduction to Machine Learning, by Ethem
Alpaydın
• No free lunch in search and optimization
• Moral Machine (MIT)
• Design and Analysis of Classifier Learning Experiments in Bioinformatics: Survey and Case Studies
• Simulation Modeling and Analysis, 3rd Edition, Averill Law and David Kelton
• Spark Scala API documentation, source code
27

Más contenido relacionado

La actualidad más candente

Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Databricks
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
 
SparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingSparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingDatabricks
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Databricks
 
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsDatabricks
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSpark Summit
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowDatabricks
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Spark Summit
 
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...Databricks
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
Semantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflowSemantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflowDatabricks
 
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...Databricks
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Databricks
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
 
Distributed processing of large graphs in python
Distributed processing of large graphs in pythonDistributed processing of large graphs in python
Distributed processing of large graphs in pythonJose Quesada (hiring)
 
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...Spark Summit
 

La actualidad más candente (20)

Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
SparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingSparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time Bidding
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
 
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data Analytics
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
 
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Semantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflowSemantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflow
 
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark with Ya...
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 
Distributed processing of large graphs in python
Distributed processing of large graphs in pythonDistributed processing of large graphs in python
Distributed processing of large graphs in python
 
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
 

Destacado

Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Spark Summit
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathSpark Summit
 
Building Machine Learning Algorithms on Apache Spark with William Benton
Building Machine Learning Algorithms on Apache Spark with William BentonBuilding Machine Learning Algorithms on Apache Spark with William Benton
Building Machine Learning Algorithms on Apache Spark with William BentonSpark Summit
 
Low Touch Machine Learning with Leah McGuire (Salesforce)
Low Touch Machine Learning with Leah McGuire (Salesforce)Low Touch Machine Learning with Leah McGuire (Salesforce)
Low Touch Machine Learning with Leah McGuire (Salesforce)Spark Summit
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarArt of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarSpark Summit
 
A Developer's View Into Spark's Memory Model with Wenchen Fan
A Developer's View Into Spark's Memory Model with Wenchen FanA Developer's View Into Spark's Memory Model with Wenchen Fan
A Developer's View Into Spark's Memory Model with Wenchen FanDatabricks
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...Spark Summit
 
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...Spark Summit
 
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...Spark Summit
 
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...Spark Summit
 
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind RaoHistogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind RaoSpark Summit
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningSpark Summit
 

Destacado (20)

Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick Pentreath
 
Building Machine Learning Algorithms on Apache Spark with William Benton
Building Machine Learning Algorithms on Apache Spark with William BentonBuilding Machine Learning Algorithms on Apache Spark with William Benton
Building Machine Learning Algorithms on Apache Spark with William Benton
 
Low Touch Machine Learning with Leah McGuire (Salesforce)
Low Touch Machine Learning with Leah McGuire (Salesforce)Low Touch Machine Learning with Leah McGuire (Salesforce)
Low Touch Machine Learning with Leah McGuire (Salesforce)
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarArt of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel Sarwar
 
A Developer's View Into Spark's Memory Model with Wenchen Fan
A Developer's View Into Spark's Memory Model with Wenchen FanA Developer's View Into Spark's Memory Model with Wenchen Fan
A Developer's View Into Spark's Memory Model with Wenchen Fan
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
 
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
 
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
 
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
 
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind RaoHistogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
 

Similar a Experimental Design for Distributed ML Model Testing

Customer choice probabilities
Customer choice probabilitiesCustomer choice probabilities
Customer choice probabilitiesAllan D. Butler
 
Feature enginnering and selection
Feature enginnering and selectionFeature enginnering and selection
Feature enginnering and selectionDavis David
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusoneDotNetCampus
 
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATAPREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATADotNetCampus
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearnPratap Dangeti
 
AI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptxAI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptxkprasad8
 
20220914-MBT-Experiences-SB1-final.pptx
20220914-MBT-Experiences-SB1-final.pptx20220914-MBT-Experiences-SB1-final.pptx
20220914-MBT-Experiences-SB1-final.pptxMinh Nguyen
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10Roger Barga
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Venturesmicrosoftventures
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
STAT7440StudentIMLPresentationJishan.pptx
STAT7440StudentIMLPresentationJishan.pptxSTAT7440StudentIMLPresentationJishan.pptx
STAT7440StudentIMLPresentationJishan.pptxJishanAhmed24
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideSalford Systems
 
Smart like a Fox: How clever students trick dumb programming assignment asses...
Smart like a Fox: How clever students trick dumb programming assignment asses...Smart like a Fox: How clever students trick dumb programming assignment asses...
Smart like a Fox: How clever students trick dumb programming assignment asses...Nane Kratzke
 
Customer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenCustomer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenPoo Kuan Hoong
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientistMatthew Evans
 
The Current State of the Art of Regression Testing
The Current State of the Art of Regression TestingThe Current State of the Art of Regression Testing
The Current State of the Art of Regression TestingJohn Reese
 

Similar a Experimental Design for Distributed ML Model Testing (20)

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Customer choice probabilities
Customer choice probabilitiesCustomer choice probabilities
Customer choice probabilities
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
 
Feature enginnering and selection
Feature enginnering and selectionFeature enginnering and selection
Feature enginnering and selection
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusone
 
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATAPREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
 
Machine learning with scikitlearn
Machine learning with scikitlearnMachine learning with scikitlearn
Machine learning with scikitlearn
 
AI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptxAI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptx
 
20220914-MBT-Experiences-SB1-final.pptx
20220914-MBT-Experiences-SB1-final.pptx20220914-MBT-Experiences-SB1-final.pptx
20220914-MBT-Experiences-SB1-final.pptx
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Ventures
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
STAT7440StudentIMLPresentationJishan.pptx
STAT7440StudentIMLPresentationJishan.pptxSTAT7440StudentIMLPresentationJishan.pptx
STAT7440StudentIMLPresentationJishan.pptx
 
Modeling and analysis
Modeling and analysisModeling and analysis
Modeling and analysis
 
Ds for finance day 3
Ds for finance day 3Ds for finance day 3
Ds for finance day 3
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User Guide
 
Smart like a Fox: How clever students trick dumb programming assignment asses...
Smart like a Fox: How clever students trick dumb programming assignment asses...Smart like a Fox: How clever students trick dumb programming assignment asses...
Smart like a Fox: How clever students trick dumb programming assignment asses...
 
Customer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenCustomer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R Open
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientist
 
The Current State of the Art of Regression Testing
The Current State of the Art of Regression TestingThe Current State of the Art of Regression Testing
The Current State of the Art of Regression Testing
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 

Último (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 

Experimental Design for Distributed ML Model Testing

  • 1. Experimental Design for Distributed Machine Learning Myles Baker (@mydpy) 26 October, 2017 1
  • 2. PROBLEM 1 PROBLEM 2 Common Modeling Problems • Business problem is defined, machine learning is a solution candidate • Multiple programming / machine learning models apply • Which model has the least error due to generalization for a given application? • Business problem is defined, machine learning is a solution candidate • A machine learning algorithm has been selected and implemented • What is the expected generalization error? Can be the business be confident in the model results? 22
  • 3. PROBLEM 3 PROBLEM 4 Common Modeling Problems • An existing business problem has been solved with a machine learning model • How do I know how variations in the input will affect the output and predicted outcomes? • How will results change after the model is retrained? • An existing business problem has been solved with a machine learning model • The model was designed for a serial-execution system and does not scale with input set • A parallel model must be developed and tested 33
  • 4. Steps in a Machine Learning Study 4 Formulate Problem and Plan the Study Collect Data & Insights, & Define a Model Conceptual Model Valid? Conceptual Model Yes No
  • 5. Conceptualizing a Distributed Model • Model capabilities are different between serial and distributed applications – Some algorithms do not have a distributed implementation – Understanding computational complexity becomes an increasingly present limitation – Solver and optimizer implementations for existing algorithms may be different or not supported – Model assumptions may change when migrating a serial model to a distributed model • Data characteristics are more challenging to reveal – Outliers are prevalent but may be incorrectly or poorly modeled – Missing value compensation can significantly skew results – Synthetic data can poorly represent actual system 5
  • 7. Steps in a Machine Learning Study 7 Build a Learning Model Model Tuning Learning Model Valid? Learning Model Yes Conceptual Model No, Redefine Conceptual Model
  • 8. Steps in a Machine Learning Study 8 Design Experiments Evaluate Learning Model Experimentation Learning Model Analyze Output Data Present, Launch, Monitor, & Maintain Subject of this Talk
  • 9. The goal of experimentation is to understand the effect of model factors and obtain conclusions which we can consider statistically significant 9 9 This is challenging for distributed learning!
  • 10. How do you design an experiment? • An Experiment is an evaluation of a model using a combination of controllable factors that affect the response • Experiments must be designed correctly using statistical methodology • An Experiment should be: – Independent of other responses – Controlled for variance and uncontrollable error – Reproducible, especially between model candidates • Techniques include: – Measuring Classifier Performance – Hypothesis Testing 10
  • 11. CONTROLLABLE UNCONTROLLABLE Factors that affect model outcomes • Noise in the data • Optimization randomness • Outcomes not observed during training but part of the system being modeled (I.e., a rare disease outcome) • Learning algorithm • Input data • Model parameters • Model hyperparameters 1111
  • 12. 1212 Generating Responses O ( LF ) O ( LF ) O ( LF )
  • 13. 1313 Generating Responses val p1 = new ParamMap().put(factor1.w(3), factor2.w(1)) val p2 = new ParamMap().put(factor1.w(1), factor2.w(2)) val p3 = new ParamMap().put(factor1.w(4), factor2.w(4)) val factorGrid = new ParamGridBuilder() .addGrid(factor1, Array(1, 2, 3, 4)) .addGrid(factor2, Array(3)) .build() val factorGrid = new ParamGridBuilder() .addGrid(factor1, Array(1, 2, 3, 4)) .addGrid(factor2, Array(1, 2, 3, 4)) .build()
  • 14. 1414 Generating Responses Train-Validation Split val tvs = new TrainValidationSplit() .setEstimatorParamMaps(factorGrid) .setEvaluator(new RegressionEvaluator) .setTrainRatio(r) • Creates an estimator based on the parameter map or grid • Randomly splits the input dataset into train and validation sets based on the training ratio r • Uses evaluation metric on the validation set to select the best model val model = tvs.fit(data) model.bestModel .extractParamMap Models support different methods / metrics
  • 15. 1515 Generating Responses Cross Validator val cv = new CrossValidator() .setEstimatorParamMaps(factorGrid) .setEvaluator(new BinaryClassificationEvaluator) .setNumFolds(k) • Creates k non-overlapping randomly partitioned folds which are used as separate training and test datasets • Controls for uncontrollable factors and variance • The ‘bestModel’ contains the model with the highest average cross-validation • Tracks the metrics for each param map evaluated val model = cv.fit(data) model.bestModel .extractParamMap Models support different methods / metrics
  • 16. Reducing the Number of Responses Computational Complexity Limits Practicality of Factorial Search • Use a well-formulated conceptual model. This informs factor choice and reduces unnecessary model iterations • Normalize factors where possible (i.e., factor is determined by input as-opposed to arbitrarily chosen by the modeler) • If your data is large enough, you can split your dataset into multiple parts for use during cross-validation 16
  • 17. CLASSIFICATION • Precision / recall relationships for binary classification problems – Receiver Operating Characteristic Curve • For multi-classification problems: – Most packages only support 0/1 error functions – Confusion matrix • For multilabel classification: – Again, 0/1 indicator function is only supported – Measures by label are most appropriate 1717 How do you analyze model output? REGRESSION • Linear: RSME = Easy • Non-linear: … *runs* – SoftMax – Cross Entropy
  • 18. 1818 Analyzing Model Output Binary Classification Metric Spark Implementation Receiver Operating Characteristic roc Area Under Receiver Operating Characteristic Curve areaUnderROC Area Under Precision-Recall Curve areaUnderPR Measures by Threshold {measure}ByThreshold val metrics = new BinaryClassificationMetrics(predictionAndLabels.rdd.map( r => (r.getAs[Double]("prediction"), r.getAs[Double]("label")))) Dataframe of (prediction, label)
  • 19. 1919 Threshold Curves val threshold = metrics.{measure}ByThreshold .toDF('threshold', 'measure') display(precision.join({measure}, 'threshold') .join(recall, 'threshold')) Optional beta parameter for F1 measure (default = 1)
  • 20. 2020 Analyzing Model Output Multiclass Classification Metric Spark Implementation Confusion Matrix confusionMatrix Accuracy accuracy Measures by Label {measure}ByLabel Weighted Measures weighted{Measure} val metrics = new MulticlassMetrics(predictionAndLabels.rdd.map( r => (r.getAs[Double]("prediction"), r.getAs[Double]("label")))) Dataframe of (prediction, label) Usually not a robust metric by itself
  • 21. 2121 Confusion Matrix metrics.confusionMatrix .toArray //Display using non-Scala tool %python confusion = np.array([[...],[...]])
  • 22. How do I know my model is correct? Hypothesis testing describes the significance of the result and provides a mechanism for assigning confidence to model selection 22 Why conduct and test a hypothesis? • 1 - Binomial Test • 1 - Approximate Normal Test • 1 - t Test • 2 - McNemar’s Test • 2 - K-Fold Cross-Validated Paired t Test 1. What is the likelihood that my model will make a misclassification error? This probability is not known! 2. Given two learning algorithms, which has the lower expected error rate?
  • 23. 2323 Chi-Squared Test Hypothesis: Outcomes are statistically independent New in Spark 2.2! • Conducts Pearson's independence test for every feature against the label • Chi-squared statistics is computed from (feature, label) pairs • All label and feature values must be categorical import org.apache.spark.mllib.stat.Statistics import org.apache.spark.mllib.stat.test.ChiSqTestResult val goodnessOfFitTestResult = Statistics.chiSqTest(labels) val independenceTestResult = Statistics.chiSqTest(contingencyMatrix) Chi squared test summary: method: pearson degrees of freedom = 4 statistic = 0.12499999999999999 pValue = 0.998126379239318 No presumption against null hypothesis: observed follows the same distribution as expected.. Nice, but you still need to do the hard work of constructing the hypothesis and validating!
  • 24. 2424 McNemar’s Test Hypothesis: Model 1 and model 2 have the same rate of generalization error val totalObs = test.count val conditions = "..." val p1c = predictions1.where(conditions).count() val p1m = totalObs - p1c val p2c = predictions2.where(conditions).count() val p2m = totalObs - p2c val e00 = p1m + p2m val e01 = p1m val e10 = p2m val e11 = p1c + p2c
  • 25. 25 Analyzing of Variance (ANOVA) Analysis of Variance is used to compare multiple models. What is the statistical significance of running model 1 or model 2? • Currently no techniques for ANOVA directly within MLlib • Requires calculating statistics manually • A very useful technique in-practice despite the manual work needed
  • 27. References • Design and Analysis of Machine Learning, Chapter 19 of Introduction to Machine Learning, by Ethem Alpaydın • No free lunch in search and optimization • Moral Machine (MIT) • Design and Analysis of Classifier Learning Experiments in Bioinformatics: Survey and Case Studies • Simulation Modeling and Analysis, 3rd Edition, Averill Law and David Kelton • Spark Scala API documentation, source code 27