SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
No More Cumbersomeness:
Automatic Predictive Modeling
on Apache Spark
Masato Asahara and Ryohei Fujimaki
NEC Corporation
Jun/07/2017 @Spark Summit 2017
* 9LivesData is our partner for this work.
2 © NEC Corporation 2017
Who we are?
▌Masato Asahara (Ph.D.)
▌Researcher, NEC System Platform Research Laboratory
Masato is currently leading developments of Spark-based machine learning
and data analytics systems, which fully automate predictive modeling.
Masato received his Ph.D. degree from Keio University, and has worked at
NEC for 7 years as a researcher in the field of distributed computing
systems and computing resource management technologies.
▌Ryohei Fujimaki (Ph.D.)
▌Research Fellow, NEC Data Science Research Laboratory
Ryohei is research fellow, data science research laboratories, for NEC
Corporation, a leading provider of advanced analytics technologies based on
artificial intelligence.
In addition to technology R&D, Ryohei is also heavily involved with co-
developing cutting-edge advanced analytics solutions with NEC’s global
business clients and partners.
Ryohei received his Ph.D. degree from the University of Tokyo, and
became the youngest research fellow ever in NEC Corporation’s 117-year
history.
3 © NEC Corporation 2017
Agenda
Predictive model
Prediction results
Training Data
Validate Data
Test Data
Yes
No Yes
4 © NEC Corporation 2017
Agenda
Input
NEC Automatic Predictive Modeling
High-speed
Generation
High accurate
predictive model
&
prediction results
Training Data
Validate Data
Test Data
5 © NEC Corporation 2017
Agenda
θ1
θ2
θ3
Automatic Predictive Modeling
7 © NEC Corporation 2017
Enterprise Applications of Predictive Analysis
Driver Risk
Assessment
Inventory
Optimization
Churn
Retention
Predictive
Maintenance
Product Price
Optimization
Sales
Optimization
Energy/Water
Operation Mgmt.
8 © NEC Corporation 2017
Predictive Model Design is “Black-Arts”
Tuning
Best Balance
Feature Selection
Algorithm Selection
Accuracy v s Transparency
Prediction
Black box
Input
Data
Input
Data
White box
Prediction
Determine a Set of Features
Sales = f(Price, Location)
Sales = f(Price, Weather)
or
Take time …
9 © NEC Corporation 2017
NEC Automatic Predictive Modeling
Tuning
Best Balance
Feature Selection
Algorithm Selection
Accuracy v s Transparency
Prediction
Black box
Input
Data
Input
Data
White box
Prediction
Determine a Set of Features
Sales = f(Price, Location)
Sales = f(Price, Weather)
or
10 © NEC Corporation 2017
Explore Massive Modeling Possibilities
Algorithms
Yes
No Yes
Parameters
Data
Preprocessing
Strategies
Yes
No Yes
Feature
Selection!
11 © NEC Corporation 2017
Automate and Accelerate by Apache Spark!
Algorithms
Yes
No Yes
Parameters
Data
Preprocessing
Strategies
Complete in hours!
Yes
No Yes
12 © NEC Corporation 2017
Modeling and Prediction Flow
Training
Data
Validate
Data
Train
models
Validate
models
Models
Test
Data
Predict
Best model
Validation
criteria
Best
Prediction
Design Challenges and Solutions
14 © NEC Corporation 2017
3 Design Challenges
θ1
θ2
θ3
15 © NEC Corporation 2017
3 Design Challenges
θ1
θ2
θ3
16 © NEC Corporation 2017
Implementation Gap between Spark and ML engines
▌Distributed memory
architecture
▌MapReduce computation
model
▌Scala on JVM
▌Single or shared memory
architecture
▌Standard computing
framework
▌High-speed native binary
code
17 © NEC Corporation 2017
Machine Learning
(Map operation)
Convert to
Matrix
Data Preprocessing
(MapReduce)
Mini-gapped Integration
Smoothly bridge ‘distributed’ preprocessing and ‘parallel’ execution of ML engines
Training
Data
(Parquet)
HDFS
HDFS
Models
Yes
No Yes
Yes
No Yes
RDD
element
RDD
element
RDD
element
18 © NEC Corporation 2017
Validation
(MapReduce)
Predict
(Map operation)
Convert to
Matrix
Data Preprocessing
(MapReduce)
Mini-gapped Integration
Smoothly bridge ‘distributed’ preprocessing and ‘parallel’ execution of ML engines
Validate
Data
(Parquet)
HDFS
HDFS
Best
Model
RDD
element
RDD
element
RDD
element
19 © NEC Corporation 2017
Predict
(Map operation)
Convert to
Matrix
Data Preprocessing
(MapReduce)
Mini-gapped Integration
Smoothly bridge ‘distributed’ preprocessing and ‘parallel’ execution of ML engines
Test Data
(Parquet)
HDFS
HDFS
Prediction
Results
(Parquet)
RDD
element
RDD
element
RDD
element
20 © NEC Corporation 2017
3 Design Challenges
θ1
θ2
θ3
21 © NEC Corporation 2017
Naive Implementation requires Multiple Data Load & Convert
Load &
Convert
Waste of memory
Load of data
from other servers
Multiple data convert to
matrix
Parameter θ1
Parameter θ1
Parameter θ1
Matrix X1
Matrix X2
Matrix X3
22 © NEC Corporation 2017
Parameter-aware Scheduling
Efficient memory usage
Low load of data
from other hosts
Low frequency of
data convert to matrix
Parameter θ1
Parameter θ2
Parameter θ3
Matrix X1
23 © NEC Corporation 2017
3 Design Challenges
θ1
θ2
θ3
24 © NEC Corporation 2017
Major Part of Computing is Machine Learning
Machine Learning
(Map operation)
Convert
to
Matrix
Data Preprocessing
(MapReduce)
Training
Data
(Parquet)
HDFS
HDFS
Major part of total
execution time!
Yes
No Yes
25 © NEC Corporation 2017
Bad Scheduling degrades Execution Efficiency
5 min 5 min
1 min 1 min Wait 8 min…Yes
No Yes
Yes
No Yes
26 © NEC Corporation 2017
Predictive Scheduling increases Performance Scalability
Balance complex model learning and simple one
5 min 1 min
5 min 1 min
Yes
No Yes
Yes
No Yes
♪~
♪~
27 © NEC Corporation 2017
Tip of Spark on YARN for Stable Execution
Default Configuration sometimes fails Execution w/ Enough Memory…
We serve much much memory to
Spark but it continues failing.
Why!?!?
Spark WebUI says …
28 © NEC Corporation 2017
Spark on YARN Tips for Stable Execution
… Because JVM system memory spikes over than YARN limitation suddenly (*)
YARN limitation
(6GB)
Spike of JVM system
memory usage
Time
Memory(GB)
(*) Shivnath and Mayuresh. “Understanding Memory Management In Spark For Fun And Profit,” Spark Summit 2016.
29 © NEC Corporation 2017
Spark on YARN Tips for Stable Execution
Tips: We have to carefully configure ‘spark.yarn.*.memoryOverhead’
(http://spark.apache.org/docs/2.1.1/running-on-yarn.html)
15% overhead was required in our case…
Optimized overhead memory configuration is our future work
Evaluation
31 © NEC Corporation 2017
Evaluation Setup
• Compare accuracy with manual predictive modeling
• Measure execution time
▌Prediction problem
Targeting top-10% of potential
positive sample
▌Manual predictive modeling
Done with scikit-learn v0.18.1
All parameters are set with default
values
• except RandamForest (n_estimators
= 200)
▌Data set
KDDCUP 2014 competition data
• 557K records for training and validate data
• 62K records for test data
• Features: 500
KDDCUP 2015 competition data
• 108K records for training and validate data
• 12K records for test data
• Features: 500
IJCAI 2015 competition data
• 87K records for training, validate and test
data
• Features: 500
32 © NEC Corporation 2017
Cluster Spec.
▌Size: 3U!
▌# Server modules: 34
▌CPU: 272 Intel Xeon-D 2.1GHz cores
(128 cores used in the evaluation)
▌MEM: 2TB
High memory access: marked 16x score of Terasort
of HiBench v5 compared to 3 x 1U servers!
▌Storage: 34TB SSD
▌Network: 10GbE for internal network
▌Spark v1.6.0, Hadoop v2.7.3
Scalable Modular Server
(DX2000)
http://www.nec.com/en/global/prod/dxseries/dx2000/product/index.html
33 © NEC Corporation 2017
Precision and Execution Time
NEC Automatic Predictive Modeling produces competitively-accurate
models automatically in hours!
data NEC Logistic Regression SVM Random Forests
KDDCUP 2014 15.6% 13.5% 12.0% 14.8%
KDDCUP 2015 97.1% 95.5% 93.1% 97.2%
IJCAI 2015 8.2% 8.3% 8.1% 8.2%
Top-10% Precision
data NEC
KDDCUP 2014 172 minutes
KDDCUP 2015 45 minutes
IJCAI 2015 36 minutes
Execution time
34 © NEC Corporation 2017
Summary
θ1
θ2
θ3
35 © NEC Corporation 2017
Future work
Speed up by
FPGA, Vector processors
Extend to other models
(e.g., DeepLearning)
Reduce YARN
memory overhead
Appendix
38 © NEC Corporation 2017
Automation Engine
Software Stack
Computing Infrastructure
Data Store / Resource Management
In-Memory Distributed Computing
Web UI
Native-speed
Machine Learning
Libraries
39 © NEC Corporation 2017
Scalable Modular Server (DX2000) Spec.: Server Module
Form factor Server module that plugs into the Module Enclosure
Number of Processors 1
Processors Intel® Xeon® Processor D-1527(2.20GHz/4-core/6MB)
Intel® Xeon® Processor D-1541(2.10GHz/8-core/12MB)
Intel® Xeon® Processor D-1571(1.30GHz/16-core/24MB)
Memory type DDR4-2133 ECC SO-DIMM
Memory slots 4
Memory capacity 16 GB / 32 GB / 64 GB
Storage type M.2 SATA SSD
Internal storage capacity 128 GB / 256 GB / 512 GB / 1TB
Network 2 10GbE links to switch modules
2 additional 10GbE links to switch modules with an optional 10G LAN
module (Occupies one server module slot)
Systems management EXPRESSSCOPE Engine 3
Operating systems and
virtualization software
Red Hat® Enterprise Linux® 7.2 / 6.8
VMware ESXi™ 6.0
Microsoft® Windows® Server 2012 R2
CentOS 7.2
Ubuntu14.04LTS / 16.04LTS
40 © NEC Corporation 2017
Scalable Modular Server (DX2000) Spec.: Module Enclosure
Form factor / height 3U Rack
Server module slots 44 (22 slots can be used for 10G LAN modules and 8 slots can be used
for PCIe cards)
* The number of installable modules may change according to
configuration.
Network switches 2 network switch modules (L2)
Redundant cooling fan Standard, hot plug
Power supplies 3 Hot plug power supply 1600 Watt
200-240 VAC ± 10% 50 / 60 Hz ± 3 Hz
Redundant power supply 2+1 redundant, hot plug
Temperature and humidity
conditions (non-condensing)
Operating: 10 to 35* °C/ 50 to 95* °F , 20 to 80%
Non-operating: -10 to 55°C/14 to 131°F, 20 to 80%
* In specific configurations, the operable ambient temperature is up to
40°C/104°F
Dimensions (W x D x H) and
maximum weight
448.0 x 769.0 x 130.0 mm / 17.6 x 30.3 x 5.1 in, 48 kg / 105.82 lbs
41 © NEC Corporation 2017
Scalable Modular Server (DX2000) Spec.: Network Switch Module (L2)
Form factor Network Switch Module that plugs into the Module
Enclosure
Network Up link: 8 40G QSFP+ ports plus 1 1000BASE-T for
management
Down link: 44 10GbE

Más contenido relacionado

La actualidad más candente

From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkDatabricks
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...Databricks
 
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...Databricks
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Databricks
 
Object Detection with Transformers
Object Detection with TransformersObject Detection with Transformers
Object Detection with TransformersDatabricks
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3Databricks
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...Databricks
 
Apache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise CompanyApache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise CompanyDatabricks
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPDatabricks
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Databricks
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Databricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
 

La actualidad más candente (20)

From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache Spark
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
 
Object Detection with Transformers
Object Detection with TransformersObject Detection with Transformers
Object Detection with Transformers
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
 
Apache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise CompanyApache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise Company
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 

Similar a No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Masato Asahara and Ryohei Fujimaki

How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors DataWorks Summit/Hadoop Summit
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Indrajit Poddar
 
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...DataWorks Summit
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Databricks
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed_Hat_Storage
 
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika SinghDeep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika SinghData Con LA
 
Graph Data Science at Scale
Graph Data Science at ScaleGraph Data Science at Scale
Graph Data Science at ScaleNeo4j
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform Seldon
 
Microsoft Azure Stack in Tunisia
Microsoft Azure Stack in TunisiaMicrosoft Azure Stack in Tunisia
Microsoft Azure Stack in TunisiaAymen Mami
 
The Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with SparkThe Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with SparkSingleStore
 
IBM Power leading Cognitive Systems
IBM Power leading Cognitive SystemsIBM Power leading Cognitive Systems
IBM Power leading Cognitive SystemsHugo Blanco
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkSpark Summit
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance Ceph Community
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
Trends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsTrends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsIgor José F. Freitas
 
Cisco connect montreal 2018 compute v final
Cisco connect montreal 2018   compute v finalCisco connect montreal 2018   compute v final
Cisco connect montreal 2018 compute v finalCisco Canada
 
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...Edge AI and Vision Alliance
 
Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIBM Switzerland
 

Similar a No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Masato Asahara and Ryohei Fujimaki (20)

How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
 
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
 
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika SinghDeep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
 
Graph Data Science at Scale
Graph Data Science at ScaleGraph Data Science at Scale
Graph Data Science at Scale
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Big Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on SparkBig Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on Spark
 
Microsoft Azure Stack in Tunisia
Microsoft Azure Stack in TunisiaMicrosoft Azure Stack in Tunisia
Microsoft Azure Stack in Tunisia
 
The Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with SparkThe Fast Path to Building Operational Applications with Spark
The Fast Path to Building Operational Applications with Spark
 
IBM Power leading Cognitive Systems
IBM Power leading Cognitive SystemsIBM Power leading Cognitive Systems
IBM Power leading Cognitive Systems
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Trends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsTrends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systems
 
Cisco connect montreal 2018 compute v final
Cisco connect montreal 2018   compute v finalCisco connect montreal 2018   compute v final
Cisco connect montreal 2018 compute v final
 
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...
“Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL,” a Pres...
 
Ibm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bkIbm symp14 referentin_barbara koch_power_8 launch bk
Ibm symp14 referentin_barbara koch_power_8 launch bk
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Último

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 

Último (20)

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Masato Asahara and Ryohei Fujimaki

  • 1. No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Masato Asahara and Ryohei Fujimaki NEC Corporation Jun/07/2017 @Spark Summit 2017 * 9LivesData is our partner for this work.
  • 2. 2 © NEC Corporation 2017 Who we are? ▌Masato Asahara (Ph.D.) ▌Researcher, NEC System Platform Research Laboratory Masato is currently leading developments of Spark-based machine learning and data analytics systems, which fully automate predictive modeling. Masato received his Ph.D. degree from Keio University, and has worked at NEC for 7 years as a researcher in the field of distributed computing systems and computing resource management technologies. ▌Ryohei Fujimaki (Ph.D.) ▌Research Fellow, NEC Data Science Research Laboratory Ryohei is research fellow, data science research laboratories, for NEC Corporation, a leading provider of advanced analytics technologies based on artificial intelligence. In addition to technology R&D, Ryohei is also heavily involved with co- developing cutting-edge advanced analytics solutions with NEC’s global business clients and partners. Ryohei received his Ph.D. degree from the University of Tokyo, and became the youngest research fellow ever in NEC Corporation’s 117-year history.
  • 3. 3 © NEC Corporation 2017 Agenda Predictive model Prediction results Training Data Validate Data Test Data Yes No Yes
  • 4. 4 © NEC Corporation 2017 Agenda Input NEC Automatic Predictive Modeling High-speed Generation High accurate predictive model & prediction results Training Data Validate Data Test Data
  • 5. 5 © NEC Corporation 2017 Agenda θ1 θ2 θ3
  • 7. 7 © NEC Corporation 2017 Enterprise Applications of Predictive Analysis Driver Risk Assessment Inventory Optimization Churn Retention Predictive Maintenance Product Price Optimization Sales Optimization Energy/Water Operation Mgmt.
  • 8. 8 © NEC Corporation 2017 Predictive Model Design is “Black-Arts” Tuning Best Balance Feature Selection Algorithm Selection Accuracy v s Transparency Prediction Black box Input Data Input Data White box Prediction Determine a Set of Features Sales = f(Price, Location) Sales = f(Price, Weather) or Take time …
  • 9. 9 © NEC Corporation 2017 NEC Automatic Predictive Modeling Tuning Best Balance Feature Selection Algorithm Selection Accuracy v s Transparency Prediction Black box Input Data Input Data White box Prediction Determine a Set of Features Sales = f(Price, Location) Sales = f(Price, Weather) or
  • 10. 10 © NEC Corporation 2017 Explore Massive Modeling Possibilities Algorithms Yes No Yes Parameters Data Preprocessing Strategies Yes No Yes Feature Selection!
  • 11. 11 © NEC Corporation 2017 Automate and Accelerate by Apache Spark! Algorithms Yes No Yes Parameters Data Preprocessing Strategies Complete in hours! Yes No Yes
  • 12. 12 © NEC Corporation 2017 Modeling and Prediction Flow Training Data Validate Data Train models Validate models Models Test Data Predict Best model Validation criteria Best Prediction
  • 14. 14 © NEC Corporation 2017 3 Design Challenges θ1 θ2 θ3
  • 15. 15 © NEC Corporation 2017 3 Design Challenges θ1 θ2 θ3
  • 16. 16 © NEC Corporation 2017 Implementation Gap between Spark and ML engines ▌Distributed memory architecture ▌MapReduce computation model ▌Scala on JVM ▌Single or shared memory architecture ▌Standard computing framework ▌High-speed native binary code
  • 17. 17 © NEC Corporation 2017 Machine Learning (Map operation) Convert to Matrix Data Preprocessing (MapReduce) Mini-gapped Integration Smoothly bridge ‘distributed’ preprocessing and ‘parallel’ execution of ML engines Training Data (Parquet) HDFS HDFS Models Yes No Yes Yes No Yes RDD element RDD element RDD element
  • 18. 18 © NEC Corporation 2017 Validation (MapReduce) Predict (Map operation) Convert to Matrix Data Preprocessing (MapReduce) Mini-gapped Integration Smoothly bridge ‘distributed’ preprocessing and ‘parallel’ execution of ML engines Validate Data (Parquet) HDFS HDFS Best Model RDD element RDD element RDD element
  • 19. 19 © NEC Corporation 2017 Predict (Map operation) Convert to Matrix Data Preprocessing (MapReduce) Mini-gapped Integration Smoothly bridge ‘distributed’ preprocessing and ‘parallel’ execution of ML engines Test Data (Parquet) HDFS HDFS Prediction Results (Parquet) RDD element RDD element RDD element
  • 20. 20 © NEC Corporation 2017 3 Design Challenges θ1 θ2 θ3
  • 21. 21 © NEC Corporation 2017 Naive Implementation requires Multiple Data Load & Convert Load & Convert Waste of memory Load of data from other servers Multiple data convert to matrix Parameter θ1 Parameter θ1 Parameter θ1 Matrix X1 Matrix X2 Matrix X3
  • 22. 22 © NEC Corporation 2017 Parameter-aware Scheduling Efficient memory usage Low load of data from other hosts Low frequency of data convert to matrix Parameter θ1 Parameter θ2 Parameter θ3 Matrix X1
  • 23. 23 © NEC Corporation 2017 3 Design Challenges θ1 θ2 θ3
  • 24. 24 © NEC Corporation 2017 Major Part of Computing is Machine Learning Machine Learning (Map operation) Convert to Matrix Data Preprocessing (MapReduce) Training Data (Parquet) HDFS HDFS Major part of total execution time! Yes No Yes
  • 25. 25 © NEC Corporation 2017 Bad Scheduling degrades Execution Efficiency 5 min 5 min 1 min 1 min Wait 8 min…Yes No Yes Yes No Yes
  • 26. 26 © NEC Corporation 2017 Predictive Scheduling increases Performance Scalability Balance complex model learning and simple one 5 min 1 min 5 min 1 min Yes No Yes Yes No Yes ♪~ ♪~
  • 27. 27 © NEC Corporation 2017 Tip of Spark on YARN for Stable Execution Default Configuration sometimes fails Execution w/ Enough Memory… We serve much much memory to Spark but it continues failing. Why!?!? Spark WebUI says …
  • 28. 28 © NEC Corporation 2017 Spark on YARN Tips for Stable Execution … Because JVM system memory spikes over than YARN limitation suddenly (*) YARN limitation (6GB) Spike of JVM system memory usage Time Memory(GB) (*) Shivnath and Mayuresh. “Understanding Memory Management In Spark For Fun And Profit,” Spark Summit 2016.
  • 29. 29 © NEC Corporation 2017 Spark on YARN Tips for Stable Execution Tips: We have to carefully configure ‘spark.yarn.*.memoryOverhead’ (http://spark.apache.org/docs/2.1.1/running-on-yarn.html) 15% overhead was required in our case… Optimized overhead memory configuration is our future work
  • 31. 31 © NEC Corporation 2017 Evaluation Setup • Compare accuracy with manual predictive modeling • Measure execution time ▌Prediction problem Targeting top-10% of potential positive sample ▌Manual predictive modeling Done with scikit-learn v0.18.1 All parameters are set with default values • except RandamForest (n_estimators = 200) ▌Data set KDDCUP 2014 competition data • 557K records for training and validate data • 62K records for test data • Features: 500 KDDCUP 2015 competition data • 108K records for training and validate data • 12K records for test data • Features: 500 IJCAI 2015 competition data • 87K records for training, validate and test data • Features: 500
  • 32. 32 © NEC Corporation 2017 Cluster Spec. ▌Size: 3U! ▌# Server modules: 34 ▌CPU: 272 Intel Xeon-D 2.1GHz cores (128 cores used in the evaluation) ▌MEM: 2TB High memory access: marked 16x score of Terasort of HiBench v5 compared to 3 x 1U servers! ▌Storage: 34TB SSD ▌Network: 10GbE for internal network ▌Spark v1.6.0, Hadoop v2.7.3 Scalable Modular Server (DX2000) http://www.nec.com/en/global/prod/dxseries/dx2000/product/index.html
  • 33. 33 © NEC Corporation 2017 Precision and Execution Time NEC Automatic Predictive Modeling produces competitively-accurate models automatically in hours! data NEC Logistic Regression SVM Random Forests KDDCUP 2014 15.6% 13.5% 12.0% 14.8% KDDCUP 2015 97.1% 95.5% 93.1% 97.2% IJCAI 2015 8.2% 8.3% 8.1% 8.2% Top-10% Precision data NEC KDDCUP 2014 172 minutes KDDCUP 2015 45 minutes IJCAI 2015 36 minutes Execution time
  • 34. 34 © NEC Corporation 2017 Summary θ1 θ2 θ3
  • 35. 35 © NEC Corporation 2017 Future work Speed up by FPGA, Vector processors Extend to other models (e.g., DeepLearning) Reduce YARN memory overhead
  • 36.
  • 38. 38 © NEC Corporation 2017 Automation Engine Software Stack Computing Infrastructure Data Store / Resource Management In-Memory Distributed Computing Web UI Native-speed Machine Learning Libraries
  • 39. 39 © NEC Corporation 2017 Scalable Modular Server (DX2000) Spec.: Server Module Form factor Server module that plugs into the Module Enclosure Number of Processors 1 Processors Intel® Xeon® Processor D-1527(2.20GHz/4-core/6MB) Intel® Xeon® Processor D-1541(2.10GHz/8-core/12MB) Intel® Xeon® Processor D-1571(1.30GHz/16-core/24MB) Memory type DDR4-2133 ECC SO-DIMM Memory slots 4 Memory capacity 16 GB / 32 GB / 64 GB Storage type M.2 SATA SSD Internal storage capacity 128 GB / 256 GB / 512 GB / 1TB Network 2 10GbE links to switch modules 2 additional 10GbE links to switch modules with an optional 10G LAN module (Occupies one server module slot) Systems management EXPRESSSCOPE Engine 3 Operating systems and virtualization software Red Hat® Enterprise Linux® 7.2 / 6.8 VMware ESXi™ 6.0 Microsoft® Windows® Server 2012 R2 CentOS 7.2 Ubuntu14.04LTS / 16.04LTS
  • 40. 40 © NEC Corporation 2017 Scalable Modular Server (DX2000) Spec.: Module Enclosure Form factor / height 3U Rack Server module slots 44 (22 slots can be used for 10G LAN modules and 8 slots can be used for PCIe cards) * The number of installable modules may change according to configuration. Network switches 2 network switch modules (L2) Redundant cooling fan Standard, hot plug Power supplies 3 Hot plug power supply 1600 Watt 200-240 VAC ± 10% 50 / 60 Hz ± 3 Hz Redundant power supply 2+1 redundant, hot plug Temperature and humidity conditions (non-condensing) Operating: 10 to 35* °C/ 50 to 95* °F , 20 to 80% Non-operating: -10 to 55°C/14 to 131°F, 20 to 80% * In specific configurations, the operable ambient temperature is up to 40°C/104°F Dimensions (W x D x H) and maximum weight 448.0 x 769.0 x 130.0 mm / 17.6 x 30.3 x 5.1 in, 48 kg / 105.82 lbs
  • 41. 41 © NEC Corporation 2017 Scalable Modular Server (DX2000) Spec.: Network Switch Module (L2) Form factor Network Switch Module that plugs into the Module Enclosure Network Up link: 8 40G QSFP+ ports plus 1 1000BASE-T for management Down link: 44 10GbE