SlideShare una empresa de Scribd logo
1 de 27
Descargar para leer sin conexión
Gianmario Spacagna
14th September, 2019 - Alluxio Meetup @ San Francisco, CA
Takeaways
¨  What a logical data warehouse is
¨  How to handle governance issues
¨  An Agile workflow made of iterative exploratory
analysis and production-quality development
¨  A fully in-memory stack for fast computation on top
of Spark and Alluxio
¨  How to successfully do data science if your data
resides in a RDBMS and you don’t have a data
lake
About me
¨  Engineering background in Distributed Systems
¤  (University of Cassino, Polytechnic of Turin, KTH of Stockholm)
¨  Data-relevant experience
¤  Predictive Marketing (AgilOne, StreamSend)
¤  Cyber Security (Cisco)
¤  Financial Services (Barclays)
¤  Automotive (Pirelli) ç
Areas of interest
¨  Functional Programming, Scala and Apache
Spark
¨  Contributor of the
Professional Data Science Manifesto
¨  Founder of Data Science Milan Meetup
community (datasciencemilan.org)
¨  Co-authoring Python Deep Learning book,
coming soon…
Building production-ready and scalable machine
learning systems
(continue with list of principles...)
Data Science Agile cycle
Get
access to
data
Explore
TransformTrain
Evaluate
Analyze
results
Even dozens of
iterations per
day!!!
Successful development
of new data products
requires proper
infrastructure and tools
Start by building a toy model with a small
snapshot of data that can fit in your laptop
memory and eventually ask your organization
for cluster resources
¨  You can’t solve problems with data science if
data is not largely available
¨  Data processing should be fast and reactive to
allow quick iterations
¨  The core team cannot depend on IT folks
Start by building a toy model with a small
snapshot of data that can fit in your laptop
memory and eventually ask your organization
for cluster resources
Data Lake in a legacy enterprise
environment
Technical issues
¨  Engineering effort
¤  dedicated infrastructure team (expensive)
¨  Synchronization with new data from source
¤  Report what portion of data has been exported and what
not
¨  Consistency / Data Versioning / Duplication
¤  ETL logic and requirements change very often
¤  Memory is cheap but when you have hundreds of sparse
copies of same data is confusing
¨  I/O cost
¤  Reading/writing is expensive for iterative and explorative
jobs (machine learning)
Logical Data Warehouse
¨  View and access cleaned versions of data
¨  Always show latest version by default
¨  Apply transformations on-the-fly
(discovery-oriented analytics)
¨  Abstract data representation from rigid structures
of the DB’s persistence store
¨  Simply add new data sources using virtualization
¨  Flexible, fast time-to-market, lower costs
What about governance issues?
¨  Large corporations can’t move data before an approved
governance plan
¨  Data can only be stored in a safe environment
administered by only a few authorized people who don’t
necessary understand data scientists needs
¨  Data leakage paranoia, cloud-phobia!
¨  As result, data cannot be easily/quickly pulled from the
central data warehouse and stored into an external
infrastructure
Long time and large investment for
setting up a new project
That’s not Agile!
Wait a moment, analysts don’t seem to
have this problem…
From disk to volatile memory
Distribute and make data temporary available in-
memory in an ad-hoc development cluster
¨  In-memory engine for distributed data processing
¨  JDBC drivers to connect to relational databases
¨  Structured data represented using DataFrame API
¨  Fully-functional data manipulation via RDD API
¨  Machine learning libraries (ML/MLllib)
¨  Interaction and visualization through
Spark Notebook or Zeppelin
In-memory workflow
Just Spark cache is not enough
¨  Data is dropped from memory
at each context restart due to
¤  Update dependency jar
(common for mixed IDE
development / notebook analysis)
¤  Re-submit the job execution
¤  Kerberos ticket expires L
¨  Fetching 600M rows can take
~ 1 hour in a 5 nodes cluster
Dozens iterations per day => spending most of the time
waiting for data to reload at each iteration!
Distribute and make data temporary persistently
available in-memory in the development cluster and
shared among multiple concurrent applications
From volatile memory to persistent
memory storage
¨  Formerly known as Tachyon
¨  In-memory distributed storage system
¨  Long-term caching of raw data and intermediate
results
¨  Spark can read/write in Alluxio seamlessly instead
of using HDFS
¨  1-tier configuration safely leaves no traces to disk
¨  Data is loaded once and available for the whole
development period to multiple applications
Alluxio as the Key Enabling Technology
1-tier configuration
¨  ALLUXIO_RAM_FOLDER=/dev/shm/ramdisk
¨  alluxio.worker.memory.size=24GB
¨  alluxio.worker.tieredstore
¤  levels=1
¤  level0.alias=MEM
¤  level0.dirs.path=${ALLUXIO_RAM_FOLDER}
¤  level0.dirs.quota=24G
¨  We leave empty the under FS configuration
¨  Deploy without mount (no root access required)
¤  ./bin/alluxio-start.sh all NoMount
Spark read/write APIs
¨  DataFrame
¤  dataframe.write.save(”alluxio://master_ip:port/mydata/
mydataframe.parquet")
¤  val dataframe: DataFrame = sqlContext.read.load(”alluxio://
master_ip:port/mydata/mydataframe.parquet")
¨  RDD
¤  rdd.saveAsObjectFile(”alluxio://master_ip:port/mydata/myrdd.object")
¤  val rdd: RDD[MyCaseClass] = sc.objectFile[MyCaseClass] (”alluxio://
master_ip:port/mydata/myrdd.object")
Making the impossible possible
¨  Agile workflow combining Spark, Scala, DataFrame,
JDBC, Parquet, Kryo and Alluxio to create a
scalable, in-memory, reactive stack to explore data
directly from source and develop production-quality
machine learning pipelines
¨  Data available since day 1 and at every iteration
¤  Alluxio decreased loading time from hours to seconds
¨  Avoid complicated and time-consuming Data
Plumbing operations
Further developments
1.  Memory size limitation
¤  Add external in-memory tiers?
2.  Set-up overhead
¤  JDBC drivers, partitioning strategy and data frame from/to case
class conversion (Spark 2 aims to solve this)
3.  Shared memory resources between Spark and Alluxio
¤  Set Alluxio as OFF_HEAP memory as well and divide memory in
storage and cache
4.  In-Memory replication for read availability
¤  If an Alluxio node fails, data is lost due the absence of an
underlying file system
5.  Would be nice if Alluxio could handle this and mount a
relational table/view in the form of data files
(csv, parquet…)
Follow-up links
¨  Original article on DZone:
¤  dzone.com/articles/Accelerate-In-Memory-Processing-with-Spark-
from-Hours-to-Seconds-With-Tachyon
¨  Professional Data Science Manifesto:
¤  datasciencemanifesto.org
¨  Vademecum of Practical Data Science:
¤  datasciencevademecum.wordpress.com
¨  Sparkz
¤  github.com/gm-spacagna/sparkz

Más contenido relacionado

La actualidad más candente

In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
Gianmario Spacagna
 

La actualidad más candente (20)

Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016
 
Hybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxioHybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxio
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
Bigdata antipatterns
Bigdata antipatternsBigdata antipatterns
Bigdata antipatterns
 
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
 
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابر
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
 
Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Presto
PrestoPresto
Presto
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio: Unify Data at Memory Speed; 2016-11-18Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio: Unify Data at Memory Speed; 2016-11-18
 
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
 

Similar a Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetup
Gianmario Spacagna
 

Similar a Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016 (20)

Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetup
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Data Streaming Technology Overview
Data Streaming Technology OverviewData Streaming Technology Overview
Data Streaming Technology Overview
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
Scaling SQL and NoSQL Databases in the Cloud
Scaling SQL and NoSQL Databases in the Cloud Scaling SQL and NoSQL Databases in the Cloud
Scaling SQL and NoSQL Databases in the Cloud
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Predictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-timePredictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-time
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data grid
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadius
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniSpark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 

Más de Alluxio, Inc.

Más de Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

  • 1. Gianmario Spacagna 14th September, 2019 - Alluxio Meetup @ San Francisco, CA
  • 2. Takeaways ¨  What a logical data warehouse is ¨  How to handle governance issues ¨  An Agile workflow made of iterative exploratory analysis and production-quality development ¨  A fully in-memory stack for fast computation on top of Spark and Alluxio ¨  How to successfully do data science if your data resides in a RDBMS and you don’t have a data lake
  • 3. About me ¨  Engineering background in Distributed Systems ¤  (University of Cassino, Polytechnic of Turin, KTH of Stockholm) ¨  Data-relevant experience ¤  Predictive Marketing (AgilOne, StreamSend) ¤  Cyber Security (Cisco) ¤  Financial Services (Barclays) ¤  Automotive (Pirelli) ç
  • 4. Areas of interest ¨  Functional Programming, Scala and Apache Spark ¨  Contributor of the Professional Data Science Manifesto ¨  Founder of Data Science Milan Meetup community (datasciencemilan.org) ¨  Co-authoring Python Deep Learning book, coming soon… Building production-ready and scalable machine learning systems (continue with list of principles...)
  • 5. Data Science Agile cycle Get access to data Explore TransformTrain Evaluate Analyze results Even dozens of iterations per day!!!
  • 6. Successful development of new data products requires proper infrastructure and tools
  • 7. Start by building a toy model with a small snapshot of data that can fit in your laptop memory and eventually ask your organization for cluster resources
  • 8. ¨  You can’t solve problems with data science if data is not largely available ¨  Data processing should be fast and reactive to allow quick iterations ¨  The core team cannot depend on IT folks Start by building a toy model with a small snapshot of data that can fit in your laptop memory and eventually ask your organization for cluster resources
  • 9. Data Lake in a legacy enterprise environment
  • 10. Technical issues ¨  Engineering effort ¤  dedicated infrastructure team (expensive) ¨  Synchronization with new data from source ¤  Report what portion of data has been exported and what not ¨  Consistency / Data Versioning / Duplication ¤  ETL logic and requirements change very often ¤  Memory is cheap but when you have hundreds of sparse copies of same data is confusing ¨  I/O cost ¤  Reading/writing is expensive for iterative and explorative jobs (machine learning)
  • 11. Logical Data Warehouse ¨  View and access cleaned versions of data ¨  Always show latest version by default ¨  Apply transformations on-the-fly (discovery-oriented analytics) ¨  Abstract data representation from rigid structures of the DB’s persistence store ¨  Simply add new data sources using virtualization ¨  Flexible, fast time-to-market, lower costs
  • 12. What about governance issues? ¨  Large corporations can’t move data before an approved governance plan ¨  Data can only be stored in a safe environment administered by only a few authorized people who don’t necessary understand data scientists needs ¨  Data leakage paranoia, cloud-phobia! ¨  As result, data cannot be easily/quickly pulled from the central data warehouse and stored into an external infrastructure
  • 13. Long time and large investment for setting up a new project That’s not Agile!
  • 14. Wait a moment, analysts don’t seem to have this problem…
  • 15. From disk to volatile memory Distribute and make data temporary available in- memory in an ad-hoc development cluster
  • 16. ¨  In-memory engine for distributed data processing ¨  JDBC drivers to connect to relational databases ¨  Structured data represented using DataFrame API ¨  Fully-functional data manipulation via RDD API ¨  Machine learning libraries (ML/MLllib) ¨  Interaction and visualization through Spark Notebook or Zeppelin
  • 18. Just Spark cache is not enough ¨  Data is dropped from memory at each context restart due to ¤  Update dependency jar (common for mixed IDE development / notebook analysis) ¤  Re-submit the job execution ¤  Kerberos ticket expires L ¨  Fetching 600M rows can take ~ 1 hour in a 5 nodes cluster Dozens iterations per day => spending most of the time waiting for data to reload at each iteration!
  • 19. Distribute and make data temporary persistently available in-memory in the development cluster and shared among multiple concurrent applications From volatile memory to persistent memory storage
  • 20. ¨  Formerly known as Tachyon ¨  In-memory distributed storage system ¨  Long-term caching of raw data and intermediate results ¨  Spark can read/write in Alluxio seamlessly instead of using HDFS ¨  1-tier configuration safely leaves no traces to disk ¨  Data is loaded once and available for the whole development period to multiple applications
  • 21. Alluxio as the Key Enabling Technology
  • 22.
  • 23. 1-tier configuration ¨  ALLUXIO_RAM_FOLDER=/dev/shm/ramdisk ¨  alluxio.worker.memory.size=24GB ¨  alluxio.worker.tieredstore ¤  levels=1 ¤  level0.alias=MEM ¤  level0.dirs.path=${ALLUXIO_RAM_FOLDER} ¤  level0.dirs.quota=24G ¨  We leave empty the under FS configuration ¨  Deploy without mount (no root access required) ¤  ./bin/alluxio-start.sh all NoMount
  • 24. Spark read/write APIs ¨  DataFrame ¤  dataframe.write.save(”alluxio://master_ip:port/mydata/ mydataframe.parquet") ¤  val dataframe: DataFrame = sqlContext.read.load(”alluxio:// master_ip:port/mydata/mydataframe.parquet") ¨  RDD ¤  rdd.saveAsObjectFile(”alluxio://master_ip:port/mydata/myrdd.object") ¤  val rdd: RDD[MyCaseClass] = sc.objectFile[MyCaseClass] (”alluxio:// master_ip:port/mydata/myrdd.object")
  • 25. Making the impossible possible ¨  Agile workflow combining Spark, Scala, DataFrame, JDBC, Parquet, Kryo and Alluxio to create a scalable, in-memory, reactive stack to explore data directly from source and develop production-quality machine learning pipelines ¨  Data available since day 1 and at every iteration ¤  Alluxio decreased loading time from hours to seconds ¨  Avoid complicated and time-consuming Data Plumbing operations
  • 26. Further developments 1.  Memory size limitation ¤  Add external in-memory tiers? 2.  Set-up overhead ¤  JDBC drivers, partitioning strategy and data frame from/to case class conversion (Spark 2 aims to solve this) 3.  Shared memory resources between Spark and Alluxio ¤  Set Alluxio as OFF_HEAP memory as well and divide memory in storage and cache 4.  In-Memory replication for read availability ¤  If an Alluxio node fails, data is lost due the absence of an underlying file system 5.  Would be nice if Alluxio could handle this and mount a relational table/view in the form of data files (csv, parquet…)
  • 27. Follow-up links ¨  Original article on DZone: ¤  dzone.com/articles/Accelerate-In-Memory-Processing-with-Spark- from-Hours-to-Seconds-With-Tachyon ¨  Professional Data Science Manifesto: ¤  datasciencemanifesto.org ¨  Vademecum of Practical Data Science: ¤  datasciencevademecum.wordpress.com ¨  Sparkz ¤  github.com/gm-spacagna/sparkz