SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Alluxio Day III
Exploring Alluxio & Dask integration
1
&
2020-04-27
whoami
2
Peter Roelants
Machine Learning Engineering Lead
@Aspect Analytics
@PeterRoelants
Outline
1. Aspect Analytics
2. Use case: Mass Spectrometry Imaging
3. Dask
4. Alluxio
5. Data access via FUSE POSIX API
3
Aspect Analytics
A brief overview of Aspect Analytics.
4
more info at https:/
/aspect-analytics.com/
5
Software company dedicated to Mass Spectrometry Imaging bioinformatics
We build software tools to support clients’ workflows (off-the-shelf and custom)
Leverage the full potential of MSI data in high-throughput settings
Beyond bioinformatics: data analysis embedded in integrated platform solution
more info at https:/
/aspect-analytics.com/
Mass Spectrometry Imaging
What data are we working with?
6
more info at
7
Mass spectrometry
● Measures the abundance of molecular weights in a sample.
● Output is a mass spectrum:
○ Histogram of molecular weights in sample.
more info at https:/
/aspect-analytics.com/media/blog/2020-05-30-introduction-to-mass-spectrometry-data-analysis/
8
Mass spectrometry imaging
Measure spatial distribution of molecular masses
over a slice of tissue.
9
Mass spectrometry imaging workflow
Overlay tissue slice with virtual grid of "pixels".
10
Mass spectrometry imaging workflow
Measure mass spectrum at each "pixel".
MSI data structure: 3D tensor
500 x 500 pixels
100,000 to 1,000,000 mass bins
⇒ 100GB - 1TB per data set
11
Illustration from Unsupervised machine learning for exploratory data analysis in imaging mass spectrometry. Nico Verbeeck Richard M. Caprioli Raf Van de Plas - 2019
12
Unsupervised analysis of mass
spectral images to help with
biomarker discovery.
⇒ new diagnostic tests
Use case
Use case
more info at https:/
/aspect-analytics.com/media/blog/2020-05-31-introduction-to-msi-data-analysis/
● Spatial localisation of biomolecules
● Region of interest analysis (images shown)
● Clinical diagnostics
13
Data challenges
14
• Interactively explore data
• Slice and subset data without loading full
data-cube into memory.
• Distributed machine learning
• Find patterns and extract features from
multiple large data-cubes.
• Process huge data arrays
• Parallel processing
• Out-of-core
Dask
What, Why?
15
Why Dask
Dask is like Apache Spark in Python with support for distributed data arrays.
• Parallel processing of data array chunks.
• Integration with Python machine learning ecosystem.
• Integration with our existing Python algorithms.
16
more info at https:/
/docs.dask.org/en/latest/spark.html
Why Dask
• Delayed compute that can be dynamically scheduled.
• Diagnostics dashboard.
more info at https:/
/docs.dask.org/. Figure from http:/
/matthewrocklin.com/slides/dask-scipy-2016.html
17
Alluxio
Why Alluxio?
18
Why Alluxio
Data access layer
• Non Python specific
• Our platform user application is built on Clojure.
• Standardized access via FUSE POSIX API
• More on this in later slides
• Distributed and Tiered Caching layer
• Download once, use multiple times
• Share between different processes and services
• Centralized access to data
• Analytics code does not need to deal with different storage implementations.
• Avoid keeping object store credentials on client services.
19
Why Alluxio
Deployable in various scenarios
• Deployable in cloud as well as on-prem
• Through Docker & Kubernetes
• Long-lived vs short-lived deployments
• Long-running Alluxio server for continuous data access
(e.g. to provide data for notebook server)
• Short-term Alluxio deployment voor ad-hoc computations.
(e.g. to run a set of analyses on a new dataset)
• Integrate in automated testing.
20
Dask & Alluxio
Using Alluxio as a data access layer for Dask.
21
Dask & Alluxio
22
Dask & Alluxio
23
Dask & Alluxio
24
● Dataset access to Dask is
provided via Alluxio FUSE
Dask & Alluxio
25
● Dataset access to Dask is
provided via Alluxio FUSE
● Alluxio worker only loads
data that is required locally
● Alluxio worker keeps data in
cache
Alluxio FUSE
Data access via a POSIX API.
26
27
FUSE
Filesystem in Userspace (FUSE)
Filesystem:
● Expose virtual files
● POSIX filesystem API
Userspace
● Refers to all code that run is run by the user (outside the operating system's kernel).
FUSE allows to create filesystems without needing to modify OS kernel code.
27
28
Share Alluxio-FUSE via Bind-Mount
● Each service has its own
Docker environment.
29
Share Alluxio-FUSE via Bind-Mount
● Each service has its own
Docker environment.
30
Share Alluxio-FUSE via Bind-Mount
● Each service has its own
Docker environment.
● FUSE Filesystem connects
Alluxio with Analytics
platform via a bind-mount.
31
Share Alluxio-FUSE via Bind-Mount
● Each service has its own
Docker environment.
● FUSE Filesystem connects
Alluxio with Analytics
platform via a bind-mount.
32
Some anecdotal results
• We have custom Alluxio containers to reduce image size.
• It takes 30s to 1 min to spin up the Alluxio services with FUSE.
• Dask reading from S3 through FUSE (without caching):
• 30% slower compared to the native Dask S3 integration.
• Reading large files with Dask from local Alluxio cache:
• 10x speedup compared to reading from S3 each time.
• Enabling FUSE kernel caching gave another 3x speedup when reading.
32
more info at https:/
/docs.alluxio.io/os/user/stable/en/api/POSIX-API.html#:~:text=Tuning%20mount%20options
Alluxio-FUSE as a data access layer for Dask

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Setting up monitoring system for Alluxio with Prometheus and Grafana in 10 mi...
Setting up monitoring system for Alluxio with Prometheus and Grafana in 10 mi...Setting up monitoring system for Alluxio with Prometheus and Grafana in 10 mi...
Setting up monitoring system for Alluxio with Prometheus and Grafana in 10 mi...
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri Simsa
 
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
 
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
 
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
 
Open Source Memory Speed Virtual Distributed Storage
Open Source Memory Speed Virtual Distributed StorageOpen Source Memory Speed Virtual Distributed Storage
Open Source Memory Speed Virtual Distributed Storage
 
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed StorageAlluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
Alluxio (formerly Tachyon): Open Source Memory Speed Virtual Distributed Storage
 
Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio: Unify Data at Memory Speed; 2016-11-18Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio: Unify Data at Memory Speed; 2016-11-18
 
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
 
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
Alluxio: The missing piece of on-demand clusters at Alluxio Meetup 2016
 
Improving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTokImproving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTok
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
 
Alluxio data orchestration for machine learning
Alluxio data orchestration for machine learningAlluxio data orchestration for machine learning
Alluxio data orchestration for machine learning
 
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with AlluxioSecurely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
 
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016Alluxio Use Cases at Strata+Hadoop World Beijing 2016
Alluxio Use Cases at Strata+Hadoop World Beijing 2016
 
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
Ultra-fast SQL Analytics using PAS (Presto on Alluxio Stack)
 
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
Using Alluxio as a Fault-tolerant Pluggable Optimization Component of JD.com'...
 
Presto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On LabPresto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On Lab
 

Similar a Alluxio-FUSE as a data access layer for Dask

Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Community
 
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene PangSpark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Summit
 

Similar a Alluxio-FUSE as a data access layer for Dask (20)

Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
 
Accelerating Spark with Kubernetes
Accelerating Spark with KubernetesAccelerating Spark with Kubernetes
Accelerating Spark with Kubernetes
 
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
StorageQuery: federated querying on object stores, powered by Alluxio and PrestoStorageQuery: federated querying on object stores, powered by Alluxio and Presto
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Unified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any CloudUnified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any Cloud
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
 
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene PangSpark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene Pang
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
 
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
 
From the Chip to the Cloud with Apache Software
From the Chip to the Cloud with Apache SoftwareFrom the Chip to the Cloud with Apache Software
From the Chip to the Cloud with Apache Software
 
SC'18 BoF Presentation
SC'18 BoF PresentationSC'18 BoF Presentation
SC'18 BoF Presentation
 
30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
 
Spark Pipelines in the Cloud with Alluxio
Spark Pipelines in the Cloud with AlluxioSpark Pipelines in the Cloud with Alluxio
Spark Pipelines in the Cloud with Alluxio
 
Spark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri SimsaSpark Summit EU talk by Jiri Simsa
Spark Summit EU talk by Jiri Simsa
 
How to scale your PaaS with OVH infrastructure?
How to scale your PaaS with OVH infrastructure?How to scale your PaaS with OVH infrastructure?
How to scale your PaaS with OVH infrastructure?
 

Más de Alluxio, Inc.

Más de Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 

Último

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
anilsa9823
 

Último (20)

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 

Alluxio-FUSE as a data access layer for Dask

  • 1. Alluxio Day III Exploring Alluxio & Dask integration 1 & 2020-04-27
  • 2. whoami 2 Peter Roelants Machine Learning Engineering Lead @Aspect Analytics @PeterRoelants
  • 3. Outline 1. Aspect Analytics 2. Use case: Mass Spectrometry Imaging 3. Dask 4. Alluxio 5. Data access via FUSE POSIX API 3
  • 4. Aspect Analytics A brief overview of Aspect Analytics. 4 more info at https:/ /aspect-analytics.com/
  • 5. 5 Software company dedicated to Mass Spectrometry Imaging bioinformatics We build software tools to support clients’ workflows (off-the-shelf and custom) Leverage the full potential of MSI data in high-throughput settings Beyond bioinformatics: data analysis embedded in integrated platform solution more info at https:/ /aspect-analytics.com/
  • 6. Mass Spectrometry Imaging What data are we working with? 6 more info at
  • 7. 7 Mass spectrometry ● Measures the abundance of molecular weights in a sample. ● Output is a mass spectrum: ○ Histogram of molecular weights in sample. more info at https:/ /aspect-analytics.com/media/blog/2020-05-30-introduction-to-mass-spectrometry-data-analysis/
  • 8. 8 Mass spectrometry imaging Measure spatial distribution of molecular masses over a slice of tissue.
  • 9. 9 Mass spectrometry imaging workflow Overlay tissue slice with virtual grid of "pixels".
  • 10. 10 Mass spectrometry imaging workflow Measure mass spectrum at each "pixel".
  • 11. MSI data structure: 3D tensor 500 x 500 pixels 100,000 to 1,000,000 mass bins ⇒ 100GB - 1TB per data set 11
  • 12. Illustration from Unsupervised machine learning for exploratory data analysis in imaging mass spectrometry. Nico Verbeeck Richard M. Caprioli Raf Van de Plas - 2019 12 Unsupervised analysis of mass spectral images to help with biomarker discovery. ⇒ new diagnostic tests Use case
  • 13. Use case more info at https:/ /aspect-analytics.com/media/blog/2020-05-31-introduction-to-msi-data-analysis/ ● Spatial localisation of biomolecules ● Region of interest analysis (images shown) ● Clinical diagnostics 13
  • 14. Data challenges 14 • Interactively explore data • Slice and subset data without loading full data-cube into memory. • Distributed machine learning • Find patterns and extract features from multiple large data-cubes. • Process huge data arrays • Parallel processing • Out-of-core
  • 16. Why Dask Dask is like Apache Spark in Python with support for distributed data arrays. • Parallel processing of data array chunks. • Integration with Python machine learning ecosystem. • Integration with our existing Python algorithms. 16 more info at https:/ /docs.dask.org/en/latest/spark.html
  • 17. Why Dask • Delayed compute that can be dynamically scheduled. • Diagnostics dashboard. more info at https:/ /docs.dask.org/. Figure from http:/ /matthewrocklin.com/slides/dask-scipy-2016.html 17
  • 19. Why Alluxio Data access layer • Non Python specific • Our platform user application is built on Clojure. • Standardized access via FUSE POSIX API • More on this in later slides • Distributed and Tiered Caching layer • Download once, use multiple times • Share between different processes and services • Centralized access to data • Analytics code does not need to deal with different storage implementations. • Avoid keeping object store credentials on client services. 19
  • 20. Why Alluxio Deployable in various scenarios • Deployable in cloud as well as on-prem • Through Docker & Kubernetes • Long-lived vs short-lived deployments • Long-running Alluxio server for continuous data access (e.g. to provide data for notebook server) • Short-term Alluxio deployment voor ad-hoc computations. (e.g. to run a set of analyses on a new dataset) • Integrate in automated testing. 20
  • 21. Dask & Alluxio Using Alluxio as a data access layer for Dask. 21
  • 24. Dask & Alluxio 24 ● Dataset access to Dask is provided via Alluxio FUSE
  • 25. Dask & Alluxio 25 ● Dataset access to Dask is provided via Alluxio FUSE ● Alluxio worker only loads data that is required locally ● Alluxio worker keeps data in cache
  • 26. Alluxio FUSE Data access via a POSIX API. 26
  • 27. 27 FUSE Filesystem in Userspace (FUSE) Filesystem: ● Expose virtual files ● POSIX filesystem API Userspace ● Refers to all code that run is run by the user (outside the operating system's kernel). FUSE allows to create filesystems without needing to modify OS kernel code. 27
  • 28. 28 Share Alluxio-FUSE via Bind-Mount ● Each service has its own Docker environment.
  • 29. 29 Share Alluxio-FUSE via Bind-Mount ● Each service has its own Docker environment.
  • 30. 30 Share Alluxio-FUSE via Bind-Mount ● Each service has its own Docker environment. ● FUSE Filesystem connects Alluxio with Analytics platform via a bind-mount.
  • 31. 31 Share Alluxio-FUSE via Bind-Mount ● Each service has its own Docker environment. ● FUSE Filesystem connects Alluxio with Analytics platform via a bind-mount.
  • 32. 32 Some anecdotal results • We have custom Alluxio containers to reduce image size. • It takes 30s to 1 min to spin up the Alluxio services with FUSE. • Dask reading from S3 through FUSE (without caching): • 30% slower compared to the native Dask S3 integration. • Reading large files with Dask from local Alluxio cache: • 10x speedup compared to reading from S3 each time. • Enabling FUSE kernel caching gave another 3x speedup when reading. 32 more info at https:/ /docs.alluxio.io/os/user/stable/en/api/POSIX-API.html#:~:text=Tuning%20mount%20options