SlideShare una empresa de Scribd logo
1 de 20
Descargar para leer sin conexión
Creating an 86,000 Hour
Speech Dataset with
Apache Spark and TPUs
Daniel Galve
z

Enginee
r

MLCommons
Feedback
Your feedback is important to us
.

Don’t forget to rate and review the sessions.
Agenda
▪ What is MLCommons
?

▪ What is The People’s Speech
Dataset
?

▪ The Workload to Create the
Datase
t

▪ Limitations of Accelerator-aware
Schedulin
g

▪ PySpark UDFs Gotcha
s

▪ TPU Gotcha
s

▪ Efficient joins on data reordered
by bucketing by sequence length.
What is MLCommons?
• Deep Learning Benchmarking Organizatio
n

• Originally known as MLPer
f

• See “MLCommons: Better ML for Everyone” by David
Kanter, Executive Director, on Thursday, at 4:25PM


• Expanding into
:

• (1) Machine Learning Best Practice
s

• (2) Dataset Development
Agenda
▪ What is MLCommons
?

▪ What is The People’s Speech
Dataset
?

▪ The Workload to Create the
Datase
t

▪ Limitations of Accelerator-aware
Schedulin
g

▪ PySpark UDFs Gotcha
s

▪ TPU Gotcha
s

▪ Efficient joins on data reordered
by bucketing by sequence length.
Motivation for The People’s Speech Dataset
• For widespread
adoption, datasets
need
:

• To be challengin
g

• To be free as in bee
r

• To have a commercial
use license
Provided by Vijay Janapa Reddi


https://www.sigarch.org/data-engineering-for-everyone/
• Historically, the majority of
datasets used by tech companies’
machine learning papers do not
use internal datasets.
Agenda
▪ What is MLCommons
?

▪ What is The People’s Speech
Dataset
?

▪ The Workload to Create the
Datase
t

▪ Limitations of Accelerator-aware
Schedulin
g

▪ PySpark UDFs Gotcha
s

▪ TPU Gotcha
s

▪ Efficient joins on data reordered
by bucketing by sequence length.
The Conceptual Workload
• Given audio and transcripts, must discover when each word
in transcript was said
.

• Known as “forced alignment” or “segmentation”
.

• We must split hour-long audio files into segments of ~15
seconds of audio
.

• Time segments >1 minute typically use too much memory
at training time
.

• Uses a pre-trained speech recognition model.
The Conceptual Workload (2)
SELECT FORCE_ALIGN(


ASR_NEURAL_NET(DECODE_MP3(A.FILE)),


NORMALIZE_TEXT(T.FILE)


)


FROM AUDIO A INNER JOIN TRANSCRIPT T ON IDENTIFIER
• On CPUs, this runs in ~0.5x real time. For 86,000 hours,
that is 20 CPU-years
.

• ASR_NEURAL_NET takes 99% of runtime in pipeline
.

• Fundamental motivation for this talk’s topics.
Agenda
▪ What is MLCommons
?

▪ What is The People’s Speech
Dataset
?

▪ The Workload to Create the
Datase
t

▪ Limitations of Accelerator-
aware Schedulin
g

▪ PySpark UDFs Gotcha
s

▪ TPU Gotcha
s

▪ Efficient joins on data reordered
by bucketing by sequence length.
Accelerator-Aware Schedulin
g

Limitations
• Cloud TPU, being a network service, precludes support in accelerator-aware
scheduling
.

• Typically assign one accelerator to each executor/task
.

• But CPU-dependent parts of the workload usually require many more executors
than you have accelerators
.

• Therefore, we use multiple jobs, writing to disk in-between
.

• Conclusion
:

• Good for data parallel training on existing Spark clusters
.

• Good for integration with NVIDIA RAPIDS
.

• Bad for heterogenous inference workloads with UDFs.
SELECT FORCE_ALIGN(


ASR_NEURAL_NET(DECODE_MP3(A.FILE)),


NORMALIZE_TEXT(T.FILE)


)


FROM AUDIO A INNER JOIN TRANSCRIPT T ON
IDENTIFIER
Agenda
▪ What is MLCommons
?

▪ What is The People’s Speech
Dataset
?

▪ The Workload to Create the
Datase
t

▪ Limitations of Accelerator-aware
Schedulin
g

▪ PySpark UDFs Gotcha
s

▪ TPU Gotcha
s

▪ Efficient joins on data reordered
by bucketing by sequence length.
PySpark Arrow UDF Gotchas
▪ Implication: Memory usage doubled
.

▪ JVM GC does not return physical memory back to OS
.

▪ Adding swap space prevents OOMs
.

▪ Don’t set spark.executor.memory to fill entire physical memory
.

▪ JVM will hog all physical memory, causing pyspark UDF to use swap disk memory
.

▪ Minimize allocations in your python UDF
.

▪ Since Java cannot handle byte arrays larger than 2GB and some MP3 files are almost 2GB in size, we must
set spark.sql.execution.arrow.maxRecordsPerBatch=1
• Reality
Ideal
JVM Executor worker.py JVM Executor worker.py
Serialize Deserialize
Deserialize Serialize
Shared
Memory
SELECT FORCE_ALIGN(


ASR_NEURAL_NET(DECODE_MP3(A.FILE)),


NORMALIZE_TEXT(T.FILE)


)


FROM AUDIO A INNER JOIN TRANSCRIPT T ON
IDENTIFIER
Agenda
▪ What is MLCommons
?

▪ What is The People’s Speech
Dataset
?

▪ The Workload to Create the
Datase
t

▪ Limitations of Accelerator-aware
Schedulin
g

▪ PySpark UDFs Gotcha
s

▪ TPU Gotcha
s

▪ Efficient joins on data reordered
by bucketing by sequence length.
TPU Gotchas
• Used a TPUv3-8 Pod
.

• Used Google’s lingvo codebase, but had to make several modifications in a custom fork
.

• Link at end of slides
.

• Used a 4-layer 1024 hidden unit LSTM network trained with CTC for inference
.

• Requires usage of Google Cloud Storage as your file system
.

• Cloud TPUs are prone to crash with a mean time between failures measured in hours
.

• Need to write your own “restartability” logic
.

• Not a TPU specific problem: All “Spot instances” require software redundancy
.

• TPU code can’t use tf.string data type. Must use integer primary keys for “keyed
prediction” machine learning design pattern.
SELECT FORCE_ALIGN(


ASR_NEURAL_NET(DECODE_MP3(A.FILE)),


NORMALIZE_TEXT(T.FILE)


)


FROM AUDIO A INNER JOIN TRANSCRIPT T ON
IDENTIFIER
Agenda
▪ What is MLCommons
?

▪ What is The People’s Speech
Dataset
?

▪ The Workload to Create the Datase
t

▪ Limitations of Accelerator-aware
Schedulin
g

▪ PySpark UDFs Gotcha
s

▪ TPU Gotcha
s

▪ Efficient joins on data reordered
by bucketing by sequence length.
TPU Gotchas
• We used “keyed prediction” design pattern to join
acoustic model output against original transcript
.

• Records are sorted by key on input to acoustic
model
.

• They are no longer sorted on output.
SELECT FORCE_ALIGN(


ASR_NEURAL_NET(DECODE_MP3(A.FILE)),


NORMALIZE_TEXT(T.FILE)


)


FROM AUDIO A INNER JOIN TRANSCRIPT T ON
IDENTIFIER
Bucketing by sequence length
Necessary to utilize modern accelerators fully
tf.data.experimental.bucket_by_sequence_length
SELECT FORCE_ALIGN(


ASR_NEURAL_NET(DECODE_MP3(A.FILE)),


NORMALIZE_TEXT(T.FILE)


)


FROM AUDIO A INNER JOIN TRANSCRIPT T ON
IDENTIFIER
A1
A2
B1
B2
B3
C1
C2
D1
A1
A2
B1
B2
B3
C1
C2
D1
Bucketing by sequence length
• TPU3-8 works best with batch size of 128 * 8 = 1024
.

• Sort-Merge joins are expensive afterward
.

• We must join speech recognizer output against ground truth transcript
.

• Speech recognizer output is not small! Probability distribution of 40 tokens per
30ms. For 86,000 hours, that’s 1.5 TiB uncompressed
.

• Two Solutions
:

• Map Side join - Join whatever you need before using accelerator
.

• Con: Reduces input bandwidth to accelerator
.

• Sharding - aka partitionBy(). Only need to sort each shard
.

• Con: If shards are too small, can reduce efficiency.
SELECT FORCE_ALIGN(


ASR_NEURAL_NET(DECODE_MP3(A.FILE)),


NORMALIZE_TEXT(T.FILE)


)


FROM AUDIO A INNER JOIN TRANSCRIPT T ON
IDENTIFIER
Conclusions
• Code is publicly available under Apache 2.0
:

• https://github.com/mlcommons/peoples-speech/tree/main/
galvasr2/align/spark
• Ideal for sequence-based deep learning inference is for
accelerators to act as an asynchronous queue, receiving input
data until a batch is large enough to run efficiently
.

• Would someone like to create a custom Spark Streaming sink
?

• Contact: dt.galvez@gmail.com

Más contenido relacionado

La actualidad más candente

Machine Learning - Object Detection and Classification
Machine Learning - Object Detection and ClassificationMachine Learning - Object Detection and Classification
Machine Learning - Object Detection and ClassificationVikas Jain
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial NetworksMustafa Yagmur
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural NetworksAshray Bhandare
 
CONVOLUTIONAL NEURAL NETWORK
CONVOLUTIONAL NEURAL NETWORKCONVOLUTIONAL NEURAL NETWORK
CONVOLUTIONAL NEURAL NETWORKMd Rajib Bhuiyan
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Sujit Pal
 
Economic Dispatch by Water Cycle Algorithm
Economic Dispatch by Water Cycle AlgorithmEconomic Dispatch by Water Cycle Algorithm
Economic Dispatch by Water Cycle AlgorithmMohammad Abdullah
 
morphological image processing
morphological image processingmorphological image processing
morphological image processingAnubhav Kumar
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network Yan Xu
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronMostafa G. M. Mostafa
 
Artificial neural network model & hidden layers in multilayer artificial neur...
Artificial neural network model & hidden layers in multilayer artificial neur...Artificial neural network model & hidden layers in multilayer artificial neur...
Artificial neural network model & hidden layers in multilayer artificial neur...Muhammad Ishaq
 
Pr045 deep lab_semantic_segmentation
Pr045 deep lab_semantic_segmentationPr045 deep lab_semantic_segmentation
Pr045 deep lab_semantic_segmentationTaeoh Kim
 
Active learning lecture
Active learning lectureActive learning lecture
Active learning lectureazuring
 
Intro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer VisionIntro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer VisionChristoph Körner
 
Image Object Detection Pipeline
Image Object Detection PipelineImage Object Detection Pipeline
Image Object Detection PipelineAbhinav Dadhich
 
Deep Learning - RNN and CNN
Deep Learning - RNN and CNNDeep Learning - RNN and CNN
Deep Learning - RNN and CNNPradnya Saval
 

La actualidad más candente (20)

Machine Learning - Object Detection and Classification
Machine Learning - Object Detection and ClassificationMachine Learning - Object Detection and Classification
Machine Learning - Object Detection and Classification
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
 
cnn ppt.pptx
cnn ppt.pptxcnn ppt.pptx
cnn ppt.pptx
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
CONVOLUTIONAL NEURAL NETWORK
CONVOLUTIONAL NEURAL NETWORKCONVOLUTIONAL NEURAL NETWORK
CONVOLUTIONAL NEURAL NETWORK
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
 
GMM
GMMGMM
GMM
 
Economic Dispatch by Water Cycle Algorithm
Economic Dispatch by Water Cycle AlgorithmEconomic Dispatch by Water Cycle Algorithm
Economic Dispatch by Water Cycle Algorithm
 
morphological image processing
morphological image processingmorphological image processing
morphological image processing
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
 
Neural Networks: Multilayer Perceptron
Neural Networks: Multilayer PerceptronNeural Networks: Multilayer Perceptron
Neural Networks: Multilayer Perceptron
 
Artificial neural network model & hidden layers in multilayer artificial neur...
Artificial neural network model & hidden layers in multilayer artificial neur...Artificial neural network model & hidden layers in multilayer artificial neur...
Artificial neural network model & hidden layers in multilayer artificial neur...
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Pr045 deep lab_semantic_segmentation
Pr045 deep lab_semantic_segmentationPr045 deep lab_semantic_segmentation
Pr045 deep lab_semantic_segmentation
 
Cnn
CnnCnn
Cnn
 
Active learning lecture
Active learning lectureActive learning lecture
Active learning lecture
 
Intro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer VisionIntro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer Vision
 
Backpropagation algo
Backpropagation  algoBackpropagation  algo
Backpropagation algo
 
Image Object Detection Pipeline
Image Object Detection PipelineImage Object Detection Pipeline
Image Object Detection Pipeline
 
Deep Learning - RNN and CNN
Deep Learning - RNN and CNNDeep Learning - RNN and CNN
Deep Learning - RNN and CNN
 

Similar a Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs

In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersDataWorks Summit
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDataWorks Summit
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...Chris Fregly
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Advertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-MobileAdvertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-MobileDatabricks
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...Chris Fregly
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax
 
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...inside-BigData.com
 
Strata + Hadoop 2015 Slides
Strata + Hadoop 2015 SlidesStrata + Hadoop 2015 Slides
Strata + Hadoop 2015 SlidesJun Liu
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainabilitygeetachauhan
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDatabricks
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)DataWorks Summit
 
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...Chris Fregly
 
Beyond the Hype of Neural Machine Translation, Diego Bartolome (tauyou) and G...
Beyond the Hype of Neural Machine Translation, Diego Bartolome (tauyou) and G...Beyond the Hype of Neural Machine Translation, Diego Bartolome (tauyou) and G...
Beyond the Hype of Neural Machine Translation, Diego Bartolome (tauyou) and G...TAUS - The Language Data Network
 
Volatile Uses for Persistent Memory
Volatile Uses for Persistent MemoryVolatile Uses for Persistent Memory
Volatile Uses for Persistent MemoryIntel® Software
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 

Similar a Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs (20)

In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Advertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-MobileAdvertising Fraud Detection at Scale at T-Mobile
Advertising Fraud Detection at Scale at T-Mobile
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
State-Of-The Art Machine Learning Algorithms and How They Are Affected By Nea...
 
Strata + Hadoop 2015 Slides
Strata + Hadoop 2015 SlidesStrata + Hadoop 2015 Slides
Strata + Hadoop 2015 Slides
 
Profiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & SustainabilityProfiling PyTorch for Efficiency & Sustainability
Profiling PyTorch for Efficiency & Sustainability
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce Spitler
 
TAU on Power 9
TAU on Power 9TAU on Power 9
TAU on Power 9
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
 
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
 
Beyond the Hype of Neural Machine Translation, Diego Bartolome (tauyou) and G...
Beyond the Hype of Neural Machine Translation, Diego Bartolome (tauyou) and G...Beyond the Hype of Neural Machine Translation, Diego Bartolome (tauyou) and G...
Beyond the Hype of Neural Machine Translation, Diego Bartolome (tauyou) and G...
 
Volatile Uses for Persistent Memory
Volatile Uses for Persistent MemoryVolatile Uses for Persistent Memory
Volatile Uses for Persistent Memory
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 

Último (20)

Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 

Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs

  • 1. Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs Daniel Galve z Enginee r MLCommons
  • 2. Feedback Your feedback is important to us . Don’t forget to rate and review the sessions.
  • 3. Agenda ▪ What is MLCommons ? ▪ What is The People’s Speech Dataset ? ▪ The Workload to Create the Datase t ▪ Limitations of Accelerator-aware Schedulin g ▪ PySpark UDFs Gotcha s ▪ TPU Gotcha s ▪ Efficient joins on data reordered by bucketing by sequence length.
  • 4. What is MLCommons? • Deep Learning Benchmarking Organizatio n • Originally known as MLPer f • See “MLCommons: Better ML for Everyone” by David Kanter, Executive Director, on Thursday, at 4:25PM • Expanding into : • (1) Machine Learning Best Practice s • (2) Dataset Development
  • 5. Agenda ▪ What is MLCommons ? ▪ What is The People’s Speech Dataset ? ▪ The Workload to Create the Datase t ▪ Limitations of Accelerator-aware Schedulin g ▪ PySpark UDFs Gotcha s ▪ TPU Gotcha s ▪ Efficient joins on data reordered by bucketing by sequence length.
  • 6. Motivation for The People’s Speech Dataset • For widespread adoption, datasets need : • To be challengin g • To be free as in bee r • To have a commercial use license Provided by Vijay Janapa Reddi https://www.sigarch.org/data-engineering-for-everyone/ • Historically, the majority of datasets used by tech companies’ machine learning papers do not use internal datasets.
  • 7. Agenda ▪ What is MLCommons ? ▪ What is The People’s Speech Dataset ? ▪ The Workload to Create the Datase t ▪ Limitations of Accelerator-aware Schedulin g ▪ PySpark UDFs Gotcha s ▪ TPU Gotcha s ▪ Efficient joins on data reordered by bucketing by sequence length.
  • 8. The Conceptual Workload • Given audio and transcripts, must discover when each word in transcript was said . • Known as “forced alignment” or “segmentation” . • We must split hour-long audio files into segments of ~15 seconds of audio . • Time segments >1 minute typically use too much memory at training time . • Uses a pre-trained speech recognition model.
  • 9. The Conceptual Workload (2) SELECT FORCE_ALIGN( 
 ASR_NEURAL_NET(DECODE_MP3(A.FILE)), 
 NORMALIZE_TEXT(T.FILE) 
 ) 
 FROM AUDIO A INNER JOIN TRANSCRIPT T ON IDENTIFIER • On CPUs, this runs in ~0.5x real time. For 86,000 hours, that is 20 CPU-years . • ASR_NEURAL_NET takes 99% of runtime in pipeline . • Fundamental motivation for this talk’s topics.
  • 10. Agenda ▪ What is MLCommons ? ▪ What is The People’s Speech Dataset ? ▪ The Workload to Create the Datase t ▪ Limitations of Accelerator- aware Schedulin g ▪ PySpark UDFs Gotcha s ▪ TPU Gotcha s ▪ Efficient joins on data reordered by bucketing by sequence length.
  • 11. Accelerator-Aware Schedulin g Limitations • Cloud TPU, being a network service, precludes support in accelerator-aware scheduling . • Typically assign one accelerator to each executor/task . • But CPU-dependent parts of the workload usually require many more executors than you have accelerators . • Therefore, we use multiple jobs, writing to disk in-between . • Conclusion : • Good for data parallel training on existing Spark clusters . • Good for integration with NVIDIA RAPIDS . • Bad for heterogenous inference workloads with UDFs. SELECT FORCE_ALIGN( 
 ASR_NEURAL_NET(DECODE_MP3(A.FILE)), 
 NORMALIZE_TEXT(T.FILE) 
 ) 
 FROM AUDIO A INNER JOIN TRANSCRIPT T ON IDENTIFIER
  • 12. Agenda ▪ What is MLCommons ? ▪ What is The People’s Speech Dataset ? ▪ The Workload to Create the Datase t ▪ Limitations of Accelerator-aware Schedulin g ▪ PySpark UDFs Gotcha s ▪ TPU Gotcha s ▪ Efficient joins on data reordered by bucketing by sequence length.
  • 13. PySpark Arrow UDF Gotchas ▪ Implication: Memory usage doubled . ▪ JVM GC does not return physical memory back to OS . ▪ Adding swap space prevents OOMs . ▪ Don’t set spark.executor.memory to fill entire physical memory . ▪ JVM will hog all physical memory, causing pyspark UDF to use swap disk memory . ▪ Minimize allocations in your python UDF . ▪ Since Java cannot handle byte arrays larger than 2GB and some MP3 files are almost 2GB in size, we must set spark.sql.execution.arrow.maxRecordsPerBatch=1 • Reality Ideal JVM Executor worker.py JVM Executor worker.py Serialize Deserialize Deserialize Serialize Shared Memory SELECT FORCE_ALIGN( 
 ASR_NEURAL_NET(DECODE_MP3(A.FILE)), 
 NORMALIZE_TEXT(T.FILE) 
 ) 
 FROM AUDIO A INNER JOIN TRANSCRIPT T ON IDENTIFIER
  • 14. Agenda ▪ What is MLCommons ? ▪ What is The People’s Speech Dataset ? ▪ The Workload to Create the Datase t ▪ Limitations of Accelerator-aware Schedulin g ▪ PySpark UDFs Gotcha s ▪ TPU Gotcha s ▪ Efficient joins on data reordered by bucketing by sequence length.
  • 15. TPU Gotchas • Used a TPUv3-8 Pod . • Used Google’s lingvo codebase, but had to make several modifications in a custom fork . • Link at end of slides . • Used a 4-layer 1024 hidden unit LSTM network trained with CTC for inference . • Requires usage of Google Cloud Storage as your file system . • Cloud TPUs are prone to crash with a mean time between failures measured in hours . • Need to write your own “restartability” logic . • Not a TPU specific problem: All “Spot instances” require software redundancy . • TPU code can’t use tf.string data type. Must use integer primary keys for “keyed prediction” machine learning design pattern. SELECT FORCE_ALIGN( 
 ASR_NEURAL_NET(DECODE_MP3(A.FILE)), 
 NORMALIZE_TEXT(T.FILE) 
 ) 
 FROM AUDIO A INNER JOIN TRANSCRIPT T ON IDENTIFIER
  • 16. Agenda ▪ What is MLCommons ? ▪ What is The People’s Speech Dataset ? ▪ The Workload to Create the Datase t ▪ Limitations of Accelerator-aware Schedulin g ▪ PySpark UDFs Gotcha s ▪ TPU Gotcha s ▪ Efficient joins on data reordered by bucketing by sequence length.
  • 17. TPU Gotchas • We used “keyed prediction” design pattern to join acoustic model output against original transcript . • Records are sorted by key on input to acoustic model . • They are no longer sorted on output. SELECT FORCE_ALIGN( 
 ASR_NEURAL_NET(DECODE_MP3(A.FILE)), 
 NORMALIZE_TEXT(T.FILE) 
 ) 
 FROM AUDIO A INNER JOIN TRANSCRIPT T ON IDENTIFIER
  • 18. Bucketing by sequence length Necessary to utilize modern accelerators fully tf.data.experimental.bucket_by_sequence_length SELECT FORCE_ALIGN( 
 ASR_NEURAL_NET(DECODE_MP3(A.FILE)), 
 NORMALIZE_TEXT(T.FILE) 
 ) 
 FROM AUDIO A INNER JOIN TRANSCRIPT T ON IDENTIFIER A1 A2 B1 B2 B3 C1 C2 D1 A1 A2 B1 B2 B3 C1 C2 D1
  • 19. Bucketing by sequence length • TPU3-8 works best with batch size of 128 * 8 = 1024 . • Sort-Merge joins are expensive afterward . • We must join speech recognizer output against ground truth transcript . • Speech recognizer output is not small! Probability distribution of 40 tokens per 30ms. For 86,000 hours, that’s 1.5 TiB uncompressed . • Two Solutions : • Map Side join - Join whatever you need before using accelerator . • Con: Reduces input bandwidth to accelerator . • Sharding - aka partitionBy(). Only need to sort each shard . • Con: If shards are too small, can reduce efficiency. SELECT FORCE_ALIGN( 
 ASR_NEURAL_NET(DECODE_MP3(A.FILE)), 
 NORMALIZE_TEXT(T.FILE) 
 ) 
 FROM AUDIO A INNER JOIN TRANSCRIPT T ON IDENTIFIER
  • 20. Conclusions • Code is publicly available under Apache 2.0 : • https://github.com/mlcommons/peoples-speech/tree/main/ galvasr2/align/spark • Ideal for sequence-based deep learning inference is for accelerators to act as an asynchronous queue, receiving input data until a batch is large enough to run efficiently . • Would someone like to create a custom Spark Streaming sink ? • Contact: dt.galvez@gmail.com