This document summarizes Daniel Galvez's presentation on creating The People's Speech Dataset using Apache Spark and TPUs. The key points are:
1) The dataset aims to provide 86,000 hours of speech data with forced alignments between audio and transcripts in order to be challenging, free to use, and have a commercial license.
2) The conceptual workload is to take hour-long audio files, split them into 15 second segments, and use a pretrained speech recognition model to discover when each word in the transcript was said.
3) Creating the dataset encountered limitations with accelerator-aware scheduling in Spark, memory issues with PySpark UDFs, crashes in TPUs, and the need to reorder data by
3. Agenda
▪ What is MLCommons
?
▪ What is The People’s Speech
Dataset
?
▪ The Workload to Create the
Datase
t
▪ Limitations of Accelerator-aware
Schedulin
g
▪ PySpark UDFs Gotcha
s
▪ TPU Gotcha
s
▪ Efficient joins on data reordered
by bucketing by sequence length.
4. What is MLCommons?
• Deep Learning Benchmarking Organizatio
n
• Originally known as MLPer
f
• See “MLCommons: Better ML for Everyone” by David
Kanter, Executive Director, on Thursday, at 4:25PM
• Expanding into
:
• (1) Machine Learning Best Practice
s
• (2) Dataset Development
5. Agenda
▪ What is MLCommons
?
▪ What is The People’s Speech
Dataset
?
▪ The Workload to Create the
Datase
t
▪ Limitations of Accelerator-aware
Schedulin
g
▪ PySpark UDFs Gotcha
s
▪ TPU Gotcha
s
▪ Efficient joins on data reordered
by bucketing by sequence length.
6. Motivation for The People’s Speech Dataset
• For widespread
adoption, datasets
need
:
• To be challengin
g
• To be free as in bee
r
• To have a commercial
use license
Provided by Vijay Janapa Reddi
https://www.sigarch.org/data-engineering-for-everyone/
• Historically, the majority of
datasets used by tech companies’
machine learning papers do not
use internal datasets.
7. Agenda
▪ What is MLCommons
?
▪ What is The People’s Speech
Dataset
?
▪ The Workload to Create the
Datase
t
▪ Limitations of Accelerator-aware
Schedulin
g
▪ PySpark UDFs Gotcha
s
▪ TPU Gotcha
s
▪ Efficient joins on data reordered
by bucketing by sequence length.
8. The Conceptual Workload
• Given audio and transcripts, must discover when each word
in transcript was said
.
• Known as “forced alignment” or “segmentation”
.
• We must split hour-long audio files into segments of ~15
seconds of audio
.
• Time segments >1 minute typically use too much memory
at training time
.
• Uses a pre-trained speech recognition model.
9. The Conceptual Workload (2)
SELECT FORCE_ALIGN(
ASR_NEURAL_NET(DECODE_MP3(A.FILE)),
NORMALIZE_TEXT(T.FILE)
)
FROM AUDIO A INNER JOIN TRANSCRIPT T ON IDENTIFIER
• On CPUs, this runs in ~0.5x real time. For 86,000 hours,
that is 20 CPU-years
.
• ASR_NEURAL_NET takes 99% of runtime in pipeline
.
• Fundamental motivation for this talk’s topics.
10. Agenda
▪ What is MLCommons
?
▪ What is The People’s Speech
Dataset
?
▪ The Workload to Create the
Datase
t
▪ Limitations of Accelerator-
aware Schedulin
g
▪ PySpark UDFs Gotcha
s
▪ TPU Gotcha
s
▪ Efficient joins on data reordered
by bucketing by sequence length.
11. Accelerator-Aware Schedulin
g
Limitations
• Cloud TPU, being a network service, precludes support in accelerator-aware
scheduling
.
• Typically assign one accelerator to each executor/task
.
• But CPU-dependent parts of the workload usually require many more executors
than you have accelerators
.
• Therefore, we use multiple jobs, writing to disk in-between
.
• Conclusion
:
• Good for data parallel training on existing Spark clusters
.
• Good for integration with NVIDIA RAPIDS
.
• Bad for heterogenous inference workloads with UDFs.
SELECT FORCE_ALIGN(
ASR_NEURAL_NET(DECODE_MP3(A.FILE)),
NORMALIZE_TEXT(T.FILE)
)
FROM AUDIO A INNER JOIN TRANSCRIPT T ON
IDENTIFIER
12. Agenda
▪ What is MLCommons
?
▪ What is The People’s Speech
Dataset
?
▪ The Workload to Create the
Datase
t
▪ Limitations of Accelerator-aware
Schedulin
g
▪ PySpark UDFs Gotcha
s
▪ TPU Gotcha
s
▪ Efficient joins on data reordered
by bucketing by sequence length.
13. PySpark Arrow UDF Gotchas
▪ Implication: Memory usage doubled
.
▪ JVM GC does not return physical memory back to OS
.
▪ Adding swap space prevents OOMs
.
▪ Don’t set spark.executor.memory to fill entire physical memory
.
▪ JVM will hog all physical memory, causing pyspark UDF to use swap disk memory
.
▪ Minimize allocations in your python UDF
.
▪ Since Java cannot handle byte arrays larger than 2GB and some MP3 files are almost 2GB in size, we must
set spark.sql.execution.arrow.maxRecordsPerBatch=1
• Reality
Ideal
JVM Executor worker.py JVM Executor worker.py
Serialize Deserialize
Deserialize Serialize
Shared
Memory
SELECT FORCE_ALIGN(
ASR_NEURAL_NET(DECODE_MP3(A.FILE)),
NORMALIZE_TEXT(T.FILE)
)
FROM AUDIO A INNER JOIN TRANSCRIPT T ON
IDENTIFIER
14. Agenda
▪ What is MLCommons
?
▪ What is The People’s Speech
Dataset
?
▪ The Workload to Create the
Datase
t
▪ Limitations of Accelerator-aware
Schedulin
g
▪ PySpark UDFs Gotcha
s
▪ TPU Gotcha
s
▪ Efficient joins on data reordered
by bucketing by sequence length.
15. TPU Gotchas
• Used a TPUv3-8 Pod
.
• Used Google’s lingvo codebase, but had to make several modifications in a custom fork
.
• Link at end of slides
.
• Used a 4-layer 1024 hidden unit LSTM network trained with CTC for inference
.
• Requires usage of Google Cloud Storage as your file system
.
• Cloud TPUs are prone to crash with a mean time between failures measured in hours
.
• Need to write your own “restartability” logic
.
• Not a TPU specific problem: All “Spot instances” require software redundancy
.
• TPU code can’t use tf.string data type. Must use integer primary keys for “keyed
prediction” machine learning design pattern.
SELECT FORCE_ALIGN(
ASR_NEURAL_NET(DECODE_MP3(A.FILE)),
NORMALIZE_TEXT(T.FILE)
)
FROM AUDIO A INNER JOIN TRANSCRIPT T ON
IDENTIFIER
16. Agenda
▪ What is MLCommons
?
▪ What is The People’s Speech
Dataset
?
▪ The Workload to Create the Datase
t
▪ Limitations of Accelerator-aware
Schedulin
g
▪ PySpark UDFs Gotcha
s
▪ TPU Gotcha
s
▪ Efficient joins on data reordered
by bucketing by sequence length.
17. TPU Gotchas
• We used “keyed prediction” design pattern to join
acoustic model output against original transcript
.
• Records are sorted by key on input to acoustic
model
.
• They are no longer sorted on output.
SELECT FORCE_ALIGN(
ASR_NEURAL_NET(DECODE_MP3(A.FILE)),
NORMALIZE_TEXT(T.FILE)
)
FROM AUDIO A INNER JOIN TRANSCRIPT T ON
IDENTIFIER
18. Bucketing by sequence length
Necessary to utilize modern accelerators fully
tf.data.experimental.bucket_by_sequence_length
SELECT FORCE_ALIGN(
ASR_NEURAL_NET(DECODE_MP3(A.FILE)),
NORMALIZE_TEXT(T.FILE)
)
FROM AUDIO A INNER JOIN TRANSCRIPT T ON
IDENTIFIER
A1
A2
B1
B2
B3
C1
C2
D1
A1
A2
B1
B2
B3
C1
C2
D1
19. Bucketing by sequence length
• TPU3-8 works best with batch size of 128 * 8 = 1024
.
• Sort-Merge joins are expensive afterward
.
• We must join speech recognizer output against ground truth transcript
.
• Speech recognizer output is not small! Probability distribution of 40 tokens per
30ms. For 86,000 hours, that’s 1.5 TiB uncompressed
.
• Two Solutions
:
• Map Side join - Join whatever you need before using accelerator
.
• Con: Reduces input bandwidth to accelerator
.
• Sharding - aka partitionBy(). Only need to sort each shard
.
• Con: If shards are too small, can reduce efficiency.
SELECT FORCE_ALIGN(
ASR_NEURAL_NET(DECODE_MP3(A.FILE)),
NORMALIZE_TEXT(T.FILE)
)
FROM AUDIO A INNER JOIN TRANSCRIPT T ON
IDENTIFIER
20. Conclusions
• Code is publicly available under Apache 2.0
:
• https://github.com/mlcommons/peoples-speech/tree/main/
galvasr2/align/spark
• Ideal for sequence-based deep learning inference is for
accelerators to act as an asynchronous queue, receiving input
data until a batch is large enough to run efficiently
.
• Would someone like to create a custom Spark Streaming sink
?
• Contact: dt.galvez@gmail.com