Enabling Biobank-Scale Genomic Processing with Spark SQL

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Karen Feng, Databricks
Enabling Biobank-Scale Genomic
Processing with Spark SQL
#UnifiedDataAnalytics #SparkAISummit

Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
3

Agenda
• Glow
– Datasources
– Extensibility
4

Genomics is a big data problem
5
40,000 Petabytes / year by 2025From $2.7B to <$1,000
https://www.genome.gov/27541954/dna-sequencing-costs-data/ https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195

Agenda
• Glow
– Datasources
– Extensibility
6

The power of big genomic data
7
Accelerate
Target
Discovery
Goal: identify a biological target
(eg. protein) that can be
modulated with a drug
Approach: large-scale
regressions to correlate DNA
variants and the trait
Result: clinical trials with
genomic evidence are 2x more
likely to be approved by the FDA
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4547496/
Orthosteric inhibition

8
Accelerate
Target
Discovery

9
Accelerate
Target
Discovery

10
Accelerate
Target
Discovery
Reduce Costs
via Precision
Prevention
Improve
Survival with
Optimized
Treatment

Agenda
• Glow
– Datasources
– Extensibility
11

Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
12

often difficult to
– Scale
– Learn
– Integrate
13

often difficult to
– Scale
– Learn
– Integrate
14
for i in chr2L chr2R chr3L chr3R chr4 chrX;do
GenomeAnalysisTK -R ${ref_seq}
-T SelectVariants
-V my_flies.vcf
-L $i
-o my_flies.${i}.vcf
done;

often difficult to
– Scale
– Learn
– Integrate
15
-T SelectVariants
-V my_flies.vcf
-L $i
done;
for i in {1..16};
do vcftools --vcf VCF_FILE --chr $i
--recode --recode-INFO-all --out VCF_$i;
done

often difficult to
– Scale
– Learn
– Integrate
16
-T SelectVariants
-V my_flies.vcf
-L $i
done;
for i in {1..16};
done
bgzip -c myvcf.vcf > myvcf.vcf.gz
tabix -p vcf myvcf.vcf.gz
tabix myvcf.vcf.gz chr1 > chr1.vcf

often difficult to
– Scale
– Learn
– Integrate
17
-T SelectVariants
-V my_flies.vcf
-L $i
done;
for i in {1..16};
done
bgzip -c myvcf.vcf > myvcf.vcf.gz
tabix -p vcf myvcf.vcf.gz
tabix myvcf.vcf.gz chr1 > chr1.vcf
java -jar SnpSift.jar split file.vcf

often difficult to
– Scale
– Learn
– Integrate
18

often difficult to
– Scale
– Learn
– Integrate
19
838 results

often difficult to
– Scale
– Learn
– Integrate
20
Data management
--make-bed
--recode
--output-chr
--zero-cluster
--split-x/--merge-x
--set-me-missing
--fill-missing-a2
--set-missing-var-ids
--update-map...
--update-ids...
--flip
--flip-scan
--keep-allele-order...
--indiv-sort
--write-covar...
--{,b}merge...
Merge failures
VCF reference merge
--merge-list
--write-snplist
--list-duplicate-vars
Basic statistics
--freq{,x}
--missing
--test-mishap
--hardy
--mendel
--het/--ibc
--check-sex/--impute-sex
--fst
Linkage disequilibrium
--indep...
--r/--r2
--show-tags
--blocks
Distance matrices
Identity-by-state/Hamming
(--distance...)
Relationship/covariance
(--make-grm-bin...)
--rel-cutoff
Distance-pheno. analysis
(--ibs-test...)
Identity-by-descent
--genome
--homozyg...
Population stratification
--cluster
--pca
--mds-plot
--neighbour
Association analysis
Basic case/control
(--assoc, --model)
Stratified case/control
(--mh, --mh2, --homog)
Quantitative trait
(--assoc, --gxe)
Regression w/ covariates
(--linear, --logistic)
--dosage
--lasso
--test-missing
Monte Carlo permutation
Set-based tests
REML additive heritability
Family-based association
--tdt
--dfam
--qfam...
--tucc
Report postprocessing
--annotate
--clump
--gene-report
--meta-analysis
Epistasis
--fast-epistasis
--epistasis
--twolocus
Allelic scoring (--score)
R plugins (--R)

often difficult to
– Scale
– Learn
– Integrate
21

often difficult to
– Scale
– Learn
– Integrate
22

often difficult to
– Scale
– Learn
– Integrate
23

often difficult to
– Scale
– Learn
– Integrate
24

often difficult to
– Scale
– Learn
– Integrate
25

often difficult to
– Scale
– Learn
– Integrate
26
1. Converting one file format to another file format.

often difficult to
– Scale
– Learn
– Integrate
27
“Give a statistical geneticist an
awk line, feed him for a day,
teach a statistical geneticist how
to awk, feed him for a lifetime...”

often difficult to
– Scale
– Learn
– Integrate
28
“Give a statistical geneticist an
awk line, feed him for a day,
teach a statistical geneticist how
to awk, feed him for a lifetime...”

Agenda
• Glow
– Datasources
– Extensibility
29

• Open-source toolkit for large-scale genomic
analysis
30

• Open-source toolkit for large-scale genomic
analysis
• Built on Spark for biobank scale
• Query and use built-in commands with familiar
languages using Spark SQL
• Compatible with existing genomic tools and
formats, as well as big data and ML tools
31

Agenda
• Glow
– Datasources
– Extensibility
32

Genomic variant data
35
Always present

36
Chromosome: StringType

37
Variant information: depends on metadata

38
MapType(StringType, StringType): {“DP” -> “14”, “AF” -> “0.5”}

39
MapType(StringType, StringType)
##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>

40
MapType(StringType, StringType)
##INFO=<ID=AF, Number=?, Type=?, Description=?>

41
MapType(StringType, StringType): lose metadata and slow querying

42
Dynamic schema: preserve metadata and fast querying

43
StructField(
name = “INFO_AF”,
dataType = DoubleType,
nullable = true,
metadata = Map(
“vcf_header_count” -> “A”,
“vcf_header_description” -> “Allele Frequency”)

44
StructField(
nullable = true,
metadata = Map(
“vcf_header_description” -> “Allele Frequency”)

45
Genotype information: depends on metadata

46
Genotype information: width depends on number of samples

47
Sample NA00001
Genotype 0|0
Genotype quality 48
Depth 1
Haplotype quality 51,51

48
Sample NA00001 NA0002
Genotype 0|0 0|0
Genotype quality 48 49
Depth 1 3
Haplotype quality 51,51 58,50

49
Sample NA00001 NA0002
Genotype 0|0 0|0
Genotype quality 48 49
Depth 1 3
Haplotype quality 51,51 58,50
...
UK Biobank has 500,000
participants!

50
Sample Genotype Genotype quality Depth Haplotype quality
NA0001 0|0 48 1 51,51
NA0002 0|0 49 3 58,50
...

• Static fields
– eg. Chromosome
• Dynamic fields
– Variant information
– Genotype information
• Preserves metadata
• Fast querying
• Limited number of columns
51

52
VCF VCF rows
spark.read
.format(“vcf”)
.load(“genotypes.vcf”)

53
spark.write
.format(“vcf”)
.save(“genotypes.vcf”)
VCF VCF rows

54
VCF rows
spark.write
.format(“delta”)
.save(“genotypes.delta”)

Delta Lake
55
• Genomic data
– VCF, BGEN, BED
• Medical images
• Electronic health records
• Waveform data
• Real world evidence
• ...

Agenda
• Glow
– Datasources
– Extensibility
56

Built-in functions
• Convert genotype probabilities to hard calls
• Normalize variants
• Liftover between reference assemblies
• Annotate variants
• Genome-wide association studies
• ...
57

Built-in functions
• Convert genotype probabilities to hard calls
• Normalize variants
• Liftover between reference assemblies
• Annotate variants
• Genome-wide association studies
• ...
58

GWAS
• linear_regression_gwas
• logistic_regression_gwas
• Single-node bioinformatics tools
59

Single-node bioinformatics tools
• SAIGE
– R library
– VCF → CSV
60
http://pheweb.sph.umich.edu/SAIGE-UKB/pheno/250

• Require flat file
splicing and
combination
61

62
Command
line tool
Text
Text
Text
Text
Text
Text
Text
Text
Command
line tool
Command
line tool
... ... ...

rdd.pipe()
63
Command
line tool
worker
stdin stdout
Text RDD Text RDD

rdd.pipe()
• Input and output RDDs have single text column
– Input: set header as pipe context
– Output: mixed header and text data
• Convert between genomic file formats
– Changing specs
64

glow.transform(‘pipe’)
65
DataFrame
(VCF, CSV,
text)
DataFrame
(VCF, CSV,
text)
Command
line tool
(SAIGE)
worker
stdin stdout

glow.transform(
"pipe",
input_df,
cmd=cmd,
input_formatter='vcf',
in_vcf_header='infer',
output_formatter='csv',
out_header='true',
out_delimiter=' ')
66

glow.transform(
"pipe",
input_df,
cmd=cmd,
out_header='true',
out_delimiter=' ')
67
DataFrame
VCF

• VCF input formatter
– Set header based on
schema
– Convert Spark Rows
to Java objects
– Third-party library
writes header and
variant rows
68
StructField(
nullable = true,
metadata = Map(
“vcf_header_description” ->
“Allele Frequency”))
##INFO=<ID=AF, Number=A, Type=Float,
Description=”Allele Frequency”>

glow.transform(
"pipe",
input_df,
cmd=cmd,
out_header='true',
out_delimiter=' ')
69
Rscript step2_SPAtests.R

• For each partition
– Input formatter writes to the command’s stdin
– Output formatter reads from the command’s stdout
– If running the command triggers an exception, the
error is propagated to the driver
70

glow.transform(
"pipe",
input_df,
cmd=cmd,
out_header='true',
out_delimiter=' ')
71
DataFrame
CSV

• CSV output formatter
– Write schema to first element in iterator
– Write remaining rows to iterator
72
CHR POS BETA SE p.value
22 35292447 1.206 3.285 0.714
22 35292456 1.358 2.534 0.592
StructType(
Seq(“CHR”, “POS”, “BETA”,
“SE”, “p.value”).map(
StructField(_, StringType))
InternalRow(“22”, “35292447”, “1.206”, “3.285”, “0.714”)
InternalRow(“22”, “35292456”, “1.358”, “2.534”, “0.592”)

• Input and output DataFrames
– Input: infer header from schema
– Output: infer schema from header
• Convert genomic data under the hood
– Spark Row ↔ Java object ↔ text
73

Agenda
• Glow
– Datasources
– Extensibility
74

GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
75

GWAS
• Load variants
76
spark.read.format("vcf")
.load(“genotypes.vcf”)

GWAS
• Load variants
77
variant_df.selectExpr("*",
"expand_struct(call_summary_stats(genotypes))",
"expand_struct(hardy_weinberg(genotypes))")
.where((col("alleleFrequencies").getItem(0) >=
allele_freq_cutoff) &
(col("alleleFrequencies").getItem(0) <=
(1.0 - allele_freq_cutoff)) &
(col("pValueHwe") >= hwe_cutoff))

GWAS
• Load variants
78
qc_df.write
.format(“delta”)
.save(delta_path)

GWAS
• Load variants
79
matrix.computeSVD(num_pcs)

GWAS
• Load variants
80
genotypes.crossJoin(
phenotypeAndCovariates)
.selectExpr(
“expand_struct( ”
“linear_regression_gwas( ”
“genotype_states(genotypes), ”
“phenotype_values, covariates))”)

GWAS
• Load variants
81
gwas_results_rdf <- as.data.frame(gwas_results)
install.packages("qqman",
`repos="http://cran.us.r-project.org") library(qqman)
png('/databricks/driver/manhattan.png')
manhattan(gwas_results_rdf)

GWAS
• Load variants
82
http://pheweb.sph.umich.edu/SAIGE-UKB/pheno/250

GWAS
• Load variants
83
mlflow.log_artifact(
'/databricks/driver/manhattan.png')

GWAS pipeline
84
VCF DF
QC’d
DataFrame
GWAS
hits
Phenotypes
Ancestry

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Enabling Biobank-Scale Genomic Processing with Spark SQL

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (18)

Similar a Enabling Biobank-Scale Genomic Processing with Spark SQL

Similar a Enabling Biobank-Scale Genomic Processing with Spark SQL (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

Enabling Biobank-Scale Genomic Processing with Spark SQL