SlideShare una empresa de Scribd logo
1 de 86
Descargar para leer sin conexión
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Karen Feng, Databricks
Enabling Biobank-Scale Genomic
Processing with Spark SQL
#UnifiedDataAnalytics #SparkAISummit
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
3
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
4
Genomics is a big data problem
5
40,000 Petabytes / year by 2025From $2.7B to <$1,000
https://www.genome.gov/27541954/dna-sequencing-costs-data/ https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
6
The power of big genomic data
7
Accelerate
Target
Discovery
Goal: identify a biological target
(eg. protein) that can be
modulated with a drug
Approach: large-scale
regressions to correlate DNA
variants and the trait
Result: clinical trials with
genomic evidence are 2x more
likely to be approved by the FDA
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4547496/
Orthosteric inhibition
The power of big genomic data
8
Accelerate
Target
Discovery
Goal: identify a biological target
(eg. protein) that can be
modulated with a drug
Approach: large-scale
regressions to correlate DNA
variants and the trait
Result: clinical trials with
genomic evidence are 2x more
likely to be approved by the FDA
The power of big genomic data
9
Accelerate
Target
Discovery
Goal: identify a biological target
(eg. protein) that can be
modulated with a drug
Approach: large-scale
regressions to correlate DNA
variants and the trait
Result: clinical trials with
genomic evidence are 2x more
likely to be approved by the FDA
The power of big genomic data
10
Accelerate
Target
Discovery
Reduce Costs
via Precision
Prevention
Improve
Survival with
Optimized
Treatment
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
11
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
12
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
13
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
14
for i in chr2L chr2R chr3L chr3R chr4 chrX;do
GenomeAnalysisTK -R ${ref_seq} 
-T SelectVariants 
-V my_flies.vcf 
-L $i 
-o my_flies.${i}.vcf
done;
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
15
for i in chr2L chr2R chr3L chr3R chr4 chrX;do
GenomeAnalysisTK -R ${ref_seq} 
-T SelectVariants 
-V my_flies.vcf 
-L $i 
-o my_flies.${i}.vcf
done;
for i in {1..16};
do vcftools --vcf VCF_FILE --chr $i
--recode --recode-INFO-all --out VCF_$i;
done
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
16
for i in chr2L chr2R chr3L chr3R chr4 chrX;do
GenomeAnalysisTK -R ${ref_seq} 
-T SelectVariants 
-V my_flies.vcf 
-L $i 
-o my_flies.${i}.vcf
done;
for i in {1..16};
do vcftools --vcf VCF_FILE --chr $i
--recode --recode-INFO-all --out VCF_$i;
done
bgzip -c myvcf.vcf > myvcf.vcf.gz
tabix -p vcf myvcf.vcf.gz
tabix myvcf.vcf.gz chr1 > chr1.vcf
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
17
for i in chr2L chr2R chr3L chr3R chr4 chrX;do
GenomeAnalysisTK -R ${ref_seq} 
-T SelectVariants 
-V my_flies.vcf 
-L $i 
-o my_flies.${i}.vcf
done;
for i in {1..16};
do vcftools --vcf VCF_FILE --chr $i
--recode --recode-INFO-all --out VCF_$i;
done
bgzip -c myvcf.vcf > myvcf.vcf.gz
tabix -p vcf myvcf.vcf.gz
tabix myvcf.vcf.gz chr1 > chr1.vcf
java -jar SnpSift.jar split file.vcf
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
18
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
19
838 results
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
20
Data management
--make-bed
--recode
--output-chr
--zero-cluster
--split-x/--merge-x
--set-me-missing
--fill-missing-a2
--set-missing-var-ids
--update-map...
--update-ids...
--flip
--flip-scan
--keep-allele-order...
--indiv-sort
--write-covar...
--{,b}merge...
Merge failures
VCF reference merge
--merge-list
--write-snplist
--list-duplicate-vars
Basic statistics
--freq{,x}
--missing
--test-mishap
--hardy
--mendel
--het/--ibc
--check-sex/--impute-sex
--fst
Linkage disequilibrium
--indep...
--r/--r2
--show-tags
--blocks
Distance matrices
Identity-by-state/Hamming
(--distance...)
Relationship/covariance
(--make-grm-bin...)
--rel-cutoff
Distance-pheno. analysis
(--ibs-test...)
Identity-by-descent
--genome
--homozyg...
Population stratification
--cluster
--pca
--mds-plot
--neighbour
Association analysis
Basic case/control
(--assoc, --model)
Stratified case/control
(--mh, --mh2, --homog)
Quantitative trait
(--assoc, --gxe)
Regression w/ covariates
(--linear, --logistic)
--dosage
--lasso
--test-missing
Monte Carlo permutation
Set-based tests
REML additive heritability
Family-based association
--tdt
--dfam
--qfam...
--tucc
Report postprocessing
--annotate
--clump
--gene-report
--meta-analysis
Epistasis
--fast-epistasis
--epistasis
--twolocus
Allelic scoring (--score)
R plugins (--R)
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
21
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
22
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
23
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
24
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
25
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
26
1. Converting one file format to another file format.
2. Converting one file format to another file format.
3. Converting one file format to another file format.
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
27
“Give a statistical geneticist an
awk line, feed him for a day,
teach a statistical geneticist how
to awk, feed him for a lifetime...”
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
28
“Give a statistical geneticist an
awk line, feed him for a day,
teach a statistical geneticist how
to awk, feed him for a lifetime...”
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
29
• Open-source toolkit for large-scale genomic
analysis
30
• Open-source toolkit for large-scale genomic
analysis
• Built on Spark for biobank scale
• Query and use built-in commands with familiar
languages using Spark SQL
• Compatible with existing genomic tools and
formats, as well as big data and ML tools
31
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
32
Genomic variant data
33
Genomic variant data
34
Genomic variant data
35
Always present
Genomic variant data
36
Chromosome: StringType
Genomic variant data
37
Variant information: depends on metadata
Genomic variant data
38
MapType(StringType, StringType): {“DP” -> “14”, “AF” -> “0.5”}
Genomic variant data
Genomic variant data
39
MapType(StringType, StringType)
##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
Genomic variant data
40
MapType(StringType, StringType)
##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
##INFO=<ID=AF, Number=?, Type=?, Description=?>
Genomic variant data
41
MapType(StringType, StringType): lose metadata and slow querying
Genomic variant data
42
Dynamic schema: preserve metadata and fast querying
Genomic variant data
43
StructField(
name = “INFO_AF”,
dataType = DoubleType,
nullable = true,
metadata = Map(
“vcf_header_count” -> “A”,
“vcf_header_description” -> “Allele Frequency”)
##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
Genomic variant data
44
StructField(
name = “INFO_AF”,
dataType = DoubleType,
nullable = true,
metadata = Map(
“vcf_header_count” -> “A”,
“vcf_header_description” -> “Allele Frequency”)
##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
Genomic variant data
45
Genotype information: depends on metadata
Genomic variant data
46
Genotype information: width depends on number of samples
Genomic variant data
47
Sample NA00001
Genotype 0|0
Genotype quality 48
Depth 1
Haplotype quality 51,51
Genomic variant data
48
Sample NA00001 NA0002
Genotype 0|0 0|0
Genotype quality 48 49
Depth 1 3
Haplotype quality 51,51 58,50
Genomic variant data
49
Sample NA00001 NA0002
Genotype 0|0 0|0
Genotype quality 48 49
Depth 1 3
Haplotype quality 51,51 58,50
...
UK Biobank has 500,000
participants!
Genomic variant data
50
Sample Genotype Genotype quality Depth Haplotype quality
NA0001 0|0 48 1 51,51
NA0002 0|0 49 3 58,50
...
Genomic variant data
• Static fields
– eg. Chromosome
• Dynamic fields
– Variant information
– Genotype information
• Preserves metadata
• Fast querying
• Limited number of columns
51
Genomic variant data
52
VCF VCF rows
spark.read
.format(“vcf”)
.load(“genotypes.vcf”)
Genomic variant data
53
spark.write
.format(“vcf”)
.save(“genotypes.vcf”)
VCF VCF rows
Genomic variant data
54
VCF rows
spark.write
.format(“delta”)
.save(“genotypes.delta”)
Delta Lake
55
• Genomic data
– VCF, BGEN, BED
• Medical images
• Electronic health records
• Waveform data
• Real world evidence
• ...
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
56
Built-in functions
• Convert genotype probabilities to hard calls
• Normalize variants
• Liftover between reference assemblies
• Annotate variants
• Genome-wide association studies
• ...
57
Built-in functions
• Convert genotype probabilities to hard calls
• Normalize variants
• Liftover between reference assemblies
• Annotate variants
• Genome-wide association studies
• ...
58
GWAS
• linear_regression_gwas
• logistic_regression_gwas
• Single-node bioinformatics tools
59
Single-node bioinformatics tools
• SAIGE
– R library
– VCF → CSV
60
http://pheweb.sph.umich.edu/SAIGE-UKB/pheno/250
Single-node bioinformatics tools
• Require flat file
splicing and
combination
61
Single-node bioinformatics tools
62
Command
line tool
Text
Text
Text
Text
Text
Text
Text
Text
Command
line tool
Command
line tool
... ... ...
rdd.pipe()
63
Command
line tool
worker
stdin stdout
Text RDD Text RDD
rdd.pipe()
• Input and output RDDs have single text column
– Input: set header as pipe context
– Output: mixed header and text data
• Convert between genomic file formats
– Changing specs
64
glow.transform(‘pipe’)
65
DataFrame
(VCF, CSV,
text)
DataFrame
(VCF, CSV,
text)
Command
line tool
(SAIGE)
worker
stdin stdout
glow.transform(‘pipe’)
glow.transform(
"pipe", 
input_df, 
cmd=cmd, 
input_formatter='vcf', 
in_vcf_header='infer', 
output_formatter='csv', 
out_header='true', 
out_delimiter=' ')
66
glow.transform(‘pipe’)
glow.transform(
"pipe", 
input_df, 
cmd=cmd, 
input_formatter='vcf', 
in_vcf_header='infer', 
output_formatter='csv', 
out_header='true', 
out_delimiter=' ')
67
DataFrame
VCF
glow.transform(‘pipe’)
• VCF input formatter
– Set header based on
schema
– Convert Spark Rows
to Java objects
– Third-party library
writes header and
variant rows
68
StructField(
name = “INFO_AF”,
dataType = DoubleType,
nullable = true,
metadata = Map(
“vcf_header_count” -> “A”,
“vcf_header_description” ->
“Allele Frequency”))
##INFO=<ID=AF, Number=A, Type=Float,
Description=”Allele Frequency”>
glow.transform(‘pipe’)
glow.transform(
"pipe", 
input_df, 
cmd=cmd, 
input_formatter='vcf', 
in_vcf_header='infer', 
output_formatter='csv', 
out_header='true', 
out_delimiter=' ')
69
Rscript step2_SPAtests.R
glow.transform(‘pipe’)
• For each partition
– Input formatter writes to the command’s stdin
– Output formatter reads from the command’s stdout
– If running the command triggers an exception, the
error is propagated to the driver
70
glow.transform(‘pipe’)
glow.transform(
"pipe", 
input_df, 
cmd=cmd, 
input_formatter='vcf', 
in_vcf_header='infer', 
output_formatter='csv', 
out_header='true', 
out_delimiter=' ')
71
DataFrame
CSV
glow.transform(‘pipe’)
• CSV output formatter
– Write schema to first element in iterator
– Write remaining rows to iterator
72
CHR POS BETA SE p.value
22 35292447 1.206 3.285 0.714
22 35292456 1.358 2.534 0.592
StructType(
Seq(“CHR”, “POS”, “BETA”,
“SE”, “p.value”).map(
StructField(_, StringType))
InternalRow(“22”, “35292447”, “1.206”, “3.285”, “0.714”)
InternalRow(“22”, “35292456”, “1.358”, “2.534”, “0.592”)
glow.transform(‘pipe’)
• Input and output DataFrames
– Input: infer header from schema
– Output: infer schema from header
• Convert genomic data under the hood
– Spark Row ↔ Java object ↔ text
73
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
74
GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
75
GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
76
spark.read.format("vcf") 
.load(“genotypes.vcf”)
GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
77
variant_df.selectExpr("*", 
"expand_struct(call_summary_stats(genotypes))", 
"expand_struct(hardy_weinberg(genotypes))") 
.where((col("alleleFrequencies").getItem(0) >= 
allele_freq_cutoff) & 
(col("alleleFrequencies").getItem(0) <= 
(1.0 - allele_freq_cutoff)) & 
(col("pValueHwe") >= hwe_cutoff))
GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
78
qc_df.write 
.format(“delta”) 
.save(delta_path)
GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
79
matrix.computeSVD(num_pcs)
GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
80
genotypes.crossJoin( 
phenotypeAndCovariates) 
.selectExpr(
“expand_struct( ” 
“linear_regression_gwas( ” 
“genotype_states(genotypes), ” 
“phenotype_values, covariates))”)
GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
81
gwas_results_rdf <- as.data.frame(gwas_results)
install.packages("qqman",
`repos="http://cran.us.r-project.org") library(qqman)
png('/databricks/driver/manhattan.png')
manhattan(gwas_results_rdf)
GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
82
http://pheweb.sph.umich.edu/SAIGE-UKB/pheno/250
GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
83
mlflow.log_artifact( 
'/databricks/driver/manhattan.png')
GWAS pipeline
84
VCF DF
QC’d
DataFrame
GWAS
hits
Phenotypes
Ancestry
85
projectglow.io
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Más contenido relacionado

La actualidad más candente

MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRPivotalOpenSourceHub
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Databricks
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionDatabricks
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in RAnqi Fu
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...Srivatsan Ramanujam
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteMatt Massie
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library EMC
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data ScientistsDataWorks Summit
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 

La actualidad más candente (18)

MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat Detection
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 

Similar a Enabling Biobank-Scale Genomic Processing with Spark SQL

Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchDavid Ruau
 
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant scienceVenice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant scienceGigaScience, BGI Hong Kong
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSHAMNAHAMNA8
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge
 
SHORT TERM BIOINFORMATICS TRAINING PROGRAM
SHORT TERM BIOINFORMATICS TRAINING PROGRAMSHORT TERM BIOINFORMATICS TRAINING PROGRAM
SHORT TERM BIOINFORMATICS TRAINING PROGRAMArraygenrajeshmahato
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxxRowlet
 
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkAccelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkDatabricks
 
ECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsJan Aerts
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
 
CS Guest Lecture 2015 10-05 advanced databases
CS Guest Lecture 2015 10-05 advanced databasesCS Guest Lecture 2015 10-05 advanced databases
CS Guest Lecture 2015 10-05 advanced databasesGabe Rudy
 
Aug2013 bioinformatics working group
Aug2013 bioinformatics working groupAug2013 bioinformatics working group
Aug2013 bioinformatics working groupGenomeInABottle
 
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...Lucidworks
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009Ian Foster
 
Mar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupMar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupGenomeInABottle
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
Reproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookReproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookKeiichiro Ono
 

Similar a Enabling Biobank-Scale Genomic Processing with Spark SQL (20)

Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant scienceVenice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
SHORT TERM BIOINFORMATICS TRAINING PROGRAM
SHORT TERM BIOINFORMATICS TRAINING PROGRAMSHORT TERM BIOINFORMATICS TRAINING PROGRAM
SHORT TERM BIOINFORMATICS TRAINING PROGRAM
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
 
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkAccelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
ECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPs
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 
CS Guest Lecture 2015 10-05 advanced databases
CS Guest Lecture 2015 10-05 advanced databasesCS Guest Lecture 2015 10-05 advanced databases
CS Guest Lecture 2015 10-05 advanced databases
 
Aug2013 bioinformatics working group
Aug2013 bioinformatics working groupAug2013 bioinformatics working group
Aug2013 bioinformatics working group
 
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
 
Mar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupMar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working Group
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
Reproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookReproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter Notebook
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schscnajjemba
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss ConfederationEfruzAsilolu
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制vexqp
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样wsppdmt
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 

Último (20)

Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 

Enabling Biobank-Scale Genomic Processing with Spark SQL

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Karen Feng, Databricks Enabling Biobank-Scale Genomic Processing with Spark SQL #UnifiedDataAnalytics #SparkAISummit
  • 3. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 3
  • 4. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 4
  • 5. Genomics is a big data problem 5 40,000 Petabytes / year by 2025From $2.7B to <$1,000 https://www.genome.gov/27541954/dna-sequencing-costs-data/ https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
  • 6. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 6
  • 7. The power of big genomic data 7 Accelerate Target Discovery Goal: identify a biological target (eg. protein) that can be modulated with a drug Approach: large-scale regressions to correlate DNA variants and the trait Result: clinical trials with genomic evidence are 2x more likely to be approved by the FDA https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4547496/ Orthosteric inhibition
  • 8. The power of big genomic data 8 Accelerate Target Discovery Goal: identify a biological target (eg. protein) that can be modulated with a drug Approach: large-scale regressions to correlate DNA variants and the trait Result: clinical trials with genomic evidence are 2x more likely to be approved by the FDA
  • 9. The power of big genomic data 9 Accelerate Target Discovery Goal: identify a biological target (eg. protein) that can be modulated with a drug Approach: large-scale regressions to correlate DNA variants and the trait Result: clinical trials with genomic evidence are 2x more likely to be approved by the FDA
  • 10. The power of big genomic data 10 Accelerate Target Discovery Reduce Costs via Precision Prevention Improve Survival with Optimized Treatment
  • 11. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 11
  • 12. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 12
  • 13. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 13
  • 14. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 14 for i in chr2L chr2R chr3L chr3R chr4 chrX;do GenomeAnalysisTK -R ${ref_seq} -T SelectVariants -V my_flies.vcf -L $i -o my_flies.${i}.vcf done;
  • 15. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 15 for i in chr2L chr2R chr3L chr3R chr4 chrX;do GenomeAnalysisTK -R ${ref_seq} -T SelectVariants -V my_flies.vcf -L $i -o my_flies.${i}.vcf done; for i in {1..16}; do vcftools --vcf VCF_FILE --chr $i --recode --recode-INFO-all --out VCF_$i; done
  • 16. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 16 for i in chr2L chr2R chr3L chr3R chr4 chrX;do GenomeAnalysisTK -R ${ref_seq} -T SelectVariants -V my_flies.vcf -L $i -o my_flies.${i}.vcf done; for i in {1..16}; do vcftools --vcf VCF_FILE --chr $i --recode --recode-INFO-all --out VCF_$i; done bgzip -c myvcf.vcf > myvcf.vcf.gz tabix -p vcf myvcf.vcf.gz tabix myvcf.vcf.gz chr1 > chr1.vcf
  • 17. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 17 for i in chr2L chr2R chr3L chr3R chr4 chrX;do GenomeAnalysisTK -R ${ref_seq} -T SelectVariants -V my_flies.vcf -L $i -o my_flies.${i}.vcf done; for i in {1..16}; do vcftools --vcf VCF_FILE --chr $i --recode --recode-INFO-all --out VCF_$i; done bgzip -c myvcf.vcf > myvcf.vcf.gz tabix -p vcf myvcf.vcf.gz tabix myvcf.vcf.gz chr1 > chr1.vcf java -jar SnpSift.jar split file.vcf
  • 18. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 18
  • 19. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 19 838 results
  • 20. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 20 Data management --make-bed --recode --output-chr --zero-cluster --split-x/--merge-x --set-me-missing --fill-missing-a2 --set-missing-var-ids --update-map... --update-ids... --flip --flip-scan --keep-allele-order... --indiv-sort --write-covar... --{,b}merge... Merge failures VCF reference merge --merge-list --write-snplist --list-duplicate-vars Basic statistics --freq{,x} --missing --test-mishap --hardy --mendel --het/--ibc --check-sex/--impute-sex --fst Linkage disequilibrium --indep... --r/--r2 --show-tags --blocks Distance matrices Identity-by-state/Hamming (--distance...) Relationship/covariance (--make-grm-bin...) --rel-cutoff Distance-pheno. analysis (--ibs-test...) Identity-by-descent --genome --homozyg... Population stratification --cluster --pca --mds-plot --neighbour Association analysis Basic case/control (--assoc, --model) Stratified case/control (--mh, --mh2, --homog) Quantitative trait (--assoc, --gxe) Regression w/ covariates (--linear, --logistic) --dosage --lasso --test-missing Monte Carlo permutation Set-based tests REML additive heritability Family-based association --tdt --dfam --qfam... --tucc Report postprocessing --annotate --clump --gene-report --meta-analysis Epistasis --fast-epistasis --epistasis --twolocus Allelic scoring (--score) R plugins (--R)
  • 21. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 21
  • 22. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 22
  • 23. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 23
  • 24. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 24
  • 25. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 25
  • 26. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 26 1. Converting one file format to another file format. 2. Converting one file format to another file format. 3. Converting one file format to another file format.
  • 27. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 27 “Give a statistical geneticist an awk line, feed him for a day, teach a statistical geneticist how to awk, feed him for a lifetime...”
  • 28. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 28 “Give a statistical geneticist an awk line, feed him for a day, teach a statistical geneticist how to awk, feed him for a lifetime...”
  • 29. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 29
  • 30. • Open-source toolkit for large-scale genomic analysis 30
  • 31. • Open-source toolkit for large-scale genomic analysis • Built on Spark for biobank scale • Query and use built-in commands with familiar languages using Spark SQL • Compatible with existing genomic tools and formats, as well as big data and ML tools 31
  • 32. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 32
  • 37. Genomic variant data 37 Variant information: depends on metadata
  • 38. Genomic variant data 38 MapType(StringType, StringType): {“DP” -> “14”, “AF” -> “0.5”} Genomic variant data
  • 39. Genomic variant data 39 MapType(StringType, StringType) ##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
  • 40. Genomic variant data 40 MapType(StringType, StringType) ##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”> ##INFO=<ID=AF, Number=?, Type=?, Description=?>
  • 41. Genomic variant data 41 MapType(StringType, StringType): lose metadata and slow querying
  • 42. Genomic variant data 42 Dynamic schema: preserve metadata and fast querying
  • 43. Genomic variant data 43 StructField( name = “INFO_AF”, dataType = DoubleType, nullable = true, metadata = Map( “vcf_header_count” -> “A”, “vcf_header_description” -> “Allele Frequency”) ##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
  • 44. Genomic variant data 44 StructField( name = “INFO_AF”, dataType = DoubleType, nullable = true, metadata = Map( “vcf_header_count” -> “A”, “vcf_header_description” -> “Allele Frequency”) ##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
  • 45. Genomic variant data 45 Genotype information: depends on metadata
  • 46. Genomic variant data 46 Genotype information: width depends on number of samples
  • 47. Genomic variant data 47 Sample NA00001 Genotype 0|0 Genotype quality 48 Depth 1 Haplotype quality 51,51
  • 48. Genomic variant data 48 Sample NA00001 NA0002 Genotype 0|0 0|0 Genotype quality 48 49 Depth 1 3 Haplotype quality 51,51 58,50
  • 49. Genomic variant data 49 Sample NA00001 NA0002 Genotype 0|0 0|0 Genotype quality 48 49 Depth 1 3 Haplotype quality 51,51 58,50 ... UK Biobank has 500,000 participants!
  • 50. Genomic variant data 50 Sample Genotype Genotype quality Depth Haplotype quality NA0001 0|0 48 1 51,51 NA0002 0|0 49 3 58,50 ...
  • 51. Genomic variant data • Static fields – eg. Chromosome • Dynamic fields – Variant information – Genotype information • Preserves metadata • Fast querying • Limited number of columns 51
  • 52. Genomic variant data 52 VCF VCF rows spark.read .format(“vcf”) .load(“genotypes.vcf”)
  • 54. Genomic variant data 54 VCF rows spark.write .format(“delta”) .save(“genotypes.delta”)
  • 55. Delta Lake 55 • Genomic data – VCF, BGEN, BED • Medical images • Electronic health records • Waveform data • Real world evidence • ...
  • 56. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 56
  • 57. Built-in functions • Convert genotype probabilities to hard calls • Normalize variants • Liftover between reference assemblies • Annotate variants • Genome-wide association studies • ... 57
  • 58. Built-in functions • Convert genotype probabilities to hard calls • Normalize variants • Liftover between reference assemblies • Annotate variants • Genome-wide association studies • ... 58
  • 60. Single-node bioinformatics tools • SAIGE – R library – VCF → CSV 60 http://pheweb.sph.umich.edu/SAIGE-UKB/pheno/250
  • 61. Single-node bioinformatics tools • Require flat file splicing and combination 61
  • 62. Single-node bioinformatics tools 62 Command line tool Text Text Text Text Text Text Text Text Command line tool Command line tool ... ... ...
  • 64. rdd.pipe() • Input and output RDDs have single text column – Input: set header as pipe context – Output: mixed header and text data • Convert between genomic file formats – Changing specs 64
  • 66. glow.transform(‘pipe’) glow.transform( "pipe", input_df, cmd=cmd, input_formatter='vcf', in_vcf_header='infer', output_formatter='csv', out_header='true', out_delimiter=' ') 66
  • 67. glow.transform(‘pipe’) glow.transform( "pipe", input_df, cmd=cmd, input_formatter='vcf', in_vcf_header='infer', output_formatter='csv', out_header='true', out_delimiter=' ') 67 DataFrame VCF
  • 68. glow.transform(‘pipe’) • VCF input formatter – Set header based on schema – Convert Spark Rows to Java objects – Third-party library writes header and variant rows 68 StructField( name = “INFO_AF”, dataType = DoubleType, nullable = true, metadata = Map( “vcf_header_count” -> “A”, “vcf_header_description” -> “Allele Frequency”)) ##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
  • 69. glow.transform(‘pipe’) glow.transform( "pipe", input_df, cmd=cmd, input_formatter='vcf', in_vcf_header='infer', output_formatter='csv', out_header='true', out_delimiter=' ') 69 Rscript step2_SPAtests.R
  • 70. glow.transform(‘pipe’) • For each partition – Input formatter writes to the command’s stdin – Output formatter reads from the command’s stdout – If running the command triggers an exception, the error is propagated to the driver 70
  • 71. glow.transform(‘pipe’) glow.transform( "pipe", input_df, cmd=cmd, input_formatter='vcf', in_vcf_header='infer', output_formatter='csv', out_header='true', out_delimiter=' ') 71 DataFrame CSV
  • 72. glow.transform(‘pipe’) • CSV output formatter – Write schema to first element in iterator – Write remaining rows to iterator 72 CHR POS BETA SE p.value 22 35292447 1.206 3.285 0.714 22 35292456 1.358 2.534 0.592 StructType( Seq(“CHR”, “POS”, “BETA”, “SE”, “p.value”).map( StructField(_, StringType)) InternalRow(“22”, “35292447”, “1.206”, “3.285”, “0.714”) InternalRow(“22”, “35292456”, “1.358”, “2.534”, “0.592”)
  • 73. glow.transform(‘pipe’) • Input and output DataFrames – Input: infer header from schema – Output: infer schema from header • Convert genomic data under the hood – Spark Row ↔ Java object ↔ text 73
  • 74. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 74
  • 75. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 75
  • 76. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 76 spark.read.format("vcf") .load(“genotypes.vcf”)
  • 77. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 77 variant_df.selectExpr("*", "expand_struct(call_summary_stats(genotypes))", "expand_struct(hardy_weinberg(genotypes))") .where((col("alleleFrequencies").getItem(0) >= allele_freq_cutoff) & (col("alleleFrequencies").getItem(0) <= (1.0 - allele_freq_cutoff)) & (col("pValueHwe") >= hwe_cutoff))
  • 78. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 78 qc_df.write .format(“delta”) .save(delta_path)
  • 79. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 79 matrix.computeSVD(num_pcs)
  • 80. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 80 genotypes.crossJoin( phenotypeAndCovariates) .selectExpr( “expand_struct( ” “linear_regression_gwas( ” “genotype_states(genotypes), ” “phenotype_values, covariates))”)
  • 81. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 81 gwas_results_rdf <- as.data.frame(gwas_results) install.packages("qqman", `repos="http://cran.us.r-project.org") library(qqman) png('/databricks/driver/manhattan.png') manhattan(gwas_results_rdf)
  • 82. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 82 http://pheweb.sph.umich.edu/SAIGE-UKB/pheno/250
  • 83. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 83 mlflow.log_artifact( '/databricks/driver/manhattan.png')
  • 86. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT