SlideShare a Scribd company logo
1 of 1
Download to read offline
Interactive Analysis of Large-Scale Sequencing Genomics 
Data Sets using a Real-Time Distributed Database
Hans-Martin Will, Ph.D., Dominic Suciu, Ph.D. – SpaceCurve, Seattle, WA, USA
SpaceCurve | 710 Second Avenue #620, Seattle, WA 98104 | spacecurve.com | hm@spacecurve.com
Conclusions!
We have shown the feasibility of storing high-throughput sequencing
data fully schematized and indexed at the level of individual reads
within a next-generation Big Data platform, thereby providing an
attractive alternative to file based informatics pipelines. While we
have conducted our study within a single framework for variation
analysis, we believe that the ability to quickly create analysis data
sets across studies and genomic regions applies to many other
analysis methodologies as well. We see the following areas for future
work: 1.) Improving the overall storage efficiency using better data
representation and compression techniques. 2.) Providing a more
natural representation of genomic locations at the SQL level.
Introduction!
With the arrival of the $1,000 human genome, the dream and
promise of applying high-throughput sequencing (HTS) to large
populations has come true. However, as described elsewhere [1],
the complexity and cost of managing the generated data and
performing computation effectively for their analysis and
interpretation have become the major challenge and dominating cost
factor. Currently, the most workable approach to handling these data
sets is organizing sequencing data in the form of many individual
data files that are stored in distributed storage clusters or cloud
storage. Bioinformatics pipelines that analyze those data sets are
then implemented using distributed file-based batch-processing
engines, such as Apache Hadoop (MapReduce) or the Open Grid
Scheduler (formerly Sun Grid Engine). In such an approach, turn-
around time of creating and running new batch jobs becomes the
limiting factor during data analysis. Ultimately, this leads to a situation
where researchers find themselves hampered in their ability to ask
“what if?” questions not due to lack of creativity or availability of data
but due to mere processing time and inconvenience.
 
In recent years, driven by the needs of mobile analytics and the
Internet of Things, a new generation of Big Data technology has
been developed that provides multi-dimensional indexing capabilities
at scale. Examples are Postgres-XC [2], SAP HANA with Spatial
Processing [3], or the SpaceCurve data platform [4] used here. 

In this poster, we describe our experience of applying such a next-
generation storage engine to the problem of characterizing genetic
variation in high-throughput sequencing data. Our particular analysis
is motivated by the framework described in DePristo et al. (2011) [5]
and aims to model the relevant data access patterns. 
Materials and Methods!
Data!
For our study we chose the 1000 Genomes data set [6] published
via Amazon Web Services (AWS) [7]. Specifically, we started with the
complete set of aligned reads that are provided in the form of BAM
[8] files across all 1092 individuals included in the study. Those BAM
files amount to 35 TB of data.

Spatial Mapping!
We are using database capabilities defined by the spatial extensions
to the SQL standard [9]. Each aligned read is mapped into a 2-
dimensional space and represented as a SQL LINESTRING object.
The x-axis used for this mapping is a linearization of all
chromosomes making up the genome. That is, location P on
chromosome C is mapped to an x-coordinate 

X = C * 1,000,000,000 + P. 

The y-coordinate is the identifier of each sample Y = <sample_id>.
See Figure 1 for an illustration of this 2-dimensional coordinate
system.

The rationale behind this indexing strategy is that for variant analysis
it is necessary to quickly extract specific genomic regions for a given
set of individuals. Using our mapping such data requests correspond
to range queries that a multi-dimensional index can answer
efficiently. 










Figure 1: Illustration of the two-dimensional coordinate system used
to index mapped reads across sample identifier and genomic
location.

Data Preparation!
The SpaceCurve data platform ingests and returns data in the form
of streams of JSON objects. We developed a conversation tool using
the C++ programming language and the SeqAn library [10] to
convert BAM files into JSON for data ingestion. By using C++, the
conversion process is fast enough to run as filter step between
reading the BAM files from disk and transferring them directly into
the SpaceCurve data platform.

References!
[1] Sboner, A. et al. The real cost of sequencing: higher than you think! Genome Biology, 12:125 (2011)
[2] http://postgresxc.wikia.com/wiki/Postgres-XC_Wiki
[3] http://www.saphana.com/community/about-hana/advanced-analytics/spatial-processing
[4] http://spacecurve.com
[5] DePristo, M. A. et al, A framework for variation discovery and genotyping using next-generation DNA sequencing
data, Nature Genetics, 43:5, 491-498 (2011)
[6] The 1000 Genomes Consortium. A map of human genome variation from population-scale sequencing. Nature 467,
1061-1073 (2010)
[7] http://s3.amazonaws.com/1000genomes
[8] http://samtools.sourceforge.net
[9] ISO/IEC 13249-3:2011 SQL Multimedia and Application Packages, Part 3: Spatial (2011)
[10] Döring, A., Weese, D., Rausch, T. and Reinert, K. SeqAn an efficient, generic C++ library for sequence analysis.
BMC Bioinformatics, 9:11 (2008)
[11] Rogers, J. A. Spatial Sieve Tree, US Patent US7734714 B2 (2010)




Materials and Methods (cont.)!
System!
The SpaceCurve data platform employs a novel spatial indexing data
structure called Spatial Sieve Tree [11]. A Spatial Sieve Tree is a
multi-level, multidimensional tree structure, which is based on
successive division of a root partition comprising the overall data
space. Spatial Sieve Trees can be implemented in a distributed
fashion, thereby providing an efficient locality-aware mechanism for
sharding data. See Figure 2 for an illustration of these concepts.

























Figure 2: Illustration of the Spatial Sieve Tree: A shows the
hierarchical structure of the tree. B illustrates the sieving process. C
shows how a tree can be distributed across cluster nodes. 

The SpaceCurve data platform was deployed as small cluster
configuration on AWS EC2. We used a group of 3 storage nodes,
each using 32 TB of disk storage and managed by an additional
master node. On each storage node we ran 4 storage management
processes, each process responsible for a data volume spanning 4
physical drives in a striped configuration. Queries where issued from
a group of 20 perimeter nodes that provided an HTTP end-point to
issue ingest and query commands to the system, and which were
also used to run data conversation and analysis tools. All machines
of the cluster had 10-gigabit Ethernet ports and were located within
a single placement group. Table 1 summarizes the overall cluster
configuration.








 Table 1: Cluster configuration used for our analysis



Chromosome 1 Chromosome 2
Sample 1
Sample 2
Sample 3
1 1
1,000,000,000 2,000,000,000
1
2
3
4
Y
X
Sample 4
Node Type! Count! EC2 Type! Cores! RAM !
[GB]!
Disk!
[GB]!
Network
[Gbit/sec]!
Storage 3 hs1.8xlarge 16 117 24 * 2048 10
Master 1 hs1.8xlarge 16 117 24 * 2048 10
Perimeter 20 cc2.8xlarge 32 60.5 4 * 840 10
Level 0
Level 1
Level 2
Level 3
Level 4
Level 0
Level 1
Level 2
Level 3
Level 4
Node 1 Node 2 Node 3
C
A
 B
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
Figure 3: Comparison of observed wild-type frequency versus
coverage in the data set. Locations of known SNPs from dbSNP are
indicated in black.
Results!
Storage!
Without using additional compression within the database, the
storage volume inside the database compared to the original BAM
files increased by a factor of 2.75. These increased storage
requirements can be attributed to the following factors:

1.  Due to the nature of the index, as more data gets loaded into the
system we see more and more reads straddling tree cell
boundaries; bounded by an overall factor of 2.
2.  For simplicity, we store the actual sequence and associated
quality scores as simple character sequences, thereby using a
less efficient representation than the original BAM files.
3.  Indexing and general system metadata give rise to additional
overhead in comparison to the original data files.
!
Query Times!
Due to the nature of the index, query times behave linearly in the
number of samples and the size of the genomic region requested.
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
Figure 4: Detailed benchmarking of query times as more and more
data gets loaded into the system. Because partitions are created
proportionally to the total amount of data, query results are returned
at a rate effectively independent of the overall data volume in the
system. Data shown restricted to Chromosome 20.
Analysis!
As pointed out by DePristo et al. (2011) [5], multi-sample low-pass
resequencing poses a major challenge for variant discovery and
genotyping due to the limited amount of experimental evidence that
is available at any particular locus in the genome for any given
sample. Hence, to obtain higher confidence it is necessary to
analyze measured variations across samples. In a traditional file-
based bioinformatics pipeline, one would need to find the BAM files
created for each sample, extract the relevant reads, and then
consolidate them into an analysis data set for statistical analysis.

By using a multi-dimensional index, we were able to extract such a
data set using a simple SQL query. For example, querying across a
consecutive group of samples translates into a polygonal intersection
query. The following SQL query extracts reads for the region
spanning position 126000 to 1126000 on chromosome 20 for the
samples with identifier 30 to 1000.

select * from tgd.bam as b where
b.genome.ST_intersects(ST_Geometry('POLYGON((2000
0126000.0 30.9, 20001126000.0 30.9, 20001126000.0
1000.1, 20000126000.0 1000.1, 20000126000.0
30.9))'))!

We then used the information contained in the CIGAR string
associated with each read to determine where mismatches had
been observed against the genomic reference during the alignment
process. These individual mismatches were then tabulated and
summarized. See Figure 3 for an example such a summarization.


0!
50!
100!
150!
200!
250!
300!
350!
400!
450!
500!
0!
200!
400!
600!
800!
1000!
1200!
1400!
1600!
1800!
0! 500! 1000! 1500! 2000! 2500! 3000! 3500! 4000!
LoadPerformancekrecs/sec!
QueryPerformancesec!
MemoryUsageinscdb(Gb)!
NumkPartitions!
Loaded Data (Gb)!
Data Usage and Query Performance! size in scdb
num kPartitions
Query krecs/sec
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
)2000"
0"
2000"
4000"
6000"
8000"
10000"
12000"
400000" 400200" 400400" 400600" 400800" 401000" 401200" 401400" 401600" 401800" 402000"
coverage"
wt/coverage"
snp_loc"

More Related Content

What's hot

A PERMISSION BASED TREE-STRUCTURED APPROACH FOR REPLICATED DATABASES
A PERMISSION BASED TREE-STRUCTURED APPROACH FOR REPLICATED DATABASESA PERMISSION BASED TREE-STRUCTURED APPROACH FOR REPLICATED DATABASES
A PERMISSION BASED TREE-STRUCTURED APPROACH FOR REPLICATED DATABASESijp2p
 
data_analytics_2014_5_30_60155
data_analytics_2014_5_30_60155data_analytics_2014_5_30_60155
data_analytics_2014_5_30_60155Neil Dahlqvist
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...acijjournal
 
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval IJECEIAES
 
A general weighted_fuzzy_clustering_algorithm
A general weighted_fuzzy_clustering_algorithmA general weighted_fuzzy_clustering_algorithm
A general weighted_fuzzy_clustering_algorithmTA Minh Thuy
 
Full-Text Retrieval in Unstructured P2P Networks using Bloom Cast Efficiently
Full-Text Retrieval in Unstructured P2P Networks using Bloom Cast EfficientlyFull-Text Retrieval in Unstructured P2P Networks using Bloom Cast Efficiently
Full-Text Retrieval in Unstructured P2P Networks using Bloom Cast Efficientlyijsrd.com
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architecturesinside-BigData.com
 
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...Amazon Web Services
 
Astromat Update on Developments 2021-01-29
Astromat Update on Developments 2021-01-29Astromat Update on Developments 2021-01-29
Astromat Update on Developments 2021-01-29Kerstin Lehnert
 
Data stream mining techniques: a review
Data stream mining techniques: a reviewData stream mining techniques: a review
Data stream mining techniques: a reviewTELKOMNIKA JOURNAL
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Scienceinventy
 
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Amit Sheth
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dcc.titus.brown
 
Effiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen DatenmengenEffiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen DatenmengenFlorian Stegmaier
 
Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper PresentationShubham Singh
 

What's hot (20)

[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...
[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...
[IJET V2I3P11] Authors: Payal More, Rohini Pandit, Supriya Makude, Harsh Nirb...
 
A PERMISSION BASED TREE-STRUCTURED APPROACH FOR REPLICATED DATABASES
A PERMISSION BASED TREE-STRUCTURED APPROACH FOR REPLICATED DATABASESA PERMISSION BASED TREE-STRUCTURED APPROACH FOR REPLICATED DATABASES
A PERMISSION BASED TREE-STRUCTURED APPROACH FOR REPLICATED DATABASES
 
data_analytics_2014_5_30_60155
data_analytics_2014_5_30_60155data_analytics_2014_5_30_60155
data_analytics_2014_5_30_60155
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
Big Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy GaussianBig Data Clustering Model based on Fuzzy Gaussian
Big Data Clustering Model based on Fuzzy Gaussian
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval
 
A general weighted_fuzzy_clustering_algorithm
A general weighted_fuzzy_clustering_algorithmA general weighted_fuzzy_clustering_algorithm
A general weighted_fuzzy_clustering_algorithm
 
Full-Text Retrieval in Unstructured P2P Networks using Bloom Cast Efficiently
Full-Text Retrieval in Unstructured P2P Networks using Bloom Cast EfficientlyFull-Text Retrieval in Unstructured P2P Networks using Bloom Cast Efficiently
Full-Text Retrieval in Unstructured P2P Networks using Bloom Cast Efficiently
 
Scientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architectures
 
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
 
Astromat Update on Developments 2021-01-29
Astromat Update on Developments 2021-01-29Astromat Update on Developments 2021-01-29
Astromat Update on Developments 2021-01-29
 
Data stream mining techniques: a review
Data stream mining techniques: a reviewData stream mining techniques: a review
Data stream mining techniques: a review
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
Delroy Cameron's Dissertation Defense: A Contenxt-Driven Subgraph Model for L...
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
 
Braintalk cuso nm
Braintalk cuso nmBraintalk cuso nm
Braintalk cuso nm
 
Effiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen DatenmengenEffiziente Verarbeitung von grossen Datenmengen
Effiziente Verarbeitung von grossen Datenmengen
 
Term Paper Presentation
Term Paper PresentationTerm Paper Presentation
Term Paper Presentation
 

Similar to Interactive Analysis of Large-Scale Genomics Data

Indexing for Large DNA Database sequences
Indexing for Large DNA Database sequencesIndexing for Large DNA Database sequences
Indexing for Large DNA Database sequencesCSCJournals
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)Michael Atkins
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataAlexander Decker
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstracttsysglobalsolutions
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLAnubhav Jain
 
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...AIRCC Publishing Corporation
 
Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...
Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...
Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...AIRCC Publishing Corporation
 
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...ijcsit
 
FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_PresentationYatpang Cheung
 
The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningRafael Ferreira da Silva
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
A h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learningA h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learningijitcs
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the CloudDataMine Lab
 
The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects Carole Goble
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...BaoTramDuong2
 
[Chung il kim] 0829 thesis
[Chung il kim] 0829 thesis[Chung il kim] 0829 thesis
[Chung il kim] 0829 thesisChung-Il Kim
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Making effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computationsMaking effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computationsOregon State University
 

Similar to Interactive Analysis of Large-Scale Genomics Data (20)

Indexing for Large DNA Database sequences
Indexing for Large DNA Database sequencesIndexing for Large DNA Database sequences
Indexing for Large DNA Database sequences
 
ADAPTER
ADAPTERADAPTER
ADAPTER
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming data
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
 
Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...
Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...
Variations in Outcome for the Same Map Reduce Transitive Closure Algorithm Im...
 
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
VARIATIONS IN OUTCOME FOR THE SAME MAP REDUCE TRANSITIVE CLOSURE ALGORITHM IM...
 
FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_Presentation
 
The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource Provisioning
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
A h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learningA h k clustering algorithm for high dimensional data using ensemble learning
A h k clustering algorithm for high dimensional data using ensemble learning
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
 
The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...
 
[Chung il kim] 0829 thesis
[Chung il kim] 0829 thesis[Chung il kim] 0829 thesis
[Chung il kim] 0829 thesis
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Making effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computationsMaking effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computations
 

Recently uploaded

(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...
(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...
(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...indiancallgirl4rent
 
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...Taniya Sharma
 
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on DeliveryCall Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Deliverynehamumbai
 
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls ServiceKesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Servicemakika9823
 
Chandrapur Call girls 8617370543 Provides all area service COD available
Chandrapur Call girls 8617370543 Provides all area service COD availableChandrapur Call girls 8617370543 Provides all area service COD available
Chandrapur Call girls 8617370543 Provides all area service COD availableDipal Arora
 
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...Call Girls in Nagpur High Profile
 
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...Arohi Goyal
 
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Call Girls Kochi Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Kochi Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Kochi Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Kochi Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...astropune
 
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...Dipal Arora
 
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Cuttack Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Servicevidya singh
 
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...chandars293
 
Lucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel roomLucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel roomdiscovermytutordmt
 
Call Girl Number in Vashi Mumbai📲 9833363713 💞 Full Night Enjoy
Call Girl Number in Vashi Mumbai📲 9833363713 💞 Full Night EnjoyCall Girl Number in Vashi Mumbai📲 9833363713 💞 Full Night Enjoy
Call Girl Number in Vashi Mumbai📲 9833363713 💞 Full Night Enjoybabeytanya
 
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...CALL GIRLS
 
Russian Escorts Girls Nehru Place ZINATHI 🔝9711199012 ☪ 24/7 Call Girls Delhi
Russian Escorts Girls  Nehru Place ZINATHI 🔝9711199012 ☪ 24/7 Call Girls DelhiRussian Escorts Girls  Nehru Place ZINATHI 🔝9711199012 ☪ 24/7 Call Girls Delhi
Russian Escorts Girls Nehru Place ZINATHI 🔝9711199012 ☪ 24/7 Call Girls DelhiAlinaDevecerski
 
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore EscortsCall Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escortsvidya singh
 

Recently uploaded (20)

(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...
(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...
(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...
 
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
 
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on DeliveryCall Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
 
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls ServiceKesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
 
Chandrapur Call girls 8617370543 Provides all area service COD available
Chandrapur Call girls 8617370543 Provides all area service COD availableChandrapur Call girls 8617370543 Provides all area service COD available
Chandrapur Call girls 8617370543 Provides all area service COD available
 
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
 
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
 
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Kochi Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Kochi Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Kochi Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Kochi Just Call 9907093804 Top Class Call Girl Service Available
 
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
 
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
 
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Cuttack Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Cuttack Just Call 9907093804 Top Class Call Girl Service Available
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
 
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...
 
Lucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel roomLucknow Call girls - 8800925952 - 24x7 service with hotel room
Lucknow Call girls - 8800925952 - 24x7 service with hotel room
 
Call Girl Number in Vashi Mumbai📲 9833363713 💞 Full Night Enjoy
Call Girl Number in Vashi Mumbai📲 9833363713 💞 Full Night EnjoyCall Girl Number in Vashi Mumbai📲 9833363713 💞 Full Night Enjoy
Call Girl Number in Vashi Mumbai📲 9833363713 💞 Full Night Enjoy
 
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
 
Russian Escorts Girls Nehru Place ZINATHI 🔝9711199012 ☪ 24/7 Call Girls Delhi
Russian Escorts Girls  Nehru Place ZINATHI 🔝9711199012 ☪ 24/7 Call Girls DelhiRussian Escorts Girls  Nehru Place ZINATHI 🔝9711199012 ☪ 24/7 Call Girls Delhi
Russian Escorts Girls Nehru Place ZINATHI 🔝9711199012 ☪ 24/7 Call Girls Delhi
 
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore EscortsCall Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
 

Interactive Analysis of Large-Scale Genomics Data

  • 1. Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Real-Time Distributed Database Hans-Martin Will, Ph.D., Dominic Suciu, Ph.D. – SpaceCurve, Seattle, WA, USA SpaceCurve | 710 Second Avenue #620, Seattle, WA 98104 | spacecurve.com | hm@spacecurve.com Conclusions! We have shown the feasibility of storing high-throughput sequencing data fully schematized and indexed at the level of individual reads within a next-generation Big Data platform, thereby providing an attractive alternative to file based informatics pipelines. While we have conducted our study within a single framework for variation analysis, we believe that the ability to quickly create analysis data sets across studies and genomic regions applies to many other analysis methodologies as well. We see the following areas for future work: 1.) Improving the overall storage efficiency using better data representation and compression techniques. 2.) Providing a more natural representation of genomic locations at the SQL level. Introduction! With the arrival of the $1,000 human genome, the dream and promise of applying high-throughput sequencing (HTS) to large populations has come true. However, as described elsewhere [1], the complexity and cost of managing the generated data and performing computation effectively for their analysis and interpretation have become the major challenge and dominating cost factor. Currently, the most workable approach to handling these data sets is organizing sequencing data in the form of many individual data files that are stored in distributed storage clusters or cloud storage. Bioinformatics pipelines that analyze those data sets are then implemented using distributed file-based batch-processing engines, such as Apache Hadoop (MapReduce) or the Open Grid Scheduler (formerly Sun Grid Engine). In such an approach, turn- around time of creating and running new batch jobs becomes the limiting factor during data analysis. Ultimately, this leads to a situation where researchers find themselves hampered in their ability to ask “what if?” questions not due to lack of creativity or availability of data but due to mere processing time and inconvenience.   In recent years, driven by the needs of mobile analytics and the Internet of Things, a new generation of Big Data technology has been developed that provides multi-dimensional indexing capabilities at scale. Examples are Postgres-XC [2], SAP HANA with Spatial Processing [3], or the SpaceCurve data platform [4] used here. In this poster, we describe our experience of applying such a next- generation storage engine to the problem of characterizing genetic variation in high-throughput sequencing data. Our particular analysis is motivated by the framework described in DePristo et al. (2011) [5] and aims to model the relevant data access patterns. Materials and Methods! Data! For our study we chose the 1000 Genomes data set [6] published via Amazon Web Services (AWS) [7]. Specifically, we started with the complete set of aligned reads that are provided in the form of BAM [8] files across all 1092 individuals included in the study. Those BAM files amount to 35 TB of data. Spatial Mapping! We are using database capabilities defined by the spatial extensions to the SQL standard [9]. Each aligned read is mapped into a 2- dimensional space and represented as a SQL LINESTRING object. The x-axis used for this mapping is a linearization of all chromosomes making up the genome. That is, location P on chromosome C is mapped to an x-coordinate X = C * 1,000,000,000 + P. The y-coordinate is the identifier of each sample Y = <sample_id>. See Figure 1 for an illustration of this 2-dimensional coordinate system. The rationale behind this indexing strategy is that for variant analysis it is necessary to quickly extract specific genomic regions for a given set of individuals. Using our mapping such data requests correspond to range queries that a multi-dimensional index can answer efficiently. Figure 1: Illustration of the two-dimensional coordinate system used to index mapped reads across sample identifier and genomic location. Data Preparation! The SpaceCurve data platform ingests and returns data in the form of streams of JSON objects. We developed a conversation tool using the C++ programming language and the SeqAn library [10] to convert BAM files into JSON for data ingestion. By using C++, the conversion process is fast enough to run as filter step between reading the BAM files from disk and transferring them directly into the SpaceCurve data platform. References! [1] Sboner, A. et al. The real cost of sequencing: higher than you think! Genome Biology, 12:125 (2011) [2] http://postgresxc.wikia.com/wiki/Postgres-XC_Wiki [3] http://www.saphana.com/community/about-hana/advanced-analytics/spatial-processing [4] http://spacecurve.com [5] DePristo, M. A. et al, A framework for variation discovery and genotyping using next-generation DNA sequencing data, Nature Genetics, 43:5, 491-498 (2011) [6] The 1000 Genomes Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061-1073 (2010) [7] http://s3.amazonaws.com/1000genomes [8] http://samtools.sourceforge.net [9] ISO/IEC 13249-3:2011 SQL Multimedia and Application Packages, Part 3: Spatial (2011) [10] Döring, A., Weese, D., Rausch, T. and Reinert, K. SeqAn an efficient, generic C++ library for sequence analysis. BMC Bioinformatics, 9:11 (2008) [11] Rogers, J. A. Spatial Sieve Tree, US Patent US7734714 B2 (2010) Materials and Methods (cont.)! System! The SpaceCurve data platform employs a novel spatial indexing data structure called Spatial Sieve Tree [11]. A Spatial Sieve Tree is a multi-level, multidimensional tree structure, which is based on successive division of a root partition comprising the overall data space. Spatial Sieve Trees can be implemented in a distributed fashion, thereby providing an efficient locality-aware mechanism for sharding data. See Figure 2 for an illustration of these concepts. Figure 2: Illustration of the Spatial Sieve Tree: A shows the hierarchical structure of the tree. B illustrates the sieving process. C shows how a tree can be distributed across cluster nodes. The SpaceCurve data platform was deployed as small cluster configuration on AWS EC2. We used a group of 3 storage nodes, each using 32 TB of disk storage and managed by an additional master node. On each storage node we ran 4 storage management processes, each process responsible for a data volume spanning 4 physical drives in a striped configuration. Queries where issued from a group of 20 perimeter nodes that provided an HTTP end-point to issue ingest and query commands to the system, and which were also used to run data conversation and analysis tools. All machines of the cluster had 10-gigabit Ethernet ports and were located within a single placement group. Table 1 summarizes the overall cluster configuration.  Table 1: Cluster configuration used for our analysis Chromosome 1 Chromosome 2 Sample 1 Sample 2 Sample 3 1 1 1,000,000,000 2,000,000,000 1 2 3 4 Y X Sample 4 Node Type! Count! EC2 Type! Cores! RAM ! [GB]! Disk! [GB]! Network [Gbit/sec]! Storage 3 hs1.8xlarge 16 117 24 * 2048 10 Master 1 hs1.8xlarge 16 117 24 * 2048 10 Perimeter 20 cc2.8xlarge 32 60.5 4 * 840 10 Level 0 Level 1 Level 2 Level 3 Level 4 Level 0 Level 1 Level 2 Level 3 Level 4 Node 1 Node 2 Node 3 C A B ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Figure 3: Comparison of observed wild-type frequency versus coverage in the data set. Locations of known SNPs from dbSNP are indicated in black. Results! Storage! Without using additional compression within the database, the storage volume inside the database compared to the original BAM files increased by a factor of 2.75. These increased storage requirements can be attributed to the following factors: 1.  Due to the nature of the index, as more data gets loaded into the system we see more and more reads straddling tree cell boundaries; bounded by an overall factor of 2. 2.  For simplicity, we store the actual sequence and associated quality scores as simple character sequences, thereby using a less efficient representation than the original BAM files. 3.  Indexing and general system metadata give rise to additional overhead in comparison to the original data files. ! Query Times! Due to the nature of the index, query times behave linearly in the number of samples and the size of the genomic region requested. ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Figure 4: Detailed benchmarking of query times as more and more data gets loaded into the system. Because partitions are created proportionally to the total amount of data, query results are returned at a rate effectively independent of the overall data volume in the system. Data shown restricted to Chromosome 20. Analysis! As pointed out by DePristo et al. (2011) [5], multi-sample low-pass resequencing poses a major challenge for variant discovery and genotyping due to the limited amount of experimental evidence that is available at any particular locus in the genome for any given sample. Hence, to obtain higher confidence it is necessary to analyze measured variations across samples. In a traditional file- based bioinformatics pipeline, one would need to find the BAM files created for each sample, extract the relevant reads, and then consolidate them into an analysis data set for statistical analysis. By using a multi-dimensional index, we were able to extract such a data set using a simple SQL query. For example, querying across a consecutive group of samples translates into a polygonal intersection query. The following SQL query extracts reads for the region spanning position 126000 to 1126000 on chromosome 20 for the samples with identifier 30 to 1000. select * from tgd.bam as b where b.genome.ST_intersects(ST_Geometry('POLYGON((2000 0126000.0 30.9, 20001126000.0 30.9, 20001126000.0 1000.1, 20000126000.0 1000.1, 20000126000.0 30.9))'))! We then used the information contained in the CIGAR string associated with each read to determine where mismatches had been observed against the genomic reference during the alignment process. These individual mismatches were then tabulated and summarized. See Figure 3 for an example such a summarization. 0! 50! 100! 150! 200! 250! 300! 350! 400! 450! 500! 0! 200! 400! 600! 800! 1000! 1200! 1400! 1600! 1800! 0! 500! 1000! 1500! 2000! 2500! 3000! 3500! 4000! LoadPerformancekrecs/sec! QueryPerformancesec! MemoryUsageinscdb(Gb)! NumkPartitions! Loaded Data (Gb)! Data Usage and Query Performance! size in scdb num kPartitions Query krecs/sec 0" 0.2" 0.4" 0.6" 0.8" 1" 1.2" )2000" 0" 2000" 4000" 6000" 8000" 10000" 12000" 400000" 400200" 400400" 400600" 400800" 401000" 401200" 401400" 401600" 401800" 402000" coverage" wt/coverage" snp_loc"