SlideShare una empresa de Scribd logo
1 de 1
Background
The past decade has seen a significant increase in high-
throughput experimental studies that catalog variant
datasets using massively parallel sequencing experiments.
New insights of biological significance can be gained by this
information with multiple genomic locations based
annotations. However, efforts to obtain this information by
integrating and mining variant data have had limited
success so far and there has yet to be a method developed
that can be scalable, practical and applied to millions of
variants and their related annotations. We explored the use
of graph data structures as a proof of concept for scalable
interpretation of the impact of variant related data.
Methods and Materials Results
Leveraging Graph Data Structures for Variant Data and Related Annotations
Chris Zawora1, Jesse Milzman1, Yatpang Cheung1, Akshay Bhushan1, Michael S. Atkins2, Hue Vuong3, F. Pascal Girard2, Uma Mudunuri3
1Georgetown University, Washington, DC, 2FedCentric Technologies, LLC, Fairfax, VA, 3Frederick National Laboratory for Cancer Research, Frederick, MD
Conclusion
References
The 1000 Genomes Project Consortium. (2010). A map of human genome
variation from population-scale sequencing. Nature, 467, 1061–1073.
Gregg, B. (2014). Systems performance: enterprise and the cloud.
Pearson Hall: Upper Saddle River, NJ.
Acknowledgments
FedCentric acknowledges Frank D’Ippolito, Shafigh Mehraeen, Margreth Mpossi,
Supriya Nittoor, and Tianwen Chu for their invaluable assistance with this project.
This project has been funded in whole or in part with federal funds from the National Cancer Institute, National Institutes of
Health, under contract HHSN261200800001E. The content of this publication does not necessarily reflect the views or
policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or
organizations imply endorsement by the U.S. Government.
Contract HHSN261200800001E - Funded by the National Cancer Institute
DEPARTMENT OF HEALTH AND HUMAN SERVICES • National Institutes of Health • National Cancer Institute
Frederick National Laboratory is a Federally Funded Research and Development Center
operated by Leidos Biomedical Research, Inc., for the National Cancer Institute
Introduction
Traditional approaches of data mining and integration in the
research field have relied on relational databases or
programming for deriving dynamic insights from research
related data. However, as more next generation sequencing
(NGS) becomes available, these approaches limit the
exploration of certain hypothesis. One such limitation is the
mining of variant data from publicly available databases
such as the 1000 genomes project and TCGA.
Although there are applications available for quickly finding
the public data with a certain set of variants or for finding
minor allele frequencies, there is no such application that
can be applied generically across all the projects allowing
researchers to globally mine and find patterns that would be
applicable to their specific research interests.
In this pilot project, we have investigated whether graph
database structures are applicable for mining variants from
individuals and populations in a scalable manner and
understanding their impact by integrating with known
annotations.
Phase I: As an initial evaluation of the graph structures we ran
several simple queries, also feasible through a relational
architecture, and measured performance speeds.
Simple Query Examples
• Get all information for a single variant
• Find annotations within a range of genomic locations
• Find variants associated with specific clinical phenotypes
Performance speeds
• Query times in milliseconds
• Better or equal to relational database query times
Queries
• Developed a new SQL-like query language called SparkQL
• Eases writing queries for non-programmatic users
Ingestion Times
•Slower than expected
•Sparksee is a multi-thread single write database
•Writes one node/edge at a time
•Each write involves creating connections with existing nodes
•Slows down as the graph size increases
Solution: Implement multi-threaded insertions in
combination with internal data structures to efficiently find
nodes and create edges
High Degree Vertices:
•Nodes with millions of edges
•Stored in a non-distributed list like format
•Searches for a specific edge might be slow
Example: nodes representing individuals with millions of
variants
Solution: Explore other graph clustering approaches that
can essentially condense the information presented
Phase II: We explored complex patterns and clusters inside the
graph and spectral clustering queries that were not feasible
through the relational architecture.
Complex Query Examples
• Compare variant profiles and find individuals that are closely
related
• Compare annotation profiles to find clusters of populations
Phase II Results
• Eight populations with 25 individuals from each population
• Strong eigenvalue support (near zero) for 3 main clusters
• Cluster pattern supported by population genetics (Fig. 4)
Performance speeds
• Spectral clustering took ca. 2 minutes
.
Our results indicate that a graph database, run on
an in-memory machine, can be a very powerful and
useful tool for cancer research. Performance using
the graph database for finding details on specific
nodes or a list of nodes is better or equal to a well-
architected relational database. We also see
promising initial results for identifying correlations
between genetic changes and specific phenotype
conditions.
We conclude that an in-memory graph
database would allow researchers to run known
queries while also providing the opportunity to
develop algorithms to explore complex correlations.
Graph models are useful for mining complex data
sets and could prove essential in the development
and implementation of tools aiding precision
medicine.
Data
• SNPs from 1000 genomes project
• Phenotype conditions from ClinVar
• Gene mappings & mRNA transcripts from Entrez Gene
• Amino acid changes from UniProt
The Graph
• Variants and annotations mapped to reference genomic
locations (Fig. 3)
• Includes all chromosomes and genomic locations
• 180 million nodes and 12 billion edges.
Graph Architecture
• Sparsity Technologies’ Sparsity Graph database
• API supports C, C++, C#, Python, and Java
• Implements graph and its attributes as maps and sparse
bitmap structures
• Allows it to scale with very limited memory requirements.
Hardware
• FedCentric Labs’ SGI UV300 system: x86/Linux system,
scales up to 1,152 cores & 64TB of memory
• Data in memory, very low latency, high performance (Fig. 2)
Fig. 2: Latency matters
Fig. 3: The Graph Model
Fig. 1: Graphs handle data complexity intuitively and interactively
Fig. 4: Results of spectral clustering of 1000 Genomes data

Más contenido relacionado

La actualidad más candente

Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkPaul Groth
 
Machines are people too
Machines are people tooMachines are people too
Machines are people tooPaul Groth
 
Cancer Analytics Poster
Cancer Analytics PosterCancer Analytics Poster
Cancer Analytics PosterMichael Atkins
 
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
 API-Centric Data Integration for Human Genomics Reference Databases: Achieve... API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...Genomika Diagnósticos
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsPaul Groth
 
Model Organism Linked Data
Model Organism Linked DataModel Organism Linked Data
Model Organism Linked DataMichel Dumontier
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
 
Next-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalNext-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalWaqas Tariq
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicinePaul Groth
 
Role of Data Accessibility During Pandemic
Role of Data Accessibility During PandemicRole of Data Accessibility During Pandemic
Role of Data Accessibility During PandemicDatabricks
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?Paul Groth
 
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental MetadataMaking it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental MetadataMichel Dumontier
 
Machine learning in biology
Machine learning in biologyMachine learning in biology
Machine learning in biologyPranavathiyani G
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesValery Tkachenko
 
CEDAR work bench for metadata management
CEDAR work bench for metadata managementCEDAR work bench for metadata management
CEDAR work bench for metadata managementPistoia Alliance
 
is there life between standards? Data interoperability for AI.
is there life between standards? Data interoperability for AI.is there life between standards? Data interoperability for AI.
is there life between standards? Data interoperability for AI.Chris Evelo
 

La actualidad más candente (20)

Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
 
Machines are people too
Machines are people tooMachines are people too
Machines are people too
 
Cancer Analytics Poster
Cancer Analytics PosterCancer Analytics Poster
Cancer Analytics Poster
 
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
 API-Centric Data Integration for Human Genomics Reference Databases: Achieve... API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
 
Model Organism Linked Data
Model Organism Linked DataModel Organism Linked Data
Model Organism Linked Data
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 
Next-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalNext-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information Retrieval
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
RSC ChemSpider as an environment for teaching and sharing chemistry
RSC ChemSpider as an environment for teaching and sharing chemistryRSC ChemSpider as an environment for teaching and sharing chemistry
RSC ChemSpider as an environment for teaching and sharing chemistry
 
Role of Data Accessibility During Pandemic
Role of Data Accessibility During PandemicRole of Data Accessibility During Pandemic
Role of Data Accessibility During Pandemic
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?
 
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental MetadataMaking it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
 
Machine learning in biology
Machine learning in biologyMachine learning in biology
Machine learning in biology
 
Tools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databasesTools and approaches for data deposition into nanomaterial databases
Tools and approaches for data deposition into nanomaterial databases
 
CEDAR work bench for metadata management
CEDAR work bench for metadata managementCEDAR work bench for metadata management
CEDAR work bench for metadata management
 
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
 
is there life between standards? Data interoperability for AI.
is there life between standards? Data interoperability for AI.is there life between standards? Data interoperability for AI.
is there life between standards? Data interoperability for AI.
 
Bioinformatics Projects And Applications
Bioinformatics Projects And ApplicationsBioinformatics Projects And Applications
Bioinformatics Projects And Applications
 

Similar a FedCentric_Presentation

Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global communityExternalEvents
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Amit Sheth
 
Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Yasel Cruz
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Databricks
 
NRNB Annual Report 2018
NRNB Annual Report 2018NRNB Annual Report 2018
NRNB Annual Report 2018Alexander Pico
 
Data-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemData-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemMaryann Martone
 
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...Syed Ahmad Chan Bukhari, PhD
 
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016Jisc
 
Knowledge Management in the AI Driven Scintific System
Knowledge Management in the AI Driven Scintific SystemKnowledge Management in the AI Driven Scintific System
Knowledge Management in the AI Driven Scintific SystemSubhasis Dasgupta
 
CINECA webinar slides: Making cohort data FAIR
CINECA webinar slides: Making cohort data FAIRCINECA webinar slides: Making cohort data FAIR
CINECA webinar slides: Making cohort data FAIRCINECAProject
 
Research Statement Chien-Wei Lin
Research Statement Chien-Wei LinResearch Statement Chien-Wei Lin
Research Statement Chien-Wei LinChien-Wei Lin
 
Recognising data sharing
Recognising data sharingRecognising data sharing
Recognising data sharingJisc RDM
 
Being FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceBeing FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceCarole Goble
 

Similar a FedCentric_Presentation (20)

Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
Semantic Web & Web 3.0 empowering real world outcomes in biomedical research ...
 
Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244
 
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...
 
NRNB Annual Report 2018
NRNB Annual Report 2018NRNB Annual Report 2018
NRNB Annual Report 2018
 
KnetMiner Overview Oct 2017
KnetMiner Overview Oct 2017KnetMiner Overview Oct 2017
KnetMiner Overview Oct 2017
 
Data-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemData-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystem
 
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
 
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
 
Knowledge Management in the AI Driven Scintific System
Knowledge Management in the AI Driven Scintific SystemKnowledge Management in the AI Driven Scintific System
Knowledge Management in the AI Driven Scintific System
 
CINECA webinar slides: Making cohort data FAIR
CINECA webinar slides: Making cohort data FAIRCINECA webinar slides: Making cohort data FAIR
CINECA webinar slides: Making cohort data FAIR
 
Research Statement Chien-Wei Lin
Research Statement Chien-Wei LinResearch Statement Chien-Wei Lin
Research Statement Chien-Wei Lin
 
Recognising data sharing
Recognising data sharingRecognising data sharing
Recognising data sharing
 
Cri big data
Cri big dataCri big data
Cri big data
 
Martone grethe
Martone gretheMartone grethe
Martone grethe
 
Being FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceBeing FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data Science
 

FedCentric_Presentation

  • 1. Background The past decade has seen a significant increase in high- throughput experimental studies that catalog variant datasets using massively parallel sequencing experiments. New insights of biological significance can be gained by this information with multiple genomic locations based annotations. However, efforts to obtain this information by integrating and mining variant data have had limited success so far and there has yet to be a method developed that can be scalable, practical and applied to millions of variants and their related annotations. We explored the use of graph data structures as a proof of concept for scalable interpretation of the impact of variant related data. Methods and Materials Results Leveraging Graph Data Structures for Variant Data and Related Annotations Chris Zawora1, Jesse Milzman1, Yatpang Cheung1, Akshay Bhushan1, Michael S. Atkins2, Hue Vuong3, F. Pascal Girard2, Uma Mudunuri3 1Georgetown University, Washington, DC, 2FedCentric Technologies, LLC, Fairfax, VA, 3Frederick National Laboratory for Cancer Research, Frederick, MD Conclusion References The 1000 Genomes Project Consortium. (2010). A map of human genome variation from population-scale sequencing. Nature, 467, 1061–1073. Gregg, B. (2014). Systems performance: enterprise and the cloud. Pearson Hall: Upper Saddle River, NJ. Acknowledgments FedCentric acknowledges Frank D’Ippolito, Shafigh Mehraeen, Margreth Mpossi, Supriya Nittoor, and Tianwen Chu for their invaluable assistance with this project. This project has been funded in whole or in part with federal funds from the National Cancer Institute, National Institutes of Health, under contract HHSN261200800001E. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government. Contract HHSN261200800001E - Funded by the National Cancer Institute DEPARTMENT OF HEALTH AND HUMAN SERVICES • National Institutes of Health • National Cancer Institute Frederick National Laboratory is a Federally Funded Research and Development Center operated by Leidos Biomedical Research, Inc., for the National Cancer Institute Introduction Traditional approaches of data mining and integration in the research field have relied on relational databases or programming for deriving dynamic insights from research related data. However, as more next generation sequencing (NGS) becomes available, these approaches limit the exploration of certain hypothesis. One such limitation is the mining of variant data from publicly available databases such as the 1000 genomes project and TCGA. Although there are applications available for quickly finding the public data with a certain set of variants or for finding minor allele frequencies, there is no such application that can be applied generically across all the projects allowing researchers to globally mine and find patterns that would be applicable to their specific research interests. In this pilot project, we have investigated whether graph database structures are applicable for mining variants from individuals and populations in a scalable manner and understanding their impact by integrating with known annotations. Phase I: As an initial evaluation of the graph structures we ran several simple queries, also feasible through a relational architecture, and measured performance speeds. Simple Query Examples • Get all information for a single variant • Find annotations within a range of genomic locations • Find variants associated with specific clinical phenotypes Performance speeds • Query times in milliseconds • Better or equal to relational database query times Queries • Developed a new SQL-like query language called SparkQL • Eases writing queries for non-programmatic users Ingestion Times •Slower than expected •Sparksee is a multi-thread single write database •Writes one node/edge at a time •Each write involves creating connections with existing nodes •Slows down as the graph size increases Solution: Implement multi-threaded insertions in combination with internal data structures to efficiently find nodes and create edges High Degree Vertices: •Nodes with millions of edges •Stored in a non-distributed list like format •Searches for a specific edge might be slow Example: nodes representing individuals with millions of variants Solution: Explore other graph clustering approaches that can essentially condense the information presented Phase II: We explored complex patterns and clusters inside the graph and spectral clustering queries that were not feasible through the relational architecture. Complex Query Examples • Compare variant profiles and find individuals that are closely related • Compare annotation profiles to find clusters of populations Phase II Results • Eight populations with 25 individuals from each population • Strong eigenvalue support (near zero) for 3 main clusters • Cluster pattern supported by population genetics (Fig. 4) Performance speeds • Spectral clustering took ca. 2 minutes . Our results indicate that a graph database, run on an in-memory machine, can be a very powerful and useful tool for cancer research. Performance using the graph database for finding details on specific nodes or a list of nodes is better or equal to a well- architected relational database. We also see promising initial results for identifying correlations between genetic changes and specific phenotype conditions. We conclude that an in-memory graph database would allow researchers to run known queries while also providing the opportunity to develop algorithms to explore complex correlations. Graph models are useful for mining complex data sets and could prove essential in the development and implementation of tools aiding precision medicine. Data • SNPs from 1000 genomes project • Phenotype conditions from ClinVar • Gene mappings & mRNA transcripts from Entrez Gene • Amino acid changes from UniProt The Graph • Variants and annotations mapped to reference genomic locations (Fig. 3) • Includes all chromosomes and genomic locations • 180 million nodes and 12 billion edges. Graph Architecture • Sparsity Technologies’ Sparsity Graph database • API supports C, C++, C#, Python, and Java • Implements graph and its attributes as maps and sparse bitmap structures • Allows it to scale with very limited memory requirements. Hardware • FedCentric Labs’ SGI UV300 system: x86/Linux system, scales up to 1,152 cores & 64TB of memory • Data in memory, very low latency, high performance (Fig. 2) Fig. 2: Latency matters Fig. 3: The Graph Model Fig. 1: Graphs handle data complexity intuitively and interactively Fig. 4: Results of spectral clustering of 1000 Genomes data