API-Centric Data Integration for Human Genomics Reference Databases: Achievements, Lessons Learned and Challenges
X-Meeting 2015
Authors: Jamisson Freitas, Marcel Caraciolo, Victor Diniz, Rodrigo Alexandre and João Bosco Oliveira
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
API-Centric Data Integration for Human Genomics Reference Databases: Achievements, Lessons Learned and Challenges
1. APP
NGS Applications
J. S. Freitas1
, M. P. Caraciolo1
, V. M. Diniz1
, R. B. de Alexandre1
, J. B. Oliveira1
1
Genomika Diagnósticos
API-Centric Data Integration for Human Genomics Reference
Databases: Achievements, Lessons Learned and Challenges
MOTIVATION
Data Integration is a main challenge faced in clinical genetics where there are
multiple heterogeneous databases spanning several domains presented in
confusing formats without clear and common standards. In variant analysis for
molecular diagnostics applications, one central task is to connect biological
information to clinical data such that specialists can determine the potential impact
of that variant associated with the disease [1, 2].
For this task, it requires the flexible assembly of tailored data sets continuously
curated without wasting the biologists and geneticists time on searching several
databases individually online, parsing, cleaning and integrating those data in
complex spreadsheets.
We are building a platform that leverages Linked Data to provide integrated
access to bioinformatics databases such as OMIN, Clinvar, using a common
and well-defined interface.
Our assumption is that by exposing those datasets via Application
Programming Interfaces (API's), it can facilitate the data access from several
sources to a big data infrastructure, which provides integrated access to
covering information about biological, carrier testing, variant analysis and
literature mining.
bioinfo@genomika.com.br | genomika.com.br
Rua Senador José Henrique, 224, Alfred Nobel, Sala 1301 | Recife, PE | Brazil
OUR COLLABORATION
DATA INFRASTRUCTURE
Lessons Learned
x
REFERENCES
[1] Anguita, A., et al. (2010) A review of methods and tools for database integration in biomedicine. Curr. Bioinform., 5, 253–269
[2] Peterson, Thomas A., Emily Doughty, and Maricel G. Kann. "Towards precision medicine: advances in computational approaches for the analysis of human
variants." Journal of molecular biology 425.21 (2013): 4047-4063.
[3] Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system." ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40.
[4] Spark, Apache. "Lightning-fast cluster computing (2015)." (2015): 345-353.
[5] Stockinger, Heinz, et al. "Experience using web services for biological sequence analysis." Briefings in bioinformatics 9.6 (2008): 493-505.
DISTRIBUTED AGGREGATION NEW SOURCE CONSUMPTION
The growing number of databases vs the variability of their
schemata. To tackle it, we designed a global schema, using
meta-modeling concepts to abstract the data fields and values.
Novel approaches to aggregate the facets by the same key. Good
solutions: NoSQL databases (Cassandra) and large data
processing engine using MapReduce concepts (Spark) [3, 4].
Load several databases and related versions will require a
replication/distributed policy for your database engine. There are
some good dataengine solutions that achieved great results on
this by using a distributed strategy for partitioning data.
RESTful APIs for exposing data. It supports several formats (XML,
JSON) and frameworks available that works out-of-the-box [5].
Challenges
The underlying datasets can change their
schema, so there's a intellectual complexity in
developing fixes in the source data
consumption.
The limited number of building new versions,
the all process requires bandwidth and
demanding computing power, so how to
overcome the number of fetching jobs running
simultaneously?
How to deal with semantic mappings between
datasets or depositories? What should the
single integrated vocabulary be in order to
identify possible relationships?
sample
genomic
position
genomic
position
Sequencing
Machine
Annotator
(rowA,
(DataFieldA, facetValue1))
(rowB,
(DataFieldA, facetValue2))
(rowA,
[(DFA, FV1)),
(DFB, FV3)),
(DFC, FV4)),
(DFD, FV7)),
(DFE, FV8)),
(DFF, FV9))]
(rowA,
(DFB, FV3))
(rowA,
(DFC, FV4))
(rowB,
(DFB, FV5))
(rowB,
(DFC, FV6))
(rowA,
(DFA, FV1))
(rowA,
(DFB, FV3))
(rowA,
(DFE, FV8))
(rowA,
(DFF, FV9))
(rowB,
[(DFA, FV2)),
(DFB, FV5)),
(DFC, FV6)),
(DFD, FV10)),
(DFE, FV11)),
(DFF, FV12))]
(rowB,
(DFB, FV2)
(rowB,
(DFB, FV5))
(rowB,
(DFE, FV11))
(rowB,
(DFF, FV12))
(rowA,
(DFD, FV7))
(rowA,
(DFE, FV8))
(rowB,
(DFE, FV11))
(rowB,
(DFF, FV12))
ClinGen Tool
Patient
Data
150,000,000
Variants observed
Variants
we understand
2003 2007 2015
Genotype
AnnotatorClinvar
dbSNP
Uniprot
OMIM
NCBI
GENE
1,000
Genome
Depository N
Clinvar
OMIM
DATA EXPOSURE
...
omim_idGene Symbol
100650
... Datafield N
... Facet #1ALDH
104760 ... Facet #nAPP
DataFieldrowID
Gene_Symbol
... DataFacet
... ALDH1
OMIM_ID ... 1006501
Gene_Symbol ... APP2
OMIM_ID ... 1047602
1.0.0 2.0.0 Depository
Version
...Genes Phenotypes Dataset N
curl
https://$GENDB_API_KEY@api.gendb.com/v1/
datasets/OMIM/3.5.0/Genes/data
-H "Content-Type: application/json"
-d '{
"filters": [
["gene_symbol", "BRCA1"]
]
}'
{
"dataset": "OMIM/3.5.0/Genes",
"dataset_id": 65,
"genome_build": "GRCh37",
"limit": 100,
"total": 111425,
"took": 5,
"results": [ "..." ]
}
As the number of current human variant
resources used in variant analysis increases,
the variants reported growing faster every
year, there's only a initial work on
understanding all this information and how
can we extract and link those variant sources.
...
fetch data
Sequencer Data
fetch data
API
GENDB
MIM
1000 Genomes
Entrez Gene
dbSNP
dbSNP
dbNSFP
COSMIC
ClinVar
Other Sources
+ name
+ output_dir
- fetch(is_dl_forced=False)
- parse()
- prepare_new_dataset(name, version)
- update_new_version(version_name)
- check_if_remote_newer(remote, local)
- get_files(is_dl_forced)
- fetch_from_url(remote_file, local_file)
- fetch_from_db(query, conn, limit, is_dl_forced)
- fetch_from_source(...)
Source
Abstract class for any
data sources that we'll
import and process.
Each of the subclasses
will fetch() the data,
scrub() it as necessary,
then parse() it into a
database.
+ name: OMIM
+ output_dir : "./raw/omim"
- _get_omim_ids()
- _process_all()
- _process_morbidmap()
- _process_phenotypicseries()
OMIM
+ name: Source N
+ output_dir : "output/dir"
- local functions()
- inheritend_functions()
extendsextends
Source N
...
......