SlideShare una empresa de Scribd logo
1 de 1
Descargar para leer sin conexión
APP
NGS Applications
J. S. Freitas1
, M. P. Caraciolo1
, V. M. Diniz1
, R. B. de Alexandre1
, J. B. Oliveira1
1
Genomika Diagnósticos
API-Centric Data Integration for Human Genomics Reference
Databases: Achievements, Lessons Learned and Challenges
MOTIVATION
Data Integration is a main challenge faced in clinical genetics where there are
multiple heterogeneous databases spanning several domains presented in
confusing formats without clear and common standards. In variant analysis for
molecular diagnostics applications, one central task is to connect biological
information to clinical data such that specialists can determine the potential impact
of that variant associated with the disease [1, 2].
For this task, it requires the flexible assembly of tailored data sets continuously
curated without wasting the biologists and geneticists time on searching several
databases individually online, parsing, cleaning and integrating those data in
complex spreadsheets.
We are building a platform that leverages Linked Data to provide integrated
access to bioinformatics databases such as OMIN, Clinvar, using a common
and well-defined interface.
Our assumption is that by exposing those datasets via Application
Programming Interfaces (API's), it can facilitate the data access from several
sources to a big data infrastructure, which provides integrated access to
covering information about biological, carrier testing, variant analysis and
literature mining.
bioinfo@genomika.com.br | genomika.com.br
Rua Senador José Henrique, 224, Alfred Nobel, Sala 1301 | Recife, PE | Brazil
OUR COLLABORATION
DATA INFRASTRUCTURE
Lessons Learned
x
REFERENCES
[1] Anguita, A., et al. (2010) A review of methods and tools for database integration in biomedicine. Curr. Bioinform., 5, 253–269
[2] Peterson, Thomas A., Emily Doughty, and Maricel G. Kann. "Towards precision medicine: advances in computational approaches for the analysis of human
variants." Journal of molecular biology 425.21 (2013): 4047-4063.
[3] Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system." ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40.
[4] Spark, Apache. "Lightning-fast cluster computing (2015)." (2015): 345-353.
[5] Stockinger, Heinz, et al. "Experience using web services for biological sequence analysis." Briefings in bioinformatics 9.6 (2008): 493-505.
DISTRIBUTED AGGREGATION NEW SOURCE CONSUMPTION
The growing number of databases vs the variability of their
schemata. To tackle it, we designed a global schema, using
meta-modeling concepts to abstract the data fields and values.
Novel approaches to aggregate the facets by the same key. Good
solutions: NoSQL databases (Cassandra) and large data
processing engine using MapReduce concepts (Spark) [3, 4].
Load several databases and related versions will require a
replication/distributed policy for your database engine. There are
some good dataengine solutions that achieved great results on
this by using a distributed strategy for partitioning data.
RESTful APIs for exposing data. It supports several formats (XML,
JSON) and frameworks available that works out-of-the-box [5].
Challenges
The underlying datasets can change their
schema, so there's a intellectual complexity in
developing fixes in the source data
consumption.
The limited number of building new versions,
the all process requires bandwidth and
demanding computing power, so how to
overcome the number of fetching jobs running
simultaneously?
How to deal with semantic mappings between
datasets or depositories? What should the
single integrated vocabulary be in order to
identify possible relationships?
sample
genomic
position
genomic
position
Sequencing
Machine
Annotator
(rowA,
(DataFieldA, facetValue1))
(rowB,
(DataFieldA, facetValue2))
(rowA,
[(DFA, FV1)),
(DFB, FV3)),
(DFC, FV4)),
(DFD, FV7)),
(DFE, FV8)),
(DFF, FV9))]
(rowA,
(DFB, FV3))
(rowA,
(DFC, FV4))
(rowB,
(DFB, FV5))
(rowB,
(DFC, FV6))
(rowA,
(DFA, FV1))
(rowA,
(DFB, FV3))
(rowA,
(DFE, FV8))
(rowA,
(DFF, FV9))
(rowB,
[(DFA, FV2)),
(DFB, FV5)),
(DFC, FV6)),
(DFD, FV10)),
(DFE, FV11)),
(DFF, FV12))]
(rowB,
(DFB, FV2)
(rowB,
(DFB, FV5))
(rowB,
(DFE, FV11))
(rowB,
(DFF, FV12))
(rowA,
(DFD, FV7))
(rowA,
(DFE, FV8))
(rowB,
(DFE, FV11))
(rowB,
(DFF, FV12))
ClinGen Tool
Patient
Data
150,000,000
Variants observed
Variants
we understand
2003 2007 2015
Genotype
AnnotatorClinvar
dbSNP
Uniprot
OMIM
NCBI
GENE
1,000
Genome
Depository N
Clinvar
OMIM
DATA EXPOSURE
...
omim_idGene Symbol
100650
... Datafield N
... Facet #1ALDH
104760 ... Facet #nAPP
DataFieldrowID
Gene_Symbol
... DataFacet
... ALDH1
OMIM_ID ... 1006501
Gene_Symbol ... APP2
OMIM_ID ... 1047602
1.0.0 2.0.0 Depository
Version
...Genes Phenotypes Dataset N
curl
https://$GENDB_API_KEY@api.gendb.com/v1/
datasets/OMIM/3.5.0/Genes/data 
-H "Content-Type: application/json"

-d '{
"filters": [
["gene_symbol", "BRCA1"]
]
}'
{
"dataset": "OMIM/3.5.0/Genes",
"dataset_id": 65,
"genome_build": "GRCh37",
"limit": 100,
"total": 111425,
"took": 5,
"results": [ "..." ]
}
As the number of current human variant
resources used in variant analysis increases,
the variants reported growing faster every
year, there's only a initial work on
understanding all this information and how
can we extract and link those variant sources.
...
fetch data
Sequencer Data
fetch data
API
GENDB
MIM
1000 Genomes
Entrez Gene
dbSNP
dbSNP
dbNSFP
COSMIC
ClinVar
Other Sources
+ name
+ output_dir
- fetch(is_dl_forced=False)
- parse()
- prepare_new_dataset(name, version)
- update_new_version(version_name)
- check_if_remote_newer(remote, local)
- get_files(is_dl_forced)
- fetch_from_url(remote_file, local_file)
- fetch_from_db(query, conn, limit, is_dl_forced)
- fetch_from_source(...)
Source
Abstract class for any
data sources that we'll
import and process.
Each of the subclasses
will fetch() the data,
scrub() it as necessary,
then parse() it into a
database.
+ name: OMIM
+ output_dir : "./raw/omim"
- _get_omim_ids()
- _process_all()
- _process_morbidmap()
- _process_phenotypicseries()
OMIM
+ name: Source N
+ output_dir : "output/dir"
- local functions()
- inheritend_functions()
extendsextends
Source N
...
......

Más contenido relacionado

La actualidad más candente

CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...Syed Ahmad Chan Bukhari, PhD
 
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...European School of Oncology
 
Cancer Analytics Poster
Cancer Analytics PosterCancer Analytics Poster
Cancer Analytics PosterMichael Atkins
 
Rethinking data intensive science using scalable analytics systems
 Rethinking data intensive science using scalable analytics systems Rethinking data intensive science using scalable analytics systems
Rethinking data intensive science using scalable analytics systemsnewmooxx
 
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental MetadataMaking it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental MetadataMichel Dumontier
 
Wim de Grave: Big Data in life sciences
Wim de Grave:  Big Data in life sciencesWim de Grave:  Big Data in life sciences
Wim de Grave: Big Data in life sciencesFlávio Codeço Coelho
 
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge DiscoveryMichel Dumontier
 
Link Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked DataLink Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked DataMichel Dumontier
 
Role of Data Accessibility During Pandemic
Role of Data Accessibility During PandemicRole of Data Accessibility During Pandemic
Role of Data Accessibility During PandemicDatabricks
 
Presentation from Code Camp 2017
Presentation from Code Camp 2017Presentation from Code Camp 2017
Presentation from Code Camp 2017Mitch Miller
 
Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Yasel Cruz
 
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...Maulik Kamdar
 
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Michel Dumontier
 
W3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description GuidelinesW3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description GuidelinesMichel Dumontier
 
Next-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalNext-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalWaqas Tariq
 
Omic Data Integration Strategies
Omic Data Integration StrategiesOmic Data Integration Strategies
Omic Data Integration StrategiesDmitry Grapov
 
Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...Andrew McEachran
 
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...Syed Ahmad Chan Bukhari, PhD
 

La actualidad más candente (20)

CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
CEDAR: Easing Authoring of Metadata to Make Biomedical Data Sets More Findabl...
 
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
 
Cancer Analytics Poster
Cancer Analytics PosterCancer Analytics Poster
Cancer Analytics Poster
 
Rethinking data intensive science using scalable analytics systems
 Rethinking data intensive science using scalable analytics systems Rethinking data intensive science using scalable analytics systems
Rethinking data intensive science using scalable analytics systems
 
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental MetadataMaking it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
Making it Easier, Possibly Even Pleasant, to Author Rich Experimental Metadata
 
Wim de Grave: Big Data in life sciences
Wim de Grave:  Big Data in life sciencesWim de Grave:  Big Data in life sciences
Wim de Grave: Big Data in life sciences
 
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
2016 ACS Semantic Approaches for Biochemical Knowledge Discovery
 
Link Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked DataLink Analysis of Life Sciences Linked Data
Link Analysis of Life Sciences Linked Data
 
Role of Data Accessibility During Pandemic
Role of Data Accessibility During PandemicRole of Data Accessibility During Pandemic
Role of Data Accessibility During Pandemic
 
Presentation from Code Camp 2017
Presentation from Code Camp 2017Presentation from Code Camp 2017
Presentation from Code Camp 2017
 
Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244Nucl. Acids Res.-2014-Howe-nar-gku1244
Nucl. Acids Res.-2014-Howe-nar-gku1244
 
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
ReVeaLD: A user-driven domain-specific interactive search platform for biomed...
 
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
Semantic approaches for biomedical knowledge discovery - Discovery Science 20...
 
W3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description GuidelinesW3C HCLS Dataset Description Guidelines
W3C HCLS Dataset Description Guidelines
 
Next-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalNext-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information Retrieval
 
Omic Data Integration Strategies
Omic Data Integration StrategiesOmic Data Integration Strategies
Omic Data Integration Strategies
 
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
 
Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...
 
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
Metadata in the BioSample Online Repository are Impaired by Numerous Anomalie...
 
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
 

Similar a API-Centric Data Integration for Human Genomics Reference Databases: Achievements, Lessons Learned and Challenges

2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dcc.titus.brown
 
Being FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceBeing FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceCarole Goble
 
A Systems Approach To Qualitative Data Management And Analysis
A Systems Approach To Qualitative Data Management And AnalysisA Systems Approach To Qualitative Data Management And Analysis
A Systems Approach To Qualitative Data Management And AnalysisMichele Thomas
 
LIMS FOR MAIZE MAPPING PROJECT
LIMS FOR MAIZE MAPPING PROJECTLIMS FOR MAIZE MAPPING PROJECT
LIMS FOR MAIZE MAPPING PROJECTG2 APPS SA DE CV
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
 
V1_I1_2012_Paper5.doc
V1_I1_2012_Paper5.docV1_I1_2012_Paper5.doc
V1_I1_2012_Paper5.docpraveena06
 
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
GASCAN: A Novel Database for Gastric Cancer Genes and PrimersGASCAN: A Novel Database for Gastric Cancer Genes and Primers
GASCAN: A Novel Database for Gastric Cancer Genes and Primersijdmtaiir
 
Accelerating GWAS epistatic interaction analysis methods
Accelerating GWAS epistatic interaction analysis methodsAccelerating GWAS epistatic interaction analysis methods
Accelerating GWAS epistatic interaction analysis methodsPriscill Orue Esquivel
 
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...Syed Ahmad Chan Bukhari, PhD
 
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...Dominic Suciu
 
Current advances to bridge the usability-expressivity gap in biomedical seman...
Current advances to bridge the usability-expressivity gap in biomedical seman...Current advances to bridge the usability-expressivity gap in biomedical seman...
Current advances to bridge the usability-expressivity gap in biomedical seman...Maulik Kamdar
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBenjamin Good
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data miningSangeeta Das
 
eTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service PlatformeTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service Platformibemam
 
A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...CSCJournals
 
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbuJax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbuAnne Deslattes Mays
 

Similar a API-Centric Data Integration for Human Genomics Reference Databases: Achievements, Lessons Learned and Challenges (20)

D1803012022
D1803012022D1803012022
D1803012022
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
 
Being FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data ScienceBeing FAIR: Enabling Reproducible Data Science
Being FAIR: Enabling Reproducible Data Science
 
A Systems Approach To Qualitative Data Management And Analysis
A Systems Approach To Qualitative Data Management And AnalysisA Systems Approach To Qualitative Data Management And Analysis
A Systems Approach To Qualitative Data Management And Analysis
 
LIMS for maize mapping project
LIMS for maize mapping projectLIMS for maize mapping project
LIMS for maize mapping project
 
LIMS FOR MAIZE MAPPING PROJECT
LIMS FOR MAIZE MAPPING PROJECTLIMS FOR MAIZE MAPPING PROJECT
LIMS FOR MAIZE MAPPING PROJECT
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 
V1_I1_2012_Paper5.doc
V1_I1_2012_Paper5.docV1_I1_2012_Paper5.doc
V1_I1_2012_Paper5.doc
 
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
GASCAN: A Novel Database for Gastric Cancer Genes and PrimersGASCAN: A Novel Database for Gastric Cancer Genes and Primers
GASCAN: A Novel Database for Gastric Cancer Genes and Primers
 
Accelerating GWAS epistatic interaction analysis methods
Accelerating GWAS epistatic interaction analysis methodsAccelerating GWAS epistatic interaction analysis methods
Accelerating GWAS epistatic interaction analysis methods
 
Poster (1)
Poster (1)Poster (1)
Poster (1)
 
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...
 
B.3.5
B.3.5B.3.5
B.3.5
 
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
Interactive Analysis of Large-Scale Sequencing Genomics Data Sets using a Rea...
 
Current advances to bridge the usability-expressivity gap in biomedical seman...
Current advances to bridge the usability-expressivity gap in biomedical seman...Current advances to bridge the usability-expressivity gap in biomedical seman...
Current advances to bridge the usability-expressivity gap in biomedical seman...
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
 
Bioinformatics data mining
Bioinformatics data miningBioinformatics data mining
Bioinformatics data mining
 
eTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service PlatformeTRIKS Data Harmonization Service Platform
eTRIKS Data Harmonization Service Platform
 
A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...
 
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbuJax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
 

Más de Genomika Diagnósticos

Detecção de CNVs por NGS: validação de pipeline de bioinformática para painéi...
Detecção de CNVs por NGS: validação de pipeline de bioinformática para painéi...Detecção de CNVs por NGS: validação de pipeline de bioinformática para painéi...
Detecção de CNVs por NGS: validação de pipeline de bioinformática para painéi...Genomika Diagnósticos
 
The importance of an adequate soft-clip based approach on bioinformatics pipe...
The importance of an adequate soft-clip based approach on bioinformatics pipe...The importance of an adequate soft-clip based approach on bioinformatics pipe...
The importance of an adequate soft-clip based approach on bioinformatics pipe...Genomika Diagnósticos
 
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...Genomika Diagnósticos
 
X-Meeting Poster 2015 - Vallys A Coverage tool
X-Meeting Poster 2015 - Vallys A Coverage toolX-Meeting Poster 2015 - Vallys A Coverage tool
X-Meeting Poster 2015 - Vallys A Coverage toolGenomika Diagnósticos
 
Como seu DNA com a Bioinformática pode revolucionar o diagnóstico clínico no ...
Como seu DNA com a Bioinformática pode revolucionar o diagnóstico clínico no ...Como seu DNA com a Bioinformática pode revolucionar o diagnóstico clínico no ...
Como seu DNA com a Bioinformática pode revolucionar o diagnóstico clínico no ...Genomika Diagnósticos
 
Construindo softwares de bioinformática para análises clínicas (Introdução)
Construindo softwares  de bioinformática  para análises clínicas (Introdução)  Construindo softwares  de bioinformática  para análises clínicas (Introdução)
Construindo softwares de bioinformática para análises clínicas (Introdução) Genomika Diagnósticos
 

Más de Genomika Diagnósticos (9)

MamaRisk - Resume Article IHC 2016
MamaRisk - Resume Article IHC 2016MamaRisk - Resume Article IHC 2016
MamaRisk - Resume Article IHC 2016
 
MamaRisk - Presentation IHC 2016
MamaRisk - Presentation IHC 2016MamaRisk - Presentation IHC 2016
MamaRisk - Presentation IHC 2016
 
Detecção de CNVs por NGS: validação de pipeline de bioinformática para painéi...
Detecção de CNVs por NGS: validação de pipeline de bioinformática para painéi...Detecção de CNVs por NGS: validação de pipeline de bioinformática para painéi...
Detecção de CNVs por NGS: validação de pipeline de bioinformática para painéi...
 
The importance of an adequate soft-clip based approach on bioinformatics pipe...
The importance of an adequate soft-clip based approach on bioinformatics pipe...The importance of an adequate soft-clip based approach on bioinformatics pipe...
The importance of an adequate soft-clip based approach on bioinformatics pipe...
 
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
Best Practices for Bioinformatics Pipelines for Molecular-Barcoded Targeted S...
 
X-Meeting Poster 2015 - Vallys A Coverage tool
X-Meeting Poster 2015 - Vallys A Coverage toolX-Meeting Poster 2015 - Vallys A Coverage tool
X-Meeting Poster 2015 - Vallys A Coverage tool
 
Docker poster bsb2015-print
Docker poster bsb2015-printDocker poster bsb2015-print
Docker poster bsb2015-print
 
Como seu DNA com a Bioinformática pode revolucionar o diagnóstico clínico no ...
Como seu DNA com a Bioinformática pode revolucionar o diagnóstico clínico no ...Como seu DNA com a Bioinformática pode revolucionar o diagnóstico clínico no ...
Como seu DNA com a Bioinformática pode revolucionar o diagnóstico clínico no ...
 
Construindo softwares de bioinformática para análises clínicas (Introdução)
Construindo softwares  de bioinformática  para análises clínicas (Introdução)  Construindo softwares  de bioinformática  para análises clínicas (Introdução)
Construindo softwares de bioinformática para análises clínicas (Introdução)
 

Último

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Último (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

API-Centric Data Integration for Human Genomics Reference Databases: Achievements, Lessons Learned and Challenges

  • 1. APP NGS Applications J. S. Freitas1 , M. P. Caraciolo1 , V. M. Diniz1 , R. B. de Alexandre1 , J. B. Oliveira1 1 Genomika Diagnósticos API-Centric Data Integration for Human Genomics Reference Databases: Achievements, Lessons Learned and Challenges MOTIVATION Data Integration is a main challenge faced in clinical genetics where there are multiple heterogeneous databases spanning several domains presented in confusing formats without clear and common standards. In variant analysis for molecular diagnostics applications, one central task is to connect biological information to clinical data such that specialists can determine the potential impact of that variant associated with the disease [1, 2]. For this task, it requires the flexible assembly of tailored data sets continuously curated without wasting the biologists and geneticists time on searching several databases individually online, parsing, cleaning and integrating those data in complex spreadsheets. We are building a platform that leverages Linked Data to provide integrated access to bioinformatics databases such as OMIN, Clinvar, using a common and well-defined interface. Our assumption is that by exposing those datasets via Application Programming Interfaces (API's), it can facilitate the data access from several sources to a big data infrastructure, which provides integrated access to covering information about biological, carrier testing, variant analysis and literature mining. bioinfo@genomika.com.br | genomika.com.br Rua Senador José Henrique, 224, Alfred Nobel, Sala 1301 | Recife, PE | Brazil OUR COLLABORATION DATA INFRASTRUCTURE Lessons Learned x REFERENCES [1] Anguita, A., et al. (2010) A review of methods and tools for database integration in biomedicine. Curr. Bioinform., 5, 253–269 [2] Peterson, Thomas A., Emily Doughty, and Maricel G. Kann. "Towards precision medicine: advances in computational approaches for the analysis of human variants." Journal of molecular biology 425.21 (2013): 4047-4063. [3] Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system." ACM SIGOPS Operating Systems Review 44.2 (2010): 35-40. [4] Spark, Apache. "Lightning-fast cluster computing (2015)." (2015): 345-353. [5] Stockinger, Heinz, et al. "Experience using web services for biological sequence analysis." Briefings in bioinformatics 9.6 (2008): 493-505. DISTRIBUTED AGGREGATION NEW SOURCE CONSUMPTION The growing number of databases vs the variability of their schemata. To tackle it, we designed a global schema, using meta-modeling concepts to abstract the data fields and values. Novel approaches to aggregate the facets by the same key. Good solutions: NoSQL databases (Cassandra) and large data processing engine using MapReduce concepts (Spark) [3, 4]. Load several databases and related versions will require a replication/distributed policy for your database engine. There are some good dataengine solutions that achieved great results on this by using a distributed strategy for partitioning data. RESTful APIs for exposing data. It supports several formats (XML, JSON) and frameworks available that works out-of-the-box [5]. Challenges The underlying datasets can change their schema, so there's a intellectual complexity in developing fixes in the source data consumption. The limited number of building new versions, the all process requires bandwidth and demanding computing power, so how to overcome the number of fetching jobs running simultaneously? How to deal with semantic mappings between datasets or depositories? What should the single integrated vocabulary be in order to identify possible relationships? sample genomic position genomic position Sequencing Machine Annotator (rowA, (DataFieldA, facetValue1)) (rowB, (DataFieldA, facetValue2)) (rowA, [(DFA, FV1)), (DFB, FV3)), (DFC, FV4)), (DFD, FV7)), (DFE, FV8)), (DFF, FV9))] (rowA, (DFB, FV3)) (rowA, (DFC, FV4)) (rowB, (DFB, FV5)) (rowB, (DFC, FV6)) (rowA, (DFA, FV1)) (rowA, (DFB, FV3)) (rowA, (DFE, FV8)) (rowA, (DFF, FV9)) (rowB, [(DFA, FV2)), (DFB, FV5)), (DFC, FV6)), (DFD, FV10)), (DFE, FV11)), (DFF, FV12))] (rowB, (DFB, FV2) (rowB, (DFB, FV5)) (rowB, (DFE, FV11)) (rowB, (DFF, FV12)) (rowA, (DFD, FV7)) (rowA, (DFE, FV8)) (rowB, (DFE, FV11)) (rowB, (DFF, FV12)) ClinGen Tool Patient Data 150,000,000 Variants observed Variants we understand 2003 2007 2015 Genotype AnnotatorClinvar dbSNP Uniprot OMIM NCBI GENE 1,000 Genome Depository N Clinvar OMIM DATA EXPOSURE ... omim_idGene Symbol 100650 ... Datafield N ... Facet #1ALDH 104760 ... Facet #nAPP DataFieldrowID Gene_Symbol ... DataFacet ... ALDH1 OMIM_ID ... 1006501 Gene_Symbol ... APP2 OMIM_ID ... 1047602 1.0.0 2.0.0 Depository Version ...Genes Phenotypes Dataset N curl https://$GENDB_API_KEY@api.gendb.com/v1/ datasets/OMIM/3.5.0/Genes/data -H "Content-Type: application/json" -d '{ "filters": [ ["gene_symbol", "BRCA1"] ] }' { "dataset": "OMIM/3.5.0/Genes", "dataset_id": 65, "genome_build": "GRCh37", "limit": 100, "total": 111425, "took": 5, "results": [ "..." ] } As the number of current human variant resources used in variant analysis increases, the variants reported growing faster every year, there's only a initial work on understanding all this information and how can we extract and link those variant sources. ... fetch data Sequencer Data fetch data API GENDB MIM 1000 Genomes Entrez Gene dbSNP dbSNP dbNSFP COSMIC ClinVar Other Sources + name + output_dir - fetch(is_dl_forced=False) - parse() - prepare_new_dataset(name, version) - update_new_version(version_name) - check_if_remote_newer(remote, local) - get_files(is_dl_forced) - fetch_from_url(remote_file, local_file) - fetch_from_db(query, conn, limit, is_dl_forced) - fetch_from_source(...) Source Abstract class for any data sources that we'll import and process. Each of the subclasses will fetch() the data, scrub() it as necessary, then parse() it into a database. + name: OMIM + output_dir : "./raw/omim" - _get_omim_ids() - _process_all() - _process_morbidmap() - _process_phenotypicseries() OMIM + name: Source N + output_dir : "output/dir" - local functions() - inheritend_functions() extendsextends Source N ... ......