SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
NCBI API – Integration into
analysis code
QBRC Tech Talk
Jiwoong Kim
Outlines
• Introduction
• Usage Guidelines of the E-utilities
• Sample Applications of the E-utilities
NCBI & Entrez
• The National Center for
Biotechnology Information
advances science and health by
providing access to biomedical
and genomic information.
• Entrez is NCBI’s primary text
search and retrieval system
that integrates the PubMed
database of biomedical
literature with 39 other
literature and molecular
databases including DNA and
protein sequence, structure,
gene, genome, genetic
variation and gene expression.
E-utilities
• Entrez Programming Utilities
– The Entrez Programming Utilities (E-utilities) are a set of
eight server-side programs that provide a stable interface
into the Entrez query and database system at the NCBI.
– The E-utilities use a fixed URL syntax that translates a
standard set of input parameters into the values necessary
for various NCBI software components to search for and
retrieve the requested data.
E-utilitiesURL XML, FASTA, Text …
Input Output
Usage Guidelines and Requirements
• Use the E-utility URL
– baseURL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ …
– Python urllib/urlopen, Perl LWP::Simple, Linux wget, …
• Frequency, Timing and Registration of E-utility URL Requests
– Make no more than 3 requests per second → sleep(0.5)
– Run large jobs on weekends or between 5 PM and 9 AM EST
– Include &tool and &email in all requests
• Minimizing the Number of Requests
– &retmax=500
• Handling Special Characters Within URLs
– Space → +, " → %22, # → %23
ESearch
ESearch (text searches)
• Responds to a text query with the list of matching UIDs in a
given database (for later use in ESummary, EFetch or ELink),
along with the term translations of the query.
• Syntax: esearch.fcgi?db=<database>&term=<query>
– Input: Entrez database (&db); Any Entrez text query (&term)
– Output: List of UIDs matching the Entrez query
• Example: Get the PubMed IDs (PMIDs) for articles about
osteosarcoma
– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&
term=%22osteosarcoma%22[majr:noexp]
ESummary
ESearch
UIDs
EFetch
UID
ESummary
(document summary downloads)
• Responds to a list of UIDs from a given database with the
corresponding document summaries.
• Syntax: esummary.fcgi?db=<database>&id=<uid_list>
– Input: List of UIDs (&id); Entrez database (&db)
– Output: XML DocSums
• Example: Download DocSums for these PubMed IDs:
24450072, 24333720, 24333432
– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubme
d&id=24450072,24333720,24333432
EFetch
ELink
EFetch (data record downloads)
• Responds to a list of UIDs in a given database with the
corresponding data records in a specified format.
• Syntax:
efetch.fcgi?db=<database>&id=<uid_list>&rettype=<retrieval
_type>&retmode=<retrieval_mode>
– Input: List of UIDs (&id); Entrez database (&db); Retrieval type
(&rettype); Retrieval mode (&retmode)
– Output: Formatted data records as specified
• Example: Download the abstract of PubMed ID 24333432
– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&i
d=24333432&rettype=abstract&retmode=text
ELink (Entrez links)
• Responds to a list of UIDs in a given database with either a list
of related UIDs (and relevancy scores) in the same database
or a list of linked UIDs in another Entrez database
• Checks for the existence of a specified link from a list of one
or more UIDs
• Creates a hyperlink to the primary LinkOut provider for a
specific UID and database, or lists LinkOut URLs and attributes
for multiple UIDs.
ELink (Entrez links)
• Syntax:
elink.fcgi?dbfrom=<source_db>&db=<destination_db>&id=<u
id_list>
– Input: List of UIDs (&id); Source Entrez database (&dbfrom);
Destination Entrez database (&db)
– Output: XML containing linked UIDs from source and destination
databases
• Example: Find one set/separate sets of Gene IDs linked to
PubMed IDs 24333432 and 24314238
– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubme
d&db=gene&id=24333432,24314238
– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubme
d&db=gene&id=24333432&id=24314238
EGQuery
EGQuery (global query)
• Responds to a text query with the number of records
matching the query in each Entrez database.
• Syntax: egquery.fcgi?term=<query>
– Input: Entrez text query (&term)
– Output: XML containing the number of hits in each database.
• Example: Determine the number of records for mouse in
Entrez.
– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=mouse[
orgn]&retmode=xml
ESpell
ESpell (spelling suggestions)
• Retrieves spelling suggestions for a text query in a given
database.
• Syntax: espell.fcgi?term=<query>&db=<database>
– Input: Entrez text query (&term); Entrez database (&db)
– Output: XML containing the original query and spelling suggestions.
• Example: Find spelling suggestions for the PubMed query
"osteosacoma".
– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?term=osteosac
oma&db=pmc
EInfo (database statistics)
• Provides the number of records indexed in each field of a
given database, the date of the last update of the database,
and the available links from the database to other Entrez
databases.
• Syntax: einfo.fcgi?db=<database>
– Input: Entrez database (&db)
– Output: XML containing database statistics
• Example: Find database statistics for Entrez Protein.
– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=protein
EPost (UID uploads)
• Accepts a list of UIDs from a given database, stores the set on
the History Server, and responds with a query key and web
environment for the uploaded dataset.
• Syntax: epost.fcgi?db=<database>&id=<uid_list>
– Input: List of UIDs (&id); Entrez database (&db)
– Output: Web environment (&WebEnv) and query key (&query_key)
parameters specifying the location on the Entrez history server of the
list of uploaded UIDs
• Example: Upload five Gene IDs (7173, 22018, 54314, 403521,
525013) for later processing.
– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=gene&id=71
73,22018,54314,403521,525013
Application 1
• Find related human genes to articles searched for non-
extended MeSH term "Osteosarcoma" (PubMed → Gene)
1. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubme
d&term=%22osteosarcoma%22[majr:noexp]&usehistory=y
2. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubm
ed&db=gene&query_key=1&WebEnv=NCID_1_220057266_130.14.
18.34_9001_1396281951_1196950266&term=%22homo+sapiens%
22[organism]&cmd=neighbor_history
3. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene
&query_key=3&WebEnv=NCID_1_220057266_130.14.18.34_9001_
1396281951_1196950266
Application 1
• Find related human genes to articles searched for non-
extended MeSH term "Osteosarcoma" (PubMed → Gene)
– ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz
• It can be used instead of "ELink".
– ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
• It can be used instead of "ESummary".
Application 2
• Find nucleotide sequences of "Burkholderia cepacia complex"
and download in GenBank format
1. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccor
e&term=%22burkholderia+cepacia+complex%22[organism]&usehist
ory=y
2. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore
&query_key=1&WebEnv=NCID_1_264773253_130.14.22.215_9001
_1396244608_457974498&rettype=gb&retmode=text
Application 3
• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
cancer "copy number"
esearch.fcgi?db=pubmed
Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter]
esearch.fcgi?db=gds
esummary.fcgi?db=pubmed
WebEnv, query_key
esummary.fcgi?db=gds
WebEnv, query_key
GPL9704
GPL8226
GPL6804
GPL6801
elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds
Parsing
Result table
Common
PubMed title
"cancer copy number" articles
"Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
Application 3
• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
cancer "copy number"
esearch.fcgi?db=pubmed
Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter]
esearch.fcgi?db=gds
esummary.fcgi?db=pubmed
WebEnv, query_key
esummary.fcgi?db=gds
WebEnv, query_key
GPL9704
GPL8226
GPL6804
GPL6801
elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds
Parsing
Result table
Common
PubMed title
Application 3
• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
Application 3
• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
cancer "copy number"
esearch.fcgi?db=pubmed
Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter]
esearch.fcgi?db=gds
esummary.fcgi?db=pubmed
WebEnv, query_key
esummary.fcgi?db=gds
WebEnv, query_key
GPL9704
GPL8226
GPL6804
GPL6801
elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds
Parsing
Result table
Common
PubMed title
Application 3
• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
Make custom scripts with XML-parser
EBot
• EBot is an interactive web tool that first allows
users to construct an arbitrary E-utility
analysis pipeline and then generates a Perl
script to execute the pipeline. The Perl script
can be downloaded and executed on any
computer with a Perl installation. For more
details, see the EBot page linked above.
– http://www.ncbi.nlm.nih.gov/Class/PowerTools/e
utils/ebot/ebot.cgi
Entrez Direct
• E-utilities on the UNIX Command Line
• Download from ftp://ftp.ncbi.nih.gov/entrez/entrezdirect/
• Entrez Direct Functions
– esearch performs a new Entrez search using terms in indexed fields.
– elink looks up neighbors (within a database) or links (between databases).
– efilter filters or restricts the results of a previous query.
– efetch downloads records or reports in a designated format.
– xtract converts XML into a table of data values.
– einfo obtains information on indexed fields in an Entrez database.
– epost uploads unique identifiers (UIDs) or sequence accession numbers.
– nquire sends a URL request to a web page or CGI service.
• Entering Query Commands
– esearch -db pubmed -query "opsin gene conversion" | elink -related
Links
• References
– Entrez Programming Utilities Help
• http://www.ncbi.nlm.nih.gov/books/NBK25501/
– Entrez Help
• http://www.ncbi.nlm.nih.gov/books/NBK3836/
• Useful Links
– Entrez Unique Identifiers (UIDs) for selected databases
• http://www.ncbi.nlm.nih.gov/books/NBK25497/table/chapter2.chapter2_table1/?r
eport=objectonly
– Valid values of &retmode and &rettype for EFetch (null = empty string)
• http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.chapter4_table1/?r
eport=objectonly
– The full list of Entrez links
• http://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html
NCBI databases
• Literature: PubMed, PubMed Central, NLM Catalog, MeSH, Books, Site
Search
• Health: PubMed Health, MedGen, GTR, dbGaP, ClinVar, OMIM, OMIA
• Organisms: Taxonomy
• Nucleotide Sequences: Nucleotide, GSS, EST, SRA, PopSet, Probe
• Genomes: Genome, Assembly, Epigenomics, UniSTS, SNP, dbVar,
BioProject, BioSample, Clone
• Genes: Gene, HomoloGene, UniGene, GEO Profiles, GEO DataSets
• Proteins: Protein, Conserved Domains, Protein Clusters, Structure
• Chemicals: PubChem Compound, PubChem Substance, PubChem BioAssay
• Pathways: BioSystems
E-utilities
• Eight server-side programs
– ESearch : Searching a Database
– EPost : Uploading UIDs to Entrez
– ESummary : Downloading Document Summaries
– EFetch : Downloading Full Records
– ELink : Finding Related Data Through Entrez Links
– EInfo : Getting Database Statistics and Search Fields
– EGQuery : Performing a Global Entrez Search
– ESpell : Retrieving Spelling Suggestions
Sample Applications of the E-utilities
• Basic pipelines
– ESearch - ESummary/EFetch
– EPost - ESummary/EFetch
– ELink - ESummary/Efetch
– ESearch - ELink - ESummary/EFetch
– EPost - ELink - ESummary/EFetch
– EPost - ESearch
– ELink - ESearch
Application 3
• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
1. tr 'n' 't' < cancer_copy_number.pubmed_result.txt | sed 's/tt/n/g' | sed 's/^t[0-9]*: //' | sed 's/t/ /g' >
cancer_copy_number.pubmed_result.oneLine.txt
2. sed 's/^.* PubMed *PMID: *//' cancer_copy_number.pubmed_result.oneLine.txt | sed 's/; .*//' | sed 's/.$//' >
cancer_copy_number.pubmed_ids.txt
3. for id in $(cat cancer_copy_number.pubmed_ids.txt); do perl ~/scripts/elink.pl pubmed gds $id pubmed_gds | sed
"s/^/$idt/"; done > cancer_copy_number.pubmed_gds_ids.txt
4. awk -F't' '($1 == "Platform")' Affymetrix_Genome-Wide_Human_SNP_Array.gds_result.txt | cut -f2 | sed
's/^Accession: //' > Affymetrix_Genome-Wide_Human_SNP_Array.platform_accessions.txt
5. for platform in $(cat Affymetrix_Genome-Wide_Human_SNP_Array.platform_accessions.txt); do perl
~/scripts/esearch.pl gds $platform; done | sort -nu > Affymetrix_Genome-Wide_Human_SNP_Array.gds_ids.txt
6. paste cancer_copy_number.pubmed_ids.txt cancer_copy_number.pubmed_result.oneLine.txt | perl
~/scripts/table.addColumns.pl cancer_copy_number.pubmed_gds_ids.txt 0 - 0 1 | perl ~/scripts/table.search.pl
Affymetrix_Genome-Wide_Human_SNP_Array.gds_ids.txt 0 - 1 | perl ~/scripts/table.mergeLines.pl -d ', ' - 0,2 >
cancer_copy_number.Affymetrix_Genome-Wide_Human_SNP_Array.pubmed_gds.txt

Más contenido relacionado

La actualidad más candente

Swertz bosc2010 molgenis
Swertz bosc2010 molgenisSwertz bosc2010 molgenis
Swertz bosc2010 molgenis
BOSC 2010
 
Mercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_frameworkMercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_framework
BOSC 2010
 
Kanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenisKanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenis
BOSC 2010
 

La actualidad más candente (20)

2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload2019 03 05_biological_databases_part4_v_upload
2019 03 05_biological_databases_part4_v_upload
 
Swertz bosc2010 molgenis
Swertz bosc2010 molgenisSwertz bosc2010 molgenis
Swertz bosc2010 molgenis
 
2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload2019 03 05_biological_databases_part3_v_upload
2019 03 05_biological_databases_part3_v_upload
 
2016 bioinformatics i_databases_wim_vancriekinge
2016 bioinformatics i_databases_wim_vancriekinge2016 bioinformatics i_databases_wim_vancriekinge
2016 bioinformatics i_databases_wim_vancriekinge
 
Ibn Sina
Ibn SinaIbn Sina
Ibn Sina
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
 
ENCODE-DCC-metadata-standard-Biocurator 2014
ENCODE-DCC-metadata-standard-Biocurator 2014ENCODE-DCC-metadata-standard-Biocurator 2014
ENCODE-DCC-metadata-standard-Biocurator 2014
 
2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload2019 02 12_biological_databases_part1_v_upload
2019 02 12_biological_databases_part1_v_upload
 
Mercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_frameworkMercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_framework
 
2020 02 11_biological_databases_part1
2020 02 11_biological_databases_part12020 02 11_biological_databases_part1
2020 02 11_biological_databases_part1
 
CSHALS 2013
CSHALS 2013CSHALS 2013
CSHALS 2013
 
NETTAB 2012
NETTAB 2012NETTAB 2012
NETTAB 2012
 
ContentMine Presentation for WHO Health Data Seminar
ContentMine Presentation for WHO Health Data SeminarContentMine Presentation for WHO Health Data Seminar
ContentMine Presentation for WHO Health Data Seminar
 
Drug Discovery- ELRIG -2012
Drug Discovery- ELRIG -2012Drug Discovery- ELRIG -2012
Drug Discovery- ELRIG -2012
 
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
Michael Reich, GenomeSpace Workshop, fged_seattle_2013Michael Reich, GenomeSpace Workshop, fged_seattle_2013
Michael Reich, GenomeSpace Workshop, fged_seattle_2013
 
Computing on the shoulders of giants
Computing on the shoulders of giantsComputing on the shoulders of giants
Computing on the shoulders of giants
 
Kanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenisKanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenis
 
4A2B2C-2013
4A2B2C-20134A2B2C-2013
4A2B2C-2013
 
Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]Bio ontologies and semantic technologies[2]
Bio ontologies and semantic technologies[2]
 

Destacado

Destacado (13)

Philosophy vs. education
Philosophy vs. educationPhilosophy vs. education
Philosophy vs. education
 
Comparing with adverbs
Comparing with adverbsComparing with adverbs
Comparing with adverbs
 
Biography of Stephen King and His Works
Biography of Stephen King and His WorksBiography of Stephen King and His Works
Biography of Stephen King and His Works
 
Region x northern mindanao by bumanglag and ternio
Region x  northern mindanao by bumanglag and ternioRegion x  northern mindanao by bumanglag and ternio
Region x northern mindanao by bumanglag and ternio
 
Ncbts vs. code of ethics
Ncbts  vs. code of ethicsNcbts  vs. code of ethics
Ncbts vs. code of ethics
 
NCBTS Domain 2: Learning Environment
NCBTS Domain 2: Learning EnvironmentNCBTS Domain 2: Learning Environment
NCBTS Domain 2: Learning Environment
 
NCBTS
NCBTSNCBTS
NCBTS
 
NCBTS Framework
NCBTS FrameworkNCBTS Framework
NCBTS Framework
 
NCBTS Worksheet
NCBTS WorksheetNCBTS Worksheet
NCBTS Worksheet
 
National Competency Based Teachers Standard
National Competency Based Teachers StandardNational Competency Based Teachers Standard
National Competency Based Teachers Standard
 
NCBTS
NCBTSNCBTS
NCBTS
 
N.C.B.T.S.-National Competency-Based Teacher's Standard (2013)
N.C.B.T.S.-National Competency-Based Teacher's Standard (2013)N.C.B.T.S.-National Competency-Based Teacher's Standard (2013)
N.C.B.T.S.-National Competency-Based Teacher's Standard (2013)
 
Code of Ethics for Professional Teachers of the Philippines
Code of Ethics for Professional Teachers of the PhilippinesCode of Ethics for Professional Teachers of the Philippines
Code of Ethics for Professional Teachers of the Philippines
 

Similar a NCBI API - Integration into analysis code

NCBO Tools and Web services
NCBO Tools and Web servicesNCBO Tools and Web services
NCBO Tools and Web services
Trish Whetzel
 
Biological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBiological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdf
BioinformaticsCentre
 
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...
ICZN
 
The Logical Model Designer - Binding Information Models to Terminology
The Logical Model Designer - Binding Information Models to TerminologyThe Logical Model Designer - Binding Information Models to Terminology
The Logical Model Designer - Binding Information Models to Terminology
Snow Owl
 

Similar a NCBI API - Integration into analysis code (20)

BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical KnowledgeBioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
 
The Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational ResearchThe Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational Research
 
Biothings presentation
Biothings presentationBiothings presentation
Biothings presentation
 
Metadata-based tools at the ENCODE Portal
Metadata-based tools at the ENCODE PortalMetadata-based tools at the ENCODE Portal
Metadata-based tools at the ENCODE Portal
 
Major databases in bioinformatics
Major databases in bioinformaticsMajor databases in bioinformatics
Major databases in bioinformatics
 
Harvester I
Harvester IHarvester I
Harvester I
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
NCBO Tools and Web services
NCBO Tools and Web servicesNCBO Tools and Web services
NCBO Tools and Web services
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
 
Databases_CSS2.pptx
Databases_CSS2.pptxDatabases_CSS2.pptx
Databases_CSS2.pptx
 
Data retreival system
Data retreival systemData retreival system
Data retreival system
 
Biological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBiological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdf
 
Bioinformatics مي.pdf
Bioinformatics  مي.pdfBioinformatics  مي.pdf
Bioinformatics مي.pdf
 
openEHR Developers Workshop at #MedInfo2015
openEHR Developers Workshop at #MedInfo2015openEHR Developers Workshop at #MedInfo2015
openEHR Developers Workshop at #MedInfo2015
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort Data
 
Data Retrieval Systems
Data Retrieval SystemsData Retrieval Systems
Data Retrieval Systems
 
The ENCODE Portal REST API
The ENCODE Portal REST API The ENCODE Portal REST API
The ENCODE Portal REST API
 
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...
Yde de Jong & Dave Roberts - ZooBank and EDIT: Towards a business model for Z...
 
The Logical Model Designer - Binding Information Models to Terminology
The Logical Model Designer - Binding Information Models to TerminologyThe Logical Model Designer - Binding Information Models to Terminology
The Logical Model Designer - Binding Information Models to Terminology
 
Intro to databases
Intro to databasesIntro to databases
Intro to databases
 

Último

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 

Último (20)

Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 

NCBI API - Integration into analysis code

  • 1. NCBI API – Integration into analysis code QBRC Tech Talk Jiwoong Kim
  • 2. Outlines • Introduction • Usage Guidelines of the E-utilities • Sample Applications of the E-utilities
  • 3. NCBI & Entrez • The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information. • Entrez is NCBI’s primary text search and retrieval system that integrates the PubMed database of biomedical literature with 39 other literature and molecular databases including DNA and protein sequence, structure, gene, genome, genetic variation and gene expression.
  • 4. E-utilities • Entrez Programming Utilities – The Entrez Programming Utilities (E-utilities) are a set of eight server-side programs that provide a stable interface into the Entrez query and database system at the NCBI. – The E-utilities use a fixed URL syntax that translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve the requested data. E-utilitiesURL XML, FASTA, Text … Input Output
  • 5. Usage Guidelines and Requirements • Use the E-utility URL – baseURL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ … – Python urllib/urlopen, Perl LWP::Simple, Linux wget, … • Frequency, Timing and Registration of E-utility URL Requests – Make no more than 3 requests per second → sleep(0.5) – Run large jobs on weekends or between 5 PM and 9 AM EST – Include &tool and &email in all requests • Minimizing the Number of Requests – &retmax=500 • Handling Special Characters Within URLs – Space → +, " → %22, # → %23
  • 7. ESearch (text searches) • Responds to a text query with the list of matching UIDs in a given database (for later use in ESummary, EFetch or ELink), along with the term translations of the query. • Syntax: esearch.fcgi?db=<database>&term=<query> – Input: Entrez database (&db); Any Entrez text query (&term) – Output: List of UIDs matching the Entrez query • Example: Get the PubMed IDs (PMIDs) for articles about osteosarcoma – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed& term=%22osteosarcoma%22[majr:noexp]
  • 9. ESummary (document summary downloads) • Responds to a list of UIDs from a given database with the corresponding document summaries. • Syntax: esummary.fcgi?db=<database>&id=<uid_list> – Input: List of UIDs (&id); Entrez database (&db) – Output: XML DocSums • Example: Download DocSums for these PubMed IDs: 24450072, 24333720, 24333432 – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubme d&id=24450072,24333720,24333432
  • 11. EFetch (data record downloads) • Responds to a list of UIDs in a given database with the corresponding data records in a specified format. • Syntax: efetch.fcgi?db=<database>&id=<uid_list>&rettype=<retrieval _type>&retmode=<retrieval_mode> – Input: List of UIDs (&id); Entrez database (&db); Retrieval type (&rettype); Retrieval mode (&retmode) – Output: Formatted data records as specified • Example: Download the abstract of PubMed ID 24333432 – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&i d=24333432&rettype=abstract&retmode=text
  • 12. ELink (Entrez links) • Responds to a list of UIDs in a given database with either a list of related UIDs (and relevancy scores) in the same database or a list of linked UIDs in another Entrez database • Checks for the existence of a specified link from a list of one or more UIDs • Creates a hyperlink to the primary LinkOut provider for a specific UID and database, or lists LinkOut URLs and attributes for multiple UIDs.
  • 13. ELink (Entrez links) • Syntax: elink.fcgi?dbfrom=<source_db>&db=<destination_db>&id=<u id_list> – Input: List of UIDs (&id); Source Entrez database (&dbfrom); Destination Entrez database (&db) – Output: XML containing linked UIDs from source and destination databases • Example: Find one set/separate sets of Gene IDs linked to PubMed IDs 24333432 and 24314238 – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubme d&db=gene&id=24333432,24314238 – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubme d&db=gene&id=24333432&id=24314238
  • 15. EGQuery (global query) • Responds to a text query with the number of records matching the query in each Entrez database. • Syntax: egquery.fcgi?term=<query> – Input: Entrez text query (&term) – Output: XML containing the number of hits in each database. • Example: Determine the number of records for mouse in Entrez. – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=mouse[ orgn]&retmode=xml
  • 17. ESpell (spelling suggestions) • Retrieves spelling suggestions for a text query in a given database. • Syntax: espell.fcgi?term=<query>&db=<database> – Input: Entrez text query (&term); Entrez database (&db) – Output: XML containing the original query and spelling suggestions. • Example: Find spelling suggestions for the PubMed query "osteosacoma". – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?term=osteosac oma&db=pmc
  • 18. EInfo (database statistics) • Provides the number of records indexed in each field of a given database, the date of the last update of the database, and the available links from the database to other Entrez databases. • Syntax: einfo.fcgi?db=<database> – Input: Entrez database (&db) – Output: XML containing database statistics • Example: Find database statistics for Entrez Protein. – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=protein
  • 19. EPost (UID uploads) • Accepts a list of UIDs from a given database, stores the set on the History Server, and responds with a query key and web environment for the uploaded dataset. • Syntax: epost.fcgi?db=<database>&id=<uid_list> – Input: List of UIDs (&id); Entrez database (&db) – Output: Web environment (&WebEnv) and query key (&query_key) parameters specifying the location on the Entrez history server of the list of uploaded UIDs • Example: Upload five Gene IDs (7173, 22018, 54314, 403521, 525013) for later processing. – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=gene&id=71 73,22018,54314,403521,525013
  • 20. Application 1 • Find related human genes to articles searched for non- extended MeSH term "Osteosarcoma" (PubMed → Gene) 1. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubme d&term=%22osteosarcoma%22[majr:noexp]&usehistory=y 2. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubm ed&db=gene&query_key=1&WebEnv=NCID_1_220057266_130.14. 18.34_9001_1396281951_1196950266&term=%22homo+sapiens% 22[organism]&cmd=neighbor_history 3. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene &query_key=3&WebEnv=NCID_1_220057266_130.14.18.34_9001_ 1396281951_1196950266
  • 21. Application 1 • Find related human genes to articles searched for non- extended MeSH term "Osteosarcoma" (PubMed → Gene) – ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz • It can be used instead of "ELink". – ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz • It can be used instead of "ESummary".
  • 22. Application 2 • Find nucleotide sequences of "Burkholderia cepacia complex" and download in GenBank format 1. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccor e&term=%22burkholderia+cepacia+complex%22[organism]&usehist ory=y 2. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore &query_key=1&WebEnv=NCID_1_264773253_130.14.22.215_9001 _1396244608_457974498&rettype=gb&retmode=text
  • 23. Application 3 • Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets cancer "copy number" esearch.fcgi?db=pubmed Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter] esearch.fcgi?db=gds esummary.fcgi?db=pubmed WebEnv, query_key esummary.fcgi?db=gds WebEnv, query_key GPL9704 GPL8226 GPL6804 GPL6801 elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds Parsing Result table Common PubMed title
  • 24. "cancer copy number" articles "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets
  • 25. Application 3 • Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets cancer "copy number" esearch.fcgi?db=pubmed Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter] esearch.fcgi?db=gds esummary.fcgi?db=pubmed WebEnv, query_key esummary.fcgi?db=gds WebEnv, query_key GPL9704 GPL8226 GPL6804 GPL6801 elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds Parsing Result table Common PubMed title
  • 26. Application 3 • Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets
  • 27. Application 3 • Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets cancer "copy number" esearch.fcgi?db=pubmed Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter] esearch.fcgi?db=gds esummary.fcgi?db=pubmed WebEnv, query_key esummary.fcgi?db=gds WebEnv, query_key GPL9704 GPL8226 GPL6804 GPL6801 elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds Parsing Result table Common PubMed title
  • 28. Application 3 • Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets
  • 29. Make custom scripts with XML-parser
  • 30. EBot • EBot is an interactive web tool that first allows users to construct an arbitrary E-utility analysis pipeline and then generates a Perl script to execute the pipeline. The Perl script can be downloaded and executed on any computer with a Perl installation. For more details, see the EBot page linked above. – http://www.ncbi.nlm.nih.gov/Class/PowerTools/e utils/ebot/ebot.cgi
  • 31. Entrez Direct • E-utilities on the UNIX Command Line • Download from ftp://ftp.ncbi.nih.gov/entrez/entrezdirect/ • Entrez Direct Functions – esearch performs a new Entrez search using terms in indexed fields. – elink looks up neighbors (within a database) or links (between databases). – efilter filters or restricts the results of a previous query. – efetch downloads records or reports in a designated format. – xtract converts XML into a table of data values. – einfo obtains information on indexed fields in an Entrez database. – epost uploads unique identifiers (UIDs) or sequence accession numbers. – nquire sends a URL request to a web page or CGI service. • Entering Query Commands – esearch -db pubmed -query "opsin gene conversion" | elink -related
  • 32. Links • References – Entrez Programming Utilities Help • http://www.ncbi.nlm.nih.gov/books/NBK25501/ – Entrez Help • http://www.ncbi.nlm.nih.gov/books/NBK3836/ • Useful Links – Entrez Unique Identifiers (UIDs) for selected databases • http://www.ncbi.nlm.nih.gov/books/NBK25497/table/chapter2.chapter2_table1/?r eport=objectonly – Valid values of &retmode and &rettype for EFetch (null = empty string) • http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.chapter4_table1/?r eport=objectonly – The full list of Entrez links • http://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html
  • 33. NCBI databases • Literature: PubMed, PubMed Central, NLM Catalog, MeSH, Books, Site Search • Health: PubMed Health, MedGen, GTR, dbGaP, ClinVar, OMIM, OMIA • Organisms: Taxonomy • Nucleotide Sequences: Nucleotide, GSS, EST, SRA, PopSet, Probe • Genomes: Genome, Assembly, Epigenomics, UniSTS, SNP, dbVar, BioProject, BioSample, Clone • Genes: Gene, HomoloGene, UniGene, GEO Profiles, GEO DataSets • Proteins: Protein, Conserved Domains, Protein Clusters, Structure • Chemicals: PubChem Compound, PubChem Substance, PubChem BioAssay • Pathways: BioSystems
  • 34. E-utilities • Eight server-side programs – ESearch : Searching a Database – EPost : Uploading UIDs to Entrez – ESummary : Downloading Document Summaries – EFetch : Downloading Full Records – ELink : Finding Related Data Through Entrez Links – EInfo : Getting Database Statistics and Search Fields – EGQuery : Performing a Global Entrez Search – ESpell : Retrieving Spelling Suggestions
  • 35. Sample Applications of the E-utilities • Basic pipelines – ESearch - ESummary/EFetch – EPost - ESummary/EFetch – ELink - ESummary/Efetch – ESearch - ELink - ESummary/EFetch – EPost - ELink - ESummary/EFetch – EPost - ESearch – ELink - ESearch
  • 36. Application 3 • Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets 1. tr 'n' 't' < cancer_copy_number.pubmed_result.txt | sed 's/tt/n/g' | sed 's/^t[0-9]*: //' | sed 's/t/ /g' > cancer_copy_number.pubmed_result.oneLine.txt 2. sed 's/^.* PubMed *PMID: *//' cancer_copy_number.pubmed_result.oneLine.txt | sed 's/; .*//' | sed 's/.$//' > cancer_copy_number.pubmed_ids.txt 3. for id in $(cat cancer_copy_number.pubmed_ids.txt); do perl ~/scripts/elink.pl pubmed gds $id pubmed_gds | sed "s/^/$idt/"; done > cancer_copy_number.pubmed_gds_ids.txt 4. awk -F't' '($1 == "Platform")' Affymetrix_Genome-Wide_Human_SNP_Array.gds_result.txt | cut -f2 | sed 's/^Accession: //' > Affymetrix_Genome-Wide_Human_SNP_Array.platform_accessions.txt 5. for platform in $(cat Affymetrix_Genome-Wide_Human_SNP_Array.platform_accessions.txt); do perl ~/scripts/esearch.pl gds $platform; done | sort -nu > Affymetrix_Genome-Wide_Human_SNP_Array.gds_ids.txt 6. paste cancer_copy_number.pubmed_ids.txt cancer_copy_number.pubmed_result.oneLine.txt | perl ~/scripts/table.addColumns.pl cancer_copy_number.pubmed_gds_ids.txt 0 - 0 1 | perl ~/scripts/table.search.pl Affymetrix_Genome-Wide_Human_SNP_Array.gds_ids.txt 0 - 1 | perl ~/scripts/table.mergeLines.pl -d ', ' - 0,2 > cancer_copy_number.Affymetrix_Genome-Wide_Human_SNP_Array.pubmed_gds.txt