3. ClinVar
140,000 2,500,000
GTR
Twenty Two Years of Growth: Genome Remapping Service
PubMed Health
CloneDB
120,000
NCBI Data and User Services Public Access
Genome Decoration Page
Influenza Seqs.
GenBank Base Pairs GenSAT 2,000,000
Users (Average) GeneTests
PubChem Peptidome
100,000 Trace Archive BioSystems
CCDS Flu H1N1
Cancer Chromosomes
Environmental Samples
Discovery Initiative 1,500,000
Base Pairs (Millions)
80,000 PubMed Central Entrez Genes Entrez Sensors
Users/Weekday
BLINK Mouse Composite Primer BLAST
MapViewer Genome
GEO Gnomon Seq Read Archive
GeneRIFs UniSTS
WGS
RefSeqGene
60,000 HLA Haplotypes
Human Genome Human Genome-TPA Genome Reference
LinkOut Consortium 1,000,000
dbMHC dbVar
PubMed LocusLink Epigenomics
BookShelf
PSI-BLAST RefSeq MyNCBI
BankIt Human Genome-
VAST dbSNP 1000 Genomes
40,000 Genomes Transcripts Alignments
ePCR Project
Taxonomy Microbial Genomes Genome-Wide
PHI-BLAST Association Studies
3D Structure OMIM CGAP dbGap 500,000
Network Entrez GeneMap Entrez Portal
20,000 Cn3D
WWW
GenBank UniGene
dbSTS
Entrez at NCBI
BLAST dbEST
0 0
1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
4. NCBI
Tools Literature Data
Blast PubMed GenBank
GBench PubMed Central Protein DB
Splign Bookshelf SRA
Cn3D MeSH GEO
e-PCR Gene Reviews dbSNP
e-Utilities … Gene
… RefSeq
…
5. Entrez: Pathway to Discovery
Term frequency
statistics
MEDLINE
abstracts
Literature Literature citations
citations in in sequence
sequence databases
databases
Nucleotide Protein
sequences sequences
Nucleotide Amino acid sequence
sequence similarity Coding region similarity
features
14. GRC Beginnings
Distributed data
Old Assembly Model
Genome not in INSDC Database
15.
16.
17.
18. Build sequence contigs based on contigs
defined in TPF.
Check for orientation consistencies
Select switch points
Instantiate sequence for further analysis
Switch point
Consensus sequence
23. Distributed data
Centralized Data
Old Assembly Model
Genome not in INSDC Database
24. Large-Scale Variation Complicates Genome Assembly
Sequences from haplotype 1
Sequences from haplotype 2
Old Assembly model: compress into a consensus
New Assembly model: represent both haplotypes
27. UGT2B17 MHC MAPT GRCh37 (hg19)
7 alternate haplotypes
at the MHC
Alternate loci released as:
FASTA
AGP
Alignment to chromosome
http://genomereference.org
28.
29. Assembly (e.g. GRCh37)
PAR Non-nuclear
Primary assembly unit
Assembly (e.g. MT)
ALT ALT ALT
Genomic 1 2 3
Region
(MHC)
Genomic
ALT ALT ALT
Region 4 5 6
(UGT2B17)
Genomic
Region
ALT
ALT
(MAPT) 7
8
ALT
9
31. Oh No! Not a new
version of the human
genome!
http://genomereference.org
32.
33. Assembly (e.g. GRCh37.p5)
PAR Non-nuclear
Primary assembly unit
Assembly (e.g. MT)
ALT ALT ALT
Genomic 1 2 3
Region
(MHC)
Genomic
ALT ALT ALT
Region 4 5 6
(UGT2B17)
Genomic
Region
ALT
ALT
(MAPT) 7
Genomic 8
Region
(ABO)
Genomic ALT
Region 9
(SMA)
Genomic
Region
(PECAM1)
Patches
…
34. TBC1D3C TBC1D3 TBC1D3H
TBC1D3C
Myo19 region (17q21)
35. 60 Fix PATCHES: Chromosome will update in GRCh38
(adds >1 Mb of novel sequence to the assembly)
70 Novel PATCHES: Additional sequence added
(adds >800K of novel sequence to the assembly)
Releasing patches quarterly
36. Distributed data
Centralized Data
Old Assembly Model
Updated Assembly Model
Genome not in INSDC Database
Genome in INSDC Database
Notas del editor
TPFs are loaded to a centralized system for tracking. This system also manages QA on the files as an ongoing process. The first level of QA is to look at the overlap between adjacent sequences on the TPF.
When certifying an overlap, external evidence supporting the alignment must be available. Evidence typically consists of sequence data from another source, spanning clone ends or experimental verification (such as a PCR assay detecting the join).These certificates are reviewed by other GRC members and may be approved or rejected. Certification information is publicly available.
Alignments refer to pairs of sequence. Once you know how a pair of sequences go together, you can look at stringing the pairs along into a contig. The contig is essentially the consensus sequence that is produced from the components.To create a contig, we use the steps shown on this slide.What are switch points? As you create the consensus sequence of the contig, the switch points tell you where to stop using the sequence from one component and begin using the sequence from the next.