SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
EBI is an Outstation of the European Molecular Biology Laboratory.
Ensembl annotation
Bronwen Aken
21 September 2014
How Ensembl started
• Ewan Birney
• Michele Clamp
• Tim Hubbard
Ensembl’s goals
Annotate
(vertebrate)
genome
Integrate
with other
biological
data
Make
publicly
available
• Stable, automatic
annotation
• High quality
• Regular release cycles
• Open source
“Provide a bioinformatics framework to organise biology around
the sequences of large genomes”
Challenges
1. Find functional elements in a genome
• Data have lots of noise
2. Software / hardware
• Storing and manipulating data
3. Intuitive and comprehensive access to data
• Visualization
GRCh38 annotation in Ensembl
What is Genebuilding?
• Automatic, evidence-based annotation of
genes
• Not ab initio
• Based on sequence alignment
• “Best-in-genome”
• Aim for high specificity
• Prefer to miss a few features than heavily over-
predict
Automated gene annotation pipeline is designed
around decisions made during manual annotation
Advantages of re-annotating
• Add new genes to new / fixed genomic regions
• Updated supporting evidence: Remove models built on
data that has been deleted from archives
• Move alignments to regions with better mapping
Gene annotation pipeline – the basics
Identify interesting regions
• Rough alignment of sequences
to genome
Exhaustive alignment to
produce transcript models
Filter models
• Prioritize data sources
Produce ‘best guess’ gene
set
Repeatmasking
Same-species proteins Other-species proteins
cDNAs/ESTs
UTR addition
Final gene set
Filtering
Protein-coding genebuild
Filtering
TranscriptConsensus
LayerAnnotation
Also:
Small ncRNAs
LincRNAs
Pseudogenes
Repeatmasking
Same-species proteins Other-species proteins
cDNAs/ESTs
UTR addition
Final gene set
Filtering
Protein-coding genebuild
Filtering
RNA-Seq models
Also:
Small ncRNAs
LincRNAs
Pseudogenes
MERGE WITH HAVANA
Release cycle
26 September 2014
11
Regulation
Gene
Allele
Conserved
sequence
Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/
Genes
• Coding & noncoding
• Protein & mRNA
alignments
• GTF & BAM files
Compara
• Conserved DNA sequence
• Multiple genome
alignments
• Homologues
• Protein families
Regulatory regions
• DNA methylation
• TFBS
• Open chromatin
Variation
• SNPs, indels,
structural variation
• Phenotypes
• QTLs
Integrate with other speciesChimpanzeeHuman
Gene SLC12A1
‘Patch’ annotation in Ensembl
Genome assembly representation
• Coord_system table
• Lists the allowed coordinate systems
• chromosome, scaffold, contig
• With ‘versions’
• GRCh37, GRCh38
• Contigs are shared between assemblies so have no version
• ‘Toplevel’ coordinate system
• Chromosomes + unplaced scaffolds + unlocalized scaffolds
+ alternate sequences
• Most popular means to access the whole genome
• API options for including/excluding alternate sequences and
PAR
Genome assembly representation
GRCh38
Scaffolds
Contigs
Chromosome
DNA only loaded for contigs
Genome assembly representation
GRCh38
Scaffolds
Contigs
Chromosome
DNA only loaded for contigs
Genome assembly representation
GRCh38
Scaffolds
Contigs
Chromosome
Genome assembly representation
GRCh38
Scaffolds
Contigs
Chromosome
GRCh37
Genome assembly representation
GRCh38
Scaffolds
Contigs
Chromosome
GRCh37
Seq_region names
• Regions of the genome are given a slice name; it’s like an
address
• eg. chromosome:GRCh37:6:133090509:133119701:1
• Users like to say, ‘chromosome 6’
• INSDC coordinates are versioned, but less human-readable
• chromosome:GRCh37:CM000668.1:133090509:133119701:1
assembly
seq_region.
name
coord_system
start
end
strand
Alternate sequences
• Assembly_exception table defines ‘bubbles’
• Initially set up to handle Y chromosome PAR
• Adapted to work for MHC haplotypes
• Now also used for GRC patches
• Assumes ‘equivalent’ region will be present in primary
assembly
Gene annotation on a ‘patched’ genome
62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH
Assembly excepti...
SNORA76 >
SNORD104 >
MILR1 >
Genes (GENCODE...
Primary assembly...
AC025362.12 > AC016489.18 > < AC234063.4Contigs
< Y_RNA < hsa-mir-1273e
< AC234063.1
< TEX2 < AC016489.1
< PECAM1
Genes (GENCODE...
H.sap-H.sap lastz-...
Assembly excepti...
62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH
protein coding merged Ensembl/Havana
RNA gene pseudogene
Alternative alleles Projection
Gene Legend
62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17
Assembly excepti...
H.sap-H.sap lastz-...
SNORA76 >
SNORD104 >
AC138744.2 >
MILR1 >
Genes (GENCODE...
GL383558.1
... ...GRC alignment i...
AC025362.12 > AC016489.18 > < AC009994.10Contigs
< TEX2 < RPL31P57 < POLG2
Genes (GENCODE...
Assembly excepti...
62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17
Insert relative to reference Delete relative to reference ... Large insert shown truncated due to image scale or edgeMatchAlignment Differe...
protein coding merged Ensembl/Havana
RNA gene pseudogene
Alternative alleles Projection
Gene Legend
331.04 kb Forward strand
Reverse strand 331.04 kb
276.06 kb Forward strand
Reverse strand 276.06 kb
TEX2 gene lies across
the patch boundary
PECAM1 is annotated
only on patch HG183
Gap in primary
assembly
PatchedchromosomePrimarychromosome
Gene annotation on a ‘patched’ genome
Gene annotation on patches
Patch
Primary
Gene annotation on patches
Patch
Primary
1. Manual
annotation
Gene annotation on patches
Patch
Primary
Patch
Primary
2. Project
models to
patch
1. Manual
annotation
Gene annotation on patches
Patch
Primary
Patch
Primary
Patch
Primary
1. Manual
annotation
2. Project
models to
patch
3. Gap-fill
with mini
genebuilld
Ongoing challenges
• How strict should we be when aligning proteins cDNAs to
the genome?
1. Genome assembly
• Sequencing error (inversion, artificial duplication)
• Assembly incomplete
• Alignments must allow for truncated matches
2. Population variation
• Linear genome is made from ‘one’ individual vs protein
databases contain data from many unknown individuals
• Paralogues, gene families, pseudogenes
3. Public databases eg. UniProt
• Include suspect data and incomplete for many species
• When there’s a match, or no match, is it biologically real?
• Aligning proteins from other species must allow for mismatches
Specificity
Sensitivity
Funding
European Commission
Framework Programme 7
Ensembl Acknowledgements
Questions?
Reporting data to users
Visualisation and Data querying:
• - When browsing the primary assembly, how do we make it obvious to users
when alternate sequences are available?
• - How do we show when the alternate genomic sequences are identical or differ
from one another?
• - How do we show whether the alternate genome sequences result in identical or
different transcribed / translated products?
• - How do we make a qualitative call about which allele is “better” to use? eg. ABO
• - Data download options
• - Concept of a ‘canonical’ transcript per gene (per tissue)
Data analysis:
• - Linking between alternate alleles (and paralogues?)
• - How do we show when data have been mapped from an old to new assembly,
compared to freshly aligned to a new assembly? When is it right to map instead of
align?
• - In a non-linear genome model, how will SNPs (rsIDs) work?
• - In a non-linear genome model, what coordinate system should be used?

Más contenido relacionado

La actualidad más candente

Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing priyanka raviraj
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Maté Ongenaert
 
Single-cell RNA-seq tutorial
Single-cell RNA-seq tutorialSingle-cell RNA-seq tutorial
Single-cell RNA-seq tutorialAaron Diaz
 
Genome assembly: An Introduction (2016)
Genome assembly: An Introduction (2016)Genome assembly: An Introduction (2016)
Genome assembly: An Introduction (2016)Sebastian Schmeier
 
PubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug DiscoveryPubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug DiscoverySunghwan Kim
 
Tech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome BrowserTech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome BrowserHoffman Lab
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformaticsphilmaweb
 
Single Nucleotide Polymorphism Genotyping Using Kompetitive Allele Specific ...
Single Nucleotide Polymorphism Genotyping Using Kompetitive Allele Specific ...Single Nucleotide Polymorphism Genotyping Using Kompetitive Allele Specific ...
Single Nucleotide Polymorphism Genotyping Using Kompetitive Allele Specific ...MANGLAM ARYA
 

La actualidad más candente (20)

Protein function prediction
Protein function predictionProtein function prediction
Protein function prediction
 
NGS File formats
NGS File formatsNGS File formats
NGS File formats
 
Protein-protein interaction networks
Protein-protein interaction networksProtein-protein interaction networks
Protein-protein interaction networks
 
Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
Snp genotyping
Snp genotypingSnp genotyping
Snp genotyping
 
Single-cell RNA-seq tutorial
Single-cell RNA-seq tutorialSingle-cell RNA-seq tutorial
Single-cell RNA-seq tutorial
 
Protein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modelingProtein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modeling
 
Structural genomics
Structural genomicsStructural genomics
Structural genomics
 
Genome assembly: An Introduction (2016)
Genome assembly: An Introduction (2016)Genome assembly: An Introduction (2016)
Genome assembly: An Introduction (2016)
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
PubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug DiscoveryPubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug Discovery
 
Exome Sequencing
Exome SequencingExome Sequencing
Exome Sequencing
 
Whole Genome Analysis
Whole Genome AnalysisWhole Genome Analysis
Whole Genome Analysis
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
 
Ngs introduction
Ngs introductionNgs introduction
Ngs introduction
 
Tech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome BrowserTech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome Browser
 
Illumina Sequencing
Illumina SequencingIllumina Sequencing
Illumina Sequencing
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
Single Nucleotide Polymorphism Genotyping Using Kompetitive Allele Specific ...
Single Nucleotide Polymorphism Genotyping Using Kompetitive Allele Specific ...Single Nucleotide Polymorphism Genotyping Using Kompetitive Allele Specific ...
Single Nucleotide Polymorphism Genotyping Using Kompetitive Allele Specific ...
 

Destacado

TGAC Browser bosc 2014
TGAC Browser bosc 2014TGAC Browser bosc 2014
TGAC Browser bosc 2014Anil Thanki
 
Genome resources at EMBL-EBI: Ensembl and Ensembl Genomes
Genome resources at EMBL-EBI: Ensembl and Ensembl GenomesGenome resources at EMBL-EBI: Ensembl and Ensembl Genomes
Genome resources at EMBL-EBI: Ensembl and Ensembl GenomesEBI
 
News screen annotation
News screen annotationNews screen annotation
News screen annotationtommybolton
 
Modelling and exchanging annotations
Modelling and exchanging annotationsModelling and exchanging annotations
Modelling and exchanging annotationsAntoine Isaac
 
News Screen Annotation
News Screen AnnotationNews Screen Annotation
News Screen Annotationandygoldman21
 
U Pointer Detailed Training Manual
U Pointer Detailed Training ManualU Pointer Detailed Training Manual
U Pointer Detailed Training ManualUPointer
 
Web2.0 tools categorised
Web2.0 tools categorised Web2.0 tools categorised
Web2.0 tools categorised Anne-Mart Olsen
 
USB Video Conferencing Info-graphic
USB Video Conferencing Info-graphicUSB Video Conferencing Info-graphic
USB Video Conferencing Info-graphicPaul Richards
 
Using 3 M Interactive Tools
Using 3 M Interactive ToolsUsing 3 M Interactive Tools
Using 3 M Interactive ToolsLinda Nitsche
 
InFocus Solutions Displays
InFocus Solutions DisplaysInFocus Solutions Displays
InFocus Solutions DisplaysGabriel Navakas
 
Ezcast pro vs Crestron Airmedia vs Barco clickshare vs Latentech wepresent
Ezcast pro vs Crestron Airmedia vs Barco clickshare vs Latentech wepresentEzcast pro vs Crestron Airmedia vs Barco clickshare vs Latentech wepresent
Ezcast pro vs Crestron Airmedia vs Barco clickshare vs Latentech wepresentvinaybs
 
The Application of the Human Phenotype Ontology
The Application of the Human Phenotype Ontology The Application of the Human Phenotype Ontology
The Application of the Human Phenotype Ontology mhaendel
 
Survey of Semantic Media Annotation Tools - towards New Media Applications wi...
Survey of Semantic Media Annotation Tools - towards New Media Applications wi...Survey of Semantic Media Annotation Tools - towards New Media Applications wi...
Survey of Semantic Media Annotation Tools - towards New Media Applications wi...LinkedTV
 
The Paperless Student - Skills and Confidence Reading on Screen
The Paperless Student - Skills and Confidence Reading on ScreenThe Paperless Student - Skills and Confidence Reading on Screen
The Paperless Student - Skills and Confidence Reading on ScreenMatt Cornock
 
Live – in relationship
Live – in relationshipLive – in relationship
Live – in relationshipankur_sk
 
BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS
 

Destacado (20)

TGAC Browser bosc 2014
TGAC Browser bosc 2014TGAC Browser bosc 2014
TGAC Browser bosc 2014
 
Genome Browser
Genome BrowserGenome Browser
Genome Browser
 
Genome resources at EMBL-EBI: Ensembl and Ensembl Genomes
Genome resources at EMBL-EBI: Ensembl and Ensembl GenomesGenome resources at EMBL-EBI: Ensembl and Ensembl Genomes
Genome resources at EMBL-EBI: Ensembl and Ensembl Genomes
 
Ensembl Browser Workshop
Ensembl Browser WorkshopEnsembl Browser Workshop
Ensembl Browser Workshop
 
Ensembl genome
Ensembl genomeEnsembl genome
Ensembl genome
 
News screen annotation
News screen annotationNews screen annotation
News screen annotation
 
Modelling and exchanging annotations
Modelling and exchanging annotationsModelling and exchanging annotations
Modelling and exchanging annotations
 
News Screen Annotation
News Screen AnnotationNews Screen Annotation
News Screen Annotation
 
U Pointer Detailed Training Manual
U Pointer Detailed Training ManualU Pointer Detailed Training Manual
U Pointer Detailed Training Manual
 
Web2.0 tools categorised
Web2.0 tools categorised Web2.0 tools categorised
Web2.0 tools categorised
 
USB Video Conferencing Info-graphic
USB Video Conferencing Info-graphicUSB Video Conferencing Info-graphic
USB Video Conferencing Info-graphic
 
Using 3 M Interactive Tools
Using 3 M Interactive ToolsUsing 3 M Interactive Tools
Using 3 M Interactive Tools
 
InFocus Solutions Displays
InFocus Solutions DisplaysInFocus Solutions Displays
InFocus Solutions Displays
 
Ezcast pro vs Crestron Airmedia vs Barco clickshare vs Latentech wepresent
Ezcast pro vs Crestron Airmedia vs Barco clickshare vs Latentech wepresentEzcast pro vs Crestron Airmedia vs Barco clickshare vs Latentech wepresent
Ezcast pro vs Crestron Airmedia vs Barco clickshare vs Latentech wepresent
 
The Application of the Human Phenotype Ontology
The Application of the Human Phenotype Ontology The Application of the Human Phenotype Ontology
The Application of the Human Phenotype Ontology
 
Survey of Semantic Media Annotation Tools - towards New Media Applications wi...
Survey of Semantic Media Annotation Tools - towards New Media Applications wi...Survey of Semantic Media Annotation Tools - towards New Media Applications wi...
Survey of Semantic Media Annotation Tools - towards New Media Applications wi...
 
The Paperless Student - Skills and Confidence Reading on Screen
The Paperless Student - Skills and Confidence Reading on ScreenThe Paperless Student - Skills and Confidence Reading on Screen
The Paperless Student - Skills and Confidence Reading on Screen
 
Live – in relationship
Live – in relationshipLive – in relationship
Live – in relationship
 
Windows Vista
Windows VistaWindows Vista
Windows Vista
 
BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1
 

Similar a Ensembl annotation

Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)Sijo A
 
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian Aurisano
 
Curation Introduction - Apollo Workshop
Curation Introduction - Apollo WorkshopCuration Introduction - Apollo Workshop
Curation Introduction - Apollo WorkshopMonica Munoz-Torres
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopMorgan Langille
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomicsajay301
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GenomeInABottle
 
GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005GenomeInABottle
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917GenomeInABottle
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...DataScienceConferenc1
 
Web Apollo Tutorial for the i5K copepod research community.
Web Apollo Tutorial for the i5K copepod research community.Web Apollo Tutorial for the i5K copepod research community.
Web Apollo Tutorial for the i5K copepod research community.Monica Munoz-Torres
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GenomeInABottle
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectGenome Reference Consortium
 
An introduction to Web Apollo for the Biomphalaria glabatra research community.
An introduction to Web Apollo for the Biomphalaria glabatra research community.An introduction to Web Apollo for the Biomphalaria glabatra research community.
An introduction to Web Apollo for the Biomphalaria glabatra research community.Monica Munoz-Torres
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshopGenomeInABottle
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016GenomeInABottle
 
Browsing Genes, Variation and Regulation data with Ensembl
Browsing Genes, Variation and Regulation data with EnsemblBrowsing Genes, Variation and Regulation data with Ensembl
Browsing Genes, Variation and Regulation data with EnsemblDenise Carvalho-Silva, PhD
 
презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшаваValeriya Simeonova
 

Similar a Ensembl annotation (20)

Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)
 
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
 
Curation Introduction - Apollo Workshop
Curation Introduction - Apollo WorkshopCuration Introduction - Apollo Workshop
Curation Introduction - Apollo Workshop
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics Workshop
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
 
Web Apollo Tutorial for the i5K copepod research community.
Web Apollo Tutorial for the i5K copepod research community.Web Apollo Tutorial for the i5K copepod research community.
Web Apollo Tutorial for the i5K copepod research community.
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
 
An introduction to Web Apollo for the Biomphalaria glabatra research community.
An introduction to Web Apollo for the Biomphalaria glabatra research community.An introduction to Web Apollo for the Biomphalaria glabatra research community.
An introduction to Web Apollo for the Biomphalaria glabatra research community.
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
Browsing Genes, Variation and Regulation data with Ensembl
Browsing Genes, Variation and Regulation data with EnsemblBrowsing Genes, Variation and Regulation data with Ensembl
Browsing Genes, Variation and Regulation data with Ensembl
 
презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшава
 

Más de Genome Reference Consortium

Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCGenome Reference Consortium
 
What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?Genome Reference Consortium
 
Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Genome Reference Consortium
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesGenome Reference Consortium
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amGenome Reference Consortium
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyGenome Reference Consortium
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsGenome Reference Consortium
 

Más de Genome Reference Consortium (20)

Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRC
 
What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?
 
Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)Advancements in the human genome reference assembly (GRCh38)
Advancements in the human genome reference assembly (GRCh38)
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
Genome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkitGenome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkit
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 am
 
Schneider grc workshop_final
Schneider grc workshop_finalSchneider grc workshop_final
Schneider grc workshop_final
 
Mane v2 final
Mane v2 finalMane v2 final
Mane v2 final
 
Lrg and mane 16 oct 2018
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
 
20181016 grc presentation-pa
20181016 grc presentation-pa20181016 grc presentation-pa
20181016 grc presentation-pa
 
2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copy
 
Ashg2017 workshop schneider
Ashg2017 workshop schneiderAshg2017 workshop schneider
Ashg2017 workshop schneider
 
Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
 
Ashg sedlazeck grc_share
Ashg sedlazeck grc_shareAshg sedlazeck grc_share
Ashg sedlazeck grc_share
 
101717.kh miga ashg_grc
101717.kh miga ashg_grc101717.kh miga ashg_grc
101717.kh miga ashg_grc
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
 
AGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: LindsayAGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: Lindsay
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
 

Último

RCPE terms and cycles scenarios as of March 2024
RCPE terms and cycles scenarios as of March 2024RCPE terms and cycles scenarios as of March 2024
RCPE terms and cycles scenarios as of March 2024suelcarter1
 
Applied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxApplied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxmarwaahmad357
 
Pests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPirithiRaju
 
Exploration Method’s in Archaeological Studies & Research
Exploration Method’s in Archaeological Studies & ResearchExploration Method’s in Archaeological Studies & Research
Exploration Method’s in Archaeological Studies & ResearchPrachya Adhyayan
 
Pests of Redgram_Identification, Binomics_Dr.UPR
Pests of Redgram_Identification, Binomics_Dr.UPRPests of Redgram_Identification, Binomics_Dr.UPR
Pests of Redgram_Identification, Binomics_Dr.UPRPirithiRaju
 
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...marwaahmad357
 
KeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data scienceKeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data scienceLayne Sadler
 
IB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptxIB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptxUalikhanKalkhojayev1
 
Lehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.pptLehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.pptSachin Teotia
 
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky WayShiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky WaySérgio Sacani
 
Substances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening TestSubstances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening TestAkashDTejwani
 
CW marking grid Analytical BS - M Ahmad.docx
CW  marking grid Analytical BS - M Ahmad.docxCW  marking grid Analytical BS - M Ahmad.docx
CW marking grid Analytical BS - M Ahmad.docxmarwaahmad357
 
Role of Herbs in Cosmetics in Cosmetic Science.
Role of Herbs in Cosmetics in Cosmetic Science.Role of Herbs in Cosmetics in Cosmetic Science.
Role of Herbs in Cosmetics in Cosmetic Science.ShwetaHattimare
 
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdfPests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdfPirithiRaju
 
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptxQ3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptxArdeniel
 
Alternative system of medicine herbal drug technology syllabus
Alternative system of medicine herbal drug technology syllabusAlternative system of medicine herbal drug technology syllabus
Alternative system of medicine herbal drug technology syllabusPradnya Wadekar
 
Human brain.. It's parts and function.
Human brain.. It's parts and function. Human brain.. It's parts and function.
Human brain.. It's parts and function. MUKTA MANJARI SAHOO
 

Último (20)

RCPE terms and cycles scenarios as of March 2024
RCPE terms and cycles scenarios as of March 2024RCPE terms and cycles scenarios as of March 2024
RCPE terms and cycles scenarios as of March 2024
 
Applied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docxApplied Biochemistry feedback_M Ahwad 2023.docx
Applied Biochemistry feedback_M Ahwad 2023.docx
 
Pests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPRPests of tenai_Identification,Binomics_Dr.UPR
Pests of tenai_Identification,Binomics_Dr.UPR
 
Exploration Method’s in Archaeological Studies & Research
Exploration Method’s in Archaeological Studies & ResearchExploration Method’s in Archaeological Studies & Research
Exploration Method’s in Archaeological Studies & Research
 
Pests of Redgram_Identification, Binomics_Dr.UPR
Pests of Redgram_Identification, Binomics_Dr.UPRPests of Redgram_Identification, Binomics_Dr.UPR
Pests of Redgram_Identification, Binomics_Dr.UPR
 
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
 
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
 
KeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data scienceKeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data science
 
IB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptxIB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptx
 
Lehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.pptLehninger_Chapter 17_Fatty acid Oxid.ppt
Lehninger_Chapter 17_Fatty acid Oxid.ppt
 
Applying Cheminformatics to Develop a Structure Searchable Database of Analyt...
Applying Cheminformatics to Develop a Structure Searchable Database of Analyt...Applying Cheminformatics to Develop a Structure Searchable Database of Analyt...
Applying Cheminformatics to Develop a Structure Searchable Database of Analyt...
 
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky WayShiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
 
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
 
Substances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening TestSubstances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening Test
 
CW marking grid Analytical BS - M Ahmad.docx
CW  marking grid Analytical BS - M Ahmad.docxCW  marking grid Analytical BS - M Ahmad.docx
CW marking grid Analytical BS - M Ahmad.docx
 
Role of Herbs in Cosmetics in Cosmetic Science.
Role of Herbs in Cosmetics in Cosmetic Science.Role of Herbs in Cosmetics in Cosmetic Science.
Role of Herbs in Cosmetics in Cosmetic Science.
 
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdfPests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
Pests of cumbu_Identification, Binomics, Integrated ManagementDr.UPR.pdf
 
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptxQ3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
 
Alternative system of medicine herbal drug technology syllabus
Alternative system of medicine herbal drug technology syllabusAlternative system of medicine herbal drug technology syllabus
Alternative system of medicine herbal drug technology syllabus
 
Human brain.. It's parts and function.
Human brain.. It's parts and function. Human brain.. It's parts and function.
Human brain.. It's parts and function.
 

Ensembl annotation

  • 1. EBI is an Outstation of the European Molecular Biology Laboratory. Ensembl annotation Bronwen Aken 21 September 2014
  • 2. How Ensembl started • Ewan Birney • Michele Clamp • Tim Hubbard
  • 3. Ensembl’s goals Annotate (vertebrate) genome Integrate with other biological data Make publicly available • Stable, automatic annotation • High quality • Regular release cycles • Open source “Provide a bioinformatics framework to organise biology around the sequences of large genomes”
  • 4. Challenges 1. Find functional elements in a genome • Data have lots of noise 2. Software / hardware • Storing and manipulating data 3. Intuitive and comprehensive access to data • Visualization
  • 6. What is Genebuilding? • Automatic, evidence-based annotation of genes • Not ab initio • Based on sequence alignment • “Best-in-genome” • Aim for high specificity • Prefer to miss a few features than heavily over- predict Automated gene annotation pipeline is designed around decisions made during manual annotation
  • 7. Advantages of re-annotating • Add new genes to new / fixed genomic regions • Updated supporting evidence: Remove models built on data that has been deleted from archives • Move alignments to regions with better mapping
  • 8. Gene annotation pipeline – the basics Identify interesting regions • Rough alignment of sequences to genome Exhaustive alignment to produce transcript models Filter models • Prioritize data sources Produce ‘best guess’ gene set
  • 9. Repeatmasking Same-species proteins Other-species proteins cDNAs/ESTs UTR addition Final gene set Filtering Protein-coding genebuild Filtering TranscriptConsensus LayerAnnotation Also: Small ncRNAs LincRNAs Pseudogenes
  • 10. Repeatmasking Same-species proteins Other-species proteins cDNAs/ESTs UTR addition Final gene set Filtering Protein-coding genebuild Filtering RNA-Seq models Also: Small ncRNAs LincRNAs Pseudogenes MERGE WITH HAVANA
  • 11. Release cycle 26 September 2014 11 Regulation Gene Allele Conserved sequence Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/ Genes • Coding & noncoding • Protein & mRNA alignments • GTF & BAM files Compara • Conserved DNA sequence • Multiple genome alignments • Homologues • Protein families Regulatory regions • DNA methylation • TFBS • Open chromatin Variation • SNPs, indels, structural variation • Phenotypes • QTLs
  • 12. Integrate with other speciesChimpanzeeHuman Gene SLC12A1
  • 14. Genome assembly representation • Coord_system table • Lists the allowed coordinate systems • chromosome, scaffold, contig • With ‘versions’ • GRCh37, GRCh38 • Contigs are shared between assemblies so have no version • ‘Toplevel’ coordinate system • Chromosomes + unplaced scaffolds + unlocalized scaffolds + alternate sequences • Most popular means to access the whole genome • API options for including/excluding alternate sequences and PAR
  • 20. Seq_region names • Regions of the genome are given a slice name; it’s like an address • eg. chromosome:GRCh37:6:133090509:133119701:1 • Users like to say, ‘chromosome 6’ • INSDC coordinates are versioned, but less human-readable • chromosome:GRCh37:CM000668.1:133090509:133119701:1 assembly seq_region. name coord_system start end strand
  • 21. Alternate sequences • Assembly_exception table defines ‘bubbles’ • Initially set up to handle Y chromosome PAR • Adapted to work for MHC haplotypes • Now also used for GRC patches • Assumes ‘equivalent’ region will be present in primary assembly
  • 22. Gene annotation on a ‘patched’ genome 62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH Assembly excepti... SNORA76 > SNORD104 > MILR1 > Genes (GENCODE... Primary assembly... AC025362.12 > AC016489.18 > < AC234063.4Contigs < Y_RNA < hsa-mir-1273e < AC234063.1 < TEX2 < AC016489.1 < PECAM1 Genes (GENCODE... H.sap-H.sap lastz-... Assembly excepti... 62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH protein coding merged Ensembl/Havana RNA gene pseudogene Alternative alleles Projection Gene Legend 62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17 Assembly excepti... H.sap-H.sap lastz-... SNORA76 > SNORD104 > AC138744.2 > MILR1 > Genes (GENCODE... GL383558.1 ... ...GRC alignment i... AC025362.12 > AC016489.18 > < AC009994.10Contigs < TEX2 < RPL31P57 < POLG2 Genes (GENCODE... Assembly excepti... 62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17 Insert relative to reference Delete relative to reference ... Large insert shown truncated due to image scale or edgeMatchAlignment Differe... protein coding merged Ensembl/Havana RNA gene pseudogene Alternative alleles Projection Gene Legend 331.04 kb Forward strand Reverse strand 331.04 kb 276.06 kb Forward strand Reverse strand 276.06 kb TEX2 gene lies across the patch boundary PECAM1 is annotated only on patch HG183 Gap in primary assembly PatchedchromosomePrimarychromosome
  • 23. Gene annotation on a ‘patched’ genome
  • 24. Gene annotation on patches Patch Primary
  • 25. Gene annotation on patches Patch Primary 1. Manual annotation
  • 26. Gene annotation on patches Patch Primary Patch Primary 2. Project models to patch 1. Manual annotation
  • 27. Gene annotation on patches Patch Primary Patch Primary Patch Primary 1. Manual annotation 2. Project models to patch 3. Gap-fill with mini genebuilld
  • 28. Ongoing challenges • How strict should we be when aligning proteins cDNAs to the genome? 1. Genome assembly • Sequencing error (inversion, artificial duplication) • Assembly incomplete • Alignments must allow for truncated matches 2. Population variation • Linear genome is made from ‘one’ individual vs protein databases contain data from many unknown individuals • Paralogues, gene families, pseudogenes 3. Public databases eg. UniProt • Include suspect data and incomplete for many species • When there’s a match, or no match, is it biologically real? • Aligning proteins from other species must allow for mismatches Specificity Sensitivity
  • 31. Reporting data to users Visualisation and Data querying: • - When browsing the primary assembly, how do we make it obvious to users when alternate sequences are available? • - How do we show when the alternate genomic sequences are identical or differ from one another? • - How do we show whether the alternate genome sequences result in identical or different transcribed / translated products? • - How do we make a qualitative call about which allele is “better” to use? eg. ABO • - Data download options • - Concept of a ‘canonical’ transcript per gene (per tissue) Data analysis: • - Linking between alternate alleles (and paralogues?) • - How do we show when data have been mapped from an old to new assembly, compared to freshly aligned to a new assembly? When is it right to map instead of align? • - In a non-linear genome model, how will SNPs (rsIDs) work? • - In a non-linear genome model, what coordinate system should be used?