IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Genometry
Gregg Helt
Cyrus Harmon

Genometry
•  Motivation and Purpose
•  Points of Reference
•  Genometry interfaces
•  Genometry manipulations
•  Genometry implementation
•  Representation examples
•  Prototype apps
•  Current status, future work

Motivation and Goals
•  Desire for a more unified data model to represent
relationships between biological sequences, such as:
–  Annotations
–  Alignments
–  Sequence composition
•  More networked, less hierarchical (genome-centric,
transcript-centric)
•  Simplicity
•  Expressivity / Flexibility
•  Memory and Computational Efficiency
•  Use by others to provide core functionality for various
Affy projects

Points of Reference
•  com.neomorphic.bio models
•  Genisys DB and Genisys IDL
•  EBI mapping models
•  Apollo data models
•  BioPerl
•  BioJava
•  Closest similarity to bio alignment models and
Genisys alignment models

Basic Annotations
Transcript T
Genome G
Transcript T
G: 1000..5000
Exon E1
G:1000..1200
Exon E2
G:3000..3500
Exon E3
G:4500..5000

Genometry Annotations – Specify All Coordinates
Transcript T
Genome G
Transcript T
G: 1000..5000
T:0..1200
Exon E1
G:1000..1200
T:0..200
Exon E2
G:3000..3500
T:200..700
Exon E3
G:4500..5000
T:700..1200

Genometry Annotations – All coordinates are
relative to BioSeqs
Transcript T
Genome G
TranscriptAnnot T1
G: 1000..5000
T:0..1200
ExonAnnot E1
G:1000..1200
T:0..200
ExonAnnot E2
G:3000..3500
T:200..700
ExonAnnot E3
G:4500..5000
T:700..1200
Transcript T
Genome G

Genometry Annotations – SeqSpans encapsulate a
range along a BioSeq
Transcript T
Genome G
TranscriptAnnot T1
ExonAnnot E1 ExonAnnot E2 ExonAnnot E3
Transcript T
Genome G
G: 1000..5000
T: 0..200
G:1000..1200
T:0..200
G:3000..3500
T:200..700
G:4500..5000
T:700..1200

Genometry Core Core
•  BioSeq
–  length, residues (optional)
•  SeqSpan
–  start, end, BioSeq
•  SeqSymmetry
–  SeqSpans (breadth)
–  SeqSymmetry parent / child hierarchy (depth)

Expressiveness of Core Core
•  “Standard” annotations
•  Singleton annotations
•  Alternative Splicing
•  Pairwise alignments
•  Annotations with depth > 2
•  Annotations with breadth > 2
•  Indels
•  Structure of analyzed sequence
•  Fuzzy locations
•  All without explicit pointers from BioSeq to annotation

Genometry Modelling of Insertions and Deletions #1a
G:1000..1006
T:7..18
G:1000..1017
T:0..6
G:1006..1017
T:0..18
…AGGCAATTAATTGATCCAGGTG……GAGTCCGAATAGGGTTAGCG…
GCAATTCAATTGATCCAG TCCGAATAGGTTAGCG
G:2000..2017
T:18..34
G:2000..2010
T:28..34T:18..28
G:2011..2017
G:1000..2017
T:0..34
insertion in transcript relative to genome
(deletion in genome relative to transcript)
deletion in transcript relative to genome
(insertion in genome relative to transcript)
Genome G
Transcript T

Genometry Modelling of Insertions and Deletions #1b
G: g0..g2
T:t0..t2
G:g3..g5
T:t3..t5
G:g3..g4
T:t4..t5T:t3..t4
G:g4+1..g5G:g0..g1
T:t0..t1 T:t1+1..t2
G:g1..g2
G:g0..g5
T:t0..t5
Genome G
Transcript T
t0 t1 t1+1 t2
g0 g1 g2 g3 g4 g4+1 g5
t3 t4 t5

Genometry Modelling of Insertions and Deletions #2
G:g0..g1
T:t0..t1 T:t1+1..t2
G:g1..g2
G: g0..g2
T:t0..t2
G:g3..g5
T:t3..t5
G:g3..g4
T:t3..t4 T:t4..t5
G:g4+1..g5
G:g0..g5
T:t0..t5
Genome G
Transcript T
T:t1..t1+1
“C” :0..1
t0 t1 t1+1 t2
g0 g1 g2 g3 g4 g4+1 g5
t3 t4 t5
G:g4..g4+1
“G” :0..1

G:g0..g1
T:t0..t1 T:t1+1..t2
G:g1..g2
G: g0..g2
T:t0..t2
G:g3..g5
T:t3..t5
G:g3..g4
T:t3..t4 T:t4..t5
G:g4+1..g5
G:g0..g5
T:t0..t5
Genome G
Transcript T
T:t1..t1+1
G:g1..g1
t0 t1 t1+1 t2
g0 g1 g2 g3 g4 g4+1 g5
t3 t4 t5
G:g4..g4+1
T:t4..t4

G:g0..g1
T:t0..t1 T:t1+1..t2
G:g1..g2
G: g0..g2
T:t0..t2
G:g3..g5
T:t3..t5
G:g3..g4
T:t3..t4 T:t4..t5
G:g4+1..g5
G:g0..g5
T:t0..t5
Genome G
Transcript T
t0 t1 t1+1 t2
g0 g1 g2 g3 g4 g4+1 g5
t3 t4 t5
T:t1..t1+1
G:g1..g1
“C”:0..1
T:t4..t4
G:g4..g4+1
“G”:0..1

Modelling SNPs with Genometry: Two Approaches
SeqB : 0..n
SeqA : 0..x
SeqB : 0..x
“T” : 0..1
SeqB : x..x+1
SeqA : 0..m
SeqA : x+1..m
SeqB : x+1..n
SeqA : x..x+1…GGCAAGGAATGATC…SeqA
x x+1
…GGCAAGGAATGATC…SeqA
SeqB …GGCAAGTAATGATC…
x x+1
SeqA = reference chromosome
SeqB = exactly same as reference chromosome, except for one SNP
I. SNPs as annotations of differences
between sequences
II. SNPs as gaps in similarity between two sequences
T
SeqB : x..x+1
SeqB …GGCAAGTAATGATC…
x x+1
“T” : 0..1
T
x x+1
I.a. annotation of just reference seq
I.b. annotation of reference seq w/ variant base
I.c. annotation of reference and variant seq

Sequence-oriented annotations
•  AnnotatedBioSeq
–  Contains a collection of SeqSymmetries that annotate the
sequence
–  Interfaces to retrieve annotations covered by a span within the
sequence

Annotation Networks
•  Can traverse networks of annotations, alternating between
AnnotatedBioSeqs and SeqSymmetries
protein2mRNA
proteinSpanB
mrnaSpanB
mRNA2genomic
genomicSpanC
mrnaSpanC
Annotated
GenomicSeq G
Annotated
mRNASeq M
Annotated
ProteinSeq P
m2gSub0
gSpanC0
mSpanC0
m2gSub1
gSpanC1
mSpanC1
m2gSub2
gSpanC2
mSpanC2
domainOnProtein
proteinSpanA
= AnnotatedBioSeq
= SeqSymmetry

Sequence Composition
•  CompositeBioSeq
– Contains a SeqSymmetry describing the mapping
of BioSeqs used in composition to the
CompositeBioSeq itself

Sequence Composition Representations
•  Sequence Assembly / Golden Path / etc.
•  Piecewise data loading / lazy data loading
•  Genotypes
•  Chromosomal Rearrangements
•  Primer construction
•  Reverse Complement
•  Coordinate Shifting

Genometry Modelling of Reverse Complement
Sequence B = reverse complement of Sequence A
BioSeq A
length: x
Composite
BioSeq B
length: x
A:0..x
B:x..0
Sym AB
composition
AGGCAATTAATTGATCCAGGTGGAGTCCGAATAGGGTTAGCGA
TCGCTAACCCTATTCGGACTCCACCTGGATCAATTAATTGCCT
SeqA
SeqB

MultiSequence Alignments
•  MultiSeqAlignment
–  Alignments sliced “horizontally” -- each “row” in an alignment is a
CompositeBioSeq whose composition maps another BioSeq to the same
coord space as the alignment
•  Can also slice vertically (synteny)

Alignment Representations
•  Can represent same alignment as either MultiSeqAlignment or Synteny
•  Transformation from horizontal slicing (MultiSeqAlignment) to vertical
slicing (Synteny)

Complete Genometry Core Models
•  Mutability
•  Curations

Genometry Manipulations
•  Symmetry Intersection (AND)
•  Symmetry Union (OR)
•  Symmetry Inverse (NOT)
•  Symmetry Mutual Exclusion (XOR)
•  Symmetry Transformation / Mapping

Symmetry Combination Operations
SymA
SymB
XOR(A, B)
AND(A, B)
OR(A, B)
NOT(A)
NOT(B)

Genometry Transformations
•  Every symmetry of breadth > 1 describes a mapping
between different sequences
•  Therefore every symmetry can be used to transform
coordinates of other symmetries from one sequence
to another
•  Because sequence annotations, alignments, and
composition are all based on symmetries, can use
any of them as mappings
•  Discontiguous linear mapping algorithm
•  Results of transformation are also symmetries

Coordinate
Mapping
(note that domain mapped to spliced transcript only overlaps two of the three exons,
hence only end up with two children for resulting domain2genomic symmetry)
Example – mapping domain from protein coords to genomic coords
protein2mRNA
proteinSpanB
mrnaSpanB
mRNA2genomic
genomicSpanC
mrnaSpanC
Annotated
GenomicSeq G
Annotated
mRNASeq M
Annotated
ProteinSeq P
m2gSub0
gSpanC0
mSpanC0
domain2genomic
proteinSpanA
d2gSub0
pSpanA0
mSpanA0
gSpanA0
domain2genomic
proteinSpanA
mrnaSpanA
domain2genomic
proteinSpanA
mrnaSpanA
genomicSpanA
d2gSub1
pSpanA1
mSpanA1
gSpanA1
transform via
protein2mRNA
transform via
mRNA2genomic
m2gSub1
gSpanC1
mSpanC1
m2gSub2
gSpanC2
mSpanC2
domainOnProtein
proteinSpanA
= AnnotatedBioSeq
(BioSeq)
= SeqSymmetry
(SeqAnnot)
“Growing” domain2genomic result
= MutableSeqSymmetry

mRNA2genomic
genomicSpanC
mrnaSpanC
m2gSub0
gSpanC0
mSpanC0
m2gSub1
gSpanC1
mSpanC1
m2gSub2
gSpanC2
mSpanC2
domain2genomic
proteinSpanA
mrnaSpanA
domain2genomic
proteinSpanA
mrnaSpanA
d2gSub0
mSpanA0
domain2genomic
proteinSpanA
mrnaSpanA
d2gSub0
mSpanA0
pSpanA0
domain2genomic
proteinSpanA
mrnaSpanA
d2gSub0
mSpanA0
pSpanA0
gSpanA0
d2gSub0
pSpanA0
mSpanA0
gSpanA0
domain2genomic
proteinSpanA
mrnaSpanA
genomicSpanA
d2gSub1
pSpanA1
mSpanA1
gSpanA1
domain2genomic
proteinSpanA
mrnaSpanA
d2gSub0
mSpanA0
pSpanA0
gSpanA0
d2gSub1
mSpanA1
pSpanA1
gSpanA1
step1b step1cstep1a
step 2
step1
(loop2)
[a,b,c]
Step 2
“roll up”
Step 1a
“sit still”
Step1b
“roll back”
Step1c
“roll forward”
Step 1
Details of “split” mapping

Transformations Applications
•  Mapping Affy probes to genome
•  Mapping contig annotations to larger genomic assemblies
•  Mapping protein annotations to genome
•  Mapping genomic annotations to proteins and transcripts
(SNPs, for example)
•  Sequence slice-and-dice with annotation propagation
•  Propagation of annotations across versioned sequences (such
as Golden Path)
•  Deep mappings (for example, SNP to genomeA to transcriptB to
proteinC to homolog proteinD to transcriptE to genomeF to
putative SNP location in genomeF – symmetry path of depth 5)
•  Etc., etc.

Prototypes & Applications
•  GenometryTest
•  Generic Genometry Viewer
•  ProtAnnot (Ann)
•  GPView (Cyrus)
•  AlignView (Eric)
•  ContigViewer (Peter, Barry)
•  Unibrow (Transcriptome Group)

Genometry Summary
•  Genometry presents a unified model for
location-based sequence relationships
•  Sequence annotation, composition, and
alignment are all based on SeqSymmetry
•  Provides powerful genometry manipulations --
any SeqSymmetry can be used to map other
SeqSymmetries across sequences /
coordinate spaces
•  Work in progress

IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Recomendados

Recomendados

Más contenido relacionado

Similar a IGB genome genometry data models by Gregg Helt and Cyrus Harmon

Similar a IGB genome genometry data models by Gregg Helt and Cyrus Harmon (20)

Más de Ann Loraine

Más de Ann Loraine (14)

Último

Último (20)

IGB genome genometry data models by Gregg Helt and Cyrus Harmon