Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
1. Crowdsourcing Biology: The Gene
Wiki, BioGPS and GeneGames.org
Andrew Su, Ph.D.
@andrewsu
asu@scripps.edu
http://sulab.org
April 5, 2013
UCSD DBMI Seminar
2. Few genes are well annotated…
2
Data: NCBI, February 2013
41%
65%
CTNNB1
VEGFA
SIRT1
FGFR2
TGFB1
TP53
MEF2C
BMP4
LEF1
WNT5A
TNF
20,473
protein-
coding
genes
Genes, sorted by decreasing counts
GOAnnotation
Counts
4. … because the literature is sparsely curated?
4
0
10
20
1979 1984 1989 1994 1999 2004 2009
Average capacity of human scientistNumber of articles read by typical scientist
6. 6
0
Sooner or later, the
research community will
need to be involved in the
annotation effort to scale
up to the rate of data
generation.
7. The Long Tail is a prolific source of content
7
Short
Head
Long Tail
Content
produced
Contributors (sorted)
News :
Video:
Product reviews:
Food reviews:
Talent judging:
Newspapers
TV/Hollywood
Consumer reports
Food critics
Olympics
Blogs
YouTube
Amazon reviews
Yelp
American Idol
9. Wikipedia has breadth and depth
9
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
Articles
Words
(millions)
Wikipedia Britannica
Online
10. 10
We can harness the
Long Tail of scientists
to directly participate in
the gene annotation
process.
14. Wiki success depends on a positive feedback
14
Gene wiki page utility
Number of
users
Number of
contributors
1001
2002
15. 10,000 gene “stubs” within Wikipedia
15
Protein structure
Symbols and
identifiers
Tissue expression
pattern
Gene Ontology
annotations
Links to structured
databases
Gene
summary
Protein
interactions
Linked
references
Huss, PLoS Biol, 2008
Utility
Users
Contributors
16. Gene Wiki has a critical mass of readers
16
Total: 4.0 million views / month
Huss, PLoS Biol, 2008; Good, NAR, 2011
Utility
Users
Contributors
17. Gene Wiki has a critical mass of editors
17
Increase of ~10,000 words / month from >1,000 edits
Currently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
Utility
Users
Contributors
Editorcount
Editors
Edits
Editcount
18. A review article for every gene is powerful
18
References to the literature
Hyperlinks to related concepts
Reelin: 98 editors, 703 edits since July 2002
Heparin: 358 editors, 654 edits since June 2003
AMPK: 109 editors, 203 edits since March 2004
RNAi: 394 editors, 994 edits since October 2002
19. Making the Gene Wiki more computable
19
Structured annotationsFree text
20. Filling the gaps in gene annotation
20
Wikilink
GO exact
match
Gene Wiki
mapping
NCBI Entrez Gene: 334
Candidate
assertion
GO:0006897
6319 novel GO annotations
2147 novel DO annotations
21. Gene Wiki content improves enrichment analysis
21
GO term
Gene list
Concept
recognition
PubMed
abstracts
Enrichment
analysis
GO:0007411
axon
guidance
(GO:0007411)
264 genes
Linked genes
through
PubMed
P = 1.55 E-20
811 articles
Yes No
Yes 13 2
No 251 12033
22. Gene Wiki content improves enrichment analysis
22
GO term
Gene list
Concept
recognition
PubMed
abstracts
Gene Wiki
+
Enrichment
analysis
GO:0006936 GO:0006936
muscle
contraction
(GO:0006936)
87 genes
Linked genes
through
PubMed
Linked genes
through
PubMed +
Gene Wiki
P = 1.0 P = 1.22 E-09
251 articles
87 articles
23. Gene Wiki content improves enrichment analysis
23
p-value (PubMed only)
p-value
(PubMed + GW)
Muscle
contraction
More
significant
PubMed + GW
More
significant
PubMed only
24. Making the Gene Wiki more computable
24
Structured annotationsFree text
Analyses
25. Making the Gene Wiki more computable
25
Structured annotationsFree text
Databases
26. Making the Gene Wiki more computable
26
Databases
Linked Data
27. The
Long Tail of scientists
is a valuable source of
information on gene
function
27
39. Utility: A simple and universal plugin interface
39
Utility
UsersContributors
Total of > 540 gene-centric online
databases registered as BioGPS plugins
40. Users: BioGPS has critical mass
40
• > 6400 registered users
• 14,000 unique visitors per month
• 155,000 page views per month
1. Harvard
2. NIH
3. UCSD
4. Scripps
5. MIT
6. Cambridge
7. U Penn
8. Stanford
9. Wash U
10. UNC
Top 10 organizations
Daily pageviewsUtility
UsersContributors
41. Contributors: Explicit and implicit knowledge
41
540 plugins registered
(>300 publicly shared)
by over 120 users
spanning 280+ domains
Utility
UsersContributors
51. Using games to fold proteins
51
Fold.it players have successfully:
• Outperformed state of the art protein
folding algorithms (Cooper, Nature, 2010)
• Solved a previously-intractable crystal
structure (Khatib, Nat Struct Mol Biol, 2011)
• Designed an improved protein folding
algorithm (Khatib, PNAS, 2011)
• Improved enzyme activity of de novo
designed enzyme (Eiben, Nat Biotechnol, 2011)
57. No good gene-disease annotation database
57
Alzheimer's disease (AD)
Lipoprotein glomerulopathy
Sea-blue histiocyte disease
Query: Apolipoprotein E
58. No good gene-disease annotation database
58
Alzheimer's disease (AD)
Lipoprotein glomerulopathy
Sea-blue histiocyte disease
Hyperlipoproteinemia, type III
Macular degeneration, age-related
Myocardial infarction susceptibility
Query: Apolipoprotein E
59. No good gene-disease annotation database
59
Alzheimer's disease (AD)
Lipoprotein glomerulopathy
Sea-blue histiocyte disease
Hyperlipoproteinemia, type III
Macular degeneration, age-related
Myocardial infarction susceptibility
HIV
Psoriasis
Vascular Diseases
Query: Apolipoprotein E
?
?
?
?
?
60. No good gene-disease annotation database
60
Alzheimer's disease (AD)
Neuropsychological Tests
Cognition Disorders
Dementia
Cognition
Disease Progression
Cardiovascular Diseases
Coronary Disease
Diabetes Mellitus, Type 2
Memory Disorders
Query: Apolipoprotein E
Memory
Coronary Artery Disease
Hypertension
Mental Status Schedule
Psychiatric Status Rating
Scales
Hyperlipidemias
Atrophy
Dementia, Vascular
Parkinson Disease
Brain Injuries
Myocardial Infarction
…
477 diseases!
61. Play Dizeez to annotate gene-disease links
61
3. If it‟s „right‟, you get points
4. Then on to the
next question…
2. Click the related disease
(only one is “right”)
5. Hurry!
1. Read the clue (gene)
6. Play to win!
62. Dizeez players seem pretty smart…
62
In total (since Dec 2011):
• 230 unique gamers
• 1045 games played
• 8525 guesses
# Occurrences Gene Disease
11 NBPF3 neuroblastoma
11 SOX8 mental retardation
9 ABL1 leukemia
9 SSX1 synovial sarcoma
8 APC colorectal cancer
8 FES sarcoma
8 RBP3 retinoblastoma
8 GAST gastrinoma
8 DCC colorectal cancer
8 MAP3K5 cancer
Gene Wiki OMIM PharmGKB PubMed
63. Using games to predict phenotype from genotype?
63
http://genegames.org
64. Classification problems in genome biology
64
cancer normal
find patterns
Classify new
samples
cancer
normal
SVM
Neural
networks
Naïve
Bayes
KNN
…
100s samples
100,000sfeatures
80. Results
• 214 registered players
– 50% declared knowledge of cancer
biology
– 40% self-identified as having Ph.D.
• Prediction results
– 70% correct on survival concordance
index
– Best scoring model was 76%
– Player registrations still increasing!
80
81. The
Long Tail of gamers
can collaboratively
build an accurate
disease classifier.
81
82. 82
Doug Howe, ZFIN
John Hogenesch, U Penn
Jon Huss, GNF
Luca de Alfaro, UCSC
Angel Pizzaro, U Penn
Faramarz Valafar, SDSU
Pierre Lindenbaum,
Fondation Jean Dausset
Michael Martone, Rush
Konrad Koehler, Karo Bio
Warren Kibbe, Simon Lim, Northwestern
Many Wikipedia editors
WP:MCB Project
Collaborators
Katie Fisch
Ben Good
Salvatore Loguercio
Max Nanis
Chunlei Wu
Group members
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820)
Contact
http://sulab.org
asu@scripps.edu
@andrewsu
+Andrew Su
Adriel Carolino
Erik Clarke
Jon Huss
Marc Leglise
Maximilian Ludvigsson
Ian MacLeod
Camilo Orozco
Key group alumni
83. Doctoral Program in Chemical
and Biological Sciences
CALIFORNIA
Office of Graduate Studies
10550 N. Torrey Pines Road
La Jolla, CA 92037
Email:
gradprgrm@scripps.edu
Phone: 858.784.8469
http://education.scripps.edu
Notas del editor
We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discoveryNo IEA
If you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
Developer resources do not scale with usagePractical effects:Core developers’ time is always the rate-limiting step Addition of new features and data always feels slowEventually, new databases are created to fill the gap80% duplication for 20% innovation
MODs and portals
Genetics resources
Literature resources
Protein resources
Pathway and expression databases
Pathway and expression databases
Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
Empire state building
Question: how to interject biological knowledge in the feature selection process?