The document describes a study that developed an online game called "The Cure" to capture knowledge from over 1,000 players regarding genes that could be used to predict breast cancer survival. Gene sets assembled from the game data showed significant enrichment for cancer-related genes and provided prediction accuracy comparable to other methods. The game successfully tapped into the collective knowledge and reasoning of many players to identify predictive gene signatures.
The Cure: Making a game of gene selection for breast cancer survival prediction
1. The Cure: Making a game of gene selection for breast cancer survival prediction
Background: Molecular signatures for predicting breast cancer prognosis
could greatly improve care through personalization of treatment.
Computational analyses of genome-wide expression datasets have
identified such signatures, but these signatures leave much to be desired
in terms of accuracy, reproducibility and biological interpretability.
Methods that take advantage of structured prior knowledge (e.g. protein
interaction networks) show promise in helping to define better
signatures but most knowledge remains unstructured. Crowdsourcing via
scientific discovery games is an emerging methodology that has the
potential to tap into human intelligence at scales and in modes
previously unheard of.
Objective: The main objective of this study was to test the hypothesis
that knowledge linking expression patterns of specific genes to breast
cancer outcomes could be captured from players of an open, Web-based
game. We envisioned capturing knowledge both from the player’s prior
experience and from their ability to interpret text related to candidate
genes presented to them in the context of the game.
Methods: We developed and evaluated an online game called “The
Cure” that captured information from players regarding genes for use in
predictors of breast cancer survival. Information gathered from game
play was aggregated using a voting approach and used to create rankings
of genes. The top genes from these rankings were evaluated using
annotation enrichment analysis, comparison to prior predictor gene
sets, and by using them to train and test machine learning systems for
predicting 10-year survival.
Results: Between its launch in Sept. 2012 and Sept. 2013, The Cure
attracted more than 1,000 registered players who collectively played
nearly 10,000 games. Gene sets assembled through aggregation of the
collected data showed significant enrichment for genes known to be
related to key concepts such as Cancer, Disease Progression, and
Recurrence (P < 1.1e-07). In terms of the accuracy of models trained
using them, these gene sets provided comparable performance to gene
sets generated using other methods including those used in commercial
tests. The Cure is available at http://genegames.org/cure/
ABSTRACT
Benjamin M. Good1, Karthik Gangavarapu1, Salvatore Loguercio1, Obi L. Griffith2, Max Nanis1, Chunlei Wu1, Andrew I. Su1
1The Scripps Research Institute, 2Washington University School of Medicine
Molecular survival prediction
How Gene Wiki?
REFERENCES
CONTACT
Benjamin Good: bgood@scripps.edu @bgood
Andrew Su: asu@scripps.edu @andrewsu
How Gene Wiki?
Cure2.0: Interactive, Collaborative, Genomic Decision Tree Construction, now live!
FUNDING
ACKNOWLEDGEMENTS
Thanks to all of the players of The Cure !
Crowdsourcing via scientific discovery games
We acknowledge support from the National Institute of
General Medical Sciences (GM089820 and GM083924).
The Cure game. Players alternate turns taking a gene card from the
board and adding it to their hand. The tabbed display provides gene
annotations (‘ontology’, ‘Rifs’) and views of decision trees
constructed by the system using the selected genes. There are one
hundred boards to choose from in a given round of the game (four
rounds were completed).
find patterns
make predictions on
new samples
< 10 year >10 year
• With tens of thousands of measurements but only
hundreds of samples, many possible patterns are found.
• But which ones are real?
• Which genes should we use to build predictors?
< 10 year
> 10 year
Online games are successfully tapping into the knowledge
and reasoning abilities of thousands of people [4].
Devise protein folding algorithmsDesign RNA molecules
The purpose
Prior knowledge encoded in protein-protein interaction
databases [1,2] and pathway databases [3] has been used to
improve prediction
What about
knowledge that is
not recorded in
structured
databases?
1. Dutkowski and Ideker (2011) Protein Networks as Logic
Functions in Development and Cancer. PLoS
Computational Biology
2. Winter et al (2012) Google Goes Cancer: Improving
Outcome Prediction for Cancer Patients by Network-
Based Ranking of Marker Genes. PLoS Computational
Biology
3. Liu et al (2012) Identifying dysregulated pathways in
cancers from pathway interaction networks. BMC
Bioinformatics
4. Good and Su (2011) Games with a Scientific Purpose.
Genome Biology
5. Wang, Jing, et al. (2013) WEB-based GEne SeT AnaLysis
Toolkit (WebGestalt): update 2013. Nucleic Acids
Research
• Goal: pick the best set of genes.
• Best: the gene set that produces the best decision tree classifier.
• Classifier: created using training data and selected genes, used to predict 10
year survival.
• Score: accuracy of the tree inferred using the selected genes
The Cure is a game
designed to focus the
collective intelligence of a
diverse community on the
challenge of selecting
genes for building
prognostic classifiers
The rules
The game
Results – recruitment and engagement
• One year, 1077 players, 9904 games played
1077
players
Key result: Genes selected in high frequencies by
the player community performed comparably to
genes selected using statistical approaches and to
genes used in commercial tests when used to train
machine learning models for survival prediction
Results – knowledge captured
Workflow for Synthesizing Knowledge Regarding Gene Selection
1. Select a set of played games based on player information such as education.
2. Measure the frequency with which each gene was selected by these players
across many different games and boards. Each time a gene is added to a hand
a ‘vote’ is recorded for that gene.
3. Measure the likelihood of observing the number of votes a gene has received
by chance and calculate a P value for that gene.
4. Rank genes by P value and select those with P<=0.001
3 gene sets extracted from all games, games
from experts, and games from novices
Overlap of ‘expert’ player selected gene
set with known predictor gene sets
Disease terms
associated with 61 genes
preferentially selected
by all players using
WebGestalt [5] with adj.
P < 10-5
Overlap between genes selected
by different player populations
61 genes preferentially
selected by all players,
P <= 0.001
Changes in Cure 2.0
1. Adapted for advanced players / scientists.
2. Players choose from all genes in dataset
3. Clinical features supported
4. Players control structure of trees.
5. Scoring based on accuracy, complexity and
novelty of trees.
6. Collaborative – players can build from other
players trees
7. Trees can also be kept private.
http://genegames.org/cure/
Try it Now!