3. CHALLENGE
Need to build biological class predictors that:
1. Have high accuracy
2. Use relatively few variables
To do this we have to use datasets that:
1. Are very noisy
2. Contain enormous numbers of variables
4. EXAMPLE: BREAST CANCER
PROGNOSIS
Van‟tVeer 2002 Nature
98 breast cancer samples:
34 developed metastases within 5 years, 44 did not
18 had BRCA1 mutations, 2 had BRCA2 mutations
expression levels of 25,000 genes measured
5,000 genes “significantly regulated across the sample groups”
6. METASTASIS PREDICTOR
231 genes were found to be significantly associated with disease outcome
Using leave-one-out cross-validation they empirically selected the 70 best
individual genes to build their predictor
Of the 78 samples in the training set
the predictor correctly classified 65 (83%)
->MammaPrint test from Agendia
Still in clinical trials (10 years since original study) (MINDACT)
7. WE CAN DO BETTER
This signature does not take advantage of:
• interactions between genes
(together two variables may be much more predictive then either one alone)
• biological knowledge
(this signature leaves out several known cancer predictors and does not make
use of biological knowledge in any way)
10. WE CAN DO BETTER BY INTEGRATING
MACHINE LEARNING WITH BIOLOGICAL
EXPERTISE
The standard signature does not take advantage of:
• interactions between genes
machine learning algorithms can find and use these but can have
problems when faced with large feature spaces
• biological knowledge
can be used to guide the machine learning process towards
meaningful features in the data and thus reduce chances of overfitting
11. MACHINE LEARNING
ALGORITHM OF THE MOMENT
• In each of many iterations, a small subset
of features are chosen randomly and used
to build one decision tree
• Decision trees are stored and
classifications are made based on the
majority vote of all of the trees.
• Good classifier!
• But you get different forests every time
you run it and it faces the same challenges
of generalizability as any other learning
algorithm.
12. NETWORK GUIDED FOREST (NGF)
Same algorithm except each tree is
constructed from a particular area
of a relevant protein-protein
interaction network.
1) Pick a gene randomly
2) Walk out along the network to
get the other N genes to use to
build that tree
3) repeat
The premise is that biologically
coherent modules will give better
signal than individual genes
randomly grouped together
Dutkowski & Ideker (2011) Protein Networks as Logic Functions in Development in Development and Cancer. PLoS Computational Biology
13. NGF RESULTS
A) Identical performance to random forest and
random network guided forest as assessed by 5
fold cross-validation repeated 100 times.
B) More known breast cancer genes show up in the
forest
C) Similar genes selected for forests in two different
training sets (different patient cohorts)
14. HUMAN GUIDED RANDOM
FOREST (HGF)
Same algorithm again except each
trees are constructed from a
manually selected subset of genes
(or other features).
1) Find a person
2) Let them select what they
think is an optimal feature set
3) back to step one, N times
4) aggregate
The premise is that biological
knowledge can produce better
than random decision modules
and that not all biological
knowledge is captured in
interaction networks
15. HGF CHALLENGES
1) Find a person
2) Let them select what they think is an optimal feature set
3) back to step one, N times
• N may be large (e.g. 1,000)
• Need many knowledgeable people to work hard... for free
17. COMBO!
COMPONENTS
Game interface(s)
Java server server provides • manages game events:
• hosts training data features for games users, moves, etc.
• executes decision (e.g. gene cards) • two implementations so
tree algorithm far
• performs cross- • server side (JSP) card
validation tests client sends groups game
• logs data generated of features („hands‟) • client side (javascript)
by game to the server for table game
scoring.
18. COMBO CODE AND DEMO
Next steps
1. Better preprocessing of training data
• map contigs to genes where possible, filter out clearly useless genes
• identify individually predictive genes
• Game
• build domain-specific boards, let players pick their knowledge area
• Real two player feeling with robot partner
• Special cards: robber card, any-gene selector card
• High scores
• ??????????????
Notas del editor
" at least a twofold difference and a P-value of less than 0.01 in more than five tumours”
a, Two-dimensional presentation of transcript ratios for 98 breast tumours. There were 4,968 significant genes across the group. Each row represents a tumour and each column a single gene. As shown in the colour bar, red indicates upregulation, green downregulation, black no change, and grey no data available. The yellow line marks the subdivision into two dominant tumour clusters. b, Selected clinical data for the 98 patients in a: BRCA1 germline mutation carrier (or sporadic patient), ER expression, tumour grade 3 (versus grade 1 and 2), lymphocytic infiltrate, angioinvasion, and metastasis status. White indicates positive, black negative and grey denotes tumours derived from BRCA1 germline carriers who were excluded from the metastasis evaluation. The cluster below the yellow line consists of 36 tumours, of which 34 are ER negative (total 39 ER-negative) and 16 are carriers of the BRCA1 mutation (total 18). c, Enlarged portion from a containing a group of genes that co-regulate with the ER- gene (ESR1). Each gene is labelled by its gene name or accession number from GenBank. Contig ESTs ending with RC are reverse-complementary of the named contig EST. d, Enlarged portion from a containing a group of co-regulated genes that are the molecular reflection of extensive lymphocytic infiltrate, and comprise a set of genes expressed in T and B cells. (Gene annotation as in c.)
(A) Average area under the ROC curve for NGF, RF, NGF applied to permuted networks (NGF**), and Naïve Bayes, compared to reported scores for representative previous methods (error bars denote standard deviation estimated over 100 runs). (B) General cancer and breast cancer associated genes identified among the 100 top-scoring genes or 100 most abundant genes in the forest created using RF or NGF. using the real network or networks with permuted edges (average over 100 permutations is shown). (C) Genes ranked by their importance for classification in two independent breast cancer patient cohorts (y vs. x axis). Network-Guided Forest, blue points; regular Random Forest, green points.