Human Guided Forests (HGF)

HUMAN GUIDED FORESTS
Su lab meeting March 30, 2012
Benjamin Good

AGENDA

1. Motivation
2. Idea
3. Game

CHALLENGE

Need to build biological class predictors that:
1. Have high accuracy
2. Use relatively few variables

To do this we have to use datasets that:
1. Are very noisy
2. Contain enormous numbers of variables

EXAMPLE: BREAST CANCER
PROGNOSIS

Van‟tVeer 2002 Nature
98 breast cancer samples:
34 developed metastases within 5 years, 44 did not
18 had BRCA1 mutations, 2 had BRCA2 mutations
expression levels of 25,000 genes measured
5,000 genes “significantly regulated across the sample groups”

98 tumors
genes that
coregulate
with ER

5000
genes

co-regulated
genes
indicating
lymphocytic
infiltrate

70% bad 30% bad

METASTASIS PREDICTOR

231 genes were found to be significantly associated with disease outcome
Using leave-one-out cross-validation they empirically selected the 70 best
individual genes to build their predictor
Of the 78 samples in the training set
the predictor correctly classified 65 (83%)

->MammaPrint test from Agendia
Still in clinical trials (10 years since original study) (MINDACT)

WE CAN DO BETTER

This signature does not take advantage of:
• interactions between genes
(together two variables may be much more predictive then either one alone)
• biological knowledge
(this signature leaves out several known cancer predictors and does not make
use of biological knowledge in any way)

THERE ARE MANY MANY CHALLENGES LIKE THIS
IN BIOLOGY

WE CAN DO BETTER BY INTEGRATING
MACHINE LEARNING WITH BIOLOGICAL
EXPERTISE

The standard signature does not take advantage of:
• interactions between genes
machine learning algorithms can find and use these but can have
problems when faced with large feature spaces
• biological knowledge
can be used to guide the machine learning process towards
meaningful features in the data and thus reduce chances of overfitting

MACHINE LEARNING
ALGORITHM OF THE MOMENT

• In each of many iterations, a small subset
of features are chosen randomly and used
to build one decision tree

• Decision trees are stored and
classifications are made based on the
majority vote of all of the trees.

• Good classifier!

• But you get different forests every time
you run it and it faces the same challenges
of generalizability as any other learning
algorithm.

NETWORK GUIDED FOREST (NGF)

Same algorithm except each tree is
constructed from a particular area
of a relevant protein-protein
interaction network.

1) Pick a gene randomly
2) Walk out along the network to
get the other N genes to use to
build that tree
3) repeat

The premise is that biologically
coherent modules will give better
signal than individual genes
randomly grouped together

Dutkowski & Ideker (2011) Protein Networks as Logic Functions in Development in Development and Cancer. PLoS Computational Biology

NGF RESULTS

A) Identical performance to random forest and
random network guided forest as assessed by 5
fold cross-validation repeated 100 times.

B) More known breast cancer genes show up in the
forest

C) Similar genes selected for forests in two different
training sets (different patient cohorts)

HUMAN GUIDED RANDOM
FOREST (HGF)
Same algorithm again except each
trees are constructed from a
manually selected subset of genes
(or other features).

1) Find a person
2) Let them select what they
think is an optimal feature set
3) back to step one, N times
4) aggregate

The premise is that biological
knowledge can produce better
than random decision modules
and that not all biological
knowledge is captured in
interaction networks

HGF CHALLENGES

1) Find a person
2) Let them select what they think is an optimal feature set
3) back to step one, N times

• N may be large (e.g. 1,000)
• Need many knowledgeable people to work hard... for free

COMBO!
COMPONENTS

Game interface(s)
Java server server provides • manages game events:
• hosts training data features for games users, moves, etc.
• executes decision (e.g. gene cards) • two implementations so
tree algorithm far
• performs cross- • server side (JSP) card
validation tests client sends groups game
• logs data generated of features („hands‟) • client side (javascript)
by game to the server for table game
scoring.

COMBO CODE AND DEMO

Next steps
1. Better preprocessing of training data
• map contigs to genes where possible, filter out clearly useless genes
• identify individually predictive genes
• Game
• build domain-specific boards, let players pick their knowledge area
• Real two player feeling with robot partner
• Special cards: robber card, any-gene selector card
• High scores
• ??????????????

Human Guided Forests (HGF)

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (18)

Similar a Human Guided Forests (HGF)

Similar a Human Guided Forests (HGF) (20)

Más de Benjamin Good

Más de Benjamin Good (19)

Último

Último (20)

Human Guided Forests (HGF)

Notas del editor