A crucial task in modern biology is the prediction of complex phenotypes, such as breast cancer prognosis, from genome-wide measurements. Machine learning algorithms can sometimes infer predictive patterns, but there is rarely enough data to train and test them effectively and the patterns that they identify are often expressed in forms (e.g. support vector machines, neural networks, random forests composed of 10s of thousands of trees) that are highly difficult to understand. In addition, it is generally unclear how to include prior knowledge in the course of their construction.
Decision trees provide an intuitive visual form that can capture complex interactions between multiple variables. Effective methods exist for inferring decision trees automatically but it has been shown that these techniques can be improved upon via the manual interventions of experts. Here, we introduce Branch, a new Web-based tool for the interactive construction of decision trees from genomic datasets. Branch offers the ability to: (1) upload and share datasets intended for classification tasks (in progress), (2) construct decision trees by manually selecting features such as genes for a gene expression dataset, (3) collaboratively edit decision trees, (4) create feature functions that aggregate content from multiple independent features into single decision nodes (e.g. pathways) and (5) evaluate decision tree classifiers in terms of precision and recall. The tool is optimized for genomic use cases through the inclusion of gene and pathway-based search functions.
Branch enables expert biologists to easily engage directly with high-throughput datasets without the need for a team of bioinformaticians. The tree building process allows researchers to rapidly test hypotheses about interactions between biological variables and phenotypes in ways that would otherwise require extensive computational sophistication. In so doing, this tool can both inform biological research and help to produce more accurate, more meaningful classifiers.
A prototype of Branch is available at http://biobranch.org/
Branch: An interactive, web-based tool for building decision tree classifiers
1. Branch: An interactive, web-based tool for building decision tree
classifiers
Benjamin M. Good, Karthik Gangavarapu, Vyshakh Babji, Max Nanis, Andrew I. Su
ABSTRACT
A crucial task in modern biology is the prediction of complex
phenotypes, such as breast cancer prognosis, from genome-wide
measurements. Machine learning algorithms can sometimes infer
predictive patterns, but there is rarely enough data to train and test
them effectively and the patterns that they identify are often
expressed in forms (e.g. support vector machines, neural networks,
random forests composed of 10s of thousands of trees) that are
highly difficult to understand. In addition, it is generally unclear
how to include prior knowledge in the course of their construction.
Decision trees provide an intuitive visual form that can capture
complex interactions between multiple variables. Effective methods
exist for inferring decision trees automatically but it has been shown
that these techniques can be improved upon via the manual
interventions of experts. Here, we introduce Branch, a new Web-based
tool for the interactive construction of decision trees from
genomic datasets. Branch offers the ability to: (1) upload and share
datasets intended for classification tasks (in progress), (2) construct
decision trees by manually selecting features such as genes for a
gene expression dataset, (3) collaboratively edit decision trees, (4)
create feature functions that aggregate content from multiple
independent features into single decision nodes (e.g. pathways) and
(5) evaluate decision tree classifiers in terms of precision and recall.
The tool is optimized for genomic use cases through the inclusion of
gene and pathway-based search functions.
Branch enables expert biologists to easily engage directly with high-throughput
datasets without the need for a team of
bioinformaticians. The tree building process allows researchers to
rapidly test hypotheses about interactions between biological
variables and phenotypes in ways that would otherwise require
extensive computational sophistication. In so doing, this tool can
both inform biological research and help to produce more accurate,
more meaningful classifiers.
A prototype of Branch is available at http://biobranch.org/
The Scripps Research Institute
Background
Feature types
REFERENCES
CONTACT
Benjamin Good: bgood@scripps.edu @bgood
Andrew Su: asu@scripps.edu @andrewsu
Dataset library
http://biobranch.org/
Building a decision tree
Research reported in this poster was supported by the National Institute of General Medical Sciences
of the National Institutes of Health under award numbers R01GM089820 and R01GM083924, and by
the National Center for Advancing Translational Sciences of the National Institute of Health under
award number UL1TR001114.
Goals
(1) Find patterns
(2) make predictions
on new samples
< 10 year >10 year
< 10 year ?
> 10 year ?
1. Griffith et al (2013) A robust prognostic signature for hormone-positive node-negative
breast cancer. Genome Medicine.
2. Dutkowski and Ideker (2011) Protein Networks as Logic Functions in Development and
Cancer. PLoS Computational Biology
3. Winter et al (2012) Google Goes Cancer: Improving Outcome Prediction for Cancer
Patients by Network-Based Ranking of Marker Genes. PLoS Computational Biology
4. Liu et al (2012) Identifying dysregulated pathways in cancers from pathway interaction
networks. BMC Bioinformatics
5. Paik et al (2004) A Multigene Assay to Predict Recurrence of Tamoxifen-Treated, Node-
Negative Breast Cancer. The New England Journal of Medicine
6. Mihael et al. (1999) Visual classification: an interactive approach to decision tree
construction. Proceedings of the fifth ACM SIGKDD international conference on
Knowledge discovery and data mining.
7. Malcolm W. (2002) Interactive machine learning: letting users build classifiers.
International Journal of Human-Counter Studies.
Example: breast cancer survival prediction
Gene Expression Data
(+CNVs, SNPs, etc..) (3) Understand the biology that
the pattern indicates
Statistics and machine learning
• Example, Random Forests [1]
• Good at (1) finding patterns
• Have mixed results at (2) identifying patterns that
generalize well across cohorts
• Sometimes offer little help for (3) increasing
understanding of the underlying biology
Prior knowledge
• Known relationships between the data elements
(e.g. genes) can be used to improve predictor
accuracy and generalizability.
• Examples of inputs to automated methods: protein-protein
interactions [2,3], pathway databases [4]
• Manual consideration by domain experts is a vital
aspect to the inference of new classifiers and is
fundamental to the formation of understanding.
See for example the creation of the OncoTypeDx
predictor for breast cancer prognosis [5]
Funding
Decision Trees
• Can be inferred automatically but..
• Engaging domain experts in their creation:
• (1) provides access to prior knowledge, (2) results in
smaller, more understandable trees, (3) can improve
predictive performance, (4) can increase user’s
comprehension of both the classifier and the data [6,7]
Clicking on a node shows the
percentage of the dataset that
passes through it and its
accuracy.
View/use trees shared by community
• Gene (e.g. expression)
• Non-gene (e.g. clinical data)
• Custom feature (manually created feature
combination)
• Classifier node (e.g. a trained SVM)
• Pre-existing tree
• Visual (manually defined decision
boundary using GUI)
• Create a classifier node.
Iteratively select feature to create each split (If, Then rule)
Transplant
rejection
HIV-1 coreceptor
usage
• Test datasets loaded:
• Breast cancer survival (gene expression)
• Kidney transplant rejection (gene expression)
• HIV coreceptor usage (amino acid sequences)
• Coming soon: upload your own data
The number of colored squares indicate the
number of samples that pass through the
node. The colors are associated with the
classes to be predicted. Ideal leaf nodes are
‘pure’ in that they only contain one kind of
class.
Breast cancer
survival
Decision trees can be made
private or shared with the
public when saved. Public
trees may be used as a
starting point for others.
For collaboratively authored
trees, the author associated
with each node is tracked.
http://biobranch.org/