1. Data mining approach to
predict BRCA1 genes
mutation
Olegas Niakšu1, Jurgita Gedminaitė2, Olga Kurasova1
1 Vilnius
University, Institute of Mathematics and Informatics,
2 Lithuanian University of Health Sciences, Oncology Institute
2. Breast Cancer
• Cancer is the second main cause of death in
the developed countries and one of the main
causes in the world.
• There are more than 10 million people,
diagnosed with oncologic disease, about 6
million people will die for it in each year.
• Breast cancer is the most common cancer in
women worldwide.
• Survivability is a major concern and is highly
related with early diagnosis and optimal
treatment plan.
2
3. BRCA genes (1)
The gene named BRCA stands for breast cancer
(BC) susceptibility gene. In normal cells, BRCA
genes help ensure the stability of the cell’s
genetic material (DNA) and help prevent
uncontrolled cell growth.
Mutation of these genes has been linked to the
development of hereditary breast and ovarian
cancer.
Patients with pathological mutation of a BRCA
gene have 65% lifelong breast cancer
probability.
3
4. BRCA genes (2)
The research deals with the issue of cancer
suppression genes BRCA1 mutations.
We methodically apply a set of classical and
emerging statistical and data mining tools
having a goal to answer questions formulated by
clinicians:
• what are BRCA1 mutation prognostic factors,
• what are BC tumour reoccurrence factors,
• if-and-what BRCA1 mutations influence to the
course of decease.
4
5. Medical data mining
Healthcare domain is known for its ontological
complexity and variety of medical data
standards and variable data quality.
Typically, the available medical datasets are
fragmented and distributed; thereby the process
of data cleaning and integration is a challenging
task.
Other important issues related to the use of
personal healthcare data have origins in legal,
ethical and social aspects.
5
6. Research data
The original medical research had been carried
out in Oncology institute of Lithuanian University
of Health Sciences from 2010 till 2013.
The study group consisted of 83 women, who
have been diagnosed with I-II stage breast
cancer with the following tumour morphology
(T1 N0, T2 N0, T3 N0, T1 N1, T2 N1).
Research duration was determined considering
amount of patients and not less than 2 years
period of disease progress monitoring.
6
7. The full list of attributes of initial dataset (1)
#
1
2
3
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Attribute
Age
Histology type
cT
pT
Multifocality
cN
pN
G
L
V
ER
PR
HER2
BRCA mutation
Bilateral BC
Tumor size
CHEK2 mutation
Affected l_m number
Attribute type*
Continuous
Nominal (5)
Nominal (5)
Nominal (6)
Nominal (2)
Nominal (3)
Nominal (2)
Nominal (3)
Nominal (2)
Nominal (2)
Nominal (4)
Nominal (4)
Nominal (2)
Nominal (6)
Nominal (2)
Continuous
Nominal (4)
Continuous
7
8. The full list of attributes of initial dataset (2)
#
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Attribute
Triple neg. BC
Family history type
Prostate cancer fam. Hist.
Pancreatic cancer fam. hist.
Colorectal cancer fam. Hist.
Surgery type
Chemotherapy type
Herceptin
Cht. complications
Reoccurrence
Metastases
Time to diseased
Is Diseased
Monitoring period
Time to reoccurrence
Adjuv. ST
Adjuv. HT
Attribute type*
Nominal (2)
Nominal (3)
Nominal (2)
Nominal (2)
Nominal (2)
Nominal (4)
Nominal (3)
Nominal (2)
Nominal (4)
Nominal (2)
Nominal (2)
Continuous
Nominal (2)
Continuous
Continuous
Nominal (2)
Nominal (5)
8
9. The distribution of prediction class
attributes
Attribute
BRCA1
mutation
BC
reoccurrence
Diseased
patients
Positive attribute
Negative attribute
value
value
Number Percentage Number Percentage
of
of the
of
of the
patients
whole
patients whole
group
group
12
14%
71
86%
22
27%
61
73%
2
2%
81
98%
9
11. Data mining approach
1 step: preprocessing. Continuous attributes
Age, Tumor size, Time to reoccurrence were
discretized accordingly to D_age group, D_tumor
size, and D_time to reoccurrence. Dataset was
checked for outliers and missing values.
2 step: dimensionality reduction. Feature
subset selection algorithms were used.
3 step: classification. A comparative analysis of
the classification techniques was performed in
WEKA, Orange, Tibco Spotfire Mining.
11
12. Classification methods
• Classification trees – J48, Random Forest,
Random tree, tree ensemble,
• Classification rules – ZeroR, OneR, and FURIA,
• Artificial neural networks – Multi-layer Perceptron,
SOM,
• Regression – logistic regression,
• Bayes – Naïve Bayes,
• Meta – Ada Boost, Bagging.
12
16. Association rules discovery
Apriori, PredictiveApriori, and HotSpot algorithms
were used.
Generic and class specific rules with a minimum
support in the range of [0.01; 0.2] with confidence
greater than 0.75.
Association rules search has found from 46
thousand to 78 thousand rules.
However there were no clinically interesting
(novel) patterns found, though the patterns
reconfirmed already known features of the BC.
16
17. Balancing data set results
Data set was changed by incrementally equalling
the proportion of dependent binary (class)
attribute values till it reached 50% to 50%
distribution.
The balancing of data set gave significant results
to the most of the classification algorithms.
Meta algorithm Bagging:
◦ accuracy – 0.90, sensitivity – 0.95,
◦ specificity – 0.85, ROC area value – 0.96
J48 tree algorithm:
◦ accuracy – 0.88, sensitivity – 0.93,
◦ specificity – 0.83, ROC area value – 0.85
17
22. Conclusions (1)
By analyzing breast cancer patient data, we have
realized the importance of systematic approach
in knowledge discovery process.
The study has showed high importance of an
optimal dataset forming for the classification
accuracy.
22
23. Conclusions (2)
Artificial neural networks have showed the best
performance for BRCA1 gene mutation carrier
prediction, but due to the lack of its expressivity,
decision tree and decision rules methods were
preferred by the clinicians.
Our overall conclusion is that after additional
validation on a larger dataset the created
prediction models can be used as clinical
decision support systems.
23
24. Thank you for your attention
Olegas Niakšu (Olegas.Niaksu@mii.vu.lt)
Jurgita Gedminaitė (Jurgita.Gedminaite@lmu.lt)
Olga Kurasova (Olga.Kurasova@mii.vu.lt)
24