2. Overview:
• Introduc4on
• Analysis
of
commonly
used
terms
and
explana4on
of
data
sets
• Overall
Programming
Process
• Genera4ng
a
merged
file
with
CDGLOBAL
• Genera4on
of
files
for
future
status
predic4on
• Data
Preprocessing
• Classifica4on
(Algorithms)
used
on
the
data
• Analysis
on
the
output
data
from
WEKAb
G
3. Introduc4on
• What
is
Alzheimer’s
Disease?
• Brain
disorder
• Most
common
form
of
demen4a
– Term
for
the
loss
• Memory
• Other
intellectual
abili4es
• Serious
enough
to
interfere
with
daily
life
• Clinical
Demen4a
Ra4o
(0,0.5,1,2,3)
Mild to Severe Dementia1.0 to 3.0
Questionable Dementia0.5
Normal0
G
4. Datasets
(60
Files)
" 56
comma
separated
files
1
File
–
Data
Dic4onary
(Explains
the
terms
used)
1
File
–
Clinical
Demen4a
Ra4ng
(Has
CDGLOBAL)
Rest
Assessments
Data
Defini4ons
Other
like
visits
having
abbrevia4ons
G
5. Environment
Setup
• Programming
Languages
used
for
the
project
are
PHP,
MySQL,
Java,
Postgresql
• Tools
used
are
WEKA
(Waikato
Environment
for
Knowledge
Analysis),
MySQLWorkBench,
and
NetBeans
•
-‐Front
End
(PHP)
•
-‐Back
End
(MySQL)
G
V
6. Overall
Programming
Process
• A
selected
dataset
(FAQ)
is
given
by
the
user.
• At
the
backend
MYSQL
queries
are
defined
enough
to
create
the
required
tables
and
insert
the
required
data
to
the
corresponding
tables.
• Here
aeer
the
required
opera4ons
are
performed
on
the
tables.
• Final
output
files
are
stored
in
.csv
format.
G
V
7. Genera4ng
a
merged
file
with
CDGLOBAL
(For
current)
• For
the
given
datasets
as
input,
(Eg:adni_faq_2011-‐01-‐20.csv)
and
from
the
adni_cdr_2011-‐01-‐20.csv)
file
-‐-‐the
RID’s
and
VISCODE’s
of
faq
and
cdr
are
compared
and
based
on
that
CDGLOBAL
column
in
cdr
file
is
merged
to
faq
file.
•
During
Remove
CDGLOBAL
which
has
-‐1
and
VISCODE’s
f,nv,uns1
are
trimmed
off.
Result
file
is
“Merged_dataset_file.csv”
G
8. Query
used
for
genera4ng
merged
file:
• Select
f.cID
,f.RID
,f.VISCODE
,f.EXAMDATE
,f.FAQSOURCE
,f.FAQFINAN,f.FAQFORM,f.FAQSHOP,f.FAQGAME,f.
FAQBEVG,f.FAQMEAL,f.FAQEVENT,f.FAQTV,f.FAQR
EM,f.FAQTRAVL,f.FAQTOTAL
,cdr.cdglobal
from
cdr,faq
f
where
cdr.rid=f.rid
and
cdr.VISCODE=f.VISCODE
and
cdr.cdglobal
not
in
(-‐1)";
G
9. Genera4on
of
files
for
future
status
predic4on
• Predic4on
dataset
is
generated
by
mapping
the
first
4me
visit
to
the
6
month’s
Class
and
6
month
visit
to
the
12
month’s
Class
and
so
on.
• SQL
query
opera4ons
are
performed
on
the
merged
file
to
separate
the
6
month’s
4me
interval
classes.
• Following
are
the
files
generated:
-‐
File_dataset_m06.csv
-‐File_dataset_m12.csv
and
so
on
V
10. Query
used
for
genera4ng
class
files:
• Select
v.ID
as
ID,v.RID
as
RID,v.VISCODE
,v.EXAMDATE,v.FAQSOURCE
,v.FAQ
FINAN
,v.FAQFORM
,v.FAQSHOP
,v.FAQGAME
,v.FA
QBEVG,v.FAQMEAL
,v.FAQEVENT
,v.FAQTV
,v.FAQR
EM
,v.FAQTRAVL
,v.FAQTOTAL
,m12.cdrglobal
fro
m
`table_adni_faq_2011-‐01-‐20_m06`
v,`table_adni_faq_2011-‐01-‐20_m12`
m12
where
v.rid=m12.rid
V
11. Preprocessing
• Aeer
we
get
required
.csv
files,
we
use
WEKA
to
preprocess
the
data.
• Load
the
file
into
WEKA.
• Apply
Filter
“weka.filters.unsuperwised.apributes.Remove
”
to
trim
off
the
unused
fields.
• Apply
“NumericaltoNominal”
to
convert
all
the
values
in
the
data
to
Nominal
before
classifying
and
fetching
to
a
classifier
algorithm.
G
12. Classifica4on
Algorithms
Used
• The
Classify
panel
enables
the
user
to
apply
classifica4on
and
regression
algorithms
(indiscriminately
called
classifiers
in
Weka)
to
the
resul4ng
dataset,
to
es4mate
the
accuracy
of
the
resul4ng
predic4ve
model.
•
J48
uses
C4.5
(a
successor
of
ID3)
Algorithm
• Naïve
Bayesian
Classifica4on
Algorithm
G
13. What
is
classifica4on?
• Given
a
collec4on
of
records
(training
set
)
– Each
record
contains
a
set
of
a"ributes,
one
of
the
apributes
is
the
class
-‐-‐
A
test
set
is
used
to
determine
the
accuracy
of
the
model.
Usually,
the
given
data
set
is
divided
into
training
and
test
sets,
with
training
set
used
to
build
the
model
and
test
set
used
to
validate
it.
Example:
If
we
have
items
in
a
house
which
are
not
classified
then
we
can’t
arrange
items
in
our
house.
We
classify
the
items
depending
on
their
usage
as
cooking
items,
decora4on
items
etc.,
such
that
we
could
arrange
them
accordingly
and
can
use
it
in
an
efficient
and
easier
way.
G
14. Decision
Tree
Classifica/on
Task
G
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Test Data
Assign Cheat to “No”
15. Decision
Tree
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Test Data
G
16. J
48
uses
C
4.5
Algorithm
• Decision
trees
represent
a
supervised
approach
to
classifica4on
• Decision
trees
are
a
classic
way
to
represent
informa4on
from
a
machine
learning
algorithm,
and
offer
a
fast
and
powerful
way
to
express
structures
in
data.
• A
decision
tree
is
a
simple
structure
where
non-‐terminal
nodes
represent
tests
on
one
or
more
apributes
and
terminal
nodes
reflect
decision
outcomes.
• The
basic
algorithm
described
above
recursively
classifies
un4l
each
leaf
is
pure,
meaning
that
the
data
has
been
categorized
as
close
to
perfectly
as
possible.
• The
latest
public
domain
implementa4on
of
Quinlan's
model
is
C4.5.
The
Weka
classifier
package
has
its
own
version
of
C4.5
known
as
J48.
• This
process
ensures
maximum
accuracy
on
the
training
data.
17. Why
decision
tree
Algorithm?
• Advantages:
– Inexpensive
to
construct
– Easy
to
interpret
for
small-‐sized
trees
– Accuracy
is
comparable
to
other
classifica4on
techniques
for
many
simple
data
sets
– There
could
be
more
than
one
tree
possible
for
the
same
data
• Disadvantages:
-‐
Under
fivng:
when
the
model
is
too
simple,
both
training
and
test
errors
are
large
18. All
about
Cross
Valida4on
• We
perform
cross
valida4on
when
amount
of
data
is
small
and
we
need
to
have
independent
training
and
test
set
from
it.
• It
is
important
that
each
class
is
represented
in
its
actual
propor4ons
in
the
training
and
test
sets:
Stra4fica4on
• An
important
cross
valida4on
technique
is
stra4fied
10
fold
cross
valida4on,
where
the
instance
set
is
divided
into
10
folds.
• We
have
10
itera4ons
with
taking
different
single
fold
for
tes4ng
and
the
rest
for
training.
V
19. Evalua4on
• Metrics
for
Performance
Evalua4on
– How
to
evaluate
the
performance
of
a
model?
• Methods
for
Model
Comparison
– How
to
compare
the
rela4ve
performance
among
compe4ng
models?
V
20. Metrics
for
Performance
Evalua4on:
Confusion
Matrix
• A
confusion
matrix
contains
informa4on
about
actual
and
predicted
classifica4ons
done
by
a
classifica4on
system.
Performance
of
systems
is
commonly
evaluated
using
the
data
in
the
matrix.
The
following
table
shows
the
confusion
matrix
for
a
two
class
classifier:
• We
get
confusion
matrix
aeer
supplying
data
to
a
Classifier
• Based
on
the
confusion
matrix
we
can
evaluate
using
the
measures
like,
precision,
F-‐measure,
accuracy
and
Recall.
G
21. Example
• Suppose
there
are
a
sample
of
27
animals
—
8
cats,
6
dogs,
and
13
rabbits.
• Each
column
of
the
matrix
represents
the
instances
in
a
predicted
class,
while
each
row
represents
the
instances
in
an
actual
class.
• We
can
see
from
the
matrix
that
the
system
in
ques4on
has
trouble
dis4nguishing
between
cats
and
dogs,
but
can
make
the
dis4nc4on
between
rabbits
and
other
types
of
animals
prepy
well.
• All
correct
guesses
are
located
in
the
diagonal
of
the
table,
so
it's
easy
to
visually
inspect
the
table
for
errors,
as
they
will
be
represented
by
any
non-‐zero
values
outside
the
diagonal.
G
22. Limita4on
of
accuracy
Limita/on
of
accuracy:
• Consider
a
2-‐class
problem
– Number
of
Class
0
examples
=
9990
– Number
of
Class
1
examples
=
10
• If
model
predicts
everything
to
be
class
0,
accuracy
is
9990/10000
=
99.9
%
– It
has
some
disadvantages
as
a
performance
es4mate.
For
example,
if
there
were
95
cats
and
only
5
dogs
in
the
data
set,
the
classifier
could
easily
be
biased
into
classifying
all
the
samples
as
cats.
The
overall
accuracy
would
be
95%,
but
in
prac4ce
the
classifier
would
have
a
100%
recogni4on
rate
for
the
cat
class
but
a
0%
recogni4on
rate
for
the
dog
class,
so
you'll
probably
want
to
look
at
some
of
the
other
numbers.
ROC
Area,
or
area
under
the
ROC
curve,
is
also
taken
as
preferred
measure.
– Accuracy
is
misleading
because
model
does
not
detect
any
class
1
example.
G
23. Metrics
for
Evalua4on
• Accuracy:
The
accuracy
(AC)
is
the
propor4ons
of
the
total
number
of
predic4ons
that
were
correct,
what
percentage
of
people
were
correctly
classified.
It
is
determined
using
the
equa4on:
Accuracy
=
(#
True
Posi4ves
+
#
True
Nega4ves)
/
N
Where
N
=
Total
#
predic4ons.
• Precision:
Finally,
precision
(P)
is
the
propor4on
of
the
predicted
posi4ve
cases
that
were
correct.
Of
all
the
people
that
are
classified
as
demented,
what
percentage
of
them
is
actually
demented?
It
is
calculated
using
the
equa4on
Precision
=
(#
True
Posi4ves)
/
(#
True
Posi4ves
+
#
False
Posi4ve)
€
Accuracy =
TP + TN
TP + TN + FP + FNV
24. Evalua4on
• F-‐measure:
F-‐measure
=2*
(#
True
Posi4ves
)
/
(
#
2*True
Posi4ves
+
#
True
Nega4ves
+
#False
Posi4ves)
• Recall:
Recall
is
the
ra4o
of
the
number
of
true
posi4ves
and
the
sum
of
true
posi4ves
and
false
nega4ves.
It
is
calculated
using
the
equa4on:
Recall
=
(#
True
Posi4ves)
/
(#
True
Posi4ves
+
#
False
Nega4ves)
V
25. Methods
for
Model
Comparison
ROC
(Receiver
Opera/ng
Characteris/c)
• Developed
in
1950s
for
signal
detec4on
theory
to
analyze
noisy
signals
– Characterize
the
trade-‐off
between
posi4ve
hits
and
false
alarms
• ROC
curve
plots
TP
(on
the
y-‐axis)
against
FP
(on
the
x-‐axis)
V
26. Using
ROC
for
Model
Comparison
M1 is better for small
FPR
M2 is better for large
FPR
A rough guide for
classifying the accuracy
of a diagnostic test is the
traditional academic
point system:.
.90-1 = excellent (A).
.80-.90 = good (B).
.70-.80 = fair (C).
.60-.70 = poor (D).
.50-.60 = fail (F)Area Under the ROC curve A
V
27. Naïve
Bayes
• It
is
a
simple
probabilis4c
classifier
based
on
applying
bayes
theorem
with
independence
assump4ons.
Naive
Bayes
classifier
assumes
that
the
presence
(or
absence)
of
a
par4cular
feature
of
a
class
is
unrelated
to
the
presence
(or
absence)
of
any
other
feature.
• For
example,
a
fruit
may
be
considered
to
be
an
apple
if
it
is
red,
round,
and
about
4"
in
diameter.
Even
if
these
features
depend
on
each
other
or
upon
the
existence
of
the
other
features,
a
naive
Bayes
classifier
considers
all
of
these
proper4es
to
independently
contribute
to
the
probability
that
this
fruit
is
an
apple.
• An
advantage
of
the
naive
Bayes
classifier
is
that
it
requires
a
small
amount
of
training
data
to
es4mate
the
parameters
(means
and
variances
of
the
variables)
necessary
for
classifica4on.
Because
independent
variables
are
assumed,
only
the
variances
of
the
variables
for
each
class
need
to
be
determined
and
not
the
en4re
set.
Best
suited
for
apributes,
which
are
independent.
It
is
very
simple,
very
fast.
V
28. Challenges
faced
• Ini4ally
data
files
all
being
processed
using
JDBC
and
MySQL
and
later
its
been
found
to
be
hec4c
if
at
all
other
dataset
being
used.
Hence
PHP
based
MYSQL
is
used
which
is
generalized
for
all
datasets.
• Table
crea4on
ini4ally
for
loading
the
data,
later
done
with
file
opera4ng
func4ons.
• Running
all
the
“MYSQL”
commands
sequen4ally,
later
enhanced
using
php
as
front
end.
•
Ini4ally
J48
tree
was
not
able
to
process
due
to
the
data
being
in
numerical
values.
Later
done
by
Discre4za4on/NumericaltoNominal
of
CDGLobal
columns.
VG