Clinical Data Classification of alzheimer's disease

Alzheimer's
Disease-‐
Clinical
Data

Classiﬁca4on

By

George
Kalangi

Venkata
Gopi

Overview:

• Introduc4on

• Analysis
of
commonly
used
terms
and

explana4on
of
data
sets

• Overall
Programming
Process

• Genera4ng
a
merged
file
with
CDGLOBAL

• Genera4on
of
files
for
future
status
predic4on

• Data
Preprocessing

• Classifica4on
(Algorithms)
used
on
the
data

• Analysis
on
the
output
data
from
WEKAb

G

Introduc4on

• What
is
Alzheimer’s
Disease?

• Brain
disorder

• Most
common
form
of
demen4a

– Term
for
the
loss

• Memory

• Other
intellectual
abili4es

• Serious
enough
to
interfere
with
daily
life

• Clinical
Demen4a
Ra4o
(0,0.5,1,2,3)

Mild to Severe Dementia1.0 to 3.0
Questionable Dementia0.5
Normal0
G

Datasets
(60
Files)

" 56
comma
separated
ﬁles

 1
File
–
Data
Dic4onary
(Explains
the
terms

used)

 1
File
–
Clinical
Demen4a
Ra4ng
(Has

CDGLOBAL)

 Rest

 Assessments

 Data
Deﬁni4ons

 Other
like
visits
having
abbrevia4ons

G

Environment
Setup

• Programming
Languages
used
for
the
project

are
PHP,
MySQL,
Java,
Postgresql

• Tools
used
are
WEKA
(Waikato
Environment

for
Knowledge
Analysis),
MySQLWorkBench,

and
NetBeans

•

-‐Front
End

(PHP)

•

-‐Back
End

(MySQL)

G
V

Overall
Programming
Process

• A
selected
dataset
(FAQ)
is
given
by
the
user.

• At
the
backend
MYSQL
queries
are
deﬁned

enough
to
create
the
required
tables
and

insert
the
required
data
to
the
corresponding

tables.

• Here
aeer
the
required
opera4ons
are

performed
on
the
tables.

• Final
output
ﬁles
are
stored
in
.csv
format.

G
V

Genera4ng
a
merged
file
with

CDGLOBAL
(For
current)

• For
the
given
datasets
as
input,

(Eg:adni_faq_2011-‐01-‐20.csv)
and
from
the

adni_cdr_2011-‐01-‐20.csv)
file

-‐-‐the
RID’s
and
VISCODE’s
of
faq
and

cdr
are
compared
and
based
on
that

CDGLOBAL
column
in
cdr
file
is
merged
to
faq

file.

•
During
Remove
CDGLOBAL
which
has
-‐1
and

VISCODE’s
f,nv,uns1
are
trimmed
off.

Result
file
is
“Merged_dataset_file.csv”

G

Query
used
for
genera4ng
merged

ﬁle:

• Select

f.cID
,f.RID
,f.VISCODE
,f.EXAMDATE
,f.FAQSOURCE
,f.FAQFINAN,f.FAQFORM,f.FAQSHOP,f.FAQGAME,f.
FAQBEVG,f.FAQMEAL,f.FAQEVENT,f.FAQTV,f.FAQR
EM,f.FAQTRAVL,f.FAQTOTAL
,cdr.cdglobal
from

cdr,faq
f
where
cdr.rid=f.rid
and

cdr.VISCODE=f.VISCODE
and
cdr.cdglobal
not
in

(-‐1)";

G

Genera4on
of
files
for
future
status

predic4on

• Predic4on
dataset
is
generated
by
mapping

the
first
4me
visit
to
the
6
month’s
Class
and
6

month
visit
to
the
12
month’s
Class
and
so
on.

• SQL
query
opera4ons
are
performed
on
the

merged
file
to
separate
the
6
month’s
4me

interval
classes.

• Following
are
the
files
generated:

-‐
File_dataset_m06.csv

-‐File_dataset_m12.csv
and
so
on

V

Query
used
for
genera4ng
class

ﬁles:

• Select
v.ID
as
ID,v.RID
as

RID,v.VISCODE
,v.EXAMDATE,v.FAQSOURCE
,v.FAQ
FINAN
,v.FAQFORM
,v.FAQSHOP
,v.FAQGAME
,v.FA
QBEVG,v.FAQMEAL
,v.FAQEVENT
,v.FAQTV
,v.FAQR
EM
,v.FAQTRAVL
,v.FAQTOTAL
,m12.cdrglobal
fro
m
`table_adni_faq_2011-‐01-‐20_m06`

v,`table_adni_faq_2011-‐01-‐20_m12`
m12
where

v.rid=m12.rid

V

Preprocessing

• Aeer
we
get
required
.csv
files,
we
use
WEKA

to
preprocess
the
data.

• Load
the
file
into
WEKA.

• Apply
Filter

“weka.filters.unsuperwised.apributes.Remove
”
to
trim
off
the
unused
fields.

• Apply
“NumericaltoNominal”
to
convert
all
the

values
in
the
data
to
Nominal
before

classifying
and
fetching
to
a
classifier

algorithm.

G

Classifica4on
Algorithms
Used

• The
Classify
panel
enables
the
user
to
apply

classifica4on
and
regression
algorithms

(indiscriminately
called
classifiers
in
Weka)
to

the
resul4ng
dataset,
to
es4mate
the
accuracy

of
the
resul4ng
predic4ve
model.

•
J48
uses

C4.5
(a
successor
of
ID3)
Algorithm

• Naïve
Bayesian
Classifica4on
Algorithm

G

What
is
classifica4on?

• Given
a
collec4on
of
records
(training
set
)

– Each
record
contains
a
set
of
a"ributes,
one
of
the

apributes
is
the
class

-‐-‐
A
test
set
is
used
to
determine
the
accuracy
of
the
model.

Usually,
the
given
data
set
is
divided
into
training
and
test

sets,
with
training
set
used
to
build
the
model
and
test
set

used
to
validate
it.

Example:

If
we
have
items
in
a
house
which
are
not
classified
then
we
can’t
arrange

items
in
our
house.

We
classify
the
items
depending
on
their
usage
as
cooking
items,

decora4on
items
etc.,
such
that
we
could
arrange
them
accordingly
and

can
use
it
in
an
efficient
and
easier
way.

G

Decision
Tree
Classiﬁca/on
Task
G
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Test Data
Assign Cheat to “No”

Decision
Tree

Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Test Data
G

J
48
uses
C
4.5
Algorithm

• Decision
trees
represent
a
supervised
approach
to
classifica4on

• Decision
trees
are
a
classic
way
to
represent
informa4on
from
a
machine

learning
algorithm,
and
offer
a
fast
and
powerful
way
to
express
structures

in
data.

• A
decision
tree
is
a
simple
structure
where
non-‐terminal
nodes
represent

tests
on
one
or
more
apributes
and
terminal
nodes
reflect
decision

outcomes.

• The
basic
algorithm
described
above
recursively
classifies
un4l
each
leaf
is

pure,
meaning
that
the
data
has
been
categorized
as
close
to
perfectly
as

possible.

• The
latest
public
domain
implementa4on
of
Quinlan's
model
is
C4.5.
The

Weka
classifier
package
has
its
own
version
of
C4.5
known
as
J48.

• This
process
ensures
maximum
accuracy
on
the
training
data.

Why
decision
tree
Algorithm?

• Advantages:

– Inexpensive
to
construct

– Easy
to
interpret
for
small-‐sized
trees

– Accuracy
is
comparable
to
other
classiﬁca4on

techniques
for
many
simple
data
sets

– There
could
be
more
than
one
tree
possible
for

the
same
data

• Disadvantages:

-‐
Under
ﬁvng:
when
the
model
is
too
simple,
both

training
and
test
errors
are
large

All
about
Cross
Valida4on

• We
perform
cross
valida4on
when
amount
of
data
is
small
and
we

need
to
have
independent

training
and
test
set
from
it.

• It
is
important
that
each
class
is
represented
in
its
actual

propor4ons
in
the
training
and
test
sets:
Stra4fica4on

• An
important
cross
valida4on
technique
is
stra4fied
10
fold
cross

valida4on,
where
the
instance
set
is
divided
into
10
folds.

• We
have
10
itera4ons
with
taking
different
single
fold
for
tes4ng

and
the
rest
for
training.

V

Evalua4on

• Metrics
for
Performance
Evalua4on

– How
to
evaluate
the
performance
of
a
model?

• Methods
for
Model
Comparison

– How
to
compare
the
rela4ve
performance
among

compe4ng
models?

V

Metrics
for
Performance
Evalua4on:

Confusion
Matrix

• A
confusion
matrix
contains
informa4on
about
actual
and
predicted

classifica4ons
done
by
a
classifica4on
system.
Performance
of
systems
is

commonly
evaluated
using
the
data
in
the
matrix.
The
following
table

shows
the
confusion
matrix
for
a
two
class
classifier:

• We
get
confusion
matrix
aeer
supplying
data
to
a
Classifier

• Based
on
the
confusion
matrix
we
can
evaluate
using
the
measures
like,

precision,
F-‐measure,
accuracy
and
Recall.

G

Example

• Suppose
there
are
a
sample
of
27
animals
—
8
cats,
6
dogs,
and
13
rabbits.

• Each
column
of
the
matrix
represents
the
instances
in
a
predicted
class,

while
each
row
represents
the
instances
in
an
actual
class.

• We
can
see
from
the
matrix
that
the
system
in
ques4on
has
trouble

dis4nguishing
between
cats
and
dogs,
but
can
make
the
dis4nc4on

between
rabbits
and
other
types
of
animals
prepy
well.

• All
correct
guesses
are
located
in
the
diagonal
of
the
table,
so
it's
easy
to

visually
inspect
the
table
for
errors,
as
they
will
be
represented
by
any

non-‐zero
values
outside
the
diagonal.

G

Limita4on
of
accuracy

Limita/on
of
accuracy:

• Consider
a
2-‐class
problem

– Number
of
Class
0
examples
=
9990

– Number
of
Class
1
examples
=
10

• If
model
predicts
everything
to
be
class
0,
accuracy
is
9990/10000
=
99.9
%

– It
has
some
disadvantages
as
a
performance
es4mate.
For
example,
if

there
were
95
cats
and
only
5
dogs
in
the
data
set,
the
classiﬁer
could

easily
be
biased
into
classifying
all
the
samples
as
cats.
The
overall

accuracy
would
be
95%,
but
in
prac4ce
the
classiﬁer
would
have
a

100%
recogni4on
rate
for
the
cat
class
but
a
0%
recogni4on
rate
for

the
dog
class,
so
you'll
probably
want
to
look
at
some
of
the
other

numbers.
ROC
Area,
or
area
under
the
ROC
curve,
is
also
taken
as

preferred
measure.

– Accuracy
is
misleading
because
model
does
not
detect
any
class
1

example.

G

Metrics
for
Evalua4on

• Accuracy:
The
accuracy
(AC)
is
the
propor4ons
of
the
total
number
of

predic4ons
that
were
correct,
what
percentage
of
people
were
correctly

classiﬁed.
It
is
determined
using
the
equa4on:

Accuracy
=
(#
True
Posi4ves
+
#
True
Nega4ves)
/
N

Where
N
=
Total
#
predic4ons.

• Precision:

Finally,
precision
(P)
is
the
propor4on
of
the
predicted
posi4ve

cases
that
were
correct.
Of
all
the
people
that
are
classiﬁed
as
demented,

what
percentage
of
them
is
actually
demented?

It
is
calculated
using
the
equa4on

Precision
=
(#
True
Posi4ves)
/
(#
True
Posi4ves
+
#
False

Posi4ve)

€
Accuracy =
TP + TN
TP + TN + FP + FNV

Evalua4on

• F-‐measure:

F-‐measure
=2*
(#
True
Posi4ves
)
/
(
#
2*True
Posi4ves
+
#

True
Nega4ves
+
#False
Posi4ves)

• Recall:

Recall
is
the
ra4o
of
the
number
of
true
posi4ves
and
the
sum
of

true
posi4ves
and
false
nega4ves.
It
is
calculated
using
the
equa4on:

Recall
=
(#
True
Posi4ves)
/
(#
True
Posi4ves
+
#
False

Nega4ves)

V

Methods
for
Model
Comparison

ROC
(Receiver
Opera/ng
Characteris/c)

• Developed
in
1950s
for
signal
detec4on
theory

to
analyze
noisy
signals

– Characterize
the
trade-‐oﬀ
between
posi4ve
hits

and
false
alarms

• ROC
curve
plots
TP
(on
the
y-‐axis)
against
FP

(on
the
x-‐axis)

V

Using
ROC
for
Model
Comparison

 M1 is better for small
FPR
 M2 is better for large
FPR
A rough guide for
classifying the accuracy
of a diagnostic test is the
traditional academic
point system:.
.90-1 = excellent (A).
.80-.90 = good (B).
.70-.80 = fair (C).
.60-.70 = poor (D).
.50-.60 = fail (F)Area Under the ROC curve A
V

Naïve
Bayes

• It
is
a
simple
probabilis4c
classifier
based
on
applying
bayes
theorem
with

independence
assump4ons.
Naive
Bayes
classifier
assumes
that
the

presence
(or
absence)
of
a
par4cular
feature
of
a
class
is
unrelated
to
the

presence
(or
absence)
of
any
other
feature.

• For
example,
a
fruit
may
be
considered
to
be
an
apple
if
it
is
red,
round,

and
about
4"
in
diameter.
Even
if
these
features
depend
on
each
other
or

upon
the
existence
of
the
other
features,
a
naive
Bayes
classifier
considers

all
of
these
proper4es
to
independently
contribute
to
the
probability
that

this
fruit
is
an
apple.

• An
advantage
of
the
naive
Bayes
classifier
is
that
it
requires
a
small

amount
of
training
data
to
es4mate
the
parameters
(means
and
variances

of
the
variables)
necessary
for
classifica4on.
Because
independent

variables
are
assumed,
only
the
variances
of
the
variables
for
each
class

need
to
be
determined
and
not
the
en4re
set.
Best
suited
for
apributes,

which
are
independent.
It
is
very
simple,
very
fast.

V

Challenges
faced

• Ini4ally
data
ﬁles
all
being
processed
using
JDBC
and
MySQL
and
later
its

been
found
to
be
hec4c
if
at
all
other
dataset
being
used.
Hence
PHP

based
MYSQL
is
used
which
is
generalized
for
all
datasets.

• Table
crea4on
ini4ally
for
loading
the
data,
later
done
with
ﬁle
opera4ng

func4ons.

• Running
all
the
“MYSQL”
commands
sequen4ally,
later
enhanced
using

php
as
front
end.

•
Ini4ally
J48
tree
was
not
able
to
process
due
to
the
data
being
in

numerical
values.
Later
done
by
Discre4za4on/NumericaltoNominal
of

CDGLobal
columns.

VG

Result
ﬁle
for
current
status(J48)
G

Current
status
(Naïve
Bayes)

V

Future
status
(J48)
V

Future
status
(Naïve
Bayes)
V

References:

http://kent.dl.sourceforge.net/project/weka/documentation/3.6.x/
WekaManual-3-6-2.pdf
http://www.dfki.de/~kipp/seminar_ws0607/reports/RossenDimov.pdf
http://stackoverflow.com/questions/2903933/how-to-interpret-weka-
classification
http://www.slideshare.net/dataminingtools/weka-credibility-evaluating-
whats-been-learned

Clinical Data Classification of alzheimer's disease

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to Clinical Data Classification of alzheimer's disease

Similar to Clinical Data Classification of alzheimer's disease (20)

Recently uploaded

Recently uploaded (20)

Clinical Data Classification of alzheimer's disease