SlideShare una empresa de Scribd logo
1 de 4
Descargar para leer sin conexión
SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009                                                  1




    Classification of newborn’s sleeping phases from
                       their EEG.
                                                          Dominik Franˇ k
                                                                      e



   Abstrakt— Correct classification of newborn’s sleeping phases
from their EEG can help to predict the problems on brain or
other mental defects. This semestral work has been disposed to
find optimal k in nearest neighbor classifier. The choice of kNN
is motivated by its simplicity, flexibility to incorporate different
data types and adaptability to irregular feature spaces. The best
k in nearest neighbor classifier was figured up for the value 3,
with accuracy 83.69%.It means each time newborn’s EEG will be
given the algorithm can classify sleeping phases of this newborn
by choosing 3 other nearest EEG records.


                         I. A SSIGNMENT
   Use the method of k Nearest Neighbors for classification of         Fig. 1. Graph showing original values of attributes; x-axis: attributes, y-axis:
                                                                      values of attributes (−5 to 543)
the target attribute of chosen dataset. Chose one of the classes
as target class - positive. Find the best classifier, which has
False Positive rate (F P r)< 0.3. Count the accuracy and True
Positive rate T P r of this classificator.

                       II. I NTRODUCTION
   The problem is to find optimal k in Nearest Neighbor
classificator (next time will be written as NN) for given dataset.
   The algorithm can be briefly summarized as follows: In
the training phase, it computes the similarity measures from
all rows in the training set and combines them in a global
similarity measure using the XValidation method. In the testing
phase, for a rows with “unknown“ classes, it chooses their k
nearest neighbors in the training set according to the trained        Fig. 2. Graph showing normalized values of attributes; x-axis: attributes,
similarity measure and then uses a customized voting scheme           y-axis: values of attributes (0.0 to 1.0)
to generate a list of predictions with confidence scores [4].
   Dataset is in *.arff format and each row has 55 attributes.
Attribute called ”class“ has 4 nominal values (0,1,2,3) and it           It’s sure the dataset has to be preprocessed before starting
represents the classified new-born’s sleeping phases. I didn’t         experiments. First operator Normalization is used (Showed on
find anywhere what does it mean exactly, but from my                   Fig.5, page 3) which normalizes all numerical values to range
observations I expect it means that from given attributes             from 0.0 to 1.0. The optimization of extreme values won’t
(EEG c1 alpha,...) can be computed what kind of sleeping              be done because in next part of preprocessing will be chosen
                                                                             1
these values of attributes represent. [5]                             just 70 of all rows (2942 rows) and extreme values will be
   The given dataset is preprocessed a little bit. There aren’t       “eliminated“. For choosing this subset method of Stratified
any rows with zero attributes and all attributes are numerical        Sampling is used and as label attribute named ”class“ is set
values. In Fig.1 are shown all attributes of dataset and their        attributed. From 2942 chosen rows 2210 are labeled as class
values. These values are not normalized so the range of               0 and 732 as class 1 (Tab.I). Class 0 is merged from original
attribute’s values is from −5 to 543. The normalized dataset          classes 1,2 and 3. Class 1 is renamed from original class 0.
is on Fig.2, where all values are in the range from 0.0 to 1.0.       Normalized datasubset is shown on Fig.3
Each class (0, 1, 2, 3) has different color. The dataset is too big      After attributes normalization the phase of training the
to process it at once, because it sets up of 42027 rows each          model begins. As shown in Fig.5 (right side) the normalized
with 55 regular attributes.                                           subset is divided into 2 parts. 1 of this subset goes to training
                                                                                                      5
                                                                                   4
                                                                      phase and 5 are used for testing.
                      III. E XPERIMENTS                                  In training phase operator Parameter Iteration is used for
   The chosen positive class of original data is class 0 (In          iterating the k of NN. K is iterated from 1 to 15 incresing by
normalized subset renamed to class 1). The other classes              +1.
(1,2,3) are set as negative classes.                                     To avoid overfitting of NN method called K-fold cross
SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009                                             2




              Attr. name   Statistics        Range
                   class   label             0.0 (2210), 1.0 (732)
                   PNG     0.427 +/- 0.111   [0.000 ; 0.919]
            PNG filtered    0.359 +/- 0.166   [0.000 ; 1.000]
               EMG std     0.114 +/- 0.090   [0.033 ; 0.766]
         EMG std filtered   0.126 +/- 0.138   [0.004 ; 0.874]
               ECG beat    0.427 +/- 0.135   [0.212 ; 0.993]
        ECG beat filtered   0.444 +/- 0.138   [0.225 ; 0.987]
           EEG fp1 delta   0.216 +/- 0.065   [0.081 ; 0.964]
           EEG fp2 delta   0.218 +/- 0.067   [0.071 ; 0.958]
            EEG t3 delta   0.202 +/- 0.074   [0.064 ; 0.906]
                                                                     Fig. 3. Graph showing normalized datasubset with positive class = 1; x-axis:
            EEG t4 delta   0.232 +/- 0.089   [0.062 ; 0.956]         attributes, y-axis: values of attributes (0.0 to 1.0)
            EEG c3 delta   0.243 +/- 0.072   [0.091 ; 0.961]
            EEG c4 delta   0.244 +/- 0.070   [0.089 ; 0.968]
           EEG o1 delta    0.212 +/- 0.077   [0.066 ; 0.958]
           EEG o2 delta    0.211 +/- 0.083   [0.046 ; 0.933]         validation (CV) is used. For each iteration of k CV is run
           EEG fp1 theta   0.188 +/- 0.072   [0.068 ; 0.976]         10 times. CV divides training set 10 times into 2 parts. CV
           EEG fp2 theta   0.216 +/- 0.075   [0.090 ; 0.972]         trains kNN on the first part and validates kNN with the second
            EEG t3 theta   0.222 +/- 0.065   [0.077 ; 0.970]         part. After 10 iterations of K-fold cross validation the average
            EEG t4 theta   0.264 +/- 0.079   [0.082 ; 0.938]         accuracy of kNN for these 10 CV is computed. After k is
            EEG c3 theta   0.308 +/- 0.061   [0.101 ; 0.962]         iterated from 1 to 15, k with the highest average accuracy
            EEG c4 theta   0.299 +/- 0.060   [0.098 ; 0.960]         is selected and will be used in testing phase. Graph with the
           EEG o1 theta    0.219 +/- 0.067   [0.080 ; 0.922]
                                                                     average accuracies for each k NN is on Fig.4.
           EEG o2 theta    0.271 +/- 0.079   [0.080 ; 0.931]
          EEG fp1 alpha    0.112 +/- 0.077   [0.043 ; 0.981]
          EEG fp2 alpha    0.124 +/- 0.081   [0.046 ; 0.956]
           EEG t3 alpha    0.158 +/- 0.080   [0.055 ; 0.946]
           EEG t4 alpha    0.181 +/- 0.082   [0.055 ; 0.928]
           EEG c3 alpha    0.249 +/- 0.070   [0.088 ; 0.943]
           EEG c4 alpha    0.246 +/- 0.069   [0.085 ; 0.957]
           EEG o1 alpha    0.116 +/- 0.066   [0.039 ; 0.910]
           EEG o2 alpha    0.151 +/- 0.066   [0.048 ; 0.935]
          EEG fp1 beta1    0.114 +/- 0.079   [0.043 ; 0.985]
          EEG fp2 beta1    0.123 +/- 0.083   [0.046 ; 0.943]
           EEG t3 beta1    0.152 +/- 0.084   [0.045 ; 0.957]
           EEG t4 beta1    0.168 +/- 0.087   [0.053 ; 0.930]
           EEG c3 beta1    0.234 +/- 0.077   [0.092 ; 0.942]
           EEG c4 beta1    0.226 +/- 0.074   [0.079 ; 0.949]         Fig. 4.   Average accuracy for kNN; x-axis: k NN; y-axis: accuracy;
           EEG o1 beta1    0.091 +/- 0.070   [0.028 ; 0.916]
           EEG o2 beta1    0.129 +/- 0.070   [0.041 ; 0.970]
          EEG fp1 beta2    0.217 +/- 0.081   [0.086 ; 0.990]
          EEG fp2 beta2    0.211 +/- 0.076   [0.083 ; 0.958]                                 IV. M ETHODOLOGY
           EEG t3 beta2    0.189 +/- 0.070   [0.063 ; 0.927]
                                                                     A. Used tool
           EEG t4 beta2    0.226 +/- 0.083   [0.065 ; 0.922]
           EEG c3 beta2    0.248 +/- 0.066   [0.092 ; 0.960]           Tool used is called RapidMiner (v4.0) [1]. Using Rapid-
           EEG c4 beta2    0.246 +/- 0.065   [0.090 ; 0.966]         Miner allows user to make all phases of DataMining in this
           EEG o1 beta2    0.230 +/- 0.085   [0.076 ; 0.958]         tool. It detracts from familiarization with only one environ-
           EEG o2 beta2    0.220 +/- 0.080   [0.055 ; 0.932]         ment. All operators used in this work are accesible from the
          EEG fp1 gama     0.154 +/- 0.073   [0.058 ; 0.976]         basic version of RapidMiner.
          EEG fp2 gama     0.172 +/- 0.076   [0.075 ; 0.956]
           EEG t3 gama     0.196 +/- 0.069   [0.067 ; 0.958]
           EEG t4 gama     0.227 +/- 0.078   [0.071 ; 0.897]         B. Configuration
           EEG c3 gama     0.289 +/- 0.063   [0.097 ; 0.959]           By combining many operators in RapidMiner the project is
           EEG c4 gama     0.281 +/- 0.061   [0.095 ; 0.959]         built. The complete tree view of operators used to get the best
           EEG o1 gama     0.168 +/- 0.065   [0.062 ; 0.915]
                                                                     k in Nearest Neighbor classification is showed on Fig.5.
           EEG o2 gama     0.237 +/- 0.077   [0.072 ; 0.912]
                                                                       • All operators has local random seed set to -1. Just the
                             TABLE I
                                                                           Root operator has value 2001, because then the random
        S TATISTICS OF ATTRIBUTES OF NORMALIZED SUBSET
                                                                           operations generates the same values. If an operator has
                                                                           sampling type then it is set to stratified sampling..
                                                                       • SplitChain operator has split ratio set to 0.2.
SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009                                     3



   •   XValidation operator has number of validations set to        xi and xj (j = 1, 2, ...n) is defined as:
       10 and measure set to Euclidean Distance.
   •   NearestNeighbor trying k operator has k set to 15 but this   d(xi , xj ) =   (xi1 − xj1 )2 + (xi2 − xj2 )2 + ... + (xin − xjn )2
       parameter is influenced by Iterating k - training operator.
   •   ClassificationPerformance (1) operator has checked ac-          The Algorithm NN can be built as: [6]
       curacy.
   •   NearestNeighbor defined k operator has k set to 3 and           •   Training phase: Build the set of training examples T .
       measure set to Euclidean Distance.                             •   Testing phase:
   •   ClassificationPerformance (2) operator has checked ac-               – Is given a query instance xq to be classified
       curacy.                                                             – Let x1 ...xn denote the k instances from T that are
   •   BinominalClassificationPerformance        operator     has             nearest to xq
       checked fallout.
                                                                                                               n
   •   ProcessLog operator logs the accuracy from Classifica-
                                                                                        F (xq ) = argmax           δ(v, f (xi ))
       tionPerformance.
                                                                                                            i=1

                                                                       The best k in Nearest Neighbor classificator is found by
                                                                    iterating k from 1 to 15. Top value 15 is chosen as enough. In
                                                                    each iteration the Operator “ClassificationPerformance” counts
                                                                    accuracy of given k. The Operator ProcessLog writes results
                                                                    of “ClassificationPerformance” and generates report (Fig.4).
                                                                    From the report it stands for reason, that the best k is 3.

                                                                                                  True 0   True 1
                                                                                         Pred 0    1609     225
                                                                                         Pred 1    159      361
                                                                                                TABLE II
                                                                     NN CLASSIFICATION FOR k := 3; accuracy = 83.69%, F P r = 8.99%,
                                                                                             T P r = 61.60%



                                                                       Positive class is selected as class with value = 1. For k = 3
                                                                    the accuracy was 83, 69% as shown in Tab.II. False Positive
                                                                    rate (F P r) of this classicator is 8.99%.

                                                                                      159/(159 + 1609) = 0.0899

                                                                    There is also evidently, that True Positive rate (T P r) is
                                                                    61.60%. Because there are 586 examples with class=1 and
                                                                    just 361 of them were classified correctly.

                                                                                       361/(361 + 225) = 0.616


                                                                                            V. D ISCUSSION

                                                                       False Positive rate seems to be very good. Maybe it seems
                                                                    to be very low but it’s probably by the big subset of training
                                                                    data.
                                                                       To discuss is, if the rate = 0.2 dividing datasubset to training
Fig. 5.   “Box view” of complete project from DataMiner             and testing part is set correctly or not. With faster computer
                                                                    can be set oposite rate = 0.8. In my opinion 584 training
                                                                    examples were enough and F P r declares the rate wasn’t
                                                                    chosen so badly. On the other hand T P r is 61.60% which
                                                                    is not much and can be easily higher/lower influencing F P r.
C. Experiments setup                                                   Next question can be if the algorithm shouldn’t count the
   The Nearest Neighbor classification uses Euclidean distance       weighted kNN. I was trying to find any dependencies between
to compute the kNN. In human words it can be translated as          the 55 attributes but I wasn’t successful. So I don’t think
“Finds the closest point xi to xj ”. Euclidean distance between     setting weights on attributes would be helpful.
SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009   4



                          VI. C ONCLUSION
   In my opinion I found very good classifier for the subset
of dataset. There can be done some improvments of this
algorithm to work better. I think it would be useful if such
classifier should be used in practices but just for school work
it’s not so important.
   The hardest part of the work was exploring operators in
RapidMiner and finding the right one I needed. I know there
are still some which can be replaced by better operators, but
this solution was working and what more it gave good results.
Most of the time I spent on waiting until RapidMiner will
process all the operators with given dataset. Unfortunately
this programm is written in Java, what is not language for
scientific computing and I had to restart java because it got
out of memory quite often. The most interesting part for me
was generating graphs and writing this report ¨ .
   I am very satisfied I finished the work and I can say that I
learnt a lot about datamining and about classifing any dataset.
I am afraid anybody can feel from this work that my future
specialization will be Software Engineering and such scientific
work is not my cup of tea.

                              R EFERENCES
[1] CENTRAL QUEENSLAND UNIVERSITY. RapidMiner GUI man-
    ual [online]. 2007 , May 29, 2007 [cit. 2008-02-08]. Available
    from WWW: <http://os.cqu.edu.au/oswins/datamining/
    rapidminer/rapidminer-4.0beta-guimanual.pdf>.
[2] FARKASOVA, Blanka, KRCAL, Martin . Project Bibliographic citations
    [online]. c2004-2008 [cit. 2008-05-08]. CZ. Available from WWW:
    <http://www.citace.com/>.
[3] LAURIKKALA, Jorma. Improving Identification of Difficult Small
    Classes by Balancing Class Distribution . [s.l.], 2001. 14 p. DEPART-
    MENT OF COMPUTER AND INFORMATION SCIENCES UNIVER-
    SITY OF TAMPERE . Report. Available from WWW: <http://
    www.cs.uta.fi/reports/pdf/A-2001-2.pdf>. ISBN 951-
    44-5093-0.
[4] KARDI, Teknomo. K-Nearest Neighbors Tutorial [online]. c2006 [cit.
    2008-05-08]. Available from WWW: <http://people.revoledu.
    com/kardi/tutorial/KNN/>.
[5] POBLANO, Adrian and GUTIERREZ, Roberto. Correlation between
    the neonatal EEG and the neurological examination in the first year of
    life in infants with bacterial meningitis. Arq. Neuro-Psiquiatr. [online].
    2007, vol. 65, no. 3a [cited 2008-05-10], pp. 576-580. Available
    from: <http://www.scielo.br/scielo.php?script=sci_
    arttext&pid=S0004-282X2007000400005&lng=en&nrm=
    iso>. ISSN 0004-282X. doi: 10.1590/S0004-282X2007000400005
[6] SOLOMATINE,           D.P.     Instance-based     learning     and      k-
    Nearest neighbor algorithm [online]. c1988-2003 [cit. 2008-
    05-10].     EN.      Available     from     WWW:       <http://www.
    xs4all.nl/˜dpsol/data-machine/nmtutorial/
    instancebasedlearningandknearestneighboralgorithm.
    htm>.
[7] VAYATIS, Nicolas, CLEMENCON, Stphan. Advanced Machine Learning
    Course [online]. [2008] [cit. 2008-05-08]. EN. Available from WWW:
    <http://www.cmla.ens-cachan.fr/Membres/vayatis/
    teaching/cours-de-machine-learning-ecp.html>.
[8] XIAOJIN, Zhu. K-nearest-neighbor: an introduction to machine learning.
    CS 540: Introduction to Artificial Intelligence [online]. 2005 [cit. 2008-
    05-08]. Available from WWW: <http://pages.cs.wisc.edu/
    ˜jerryzhu/cs540/knn.pdf>.
[9] van den BOSCH, Antal.Video: K-nearest neighbor classification [online].
    Tilburg University cc2007 [cit. 2008-05-10]. EN. Available from WWW:
    <http://videolectures.net/aaai07_bosch_knnc/>.

Más contenido relacionado

Destacado (9)

Intermediate Dreamweaver
Intermediate DreamweaverIntermediate Dreamweaver
Intermediate Dreamweaver
 
Wireless Boot Camp 2010 Handout
Wireless Boot Camp 2010 HandoutWireless Boot Camp 2010 Handout
Wireless Boot Camp 2010 Handout
 
Office 2007 Survival Guide
Office 2007 Survival GuideOffice 2007 Survival Guide
Office 2007 Survival Guide
 
Working With Images in Photoshop
Working With Images in PhotoshopWorking With Images in Photoshop
Working With Images in Photoshop
 
Photoshop: Tools and Tricks for Beginners
Photoshop: Tools and Tricks for BeginnersPhotoshop: Tools and Tricks for Beginners
Photoshop: Tools and Tricks for Beginners
 
Why Tweet? Handout
Why Tweet? HandoutWhy Tweet? Handout
Why Tweet? Handout
 
Introduction to Flash
Introduction to FlashIntroduction to Flash
Introduction to Flash
 
Making Connections - My home country Lebanon
Making Connections - My home country LebanonMaking Connections - My home country Lebanon
Making Connections - My home country Lebanon
 
Introduction to Dreamweaver
Introduction to DreamweaverIntroduction to Dreamweaver
Introduction to Dreamweaver
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

336vd Report Franed1 Update Language

  • 1. SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 1 Classification of newborn’s sleeping phases from their EEG. Dominik Franˇ k e Abstrakt— Correct classification of newborn’s sleeping phases from their EEG can help to predict the problems on brain or other mental defects. This semestral work has been disposed to find optimal k in nearest neighbor classifier. The choice of kNN is motivated by its simplicity, flexibility to incorporate different data types and adaptability to irregular feature spaces. The best k in nearest neighbor classifier was figured up for the value 3, with accuracy 83.69%.It means each time newborn’s EEG will be given the algorithm can classify sleeping phases of this newborn by choosing 3 other nearest EEG records. I. A SSIGNMENT Use the method of k Nearest Neighbors for classification of Fig. 1. Graph showing original values of attributes; x-axis: attributes, y-axis: values of attributes (−5 to 543) the target attribute of chosen dataset. Chose one of the classes as target class - positive. Find the best classifier, which has False Positive rate (F P r)< 0.3. Count the accuracy and True Positive rate T P r of this classificator. II. I NTRODUCTION The problem is to find optimal k in Nearest Neighbor classificator (next time will be written as NN) for given dataset. The algorithm can be briefly summarized as follows: In the training phase, it computes the similarity measures from all rows in the training set and combines them in a global similarity measure using the XValidation method. In the testing phase, for a rows with “unknown“ classes, it chooses their k nearest neighbors in the training set according to the trained Fig. 2. Graph showing normalized values of attributes; x-axis: attributes, similarity measure and then uses a customized voting scheme y-axis: values of attributes (0.0 to 1.0) to generate a list of predictions with confidence scores [4]. Dataset is in *.arff format and each row has 55 attributes. Attribute called ”class“ has 4 nominal values (0,1,2,3) and it It’s sure the dataset has to be preprocessed before starting represents the classified new-born’s sleeping phases. I didn’t experiments. First operator Normalization is used (Showed on find anywhere what does it mean exactly, but from my Fig.5, page 3) which normalizes all numerical values to range observations I expect it means that from given attributes from 0.0 to 1.0. The optimization of extreme values won’t (EEG c1 alpha,...) can be computed what kind of sleeping be done because in next part of preprocessing will be chosen 1 these values of attributes represent. [5] just 70 of all rows (2942 rows) and extreme values will be The given dataset is preprocessed a little bit. There aren’t “eliminated“. For choosing this subset method of Stratified any rows with zero attributes and all attributes are numerical Sampling is used and as label attribute named ”class“ is set values. In Fig.1 are shown all attributes of dataset and their attributed. From 2942 chosen rows 2210 are labeled as class values. These values are not normalized so the range of 0 and 732 as class 1 (Tab.I). Class 0 is merged from original attribute’s values is from −5 to 543. The normalized dataset classes 1,2 and 3. Class 1 is renamed from original class 0. is on Fig.2, where all values are in the range from 0.0 to 1.0. Normalized datasubset is shown on Fig.3 Each class (0, 1, 2, 3) has different color. The dataset is too big After attributes normalization the phase of training the to process it at once, because it sets up of 42027 rows each model begins. As shown in Fig.5 (right side) the normalized with 55 regular attributes. subset is divided into 2 parts. 1 of this subset goes to training 5 4 phase and 5 are used for testing. III. E XPERIMENTS In training phase operator Parameter Iteration is used for The chosen positive class of original data is class 0 (In iterating the k of NN. K is iterated from 1 to 15 incresing by normalized subset renamed to class 1). The other classes +1. (1,2,3) are set as negative classes. To avoid overfitting of NN method called K-fold cross
  • 2. SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 2 Attr. name Statistics Range class label 0.0 (2210), 1.0 (732) PNG 0.427 +/- 0.111 [0.000 ; 0.919] PNG filtered 0.359 +/- 0.166 [0.000 ; 1.000] EMG std 0.114 +/- 0.090 [0.033 ; 0.766] EMG std filtered 0.126 +/- 0.138 [0.004 ; 0.874] ECG beat 0.427 +/- 0.135 [0.212 ; 0.993] ECG beat filtered 0.444 +/- 0.138 [0.225 ; 0.987] EEG fp1 delta 0.216 +/- 0.065 [0.081 ; 0.964] EEG fp2 delta 0.218 +/- 0.067 [0.071 ; 0.958] EEG t3 delta 0.202 +/- 0.074 [0.064 ; 0.906] Fig. 3. Graph showing normalized datasubset with positive class = 1; x-axis: EEG t4 delta 0.232 +/- 0.089 [0.062 ; 0.956] attributes, y-axis: values of attributes (0.0 to 1.0) EEG c3 delta 0.243 +/- 0.072 [0.091 ; 0.961] EEG c4 delta 0.244 +/- 0.070 [0.089 ; 0.968] EEG o1 delta 0.212 +/- 0.077 [0.066 ; 0.958] EEG o2 delta 0.211 +/- 0.083 [0.046 ; 0.933] validation (CV) is used. For each iteration of k CV is run EEG fp1 theta 0.188 +/- 0.072 [0.068 ; 0.976] 10 times. CV divides training set 10 times into 2 parts. CV EEG fp2 theta 0.216 +/- 0.075 [0.090 ; 0.972] trains kNN on the first part and validates kNN with the second EEG t3 theta 0.222 +/- 0.065 [0.077 ; 0.970] part. After 10 iterations of K-fold cross validation the average EEG t4 theta 0.264 +/- 0.079 [0.082 ; 0.938] accuracy of kNN for these 10 CV is computed. After k is EEG c3 theta 0.308 +/- 0.061 [0.101 ; 0.962] iterated from 1 to 15, k with the highest average accuracy EEG c4 theta 0.299 +/- 0.060 [0.098 ; 0.960] is selected and will be used in testing phase. Graph with the EEG o1 theta 0.219 +/- 0.067 [0.080 ; 0.922] average accuracies for each k NN is on Fig.4. EEG o2 theta 0.271 +/- 0.079 [0.080 ; 0.931] EEG fp1 alpha 0.112 +/- 0.077 [0.043 ; 0.981] EEG fp2 alpha 0.124 +/- 0.081 [0.046 ; 0.956] EEG t3 alpha 0.158 +/- 0.080 [0.055 ; 0.946] EEG t4 alpha 0.181 +/- 0.082 [0.055 ; 0.928] EEG c3 alpha 0.249 +/- 0.070 [0.088 ; 0.943] EEG c4 alpha 0.246 +/- 0.069 [0.085 ; 0.957] EEG o1 alpha 0.116 +/- 0.066 [0.039 ; 0.910] EEG o2 alpha 0.151 +/- 0.066 [0.048 ; 0.935] EEG fp1 beta1 0.114 +/- 0.079 [0.043 ; 0.985] EEG fp2 beta1 0.123 +/- 0.083 [0.046 ; 0.943] EEG t3 beta1 0.152 +/- 0.084 [0.045 ; 0.957] EEG t4 beta1 0.168 +/- 0.087 [0.053 ; 0.930] EEG c3 beta1 0.234 +/- 0.077 [0.092 ; 0.942] EEG c4 beta1 0.226 +/- 0.074 [0.079 ; 0.949] Fig. 4. Average accuracy for kNN; x-axis: k NN; y-axis: accuracy; EEG o1 beta1 0.091 +/- 0.070 [0.028 ; 0.916] EEG o2 beta1 0.129 +/- 0.070 [0.041 ; 0.970] EEG fp1 beta2 0.217 +/- 0.081 [0.086 ; 0.990] EEG fp2 beta2 0.211 +/- 0.076 [0.083 ; 0.958] IV. M ETHODOLOGY EEG t3 beta2 0.189 +/- 0.070 [0.063 ; 0.927] A. Used tool EEG t4 beta2 0.226 +/- 0.083 [0.065 ; 0.922] EEG c3 beta2 0.248 +/- 0.066 [0.092 ; 0.960] Tool used is called RapidMiner (v4.0) [1]. Using Rapid- EEG c4 beta2 0.246 +/- 0.065 [0.090 ; 0.966] Miner allows user to make all phases of DataMining in this EEG o1 beta2 0.230 +/- 0.085 [0.076 ; 0.958] tool. It detracts from familiarization with only one environ- EEG o2 beta2 0.220 +/- 0.080 [0.055 ; 0.932] ment. All operators used in this work are accesible from the EEG fp1 gama 0.154 +/- 0.073 [0.058 ; 0.976] basic version of RapidMiner. EEG fp2 gama 0.172 +/- 0.076 [0.075 ; 0.956] EEG t3 gama 0.196 +/- 0.069 [0.067 ; 0.958] EEG t4 gama 0.227 +/- 0.078 [0.071 ; 0.897] B. Configuration EEG c3 gama 0.289 +/- 0.063 [0.097 ; 0.959] By combining many operators in RapidMiner the project is EEG c4 gama 0.281 +/- 0.061 [0.095 ; 0.959] built. The complete tree view of operators used to get the best EEG o1 gama 0.168 +/- 0.065 [0.062 ; 0.915] k in Nearest Neighbor classification is showed on Fig.5. EEG o2 gama 0.237 +/- 0.077 [0.072 ; 0.912] • All operators has local random seed set to -1. Just the TABLE I Root operator has value 2001, because then the random S TATISTICS OF ATTRIBUTES OF NORMALIZED SUBSET operations generates the same values. If an operator has sampling type then it is set to stratified sampling.. • SplitChain operator has split ratio set to 0.2.
  • 3. SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 3 • XValidation operator has number of validations set to xi and xj (j = 1, 2, ...n) is defined as: 10 and measure set to Euclidean Distance. • NearestNeighbor trying k operator has k set to 15 but this d(xi , xj ) = (xi1 − xj1 )2 + (xi2 − xj2 )2 + ... + (xin − xjn )2 parameter is influenced by Iterating k - training operator. • ClassificationPerformance (1) operator has checked ac- The Algorithm NN can be built as: [6] curacy. • NearestNeighbor defined k operator has k set to 3 and • Training phase: Build the set of training examples T . measure set to Euclidean Distance. • Testing phase: • ClassificationPerformance (2) operator has checked ac- – Is given a query instance xq to be classified curacy. – Let x1 ...xn denote the k instances from T that are • BinominalClassificationPerformance operator has nearest to xq checked fallout. n • ProcessLog operator logs the accuracy from Classifica- F (xq ) = argmax δ(v, f (xi )) tionPerformance. i=1 The best k in Nearest Neighbor classificator is found by iterating k from 1 to 15. Top value 15 is chosen as enough. In each iteration the Operator “ClassificationPerformance” counts accuracy of given k. The Operator ProcessLog writes results of “ClassificationPerformance” and generates report (Fig.4). From the report it stands for reason, that the best k is 3. True 0 True 1 Pred 0 1609 225 Pred 1 159 361 TABLE II NN CLASSIFICATION FOR k := 3; accuracy = 83.69%, F P r = 8.99%, T P r = 61.60% Positive class is selected as class with value = 1. For k = 3 the accuracy was 83, 69% as shown in Tab.II. False Positive rate (F P r) of this classicator is 8.99%. 159/(159 + 1609) = 0.0899 There is also evidently, that True Positive rate (T P r) is 61.60%. Because there are 586 examples with class=1 and just 361 of them were classified correctly. 361/(361 + 225) = 0.616 V. D ISCUSSION False Positive rate seems to be very good. Maybe it seems to be very low but it’s probably by the big subset of training data. To discuss is, if the rate = 0.2 dividing datasubset to training Fig. 5. “Box view” of complete project from DataMiner and testing part is set correctly or not. With faster computer can be set oposite rate = 0.8. In my opinion 584 training examples were enough and F P r declares the rate wasn’t chosen so badly. On the other hand T P r is 61.60% which is not much and can be easily higher/lower influencing F P r. C. Experiments setup Next question can be if the algorithm shouldn’t count the The Nearest Neighbor classification uses Euclidean distance weighted kNN. I was trying to find any dependencies between to compute the kNN. In human words it can be translated as the 55 attributes but I wasn’t successful. So I don’t think “Finds the closest point xi to xj ”. Euclidean distance between setting weights on attributes would be helpful.
  • 4. SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 4 VI. C ONCLUSION In my opinion I found very good classifier for the subset of dataset. There can be done some improvments of this algorithm to work better. I think it would be useful if such classifier should be used in practices but just for school work it’s not so important. The hardest part of the work was exploring operators in RapidMiner and finding the right one I needed. I know there are still some which can be replaced by better operators, but this solution was working and what more it gave good results. Most of the time I spent on waiting until RapidMiner will process all the operators with given dataset. Unfortunately this programm is written in Java, what is not language for scientific computing and I had to restart java because it got out of memory quite often. The most interesting part for me was generating graphs and writing this report ¨ . I am very satisfied I finished the work and I can say that I learnt a lot about datamining and about classifing any dataset. I am afraid anybody can feel from this work that my future specialization will be Software Engineering and such scientific work is not my cup of tea. R EFERENCES [1] CENTRAL QUEENSLAND UNIVERSITY. RapidMiner GUI man- ual [online]. 2007 , May 29, 2007 [cit. 2008-02-08]. Available from WWW: <http://os.cqu.edu.au/oswins/datamining/ rapidminer/rapidminer-4.0beta-guimanual.pdf>. [2] FARKASOVA, Blanka, KRCAL, Martin . Project Bibliographic citations [online]. c2004-2008 [cit. 2008-05-08]. CZ. Available from WWW: <http://www.citace.com/>. [3] LAURIKKALA, Jorma. Improving Identification of Difficult Small Classes by Balancing Class Distribution . [s.l.], 2001. 14 p. DEPART- MENT OF COMPUTER AND INFORMATION SCIENCES UNIVER- SITY OF TAMPERE . Report. Available from WWW: <http:// www.cs.uta.fi/reports/pdf/A-2001-2.pdf>. ISBN 951- 44-5093-0. [4] KARDI, Teknomo. K-Nearest Neighbors Tutorial [online]. c2006 [cit. 2008-05-08]. Available from WWW: <http://people.revoledu. com/kardi/tutorial/KNN/>. [5] POBLANO, Adrian and GUTIERREZ, Roberto. Correlation between the neonatal EEG and the neurological examination in the first year of life in infants with bacterial meningitis. Arq. Neuro-Psiquiatr. [online]. 2007, vol. 65, no. 3a [cited 2008-05-10], pp. 576-580. Available from: <http://www.scielo.br/scielo.php?script=sci_ arttext&pid=S0004-282X2007000400005&lng=en&nrm= iso>. ISSN 0004-282X. doi: 10.1590/S0004-282X2007000400005 [6] SOLOMATINE, D.P. Instance-based learning and k- Nearest neighbor algorithm [online]. c1988-2003 [cit. 2008- 05-10]. EN. Available from WWW: <http://www. xs4all.nl/˜dpsol/data-machine/nmtutorial/ instancebasedlearningandknearestneighboralgorithm. htm>. [7] VAYATIS, Nicolas, CLEMENCON, Stphan. Advanced Machine Learning Course [online]. [2008] [cit. 2008-05-08]. EN. Available from WWW: <http://www.cmla.ens-cachan.fr/Membres/vayatis/ teaching/cours-de-machine-learning-ecp.html>. [8] XIAOJIN, Zhu. K-nearest-neighbor: an introduction to machine learning. CS 540: Introduction to Artificial Intelligence [online]. 2005 [cit. 2008- 05-08]. Available from WWW: <http://pages.cs.wisc.edu/ ˜jerryzhu/cs540/knn.pdf>. [9] van den BOSCH, Antal.Video: K-nearest neighbor classification [online]. Tilburg University cc2007 [cit. 2008-05-10]. EN. Available from WWW: <http://videolectures.net/aaai07_bosch_knnc/>.