SlideShare una empresa de Scribd logo
1 de 25
A Mixed Discrete-Continuous Attribute
  List Representation for Large Scale
         Classification Domains
           Jaume Bacardit
          Natalio Krasnogor
       {jqb,nxk}@cs.nott.ac.uk
       University of Nottingham
Outline

• Motivation and objectives
• Framework: The BioHEL GBML system
• Improving the Attribute List Knowledge
  Representation
• Experimental design
• Results and discussion
• Conclusions and further work
Motivation

• We live in times of a great “data deluge”
• Many different disciplines and industries
  generate vast amounts of data
• Large scale can mean
  – Many records, many dimensions, many classes, …
• Our work is focused on representations that
  – Can deal with large attribute spaces
  – Are efficient, as this can make a big difference
    when dealing with really large datasets
The Attribute List knowledge
          representation (ALKR)
• This representation was recently proposed [Bacardit et al.,
  09] to achieve these aims
• This representation exploits a very frequent situation
   – In high-dimensionality domains it is usual that each rule only uses a
      very small subset of the attributes
• Example of a rule for predicting a Bioinformatics dataset [Bacardit and
  Krasnogor, 2009]
   •   Att Leu-2 ∈ [-0.51,7] and Glu ∈ [0.19,8] and
       Asp+1 ∈ [-5.01,2.67] and Met+1∈ [-3.98,10] and
       Pro+2 ∈ [-7,-4.02] and Pro+3 ∈ [-7,-1.89] and
       Trp+3 ∈ [-8,13] and Glu+4 ∈ [0.70,5.52] and
       Lys+4 ∈ [-0.43,4.94]  alpha
   •   Only 9 attributes out of 300 were actually in the rule
    – Can we get rid of the 291 irrelevant attributes?
The Attribute List knowledge
            representation
• Thus, if we can get rid of the irrelevant attributes
   – The representation will be more efficient, avoiding the
     waste of cycles dealing with irrelevant data
   – Exploration will be more focused, as the chromosomes will
     only contain data that matters
• This representation automatically identifies the
  relevant attributes in the domain for each rule
• It was tested on several small datasets and a couple
  of large protein datasets, showing good performance
Objectives of this work

• We propose an efficient extension of the
  representation that can deal at the same time with
  continuous and discrete attributes
   – The original representation only dealt with continuous
     variables
• We evaluate the representation using several large-
  scale domains
   – To assess its performance, and to identify where to
     improve it
• We compare ALKR against other standard machine
  learning techniques
The BioHEL GBML System

• BIOinformatics-oriented Hiearchical Evolutionary
  Learning – BioHEL (Bacardit et al., 2007)

• BioHEL is a GBML system that employs the
  Iterative Rule Learning (IRL) paradigm
  – First used in EC in Venturini’s SIA system (Venturini, 1993)
  – Widely used for both Fuzzy and non-fuzzy evolutionary learning
• BioHEL inherits most of its components from
  GAssist [Bacardit, 04], a Pittsburgh GBML system
Iterative Rule Learning
• IRL has been used for many years in the ML
  community, with the name of separate-and-conquer
Characteristics of BioHEL
• A fitness function based on the Minimum-Description-Length (MDL)
  (Rissanen,1978) principle that tries to
   – Evolve accurate rules
   – Evolve high coverage rules
   – Evolve rules with low complexity, as general as possible
• The Attribute List Knowledge representation
   – Representation designed to handle high-dimensionality domains
• The ILAS windowing scheme
   – Efficiency enhancement method, not all training points are used for
     each fitness computation
• An explicit default rule mechanism
   – Generating more compact rule sets
• Ensembles for consensus prediction
   – Easy system to boost robustness
Fitness function of BioHEL
• Coverage term penalizes rules that do not cover a minimum
  percentage of examples




• Choice of the coverage break is crucial for the proper
  performance of the system
Improving the Attribute List
         Knowledge Representation
• Mixed discrete-continuous representation
   – Intervalar represenation for continuous variables [Llora et al.,
     07]
      • If Att ∈ [ LB, UB]
      • 2 real-valued parameters, specifying the bounds
   – GABIL binary representation [De Jong & Spears, 91] for
     discrete variables
      • If Att takes value A or B
      • One bit for each possible value, indicating if value is included in the
        disjunction
• If Att1∈ [0.2,0.5] and Att2 is (A or B)  Class 1
• {0.2,0.5|1,1,0|1}
Improving the Attribute List
       Knowledge Representation
• Each rule contains:
Improving the Attribute List
             Knowledge Representation
The match
process is a
crucial element
in the
performance of
the system

This code is run
millions of times

Do you think
that this code is
efficient?
Look at the If
Improving the Attribute List Knowledge
                 Representation
• Doing supervised learning
  allows us to exploit one trick
   – When we evaluate a rule, we
     test it against each example in
     the training set
   – Thus, we can precalculate two
     lists, of discrete and continuous
     attributes
• The match process is
  performed separately for
  both kinds of attributes
• Essentially, we have unrolled
  the loop
Improving the Attribute List Knowledge
            Representation
• Recombination remains unchanged
  – Simulated 1-point crossover to deal with the
    variable-length lists of attributes
  – Standard GA mutation
  – Two operators (specialize and generalize) add or
    remove attributes from the list with a given
    probability, hence exploring the space of the
    relevant attributes for this rule
Experimental design

• Seven datasets were used
  – They represent a broad range of characteristics in
    terms of instances, attributes, classes, type of
    attributes and class balance/unbalance
Experimental design
• First, ALKR was compared against BioHEL using its
  original representation (labelled orig)
• Also, three standard machine learning techniques
  were used in the comparison:
   – C4.5 [Quinlan, 93]
   – Naive Bayes [John and Langley, 95]
   – LIBSVM [Chang & Lin, 01]
• The default parameters of BioHEL were used, except
  for two of them:
   – The number of strata of the ILAS windowing scheme
   – The coverage breakpoint of BioHEL’s fitness function
   – These two parameters were strongly problem-dependant
The traditional
big table of
results
And one more (much larger)
                      dataset
    • Protein Structure Prediction dataset (Solvent
      Accessibility - SA) with 270 attributes, 550K
      instances and 2 classes
             Method         Accuracy   Size solution   #exp atts   Run-time (h)
             BioHEL-orig    79.0±0.3   236.23±5.7      14.9±3.7    20.7±1.4
             BioHEL-ALKR    79.2±0.3   243.23±5.2      8.4±2.7     14.8±1.0
             BioHEL-naive   79.2±0.3   242.62±4.5      8.4±2.7     19.4±1.0
Run in a
different    C4.5           ---
cluster      Naïve Bayes    74.1±0.4
with more
memory       SVM            79.9±0.3                               10 days
and faster
nodes
ALKR vs Original BioHEL
• Except for one dataset (and the difference is
  minor), ALKR always obtains better accuracy
• Datasets where is ALKR is much better are
  those with larger number of attributes
  – ALKR is better at exploring the search space
• ALKR generates more compact solutions, in
  #rules and, specially, in #attributes
• Except for the ParMX domain (with a very
  small number of attributes), ALKR is always
  faster (72 times faster in the Germ dataset!)
BioHEL vs other ML methods

• The accuracy results were analyzed overall using a Friedman
  test for multiple comparisons
• The test detected with a 97.77% confidence that there were
  significant differences in the performance of the compared
  methods
• A post-hoc Holm test indicated that ALKR was significantly
  better than Naive Bayes with 95% confidence.
• If we look at individual datasets, BioHEL is only outperformed
  largely in the wav and SA datasets by SVM
• BioHEL’s advantage in the Germ dataset is specially large
Where can we improve BioHEL?
•   ParMX is a synthetic
    dataset for which the
    optimal solution
    consists in 257 rules.
    BioHEL generated 402
    rules
•   The rules were
    accurate but
    suboptimal
•   The coverage pressure
    introduced by the
    coverage breakpoint
    parameter was not
    appropiate for the
    whole learning process
•   BioHEL also had some problems in datasets with class unbalance (c-4)
Conclusions
• In this work we have
   – Extended the Attribute List Knowledge Representation of
     the BioHEL LCS to deal with mixed discrete-continuous
     domains in an efficient way
   – Assessed the performance of BioHEL using a broad range
     of large-scale scenarios
   – Compared BioHEL’s performance against other
     representations/learning techniques
• The experiments have shown that BioHEL+ALKR is
  efficient, it generates compact and accurate
  solutions and it is competitive against other machine
  learning methods
• We also identified several directions of improvement
Future work
• Identify the causes and address the issues that were
  observed in these experiments about BioHEL’s
  performance
• Compare and combine ALKR against similar recent
  LCS work [Butz et al., 08]
• Is possible to create a parameter-less BioHEL?
• The development of theoretical models that can
  explain the behavior of both BioHEL and ALKR would
   – Made all of the above easier
   – Be an important milestone in the principled application of
     LCS to large-scale domains
Questions?

Más contenido relacionado

La actualidad más candente

Quantitative Structure Activity Relationship
Quantitative Structure Activity RelationshipQuantitative Structure Activity Relationship
Quantitative Structure Activity RelationshipRaniBhagat1
 
molecular mechanics and quantum mechnics
molecular mechanics and quantum mechnicsmolecular mechanics and quantum mechnics
molecular mechanics and quantum mechnicsRAKESH JAGTAP
 
Review On Molecular Modeling
Review On Molecular ModelingReview On Molecular Modeling
Review On Molecular Modelingankishukla000
 
Molecular modelling and docking studies
Molecular modelling and docking studiesMolecular modelling and docking studies
Molecular modelling and docking studiesrouthusree
 
Novel algorithms for detection of unknown chemical molecules with specific bi...
Novel algorithms for detection of unknown chemical molecules with specific bi...Novel algorithms for detection of unknown chemical molecules with specific bi...
Novel algorithms for detection of unknown chemical molecules with specific bi...Aboul Ella Hassanien
 
CoMFA CoMFA Comparative Molecular Field Analysis)
CoMFA CoMFA Comparative Molecular Field Analysis)CoMFA CoMFA Comparative Molecular Field Analysis)
CoMFA CoMFA Comparative Molecular Field Analysis)Pinky Vincent
 
Molecular and Quantum Mechanics in drug design
Molecular and Quantum Mechanics in drug designMolecular and Quantum Mechanics in drug design
Molecular and Quantum Mechanics in drug designAjay Kumar
 
Lecture 9 molecular descriptors
Lecture 9  molecular descriptorsLecture 9  molecular descriptors
Lecture 9 molecular descriptorsRAJAN ROLTA
 
Chemical Spaces: Modeling, Exploration & Understanding
Chemical Spaces: Modeling, Exploration & UnderstandingChemical Spaces: Modeling, Exploration & Understanding
Chemical Spaces: Modeling, Exploration & UnderstandingRajarshi Guha
 
Lecture 11 developing qsar, evaluation of qsar model and virtual screening
Lecture 11  developing qsar, evaluation of qsar model and virtual screeningLecture 11  developing qsar, evaluation of qsar model and virtual screening
Lecture 11 developing qsar, evaluation of qsar model and virtual screeningRAJAN ROLTA
 
Richard Cramer 2014 euro QSAR presentation
Richard Cramer 2014 euro QSAR presentationRichard Cramer 2014 euro QSAR presentation
Richard Cramer 2014 euro QSAR presentationCertara
 
computer aided drug designing and molecular modelling
computer aided drug designing and molecular modellingcomputer aided drug designing and molecular modelling
computer aided drug designing and molecular modellingnehla313
 
2.molecular modelling intro
2.molecular modelling intro2.molecular modelling intro
2.molecular modelling introAbhijeet Kadam
 
Gaussian presentation
Gaussian presentationGaussian presentation
Gaussian presentationmojdeh y
 

La actualidad más candente (19)

Quantitative Structure Activity Relationship
Quantitative Structure Activity RelationshipQuantitative Structure Activity Relationship
Quantitative Structure Activity Relationship
 
molecular mechanics and quantum mechnics
molecular mechanics and quantum mechnicsmolecular mechanics and quantum mechnics
molecular mechanics and quantum mechnics
 
Review On Molecular Modeling
Review On Molecular ModelingReview On Molecular Modeling
Review On Molecular Modeling
 
Molecular modelling and docking studies
Molecular modelling and docking studiesMolecular modelling and docking studies
Molecular modelling and docking studies
 
Novel algorithms for detection of unknown chemical molecules with specific bi...
Novel algorithms for detection of unknown chemical molecules with specific bi...Novel algorithms for detection of unknown chemical molecules with specific bi...
Novel algorithms for detection of unknown chemical molecules with specific bi...
 
CoMFA CoMFA Comparative Molecular Field Analysis)
CoMFA CoMFA Comparative Molecular Field Analysis)CoMFA CoMFA Comparative Molecular Field Analysis)
CoMFA CoMFA Comparative Molecular Field Analysis)
 
Molecular and Quantum Mechanics in drug design
Molecular and Quantum Mechanics in drug designMolecular and Quantum Mechanics in drug design
Molecular and Quantum Mechanics in drug design
 
Lecture 9 molecular descriptors
Lecture 9  molecular descriptorsLecture 9  molecular descriptors
Lecture 9 molecular descriptors
 
Chemical Spaces: Modeling, Exploration & Understanding
Chemical Spaces: Modeling, Exploration & UnderstandingChemical Spaces: Modeling, Exploration & Understanding
Chemical Spaces: Modeling, Exploration & Understanding
 
3 D QSAR Approaches and Contour Map Analysis
3 D QSAR Approaches and Contour Map Analysis3 D QSAR Approaches and Contour Map Analysis
3 D QSAR Approaches and Contour Map Analysis
 
Molecular modelling
Molecular modellingMolecular modelling
Molecular modelling
 
Lecture 11 developing qsar, evaluation of qsar model and virtual screening
Lecture 11  developing qsar, evaluation of qsar model and virtual screeningLecture 11  developing qsar, evaluation of qsar model and virtual screening
Lecture 11 developing qsar, evaluation of qsar model and virtual screening
 
docking
docking docking
docking
 
Richard Cramer 2014 euro QSAR presentation
Richard Cramer 2014 euro QSAR presentationRichard Cramer 2014 euro QSAR presentation
Richard Cramer 2014 euro QSAR presentation
 
computer aided drug designing and molecular modelling
computer aided drug designing and molecular modellingcomputer aided drug designing and molecular modelling
computer aided drug designing and molecular modelling
 
Molecular modelling
Molecular modelling Molecular modelling
Molecular modelling
 
2.molecular modelling intro
2.molecular modelling intro2.molecular modelling intro
2.molecular modelling intro
 
SAR & QSAR
SAR & QSARSAR & QSAR
SAR & QSAR
 
Gaussian presentation
Gaussian presentationGaussian presentation
Gaussian presentation
 

Destacado

Luentotallenteiden käyttö matemaattisten aineiden opetuksessa
Luentotallenteiden käyttö matemaattisten aineiden opetuksessaLuentotallenteiden käyttö matemaattisten aineiden opetuksessa
Luentotallenteiden käyttö matemaattisten aineiden opetuksessaIlkka Kukkonen
 
S Kerr Ip3 Part 2, Continued
S Kerr Ip3 Part 2, ContinuedS Kerr Ip3 Part 2, Continued
S Kerr Ip3 Part 2, Continuedkerrshar
 
CW Quiz #27 & 31
CW Quiz #27 & 31CW Quiz #27 & 31
CW Quiz #27 & 31ricmac25
 
Facebookcamp Toronto FBML
Facebookcamp Toronto FBMLFacebookcamp Toronto FBML
Facebookcamp Toronto FBMLsboodram
 
Happiness Movie Ppt Version Sample
Happiness Movie Ppt Version SampleHappiness Movie Ppt Version Sample
Happiness Movie Ppt Version SampleAndrew Schwartz
 
Spring 3 - Der dritte Frühling
Spring 3 - Der dritte FrühlingSpring 3 - Der dritte Frühling
Spring 3 - Der dritte FrühlingThorsten Kamann
 
O que aconteceu com os mundos virtuais no ensino?
O que aconteceu com os mundos virtuais no ensino?O que aconteceu com os mundos virtuais no ensino?
O que aconteceu com os mundos virtuais no ensino?Neli Maria Mengalli
 
Battle Underground NullCon 2011 Walkthrough
Battle Underground NullCon 2011 WalkthroughBattle Underground NullCon 2011 Walkthrough
Battle Underground NullCon 2011 WalkthroughAnant Shrivastava
 
Max Hachenburg
Max HachenburgMax Hachenburg
Max HachenburgMsSchool
 
03 кластеризация документов
03 кластеризация документов03 кластеризация документов
03 кластеризация документовLidia Pivovarova
 
Discribes You
Discribes YouDiscribes You
Discribes Yousatya414
 

Destacado (18)

Luentotallenteiden käyttö matemaattisten aineiden opetuksessa
Luentotallenteiden käyttö matemaattisten aineiden opetuksessaLuentotallenteiden käyttö matemaattisten aineiden opetuksessa
Luentotallenteiden käyttö matemaattisten aineiden opetuksessa
 
Milieu
MilieuMilieu
Milieu
 
S Kerr Ip3 Part 2, Continued
S Kerr Ip3 Part 2, ContinuedS Kerr Ip3 Part 2, Continued
S Kerr Ip3 Part 2, Continued
 
Jax 2011 keynote
Jax 2011 keynoteJax 2011 keynote
Jax 2011 keynote
 
CW Quiz #27 & 31
CW Quiz #27 & 31CW Quiz #27 & 31
CW Quiz #27 & 31
 
Boyarsky
BoyarskyBoyarsky
Boyarsky
 
Facebookcamp Toronto FBML
Facebookcamp Toronto FBMLFacebookcamp Toronto FBML
Facebookcamp Toronto FBML
 
Rab0809
Rab0809Rab0809
Rab0809
 
Web20anwendungen
Web20anwendungenWeb20anwendungen
Web20anwendungen
 
дерево
дереводерево
дерево
 
Happiness Movie Ppt Version Sample
Happiness Movie Ppt Version SampleHappiness Movie Ppt Version Sample
Happiness Movie Ppt Version Sample
 
Spring 3 - Der dritte Frühling
Spring 3 - Der dritte FrühlingSpring 3 - Der dritte Frühling
Spring 3 - Der dritte Frühling
 
O que aconteceu com os mundos virtuais no ensino?
O que aconteceu com os mundos virtuais no ensino?O que aconteceu com os mundos virtuais no ensino?
O que aconteceu com os mundos virtuais no ensino?
 
Battle Underground NullCon 2011 Walkthrough
Battle Underground NullCon 2011 WalkthroughBattle Underground NullCon 2011 Walkthrough
Battle Underground NullCon 2011 Walkthrough
 
Max Hachenburg
Max HachenburgMax Hachenburg
Max Hachenburg
 
03 кластеризация документов
03 кластеризация документов03 кластеризация документов
03 кластеризация документов
 
Discribes You
Discribes YouDiscribes You
Discribes You
 
Handpaintings
HandpaintingsHandpaintings
Handpaintings
 

Similar a Mixed Discrete-Continuous ALKR for Large Scale Classification

Knowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningKnowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningjaumebp
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...csandit
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeBernard Ong
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.Sunghoon Joo
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionSunghoon Joo
 
Data analysis
Data analysisData analysis
Data analysisamlbinder
 
Practical Constraint Solving for Generating System Test Data
Practical Constraint Solving for Generating System Test DataPractical Constraint Solving for Generating System Test Data
Practical Constraint Solving for Generating System Test DataLionel Briand
 
Demo aamas-mapp
Demo aamas-mappDemo aamas-mapp
Demo aamas-mappXinhua Zhu
 
Dimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine LearningDimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine LearningRomiRoy4
 
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Maninda Edirisooriya
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balanceAlex Henderson
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTuri, Inc.
 
Distributed Model Validation with Epsilon
Distributed Model Validation with EpsilonDistributed Model Validation with Epsilon
Distributed Model Validation with EpsilonSina Madani
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerDatabricks
 

Similar a Mixed Discrete-Continuous ALKR for Large Scale Classification (20)

Knowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningKnowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learning
 
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
ATTRIBUTE REDUCTION-BASED ENSEMBLE RULE CLASSIFIERS METHOD FOR DATASET CLASSI...
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Kaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning ChallengeKaggle Higgs Boson Machine Learning Challenge
Kaggle Higgs Boson Machine Learning Challenge
 
Competition16
Competition16Competition16
Competition16
 
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
PR-373: Revisiting ResNets: Improved Training and Scaling Strategies.
 
PR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed RecognitionPR-393: ResLT: Residual Learning for Long-tailed Recognition
PR-393: ResLT: Residual Learning for Long-tailed Recognition
 
OTTO-Report
OTTO-ReportOTTO-Report
OTTO-Report
 
Data analysis
Data analysisData analysis
Data analysis
 
Practical Constraint Solving for Generating System Test Data
Practical Constraint Solving for Generating System Test DataPractical Constraint Solving for Generating System Test Data
Practical Constraint Solving for Generating System Test Data
 
Demo aamas-mapp
Demo aamas-mappDemo aamas-mapp
Demo aamas-mapp
 
Dimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine LearningDimensionality Reduction in Machine Learning
Dimensionality Reduction in Machine Learning
 
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
 
To bag, or to boost? A question of balance
To bag, or to boost? A question of balanceTo bag, or to boost? A question of balance
To bag, or to boost? A question of balance
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
 
Distributed Model Validation with Epsilon
Distributed Model Validation with EpsilonDistributed Model Validation with Epsilon
Distributed Model Validation with Epsilon
 
weka data mining
weka data mining weka data mining
weka data mining
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
Query optimization
Query optimizationQuery optimization
Query optimization
 

Último

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 

Último (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 

Mixed Discrete-Continuous ALKR for Large Scale Classification

  • 1. A Mixed Discrete-Continuous Attribute List Representation for Large Scale Classification Domains Jaume Bacardit Natalio Krasnogor {jqb,nxk}@cs.nott.ac.uk University of Nottingham
  • 2. Outline • Motivation and objectives • Framework: The BioHEL GBML system • Improving the Attribute List Knowledge Representation • Experimental design • Results and discussion • Conclusions and further work
  • 3. Motivation • We live in times of a great “data deluge” • Many different disciplines and industries generate vast amounts of data • Large scale can mean – Many records, many dimensions, many classes, … • Our work is focused on representations that – Can deal with large attribute spaces – Are efficient, as this can make a big difference when dealing with really large datasets
  • 4. The Attribute List knowledge representation (ALKR) • This representation was recently proposed [Bacardit et al., 09] to achieve these aims • This representation exploits a very frequent situation – In high-dimensionality domains it is usual that each rule only uses a very small subset of the attributes • Example of a rule for predicting a Bioinformatics dataset [Bacardit and Krasnogor, 2009] • Att Leu-2 ∈ [-0.51,7] and Glu ∈ [0.19,8] and Asp+1 ∈ [-5.01,2.67] and Met+1∈ [-3.98,10] and Pro+2 ∈ [-7,-4.02] and Pro+3 ∈ [-7,-1.89] and Trp+3 ∈ [-8,13] and Glu+4 ∈ [0.70,5.52] and Lys+4 ∈ [-0.43,4.94]  alpha • Only 9 attributes out of 300 were actually in the rule – Can we get rid of the 291 irrelevant attributes?
  • 5. The Attribute List knowledge representation • Thus, if we can get rid of the irrelevant attributes – The representation will be more efficient, avoiding the waste of cycles dealing with irrelevant data – Exploration will be more focused, as the chromosomes will only contain data that matters • This representation automatically identifies the relevant attributes in the domain for each rule • It was tested on several small datasets and a couple of large protein datasets, showing good performance
  • 6. Objectives of this work • We propose an efficient extension of the representation that can deal at the same time with continuous and discrete attributes – The original representation only dealt with continuous variables • We evaluate the representation using several large- scale domains – To assess its performance, and to identify where to improve it • We compare ALKR against other standard machine learning techniques
  • 7. The BioHEL GBML System • BIOinformatics-oriented Hiearchical Evolutionary Learning – BioHEL (Bacardit et al., 2007) • BioHEL is a GBML system that employs the Iterative Rule Learning (IRL) paradigm – First used in EC in Venturini’s SIA system (Venturini, 1993) – Widely used for both Fuzzy and non-fuzzy evolutionary learning • BioHEL inherits most of its components from GAssist [Bacardit, 04], a Pittsburgh GBML system
  • 8. Iterative Rule Learning • IRL has been used for many years in the ML community, with the name of separate-and-conquer
  • 9. Characteristics of BioHEL • A fitness function based on the Minimum-Description-Length (MDL) (Rissanen,1978) principle that tries to – Evolve accurate rules – Evolve high coverage rules – Evolve rules with low complexity, as general as possible • The Attribute List Knowledge representation – Representation designed to handle high-dimensionality domains • The ILAS windowing scheme – Efficiency enhancement method, not all training points are used for each fitness computation • An explicit default rule mechanism – Generating more compact rule sets • Ensembles for consensus prediction – Easy system to boost robustness
  • 10. Fitness function of BioHEL • Coverage term penalizes rules that do not cover a minimum percentage of examples • Choice of the coverage break is crucial for the proper performance of the system
  • 11. Improving the Attribute List Knowledge Representation • Mixed discrete-continuous representation – Intervalar represenation for continuous variables [Llora et al., 07] • If Att ∈ [ LB, UB] • 2 real-valued parameters, specifying the bounds – GABIL binary representation [De Jong & Spears, 91] for discrete variables • If Att takes value A or B • One bit for each possible value, indicating if value is included in the disjunction • If Att1∈ [0.2,0.5] and Att2 is (A or B)  Class 1 • {0.2,0.5|1,1,0|1}
  • 12. Improving the Attribute List Knowledge Representation • Each rule contains:
  • 13. Improving the Attribute List Knowledge Representation The match process is a crucial element in the performance of the system This code is run millions of times Do you think that this code is efficient? Look at the If
  • 14. Improving the Attribute List Knowledge Representation • Doing supervised learning allows us to exploit one trick – When we evaluate a rule, we test it against each example in the training set – Thus, we can precalculate two lists, of discrete and continuous attributes • The match process is performed separately for both kinds of attributes • Essentially, we have unrolled the loop
  • 15. Improving the Attribute List Knowledge Representation • Recombination remains unchanged – Simulated 1-point crossover to deal with the variable-length lists of attributes – Standard GA mutation – Two operators (specialize and generalize) add or remove attributes from the list with a given probability, hence exploring the space of the relevant attributes for this rule
  • 16. Experimental design • Seven datasets were used – They represent a broad range of characteristics in terms of instances, attributes, classes, type of attributes and class balance/unbalance
  • 17. Experimental design • First, ALKR was compared against BioHEL using its original representation (labelled orig) • Also, three standard machine learning techniques were used in the comparison: – C4.5 [Quinlan, 93] – Naive Bayes [John and Langley, 95] – LIBSVM [Chang & Lin, 01] • The default parameters of BioHEL were used, except for two of them: – The number of strata of the ILAS windowing scheme – The coverage breakpoint of BioHEL’s fitness function – These two parameters were strongly problem-dependant
  • 19. And one more (much larger) dataset • Protein Structure Prediction dataset (Solvent Accessibility - SA) with 270 attributes, 550K instances and 2 classes Method Accuracy Size solution #exp atts Run-time (h) BioHEL-orig 79.0±0.3 236.23±5.7 14.9±3.7 20.7±1.4 BioHEL-ALKR 79.2±0.3 243.23±5.2 8.4±2.7 14.8±1.0 BioHEL-naive 79.2±0.3 242.62±4.5 8.4±2.7 19.4±1.0 Run in a different C4.5 --- cluster Naïve Bayes 74.1±0.4 with more memory SVM 79.9±0.3 10 days and faster nodes
  • 20. ALKR vs Original BioHEL • Except for one dataset (and the difference is minor), ALKR always obtains better accuracy • Datasets where is ALKR is much better are those with larger number of attributes – ALKR is better at exploring the search space • ALKR generates more compact solutions, in #rules and, specially, in #attributes • Except for the ParMX domain (with a very small number of attributes), ALKR is always faster (72 times faster in the Germ dataset!)
  • 21. BioHEL vs other ML methods • The accuracy results were analyzed overall using a Friedman test for multiple comparisons • The test detected with a 97.77% confidence that there were significant differences in the performance of the compared methods • A post-hoc Holm test indicated that ALKR was significantly better than Naive Bayes with 95% confidence. • If we look at individual datasets, BioHEL is only outperformed largely in the wav and SA datasets by SVM • BioHEL’s advantage in the Germ dataset is specially large
  • 22. Where can we improve BioHEL? • ParMX is a synthetic dataset for which the optimal solution consists in 257 rules. BioHEL generated 402 rules • The rules were accurate but suboptimal • The coverage pressure introduced by the coverage breakpoint parameter was not appropiate for the whole learning process • BioHEL also had some problems in datasets with class unbalance (c-4)
  • 23. Conclusions • In this work we have – Extended the Attribute List Knowledge Representation of the BioHEL LCS to deal with mixed discrete-continuous domains in an efficient way – Assessed the performance of BioHEL using a broad range of large-scale scenarios – Compared BioHEL’s performance against other representations/learning techniques • The experiments have shown that BioHEL+ALKR is efficient, it generates compact and accurate solutions and it is competitive against other machine learning methods • We also identified several directions of improvement
  • 24. Future work • Identify the causes and address the issues that were observed in these experiments about BioHEL’s performance • Compare and combine ALKR against similar recent LCS work [Butz et al., 08] • Is possible to create a parameter-less BioHEL? • The development of theoretical models that can explain the behavior of both BioHEL and ALKR would – Made all of the above easier – Be an important milestone in the principled application of LCS to large-scale domains