SlideShare una empresa de Scribd logo
1 de 36
Knowledge extraction and
visualisation using rule-based
       machine learning
                 Dr. Jaume Bacardit
Interdisciplinary Computing and Complex Systems
               (ICOS) research group
              University of Nottingham
        jaume.bacardit@nottingham.ac.uk

           ICOS seminar. 11/10/2012
Preface
• I came to Nottingham in 2005 to work as a postdoc in a project applying
  evolutionary rule learning to protein structure prediction (EPSRC
  GR/T07534/01). In the project me managed to:
    – Generate predictors that are competent with the start-of-the-art
    – Indeed, extract human-readable explanations providing new
      knowledge
    – We proposed several improvements to the learning algorithms so they
      could scale to big problems
• When I became a lecturer in 2008 I started several collaborations with
  experimentalists analysing biological data of all kinds, always with the goal
  of extracting knowledge
    – Thanks to having sets of rules, it is relatively straightforward to
      develop a generic methodology to extract knowledge from them, that
      can be applied almost straight away to a variety of datasets
    – Still, we are only at the tip of the iceberg, there are many ways in
      which this analysis can be made more efficient/reliable/useful
RULE LEARNING
A set of rules as a knowledge
               representation


    1
                          If (X<0.25 and Y>0.75) or
                             (X>0.75 and Y<0.25) then 
                          If (X>0.75 and Y>0.75) then 
Y                         If (X<0.25 and Y<0.25) then 
                          Everything else           




    0                 1
            X
Another example




Witten and Frank, 2005 (http://www.cs.waikato.ac.nz/~eibe/Slides2edRev2.zip)
The BioHEL rule learning system
• BioHEL [Bacardit et al., 09] is an evolutionary
  learning system that applies the Iterative Rule
  Learning (IRL) approach
• Designed explicitly to deal with noisy large-scale
  datasets
• IRL was first used in EC by the SIA system
  [Venturini, 93]
BioHEL’s learning paradigm
– IRL has been used for many years in the ML community,
  with the name of separate-and-conquer
– A standard elitist Genetic Algorithm generates each rule
BioHEL’s characteristics 1/2

• Objective function that tries to balance the
  generation of accurate and general rules
   – Accurate: not making many mistakes
   – General: covering as many examples as possible and covering as much
     of the search space as possible
• Attribute list rule representation
   – Automatically identifying the relevant attributes for a given rule and
     discarding all the other ones
• Ensemble mechanisms
   – Exploiting the GA’s stochasticity to construct ensembles of rule sets, all
     of them generated from the same data, but with different random
     seeds, also ensembles for ordinal classification
BioHEL’s characteristics 2/2
• The ILAS windowing scheme
  – Efficiency enhancement method. Training set divided into strata.
    Different GA iterations use different strata for their evaluation using a
    round-robin policy
• GPGPU-based fitness evaluation
  – Obtaining ~50x speedups on large datasets on its own and ~700x
    speedups in combination with ILAS
Mining –omics data
Protein contact map prediction

CASE STUDIES
Functional Network Reconstruction for
          seed germination
 Microarray data obtained from seed tissue of
  Arabidopsis Thaliana
 122 samples represented by the expression level
  of almost 14000 genes
 It had been experimentally determined whether
  each of the seeds had germinated or not
 Can we learn to predict germination/dormancy
  from the microarray data?
 Bassel et al., Plant Cell 23(9):3101-3116, 2011
Generating rule sets
 BioHEL was able to predict the
  outcome of the samples with
  93.5% accuracy (10 x 10-fold cross-
  validation
 Learning from a scrambled dataset
  (labels randomly assigned to
  samples) produced ~50% accuracy
If At1g27595>100.87 and At3g49000>68.13 and At2g40475>55.96  Predict
germination
If At4g34710>349.67 and At4g37760>150.75 and At1g30135>17.66  Predict
germination
If At3g03050>37.90 and At2g20630>96.01 and At3g02885>9.66  Predict germination
If At5g54910>45.03 and At4g18975>16.74 and At3g28910>52.76 and At1g48320>56.80
 Predict germination
Everything else  Predict dormancy
Identifying regulators
 Rule building process is stochastic
    Generates different rule sets each time the system is
     run
 But if we run the system many times, we can see
  some patterns in the rule sets
    Genes appearing quite more frequent than the rest
       Some associated to dormancy
       Some associated to germination
 We generated 10K rule sets for each outcome
    Rules predicted one of the two outcomes
    Default rule captured the other
Known regulators appear with high
     frequency in the rules
Generating co-prediction networks of
                  interactions
•   For each of the rules shown before to be
    true, all of the conditions in it need to be
    true at the same time
     – Each rule is expressing an interaction between
       certain gens
•   From a high number of rule sets we can
    identify pairs of genes that co-occur with
    high frequency and generate functional
    networks with a methodology coined as co-
    prediction
•   The network shows different topology when
    compared to other type of network
    construction methods (e.g. by gene co-
    expression)
•   Different regions in the network contain the
    germination and dormancy genes.
•   Other visualisations providing the big picture
    exist (Urbanowicz et al., 2012)
Experimental validation
 We have experimentally verified this analysis
    By ordering and planting knockouts for the highly ranked
     genes
    We have been able to identify four new regulators of
     germination, with phenotype different than the wild type
Same analysis. Different datasets
• We applied the same principle to three cancer
  datasets from the literature (E. Glaab et al., PLoS
  ONE (2012) 7(7):e39932)
• We checked PubMed to see if the genes linked
  together in BioHEL’s rules appeared together in
  the literature
• We used Point-Wise Mutual Information (PMI) to
  quantify that the genes do not appear linked
  together in the literature by chance
• Compared the PMI scores of the highly ranked
  pairs of genes with random pairs
BioHEL’s scores were much better than
                random
And to lots of other datasets!
• These datasets were generated using transcriptomics
  technology
   – Looks at RNA
• There are lots of other –omics (hundreds of them)
   –   Proteomics
   –   Lipidomics
   –   Metabolomics
   –   Next-generation sequencing
• Each –omics requires specific preprocessing, but the
  learning and knowledge extraction process is exactly
  the same
• Lots of datasets out there
Another example different from -omics
• Protein Structure Prediction aims to predict the 3D
  structure of a protein based on its primary sequence
Prediction types of PSP
• There are several kinds of prediction problems within
  the scope of PSP
   – The main one, of course, is to predict the 3D coordinates
     of all atoms of a protein (or at least the backbone) based
     on its primary sequence
   – There are many structural properties of individual residues
     within a protein that can be predicted, for instance:
      • The secondary structure state of the residue
      • If a residue is buried in the core of the protein or exposed in the
        surface
   – Accurate predictions of these sub-problems can simplify
     the general 3D PSP problem
Contact Map prediction
•   Prediction, for each pair of residues in a
    protein, whether these residues are in
    contact (have a small distance between
    them in the 3D structure) or not
•   This problem can be represented by a
    binary matrix. 1= contact, 0 = non
    contact. Plotting this matrix reveals the
    main traits in the protein structure
•   Very sparse characteristic: Less than 2%
    of contacts in native structures
•   Training sets easily reach millions of
    residue pairs
•   Our method was one of the top
    predictors in the last two editions of the
    CASP competition (actually, the best
    sequence-based predictor in last CASP)

                                                 helices                sheets
                                                 (Bacardit et al., Bioinformatics (2012) 28 (19): 2441-2448)
Steps for CM prediction
1. Prediction of
     Secondary structure (using PSIPRED)
     Solvent Accessibility
     Recursive Convex Hull        Using BioHEL [Bacardit et al., 09]

     Coordination Number
2. Integration of all these predictions plus other
   sources of information
3. Final CM prediction (using BioHEL)
Characterisation of the contact map
                        problem
     Three types of input information were used
         1. Detailed information of three different windows of
            residues centered around
               The two target residues (2x)
               The middle point between them
         2. Information about the connecting segment between the
            two target residues and
         3. Global protein information.
     1

3


                2
Samples and ensembles
            Training set
                                  Training set contained 32 million
                                   pairs of AA and 631 attributes
                           x50     (+60GB of disk space)
Samples
                                  50 samples of 660K examples are
                                   generated from the training set with a
                                   ratio of 2:1 non-contacts/contacts
                           x25
Rule sets                         BioHEL is run 25 times for each sample
                                  Prediction is done by a consensus of
                                   1250 rule sets
                                  Confidence of prediction is computed
                                   based on the votes distribution in the
            Consensus              ensemble.
                                  Whole training process took about 25K
                                   CPU hours

             Predictions
Knowledge extraction in contact map
             prediction
• Basic analysis is exactly the same




                 Frequent attributes



                 Frequent pairs of
                 attributes
But analysis can be much more refined
• Because the representation has a very clear structure
  and we have lots of domain knowledge
• For instance, there are several way to aggregate the
  ranks of individual attributes based on characteristics
  from the representation/domain

                  Ranks aggregated by
                  source of information



                  Ranks aggregated by
                  amino acid type
CHALLENGES AND OPPORTUNITIES
The knowledge extraction can be
          much more refined
• We just looked at what attributes appear in the
  rules, but not yet at the shape of the predicates
• Sometimes biasing the representation helps
  generating knowledge that is more useful to the
  domain experts
   – In the experiments with the seed data BioHEL was
     constrained to generate only predicates “Att>X”
   – But we always have to be careful when introducing
     bias
Is the knowledge real?

• Data is far from perfect, lots of spurious peaks
• Probably many of the edges in the network are false
  positives
• Strategies for filtering the knowledge
   – Classic blind feature selection?
   – Contrast the knowledge with databases of curated
     information about the genes/interactions
      • Some of these are quite pricy!
      • Or we need strong text mining skills
   – Careful balance is needed, we don’t want to filter true
     positives
   – Using expert knowledge to bias the learning process (Moore
     & White, 2006)
Modelling the ML problem
• Datasets annotated as “case/controls” are easy
• What happens with N>2 labels?
  – Tricky for decision lists, as there is an implicit overlap
    between rules
• What happens with continuous annotations?
  – There are similar examples in the literature using
    model trees (Nepomuceno-Chamorro et al., 2010)
• What happens when the annotation is a time
  course?
  – Ordinal classification problem
References
• BioHEL
   – Improving the scalability of rule-based evolutionary learning. J. Bacardit, E.K.
     Burke and N. Krasnogor. Memetic Computing journal 1(1):55-67, 2009
   – Speeding Up the Evaluation of Evolutionary Learning Systems using GPGPUs.
     M. Franco, N. Krasnogor and J. Bacardit. In Proceedings of the 12th Annual
     Conference on Genetic and Evolutionary Computation (GECCO2010), 1039-
     1046, ACM Press, 2010
   – Modelling the Initialisation Stage of the ALKR Representation for Discrete
     Domains and GABIL Encoding. M. Franco, N. Krasnogor and J. Bacardit. In
     Proceedings of the 13th Annual Conference on Genetic and Evolutionary
     Computation - GECCO2011, pages 1291-1298. ACM, 2011
   – Post-processing Operators for Decision Lists. M. Franco, N. Krasnogor and J.
     Bacardit. In Proceedings of the 14th Annual Conference on Genetic and
     Evolutionary Computation - GECCO2012, pages 847-854. ACM, 2012
   – Analysing BioHEL using challenging boolean functions. M. Franco, N.
     Krasnogor and J. Bacardit. Evolutionary Intelligence, 5(2):87-102, June 2012
References
• Knowledge extraction and visualisation
    – Prediction of Recursive Convex Hull Class Assignments for Protein Residues. Stout, M.,
      Bacardit, J., Hirst, J.D. and Krasnogor, N. Bioinformatics, 24(7):916-923, 2008
    – Automated Alphabet Reduction for Protein Datasets. J. Bacardit, M. Stout, J.D. Hirst, A.
      Valencia, R.E. Smith and N. Krasnogor. BMC Bioinformatics 10:6, 2009
    – Functional Network Construction in Arabidopsis Using Rule-Based Machine Learning on
      Large-Scale Data Sets. George W. Bassel, Enrico Glaab, Julietta Marquez, Michael J.
      Holdsworth and Jaume Bacardit. The Plant Cell, 23(9):3101-3116, 2011
    – E. Glaab, J. Bacardit, J.M. Garibaldi and N. Krasnogor. Using Rule-Based Machine
      Learning for Candidate Disease Gene Prioritization and Sample Classification of Cancer
      Gene Expression Data. PLoS ONE 7(7):e39932. 2012. doi:10.1371/journal.pone.0039932
    – J. Bacardit, P. Widera, A. Márquez-Chamorro, F. Divina, J.S. Aguilar-Ruiz and Natalio
      Krasnogor. Contact map prediction using a large-scale ensemble of rule sets and the
      fusion of multiple predicted structural features. Bioinformatics (2012) 28 (19): 2441-
      2448. doi:10.1093/bioinformatics/bts472
    – HP Fainberg, K. Bodley, J. Bacardit, D. Li, F. Wessely, NP. Mongan, ME. Symonds, L. Clarke
      and A. Mostyn, Reduced neonatal mortality in Meishan piglets: a role for hepatic fatty
      acids? PLoS ONE, in press, 2012
References
• Related work
  – Nepomuceno-Chamorro, I.A., Aguilar-Ruiz, J.S., and
    Riquelme, J.C. (2010). Inferring gene regression networks
    with model trees. BMC Bioinformatics 11: 517
  – Moore, J. and White, B., Exploiting expert knowledge in
    genetic programming for genome-wide genetic analysis,
    Parallel Problem Solving from Nature-PPSN IX, pp. 969-
    977, 2006
  – R. J. Urbanowicz, A. Granizo-MacKenzie, and J. H. Moore.
    Instance-linked attribute tracking and feedback for
    michigan-style supervised learning classifier systems. In
    GECCO ’12: Proceedings of the 14th annual conference on
    Genetic and evolutionary computation , pages 927–934.
    ACM Press, 2012
Acknowledgements
•   Natalio Krasnogor
•   Michael Holdsworth
•   George Bassel
•   Enrico Glaab
•   Pawel Widera
•   Maria Franco
•   Anna Swan
•   Hernan Fainberg

• EPSRC GR/T07534/01 & EP/H016597/1
Knowledge extraction and
visualisation using rule-based
       machine learning
                 Dr. Jaume Bacardit
Interdisciplinary Computing and Complex Systems
               (ICOS) research group
              University of Nottingham
        jaume.bacardit@nottingham.ac.uk

           ICOS seminar. 11/10/2012

Más contenido relacionado

La actualidad más candente

Whale optimization mirjalili
Whale optimization mirjaliliWhale optimization mirjalili
Whale optimization mirjalili
Prashant Kumar
 
NatashaBME1450.doc
NatashaBME1450.docNatashaBME1450.doc
NatashaBME1450.doc
butest
 
Autism_risk_factors
Autism_risk_factorsAutism_risk_factors
Autism_risk_factors
Colleen Chen
 
Neural network
Neural networkNeural network
Neural network
Saddam Hussain
 

La actualidad más candente (20)

Whale optimization mirjalili
Whale optimization mirjaliliWhale optimization mirjalili
Whale optimization mirjalili
 
NatashaBME1450.doc
NatashaBME1450.docNatashaBME1450.doc
NatashaBME1450.doc
 
Prediction of Bioprocess Production Using Deep Neural Network Method
Prediction of Bioprocess Production Using Deep Neural Network MethodPrediction of Bioprocess Production Using Deep Neural Network Method
Prediction of Bioprocess Production Using Deep Neural Network Method
 
Performance Evaluation of Different Data Mining Classification Algorithm and ...
Performance Evaluation of Different Data Mining Classification Algorithm and ...Performance Evaluation of Different Data Mining Classification Algorithm and ...
Performance Evaluation of Different Data Mining Classification Algorithm and ...
 
Autism_risk_factors
Autism_risk_factorsAutism_risk_factors
Autism_risk_factors
 
A tutorial in Connectome Analysis (3) - Marcus Kaiser
A tutorial in Connectome Analysis (3) - Marcus KaiserA tutorial in Connectome Analysis (3) - Marcus Kaiser
A tutorial in Connectome Analysis (3) - Marcus Kaiser
 
Neural network
Neural networkNeural network
Neural network
 
Building Neural Network Through Neuroevolution
Building Neural Network Through NeuroevolutionBuilding Neural Network Through Neuroevolution
Building Neural Network Through Neuroevolution
 
Boosting probabilistic graphical model inference by incorporating prior knowl...
Boosting probabilistic graphical model inference by incorporating prior knowl...Boosting probabilistic graphical model inference by incorporating prior knowl...
Boosting probabilistic graphical model inference by incorporating prior knowl...
 
28 15017 estimation of turbidity in water(edit)
28 15017 estimation of turbidity in water(edit)28 15017 estimation of turbidity in water(edit)
28 15017 estimation of turbidity in water(edit)
 
Neural Networks for Pattern Recognition
Neural Networks for Pattern RecognitionNeural Networks for Pattern Recognition
Neural Networks for Pattern Recognition
 
NetBioSIG2013-Talk David Amar
NetBioSIG2013-Talk David AmarNetBioSIG2013-Talk David Amar
NetBioSIG2013-Talk David Amar
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Information updated and conveyed by the neural network systems
Information updated and conveyed by the neural network systemsInformation updated and conveyed by the neural network systems
Information updated and conveyed by the neural network systems
 
Computational approaches for mapping the human connectome
Computational approaches for mapping the human connectomeComputational approaches for mapping the human connectome
Computational approaches for mapping the human connectome
 
Pattern recognition system based on support vector machines
Pattern recognition system based on support vector machinesPattern recognition system based on support vector machines
Pattern recognition system based on support vector machines
 
CVPR 2020 Workshop: Sparsity in the neocortex, and its implications for conti...
CVPR 2020 Workshop: Sparsity in the neocortex, and its implications for conti...CVPR 2020 Workshop: Sparsity in the neocortex, and its implications for conti...
CVPR 2020 Workshop: Sparsity in the neocortex, and its implications for conti...
 
NetBioSIG2014-Talk by Tijana Milenkovic
NetBioSIG2014-Talk by Tijana MilenkovicNetBioSIG2014-Talk by Tijana Milenkovic
NetBioSIG2014-Talk by Tijana Milenkovic
 
A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATION
A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATIONA NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATION
A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATION
 
Complex system
Complex systemComplex system
Complex system
 

Destacado

Fighting Knowledge Acquisition Bottleneck with Argument Based ...
Fighting Knowledge Acquisition Bottleneck with Argument Based ...Fighting Knowledge Acquisition Bottleneck with Argument Based ...
Fighting Knowledge Acquisition Bottleneck with Argument Based ...
butest
 

Destacado (20)

curl manual
curl manualcurl manual
curl manual
 
Computer security - A machine learning approach
Computer security - A machine learning approachComputer security - A machine learning approach
Computer security - A machine learning approach
 
Fighting Knowledge Acquisition Bottleneck with Argument Based ...
Fighting Knowledge Acquisition Bottleneck with Argument Based ...Fighting Knowledge Acquisition Bottleneck with Argument Based ...
Fighting Knowledge Acquisition Bottleneck with Argument Based ...
 
Cost savings from auto-scaling of network resources using machine learning
Cost savings from auto-scaling of network resources using machine learningCost savings from auto-scaling of network resources using machine learning
Cost savings from auto-scaling of network resources using machine learning
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
 
Applications of Machine Learning to Location-based Social Networks
Applications of Machine Learning to Location-based Social NetworksApplications of Machine Learning to Location-based Social Networks
Applications of Machine Learning to Location-based Social Networks
 
IoT Mobility Forensics
IoT Mobility ForensicsIoT Mobility Forensics
IoT Mobility Forensics
 
Network_Intrusion_Detection_System_Team1
Network_Intrusion_Detection_System_Team1Network_Intrusion_Detection_System_Team1
Network_Intrusion_Detection_System_Team1
 
KM technologies and strategy
KM technologies and strategyKM technologies and strategy
KM technologies and strategy
 
Airline passenger profiling based on fuzzy deep machine learning
Airline passenger profiling based on fuzzy deep machine learningAirline passenger profiling based on fuzzy deep machine learning
Airline passenger profiling based on fuzzy deep machine learning
 
Machine Learning for dummies
Machine Learning for dummiesMachine Learning for dummies
Machine Learning for dummies
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
 
Computer security using machine learning
Computer security using machine learningComputer security using machine learning
Computer security using machine learning
 
Online Machine Learning: introduction and examples
Online Machine Learning:  introduction and examplesOnline Machine Learning:  introduction and examples
Online Machine Learning: introduction and examples
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
BSidesLV 2013 - Using Machine Learning to Support Information Security
BSidesLV 2013 - Using Machine Learning to Support Information SecurityBSidesLV 2013 - Using Machine Learning to Support Information Security
BSidesLV 2013 - Using Machine Learning to Support Information Security
 
Machine learning support vector machines
Machine learning   support vector machinesMachine learning   support vector machines
Machine learning support vector machines
 
Distributed Online Machine Learning Framework for Big Data
Distributed Online Machine Learning Framework for Big DataDistributed Online Machine Learning Framework for Big Data
Distributed Online Machine Learning Framework for Big Data
 
Online algorithms in Machine Learning
Online algorithms in Machine LearningOnline algorithms in Machine Learning
Online algorithms in Machine Learning
 
A use case of online machine learning using Jubatus
A use case of online machine learning using JubatusA use case of online machine learning using Jubatus
A use case of online machine learning using Jubatus
 

Similar a Knowledge extraction and visualisation using rule-based machine learning

Session ii g2 overview chemical modeling mmc
Session ii g2 overview chemical modeling mmcSession ii g2 overview chemical modeling mmc
Session ii g2 overview chemical modeling mmc
USD Bioinformatics
 
Quantification of variability and uncertainty in systems medicine models
Quantification of variability and uncertainty in systems medicine modelsQuantification of variability and uncertainty in systems medicine models
Quantification of variability and uncertainty in systems medicine models
Natal van Riel
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
Philip Cheung
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
DataScienceConferenc1
 

Similar a Knowledge extraction and visualisation using rule-based machine learning (20)

Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014
 
Session ii g2 overview chemical modeling mmc
Session ii g2 overview chemical modeling mmcSession ii g2 overview chemical modeling mmc
Session ii g2 overview chemical modeling mmc
 
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
 
Introduction to biocomputing
 Introduction to biocomputing Introduction to biocomputing
Introduction to biocomputing
 
Quantification of variability and uncertainty in systems medicine models
Quantification of variability and uncertainty in systems medicine modelsQuantification of variability and uncertainty in systems medicine models
Quantification of variability and uncertainty in systems medicine models
 
12918 2015 article_144 (1)
12918 2015 article_144 (1)12918 2015 article_144 (1)
12918 2015 article_144 (1)
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
 
Molecular modelling for in silico drug discovery
Molecular modelling for in silico drug discoveryMolecular modelling for in silico drug discovery
Molecular modelling for in silico drug discovery
 
presentation
presentationpresentation
presentation
 
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
 
An interactive approach to multiobjective clustering of gene expression patterns
An interactive approach to multiobjective clustering of gene expression patternsAn interactive approach to multiobjective clustering of gene expression patterns
An interactive approach to multiobjective clustering of gene expression patterns
 
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
 
Systems biology & Approaches of genomics and proteomics
 Systems biology & Approaches of genomics and proteomics Systems biology & Approaches of genomics and proteomics
Systems biology & Approaches of genomics and proteomics
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
 
Structural Systems Pharmacology
Structural Systems PharmacologyStructural Systems Pharmacology
Structural Systems Pharmacology
 
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...
 
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing codeISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
 
Systems Modeling Overview
Systems Modeling OverviewSystems Modeling Overview
Systems Modeling Overview
 
NetBioSIG2014-Talk by Hyunghoon Cho
NetBioSIG2014-Talk by Hyunghoon ChoNetBioSIG2014-Talk by Hyunghoon Cho
NetBioSIG2014-Talk by Hyunghoon Cho
 
Modelling physiological uncertainty
Modelling physiological uncertaintyModelling physiological uncertainty
Modelling physiological uncertainty
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 

Knowledge extraction and visualisation using rule-based machine learning

  • 1. Knowledge extraction and visualisation using rule-based machine learning Dr. Jaume Bacardit Interdisciplinary Computing and Complex Systems (ICOS) research group University of Nottingham jaume.bacardit@nottingham.ac.uk ICOS seminar. 11/10/2012
  • 2. Preface • I came to Nottingham in 2005 to work as a postdoc in a project applying evolutionary rule learning to protein structure prediction (EPSRC GR/T07534/01). In the project me managed to: – Generate predictors that are competent with the start-of-the-art – Indeed, extract human-readable explanations providing new knowledge – We proposed several improvements to the learning algorithms so they could scale to big problems • When I became a lecturer in 2008 I started several collaborations with experimentalists analysing biological data of all kinds, always with the goal of extracting knowledge – Thanks to having sets of rules, it is relatively straightforward to develop a generic methodology to extract knowledge from them, that can be applied almost straight away to a variety of datasets – Still, we are only at the tip of the iceberg, there are many ways in which this analysis can be made more efficient/reliable/useful
  • 4. A set of rules as a knowledge representation 1 If (X<0.25 and Y>0.75) or (X>0.75 and Y<0.25) then  If (X>0.75 and Y>0.75) then  Y If (X<0.25 and Y<0.25) then  Everything else  0 1 X
  • 5. Another example Witten and Frank, 2005 (http://www.cs.waikato.ac.nz/~eibe/Slides2edRev2.zip)
  • 6. The BioHEL rule learning system • BioHEL [Bacardit et al., 09] is an evolutionary learning system that applies the Iterative Rule Learning (IRL) approach • Designed explicitly to deal with noisy large-scale datasets • IRL was first used in EC by the SIA system [Venturini, 93]
  • 7. BioHEL’s learning paradigm – IRL has been used for many years in the ML community, with the name of separate-and-conquer – A standard elitist Genetic Algorithm generates each rule
  • 8. BioHEL’s characteristics 1/2 • Objective function that tries to balance the generation of accurate and general rules – Accurate: not making many mistakes – General: covering as many examples as possible and covering as much of the search space as possible • Attribute list rule representation – Automatically identifying the relevant attributes for a given rule and discarding all the other ones • Ensemble mechanisms – Exploiting the GA’s stochasticity to construct ensembles of rule sets, all of them generated from the same data, but with different random seeds, also ensembles for ordinal classification
  • 9. BioHEL’s characteristics 2/2 • The ILAS windowing scheme – Efficiency enhancement method. Training set divided into strata. Different GA iterations use different strata for their evaluation using a round-robin policy • GPGPU-based fitness evaluation – Obtaining ~50x speedups on large datasets on its own and ~700x speedups in combination with ILAS
  • 10. Mining –omics data Protein contact map prediction CASE STUDIES
  • 11. Functional Network Reconstruction for seed germination  Microarray data obtained from seed tissue of Arabidopsis Thaliana  122 samples represented by the expression level of almost 14000 genes  It had been experimentally determined whether each of the seeds had germinated or not  Can we learn to predict germination/dormancy from the microarray data?  Bassel et al., Plant Cell 23(9):3101-3116, 2011
  • 12. Generating rule sets  BioHEL was able to predict the outcome of the samples with 93.5% accuracy (10 x 10-fold cross- validation  Learning from a scrambled dataset (labels randomly assigned to samples) produced ~50% accuracy If At1g27595>100.87 and At3g49000>68.13 and At2g40475>55.96  Predict germination If At4g34710>349.67 and At4g37760>150.75 and At1g30135>17.66  Predict germination If At3g03050>37.90 and At2g20630>96.01 and At3g02885>9.66  Predict germination If At5g54910>45.03 and At4g18975>16.74 and At3g28910>52.76 and At1g48320>56.80  Predict germination Everything else  Predict dormancy
  • 13. Identifying regulators  Rule building process is stochastic  Generates different rule sets each time the system is run  But if we run the system many times, we can see some patterns in the rule sets  Genes appearing quite more frequent than the rest  Some associated to dormancy  Some associated to germination  We generated 10K rule sets for each outcome  Rules predicted one of the two outcomes  Default rule captured the other
  • 14. Known regulators appear with high frequency in the rules
  • 15. Generating co-prediction networks of interactions • For each of the rules shown before to be true, all of the conditions in it need to be true at the same time – Each rule is expressing an interaction between certain gens • From a high number of rule sets we can identify pairs of genes that co-occur with high frequency and generate functional networks with a methodology coined as co- prediction • The network shows different topology when compared to other type of network construction methods (e.g. by gene co- expression) • Different regions in the network contain the germination and dormancy genes. • Other visualisations providing the big picture exist (Urbanowicz et al., 2012)
  • 16. Experimental validation  We have experimentally verified this analysis  By ordering and planting knockouts for the highly ranked genes  We have been able to identify four new regulators of germination, with phenotype different than the wild type
  • 17. Same analysis. Different datasets • We applied the same principle to three cancer datasets from the literature (E. Glaab et al., PLoS ONE (2012) 7(7):e39932) • We checked PubMed to see if the genes linked together in BioHEL’s rules appeared together in the literature • We used Point-Wise Mutual Information (PMI) to quantify that the genes do not appear linked together in the literature by chance • Compared the PMI scores of the highly ranked pairs of genes with random pairs
  • 18. BioHEL’s scores were much better than random
  • 19. And to lots of other datasets! • These datasets were generated using transcriptomics technology – Looks at RNA • There are lots of other –omics (hundreds of them) – Proteomics – Lipidomics – Metabolomics – Next-generation sequencing • Each –omics requires specific preprocessing, but the learning and knowledge extraction process is exactly the same • Lots of datasets out there
  • 20. Another example different from -omics • Protein Structure Prediction aims to predict the 3D structure of a protein based on its primary sequence
  • 21. Prediction types of PSP • There are several kinds of prediction problems within the scope of PSP – The main one, of course, is to predict the 3D coordinates of all atoms of a protein (or at least the backbone) based on its primary sequence – There are many structural properties of individual residues within a protein that can be predicted, for instance: • The secondary structure state of the residue • If a residue is buried in the core of the protein or exposed in the surface – Accurate predictions of these sub-problems can simplify the general 3D PSP problem
  • 22. Contact Map prediction • Prediction, for each pair of residues in a protein, whether these residues are in contact (have a small distance between them in the 3D structure) or not • This problem can be represented by a binary matrix. 1= contact, 0 = non contact. Plotting this matrix reveals the main traits in the protein structure • Very sparse characteristic: Less than 2% of contacts in native structures • Training sets easily reach millions of residue pairs • Our method was one of the top predictors in the last two editions of the CASP competition (actually, the best sequence-based predictor in last CASP) helices sheets (Bacardit et al., Bioinformatics (2012) 28 (19): 2441-2448)
  • 23. Steps for CM prediction 1. Prediction of  Secondary structure (using PSIPRED)  Solvent Accessibility  Recursive Convex Hull Using BioHEL [Bacardit et al., 09]  Coordination Number 2. Integration of all these predictions plus other sources of information 3. Final CM prediction (using BioHEL)
  • 24. Characterisation of the contact map problem  Three types of input information were used 1. Detailed information of three different windows of residues centered around  The two target residues (2x)  The middle point between them 2. Information about the connecting segment between the two target residues and 3. Global protein information. 1 3 2
  • 25. Samples and ensembles Training set  Training set contained 32 million pairs of AA and 631 attributes x50 (+60GB of disk space) Samples  50 samples of 660K examples are generated from the training set with a ratio of 2:1 non-contacts/contacts x25 Rule sets  BioHEL is run 25 times for each sample  Prediction is done by a consensus of 1250 rule sets  Confidence of prediction is computed based on the votes distribution in the Consensus ensemble.  Whole training process took about 25K CPU hours Predictions
  • 26. Knowledge extraction in contact map prediction • Basic analysis is exactly the same Frequent attributes Frequent pairs of attributes
  • 27. But analysis can be much more refined • Because the representation has a very clear structure and we have lots of domain knowledge • For instance, there are several way to aggregate the ranks of individual attributes based on characteristics from the representation/domain Ranks aggregated by source of information Ranks aggregated by amino acid type
  • 29. The knowledge extraction can be much more refined • We just looked at what attributes appear in the rules, but not yet at the shape of the predicates • Sometimes biasing the representation helps generating knowledge that is more useful to the domain experts – In the experiments with the seed data BioHEL was constrained to generate only predicates “Att>X” – But we always have to be careful when introducing bias
  • 30. Is the knowledge real? • Data is far from perfect, lots of spurious peaks • Probably many of the edges in the network are false positives • Strategies for filtering the knowledge – Classic blind feature selection? – Contrast the knowledge with databases of curated information about the genes/interactions • Some of these are quite pricy! • Or we need strong text mining skills – Careful balance is needed, we don’t want to filter true positives – Using expert knowledge to bias the learning process (Moore & White, 2006)
  • 31. Modelling the ML problem • Datasets annotated as “case/controls” are easy • What happens with N>2 labels? – Tricky for decision lists, as there is an implicit overlap between rules • What happens with continuous annotations? – There are similar examples in the literature using model trees (Nepomuceno-Chamorro et al., 2010) • What happens when the annotation is a time course? – Ordinal classification problem
  • 32. References • BioHEL – Improving the scalability of rule-based evolutionary learning. J. Bacardit, E.K. Burke and N. Krasnogor. Memetic Computing journal 1(1):55-67, 2009 – Speeding Up the Evaluation of Evolutionary Learning Systems using GPGPUs. M. Franco, N. Krasnogor and J. Bacardit. In Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation (GECCO2010), 1039- 1046, ACM Press, 2010 – Modelling the Initialisation Stage of the ALKR Representation for Discrete Domains and GABIL Encoding. M. Franco, N. Krasnogor and J. Bacardit. In Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation - GECCO2011, pages 1291-1298. ACM, 2011 – Post-processing Operators for Decision Lists. M. Franco, N. Krasnogor and J. Bacardit. In Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation - GECCO2012, pages 847-854. ACM, 2012 – Analysing BioHEL using challenging boolean functions. M. Franco, N. Krasnogor and J. Bacardit. Evolutionary Intelligence, 5(2):87-102, June 2012
  • 33. References • Knowledge extraction and visualisation – Prediction of Recursive Convex Hull Class Assignments for Protein Residues. Stout, M., Bacardit, J., Hirst, J.D. and Krasnogor, N. Bioinformatics, 24(7):916-923, 2008 – Automated Alphabet Reduction for Protein Datasets. J. Bacardit, M. Stout, J.D. Hirst, A. Valencia, R.E. Smith and N. Krasnogor. BMC Bioinformatics 10:6, 2009 – Functional Network Construction in Arabidopsis Using Rule-Based Machine Learning on Large-Scale Data Sets. George W. Bassel, Enrico Glaab, Julietta Marquez, Michael J. Holdsworth and Jaume Bacardit. The Plant Cell, 23(9):3101-3116, 2011 – E. Glaab, J. Bacardit, J.M. Garibaldi and N. Krasnogor. Using Rule-Based Machine Learning for Candidate Disease Gene Prioritization and Sample Classification of Cancer Gene Expression Data. PLoS ONE 7(7):e39932. 2012. doi:10.1371/journal.pone.0039932 – J. Bacardit, P. Widera, A. Márquez-Chamorro, F. Divina, J.S. Aguilar-Ruiz and Natalio Krasnogor. Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features. Bioinformatics (2012) 28 (19): 2441- 2448. doi:10.1093/bioinformatics/bts472 – HP Fainberg, K. Bodley, J. Bacardit, D. Li, F. Wessely, NP. Mongan, ME. Symonds, L. Clarke and A. Mostyn, Reduced neonatal mortality in Meishan piglets: a role for hepatic fatty acids? PLoS ONE, in press, 2012
  • 34. References • Related work – Nepomuceno-Chamorro, I.A., Aguilar-Ruiz, J.S., and Riquelme, J.C. (2010). Inferring gene regression networks with model trees. BMC Bioinformatics 11: 517 – Moore, J. and White, B., Exploiting expert knowledge in genetic programming for genome-wide genetic analysis, Parallel Problem Solving from Nature-PPSN IX, pp. 969- 977, 2006 – R. J. Urbanowicz, A. Granizo-MacKenzie, and J. H. Moore. Instance-linked attribute tracking and feedback for michigan-style supervised learning classifier systems. In GECCO ’12: Proceedings of the 14th annual conference on Genetic and evolutionary computation , pages 927–934. ACM Press, 2012
  • 35. Acknowledgements • Natalio Krasnogor • Michael Holdsworth • George Bassel • Enrico Glaab • Pawel Widera • Maria Franco • Anna Swan • Hernan Fainberg • EPSRC GR/T07534/01 & EP/H016597/1
  • 36. Knowledge extraction and visualisation using rule-based machine learning Dr. Jaume Bacardit Interdisciplinary Computing and Complex Systems (ICOS) research group University of Nottingham jaume.bacardit@nottingham.ac.uk ICOS seminar. 11/10/2012