SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
An Evolution algorithm approach for Feature
generation from Sequence data and its
application to DNA splice data prediction
Authors: Uday Kamath, Kenneth A. De Jong, Amarda Shehu,
Jack Comton, Rezarta Islamaj-Dogan.
Source: IEEE/ACM Transactions on Computational Biology and Bioinformatics,
Vol. 9, No. 5, September/October 2012.
Presenter: Nguyen Dinh Chien (阮庭戰)
1
An EA approach for FG from Sequence data and
its application to DNA splice data prediction
 A challenge for machine learning methods is associating functional
information with biological sequence. Their performance often depends
on deriving predictive features from sequence sought to be classified.
 Feature generation is a difficult problem. It is often the task of domain
experts or exhaustive feature enumeration techniques to generate a few
features. Their predictive power is tested in context of classification.
 Therefore, the authors proposed an evolution algorithm to effectively
explore a large feature space and generate predictive features from
sequence data.
2
Outline
 Introduction
 Methods
 Materials
 Discussion
 Conclusions
3
Introduction
 A variety of general purpose search techniques are effective for NP-
Hard problems.
 In this paper, they explored the use of evolutionary algorithms to search
a large and complex features space, with the goal is to obtain features
from sequence data that can significantly improve the classification
accuracy of a Support Vector Machine (SVM). This approach is
evaluated on the difficult problem of DNA splice site prediction.
 They used Genetic Programming (GP) techniques to evolve the kind of
structures. This approach is called FG-EA, means Feature Generation with
Evolution Algorithm.
4
Introduction
5
 Using an efficient fitness function, they identified a set of candidate features (a
hall of fame) to be used as input to a standard SVM classification procedure.
 Comparison with state-of-the-
art feature-based classification
method, they realized that FG-
EA features significantly
improve the classification
performance.
The DNA splice site prediction problem
 Transcription of a eukaryotic DNA sequence into messenger RNA (mRNA)
occurs only after enzymes splice away nocoding regions (intron) from
precursor (pre-mRNA) sequence to leave only coding regions (exons), so that,
prediction of splice sites is a fundamental component of the gene-finding
problem.
 Splice site prediction is a difficult problem. AG and GT (GU) can not be used
as features due to their abundance in non-splice site sequences.
6
Related EA work
 Many studies have demonstrated the advantages of EAs for feature generation in
different domains, such as, Fast genetic selection (F.A Brill et al. 1992), Nearest
neighbor classifier (L.I Kucheva and L.C Jain. 1999), Dimensionality reduction
(M.L Raymer et al. 2007), …
 All of above methods obtain predictive
features from sequence data have
shown success in diverse
bioinformatics problems.
 Work on predicting enzymatic activity
in proteins additionally shows the
power of EAs in feature generation.
7
Outline
 Introduction
 Methods
 Materials
 Discussion
 Conclusions
8
Methods
9
The FG-EA algorithm
10
Feature representation
 In the FG-EA, the leaves of a parse tree, also referred as a terminals, are either
characters from the DNA alphabet {A,C,G,T} or integers corresponding to
positions or motif (k-mers) length; the internal nodes are operators Length,
Position, Motif, Matches, MatchAsPosition, AND, OR, and NOT.
positional featurescompositional features
11
Generating features
 Generating random initial features from 0 consists of N=15,000
features.
 The tree representing features are generated using the well-known
ramped half-and-half generative method, which includes both Full
and Grow techniques. These techniques obtain a mixture of full-
balanced trees and bushy trees with each technique is employed
with equal probability of 0.5.
 Subsequent generations are evolved using standard GP (Genetic
Programming) selection, crossover, and mutation mechanisms.
12
Genetic Operators
 Given a set of m features
extracted from the hall of
fame to serve as parents in a
generation, the rest of
GenSize-m features are
generated using the mutation
and crossover operators.
 Employed three breeding
pipelines:
 mutation
 mutation-ERC
 crossover
13
Fitness function
 The fitness function is key to achieving an efficient and effective
EA search heuristic.
 FG-EA uses a surrogate fitness function given by:
 f – feature; the ratio C+,f/C+ is weighted by the information gain (IG)
 IG is often employed as a criterion of a feature’s goodness in machine
learning. Given m class attributes:
)(*)(
,
fIG
C
C
fFitness
f



   

m
i ii
m
i i
m
i iii fcPfcPfPfcPfcpfPcPcPfIG 11 1
))|(log().|(.)())|(log().|().())(log().()(
14
Post FG-EA feature selection
 The set of features in the hall of fame can be further narrowed
through Recursive Feature Elimination (RFE). RFE start with a
large feature set and gradually reduce this set by removing the
least successful features until a stopping criterion is met.
 They employed RFE to estimate the impact of feature set sizes on
the precision and accuracy of the classification, and directly
compare with existing work.
15
Support vector machines as classifier
 SVMs is popular and successful in a wide variety of binary
classification problems.
 In this paper, they use it with three steps:
 Map sequence data into a Euclidean vector space;
 Select a kernel function to map the vector space into higher dimensional and more
effective Euclidean space;
 Turn parameters for the kernel and other SVM parameters to improve performance.
16
Outline
 Introduction
 Methods
 Materials
 Discussion
 Conclusions
17
Data sets
 Compare the classification performance to two different groups of state-of-the-
art methods in splice site prediction, feature-based, and kernel-based.
 Feature-based: (FGA and GeneSplicer) extracted from the 2005 NCBI RefSeq
collection of 5057 human pre-mRNA sequences (http://www.ncbi.nlm.nih.gov)
 Used to extract 51008 positive (contain splice sites) and 200000 negative sequences
 25504 acceptor and 25504 donor consist of 162 nucleotides each (80 upstream + AG|GT +
80 downstream)
 Kernel-based: (WD and WDS) extracted from the worm data set with EST
sequences (http://www.wormbase.org).
 Using 64844 donor and 64838 acceptor splice site sequences. Each sequence is 142
nucleotides long (60+AG|GT+80)
 1,777,912 sequences are centered around nonsplice site AG dinucleotides, and
2,846,598 sequences are centered around nonsplice site GT dinucleotides.
18
Overview of conducted experiments
 Two sets of classification experiments are conducted
 Compare the performance of FG-EA to FGA and GeneSplicer
 The SVM is trained over 2/3 of the data tested on the remaining 1/3.
 These process is repeated three times to obtain an average performance, with 30 difference
sets of hall of fame features.
 The trained SVM is applied to classify the B2hum testing data set.
 Compare with WD and WDS methods
 Employ 30 independent runs of FG-EA and SVM evaluation of resulting features.
 40,000 sequences are sampled from the worm data set.
 Ten different subsets of 360,000 sequences are randomly sampled from the worm data set.
 The values of these parameters can be found on our website
(http://www.cs.gmu.edu/~ashehu/?q=OurTools).
19
Overview of conducted experiments
 They measure performance in terms of 11ptAVG, FPR, auROC, and auPRC.
 The 11ptAVG is the average of the precisions calculated at 11 recall values
{0%,10%,…,100%}
 PRCs are employed to show the ability of FG-EA to discriminate true splice sites from
other sequences.
 FPR is also computed for recall values by varying the confidence threshold to employ
FPR-recall curves and show that FG-EA make very few mistakes.
Performance measurements
20
Evaluation of fitness quality and convergence
Mean and maximum fitness values per generation (left: acceptor, right: donor) are averaged over
30 independent GP runs. Error bars are standard deviations
21
Performance on Human training data set
Precision versus recall on training data set
Precision values are plotted over recall points (left: acceptor, right: donor). Values are averages
over 30 FG-EA run. Error bars are standard deviations.
22
Performance on B2Hum testing data set
Precision versus recall on testing data set
Precision over recall (left: acceptor, right: donor) are plotted for the B2Hum testing dataset
23
Performance on worm training data set
Precision over recall (left: acceptor, right: donor) are plotted for the 40K subset sampled from the
worm training data set
24
Performance on worm training data set
25
Outline
 Introduction
 Methods
 Materials
 Discussion
 Conclusions
26
Discussion
 They divide the hall of fame
features in three types of subsets
 All composition features
 All region-specific compositional,
positional, correlational features
 All remaining features include
conjunctive and disjunctive
features
27
Outline
 Introduction
 Methods
 Materials
 Discussion
 Conclusions
28
Conclussions
 FG-EA outperforms state-of-the-art feature generation methods in splice site
classification.
 FG-EA reveals the significant role of novel complex conjunctive and
disjunctive features.
 The proposed FG-EA algorithm can easily be employed in other prediction
problems on biological sequences.
 Further extensions of the FG-EA can combine the evolution of features with
evolution of SVM kernels for greater classification accuracy.
 Plan on employing regular expressions to further combine and reduce the bloat
in the expressions and so improve readability and performance.
29
 Evolutionary algorithms and genetic programming was widely-used in
bioinformatics, computational sciences, economics, chemistry and other fields.
However, they also have some limitations as,
 Operating on dynamic data sets is difficult.
 Cannot effectively solve problems in which the only fitness measure is a single
right/wrong measure.
 In some cases, the optimization algorithm can be more effective than the evolutionary
algorithm.
 Therefore, I think that we can combine FG-EA algorithm with optimization
algorithms to get better results in DNA splice site prediction problem.
30
Thank you for your listening!

Más contenido relacionado

Destacado

Quy Che Quyet Dinh 14
Quy Che Quyet Dinh 14Quy Che Quyet Dinh 14
Quy Che Quyet Dinh 14Nguyen Chien
 
Evolutionary inaccuracy of pairwise structural alignments (slide)
Evolutionary inaccuracy of pairwise structural alignments (slide)Evolutionary inaccuracy of pairwise structural alignments (slide)
Evolutionary inaccuracy of pairwise structural alignments (slide)Nguyen Chien
 
Giao Trinh Vi Xu Ly (20 12 2008)
Giao Trinh Vi Xu Ly (20 12 2008)Giao Trinh Vi Xu Ly (20 12 2008)
Giao Trinh Vi Xu Ly (20 12 2008)Nguyen Chien
 
Year 8 – soldering
Year 8 – solderingYear 8 – soldering
Year 8 – solderingLee Ainsworth
 
Soldering techniques
Soldering techniquesSoldering techniques
Soldering techniquesrupert1977
 
oil & gas drilling preliminaries
oil & gas drilling preliminariesoil & gas drilling preliminaries
oil & gas drilling preliminariesKartikeya Pandey
 
T.L.E. GRADE 7 LESSONS
T.L.E. GRADE 7 LESSONST.L.E. GRADE 7 LESSONS
T.L.E. GRADE 7 LESSONSDayleen Hijosa
 
Noi quy van phong
Noi quy van phongNoi quy van phong
Noi quy van phongNgan Hoang
 
Development Length Hook Splice Of Reinforcements
Development Length Hook Splice Of ReinforcementsDevelopment Length Hook Splice Of Reinforcements
Development Length Hook Splice Of ReinforcementsSHERAZ HAMEED
 
cementation in Fixed prosthodontics
cementation in Fixed prosthodonticscementation in Fixed prosthodontics
cementation in Fixed prosthodonticsmalek mohammed
 
Modyul 20 neo-kolonyalismo
Modyul 20   neo-kolonyalismoModyul 20   neo-kolonyalismo
Modyul 20 neo-kolonyalismo南 睿
 
Machining operations and machine tools
Machining operations and machine toolsMachining operations and machine tools
Machining operations and machine toolsMuhammad Muddassir
 
DIFFERENT TYPES OF ELECTRICAL POWER AND HYDRAULIC TOOLS
DIFFERENT TYPES OF ELECTRICAL POWER AND HYDRAULIC TOOLSDIFFERENT TYPES OF ELECTRICAL POWER AND HYDRAULIC TOOLS
DIFFERENT TYPES OF ELECTRICAL POWER AND HYDRAULIC TOOLSLyn Gile Facebook
 

Destacado (16)

Mips Assembly
Mips AssemblyMips Assembly
Mips Assembly
 
Quy Che Quyet Dinh 14
Quy Che Quyet Dinh 14Quy Che Quyet Dinh 14
Quy Che Quyet Dinh 14
 
Evolutionary inaccuracy of pairwise structural alignments (slide)
Evolutionary inaccuracy of pairwise structural alignments (slide)Evolutionary inaccuracy of pairwise structural alignments (slide)
Evolutionary inaccuracy of pairwise structural alignments (slide)
 
Giao Trinh Vi Xu Ly (20 12 2008)
Giao Trinh Vi Xu Ly (20 12 2008)Giao Trinh Vi Xu Ly (20 12 2008)
Giao Trinh Vi Xu Ly (20 12 2008)
 
Simple Machines
Simple MachinesSimple Machines
Simple Machines
 
7. quy che khen thuong ky luat
7. quy che khen thuong ky luat7. quy che khen thuong ky luat
7. quy che khen thuong ky luat
 
Year 8 – soldering
Year 8 – solderingYear 8 – soldering
Year 8 – soldering
 
Soldering techniques
Soldering techniquesSoldering techniques
Soldering techniques
 
oil & gas drilling preliminaries
oil & gas drilling preliminariesoil & gas drilling preliminaries
oil & gas drilling preliminaries
 
T.L.E. GRADE 7 LESSONS
T.L.E. GRADE 7 LESSONST.L.E. GRADE 7 LESSONS
T.L.E. GRADE 7 LESSONS
 
Noi quy van phong
Noi quy van phongNoi quy van phong
Noi quy van phong
 
Development Length Hook Splice Of Reinforcements
Development Length Hook Splice Of ReinforcementsDevelopment Length Hook Splice Of Reinforcements
Development Length Hook Splice Of Reinforcements
 
cementation in Fixed prosthodontics
cementation in Fixed prosthodonticscementation in Fixed prosthodontics
cementation in Fixed prosthodontics
 
Modyul 20 neo-kolonyalismo
Modyul 20   neo-kolonyalismoModyul 20   neo-kolonyalismo
Modyul 20 neo-kolonyalismo
 
Machining operations and machine tools
Machining operations and machine toolsMachining operations and machine tools
Machining operations and machine tools
 
DIFFERENT TYPES OF ELECTRICAL POWER AND HYDRAULIC TOOLS
DIFFERENT TYPES OF ELECTRICAL POWER AND HYDRAULIC TOOLSDIFFERENT TYPES OF ELECTRICAL POWER AND HYDRAULIC TOOLS
DIFFERENT TYPES OF ELECTRICAL POWER AND HYDRAULIC TOOLS
 

Similar a P0126557 slides

Genetic Algorithm Processor for Image Noise Filtering Using Evolvable Hardware
Genetic Algorithm Processor for Image Noise Filtering Using Evolvable HardwareGenetic Algorithm Processor for Image Noise Filtering Using Evolvable Hardware
Genetic Algorithm Processor for Image Noise Filtering Using Evolvable HardwareCSCJournals
 
Combinational circuit designer using 2D Genetic Algorithm
Combinational circuit designer using 2D Genetic AlgorithmCombinational circuit designer using 2D Genetic Algorithm
Combinational circuit designer using 2D Genetic AlgorithmVivek Maheshwari
 
Deep_Learning__INAF_baroncelli.pdf
Deep_Learning__INAF_baroncelli.pdfDeep_Learning__INAF_baroncelli.pdf
Deep_Learning__INAF_baroncelli.pdfasdfasdf214078
 
A Brain Computer Interface Speller for Smart Devices
A Brain Computer Interface Speller for Smart DevicesA Brain Computer Interface Speller for Smart Devices
A Brain Computer Interface Speller for Smart DevicesMahmoud Helal
 
GENETIC GAIN BY GENOMIC SELECTION PPT.pptx
GENETIC GAIN BY GENOMIC SELECTION PPT.pptxGENETIC GAIN BY GENOMIC SELECTION PPT.pptx
GENETIC GAIN BY GENOMIC SELECTION PPT.pptxPABOLU TEJASREE
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...cscpconf
 
Improved feature selection using a hybrid side-blotched lizard algorithm and ...
Improved feature selection using a hybrid side-blotched lizard algorithm and ...Improved feature selection using a hybrid side-blotched lizard algorithm and ...
Improved feature selection using a hybrid side-blotched lizard algorithm and ...IJECEIAES
 
PPSN 2004 - 3rd session
PPSN 2004 - 3rd sessionPPSN 2004 - 3rd session
PPSN 2004 - 3rd sessionJuan J. Merelo
 
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...IRJET Journal
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger Eli Kaminuma
 
modelling assignment
modelling assignmentmodelling assignment
modelling assignmentShwetA Kumari
 

Similar a P0126557 slides (20)

Genetic Algorithm Processor for Image Noise Filtering Using Evolvable Hardware
Genetic Algorithm Processor for Image Noise Filtering Using Evolvable HardwareGenetic Algorithm Processor for Image Noise Filtering Using Evolvable Hardware
Genetic Algorithm Processor for Image Noise Filtering Using Evolvable Hardware
 
D046062030
D046062030D046062030
D046062030
 
Combinational circuit designer using 2D Genetic Algorithm
Combinational circuit designer using 2D Genetic AlgorithmCombinational circuit designer using 2D Genetic Algorithm
Combinational circuit designer using 2D Genetic Algorithm
 
Deep_Learning__INAF_baroncelli.pdf
Deep_Learning__INAF_baroncelli.pdfDeep_Learning__INAF_baroncelli.pdf
Deep_Learning__INAF_baroncelli.pdf
 
A Brain Computer Interface Speller for Smart Devices
A Brain Computer Interface Speller for Smart DevicesA Brain Computer Interface Speller for Smart Devices
A Brain Computer Interface Speller for Smart Devices
 
GENETIC GAIN BY GENOMIC SELECTION PPT.pptx
GENETIC GAIN BY GENOMIC SELECTION PPT.pptxGENETIC GAIN BY GENOMIC SELECTION PPT.pptx
GENETIC GAIN BY GENOMIC SELECTION PPT.pptx
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
 
04 1 evolution
04 1 evolution04 1 evolution
04 1 evolution
 
Da35573574
Da35573574Da35573574
Da35573574
 
1207.2600
1207.26001207.2600
1207.2600
 
2015-03-31_MotifGP
2015-03-31_MotifGP2015-03-31_MotifGP
2015-03-31_MotifGP
 
Improved feature selection using a hybrid side-blotched lizard algorithm and ...
Improved feature selection using a hybrid side-blotched lizard algorithm and ...Improved feature selection using a hybrid side-blotched lizard algorithm and ...
Improved feature selection using a hybrid side-blotched lizard algorithm and ...
 
PPSN 2004 - 3rd session
PPSN 2004 - 3rd sessionPPSN 2004 - 3rd session
PPSN 2004 - 3rd session
 
Folker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data AnnotationFolker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data Annotation
 
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
 
An26247254
An26247254An26247254
An26247254
 
modelling assignment
modelling assignmentmodelling assignment
modelling assignment
 
B017410916
B017410916B017410916
B017410916
 
B017410916
B017410916B017410916
B017410916
 

Último

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Último (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

P0126557 slides

  • 1. An Evolution algorithm approach for Feature generation from Sequence data and its application to DNA splice data prediction Authors: Uday Kamath, Kenneth A. De Jong, Amarda Shehu, Jack Comton, Rezarta Islamaj-Dogan. Source: IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 9, No. 5, September/October 2012. Presenter: Nguyen Dinh Chien (阮庭戰) 1
  • 2. An EA approach for FG from Sequence data and its application to DNA splice data prediction  A challenge for machine learning methods is associating functional information with biological sequence. Their performance often depends on deriving predictive features from sequence sought to be classified.  Feature generation is a difficult problem. It is often the task of domain experts or exhaustive feature enumeration techniques to generate a few features. Their predictive power is tested in context of classification.  Therefore, the authors proposed an evolution algorithm to effectively explore a large feature space and generate predictive features from sequence data. 2
  • 3. Outline  Introduction  Methods  Materials  Discussion  Conclusions 3
  • 4. Introduction  A variety of general purpose search techniques are effective for NP- Hard problems.  In this paper, they explored the use of evolutionary algorithms to search a large and complex features space, with the goal is to obtain features from sequence data that can significantly improve the classification accuracy of a Support Vector Machine (SVM). This approach is evaluated on the difficult problem of DNA splice site prediction.  They used Genetic Programming (GP) techniques to evolve the kind of structures. This approach is called FG-EA, means Feature Generation with Evolution Algorithm. 4
  • 5. Introduction 5  Using an efficient fitness function, they identified a set of candidate features (a hall of fame) to be used as input to a standard SVM classification procedure.  Comparison with state-of-the- art feature-based classification method, they realized that FG- EA features significantly improve the classification performance.
  • 6. The DNA splice site prediction problem  Transcription of a eukaryotic DNA sequence into messenger RNA (mRNA) occurs only after enzymes splice away nocoding regions (intron) from precursor (pre-mRNA) sequence to leave only coding regions (exons), so that, prediction of splice sites is a fundamental component of the gene-finding problem.  Splice site prediction is a difficult problem. AG and GT (GU) can not be used as features due to their abundance in non-splice site sequences. 6
  • 7. Related EA work  Many studies have demonstrated the advantages of EAs for feature generation in different domains, such as, Fast genetic selection (F.A Brill et al. 1992), Nearest neighbor classifier (L.I Kucheva and L.C Jain. 1999), Dimensionality reduction (M.L Raymer et al. 2007), …  All of above methods obtain predictive features from sequence data have shown success in diverse bioinformatics problems.  Work on predicting enzymatic activity in proteins additionally shows the power of EAs in feature generation. 7
  • 8. Outline  Introduction  Methods  Materials  Discussion  Conclusions 8
  • 11. Feature representation  In the FG-EA, the leaves of a parse tree, also referred as a terminals, are either characters from the DNA alphabet {A,C,G,T} or integers corresponding to positions or motif (k-mers) length; the internal nodes are operators Length, Position, Motif, Matches, MatchAsPosition, AND, OR, and NOT. positional featurescompositional features 11
  • 12. Generating features  Generating random initial features from 0 consists of N=15,000 features.  The tree representing features are generated using the well-known ramped half-and-half generative method, which includes both Full and Grow techniques. These techniques obtain a mixture of full- balanced trees and bushy trees with each technique is employed with equal probability of 0.5.  Subsequent generations are evolved using standard GP (Genetic Programming) selection, crossover, and mutation mechanisms. 12
  • 13. Genetic Operators  Given a set of m features extracted from the hall of fame to serve as parents in a generation, the rest of GenSize-m features are generated using the mutation and crossover operators.  Employed three breeding pipelines:  mutation  mutation-ERC  crossover 13
  • 14. Fitness function  The fitness function is key to achieving an efficient and effective EA search heuristic.  FG-EA uses a surrogate fitness function given by:  f – feature; the ratio C+,f/C+ is weighted by the information gain (IG)  IG is often employed as a criterion of a feature’s goodness in machine learning. Given m class attributes: )(*)( , fIG C C fFitness f         m i ii m i i m i iii fcPfcPfPfcPfcpfPcPcPfIG 11 1 ))|(log().|(.)())|(log().|().())(log().()( 14
  • 15. Post FG-EA feature selection  The set of features in the hall of fame can be further narrowed through Recursive Feature Elimination (RFE). RFE start with a large feature set and gradually reduce this set by removing the least successful features until a stopping criterion is met.  They employed RFE to estimate the impact of feature set sizes on the precision and accuracy of the classification, and directly compare with existing work. 15
  • 16. Support vector machines as classifier  SVMs is popular and successful in a wide variety of binary classification problems.  In this paper, they use it with three steps:  Map sequence data into a Euclidean vector space;  Select a kernel function to map the vector space into higher dimensional and more effective Euclidean space;  Turn parameters for the kernel and other SVM parameters to improve performance. 16
  • 17. Outline  Introduction  Methods  Materials  Discussion  Conclusions 17
  • 18. Data sets  Compare the classification performance to two different groups of state-of-the- art methods in splice site prediction, feature-based, and kernel-based.  Feature-based: (FGA and GeneSplicer) extracted from the 2005 NCBI RefSeq collection of 5057 human pre-mRNA sequences (http://www.ncbi.nlm.nih.gov)  Used to extract 51008 positive (contain splice sites) and 200000 negative sequences  25504 acceptor and 25504 donor consist of 162 nucleotides each (80 upstream + AG|GT + 80 downstream)  Kernel-based: (WD and WDS) extracted from the worm data set with EST sequences (http://www.wormbase.org).  Using 64844 donor and 64838 acceptor splice site sequences. Each sequence is 142 nucleotides long (60+AG|GT+80)  1,777,912 sequences are centered around nonsplice site AG dinucleotides, and 2,846,598 sequences are centered around nonsplice site GT dinucleotides. 18
  • 19. Overview of conducted experiments  Two sets of classification experiments are conducted  Compare the performance of FG-EA to FGA and GeneSplicer  The SVM is trained over 2/3 of the data tested on the remaining 1/3.  These process is repeated three times to obtain an average performance, with 30 difference sets of hall of fame features.  The trained SVM is applied to classify the B2hum testing data set.  Compare with WD and WDS methods  Employ 30 independent runs of FG-EA and SVM evaluation of resulting features.  40,000 sequences are sampled from the worm data set.  Ten different subsets of 360,000 sequences are randomly sampled from the worm data set.  The values of these parameters can be found on our website (http://www.cs.gmu.edu/~ashehu/?q=OurTools). 19
  • 20. Overview of conducted experiments  They measure performance in terms of 11ptAVG, FPR, auROC, and auPRC.  The 11ptAVG is the average of the precisions calculated at 11 recall values {0%,10%,…,100%}  PRCs are employed to show the ability of FG-EA to discriminate true splice sites from other sequences.  FPR is also computed for recall values by varying the confidence threshold to employ FPR-recall curves and show that FG-EA make very few mistakes. Performance measurements 20
  • 21. Evaluation of fitness quality and convergence Mean and maximum fitness values per generation (left: acceptor, right: donor) are averaged over 30 independent GP runs. Error bars are standard deviations 21
  • 22. Performance on Human training data set Precision versus recall on training data set Precision values are plotted over recall points (left: acceptor, right: donor). Values are averages over 30 FG-EA run. Error bars are standard deviations. 22
  • 23. Performance on B2Hum testing data set Precision versus recall on testing data set Precision over recall (left: acceptor, right: donor) are plotted for the B2Hum testing dataset 23
  • 24. Performance on worm training data set Precision over recall (left: acceptor, right: donor) are plotted for the 40K subset sampled from the worm training data set 24
  • 25. Performance on worm training data set 25
  • 26. Outline  Introduction  Methods  Materials  Discussion  Conclusions 26
  • 27. Discussion  They divide the hall of fame features in three types of subsets  All composition features  All region-specific compositional, positional, correlational features  All remaining features include conjunctive and disjunctive features 27
  • 28. Outline  Introduction  Methods  Materials  Discussion  Conclusions 28
  • 29. Conclussions  FG-EA outperforms state-of-the-art feature generation methods in splice site classification.  FG-EA reveals the significant role of novel complex conjunctive and disjunctive features.  The proposed FG-EA algorithm can easily be employed in other prediction problems on biological sequences.  Further extensions of the FG-EA can combine the evolution of features with evolution of SVM kernels for greater classification accuracy.  Plan on employing regular expressions to further combine and reduce the bloat in the expressions and so improve readability and performance. 29
  • 30.  Evolutionary algorithms and genetic programming was widely-used in bioinformatics, computational sciences, economics, chemistry and other fields. However, they also have some limitations as,  Operating on dynamic data sets is difficult.  Cannot effectively solve problems in which the only fitness measure is a single right/wrong measure.  In some cases, the optimization algorithm can be more effective than the evolutionary algorithm.  Therefore, I think that we can combine FG-EA algorithm with optimization algorithms to get better results in DNA splice site prediction problem. 30
  • 31. Thank you for your listening!