Transcription Factor DNA Binding Prediction

•

1 like•486 views

UT, San Antonio

Technology Education

1. Defining the Scope of the Project:

In this project, we have given a number of labeled (which are p & n) DNA sequence and a number of
unlabeled DNA sequence which we have to label based on a model built from the given labeled
sequences. Eventually, the scope of the problem is to build a binary classifier model based on the given
training DNA sequence and apply the model to label the unlabeled DNA sequence.

1.1 Challenges of the Projects:

In conventional classification problem, there are a number of different attributes that we can readily use to
build the classifier. In this project, we are only given sequences and label. So, part of the work for this
project, is to find a way for generating meaningful attribute.

Fig. 1 : Overall scope of the project.

2. K-mer Based Approach:

In the K-mer approach, we have generated all possible combination of DNA characters for a
specified length of K. The K-mer Approach is shown in details in figure 2. The important steps of the k-
mer approach are discussed in the following paragraphs.

Fig 2: Overall K-mer based process.

After we have generated the K-mers, we have followed different kind of approaches to count the
their frequencies which are i)Strict matching , ii) matching with mismatch and iii) matching based
on Regular Expression.

In order to build an optimum model, we have tuned different parameters of the model. Some of
parameters and their impact on the classifier is shown in table I.

3. PWM Based Approach:

We have used a motif finding tool named MEME [1] to generate specified number of motifs of
specific minimum and maximum length and motif Alignment and search tool MAST [2] to get the
E-value (bounded to 100)for each sequence. We have derived scores from these E-values by
subtracting the E-value from 100 for ordering the sequences according to their E-value. We

have used these scores specific to each motif as attributes of the sequences and feed them to
different classifiers. Table II gives the synopsis of parameters and their impact on the model.

Table I: Synopsis of the parameters and their effect in the K-mer model building process.

K-mer Value Classifier Selection String Match MisMatch Regular
Expression
5( Best) Logistic (Best) When applied When not applied Not significant
(perform best) (perform best)
4(reasonably SMO (Good) When not applied When applied (perform
good) (perform relatively worse)
relatively worse)
6 (Comparatively J48 (Comparatively
bad) weak)

Table II: Synopsis of the parameters for PWM approach and their effect in the model

No. of Motif No.of Sites a Min / Max Length of Motif Classifier
Motif appear
10 18 6-15 J48(Best)
8 20 5-16 Logistic(Moderate)
5 10 6-15 Naïve Bayes(comparatively Bad)

4. Combining K-mer & PWM approach:

In order to obtain a better model, we have combined both K-mer and PWM approaches with
known best parameters. We found reasonable improvement for the combined approach when
applying it in the training data.

5. Some Difficulties and Limitation of our Work:

Tuning the parameters for the classifier was the most challenging part of the project. We think,
we have done reasonable experiment for choosing the parameters given the limited timeline.

6. Acknowledgement:

At the end of the project, we would like to thank Dr. Ruan for assigning us such a challenging
project. It offered us good working knowledge of practical Machine Learning and data mining
stuffs. Working in the group was also a nice experience and knowledge sharing scope for us.

References:

[1-2] “MEME Suite“, available at http://meme.sdsc.edu/meme/meme-download.html
[3] “Weka”, available at: http://www.cs.waikato.ac.nz/ml/weka/index_downloading.html

What's hot

Speaker Identification From Youtube Obtained Datasipij

Accurate global localization using visual odometry and digital (1)Naveen Gouda

IRJET- American Sign Language ClassificationIRJET Journal

Matlab course syllabusSoftroniics india

352735350 rsh-qam11-tif-15-docFiras Husseini

Frontier in reinforcement learningJie-Han Chen

D111823inventionjournals

Reciprocal Ranking Fusion in Consumer Health Search - IMS UNIPD @ CLEF eHealt...Giorgio Di Nunzio

Analog Communication Apr 2013Paramjeet Singh Jamwal

Study on Some Key Issues of Synergetic Neural Network Jie Bao

What's hot (10)

Speaker Identification From Youtube Obtained Data

Accurate global localization using visual odometry and digital (1)

IRJET- American Sign Language Classification

Matlab course syllabus

352735350 rsh-qam11-tif-15-doc

Frontier in reinforcement learning

D111823

Reciprocal Ranking Fusion in Consumer Health Search - IMS UNIPD @ CLEF eHealt...

Analog Communication Apr 2013

Study on Some Key Issues of Synergetic Neural Network

Viewers also liked

An Application of Pattern matching for Motif IdentificationCSCJournals

Branch prediction contest_reportUT, San Antonio

Cyber Security Exam 2UT, San Antonio

RecitationUT, San Antonio

KsiUT, San Antonio

تصنيع البروتينات في الخليةUniv. of Tripoli

DNA Motif Finding 2010Stewart MacArthur

Attribute Based EncryptionUT, San Antonio

Sample graduation project presentationburnsr

Viewers also liked (10)

An Application of Pattern matching for Motif Identification

Branch prediction contest_report

Cyber Security Exam 2

Recitation

Ksi

تصنيع البروتينات في الخلية

DNA Motif Finding 2010

Attribute Based Encryption

Sample graduation project presentation

Similar to Transcription Factor DNA Binding Prediction

InternshipReportVikas Solanki

IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET Journal

IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET Journal

IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor

2cee Master Cocomo20071CS, NcState

SYNOPSIS on Parse representation and Linear SVM.bhavinecindus

IRJET - Cognitive based Emotion Analysis of a Child Reading a BookIRJET Journal

Test for AI modelArithmer Inc.

Developing Tools for “What if…” Testing of Large-scale Software SystemsJames Hill

A Novel Design For Generating Dynamic Length Message Digest To Ensure Integri...IRJET Journal

Software Product Measurement and Analysis in a Continuous Integration Environ...Gabriel Moreira

Quality Prediction in Fingerprint CompressionIJTET Journal

Barga Data Science lecture 10Roger Barga

Image Features Matching and Classification Using Machine LearningIRJET Journal

Archana kalapgar 19210184_ca684ArchanaKalapgar

Principles of effort estimationCS, NcState

KnowledgeFromDataAtScaleProjectMarciano Moreno

Sign Detection from Hearing ImpairedIRJET Journal

A DEEP LEARNING APPROACH FOR SEMANTIC SEGMENTATION IN BRAIN TUMOR IMAGESPNandaSai

A Hierarchical Feature Set optimization for effective code change based Defec...IOSR Journals

Similar to Transcription Factor DNA Binding Prediction (20)

InternshipReport

IRJET- Deep Learning Model to Predict Hardware Performance

IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive

IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...

2cee Master Cocomo20071

SYNOPSIS on Parse representation and Linear SVM.

IRJET - Cognitive based Emotion Analysis of a Child Reading a Book

Test for AI model

Developing Tools for “What if…” Testing of Large-scale Software Systems

A Novel Design For Generating Dynamic Length Message Digest To Ensure Integri...

Software Product Measurement and Analysis in a Continuous Integration Environ...

Quality Prediction in Fingerprint Compression

Barga Data Science lecture 10

Image Features Matching and Classification Using Machine Learning

Archana kalapgar 19210184_ca684

Principles of effort estimation

KnowledgeFromDataAtScaleProject

Sign Detection from Hearing Impaired

A DEEP LEARNING APPROACH FOR SEMANTIC SEGMENTATION IN BRAIN TUMOR IMAGES

A Hierarchical Feature Set optimization for effective code change based Defec...

Recently uploaded

Corporate and higher education May webinar.pptxRustici Software

FWD Group - Insurer Innovation Award 2024The Digital Insurer

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Manulife - Insurer Transformation Award 2024The Digital Insurer

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

GenAI Risks & Security Meetup 01052024.pdflior mazor

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

Architecting Cloud Native ApplicationsWSO2

Why Teams call analytics are critical to your entire businesspanagenda

Recently uploaded (20)

Corporate and higher education May webinar.pptx

FWD Group - Insurer Innovation Award 2024

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Exploring the Future Potential of AI-Enabled Smartphone Processors

Powerful Google developer tools for immediate impact! (2023-24 C)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

presentation ICT roal in 21st century education

How to Troubleshoot Apps for the Modern Connected Worker

Manulife - Insurer Transformation Award 2024

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Strategies for Landing an Oracle DBA Job as a Fresher

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Data Cloud, More than a CDP by Matt Robison

GenAI Risks & Security Meetup 01052024.pdf

Boost Fertility New Invention Ups Success Rates.pdf

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Architecting Cloud Native Applications

Why Teams call analytics are critical to your entire business

Transcription Factor DNA Binding Prediction

1. Final Project –CS6243 Transcription Factor DNA Binding Prediction Team Members: Badri Sampath α Iffat Sharmin Chowdhury α Prosunjit Biswas α Tahmina Ahmed α α Department of Computer Science University of Texas at San Antonio.

2. 1. Defining the Scope of the Project: In this project, we have given a number of labeled (which are p & n) DNA sequence and a number of unlabeled DNA sequence which we have to label based on a model built from the given labeled sequences. Eventually, the scope of the problem is to build a binary classifier model based on the given training DNA sequence and apply the model to label the unlabeled DNA sequence. 1.1 Challenges of the Projects: In conventional classification problem, there are a number of different attributes that we can readily use to build the classifier. In this project, we are only given sequences and label. So, part of the work for this project, is to find a way for generating meaningful attribute. Fig. 1 : Overall scope of the project. 2. K-mer Based Approach: In the K-mer approach, we have generated all possible combination of DNA characters for a specified length of K. The K-mer Approach is shown in details in figure 2. The important steps of the k- mer approach are discussed in the following paragraphs. Fig 2: Overall K-mer based process. After we have generated the K-mers, we have followed different kind of approaches to count the their frequencies which are i)Strict matching , ii) matching with mismatch and iii) matching based on Regular Expression. In order to build an optimum model, we have tuned different parameters of the model. Some of parameters and their impact on the classifier is shown in table I. 3. PWM Based Approach: We have used a motif finding tool named MEME [1] to generate specified number of motifs of specific minimum and maximum length and motif Alignment and search tool MAST [2] to get the E-value (bounded to 100)for each sequence. We have derived scores from these E-values by subtracting the E-value from 100 for ordering the sequences according to their E-value. We

3. have used these scores specific to each motif as attributes of the sequences and feed them to different classifiers. Table II gives the synopsis of parameters and their impact on the model. Table I: Synopsis of the parameters and their effect in the K-mer model building process. K-mer Value Classifier Selection String Match MisMatch Regular Expression 5( Best) Logistic (Best) When applied When not applied Not significant (perform best) (perform best) 4(reasonably SMO (Good) When not applied When applied (perform good) (perform relatively worse) relatively worse) 6 (Comparatively J48 (Comparatively bad) weak) Table II: Synopsis of the parameters for PWM approach and their effect in the model No. of Motif No.of Sites a Min / Max Length of Motif Classifier Motif appear 10 18 6-15 J48(Best) 8 20 5-16 Logistic(Moderate) 5 10 6-15 Naïve Bayes(comparatively Bad) 4. Combining K-mer & PWM approach: In order to obtain a better model, we have combined both K-mer and PWM approaches with known best parameters. We found reasonable improvement for the combined approach when applying it in the training data. 5. Some Difficulties and Limitation of our Work: Tuning the parameters for the classifier was the most challenging part of the project. We think, we have done reasonable experiment for choosing the parameters given the limited timeline. 6. Acknowledgement: At the end of the project, we would like to thank Dr. Ruan for assigning us such a challenging project. It offered us good working knowledge of practical Machine Learning and data mining stuffs. Working in the group was also a nice experience and knowledge sharing scope for us. References: [1-2] “MEME Suite“, available at http://meme.sdsc.edu/meme/meme-download.html [3] “Weka”, available at: http://www.cs.waikato.ac.nz/ml/weka/index_downloading.html

Transcription Factor DNA Binding Prediction

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Viewers also liked

Viewers also liked (10)

Similar to Transcription Factor DNA Binding Prediction

Similar to Transcription Factor DNA Binding Prediction (20)

More from UT, San Antonio

More from UT, San Antonio (20)

Recently uploaded

Recently uploaded (20)

Transcription Factor DNA Binding Prediction