Why Teams call analytics are critical to your entire business
Transcription Factor DNA Binding Prediction
1. Final Project –CS6243
Transcription Factor DNA Binding Prediction
Team Members:
Badri Sampath α
Iffat Sharmin Chowdhury α
Prosunjit Biswas α
Tahmina Ahmed α
α
Department of Computer Science
University of Texas at San Antonio.
2. 1. Defining the Scope of the Project:
In this project, we have given a number of labeled (which are p & n) DNA sequence and a number of
unlabeled DNA sequence which we have to label based on a model built from the given labeled
sequences. Eventually, the scope of the problem is to build a binary classifier model based on the given
training DNA sequence and apply the model to label the unlabeled DNA sequence.
1.1 Challenges of the Projects:
In conventional classification problem, there are a number of different attributes that we can readily use to
build the classifier. In this project, we are only given sequences and label. So, part of the work for this
project, is to find a way for generating meaningful attribute.
Fig. 1 : Overall scope of the project.
2. K-mer Based Approach:
In the K-mer approach, we have generated all possible combination of DNA characters for a
specified length of K. The K-mer Approach is shown in details in figure 2. The important steps of the k-
mer approach are discussed in the following paragraphs.
Fig 2: Overall K-mer based process.
After we have generated the K-mers, we have followed different kind of approaches to count the
their frequencies which are i)Strict matching , ii) matching with mismatch and iii) matching based
on Regular Expression.
In order to build an optimum model, we have tuned different parameters of the model. Some of
parameters and their impact on the classifier is shown in table I.
3. PWM Based Approach:
We have used a motif finding tool named MEME [1] to generate specified number of motifs of
specific minimum and maximum length and motif Alignment and search tool MAST [2] to get the
E-value (bounded to 100)for each sequence. We have derived scores from these E-values by
subtracting the E-value from 100 for ordering the sequences according to their E-value. We
3. have used these scores specific to each motif as attributes of the sequences and feed them to
different classifiers. Table II gives the synopsis of parameters and their impact on the model.
Table I: Synopsis of the parameters and their effect in the K-mer model building process.
K-mer Value Classifier Selection String Match MisMatch Regular
Expression
5( Best) Logistic (Best) When applied When not applied Not significant
(perform best) (perform best)
4(reasonably SMO (Good) When not applied When applied (perform
good) (perform relatively worse)
relatively worse)
6 (Comparatively J48 (Comparatively
bad) weak)
Table II: Synopsis of the parameters for PWM approach and their effect in the model
No. of Motif No.of Sites a Min / Max Length of Motif Classifier
Motif appear
10 18 6-15 J48(Best)
8 20 5-16 Logistic(Moderate)
5 10 6-15 Naïve Bayes(comparatively Bad)
4. Combining K-mer & PWM approach:
In order to obtain a better model, we have combined both K-mer and PWM approaches with
known best parameters. We found reasonable improvement for the combined approach when
applying it in the training data.
5. Some Difficulties and Limitation of our Work:
Tuning the parameters for the classifier was the most challenging part of the project. We think,
we have done reasonable experiment for choosing the parameters given the limited timeline.
6. Acknowledgement:
At the end of the project, we would like to thank Dr. Ruan for assigning us such a challenging
project. It offered us good working knowledge of practical Machine Learning and data mining
stuffs. Working in the group was also a nice experience and knowledge sharing scope for us.
References:
[1-2] “MEME Suite“, available at http://meme.sdsc.edu/meme/meme-download.html
[3] “Weka”, available at: http://www.cs.waikato.ac.nz/ml/weka/index_downloading.html