2. Homework
● ATCurve.py
● take an input string from the user
● check if the sequence only contains DNA – if
not, prompt for new sequence.
● calculate a running average of AT content
along the sequence. Window size should be
3, and the step size should be 1. Print one
value per line.
● Note: you need to include several runtime
examples to show that all parts of the code
works.
3. ATCurve.py - thinking
● Take input from user:
● raw_input
● Check for the presence of !ATCG
● use sets – very easy
● Calculate AT – window = 3, step = 1
● iterate over string in slices of three
4. ATCurve.py
# variable valid is used to see if the string is ok or not.
valid = False
while not valid:
# promt user for input using raw_input() and store in string,
# convert all characters into uppercase
test_string = raw_input("Enter string: ")
upper_string = test_string.upper()
# Figure out if anything else than ATGCs are present
dnaset = set(list("ATGC"))
upper_string_set = set(list(upper_string))
if len(upper_string_set - dnaset) > 0:
print "Non-DNA present in your string, try again"
else:
valid = True
if valid:
for i in range(0, len(upper_string)-3, 1):
at_sum = 0.0
at_sum += upper_string.count("A",i,i+2)
at_sum += upper_string.count("T",i,i+2)
5. Homework
● CodonFrequency.py
● take an input string from the user
● if the sequence only contains DNA
– find a start codon in your string
– if startcodon is present
● count the occurrences of each three-mer from start
codon and onwards
● print the results
6. CodonFrequency.py - thinking
● First part – same as earlier
● Find start codon: locate index of AUG
● Note, can simplify and find ATG
● If start codon is found:
● create dictionary
● for slice of three in input[StartCodon:]:
– get codon
– if codon is in dict:
● add to count
– if not:
● create key-value pair in dict
7. CodonFrequency.py
input = raw_input("Type a piece of DNA here: ")
if len(set(input) - set(list("ATGC"))) > 0:
print "Not a valid DNA sequence"
else:
atg = input.find("ATG")
if atg == -1:
print "Start codon not found"
else:
codondict = {}
for i in xrange(atg,len(input)-3,3):
codon = input[i:i+3]
if codon not in codondict:
codondict[codon] = 1
else:
codondict[codon] +=1
for codon in codondict:
print codon, codondict[codon]
8. CodonFrequency.py w/
stopcodon
input = raw_input("Type a piece of DNA here: ")
if len(set(input) - set(list("ATGC"))) > 0:
print "Not a valid DNA sequence"
else:
atg = input.find("ATG")
if atg == -1:
print "Start codon not found"
else:
codondict = {}
for i in xrange(atg,len(input) -3,3):
codon = input[i:i+3]
if codon in ['UAG', 'UAA', 'UAG']:
break
elif codon not in codondict:
codondict[codon] = 1
else:
codondict[codon] +=1
for codon in codondict:
print codon, codondict[codon]
10. Working with files
● Reading – get info into your program
● Parsing – processing file contents
● Writing – get info out of your program
11. Reading and writing
● Three-step process
● Open file
– create file handle – reference to file
● Read or write to file
● Close file
– will be automatically close on program end, but
bad form to not close
12. Opening files
● Opening modes:
● “r” - read file
● “w” - write file
● “a” - append to end of file
● fh = open(“filename”, “mode”)
● fh = filehandle, reference to a file, NOT the
file itself
13. Reading a file
● Three ways to read
● read([n]) - n = bytes to read, default is all
● readline() - read one line, incl. newline
● readlines() - read file into a list, one element
per line, including newline
14. Reading example
● Log on to freebee, and go to your area
● do cp ../Karin/fastafile.fsa .
● open python
>>> fh = open("fastafile.fsa", "r")
>>> fh
● Q: what does the response mean?
15. Read example
● Use all three methods to read the file. Print
the results.
● read
● readlines
● readline
● Q: what happens after you have read the
file?
● Q: What is the difference between the
three?
16. Read example
>>> fh = open("fastafile.fsa", "r")
>>> withread = fh.read()
>>> withread
'>This is the description linenATGCGCTTAGGATCGATAGCGATTTAGAnTTAGCGGAn'
>>> withreadlines = fh.readlines()
>>> withreadlines
[]
>>> fh = open("fastafile.fsa", "r")
>>> withreadlines = fh.readlines()
>>> withreadlines
['>This is the description linen', 'ATGCGCTTAGGATCGATAGCGATTTAGAn', 'TTAGCGGAn']
>>> fh = open("fastafile.fsa", "r")
>>> withreadline = fh.readline()
>>> withreadline
'>This is the description linen'
>>>
17. Parsing
● Getting information out of a file
● Commonly used string methods
● split([character]) – default is whitespace
● replace(“in string”, “put into instead”)
● “string character”.join(list)
– joins all elements in the list with string
character as a separator
– common construction: ''.join(list)
● slicing
18. Type conversions
● Everything that comes on the command
line or from a file is a string
● Conversions:
● int(X)
– string cannot have decimals
– floats will be floored
● float(X)
● str(X)
19. Parsing example
● Continue using fastafile.fsa
● Print only the description line to screen
● Print the whole DNA string
>>> fh = open("fastafile.fsa", "r")
>>> firstline = fh.readline()
>>> print firstline[1:-1]
This is the description line
>>> sequence = ''
>>> for line in fh:
... sequence += line.replace("n", "")
...
>>> print sequence
ATGCGCTTAGGATCGATAGCGATTTAGA
>>>
20. Accepting input from
command line
● Need to be able to specify file name on
command line
● Command line parameters stored in list
called sys.argv – program name is 0
● Usage:
● python pythonscript.py arg1 arg2 arg3....
● In script:
● at the top of the file, write import sys
●
arg1 = sys.argv[1]
21. Batch example
● Read fastafile.fsa with all three methods
● Per method, print method, name and
sequence
● Remember to close the file at the end!
22. Batch example
import sys
filename = sys.argv[1]
#using readline
fh = open(filename, "r")
firstline = fh.readline()
name = firstline[1:-1]
sequence =''
for line in fh:
sequence += line.replace("n", "")
print "Readline", name, sequence
#using readlines()
fh = open(filename, "r")
inputlines = fh.readlines()
name = inputlines[0][1:-1]
sequence = ''
for line in inputlines[1:]:
sequence += line.replace("n", "")
print "Readlines", name, sequence
#using read
fh = open(filename, "r")
inputlines = fh.read()
name = inputlines.split("n")[0][1:-1]
sequence = "".join(inputlines.split("n")[1:])
print "Read", name, sequence
fh.close()
23. Classroom exercise
● Modify ATCurve.py script so that it accepts
the following input on the command line:
● fasta filename
● window size
● Let the user input an alternate filename if it
contains !ATGC
● Print results to screen
24. ATCurve2.py
import sys
# Define filename
filename = sys.argv[1]
windowsize = int(sys.argv[2])
# variable valid is used to see if the string is ok or not.
valid = False
while not valid:
fh = open(filename, "r")
inputlines = fh.readlines()
name = inputlines[0][1:-1]
sequence = ''
for line in inputlines[1:]:
sequence += line.replace("n", "")
upper_string = sequence.upper()
# Figure out if anything else than ATGCs are present
dnaset = set(list("ATGC"))
upper_string_set = set(list(upper_string))
if len(upper_string_set - dnaset) > 0:
print "Non-DNA present in your file, try again"
filename = raw_input("Type in filename: ")
else:
valid = True
if valid:
for i in range(0, len(upper_string)-windowsize + 1, 1):
at_sum = 0.0
at_sum += upper_string.count("A",i,i+windowsize)
at_sum += upper_string.count("T",i,i+windowsize)
print i + 1, at_sum/windowsize
25. Writing to files
● Similar procedure as for read
● Open file, mode is “w” or “a”
● fh.write(string)
– Note: one single string
– No newlines are added
● fh.close()
26. ATContent3.py
● Modify previous script so that you have the
following on the command line
● fasta filename for input file
● window size
● output file
● Output should be on the format
● number, AT content
● number is the 1-based position of the first
nucleotide in the window
28. Homework:
TranslateProtein.py
● Input files are in
/projects/temporary/cees-python-course/Karin
● translationtable.txt - tab separated
● dna31.fsa
● Script should:
● Open the translationtable.txt file and read it into a
dictionary
● Open the dna31.fsa file and read the contents.
● Translates the DNA into protein using the dictionary
● Prints the translation in a fasta format to the file
TranslateProtein.fsa. Each protein line should be 60
characters long.