Homework: Finding ORFs¶
Q1. Use regular expressions to find potential Open Reading Frames (ORF) in chromosome 1 of Saccharomyces cerevisiae strain S288C (Baker’s yeast). Here, we define an ORF to have the following properties:
- START codon is ATG
- STOP codon is TGA, TAA or TAG
- Between START and STOP
- Must be multiple of 3
- At least 10 codons
- Does not contain other START or STOP codon
A codon is a sequence of 3 non-overlapping bases that codes for an amino
acid. For example, the first ORF in chromosome 1 of Saccharomyces
cerevisiae strain S288C is
ATGACCTACTCACCATACTGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAAA
which consists of 21 codons. Each ORF should include the START codon
(which codes for methionine) but not the STOP codon.
- 2.1 Write code to download the FASTA file from
http://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chr01.fsa
(5 points) - 2.2 Write a single regular expression to capture all ORFs meeting the criteria given above (20 points)
- 2.3 Apply the regular expression to find all matches in the downloaded sequence. You will have to exclude the FASTA comment lines that start with ‘>’ and do some processing to generate a single DNA sequence. (20 points)
- 2.4 How many ORFs are found in total? (5 points)
- 2.5 Write a script to convert the ORFs to the putative peptide that it would translate to, using this translation table. Express the peptide sequence as a string of single letter codes for each amino acid (SLC). What is the peptide sequence for the last ORF found? (50 points)
2.1 Write code to download the FASTA file from
http://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chr01.fsa
(5 points)
In [1]:
2.2 Write a single regular expression to capture all ORFs meeting the criteria given above (20 points)
In [1]:
2.3 Apply the regular expression to find all matches in the downloaded sequence. You will have to exclude the FASTA comment lines that start with ‘>’ and do some processing to generate a single DNA sequence. (20 points)
In [1]:
2.4 How many ORFs are found in total? (5 points)
In [1]:
2.5 Write a script to convert the ORFs to the putative peptide that it would translate to, using this translation table. Express the peptide sequence as a string of single letter codes for each amino acid (SLC). What is the peptide sequence for the last ORF found? (50 points)
In [1]: