Homework: Finding ORFs

Q1. Use regular expressions to find potential Open Reading Frames (ORF) in chromosome 1 of Saccharomyces cerevisiae strain S288C (Baker’s yeast). Here, we define an ORF to have the following properties:

  • START codon is ATG
  • STOP codon is TGA, TAA or TAG
  • Between START and STOP
    • Must be multiple of 3
    • At least 10 codons
    • Does not contain other START or STOP codon

A codon is a sequence of 3 non-overlapping bases that codes for an amino acid. For example, the first ORF in chromosome 1 of Saccharomyces cerevisiae strain S288C is ATGACCTACTCACCATACTGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAAA which consists of 21 codons. Each ORF should include the START codon (which codes for methionine) but not the STOP codon.

  • 2.1 Write code to download the FASTA file from http://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chr01.fsa (5 points)
  • 2.2 Write a single regular expression to capture all ORFs meeting the criteria given above (20 points)
  • 2.3 Apply the regular expression to find all matches in the downloaded sequence. You will have to exclude the FASTA comment lines that start with ‘>’ and do some processing to generate a single DNA sequence. (20 points)
  • 2.4 How many ORFs are found in total? (5 points)
  • 2.5 Write a script to convert the ORFs to the putative peptide that it would translate to, using this translation table. Express the peptide sequence as a string of single letter codes for each amino acid (SLC). What is the peptide sequence for the last ORF found? (50 points)

2.1 Write code to download the FASTA file from http://downloads.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/chr01.fsa (5 points)

In [1]:




2.2 Write a single regular expression to capture all ORFs meeting the criteria given above (20 points)

In [1]:




2.3 Apply the regular expression to find all matches in the downloaded sequence. You will have to exclude the FASTA comment lines that start with ‘>’ and do some processing to generate a single DNA sequence. (20 points)

In [1]:




2.4 How many ORFs are found in total? (5 points)

In [1]:




2.5 Write a script to convert the ORFs to the putative peptide that it would translate to, using this translation table. Express the peptide sequence as a string of single letter codes for each amino acid (SLC). What is the peptide sequence for the last ORF found? (50 points)

In [1]: