# Matplotlib - exercises¶

## Yeast cell cycle data¶

The data were originally downloaded from the Yeast Cell Cycle Analysis Project Page. These data [1] by Spellman and Sherlock have likely been used in over a hundred papers.

1. look at the file so you know what you are getting into (less,wc)
2. copy this script into an editor and lets go over it
#!/usr/bin/env python

import csv,os,sys,pickle
import numpy as np

## read the file once to get numRows and numCols
txtFilePath = os.path.join("..","data","cellcycle.txt")
numRows = 0
expListIDs = np.array(expListIDs[1:])
numRows+=1

## populate a matrix and name vectors with file info
numColumns = len(expListIDs)
exprMat = np.zeros([numRows,numColumns])
rowInd = 0
geneList = []

row = np.array(linja[1:])
newRow = np.zeros(len(row),)
nanInds = np.where(row == '')
goodInds = np.where(row != '')
newRow[nanInds] = np.nan
newRow[goodInds] = [float(element) for element in row[goodInds]]
exprMat[rowInd,:] = newRow
rowInd +=1
geneList.append(linja[0])
geneList = np.array(geneList)

## print out info
print ".............."
print "matrix of size (%s,%s)  created..."%(exprMat.shape)
print "gene list size - %s"%geneList.size
print "exp list size  - %s"%expListIDs.size
print ".............."

## write the data to a file
outFilePath = os.path.join(".","excercise-np.pickle")
tmp = open(outFilePath,'w')
pickle.dump([geneList,expListIDs,exprMat],tmp)
tmp.close()

1. save the file using your editor to a directory and use it to read cellcycle.txt. To do this you will need to change at least one line?

2. create your own script(s) that does the following

• opens the pickle file
• calculates mean expression value for a gene
• plot expression values for a given gene in a histogram
• [extra] for a given gene create a lineplot that shows expression values for all the conditions
• [extra] make plot that has boxplots for the 5 genes with the greatest expression mean
• [extra] create a scatter plot (1 condition) where the negative expression values are green and positive ones are red

Note that:

>>> a = np.array([1,2,3,np.nan])
>>> a.max()
nan
>>> a.min()
nan
>>> a.mean()
nan
>>> np.where(np.isnan(a)==False)[0]
array([0, 1, 2])


Also, note that we do not transform, normalize or otherwise process the data in this example. We are using this data set as a learning tool. The missing data difficulties that arise are common in the biological sciences.

## Bibliographic notes¶

1. Spellman P T, Sherlock, G, Zhang, M Q, Iyer, V R, Anders, K, Eisen, M B, Brown, P O, Botstein, D, Futcher, B. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular biology of the cell, Vol. 9 (12): 3273-97, 1998. PubMed.