Data analysis with Python

We have seen how to perform data munging with regular expressions and Python. For a refresher, here is a Python program using regular expressions to munge the Ch3observations.txt file that we did on day 1 using TextWrangler.

import re

find = r'(\d+)\s+(\w{3})[\w\,\.]*\s+(\d+)\sat\s(\d+):(\d+)\s+([-\d\.]+)\s+([-\d\.]+).*'
replace = r'\3\t\2.\t\1\t\4\t\5\t\6\t\7'

for line in open('examples/Ch3observations.txt'):
    newline = re.sub(find, replace, line)
    print newline,

which when run produces this output

eris:pcfb cliburn$ python examples/
1752    Jan.    13      13      53      -1.414  5.781
1961    Mar.    17      03      46      14      3.6
2002    Oct.    1       18      22      36.51   -3.4221
1863    Jul.    20      12      02      1.74    133

This session will introduce you to the next stage of data analysis and data visualization. Because it is impossible to do any justice to these areas in a few hours, the aim of this session is to provide a taste of what analysis and visualization in Python look like, and a tour of some of the many modules available for scientific computation in Python.

Curve Fitting

One common analysis task performed by biologists is curve fitting. For example, we may want to fit a 4 parameter logistic (4PL) equation to ELISA data. The usual formula for the 4PL model is

\[f(x) = \frac{A-D}{1+(x/C)^B}+ D\]

where \(x\) is the concentration, \(A\) is the minimum asymptote, \(B\) is the steepness, \(C\) is the inflection point and \(D\) is the maximum asymptote.

import numpy as np
import numpy.random as npr
import matplotlib.pyplot as plt
from scipy.optimize import leastsq

def logistic4(x, A, B, C, D):
    """4PL lgoistic equation."""
    return ((A-D)/(1.0+((x/C)**B))) + D

def residuals(p, y, x):
    """Deviations of data from fitted 4PL curve"""
    A,B,C,D = p
    err = y-logistic4(x, A, B, C, D)
    return err

def peval(x, p):
    """Evaluated value at x with current parameters."""
    A,B,C,D = p
    return logistic4(x, A, B, C, D)

# Make up some data for fitting and add noise
# In practice, y_meas would be read in from a file
x = np.linspace(0,20,20)
A,B,C,D = 0.5,2.5,8,7.3
y_true = logistic4(x, A, B, C, D)
y_meas = y_true + 0.2*npr.randn(len(x))

# Initial guess for parameters
p0 = [0, 1, 1, 1]

# Fit equation using least squares optimization
plsq = leastsq(residuals, p0, args=(y_meas, x))

# Plot results
plt.title('Least-squares 4PL fit to noisy data')
plt.legend(['Fit', 'Noisy', 'True'], loc='upper left')
for i, (param, actual, est) in enumerate(zip('ABCD', [A,B,C,D], plsq[0])):
    plt.text(10, 3-i*0.5, '%s = %.2f, est(%s) = %.2f' % (param, actual, param, est))

It will be straightforward to modify this code to use, for example, a five parameter logistic or other equation, offering a flexibility rarely available with standard analysis software.

Simulation-based statistics

With increasing computational power, it is now feasible to run many, many simulations to estimate parameters instead of, or in addition to, the traditiional parameteric statistial methods. Most of these methods are based on some form of resampling of the data available to estimate the null distribution, with well known examples being the bootstrap and permuation resampling.

Before we do this, we need to understand a little about how to get random numbers. The numpy.random module has random number generators for a variety of common probabiltiy distributions. These numbers are then used to simulate the generation of new random samples. If the samples are chosen in a certain way, the statistics of the randomly drawn samples can provide useful information about the properties of our original data sample. Here are some examples of random number generation in iptyhon.

In [1]: import numpy.random as npr

In [2]: npr.random(5)
Out[2]: array([ 0.88509491,  0.1530188 ,  0.86235945,  0.77324214,  0.27968431])

In [3]: npr.random((3,4))
array([[ 0.09011934,  0.2435081 ,  0.2463418 ,  0.49527452],
       [ 0.54057567,  0.30539368,  0.34848276,  0.64897038],
       [ 0.51525096,  0.19594408,  0.86945157,  0.02309191]])

In [4]: npr.normal(5, 1, 4)
Out[4]: array([ 5.39112784,  4.9500269 ,  6.05270335,  5.95333509])

In [5]: npr.randint(1, 7, 10)
Out[5]: array([4, 6, 4, 1, 4, 5, 3, 4, 4, 6])

In [6]: npr.uniform(1, 7, 10)
array([ 2.04285874,  3.94923612,  5.93212699,  2.39859964,  3.202536  ,
        2.30029199,  6.39038325,  4.43109617,  1.93328928,  1.58893167])

In [7]: npr.binomial(n=10, p=0.2, size=(4,4))
array([[1, 3, 0, 4],
       [1, 4, 5, 1],
       [5, 2, 1, 0],
       [2, 3, 2, 1]])

In [8]: x =  [1,2,3,4,5,6]

In [9]: npr.shuffle(x)

In [10]: x
Out[10]: [4, 3, 6, 2, 1, 5]

In [11]: npr.permutation(10)
Out[11]: array([8, 4, 7, 6, 5, 1, 0, 2, 3, 9])

For example, choosing a new sample with replacement from an existing sample (i.e. we draw one item from the data, record what it is, then replace it in the data and repeat to get a new sample) can be done efficiently in this way:

In [1]: import numpy as np

In [2]: import numpy.random as npr

In [3]: data = np.array(['tom', 'jerry', 'mickey', 'minnie', 'pocahontas'])

In [4]: idx = npr.randint(0, len(data), (4,len(data)))

In [5]: idx
array([[4, 0, 1, 3, 2],
       [4, 1, 0, 1, 4],
       [0, 4, 4, 2, 3],
       [0, 4, 2, 1, 3]])

In [6]: samples_with_replacement = data[idx]

In [7]: samples_with_replacement
array([['pocahontas', 'tom', 'jerry', 'minnie', 'mickey'],
       ['pocahontas', 'jerry', 'tom', 'jerry', 'pocahontas'],
       ['tom', 'pocahontas', 'pocahontas', 'mickey', 'minnie'],
       ['tom', 'pocahontas', 'mickey', 'jerry', 'minnie']], 

In the next version of numpy (1.7.0), a new function choice is available in numpy.random to do the same thing with a nicer syntax. Version 1.7.0 is only currently available from the git repository as source code that you must compile yourself, but should be available for easy_install/pip installation soon.

In [1]: import numpy.random as npr

In [2]: data = ['tom', 'jerry', 'mickey', 'minnie', 'pocahontas']

# only availlable if you install numpy 1.7.0 from the git repository
In [3]: npr.choice(data, size=(4, len(data)), replace=True)
array([['tom', 'minnie', 'pocahontas', 'pocahontas', 'pocahontas'],
       ['minnie', 'tom', 'mickey', 'mickey', 'minnie'],
       ['minnie', 'pocahontas', 'tom', 'pocahontas', 'mickey'],
       ['tom', 'mickey', 'jerry', 'jerry', 'mickey']], 

Moving on our first simulation example - if we want to plot the 95% confidence interval for the mean of our data samples, we can use the bootstrap to do so. The basic idea is simple - draw many, many samples with replacement from the data available, estimate the mean from each sample, then rank order the means to estimate the 2.5 and 97.5 percentile values for 95% confidence interval. Unlike using normal assumptions to calculate 95% CI, the results generated by the bootstrap are robust even if the underlying data are very far from normal.

import numpy as np
import numpy.random as npr
import pylab

def bootstrap(data, num_samples, statistic, alpha):
    """Returns bootstrap estimate of 100.0*(1-alpha) CI for statistic."""
    n = len(data)
    idx = npr.randint(0, n, (num_samples, n))
    samples = data[idx]
    stat = np.sort(statistic(samples, 1))
    return (stat[int((alpha/2.0)*num_samples)],

if __name__ == '__main__':
    # data of interest is bimodal and obviously not normal
    x = np.concatenate([npr.normal(3, 1, 100), npr.normal(6, 2, 200)])

    # find mean 95% CI and 100,000 bootstrap samples
    low, high = bootstrap(x, 100000, np.mean, 0.05)

    # make plots
    pylab.hist(x, 50, histtype='step')
    pylab.title('Historgram of data')
    pylab.plot([-0.03,0.03], [np.mean(x), np.mean(x)], 'r', linewidth=2)
    pylab.scatter(0.1*(npr.random(len(x))-0.5), x)
    pylab.plot([0.19,0.21], [low, low], 'r', linewidth=2)
    pylab.plot([0.19,0.21], [high, high], 'r', linewidth=2)
    pylab.plot([0.2,0.2], [low, high], 'r', linewidth=2)
    pylab.xlim([-0.2, 0.3])
    pylab.title('Bootstrap 95% CI for mean')

Note that the bootstrap function is a higher order function, and will return the boostrap CI for any valid statistical function, not just the mean. For example, to find the 95% CI for the standard deviation, we only need to change np.mean to np.std in the arguments:

# find standard deviation 95% CI bootstrap samples
low, high =  bootstrap(x, 100000, np.std, 0.05)

The function is also highly optimized, and takes under 2 seconds to calculate the boostrap mean for a data sample of size 300 using 100,000 bootstrap samples on a 4 year old MacBook Pro with 2.4 GHz Intel Core 2 Duo processor.

Permutation-resampling is another form of simulation-based statistical calculation, and is often used to evaluate the p-value for the difference between two groups, under the null hypothesis that the groups are invariant under label permutation. For example, in a case-control study, it can be used to find the p-value that hypothesis that the mean of the case group is different from that of the control group, and we cannot use the t-test because the distributions are highly skewed.

import numpy as np
import numpy.random as npr
import pylab

def permutation_resampling(case, control, num_samples, statistic):
    """Returns p-value that statistic for case is different
    from statistc for control."""

    observed_diff = abs(statistic(case) - statistic(control))
    num_case = len(case)

    combined = np.concatenate([case, control])
    diffs = []
    for i in range(num_samples):
        xs = npr.permutation(combined)
        diff = np.mean(xs[:num_case]) - np.mean(xs[num_case:])

    pval = (np.sum(diffs > observed_diff) +
            np.sum(diffs < -observed_diff))/float(num_samples)
    return pval, observed_diff, diffs

if __name__ == '__main__':
    # make up some data
    case = [94, 38, 23, 197, 99, 16, 141]
    control = [52, 10, 40, 104, 51, 27, 146, 30, 46]

    # find p-value by permutation resampling
    pval, observed_diff, diffs = \
          permutation_resampling(case, control, 10000, np.mean)

    # make plots
    pylab.title('Empirical null distribution for differences in mean')
    pylab.hist(diffs, bins=100, histtype='step', normed=True)
    pylab.axvline(observed_diff, c='red', label='diff')
    pylab.axvline(-observed_diff, c='green', label='-diff')
    pylab.text(60, 0.01, 'p = %.3f' % pval, fontsize=16)

Data visualization

Data visualization is important for exploratory data analysis as well as for communication of scientific results. Using ipython with the -- pylab option provides an interarctive environment that is ideal for exploratory data analysis. The classic Anscombe data set illustrates the importance of visualization when analysing data. The Anscombe data set consists of 4 different sets of (x,y) values, with esentially identical values for the

  1. mean of x
  2. variance of x
  3. mean of y
  4. variance of y
  5. correlation between x and y
  6. linear regression intercept and slope

Plotting the actual data sets quickly shows that the data sets are not as similar as suggested by the summary stattstics!

import numpy as np
import pylab

xs = np.loadtxt('anscombe.txt')
for i in range(4):
    x = xs[:,i*2]
    y = xs[:,i*2+1]
    A = np.vstack([x, np.ones(len(x))]).T
    m, c = np.linalg.lstsq(A, y)[0]

    pylab.scatter(x, y)
    pylab.plot(x, m*x+c, 'r')


For communication of results, matplotlib offers a huge range of graphics. We have only scratched the surface of what the package has to offer. The fastest way to get a custom graphic to communicate your results is to look at the thumbnails at If one of the graphics looks appropirate for your needs, just click on the thumbnail to get the source code. You should now know enough Python to customize the graphic to your specific needs.