Resampling and Monte Carlo Simulations¶

Broadly, any simulation that relies on random sampling to obtain results fall into the category of Monte Carlo methods. Another common type of statistical experiment is the use of repeated sampling from a data set, including the bootstrap, jackknife and permutation resampling. Often, they are combined, as when we use a random set of permutations rather than the full set of permutations, which grows as \(O(n!))\) and is typically infeasible. What Monte Carlo simulations have in common is that they are typically more flexible but also more computationally demanding than methods based on asymptotic results. Because of their flexibility and the inexorable growth of computing power, I expect these computational simulation methods to only become more popular over time.

Setting the random seed¶

In any probabilistic simulation, it is prudent to set the random number seed so that results can be replicated

In [1]:

np.random.seed(123)

Sampling with and without replacement¶

In [2]:

# Sampling is done with replacement by default
np.random.choice(4, 12)

Out[2]:

array([2, 1, 2, 2, 0, 2, 2, 1, 3, 2, 3, 1])

In [3]:

# Probability weights can be given
np.random.choice(4, 12, p=[.4, .1, .1, .4])

Out[3]:

array([3, 3, 1, 0, 0, 3, 1, 0, 0, 3, 0, 0])

In [4]:

x = np.random.randint(0, 10, (8, 12))
x

Out[4]:

array([[7, 2, 4, 8, 0, 7, 9, 3, 4, 6, 1, 5],
       [6, 2, 1, 8, 3, 5, 0, 2, 6, 2, 4, 4],
       [6, 3, 0, 6, 4, 7, 6, 7, 1, 5, 7, 9],
       [2, 4, 8, 1, 2, 1, 1, 3, 5, 9, 0, 8],
       [1, 6, 3, 3, 5, 9, 7, 9, 2, 3, 3, 3],
       [8, 6, 9, 7, 6, 3, 9, 6, 6, 6, 1, 3],
       [4, 3, 1, 0, 5, 8, 6, 8, 9, 1, 0, 3],
       [1, 3, 4, 7, 6, 1, 4, 3, 3, 7, 6, 8]])

In [5]:

# sampling individual elements
np.random.choice(x.ravel(), 12)

Out[5]:

array([1, 2, 4, 7, 1, 2, 2, 6, 7, 3, 8, 4])

In [6]:

# sampling rows
idx = np.random.choice(x.shape[0], 4)
x[idx, :]

Out[6]:

array([[4, 3, 1, 0, 5, 8, 6, 8, 9, 1, 0, 3],
       [4, 3, 1, 0, 5, 8, 6, 8, 9, 1, 0, 3],
       [6, 2, 1, 8, 3, 5, 0, 2, 6, 2, 4, 4],
       [4, 3, 1, 0, 5, 8, 6, 8, 9, 1, 0, 3]])

In [7]:

# sampling columns
idx = np.random.choice(x.shape[1], 4)
x[:, idx]

Out[7]:

array([[9, 4, 3, 1],
       [0, 6, 2, 4],
       [6, 1, 7, 7],
       [1, 5, 3, 0],
       [7, 2, 9, 3],
       [9, 6, 6, 1],
       [6, 9, 8, 0],
       [4, 3, 3, 6]])

Sampling without replacement¶

In [8]:

# Give the argument replace=False
try:
    np.random.choice(4, 12, replace=False)
except ValueError as e:
    print(e)

Cannot take a larger sample than population when 'replace=False'

Random shuffling¶

In [9]:

Out[9]:

array([[7, 2, 4, 8, 0, 7, 9, 3, 4, 6, 1, 5],
       [6, 2, 1, 8, 3, 5, 0, 2, 6, 2, 4, 4],
       [6, 3, 0, 6, 4, 7, 6, 7, 1, 5, 7, 9],
       [2, 4, 8, 1, 2, 1, 1, 3, 5, 9, 0, 8],
       [1, 6, 3, 3, 5, 9, 7, 9, 2, 3, 3, 3],
       [8, 6, 9, 7, 6, 3, 9, 6, 6, 6, 1, 3],
       [4, 3, 1, 0, 5, 8, 6, 8, 9, 1, 0, 3],
       [1, 3, 4, 7, 6, 1, 4, 3, 3, 7, 6, 8]])

In [10]:

# Shuffling occurs "in place" for efficiency
np.random.shuffle(x)
x

Out[10]:

array([[7, 2, 4, 8, 0, 7, 9, 3, 4, 6, 1, 5],
       [4, 3, 1, 0, 5, 8, 6, 8, 9, 1, 0, 3],
       [8, 6, 9, 7, 6, 3, 9, 6, 6, 6, 1, 3],
       [2, 4, 8, 1, 2, 1, 1, 3, 5, 9, 0, 8],
       [6, 3, 0, 6, 4, 7, 6, 7, 1, 5, 7, 9],
       [6, 2, 1, 8, 3, 5, 0, 2, 6, 2, 4, 4],
       [1, 3, 4, 7, 6, 1, 4, 3, 3, 7, 6, 8],
       [1, 6, 3, 3, 5, 9, 7, 9, 2, 3, 3, 3]])

In [11]:

# To shuffle columns instead, transpose before shuffling
np.random.shuffle(x.T)
x

Out[11]:

array([[7, 0, 4, 7, 9, 8, 1, 6, 4, 3, 2, 5],
       [8, 5, 1, 4, 6, 0, 0, 1, 9, 8, 3, 3],
       [3, 6, 9, 8, 9, 7, 1, 6, 6, 6, 6, 3],
       [1, 2, 8, 2, 1, 1, 0, 9, 5, 3, 4, 8],
       [7, 4, 0, 6, 6, 6, 7, 5, 1, 7, 3, 9],
       [5, 3, 1, 6, 0, 8, 4, 2, 6, 2, 2, 4],
       [1, 6, 4, 1, 4, 7, 6, 7, 3, 3, 3, 8],
       [9, 5, 3, 1, 7, 3, 3, 3, 2, 9, 6, 3]])

In [12]:

# numpy.random.permutation does the same thing but returns a copy
np.random.permutation(x)

Out[12]:

array([[7, 0, 4, 7, 9, 8, 1, 6, 4, 3, 2, 5],
       [1, 6, 4, 1, 4, 7, 6, 7, 3, 3, 3, 8],
       [1, 2, 8, 2, 1, 1, 0, 9, 5, 3, 4, 8],
       [7, 4, 0, 6, 6, 6, 7, 5, 1, 7, 3, 9],
       [9, 5, 3, 1, 7, 3, 3, 3, 2, 9, 6, 3],
       [3, 6, 9, 8, 9, 7, 1, 6, 6, 6, 6, 3],
       [8, 5, 1, 4, 6, 0, 0, 1, 9, 8, 3, 3],
       [5, 3, 1, 6, 0, 8, 4, 2, 6, 2, 2, 4]])

In [13]:

# When given an integre n, permutation treats is as the array arange(n)
np.random.permutation(10)

Out[13]:

array([4, 0, 6, 7, 5, 1, 8, 2, 3, 9])

In [14]:

# Use indices if you needed to shuffle collections of arrays in synchrony
x = np.arange(12).reshape(4,3)
y = x + 10
idx = np.random.permutation(x.shape[0])
list(zip(x[idx, :], y[idx, :]))

Out[14]:

[(array([ 9, 10, 11]), array([19, 20, 21])),
 (array([3, 4, 5]), array([13, 14, 15])),
 (array([6, 7, 8]), array([16, 17, 18])),
 (array([0, 1, 2]), array([10, 11, 12]))]

Bootstrap¶

The bootstrap is commonly used to estimate statistics when theory fails. We have already seen the bootstrap for estimating confidence bounds for convergence in the Monte Carlo integration.

In [1]:

# For example, what is the 95% confidence interval for
# the 10th percentile of this data set if you didn't know how it was generated?

x = np.concatenate([np.random.exponential(size=200), np.random.normal(size=100)])
plt.hist(x, 25, histtype='step', linewidth=1)
pass

_images/15B_ResamplingAndSimulation_20_0.png

In [7]:

n = len(x)
reps = 10000
xb = np.random.choice(x, (n, reps))
mb = np.percentile(xb, 10, axis=0)
mb.sort()

lower, upper = np.percentile(mb, [2.5, 97.5])
sns.kdeplot(mb)
for v in (lower, upper):
    plt.axvline(v, color='red')

_images/15B_ResamplingAndSimulation_21_0.png

Bootstrap example for Monte Carlo integration¶

In [17]:

def f(x):
    return x * np.cos(71*x) + np.sin(13*x)

In [18]:

x = np.linspace(0, 1, 100)
plt.plot(x, f(x))
pass

_images/15B_ResamplingAndSimulation_24_0.png

In [19]:

# data sample for integration
n = 100
x = f(np.random.random(n))

In [20]:

# bootstrap MC integration
reps = 1000
xb = np.random.choice(x, (n, reps), replace=True)
yb = 1/np.arange(1, n+1)[:, None] * np.cumsum(xb, axis=0)
upper, lower = np.percentile(yb, [2.5, 97.5], axis=1)

In [21]:

plt.plot(np.arange(1, n+1)[:, None], yb, c='grey', alpha=0.02)
plt.plot(np.arange(1, n+1), yb[:, 0], c='red', linewidth=1)
plt.plot(np.arange(1, n+1), upper, 'b', np.arange(1, n+1), lower, 'b')
pass

_images/15B_ResamplingAndSimulation_27_0.png

Permutation resampling¶

For flexible hypothesis testing¶

Suppose you have 2 data sets from unknown distribution and you want to test if some arbitrary statistic (e.g 7th percentile) is the same in the 2 data sets - what can you do?

An appropriate test statistic is the difference between the 7th percentile, and if we knew the null distribution of this statistic, we could test for the null hypothesis that the statistic = 0. Permuting the labels of the 2 data sets allows us to create the empirical null distribution.

Create two data sets for comparison¶

In [9]:

x = np.r_[np.random.exponential(size=200),
          np.random.normal(0, 1, size=100)]
y = np.r_[np.random.exponential(size=250),
          np.random.normal(0, 1, size=50)]

Generate permutations of labels for 10,000 comparisons¶

In [10]:

n1, n2 = map(len, (x, y))
reps = 10000

data = np.r_[x, y]
ps = np.array([np.random.permutation(n1+n2) for i in range(reps)])

Estimate empirical null distribution for differences between samples¶

In [11]:

xp = data[ps[:, :n1]]
yp = data[ps[:, n1:]]
samples = np.percentile(xp, 7, axis=1) - np.percentile(yp, 7, axis=1)

Plot the results¶

In [12]:

plt.hist(samples, 25, histtype='step', color='red')
test_stat = np.percentile(x, 7) - np.percentile(y, 7)
plt.axvline(test_stat)
plt.axvline(np.percentile(samples, 2.5), linestyle='--')
plt.axvline(np.percentile(samples, 97.5), linestyle='--')
print("p-value =", 2*np.sum(samples >= np.abs(test_stat))/reps)

p-value = 0.0008

_images/15B_ResamplingAndSimulation_37_1.png

Adjusting p-values for multiple testing¶

We will make up some data - a typical example is trying to identify genes that are differentially expressed in two groups of people, perhaps those who are healthy and those who are sick. For each gene, we can perform a t-test to see if the gene is differentially expressed across the two groups at some nominal significance level, typically 0.05. When we have many genes, this is unsatisfactory since 5% of the genes will be found to be differentially expressed just by chance.

One possible solution is to use the family-wise error rate (FWER) instead - most simply using the Bonferroni adjusted p-value. An alternative is to use the non-parametric method originally proposed by Young and Westfall that uses permutation resampling to estimate the adjusted p-value without the assumptions of independence that the Bonferroni method makes.

In [35]:

x = np.array([1,2,3]).reshape((-1,1))

In [36]:

x @ x.T

Out[36]:

array([[1, 2, 3],
       [2, 4, 6],
       [3, 6, 9]])

In [96]:

ngenes = 100
ncases = 500
nctrls = 500
nsamples = ncases + nctrls
x = np.random.normal(0, 1, (ngenes, nsamples))

target_genes = [5,15,25,35,45]
x[target_genes, ncases:] += np.random.normal(1, 1, (len(target_genes), ncases))

In [97]:

import scipy.stats as stats

In [98]:

%precision 3

Out[98]:

'%.3f'

In [99]:

t, p0 = stats.ttest_ind(x[:, :ncases], x[:, ncases:], axis=1)
idx = p0 < 0.05
list(zip(np.nonzero(idx)[0], p0[idx]))

Out[99]:

[(5, 0.000),
 (15, 0.000),
 (25, 0.000),
 (33, 0.042),
 (35, 0.000),
 (45, 0.000),
 (65, 0.002),
 (81, 0.028),
 (82, 0.045)]

In [100]:

vmin = x.min()
vmax = x.max()

plt.subplot(121)
plt.imshow(x[:, :ncases], extent=[0, 1, 0, 2], interpolation='nearest',
           vmin=vmin, vmax=vmax, cmap='jet')
plt.xticks([])
plt.yticks([])
plt.title('Controls')
plt.subplot(122)
plt.imshow(x[:, ncases:], extent=[0, 1, 0, 2], interpolation='nearest',
           vmin=vmin, vmax=vmax, cmap='jet')
plt.xticks([])
plt.yticks([])
plt.title('Cases')
plt.colorbar()
pass

_images/15B_ResamplingAndSimulation_47_0.png

In [101]:

p1 = np.clip(len(p0) * p0, 0, 1)
idx = p1 < 0.05
list(zip(np.nonzero(idx)[0], p1[idx]))

Out[101]:

[(5, 0.000), (15, 0.000), (25, 0.000), (35, 0.000), (45, 0.000)]

Is similar to Bonferroni when features are uncorrelated, but is more powerful when features are correlated.

In [110]:

nperms = 10000
k = ngenes

t, p0 = stats.ttest_ind(x[:, :ncases], x[:, ncases:], axis=1)
ranks = np.argsort(np.abs(t))[::-1]
counts = np.zeros((nperms, k))
for i in range(nperms):
    u = np.zeros(k)
    sidx = np.random.permutation(nsamples)
    y = x[:, sidx]
    tb, pb = stats.ttest_ind(y[:, :ncases], y[:, ncases:], axis=1)
    u[k-1] = np.abs(tb[ranks[k-1]])
    for j in range(k-2, -1, -1):
        u[j] = max(u[j+1], np.abs(tb[ranks[j]]))
    counts[i] = (u >= np.abs(t[ranks]))

p2 = np.sum(counts, axis=0)/nperms
for i in range(1, k):
    p2[i] = max(p2[i],p2[i-1])
idx = p2 < 0.05
list(zip(ranks, p2[idx]))

Out[110]:

[(99, 0.012),
 (36, 0.012),
 (26, 0.012),
 (27, 0.012),
 (28, 0.012),
 (29, 0.012),
 (30, 0.012),
 (31, 0.012),
 (32, 0.012),
 (33, 0.012),
 (34, 0.012),
 (35, 0.012),
 (37, 0.012),
 (98, 0.012),
 (38, 0.012),
 (39, 0.012),
 (40, 0.012),
 (41, 0.012),
 (42, 0.012),
 (43, 0.012),
 (44, 0.012),
 (45, 0.012),
 (46, 0.012),
 (47, 0.012),
 (25, 0.012),
 (24, 0.012),
 (23, 0.012),
 (22, 0.012),
 (1, 0.012),
 (2, 0.012),
 (3, 0.012),
 (4, 0.012),
 (5, 0.012),
 (6, 0.012),
 (7, 0.012),
 (8, 0.012),
 (9, 0.012),
 (10, 0.012),
 (11, 0.012),
 (12, 0.012),
 (13, 0.012),
 (14, 0.012),
 (15, 0.012),
 (16, 0.012),
 (17, 0.012),
 (18, 0.012),
 (19, 0.012),
 (20, 0.012),
 (21, 0.012),
 (48, 0.012),
 (49, 0.012),
 (50, 0.012),
 (75, 0.012),
 (77, 0.012),
 (78, 0.012),
 (79, 0.012),
 (80, 0.012),
 (81, 0.012),
 (82, 0.012),
 (83, 0.012),
 (84, 0.012),
 (85, 0.012),
 (86, 0.012),
 (87, 0.012),
 (88, 0.012),
 (89, 0.012),
 (90, 0.012),
 (91, 0.012),
 (92, 0.012),
 (93, 0.012),
 (94, 0.012),
 (95, 0.012),
 (96, 0.012),
 (97, 0.012),
 (76, 0.012),
 (74, 0.012),
 (51, 0.012),
 (73, 0.012),
 (52, 0.012),
 (53, 0.012),
 (54, 0.012),
 (55, 0.012),
 (56, 0.012),
 (57, 0.012),
 (58, 0.012),
 (59, 0.012),
 (60, 0.012),
 (61, 0.012),
 (62, 0.012),
 (63, 0.012),
 (64, 0.012),
 (65, 0.012),
 (66, 0.012),
 (67, 0.012),
 (68, 0.012),
 (69, 0.012),
 (70, 0.012),
 (71, 0.012),
 (72, 0.012),
 (0, 0.012)]

In [103]:

plt.plot(sorted(p0), label='No correction')
plt.plot(sorted(p1), label='Bonferroni')
plt.plot(sorted(p2), label='Westfall-Young')
plt.ylim([0,1])
plt.legend(loc='best')
pass

_images/15B_ResamplingAndSimulation_52_0.png

The Bonferrroni assumes that tests are independent. However, often test results are strongly correlated (e.g. genes in the same pathway behave similarly) and the Bonferroni will be too conservative. However the permutation-resampling method still works in the presence of correlations.

In [105]:

ngenes = 100
ncases = 500
nctrls = 500
nsamples = ncases + nctrls

# use random number seed knwon to give a differnece
np.random.seed(52)
x = np.repeat(np.random.normal(0, 1, (1, nsamples)), ngenes, axis=0)

In [106]:

# In this extreme case, we measure the same gene 100 times
x[:5, :5]

Out[106]:

array([[ 0.519, -1.269,  0.24 , -0.804,  0.017],
       [ 0.519, -1.269,  0.24 , -0.804,  0.017],
       [ 0.519, -1.269,  0.24 , -0.804,  0.017],
       [ 0.519, -1.269,  0.24 , -0.804,  0.017],
       [ 0.519, -1.269,  0.24 , -0.804,  0.017]])

In [107]:

t, p0 = stats.ttest_ind(x[:, :ncases], x[:, ncases:], axis=1)
idx = p0 < 0.05
print('Minimum p-value', p0.min(), '# significant', idx.sum())

Minimum p-value 0.0119317780363 # significant 100

Bonferroni tells us none of the adjusted p-values are significant, which we know is the wrong answer.

In [108]:

p1 = np.clip(len(p0) * p0, 0, 1)
idx = p1 < 0.05
print('Minimum p-value', p1.min(), '# significant', idx.sum())

Minimum p-value 1.0 # significant 0

This tells us that every gene is significant, which is the correct answer.

In [111]:

nperms = 10000

counts = np.zeros((nperms, k))
t, p0 = stats.ttest_ind(x[:, :ncases], x[:, ncases:], axis=1)
ranks = np.argsort(np.abs(t))[::-1]
for i in range(nperms):
    u = np.zeros(k)
    sidx = np.random.permutation(nsamples)
    y = x[:, sidx]
    tb, pb = stats.ttest_ind(y[:, :ncases], y[:, ncases:], axis=1)
    u[k-1] = np.abs(tb[ranks[k-1]])
    for j in range(k-2, -1, -1):
        u[j] = max(u[j+1], np.abs(tb[ranks[j]]))
    counts[i] = (u >= np.abs(t[ranks]))

p2 = np.sum(counts, axis=0)/nperms
for i in range(1, k):
    p2[i] = max(p2[i],p2[i-1])
idx = p2 < 0.05

print ('Minimum p-value', p2.min(), '# significant', idx.sum())

Minimum p-value 0.0112 # significant 100

In [112]:

plt.plot(sorted(p1), label='Bonferroni')
plt.plot(sorted(p2), label='Westfall-Young')
plt.ylim([-0.05,1.05])
plt.legend(loc='best')
pass

_images/15B_ResamplingAndSimulation_62_0.png

“Leave one out” resampling methods¶

Jackknife estimate of parameters¶

This shows the leave-one-out calculation idiom for Python. Unlike R, a -k index to an array does not delete the kth entry, but returns the kth entry from the end, so we need another way to efficiently drop one scalar or vector. This can be done using Boolean indexing as shown in the examples below, and is efficient since the operations are on views of the original array rather than copies. Note also that

In [40]:

def jackknife(x, func):
    """Jackknife estimate of the estimator func"""
    n = len(x)
    idx = np.arange(n)
    return np.sum(func(x[idx!=i]) for i in range(n))/float(n)

In [41]:

# Jackknife estimate of standard deviation
x = np.random.normal(0, 2, 100)
jackknife(x, np.std)

Out[41]:

2.029

In [42]:

def jackknife_var(x, func):
    """Jackknife estiamte of the variance of the estimator func."""
    n = len(x)
    idx = np.arange(n)
    j_est = jackknife(x, func)
    return (n-1)/(n + 0.0) * np.sum((func(x[idx!=i]) - j_est)**2.0 for i in range(n))

In [43]:

# estimate of the variance of an estimator
jackknife_var(x, np.std)

Out[43]:

0.022

Leave one out cross validation (LOOCV)¶

LOOCV also uses the same idiom, and a simple example of LOOCV for model selection is illustrated.

In [113]:

a, b, c = 2, 3, 4
x = np.linspace(0, 5, 6)
y = a*x**2 + b*x + c + np.random.normal(0, 3, len(x))

In [114]:

plt.figure(figsize=(15,3))
for deg in range(1, 6):
    plt.subplot(1, 6, deg)
    beta = np.polyfit(x, y, deg)
    plt.plot(x, y, 'r:o')
    plt.plot(x, np.polyval(beta, x), 'b-')
    plt.title('Degree = %d' % deg)
    plt.margins(0.04)

_images/15B_ResamplingAndSimulation_71_0.png

In [115]:

def loocv(x, y, fit, pred, deg):
    """LOOCV RSS for fitting a polynomial model."""
    n = len(x)
    idx = np.arange(n)
    rss = np.sum([(y - pred(fit(x[idx!=i], y[idx!=i], deg), x))**2.0 for i in range(n)])
    return rss

In [116]:

# RSS does not detect overfitting and selects the most complex model
for deg in range(1, 6):
    print('Degree = %d, RSS=%.2f' % (deg, np.sum((y - np.polyval(np.polyfit(x, y, deg), x))**2.0)))

Degree = 1, RSS=247.13
Degree = 2, RSS=4.65
Degree = 3, RSS=4.57
Degree = 4, RSS=0.75
Degree = 5, RSS=0.00

In [117]:

# LOOCV selects a more conservative model
import warnings

with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    for deg in range(1, 6):
        print('Degree = %d, RSS=%.2f' % (deg, loocv(x, y, np.polyfit, np.polyval, deg)))

Degree = 1, RSS=1875.41
Degree = 2, RSS=39.46
Degree = 3, RSS=208.48
Degree = 4, RSS=398.62
Degree = 5, RSS=245.29

Simulations to estimate power¶

What sample size is needed for the t-test to have a power of 0.8 with an effect size of 0.5?

This is a toy example, since you can calculate it exactly, but the simulation approach works for everything, including arbitrarily complex experimental designs, correcting for multiple comparisons and so on(assuming infinite computational resources).

In [49]:

# Run nresps simulations
# The power is simply the fraction of reps where
# the p-value is less than 0.05

nreps = 10000
d = 0.5

n = 50
power = 0
while power < 0.8:
    n1 = n2 = n
    x = np.random.normal(0, 1, (n1, nreps))
    y = np.random.normal(d, 1, (n2, nreps))
    t, p = stats.ttest_ind(x, y)
    power = (p < 0.05).sum()/nreps
    print(n, power)
    n += 1

Check with R¶

In [50]:

%load_ext rpy2.ipython

In [51]:

%%R
library(pwr)

power.t.test(sig.level=0.05, power=0.8, delta = 0.5)

Two-sample t test power calculation

              n = 63.76576
          delta = 0.5
             sd = 1
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

Estimating CDF and PDF from Monte Carlo samples¶

Given a bunch of random numbers from a simulation experiment, one of the first steps is to visualize the CDF and PDF. The ECDF is quite useful for, say, visualizing how similar or different two sets of data are.

Estimating the CDF¶

In [52]:

# Make up some random data
x = np.r_[np.random.normal(0, 1, 10000),
          np.random.normal(4, 1, 10000)]

In [53]:

# Roll our own ECDF function

def ecdf(x):
    """Return empirical CDF of x."""

    sx = np.sort(x)
    cdf = (1.0 + np.arange(len(sx)))/len(sx)
    return sx, cdf

In [54]:

sx, y = ecdf(x)
plt.plot(sx, y)
pass

_images/15B_ResamplingAndSimulation_84_0.png

In [55]:

from statsmodels.distributions.empirical_distribution import ECDF

ecdf = ECDF(x)
plt.plot(ecdf.x, ecdf.y)
pass

_images/15B_ResamplingAndSimulation_86_0.png

Estimating the PDF¶

The simplest is to plot a normalized histogram as shown above, but we will also look at how to estimate density functions using kernel density estimation (KDE). KDE works by placing a kernel unit on each data point, and summing the kernels to present a smoother estimate than you would get with a (n-d) histogram.

In [56]:

def epanechnikov(u):
    """Epanechnikov kernel."""
    return np.where(np.abs(u) <= np.sqrt(5), 3/(4*np.sqrt(5)) * (1 - u*u/5.0), 0)

def silverman(y):
    """Find bandwidth using heuristic suggested by Silverman
    .9 min(standard deviation, interquartile range/1.34)n−1/5
    """
    n = len(y)
    iqr = np.subtract(*np.percentile(y, [75, 25]))
    h = 0.9*np.min([y.std(ddof=1), iqr/1.34])*n**-0.2
    return h

def kde(x, y, bandwidth=silverman, kernel=epanechnikov):
    """Returns kernel density estimate.
    x are the points for evaluation
    y is the data to be fitted
    bandwidth is a function that returens the smoothing parameter h
    kernel is a function that gives weights to neighboring data
    """
    h = bandwidth(y)
    return np.sum(kernel((x-y[:, None])/h)/h, axis=0)/len(y)

In [57]:

xs = np.linspace(-5,8,100)
density = kde(xs, x)
plt.plot(xs, density)
xlim = plt.xlim()
pass

_images/15B_ResamplingAndSimulation_89_0.png

In [58]:

sns.kdeplot(x, kernel='epa', bw='silverman', shade=True)
plt.xlim(xlim)
pass

_images/15B_ResamplingAndSimulation_91_0.png

There are several kernel density estimation routines available in scipy, statsmodels and scikit-leran. Here we will use the scikits-learn and statsmodels routine as examples.

In [59]:

import statsmodels.api as sm

dens = sm.nonparametric.KDEUnivariate(x)
dens.fit(kernel='gau')
plt.plot(xs, dens.evaluate(xs))
pass

_images/15B_ResamplingAndSimulation_93_0.png

In [60]:

from sklearn.neighbors import KernelDensity

# expects n x p matrix with p features
x.shape = (len(x), 1)
xs.shape = (len(xs), 1)

kde = KernelDensity(kernel='epanechnikov').fit(x)
dens = np.exp(kde.score_samples(xs))
plt.plot(xs, dens)
pass

_images/15B_ResamplingAndSimulation_94_0.png

In [61]:

%load_ext version_information
%version_information numpy, scipy, statsmodels, sklearn, rpy2

Out[61]:

Software	Version
Python	3.5.1 64bit [GCC 4.2.1 (Apple Inc. build 5577)]
IPython	4.0.1
OS	Darwin 15.3.0 x86_64 i386 64bit
numpy	1.10.4
scipy	0.16.1
statsmodels	0.6.1
sklearn	0.17
rpy2	2.7.6
Tue Feb 23 09:09:05 2016 EST

In [ ]: