import os
import sys
import glob
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
%precision 4
np.random.seed(1)
plt.style.use('ggplot')
import scipy.stats as st
Starting with some data (which may come from an experiment or a simulation), we often use statsitics to answer a few typcical questions:
Most commonly, the computational approaches used to address these questions will involve
Rarely (i.e. textbook examples), we can find a closed form solution to these problems.
Data comes from simulation.
n = 100
pcoin = 0.62 # actual value of p for coin
results = st.bernoulli(pcoin).rvs(n)
h = sum(results)
print h
62
# Expected distribution for fair coin
p = 0.5
rv = st.binom(n, p)
mu = rv.mean()
sd = rv.std()
mu, sd
(50.0000, 5.0000)
Use of approximation when true solution is computatioanlly expensive.
z = (h-0.5-mu)/sd
z
2.3000
2*(1 - st.norm.cdf(z))
0.0214
Use simulaiton when we don’t have any theory (e.g. data doesen’t meet assumptions of test)
nsamples = 100000
xs = np.random.binomial(n, p, nsamples)
2*np.sum(xs >= h)/(xs.size + 0.0)
0.0202
Point estimate of parameter.
print "Maximum likelihood", np.sum(results)/float(len(results))
Maximum likelihood 0.62
Interval etsimate of parameter.
bs_samples = np.random.choice(results, (nsamples, len(results)), replace=True)
bs_ps = np.mean(bs_samples, axis=1)
bs_ps.sort()
print "Bootstrap CI: (%.4f, %.4f)" % (bs_ps[int(0.025*nsamples)], bs_ps[int(0.975*nsamples)])
Bootstrap CI: (0.5200, 0.7100)
The Bayesian approach directly estimates the posterior distribution, from which all other point/interval statistics can be estimated.
a, b = 10, 10
prior = st.beta(a, b)
post = st.beta(h+a, n-h+b)
ci = post.interval(0.95)
map_ =(h+a-1.0)/(n+a+b-2.0)
xs = np.linspace(0, 1, 100)
plt.plot(prior.pdf(xs), label='Prior')
plt.plot(post.pdf(xs), label='Posterior')
plt.axvline(mu, c='red', linestyle='dashed', alpha=0.4)
plt.xlim([0, 100])
plt.axhline(0.3, ci[0], ci[1], c='black', linewidth=2, label='95% CI');
plt.axvline(n*map_, c='blue', linestyle='dashed', alpha=0.4)
plt.legend();
All the above calculations have simple analytic solutions. For most real life problems reuqireing more complex statistical models, we will need to search for solutions using more advanced numerical methods and simulations. However, the types of problems that we will be addressing are largely similar to those asked of the toy coin toss problem. These include
and most will require some knowledge of numerical methods for
The next section of the course will focus on the ideas behiind these numerical methods.