More Python, numpy and sklearnΒΆ
We will use the following data sets:
titanic = sns.load_dataset("titanic")
iris = sns.load_dataset("iris")
Q1 (20 pts) Working with numpy.random.
Part 1 (10 pts) Consider a sequence of \(n\) Bernoulli trials with success probabilty \(p\) per trial. A string of consecutive successes is known as a success run. Write a function that returns the counts for runs of length \(k\) for each \(k\) observed in a dictionary.
For example: if the trials were [0, 1, 0, 1, 1, 0, 0, 0, 0, 1], the function should return
{1: 2, 2: 1})
In [1]:
Part 2 (10 pts) Continuing from Part 1, what is the probability of observing at least one run of length 5 or more when \(n=100\) and \(p=0.5\)?. Estimate this from 100,000 simulated experiments. Is this more, less or equally likely than finding runs of length 7 or more when \(p=0.7\)?
In [1]:
Q2. (30 pts)
Using RandomForestClassifier from sklearn, find the 5 most
important predictors of survival on the Titanic. Compare the accuracy of
prediction using only these 5 predictors and using all non-redundant
predictors. Some intial pre-processing code is provided. Hint: check out
the pandas.get_dummies() function.
In [1]:
titanic = sns.load_dataset("titanic")
titanic.head()
Out[1]:
| survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
In [2]:
titanic.drop(['alive', 'embarked', 'class', 'who', 'adult_male'], axis=1, inplace=True)
titanic.dropna(axis=0, inplace=True)
titanic.head()
Out[2]:
| survived | pclass | sex | age | sibsp | parch | fare | deck | embark_town | alone | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | Cherbourg | False |
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | C | Southampton | False |
| 6 | 0 | 1 | male | 54.0 | 0 | 0 | 51.8625 | E | Southampton | True |
| 10 | 1 | 3 | female | 4.0 | 1 | 1 | 16.7000 | G | Southampton | False |
| 11 | 1 | 1 | female | 58.0 | 0 | 0 | 26.5500 | C | Southampton | True |
In [3]:
Q2. (25 pts)
Using sklearn, perform unsupervised learning of the iris data using
2 different clustering methods. Do NOT assume you know the number of
clusters - rather the code should either determine it from the data or
compare models with different numbers of components using some
appropriate test statistic. Make a pairwise scatter plot of the four
predictor variables indicating cluster by color for each unsupervised
learning method used.
In [3]:
Q3. (50 pts)
Write code to generate a plot similar to the following
using
the explanation for generation of 1D Cellular Automata found
here.
You should only need to use standard Python, numpy and
matplotllib.
To make it simpler, I have provided the code for plotting below. All you
need to do is to supply the make_ca function (which may of course
use as many ohter custom functons as you deem necessary). As you can see
from the code below, the make_ca function takes 3 arguments
rule - an integer e.g. 30
init - an initial state i.e. the first row of the image
niter - the number of iterations i.e. the number of rows in the image
In [3]:
In [3]:
from matplotlib.ticker import NullFormatter, IndexLocator
def plot_grid(rule, grid, ax=None):
if ax is None:
ax = plt.subplot(111)
ax.grid(True, which='major', color='grey', linewidth=0.5)
ax.imshow(grid, interpolation='none', cmap='Greys', aspect=1, alpha=0.8)
ax.xaxis.set_major_locator(IndexLocator(1, 0))
ax.yaxis.set_major_locator(IndexLocator(1, 0))
ax.xaxis.set_major_formatter( NullFormatter() )
ax.yaxis.set_major_formatter( NullFormatter() )
ax.set_title('Rule %d' % rule)
In [4]:
niter = 15
width = niter*2+1
init = np.zeros(width, 'int')
init[width//2] = 1
rules = np.array([30, 54, 60, 62, 90, 94, 102, 110, 122, 126,
150, 158, 182, 188, 190, 220, 222, 250]).reshape((-1, 3))
nrows, ncols = rules.shape
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*3, nrows*2))
for i in range(nrows):
for j in range(ncols):
grid = make_ca(rules[i, j], init, niter)
plot_grid(rules[i, j], grid, ax=axes[i,j])
plt.tight_layout()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-4-add074bebf92> in <module>()
10 for i in range(nrows):
11 for j in range(ncols):
---> 12 grid = make_ca(rules[i, j], init, niter)
13 plot_grid(rules[i, j], grid, ax=axes[i,j])
14 plt.tight_layout()
NameError: name 'make_ca' is not defined