More Python, numpy and sklearnΒΆ

We will use the following data sets:

titanic = sns.load_dataset("titanic")
iris = sns.load_dataset("iris")

Q1 (20 pts) Working with numpy.random.

Part 1 (10 pts) Consider a sequence of \(n\) Bernoulli trials with success probabilty \(p\) per trial. A string of consecutive successes is known as a success run. Write a function that returns the counts for runs of length \(k\) for each \(k\) observed in a dictionary.

For example: if the trials were [0, 1, 0, 1, 1, 0, 0, 0, 0, 1], the function should return

{1: 2, 2: 1})
In [1]:





Part 2 (10 pts) Continuing from Part 1, what is the probability of observing at least one run of length 5 or more when \(n=100\) and \(p=0.5\)?. Estimate this from 100,000 simulated experiments. Is this more, less or equally likely than finding runs of length 7 or more when \(p=0.7\)?

In [1]:





Q2. (30 pts)

Using RandomForestClassifier from sklearn, find the 5 most important predictors of survival on the Titanic. Compare the accuracy of prediction using only these 5 predictors and using all non-redundant predictors. Some intial pre-processing code is provided. Hint: check out the pandas.get_dummies() function.

In [1]:
titanic = sns.load_dataset("titanic")
titanic.head()
Out[1]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
In [2]:
titanic.drop(['alive', 'embarked', 'class', 'who', 'adult_male'], axis=1, inplace=True)
titanic.dropna(axis=0, inplace=True)
titanic.head()
Out[2]:
survived pclass sex age sibsp parch fare deck embark_town alone
1 1 1 female 38.0 1 0 71.2833 C Cherbourg False
3 1 1 female 35.0 1 0 53.1000 C Southampton False
6 0 1 male 54.0 0 0 51.8625 E Southampton True
10 1 3 female 4.0 1 1 16.7000 G Southampton False
11 1 1 female 58.0 0 0 26.5500 C Southampton True
In [3]:










Q2. (25 pts)

Using sklearn, perform unsupervised learning of the iris data using 2 different clustering methods. Do NOT assume you know the number of clusters - rather the code should either determine it from the data or compare models with different numbers of components using some appropriate test statistic. Make a pairwise scatter plot of the four predictor variables indicating cluster by color for each unsupervised learning method used.

In [3]:











Q3. (50 pts)

Write code to generate a plot similar to the following figure using the explanation for generation of 1D Cellular Automata found here. You should only need to use standard Python, numpy and matplotllib.

To make it simpler, I have provided the code for plotting below. All you need to do is to supply the make_ca function (which may of course use as many ohter custom functons as you deem necessary). As you can see from the code below, the make_ca function takes 3 arguments

rule - an integer e.g. 30
init - an initial state i.e. the first row of the image
niter - the number of iterations i.e. the number of rows in the image
In [3]:











In [3]:
from matplotlib.ticker import NullFormatter, IndexLocator

def plot_grid(rule, grid, ax=None):
    if ax is None:
        ax = plt.subplot(111)
    ax.grid(True, which='major', color='grey', linewidth=0.5)
    ax.imshow(grid, interpolation='none', cmap='Greys', aspect=1, alpha=0.8)
    ax.xaxis.set_major_locator(IndexLocator(1, 0))
    ax.yaxis.set_major_locator(IndexLocator(1, 0))
    ax.xaxis.set_major_formatter( NullFormatter() )
    ax.yaxis.set_major_formatter( NullFormatter() )
    ax.set_title('Rule %d' % rule)
In [4]:
niter = 15
width = niter*2+1
init = np.zeros(width, 'int')
init[width//2] = 1
rules = np.array([30, 54, 60, 62, 90, 94, 102, 110, 122, 126,
                  150, 158, 182, 188, 190, 220, 222, 250]).reshape((-1, 3))

nrows, ncols = rules.shape
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols*3, nrows*2))
for i in range(nrows):
    for j in range(ncols):
        grid = make_ca(rules[i, j], init, niter)
        plot_grid(rules[i, j], grid, ax=axes[i,j])
plt.tight_layout()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-add074bebf92> in <module>()
     10 for i in range(nrows):
     11     for j in range(ncols):
---> 12         grid = make_ca(rules[i, j], init, niter)
     13         plot_grid(rules[i, j], grid, ax=axes[i,j])
     14 plt.tight_layout()

NameError: name 'make_ca' is not defined
homework/../_build/doctrees/nbsphinx/homework_Homework03_16_1.png