STA 663 Midterm Exams¶

Please observe the Duke honor code for this closed book exam.

Permitted exceptions to the closed book rule

You may use any of the links accessible from the Help Menu for reference - that is, you may follow a chain of clicks from the landing pages of the sites accessible through the Help Menu. If you find yourself outside the help/reference pages of python, ipython, numpy, scipy, matplotlib, sympy, pandas, (e.g. on a Google search page or stackoverflow or current/past versions of the STA 663 notes) you are in danger of violating the honor code and should exit immediately.
You may also use TAB or SHIFT-TAB completion, as well as ?foo, foo? and help(foo) for any function, method or class foo.

The total points allocated is 125, but the maximum possible is 100. Hence it is possible to score 100 even with some errors or incomplete solutions.

Imports

All the necessary packages have been imported for you in the Code cells below. You should not need any additional imports.

In [1]:

%matplotlib inline
import  matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.linalg as la
from collections import Counter
from functools import reduce

In [2]:

%load_ext rpy2.ipython

1. (10 points)

Read the flights data at https://raw.githubusercontent.com/mwaskom/seaborn-data/master/flights.csv into a pnadas data frame. Find the average number of passengers per quarter (Q1, Q2, Q3,Q4) across the years 1950-1959 (inclusive of 1950 and 1959), where

Q1 = Jan, Feb, Mar
Q2 = Apr, May, Jun
Q3 = Jul, Aug, Sep
Q4 = Oct, Nov, Dec

In [ ]:

2. (10 points)

The Collatz sequence is defined by the following rules for finding the next number

if the current number is even, divide by 2
if the current number is odd, multiply by 3 and add 1
if the current number is 1, stop

Find the starting integer that gives the longest Collatz sequence for integers in the range(1, 10000). What is the starting number and length of this Collatz sequence?

In [ ]:

3. (10 points)

Recall that a covariance matrix is a matrix whose entries are

img

Find the sample covariance matrix of the 4 features of the iris data set at http://bit.ly/2ow0oJO using basic numpy operations on ndarrasy. Do not use the np.cov or equivalent functions in pandas (except for checking). Remember to scale by \(1/(n-1)\) for the sample covariance.

In [ ]:

4. (10 points)

How many numbers in range(100, 1000) are divisible by 17 after you square them and add 1? Find this out using only lambda functions, map, filter and reduce on xs, where xs = range(100, 10000).

In pseudo-code, you want to achieve

xs = range(100, 10000)
count(y for y in (x**2 + 1 for x in xs) if y % 17 == 0)

In [ ]:

5. (20 points)

Given the DNA sequence below, create a \(4 \times 4\) transition matrix \(A\) where \(A[i,j]\) is the probability of the base \(j\) appearing immediately after base \(i\). Note that a base is one of the four letters a, c, t or g. The letters below should be treated as a single sequence, broken into separate lines just for formatting purposes. You should check that row probabilities sum to 1. (10 points)
Find the steady state distribution of the 4 bases from the row stochastic transition matrix - that is the, the values of \(x\) for which \(x^TA = x\). Find the solution by solving a set of linear equations. Hint: you need to add a constraint on the values of \(x\). Only partial credit will be given for other methods of finding the steady state distribution. (10 points)

gggttgtatgtcacttgagcctgtgcggacgagtgacacttgggacgtgaacagcggcggccgatacgttctctaagatc
ctctcccatgggcctggtctgtatggctttcttgttgtgggggcggagaggcagcgagtgggtgtacattaagcatggcc
accaccatgtggagcgtggcgtggtcgcggagttggcagggtttttgggggtggggagccggttcaggtattccctccgc
gtttctgtcgggtaggggggcttctcgtaagggattgctgcggccgggttctctgggccgtgatgactgcaggtgccatg
gaggcggtttggggggcccccggaagtctagcgggatcgggcttcgtttgtggaggagggggcgagtgcggaggtgttct

Part 1

In [ ]:

Part 2

In [ ]:

6. (10 points)

Find the matrix \(A\) that results in rotating the standard vectors in \(\mathbb{R}^2\) by 30 degrees counter-clockwise and stretches \(e_1\) by a factor of 3 and contracts \(e_2\) by a factor of \(0.5\).
What is the inverse of this matrix? How you find the inverse should reflect your understanding.

The effects of the matrix \(A\) and \(A^{-1}\) are shown in the figure below:

image

In [ ]:

7. (55 points)

We observe some data points \((x_i, y_i)\), and believe that an appropriate model for the data is that

\[f(x) = ax^2 + bx^3 + c\sin{x}\]

with some added noise. Find optimal values of the parameters \(\beta = (a, b, c)\) that minimize \(\Vert y - f(x) \Vert^2\)

using scipy.linalg.lstsq (10 points)
solving the normal equations \(X^TX \beta = X^Ty\) (10 points)
using scipy.linalg.svd (10 points)
using gradient descent with RMSProp (no bias correction) and starting with an initial value of \(\beta = \begin{bmatrix}1 & 1 & 1\end{bmatrix}\). Use a learning rate of 0.01 and 10,000 iterations, and set the \(\beta\) parameter of RMSprop to be 0.9 (this is a different \(\beta\) from the parameters of the function we are minimizing). Running gradient descent should take a few seconds to complete. (25 points)

In each case, plot the data and fitted curve using matplotlib.

Data

x = array([ 3.4027718 ,  4.29209002,  5.88176277,  6.3465969 ,  7.21397852,
        8.26972154, 10.27244608, 10.44703778, 10.79203455, 14.71146298])
y = array([ 25.54026428,  29.4558919 ,  58.50315846,  70.24957254,
        90.55155435, 100.56372833,  91.83189927,  90.41536733,
        90.43103028,  23.0719842 ])

Using ``lstsq``

In [ ]:

Using normal equations

In [ ]:

Using SVD

In [ ]:

Using Gradient descent with RMSprop

In [ ]: