STA 663 Midterm ExamsΒΆ
Please observe the Duke honor code for this closed book exam.
Permitted exceptions to the closed book rule
- You may use any of the links accessible from the Help Menu for
reference - that is, you may follow a chain of clicks from the
landing pages of the sites accessible through the Help Menu. If you
find yourself outside the help/reference pages of
python
,ipython
,numpy
,scipy
,matplotlib
,sympy
,pandas
, (e.g. on a Google search page or stackoverflow or current/past versions of the STA 663 notes) you are in danger of violating the honor code and should exit immediately. - You may also use TAB or SHIFT-TAB completion, as well as
?foo
,foo?
andhelp(foo)
for any function, method or classfoo
.
The total points allocated is 125, but the maximum possible is 100. Hence it is possible to score 100 even with some errors or incomplete solutions.
Imports
All the necessary packages have been imported for you in the Code cells below. You should not need any additional imports.
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.linalg as la
from collections import Counter
from functools import reduce
In [2]:
%load_ext rpy2.ipython
1. (10 points)
Read the flights data at
https://raw.githubusercontent.com/mwaskom/seaborn-data/master/flights.csv
into a pnadas
data frame. Find the average number of passengers per
quarter (Q1, Q2, Q3,Q4) across the years 1950-1959 (inclusive of 1950
and 1959), where
- Q1 = Jan, Feb, Mar
- Q2 = Apr, May, Jun
- Q3 = Jul, Aug, Sep
- Q4 = Oct, Nov, Dec
In [ ]:
2. (10 points)
The Collatz sequence is defined by the following rules for finding the next number
if the current number is even, divide by 2
if the current number is odd, multiply by 3 and add 1
if the current number is 1, stop
- Find the starting integer that gives the longest Collatz sequence for integers in the range(1, 10000). What is the starting number and length of this Collatz sequence?
In [ ]:
3. (10 points)
Recall that a covariance matrix is a matrix whose entries are
Find the sample covariance matrix of the 4 features of the iris data
set at http://bit.ly/2ow0oJO using basic numpy
operations on
ndarrasy
. Do not use the np.cov
or equivalent functions in
pandas
(except for checking). Remember to scale by \(1/(n-1)\)
for the sample covariance.
In [ ]:
4. (10 points)
How many numbers in range(100, 1000)
are divisible by 17 after you
square them and add 1? Find this out using only lambda functions,
map, filter and reduce on xs
, where
xs = range(100, 10000)
.
In pseudo-code, you want to achieve
xs = range(100, 10000)
count(y for y in (x**2 + 1 for x in xs) if y % 17 == 0)
In [ ]:
5. (20 points)
- Given the DNA sequence below, create a \(4 \times 4\) transition
matrix \(A\) where \(A[i,j]\) is the probability of the base
\(j\) appearing immediately after base \(i\). Note that a
base is one of the four letters
a
,c
,t
org
. The letters below should be treated as a single sequence, broken into separate lines just for formatting purposes. You should check that row probabilities sum to 1. (10 points) - Find the steady state distribution of the 4 bases from the row stochastic transition matrix - that is the, the values of \(x\) for which \(x^TA = x\). Find the solution by solving a set of linear equations. Hint: you need to add a constraint on the values of \(x\). Only partial credit will be given for other methods of finding the steady state distribution. (10 points)
gggttgtatgtcacttgagcctgtgcggacgagtgacacttgggacgtgaacagcggcggccgatacgttctctaagatc
ctctcccatgggcctggtctgtatggctttcttgttgtgggggcggagaggcagcgagtgggtgtacattaagcatggcc
accaccatgtggagcgtggcgtggtcgcggagttggcagggtttttgggggtggggagccggttcaggtattccctccgc
gtttctgtcgggtaggggggcttctcgtaagggattgctgcggccgggttctctgggccgtgatgactgcaggtgccatg
gaggcggtttggggggcccccggaagtctagcgggatcgggcttcgtttgtggaggagggggcgagtgcggaggtgttct
Part 1
In [ ]:
Part 2
In [ ]:
6. (10 points)
- Find the matrix \(A\) that results in rotating the standard vectors in \(\mathbb{R}^2\) by 30 degrees counter-clockwise and stretches \(e_1\) by a factor of 3 and contracts \(e_2\) by a factor of \(0.5\).
- What is the inverse of this matrix? How you find the inverse should reflect your understanding.
The effects of the matrix \(A\) and \(A^{-1}\) are shown in the figure below:
In [ ]:
7. (55 points)
We observe some data points \((x_i, y_i)\), and believe that an appropriate model for the data is that
with some added noise. Find optimal values of the parameters \(\beta = (a, b, c)\) that minimize \(\Vert y - f(x) \Vert^2\)
- using
scipy.linalg.lstsq
(10 points) - solving the normal equations \(X^TX \beta = X^Ty\) (10 points)
- using
scipy.linalg.svd
(10 points) - using gradient descent with RMSProp (no bias correction) and starting with an initial value of \(\beta = \begin{bmatrix}1 & 1 & 1\end{bmatrix}\). Use a learning rate of 0.01 and 10,000 iterations, and set the \(\beta\) parameter of RMSprop to be 0.9 (this is a different \(\beta\) from the parameters of the function we are minimizing). Running gradient descent should take a few seconds to complete. (25 points)
In each case, plot the data and fitted curve using matplotlib
.
Data
x = array([ 3.4027718 , 4.29209002, 5.88176277, 6.3465969 , 7.21397852,
8.26972154, 10.27244608, 10.44703778, 10.79203455, 14.71146298])
y = array([ 25.54026428, 29.4558919 , 58.50315846, 70.24957254,
90.55155435, 100.56372833, 91.83189927, 90.41536733,
90.43103028, 23.0719842 ])
Using ``lstsq``
In [ ]:
Using normal equations
In [ ]:
Using SVD
In [ ]:
Using Gradient descent with RMSprop
In [ ]: