# STA 663 Midterm ExamsΒΆ

Please observe the Duke honor code for this **closed book** exam.

**Permitted exceptions to the closed book rule**

- You may use any of the links accessible from the Help Menu for
reference - that is, you may follow a chain of clicks from the
landing pages of the sites accessible through the Help Menu. If you
find yourself outside the help/reference pages of
`python`

,`ipython`

,`numpy`

,`scipy`

,`matplotlib`

,`sympy`

,`pandas`

, (e.g. on a Google search page or stackoverflow or current/past versions of the STA 663 notes) you are in danger of violating the honor code and should exit immediately. - You may also use TAB or SHIFT-TAB completion, as well as
`?foo`

,`foo?`

and`help(foo)`

for any function, method or class`foo`

.

The total points allocated is 125, but the maximum possible is 100. Hence it is possible to score 100 even with some errors or incomplete solutions.

**Imports**

All the necessary packages have been imported for you in the Code cells below. You should not need any additional imports.

```
In [1]:
```

```
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.linalg as la
from collections import Counter
from functools import reduce
```

```
In [2]:
```

```
%load_ext rpy2.ipython
```

**1**. (10 points)

Read the flights data at
https://raw.githubusercontent.com/mwaskom/seaborn-data/master/flights.csv
into a `pnadas`

data frame. Find the average number of passengers per
quarter (Q1, Q2, Q3,Q4) across the years 1950-1959 (inclusive of 1950
and 1959), where

- Q1 = Jan, Feb, Mar
- Q2 = Apr, May, Jun
- Q3 = Jul, Aug, Sep
- Q4 = Oct, Nov, Dec

```
In [ ]:
```

```
```

**2**. (10 points)

The Collatz sequence is defined by the following rules for finding the next number

```
if the current number is even, divide by 2
if the current number is odd, multiply by 3 and add 1
if the current number is 1, stop
```

- Find the starting integer that gives the longest Collatz sequence for integers in the range(1, 10000). What is the starting number and length of this Collatz sequence?

```
In [ ]:
```

```
```

**3**. (10 points)

Recall that a covariance matrix is a matrix whose entries are

Find the sample covariance matrix of the 4 features of the **iris** data
set at http://bit.ly/2ow0oJO using basic `numpy`

operations on
`ndarrasy`

. Do **not** use the `np.cov`

or equivalent functions in
`pandas`

(except for checking). Remember to scale by \(1/(n-1)\)
for the sample covariance.

```
In [ ]:
```

```
```

**4**. (10 points)

How many numbers in `range(100, 1000)`

are divisible by 17 after you
square them and add 1? Find this out using only **lambda** functions,
**map**, **filter** and **reduce** on `xs`

, where
`xs = range(100, 10000)`

.

In pseudo-code, you want to achieve

```
xs = range(100, 10000)
count(y for y in (x**2 + 1 for x in xs) if y % 17 == 0)
```

```
In [ ]:
```

```
```

**5**. (20 points)

- Given the DNA sequence below, create a \(4 \times 4\) transition
matrix \(A\) where \(A[i,j]\) is the probability of the base
\(j\) appearing immediately after base \(i\). Note that a
*base*is one of the four letters`a`

,`c`

,`t`

or`g`

. The letters below should be treated as a single sequence, broken into separate lines just for formatting purposes. You should check that row probabilities sum to 1. (10 points) - Find the steady state distribution of the 4 bases from the row stochastic transition matrix - that is the, the values of \(x\) for which \(x^TA = x\). Find the solution by solving a set of linear equations. Hint: you need to add a constraint on the values of \(x\). Only partial credit will be given for other methods of finding the steady state distribution. (10 points)

```
gggttgtatgtcacttgagcctgtgcggacgagtgacacttgggacgtgaacagcggcggccgatacgttctctaagatc
ctctcccatgggcctggtctgtatggctttcttgttgtgggggcggagaggcagcgagtgggtgtacattaagcatggcc
accaccatgtggagcgtggcgtggtcgcggagttggcagggtttttgggggtggggagccggttcaggtattccctccgc
gtttctgtcgggtaggggggcttctcgtaagggattgctgcggccgggttctctgggccgtgatgactgcaggtgccatg
gaggcggtttggggggcccccggaagtctagcgggatcgggcttcgtttgtggaggagggggcgagtgcggaggtgttct
```

**Part 1**

```
In [ ]:
```

```
```

**Part 2**

```
In [ ]:
```

```
```

**6**. (10 points)

- Find the matrix \(A\) that results in rotating the standard vectors in \(\mathbb{R}^2\) by 30 degrees counter-clockwise and stretches \(e_1\) by a factor of 3 and contracts \(e_2\) by a factor of \(0.5\).
- What is the inverse of this matrix? How you find the inverse should reflect your understanding.

The effects of the matrix \(A\) and \(A^{-1}\) are shown in the figure below:

```
In [ ]:
```

```
```

**7**. (55 points)

We observe some data points \((x_i, y_i)\), and believe that an appropriate model for the data is that

with some added noise. Find optimal values of the parameters \(\beta = (a, b, c)\) that minimize \(\Vert y - f(x) \Vert^2\)

- using
`scipy.linalg.lstsq`

(10 points) - solving the normal equations \(X^TX \beta = X^Ty\) (10 points)
- using
`scipy.linalg.svd`

(10 points) - using gradient descent with RMSProp (no bias correction) and starting with an initial value of \(\beta = \begin{bmatrix}1 & 1 & 1\end{bmatrix}\). Use a learning rate of 0.01 and 10,000 iterations, and set the \(\beta\) parameter of RMSprop to be 0.9 (this is a different \(\beta\) from the parameters of the function we are minimizing). Running gradient descent should take a few seconds to complete. (25 points)

In each case, plot the data and fitted curve using `matplotlib`

.

Data

```
x = array([ 3.4027718 , 4.29209002, 5.88176277, 6.3465969 , 7.21397852,
8.26972154, 10.27244608, 10.44703778, 10.79203455, 14.71146298])
y = array([ 25.54026428, 29.4558919 , 58.50315846, 70.24957254,
90.55155435, 100.56372833, 91.83189927, 90.41536733,
90.43103028, 23.0719842 ])
```

**Using ``lstsq``**

```
In [ ]:
```

```
```

**Using normal equations**

```
In [ ]:
```

```
```

**Using SVD**

```
In [ ]:
```

```
```

**Using Gradient descent with RMSprop**

```
In [ ]:
```

```
```