{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# STA 663 Midterm Exams\n", "\n", "Please observe the Duke honor code for this **closed book** exam.\n", "\n", "**Permitted exceptions to the closed book rule**\n", "\n", "- You may use any of the links accessible from the Help Menu for reference - that is, you may follow a chain of clicks from the landing pages of the sites accessible through the Help Menu. If you find yourself outside the help/reference pages of `python`, `ipython`, `numpy`, `scipy`, `matplotlib`, `sympy`, `pandas`, (e.g. on a Google search page or stackoverflow or current/past versions of the STA 663 notes) you are in danger of violating the honor code and should exit immediately.\n", "\n", "- You may also use TAB or SHIFT-TAB completion, as well as `?foo`, `foo?` and `help(foo)` for any function, method or class `foo`.\n", "\n", "The total points allocated is 125, but the maximum possible is 100. Hence it is possible to score 100 even with some errors or incomplete solutions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Imports**\n", "\n", "All the necessary packages have been imported for you in the Code cells below. You should not need any additional imports." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import scipy.linalg as la\n", "from collections import Counter\n", "from functools import reduce" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%load_ext rpy2.ipython" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**1**. (10 points)\n", "\n", "Read the flights data at https://raw.githubusercontent.com/mwaskom/seaborn-data/master/flights.csv into a `pnadas` data frame. Find the average number of passengers per quarter (Q1, Q2, Q3,Q4) across the years 1950-1959 (inclusive of 1950 and 1959), where\n", "\n", "- Q1 = Jan, Feb, Mar\n", "- Q2 = Apr, May, Jun\n", "- Q3 = Jul, Aug, Sep\n", "- Q4 = Oct, Nov, Dec" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2**. (10 points)\n", "\n", "The Collatz sequence is defined by the following rules for finding the next number\n", "\n", "```\n", "if the current number is even, divide by 2\n", "if the current number is odd, multiply by 3 and add 1\n", "if the current number is 1, stop\n", "```\n", "\n", "- Find the starting integer that gives the longest Collatz sequence for integers in the range(1, 10000). What is the starting number and length of this Collatz sequence?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3**. (10 points)\n", "\n", "Recall that a covariance matrix is a matrix whose entries are\n", "\n", "![img](https://wikimedia.org/api/rest_v1/media/math/render/svg/4df2969e65403dd04f2c64137d21ff59b5f54190)\n", "\n", "Find the sample covariance matrix of the 4 features of the **iris** data set at http://bit.ly/2ow0oJO using basic `numpy` operations on `ndarrasy`. Do **not** use the `np.cov` or equivalent functions in `pandas` (except for checking). Remember to scale by $1/(n-1)$ for the sample covariance." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**4**. (10 points)\n", "\n", "How many numbers in `range(100, 1000)` are divisible by 17 after you square them and add 1? Find this out using only **lambda** functions, **map**, **filter** and **reduce** on `xs`, where `xs = range(100, 10000)`.\n", "\n", "In pseudo-code, you want to achieve\n", "\n", "```python\n", "xs = range(100, 10000)\n", "count(y for y in (x**2 + 1 for x in xs) if y % 17 == 0)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**5**. (20 points)\n", "\n", "- Given the DNA sequence below, create a $4 \\times 4$ transition matrix $A$ where $A[i,j]$ is the probability of the base $j$ appearing immediately after base $i$. Note that a *base* is one of the four letters `a`, `c`, `t` or `g`. The letters below should be treated as a single sequence, broken into separate lines just for formatting purposes. You should check that row probabilities sum to 1. (10 points)\n", "- Find the steady state distribution of the 4 bases from the row stochastic transition matrix - that is the, the values of $x$ for which $x^TA = x$. Find the solution by solving a set of linear equations. Hint: you need to add a constraint on the values of $x$. Only partial credit will be given for other methods of finding the steady state distribution. (10 points)\n", "\n", "```\n", "gggttgtatgtcacttgagcctgtgcggacgagtgacacttgggacgtgaacagcggcggccgatacgttctctaagatc\n", "ctctcccatgggcctggtctgtatggctttcttgttgtgggggcggagaggcagcgagtgggtgtacattaagcatggcc\n", "accaccatgtggagcgtggcgtggtcgcggagttggcagggtttttgggggtggggagccggttcaggtattccctccgc\n", "gtttctgtcgggtaggggggcttctcgtaagggattgctgcggccgggttctctgggccgtgatgactgcaggtgccatg\n", "gaggcggtttggggggcccccggaagtctagcgggatcgggcttcgtttgtggaggagggggcgagtgcggaggtgttct\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Part 1**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Part 2**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**6**. (10 points)\n", "\n", "- Find the matrix $A$ that results in rotating the standard vectors in $\\mathbb{R}^2$ by 30 degrees counter-clockwise and stretches $e_1$ by a factor of 3 and contracts $e_2$ by a factor of $0.5$. \n", "- What is the inverse of this matrix? How you find the inverse should reflect your understanding.\n", "\n", "The effects of the matrix $A$ and $A^{-1}$ are shown in the figure below:\n", "\n", "![image](./vecs.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**7**. (55 points) \n", "\n", "We observe some data points $(x_i, y_i)$, and believe that an appropriate model for the data is that\n", "\n", "$$\n", "f(x) = ax^2 + bx^3 + c\\sin{x}\n", "$$\n", "\n", "with some added noise. Find optimal values of the parameters $\\beta = (a, b, c)$ that minimize $\\Vert y - f(x) \\Vert^2$\n", "\n", "1. using `scipy.linalg.lstsq` (10 points)\n", "2. solving the normal equations $X^TX \\beta = X^Ty$ (10 points)\n", "3. using `scipy.linalg.svd` (10 points)\n", "4. using gradient descent with RMSProp (no bias correction) and starting with an initial value of $\\beta = \\begin{bmatrix}1 & 1 & 1\\end{bmatrix}$. Use a learning rate of 0.01 and 10,000 iterations, and set the $\\beta$ parameter of RMSprop to be 0.9 (this is a different $\\beta$ from the parameters of the function we are minimizing). Running gradient descent should take a few seconds to complete. (25 points)\n", "\n", "In each case, plot the data and fitted curve using `matplotlib`.\n", "\n", "Data\n", "```\n", "x = array([ 3.4027718 , 4.29209002, 5.88176277, 6.3465969 , 7.21397852,\n", " 8.26972154, 10.27244608, 10.44703778, 10.79203455, 14.71146298])\n", "y = array([ 25.54026428, 29.4558919 , 58.50315846, 70.24957254,\n", " 90.55155435, 100.56372833, 91.83189927, 90.41536733,\n", " 90.43103028, 23.0719842 ])\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Using `lstsq`**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Using normal equations**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "**Using SVD**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Using Gradient descent with RMSprop**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "\n", "\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }