{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Assignment 10: Review\n", "\n", "We will review material covered so far in this course with short questions similar to what you might expect on the Final Exam. \n", "\n", "First review the lecture notes. Then try to do this assignment without referring to ANY external material to simulate exam conditions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setup for Q1" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext sql" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from collections import OrderedDict" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "pid = ['a', 'c', 'a', 'b', 'c', 'a', 'c', 'c', 'a', 'a', 'b', 'b']\n", "visit = [1, 1, 2, 1, 2, 3, 3, 4, 4, 5, 2, 3]\n", "n = len(pid)\n", "\n", "readings = pd.DataFrame(OrderedDict(pid=pid, visit=visit, sbp=np.random.normal(120, 25, n)))\n", "readings['dbp'] = readings.sbp - np.random.normal(40, 10, n)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "readings[['sbp', 'dbp']] = readings[['sbp', 'dbp']].astype('int')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "patients = pd.DataFrame(OrderedDict(pid=['a', 'b', 'c', 'd'], ages=[23,34,45,56]))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Connected: @tables.db'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql sqlite:///tables.db" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " * sqlite:///tables.db\n", "Done.\n", " * sqlite:///tables.db\n" ] }, { "data": { "text/plain": [ "'Persisted patients'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql drop table patients\n", "%sql persist patients" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " * sqlite:///tables.db\n", "Done.\n", " * sqlite:///tables.db\n" ] }, { "data": { "text/plain": [ "'Persisted readings'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql drop table readings\n", "%sql persist readings" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " * sqlite:///tables.db\n", "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexagespid
023a
134b
245c
356d
" ], "text/plain": [ "[(0, 23, 'a'), (1, 34, 'b'), (2, 45, 'c'), (3, 56, 'd')]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "select * from patients" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " * sqlite:///tables.db\n", "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexsbpvisitpiddbp
01331a85
11221c90
21442a117
31041b50
41192c66
51353a92
61143c88
71284c108
81494a103
91425a107
101242b80
111143b83
" ], "text/plain": [ "[(0, 133, 1, 'a', 85),\n", " (1, 122, 1, 'c', 90),\n", " (2, 144, 2, 'a', 117),\n", " (3, 104, 1, 'b', 50),\n", " (4, 119, 2, 'c', 66),\n", " (5, 135, 3, 'a', 92),\n", " (6, 114, 3, 'c', 88),\n", " (7, 128, 4, 'c', 108),\n", " (8, 149, 4, 'a', 103),\n", " (9, 142, 5, 'a', 107),\n", " (10, 124, 2, 'b', 80),\n", " (11, 114, 3, 'b', 83)]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql \n", "\n", "select * from readings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**1**. (20 points)\n", "\n", "- Write an SQL statement to merge the patient and pressure tables using an inner join\n", "- Write an SQL statement to find the average systolic (sbp) and diastolic (dbp) blood pressure for each patient, sorted in ascending order of sbp. The function to calculate averages in SQL is AVG.\n", "- (optional - ungraded) Write an SQL statement to find the running average of systolic blood pressure for each patient across successive visits. Show the following columns pid, visit, sbp and running average of sbp. (NOTE: This will no work unless you have the version 3.2.5 or higher of SQLite3 or swithc to a database like PostgreSQL - in particular, it will not work on Docker. Just write the SQL even if it does not execute.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2**. (30 points)\n", "\n", "- Use a raw count bag of words model for unigrams and bigrams to generate feature vectors for these two documents. For simplicity, you may tokenize by removing punctuation, splitting by white space to find words, and converting all words to lowercase. \n", "- Implement a function to calculate cosine *similarity* between two vectors without using any trigonometric functions, built-in distance functions or linear algebra modules. Find the cosine similarity between the two documents. Recall that the cosine similarity is the dot product of two unit vectors.\n", "\n", "Only use the Python standard library and `numpy` to do this exercise." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "doc1 = \"\"\"As I was going by Charing Cross,\n", "I saw a black man upon a black horse;\n", "They told me it was King Charles the First-\n", "Oh dear, my heart was ready to burst!\"\"\"\n", "\n", "doc2 = \"\"\"As I was going to St. Ives,\n", "I met a man with seven wives,\n", "Each wife had seven sacks,\n", "Each sack had seven cats,\n", "Each cat had seven kits:\n", "Kits, cats, sacks, and wives,\n", "How many were there going to St. Ives\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3**. (30 points)\n", "\n", "- Fit polynomials of order 2, 3 and 4 to the data set `x` and `y` by solving the normal equations $(X^TX) \\hat{\\beta} = X^Ty$. \n", "- Plot the fit against the data for each model.\n", "- Calculate the sum of squares error for each model using leave one out cross-validation.\n", "\n", "You may use `numpy.linalg` for this." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "x = np.array([4.17022005e+00, 7.20324493e+00, 1.14374817e-03, 3.02332573e+00,\n", " 1.46755891e+00, 9.23385948e-01, 1.86260211e+00, 3.45560727e+00,\n", " 3.96767474e+00, 5.38816734e+00])\n", "y = np.array([29.05627699, 22.38450486, 3.33047527, 23.84338844, 16.98396787,\n", " 9.32107716, 17.8343173 , 25.23079674, 28.068074 , 26.74943485])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**4** (20 points)\n", "\n", "- Write a gradient descent algorithm to fit a cubic polynomial to the data from question 5. Use a learning rate of 1e-5 and 1 million iterations, and start with a $\\beta_0 = (1,10,1,1)$\n", "- Use a JIT decorator to create a compiled version and report the fold-change improvement in run time. Use the timeit.timeit function with argument `number=1`, and use a lambda function to pass in a function that takes 0 arguments.\n", "- Plot the fitted curve" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" } }, "nbformat": 4, "nbformat_minor": 2 }