Assignment 10: Review¶
We will review material covered so far in this course with short questions similar to what you might expect on the Final Exam.
First review the lecture notes. Then try to do this assignment without referring to ANY external material to simulate exam conditions.
Setup for Q1
In [1]:
%load_ext sql
In [2]:
import pandas as pd
import numpy as np
In [3]:
from collections import OrderedDict
In [4]:
pid = ['a', 'c', 'a', 'b', 'c', 'a', 'c', 'c', 'a', 'a', 'b', 'b']
visit = [1, 1, 2, 1, 2, 3, 3, 4, 4, 5, 2, 3]
n = len(pid)
readings = pd.DataFrame(OrderedDict(pid=pid, visit=visit, sbp=np.random.normal(120, 25, n)))
readings['dbp'] = readings.sbp - np.random.normal(40, 10, n)
In [5]:
readings[['sbp', 'dbp']] = readings[['sbp', 'dbp']].astype('int')
In [6]:
patients = pd.DataFrame(OrderedDict(pid=['a', 'b', 'c', 'd'], ages=[23,34,45,56]))
In [7]:
%sql sqlite:///tables.db
Out[7]:
'Connected: @tables.db'
In [8]:
%sql drop table patients
%sql persist patients
* sqlite:///tables.db
Done.
* sqlite:///tables.db
Out[8]:
'Persisted patients'
In [9]:
%sql drop table readings
%sql persist readings
* sqlite:///tables.db
Done.
* sqlite:///tables.db
Out[9]:
'Persisted readings'
In [10]:
%%sql
select * from patients
* sqlite:///tables.db
Done.
Out[10]:
index | ages | pid |
---|---|---|
0 | 23 | a |
1 | 34 | b |
2 | 45 | c |
3 | 56 | d |
In [11]:
%%sql
select * from readings
* sqlite:///tables.db
Done.
Out[11]:
index | sbp | visit | pid | dbp |
---|---|---|---|---|
0 | 133 | 1 | a | 85 |
1 | 122 | 1 | c | 90 |
2 | 144 | 2 | a | 117 |
3 | 104 | 1 | b | 50 |
4 | 119 | 2 | c | 66 |
5 | 135 | 3 | a | 92 |
6 | 114 | 3 | c | 88 |
7 | 128 | 4 | c | 108 |
8 | 149 | 4 | a | 103 |
9 | 142 | 5 | a | 107 |
10 | 124 | 2 | b | 80 |
11 | 114 | 3 | b | 83 |
1. (20 points)
- Write an SQL statement to merge the patient and pressure tables using an inner join
- Write an SQL statement to find the average systolic (sbp) and diastolic (dbp) blood pressure for each patient, sorted in ascending order of sbp. The function to calculate averages in SQL is AVG.
- (optional - ungraded) Write an SQL statement to find the running average of systolic blood pressure for each patient across successive visits. Show the following columns pid, visit, sbp and running average of sbp. (NOTE: This will no work unless you have the version 3.2.5 or higher of SQLite3 or swithc to a database like PostgreSQL - in particular, it will not work on Docker. Just write the SQL even if it does not execute.)
In [ ]:
2. (30 points)
- Use a raw count bag of words model for unigrams and bigrams to generate feature vectors for these two documents. For simplicity, you may tokenize by removing punctuation, splitting by white space to find words, and converting all words to lowercase.
- Implement a function to calculate cosine similarity between two vectors without using any trigonometric functions, built-in distance functions or linear algebra modules. Find the cosine similarity between the two documents. Recall that the cosine similarity is the dot product of two unit vectors.
Only use the Python standard library and numpy
to do this exercise.
In [12]:
doc1 = """As I was going by Charing Cross,
I saw a black man upon a black horse;
They told me it was King Charles the First-
Oh dear, my heart was ready to burst!"""
doc2 = """As I was going to St. Ives,
I met a man with seven wives,
Each wife had seven sacks,
Each sack had seven cats,
Each cat had seven kits:
Kits, cats, sacks, and wives,
How many were there going to St. Ives"""
In [ ]:
3. (30 points)
- Fit polynomials of order 2, 3 and 4 to the data set
x
andy
by solving the normal equations \((X^TX) \hat{\beta} = X^Ty\). - Plot the fit against the data for each model.
- Calculate the sum of squares error for each model using leave one out cross-validation.
You may use numpy.linalg
for this.
In [13]:
x = np.array([4.17022005e+00, 7.20324493e+00, 1.14374817e-03, 3.02332573e+00,
1.46755891e+00, 9.23385948e-01, 1.86260211e+00, 3.45560727e+00,
3.96767474e+00, 5.38816734e+00])
y = np.array([29.05627699, 22.38450486, 3.33047527, 23.84338844, 16.98396787,
9.32107716, 17.8343173 , 25.23079674, 28.068074 , 26.74943485])
In [ ]:
4 (20 points)
- Write a gradient descent algorithm to fit a cubic polynomial to the data from question 5. Use a learning rate of 1e-5 and 1 million iterations, and start with a \(\beta_0 = (1,10,1,1)\)
- Use a JIT decorator to create a compiled version and report the
fold-change improvement in run time. Use the timeit.timeit function
with argument
number=1
, and use a lambda function to pass in a function that takes 0 arguments. - Plot the fitted curve
In [ ]: