Lab04: Data manipulation¶

Brief Honor Code. Do the homework on your own. You may discuss ideas with your classmates, but DO NOT copy the solutions from someone else or the Internet. If stuck, discuss with TA.

In [1]:

%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import string
import toolz as tz

1. (10 points)

Write a function pdsist(xs) which returns a matrix of the pairwise distance between the collection of vectors in xs using Euclidean distance.

Recall that Euclidean distance between two vectors \(x\) and \(y\) is

\[d(x, y) = \sqrt{\sum{(y-x)^2}}\]

Find the square distance matrix for

xs = np.array([[0.20981496, 0.54777461, 0.9398527 ],
       [0.63149939, 0.935947  , 0.29834026],
       [0.46302941, 0.25515557, 0.0698739 ],
       [0.38192644, 0.42378508, 0.26055664],
       [0.46307302, 0.05943961, 0.60204931]])

Do this without using any for loops.

In [2]:

2. (20 points)

Wikipedia gives this algorithm for finding prime numbers

To find all the prime numbers less than or equal to a given integer n by Eratosthenes’ method:

Create a list of consecutive integers from 2 through n: (2, 3, 4, …, n).
Initially, let p equal 2, the smallest prime number.
Enumerate the multiples of p by counting to n from 2p in increments of p, and mark them in the list (these will be 2p, 3p, 4p, …; the p itself should not be marked).
Find the first number greater than p in the list that is not marked. If there was no such number, stop. Otherwise, let p now equal this new number (which is the next prime), and repeat from step 3.
When the algorithm terminates, the numbers remaining not marked in the list are all the primes below n.

Find all primes less than 1,000 using this method.

You may use numpy and do not have to follow the algorithm exactly if you can achieve the same results.

In [2]:

3. (20 points)

Load the iris data set from R into a pandas DataFrame and create a new DataFrame iris_scaled where each feature (column) has zero mean and unit standard deviation (5)
Calculate a pairwise distance matrix (using Euclidean distance) of the normalized features for each flower using the function you wrote in 1 (5)
Using seaborn, plot a clustermap of the original iris data, with row color labels for the Species (5)
Using a single seaborn function, make a 2 by 2 grid of boxplots for each feature to compare across iris species. The y-axis scale need not be the same across subplots. (5)

In [2]:

4. (50 points)

This is an example of the messiness of real-life data munging, using a (truncated and randomized) data set provided by an HIV researcher. In this data set, HIV-positive subjects were tested for sexually transmitted infections (STI) at each visit (and treated if they were positive).

You will probably need to review the pandas documentation and examples carefully to complete this exercise.

The file ‘hiv.csv’ contains data the HIV subjects. The variables are

PID: A randomly generated patient ID
Race: The race of the subject as a single letter code
Age: The age of the subject at study entry
Visit Date: Date of visit at which STI data was collected
STIs: Type of STI detected if any

The Race and Age of each subject are only recorded for the first visit.

Objectives:

Read the data set into a pandas DataFrame (5)
Fill up the missing Race and Age values (5)
Reclassify the STIs into one of the following categories (‘none’, ‘syphilis’, ‘chlamydia’, ‘gonorrhea’, ‘HCV’, NaN) by grouping in the obvious way. Note that ‘none’ means no infection and NaN (np.nan) means missing data. (10)
FInd the subject(s) with the most number of visits (how many are there?) (5)
Perform a chi-square test to see if an STI at the first visit is associated with the risk of an STI at any subsequent visit. You only need to do this for subjects who have more than 1 visit. You can use from scipy.stats import chi2_contingency to do the \(\chi^2\) test. (20)

In [2]: