# Lab04: Data manipulation¶

**Brief Honor Code**. Do the homework on your own. You may discuss ideas
with your classmates, but DO NOT copy the solutions from someone else or
the Internet. If stuck, discuss with TA.

```
In [1]:
```

```
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import string
import toolz as tz
```

**1**. (10 points)

Write a function `pdsist(xs)`

which returns a matrix of the pairwise
distance between the collection of vectors in `xs`

using Euclidean
distance.

Recall that Euclidean distance between two vectors \(x\) and \(y\) is

Find the square distance matrix for

```
xs = np.array([[0.20981496, 0.54777461, 0.9398527 ],
[0.63149939, 0.935947 , 0.29834026],
[0.46302941, 0.25515557, 0.0698739 ],
[0.38192644, 0.42378508, 0.26055664],
[0.46307302, 0.05943961, 0.60204931]])
```

Do this **without** using any for loops.

```
In [2]:
```

```
```

**2**. (20 points)

Wikipedia gives this algorithm for finding prime numbers

To find all the prime numbers less than or equal to a given integer n by Eratosthenes’ method:

- Create a list of consecutive integers from 2 through n: (2, 3, 4, …, n).
- Initially, let p equal 2, the smallest prime number.
- Enumerate the multiples of p by counting to n from 2p in increments of p, and mark them in the list (these will be 2p, 3p, 4p, …; the p itself should not be marked).
- Find the first number greater than p in the list that is not marked. If there was no such number, stop. Otherwise, let p now equal this new number (which is the next prime), and repeat from step 3.
- When the algorithm terminates, the numbers remaining not marked in the list are all the primes below n.

Find all primes less than 1,000 using this method.

- You may use
`numpy`

and do not have to follow the algorithm exactly if you can achieve the same results.

```
In [2]:
```

```
```

**3**. (20 points)

- Load the
`iris`

data set from R into a`pandas`

DataFrame and create a new DataFrame`iris_scaled`

where each feature (column) has zero mean and unit standard deviation (5) - Calculate a pairwise distance matrix (using Euclidean distance) of
the normalized features for each flower using the function you wrote
in
**1**(5) - Using
`seaborn`

, plot a`clustermap`

of the original`iris`

data, with row color labels for the Species (5) - Using a single
`seaborn`

function, make a 2 by 2 grid of boxplots for each feature to compare across iris species. The y-axis scale need not be the same across subplots. (5)

```
In [2]:
```

```
```

**4**. (50 points)

This is an example of the messiness of real-life data munging, using a (truncated and randomized) data set provided by an HIV researcher. In this data set, HIV-positive subjects were tested for sexually transmitted infections (STI) at each visit (and treated if they were positive).

You will probably need to review the `pandas`

documentation and
examples carefully to complete this exercise.

The file ‘hiv.csv’ contains data the HIV subjects. The variables are

- PID: A randomly generated patient ID
- Race: The race of the subject as a single letter code
- Age: The age of the subject at study entry
- Visit Date: Date of visit at which STI data was collected
- STIs: Type of STI detected if any

The Race and Age of each subject are only recorded for the first visit.

Objectives:

- Read the data set into a
`pandas`

DataFrame (5) - Fill up the missing Race and Age values (5)
- Reclassify the STIs into one of the following categories (‘none’, ‘syphilis’, ‘chlamydia’, ‘gonorrhea’, ‘HCV’, NaN) by grouping in the obvious way. Note that ‘none’ means no infection and NaN (np.nan) means missing data. (10)
- FInd the subject(s) with the most number of visits (how many are there?) (5)
- Perform a chi-square test to see if an STI at the
**first**visit is associated with the risk of an STI at**any**subsequent visit. You only need to do this for subjects who have more than 1 visit. You can use`from scipy.stats import chi2_contingency`

to do the \(\chi^2\) test. (20)

```
In [2]:
```

```
```