In [4]:

%matplotlib inline

In [5]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Homework: Data Munging with `pandas`¶

1 (10 points)

You are given the URLs of two data sets. The first is for North Carolina School Performance Data 2016-2017 in JSON format, and the second is for Durham Public Schools Locations in a delimited format.

url1 = "goo.gl/3JTQkF"
url2 = "goo.gl/y2Ru2v"

Use curl in a bash cell to download these files preserving their original filenames into a directory called data

Note: This may take a while as one of the files is about 100 MB.

In [ ]:

2. (10 points)

Use head in a bash shell to view the first few lines of the file public-schools.csv
Read the file into a pandas DataFrame called dps
Show the names of all public schools in Durham (without repeats)

In [ ]:

3. (10 points)

Read the contents of the file north-carolina-school-performance-data.json into a pandas DataFrame called df. This wil have 4 columns
- datasetid
- fields
- record_timestamp
- recordid
Convert the fields column into a pandas DataFrame called nc with 85766 rows and 32 columns

In [9]:

4. (10 points)

List the names of all public schools in Durham found in the nc DataFrame.

In [ ]:

5. (20 points)

Create a dictionary mapping the short school names in dps to the full school names in nc.
Use this dictionary to rename the entries in the SCHOOL column of dps
You should be able to find matches for all schools except one (Chewning) by preprocessing strings before matching

In [ ]:

6. (10 points)

Find the top 10 districts with the highest average number of students in a school. Do not include ‘State of North Carolina’ as a district.

In [ ]:

7. (10 points)

Add the ‘Geo Point’ data from dps as a column in nc.

In [ ]:

8. (20 points)

Make a table of the mean percentage across the races (asian, black, hispanic, white) where the columns are subjects and the rows are district_names
Make boxplots of the percentage of the races (asian, black, hispanic, white) by subject (column) and district_name (row). Save the image as school_math.png

Note

In both cases, we only want to consider data for the 4 races listed
In both cases, only include subjects with the phrase ‘EOG Math Grade N’ in them, where N is 3,4 or 5.

You may need to do some preprocessing and create a DataFrame in the long format.

Replace * with np.nan
Replace <5 with a random integer from 0,1,2,3,4
Replace >95 with a random integer from 96,97,98,99,100

Format the figure so it looks like this

Figure

In [19]:

Homework: Data Munging with `pandas`¶

Previous page

Next page

This Page

Homework: Data Munging with pandas¶

Homework: Data Munging with `pandas`¶