In [4]:
%matplotlib inline
In [5]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Homework: Data Munging with pandas

1 (10 points)

You are given the URLs of two data sets. The first is for North Carolina School Performance Data 2016-2017 in JSON format, and the second is for Durham Public Schools Locations in a delimited format.

url1 = "goo.gl/3JTQkF"
url2 = "goo.gl/y2Ru2v"
  • Use curl in a bash cell to download these files preserving their original filenames into a directory called data

Note: This may take a while as one of the files is about 100 MB.

In [ ]:




2. (10 points)

  • Use head in a bash shell to view the first few lines of the file public-schools.csv
  • Read the file into a pandas DataFrame called dps
  • Show the names of all public schools in Durham (without repeats)
In [ ]:




3. (10 points)

  • Read the contents of the file north-carolina-school-performance-data.json into a pandas DataFrame called df. This wil have 4 columns
    • datasetid
    • fields
    • record_timestamp
    • recordid
  • Convert the fields column into a pandas DataFrame called nc with 85766 rows and 32 columns
In [9]:




4. (10 points)

  • List the names of all public schools in Durham found in the nc DataFrame.
In [ ]:




5. (20 points)

  • Create a dictionary mapping the short school names in dps to the full school names in nc.
  • Use this dictionary to rename the entries in the SCHOOL column of dps
  • You should be able to find matches for all schools except one (Chewning) by preprocessing strings before matching
In [ ]:




6. (10 points)

  • Find the top 10 districts with the highest average number of students in a school. Do not include ‘State of North Carolina’ as a district.
In [ ]:




7. (10 points)

  • Add the ‘Geo Point’ data from dps as a column in nc.
In [ ]:




8. (20 points)

  • Make a table of the mean percentage across the races (asian, black, hispanic, white) where the columns are subjects and the rows are district_names
  • Make boxplots of the percentage of the races (asian, black, hispanic, white) by subject (column) and district_name (row). Save the image as school_math.png

Note

  • In both cases, we only want to consider data for the 4 races listed
  • In both cases, only include subjects with the phrase ‘EOG Math Grade N’ in them, where N is 3,4 or 5.

You may need to do some preprocessing and create a DataFrame in the long format.

  • Replace * with np.nan
  • Replace <5 with a random integer from 0,1,2,3,4
  • Replace >95 with a random integer from 96,97,98,99,100

Format the figure so it looks like this

Figure

Figure

In [19]: