In [4]:
%matplotlib inline
In [5]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Homework: Data Munging with pandas¶
1 (10 points)
You are given the URLs of two data sets. The first is for North Carolina School Performance Data 2016-2017 in JSON format, and the second is for Durham Public Schools Locations in a delimited format.
url1 = "goo.gl/3JTQkF"
url2 = "goo.gl/y2Ru2v"
- Use
curlin a bash cell to download these files preserving their original filenames into a directory calleddata
Note: This may take a while as one of the files is about 100 MB.
In [ ]:
2. (10 points)
- Use
headin a bash shell to view the first few lines of the filepublic-schools.csv - Read the file into a
pandasDataFrame calleddps - Show the names of all public schools in Durham (without repeats)
In [ ]:
3. (10 points)
- Read the contents of the file
north-carolina-school-performance-data.jsoninto apandasDataFrame calleddf. This wil have 4 columns- datasetid
- fields
- record_timestamp
- recordid
- Convert the fields column into a
pandasDataFrame calledncwith 85766 rows and 32 columns
In [9]:
4. (10 points)
- List the names of all public schools in Durham found in the
ncDataFrame.
In [ ]:
5. (20 points)
- Create a dictionary mapping the short school names in
dpsto the full school names innc. - Use this dictionary to rename the entries in the
SCHOOLcolumn ofdps - You should be able to find matches for all schools except one (Chewning) by preprocessing strings before matching
In [ ]:
6. (10 points)
- Find the top 10 districts with the highest average number of students in a school. Do not include ‘State of North Carolina’ as a district.
In [ ]:
7. (10 points)
- Add the ‘Geo Point’ data from
dpsas a column innc.
In [ ]:
8. (20 points)
- Make a table of the mean percentage across the races (asian, black, hispanic, white) where the columns are subjects and the rows are district_names
- Make boxplots of the percentage of the races (asian, black, hispanic,
white) by subject (column) and district_name (row). Save the image
as
school_math.png
Note
- In both cases, we only want to consider data for the 4 races listed
- In both cases, only include subjects with the phrase ‘EOG Math Grade N’ in them, where N is 3,4 or 5.
You may need to do some preprocessing and create a DataFrame in the long format.
- Replace
*withnp.nan - Replace
<5with a random integer from 0,1,2,3,4 - Replace
>95with a random integer from 96,97,98,99,100
Format the figure so it looks like this
Figure
In [19]: