In [1]:
import pandas as pd
from pandas import DataFrame, Series
Introduction to DataFrames and Series¶
DataFrame¶
The workshop emphasizes the use of data frames, because most data analysis and plotting libraries are designed to work best with this data structure. The data frame is basically a table with rows and columns, just like a spreadsheet. Generally (and it is good practice), each row contains one observation and each column contains one variable. If the data frame is organized this way, it is known as a tidy data frame and is easy to manipulate and query. Much of data preprocessing involves getting the data into this tidy format.
Read from CSV¶
In [2]:
iris_1 = pd.read_csv('data/iris.csv')
In [3]:
iris_1.head()
Out[3]:
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Write to CSV¶
In [4]:
iris_1.to_csv('data/iris_1.csv', index=False)
Read from Excel¶
In [5]:
iris_2 = pd.read_excel('data/iris.xlsx')
In [6]:
iris_2.head()
Out[6]:
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Write to Excel¶
In [7]:
iris_2.to_excel('data/iris_2.xlsx', index=False)
Creating a DataFrame from scratch¶
In [8]:
subject = ['ann', 'bob', 'charles', 'david', 'ella']
sex = ['F', 'M', 'M', 'M', 'F']
group = ['A', 'A', 'A', 'B', 'B']
age = [23, 34, 27, 30, 27]
score = [89, 90, 78, 98, 90]
df = DataFrame(data = {'name': subject, 'sex': sex, 'age': age, 'group': group, 'score': score})
In [9]:
df
Out[9]:
| age | group | name | score | sex | |
|---|---|---|---|---|---|
| 0 | 23 | A | ann | 89 | F |
| 1 | 34 | A | bob | 90 | M |
| 2 | 27 | A | charles | 78 | M |
| 3 | 30 | B | david | 98 | M |
| 4 | 27 | B | ella | 90 | F |
Series¶
The data about a variable in a single column of the DataFrame is known as a Series. We sill start by learning how to manipulate a single variable, in other words, how to work with a Series data type.
Extracting a Series from a DataFrame¶
Using dot notation¶
In [11]:
age = df.age
age
Out[11]:
0 23
1 34
2 27
3 30
4 27
Name: age, dtype: int64
Using an index¶
In [12]:
age = df.ix[:,0]
age
Out[12]:
0 23
1 34
2 27
3 30
4 27
Name: age, dtype: int64
Getting information about a series¶
In [13]:
type(age)
Out[13]:
pandas.core.series.Series
In [14]:
age.index
Out[14]:
RangeIndex(start=0, stop=5, step=1)
In [15]:
age.values
Out[15]:
array([23, 34, 27, 30, 27])
In [16]:
age.unique()
Out[16]:
array([23, 34, 27, 30])
Manipulating a Series¶
In [17]:
age.sort_values()
Out[17]:
0 23
2 27
4 27
3 30
1 34
Name: age, dtype: int64
In [18]:
age.nsmallest(3)
Out[18]:
0 23
2 27
4 27
Name: age, dtype: int64
Working with Series containing strings¶
In [19]:
name = df.name
name.head(3)
Out[19]:
0 ann
1 bob
2 charles
Name: name, dtype: object
In [20]:
name.str.title()
Out[20]:
0 Ann
1 Bob
2 Charles
3 David
4 Ella
Name: name, dtype: object
In [21]:
name.str[1:3]
Out[21]:
0 nn
1 ob
2 ha
3 av
4 ll
Name: name, dtype: object
In [22]:
name.replace({'ann': 'angelina jolie', 'bob': 'brad pitt'})
Out[22]:
0 angelina jolie
1 brad pitt
2 charles
3 david
4 ella
Name: name, dtype: object
Working with Series containing categorical data¶
In [23]:
df.sex
Out[23]:
0 F
1 M
2 M
3 M
4 F
Name: sex, dtype: object
In [24]:
sex = df.sex.astype('category')
sex
Out[24]:
0 F
1 M
2 M
3 M
4 F
Name: sex, dtype: category
Categories (2, object): [F, M]
In [25]:
sex.cat.codes
Out[25]:
0 0
1 1
2 1
3 1
4 0
dtype: int8
In [26]:
sex.cat.categories
Out[26]:
Index(['F', 'M'], dtype='object')
Back to DataFrame¶
Getting information about a DataFrame¶
In [27]:
df.shape
Out[27]:
(5, 5)
In [28]:
df.index
Out[28]:
RangeIndex(start=0, stop=5, step=1)
In [29]:
df.columns
Out[29]:
Index(['age', 'group', 'name', 'score', 'sex'], dtype='object')
In [30]:
df.dtypes
Out[30]:
age int64
group object
name object
score int64
sex object
dtype: object
Extracting rows¶
In [31]:
df.head(2)
Out[31]:
| age | group | name | score | sex | |
|---|---|---|---|---|---|
| 0 | 23 | A | ann | 89 | F |
| 1 | 34 | A | bob | 90 | M |
In [32]:
df.tail(2)
Out[32]:
| age | group | name | score | sex | |
|---|---|---|---|---|---|
| 3 | 30 | B | david | 98 | M |
| 4 | 27 | B | ella | 90 | F |
In [33]:
df.sample(2)
Out[33]:
| age | group | name | score | sex | |
|---|---|---|---|---|---|
| 4 | 27 | B | ella | 90 | F |
| 0 | 23 | A | ann | 89 | F |
Extracting columns¶
In [34]:
df['name']
Out[34]:
0 ann
1 bob
2 charles
3 david
4 ella
Name: name, dtype: object
In [35]:
df.name
Out[35]:
0 ann
1 bob
2 charles
3 david
4 ella
Name: name, dtype: object
In [36]:
df[[0]]
Out[36]:
| age | |
|---|---|
| 0 | 23 |
| 1 | 34 |
| 2 | 27 |
| 3 | 30 |
| 4 | 27 |
In [37]:
df[['age', 'sex']]
Out[37]:
| age | sex | |
|---|---|---|
| 0 | 23 | F |
| 1 | 34 | M |
| 2 | 27 | M |
| 3 | 30 | M |
| 4 | 27 | F |
In [38]:
df[[1,2]]
Out[38]:
| group | name | |
|---|---|---|
| 0 | A | ann |
| 1 | A | bob |
| 2 | A | charles |
| 3 | B | david |
| 4 | B | ella |
Using indexing to extract parts of a DataFrame¶
There are actually 3 separate methods to index a DataFrame - using
loc, iloc and ix. We will only show ix because it is the
most flexible, but see
here
for an example if you want to know more. In particular, working with
indexes containing only integers can be confusing at first.
In [39]:
df.ix[0,0]
Out[39]:
23
In [40]:
df.ix[1,1]
Out[40]:
'A'
In [41]:
df.ix[1:4, :2]
Out[41]:
| age | group | |
|---|---|---|
| 1 | 34 | A |
| 2 | 27 | A |
| 3 | 30 | B |
| 4 | 27 | B |
In [42]:
df.ix[[1,3,4], ['age', 'name']]
Out[42]:
| age | name | |
|---|---|---|
| 1 | 34 | bob |
| 3 | 30 | david |
| 4 | 27 | ella |
Logical indexing¶
In [43]:
df[df.age > 25]
Out[43]:
| age | group | name | score | sex | |
|---|---|---|---|---|---|
| 1 | 34 | A | bob | 90 | M |
| 2 | 27 | A | charles | 78 | M |
| 3 | 30 | B | david | 98 | M |
| 4 | 27 | B | ella | 90 | F |
In [44]:
df[(df.age > 25) & (df.sex == 'M')]
Out[44]:
| age | group | name | score | sex | |
|---|---|---|---|---|---|
| 1 | 34 | A | bob | 90 | M |
| 2 | 27 | A | charles | 78 | M |
| 3 | 30 | B | david | 98 | M |
Sorting¶
In [45]:
df.sort_values('age')
Out[45]:
| age | group | name | score | sex | |
|---|---|---|---|---|---|
| 0 | 23 | A | ann | 89 | F |
| 2 | 27 | A | charles | 78 | M |
| 4 | 27 | B | ella | 90 | F |
| 3 | 30 | B | david | 98 | M |
| 1 | 34 | A | bob | 90 | M |
In [46]:
df.sort_values('age', ascending = False)
Out[46]:
| age | group | name | score | sex | |
|---|---|---|---|---|---|
| 1 | 34 | A | bob | 90 | M |
| 3 | 30 | B | david | 98 | M |
| 2 | 27 | A | charles | 78 | M |
| 4 | 27 | B | ella | 90 | F |
| 0 | 23 | A | ann | 89 | F |
In [47]:
df.sort_values(['age', 'score'], ascending = ['True', 'True'])
Out[47]:
| age | group | name | score | sex | |
|---|---|---|---|---|---|
| 0 | 23 | A | ann | 89 | F |
| 2 | 27 | A | charles | 78 | M |
| 4 | 27 | B | ella | 90 | F |
| 3 | 30 | B | david | 98 | M |
| 1 | 34 | A | bob | 90 | M |
In [48]:
df.sort_values(['age', 'score'], ascending = ['True', 'False'], inplace = True)
df
Out[48]:
| age | group | name | score | sex | |
|---|---|---|---|---|---|
| 0 | 23 | A | ann | 89 | F |
| 2 | 27 | A | charles | 78 | M |
| 4 | 27 | B | ella | 90 | F |
| 3 | 30 | B | david | 98 | M |
| 1 | 34 | A | bob | 90 | M |
In [49]:
df.sort_index()
Out[49]:
| age | group | name | score | sex | |
|---|---|---|---|---|---|
| 0 | 23 | A | ann | 89 | F |
| 1 | 34 | A | bob | 90 | M |
| 2 | 27 | A | charles | 78 | M |
| 3 | 30 | B | david | 98 | M |
| 4 | 27 | B | ella | 90 | F |
Getting summary Statistics¶
In [50]:
df.mean()
Out[50]:
age 28.2
score 89.0
dtype: float64
In [51]:
df.score.var()
Out[51]:
51.0
In [52]:
df.count()
Out[52]:
age 5
group 5
name 5
score 5
sex 5
dtype: int64
In [53]:
df.describe()
Out[53]:
| age | score | |
|---|---|---|
| count | 5.000000 | 5.000000 |
| mean | 28.200000 | 89.000000 |
| std | 4.086563 | 7.141428 |
| min | 23.000000 | 78.000000 |
| 25% | 27.000000 | 89.000000 |
| 50% | 27.000000 | 90.000000 |
| 75% | 30.000000 | 90.000000 |
| max | 34.000000 | 98.000000 |
Exercises¶
1. Read in the file at data/dummy.xlsx into a DataFrame called
data.
In [3]:
data = pd.read_excel('data/dummy.xlsx')
2. Display just rows 2,3 and 4.
In [5]:
data.iloc[2:5]
Out[5]:
| age | group | name | score | sex | |
|---|---|---|---|---|---|
| 2 | 27 | A | charles | 78 | M |
| 3 | 30 | B | david | 98 | M |
| 4 | 27 | B | ella | 90 | F |
3. Display just the ‘name’ and ‘age’ columns from the DataFrame.
In [8]:
data[['name', 'age']]
Out[8]:
| name | age | |
|---|---|---|
| 0 | ann | 23 |
| 1 | bob | 34 |
| 2 | charles | 27 |
| 3 | david | 30 |
| 4 | ella | 27 |
4. Sort the DataFrame in descending order of age.
In [9]:
data.sort_values('age', ascending=False)
Out[9]:
| age | group | name | score | sex | |
|---|---|---|---|---|---|
| 1 | 34 | A | bob | 90 | M |
| 3 | 30 | B | david | 98 | M |
| 2 | 27 | A | charles | 78 | M |
| 4 | 27 | B | ella | 90 | F |
| 0 | 23 | A | ann | 89 | F |