In [1]:
import pandas as pd
from pandas import DataFrame, Series

Introduction to DataFrames and Series

DataFrame

The workshop emphasizes the use of data frames, because most data analysis and plotting libraries are designed to work best with this data structure. The data frame is basically a table with rows and columns, just like a spreadsheet. Generally (and it is good practice), each row contains one observation and each column contains one variable. If the data frame is organized this way, it is known as a tidy data frame and is easy to manipulate and query. Much of data preprocessing involves getting the data into this tidy format.

Read from CSV

In [2]:
iris_1 = pd.read_csv('data/iris.csv')
In [3]:
iris_1.head()
Out[3]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Write to CSV

In [4]:
iris_1.to_csv('data/iris_1.csv', index=False)

Read from Excel

In [5]:
iris_2 = pd.read_excel('data/iris.xlsx')
In [6]:
iris_2.head()
Out[6]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Write to Excel

In [7]:
iris_2.to_excel('data/iris_2.xlsx', index=False)

Creating a DataFrame from scratch

In [8]:
subject = ['ann', 'bob', 'charles', 'david', 'ella']
sex = ['F', 'M', 'M', 'M', 'F']
group = ['A', 'A', 'A', 'B', 'B']
age = [23, 34, 27, 30, 27]
score = [89, 90, 78, 98, 90]
df = DataFrame(data = {'name': subject, 'sex': sex, 'age': age, 'group': group, 'score': score})
In [9]:
df
Out[9]:
age group name score sex
0 23 A ann 89 F
1 34 A bob 90 M
2 27 A charles 78 M
3 30 B david 98 M
4 27 B ella 90 F

Series

The data about a variable in a single column of the DataFrame is known as a Series. We sill start by learning how to manipulate a single variable, in other words, how to work with a Series data type.

Extracting a Series from a DataFrame

By name

In [10]:
age = df['age']
age
Out[10]:
0    23
1    34
2    27
3    30
4    27
Name: age, dtype: int64

Using dot notation

In [11]:
age = df.age
age
Out[11]:
0    23
1    34
2    27
3    30
4    27
Name: age, dtype: int64

Using an index

In [12]:
age = df.ix[:,0]
age
Out[12]:
0    23
1    34
2    27
3    30
4    27
Name: age, dtype: int64

Getting information about a series

In [13]:
type(age)
Out[13]:
pandas.core.series.Series
In [14]:
age.index
Out[14]:
RangeIndex(start=0, stop=5, step=1)
In [15]:
age.values
Out[15]:
array([23, 34, 27, 30, 27])
In [16]:
age.unique()
Out[16]:
array([23, 34, 27, 30])

Manipulating a Series

In [17]:
age.sort_values()
Out[17]:
0    23
2    27
4    27
3    30
1    34
Name: age, dtype: int64
In [18]:
age.nsmallest(3)
Out[18]:
0    23
2    27
4    27
Name: age, dtype: int64

Working with Series containing strings

In [19]:
name = df.name
name.head(3)
Out[19]:
0        ann
1        bob
2    charles
Name: name, dtype: object
In [20]:
name.str.title()
Out[20]:
0        Ann
1        Bob
2    Charles
3      David
4       Ella
Name: name, dtype: object
In [21]:
name.str[1:3]
Out[21]:
0    nn
1    ob
2    ha
3    av
4    ll
Name: name, dtype: object
In [22]:
name.replace({'ann': 'angelina jolie', 'bob': 'brad pitt'})
Out[22]:
0    angelina jolie
1         brad pitt
2           charles
3             david
4              ella
Name: name, dtype: object

Working with Series containing categorical data

In [23]:
df.sex
Out[23]:
0    F
1    M
2    M
3    M
4    F
Name: sex, dtype: object
In [24]:
sex = df.sex.astype('category')
sex
Out[24]:
0    F
1    M
2    M
3    M
4    F
Name: sex, dtype: category
Categories (2, object): [F, M]
In [25]:
sex.cat.codes
Out[25]:
0    0
1    1
2    1
3    1
4    0
dtype: int8
In [26]:
sex.cat.categories
Out[26]:
Index(['F', 'M'], dtype='object')

Back to DataFrame

Getting information about a DataFrame

In [27]:
df.shape
Out[27]:
(5, 5)
In [28]:
df.index
Out[28]:
RangeIndex(start=0, stop=5, step=1)
In [29]:
df.columns
Out[29]:
Index(['age', 'group', 'name', 'score', 'sex'], dtype='object')
In [30]:
df.dtypes
Out[30]:
age       int64
group    object
name     object
score     int64
sex      object
dtype: object

Extracting rows

In [31]:
df.head(2)
Out[31]:
age group name score sex
0 23 A ann 89 F
1 34 A bob 90 M
In [32]:
df.tail(2)
Out[32]:
age group name score sex
3 30 B david 98 M
4 27 B ella 90 F
In [33]:
df.sample(2)
Out[33]:
age group name score sex
4 27 B ella 90 F
0 23 A ann 89 F

Extracting columns

In [34]:
df['name']
Out[34]:
0        ann
1        bob
2    charles
3      david
4       ella
Name: name, dtype: object
In [35]:
df.name
Out[35]:
0        ann
1        bob
2    charles
3      david
4       ella
Name: name, dtype: object
In [36]:
df[[0]]
Out[36]:
age
0 23
1 34
2 27
3 30
4 27
In [37]:
df[['age', 'sex']]
Out[37]:
age sex
0 23 F
1 34 M
2 27 M
3 30 M
4 27 F
In [38]:
df[[1,2]]
Out[38]:
group name
0 A ann
1 A bob
2 A charles
3 B david
4 B ella

Using indexing to extract parts of a DataFrame

There are actually 3 separate methods to index a DataFrame - using loc, iloc and ix. We will only show ix because it is the most flexible, but see here for an example if you want to know more. In particular, working with indexes containing only integers can be confusing at first.

In [39]:
df.ix[0,0]
Out[39]:
23
In [40]:
df.ix[1,1]
Out[40]:
'A'
In [41]:
df.ix[1:4, :2]
Out[41]:
age group
1 34 A
2 27 A
3 30 B
4 27 B
In [42]:
df.ix[[1,3,4], ['age', 'name']]
Out[42]:
age name
1 34 bob
3 30 david
4 27 ella

Logical indexing

In [43]:
df[df.age > 25]
Out[43]:
age group name score sex
1 34 A bob 90 M
2 27 A charles 78 M
3 30 B david 98 M
4 27 B ella 90 F
In [44]:
df[(df.age > 25) & (df.sex == 'M')]
Out[44]:
age group name score sex
1 34 A bob 90 M
2 27 A charles 78 M
3 30 B david 98 M

Sorting

In [45]:
df.sort_values('age')
Out[45]:
age group name score sex
0 23 A ann 89 F
2 27 A charles 78 M
4 27 B ella 90 F
3 30 B david 98 M
1 34 A bob 90 M
In [46]:
df.sort_values('age', ascending = False)
Out[46]:
age group name score sex
1 34 A bob 90 M
3 30 B david 98 M
2 27 A charles 78 M
4 27 B ella 90 F
0 23 A ann 89 F
In [47]:
df.sort_values(['age', 'score'], ascending = ['True', 'True'])
Out[47]:
age group name score sex
0 23 A ann 89 F
2 27 A charles 78 M
4 27 B ella 90 F
3 30 B david 98 M
1 34 A bob 90 M
In [48]:
df.sort_values(['age', 'score'], ascending = ['True', 'False'], inplace = True)
df
Out[48]:
age group name score sex
0 23 A ann 89 F
2 27 A charles 78 M
4 27 B ella 90 F
3 30 B david 98 M
1 34 A bob 90 M
In [49]:
df.sort_index()
Out[49]:
age group name score sex
0 23 A ann 89 F
1 34 A bob 90 M
2 27 A charles 78 M
3 30 B david 98 M
4 27 B ella 90 F

Getting summary Statistics

In [50]:
df.mean()
Out[50]:
age      28.2
score    89.0
dtype: float64
In [51]:
df.score.var()
Out[51]:
51.0
In [52]:
df.count()
Out[52]:
age      5
group    5
name     5
score    5
sex      5
dtype: int64
In [53]:
df.describe()
Out[53]:
age score
count 5.000000 5.000000
mean 28.200000 89.000000
std 4.086563 7.141428
min 23.000000 78.000000
25% 27.000000 89.000000
50% 27.000000 90.000000
75% 30.000000 90.000000
max 34.000000 98.000000

Exercises

1. Read in the file at data/dummy.xlsx into a DataFrame called data.

In [3]:
data = pd.read_excel('data/dummy.xlsx')

2. Display just rows 2,3 and 4.

In [5]:
data.iloc[2:5]
Out[5]:
age group name score sex
2 27 A charles 78 M
3 30 B david 98 M
4 27 B ella 90 F

3. Display just the ‘name’ and ‘age’ columns from the DataFrame.

In [8]:
data[['name', 'age']]
Out[8]:
name age
0 ann 23
1 bob 34
2 charles 27
3 david 30
4 ella 27

4. Sort the DataFrame in descending order of age.

In [9]:
data.sort_values('age', ascending=False)
Out[9]:
age group name score sex
1 34 A bob 90 M
3 30 B david 98 M
2 27 A charles 78 M
4 27 B ella 90 F
0 23 A ann 89 F