In [1]:

import pandas as pd
from pandas import DataFrame, Series

Introduction to DataFrames and Series¶

DataFrame¶

The workshop emphasizes the use of data frames, because most data analysis and plotting libraries are designed to work best with this data structure. The data frame is basically a table with rows and columns, just like a spreadsheet. Generally (and it is good practice), each row contains one observation and each column contains one variable. If the data frame is organized this way, it is known as a tidy data frame and is easy to manipulate and query. Much of data preprocessing involves getting the data into this tidy format.

Read from CSV¶

In [2]:

iris_1 = pd.read_csv('data/iris.csv')

In [3]:

iris_1.head()

Out[3]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

Write to CSV¶

In [4]:

iris_1.to_csv('data/iris_1.csv', index=False)

Read from Excel¶

In [5]:

iris_2 = pd.read_excel('data/iris.xlsx')

In [6]:

iris_2.head()

Out[6]:

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

Write to Excel¶

In [7]:

iris_2.to_excel('data/iris_2.xlsx', index=False)

Creating a DataFrame from scratch¶

In [8]:

subject = ['ann', 'bob', 'charles', 'david', 'ella']
sex = ['F', 'M', 'M', 'M', 'F']
group = ['A', 'A', 'A', 'B', 'B']
age = [23, 34, 27, 30, 27]
score = [89, 90, 78, 98, 90]
df = DataFrame(data = {'name': subject, 'sex': sex, 'age': age, 'group': group, 'score': score})

In [9]:

df

Out[9]:

	age	group	name	score	sex
0	23	A	ann	89	F
1	34	A	bob	90	M
2	27	A	charles	78	M
3	30	B	david	98	M
4	27	B	ella	90	F

Series¶

The data about a variable in a single column of the DataFrame is known as a Series. We sill start by learning how to manipulate a single variable, in other words, how to work with a Series data type.

Extracting a Series from a DataFrame¶

By name¶

In [10]:

age = df['age']
age

Out[10]:

  23
  34
  27
  30
  27
Name: age, dtype: int64

Using dot notation¶

In [11]:

age = df.age
age

Out[11]:

  23
  34
  27
  30
  27
Name: age, dtype: int64

Using an index¶

In [12]:

age = df.ix[:,0]
age

Out[12]:

  23
  34
  27
  30
  27
Name: age, dtype: int64

Getting information about a series¶

In [13]:

type(age)

Out[13]:

pandas.core.series.Series

In [14]:

age.index

Out[14]:

RangeIndex(start=0, stop=5, step=1)

In [15]:

age.values

Out[15]:

array([23, 34, 27, 30, 27])

In [16]:

age.unique()

Out[16]:

array([23, 34, 27, 30])

Manipulating a Series¶

In [17]:

age.sort_values()

Out[17]:

  23
  27
  27
  30
  34
Name: age, dtype: int64

In [18]:

age.nsmallest(3)

Out[18]:

  23
  27
  27
Name: age, dtype: int64

Working with Series containing strings¶

In [19]:

name = df.name
name.head(3)

Out[19]:

      ann
      bob
  charles
Name: name, dtype: object

In [20]:

name.str.title()

Out[20]:

      Ann
      Bob
  Charles
    David
     Ella
Name: name, dtype: object

In [21]:

name.str[1:3]

Out[21]:

  nn
  ob
  ha
  av
  ll
Name: name, dtype: object

In [22]:

name.replace({'ann': 'angelina jolie', 'bob': 'brad pitt'})

Out[22]:

  angelina jolie
       brad pitt
         charles
           david
            ella
Name: name, dtype: object

Working with Series containing categorical data¶

In [23]:

df.sex

Out[23]:

  F
  M
  M
  M
  F
Name: sex, dtype: object

In [24]:

sex = df.sex.astype('category')
sex

Out[24]:

  F
  M
  M
  M
  F
Name: sex, dtype: category
Categories (2, object): [F, M]

In [25]:

sex.cat.codes

Out[25]:

  0
  1
  1
  1
  0
dtype: int8

In [26]:

sex.cat.categories

Out[26]:

Index(['F', 'M'], dtype='object')

Back to DataFrame¶

Getting information about a DataFrame¶

In [27]:

df.shape

Out[27]:

(5, 5)

In [28]:

df.index

Out[28]:

RangeIndex(start=0, stop=5, step=1)

In [29]:

df.columns

Out[29]:

Index(['age', 'group', 'name', 'score', 'sex'], dtype='object')

In [30]:

df.dtypes

Out[30]:

age       int64
group    object
name     object
score     int64
sex      object
dtype: object

Extracting rows¶

In [31]:

df.head(2)

Out[31]:

	age	group	name	score	sex
0	23	A	ann	89	F
1	34	A	bob	90	M

In [32]:

df.tail(2)

Out[32]:

	age	group	name	score	sex
3	30	B	david	98	M
4	27	B	ella	90	F

In [33]:

df.sample(2)

Out[33]:

	age	group	name	score	sex
4	27	B	ella	90	F
0	23	A	ann	89	F

Extracting columns¶

In [34]:

df['name']

Out[34]:

      ann
      bob
  charles
    david
     ella
Name: name, dtype: object

In [35]:

df.name

Out[35]:

      ann
      bob
  charles
    david
     ella
Name: name, dtype: object

In [36]:

df[[0]]

Out[36]:

	age
0	23
1	34
2	27
3	30
4	27

In [37]:

df[['age', 'sex']]

Out[37]:

	age	sex
0	23	F
1	34	M
2	27	M
3	30	M
4	27	F

In [38]:

df[[1,2]]

Out[38]:

	group	name
0	A	ann
1	A	bob
2	A	charles
3	B	david
4	B	ella

Using indexing to extract parts of a DataFrame¶

There are actually 3 separate methods to index a DataFrame - using loc, iloc and ix. We will only show ix because it is the most flexible, but see here for an example if you want to know more. In particular, working with indexes containing only integers can be confusing at first.

In [39]:

df.ix[0,0]

Out[39]:

In [40]:

df.ix[1,1]

Out[40]:

'A'

In [41]:

df.ix[1:4, :2]

Out[41]:

	age	group
1	34	A
2	27	A
3	30	B
4	27	B

In [42]:

df.ix[[1,3,4], ['age', 'name']]

Out[42]:

	age	name
1	34	bob
3	30	david
4	27	ella

Logical indexing¶

In [43]:

df[df.age > 25]

Out[43]:

	age	group	name	score	sex
1	34	A	bob	90	M
2	27	A	charles	78	M
3	30	B	david	98	M
4	27	B	ella	90	F

In [44]:

df[(df.age > 25) & (df.sex == 'M')]

Out[44]:

	age	group	name	score	sex
1	34	A	bob	90	M
2	27	A	charles	78	M
3	30	B	david	98	M

Sorting¶

In [45]:

df.sort_values('age')

Out[45]:

	age	group	name	score	sex
0	23	A	ann	89	F
2	27	A	charles	78	M
4	27	B	ella	90	F
3	30	B	david	98	M
1	34	A	bob	90	M

In [46]:

df.sort_values('age', ascending = False)

Out[46]:

	age	group	name	score	sex
1	34	A	bob	90	M
3	30	B	david	98	M
2	27	A	charles	78	M
4	27	B	ella	90	F
0	23	A	ann	89	F

In [47]:

df.sort_values(['age', 'score'], ascending = ['True', 'True'])

Out[47]:

	age	group	name	score	sex
0	23	A	ann	89	F
2	27	A	charles	78	M
4	27	B	ella	90	F
3	30	B	david	98	M
1	34	A	bob	90	M

In [48]:

df.sort_values(['age', 'score'], ascending = ['True', 'False'], inplace = True)
df

Out[48]:

	age	group	name	score	sex
0	23	A	ann	89	F
2	27	A	charles	78	M
4	27	B	ella	90	F
3	30	B	david	98	M
1	34	A	bob	90	M

In [49]:

df.sort_index()

Out[49]:

	age	group	name	score	sex
0	23	A	ann	89	F
1	34	A	bob	90	M
2	27	A	charles	78	M
3	30	B	david	98	M
4	27	B	ella	90	F

Getting summary Statistics¶

In [50]:

df.mean()

Out[50]:

age      28.2
score    89.0
dtype: float64

In [51]:

df.score.var()

Out[51]:

51.0

In [52]:

df.count()

Out[52]:

age      5
group    5
name     5
score    5
sex      5
dtype: int64

In [53]:

df.describe()

Out[53]:

	age	score
count	5.000000	5.000000
mean	28.200000	89.000000
std	4.086563	7.141428
min	23.000000	78.000000
25%	27.000000	89.000000
50%	27.000000	90.000000
75%	30.000000	90.000000
max	34.000000	98.000000

Exercises¶

1. Read in the file at data/dummy.xlsx into a DataFrame called data.

In [3]:

data = pd.read_excel('data/dummy.xlsx')

2. Display just rows 2,3 and 4.

In [5]:

data.iloc[2:5]

Out[5]:

	age	group	name	score	sex
2	27	A	charles	78	M
3	30	B	david	98	M
4	27	B	ella	90	F

3. Display just the ‘name’ and ‘age’ columns from the DataFrame.

In [8]:

data[['name', 'age']]

Out[8]:

	name	age
0	ann	23
1	bob	34
2	charles	27
3	david	30
4	ella	27

4. Sort the DataFrame in descending order of age.

In [9]:

data.sort_values('age', ascending=False)

Out[9]:

	age	group	name	score	sex
1	34	A	bob	90	M
3	30	B	david	98	M
2	27	A	charles	78	M
4	27	B	ella	90	F
0	23	A	ann	89	F