{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"from pandas import DataFrame, Series"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to DataFrames and Series"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrame\n",
"\n",
"The workshop emphasizes the use of data frames, because most data analysis and plotting libraries are designed to work best with this data structure. The data frame is basically a table with rows and columns, just like a spreadsheet. Generally (and it is good practice), each row contains one **observation** and each column contains one **variable**. If the data frame is organized this way, it is known as a **tidy** data frame and is easy to manipulate and query. Much of data preprocessing involves getting the data into this **tidy** format."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Read from CSV"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"iris_1 = pd.read_csv('data/iris.csv')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" Sepal.Length | \n",
" Sepal.Width | \n",
" Petal.Length | \n",
" Petal.Width | \n",
" Species | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 1 | \n",
" 4.9 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 3 | \n",
" 4.6 | \n",
" 3.1 | \n",
" 1.5 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 4 | \n",
" 5.0 | \n",
" 3.6 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n",
"0 5.1 3.5 1.4 0.2 setosa\n",
"1 4.9 3.0 1.4 0.2 setosa\n",
"2 4.7 3.2 1.3 0.2 setosa\n",
"3 4.6 3.1 1.5 0.2 setosa\n",
"4 5.0 3.6 1.4 0.2 setosa"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris_1.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Write to CSV"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"iris_1.to_csv('data/iris_1.csv', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Read from Excel"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"iris_2 = pd.read_excel('data/iris.xlsx')"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sepal.Length | \n",
" Sepal.Width | \n",
" Petal.Length | \n",
" Petal.Width | \n",
" Species | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 1 | \n",
" 4.9 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 3 | \n",
" 4.6 | \n",
" 3.1 | \n",
" 1.5 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
" | 4 | \n",
" 5.0 | \n",
" 3.6 | \n",
" 1.4 | \n",
" 0.2 | \n",
" setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n",
"0 5.1 3.5 1.4 0.2 setosa\n",
"1 4.9 3.0 1.4 0.2 setosa\n",
"2 4.7 3.2 1.3 0.2 setosa\n",
"3 4.6 3.1 1.5 0.2 setosa\n",
"4 5.0 3.6 1.4 0.2 setosa"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"iris_2.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Write to Excel"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"iris_2.to_excel('data/iris_2.xlsx', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Creating a DataFrame from scratch"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"subject = ['ann', 'bob', 'charles', 'david', 'ella']\n",
"sex = ['F', 'M', 'M', 'M', 'F']\n",
"group = ['A', 'A', 'A', 'B', 'B']\n",
"age = [23, 34, 27, 30, 27]\n",
"score = [89, 90, 78, 98, 90]\n",
"df = DataFrame(data = {'name': subject, 'sex': sex, 'age': age, 'group': group, 'score': score})"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" group | \n",
" name | \n",
" score | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 23 | \n",
" A | \n",
" ann | \n",
" 89 | \n",
" F | \n",
"
\n",
" \n",
" | 1 | \n",
" 34 | \n",
" A | \n",
" bob | \n",
" 90 | \n",
" M | \n",
"
\n",
" \n",
" | 2 | \n",
" 27 | \n",
" A | \n",
" charles | \n",
" 78 | \n",
" M | \n",
"
\n",
" \n",
" | 3 | \n",
" 30 | \n",
" B | \n",
" david | \n",
" 98 | \n",
" M | \n",
"
\n",
" \n",
" | 4 | \n",
" 27 | \n",
" B | \n",
" ella | \n",
" 90 | \n",
" F | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age group name score sex\n",
"0 23 A ann 89 F\n",
"1 34 A bob 90 M\n",
"2 27 A charles 78 M\n",
"3 30 B david 98 M\n",
"4 27 B ella 90 F"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Series"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data about a variable in a single column of the DataFrame is known as a Series. We sill start by learning how to manipulate a single variable, in other words, how to work with a Series data type. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extracting a Series from a DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### By name"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 23\n",
"1 34\n",
"2 27\n",
"3 30\n",
"4 27\n",
"Name: age, dtype: int64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"age = df['age']\n",
"age"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Using dot notation"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 23\n",
"1 34\n",
"2 27\n",
"3 30\n",
"4 27\n",
"Name: age, dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"age = df.age\n",
"age"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Using an index"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 23\n",
"1 34\n",
"2 27\n",
"3 30\n",
"4 27\n",
"Name: age, dtype: int64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"age = df.ix[:,0]\n",
"age"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Getting information about a series"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(age)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"RangeIndex(start=0, stop=5, step=1)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"age.index"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([23, 34, 27, 30, 27])"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"age.values"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([23, 34, 27, 30])"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"age.unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Manipulating a Series"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 23\n",
"2 27\n",
"4 27\n",
"3 30\n",
"1 34\n",
"Name: age, dtype: int64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"age.sort_values()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 23\n",
"2 27\n",
"4 27\n",
"Name: age, dtype: int64"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"age.nsmallest(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Working with Series containing strings"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 ann\n",
"1 bob\n",
"2 charles\n",
"Name: name, dtype: object"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"name = df.name\n",
"name.head(3)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 Ann\n",
"1 Bob\n",
"2 Charles\n",
"3 David\n",
"4 Ella\n",
"Name: name, dtype: object"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"name.str.title()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 nn\n",
"1 ob\n",
"2 ha\n",
"3 av\n",
"4 ll\n",
"Name: name, dtype: object"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"name.str[1:3]"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 angelina jolie\n",
"1 brad pitt\n",
"2 charles\n",
"3 david\n",
"4 ella\n",
"Name: name, dtype: object"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"name.replace({'ann': 'angelina jolie', 'bob': 'brad pitt'})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Working with Series containing categorical data"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 F\n",
"1 M\n",
"2 M\n",
"3 M\n",
"4 F\n",
"Name: sex, dtype: object"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sex"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 F\n",
"1 M\n",
"2 M\n",
"3 M\n",
"4 F\n",
"Name: sex, dtype: category\n",
"Categories (2, object): [F, M]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sex = df.sex.astype('category')\n",
"sex"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 0\n",
"1 1\n",
"2 1\n",
"3 1\n",
"4 0\n",
"dtype: int8"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sex.cat.codes"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['F', 'M'], dtype='object')"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sex.cat.categories"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Back to DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Getting information about a DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(5, 5)"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"RangeIndex(start=0, stop=5, step=1)"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.index"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['age', 'group', 'name', 'score', 'sex'], dtype='object')"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"age int64\n",
"group object\n",
"name object\n",
"score int64\n",
"sex object\n",
"dtype: object"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extracting rows"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" group | \n",
" name | \n",
" score | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 23 | \n",
" A | \n",
" ann | \n",
" 89 | \n",
" F | \n",
"
\n",
" \n",
" | 1 | \n",
" 34 | \n",
" A | \n",
" bob | \n",
" 90 | \n",
" M | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age group name score sex\n",
"0 23 A ann 89 F\n",
"1 34 A bob 90 M"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" group | \n",
" name | \n",
" score | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" | 3 | \n",
" 30 | \n",
" B | \n",
" david | \n",
" 98 | \n",
" M | \n",
"
\n",
" \n",
" | 4 | \n",
" 27 | \n",
" B | \n",
" ella | \n",
" 90 | \n",
" F | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age group name score sex\n",
"3 30 B david 98 M\n",
"4 27 B ella 90 F"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.tail(2)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" group | \n",
" name | \n",
" score | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" | 4 | \n",
" 27 | \n",
" B | \n",
" ella | \n",
" 90 | \n",
" F | \n",
"
\n",
" \n",
" | 0 | \n",
" 23 | \n",
" A | \n",
" ann | \n",
" 89 | \n",
" F | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age group name score sex\n",
"4 27 B ella 90 F\n",
"0 23 A ann 89 F"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sample(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Extracting columns"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 ann\n",
"1 bob\n",
"2 charles\n",
"3 david\n",
"4 ella\n",
"Name: name, dtype: object"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['name']"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 ann\n",
"1 bob\n",
"2 charles\n",
"3 david\n",
"4 ella\n",
"Name: name, dtype: object"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.name"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 23 | \n",
"
\n",
" \n",
" | 1 | \n",
" 34 | \n",
"
\n",
" \n",
" | 2 | \n",
" 27 | \n",
"
\n",
" \n",
" | 3 | \n",
" 30 | \n",
"
\n",
" \n",
" | 4 | \n",
" 27 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age\n",
"0 23\n",
"1 34\n",
"2 27\n",
"3 30\n",
"4 27"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[[0]]"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 23 | \n",
" F | \n",
"
\n",
" \n",
" | 1 | \n",
" 34 | \n",
" M | \n",
"
\n",
" \n",
" | 2 | \n",
" 27 | \n",
" M | \n",
"
\n",
" \n",
" | 3 | \n",
" 30 | \n",
" M | \n",
"
\n",
" \n",
" | 4 | \n",
" 27 | \n",
" F | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age sex\n",
"0 23 F\n",
"1 34 M\n",
"2 27 M\n",
"3 30 M\n",
"4 27 F"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[['age', 'sex']]"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" group | \n",
" name | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" A | \n",
" ann | \n",
"
\n",
" \n",
" | 1 | \n",
" A | \n",
" bob | \n",
"
\n",
" \n",
" | 2 | \n",
" A | \n",
" charles | \n",
"
\n",
" \n",
" | 3 | \n",
" B | \n",
" david | \n",
"
\n",
" \n",
" | 4 | \n",
" B | \n",
" ella | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" group name\n",
"0 A ann\n",
"1 A bob\n",
"2 A charles\n",
"3 B david\n",
"4 B ella"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[[1,2]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using indexing to extract parts of a DataFrame\n",
"\n",
"There are actually 3 separate methods to index a DataFrame - using `loc`, `iloc` and `ix`. We will only show `ix` because it is the most flexible, but see [here](http://stackoverflow.com/questions/31593201/pandas-iloc-vs-ix-vs-loc-explanation) for an example if you want to know more. In particular, working with indexes containing only integers can be confusing at first."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"23"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.ix[0,0]"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'A'"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.ix[1,1]"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" group | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1 | \n",
" 34 | \n",
" A | \n",
"
\n",
" \n",
" | 2 | \n",
" 27 | \n",
" A | \n",
"
\n",
" \n",
" | 3 | \n",
" 30 | \n",
" B | \n",
"
\n",
" \n",
" | 4 | \n",
" 27 | \n",
" B | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age group\n",
"1 34 A\n",
"2 27 A\n",
"3 30 B\n",
"4 27 B"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.ix[1:4, :2]"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" name | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1 | \n",
" 34 | \n",
" bob | \n",
"
\n",
" \n",
" | 3 | \n",
" 30 | \n",
" david | \n",
"
\n",
" \n",
" | 4 | \n",
" 27 | \n",
" ella | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age name\n",
"1 34 bob\n",
"3 30 david\n",
"4 27 ella"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.ix[[1,3,4], ['age', 'name']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Logical indexing"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" group | \n",
" name | \n",
" score | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1 | \n",
" 34 | \n",
" A | \n",
" bob | \n",
" 90 | \n",
" M | \n",
"
\n",
" \n",
" | 2 | \n",
" 27 | \n",
" A | \n",
" charles | \n",
" 78 | \n",
" M | \n",
"
\n",
" \n",
" | 3 | \n",
" 30 | \n",
" B | \n",
" david | \n",
" 98 | \n",
" M | \n",
"
\n",
" \n",
" | 4 | \n",
" 27 | \n",
" B | \n",
" ella | \n",
" 90 | \n",
" F | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age group name score sex\n",
"1 34 A bob 90 M\n",
"2 27 A charles 78 M\n",
"3 30 B david 98 M\n",
"4 27 B ella 90 F"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df.age > 25]"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" group | \n",
" name | \n",
" score | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1 | \n",
" 34 | \n",
" A | \n",
" bob | \n",
" 90 | \n",
" M | \n",
"
\n",
" \n",
" | 2 | \n",
" 27 | \n",
" A | \n",
" charles | \n",
" 78 | \n",
" M | \n",
"
\n",
" \n",
" | 3 | \n",
" 30 | \n",
" B | \n",
" david | \n",
" 98 | \n",
" M | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age group name score sex\n",
"1 34 A bob 90 M\n",
"2 27 A charles 78 M\n",
"3 30 B david 98 M"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[(df.age > 25) & (df.sex == 'M')]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sorting"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" group | \n",
" name | \n",
" score | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 23 | \n",
" A | \n",
" ann | \n",
" 89 | \n",
" F | \n",
"
\n",
" \n",
" | 2 | \n",
" 27 | \n",
" A | \n",
" charles | \n",
" 78 | \n",
" M | \n",
"
\n",
" \n",
" | 4 | \n",
" 27 | \n",
" B | \n",
" ella | \n",
" 90 | \n",
" F | \n",
"
\n",
" \n",
" | 3 | \n",
" 30 | \n",
" B | \n",
" david | \n",
" 98 | \n",
" M | \n",
"
\n",
" \n",
" | 1 | \n",
" 34 | \n",
" A | \n",
" bob | \n",
" 90 | \n",
" M | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age group name score sex\n",
"0 23 A ann 89 F\n",
"2 27 A charles 78 M\n",
"4 27 B ella 90 F\n",
"3 30 B david 98 M\n",
"1 34 A bob 90 M"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sort_values('age')"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" group | \n",
" name | \n",
" score | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1 | \n",
" 34 | \n",
" A | \n",
" bob | \n",
" 90 | \n",
" M | \n",
"
\n",
" \n",
" | 3 | \n",
" 30 | \n",
" B | \n",
" david | \n",
" 98 | \n",
" M | \n",
"
\n",
" \n",
" | 2 | \n",
" 27 | \n",
" A | \n",
" charles | \n",
" 78 | \n",
" M | \n",
"
\n",
" \n",
" | 4 | \n",
" 27 | \n",
" B | \n",
" ella | \n",
" 90 | \n",
" F | \n",
"
\n",
" \n",
" | 0 | \n",
" 23 | \n",
" A | \n",
" ann | \n",
" 89 | \n",
" F | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age group name score sex\n",
"1 34 A bob 90 M\n",
"3 30 B david 98 M\n",
"2 27 A charles 78 M\n",
"4 27 B ella 90 F\n",
"0 23 A ann 89 F"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sort_values('age', ascending = False)"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" group | \n",
" name | \n",
" score | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 23 | \n",
" A | \n",
" ann | \n",
" 89 | \n",
" F | \n",
"
\n",
" \n",
" | 2 | \n",
" 27 | \n",
" A | \n",
" charles | \n",
" 78 | \n",
" M | \n",
"
\n",
" \n",
" | 4 | \n",
" 27 | \n",
" B | \n",
" ella | \n",
" 90 | \n",
" F | \n",
"
\n",
" \n",
" | 3 | \n",
" 30 | \n",
" B | \n",
" david | \n",
" 98 | \n",
" M | \n",
"
\n",
" \n",
" | 1 | \n",
" 34 | \n",
" A | \n",
" bob | \n",
" 90 | \n",
" M | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age group name score sex\n",
"0 23 A ann 89 F\n",
"2 27 A charles 78 M\n",
"4 27 B ella 90 F\n",
"3 30 B david 98 M\n",
"1 34 A bob 90 M"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sort_values(['age', 'score'], ascending = ['True', 'True'])"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" group | \n",
" name | \n",
" score | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 23 | \n",
" A | \n",
" ann | \n",
" 89 | \n",
" F | \n",
"
\n",
" \n",
" | 2 | \n",
" 27 | \n",
" A | \n",
" charles | \n",
" 78 | \n",
" M | \n",
"
\n",
" \n",
" | 4 | \n",
" 27 | \n",
" B | \n",
" ella | \n",
" 90 | \n",
" F | \n",
"
\n",
" \n",
" | 3 | \n",
" 30 | \n",
" B | \n",
" david | \n",
" 98 | \n",
" M | \n",
"
\n",
" \n",
" | 1 | \n",
" 34 | \n",
" A | \n",
" bob | \n",
" 90 | \n",
" M | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age group name score sex\n",
"0 23 A ann 89 F\n",
"2 27 A charles 78 M\n",
"4 27 B ella 90 F\n",
"3 30 B david 98 M\n",
"1 34 A bob 90 M"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sort_values(['age', 'score'], ascending = ['True', 'False'], inplace = True)\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" group | \n",
" name | \n",
" score | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 23 | \n",
" A | \n",
" ann | \n",
" 89 | \n",
" F | \n",
"
\n",
" \n",
" | 1 | \n",
" 34 | \n",
" A | \n",
" bob | \n",
" 90 | \n",
" M | \n",
"
\n",
" \n",
" | 2 | \n",
" 27 | \n",
" A | \n",
" charles | \n",
" 78 | \n",
" M | \n",
"
\n",
" \n",
" | 3 | \n",
" 30 | \n",
" B | \n",
" david | \n",
" 98 | \n",
" M | \n",
"
\n",
" \n",
" | 4 | \n",
" 27 | \n",
" B | \n",
" ella | \n",
" 90 | \n",
" F | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age group name score sex\n",
"0 23 A ann 89 F\n",
"1 34 A bob 90 M\n",
"2 27 A charles 78 M\n",
"3 30 B david 98 M\n",
"4 27 B ella 90 F"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sort_index()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Getting summary Statistics"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"age 28.2\n",
"score 89.0\n",
"dtype: float64"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.mean()"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"51.0"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.score.var()"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"age 5\n",
"group 5\n",
"name 5\n",
"score 5\n",
"sex 5\n",
"dtype: int64"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.count()"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" score | \n",
"
\n",
" \n",
" \n",
" \n",
" | count | \n",
" 5.000000 | \n",
" 5.000000 | \n",
"
\n",
" \n",
" | mean | \n",
" 28.200000 | \n",
" 89.000000 | \n",
"
\n",
" \n",
" | std | \n",
" 4.086563 | \n",
" 7.141428 | \n",
"
\n",
" \n",
" | min | \n",
" 23.000000 | \n",
" 78.000000 | \n",
"
\n",
" \n",
" | 25% | \n",
" 27.000000 | \n",
" 89.000000 | \n",
"
\n",
" \n",
" | 50% | \n",
" 27.000000 | \n",
" 90.000000 | \n",
"
\n",
" \n",
" | 75% | \n",
" 30.000000 | \n",
" 90.000000 | \n",
"
\n",
" \n",
" | max | \n",
" 34.000000 | \n",
" 98.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age score\n",
"count 5.000000 5.000000\n",
"mean 28.200000 89.000000\n",
"std 4.086563 7.141428\n",
"min 23.000000 78.000000\n",
"25% 27.000000 89.000000\n",
"50% 27.000000 90.000000\n",
"75% 30.000000 90.000000\n",
"max 34.000000 98.000000"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercises"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**1**. Read in the file at `data/dummy.xlsx` into a DataFrame called `data`."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data = pd.read_excel('data/dummy.xlsx')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**2**. Display just rows 2,3 and 4."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" group | \n",
" name | \n",
" score | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2 | \n",
" 27 | \n",
" A | \n",
" charles | \n",
" 78 | \n",
" M | \n",
"
\n",
" \n",
" | 3 | \n",
" 30 | \n",
" B | \n",
" david | \n",
" 98 | \n",
" M | \n",
"
\n",
" \n",
" | 4 | \n",
" 27 | \n",
" B | \n",
" ella | \n",
" 90 | \n",
" F | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age group name score sex\n",
"2 27 A charles 78 M\n",
"3 30 B david 98 M\n",
"4 27 B ella 90 F"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.iloc[2:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**3**. Display just the 'name' and 'age' columns from the DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" age | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" ann | \n",
" 23 | \n",
"
\n",
" \n",
" | 1 | \n",
" bob | \n",
" 34 | \n",
"
\n",
" \n",
" | 2 | \n",
" charles | \n",
" 27 | \n",
"
\n",
" \n",
" | 3 | \n",
" david | \n",
" 30 | \n",
"
\n",
" \n",
" | 4 | \n",
" ella | \n",
" 27 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name age\n",
"0 ann 23\n",
"1 bob 34\n",
"2 charles 27\n",
"3 david 30\n",
"4 ella 27"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[['name', 'age']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**4**. Sort the DataFrame in descending order of age."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" group | \n",
" name | \n",
" score | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1 | \n",
" 34 | \n",
" A | \n",
" bob | \n",
" 90 | \n",
" M | \n",
"
\n",
" \n",
" | 3 | \n",
" 30 | \n",
" B | \n",
" david | \n",
" 98 | \n",
" M | \n",
"
\n",
" \n",
" | 2 | \n",
" 27 | \n",
" A | \n",
" charles | \n",
" 78 | \n",
" M | \n",
"
\n",
" \n",
" | 4 | \n",
" 27 | \n",
" B | \n",
" ella | \n",
" 90 | \n",
" F | \n",
"
\n",
" \n",
" | 0 | \n",
" 23 | \n",
" A | \n",
" ann | \n",
" 89 | \n",
" F | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age group name score sex\n",
"1 34 A bob 90 M\n",
"3 30 B david 98 M\n",
"2 27 A charles 78 M\n",
"4 27 B ella 90 F\n",
"0 23 A ann 89 F"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.sort_values('age', ascending=False)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
},
"latex_envs": {
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 0
}
},
"nbformat": 4,
"nbformat_minor": 1
}