{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas Part 1\n", "\n", "In Part 1, we'll deal with the basic `pandas` data structures (Series and DataFrame), loading data from a table, spreadsheet or database, and the basics of data manipulation. In Part 2, we'll dive into more data manipulation techniques.\n", "\n", "For those already familiar with R data munging, see how to do the [same thing](https://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html) in `pandas`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "from pandas import Series, DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data structures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Series are similar to `numpy` arrays." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "s1 = Series([1,1,2,4,5,8])" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 1\n", "2 2\n", "3 4\n", "4 5\n", "5 8\n", "dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 1, 2, 4, 5, 8])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.values" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 1, 2, 4, 5, 8]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.tolist()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=6, step=1)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.index" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3 4\n", "4 5\n", "5 8\n", "dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1[s1 > 3]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3 16\n", "4 25\n", "5 64\n", "dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1[s1 > 3] ** 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Conversions" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 1.0\n", "2 2.0\n", "3 4.0\n", "4 5.0\n", "5 8.0\n", "dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.astype('float')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 1\n", "2 2\n", "3 4\n", "4 5\n", "5 8\n", "dtype: category\n", "Categories (5, int64): [1, 2, 4, 5, 8]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.astype('category')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Series can also behave like Python dictionaries." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "s1.index = ['ann', 'bob', 'charles', 'david', 'esther', 'fred']" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ann 1\n", "bob 1\n", "charles 2\n", "david 4\n", "esther 5\n", "fred 8\n", "dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1['esther']" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "s2 = Series(dict(ann=3, charles=5, david=7, fred=9, gloria=11))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can name the Series and its index." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "s2.name = 'foo'\n", "s2.index.name = 'name'" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "name\n", "ann 3\n", "charles 5\n", "david 7\n", "fred 9\n", "gloria 11\n", "Name: foo, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Automatic alignment on index occurs" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ann 4.0\n", "bob NaN\n", "charles 7.0\n", "david 11.0\n", "esther NaN\n", "fred 17.0\n", "gloria NaN\n", "dtype: float64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1 + s2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Be careful when using custom integer indexes." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "s3 = Series([1,2,3,4])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "s4 = Series([1,2,3,4], index=range(1, 5))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 NaN\n", "1 3.0\n", "2 5.0\n", "3 7.0\n", "4 NaN\n", "dtype: float64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s3 + s4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conceptually, a DataFrame is a dict of Series sharing the same index values." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "n = len(s1)\n", "name = ['ann arbor', 'bob blanks', 'charles chin', 'david dumas', 'esther einstein', 'fred foster']\n", "major = ['math', 'math', 'bio', 'stats', 'bio', 'stats']\n", "age = np.random.randint(18, 30, n)\n", "ht = np.random.normal(170, 15, n)\n", "wt = np.random.normal(65, 25, n)\n", "df1 = DataFrame(dict(name=name, age=age, major=major, height=ht, weight=wt), \n", " columns=['name', 'age', 'height', 'weight', 'major'])" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameageheightweightmajor
0ann arbor26157.64379546.121029math
1bob blanks22160.55859215.930474math
2charles chin27156.66372083.714374bio
3david dumas25183.49066662.801966stats
4esther einstein27179.07564871.689364bio
5fred foster20162.30189684.537286stats
\n", "
" ], "text/plain": [ " name age height weight major\n", "0 ann arbor 26 157.643795 46.121029 math\n", "1 bob blanks 22 160.558592 15.930474 math\n", "2 charles chin 27 156.663720 83.714374 bio\n", "3 david dumas 25 183.490666 62.801966 stats\n", "4 esther einstein 27 179.075648 71.689364 bio\n", "5 fred foster 20 162.301896 84.537286 stats" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sampling from DataFrame" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameageheightweightmajor
0ann arbor26157.64379546.121029math
1bob blanks22160.55859215.930474math
2charles chin27156.66372083.714374bio
\n", "
" ], "text/plain": [ " name age height weight major\n", "0 ann arbor 26 157.643795 46.121029 math\n", "1 bob blanks 22 160.558592 15.930474 math\n", "2 charles chin 27 156.663720 83.714374 bio" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.head(3)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameageheightweightmajor
3david dumas25183.49066662.801966stats
4esther einstein27179.07564871.689364bio
5fred foster20162.30189684.537286stats
\n", "
" ], "text/plain": [ " name age height weight major\n", "3 david dumas 25 183.490666 62.801966 stats\n", "4 esther einstein 27 179.075648 71.689364 bio\n", "5 fred foster 20 162.301896 84.537286 stats" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.tail(3)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameageheightweightmajor
5fred foster20162.30189684.537286stats
2charles chin27156.66372083.714374bio
4esther einstein27179.07564871.689364bio
\n", "
" ], "text/plain": [ " name age height weight major\n", "5 fred foster 20 162.301896 84.537286 stats\n", "2 charles chin 27 156.663720 83.714374 bio\n", "4 esther einstein 27 179.075648 71.689364 bio" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.sample(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Slicing a DataFrame\n", "\n", "You can slice by position or by index value." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df2 = DataFrame(np.random.randint(0, 10, (5,4)), index=range(1,6), columns=list('abcd'))" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
19372
20947
37833
44264
50693
\n", "
" ], "text/plain": [ " a b c d\n", "1 9 3 7 2\n", "2 0 9 4 7\n", "3 7 8 3 3\n", "4 4 2 6 4\n", "5 0 6 9 3" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
37833
44264
\n", "
" ], "text/plain": [ " a b c d\n", "3 7 8 3 3\n", "4 4 2 6 4" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2[2:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `loc` if you want to retieve by value rather than position." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
20947
37833
44264
\n", "
" ], "text/plain": [ " a b c d\n", "2 0 9 4 7\n", "3 7 8 3 3\n", "4 4 2 6 4" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.loc[2:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `iloc` to use `numpy` style indexing." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
37833
44264
\n", "
" ], "text/plain": [ " a b c d\n", "3 7 8 3 3\n", "4 4 2 6 4" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.iloc[2:4]" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bc
383
426
\n", "
" ], "text/plain": [ " b c\n", "3 8 3\n", "4 2 6" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.iloc[2:4, 1:3]" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bc
137
294
383
426
569
\n", "
" ], "text/plain": [ " b c\n", "1 3 7\n", "2 9 4\n", "3 8 3\n", "4 2 6\n", "5 6 9" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.iloc[:, 1:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Getting a Series from a DataFrame" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 26\n", "1 22\n", "2 27\n", "3 25\n", "4 27\n", "5 20\n", "Name: age, dtype: int64" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1['age']" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 26\n", "1 22\n", "2 27\n", "3 25\n", "4 27\n", "5 20\n", "Name: age, dtype: int64" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.age" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting`numpy` arrays from a DataFrame" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([['ann arbor', 26, 157.64379546715057, 46.12102942626177, 'math'],\n", " ['bob blanks', 22, 160.55859183627487, 15.93047432813055, 'math'],\n", " ['charles chin', 27, 156.66371998112533, 83.71437370950592, 'bio'],\n", " ['david dumas', 25, 183.49066607121097, 62.80196617116858, 'stats'],\n", " ['esther einstein', 27, 179.07564818672492, 71.68936436315059, 'bio'],\n", " ['fred foster', 20, 162.30189564187515, 84.53728614200384, 'stats']], dtype=object)" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indexes are immutable collections used to store metadata for `pandas` Series and DataFrames. They behave like multisets or bags." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=6, step=1)" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.index" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['name', 'age', 'height', 'weight', 'major'], dtype='object')" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## I/O" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting data from R" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%load_ext rpy2.ipython" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true }, "outputs": [], "source": [ "iris = %R iris" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "1 5.1 3.5 1.4 0.2 setosa\n", "2 4.9 3.0 1.4 0.2 setosa\n", "3 4.7 3.2 1.3 0.2 setosa\n", "4 4.6 3.1 1.5 0.2 setosa\n", "5 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have no access to the R magic, you can do this." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "1 5.1 3.5 1.4 0.2 setosa\n", "2 4.9 3.0 1.4 0.2 setosa\n", "3 4.7 3.2 1.3 0.2 setosa\n", "4 4.6 3.1 1.5 0.2 setosa\n", "5 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from rpy2.robjects import r, pandas2ri\n", "pandas2ri.activate()\n", "\n", "iris2 = r['iris']\n", "iris2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading CSV files" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import os\n", "\n", "base = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master'\n", "path = os.path.join(base, 'tips.csv')" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": true }, "outputs": [], "source": [ "tips = pd.read_csv(path)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size\n", "0 16.99 1.01 Female No Sun Dinner 2\n", "1 10.34 1.66 Male No Sun Dinner 3\n", "2 21.01 3.50 Male No Sun Dinner 3\n", "3 23.68 3.31 Male No Sun Dinner 2" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.head(4)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": true }, "outputs": [], "source": [ "titanic = pd.read_csv(os.path.join(base, 'titanic.csv'))" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
003male22.0107.2500SThirdmanTrueNaNSouthamptonnoFalse
111female38.01071.2833CFirstwomanFalseCCherbourgyesFalse
213female26.0007.9250SThirdwomanFalseNaNSouthamptonyesTrue
311female35.01053.1000SFirstwomanFalseCSouthamptonyesFalse
\n", "
" ], "text/plain": [ " survived pclass sex age sibsp parch fare embarked class \\\n", "0 0 3 male 22.0 1 0 7.2500 S Third \n", "1 1 1 female 38.0 1 0 71.2833 C First \n", "2 1 3 female 26.0 0 0 7.9250 S Third \n", "3 1 1 female 35.0 1 0 53.1000 S First \n", "\n", " who adult_male deck embark_town alive alone \n", "0 man True NaN Southampton no False \n", "1 woman False C Cherbourg yes False \n", "2 woman False NaN Southampton yes True \n", "3 woman False C Southampton yes False " ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titanic.head(4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Working with Excel files" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": true }, "outputs": [], "source": [ "tips.to_excel('tips.xlsx')" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tips.xlsx\r\n" ] } ], "source": [ "ls *.xlsx" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": true }, "outputs": [], "source": [ "tips2 = pd.read_excel('tips.xlsx')" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size\n", "0 16.99 1.01 Female No Sun Dinner 2\n", "1 10.34 1.66 Male No Sun Dinner 3\n", "2 21.01 3.50 Male No Sun Dinner 3\n", "3 23.68 3.31 Male No Sun Dinner 2" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips2.head(4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Working with relational databases" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 181 100 181 0 0 877 0 --:--:-- --:--:-- --:--:-- 878\n", "100 1042k 100 1042k 0 0 1013k 0 0:00:01 0:00:01 --:--:-- 1903k\n" ] } ], "source": [ "! curl -L -O https://github.com/lerocha/chinook-database/raw/master/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Chinook_Sqlite.sqlite\r\n" ] } ], "source": [ "ls *sqlite" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Standard access" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sqlalchemy import create_engine" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": true }, "outputs": [], "source": [ "engine = 'sqlite:///Chinook_Sqlite.sqlite'" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": true }, "outputs": [], "source": [ "db = create_engine(engine)" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Album',\n", " 'Artist',\n", " 'Customer',\n", " 'Employee',\n", " 'Genre',\n", " 'Invoice',\n", " 'InvoiceLine',\n", " 'MediaType',\n", " 'Playlist',\n", " 'PlaylistTrack',\n", " 'Track']" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "db.table_names()" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": true }, "outputs": [], "source": [ "artist = pd.read_sql('select * from Artist where Name like \"A%\"', db)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ArtistIdName
01AC/DC
12Accept
23Aerosmith
34Alanis Morissette
\n", "
" ], "text/plain": [ " ArtistId Name\n", "0 1 AC/DC\n", "1 2 Accept\n", "2 3 Aerosmith\n", "3 4 Alanis Morissette" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "artist.head(4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Ussing SQLmagic" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%load_ext sql" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Connected: None@Chinook_Sqlite.sqlite'" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql sqlite:///Chinook_Sqlite.sqlite" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name
Album
Artist
Customer
Employee
Genre
Invoice
InvoiceLine
MediaType
Playlist
PlaylistTrack
Track
" ], "text/plain": [ "[('Album',),\n", " ('Artist',),\n", " ('Customer',),\n", " ('Employee',),\n", " ('Genre',),\n", " ('Invoice',),\n", " ('InvoiceLine',),\n", " ('MediaType',),\n", " ('Playlist',),\n", " ('PlaylistTrack',),\n", " ('Track',)]" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql SELECT name FROM sqlite_master WHERE type='table'" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] } ], "source": [ "result = %sql select * from Album;" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": true }, "outputs": [], "source": [ "album = result.DataFrame()" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AlbumIdTitleArtistId
01For Those About To Rock We Salute You1
12Balls to the Wall2
23Restless and Wild2
34Let There Be Rock1
\n", "
" ], "text/plain": [ " AlbumId Title ArtistId\n", "0 1 For Those About To Rock We Salute You 1\n", "1 2 Balls to the Wall 2\n", "2 3 Restless and Wild 2\n", "3 4 Let There Be Rock 1" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "album.head(4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic data manipulation" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "1 5.1 3.5 1.4 0.2 setosa\n", "2 4.9 3.0 1.4 0.2 setosa\n", "3 4.7 3.2 1.3 0.2 setosa\n", "4 4.6 3.1 1.5 0.2 setosa\n", "5 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.head()" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Extracting columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A single value or list of value selects columns." ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 setosa\n", "2 setosa\n", "3 setosa\n", "Name: Species, dtype: object" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris['Species'].head(3)" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthPetal.Length
15.11.4
24.91.4
34.71.3
\n", "
" ], "text/plain": [ " Sepal.Length Petal.Length\n", "1 5.1 1.4\n", "2 4.9 1.4\n", "3 4.7 1.3" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris[['Sepal.Length', 'Petal.Length']].head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use `loc` and `iloc`." ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.Width
15.13.5
24.93.0
34.73.2
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width\n", "1 5.1 3.5\n", "2 4.9 3.0\n", "3 4.7 3.2" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.loc[:, :'Sepal.Width'].head(3)" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthPetal.LengthSpecies
15.11.4setosa
24.91.4setosa
34.71.3setosa
\n", "
" ], "text/plain": [ " Sepal.Length Petal.Length Species\n", "1 5.1 1.4 setosa\n", "2 4.9 1.4 setosa\n", "3 4.7 1.3 setosa" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.iloc[:, [0,2,4]].head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extracting rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A single slice returns rows." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
34.73.21.30.2setosa
44.63.11.50.2setosa
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "3 4.7 3.2 1.3 0.2 setosa\n", "4 4.6 3.1 1.5 0.2 setosa" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris[2:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use `loc` or `iloc`" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
34.73.21.30.2setosa
44.63.11.50.2setosa
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "3 4.7 3.2 1.3 0.2 setosa\n", "4 4.6 3.1 1.5 0.2 setosa" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.loc[3:4]" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
34.73.21.30.2setosa
44.63.11.50.2setosa
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "3 4.7 3.2 1.3 0.2 setosa\n", "4 4.6 3.1 1.5 0.2 setosa" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.iloc[2:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or use Boolean indexing" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
1016.33.36.02.5virginica
1025.82.75.11.9virginica
1037.13.05.92.1virginica
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "101 6.3 3.3 6.0 2.5 virginica\n", "102 5.8 2.7 5.1 1.9 virginica\n", "103 7.1 3.0 5.9 2.1 virginica" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris[iris.Species == 'virginica'].head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using `select` and `filter`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`select` uses a predicate to select rows or columns." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: 'select' is deprecated and will be removed in a future release. You can use .loc[labels.map(crit)] as a replacement\n", " \"\"\"Entry point for launching an IPython kernel.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthSpecies
15.13.5setosa
24.93.0setosa
34.73.2setosa
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Species\n", "1 5.1 3.5 setosa\n", "2 4.9 3.0 setosa\n", "3 4.7 3.2 setosa" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.select(lambda x: x.startswith('S'), axis=1).head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`filter` subsets based on *labels* in an index." ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SpeciesSepal.Width
1setosa3.5
2setosa3.0
3setosa3.2
\n", "
" ], "text/plain": [ " Species Sepal.Width\n", "1 setosa 3.5\n", "2 setosa 3.0\n", "3 setosa 3.2" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.filter(items=['Species', 'Sepal.Width']).head(3)" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
55.03.61.40.2setosa
104.93.11.50.1setosa
155.84.01.20.2setosa
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "5 5.0 3.6 1.4 0.2 setosa\n", "10 4.9 3.1 1.5 0.1 setosa\n", "15 5.8 4.0 1.2 0.2 setosa" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.filter(items=[5,10,15], axis=0).head(3)" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.WidthPetal.Width
13.50.2
23.00.2
33.20.2
\n", "
" ], "text/plain": [ " Sepal.Width Petal.Width\n", "1 3.5 0.2\n", "2 3.0 0.2\n", "3 3.2 0.2" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.filter(like='Width').head(3)" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
15.13.51.40.2
24.93.01.40.2
34.73.21.30.2
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width\n", "1 5.1 3.5 1.4 0.2\n", "2 4.9 3.0 1.4 0.2\n", "3 4.7 3.2 1.3 0.2" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.filter(regex='.+\\..+').head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sort" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sorting on index." ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
1505.93.05.11.8virginica
1496.23.45.42.3virginica
1486.53.05.22.0virginica
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "150 5.9 3.0 5.1 1.8 virginica\n", "149 6.2 3.4 5.4 2.3 virginica\n", "148 6.5 3.0 5.2 2.0 virginica" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.sort_index(ascending=False).head(3)" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Petal.LengthPetal.WidthSepal.LengthSepal.WidthSpecies
11.40.25.13.5setosa
21.40.24.93.0setosa
31.30.24.73.2setosa
\n", "
" ], "text/plain": [ " Petal.Length Petal.Width Sepal.Length Sepal.Width Species\n", "1 1.4 0.2 5.1 3.5 setosa\n", "2 1.4 0.2 4.9 3.0 setosa\n", "3 1.3 0.2 4.7 3.2 setosa" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.sort_index(axis=1).head(3)" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Petal.LengthPetal.WidthSepal.LengthSepal.WidthSpecies
11.40.25.13.5setosa
21.40.24.93.0setosa
31.30.24.73.2setosa
\n", "
" ], "text/plain": [ " Petal.Length Petal.Width Sepal.Length Sepal.Width Species\n", "1 1.4 0.2 5.1 3.5 setosa\n", "2 1.4 0.2 4.9 3.0 setosa\n", "3 1.3 0.2 4.7 3.2 setosa" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.sort_index(axis=1).head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sorting on value." ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
144.33.01.10.1setosa
434.43.21.30.2setosa
394.43.01.30.2setosa
94.42.91.40.2setosa
424.52.31.30.3setosa
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "14 4.3 3.0 1.1 0.1 setosa\n", "43 4.4 3.2 1.3 0.2 setosa\n", "39 4.4 3.0 1.3 0.2 setosa\n", "9 4.4 2.9 1.4 0.2 setosa\n", "42 4.5 2.3 1.3 0.3 setosa" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.sort_values(by=['Sepal.Length', 'Sepal.Width'], ascending=[True, False]).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Getting ranks." ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
133.0126.012.06.0
217.058.013.07.0
310.095.05.08.0
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width\n", "1 33.0 126.0 12.0 6.0\n", "2 17.0 58.0 13.0 7.0\n", "3 10.0 95.0 5.0 8.0" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.rank(axis=0, numeric_only=True, method='first').head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Chaining\n", "\n", "DataFrame methods can be chained together (like piping)." ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: 'select' is deprecated and will be removed in a future release. You can use .loc[labels.map(crit)] as a replacement\n", " \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
1187.73.86.72.2
584.92.43.31.0
705.62.53.91.1
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width\n", "118 7.7 3.8 6.7 2.2\n", "58 4.9 2.4 3.3 1.0\n", "70 5.6 2.5 3.9 1.1" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " iris.select(lambda x: x % 2 == 0).\n", " filter(regex='.+\\..+').\n", " sort_values('Sepal.Length').\n", " sample(3)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can add your own function to the chain using `pipe`." ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def f(x, n):\n", " return x**n" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: 'select' is deprecated and will be removed in a future release. You can use .loc[labels.map(crit)] as a replacement\n", " \n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
9237.219.0021.161.96
12439.697.2924.013.24
7844.899.0025.002.89
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width\n", "92 37.21 9.00 21.16 1.96\n", "124 39.69 7.29 24.01 3.24\n", "78 44.89 9.00 25.00 2.89" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " iris.select(lambda x: x % 2 == 0).\n", " filter(regex='.+\\..+').\n", " sort_values('Sepal.Length').\n", " pipe(f, n=2).\n", " sample(3)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Transformations" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
15.13.51.40.2
24.93.01.40.2
34.73.21.30.2
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width\n", "1 5.1 3.5 1.4 0.2\n", "2 4.9 3.0 1.4 0.2\n", "3 4.7 3.2 1.3 0.2" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.iloc[:, :-1].head(3)" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
126.0112.251.960.04
224.019.001.960.04
322.0910.241.690.04
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width\n", "1 26.01 12.25 1.96 0.04\n", "2 24.01 9.00 1.96 0.04\n", "3 22.09 10.24 1.69 0.04" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(iris.iloc[:, :-1] ** 2).head(3)" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "collapsed": true }, "outputs": [], "source": [ "iris2 = iris.copy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Creating new columns" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpeciesLength
15.13.51.40.2setosa6.5
24.93.01.40.2setosa6.3
34.73.21.30.2setosa6.0
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species Length\n", "1 5.1 3.5 1.4 0.2 setosa 6.5\n", "2 4.9 3.0 1.4 0.2 setosa 6.3\n", "3 4.7 3.2 1.3 0.2 setosa 6.0" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris2['Length'] = (iris2.filter(like='Length')).sum(axis=1)\n", "iris2.head(3)" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpeciesLengthWidth
15.13.51.40.2setosa6.53.7
24.93.01.40.2setosa6.33.2
34.73.21.30.2setosa6.03.4
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species Length Width\n", "1 5.1 3.5 1.4 0.2 setosa 6.5 3.7\n", "2 4.9 3.0 1.4 0.2 setosa 6.3 3.2\n", "3 4.7 3.2 1.3 0.2 setosa 6.0 3.4" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris2.assign(Width=iris2.filter(like='Width').sum(axis=1)).head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Use `applymap` to perform an element-wise transformation" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
133336
233336
333336
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "1 3 3 3 3 6\n", "2 3 3 3 3 6\n", "3 3 3 3 3 6" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f = lambda x: len(str(x))\n", "iris.applymap(f).head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### For Series, there is a `map` function" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 6\n", "2 6\n", "3 6\n", "Name: Species, dtype: int64" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris['Species'].apply(f).head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accumulations" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.13.51.40.2setosa
2106.52.80.4setosasetosa
314.79.74.10.6setosasetosasetosa
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "1 5.1 3.5 1.4 0.2 setosa\n", "2 10 6.5 2.8 0.4 setosasetosa\n", "3 14.7 9.7 4.1 0.6 setosasetosasetosa" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.cumsum().head(3)" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.13.51.40.2setosa
25.13.51.40.2setosa
35.13.51.40.2setosa
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "1 5.1 3.5 1.4 0.2 setosa\n", "2 5.1 3.5 1.4 0.2 setosa\n", "3 5.1 3.5 1.4 0.2 setosa" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.cummax().head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summaries" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
count150.000000150.000000150.000000150.000000
mean5.8433333.0573333.7580001.199333
std0.8280660.4358661.7652980.762238
min4.3000002.0000001.0000000.100000
25%5.1000002.8000001.6000000.300000
50%5.8000003.0000004.3500001.300000
75%6.4000003.3000005.1000001.800000
max7.9000004.4000006.9000002.500000
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width\n", "count 150.000000 150.000000 150.000000 150.000000\n", "mean 5.843333 3.057333 3.758000 1.199333\n", "std 0.828066 0.435866 1.765298 0.762238\n", "min 4.300000 2.000000 1.000000 0.100000\n", "25% 5.100000 2.800000 1.600000 0.300000\n", "50% 5.800000 3.000000 4.350000 1.300000\n", "75% 6.400000 3.300000 5.100000 1.800000\n", "max 7.900000 4.400000 6.900000 2.500000" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Many statistical summaries are built-in" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Sepal.Length 5.80\n", "Sepal.Width 3.00\n", "Petal.Length 4.35\n", "Petal.Width 1.30\n", "dtype: float64" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.median()" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Sepal.Length 7.9\n", "Sepal.Width 4.4\n", "Petal.Length 6.9\n", "Petal.Width 2.5\n", "Species virginica\n", "dtype: object" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.max()" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Sepal.Length 5.1\n", "Sepal.Width 2.8\n", "Petal.Length 1.6\n", "Petal.Width 0.3\n", "Name: 0.25, dtype: float64" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.quantile(0.25)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can do this over rows as well, even if it does not make much sense in this example." ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 10.2\n", "2 9.5\n", "3 9.4\n", "dtype: float64" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.sum(axis=1).head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Use `apply` for a custom summary" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def cv(x):\n", " \"\"\"Coefficient of variation.\"\"\"\n", " \n", " return x.mean() / x.std()" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Sepal.Length 7.056602\n", "Sepal.Width 7.014384\n", "Petal.Length 2.128819\n", "Petal.Width 1.573438\n", "dtype: float64" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.iloc[:, : -1].apply(cv)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Correlation and covariance\n", "\n", "These work on pairs of values." ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
Sepal.Length1.000000-0.1175700.8717540.817941
Sepal.Width-0.1175701.000000-0.428440-0.366126
Petal.Length0.871754-0.4284401.0000000.962865
Petal.Width0.817941-0.3661260.9628651.000000
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width\n", "Sepal.Length 1.000000 -0.117570 0.871754 0.817941\n", "Sepal.Width -0.117570 1.000000 -0.428440 -0.366126\n", "Petal.Length 0.871754 -0.428440 1.000000 0.962865\n", "Petal.Width 0.817941 -0.366126 0.962865 1.000000" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.corr()" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
Sepal.Length0.685694-0.0424341.2743150.516271
Sepal.Width-0.0424340.189979-0.329656-0.121639
Petal.Length1.274315-0.3296563.1162781.295609
Petal.Width0.516271-0.1216391.2956090.581006
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width\n", "Sepal.Length 0.685694 -0.042434 1.274315 0.516271\n", "Sepal.Width -0.042434 0.189979 -0.329656 -0.121639\n", "Petal.Length 1.274315 -0.329656 3.116278 1.295609\n", "Petal.Width 0.516271 -0.121639 1.295609 0.581006" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.cov()" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Petal.Length -0.428440\n", "Petal.Width -0.366126\n", "Sepal.Length -0.117570\n", "Sepal.Width 1.000000\n", "dtype: float64" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.iloc[:,:-1].corrwith(iris['Sepal.Width']).sort_values()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Finding unique values and count for a Series" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['setosa', 'versicolor', 'virginica'], dtype=object)" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.Species.unique()" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "setosa 50\n", "versicolor 50\n", "virginica 50\n", "Name: Species, dtype: int64" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.Species.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split-apply-combine\n", "\n", "\"Drawing\"" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "collapsed": true }, "outputs": [], "source": [ "g = iris.groupby('Species')" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
1016.33.36.02.5virginica
1025.82.75.11.9virginica
1037.13.05.92.1virginica
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "101 6.3 3.3 6.0 2.5 virginica\n", "102 5.8 2.7 5.1 1.9 virginica\n", "103 7.1 3.0 5.9 2.1 virginica" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g.get_group('virginica').head(3)" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
Species
setosa5.0063.4281.4620.246
versicolor5.9362.7704.2601.326
virginica6.5882.9745.5522.026
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width\n", "Species \n", "setosa 5.006 3.428 1.462 0.246\n", "versicolor 5.936 2.770 4.260 1.326\n", "virginica 6.588 2.974 5.552 2.026" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g.mean()" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.Length
Species
setosa5.006
versicolor5.936
virginica6.588
\n", "
" ], "text/plain": [ " Sepal.Length\n", "Species \n", "setosa 5.006\n", "versicolor 5.936\n", "virginica 6.588" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g[['Sepal.Length']].mean()" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
Species
setosa14.2018349.0433198.4185562.334285
versicolor11.5000628.8273269.0655476.705345
virginica10.3604529.22180210.0598907.376660
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width\n", "Species \n", "setosa 14.201834 9.043319 8.418556 2.334285\n", "versicolor 11.500062 8.827326 9.065547 6.705345\n", "virginica 10.360452 9.221802 10.059890 7.376660" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g.apply(cv)" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
minmaxcvminmaxcvminmaxcvminmaxcv
Species
setosa4.35.814.2018342.34.49.0433191.01.98.4185560.10.62.334285
versicolor4.97.011.5000622.03.48.8273263.05.19.0655471.01.86.705345
virginica4.97.910.3604522.23.89.2218024.56.910.0598901.42.57.376660
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width \\\n", " min max cv min max cv \n", "Species \n", "setosa 4.3 5.8 14.201834 2.3 4.4 9.043319 \n", "versicolor 4.9 7.0 11.500062 2.0 3.4 8.827326 \n", "virginica 4.9 7.9 10.360452 2.2 3.8 9.221802 \n", "\n", " Petal.Length Petal.Width \n", " min max cv min max cv \n", "Species \n", "setosa 1.0 1.9 8.418556 0.1 0.6 2.334285 \n", "versicolor 3.0 5.1 9.065547 1.0 1.8 6.705345 \n", "virginica 4.5 6.9 10.059890 1.4 2.5 7.376660 " ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g.agg(['min', 'max', cv])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Flexibility of `agg`" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.Width
smallestlargestmeanstdcv
Species
setosa4.35.83.4280.3790649.043319
versicolor4.97.02.7700.3137988.827326
virginica4.97.92.9740.3224979.221802
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width \n", " smallest largest mean std cv\n", "Species \n", "setosa 4.3 5.8 3.428 0.379064 9.043319\n", "versicolor 4.9 7.0 2.770 0.313798 8.827326\n", "virginica 4.9 7.9 2.974 0.322497 9.221802" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g.agg({\n", " 'Sepal.Length': [('smallest', 'min'), ('largest', 'max')],\n", " 'Sepal.Width': ['mean', 'std', cv],\n", "})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you need to do something other than an aggregate function, use apply." ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def top_k(df, col, k=3):\n", " return df.sort_values(col)[:k]" ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
Species
setosa144.33.01.10.1setosa
94.42.91.40.2setosa
versicolor584.92.43.31.0versicolor
615.02.03.51.0versicolor
virginica1074.92.54.51.7virginica
1225.62.84.92.0virginica
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width \\\n", "Species \n", "setosa 14 4.3 3.0 1.1 0.1 \n", " 9 4.4 2.9 1.4 0.2 \n", "versicolor 58 4.9 2.4 3.3 1.0 \n", " 61 5.0 2.0 3.5 1.0 \n", "virginica 107 4.9 2.5 4.5 1.7 \n", " 122 5.6 2.8 4.9 2.0 \n", "\n", " Species \n", "Species \n", "setosa 14 setosa \n", " 9 setosa \n", "versicolor 58 versicolor \n", " 61 versicolor \n", "virginica 107 virginica \n", " 122 virginica " ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.groupby('Species').apply(top_k, col='Sepal.Length', k=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternaive without group keys." ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
144.33.01.10.1setosa
94.42.91.40.2setosa
584.92.43.31.0versicolor
615.02.03.51.0versicolor
1074.92.54.51.7virginica
1225.62.84.92.0virginica
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "14 4.3 3.0 1.1 0.1 setosa\n", "9 4.4 2.9 1.4 0.2 setosa\n", "58 4.9 2.4 3.3 1.0 versicolor\n", "61 5.0 2.0 3.5 1.0 versicolor\n", "107 4.9 2.5 4.5 1.7 virginica\n", "122 5.6 2.8 4.9 2.0 virginica" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.groupby('Species', group_keys=False).apply(top_k, col='Sepal.Length', k=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Grouping on multiple columns" ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size\n", "0 16.99 1.01 Female No Sun Dinner 2\n", "1 10.34 1.66 Male No Sun Dinner 3\n", "2 21.01 3.50 Male No Sun Dinner 3\n", "3 23.68 3.31 Male No Sun Dinner 2\n", "4 24.59 3.61 Female No Sun Dinner 4" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.head()" ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsize
sexsmokertime
FemaleNoDinner20.8242863.3292863.071429
YesDinner16.5400003.5000002.500000
MaleNoDinner20.4032563.1153492.883721
YesDinner26.1413333.5213332.600000
\n", "
" ], "text/plain": [ " total_bill tip size\n", "sex smoker time \n", "Female No Dinner 20.824286 3.329286 3.071429\n", " Yes Dinner 16.540000 3.500000 2.500000\n", "Male No Dinner 20.403256 3.115349 2.883721\n", " Yes Dinner 26.141333 3.521333 2.600000" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips[tips.day=='Sun'].groupby(['sex', 'smoker', 'time']).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternative without hierarchical index." ] }, { "cell_type": "code", "execution_count": 121, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexsmokertimetotal_billtipsize
0FemaleNoDinner20.8242863.3292863.071429
1FemaleYesDinner16.5400003.5000002.500000
2MaleNoDinner20.4032563.1153492.883721
3MaleYesDinner26.1413333.5213332.600000
\n", "
" ], "text/plain": [ " sex smoker time total_bill tip size\n", "0 Female No Dinner 20.824286 3.329286 3.071429\n", "1 Female Yes Dinner 16.540000 3.500000 2.500000\n", "2 Male No Dinner 20.403256 3.115349 2.883721\n", "3 Male Yes Dinner 26.141333 3.521333 2.600000" ] }, "execution_count": 121, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips[tips.day=='Sun'].groupby(['sex', 'smoker', 'time'], as_index=False).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Grouping can be on objects other than an existing column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using a dictionary to provide mapping." ] }, { "cell_type": "code", "execution_count": 122, "metadata": { "collapsed": true }, "outputs": [], "source": [ "mapping = {\n", " 'Sepal.Length' : 'Sepal', \n", " 'Sepal.Width': 'Sepal', \n", " 'Petal.Length': 'Petal', \n", " 'Petal.Width': 'Petal',\n", " }" ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PetalSepal
10.804.30
20.803.95
30.753.95
40.853.85
\n", "
" ], "text/plain": [ " Petal Sepal\n", "1 0.80 4.30\n", "2 0.80 3.95\n", "3 0.75 3.95\n", "4 0.85 3.85" ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.iloc[:, :-1].groupby(mapping, axis=1).mean().head(4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using a function to provide mapping." ] }, { "cell_type": "code", "execution_count": 124, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PetalSepal
10.804.30
20.803.95
30.753.95
40.853.85
\n", "
" ], "text/plain": [ " Petal Sepal\n", "1 0.80 4.30\n", "2 0.80 3.95\n", "3 0.75 3.95\n", "4 0.85 3.85" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.iloc[:, :-1].groupby(lambda x: x.split('.')[0], axis=1).mean().head(4)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }