{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Science: Data processing\n", "\n", "Typical packages: `pandas`, `plotnine`, `plotly`, `streamlit`\n", "\n", "**References**\n", "\n", "- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)\n", "- [Python for Data Analysis, 2nd Edition](https://github.com/wesm/pydata-book)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction to `pandas`" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Series and Data Frames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Series objects" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `Series` is like a vector. All elements must have the same type or are nulls." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 1.0\n", "2 2.0\n", "3 3.0\n", "4 NaN\n", "dtype: float64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = pd.Series([1,1,2,3] + [None])\n", "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Size" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Unique Counts" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0 2\n", "3.0 1\n", "2.0 1\n", "dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Special types of series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Strings" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 the quick\n", "1 quick brown\n", "2 brown fox\n", "3 fox jumps\n", "4 jumps over\n", "5 over the\n", "6 the lazy\n", "7 lazy dog\n", "dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "words = 'the quick brown fox jumps over the lazy dog'.split()\n", "s1 = pd.Series([' '.join(item) for item in zip(words[:-1], words[1:])])\n", "s1" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 THE QUICK\n", "1 QUICK BROWN\n", "2 BROWN FOX\n", "3 FOX JUMPS\n", "4 JUMPS OVER\n", "5 OVER THE\n", "6 THE LAZY\n", "7 LAZY DOG\n", "dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.str.upper()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 [the, quick]\n", "1 [quick, brown]\n", "2 [brown, fox]\n", "3 [fox, jumps]\n", "4 [jumps, over]\n", "5 [over, the]\n", "6 [the, lazy]\n", "7 [lazy, dog]\n", "dtype: object" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.str.split()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 quick\n", "1 brown\n", "2 fox\n", "3 jumps\n", "4 over\n", "5 the\n", "6 lazy\n", "7 dog\n", "dtype: object" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s1.str.split().str[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Categories" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Asian\n", "1 Asian\n", "2 White\n", "3 Black\n", "4 White\n", "5 Hispanic\n", "dtype: object" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s2 = pd.Series(['Asian', 'Asian', 'White', 'Black', 'White', 'Hispanic'])\n", "s2" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Asian\n", "1 Asian\n", "2 White\n", "3 Black\n", "4 White\n", "5 Hispanic\n", "dtype: category\n", "Categories (4, object): ['Asian', 'Black', 'Hispanic', 'White']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s2 = s2.astype('category')\n", "s2" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Asian', 'Black', 'Hispanic', 'White'], dtype='object')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s2.cat.categories" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 0\n", "2 3\n", "3 1\n", "4 3\n", "5 2\n", "dtype: int8" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s2.cat.codes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dates and times\n", "\n", "Datetimes are often useful as indices to a time series." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "import pendulum" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "d = pendulum.today()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'2020-11-11'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d.to_date_string()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "k = 18\n", "s3 = pd.Series(range(k), \n", " index=pd.date_range(d.to_date_string(),\n", " periods=k, \n", " freq='M'))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2020-11-30 0\n", "2020-12-31 1\n", "2021-01-31 2\n", "2021-02-28 3\n", "2021-03-31 4\n", "2021-04-30 5\n", "2021-05-31 6\n", "2021-06-30 7\n", "2021-07-31 8\n", "2021-08-31 9\n", "2021-09-30 10\n", "2021-10-31 11\n", "2021-11-30 12\n", "2021-12-31 13\n", "2022-01-31 14\n", "2022-02-28 15\n", "2022-03-31 16\n", "2022-04-30 17\n", "Freq: M, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s3" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2021-01-31 2\n", "2021-02-28 3\n", "2021-03-31 4\n", "2021-04-30 5\n", "2021-05-31 6\n", "2021-06-30 7\n", "2021-07-31 8\n", "2021-08-31 9\n", "2021-09-30 10\n", "2021-10-31 11\n", "2021-11-30 12\n", "2021-12-31 13\n", "Freq: M, dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s3['2021']" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2021-01-31 2\n", "2021-02-28 3\n", "2021-03-31 4\n", "2021-04-30 5\n", "2021-05-31 6\n", "2021-06-30 7\n", "Freq: M, dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s3['2021-01':'2021-06']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If used as a series, then need `dt` accessor method" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "s4 = s3.index.to_series()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2020-11-30 Monday\n", "2020-12-31 Thursday\n", "2021-01-31 Sunday\n", "2021-02-28 Sunday\n", "2021-03-31 Wednesday\n", "2021-04-30 Friday\n", "2021-05-31 Monday\n", "2021-06-30 Wednesday\n", "2021-07-31 Saturday\n", "2021-08-31 Tuesday\n", "2021-09-30 Thursday\n", "2021-10-31 Sunday\n", "2021-11-30 Tuesday\n", "2021-12-31 Friday\n", "2022-01-31 Monday\n", "2022-02-28 Monday\n", "2022-03-31 Thursday\n", "2022-04-30 Saturday\n", "Freq: M, dtype: object" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s4.dt.day_name()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### DataFrame objects" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `DataFrame` is like a matrix. Columns in a `DataFrame` are `Series`.\n", "\n", "- Each column in a DataFrame represents a **variale**\n", "- Each row in a DataFrame represents an **observation**\n", "- Each cell in a DataFrame represents a **value**" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
num
01.0
12.0
23.0
3NaN
\n", "
" ], "text/plain": [ " num\n", "0 1.0\n", "1 2.0\n", "2 3.0\n", "3 NaN" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(dict(num=[1,2,3] + [None]))\n", "df" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 2.0\n", "2 3.0\n", "3 NaN\n", "Name: num, dtype: float64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.num" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Index\n", "\n", "Row and column identifiers are of `Index` type.\n", "\n", "Somewhat confusingly, index is also a a synonym for the row identifiers." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=4, step=1)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Setting a column as the row index" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
num
01.0
12.0
23.0
3NaN
\n", "
" ], "text/plain": [ " num\n", "0 1.0\n", "1 2.0\n", "2 3.0\n", "3 NaN" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
num
1.0
2.0
3.0
NaN
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: []\n", "Index: [1.0, 2.0, 3.0, nan]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = df.set_index('num')\n", "df1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Making an index into a column" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
num
01.0
12.0
23.0
3NaN
\n", "
" ], "text/plain": [ " num\n", "0 1.0\n", "1 2.0\n", "2 3.0\n", "3 NaN" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.reset_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Columns\n", "\n", "This is just a different index object" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['num'], dtype='object')" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting raw values\n", "\n", "Sometimes you just want a `numpy` array, and not a `pandas` object." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 1.],\n", " [ 2.],\n", " [ 3.],\n", " [nan]])" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Data Frames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Manual" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "from collections import OrderedDict" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdate
073088.630097195.9227372020-11-11 19:25:24.062141
111253.958069169.0223812020-11-12 19:25:24.062141
274089.709304180.3276362020-11-13 19:25:24.062141
348552.879337142.8062232020-11-14 19:25:24.062141
490580.488371168.3847922020-11-15 19:25:24.062141
\n", "
" ], "text/plain": [ " pid weight height date\n", "0 730 88.630097 195.922737 2020-11-11 19:25:24.062141\n", "1 112 53.958069 169.022381 2020-11-12 19:25:24.062141\n", "2 740 89.709304 180.327636 2020-11-13 19:25:24.062141\n", "3 485 52.879337 142.806223 2020-11-14 19:25:24.062141\n", "4 905 80.488371 168.384792 2020-11-15 19:25:24.062141" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n = 5\n", "dates = pd.date_range(start='now', periods=n, freq='d')\n", "df = pd.DataFrame(OrderedDict(pid=np.random.randint(100, 999, n), \n", " weight=np.random.normal(70, 20, n),\n", " height=np.random.normal(170, 15, n),\n", " date=dates,\n", " ))\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### From file\n", "\n", "You can read in data from many different file types - plain text, JSON, spreadsheets, databases etc. Functions to read in data look like `read_X` where X is the data type." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Overwriting measures.txt\n" ] } ], "source": [ "%%file measures.txt\n", "pid\tweight\theight\tdate\n", "328\t72.654347\t203.560866\t2018-11-11 14:16:18.148411\n", "756\t34.027679\t189.847316\t2018-11-12 14:16:18.148411\n", "185\t28.501914\t158.646074\t2018-11-13 14:16:18.148411\n", "507\t17.396343\t180.795993\t2018-11-14 14:16:18.148411\n", "919\t64.724301\t173.564725\t2018-11-15 14:16:18.148411" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdate
032872.654347203.5608662018-11-11 14:16:18.148411
175634.027679189.8473162018-11-12 14:16:18.148411
218528.501914158.6460742018-11-13 14:16:18.148411
350717.396343180.7959932018-11-14 14:16:18.148411
491964.724301173.5647252018-11-15 14:16:18.148411
\n", "
" ], "text/plain": [ " pid weight height date\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411\n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411\n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411\n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411\n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_table('measures.txt')\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Indexing Data Frames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Implicit defaults\n", "\n", "if you provide a slice, it is assumed that you are asking for rows." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdate
175634.027679189.8473162018-11-12 14:16:18.148411
218528.501914158.6460742018-11-13 14:16:18.148411
\n", "
" ], "text/plain": [ " pid weight height date\n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411\n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[1:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you provide a singe value or list, it is assumed that you are asking for columns." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweight
032872.654347
175634.027679
218528.501914
350717.396343
491964.724301
\n", "
" ], "text/plain": [ " pid weight\n", "0 328 72.654347\n", "1 756 34.027679\n", "2 185 28.501914\n", "3 507 17.396343\n", "4 919 64.724301" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['pid', 'weight']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extracting a column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Dictionary style access" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 328\n", "1 756\n", "2 185\n", "3 507\n", "4 919\n", "Name: pid, dtype: int64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['pid']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Property style access\n", "\n", "This only works for column names tat are also valid Python identifier (i.e., no spaces or dashes or keywords)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 328\n", "1 756\n", "2 185\n", "3 507\n", "4 919\n", "Name: pid, dtype: int64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pid" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Indexing by location\n", "\n", "This is similar to `numpy` indexing" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdate
175634.027679189.8473162018-11-12 14:16:18.148411
218528.501914158.6460742018-11-13 14:16:18.148411
\n", "
" ], "text/plain": [ " pid weight height date\n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411\n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[1:3, :]" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightdate
134.0276792018-11-12 14:16:18.148411
228.5019142018-11-13 14:16:18.148411
\n", "
" ], "text/plain": [ " weight date\n", "1 34.027679 2018-11-12 14:16:18.148411\n", "2 28.501914 2018-11-13 14:16:18.148411" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[1:3, 1:4:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Indexing by name" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheight
134.027679189.847316
228.501914158.646074
317.396343180.795993
\n", "
" ], "text/plain": [ " weight height\n", "1 34.027679 189.847316\n", "2 28.501914 158.646074\n", "3 17.396343 180.795993" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[1:3, 'weight':'height']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Warning**: When using `loc`, the row slice indicates row names, not positions." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdate
132872.654347203.5608662018-11-11 14:16:18.148411
275634.027679189.8473162018-11-12 14:16:18.148411
318528.501914158.6460742018-11-13 14:16:18.148411
450717.396343180.7959932018-11-14 14:16:18.148411
591964.724301173.5647252018-11-15 14:16:18.148411
\n", "
" ], "text/plain": [ " pid weight height date\n", "1 328 72.654347 203.560866 2018-11-11 14:16:18.148411\n", "2 756 34.027679 189.847316 2018-11-12 14:16:18.148411\n", "3 185 28.501914 158.646074 2018-11-13 14:16:18.148411\n", "4 507 17.396343 180.795993 2018-11-14 14:16:18.148411\n", "5 919 64.724301 173.564725 2018-11-15 14:16:18.148411" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = df.copy()\n", "df1.index = df.index + 1\n", "df1" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheight
172.654347203.560866
234.027679189.847316
328.501914158.646074
\n", "
" ], "text/plain": [ " weight height\n", "1 72.654347 203.560866\n", "2 34.027679 189.847316\n", "3 28.501914 158.646074" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.loc[1:3, 'weight':'height']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Structure of a Data Frame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data types" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pid int64\n", "weight float64\n", "height float64\n", "date object\n", "dtype: object" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Converting data types" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using `astype` on one column" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "df.pid = df.pid.astype('category')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using `astype` on multiple columns" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "df = df.astype(dict(weight=float, height=float))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using a conversion function" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "df.date = pd.to_datetime(df.date)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Check" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pid category\n", "weight float64\n", "height float64\n", "date datetime64[ns]\n", "dtype: object" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic properties" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "20" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.size" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(5, 4)" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheight
count5.0000005.000000
mean43.460917181.282995
std23.96094516.895933
min17.396343158.646074
25%28.501914173.564725
50%34.027679180.795993
75%64.724301189.847316
max72.654347203.560866
\n", "
" ], "text/plain": [ " weight height\n", "count 5.000000 5.000000\n", "mean 43.460917 181.282995\n", "std 23.960945 16.895933\n", "min 17.396343 158.646074\n", "25% 28.501914 173.564725\n", "50% 34.027679 180.795993\n", "75% 64.724301 189.847316\n", "max 72.654347 203.560866" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 5 entries, 0 to 4\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 pid 5 non-null category \n", " 1 weight 5 non-null float64 \n", " 2 height 5 non-null float64 \n", " 3 date 5 non-null datetime64[ns]\n", "dtypes: category(1), datetime64[ns](1), float64(2)\n", "memory usage: 453.0 bytes\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspection" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdate
032872.654347203.5608662018-11-11 14:16:18.148411
175634.027679189.8473162018-11-12 14:16:18.148411
218528.501914158.6460742018-11-13 14:16:18.148411
\n", "
" ], "text/plain": [ " pid weight height date\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411\n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411\n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(n=3)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdate
218528.501914158.6460742018-11-13 14:16:18.148411
350717.396343180.7959932018-11-14 14:16:18.148411
491964.724301173.5647252018-11-15 14:16:18.148411
\n", "
" ], "text/plain": [ " pid weight height date\n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411\n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411\n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail(n=3)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdate
491964.724301173.5647252018-11-15 14:16:18.148411
350717.396343180.7959932018-11-14 14:16:18.148411
032872.654347203.5608662018-11-11 14:16:18.148411
\n", "
" ], "text/plain": [ " pid weight height date\n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411\n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sample(n=3)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdate
491964.724301173.5647252018-11-15 14:16:18.148411
218528.501914158.6460742018-11-13 14:16:18.148411
\n", "
" ], "text/plain": [ " pid weight height date\n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411\n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sample(frac=0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting, Renaming and Removing Columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting columns" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
piddate
03282018-11-11 14:16:18.148411
17562018-11-12 14:16:18.148411
21852018-11-13 14:16:18.148411
35072018-11-14 14:16:18.148411
49192018-11-15 14:16:18.148411
\n", "
" ], "text/plain": [ " pid date\n", "0 328 2018-11-11 14:16:18.148411\n", "1 756 2018-11-12 14:16:18.148411\n", "2 185 2018-11-13 14:16:18.148411\n", "3 507 2018-11-14 14:16:18.148411\n", "4 919 2018-11-15 14:16:18.148411" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.filter(items=['pid', 'date'])" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheight
072.654347203.560866
134.027679189.847316
228.501914158.646074
317.396343180.795993
464.724301173.564725
\n", "
" ], "text/plain": [ " weight height\n", "0 72.654347 203.560866\n", "1 34.027679 189.847316\n", "2 28.501914 158.646074\n", "3 17.396343 180.795993\n", "4 64.724301 173.564725" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.filter(regex='.*ght')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Note that you can also use regular string methods on the columns" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
piddate
03282018-11-11 14:16:18.148411
17562018-11-12 14:16:18.148411
21852018-11-13 14:16:18.148411
35072018-11-14 14:16:18.148411
49192018-11-15 14:16:18.148411
\n", "
" ], "text/plain": [ " pid date\n", "0 328 2018-11-11 14:16:18.148411\n", "1 756 2018-11-12 14:16:18.148411\n", "2 185 2018-11-13 14:16:18.148411\n", "3 507 2018-11-14 14:16:18.148411\n", "4 919 2018-11-15 14:16:18.148411" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[:, df.columns.str.contains('d')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Renaming columns" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidwhdate
032872.654347203.5608662018-11-11 14:16:18.148411
175634.027679189.8473162018-11-12 14:16:18.148411
218528.501914158.6460742018-11-13 14:16:18.148411
350717.396343180.7959932018-11-14 14:16:18.148411
491964.724301173.5647252018-11-15 14:16:18.148411
\n", "
" ], "text/plain": [ " pid w h date\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411\n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411\n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411\n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411\n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.rename(dict(weight='w', height='h'), axis=1)" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "orig_cols = df.columns " ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "df.columns = list('abcd')" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcd
032872.654347203.5608662018-11-11 14:16:18.148411
175634.027679189.8473162018-11-12 14:16:18.148411
218528.501914158.6460742018-11-13 14:16:18.148411
350717.396343180.7959932018-11-14 14:16:18.148411
491964.724301173.5647252018-11-15 14:16:18.148411
\n", "
" ], "text/plain": [ " a b c d\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411\n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411\n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411\n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411\n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "df.columns = orig_cols" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdate
032872.654347203.5608662018-11-11 14:16:18.148411
175634.027679189.8473162018-11-12 14:16:18.148411
218528.501914158.6460742018-11-13 14:16:18.148411
350717.396343180.7959932018-11-14 14:16:18.148411
491964.724301173.5647252018-11-15 14:16:18.148411
\n", "
" ], "text/plain": [ " pid weight height date\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411\n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411\n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411\n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411\n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Removing columns" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheight
072.654347203.560866
134.027679189.847316
228.501914158.646074
317.396343180.795993
464.724301173.564725
\n", "
" ], "text/plain": [ " weight height\n", "0 72.654347 203.560866\n", "1 34.027679 189.847316\n", "2 28.501914 158.646074\n", "3 17.396343 180.795993\n", "4 64.724301 173.564725" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop(['pid', 'date'], axis=1)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheight
072.654347203.560866
134.027679189.847316
228.501914158.646074
317.396343180.795993
464.724301173.564725
\n", "
" ], "text/plain": [ " weight height\n", "0 72.654347 203.560866\n", "1 34.027679 189.847316\n", "2 28.501914 158.646074\n", "3 17.396343 180.795993\n", "4 64.724301 173.564725" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop(columns=['pid', 'date'])" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheight
072.654347203.560866
134.027679189.847316
228.501914158.646074
317.396343180.795993
464.724301173.564725
\n", "
" ], "text/plain": [ " weight height\n", "0 72.654347 203.560866\n", "1 34.027679 189.847316\n", "2 28.501914 158.646074\n", "3 17.396343 180.795993\n", "4 64.724301 173.564725" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop(columns=df.columns[df.columns.str.contains('d')])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting, Renaming and Removing Rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting rows" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdate
491964.724301173.5647252018-11-15 14:16:18.148411
\n", "
" ], "text/plain": [ " pid weight height date\n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.weight.between(60,70)]" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdate
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [pid, weight, height, date]\n", "Index: []" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[(69 <= df.weight) & (df.weight < 70)]" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdate
218528.501914158.6460742018-11-13 14:16:18.148411
350717.396343180.7959932018-11-14 14:16:18.148411
491964.724301173.5647252018-11-15 14:16:18.148411
\n", "
" ], "text/plain": [ " pid weight height date\n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411\n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411\n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.date.between(pd.to_datetime('2018-11-13'), \n", " pd.to_datetime('2018-11-15 23:59:59'))]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Renaming rows" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdate
a32872.654347203.5608662018-11-11 14:16:18.148411
b75634.027679189.8473162018-11-12 14:16:18.148411
c18528.501914158.6460742018-11-13 14:16:18.148411
d50717.396343180.7959932018-11-14 14:16:18.148411
e91964.724301173.5647252018-11-15 14:16:18.148411
\n", "
" ], "text/plain": [ " pid weight height date\n", "a 328 72.654347 203.560866 2018-11-11 14:16:18.148411\n", "b 756 34.027679 189.847316 2018-11-12 14:16:18.148411\n", "c 185 28.501914 158.646074 2018-11-13 14:16:18.148411\n", "d 507 17.396343 180.795993 2018-11-14 14:16:18.148411\n", "e 919 64.724301 173.564725 2018-11-15 14:16:18.148411" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.rename({i:letter for i,letter in enumerate('abcde')})" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [], "source": [ "df.index = ['the', 'quick', 'brown', 'fox', 'jumphs']" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdate
the32872.654347203.5608662018-11-11 14:16:18.148411
quick75634.027679189.8473162018-11-12 14:16:18.148411
brown18528.501914158.6460742018-11-13 14:16:18.148411
fox50717.396343180.7959932018-11-14 14:16:18.148411
jumphs91964.724301173.5647252018-11-15 14:16:18.148411
\n", "
" ], "text/plain": [ " pid weight height date\n", "the 328 72.654347 203.560866 2018-11-11 14:16:18.148411\n", "quick 756 34.027679 189.847316 2018-11-12 14:16:18.148411\n", "brown 185 28.501914 158.646074 2018-11-13 14:16:18.148411\n", "fox 507 17.396343 180.795993 2018-11-14 14:16:18.148411\n", "jumphs 919 64.724301 173.564725 2018-11-15 14:16:18.148411" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "df = df.reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdate
032872.654347203.5608662018-11-11 14:16:18.148411
175634.027679189.8473162018-11-12 14:16:18.148411
218528.501914158.6460742018-11-13 14:16:18.148411
350717.396343180.7959932018-11-14 14:16:18.148411
491964.724301173.5647252018-11-15 14:16:18.148411
\n", "
" ], "text/plain": [ " pid weight height date\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411\n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411\n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411\n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411\n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dropping rows" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdate
032872.654347203.5608662018-11-11 14:16:18.148411
218528.501914158.6460742018-11-13 14:16:18.148411
491964.724301173.5647252018-11-15 14:16:18.148411
\n", "
" ], "text/plain": [ " pid weight height date\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411\n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411\n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop([1,3], axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Dropping duplicated data" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "df['something'] = [1,1,None,2,None]" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatesomething
175634.027679189.8473162018-11-12 14:16:18.1484111.0
491964.724301173.5647252018-11-15 14:16:18.148411NaN
\n", "
" ], "text/plain": [ " pid weight height date something\n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0\n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[df.something.duplicated()]" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatesomething
032872.654347203.5608662018-11-11 14:16:18.1484111.0
218528.501914158.6460742018-11-13 14:16:18.148411NaN
350717.396343180.7959932018-11-14 14:16:18.1484112.0
\n", "
" ], "text/plain": [ " pid weight height date something\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0\n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN\n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop_duplicates(subset='something')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Dropping missing data" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatesomething
032872.654347203.5608662018-11-11 14:16:18.1484111.0
175634.027679189.8473162018-11-12 14:16:18.1484111.0
218528.501914158.6460742018-11-13 14:16:18.148411NaN
350717.396343180.7959932018-11-14 14:16:18.1484112.0
491964.724301173.5647252018-11-15 14:16:18.148411NaN
\n", "
" ], "text/plain": [ " pid weight height date something\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0\n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0\n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN\n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0\n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 1.0\n", "2 0.0\n", "3 2.0\n", "4 0.0\n", "Name: something, dtype: float64" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.something.fillna(0)" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 1.0\n", "2 1.0\n", "3 2.0\n", "4 2.0\n", "Name: something, dtype: float64" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.something.ffill()" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 1.0\n", "2 2.0\n", "3 2.0\n", "4 NaN\n", "Name: something, dtype: float64" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.something.bfill()" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 1.0\n", "1 1.0\n", "2 1.5\n", "3 2.0\n", "4 2.0\n", "Name: something, dtype: float64" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.something.interpolate()" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatesomething
032872.654347203.5608662018-11-11 14:16:18.1484111.0
175634.027679189.8473162018-11-12 14:16:18.1484111.0
350717.396343180.7959932018-11-14 14:16:18.1484112.0
\n", "
" ], "text/plain": [ " pid weight height date something\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0\n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0\n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transforming and Creating Columns" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatesomethingbmi
032872.654347203.5608662018-11-11 14:16:18.1484111.017.533678
175634.027679189.8473162018-11-12 14:16:18.1484111.09.441118
218528.501914158.6460742018-11-13 14:16:18.148411NaN11.324404
350717.396343180.7959932018-11-14 14:16:18.1484112.05.322067
491964.724301173.5647252018-11-15 14:16:18.148411NaN21.485449
\n", "
" ], "text/plain": [ " pid weight height date something bmi\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0 17.533678\n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0 9.441118\n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404\n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0 5.322067\n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN 21.485449" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.assign(bmi=df['weight'] / (df['height']/100)**2)" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [], "source": [ "df['bmi'] = df['weight'] / (df['height']/100)**2" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatesomethingbmi
032872.654347203.5608662018-11-11 14:16:18.1484111.017.533678
175634.027679189.8473162018-11-12 14:16:18.1484111.09.441118
218528.501914158.6460742018-11-13 14:16:18.148411NaN11.324404
350717.396343180.7959932018-11-14 14:16:18.1484112.05.322067
491964.724301173.5647252018-11-15 14:16:18.148411NaN21.485449
\n", "
" ], "text/plain": [ " pid weight height date something bmi\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 1.0 17.533678\n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 1.0 9.441118\n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404\n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 2.0 5.322067\n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN 21.485449" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "df['something'] = [2,2,None,None,3]" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatesomethingbmi
032872.654347203.5608662018-11-11 14:16:18.1484112.017.533678
175634.027679189.8473162018-11-12 14:16:18.1484112.09.441118
218528.501914158.6460742018-11-13 14:16:18.148411NaN11.324404
350717.396343180.7959932018-11-14 14:16:18.148411NaN5.322067
491964.724301173.5647252018-11-15 14:16:18.1484113.021.485449
\n", "
" ], "text/plain": [ " pid weight height date something bmi\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678\n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118\n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404\n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067\n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sorting Data Frames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sort on indexes" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
bmidateheightpidsomethingweight
017.5336782018-11-11 14:16:18.148411203.5608663282.072.654347
19.4411182018-11-12 14:16:18.148411189.8473167562.034.027679
211.3244042018-11-13 14:16:18.148411158.646074185NaN28.501914
35.3220672018-11-14 14:16:18.148411180.795993507NaN17.396343
421.4854492018-11-15 14:16:18.148411173.5647259193.064.724301
\n", "
" ], "text/plain": [ " bmi date height pid something weight\n", "0 17.533678 2018-11-11 14:16:18.148411 203.560866 328 2.0 72.654347\n", "1 9.441118 2018-11-12 14:16:18.148411 189.847316 756 2.0 34.027679\n", "2 11.324404 2018-11-13 14:16:18.148411 158.646074 185 NaN 28.501914\n", "3 5.322067 2018-11-14 14:16:18.148411 180.795993 507 NaN 17.396343\n", "4 21.485449 2018-11-15 14:16:18.148411 173.564725 919 3.0 64.724301" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_index(axis=1)" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatesomethingbmi
491964.724301173.5647252018-11-15 14:16:18.1484113.021.485449
350717.396343180.7959932018-11-14 14:16:18.148411NaN5.322067
218528.501914158.6460742018-11-13 14:16:18.148411NaN11.324404
175634.027679189.8473162018-11-12 14:16:18.1484112.09.441118
032872.654347203.5608662018-11-11 14:16:18.1484112.017.533678
\n", "
" ], "text/plain": [ " pid weight height date something bmi\n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449\n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067\n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404\n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_index(axis=0, ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sort on values" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatesomethingbmi
032872.654347203.5608662018-11-11 14:16:18.1484112.017.533678
175634.027679189.8473162018-11-12 14:16:18.1484112.09.441118
491964.724301173.5647252018-11-15 14:16:18.1484113.021.485449
218528.501914158.6460742018-11-13 14:16:18.148411NaN11.324404
350717.396343180.7959932018-11-14 14:16:18.148411NaN5.322067
\n", "
" ], "text/plain": [ " pid weight height date something bmi\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 17.533678\n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 9.441118\n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 21.485449\n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN 11.324404\n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN 5.322067" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values(by=['something', 'bmi'], ascending=[True, False])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summarizing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Apply an aggregation function" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheightsomethingbmi
072.654347203.5608662.017.533678
134.027679189.8473162.09.441118
228.501914158.646074NaN11.324404
317.396343180.795993NaN5.322067
464.724301173.5647253.021.485449
\n", "
" ], "text/plain": [ " weight height something bmi\n", "0 72.654347 203.560866 2.0 17.533678\n", "1 34.027679 189.847316 2.0 9.441118\n", "2 28.501914 158.646074 NaN 11.324404\n", "3 17.396343 180.795993 NaN 5.322067\n", "4 64.724301 173.564725 3.0 21.485449" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.select_dtypes(include=np.number)" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "weight 217.304584\n", "height 906.414974\n", "something 7.000000\n", "bmi 65.106716\n", "dtype: float64" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.select_dtypes(include=np.number).agg(np.sum)" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatesomethingbmi
count5.05.0000005.00000053.0000005.000000
sumNaN217.304584906.414974NaN7.00000065.106716
meanNaN43.460917181.2829952018-11-13 14:16:18.1484108802.33333313.021343
\n", "
" ], "text/plain": [ " pid weight height date something \\\n", "count 5.0 5.000000 5.000000 5 3.000000 \n", "sum NaN 217.304584 906.414974 NaN 7.000000 \n", "mean NaN 43.460917 181.282995 2018-11-13 14:16:18.148410880 2.333333 \n", "\n", " bmi \n", "count 5.000000 \n", "sum 65.106716 \n", "mean 13.021343 " ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.agg(['count', np.sum, np.mean])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Split-Apply-Combine\n", "\n", "We often want to perform subgroup analysis (conditioning by some discrete or categorical variable). This is done with `groupby` followed by an aggregate function. Conceptually, we split the data frame into separate groups, apply the aggregate function to each group separately, then combine the aggregated results back into a single data frame." ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [], "source": [ "df['treatment'] = list('ababa')" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatesomethingbmitreatment
032872.654347203.5608662018-11-11 14:16:18.1484112.017.533678a
175634.027679189.8473162018-11-12 14:16:18.1484112.09.441118b
218528.501914158.6460742018-11-13 14:16:18.148411NaN11.324404a
350717.396343180.7959932018-11-14 14:16:18.148411NaN5.322067b
491964.724301173.5647252018-11-15 14:16:18.1484113.021.485449a
\n", "
" ], "text/plain": [ " pid weight height date something \\\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 \n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 \n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN \n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN \n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 \n", "\n", " bmi treatment \n", "0 17.533678 a \n", "1 9.441118 b \n", "2 11.324404 a \n", "3 5.322067 b \n", "4 21.485449 a " ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [], "source": [ "grouped = df.groupby('treatment')" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatesomethingbmitreatment
032872.654347203.5608662018-11-11 14:16:18.1484112.017.533678a
218528.501914158.6460742018-11-13 14:16:18.148411NaN11.324404a
491964.724301173.5647252018-11-15 14:16:18.1484113.021.485449a
\n", "
" ], "text/plain": [ " pid weight height date something \\\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 \n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN \n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 \n", "\n", " bmi treatment \n", "0 17.533678 a \n", "2 11.324404 a \n", "4 21.485449 a " ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped.get_group('a')" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheightsomethingbmi
treatment
a55.293521178.5905552.516.781177
b25.712011185.3216542.07.381592
\n", "
" ], "text/plain": [ " weight height something bmi\n", "treatment \n", "a 55.293521 178.590555 2.5 16.781177\n", "b 25.712011 185.321654 2.0 7.381592" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using `agg` with `groupby`" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheightsomethingbmi
treatment
a55.293521178.5905552.516.781177
b25.712011185.3216542.07.381592
\n", "
" ], "text/plain": [ " weight height something bmi\n", "treatment \n", "a 55.293521 178.590555 2.5 16.781177\n", "b 25.712011 185.321654 2.0 7.381592" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped.agg('mean')" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheightsomethingbmi
meanstdmeanstdmeanstdmeanstd
treatment
a55.29352123.538565178.59055522.8752892.50.70710716.7811775.122148
b25.71201111.760130185.3216546.4002522.0NaN7.3815922.912608
\n", "
" ], "text/plain": [ " weight height something \\\n", " mean std mean std mean std \n", "treatment \n", "a 55.293521 23.538565 178.590555 22.875289 2.5 0.707107 \n", "b 25.712011 11.760130 185.321654 6.400252 2.0 NaN \n", "\n", " bmi \n", " mean std \n", "treatment \n", "a 16.781177 5.122148 \n", "b 7.381592 2.912608 " ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped.agg(['mean', 'std'])" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheightbmi
meanstdminmax<lambda>
treatment
a55.29352123.538565158.646074203.560866897.296525
b25.71201111.760130180.795993189.847316117.459100
\n", "
" ], "text/plain": [ " weight height bmi\n", " mean std min max \n", "treatment \n", "a 55.293521 23.538565 158.646074 203.560866 897.296525\n", "b 25.712011 11.760130 180.795993 189.847316 117.459100" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped.agg({'weight': ['mean', 'std'], 'height': ['min', 'max'], 'bmi': lambda x: (x**2).sum()})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using `trasnform` wtih `groupby`" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheight
055.293521178.590555
125.712011185.321654
255.293521178.590555
325.712011185.321654
455.293521178.590555
\n", "
" ], "text/plain": [ " weight height\n", "0 55.293521 178.590555\n", "1 25.712011 185.321654\n", "2 55.293521 178.590555\n", "3 25.712011 185.321654\n", "4 55.293521 178.590555" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g_mean = grouped[['weight', 'height']].transform(np.mean)\n", "g_mean" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheight
023.53856522.875289
111.7601306.400252
223.53856522.875289
311.7601306.400252
423.53856522.875289
\n", "
" ], "text/plain": [ " weight height\n", "0 23.538565 22.875289\n", "1 11.760130 6.400252\n", "2 23.538565 22.875289\n", "3 11.760130 6.400252\n", "4 23.538565 22.875289" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g_std = grouped[['weight', 'height']].transform(np.std)\n", "g_std" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
weightheight
00.7375481.091584
10.7071070.707107
2-1.138201-0.871879
3-0.707107-0.707107
40.400652-0.219706
\n", "
" ], "text/plain": [ " weight height\n", "0 0.737548 1.091584\n", "1 0.707107 0.707107\n", "2 -1.138201 -0.871879\n", "3 -0.707107 -0.707107\n", "4 0.400652 -0.219706" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(df[['weight', 'height']] - g_mean)/g_std" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Combining Data Frames" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatesomethingbmitreatment
032872.654347203.5608662018-11-11 14:16:18.1484112.017.533678a
175634.027679189.8473162018-11-12 14:16:18.1484112.09.441118b
218528.501914158.6460742018-11-13 14:16:18.148411NaN11.324404a
350717.396343180.7959932018-11-14 14:16:18.148411NaN5.322067b
491964.724301173.5647252018-11-15 14:16:18.1484113.021.485449a
\n", "
" ], "text/plain": [ " pid weight height date something \\\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 \n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 \n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN \n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN \n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 \n", "\n", " bmi treatment \n", "0 17.533678 a \n", "1 9.441118 b \n", "2 11.324404 a \n", "3 5.322067 b \n", "4 21.485449 a " ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [], "source": [ "df1 = df.iloc[3:].copy()" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatebmitreatment
350717.396343180.7959932018-11-14 14:16:18.1484115.322067b
491964.724301173.5647252018-11-15 14:16:18.14841121.485449a
\n", "
" ], "text/plain": [ " pid weight height date bmi treatment\n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 5.322067 b\n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 21.485449 a" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.drop('something', axis=1, inplace=True)\n", "df1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding rows\n", "\n", "Note that `pandas` aligns by column indexes automatically." ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatesomethingbmitreatment
032872.654347203.5608662018-11-11 14:16:18.1484112.017.533678a
175634.027679189.8473162018-11-12 14:16:18.1484112.09.441118b
218528.501914158.6460742018-11-13 14:16:18.148411NaN11.324404a
350717.396343180.7959932018-11-14 14:16:18.148411NaN5.322067b
491964.724301173.5647252018-11-15 14:16:18.1484113.021.485449a
350717.396343180.7959932018-11-14 14:16:18.148411NaN5.322067b
491964.724301173.5647252018-11-15 14:16:18.148411NaN21.485449a
\n", "
" ], "text/plain": [ " pid weight height date something \\\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 \n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 \n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN \n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN \n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 \n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN \n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN \n", "\n", " bmi treatment \n", "0 17.533678 a \n", "1 9.441118 b \n", "2 11.324404 a \n", "3 5.322067 b \n", "4 21.485449 a \n", "3 5.322067 b \n", "4 21.485449 a " ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.append(df1, sort=False)" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatesomethingbmitreatment
032872.654347203.5608662018-11-11 14:16:18.1484112.017.533678a
175634.027679189.8473162018-11-12 14:16:18.1484112.09.441118b
218528.501914158.6460742018-11-13 14:16:18.148411NaN11.324404a
350717.396343180.7959932018-11-14 14:16:18.148411NaN5.322067b
491964.724301173.5647252018-11-15 14:16:18.1484113.021.485449a
350717.396343180.7959932018-11-14 14:16:18.148411NaN5.322067b
491964.724301173.5647252018-11-15 14:16:18.148411NaN21.485449a
\n", "
" ], "text/plain": [ " pid weight height date something \\\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 \n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 \n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN \n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN \n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 \n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN \n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 NaN \n", "\n", " bmi treatment \n", "0 17.533678 a \n", "1 9.441118 b \n", "2 11.324404 a \n", "3 5.322067 b \n", "4 21.485449 a \n", "3 5.322067 b \n", "4 21.485449 a " ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df, df1], sort=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding columns" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 328\n", "1 756\n", "2 185\n", "3 507\n", "4 919\n", "Name: pid, dtype: category\n", "Categories (5, int64): [185, 328, 507, 756, 919]" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pid" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [], "source": [ "df2 = pd.DataFrame(OrderedDict(pid=[649, 533, 400, 600], age=[23,34,45,56]))" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 649\n", "1 533\n", "2 400\n", "3 600\n", "Name: pid, dtype: int64" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.pid" ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [], "source": [ "df.pid = df.pid.astype('int')" ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatesomethingbmitreatmentage
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [pid, weight, height, date, something, bmi, treatment, age]\n", "Index: []" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df, df2, on='pid', how='inner')" ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatesomethingbmitreatmentage
032872.654347203.5608662018-11-11 14:16:18.1484112.017.533678aNaN
175634.027679189.8473162018-11-12 14:16:18.1484112.09.441118bNaN
218528.501914158.6460742018-11-13 14:16:18.148411NaN11.324404aNaN
350717.396343180.7959932018-11-14 14:16:18.148411NaN5.322067bNaN
491964.724301173.5647252018-11-15 14:16:18.1484113.021.485449aNaN
\n", "
" ], "text/plain": [ " pid weight height date something \\\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 \n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 \n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN \n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN \n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 \n", "\n", " bmi treatment age \n", "0 17.533678 a NaN \n", "1 9.441118 b NaN \n", "2 11.324404 a NaN \n", "3 5.322067 b NaN \n", "4 21.485449 a NaN " ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df, df2, on='pid', how='left')" ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatesomethingbmitreatmentage
0649NaNNaNNaTNaNNaNNaN23
1533NaNNaNNaTNaNNaNNaN34
2400NaNNaNNaTNaNNaNNaN45
3600NaNNaNNaTNaNNaNNaN56
\n", "
" ], "text/plain": [ " pid weight height date something bmi treatment age\n", "0 649 NaN NaN NaT NaN NaN NaN 23\n", "1 533 NaN NaN NaT NaN NaN NaN 34\n", "2 400 NaN NaN NaT NaN NaN NaN 45\n", "3 600 NaN NaN NaT NaN NaN NaN 56" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df, df2, on='pid', how='right')" ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidweightheightdatesomethingbmitreatmentage
032872.654347203.5608662018-11-11 14:16:18.1484112.017.533678aNaN
175634.027679189.8473162018-11-12 14:16:18.1484112.09.441118bNaN
218528.501914158.6460742018-11-13 14:16:18.148411NaN11.324404aNaN
350717.396343180.7959932018-11-14 14:16:18.148411NaN5.322067bNaN
491964.724301173.5647252018-11-15 14:16:18.1484113.021.485449aNaN
5649NaNNaNNaTNaNNaNNaN23.0
6533NaNNaNNaTNaNNaNNaN34.0
7400NaNNaNNaTNaNNaNNaN45.0
8600NaNNaNNaTNaNNaNNaN56.0
\n", "
" ], "text/plain": [ " pid weight height date something \\\n", "0 328 72.654347 203.560866 2018-11-11 14:16:18.148411 2.0 \n", "1 756 34.027679 189.847316 2018-11-12 14:16:18.148411 2.0 \n", "2 185 28.501914 158.646074 2018-11-13 14:16:18.148411 NaN \n", "3 507 17.396343 180.795993 2018-11-14 14:16:18.148411 NaN \n", "4 919 64.724301 173.564725 2018-11-15 14:16:18.148411 3.0 \n", "5 649 NaN NaN NaT NaN \n", "6 533 NaN NaN NaT NaN \n", "7 400 NaN NaN NaT NaN \n", "8 600 NaN NaN NaT NaN \n", "\n", " bmi treatment age \n", "0 17.533678 a NaN \n", "1 9.441118 b NaN \n", "2 11.324404 a NaN \n", "3 5.322067 b NaN \n", "4 21.485449 a NaN \n", "5 NaN NaN 23.0 \n", "6 NaN NaN 34.0 \n", "7 NaN NaN 45.0 \n", "8 NaN NaN 56.0 " ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(df, df2, on='pid', how='outer')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merging on the index" ] }, { "cell_type": "code", "execution_count": 121, "metadata": {}, "outputs": [], "source": [ "df1 = pd.DataFrame(dict(x=[1,2,3]), index=list('abc'))\n", "df2 = pd.DataFrame(dict(y=[4,5,6]), index=list('abc'))\n", "df3 = pd.DataFrame(dict(z=[7,8,9]), index=list('abc'))" ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
x
a1
b2
c3
\n", "
" ], "text/plain": [ " x\n", "a 1\n", "b 2\n", "c 3" ] }, "execution_count": 122, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
y
a4
b5
c6
\n", "
" ], "text/plain": [ " y\n", "a 4\n", "b 5\n", "c 6" ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": 124, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
z
a7
b8
c9
\n", "
" ], "text/plain": [ " z\n", "a 7\n", "b 8\n", "c 9" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df3" ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xyz
a147
b258
c369
\n", "
" ], "text/plain": [ " x y z\n", "a 1 4 7\n", "b 2 5 8\n", "c 3 6 9" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.join([df2, df3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fixing common DataFrame issues" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiple variables in a column" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pid_treat
0A-1
1B-2
2C-1
3D-2
\n", "
" ], "text/plain": [ " pid_treat\n", "0 A-1\n", "1 B-2\n", "2 C-1\n", "3 D-2" ] }, "execution_count": 126, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(dict(pid_treat = ['A-1', 'B-2', 'C-1', 'D-2']))\n", "df" ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 [A, 1]\n", "1 [B, 2]\n", "2 [C, 1]\n", "3 [D, 2]\n", "Name: pid_treat, dtype: object" ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pid_treat.str.split('-')" ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidtreat
0A1
1B2
2C1
3D2
\n", "
" ], "text/plain": [ " pid treat\n", "0 A 1\n", "1 B 2\n", "2 C 1\n", "3 D 2" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.pid_treat.str.split('-').apply(pd.Series, index=['pid', 'treat'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiple values in a cell" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidvals
0a(1, 2, 3)
1b(4, 5, 6)
2c(7, 8, 9)
\n", "
" ], "text/plain": [ " pid vals\n", "0 a (1, 2, 3)\n", "1 b (4, 5, 6)\n", "2 c (7, 8, 9)" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(dict(pid=['a', 'b', 'c'], vals = [(1,2,3), (4,5,6), (7,8,9)]))\n", "df" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidvalst1t2t3
0a(1, 2, 3)123
1b(4, 5, 6)456
2c(7, 8, 9)789
\n", "
" ], "text/plain": [ " pid vals t1 t2 t3\n", "0 a (1, 2, 3) 1 2 3\n", "1 b (4, 5, 6) 4 5 6\n", "2 c (7, 8, 9) 7 8 9" ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['t1', 't2', 't3']] = df.vals.apply(pd.Series)\n", "df" ] }, { "cell_type": "code", "execution_count": 131, "metadata": {}, "outputs": [], "source": [ "df.drop('vals', axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pidvals
0a1
1b4
2c7
3a2
4b5
5c8
6a3
7b6
8c9
\n", "
" ], "text/plain": [ " pid vals\n", "0 a 1\n", "1 b 4\n", "2 c 7\n", "3 a 2\n", "4 b 5\n", "5 c 8\n", "6 a 3\n", "7 b 6\n", "8 c 9" ] }, "execution_count": 132, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.melt(df, id_vars='pid', value_name='vals').drop('variable', axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reshaping Data Frames\n", "\n", "Sometimes we need to make rows into columns or vice versa." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Converting multiple columns into a single column\n", "\n", "This is often useful if you need to condition on some variable." ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [], "source": [ "url = 'https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv'\n", "iris = pd.read_csv(url)" ] }, { "cell_type": "code", "execution_count": 134, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 134, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.head()" ] }, { "cell_type": "code", "execution_count": 135, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(150, 5)" ] }, "execution_count": 135, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.shape" ] }, { "cell_type": "code", "execution_count": 136, "metadata": {}, "outputs": [], "source": [ "df_iris = pd.melt(iris, id_vars='species')" ] }, { "cell_type": "code", "execution_count": 137, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciesvariablevalue
463setosapetal_width0.1
210versicolorsepal_width2.0
70versicolorsepal_length5.9
468setosapetal_width0.3
57versicolorsepal_length4.9
98versicolorsepal_length5.1
111virginicasepal_length6.4
385versicolorpetal_length4.5
558virginicapetal_width1.8
469setosapetal_width0.3
\n", "
" ], "text/plain": [ " species variable value\n", "463 setosa petal_width 0.1\n", "210 versicolor sepal_width 2.0\n", "70 versicolor sepal_length 5.9\n", "468 setosa petal_width 0.3\n", "57 versicolor sepal_length 4.9\n", "98 versicolor sepal_length 5.1\n", "111 virginica sepal_length 6.4\n", "385 versicolor petal_length 4.5\n", "558 virginica petal_width 1.8\n", "469 setosa petal_width 0.3" ] }, "execution_count": 137, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_iris.sample(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pivoting\n", "\n", "Sometimes we need to convert categorical values in a column into separate columns. This is often done at the same time as performing a summary." ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speciessetosaversicolorvirginica
variable
petal_length1.4644.2605.552
petal_width0.2441.3262.026
sepal_length5.0065.9366.588
sepal_width3.4182.7702.974
\n", "
" ], "text/plain": [ "species setosa versicolor virginica\n", "variable \n", "petal_length 1.464 4.260 5.552\n", "petal_width 0.244 1.326 2.026\n", "sepal_length 5.006 5.936 6.588\n", "sepal_width 3.418 2.770 2.974" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_iris.pivot_table(index='variable', columns='species', values='value', aggfunc='mean')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Functional style - `apply`, `applymap` and `map`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`apply` can be used to apply a custom function" ] }, { "cell_type": "code", "execution_count": 139, "metadata": {}, "outputs": [], "source": [ "scores = pd.DataFrame(\n", " np.around(np.clip(np.random.normal(90, 10, (5,3)), 0, 100), 1),\n", " columns = ['math', 'stat', 'biol'],\n", " index = ['anne', 'bob', 'charles', 'dirk', 'edgar']\n", ")" ] }, { "cell_type": "code", "execution_count": 140, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mathstatbiol
anne93.499.580.9
bob67.582.994.5
charles93.591.596.4
dirk89.493.384.9
edgar88.392.497.0
\n", "
" ], "text/plain": [ " math stat biol\n", "anne 93.4 99.5 80.9\n", "bob 67.5 82.9 94.5\n", "charles 93.5 91.5 96.4\n", "dirk 89.4 93.3 84.9\n", "edgar 88.3 92.4 97.0" ] }, "execution_count": 140, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scores" ] }, { "cell_type": "code", "execution_count": 141, "metadata": {}, "outputs": [], "source": [ "def convert_grade_1(score):\n", " return np.where(score > 90, 'A', \n", " np.where(score > 80, 'B',\n", " np.where(score > 70, 'C', 'F')))" ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mathstatbiol
anneAAB
bobFBA
charlesAAA
dirkBAB
edgarBAA
\n", "
" ], "text/plain": [ " math stat biol\n", "anne A A B\n", "bob F B A\n", "charles A A A\n", "dirk B A B\n", "edgar B A A" ] }, "execution_count": 142, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scores.apply(convert_grade_1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `np.where` is a little clumsy - here is an alternative." ] }, { "cell_type": "code", "execution_count": 143, "metadata": {}, "outputs": [], "source": [ "def convert_grade_2(score):\n", " if score.name == 'math': # math professors are mean\n", " return np.choose(\n", " pd.cut(score, [-1, 80, 90, 95, 100], labels=False),\n", " ['F', 'C', 'B', 'A']\n", " ) \n", " else:\n", " return np.choose(\n", " pd.cut(score, [-1, 70, 80, 90, 100], labels=False),\n", " ['F', 'C', 'B', 'A']\n", " )" ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mathstatbiol
anneBAB
bobFBA
charlesBAA
dirkCAB
edgarCAA
\n", "
" ], "text/plain": [ " math stat biol\n", "anne B A B\n", "bob F B A\n", "charles B A A\n", "dirk C A B\n", "edgar C A A" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scores.apply(convert_grade_2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`apply` can be used to avoid explicit looping" ] }, { "cell_type": "code", "execution_count": 145, "metadata": {}, "outputs": [], "source": [ "def likely_profession(row):\n", " if (row.biol > row.math) and (row.biol > row.stat):\n", " return 'farmer'\n", " elif (row.math > row.biol) and (row.math > row.stat):\n", " return 'high school teacher'\n", " elif (row.stat > row.math) and (row.stat > row.biol):\n", " return 'actuary'\n", " else:\n", " return 'doctor'" ] }, { "cell_type": "code", "execution_count": 146, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "anne actuary\n", "bob farmer\n", "charles farmer\n", "dirk actuary\n", "edgar farmer\n", "dtype: object" ] }, "execution_count": 146, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scores.apply(likely_profession, axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If all else fails, you can loop over `pandas` data frames.\n", "\n", "- Be prepared for pitying looks from more snobbish Python coders\n", "\n", "Loops are frowned upon because they are not efficient, but sometimes pragmatism beats elegance." ] }, { "cell_type": "code", "execution_count": 147, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "idx = anne\n", "row = Index(['math', 'stat', 'biol'], dtype='object'): [93.4 99.5 80.9]\n", "------------------------------\n", "idx = bob\n", "row = Index(['math', 'stat', 'biol'], dtype='object'): [67.5 82.9 94.5]\n", "------------------------------\n", "idx = charles\n", "row = Index(['math', 'stat', 'biol'], dtype='object'): [93.5 91.5 96.4]\n", "------------------------------\n", "idx = dirk\n", "row = Index(['math', 'stat', 'biol'], dtype='object'): [89.4 93.3 84.9]\n", "------------------------------\n", "idx = edgar\n", "row = Index(['math', 'stat', 'biol'], dtype='object'): [88.3 92.4 97. ]\n", "------------------------------" ] } ], "source": [ "for idx, row in scores.iterrows():\n", " print(f'\\nidx = {idx}\\nrow = {row.index}: {row.values}\\n', \n", " end='-'*30)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`apply` can be used for reductions along margins" ] }, { "cell_type": "code", "execution_count": 148, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(np.random.randint(0, 10, (4,5)), columns=list('abcde'), index=list('wxyz'))" ] }, { "cell_type": "code", "execution_count": 149, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcde
w37037
x95525
y94583
z04918
\n", "
" ], "text/plain": [ " a b c d e\n", "w 3 7 0 3 7\n", "x 9 5 5 2 5\n", "y 9 4 5 8 3\n", "z 0 4 9 1 8" ] }, "execution_count": 149, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 150, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 21\n", "b 20\n", "c 19\n", "d 14\n", "e 23\n", "dtype: int64" ] }, "execution_count": 150, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.apply(sum, axis=0)" ] }, { "cell_type": "code", "execution_count": 151, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "w 20\n", "x 26\n", "y 29\n", "z 22\n", "dtype: int64" ] }, "execution_count": 151, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.apply(sum, axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### For element-wise mapping operations" ] }, { "cell_type": "code", "execution_count": 152, "metadata": {}, "outputs": [], "source": [ "import string" ] }, { "cell_type": "code", "execution_count": 153, "metadata": {}, "outputs": [], "source": [ "char_map = {i: c for i,c in enumerate(string.ascii_uppercase)}" ] }, { "cell_type": "code", "execution_count": 154, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcde
wDHADH
xJFFCF
yJEFID
zAEJBI
\n", "
" ], "text/plain": [ " a b c d e\n", "w D H A D H\n", "x J F F C F\n", "y J E F I D\n", "z A E J B I" ] }, "execution_count": 154, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.applymap(lambda x: char_map[x])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### For mapping a series" ] }, { "cell_type": "code", "execution_count": 155, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcdeb_map
w37037H
x95525F
y94583E
z04918E
\n", "
" ], "text/plain": [ " a b c d e b_map\n", "w 3 7 0 3 7 H\n", "x 9 5 5 2 5 F\n", "y 9 4 5 8 3 E\n", "z 0 4 9 1 8 E" ] }, "execution_count": 155, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.assign(b_map = df.b.map(char_map))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chaining commands\n", "\n", "Sometimes you see this functional style of method chaining that avoids the need for temporary intermediate variables." ] }, { "cell_type": "code", "execution_count": 156, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthboth
meansummeansummeansum
species
setosa5.150.73.535.16.665.7
versicolor5.852.22.824.810.190.9
virginica7.077.03.133.812.7140.2
\n", "
" ], "text/plain": [ " sepal_length sepal_width both \n", " mean sum mean sum mean sum\n", "species \n", "setosa 5.1 50.7 3.5 35.1 6.6 65.7\n", "versicolor 5.8 52.2 2.8 24.8 10.1 90.9\n", "virginica 7.0 77.0 3.1 33.8 12.7 140.2" ] }, "execution_count": 156, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " iris.\n", " sample(frac=0.2).\n", " filter(regex='s.*').\n", " assign(both=iris.sepal_length + iris.petal_length).\n", " query('both > 2').\n", " groupby('species').agg(['mean', 'sum']).\n", " pipe(lambda x: np.around(x, 1))\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Moving between R and Python in Jupyter" ] }, { "cell_type": "code", "execution_count": 157, "metadata": {}, "outputs": [], "source": [ "%load_ext rpy2.ipython" ] }, { "cell_type": "code", "execution_count": 158, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.simplefilter('ignore', FutureWarning)" ] }, { "cell_type": "code", "execution_count": 159, "metadata": {}, "outputs": [], "source": [ "iris = %R iris" ] }, { "cell_type": "code", "execution_count": 160, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "1 5.1 3.5 1.4 0.2 setosa\n", "2 4.9 3.0 1.4 0.2 setosa\n", "3 4.7 3.2 1.3 0.2 setosa\n", "4 4.6 3.1 1.5 0.2 setosa\n", "5 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 160, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris.head()" ] }, { "cell_type": "code", "execution_count": 161, "metadata": {}, "outputs": [], "source": [ "iris_py = iris.copy()\n", "iris_py.Species = iris_py.Species.str.upper()" ] }, { "cell_type": "code", "execution_count": 162, "metadata": {}, "outputs": [], "source": [ "%%R -i iris_py -o iris_r\n", "\n", "iris_r <- iris_py[1:3,]" ] }, { "cell_type": "code", "execution_count": 163, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.13.51.40.2SETOSA
24.93.01.40.2SETOSA
34.73.21.30.2SETOSA
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "1 5.1 3.5 1.4 0.2 SETOSA\n", "2 4.9 3.0 1.4 0.2 SETOSA\n", "3 4.7 3.2 1.3 0.2 SETOSA" ] }, "execution_count": 163, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_r" ] }, { "cell_type": "code", "execution_count": 164, "metadata": {}, "outputs": [], "source": [ "! python3 -m pip install --quiet watermark" ] }, { "cell_type": "code", "execution_count": 165, "metadata": {}, "outputs": [], "source": [ "%load_ext watermark" ] }, { "cell_type": "code", "execution_count": 166, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pendulum 2.1.2\n", "pandas 1.1.1\n", "numpy 1.18.5\n", "CPython 3.8.5\n", "IPython 7.17.0\n" ] } ], "source": [ "%watermark -v -iv" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 2 }