{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "from pandas import DataFrame, Series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to DataFrames and Series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## DataFrame\n", "\n", "The workshop emphasizes the use of data frames, because most data analysis and plotting libraries are designed to work best with this data structure. The data frame is basically a table with rows and columns, just like a spreadsheet. Generally (and it is good practice), each row contains one **observation** and each column contains one **variable**. If the data frame is organized this way, it is known as a **tidy** data frame and is easy to manipulate and query. Much of data preprocessing involves getting the data into this **tidy** format." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read from CSV" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "iris_1 = pd.read_csv('data/iris.csv')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_1.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Write to CSV" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "iris_1.to_csv('data/iris_1.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read from Excel" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "iris_2 = pd.read_excel('data/iris.xlsx')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
" ], "text/plain": [ " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iris_2.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Write to Excel" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "iris_2.to_excel('data/iris_2.xlsx', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating a DataFrame from scratch" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "subject = ['ann', 'bob', 'charles', 'david', 'ella']\n", "sex = ['F', 'M', 'M', 'M', 'F']\n", "group = ['A', 'A', 'A', 'B', 'B']\n", "age = [23, 34, 27, 30, 27]\n", "score = [89, 90, 78, 98, 90]\n", "df = DataFrame(data = {'name': subject, 'sex': sex, 'age': age, 'group': group, 'score': score})" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegroupnamescoresex
023Aann89F
134Abob90M
227Acharles78M
330Bdavid98M
427Bella90F
\n", "
" ], "text/plain": [ " age group name score sex\n", "0 23 A ann 89 F\n", "1 34 A bob 90 M\n", "2 27 A charles 78 M\n", "3 30 B david 98 M\n", "4 27 B ella 90 F" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data about a variable in a single column of the DataFrame is known as a Series. We sill start by learning how to manipulate a single variable, in other words, how to work with a Series data type. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extracting a Series from a DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### By name" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 23\n", "1 34\n", "2 27\n", "3 30\n", "4 27\n", "Name: age, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "age = df['age']\n", "age" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using dot notation" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 23\n", "1 34\n", "2 27\n", "3 30\n", "4 27\n", "Name: age, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "age = df.age\n", "age" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using an index" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 23\n", "1 34\n", "2 27\n", "3 30\n", "4 27\n", "Name: age, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "age = df.ix[:,0]\n", "age" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting information about a series" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(age)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=5, step=1)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "age.index" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([23, 34, 27, 30, 27])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "age.values" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([23, 34, 27, 30])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "age.unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Manipulating a Series" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 23\n", "2 27\n", "4 27\n", "3 30\n", "1 34\n", "Name: age, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "age.sort_values()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 23\n", "2 27\n", "4 27\n", "Name: age, dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "age.nsmallest(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Working with Series containing strings" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 ann\n", "1 bob\n", "2 charles\n", "Name: name, dtype: object" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "name = df.name\n", "name.head(3)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 Ann\n", "1 Bob\n", "2 Charles\n", "3 David\n", "4 Ella\n", "Name: name, dtype: object" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "name.str.title()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 nn\n", "1 ob\n", "2 ha\n", "3 av\n", "4 ll\n", "Name: name, dtype: object" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "name.str[1:3]" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 angelina jolie\n", "1 brad pitt\n", "2 charles\n", "3 david\n", "4 ella\n", "Name: name, dtype: object" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "name.replace({'ann': 'angelina jolie', 'bob': 'brad pitt'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Working with Series containing categorical data" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 F\n", "1 M\n", "2 M\n", "3 M\n", "4 F\n", "Name: sex, dtype: object" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sex" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 F\n", "1 M\n", "2 M\n", "3 M\n", "4 F\n", "Name: sex, dtype: category\n", "Categories (2, object): [F, M]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sex = df.sex.astype('category')\n", "sex" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 1\n", "3 1\n", "4 0\n", "dtype: int8" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sex.cat.codes" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['F', 'M'], dtype='object')" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sex.cat.categories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Back to DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting information about a DataFrame" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(5, 5)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=5, step=1)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.index" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['age', 'group', 'name', 'score', 'sex'], dtype='object')" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "age int64\n", "group object\n", "name object\n", "score int64\n", "sex object\n", "dtype: object" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extracting rows" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegroupnamescoresex
023Aann89F
134Abob90M
\n", "
" ], "text/plain": [ " age group name score sex\n", "0 23 A ann 89 F\n", "1 34 A bob 90 M" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(2)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegroupnamescoresex
330Bdavid98M
427Bella90F
\n", "
" ], "text/plain": [ " age group name score sex\n", "3 30 B david 98 M\n", "4 27 B ella 90 F" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail(2)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegroupnamescoresex
427Bella90F
023Aann89F
\n", "
" ], "text/plain": [ " age group name score sex\n", "4 27 B ella 90 F\n", "0 23 A ann 89 F" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sample(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extracting columns" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 ann\n", "1 bob\n", "2 charles\n", "3 david\n", "4 ella\n", "Name: name, dtype: object" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['name']" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 ann\n", "1 bob\n", "2 charles\n", "3 david\n", "4 ella\n", "Name: name, dtype: object" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.name" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
age
023
134
227
330
427
\n", "
" ], "text/plain": [ " age\n", "0 23\n", "1 34\n", "2 27\n", "3 30\n", "4 27" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[[0]]" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agesex
023F
134M
227M
330M
427F
\n", "
" ], "text/plain": [ " age sex\n", "0 23 F\n", "1 34 M\n", "2 27 M\n", "3 30 M\n", "4 27 F" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['age', 'sex']]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupname
0Aann
1Abob
2Acharles
3Bdavid
4Bella
\n", "
" ], "text/plain": [ " group name\n", "0 A ann\n", "1 A bob\n", "2 A charles\n", "3 B david\n", "4 B ella" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[[1,2]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using indexing to extract parts of a DataFrame\n", "\n", "There are actually 3 separate methods to index a DataFrame - using `loc`, `iloc` and `ix`. We will only show `ix` because it is the most flexible, but see [here](http://stackoverflow.com/questions/31593201/pandas-iloc-vs-ix-vs-loc-explanation) for an example if you want to know more. In particular, working with indexes containing only integers can be confusing at first." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "23" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.ix[0,0]" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'A'" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.ix[1,1]" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegroup
134A
227A
330B
427B
\n", "
" ], "text/plain": [ " age group\n", "1 34 A\n", "2 27 A\n", "3 30 B\n", "4 27 B" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.ix[1:4, :2]" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agename
134bob
330david
427ella
\n", "
" ], "text/plain": [ " age name\n", "1 34 bob\n", "3 30 david\n", "4 27 ella" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.ix[[1,3,4], ['age', 'name']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Logical indexing" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegroupnamescoresex
134Abob90M
227Acharles78M
330Bdavid98M
427Bella90F
\n", "
" ], "text/plain": [ " age group name score sex\n", "1 34 A bob 90 M\n", "2 27 A charles 78 M\n", "3 30 B david 98 M\n", "4 27 B ella 90 F" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.age > 25]" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegroupnamescoresex
134Abob90M
227Acharles78M
330Bdavid98M
\n", "
" ], "text/plain": [ " age group name score sex\n", "1 34 A bob 90 M\n", "2 27 A charles 78 M\n", "3 30 B david 98 M" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[(df.age > 25) & (df.sex == 'M')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sorting" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegroupnamescoresex
023Aann89F
227Acharles78M
427Bella90F
330Bdavid98M
134Abob90M
\n", "
" ], "text/plain": [ " age group name score sex\n", "0 23 A ann 89 F\n", "2 27 A charles 78 M\n", "4 27 B ella 90 F\n", "3 30 B david 98 M\n", "1 34 A bob 90 M" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values('age')" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegroupnamescoresex
134Abob90M
330Bdavid98M
227Acharles78M
427Bella90F
023Aann89F
\n", "
" ], "text/plain": [ " age group name score sex\n", "1 34 A bob 90 M\n", "3 30 B david 98 M\n", "2 27 A charles 78 M\n", "4 27 B ella 90 F\n", "0 23 A ann 89 F" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values('age', ascending = False)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegroupnamescoresex
023Aann89F
227Acharles78M
427Bella90F
330Bdavid98M
134Abob90M
\n", "
" ], "text/plain": [ " age group name score sex\n", "0 23 A ann 89 F\n", "2 27 A charles 78 M\n", "4 27 B ella 90 F\n", "3 30 B david 98 M\n", "1 34 A bob 90 M" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values(['age', 'score'], ascending = ['True', 'True'])" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegroupnamescoresex
023Aann89F
227Acharles78M
427Bella90F
330Bdavid98M
134Abob90M
\n", "
" ], "text/plain": [ " age group name score sex\n", "0 23 A ann 89 F\n", "2 27 A charles 78 M\n", "4 27 B ella 90 F\n", "3 30 B david 98 M\n", "1 34 A bob 90 M" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values(['age', 'score'], ascending = ['True', 'False'], inplace = True)\n", "df" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegroupnamescoresex
023Aann89F
134Abob90M
227Acharles78M
330Bdavid98M
427Bella90F
\n", "
" ], "text/plain": [ " age group name score sex\n", "0 23 A ann 89 F\n", "1 34 A bob 90 M\n", "2 27 A charles 78 M\n", "3 30 B david 98 M\n", "4 27 B ella 90 F" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting summary Statistics" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "age 28.2\n", "score 89.0\n", "dtype: float64" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.mean()" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "51.0" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.score.var()" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "age 5\n", "group 5\n", "name 5\n", "score 5\n", "sex 5\n", "dtype: int64" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.count()" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agescore
count5.0000005.000000
mean28.20000089.000000
std4.0865637.141428
min23.00000078.000000
25%27.00000089.000000
50%27.00000090.000000
75%30.00000090.000000
max34.00000098.000000
\n", "
" ], "text/plain": [ " age score\n", "count 5.000000 5.000000\n", "mean 28.200000 89.000000\n", "std 4.086563 7.141428\n", "min 23.000000 78.000000\n", "25% 27.000000 89.000000\n", "50% 27.000000 90.000000\n", "75% 30.000000 90.000000\n", "max 34.000000 98.000000" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercises" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**1**. Read in the file at `data/dummy.xlsx` into a DataFrame called `data`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = pd.read_excel('data/dummy.xlsx')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2**. Display just rows 2,3 and 4." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegroupnamescoresex
227Acharles78M
330Bdavid98M
427Bella90F
\n", "
" ], "text/plain": [ " age group name score sex\n", "2 27 A charles 78 M\n", "3 30 B david 98 M\n", "4 27 B ella 90 F" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.iloc[2:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3**. Display just the 'name' and 'age' columns from the DataFrame." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameage
0ann23
1bob34
2charles27
3david30
4ella27
\n", "
" ], "text/plain": [ " name age\n", "0 ann 23\n", "1 bob 34\n", "2 charles 27\n", "3 david 30\n", "4 ella 27" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[['name', 'age']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**4**. Sort the DataFrame in descending order of age." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agegroupnamescoresex
134Abob90M
330Bdavid98M
227Acharles78M
427Bella90F
023Aann89F
\n", "
" ], "text/plain": [ " age group name score sex\n", "1 34 A bob 90 M\n", "3 30 B david 98 M\n", "2 27 A charles 78 M\n", "4 27 B ella 90 F\n", "0 23 A ann 89 F" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.sort_values('age', ascending=False)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" }, "latex_envs": { "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0 } }, "nbformat": 4, "nbformat_minor": 1 }