{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# S02 RDBMS and SQL" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more SQL examples in the SQLite3 dialect, seee [SQLite3 tutorial](https://www.techonthenet.com/sqlite/index.php). \n", "\n", "For a deep dive, see [SQL Queries for Mere Mortals](https://www.amazon.com/SQL-Queries-Mere-Mortals-Hands/dp/0134858336/ref=dp_ob_title_bk)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A. More RDBMS concepts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### OLTP and OLAP\n", "\n", "- OLTP\n", " - Normalized schema\n", "- OLAP\n", " - Denormalized schema\n", " - Star\n", " - Facts\n", " - Dimensions\n", " - Snowflake \n", " - Generated from OLTP databases by ETL (Extract-Transform-Load) operations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Types of REBMS\n", " \n", "- Data lake\n", "- Data warehouse\n", "- Data mart \n", " \n", "Data marts typically use a star schema that is customized for the analysis needs. For example, the finance department in a hospital may be most interested in Facts about Claims.\n", " \n", "![img](https://www.researchgate.net/profile/Hugh_Watson3/publication/239823035/figure/fig2/AS:375035886882819@1466426931869/A-Star-Schema-for-Health-Care-courtesy-of-Arthur-Andersen.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Robustness and scaling\n", "\n", "- Replication\n", "- Sharding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## B. Basic SQL queries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data we will work with in Part B" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext sql" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Connected: None@data/faculty.db'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql sqlite:///data/faculty.db" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typenametbl_namerootpagesql
tablepersonperson2CREATE TABLE person (
\t"index" BIGINT,
\tperson_id BIGINT,
\tfirst TEXT,
\tlast TEXT,
\tage BIGINT,
\theight FLOAT,
\tweight BIGINT,
\tcountry_id TEXT,
\tgender_id BIGINT
)
tableconfidentialconfidential18CREATE TABLE confidential (
\t"index" BIGINT,
\tperson_id BIGINT,
\tsalary BIGINT
)
tableperson_languageperson_language33CREATE TABLE person_language (
\t"index" BIGINT,
\tperson_id BIGINT,
\tlanguage_id BIGINT
)
tablelanguagelanguage50CREATE TABLE language (
\t"index" BIGINT,
\tlanguage_id BIGINT,
\tlanguage_name TEXT
)
tablegendergender55CREATE TABLE gender (
\t"index" BIGINT,
\tgender_id BIGINT,
\tgender TEXT
)
tablecountrycountry57CREATE TABLE country (
\t"index" BIGINT,
\tcountry_id TEXT,
\tcountry TEXT,
\tnationality TEXT
)
" ], "text/plain": [ "[('table', 'person', 'person', 2, 'CREATE TABLE person (\\n\\t\"index\" BIGINT, \\n\\tperson_id BIGINT, \\n\\tfirst TEXT, \\n\\tlast TEXT, \\n\\tage BIGINT, \\n\\theight FLOAT, \\n\\tweight BIGINT, \\n\\tcountry_id TEXT, \\n\\tgender_id BIGINT\\n)'),\n", " ('table', 'confidential', 'confidential', 18, 'CREATE TABLE confidential (\\n\\t\"index\" BIGINT, \\n\\tperson_id BIGINT, \\n\\tsalary BIGINT\\n)'),\n", " ('table', 'person_language', 'person_language', 33, 'CREATE TABLE person_language (\\n\\t\"index\" BIGINT, \\n\\tperson_id BIGINT, \\n\\tlanguage_id BIGINT\\n)'),\n", " ('table', 'language', 'language', 50, 'CREATE TABLE language (\\n\\t\"index\" BIGINT, \\n\\tlanguage_id BIGINT, \\n\\tlanguage_name TEXT\\n)'),\n", " ('table', 'gender', 'gender', 55, 'CREATE TABLE gender (\\n\\t\"index\" BIGINT, \\n\\tgender_id BIGINT, \\n\\tgender TEXT\\n)'),\n", " ('table', 'country', 'country', 57, 'CREATE TABLE country (\\n\\t\"index\" BIGINT, \\n\\tcountry_id TEXT, \\n\\tcountry TEXT, \\n\\tnationality TEXT\\n)')]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT * FROM sqlite_master WHERE type='table';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic Structure" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```SQL\n", "SELECT DISTINCT value_expression AS alias\n", "FROM tables AS alias\n", "WHERE predicate\n", "ORDER BY value_expression\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Types" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Character (Fixed width, variable width)\n", "- National Character (Fixed width, variable width)\n", "- Binary\n", "- Numeric (Exact, Arpproximate)\n", "- Boolean\n", "- DateTime\n", "- Interval\n", "\n", "The SQL standard specifies that character strings and datetime literals are enclosed by single quotes. Two single quotes wihtin a string is intepreted as a literal single quote.\n", "\n", "```sql\n", "'Gilligan''s island'\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The CAST function" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```sql\n", "CAST(X as CHARACTER(10))\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Value expreesion\n", "\n", "- Literal\n", "- Column reference\n", "- Function\n", "- CASES\n", "- (Value expression)\n", "- (SELECT expression)\n", "\n", "which may be prefixed with unary operaors `-` and `+` and combined with binary operators appropriate for the data type." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bineary operators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Concatenation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```SQL\n", "A || B\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Mathematical" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```SQL\n", "A + B\n", "A - B\n", "A * B\n", "A / B\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Data and time arithmetic" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```SQL\n", "'2018-08-29' + 3\n", "'11:59' + '00:01'\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
language_name
PHP
Clojure
Dylan
GNU Octave
D
" ], "text/plain": [ "[('PHP',), ('Clojure',), ('Dylan',), ('GNU Octave',), ('D',)]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT DISTINCT language_name\n", "FROM language\n", "LIMIT 5;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sorting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```SQL\n", "SELECT DISTINCT value_expression AS alias\n", "FROM tables AS alias\n", "ORDER BY value_expression\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
language_name
ASP
Assembly
AutoIt
Awk
Bash
" ], "text/plain": [ "[('ASP',), ('Assembly',), ('AutoIt',), ('Awk',), ('Bash',)]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT DISTINCT language_name\n", "FROM language\n", "ORDER BY language_name ASC\n", "LIMIT 5;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filtering\n", "\n", "For efficiency, place the most stringent filters first." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```SQL\n", "SELECT DISTINCT value_expression AS alias\n", "FROM tables AS alias\n", "WHERE predicate\n", "ORDER BY value_expression\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predicates for filtering rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Comparison operators (=, <>, <, >, <=, >=)\n", "- BETWEEN start AND end\n", "- IN(A, B, C)\n", "- LIKE\n", "- IS NULL\n", "- REGEX\n", "\n", "Use NOT prefix for negation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Combining predicates\n", "\n", "```sql\n", "AND\n", "OR\n", "```\n", "\n", "USe parenthesis to indicate order of evaluation for compound statements." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firstlastage
AntoineBeard16
AugustineMejia16
BorisMejia16
BrainHaney16
BurlMayo17
" ], "text/plain": [ "[('Antoine', 'Beard', 16),\n", " ('Augustine', 'Mejia', 16),\n", " ('Boris', 'Mejia', 16),\n", " ('Brain', 'Haney', 16),\n", " ('Burl', 'Mayo', 17)]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT first, last, age\n", "FROM person\n", "WHERE age BETWEEN 16 AND 17\n", "LIMIT 5;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Joins\n", "\n", "Joins combine data from 1 or more tables to form a new result set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Natural join\n", "\n", "Uses all common columns in Tables 1 and 2 for JOIN\n", "\n", "```SQL\n", "FROM Table1 \n", "NATURAL INNER JOIN Table 2\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Inner join" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "General form of INNER JOIN uisng ON\n", "\n", "```SQL\n", "FROM Table1 \n", "INNER JOIN Table2\n", "ON Table1.Column = Table2.Column\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If there is a common column in both tables\n", "\n", "```SQL\n", "FROM Table1\n", "INNER JOIN Table2\n", "USING Column\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Joining more than two tables\n", "\n", "```SQL\n", "From (Table1 \n", " INNER JOIN Table2\n", " ON Table1.column1 = Table2.Column1)\n", " INNER JOIN Table3 \n", " ON Table3.column2 = Table2.Column2\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Outer join" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "General form of OUTER JOIN uisng ON\n", "\n", "```SQL\n", "FROM Table1 \n", "RIGHT OUTER JOIN Table2\n", "ON Table1.Column = Table2.Column\n", "```\n", "\n", "```SQL\n", "FROM Table1 \n", "LEFT OUTER JOIN Table2\n", "ON Table1.Column = Table2.Column\n", "```\n", "\n", "```SQL\n", "FROM Table1 \n", "FULL OUTER JOIN Table2\n", "ON Table1.Column = Table2.Column\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firstlastlanguage_name
AaronAlexanderHaskell
AaronKirbyGNU Octave
AaronKirbyhaXe
AaronKirbyFalcon
AbramAllenTypeScript
AbramBoyerIo
AbramBoyerLua
AbramBoyerFalcon
AdanBrownF#
AdolphDaltonDart
" ], "text/plain": [ "[('Aaron', 'Alexander', 'Haskell'),\n", " ('Aaron', 'Kirby', 'GNU Octave'),\n", " ('Aaron', 'Kirby', 'haXe'),\n", " ('Aaron', 'Kirby', 'Falcon'),\n", " ('Abram', 'Allen', 'TypeScript'),\n", " ('Abram', 'Boyer', 'Io'),\n", " ('Abram', 'Boyer', 'Lua'),\n", " ('Abram', 'Boyer', 'Falcon'),\n", " ('Adan', 'Brown', 'F#'),\n", " ('Adolph', 'Dalton', 'Dart')]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT first, last, language_name \n", "FROM person\n", "INNER JOIN person_language ON person.person_id = person_language.person_id\n", "INNER JOIN language ON language.language_id = person_language.language_id\n", "LIMIT 10;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set operations " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```SQL\n", "SELECT a, b \n", "FROM table1\n", "SetOp\n", "SELECT a, b \n", "FROM table2\n", "```\n", "\n", "wehre SetOp is `INTERSECT`, `EXCEPT`, `UNION` or `UNION ALL`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Intersection\n", "\n", "```sql\n", "INTERSECT\n", "```\n", "\n", "Alternative using `INNER JOIN`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Union" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```SQL\n", "UNION\n", "UNION ALL (does not eliminate duplicate rows)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Difference" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```SQL\n", "EXCEPT\n", "```\n", "\n", "Alternative using `OUTER JOIN` with test for `NULL`" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n", "Done.\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "DROP VIEW IF EXISTS language_view;\n", "CREATE VIEW language_view AS\n", "SELECT first, last, language_name \n", "FROM person\n", "INNER JOIN person_language ON person.person_id = person_language.person_id\n", "INNER JOIN language ON language.language_id = person_language.language_id\n", ";" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firstlastlanguage_name
AaronAlexanderHaskell
AaronKirbyGNU Octave
AaronKirbyhaXe
AaronKirbyFalcon
AbramAllenTypeScript
AbramBoyerIo
AbramBoyerLua
AbramBoyerFalcon
AdanBrownF#
AdolphDaltonDart
" ], "text/plain": [ "[('Aaron', 'Alexander', 'Haskell'),\n", " ('Aaron', 'Kirby', 'GNU Octave'),\n", " ('Aaron', 'Kirby', 'haXe'),\n", " ('Aaron', 'Kirby', 'Falcon'),\n", " ('Abram', 'Allen', 'TypeScript'),\n", " ('Abram', 'Boyer', 'Io'),\n", " ('Abram', 'Boyer', 'Lua'),\n", " ('Abram', 'Boyer', 'Falcon'),\n", " ('Adan', 'Brown', 'F#'),\n", " ('Adolph', 'Dalton', 'Dart')]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECt * \n", "FROM language_view \n", "LIMIT 10;" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firstlastlanguage_name
AaronAlexanderHaskell
AndreeDouglasHaskell
ArlieTerrellPython
BoydBlackwellHaskell
BuckHoweHaskell
CarltonRichardHaskell
CarylonZamoraPython
ClarisaRodgersPython
DinorahO'brienHaskell
DorianLloydHaskell
" ], "text/plain": [ "[('Aaron', 'Alexander', 'Haskell'),\n", " ('Andree', 'Douglas', 'Haskell'),\n", " ('Arlie', 'Terrell', 'Python'),\n", " ('Boyd', 'Blackwell', 'Haskell'),\n", " ('Buck', 'Howe', 'Haskell'),\n", " ('Carlton', 'Richard', 'Haskell'),\n", " ('Carylon', 'Zamora', 'Python'),\n", " ('Clarisa', 'Rodgers', 'Python'),\n", " ('Dinorah', \"O'brien\", 'Haskell'),\n", " ('Dorian', 'Lloyd', 'Haskell')]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECt * \n", "FROM language_view \n", "WHERE language_name = 'Python'\n", "UNION\n", "SELECt * \n", "FROM language_view \n", "WHERE language_name = 'Haskell'\n", "LIMIT 10;" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firstlastlanguage_name
AaronAlexanderHaskell
AndreeDouglasHaskell
ArlieTerrellPython
BoydBlackwellHaskell
BuckHoweHaskell
CarltonRichardHaskell
CarylonZamoraPython
ClarisaRodgersPython
DinorahO'brienHaskell
DorianLloydHaskell
" ], "text/plain": [ "[('Aaron', 'Alexander', 'Haskell'),\n", " ('Andree', 'Douglas', 'Haskell'),\n", " ('Arlie', 'Terrell', 'Python'),\n", " ('Boyd', 'Blackwell', 'Haskell'),\n", " ('Buck', 'Howe', 'Haskell'),\n", " ('Carlton', 'Richard', 'Haskell'),\n", " ('Carylon', 'Zamora', 'Python'),\n", " ('Clarisa', 'Rodgers', 'Python'),\n", " ('Dinorah', \"O'brien\", 'Haskell'),\n", " ('Dorian', 'Lloyd', 'Haskell')]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECt * \n", "FROM language_view \n", "WHERE language_name IN ('Python', 'Haskell')\n", "ORDER BY first\n", "LIMIT 10;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Subqueries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### As column expresions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```SQL\n", "SELECT a, b, \n", "(\n", " SELECT MAX(c) \n", " FROM table2\n", " INNER JOIN table1\n", " USING column1\n", ") as max_c\n", "FROM table1\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### As filters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```SQL\n", "SELECT a, b, \n", "FROM table1\n", "WHERE b > \n", "(\n", " SELECT AVG(b)\n", " FROM table1\n", ")\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Quantified Subqueires\n", "\n", "```SQL\n", "ALl\n", "SOME\n", "ANY\n", "EXISTS\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```SQL\n", "SELECT a, b, \n", "FROM table1\n", "WHERE EXISTS\n", "(\n", " SELECT c\n", " FROM table2\n", ")\n", "```" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firstlastlanguage_name
AaronAlexanderIo
AaronKirbyIo
AbramAllenIo
AbramBoyerIo
AdanBrownIo
AdolphDaltonIo
AdrianBlevinsIo
AgustinFultonIo
AgustinMcdonaldIo
AlbertoDudleyIo
" ], "text/plain": [ "[('Aaron', 'Alexander', 'Io'),\n", " ('Aaron', 'Kirby', 'Io'),\n", " ('Abram', 'Allen', 'Io'),\n", " ('Abram', 'Boyer', 'Io'),\n", " ('Adan', 'Brown', 'Io'),\n", " ('Adolph', 'Dalton', 'Io'),\n", " ('Adrian', 'Blevins', 'Io'),\n", " ('Agustin', 'Fulton', 'Io'),\n", " ('Agustin', 'Mcdonald', 'Io'),\n", " ('Alberto', 'Dudley', 'Io')]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT first, last, language_name\n", "FROM person, language\n", "WHERE language_name IN (\n", " SELECT language_name \n", " FROM language_view\n", " WHERe first='Abram' AND last='Boyer'\n", ")\n", "LIMIT 10;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Aggregate functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```SQL\n", "COUNT\n", "MIN\n", "MAX\n", "AVG\n", "SUM\n", "```" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
count(language_name)
2297
" ], "text/plain": [ "[(2297,)]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT count(language_name) \n", "FROM language_view;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grouping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```SQL\n", "SELECT a, MIN(b) AS min_b, MAX(b) AS max_b, AVG(b) AS mean_b\n", "FROM table\n", "GROUP BY a\n", "HAVING mean_b > 5\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `HAVING` is analagous to the `WHERE` clause, but filters on aggregate conditions. Note that the `WHERE` statement filters rows BEFORE the grouping is done." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: Any variable in the SELECT part that is not an aggregte function needs to be in the GROUP BY part.\n", "\n", "```SQL\n", "SELECT a, b, c, COUNT(d)\n", "FROM table\n", "GROUP BY a, b, c\n", "```" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
language_namen
AutoIt61
Bash48
ECMAScript48
GNU Octave49
JavaScript48
Perl55
PowerShell50
Prolog50
" ], "text/plain": [ "[('AutoIt', 61),\n", " ('Bash', 48),\n", " ('ECMAScript', 48),\n", " ('GNU Octave', 49),\n", " ('JavaScript', 48),\n", " ('Perl', 55),\n", " ('PowerShell', 50),\n", " ('Prolog', 50)]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT language_name, count(*) AS n\n", "FROM language_view\n", "GROUP BY language_name\n", "HAVING n > 45;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The CASE switch" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Simple CASE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```SQL\n", "SELECT name,\n", "(CASE sex \n", " WHEN 'M' THEN 1.5*dose\n", " WHEN 'F' THEN dose\n", " END) as adjusted_dose\n", "FROM table\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Searched CASE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```SQL\n", "SELECT name,\n", "(CASE \n", " WHEN sex = 'M' THEN 1.5*dose\n", " WHEN sex = 'F' THEN dose\n", " END) as adjusted_dose\n", "FROM table\n", "```" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firstlastlanguage_nameoutcome
AaronAlexanderHaskellHire
AaronKirbyGNU OctaveFIRE
AaronKirbyhaXeHire
AaronKirbyFalconFIRE
AbramAllenTypeScriptFIRE
AbramBoyerIoFIRE
AbramBoyerLuaFIRE
AbramBoyerFalconFIRE
AdanBrownF#FIRE
AdolphDaltonDartFIRE
" ], "text/plain": [ "[('Aaron', 'Alexander', 'Haskell', 'Hire'),\n", " ('Aaron', 'Kirby', 'GNU Octave', 'FIRE'),\n", " ('Aaron', 'Kirby', 'haXe', 'Hire'),\n", " ('Aaron', 'Kirby', 'Falcon', 'FIRE'),\n", " ('Abram', 'Allen', 'TypeScript', 'FIRE'),\n", " ('Abram', 'Boyer', 'Io', 'FIRE'),\n", " ('Abram', 'Boyer', 'Lua', 'FIRE'),\n", " ('Abram', 'Boyer', 'Falcon', 'FIRE'),\n", " ('Adan', 'Brown', 'F#', 'FIRE'),\n", " ('Adolph', 'Dalton', 'Dart', 'FIRE')]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT first, last, language_name,\n", "(CASE\n", " WHEN language_name LIKE 'H%' THEN 'Hire'\n", " ELSE 'FIRE'\n", "END\n", ") AS outcome\n", "FROM language_view\n", "LIMIT 10;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## C. Window Functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the PostgreSQL databsaee because window functions are not supported in SQLite3 [yet](https://www.sqlite.org/draft/releaselog/3_25_0.html)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Connected: cliburn@bios-823'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql postgresql://cliburn@localhost/bios-823" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "from collections import OrderedDict" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "np.random.seed(23)\n", "n = 10\n", "df = pd.DataFrame(\n", " OrderedDict(person=np.random.choice(['A', 'B', 'C', 'D'], n,),\n", " time=np.random.randint(0, 10, n), \n", " bsl=np.random.randint(50, 400, n)))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
persontimebsl
8A0115
5A2237
2A3129
7B586
3B6396
4C1107
1C9347
6D589
9D5221
0D798
\n", "
" ], "text/plain": [ " person time bsl\n", "8 A 0 115\n", "5 A 2 237\n", "2 A 3 129\n", "7 B 5 86\n", "3 B 6 396\n", "4 C 1 107\n", "1 C 9 347\n", "6 D 5 89\n", "9 D 5 221\n", "0 D 7 98" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values(['person', 'time'])" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql DROP TABLE IF EXISTS df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Magic shortcut to creating a database table from `pandas` DataFrame." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Persisted df'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql persist df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Over creates widows" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 rows affected.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
persontimebslrow_number
D7981
C93472
A31293
B63964
C11075
A22376
D5897
B5868
A01159
D522110
" ], "text/plain": [ "[('D', 7, 98, 1),\n", " ('C', 9, 347, 2),\n", " ('A', 3, 129, 3),\n", " ('B', 6, 396, 4),\n", " ('C', 1, 107, 5),\n", " ('A', 2, 237, 6),\n", " ('D', 5, 89, 7),\n", " ('B', 5, 86, 8),\n", " ('A', 0, 115, 9),\n", " ('D', 5, 221, 10)]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT person, time, bsl, row_number() \n", "OVER ()\n", "FROM df;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Order by" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 rows affected.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
persontimebslrow_number
A01151
A22372
A31293
B5864
B63965
C11076
C93477
D52218
D5899
D79810
" ], "text/plain": [ "[('A', 0, 115, 1),\n", " ('A', 2, 237, 2),\n", " ('A', 3, 129, 3),\n", " ('B', 5, 86, 4),\n", " ('B', 6, 396, 5),\n", " ('C', 1, 107, 6),\n", " ('C', 9, 347, 7),\n", " ('D', 5, 221, 8),\n", " ('D', 5, 89, 9),\n", " ('D', 7, 98, 10)]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT person, time, bsl, row_number() \n", "OVER (ORDER BY person, time)\n", "FROM df;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Partition by" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 rows affected.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
persontimebslrow_number
A01151
A22372
A31293
B5861
B63962
C11071
C93472
D52211
D5892
D7983
" ], "text/plain": [ "[('A', 0, 115, 1),\n", " ('A', 2, 237, 2),\n", " ('A', 3, 129, 3),\n", " ('B', 5, 86, 1),\n", " ('B', 6, 396, 2),\n", " ('C', 1, 107, 1),\n", " ('C', 9, 347, 2),\n", " ('D', 5, 221, 1),\n", " ('D', 5, 89, 2),\n", " ('D', 7, 98, 3)]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT person, time, bsl, row_number() \n", "OVER (PARTITION BY person ORDER BY time)\n", "FROM df;" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 rows affected.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
persontimebslstring_agg
A0115115
A2237115, 237
A3129115, 237, 129
B58686
B639686, 396
C1107107
C9347107, 347
D5221221, 89
D589221, 89
D798221, 89, 98
" ], "text/plain": [ "[('A', 0, 115, '115'),\n", " ('A', 2, 237, '115, 237'),\n", " ('A', 3, 129, '115, 237, 129'),\n", " ('B', 5, 86, '86'),\n", " ('B', 6, 396, '86, 396'),\n", " ('C', 1, 107, '107'),\n", " ('C', 9, 347, '107, 347'),\n", " ('D', 5, 221, '221, 89'),\n", " ('D', 5, 89, '221, 89'),\n", " ('D', 7, 98, '221, 89, 98')]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT person, time, bsl, STRING_AGG(CAST(bsl AS TEXT), ', ')\n", "OVER (PARTITION BY person ORDER BY time)\n", "FROM df;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specifying rows in window" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 rows affected.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
persontimebslstring_agg
A0115115, 237
A2237115, 237, 129
A3129237, 129
B58686, 396
B639686, 396
C1107107, 347
C9347107, 347
D5221221, 89
D589221, 89, 98
D79889, 98
" ], "text/plain": [ "[('A', 0, 115, '115, 237'),\n", " ('A', 2, 237, '115, 237, 129'),\n", " ('A', 3, 129, '237, 129'),\n", " ('B', 5, 86, '86, 396'),\n", " ('B', 6, 396, '86, 396'),\n", " ('C', 1, 107, '107, 347'),\n", " ('C', 9, 347, '107, 347'),\n", " ('D', 5, 221, '221, 89'),\n", " ('D', 5, 89, '221, 89, 98'),\n", " ('D', 7, 98, '89, 98')]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT person, time, bsl, STRING_AGG(CAST(bsl AS TEXT), ', ')\n", "OVER (\n", " PARTITION BY person\n", " ORDER BY time\n", " ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING\n", " )\n", "FROM df;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using window functions" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 rows affected.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
persontimebslrow_numberrankdense_rankpercent_rankcume_dist
A01151110.00.3
A22372110.00.3
A31293110.00.3
B63964420.3333333333333330.5
B5865420.3333333333333330.5
C11076630.5555555555555560.7
C93477630.5555555555555560.7
D52218840.7777777777777781.0
D5899840.7777777777777781.0
D79810840.7777777777777781.0
" ], "text/plain": [ "[('A', 0, 115, 1, 1, 1, 0.0, 0.3),\n", " ('A', 2, 237, 2, 1, 1, 0.0, 0.3),\n", " ('A', 3, 129, 3, 1, 1, 0.0, 0.3),\n", " ('B', 6, 396, 4, 4, 2, 0.333333333333333, 0.5),\n", " ('B', 5, 86, 5, 4, 2, 0.333333333333333, 0.5),\n", " ('C', 1, 107, 6, 6, 3, 0.555555555555556, 0.7),\n", " ('C', 9, 347, 7, 6, 3, 0.555555555555556, 0.7),\n", " ('D', 5, 221, 8, 8, 4, 0.777777777777778, 1.0),\n", " ('D', 5, 89, 9, 8, 4, 0.777777777777778, 1.0),\n", " ('D', 7, 98, 10, 8, 4, 0.777777777777778, 1.0)]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT person, time, bsl, \n", " row_number() OVER win AS row_number,\n", " rank() OVER win AS rank,\n", " dense_rank() OVER win AS dense_rank,\n", " percent_rank() OVER win AS percent_rank,\n", " cume_dist() OVER win AS cume_dist\n", "FROM df\n", "WINDOW win AS (ORDER BY person);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using aggregate functions" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 rows affected.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
persontimebslbsl_sumbsl_avgbsl_minbsl_maxbsl_startbsl_end
A0115115115.0000000000000000115115115115
A2237352176.0000000000000000115237115237
A3129481160.3333333333333333115237115129
B5868686.000000000000000086868686
B6396482241.00000000000000008639686396
C1107107107.0000000000000000107107107107
C9347454227.0000000000000000107347107347
D5221310155.00000000000000008922122189
D589310155.00000000000000008922122189
D798408136.00000000000000008922122198
" ], "text/plain": [ "[('A', 0, 115, Decimal('115'), Decimal('115.0000000000000000'), 115, 115, 115, 115),\n", " ('A', 2, 237, Decimal('352'), Decimal('176.0000000000000000'), 115, 237, 115, 237),\n", " ('A', 3, 129, Decimal('481'), Decimal('160.3333333333333333'), 115, 237, 115, 129),\n", " ('B', 5, 86, Decimal('86'), Decimal('86.0000000000000000'), 86, 86, 86, 86),\n", " ('B', 6, 396, Decimal('482'), Decimal('241.0000000000000000'), 86, 396, 86, 396),\n", " ('C', 1, 107, Decimal('107'), Decimal('107.0000000000000000'), 107, 107, 107, 107),\n", " ('C', 9, 347, Decimal('454'), Decimal('227.0000000000000000'), 107, 347, 107, 347),\n", " ('D', 5, 221, Decimal('310'), Decimal('155.0000000000000000'), 89, 221, 221, 89),\n", " ('D', 5, 89, Decimal('310'), Decimal('155.0000000000000000'), 89, 221, 221, 89),\n", " ('D', 7, 98, Decimal('408'), Decimal('136.0000000000000000'), 89, 221, 221, 98)]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT person, time, bsl,\n", " SUM(bsl) OVER win AS bsl_sum,\n", " AVG(bsl) OVER win AS bsl_avg,\n", " MIN(bsl) OVER win AS bsl_min,\n", " MAX(bsl) over win as bsl_max,\n", " FIRST_VALUE(bsl) OVER win as bsl_start,\n", " LAST_VALUE(bsl) OVER win as bsl_end\n", "FROM df\n", "WINDOW win AS (PARTITION BY person ORDER BY time);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using rows and range to constrain windows" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 rows affected.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
persontimebslvalsbsl_sumbsl_avg
A0115115, 237352176.0000000000000000
A2237115, 237, 129481160.3333333333333333
A3129237, 129366183.0000000000000000
B58686, 396482241.0000000000000000
B639686, 396482241.0000000000000000
C1107107, 347454227.0000000000000000
C9347107, 347454227.0000000000000000
D5221221, 89310155.0000000000000000
D589221, 89, 98408136.0000000000000000
D79889, 9818793.5000000000000000
" ], "text/plain": [ "[('A', 0, 115, '115, 237', Decimal('352'), Decimal('176.0000000000000000')),\n", " ('A', 2, 237, '115, 237, 129', Decimal('481'), Decimal('160.3333333333333333')),\n", " ('A', 3, 129, '237, 129', Decimal('366'), Decimal('183.0000000000000000')),\n", " ('B', 5, 86, '86, 396', Decimal('482'), Decimal('241.0000000000000000')),\n", " ('B', 6, 396, '86, 396', Decimal('482'), Decimal('241.0000000000000000')),\n", " ('C', 1, 107, '107, 347', Decimal('454'), Decimal('227.0000000000000000')),\n", " ('C', 9, 347, '107, 347', Decimal('454'), Decimal('227.0000000000000000')),\n", " ('D', 5, 221, '221, 89', Decimal('310'), Decimal('155.0000000000000000')),\n", " ('D', 5, 89, '221, 89, 98', Decimal('408'), Decimal('136.0000000000000000')),\n", " ('D', 7, 98, '89, 98', Decimal('187'), Decimal('93.5000000000000000'))]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT person, time, bsl, \n", " STRING_AGG(CAST(bsl AS TEXT), ', ') OVER win AS vals,\n", " SUM(bsl) OVER win AS bsl_sum,\n", " AVG(bsl) OVER win AS bsl_avg\n", "FROM df\n", "WINDOW win AS (\n", " PARTITION BY person\n", " ORDER BY time\n", " ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING\n", ")\n", "ORDER BY person, time;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Frames using Rows and Range\n", "\n", "For Range, all rows with the same ORDER BY value are considered peers." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 rows affected.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
persontimebslvalsbsl_sumbsl_avg
A0115115115115.0000000000000000
A2237115, 237352176.0000000000000000
A3129115, 237, 129481160.3333333333333333
B586115, 237, 129, 396, 86963192.6000000000000000
B6396115, 237, 129, 396877219.2500000000000000
C1107115, 237, 129, 396, 86, 1071070178.3333333333333333
C9347115, 237, 129, 396, 86, 107, 3471417202.4285714285714286
D5221115, 237, 129, 396, 86, 107, 347, 2211638204.7500000000000000
D589115, 237, 129, 396, 86, 107, 347, 221, 891727191.8888888888888889
D798115, 237, 129, 396, 86, 107, 347, 221, 89, 981825182.5000000000000000
" ], "text/plain": [ "[('A', 0, 115, '115', Decimal('115'), Decimal('115.0000000000000000')),\n", " ('A', 2, 237, '115, 237', Decimal('352'), Decimal('176.0000000000000000')),\n", " ('A', 3, 129, '115, 237, 129', Decimal('481'), Decimal('160.3333333333333333')),\n", " ('B', 5, 86, '115, 237, 129, 396, 86', Decimal('963'), Decimal('192.6000000000000000')),\n", " ('B', 6, 396, '115, 237, 129, 396', Decimal('877'), Decimal('219.2500000000000000')),\n", " ('C', 1, 107, '115, 237, 129, 396, 86, 107', Decimal('1070'), Decimal('178.3333333333333333')),\n", " ('C', 9, 347, '115, 237, 129, 396, 86, 107, 347', Decimal('1417'), Decimal('202.4285714285714286')),\n", " ('D', 5, 221, '115, 237, 129, 396, 86, 107, 347, 221', Decimal('1638'), Decimal('204.7500000000000000')),\n", " ('D', 5, 89, '115, 237, 129, 396, 86, 107, 347, 221, 89', Decimal('1727'), Decimal('191.8888888888888889')),\n", " ('D', 7, 98, '115, 237, 129, 396, 86, 107, 347, 221, 89, 98', Decimal('1825'), Decimal('182.5000000000000000'))]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT person, time, bsl, \n", " STRING_AGG(CAST(bsl AS TEXT), ', ') OVER win AS vals,\n", " SUM(bsl) OVER win AS bsl_sum,\n", " AVG(bsl) OVER win AS bsl_avg\n", "FROM df\n", "WINDOW win AS (\n", " ORDER BY person\n", " ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW\n", ")\n", "ORDER BY person, time;" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 rows affected.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
persontimebslvalsbsl_sumbsl_avg
A0115115, 237, 129481160.3333333333333333
A2237115, 237, 129481160.3333333333333333
A3129115, 237, 129481160.3333333333333333
B586115, 237, 129, 396, 86963192.6000000000000000
B6396115, 237, 129, 396, 86963192.6000000000000000
C1107115, 237, 129, 396, 86, 107, 3471417202.4285714285714286
C9347115, 237, 129, 396, 86, 107, 3471417202.4285714285714286
D5221115, 237, 129, 396, 86, 107, 347, 221, 89, 981825182.5000000000000000
D589115, 237, 129, 396, 86, 107, 347, 221, 89, 981825182.5000000000000000
D798115, 237, 129, 396, 86, 107, 347, 221, 89, 981825182.5000000000000000
" ], "text/plain": [ "[('A', 0, 115, '115, 237, 129', Decimal('481'), Decimal('160.3333333333333333')),\n", " ('A', 2, 237, '115, 237, 129', Decimal('481'), Decimal('160.3333333333333333')),\n", " ('A', 3, 129, '115, 237, 129', Decimal('481'), Decimal('160.3333333333333333')),\n", " ('B', 5, 86, '115, 237, 129, 396, 86', Decimal('963'), Decimal('192.6000000000000000')),\n", " ('B', 6, 396, '115, 237, 129, 396, 86', Decimal('963'), Decimal('192.6000000000000000')),\n", " ('C', 1, 107, '115, 237, 129, 396, 86, 107, 347', Decimal('1417'), Decimal('202.4285714285714286')),\n", " ('C', 9, 347, '115, 237, 129, 396, 86, 107, 347', Decimal('1417'), Decimal('202.4285714285714286')),\n", " ('D', 5, 221, '115, 237, 129, 396, 86, 107, 347, 221, 89, 98', Decimal('1825'), Decimal('182.5000000000000000')),\n", " ('D', 5, 89, '115, 237, 129, 396, 86, 107, 347, 221, 89, 98', Decimal('1825'), Decimal('182.5000000000000000')),\n", " ('D', 7, 98, '115, 237, 129, 396, 86, 107, 347, 221, 89, 98', Decimal('1825'), Decimal('182.5000000000000000'))]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT person, time, bsl, \n", " STRING_AGG(CAST(bsl AS TEXT), ', ') OVER win AS vals,\n", " SUM(bsl) OVER win AS bsl_sum,\n", " AVG(bsl) OVER win AS bsl_avg\n", "FROM df\n", "WINDOW win AS (\n", " ORDER BY person\n", " RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW\n", ")\n", "ORDER BY person, time;" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }