S02 RDBMS and SQL¶
For more SQL examples in the SQLite3 dialect, seee SQLite3 tutorial.
For a deep dive, see SQL Queries for Mere Mortals.
A. More RDBMS concepts¶
OLTP and OLAP¶
- OLTP
- Normalized schema
- OLAP
- Denormalized schema
- Star
- Facts
- Dimensions
- Snowflake
- Generated from OLTP databases by ETL (Extract-Transform-Load) operations
- Star
- Denormalized schema
Types of REBMS¶
- Data lake
- Data warehouse
- Data mart
Data marts typically use a star schema that is customized for the analysis needs. For example, the finance department in a hospital may be most interested in Facts about Claims.

img
Robustness and scaling¶
- Replication
- Sharding
B. Basic SQL queries¶
Data we will work with in Part B¶
In [1]:
%load_ext sql
In [2]:
%sql sqlite:///data/faculty.db
Out[2]:
'Connected: None@data/faculty.db'
In [3]:
%%sql
SELECT * FROM sqlite_master WHERE type='table';
Done.
Out[3]:
type | name | tbl_name | rootpage | sql |
---|---|---|---|---|
table | person | person | 2 | CREATE TABLE person ( "index" BIGINT, person_id BIGINT, first TEXT, last TEXT, age BIGINT, height FLOAT, weight BIGINT, country_id TEXT, gender_id BIGINT ) |
table | confidential | confidential | 18 | CREATE TABLE confidential ( "index" BIGINT, person_id BIGINT, salary BIGINT ) |
table | person_language | person_language | 33 | CREATE TABLE person_language ( "index" BIGINT, person_id BIGINT, language_id BIGINT ) |
table | language | language | 50 | CREATE TABLE language ( "index" BIGINT, language_id BIGINT, language_name TEXT ) |
table | gender | gender | 55 | CREATE TABLE gender ( "index" BIGINT, gender_id BIGINT, gender TEXT ) |
table | country | country | 57 | CREATE TABLE country ( "index" BIGINT, country_id TEXT, country TEXT, nationality TEXT ) |
Basic Structure¶
SELECT DISTINCT value_expression AS alias
FROM tables AS alias
WHERE predicate
ORDER BY value_expression
Types¶
- Character (Fixed width, variable width)
- National Character (Fixed width, variable width)
- Binary
- Numeric (Exact, Arpproximate)
- Boolean
- DateTime
- Interval
The SQL standard specifies that character strings and datetime literals are enclosed by single quotes. Two single quotes wihtin a string is intepreted as a literal single quote.
'Gilligan''s island'
The CAST function¶
CAST(X as CHARACTER(10))
Value expreesion¶
- Literal
- Column reference
- Function
- CASES
- (Value expression)
- (SELECT expression)
which may be prefixed with unary operaors -
and +
and combined
with binary operators appropriate for the data type.
Bineary operators¶
Concatenation¶
A || B
Mathematical¶
A + B
A - B
A * B
A / B
Data and time arithmetic¶
'2018-08-29' + 3
'11:59' + '00:01'
In [4]:
%%sql
SELECT DISTINCT language_name
FROM language
LIMIT 5;
Done.
Out[4]:
language_name |
---|
PHP |
Clojure |
Dylan |
GNU Octave |
D |
Sorting¶
SELECT DISTINCT value_expression AS alias
FROM tables AS alias
ORDER BY value_expression
In [5]:
%%sql
SELECT DISTINCT language_name
FROM language
ORDER BY language_name ASC
LIMIT 5;
Done.
Out[5]:
language_name |
---|
ASP |
Assembly |
AutoIt |
Awk |
Bash |
Filtering¶
For efficiency, place the most stringent filters first.
SELECT DISTINCT value_expression AS alias
FROM tables AS alias
WHERE predicate
ORDER BY value_expression
Predicates for filtering rows¶
- Comparison operators (=, <>, <, >, <=, >=)
- BETWEEN start AND end
- IN(A, B, C)
- LIKE
- IS NULL
- REGEX
Use NOT prefix for negation
Combining predicates¶
AND
OR
USe parenthesis to indicate order of evaluation for compound statements.
In [6]:
%%sql
SELECT first, last, age
FROM person
WHERE age BETWEEN 16 AND 17
LIMIT 5;
Done.
Out[6]:
first | last | age |
---|---|---|
Antoine | Beard | 16 |
Augustine | Mejia | 16 |
Boris | Mejia | 16 |
Brain | Haney | 16 |
Burl | Mayo | 17 |
Joins¶
Joins combine data from 1 or more tables to form a new result set.
Natural join¶
Uses all common columns in Tables 1 and 2 for JOIN
FROM Table1
NATURAL INNER JOIN Table 2
Inner join¶
General form of INNER JOIN uisng ON
FROM Table1
INNER JOIN Table2
ON Table1.Column = Table2.Column
If there is a common column in both tables
FROM Table1
INNER JOIN Table2
USING Column
Joining more than two tables
From (Table1
INNER JOIN Table2
ON Table1.column1 = Table2.Column1)
INNER JOIN Table3
ON Table3.column2 = Table2.Column2
Outer join¶
General form of OUTER JOIN uisng ON
FROM Table1
RIGHT OUTER JOIN Table2
ON Table1.Column = Table2.Column
FROM Table1
LEFT OUTER JOIN Table2
ON Table1.Column = Table2.Column
FROM Table1
FULL OUTER JOIN Table2
ON Table1.Column = Table2.Column
In [7]:
%%sql
SELECT first, last, language_name
FROM person
INNER JOIN person_language ON person.person_id = person_language.person_id
INNER JOIN language ON language.language_id = person_language.language_id
LIMIT 10;
Done.
Out[7]:
first | last | language_name |
---|---|---|
Aaron | Alexander | Haskell |
Aaron | Kirby | GNU Octave |
Aaron | Kirby | haXe |
Aaron | Kirby | Falcon |
Abram | Allen | TypeScript |
Abram | Boyer | Io |
Abram | Boyer | Lua |
Abram | Boyer | Falcon |
Adan | Brown | F# |
Adolph | Dalton | Dart |
Set operations¶
SELECT a, b
FROM table1
SetOp
SELECT a, b
FROM table2
wehre SetOp is INTERSECT
, EXCEPT
, UNION
or UNION ALL
.
Union¶
UNION
UNION ALL (does not eliminate duplicate rows)
Difference¶
EXCEPT
Alternative using OUTER JOIN
with test for NULL
In [8]:
%%sql
DROP VIEW IF EXISTS language_view;
CREATE VIEW language_view AS
SELECT first, last, language_name
FROM person
INNER JOIN person_language ON person.person_id = person_language.person_id
INNER JOIN language ON language.language_id = person_language.language_id
;
Done.
Done.
Out[8]:
[]
In [9]:
%%sql
SELECt *
FROM language_view
LIMIT 10;
Done.
Out[9]:
first | last | language_name |
---|---|---|
Aaron | Alexander | Haskell |
Aaron | Kirby | GNU Octave |
Aaron | Kirby | haXe |
Aaron | Kirby | Falcon |
Abram | Allen | TypeScript |
Abram | Boyer | Io |
Abram | Boyer | Lua |
Abram | Boyer | Falcon |
Adan | Brown | F# |
Adolph | Dalton | Dart |
In [10]:
%%sql
SELECt *
FROM language_view
WHERE language_name = 'Python'
UNION
SELECt *
FROM language_view
WHERE language_name = 'Haskell'
LIMIT 10;
Done.
Out[10]:
first | last | language_name |
---|---|---|
Aaron | Alexander | Haskell |
Andree | Douglas | Haskell |
Arlie | Terrell | Python |
Boyd | Blackwell | Haskell |
Buck | Howe | Haskell |
Carlton | Richard | Haskell |
Carylon | Zamora | Python |
Clarisa | Rodgers | Python |
Dinorah | O'brien | Haskell |
Dorian | Lloyd | Haskell |
In [11]:
%%sql
SELECt *
FROM language_view
WHERE language_name IN ('Python', 'Haskell')
ORDER BY first
LIMIT 10;
Done.
Out[11]:
first | last | language_name |
---|---|---|
Aaron | Alexander | Haskell |
Andree | Douglas | Haskell |
Arlie | Terrell | Python |
Boyd | Blackwell | Haskell |
Buck | Howe | Haskell |
Carlton | Richard | Haskell |
Carylon | Zamora | Python |
Clarisa | Rodgers | Python |
Dinorah | O'brien | Haskell |
Dorian | Lloyd | Haskell |
Subqueries¶
As column expresions¶
SELECT a, b,
(
SELECT MAX(c)
FROM table2
INNER JOIN table1
USING column1
) as max_c
FROM table1
As filters¶
SELECT a, b,
FROM table1
WHERE b >
(
SELECT AVG(b)
FROM table1
)
Quantified Subqueires¶
ALl
SOME
ANY
EXISTS
SELECT a, b,
FROM table1
WHERE EXISTS
(
SELECT c
FROM table2
)
In [12]:
%%sql
SELECT first, last, language_name
FROM person, language
WHERE language_name IN (
SELECT language_name
FROM language_view
WHERe first='Abram' AND last='Boyer'
)
LIMIT 10;
Done.
Out[12]:
first | last | language_name |
---|---|---|
Aaron | Alexander | Io |
Aaron | Kirby | Io |
Abram | Allen | Io |
Abram | Boyer | Io |
Adan | Brown | Io |
Adolph | Dalton | Io |
Adrian | Blevins | Io |
Agustin | Fulton | Io |
Agustin | Mcdonald | Io |
Alberto | Dudley | Io |
Aggregate functions¶
COUNT
MIN
MAX
AVG
SUM
In [13]:
%%sql
SELECT count(language_name)
FROM language_view;
Done.
Out[13]:
count(language_name) |
---|
2297 |
Grouping¶
SELECT a, MIN(b) AS min_b, MAX(b) AS max_b, AVG(b) AS mean_b
FROM table
GROUP BY a
HAVING mean_b > 5
The HAVING
is analagous to the WHERE
clause, but filters on
aggregate conditions. Note that the WHERE
statement filters rows
BEFORE the grouping is done.
Note: Any variable in the SELECT part that is not an aggregte function needs to be in the GROUP BY part.
SELECT a, b, c, COUNT(d)
FROM table
GROUP BY a, b, c
In [14]:
%%sql
SELECT language_name, count(*) AS n
FROM language_view
GROUP BY language_name
HAVING n > 45;
Done.
Out[14]:
language_name | n |
---|---|
AutoIt | 61 |
Bash | 48 |
ECMAScript | 48 |
GNU Octave | 49 |
JavaScript | 48 |
Perl | 55 |
PowerShell | 50 |
Prolog | 50 |
The CASE switch¶
Simple CASE¶
SELECT name,
(CASE sex
WHEN 'M' THEN 1.5*dose
WHEN 'F' THEN dose
END) as adjusted_dose
FROM table
Searched CASE¶
SELECT name,
(CASE
WHEN sex = 'M' THEN 1.5*dose
WHEN sex = 'F' THEN dose
END) as adjusted_dose
FROM table
In [15]:
%%sql
SELECT first, last, language_name,
(CASE
WHEN language_name LIKE 'H%' THEN 'Hire'
ELSE 'FIRE'
END
) AS outcome
FROM language_view
LIMIT 10;
Done.
Out[15]:
first | last | language_name | outcome |
---|---|---|---|
Aaron | Alexander | Haskell | Hire |
Aaron | Kirby | GNU Octave | FIRE |
Aaron | Kirby | haXe | Hire |
Aaron | Kirby | Falcon | FIRE |
Abram | Allen | TypeScript | FIRE |
Abram | Boyer | Io | FIRE |
Abram | Boyer | Lua | FIRE |
Abram | Boyer | Falcon | FIRE |
Adan | Brown | F# | FIRE |
Adolph | Dalton | Dart | FIRE |
C. Window Functions¶
We use the PostgreSQL databsaee because window functions are not supported in SQLite3 yet
In [16]:
%sql postgresql://cliburn@localhost/bios-823
Out[16]:
'Connected: cliburn@bios-823'
In [17]:
import pandas as pd
In [18]:
import numpy as np
In [19]:
from collections import OrderedDict
In [20]:
np.random.seed(23)
n = 10
df = pd.DataFrame(
OrderedDict(person=np.random.choice(['A', 'B', 'C', 'D'], n,),
time=np.random.randint(0, 10, n),
bsl=np.random.randint(50, 400, n)))
In [21]:
df.sort_values(['person', 'time'])
Out[21]:
person | time | bsl | |
---|---|---|---|
8 | A | 0 | 115 |
5 | A | 2 | 237 |
2 | A | 3 | 129 |
7 | B | 5 | 86 |
3 | B | 6 | 396 |
4 | C | 1 | 107 |
1 | C | 9 | 347 |
6 | D | 5 | 89 |
9 | D | 5 | 221 |
0 | D | 7 | 98 |
In [22]:
%sql DROP TABLE IF EXISTS df
Done.
Out[22]:
[]
Magic shortcut to creating a database table from pandas
DataFrame.
In [23]:
%sql persist df
Out[23]:
'Persisted df'
Over creates widows¶
In [24]:
%%sql
SELECT person, time, bsl, row_number()
OVER ()
FROM df;
10 rows affected.
Out[24]:
person | time | bsl | row_number |
---|---|---|---|
D | 7 | 98 | 1 |
C | 9 | 347 | 2 |
A | 3 | 129 | 3 |
B | 6 | 396 | 4 |
C | 1 | 107 | 5 |
A | 2 | 237 | 6 |
D | 5 | 89 | 7 |
B | 5 | 86 | 8 |
A | 0 | 115 | 9 |
D | 5 | 221 | 10 |
Order by¶
In [25]:
%%sql
SELECT person, time, bsl, row_number()
OVER (ORDER BY person, time)
FROM df;
10 rows affected.
Out[25]:
person | time | bsl | row_number |
---|---|---|---|
A | 0 | 115 | 1 |
A | 2 | 237 | 2 |
A | 3 | 129 | 3 |
B | 5 | 86 | 4 |
B | 6 | 396 | 5 |
C | 1 | 107 | 6 |
C | 9 | 347 | 7 |
D | 5 | 221 | 8 |
D | 5 | 89 | 9 |
D | 7 | 98 | 10 |
Partition by¶
In [26]:
%%sql
SELECT person, time, bsl, row_number()
OVER (PARTITION BY person ORDER BY time)
FROM df;
10 rows affected.
Out[26]:
person | time | bsl | row_number |
---|---|---|---|
A | 0 | 115 | 1 |
A | 2 | 237 | 2 |
A | 3 | 129 | 3 |
B | 5 | 86 | 1 |
B | 6 | 396 | 2 |
C | 1 | 107 | 1 |
C | 9 | 347 | 2 |
D | 5 | 221 | 1 |
D | 5 | 89 | 2 |
D | 7 | 98 | 3 |
In [27]:
%%sql
SELECT person, time, bsl, STRING_AGG(CAST(bsl AS TEXT), ', ')
OVER (PARTITION BY person ORDER BY time)
FROM df;
10 rows affected.
Out[27]:
person | time | bsl | string_agg |
---|---|---|---|
A | 0 | 115 | 115 |
A | 2 | 237 | 115, 237 |
A | 3 | 129 | 115, 237, 129 |
B | 5 | 86 | 86 |
B | 6 | 396 | 86, 396 |
C | 1 | 107 | 107 |
C | 9 | 347 | 107, 347 |
D | 5 | 221 | 221, 89 |
D | 5 | 89 | 221, 89 |
D | 7 | 98 | 221, 89, 98 |
Specifying rows in window¶
In [28]:
%%sql
SELECT person, time, bsl, STRING_AGG(CAST(bsl AS TEXT), ', ')
OVER (
PARTITION BY person
ORDER BY time
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
)
FROM df;
10 rows affected.
Out[28]:
person | time | bsl | string_agg |
---|---|---|---|
A | 0 | 115 | 115, 237 |
A | 2 | 237 | 115, 237, 129 |
A | 3 | 129 | 237, 129 |
B | 5 | 86 | 86, 396 |
B | 6 | 396 | 86, 396 |
C | 1 | 107 | 107, 347 |
C | 9 | 347 | 107, 347 |
D | 5 | 221 | 221, 89 |
D | 5 | 89 | 221, 89, 98 |
D | 7 | 98 | 89, 98 |
Using window functions¶
In [29]:
%%sql
SELECT person, time, bsl,
row_number() OVER win AS row_number,
rank() OVER win AS rank,
dense_rank() OVER win AS dense_rank,
percent_rank() OVER win AS percent_rank,
cume_dist() OVER win AS cume_dist
FROM df
WINDOW win AS (ORDER BY person);
10 rows affected.
Out[29]:
person | time | bsl | row_number | rank | dense_rank | percent_rank | cume_dist |
---|---|---|---|---|---|---|---|
A | 0 | 115 | 1 | 1 | 1 | 0.0 | 0.3 |
A | 2 | 237 | 2 | 1 | 1 | 0.0 | 0.3 |
A | 3 | 129 | 3 | 1 | 1 | 0.0 | 0.3 |
B | 6 | 396 | 4 | 4 | 2 | 0.333333333333333 | 0.5 |
B | 5 | 86 | 5 | 4 | 2 | 0.333333333333333 | 0.5 |
C | 1 | 107 | 6 | 6 | 3 | 0.555555555555556 | 0.7 |
C | 9 | 347 | 7 | 6 | 3 | 0.555555555555556 | 0.7 |
D | 5 | 221 | 8 | 8 | 4 | 0.777777777777778 | 1.0 |
D | 5 | 89 | 9 | 8 | 4 | 0.777777777777778 | 1.0 |
D | 7 | 98 | 10 | 8 | 4 | 0.777777777777778 | 1.0 |
Using aggregate functions¶
In [30]:
%%sql
SELECT person, time, bsl,
SUM(bsl) OVER win AS bsl_sum,
AVG(bsl) OVER win AS bsl_avg,
MIN(bsl) OVER win AS bsl_min,
MAX(bsl) over win as bsl_max,
FIRST_VALUE(bsl) OVER win as bsl_start,
LAST_VALUE(bsl) OVER win as bsl_end
FROM df
WINDOW win AS (PARTITION BY person ORDER BY time);
10 rows affected.
Out[30]:
person | time | bsl | bsl_sum | bsl_avg | bsl_min | bsl_max | bsl_start | bsl_end |
---|---|---|---|---|---|---|---|---|
A | 0 | 115 | 115 | 115.0000000000000000 | 115 | 115 | 115 | 115 |
A | 2 | 237 | 352 | 176.0000000000000000 | 115 | 237 | 115 | 237 |
A | 3 | 129 | 481 | 160.3333333333333333 | 115 | 237 | 115 | 129 |
B | 5 | 86 | 86 | 86.0000000000000000 | 86 | 86 | 86 | 86 |
B | 6 | 396 | 482 | 241.0000000000000000 | 86 | 396 | 86 | 396 |
C | 1 | 107 | 107 | 107.0000000000000000 | 107 | 107 | 107 | 107 |
C | 9 | 347 | 454 | 227.0000000000000000 | 107 | 347 | 107 | 347 |
D | 5 | 221 | 310 | 155.0000000000000000 | 89 | 221 | 221 | 89 |
D | 5 | 89 | 310 | 155.0000000000000000 | 89 | 221 | 221 | 89 |
D | 7 | 98 | 408 | 136.0000000000000000 | 89 | 221 | 221 | 98 |
Using rows and range to constrain windows¶
In [31]:
%%sql
SELECT person, time, bsl,
STRING_AGG(CAST(bsl AS TEXT), ', ') OVER win AS vals,
SUM(bsl) OVER win AS bsl_sum,
AVG(bsl) OVER win AS bsl_avg
FROM df
WINDOW win AS (
PARTITION BY person
ORDER BY time
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
)
ORDER BY person, time;
10 rows affected.
Out[31]:
person | time | bsl | vals | bsl_sum | bsl_avg |
---|---|---|---|---|---|
A | 0 | 115 | 115, 237 | 352 | 176.0000000000000000 |
A | 2 | 237 | 115, 237, 129 | 481 | 160.3333333333333333 |
A | 3 | 129 | 237, 129 | 366 | 183.0000000000000000 |
B | 5 | 86 | 86, 396 | 482 | 241.0000000000000000 |
B | 6 | 396 | 86, 396 | 482 | 241.0000000000000000 |
C | 1 | 107 | 107, 347 | 454 | 227.0000000000000000 |
C | 9 | 347 | 107, 347 | 454 | 227.0000000000000000 |
D | 5 | 221 | 221, 89 | 310 | 155.0000000000000000 |
D | 5 | 89 | 221, 89, 98 | 408 | 136.0000000000000000 |
D | 7 | 98 | 89, 98 | 187 | 93.5000000000000000 |
Frames using Rows and Range¶
For Range, all rows with the same ORDER BY value are considered peers.
In [32]:
%%sql
SELECT person, time, bsl,
STRING_AGG(CAST(bsl AS TEXT), ', ') OVER win AS vals,
SUM(bsl) OVER win AS bsl_sum,
AVG(bsl) OVER win AS bsl_avg
FROM df
WINDOW win AS (
ORDER BY person
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
ORDER BY person, time;
10 rows affected.
Out[32]:
person | time | bsl | vals | bsl_sum | bsl_avg |
---|---|---|---|---|---|
A | 0 | 115 | 115 | 115 | 115.0000000000000000 |
A | 2 | 237 | 115, 237 | 352 | 176.0000000000000000 |
A | 3 | 129 | 115, 237, 129 | 481 | 160.3333333333333333 |
B | 5 | 86 | 115, 237, 129, 396, 86 | 963 | 192.6000000000000000 |
B | 6 | 396 | 115, 237, 129, 396 | 877 | 219.2500000000000000 |
C | 1 | 107 | 115, 237, 129, 396, 86, 107 | 1070 | 178.3333333333333333 |
C | 9 | 347 | 115, 237, 129, 396, 86, 107, 347 | 1417 | 202.4285714285714286 |
D | 5 | 221 | 115, 237, 129, 396, 86, 107, 347, 221 | 1638 | 204.7500000000000000 |
D | 5 | 89 | 115, 237, 129, 396, 86, 107, 347, 221, 89 | 1727 | 191.8888888888888889 |
D | 7 | 98 | 115, 237, 129, 396, 86, 107, 347, 221, 89, 98 | 1825 | 182.5000000000000000 |
In [33]:
%%sql
SELECT person, time, bsl,
STRING_AGG(CAST(bsl AS TEXT), ', ') OVER win AS vals,
SUM(bsl) OVER win AS bsl_sum,
AVG(bsl) OVER win AS bsl_avg
FROM df
WINDOW win AS (
ORDER BY person
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
ORDER BY person, time;
10 rows affected.
Out[33]:
person | time | bsl | vals | bsl_sum | bsl_avg |
---|---|---|---|---|---|
A | 0 | 115 | 115, 237, 129 | 481 | 160.3333333333333333 |
A | 2 | 237 | 115, 237, 129 | 481 | 160.3333333333333333 |
A | 3 | 129 | 115, 237, 129 | 481 | 160.3333333333333333 |
B | 5 | 86 | 115, 237, 129, 396, 86 | 963 | 192.6000000000000000 |
B | 6 | 396 | 115, 237, 129, 396, 86 | 963 | 192.6000000000000000 |
C | 1 | 107 | 115, 237, 129, 396, 86, 107, 347 | 1417 | 202.4285714285714286 |
C | 9 | 347 | 115, 237, 129, 396, 86, 107, 347 | 1417 | 202.4285714285714286 |
D | 5 | 221 | 115, 237, 129, 396, 86, 107, 347, 221, 89, 98 | 1825 | 182.5000000000000000 |
D | 5 | 89 | 115, 237, 129, 396, 86, 107, 347, 221, 89, 98 | 1825 | 182.5000000000000000 |
D | 7 | 98 | 115, 237, 129, 396, 86, 107, 347, 221, 89, 98 | 1825 | 182.5000000000000000 |
In [ ]: