S02 RDBMS and SQL

For more SQL examples in the SQLite3 dialect, seee SQLite3 tutorial.

For a deep dive, see SQL Queries for Mere Mortals.

A. More RDBMS concepts

OLTP and OLAP

  • OLTP
    • Normalized schema
  • OLAP
    • Denormalized schema
      • Star
        • Facts
        • Dimensions
      • Snowflake
      • Generated from OLTP databases by ETL (Extract-Transform-Load) operations

Types of REBMS

  • Data lake
  • Data warehouse
  • Data mart

Data marts typically use a star schema that is customized for the analysis needs. For example, the finance department in a hospital may be most interested in Facts about Claims.

img

img

Robustness and scaling

  • Replication
  • Sharding

B. Basic SQL queries

Data we will work with in Part B

In [1]:
%load_ext sql
In [2]:
%sql sqlite:///data/faculty.db
Out[2]:
'Connected: None@data/faculty.db'
In [3]:
%%sql

SELECT * FROM sqlite_master WHERE type='table';
Done.
Out[3]:
type name tbl_name rootpage sql
table person person 2 CREATE TABLE person (
"index" BIGINT,
person_id BIGINT,
first TEXT,
last TEXT,
age BIGINT,
height FLOAT,
weight BIGINT,
country_id TEXT,
gender_id BIGINT
)
table confidential confidential 18 CREATE TABLE confidential (
"index" BIGINT,
person_id BIGINT,
salary BIGINT
)
table person_language person_language 33 CREATE TABLE person_language (
"index" BIGINT,
person_id BIGINT,
language_id BIGINT
)
table language language 50 CREATE TABLE language (
"index" BIGINT,
language_id BIGINT,
language_name TEXT
)
table gender gender 55 CREATE TABLE gender (
"index" BIGINT,
gender_id BIGINT,
gender TEXT
)
table country country 57 CREATE TABLE country (
"index" BIGINT,
country_id TEXT,
country TEXT,
nationality TEXT
)

Basic Structure

SELECT DISTINCT value_expression AS alias
FROM tables AS alias
WHERE predicate
ORDER BY value_expression

Types

  • Character (Fixed width, variable width)
  • National Character (Fixed width, variable width)
  • Binary
  • Numeric (Exact, Arpproximate)
  • Boolean
  • DateTime
  • Interval

The SQL standard specifies that character strings and datetime literals are enclosed by single quotes. Two single quotes wihtin a string is intepreted as a literal single quote.

'Gilligan''s island'

The CAST function

CAST(X as CHARACTER(10))

Value expreesion

  • Literal
  • Column reference
  • Function
  • CASES
  • (Value expression)
  • (SELECT expression)

which may be prefixed with unary operaors - and + and combined with binary operators appropriate for the data type.

Bineary operators

Concatenation

A || B

Mathematical

A + B
A - B
A * B
A / B

Data and time arithmetic

'2018-08-29' + 3
'11:59' + '00:01'
In [4]:
%%sql

SELECT DISTINCT language_name
FROM language
LIMIT 5;
Done.
Out[4]:
language_name
PHP
Clojure
Dylan
GNU Octave
D

Sorting

SELECT DISTINCT value_expression AS alias
FROM tables AS alias
ORDER BY value_expression
In [5]:
%%sql

SELECT DISTINCT language_name
FROM language
ORDER BY language_name ASC
LIMIT 5;
Done.
Out[5]:
language_name
ASP
Assembly
AutoIt
Awk
Bash

Filtering

For efficiency, place the most stringent filters first.

SELECT DISTINCT value_expression AS alias
FROM tables AS alias
WHERE predicate
ORDER BY value_expression

Predicates for filtering rows

  • Comparison operators (=, <>, <, >, <=, >=)
  • BETWEEN start AND end
  • IN(A, B, C)
  • LIKE
  • IS NULL
  • REGEX

Use NOT prefix for negation

Combining predicates

AND
OR

USe parenthesis to indicate order of evaluation for compound statements.

In [6]:
%%sql

SELECT first, last, age
FROM person
WHERE age BETWEEN 16 AND 17
LIMIT 5;
Done.
Out[6]:
first last age
Antoine Beard 16
Augustine Mejia 16
Boris Mejia 16
Brain Haney 16
Burl Mayo 17

Joins

Joins combine data from 1 or more tables to form a new result set.

Natural join

Uses all common columns in Tables 1 and 2 for JOIN

FROM Table1
NATURAL INNER JOIN Table 2

Inner join

General form of INNER JOIN uisng ON

FROM Table1
INNER JOIN Table2
ON Table1.Column = Table2.Column

If there is a common column in both tables

FROM Table1
INNER JOIN Table2
USING Column

Joining more than two tables

From (Table1
      INNER JOIN Table2
      ON Table1.column1 = Table2.Column1)
      INNER JOIN Table3
      ON Table3.column2 = Table2.Column2

Outer join

General form of OUTER JOIN uisng ON

FROM Table1
RIGHT OUTER JOIN Table2
ON Table1.Column = Table2.Column
FROM Table1
LEFT OUTER JOIN Table2
ON Table1.Column = Table2.Column
FROM Table1
FULL OUTER JOIN Table2
ON Table1.Column = Table2.Column
In [7]:
%%sql

SELECT first, last, language_name
FROM person
INNER JOIN person_language ON person.person_id = person_language.person_id
INNER JOIN language ON language.language_id = person_language.language_id
LIMIT 10;
Done.
Out[7]:
first last language_name
Aaron Alexander Haskell
Aaron Kirby GNU Octave
Aaron Kirby haXe
Aaron Kirby Falcon
Abram Allen TypeScript
Abram Boyer Io
Abram Boyer Lua
Abram Boyer Falcon
Adan Brown F#
Adolph Dalton Dart

Set operations

SELECT a, b
FROM table1
SetOp
SELECT a, b
FROM table2

wehre SetOp is INTERSECT, EXCEPT, UNION or UNION ALL.

Intersection

INTERSECT

Alternative using INNER JOIN

Union

UNION
UNION ALL (does not eliminate duplicate rows)

Difference

EXCEPT

Alternative using OUTER JOIN with test for NULL

In [8]:
%%sql

DROP VIEW IF EXISTS language_view;
CREATE VIEW language_view AS
SELECT first, last, language_name
FROM person
INNER JOIN person_language ON person.person_id = person_language.person_id
INNER JOIN language ON language.language_id = person_language.language_id
;
Done.
Done.
Out[8]:
[]
In [9]:
%%sql

SELECt *
FROM language_view
LIMIT 10;
Done.
Out[9]:
first last language_name
Aaron Alexander Haskell
Aaron Kirby GNU Octave
Aaron Kirby haXe
Aaron Kirby Falcon
Abram Allen TypeScript
Abram Boyer Io
Abram Boyer Lua
Abram Boyer Falcon
Adan Brown F#
Adolph Dalton Dart
In [10]:
%%sql

SELECt *
FROM language_view
WHERE language_name = 'Python'
UNION
SELECt *
FROM language_view
WHERE language_name = 'Haskell'
LIMIT 10;
Done.
Out[10]:
first last language_name
Aaron Alexander Haskell
Andree Douglas Haskell
Arlie Terrell Python
Boyd Blackwell Haskell
Buck Howe Haskell
Carlton Richard Haskell
Carylon Zamora Python
Clarisa Rodgers Python
Dinorah O'brien Haskell
Dorian Lloyd Haskell
In [11]:
%%sql

SELECt *
FROM language_view
WHERE language_name IN ('Python', 'Haskell')
ORDER BY first
LIMIT 10;
Done.
Out[11]:
first last language_name
Aaron Alexander Haskell
Andree Douglas Haskell
Arlie Terrell Python
Boyd Blackwell Haskell
Buck Howe Haskell
Carlton Richard Haskell
Carylon Zamora Python
Clarisa Rodgers Python
Dinorah O'brien Haskell
Dorian Lloyd Haskell

Subqueries

As column expresions

SELECT a, b,
(
    SELECT MAX(c)
    FROM table2
    INNER JOIN table1
    USING column1
) as max_c
FROM table1

As filters

SELECT a, b,
FROM table1
WHERE b >
(
    SELECT AVG(b)
    FROM table1
)

Quantified Subqueires

ALl
SOME
ANY
EXISTS
SELECT a, b,
FROM table1
WHERE EXISTS
(
    SELECT c
    FROM table2
)
In [12]:
%%sql

SELECT first, last, language_name
FROM person, language
WHERE language_name IN (
    SELECT language_name
    FROM language_view
    WHERe first='Abram' AND last='Boyer'
)
LIMIT 10;
Done.
Out[12]:
first last language_name
Aaron Alexander Io
Aaron Kirby Io
Abram Allen Io
Abram Boyer Io
Adan Brown Io
Adolph Dalton Io
Adrian Blevins Io
Agustin Fulton Io
Agustin Mcdonald Io
Alberto Dudley Io

Aggregate functions

COUNT
MIN
MAX
AVG
SUM
In [13]:
%%sql

SELECT count(language_name)
FROM language_view;
Done.
Out[13]:
count(language_name)
2297

Grouping

SELECT a, MIN(b) AS min_b, MAX(b) AS max_b, AVG(b) AS mean_b
FROM table
GROUP BY a
HAVING mean_b > 5

The HAVING is analagous to the WHERE clause, but filters on aggregate conditions. Note that the WHERE statement filters rows BEFORE the grouping is done.

Note: Any variable in the SELECT part that is not an aggregte function needs to be in the GROUP BY part.

SELECT a, b, c, COUNT(d)
FROM table
GROUP BY a, b, c
In [14]:
%%sql

SELECT language_name, count(*) AS n
FROM language_view
GROUP BY language_name
HAVING n > 45;
Done.
Out[14]:
language_name n
AutoIt 61
Bash 48
ECMAScript 48
GNU Octave 49
JavaScript 48
Perl 55
PowerShell 50
Prolog 50

The CASE switch

Simple CASE

SELECT name,
(CASE sex
 WHEN 'M' THEN 1.5*dose
 WHEN 'F' THEN dose
 END) as adjusted_dose
FROM table

Searched CASE

SELECT name,
(CASE
 WHEN sex = 'M' THEN 1.5*dose
 WHEN sex = 'F' THEN dose
 END) as adjusted_dose
FROM table
In [15]:
%%sql

SELECT first, last, language_name,
(CASE
    WHEN language_name LIKE 'H%' THEN 'Hire'
    ELSE 'FIRE'
END
) AS outcome
FROM language_view
LIMIT 10;
Done.
Out[15]:
first last language_name outcome
Aaron Alexander Haskell Hire
Aaron Kirby GNU Octave FIRE
Aaron Kirby haXe Hire
Aaron Kirby Falcon FIRE
Abram Allen TypeScript FIRE
Abram Boyer Io FIRE
Abram Boyer Lua FIRE
Abram Boyer Falcon FIRE
Adan Brown F# FIRE
Adolph Dalton Dart FIRE

C. Window Functions

We use the PostgreSQL databsaee because window functions are not supported in SQLite3 yet

In [16]:
%sql postgresql://cliburn@localhost/bios-823
Out[16]:
'Connected: cliburn@bios-823'
In [17]:
import pandas as pd
In [18]:
import numpy as np
In [19]:
from collections import OrderedDict
In [20]:
np.random.seed(23)
n = 10
df = pd.DataFrame(
    OrderedDict(person=np.random.choice(['A', 'B', 'C', 'D'], n,),
                time=np.random.randint(0, 10, n),
                bsl=np.random.randint(50, 400, n)))
In [21]:
df.sort_values(['person', 'time'])
Out[21]:
person time bsl
8 A 0 115
5 A 2 237
2 A 3 129
7 B 5 86
3 B 6 396
4 C 1 107
1 C 9 347
6 D 5 89
9 D 5 221
0 D 7 98
In [22]:
%sql DROP TABLE IF EXISTS df
Done.
Out[22]:
[]

Magic shortcut to creating a database table from pandas DataFrame.

In [23]:
%sql persist df
Out[23]:
'Persisted df'

Over creates widows

In [24]:
%%sql

SELECT person, time, bsl, row_number()
OVER ()
FROM df;
10 rows affected.
Out[24]:
person time bsl row_number
D 7 98 1
C 9 347 2
A 3 129 3
B 6 396 4
C 1 107 5
A 2 237 6
D 5 89 7
B 5 86 8
A 0 115 9
D 5 221 10

Order by

In [25]:
%%sql

SELECT person, time, bsl, row_number()
OVER (ORDER BY person, time)
FROM df;
10 rows affected.
Out[25]:
person time bsl row_number
A 0 115 1
A 2 237 2
A 3 129 3
B 5 86 4
B 6 396 5
C 1 107 6
C 9 347 7
D 5 221 8
D 5 89 9
D 7 98 10

Partition by

In [26]:
%%sql

SELECT person, time, bsl, row_number()
OVER (PARTITION BY person ORDER BY time)
FROM df;
10 rows affected.
Out[26]:
person time bsl row_number
A 0 115 1
A 2 237 2
A 3 129 3
B 5 86 1
B 6 396 2
C 1 107 1
C 9 347 2
D 5 221 1
D 5 89 2
D 7 98 3
In [27]:
%%sql

SELECT person, time, bsl, STRING_AGG(CAST(bsl AS TEXT), ', ')
OVER (PARTITION BY person ORDER BY time)
FROM df;
10 rows affected.
Out[27]:
person time bsl string_agg
A 0 115 115
A 2 237 115, 237
A 3 129 115, 237, 129
B 5 86 86
B 6 396 86, 396
C 1 107 107
C 9 347 107, 347
D 5 221 221, 89
D 5 89 221, 89
D 7 98 221, 89, 98

Specifying rows in window

In [28]:
%%sql

SELECT person, time, bsl, STRING_AGG(CAST(bsl AS TEXT), ', ')
OVER (
    PARTITION BY person
    ORDER BY time
    ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
     )
FROM df;
10 rows affected.
Out[28]:
person time bsl string_agg
A 0 115 115, 237
A 2 237 115, 237, 129
A 3 129 237, 129
B 5 86 86, 396
B 6 396 86, 396
C 1 107 107, 347
C 9 347 107, 347
D 5 221 221, 89
D 5 89 221, 89, 98
D 7 98 89, 98

Using window functions

In [29]:
%%sql

SELECT person, time, bsl,
       row_number() OVER win    AS row_number,
       rank() OVER win          AS rank,
       dense_rank() OVER win    AS dense_rank,
       percent_rank() OVER win  AS percent_rank,
       cume_dist() OVER win     AS cume_dist
FROM df
WINDOW win AS (ORDER BY person);
10 rows affected.
Out[29]:
person time bsl row_number rank dense_rank percent_rank cume_dist
A 0 115 1 1 1 0.0 0.3
A 2 237 2 1 1 0.0 0.3
A 3 129 3 1 1 0.0 0.3
B 6 396 4 4 2 0.333333333333333 0.5
B 5 86 5 4 2 0.333333333333333 0.5
C 1 107 6 6 3 0.555555555555556 0.7
C 9 347 7 6 3 0.555555555555556 0.7
D 5 221 8 8 4 0.777777777777778 1.0
D 5 89 9 8 4 0.777777777777778 1.0
D 7 98 10 8 4 0.777777777777778 1.0

Using aggregate functions

In [30]:
%%sql

SELECT person, time, bsl,
       SUM(bsl) OVER win    AS bsl_sum,
       AVG(bsl) OVER win    AS bsl_avg,
       MIN(bsl) OVER win    AS bsl_min,
       MAX(bsl) over win as bsl_max,
       FIRST_VALUE(bsl) OVER win as bsl_start,
       LAST_VALUE(bsl) OVER win as bsl_end
FROM df
WINDOW win AS (PARTITION BY person ORDER BY time);
10 rows affected.
Out[30]:
person time bsl bsl_sum bsl_avg bsl_min bsl_max bsl_start bsl_end
A 0 115 115 115.0000000000000000 115 115 115 115
A 2 237 352 176.0000000000000000 115 237 115 237
A 3 129 481 160.3333333333333333 115 237 115 129
B 5 86 86 86.0000000000000000 86 86 86 86
B 6 396 482 241.0000000000000000 86 396 86 396
C 1 107 107 107.0000000000000000 107 107 107 107
C 9 347 454 227.0000000000000000 107 347 107 347
D 5 221 310 155.0000000000000000 89 221 221 89
D 5 89 310 155.0000000000000000 89 221 221 89
D 7 98 408 136.0000000000000000 89 221 221 98

Using rows and range to constrain windows

In [31]:
%%sql

SELECT person, time, bsl,
       STRING_AGG(CAST(bsl AS TEXT), ', ') OVER win AS vals,
       SUM(bsl) OVER win AS bsl_sum,
       AVG(bsl) OVER win AS bsl_avg
FROM df
WINDOW win AS (
    PARTITION BY person
    ORDER BY time
    ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
)
ORDER BY person, time;
10 rows affected.
Out[31]:
person time bsl vals bsl_sum bsl_avg
A 0 115 115, 237 352 176.0000000000000000
A 2 237 115, 237, 129 481 160.3333333333333333
A 3 129 237, 129 366 183.0000000000000000
B 5 86 86, 396 482 241.0000000000000000
B 6 396 86, 396 482 241.0000000000000000
C 1 107 107, 347 454 227.0000000000000000
C 9 347 107, 347 454 227.0000000000000000
D 5 221 221, 89 310 155.0000000000000000
D 5 89 221, 89, 98 408 136.0000000000000000
D 7 98 89, 98 187 93.5000000000000000

Frames using Rows and Range

For Range, all rows with the same ORDER BY value are considered peers.

In [32]:
%%sql

SELECT person, time, bsl,
       STRING_AGG(CAST(bsl AS TEXT), ', ') OVER win AS vals,
       SUM(bsl) OVER win AS bsl_sum,
       AVG(bsl) OVER win AS bsl_avg
FROM df
WINDOW win AS (
    ORDER BY person
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
ORDER BY person, time;
10 rows affected.
Out[32]:
person time bsl vals bsl_sum bsl_avg
A 0 115 115 115 115.0000000000000000
A 2 237 115, 237 352 176.0000000000000000
A 3 129 115, 237, 129 481 160.3333333333333333
B 5 86 115, 237, 129, 396, 86 963 192.6000000000000000
B 6 396 115, 237, 129, 396 877 219.2500000000000000
C 1 107 115, 237, 129, 396, 86, 107 1070 178.3333333333333333
C 9 347 115, 237, 129, 396, 86, 107, 347 1417 202.4285714285714286
D 5 221 115, 237, 129, 396, 86, 107, 347, 221 1638 204.7500000000000000
D 5 89 115, 237, 129, 396, 86, 107, 347, 221, 89 1727 191.8888888888888889
D 7 98 115, 237, 129, 396, 86, 107, 347, 221, 89, 98 1825 182.5000000000000000
In [33]:
%%sql

SELECT person, time, bsl,
       STRING_AGG(CAST(bsl AS TEXT), ', ') OVER win AS vals,
       SUM(bsl) OVER win AS bsl_sum,
       AVG(bsl) OVER win AS bsl_avg
FROM df
WINDOW win AS (
    ORDER BY person
    RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
ORDER BY person, time;
10 rows affected.
Out[33]:
person time bsl vals bsl_sum bsl_avg
A 0 115 115, 237, 129 481 160.3333333333333333
A 2 237 115, 237, 129 481 160.3333333333333333
A 3 129 115, 237, 129 481 160.3333333333333333
B 5 86 115, 237, 129, 396, 86 963 192.6000000000000000
B 6 396 115, 237, 129, 396, 86 963 192.6000000000000000
C 1 107 115, 237, 129, 396, 86, 107, 347 1417 202.4285714285714286
C 9 347 115, 237, 129, 396, 86, 107, 347 1417 202.4285714285714286
D 5 221 115, 237, 129, 396, 86, 107, 347, 221, 89, 98 1825 182.5000000000000000
D 5 89 115, 237, 129, 396, 86, 107, 347, 221, 89, 98 1825 182.5000000000000000
D 7 98 115, 237, 129, 396, 86, 107, 347, 221, 89, 98 1825 182.5000000000000000
In [ ]: