Relational Databases Overview¶

For a simple tutorial on database design, see Introduction to Database Design

For a deep dive, see Database Design for Mere Mortals

0. Packages for working with relational databases in Python¶

Python Database API Specification v2.0 - The standard Python Database API
sqlite3 - API for builit-in sqlite3 package
Database drivers - For connecting to other databases
ipython-sql - SQL magic in Jupyter
SQLAlchemy - Most well-known Object Relational Mapper (ORM)

Motivation¶

Why relational databases and SQL?

History of databases
ACID
Data integrity
Schema

RDBMS¶

Memory
Storage
Dictionary
Query language

What is a database?¶

A database contains tables with rows and columns. Tables belong to a schema. Schemas belong to a catalog. In other words, a database contains catalogs that contain schemas that contain tables (or views). Most simple database only consider the schema/table part of the hierarchy.

Schema¶

Represents a collection of tables.

Table (Relation)¶

There are two definitions of relation - in one, relation is a synonym for table, in the other, a relation describes how two tables are connected via foreign/primary keys.

Represents a subject or an event.

Column (Attribute)¶

Represents a single variable or feature.

Row (Tuple)¶

Represents an observation.

Concepts¶

Constraints¶

You can impose constraints that values in a column have to take. For example, you can specify that values are compulsory (NOT NULL), or UNIQUE or fall within a certain range.

Referential integrity¶

Primary key represents a unique identifier of a row. It may be simple or composite.
- Unique
- Non-null
- Never optional
Foreign key is a column containing the primary key of a different table. It enforces referential integrity.

Relationships¶

One to one
One to many
Many to many
What happens on delete?
- Restrict
- Cascade

Indexes¶

An index is a data structure that allows fast search of a column (typically from linear to log time complexity). Most databases will automatically build an index for every primary key column, but you can also manually specify columns to build indexes for.

Views¶

Temporary virtual table retuned as a result of a query.
Views only specify the strucutre of a table - the contents are constructed on the fly from existing tables.
Queries return a Result Set

Design¶

Columns¶

Use singlular form for name
Use informative names
Use unique names not shared by any other table (except foreign keys)
Column must be an attribute of the table’s subject
Eliminate multi-part columns
Eliminate multi-value columsn
Eliminate redundant columns

Tables¶

Use singular/plural forms for name (controversial)
Enusre every table has a primary key
Eliminate duplicate columns

Relationships¶

Establish participation type and degree of relationship
- One to one
- One to many
- Many to many

Database administration¶

Use sqlmagic as interface to sqlite3 driver.

[1]:

! python3 -m pip install --quiet ipython-sql

[2]:

%load_ext sql

Connect to SQLite3 database on disk (creates it if it does not exist)

[3]:

%sql sqlite:///data/dummy.db

SQL for table deletion and creation¶

[4]:

%%sql

DROP TABLE IF EXISTS Country;
DROP TABLE IF EXISTS Person;

CREATE TABLE Country (
    country_id varcarh(2) PRIMARY KEY,
    country_name varchar(255)
);

CREATE TABLE Person (
    person_id INTEGER PRIMARY KEY,
    person_first varchar(255),
    person_last varchar(255),
    country_id INTEGER NOT NULL,
      FOREIGN KEY (country_id) REFERENCES Country(country_id)
);

 * sqlite:///data/dummy.db
Done.
Done.
Done.
Done.

[4]:

[]

Accessing the RDBMS data dictionary.¶

[5]:

%%sql

SELECT name FROM sqlite_master
WHERE type = "table";

 * sqlite:///data/dummy.db
Done.

[5]:

name
Country
Person

[6]:

%%sql

SELECT sql FROM sqlite_master
WHERE name='Person';

 * sqlite:///data/dummy.db
Done.

[6]:

sql
CREATE TABLE Person ( person_id INTEGER PRIMARY KEY, person_first varchar(255), person_last varchar(255), country_id INTEGER NOT NULL, FOREIGN KEY (country_id) REFERENCES Country(country_id) )

CRUD¶

SQL to insert rows.

[7]:

%%sql

INSERT INTO Country(country_id, country_name)
VALUES ('FR', 'France'), ('CU', 'CUBA');

 * sqlite:///data/dummy.db
2 rows affected.

[7]:

[]

[8]:

%%sql

INSERT INTO Person(person_first, person_last, country_id)
VALUES
('Napolean', 'Bonaparte', 'FR'),
('Luis','Alvarez', 'CU');

 * sqlite:///data/dummy.db
2 rows affected.

[8]:

[]

Read rows.

SQL as a Query Language.

[9]:

%%sql

SELECT *
FROM Person

 * sqlite:///data/dummy.db
Done.

[9]:

person_id	person_first	person_last	country_id
1	Napolean	Bonaparte	FR
2	Luis	Alvarez	CU

[10]:

%%sql

SELECT person_first as first, person_last AS last, country_name AS nationality
FROM Person
INNER JOIN country
ON Person.country_id = Country.country_id;

 * sqlite:///data/dummy.db
Done.

[10]:

first	last	nationality
Napolean	Bonaparte	France
Luis	Alvarez	CUBA

[11]:

%%sql

UPDATE
FROM Person
SET person_first = 'Walter'
WHERE person_last = 'Alvarez'

 * sqlite:///data/dummy.db
(sqlite3.OperationalError) near "FROM": syntax error
[SQL: UPDATE FROM Person SET person_first = 'Walter'
WHERE person_last = 'Alvarez']
(Background on this error at: http://sqlalche.me/e/e3q8)

[12]:

%%sql

SELECT * from person

 * sqlite:///data/dummy.db
Done.

[12]:

person_id	person_first	person_last	country_id
1	Napolean	Bonaparte	FR
2	Luis	Alvarez	CU

[13]:

%%sql

DELETE
FROM Person
WHERE person_last = 'Alvarez'

 * sqlite:///data/dummy.db
1 rows affected.

[13]:

[]

[14]:

%%sql

SELECT * from person

 * sqlite:///data/dummy.db
Done.

[14]:

person_id	person_first	person_last	country_id
1	Napolean	Bonaparte	FR

Communicating with database without magic¶

[15]:

import sqlite3

[16]:

con = sqlite3.connect('data/dummy.db')

[17]:

c = con.cursor()

[18]:

c.execute('select * from Person')
c.fetchall()

[18]:

[(1, 'Napolean', 'Bonaparte', 'FR')]

[19]:

c.close()
con.close()

Communicating with database from `pandas`¶

[20]:

import pandas as pd

[21]:

con = sqlite3.connect('data/dummy.db')
sql = """
SELECT *
FROM Person
"""
df = pd.read_sql(sql, con)

[22]:

df

[22]:

	person_id	person_first	person_last	country_id
0	1	Napolean	Bonaparte	FR

[23]:

con.close()

Database normalization¶

Database normalization is performed for two main reasons - reduce redundancy and prevent inconsistencies on insert/update/delete.

Note: A fully normalized database is in domain-key normal form (DK/NF) if every constraint is a logical consequence of the definition of the candidate key and domains. However, most practical normalization procedures go through a series of steps known as first, second and third normal forms, and ignore potential modification anomalies that may remain.

[24]:

import pandas as pd

First Normal Form (1NF)¶

Table has a primary key (unique, non-null column that identifies each row)
No repeating groups of columns
Each cell contains a single value

No repeating columns¶

[25]:

names = ['ann', 'bob', 'charles']
grades = [(78,57), (99, 99), (67, 98)]
grade1, grade2 = zip(*grades)
df = pd.DataFrame(dict(name=names, grade1=grade1, grade2=grade2))
df

[25]:

	name	grade1	grade2
0	ann	78	57
1	bob	99	99
2	charles	67	98

[26]:

df = pd.melt(df, id_vars='name',
             var_name='homework',
             value_name='grade')
df

[26]:

	name	homework	grade
0	ann	grade1	78
1	bob	grade1	99
2	charles	grade1	67
3	ann	grade2	57
4	bob	grade2	99
5	charles	grade2	98

[27]:

df.homework = df.homework.str.extract('.*(\d+)$')
df

[27]:

	name	homework	grade
0	ann	1	78
1	bob	1	99
2	charles	1	67
3	ann	2	57
4	bob	2	99
5	charles	2	98

Each column contains a single value¶

[28]:

names = ['ann', 'bob', 'charles']
grades = [(78,57), (99, 99), (67, 98)]
df = pd.DataFrame(dict(name=names, grade=grades))
df

[28]:

	name	grade
0	ann	(78, 57)
1	bob	(99, 99)
2	charles	(67, 98)

[29]:

df.explode('grade').reset_index(drop=True)

[29]:

	name	grade
0	ann	78
1	ann	57
2	bob	99
3	bob	99
4	charles	67
5	charles	98

Second Normal Form (2NF)¶

All columns in each row depend fully on candidate keys

This can be quite tricky to understand. Look for candidate composite keys that can uniquely identify a row. Then see if the other columns depend on ALL columns of the composite key.

In the example below, suppose we have a table for academic books. Note that (publisher, title) is a candidate key. However, headquarters depends only on publisher and not on title, so this violates 2NF.

[30]:

import numpy as np

[31]:

publishers = ['Springer', 'Springer', 'CUB', 'CUP']
headquarters = ['Germany', 'Gernamy', 'England', 'England']
titles = ['Linear Algebra Done Wrong',
          'Undergraduate Algebra',
          'Stochastic Modelling of Reaction–Diffusion Processes',
          'An Introduction to Stochastic Dynamics']
df = pd.DataFrame(dict(publisher=publishers,
                       headquarter=headquarters,
                       title=titles))
df

[31]:

	publisher	headquarter	title
0	Springer	Germany	Linear Algebra Done Wrong
1	Springer	Gernamy	Undergraduate Algebra
2	CUB	England	Stochastic Modelling of Reaction–Diffusion Pro...
3	CUP	England	An Introduction to Stochastic Dynamics

[32]:

publisher_ids = [0, 0, 1, 1]
titles = ['Linear Algebra Done Wrong',
          'Undergraduate Algebra',
          'Stochastic Modelling of Reaction–Diffusion Processes',
          'An Introduction to Stochastic Dynamics']
df1 = pd.DataFrame(dict(title=titles, publisher_id=publisher_ids))
df1

[32]:

	title	publisher_id
0	Linear Algebra Done Wrong	0
1	Undergraduate Algebra	0
2	Stochastic Modelling of Reaction–Diffusion Pro...	1
3	An Introduction to Stochastic Dynamics	1

[33]:

publishers = ['Springer', 'CUP']
headquarters = ['Germany', 'England']
df2 = pd.DataFrame(dict(publisher=publishers,
                       headquarter=headquarters))
df2

[33]:

	publisher	headquarter
0	Springer	Germany
1	CUP	England

Third Normal Form (3NF)¶

No transitive dependencies between non-candidate columns

In the table below, both major and major_description depend on the name (or row number), but major_description only depends on name via the major. This is a transitive dependency and violates 3NF.

[34]:

names = ['ann', 'bob', 'charles', 'david']
ages = [21, 22, 21, 23]
majors = ['math', 'stats', 'bio', 'math']
major_descriptions = ['Mathematics', 'Statisitcs', 'Biohazards in the University', 'Mathematics']
df = pd.DataFrame(dict(name=names, age=ages, major=majors, major_dscription=major_descriptions))
df

[34]:

	name	age	major	major_dscription
0	ann	21	math	Mathematics
1	bob	22	stats	Statisitcs
2	charles	21	bio	Biohazards in the University
3	david	23	math	Mathematics

[35]:

names = ['ann', 'bob', 'charles', 'david']
ages = [21, 22, 21, 23]
major_ids = [0,1,2, 0]
df1 = pd.DataFrame(dict(name=names, age=ages, major=major_ids))
df1

[35]:

	name	age	major
0	ann	21	0
1	bob	22	1
2	charles	21	2
3	david	23	0

[36]:

majors = ['math', 'stats', 'bio']
major_descriptions = ['Mathematics', 'Statisitcs', 'Biohazards in the University']
df2 = pd.DataFrame(dict(major=majors, description=major_descriptions))
df2

[36]:

	major	description
0	math	Mathematics
1	stats	Statisitcs
2	bio	Biohazards in the University

OLTP and OLAP¶

OLTP
- Normalized schema
OLAP
- Denormalized schema
  - Star
    - Facts
    - Dimensions
  - Snowflake
  - Generated from OLTP databases by ETL (Extract-Transform-Load) operations

Denormalized schemas¶

Data lake
Data warehouse
Data mart

Data marts typically use a star schema that is customized for the analysis needs. For example, the finance department in a hospital may be most interested in Facts about Claims.

These schemas are generally designed to reduce the need for complex joins and return queries efficiently. Generally, there is an Extract Transform Load (ETL) script that periodically batch converts data entered into an OLTP database into such a star schema on an OLAP data mart or warehouse.

Generating ER diagrams¶

Visualizing the entitry-relationship diagram (ERd).

[37]:

! python3 -m pip install --quiet eralchemy

[38]:

import os
from eralchemy import render_er

if not os.path.exists('erd_from_sqlalchemy.png'):
    render_er('sqlite:///data/dummy.db', 'erd_from_sqlalchemy.png')

Robustness and scaling¶

As the database grows, it may be necessary to scale your system. In vertical scaling, you get a machine with more resources (more disk space, more RAM, more CPUs). This is obviously limited and quickly becomes prohibitively expensive. In horizontal scaling, you add more (commodity) machines to grow. Two concepts important for horizontal scaling are replication and sharding.

In replication, you duplicate the entire database over multiple machines. This does not do anything to make your queries faster, but increases robustness since there is no longer a single point of failure.

In sharding you divide the rows of your tables into partitions that are distributed over different machines. This can improve query efficiency since queries can be run in parallel.

Sharing is also known as horizontal partitioning. In vertical partitioning, you split the columns into partitions. Generally for relational databases, vertical partitioning is hard to achieve unless there is some natural domain specific way to logically split the data.

Sharding is technically challenging to achieve with relational databases, and the need to perform horizontal scaling as data sets grew exponentially was a major driver in the development of NoSQL systems.

Relational Databases Overview¶

0. Packages for working with relational databases in Python¶

Motivation¶

RDBMS¶

What is a database?¶

Schema¶

Table (Relation)¶

Column (Attribute)¶

Row (Tuple)¶

Concepts¶

Constraints¶

Referential integrity¶

Relationships¶

Indexes¶

Views¶

Design¶

Columns¶

Tables¶

Relationships¶

Database administration¶

SQL for table deletion and creation¶

Accessing the RDBMS data dictionary.¶

CRUD¶

Communicating with database without magic¶

Communicating with database from pandas¶

Database normalization¶

First Normal Form (1NF)¶

No repeating columns¶

Each column contains a single value¶

Second Normal Form (2NF)¶

Third Normal Form (3NF)¶

OLTP and OLAP¶

Denormalized schemas¶

Generating ER diagrams¶

Robustness and scaling¶

Communicating with database from `pandas`¶