{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# S01 Relational Databases\n", "\n", "For a simple tutorial on database design, see [Introduction to Database Design](https://www.datanamic.com/support/lt-dez005-introduction-db-modeling.html)\n", "\n", "For a deep dive, see [Database Design for Mere Mortals](https://www.amazon.com/Database-Design-Mere-Mortals-Hands/dp/0321884493/ref=dp_ob_title_bk)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Packages for working with relational databases in Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Python Database API Specification v2.0](https://www.python.org/dev/peps/pep-0249/) - The standard Python Database API\n", "- [sqlite3](https://docs.python.org/3.7/library/sqlite3.html) - API for builit-in `sqlite3` package\n", "- [Database drivers](https://github.com/vinta/awesome-python#database-drivers) - For connecting to other databases\n", "- [ipython-sql](https://github.com/catherinedevlin/ipython-sql) - SQL magic in Jupyter\n", "- [SQLAlchemy](https://www.sqlalchemy.org) - Most well-known Object Relational Mapper (ORM)\n", "- [Pony ORM](https://ponyorm.com) - Alternative ORM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Motivation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why relational databases and SQL?\n", "\n", "- History of databases\n", "- ACID\n", "- Data integrity\n", "- Schema" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. RDBMS\n", "\n", "- Memory\n", "- Storage\n", "- Dictionary\n", "- Query language" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Anatomy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Table (Relation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Represents a *subject* or an *event*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Column (Attribute)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Represents a single *variable* or *feature*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Row (Tuple)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Represents an *observation*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Concepts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Constraints\n", "\n", "You can impose constraints that values in a column have to take. For example, you can specify that values are compulsory (NOT NULL), or UNIQUE or fall within a certain range." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Referential integrity" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Primary key represents a unique identifier of a row. It may be simple or composite.\n", " - Unique\n", " - Non-null\n", " - Never optional\n", "- Foreign key is a column containing the primary key of a different table. It enforces *referential integrity*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Relationships" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- One to one\n", "- One to many\n", "- Many to many\n", "\n", "- What happens on delete?\n", " - Restrict\n", " - Cascade" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Indexes\n", "\n", "An index is a data structure that allows fast search of a column (typically from linear to log time complexity). Most databases will automatically build an index for every primary key column, but you can also manually specify columns to build indexes for. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Views" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Temporary virtual table retuned as a result of a *query*.\n", "- Views only specify the strucutre of a table - the contents are constructed on the fly from existing tables.\n", "- Queries return a Result Set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Design" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Use singlular form for name \n", "- Use informative names\n", "- Use unique names not shared by any other table (except foreign keys)\n", "- Column must be an attribute of the table's subject\n", "- Eliminate multi-part columns\n", "- Eliminate multi-value columsn\n", "- Eliminate redundant columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Use singular/plural forms for name (controversial)\n", "- Enusre every table has a primary key\n", "- Eliminate duplicate columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Relationships" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Establish participation type and degree of relationship\n", " - One to one\n", " - One to many\n", " - Many to many" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `sqlmagic` as alternative to using `sqlite3` driver." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%load_ext sql" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Connect to SQLite3 database on disk (creates it if it does not exist)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Connected: None@data/dummy.db'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%sql sqlite:///data/dummy.db" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SQL for table deletion and creation" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n", "Done.\n", "Done.\n", "Done.\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "DROP TABLE IF EXISTS Country;\n", "DROP TABLE IF EXISTS Person;\n", "\n", "CREATE TABLE Country (\n", " country_id varcarh(2) PRIMARY KEY,\n", " country_name varchar(255)\n", ");\n", "\n", "CREATE TABLE Person (\n", " person_id INTEGER PRIMARY KEY,\n", " person_first varchar(255),\n", " person_last varchar(255),\n", " country_id INTEGER NOT NULL,\n", " FOREIGN KEY (country_id) REFERENCES Country(country_id)\n", ");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SQL to insert rows." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2 rows affected.\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "INSERT INTO Country(country_id, country_name) \n", "VALUES ('FR', 'France'), ('CU', 'CUBA');" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2 rows affected.\n" ] }, { "data": { "text/plain": [ "[]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "INSERT INTO Person(person_first, person_last, country_id) \n", "VALUES \n", "('Napolean', 'Bonaparte', 'FR'),\n", "('Luis','Alvarez', 'CU');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Accessing the RDBMS dictionary." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name
Country
Person
" ], "text/plain": [ "[('Country',), ('Person',)]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT name FROM sqlite_master \n", "WHERE type = \"table\";" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
sql
CREATE TABLE Person (
person_id INTEGER PRIMARY KEY,
person_first varchar(255),
person_last varchar(255),
country_id INTEGER NOT NULL,
FOREIGN KEY (country_id) REFERENCES Country(country_id)
)
" ], "text/plain": [ "[('CREATE TABLE Person (\\n person_id INTEGER PRIMARY KEY,\\n person_first varchar(255),\\n person_last varchar(255),\\n country_id INTEGER NOT NULL,\\n FOREIGN KEY (country_id) REFERENCES Country(country_id)\\n)',)]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT sql FROM sqlite_master \n", "WHERE name='Person';" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SQL as a Query Language." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Done.\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
firstlastnationality
NapoleanBonaparteFrance
LuisAlvarezCUBA
" ], "text/plain": [ "[('Napolean', 'Bonaparte', 'France'), ('Luis', 'Alvarez', 'CUBA')]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%sql\n", "\n", "SELECT person_first as first, person_last AS last, country_name AS nationality\n", "FROM Person \n", "INNER JOIN country \n", "ON Person.country_id = Country.country_id;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualizing the entitry-relationship diagram (ERd)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "import os\n", "from eralchemy import render_er\n", "\n", "if not os.path.exists('erd_from_sqlalchemy.png'):\n", " render_er('sqlite:///data/dummy.db', 'erd_from_sqlalchemy.png')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](erd_from_sqlalchemy.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Homework walk-through\n", "\n", "Convert the flat file data in `data/flat.csv` into a well-structured relational database in SQLite3 stored as `data/faculty.db`. Note - salary information is confidential and should be kept in a separate table from other personal data." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namegenderageheightweightsalarynationalitycodecountrylanguage1language2language3firstlast
7Adan BrownMale491.816863000BelgianBEBelgiumF#AdanBrown
859Lester MonroeMale181.815843000AustralianAUAustraliaLuaGoLesterMonroe
28Aleshia ReillyFemale611.768697000SwissCHSwitzerlandAleshiaReilly
743Kathaleen FaulknerFemale521.977356000JapaneseJPJapanPowerShellKathaleenFaulkner
255Collin KaneMale471.876584000VenezuelanVEVenezuelaCollinKane
800Ladawn KochFemale241.835064000GermanDEGermanyAutoItJ#TypeScriptLadawnKoch
122Bari ColonFemale651.6062116000UkrainianUAUkrainePowerShellTclBariColon
253Coletta VegaFemale571.874494000BrazilianBRBrazilC++ColettaVega
1343Stephan McfaddenMale401.7454149000CubanCUCubaAwkStephanMcfadden
551Gonzalo HouseMale421.8144114000DanishDKDenmarkAutoItPerlGonzaloHouse
\n", "
" ], "text/plain": [ " name gender age height weight salary nationality \\\n", "7 Adan Brown Male 49 1.81 68 63000 Belgian \n", "859 Lester Monroe Male 18 1.81 58 43000 Australian \n", "28 Aleshia Reilly Female 61 1.76 86 97000 Swiss \n", "743 Kathaleen Faulkner Female 52 1.97 73 56000 Japanese \n", "255 Collin Kane Male 47 1.87 65 84000 Venezuelan \n", "800 Ladawn Koch Female 24 1.83 50 64000 German \n", "122 Bari Colon Female 65 1.60 62 116000 Ukrainian \n", "253 Coletta Vega Female 57 1.87 44 94000 Brazilian \n", "1343 Stephan Mcfadden Male 40 1.74 54 149000 Cuban \n", "551 Gonzalo House Male 42 1.81 44 114000 Danish \n", "\n", " code country language1 language2 language3 first last \n", "7 BE Belgium F# Adan Brown \n", "859 AU Australia Lua Go Lester Monroe \n", "28 CH Switzerland Aleshia Reilly \n", "743 JP Japan PowerShell Kathaleen Faulkner \n", "255 VE Venezuela Collin Kane \n", "800 DE Germany AutoIt J# TypeScript Ladawn Koch \n", "122 UA Ukraine PowerShell Tcl Bari Colon \n", "253 BR Brazil C++ Coletta Vega \n", "1343 CU Cuba Awk Stephan Mcfadden \n", "551 DK Denmark AutoIt Perl Gonzalo House " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "flat = pd.read_csv('data/flat.csv', keep_default_na=False)\n", "flat.sample(10)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }