SQL¶

SQL is a language for getting data from databases. It is also becoming essential to have some basic familiarity with SQL because it is universally used to slice and dice data across many different data storage formats and technologies.

We will only show the use of SQL as a query language here, since that is probably all that a statistician will use SQL for. However, SQL can also be used to create and modify tables, as well as manage database permissions.

SQL via `pandas` DataFrames¶

We will use the pandas-sql package to practice SQL syntax for querying a pandas DataFrame. Later, we will see how to use an actual relatoinal database.

In [1]:

from pandasql import PandaSQL
pdsql = PandaSQL()

In [2]:

tips = sns.load_dataset('tips')
tips.head()

Out[2]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

Using SQL for Queries¶

Note that SQL is case-insensitive, but it is traditional to use ALL CAPS for SQL keywords. It is also standard to end SQL statements with a semi-colon.

In [3]:

pdsql("SELECT * FROM tips LIMIT 5;")

Out[3]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

In [4]:

pdsql("SELECT * FROM tips WHERE sex='Female' LIMIT 5;")

Out[4]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	24.59	3.61	Female	No	Sun	Dinner	4
2	35.26	5.00	Female	No	Sun	Dinner	4
3	14.83	3.02	Female	No	Sun	Dinner	2
4	10.33	1.67	Female	No	Sun	Dinner	3

In [5]:

pdsql("SELECT tip, sex, size FROM tips WHERE total_bill< 10 LIMIT 5;")

Out[5]:

	tip	sex	size
0	2.00	Male	2
1	1.45	Male	2
2	1.32	Male	2
3	1.56	Male	2
4	1.00	Female	1

Ordering¶

In [6]:

query = """
SELECT * FROM tips
WHERE sex='Female' and smoker='Yes'
ORDER BY total_bill ASC
LIMIT 5;
"""
pdsql(query)

Out[6]:

	total_bill	tip	sex	smoker	day	time	size
0	3.07	1.00	Female	Yes	Sat	Dinner	1
1	5.75	1.00	Female	Yes	Fri	Dinner	2
2	9.60	4.00	Female	Yes	Sun	Dinner	2
3	10.09	2.00	Female	Yes	Fri	Lunch	2
4	10.59	1.61	Female	Yes	Sat	Dinner	2

Aggregate queries¶

In [7]:

query = """
SELECT count(*) AS count, max(tip) AS max, min(tip) AS min FROM tips
WHERE size > 1
GROUP BY sex, day
HAVING max < 6
ORDER BY count DESC
LIMIT 5;
"""
pdsql(query)

Out[7]:

	count	max	min
0	31	5.17	1.25
1	18	5.20	1.01
2	9	4.30	1.00
3	9	4.73	1.50

Joins¶

A join occurs when you combine information from two or more database tables, based on information in a column that is common among the tables. As usual, it is easier to understand the concept with examples.

In [8]:

student = pd.read_csv('../data/student.txt')
student

Out[8]:

	student_id	first	last	email	major_id
0	1	frodo	baggins	frodo.baggins@duke.edu	1
1	2	bilbo	baggins	b_baggins@duke.edu	3
2	3	golum	golum	golum.golum@duke.edu	2
3	4	gandalf	white	g.white@duke.edu	5
4	5	gandalf	grey	g.grey@duke.edu	6
5	6	saruman	wise	s.wise@duke.edu	2

In [9]:

cls = pd.read_csv('../data/class.txt')
cls

Out[9]:

	class_id	code	name	credits
0	1	ANT01	Introduction to Hobbits	4
1	2	MAT802	Abstrct Nonsense	8
2	3	ENG234	Jabberwocky	2
3	4	STA007	Statistics for Secret Agens	4
4	5	PHY211	Physics of Star Wars	4

In [10]:

major = pd.read_csv('../data/major.txt')
major

Out[10]:

	major_id	name
0	1	Computer Science
1	2	Physics
2	3	Statisitcs
3	4	English
4	5	History

In [11]:

student_cls = pd.read_csv('../data/student_class.txt')
student_cls

Out[11]:

	student_id	class_id
0	1	3
1	1	4
2	2	1
3	2	4
4	3	1
5	3	2
6	3	3
7	3	5
8	4	2
9	4	5

Matching students and majors¶

Inner join¶

In [12]:

query = """
SELECT s.first, s.last, m.name
FROM student s
INNER JOIN major m
ON s.major_id = m.major_id;
"""
pdsql(query)

Out[12]:

	first	last	name
0	frodo	baggins	Computer Science
1	bilbo	baggins	Statisitcs
2	golum	golum	Physics
3	gandalf	white	History
4	saruman	wise	Physics

Left outer join¶

SQL also has RIGHT OUTER JOIN and FULL OUTER JOIN but these are not currently supported by SQLite3 (the database engine used by pdsql).

In [13]:

query = """
SELECT s.first, s.last, m.name
FROM student s
LEFT OUTER JOIN major m
ON s.major_id = m.major_id;
"""
pdsql(query)

Out[13]:

	first	last	name
0	frodo	baggins	Computer Science
1	bilbo	baggins	Statisitcs
2	golum	golum	Physics
3	gandalf	white	History
4	gandalf	grey	None
5	saruman	wise	Physics

Emulating a full outer join with UNION ALL¶

Only necessary if the database does not proivde FULL OUTER JOIN

In [14]:

query = """
SELECT s.first, s.last, m.name
FROM   student s
       LEFT JOIN major m
          ON s.major_id = m.major_id
UNION All
SELECT s.first, s.last, m.name
FROM   major m
       LEFT JOIN student s
          ON s.major_id = m.major_id
WHERE  s.major_id IS NULL;
"""
pdsql(query)

Out[14]:

	first	last	name
0	frodo	baggins	Computer Science
1	bilbo	baggins	Statisitcs
2	golum	golum	Physics
3	gandalf	white	History
4	gandalf	grey	None
5	saruman	wise	Physics
6	None	None	English

Using linker tables to match students to classes (a MANY TO MANY join)¶

In [15]:

query = """
SELECT s.first, s.last, c.code, c.name, c.credits
FROM student s
INNER JOIN student_cls sc ON s.student_id = sc.student_id
INNER JOIN cls c ON c.class_id = sc.class_id;
"""
pdsql(query)

Out[15]:

	first	last	code	name	credits
0	frodo	baggins	ENG234	Jabberwocky	2
1	frodo	baggins	STA007	Statistics for Secret Agens	4
2	bilbo	baggins	ANT01	Introduction to Hobbits	4
3	bilbo	baggins	STA007	Statistics for Secret Agens	4
4	golum	golum	ANT01	Introduction to Hobbits	4
5	golum	golum	MAT802	Abstrct Nonsense	8
6	golum	golum	ENG234	Jabberwocky	2
7	golum	golum	PHY211	Physics of Star Wars	4
8	gandalf	white	MAT802	Abstrct Nonsense	8
9	gandalf	white	PHY211	Physics of Star Wars	4

Same thing but including students with no majors¶

In [16]:

query = """
SELECT s.first, s.last, c.code, c.name, c.credits
FROM student s
LEFT OUTER JOIN student_cls sc ON s.student_id = sc.student_id
LEFT OUTER JOIN cls c ON c.class_id = sc.class_id;
"""
pdsql(query)

Out[16]:

	first	last	code	name	credits
0	frodo	baggins	ENG234	Jabberwocky	2
1	frodo	baggins	STA007	Statistics for Secret Agens	4
2	bilbo	baggins	ANT01	Introduction to Hobbits	4
3	bilbo	baggins	STA007	Statistics for Secret Agens	4
4	golum	golum	ANT01	Introduction to Hobbits	4
5	golum	golum	MAT802	Abstrct Nonsense	8
6	golum	golum	ENG234	Jabberwocky	2
7	golum	golum	PHY211	Physics of Star Wars	4
8	gandalf	white	MAT802	Abstrct Nonsense	8
9	gandalf	white	PHY211	Physics of Star Wars	4
10	gandalf	grey	None	None	NaN
11	saruman	wise	None	None	NaN

Using SQLite3¶

SQLite3 is part of the standard library. However, the mechanics of using essentially any database in Python is similar, because of the Python DB-API.

In [17]:

import sqlite3

In [18]:

c = sqlite3.connect('../data/Chinook_Sqlite.sqlite')

SQLite specific commands to get metadata¶

Unlike SQL syntax for queries, how you get metadata from a relational database is vendor-specific. You’ll have to read the docs to find out what is needed for your SQL flavor.

What tables are there in the database?¶

In [19]:

list(c.execute("SELECT name FROM sqlite_master WHERE type='table';"))

Out[19]:

[('Album',),
 ('Artist',),
 ('Customer',),
 ('Employee',),
 ('Genre',),
 ('Invoice',),
 ('InvoiceLine',),
 ('MediaType',),
 ('Playlist',),
 ('PlaylistTrack',),
 ('Track',)]

What are the columns of the table “Album”?¶

In [20]:

list(c.execute("PRAGMA table_info(Album);"))

Out[20]:

[(0, 'AlbumId', 'INTEGER', 1, None, 1),
 (1, 'Title', 'NVARCHAR(160)', 1, None, 0),
 (2, 'ArtistId', 'INTEGER', 1, None, 0)]

Standard SQL statements with parameter substitution¶

Note: Using Python string substitution for Python defined parameters is dangerous because of the risk of SQL injection attacks. Use parameter substitution with ? instead.

Do this¶

In [21]:

t = ['%rock%']
list(c.execute("SELECT * FROM Album WHERE Title like ? LIMIT 5;", t))

Out[21]:

[(1, 'For Those About To Rock We Salute You', 1),
 (4, 'Let There Be Rock', 1),
 (59, 'Deep Purple In Rock', 58),
 (108, 'Rock In Rio [CD1]', 90),
 (109, 'Rock In Rio [CD2]', 90)]

Not this¶

In [22]:

t = ["'%rock%'"]
list(c.execute("SELECT * FROM Album WHERE Title like %s LIMIT 5;" % t[0]))

Out[22]:

[(1, 'For Those About To Rock We Salute You', 1),
 (4, 'Let There Be Rock', 1),
 (59, 'Deep Purple In Rock', 58),
 (108, 'Rock In Rio [CD1]', 90),
 (109, 'Rock In Rio [CD2]', 90)]

User defined functions¶

Sometiems it is useful to have custom functions that run on the database server rather than on the client. These are called User Defined Functions or UDF. How do do this varies wtih the database used, but it is fairly simple with Python and SQLite.

In [23]:

def encode(text, offset):
    """Caesar cipher of text with given offset."""
    from string import ascii_lowercase, ascii_uppercase
    tbl = dict(zip(map(ord, ascii_lowercase + ascii_uppercase),
               ascii_lowercase[offset:] + ascii_lowercase[:offset] +
               ascii_uppercase[offset:] + ascii_uppercase[:offset]))
    return text.translate(tbl)

In [24]:

c.create_function("encode", 2, encode)

In [25]:

list(c.execute("SELECT Title, encode(Title, 3) FROM Album limit 5;"))

Out[25]:

[('For Those About To Rock We Salute You',
  'Iru Wkrvh Derxw Wr Urfn Zh Vdoxwh Brx'),
 ('Balls to the Wall', 'Edoov wr wkh Zdoo'),
 ('Restless and Wild', 'Uhvwohvv dqg Zlog'),
 ('Let There Be Rock', 'Ohw Wkhuh Eh Urfn'),
 ('Big Ones', 'Elj Rqhv')]

We cna also add aggregate UDFs similar to SQL MIN, SUM, COUNT etc. Aggregate UDFs require you to write a class __init__, step and finalize methods.

In [26]:

class CV:
    """Aggregate UDF for coefficient of varation in %."""
    import numpy as np

    def __init__(self):
        self.s = []

    def step(self, value):
        self.s.append(value)

    def finalize(self):
        if len(self.s) < 2:
            return 0
        else:
            return 100.0*np.std(self.s)/np.mean(self.s)

In [27]:

c.create_aggregate("cv", 1, CV)

In [28]:

list(c.execute("PRAGMA table_info(Invoice);"))

Out[28]:

[(0, 'InvoiceId', 'INTEGER', 1, None, 1),
 (1, 'CustomerId', 'INTEGER', 1, None, 0),
 (2, 'InvoiceDate', 'DATETIME', 1, None, 0),
 (3, 'BillingAddress', 'NVARCHAR(70)', 0, None, 0),
 (4, 'BillingCity', 'NVARCHAR(40)', 0, None, 0),
 (5, 'BillingState', 'NVARCHAR(40)', 0, None, 0),
 (6, 'BillingCountry', 'NVARCHAR(40)', 0, None, 0),
 (7, 'BillingPostalCode', 'NVARCHAR(10)', 0, None, 0),
 (8, 'Total', 'NUMERIC(10,2)', 1, None, 0)]

In [29]:

list(c.execute("SELECT cv(Total) from Invoice limit 10;"))

Out[29]:

[(83.85715075292478,)]

In [30]:

c.close()

Using SQL magic functions¶

We will use the ipython-sql notebook extension for convenience. This will only work in notebooks and IPython scripts with the .ipy extension.

In [31]:

import warnings

with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    %load_ext sql

Configuring the SqlMagic extension¶

In [32]:

%config SqlMagic

SqlMagic options
--------------
SqlMagic.autolimit=<Int>
    Current: 0
    Automatically limit the size of the returned result sets
SqlMagic.autopandas=<Bool>
    Current: False
    Return Pandas DataFrames instead of regular result sets
SqlMagic.displaylimit=<Int>
    Current: 0
    Automatically limit the number of rows displayed (full result set is still
    stored)
SqlMagic.dsn_filename=<Unicode>
    Current: 'odbc.ini'
    Path to DSN file. When the first argument is of the form [section], a
    sqlalchemy connection string is formed from the matching section in the DSN
    file.
SqlMagic.feedback=<Bool>
    Current: True
    Print number of rows affected by DML
SqlMagic.short_errors=<Bool>
    Current: True
    Don't display the full traceback on SQL Programming Error
SqlMagic.style=<Unicode>
    Current: 'DEFAULT'
    Set the table printing style to any of prettytable's defined styles
    (currently DEFAULT, MSWORD_FRIENDLY, PLAIN_COLUMNS, RANDOM)

In [33]:

%config SqlMagic.displaylimit=10

Connect to SQLite3 database¶

In [34]:

%sql sqlite:///../data/Chinook_Sqlite.sqlite

Out[34]:

'Connected: None@../data/Chinook_Sqlite.sqlite'

Other databases¶

See SQLAlchemy connection strings for how to connect to oher databases such as Oracle, MySQL or PostgreSQL.

Line magic¶

In [35]:

%sql SELECT * from Album LIMIT 5;

Done.

Out[35]:

AlbumId	Title	ArtistId
1	For Those About To Rock We Salute You	1
2	Balls to the Wall	2
3	Restless and Wild	2
4	Let There Be Rock	1
5	Big Ones	3

In [36]:

%sql SELECT * from Artist LIMIT 5;

Done.

Out[36]:

ArtistId	Name
1	AC/DC
2	Accept
3	Aerosmith
4	Alanis Morissette
5	Alice In Chains

Cell magic¶

In [37]:

%%sql
SELECT Artist.Name, Album.Title
FROM Album
INNER JOIN Artist on Album.ArtistId = Artist.ArtistId
ORDER BY Artist.Name ASC
LIMIT 5;

Done.

Out[37]:

Name	Title
AC/DC	For Those About To Rock We Salute You
AC/DC	Let There Be Rock
Aaron Copland & London Symphony Orchestra	A Copland Celebration, Vol. I
Aaron Goldberg	Worlds
Academy of St. Martin in the Fields & Sir Neville Marriner	The World of Classical Favourites

You can assign results of queries to Python names¶

In [38]:

result = %sql SELECT * from Album;

Done.

In [39]:

type(result)

Out[39]:

sql.run.ResultSet

Resutls behave like lists¶

In [40]:

result[2:4]

Out[40]:

[(3, 'Restless and Wild', 2), (4, 'Let There Be Rock', 1)]

You can use Python variables in your queires.¶

Use :varname where you want to use a Python variable in your query.

In [41]:

artist_id = 10

%sql select * from Artist where ArtistId < :artist_id;

Done.

Out[41]:

ArtistId	Name
1	AC/DC
2	Accept
3	Aerosmith
4	Alanis Morissette
5	Alice In Chains
6	Antônio Carlos Jobim
7	Apocalyptica
8	Audioslave
9	BackBeat

In [42]:

word = '%rock%'

%sql select * from Album WHERE Title LIKE :word;

Done.

Out[42]:

AlbumId	Title	ArtistId
1	For Those About To Rock We Salute You	1
4	Let There Be Rock	1
59	Deep Purple In Rock	58
108	Rock In Rio [CD1]	90
109	Rock In Rio [CD2]	90
213	Pure Cult: The Best Of The Cult (For Rockers, Ravers, Lovers & Sinners) [UK]	139
216	Hot Rocks, 1964-1971 (Disc 1)	142

Convert to `pandas` dataframe¶

In [43]:

df = result.DataFrame()
df.head(5)

Out[43]:

	AlbumId	Title	ArtistId
0	1	For Those About To Rock We Salute You	1
1	2	Balls to the Wall	2
2	3	Restless and Wild	2
3	4	Let There Be Rock	1
4	5	Big Ones	3

In [ ]:

SQL¶

SQL via `pandas` DataFrames¶

Using SQL for Queries¶

Ordering¶

Aggregate queries¶

Joins¶

Matching students and majors¶

Inner join¶

Left outer join¶

Emulating a full outer join with UNION ALL¶

Using linker tables to match students to classes (a MANY TO MANY join)¶

Same thing but including students with no majors¶

Using SQLite3¶

SQLite specific commands to get metadata¶

What tables are there in the database?¶

What are the columns of the table “Album”?¶

Standard SQL statements with parameter substitution¶

Do this¶

Not this¶

User defined functions¶

Using SQL magic functions¶

Configuring the SqlMagic extension¶

Connect to SQLite3 database¶

Other databases¶

Line magic¶

Cell magic¶

You can assign results of queries to Python names¶

Resutls behave like lists¶

You can use Python variables in your queires.¶

Convert to `pandas` dataframe¶

Page contents

Previous page

Next page

This Page

SQL¶

SQL via pandas DataFrames¶

Using SQL for Queries¶

Ordering¶

Aggregate queries¶

Joins¶

Matching students and majors¶

Inner join¶

Left outer join¶

Emulating a full outer join with UNION ALL¶

Using linker tables to match students to classes (a MANY TO MANY join)¶

Same thing but including students with no majors¶

Using SQLite3¶

SQLite specific commands to get metadata¶

What tables are there in the database?¶

What are the columns of the table “Album”?¶

Standard SQL statements with parameter substitution¶

Do this¶

Not this¶

User defined functions¶

Using SQL magic functions¶

Configuring the SqlMagic extension¶

Connect to SQLite3 database¶

Other databases¶

Line magic¶

Cell magic¶

You can assign results of queries to Python names¶

Resutls behave like lists¶

You can use Python variables in your queires.¶

Convert to pandas dataframe¶

SQL via `pandas` DataFrames¶

Convert to `pandas` dataframe¶