SQL¶
SQL is a language for getting data from databases. It is also becoming essential to have some basic familiarity with SQL because it is universally used to slice and dice data across many different data storage formats and technologies.
We will only show the use of SQL as a query language here, since that is probably all that a statistician will use SQL for. However, SQL can also be used to create and modify tables, as well as manage database permissions.
SQL via pandas
DataFrames¶
We will use the pandas-sql
package to practice SQL syntax for querying a pandas
DataFrame.
Later, we will see how to use an actual relatoinal database.
In [1]:
from pandasql import PandaSQL
pdsql = PandaSQL()
In [2]:
tips = sns.load_dataset('tips')
tips.head()
Out[2]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
Using SQL for Queries¶
Note that SQL is case-insensitive, but it is traditional to use ALL CAPS for SQL keywords. It is also standard to end SQL statements with a semi-colon.
In [3]:
pdsql("SELECT * FROM tips LIMIT 5;")
Out[3]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
In [4]:
pdsql("SELECT * FROM tips WHERE sex='Female' LIMIT 5;")
Out[4]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
2 | 35.26 | 5.00 | Female | No | Sun | Dinner | 4 |
3 | 14.83 | 3.02 | Female | No | Sun | Dinner | 2 |
4 | 10.33 | 1.67 | Female | No | Sun | Dinner | 3 |
In [5]:
pdsql("SELECT tip, sex, size FROM tips WHERE total_bill< 10 LIMIT 5;")
Out[5]:
tip | sex | size | |
---|---|---|---|
0 | 2.00 | Male | 2 |
1 | 1.45 | Male | 2 |
2 | 1.32 | Male | 2 |
3 | 1.56 | Male | 2 |
4 | 1.00 | Female | 1 |
Ordering¶
In [6]:
query = """
SELECT * FROM tips
WHERE sex='Female' and smoker='Yes'
ORDER BY total_bill ASC
LIMIT 5;
"""
pdsql(query)
Out[6]:
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 |
1 | 5.75 | 1.00 | Female | Yes | Fri | Dinner | 2 |
2 | 9.60 | 4.00 | Female | Yes | Sun | Dinner | 2 |
3 | 10.09 | 2.00 | Female | Yes | Fri | Lunch | 2 |
4 | 10.59 | 1.61 | Female | Yes | Sat | Dinner | 2 |
Aggregate queries¶
In [7]:
query = """
SELECT count(*) AS count, max(tip) AS max, min(tip) AS min FROM tips
WHERE size > 1
GROUP BY sex, day
HAVING max < 6
ORDER BY count DESC
LIMIT 5;
"""
pdsql(query)
Out[7]:
count | max | min | |
---|---|---|---|
0 | 31 | 5.17 | 1.25 |
1 | 18 | 5.20 | 1.01 |
2 | 9 | 4.30 | 1.00 |
3 | 9 | 4.73 | 1.50 |
Joins¶
A join occurs when you combine information from two or more database tables, based on information in a column that is common among the tables. As usual, it is easier to understand the concept with examples.
In [8]:
student = pd.read_csv('../data/student.txt')
student
Out[8]:
student_id | first | last | major_id | ||
---|---|---|---|---|---|
0 | 1 | frodo | baggins | frodo.baggins@duke.edu | 1 |
1 | 2 | bilbo | baggins | b_baggins@duke.edu | 3 |
2 | 3 | golum | golum | golum.golum@duke.edu | 2 |
3 | 4 | gandalf | white | g.white@duke.edu | 5 |
4 | 5 | gandalf | grey | g.grey@duke.edu | 6 |
5 | 6 | saruman | wise | s.wise@duke.edu | 2 |
In [9]:
cls = pd.read_csv('../data/class.txt')
cls
Out[9]:
class_id | code | name | credits | |
---|---|---|---|---|
0 | 1 | ANT01 | Introduction to Hobbits | 4 |
1 | 2 | MAT802 | Abstrct Nonsense | 8 |
2 | 3 | ENG234 | Jabberwocky | 2 |
3 | 4 | STA007 | Statistics for Secret Agens | 4 |
4 | 5 | PHY211 | Physics of Star Wars | 4 |
In [10]:
major = pd.read_csv('../data/major.txt')
major
Out[10]:
major_id | name | |
---|---|---|
0 | 1 | Computer Science |
1 | 2 | Physics |
2 | 3 | Statisitcs |
3 | 4 | English |
4 | 5 | History |
In [11]:
student_cls = pd.read_csv('../data/student_class.txt')
student_cls
Out[11]:
student_id | class_id | |
---|---|---|
0 | 1 | 3 |
1 | 1 | 4 |
2 | 2 | 1 |
3 | 2 | 4 |
4 | 3 | 1 |
5 | 3 | 2 |
6 | 3 | 3 |
7 | 3 | 5 |
8 | 4 | 2 |
9 | 4 | 5 |
Matching students and majors¶
Inner join¶
In [12]:
query = """
SELECT s.first, s.last, m.name
FROM student s
INNER JOIN major m
ON s.major_id = m.major_id;
"""
pdsql(query)
Out[12]:
first | last | name | |
---|---|---|---|
0 | frodo | baggins | Computer Science |
1 | bilbo | baggins | Statisitcs |
2 | golum | golum | Physics |
3 | gandalf | white | History |
4 | saruman | wise | Physics |
Left outer join¶
SQL also has RIGHT OUTER JOIN and FULL OUTER JOIN but these are not
currently supported by SQLite3 (the database engine used by pdsql
).
In [13]:
query = """
SELECT s.first, s.last, m.name
FROM student s
LEFT OUTER JOIN major m
ON s.major_id = m.major_id;
"""
pdsql(query)
Out[13]:
first | last | name | |
---|---|---|---|
0 | frodo | baggins | Computer Science |
1 | bilbo | baggins | Statisitcs |
2 | golum | golum | Physics |
3 | gandalf | white | History |
4 | gandalf | grey | None |
5 | saruman | wise | Physics |
Emulating a full outer join with UNION ALL¶
Only necessary if the database does not proivde FULL OUTER JOIN
In [14]:
query = """
SELECT s.first, s.last, m.name
FROM student s
LEFT JOIN major m
ON s.major_id = m.major_id
UNION All
SELECT s.first, s.last, m.name
FROM major m
LEFT JOIN student s
ON s.major_id = m.major_id
WHERE s.major_id IS NULL;
"""
pdsql(query)
Out[14]:
first | last | name | |
---|---|---|---|
0 | frodo | baggins | Computer Science |
1 | bilbo | baggins | Statisitcs |
2 | golum | golum | Physics |
3 | gandalf | white | History |
4 | gandalf | grey | None |
5 | saruman | wise | Physics |
6 | None | None | English |
Using linker tables to match students to classes (a MANY TO MANY join)¶
In [15]:
query = """
SELECT s.first, s.last, c.code, c.name, c.credits
FROM student s
INNER JOIN student_cls sc ON s.student_id = sc.student_id
INNER JOIN cls c ON c.class_id = sc.class_id;
"""
pdsql(query)
Out[15]:
first | last | code | name | credits | |
---|---|---|---|---|---|
0 | frodo | baggins | ENG234 | Jabberwocky | 2 |
1 | frodo | baggins | STA007 | Statistics for Secret Agens | 4 |
2 | bilbo | baggins | ANT01 | Introduction to Hobbits | 4 |
3 | bilbo | baggins | STA007 | Statistics for Secret Agens | 4 |
4 | golum | golum | ANT01 | Introduction to Hobbits | 4 |
5 | golum | golum | MAT802 | Abstrct Nonsense | 8 |
6 | golum | golum | ENG234 | Jabberwocky | 2 |
7 | golum | golum | PHY211 | Physics of Star Wars | 4 |
8 | gandalf | white | MAT802 | Abstrct Nonsense | 8 |
9 | gandalf | white | PHY211 | Physics of Star Wars | 4 |
Same thing but including students with no majors¶
In [16]:
query = """
SELECT s.first, s.last, c.code, c.name, c.credits
FROM student s
LEFT OUTER JOIN student_cls sc ON s.student_id = sc.student_id
LEFT OUTER JOIN cls c ON c.class_id = sc.class_id;
"""
pdsql(query)
Out[16]:
first | last | code | name | credits | |
---|---|---|---|---|---|
0 | frodo | baggins | ENG234 | Jabberwocky | 2 |
1 | frodo | baggins | STA007 | Statistics for Secret Agens | 4 |
2 | bilbo | baggins | ANT01 | Introduction to Hobbits | 4 |
3 | bilbo | baggins | STA007 | Statistics for Secret Agens | 4 |
4 | golum | golum | ANT01 | Introduction to Hobbits | 4 |
5 | golum | golum | MAT802 | Abstrct Nonsense | 8 |
6 | golum | golum | ENG234 | Jabberwocky | 2 |
7 | golum | golum | PHY211 | Physics of Star Wars | 4 |
8 | gandalf | white | MAT802 | Abstrct Nonsense | 8 |
9 | gandalf | white | PHY211 | Physics of Star Wars | 4 |
10 | gandalf | grey | None | None | NaN |
11 | saruman | wise | None | None | NaN |
Using SQLite3¶
SQLite3 is part of the standard library. However, the mechanics of using essentially any database in Python is similar, because of the Python DB-API.
In [17]:
import sqlite3
In [18]:
c = sqlite3.connect('../data/Chinook_Sqlite.sqlite')
SQLite specific commands to get metadata¶
Unlike SQL syntax for queries, how you get metadata from a relational database is vendor-specific. You’ll have to read the docs to find out what is needed for your SQL flavor.
What tables are there in the database?¶
In [19]:
list(c.execute("SELECT name FROM sqlite_master WHERE type='table';"))
Out[19]:
[('Album',),
('Artist',),
('Customer',),
('Employee',),
('Genre',),
('Invoice',),
('InvoiceLine',),
('MediaType',),
('Playlist',),
('PlaylistTrack',),
('Track',)]
What are the columns of the table “Album”?¶
In [20]:
list(c.execute("PRAGMA table_info(Album);"))
Out[20]:
[(0, 'AlbumId', 'INTEGER', 1, None, 1),
(1, 'Title', 'NVARCHAR(160)', 1, None, 0),
(2, 'ArtistId', 'INTEGER', 1, None, 0)]
Standard SQL statements with parameter substitution¶
Note: Using Python string substitution for Python defined parameters is
dangerous because of the risk of SQL injection
attacks. Use parameter substitution with ?
instead.
Do this¶
In [21]:
t = ['%rock%']
list(c.execute("SELECT * FROM Album WHERE Title like ? LIMIT 5;", t))
Out[21]:
[(1, 'For Those About To Rock We Salute You', 1),
(4, 'Let There Be Rock', 1),
(59, 'Deep Purple In Rock', 58),
(108, 'Rock In Rio [CD1]', 90),
(109, 'Rock In Rio [CD2]', 90)]
Not this¶
In [22]:
t = ["'%rock%'"]
list(c.execute("SELECT * FROM Album WHERE Title like %s LIMIT 5;" % t[0]))
Out[22]:
[(1, 'For Those About To Rock We Salute You', 1),
(4, 'Let There Be Rock', 1),
(59, 'Deep Purple In Rock', 58),
(108, 'Rock In Rio [CD1]', 90),
(109, 'Rock In Rio [CD2]', 90)]
User defined functions¶
Sometiems it is useful to have custom functions that run on the database server rather than on the client. These are called User Defined Functions or UDF. How do do this varies wtih the database used, but it is fairly simple with Python and SQLite.
In [23]:
def encode(text, offset):
"""Caesar cipher of text with given offset."""
from string import ascii_lowercase, ascii_uppercase
tbl = dict(zip(map(ord, ascii_lowercase + ascii_uppercase),
ascii_lowercase[offset:] + ascii_lowercase[:offset] +
ascii_uppercase[offset:] + ascii_uppercase[:offset]))
return text.translate(tbl)
In [24]:
c.create_function("encode", 2, encode)
In [25]:
list(c.execute("SELECT Title, encode(Title, 3) FROM Album limit 5;"))
Out[25]:
[('For Those About To Rock We Salute You',
'Iru Wkrvh Derxw Wr Urfn Zh Vdoxwh Brx'),
('Balls to the Wall', 'Edoov wr wkh Zdoo'),
('Restless and Wild', 'Uhvwohvv dqg Zlog'),
('Let There Be Rock', 'Ohw Wkhuh Eh Urfn'),
('Big Ones', 'Elj Rqhv')]
We cna also add aggregate UDFs similar to SQL MIN, SUM, COUNT etc.
Aggregate UDFs require you to write a class __init__
, step
and
finalize
methods.
In [26]:
class CV:
"""Aggregate UDF for coefficient of varation in %."""
import numpy as np
def __init__(self):
self.s = []
def step(self, value):
self.s.append(value)
def finalize(self):
if len(self.s) < 2:
return 0
else:
return 100.0*np.std(self.s)/np.mean(self.s)
In [27]:
c.create_aggregate("cv", 1, CV)
In [28]:
list(c.execute("PRAGMA table_info(Invoice);"))
Out[28]:
[(0, 'InvoiceId', 'INTEGER', 1, None, 1),
(1, 'CustomerId', 'INTEGER', 1, None, 0),
(2, 'InvoiceDate', 'DATETIME', 1, None, 0),
(3, 'BillingAddress', 'NVARCHAR(70)', 0, None, 0),
(4, 'BillingCity', 'NVARCHAR(40)', 0, None, 0),
(5, 'BillingState', 'NVARCHAR(40)', 0, None, 0),
(6, 'BillingCountry', 'NVARCHAR(40)', 0, None, 0),
(7, 'BillingPostalCode', 'NVARCHAR(10)', 0, None, 0),
(8, 'Total', 'NUMERIC(10,2)', 1, None, 0)]
In [29]:
list(c.execute("SELECT cv(Total) from Invoice limit 10;"))
Out[29]:
[(83.85715075292478,)]
In [30]:
c.close()
Using SQL magic functions¶
We will use the ipython-sql notebook extension for convenience. This will only work in notebooks and IPython scripts with the .ipy extension.
In [31]:
import warnings
with warnings.catch_warnings():
warnings.simplefilter('ignore')
%load_ext sql
Configuring the SqlMagic extension¶
In [32]:
%config SqlMagic
SqlMagic options
--------------
SqlMagic.autolimit=<Int>
Current: 0
Automatically limit the size of the returned result sets
SqlMagic.autopandas=<Bool>
Current: False
Return Pandas DataFrames instead of regular result sets
SqlMagic.displaylimit=<Int>
Current: 0
Automatically limit the number of rows displayed (full result set is still
stored)
SqlMagic.dsn_filename=<Unicode>
Current: 'odbc.ini'
Path to DSN file. When the first argument is of the form [section], a
sqlalchemy connection string is formed from the matching section in the DSN
file.
SqlMagic.feedback=<Bool>
Current: True
Print number of rows affected by DML
SqlMagic.short_errors=<Bool>
Current: True
Don't display the full traceback on SQL Programming Error
SqlMagic.style=<Unicode>
Current: 'DEFAULT'
Set the table printing style to any of prettytable's defined styles
(currently DEFAULT, MSWORD_FRIENDLY, PLAIN_COLUMNS, RANDOM)
In [33]:
%config SqlMagic.displaylimit=10
Connect to SQLite3 database¶
In [34]:
%sql sqlite:///../data/Chinook_Sqlite.sqlite
Out[34]:
'Connected: None@../data/Chinook_Sqlite.sqlite'
Other databases¶
See SQLAlchemy connection strings for how to connect to oher databases such as Oracle, MySQL or PostgreSQL.
Line magic¶
In [35]:
%sql SELECT * from Album LIMIT 5;
Done.
Out[35]:
AlbumId | Title | ArtistId |
---|---|---|
1 | For Those About To Rock We Salute You | 1 |
2 | Balls to the Wall | 2 |
3 | Restless and Wild | 2 |
4 | Let There Be Rock | 1 |
5 | Big Ones | 3 |
In [36]:
%sql SELECT * from Artist LIMIT 5;
Done.
Out[36]:
ArtistId | Name |
---|---|
1 | AC/DC |
2 | Accept |
3 | Aerosmith |
4 | Alanis Morissette |
5 | Alice In Chains |
Cell magic¶
In [37]:
%%sql
SELECT Artist.Name, Album.Title
FROM Album
INNER JOIN Artist on Album.ArtistId = Artist.ArtistId
ORDER BY Artist.Name ASC
LIMIT 5;
Done.
Out[37]:
Name | Title |
---|---|
AC/DC | For Those About To Rock We Salute You |
AC/DC | Let There Be Rock |
Aaron Copland & London Symphony Orchestra | A Copland Celebration, Vol. I |
Aaron Goldberg | Worlds |
Academy of St. Martin in the Fields & Sir Neville Marriner | The World of Classical Favourites |
You can assign results of queries to Python names¶
In [38]:
result = %sql SELECT * from Album;
Done.
In [39]:
type(result)
Out[39]:
sql.run.ResultSet
Resutls behave like lists¶
In [40]:
result[2:4]
Out[40]:
[(3, 'Restless and Wild', 2), (4, 'Let There Be Rock', 1)]
You can use Python variables in your queires.¶
Use :varname
where you want to use a Python variable in your query.
In [41]:
artist_id = 10
%sql select * from Artist where ArtistId < :artist_id;
Done.
Out[41]:
ArtistId | Name |
---|---|
1 | AC/DC |
2 | Accept |
3 | Aerosmith |
4 | Alanis Morissette |
5 | Alice In Chains |
6 | Antônio Carlos Jobim |
7 | Apocalyptica |
8 | Audioslave |
9 | BackBeat |
In [42]:
word = '%rock%'
%sql select * from Album WHERE Title LIKE :word;
Done.
Out[42]:
AlbumId | Title | ArtistId |
---|---|---|
1 | For Those About To Rock We Salute You | 1 |
4 | Let There Be Rock | 1 |
59 | Deep Purple In Rock | 58 |
108 | Rock In Rio [CD1] | 90 |
109 | Rock In Rio [CD2] | 90 |
213 | Pure Cult: The Best Of The Cult (For Rockers, Ravers, Lovers & Sinners) [UK] | 139 |
216 | Hot Rocks, 1964-1971 (Disc 1) | 142 |
Convert to pandas
dataframe¶
In [43]:
df = result.DataFrame()
df.head(5)
Out[43]:
AlbumId | Title | ArtistId | |
---|---|---|---|
0 | 1 | For Those About To Rock We Salute You | 1 |
1 | 2 | Balls to the Wall | 2 |
2 | 3 | Restless and Wild | 2 |
3 | 4 | Let There Be Rock | 1 |
4 | 5 | Big Ones | 3 |
In [ ]: