In [1]:
from IPython.display import display, HTML, clear_output
HTML('''<script> code_show=true; function code_toggle() {if (code_show){$('div.input').hide();}
else {$('div.input').show();}code_show = !code_show} $( document ).ready(code_toggle);
</script> <form action="javascript:code_toggle()"><input type="submit" value="Hide Raw Code"></form>''')
Out[1]:
In [2]:
from ipywidgets import interact, interactive, fixed, widgets
import pandas as pd
import sqlite3
import re
In [3]:
# just testing out the youtube player capabilites of Jupyter
#from IPython.display import YouTubeVideo
#YouTubeVideo("a1Y73sPHKxw", width=700, height=500)
In [4]:
# if this .sqlite db doesn't already exists, this will create it
# if the .sqlite db *does* already exist, this establishes the desired connection
con = sqlite3.connect("sql_sample_db_new.sqlite")
book_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/book_table.csv')
auth_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/author_table.csv')
sales_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/sales_table.csv')
tech_cos = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/tech_cos.csv')
public_cos = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/public_cos.csv')
movie_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/movie_table.csv')
tables = [book_table,
auth_table,
sales_table,
tech_cos,
public_cos,
movie_table]
table_names = ['book_table',
'auth_table',
'sales_table',
'tech_cos',
'public_cos',
'movie_table']
# drop each table name if it already exists to avoid error if you rerun this bit of code
# then add it back (or add it for the first time, if the table didn't already exist)
for i in range(len(tables)):
table_name = table_names[i]
table = tables[i]
con.execute("DROP TABLE IF EXISTS {}".format(table_name))
pd.io.sql.to_sql(table, "{}".format(table_name), con, index=False)
In [5]:
# Function to make it easy to run queries on this mini-database
def run(query):
try:
results = pd.read_sql("{}".format(query), con).fillna(' ')
return results
except:
pass
def run_q(query, button):
def on_button_clicked(b):
clear_output()
new_value = query.value.replace('\n', ' ')
if new_value != '':
df = run(new_value)
try:
output = HTML(df.to_html(index=False))
display(output)
#except AttributeError:
except:
print('''SQL error! Check your query:
1. Text values are in quotation marks and capitalized correctly
2. Items in the SELECT clause are comma-separated
3. No dangling comma in the SELECT clause right before the FROM clause
4. If you are joining tables that have columns with the same name, use table_name.column_name format
5. Try "PRAGMA TABLE_INFO(table_name) to double-check the column names in the table
6. Correct order of clauses:
SELECT
FROM
JOIN...ON
WHERE
GROUP BY
ORDER BY
LIMIT
''')
button.on_click(on_button_clicked)
on_button_clicked(None)
def cheat(answer):
def f(Reveal):
if Reveal == False:
clear_output()
else:
print(answer)
interact(f, Reveal=False)
clear_output()
Structure and Formatting Basics
Determine a table's strucure.............................PRAGMA TABLE_INFO()
Query building blocks.......................................SELECT
& FROM
Filter your data..................................................WHERE
Wildcards and vague search.............................LIKE
and %
Sort your data...................................................ORDER BY
Limit the number of rows you see.....................LIMIT
Combining tables..............................................JOIN
• Combining 3+ tables........................................Multiple JOIN
's
• Different JOIN
Types.......................................Overview
• Simple Join......................................................INNER JOIN
aka JOIN
• One-Sided Join................................................LEFT JOIN
• Full Join............................................................OUTER JOIN
• Practice combining tables...............................JOIN
Drills
Column & Table Aliases.....................................AS
Add, subtract, multiply, & divide data................Operators
Apply functions to columns...............................Functions
Group data by categories..................................GROUP BY
Filter out certain groups.....................................HAVING
Conditional values..............................................IF
& CASE WHEN
SQL-ception: Queries within queries..................Nesting
Run multiple queries at once..............................UNION
& UNION ALL
Add a summaryt/total row..................................ROLLUP
Full table of RDBMS dialect differences
SQL stands for "Structured Query Language", but no one calls it that. You can pronouce it as either "S-Q-L" or "sequel". Some people feel strongly in favor of a particular pronunciation. I don't. I'll say "sequel" in class, but I'll never correct you for saying S-Q-L.
SQL is the database language of choice for most businesses; you use it to communicate with databases. "Communication" can take the form of creating, reading, updating, and deleting data. This course only covers reading data. That's all most MBAs do with SQL.
Companies use relational databases because they can store and easily recall A LOT of data. Excel can't handle more than a million rows. If you're Amazon and you need to record every click, Excel is useless. Relational databases are much more efficient.
What do we mean by "efficient"? Every recorded "bit" takes up server space, which costs money. It also slows everything down. So an efficient database should allow you to record and recall a lot of information using the minimal number of bits.
Imagine you want to store the names of four books and some information about their authors. Think of the character count as a proxy for how many bits of storage your table takes up:
Now imagine you want to add some more books by each of those authors. Some of the information gets redundant. Imagine if you had to do this for millions of different books:
This is where relational databases can help. With a relational database, you'd create two separate tables that relate to each other. You still storing the same information, but you're doing it by using fewer characters. You've eliminated the need to repeat yourself, so you've made a much more efficient database.
There are different softwares that can manage relational databases. SQL varies a little from software to software, just like English varies a little between England and the U.S. We'll address these instances whenever possible.
Each software is called a Relational Database Management System, or RDBMS. These are some of the most popular that you might encounter at work:
Microsoft SQL Server | MySQL | Oracle | SQLite |
---|---|---|---|
Proprietary, more common at older companies | Open source, frequently used by startups and tech companies | Proprietary, more common at older companies | Frequently used for mobile apps (and this class!) |
The goal is to start simple and practice often. By the end of this class, you should feel extremely comfortable writing moderately complicated SQL code, which will save you countless hours trying to figure out SQL on the job or waiting for someone else with SQL knowledge to pull data for you. Using this interactive program, we'll explore a small sample database by learning new SQL concepts one at a time. Concepts will build on each other.
Sometimes you'll be asked to edit or delete parts of a provided query. Rerun the query with each step, taking care to understand what changed with the output each time. Note that none of the changes that you make to these queries will be saved when you close this program.
Try it by changing something in the cell below and hitting "Run"!
In [6]:
test_ex = widgets.Textarea(value=
'''You can change text in these boxes to edit and re-run queries!''',
width = '50em', height = '7em')
display(test_ex)
In [7]:
text_ex_button = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(text_ex_button)
def on_button_clicked2(b):
clear_output()
print('''Here's the output from the cell above:
''', test_ex.value)
text_ex_button.on_click(on_button_clicked2)
After we've learned a new concept and you've practiced with some quick exercises, you'll be challenged to write your own query. Read each challenge carefully, and keep re-running it until you get the results you are looking for.
Need to cheat a little? Check the "Reveal" box to see the answer to a challenge.
In [8]:
chall ='''When you click a checkbox, you reveal the answer to a challenge!
Uncheck it to hide the answer again.'''
cheat(chall)
The content you're currently reading is written in Python, Markdown and HTML and runs in a Jupyter Notebook. No need to know what any of that means, I only told you in case you were terribly curious.
You will not be using this interface at work - the point of this class is to teach you SQL the language, which can be typed into a variety of different software programs. You'll be able to learn the quirks of a different software program pretty easily as long as you know SQL.
Still, there are some things you should know about this program to help you with the class:
SHIFT-RETURN
or 'Cell ➤ Run'. Below is an example of a query, SQL code that requests data from a database. Try to make a habit of writing queries by following these formatting conventions. Queries can get very long and complicated, and formatting makes them easier to read.
Microsoft SQL Server | MySQL | Oracle | SQLite |
---|---|---|---|
SP_Help some_table |
DESCRIBE some_table |
DESCRIBE some_table |
PRAGMA TABLE_INFO(some_table) |
PRAGMA TABLE_INFO(some_table)
➞ result-set lists the column names and data types within the table
We're using SQLite, so we're going to be using the PRAGMA TABLE_INFO()
option. Put the name of a table in the parentheses, and the output tells you the names and data types in each column in the table.
So far, we've learned about 2 tables in our relational database, which we'll call book_table
and auth_table
. We're also going to use a sales_table
, which we'll take a look at later on. Combined, these three tables will make up the "database" of a very tiny, very limited, very imaginary bookstore.
We'll start by reviewing the book_table
and auth_table
:
Now, we'll use PRAGMA TABLE_INFO()
to read the table structure of the book_table
. In plain English, the query below says "show me the names of the columns in the book_table
, and what type of data (text, numbers?) is in each column."
In [9]:
pragma = widgets.Textarea(
value='''PRAGMA TABLE_INFO(book_table)''',
width = '50em', height = '3em')
display(pragma)
In [10]:
prag_button = widgets.Button(description='Run',width='10em', height='2.5em', color='white', background_color='black', border_color='black')
display(prag_button)
run_q(pragma, prag_button)
book_table
has columns headed Book
, COGs
, and Author
Book
column has TEXT data, and that COGs
contains REAL numbers - numbers that can have a fractional value.Change the query above to look at the auth_table
instead. Why is the author's birth_year
data type not REAL like we saw with COGs
?
Rewrite the query above to take a look at the sales_table
structure. Judging from what your query returns, can you guess what you'll probably see once you actually look all the data in the sales_table
?
In [11]:
pragma_sales = widgets.Textarea(
value='',
width = '50em',
height = '4em'
)
display(pragma_sales)
In [12]:
prag_sales_button = widgets.Button(description='Run',width='10em', height='2.5em', color='white', background_color='black', border_color='black')
display(prag_sales_button)
run_q(pragma_sales, prag_sales_button)
In [13]:
prag_sales_answer = 'PRAGMA TABLE_INFO(sales_table)'
cheat(prag_sales_answer)
SELECT
*
➞ an asterisk means "all columns"
FROM
table_name
To see the actual data in a table, we'll use SELECT
and FROM
clauses. In the SELECT
clause, you tell SQL which columns you want to see. In the FROM
clause, you tell SQL the table where those columns are located. An asterisk returns all columns from a particular table.
In plain English, the query below says: "Show me all columns and their data from the book_table
"
In [14]:
select = widgets.Textarea(value=
'''SELECT
*
FROM
book_table''',
width = '50em', height = '7em')
display(select)
In [15]:
select_button = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(select_button)
run_q(select, select_button)
In [16]:
select_c = widgets.Textarea(value='', width = '50em')
display(select_c)
In [17]:
select_c_button = widgets.Button(
description='Run',
width='10em',
height='2.5em',
color='white',
background_color='black',
border_color='black')
display(select_c_button)
run_q(select_c, select_c_button)
In [18]:
select_answer ='''SELECT
*
FROM
sales_table'''
cheat(select_answer)
Use asterisks sparingly. Usually, you'll select specific columns from a table rather than all columns. Using an asterisk to select all columns is okay when the table is small or when you tightly constrain your selection of rows. Otherwise, select specific columns and use WHERE and LIMIT (taught below) to go easy on your servers.
SELECT
specific columns:
SELECT
column_a,
➞ separate multiple columns with commas
column_b
➞ optional, but conventional, to also use a return
FROM
table_name
Instead of using an asterisk for "all columns", you can specify a particular column or columns. In plain English: "Show me the data in the book
and author
columns from the book_table
"
In [19]:
select_col = widgets.Textarea(value=
'''SELECT
book,
author
FROM
book_table''',
width = '50em', height = '8em')
display(select_col)
In [20]:
select_col_button = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(select_col_button)
run_q(select_col, select_col_button)
In [21]:
select_cols_chall = widgets.Textarea(value=
'',
width = '50em', height = '8em')
display(select_cols_chall)
In [22]:
select_cols_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(select_cols_chall_b)
run_q(select_cols_chall, select_cols_chall_b)
In [23]:
select_cols_chall_cheat ='''SELECT
first_name,
last_name
FROM
auth_table'''
cheat(select_cols_chall_cheat)
In [24]:
select_cols_chall2 = widgets.Textarea(value='', width = '50em', height = '8em')
display(select_cols_chall2)
In [25]:
select_cols_chall_b2 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(select_cols_chall_b2)
run_q(select_cols_chall2, select_cols_chall_b2)
In [26]:
select_cols_chall_cheat2 ='''SELECT
book
FROM
sales_table'''
cheat(select_cols_chall_cheat2)
SELECT DISTINCT
:
SELECT
DISTINCT column_a
➞ returns only unique values
FROM
table_name
Use DISTINCT
to return unique values from a column, so if there are any repeats in a column, your output will include each value just once. The query below displays each book in the sales_table
just once, even though we know each shows up multiple times in the table.
In [27]:
distinct_q = widgets.Textarea(value=
'''SELECT
DISTINCT book
FROM
sales_table''',
width = '50em', height = '7em')
display(distinct_q)
In [28]:
distinct_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(distinct_b)
run_q(distinct_q, distinct_b)
In [29]:
distinct_q_chall = widgets.Textarea(value='', width = '50em', height = '7em')
display(distinct_q_chall)
In [30]:
distinct_q_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(distinct_q_chall_b)
run_q(distinct_q_chall, distinct_q_chall_b)
In [31]:
REP ='''SELECT
DISTINCT author
FROM
book_table'''
cheat(REP)
SELECT
column_a
FROM
table_name
WHERE
column_a = x
➞ result-set will only include rows where value of column_a is x
WHERE
lets you filter results so you only see rows that specifically match your criteria. Below there are few more options for the WHERE
clause:
Options for WHERE | Description |
---|---|
col = 'some_text' |
Put text in quotations. Capitalization is important! |
col != x |
Return rows where col's values DO NOT equal x |
col < x |
Return rows where col's value is less than x |
col <= x |
Return rows where col's value is less OR EQUAL TO than x |
col IN (x, y) |
Values can equal EITHER x OR y |
col NOT IN (x, y) |
Value are NEITHER x NOR y |
col BETWEEN x AND y |
Values are between x and y |
col = x AND another_col = y |
Returns rows when col's values are x AND another_col's values are y |
col = x OR another_col = y |
Returns rows when col's values are x OR another_col's values are y |
WHERE
& text valuesBelow, we use WHERE
to tell SQL to only show us rows in the book_table
when Hemingway is the author. In plain English, we're saying "Show me information about books that are written by Hemingway in the book_table
"
In [32]:
where_q = widgets.Textarea(value=
'''SELECT
*
FROM
book_table
WHERE
author = 'Hemingway' ''',
width = '50em', height = '10em')
display(where_q)
In [33]:
where_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(where_b)
run_q(where_q, where_b)
'Hemingway'
to 'Shakespeare'
, rerunShakespeare
, rerun. Why the error?"Shakespeare"
to "shakespeare"
, rerun"shakespeare"
to "Twain"
, rerun"Twain"
to 'Hemingway'
, rerun to get back to where we startedWHERE
clause to !=, rerunWrite a query to return all columns of the auth_table
, but only rows where the author's country is England.
In [34]:
where_q_chall = widgets.Textarea(value='',
width = '50em', height = '10em')
display(where_q_chall)
In [35]:
where_b_chall = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(where_b_chall)
run_q(where_q_chall, where_b_chall)
In [36]:
where_q_chall_cheat ='''SELECT
*
FROM
auth_table
WHERE
country = 'England' '''
cheat(where_q_chall_cheat)
We use IN (value_1, value_2)
to return rows that can match more than one value. In plain English, the query below says, "Show me all columns from the book table when the author is EITHER Hemingway OR Austen"
In [37]:
in_q = widgets.Textarea(value=
'''SELECT
*
FROM
book_table
WHERE
author IN ('Hemingway', 'Austen')''',
width = '50em', height = '10em')
display(in_q)
In [38]:
in_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(in_b)
run_q(in_q, in_b)
'Faulkner'
to the list, rerun.IN
with NOT IN
, rerun.WHERE
& number valuesThe WHERE
clause is useful with numbers as well. We can start throwing in comparisons like less than (<) and greater than (>):
In [39]:
greater_q = widgets.Textarea(value=
'''SELECT
*
FROM
sales_table
WHERE
revenue > 18''',
width = '50em', height = '10em')
display(greater_q)
In [40]:
greater_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(greater_b)
run_q(greater_q, greater_b)
In [41]:
born_chall = widgets.Textarea(value='',width = '50em', height = '10em')
display(born_chall)
In [42]:
born_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(born_chall_b)
run_q(born_chall, born_chall_b)
In [43]:
born_chall_c ='''SELECT
*
FROM
auth_table
WHERE
birth_year < 1800
'''
cheat(born_chall_c)
In [44]:
and_q = widgets.Textarea(value=
'''SELECT
*
FROM
book_table
WHERE
author = 'Hemingway'
AND cogs > 11''',
width = '50em', height = '12em')
display(and_q)
In [45]:
and_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(and_b)
run_q(and_q, and_b)
In [46]:
and_chall = widgets.Textarea(value=
'',
width = '50em', height = '14em')
display(and_chall)
In [47]:
and_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(and_chall_b)
run_q(and_chall, and_chall_b)
In [48]:
and_chall_c ='''SELECT
last_Name,
country,
birth_Year
FROM
auth_table
WHERE
country = 'England'
AND birth_Year > 1650 '''
cheat(and_chall_c)
In [49]:
or_chall = widgets.Textarea(value=
'',
width = '50em', height = '12em')
display(or_chall)
In [50]:
or_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(or_chall_b)
run_q(or_chall, or_chall_b)
In [51]:
or_chall_cheat ='''SELECT
*
FROM
sales_table
WHERE
book = 'Macbeth'
OR revenue > 17 '''
cheat(or_chall_cheat)
SELECT
column_a
FROM
table_name
WHERE
column_a LIKE 's%Me_t%xT'
➞ correct capitalization isn't necessary withLIKE
, and%
stands in for any missing character
LIKE
lets you search for a value even if you capitalize it incorrectly. It also allows you to work with percentage signs that act as wildcards, which stand in for an unlimited number of missing characters (helpful if you don't know how to spell something). Take a look at the query below. Recall that earlier when we wrote author = 'shakespeare'
, we got no results.
In [52]:
like_q = widgets.Textarea(value=
'''SELECT
*
FROM
book_table
WHERE
author LIKE 'hemingway' ''',
width = '50em', height = '10em')
display(like_q)
In [53]:
like_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(like_b)
run_q(like_q, like_b)
'hemingway'
with 'hemingWAY'
, rerunLIKE
with =, rerunLIKE
again, but change 'hemingWAY
' to 'Hemmingway
', rerunHemmingway
' with 'Hem'
, rerunWith exercises #3 and #4, you saw that LIKE
alone has a limitation - it only lets you mess with capitalization. You need wildcards to do more with LIKE
. Let's say you can't remember if Hemingway is spelled with 1 "m" or 2. Use a percentage sign (%) to get the value you're looking for:
In [54]:
like_q2 = widgets.Textarea(value=
'''SELECT
*
FROM
book_table
WHERE
author LIKE 'He%ingway' ''',
width = '50em', height = '11em')
display(like_q2)
In [55]:
like_b2 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(like_b2)
run_q(like_q2, like_b2)
'He%ingway'
to 'Hemm%ingway'
. Why doesn't this work?'Hemm%ingway'
to 'Hem%'
, rerun'Hem%'
to '%us%'
, rerunLIKE
to =, rerun (see how wildcards only work with LIKE
?)Write a query to pull the book
and author
columns from the book_table
. Pretend you can't remember the full name of the book you're looking for. You just know it starts with the word "Pride".
In [56]:
like_chall = widgets.Textarea(value=
'',
width = '50em', height = '12em')
display(like_chall)
In [57]:
like_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(like_chall_b)
run_q(like_chall, like_chall_b)
In [58]:
like_chall_c ='''SELECT
book,
author
FROM
book_table
WHERE
book LIKE 'Pride %' '''
cheat(like_chall_c)
Use LIKE
sparingly: It's a great tool, but it really puts the strain on your database's servers. Use it only when a table is pretty small or when you've limited your result-set by using additional filters in the WHERE
clause.
SELECT
column_a
FROM
table_name
[WHERE clause, optional]
ORDER BY
➞ sorts the result-set by column_a
column_a DESC
➞DESC
is optional, it sorts results in descending order
Without an ORDER BY
clause, the default result-set will be sorted by however it appears in the database (which is crap-shoot depending on the type of table). Use ORDER BY
to sort your result-set by a particular column, and add DESC
to sort in descending order (Z→A, 100→1).
In [59]:
order_q = widgets.Textarea(value=
'''SELECT
*
FROM
book_table
ORDER BY
book''',
width = '50em', height = '11em')
display(order_q)
In [60]:
order_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(order_b)
run_q(order_q, order_b)
author
insteadDESC
and rerunauthor DESC
, replace it with author, book
, rerunDESC
so it reads author, book DESC
, rerun author DESC, book
Write a query to see the book
and revenue
columns from the sales_table
, and sort the results by revenue
in descending order.
In [61]:
order_chall = widgets.Textarea(value=
'',
width = '50em', height = '12em')
display(order_chall)
In [62]:
order_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(order_chall_b)
run_q(order_chall, order_chall_b)
In [63]:
order_chall_c ='''SELECT
book,
revenue
FROM
sales_table
ORDER BY
revenue DESC'''
cheat(order_chall_c)
In [64]:
order_chall2 = widgets.Textarea(value=
'',
width = '50em', height = '16em')
display(order_chall2)
In [65]:
order_chall_b2 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(order_chall_b2)
run_q(order_chall2, order_chall_b2)
In [66]:
order_chall_c2 ='''SELECT
*
FROM
book_table
WHERE
author LIKE '%pear%'
OR cogs > 12
ORDER BY
cogs'''
cheat(order_chall_c2)
SELECT TOP N column_name
| LIMIT N
| WHERE ROWNUM <= N
| LIMIT N
SELECT
column_a
FROM
table_name
[WHERE clause]
[ORDER BY clause]
LIMIT N
➞ limits the result-set to N rows
LIMIT
lets you set a maximum limit to the number of rows that your query returns. You've seen the query below before, but now we've added a LIMIT
clause:
In [67]:
limit_q = widgets.Textarea(value=
'''SELECT
*
FROM
sales_table
LIMIT 5''',
width = '50em', height = '14em')
display(limit_q)
In [68]:
limit_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(limit_b)
run_q(limit_q, limit_b)
ORDER BY
clause so that you see the top 10 transactions in terms of revenue
, rerunWHERE
clause so you only see transactions relating to EmmaWrite a query to view the title
and cogs
of the two books with the lowest cogs in the book_table
In [69]:
limit_chall = widgets.Textarea(value=
'',
width = '50em', height = '12em')
display(limit_chall)
In [70]:
limit_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(limit_chall_b)
run_q(limit_chall, limit_chall_b)
In [71]:
limit_chall_c ='''SELECT
book,
cogs
FROM
book_table
ORDER BY
cogs
LIMIT 2'''
cheat(limit_chall_c)
In [72]:
limit_chall2 = widgets.Textarea(value=
'',
width = '50em', height = '12em')
display(limit_chall2)
In [73]:
limit_chall2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(limit_chall2_b)
run_q(limit_chall2, limit_chall2_b)
In [74]:
limit_chall2_c ='''SELECT
book,
revenue
FROM
sales_table
ORDER BY
book, revenue
LIMIT 15'''
cheat(limit_chall2_c)
SELECT
table_x.column_a,
➞ read this as "column_a from table_x"
table_y.column_b,
➞ "column_b from table_y"
FROM
table_x
JOIN table_y
ON table_x.key_column_x = table_y.key_column_y
➞ table_x's key_column_x has corresponding values with table_y's key_column_y
[WHERE clause]
[ORDER BY clause]
[LIMIT clause]
The ability to join tables is the most fundamental and useful part about relational databases. Different tables have columns with corresponding values, and you can use these columns as "keys" to join the two tables.
The format table_x.key_column
can be read as "key_column
from table_x
"; it tells SQL the tables where a column is located. We didn't need this before because we were only using one table at a time, so SQL knew exactly which table we were talking about. When we deal with more than one table, we need to be more specific. So for example, book_table.book
means "the book
column from the book_table
", and auth_table.last_name
means "the last_name
column from the auth_table
."
Think back to our original discussion of splitting up our author and book data onto two separate tables:
You could think of the columns in these tables in terms of a Venn Diagram. Again, the format table_x.key_column
is read as "key_column
from table_x
", so book_table.author
means "the author
column from the book_table
":
The author
column from the book_table
corresponds with the last_name
column in the auth_table
- they both list the last names of the writers. Whenever you have two tables corresponding columns, you can "join" them by telling SQL to use these corresponding columns as keys. book_table.author
and auth_table.last_name
are the key columns for our join.
In [75]:
join_q = widgets.Textarea(value=
'''SELECT
*
FROM
book_table
JOIN auth_table
ON book_table.author = auth_table.last_name''',
width = '50em', height = '11em')
display(join_q)
In [76]:
join_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(join_b)
run_q(join_q, join_b)
What just happened? SQL went through these steps:
FROM
clause: FROM book_table JOIN auth_table
.ON
: ON book_table.author = auth_table.last_name
. Next, it matched up the values on the key columns:
Whenever it found a match, it made a kind of copy of the row from the auth_table
and pasted it to the book_table
The JOIN
query we've been discussing has been reproduced in the box below for these exercises.
book
, first_name
and the author's last name (you can do this with either author
or last_name
), and the birth_year
, then rerun. WHERE
clause so that you only see books by Hemingway and Austen, rerun.ORDER BY
clause so that the author born first appears first, and so that their books appear in alphabetical order. Rerun.
In [77]:
join_q2 = widgets.Textarea(value=
'''SELECT
*
FROM
book_table
JOIN auth_table
ON book_table.author = auth_table.last_name''',
width = '50em', height = '18em')
display(join_q2)
In [78]:
join_b2 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(join_b2)
run_q(join_q2, join_b2)
In [79]:
join_chall = widgets.Textarea(value=
'',
width = '50em', height = '15em')
display(join_chall)
In [80]:
join_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(join_chall_b)
run_q(join_chall, join_chall_b)
In [81]:
join_chall_c ='''SELECT
*
FROM
book_table
JOIN sales_table
ON book_table.book = sales_table.book
LIMIT 20'''
cheat(join_chall_c)
Start by copying the query that you wrote in the previous challenge and pasting it in the box below.
book_table.book = sales_table.book
to book = book
and rerun. What's going wrong? Fix it and rerun. table_x.column_a
means "column_a from table_x".
In [82]:
join_chall2 = widgets.Textarea(value=
'',
width = '50em', height = '17em')
display(join_chall2)
In [83]:
join_chall2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(join_chall2_b)
run_q(join_chall2, join_chall2_b)
In [84]:
join_chall_c2 ='''SELECT
book_table.book, [NOTE: or you could use "sales_table.book" instead]
cogs,
author,
id,
revenue
FROM
book_table
JOIN sales_table
ON book_table.book = sales_table.book
LIMIT 20'''
cheat(join_chall_c2)
SELECT
table_x.column_a,
table_y.column_b
table_z.column_c
FROM
table_x
JOIN table_y ON table_x.key_column = table_y.key_column
JOIN table_z ON table_x.other_key_column = table_z.other_key_column
[WHERE clause]
[ORDER BY clause]
[LIMIT clause]
As long as they are directly related or related by the transitive property, you can join multiple tables. Consider the sales_table
and the auth_table
in a Venn Diagram - there's no relation at all:
However, when the book_table
enters the picture, suddenly the sales_table
and auth_table
have a connection:
Now we have an opportunity to join all three!
Write a query to show the first and last name of the author, the book title, the COGs, and the revenue from each transaction.
Extra credit once you've completed the challenge: Only return rows where the book was written by an English author. Sort your results so that the transaction with the most revenue appears first.
In [85]:
multi_join_chall = widgets.Textarea(value=
'',
width = '50em', height = '25em')
display(multi_join_chall)
In [86]:
multi_join_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(multi_join_chall_b)
run_q(multi_join_chall, multi_join_chall_b)
In [87]:
multi_join_chall_c ='''SELECT
first_name,
last_name,
book_table.book,
cogs,
revenue
FROM
book_table
JOIN auth_table
ON book_table.author = auth_table.last_name
JOIN
sales_table
ON book_table.book = sales_table.book'''
cheat(multi_join_chall_c)
Answer to the extra credit:
In [88]:
multi_join_chall_c2 ='''SELECT
first_name,
last_name,
book_table.book,
cogs,
revenue
FROM
book_table
JOIN auth_table
ON book_table.author = auth_table.last_name
JOIN
sales_table
ON book_table.book = sales_table.book
WHERE
country = 'England'
ORDER BY
revenue DESC'''
cheat(multi_join_chall_c2)
Use multiple joins sparingly: Multiple joins can put a lot of strain on servers because SQL has to do a lot of work matching up all that data. The more tangential the relationship, the worse it gets. Avoid more than 2 degrees of separatation, and avoid joining 2 or more large tables. It's ok if one of your tables is big, but the others should be small.
There are more ways to join two tables than the method we just covered. However, not all RDBMS support these different join methods. We'll learn about each of these methods, even if we can only practice 2 of them in SQLite.
Join Type | Microsoft SQL Server | MySQL | Oracle | SQLite |
---|---|---|---|---|
JOIN or INNER JOIN |
✓ | ✓ | ✓ | ✓ |
LEFT JOIN or LEFT OUTER JOIN |
✓ | ✓ | ✓ | ✓ |
RIGHT JOIN or RIGHT OUTER JOIN |
✓ | ✓ | ✓ | not supported |
OUTER JOIN or FULL OUTER JOIN |
✓ | not supported | ✓ | not supported |
We're going to leave behind the book database for the next lesson, since a different data set will help illustrate the point a little better.
Until now, the tables that we've joined have had columns that correspond perfectly - that is to say, every value that appears in one table also appears in the other. There aren't any authors that appear in the auth_table that don't also appear at least once in the book_table, and vice versa.
Sometimes, you'll have two tables with corresponding columns, but they don't match perfectly. Consider the two tables below. The first lists tech companies and their CEOs, the second lists publicly traded companies and their share price.
Amazon, Alphabet, and Microsoft all appear on both tables. But Uber, SpaceX and AirBnB - which haven't IPO'd - aren't on the public_cos
table. Conversely, Walmart, GE and P&G only appear on the public_cos
table.
Even though it's not a perfect match, there is some overlap. So, we can still use the company
columns from each table as keys to join the tables:
SELECT
table_x.column_a,
➞ read this as "column_a from table_x"
table_y.column_b,
➞ "column_b from table_y"
FROM
table_x
JOIN table_y
➞ SQL interpretsJOIN
andINNER JOIN
as the same thing
ON table_x.key_column = table_y.key_column
So what would happen if we tried to join public_cos
and tech_cos
using the method we just learned with the book database? We'll give it a shot:
In [89]:
inner_join_q = widgets.Textarea(value=
'''SELECT
*
FROM
tech_cos
JOIN public_cos
ON tech_cos.company = public_cos.company''',
width = '50em', height = '11em')
display(inner_join_q)
In [90]:
inner_join_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(inner_join_b)
run_q(inner_join_q, inner_join_b)
This time, SQL can't find a match for every value in the two different company
columns:
So it performs an "INNER JOIN". You can write either JOIN
or INNER JOIN
- SQL will interpret them as the same thing. This eliminates any rows that don't have matching values, then combined the tables:
SELECT
table_x.column_a,
table_y.column_b,
FROM
table_x
LEFT JOIN table_y
➞ see all results from the first ("left") table & results when available from second table
ON table_x.key_column = table_y.key_column
If you want to make sure you see all the rows from a particular table - even if there's no match in the other table - you can do a LEFT JOIN
instead. It lets you prioritize the results from one table over another. Let's say your priority is to see all tech companies in your result-set, but you also want to see the share_price
when that data is available:
In [91]:
left_q = widgets.Textarea(value=
'''SELECT
*
FROM
tech_cos
LEFT JOIN public_cos
ON tech_cos.company = public_cos.company''',
width = '50em', height = '11em')
display(left_q)
In [92]:
left_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(left_b)
run_q(left_q, left_b)
With a LEFT JOIN
, SQL still starts by looking for matching values.
When it fails to find a match, it will still keep the values on the "left" table, but get rid of the unmatched values on the "right" table.
The query from above has been reproduced below for these exercises.
public_cos
becomes the priority table instead.
In [93]:
left_q2 = widgets.Textarea(value=
'''SELECT
*
FROM
tech_cos
LEFT JOIN public_cos
ON tech_cos.company = public_cos.company''',
width = '50em', height = '11em')
display(left_q2)
In [94]:
left_b2 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(left_b2)
run_q(left_q2, left_b2)
SELECT
table_x.column_a,
table_y.column_b,
FROM
table_x
RIGHT JOIN table_y
➞ see all results from the second ("right") table, results where available from first table
ON table_x.key_column = table_y.key_column
It's exactly the same as LEFT JOIN
, except it prioritizes the second table (the "right" table) over the first ("left") table. We can't practice it because SQLite doesn't support it, and it's super redundant anyway because we can just use LEFT JOIN
. Boom. We're done with RIGHT JOIN
.
SELECT
table_x.column_a,
table_y.column_b,
FROM
table_x
OUTER JOIN table_y
➞ see all results from BOTH the first and second table
ON table_x.key_column = table_y.key_column
What if you want to see all values from both tables? You can do this with an OUTER JOIN
. Unfortunately, MySQL and SQLite (what we're using right now!) doesn't support it, so we can't practice it.
If you are using Oracle or Microsoft SQL, then you'd use the example code above. For MySQL and SQLite, there is a workaround. You don't need to understand what's going on in the code for now, just look at the output to make sure you understand what the OUTER JOIN
output should look like.
In [95]:
full_join_q = widgets.Textarea(value=
'''SELECT
ceo,
tech_cos.company,
public_cos.company,
share_price
FROM
tech_cos
LEFT JOIN public_cos ON tech_cos.company = public_cos.company
UNION ALL
SELECT
' ',
' ',
public_cos.company,
share_price
FROM
public_cos
WHERE
public_cos.company NOT IN (SELECT company FROM tech_cos)''',
width = '50em', height = '26em')
display(full_join_q)
In [96]:
full_join_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(full_join_b)
run_q(full_join_q, full_join_b)
The OUTER JOIN
starts out the same as the INNER JOIN
and LEFT JOIN
, trying to find matches wherever it can:
But when it can't find a match, instead of eliminating any the rows, it makes room for them:
In [97]:
insert_b = widgets.Button(description='Click here JUST ONCE before starting', width='20em', height='3em', color='white',background_color='#1f5fd6', border_color='#1f5fd6')
display(insert_b)
def insert_button(b):
insert_q1 = '''INSERT INTO auth_table VALUES ('Tolstoy', 'Leo', 'Russia', 1828)'''
insert_q2 = '''INSERT INTO auth_table VALUES ('Twain', 'Mark', 'USA', 1835)'''
insert_q3 = '''INSERT INTO book_table VALUES ('Jude the Obscure', '11.25', 'Hardy')'''
insert_q4 = '''INSERT INTO book_table VALUES ('The Age of Innocence', '14.20', 'Wharton')'''
query_list = [insert_q1, insert_q2, insert_q3, insert_q4]
for query in query_list:
run(query)
print('New rows have been added to auth_table and book_table!')
insert_b.on_click(insert_button)
In [98]:
join_drill1 = widgets.Textarea(value='',width = '50em', height = '7em')
display(join_drill1)
In [99]:
join_drill1_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(join_drill1_b)
run_q(join_drill1, join_drill1_b)
In [100]:
join_drill1_c ='''SELECT
*
FROM
auth_table [change to "book_table" for second part of challenge]'''
cheat(join_drill1_c)
Now auth_table has 2 authors listed (Tolstoy and Twain) that don't appear on the book_table
, and the book_table
has two books (Jude the Obscure and The Age of Innocence) whose authors don't appear in the auth_table
.
Write a query to view the book titles, first names, and last names of authors that appear on both the auth_table
and the book_table
.
In [101]:
join_drill2 = widgets.Textarea(value='', width = '50em', height = '12em')
display(join_drill2)
In [102]:
join_drill2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(join_drill2_b)
run_q(join_drill2, join_drill2_b)
In [103]:
join_drill2_c ='''SELECT
book,
author,
first_name
FROM
book_table
JOIN auth_table ON book_table.author = auth_table.last_name'''
cheat(join_drill2_c)
In [104]:
join_drill3 = widgets.Textarea(value='', width = '50em', height = '13em')
display(join_drill3)
In [105]:
join_drill3_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(join_drill3_b)
run_q(join_drill3, join_drill3_b)
In [106]:
join_drill3_b ='''SELECT
book,
country
FROM
book_table
LEFT JOIN auth_table
ON book_table.author = auth_table.last_name'''
cheat(join_drill3_b)
In [107]:
join_drill4 = widgets.Textarea(value='',width = '50em', height = '16em')
display(join_drill4)
In [108]:
join_drill4_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(join_drill4_b)
run_q(join_drill4, join_drill4_b)
In [109]:
join_drill4_c ='''SELECT
first_name,
book
FROM
auth_table
LEFT JOIN book_table ON auth_table.last_name = book_table.author'''
cheat(join_drill4_c)
In [110]:
join_drill4_ex_c ='''SELECT
first_name,
book
FROM
auth_table
LEFT JOIN book_table ON auth_table.last_name = book_table.author
WHERE
first_name = 'William'
ORDER BY
book
LIMIT 3'''
cheat(join_drill4_ex_c)
In [111]:
join_drill4 = widgets.Textarea(value='',width = '50em', height = '12em')
display(join_drill4)
In [112]:
join_drill4_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(join_drill4_b)
run_q(join_drill4, join_drill4_b)
In [113]:
join_drill4_c ='''SELECT
book_table.book,
revenue
FROM
book_table
LEFT JOIN sales_table ON book_table.book = sales_table.book
ORDER BY
revenue'''
cheat(join_drill4_c)
SELECT
column_a AS alias_a
➞ creates an alias for column_a
FROM
table_name
WHERE
alias_a = x
➞ optional; use the alias in theWHERE
clause
ORDER BY
alias_a
➞ optional; use the alias in theORDER BY
clause
[LIMIT clause]
Aliases allow you to rename columns and tables in your query. They will come in handy as we learn to do more with the data.
In plain English, the query below can be read as "Show me the book
column from the book_table
, but rename the column to book_title
.
In [114]:
as_q = widgets.Textarea(value=
'''SELECT
book AS book_title
FROM
book_table''',
width = '50em', height = '14em')
display(as_q)
In [115]:
as_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(as_b)
run_q(as_q, as_b)
titles
. AS
and rerun. (You'll see that AS
is totally optional when assigning aliases. It just make the query easier to read.)author_name
. Austen
. Use the column's alias in your WHERE
clause. ORDER BY
clause.
In [116]:
as_ex_c ='''SELECT
book titles,
author AS author_name
FROM
book_table
WHERE
author_name = 'Austen'
ORDER BY
titles DESC'''
cheat(as_ex_c)
In [117]:
as_chall = widgets.Textarea(value='', width = '50em', height = '15em')
display(as_chall)
In [118]:
as_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(as_chall_b)
run_q(as_chall, as_chall_b)
In [119]:
as_chall_c ='''SELECT
last_name AS author,
country as nationality,
birth_year AS year_born
FROM
auth_table
WHERE
nationality != 'England'
AND year_born BETWEEN 1800 and 1850'''
cheat(as_chall_c)
SELECT
X.column_a,
Y.column_b
FROM
table_x X
➞ assigns table_x the alias X
JOIN table Y
➞ assigns table_y the alias Y
ON X.key_column = Y.key_column
➞ table aliases can be used as substitutes in thetable_x.column_a
format
When dealing with one or more tables in a query, we commonly assign capitalized one-letter aliases to tables. Writing X.key_column
is much shorter than table_x.key_column
, and coders like shortcuts. They also typically won't use AS
when assigning aliases to tables (although it makes no difference either way).
When you're dealing with only one table, it's unnecessary to use table aliases because SQL knows exactly what columns you are referring to. However, when you are dealing with 2 or more tables, particularly tables that have columns with the same names (like book
, which is a column in both sales_table
and book_table
), then aliases are extremely handy.
In [120]:
as_table_q = widgets.Textarea(value=
'''SELECT
S.book,
S.revenue,
B.cogs
FROM
book_table B
JOIN sales_table S
ON S.book = B.book''',
width = '50em', height = '12em')
display(as_table_q)
In [121]:
as_table_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(as_table_b)
run_q(as_table_q, as_table_b)
SELECT
clause, change S.book
to just book
, rerun. What's going wrong?book
to B.book
, rerun.S.revenue
to B.revenue
, rerun. What's going wrong?B.revenue
to just revenue
and rerun. Note that when you're joining tables, it's standard to use table aliases even on columns that don't need them. This makes it easier for someone to read your query even if they are unfamaliar with the tables that you're working with. However, as you see, it's not technically necessary.Write a query to view books and the author's country by joining auth_table
and book_table
. Give auth_table
the alias A
and book_table
the alias B
. Use the aliases in the ON
part of the JOIN
clause.
In [122]:
as_table_chall = widgets.Textarea(value='', width = '50em', height = '10em')
display(as_table_chall)
In [123]:
as_table_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(as_table_chall_b)
run_q(as_table_chall, as_table_chall_b)
In [124]:
as_table_chall_c ='''SELECT
B.book,
A.country
FROM
book_table B
JOIN auth_table A ON B.author = A.last_name'''
cheat(as_table_chall_c)
titles
earnings
author_name
year_born
SELECT
and JOIN
clausesWHERE
and ORDER BY
clauses:
In [125]:
as_chall2 = widgets.Textarea(value='', width = '50em', height = '20em')
display(as_chall2)
In [126]:
as_chall2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(as_chall2_b)
run_q(as_chall2, as_chall2_b)
In [127]:
as_chall2_c ='''SELECT
B.book as titles,
S.revenue as earnings,
A.birth_year as year_born
FROM
book_table B
JOIN auth_table A ON B.author = A.last_name
JOIN sales_table S on B.book = S.book
WHERE
year_born BETWEEN 1700 and 1900
AND earnings > 12
ORDER BY
earnings
LIMIT 20'''
cheat(as_chall2_b)
SELECT
column_a + column_b,
➞ adds the values incolumn_a
andcolumns_b
column_a - column_b,
➞ subtracts
column_a * column_b,
➞ multiplies
column_a / column_b,
➞ divides
(column_a + column_b) * column_c,
➞ use parentheses to make more complex calculations
FROM
table_name
[WHERE clause]
[ORDER BY clause]
[LIMIT clause]
This is pretty straightforward. Let's start by calculating gross profit per transaction: revenue
minus cogs
. Recall again that we can use S.book
or B.book
- we'll get the same results.
In [128]:
op_q = widgets.Textarea(value=
'''SELECT
B.book,
S.revenue,
B.cogs,
S.revenue - B.cogs
FROM
book_table B
JOIN sales_table S ON B.book = S.Book''',
width = '50em', height = '18em')
display(op_q)
In [129]:
op_q_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(op_q_b)
run_q(op_q, op_q_b)
gross_grofit
, rerun. See how nice aliases are?WHERE
clause to only see transactions where gross_grofit
is over $5, rerun. ORDER BY
clause to sort by gross_profit
with the most profitable transaction is listed first, rerun. gross_margin
In [130]:
op_chall = widgets.Textarea(value='',width = '50em', height = '18em')
display(op_chall)
In [131]:
op_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(op_chall_b)
run_q(op_chall, op_chall_b)
In [132]:
op_chall_c ='''SELECT
B.book,
B.author,
(S.revenue - B.cogs) / S.revenue AS gross_margin
FROM
book_table B
JOIN sales_table S ON B.book = S.book
WHERE
B.author NOT IN ('Falkner', 'Austen')
ORDER BY
gross_margin DESC
LIMIT 10'''
cheat(op_chall_c)
Microsoft SQL Server | MySQL | Oracle | SQLite |
---|---|---|---|
CONCAT(column_a, column_b) or + |
CONCAT(column_a, column_b) |
CONCAT(column_a, column_b) or || |
|| |
SELECT
column_a || column_b,
➞ combines the characters of column_a & column_b
column_a || ' ' || column_b
➞ combines the characters of column_a & column_b with a space inbetween
FROM
table_name
This one is extremely straightforward. This allows you to non-mathematically combine values. So "some || word
" becomes "someword
", and "some || ' ' || word
" becomes "some word
".
In [133]:
conc_q = widgets.Textarea(value=
'''SELECT
first_name || last_name
FROM
auth_table''',
width = '50em', height = '7em')
display(conc_q)
In [134]:
conc_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(conc_b)
run_q(conc_q, conc_b)
SELECT
SOME_FUNCTION(column_a),
➞ performs the function on the column
FROM
table_name
[WHERE clause]
[ORDER BY clause]
[LIMIT clause]
Functions work similar to funcitons in Excel - you can apply them to entire columns. There are tons more functions than the ones listed below, just Google what you want to do to find more.
FUNCTION | DESCRIPTION |
---|---|
AVG(col) |
Averages values |
COUNT(col) |
Counts the number of rows with non-null values in the column |
COUNT(*) |
Counts the number of rows in the table |
COUNT(DISTINCT(col)) |
Counts the number of unique values in the column |
GROUP_CONCAT(col, 'separator') |
Returns a comma-separated list of values, specify a separator in quotes |
MAX(col) |
Returns the maximum value |
MIN(col) |
Returns the minimum value |
ROUND(AVG(col), x) |
Rounds value to x decimals |
SUM(col) |
Sums values |
UPPER(col) |
If column is text, it will return all-caps version of the text |
First, we'll start with SUM()
to find the total revenue for all our transactions.
In [135]:
sum_q = widgets.Textarea(value=
'''SELECT
sum(revenue)
FROM
sales_table''',
width = '50em', height = '17em')
display(sum_q)
In [136]:
sum_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(sum_b)
run_q(sum_q, sum_b)
SELECT
clause to find the average revenue per transaction and give the column an alias, rerunSELECT
clause to count the total number of transactions and give the column an alias, rerunSELECT
clause to see the minimum and maximum revenue earned on a single transaction, rerunSELECT
clause to see a count of the number of distinct books that appear in sales_table
WHERE
clause to only see results for the books "For Whom the Bell Tolls" and "Emma"
In [137]:
sum_q_c ='''SELECT
SUM(revenue) AS total_rev,
ROUND(AVG(revenue), 2) AS avg_rev,
COUNT(revenue) AS total_transactions,
MAX(revenue) AS max_rev,
MIN(revenue) AS min_rev,
COUNT(DISTINCT(book)) AS distinct_books
FROM
sales_table
WHERE
book IN ('For Whom the Bell Tolls', 'Emma') '''
cheat(sum_q_c)
In [138]:
function_chall = widgets.Textarea(value='', width = '50em', height = '11em')
display(function_chall)
In [139]:
function_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(function_chall_b)
run_q(function_chall, function_chall_b)
In [140]:
function_chall_c ='''SELECT
ROUND(AVG(B.cogs), 2) AS avg_cogs
FROM
book_table B
JOIN auth_table A ON B.author = A.last_name
WHERE
A.country = 'USA'
'''
cheat(function_chall_c)
In [141]:
function_chall2 = widgets.Textarea(value='',width = '50em', height = '10em')
display(function_chall2)
In [142]:
function_chall2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(function_chall2_b)
run_q(function_chall2, function_chall2_b)
In [143]:
function_chall2_c ='''SELECT
GROUP_CONCAT(last_name)
FROM
auth_table
WHERE
last_name NOT IN ('Austen', 'Shakespeare')
'''
cheat(function_chall2_c)
In [144]:
insert_null_b = widgets.Button(description='Click here JUST ONCE before starting', width='20em', height='3em', color='white',background_color='#1f5fd6', border_color='#1f5fd6')
display(insert_null_b)
def insert_button(b):
null_query1 = '''INSERT INTO auth_table VALUES ('Homer', NULL, 'Greece', NULL)'''
run(null_query1)
print('A new row has been added to the auth_table!')
insert_null_b.on_click(insert_button)
It's much more common to use COUNT(*)
than COUNT(column_name)
when you are trying to get a count of the number of rows in your result-set. This is because COUNT(*)
will capture all rows, while COUNT(column_name)
will skip over NULL values in that particular row.
We've just added new a new row to the auth_table
that has some NULL (blank) values. Start by writing a query to view everything (SELECT *
) in the auth_table
and make a note of the new row.
In [145]:
star_chall = widgets.Textarea(value='', width = '50em', height = '12em')
display(star_chall)
In [146]:
star_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(star_chall_b)
run_q(star_chall, star_chall_b)
In [147]:
star_chall_c ='''SELECT
COUNT(first_name),
COUNT(*)
FROM
auth_table
'''
cheat(star_chall_c)
When you use COUNT(first_name)
, SQL skipped over the Homer row because there was no value in the first_name column. COUNT(*)
, on the other hand, looks across all columns, so as long as a row has a value in at least one column, it'll get included in the count. You might argue that you could just use country
or last_name
, but they fact is that *
is just way easier and less time-consuming to type out. Overwhelmingly, people opt for COUNT(*)
instead of COUNT(column_name)
unless they are interested in overlooking NULL values.
In [148]:
fun_op_q = widgets.Textarea(value=
'''SELECT
SUM(S.revenue) - SUM(B.cogs) AS gross_profit
FROM
book_table B
JOIN sales_table S ON S.book = B.book''',
width = '50em', height = '8em')
display(fun_op_q)
In [149]:
fun_op_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(fun_op_b)
run_q(fun_op_q, fun_op_b)
In [150]:
fun_op_chall = widgets.Textarea(value='', width = '50em', height = '8em')
display(fun_op_chall)
In [151]:
fun_op_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(fun_op_chall_b)
run_q(fun_op_chall, fun_op_chall_b)
In [152]:
fun_op_chall_c ='''SELECT
SUM(S.revenue - B.cogs) / SUM(S.revenue) AS gross_margin
FROM
book_table B
JOIN sales_table S ON S.book = B.book
- OR - you can use AVG() instead of SUM() for all functions'''
cheat(fun_op_chall_c)
SELECT
column_a,
SUM(column_b)
➞ sums up the values in column_b
FROM
table_name
[WHERE clause]
GROUP BY
➞ creates one group for each unique value in column_a
column_a
[ORDER BY clause]
[LIMIT clause]
GROUP BY
creates a group for each unique value in the column you specify. You'll always use it in conjunction with functions - it creates segments for your results. In plain English, the query below says: "Show me the average revenue
per book
from the sales_table
"
In [153]:
group_q = widgets.Textarea(value=
'''SELECT
book,
AVG(revenue)
FROM
sales_table
GROUP BY
book''',
width = '50em', height = '20em')
display(group_q)
In [154]:
group_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(group_b)
run_q(group_q, group_b)
AVG()
to SUM()
, rerunbook
column the alias book_title
, then use the alias in the GROUP BY
clause, rerunSELECT
clause: COUNT(*)
, rerunWHERE
clause to only return results that are not written by Faulkner (hint: you'll have to join a table for this)Write a query to count the number of books each author has listed in the book_table
.
In [155]:
group_chall = widgets.Textarea(value='', width = '50em', height = '12em')
display(group_chall)
In [156]:
group_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(group_chall_b)
run_q(group_chall, group_chall_b)
In [157]:
group_chall_c ='''SELECT
author,
count(*)
FROM
book_table
GROUP BY
author'''
cheat(group_chall_c)
In [158]:
group_chall2 = widgets.Textarea(value='', width = '50em', height = '12em')
display(group_chall2)
In [159]:
group_chall2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(group_chall2_b)
run_q(group_chall2, group_chall2_b)
In [160]:
group_chall2_c ='''SELECT
B.author,
SUM(S.revenue)
FROM
book_table B
JOIN sales_table S ON B.book = S.book
GROUP BY
B.author'''
cheat(group_chall2_c)
In [161]:
group_chall3 = widgets.Textarea(value='',width = '50em', height = '15em')
display(group_chall3)
In [162]:
group_chall3_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(group_chall3_b)
run_q(group_chall3, group_chall3_b)
In [163]:
group_chall3_c ='''SELECT
book,
MAX(revenue),
MIN(revenue)
FROM
sales_table
WHERE
book NOT IN ('Macbeth','Hamlet')
GROUP BY
book'''
cheat(group_chall3_c)
In [164]:
group_fun_op_q = widgets.Textarea(value=
'''SELECT
B.book,
SUM(S.revenue) - SUM(B.cogs) AS gross_profit
FROM
sales_table S
JOIN book_table B ON S.book = B.book
GROUP BY
B.book''',
width = '50em', height = '12em')
display(group_fun_op_q)
In [165]:
group_fun_op_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(group_fun_op_b)
run_q(group_fun_op_q, group_fun_op_b)
In [166]:
group_func_op_chall = widgets.Textarea(value='', width = '50em', height = '13em')
display(group_func_op_chall)
In [167]:
group_func_op_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(group_func_op_chall_b)
run_q(group_func_op_chall, group_func_op_chall_b)
In [168]:
group_func_op_chall_c ='''SELECT
B.author,
(SUM(S.revenue) - SUM(B.cogs)) / SUM(S.revenue) AS gross_margin
FROM
book_table B
JOIN sales_table S ON B.book = S.book
GROUP BY
B.author'''
cheat(group_func_op_chall_c)
SELECT
clause (in addition to author and gross margin columns), and give every column an alias:
In [169]:
group_chall3 = widgets.Textarea(value='', width = '50em', height = '26em')
display(group_chall3)
In [170]:
group_chall3_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(group_chall3_b)
run_q(group_chall3, group_chall3_b)
In [171]:
group_chall3_c ='''SELECT
B.author AS author_name,
SUM(S.revenue) AS total_revenue,
SUM(B.cogs) AS total_cogs,
(SUM(S.revenue) - SUM(B.cogs))/SUM(S.revenue) AS gross_margin,
COUNT(*) AS transaction_count,
COUNT(DISTINCT(S.book)) AS distinct_book_titles,
GROUP_CONCAT(DISTINCT(S.book)) AS book_list
FROM
book_table B
JOIN sales_table S ON B.book = S.book
WHERE
author_name != 'Faulkner'
AND S.book != 'Hamlet'
GROUP BY
author_name
ORDER BY
gross_margin DESC'''
cheat(group_chall3_c)
SELECT
column_a,
FUNCTION(column_b)
FROM
table_name
[WHERE clause]
GROUP BY
column_a HAVING FUNCTION(column_b) > x
➞ returns groups whose value is greater than x
[ORDER BY clause]
[LIMIT clause]
Use HAVING
with GROUP BY
in order to filter out groups that don't meet your criteria. Below, the plain English translation of this query says, "show my the total revenue for each book, but only show me books that have total revenue over $100"
In [172]:
having_q = widgets.Textarea(value=
'''SELECT
book,
SUM(revenue)
FROM
sales_table
GROUP BY
book HAVING SUM(revenue) > 100''',
width = '50em', height = '11em')
display(having_q)
In [173]:
having_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(having_b)
run_q(having_q, having_b)
In [174]:
having_chall = widgets.Textarea(value='',width = '50em', height = '10em')
display(having_chall)
In [175]:
having_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(having_chall_b)
run_q(having_chall, having_chall_b)
In [176]:
having_chall_c ='''SELECT
author,
AVG(cogs) AS avg_cogs
FROM
book_table
GROUP BY
author HAVING avg_cogs > 10'''
cheat(having_chall_c)
In [177]:
hw_q1 = widgets.Textarea(value=
'''SELECT
author,
AVG(cogs)
FROM
book_table
WHERE
author != 'Faulkner'
GROUP BY
author''',
width = '50em', height = '14em')
display(hw_q1)
In [178]:
hw_b1 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(hw_b1)
run_q(hw_q1, hw_b1)
Now let's say you want to filter out Faulkner AND you only want to see authors whose average COGs are over $11. Your first thought might be to use the WHERE
clause. Add AND AVG(cogs) > 11
, rerun. Why do you think you're hitting an error?
You hit an error because the AVG(cogs)
column was created by a function, and SQL doesn't let you put function-generated columns in the WHERE
clause. You have to use HAVING
instead. The query below will accomplish what we're trying to do, and returns a result-set that doesn't include Faulkner AND only shows authors whose average COGs are over $11.
In [179]:
hw_q2 = widgets.Textarea(value=
'''SELECT
author,
AVG(cogs)
FROM
book_table
WHERE
author != 'Faulkner'
GROUP BY
author HAVING AVG(cogs) > 11''',
width = '50em', height = '14em')
display(hw_q2)
In [180]:
hw_b2 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(hw_b2)
run_q(hw_q2, hw_b2)
This seems relatively straightforward - but it's easy to forget and wind up with inaccurate results. Consider the following query. Why would this be wrong?
In [181]:
hw_q3 = widgets.Textarea(value=
'''SELECT
author,
AVG(cogs)
FROM
book_table
WHERE
author != 'Faulkner'
AND cogs > 11
GROUP BY
author''',
width = '50em', height = '15em')
display(hw_q3)
In [182]:
hw_b3 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(hw_b3)
run_q(hw_q3, hw_b3)
These results are inaccurate because instead of telling SQL to only return authors with average COGs over \$11, we've told SQL "only consider rows where COGs are over \$11". SQL dropped the rows with COGs under \$11 before it started grouping and averaging.
When we read, we start at the top of a page and work our way to the bottom. That's not how SQL works. It actually starts with the FROM
clause and jumps around. It's helpful to understand the order it follows to determine when to use HAVING
and when to use WHERE
.
We write the clauses in this order:
SELECT
FROM
JOIN...ON
WHERE
GROUP BY...HAVING
ORDER BY
LIMIT
However, SQL reads and executes the clauses in this order:
FROM
JOIN...ON
WHERE
SELECT
GROUP BY...HAVING
ORDER BY
LIMIT
Here's a query we've seen before, but now we've added a few more clauses so that we can see all of them in action:
In [183]:
sql_order_q = widgets.Textarea(value=
'''SELECT
author,
AVG(cogs)
FROM
book_table
WHERE
author != 'Faulkner'
GROUP BY
author HAVING AVG(cogs) > 11
ORDER BY
AVG(cogs)
LIMIT 3''',
width = '50em', height = '17.5em')
display(sql_order_q)
In [184]:
sql_order_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(sql_order_b)
run_q(sql_order_q, sql_order_b)
The GIF below shows the order that SQL follows the steps:
Let's revisit the query that gave us the skewed average:
In [185]:
hw_q4 = widgets.Textarea(value=
'''SELECT
author,
AVG(cogs)
FROM
book_table
WHERE
author != 'Faulkner'
AND cogs > 11
GROUP BY
author''',
width = '50em', height = '15em')
display(hw_q4)
In [186]:
hw_b4 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(hw_b4)
run_q(hw_q4, hw_b4)
Now that we know the order in which SQL executes commands, we can see what went wrong. The rows with COGs under $11 were eliminated before SQL averaged COGs for each group:
In [187]:
have_chall2 = widgets.Textarea(value='',width = '50em', height = '12em')
display(have_chall2)
In [188]:
have_chall2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(have_chall2_b)
run_q(have_chall2, have_chall2_b)
In [189]:
have_chall2_c ='''SELECT
B.author,
SUM(S.revenue) AS total_rev
FROM
book_table B
JOIN sales_table S ON B.book = S.book
GROUP BY
author HAVING total_rev > 200'''
cheat(have_chall2_c)
In [190]:
have_chall3 = widgets.Textarea(value='', width = '50em', height = '16em')
display(have_chall3)
In [191]:
have_chall3_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(have_chall3_b)
run_q(have_chall3, have_chall3_b)
In [192]:
have_chall3_c ='''SELECT
A.country,
COUNT(*) AS count_of_sales
FROM
sales_table S
JOIN book_table B ON S.book = B.book
JOIN auth_table A ON A.last_name = B.author
WHERE
A.last_name != 'Hemingway'
GROUP BY
A.country
NOTE: you can also use "B.author != 'Hemingway'" in the WHERE clause to get the same results'''
cheat(have_chall3_c)
Conditional Type | Microsoft SQL Server | MySQL | Oracle | SQLite |
---|---|---|---|---|
IF | IF logical_test PRINT value_if_true |
IF(logical_test, value_if_true, value_if_false) (same as Excel) |
IF logical_test THEN value_if_true ELSIF...END IF |
NOT SUPPORTED |
CASE WHEN | ✓ | ✓ | ✓ | ✓ |
SELECT
CASE WHEN some_column = x THEN value_if_true
WHEN some_column = y THEN other_value_if_true
ELSE value_if_false
END
FROM
some_table
Because SQLite doesn't support IF
statements, we're going to focus on CASE WHEN
. CASE WHEN
lets you accomplish the same thing by setting logical tests and conditional values, but it has the added bonus of freeing you from ever needing to nest multiple IF
statements.
Let's start very simple. The following query uses a logical test to create a column where the value is "true" if the author is Austen, and "false" if the author is not Austen:
In [193]:
case1_q = widgets.Textarea(value=
'''SELECT
last_name,
CASE WHEN last_name = 'Austen' THEN 'True'
ELSE 'False'
END
FROM
auth_table''',
width = '50em', height = '13em')
display(case1_q)
In [194]:
case1_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(case1_b)
run_q(case1_q, case1_b)
CASE WHEN
column an alias (immediately after END
), rerunCASE WHEN
column so that the query returns Faulkner's first name as well (look at the example code above for help), rerun.CASE WHEN
to create categoriesCASE WHEN
allows you to set multiple logical tests, which can help you create buckets or categories. In Excel, you'd have to nest multiple logical tests in an IF
statement (ie. IF(logical_test, value_if_true, IF(other_logical_test, value_if_true, value_if_false))...very messy). With CASE WHEN
, you can add an infinite number very easily.
Let's say that rather than caring about the exact revenue for each transaction, you only cared whether it was under \$10, between \$10 and \$15, or over \$15. That's easy to do with CASE WHEN
. We'll include the revenue
column as well so you can more easily see what's going on:
In [195]:
case2_q = widgets.Textarea(value=
'''SELECT
book,
revenue,
CASE WHEN revenue < 10 THEN "<$10"
WHEN revenue BETWEEN 10 AND 15 THEN "$10-15"
WHEN revenue > 15 THEN ">$15"
END AS revenue_category
FROM
sales_table
''',
width = '50em', height = '15em')
display(case2_q)
In [196]:
case2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(case2_b)
run_q(case2_q, case2_b)
This might not immediately seem useful, but when you start grouping by your newly created categories, you'll be able to do all kinds of new analysis. Consider the query below. We use CASE WHEN
to create a column that we then use in the GROUP BY
clause. We've essentially created new groups where there were none before, and now we can assess the number of sales and total revenue from each revenue group.
In [197]:
case3_q = widgets.Textarea(value=
'''SELECT
CASE WHEN revenue < 10 THEN "<$10"
WHEN revenue BETWEEN 10 AND 15 THEN "$10-15"
WHEN revenue > 15 THEN ">$15"
END AS revenue_category,
COUNT(*) AS total_sales,
SUM(revenue) AS total_revenue
FROM
sales_table
GROUP BY
revenue_category
''',
width = '50em', height = '18em')
display(case3_q)
In [198]:
case3_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(case3_b)
run_q(case3_q, case3_b)
In [199]:
case_chall = widgets.Textarea(value='', width = '50em', height = '16em')
display(case_chall)
In [200]:
case_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(case_chall_b)
run_q(case_chall, case_chall_b)
In [201]:
case_chall_c ='''SELECT
CASE WHEN
B.author = 'Austen' THEN 'Female'
ELSE 'Male' --- or you can say, "WHEN B.author IN ('Faulkner', 'Shakespeare', 'Hemingway') THEN 'Male' "
END AS gender,
SUM(S.revenue) AS total_revenue
FROM
book_table B
JOIN sales_table S ON B.book = S.book
GROUP BY
gender'''
cheat(case_chall_c)
CASE WHEN
to create a pivot tableSay you want to see revenue broken out by gender and by date. Right now, the only way we know how to do this is to add "date" to the GROUP BY
clause. The query below is the same as the one from your last challenge, only we've added date
to both the SELECT
clause and the GROUP BY
clause.
In [202]:
case_pivot_ex = widgets.Textarea(value='''SELECT
date,
CASE WHEN
B.author = 'Austen' THEN 'Female'
ELSE 'Male'
END AS gender,
SUM(S.revenue) AS total_revenue
FROM
book_table B
JOIN sales_table S ON B.book = S.book
GROUP BY
date, gender'''
, width = '50em', height = '18em')
display(case_pivot_ex)
In [203]:
case_pivot_ex_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(case_pivot_ex_b)
run_q(case_pivot_ex, case_pivot_ex_b)
The result-set above is almost useless - it's impossible to do meaningful analysis when you group by multiple columns. Instead, we'll use CASE WHEN
nested inside a function to essentially create a pivot table:
In [204]:
case_pivot = widgets.Textarea(value=
'''SELECT
date,
SUM(CASE WHEN B.author = 'Austen' THEN revenue END) AS Female_Rev,
SUM(CASE WHEN B.author != 'Austen' THEN revenue END) AS Male_Rev
FROM
book_table B
JOIN sales_table S ON B.book = S.book
GROUP BY
date''',
width = '50em', height = '14em')
display(case_pivot)
In [205]:
case_pivot_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(case_pivot_b)
run_q(case_pivot, case_pivot_b)
Note that while this CASE WHEN
method will work in other RDBMSs, it's more common to use IF
when you are only using a single logical test. In MySQL, for instance, the line for Female_Rev would look like this instead, which would translate to "sum up the revenue for any row where the author is Austen, and the number 0 whenever the author is not Austen":
SUM(IF(B.author = 'Austen', revenue, 0))
The query from above has been reproduced below for these exercises (so you don't have to keep scrolling up and down).
SUM
to AVG
, rerunAVG
to COUNT
- note that with conditional statements, you don't use an asterisk with COUNT
. You need to stick with a specific column name
In [206]:
case_pivot1 = widgets.Textarea(value=
'''SELECT
date,
SUM(CASE WHEN B.author = 'Austen' THEN revenue END) AS Female_Rev,
SUM(CASE WHEN B.author != 'Austen' THEN revenue END) AS Male_Rev
FROM
book_table B
JOIN sales_table S ON B.book = S.book
GROUP BY
date''',
width = '50em', height = '18em')
display(case_pivot1)
In [207]:
case_pivot_b1 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(case_pivot_b1)
run_q(case_pivot1, case_pivot_b1)
SELECT
clause:CASE WHEN
to create a column that creates buckets for author's birth_year
: "Before 1700", "1700-1800", "After 1800"CASE WHEN
to create a column that returns the count of books by authors from USACASE WHEN
to create a column that returns the count of books by authors from EnglandGROUP BY
the birth_year bucket column that you created
In [208]:
case_pivot_chall = widgets.Textarea(value='',width = '50em', height = '18em')
display(case_pivot_chall)
In [209]:
case_pivot_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(case_pivot_chall_b)
run_q(case_pivot_chall, case_pivot_chall_b)
In [210]:
case_pivot_chall_c ='''SELECT
CASE WHEN A.birth_year < 1700 THEN "Before 1700"
WHEN A.birth_year BETWEEN 1700 AND 1800 THEN "1700-1800"
WHEN A.birth_year > 1800 THEN "After 1800"
END AS era,
COUNT(CASE WHEN A.country = 'USA' THEN book END) AS count_from_USA,
COUNT(CASE WHEN A.country = 'England' THEN book END) AS count_from_England
FROM
book_table B
JOIN auth_table A ON B.author = A.last_name
GROUP BY
era
'''
cheat(case_pivot_chall_c)
In [211]:
case_pivot_chall2 = widgets.Textarea(value='', width = '50em', height = '14em')
display(case_pivot_chall2)
In [212]:
case_pivot_chall2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(case_pivot_chall2_b)
run_q(case_pivot_chall2, case_pivot_chall2_b)
In [213]:
case_pivot_chall2_c ='''SELECT
date,
COUNT(CASE WHEN book = 'For Whom the Bell Tolls' THEN revenue END) Bell_Tolls_Count,
COUNT(CASE WHEN book = 'Emma' THEN revenue END) Emma_Count,
COUNT(CASE WHEN book IN ('Macbeth', 'Hamlet') THEN revenue END) Macbeth_Hamlet_Count
FROM
sales_table
GROUP BY
date'''
cheat(case_pivot_chall2_c)
SELECT
column_a
FROM
table_x
WHERE
column_a IN (SELECT column_b FROM table_y)
Read first: For this section, we'll need the extra rows in the auth_table
and book_table
that we added during the JOIN
exercises. If you've closed the program or re-run it since you last added those rows, then click the button below to re-add them.
In [214]:
insert_b = widgets.Button(description="Read the paragraph above before clicking", width='25em', height='3em', color='white',background_color='#1f5fd6', border_color='#1f5fd6')
display(insert_b)
def insert_button(b):
insert_q1 = '''INSERT INTO auth_table VALUES ('Tolstoy', 'Leo', 'Russia', 1828)'''
insert_q2 = '''INSERT INTO auth_table VALUES ('Twain', 'Mark', 'USA', 1835)'''
insert_q3 = '''INSERT INTO book_table VALUES ('Jude the Obscure', '11.25', 'Hardy')'''
insert_q4 = '''INSERT INTO book_table VALUES ('The Age of Innocence', '14.20', 'Wharton')'''
query_list = [insert_q1, insert_q2, insert_q3, insert_q4]
for query in query_list:
run(query)
print('New rows have been added to auth_table and book_table!')
insert_b.on_click(insert_button)
To be totally honest, you likely won't be writing nested queries yourself until you've become much more comfortable with SQL. However, it's good to learn about them because you'll likely encounter them when coworkers share queries with you.
Start by looking at the two queries and their outputs below:
In [215]:
nested_q1 = widgets.Textarea(value=
'''SELECT
COUNT(DISTINCT(book)) AS Count_of_Distinct_Books
FROM
sales_table''',
width = '50em', height = '7em')
display(nested_q1)
In [216]:
nested_b1 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(nested_b1)
run_q(nested_q1, nested_b1)
NOTE THAT YOU NEED TO HIT "RUN" AGAIN FOR THE QUERY BELOW
(It should return the number 13. If it doesn't, click the blue button above to update the book_table
, then re-run the query below)
In [217]:
nested_q2 = widgets.Textarea(value=
'''SELECT
COUNT(DISTINCT(book)) AS Count_of_Distinct_Books
FROM
book_table''',
width = '50em', height = '7em')
display(nested_q2)
In [218]:
nested_b2 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(nested_b2)
run_q(nested_q2, nested_b2)
From the count of distinct books in each table, we see that there are two books in our book_table
(our inventory) that haven't made a single sale. Imagine if both tables had thousands of rows - it'd be a nightmare to try to figure out which were the books with no sales. However, a nested query can help us out.
The query below uses a nested query in the WHERE
clause. In plain English, it says "Show me the books from the book_table, but not the ones that also show up in the sales_table":
In [219]:
nested_q3 = widgets.Textarea(value=
'''SELECT
book
FROM
book_table
WHERE
book NOT IN (SELECT book FROM sales_table)''',
width = '50em', height = '10em')
display(nested_q3)
In [220]:
nested_q3_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(nested_q3_b)
run_q(nested_q3, nested_q3_b)
In [221]:
nest_chall = widgets.Textarea(value='', width = '50em', height = '12em')
display(nest_chall)
In [222]:
nest_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(nest_chall_b)
run_q(nest_chall, nest_chall_b)
In [223]:
nest_chall_c ='''SELECT
last_name
FROM
auth_table
WHERE
last_name NOT IN (SELECT author FROM book_table)
'''
cheat(nest_chall_c)
You can also use nested queries to avoid the need for multiple JOIN
clauses. Suppose you wanted to see the total revenue for books by authors from England. Previously, we would have joined the sales_table
to the book_table
, and then the book_table
to the auth_table
in order to be able to work with both the revenue
column and the country
column:
In [224]:
nest_q3 = widgets.Textarea(value=
'''SELECT
SUM(revenue)
FROM
sales_table
WHERE
book IN (SELECT
book
FROM
book_table B
JOIN auth_table A ON B.author = A.last_name
WHERE
A.country = 'England')''',
width = '50em', height = '18em')
display(nest_q3)
In [225]:
nest_b3 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(nest_b3)
run_q(nest_q3, nest_b3)
Let's break down what's going on here. First, copy and paste the nested query (the part in parentheses) in the cell below, then run it:
In [226]:
nest_q3_explained = widgets.Textarea(value='',width = '50em', height = '12em')
display(nest_q3_explained)
In [227]:
nest_b3_explained = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(nest_b3_explained)
run_q(nest_q3_explained, nest_b3_explained)
Next, take the output that you just produced and:
• comma-separate each book
• wrap each book in quotation marks
• paste your list between the parentheses in the WHERE
clause below:
• rerun the query
In [228]:
nest_q3_explained2 = widgets.Textarea(value=
'''SELECT
SUM(revenue)
FROM
sales_table
WHERE
book IN ( )''',
width = '50em', height = '12em')
display(nest_q3_explained2)
In [229]:
nest_b3_explained2 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(nest_b3_explained2)
run_q(nest_q3_explained2, nest_b3_explained2)
This is essentially the same process that SQL walks through when you run a nested query. It pulls the list of books from the nested query, then uses that list in the WHERE
clause of the dominant query.
The query above produces the same results as if you'd done a multiple join, but it's much more efficient. That's because SQL can just get the book titles it needs from the nested query and plug them into the dominent query, rather than needing to do all the work of duplicating rows to create joined tables. Use nested queries for more efficient joins whenever possible.
SELECT
some_column
FROM
table_x
UNION
➞ or useUNION ALL
, see explanation below
SELECT
some_other_column
FROM
table_y
UNION
and UNION ALL
allow you to attach two completely separate queries. UNION
will result in the output from the first query and the second query to be sorted by default (or you can add an ORDER BY
clause). UNION ALL
will ensure that the results from the second query will all appear after the results from the first query.
We'll start with a very, very simple illustration and work our way into more complex versions of UNION
queries. First, consider the query below. We're pulling all books from the book table with the first query, and all the authors' first names from the auth_table with the second query. By using UNION
, we're telling SQL to return the results of both these queries in the same column.
In [230]:
union_q = widgets.Textarea(value=
'''SELECT
book AS selection
FROM
book_table
UNION
SELECT
first_name AS selection
FROM
auth_table''',
width = '50em', height = '16em')
display(union_q)
In [231]:
union_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(union_b)
run_q(union_q, union_b)
UNION ALL
instead of UNION
and re-run. Make sure you understand how the output changes. AS selection
in the second query and rerun. selection
to something else in the first query and rerun. cogs
in the SELECT
clause in the first query and country
to the SELECT
clause in the second query, rerun.cogs
in the first query and rerun. Can you think why you're hitting an error?UNION
The above example is just a simple illustration of how UNION
functions, but it's not very useful as a practical application. Now lets try UNION
in a more useful way.
Let's say that our imaginary book store decides to start stocking a few movies, so we've created a new table to manage this new inventory:
Now let's say we wanted to view all of our store inventory, COGs, and the author or director of each item. We don't want to join the movie_table
and the book_table
- there's nothing to really join them on. However, it'd be useful to stack them.
Use UNION
to write a query to view the contents of both movie_table
and book_table
in a single table. The column-headers should be: Item
, COGs
, and Creator
. Order by item title (hint: with UNION
, the ORDER BY
clause can only go after the second query).
In [232]:
union_chall = widgets.Textarea(value='',width = '50em', height = '25em')
display(union_chall)
In [233]:
union_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(union_chall_b)
run_q(union_chall, union_chall_b)
In [234]:
union_chall_c ='''SELECT
book as Item,
cogs as COGs,
author as Creator
FROM
book_table
UNION
SELECT
film,
cogs,
director
FROM
movie_table
ORDER BY
Item'''
cheat(union_chall_c)
In [235]:
union_q2 = widgets.Textarea(value=
'''SELECT
book,
cogs
FROM
book_table
UNION ALL
SELECT
'Average COGs',
ROUND(AVG(cogs), 2)
FROM
book_table''',
width = '50em', height = '19em')
display(union_q2)
In [236]:
union_b2 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(union_b2)
run_q(union_q2, union_b2)
ALL
from UNION ALL
and rerun. See how ALL
can be useful?'Average COGs',
from the second query and rerun. Make sure you understand why there's an error. Fix it and reruun.Write a query that totals revenue per book from the sales_table
. Use UNION ALL
to add a summary line that totals revenue for all books.
In [237]:
union_chall1 = widgets.Textarea(value='',
width = '50em', height = '22em')
display(union_chall1)
In [238]:
union_chall1_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(union_chall1_b)
run_q(union_chall1, union_chall1_b)
In [239]:
union_chall1_c ='''SELECT
book,
SUM(revenue) AS total_revenue
FROM
sales_table
GROUP BY
book
UNION ALL
SELECT
'NULL',
SUM(revenue)
FROM
sales_table
'''
cheat(union_chall1_c)
In [240]:
HTML(run('''SELECT
B.author AS author_last_name,
B.book AS book_title,
SUM(S.revenue) AS sum_revenue
FROM
sales_table S
JOIN book_table B on S.book = B.book
GROUP BY
B.book
UNION
SELECT
UPPER(B.author),
'TOTAL REVENUE',
SUM(S.revenue)
FROM
sales_table S
JOIN book_table B on S.book = B.book
GROUP BY
B.author
''').to_html(index=False))
Out[240]:
In [241]:
union_chall2 = widgets.Textarea(value='', width = '50em', height = '30em')
display(union_chall2)
In [242]:
union_chall2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(union_chall2_b)
run_q(union_chall2, union_chall2_b)
In [243]:
union_chall2_c ='''SELECT
B.author AS author_last_name,
B.book AS book_title,
SUM(S.revenue) AS sum_revenue
FROM
sales_table S
JOIN book_table B on S.book = B.book
GROUP BY
B.book
UNION
SELECT
UPPER(B.author),
'TOTAL REVENUE',
SUM(S.revenue)
FROM
sales_table S
JOIN book_table B on S.book = B.book
GROUP BY
B.author'''
cheat(union_chall2_c)
Microsoft SQL Server | MySQL | Oracle | SQLite |
---|---|---|---|
GROUP BY column_a WITH ROLLUP |
GROUP BY column_a WITH ROLLUP |
GROUP BY ROLLUP (column_a) |
not supported |
Unfortunately, SQLite doesn't have a simple way to do ROLLUP like the other RDBMSs, so we can't practice it here. However, the concept is very straightforward: it's exactly like using UNION
to add a summary row, except way simpler. Below is what the query would look like if we were using Microsoft or MySQL. Take a look at the query and the output to understand what's going on, even if you can't practice it:
SELECT
book,
SUM(revenue) AS total_revenue,
COUNT(*) AS count_of_sales
FROM
sales_table
GROUP BY
book WITH ROLLUP
In [244]:
HTML(run('''SELECT
book,
SUM(revenue) AS total_revenue,
COUNT(*) AS count_of_sales
FROM
sales_table
GROUP BY
book
UNION ALL
SELECT
'NULL',
SUM(revenue),
COUNT(*)
FROM
sales_table''').to_html(index=False))
Out[244]:
Note that rollup produces the word "NULL" for any row that it cannot sum up. Essentially its function is to find all columns with numbers and add these up. Numberless columns can't be added, so rollup just skips them.
Review and PRACTICE! Review the lessons and terms, re-do the quick exercises and challenges. Seriously. This stuff is easily forgotten if you don't use it, so be sure to refresh what you've learned before you start working.
Here's some additional reading on SQL if you're interested.
SELECT * FROM table_name
query unless you are absolutely certain that the table is very, very smallJOIN
clauses out.LIKE
and % sparingly or use WHERE
to limit your search as much as possible unless you are dealing with a small-ish tableSeveral times in class we've discussed certain practices that should only be applied to small tables to avoid strain on your server. You can use SELECT COUNT(*) FROM table_name
to see how many rows it has. You can also think of it like this: the more often the table is updated, the larger it probably is. A table that adds a row every time a user views a webpage is updated constantly and is probably huge. A table that simply lists all the ZIP codes in the US probably doesn't get updated often, it'll be pretty small.
Description | Microsoft SQL Server | MySQL | Oracle | SQLite |
---|---|---|---|---|
Reading a table's structure | SP_Help tablename |
DESCRIBE tablename |
DESCRIBE tablename |
PRAGMA TABLE_INFO(tablename) |
Limiting rows | SELECT TOP N column_name |
LIMIT N |
WHERE ROWNUM <= N |
LIMIT N |
JOIN or INNER JOIN |
✓ | ✓ | ✓ | ✓ |
LEFT JOIN or LEFT OUTER JOIN |
✓ | ✓ | ✓ | ✓ |
RIGHT JOIN or RIGHT OUTER JOIN |
✓ | ✓ | ✓ | not supported |
OUTER JOIN or FULL OUTER JOIN |
✓ | not supported | ✓ | not supported |
IF |
IF logical_test PRINT value_if_true |
IF(logical_test, value_if_true, value_if_false) (same as Excel) |
IF logical_test THEN value_if_true ELSIF...END IF |
NOT SUPPORTED |
CASE WHEN |
✓ | ✓ | ✓ | ✓ |
ROLLUP |
GROUP BY column_a WITH ROLLUP |
GROUP BY column_a WITH ROLLUP |
GROUP BY ROLLUP (column_a) |
not supported |