Sarah Beckett-Hile | NYU Stern School of Business | Spring 2016


To get started:

Click 'Cell ➤ Run All' above. When the program runs, it will skip down the page. After it stops skipping, scroll back up to the top to continue the lesson. Just a weird quirk of Jupyter.


In [1]:
from IPython.display import display, HTML, clear_output
HTML('''<script> code_show=true; function code_toggle() {if (code_show){$('div.input').hide();} 
else {$('div.input').show();}code_show = !code_show} $( document ).ready(code_toggle);
</script> <form action="javascript:code_toggle()"><input type="submit" value="Hide Raw Code"></form>''')


Out[1]:

In [2]:
from ipywidgets import interact, interactive, fixed, widgets
import pandas as pd
import sqlite3
import re

In [3]:
# just testing out the youtube player capabilites of Jupyter
#from IPython.display import YouTubeVideo
#YouTubeVideo("a1Y73sPHKxw", width=700, height=500)

In [4]:
# if this .sqlite db doesn't already exists, this will create it
# if the .sqlite db *does* already exist, this establishes the desired connection
con = sqlite3.connect("sql_sample_db_new.sqlite")

book_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/book_table.csv')
auth_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/author_table.csv')
sales_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/sales_table.csv')
tech_cos = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/tech_cos.csv')
public_cos = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/public_cos.csv')
movie_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/movie_table.csv')

tables = [book_table, 
          auth_table, 
          sales_table, 
          tech_cos, 
          public_cos,
          movie_table]

table_names = ['book_table', 
               'auth_table', 
               'sales_table', 
               'tech_cos', 
               'public_cos',
               'movie_table']

# drop each table name if it already exists to avoid error if you rerun this bit of code
# then add it back (or add it for the first time, if the table didn't already exist)
for i in range(len(tables)):
    table_name = table_names[i]
    table = tables[i]
    con.execute("DROP TABLE IF EXISTS {}".format(table_name))
    pd.io.sql.to_sql(table, "{}".format(table_name), con, index=False)

In [5]:
# Function to make it easy to run queries on this mini-database
def run(query):
    try:
        results = pd.read_sql("{}".format(query), con).fillna(' ')
        return results
    except:
        pass

def run_q(query, button):
    def on_button_clicked(b):
        clear_output()
        new_value = query.value.replace('\n', ' ')
        if new_value != '':
            df = run(new_value)
            try:
                output = HTML(df.to_html(index=False))
                display(output)
            #except AttributeError:
            except:
                print('''SQL error!  Check your query:
    1. Text values are in quotation marks and capitalized correctly
    2. Items in the SELECT clause are comma-separated
    3. No dangling comma in the SELECT clause right before the FROM clause
    4. If you are joining tables that have columns with the same name, use table_name.column_name format
    5. Try "PRAGMA TABLE_INFO(table_name) to double-check the column names in the table
    6. Correct order of clauses:
        SELECT
        FROM
            JOIN...ON 
        WHERE
        GROUP BY
        ORDER BY
        LIMIT
        ''')
    button.on_click(on_button_clicked)
    on_button_clicked(None)
    
def cheat(answer):
    def f(Reveal):
        if Reveal == False:
            clear_output()
        else:
            print(answer)
    interact(f, Reveal=False)
    clear_output()



Table of Contents

Course Details

Introduction to SQL

Structure and Formatting Basics

Determine a table's strucure.............................PRAGMA TABLE_INFO()

Query building blocks.......................................SELECT & FROM

Filter your data..................................................WHERE

Wildcards and vague search.............................LIKE and %

Sort your data...................................................ORDER BY

Limit the number of rows you see.....................LIMIT

Combining tables..............................................JOIN
      •   Combining 3+ tables........................................Multiple JOIN's
      •   Different JOIN Types.......................................Overview
      •   Simple Join......................................................INNER JOIN aka JOIN
      •   One-Sided Join................................................LEFT JOIN
      •   Full Join............................................................OUTER JOIN
      •   Practice combining tables...............................JOIN Drills

Column & Table Aliases.....................................AS

Add, subtract, multiply, & divide data................Operators

Apply functions to columns...............................Functions

Group data by categories..................................GROUP BY

Filter out certain groups.....................................HAVING

Conditional values..............................................IF & CASE WHEN

SQL-ception: Queries within queries..................Nesting

Run multiple queries at once..............................UNION & UNION ALL

Add a summaryt/total row..................................ROLLUP

Summary

Full table of RDBMS dialect differences

Additional Resources:




      [Table of Contents](#table_of_contents)       |       [Next](#course_details)

SQL, "sequel", "ESS CUE ELL"

SQL stands for "Structured Query Language", but no one calls it that. You can pronouce it as either "S-Q-L" or "sequel". Some people feel strongly in favor of a particular pronunciation. I don't. I'll say "sequel" in class, but I'll never correct you for saying S-Q-L.

SQL is the database language of choice for most businesses; you use it to communicate with databases. "Communication" can take the form of creating, reading, updating, and deleting data. This course only covers reading data. That's all most MBAs do with SQL.


Relational Databases

Companies use relational databases because they can store and easily recall A LOT of data. Excel can't handle more than a million rows. If you're Amazon and you need to record every click, Excel is useless. Relational databases are much more efficient.

What do we mean by "efficient"? Every recorded "bit" takes up server space, which costs money. It also slows everything down. So an efficient database should allow you to record and recall a lot of information using the minimal number of bits.

Imagine you want to store the names of four books and some information about their authors. Think of the character count as a proxy for how many bits of storage your table takes up:

Now imagine you want to add some more books by each of those authors. Some of the information gets redundant. Imagine if you had to do this for millions of different books:

This is where relational databases can help. With a relational database, you'd create two separate tables that relate to each other. You still storing the same information, but you're doing it by using fewer characters. You've eliminated the need to repeat yourself, so you've made a much more efficient database.


SQL Dialects

There are different softwares that can manage relational databases. SQL varies a little from software to software, just like English varies a little between England and the U.S. We'll address these instances whenever possible.

Each software is called a Relational Database Management System, or RDBMS. These are some of the most popular that you might encounter at work:

Microsoft SQL Server MySQL Oracle SQLite
Proprietary, more common at older companies Open source, frequently used by startups and tech companies Proprietary, more common at older companies Frequently used for mobile apps (and this class!)





[Previous](#introduction)      |       [Table of Contents](#table_of_contents)       |       [Next](#formatting)

The goal is to start simple and practice often. By the end of this class, you should feel extremely comfortable writing moderately complicated SQL code, which will save you countless hours trying to figure out SQL on the job or waiting for someone else with SQL knowledge to pull data for you. Using this interactive program, we'll explore a small sample database by learning new SQL concepts one at a time. Concepts will build on each other.

Quick Exercises

Sometimes you'll be asked to edit or delete parts of a provided query. Rerun the query with each step, taking care to understand what changed with the output each time. Note that none of the changes that you make to these queries will be saved when you close this program.

Try it by changing something in the cell below and hitting "Run"!


In [6]:
test_ex = widgets.Textarea(value=
'''You can change text in these boxes to edit and re-run queries!''',
width = '50em', height = '7em')
display(test_ex)

In [7]:
text_ex_button = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(text_ex_button)
def on_button_clicked2(b):
        clear_output()
        print('''Here's the output from the cell above:
        ''', test_ex.value)
text_ex_button.on_click(on_button_clicked2)

Challenges

After we've learned a new concept and you've practiced with some quick exercises, you'll be challenged to write your own query. Read each challenge carefully, and keep re-running it until you get the results you are looking for.

Need to cheat a little? Check the "Reveal" box to see the answer to a challenge.


In [8]:
chall ='''When you click a checkbox, you reveal the answer to a challenge! 
Uncheck it to hide the answer again.'''
cheat(chall)

Using this program

The content you're currently reading is written in Python, Markdown and HTML and runs in a Jupyter Notebook. No need to know what any of that means, I only told you in case you were terribly curious.

You will not be using this interface at work - the point of this class is to teach you SQL the language, which can be typed into a variety of different software programs. You'll be able to learn the quirks of a different software program pretty easily as long as you know SQL.

Still, there are some things you should know about this program to help you with the class:

  • If you accidentally double-click on a block of text, and suddenly it looks like code, hit SHIFT-RETURN or 'Cell ➤ Run'.
  • If you try to run a query and the output doesn't refresh, select 'Cell ➤ Run All' to reboot the program.
  • If you accidentally delete a cell, click 'Edit ➤ Undo Delete Cell'
  • Nothing that you write in the challenges and exercises will save after you close this program.
  • If you want to save something that you've written, follow the steps below:
A grey "cell" like this will appear when you follow Step 1 (click the + sign in the toolbar) and Step 2 (change the cell to Raw NBConvert). Write your notes in the cell; the program will automatically save it.














[Previous](#course_details)       |       [Table of Contents](#table_of_contents)       |       [Next](#pragma_table)

Structure and Formatting Query Basics:

Below is an example of a query, SQL code that requests data from a database. Try to make a habit of writing queries by following these formatting conventions. Queries can get very long and complicated, and formatting makes them easier to read.




[Previous](#formatting)       |       [Table of Contents](#table_of_contents)       |       [Next](#select_from)

Microsoft SQL Server MySQL Oracle SQLite
SP_Help some_table DESCRIBE some_table DESCRIBE some_table PRAGMA TABLE_INFO(some_table)

SQLite version that we'll be using for this class:

PRAGMA TABLE_INFO(some_table)                 ➞ result-set lists the column names and data types within the table

We're using SQLite, so we're going to be using the PRAGMA TABLE_INFO() option. Put the name of a table in the parentheses, and the output tells you the names and data types in each column in the table.

So far, we've learned about 2 tables in our relational database, which we'll call book_table and auth_table. We're also going to use a sales_table, which we'll take a look at later on. Combined, these three tables will make up the "database" of a very tiny, very limited, very imaginary bookstore.

We'll start by reviewing the book_table and auth_table:

Now, we'll use PRAGMA TABLE_INFO() to read the table structure of the book_table. In plain English, the query below says "show me the names of the columns in the book_table, and what type of data (text, numbers?) is in each column."


In [9]:
pragma = widgets.Textarea(
    value='''PRAGMA TABLE_INFO(book_table)''',
    width = '50em', height = '3em')
display(pragma)

In [10]:
prag_button = widgets.Button(description='Run',width='10em', height='2.5em', color='white', background_color='black', border_color='black')
display(prag_button)
run_q(pragma, prag_button)


cid name type notnull dflt_value pk
0 book TEXT 0 0
1 cogs REAL 0 0
2 author TEXT 0 0
  • name tells us the names of each column in the table. So now we know that the book_table has columns headed Book, COGs, and Author
  • type tells us what type of data is in each column. So now we know that the Book column has TEXT data, and that COGs contains REAL numbers - numbers that can have a fractional value.
  • All other columns you can ignore. Seriously.

Quick Exercise:

Change the query above to look at the auth_table instead. Why is the author's birth_year data type not REAL like we saw with COGs?


Challenge:

Rewrite the query above to take a look at the sales_table structure. Judging from what your query returns, can you guess what you'll probably see once you actually look all the data in the sales_table?


In [11]:
pragma_sales = widgets.Textarea(
    value='',
    width = '50em',
    height = '4em'
)
display(pragma_sales)

In [12]:
prag_sales_button = widgets.Button(description='Run',width='10em', height='2.5em', color='white', background_color='black', border_color='black')
display(prag_sales_button)
run_q(pragma_sales, prag_sales_button)

In [13]:
prag_sales_answer =  'PRAGMA TABLE_INFO(sales_table)'
cheat(prag_sales_answer)



[Previous](#pragma_table)       |       [Table of Contents](#table_of_contents)       |       [Next](#where)

SELECT
     *                   
➞ an asterisk means "all columns"
FROM
     table_name

To see the actual data in a table, we'll use SELECT and FROM clauses. In the SELECT clause, you tell SQL which columns you want to see. In the FROM clause, you tell SQL the table where those columns are located. An asterisk returns all columns from a particular table.

In plain English, the query below says: "Show me all columns and their data from the book_table"


In [14]:
select = widgets.Textarea(value=
'''SELECT
    *
FROM
    book_table''',
width = '50em', height = '7em')
display(select)

In [15]:
select_button = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(select_button)
run_q(select, select_button)


book cogs author
For Whom the Bell Tolls 13.50 Hemingway
A Farewell to Arms 10.35 Hemingway
The Sun Also Rises 11.35 Hemingway
Sense and Sensibility 11.00 Austen
Emma 12.50 Austen
Pride and Prejudice 13.00 Austen
Romeo and Juliet 7.50 Shakespeare
Hamlet 6.95 Shakespeare
Macbeth 11.10 Shakespeare
The Sound and the Fury 8.50 Faulkner
Absalom! Absalom! 5.15 Faulkner

Quick Exercise:

Change the query above to show us all columns and their data from the auth_table instead of the book_table


Challenge:

Write a query to view all columns and their data from the sales_table


In [16]:
select_c = widgets.Textarea(value='', width = '50em')
display(select_c)

In [17]:
select_c_button = widgets.Button(
    description='Run',
    width='10em',
    height='2.5em', 
    color='white', 
    background_color='black',
    border_color='black')
display(select_c_button)
run_q(select_c, select_c_button)

In [18]:
select_answer ='''SELECT 
    *
FROM 
    sales_table'''
cheat(select_answer)



Use asterisks sparingly. Usually, you'll select specific columns from a table rather than all columns. Using an asterisk to select all columns is okay when the table is small or when you tightly constrain your selection of rows. Otherwise, select specific columns and use WHERE and LIMIT (taught below) to go easy on your servers.


SELECT specific columns:

SELECT
      column_a,
                ➞ separate multiple columns with commas
      column_b                   ➞ optional, but conventional, to also use a return
FROM
      table_name

Instead of using an asterisk for "all columns", you can specify a particular column or columns. In plain English: "Show me the data in the book and author columns from the book_table"


In [19]:
select_col = widgets.Textarea(value=
'''SELECT
    book,
    author
FROM
    book_table''',
width = '50em', height = '8em')
display(select_col)

In [20]:
select_col_button = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(select_col_button)
run_q(select_col, select_col_button)


book author
For Whom the Bell Tolls Hemingway
A Farewell to Arms Hemingway
The Sun Also Rises Hemingway
Sense and Sensibility Austen
Emma Austen
Pride and Prejudice Austen
Romeo and Juliet Shakespeare
Hamlet Shakespeare
Macbeth Shakespeare
The Sound and the Fury Faulkner
Absalom! Absalom! Faulkner

Challenge:

Write a query to show the first_name and last_name columns from the auth_table


In [21]:
select_cols_chall = widgets.Textarea(value=
'',
width = '50em', height = '8em')
display(select_cols_chall)

In [22]:
select_cols_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(select_cols_chall_b)
run_q(select_cols_chall, select_cols_chall_b)

In [23]:
select_cols_chall_cheat ='''SELECT 
    first_name,
    last_name
FROM
    auth_table'''
cheat(select_cols_chall_cheat)

Challenge:

Write a query to select only the book column from the sales_table


In [24]:
select_cols_chall2 = widgets.Textarea(value='', width = '50em', height = '8em')
display(select_cols_chall2)

In [25]:
select_cols_chall_b2 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(select_cols_chall_b2)
run_q(select_cols_chall2, select_cols_chall_b2)

In [26]:
select_cols_chall_cheat2 ='''SELECT 
    book
FROM
    sales_table'''
cheat(select_cols_chall_cheat2)


SELECT DISTINCT:

SELECT
      DISTINCT column_a
                ➞ returns only unique values
FROM
      table_name

Use DISTINCT to return unique values from a column, so if there are any repeats in a column, your output will include each value just once. The query below displays each book in the sales_table just once, even though we know each shows up multiple times in the table.


In [27]:
distinct_q = widgets.Textarea(value=
'''SELECT
    DISTINCT book
FROM
    sales_table''',
width = '50em', height = '7em')
display(distinct_q)

In [28]:
distinct_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(distinct_b)
run_q(distinct_q, distinct_b)


book
Absalom! Absalom!
For Whom the Bell Tolls
Macbeth
The Sound and the Fury
The Sun Also Rises
Hamlet
Pride and Prejudice
Sense and Sensibility
Emma
Romeo and Juliet
A Farewell to Arms

Challenge:

Write a query to return each author from the book_table without any names repeating.


In [29]:
distinct_q_chall = widgets.Textarea(value='', width = '50em', height = '7em')
display(distinct_q_chall)

In [30]:
distinct_q_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(distinct_q_chall_b)
run_q(distinct_q_chall, distinct_q_chall_b)

In [31]:
REP ='''SELECT 
    DISTINCT author
FROM
    book_table'''
cheat(REP)



[Previous](#select_from)       |       [Table of Contents](#table_of_contents)       |       [Next](#where_like)

SELECT
      column_a
FROM
      table_name
WHERE
      column_a = x
                ➞ result-set will only include rows where value of column_a is x

WHERE lets you filter results so you only see rows that specifically match your criteria. Below there are few more options for the WHERE clause:

Options for WHERE Description
col = 'some_text' Put text in quotations. Capitalization is important!
col != x Return rows where col's values DO NOT equal x
col < x Return rows where col's value is less than x
col <= x Return rows where col's value is less OR EQUAL TO than x
col IN (x, y) Values can equal EITHER x OR y
col NOT IN (x, y) Value are NEITHER x NOR y
col BETWEEN x AND y Values are between x and y
col = x AND another_col = y Returns rows when col's values are x AND another_col's values are y
col = x OR another_col = y Returns rows when col's values are x OR another_col's values are y

WHERE & text values

Below, we use WHERE to tell SQL to only show us rows in the book_table when Hemingway is the author. In plain English, we're saying "Show me information about books that are written by Hemingway in the book_table"


In [32]:
where_q = widgets.Textarea(value=
'''SELECT 
    *
FROM
    book_table
WHERE
    author = 'Hemingway' ''',
width = '50em', height = '10em')
display(where_q)

In [33]:
where_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(where_b)
run_q(where_q, where_b)


book cogs author
For Whom the Bell Tolls 13.50 Hemingway
A Farewell to Arms 10.35 Hemingway
The Sun Also Rises 11.35 Hemingway

Quick Exercises:

  1. Above, change the name from 'Hemingway' to 'Shakespeare', rerun
  2. Delete the quotation marks around the word Shakespeare, rerun. Why the error?
  3. Put double quotation marks, rerun
  4. Change "Shakespeare" to "shakespeare", rerun
  5. Change "shakespeare" to "Twain", rerun
  6. Change "Twain" to 'Hemingway', rerun to get back to where we started
  7. Change = in the WHERE clause to !=, rerun

Challenge:

Write a query to return all columns of the auth_table, but only rows where the author's country is England.


In [34]:
where_q_chall = widgets.Textarea(value='',
width = '50em', height = '10em')
display(where_q_chall)

In [35]:
where_b_chall = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(where_b_chall)
run_q(where_q_chall, where_b_chall)

In [36]:
where_q_chall_cheat ='''SELECT
    *
FROM
    auth_table
WHERE
    country = 'England' '''
cheat(where_q_chall_cheat)

We use IN (value_1, value_2) to return rows that can match more than one value. In plain English, the query below says, "Show me all columns from the book table when the author is EITHER Hemingway OR Austen"


In [37]:
in_q = widgets.Textarea(value=
'''SELECT
    *
FROM
    book_table
WHERE
    author IN ('Hemingway', 'Austen')''',
width = '50em', height = '10em')
display(in_q)

In [38]:
in_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(in_b)
run_q(in_q, in_b)


book cogs author
For Whom the Bell Tolls 13.50 Hemingway
A Farewell to Arms 10.35 Hemingway
The Sun Also Rises 11.35 Hemingway
Sense and Sensibility 11.00 Austen
Emma 12.50 Austen
Pride and Prejudice 13.00 Austen

Quick Exercise:

  1. Add 'Faulkner' to the list, rerun.
  2. Replace IN with NOT IN, rerun.
  3. Delete the whole last line and replace it so that the query returns all books except for Emma and Macbeth.


WHERE & number values

The WHERE clause is useful with numbers as well. We can start throwing in comparisons like less than (<) and greater than (>):


In [39]:
greater_q = widgets.Textarea(value=
'''SELECT
    *
FROM
    sales_table
WHERE
    revenue > 18''',
width = '50em', height = '10em')
display(greater_q)

In [40]:
greater_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(greater_b)
run_q(greater_q, greater_b)


id book revenue date
2 For Whom the Bell Tolls 19.74 1/1/16
4 For Whom the Bell Tolls 19.26 1/1/16
24 For Whom the Bell Tolls 18.94 1/5/16
25 Emma 18.21 1/6/16
47 For Whom the Bell Tolls 18.34 1/13/16
64 Pride and Prejudice 18.82 1/15/16
88 Emma 18.18 1/24/16
91 Pride and Prejudice 18.62 1/26/16

Quick Exercises:

  1. Replace > with <, rerun
  2. Add an = directly after the <, rerun
  3. Change the line to revenue BETWEEN 10 AND 12, rerun

Challenge:

Write a query that returns all columns from the auth_table for authors with a birth_year before 1800:


In [41]:
born_chall = widgets.Textarea(value='',width = '50em', height = '10em')
display(born_chall)

In [42]:
born_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(born_chall_b)
run_q(born_chall, born_chall_b)

In [43]:
born_chall_c ='''SELECT
    *
FROM
    auth_table
WHERE
    birth_year < 1800
'''
cheat(born_chall_c)


WHERE with AND/OR

So far, we've only filtered by a specific column (like the revenue column , country, or author columns). Sometimes you'll want to filter by multiple columns. This is where AND and OR come in handy.


In [44]:
and_q = widgets.Textarea(value=
'''SELECT
    *
FROM
    book_table
WHERE
    author = 'Hemingway'
    AND cogs > 11''',
width = '50em', height = '12em')
display(and_q)

In [45]:
and_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(and_b)
run_q(and_q, and_b)


book cogs author
For Whom the Bell Tolls 13.50 Hemingway
The Sun Also Rises 11.35 Hemingway

Quick Exercises:

  1. Delete AND cogs > 11 and rerun the query. Then replace it and run it again.
  2. Change the word AND to OR, rerun. What's going on?

Challenge:

Write a query to pull the last_name, country, and birth_year of authors who were from England AND born after 1650


In [46]:
and_chall = widgets.Textarea(value=
'',
width = '50em', height = '14em')
display(and_chall)

In [47]:
and_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(and_chall_b)
run_q(and_chall, and_chall_b)

In [48]:
and_chall_c ='''SELECT
    last_Name,
    country,
    birth_Year
FROM
    auth_table
WHERE
    country = 'England'
    AND birth_Year > 1650 '''
cheat(and_chall_c)

Challenge:

Write a query to see all columns from the sales_table where the book name is Macbeth OR revenue was greater than $17.


In [49]:
or_chall = widgets.Textarea(value=
'',
width = '50em', height = '12em')
display(or_chall)

In [50]:
or_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(or_chall_b)
run_q(or_chall, or_chall_b)

In [51]:
or_chall_cheat ='''SELECT
    *
FROM
    sales_table
WHERE
    book = 'Macbeth'
    OR revenue > 17 '''
cheat(or_chall_cheat)

[Previous](#where)       |       [Table of Contents](#table_of_contents)       |       [Next](#order_by)

SELECT
      column_a
FROM
     table_name
WHERE
     column_a LIKE 's%Me_t%xT'
                 ➞ correct capitalization isn't necessary with LIKE, and % stands in for any missing character

LIKE lets you search for a value even if you capitalize it incorrectly. It also allows you to work with percentage signs that act as wildcards, which stand in for an unlimited number of missing characters (helpful if you don't know how to spell something). Take a look at the query below. Recall that earlier when we wrote author = 'shakespeare', we got no results.


In [52]:
like_q = widgets.Textarea(value=
'''SELECT
    *
FROM
    book_table
WHERE
    author LIKE 'hemingway' ''',
width = '50em', height = '10em')
display(like_q)

In [53]:
like_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(like_b)
run_q(like_q, like_b)


book cogs author
For Whom the Bell Tolls 13.50 Hemingway
A Farewell to Arms 10.35 Hemingway
The Sun Also Rises 11.35 Hemingway

Quick Exercises:

  1. Replace 'hemingway' with 'hemingWAY', rerun
  2. Replace LIKE with =, rerun
  3. Replace = with LIKE again, but change 'hemingWAY' to 'Hemmingway', rerun
  4. Replace 'Hemmingway' with 'Hem', rerun

Using % as a "wildcard"

With exercises #3 and #4, you saw that LIKE alone has a limitation - it only lets you mess with capitalization. You need wildcards to do more with LIKE. Let's say you can't remember if Hemingway is spelled with 1 "m" or 2. Use a percentage sign (%) to get the value you're looking for:


In [54]:
like_q2 = widgets.Textarea(value=
'''SELECT
    *
FROM
    book_table
WHERE
    author LIKE 'He%ingway' ''',
width = '50em', height = '11em')
display(like_q2)

In [55]:
like_b2 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(like_b2)
run_q(like_q2, like_b2)


book cogs author
For Whom the Bell Tolls 13.50 Hemingway
A Farewell to Arms 10.35 Hemingway
The Sun Also Rises 11.35 Hemingway

Quick Exercises:

  1. Change 'He%ingway' to 'Hemm%ingway'. Why doesn't this work?
  2. Change 'Hemm%ingway' to 'Hem%', rerun
  3. Change 'Hem%' to '%us%', rerun
  4. Change LIKE to =, rerun (see how wildcards only work with LIKE?)

Challenge:

Write a query to pull the book and author columns from the book_table. Pretend you can't remember the full name of the book you're looking for. You just know it starts with the word "Pride".


In [56]:
like_chall = widgets.Textarea(value=
'',
width = '50em', height = '12em')
display(like_chall)

In [57]:
like_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(like_chall_b)
run_q(like_chall, like_chall_b)

In [58]:
like_chall_c ='''SELECT
    book,
    author
FROM
    book_table
WHERE
    book LIKE 'Pride %' '''
cheat(like_chall_c)



Use LIKE sparingly: It's a great tool, but it really puts the strain on your database's servers. Use it only when a table is pretty small or when you've limited your result-set by using additional filters in the WHERE clause.




[Previous](#where_like)       |       [Table of Contents](#table_of_contents)       |       [Next](#limit)

SELECT
      column_a
FROM
      table_name

[WHERE clause, optional]
ORDER BY                                              ➞ sorts the result-set by column_a
      column_a DESC                             ➞ DESC is optional, it sorts results in descending order

Without an ORDER BY clause, the default result-set will be sorted by however it appears in the database (which is crap-shoot depending on the type of table). Use ORDER BY to sort your result-set by a particular column, and add DESC to sort in descending order (Z→A, 100→1).


In [59]:
order_q = widgets.Textarea(value=
'''SELECT
    *
FROM
    book_table
ORDER BY
    book''',
width = '50em', height = '11em')
display(order_q)

In [60]:
order_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(order_b)
run_q(order_q, order_b)


book cogs author
A Farewell to Arms 10.35 Hemingway
Absalom! Absalom! 5.15 Faulkner
Emma 12.50 Austen
For Whom the Bell Tolls 13.50 Hemingway
Hamlet 6.95 Shakespeare
Macbeth 11.10 Shakespeare
Pride and Prejudice 13.00 Austen
Romeo and Juliet 7.50 Shakespeare
Sense and Sensibility 11.00 Austen
The Sound and the Fury 8.50 Faulkner
The Sun Also Rises 11.35 Hemingway

Quick Exercises:

  1. Change the query so it sorts by author instead
  2. Add DESC and rerun
  3. Delete author DESC, replace it with author, book, rerun
  4. Add DESC so it reads author, book DESC, rerun
  5. Change the line to author DESC, book

Challenge:

Write a query to see the book and revenue columns from the sales_table, and sort the results by revenue in descending order.


In [61]:
order_chall = widgets.Textarea(value=
'',
width = '50em', height = '12em')
display(order_chall)

In [62]:
order_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(order_chall_b)
run_q(order_chall, order_chall_b)

In [63]:
order_chall_c ='''SELECT
    book,
    revenue
FROM
    sales_table
ORDER BY
    revenue DESC'''
cheat(order_chall_c)

Challenge:

Write a query to view all columns from the book_table, but only where the author's name is something like "pear" or COGs are over $12. Sort your results by COGs with the cheapest book first.


In [64]:
order_chall2 = widgets.Textarea(value=
'',
width = '50em', height = '16em')
display(order_chall2)

In [65]:
order_chall_b2 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(order_chall_b2)
run_q(order_chall2, order_chall_b2)

In [66]:
order_chall_c2 ='''SELECT 
    *
FROM
    book_table
WHERE
    author LIKE '%pear%'
    OR cogs > 12
ORDER BY 
    cogs'''
cheat(order_chall_c2)



[Previous](#order_by)       |       [Table of Contents](#table_of_contents)       |       [Next](#join_tables)

Microsoft SQL Server | MySQL | Oracle | SQLite :------------------: | :---: | :----: | :----: SELECT TOP N column_name | LIMIT N | WHERE ROWNUM <= N | LIMIT N

SELECT
     column_a
FROM
     table_name
[WHERE clause]
[ORDER BY clause]
LIMIT N                             ➞ limits the result-set to N rows

LIMIT lets you set a maximum limit to the number of rows that your query returns. You've seen the query below before, but now we've added a LIMIT clause:


In [67]:
limit_q = widgets.Textarea(value=
'''SELECT
    *
FROM
    sales_table
LIMIT 5''',
width = '50em', height = '14em')
display(limit_q)

In [68]:
limit_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(limit_b)
run_q(limit_q, limit_b)


id book revenue date
1 Absalom! Absalom! 6.44 1/1/16
2 For Whom the Bell Tolls 19.74 1/1/16
3 Macbeth 8.90 1/1/16
4 For Whom the Bell Tolls 19.26 1/1/16
5 The Sound and the Fury 10.64 1/1/16

Quick Exercises:

  1. Change 5 to 10, rerun
  2. Add an ORDER BY clause so that you see the top 10 transactions in terms of revenue, rerun
  3. Add a WHERE clause so you only see transactions relating to Emma

Challenge:

Write a query to view the title and cogs of the two books with the lowest cogs in the book_table


In [69]:
limit_chall = widgets.Textarea(value=
'',
width = '50em', height = '12em')
display(limit_chall)

In [70]:
limit_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(limit_chall_b)
run_q(limit_chall, limit_chall_b)

In [71]:
limit_chall_c ='''SELECT
    book,
    cogs
FROM
    book_table
ORDER BY
    cogs
LIMIT 2'''
cheat(limit_chall_c)

Challenge:

Write a query to view the book and revenue columns from the sales_table and sort by book title first, then by revenue (ascending). Limit your results to 15 rows.


In [72]:
limit_chall2 = widgets.Textarea(value=
'',
width = '50em', height = '12em')
display(limit_chall2)

In [73]:
limit_chall2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(limit_chall2_b)
run_q(limit_chall2, limit_chall2_b)

In [74]:
limit_chall2_c ='''SELECT 
    book,
    revenue
FROM
    sales_table
ORDER BY
    book, revenue
LIMIT 15'''
cheat(limit_chall2_c)



[Previous](#limit)       |       [Table of Contents](#table_of_contents)       |       [Next](#multi_join)

SELECT
      table_x.column_a,
                               ➞ read this as "column_a from table_x"
      table_y.column_b,                                ➞ "column_b from table_y"
FROM
     table_x
     JOIN table_y

      ON table_x.key_column_x = table_y.key_column_y
                               ➞ table_x's key_column_x has corresponding values with table_y's key_column_y
[WHERE clause]
[ORDER BY clause]
[LIMIT clause]

The ability to join tables is the most fundamental and useful part about relational databases. Different tables have columns with corresponding values, and you can use these columns as "keys" to join the two tables.

The format table_x.key_column can be read as "key_column from table_x"; it tells SQL the tables where a column is located. We didn't need this before because we were only using one table at a time, so SQL knew exactly which table we were talking about. When we deal with more than one table, we need to be more specific. So for example, book_table.book means "the book column from the book_table", and auth_table.last_name means "the last_name column from the auth_table."

Think back to our original discussion of splitting up our author and book data onto two separate tables:

You could think of the columns in these tables in terms of a Venn Diagram. Again, the format table_x.key_column is read as "key_column from table_x", so book_table.author means "the author column from the book_table":

The author column from the book_table corresponds with the last_name column in the auth_table - they both list the last names of the writers. Whenever you have two tables corresponding columns, you can "join" them by telling SQL to use these corresponding columns as keys. book_table.author and auth_table.last_name are the key columns for our join.


In [75]:
join_q = widgets.Textarea(value=
'''SELECT
    *
FROM
    book_table
    JOIN auth_table
        ON book_table.author = auth_table.last_name''',
width = '50em', height = '11em')
display(join_q)

In [76]:
join_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(join_b)
run_q(join_q, join_b)


book cogs author last_name first_name country birth_year
For Whom the Bell Tolls 13.50 Hemingway Hemingway Ernest USA 1899
A Farewell to Arms 10.35 Hemingway Hemingway Ernest USA 1899
The Sun Also Rises 11.35 Hemingway Hemingway Ernest USA 1899
Sense and Sensibility 11.00 Austen Austen Jane England 1775
Emma 12.50 Austen Austen Jane England 1775
Pride and Prejudice 13.00 Austen Austen Jane England 1775
Romeo and Juliet 7.50 Shakespeare Shakespeare William England 1564
Hamlet 6.95 Shakespeare Shakespeare William England 1564
Macbeth 11.10 Shakespeare Shakespeare William England 1564
The Sound and the Fury 8.50 Faulkner Faulkner William USA 1897
Absalom! Absalom! 5.15 Faulkner Faulkner William USA 1897

What just happened? SQL went through these steps:

  • First, SQL pulled up the two tables that we named in the FROM clause: FROM book_table JOIN auth_table.
  • Then it identified the "key" columns that we named with ON: ON book_table.author = auth_table.last_name.
  • Next, it matched up the values on the key columns:

  • Whenever it found a match, it made a kind of copy of the row from the auth_table and pasted it to the book_table

  • Finally, it literally "joined" the two tables by returning their columns in a single table

Quick Exercises:

The JOIN query we've been discussing has been reproduced in the box below for these exercises.

  1. Change the query so you only see the book, first_name and the author's last name (you can do this with either author or last_name), and the birth_year, then rerun.
  2. Add a WHERE clause so that you only see books by Hemingway and Austen, rerun.
  3. Add an ORDER BY clause so that the author born first appears first, and so that their books appear in alphabetical order. Rerun.

In [77]:
join_q2 = widgets.Textarea(value=
'''SELECT
    *
FROM
    book_table
    JOIN auth_table
        ON book_table.author = auth_table.last_name''',
width = '50em', height = '18em')
display(join_q2)

In [78]:
join_b2 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(join_b2)
run_q(join_q2, join_b2)


book cogs author last_name first_name country birth_year
For Whom the Bell Tolls 13.50 Hemingway Hemingway Ernest USA 1899
A Farewell to Arms 10.35 Hemingway Hemingway Ernest USA 1899
The Sun Also Rises 11.35 Hemingway Hemingway Ernest USA 1899
Sense and Sensibility 11.00 Austen Austen Jane England 1775
Emma 12.50 Austen Austen Jane England 1775
Pride and Prejudice 13.00 Austen Austen Jane England 1775
Romeo and Juliet 7.50 Shakespeare Shakespeare William England 1564
Hamlet 6.95 Shakespeare Shakespeare William England 1564
Macbeth 11.10 Shakespeare Shakespeare William England 1564
The Sound and the Fury 8.50 Faulkner Faulkner William USA 1897
Absalom! Absalom! 5.15 Faulkner Faulkner William USA 1897

Challenge:

Our "database" has another Venn Diagram relationship: the book_table is related to the sales_table. Write a query to join these tables and view all their columns but limit your results to 20 rows. Use the Venn Diagram below as a guide:


In [79]:
join_chall = widgets.Textarea(value=
'',
width = '50em', height = '15em')
display(join_chall)

In [80]:
join_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(join_chall_b)
run_q(join_chall, join_chall_b)

In [81]:
join_chall_c ='''SELECT
    *
FROM
    book_table
    JOIN sales_table
        ON book_table.book = sales_table.book
LIMIT 20'''
cheat(join_chall_c)

Challenge Continued:

Start by copying the query that you wrote in the previous challenge and pasting it in the box below.

  1. Change the line book_table.book = sales_table.book to book = book and rerun. What's going wrong? Fix it and rerun.
  2. Change the query so that you only see the book title listed once. If you get stuck, remember that table_x.column_a means "column_a from table_x".

In [82]:
join_chall2 = widgets.Textarea(value=
'',
width = '50em', height = '17em')
display(join_chall2)

In [83]:
join_chall2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(join_chall2_b)
run_q(join_chall2, join_chall2_b)

In [84]:
join_chall_c2 ='''SELECT
    book_table.book,           [NOTE: or you could use "sales_table.book" instead]
    cogs, 
    author,
    id,
    revenue
FROM
    book_table
    JOIN sales_table
        ON book_table.book = sales_table.book
LIMIT 20'''
cheat(join_chall_c2)



[Previous](#join_tables)       |       [Table of Contents](#table_of_contents)       |       [Next](#join_types)

SELECT
      table_x.column_a,
     table_y.column_b
     table_z.column_c
FROM
      table_x
      JOIN table_y ON table_x.key_column = table_y.key_column
      JOIN table_z ON table_x.other_key_column = table_z.other_key_column
[WHERE clause]
[ORDER BY clause]
[LIMIT clause]

As long as they are directly related or related by the transitive property, you can join multiple tables. Consider the sales_table and the auth_table in a Venn Diagram - there's no relation at all:

However, when the book_table enters the picture, suddenly the sales_table and auth_table have a connection:

Now we have an opportunity to join all three!

Challenge:

Write a query to show the first and last name of the author, the book title, the COGs, and the revenue from each transaction.

Extra credit once you've completed the challenge: Only return rows where the book was written by an English author. Sort your results so that the transaction with the most revenue appears first.


In [85]:
multi_join_chall = widgets.Textarea(value=
'',
width = '50em', height = '25em')
display(multi_join_chall)

In [86]:
multi_join_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(multi_join_chall_b)
run_q(multi_join_chall, multi_join_chall_b)

In [87]:
multi_join_chall_c ='''SELECT
    first_name, 
    last_name,
    book_table.book,
    cogs,
    revenue
FROM
    book_table
    JOIN auth_table 
        ON book_table.author = auth_table.last_name
    JOIN
        sales_table
        ON book_table.book = sales_table.book'''
cheat(multi_join_chall_c)

Answer to the extra credit:


In [88]:
multi_join_chall_c2 ='''SELECT
    first_name, 
    last_name,
    book_table.book,
    cogs,
    revenue
FROM
    book_table
    JOIN auth_table 
        ON book_table.author = auth_table.last_name
    JOIN
        sales_table
        ON book_table.book = sales_table.book
WHERE
    country = 'England'
ORDER BY
    revenue DESC'''
cheat(multi_join_chall_c2)


Use multiple joins sparingly: Multiple joins can put a lot of strain on servers because SQL has to do a lot of work matching up all that data. The more tangential the relationship, the worse it gets. Avoid more than 2 degrees of separatation, and avoid joining 2 or more large tables. It's ok if one of your tables is big, but the others should be small.

[Previous](#multi_join)       |       [Table of Contents](#table_of_contents)       |       [Next](#inner_joins)

There are more ways to join two tables than the method we just covered. However, not all RDBMS support these different join methods. We'll learn about each of these methods, even if we can only practice 2 of them in SQLite.

Join Type Microsoft SQL Server MySQL Oracle SQLite
JOIN or INNER JOIN
LEFT JOIN or LEFT OUTER JOIN
RIGHT JOIN or RIGHT OUTER JOIN not supported
OUTER JOIN or FULL OUTER JOIN not supported not supported

We're going to leave behind the book database for the next lesson, since a different data set will help illustrate the point a little better.

Until now, the tables that we've joined have had columns that correspond perfectly - that is to say, every value that appears in one table also appears in the other. There aren't any authors that appear in the auth_table that don't also appear at least once in the book_table, and vice versa.

Sometimes, you'll have two tables with corresponding columns, but they don't match perfectly. Consider the two tables below. The first lists tech companies and their CEOs, the second lists publicly traded companies and their share price.

Amazon, Alphabet, and Microsoft all appear on both tables. But Uber, SpaceX and AirBnB - which haven't IPO'd - aren't on the public_cos table. Conversely, Walmart, GE and P&G only appear on the public_cos table.

Even though it's not a perfect match, there is some overlap. So, we can still use the company columns from each table as keys to join the tables:


[Previous](#join_types)       |       [Table of Contents](#table_of_contents)       |       [Next](#left_joins)

SELECT
      table_x.column_a,
                               ➞ read this as "column_a from table_x"
      table_y.column_b,                                ➞ "column_b from table_y"
FROM
     table_x
     JOIN table_y
                                           ➞ SQL interprets JOIN and INNER JOIN as the same thing
           ON table_x.key_column = table_y.key_column

So what would happen if we tried to join public_cos and tech_cos using the method we just learned with the book database? We'll give it a shot:


In [89]:
inner_join_q = widgets.Textarea(value=
'''SELECT
    *
FROM
    tech_cos
    JOIN public_cos 
        ON tech_cos.company = public_cos.company''',
width = '50em', height = '11em')
display(inner_join_q)

In [90]:
inner_join_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(inner_join_b)
run_q(inner_join_q, inner_join_b)


CEO Company Company Share_Price
Jeff Bezos Amazon Amazon 690
Larry Page Alphabet Alphabet 570
Satya Nadellla Microsoft Microsoft 60

This time, SQL can't find a match for every value in the two different company columns:

So it performs an "INNER JOIN". You can write either JOIN or INNER JOIN - SQL will interpret them as the same thing. This eliminates any rows that don't have matching values, then combined the tables:


[Previous](#inner_joins)       |       [Table of Contents](#table_of_contents)       |       [Next](#outer_joins)

SELECT
      table_x.column_a,
      table_y.column_b,
FROM
     table_x
     LEFT JOIN table_y
                ➞ see all results from the first ("left") table & results when available from second table
           ON table_x.key_column = table_y.key_column

If you want to make sure you see all the rows from a particular table - even if there's no match in the other table - you can do a LEFT JOIN instead. It lets you prioritize the results from one table over another. Let's say your priority is to see all tech companies in your result-set, but you also want to see the share_price when that data is available:


In [91]:
left_q = widgets.Textarea(value=
'''SELECT 
    *
FROM
    tech_cos
    LEFT JOIN public_cos
        ON tech_cos.company = public_cos.company''',
width = '50em', height = '11em')
display(left_q)

In [92]:
left_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(left_b)
run_q(left_q, left_b)


CEO Company Company Share_Price
Jeff Bezos Amazon Amazon 690
Travis Kalanick Uber
Larry Page Alphabet Alphabet 570
Brian Chesky Airbnb
Elon Musk SpaceX
Satya Nadellla Microsoft Microsoft 60

With a LEFT JOIN, SQL still starts by looking for matching values.

When it fails to find a match, it will still keep the values on the "left" table, but get rid of the unmatched values on the "right" table.

Quick Exercise:

The query from above has been reproduced below for these exercises.

  1. Figure out how to change the query above so that you only see the company column appear once.
  2. Rewrite the query so that public_cos becomes the priority table instead.

In [93]:
left_q2 = widgets.Textarea(value=
'''SELECT 
    *
FROM
    tech_cos
    LEFT JOIN public_cos
        ON tech_cos.company = public_cos.company''',
width = '50em', height = '11em')
display(left_q2)

In [94]:
left_b2 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(left_b2)
run_q(left_q2, left_b2)


CEO Company Company Share_Price
Jeff Bezos Amazon Amazon 690
Travis Kalanick Uber
Larry Page Alphabet Alphabet 570
Brian Chesky Airbnb
Elon Musk SpaceX
Satya Nadellla Microsoft Microsoft 60

SELECT
      table_x.column_a,
      table_y.column_b,
FROM
     table_x
     RIGHT JOIN table_y
                ➞ see all results from the second ("right") table, results where available from first table
           ON table_x.key_column = table_y.key_column

It's exactly the same as LEFT JOIN, except it prioritizes the second table (the "right" table) over the first ("left") table. We can't practice it because SQLite doesn't support it, and it's super redundant anyway because we can just use LEFT JOIN. Boom. We're done with RIGHT JOIN.


[Previous](#left_joins)       |       [Table of Contents](#table_of_contents)       |       [Next](#join_drills)

SELECT
      table_x.column_a,
      table_y.column_b,
FROM
     table_x
     OUTER JOIN table_y
                ➞ see all results from BOTH the first and second table
           ON table_x.key_column = table_y.key_column

What if you want to see all values from both tables? You can do this with an OUTER JOIN. Unfortunately, MySQL and SQLite (what we're using right now!) doesn't support it, so we can't practice it.

If you are using Oracle or Microsoft SQL, then you'd use the example code above. For MySQL and SQLite, there is a workaround. You don't need to understand what's going on in the code for now, just look at the output to make sure you understand what the OUTER JOIN output should look like.


In [95]:
full_join_q = widgets.Textarea(value=
'''SELECT
    ceo,
    tech_cos.company,
    public_cos.company,
    share_price
FROM
    tech_cos
    LEFT JOIN public_cos ON tech_cos.company = public_cos.company
UNION ALL
SELECT
    ' ',
    ' ',
    public_cos.company,
    share_price
FROM
    public_cos
WHERE 
    public_cos.company NOT IN (SELECT company FROM tech_cos)''',
width = '50em', height = '26em')
display(full_join_q)

In [96]:
full_join_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(full_join_b)
run_q(full_join_q, full_join_b)


CEO Company Company Share_Price
Jeff Bezos Amazon Amazon 690
Travis Kalanick Uber
Larry Page Alphabet Alphabet 570
Brian Chesky Airbnb
Elon Musk SpaceX
Satya Nadellla Microsoft Microsoft 60
Walmart 60
GE 30
Procter & Gamble 75

The OUTER JOIN starts out the same as the INNER JOIN and LEFT JOIN, trying to find matches wherever it can:

But when it can't find a match, instead of eliminating any the rows, it makes room for them:




[Previous](#outer_joins)       |       [Table of Contents](#table_of_contents)       |       [Next](#as)


In [97]:
insert_b = widgets.Button(description='Click here JUST ONCE before starting', width='20em', height='3em', color='white',background_color='#1f5fd6', border_color='#1f5fd6')
display(insert_b)
def insert_button(b):
    insert_q1 = '''INSERT INTO auth_table VALUES ('Tolstoy', 'Leo', 'Russia', 1828)'''
    insert_q2 = '''INSERT INTO auth_table VALUES ('Twain', 'Mark', 'USA', 1835)'''
    insert_q3 = '''INSERT INTO book_table VALUES ('Jude the Obscure', '11.25', 'Hardy')'''
    insert_q4 = '''INSERT INTO book_table VALUES ('The Age of Innocence', '14.20', 'Wharton')'''
    query_list = [insert_q1, insert_q2, insert_q3, insert_q4]
    for query in query_list:
        run(query)
    print('New rows have been added to auth_table and book_table!')
insert_b.on_click(insert_button)

We've now added some rows to auth_table and book_table so we can practice our new join skills!

Challenge

Start simple by taking a look at the new rows we've added. Write a query to see all columns and rows from book_table, then change the query so you can take a look at auth_table instead:


In [98]:
join_drill1 = widgets.Textarea(value='',width = '50em', height = '7em')
display(join_drill1)

In [99]:
join_drill1_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(join_drill1_b)
run_q(join_drill1, join_drill1_b)

In [100]:
join_drill1_c ='''SELECT 
    *
FROM
    auth_table       [change to "book_table" for second part of challenge]'''
cheat(join_drill1_c)

Now auth_table has 2 authors listed (Tolstoy and Twain) that don't appear on the book_table, and the book_table has two books (Jude the Obscure and The Age of Innocence) whose authors don't appear in the auth_table.

Challenge:

Write a query to view the book titles, first names, and last names of authors that appear on both the auth_table and the book_table.


In [101]:
join_drill2 = widgets.Textarea(value='', width = '50em', height = '12em')
display(join_drill2)

In [102]:
join_drill2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(join_drill2_b)
run_q(join_drill2, join_drill2_b)

In [103]:
join_drill2_c ='''SELECT
    book, 
    author,
    first_name 
FROM
    book_table
    JOIN auth_table ON book_table.author = auth_table.last_name'''
cheat(join_drill2_c)

Challenge:

Write a query to see the titles of all the books from the book_table, and the author's country when that information is available.


In [104]:
join_drill3 = widgets.Textarea(value='', width = '50em', height = '13em')
display(join_drill3)

In [105]:
join_drill3_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(join_drill3_b)
run_q(join_drill3, join_drill3_b)

In [106]:
join_drill3_b ='''SELECT
    book, 
    country
FROM
    book_table
    LEFT JOIN auth_table
        ON book_table.author = auth_table.last_name'''
cheat(join_drill3_b)

Quick Exercise:

  1. Edit the query above so that you only see books by authors from England
  2. Edit it so that you only see books by authors NOT from England.

Challenge

Write a query to see the first names of all authors in the auth_table, and the books they've written when that information is available.


In [107]:
join_drill4 = widgets.Textarea(value='',width = '50em', height = '16em')
display(join_drill4)

In [108]:
join_drill4_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(join_drill4_b)
run_q(join_drill4, join_drill4_b)

In [109]:
join_drill4_c ='''SELECT 
    first_name,
    book
FROM
    auth_table
    LEFT JOIN book_table ON auth_table.last_name = book_table.author'''
cheat(join_drill4_c)

Quick Exercise:

  1. Change the query so that you only see results when the writer's first name is William, rerun
  2. Change the query to sort the books in alphabethical order, rerun
  3. Limit the number of rows to 3, rerun

In [110]:
join_drill4_ex_c ='''SELECT 
    first_name,
    book
FROM
    auth_table
    LEFT JOIN book_table ON auth_table.last_name = book_table.author
WHERE 
    first_name = 'William'
ORDER BY
    book
LIMIT 3'''
cheat(join_drill4_ex_c)

Challenge:

Write a query to return all books from the book_table, and their revenue data whenever that information is available. Try to figure out how you might sort your results so that you see books with no sales first.


In [111]:
join_drill4 = widgets.Textarea(value='',width = '50em', height = '12em')
display(join_drill4)

In [112]:
join_drill4_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(join_drill4_b)
run_q(join_drill4, join_drill4_b)

In [113]:
join_drill4_c ='''SELECT 
    book_table.book,
    revenue
FROM
    book_table
    LEFT JOIN sales_table ON book_table.book = sales_table.book
ORDER BY
    revenue'''
cheat(join_drill4_c)



[Previous](#join_drills)       |       [Table of Contents](#table_of_contents)       |       [Next](#operators)

Assigning aliases to columns

SELECT
      column_a AS alias_a
                ➞ creates an alias for column_a
FROM
     table_name
WHERE
     alias_a = x                                   ➞ optional; use the alias in the WHERE clause
ORDER BY
     alias_a                                            ➞ optional; use the alias in the ORDER BY clause
[LIMIT clause]

Aliases allow you to rename columns and tables in your query. They will come in handy as we learn to do more with the data.

In plain English, the query below can be read as "Show me the book column from the book_table, but rename the column to book_title.


In [114]:
as_q = widgets.Textarea(value=
'''SELECT
    book AS book_title
FROM
    book_table''',
width = '50em', height = '14em')
display(as_q)

In [115]:
as_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(as_b)
run_q(as_q, as_b)


book_title
For Whom the Bell Tolls
A Farewell to Arms
The Sun Also Rises
Sense and Sensibility
Emma
Pride and Prejudice
Romeo and Juliet
Hamlet
Macbeth
The Sound and the Fury
Absalom! Absalom!

Quick Exercises:

  1. Change the query to rename the column to titles.
  2. Delete the word AS and rerun. (You'll see that AS is totally optional when assigning aliases. It just make the query easier to read.)
  3. Change the query so that you also pull the author, but rename the column author_name.
  4. Change the query so that you only see books by Austen. Use the column's alias in your WHERE clause.
  5. Order results by book in reverse alphabetical order. Use the column's alias in your ORDER BY clause.

In [116]:
as_ex_c ='''SELECT
    book titles,
    author AS author_name
FROM
    book_table 
WHERE 
    author_name = 'Austen'
ORDER BY
    titles DESC'''
cheat(as_ex_c)

Challenge:

  • Write a query to pull:
    • last_name, but renamed author
    • country, but renamed nationality
    • birth_year, but renamed year_born
  • Use each column's alias in the WHERE clause; use WHERE clause to only return results where the author is not from England, AND was born between 1800 and 1850.

In [117]:
as_chall = widgets.Textarea(value='', width = '50em', height = '15em')
display(as_chall)

In [118]:
as_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(as_chall_b)
run_q(as_chall, as_chall_b)

In [119]:
as_chall_c ='''SELECT 
    last_name AS author,
    country as nationality,
    birth_year AS year_born
FROM
    auth_table
WHERE
    nationality != 'England'
    AND year_born BETWEEN 1800 and 1850'''
cheat(as_chall_c)

Assigning aliases to tables

SELECT
      X.column_a,
      Y.column_b
FROM
      table_x X
                                              ➞ assigns table_x the alias X
      JOIN table Y                                         ➞ assigns table_y the alias Y
           ON X.key_column = Y.key_column            ➞ table aliases can be used as substitutes in the table_x.column_a format

When dealing with one or more tables in a query, we commonly assign capitalized one-letter aliases to tables. Writing X.key_column is much shorter than table_x.key_column, and coders like shortcuts. They also typically won't use AS when assigning aliases to tables (although it makes no difference either way).

When you're dealing with only one table, it's unnecessary to use table aliases because SQL knows exactly what columns you are referring to. However, when you are dealing with 2 or more tables, particularly tables that have columns with the same names (like book, which is a column in both sales_table and book_table), then aliases are extremely handy.


In [120]:
as_table_q = widgets.Textarea(value=
'''SELECT
    S.book,
    S.revenue,
    B.cogs
FROM
    book_table B
    JOIN sales_table S 
        ON S.book = B.book''',
width = '50em', height = '12em')
display(as_table_q)

In [121]:
as_table_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(as_table_b)
run_q(as_table_q, as_table_b)


book revenue cogs
For Whom the Bell Tolls 15.58 13.50
For Whom the Bell Tolls 15.82 13.50
For Whom the Bell Tolls 15.95 13.50
For Whom the Bell Tolls 16.07 13.50
For Whom the Bell Tolls 16.76 13.50
For Whom the Bell Tolls 16.90 13.50
For Whom the Bell Tolls 16.93 13.50
For Whom the Bell Tolls 17.31 13.50
For Whom the Bell Tolls 18.34 13.50
For Whom the Bell Tolls 18.94 13.50
For Whom the Bell Tolls 19.26 13.50
For Whom the Bell Tolls 19.74 13.50
A Farewell to Arms 12.23 10.35
A Farewell to Arms 13.01 10.35
A Farewell to Arms 13.30 10.35
A Farewell to Arms 13.56 10.35
A Farewell to Arms 14.28 10.35
A Farewell to Arms 14.52 10.35
A Farewell to Arms 14.84 10.35
A Farewell to Arms 14.92 10.35
The Sun Also Rises 13.60 11.35
The Sun Also Rises 13.62 11.35
The Sun Also Rises 13.87 11.35
The Sun Also Rises 14.51 11.35
The Sun Also Rises 14.70 11.35
The Sun Also Rises 15.05 11.35
The Sun Also Rises 15.08 11.35
The Sun Also Rises 15.32 11.35
The Sun Also Rises 15.35 11.35
The Sun Also Rises 15.41 11.35
The Sun Also Rises 16.82 11.35
Sense and Sensibility 12.28 11.00
Sense and Sensibility 12.75 11.00
Sense and Sensibility 12.88 11.00
Sense and Sensibility 13.86 11.00
Sense and Sensibility 15.15 11.00
Sense and Sensibility 15.90 11.00
Sense and Sensibility 16.18 11.00
Sense and Sensibility 16.48 11.00
Emma 13.82 12.50
Emma 14.01 12.50
Emma 14.36 12.50
Emma 14.99 12.50
Emma 15.63 12.50
Emma 16.25 12.50
Emma 17.26 12.50
Emma 18.18 12.50
Emma 18.21 12.50
Pride and Prejudice 14.63 13.00
Pride and Prejudice 14.84 13.00
Pride and Prejudice 15.57 13.00
Pride and Prejudice 16.93 13.00
Pride and Prejudice 17.54 13.00
Pride and Prejudice 17.59 13.00
Pride and Prejudice 18.62 13.00
Pride and Prejudice 18.82 13.00
Romeo and Juliet 8.98 7.50
Romeo and Juliet 9.14 7.50
Romeo and Juliet 10.28 7.50
Romeo and Juliet 10.74 7.50
Romeo and Juliet 10.92 7.50
Romeo and Juliet 10.95 7.50
Hamlet 7.67 6.95
Hamlet 7.96 6.95
Hamlet 8.06 6.95
Hamlet 8.29 6.95
Hamlet 8.78 6.95
Hamlet 9.00 6.95
Hamlet 9.21 6.95
Hamlet 9.26 6.95
Hamlet 9.91 6.95
Hamlet 10.34 6.95
Macbeth 8.08 11.10
Macbeth 8.14 11.10
Macbeth 8.19 11.10
Macbeth 8.24 11.10
Macbeth 8.64 11.10
Macbeth 8.68 11.10
Macbeth 8.90 11.10
Macbeth 9.06 11.10
Macbeth 9.55 11.10
Macbeth 9.62 11.10
Macbeth 10.38 11.10
The Sound and the Fury 9.83 8.50
The Sound and the Fury 9.97 8.50
The Sound and the Fury 10.06 8.50
The Sound and the Fury 10.29 8.50
The Sound and the Fury 10.64 8.50
The Sound and the Fury 10.77 8.50
The Sound and the Fury 10.77 8.50
The Sound and the Fury 11.11 8.50
The Sound and the Fury 11.56 8.50
The Sound and the Fury 11.61 8.50
The Sound and the Fury 12.21 8.50
The Sound and the Fury 12.52 8.50
The Sound and the Fury 12.61 8.50
Absalom! Absalom! 6.24 5.15
Absalom! Absalom! 6.31 5.15
Absalom! Absalom! 6.44 5.15
Absalom! Absalom! 6.92 5.15

Quick Exercises:

  1. In the SELECT clause, change S.book to just book, rerun. What's going wrong?
  2. Now change book to B.book, rerun.
  3. Change S.revenue to B.revenue, rerun. What's going wrong?
  4. Change B.revenue to just revenue and rerun. Note that when you're joining tables, it's standard to use table aliases even on columns that don't need them. This makes it easier for someone to read your query even if they are unfamaliar with the tables that you're working with. However, as you see, it's not technically necessary.
  5. Give each of these columns an alias (any alias) and rerun.

Challenge:

Write a query to view books and the author's country by joining auth_table and book_table. Give auth_table the alias A and book_table the alias B. Use the aliases in the ON part of the JOIN clause.


In [122]:
as_table_chall = widgets.Textarea(value='', width = '50em', height = '10em')
display(as_table_chall)

In [123]:
as_table_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(as_table_chall_b)
run_q(as_table_chall, as_table_chall_b)

In [124]:
as_table_chall_c ='''SELECT
    B.book,
    A.country
FROM
    book_table B
    JOIN auth_table A ON B.author = A.last_name'''
cheat(as_table_chall_c)

Challenge:

  • Write a query to view these columns:
    • book titles with the alias titles
    • revenue with the alias earnings
    • author's last name with the alias author_name
    • year in which the author was born with the alias year_born
  • Use one-letter aliases for table names in your SELECT and JOIN clauses
  • For your WHERE and ORDER BY clauses:
    • Use column aliases
    • Only view results where author was born between 1700 and 1900 AND where revenue is more than $12.
    • Sort your results so that earnings appear in ascending order
  • Limit your results to 20 rows

In [125]:
as_chall2 = widgets.Textarea(value='', width = '50em', height = '20em')
display(as_chall2)

In [126]:
as_chall2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(as_chall2_b)
run_q(as_chall2, as_chall2_b)

In [127]:
as_chall2_c ='''SELECT 
    B.book as titles, 
    S.revenue as earnings,
    A.birth_year as year_born
FROM
    book_table B
    JOIN auth_table A ON B.author = A.last_name
    JOIN sales_table S on B.book = S.book
WHERE 
    year_born BETWEEN 1700 and 1900
    AND earnings > 12
ORDER BY
    earnings
LIMIT 20'''
cheat(as_chall2_b)



[Previous](#as)       |       [Table of Contents](#table_of_contents)       |       [Next](#functions)

SELECT
      column_a + column_b,
                ➞ adds the values in column_a and columns_b
      column_a - column_b,                 ➞ subtracts
      column_a * column_b,                 ➞ multiplies
      column_a / column_b,                 ➞ divides
      (column_a + column_b) * column_c,                 ➞ use parentheses to make more complex calculations
FROM
      table_name

[WHERE clause]
[ORDER BY clause]
[LIMIT clause]

This is pretty straightforward. Let's start by calculating gross profit per transaction: revenue minus cogs. Recall again that we can use S.book or B.book - we'll get the same results.


In [128]:
op_q = widgets.Textarea(value=
'''SELECT
    B.book,
    S.revenue,
    B.cogs,
    S.revenue - B.cogs
FROM
    book_table B 
    JOIN sales_table S ON B.book = S.Book''',
width = '50em', height = '18em')
display(op_q)

In [129]:
op_q_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(op_q_b)
run_q(op_q, op_q_b)


book revenue cogs S.revenue - B.cogs
For Whom the Bell Tolls 15.58 13.50 2.08
For Whom the Bell Tolls 15.82 13.50 2.32
For Whom the Bell Tolls 15.95 13.50 2.45
For Whom the Bell Tolls 16.07 13.50 2.57
For Whom the Bell Tolls 16.76 13.50 3.26
For Whom the Bell Tolls 16.90 13.50 3.40
For Whom the Bell Tolls 16.93 13.50 3.43
For Whom the Bell Tolls 17.31 13.50 3.81
For Whom the Bell Tolls 18.34 13.50 4.84
For Whom the Bell Tolls 18.94 13.50 5.44
For Whom the Bell Tolls 19.26 13.50 5.76
For Whom the Bell Tolls 19.74 13.50 6.24
A Farewell to Arms 12.23 10.35 1.88
A Farewell to Arms 13.01 10.35 2.66
A Farewell to Arms 13.30 10.35 2.95
A Farewell to Arms 13.56 10.35 3.21
A Farewell to Arms 14.28 10.35 3.93
A Farewell to Arms 14.52 10.35 4.17
A Farewell to Arms 14.84 10.35 4.49
A Farewell to Arms 14.92 10.35 4.57
The Sun Also Rises 13.60 11.35 2.25
The Sun Also Rises 13.62 11.35 2.27
The Sun Also Rises 13.87 11.35 2.52
The Sun Also Rises 14.51 11.35 3.16
The Sun Also Rises 14.70 11.35 3.35
The Sun Also Rises 15.05 11.35 3.70
The Sun Also Rises 15.08 11.35 3.73
The Sun Also Rises 15.32 11.35 3.97
The Sun Also Rises 15.35 11.35 4.00
The Sun Also Rises 15.41 11.35 4.06
The Sun Also Rises 16.82 11.35 5.47
Sense and Sensibility 12.28 11.00 1.28
Sense and Sensibility 12.75 11.00 1.75
Sense and Sensibility 12.88 11.00 1.88
Sense and Sensibility 13.86 11.00 2.86
Sense and Sensibility 15.15 11.00 4.15
Sense and Sensibility 15.90 11.00 4.90
Sense and Sensibility 16.18 11.00 5.18
Sense and Sensibility 16.48 11.00 5.48
Emma 13.82 12.50 1.32
Emma 14.01 12.50 1.51
Emma 14.36 12.50 1.86
Emma 14.99 12.50 2.49
Emma 15.63 12.50 3.13
Emma 16.25 12.50 3.75
Emma 17.26 12.50 4.76
Emma 18.18 12.50 5.68
Emma 18.21 12.50 5.71
Pride and Prejudice 14.63 13.00 1.63
Pride and Prejudice 14.84 13.00 1.84
Pride and Prejudice 15.57 13.00 2.57
Pride and Prejudice 16.93 13.00 3.93
Pride and Prejudice 17.54 13.00 4.54
Pride and Prejudice 17.59 13.00 4.59
Pride and Prejudice 18.62 13.00 5.62
Pride and Prejudice 18.82 13.00 5.82
Romeo and Juliet 8.98 7.50 1.48
Romeo and Juliet 9.14 7.50 1.64
Romeo and Juliet 10.28 7.50 2.78
Romeo and Juliet 10.74 7.50 3.24
Romeo and Juliet 10.92 7.50 3.42
Romeo and Juliet 10.95 7.50 3.45
Hamlet 7.67 6.95 0.72
Hamlet 7.96 6.95 1.01
Hamlet 8.06 6.95 1.11
Hamlet 8.29 6.95 1.34
Hamlet 8.78 6.95 1.83
Hamlet 9.00 6.95 2.05
Hamlet 9.21 6.95 2.26
Hamlet 9.26 6.95 2.31
Hamlet 9.91 6.95 2.96
Hamlet 10.34 6.95 3.39
Macbeth 8.08 11.10 -3.02
Macbeth 8.14 11.10 -2.96
Macbeth 8.19 11.10 -2.91
Macbeth 8.24 11.10 -2.86
Macbeth 8.64 11.10 -2.46
Macbeth 8.68 11.10 -2.42
Macbeth 8.90 11.10 -2.20
Macbeth 9.06 11.10 -2.04
Macbeth 9.55 11.10 -1.55
Macbeth 9.62 11.10 -1.48
Macbeth 10.38 11.10 -0.72
The Sound and the Fury 9.83 8.50 1.33
The Sound and the Fury 9.97 8.50 1.47
The Sound and the Fury 10.06 8.50 1.56
The Sound and the Fury 10.29 8.50 1.79
The Sound and the Fury 10.64 8.50 2.14
The Sound and the Fury 10.77 8.50 2.27
The Sound and the Fury 10.77 8.50 2.27
The Sound and the Fury 11.11 8.50 2.61
The Sound and the Fury 11.56 8.50 3.06
The Sound and the Fury 11.61 8.50 3.11
The Sound and the Fury 12.21 8.50 3.71
The Sound and the Fury 12.52 8.50 4.02
The Sound and the Fury 12.61 8.50 4.11
Absalom! Absalom! 6.24 5.15 1.09
Absalom! Absalom! 6.31 5.15 1.16
Absalom! Absalom! 6.44 5.15 1.29
Absalom! Absalom! 6.92 5.15 1.77

Quick Exercise:

  1. Give the calculated column the alias gross_grofit, rerun. See how nice aliases are?
  2. Add a WHERE clause to only see transactions where gross_grofit is over $5, rerun.
  3. Add an ORDER BY clause to sort by gross_profit with the most profitable transaction is listed first, rerun.

Challenge:

  • Pull book name and author's last name
  • Calculate the gross margin per transaction, give the calculated column the alias gross_margin
  • Use one-letter aliases for all the table names
  • Only return rows where the author's name is NOT Faulkner or Austen
  • Sort your results with the highest margin transaction listed first
  • Limit your results to 10 rows

In [130]:
op_chall = widgets.Textarea(value='',width = '50em', height = '18em')
display(op_chall)

In [131]:
op_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(op_chall_b)
run_q(op_chall, op_chall_b)

In [132]:
op_chall_c ='''SELECT
    B.book,
    B.author,
    (S.revenue - B.cogs) / S.revenue AS gross_margin
FROM
    book_table B
    JOIN sales_table S ON B.book = S.book
WHERE
    B.author NOT IN ('Falkner', 'Austen')
ORDER BY
    gross_margin DESC
LIMIT 10'''
cheat(op_chall_c)

Concatonating

Microsoft SQL Server MySQL Oracle SQLite
CONCAT(column_a, column_b) or + CONCAT(column_a, column_b) CONCAT(column_a, column_b) or || ||

SELECT
      column_a || column_b,
                             ➞ combines the characters of column_a & column_b
      column_a || ' ' || column_b                 ➞ combines the characters of column_a & column_b with a space inbetween
FROM
      table_name

This one is extremely straightforward. This allows you to non-mathematically combine values. So "some || word" becomes "someword", and "some || ' ' || word" becomes "some word".


In [133]:
conc_q = widgets.Textarea(value=
'''SELECT
    first_name || last_name
FROM 
    auth_table''',
width = '50em', height = '7em')
display(conc_q)

In [134]:
conc_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(conc_b)
run_q(conc_q, conc_b)


first_name || last_name
WilliamFaulkner
ErnestHemingway
JaneAusten
WilliamShakespeare

Quick Exercises:

  1. Fix the query so that there is a space between the names, rerun
  2. Give the concatenated column an alias, rerun
  3. Rewrite the query so that it follows the format "last_name, first_name" instead



[Previous](#operators)       |       [Table of Contents](#table_of_contents)       |       [Next](#group_by)

SELECT
      SOME_FUNCTION(column_a),                                   ➞ performs the function on the column
FROM
      table_name
[WHERE clause]
[ORDER BY clause]
[LIMIT clause]

Functions work similar to funcitons in Excel - you can apply them to entire columns. There are tons more functions than the ones listed below, just Google what you want to do to find more.

Short List of Functions:

FUNCTION DESCRIPTION
AVG(col) Averages values
COUNT(col) Counts the number of rows with non-null values in the column
COUNT(*) Counts the number of rows in the table
COUNT(DISTINCT(col)) Counts the number of unique values in the column
GROUP_CONCAT(col, 'separator') Returns a comma-separated list of values, specify a separator in quotes
MAX(col) Returns the maximum value
MIN(col) Returns the minimum value
ROUND(AVG(col), x) Rounds value to x decimals
SUM(col) Sums values
UPPER(col) If column is text, it will return all-caps version of the text

First, we'll start with SUM() to find the total revenue for all our transactions.


In [135]:
sum_q = widgets.Textarea(value=
'''SELECT
    sum(revenue)
FROM
    sales_table''',
width = '50em', height = '17em')
display(sum_q)

In [136]:
sum_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(sum_b)
run_q(sum_q, sum_b)


sum(revenue)
1291.15

Quick Exercises:

  1. Give the calculated column an alias, rerun
  2. Add a line to the SELECT clause to find the average revenue per transaction and give the column an alias, rerun
  3. Edit the average column so that your results are rounded to the nearest cent, rerun
  4. Add a line to the SELECT clause to count the total number of transactions and give the column an alias, rerun
  5. Add a 2 lines to the SELECT clause to see the minimum and maximum revenue earned on a single transaction, rerun
  6. Add a line to the SELECT clause to see a count of the number of distinct books that appear in sales_table
  7. Add a WHERE clause to only see results for the books "For Whom the Bell Tolls" and "Emma"

In [137]:
sum_q_c ='''SELECT
    SUM(revenue) AS total_rev,
    ROUND(AVG(revenue), 2) AS avg_rev,
    COUNT(revenue) AS total_transactions,
    MAX(revenue) AS max_rev,
    MIN(revenue) AS min_rev,
    COUNT(DISTINCT(book)) AS distinct_books
FROM
    sales_table
WHERE 
    book IN ('For Whom the Bell Tolls', 'Emma') '''
cheat(sum_q_c)

Challenge:

Write a query to find the average cost of goods for books whose authors are from the US (USA). Round the number to the nearest cent. Use an alias for your column.


In [138]:
function_chall = widgets.Textarea(value='', width = '50em', height = '11em')
display(function_chall)

In [139]:
function_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(function_chall_b)
run_q(function_chall, function_chall_b)

In [140]:
function_chall_c ='''SELECT
    ROUND(AVG(B.cogs), 2) AS avg_cogs
FROM
    book_table B
    JOIN auth_table A ON B.author = A.last_name 
WHERE
    A.country = 'USA'
'''
cheat(function_chall_c)

Challenge:

Try out the GROUP_CONCAT function. Write a query to select GROUP_CONCAT(last_name) from the auth_table, only return results where the author is NEITHER Austen NOR Shakespeare. After you get your query to work, change it to GROUP_CONCAT(last_name, ' / ') and rerun.


In [141]:
function_chall2 = widgets.Textarea(value='',width = '50em', height = '10em')
display(function_chall2)

In [142]:
function_chall2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(function_chall2_b)
run_q(function_chall2, function_chall2_b)

In [143]:
function_chall2_c ='''SELECT 
    GROUP_CONCAT(last_name)
FROM
    auth_table
WHERE
    last_name NOT IN ('Austen', 'Shakespeare')
'''
cheat(function_chall2_c)

COUNT(*) vs COUNT(column_name)


In [144]:
insert_null_b = widgets.Button(description='Click here JUST ONCE before starting', width='20em', height='3em', color='white',background_color='#1f5fd6', border_color='#1f5fd6')
display(insert_null_b)
def insert_button(b):
    null_query1 = '''INSERT INTO auth_table VALUES ('Homer', NULL, 'Greece', NULL)'''
    run(null_query1)
    print('A new row has been added to the auth_table!')
insert_null_b.on_click(insert_button)

It's much more common to use COUNT(*) than COUNT(column_name) when you are trying to get a count of the number of rows in your result-set. This is because COUNT(*) will capture all rows, while COUNT(column_name) will skip over NULL values in that particular row.

Challenge:

We've just added new a new row to the auth_table that has some NULL (blank) values. Start by writing a query to view everything (SELECT *) in the auth_table and make a note of the new row.


In [145]:
star_chall = widgets.Textarea(value='', width = '50em', height = '12em')
display(star_chall)

In [146]:
star_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(star_chall_b)
run_q(star_chall, star_chall_b)

Challenge Continued...

  1. Delete * and replace it with COUNT(first_name), rerun
  2. Add a line (but don't erase anything) to the SELECT clause: COUNT(*)

In [147]:
star_chall_c ='''SELECT 
    COUNT(first_name),
    COUNT(*)
FROM 
    auth_table
'''
cheat(star_chall_c)

What's going on:

When you use COUNT(first_name), SQL skipped over the Homer row because there was no value in the first_name column. COUNT(*), on the other hand, looks across all columns, so as long as a row has a value in at least one column, it'll get included in the count. You might argue that you could just use country or last_name, but they fact is that * is just way easier and less time-consuming to type out. Overwhelmingly, people opt for COUNT(*) instead of COUNT(column_name) unless they are interested in overlooking NULL values.


Functions + Operators

You can use functions together with operators to do more complex calculations. Below, we've calculated our total gross profit using both the SUM() function and subtraction:


In [148]:
fun_op_q = widgets.Textarea(value=
'''SELECT
    SUM(S.revenue) - SUM(B.cogs) AS gross_profit
FROM
    book_table B
    JOIN sales_table S ON S.book = B.book''',
width = '50em', height = '8em')
display(fun_op_q)

In [149]:
fun_op_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(fun_op_b)
run_q(fun_op_q, fun_op_b)


gross_profit
249.3

Quick Exercise:

Rewrite the SELECT clause so that you get the same results but only have to use SUM once.


Challenge:

Write a query to view gross margin for all transactions using functions in conjunction with operators. Extra credit: round your results to the nearest cent.


In [150]:
fun_op_chall = widgets.Textarea(value='', width = '50em', height = '8em')
display(fun_op_chall)

In [151]:
fun_op_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(fun_op_chall_b)
run_q(fun_op_chall, fun_op_chall_b)

In [152]:
fun_op_chall_c ='''SELECT
    SUM(S.revenue - B.cogs) / SUM(S.revenue) AS gross_margin
FROM
    book_table B
    JOIN sales_table S ON S.book = B.book

- OR - you can use AVG() instead of SUM() for all functions'''
cheat(fun_op_chall_c)



[Previous](#functions)       |       [Table of Contents](#table_of_contents)       |       [Next](#having)

SELECT
      column_a,
      SUM(column_b)                  ➞ sums up the values in column_b
FROM
      table_name
[WHERE clause]
GROUP BY                                   ➞ creates one group for each unique value in column_a
     column_a
[ORDER BY clause]
[LIMIT clause]

GROUP BY creates a group for each unique value in the column you specify. You'll always use it in conjunction with functions - it creates segments for your results. In plain English, the query below says: "Show me the average revenue per book from the sales_table"


In [153]:
group_q = widgets.Textarea(value=
'''SELECT
    book, 
    AVG(revenue)
FROM
    sales_table
GROUP BY
    book''',
width = '50em', height = '20em')
display(group_q)

In [154]:
group_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(group_b)
run_q(group_q, group_b)


book AVG(revenue)
A Farewell to Arms 13.832500
Absalom! Absalom! 6.477500
Emma 15.856667
For Whom the Bell Tolls 17.300000
Hamlet 8.848000
Macbeth 8.861818
Pride and Prejudice 16.817500
Romeo and Juliet 10.168333
Sense and Sensibility 14.435000
The Sound and the Fury 11.073077
The Sun Also Rises 14.848182

Quick Exercises:

  1. Change AVG() to SUM(), rerun
  2. Give the book column the alias book_title, then use the alias in the GROUP BY clause, rerun
  3. Sort the results so that the most profitable book is listed first, rerun
  4. Add this to the SELECT clause: COUNT(*), rerun
  5. Add a WHERE clause to only return results that are not written by Faulkner (hint: you'll have to join a table for this)

Challenge:

Write a query to count the number of books each author has listed in the book_table.


In [155]:
group_chall = widgets.Textarea(value='', width = '50em', height = '12em')
display(group_chall)

In [156]:
group_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(group_chall_b)
run_q(group_chall, group_chall_b)

In [157]:
group_chall_c ='''SELECT 
    author, 
    count(*)
FROM
    book_table
GROUP BY 
    author'''
cheat(group_chall_c)

Challenge:

Write a query that joins the book_table and the sales_table to see total revenue per author.


In [158]:
group_chall2 = widgets.Textarea(value='', width = '50em', height = '12em')
display(group_chall2)

In [159]:
group_chall2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(group_chall2_b)
run_q(group_chall2, group_chall2_b)

In [160]:
group_chall2_c ='''SELECT
    B.author,
    SUM(S.revenue)
FROM
    book_table B
    JOIN sales_table S ON B.book = S.book 
GROUP BY
    B.author'''
cheat(group_chall2_c)

Challenge:

Write a query to see the maximum and minimum prices that each book sold for, but don't include Macbeth or Hamlet in your result-set:


In [161]:
group_chall3 = widgets.Textarea(value='',width = '50em', height = '15em')
display(group_chall3)

In [162]:
group_chall3_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(group_chall3_b)
run_q(group_chall3, group_chall3_b)

In [163]:
group_chall3_c ='''SELECT
    book, 
    MAX(revenue), 
    MIN(revenue)
FROM
    sales_table
WHERE
    book NOT IN ('Macbeth','Hamlet')
GROUP BY
    book'''
cheat(group_chall3_c)


GROUP BY + Functions + Operators

You can use GROUP BY with functions and operators to do more complex analysis. Below, we use SUM() and the subtraction operator to see gross profit for each book.


In [164]:
group_fun_op_q = widgets.Textarea(value=
'''SELECT
    B.book,
    SUM(S.revenue) - SUM(B.cogs) AS gross_profit
FROM 
    sales_table S 
    JOIN book_table B ON S.book = B.book
GROUP BY 
    B.book''',
width = '50em', height = '12em')
display(group_fun_op_q)

In [165]:
group_fun_op_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(group_fun_op_b)
run_q(group_fun_op_q, group_fun_op_b)


book gross_profit
A Farewell to Arms 27.86
Absalom! Absalom! 5.31
Emma 30.21
For Whom the Bell Tolls 45.60
Hamlet 18.98
Macbeth -24.62
Pride and Prejudice 30.54
Romeo and Juliet 16.01
Sense and Sensibility 27.48
The Sound and the Fury 33.45
The Sun Also Rises 38.48

Challenge:

Write a query to find the gross margin per author using GROUP BY, functions and operators. Give the gross margin column an alias.


In [166]:
group_func_op_chall = widgets.Textarea(value='', width = '50em', height = '13em')
display(group_func_op_chall)

In [167]:
group_func_op_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(group_func_op_chall_b)
run_q(group_func_op_chall, group_func_op_chall_b)

In [168]:
group_func_op_chall_c ='''SELECT 
    B.author,
    (SUM(S.revenue) - SUM(B.cogs)) / SUM(S.revenue) AS gross_margin
FROM
    book_table B 
    JOIN sales_table S ON B.book = S.book
GROUP BY 
    B.author'''
cheat(group_func_op_chall_c)

Challenge:

  • Copy and paste the query you just wrote for the previous challenge.
  • Add the following columns in the SELECT clause (in addition to author and gross margin columns), and give every column an alias:
    • total revenue
    • total cogs
    • a count of the number of individual transactions
    • BONUS: a count of the distinct book titles sold
    • BONUS: a comma-separated list of the book titles with no repeats
  • Only include results where the author isn't Faulker and the book isn't Hamlet
  • Sort your results so that the author with the highest average gross margin is listed first

In [169]:
group_chall3 = widgets.Textarea(value='', width = '50em', height = '26em')
display(group_chall3)

In [170]:
group_chall3_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(group_chall3_b)
run_q(group_chall3, group_chall3_b)

In [171]:
group_chall3_c ='''SELECT
    B.author AS author_name,
    SUM(S.revenue) AS total_revenue,
    SUM(B.cogs) AS total_cogs,
    (SUM(S.revenue) - SUM(B.cogs))/SUM(S.revenue) AS gross_margin,
    COUNT(*) AS transaction_count,
    COUNT(DISTINCT(S.book)) AS distinct_book_titles,
    GROUP_CONCAT(DISTINCT(S.book)) AS book_list
FROM
    book_table B
    JOIN sales_table S ON B.book = S.book
WHERE 
    author_name != 'Faulkner'
    AND S.book != 'Hamlet'
GROUP BY
    author_name
ORDER BY
    gross_margin DESC'''
cheat(group_chall3_c)



[Previous](#group_by)       |       [Table of Contents](#table_of_contents)       |       [Next](#case_when)

SELECT
      column_a,
      FUNCTION(column_b)
FROM
      table_name
[WHERE clause]
GROUP BY
      column_a HAVING FUNCTION(column_b) > x                 ➞ returns groups whose value is greater than x
[ORDER BY clause]
[LIMIT clause]

Use HAVING with GROUP BY in order to filter out groups that don't meet your criteria. Below, the plain English translation of this query says, "show my the total revenue for each book, but only show me books that have total revenue over $100"


In [172]:
having_q = widgets.Textarea(value=
'''SELECT
    book, 
    SUM(revenue)
FROM
    sales_table
GROUP BY
    book HAVING SUM(revenue) > 100''',
width = '50em', height = '11em')
display(having_q)

In [173]:
having_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(having_b)
run_q(having_q, having_b)


book SUM(revenue)
A Farewell to Arms 110.66
Emma 142.71
For Whom the Bell Tolls 207.60
Pride and Prejudice 134.54
Sense and Sensibility 115.48
The Sound and the Fury 143.95
The Sun Also Rises 163.33

Quick Exercises:

  1. Change the > to <, rerun
  2. Give the SUM(revenue) column an alias, and change the GROUP BY clause so that you're using the alias instead, rerun
  3. Think about why this is different from WHERE. Take a moment to discuss this with your partner in class.

Challenge:

Write a query to see average COGs per author, but use HAVING to return authors whose average COGs is greater than $10. Assign the average COGs column an alias and use it in the GROUP BY clause.


In [174]:
having_chall = widgets.Textarea(value='',width = '50em', height = '10em')
display(having_chall)

In [175]:
having_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(having_chall_b)
run_q(having_chall, having_chall_b)

In [176]:
having_chall_c ='''SELECT 
    author,
    AVG(cogs) AS avg_cogs
FROM
    book_table
GROUP BY
    author HAVING avg_cogs > 10'''
cheat(having_chall_c)

HAVING vs. WHERE

HAVING and WHERE both let you change the results you see in your result-set, but they operate quite differently. Take a look at the query below. It looks at the average cogs per author, but uses a WHERE clause to filter out 'Faulkner':


In [177]:
hw_q1 = widgets.Textarea(value=
'''SELECT 
    author,
    AVG(cogs)
FROM
    book_table
WHERE 
    author != 'Faulkner'
GROUP BY
    author''',
width = '50em', height = '14em')
display(hw_q1)

In [178]:
hw_b1 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(hw_b1)
run_q(hw_q1, hw_b1)


author AVG(cogs)
Austen 12.166667
Hemingway 11.733333
Shakespeare 8.516667

Quick Exercise:

Now let's say you want to filter out Faulkner AND you only want to see authors whose average COGs are over $11. Your first thought might be to use the WHERE clause. Add AND AVG(cogs) > 11, rerun. Why do you think you're hitting an error?


You hit an error because the AVG(cogs) column was created by a function, and SQL doesn't let you put function-generated columns in the WHERE clause. You have to use HAVING instead. The query below will accomplish what we're trying to do, and returns a result-set that doesn't include Faulkner AND only shows authors whose average COGs are over $11.


In [179]:
hw_q2 = widgets.Textarea(value=
'''SELECT 
    author,
    AVG(cogs)
FROM
    book_table
WHERE 
    author != 'Faulkner'
GROUP BY
    author HAVING AVG(cogs) > 11''',
width = '50em', height = '14em')
display(hw_q2)

In [180]:
hw_b2 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(hw_b2)
run_q(hw_q2, hw_b2)


author AVG(cogs)
Austen 12.166667
Hemingway 11.733333

This seems relatively straightforward - but it's easy to forget and wind up with inaccurate results. Consider the following query. Why would this be wrong?


In [181]:
hw_q3 = widgets.Textarea(value=
'''SELECT 
    author,
    AVG(cogs)
FROM
    book_table
WHERE 
    author != 'Faulkner'
    AND cogs > 11
GROUP BY
    author''',
width = '50em', height = '15em')
display(hw_q3)

In [182]:
hw_b3 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(hw_b3)
run_q(hw_q3, hw_b3)


author AVG(cogs)
Austen 12.750
Hemingway 12.425
Shakespeare 11.100

These results are inaccurate because instead of telling SQL to only return authors with average COGs over \$11, we've told SQL "only consider rows where COGs are over \$11". SQL dropped the rows with COGs under \$11 before it started grouping and averaging.

SQL's Order of Execution

When we read, we start at the top of a page and work our way to the bottom. That's not how SQL works. It actually starts with the FROM clause and jumps around. It's helpful to understand the order it follows to determine when to use HAVING and when to use WHERE.

We write the clauses in this order:

SELECT
FROM
      JOIN...ON
WHERE
GROUP BY...HAVING
ORDER BY
LIMIT

However, SQL reads and executes the clauses in this order:

FROM
      JOIN...ON
WHERE
SELECT
GROUP BY...HAVING
ORDER BY
LIMIT

Here's a query we've seen before, but now we've added a few more clauses so that we can see all of them in action:


In [183]:
sql_order_q = widgets.Textarea(value=
'''SELECT 
    author,
    AVG(cogs)
FROM
    book_table
WHERE 
    author != 'Faulkner'
GROUP BY
    author HAVING AVG(cogs) > 11
ORDER BY 
    AVG(cogs)
LIMIT 3''',
width = '50em', height = '17.5em')
display(sql_order_q)

In [184]:
sql_order_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(sql_order_b)
run_q(sql_order_q, sql_order_b)


author AVG(cogs)
Hemingway 11.733333
Austen 12.166667

The GIF below shows the order that SQL follows the steps:

Let's revisit the query that gave us the skewed average:


In [185]:
hw_q4 = widgets.Textarea(value=
'''SELECT 
    author,
    AVG(cogs)
FROM
    book_table
WHERE 
    author != 'Faulkner'
    AND cogs > 11
GROUP BY
    author''',
width = '50em', height = '15em')
display(hw_q4)

In [186]:
hw_b4 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(hw_b4)
run_q(hw_q4, hw_b4)


author AVG(cogs)
Austen 12.750
Hemingway 12.425
Shakespeare 11.100

Now that we know the order in which SQL executes commands, we can see what went wrong. The rows with COGs under $11 were eliminated before SQL averaged COGs for each group:

Challenge:

Write a query to join book_table and sales_table. Select author and total revenue, but only return authors whose total revenue was over $200


In [187]:
have_chall2 = widgets.Textarea(value='',width = '50em', height = '12em')
display(have_chall2)

In [188]:
have_chall2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(have_chall2_b)
run_q(have_chall2, have_chall2_b)

In [189]:
have_chall2_c ='''SELECT
    B.author,
    SUM(S.revenue) AS total_rev
FROM
    book_table B
    JOIN sales_table S ON B.book = S.book
GROUP BY 
    author HAVING total_rev > 200'''
cheat(have_chall2_c)

Challenge:

Write a query to join the auth_table with the sales_table (remember that this requires multiple joins). Count the number of sales per country (author's country of origin in the auth_table), but don't include sales from Hemingway.


In [190]:
have_chall3 = widgets.Textarea(value='', width = '50em', height = '16em')
display(have_chall3)

In [191]:
have_chall3_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(have_chall3_b)
run_q(have_chall3, have_chall3_b)

In [192]:
have_chall3_c ='''SELECT
    A.country,
    COUNT(*) AS count_of_sales
FROM
    sales_table S 
    JOIN book_table B ON S.book = B.book
    JOIN auth_table A ON A.last_name = B.author
WHERE
    A.last_name != 'Hemingway' 
GROUP BY
    A.country
    
NOTE: you can also use "B.author != 'Hemingway'" in the WHERE clause to get the same results'''
cheat(have_chall3_c)



[Previous](#having)       |       [Table of Contents](#table_of_contents)       |       [Next](#nesting)

Conditional Type Microsoft SQL Server MySQL Oracle SQLite
IF IF logical_test PRINT value_if_true IF(logical_test, value_if_true, value_if_false) (same as Excel) IF logical_test THEN value_if_true ELSIF...END IF NOT SUPPORTED
CASE WHEN

SELECT
      CASE WHEN some_column = x THEN value_if_true
                 WHEN some_column = y THEN other_value_if_true
                 ELSE value_if_false
                 END
FROM
      some_table

Because SQLite doesn't support IF statements, we're going to focus on CASE WHEN. CASE WHEN lets you accomplish the same thing by setting logical tests and conditional values, but it has the added bonus of freeing you from ever needing to nest multiple IF statements.

Let's start very simple. The following query uses a logical test to create a column where the value is "true" if the author is Austen, and "false" if the author is not Austen:


In [193]:
case1_q = widgets.Textarea(value=
'''SELECT 
    last_name,
    CASE WHEN last_name = 'Austen' THEN 'True' 
        ELSE 'False' 
        END
FROM
    auth_table''',
width = '50em', height = '13em')
display(case1_q)

In [194]:
case1_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(case1_b)
run_q(case1_q, case1_b)


last_name CASE WHEN last_name = 'Austen' THEN 'True' ELSE 'False' END
Faulkner False
Hemingway False
Austen True
Shakespeare False

Quick Exercises:

  1. Give the CASE WHEN column an alias (immediately after END), rerun
  2. Change the query so that instead of "True", the query returns Austen's first name (use the first_name column), rerun
  3. Add something to the CASE WHEN column so that the query returns Faulkner's first name as well (look at the example code above for help), rerun.

Using CASE WHEN to create categories

CASE WHEN allows you to set multiple logical tests, which can help you create buckets or categories. In Excel, you'd have to nest multiple logical tests in an IF statement (ie. IF(logical_test, value_if_true, IF(other_logical_test, value_if_true, value_if_false))...very messy). With CASE WHEN, you can add an infinite number very easily.

Let's say that rather than caring about the exact revenue for each transaction, you only cared whether it was under \$10, between \$10 and \$15, or over \$15. That's easy to do with CASE WHEN. We'll include the revenue column as well so you can more easily see what's going on:


In [195]:
case2_q = widgets.Textarea(value=
'''SELECT
    book,
    revenue,
    CASE WHEN revenue < 10 THEN "<$10"
        WHEN revenue BETWEEN 10 AND 15 THEN "$10-15"
        WHEN revenue > 15 THEN ">$15" 
        END AS revenue_category
FROM
    sales_table
    ''',
width = '50em', height = '15em')
display(case2_q)

In [196]:
case2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(case2_b)
run_q(case2_q, case2_b)


book revenue revenue_category
Absalom! Absalom! 6.44 <$10
For Whom the Bell Tolls 19.74 >$15
Macbeth 8.90 <$10
For Whom the Bell Tolls 19.26 >$15
The Sound and the Fury 10.64 $10-15
The Sun Also Rises 13.60 $10-15
The Sound and the Fury 12.52 $10-15
Absalom! Absalom! 6.31 <$10
Hamlet 10.34 $10-15
The Sun Also Rises 14.51 $10-15
Pride and Prejudice 14.84 $10-15
The Sound and the Fury 11.61 $10-15
Macbeth 8.64 <$10
For Whom the Bell Tolls 15.82 >$15
Hamlet 8.06 <$10
The Sound and the Fury 12.21 $10-15
Macbeth 10.38 $10-15
Macbeth 8.24 <$10
Sense and Sensibility 15.90 >$15
The Sun Also Rises 15.41 >$15
Emma 16.25 >$15
For Whom the Bell Tolls 15.95 >$15
Hamlet 8.29 <$10
For Whom the Bell Tolls 18.94 >$15
Emma 18.21 >$15
Sense and Sensibility 12.88 $10-15
Romeo and Juliet 10.74 $10-15
Absalom! Absalom! 6.24 <$10
A Farewell to Arms 14.52 $10-15
Hamlet 7.67 <$10
Emma 14.36 $10-15
For Whom the Bell Tolls 16.90 >$15
Emma 15.63 >$15
Pride and Prejudice 15.57 >$15
For Whom the Bell Tolls 16.76 >$15
The Sun Also Rises 15.08 >$15
Romeo and Juliet 10.92 $10-15
The Sun Also Rises 13.62 $10-15
A Farewell to Arms 13.01 $10-15
Sense and Sensibility 15.15 >$15
A Farewell to Arms 14.28 $10-15
Macbeth 8.14 <$10
For Whom the Bell Tolls 16.07 >$15
Hamlet 9.26 <$10
Pride and Prejudice 16.93 >$15
The Sound and the Fury 12.61 $10-15
For Whom the Bell Tolls 18.34 >$15
The Sound and the Fury 10.77 $10-15
Sense and Sensibility 13.86 $10-15
Sense and Sensibility 16.48 >$15
Pride and Prejudice 17.54 >$15
Macbeth 9.62 <$10
A Farewell to Arms 12.23 $10-15
The Sun Also Rises 14.70 $10-15
The Sound and the Fury 10.29 $10-15
For Whom the Bell Tolls 16.93 >$15
A Farewell to Arms 14.92 $10-15
The Sun Also Rises 13.87 $10-15
Pride and Prejudice 17.59 >$15
The Sun Also Rises 15.05 >$15
Sense and Sensibility 12.75 $10-15
Emma 13.82 $10-15
Sense and Sensibility 12.28 $10-15
Pride and Prejudice 18.82 >$15
Macbeth 8.08 <$10
The Sun Also Rises 16.82 >$15
Romeo and Juliet 10.95 $10-15
Macbeth 9.06 <$10
The Sound and the Fury 9.97 <$10
The Sound and the Fury 11.56 $10-15
Pride and Prejudice 14.63 $10-15
The Sound and the Fury 10.06 $10-15
A Farewell to Arms 13.56 $10-15
Absalom! Absalom! 6.92 <$10
A Farewell to Arms 14.84 $10-15
Emma 14.99 $10-15
Hamlet 9.91 <$10
The Sun Also Rises 15.35 >$15
The Sun Also Rises 15.32 >$15
Romeo and Juliet 9.14 <$10
The Sound and the Fury 9.83 <$10
Romeo and Juliet 10.28 $10-15
The Sound and the Fury 10.77 $10-15
Sense and Sensibility 16.18 >$15
For Whom the Bell Tolls 15.58 >$15
For Whom the Bell Tolls 17.31 >$15
Emma 17.26 >$15
Emma 18.18 >$15
Macbeth 8.19 <$10
A Farewell to Arms 13.30 $10-15
Pride and Prejudice 18.62 >$15
The Sound and the Fury 11.11 $10-15
Hamlet 9.21 <$10
Macbeth 8.68 <$10
Hamlet 7.96 <$10
Emma 14.01 $10-15
Hamlet 8.78 <$10
Hamlet 9.00 <$10
Romeo and Juliet 8.98 <$10
Macbeth 9.55 <$10

This might not immediately seem useful, but when you start grouping by your newly created categories, you'll be able to do all kinds of new analysis. Consider the query below. We use CASE WHEN to create a column that we then use in the GROUP BY clause. We've essentially created new groups where there were none before, and now we can assess the number of sales and total revenue from each revenue group.


In [197]:
case3_q = widgets.Textarea(value=
'''SELECT
    CASE WHEN revenue < 10 THEN "<$10"
        WHEN revenue BETWEEN 10 AND 15 THEN "$10-15"
        WHEN revenue > 15 THEN ">$15" 
        END AS revenue_category,
    COUNT(*) AS total_sales,
    SUM(revenue) AS total_revenue
FROM
    sales_table
GROUP BY
    revenue_category
''',
width = '50em', height = '18em')
display(case3_q)

In [198]:
case3_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(case3_b)
run_q(case3_q, case3_b)


revenue_category total_sales total_revenue
$10-15 40 507.14
<$10 27 229.07
>$15 33 554.94

Challenge:

Suppose you want to see total revenues broken out by male vs. female authors. Use CASE WHEN to create these groups - with Austen in the "female" group and Faulkner, Hemingway, and Shakespeare in the "male" group.


In [199]:
case_chall = widgets.Textarea(value='', width = '50em', height = '16em')
display(case_chall)

In [200]:
case_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(case_chall_b)
run_q(case_chall, case_chall_b)

In [201]:
case_chall_c ='''SELECT
    CASE WHEN
        B.author = 'Austen' THEN 'Female'
        ELSE 'Male'          --- or you can say, "WHEN B.author IN ('Faulkner', 'Shakespeare', 'Hemingway') THEN 'Male' "
        END AS gender,
    SUM(S.revenue) AS total_revenue
FROM
    book_table B 
    JOIN sales_table S ON B.book = S.book
GROUP BY 
    gender'''
cheat(case_chall_c)

Using CASE WHEN to create a pivot table

Say you want to see revenue broken out by gender and by date. Right now, the only way we know how to do this is to add "date" to the GROUP BY clause. The query below is the same as the one from your last challenge, only we've added date to both the SELECT clause and the GROUP BY clause.


In [202]:
case_pivot_ex = widgets.Textarea(value='''SELECT
    date,
    CASE WHEN
        B.author = 'Austen' THEN 'Female'
        ELSE 'Male'
        END AS gender,
    SUM(S.revenue) AS total_revenue
FROM
    book_table B 
    JOIN sales_table S ON B.book = S.book
GROUP BY 
    date, gender'''
, width = '50em', height = '18em')
display(case_pivot_ex)

In [203]:
case_pivot_ex_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(case_pivot_ex_b)
run_q(case_pivot_ex, case_pivot_ex_b)


date gender total_revenue
1/1/16 Male 64.98
1/10/16 Female 15.57
1/10/16 Male 42.76
1/11/16 Female 15.15
1/11/16 Male 40.91
1/12/16 Male 33.47
1/13/16 Female 64.81
1/13/16 Male 41.72
1/14/16 Male 92.56
1/15/16 Female 75.26
1/15/16 Male 23.13
1/16/16 Male 36.83
1/17/16 Male 9.97
1/18/16 Female 14.63
1/18/16 Male 21.62
1/19/16 Female 14.99
1/19/16 Male 45.23
1/2/16 Male 32.43
1/20/16 Male 39.81
1/21/16 Male 20.11
1/22/16 Male 10.77
1/23/16 Female 33.44
1/23/16 Male 32.89
1/24/16 Female 18.18
1/25/16 Male 21.49
1/26/16 Female 18.62
1/26/16 Male 11.11
1/27/16 Male 17.89
1/28/16 Female 14.01
1/28/16 Male 16.74
1/29/16 Male 17.98
1/3/16 Female 14.84
1/3/16 Male 45.10
1/30/16 Male 9.55
1/4/16 Female 15.90
1/4/16 Male 70.12
1/5/16 Female 16.25
1/5/16 Male 43.18
1/6/16 Female 31.09
1/7/16 Male 10.74
1/8/16 Male 6.24
1/9/16 Female 29.99
1/9/16 Male 39.09

The result-set above is almost useless - it's impossible to do meaningful analysis when you group by multiple columns. Instead, we'll use CASE WHEN nested inside a function to essentially create a pivot table:


In [204]:
case_pivot = widgets.Textarea(value=
'''SELECT
    date,
    SUM(CASE WHEN B.author = 'Austen' THEN revenue END) AS Female_Rev,
    SUM(CASE WHEN B.author != 'Austen' THEN revenue END) AS Male_Rev
FROM
    book_table B 
    JOIN sales_table S ON B.book = S.book
GROUP BY 
    date''',
width = '50em', height = '14em')
display(case_pivot)

In [205]:
case_pivot_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(case_pivot_b)
run_q(case_pivot, case_pivot_b)


date Female_Rev Male_Rev
1/1/16 64.98
1/10/16 15.57 42.76
1/11/16 15.15 40.91
1/12/16 33.47
1/13/16 64.81 41.72
1/14/16 92.56
1/15/16 75.26 23.13
1/16/16 36.83
1/17/16 9.97
1/18/16 14.63 21.62
1/19/16 14.99 45.23
1/2/16 32.43
1/20/16 39.81
1/21/16 20.11
1/22/16 10.77
1/23/16 33.44 32.89
1/24/16 18.18
1/25/16 21.49
1/26/16 18.62 11.11
1/27/16 17.89
1/28/16 14.01 16.74
1/29/16 17.98
1/3/16 14.84 45.1
1/30/16 9.55
1/4/16 15.9 70.12
1/5/16 16.25 43.18
1/6/16 31.09
1/7/16 10.74
1/8/16 6.24
1/9/16 29.99 39.09

Note that while this CASE WHEN method will work in other RDBMSs, it's more common to use IF when you are only using a single logical test. In MySQL, for instance, the line for Female_Rev would look like this instead, which would translate to "sum up the revenue for any row where the author is Austen, and the number 0 whenever the author is not Austen":

SUM(IF(B.author = 'Austen', revenue, 0))


Quick Exercises:

The query from above has been reproduced below for these exercises (so you don't have to keep scrolling up and down).

  1. Change the query so that you have separate columns for each individual author's revenue, rerun
  2. Change SUM to AVG, rerun
  3. Change AVG to COUNT - note that with conditional statements, you don't use an asterisk with COUNT. You need to stick with a specific column name

In [206]:
case_pivot1 = widgets.Textarea(value=
'''SELECT
    date,
    SUM(CASE WHEN B.author = 'Austen' THEN revenue END) AS Female_Rev,
    SUM(CASE WHEN B.author != 'Austen' THEN revenue END) AS Male_Rev
FROM
    book_table B 
    JOIN sales_table S ON B.book = S.book
GROUP BY 
    date''',
width = '50em', height = '18em')
display(case_pivot1)

In [207]:
case_pivot_b1 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(case_pivot_b1)
run_q(case_pivot1, case_pivot_b1)


date Female_Rev Male_Rev
1/1/16 64.98
1/10/16 15.57 42.76
1/11/16 15.15 40.91
1/12/16 33.47
1/13/16 64.81 41.72
1/14/16 92.56
1/15/16 75.26 23.13
1/16/16 36.83
1/17/16 9.97
1/18/16 14.63 21.62
1/19/16 14.99 45.23
1/2/16 32.43
1/20/16 39.81
1/21/16 20.11
1/22/16 10.77
1/23/16 33.44 32.89
1/24/16 18.18
1/25/16 21.49
1/26/16 18.62 11.11
1/27/16 17.89
1/28/16 14.01 16.74
1/29/16 17.98
1/3/16 14.84 45.1
1/30/16 9.55
1/4/16 15.9 70.12
1/5/16 16.25 43.18
1/6/16 31.09
1/7/16 10.74
1/8/16 6.24
1/9/16 29.99 39.09

Challenge:

  • In the SELECT clause:
    • Use CASE WHEN to create a column that creates buckets for author's birth_year: "Before 1700", "1700-1800", "After 1800"
    • Use CASE WHEN to create a column that returns the count of books by authors from USA
    • Use CASE WHEN to create a column that returns the count of books by authors from England
  • GROUP BY the birth_year bucket column that you created

In [208]:
case_pivot_chall = widgets.Textarea(value='',width = '50em', height = '18em')
display(case_pivot_chall)

In [209]:
case_pivot_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(case_pivot_chall_b)
run_q(case_pivot_chall, case_pivot_chall_b)

In [210]:
case_pivot_chall_c ='''SELECT
    CASE WHEN A.birth_year < 1700 THEN "Before 1700"
        WHEN A.birth_year BETWEEN 1700 AND 1800 THEN "1700-1800"
        WHEN A.birth_year > 1800 THEN "After 1800" 
        END AS era,
    COUNT(CASE WHEN A.country = 'USA' THEN book END) AS count_from_USA,
    COUNT(CASE WHEN A.country = 'England' THEN book END) AS count_from_England
FROM
    book_table B
    JOIN auth_table A ON B.author = A.last_name
GROUP BY
    era
'''
cheat(case_pivot_chall_c)

Challenge:

  • Write a query that returns a daily count of the sales of:
    • For Whom the Bell Tolls (in its own column)
    • Emma (in its own column)
    • Macbeth and Hamlet (in a combined column)

In [211]:
case_pivot_chall2 = widgets.Textarea(value='', width = '50em', height = '14em')
display(case_pivot_chall2)

In [212]:
case_pivot_chall2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(case_pivot_chall2_b)
run_q(case_pivot_chall2, case_pivot_chall2_b)

In [213]:
case_pivot_chall2_c ='''SELECT 
    date,
    COUNT(CASE WHEN book = 'For Whom the Bell Tolls' THEN revenue END) Bell_Tolls_Count,
    COUNT(CASE WHEN book = 'Emma' THEN revenue END) Emma_Count,
    COUNT(CASE WHEN book IN ('Macbeth', 'Hamlet') THEN revenue END) Macbeth_Hamlet_Count
FROM
    sales_table
GROUP BY
    date'''
cheat(case_pivot_chall2_c)



[Previous](#case_when)       |       [Table of Contents](#table_of_contents)       |       [Next](#union)

SELECT
      column_a
FROM
      table_x
WHERE
      column_a IN (SELECT column_b FROM table_y)

Read first: For this section, we'll need the extra rows in the auth_table and book_table that we added during the JOIN exercises. If you've closed the program or re-run it since you last added those rows, then click the button below to re-add them.


In [214]:
insert_b = widgets.Button(description="Read the paragraph above before clicking", width='25em', height='3em', color='white',background_color='#1f5fd6', border_color='#1f5fd6')
display(insert_b)
def insert_button(b):
    insert_q1 = '''INSERT INTO auth_table VALUES ('Tolstoy', 'Leo', 'Russia', 1828)'''
    insert_q2 = '''INSERT INTO auth_table VALUES ('Twain', 'Mark', 'USA', 1835)'''
    insert_q3 = '''INSERT INTO book_table VALUES ('Jude the Obscure', '11.25', 'Hardy')'''
    insert_q4 = '''INSERT INTO book_table VALUES ('The Age of Innocence', '14.20', 'Wharton')'''
    query_list = [insert_q1, insert_q2, insert_q3, insert_q4]
    for query in query_list:
        run(query)
    print('New rows have been added to auth_table and book_table!')
insert_b.on_click(insert_button)

To be totally honest, you likely won't be writing nested queries yourself until you've become much more comfortable with SQL. However, it's good to learn about them because you'll likely encounter them when coworkers share queries with you.

Start by looking at the two queries and their outputs below:


In [215]:
nested_q1 = widgets.Textarea(value=
'''SELECT 
    COUNT(DISTINCT(book)) AS Count_of_Distinct_Books
FROM 
    sales_table''',
width = '50em', height = '7em')
display(nested_q1)

In [216]:
nested_b1 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(nested_b1)
run_q(nested_q1, nested_b1)


Count_of_Distinct_Books
11

NOTE THAT YOU NEED TO HIT "RUN" AGAIN FOR THE QUERY BELOW (It should return the number 13. If it doesn't, click the blue button above to update the book_table, then re-run the query below)


In [217]:
nested_q2 = widgets.Textarea(value=
'''SELECT 
    COUNT(DISTINCT(book)) AS Count_of_Distinct_Books 
FROM 
    book_table''',
width = '50em', height = '7em')
display(nested_q2)

In [218]:
nested_b2 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(nested_b2)
run_q(nested_q2, nested_b2)


Count_of_Distinct_Books
11

From the count of distinct books in each table, we see that there are two books in our book_table (our inventory) that haven't made a single sale. Imagine if both tables had thousands of rows - it'd be a nightmare to try to figure out which were the books with no sales. However, a nested query can help us out.

The query below uses a nested query in the WHERE clause. In plain English, it says "Show me the books from the book_table, but not the ones that also show up in the sales_table":


In [219]:
nested_q3 = widgets.Textarea(value=
'''SELECT
    book
FROM
    book_table
WHERE 
    book NOT IN (SELECT book FROM sales_table)''',
width = '50em', height = '10em')
display(nested_q3)

In [220]:
nested_q3_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(nested_q3_b)
run_q(nested_q3, nested_q3_b)


book

Challenge:

Write a query to see authors who appear in the auth_table but don't show up in the book_table.


In [221]:
nest_chall = widgets.Textarea(value='', width = '50em', height = '12em')
display(nest_chall)

In [222]:
nest_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(nest_chall_b)
run_q(nest_chall, nest_chall_b)

In [223]:
nest_chall_c ='''SELECT 
    last_name
FROM
    auth_table
WHERE
    last_name NOT IN (SELECT author FROM book_table)
'''
cheat(nest_chall_c)

You can also use nested queries to avoid the need for multiple JOIN clauses. Suppose you wanted to see the total revenue for books by authors from England. Previously, we would have joined the sales_table to the book_table, and then the book_table to the auth_table in order to be able to work with both the revenue column and the country column:


In [224]:
nest_q3 = widgets.Textarea(value=
'''SELECT
    SUM(revenue)
FROM
    sales_table
WHERE
    book IN (SELECT 
                book 
            FROM 
                book_table B 
                JOIN auth_table A ON B.author = A.last_name 
            WHERE 
                A.country = 'England')''',
width = '50em', height = '18em')
display(nest_q3)

In [225]:
nest_b3 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(nest_b3)
run_q(nest_q3, nest_b3)


SUM(revenue)
639.7

Let's break down what's going on here. First, copy and paste the nested query (the part in parentheses) in the cell below, then run it:


In [226]:
nest_q3_explained = widgets.Textarea(value='',width = '50em', height = '12em')
display(nest_q3_explained)

In [227]:
nest_b3_explained = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(nest_b3_explained)
run_q(nest_q3_explained, nest_b3_explained)

Next, take the output that you just produced and:
• comma-separate each book
• wrap each book in quotation marks
• paste your list between the parentheses in the WHERE clause below:
• rerun the query


In [228]:
nest_q3_explained2 = widgets.Textarea(value=
'''SELECT
    SUM(revenue)
FROM
    sales_table
WHERE
    book IN ( )''',
width = '50em', height = '12em')
display(nest_q3_explained2)

In [229]:
nest_b3_explained2 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(nest_b3_explained2)
run_q(nest_q3_explained2, nest_b3_explained2)


SUM(revenue)

This is essentially the same process that SQL walks through when you run a nested query. It pulls the list of books from the nested query, then uses that list in the WHERE clause of the dominant query.


The query above produces the same results as if you'd done a multiple join, but it's much more efficient. That's because SQL can just get the book titles it needs from the nested query and plug them into the dominent query, rather than needing to do all the work of duplicating rows to create joined tables. Use nested queries for more efficient joins whenever possible.




[Previous](#case_when)       |       [Table of Contents](#table_of_contents)       |       [Next](#rollup)

SELECT
      some_column
FROM
      table_x

UNION                        ➞ or use UNION ALL, see explanation below

SELECT
      some_other_column
FROM
      table_y

UNION and UNION ALL allow you to attach two completely separate queries. UNION will result in the output from the first query and the second query to be sorted by default (or you can add an ORDER BY clause). UNION ALL will ensure that the results from the second query will all appear after the results from the first query.

We'll start with a very, very simple illustration and work our way into more complex versions of UNION queries. First, consider the query below. We're pulling all books from the book table with the first query, and all the authors' first names from the auth_table with the second query. By using UNION, we're telling SQL to return the results of both these queries in the same column.


In [230]:
union_q = widgets.Textarea(value=
'''SELECT
    book AS selection
FROM
    book_table

UNION

SELECT
    first_name AS selection
FROM
    auth_table''',
width = '50em', height = '16em')
display(union_q)

In [231]:
union_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(union_b)
run_q(union_q, union_b)


selection
A Farewell to Arms
Absalom! Absalom!
Emma
Ernest
For Whom the Bell Tolls
Hamlet
Jane
Macbeth
Pride and Prejudice
Romeo and Juliet
Sense and Sensibility
The Sound and the Fury
The Sun Also Rises
William

Quick Exercise:

  1. Change the query above to use UNION ALL instead of UNION and re-run. Make sure you understand how the output changes.
  2. Delete AS selection in the second query and rerun.
  3. Rename selection to something else in the first query and rerun.
  4. Add cogs in the SELECT clause in the first query and country to the SELECT clause in the second query, rerun.
  5. Delete cogs in the first query and rerun. Can you think why you're hitting an error?

Useful applications for UNION

The above example is just a simple illustration of how UNION functions, but it's not very useful as a practical application. Now lets try UNION in a more useful way.

Let's say that our imaginary book store decides to start stocking a few movies, so we've created a new table to manage this new inventory:

Now let's say we wanted to view all of our store inventory, COGs, and the author or director of each item. We don't want to join the movie_table and the book_table - there's nothing to really join them on. However, it'd be useful to stack them.

Challenge:

Use UNION to write a query to view the contents of both movie_table and book_table in a single table. The column-headers should be: Item, COGs, and Creator. Order by item title (hint: with UNION, the ORDER BY clause can only go after the second query).


In [232]:
union_chall = widgets.Textarea(value='',width = '50em', height = '25em')
display(union_chall)

In [233]:
union_chall_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(union_chall_b)
run_q(union_chall, union_chall_b)

In [234]:
union_chall_c ='''SELECT
    book as Item,
    cogs as COGs,
    author as Creator
FROM
    book_table
    
UNION

SELECT
    film, 
    cogs,
    director
FROM
    movie_table
ORDER BY 
    Item'''
cheat(union_chall_c)

Using UNION to add totals and subtotals:

Take a look at the query below. You'll see it pulls the COGs and book title for each book. It also uses UNION ALL to add a final line - a summary row averaging all cogs:


In [235]:
union_q2 = widgets.Textarea(value=
'''SELECT
    book,
    cogs
FROM
    book_table
    
UNION ALL

SELECT
    'Average COGs',
    ROUND(AVG(cogs), 2)
FROM
    book_table''',
width = '50em', height = '19em')
display(union_q2)

In [236]:
union_b2 = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(union_b2)
run_q(union_q2, union_b2)


book cogs
For Whom the Bell Tolls 13.50
A Farewell to Arms 10.35
The Sun Also Rises 11.35
Sense and Sensibility 11.00
Emma 12.50
Pride and Prejudice 13.00
Romeo and Juliet 7.50
Hamlet 6.95
Macbeth 11.10
The Sound and the Fury 8.50
Absalom! Absalom! 5.15
Average COGs 10.08

Quick Exercise:

  1. Remove the ALL from UNION ALL and rerun. See how ALL can be useful?
  2. Delete 'Average COGs', from the second query and rerun. Make sure you understand why there's an error. Fix it and reruun.

Challenge:

Write a query that totals revenue per book from the sales_table. Use UNION ALL to add a summary line that totals revenue for all books.


In [237]:
union_chall1 = widgets.Textarea(value='',
width = '50em', height = '22em')
display(union_chall1)

In [238]:
union_chall1_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(union_chall1_b)
run_q(union_chall1, union_chall1_b)

In [239]:
union_chall1_c ='''SELECT
    book,
    SUM(revenue) AS total_revenue
FROM
    sales_table
GROUP BY
    book

UNION ALL

SELECT
   'NULL',
   SUM(revenue)
FROM
    sales_table 
'''
cheat(union_chall1_c)

Extra Challenging Challenge:

Write a query that totals revenue per book. Add subtotal lines for each author's revenue above their books. The output should look like this (use Google to figure out how to capitalize the authors' names for the subtotal rows).


In [240]:
HTML(run('''SELECT
    B.author AS author_last_name,
    B.book AS book_title,
    SUM(S.revenue) AS sum_revenue
FROM
    sales_table S
    JOIN book_table B on S.book = B.book
GROUP BY
    B.book

UNION 

SELECT
   UPPER(B.author),
   'TOTAL REVENUE',
   SUM(S.revenue)
FROM
    sales_table S
    JOIN book_table B on S.book = B.book
GROUP BY
    B.author
''').to_html(index=False))


Out[240]:
author_last_name book_title sum_revenue
AUSTEN TOTAL REVENUE 392.73
Austen Emma 142.71
Austen Pride and Prejudice 134.54
Austen Sense and Sensibility 115.48
FAULKNER TOTAL REVENUE 169.86
Faulkner Absalom! Absalom! 25.91
Faulkner The Sound and the Fury 143.95
HEMINGWAY TOTAL REVENUE 481.59
Hemingway A Farewell to Arms 110.66
Hemingway For Whom the Bell Tolls 207.60
Hemingway The Sun Also Rises 163.33
SHAKESPEARE TOTAL REVENUE 246.97
Shakespeare Hamlet 88.48
Shakespeare Macbeth 97.48
Shakespeare Romeo and Juliet 61.01

In [241]:
union_chall2 = widgets.Textarea(value='', width = '50em', height = '30em')
display(union_chall2)

In [242]:
union_chall2_b = widgets.Button(description='Run', width='10em', height='2.5em', color='white',background_color='black', border_color='black')
display(union_chall2_b)
run_q(union_chall2, union_chall2_b)

In [243]:
union_chall2_c ='''SELECT
    B.author AS author_last_name,
    B.book AS book_title,
    SUM(S.revenue) AS sum_revenue
FROM
    sales_table S
    JOIN book_table B on S.book = B.book
GROUP BY
    B.book

UNION 

SELECT
   UPPER(B.author),
   'TOTAL REVENUE',
   SUM(S.revenue)
FROM
    sales_table S
    JOIN book_table B on S.book = B.book
GROUP BY
    B.author'''
cheat(union_chall2_c)



[Previous](#union)       |       [Table of Contents](#table_of_contents)       |       [Next](#wrapping_up)

Microsoft SQL Server MySQL Oracle SQLite
GROUP BY column_a WITH ROLLUP GROUP BY column_a WITH ROLLUP GROUP BY ROLLUP (column_a) not supported

Unfortunately, SQLite doesn't have a simple way to do ROLLUP like the other RDBMSs, so we can't practice it here. However, the concept is very straightforward: it's exactly like using UNION to add a summary row, except way simpler. Below is what the query would look like if we were using Microsoft or MySQL. Take a look at the query and the output to understand what's going on, even if you can't practice it:

SELECT
      book,
      SUM(revenue) AS total_revenue,
      COUNT(*) AS count_of_sales
FROM
      sales_table
GROUP BY
      book WITH ROLLUP


In [244]:
HTML(run('''SELECT
    book,
    SUM(revenue) AS total_revenue, 
    COUNT(*) AS count_of_sales
FROM
    sales_table
GROUP BY
    book

UNION ALL

SELECT
   'NULL',
   SUM(revenue),
   COUNT(*)
FROM
    sales_table''').to_html(index=False))


Out[244]:
book total_revenue count_of_sales
A Farewell to Arms 110.66 8
Absalom! Absalom! 25.91 4
Emma 142.71 9
For Whom the Bell Tolls 207.60 12
Hamlet 88.48 10
Macbeth 97.48 11
Pride and Prejudice 134.54 8
Romeo and Juliet 61.01 6
Sense and Sensibility 115.48 8
The Sound and the Fury 143.95 13
The Sun Also Rises 163.33 11
NULL 1291.15 100

Note that rollup produces the word "NULL" for any row that it cannot sum up. Essentially its function is to find all columns with numbers and add these up. Numberless columns can't be added, so rollup just skips them.




[Previous](#rollup)       |       [Table of Contents](#table_of_contents)      

One or two days before your job or internship:

Review and PRACTICE! Review the lessons and terms, re-do the quick exercises and challenges. Seriously. This stuff is easily forgotten if you don't use it, so be sure to refresh what you've learned before you start working.

Here's some additional reading on SQL if you're interested.

When you first start work:

  1. Find out what relational database management system your company uses, and get acquainted with how that system differs from what we've learned in class (use the table below as a quide).
  2. Ask coworkers if they have any pre-written queries that will be useful to your work. Read through them and make sure you understand them.
  3. Read the structure of any table that seems important so you understand what data can be found in each.
  4. Figure out how tables join to one another, and which columns come from which tables
  5. ALWAYS ALWAYS avoid "slow server" traps when you are exploring your database. That means:
    • NEVER Run a simple SELECT * FROM table_name query unless you are absolutely certain that the table is very, very small
    • Avoid joining 3 or more tables whenever possible. If you find yourself needing them, try to see if you can use a nested query to cut one of the JOIN clauses out.
    • If your table is recording dates, use these to limit how much data you pull. Depending on the table, more than 1-2 months at a time will usually slow down a system.
    • Use LIKE and % sparingly or use WHERE to limit your search as much as possible unless you are dealing with a small-ish table

What do I mean by small vs. large tables?

Several times in class we've discussed certain practices that should only be applied to small tables to avoid strain on your server. You can use SELECT COUNT(*) FROM table_name to see how many rows it has. You can also think of it like this: the more often the table is updated, the larger it probably is. A table that adds a row every time a user views a webpage is updated constantly and is probably huge. A table that simply lists all the ZIP codes in the US probably doesn't get updated often, it'll be pretty small.

Dialect Differences:

Description Microsoft SQL Server MySQL Oracle SQLite
Reading a table's structure SP_Help tablename DESCRIBE tablename DESCRIBE tablename PRAGMA TABLE_INFO(tablename)
Limiting rows SELECT TOP N column_name LIMIT N WHERE ROWNUM <= N LIMIT N
JOIN or INNER JOIN
LEFT JOIN or LEFT OUTER JOIN
RIGHT JOIN or RIGHT OUTER JOIN not supported
OUTER JOIN or FULL OUTER JOIN not supported not supported
IF IF logical_test PRINT value_if_true IF(logical_test, value_if_true, value_if_false) (same as Excel) IF logical_test THEN value_if_true ELSIF...END IF NOT SUPPORTED
CASE WHEN
ROLLUP GROUP BY column_a WITH ROLLUP GROUP BY column_a WITH ROLLUP GROUP BY ROLLUP (column_a) not supported