SQL Bootcamp

Sarah Beckett-Hile | NYU Stern School of Business | March 2015

Today's plan

  • SQL, the tool of business
  • Relational Databases
  • Why can't I do this in Excel?
  • Setting up this course
  • Basic Clauses

About SQL

  • SQL = "Structured Query Language" (pronounced "S-Q-L" or "sequel")

  • Database language of choice for most businesses

  • The software optimized for storing relational databases that you access with SQL varies. Relational Database Management Systems (RDBMS) include MySQL, Microsoft SQL Server, Oracle, and SQLite. We will be working with SQLite.

  • Relational Databases have multiple tables. Visualize it like an Excel file:

    • Database = a single Excel file/workbook
    • Table = a single worksheet in the same Excel file
  • SQL lets you perform four basic functions: C.R.U.D. = Create, Read, Update, Delete

  • "Read" is all you'll need for business analytics

  • Additional reading: http://www.w3schools.com/sql/sql_intro.asp

  • Find examples of queries for business analysis at botton of this lesson page

About this file

  • We'll use SQL in Python, specifically an IPython Notebook
  • No need to know what that means, but be sure you have SQL_support_code.py saved in the same folder as this file.
  • All SQL queries are in red
  • If you get stumped on a challenge, there are cheats at the bottom of a challenge cell. You'll see something like "#print(cheat1)". Delete the hash and run the cell (SHIFT-RETURN). Once you've figured it out, replace the hash, and try again.
TO GET STARTED, CLICK "CELL" IN THE MENU BAR ABOVE, THEN SELECT "RUN ALL"

In [1]:
from SQL_support_code import *

Structure and Formatting Query Basics:

  • Indentations and Returns:

    • Mostly arbitrary in SQL
    • Usually for readability
  • Capitalization:

    • Convention to put keywords (functions, clauses) in CAPS
    • Consistency is best
  • Order of Clauses:

    • Very strict
    • Not all clauses need to be present in a query, but when they are present, then they must be in the correct order
    • Below are the major clauses that we are going to cover. Use this list as reference if you are getting errors with your queries - there's a chance you just have the clauses in the wrong order:
        SELECT
        FROM
        JOIN...ON 
        WHERE
        GROUP BY
        UNION
        ORDER BY
        LIMIT

Reading a table's structure:

PRAGMA TABLE_INFO(table_name)

Running this will let you see the column heads and data types of any table.

The SQL query above only works for SQLite, which is what we're using here. If you're interested in knowing the equivalent versions for other RDBMS options, see the table below.


In [2]:
describe_differences


Out[2]:
Microsoft SQL Server MySQL Oracle SQLite
Reading a table SP_HELP table_name DESCRIBE table_name DESCRIBE table_table PRAGMA TABLE_INFO(table_name)

These are the names of the tables in our mini SQLite database:

sales_table
car_table
salesman_table
cust_table

Start by looking at the columns and their data types in the sales_table.


In [3]:
run('''
    PRAGMA TABLE_INFO(sales_table)
    ''')


Out[3]:
cid name type notnull dflt_value pk
0 0 id INTEGER 0 0
1 1 model_id INTEGER 0 0
2 2 customer_id INTEGER 0 0
3 3 revenue INTEGER 0 0
4 4 payment_type TEXT 0 0
5 5 salesman_id INTEGER 0 0
6 6 date TEXT 0 0

Rewrite the query to look at the other tables:


In [4]:
run('''
    PRAGMA TABLE_INFO(sales_table)
    ''')
#print(describe_cheat)


Out[4]:
cid name type notnull dflt_value pk
0 0 id INTEGER 0 0
1 1 model_id INTEGER 0 0
2 2 customer_id INTEGER 0 0
3 3 revenue INTEGER 0 0
4 4 payment_type TEXT 0 0
5 5 salesman_id INTEGER 0 0
6 6 date TEXT 0 0





SELECT & FROM:

  • Basically every "read" query will contain a SELECT and FROM clause
  • In the SELECT clause, you tell SQL which columns you want to see
  • In the FROM clause, you tell SQL the table where those columns are located
  • More on SELECT: http://www.w3schools.com/sql/sql_select.asp

SELECT * (ALL COLUMNS)

SELECT          # specifies which columns you want to see 
    *           # asterisk returns all columns
FROM            # specifies the table or tables where these columns can be found
    table_name

Use an asterisk to tell SQL to return all columns from the table:


In [5]:
run('''
    SELECT
        *
    FROM
        sales_table
    ''')


Out[5]:
id model_id customer_id revenue payment_type salesman_id date
0 54858 36 237906 21222 finance 276 1/7/2014
1 43161 20 967016 19140 finance 225 1/26/2014
2 40112 46 819010 14720 cash 147 1/17/2014
3 92495 31 633030 19010 finance 215 1/13/2014
4 78000 51 341877 22022 finance 803 1/12/2014
5 13154 75 720210 21409 cash 215 1/20/2014
6 36535 31 908558 19894 finance 215 1/30/2014
7 22813 46 705508 12960 finance 813 1/29/2014
8 56245 36 248621 25938 cash 492 1/19/2014
9 88118 51 341344 19844 finance 215 1/23/2014
10 84469 31 733566 21441 finance 492 1/20/2014
11 37412 31 750195 17462 cash 949 1/11/2014
12 68513 31 461723 17020 cash 147 1/18/2014
13 74380 20 468665 18040 finance 803 1/5/2014
14 24047 75 556188 18671 finance 225 1/12/2014
15 40603 51 241759 22506 finance 949 1/4/2014
16 69883 20 161369 20460 finance 215 1/25/2014
17 43338 46 731692 13440 finance 225 1/29/2014
18 39727 51 656750 18634 finance 215 1/1/2014
19 21022 22 619020 17312 cash 492 1/14/2014
20 18303 36 413891 23318 cash 215 1/17/2014
21 28176 20 965672 21780 cash 862 1/11/2014
22 62604 75 217720 19418 cash 680 1/30/2014
23 83930 22 946265 17756 cash 813 1/13/2014
24 87626 31 140795 21441 finance 225 1/13/2014
25 10682 31 145479 17905 cash 949 1/20/2014
26 23444 20 961650 16720 finance 215 1/1/2014
27 87998 31 457742 20778 cash 680 1/3/2014
28 95912 75 978159 19667 cash 949 1/25/2014
29 74410 20 474218 19360 finance 276 1/6/2014
... ... ... ... ... ... ... ...
70 61014 36 253239 20174 finance 215 1/21/2014
71 18946 20 614301 18480 cash 680 1/28/2014
72 75834 51 168495 23474 cash 492 1/10/2014
73 57366 31 699431 19231 cash 813 1/12/2014
74 57711 22 528354 18865 cash 225 1/4/2014
75 65870 36 751098 24104 cash 492 1/23/2014
76 22147 36 480153 20436 finance 225 1/23/2014
77 99759 51 208677 22990 cash 492 1/16/2014
78 64406 20 349494 19580 cash 276 1/16/2014
79 55585 22 206310 17756 finance 276 1/1/2014
80 24710 75 185223 18920 cash 147 1/16/2014
81 53831 75 656181 24646 finance 803 1/18/2014
82 97494 19 393613 14006 cash 862 1/14/2014
83 13004 36 488910 24890 finance 813 1/12/2014
84 30779 75 846630 23401 finance 147 1/25/2014
85 29708 51 534633 20328 finance 215 1/29/2014
86 14082 46 909110 15200 cash 813 1/28/2014
87 82119 19 628031 16189 cash 862 1/12/2014
88 31815 46 743097 15040 cash 147 1/25/2014
89 32345 22 476047 20197 finance 862 1/19/2014
90 63526 51 862765 23232 finance 276 1/23/2014
91 20256 19 597655 16734 cash 215 1/16/2014
92 15930 36 486534 20960 cash 862 1/26/2014
93 68446 75 339830 24148 cash 225 1/22/2014
94 68330 20 283797 19140 cash 862 1/9/2014
95 85407 22 635204 21751 finance 803 1/18/2014
96 42428 75 619016 20413 cash 147 1/11/2014
97 37620 51 183947 20328 finance 949 1/3/2014
98 96344 20 731677 17600 finance 680 1/15/2014
99 53062 31 907549 19894 finance 225 1/28/2014

100 rows × 7 columns

Write a query to select all columns from the car_table:


In [6]:
run('''
    SELECT NULL
    ''')
#print(select_cheat1)


Out[6]:
NULL
0

SELECT COLUMN:

SELECT 
    column_a,   # comma-separate multiple columns
    column_b
FROM 
    table_name

Instead of using an asterisk for "all columns", you can specify a particular column or columns:


In [7]:
run('''
    SELECT
        model_id, 
        revenue
    FROM
        sales_table
    ''')


Out[7]:
model_id revenue
0 36 21222
1 20 19140
2 46 14720
3 31 19010
4 51 22022
5 75 21409
6 31 19894
7 46 12960
8 36 25938
9 51 19844
10 31 21441
11 31 17462
12 31 17020
13 20 18040
14 75 18671
15 51 22506
16 20 20460
17 46 13440
18 51 18634
19 22 17312
20 36 23318
21 20 21780
22 75 19418
23 22 17756
24 31 21441
25 31 17905
26 20 16720
27 31 20778
28 75 19667
29 20 19360
... ... ...
70 36 20174
71 20 18480
72 51 23474
73 31 19231
74 22 18865
75 36 24104
76 36 20436
77 51 22990
78 20 19580
79 22 17756
80 75 18920
81 75 24646
82 19 14006
83 36 24890
84 75 23401
85 51 20328
86 46 15200
87 19 16189
88 46 15040
89 22 20197
90 51 23232
91 19 16734
92 36 20960
93 75 24148
94 20 19140
95 22 21751
96 75 20413
97 51 20328
98 20 17600
99 31 19894

100 rows × 2 columns

Write a query to select model_id and model from the car_table:


In [8]:
run('''
    SELECT NULL
    ''')
#print(select_cheat2)


Out[8]:
NULL
0

One more quick note on the basics of SELECT - technically you can SELECT a value without using FROM to specify a table. You could just tell the query exactly what you want to see in the result-set. If it's a number, you can write the exact number. If you are using various characters, put them in quotes.

See the query below as an example:


In [9]:
run('''
    SELECT 
    4, 
    5, 
    7, 
    'various characters or text'
    ''')


Out[9]:
4 5 7 'various characters or text'
0 4 5 7 various characters or text

SELECT DISTINCT VALUES IN COLUMNS:

SELECT
    DISTINCT column_a     # returns a list of each unique value in column_a
FROM
    table_name

The query below pulls each distinct value from the model_id column in the sales_table, so each value is only listed one time:


In [10]:
run('''
    SELECT
        DISTINCT model_id
    FROM
        sales_table
    ''')


Out[10]:
model_id
0 36
1 20
2 46
3 31
4 51
5 75
6 22
7 19

Use DISTINCT to select unqiue values from the salesman_id column in sales_table. Delete DISTINCT and rerun to see the effect.


In [11]:
run('''
    SELECT NULL
    ''')
#print(select_cheat3)


Out[11]:
NULL
0





WHERE

SELECT 
    column_a
FROM
    table_name
WHERE
    column_a = x      # filters the result-set to rows where column_a's value is exactly x

A few more options for the where clause:

WHERE column_a = 'some_text'      # put text in quotations. CAPITALIZATION IS IMPORTANT

WHERE column_a != x      # filters the result-set to rows where column_a's value DOES NOT EQUAL x

WHERE column_a < x       # filters the result-set to rows where column_a's value is less than x

WHERE columna_a <= x      # filters the result-set to rows where column_a's value is less than or equal to x

WHERE column_a IN (x, y)       # column_a's value can be EITHER x OR y 

WHERE column_a NOT IN (x, y)     #  column_a's value can be NEITHER x NOR y  

WHERE column_a BETWEEN x AND y      # BETWEEN lets you specify a range   

WHERE column_a = x AND column_b = y    # AND lets you add more filters

WHERE column_a = x OR column_b = y     # OR will include results that fulfill either criteria

WHERE (column_a = x AND column_b = y) OR (column_c = z)   # use parentheses to create complex AND/OR statements 


  • WHERE allows you to filter the result-set to only include rows matching specific values/criteria. If the value/criteria is text, remember to put it in single or double quotation marks
  • More on WHERE: http://www.w3schools.com/sql/sql_where.asp

Below, WHERE filters out any rows that don't match the criteria. The result-set will only contain rows where the payment type is cash AND where the model_id is 46:


In [12]:
run('''
    SELECT
        *
    FROM
        sales_table
    WHERE
        payment_type = 'cash'
        AND model_id = 46 
    ''')


Out[12]:
id model_id customer_id revenue payment_type salesman_id date
0 40112 46 819010 14720 cash 147 1/17/2014
1 82630 46 618295 15040 cash 949 1/16/2014
2 70434 46 944843 13600 cash 680 1/25/2014
3 30046 46 982483 12640 cash 215 1/29/2014
4 14082 46 909110 15200 cash 813 1/28/2014
5 31815 46 743097 15040 cash 147 1/25/2014

Rewrite the query to return rows where payment_type is NOT cash, and the model_id is either 31 or 36

  • Extra: Try changing 'cash' to 'Cash' to see what happens.

In [13]:
run('''
    SELECT NULL
    ''')
#print(where_cheat1)


Out[13]:
NULL
0

Using BETWEEN, rewrite the query to return rows where the revenue was between 24,000 and 25,000:


In [14]:
run('''
    SELECT NULL
    ''')
#print(where_cheat2)


Out[14]:
NULL
0





WHERE column LIKE:

SELECT 
    column_a
FROM
    table_name
WHERE
    column_a LIKE '%text or number%'      # Filters the result_set to rows where that text or value can be found, with % standing in as a wildcard

Note that you don't have to use the whole word "cash" when you use LIKE, and that the capital "C" now doesn't cause a problem:


In [15]:
run('''
    SELECT
        *
    FROM
        sales_table
    WHERE
        payment_type LIKE 'Cas%'    
    ''').head()


Out[15]:
id model_id customer_id revenue payment_type salesman_id date
0 40112 46 819010 14720 cash 147 1/17/2014
1 13154 75 720210 21409 cash 215 1/20/2014
2 56245 36 248621 25938 cash 492 1/19/2014
3 37412 31 750195 17462 cash 949 1/11/2014
4 68513 31 461723 17020 cash 147 1/18/2014

Be careful with LIKE though - it can't deal with extra characters or mispellings:


In [16]:
run('''
    SELECT
        *
    FROM
        sales_table
    WHERE
        payment_type LIKE 'ces%'
    LIMIT 5
    ''')


Out[16]:
id model_id customer_id revenue payment_type salesman_id date

LIKE and % will also return too much if you're not specific enough. This returns both 'cash' and 'finance' because both have a 'c' with some letters before or after:


In [17]:
run('''
    SELECT
        *
    FROM
        sales_table
    WHERE
        payment_type LIKE '%c%'
    LIMIT 5
    ''')


Out[17]:
id model_id customer_id revenue payment_type salesman_id date
0 54858 36 237906 21222 finance 276 1/7/2014
1 43161 20 967016 19140 finance 225 1/26/2014
2 40112 46 819010 14720 cash 147 1/17/2014
3 92495 31 633030 19010 finance 215 1/13/2014
4 78000 51 341877 22022 finance 803 1/12/2014

You can use different wildcards besides % to get more specific. An underscore is a substitute for a single letter or character, rather than any number. The query below uses 3 underscores after c to get 'cash':


In [18]:
run('''
    SELECT
        *
    FROM
        sales_table
    WHERE
        payment_type LIKE 'c___'
    LIMIT 5
    ''')


Out[18]:
id model_id customer_id revenue payment_type salesman_id date
0 40112 46 819010 14720 cash 147 1/17/2014
1 13154 75 720210 21409 cash 215 1/20/2014
2 56245 36 248621 25938 cash 492 1/19/2014
3 37412 31 750195 17462 cash 949 1/11/2014
4 68513 31 461723 17020 cash 147 1/18/2014

Say you can't remember the model of the car you're trying to look up. You know it's "out"...something. Outcast? Outstanding? Write a query to return the model_id and model from the car_table and use LIKE to help you search:


In [19]:
run('''
    SELECT NULL
    ''')
#print(where_cheat3)


Out[19]:
NULL
0





ORDER BY

SELECT 
    column_a
FROM
    table_name
WHERE                 # optional 
    column_a = x
ORDER BY              # sorts the result-set by column_a
    column_a DESC     # DESC is optional. It sorts results in descending order (100->1) instead of ascending (1->100)

  • Without an ORDER BY clause, the default result-set will be ordered by however it appears in the database
  • By default, ORDER BY will sort values in ascending order (A→Z, 1→100). Add DESC to order results in desceding order instead (Z→A, 100→1)
  • More on ORDER BY: http://www.w3schools.com/sql/sql_orderby.asp

The query below orders the result-set by revenue amount, starting with the smallest amount listed first:


In [20]:
run('''
    SELECT
        *
    FROM
        sales_table
    ORDER BY
        revenue DESC
    LIMIT 5
    ''')


Out[20]:
id model_id customer_id revenue payment_type salesman_id date
0 56245 36 248621 25938 cash 492 1/19/2014
1 37050 36 513614 25937 cash 492 1/2/2014
2 13004 36 488910 24890 finance 813 1/12/2014
3 53831 75 656181 24646 finance 803 1/18/2014
4 91723 36 490244 24628 cash 225 1/9/2014

Rewrite the query above to look at the sticker_price of cars from the car_table in descending order:


In [21]:
run('''
    SELECT NULL
    ''')
#print(order_cheat)


Out[21]:
NULL
0





LIMIT

SELECT
    column_a
FROM
    table_name
WHERE
    columna_a = x      # optional
ORDER BY
    column_a           # optional
LIMIT                  # Limits the result-set to N rows
    N  
  • LIMIT just limits the number of rows in your result set
  • More on LIMIT: http://www.w3schools.com/sql/sql_top.asp
  • The ability to limit results varies by RBDSM. Below you can see the different ways to do this:

In [22]:
limit_differences


Out[22]:
Microsoft SQL Server MySQL Oracle SQLite
LIMITING SELECT TOP N column_a... LIMIT N WHERE ROWNUM <=N LIMIT N

The query below limits the number of rows to 5 results. Change it to 10 to get a quick sense of what we're doing here:


In [23]:
run('''
    SELECT
        *
    FROM
        sales_table
    LIMIT 5
    ''')


Out[23]:
id model_id customer_id revenue payment_type salesman_id date
0 54858 36 237906 21222 finance 276 1/7/2014
1 43161 20 967016 19140 finance 225 1/26/2014
2 40112 46 819010 14720 cash 147 1/17/2014
3 92495 31 633030 19010 finance 215 1/13/2014
4 78000 51 341877 22022 finance 803 1/12/2014





ALIASES

SELECT
    T.column_a AS alias_a    # creates a nickname for column_a, and states that it's from table_name (whose alias is T)
FROM
    table_name AS T          # creates a nickname for table_name
WHERE
    alias_a = z           # refer to an alias in the WHERE clause
ORDER BY
    alias_a                  # refer to an alias in the ORDER BY clause

  • Aliases are optional, but save you time and make column headers cleaner
  • AS isn't necessary to create and alias, but commonly used
  • The convention is to use "AS" in the "SELECT" clause, but not in the "FROM" clause.
  • More on Aliases: http://www.w3schools.com/sql/sql_alias.asp

Change the aiases for model_id and revenue, or add extra columns to see how they work:


In [24]:
run('''
    SELECT
        model_id AS Model_of_car,
        revenue AS Rev_per_car
    FROM 
        sales_table
    ''')


Out[24]:
Model_of_car Rev_per_car
0 36 21222
1 20 19140
2 46 14720
3 31 19010
4 51 22022
5 75 21409
6 31 19894
7 46 12960
8 36 25938
9 51 19844
10 31 21441
11 31 17462
12 31 17020
13 20 18040
14 75 18671
15 51 22506
16 20 20460
17 46 13440
18 51 18634
19 22 17312
20 36 23318
21 20 21780
22 75 19418
23 22 17756
24 31 21441
25 31 17905
26 20 16720
27 31 20778
28 75 19667
29 20 19360
... ... ...
70 36 20174
71 20 18480
72 51 23474
73 31 19231
74 22 18865
75 36 24104
76 36 20436
77 51 22990
78 20 19580
79 22 17756
80 75 18920
81 75 24646
82 19 14006
83 36 24890
84 75 23401
85 51 20328
86 46 15200
87 19 16189
88 46 15040
89 22 20197
90 51 23232
91 19 16734
92 36 20960
93 75 24148
94 20 19140
95 22 21751
96 75 20413
97 51 20328
98 20 17600
99 31 19894

100 rows × 2 columns

You can use an alias in the ORDER BY and WHERE clauses now. Write a query to:

  • pull the model_id and revenue for each transaction
  • give model_id the alias "Model"
  • give revenue the alias "Rev"
  • limit the results to only include rows where the model_id id 36, use the alias in the WHERE clause
  • order the results by revenue in descending order, use the alias in the ORDER BY clause
  • Run the query

THEN:

  • Try giving model_id the alias "ID" and use it in the WHERE clause, then rerun the query. What do you think is causing the error?

In [25]:
run('''
    SELECT NULL
    ''') 
#print(alias_cheat)


Out[25]:
NULL
0

You can also assign an alias to a table, and use the alias to tell SQL which table the column is coming from. This isn't of much use when you're only using one table, but it will come in handy when you start using multiple tables.

Below,the sales_table has the alias "S". Read "S.model_id" as "the model_id column from S, which is the sales_table"

Change the S to another letter in the FROM clause and run. Why did you hit an error? What can you do to fix it?


In [26]:
run('''
    SELECT
        S.model_id,
        S.revenue
    FROM 
        sales_table AS S
    LIMIT 5
    ''')


Out[26]:
model_id revenue
0 36 21222
1 20 19140
2 46 14720
3 31 19010
4 51 22022





JOINS

SELECT
    *
FROM
    table_x                                   
    JOIN table_y                                 # use JOIN to add the second table
        ON table_x.column_a = table_y.column_a   # use ON to specify which columns correspond on each table

  • Joining tables is the most fundamental and useful part about relational databases
  • Use columns on different tables with corresponding values to join the two tables
  • The format "table_x.column_a" can be read as "column_a from table_x"; it tells SQL the table where it can find that column
  • More on JOINS: http://www.w3schools.com/sql/sql_join.asp

Start by looking at the first few rows of sales_table again:


In [27]:
run('''
    SELECT
        *
    FROM
        sales_table
    LIMIT 5
    ''')


Out[27]:
id model_id customer_id revenue payment_type salesman_id date
0 54858 36 237906 21222 finance 276 1/7/2014
1 43161 20 967016 19140 finance 225 1/26/2014
2 40112 46 819010 14720 cash 147 1/17/2014
3 92495 31 633030 19010 finance 215 1/13/2014
4 78000 51 341877 22022 finance 803 1/12/2014

Now the first few rows of the car_table:


In [28]:
run('''
    SELECT
        *
    FROM
        car_table
    LIMIT 5
    ''')


Out[28]:
model_id make model sticker_price cogs
0 20 Toyota Camry 22000 13200
1 46 Toyota Corolla 16000 9600
2 51 Toyota Prius 24200 14520
3 19 Honda Civic 18190 10914
4 31 Honda Accord 22105 13263

These tables are related. There's a column named "model_id" in the sales_table and a "model_id" in the car_table - but the column names don't need to be the same, what's important is that the values in the sales_table's model_id column correspond to the values in the car_table's model_id column.

You can join these tables by using these columns as keys.


In [29]:
run('''
    SELECT
        *
    FROM
        sales_table
        JOIN car_table ON sales_table.model_id = car_table.model_id
    LIMIT 10
    ''')


Out[29]:
id model_id customer_id revenue payment_type salesman_id date model_id make model sticker_price cogs
0 54858 36 237906 21222 finance 276 1/7/2014 36 Toyota Tundra 26200 15720
1 43161 20 967016 19140 finance 225 1/26/2014 20 Toyota Camry 22000 13200
2 40112 46 819010 14720 cash 147 1/17/2014 46 Toyota Corolla 16000 9600
3 92495 31 633030 19010 finance 215 1/13/2014 31 Honda Accord 22105 13263
4 78000 51 341877 22022 finance 803 1/12/2014 51 Toyota Prius 24200 14520
5 13154 75 720210 21409 cash 215 1/20/2014 75 Subaru Outback 24895 14937
6 36535 31 908558 19894 finance 215 1/30/2014 31 Honda Accord 22105 13263
7 22813 46 705508 12960 finance 813 1/29/2014 46 Toyota Corolla 16000 9600
8 56245 36 248621 25938 cash 492 1/19/2014 36 Toyota Tundra 26200 15720
9 88118 51 341344 19844 finance 215 1/23/2014 51 Toyota Prius 24200 14520

Write a query to join the cust_table to the sales_table, using the customer_id columns in both tables as the key:


In [30]:
run('''
    SELECT NULL
    ''') 
#print(join_cheat1)


Out[30]:
NULL
0

Rewrite the query from above, but instead of selecting all columns, specify just the customer gender and the revenue:


In [31]:
run('''
    SELECT NULL
    ''')
#print(join_cheat2)


Out[31]:
NULL
0

Rewrite the query from above, but this time select the customer_id, gender, and revenue:

  • You'll probably hit an error at first. Try to use what you've learned about this structure "table_x.column_a" to fix the issue. Why do you think you need to use this?

In [32]:
run('''
    SELECT NULL
    ''')
#print(join_cheat3)


Out[32]:
NULL
0

A column with the name customer_id appears in both the cust_table and the sales_table. SQL doesn't know which one you want to see. You have to tell it from which table you want the customer_id.

This can be important when columns in different tables have the same names but totally unrelated values.

Look at the sales_table again:


In [33]:
run('''
    SELECT
        *
    FROM
        sales_table
    LIMIT 5
    ''')


Out[33]:
id model_id customer_id revenue payment_type salesman_id date
0 54858 36 237906 21222 finance 276 1/7/2014
1 43161 20 967016 19140 finance 225 1/26/2014
2 40112 46 819010 14720 cash 147 1/17/2014
3 92495 31 633030 19010 finance 215 1/13/2014
4 78000 51 341877 22022 finance 803 1/12/2014

Above, there's a column called "id".

Now look at the salesman_table again:


In [34]:
run('''
    SELECT
        *
    FROM
        salesman_table
    LIMIT 5
    ''')


Out[34]:
id first_name last_name
0 949 Samantha Douglas
1 215 Jared Case
2 813 Michael Hill
3 680 Claudine Hatch
4 276 Joseph Seney

There's a column named "id" in the salesman_table too. However, it doesn't look like those IDs correspond to the sales_table IDs. In fact, it's the salesman_id column in the sales_table that corresponds to the id column in the salesman_table. More often than not, your tables will use different names for corresponding columns, and will have columns with identical names that don't correspond at all.

Write a query to join the salesman_table with the sales_table (select all columns using an asterisk)


In [35]:
run('''
    SELECT NULL
    ''') 
#print(join_cheat4)


Out[35]:
NULL
0

Practice applying this "table_x.column_a" format to all columns in the SELECT clause when you are joining multiple tables, since multiple tables frequenty use the same column names even when they don't correspond.

It's common to use single-letter aliases for tables to make queries shorter. Take a look at the query below and make sure you understand what's going on with the table aliases. It's the same query that you wrote earlier, but with aliases to help identify the columns


In [36]:
run('''
    SELECT
        S.customer_id,
        C.gender,
        S.revenue
    FROM
        sales_table AS S
        JOIN cust_table AS C on S.customer_id = C.customer_id
    ''')


Out[36]:
customer_id gender revenue
0 237906 female 21222
1 967016 male 19140
2 819010 male 14720
3 633030 male 19010
4 341877 female 22022
5 720210 male 21409
6 908558 male 19894
7 705508 female 12960
8 248621 male 25938
9 341344 female 19844
10 733566 female 21441
11 750195 female 17462
12 461723 male 17020
13 468665 female 18040
14 556188 female 18671
15 241759 male 22506
16 161369 female 20460
17 731692 female 13440
18 656750 male 18634
19 619020 female 17312
20 413891 male 23318
21 965672 male 21780
22 217720 female 19418
23 946265 male 17756
24 140795 female 21441
25 145479 male 17905
26 961650 female 16720
27 457742 male 20778
28 978159 male 19667
29 474218 male 19360
... ... ... ...
70 253239 male 20174
71 614301 female 18480
72 168495 male 23474
73 699431 female 19231
74 528354 female 18865
75 751098 male 24104
76 480153 female 20436
77 208677 male 22990
78 349494 female 19580
79 206310 female 17756
80 185223 male 18920
81 656181 female 24646
82 393613 female 14006
83 488910 female 24890
84 846630 male 23401
85 534633 male 20328
86 909110 female 15200
87 628031 male 16189
88 743097 female 15040
89 476047 male 20197
90 862765 male 23232
91 597655 male 16734
92 486534 male 20960
93 339830 male 24148
94 283797 male 19140
95 635204 female 21751
96 619016 female 20413
97 183947 male 20328
98 731677 male 17600
99 907549 female 19894

100 rows × 3 columns

Join the sales_table (assign it the alias S) and salesman_table (alias SM) again.

  • Select the id and salesman_id column from the sales_table
  • Also, select the id column from the salesman_table
  • Optional: assign aliases to the columns in the SELECT clause to make the result-set easier to read

In [37]:
run('''
    SELECT NULL
    ''') 
#print(join_cheat5)


Out[37]:
NULL
0

Different Types of Joins

There are different types of joins you can do according to your needs. Here's a helpful way to visualize your options: http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins

However, not all types of joins are compatible with SQLite and MySQL. The table below breaks down compatibility:


In [38]:
join_differences


Out[38]:
Microsoft SQL Server MySQL Oracle SQLite
JOIN or INNER JOIN
LEFT JOIN or LEFT OUTER JOIN
RIGHT JOIN or RIGHT OUTER JOIN not supported
OUTER JOIN or FULL OUTER JOIN not supported not supported

So far, we've just done a simple join, also called an "inner join". To illustrate different types of joins, we're going to use a different "database" for the following lesson. First, let's take a look at each one:


In [39]:
run('''
    SELECT 
        *
    FROM
        Dog_Table
    ''')


Out[39]:
Owner_Name Dog_Name
0 Michael Martinez Chewbacca
1 Gilbert Henkel Max
2 May Reeves Georgie
3 Elizabeth Minier Sunny
4 Donna Ona Daisy
5 Daniel Abner Jake
6 Keith Bursey Muffin
7 Jane Garcia Admiral
8 Sarah Flores Bogie
9 David Brace Fritz

In [40]:
run('''
    SELECT 
        *
    FROM
        Cat_Table
    ''')


Out[40]:
Owner_Name Cat_Name
0 Michael Martinez Mrs. Paws
1 Gilbert Henkel Mittens
2 May Reeves Clover
3 Elizabeth Minier Sweetcakes
4 Donna Ona Barnie
5 Monika Turner Cara
6 Adrian Hardiman Annabelle
7 Loyd Mossman Garfield
8 Heather Emery Engine
9 Lauren Larson Midnight

Notice that the Owner_Name columns on each table have some corresponding values (Michael, Gilbert, May and Elizabeth and Donna are in both tables), but they both also have values that don't overlap.

JOINS or INNER JOINS

SELECT
    *
FROM 
    table_x X
    JOIN table_y Y ON X.column_a = Y.column_a      # Returns rows when values match on both tables.

This is what we used in the initial example. Simple joins, (also called Inner Joins) will combine tables only where there are corresponding values on both tables.

Write a query below to join the Cat_Table and Dog_Table using the same method we've used before:


In [41]:
run('''
    SELECT NULL
    ''')
#print(inner_join_cheat)


Out[41]:
NULL
0

Notice that the result-set only includes the names that are in both tables. Think of inner joins as being the overlapping parts of a Venn Diagram. So, essentially we're looking at results only where the pet owner has both a cat and a dog.


LEFT JOINS or LEFT OUTER JOINS

SELECT
    *
FROM 
    table_x X
    LEFT JOIN table_y Y ON X.column_a = Y.column_a   # Returns all rows from 1st table, rows that match from 2nd

  • LEFT JOINS will return all rows from the first table, but only rows from the second table if a value matches on the key column.

Rewrite your query from above, but instead of "JOIN", write "LEFT JOIN":


In [42]:
run('''
    SELECT NULL
    ''')
#print(left_join_cheat)


Out[42]:
NULL
0

This time, you're seeing everything from the Dog_Table, but only results from the Cat_Table IF the owner also has a dog.


OUTER JOINS or FULL OUTER JOINS:

SELECT
    *
FROM 
    table_x X
    OUTER JOIN table_y Y ON X.column_a = Y.column_a  # Returns all rows, regardless of whether values match

For now, this query won't totally make sense, just pay attention to the results so you can visualize an outer join:


In [43]:
run('''
    SELECT
        C.Owner_Name, 
        Cat_Name, 
        Dog_Name
    FROM
        Cat_Table C 
        LEFT JOIN Dog_Table D ON D.Owner_Name = C.Owner_Name

    UNION ALL
    
    SELECT
      D.Owner_Name, 
      ' ', 
      Dog_Name  
    FROM
        Dog_Table D 
    WHERE 
        Owner_Name NOT IN (SELECT Owner_Name from Cat_Table)
    ''')


Out[43]:
Owner_Name Cat_Name Dog_Name
0 Michael Martinez Mrs. Paws Chewbacca
1 Gilbert Henkel Mittens Max
2 May Reeves Clover Georgie
3 Elizabeth Minier Sweetcakes Sunny
4 Donna Ona Barnie Daisy
5 Monika Turner Cara
6 Adrian Hardiman Annabelle
7 Loyd Mossman Garfield
8 Heather Emery Engine
9 Lauren Larson Midnight
10 Daniel Abner Jake
11 Keith Bursey Muffin
12 Jane Garcia Admiral
13 Sarah Flores Bogie
14 David Brace Fritz

Essentially, in Venn Diagram terms, and outer join lets you see all contents of both circles. This join will let you see all pet owners, regardless of whether the own only a cat or only a dog


Using the "WHERE" Clause to Join Tables

SELECT
    *
FROM
    table_x X
    JOIN table y Y
WHERE
    X.column_a = Y.column_a         # tells SQL the key for the join

  • Some people prefer to use the WHERE clause to specify the key for a join
  • Fine if the query is short, but SUPER messy when the query is complex
  • We won't use this moving forward, but it's good to see it in case you run across someone else's code and you need to make sense of it

When it's simple, it's not so bad:


In [44]:
run('''
    SELECT
        C.model, 
        S.revenue
    FROM
        sales_table S, car_table C 
    WHERE
        S.model_id = C.model_id
    LIMIT 5
    ''')


Out[44]:
model revenue
0 Tundra 21222
1 Camry 19140
2 Corolla 14720
3 Accord 19010
4 Prius 22022

When the query is longer, this method is messy. Suddenly it's harder to parse out which parts of the "WHERE" clause are actual filters, and which parts are just facilitating the join.

Note that we've covered all of these clauses and expressions by now, try to parse out what's going on:


In [45]:
run('''
    SELECT
        C.make,
        C.model, 
        S.revenue,
        CUST.gender,
        SM.first_name
    FROM
        sales_table S
        JOIN car_table C
        JOIN salesman_table SM
        JOIN cust_table CUST
    WHERE
        S.customer_id = CUST.customer_id
        AND S.model_id = C.model_id
        AND S.salesman_id = SM.id
        AND (C.model in ('Tundra', 'Camry', 'Corolla') OR C.make = 'Subaru')
        AND S.revenue between 17000 and 22000
        AND CUST.gender = 'female' 
        AND SM.first_name NOT IN ('Kathleen', 'Samantha')
    LIMIT 5
    ''')


Out[45]:
make model revenue gender first_name
0 Toyota Camry 20460 female Jared
1 Subaru Forester 17756 female Joseph
2 Subaru Forester 17090 female Claudine
3 Subaru Outback 19418 female Claudine
4 Toyota Tundra 21222 female Joseph





OPERATORS

ADDING / SUBSTRACTING / MULTIPLYING / DIVIDING

SELECT
    column_a + column_b     # adds the values in column_a to the values in columns_b
FROM
    table_name

Use the standard formats for add, subtract, mutiply, and divide: + - * /

The query below subtracts cogs (from the car_table) from revenue (from the sales_table) to show us the gross_profit per transaction


In [46]:
run('''
    SELECT
        S.id,
        C.model, 
        S.revenue,
        C.cogs,
        S.revenue - C.cogs AS gross_profit
    FROM
        sales_table S 
        JOIN car_table C on S.model_id = C.model_id
    LIMIT 5
    ''')


Out[46]:
id model revenue cogs gross_profit
0 54858 Tundra 21222 15720 5502
1 43161 Camry 19140 13200 5940
2 40112 Corolla 14720 9600 5120
3 92495 Accord 19010 13263 5747
4 78000 Prius 22022 14520 7502

Rewrite the query above to return gross margin instead of gross profit. Rename the alias as well. Limit it to 5 results


In [47]:
run('''
    SELECT NULL
    ''') 
#print(operator_cheat)


Out[47]:
NULL
0

CONCATENATING:

Concatenating varies by RDBMS:


In [48]:
concat_differences


Out[48]:
Microsoft SQL Server MySQL Oracle SQLite
Concatenating CONCAT(column_a, column_b) or + CONCAT(column_a, column_b) CONCAT(column_a, column_b) or || ||

Here we'll use SQLite and use the concatenating operator || to combine words/values in different columns:


In [49]:
run('''
    SELECT
        last_name, 
        first_name,
        last_name || ', ' || first_name AS full_name
    FROM
        salesman_table
    ''')


Out[49]:
last_name first_name full_name
0 Douglas Samantha Douglas, Samantha
1 Case Jared Case, Jared
2 Hill Michael Hill, Michael
3 Hatch Claudine Hatch, Claudine
4 Seney Joseph Seney, Joseph
5 March Kathleen March, Kathleen
6 Self Rosemarie Self, Rosemarie
7 Avellaneda Justin Avellaneda, Justin
8 Elzy Elton Elzy, Elton
9 Luna Matthew Luna, Matthew

Use || to pull the make and model from the car_table and make it appear in this format: "Model (Make)"

  • give it an alias to clean up the column header, otherwise it'll look pretty messy

In [50]:
run('''
    SELECT NULL
    ''') 
#print(concat_cheat)


Out[50]:
NULL
0





FUNCTIONS:

SELECT
    SUM(column_a),             # sums up the values in column_a
    AVG(column_a),             # averages the values in column_a
    ROUND(AVG(column_a), 2),   # rounds the averaged values in column_a to 2 digits
    COUNT(column_a),           # counts the number of rows in column_a  
    MAX(column_a),             # returns the maximum value in column_a
    MIN(column_a),             # returns the minimum value in column_a
    GROUP_CONCAT(column_a)     # returns a comma separated list of all values in column_a
FROM
    table_name

  • Functions can be applied to columns to help analyze data
  • You can find more than just these basic few in the link below, or just Google what you're looking to do - there's a lot of help available on forums
  • More on functions: http://www.w3schools.com/sql/sql_functions.asp

The function below will sum up everything in the revenue column. Note that now we only get one row:


In [51]:
run('''
    SELECT
        SUM(revenue) AS Total_Revenue
    FROM 
        sales_table
    ''')


Out[51]:
Total_Revenue
0 1941252

Rewrite the query to return the average cost of goods for a car in the car table. Try rounding it to cents.

  • If you can't remember the name of the column for cost of goods in the car_table, remember you can use "SELECT * FROM car_table LIMIT 1" to see the first row of all columns, or you can use "PRAGMA TABLE_INFO(car_table)"

In [52]:
run('''
    SELECT NULL
    ''')
#print(avg_cheat)


Out[52]:
NULL
0

Using COUNT(*) will return the number of rows in any given table. Rewrite the query to return the number of rows in the car_table:

  • After you've run the query, try changing it by adding "WHERE make = 'Subaru'" and see what happens

In [53]:
run('''
    SELECT NULL
    ''')
#print(count_cheat)


Out[53]:
NULL
0

You can apply functions on top of other operators. Below is the sum of gross profits:


In [54]:
run('''
    SELECT
        '$ ' || SUM(S.revenue - C.cogs) total_gross_profit
    FROM
        sales_table S
        JOIN car_table C on S.model_id = C.model_id 
    ''')


Out[54]:
total_gross_profit
0 $ 596814.0

Write a query to show the average difference between the sticker_price (in car_table) and the revenue.

  • If you want a challenge, try to join cust_table and limit the query to only look at transactions where the customer's age is over 35

In [55]:
run('''
    SELECT NULL
    ''') 
#print(avg_cheat2)


Out[55]:
NULL
0

GROUP_CONCAT

SELECT
    GROUP_CONCAT(column_a, '[some character separating items]') 
FROM
    table_x

This function is useful to return comma-separated lists of the values in a column


In [56]:
run('''
    SELECT 
        GROUP_CONCAT(model, ', ') as Car_Models
    FROM
        car_table
    ''')


Out[56]:
Car_Models
0 Camry, Corolla, Prius, Civic, Accord, Outback,...

Use GROUP_CONCAT to return a comma-separated list of last names from the salesman_table:


In [57]:
run('''
    SELECT NULL
''')
#print(concat_cheat)


Out[57]:
NULL
0





GROUP BY:

SELECT
    column_a,                      
    SUM(column_b)    # sums up the values in column_b
FROM
    table_name
GROUP BY             # creates one group for each unique value in column_a
    column_a


The query below creates a group for each unique value in the car_table's model column, then sums up the revenue for each group. Note that you can use an alias in the GROUP BY clause.


In [58]:
run('''
    SELECT
        C.model AS Car_Model, 
        SUM(revenue) AS Total_Revenue
    FROM
        sales_table S
        JOIN car_table C on S.model_id = C.model_id
    GROUP BY
        Car_Model
    ''')


Out[58]:
Car_Model Total_Revenue
0 Accord 234748
1 Camry 212080
2 Civic 157886
3 Corolla 154880
4 Forester 201083
5 Outback 317406
6 Prius 297418
7 Tundra 365751

Rewrite the query above to return the average gross profit (revenue - cogs) per make (remember that "make" is in the car_table)

Extra things to try:

  • Round average revenue to two decimal points
  • Order the results by gross profit in descending order
  • Rename the make column as "Car_Maker" and use the alias in the GROUP BY clause
  • Rename gross profit column as "Avg_Gross_Profit" and use the alias in the ORDER BY clause
  • Join the salesman_table and filter results to only look at revenue where first_name is Michael
    • After you've gotten the query to run with all of these adjustments, think about the risks involved with adding something in the WHERE clause that doesn't show up in the SELECT clause. Think about a potential solution to these risks.

In [59]:
run('''
    SELECT NULL
    ''')
#print(group_cheat)


Out[59]:
NULL
0

Write a query to make a comma-separated list of models for each car maker:


In [60]:
run('''
    SELECT NULL
    ''')
#print(group_cheat1)


Out[60]:
NULL
0

GROUP BY, when used with joins and functions, can help you quickly see trends in your data. Parse out what's going on here:


In [61]:
run('''
    SELECT
        C.model AS Car_Model, 
        MIN(S.revenue) || ' - ' || MAX(S.revenue) AS Min_to_Max_Sale,
        MAX(S.revenue) - MIN(S.revenue) AS Range,
        ROUND(AVG(S.revenue), 2) AS Average_Sale
    FROM
        sales_table S
        JOIN car_table C on S.model_id = C.model_id 
    GROUP BY
        Car_Model
    ORDER BY
        Average_Sale DESC
    ''')


Out[61]:
Car_Model Min_to_Max_Sale Range Average_Sale
0 Tundra 20174 - 25938 5764 22859.44
1 Prius 18150 - 23958 5808 21244.14
2 Outback 18671 - 24646 5975 21160.40
3 Accord 17020 - 21441 4421 19562.33
4 Camry 16720 - 21780 5060 19280.00
5 Forester 16646 - 21751 5105 18280.27
6 Civic 13642 - 18008 4366 15788.60
7 Corolla 12640 - 15360 2720 14080.00

You can also use GROUP BY with multiple columns to segment out the results further:


In [62]:
run('''
    SELECT
        C.make AS car_caker, 
        payment_type,
        ROUND(AVG(revenue)) as avg_revenue
    FROM
        sales_table S
        JOIN car_table C on S.model_id = C.model_id 
    GROUP BY
        C.Make, 
        payment_type
    ''')


Out[62]:
car_caker payment_type avg_revenue
0 Honda cash 17028
1 Honda finance 19030
2 Subaru cash 19581
3 Subaru finance 20363
4 Toyota cash 20437
5 Toyota finance 19183

Rewrite the query to find the total revenue grouped by each salesperson's first_name and by the customer's gender (gender column in cust_table)

  • For an extra challenge, use the concatenating operator to use the salesperson's full name instead
  • Add COUNT(S.id) to the SELECT clause to see the number of transactions in each group

In [63]:
run('''
    SELECT NULL
    ''')
#print(group_cheat2)


Out[63]:
NULL
0





"HAVING" in GROUP BY statements:

SELECT
    column_a,
    SUM(column_b) AS alias_b
FROM
    table_name
GROUP BY
    column_a HAVING alias_b > x   # only includes groups in column_a when the sum of column_b is greater than x 

The query below will sum up all the revenue for each car maker, but it will only show you results for car maker's whose total revenue is greater than 50,000:


In [64]:
run('''
    SELECT
        C.Make as Car_Maker, 
        SUM(revenue) as Total_Revenue
    FROM
        sales_table S
        JOIN car_table C on S.model_id = C.model_id
    GROUP BY
        Car_Maker HAVING Total_Revenue > 500000
    ''')


Out[64]:
Car_Maker Total_Revenue
0 Subaru 518489
1 Toyota 1030129

Rewrite the query above to look at average revenue per model, and using HAVING to filter your result-set to only include models whose average revenue is less than 18,000:


In [65]:
run('''
    SELECT NULL
    ''')
#print(having_cheat)


Out[65]:
NULL
0

HAVING vs WHERE:

WHERE filters which rows will be included in the function, whereas HAVING filters what's returned after the function has been applied.

Take a look at the query below. It might look like the query you just wrote (above) if you'd tried to use WHERE instead of HAVING:

SELECT 
    C.model as Car_Model,
    AVG(S.revenue) as Avg_Revenue
FROM
    sales_table S
    JOIN car_table C on S.model_id = C.model_id      
WHERE 
    S.revenue < 18000
GROUP BY
    Car_Model

  1. Find the sales_table and join it to the car_table
  2. Pull the data from the 'model' column in car_table and 'revenue' column in sales_table
  3. Filter out all rows where revenue is less than 18000
  4. Average remaining rows for each Car_Model

Even though AVG( ) appears early in the query, it's not actually being applied until after the WHERE statement has filtered out rows with less than 18,000 in revenue.

This is the result:


In [66]:
run('''
    SELECT 
        C.model as Car_Model,
        AVG(S.revenue) as Avg_Revenue
    FROM
        sales_table S
        JOIN car_table C on S.model_id = C.model_id      
    WHERE 
        S.revenue < 18000
    GROUP BY
        Car_Model
    ''')


Out[66]:
Car_Model Avg_Revenue
0 Accord 17462.333333
1 Camry 17160.000000
2 Civic 15542.000000
3 Corolla 14080.000000
4 Forester 17438.714286

All model_ids are returned, but the averages are all much lower than they should be. That's because the query first drops all rows that have revenue greater than 18000, and then averages the remaining rows.

When you use HAVING, SQL follows these steps instead (this query should look like the one you wrote in the last challenge):

SELECT
    C.model as Car_Model,
    AVG(S.revenue) as Avg_Revenue
FROM
    sales_table S
    JOIN car_table C on S.model_id = C.model_id
GROUP BY
    Car_Model HAVING Avg_Revenue < 18000

  1. Find the sales_table and join it to the car_table (same as before)
  2. Pull the data from the 'model' column in car_table and 'revenue' column in sales_table (same as before)
  3. Average the rows for each Car_Model
  4. Return only the Car_Models whose averages are less than 18,000

And as you can see, there's a big difference in these results and the results of the query that used "WHERE" instead of HAVING:


In [67]:
run('''
    SELECT
        C.model as Car_Model,
        AVG(S.revenue) as Avg_Revenue
    FROM
        sales_table S
        JOIN car_table C on S.model_id = C.model_id
    GROUP BY
        Car_Model HAVING Avg_Revenue < 18000
    ''')


Out[67]:
Car_Model Avg_Revenue
0 Civic 15788.6
1 Corolla 14080.0

HAVING & WHERE in the same query:

  • Sometimes, you will want to use WHERE and HAVING in the same query
  • Just be aware of the order of the steps that SQL takes
  • Rule of thumb: if you're applying a function to a column, you probably don't want that column in there WHERE clause

This query is only looking at Toyotas whose revenue is less than 18,000, using WHERE to limit the results to Toyotas, and HAVING to limit the results by revenue:


In [68]:
run('''
    SELECT
        C.model as Car_Model,
        AVG(S.revenue) as Avg_Revenue
    FROM
        sales_table S
        JOIN car_table C on S.model_id = C.model_id
    WHERE
        C.make = 'Toyota'
    GROUP BY
        Car_Model HAVING Avg_Revenue < 18000
    ''')


Out[68]:
Car_Model Avg_Revenue
0 Corolla 14080

Write a query with the following criteria:

  • SELECT clause:
    • salesman's last name and average revenue, rounded to the nearest cent
  • FROM clause:
    • sales_table joined with the salesman_table and the cust_table
  • WHERE clause:
    • only female customers
  • GROUP BY clause:
    • only salespeople whose average revenue was greater than 20,000

So, in plain English, we want to see salespeople whose average revenue for female customers is greater than 20,000


In [69]:
run('''
    SELECT NULL
    ''')
#print(having_where_cheat)


Out[69]:
NULL
0





ROLLUP

SELECT
    column_a,
    SUM(column_b)
FROM
    table_x
GROUP BY
    ROLLUP(column_a)     # adds up all groups' values in a single final row     


  • Rollup, used with GROUP BY, provides subtotals and totals for your groups
  • Useful for quick analysis
  • Varies by RDBMS

In [70]:
rollup_differences


Out[70]:
Microsoft SQL Server MySQL Oracle SQLite
ROLLUP GROUP BY column_a WITH ROLLUP GROUP BY column_a WITH ROLLUP GROUP BY ROLLUP (column_a) not supported

Because SQLite doesn't support ROLLUP, the query below is just intended to illustrate how ROLLUP would work. Don't worry about understanding the query itself, just get familiar with what's going on in the result-set:


In [71]:
run('''
    SELECT
        C.model AS Car_Model, 
        SUM(S.revenue) as Sum_Revenue
    FROM
        sales_table S
        JOIN car_table C on S.model_id = C.model_id 
    GROUP BY C.model
    
    UNION ALL
    
    SELECT 
        'NULL', 
        SUM(S.revenue)
    FROM 
        sales_table S
''')


Out[71]:
Car_Model Sum_Revenue
0 Accord 234748
1 Camry 212080
2 Civic 157886
3 Corolla 154880
4 Forester 201083
5 Outback 317406
6 Prius 297418
7 Tundra 365751
8 NULL 1941252





Conditional Expressions: IF & CASE WHEN

SELECT
    CASE WHEN column_a = x THEN some_value
         WHEN column_a = y THEN some_value2
         ELSE some_other_value
         END some_alias                         # alias optional after END
FROM
    table_name


In [72]:
conditional_differences


Out[72]:
Microsoft SQL Server MySQL Oracle SQLite
IF IF condition PRINT value_if_true... IF(condition, value_if_true, value_if_false) IF condition THEN value_if_true ELSIF...END IF not supported
CASE WHEN

Starting with a simple example, here we'll use CASE WHEN to create a new column on the sales_table:


In [73]:
run('''
    SELECT
        revenue, 
        CASE WHEN revenue > 20000 THEN 'Revenue is more than 20,000' 
            END Conditional_Column
    FROM
        sales_table
    LIMIT 10
    ''')


Out[73]:
revenue Conditional_Column
0 21222 Revenue is more than 20,000
1 19140
2 14720
3 19010
4 22022 Revenue is more than 20,000
5 21409 Revenue is more than 20,000
6 19894
7 12960
8 25938 Revenue is more than 20,000
9 19844

CASE WHEN gives you the value "Revenue is more MORE 20,000" when revenue in that same row is greater than 20,000. Otherwise, it has no value.

Now let's add a level:


In [74]:
run('''
    SELECT
        revenue, 
        CASE WHEN revenue > 20000 THEN 'Revenue is MORE than 20,000' 
             WHEN revenue < 15000 THEN 'Revenue is LESS than 15,000'
             END Conditional_Column
    FROM
        sales_table
    LIMIT 10
    ''')


Out[74]:
revenue Conditional_Column
0 21222 Revenue is MORE than 20,000
1 19140
2 14720 Revenue is LESS than 15,000
3 19010
4 22022 Revenue is MORE than 20,000
5 21409 Revenue is MORE than 20,000
6 19894
7 12960 Revenue is LESS than 15,000
8 25938 Revenue is MORE than 20,000
9 19844

Now to deal with the blank spaces. You can assign an "ELSE" value to catch anything that's not included in the prior expressions:


In [75]:
run('''
    SELECT
        revenue,
        CASE WHEN revenue > 20000 THEN 'Revenue is MORE than 20,000' 
             WHEN revenue < 15000 THEN 'Revenue is LESS than 15,000'
             ELSE 'NEITHER'
             END Conditional_Column
    FROM
        sales_table
    LIMIT 10
    ''')


Out[75]:
revenue Conditional_Column
0 21222 Revenue is MORE than 20,000
1 19140 NEITHER
2 14720 Revenue is LESS than 15,000
3 19010 NEITHER
4 22022 Revenue is MORE than 20,000
5 21409 Revenue is MORE than 20,000
6 19894 NEITHER
7 12960 Revenue is LESS than 15,000
8 25938 Revenue is MORE than 20,000
9 19844 NEITHER

You can use values from another column as well. Remember this query from the GROUP BY lesson? It's often helpful to look at information broken out by multiple groups, but it's not especially easy to digest:


In [76]:
run('''
    SELECT
        C.Make as car_maker, 
        payment_type,
        ROUND(AVG(S.revenue)) as avg_revenue
    FROM
        sales_table S
        JOIN car_table C on S.model_id = C.model_id 
    GROUP BY
        C.Make, 
        payment_type
    ''')


Out[76]:
car_maker payment_type avg_revenue
0 Honda cash 17028
1 Honda finance 19030
2 Subaru cash 19581
3 Subaru finance 20363
4 Toyota cash 20437
5 Toyota finance 19183

Look at what's going on in that query without the AVG( ) function and the GROUP BY clause:


In [77]:
run('''
    SELECT
        C.Make as Car_Maker, 
        payment_type,
        S.revenue
    FROM
        sales_table S
        JOIN car_table C on S.model_id = C.model_id 
    ''')


Out[77]:
Car_Maker payment_type revenue
0 Toyota finance 21222
1 Toyota finance 19140
2 Toyota cash 14720
3 Honda finance 19010
4 Toyota finance 22022
5 Subaru cash 21409
6 Honda finance 19894
7 Toyota finance 12960
8 Toyota cash 25938
9 Toyota finance 19844
10 Honda finance 21441
11 Honda cash 17462
12 Honda cash 17020
13 Toyota finance 18040
14 Subaru finance 18671
15 Toyota finance 22506
16 Toyota finance 20460
17 Toyota finance 13440
18 Toyota finance 18634
19 Subaru cash 17312
20 Toyota cash 23318
21 Toyota cash 21780
22 Subaru cash 19418
23 Subaru cash 17756
24 Honda finance 21441
25 Honda cash 17905
26 Toyota finance 16720
27 Honda cash 20778
28 Subaru cash 19667
29 Toyota finance 19360
... ... ... ...
70 Toyota finance 20174
71 Toyota cash 18480
72 Toyota cash 23474
73 Honda cash 19231
74 Subaru cash 18865
75 Toyota cash 24104
76 Toyota finance 20436
77 Toyota cash 22990
78 Toyota cash 19580
79 Subaru finance 17756
80 Subaru cash 18920
81 Subaru finance 24646
82 Honda cash 14006
83 Toyota finance 24890
84 Subaru finance 23401
85 Toyota finance 20328
86 Toyota cash 15200
87 Honda cash 16189
88 Toyota cash 15040
89 Subaru finance 20197
90 Toyota finance 23232
91 Honda cash 16734
92 Toyota cash 20960
93 Subaru cash 24148
94 Toyota cash 19140
95 Subaru finance 21751
96 Subaru cash 20413
97 Toyota finance 20328
98 Toyota finance 17600
99 Honda finance 19894

100 rows × 3 columns

The result-set above is essentially what SQL is working with right before it separates the rows into groups and averages the revenue within those groups.

Now, we're going to use some CASE WHEN statements to change this a little:


In [78]:
run('''
    SELECT
        C.Make as Car_Maker, 
        payment_type,
        CASE WHEN payment_type = 'cash' THEN S.revenue END Cash_Revenue,
        CASE WHEN payment_type = 'finance' THEN S.revenue END Finance_Revenue
    FROM
        sales_table S
        JOIN car_table C on S.model_id = C.model_id 
    ''')


Out[78]:
Car_Maker payment_type Cash_Revenue Finance_Revenue
0 Toyota finance 21222
1 Toyota finance 19140
2 Toyota cash 14720
3 Honda finance 19010
4 Toyota finance 22022
5 Subaru cash 21409
6 Honda finance 19894
7 Toyota finance 12960
8 Toyota cash 25938
9 Toyota finance 19844
10 Honda finance 21441
11 Honda cash 17462
12 Honda cash 17020
13 Toyota finance 18040
14 Subaru finance 18671
15 Toyota finance 22506
16 Toyota finance 20460
17 Toyota finance 13440
18 Toyota finance 18634
19 Subaru cash 17312
20 Toyota cash 23318
21 Toyota cash 21780
22 Subaru cash 19418
23 Subaru cash 17756
24 Honda finance 21441
25 Honda cash 17905
26 Toyota finance 16720
27 Honda cash 20778
28 Subaru cash 19667
29 Toyota finance 19360
... ... ... ... ...
70 Toyota finance 20174
71 Toyota cash 18480
72 Toyota cash 23474
73 Honda cash 19231
74 Subaru cash 18865
75 Toyota cash 24104
76 Toyota finance 20436
77 Toyota cash 22990
78 Toyota cash 19580
79 Subaru finance 17756
80 Subaru cash 18920
81 Subaru finance 24646
82 Honda cash 14006
83 Toyota finance 24890
84 Subaru finance 23401
85 Toyota finance 20328
86 Toyota cash 15200
87 Honda cash 16189
88 Toyota cash 15040
89 Subaru finance 20197
90 Toyota finance 23232
91 Honda cash 16734
92 Toyota cash 20960
93 Subaru cash 24148
94 Toyota cash 19140
95 Subaru finance 21751
96 Subaru cash 20413
97 Toyota finance 20328
98 Toyota finance 17600
99 Honda finance 19894

100 rows × 4 columns

Now let's add back the ROUND() and AVG() functions and the GROUP BY statement:


In [79]:
run('''
    SELECT
        C.Make as Car_Maker, 
        ROUND(AVG(CASE WHEN payment_type = 'cash' THEN S.revenue END)) AS Avg_Cash_Revenue,
        ROUND(AVG(CASE WHEN payment_type = 'finance' THEN S.revenue END)) AS Avg_Finance_Revenue
    FROM
        sales_table S
        JOIN car_table C on S.model_id = C.model_id 
    GROUP BY
        C.Make
    ''')


Out[79]:
Car_Maker Avg_Cash_Revenue Avg_Finance_Revenue
0 Honda 17028 19030
1 Subaru 19581 20363
2 Toyota 20437 19183

CASE WHEN makes this same information a lot easier to read by letting you pivot the result set a little.

Write a query using CASE WHEN to look at total revenue per gender, grouped by each car model


In [80]:
run('''
    SELECT NULL
    ''')
#print(case_cheat)


Out[80]:
NULL
0

CASE WHEN also lets you create new groups. Start by looking at the cust_table grouped by age - remember that COUNT(***) tells you how many rows are in each group (which is the same as telling you the number of customers in each group):


In [81]:
run('''
    SELECT
        age,
        COUNT(*) customers
    FROM
        cust_table
    GROUP BY
        age
    ''')


Out[81]:
age customers
0 18 4
1 19 2
2 20 2
3 21 2
4 22 2
5 23 1
6 24 3
7 25 1
8 26 1
9 27 3
10 28 3
11 29 4
12 30 1
13 31 5
14 32 3
15 33 6
16 34 2
17 36 3
18 38 4
19 39 1
20 40 3
21 41 4
22 42 5
23 43 2
24 44 2
25 45 3
26 46 4
27 47 1
28 48 2
29 49 1
30 50 1
31 51 1
32 52 3
33 53 1
34 54 2
35 55 1
36 56 3
37 57 3
38 58 4
39 59 1

When you want to segment your results, but there are too many different values for GROUP BY to be helpful, use CASE WHEN to make your own groups. GROUP BY the column you created with CASE WHEN to look at your newly created segments.


In [82]:
run('''
    SELECT
        CASE WHEN age BETWEEN 18 AND 24 THEN '18-24 years'
             WHEN age BETWEEN 25 AND 34 THEN '25-34 years'
             WHEN age BETWEEN 35 AND 44 THEN '35-45 years'
             WHEN age BETWEEN 45 AND 54 THEN '45-54 years'
             WHEN age BETWEEN 55 AND 64 THEN '55-64 years'
             END Age_Group,
        COUNT(*) as Customers
    FROM
        cust_table
    GROUP BY
        Age_Group
    ''')


Out[82]:
Age_Group Customers
0 18-24 years 16
1 25-34 years 29
2 35-45 years 24
3 45-54 years 19
4 55-64 years 12

Ta-DA! Useful customer segments!

Try to break up the "Customers" column into 2 columns - one for male and one for female. Keep the age segments intact.

  • Note that COUNT(***) cannot be wrapped around a CASE WHEN expression the way that other functions can. Try to think of a different way to get a count.
  • Extra challenge: try to express male and female customers as a percentage of the total for each group, rounded to 2 decimal points

In [83]:
run('''
    SELECT NULL
    ''')
#print(case_cheat2)


Out[83]:
NULL
0





NESTING

  • Nested queries allow you to put a query within a query
  • Depending on your needs, you might put a nested query in the SELECT clause, the FROM clause, or the WHERE clause

Consider the following query. We're using a nested query in the SELECT clause to see the sum of all revenue in the sales_table, and then using it again to what percentage of total revenue can be attributed to each Car_Model.


In [84]:
run('''
    SELECT
        C.model AS Car_Model,
        SUM(S.revenue) AS Revenue_Per_Model,
        (SELECT SUM(revenue) FROM sales_table) AS Total_Revenue,
        SUM(S.revenue) / (SELECT SUM(revenue) FROM sales_table) AS Contribution_to_Revenue
    FROM
        sales_table S
        JOIN car_table C ON C.model_id = S.model_id
    GROUP BY
        Car_Model
    ''')


Out[84]:
Car_Model Revenue_Per_Model Total_Revenue Contribution_to_Revenue
0 Accord 234748 1941252 0
1 Camry 212080 1941252 0
2 Civic 157886 1941252 0
3 Corolla 154880 1941252 0
4 Forester 201083 1941252 0
5 Outback 317406 1941252 0
6 Prius 297418 1941252 0
7 Tundra 365751 1941252 0

Write a query to look at the model name and COGs for each car in car_table, then use a nested query to also look at the average COGs off all car models in a third column

  • Extra Challenge: add a fourth colum using another nested query to return the difference between each car model's COGs and the average COGs

In [85]:
run('''
    SELECT NULL
    ''')
#print(nest_cheat1)


Out[85]:
NULL
0





UNION & UNION ALL

SELECT
    column_a
FROM
    table_x

UNION               # or UNION ALL

SELECT
    column_b
FROM
    table_y
  • UNION allows you to run a 2nd query (or 3rd or 4th), the results will be ordered by default with the results of the first query
  • UNION ALL ensures that the results in the result set appear in order that the queries are written
  • The number of columns in each query must be the same in order for UNION & UNION ALL to work

Starting with something simple (and a little nonsensical), UNION basically lets you run two entirely separate queries. Technically, they could have nothing to do with each other:


In [86]:
run('''
    SELECT
        model
    FROM
        car_table
    WHERE
        model = 'Tundra'
    
    UNION
    
    SELECT 
        first_name
    FROM
        salesman_table
    WHERE first_name = 'Jared'
    ''')


Out[86]:
model
0 Jared
1 Tundra

Some things to note:

  • Although these queries and their results are unrelated, the column header is dictated by the query that appears first
  • Even though the query for "Tundra" is first, "Tundra" is second in the results. UNION will sort all results according to the normal default rules, acending order.
  • Replace UNION with UNION ALL and run the query again. What changes?

Use UNION to join two queries. The first should have two columns: car model and COGs per car. The second query should show you to average COGs for all the car models, rounded to cents. You want the the average COGs to appear in the last row.

  • Remember that united queries need to have the same number of columns.

In [87]:
run('''
    SELECT NULL
    ''')
#print(union_cheat1)


Out[87]:
NULL
0

Consider the issue we had before, where SQLite didn't support WITH ROUNDUP. We used this query as a workaround. Does it make sense now?


In [88]:
run('''
    SELECT
        C.model AS Car_Model, 
        SUM(S.revenue) as Sum_Revenue
    FROM
        sales_table S
        JOIN car_table C on S.model_id = C.model_id 
    GROUP BY C.model
    
    UNION ALL
    
    SELECT 
        'NULL', 
        SUM(S.revenue)
    FROM 
        sales_table S
''')


Out[88]:
Car_Model Sum_Revenue
0 Accord 234748
1 Camry 212080
2 Civic 157886
3 Corolla 154880
4 Forester 201083
5 Outback 317406
6 Prius 297418
7 Tundra 365751
8 NULL 1941252





Optimization:

Non-optimized queries can cause a lot of problems because tables frequently have thousands or millions of rows:

If you haven't optimized your query, it might:

  • Take several minutes (or even hours) to return the information you're requesting
  • Crash your computer
  • Muck up the server's processes, and you'll face the wrath of your company's system administrators once they figure out that you are the reason why the whole system has slowed down and everyone is sending them angry emails (this will probably happen to you no matter what. It's a rite of passage).

Find a few more useful optimization tips here: http://hungred.com/useful-information/ways-optimize-sql-queries/

Some of these seem strange, because we're going ling you NOT to do a bunch of things that you've learned how to do. Stick to this principal: if you're dealing with a small table, you can break a few of these rules. The larger the table, the fewer rules you can break.

DO name specific columns in the SELECT CLAUSE:


In [89]:
run(''' 
    SELECT
        date,
        revenue
    FROM 
        sales_table
    ''').head()


Out[89]:
date revenue
0 1/7/2014 21222
1 1/26/2014 19140
2 1/17/2014 14720
3 1/13/2014 19010
4 1/12/2014 22022

DON'T use an asterisk unless you absolutely have to:

This can put a lot of strain on servers. Only use if you know for certain that your using a small table


In [90]:
run(''' 
    SELECT
        *
    FROM 
        sales_table
    ''').head()


Out[90]:
id model_id customer_id revenue payment_type salesman_id date
0 54858 36 237906 21222 finance 276 1/7/2014
1 43161 20 967016 19140 finance 225 1/26/2014
2 40112 46 819010 14720 cash 147 1/17/2014
3 92495 31 633030 19010 finance 215 1/13/2014
4 78000 51 341877 22022 finance 803 1/12/2014

DO use LIKE on small tables and in simple queries:

LIKE is helpful if you know where to find something but you can't quite remember what it's called. Try to use a wildcard sparingly - don't use 2 when 1 will suffice:


In [91]:
run('''
    SELECT
        model_id, model
    FROM 
        car_table
    WHERE
        model LIKE '%undra'
    ''')


Out[91]:
model_id model
0 36 Tundra

DON'T use LIKE on large tables or when using JOINs:


In [92]:
run('''
    SELECT 
        C.model, 
        AVG(revenue)
    FROM
        sales_table S
        JOIN car_table C on S.model_id = C.model_id 
    WHERE
        C.model LIKE '%undra'
''')


Out[92]:
model AVG(revenue)
0 Tundra 22859.4375

If you want to look at average revenue for car models that are like "%undra", run the LIKE query on the small table (car_table) first to figure out exacly what you're looking for, then use that information to search for the data you need from the sales_table

DO dip your toe in by starting with a small data set

Use WHERE to only view a few days of data at first. If the query runs quickly, add a few days at a time. If it starts to run slowly, run just a few days at a time and paste results into excel to combine results (or use Python...ask me later!!!).

The query below won't work because SQLite doesn't recognize dates, but remember these concepts when working with other RDBMS


In [93]:
run('''
    SELECT
        revenue,
        date
    FROM
        sales_table
    WHERE
        date = '1/1/14'
    ''')


Out[93]:
revenue date

DO use a UNION to look at result-sets that aren't mutually exclusive

Let's say you were interested in seeing all Toyotas as well as cars with COGs of more than 13000. Write a query for the first group, then a query for the second group, and unite them with UNION. The result set won't show you repeats - if a row matches both result sets, it will only display once.


In [94]:
run('''
    SELECT
        make, model, cogs
    FROM
        car_table
    WHERE 
        make = 'Toyota'
    
    UNION
    
    SELECT 
        make, model, cogs
    FROM
        car_table
    WHERE
        cogs > 13000
    ''')


Out[94]:
make model cogs
0 Honda Accord 13263
1 Subaru Forester 13317
2 Subaru Outback 14937
3 Toyota Camry 13200
4 Toyota Corolla 9600
5 Toyota Prius 14520
6 Toyota Tundra 15720

DON'T use OR when a UNION will generate the same results

Note that we'll get the same results as above, but this query could run MUCH slower on a large table. It's tempting to use OR because it's faster to write, but unless you're dealing with very small tables, avoid the temptation. In 5 years of doing business analytics with SQL, I never used OR once. It's slow. Use a UNION.


In [95]:
run('''
    SELECT
        make, model, cogs
    FROM
        car_table
    WHERE
        make = 'Toyota' OR cogs > 13000
    ''')


Out[95]:
make model cogs
0 Toyota Camry 13200
1 Toyota Corolla 9600
2 Toyota Prius 14520
3 Honda Accord 13263
4 Subaru Outback 14937
5 Subaru Forester 13317
6 Toyota Tundra 15720

DON'T use negative filters when a positive filter is possible

Let's say you want to look at cars made by Toyato and Honda, but you don't care about Subaru. It might be tempting to use a negative filter:


In [96]:
run('''
    SELECT
        *
    FROM
        car_table
    WHERE
        make != 'Subaru'
    ''')


Out[96]:
model_id make model sticker_price cogs
0 20 Toyota Camry 22000 13200
1 46 Toyota Corolla 16000 9600
2 51 Toyota Prius 24200 14520
3 19 Honda Civic 18190 10914
4 31 Honda Accord 22105 13263
5 36 Toyota Tundra 26200 15720

On a big table, this will run much more slowly than if you use a positive filter. Try this instead - it might require a little extra typing, but it will run much faster:


In [97]:
run('''
    SELECT
        *
    FROM
        car_table
    WHERE
        make in ('Toyota', 'Honda')
    ''')


Out[97]:
model_id make model sticker_price cogs
0 20 Toyota Camry 22000 13200
1 46 Toyota Corolla 16000 9600
2 51 Toyota Prius 24200 14520
3 19 Honda Civic 18190 10914
4 31 Honda Accord 22105 13263
5 36 Toyota Tundra 26200 15720




Wrapping Up:

Debugging:

If you run into errors when you start writing your own queries, here are some things to make sure your query has:

  • The right names for columns in the SELECT clause
  • Columns that can be found in the tables in the FROM clause
  • Consistent use of aliases throughout (if using aliases)
  • Joined tables on the corresponding column and proper aliases to indicate each table
  • The correct order of clauses:
        SELECT
        FROM
        JOIN...ON 
        WHERE
        GROUP BY
        UNION
        ORDER BY
        LIMIT
  • Consistent use of capitalization for variables in quotes
  • Fuctions and opperators for real numbers, not integers
  • The same number of columns/expressions in their SELECT clauses of your queries when using UNION

Gain a deeper understanding:

http://tech.pro/tutorial/1555/10-easy-steps-to-a-complete-understanding-of-sql

Practice on other databases:

http://sqlzoo.net/wiki/SELECT_.._WHERE

Sample Queries for Business Analysis:

Let's say you recently opened a car dealership, and you now have one month's worth of sales data. You want to know how your sales team is doing.

Start by looking at the number of cars each person sold last month. The names of the sales team and the list of transactions are on different tables in your database, but SQL can help you with that:


In [98]:
run('''
    SELECT
        first_name ||  ' ' || last_name as Salesperson,
        COUNT(*) as Cars_Sold
    FROM
        sales_table S
        JOIN salesman_table M ON S.salesman_id = M.id
    GROUP BY 
        Salesperson
    ORDER BY 
        Cars_Sold DESC
    ''')


Out[98]:
Salesperson Cars_Sold
0 Jared Case 14
1 Claudine Hatch 11
2 Michael Hill 11
3 Rosemarie Self 11
4 Elton Elzy 10
5 Matthew Luna 10
6 Joseph Seney 9
7 Justin Avellaneda 9
8 Samantha Douglas 8
9 Kathleen March 7

Add on the average amount of revenue made per sale:


In [99]:
run('''
    SELECT
        first_name ||  ' ' || last_name as Salesperson,
        COUNT(*) as Cars_Sold, 
        ROUND(AVG(revenue)) as Revenue_per_Sale
    FROM
        sales_table S
        JOIN salesman_table M ON S.salesman_id = M.id
    GROUP BY 
        Salesperson
    ORDER BY 
        Cars_Sold DESC
    ''')


Out[99]:
Salesperson Cars_Sold Revenue_per_Sale
0 Jared Case 14 18952
1 Claudine Hatch 11 17890
2 Michael Hill 11 18323
3 Rosemarie Self 11 19462
4 Elton Elzy 10 19467
5 Matthew Luna 10 21349
6 Joseph Seney 9 20002
7 Justin Avellaneda 9 19242
8 Samantha Douglas 8 19149
9 Kathleen March 7 21277

Make it easier to compare the average revenue of Jared's sales to the average revenue of per sale overall by adding a column to see by what percent each salesperson's sales are more or less than average:


In [100]:
run('''
    SELECT
        first_name ||  ' ' || last_name as Salesperson,
        COUNT(*) as Cars_Sold, 
        ROUND(AVG(revenue), 2) as Rev_per_Sale,
        ROUND((((AVG(revenue) 
            - (SELECT AVG(revenue) from sales_table))
            /(SELECT AVG(revenue) from sales_table))*100), 1) || ' %'
            as RPS_Compared_to_Avg
    FROM
        sales_table S
        JOIN salesman_table M ON S.salesman_id = M.id
    GROUP BY 
        Salesperson
    ORDER BY 
        Cars_Sold DESC
    ''')


Out[100]:
Salesperson Cars_Sold Rev_per_Sale RPS_Compared_to_Avg
0 Jared Case 14 18951.64 -2.4 %
1 Claudine Hatch 11 17890.00 -7.8 %
2 Michael Hill 11 18323.45 -5.6 %
3 Rosemarie Self 11 19462.27 0.3 %
4 Elton Elzy 10 19466.90 0.3 %
5 Matthew Luna 10 21349.20 10.0 %
6 Joseph Seney 9 20002.22 3.0 %
7 Justin Avellaneda 9 19242.44 -0.9 %
8 Samantha Douglas 8 19149.13 -1.4 %
9 Kathleen March 7 21277.14 9.6 %

So maybe Jared is just selling cheaper cars.

Let's go further and compare the sale price of each car against the sticker price to see how low Jared was willing to negotiate with customers. Sticker price is in anther table, but again, that's no problem with SQL:


In [101]:
run('''
    SELECT
        first_name ||  ' ' || last_name as Salesperson,
        COUNT(*) as Cars_Sold, 
        '$ ' || ROUND(AVG(revenue), 2)  as Rev_per_Sale,
        ROUND((((AVG(revenue) 
            - (SELECT AVG(revenue) from sales_table where salesman_id != 215))
            /(SELECT AVG(revenue) from sales_table where salesman_id != 215))*100), 1) || ' %'
            AS RPS_Compared_to_Avg,
        ROUND((1-(SUM(revenue) / SUM(sticker_price)))*100, 1) || ' %' as Avg_Customer_Discount
    FROM
        sales_table S
        JOIN salesman_table M ON S.salesman_id = M.id
        JOIN car_table C ON S.model_id = C.model_id
    GROUP BY 
        Salesperson
    ORDER BY 
        Cars_Sold DESC
    ''')


Out[101]:
Salesperson Cars_Sold Rev_per_Sale RPS_Compared_to_Avg Avg_Customer_Discount
0 Jared Case 14 $ 18951.64 -2.7 % 15.7 %
1 Claudine Hatch 11 $ 17890.0 -8.2 % 13.9 %
2 Michael Hill 11 $ 18323.45 -6.0 % 14.7 %
3 Rosemarie Self 11 $ 19462.27 -0.1 % 12.8 %
4 Elton Elzy 10 $ 19466.9 -0.1 % 13.4 %
5 Matthew Luna 10 $ 21349.2 9.6 % 10.3 %
6 Joseph Seney 9 $ 20002.22 2.6 % 12.5 %
7 Justin Avellaneda 9 $ 19242.44 -1.3 % 13.4 %
8 Samantha Douglas 8 $ 19149.13 -1.7 % 16.1 %
9 Kathleen March 7 $ 21277.14 9.2 % 9.5 %

Looks like Jared is letting customers negotiate prices down much more than his peers.

But is this a real problem? How much is each salesperson contributing to our gross profits?


In [102]:
run('''
    SELECT
        first_name ||  ' ' || last_name as Salesperson,
        COUNT(*) as Cars_Sold, 
        '$ ' || ROUND(AVG(revenue), 2)  as Rev_per_Sale,
        ROUND((((AVG(revenue) 
            - (SELECT AVG(revenue) from sales_table where salesman_id != 215))
            /(SELECT AVG(revenue) from sales_table where salesman_id != 215))*100), 1) || ' %'
            AS RPS_Compared_to_Peers,
        ROUND((1-(SUM(revenue) / SUM(sticker_price)))*100, 1) || ' %' as Avg_Customer_Discount, 
        ROUND(((SUM(revenue)-sum(C.cogs))
            /(SELECT SUM(revenue)-sum(cogs) FROM sales_table S join car_table C on S.model_id = C.model_id))*100, 1) || ' %' as Gross_Profit_Contribution
    FROM
        sales_table S
        JOIN salesman_table M ON S.salesman_id = M.id
        JOIN car_table C ON S.model_id = C.model_id
    GROUP BY 
        Salesperson
    ORDER BY 
        Cars_Sold DESC
    ''')


Out[102]:
Salesperson Cars_Sold Rev_per_Sale RPS_Compared_to_Peers Avg_Customer_Discount Gross_Profit_Contribution
0 Jared Case 14 $ 18951.64 -2.7 % 15.7 % 12.8 %
1 Claudine Hatch 11 $ 17890.0 -8.2 % 13.9 % 10.0 %
2 Michael Hill 11 $ 18323.45 -6.0 % 14.7 % 10.0 %
3 Rosemarie Self 11 $ 19462.27 -0.1 % 12.8 % 11.2 %
4 Elton Elzy 10 $ 19466.9 -0.1 % 13.4 % 10.0 %
5 Matthew Luna 10 $ 21349.2 9.6 % 10.3 % 11.9 %
6 Joseph Seney 9 $ 20002.22 2.6 % 12.5 % 9.5 %
7 Justin Avellaneda 9 $ 19242.44 -1.3 % 13.4 % 8.9 %
8 Samantha Douglas 8 $ 19149.13 -1.7 % 16.1 % 7.3 %
9 Kathleen March 7 $ 21277.14 9.2 % 9.5 % 8.4 %

SQL really lets you dig.

Some other quick examples - we could do a gender breakdown of customers per car model and add a total at the bottom:


In [103]:
run('''
    SELECT
        C.model as Car_Model, 
        ROUND(SUM(CASE WHEN CUST.gender = 'female' THEN 1 END)/(COUNT(S.id)*1.0), 2) AS '% Female Customers',
        ROUND(SUM(CASE WHEN CUST.gender = 'male' THEN 1 END)/(COUNT(S.id)*1.0), 2) AS '% Male Customers'
    FROM
        sales_table S
        JOIN car_table C on S.model_id = C.model_id 
        JOIN cust_table CUST on S.customer_id = CUST.customer_id
    GROUP BY 
        Car_Model
    
    UNION ALL
    
    SELECT
        'Total:', 
        ROUND(SUM(CASE WHEN CUST.gender = 'female' THEN 1 END)/(COUNT(S.id)*1.0), 2) AS '% Female Customers',
        ROUND(SUM(CASE WHEN CUST.gender = 'male' THEN 1 END)/(COUNT(S.id)*1.0), 2) AS '% Male Customers'
    FROM
        sales_table S
        JOIN cust_table CUST on S.customer_id = CUST.customer_id
    ''')


Out[103]:
Car_Model % Female Customers % Male Customers
0 Accord 0.50 0.50
1 Camry 0.45 0.55
2 Civic 0.30 0.70
3 Corolla 0.55 0.45
4 Forester 0.64 0.36
5 Outback 0.33 0.67
6 Prius 0.14 0.86
7 Tundra 0.50 0.50
8 Total: 0.42 0.58

Easily create age groups and see how aggressively each group negotiates (judged by the difference between the actual sale amount and the sticker price):


In [104]:
run('''
    SELECT
        CASE WHEN age BETWEEN 18 AND 24 THEN '18-24 years'
             WHEN age BETWEEN 25 AND 34 THEN '25-34 years'
             WHEN age BETWEEN 35 AND 44 THEN '35-44 years'
             WHEN age BETWEEN 45 AND 54 THEN '45-54 years'
             WHEN age BETWEEN 55 AND 64 THEN '55-64 years'
             END Age_Group,
        ROUND((SUM(S.revenue)-SUM(C.sticker_price))/SUM(C.sticker_price), 2) as '% Paid Below Sticker Price'
    FROM
        sales_table S
        JOIN car_table C on S.model_id = C.model_id
        JOIN cust_table CUST on S.customer_id = CUST.customer_id
    GROUP BY
        Age_Group
    ''')


Out[104]:
Age_Group % Paid Below Sticker Price
0 18-24 years -0.14
1 25-34 years -0.12
2 35-44 years -0.11
3 45-54 years -0.16
4 55-64 years -0.14