Sarah Beckett-Hile | NYU Stern School of Business | March 2015
Today's plan
SQL = "Structured Query Language" (pronounced "S-Q-L" or "sequel")
Database language of choice for most businesses
The software optimized for storing relational databases that you access with SQL varies. Relational Database Management Systems (RDBMS) include MySQL, Microsoft SQL Server, Oracle, and SQLite. We will be working with SQLite.
Relational Databases have multiple tables. Visualize it like an Excel file:
SQL lets you perform four basic functions: C.R.U.D. = Create, Read, Update, Delete
"Read" is all you'll need for business analytics
Additional reading: http://www.w3schools.com/sql/sql_intro.asp
Find examples of queries for business analysis at botton of this lesson page
In [1]:
from SQL_support_code import *
Indentations and Returns:
Capitalization:
Order of Clauses:
SELECT
FROM
JOIN...ON
WHERE
GROUP BY
UNION
ORDER BY
LIMIT
PRAGMA TABLE_INFO(table_name)
Running this will let you see the column heads and data types of any table.
The SQL query above only works for SQLite, which is what we're using here. If you're interested in knowing the equivalent versions for other RDBMS options, see the table below.
In [2]:
describe_differences
Out[2]:
In [3]:
run('''
PRAGMA TABLE_INFO(sales_table)
''')
Out[3]:
In [4]:
run('''
PRAGMA TABLE_INFO(sales_table)
''')
#print(describe_cheat)
Out[4]:
Different RDBMS have different datatypes available:
SELECT # specifies which columns you want to see
* # asterisk returns all columns
FROM # specifies the table or tables where these columns can be found
table_name
Use an asterisk to tell SQL to return all columns from the table:
In [5]:
run('''
SELECT
*
FROM
sales_table
''')
Out[5]:
In [6]:
run('''
SELECT NULL
''')
#print(select_cheat1)
Out[6]:
In [7]:
run('''
SELECT
model_id,
revenue
FROM
sales_table
''')
Out[7]:
In [8]:
run('''
SELECT NULL
''')
#print(select_cheat2)
Out[8]:
One more quick note on the basics of SELECT - technically you can SELECT a value without using FROM to specify a table. You could just tell the query exactly what you want to see in the result-set. If it's a number, you can write the exact number. If you are using various characters, put them in quotes.
See the query below as an example:
In [9]:
run('''
SELECT
4,
5,
7,
'various characters or text'
''')
Out[9]:
SELECT
DISTINCT column_a # returns a list of each unique value in column_a
FROM
table_name
The query below pulls each distinct value from the model_id column in the sales_table, so each value is only listed one time:
In [10]:
run('''
SELECT
DISTINCT model_id
FROM
sales_table
''')
Out[10]:
In [11]:
run('''
SELECT NULL
''')
#print(select_cheat3)
Out[11]:
SELECT
column_a
FROM
table_name
WHERE
column_a = x # filters the result-set to rows where column_a's value is exactly x
A few more options for the where clause:
WHERE column_a = 'some_text' # put text in quotations. CAPITALIZATION IS IMPORTANT
WHERE column_a != x # filters the result-set to rows where column_a's value DOES NOT EQUAL x
WHERE column_a < x # filters the result-set to rows where column_a's value is less than x
WHERE columna_a <= x # filters the result-set to rows where column_a's value is less than or equal to x
WHERE column_a IN (x, y) # column_a's value can be EITHER x OR y
WHERE column_a NOT IN (x, y) # column_a's value can be NEITHER x NOR y
WHERE column_a BETWEEN x AND y # BETWEEN lets you specify a range
WHERE column_a = x AND column_b = y # AND lets you add more filters
WHERE column_a = x OR column_b = y # OR will include results that fulfill either criteria
WHERE (column_a = x AND column_b = y) OR (column_c = z) # use parentheses to create complex AND/OR statements
Below, WHERE filters out any rows that don't match the criteria. The result-set will only contain rows where the payment type is cash AND where the model_id is 46:
In [12]:
run('''
SELECT
*
FROM
sales_table
WHERE
payment_type = 'cash'
AND model_id = 46
''')
Out[12]:
Rewrite the query to return rows where payment_type is NOT cash, and the model_id is either 31 or 36
In [13]:
run('''
SELECT NULL
''')
#print(where_cheat1)
Out[13]:
In [14]:
run('''
SELECT NULL
''')
#print(where_cheat2)
Out[14]:
SELECT
column_a
FROM
table_name
WHERE
column_a LIKE '%text or number%' # Filters the result_set to rows where that text or value can be found, with % standing in as a wildcard
Note that you don't have to use the whole word "cash" when you use LIKE, and that the capital "C" now doesn't cause a problem:
In [15]:
run('''
SELECT
*
FROM
sales_table
WHERE
payment_type LIKE 'Cas%'
''').head()
Out[15]:
Be careful with LIKE though - it can't deal with extra characters or mispellings:
In [16]:
run('''
SELECT
*
FROM
sales_table
WHERE
payment_type LIKE 'ces%'
LIMIT 5
''')
Out[16]:
LIKE and % will also return too much if you're not specific enough. This returns both 'cash' and 'finance' because both have a 'c' with some letters before or after:
In [17]:
run('''
SELECT
*
FROM
sales_table
WHERE
payment_type LIKE '%c%'
LIMIT 5
''')
Out[17]:
You can use different wildcards besides % to get more specific. An underscore is a substitute for a single letter or character, rather than any number. The query below uses 3 underscores after c to get 'cash':
In [18]:
run('''
SELECT
*
FROM
sales_table
WHERE
payment_type LIKE 'c___'
LIMIT 5
''')
Out[18]:
In [19]:
run('''
SELECT NULL
''')
#print(where_cheat3)
Out[19]:
SELECT
column_a
FROM
table_name
WHERE # optional
column_a = x
ORDER BY # sorts the result-set by column_a
column_a DESC # DESC is optional. It sorts results in descending order (100->1) instead of ascending (1->100)
The query below orders the result-set by revenue amount, starting with the smallest amount listed first:
In [20]:
run('''
SELECT
*
FROM
sales_table
ORDER BY
revenue DESC
LIMIT 5
''')
Out[20]:
In [21]:
run('''
SELECT NULL
''')
#print(order_cheat)
Out[21]:
SELECT
column_a
FROM
table_name
WHERE
columna_a = x # optional
ORDER BY
column_a # optional
LIMIT # Limits the result-set to N rows
N
In [22]:
limit_differences
Out[22]:
The query below limits the number of rows to 5 results. Change it to 10 to get a quick sense of what we're doing here:
In [23]:
run('''
SELECT
*
FROM
sales_table
LIMIT 5
''')
Out[23]:
SELECT
T.column_a AS alias_a # creates a nickname for column_a, and states that it's from table_name (whose alias is T)
FROM
table_name AS T # creates a nickname for table_name
WHERE
alias_a = z # refer to an alias in the WHERE clause
ORDER BY
alias_a # refer to an alias in the ORDER BY clause
Change the aiases for model_id and revenue, or add extra columns to see how they work:
In [24]:
run('''
SELECT
model_id AS Model_of_car,
revenue AS Rev_per_car
FROM
sales_table
''')
Out[24]:
You can use an alias in the ORDER BY and WHERE clauses now. Write a query to:
THEN:
In [25]:
run('''
SELECT NULL
''')
#print(alias_cheat)
Out[25]:
You can also assign an alias to a table, and use the alias to tell SQL which table the column is coming from. This isn't of much use when you're only using one table, but it will come in handy when you start using multiple tables.
Below,the sales_table has the alias "S". Read "S.model_id" as "the model_id column from S, which is the sales_table"
Change the S to another letter in the FROM clause and run. Why did you hit an error? What can you do to fix it?
In [26]:
run('''
SELECT
S.model_id,
S.revenue
FROM
sales_table AS S
LIMIT 5
''')
Out[26]:
SELECT
*
FROM
table_x
JOIN table_y # use JOIN to add the second table
ON table_x.column_a = table_y.column_a # use ON to specify which columns correspond on each table
Start by looking at the first few rows of sales_table again:
In [27]:
run('''
SELECT
*
FROM
sales_table
LIMIT 5
''')
Out[27]:
Now the first few rows of the car_table:
In [28]:
run('''
SELECT
*
FROM
car_table
LIMIT 5
''')
Out[28]:
These tables are related. There's a column named "model_id" in the sales_table and a "model_id" in the car_table - but the column names don't need to be the same, what's important is that the values in the sales_table's model_id column correspond to the values in the car_table's model_id column.
You can join these tables by using these columns as keys.
In [29]:
run('''
SELECT
*
FROM
sales_table
JOIN car_table ON sales_table.model_id = car_table.model_id
LIMIT 10
''')
Out[29]:
Write a query to join the cust_table to the sales_table, using the customer_id columns in both tables as the key:
In [30]:
run('''
SELECT NULL
''')
#print(join_cheat1)
Out[30]:
Rewrite the query from above, but instead of selecting all columns, specify just the customer gender and the revenue:
In [31]:
run('''
SELECT NULL
''')
#print(join_cheat2)
Out[31]:
Rewrite the query from above, but this time select the customer_id, gender, and revenue:
In [32]:
run('''
SELECT NULL
''')
#print(join_cheat3)
Out[32]:
A column with the name customer_id appears in both the cust_table and the sales_table. SQL doesn't know which one you want to see. You have to tell it from which table you want the customer_id.
This can be important when columns in different tables have the same names but totally unrelated values.
Look at the sales_table again:
In [33]:
run('''
SELECT
*
FROM
sales_table
LIMIT 5
''')
Out[33]:
Above, there's a column called "id".
Now look at the salesman_table again:
In [34]:
run('''
SELECT
*
FROM
salesman_table
LIMIT 5
''')
Out[34]:
There's a column named "id" in the salesman_table too. However, it doesn't look like those IDs correspond to the sales_table IDs. In fact, it's the salesman_id column in the sales_table that corresponds to the id column in the salesman_table. More often than not, your tables will use different names for corresponding columns, and will have columns with identical names that don't correspond at all.
Write a query to join the salesman_table with the sales_table (select all columns using an asterisk)
In [35]:
run('''
SELECT NULL
''')
#print(join_cheat4)
Out[35]:
Practice applying this "table_x.column_a" format to all columns in the SELECT clause when you are joining multiple tables, since multiple tables frequenty use the same column names even when they don't correspond.
It's common to use single-letter aliases for tables to make queries shorter. Take a look at the query below and make sure you understand what's going on with the table aliases. It's the same query that you wrote earlier, but with aliases to help identify the columns
In [36]:
run('''
SELECT
S.customer_id,
C.gender,
S.revenue
FROM
sales_table AS S
JOIN cust_table AS C on S.customer_id = C.customer_id
''')
Out[36]:
Join the sales_table (assign it the alias S) and salesman_table (alias SM) again.
In [37]:
run('''
SELECT NULL
''')
#print(join_cheat5)
Out[37]:
There are different types of joins you can do according to your needs. Here's a helpful way to visualize your options: http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins
However, not all types of joins are compatible with SQLite and MySQL. The table below breaks down compatibility:
In [38]:
join_differences
Out[38]:
So far, we've just done a simple join, also called an "inner join". To illustrate different types of joins, we're going to use a different "database" for the following lesson. First, let's take a look at each one:
In [39]:
run('''
SELECT
*
FROM
Dog_Table
''')
Out[39]:
In [40]:
run('''
SELECT
*
FROM
Cat_Table
''')
Out[40]:
Notice that the Owner_Name columns on each table have some corresponding values (Michael, Gilbert, May and Elizabeth and Donna are in both tables), but they both also have values that don't overlap.
SELECT
*
FROM
table_x X
JOIN table_y Y ON X.column_a = Y.column_a # Returns rows when values match on both tables.
This is what we used in the initial example. Simple joins, (also called Inner Joins) will combine tables only where there are corresponding values on both tables.
Write a query below to join the Cat_Table and Dog_Table using the same method we've used before:
In [41]:
run('''
SELECT NULL
''')
#print(inner_join_cheat)
Out[41]:
Notice that the result-set only includes the names that are in both tables. Think of inner joins as being the overlapping parts of a Venn Diagram. So, essentially we're looking at results only where the pet owner has both a cat and a dog.
SELECT
*
FROM
table_x X
LEFT JOIN table_y Y ON X.column_a = Y.column_a # Returns all rows from 1st table, rows that match from 2nd
Rewrite your query from above, but instead of "JOIN", write "LEFT JOIN":
In [42]:
run('''
SELECT NULL
''')
#print(left_join_cheat)
Out[42]:
This time, you're seeing everything from the Dog_Table, but only results from the Cat_Table IF the owner also has a dog.
SELECT
*
FROM
table_x X
OUTER JOIN table_y Y ON X.column_a = Y.column_a # Returns all rows, regardless of whether values match
For now, this query won't totally make sense, just pay attention to the results so you can visualize an outer join:
In [43]:
run('''
SELECT
C.Owner_Name,
Cat_Name,
Dog_Name
FROM
Cat_Table C
LEFT JOIN Dog_Table D ON D.Owner_Name = C.Owner_Name
UNION ALL
SELECT
D.Owner_Name,
' ',
Dog_Name
FROM
Dog_Table D
WHERE
Owner_Name NOT IN (SELECT Owner_Name from Cat_Table)
''')
Out[43]:
Essentially, in Venn Diagram terms, and outer join lets you see all contents of both circles. This join will let you see all pet owners, regardless of whether the own only a cat or only a dog
SELECT
*
FROM
table_x X
JOIN table y Y
WHERE
X.column_a = Y.column_a # tells SQL the key for the join
When it's simple, it's not so bad:
In [44]:
run('''
SELECT
C.model,
S.revenue
FROM
sales_table S, car_table C
WHERE
S.model_id = C.model_id
LIMIT 5
''')
Out[44]:
When the query is longer, this method is messy. Suddenly it's harder to parse out which parts of the "WHERE" clause are actual filters, and which parts are just facilitating the join.
Note that we've covered all of these clauses and expressions by now, try to parse out what's going on:
In [45]:
run('''
SELECT
C.make,
C.model,
S.revenue,
CUST.gender,
SM.first_name
FROM
sales_table S
JOIN car_table C
JOIN salesman_table SM
JOIN cust_table CUST
WHERE
S.customer_id = CUST.customer_id
AND S.model_id = C.model_id
AND S.salesman_id = SM.id
AND (C.model in ('Tundra', 'Camry', 'Corolla') OR C.make = 'Subaru')
AND S.revenue between 17000 and 22000
AND CUST.gender = 'female'
AND SM.first_name NOT IN ('Kathleen', 'Samantha')
LIMIT 5
''')
Out[45]:
SELECT
column_a + column_b # adds the values in column_a to the values in columns_b
FROM
table_name
Use the standard formats for add, subtract, mutiply, and divide: + - * /
The query below subtracts cogs (from the car_table) from revenue (from the sales_table) to show us the gross_profit per transaction
In [46]:
run('''
SELECT
S.id,
C.model,
S.revenue,
C.cogs,
S.revenue - C.cogs AS gross_profit
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
LIMIT 5
''')
Out[46]:
Rewrite the query above to return gross margin instead of gross profit. Rename the alias as well. Limit it to 5 results
In [47]:
run('''
SELECT NULL
''')
#print(operator_cheat)
Out[47]:
Concatenating varies by RDBMS:
In [48]:
concat_differences
Out[48]:
Here we'll use SQLite and use the concatenating operator || to combine words/values in different columns:
In [49]:
run('''
SELECT
last_name,
first_name,
last_name || ', ' || first_name AS full_name
FROM
salesman_table
''')
Out[49]:
Use || to pull the make and model from the car_table and make it appear in this format: "Model (Make)"
In [50]:
run('''
SELECT NULL
''')
#print(concat_cheat)
Out[50]:
SELECT
SUM(column_a), # sums up the values in column_a
AVG(column_a), # averages the values in column_a
ROUND(AVG(column_a), 2), # rounds the averaged values in column_a to 2 digits
COUNT(column_a), # counts the number of rows in column_a
MAX(column_a), # returns the maximum value in column_a
MIN(column_a), # returns the minimum value in column_a
GROUP_CONCAT(column_a) # returns a comma separated list of all values in column_a
FROM
table_name
The function below will sum up everything in the revenue column. Note that now we only get one row:
In [51]:
run('''
SELECT
SUM(revenue) AS Total_Revenue
FROM
sales_table
''')
Out[51]:
Rewrite the query to return the average cost of goods for a car in the car table. Try rounding it to cents.
In [52]:
run('''
SELECT NULL
''')
#print(avg_cheat)
Out[52]:
Using COUNT(*) will return the number of rows in any given table. Rewrite the query to return the number of rows in the car_table:
In [53]:
run('''
SELECT NULL
''')
#print(count_cheat)
Out[53]:
You can apply functions on top of other operators. Below is the sum of gross profits:
In [54]:
run('''
SELECT
'$ ' || SUM(S.revenue - C.cogs) total_gross_profit
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
''')
Out[54]:
In [55]:
run('''
SELECT NULL
''')
#print(avg_cheat2)
Out[55]:
In [56]:
run('''
SELECT
GROUP_CONCAT(model, ', ') as Car_Models
FROM
car_table
''')
Out[56]:
Use GROUP_CONCAT to return a comma-separated list of last names from the salesman_table:
In [57]:
run('''
SELECT NULL
''')
#print(concat_cheat)
Out[57]:
SELECT
column_a,
SUM(column_b) # sums up the values in column_b
FROM
table_name
GROUP BY # creates one group for each unique value in column_a
column_a
The query below creates a group for each unique value in the car_table's model column, then sums up the revenue for each group. Note that you can use an alias in the GROUP BY clause.
In [58]:
run('''
SELECT
C.model AS Car_Model,
SUM(revenue) AS Total_Revenue
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
GROUP BY
Car_Model
''')
Out[58]:
Rewrite the query above to return the average gross profit (revenue - cogs) per make (remember that "make" is in the car_table)
Extra things to try:
In [59]:
run('''
SELECT NULL
''')
#print(group_cheat)
Out[59]:
Write a query to make a comma-separated list of models for each car maker:
In [60]:
run('''
SELECT NULL
''')
#print(group_cheat1)
Out[60]:
GROUP BY, when used with joins and functions, can help you quickly see trends in your data. Parse out what's going on here:
In [61]:
run('''
SELECT
C.model AS Car_Model,
MIN(S.revenue) || ' - ' || MAX(S.revenue) AS Min_to_Max_Sale,
MAX(S.revenue) - MIN(S.revenue) AS Range,
ROUND(AVG(S.revenue), 2) AS Average_Sale
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
GROUP BY
Car_Model
ORDER BY
Average_Sale DESC
''')
Out[61]:
You can also use GROUP BY with multiple columns to segment out the results further:
In [62]:
run('''
SELECT
C.make AS car_caker,
payment_type,
ROUND(AVG(revenue)) as avg_revenue
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
GROUP BY
C.Make,
payment_type
''')
Out[62]:
Rewrite the query to find the total revenue grouped by each salesperson's first_name and by the customer's gender (gender column in cust_table)
In [63]:
run('''
SELECT NULL
''')
#print(group_cheat2)
Out[63]:
SELECT
column_a,
SUM(column_b) AS alias_b
FROM
table_name
GROUP BY
column_a HAVING alias_b > x # only includes groups in column_a when the sum of column_b is greater than x
The query below will sum up all the revenue for each car maker, but it will only show you results for car maker's whose total revenue is greater than 50,000:
In [64]:
run('''
SELECT
C.Make as Car_Maker,
SUM(revenue) as Total_Revenue
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
GROUP BY
Car_Maker HAVING Total_Revenue > 500000
''')
Out[64]:
Rewrite the query above to look at average revenue per model, and using HAVING to filter your result-set to only include models whose average revenue is less than 18,000:
In [65]:
run('''
SELECT NULL
''')
#print(having_cheat)
Out[65]:
WHERE filters which rows will be included in the function, whereas HAVING filters what's returned after the function has been applied.
Take a look at the query below. It might look like the query you just wrote (above) if you'd tried to use WHERE instead of HAVING:
SELECT
C.model as Car_Model,
AVG(S.revenue) as Avg_Revenue
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
WHERE
S.revenue < 18000
GROUP BY
Car_Model
Even though AVG( ) appears early in the query, it's not actually being applied until after the WHERE statement has filtered out rows with less than 18,000 in revenue.
This is the result:
In [66]:
run('''
SELECT
C.model as Car_Model,
AVG(S.revenue) as Avg_Revenue
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
WHERE
S.revenue < 18000
GROUP BY
Car_Model
''')
Out[66]:
All model_ids are returned, but the averages are all much lower than they should be. That's because the query first drops all rows that have revenue greater than 18000, and then averages the remaining rows.
When you use HAVING, SQL follows these steps instead (this query should look like the one you wrote in the last challenge):
SELECT
C.model as Car_Model,
AVG(S.revenue) as Avg_Revenue
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
GROUP BY
Car_Model HAVING Avg_Revenue < 18000
And as you can see, there's a big difference in these results and the results of the query that used "WHERE" instead of HAVING:
In [67]:
run('''
SELECT
C.model as Car_Model,
AVG(S.revenue) as Avg_Revenue
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
GROUP BY
Car_Model HAVING Avg_Revenue < 18000
''')
Out[67]:
This query is only looking at Toyotas whose revenue is less than 18,000, using WHERE to limit the results to Toyotas, and HAVING to limit the results by revenue:
In [68]:
run('''
SELECT
C.model as Car_Model,
AVG(S.revenue) as Avg_Revenue
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
WHERE
C.make = 'Toyota'
GROUP BY
Car_Model HAVING Avg_Revenue < 18000
''')
Out[68]:
Write a query with the following criteria:
So, in plain English, we want to see salespeople whose average revenue for female customers is greater than 20,000
In [69]:
run('''
SELECT NULL
''')
#print(having_where_cheat)
Out[69]:
In [70]:
rollup_differences
Out[70]:
Because SQLite doesn't support ROLLUP, the query below is just intended to illustrate how ROLLUP would work. Don't worry about understanding the query itself, just get familiar with what's going on in the result-set:
In [71]:
run('''
SELECT
C.model AS Car_Model,
SUM(S.revenue) as Sum_Revenue
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
GROUP BY C.model
UNION ALL
SELECT
'NULL',
SUM(S.revenue)
FROM
sales_table S
''')
Out[71]:
SELECT
CASE WHEN column_a = x THEN some_value
WHEN column_a = y THEN some_value2
ELSE some_other_value
END some_alias # alias optional after END
FROM
table_name
In [72]:
conditional_differences
Out[72]:
Starting with a simple example, here we'll use CASE WHEN to create a new column on the sales_table:
In [73]:
run('''
SELECT
revenue,
CASE WHEN revenue > 20000 THEN 'Revenue is more than 20,000'
END Conditional_Column
FROM
sales_table
LIMIT 10
''')
Out[73]:
CASE WHEN gives you the value "Revenue is more MORE 20,000" when revenue in that same row is greater than 20,000. Otherwise, it has no value.
Now let's add a level:
In [74]:
run('''
SELECT
revenue,
CASE WHEN revenue > 20000 THEN 'Revenue is MORE than 20,000'
WHEN revenue < 15000 THEN 'Revenue is LESS than 15,000'
END Conditional_Column
FROM
sales_table
LIMIT 10
''')
Out[74]:
Now to deal with the blank spaces. You can assign an "ELSE" value to catch anything that's not included in the prior expressions:
In [75]:
run('''
SELECT
revenue,
CASE WHEN revenue > 20000 THEN 'Revenue is MORE than 20,000'
WHEN revenue < 15000 THEN 'Revenue is LESS than 15,000'
ELSE 'NEITHER'
END Conditional_Column
FROM
sales_table
LIMIT 10
''')
Out[75]:
You can use values from another column as well. Remember this query from the GROUP BY lesson? It's often helpful to look at information broken out by multiple groups, but it's not especially easy to digest:
In [76]:
run('''
SELECT
C.Make as car_maker,
payment_type,
ROUND(AVG(S.revenue)) as avg_revenue
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
GROUP BY
C.Make,
payment_type
''')
Out[76]:
Look at what's going on in that query without the AVG( ) function and the GROUP BY clause:
In [77]:
run('''
SELECT
C.Make as Car_Maker,
payment_type,
S.revenue
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
''')
Out[77]:
The result-set above is essentially what SQL is working with right before it separates the rows into groups and averages the revenue within those groups.
Now, we're going to use some CASE WHEN statements to change this a little:
In [78]:
run('''
SELECT
C.Make as Car_Maker,
payment_type,
CASE WHEN payment_type = 'cash' THEN S.revenue END Cash_Revenue,
CASE WHEN payment_type = 'finance' THEN S.revenue END Finance_Revenue
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
''')
Out[78]:
Now let's add back the ROUND() and AVG() functions and the GROUP BY statement:
In [79]:
run('''
SELECT
C.Make as Car_Maker,
ROUND(AVG(CASE WHEN payment_type = 'cash' THEN S.revenue END)) AS Avg_Cash_Revenue,
ROUND(AVG(CASE WHEN payment_type = 'finance' THEN S.revenue END)) AS Avg_Finance_Revenue
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
GROUP BY
C.Make
''')
Out[79]:
CASE WHEN makes this same information a lot easier to read by letting you pivot the result set a little.
Write a query using CASE WHEN to look at total revenue per gender, grouped by each car model
In [80]:
run('''
SELECT NULL
''')
#print(case_cheat)
Out[80]:
CASE WHEN also lets you create new groups. Start by looking at the cust_table grouped by age - remember that COUNT(***) tells you how many rows are in each group (which is the same as telling you the number of customers in each group):
In [81]:
run('''
SELECT
age,
COUNT(*) customers
FROM
cust_table
GROUP BY
age
''')
Out[81]:
When you want to segment your results, but there are too many different values for GROUP BY to be helpful, use CASE WHEN to make your own groups. GROUP BY the column you created with CASE WHEN to look at your newly created segments.
In [82]:
run('''
SELECT
CASE WHEN age BETWEEN 18 AND 24 THEN '18-24 years'
WHEN age BETWEEN 25 AND 34 THEN '25-34 years'
WHEN age BETWEEN 35 AND 44 THEN '35-45 years'
WHEN age BETWEEN 45 AND 54 THEN '45-54 years'
WHEN age BETWEEN 55 AND 64 THEN '55-64 years'
END Age_Group,
COUNT(*) as Customers
FROM
cust_table
GROUP BY
Age_Group
''')
Out[82]:
Ta-DA! Useful customer segments!
Try to break up the "Customers" column into 2 columns - one for male and one for female. Keep the age segments intact.
In [83]:
run('''
SELECT NULL
''')
#print(case_cheat2)
Out[83]:
Consider the following query. We're using a nested query in the SELECT clause to see the sum of all revenue in the sales_table, and then using it again to what percentage of total revenue can be attributed to each Car_Model.
In [84]:
run('''
SELECT
C.model AS Car_Model,
SUM(S.revenue) AS Revenue_Per_Model,
(SELECT SUM(revenue) FROM sales_table) AS Total_Revenue,
SUM(S.revenue) / (SELECT SUM(revenue) FROM sales_table) AS Contribution_to_Revenue
FROM
sales_table S
JOIN car_table C ON C.model_id = S.model_id
GROUP BY
Car_Model
''')
Out[84]:
Write a query to look at the model name and COGs for each car in car_table, then use a nested query to also look at the average COGs off all car models in a third column
In [85]:
run('''
SELECT NULL
''')
#print(nest_cheat1)
Out[85]:
SELECT
column_a
FROM
table_x
UNION # or UNION ALL
SELECT
column_b
FROM
table_y
Starting with something simple (and a little nonsensical), UNION basically lets you run two entirely separate queries. Technically, they could have nothing to do with each other:
In [86]:
run('''
SELECT
model
FROM
car_table
WHERE
model = 'Tundra'
UNION
SELECT
first_name
FROM
salesman_table
WHERE first_name = 'Jared'
''')
Out[86]:
Some things to note:
Use UNION to join two queries. The first should have two columns: car model and COGs per car. The second query should show you to average COGs for all the car models, rounded to cents. You want the the average COGs to appear in the last row.
In [87]:
run('''
SELECT NULL
''')
#print(union_cheat1)
Out[87]:
Consider the issue we had before, where SQLite didn't support WITH ROUNDUP. We used this query as a workaround. Does it make sense now?
In [88]:
run('''
SELECT
C.model AS Car_Model,
SUM(S.revenue) as Sum_Revenue
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
GROUP BY C.model
UNION ALL
SELECT
'NULL',
SUM(S.revenue)
FROM
sales_table S
''')
Out[88]:
Non-optimized queries can cause a lot of problems because tables frequently have thousands or millions of rows:
If you haven't optimized your query, it might:
Find a few more useful optimization tips here: http://hungred.com/useful-information/ways-optimize-sql-queries/
Some of these seem strange, because we're going ling you NOT to do a bunch of things that you've learned how to do. Stick to this principal: if you're dealing with a small table, you can break a few of these rules. The larger the table, the fewer rules you can break.
In [89]:
run('''
SELECT
date,
revenue
FROM
sales_table
''').head()
Out[89]:
In [90]:
run('''
SELECT
*
FROM
sales_table
''').head()
Out[90]:
In [91]:
run('''
SELECT
model_id, model
FROM
car_table
WHERE
model LIKE '%undra'
''')
Out[91]:
In [92]:
run('''
SELECT
C.model,
AVG(revenue)
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
WHERE
C.model LIKE '%undra'
''')
Out[92]:
If you want to look at average revenue for car models that are like "%undra", run the LIKE query on the small table (car_table) first to figure out exacly what you're looking for, then use that information to search for the data you need from the sales_table
Use WHERE to only view a few days of data at first. If the query runs quickly, add a few days at a time. If it starts to run slowly, run just a few days at a time and paste results into excel to combine results (or use Python...ask me later!!!).
The query below won't work because SQLite doesn't recognize dates, but remember these concepts when working with other RDBMS
In [93]:
run('''
SELECT
revenue,
date
FROM
sales_table
WHERE
date = '1/1/14'
''')
Out[93]:
Let's say you were interested in seeing all Toyotas as well as cars with COGs of more than 13000. Write a query for the first group, then a query for the second group, and unite them with UNION. The result set won't show you repeats - if a row matches both result sets, it will only display once.
In [94]:
run('''
SELECT
make, model, cogs
FROM
car_table
WHERE
make = 'Toyota'
UNION
SELECT
make, model, cogs
FROM
car_table
WHERE
cogs > 13000
''')
Out[94]:
Note that we'll get the same results as above, but this query could run MUCH slower on a large table. It's tempting to use OR because it's faster to write, but unless you're dealing with very small tables, avoid the temptation. In 5 years of doing business analytics with SQL, I never used OR once. It's slow. Use a UNION.
In [95]:
run('''
SELECT
make, model, cogs
FROM
car_table
WHERE
make = 'Toyota' OR cogs > 13000
''')
Out[95]:
Let's say you want to look at cars made by Toyato and Honda, but you don't care about Subaru. It might be tempting to use a negative filter:
In [96]:
run('''
SELECT
*
FROM
car_table
WHERE
make != 'Subaru'
''')
Out[96]:
On a big table, this will run much more slowly than if you use a positive filter. Try this instead - it might require a little extra typing, but it will run much faster:
In [97]:
run('''
SELECT
*
FROM
car_table
WHERE
make in ('Toyota', 'Honda')
''')
Out[97]:
If you run into errors when you start writing your own queries, here are some things to make sure your query has:
SELECT
FROM
JOIN...ON
WHERE
GROUP BY
UNION
ORDER BY
LIMIT
http://tech.pro/tutorial/1555/10-easy-steps-to-a-complete-understanding-of-sql
http://sqlzoo.net/wiki/SELECT_.._WHERE
Let's say you recently opened a car dealership, and you now have one month's worth of sales data. You want to know how your sales team is doing.
Start by looking at the number of cars each person sold last month. The names of the sales team and the list of transactions are on different tables in your database, but SQL can help you with that:
In [98]:
run('''
SELECT
first_name || ' ' || last_name as Salesperson,
COUNT(*) as Cars_Sold
FROM
sales_table S
JOIN salesman_table M ON S.salesman_id = M.id
GROUP BY
Salesperson
ORDER BY
Cars_Sold DESC
''')
Out[98]:
Add on the average amount of revenue made per sale:
In [99]:
run('''
SELECT
first_name || ' ' || last_name as Salesperson,
COUNT(*) as Cars_Sold,
ROUND(AVG(revenue)) as Revenue_per_Sale
FROM
sales_table S
JOIN salesman_table M ON S.salesman_id = M.id
GROUP BY
Salesperson
ORDER BY
Cars_Sold DESC
''')
Out[99]:
Make it easier to compare the average revenue of Jared's sales to the average revenue of per sale overall by adding a column to see by what percent each salesperson's sales are more or less than average:
In [100]:
run('''
SELECT
first_name || ' ' || last_name as Salesperson,
COUNT(*) as Cars_Sold,
ROUND(AVG(revenue), 2) as Rev_per_Sale,
ROUND((((AVG(revenue)
- (SELECT AVG(revenue) from sales_table))
/(SELECT AVG(revenue) from sales_table))*100), 1) || ' %'
as RPS_Compared_to_Avg
FROM
sales_table S
JOIN salesman_table M ON S.salesman_id = M.id
GROUP BY
Salesperson
ORDER BY
Cars_Sold DESC
''')
Out[100]:
So maybe Jared is just selling cheaper cars.
Let's go further and compare the sale price of each car against the sticker price to see how low Jared was willing to negotiate with customers. Sticker price is in anther table, but again, that's no problem with SQL:
In [101]:
run('''
SELECT
first_name || ' ' || last_name as Salesperson,
COUNT(*) as Cars_Sold,
'$ ' || ROUND(AVG(revenue), 2) as Rev_per_Sale,
ROUND((((AVG(revenue)
- (SELECT AVG(revenue) from sales_table where salesman_id != 215))
/(SELECT AVG(revenue) from sales_table where salesman_id != 215))*100), 1) || ' %'
AS RPS_Compared_to_Avg,
ROUND((1-(SUM(revenue) / SUM(sticker_price)))*100, 1) || ' %' as Avg_Customer_Discount
FROM
sales_table S
JOIN salesman_table M ON S.salesman_id = M.id
JOIN car_table C ON S.model_id = C.model_id
GROUP BY
Salesperson
ORDER BY
Cars_Sold DESC
''')
Out[101]:
Looks like Jared is letting customers negotiate prices down much more than his peers.
But is this a real problem? How much is each salesperson contributing to our gross profits?
In [102]:
run('''
SELECT
first_name || ' ' || last_name as Salesperson,
COUNT(*) as Cars_Sold,
'$ ' || ROUND(AVG(revenue), 2) as Rev_per_Sale,
ROUND((((AVG(revenue)
- (SELECT AVG(revenue) from sales_table where salesman_id != 215))
/(SELECT AVG(revenue) from sales_table where salesman_id != 215))*100), 1) || ' %'
AS RPS_Compared_to_Peers,
ROUND((1-(SUM(revenue) / SUM(sticker_price)))*100, 1) || ' %' as Avg_Customer_Discount,
ROUND(((SUM(revenue)-sum(C.cogs))
/(SELECT SUM(revenue)-sum(cogs) FROM sales_table S join car_table C on S.model_id = C.model_id))*100, 1) || ' %' as Gross_Profit_Contribution
FROM
sales_table S
JOIN salesman_table M ON S.salesman_id = M.id
JOIN car_table C ON S.model_id = C.model_id
GROUP BY
Salesperson
ORDER BY
Cars_Sold DESC
''')
Out[102]:
SQL really lets you dig.
Some other quick examples - we could do a gender breakdown of customers per car model and add a total at the bottom:
In [103]:
run('''
SELECT
C.model as Car_Model,
ROUND(SUM(CASE WHEN CUST.gender = 'female' THEN 1 END)/(COUNT(S.id)*1.0), 2) AS '% Female Customers',
ROUND(SUM(CASE WHEN CUST.gender = 'male' THEN 1 END)/(COUNT(S.id)*1.0), 2) AS '% Male Customers'
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
JOIN cust_table CUST on S.customer_id = CUST.customer_id
GROUP BY
Car_Model
UNION ALL
SELECT
'Total:',
ROUND(SUM(CASE WHEN CUST.gender = 'female' THEN 1 END)/(COUNT(S.id)*1.0), 2) AS '% Female Customers',
ROUND(SUM(CASE WHEN CUST.gender = 'male' THEN 1 END)/(COUNT(S.id)*1.0), 2) AS '% Male Customers'
FROM
sales_table S
JOIN cust_table CUST on S.customer_id = CUST.customer_id
''')
Out[103]:
Easily create age groups and see how aggressively each group negotiates (judged by the difference between the actual sale amount and the sticker price):
In [104]:
run('''
SELECT
CASE WHEN age BETWEEN 18 AND 24 THEN '18-24 years'
WHEN age BETWEEN 25 AND 34 THEN '25-34 years'
WHEN age BETWEEN 35 AND 44 THEN '35-44 years'
WHEN age BETWEEN 45 AND 54 THEN '45-54 years'
WHEN age BETWEEN 55 AND 64 THEN '55-64 years'
END Age_Group,
ROUND((SUM(S.revenue)-SUM(C.sticker_price))/SUM(C.sticker_price), 2) as '% Paid Below Sticker Price'
FROM
sales_table S
JOIN car_table C on S.model_id = C.model_id
JOIN cust_table CUST on S.customer_id = CUST.customer_id
GROUP BY
Age_Group
''')
Out[104]: