In [2]:
import pandas as pd
import sqlite3
Stands for "Structured Query Language"
Can be pronounced "S. Q. L." or "sequel"
Requests are often called "queries"
One of the most widely-used languages for managing relational databases
A relational database has multiple tables. It's sort of like an Excel file:
SQL is the language, but the software optimized for storing relational databases that you can access with SQL varies (MySQL, Microsoft SQL Server, Oracle, SQLite)
For a little more background, this is a quick but helpful read: http://sql.learncodethehardway.org/book/introduction.html
Structure and formatting:
INDENTATIONS and RETURNS:
CAPITALIZATION:
DATA TYPES:
ORDER OF COMMANDS:
It's not necessary to learn what's going on here (for now), just run the code to set up the database
In [10]:
# Set up a SQLite database using an excel file:
## http://nbviewer.ipython.org/github/jvns/pandas-cookbook/blob/master/cookbook/Chapter%209%20-%20Loading%20data%20from%20SQL%20databases.ipynb
import xlrd as xl
# Select a sample file.
# Using an excel file purely because pandas can select different sheets from .xlsx files and this will be useful when creating multiple tables within a single SQLite database
path = ('/Users/sarahbeckett-hile/Dropbox/Data_Bootcamp/Code/SQL/book_sales.xlsx')
# if this .sqlite db doesn't already exists, this will create it
# if the .sqlite db *does* already exist, this establishes the desired connection
con = sqlite3.connect("sqlite_cars.sqlite")
# this pulls out the names of the sheets in the workbook. We'll use these to name the different tables in the SQLite database that we'll create
table_names = xl.open_workbook(path).sheet_names()
# this loop makes it possible to use any other .xls sheet, since the sheet names aren't called out specifically
for table in table_names:
df = pd.read_excel(path, sheetname='{}'.format(table))
con.execute("DROP TABLE IF EXISTS {}".format(table))
pd.io.sql.to_sql(df, "{}".format(table), con, index=False)
# now the spreadsheets are in tables in a mini database!
# Finally, a little function to make it easy to run queries on this mini-database
def run(query):
results = pd.read_sql("{}".format(query), con)
return results
In [61]:
# Set up a SQLite database using several .csv files:
# if this .sqlite db doesn't already exists, this will create it
# if the .sqlite db *does* already exist, this establishes the desired connection
con = sqlite3.connect("sqlite_cars.sqlite")
# create dataframes from each .csv file:
sales_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/sales_table.csv')
car_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/car_table.csv')
salesman_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/salesman_table.csv')
cust_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/cust_table.csv')
# make a list of the tables (dataframes) and table names:
tables = [sales_table, car_table, salesman_table, cust_table]
table_names = ['sales_table', 'car_table', 'salesman_table', 'cust_table']
# drop each table name if it already exists to avoid error if you rerun this bit of code
# then add it back (or for the first time, if the table didn't already exist)
for i in range(len(tables)):
table_name = table_names[i]
table = tables[i]
con.execute("DROP TABLE IF EXISTS {}".format(table_name))
pd.io.sql.to_sql(table, "{}".format(table_name), con, index=False)
# Finally, a little function to make it easy to run queries on this mini-database
def run(query):
results = pd.read_sql("{}".format(query), con)
return results
We've created a mini SQLite database containing 4 tables that you might expect to see in a company's system. Each of these tables has one or more columns that will correspond to similar columns in other tables.
Table names:
Start by looking at the columns and their data types in the sales_table:
In [64]:
run('''
PRAGMA TABLE_INFO(sales_table)
''')
Out[64]:
In [65]:
run('''
''')
Out[65]:
First, we'll look at everything - all rows and columns - from the sales_table, basically like looking at a simple excel spreadsheet. You do this with an asterisk after "SELECT":
In [60]:
run('''
SELECT
*
FROM
sales_table
''')
# "Show me all columns from the sales table"
Out[60]:
In [ ]:
run('''
''')
# Show me all columns from the car_table
Risks of using an asterisk:
- LIMIT N
Alternatively, you can use "LIMIT" and specify a few rows. This way, you'll just see the first few rows (and avoid overloading your system):
In [16]:
run('''
SELECT
*
FROM
sales_table
LIMIT
5
''')
# "Show me all columns from the sales_table, but limit it to the first 5 rows"
Out[16]:
Write a query to select all columns and the first 10 rows from the car_table:
In [ ]:
run('''
''')
# Show me all columns from the car_table, but but limit it to the first 10 rows
Instead of using an asterisk for "all columns", you can specify a particular column or columns:
In [77]:
run('''
SELECT
model_id, customer_id
FROM
sales_table
LIMIT
5
''')
# "Show me all columns from the sales_table, but limit it to the first 5 rows"
Out[77]:
In [ ]:
run('''
''')
# Show me x and y columns from the car_table, but but limit it to the first 10 rows
- WHERE column_name = x / != x / in (x, y) / not in (x, y)
WHERE allows you to select rows matching specific values:
= x : returns all rows where the values in column_name are equal to x
!= x: returns all rows that DO NOT match x
IN (x, y) : returns all rows where the values in column_name match either x or y
NOT IN (x, y) : returns all rows where the values in column_name DOES NOT match x or y
In [67]:
run('''
SELECT
*
FROM
sales_table
WHERE
payment_type = 'cash'
LIMIT 5
''')
# "Show me all columns and rows from the sales table, but only where the payment type is cash, and limit it to the first 5 rows"
Out[67]:
Rewrite the query above but ask for all rows that DO NOT involve cash transactions:
In [ ]:
run('''
''')
# Show me all columns and rows from the sales table, but only where the payment type is NOT cash, and limit it to the first 5 rows
In [68]:
run('''
''')
# Show me all columns and rows from the sales table, but only where the model_id is either 31 or 36
Out[68]:
You can add additional conditions to WHERE with the following format:
WHERE column_a = x AND column_b = y
In [79]:
run('''
SELECT
*
FROM
sales_table
WHERE
payment_type = 'cash'
AND model_id in (31, 36)
LIMIT 5
''')
Out[79]:
ORDER BY column_name
ORDER BY sorts the data by a particular column.
By default, ORDER BY sorts numbers or alphabetically in ascending order.
To sort in descending order, add DESC after the column name
In [70]:
run('''
SELECT
*
FROM
sales_table
WHERE
payment_type = 'cash'
ORDER BY
revenue
LIMIT 5
''')
# Show me all columns and rows for cash transactions, starting with the lowest sale price, limit it to 5 results
Out[70]:
In [ ]:
run('''
''')
# Translation: Show me all columns and rows for cash transactions, starting with the lowest sale price
In [74]:
run('''
SELECT
model_id, sum(revenue)
FROM
sales_table
GROUP BY
model_id
''')
Out[74]:
In [ ]:
Add
In [ ]:
In [ ]:
The model_id isn't very helpful for analysis - the name would be better. However, each car's Model and Make information isn't located on the sales_table. Write out a query to return all columns and rows from the car_table to find out what information might be helpful in there.
In [15]:
run('''
???
''')
Out[15]:
You can add data from two different tables using a "JOIN" in the FROM clause:
In [16]:
run('''
SELECT
make,
model,
revenue
FROM
sales_table
JOIN car_table
ON sales_table.model_id = car_table.model_id
WHERE
payment_type = 'cash'
ORDER BY
revenue DESC
LIMIT
5
''')
Out[16]:
What's going on here:
Let's say you wanted to see the model_id in the table as well. If you just try adding "model_id" in the SELECT clause, suddenly you'll hit an error:
In [17]:
run('''
SELECT
model_id,
make,
model,
revenue
FROM
sales_table
JOIN car_table
ON sales_table.model_id = car_table.model_id
WHERE
payment_type = 'cash'
ORDER BY
revenue DESC
LIMIT
10
''')
This is because you've introduced another table with a column named "model_id" - the query doesn't know if it should pick the data from the model_id column in the sales_table or the car_table. So, you have to specify which table you want the query to use:
In [18]:
run('''
SELECT
sales_table.model_id,
make,
model,
revenue
FROM
sales_table
JOIN car_table
ON sales_table.model_id = car_table.model_id
WHERE
payment_type = 'cash'
ORDER BY
revenue DESC
LIMIT
5
''')
Out[18]:
Note: because the data in the sales_table.model_id column corresponds to the data in car_table.model_id, you can specify either one in the SELECT clause in this case. There are instances where this won't be the case, but that can be addressed later.
Although you technically only need to specify the column's table name in the SELECT clause when there are one or more columns of the same name (so if the joined tables both have a column named "model_id", for instance), it's a good habit to get into labeling all column names with the table they are coming from.
You can also give tables nicknames so that you don't have to write out the entire name of the table each time. You can use the nickname in the SELECT clause, even though you don't state the nicknae until the FROM clause. Just put the nickname immediately after in the FROM clause. One or two letters for a nickname is pretty common.
In this case, I'm going to give sales_table the nickname "S", and the car_table the nickname "C". Now it's easy to label the columns in the SELECT clause:
In [34]:
run('''
SELECT
S.model_id,
C.make,
C.model,
S.revenue
FROM
sales_table S
JOIN car_table C
ON S.model_id = C.model_id
WHERE
payment_type = 'cash'
ORDER BY
revenue DESC
LIMIT
5
''')
Out[34]:
Now that we can see the Make and Model information for each car, let's throw in the sticker_price as well - also located on the car_table:
In [26]:
run('''
SELECT
C.make,
C.model,
S.revenue,
C.sticker_price
FROM
sales_table S
JOIN car_table C
ON S.model_id = C.model_id
WHERE
payment_type = 'cash'
ORDER BY
revenue DESC
LIMIT
5
''')
Out[26]:
Now we can see that each car sold for some amount less than the listing price, presumably because the customer negotiated a better price with the salesman. You can add a function in the SELECT clause to see the percent under the listing price. You can perform any mathematical function by using the column names:
In [27]:
## hacky fix, need to figure out the deal with integers and pandas.to_sql on SQLite
run('''
SELECT
C.make,
C.model,
S.revenue,
C.sticker_price,
S.revenue*100/C.sticker_price
FROM
sales_table S
JOIN car_table C
ON S.model_id = C.model_id
WHERE
payment_type = 'cash'
ORDER BY
revenue DESC
LIMIT
5
''')
Out[27]:
But that column name looks like a mess, so you can rename it by adding "as [desired_column_name]"
In [31]:
run('''
SELECT
C.make,
C.model,
S.revenue,
C.sticker_price,
S.revenue*100/C.sticker_price AS percent_of_sticker_price
FROM
sales_table S
JOIN car_table C
ON S.model_id = C.model_id
WHERE
payment_type = 'cash'
ORDER BY
revenue DESC
LIMIT
5
''')
Out[31]:
Let's say you want to see the average price that each model of car sold for compared to the list price for the entire month. This is where the GROUP BY clause helps. We'll also need to use the AVG( ) function in the SELECT clause:
In [34]:
# I've elimated the constraints from the WHERE, ORDER BY and LIMIT clause so all rows from the sale_table are included in the average
run('''
SELECT
C.make,
C.model,
AVG(S.revenue) AS average_revenue,
C.sticker_price,
S.revenue*100/C.sticker_price AS percent_of_sticker_price
FROM
sales_table S
JOIN car_table C
ON S.model_id = C.model_id
GROUP BY
model
''')
Out[34]:
The Average_Revenue numbers are messy, you can round them with ROUND(). To round to the nearest cent: ROUND( ,2)
In [51]:
run('''
SELECT
make,
model,
ROUND(AVG(revenue), 2) AS average_revenue,
sticker_price,
revenue*100/sticker_price AS percent_of_sticker_price
FROM
sales_table S
JOIN car_table C
ON S.model_id = C.model_id
WHERE
C.model in ('Accord', 'Camry')
GROUP BY
model
''')
Out[51]:
ORDER BY the percent_of_sticker_price to highlight which cars are selling for the furthest below their list price:
In [29]:
run('''
SELECT
make,
model,
ROUND(AVG(revenue), 2) AS average_revenue,
sticker_price,
revenue*100/sticker_price AS percent_of_sticker_price
FROM
sales_table S
JOIN car_table C
ON S.model_id = C.model_id
GROUP BY
model
ORDER BY
percent_of_sticker_price
''')
Out[29]:
One last thing - we'll add a count of how many sales were made of each car model. If we use "id" column in car_sales, we can count
In [46]:
run('''
SELECT
make,
model,
count() as units_sold,
ROUND(AVG(revenue), 2) AS average_revenue,
sticker_price,
revenue*100/sticker_price AS percent_of_sticker_price
FROM
sales_table S
JOIN car_table C
ON S.model_id = C.model_id
GROUP BY
model
ORDER BY
percent_of_sticker_price
''')
Out[46]:
So for some reason, customers are negotiating harder (or more successfully) on the Toyota Camry and Tundra - and it would probably be causing a problem for the dealership. This kind of issue comes up a lot in B2B as well, since companies have bargaining power with their partners.
Being able to pull this information quickly makes it easy to pinpoint problems, although more queries are then needed to dig into the reasons. The data in this SQLite db could help answer questions like:
Count the number of women and men who bought each model:
In [31]:
#run('''
# SELECT
# model,
# sum(if(gender='female', 1, 0)) as female,
# sum(if(gender='male', 1, 0)) as male
# FROM
# sales_table
# JOIN car_table
# ON sales_table.model_id = car_table.model_id
# JOIN cust_table
# ON sales_table.customer_id = cust_table.customer_id
# GROUP BY
# model
# ''')
In [72]:
run('''
SELECT
first_name,
last_name,
count() as units_sold,
ROUND(AVG(revenue), 2) AS avg_revenue,
ROUND(AVG(sticker_price), 2) avg_sticker_price,
ROUND(AVG(revenue)/AVG(sticker_price), 2) AS percent_of_sticker_price
FROM
sales_table S
JOIN car_table C
ON S.model_id = C.model_id
JOIN salesman_table SM
ON S.salesman_id = SM.id
GROUP BY
first_name
ORDER BY
percent_of_sticker_price
''')
Out[72]:
Although Claudine Hatch has good volume, she is giving up too much in negotations - she might need extra training. Meanwhile, Jared Case might have earned himself a bonus for performance this month.
What impact is Claudine's weak negotiating having on the overall margins for the dealership? How much could they gain if she got up to Jared's standards? We'll include cogs to get some Net info:
In [89]:
run('''
SELECT
first_name,
last_name,
count() as units_sold,
sum(revenue) AS total_revenue,
sum(cogs) total_cogs,
sum(revenue) - sum(cogs) AS Net,
(sum(revenue) - sum(cogs))*100 / sum(revenue) Margin,
(sum(revenue)*100 / (select sum(revenue) from sales_table)) as percent_of_gross,
(sum(revenue) - sum(cogs))*100 / (select sum(revenue)-sum(cogs) from sales_table S join car_table C on S.model_id = C.model_id) as percent_of_net
FROM
sales_table S
JOIN car_table C
ON S.model_id = C.model_id
JOIN salesman_table SM
ON S.salesman_id = SM.id
GROUP BY
first_name
ORDER BY
total_revenue desc
''')
Out[89]:
Claudine's sales make up 12% of gross revenues, but only contribute 10% to net - she's hurting margins and could benefit from training, but doesn't appear to be a lost cause. However, Rosemarie Self has weak margins and is barely moving inventory. This might not be the right role for her.
You could even use this data to figure out which salesperson would be best to approach a new customer who just walked in the door.
It's crazy easy to make mistakes in SQL,
Remeber this structure: Although capitalization and indentation do not matter, the order of the clauses cannot change. You don't need every clause in a query to make it work, but if you do use it, it has to appear in the right order.
Some of these we haven't covered yet, but will below. For now, just use
In [ ]:
SELECT
FROM
WHERE
GROUP BY
ORDER BY
LIMIT
In [42]:
debugging = pd.DataFrame()
In [48]:
tip_1 = ["If you are joining more than one table, check to make sure that all of your columns are labeled"]
debugging.append(tip_1)
Out[48]:
In [47]:
Out[47]:
In [ ]:
How to read basic queries in English:
Queries are much more like normal English syntax than most programming languages out there. Reading them out loud can help with understanding their structure.
The two most basic components of a query are nearly always "SELECT" and "FROM". Think about going to the grocery store to look for milk. This could be the structure of your search: