In [2]:
import pandas as pd
import sqlite3

About SQL:

  • Stands for "Structured Query Language"

  • Can be pronounced "S. Q. L." or "sequel"

  • Requests are often called "queries"

  • One of the most widely-used languages for managing relational databases

  • A relational database has multiple tables. It's sort of like an Excel file:

    • Database = a single Excel file
    • Table = a single sheet in the same Excel file
  • SQL is the language, but the software optimized for storing relational databases that you can access with SQL varies (MySQL, Microsoft SQL Server, Oracle, SQLite)

    • We're using SQLite in these examples
  • For a little more background, this is a quick but helpful read: http://sql.learncodethehardway.org/book/introduction.html

Structure and formatting:

  • INDENTATIONS and RETURNS:

    • indentations and returns are arbitrary in MySQL
    • a query can be written entirely on a single line or on multiple lines
    • most people opt to use structure to organize their thoughts (but not necessary)
  • CAPITALIZATION:

    • Important:
      • Variables (anything that appears in quotation marks)
      • Nicknames
    • NOT Important:
      • Clauses like SELECT/select
      • Functions like SUM/sum
      • Column names like COLUMN_A/column_a
      • Table names like TABLE_A/table_a
  • DATA TYPES:

    • Varies depending on the database software that you're using (MySQL, Microsoft, etc) - easy to google it
    • SQLite (used in examples below):
      • To see the data types for each column in a table, run:
        • PRAGMA TABLE_INFO(sales_table)
      • Common data types:
        • TEXT: various characters. Dates will be saved as text in SQLite (major SQLite detractor)
        • INTEGER: integer
        • REAL: floating-point value
      • For more info: https://www.sqlite.org/datatype3.html
  • ORDER OF COMMANDS:

    • Very strict
    • Not all clauses need to be present in a query, but when they are present, then must be in the following order:
SELECT column_x, column_y FROM table_a A JOIN table_b B ON A.column = B.column WHERE column_x = 'variable1' AND column_y = 'variable2' GROUP BY column_x ORDER BY column_y LIMIT [some number]
DEBUGGING CHECKLIST:
  • Are all clauses in the right order?
  • If any variables are TEXT, are they:
    • enclosed in quotation marks?
    • properly capitalized?
  • IN SQL ONLY: If any variables are dates, are they:
    • enclosed in quotation marks?
    • in the standard date format? 'YYYY-MM-DD'

Setting up the SQLite Database:

It's not necessary to learn what's going on here (for now), just run the code to set up the database


In [10]:
# Set up a SQLite database using an excel file:
## http://nbviewer.ipython.org/github/jvns/pandas-cookbook/blob/master/cookbook/Chapter%209%20-%20Loading%20data%20from%20SQL%20databases.ipynb

import xlrd as xl

# Select a sample file. 
# Using an excel file purely because pandas can select different sheets from .xlsx files and this will be useful when creating multiple tables within a single SQLite database
path = ('/Users/sarahbeckett-hile/Dropbox/Data_Bootcamp/Code/SQL/book_sales.xlsx')

# if this .sqlite db doesn't already exists, this will create it
# if the .sqlite db *does* already exist, this establishes the desired connection
con = sqlite3.connect("sqlite_cars.sqlite")

# this pulls out the names of the sheets in the workbook. We'll use these to name the different tables in the SQLite database that we'll create
table_names = xl.open_workbook(path).sheet_names()

# this loop makes it possible to use any other .xls sheet, since the sheet names aren't called out specifically
for table in table_names:
    df = pd.read_excel(path, sheetname='{}'.format(table))
    con.execute("DROP TABLE IF EXISTS {}".format(table))
    pd.io.sql.to_sql(df, "{}".format(table), con, index=False)

# now the spreadsheets are in tables in a mini database!

# Finally, a little function to make it easy to run queries on this mini-database
def run(query):
    results = pd.read_sql("{}".format(query), con)
    return results

In [61]:
# Set up a SQLite database using several .csv files:

# if this .sqlite db doesn't already exists, this will create it
# if the .sqlite db *does* already exist, this establishes the desired connection
con = sqlite3.connect("sqlite_cars.sqlite")

# create dataframes from each .csv file:
sales_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/sales_table.csv')
car_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/car_table.csv')
salesman_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/salesman_table.csv')
cust_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/cust_table.csv')

# make a list of the tables (dataframes) and table names:
tables = [sales_table, car_table, salesman_table, cust_table]
table_names = ['sales_table', 'car_table', 'salesman_table', 'cust_table']

# drop each table name if it already exists to avoid error if you rerun this bit of code
# then add it back (or for the first time, if the table didn't already exist)
for i in range(len(tables)):
    table_name = table_names[i]
    table = tables[i]
    con.execute("DROP TABLE IF EXISTS {}".format(table_name))
    pd.io.sql.to_sql(table, "{}".format(table_name), con, index=False)

# Finally, a little function to make it easy to run queries on this mini-database
def run(query):
    results = pd.read_sql("{}".format(query), con)
    return results

SELECT milk FROM dairy_aisle # "Show me milk, from the dairy aisle"

EXAMPLES & EXERCISES:

We've created a mini SQLite database containing 4 tables that you might expect to see in a company's system. Each of these tables has one or more columns that will correspond to similar columns in other tables.

Table names:

sales_table car_table salesman_table cust_table

Start by looking at the columns and their data types in the sales_table:


In [64]:
run('''
    PRAGMA TABLE_INFO(sales_table)
    ''')


Out[64]:
cid name type notnull dflt_value pk
0 0 id INTEGER 0 None 0
1 1 model_id INTEGER 0 None 0
2 2 customer_id INTEGER 0 None 0
3 3 revenue INTEGER 0 None 0
4 4 payment_type TEXT 0 None 0
5 5 salesman_id INTEGER 0 None 0
6 6 date TEXT 0 None 0

Do the same for the other tables to get a sense of they contain:


In [65]:
run('''
    
    ''')


Out[65]:
cid name type notnull dflt_value pk
0 0 model_id INTEGER 0 None 0
1 1 make TEXT 0 None 0
2 2 model TEXT 0 None 0
3 3 sticker_price INTEGER 0 None 0
4 4 cogs INTEGER 0 None 0

SELECT * FROM table_name

First, we'll look at everything - all rows and columns - from the sales_table, basically like looking at a simple excel spreadsheet. You do this with an asterisk after "SELECT":


In [60]:
run('''
    SELECT
        *
    FROM
        sales_table
    ''')
# "Show me all columns from the sales table"


Out[60]:
id model_id customer_id revenue payment_type salesman_id date
0 54858 36 237906 21222 finance 492 1/7/14
1 43161 20 967016 19140 finance 215 1/26/14
2 40112 46 819010 14720 cash 862 1/17/14
3 92495 31 633030 19010 finance 803 1/13/14
4 78000 51 341877 22022 finance 862 1/12/14
5 13154 75 720210 21409 cash 492 1/20/14
6 36535 31 908558 19894 finance 862 1/30/14
7 22813 46 705508 12960 finance 225 1/29/14
8 56245 36 248621 25938 cash 276 1/19/14
9 88118 51 341344 19844 finance 492 1/23/14
10 84469 31 733566 21441 finance 215 1/20/14
11 37412 31 750195 17462 cash 276 1/11/14
12 68513 31 461723 17020 cash 803 1/18/14
13 74380 20 468665 18040 finance 215 1/5/14
14 24047 75 556188 18671 finance 862 1/12/14
15 40603 51 241759 22506 finance 862 1/4/14
16 69883 20 161369 20460 finance 803 1/25/14
17 43338 46 731692 13440 finance 225 1/29/14
18 39727 51 656750 18634 finance 492 1/1/14
19 21022 22 619020 17312 cash 862 1/14/14
20 18303 36 413891 23318 cash 492 1/17/14
21 28176 20 965672 21780 cash 813 1/11/14
22 62604 75 217720 19418 cash 862 1/30/14
23 83930 22 946265 17756 cash 862 1/13/14
24 87626 31 140795 21441 finance 215 1/13/14
25 10682 31 145479 17905 cash 147 1/20/14
26 23444 20 961650 16720 finance 276 1/1/14
27 87998 31 457742 20778 cash 680 1/3/14
28 95912 75 978159 19667 cash 813 1/25/14
29 74410 20 474218 19360 finance 276 1/6/14
... ... ... ... ... ... ... ...
70 61014 36 253239 20174 finance 680 1/21/14
71 18946 20 614301 18480 cash 225 1/28/14
72 75834 51 168495 23474 cash 215 1/10/14
73 57366 31 699431 19231 cash 225 1/12/14
74 57711 22 528354 18865 cash 813 1/4/14
75 65870 36 751098 24104 cash 862 1/23/14
76 22147 36 480153 20436 finance 492 1/23/14
77 99759 51 208677 22990 cash 492 1/16/14
78 64406 20 349494 19580 cash 147 1/16/14
79 55585 22 206310 17756 finance 949 1/1/14
80 24710 75 185223 18920 cash 276 1/16/14
81 53831 75 656181 24646 finance 949 1/18/14
82 97494 19 393613 14006 cash 680 1/14/14
83 13004 36 488910 24890 finance 225 1/12/14
84 30779 75 846630 23401 finance 147 1/25/14
85 29708 51 534633 20328 finance 276 1/29/14
86 14082 46 909110 15200 cash 813 1/28/14
87 82119 19 628031 16189 cash 492 1/12/14
88 31815 46 743097 15040 cash 215 1/25/14
89 32345 22 476047 20197 finance 803 1/19/14
90 63526 51 862765 23232 finance 225 1/23/14
91 20256 19 597655 16734 cash 492 1/16/14
92 15930 36 486534 20960 cash 949 1/26/14
93 68446 75 339830 24148 cash 215 1/22/14
94 68330 20 283797 19140 cash 215 1/9/14
95 85407 22 635204 21751 finance 949 1/18/14
96 42428 75 619016 20413 cash 680 1/11/14
97 37620 51 183947 20328 finance 276 1/3/14
98 96344 20 731677 17600 finance 813 1/15/14
99 53062 31 907549 19894 finance 680 1/28/14

100 rows × 7 columns

Write a query to select everything from the car_table:


In [ ]:
run('''
    
    ''')
# Show me all columns from the car_table

Risks of using an asterisk:

  • Works well when a table doesn't have a significant number of rows
  • However, most table will have thousands or even millions of rows
  • With a large table, you'll run into some problems, running "SELECT * FROM table_name" might:
    • Take several minutes (or even hours) to return the information
    • Crash your computer
    • Muck up the server's processes, and you'll face the wrath of your company's system administrators once they figure out that you are the reason why the whole system has slowed down
LIMIT
- LIMIT N

Alternatively, you can use "LIMIT" and specify a few rows. This way, you'll just see the first few rows (and avoid overloading your system):


In [16]:
run('''
    SELECT
        *
    FROM
        sales_table
    LIMIT
        5
    ''')
# "Show me all columns from the sales_table, but limit it to the first 5 rows"


Out[16]:
id model_id customer_id revenue payment_type salesman_id date
0 54858 36 237906 21222 finance 492 1/7/14
1 43161 20 967016 19140 finance 215 1/26/14
2 40112 46 819010 14720 cash 862 1/17/14
3 92495 31 633030 19010 finance 803 1/13/14
4 78000 51 341877 22022 finance 862 1/12/14

Write a query to select all columns and the first 10 rows from the car_table:


In [ ]:
run('''
    
    ''')
# Show me all columns from the car_table, but but limit it to the first 10 rows
SELECT SPECIFIC COLUMNS:

Instead of using an asterisk for "all columns", you can specify a particular column or columns:


In [77]:
run('''
    SELECT
        model_id, customer_id
    FROM
        sales_table
    LIMIT
        5
    ''')
# "Show me all columns from the sales_table, but limit it to the first 5 rows"


Out[77]:
model_id customer_id
0 36 237906
1 20 967016
2 46 819010
3 31 633030
4 51 341877

Write a query to select model_id and model from the car_table and limit it to 10 results:


In [ ]:
run('''
    
    ''')
# Show me x and y columns from the car_table, but but limit it to the first 10 rows
WHERE
- WHERE column_name = x / != x / in (x, y) / not in (x, y)

WHERE allows you to select rows matching specific values:

= x : returns all rows where the values in column_name are equal to x

!= x: returns all rows that DO NOT match x

IN (x, y) : returns all rows where the values in column_name match either x or y

NOT IN (x, y) : returns all rows where the values in column_name DOES NOT match x or y


In [67]:
run('''
    SELECT
        *
    FROM
        sales_table
    WHERE
        payment_type = 'cash'
    LIMIT 5
    ''')
# "Show me all columns and rows from the sales table, but only where the payment type is cash, and limit it to the first 5 rows"


Out[67]:
id model_id customer_id revenue payment_type salesman_id date
0 40112 46 819010 14720 cash 862 1/17/14
1 13154 75 720210 21409 cash 492 1/20/14
2 56245 36 248621 25938 cash 276 1/19/14
3 37412 31 750195 17462 cash 276 1/11/14
4 68513 31 461723 17020 cash 803 1/18/14

Rewrite the query above but ask for all rows that DO NOT involve cash transactions:


In [ ]:
run('''
    
    ''')
# Show me all columns and rows from the sales table, but only where the payment type is NOT cash, and limit it to the first 5 rows

Try to rewrite the query to return rows where the model_id is either 31 or 36


In [68]:
run('''
    
    ''')
# Show me all columns and rows from the sales table, but only where the model_id is either 31 or 36


Out[68]:
id model_id customer_id revenue payment_type salesman_id date
0 54858 36 237906 21222 finance 492 1/7/14
1 92495 31 633030 19010 finance 803 1/13/14
2 36535 31 908558 19894 finance 862 1/30/14
3 56245 36 248621 25938 cash 276 1/19/14
4 84469 31 733566 21441 finance 215 1/20/14

You can add additional conditions to WHERE with the following format:

WHERE column_a = x AND column_b = y

In [79]:
run('''
    SELECT
        *
    FROM
        sales_table
    WHERE
        payment_type = 'cash' 
        AND model_id in (31, 36)
    LIMIT 5
    ''')


Out[79]:
id model_id customer_id revenue payment_type salesman_id date
0 56245 36 248621 25938 cash 276 1/19/14
1 37412 31 750195 17462 cash 276 1/11/14
2 68513 31 461723 17020 cash 803 1/18/14
3 18303 36 413891 23318 cash 492 1/17/14
4 10682 31 145479 17905 cash 147 1/20/14

ORDER BY:

ORDER BY column_name

ORDER BY sorts the data by a particular column.

By default, ORDER BY sorts numbers or alphabetically in ascending order.

To sort in descending order, add DESC after the column name


In [70]:
run('''
    SELECT
        *
    FROM
        sales_table
    WHERE
        payment_type = 'cash'
    ORDER BY
        revenue
    LIMIT 5
    ''')
# Show me all columns and rows for cash transactions, starting with the lowest sale price, limit it to 5 results


Out[70]:
id model_id customer_id revenue payment_type salesman_id date
0 30046 46 982483 12640 cash 803 1/29/14
1 70434 46 944843 13600 cash 949 1/25/14
2 98201 19 697575 13642 cash 680 1/4/14
3 97494 19 393613 14006 cash 680 1/14/14
4 40112 46 819010 14720 cash 862 1/17/14

Rewrite the query above to order by a different column and in descending order.


In [ ]:
run('''
    
    ''')
# Translation: Show me all columns and rows for cash transactions, starting with the lowest sale price

GROUP BY & FUNCTIONS:

  • GROUP BY column_name : lets you specify one or more columns to create groups. Each unique value in the GROUP BY column will get its own group
  • FUNCTIONS:

In [74]:
run('''
    SELECT
        model_id, sum(revenue)
    FROM
        sales_table
    GROUP BY
        model_id
    ''')


Out[74]:
model_id sum(revenue)
0 19 157886
1 20 212080
2 22 201083
3 31 234748
4 36 365751
5 46 154880
6 51 297418
7 75 317406

In [ ]:
Add

In [ ]:


In [ ]:

The model_id isn't very helpful for analysis - the name would be better. However, each car's Model and Make information isn't located on the sales_table. Write out a query to return all columns and rows from the car_table to find out what information might be helpful in there.


In [15]:
run('''
    ???
    ''')


Out[15]:
model_id make model sticker_price
0 20 Toyota Camry 22000
1 46 Toyota Corolla 16000
2 51 Toyota Prius 24200
3 19 Honda Civic 18190
4 31 Honda Accord 22105
5 75 Subaru Outback 24895
6 22 Subaru Forester 22195
7 36 Toyota Tundra 26200

You can add data from two different tables using a "JOIN" in the FROM clause:


In [16]:
run('''
    SELECT
        make,
        model,
        revenue
    FROM
        sales_table
        JOIN car_table 
            ON sales_table.model_id = car_table.model_id
    WHERE
        payment_type = 'cash'
    ORDER BY
        revenue DESC
    LIMIT 
        5
    ''')


Out[16]:
make model revenue
0 Toyota Tundra 25938
1 Toyota Tundra 25937
2 Toyota Tundra 24628
3 Toyota Tundra 24366
4 Subaru Outback 24148

What's going on here:

  • To JOIN two tables, find columns on each table that contain corresponding data
  • In this case, the model_id column in the sales_table matched up to the model_id column in the car_table. Both have the ID numbers that the dealership has assigned to each model of the cars they are selling.
  • Follow the format below, where a column is referred to by table_name.column_name:
FROM tableA JOIN tableB ON tableA.column_name = tableB.column_name
  • Take care: corresponding columns from different tables don't necessarily always have identical names. So, if the model_id column in the car_table were named car_model_id instead, this would be the FROM clause:
FROM sales_table JOIN car_table ON sales_table.model_id = car_table.car_model_id

Let's say you wanted to see the model_id in the table as well. If you just try adding "model_id" in the SELECT clause, suddenly you'll hit an error:


In [17]:
run('''
    SELECT
        model_id,
        make,
        model,
        revenue
    FROM
        sales_table
        JOIN car_table 
            ON sales_table.model_id = car_table.model_id
    WHERE
        payment_type = 'cash'
    ORDER BY
        revenue DESC
    LIMIT 
        10
    ''')


ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 4))

---------------------------------------------------------------------------
DatabaseError                             Traceback (most recent call last)
<ipython-input-17-5ca1644591a3> in <module>()
     15     LIMIT
     16         10
---> 17     ''')

<ipython-input-7-6fb62674a9bd> in run(query)
      4 # a pandas dataframe
      5 def run(query):
----> 6     results = pd.read_sql("{}".format(query), con)
      7     return results

//anaconda/envs/py3k/lib/python3.3/site-packages/pandas/io/sql.py in read_sql(sql, con, index_col, coerce_float, params, parse_dates, columns)
    386         return pandas_sql.read_sql(
    387             sql, index_col=index_col, params=params,
--> 388             coerce_float=coerce_float, parse_dates=parse_dates)
    389     else:
    390         if isinstance(pandas_sql, PandasSQLLegacy):

//anaconda/envs/py3k/lib/python3.3/site-packages/pandas/io/sql.py in read_sql(self, sql, index_col, coerce_float, params, parse_dates)
   1020                  parse_dates=None):
   1021         args = _convert_params(sql, params)
-> 1022         cursor = self.execute(*args)
   1023         columns = [col_desc[0] for col_desc in cursor.description]
   1024         data = self._fetchall_as_list(cursor)

//anaconda/envs/py3k/lib/python3.3/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1015 
   1016             ex = DatabaseError("Execution failed on sql: %s" % args[0])
-> 1017             raise_with_traceback(ex)
   1018 
   1019     def read_sql(self, sql, index_col=None, coerce_float=True, params=None,

//anaconda/envs/py3k/lib/python3.3/site-packages/pandas/compat/__init__.py in raise_with_traceback(exc, traceback)
    703         if traceback == Ellipsis:
    704             _, _, traceback = sys.exc_info()
--> 705         raise exc.with_traceback(traceback)
    706 else:
    707     # this version of raise is a syntax error in Python 3

//anaconda/envs/py3k/lib/python3.3/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1004                 cur.execute(*args, **kwargs)
   1005             else:
-> 1006                 cur.execute(*args)
   1007             return cur
   1008         except Exception as e:

DatabaseError: Execution failed on sql: 
    SELECT
        model_id,
        make,
        model,
        revenue
    FROM
        sales_table
        JOIN car_table 
            ON sales_table.model_id = car_table.model_id
    WHERE
        payment_type = 'cash'
    ORDER BY
        revenue DESC
    LIMIT 
        10
    

This is because you've introduced another table with a column named "model_id" - the query doesn't know if it should pick the data from the model_id column in the sales_table or the car_table. So, you have to specify which table you want the query to use:


In [18]:
run('''
    SELECT
        sales_table.model_id,
        make,
        model,
        revenue
    FROM
        sales_table
        JOIN car_table 
            ON sales_table.model_id = car_table.model_id
    WHERE
        payment_type = 'cash'
    ORDER BY
        revenue DESC
    LIMIT 
        5
    ''')


Out[18]:
model_id make model revenue
0 36 Toyota Tundra 25938
1 36 Toyota Tundra 25937
2 36 Toyota Tundra 24628
3 36 Toyota Tundra 24366
4 75 Subaru Outback 24148

Note: because the data in the sales_table.model_id column corresponds to the data in car_table.model_id, you can specify either one in the SELECT clause in this case. There are instances where this won't be the case, but that can be addressed later.

Although you technically only need to specify the column's table name in the SELECT clause when there are one or more columns of the same name (so if the joined tables both have a column named "model_id", for instance), it's a good habit to get into labeling all column names with the table they are coming from.

You can also give tables nicknames so that you don't have to write out the entire name of the table each time. You can use the nickname in the SELECT clause, even though you don't state the nicknae until the FROM clause. Just put the nickname immediately after in the FROM clause. One or two letters for a nickname is pretty common.

In this case, I'm going to give sales_table the nickname "S", and the car_table the nickname "C". Now it's easy to label the columns in the SELECT clause:


In [34]:
run('''
    SELECT
        S.model_id,
        C.make,
        C.model,
        S.revenue
    FROM
        sales_table S
        JOIN car_table C
            ON S.model_id = C.model_id
    WHERE
        payment_type = 'cash'
    ORDER BY
        revenue DESC
    LIMIT 
        5
    ''')


Out[34]:
model_id make model revenue
0 36 Toyota Tundra 25938
1 36 Toyota Tundra 25937
2 36 Toyota Tundra 24628
3 36 Toyota Tundra 24366
4 75 Subaru Outback 24148

Now that we can see the Make and Model information for each car, let's throw in the sticker_price as well - also located on the car_table:


In [26]:
run('''
    SELECT
        C.make,
        C.model,
        S.revenue, 
        C.sticker_price
    FROM
        sales_table S
        JOIN car_table C
            ON S.model_id = C.model_id
    WHERE
        payment_type = 'cash'
    ORDER BY
        revenue DESC
    LIMIT 
        5
    ''')


Out[26]:
make model revenue sticker_price
0 Toyota Tundra 25938 26200
1 Toyota Tundra 25937 26200
2 Toyota Tundra 24628 26200
3 Toyota Tundra 24366 26200
4 Subaru Outback 24148 24895

Now we can see that each car sold for some amount less than the listing price, presumably because the customer negotiated a better price with the salesman. You can add a function in the SELECT clause to see the percent under the listing price. You can perform any mathematical function by using the column names:

  • columnA + columnB = sumAB
  • columnA * columnB = productAB

In [27]:
## hacky fix, need to figure out the deal with integers and pandas.to_sql on SQLite
run('''
    SELECT
        C.make,
        C.model,
        S.revenue, 
        C.sticker_price,
        S.revenue*100/C.sticker_price
    FROM
        sales_table S
        JOIN car_table C
            ON S.model_id = C.model_id
    WHERE
        payment_type = 'cash'
    ORDER BY
        revenue DESC
    LIMIT 
        5
    ''')


Out[27]:
make model revenue sticker_price S.revenue*100/C.sticker_price
0 Toyota Tundra 25938 26200 99
1 Toyota Tundra 25937 26200 98
2 Toyota Tundra 24628 26200 94
3 Toyota Tundra 24366 26200 93
4 Subaru Outback 24148 24895 96

But that column name looks like a mess, so you can rename it by adding "as [desired_column_name]"


In [31]:
run('''
    SELECT
        C.make,
        C.model,
        S.revenue, 
        C.sticker_price,
        S.revenue*100/C.sticker_price AS percent_of_sticker_price
    FROM
        sales_table S
        JOIN car_table C
            ON S.model_id = C.model_id
    WHERE
        payment_type = 'cash'
    ORDER BY
        revenue DESC
    LIMIT 
        5
    ''')


Out[31]:
make model revenue sticker_price percent_of_sticker_price
0 Toyota Tundra 25938 26200 99
1 Toyota Tundra 25937 26200 98
2 Toyota Tundra 24628 26200 94
3 Toyota Tundra 24366 26200 93
4 Subaru Outback 24148 24895 96

Let's say you want to see the average price that each model of car sold for compared to the list price for the entire month. This is where the GROUP BY clause helps. We'll also need to use the AVG( ) function in the SELECT clause:


In [34]:
# I've elimated the constraints from the WHERE, ORDER BY and LIMIT clause so all rows from the sale_table are included in the average
run('''
    SELECT
        C.make,
        C.model,
        AVG(S.revenue) AS average_revenue, 
        C.sticker_price,
        S.revenue*100/C.sticker_price AS percent_of_sticker_price
    FROM
        sales_table S
        JOIN car_table C
            ON S.model_id = C.model_id
    GROUP BY
        model
    ''')


Out[34]:
make model average_revenue sticker_price percent_of_sticker_price
0 Honda Accord 19562.333333 22105 89
1 Toyota Camry 19280.000000 22000 80
2 Honda Civic 15788.600000 18190 91
3 Toyota Corolla 14080.000000 16000 94
4 Subaru Forester 18280.272727 22195 97
5 Subaru Outback 21160.400000 24895 81
6 Toyota Prius 21244.142857 24200 84
7 Toyota Tundra 22859.437500 26200 80

The Average_Revenue numbers are messy, you can round them with ROUND(). To round to the nearest cent: ROUND( ,2)


In [51]:
run('''
    SELECT
        make,
        model,
        ROUND(AVG(revenue), 2) AS average_revenue, 
        sticker_price,
        revenue*100/sticker_price AS percent_of_sticker_price
    FROM
        sales_table S
        JOIN car_table C
            ON S.model_id = C.model_id
    WHERE
        C.model in ('Accord', 'Camry')
    GROUP BY
        model
    ''')


Out[51]:
make model average_revenue sticker_price percent_of_sticker_price
0 Honda Accord 19562.33 22105 96
1 Toyota Camry 19280.00 22000 99

ORDER BY the percent_of_sticker_price to highlight which cars are selling for the furthest below their list price:


In [29]:
run('''
    SELECT
        make,
        model,
        ROUND(AVG(revenue), 2) AS average_revenue, 
        sticker_price,
        revenue*100/sticker_price AS percent_of_sticker_price
    FROM
        sales_table S
        JOIN car_table C
            ON S.model_id = C.model_id
    GROUP BY
        model
    ORDER BY 
        percent_of_sticker_price
    ''')


Out[29]:
make model average_revenue sticker_price percent_of_sticker_price
0 Toyota Camry 19280.00 22000 80
1 Toyota Tundra 22859.44 26200 80
2 Subaru Outback 21160.40 24895 81
3 Toyota Prius 21244.14 24200 84
4 Honda Accord 19562.33 22105 89
5 Honda Civic 15788.60 18190 91
6 Toyota Corolla 14080.00 16000 94
7 Subaru Forester 18280.27 22195 97

One last thing - we'll add a count of how many sales were made of each car model. If we use "id" column in car_sales, we can count


In [46]:
run('''
    SELECT
        make,
        model,
        count() as units_sold,
        ROUND(AVG(revenue), 2) AS average_revenue, 
        sticker_price,
        revenue*100/sticker_price AS percent_of_sticker_price
    FROM
         sales_table S
            JOIN car_table C
                ON S.model_id = C.model_id 
    GROUP BY
        model
    ORDER BY 
        percent_of_sticker_price
    ''')


Out[46]:
make model units_sold average_revenue sticker_price percent_of_sticker_price
0 Toyota Camry 11 19280.00 22000 80
1 Toyota Tundra 16 22859.44 26200 80
2 Subaru Outback 15 21160.40 24895 81
3 Toyota Prius 14 21244.14 24200 84
4 Honda Accord 12 19562.33 22105 89
5 Honda Civic 10 15788.60 18190 91
6 Toyota Corolla 11 14080.00 16000 94
7 Subaru Forester 11 18280.27 22195 97

So for some reason, customers are negotiating harder (or more successfully) on the Toyota Camry and Tundra - and it would probably be causing a problem for the dealership. This kind of issue comes up a lot in B2B as well, since companies have bargaining power with their partners.

Being able to pull this information quickly makes it easy to pinpoint problems, although more queries are then needed to dig into the reasons. The data in this SQLite db could help answer questions like:

  • What age group is buying Tundras and Camrys? Maybe if younger people are drawn to them, they bargain harder because they have less expenadable income
  • What salesperson has been selling these cars? Maybe he's not a very good negotiator

Salesperson performance:

Count the number of women and men who bought each model:


In [31]:
#run('''
#    SELECT
#        model,
#        sum(if(gender='female', 1, 0)) as female,
#        sum(if(gender='male', 1, 0)) as male
#    FROM
#        sales_table
#        JOIN car_table 
#            ON sales_table.model_id = car_table.model_id
#        JOIN cust_table
#            ON sales_table.customer_id = cust_table.customer_id
#    GROUP BY
#        model
#    ''')

In [72]:
run('''
    SELECT
        first_name,
        last_name,
        count() as units_sold,
        ROUND(AVG(revenue), 2) AS avg_revenue, 
        ROUND(AVG(sticker_price), 2) avg_sticker_price,
        ROUND(AVG(revenue)/AVG(sticker_price), 2) AS percent_of_sticker_price
    FROM
         sales_table S
            JOIN car_table C
                ON S.model_id = C.model_id 
            JOIN salesman_table SM 
                ON S.salesman_id = SM.id
    GROUP BY
        first_name
    ORDER BY 
        percent_of_sticker_price
    ''')


Out[72]:
first_name last_name units_sold avg_revenue avg_sticker_price percent_of_sticker_price
0 Claudine Hatch 12 19463.92 23465.00 0.83
1 Kathleen March 8 19053.00 22549.38 0.84
2 Rosemarie Self 5 19803.20 23479.00 0.84
3 Justin Avellaneda 10 19393.70 22908.00 0.85
4 Matthew Luna 11 19856.64 23343.18 0.85
5 Michael Hill 10 17110.30 20167.00 0.85
6 Joseph Seney 11 19521.09 22198.18 0.88
7 Elton Elzy 12 20012.67 22525.00 0.89
8 Samantha Douglas 8 19600.75 22095.63 0.89
9 Jared Case 13 20084.00 21737.69 0.92

Although Claudine Hatch has good volume, she is giving up too much in negotations - she might need extra training. Meanwhile, Jared Case might have earned himself a bonus for performance this month.

What impact is Claudine's weak negotiating having on the overall margins for the dealership? How much could they gain if she got up to Jared's standards? We'll include cogs to get some Net info:


In [89]:
run('''
    SELECT
        first_name,
        last_name,
        count() as units_sold,
        sum(revenue) AS total_revenue, 
        sum(cogs) total_cogs, 
        sum(revenue) - sum(cogs) AS Net,
        (sum(revenue) - sum(cogs))*100 / sum(revenue) Margin,
        (sum(revenue)*100 / (select sum(revenue) from sales_table)) as percent_of_gross,
        (sum(revenue) - sum(cogs))*100 / (select sum(revenue)-sum(cogs) from sales_table S join car_table C on S.model_id = C.model_id) as percent_of_net
    FROM
         sales_table S
            JOIN car_table C
                ON S.model_id = C.model_id 
            JOIN salesman_table SM 
                ON S.salesman_id = SM.id
    GROUP BY
        first_name
    ORDER BY 
        total_revenue desc
    ''')


Out[89]:
first_name last_name units_sold total_revenue total_cogs Net Margin percent_of_gross percent_of_net
0 Jared Case 13 261092 169554 91538 35 13 15
1 Elton Elzy 12 240152 162180 77972 32 12 13
2 Claudine Hatch 12 233567 168948 64619 27 12 10
3 Matthew Luna 11 218423 154065 64358 29 11 10
4 Joseph Seney 11 214732 146508 68224 31 11 11
5 Justin Avellaneda 10 193937 137448 56489 29 9 9
6 Michael Hill 10 171103 121002 50101 29 8 8
7 Samantha Douglas 8 156806 106059 50747 32 8 8
8 Kathleen March 8 152424 108237 44187 28 7 7
9 Rosemarie Self 5 99016 70437 28579 28 5 4

Claudine's sales make up 12% of gross revenues, but only contribute 10% to net - she's hurting margins and could benefit from training, but doesn't appear to be a lost cause. However, Rosemarie Self has weak margins and is barely moving inventory. This might not be the right role for her.

You could even use this data to figure out which salesperson would be best to approach a new customer who just walked in the door.

WORK IN PROGRESS STUFF

Debugging Tips:

It's crazy easy to make mistakes in SQL,

Remeber this structure: Although capitalization and indentation do not matter, the order of the clauses cannot change. You don't need every clause in a query to make it work, but if you do use it, it has to appear in the right order.

Some of these we haven't covered yet, but will below. For now, just use


In [ ]:
SELECT
FROM 
WHERE
GROUP BY 
ORDER BY
LIMIT

In [42]:
debugging = pd.DataFrame()

In [48]:
tip_1 = ["If you are joining more than one table, check to make sure that all of your columns are labeled"]
debugging.append(tip_1)


Out[48]:
0
0 If you are joining more than one table, check ...

In [47]:



Out[47]:
0
0 If you are joining more than one table, check ...

In [ ]:
How to read basic queries in English:

Queries are much more like normal English syntax than most programming languages out there. Reading them out loud can help with understanding their structure. 

The two most basic components of a query are nearly always "SELECT" and "FROM".  Think about going to the grocery store to look for milk. This could be the structure of your search: