In [2]:

    
import pandas as pd
import sqlite3

About SQL:

Stands for "Structured Query Language"
Can be pronounced "S. Q. L." or "sequel"
Requests are often called "queries"
One of the most widely-used languages for managing relational databases
A relational database has multiple tables. It's sort of like an Excel file:
- Database = a single Excel file
- Table = a single sheet in the same Excel file
SQL is the language, but the software optimized for storing relational databases that you can access with SQL varies (MySQL, Microsoft SQL Server, Oracle, SQLite)
- We're using SQLite in these examples
For a little more background, this is a quick but helpful read: http://sql.learncodethehardway.org/book/introduction.html

Structure and formatting:

INDENTATIONS and RETURNS:
- indentations and returns are arbitrary in MySQL
- a query can be written entirely on a single line or on multiple lines
- most people opt to use structure to organize their thoughts (but not necessary)
CAPITALIZATION:
- Important:
  - Variables (anything that appears in quotation marks)
  - Nicknames
- NOT Important:
  - Clauses like SELECT/select
  - Functions like SUM/sum
  - Column names like COLUMN_A/column_a
  - Table names like TABLE_A/table_a
DATA TYPES:
- Varies depending on the database software that you're using (MySQL, Microsoft, etc) - easy to google it
- SQLite (used in examples below):
  - To see the data types for each column in a table, run:
    - PRAGMA TABLE_INFO(sales_table)
  - Common data types:
    - TEXT: various characters. Dates will be saved as text in SQLite (major SQLite detractor)
    - INTEGER: integer
    - REAL: floating-point value
  - For more info: https://www.sqlite.org/datatype3.html
ORDER OF COMMANDS:
- Very strict
- Not all clauses need to be present in a query, but when they are present, then must be in the following order:

SELECT column_x, column_y FROM table_a A JOIN table_b B ON A.column = B.column WHERE column_x = 'variable1' AND column_y = 'variable2' GROUP BY column_x ORDER BY column_y LIMIT [some number]

DEBUGGING CHECKLIST:

Are all clauses in the right order?
If any variables are TEXT, are they:
- enclosed in quotation marks?
- properly capitalized?
IN SQL ONLY: If any variables are dates, are they:
- enclosed in quotation marks?
- in the standard date format? 'YYYY-MM-DD'

Setting up the SQLite Database:

It's not necessary to learn what's going on here (for now), just run the code to set up the database



In [10]:

    
# Set up a SQLite database using an excel file:
## http://nbviewer.ipython.org/github/jvns/pandas-cookbook/blob/master/cookbook/Chapter%209%20-%20Loading%20data%20from%20SQL%20databases.ipynb

import xlrd as xl

# Select a sample file. 
# Using an excel file purely because pandas can select different sheets from .xlsx files and this will be useful when creating multiple tables within a single SQLite database
path = ('/Users/sarahbeckett-hile/Dropbox/Data_Bootcamp/Code/SQL/book_sales.xlsx')

# if this .sqlite db doesn't already exists, this will create it
# if the .sqlite db *does* already exist, this establishes the desired connection
con = sqlite3.connect("sqlite_cars.sqlite")

# this pulls out the names of the sheets in the workbook. We'll use these to name the different tables in the SQLite database that we'll create
table_names = xl.open_workbook(path).sheet_names()

# this loop makes it possible to use any other .xls sheet, since the sheet names aren't called out specifically
for table in table_names:
    df = pd.read_excel(path, sheetname='{}'.format(table))
    con.execute("DROP TABLE IF EXISTS {}".format(table))
    pd.io.sql.to_sql(df, "{}".format(table), con, index=False)

# now the spreadsheets are in tables in a mini database!

# Finally, a little function to make it easy to run queries on this mini-database
def run(query):
    results = pd.read_sql("{}".format(query), con)
    return results



In [61]:

    
# Set up a SQLite database using several .csv files:

# if this .sqlite db doesn't already exists, this will create it
# if the .sqlite db *does* already exist, this establishes the desired connection
con = sqlite3.connect("sqlite_cars.sqlite")

# create dataframes from each .csv file:
sales_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/sales_table.csv')
car_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/car_table.csv')
salesman_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/salesman_table.csv')
cust_table = pd.read_csv('https://raw.githubusercontent.com/DaveBackus/Data_Bootcamp/master/Code/SQL/cust_table.csv')

# make a list of the tables (dataframes) and table names:
tables = [sales_table, car_table, salesman_table, cust_table]
table_names = ['sales_table', 'car_table', 'salesman_table', 'cust_table']

# drop each table name if it already exists to avoid error if you rerun this bit of code
# then add it back (or for the first time, if the table didn't already exist)
for i in range(len(tables)):
    table_name = table_names[i]
    table = tables[i]
    con.execute("DROP TABLE IF EXISTS {}".format(table_name))
    pd.io.sql.to_sql(table, "{}".format(table_name), con, index=False)

# Finally, a little function to make it easy to run queries on this mini-database
def run(query):
    results = pd.read_sql("{}".format(query), con)
    return results

SELECT milk FROM dairy_aisle # "Show me milk, from the dairy aisle"

EXAMPLES & EXERCISES:

We've created a mini SQLite database containing 4 tables that you might expect to see in a company's system. Each of these tables has one or more columns that will correspond to similar columns in other tables.

Table names:

sales_table car_table salesman_table cust_table

Start by looking at the columns and their data types in the sales_table:



In [64]:

    
run('''
    PRAGMA TABLE_INFO(sales_table)
    ''')









    Out[64]:






  
    
      
      cid
      name
      type
      notnull
      dflt_value
      pk
    
  
  
    
      0
       0
                 id
       INTEGER
       0
       None
       0
    
    
      1
       1
           model_id
       INTEGER
       0
       None
       0
    
    
      2
       2
        customer_id
       INTEGER
       0
       None
       0
    
    
      3
       3
            revenue
       INTEGER
       0
       None
       0
    
    
      4
       4
       payment_type
          TEXT
       0
       None
       0
    
    
      5
       5
        salesman_id
       INTEGER
       0
       None
       0
    
    
      6
       6
               date
          TEXT
       0
       None
       0

Do the same for the other tables to get a sense of they contain:



In [65]:

    
run('''
    
    ''')









    Out[65]:






  
    
      
      cid
      name
      type
      notnull
      dflt_value
      pk
    
  
  
    
      0
       0
            model_id
       INTEGER
       0
       None
       0
    
    
      1
       1
                make
          TEXT
       0
       None
       0
    
    
      2
       2
               model
          TEXT
       0
       None
       0
    
    
      3
       3
       sticker_price
       INTEGER
       0
       None
       0
    
    
      4
       4
                cogs
       INTEGER
       0
       None
       0

SELECT * FROM table_name

First, we'll look at everything - all rows and columns - from the sales_table, basically like looking at a simple excel spreadsheet. You do this with an asterisk after "SELECT":



In [60]:

    
run('''
    SELECT
        *
    FROM
        sales_table
    ''')
# "Show me all columns from the sales table"









    Out[60]:






  
    
      
      id
      model_id
      customer_id
      revenue
      payment_type
      salesman_id
      date
    
  
  
    
      0 
       54858
       36
       237906
       21222
       finance
       492
         1/7/14
    
    
      1 
       43161
       20
       967016
       19140
       finance
       215
        1/26/14
    
    
      2 
       40112
       46
       819010
       14720
          cash
       862
        1/17/14
    
    
      3 
       92495
       31
       633030
       19010
       finance
       803
        1/13/14
    
    
      4 
       78000
       51
       341877
       22022
       finance
       862
        1/12/14
    
    
      5 
       13154
       75
       720210
       21409
          cash
       492
        1/20/14
    
    
      6 
       36535
       31
       908558
       19894
       finance
       862
        1/30/14
    
    
      7 
       22813
       46
       705508
       12960
       finance
       225
        1/29/14
    
    
      8 
       56245
       36
       248621
       25938
          cash
       276
        1/19/14
    
    
      9 
       88118
       51
       341344
       19844
       finance
       492
        1/23/14
    
    
      10
       84469
       31
       733566
       21441
       finance
       215
        1/20/14
    
    
      11
       37412
       31
       750195
       17462
          cash
       276
        1/11/14
    
    
      12
       68513
       31
       461723
       17020
          cash
       803
        1/18/14
    
    
      13
       74380
       20
       468665
       18040
       finance
       215
         1/5/14
    
    
      14
       24047
       75
       556188
       18671
       finance
       862
        1/12/14
    
    
      15
       40603
       51
       241759
       22506
       finance
       862
         1/4/14
    
    
      16
       69883
       20
       161369
       20460
       finance
       803
        1/25/14
    
    
      17
       43338
       46
       731692
       13440
       finance
       225
        1/29/14
    
    
      18
       39727
       51
       656750
       18634
       finance
       492
         1/1/14
    
    
      19
       21022
       22
       619020
       17312
          cash
       862
        1/14/14
    
    
      20
       18303
       36
       413891
       23318
          cash
       492
        1/17/14
    
    
      21
       28176
       20
       965672
       21780
          cash
       813
        1/11/14
    
    
      22
       62604
       75
       217720
       19418
          cash
       862
        1/30/14
    
    
      23
       83930
       22
       946265
       17756
          cash
       862
        1/13/14
    
    
      24
       87626
       31
       140795
       21441
       finance
       215
        1/13/14
    
    
      25
       10682
       31
       145479
       17905
          cash
       147
        1/20/14
    
    
      26
       23444
       20
       961650
       16720
       finance
       276
         1/1/14
    
    
      27
       87998
       31
       457742
       20778
          cash
       680
         1/3/14
    
    
      28
       95912
       75
       978159
       19667
          cash
       813
        1/25/14
    
    
      29
       74410
       20
       474218
       19360
       finance
       276
         1/6/14
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      70
       61014
       36
       253239
       20174
       finance
       680
        1/21/14
    
    
      71
       18946
       20
       614301
       18480
          cash
       225
        1/28/14
    
    
      72
       75834
       51
       168495
       23474
          cash
       215
        1/10/14
    
    
      73
       57366
       31
       699431
       19231
          cash
       225
        1/12/14
    
    
      74
       57711
       22
       528354
       18865
          cash
       813
         1/4/14
    
    
      75
       65870
       36
       751098
       24104
          cash
       862
        1/23/14
    
    
      76
       22147
       36
       480153
       20436
       finance
       492
        1/23/14
    
    
      77
       99759
       51
       208677
       22990
          cash
       492
        1/16/14
    
    
      78
       64406
       20
       349494
       19580
          cash
       147
        1/16/14
    
    
      79
       55585
       22
       206310
       17756
       finance
       949
         1/1/14
    
    
      80
       24710
       75
       185223
       18920
          cash
       276
        1/16/14
    
    
      81
       53831
       75
       656181
       24646
       finance
       949
        1/18/14
    
    
      82
       97494
       19
       393613
       14006
          cash
       680
        1/14/14
    
    
      83
       13004
       36
       488910
       24890
       finance
       225
        1/12/14
    
    
      84
       30779
       75
       846630
       23401
       finance
       147
        1/25/14
    
    
      85
       29708
       51
       534633
       20328
       finance
       276
        1/29/14
    
    
      86
       14082
       46
       909110
       15200
          cash
       813
        1/28/14
    
    
      87
       82119
       19
       628031
       16189
          cash
       492
        1/12/14
    
    
      88
       31815
       46
       743097
       15040
          cash
       215
        1/25/14
    
    
      89
       32345
       22
       476047
       20197
       finance
       803
        1/19/14
    
    
      90
       63526
       51
       862765
       23232
       finance
       225
        1/23/14
    
    
      91
       20256
       19
       597655
       16734
          cash
       492
        1/16/14
    
    
      92
       15930
       36
       486534
       20960
          cash
       949
        1/26/14
    
    
      93
       68446
       75
       339830
       24148
          cash
       215
        1/22/14
    
    
      94
       68330
       20
       283797
       19140
          cash
       215
         1/9/14
    
    
      95
       85407
       22
       635204
       21751
       finance
       949
        1/18/14
    
    
      96
       42428
       75
       619016
       20413
          cash
       680
        1/11/14
    
    
      97
       37620
       51
       183947
       20328
       finance
       276
         1/3/14
    
    
      98
       96344
       20
       731677
       17600
       finance
       813
        1/15/14
    
    
      99
       53062
       31
       907549
       19894
       finance
       680
        1/28/14
    
  

100 rows × 7 columns

Write a query to select everything from the car_table:



In [ ]:

    
run('''
    
    ''')
# Show me all columns from the car_table

Risks of using an asterisk:

Works well when a table doesn't have a significant number of rows
However, most table will have thousands or even millions of rows
With a large table, you'll run into some problems, running "SELECT * FROM table_name" might:
- Take several minutes (or even hours) to return the information
- Crash your computer
- Muck up the server's processes, and you'll face the wrath of your company's system administrators once they figure out that you are the reason why the whole system has slowed down

LIMIT

- LIMIT N

Alternatively, you can use "LIMIT" and specify a few rows. This way, you'll just see the first few rows (and avoid overloading your system):



In [16]:

    
run('''
    SELECT
        *
    FROM
        sales_table
    LIMIT
        5
    ''')
# "Show me all columns from the sales_table, but limit it to the first 5 rows"









    Out[16]:






  
    
      
      id
      model_id
      customer_id
      revenue
      payment_type
      salesman_id
      date
    
  
  
    
      0
       54858
       36
       237906
       21222
       finance
       492
         1/7/14
    
    
      1
       43161
       20
       967016
       19140
       finance
       215
        1/26/14
    
    
      2
       40112
       46
       819010
       14720
          cash
       862
        1/17/14
    
    
      3
       92495
       31
       633030
       19010
       finance
       803
        1/13/14
    
    
      4
       78000
       51
       341877
       22022
       finance
       862
        1/12/14

Write a query to select all columns and the first 10 rows from the car_table:



In [ ]:

    
run('''
    
    ''')
# Show me all columns from the car_table, but but limit it to the first 10 rows

SELECT SPECIFIC COLUMNS:

Instead of using an asterisk for "all columns", you can specify a particular column or columns:



In [77]:

    
run('''
    SELECT
        model_id, customer_id
    FROM
        sales_table
    LIMIT
        5
    ''')
# "Show me all columns from the sales_table, but limit it to the first 5 rows"









    Out[77]:






  
    
      
      model_id
      customer_id
    
  
  
    
      0
       36
       237906
    
    
      1
       20
       967016
    
    
      2
       46
       819010
    
    
      3
       31
       633030
    
    
      4
       51
       341877

Write a query to select model_id and model from the car_table and limit it to 10 results:



In [ ]:

    
run('''
    
    ''')
# Show me x and y columns from the car_table, but but limit it to the first 10 rows

WHERE

- WHERE column_name = x / != x / in (x, y) / not in (x, y)

WHERE allows you to select rows matching specific values:

= x : returns all rows where the values in column_name are equal to x

!= x: returns all rows that DO NOT match x

IN (x, y) : returns all rows where the values in column_name match either x or y

NOT IN (x, y) : returns all rows where the values in column_name DOES NOT match x or y



In [67]:

    
run('''
    SELECT
        *
    FROM
        sales_table
    WHERE
        payment_type = 'cash'
    LIMIT 5
    ''')
# "Show me all columns and rows from the sales table, but only where the payment type is cash, and limit it to the first 5 rows"









    Out[67]:






  
    
      
      id
      model_id
      customer_id
      revenue
      payment_type
      salesman_id
      date
    
  
  
    
      0
       40112
       46
       819010
       14720
       cash
       862
        1/17/14
    
    
      1
       13154
       75
       720210
       21409
       cash
       492
        1/20/14
    
    
      2
       56245
       36
       248621
       25938
       cash
       276
        1/19/14
    
    
      3
       37412
       31
       750195
       17462
       cash
       276
        1/11/14
    
    
      4
       68513
       31
       461723
       17020
       cash
       803
        1/18/14

Rewrite the query above but ask for all rows that DO NOT involve cash transactions:



In [ ]:

    
run('''
    
    ''')
# Show me all columns and rows from the sales table, but only where the payment type is NOT cash, and limit it to the first 5 rows

Try to rewrite the query to return rows where the model_id is either 31 or 36



In [68]:

    
run('''
    
    ''')
# Show me all columns and rows from the sales table, but only where the model_id is either 31 or 36









    Out[68]:






  
    
      
      id
      model_id
      customer_id
      revenue
      payment_type
      salesman_id
      date
    
  
  
    
      0
       54858
       36
       237906
       21222
       finance
       492
         1/7/14
    
    
      1
       92495
       31
       633030
       19010
       finance
       803
        1/13/14
    
    
      2
       36535
       31
       908558
       19894
       finance
       862
        1/30/14
    
    
      3
       56245
       36
       248621
       25938
          cash
       276
        1/19/14
    
    
      4
       84469
       31
       733566
       21441
       finance
       215
        1/20/14

You can add additional conditions to WHERE with the following format:

WHERE column_a = x AND column_b = y



In [79]:

    
run('''
    SELECT
        *
    FROM
        sales_table
    WHERE
        payment_type = 'cash' 
        AND model_id in (31, 36)
    LIMIT 5
    ''')









    Out[79]:






  
    
      
      id
      model_id
      customer_id
      revenue
      payment_type
      salesman_id
      date
    
  
  
    
      0
       56245
       36
       248621
       25938
       cash
       276
        1/19/14
    
    
      1
       37412
       31
       750195
       17462
       cash
       276
        1/11/14
    
    
      2
       68513
       31
       461723
       17020
       cash
       803
        1/18/14
    
    
      3
       18303
       36
       413891
       23318
       cash
       492
        1/17/14
    
    
      4
       10682
       31
       145479
       17905
       cash
       147
        1/20/14

ORDER BY:

ORDER BY column_name

ORDER BY sorts the data by a particular column.

By default, ORDER BY sorts numbers or alphabetically in ascending order.

To sort in descending order, add DESC after the column name



In [70]:

    
run('''
    SELECT
        *
    FROM
        sales_table
    WHERE
        payment_type = 'cash'
    ORDER BY
        revenue
    LIMIT 5
    ''')
# Show me all columns and rows for cash transactions, starting with the lowest sale price, limit it to 5 results









    Out[70]:






  
    
      
      id
      model_id
      customer_id
      revenue
      payment_type
      salesman_id
      date
    
  
  
    
      0
       30046
       46
       982483
       12640
       cash
       803
        1/29/14
    
    
      1
       70434
       46
       944843
       13600
       cash
       949
        1/25/14
    
    
      2
       98201
       19
       697575
       13642
       cash
       680
         1/4/14
    
    
      3
       97494
       19
       393613
       14006
       cash
       680
        1/14/14
    
    
      4
       40112
       46
       819010
       14720
       cash
       862
        1/17/14

Rewrite the query above to order by a different column and in descending order.



In [ ]:

    
run('''
    
    ''')
# Translation: Show me all columns and rows for cash transactions, starting with the lowest sale price

GROUP BY & FUNCTIONS:

GROUP BY column_name : lets you specify one or more columns to create groups. Each unique value in the GROUP BY column will get its own group
FUNCTIONS:
- in the SELECT clause, you can apply functions to columns
- http://www.w3schools.com/sql/sql_functions.asp



In [74]:

    
run('''
    SELECT
        model_id, sum(revenue)
    FROM
        sales_table
    GROUP BY
        model_id
    ''')









    Out[74]:






  
    
      
      model_id
      sum(revenue)
    
  
  
    
      0
       19
       157886
    
    
      1
       20
       212080
    
    
      2
       22
       201083
    
    
      3
       31
       234748
    
    
      4
       36
       365751
    
    
      5
       46
       154880
    
    
      6
       51
       297418
    
    
      7
       75
       317406



In [ ]:

    
Add



In [ ]:



In [ ]:

The model_id isn't very helpful for analysis - the name would be better. However, each car's Model and Make information isn't located on the sales_table. Write out a query to return all columns and rows from the car_table to find out what information might be helpful in there.



In [15]:

    
run('''
    ???
    ''')









    Out[15]:






  
    
      
      model_id
      make
      model
      sticker_price
    
  
  
    
      0
       20
       Toyota
          Camry
       22000
    
    
      1
       46
       Toyota
        Corolla
       16000
    
    
      2
       51
       Toyota
          Prius
       24200
    
    
      3
       19
        Honda
          Civic
       18190
    
    
      4
       31
        Honda
         Accord
       22105
    
    
      5
       75
       Subaru
        Outback
       24895
    
    
      6
       22
       Subaru
       Forester
       22195
    
    
      7
       36
       Toyota
         Tundra
       26200

You can add data from two different tables using a "JOIN" in the FROM clause:



In [16]:

    
run('''
    SELECT
        make,
        model,
        revenue
    FROM
        sales_table
        JOIN car_table 
            ON sales_table.model_id = car_table.model_id
    WHERE
        payment_type = 'cash'
    ORDER BY
        revenue DESC
    LIMIT 
        5
    ''')









    Out[16]:






  
    
      
      make
      model
      revenue
    
  
  
    
      0
       Toyota
        Tundra
       25938
    
    
      1
       Toyota
        Tundra
       25937
    
    
      2
       Toyota
        Tundra
       24628
    
    
      3
       Toyota
        Tundra
       24366
    
    
      4
       Subaru
       Outback
       24148

What's going on here:

To JOIN two tables, find columns on each table that contain corresponding data
In this case, the model_id column in the sales_table matched up to the model_id column in the car_table. Both have the ID numbers that the dealership has assigned to each model of the cars they are selling.
Follow the format below, where a column is referred to by table_name.column_name:

FROM tableA JOIN tableB ON tableA.column_name = tableB.column_name

Take care: corresponding columns from different tables don't necessarily always have identical names. So, if the model_id column in the car_table were named car_model_id instead, this would be the FROM clause:

FROM sales_table JOIN car_table ON sales_table.model_id = car_table.car_model_id

Let's say you wanted to see the model_id in the table as well. If you just try adding "model_id" in the SELECT clause, suddenly you'll hit an error:



In [17]:

    
run('''
    SELECT
        model_id,
        make,
        model,
        revenue
    FROM
        sales_table
        JOIN car_table 
            ON sales_table.model_id = car_table.model_id
    WHERE
        payment_type = 'cash'
    ORDER BY
        revenue DESC
    LIMIT 
        10
    ''')









    



ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 4))







    



---------------------------------------------------------------------------
DatabaseError                             Traceback (most recent call last)
<ipython-input-17-5ca1644591a3> in <module>()
     15     LIMIT
     16         10
---> 17     ''')

<ipython-input-7-6fb62674a9bd> in run(query)
      4 # a pandas dataframe
      5 def run(query):
----> 6     results = pd.read_sql("{}".format(query), con)
      7     return results

//anaconda/envs/py3k/lib/python3.3/site-packages/pandas/io/sql.py in read_sql(sql, con, index_col, coerce_float, params, parse_dates, columns)
    386         return pandas_sql.read_sql(
    387             sql, index_col=index_col, params=params,
--> 388             coerce_float=coerce_float, parse_dates=parse_dates)
    389     else:
    390         if isinstance(pandas_sql, PandasSQLLegacy):

//anaconda/envs/py3k/lib/python3.3/site-packages/pandas/io/sql.py in read_sql(self, sql, index_col, coerce_float, params, parse_dates)
   1020                  parse_dates=None):
   1021         args = _convert_params(sql, params)
-> 1022         cursor = self.execute(*args)
   1023         columns = [col_desc[0] for col_desc in cursor.description]
   1024         data = self._fetchall_as_list(cursor)

//anaconda/envs/py3k/lib/python3.3/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1015 
   1016             ex = DatabaseError("Execution failed on sql: %s" % args[0])
-> 1017             raise_with_traceback(ex)
   1018 
   1019     def read_sql(self, sql, index_col=None, coerce_float=True, params=None,

//anaconda/envs/py3k/lib/python3.3/site-packages/pandas/compat/__init__.py in raise_with_traceback(exc, traceback)
    703         if traceback == Ellipsis:
    704             _, _, traceback = sys.exc_info()
--> 705         raise exc.with_traceback(traceback)
    706 else:
    707     # this version of raise is a syntax error in Python 3

//anaconda/envs/py3k/lib/python3.3/site-packages/pandas/io/sql.py in execute(self, *args, **kwargs)
   1004                 cur.execute(*args, **kwargs)
   1005             else:
-> 1006                 cur.execute(*args)
   1007             return cur
   1008         except Exception as e:

DatabaseError: Execution failed on sql: 
    SELECT
        model_id,
        make,
        model,
        revenue
    FROM
        sales_table
        JOIN car_table 
            ON sales_table.model_id = car_table.model_id
    WHERE
        payment_type = 'cash'
    ORDER BY
        revenue DESC
    LIMIT 
        10

This is because you've introduced another table with a column named "model_id" - the query doesn't know if it should pick the data from the model_id column in the sales_table or the car_table. So, you have to specify which table you want the query to use:



In [18]:

    
run('''
    SELECT
        sales_table.model_id,
        make,
        model,
        revenue
    FROM
        sales_table
        JOIN car_table 
            ON sales_table.model_id = car_table.model_id
    WHERE
        payment_type = 'cash'
    ORDER BY
        revenue DESC
    LIMIT 
        5
    ''')









    Out[18]:






  
    
      
      model_id
      make
      model
      revenue
    
  
  
    
      0
       36
       Toyota
        Tundra
       25938
    
    
      1
       36
       Toyota
        Tundra
       25937
    
    
      2
       36
       Toyota
        Tundra
       24628
    
    
      3
       36
       Toyota
        Tundra
       24366
    
    
      4
       75
       Subaru
       Outback
       24148

Note: because the data in the sales_table.model_id column corresponds to the data in car_table.model_id, you can specify either one in the SELECT clause in this case. There are instances where this won't be the case, but that can be addressed later.

Although you technically only need to specify the column's table name in the SELECT clause when there are one or more columns of the same name (so if the joined tables both have a column named "model_id", for instance), it's a good habit to get into labeling all column names with the table they are coming from.

You can also give tables nicknames so that you don't have to write out the entire name of the table each time. You can use the nickname in the SELECT clause, even though you don't state the nicknae until the FROM clause. Just put the nickname immediately after in the FROM clause. One or two letters for a nickname is pretty common.

In this case, I'm going to give sales_table the nickname "S", and the car_table the nickname "C". Now it's easy to label the columns in the SELECT clause:



In [34]:

    
run('''
    SELECT
        S.model_id,
        C.make,
        C.model,
        S.revenue
    FROM
        sales_table S
        JOIN car_table C
            ON S.model_id = C.model_id
    WHERE
        payment_type = 'cash'
    ORDER BY
        revenue DESC
    LIMIT 
        5
    ''')









    Out[34]:






  
    
      
      model_id
      make
      model
      revenue
    
  
  
    
      0
       36
       Toyota
        Tundra
       25938
    
    
      1
       36
       Toyota
        Tundra
       25937
    
    
      2
       36
       Toyota
        Tundra
       24628
    
    
      3
       36
       Toyota
        Tundra
       24366
    
    
      4
       75
       Subaru
       Outback
       24148

Now that we can see the Make and Model information for each car, let's throw in the sticker_price as well - also located on the car_table:



In [26]:

    
run('''
    SELECT
        C.make,
        C.model,
        S.revenue, 
        C.sticker_price
    FROM
        sales_table S
        JOIN car_table C
            ON S.model_id = C.model_id
    WHERE
        payment_type = 'cash'
    ORDER BY
        revenue DESC
    LIMIT 
        5
    ''')









    Out[26]:






  
    
      
      make
      model
      revenue
      sticker_price
    
  
  
    
      0
       Toyota
        Tundra
       25938
       26200
    
    
      1
       Toyota
        Tundra
       25937
       26200
    
    
      2
       Toyota
        Tundra
       24628
       26200
    
    
      3
       Toyota
        Tundra
       24366
       26200
    
    
      4
       Subaru
       Outback
       24148
       24895

Now we can see that each car sold for some amount less than the listing price, presumably because the customer negotiated a better price with the salesman. You can add a function in the SELECT clause to see the percent under the listing price. You can perform any mathematical function by using the column names:

columnA + columnB = sumAB
columnA * columnB = productAB



In [27]:

    
## hacky fix, need to figure out the deal with integers and pandas.to_sql on SQLite
run('''
    SELECT
        C.make,
        C.model,
        S.revenue, 
        C.sticker_price,
        S.revenue*100/C.sticker_price
    FROM
        sales_table S
        JOIN car_table C
            ON S.model_id = C.model_id
    WHERE
        payment_type = 'cash'
    ORDER BY
        revenue DESC
    LIMIT 
        5
    ''')









    Out[27]:






  
    
      
      make
      model
      revenue
      sticker_price
      S.revenue*100/C.sticker_price
    
  
  
    
      0
       Toyota
        Tundra
       25938
       26200
       99
    
    
      1
       Toyota
        Tundra
       25937
       26200
       98
    
    
      2
       Toyota
        Tundra
       24628
       26200
       94
    
    
      3
       Toyota
        Tundra
       24366
       26200
       93
    
    
      4
       Subaru
       Outback
       24148
       24895
       96

But that column name looks like a mess, so you can rename it by adding "as [desired_column_name]"



In [31]:

    
run('''
    SELECT
        C.make,
        C.model,
        S.revenue, 
        C.sticker_price,
        S.revenue*100/C.sticker_price AS percent_of_sticker_price
    FROM
        sales_table S
        JOIN car_table C
            ON S.model_id = C.model_id
    WHERE
        payment_type = 'cash'
    ORDER BY
        revenue DESC
    LIMIT 
        5
    ''')









    Out[31]:






  
    
      
      make
      model
      revenue
      sticker_price
      percent_of_sticker_price
    
  
  
    
      0
       Toyota
        Tundra
       25938
       26200
       99
    
    
      1
       Toyota
        Tundra
       25937
       26200
       98
    
    
      2
       Toyota
        Tundra
       24628
       26200
       94
    
    
      3
       Toyota
        Tundra
       24366
       26200
       93
    
    
      4
       Subaru
       Outback
       24148
       24895
       96

Let's say you want to see the average price that each model of car sold for compared to the list price for the entire month. This is where the GROUP BY clause helps. We'll also need to use the AVG( ) function in the SELECT clause:



In [34]:

    
# I've elimated the constraints from the WHERE, ORDER BY and LIMIT clause so all rows from the sale_table are included in the average
run('''
    SELECT
        C.make,
        C.model,
        AVG(S.revenue) AS average_revenue, 
        C.sticker_price,
        S.revenue*100/C.sticker_price AS percent_of_sticker_price
    FROM
        sales_table S
        JOIN car_table C
            ON S.model_id = C.model_id
    GROUP BY
        model
    ''')









    Out[34]:






  
    
      
      make
      model
      average_revenue
      sticker_price
      percent_of_sticker_price
    
  
  
    
      0
        Honda
         Accord
       19562.333333
       22105
       89
    
    
      1
       Toyota
          Camry
       19280.000000
       22000
       80
    
    
      2
        Honda
          Civic
       15788.600000
       18190
       91
    
    
      3
       Toyota
        Corolla
       14080.000000
       16000
       94
    
    
      4
       Subaru
       Forester
       18280.272727
       22195
       97
    
    
      5
       Subaru
        Outback
       21160.400000
       24895
       81
    
    
      6
       Toyota
          Prius
       21244.142857
       24200
       84
    
    
      7
       Toyota
         Tundra
       22859.437500
       26200
       80

The Average_Revenue numbers are messy, you can round them with ROUND(). To round to the nearest cent: ROUND( ,2)



In [51]:

    
run('''
    SELECT
        make,
        model,
        ROUND(AVG(revenue), 2) AS average_revenue, 
        sticker_price,
        revenue*100/sticker_price AS percent_of_sticker_price
    FROM
        sales_table S
        JOIN car_table C
            ON S.model_id = C.model_id
    WHERE
        C.model in ('Accord', 'Camry')
    GROUP BY
        model
    ''')









    Out[51]:






  
    
      
      make
      model
      average_revenue
      sticker_price
      percent_of_sticker_price
    
  
  
    
      0
        Honda
       Accord
       19562.33
       22105
       96
    
    
      1
       Toyota
        Camry
       19280.00
       22000
       99

ORDER BY the percent_of_sticker_price to highlight which cars are selling for the furthest below their list price:



In [29]:

    
run('''
    SELECT
        make,
        model,
        ROUND(AVG(revenue), 2) AS average_revenue, 
        sticker_price,
        revenue*100/sticker_price AS percent_of_sticker_price
    FROM
        sales_table S
        JOIN car_table C
            ON S.model_id = C.model_id
    GROUP BY
        model
    ORDER BY 
        percent_of_sticker_price
    ''')









    Out[29]:






  
    
      
      make
      model
      average_revenue
      sticker_price
      percent_of_sticker_price
    
  
  
    
      0
       Toyota
          Camry
       19280.00
       22000
       80
    
    
      1
       Toyota
         Tundra
       22859.44
       26200
       80
    
    
      2
       Subaru
        Outback
       21160.40
       24895
       81
    
    
      3
       Toyota
          Prius
       21244.14
       24200
       84
    
    
      4
        Honda
         Accord
       19562.33
       22105
       89
    
    
      5
        Honda
          Civic
       15788.60
       18190
       91
    
    
      6
       Toyota
        Corolla
       14080.00
       16000
       94
    
    
      7
       Subaru
       Forester
       18280.27
       22195
       97

One last thing - we'll add a count of how many sales were made of each car model. If we use "id" column in car_sales, we can count



In [46]:

    
run('''
    SELECT
        make,
        model,
        count() as units_sold,
        ROUND(AVG(revenue), 2) AS average_revenue, 
        sticker_price,
        revenue*100/sticker_price AS percent_of_sticker_price
    FROM
         sales_table S
            JOIN car_table C
                ON S.model_id = C.model_id 
    GROUP BY
        model
    ORDER BY 
        percent_of_sticker_price
    ''')









    Out[46]:






  
    
      
      make
      model
      units_sold
      average_revenue
      sticker_price
      percent_of_sticker_price
    
  
  
    
      0
       Toyota
          Camry
       11
       19280.00
       22000
       80
    
    
      1
       Toyota
         Tundra
       16
       22859.44
       26200
       80
    
    
      2
       Subaru
        Outback
       15
       21160.40
       24895
       81
    
    
      3
       Toyota
          Prius
       14
       21244.14
       24200
       84
    
    
      4
        Honda
         Accord
       12
       19562.33
       22105
       89
    
    
      5
        Honda
          Civic
       10
       15788.60
       18190
       91
    
    
      6
       Toyota
        Corolla
       11
       14080.00
       16000
       94
    
    
      7
       Subaru
       Forester
       11
       18280.27
       22195
       97

So for some reason, customers are negotiating harder (or more successfully) on the Toyota Camry and Tundra - and it would probably be causing a problem for the dealership. This kind of issue comes up a lot in B2B as well, since companies have bargaining power with their partners.

Being able to pull this information quickly makes it easy to pinpoint problems, although more queries are then needed to dig into the reasons. The data in this SQLite db could help answer questions like:

What age group is buying Tundras and Camrys? Maybe if younger people are drawn to them, they bargain harder because they have less expenadable income
What salesperson has been selling these cars? Maybe he's not a very good negotiator

Salesperson performance:

Count the number of women and men who bought each model:



In [31]:

    
#run('''
#    SELECT
#        model,
#        sum(if(gender='female', 1, 0)) as female,
#        sum(if(gender='male', 1, 0)) as male
#    FROM
#        sales_table
#        JOIN car_table 
#            ON sales_table.model_id = car_table.model_id
#        JOIN cust_table
#            ON sales_table.customer_id = cust_table.customer_id
#    GROUP BY
#        model
#    ''')



In [72]:

    
run('''
    SELECT
        first_name,
        last_name,
        count() as units_sold,
        ROUND(AVG(revenue), 2) AS avg_revenue, 
        ROUND(AVG(sticker_price), 2) avg_sticker_price,
        ROUND(AVG(revenue)/AVG(sticker_price), 2) AS percent_of_sticker_price
    FROM
         sales_table S
            JOIN car_table C
                ON S.model_id = C.model_id 
            JOIN salesman_table SM 
                ON S.salesman_id = SM.id
    GROUP BY
        first_name
    ORDER BY 
        percent_of_sticker_price
    ''')









    Out[72]:






  
    
      
      first_name
      last_name
      units_sold
      avg_revenue
      avg_sticker_price
      percent_of_sticker_price
    
  
  
    
      0
        Claudine
            Hatch
       12
       19463.92
       23465.00
       0.83
    
    
      1
        Kathleen
            March
        8
       19053.00
       22549.38
       0.84
    
    
      2
       Rosemarie
             Self
        5
       19803.20
       23479.00
       0.84
    
    
      3
          Justin
       Avellaneda
       10
       19393.70
       22908.00
       0.85
    
    
      4
         Matthew
             Luna
       11
       19856.64
       23343.18
       0.85
    
    
      5
         Michael
             Hill
       10
       17110.30
       20167.00
       0.85
    
    
      6
          Joseph
            Seney
       11
       19521.09
       22198.18
       0.88
    
    
      7
           Elton
             Elzy
       12
       20012.67
       22525.00
       0.89
    
    
      8
        Samantha
          Douglas
        8
       19600.75
       22095.63
       0.89
    
    
      9
           Jared
             Case
       13
       20084.00
       21737.69
       0.92

Although Claudine Hatch has good volume, she is giving up too much in negotations - she might need extra training. Meanwhile, Jared Case might have earned himself a bonus for performance this month.

What impact is Claudine's weak negotiating having on the overall margins for the dealership? How much could they gain if she got up to Jared's standards? We'll include cogs to get some Net info:



In [89]:

    
run('''
    SELECT
        first_name,
        last_name,
        count() as units_sold,
        sum(revenue) AS total_revenue, 
        sum(cogs) total_cogs, 
        sum(revenue) - sum(cogs) AS Net,
        (sum(revenue) - sum(cogs))*100 / sum(revenue) Margin,
        (sum(revenue)*100 / (select sum(revenue) from sales_table)) as percent_of_gross,
        (sum(revenue) - sum(cogs))*100 / (select sum(revenue)-sum(cogs) from sales_table S join car_table C on S.model_id = C.model_id) as percent_of_net
    FROM
         sales_table S
            JOIN car_table C
                ON S.model_id = C.model_id 
            JOIN salesman_table SM 
                ON S.salesman_id = SM.id
    GROUP BY
        first_name
    ORDER BY 
        total_revenue desc
    ''')









    Out[89]:






  
    
      
      first_name
      last_name
      units_sold
      total_revenue
      total_cogs
      Net
      Margin
      percent_of_gross
      percent_of_net
    
  
  
    
      0
           Jared
             Case
       13
       261092
       169554
       91538
       35
       13
       15
    
    
      1
           Elton
             Elzy
       12
       240152
       162180
       77972
       32
       12
       13
    
    
      2
        Claudine
            Hatch
       12
       233567
       168948
       64619
       27
       12
       10
    
    
      3
         Matthew
             Luna
       11
       218423
       154065
       64358
       29
       11
       10
    
    
      4
          Joseph
            Seney
       11
       214732
       146508
       68224
       31
       11
       11
    
    
      5
          Justin
       Avellaneda
       10
       193937
       137448
       56489
       29
        9
        9
    
    
      6
         Michael
             Hill
       10
       171103
       121002
       50101
       29
        8
        8
    
    
      7
        Samantha
          Douglas
        8
       156806
       106059
       50747
       32
        8
        8
    
    
      8
        Kathleen
            March
        8
       152424
       108237
       44187
       28
        7
        7
    
    
      9
       Rosemarie
             Self
        5
        99016
        70437
       28579
       28
        5
        4

Claudine's sales make up 12% of gross revenues, but only contribute 10% to net - she's hurting margins and could benefit from training, but doesn't appear to be a lost cause. However, Rosemarie Self has weak margins and is barely moving inventory. This might not be the right role for her.

You could even use this data to figure out which salesperson would be best to approach a new customer who just walked in the door.

WORK IN PROGRESS STUFF

Debugging Tips:

It's crazy easy to make mistakes in SQL,

Remeber this structure: Although capitalization and indentation do not matter, the order of the clauses cannot change. You don't need every clause in a query to make it work, but if you do use it, it has to appear in the right order.

Some of these we haven't covered yet, but will below. For now, just use



In [ ]:

    
SELECT
FROM 
WHERE
GROUP BY 
ORDER BY
LIMIT



In [42]:

    
debugging = pd.DataFrame()



In [48]:

    
tip_1 = ["If you are joining more than one table, check to make sure that all of your columns are labeled"]
debugging.append(tip_1)









    Out[48]:






  
    
      
      0
    
  
  
    
      0
       If you are joining more than one table, check ...



In [47]:









    Out[47]:






  
    
      
      0
    
  
  
    
      0
       If you are joining more than one table, check ...



In [ ]:

    
How to read basic queries in English:

Queries are much more like normal English syntax than most programming languages out there. Reading them out loud can help with understanding their structure. 

The two most basic components of a query are nearly always "SELECT" and "FROM".  Think about going to the grocery store to look for milk. This could be the structure of your search:

	cid	name	type	dflt_value
0	0	id	INTEGER	None
1	1	model_id	INTEGER	None
2	2	customer_id	INTEGER	None
3	3	revenue	INTEGER	None
4	4	payment_type	TEXT	None
5	5	salesman_id	INTEGER	None
6	6	date	TEXT	None

	id	model_id	customer_id	revenue	payment_type	salesman_id	date
0	54858	36	237906	21222	finance	492	1/7/14
1	43161	20	967016	19140	finance	215	1/26/14
2	40112	46	819010	14720	cash	862	1/17/14
3	92495	31	633030	19010	finance	803	1/13/14
4	78000	51	341877	22022	finance	862	1/12/14
5	13154	75	720210	21409	cash	492	1/20/14
6	36535	31	908558	19894	finance	862	1/30/14
7	22813	46	705508	12960	finance	225	1/29/14
8	56245	36	248621	25938	cash	276	1/19/14
9	88118	51	341344	19844	finance	492	1/23/14
10	84469	31	733566	21441	finance	215	1/20/14
11	37412	31	750195	17462	cash	276	1/11/14
12	68513	31	461723	17020	cash	803	1/18/14
13	74380	20	468665	18040	finance	215	1/5/14
14	24047	75	556188	18671	finance	862	1/12/14
15	40603	51	241759	22506	finance	862	1/4/14
16	69883	20	161369	20460	finance	803	1/25/14
17	43338	46	731692	13440	finance	225	1/29/14
18	39727	51	656750	18634	finance	492	1/1/14
19	21022	22	619020	17312	cash	862	1/14/14
20	18303	36	413891	23318	cash	492	1/17/14
21	28176	20	965672	21780	cash	813	1/11/14
22	62604	75	217720	19418	cash	862	1/30/14
23	83930	22	946265	17756	cash	862	1/13/14
24	87626	31	140795	21441	finance	215	1/13/14
25	10682	31	145479	17905	cash	147	1/20/14
26	23444	20	961650	16720	finance	276	1/1/14
27	87998	31	457742	20778	cash	680	1/3/14
28	95912	75	978159	19667	cash	813	1/25/14
29	74410	20	474218	19360	finance	276	1/6/14
...	...	...	...	...	...	...	...
70	61014	36	253239	20174	finance	680	1/21/14
71	18946	20	614301	18480	cash	225	1/28/14
72	75834	51	168495	23474	cash	215	1/10/14
73	57366	31	699431	19231	cash	225	1/12/14
74	57711	22	528354	18865	cash	813	1/4/14
75	65870	36	751098	24104	cash	862	1/23/14
76	22147	36	480153	20436	finance	492	1/23/14
77	99759	51	208677	22990	cash	492	1/16/14
78	64406	20	349494	19580	cash	147	1/16/14
79	55585	22	206310	17756	finance	949	1/1/14
80	24710	75	185223	18920	cash	276	1/16/14
81	53831	75	656181	24646	finance	949	1/18/14
82	97494	19	393613	14006	cash	680	1/14/14
83	13004	36	488910	24890	finance	225	1/12/14
84	30779	75	846630	23401	finance	147	1/25/14
85	29708	51	534633	20328	finance	276	1/29/14
86	14082	46	909110	15200	cash	813	1/28/14
87	82119	19	628031	16189	cash	492	1/12/14
88	31815	46	743097	15040	cash	215	1/25/14
89	32345	22	476047	20197	finance	803	1/19/14
90	63526	51	862765	23232	finance	225	1/23/14
91	20256	19	597655	16734	cash	492	1/16/14
92	15930	36	486534	20960	cash	949	1/26/14
93	68446	75	339830	24148	cash	215	1/22/14
94	68330	20	283797	19140	cash	215	1/9/14
95	85407	22	635204	21751	finance	949	1/18/14
96	42428	75	619016	20413	cash	680	1/11/14
97	37620	51	183947	20328	finance	276	1/3/14
98	96344	20	731677	17600	finance	813	1/15/14
99	53062	31	907549	19894	finance	680	1/28/14

	id	model_id	customer_id	revenue	payment_type	salesman_id	date
0	30046	46	982483	12640	cash	803	1/29/14
1	70434	46	944843	13600	cash	949	1/25/14
2	98201	19	697575	13642	cash	680	1/4/14
3	97494	19	393613	14006	cash	680	1/14/14
4	40112	46	819010	14720	cash	862	1/17/14

	model_id	sum(revenue)
0	19	157886
1	20	212080
2	22	201083
3	31	234748
4	36	365751
5	46	154880
6	51	297418
7	75	317406

	model_id	make	model	sticker_price
0	20	Toyota	Camry	22000
1	46	Toyota	Corolla	16000
2	51	Toyota	Prius	24200
3	19	Honda	Civic	18190
4	31	Honda	Accord	22105
5	75	Subaru	Outback	24895
6	22	Subaru	Forester	22195
7	36	Toyota	Tundra	26200

	make	model	average_revenue	sticker_price	percent_of_sticker_price
0	Honda	Accord	19562.333333	22105	89
1	Toyota	Camry	19280.000000	22000	80
2	Honda	Civic	15788.600000	18190	91
3	Toyota	Corolla	14080.000000	16000	94
4	Subaru	Forester	18280.272727	22195	97
5	Subaru	Outback	21160.400000	24895	81
6	Toyota	Prius	21244.142857	24200	84
7	Toyota	Tundra	22859.437500	26200	80

	make	model	average_revenue	sticker_price	percent_of_sticker_price
0	Toyota	Camry	19280.00	22000	80
1	Toyota	Tundra	22859.44	26200	80
2	Subaru	Outback	21160.40	24895	81
3	Toyota	Prius	21244.14	24200	84
4	Honda	Accord	19562.33	22105	89
5	Honda	Civic	15788.60	18190	91
6	Toyota	Corolla	14080.00	16000	94
7	Subaru	Forester	18280.27	22195	97

	first_name	last_name	units_sold	avg_revenue	avg_sticker_price	percent_of_sticker_price
0	Claudine	Hatch	12	19463.92	23465.00	0.83
1	Kathleen	March	8	19053.00	22549.38	0.84
2	Rosemarie	Self	5	19803.20	23479.00	0.84
3	Justin	Avellaneda	10	19393.70	22908.00	0.85
4	Matthew	Luna	11	19856.64	23343.18	0.85
5	Michael	Hill	10	17110.30	20167.00	0.85
6	Joseph	Seney	11	19521.09	22198.18	0.88
7	Elton	Elzy	12	20012.67	22525.00	0.89
8	Samantha	Douglas	8	19600.75	22095.63	0.89
9	Jared	Case	13	20084.00	21737.69	0.92

	first_name	last_name	units_sold	total_revenue	total_cogs	Net	Margin	percent_of_gross	percent_of_net
0	Jared	Case	13	261092	169554	91538	35	13	15
1	Elton	Elzy	12	240152	162180	77972	32	12	13
2	Claudine	Hatch	12	233567	168948	64619	27	12	10
3	Matthew	Luna	11	218423	154065	64358	29	11	10
4	Joseph	Seney	11	214732	146508	68224	31	11	11
5	Justin	Avellaneda	10	193937	137448	56489	29	9	9
6	Michael	Hill	10	171103	121002	50101	29	8	8
7	Samantha	Douglas	8	156806	106059	50747	32	8	8
8	Kathleen	March	8	152424	108237	44187	28	7	7
9	Rosemarie	Self	5	99016	70437	28579	28	5	4