In [ ]:
import sqlite3 as sl

Problem 11.1. SQLite Benchmarks

You might have wondered why relational databases and SQLite were introduced in weeks 10 and 11. One of the strengths of SQLite is that, once a database is stored, it is much faster when we have to make frequent queries. In this problem, we will test the question, "how fast is SQLite really?", by measuring the CPU time to

  • use pure Python to brute-force search a CSV file,
  • create an SQLite database by reading in a CSV file, and
  • query an existing SQLite database.

For simplicity, we will extract only two columns: month (the 2nd colum) and cancelled (the 24th column).

(Note: The cancelled column is actually the 22nd column, and the 24th column is diverted. This bug was discovered too late, so in the following simply pretend that the 24th column is cancelled, but keep in mind that we are actually counting the number of diverted flights.)


In [ ]:
%%bash
cat /data/airline/2001.csv | awk -F, '{print $2 "," $24}' > /data/airline/week11.csv

The count_cancelled() function use a simple for loop to read in a CSV file line by line. On my machine, the results were

%time n = count_cancelled(9, "/data/airline/week11.csv")
CPU times: user 26.4 s, sys: 24 ms, total: 26.4 s
Wall time: 26.4 s

The create_db() function creates an SQLite database.

%time db_name = create_db("/data/airline/week11.csv")
CPU times: user 2min 17s, sys: 6.87 s, total: 2min 24s
Wall time: 2min 35s

The query() function uses the databse we have created with create_db() and makes a SELECT query to count the number of cancellations in September.

%time n = query(db_name, 9)
CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 7.93 ms

That's 5.35 thousands of a second per query. We can see that it takes a while to create an SQLite database, but once it has been created, the query speed is impressive. This is a toy problem, but imagine we have to make 1000 such queries. It would take 26.4 sec $\times$ 1000 = 7 hours 20 minutes for plain Python, while it would take 2 min 17 sec + 7.393 msec $\times$ 1000 = 2 minutes 24 seconds in SQLite.

Function: create_db()

First, write a function named create_db() that takes a string (the name of the CSV file) and returns a string (the name of the SQLite database name you have created). You should also create a multi-column index on both the month and cancelled columns (Name the index idx). Here's the schema you should use:


In [ ]:
create_flights = '''
CREATE TABLE flights (
    month INT,
    cancelled INT
);
'''

In [ ]:
def create_db(filename):
    '''
    Takes a string and returns a string.
    
    Parameter
    ---------
    filename: The file name of the CSV file to be imported.
    
    Returns
    -------
    A str. The file path and/or the file name of the SQL database created.
    '''
    
    #### your code goes here
    
    return db_name

In [ ]:
%time db_name = create_db("/data/airline/week11.csv")

Function: query ()

Next, write a function named query that takes a string (the name of the SQL database you created) and an integer (the month of the year to query). It should return an integer (the number of cancellations in that month).

Hint: The multi-column indices we have created in `create_cb() is optmized for WHERE ... AND ... terms. Use WHERE ... AND ...


In [ ]:
def query(db_name, month):
    '''
    Takes a string and returns an integer.
    
    Parameters
    ----------
    db_name: A str. The file path and/or the file name of SQL database.
    month: An int. The month to query.
    
    Returns
    -------
    An int. The number of cancellations (cancelled == 1) where
    the value of 'month' column in the SQL database is equal to the 'month' variable.
    '''

    #### your code goes here
    
    return n

In [ ]:
%time n = query(db_name, 9)

In [ ]:
print("SQLite: {0} flights were cancelled in September, 2001".format(n))

Finally, I have provided a function that goes through a CSV file line by line and counts the number of cancellations in September. Run the following cells to see how long it takes on your machine.


In [ ]:
def count_cancelled(month, filename, header=True):
    
    n = 0
    with open(filename) as f:
        for i, line in enumerate(f):
            if header and i == 0:
                continue
                
            values = line.strip().split(',')
            if int(values[0]) == 9 and int(values[1]) == 1:
                n += 1
    return n

In [ ]:
%time n = count_cancelled(9, "/data/airline/week11.csv")
print("Python: {0} flights were cancelled in September, 2001".format(n))

In [ ]: