In this Notebook, we start to examine the topic of data exploration. Before one can begin to visually or analytically explore a dataset, many mundane tasks often must be performed In this lesson, we cover many of these steps by using the flights database.
This will require us to read data from the SQLite3 database we constructed previously, specifically the flight information data. To complete this task, we first need to construct a suitable SQL query, establish a database connection, execute the query, and process the query results, which in this case simply means accumulating a list of rows extracted by the query.
In [ ]:
# Lets grab data from the sqlite database
import sqlite3 as sl
# Our list of airport rows
airports = []
# The database location
database = '/notebooks/i2ds/database/i2ds'
# The airport query, all columns specified in order.
query = "SELECT * FROM flights"
# Establish a connection
with sl.connect(database) as con:
# We need this to read in latin-1 characters
con.text_factory = bytes
# We will wrap our database access statements with exception handling
try:
# Get a cursor and execute our query
cur = con.cursor()
cur.execute(query)
# Iterate through the query results to build airport list
for row in cur.execute(query):
airports.append(row)
except sl.Warning as w:
print("Warning:", w.args[0])
except sl.Error as e:
print("Error:", e.args[0])
# Now display the first few entries
print(airports[0:3])
If you do execute the previous code cell, rather than getting an airports list, you most likely receive the following message window:
First, note that this message means all definitions, including code
imports and variable or function definitions have been lost and will
need to be re-entered (in a new code cell!) before being used again.
Second, we will need to perform some detective work to uncover the
issue. As a next step, lets take a look at the size of the input data
file we used to construct the Airport database. For this we can simply
use the wc
command to count the lines in the 2001.csv
file.
In [1]:
!wc -l /notebooks/i2ds/data/2001.csv
The input file only contains just under six millions rows, that
shouldn't be a problem, but lets take a look at the database file size.
That should give some idea to the amount of data we are trying to pull
out of the database and into our IPython kernel. We can do this with the
ls
command by using the -la
flag and he full path info to our
database file, which is located at /notebooks/i2ds/database/i2ds
.
In [2]:
!ls -la /notebooks/i2ds/database/i2ds
The database file is over half a gigabyte of data. That is a lot of data,
and even though we may have a lot of memory, it could be the problem.
One thought is that maybe a general SQL query, by using
SELECT * FROM flights
was a problem, and that maybe we should modify our original
query to explicitly list all columns:
query = '''SELECT
year, Month, dayOfMonth, dayOfWeek, actualDepartureTime, scheduledDepartureTime,
arrivalArrivalTime, scheduledArrivalTime, uniqueCarrierCode, flightNumber,
tailNumber, actualElapsedTime, scheduledElapsedTime, airTime, arrivalDelay,
departureDelay, originCode, destinationCode, distance, taxiIn, taxiOut, cancelled,
cancellationCode, diverted, carrierDelay, weatherDelay, nasDelay,
securityDelay, lateAircraftDelay
FROM flights;
'''
I don't recommend trying this, however, as this will probably still
require a Kernel restart. To beter understand the issue, we can look at
our available memory with the free
command, which will run in our
Docker container. To make it easier to read, we will use the -m
flag,
which lists all data in megabytes.
In [3]:
!free -m
For completeness, the output from my running Docker container is shown below.
Interesting, our Docker container doesn't have much memory. On my system, the container only has a little over 2000 megabytes total, of which about 1700 is free for use. Since we are all running a Docker container inside a VirtualBox Ubuntu image, we can see the explicit limitation this produces:
The VirtualBox image has only 2048 Mbytes available for the entire
virtual machine, inside which our Docker container is running. In fact,
this would be a good time to verify that only one Docker container is
running inside our virtual machine. We can do this by entering docker
ps
at the command line of a Boot2Docker prompt (you can open a new
Boot2Docker shell to do this.
So, we have no other containers running (if you do, you might close them
down with docker stop
and the container name as shown by docker ps
.
Be sure to leave the current IPython Notebook server running, or you
will need to restart the container to use this Notebook.
At this point, we should simply read the data straight into a Pandas DataFrame from the original CSV file.
In [4]:
# Lets grab data from the sqlite database
import pandas as pd
csvfile = '/notebooks/i2ds/data/2001.csv'
try:
data = pd.read_csv(csvfile, encoding='latin-1')
# Now display the first few entries
print(data.head())
except Exception as e:
print("Error:", type(e))
Well, at least we didn't get a Kernel Restart dialog box, but this
isn't good. We still haven't read in the data (you can verify that
data
is empty, by simplifying entering data
in a code cell and
executing the cell, this will produce a NameError: name 'data' is not
defined
message).
We can look at the available memory again with free -m
, either in a
code cell or in a Docker container, preferably the one in which our
IPython Notebook server is running (remember you can attach to a running
Docker container by using docker exec it
name/bin/bash
where
name is the name of the running container to which you want to attach.
Look at the available memory by using free -m
, either in this
Notebook, or at the Docker container prompt. Afterwards, manually
restart the kernel (either by clicking on the circular arrow button or
choosing the Restart item from the Kernel menu. Now check the
available memory again and notice the change. The results of these
commands as run in my Docker container are shown in the following image.
First, notice how much free space is recovered by restarting the kernel. Second, notice how much swap memory is recovered by restarting the kernel. In both cases, over 700 Megabytes were recovered on my system.
In [5]:
!free -m
But this lesson is not (solely) about managing Docker containers,
especially in terms of their memory footprint. Lets look again at the
original CSV file, this time to see how the full character count, which
we can get with wc
command with no flags.
In [6]:
!wc /notebooks/i2ds/data/2001.csv
So, as we see from the third numerical column of the wc
output, the
CSV file has over 600 million characters of data. This is a lot of data
to be pulling into one program, so lets take a different approach. We
can first identify what columns we will want to analyze, and only
extract those columns. To do this, the easiest approach is to simply
examine the header row of the CSV file, which we can do by using the
head
command.
In [7]:
!head -5 /notebooks/i2ds/data/2001.csv
It should be interesting to explore the relationship between arrival and departure times, and delays. We also might want to know if this varies with month, airport, or day of the week. so lets grab those columns from data file. We can do this using SQL and Pandas.
import sqlite3 as sl
import pandas as pd
query = """
SELECT dayOfWeek, month, actualDepartureTime, arrivalArrivalTime,
flightNumber, arrivalDelay, departureDelay, distance, originCode,
destinationCode
FROM flights
"""
with sl.connect(database) as con:
data = pd.read_sql(query, con)
If we do this though, we will get again run out of memory (if you don't
believe me, try running this code and while it is running keep entering
free -m
in a Docker container and watch first the free memory and then
the swap memory disappear). This will lead to the following window:
To keep the memory requirements to a minimum, we can grab out two columns and explore what this can teach us about extracting more data. For now, we can extract the arrival and departure delay columns from the database.
In [8]:
# Lets grab data from the sqlite database
import sqlite3 as sl
import pandas as pd
database = '/notebooks/i2ds/database/i2ds'
query = '''SELECT arrivalDelay, departureDelay FROM flights '''
with sl.connect(database) as con:
data = pd.read_sql(query, con)
# Now display the first few entries
print(data.head())
In [9]:
!free -m
Notice how simply extracting these two columns from the database
consumed a significant amount of memory, as seen above, you might have
consumed enough memory to prevent the IPython kernel from creating a new
Unix process to execute the free -m
command (you can still issue this
command at a Unix prompt in a docker container). Given this result, it
is no wonder our previous efforts produced a MemoryError or a Kernel
restart. But the better question is why? We should have two values for
about 6 million rows, which even for 8 bytes per value should be less
than 100 Megabytes.
To understand this issue better, we can examine our pandas DataFrame.
In [11]:
data.describe()
Out[11]:
So the output is informative in that we have the correct number of rows,
however, we do not have the normal numerical output, which lists, among
others, the min
, max
, and mean
values. To continue this line of
research, we can examine the data types used for the columns in our
DataFrame.
In [12]:
data.dtypes
Out[12]:
So our DataFrame is using objects to hold the values in both columns.
This is why we have memory issues, as the object data type consumes
significantly more memory. Note that we could also see this information
by using the info
method, as shown in the next code cell.
In [13]:
data.info(verbose=True)
But this doesn't explain why Pandas chose to use the object data type.
TO see why, lets take a look at the original data file again, in this
case, the last few rows, which we can do by using the tail
command.
But first, we need to perform a manual kernel restart to reclaim enough memory to issue the tail command. The easist method to accmoplish this is to select Restart
from the Kernel
menu.
In [1]:
!tail -5 /notebooks/i2ds/data/2001.csv
So, the columns of interest (among others) contain NA values, which are stored in the database as Text values (recall that sqlite3 allows you to store any value in a column of any data type (presumably by type promotion). Thus, these bad values are causing our problems, which we can rectify in one of two ways:
The best solution, even in the presence of more memory, is to use the appropriate data type, which in this case requires us to remove the NA values. Another issue, however, is that according to the documentation for the most recent version of Pandas, these issues should not be this severe. In Pandas 0.15.2 or higher, for example, we can monitor our memory usage as follows:
data.memory_usage(verbose=true)
This does not work in our Docker container, however, as demonstrated in the next code cell.
In [2]:
import pandas as pd
print(pd.version.version)
Our version of Pandas is approximately a year out of date! This raises
another important point, always know the software versions of all tools
being used. Even though this Docker container is built by using the
latest version of tools and libraries available on Ubuntu, for
example by using sudo apt-get install
, you might be using a version
different than expected. This is obviously very important when checking
the documentation, as you must be sure to use the appropriate version of
the documentation.
We can return to the data extraction, but now we will explicitly remove rows where either column of interest has a bad value. We can do this by including a WHERE clause as shown below.
In [3]:
# Lets grab data from the sqlite database
import sqlite3 as sl
import pandas as pd
database = '/notebooks/i2ds/database/i2ds'
query = '''SELECT arrivalDelay, departureDelay
FROM flights
WHERE arrivalDelay != 'NA' AND
departureDelay != 'NA' ; '''
with sl.connect(database) as con:
data = pd.read_sql(query, con)
# Now display the first few entries
print(data.head())
In [4]:
!free -m
Notice how this new data extraction consumed little memory, we clearly are on the right track!
We can now verify the DataFrame data types with the dtypes
attribute
and the data we obtained by using the describe
method.
In [5]:
data.dtypes
Out[5]:
In [6]:
data.describe()
Out[6]:
The DataFrame is using 8-byte integers to hold the data, which consume less space than the object data type did previously. However, looking at the data description indicates that we really only need two bytes per column, since our numerical range extends (min to max) of less than three thousand. We can convert these columns to a smaller data type, in this case by using NumPy.
In [7]:
import numpy as np
data[['arrivalDelay']] = data[['arrivalDelay']].astype(np.int16)
data[['departureDelay']] = data[['departureDelay']].astype(np.int16)
In [8]:
!free -m
In [9]:
data.dtypes
Out[9]:
First, the conversion was quite fast, which demonstrated the speed of NumPy operations. Second, notice how the DataFrame is now using the correct data types. Finally, as seen in the memory usage, the conversion required less than thirty Megabytes (2 columns x 2 bytes x six million rows = 24 Megabytes).
But, this would have been more efficiently done by simplifying reading the
data straight into the correct data type. While we could do this with
the read_sql
method, lets switch to the read_csv
method. With this
method, we can explicitly list the data types, list values that indicate
Not Available values, and also the columns to read from the file. In
the following code cell, we read directly from the CSV file into a
Pandas DataFrame. Following that, we call the dropna
method to
eliminate rows with bad values, and show the results.
In [10]:
newdata = pd.read_csv('/notebooks/i2ds/data/2001.csv', dtype=np.float, header=0, na_values=['NA'], usecols=(14, 15))
In [11]:
ndata = newdata.dropna()
ndata.describe()
Out[11]:
We now have both columns read into a Pandas DataFrame, the columns are
of float
type, and the process was fast and consumed minimal memory.
At this point, we are ready to explicitly read in the entire set of data
of interest. In the following code cell, we read in the columns of
interest into a new DataFrame, after which we drop rows with bad values.
In [12]:
import pandas as pd
#ucs=(14, 15, 3, 4, 6, 1, 8, 9, 16, 17, 18)
ucs=(1, 3, 4, 14, 15, 16, 17, 18)
cnms = ['month', 'Day', 'dTime', 'aDelay', 'dDelay', 'depart', 'arrive', 'distance']
#cnms = ['aDelay', 'dDelay', 'Day', 'dTime', 'aTime',
# 'month', 'uc', 'fnum', 'depart', 'arrive', 'distance' ]
newdata = pd.read_csv('/notebooks/i2ds/data/2001.csv', #dtype=np.float32,
header=0, na_values=['NA'], usecols=ucs, names = cnms)
In [13]:
newdata = newdata.dropna()
At this point, we can persist our extracted data to simplify later
analysis. For this, we can use the HDF file format, which is easy with
Pandas, as we simply call the to_hdf
method, and pass in the value
table
to indicate the key we will later use to read the data back out
of the HDF file. For completeness, we also can examine the resulting
file size by using the ls
command.
In [14]:
newdata.to_hdf('flights.h5', 'table', append=False)
In [15]:
!ls -la *.h5
This file required nearly 400 Megabytes, which is a considerable size. We can examine the DataFrame to get a better idea why this is the case. First we can look at the selected data types, and second we can look at the range of numerical values in each column.
In [16]:
newdata.dtypes
Out[16]:
In [17]:
newdata.describe()
Out[17]:
In each case, Pandas has selected the biggest data type for each column.
Note how the month
and Day
columns only include integers from 1-12
and 1-7, respectively. We can use one-byte integers for those values.
Likewise the other columns can also be stored more compactly by using
optimal data types. In the following code cell, we repeat the reading
process, but this time we explicitly list the target data types during
the read process. Afterwards, we verify the data types of our new
DataFrame before also displaying the numerical column statistics.
In [18]:
import numpy as np
cnms = ['month', 'Day', 'dTime', 'aDelay', 'dDelay', 'depart', 'arrive', 'distance']
dts = [np.uint8, np.uint8, np.uint16, np.int16, np.int16, object, object, np.uint16]
nd = pd.DataFrame()
for i in range(len(cnms)):
nd[[cnms[i]]] = newdata[[cnms[i]]].astype(dts[i])
In [19]:
nd.dtypes
Out[19]:
In [20]:
nd.describe()
Out[20]:
At this point, we have successfully read the target columns from our data file (we also could have performed these steps by reading directly from our database). We can now save the new DataFrame to an HDF file, and compare the more compact DataFrame representation file size with the original HDF file.
In [21]:
nd.to_hdf('sflights.h5', 'table', append=False)
In [22]:
!ls -la *.h5
While it was a bit of work to get to this point, we now have a compact file that contins our data of interest, all done while using standard Python data analysis tools. In the next lesson, we will use data from this file to understand basic data exploration by using Python.