After the previous IPython Notebook exploring how to use standard Python to work with a database, you have probably realized that there is a lot of standard code we must write to simply execute some SQL commands. While there are different Python libraries that exist to simplify these steps, we will focus on using the Pandas library, which is a standard library for doing Data Analysis in Python.
In [1]:
import sqlite3 as sl
import pandas as pd
Pandas provides built-in support for executing a SQL query and
retrieving the result as a DataFrame. This is demonstrated in the next
code cell, where we execute a SQL query on the airlines database. We
select several columns, and for simplicity we restrict our query result
to only ten rows by using the ANSI SQL LIMIT
clause.
The Pandas method to execute a SQL statement is read_sql
, and mimics
in appearance other Panda methods for reading data into a Pandas
DataFrame. In this case, the method takes our SQL statement, database
connection, and an optional parameter, index_col
that we can use to
specify which column in our result should be treated as an index column.
Pandas will supply an auto-incrementing column if no column is explicitly
supplied. To save space in the output display, we specify our own column
in these examples.
In [2]:
query = "SELECT code, airport, city, state, latitude, longitude FROM airports LIMIT 10 ;"
database = '/notebooks/i2ds/database/i2ds'
with sl.connect(database) as con:
data = pd.read_sql(query, con, index_col ='code')
print(data)
In the next code cell, we use the column selection feature with a Pandas
DataFrame to select only those rows that have airports in the state of
Mississippi. We do this by selecting the state
attribute of the
DataFrame, which corresponds to the state column, and applying a
Boolean condition.
In [3]:
query = "SELECT code, airport, city, state, latitude, longitude FROM airports LIMIT 100 ;"
with sl.connect(database) as con:
data = pd.read_sql(query, con, index_col ='code')
print(data[data.state == 'MS'])
Pandas also simplifies the insertion of new data into a SQL database.
For this, we can simply take an existing Pandas DataFrame and call the
to_sql()
method. This method requires two parameters, the name of the
database table, and the database connection. If the table does not
exist, a new table will be created to match the DataFrame, including
appropriate column names and data types.
In the next two code blocks, we first query the airports table, and use
Pandas to extract all airports in Illinois. We next insert this data
back into our database as a new table called ILAirports
. The following
code block queries this new table and display the results for
confirmation.
In [4]:
# Creating table automatically works better if columns are explicitly listed.
query = "SELECT code, airport, city, state, latitude, longitude FROM airports ;"
with sl.connect(database) as con:
data = pd.read_sql(query, con)
data[data.state == 'IL'].to_sql('ILAirports', con)
In [5]:
with sl.connect(database) as con:
data = pd.read_sql('SELECT code, city, airport, latitude, longitude FROM ILAirports',
con, index_col ='code')
print(data[10:20])