Introduction to Pandas & Databases

Professor Robert J. Brunner

</DIV>


Pandas and SQL

After the previous IPython Notebook exploring how to use standard Python to work with a database, you have probably realized that there is a lot of standard code we must write to simply execute some SQL commands. While there are different Python libraries that exist to simplify these steps, we will focus on using the Pandas library, which is a standard library for doing Data Analysis in Python.



In [1]:
import sqlite3 as sl
import pandas as pd

Pandas provides built-in support for executing a SQL query and retrieving the result as a DataFrame. This is demonstrated in the next code cell, where we execute a SQL query on the airlines database. We select several columns, and for simplicity we restrict our query result to only ten rows by using the ANSI SQL LIMIT clause.

The Pandas method to execute a SQL statement is read_sql, and mimics in appearance other Panda methods for reading data into a Pandas DataFrame. In this case, the method takes our SQL statement, database connection, and an optional parameter, index_col that we can use to specify which column in our result should be treated as an index column. Pandas will supply an auto-incrementing column if no column is explicitly supplied. To save space in the output display, we specify our own column in these examples.



In [2]:
query = "SELECT code, airport, city, state, latitude, longitude FROM airports LIMIT 10 ;"

database = '/notebooks/i2ds/database/i2ds'

with sl.connect(database) as con:
    data = pd.read_sql(query, con, index_col ='code')
    
    print(data)


                   airport              city state   latitude   longitude
code                                                                     
00M               Thigpen        Bay Springs    MS  31.953765  -89.234505
00R   Livingston Municipal        Livingston    TX  30.685861  -95.017928
00V            Meadow Lake  Colorado Springs    CO  38.945749 -104.569893
01G           Perry-Warsaw             Perry    NY  42.741347  -78.052081
01J       Hilliard Airpark          Hilliard    FL  30.688012  -81.905944
01M      Tishomingo County           Belmont    MS  34.491667  -88.201111
02A            Gragg-Wade            Clanton    AL  32.850487  -86.611453
02C                Capitol        Brookfield    WI  43.087510  -88.177869
02G      Columbiana County    East Liverpool    OH  40.673313  -80.641406
03D       Memphis Memorial           Memphis    MO  40.447259  -92.226961

[10 rows x 5 columns]

In the next code cell, we use the column selection feature with a Pandas DataFrame to select only those rows that have airports in the state of Mississippi. We do this by selecting the state attribute of the DataFrame, which corresponds to the state column, and applying a Boolean condition.



In [3]:
query = "SELECT code, airport, city, state, latitude, longitude FROM airports LIMIT 100 ;"

with sl.connect(database) as con:
    data = pd.read_sql(query, con, index_col ='code')
    
    print(data[data.state == 'MS'])


                     airport         city state   latitude  longitude
code                                                                 
00M                 Thigpen   Bay Springs    MS  31.953765 -89.234505
01M        Tishomingo County      Belmont    MS  34.491667 -88.201111
04M           Calhoun County    Pittsboro    MS  33.930112 -89.342852
06M         Eupora Municipal       Eupora    MS  33.534566 -89.312569
08M    Carthage-Leake County     Carthage    MS  32.761246 -89.530071
09M     Charleston Municipal   Charleston    MS  33.991502 -90.078145
0M6            Panola County   Batesville    MS  34.366774 -89.900089
0R0   Columbia-Marion County     Columbia    MS  31.297008 -89.812829

[8 rows x 5 columns]

Pandas also simplifies the insertion of new data into a SQL database. For this, we can simply take an existing Pandas DataFrame and call the to_sql() method. This method requires two parameters, the name of the database table, and the database connection. If the table does not exist, a new table will be created to match the DataFrame, including appropriate column names and data types.

In the next two code blocks, we first query the airports table, and use Pandas to extract all airports in Illinois. We next insert this data back into our database as a new table called ILAirports. The following code block queries this new table and display the results for confirmation.



In [4]:
# Creating table automatically works better if columns are explicitly listed.

query = "SELECT code, airport, city, state, latitude, longitude FROM airports ;"
with sl.connect(database) as con:
    data = pd.read_sql(query, con)

    data[data.state == 'IL'].to_sql('ILAirports', con)

In [5]:
with sl.connect(database) as con:
    data = pd.read_sql('SELECT code, city, airport, latitude, longitude FROM ILAirports', 
                       con, index_col ='code')
    
    print(data[10:20])


                       city                airport   latitude  longitude
code                                                                    
3CK       Lake In The Hills      Lake In The Hills  42.206803 -88.323040
3K6   Troy/Marine/St. Louis    St Louis-Metro East  38.732909 -89.806567
3LC                 Lincoln           Logan County  40.158472 -89.334972
3LF              Litchfield   Litchfield Municipal  39.166353 -89.674897
3MY                  Peoria   Mt. Hawley Auxiliary  40.795259 -89.613403
9I0                  Havana        Havana Regional  40.221155 -90.022894
AJG               Mt Carmel    Mt Carmel Municipal  38.606547 -87.726694
ALN         Alton/St. Louis     St. Louis Regional  38.890291 -90.046043
ARR          Chicago/Aurora       Aurora Municipal  41.771929 -88.475659
BLV    Belleville/St. Louis  Scott AFB/MidAmerica   38.545179 -89.835184

[10 rows x 4 columns]

Additional References

  1. Pandas Documentation
  2. A slightly dated Pandas tutorial