Interacting with Databases

In many applications data rarely comes from text files, that being a fairly inefficient way to store large amounts of data. SQL-based relational databases (such as SQL Server, PostgreSQL, and MySQL) are in wide use, and many alternative non-SQL (so-called NoSQL) databases have become quite popular. The choice of database is usually de- pendent on the performance, data integrity, and scalability needs of an application.f


In [16]:
from pandas import DataFrame

Loading data from SQL into a DataFrame is fairly straightforward, and pandas has some functions to simplify the process. As an example, I’ll use an in-memory SQLite database using Python’s built-in sqlite3 driver:


In [17]:
import sqlite3

query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
 c REAL,        d INTEGER
);"""

con = sqlite3.connect(':memory:')
con.execute(query)
con.commit()

Then, insert a few rows of data:


In [18]:
data = [('Atlanta', 'Georgia', 1.25, 6),
        ('Tallahassee', 'Florida', 2.6, 3),
        ('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"

con.executemany(stmt, data)
con.commit()

Most Python SQL drivers (PyODBC, psycopg2, MySQLdb, pymssql, etc.) return a list of tuples when selecting data from a table:


In [22]:
cursor = con.execute("select * from test where b LIKE 'Georgia'")
rows = cursor.fetchall()
rows


Out[22]:
[(u'Atlanta', u'Georgia', 1.25, 6)]

You can pass the list of tuples to the DataFrame constructor, but you also need the column names, contained in the cursor’s description attribute:


In [20]:
cursor.description


Out[20]:
(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

In [21]:
df = DataFrame(rows, columns=zip(*cursor.description)[0])
df


Out[21]:
a b c d
0 Atlanta Georgia 1.25 6
1 Tallahassee Florida 2.60 3
2 Sacramento California 1.70 5

This is quite a bit of munging that you’d rather not repeat each time you query the database. pandas has a read_sql function that simplifies the process. Just pass the select statement and the connection object:


In [24]:
import pandas.io.sql as sql
sql.read_sql('select * from test where c>2.0', con)


Out[24]:
a b c d
0 Tallahassee Florida 2.6 3

In [ ]: