In many applications data rarely comes from text files, that being a fairly inefficient way to store large amounts of data. SQL-based relational databases (such as SQL Server, PostgreSQL, and MySQL) are in wide use, and many alternative non-SQL (so-called NoSQL) databases have become quite popular. The choice of database is usually de- pendent on the performance, data integrity, and scalability needs of an application.f
In [16]:
from pandas import DataFrame
Loading data from SQL into a DataFrame is fairly straightforward, and pandas has some functions to simplify the process. As an example, I’ll use an in-memory SQLite database using Python’s built-in sqlite3 driver:
In [17]:
import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
);"""
con = sqlite3.connect(':memory:')
con.execute(query)
con.commit()
Then, insert a few rows of data:
In [18]:
data = [('Atlanta', 'Georgia', 1.25, 6),
('Tallahassee', 'Florida', 2.6, 3),
('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
con.commit()
Most Python SQL drivers (PyODBC, psycopg2, MySQLdb, pymssql, etc.) return a list of tuples when selecting data from a table:
In [22]:
cursor = con.execute("select * from test where b LIKE 'Georgia'")
rows = cursor.fetchall()
rows
Out[22]:
You can pass the list of tuples to the DataFrame constructor, but you also need the column names, contained in the cursor’s description attribute:
In [20]:
cursor.description
Out[20]:
In [21]:
df = DataFrame(rows, columns=zip(*cursor.description)[0])
df
Out[21]:
This is quite a bit of munging that you’d rather not repeat each time you query the database. pandas has a read_sql function that simplifies the process. Just pass the select statement and the connection object:
In [24]:
import pandas.io.sql as sql
sql.read_sql('select * from test where c>2.0', con)
Out[24]:
In [ ]: