Interacting with Databases

In many applications data rarely comes from text files, that being a fairly inefficient way to store large amounts of data. SQL-based relational databases (such as SQL Server, PostgreSQL, and MySQL) are in wide use, and many alternative non-SQL (so-called NoSQL) databases have become quite popular. The choice of database is usually de- pendent on the performance, data integrity, and scalability needs of an application.f



In [16]:

    
from pandas import DataFrame

Loading data from SQL into a DataFrame is fairly straightforward, and pandas has some functions to simplify the process. As an example, I’ll use an in-memory SQLite database using Python’s built-in sqlite3 driver:



In [17]:

    
import sqlite3

query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
 c REAL,        d INTEGER
);"""

con = sqlite3.connect(':memory:')
con.execute(query)
con.commit()

Then, insert a few rows of data:



In [18]:

    
data = [('Atlanta', 'Georgia', 1.25, 6),
        ('Tallahassee', 'Florida', 2.6, 3),
        ('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"

con.executemany(stmt, data)
con.commit()

Most Python SQL drivers (PyODBC, psycopg2, MySQLdb, pymssql, etc.) return a list of tuples when selecting data from a table:



In [22]:

    
cursor = con.execute("select * from test where b LIKE 'Georgia'")
rows = cursor.fetchall()
rows









    Out[22]:





[(u'Atlanta', u'Georgia', 1.25, 6)]

You can pass the list of tuples to the DataFrame constructor, but you also need the column names, contained in the cursor’s description attribute:



In [20]:

    
cursor.description









    Out[20]:





(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))



In [21]:

    
df = DataFrame(rows, columns=zip(*cursor.description)[0])
df









    Out[21]:






  
    
      
      a
      b
      c
      d
    
  
  
    
      0
      Atlanta
      Georgia
      1.25
      6
    
    
      1
      Tallahassee
      Florida
      2.60
      3
    
    
      2
      Sacramento
      California
      1.70
      5

This is quite a bit of munging that you’d rather not repeat each time you query the database. pandas has a read_sql function that simplifies the process. Just pass the select statement and the connection object:



In [24]:

    
import pandas.io.sql as sql
sql.read_sql('select * from test where c>2.0', con)









    Out[24]:






  
    
      
      a
      b
      c
      d
    
  
  
    
      0
      Tallahassee
      Florida
      2.6
      3



In [ ]:

	a	b	c	d
0	Atlanta	Georgia	1.25	6
1	Tallahassee	Florida	2.60	3
2	Sacramento	California	1.70	5