Database Interaction with Python

PEP 249 - The Python Database API Specification gives very specific instructions for how Python developers should interact with databases. Although there are some notable differences in the database backends, what that means is that from the Python perspective, the use of different databases like SQLite, PostgreSQL, MySQL, etc. should all be very similar. In this tutorial, we will go over the use of Python with SQLite - a lightweight and simple database that will immediately make the data management in your apps more effective.


In [ ]:
## Imports

import sqlite3

The first thing to keep in mind is that we have to import driver code - that is the API for the specific database that we want to use. The most common are:

Though a host of other databases for vendors like IBM, Microsoft, and Oracle can be found at Python Database Interfaces.

In this tutorial we will be using sqlite3 because it ships with Python (the other drivers are third party) and because it is so simple to use. SQLite databases are the embedded backbone of many applications, though they should be kept small.

Connecting to a Database

The first thing you have to do is make a connection to a database. Often times this means you'll need the following information to connect to a database server:

  • hostname
  • port
  • username
  • password
  • database name

SQLite is an embedded database, however - which means it is stored in a file on disk, and operated on soley by a single program (not multiple programs at once). Therefore in order to create a connection to a SQLite database, we simply need to point it to a file on disk.


In [ ]:
DBPATH = 'people.db'
conn = sqlite3.connect(DBPATH)

At this point, you should notice that a file called people.db has been created in your current working directory!

The connect method returned a connection object that we've called conn. With conn you can manipulate your connection to the database including the following methods:

  • conn.commit() - commit any changes back to the database
  • conn.close() - close our connection to the database and tidy up

However, to execute SQL against the database to INSERT or SELECT rows, we'll need to create a cursor:


In [ ]:
cursor = conn.cursor()

A cursor is essentially a pointer into the database. Think of it like a mouse cursor that keeps track of where you are on in the database table or tables. Cursors have the following methods:

  • cursor.execute() - executes a SQL string against the database
  • cursor.fetchone() - fetch a single row back from the executed query
  • cursor.fetchall() - fetch all results back from the executed query.

Together, connections and cursors are the basic way to interact with a SQL database.

Describing the Database

The first thing we have to do is describe the type of data that we'll be putting in the database by creating a schema. For this workshop, we'll be creating a very simple contacts application, our schema is as follows:

Here we have two tables, contacts which keeps track of people, their email, and who they are affiliated with, and companies which keeps tracks of organizations. To create the companies table we would execute SQL as follows:


In [ ]:
sql = (
    "CREATE TABLE IF NOT EXISTS companies ("
    "    id INTEGER PRIMARY KEY AUTOINCREMENT,"
    "    name TEXT NOT NULL"
    ")"
)

cursor.execute(sql)

A note on the syntax above - since I like to write clean, well-indented SQL; I used a string concatentation method in Python, by opening up a parentheses and adding strings without commas on new lines between them. If you print sql you'll see it's just one long string with spaces inside of it. You could also use docstrings with the three quotes """ to write a multiline string, or even read in the SQL from a file.

Now write and execute the contacts table create statement.


In [ ]:
# Create the contacts table

Inserting Records

The next thing we'll want to do is insert some records into the database; let's add Georgetown University to the companies table.


In [ ]:
sql = "INSERT INTO companies (name) VALUES (?)"
cursor.execute(sql, ("Georgetown University",))
conn.commit()

Here we've created essentially a SQL template for inserting the names of companies into the table. Note that we don't have to assign an id, since it will be automatically assigned using the AUTOINCREMENT property of that field.

The ? is a parameter to the query, and can be used as a placeholder for any user input. Values for the parameters are then passed to the second argument of the execute method as a tuple. You should not use string formatting methods like:

sql = "INSERT INTO companies (name) VALUES ({})".format("Georgetown University")

This is potentially unsafe behavior, and the ? parameters do a lot of work on your behalf to make sure things work correctly and securely.

Let's go ahead and insert another record using the same sql statement.


In [ ]:
cursor.execute(sql, ("US Department of Commerce",))
conn.commit()

The last thing we should mention is the commit call. Nothing will be written to the database until commit is called. This gives us an interesting ability to do transactions - a series of SQL queries that when completed together succesfully, we commit them. However if something goes wrong during execution, we don't commit and therefore "rollback".

Selecting Records

Before we go on to insert contact information, we need to know the id of the company of the contact we're inserting. However, because we inserted the data using the auto increment feature, we don't know what the company's ids are. To read them, we'll have to fetch them as follows:


In [ ]:
cursor.execute("SELECT id FROM companies WHERE name=?", ("Georgetown University",))
print cursor.fetchone()

Using the same parameter statement in our predicate clause and passing in the tuple containing "Georgetown University" as an argument to execute, we can select the id that we need, which is returned as a Row. Rows present themselves as tuples, and since we only fetched the ID, it is the first element in the record. Note that we can use SELECT * to select all fields in a record if so desired.

The fetchone statement goes and gets the first record it finds. Note that the name of companies are not constrained uniqueness, therefore there could be multiple "Georgetown University" records fetched from this query. If you wanted all of them, you would use fetchall.

Now to insert a person who works for Georgetown University you would write a statement similar to:

sql = "INSERT INTO contacts (name, email, company_id) VALUES (?,?,?)" 
cursor.execute(sql, ("Benjamin Bengfort", "bb830@georgetown.edu", 1))
conn.commit()

And at this point you should start inserting some contacts and companies.


In [ ]:
# Insert some contacts and companies using the methods described above.

Consider the following questions while you're working with the database:

  • What errors have you been receiving? What would you do to prevent them (e.g. create guarentees)?
  • What happens if you pass in a name that doesn't exist in the database to our company select?
  • What happens if you run the insert statements twice in a row with the same values?
  • How would you ensure that company name, and email are unique? What other fields should be unique?

Workshop

At this point, you should be able to create a small Python application that does the following:

  • Inserts contacts into the database by gathering their name, email, and company name
  • Prints out a list of companies and the number of contacts associated with each

Using the following incomplete code as a template to help you design your project:


In [ ]:
import os
import sqlite3

def create_tables(conn):
    """
    Write your CREATE TABLE statements in this function and execute
    them with the passed in connection. 
    """
    # TODO: fill in. 
    pass


def connect(path="people.db", syncdb=False):
    """
    Connects to the database and ensures there are tables.
    """
    
    # Check if the SQLite file exists, if not create it.
    if not os.path.exists(path):
        syncdb=True

    # Connect to the sqlite database
    conn = sqlite3.connect(path)
    if syncdb:
        create_tables(conn)
    
    return conn


def insert(name, email, company, conn=None):
    if not conn: conn = connect()

    # Attempt to select company by name first. 
    
    # If not exists, insert and select new id.
    
    # Insert contact

    
if __name__ == "__main__":
    name    = raw_input("Enter name: ")
    email   = raw_input("Enter email: ")
    company = raw_input("Enter company: ")
    
    conn = connect()
    insert(name, email, company, conn)

    # Change below to count contacts per company! 
    contacts = conn.execute("SELECT count(id) FROM contacts").fetchone()
    print "There are now {} contacts".format(*contacts)

    conn.close()

Questions to Think about for Next Time

  1. Why use a database? Why not just use Pandas dataframes?
  2. Why is it easier to work with a database then with raw JSON?
  3. How is using PostgreSQL different from using SQLite?
  4. How is MongoDB different than a relational database?
  5. What would make you choose a relational database over a NoSQL database (and vice versa)?