Today's lab is a crash-course in relational databases, as well as SQL (Structured Query Language), which is the most popular language for managing relational databases.
There are many database management system ("DBMS") products that support SQL. The one we will consider in this class is the simplest, called sqlite3. It stores the database in a simple file and can be run in a "standalone" mode. However, we will consider invoking it from Python.
With a little luck, you might by the end of this class understand this xkcd comic on SQL injection attacks.
In [ ]:
import sqlite3 as db
# Connect to a database (or create one if it doesn't exist)
conn = db.connect ('example.db')
sqlite
maintains databases as files; in this example, the name of that file is example.db
.
If the named file does not yet exist, connecting to it in this way will create it.
To issue commands to the database, you also need to create a cursor.
In [ ]:
# Create a 'cursor' for executing commands
c = conn.cursor ()
A cursor tracks the current state of the database, and you will mostly be using the cursor to manipulate or query the database.
The main object of a relational database is a table.
Conceptually, your data consists of items and attributes. In a database table, the items are rows and the attributes are columns.
For instance, suppose we wish to maintain a database of Georgia Tech students, whose attributes are their names and GT IDs. You might start by creating a table named Students
to hold this data. You can create the table using the command, create table
.
In [ ]:
c.execute ("create table Students (gtid integer, name text)")
Note: This command will fail if the table already exists. If you are trying to carry out these exercises from scratch, you may need to remove any existing
example.db
first.
To populate the table with items, you can use the command, insert into
.
In [ ]:
c.execute ("insert into Students values (123, 'Vuduc')")
c.execute ("insert into Students values (456, 'Chau')")
c.execute ("insert into Students values (381, 'Bader')")
c.execute ("insert into Students values (991, 'Sokol')")
Given a table, the most common operation is a query. The simplest kind of query is called a select
.
The following example selects all rows (items) from the Students
table.
In [ ]:
c.execute ("select * from Students")
Conceptually, the database is now in a new state in which you can ask for results of the query. One way to do that is to call fetchone()
on the cursor object, which will return a tuple corresponding to a row of the table.
This example calls fetchone()
twice to get the first two query results.
In [ ]:
print (c.fetchone ())
print (c.fetchone ())
An alternative to fetchone()
is fetchall()
, which will return a list of tuples for all rows, starting at the cursor.
Since the preceding code has already fetched the first two results, calling
fetchall()
at this point will return all remaining results.
In [ ]:
print (c.fetchall ())
Question. What will calling fetchone()
at this point return?
In [ ]:
print (c.fetchone ())
Here is an alternative, an arguably more natural, idiom for executing a query and iterating over its results.
In [ ]:
query = 'select * from Students'
for student in c.execute (query):
print (student)
In [ ]:
# An important (and secure!) idiom
more_students = [(723, 'Rozga'),
(882, 'Zha'),
(401, 'Park'),
(377, 'Vetter'),
(904, 'Brown')]
c.executemany ('insert into Students values (?, ?)', more_students)
query = 'select * from Students'
for student in c.execute (query):
print (student)
Exercise. Suppose we wish to maintain a second table, called Takes
, which records classes that students have taken and the grades they earn.
In particular, each row of Takes
stores a student by his/her GT ID, the course he/she took, and the grade he/she earned. More formally, suppose this table is defined as follows:
In [ ]:
c.execute ('create table Takes (gtid integer, course text, grade real)')
Write a command to insert the following records into the Takes
table.
In [ ]:
# Insert your solution here; use the next cell to test the output
taken_spring2015 = [(991, "CSE 6040", 4.0),
(456, "CSE 6040", 4.0),
(123, "CSE 6040", 2.0),
(123, "ISYE 6644", 3.0),
(123, "MGMT 8803", 1.0),
(991, "ISYE 6740", 4.0),
(456, "CSE 6740", 2.0),
(456, "MGMT 8803", 3.0)]
c.executemany ('insert into Takes values (?, ?, ?)', taken_spring2015)
In [ ]:
# Displays the results of your code
for row in c.execute ('select * from Takes'):
print (row)
The "big idea" in a relational database is to build queries that combine information from multiple tables. A join query is one such operation.
There are many types of joins, but the simplest is one in which you use the where
clause of a select
statement to specify how to match rows from the tables being joined.
For example, recall that the Takes
table stores classes taken by each student. However, these classes are recorded by a student's GT ID. Suppose we want a report where we want each student's name rather than his/her ID. We can get the matching name from the Students
table. Here is a query to accomplish this matching:
In [ ]:
# See all (name, course, grade) tuples
query = '''
select Students.name, Takes.course, Takes.grade
from Students, Takes
where Students.gtid=Takes.gtid
'''
for match in c.execute (query):
print (match)
Exercise. Write a query to select only the names and grades of students who took CSE 6040.
In [ ]:
In [ ]:
query = '''
select Students.name, avg (Takes.grade)
from Takes, Students
where Students.gtid=Takes.gtid
group by Takes.gtid
'''
for match in c.execute (query):
print (match)
Exercise. Compute the GPA of every student, but report the name (rather than GT ID) and GPA.
In [ ]:
In [ ]:
c.close()
conn.close()
In [ ]: