In this lesson, we focus on a basic component of relational database management systems, SQL. In this Notebook we will use the SQLite database to build a fictitious database. We will cover SQL data types and how to create SQL schemas before moving on to creating and executing queries and finishing with updating and deleting data.
We can now start using the SQLite database. First we will test SQLite
from within the IPython Notebook, before switching to the command line
to actually create and populate a database. Note that if we run the
sqlite3
command line client from within an IPython Notebook cell, the
process will continue to run in the background since we can not
directly enter commands. Thus, you should either redirect STDIN for
the sqlite3
client to be a file of commands, or work directly with
this tool at the command line.
By default, the sqlite3
command line client will operate in
interactive mode. However, this tool will also read and execute commands
either in from a separate file by redirecting STDIN or by enclosing
the commands in quotes. Since SQLite databases are files, unless
explicitly created from within a program as in memory databases, we pass
the name of the database as a command line argument. Thus, to connect to
the i2ds
database with the sqlite3
command line client in
interactive mode, we simply enter the following at a command prompt in
our Docker container:
/notebooks/i2ds/database# sqlite3 i2ds
SQLite version 3.8.2 2013-12-06 14:53:30
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite>
To exit from the sqlite3
client, simply enter either ctrl-d or use the
.quit
command. The sqlite3
client can either accept SQL commands,
which we will discus in more detail in the next lesson, or the client
can accept dot commands, which are instructions to the SQLite database
engine that begin with a .
character. These commands can be explicitly
listed by entering .help
at the sqlite3
client prompt. We can do
this from within our Notebook by creating and using a file as shown below.
In [ ]:
# Run the SQLite command line client
!sqlite3 -help
In [1]:
%%writefile help.txt
.help
In [1]:
# List 'dot' commands
!sqlite3 < help.txt
The previous code block listed the available dot commands that we can
use within the sqlite3
client. We will use several of these, including
.header
: If this is on
, headers will be displayed. Generally this
is used to see the names of columns in a table.
.separator
: This specifies the separator character for fields, either
in output displays or when importing data. By default, the separator
character is the vertical bar, |
. However, to import data from a
comma-separated-value (CSV) file, you would change the separator to a
comma, .separator ","
.
.import
: This command is used to read data from a file and insert
this data into a specific table in the database.
.schema
: This command will list the schema commands required to
recreate the tables contained in the database.
.stats
: This command lists statistics after each command entered at
the sqlite prompt. This can be useful for profiling SQL commands.
.width
: This command changes the default width for columns, which
can improve the visual formatting of the results from database queries
that are displayed to the screen.
We can easily create and populate a database by using the sqlite3
client. While we could do this at the command line (and advanced users
are encouraged to do so), we can also complete these tasks from within
this IPython Notebook. The steps we must complete include
Create the new database. We do this by simply passing the name of our
new database to the sqlite3
client. If the file does not exist, a new
file will be created. This file will hold the entire contents of the new
database.
Create the schema for our new database. A relational database is
built on a tabular data model. Thus our schema consists of the table
definitions as well as the relationships that might exist between
tables. To accomplish this, we must execute SQL CREATE TABLE
statements. For now, we simply create the schema, the next lesson
explores SQL statements in more detail. A schema file for the airline
data is included in this lesson, under the schema
directory.
Populate the tables with data. For simplicity, we will use the
.import
command within the sqlite3
client to import data from a file
directly into the relevant table in our database.
In [ ]:
# First we make a new directory to hold our database
!mkdir /notebooks/i2ds/database
In [ ]:
!ls -l /notebooks/i2ds/database
In [ ]:
# We could create a schema SQL file, then use Bulk imort to load the databases.
Before you can begin to develop database applications, you need to understand the basic concepts. Relational databases hold data. This data can be of different types, such as numbers, characters, or dates. Within the database, the data are organized into logical units called tables. A table is like a spreadsheet, because it contains rows of data. Each row is made up of a number of columns. The columns hold data of a specific data type, like integer values or strings of characters. In most cases, a database has more than one table. To relate the tables together, a database designer takes advantage of natural (or artificial) links between the tables. In a spreadsheet, you can link rows in different sheets by cell values. The same idea holds in a relational database, and the column used to make the link is referred to as a key column.
To make it easier to understand the purpose of a table or a particular column, you should select appropriate names. The naming conventions can vary between databases. For a SQLite database, individual names:
by default, case insensitive (although this can be changed),
are unlimited in length, but should be kept to a reasonable length (given readability constraints),
must begin with a letter or an underscore character, and
must only code alphanumeric characters or underscores.
You can escape these rules by placing the name in double quotation marks, which allows names to be case sensitive and to include additional characters (including spaces). Doing this, however, is generally a bad practice: It requires the name to always be enclosed in double quotation marks and can easily confuse someone else who may be maintaining your code. Finally, a name can not be one of the reserved keywords.
For this class, we will follow a specific style: All SQL commands are presented entirely in uppercase, and item names use camelCase. In camelCase style, words are joined together, and the first letter of each word-following the first one-is capitalized, such as aLongIdentifier. Combining these two styles together, these articles write SQL commands using the following style: SELECT aLongIdentifier FROM dataTable ;.
Related tables are often grouped together into a schema. You can think of a schema as a container for all the related structure definitions within a particular database. A table name must be unique within a given schema. Thus, by using schemas, you can have identically named objects (such as tables) enclosed within different schemas. When you're using the Apache Derby database, a table is always in a schema. If you don't specify a schema explicitly, Derby implicitly uses the built-in apps schema. A second built-in schema called sys is used to isolate system tables.
You can use the schema name to qualify a name. By default, SQLite uses the database name as the schema, and you do not need to prefix names to indicate the correct schema. For other databases, however, this is not the case. For these databases, you specify the schema name followed by a period and then the table name. For example, bigdog.products references the products table in the bigdog schema. Without the relevant schema name, a table name is said to be unqualified, as in products. When the schema name and the table name are completely specified, as in bigdog.products, the name is said to be fully qualified.
In an abstract sense, these database concepts may seem confusing, but in practice they're fairly straightforward. For example, imagine you own a store called Bigdog's Surf Shop that sells a variety of items like sunglasses, shirts, and so on. If you want to be profitable, you must keep a close eye on your inventory so you can easily order additional inventory or change vendors to keep your overhead to a minimum. One simple method for tracking this information is to write entries in a table-like format:
Item# | Price | Stock Date | Description |
---|---|---|---|
1 | 29.95 | 1/15/15 | Basic Sunglasses |
2 | 9.95 | 12/14/14 | Generic Shirt |
3 | 99.95 | 8/04/14 | Boogie Board |
Item# | Vendor# | Vendor Name |
---|---|---|
1 | 101 | Mikal Arroyo |
2 | 102 | Quiet Beach Industries |
3 | 103 | Vista Luna |
From this simple visual design you can easily map the business logic straight into database tables. You have two database tables, Products and Vendors, which are naturally linked by the item number. The data types for the columns in each table are easy to determine. Later in this lesson we will actually create this sample schema for Bigdog's Surf Shop, which consists of these two tables, in a SQLite database. But first, we need to address how data is stored in a relational database table.
Database systems can be complex pieces of software, especially when they scale to support enterprise-level applications. As a result, you may expect that every database has its own application programming interface (API) and that these APIs may be different from one system to the next. When relational databases were first developed, this was the case; but, fortunately, a number of vendors agreed to develop a standard language for accessing and manipulating relational databases. This language is officially called Structured Query Language (or SQL, pronounced sea-quill). Several official standard versions have been produced, including one in 1992 that is referred to as SQL-92, and one in 1999 that is referred to as SQL-99. The Apache Derby database provides a nearly complete implementation of the SQL-92 standard, so applications developed with Derby can be easily transported to other database systems.
SQL has two main components: a Data Definition Language (DDL) and a Data Manipulation Language (DML). DDL commands are used to create, modify, or delete items (such as tables) in a database. DML commands are used to add, modify, delete, or select data from a table in the database. The rest of this article provides a basic introduction to the DDL components of SQL. Future articles will focus on the DML commands and more advanced DDL commands.
SQL, being a programming language in its own right, defines a rich data-type hierarchy. Persisting these data types is one of the most important responsibilities of the database. As databases have become more powerful, this type hierarchy has grown more complex. But most simple databases don't require the full range of allowed types, and often they need to store only numerical, character, and date or time data.
While the SQL standard defines basic data types, different database systems can support the standard to varying degrees. While this might seem odd, doing so provides more flexibility in allowing a particular implementation to achieve a market niche. In the case of SQLite, the design decisions support a compact, zero-configuration database file that is platform-independent. As a result, SQLite does not support a rich data type hierarchy, and instead focuses on ease-of-use.
SQLite supports five storage classes:
NULL: A null value.
INTEGER: A signed integer, the number of bytes (1, 2, 3, 4, 6, or 8) used depends on the magnitude of the value.
REAL: A floating-point value stored as an 8 byte IEEE floating-point value.
TEXT: A string of character values stored in the default database encoding (e.g., UTF-8).
BLOB: A blob of data stored exactly as is in the database.
Note that SQLite does not support Boolean or Date/Time valus directly. Instead, Boolean values are encoded as INTEGERs (0 = False, 1 = True). Likewise Date/Time values can be encoded either as TEXT, REAL, or INTEGER values. For full details, see the SQLite documentation. In addition, SQLite supports the concept of Type Affinity, whereby different data types can be easily mapped into each other. This simplifies moving schemas from other database systems to SQLite.
Before you begin creating database tables, you must know what to do when no value is specified for a column. To illustrate this point, imagine that you've been asked to fill out a Web form. If you leave a particular column blank, what is inserted into the database? As you can imagine, this problem could be cumbersome if you had to track no value markers. Fortunately, SQL defines a special value, NULL, to indicate that a column has no value.
So far, you've learned how to design a table, including mapping out the table columns and defining the data type for each column. After you've properly designed a table, the method for creating a table in SQL is straightforward. Listing 1 shows the formal syntax for creating a table in Derby.
-- Comment describing the purpose and layout of the table
CREATE TABLE tableName ( { <columnDefinition> | <tableLevelConstraint> }
[, { <columnDefinition> | <tableLevelConstraint> } ]* ) ;
You may feel bewildered after looking at this syntax for the first time. But it's easy to follow once you have the basics down. The square brackets ([ and ]) enclose optional parameters. As you can see from the formal syntax, any column definitions or table-level constraints after the required initial one (it wouldn't make sense to create a table with no columns!) are optional.
You probably understand what is meant by a column definition, but you might not understand the idea of a constraint. Constraints come in two types: table-level constraints and column constraints. A constraint limits either a column or a table in some manner. For example, you can use a constraint to require that a column always be assigned an actual value (no NULL values), or that every entry in a column must be unique, or that a column is automatically assigned a default value.
The asterisk (*) after the last closing square bracket indicates that more than one of the enclosing items can be included. This implies that the table must have one or more columns or table-level constraints. The vertical line (|) indicates an either/or condition. In this syntax example, you must either define a new column or define a new table-level constraint. The curly brackets ({ and }) group related items together, and the parentheses (( and )) are required elements. Finally, the semicolon (;) indicates the end of a SQL statement.
In the following code block, we create our schema for Bigdog's Surf
Shop, which includes two new tables: myProducts
and myVendors
. The
myProducts table has four columns: itemNumber, price, stockDate, and
description. The itemNumber column provides a unique identity for each
item (or row) and has an attached column-level constraint that enforces
a valid value to always be supplied (NOT NULL). Without this
requirement, the itemNumber column isn't guaranteed to be unique because
multiple columns could be assigned a NULL value. The price column is
created as a REAL data type. The last two columns are simple: The
stockDate column is stored as a TEXT (we can use application logic to
transform the date/time information into and out of the appropriate
YYYY-MM-DD format), and description is also stored in a TEXT field.
The myVendors table has three columns: itemNumber, vendorNumber, and vendorName. In this case, both the itemNumber and vendorNumber columns have attached column-level constraints (NOT NULL). In addition, the vendorName column is stored as a TEXT field.
No one is perfect. What do you do when you incorrectly create a table or a table is no longer needed? The simple answer is to delete the table from the database and, if necessary, create a replacement table. Deleting a table is easy, which means, of course, that you should exercise great care when doing so -- no dialog box pops up and asks if you're sure you want to proceed!
The full syntax for deleting-or, more formally, dropping-a table from a database in SQLite is
DROP TABLE tableName ;
The syntax is simple: You append the fully qualified name and a semicolon to the DROP TABLE SQL command, and you're finished.
In the next few code cells, we first create our schema file, before
executing the SQL commands on our new database by using the sqlite3
client tool.
In [1]:
%%writefile create.sql
-- First we drop any tables if they exist
-- Ignore the no such Table error if present
DROP TABLE myVendors ;
DROP TABLE myProducts ;
-- Vendor Table: Could contain full vendor contact information.
CREATE TABLE myVendors (
itemNumber INT NOT NULL,
vendornumber INT NOT NULL,
vendorName TEXT
) ;
-- Product Table: Could include additional data like quantity
CREATE TABLE myProducts (
itemNumber INT NOT NULL,
price REAL,
stockDate TEXT,
description TEXT
) ;
In [3]:
# Now create the schema in a new test database
!sqlite3 test < create.sql
In [4]:
# We can test the results
!sqlite3 test ".schema"
As the previous code blocks demonstrated, we can write SQL commands into
a script file that can be easily executed by the sqlite3
client. While
this might seem like overkill given the simplicity of our current
schema, it is actually a useful technique. Often, you'll need to execute
multiple, complex commands. To simplify debugging a set of complex SQL
commands, it's generally easier to write them in a text file and then
execute the commands in the text file all at once. By placing SQL
commands in a script file, you gain the additional benefit of being able
to execute the commands as many times as necessary.
A script file is just a plain text file that contains a combination of
SQL commands and SQLite commands that can be run directly from the
sqlite3
tool. A script file simplifies the development and maintenance
of relational databases and provides a self-documenting technique for
building databases. You should store these files as ASCII text files,
not as RTF files (or any other format), to prevent text-encoding errors.
Some text applications may try to automatically save your file as a rich
text file. Be careful to avoid doing so, or you may have problems when
you try to execute your script file. For this reason, you probably will
want to always create and edit your SQL scripts at the Unix command
prompt.
Our previous script file, create.sql
, includes several lines that
start with two dashes (--). These lines are SQL comments; you should use
them to provide a basic description of the purpose of each major
component within the script file. The first actual commands ion the
script file are SQL DROP statements that delete the myProducts and
myVendors tables from the database. If the tables don't exist (which is
the case if the database was just created), an error message is
displayed; but as the preceding SQL comments indicate, you can safely
ignore those messages.
You first drop the tables, if they exist, so that you can cleanly create
new tables with the exact column definitions you need. The next two SQL
statements do just that, creating the myProducts and myVendors tables.
To run this script file, we can either have the sqlite3
client tool
read the file form STDIN, as shown above, or we can start sqlite3
and use the .read
command to read and execute SQL commands from our
create.sql
file.
Sometimes, no matter how hard you try, things don't work out quite right. If you can't safely execute the create.sql script, there are a number of possibilities to check:
Be sure the SQLite client tool starts up properly. You can do this at the Unix command prompt.
Be sure you have free disk space in which to create a new database.
Be sure you have proper permissions (to read the script file and to create the new database) in the directory where you try to execute the script file.
Be sure your script file is a simple ASCII text file (and not an RTF file).
If the output of the !sqlite3 test ".schema"
matches the data in the
create.sql
file, congratulations are in order. You now have a new test
database with two new tables ready to hold data.
Now we'll focus on the process of inserting data into a table by using SQLite. To follow along, you will need a SQLite database with the myProducts table available. If you haven't already done so, you should execute the create.sql script file.
Several sites exist that allow you to try out SQL commands online.