Introduction to Data Persistence

Professor Robert J. Brunner

</DIV>


Introduction

Persisting data is an important task, and not just for data science applications. Programs may need to persist data to ensure state, to share information, and to improve performance. As a result, many different approaches exist for saving data, spanning everything from basic file input/output techniques to enterprise level database management software. In this lesson, we explore some of these different techniques with the goal of leveraging them to facilitate data science investigations.


Course Information

Before jumping into a discussion of data persistence technologies, and database systems in particular, I want to take some time to discuss how I recommend you work with these IPython Notebookes.

  1. Create a Docker container from our course Docker image by executing docker run. This command should be run with the -it flag for interactive use and the -v flag to share local folders with the running Docker container. On my Mac OSX laptop, I run the docker container with this command:

docker run -d -p 8888:8888 -e "PASSWORD=temp123" -v /Users/rb/i2ds:/notebooks/i2ds lcdm/info490

You should change the local directory (for me this is /Users/rb/i2ds) to your local equivalent. On a Mac OSX system this means changing rb to your username. On a Windows system, the change will be similar to /c/Users/rb/i2ds, assuming your username was rb. Note that you should understand every flag in this invocation.

  1. Open a terminal window in the running Docker container. This requires knowing the name of the running container, which you get from docker ps, and then issuing a docker exec command to open a terminal in the running container. So assuming docker ps tells me my IPython server is running in a container named silly_lovelace, I would open a terminal window into this container by issuing the following command:

docker exec -it silly_lovelace /bin/bash

  1. Once I have a terminal open into this running Docker container (which I can tell because the prompt changes to #, I cd into the /notebooks/i2ds directory, and obtain the latest course git repository. For example, I can issue the following git clone the first time I do this, and afterwards I can change into the repository and issue a git pull.

git clone https://github.com/INFO490/spring2015

At this point, we have a running IPython server that holds the full course repository. You can browse the local IPython notebooks by opening a web browser to your container's IP address (usually http://192.168.59.103:8888, entering the password, and then browsing the directory structure until you find the relevant notebook.

Data

You can copy the stat-computing data either at the Unix command line, by using wget, or in an IPython notebook cell by using !wget. I strongly encourage you to put the data in a directory called /notebooks/i2ds/data. This will match what I have in the notebooks for this week. At this point, you should be ready to start this week's lessons.


Data Persistence Techniques

We have already discussed the simplest persistence technique, basic file input/output in previous lessons. By using the Python programming language, you can open a file for reading and writing and even use binary mode to save storage space (or even directly use a compression technique by using the appropriate Python library like bzip2). To recall, the following code segment demonstrates how to write data to a file called test.dat.



In [1]:
data = """Data to write to the file, which can easily include any Python datatype 
by using string formatting techniques."""
    
with open('test.dat', 'w') as fout:
    fout.write(data)

and we can easily read data back into a Python program (and in this example, simply echo the text to STDOUT) in a similar manner:



In [2]:
with open('test.dat', 'r') as fin:
    for line in fin:
        print(line)


Data to write to the file, which can easily include any Python datatype 

by using string formatting techniques.

While this works, it is not optimal for several reasons:

  1. All data is written and read as Python strings. Complex arrangements of heterogenous data thus require potentially complex (and costly in execution time) transformations.

  2. All concurrency is provides by the file system, thus we are not guaranteed consistent results if multiple writers work at the same time.

  3. Without extra effort, for example, to write to a binary file or to employ compression, this approach is costly in terms of storage space.

  4. We rely completely on the underlying file system for consistency and durability. Thus, persisted application state may have unintentional dependencies on the underlying file system.

An alternative approach is available for more advanced data structures, like the NumPy array.

data = np.genfromtxt('i2ds/data/airports.csv', names=True, 
    delimiter=',', dtype=None, invalid_raise=False)

And if we develop our own data types, we can create our own custom input/output routines to read/write any new objects we have created. But this can be a lot of extra work, especially to verify the routines work accurately as a program is continually developed or maintained. Furthermore, this doesn't solve all of the problems as we still rely entirely on the file system to maintain concurrency, consistency, and durability.


Pickling

Fortunately, Python provides a simple technique, called pickling, that we can use to easily save data to a file and to later reconstitute the data into a Python program. Pickling writes the class information for any data being written to the file along with the data. When you unpickle data, this class information is used to properly reconstitute the data in the pickled file. Pickling is easy to use and can often suffice for simple data persistence tasks. To pickle data to a file, you must import the pickle module and open a file in binary writing mode. After this, simply call the pickle.dump() method with the data to write and the file stream.



In [3]:
import numpy as np
import pickle

data = np.random.rand(100)

with open('test.p', 'wb') as fout:
    pickle.dump(data, fout)

Unpickling data is also easy, simply open the appropriate file in binary read mode and call the pickle.load() method to retrieve the data from the file and assign to a variable.



In [4]:
with open('test.p', 'rb') as fin:
    newData = pickle.load(fin)

print(newData[0:20:4])


[ 0.09178283  0.50327161  0.63610371  0.42434451  0.66831167]

In [5]:
!ls -l test*


-rw-r--r-- 1 1000 staff 111 Feb 25 00:38 test.dat
-rw-r--r-- 1 1000 staff 958 Feb 25 00:40 test.p

While easier than custom read/write routines, pickling still requires the file system to provide support for concurrency, consistency, and durability. To go any further with data persistence, we need to move beyond Python language constructs and employ additional software tools.


Database Systems

Whether you realize it or not, as you surf the Internet you're interacting with a variety of database-backed Web applications. This nomenclature may be unfamiliar, but it simply means that a website you visit is dynamically created using data saved in a database. To demonstrate, consider the following types of Web sites that you may visit:

  • An information portal, like the Yahoo

  • A newspaper Web site to catch up on the local news or sports

  • A financial Web site, like that of a bank or investment institution, to monitor your financial portfolio

  • A map website to find driving directions

  • A search engine where you can identify interesting Web sites for more detailed information on a subject

Each of these examples use databases to store, locate, and retrieve information dynamically. In each of these applications, the website collects necessary information from the user (such as a street address), queries the application database, and collects the data that has been requested into a suitable visual result.

Many of these database systems are large and complex-imagine holding all the map information needed to provide accurate driving directions with pictures! Clearly, storing data and making it available to applications is a big task, one that has been addressed by a number of commercial vendors, that provide different solutions that are optimized for different tasks. Many of these open-source or commercial database systems provide full, enterprise-class capabilities. As a result, they can hold enormous quantities of data, concurrently interact with a large number of users, and scale across large computational systems.

We can broadly classify these systems into two categories:

  1. Relational Database Management Systems like the open-source MySQL and PostgreSQL, and commercial systems like IBM DB2, Microsoft SQL Server, or Oracle Database that rely on a tabular data model.

  2. NoSQL (or Not only SQL) systems that abondon the tabular data model to achieve a simpler design, better scaling or higher availability than is traditionally possible with relational databases. NoSQL databases can be classifid based on their data model, and include key-store databases like Amazon's Dynamo, Object Databases like ZopeDB, Document Store databases like MongoDB, and Column Databases like Cassandra or HBase, which is an open source implementation of Google's BigTable model.

While the NoSQL databases are extremely interesting, many of them have been developed to meet the big data challenges faced by companies like Google, Facebook, or Amazon, for the rest of this week's lessons, we will focus on relational database systems.


Database Roles

As you might expect, working with these systems isn't trivial, and they can be expensive to operate. Historically, the tasks involved in working with these databases have been divided into three categories. Although the roles sometimes overlap, their individual responsibilities are easy to comprehend:

Database administrator (DBA): Responsible for the overall operation of the database system, which includes the selection and layout of the underlying hardware, the installation and optimization of the database server (especially given the hardware being used), and the day-to-day operations of the database server, such as data backup and recovery.

Database developer: Responsible for the actual databases in operation, including designing databases, schemas, tables, table relationships, and indexes as well as optimizing queries.

Database application developer: Responsible for integrating application code with the underlying database by using database application programming interfaces (APIs) to store and retrieve data as necessary.

If the previous discussion leaves you feeling intimidated, that's OK, working with databases has historically been difficult. To understand why, let's examine a specific example in more detail: online banking. When you connect to your bank's Web site, you provide your credentials (most likely a username and password) and thereby gain access to your financial accounts. You can view your data, pay bills, and transfer funds. The database your bank uses must quickly locate the relevant information, safely manage the transactions, securely interact with users, and most important not lose any data! And the bank must do this for a large number of users concurrently. To ensure these tasks are performed correctly, relational database systems are given a special test, known as the ACID Test.


The ACID Test

Diamonds are obviously a valuable commodity,so valuable that counterfeits are a serious concern. One simple and (at least, in the movies) popular test to determine whether a diamond is real is to run it across a piece of glass. Because diamonds are one of the hardest materials known, a real diamond easily cuts the glass surface; a fake, especially if it's made of glass itself, won't.

To a software developer, databases are equally valuable. If you use a database, you want to be sure it will safely store your data and let you easily retrieve the data later. You also want your database to allow multiple programs (or people) to work with the database without interfering with each other. To demonstrate, imagine you own a bank. The database for your bank must do the following, among other things:

  • Safely store the appropriate data
  • Quickly retrieve the appropriate data
  • Support multiple, concurrent user sessions

These tasks can be collectively referred to as the ACID test; ACID is an acronym for Atomicity, Consistency, Isolation, and Durability.

Atomicity means that operations with the database can be grouped together and treated as a single unit.

Consistency guarantees that either all the operations in this single unit (or transaction) are performed successfully, or none of them is performed. In other words, a database can't be in an unfinished state. To understand why these characteristics are important, think about a bank transaction during which money is transferred from a savings account into a checking account. If the transfer process fails after subtracting the money from your savings account and before it was added to your checking account, you would become poorer, and the bank would have an angry (ex)customer! Atomicity enables the two operations -- the subtraction from the savings account and the addition to the checking account -- to be treated as a single transaction. Consistency guarantees that both operations in the transaction either succeed or fail. That way, your money isn't lost.

Isolation means that independent sets of database transactions are performed in such a way that they don't conflict with each other. Continuing the bank analogy, consider two customers who transfer funds between accounts at the same time. The database must track both transfers separately; otherwise, the funds could go into the wrong accounts, and the bank might be left with two angry (ex)customers.

Durability guarantees that the database is safe against unexpected terminations. It may be a minor inconvenience if your television or computer won't work when the power goes out, but the same can't be said for a database. If the bank's computers lose power when transferring your funds, you won't be a happy customer if the transaction is lost. Durability guarantees that if the database terminates abnormally during a funds transfer, then when the database is brought back up, it will be able to recover the transaction and continue with normal operations.

Passing the ACID test is nontrivial, and many simple databases fall short. For critical e-business or Web-based applications, passing the ACID test is a must. This is one of the reasons so many companies and individuals utilize enterprise-level database systems, such as IBM DB2, Oracle Database, or Microsoft SQL Server. These databases are fully compliant with the ACID test, and can meet many of the data persistence needs of large corporations or organizations. To do so, however, often requires a large team that includes database administrators, database developers, and database application developers to ensure that data is effectively persisted and available as necessary for business applications.


SQLite

But not all applications are this demanding, especially when you're starting out and trying to learn the basic relational database concepts. If you're just learning to work with databases, or if you want to quickly prototype a database application, most commercial database systems can be cumbersome. Fortunately, open-source, ACID compliant database systems exist, including the zero-configuration, serverless relational database system known as SQLite. By using SQLite, you can learn to work with a relational database by using SQL as well as the Python programming language. If you later find your application needs require a more powerful database system, you can always migrate your efforts to a more powerful database system; however, many systems continue to embed SQLite within their own applications, as demonstrated in the following webpage.



In [6]:
from IPython.display import HTML
HTML('<iframe src=https://www.sqlite.org/famous.html width=750 height=300></iframe>')


Out[6]:

What is SQLite?

SQLite is quite different than traditional relational database systems. SQLite does not have a separate server process, instead SQLite is a software library that, as the website states:

implements a self-contained, serverless, zero-configuration, transactional SQL database engine.

Before progressing, lets examine each of these concepts in turn:

  • self-contained: Nothing else is needed to use SQLite but the software library. Since, by default, this comes with Python, we can use SQLite without any additional software downloads or installs. in addition, if you want to embed SQLite in your own application, you can obtain a single ANSI-C file that contains the entire SQLite library.

  • serverless: We interact with the SQLite database by using the SQLite library. The database is stored in a single file that is platform independent (so you can simply copy it over to a new machine with no further effort).

  • zero-configuration: SQLite does not use a server process, so there is no configuration required. While you can customize sqlite to change default limits, for most applications this is unnecessary. You can also pre-specify certain options for the sqlite3 command line client in a separate configuration file (e.g., .sqliterc, which is located in the current user's home directory).

  • transactional: A transaction is a logical set of operations. SQLite is ACID-complaint by implementing atomic commits, which means that either every operation within the transaction completes successfully or none of them do. No partial writes are persisted, so that the database is always in a consistent state.

With this power, it is even more surprising that the SQLite library is quite small, and can be compacted to as small as 300 kb if required.

SQLite by default will store data in a single database file; however, it can also be used as an in memory database. SQLite has been distributed as a component within the Python language for many years, but also has a stand-alone command line interface client, called sqlite3 that we will use in this lesson use to create a database, create schema within that database, and to import data.


Additional References

  1. SQLite Documentation
  2. Free SQLite Tutorial

Return to the Week 10 Index.