Data Science

Structured Data: Relational Databases and Data Frames

Alessandro Gagliardi
Sr. Data Scientist, Glassdoor.com

Last Time:

  1. APIs & JSON Lecture
  2. NoSQL Databases
  3. Lab: Twitter & MongoDB

Questions? (besides "what is the CAP theorem?")

Agenda

  1. Project Discussion
  2. Readiness Assessment
  3. Relational Databases
  4. Data Frames
  5. Lab: SQLite and Pandas

Projects!

¡¡¡Data (and 1 visualization) are due on Monday!!!

Readiness Assement:

Q: What is the CAP theorem? (And why does it matter?)

Relational Databases

Those who cannot remember the past are condemned to repeat it.

The Life of Reason, by George Santayana

DB Evolution

  • 1960s
    • Hierarchical data structure (IBM IMS)
    • Network data structure (CODASYL)
  • 1970s
    • Relational data model
      • A Relational Model of Data for Large Shared Data Banks – E. F. Codd [1970]
    • System R (IBM), Ingres (Berkeley)

DB Evolution

  • 1980s
    • Commercialization of RDBMS (relational database management systems)
      • Oracle, Sybase, IBM DB2, Informix
    • SQL (Structured Query Language)
    • ACID (Atomic, Consistent, Isolated, Durable)
  • 1990s
    • PC RDBMS
      • Paradox, Microsoft SQL Server & Access
    • Larger DBs, driven by internet
    • Consolidation among commercial DB vendors

DB Evolution

  • 2000s
    • Commercialization of Open Source RDBMS
      • MySQL, Postgres
    • Evolving requirements expose RDBMS limitations
      • Storing complex and dynamic objects
      • Processing increasing data volumes
      • Analyzing massive amounts of data

Normalization

The objectives of normalization were stated as follows by Codd:

  1. To free the collection of relations from undesirable insertion, update and deletion dependencies;
  2. To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs;
  3. To make the relational model more informative to users;
  4. To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.

Codd, E.F. "Further Normalization of the Data Base Relational Model", p. 34

First normal form (1NF)

A relation is in first normal form if the domain of each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain.

What?

The following scenario illustrates how a database design might violate first normal form.
Taken from http://en.wikipedia.org/wiki/First_normal_form

Suppose a designer wishes to record the names and telephone numbers of customers. He defines a customer table which looks like this:

Customer

Customer ID First Name Surname Telephone Number
123 Robert Ingram 555-861-2025
456 Jane Wright 555-403-1659
789 Maria Fernandez 555-808-9633

The designer then becomes aware of a requirement to record multiple telephone numbers for some customers. He reasons that the simplest way of doing this is to allow the "Telephone Number" field in any given record to contain more than one value:

Customer

Customer ID First Name Surname Telephone Number
123 Robert Ingram 555-861-2025
456 Jane Wright 555-403-1659
555-776-4100
789 Maria Fernandez 555-808-9633

Assuming, however, that the Telephone Number column is defined on some telephone number-like domain, such as the domain of 12-character strings, the representation above is not in first normal form. It is in violation of first normal form as a single field has been allowed to contain multiple values. A typical relational database management system will not allow fields in a table to contain multiple values in this way.

A design that complies with 1NF:

A design that is unambiguously in first normal form makes use of two tables: a Customer Name table and a Customer Telephone Number table.

Customer Name

Customer ID First Name Surname
123 Robert Ingram
456 Jane Wright
789 Maria Fernandez

Customer Telephone Number

Customer ID Telephone Number
123 555-861-2025
456 555-403-1659
456 555-776-4100
789 555-808-9633

Repeating groups of telephone numbers do not occur in this design. Instead, each Customer-to-Telephone Number link appears on its own record. With Customer ID as key, a one-to-many relationship exists between the two tables. A record in the "parent" table, Customer Name, can have many telephone number records in the "child" table, Customer Telephone Number, but each telephone number belongs to one, and only one customer.

Second normal form (2NF)

A table is in 2NF if and only if it is in 1NF and every non-prime attribute of the table is dependent on the whole of a candidate key.

What?

Consider a table describing employees' skills:
Taken from http://en.wikipedia.org/wiki/Second_normal_form

Employees' Skills

Employee Skill Current Work Location
Brown Light Cleaning 73 Industrial Way
Brown Typing 73 Industrial Way
Harrison Light Cleaning 73 Industrial Way
Jones Shorthand 114 Main Street
Jones Typing 114 Main Street
Jones Whittling 114 Main Street

Neither {Employee} nor {Skill} is a candidate key for the table. This is because a given Employee might need to appear more than once (he might have multiple Skills), and a given Skill might need to appear more than once (it might be possessed by multiple Employees). Only the composite key {Employee, Skill} qualifies as a candidate key for the table.

The remaining attribute, Current Work Location, is dependent on only part of the candidate key, namely Employee. Therefore the table is not in 2NF. Note the redundancy in the way Current Work Locations are represented: we are told three times that Jones works at 114 Main Street, and twice that Brown works at 73 Industrial Way. This redundancy makes the table vulnerable to update anomalies: it is, for example, possible to update Jones' work location on his "Shorthand" and "Typing" records and not update his "Whittling" record. The resulting data would imply contradictory answers to the question "What is Jones' current work location?"

Employees' Skills

Employee Skill Current Work Location
Brown Light Cleaning 73 Industrial Way
Brown Typing 73 Industrial Way
Harrison Light Cleaning 73 Industrial Way
Jones Shorthand 414 Brannon Street
Jones Typing 414 Brannon Street
Jones Whittling 114 Main Street

A 2NF alternative to this design would represent the same information in two tables: an "Employees" table with candidate key {Employee}, and an "Employees' Skills" table with candidate key {Employee, Skill}:

Employees

Employee Current Work Location
Brown 73 Industrial Way
Harrison 73 Industrial Way
Jones 114 Main Street

Employees' Skills

Employee Skill
Brown Light Cleaning
Brown Typing
Harrison Light Cleaning
Jones Shorthand
Jones Typing
Jones Whittling

Neither of these tables can suffer from update anomalies.

Not all 2NF tables are free from update anomalies, however. This brings us to...

Third normal form (3NF)

3NF was originally defined by E.F. Codd in 1971

A table is in 3NF if and only if it is in 2NF and every non-prime attribute of the table is non-transitively (i.e. directly) dependent on every superkey of that table.

Or...
"[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key (so help me Codd)."

  • Requiring existence of "the key" ensures that the table is in 1NF
  • Requiring that non-key attributes be dependent on "the whole key" ensures 2NF
  • Requiring that non-key attributes be dependent on "nothing but the key" ensures 3NF
An example of a 2NF table that fails to meet the requirements of 3NF is:
Taken from http://en.wikipedia.org/wiki/Third_normal_form

Tournament Winners

Tournament Year Winner Winner Date of Birth
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977

Because each row in the table needs to tell us who won a particular Tournament in a particular Year, the composite key {Tournament, Year} is a minimal set of attributes guaranteed to uniquely identify a row. That is, {Tournament, Year} is a candidate key for the table.

The breach of 3NF occurs because the non-prime attribute Winner Date of Birth is transitively dependent on the candidate key {Tournament, Year} via the non-prime attribute Winner. The fact that Winner Date of Birth is functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing to stop the same person from being shown with different dates of birth on different records.

In order to express the same facts without violating 3NF, it is necessary to split the table into two:

Tournament Winners

Tournament Year Winner
Indiana Invitational 1998 Al Fredrickson
Cleveland Open 1999 Bob Albertson
Des Moines Masters 1999 Al Fredrickson
Indiana Invitational 1999 Chip Masterson

Player Dates of Birth

Player Date of Birth
Chip Masterson 14 March 1977
Al Fredrickson 21 July 1975
Bob Albertson 28 September 1968

Update anomalies cannot occur in these tables, which are both in 3NF.

I believe firmly that anything less than a fully normalized design is strongly contraindicated ... [Y]ou should "denormalize" only as a last resort. That is, you should back off from a fully normalized design only if all other strategies for improving performance have somehow failed to meet requirements.

Date, C.J. Database in Depth: Relational Theory for Practitioners. O'Reilly (2005), p. 152

Example

Enron

Download the sample database at: tinyurl.com/enron-db

pip install ipython-sql
for sql magic


In [52]:
%load_ext sql

In [243]:
%%bash
sqlite3 enron.db .tables


Employee          EmployeeWithVars  MessageBase       RecipientBase   
EmployeeBase      Message           Recipient       

We only care about the *Base tables:

  • EmployeeBase
  • MessageBase
  • RecipientBase

(The other tables are derived from those. We don't need them for this exercise.)


In [68]:
%%sql sqlite:///enron.db
SELECT * FROM EmployeeBase LIMIT 5


Done.
Out[68]:
eid name department longdepartment title gender seniority
1 John Arnold Trading ENA Gas Financial VP Trading Male Senior
2 Harry Arora Trading ENA East Power VP Trading Male Senior
3 Robert Badeer Trading ENA West Power Mgr Trading Male Junior
4 Susan Bailey Legal ENA Legal Specialist Legal Female Junior
5 Eric Bass Trading ENA Gas Texas Trader Male Junior

In [67]:
%%sql sqlite:///enron.db
SELECT * FROM MessageBase LIMIT 5


Done.
Out[67]:
mid filename unix_time subject from_eid
1 taylor-m/sent/11 910930020 Cd$ CME letter 138
2 taylor-m/sent/17 911459940 Indemnification 138
3 taylor-m/sent/18 911463840 Re: Indemnification 138
4 taylor-m/sent/23 911874180 Re: Coral Energy, L.P. 138
5 taylor-m/sent/27 912396120 Bankruptcy Code revisions 138

from_eid looks like a foreign key


In [247]:
%%sql sqlite:///enron.db
SELECT unix_time, name, department, title, gender, seniority, subject, filename 
FROM EmployeeBase JOIN MessageBase ON eid = from_eid LIMIT 5


Done.
Out[247]:
unix_time name department title gender seniority subject filename
910930020 Mark E. Taylor Legal VP & Gen Cnsl Male Senior Cd$ CME letter taylor-m/sent/11
911459940 Mark E. Taylor Legal VP & Gen Cnsl Male Senior Indemnification taylor-m/sent/17
911463840 Mark E. Taylor Legal VP & Gen Cnsl Male Senior Re: Indemnification taylor-m/sent/18
911874180 Mark E. Taylor Legal VP & Gen Cnsl Male Senior Re: Coral Energy, L.P. taylor-m/sent/23
912396120 Mark E. Taylor Legal VP & Gen Cnsl Male Senior Bankruptcy Code revisions taylor-m/sent/27

Finally, RecipientBase:


In [64]:
%%sql sqlite:///enron.db
SELECT * FROM RecipientBase LIMIT 9


Done.
Out[64]:
mid rno to_eid
1 1 59
2 1 15
3 1 15
4 1 109
4 2 49
4 3 120
4 4 59
5 1 45
5 2 53

It looks like we can use this to determine how many recipients each message received
i.e. MAX(rno)


In [80]:
%%sql sqlite:///enron.db
SELECT * FROM RecipientBase WHERE rno = 57


Done.
Out[80]:
mid rno to_eid
12116 57 51
12151 57 51

In [81]:
%%sql sqlite:///enron.db
SELECT * FROM MessageBase WHERE mid IN (12116, 12151)


Done.
Out[81]:
mid filename unix_time subject from_eid
12116 baughman-d/all_documents/398 990510780 67
12151 baughman-d/ect_admin/22 990546780 67

In [82]:
%%sql sqlite:///enron.db
SELECT * FROM EmployeeBase WHERE eid=67


Done.
Out[82]:
eid name department longdepartment title gender seniority
67 John J. Lavorato Other EWS ENA President & CEO Male Senior

Who sent the email to the most recipients?


In [248]:
%%sql sqlite:///enron.db
SELECT name, title, MAX(rno)
FROM EmployeeBase 
JOIN MessageBase ON eid = from_eid 
JOIN RecipientBase USING(mid)


Done.
Out[248]:
name title MAX(rno)
John J. Lavorato ENA President & CEO 57

Who are the top five senders of mass emails?


In [84]:
%%sql sqlite:///enron.db
SELECT name, title, MAX(rno) AS recipient_count
FROM EmployeeBase 
JOIN MessageBase ON eid = from_eid 
JOIN RecipientBase USING(mid) 
GROUP BY name, title ORDER BY recipient_count DESC LIMIT 5


Done.
Out[84]:
name title recipient_count
John J. Lavorato ENA President & CEO 57
Kenneth Lay President & CEO 52
Sally Beck VP 49
Monique Sanchez Associate 24
Louise Kitchen COO 21

Pandas Data Frames


In [71]:
import sqlite3
from pandas.io import sql
conn = sqlite3.connect('enron.db')

In [259]:
sql.frame_query("""SELECT mid, unix_time, subject, filename, from_eid 
                FROM MessageBase LIMIT 5""", conn, "mid")


Out[259]:
unix_time subject filename from_eid
mid
1 910930020 Cd$ CME letter taylor-m/sent/11 138
2 911459940 Indemnification taylor-m/sent/17 138
3 911463840 Re: Indemnification taylor-m/sent/18 138
4 911874180 Re: Coral Energy, L.P. taylor-m/sent/23 138
5 912396120 Bankruptcy Code revisions taylor-m/sent/27 138

5 rows × 4 columns


In [268]:
MessageBase = sql.frame_query("""SELECT mid, unix_time, subject, from_eid 
                FROM MessageBase""", conn, "mid")
MessageBase.head()


Out[268]:
unix_time subject from_eid
mid
1 910930020 Cd$ CME letter 138
2 911459940 Indemnification 138
3 911463840 Re: Indemnification 138
4 911874180 Re: Coral Energy, L.P. 138
5 912396120 Bankruptcy Code revisions 138

5 rows × 3 columns

SQL JOIN (with USING):


In [269]:
sql.frame_query("""SELECT mid, unix_time, subject, from_eid, rno, to_eid 
                FROM MessageBase JOIN RecipientBase USING (mid) LIMIT 5""", conn, "mid")


Out[269]:
unix_time subject from_eid rno to_eid
mid
1 910930020 Cd$ CME letter 138 1 59
2 911459940 Indemnification 138 1 15
3 911463840 Re: Indemnification 138 1 15
4 911874180 Re: Coral Energy, L.P. 138 1 109
4 911874180 Re: Coral Energy, L.P. 138 2 49

5 rows × 5 columns

pandas.DataFrame.join


In [262]:
RecipientBase = sql.frame_query("SELECT * FROM RecipientBase", conn, "mid")
MessageBase.join(RecipientBase).head()


Out[262]:
unix_time subject filename from_eid rno to_eid
mid
1 910930020 Cd$ CME letter taylor-m/sent/11 138 1 59
2 911459940 Indemnification taylor-m/sent/17 138 1 15
3 911463840 Re: Indemnification taylor-m/sent/18 138 1 15
4 911874180 Re: Coral Energy, L.P. taylor-m/sent/23 138 1 109
4 911874180 Re: Coral Energy, L.P. taylor-m/sent/23 138 2 49

5 rows × 6 columns

SQL JOIN (with ON):


In [273]:
sql.frame_query("""SELECT mid, unix_time, name,  
                          title, seniority, subject 
                     FROM EmployeeBase 
                     JOIN MessageBase ON eid = from_eid 
                    LIMIT 5""", conn, "mid")


Out[273]:
unix_time name title seniority subject
mid
1 910930020 Mark E. Taylor VP & Gen Cnsl Senior Cd$ CME letter
2 911459940 Mark E. Taylor VP & Gen Cnsl Senior Indemnification
3 911463840 Mark E. Taylor VP & Gen Cnsl Senior Re: Indemnification
4 911874180 Mark E. Taylor VP & Gen Cnsl Senior Re: Coral Energy, L.P.
5 912396120 Mark E. Taylor VP & Gen Cnsl Senior Bankruptcy Code revisions

5 rows × 5 columns

pandas.DataFrame.merge


In [274]:
EmployeeBase = sql.frame_query("""SELECT eid, name, title, seniority 
                                    FROM EmployeeBase""", conn, "eid")
EmployeeBase.merge(MessageBase, right_on = 'from_eid', left_index = True).head()


Out[274]:
name title seniority unix_time subject from_eid
mid
1611 John Arnold VP Trading Senior 954317280 Re: Insurance Call Spread 1
1615 John Arnold VP Trading Senior 954328920 New curve generation methodology 1
1722 John Arnold VP Trading Senior 955447080 Option Analysis on NG Price Book 1
1723 John Arnold VP Trading Senior 955448160 1
1763 John Arnold VP Trading Senior 955701240 Re: options 1

5 rows × 6 columns

Histograms are a good way to get a feel for the data. For example:


In [104]:
MessageBase.unix_time.hist()


Out[104]:
<matplotlib.axes.AxesSubplot at 0x10800cd10>

In [277]:
MessageBase.unix_time.mean()


Out[277]:
985319365.65994918

What does this tell us?

unix_time (i.e. unixepoch) is easy for computers to read, but difficult for humans. It is the number of seconds since midnight, January 1, 1970. To make it more readable, we will map datetime.fromtimestamp onto that value to return a datetime object. We can then downsample the data from seconds to weeks.


In [278]:
datetime.fromtimestamp(985319365)


Out[278]:
datetime.datetime(2001, 3, 22, 19, 49, 25)

In [184]:
from datetime import datetime
dt = MessageBase.unix_time.map(datetime.fromtimestamp)
MessageBase.groupby(dt).unix_time.count().resample('1W', how='sum').plot()


Out[184]:
<matplotlib.axes.AxesSubplot at 0x10f277350>

What does this tell us?

Perhaps the recipient counts would be interesting as well. This time we will cast unix_time to a datetime string in our query. Since this is a new query, we will also make sure it looks okay by calling .head() before we do anyting else:


In [218]:
recipient_counts = sql.frame_query("""SELECT mid, datetime(unix_time, 'unixepoch') AS time_sent, MAX(rno) AS recipient_count 
                        FROM MessageBase JOIN RecipientBase USING (mid) GROUP BY mid, time_sent""", conn, "mid")
recipient_counts.head()


Out[218]:
time_sent recipient_count
mid
1 1998-11-13 04:07:00 1
2 1998-11-19 07:19:00 1
3 1998-11-19 08:24:00 1
4 1998-11-24 02:23:00 4
5 1998-11-30 03:22:00 2

5 rows × 2 columns

When using pandas.DataFrame.groupby it is sometimes useful to create a grouping object first. In this case, we will do it by mapping the datetime strings we received from SQLite to a weekly period. (This could also have been done in SQL, but it would have come across as a string. This way it exists as a pandas.Period object which has more intelligence.)


In [279]:
w = recipient_counts.time_sent.map(lambda dt: pd.Period(dt, freq='W'))
recipient_counts.groupby(w).recipient_count.mean().plot(ylim=(0,10))


Out[279]:
<matplotlib.axes.AxesSubplot at 0x1135be150>

Lab

Your turn. Follow what I did at http://nbviewer.ipython.org/github/eklypse/GA_DS_SF_2014_01/blob/master/DS_Lec05.ipynb. Do the same thing but for .max() and .median() instead of .mean(). Also, try to find your own discoveries! (e.g. is there anything interesting you could say about the subjects of the emails. What about who sent the first ones and who sent the last ones? What about the recipients themselves? (hint: to_eid is a foreign key to eid)

See http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/Pandas%20for%20SQL%20Developers.ipynb for more examples about how Pandas and SQL compare.

Discussion

Next Time:

R & ggplot2