Alessandro Gagliardi
Sr. Data Scientist, Glassdoor.com
Those who cannot remember the past are condemned to repeat it.
The Life of Reason, by George Santayana
The objectives of normalization were stated as follows by Codd:
- To free the collection of relations from undesirable insertion, update and deletion dependencies;
- To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs;
- To make the relational model more informative to users;
- To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.
Codd, E.F. "Further Normalization of the Data Base Relational Model", p. 34
The designer then becomes aware of a requirement to record multiple telephone numbers for some customers. He reasons that the simplest way of doing this is to allow the "Telephone Number" field in any given record to contain more than one value:
| Customer ID | First Name | Surname | Telephone Number |
|---|---|---|---|
| 123 | Robert | Ingram | 555-861-2025 |
| 456 | Jane | Wright | 555-403-1659 |
| 555-776-4100 | |||
| 789 | Maria | Fernandez | 555-808-9633 |
Assuming, however, that the Telephone Number column is defined on some telephone number-like domain, such as the domain of 12-character strings, the representation above is not in first normal form. It is in violation of first normal form as a single field has been allowed to contain multiple values. A typical relational database management system will not allow fields in a table to contain multiple values in this way.
A design that is unambiguously in first normal form makes use of two tables: a Customer Name table and a Customer Telephone Number table.
| Customer ID | First Name | Surname |
|---|---|---|
| 123 | Robert | Ingram |
| 456 | Jane | Wright |
| 789 | Maria | Fernandez |
| Customer ID | Telephone Number |
|---|---|
| 123 | 555-861-2025 |
| 456 | 555-403-1659 |
| 456 | 555-776-4100 |
| 789 | 555-808-9633 |
Repeating groups of telephone numbers do not occur in this design. Instead, each Customer-to-Telephone Number link appears on its own record. With Customer ID as key, a one-to-many relationship exists between the two tables. A record in the "parent" table, Customer Name, can have many telephone number records in the "child" table, Customer Telephone Number, but each telephone number belongs to one, and only one customer.
| Employee | Skill | Current Work Location |
|---|---|---|
| Brown | Light Cleaning | 73 Industrial Way |
| Brown | Typing | 73 Industrial Way |
| Harrison | Light Cleaning | 73 Industrial Way |
| Jones | Shorthand | 114 Main Street |
| Jones | Typing | 114 Main Street |
| Jones | Whittling | 114 Main Street |
Neither {Employee} nor {Skill} is a candidate key for the table. This is because a given Employee might need to appear more than once (he might have multiple Skills), and a given Skill might need to appear more than once (it might be possessed by multiple Employees). Only the composite key {Employee, Skill} qualifies as a candidate key for the table.
The remaining attribute, Current Work Location, is dependent on only part of the candidate key, namely Employee. Therefore the table is not in 2NF. Note the redundancy in the way Current Work Locations are represented: we are told three times that Jones works at 114 Main Street, and twice that Brown works at 73 Industrial Way. This redundancy makes the table vulnerable to update anomalies: it is, for example, possible to update Jones' work location on his "Shorthand" and "Typing" records and not update his "Whittling" record. The resulting data would imply contradictory answers to the question "What is Jones' current work location?"
| Employee | Skill | Current Work Location |
|---|---|---|
| Brown | Light Cleaning | 73 Industrial Way |
| Brown | Typing | 73 Industrial Way |
| Harrison | Light Cleaning | 73 Industrial Way |
| Jones | Shorthand | 414 Brannon Street |
| Jones | Typing | 414 Brannon Street |
| Jones | Whittling | 114 Main Street |
A 2NF alternative to this design would represent the same information in two tables: an "Employees" table with candidate key {Employee}, and an "Employees' Skills" table with candidate key {Employee, Skill}:
| Employee | Current Work Location |
|---|---|
| Brown | 73 Industrial Way |
| Harrison | 73 Industrial Way |
| Jones | 114 Main Street |
| Employee | Skill |
|---|---|
| Brown | Light Cleaning |
| Brown | Typing |
| Harrison | Light Cleaning |
| Jones | Shorthand |
| Jones | Typing |
| Jones | Whittling |
Neither of these tables can suffer from update anomalies.
Not all 2NF tables are free from update anomalies, however. This brings us to...
Or...
"[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key (so help me Codd)."
| Tournament | Year | Winner | Winner Date of Birth |
|---|---|---|---|
| Indiana Invitational | 1998 | Al Fredrickson | 21 July 1975 |
| Cleveland Open | 1999 | Bob Albertson | 28 September 1968 |
| Des Moines Masters | 1999 | Al Fredrickson | 21 July 1975 |
| Indiana Invitational | 1999 | Chip Masterson | 14 March 1977 |
Because each row in the table needs to tell us who won a particular Tournament in a particular Year, the composite key {Tournament, Year} is a minimal set of attributes guaranteed to uniquely identify a row. That is, {Tournament, Year} is a candidate key for the table.
The breach of 3NF occurs because the non-prime attribute Winner Date of Birth is transitively dependent on the candidate key {Tournament, Year} via the non-prime attribute Winner. The fact that Winner Date of Birth is functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing to stop the same person from being shown with different dates of birth on different records.
In order to express the same facts without violating 3NF, it is necessary to split the table into two:
| Tournament | Year | Winner |
|---|---|---|
| Indiana Invitational | 1998 | Al Fredrickson |
| Cleveland Open | 1999 | Bob Albertson |
| Des Moines Masters | 1999 | Al Fredrickson |
| Indiana Invitational | 1999 | Chip Masterson |
| Player | Date of Birth |
|---|---|
| Chip Masterson | 14 March 1977 |
| Al Fredrickson | 21 July 1975 |
| Bob Albertson | 28 September 1968 |
Update anomalies cannot occur in these tables, which are both in 3NF.
I believe firmly that anything less than a fully normalized design is strongly contraindicated ... [Y]ou should "denormalize" only as a last resort. That is, you should back off from a fully normalized design only if all other strategies for improving performance have somehow failed to meet requirements.
Date, C.J. Database in Depth: Relational Theory for Practitioners. O'Reilly (2005), p. 152
pip install ipython-sql
for sql magic
In [52]:
%load_ext sql
In [243]:
%%bash
sqlite3 enron.db .tables
We only care about the *Base tables:
(The other tables are derived from those. We don't need them for this exercise.)
In [68]:
%%sql sqlite:///enron.db
SELECT * FROM EmployeeBase LIMIT 5
Out[68]:
In [67]:
%%sql sqlite:///enron.db
SELECT * FROM MessageBase LIMIT 5
Out[67]:
from_eid looks like a foreign key
In [247]:
%%sql sqlite:///enron.db
SELECT unix_time, name, department, title, gender, seniority, subject, filename
FROM EmployeeBase JOIN MessageBase ON eid = from_eid LIMIT 5
Out[247]:
Finally, RecipientBase:
In [64]:
%%sql sqlite:///enron.db
SELECT * FROM RecipientBase LIMIT 9
Out[64]:
It looks like we can use this to determine how many recipients each message received
i.e. MAX(rno)
In [80]:
%%sql sqlite:///enron.db
SELECT * FROM RecipientBase WHERE rno = 57
Out[80]:
In [81]:
%%sql sqlite:///enron.db
SELECT * FROM MessageBase WHERE mid IN (12116, 12151)
Out[81]:
In [82]:
%%sql sqlite:///enron.db
SELECT * FROM EmployeeBase WHERE eid=67
Out[82]:
Who sent the email to the most recipients?
In [248]:
%%sql sqlite:///enron.db
SELECT name, title, MAX(rno)
FROM EmployeeBase
JOIN MessageBase ON eid = from_eid
JOIN RecipientBase USING(mid)
Out[248]:
Who are the top five senders of mass emails?
In [84]:
%%sql sqlite:///enron.db
SELECT name, title, MAX(rno) AS recipient_count
FROM EmployeeBase
JOIN MessageBase ON eid = from_eid
JOIN RecipientBase USING(mid)
GROUP BY name, title ORDER BY recipient_count DESC LIMIT 5
Out[84]:
In [71]:
import sqlite3
from pandas.io import sql
conn = sqlite3.connect('enron.db')
In [259]:
sql.frame_query("""SELECT mid, unix_time, subject, filename, from_eid
FROM MessageBase LIMIT 5""", conn, "mid")
Out[259]:
In [268]:
MessageBase = sql.frame_query("""SELECT mid, unix_time, subject, from_eid
FROM MessageBase""", conn, "mid")
MessageBase.head()
Out[268]:
SQL JOIN (with USING):
In [269]:
sql.frame_query("""SELECT mid, unix_time, subject, from_eid, rno, to_eid
FROM MessageBase JOIN RecipientBase USING (mid) LIMIT 5""", conn, "mid")
Out[269]:
pandas.DataFrame.join
In [262]:
RecipientBase = sql.frame_query("SELECT * FROM RecipientBase", conn, "mid")
MessageBase.join(RecipientBase).head()
Out[262]:
SQL JOIN (with ON):
In [273]:
sql.frame_query("""SELECT mid, unix_time, name,
title, seniority, subject
FROM EmployeeBase
JOIN MessageBase ON eid = from_eid
LIMIT 5""", conn, "mid")
Out[273]:
pandas.DataFrame.merge
In [274]:
EmployeeBase = sql.frame_query("""SELECT eid, name, title, seniority
FROM EmployeeBase""", conn, "eid")
EmployeeBase.merge(MessageBase, right_on = 'from_eid', left_index = True).head()
Out[274]:
Histograms are a good way to get a feel for the data. For example:
In [104]:
MessageBase.unix_time.hist()
Out[104]:
In [277]:
MessageBase.unix_time.mean()
Out[277]:
What does this tell us?
unix_time (i.e. unixepoch) is easy for computers to read, but difficult for humans. It is the number of seconds since midnight, January 1, 1970. To make it more readable, we will map datetime.fromtimestamp onto that value to return a datetime object. We can then downsample the data from seconds to weeks.
In [278]:
datetime.fromtimestamp(985319365)
Out[278]:
In [184]:
from datetime import datetime
dt = MessageBase.unix_time.map(datetime.fromtimestamp)
MessageBase.groupby(dt).unix_time.count().resample('1W', how='sum').plot()
Out[184]:
What does this tell us?
Perhaps the recipient counts would be interesting as well. This time we will cast unix_time to a datetime string in our query. Since this is a new query, we will also make sure it looks okay by calling .head() before we do anyting else:
In [218]:
recipient_counts = sql.frame_query("""SELECT mid, datetime(unix_time, 'unixepoch') AS time_sent, MAX(rno) AS recipient_count
FROM MessageBase JOIN RecipientBase USING (mid) GROUP BY mid, time_sent""", conn, "mid")
recipient_counts.head()
Out[218]:
When using pandas.DataFrame.groupby it is sometimes useful to create a grouping object first. In this case, we will do it by mapping the datetime strings we received from SQLite to a weekly period. (This could also have been done in SQL, but it would have come across as a string. This way it exists as a pandas.Period object which has more intelligence.)
In [279]:
w = recipient_counts.time_sent.map(lambda dt: pd.Period(dt, freq='W'))
recipient_counts.groupby(w).recipient_count.mean().plot(ylim=(0,10))
Out[279]:
Your turn. Follow what I did at http://nbviewer.ipython.org/github/eklypse/GA_DS_SF_2014_01/blob/master/DS_Lec05.ipynb. Do the same thing but for .max() and .median() instead of .mean(). Also, try to find your own discoveries! (e.g. is there anything interesting you could say about the subjects of the emails. What about who sent the first ones and who sent the last ones? What about the recipients themselves? (hint: to_eid is a foreign key to eid)
See http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/Pandas%20for%20SQL%20Developers.ipynb for more examples about how Pandas and SQL compare.