Data Science

Structured Data: Relational Databases and Data Frames

Alessandro Gagliardi
Sr. Data Scientist, Glassdoor.com

Last Time:

APIs & JSON Lecture
NoSQL Databases
Lab: Twitter & MongoDB

Questions? (besides "what is the CAP theorem?")

Agenda

Project Discussion
Readiness Assessment
Relational Databases
Data Frames
Lab: SQLite and Pandas

Projects!

¡¡¡Data (and 1 visualization) are due on Monday!!!

Readiness Assement:

Q: What is the CAP theorem? (And why does it matter?)

Relational Databases

Those who cannot remember the past are condemned to repeat it.

The Life of Reason, by George Santayana

DB Evolution

1960s
- Hierarchical data structure (IBM IMS)
- Network data structure (CODASYL)
1970s
- Relational data model
  - A Relational Model of Data for Large Shared Data Banks – E. F. Codd [1970]
- System R (IBM), Ingres (Berkeley)

DB Evolution

1980s
- Commercialization of RDBMS (relational database management systems)
  - Oracle, Sybase, IBM DB2, Informix
- SQL (Structured Query Language)
- ACID (Atomic, Consistent, Isolated, Durable)
1990s
- PC RDBMS
  - Paradox, Microsoft SQL Server & Access
- Larger DBs, driven by internet
- Consolidation among commercial DB vendors

DB Evolution

2000s
- Commercialization of Open Source RDBMS
  - MySQL, Postgres
- Evolving requirements expose RDBMS limitations
  - Storing complex and dynamic objects
  - Processing increasing data volumes
  - Analyzing massive amounts of data

Normalization

The objectives of normalization were stated as follows by Codd:

To free the collection of relations from undesirable insertion, update and deletion dependencies;

To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs;

To make the relational model more informative to users;

To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.

Codd, E.F. "Further Normalization of the Data Base Relational Model", p. 34

First normal form (1NF)

A relation is in first normal form if the domain of each attribute contains only atomic values, and the value of each attribute contains only a single value from that domain.

What?

The following scenario illustrates how a database design might violate first normal form.

Taken from http://en.wikipedia.org/wiki/First_normal_form

Suppose a designer wishes to record the names and telephone numbers of customers. He defines a customer table which looks like this:

Customer

Customer ID	First Name	Surname	Telephone Number
123	Robert	Ingram	555-861-2025
456	Jane	Wright	555-403-1659
789	Maria	Fernandez	555-808-9633

The designer then becomes aware of a requirement to record multiple telephone numbers for some customers. He reasons that the simplest way of doing this is to allow the "Telephone Number" field in any given record to contain more than one value:

Customer

Customer ID	First Name	Surname	Telephone Number
123	Robert	Ingram	555-861-2025
456	Jane	Wright	555-403-1659
			555-776-4100
789	Maria	Fernandez	555-808-9633

Assuming, however, that the Telephone Number column is defined on some telephone number-like domain, such as the domain of 12-character strings, the representation above is not in first normal form. It is in violation of first normal form as a single field has been allowed to contain multiple values. A typical relational database management system will not allow fields in a table to contain multiple values in this way.

A design that complies with 1NF:

A design that is unambiguously in first normal form makes use of two tables: a Customer Name table and a Customer Telephone Number table.

Customer Name

Customer ID	First Name	Surname
123	Robert	Ingram
456	Jane	Wright
789	Maria	Fernandez

Customer Telephone Number

Customer ID	Telephone Number
123	555-861-2025
456	555-403-1659
456	555-776-4100
789	555-808-9633

Repeating groups of telephone numbers do not occur in this design. Instead, each Customer-to-Telephone Number link appears on its own record. With Customer ID as key, a one-to-many relationship exists between the two tables. A record in the "parent" table, Customer Name, can have many telephone number records in the "child" table, Customer Telephone Number, but each telephone number belongs to one, and only one customer.

Second normal form (2NF)

A table is in 2NF if and only if it is in 1NF and every non-prime attribute of the table is dependent on the whole of a candidate key.

What?

Consider a table describing employees' skills:

Taken from http://en.wikipedia.org/wiki/Second_normal_form

Employees' Skills

Employee	Skill	Current Work Location
Brown	Light Cleaning	73 Industrial Way
Brown	Typing	73 Industrial Way
Harrison	Light Cleaning	73 Industrial Way
Jones	Shorthand	114 Main Street
Jones	Typing	114 Main Street
Jones	Whittling	114 Main Street

Neither {Employee} nor {Skill} is a candidate key for the table. This is because a given Employee might need to appear more than once (he might have multiple Skills), and a given Skill might need to appear more than once (it might be possessed by multiple Employees). Only the composite key {Employee, Skill} qualifies as a candidate key for the table.

The remaining attribute, Current Work Location, is dependent on only part of the candidate key, namely Employee. Therefore the table is not in 2NF. Note the redundancy in the way Current Work Locations are represented: we are told three times that Jones works at 114 Main Street, and twice that Brown works at 73 Industrial Way. This redundancy makes the table vulnerable to update anomalies: it is, for example, possible to update Jones' work location on his "Shorthand" and "Typing" records and not update his "Whittling" record. The resulting data would imply contradictory answers to the question "What is Jones' current work location?"

Employees' Skills

Employee	Skill	Current Work Location
Brown	Light Cleaning	73 Industrial Way
Brown	Typing	73 Industrial Way
Harrison	Light Cleaning	73 Industrial Way
Jones	Shorthand	414 Brannon Street
Jones	Typing	414 Brannon Street
Jones	Whittling	114 Main Street

A 2NF alternative to this design would represent the same information in two tables: an "Employees" table with candidate key {Employee}, and an "Employees' Skills" table with candidate key {Employee, Skill}:

Employees

Employee	Current Work Location
Brown	73 Industrial Way
Harrison	73 Industrial Way
Jones	114 Main Street

Employees' Skills

Employee	Skill
Brown	Light Cleaning
Brown	Typing
Harrison	Light Cleaning
Jones	Shorthand
Jones	Typing
Jones	Whittling

Neither of these tables can suffer from update anomalies.

Not all 2NF tables are free from update anomalies, however. This brings us to...

Third normal form (3NF)

3NF was originally defined by E.F. Codd in 1971

A table is in 3NF if and only if it is in 2NF and every non-prime attribute of the table is non-transitively (i.e. directly) dependent on every superkey of that table.

Or...
"[Every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key (so help me Codd)."

Requiring existence of "the key" ensures that the table is in 1NF
Requiring that non-key attributes be dependent on "the whole key" ensures 2NF
Requiring that non-key attributes be dependent on "nothing but the key" ensures 3NF

An example of a 2NF table that fails to meet the requirements of 3NF is:

Taken from http://en.wikipedia.org/wiki/Third_normal_form

Tournament Winners

Tournament	Year	Winner	Winner Date of Birth
Indiana Invitational	1998	Al Fredrickson	21 July 1975
Cleveland Open	1999	Bob Albertson	28 September 1968
Des Moines Masters	1999	Al Fredrickson	21 July 1975
Indiana Invitational	1999	Chip Masterson	14 March 1977

Because each row in the table needs to tell us who won a particular Tournament in a particular Year, the composite key {Tournament, Year} is a minimal set of attributes guaranteed to uniquely identify a row. That is, {Tournament, Year} is a candidate key for the table.

The breach of 3NF occurs because the non-prime attribute Winner Date of Birth is transitively dependent on the candidate key {Tournament, Year} via the non-prime attribute Winner. The fact that Winner Date of Birth is functionally dependent on Winner makes the table vulnerable to logical inconsistencies, as there is nothing to stop the same person from being shown with different dates of birth on different records.

In order to express the same facts without violating 3NF, it is necessary to split the table into two:

Tournament Winners

Tournament	Year	Winner
Indiana Invitational	1998	Al Fredrickson
Cleveland Open	1999	Bob Albertson
Des Moines Masters	1999	Al Fredrickson
Indiana Invitational	1999	Chip Masterson

Player Dates of Birth

Player	Date of Birth
Chip Masterson	14 March 1977
Al Fredrickson	21 July 1975
Bob Albertson	28 September 1968

Update anomalies cannot occur in these tables, which are both in 3NF.

I believe firmly that anything less than a fully normalized design is strongly contraindicated ... [Y]ou should "denormalize" only as a last resort. That is, you should back off from a fully normalized design only if all other strategies for improving performance have somehow failed to meet requirements.

Date, C.J. Database in Depth: Relational Theory for Practitioners. O'Reilly (2005), p. 152

Example

Enron

Download the sample database at: tinyurl.com/enron-db

pip install ipython-sql
for sql magic



In [52]:

    
%load_ext sql



In [243]:

    
%%bash
sqlite3 enron.db .tables









    



Employee          EmployeeWithVars  MessageBase       RecipientBase   
EmployeeBase      Message           Recipient

We only care about the *Base tables:

EmployeeBase
MessageBase
RecipientBase

(The other tables are derived from those. We don't need them for this exercise.)



In [68]:

    
%%sql sqlite:///enron.db
SELECT * FROM EmployeeBase LIMIT 5









    



Done.






    Out[68]:





    
        eid
        name
        department
        longdepartment
        title
        gender
        seniority
    
    
        1
        John Arnold
        Trading
        ENA Gas Financial
        VP Trading
        Male
        Senior
    
    
        2
        Harry Arora
        Trading
        ENA East Power
        VP Trading
        Male
        Senior
    
    
        3
        Robert Badeer
        Trading
        ENA West Power
        Mgr Trading
        Male
        Junior
    
    
        4
        Susan Bailey
        Legal
        ENA Legal
        Specialist Legal
        Female
        Junior
    
    
        5
        Eric Bass
        Trading
        ENA Gas Texas
        Trader
        Male
        Junior



In [67]:

    
%%sql sqlite:///enron.db
SELECT * FROM MessageBase LIMIT 5









    



Done.






    Out[67]:





    
        mid
        filename
        unix_time
        subject
        from_eid
    
    
        1
        taylor-m/sent/11
        910930020
        Cd$ CME letter
        138
    
    
        2
        taylor-m/sent/17
        911459940
        Indemnification
        138
    
    
        3
        taylor-m/sent/18
        911463840
        Re: Indemnification
        138
    
    
        4
        taylor-m/sent/23
        911874180
        Re: Coral Energy, L.P.
        138
    
    
        5
        taylor-m/sent/27
        912396120
        Bankruptcy Code revisions
        138

from_eid looks like a foreign key



In [247]:

    
%%sql sqlite:///enron.db
SELECT unix_time, name, department, title, gender, seniority, subject, filename 
FROM EmployeeBase JOIN MessageBase ON eid = from_eid LIMIT 5









    



Done.






    Out[247]:





    
        unix_time
        name
        department
        title
        gender
        seniority
        subject
        filename
    
    
        910930020
        Mark E. Taylor
        Legal
        VP & Gen Cnsl
        Male
        Senior
        Cd$ CME letter
        taylor-m/sent/11
    
    
        911459940
        Mark E. Taylor
        Legal
        VP & Gen Cnsl
        Male
        Senior
        Indemnification
        taylor-m/sent/17
    
    
        911463840
        Mark E. Taylor
        Legal
        VP & Gen Cnsl
        Male
        Senior
        Re: Indemnification
        taylor-m/sent/18
    
    
        911874180
        Mark E. Taylor
        Legal
        VP & Gen Cnsl
        Male
        Senior
        Re: Coral Energy, L.P.
        taylor-m/sent/23
    
    
        912396120
        Mark E. Taylor
        Legal
        VP & Gen Cnsl
        Male
        Senior
        Bankruptcy Code revisions
        taylor-m/sent/27

Finally, RecipientBase:



In [64]:

    
%%sql sqlite:///enron.db
SELECT * FROM RecipientBase LIMIT 9

It looks like we can use this to determine how many recipients each message received
i.e. MAX(rno)



In [80]:

    
%%sql sqlite:///enron.db
SELECT * FROM RecipientBase WHERE rno = 57



In [81]:

    
%%sql sqlite:///enron.db
SELECT * FROM MessageBase WHERE mid IN (12116, 12151)









    



Done.






    Out[81]:





    
        mid
        filename
        unix_time
        subject
        from_eid
    
    
        12116
        baughman-d/all_documents/398
        990510780
        
        67
    
    
        12151
        baughman-d/ect_admin/22
        990546780
        
        67



In [82]:

    
%%sql sqlite:///enron.db
SELECT * FROM EmployeeBase WHERE eid=67









    



Done.






    Out[82]:





    
        eid
        name
        department
        longdepartment
        title
        gender
        seniority
    
    
        67
        John J. Lavorato
        Other
        EWS
        ENA President & CEO
        Male
        Senior

Who sent the email to the most recipients?



In [248]:

    
%%sql sqlite:///enron.db
SELECT name, title, MAX(rno)
FROM EmployeeBase 
JOIN MessageBase ON eid = from_eid 
JOIN RecipientBase USING(mid)









    



Done.






    Out[248]:





    
        name
        title
        MAX(rno)
    
    
        John J. Lavorato
        ENA President & CEO
        57

Who are the top five senders of mass emails?



In [84]:

    
%%sql sqlite:///enron.db
SELECT name, title, MAX(rno) AS recipient_count
FROM EmployeeBase 
JOIN MessageBase ON eid = from_eid 
JOIN RecipientBase USING(mid) 
GROUP BY name, title ORDER BY recipient_count DESC LIMIT 5









    



Done.






    Out[84]:





    
        name
        title
        recipient_count
    
    
        John J. Lavorato
        ENA President & CEO
        57
    
    
        Kenneth Lay
        President & CEO
        52
    
    
        Sally Beck
        VP
        49
    
    
        Monique Sanchez 
        Associate
        24
    
    
        Louise Kitchen
        COO
        21

Pandas Data Frames



In [71]:

    
import sqlite3
from pandas.io import sql
conn = sqlite3.connect('enron.db')



In [259]:

    
sql.frame_query("""SELECT mid, unix_time, subject, filename, from_eid 
                FROM MessageBase LIMIT 5""", conn, "mid")









    Out[259]:






  
    
      
      unix_time
      subject
      filename
      from_eid
    
    
      mid
      
      
      
      
    
  
  
    
      1
       910930020
                  Cd$ CME letter
       taylor-m/sent/11
       138
    
    
      2
       911459940
                 Indemnification
       taylor-m/sent/17
       138
    
    
      3
       911463840
             Re: Indemnification
       taylor-m/sent/18
       138
    
    
      4
       911874180
          Re: Coral Energy, L.P.
       taylor-m/sent/23
       138
    
    
      5
       912396120
       Bankruptcy Code revisions
       taylor-m/sent/27
       138
    
  

5 rows × 4 columns



In [268]:

    
MessageBase = sql.frame_query("""SELECT mid, unix_time, subject, from_eid 
                FROM MessageBase""", conn, "mid")
MessageBase.head()









    Out[268]:






  
    
      
      unix_time
      subject
      from_eid
    
    
      mid
      
      
      
    
  
  
    
      1
       910930020
                  Cd$ CME letter
       138
    
    
      2
       911459940
                 Indemnification
       138
    
    
      3
       911463840
             Re: Indemnification
       138
    
    
      4
       911874180
          Re: Coral Energy, L.P.
       138
    
    
      5
       912396120
       Bankruptcy Code revisions
       138
    
  

5 rows × 3 columns

SQL JOIN (with USING):



In [269]:

    
sql.frame_query("""SELECT mid, unix_time, subject, from_eid, rno, to_eid 
                FROM MessageBase JOIN RecipientBase USING (mid) LIMIT 5""", conn, "mid")









    Out[269]:






  
    
      
      unix_time
      subject
      from_eid
      rno
      to_eid
    
    
      mid
      
      
      
      
      
    
  
  
    
      1
       910930020
               Cd$ CME letter
       138
       1
        59
    
    
      2
       911459940
              Indemnification
       138
       1
        15
    
    
      3
       911463840
          Re: Indemnification
       138
       1
        15
    
    
      4
       911874180
       Re: Coral Energy, L.P.
       138
       1
       109
    
    
      4
       911874180
       Re: Coral Energy, L.P.
       138
       2
        49
    
  

5 rows × 5 columns

pandas.DataFrame.join



In [262]:

    
RecipientBase = sql.frame_query("SELECT * FROM RecipientBase", conn, "mid")
MessageBase.join(RecipientBase).head()









    Out[262]:






  
    
      
      unix_time
      subject
      filename
      from_eid
      rno
      to_eid
    
    
      mid
      
      
      
      
      
      
    
  
  
    
      1
       910930020
               Cd$ CME letter
       taylor-m/sent/11
       138
       1
        59
    
    
      2
       911459940
              Indemnification
       taylor-m/sent/17
       138
       1
        15
    
    
      3
       911463840
          Re: Indemnification
       taylor-m/sent/18
       138
       1
        15
    
    
      4
       911874180
       Re: Coral Energy, L.P.
       taylor-m/sent/23
       138
       1
       109
    
    
      4
       911874180
       Re: Coral Energy, L.P.
       taylor-m/sent/23
       138
       2
        49
    
  

5 rows × 6 columns

SQL JOIN (with ON):



In [273]:

    
sql.frame_query("""SELECT mid, unix_time, name,  
                          title, seniority, subject 
                     FROM EmployeeBase 
                     JOIN MessageBase ON eid = from_eid 
                    LIMIT 5""", conn, "mid")









    Out[273]:






  
    
      
      unix_time
      name
      title
      seniority
      subject
    
    
      mid
      
      
      
      
      
    
  
  
    
      1
       910930020
       Mark E. Taylor
       VP & Gen Cnsl
       Senior
                  Cd$ CME letter
    
    
      2
       911459940
       Mark E. Taylor
       VP & Gen Cnsl
       Senior
                 Indemnification
    
    
      3
       911463840
       Mark E. Taylor
       VP & Gen Cnsl
       Senior
             Re: Indemnification
    
    
      4
       911874180
       Mark E. Taylor
       VP & Gen Cnsl
       Senior
          Re: Coral Energy, L.P.
    
    
      5
       912396120
       Mark E. Taylor
       VP & Gen Cnsl
       Senior
       Bankruptcy Code revisions
    
  

5 rows × 5 columns

pandas.DataFrame.merge



In [274]:

    
EmployeeBase = sql.frame_query("""SELECT eid, name, title, seniority 
                                    FROM EmployeeBase""", conn, "eid")
EmployeeBase.merge(MessageBase, right_on = 'from_eid', left_index = True).head()









    Out[274]:






  
    
      
      name
      title
      seniority
      unix_time
      subject
      from_eid
    
    
      mid
      
      
      
      
      
      
    
  
  
    
      1611
       John Arnold
       VP Trading
       Senior
       954317280
              Re: Insurance Call Spread
       1
    
    
      1615
       John Arnold
       VP Trading
       Senior
       954328920
       New curve generation methodology
       1
    
    
      1722
       John Arnold
       VP Trading
       Senior
       955447080
       Option Analysis on NG Price Book
       1
    
    
      1723
       John Arnold
       VP Trading
       Senior
       955448160
                                       
       1
    
    
      1763
       John Arnold
       VP Trading
       Senior
       955701240
                            Re: options
       1
    
  

5 rows × 6 columns

Histograms are a good way to get a feel for the data. For example:



In [104]:

    
MessageBase.unix_time.hist()









    Out[104]:





<matplotlib.axes.AxesSubplot at 0x10800cd10>



In [277]:

    
MessageBase.unix_time.mean()









    Out[277]:





985319365.65994918

What does this tell us?

unix_time (i.e. unixepoch) is easy for computers to read, but difficult for humans. It is the number of seconds since midnight, January 1, 1970. To make it more readable, we will map datetime.fromtimestamp onto that value to return a datetime object. We can then downsample the data from seconds to weeks.



In [278]:

    
datetime.fromtimestamp(985319365)









    Out[278]:





datetime.datetime(2001, 3, 22, 19, 49, 25)



In [184]:

    
from datetime import datetime
dt = MessageBase.unix_time.map(datetime.fromtimestamp)
MessageBase.groupby(dt).unix_time.count().resample('1W', how='sum').plot()









    Out[184]:





<matplotlib.axes.AxesSubplot at 0x10f277350>

What does this tell us?

Perhaps the recipient counts would be interesting as well. This time we will cast unix_time to a datetime string in our query. Since this is a new query, we will also make sure it looks okay by calling .head() before we do anyting else:



In [218]:

    
recipient_counts = sql.frame_query("""SELECT mid, datetime(unix_time, 'unixepoch') AS time_sent, MAX(rno) AS recipient_count 
                        FROM MessageBase JOIN RecipientBase USING (mid) GROUP BY mid, time_sent""", conn, "mid")
recipient_counts.head()









    Out[218]:






  
    
      
      time_sent
      recipient_count
    
    
      mid
      
      
    
  
  
    
      1
       1998-11-13 04:07:00
       1
    
    
      2
       1998-11-19 07:19:00
       1
    
    
      3
       1998-11-19 08:24:00
       1
    
    
      4
       1998-11-24 02:23:00
       4
    
    
      5
       1998-11-30 03:22:00
       2
    
  

5 rows × 2 columns

When using pandas.DataFrame.groupby it is sometimes useful to create a grouping object first. In this case, we will do it by mapping the datetime strings we received from SQLite to a weekly period. (This could also have been done in SQL, but it would have come across as a string. This way it exists as a pandas.Period object which has more intelligence.)



In [279]:

    
w = recipient_counts.time_sent.map(lambda dt: pd.Period(dt, freq='W'))
recipient_counts.groupby(w).recipient_count.mean().plot(ylim=(0,10))









    Out[279]:





<matplotlib.axes.AxesSubplot at 0x1135be150>

Lab

Your turn. Follow what I did at http://nbviewer.ipython.org/github/eklypse/GA_DS_SF_2014_01/blob/master/DS_Lec05.ipynb. Do the same thing but for .max() and .median() instead of .mean(). Also, try to find your own discoveries! (e.g. is there anything interesting you could say about the subjects of the emails. What about who sent the first ones and who sent the last ones? What about the recipients themselves? (hint: to_eid is a foreign key to eid)

See http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/Pandas%20for%20SQL%20Developers.ipynb for more examples about how Pandas and SQL compare.

eid	name	department	longdepartment	title	gender	seniority
1	John Arnold	Trading	ENA Gas Financial	VP Trading	Male	Senior
2	Harry Arora	Trading	ENA East Power	VP Trading	Male	Senior
3	Robert Badeer	Trading	ENA West Power	Mgr Trading	Male	Junior
4	Susan Bailey	Legal	ENA Legal	Specialist Legal	Female	Junior
5	Eric Bass	Trading	ENA Gas Texas	Trader	Male	Junior

mid	filename	unix_time	subject	from_eid
1	taylor-m/sent/11	910930020	Cd$ CME letter	138
2	taylor-m/sent/17	911459940	Indemnification	138
3	taylor-m/sent/18	911463840	Re: Indemnification	138
4	taylor-m/sent/23	911874180	Re: Coral Energy, L.P.	138
5	taylor-m/sent/27	912396120	Bankruptcy Code revisions	138

unix_time	name	department	title	gender	seniority	subject	filename
910930020	Mark E. Taylor	Legal	VP & Gen Cnsl	Male	Senior	Cd$ CME letter	taylor-m/sent/11
911459940	Mark E. Taylor	Legal	VP & Gen Cnsl	Male	Senior	Indemnification	taylor-m/sent/17
911463840	Mark E. Taylor	Legal	VP & Gen Cnsl	Male	Senior	Re: Indemnification	taylor-m/sent/18
911874180	Mark E. Taylor	Legal	VP & Gen Cnsl	Male	Senior	Re: Coral Energy, L.P.	taylor-m/sent/23
912396120	Mark E. Taylor	Legal	VP & Gen Cnsl	Male	Senior	Bankruptcy Code revisions	taylor-m/sent/27

mid	filename	unix_time	subject	from_eid
12116	baughman-d/all_documents/398	990510780		67
12151	baughman-d/ect_admin/22	990546780		67

name	title	recipient_count
John J. Lavorato	ENA President & CEO	57
Kenneth Lay	President & CEO	52
Sally Beck	VP	49
Monique Sanchez	Associate	24
Louise Kitchen	COO	21

	name	title	seniority	unix_time	subject	from_eid
mid
1611	John Arnold	VP Trading	Senior	954317280	Re: Insurance Call Spread	1
1615	John Arnold	VP Trading	Senior	954328920	New curve generation methodology	1
1722	John Arnold	VP Trading	Senior	955447080	Option Analysis on NG Price Book	1
1723	John Arnold	VP Trading	Senior	955448160		1
1763	John Arnold	VP Trading	Senior	955701240	Re: options	1

	time_sent	recipient_count
mid
1	1998-11-13 04:07:00	1
2	1998-11-19 07:19:00	1
3	1998-11-19 08:24:00	1
4	1998-11-24 02:23:00	4
5	1998-11-30 03:22:00	2

Data Science

Structured Data: Relational Databases and Data Frames

Last Time:

Questions? (besides "what is the CAP theorem?")

Agenda

Projects!

¡¡¡Data (and 1 visualization) are due on Monday!!!

Readiness Assement:

Q: What is the CAP theorem? (And why does it matter?)

Relational Databases

DB Evolution

DB Evolution

DB Evolution

Normalization

First normal form (1NF)

What?

The following scenario illustrates how a database design might violate first normal form.

Taken from http://en.wikipedia.org/wiki/First_normal_form

Customer

Customer

A design that complies with 1NF:

Customer Name

Customer Telephone Number

Second normal form (2NF)

What?

Consider a table describing employees' skills:

Taken from http://en.wikipedia.org/wiki/Second_normal_form

Employees' Skills

Employees' Skills

Employees

Employees' Skills

Third normal form (3NF)

An example of a 2NF table that fails to meet the requirements of 3NF is:

Taken from http://en.wikipedia.org/wiki/Third_normal_form

Tournament Winners

Tournament Winners

Player Dates of Birth

Example

Enron

Download the sample database at: tinyurl.com/enron-db

Pandas Data Frames

Lab

Discussion

Next Time:

R & ggplot2