Lab

SQL and Data Frames

Alessandro D. Gagliardi

Enron

Download the sample database at: tinyurl.com/enron-db

pip install ipython-sql
for sql magic


In [1]:
%load_ext sql

In [2]:
%%bash
sqlite3 enron.db .tables


Employee          EmployeeWithVars  MessageBase       RecipientBase   
EmployeeBase      Message           Recipient       

We only care about the *Base tables:

  • EmployeeBase
  • MessageBase
  • RecipientBase

(The other tables are derived from those. We don't need them for this exercise.)


In [3]:
%%sql sqlite:///enron.db
SELECT * FROM EmployeeBase LIMIT 5


Done.
Out[3]:
eid name department longdepartment title gender seniority
1 John Arnold Trading ENA Gas Financial VP Trading Male Senior
2 Harry Arora Trading ENA East Power VP Trading Male Senior
3 Robert Badeer Trading ENA West Power Mgr Trading Male Junior
4 Susan Bailey Legal ENA Legal Specialist Legal Female Junior
5 Eric Bass Trading ENA Gas Texas Trader Male Junior

In [4]:
%%sql sqlite:///enron.db
SELECT * FROM MessageBase LIMIT 5


Done.
Out[4]:
mid filename unix_time subject from_eid
1 taylor-m/sent/11 910930020 Cd$ CME letter 138
2 taylor-m/sent/17 911459940 Indemnification 138
3 taylor-m/sent/18 911463840 Re: Indemnification 138
4 taylor-m/sent/23 911874180 Re: Coral Energy, L.P. 138
5 taylor-m/sent/27 912396120 Bankruptcy Code revisions 138

from_eid looks like a foreign key


In [5]:
%%sql sqlite:///enron.db
SELECT unix_time, name, department, title, gender, seniority, subject, filename 
FROM EmployeeBase JOIN MessageBase ON eid = from_eid LIMIT 5


Done.
Out[5]:
unix_time name department title gender seniority subject filename
910930020 Mark E. Taylor Legal VP & Gen Cnsl Male Senior Cd$ CME letter taylor-m/sent/11
911459940 Mark E. Taylor Legal VP & Gen Cnsl Male Senior Indemnification taylor-m/sent/17
911463840 Mark E. Taylor Legal VP & Gen Cnsl Male Senior Re: Indemnification taylor-m/sent/18
911874180 Mark E. Taylor Legal VP & Gen Cnsl Male Senior Re: Coral Energy, L.P. taylor-m/sent/23
912396120 Mark E. Taylor Legal VP & Gen Cnsl Male Senior Bankruptcy Code revisions taylor-m/sent/27

Finally, RecipientBase:


In [6]:
%%sql sqlite:///enron.db
SELECT * FROM RecipientBase LIMIT 9


Done.
Out[6]:
mid rno to_eid
1 1 59
2 1 15
3 1 15
4 1 109
4 2 49
4 3 120
4 4 59
5 1 45
5 2 53

It looks like we can use this to determine how many recipients each message received
i.e. MAX(rno)


In [7]:
%%sql sqlite:///enron.db
SELECT * FROM RecipientBase WHERE rno = 57


Done.
Out[7]:
mid rno to_eid
12116 57 51
12151 57 51

In [8]:
%%sql sqlite:///enron.db
SELECT * FROM MessageBase WHERE mid IN (12116, 12151)


Done.
Out[8]:
mid filename unix_time subject from_eid
12116 baughman-d/all_documents/398 990510780 67
12151 baughman-d/ect_admin/22 990546780 67

In [9]:
%%sql sqlite:///enron.db
SELECT * FROM EmployeeBase WHERE eid=67


Done.
Out[9]:
eid name department longdepartment title gender seniority
67 John J. Lavorato Other EWS ENA President & CEO Male Senior

Who sent the email to the most recipients?


In [10]:
%%sql sqlite:///enron.db
SELECT name, title, MAX(rno)
FROM EmployeeBase 
JOIN MessageBase ON eid = from_eid 
JOIN RecipientBase USING(mid)


Done.
Out[10]:
name title MAX(rno)
John J. Lavorato ENA President & CEO 57

Who are the top five senders of mass emails?


In [11]:
%%sql sqlite:///enron.db
SELECT name, title, MAX(rno) AS recipient_count
FROM EmployeeBase 
JOIN MessageBase ON eid = from_eid 
JOIN RecipientBase USING(mid) 
GROUP BY name, title ORDER BY recipient_count DESC LIMIT 5


Done.
Out[11]:
name title recipient_count
John J. Lavorato ENA President & CEO 57
Kenneth Lay President & CEO 52
Sally Beck VP 49
Monique Sanchez Associate 24
Louise Kitchen COO 21

Pandas Data Frames


In [12]:
import pandas as pd
import sqlite3
from pandas.io import sql
conn = sqlite3.connect('enron.db')

In [13]:
sql.frame_query("""SELECT mid, unix_time, subject, from_eid 
                FROM MessageBase LIMIT 5""", conn, "mid")


Out[13]:
unix_time subject from_eid
mid
1 910930020 Cd$ CME letter 138
2 911459940 Indemnification 138
3 911463840 Re: Indemnification 138
4 911874180 Re: Coral Energy, L.P. 138
5 912396120 Bankruptcy Code revisions 138

5 rows × 3 columns


In [14]:
MessageBase = sql.frame_query("""SELECT mid, unix_time, subject, from_eid 
                FROM MessageBase""", conn, "mid")
MessageBase.head()


Out[14]:
unix_time subject from_eid
mid
1 910930020 Cd$ CME letter 138
2 911459940 Indemnification 138
3 911463840 Re: Indemnification 138
4 911874180 Re: Coral Energy, L.P. 138
5 912396120 Bankruptcy Code revisions 138

5 rows × 3 columns

SQL JOIN (with USING):


In [15]:
sql.frame_query("""SELECT mid, unix_time, subject, from_eid, rno, to_eid 
                FROM MessageBase JOIN RecipientBase USING (mid) LIMIT 5""", conn, "mid")


Out[15]:
unix_time subject from_eid rno to_eid
mid
1 910930020 Cd$ CME letter 138 1 59
2 911459940 Indemnification 138 1 15
3 911463840 Re: Indemnification 138 1 15
4 911874180 Re: Coral Energy, L.P. 138 1 109
4 911874180 Re: Coral Energy, L.P. 138 2 49

5 rows × 5 columns

pandas.DataFrame.join


In [16]:
RecipientBase = sql.frame_query("SELECT * FROM RecipientBase", conn, "mid")
MessageBase.join(RecipientBase).head()


Out[16]:
unix_time subject from_eid rno to_eid
mid
1 910930020 Cd$ CME letter 138 1 59
2 911459940 Indemnification 138 1 15
3 911463840 Re: Indemnification 138 1 15
4 911874180 Re: Coral Energy, L.P. 138 1 109
4 911874180 Re: Coral Energy, L.P. 138 2 49

5 rows × 5 columns

SQL JOIN (with ON):


In [17]:
sql.frame_query("""SELECT mid, name, title, seniority,
                          unix_time, subject, from_eid 
                     FROM EmployeeBase 
                     JOIN MessageBase ON eid = from_eid 
                    LIMIT 5""", conn, "mid")


Out[17]:
name title seniority unix_time subject from_eid
mid
1 Mark E. Taylor VP & Gen Cnsl Senior 910930020 Cd$ CME letter 138
2 Mark E. Taylor VP & Gen Cnsl Senior 911459940 Indemnification 138
3 Mark E. Taylor VP & Gen Cnsl Senior 911463840 Re: Indemnification 138
4 Mark E. Taylor VP & Gen Cnsl Senior 911874180 Re: Coral Energy, L.P. 138
5 Mark E. Taylor VP & Gen Cnsl Senior 912396120 Bankruptcy Code revisions 138

5 rows × 6 columns

pandas.DataFrame.merge


In [18]:
EmployeeBase = sql.frame_query("""SELECT eid, name, title, seniority 
                                    FROM EmployeeBase""", conn, "eid")
EmployeeBase.merge(MessageBase, right_on = 'from_eid', left_index = True).head()


Out[18]:
name title seniority unix_time subject from_eid
mid
1611 John Arnold VP Trading Senior 954317280 Re: Insurance Call Spread 1
1615 John Arnold VP Trading Senior 954328920 New curve generation methodology 1
1722 John Arnold VP Trading Senior 955447080 Option Analysis on NG Price Book 1
1723 John Arnold VP Trading Senior 955448160 1
1763 John Arnold VP Trading Senior 955701240 Re: options 1

5 rows × 6 columns

Histograms are a good way to get a feel for the data. For example:


In [19]:
%matplotlib inline
MessageBase.unix_time.hist()


Out[19]:
<matplotlib.axes.AxesSubplot at 0x10a824cd0>

In [20]:
MessageBase.unix_time.mean()


Out[20]:
985319365.65994918

What does this tell us?

Convert unix_time to human readable form (i.e. datetime)

unix_time (i.e. unixepoch) is easy for computers to read, but difficult for humans. It is the number of seconds since midnight, January 1, 1970. To make it more readable, we will map datetime.fromtimestamp onto that value to return a datetime object. We can then downsample the data from seconds to weeks.


In [21]:
from datetime import datetime
dt = MessageBase.unix_time.map(datetime.fromtimestamp)
MessageBase.groupby(dt).unix_time.count().resample('1W', how='sum').plot()


Out[21]:
<matplotlib.axes.AxesSubplot at 0x10a830d90>

What does this tell us?

Consider recipient counts (i.e. MAX(rno))

Perhaps the recipient counts would be interesting as well. This time we will cast unix_time to a datetime string in our query. Since this is a new query, we will also make sure it looks okay by calling .head() before we do anyting else:


In [22]:
recipient_counts = sql.frame_query("""SELECT mid, datetime(unix_time, 'unixepoch') AS time_sent, 
                                             MAX(rno) AS recipient_count 
                                        FROM MessageBase JOIN RecipientBase USING (mid) 
                                       GROUP BY mid, time_sent""", 
                                   conn, "mid")
recipient_counts.head()


Out[22]:
time_sent recipient_count
mid
1 1998-11-13 04:07:00 1
2 1998-11-19 07:19:00 1
3 1998-11-19 08:24:00 1
4 1998-11-24 02:23:00 4
5 1998-11-30 03:22:00 2

5 rows × 2 columns

grouping objects

When using pandas.DataFrame.groupby it is sometimes useful to create a grouping object first. In this case, we will do it by mapping the datetime strings we received from SQLite to a weekly period. (This could also have been done in SQL, but it would have come across as a string. This way it exists as a pandas.Period object which has more intelligence.)


In [23]:
w = recipient_counts.time_sent.map(lambda dt: pd.Period(dt, freq='W'))
recipient_counts.groupby(w).recipient_count.mean().plot(ylim=(0,10))


Out[23]:
<matplotlib.axes.AxesSubplot at 0x10a98ba10>

Lab

1-2 Pairs

Raise two fingers if you understand the material well.
Raise one finger if you understand this material OK.
Now, 1 find a 2 and sit next to them. While you work through this exercise, only 1's can type.

Use matplotlib.pyplot.scatter (or plt.scatter for short) to plot the length of the subjects against the recipient_count.

Hint: Take advantage of IPython's help functionality:


In [24]:
import matplotlib.pyplot as plt
subjects = sql.frame_query("""SELECT mid, name, LENGTH(subject) as subject_length, 
                                 MAX(rno) AS recipient_count, 
                                 subject,
                                 datetime(unix_time, 'unixepoch') AS time_sent
                                        FROM MessageBase 
                                        JOIN RecipientBase 
                                        USING (mid)
                                        JOIN EmployeeBase ON eid = from_eid
                                        GROUP BY subject_length
                                        ORDER BY time_sent DESC
                                        """, 
                                   conn, "mid")
plt.scatter(subjects.subject_length, subjects.recipient_count, c='c')


Out[24]:
<matplotlib.collections.PathCollection at 0x10b56b590>

In [25]:
subjects.head(10)


Out[25]:
name subject_length recipient_count subject time_sent
mid
21103 Kam Keiser 10 20 book names 2002-02-11 06:28:39
21047 Louise Kitchen 17 19 YOU HAVE 48 HOURS 2002-02-06 08:03:03
21030 Louise Kitchen 33 15 Commercial Re-start Sub-Committee 2002-02-05 15:08:44
21025 Sara Shackleton 134 2 FW: TOP TEN counterparties (for ENA) - Non-Te... 2002-02-05 14:45:10
21021 Sally Beck 121 1 Updated: ECS 4102 - LReeves, SWhite, BSuperty,... 2002-02-05 13:35:17
20919 Elizabeth Sager 90 9 RE: Final draft of Power GTC (really EEI) for ... 2002-02-02 12:02:09
20332 Lynn Blair 69 7 RE: Procedures for Transwestern Gas Control/Ma... 2002-01-14 04:45:39
20031 Louise Kitchen 9 13 RE: NETCO 2002-01-02 16:43:49
19796 Tana Jones 100 3 FW: (01-438) EXCHANGE TO INTRODUCE STRIP TRADI... 2001-12-19 14:27:51
19731 Jeff Dasovich 141 5 RE: Conference Call with PG&E to Discuss the G... 2001-12-17 08:26:58

10 rows × 5 columns

Individually:

Plot the median subject length over time (using the same weekly grouping object above)


In [27]:
#weekly = subjects.time_sent.map(lambda dt: pd.Period(dt, freq='W'))
subjects.groupby(w).subject_length.median().plot()


Out[27]:
<matplotlib.axes.AxesSubplot at 0x10bbc1f90>

Further exercises: Try to find your own discoveries! (e.g. is there anything interesting you could say about the subjects of the emails. What about who sent the first ones and who sent the last ones? What about the recipients themselves? (hint: to_eid is a foreign key to eid)

See Pandas for SQL Developers for more examples about how Pandas and SQL compare.

Next Time:

APIs & NoSQL