pip install ipython-sql
for sql magic
In [1]:
%load_ext sql
In [2]:
%%bash
sqlite3 enron.db .tables
We only care about the *Base tables:
(The other tables are derived from those. We don't need them for this exercise.)
In [3]:
%%sql sqlite:///enron.db
SELECT * FROM EmployeeBase LIMIT 5
Out[3]:
In [4]:
%%sql sqlite:///enron.db
SELECT * FROM MessageBase LIMIT 5
Out[4]:
from_eid looks like a foreign key
In [5]:
%%sql sqlite:///enron.db
SELECT unix_time, name, department, title, gender, seniority, subject, filename
FROM EmployeeBase JOIN MessageBase ON eid = from_eid LIMIT 5
Out[5]:
Finally, RecipientBase:
In [6]:
%%sql sqlite:///enron.db
SELECT * FROM RecipientBase LIMIT 9
Out[6]:
It looks like we can use this to determine how many recipients each message received
i.e. MAX(rno)
In [7]:
%%sql sqlite:///enron.db
SELECT * FROM RecipientBase WHERE rno = 57
Out[7]:
In [8]:
%%sql sqlite:///enron.db
SELECT * FROM MessageBase WHERE mid IN (12116, 12151)
Out[8]:
In [9]:
%%sql sqlite:///enron.db
SELECT * FROM EmployeeBase WHERE eid=67
Out[9]:
Who sent the email to the most recipients?
In [10]:
%%sql sqlite:///enron.db
SELECT name, title, MAX(rno)
FROM EmployeeBase
JOIN MessageBase ON eid = from_eid
JOIN RecipientBase USING(mid)
Out[10]:
Who are the top five senders of mass emails?
In [11]:
%%sql sqlite:///enron.db
SELECT name, title, MAX(rno) AS recipient_count
FROM EmployeeBase
JOIN MessageBase ON eid = from_eid
JOIN RecipientBase USING(mid)
GROUP BY name, title ORDER BY recipient_count DESC LIMIT 5
Out[11]:
In [12]:
import pandas as pd
import sqlite3
from pandas.io import sql
conn = sqlite3.connect('enron.db')
In [13]:
sql.frame_query("""SELECT mid, unix_time, subject, from_eid
FROM MessageBase LIMIT 5""", conn, "mid")
Out[13]:
In [14]:
MessageBase = sql.frame_query("""SELECT mid, unix_time, subject, from_eid
FROM MessageBase""", conn, "mid")
MessageBase.head()
Out[14]:
SQL JOIN (with USING):
In [15]:
sql.frame_query("""SELECT mid, unix_time, subject, from_eid, rno, to_eid
FROM MessageBase JOIN RecipientBase USING (mid) LIMIT 5""", conn, "mid")
Out[15]:
pandas.DataFrame.join
In [16]:
RecipientBase = sql.frame_query("SELECT * FROM RecipientBase", conn, "mid")
MessageBase.join(RecipientBase).head()
Out[16]:
SQL JOIN (with ON):
In [17]:
sql.frame_query("""SELECT mid, name, title, seniority,
unix_time, subject, from_eid
FROM EmployeeBase
JOIN MessageBase ON eid = from_eid
LIMIT 5""", conn, "mid")
Out[17]:
pandas.DataFrame.merge
In [18]:
EmployeeBase = sql.frame_query("""SELECT eid, name, title, seniority
FROM EmployeeBase""", conn, "eid")
EmployeeBase.merge(MessageBase, right_on = 'from_eid', left_index = True).head()
Out[18]:
Histograms are a good way to get a feel for the data. For example:
In [19]:
%matplotlib inline
MessageBase.unix_time.hist()
Out[19]:
In [20]:
MessageBase.unix_time.mean()
Out[20]:
What does this tell us?
Convert unix_time to human readable form (i.e. datetime)
unix_time (i.e. unixepoch) is easy for computers to read, but difficult for humans. It is the number of seconds since midnight, January 1, 1970. To make it more readable, we will map datetime.fromtimestamp onto that value to return a datetime object. We can then downsample the data from seconds to weeks.
In [21]:
from datetime import datetime
dt = MessageBase.unix_time.map(datetime.fromtimestamp)
MessageBase.groupby(dt).unix_time.count().resample('1W', how='sum').plot()
Out[21]:
What does this tell us?
Consider recipient counts (i.e. MAX(rno))
Perhaps the recipient counts would be interesting as well. This time we will cast unix_time to a datetime string in our query. Since this is a new query, we will also make sure it looks okay by calling .head() before we do anyting else:
In [22]:
recipient_counts = sql.frame_query("""SELECT mid, datetime(unix_time, 'unixepoch') AS time_sent,
MAX(rno) AS recipient_count
FROM MessageBase JOIN RecipientBase USING (mid)
GROUP BY mid, time_sent""",
conn, "mid")
recipient_counts.head()
Out[22]:
grouping objects
When using pandas.DataFrame.groupby it is sometimes useful to create a grouping object first. In this case, we will do it by mapping the datetime strings we received from SQLite to a weekly period. (This could also have been done in SQL, but it would have come across as a string. This way it exists as a pandas.Period object which has more intelligence.)
In [23]:
w = recipient_counts.time_sent.map(lambda dt: pd.Period(dt, freq='W'))
recipient_counts.groupby(w).recipient_count.mean().plot(ylim=(0,10))
Out[23]:
Raise two fingers if you understand the material well.
Raise one finger if you understand this material OK.
Now, 1 find a 2 and sit next to them. While you work through this exercise, only 1's can type.
Use matplotlib.pyplot.scatter (or plt.scatter for short) to plot the length of the subjects against the recipient_count.
Hint: Take advantage of IPython's help functionality:
In [24]:
import matplotlib.pyplot as plt
subjects = sql.frame_query("""SELECT mid, name, LENGTH(subject) as subject_length,
MAX(rno) AS recipient_count,
subject,
datetime(unix_time, 'unixepoch') AS time_sent
FROM MessageBase
JOIN RecipientBase
USING (mid)
JOIN EmployeeBase ON eid = from_eid
GROUP BY subject_length
ORDER BY time_sent DESC
""",
conn, "mid")
plt.scatter(subjects.subject_length, subjects.recipient_count, c='c')
Out[24]:
In [25]:
subjects.head(10)
Out[25]:
In [27]:
#weekly = subjects.time_sent.map(lambda dt: pd.Period(dt, freq='W'))
subjects.groupby(w).subject_length.median().plot()
Out[27]:
Further exercises: Try to find your own discoveries! (e.g. is there anything interesting you could say about the subjects of the emails. What about who sent the first ones and who sent the last ones? What about the recipients themselves? (hint: to_eid is a foreign key to eid)
See Pandas for SQL Developers for more examples about how Pandas and SQL compare.