Lab

SQL and Data Frames

Alessandro D. Gagliardi

Enron

Download the sample database at: tinyurl.com/enron-db

pip install ipython-sql
for sql magic



In [1]:

    
%load_ext sql



In [2]:

    
%%bash
sqlite3 enron.db .tables









    



Employee          EmployeeWithVars  MessageBase       RecipientBase   
EmployeeBase      Message           Recipient

We only care about the *Base tables:

EmployeeBase
MessageBase
RecipientBase

(The other tables are derived from those. We don't need them for this exercise.)



In [3]:

    
%%sql sqlite:///enron.db
SELECT * FROM EmployeeBase LIMIT 5









    



Done.






    Out[3]:





    
        eid
        name
        department
        longdepartment
        title
        gender
        seniority
    
    
        1
        John Arnold
        Trading
        ENA Gas Financial
        VP Trading
        Male
        Senior
    
    
        2
        Harry Arora
        Trading
        ENA East Power
        VP Trading
        Male
        Senior
    
    
        3
        Robert Badeer
        Trading
        ENA West Power
        Mgr Trading
        Male
        Junior
    
    
        4
        Susan Bailey
        Legal
        ENA Legal
        Specialist Legal
        Female
        Junior
    
    
        5
        Eric Bass
        Trading
        ENA Gas Texas
        Trader
        Male
        Junior



In [4]:

    
%%sql sqlite:///enron.db
SELECT * FROM MessageBase LIMIT 5









    



Done.






    Out[4]:





    
        mid
        filename
        unix_time
        subject
        from_eid
    
    
        1
        taylor-m/sent/11
        910930020
        Cd$ CME letter
        138
    
    
        2
        taylor-m/sent/17
        911459940
        Indemnification
        138
    
    
        3
        taylor-m/sent/18
        911463840
        Re: Indemnification
        138
    
    
        4
        taylor-m/sent/23
        911874180
        Re: Coral Energy, L.P.
        138
    
    
        5
        taylor-m/sent/27
        912396120
        Bankruptcy Code revisions
        138

from_eid looks like a foreign key



In [5]:

    
%%sql sqlite:///enron.db
SELECT unix_time, name, department, title, gender, seniority, subject, filename 
FROM EmployeeBase JOIN MessageBase ON eid = from_eid LIMIT 5









    



Done.






    Out[5]:





    
        unix_time
        name
        department
        title
        gender
        seniority
        subject
        filename
    
    
        910930020
        Mark E. Taylor
        Legal
        VP & Gen Cnsl
        Male
        Senior
        Cd$ CME letter
        taylor-m/sent/11
    
    
        911459940
        Mark E. Taylor
        Legal
        VP & Gen Cnsl
        Male
        Senior
        Indemnification
        taylor-m/sent/17
    
    
        911463840
        Mark E. Taylor
        Legal
        VP & Gen Cnsl
        Male
        Senior
        Re: Indemnification
        taylor-m/sent/18
    
    
        911874180
        Mark E. Taylor
        Legal
        VP & Gen Cnsl
        Male
        Senior
        Re: Coral Energy, L.P.
        taylor-m/sent/23
    
    
        912396120
        Mark E. Taylor
        Legal
        VP & Gen Cnsl
        Male
        Senior
        Bankruptcy Code revisions
        taylor-m/sent/27

Finally, RecipientBase:



In [6]:

    
%%sql sqlite:///enron.db
SELECT * FROM RecipientBase LIMIT 9

It looks like we can use this to determine how many recipients each message received
i.e. MAX(rno)



In [7]:

    
%%sql sqlite:///enron.db
SELECT * FROM RecipientBase WHERE rno = 57



In [8]:

    
%%sql sqlite:///enron.db
SELECT * FROM MessageBase WHERE mid IN (12116, 12151)









    



Done.






    Out[8]:





    
        mid
        filename
        unix_time
        subject
        from_eid
    
    
        12116
        baughman-d/all_documents/398
        990510780
        
        67
    
    
        12151
        baughman-d/ect_admin/22
        990546780
        
        67



In [9]:

    
%%sql sqlite:///enron.db
SELECT * FROM EmployeeBase WHERE eid=67









    



Done.






    Out[9]:





    
        eid
        name
        department
        longdepartment
        title
        gender
        seniority
    
    
        67
        John J. Lavorato
        Other
        EWS
        ENA President & CEO
        Male
        Senior

Who sent the email to the most recipients?



In [10]:

    
%%sql sqlite:///enron.db
SELECT name, title, MAX(rno)
FROM EmployeeBase 
JOIN MessageBase ON eid = from_eid 
JOIN RecipientBase USING(mid)









    



Done.






    Out[10]:





    
        name
        title
        MAX(rno)
    
    
        John J. Lavorato
        ENA President & CEO
        57

Who are the top five senders of mass emails?



In [11]:

    
%%sql sqlite:///enron.db
SELECT name, title, MAX(rno) AS recipient_count
FROM EmployeeBase 
JOIN MessageBase ON eid = from_eid 
JOIN RecipientBase USING(mid) 
GROUP BY name, title ORDER BY recipient_count DESC LIMIT 5









    



Done.






    Out[11]:





    
        name
        title
        recipient_count
    
    
        John J. Lavorato
        ENA President & CEO
        57
    
    
        Kenneth Lay
        President & CEO
        52
    
    
        Sally Beck
        VP
        49
    
    
        Monique Sanchez 
        Associate
        24
    
    
        Louise Kitchen
        COO
        21

Pandas Data Frames



In [12]:

    
import pandas as pd
import sqlite3
from pandas.io import sql
conn = sqlite3.connect('enron.db')



In [13]:

    
sql.frame_query("""SELECT mid, unix_time, subject, from_eid 
                FROM MessageBase LIMIT 5""", conn, "mid")









    Out[13]:






  
    
      
      unix_time
      subject
      from_eid
    
    
      mid
      
      
      
    
  
  
    
      1
       910930020
                  Cd$ CME letter
       138
    
    
      2
       911459940
                 Indemnification
       138
    
    
      3
       911463840
             Re: Indemnification
       138
    
    
      4
       911874180
          Re: Coral Energy, L.P.
       138
    
    
      5
       912396120
       Bankruptcy Code revisions
       138
    
  

5 rows × 3 columns



In [14]:

    
MessageBase = sql.frame_query("""SELECT mid, unix_time, subject, from_eid 
                FROM MessageBase""", conn, "mid")
MessageBase.head()









    Out[14]:






  
    
      
      unix_time
      subject
      from_eid
    
    
      mid
      
      
      
    
  
  
    
      1
       910930020
                  Cd$ CME letter
       138
    
    
      2
       911459940
                 Indemnification
       138
    
    
      3
       911463840
             Re: Indemnification
       138
    
    
      4
       911874180
          Re: Coral Energy, L.P.
       138
    
    
      5
       912396120
       Bankruptcy Code revisions
       138
    
  

5 rows × 3 columns

SQL JOIN (with USING):



In [15]:

    
sql.frame_query("""SELECT mid, unix_time, subject, from_eid, rno, to_eid 
                FROM MessageBase JOIN RecipientBase USING (mid) LIMIT 5""", conn, "mid")









    Out[15]:






  
    
      
      unix_time
      subject
      from_eid
      rno
      to_eid
    
    
      mid
      
      
      
      
      
    
  
  
    
      1
       910930020
               Cd$ CME letter
       138
       1
        59
    
    
      2
       911459940
              Indemnification
       138
       1
        15
    
    
      3
       911463840
          Re: Indemnification
       138
       1
        15
    
    
      4
       911874180
       Re: Coral Energy, L.P.
       138
       1
       109
    
    
      4
       911874180
       Re: Coral Energy, L.P.
       138
       2
        49
    
  

5 rows × 5 columns

pandas.DataFrame.join



In [16]:

    
RecipientBase = sql.frame_query("SELECT * FROM RecipientBase", conn, "mid")
MessageBase.join(RecipientBase).head()









    Out[16]:






  
    
      
      unix_time
      subject
      from_eid
      rno
      to_eid
    
    
      mid
      
      
      
      
      
    
  
  
    
      1
       910930020
               Cd$ CME letter
       138
       1
        59
    
    
      2
       911459940
              Indemnification
       138
       1
        15
    
    
      3
       911463840
          Re: Indemnification
       138
       1
        15
    
    
      4
       911874180
       Re: Coral Energy, L.P.
       138
       1
       109
    
    
      4
       911874180
       Re: Coral Energy, L.P.
       138
       2
        49
    
  

5 rows × 5 columns

SQL JOIN (with ON):



In [17]:

    
sql.frame_query("""SELECT mid, name, title, seniority,
                          unix_time, subject, from_eid 
                     FROM EmployeeBase 
                     JOIN MessageBase ON eid = from_eid 
                    LIMIT 5""", conn, "mid")









    Out[17]:






  
    
      
      name
      title
      seniority
      unix_time
      subject
      from_eid
    
    
      mid
      
      
      
      
      
      
    
  
  
    
      1
       Mark E. Taylor
       VP & Gen Cnsl
       Senior
       910930020
                  Cd$ CME letter
       138
    
    
      2
       Mark E. Taylor
       VP & Gen Cnsl
       Senior
       911459940
                 Indemnification
       138
    
    
      3
       Mark E. Taylor
       VP & Gen Cnsl
       Senior
       911463840
             Re: Indemnification
       138
    
    
      4
       Mark E. Taylor
       VP & Gen Cnsl
       Senior
       911874180
          Re: Coral Energy, L.P.
       138
    
    
      5
       Mark E. Taylor
       VP & Gen Cnsl
       Senior
       912396120
       Bankruptcy Code revisions
       138
    
  

5 rows × 6 columns

pandas.DataFrame.merge



In [18]:

    
EmployeeBase = sql.frame_query("""SELECT eid, name, title, seniority 
                                    FROM EmployeeBase""", conn, "eid")
EmployeeBase.merge(MessageBase, right_on = 'from_eid', left_index = True).head()









    Out[18]:






  
    
      
      name
      title
      seniority
      unix_time
      subject
      from_eid
    
    
      mid
      
      
      
      
      
      
    
  
  
    
      1611
       John Arnold
       VP Trading
       Senior
       954317280
              Re: Insurance Call Spread
       1
    
    
      1615
       John Arnold
       VP Trading
       Senior
       954328920
       New curve generation methodology
       1
    
    
      1722
       John Arnold
       VP Trading
       Senior
       955447080
       Option Analysis on NG Price Book
       1
    
    
      1723
       John Arnold
       VP Trading
       Senior
       955448160
                                       
       1
    
    
      1763
       John Arnold
       VP Trading
       Senior
       955701240
                            Re: options
       1
    
  

5 rows × 6 columns

Histograms are a good way to get a feel for the data. For example:



In [19]:

    
%matplotlib inline
MessageBase.unix_time.hist()









    Out[19]:





<matplotlib.axes.AxesSubplot at 0x10a824cd0>



In [20]:

    
MessageBase.unix_time.mean()









    Out[20]:





985319365.65994918

What does this tell us?

Convert unix_time to human readable form (i.e. datetime)

unix_time (i.e. unixepoch) is easy for computers to read, but difficult for humans. It is the number of seconds since midnight, January 1, 1970. To make it more readable, we will map datetime.fromtimestamp onto that value to return a datetime object. We can then downsample the data from seconds to weeks.



In [21]:

    
from datetime import datetime
dt = MessageBase.unix_time.map(datetime.fromtimestamp)
MessageBase.groupby(dt).unix_time.count().resample('1W', how='sum').plot()









    Out[21]:





<matplotlib.axes.AxesSubplot at 0x10a830d90>

What does this tell us?

Consider recipient counts (i.e. MAX(rno))

Perhaps the recipient counts would be interesting as well. This time we will cast unix_time to a datetime string in our query. Since this is a new query, we will also make sure it looks okay by calling .head() before we do anyting else:



In [22]:

    
recipient_counts = sql.frame_query("""SELECT mid, datetime(unix_time, 'unixepoch') AS time_sent, 
                                             MAX(rno) AS recipient_count 
                                        FROM MessageBase JOIN RecipientBase USING (mid) 
                                       GROUP BY mid, time_sent""", 
                                   conn, "mid")
recipient_counts.head()









    Out[22]:






  
    
      
      time_sent
      recipient_count
    
    
      mid
      
      
    
  
  
    
      1
       1998-11-13 04:07:00
       1
    
    
      2
       1998-11-19 07:19:00
       1
    
    
      3
       1998-11-19 08:24:00
       1
    
    
      4
       1998-11-24 02:23:00
       4
    
    
      5
       1998-11-30 03:22:00
       2
    
  

5 rows × 2 columns

grouping objects

When using pandas.DataFrame.groupby it is sometimes useful to create a grouping object first. In this case, we will do it by mapping the datetime strings we received from SQLite to a weekly period. (This could also have been done in SQL, but it would have come across as a string. This way it exists as a pandas.Period object which has more intelligence.)



In [23]:

    
w = recipient_counts.time_sent.map(lambda dt: pd.Period(dt, freq='W'))
recipient_counts.groupby(w).recipient_count.mean().plot(ylim=(0,10))









    Out[23]:





<matplotlib.axes.AxesSubplot at 0x10a98ba10>

Lab

1-2 Pairs

Raise two fingers if you understand the material well.
Raise one finger if you understand this material OK.
Now, 1 find a 2 and sit next to them. While you work through this exercise, only 1's can type.

Use matplotlib.pyplot.scatter (or plt.scatter for short) to plot the length of the subjects against the recipient_count.

Hint: Take advantage of IPython's help functionality:



In [24]:

    
import matplotlib.pyplot as plt
subjects = sql.frame_query("""SELECT mid, name, LENGTH(subject) as subject_length, 
                                 MAX(rno) AS recipient_count, 
                                 subject,
                                 datetime(unix_time, 'unixepoch') AS time_sent
                                        FROM MessageBase 
                                        JOIN RecipientBase 
                                        USING (mid)
                                        JOIN EmployeeBase ON eid = from_eid
                                        GROUP BY subject_length
                                        ORDER BY time_sent DESC
                                        """, 
                                   conn, "mid")
plt.scatter(subjects.subject_length, subjects.recipient_count, c='c')









    Out[24]:





<matplotlib.collections.PathCollection at 0x10b56b590>



In [25]:

    
subjects.head(10)









    Out[25]:






  
    
      
      name
      subject_length
      recipient_count
      subject
      time_sent
    
    
      mid
      
      
      
      
      
    
  
  
    
      21103
            Kam Keiser
        10
       20
                                              book names
       2002-02-11 06:28:39
    
    
      21047
        Louise Kitchen
        17
       19
                                       YOU HAVE 48 HOURS
       2002-02-06 08:03:03
    
    
      21030
        Louise Kitchen
        33
       15
                       Commercial Re-start Sub-Committee
       2002-02-05 15:08:44
    
    
      21025
       Sara Shackleton
       134
        2
       FW:  TOP TEN counterparties (for ENA) - Non-Te...
       2002-02-05 14:45:10
    
    
      21021
            Sally Beck
       121
        1
       Updated: ECS 4102 - LReeves, SWhite, BSuperty,...
       2002-02-05 13:35:17
    
    
      20919
       Elizabeth Sager
        90
        9
       RE: Final draft of Power GTC (really EEI) for ...
       2002-02-02 12:02:09
    
    
      20332
            Lynn Blair
        69
        7
       RE: Procedures for Transwestern Gas Control/Ma...
       2002-01-14 04:45:39
    
    
      20031
        Louise Kitchen
         9
       13
                                               RE: NETCO
       2002-01-02 16:43:49
    
    
      19796
            Tana Jones
       100
        3
       FW: (01-438) EXCHANGE TO INTRODUCE STRIP TRADI...
       2001-12-19 14:27:51
    
    
      19731
         Jeff Dasovich
       141
        5
       RE: Conference Call with PG&E to Discuss the G...
       2001-12-17 08:26:58
    
  

10 rows × 5 columns

Individually:

Plot the median subject length over time (using the same weekly grouping object above)



In [27]:

    
#weekly = subjects.time_sent.map(lambda dt: pd.Period(dt, freq='W'))
subjects.groupby(w).subject_length.median().plot()









    Out[27]:





<matplotlib.axes.AxesSubplot at 0x10bbc1f90>

Further exercises: Try to find your own discoveries! (e.g. is there anything interesting you could say about the subjects of the emails. What about who sent the first ones and who sent the last ones? What about the recipients themselves? (hint: to_eid is a foreign key to eid)

See Pandas for SQL Developers for more examples about how Pandas and SQL compare.

Lab