Extract events data

The donations dataframe has event related information. Pull that out into a separate dataframe.



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline



In [2]:

    
df = pd.read_pickle('out/21/donations.pkl')



In [3]:

    
df.columns









    Out[3]:





Index([u'activity_date', u'city', u'fund', u'batch_num', u'amount_initial',
       u'amount_cleanup', u'zipcode', u'longitude', u'sales', u'county',
       u'charitable', u'amount', u'state', u'donor_id', u'timezone',
       u'latitude', u'appeal', u'activity_year', u'activity_month',
       u'activity_dow', u'activity_ym', u'activity_yq', u'activity_ymd',
       u'county_norm', u'census_region_name', u'state_name', u'county_id',
       u'is_service', u'channel', u'campaign_location_id',
       u'is_location_center', u'campaign_month_id', u'is_month_center'],
      dtype='object')

We have 91 appeal codes that are for events

These break down into 125 events since appeal codes can be re-used



In [4]:

    
df[(df.is_service==True)].appeal.nunique()









    Out[4]:





155



In [5]:

    
# We will only consider services for this step, so limit what we need
dfs = df[df.is_service==True]
del df



In [6]:

    
events_by_attendance = dfs\
    .groupby(['appeal', 'campaign_location_id', 'campaign_month_id'])\
    .size()\
    .to_frame()\
    .reset_index()\
    .rename(columns={0: 'transaction_count'})



In [7]:

    
events_by_amount = dfs\
    .groupby(['appeal', 'campaign_location_id', 'campaign_month_id', ])\
    .amount\
    .sum()\
    .to_frame()\
    .reset_index()\
    .rename(columns={0: 'amount'})

Histogram plots



In [8]:

    
events_by_attendance[events_by_attendance.transaction_count > 10].transaction_count.plot(kind='hist', bins=30)









    Out[8]:





<matplotlib.axes._subplots.AxesSubplot at 0xa438a36c>



In [9]:

    
events_by_amount.amount[events_by_amount.amount < 10000].plot(kind='hist', bins=50)









    Out[9]:





<matplotlib.axes._subplots.AxesSubplot at 0xa436ca4c>

Check for data correctness



In [10]:

    
dfs[(dfs.appeal=='Event_TollywoodThriller')].shape









    Out[10]:





(161, 33)



In [11]:

    
dfs[(dfs.appeal=='Event_TollywoodThriller')]\
    .groupby(['appeal', 'campaign_location_id', 'campaign_month_id']).size()









    Out[11]:





appeal                   campaign_location_id  campaign_month_id
Event_TollywoodThriller  0                     0                    48
                                               1                    41
                                               2                    33
                                               3                    39
dtype: int64



In [12]:

    
events_by_attendance[events_by_attendance.appeal=='Event_TollywoodThriller']









    Out[12]:






  
    
      
      appeal
      campaign_location_id
      campaign_month_id
      transaction_count
    
  
  
    
      349
      Event_TollywoodThriller
      0
      0
      48
    
    
      350
      Event_TollywoodThriller
      0
      1
      41
    
    
      351
      Event_TollywoodThriller
      0
      2
      33
    
    
      352
      Event_TollywoodThriller
      0
      3
      39



In [13]:

    
dfs[(dfs.appeal=='Event_TollywoodThriller')]\
    .groupby(['appeal', 'campaign_location_id', 'campaign_month_id']).amount.sum()









    Out[13]:





appeal                   campaign_location_id  campaign_month_id
Event_TollywoodThriller  0                     0                    3781
                                               1                    5054
                                               2                    6854
                                               3                    5415
Name: amount, dtype: int64



In [14]:

    
events_by_amount[events_by_amount.appeal=='Event_TollywoodThriller']









    Out[14]:






  
    
      
      appeal
      campaign_location_id
      campaign_month_id
      amount
    
  
  
    
      349
      Event_TollywoodThriller
      0
      0
      3781
    
    
      350
      Event_TollywoodThriller
      0
      1
      5054
    
    
      351
      Event_TollywoodThriller
      0
      2
      6854
    
    
      352
      Event_TollywoodThriller
      0
      3
      5415

Filter the data - we will consider only the events with atleast 10 transactions - which results in 70 events



In [15]:

    
events_by_attendance[events_by_attendance.transaction_count >= 10].shape









    Out[15]:





(178, 4)



In [16]:

    
events_by_attendance[events_by_attendance.transaction_count < 10].transaction_count.sum()









    Out[16]:





608L



In [17]:

    
## Trim the event attendance data to atleast 20 transactions and merge with the other dataframe to get the amounts
events = events_by_attendance[events_by_attendance.transaction_count >= 10]\
.merge(events_by_amount, how='left', on=['appeal','campaign_location_id','campaign_month_id'])



In [18]:

    
# Merge with the original dataframe to get the month centers
events = events.merge(
    dfs[(dfs.is_month_center==True)]\
    [['appeal', 'campaign_month_id', 'activity_year', 'activity_month', 'activity_ym']],
    how='left', on=['appeal', 'campaign_month_id']).drop_duplicates()

# Merge with the original dataframe to get the location centers
events = events.merge(
    dfs[(dfs.is_location_center==True)]\
    [['appeal', 'campaign_location_id', 'county', 'state']],
    how='left', on=['appeal', 'campaign_location_id']).drop_duplicates()



In [19]:

    
# Sanity check what we got thus far
events[events.appeal == 'Event_TollywoodThriller']









    Out[19]:






  
    
      
      appeal
      campaign_location_id
      campaign_month_id
      transaction_count
      amount
      activity_year
      activity_month
      activity_ym
      county
      state
    
  
  
    
      8727
      Event_TollywoodThriller
      0
      0
      48
      3781
      2014
      11
      201411
      King
      WA
    
    
      8773
      Event_TollywoodThriller
      0
      1
      41
      5054
      2012
      12
      201212
      King
      WA
    
    
      8819
      Event_TollywoodThriller
      0
      2
      33
      6854
      2009
      8
      200908
      King
      WA
    
    
      8865
      Event_TollywoodThriller
      0
      3
      39
      5415
      2011
      7
      201107
      King
      WA



In [20]:

    
dfs[(dfs.is_location_center==True) & (dfs.appeal=='Event_TollywoodThriller')]\
    [['county', 'state']].drop_duplicates()









    Out[20]:






  
    
      
      county
      state
    
  
  
    
      115022
      King
      WA



In [21]:

    
dfs[(dfs.is_month_center==True) & (dfs.appeal=='Event_TollywoodThriller')]\
    [['activity_year', 'activity_month']].drop_duplicates()









    Out[21]:






  
    
      
      activity_year
      activity_month
    
  
  
    
      66124
      2014
      11
    
    
      115130
      2009
      8
    
    
      116918
      2012
      12
    
    
      117693
      2011
      7

Understand the data

We had events in a total of 13 different states.
California had the highest number of events (33). Washington has the next highest number of events (11).
The events were more or less distributed throughout the year.
March 2014 was by far the busiest year/month for the organization (8 events).
2014 was the busiest year in terms of number of events (22).
Other than that, in general events are spread out quite evenly over the year across the board.



In [22]:

    
events.groupby('state').size().plot(kind='barh')









    Out[22]:





<matplotlib.axes._subplots.AxesSubplot at 0xa43bee2c>



In [23]:

    
events.groupby('activity_month').size().plot(kind='barh')









    Out[23]:





<matplotlib.axes._subplots.AxesSubplot at 0xa43a742c>



In [24]:

    
events.groupby('activity_month').amount.sum().plot(kind='barh')









    Out[24]:





<matplotlib.axes._subplots.AxesSubplot at 0xa44015cc>



In [25]:

    
events.groupby('activity_year').size().plot(kind='barh')









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0xa44031cc>



In [26]:

    
events.groupby(['activity_year']).agg({'amount': sum}).plot(kind='bar', legend=True)









    Out[26]:





<matplotlib.axes._subplots.AxesSubplot at 0xa445c9ac>



In [27]:

    
events['event_name'] = \
    events.apply(
        lambda row:
            row.appeal + '_' +
            str(int(row.campaign_location_id)) + '_' +
            str(int(row.campaign_month_id)),
        axis=1)

Save the data



In [28]:

    
!mkdir -p out/41

events\
    .reset_index()\
    .drop('index', axis=1)\
    .to_pickle('out/41/events.pkl')



In [ ]:

	appeal	campaign_month_id	transaction_count
349	Event_TollywoodThriller	0	48
350	Event_TollywoodThriller	1	41
351	Event_TollywoodThriller	2	33
352	Event_TollywoodThriller	3	39

	appeal	campaign_month_id	amount
349	Event_TollywoodThriller	0	3781
350	Event_TollywoodThriller	1	5054
351	Event_TollywoodThriller	2	6854
352	Event_TollywoodThriller	3	5415

	appeal	campaign_month_id	transaction_count	amount	activity_year	activity_month	activity_ym	county	state
8727	Event_TollywoodThriller	0	48	3781	2014	11	201411	King	WA
8773	Event_TollywoodThriller	1	41	5054	2012	12	201212	King	WA
8819	Event_TollywoodThriller	2	33	6854	2009	8	200908	King	WA
8865	Event_TollywoodThriller	3	39	5415	2011	7	201107	King	WA