Extract events data

The donations dataframe has event related information. Pull that out into a separate dataframe.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_pickle('out/21/donations.pkl')

In [3]:
df.columns


Out[3]:
Index([u'activity_date', u'city', u'fund', u'batch_num', u'amount_initial',
       u'amount_cleanup', u'zipcode', u'longitude', u'sales', u'county',
       u'charitable', u'amount', u'state', u'donor_id', u'timezone',
       u'latitude', u'appeal', u'activity_year', u'activity_month',
       u'activity_dow', u'activity_ym', u'activity_yq', u'activity_ymd',
       u'county_norm', u'census_region_name', u'state_name', u'county_id',
       u'is_service', u'channel', u'campaign_location_id',
       u'is_location_center', u'campaign_month_id', u'is_month_center'],
      dtype='object')

We have 91 appeal codes that are for events

These break down into 125 events since appeal codes can be re-used


In [4]:
df[(df.is_service==True)].appeal.nunique()


Out[4]:
155

In [5]:
# We will only consider services for this step, so limit what we need
dfs = df[df.is_service==True]
del df

In [6]:
events_by_attendance = dfs\
    .groupby(['appeal', 'campaign_location_id', 'campaign_month_id'])\
    .size()\
    .to_frame()\
    .reset_index()\
    .rename(columns={0: 'transaction_count'})

In [7]:
events_by_amount = dfs\
    .groupby(['appeal', 'campaign_location_id', 'campaign_month_id', ])\
    .amount\
    .sum()\
    .to_frame()\
    .reset_index()\
    .rename(columns={0: 'amount'})

Histogram plots


In [8]:
events_by_attendance[events_by_attendance.transaction_count > 10].transaction_count.plot(kind='hist', bins=30)


Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0xa438a36c>

In [9]:
events_by_amount.amount[events_by_amount.amount < 10000].plot(kind='hist', bins=50)


Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0xa436ca4c>

Check for data correctness


In [10]:
dfs[(dfs.appeal=='Event_TollywoodThriller')].shape


Out[10]:
(161, 33)

In [11]:
dfs[(dfs.appeal=='Event_TollywoodThriller')]\
    .groupby(['appeal', 'campaign_location_id', 'campaign_month_id']).size()


Out[11]:
appeal                   campaign_location_id  campaign_month_id
Event_TollywoodThriller  0                     0                    48
                                               1                    41
                                               2                    33
                                               3                    39
dtype: int64

In [12]:
events_by_attendance[events_by_attendance.appeal=='Event_TollywoodThriller']


Out[12]:
appeal campaign_location_id campaign_month_id transaction_count
349 Event_TollywoodThriller 0 0 48
350 Event_TollywoodThriller 0 1 41
351 Event_TollywoodThriller 0 2 33
352 Event_TollywoodThriller 0 3 39

In [13]:
dfs[(dfs.appeal=='Event_TollywoodThriller')]\
    .groupby(['appeal', 'campaign_location_id', 'campaign_month_id']).amount.sum()


Out[13]:
appeal                   campaign_location_id  campaign_month_id
Event_TollywoodThriller  0                     0                    3781
                                               1                    5054
                                               2                    6854
                                               3                    5415
Name: amount, dtype: int64

In [14]:
events_by_amount[events_by_amount.appeal=='Event_TollywoodThriller']


Out[14]:
appeal campaign_location_id campaign_month_id amount
349 Event_TollywoodThriller 0 0 3781
350 Event_TollywoodThriller 0 1 5054
351 Event_TollywoodThriller 0 2 6854
352 Event_TollywoodThriller 0 3 5415

Filter the data - we will consider only the events with atleast 10 transactions - which results in 70 events


In [15]:
events_by_attendance[events_by_attendance.transaction_count >= 10].shape


Out[15]:
(178, 4)

In [16]:
events_by_attendance[events_by_attendance.transaction_count < 10].transaction_count.sum()


Out[16]:
608L

In [17]:
## Trim the event attendance data to atleast 20 transactions and merge with the other dataframe to get the amounts
events = events_by_attendance[events_by_attendance.transaction_count >= 10]\
.merge(events_by_amount, how='left', on=['appeal','campaign_location_id','campaign_month_id'])

In [18]:
# Merge with the original dataframe to get the month centers
events = events.merge(
    dfs[(dfs.is_month_center==True)]\
    [['appeal', 'campaign_month_id', 'activity_year', 'activity_month', 'activity_ym']],
    how='left', on=['appeal', 'campaign_month_id']).drop_duplicates()

# Merge with the original dataframe to get the location centers
events = events.merge(
    dfs[(dfs.is_location_center==True)]\
    [['appeal', 'campaign_location_id', 'county', 'state']],
    how='left', on=['appeal', 'campaign_location_id']).drop_duplicates()

In [19]:
# Sanity check what we got thus far
events[events.appeal == 'Event_TollywoodThriller']


Out[19]:
appeal campaign_location_id campaign_month_id transaction_count amount activity_year activity_month activity_ym county state
8727 Event_TollywoodThriller 0 0 48 3781 2014 11 201411 King WA
8773 Event_TollywoodThriller 0 1 41 5054 2012 12 201212 King WA
8819 Event_TollywoodThriller 0 2 33 6854 2009 8 200908 King WA
8865 Event_TollywoodThriller 0 3 39 5415 2011 7 201107 King WA

In [20]:
dfs[(dfs.is_location_center==True) & (dfs.appeal=='Event_TollywoodThriller')]\
    [['county', 'state']].drop_duplicates()


Out[20]:
county state
115022 King WA

In [21]:
dfs[(dfs.is_month_center==True) & (dfs.appeal=='Event_TollywoodThriller')]\
    [['activity_year', 'activity_month']].drop_duplicates()


Out[21]:
activity_year activity_month
66124 2014 11
115130 2009 8
116918 2012 12
117693 2011 7

Understand the data

We had events in a total of 13 different states.
California had the highest number of events (33). Washington has the next highest number of events (11).
The events were more or less distributed throughout the year.
March 2014 was by far the busiest year/month for the organization (8 events).
2014 was the busiest year in terms of number of events (22).
Other than that, in general events are spread out quite evenly over the year across the board.


In [22]:
events.groupby('state').size().plot(kind='barh')


Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0xa43bee2c>

In [23]:
events.groupby('activity_month').size().plot(kind='barh')


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0xa43a742c>

In [24]:
events.groupby('activity_month').amount.sum().plot(kind='barh')


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0xa44015cc>

In [25]:
events.groupby('activity_year').size().plot(kind='barh')


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0xa44031cc>

In [26]:
events.groupby(['activity_year']).agg({'amount': sum}).plot(kind='bar', legend=True)


Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0xa445c9ac>

In [27]:
events['event_name'] = \
    events.apply(
        lambda row:
            row.appeal + '_' +
            str(int(row.campaign_location_id)) + '_' +
            str(int(row.campaign_month_id)),
        axis=1)

Save the data


In [28]:
!mkdir -p out/41

events\
    .reset_index()\
    .drop('index', axis=1)\
    .to_pickle('out/41/events.pkl')

In [ ]: