In [1]:

    
import pandas as pd



In [2]:

    
ads_csv = pd.read_csv("ad_table.csv.csv")

Company XYZ is a food delivery company. Like pretty much any other site, in order to get customers, they have been relying significantly on online ads, such as those you see on Google or Facebook.

At the moment, they are running 40 different ad campaigns and want you to help them understand their performance.

Specifically, you are asked to:

1. If you had to identify the 5 best ad groups, which ones would they be? Which metric did you choose to identify the best ad groups? Why? Explain the pros of your metric as well as the possible cons.

2. For any group, predict how many ads will be shown on Dec, 15 (assume each ad group keeps following its trend). Note: Slightly changed question from the original Data Science challenge book

3. Cluster ads into 3 groups: the ones whose avg_cost_per_click is going up, the ones whose avg_cost_per_click is flat and the ones whose avg_cost_per_click is going down.



In [3]:

    
ads_csv.head()









    Out[3]:






  
    
      
      date
      shown
      clicked
      converted
      avg_cost_per_click
      total_revenue
      ad
    
  
  
    
      0
      10/1/2015
      65877
      2339
      43
      0.90
      641.62
      ad_group_1
    
    
      1
      10/2/2015
      65100
      2498
      38
      0.94
      756.37
      ad_group_1
    
    
      2
      10/3/2015
      70658
      2313
      49
      0.86
      970.90
      ad_group_1
    
    
      3
      10/4/2015
      69809
      2833
      51
      1.01
      907.39
      ad_group_1
    
    
      4
      10/5/2015
      68186
      2696
      41
      1.00
      879.45
      ad_group_1



In [4]:

    
ads_csv.shape









    Out[4]:





(2115, 7)

Data Exploration



In [5]:

    
#Explore ad groups
ads_csv.ad.unique()









    Out[5]:





array(['ad_group_1', 'ad_group_2', 'ad_group_3', 'ad_group_4',
       'ad_group_5', 'ad_group_6', 'ad_group_7', 'ad_group_8',
       'ad_group_9', 'ad_group_10', 'ad_group_11', 'ad_group_12',
       'ad_group_13', 'ad_group_14', 'ad_group_15', 'ad_group_16',
       'ad_group_17', 'ad_group_18', 'ad_group_19', 'ad_group_20',
       'ad_group_21', 'ad_group_22', 'ad_group_23', 'ad_group_24',
       'ad_group_25', 'ad_group_26', 'ad_group_27', 'ad_group_28',
       'ad_group_29', 'ad_group_30', 'ad_group_31', 'ad_group_32',
       'ad_group_33', 'ad_group_34', 'ad_group_35', 'ad_group_36',
       'ad_group_37', 'ad_group_38', 'ad_group_39', 'ad_group_40'], dtype=object)



In [6]:

    
#convert date column to date-time object. Create new column dateDT
ads_csv["dateDT"]=pd.to_datetime(ads_csv["date"])



In [7]:

    
print "unique days" , ads_csv["dateDT"].dt.day.unique()
print "month",  ads_csv["dateDT"].dt.month.unique()









    



unique days [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
 26 27 28 29 30 31]
month [10 11]

So we have data for two months Oct. and Nov.



In [ ]:



In [8]:

    
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns



In [9]:

    
g = sns.factorplot(x="dateDT", y="shown",hue="ad", data=ads_csv,size=10)
g.set(xticklabels=[])









    Out[9]:





<seaborn.axisgrid.FacetGrid at 0x7f26748>



In [ ]:



In [ ]:

Clearly there are differences in just the number of times each ad group is shown and they tend to maintain more or less the same range as they started out. Also the number of times ads shown drop off to zero for some ad groups



In [10]:

    
min(ads_csv["shown"])









    Out[10]:





0

Outlier detection

Look for outliers in data. Deal with missing data or values out of bound such as negative values for money.



In [11]:

    
#Look for outliers, the table below gives us an idea of the distribution of variables in our data frame
ads_csv.describe()









    Out[11]:






  
    
      
      shown
      clicked
      converted
      avg_cost_per_click
      total_revenue
    
  
  
    
      count
      2115.000000
      2115.000000
      2115.000000
      2115.000000
      2115.000000
    
    
      mean
      68299.844444
      3056.077069
      126.453901
      1.373749
      1966.517589
    
    
      std
      48884.821409
      3783.579969
      233.420826
      0.754331
      3942.018757
    
    
      min
      0.000000
      0.000000
      0.000000
      0.000000
      -200.150000
    
    
      25%
      28030.500000
      744.000000
      18.000000
      0.760000
      235.470000
    
    
      50%
      54029.000000
      1392.000000
      41.000000
      1.400000
      553.300000
    
    
      75%
      97314.500000
      3366.500000
      103.000000
      1.920000
      1611.495000
    
    
      max
      192507.000000
      20848.000000
      1578.000000
      4.190000
      39623.710000



In [12]:

    
#min of -200.150000 doesn't make sense under total_revenue.(Min) explore more
ads_csv[ads_csv["total_revenue"]<0]









    Out[12]:






  
    
      
      date
      shown
      clicked
      converted
      avg_cost_per_click
      total_revenue
      ad
      dateDT
    
  
  
    
      1186
      10/25/2015
      50459
      1904
      97
      0.89
      -200.15
      ad_group_23
      2015-10-25
    
    
      1404
      11/1/2015
      70145
      584
      15
      0.36
      -25.09
      ad_group_27
      2015-11-01
    
    
      1524
      11/15/2015
      22070
      1400
      20
      1.67
      -3.56
      ad_group_29
      2015-11-15
    
    
      2089
      10/28/2015
      80270
      1151
      41
      0.96
      -92.06
      ad_group_40
      2015-10-28

The values in all the columns except total_revenue look legitimate when compared to the summary table above. Some ways to deal with missing values: Imputing based on computing mean, median, mode on nearby similar values, ignoring missing data rows based on percentage of data missing, using nearest neighbors or decision trees to fill in for missing values.



In [13]:

    
# filter rows where revenue is negative, there are only 4 rows so ignore them.
ads_csv=ads_csv[ads_csv["total_revenue"] >=0]

Look for rows where the number of times ads displayed equaled zero. Remove ad campaigns on days where they were not displayed.



In [14]:

    
#Explore rows where "shown" or the number of times an ad group was displayed equaled 0.
ads_csv[ads_csv["shown"]==0]









    Out[14]:






  
    
      
      date
      shown
      clicked
      converted
      avg_cost_per_click
      total_revenue
      ad
      dateDT
    
  
  
    
      204
      11/15/2015
      0
      0
      0
      0.0
      0.0
      ad_group_4
      2015-11-15
    
    
      223
      10/12/2015
      0
      0
      0
      0.0
      0.0
      ad_group_5
      2015-10-12
    
    
      554
      10/26/2015
      0
      0
      0
      0.0
      0.0
      ad_group_11
      2015-10-26
    
    
      580
      11/21/2015
      0
      0
      0
      0.0
      0.0
      ad_group_11
      2015-11-21
    
    
      632
      11/20/2015
      0
      0
      0
      0.0
      0.0
      ad_group_12
      2015-11-20
    
    
      650
      10/16/2015
      0
      0
      0
      0.0
      0.0
      ad_group_13
      2015-10-16
    
    
      676
      11/11/2015
      0
      0
      0
      0.0
      0.0
      ad_group_13
      2015-11-11
    
    
      811
      10/19/2015
      0
      0
      0
      0.0
      0.0
      ad_group_16
      2015-10-19
    
    
      821
      10/29/2015
      0
      0
      0
      0.0
      0.0
      ad_group_16
      2015-10-29
    
    
      1000
      11/19/2015
      0
      0
      0
      0.0
      0.0
      ad_group_19
      2015-11-19
    
    
      1005
      10/2/2015
      0
      0
      0
      0.0
      0.0
      ad_group_20
      2015-10-02
    
    
      1211
      11/19/2015
      0
      0
      0
      0.0
      0.0
      ad_group_23
      2015-11-19
    
    
      1273
      10/7/2015
      0
      0
      0
      0.0
      0.0
      ad_group_25
      2015-10-07

Ignore rows where the total number of ads shown =0. It is ok for the number of clicks or converted to be 0 but not where the intial impressions or ads shown is 0. Also, there is no usable data in other columns.



In [15]:

    
ads_csv=ads_csv.drop(ads_csv[ads_csv["shown"]==0].index)



In [16]:

    
## Check to see if the desired rows have been dropped. row 1273 has been dropped according to our criteria 
# above.
1273 in ads_csv.index









    Out[16]:





False

Data Analysis

Explore data frame based on grouping ads campaigns together and compute mean for each column.



In [17]:

    
ad_groups_csv=ads_csv.groupby(['ad']).mean()
ad_groups_csv









    Out[17]:






  
    
      
      shown
      clicked
      converted
      avg_cost_per_click
      total_revenue
    
    
      ad
      
      
      
      
      
    
  
  
    
      ad_group_1
      69345.566038
      2705.830189
      47.113208
      0.971321
      743.015849
    
    
      ad_group_10
      117701.365385
      1671.115385
      19.538462
      1.166154
      299.612115
    
    
      ad_group_11
      19447.078431
      1187.549020
      57.549020
      1.684118
      797.104314
    
    
      ad_group_12
      29174.807692
      1743.884615
      159.519231
      2.048269
      2486.270000
    
    
      ad_group_13
      161519.176471
      13841.980392
      1308.647059
      1.740392
      20685.530000
    
    
      ad_group_14
      8461.096154
      253.211538
      20.961538
      0.709231
      320.639615
    
    
      ad_group_15
      15539.566038
      356.566038
      16.660377
      0.802830
      253.585660
    
    
      ad_group_16
      30755.666667
      818.901961
      72.588235
      0.563922
      1098.290784
    
    
      ad_group_17
      139477.811321
      1228.962264
      14.358491
      0.396792
      189.874717
    
    
      ad_group_18
      89124.346154
      8819.634615
      616.134615
      2.000192
      10052.245769
    
    
      ad_group_19
      19506.884615
      1836.461538
      48.173077
      2.257500
      731.529808
    
    
      ad_group_2
      51076.981132
      1168.150943
      117.905660
      0.638113
      1783.559811
    
    
      ad_group_20
      123004.509804
      7519.333333
      350.372549
      1.827647
      5508.405098
    
    
      ad_group_21
      27432.698113
      247.245283
      12.622642
      1.065472
      178.322453
    
    
      ad_group_22
      24333.094340
      895.547170
      41.320755
      0.802453
      627.441132
    
    
      ad_group_23
      47570.000000
      2196.431373
      95.156863
      1.097451
      1404.683529
    
    
      ad_group_24
      36033.500000
      352.903846
      3.480769
      3.183654
      53.403654
    
    
      ad_group_25
      174133.307692
      623.711538
      0.000000
      0.717308
      0.000000
    
    
      ad_group_26
      72090.660377
      6559.283019
      328.245283
      2.237547
      5192.878113
    
    
      ad_group_27
      65725.519231
      539.846154
      16.711538
      0.379423
      258.538654
    
    
      ad_group_28
      19886.339623
      1770.792453
      55.962264
      1.970755
      864.178868
    
    
      ad_group_29
      20652.692308
      1311.076923
      19.769231
      1.634423
      272.124231
    
    
      ad_group_3
      169248.924528
      16073.320755
      58.301887
      2.682264
      867.241509
    
    
      ad_group_30
      116388.358491
      1273.754717
      16.849057
      1.094528
      266.097170
    
    
      ad_group_31
      130457.132075
      679.867925
      59.094340
      0.147358
      860.718113
    
    
      ad_group_32
      38446.622642
      3380.754717
      26.754717
      1.933396
      375.154340
    
    
      ad_group_33
      15944.000000
      735.226415
      25.905660
      0.927736
      366.206038
    
    
      ad_group_34
      35371.622642
      2979.169811
      260.320755
      1.733208
      4019.534340
    
    
      ad_group_35
      60303.396226
      860.547170
      1.735849
      0.354528
      24.675094
    
    
      ad_group_36
      67026.547170
      554.981132
      25.547170
      0.750377
      379.212642
    
    
      ad_group_37
      72666.811321
      5865.188679
      312.377358
      1.898679
      4743.911887
    
    
      ad_group_38
      174986.075472
      10948.867925
      78.169811
      1.394717
      1102.008491
    
    
      ad_group_39
      25258.679245
      1000.867925
      1.264151
      1.594340
      19.538491
    
    
      ad_group_4
      92289.903846
      7649.173077
      480.846154
      2.496923
      7331.175192
    
    
      ad_group_40
      77369.826923
      1285.192308
      34.711538
      0.996154
      532.098269
    
    
      ad_group_5
      52406.461538
      3343.173077
      215.538462
      2.027308
      3574.216923
    
    
      ad_group_6
      39575.264151
      1326.905660
      23.094340
      1.585094
      340.284340
    
    
      ad_group_7
      55782.132075
      4869.698113
      25.000000
      1.812830
      410.156604
    
    
      ad_group_8
      43938.396226
      2655.490566
      75.622642
      1.961132
      1054.843585
    
    
      ad_group_9
      121083.018868
      249.056604
      1.000000
      0.092642
      14.814340

For ad group 25, the average revenue column and converted column is 0. Some ad groups e.g. ad_group 9 and 39 have really low revenue (14.81 and 19.53 dollars respectively) but that is OK.

Plot all variables and sort by ad_group based on descending revenue. Ad groups 25 and 9 are the least performing in terms of revenue.



In [39]:

    
### Plot all variables and sort by ad_group based on descending revenue. Ad groups 25 and 9 are the least. 
import matplotlib.pyplot as plt
%matplotlib inline

## Drop 'shown' and 'clicked' columns to better visualize, just for plotting and to look at Revenue.
ads_group_csv_drop_cols = ad_groups_csv.drop(['shown','clicked'],axis=1)

## plot only the bottom 10 ad_groups based on revenue
ads_groups_csv_plot = ads_group_csv_drop_cols.sort_values(by='total_revenue',ascending= False).ix[30:,:].\
plot(kind='bar', title='Plot of bottom 10 worse performing ad_groups by revenue',figsize=(10, 9))
ads_group_csv_drop_cols_plot= ads_group_csv_drop_cols.sort_values(by='total_revenue',ascending= False).\
plot(kind='bar',title = 'Plot of all ad_groups by revenue',figsize=(10,9))

Data analysis after exclusion of unreliable data. Answer to first part of 3 questions (scroll above to beginning of post)

1. If you had to identify the 5 best ad groups, which ones would they be? Which metric did you choose to identify the best ad groups? Why? Explain the pros of your metric as well as the possible cons.

In order to measure effective campaigns it becomes important to consider the ultimate goals for company XYZ. If goal is to increase brand awareness then consider impressions, also consider number of unique visitors to the XYZ website. If goal is to increase sales then conversion rate, revenue, average order value (revenue/conversions) might be a few metrics to consider.

Metric for identifying 5 best ad groups: Conversion per Impression

PROS:

Assuming that the goals of XYZ are visibility, conversions and building a good user base CPI: Conversions per impression is the metric that will be considered. This takes into account both CTR or Click Through Rate(clicks/ impressions) and CR or Conversion Rate (conversions/clicks). Therefore, CPI is conversions/impressions.

CONS:

This method does not take into account revenue.

Compute additional metrics such as CTR, CPI in the dataframe



In [19]:

    
ad_groups_csv["ClickThroRate"]=ad_groups_csv.clicked/ad_groups_csv.shown
ad_groups_csv["ConvPerImpression"]=ad_groups_csv.converted/ad_groups_csv.shown
ad_groups_csv["ConvRate"]=ad_groups_csv.converted/ad_groups_csv.clicked



In [20]:

    
ads_group_csv_metrics_plot= ad_groups_csv[['ClickThroRate','ConvPerImpression','ConvRate']].\
sort_values(by='ConvPerImpression',ascending= False).\
plot(kind='bar',title = 'Plot of metrics CTR, CPI, CR all ad_groups, sorted by decreasing CPI',figsize=(10,9))

Based on CPI, the top 5 ad groups are 13, 34, 18,12, and 4. These ads have the highest ratio in decreasing order of conversions/impressions. Higher CTR indicates relevance to search term and indicates quality. Higher conversion rate indicates the percent of people that have performed a predefined goal out of the people who clicked on the ads. This metric combines both CTR and conversion rate and help us choose which campaigns perform better. It however does not consider revenue.

From our previous revenue chart: ad groups 13, 18 and 4 are the top three in terms of revenue with ad groups 34 and 12 in 7th and 9th positions respectively.

2. For any group, predict how many ads will be shown on Dec, 15 (assume each ad group keeps following its trend).

Follow general steps for time series analysis:

Plot data and test for stationarity
Prepare a model with parameters
Fit the model from step 2
Predict model on test data and/or for future values
Evaluate performance metrics

Test for stationarity of ad groups. Test 1 group. Also data is only for 2 months, so I don't expect to see a meaningful trend or seasonality.



In [21]:

    
#### Plot ad_group_1
df_ad_group_1=ads_csv.ix[ads_csv["ad"]=="ad_group_1"]
#df_ad_group_1.index=df_ad_group_1["dateDT"]
df_ad_group_1.set_index(df_ad_group_1["dateDT"],inplace =True)
#df_ad_group_1.unstack(level=1)
#df_ad_group_1.plot()

df_ad_group_1.loc[:,"shown"]=df_ad_group_1["shown"].astype('float')









    



C:\Users\susudars\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\indexing.py:461: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s



In [22]:

    
# Time series model 
#df_ad_group_1["shown"]
rolmean1 = pd.rolling_mean(df_ad_group_1["shown"],window=5)
rolstd1 = pd.rolling_std(df_ad_group_1["shown"], window=5)

#Plot rolling statistics:
orig = plt.plot(df_ad_group_1["shown"], color='blue',label='Original')
mean = plt.plot(rolmean1, color='red', label='Rolling Mean')
std = plt.plot(rolstd1, color='black', label = 'Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show(block=False)









    



C:\Users\susudars\AppData\Local\Continuum\Anaconda2\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=5,center=False).mean()
  app.launch_new_instance()
C:\Users\susudars\AppData\Local\Continuum\Anaconda2\lib\site-packages\ipykernel\__main__.py:4: FutureWarning: pd.rolling_std is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=5,center=False).std()



In [23]:

    
from statsmodels.tsa.stattools import adfuller
#Perform Dickey-Fuller test:
print 'Results of Dickey-Fuller Test:'
dftest = adfuller(df_ad_group_1["shown"], autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value

print dfoutput

#### Test statistic is lesser than critical values, hence reject null hypothesis that it is non stationary, hence 
### this series is stationary









    



Results of Dickey-Fuller Test:
Test Statistic                 -4.458884
p-value                         0.000233
#Lags Used                      0.000000
Number of Observations Used    52.000000
Critical Value (5%)            -2.918973
Critical Value (1%)            -3.562879
Critical Value (10%)           -2.597393
dtype: float64

Plot ACF and PACF plots to help determine p,d,q parameters



In [24]:

    
# Autocorrelation plot and partial autocorrelation plot to determine ARIMA model parameters
from statsmodels.tsa.arima_model import ARIMA
#from pandas.tools.plotting import autocorrelation_plot

from statsmodels.tsa.stattools import acf, pacf

y= df_ad_group_1["shown"]

import statsmodels.tsa.api as smt
smt.graphics.plot_acf(y, lags=20, alpha = 0.5)
smt.graphics.plot_pacf(y, lags=20,  alpha =0.5)









    Out[24]:

Choosep = 6 and q= 3. A grid search can be implementedto determine best p,d,q. This was determined after manually running different configurations

Prepare training and test set, fit a model and run on test set



In [25]:

    
# Prepare training and test set

# We need to forecast 3 weeks ahead (from Nov. 23 to Dec. 15th), so ensure test is atleast 2 weeks long.
series = df_ad_group_1["shown"]
size_train=int(len(series)*0.80)
training_data=series[0:size_train] #Dates from Oct. 1 to Nov. 6
testing_data=series[size_train:len(series)]#Dates from Nov.7 to Nov.22



In [26]:

    
#prepare model
modelA = ARIMA(training_data, order=(6,0,3))


#fit model on entire series
model_fit = modelA.fit(disp=-1)
print(model_fit.summary())









    



                              ARMA Model Results                              
==============================================================================
Dep. Variable:                  shown   No. Observations:                   42
Model:                     ARMA(6, 3)   Log Likelihood                -382.778
Method:                       css-mle   S.D. of innovations           2064.870
Date:                Tue, 18 Jul 2017   AIC                            787.557
Time:                        09:46:14   BIC                            806.671
Sample:                    10-01-2015   HQIC                           794.563
                         - 11-11-2015                                         
===============================================================================
                  coef    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
const         6.88e+04    643.062    106.989      0.000      6.75e+04  7.01e+04
ar.L1.shown     1.0251      0.336      3.054      0.005         0.367     1.683
ar.L2.shown    -0.9702      0.411     -2.358      0.025        -1.776    -0.164
ar.L3.shown     0.0032      0.423      0.008      0.994        -0.825     0.832
ar.L4.shown     0.4074      0.244      1.671      0.104        -0.070     0.885
ar.L5.shown    -0.4526      0.237     -1.911      0.065        -0.917     0.012
ar.L6.shown     0.3714      0.171      2.177      0.037         0.037     0.706
ma.L1.shown    -0.7355      0.338     -2.176      0.037        -1.398    -0.073
ma.L2.shown     0.9323      0.326      2.864      0.007         0.294     1.570
ma.L3.shown     0.0827      0.367      0.225      0.823        -0.636     0.802
                                    Roots                                    
=============================================================================
                 Real           Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1           -1.2509           -0.0000j            1.2509           -0.5000
AR.2            1.3061           -0.0000j            1.3061           -0.0000
AR.3            0.5537           -0.8815j            1.0410           -0.1607
AR.4            0.5537           +0.8815j            1.0410            0.1607
AR.5            0.0280           -1.2329j            1.2333           -0.2464
AR.6            0.0280           +1.2329j            1.2333            0.2464
MA.1            0.4091           -0.9125j            1.0000           -0.1829
MA.2            0.4091           +0.9125j            1.0000            0.1829
MA.3          -12.0933           -0.0000j           12.0933           -0.5000
-----------------------------------------------------------------------------






    



C:\Users\susudars\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\base\model.py:466: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)



In [27]:

    
# plot residual errors
residual = pd.DataFrame(model_fit.resid)
residual.plot()
plt.show()
residual.plot(kind='kde')
print(residual.describe())









    












    



                 0
count    42.000000
mean    -37.411271
std    2227.636585
min   -4853.391703
25%   -1395.070459
50%    -327.491389
75%    1113.258778
max    6295.280572

There seems to be a bias because mean isn't zero.



In [28]:

    
### #Helper. This part of code is to determine the dates for test set and future dates out of sample.
import datetime
#start_date = testing_data.index[-1] +datetime.timedelta(days=1)
#end_date=testing_data.index[-1]+datetime.timedelta(days=55)
start_date=testing_data.index[0]
end_date_test=testing_data.index[-1]
end_date=testing_data.index[-1]+datetime.timedelta(days=23)
start_date=start_date.date()
end_date=end_date.date()
end_date_test=end_date_test.date()
start_date_str=str(start_date)
end_date_str=str(end_date)

Pedict step



In [29]:

    
# multi-step out-of-sample forecast using predict function 
df_pred=model_fit.predict(start_date_str, end_date_str, testing_data) 
# if differencing use typ ='levels' to get data in orignal scales.

df_pred_test_data = df_pred[start_date_str:end_date_test ] # to compute metrics
df_pred_future=df_pred[end_date_test+datetime.timedelta(days=1):end_date]

#print df_pred

#print df_pred_test_data
#print df_pred_future









    



C:\Users\susudars\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\base\data.py:503: FutureWarning: TimeSeries is deprecated. Please use Series
  return TimeSeries(result, index=self.predict_dates)



In [30]:

    
# plot original timeseries and ARIMA FITTED values

plt.plot(df_ad_group_1["shown"],color='blue')
plt.plot(model_fit.fittedvalues, color='red')

plt.plot(df_pred_test_data,color='green')
plt.plot(df_pred_future,color='brown')
plt.title('Plot of underlying time series, fitted (red), predicted(green) and future forecasted values(brown)')









    Out[30]:





<matplotlib.text.Text at 0x1385dba8>

Metrics to evaluate



In [31]:

    
import sklearn.metrics, math

predicted = df_pred_test_data
expected= testing_data
mae = sklearn.metrics.mean_absolute_error(expected, predicted)

mse=sklearn.metrics.mean_squared_error(expected,predicted)

rmse=math.sqrt(mse)

Final prediction for ad group 1



In [32]:

    
pd.DataFrame({'series':'ad_group_1','mae':[mae],'mse':[mse],'rmse':[rmse], 'predicted_value':df_pred_future[end_date]})









    Out[32]:






  
    
      
      mae
      mse
      predicted_value
      rmse
      series
    
  
  
    
      0
      2844.9284
      1.199582e+07
      68890.440349
      3463.498225
      ad_group_1

Part 3: Cluster Ad Groups avg_cost_per_click is going up, going down and flat

Fit a trend line and compute slope. The time period to compute slope can be either from the beginning of data or last week or previous 10 days. Here, slope is computed for the entire duration of data for determining trend.



In [33]:

    
ad_strings = ads_csv["ad"].unique()
import pdb, numpy as np
from scipy import stats
slope_df=dict()
p_value_df=dict()
fig_num=0
for i in ad_strings:
    fig_num=fig_num+1
    df_ad_group=ads_csv.ix[ads_csv["ad"]==i]
    df_ad_group.set_index(df_ad_group["dateDT"],inplace =True)
    x=np.arange(len(df_ad_group))
    #coeffs = np.polyfit(x,df_ad_group["avg_cost_per_click"], 1)
    #print coeffs
    #slope = coeffs[0]
    #slope_df.update({slope:i})
    #y_line = x * coeffs[0] + coeffs[1]
    #plt.figure(fig_num)
    #plt.plot(x,df_ad_group["avg_cost_per_click"])
    #plt.plot(y_line)
    #plt.title('Trend line for Ad Revenue by  '+ i)
    
    slope, intercept, r_value, p_value, std_err = stats.linregress(x, df_ad_group["avg_cost_per_click"])
    y_line = x * slope + intercept
    slope_df.update({i:slope})
    p_value_df.update({i:p_value})
    plt.figure(fig_num)
    plt.plot(x,df_ad_group["avg_cost_per_click"])
    plt.plot(y_line)
    plt.title('Trend line for Ad Revenue by  '+ i)
    #print "slope 2 " + i , slope
    #print "p value" ,p_value









    



C:\Users\susudars\AppData\Local\Continuum\Anaconda2\lib\site-packages\matplotlib\pyplot.py:516: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)



In [34]:

    
#Print the slope values for each ad group. 
slope_df









    Out[34]:





{'ad_group_1': 0.00092646347363328568,
 'ad_group_10': -0.00018782549304191949,
 'ad_group_11': -0.0036298642533936659,
 'ad_group_12': 0.00096004439511653689,
 'ad_group_13': 0.00022171945701357414,
 'ad_group_14': -0.00080252710663365474,
 'ad_group_15': 0.00040557974520238649,
 'ad_group_16': -0.00062895927601809977,
 'ad_group_17': -2.7414933075310693e-05,
 'ad_group_18': 0.00065098608383847012,
 'ad_group_19': -0.0029518483736019796,
 'ad_group_2': -6.4505724883081191e-06,
 'ad_group_20': 0.00066515837104072462,
 'ad_group_21': 0.0004604096113530068,
 'ad_group_22': -0.00024350911143364011,
 'ad_group_23': 0.00062171945701357527,
 'ad_group_24': -0.006357466063348415,
 'ad_group_25': -0.00010074276444975712,
 'ad_group_26': 0.00072407676181260924,
 'ad_group_27': -0.00030009391274652092,
 'ad_group_28': -0.0018319625866795683,
 'ad_group_29': -0.0013587466917100658,
 'ad_group_3': 0.00244476697306886,
 'ad_group_30': -0.00011369134010643496,
 'ad_group_31': 7.982583454281562e-05,
 'ad_group_32': 0.004258184163844542,
 'ad_group_33': -0.00096677955168521278,
 'ad_group_34': 0.00012820512820512856,
 'ad_group_35': -0.00041203031769069517,
 'ad_group_36': -0.0010716013546202227,
 'ad_group_37': 0.0026100628930817633,
 'ad_group_38': -0.00080874052572165779,
 'ad_group_39': 0.0011409450088695381,
 'ad_group_4': -0.0041680184410484078,
 'ad_group_40': 0.0019832664560744478,
 'ad_group_5': -0.0010509690087936471,
 'ad_group_6': -0.0017335913562328662,
 'ad_group_7': -0.0025689404934687949,
 'ad_group_8': -0.0026890824060635389,
 'ad_group_9': 4.4347685857119878e-05}

From the above values, it appears that the ad groups in the plots above where the trend line appears flat is when the slope value has a decimal place value in the ten-thousandths e.g. 'ad_group_1: 0.000926. It appears increasing when slpove value has a decimal place value in the thousandths 'ad_group_3': 0.002444. It is decreasing when slope has a negative value.



In [35]:

    
p_value_df
#Looking a the p-values there is no significance in the trend noted except in ad_group_8 and ad_group_32.









    Out[35]:





{'ad_group_1': 0.14542867450916003,
 'ad_group_10': 0.84284571273621578,
 'ad_group_11': 0.29618167238853588,
 'ad_group_12': 0.74825519306604438,
 'ad_group_13': 0.8663146096941341,
 'ad_group_14': 0.16484237436212743,
 'ad_group_15': 0.53328731560595566,
 'ad_group_16': 0.28020372286269046,
 'ad_group_17': 0.89407098165600041,
 'ad_group_18': 0.6838909025425266,
 'ad_group_19': 0.14627796912041327,
 'ad_group_2': 0.99109489230856662,
 'ad_group_20': 0.48587503579351665,
 'ad_group_21': 0.82726757983195209,
 'ad_group_22': 0.69584633202051149,
 'ad_group_23': 0.74576254413220111,
 'ad_group_24': 0.03185556037758816,
 'ad_group_25': 0.87054063617249178,
 'ad_group_26': 0.69979371725836759,
 'ad_group_27': 0.16080216981889919,
 'ad_group_28': 0.1884405442342153,
 'ad_group_29': 0.32661125117324219,
 'ad_group_3': 0.25556406953994876,
 'ad_group_30': 0.87811681496053484,
 'ad_group_31': 0.44989062463568441,
 'ad_group_32': 0.033788051992211347,
 'ad_group_33': 0.31907931450774302,
 'ad_group_34': 0.93279046185028314,
 'ad_group_35': 0.16306017476970144,
 'ad_group_36': 0.056770704742749124,
 'ad_group_37': 0.076241171443170649,
 'ad_group_38': 0.43233635478720434,
 'ad_group_39': 0.32789438454433295,
 'ad_group_4': 0.29936660104654189,
 'ad_group_40': 0.0061436042068532207,
 'ad_group_5': 0.43435770718401867,
 'ad_group_6': 0.07712859731587636,
 'ad_group_7': 0.12006813490601213,
 'ad_group_8': 0.026230860406837624,
 'ad_group_9': 0.43462574695789546}



In [ ]:



In [40]:

    
plt.plot(sorted(slope_df.values()))









    Out[40]:





[<matplotlib.lines.Line2D at 0x14c542e8>]



In [ ]:



In [ ]:



In [ ]:

	date	shown	clicked	converted	avg_cost_per_click	total_revenue	ad
0	10/1/2015	65877	2339	43	0.90	641.62	ad_group_1
1	10/2/2015	65100	2498	38	0.94	756.37	ad_group_1
2	10/3/2015	70658	2313	49	0.86	970.90	ad_group_1
3	10/4/2015	69809	2833	51	1.01	907.39	ad_group_1
4	10/5/2015	68186	2696	41	1.00	879.45	ad_group_1

	shown	clicked	converted	avg_cost_per_click	total_revenue
count	2115.000000	2115.000000	2115.000000	2115.000000	2115.000000
mean	68299.844444	3056.077069	126.453901	1.373749	1966.517589
std	48884.821409	3783.579969	233.420826	0.754331	3942.018757
min	0.000000	0.000000	0.000000	0.000000	-200.150000
25%	28030.500000	744.000000	18.000000	0.760000	235.470000
50%	54029.000000	1392.000000	41.000000	1.400000	553.300000
75%	97314.500000	3366.500000	103.000000	1.920000	1611.495000
max	192507.000000	20848.000000	1578.000000	4.190000	39623.710000

	date	shown	clicked	converted	avg_cost_per_click	total_revenue	ad	dateDT
1186	10/25/2015	50459	1904	97	0.89	-200.15	ad_group_23	2015-10-25
1404	11/1/2015	70145	584	15	0.36	-25.09	ad_group_27	2015-11-01
1524	11/15/2015	22070	1400	20	1.67	-3.56	ad_group_29	2015-11-15
2089	10/28/2015	80270	1151	41	0.96	-92.06	ad_group_40	2015-10-28

	date	ad	dateDT
204	11/15/2015	ad_group_4	2015-11-15
223	10/12/2015	ad_group_5	2015-10-12
554	10/26/2015	ad_group_11	2015-10-26
580	11/21/2015	ad_group_11	2015-11-21
632	11/20/2015	ad_group_12	2015-11-20
650	10/16/2015	ad_group_13	2015-10-16
676	11/11/2015	ad_group_13	2015-11-11
811	10/19/2015	ad_group_16	2015-10-19
821	10/29/2015	ad_group_16	2015-10-29
1000	11/19/2015	ad_group_19	2015-11-19
1005	10/2/2015	ad_group_20	2015-10-02
1211	11/19/2015	ad_group_23	2015-11-19
1273	10/7/2015	ad_group_25	2015-10-07

	shown	clicked	converted	avg_cost_per_click	total_revenue
ad
ad_group_1	69345.566038	2705.830189	47.113208	0.971321	743.015849
ad_group_10	117701.365385	1671.115385	19.538462	1.166154	299.612115
ad_group_11	19447.078431	1187.549020	57.549020	1.684118	797.104314
ad_group_12	29174.807692	1743.884615	159.519231	2.048269	2486.270000
ad_group_13	161519.176471	13841.980392	1308.647059	1.740392	20685.530000
ad_group_14	8461.096154	253.211538	20.961538	0.709231	320.639615
ad_group_15	15539.566038	356.566038	16.660377	0.802830	253.585660
ad_group_16	30755.666667	818.901961	72.588235	0.563922	1098.290784
ad_group_17	139477.811321	1228.962264	14.358491	0.396792	189.874717
ad_group_18	89124.346154	8819.634615	616.134615	2.000192	10052.245769
ad_group_19	19506.884615	1836.461538	48.173077	2.257500	731.529808
ad_group_2	51076.981132	1168.150943	117.905660	0.638113	1783.559811
ad_group_20	123004.509804	7519.333333	350.372549	1.827647	5508.405098
ad_group_21	27432.698113	247.245283	12.622642	1.065472	178.322453
ad_group_22	24333.094340	895.547170	41.320755	0.802453	627.441132
ad_group_23	47570.000000	2196.431373	95.156863	1.097451	1404.683529
ad_group_24	36033.500000	352.903846	3.480769	3.183654	53.403654
ad_group_25	174133.307692	623.711538	0.000000	0.717308	0.000000
ad_group_26	72090.660377	6559.283019	328.245283	2.237547	5192.878113
ad_group_27	65725.519231	539.846154	16.711538	0.379423	258.538654
ad_group_28	19886.339623	1770.792453	55.962264	1.970755	864.178868
ad_group_29	20652.692308	1311.076923	19.769231	1.634423	272.124231
ad_group_3	169248.924528	16073.320755	58.301887	2.682264	867.241509
ad_group_30	116388.358491	1273.754717	16.849057	1.094528	266.097170
ad_group_31	130457.132075	679.867925	59.094340	0.147358	860.718113
ad_group_32	38446.622642	3380.754717	26.754717	1.933396	375.154340
ad_group_33	15944.000000	735.226415	25.905660	0.927736	366.206038
ad_group_34	35371.622642	2979.169811	260.320755	1.733208	4019.534340
ad_group_35	60303.396226	860.547170	1.735849	0.354528	24.675094
ad_group_36	67026.547170	554.981132	25.547170	0.750377	379.212642
ad_group_37	72666.811321	5865.188679	312.377358	1.898679	4743.911887
ad_group_38	174986.075472	10948.867925	78.169811	1.394717	1102.008491
ad_group_39	25258.679245	1000.867925	1.264151	1.594340	19.538491
ad_group_4	92289.903846	7649.173077	480.846154	2.496923	7331.175192
ad_group_40	77369.826923	1285.192308	34.711538	0.996154	532.098269
ad_group_5	52406.461538	3343.173077	215.538462	2.027308	3574.216923
ad_group_6	39575.264151	1326.905660	23.094340	1.585094	340.284340
ad_group_7	55782.132075	4869.698113	25.000000	1.812830	410.156604
ad_group_8	43938.396226	2655.490566	75.622642	1.961132	1054.843585
ad_group_9	121083.018868	249.056604	1.000000	0.092642	14.814340