In [1]:
import pandas as pd

In [2]:
ads_csv = pd.read_csv("ad_table.csv.csv")

Company XYZ is a food delivery company. Like pretty much any other site, in order to get customers, they have been relying significantly on online ads, such as those you see on Google or Facebook.

At the moment, they are running 40 different ad campaigns and want you to help them understand their performance.

Specifically, you are asked to:

1. If you had to identify the 5 best ad groups, which ones would they be? Which metric did you choose to identify the best ad groups? Why? Explain the pros of your metric as well as the possible cons.

2. For any group, predict how many ads will be shown on Dec, 15 (assume each ad group keeps following its trend). Note: Slightly changed question from the original Data Science challenge book

3. Cluster ads into 3 groups: the ones whose avg_cost_per_click is going up, the ones whose avg_cost_per_click is flat and the ones whose avg_cost_per_click is going down.


In [3]:
ads_csv.head()


Out[3]:
date shown clicked converted avg_cost_per_click total_revenue ad
0 10/1/2015 65877 2339 43 0.90 641.62 ad_group_1
1 10/2/2015 65100 2498 38 0.94 756.37 ad_group_1
2 10/3/2015 70658 2313 49 0.86 970.90 ad_group_1
3 10/4/2015 69809 2833 51 1.01 907.39 ad_group_1
4 10/5/2015 68186 2696 41 1.00 879.45 ad_group_1

In [4]:
ads_csv.shape


Out[4]:
(2115, 7)

Data Exploration


In [5]:
#Explore ad groups
ads_csv.ad.unique()


Out[5]:
array(['ad_group_1', 'ad_group_2', 'ad_group_3', 'ad_group_4',
       'ad_group_5', 'ad_group_6', 'ad_group_7', 'ad_group_8',
       'ad_group_9', 'ad_group_10', 'ad_group_11', 'ad_group_12',
       'ad_group_13', 'ad_group_14', 'ad_group_15', 'ad_group_16',
       'ad_group_17', 'ad_group_18', 'ad_group_19', 'ad_group_20',
       'ad_group_21', 'ad_group_22', 'ad_group_23', 'ad_group_24',
       'ad_group_25', 'ad_group_26', 'ad_group_27', 'ad_group_28',
       'ad_group_29', 'ad_group_30', 'ad_group_31', 'ad_group_32',
       'ad_group_33', 'ad_group_34', 'ad_group_35', 'ad_group_36',
       'ad_group_37', 'ad_group_38', 'ad_group_39', 'ad_group_40'], dtype=object)

In [6]:
#convert date column to date-time object. Create new column dateDT
ads_csv["dateDT"]=pd.to_datetime(ads_csv["date"])

In [7]:
print "unique days" , ads_csv["dateDT"].dt.day.unique()
print "month",  ads_csv["dateDT"].dt.month.unique()


unique days [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
 26 27 28 29 30 31]
month [10 11]

So we have data for two months Oct. and Nov.


In [ ]:


In [8]:
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

In [9]:
g = sns.factorplot(x="dateDT", y="shown",hue="ad", data=ads_csv,size=10)
g.set(xticklabels=[])


Out[9]:
<seaborn.axisgrid.FacetGrid at 0x7f26748>

In [ ]:


In [ ]:

Clearly there are differences in just the number of times each ad group is shown and they tend to maintain more or less the same range as they started out. Also the number of times ads shown drop off to zero for some ad groups


In [10]:
min(ads_csv["shown"])


Out[10]:
0

Outlier detection

Look for outliers in data. Deal with missing data or values out of bound such as negative values for money.


In [11]:
#Look for outliers, the table below gives us an idea of the distribution of variables in our data frame
ads_csv.describe()


Out[11]:
shown clicked converted avg_cost_per_click total_revenue
count 2115.000000 2115.000000 2115.000000 2115.000000 2115.000000
mean 68299.844444 3056.077069 126.453901 1.373749 1966.517589
std 48884.821409 3783.579969 233.420826 0.754331 3942.018757
min 0.000000 0.000000 0.000000 0.000000 -200.150000
25% 28030.500000 744.000000 18.000000 0.760000 235.470000
50% 54029.000000 1392.000000 41.000000 1.400000 553.300000
75% 97314.500000 3366.500000 103.000000 1.920000 1611.495000
max 192507.000000 20848.000000 1578.000000 4.190000 39623.710000

In [12]:
#min of -200.150000 doesn't make sense under total_revenue.(Min) explore more
ads_csv[ads_csv["total_revenue"]<0]


Out[12]:
date shown clicked converted avg_cost_per_click total_revenue ad dateDT
1186 10/25/2015 50459 1904 97 0.89 -200.15 ad_group_23 2015-10-25
1404 11/1/2015 70145 584 15 0.36 -25.09 ad_group_27 2015-11-01
1524 11/15/2015 22070 1400 20 1.67 -3.56 ad_group_29 2015-11-15
2089 10/28/2015 80270 1151 41 0.96 -92.06 ad_group_40 2015-10-28

The values in all the columns except total_revenue look legitimate when compared to the summary table above. Some ways to deal with missing values: Imputing based on computing mean, median, mode on nearby similar values, ignoring missing data rows based on percentage of data missing, using nearest neighbors or decision trees to fill in for missing values.


In [13]:
# filter rows where revenue is negative, there are only 4 rows so ignore them.
ads_csv=ads_csv[ads_csv["total_revenue"] >=0]

Look for rows where the number of times ads displayed equaled zero. Remove ad campaigns on days where they were not displayed.


In [14]:
#Explore rows where "shown" or the number of times an ad group was displayed equaled 0.
ads_csv[ads_csv["shown"]==0]


Out[14]:
date shown clicked converted avg_cost_per_click total_revenue ad dateDT
204 11/15/2015 0 0 0 0.0 0.0 ad_group_4 2015-11-15
223 10/12/2015 0 0 0 0.0 0.0 ad_group_5 2015-10-12
554 10/26/2015 0 0 0 0.0 0.0 ad_group_11 2015-10-26
580 11/21/2015 0 0 0 0.0 0.0 ad_group_11 2015-11-21
632 11/20/2015 0 0 0 0.0 0.0 ad_group_12 2015-11-20
650 10/16/2015 0 0 0 0.0 0.0 ad_group_13 2015-10-16
676 11/11/2015 0 0 0 0.0 0.0 ad_group_13 2015-11-11
811 10/19/2015 0 0 0 0.0 0.0 ad_group_16 2015-10-19
821 10/29/2015 0 0 0 0.0 0.0 ad_group_16 2015-10-29
1000 11/19/2015 0 0 0 0.0 0.0 ad_group_19 2015-11-19
1005 10/2/2015 0 0 0 0.0 0.0 ad_group_20 2015-10-02
1211 11/19/2015 0 0 0 0.0 0.0 ad_group_23 2015-11-19
1273 10/7/2015 0 0 0 0.0 0.0 ad_group_25 2015-10-07

Ignore rows where the total number of ads shown =0. It is ok for the number of clicks or converted to be 0 but not where the intial impressions or ads shown is 0. Also, there is no usable data in other columns.


In [15]:
ads_csv=ads_csv.drop(ads_csv[ads_csv["shown"]==0].index)

In [16]:
## Check to see if the desired rows have been dropped. row 1273 has been dropped according to our criteria 
# above.
1273 in ads_csv.index


Out[16]:
False

Data Analysis

Explore data frame based on grouping ads campaigns together and compute mean for each column.


In [17]:
ad_groups_csv=ads_csv.groupby(['ad']).mean()
ad_groups_csv


Out[17]:
shown clicked converted avg_cost_per_click total_revenue
ad
ad_group_1 69345.566038 2705.830189 47.113208 0.971321 743.015849
ad_group_10 117701.365385 1671.115385 19.538462 1.166154 299.612115
ad_group_11 19447.078431 1187.549020 57.549020 1.684118 797.104314
ad_group_12 29174.807692 1743.884615 159.519231 2.048269 2486.270000
ad_group_13 161519.176471 13841.980392 1308.647059 1.740392 20685.530000
ad_group_14 8461.096154 253.211538 20.961538 0.709231 320.639615
ad_group_15 15539.566038 356.566038 16.660377 0.802830 253.585660
ad_group_16 30755.666667 818.901961 72.588235 0.563922 1098.290784
ad_group_17 139477.811321 1228.962264 14.358491 0.396792 189.874717
ad_group_18 89124.346154 8819.634615 616.134615 2.000192 10052.245769
ad_group_19 19506.884615 1836.461538 48.173077 2.257500 731.529808
ad_group_2 51076.981132 1168.150943 117.905660 0.638113 1783.559811
ad_group_20 123004.509804 7519.333333 350.372549 1.827647 5508.405098
ad_group_21 27432.698113 247.245283 12.622642 1.065472 178.322453
ad_group_22 24333.094340 895.547170 41.320755 0.802453 627.441132
ad_group_23 47570.000000 2196.431373 95.156863 1.097451 1404.683529
ad_group_24 36033.500000 352.903846 3.480769 3.183654 53.403654
ad_group_25 174133.307692 623.711538 0.000000 0.717308 0.000000
ad_group_26 72090.660377 6559.283019 328.245283 2.237547 5192.878113
ad_group_27 65725.519231 539.846154 16.711538 0.379423 258.538654
ad_group_28 19886.339623 1770.792453 55.962264 1.970755 864.178868
ad_group_29 20652.692308 1311.076923 19.769231 1.634423 272.124231
ad_group_3 169248.924528 16073.320755 58.301887 2.682264 867.241509
ad_group_30 116388.358491 1273.754717 16.849057 1.094528 266.097170
ad_group_31 130457.132075 679.867925 59.094340 0.147358 860.718113
ad_group_32 38446.622642 3380.754717 26.754717 1.933396 375.154340
ad_group_33 15944.000000 735.226415 25.905660 0.927736 366.206038
ad_group_34 35371.622642 2979.169811 260.320755 1.733208 4019.534340
ad_group_35 60303.396226 860.547170 1.735849 0.354528 24.675094
ad_group_36 67026.547170 554.981132 25.547170 0.750377 379.212642
ad_group_37 72666.811321 5865.188679 312.377358 1.898679 4743.911887
ad_group_38 174986.075472 10948.867925 78.169811 1.394717 1102.008491
ad_group_39 25258.679245 1000.867925 1.264151 1.594340 19.538491
ad_group_4 92289.903846 7649.173077 480.846154 2.496923 7331.175192
ad_group_40 77369.826923 1285.192308 34.711538 0.996154 532.098269
ad_group_5 52406.461538 3343.173077 215.538462 2.027308 3574.216923
ad_group_6 39575.264151 1326.905660 23.094340 1.585094 340.284340
ad_group_7 55782.132075 4869.698113 25.000000 1.812830 410.156604
ad_group_8 43938.396226 2655.490566 75.622642 1.961132 1054.843585
ad_group_9 121083.018868 249.056604 1.000000 0.092642 14.814340

For ad group 25, the average revenue column and converted column is 0. Some ad groups e.g. ad_group 9 and 39 have really low revenue (14.81 and 19.53 dollars respectively) but that is OK.

Plot all variables and sort by ad_group based on descending revenue. Ad groups 25 and 9 are the least performing in terms of revenue.


In [39]:
### Plot all variables and sort by ad_group based on descending revenue. Ad groups 25 and 9 are the least. 
import matplotlib.pyplot as plt
%matplotlib inline

## Drop 'shown' and 'clicked' columns to better visualize, just for plotting and to look at Revenue.
ads_group_csv_drop_cols = ad_groups_csv.drop(['shown','clicked'],axis=1)

## plot only the bottom 10 ad_groups based on revenue
ads_groups_csv_plot = ads_group_csv_drop_cols.sort_values(by='total_revenue',ascending= False).ix[30:,:].\
plot(kind='bar', title='Plot of bottom 10 worse performing ad_groups by revenue',figsize=(10, 9))
ads_group_csv_drop_cols_plot= ads_group_csv_drop_cols.sort_values(by='total_revenue',ascending= False).\
plot(kind='bar',title = 'Plot of all ad_groups by revenue',figsize=(10,9))


Data analysis after exclusion of unreliable data. Answer to first part of 3 questions (scroll above to beginning of post)

1. If you had to identify the 5 best ad groups, which ones would they be? Which metric did you choose to identify the best ad groups? Why? Explain the pros of your metric as well as the possible cons.

In order to measure effective campaigns it becomes important to consider the ultimate goals for company XYZ. If goal is to increase brand awareness then consider impressions, also consider number of unique visitors to the XYZ website. If goal is to increase sales then conversion rate, revenue, average order value (revenue/conversions) might be a few metrics to consider.

Metric for identifying 5 best ad groups: Conversion per Impression

PROS:

Assuming that the goals of XYZ are visibility, conversions and building a good user base CPI: Conversions per impression is the metric that will be considered. This takes into account both CTR or Click Through Rate(clicks/ impressions) and CR or Conversion Rate (conversions/clicks). Therefore, CPI is conversions/impressions.

CONS:

This method does not take into account revenue.

Compute additional metrics such as CTR, CPI in the dataframe


In [19]:
ad_groups_csv["ClickThroRate"]=ad_groups_csv.clicked/ad_groups_csv.shown
ad_groups_csv["ConvPerImpression"]=ad_groups_csv.converted/ad_groups_csv.shown
ad_groups_csv["ConvRate"]=ad_groups_csv.converted/ad_groups_csv.clicked

In [20]:
ads_group_csv_metrics_plot= ad_groups_csv[['ClickThroRate','ConvPerImpression','ConvRate']].\
sort_values(by='ConvPerImpression',ascending= False).\
plot(kind='bar',title = 'Plot of metrics CTR, CPI, CR all ad_groups, sorted by decreasing CPI',figsize=(10,9))


Based on CPI, the top 5 ad groups are 13, 34, 18,12, and 4. These ads have the highest ratio in decreasing order of conversions/impressions. Higher CTR indicates relevance to search term and indicates quality. Higher conversion rate indicates the percent of people that have performed a predefined goal out of the people who clicked on the ads. This metric combines both CTR and conversion rate and help us choose which campaigns perform better. It however does not consider revenue.

From our previous revenue chart: ad groups 13, 18 and 4 are the top three in terms of revenue with ad groups 34 and 12 in 7th and 9th positions respectively.

2. For any group, predict how many ads will be shown on Dec, 15 (assume each ad group keeps following its trend).

Follow general steps for time series analysis:

  1. Plot data and test for stationarity
  2. Prepare a model with parameters
  3. Fit the model from step 2
  4. Predict model on test data and/or for future values
  5. Evaluate performance metrics

Test for stationarity of ad groups. Test 1 group. Also data is only for 2 months, so I don't expect to see a meaningful trend or seasonality.


In [21]:
#### Plot ad_group_1
df_ad_group_1=ads_csv.ix[ads_csv["ad"]=="ad_group_1"]
#df_ad_group_1.index=df_ad_group_1["dateDT"]
df_ad_group_1.set_index(df_ad_group_1["dateDT"],inplace =True)
#df_ad_group_1.unstack(level=1)
#df_ad_group_1.plot()

df_ad_group_1.loc[:,"shown"]=df_ad_group_1["shown"].astype('float')


C:\Users\susudars\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\indexing.py:461: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

In [22]:
# Time series model 
#df_ad_group_1["shown"]
rolmean1 = pd.rolling_mean(df_ad_group_1["shown"],window=5)
rolstd1 = pd.rolling_std(df_ad_group_1["shown"], window=5)

#Plot rolling statistics:
orig = plt.plot(df_ad_group_1["shown"], color='blue',label='Original')
mean = plt.plot(rolmean1, color='red', label='Rolling Mean')
std = plt.plot(rolstd1, color='black', label = 'Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show(block=False)


C:\Users\susudars\AppData\Local\Continuum\Anaconda2\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=5,center=False).mean()
  app.launch_new_instance()
C:\Users\susudars\AppData\Local\Continuum\Anaconda2\lib\site-packages\ipykernel\__main__.py:4: FutureWarning: pd.rolling_std is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=5,center=False).std()

In [23]:
from statsmodels.tsa.stattools import adfuller
#Perform Dickey-Fuller test:
print 'Results of Dickey-Fuller Test:'
dftest = adfuller(df_ad_group_1["shown"], autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
    dfoutput['Critical Value (%s)'%key] = value

print dfoutput

#### Test statistic is lesser than critical values, hence reject null hypothesis that it is non stationary, hence 
### this series is stationary


Results of Dickey-Fuller Test:
Test Statistic                 -4.458884
p-value                         0.000233
#Lags Used                      0.000000
Number of Observations Used    52.000000
Critical Value (5%)            -2.918973
Critical Value (1%)            -3.562879
Critical Value (10%)           -2.597393
dtype: float64

Plot ACF and PACF plots to help determine p,d,q parameters


In [24]:
# Autocorrelation plot and partial autocorrelation plot to determine ARIMA model parameters
from statsmodels.tsa.arima_model import ARIMA
#from pandas.tools.plotting import autocorrelation_plot

from statsmodels.tsa.stattools import acf, pacf

y= df_ad_group_1["shown"]

import statsmodels.tsa.api as smt
smt.graphics.plot_acf(y, lags=20, alpha = 0.5)
smt.graphics.plot_pacf(y, lags=20,  alpha =0.5)


Out[24]:

Choosep = 6 and q= 3. A grid search can be implementedto determine best p,d,q. This was determined after manually running different configurations

Prepare training and test set, fit a model and run on test set


In [25]:
# Prepare training and test set

# We need to forecast 3 weeks ahead (from Nov. 23 to Dec. 15th), so ensure test is atleast 2 weeks long.
series = df_ad_group_1["shown"]
size_train=int(len(series)*0.80)
training_data=series[0:size_train] #Dates from Oct. 1 to Nov. 6
testing_data=series[size_train:len(series)]#Dates from Nov.7 to Nov.22

In [26]:
#prepare model
modelA = ARIMA(training_data, order=(6,0,3))


#fit model on entire series
model_fit = modelA.fit(disp=-1)
print(model_fit.summary())


                              ARMA Model Results                              
==============================================================================
Dep. Variable:                  shown   No. Observations:                   42
Model:                     ARMA(6, 3)   Log Likelihood                -382.778
Method:                       css-mle   S.D. of innovations           2064.870
Date:                Tue, 18 Jul 2017   AIC                            787.557
Time:                        09:46:14   BIC                            806.671
Sample:                    10-01-2015   HQIC                           794.563
                         - 11-11-2015                                         
===============================================================================
                  coef    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
const         6.88e+04    643.062    106.989      0.000      6.75e+04  7.01e+04
ar.L1.shown     1.0251      0.336      3.054      0.005         0.367     1.683
ar.L2.shown    -0.9702      0.411     -2.358      0.025        -1.776    -0.164
ar.L3.shown     0.0032      0.423      0.008      0.994        -0.825     0.832
ar.L4.shown     0.4074      0.244      1.671      0.104        -0.070     0.885
ar.L5.shown    -0.4526      0.237     -1.911      0.065        -0.917     0.012
ar.L6.shown     0.3714      0.171      2.177      0.037         0.037     0.706
ma.L1.shown    -0.7355      0.338     -2.176      0.037        -1.398    -0.073
ma.L2.shown     0.9323      0.326      2.864      0.007         0.294     1.570
ma.L3.shown     0.0827      0.367      0.225      0.823        -0.636     0.802
                                    Roots                                    
=============================================================================
                 Real           Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1           -1.2509           -0.0000j            1.2509           -0.5000
AR.2            1.3061           -0.0000j            1.3061           -0.0000
AR.3            0.5537           -0.8815j            1.0410           -0.1607
AR.4            0.5537           +0.8815j            1.0410            0.1607
AR.5            0.0280           -1.2329j            1.2333           -0.2464
AR.6            0.0280           +1.2329j            1.2333            0.2464
MA.1            0.4091           -0.9125j            1.0000           -0.1829
MA.2            0.4091           +0.9125j            1.0000            0.1829
MA.3          -12.0933           -0.0000j           12.0933           -0.5000
-----------------------------------------------------------------------------
C:\Users\susudars\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\base\model.py:466: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)

In [27]:
# plot residual errors
residual = pd.DataFrame(model_fit.resid)
residual.plot()
plt.show()
residual.plot(kind='kde')
print(residual.describe())


                 0
count    42.000000
mean    -37.411271
std    2227.636585
min   -4853.391703
25%   -1395.070459
50%    -327.491389
75%    1113.258778
max    6295.280572

There seems to be a bias because mean isn't zero.


In [28]:
### #Helper. This part of code is to determine the dates for test set and future dates out of sample.
import datetime
#start_date = testing_data.index[-1] +datetime.timedelta(days=1)
#end_date=testing_data.index[-1]+datetime.timedelta(days=55)
start_date=testing_data.index[0]
end_date_test=testing_data.index[-1]
end_date=testing_data.index[-1]+datetime.timedelta(days=23)
start_date=start_date.date()
end_date=end_date.date()
end_date_test=end_date_test.date()
start_date_str=str(start_date)
end_date_str=str(end_date)

Pedict step


In [29]:
# multi-step out-of-sample forecast using predict function 
df_pred=model_fit.predict(start_date_str, end_date_str, testing_data) 
# if differencing use typ ='levels' to get data in orignal scales.

df_pred_test_data = df_pred[start_date_str:end_date_test ] # to compute metrics
df_pred_future=df_pred[end_date_test+datetime.timedelta(days=1):end_date]

#print df_pred

#print df_pred_test_data
#print df_pred_future


C:\Users\susudars\AppData\Local\Continuum\Anaconda2\lib\site-packages\statsmodels\base\data.py:503: FutureWarning: TimeSeries is deprecated. Please use Series
  return TimeSeries(result, index=self.predict_dates)

In [30]:
# plot original timeseries and ARIMA FITTED values

plt.plot(df_ad_group_1["shown"],color='blue')
plt.plot(model_fit.fittedvalues, color='red')

plt.plot(df_pred_test_data,color='green')
plt.plot(df_pred_future,color='brown')
plt.title('Plot of underlying time series, fitted (red), predicted(green) and future forecasted values(brown)')


Out[30]:
<matplotlib.text.Text at 0x1385dba8>

Metrics to evaluate


In [31]:
import sklearn.metrics, math

predicted = df_pred_test_data
expected= testing_data
mae = sklearn.metrics.mean_absolute_error(expected, predicted)

mse=sklearn.metrics.mean_squared_error(expected,predicted)

rmse=math.sqrt(mse)

Final prediction for ad group 1


In [32]:
pd.DataFrame({'series':'ad_group_1','mae':[mae],'mse':[mse],'rmse':[rmse], 'predicted_value':df_pred_future[end_date]})


Out[32]:
mae mse predicted_value rmse series
0 2844.9284 1.199582e+07 68890.440349 3463.498225 ad_group_1

Part 3: Cluster Ad Groups avg_cost_per_click is going up, going down and flat

Fit a trend line and compute slope. The time period to compute slope can be either from the beginning of data or last week or previous 10 days. Here, slope is computed for the entire duration of data for determining trend.


In [33]:
ad_strings = ads_csv["ad"].unique()
import pdb, numpy as np
from scipy import stats
slope_df=dict()
p_value_df=dict()
fig_num=0
for i in ad_strings:
    fig_num=fig_num+1
    df_ad_group=ads_csv.ix[ads_csv["ad"]==i]
    df_ad_group.set_index(df_ad_group["dateDT"],inplace =True)
    x=np.arange(len(df_ad_group))
    #coeffs = np.polyfit(x,df_ad_group["avg_cost_per_click"], 1)
    #print coeffs
    #slope = coeffs[0]
    #slope_df.update({slope:i})
    #y_line = x * coeffs[0] + coeffs[1]
    #plt.figure(fig_num)
    #plt.plot(x,df_ad_group["avg_cost_per_click"])
    #plt.plot(y_line)
    #plt.title('Trend line for Ad Revenue by  '+ i)
    
    slope, intercept, r_value, p_value, std_err = stats.linregress(x, df_ad_group["avg_cost_per_click"])
    y_line = x * slope + intercept
    slope_df.update({i:slope})
    p_value_df.update({i:p_value})
    plt.figure(fig_num)
    plt.plot(x,df_ad_group["avg_cost_per_click"])
    plt.plot(y_line)
    plt.title('Trend line for Ad Revenue by  '+ i)
    #print "slope 2 " + i , slope
    #print "p value" ,p_value


C:\Users\susudars\AppData\Local\Continuum\Anaconda2\lib\site-packages\matplotlib\pyplot.py:516: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

In [34]:
#Print the slope values for each ad group. 
slope_df


Out[34]:
{'ad_group_1': 0.00092646347363328568,
 'ad_group_10': -0.00018782549304191949,
 'ad_group_11': -0.0036298642533936659,
 'ad_group_12': 0.00096004439511653689,
 'ad_group_13': 0.00022171945701357414,
 'ad_group_14': -0.00080252710663365474,
 'ad_group_15': 0.00040557974520238649,
 'ad_group_16': -0.00062895927601809977,
 'ad_group_17': -2.7414933075310693e-05,
 'ad_group_18': 0.00065098608383847012,
 'ad_group_19': -0.0029518483736019796,
 'ad_group_2': -6.4505724883081191e-06,
 'ad_group_20': 0.00066515837104072462,
 'ad_group_21': 0.0004604096113530068,
 'ad_group_22': -0.00024350911143364011,
 'ad_group_23': 0.00062171945701357527,
 'ad_group_24': -0.006357466063348415,
 'ad_group_25': -0.00010074276444975712,
 'ad_group_26': 0.00072407676181260924,
 'ad_group_27': -0.00030009391274652092,
 'ad_group_28': -0.0018319625866795683,
 'ad_group_29': -0.0013587466917100658,
 'ad_group_3': 0.00244476697306886,
 'ad_group_30': -0.00011369134010643496,
 'ad_group_31': 7.982583454281562e-05,
 'ad_group_32': 0.004258184163844542,
 'ad_group_33': -0.00096677955168521278,
 'ad_group_34': 0.00012820512820512856,
 'ad_group_35': -0.00041203031769069517,
 'ad_group_36': -0.0010716013546202227,
 'ad_group_37': 0.0026100628930817633,
 'ad_group_38': -0.00080874052572165779,
 'ad_group_39': 0.0011409450088695381,
 'ad_group_4': -0.0041680184410484078,
 'ad_group_40': 0.0019832664560744478,
 'ad_group_5': -0.0010509690087936471,
 'ad_group_6': -0.0017335913562328662,
 'ad_group_7': -0.0025689404934687949,
 'ad_group_8': -0.0026890824060635389,
 'ad_group_9': 4.4347685857119878e-05}

From the above values, it appears that the ad groups in the plots above where the trend line appears flat is when the slope value has a decimal place value in the ten-thousandths e.g. 'ad_group_1: 0.000926. It appears increasing when slpove value has a decimal place value in the thousandths 'ad_group_3': 0.002444. It is decreasing when slope has a negative value.


In [35]:
p_value_df
#Looking a the p-values there is no significance in the trend noted except in ad_group_8 and ad_group_32.


Out[35]:
{'ad_group_1': 0.14542867450916003,
 'ad_group_10': 0.84284571273621578,
 'ad_group_11': 0.29618167238853588,
 'ad_group_12': 0.74825519306604438,
 'ad_group_13': 0.8663146096941341,
 'ad_group_14': 0.16484237436212743,
 'ad_group_15': 0.53328731560595566,
 'ad_group_16': 0.28020372286269046,
 'ad_group_17': 0.89407098165600041,
 'ad_group_18': 0.6838909025425266,
 'ad_group_19': 0.14627796912041327,
 'ad_group_2': 0.99109489230856662,
 'ad_group_20': 0.48587503579351665,
 'ad_group_21': 0.82726757983195209,
 'ad_group_22': 0.69584633202051149,
 'ad_group_23': 0.74576254413220111,
 'ad_group_24': 0.03185556037758816,
 'ad_group_25': 0.87054063617249178,
 'ad_group_26': 0.69979371725836759,
 'ad_group_27': 0.16080216981889919,
 'ad_group_28': 0.1884405442342153,
 'ad_group_29': 0.32661125117324219,
 'ad_group_3': 0.25556406953994876,
 'ad_group_30': 0.87811681496053484,
 'ad_group_31': 0.44989062463568441,
 'ad_group_32': 0.033788051992211347,
 'ad_group_33': 0.31907931450774302,
 'ad_group_34': 0.93279046185028314,
 'ad_group_35': 0.16306017476970144,
 'ad_group_36': 0.056770704742749124,
 'ad_group_37': 0.076241171443170649,
 'ad_group_38': 0.43233635478720434,
 'ad_group_39': 0.32789438454433295,
 'ad_group_4': 0.29936660104654189,
 'ad_group_40': 0.0061436042068532207,
 'ad_group_5': 0.43435770718401867,
 'ad_group_6': 0.07712859731587636,
 'ad_group_7': 0.12006813490601213,
 'ad_group_8': 0.026230860406837624,
 'ad_group_9': 0.43462574695789546}

In [ ]:


In [40]:
plt.plot(sorted(slope_df.values()))


Out[40]:
[<matplotlib.lines.Line2D at 0x14c542e8>]

In [ ]:


In [ ]:


In [ ]: