In this project, I'm going to look at TV ad data on the date of 1/31/16. Ads are tracked across a number of cities across the US. The idea is to examine WHEN different industries and brands advertise, and also how that might differ in various cities.
I originally wanted to combine this with social engagement data, but I wasn't able to pull that data in time.
In [4]:
#4/27/16
import sys # system module
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics module
import datetime as dt # date and time module
import numpy as np # foundation for Pandas
import seaborn as sns
import random
import string
from python_vlookup import python_vlookup
# these lines make our graphics show up in the notebook
%matplotlib inline
# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Seaborn version: ', sns.__version__)
print('Today: ', dt.date.today())
import datetime
In [5]:
#Let's read in the data we'll be using
#This is proprietary TV ad data that has been tracked at the company i intern with
df = pd.read_csv("DailyData_2016_01_31.csv")
In [6]:
#let's see what the data looks like.
df.dtypes
Out[6]:
In [7]:
#let's see what the size is
df.shape
Out[7]:
In [8]:
#Holy cow that's a big dataset. Here's a preview of the first few rows
df.head(10)
Out[8]:
In [9]:
#Need to split some columsn into things we can use!
df['Time_sec'] = df['Asset: Length'].str.split(':').apply(lambda x: int(x[1]) * 60 + int(x[2]))
df['Time_sec_detect'] = df['Actual detection length'].str.split(':').apply(lambda x: int(x[1]) * 60 + int(x[2]))
# This is how you split the time column = df['Asset: Length'].str.split(':')
#This is how you convert time col to seconds = df['Asset: Length'].str.split(':').apply(lambda x: int(x[1]) * 60 + int(x[2]))
df.head()
Out[9]:
In [10]:
#Let's see if Pivot Tables help us at all
#Below, we see total seconds of Ads per Brand per Program, and count of Ads
pd.pivot_table(df,index=['EPG Program name', 'Brand'], values = ['Asset: ResourceID', 'Time_sec'], aggfunc=[np.sum,len])
Out[10]:
In [11]:
#Now let's add columns to get time in minutes
df['Time_min'] = df['Time_sec']/60
df['Time_min_detect'] = df['Time_sec_detect']/60
df.head()
Out[11]:
In [12]:
cat_totals = df[["Category","Time_min_detect"]].groupby('Category').sum().sort_values('Time_min_detect')
cat_totals
# This is all categories advertised on this day, sorted by minutes detected
Out[12]:
In [13]:
#How many channels to we track?
df[['Channel: ResourceID']].groupby('Channel: ResourceID').count()
Out[13]:
In [14]:
#Let's get a visual idea of what the popular ad lengths are using KDE plot
fig, ax = plt.subplots()
sns.kdeplot(df["Time_min"], ax=ax)
ax.hist(df["Time_min"], bins = 2, alpha=0.25, normed=True, label="tip")
ax.legend()
fig.suptitle("How long are ads, typically? (in Seaborn)")
plt.show()
In [15]:
#Let's experiment by grouping by Category, Brand, and Product. Maybe it'll be useful later.
df[['Brand','Product','Category','Time_min_detect']].groupby(['Category','Brand','Product']).sum()
Out[15]:
In [16]:
#Let's see which Auto brands advertise the most - at least for this day
autos1 = df[df['Category'] == 'Automotive / Vehicles'].groupby(['Brand']).sum()
autos1.sort_values(['Time_min_detect'],ascending=False)
Out[16]:
In [17]:
#Who are the top auto brands?
TopAutoBrands = autos1.sort_values(['Time_min'],ascending=False).head(20)
TopAutoBrands
Out[17]:
In [18]:
#Let's see how the top auto brands look in a chart!
plt.style.use("fivethirtyeight")
TopAutoBrands.plot(y='Time_min',kind='bar',stacked=True,title="Top 20 Advertising Auto Brands", figsize=(15,5) )
Out[18]:
In [19]:
#Now let's see which auto PRODUCTS advertise the most
autos2 = df[df['Category'] == 'Automotive / Vehicles'].groupby(['Brand','Product']).sum()
TopAutoProducts = autos2.sort_values(['Time_min'],ascending=False).head(20)
plt.style.use("fivethirtyeight")
TopAutoProducts.plot(y='Time_min',kind='bar',stacked=True,title="Top 20 Advertising Auto Products", figsize=(15,5) )
Out[19]:
In [20]:
#Let's figure out top Retail and Restaurant Brands and compare to Auto
Ret1 = df[df['Category'] == 'Retail'].groupby(['Brand']).sum()
Ret1.sort_values(['Time_min'],ascending=False)
TopRetBrands = Ret1.sort_values(['Time_min'],ascending=False).head(20)
Resto1 = df[df['Category'] == 'Restaurants'].groupby(['Brand']).sum()
Resto1.sort_values(['Time_min'],ascending=False)
TopRestoBrands = Resto1.sort_values(['Time_min'],ascending=False).head(20)
plt.style.use("fivethirtyeight")
TopRetBrands.plot.bar(y='Time_min', color = "Red", stacked=True,title="Top 20 Advertising Retail Brands", figsize=(15,5) )
TopRestoBrands.plot.bar(y='Time_min', color = "Blue", stacked=True,title="Top 20 Advertising Restaurant Brands", figsize=(15,5) )
Out[20]:
In [21]:
#Let's split out the Local Detection Start column to compare brands/products at what TIME OF DAY they start
df['Date'] = df['Local detection start'].apply(lambda x: x.split(' ')[0])
df['TimeOfDay'] = df['Local detection start'].apply(lambda x: x.split(' ')[1])
df['Hour'] = df['TimeOfDay'].apply(lambda x: x.split(':')[0])
df.head()
Out[21]:
In [22]:
# Let's focus in on one product, JUST the Hyundai Genesis
HyundaiGen = df[df['Product'] == 'Hyundai - Genesis'].sort_values(['TimeOfDay'],ascending=True)
HyundaiGen.head()
Out[22]:
In [23]:
#drop what you don't need
HyundaiGen.drop('Asset: ResourceID', axis=1, inplace=True)
HyundaiGen.drop('Category', axis=1, inplace=True)
HyundaiGen.drop('Asset: Type', axis=1, inplace=True)
HyundaiGen.drop('Video', axis=1, inplace=True)
HyundaiGen.drop('Market: ResourceID', axis=1, inplace=True)
HyundaiGen.drop('TMS ID', axis=1, inplace=True)
In [24]:
#Always make sure you've got the right format...
HyundaiGen['Hour'] = HyundaiGen['Hour'].apply(lambda x: int(x))
In [25]:
HyundaiGen.dtypes
Out[25]:
In [36]:
#Here we see the # Hyundai commercials in each hour timeslot
HyundaiGenByHr = HyundaiGen[["Hour","Time_min"]].sort_values(by="Hour").groupby('Hour',as_index=False).count()
In [37]:
HyundaiGenByHr
Out[37]:
In [28]:
HyundaiGenByHr.dtypes
Out[28]:
In [29]:
#Finally! Now we can look at how this product's ads are spread across the daytime.
#Keep in mind i've used local detection time. That means, 5pm in NYC and 5PM in Chicago would show up as the same.
plt.style.use("fivethirtyeight")
Plot1 = HyundaiGenByHr.plot(x='Hour',y='Time_min',kind='bar',stacked=True,title="Hyundai Genesis Ad Minutes per Hr",figsize=(15,5))
Plot1.set_xlabel("Hour")
Plot1.set_ylabel("# of Ads")
Out[29]:
In [30]:
#Very interesting, lots of late-night ads and not so much in prime time.
#Let's look at that with another view to see how ads are dispersed across the day time
HyundaiGenByHr.plot.area(x='Hour',y='Time_min',stacked=True,title="Hyundai Genesis Ad Minutes per Hr",figsize=(15,5))
Out[30]:
In [31]:
#Now let's bring in another data set - mapping to GEOGRAPHY!
BroadcastId = pd.read_csv("4CBroadcastID.csv")
#Merging is useful!
df3 = pd.merge(df,BroadcastId,left_on='Channel: ResourceID', right_on='Resource.ID', how = 'left')
In [32]:
#Remove unnecessary columns
df3.drop('Channel.Name', axis=1, inplace=True)
df3.drop('Affiliation', axis=1, inplace=True)
df3.drop('mkt_id', axis=1, inplace=True)
df3.drop('DMA.Rank', axis=1, inplace=True)
In [33]:
#Let's quickly see how many Markets we're tracking
df3[['Market.Name']].groupby('Market.Name').count()
Out[33]:
In [34]:
#Now, let's see who's advertising in NYC
NYCcat = df3[(df3['Market.Name'] == 'New York')].groupby(['Brand']).sum()
NYCcat.sort_values(['Time_min'],ascending=False)
NYCtopcat = NYCcat.sort_values(['Time_min'],ascending=False).head(20)
NYCtopcat.plot.bar(y='Time_min', color ="g", stacked=True,title="Top 20 Brands advertising, NYC", figsize=(15,5) )
Out[34]:
In [35]:
#Now let's see look at Cars again - and compare NYC to the rest of the US
NYCcat = df3[(df3['Category'] == 'Automotive / Vehicles')].groupby(['Brand']).sum()
NYCcat.sort_values(['Time_min'],ascending=False)
NYCtopcat = NYCcat.sort_values(['Time_min'],ascending=False).head(20)
NYCtopcat.plot.bar(y='Time_min', color ="y", stacked=True,title="Top 20 Ad Categories, NYC", figsize=(15,5) )
TopAutoBrands.plot(y='Time_min',kind='bar',stacked=True,title="Top 20 Advertising Auto Brands", figsize=(15,5) )
Out[35]:
So we got to see how various categories, brands, and products are advertising across a span of a day. It's really quite interesting, but i lacked time to go deeper.
All in all, learned a ton in this project. Merging was especially tricky, mainly due to data format problems. I thoroughly enjoyed this project and it forced me to learn so much.
Thank you!
In [ ]: