By: Siddharth Srikanth and Vishal Bailoor
We sought to discover the true effect of streaming on TV by looking at changes in TV show ratings and sports programming ratings over time. TV show data captures the comparable set of programming to a Netflix or Amazon Video. Sports, meanwhile, has historically been popular and tied to a live experience. The media has trumpeted the effect of streaming on TV, including sports programming, and we wanted to see what the data has to say.
Introduction: The Death of Television (may have been greatly exaggerated)
The rise of Netflix and Amazon Video have dominated the news for a number of years, alongside concurrent trends like cord-cutting, streaming, and the decline of media and cable television. Logic would suggest the two are connected, that the rise of convenient, centralized streaming would itself harm current media, especially TV. Netflix statistics are used here as a proxy for streaming numbers, as a market-dominant firm.
We sought to answer two key questions surrounding this emerging trend.
Key metrics here include [INSERT KEY METRICS HERE SID]
Key metrics here include ratings numbers, viewership numbers, year-over-year change, and weekly and yearly numbers.
Data
Question 1:
[Fill in Data Methodology]
Question 2:
Sports are a very different phenomenon from the remainder of network TV, which the other portion of the project covers. Sports are consistently higher rated across channels, are less channel-specific, and on a business side have separate media rights contracts to production shows.
This part draws from a number of web sources. The bulk of the data comes from http://www.sportsmediawatch.com/nfl-tv-ratings-viewership-nbc-cbs-fox-espn-nfln-regular-season-playoffs/. This data is on 2014-current ratings and viewership for various football matches. Additional data has been filled in from ESPN.com (which reports viewers/ratings for certain high-ticket games) and http://tvbythenumbers.zap2it.com/tag/nfl-football-ratings/. The latter source was primarily used to fill in unreported games from sportsmediawatch.
SportsMediaWatch("SMW") provides week-by-week data. We then manually downloaded and cleaned the data from 2014-now. 2013 data can be derived from the change numbers in 2014, but we also got comparable raw 2013 data from TV-by-the-numbers. ESPN.com served as a fact-check for blue-chip games.
Operationally, we first downloaded the raw data in CSV form from sportsmediawatch. The CSV link on a public Github file is below. We then cleaned it and organized it in this notebook.
In [1]:
import sys
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import datetime as dt
In [2]:
url = "https://raw.githubusercontent.com/vishbail/DB-Final-Project/master/DB%20Final%20Project%20Data%20Raw.csv"
#sdf stands for sports data frame, distinguished from the non sports data frame
sdf = pd.read_csv(url)
In [3]:
#remove all internal header rows - found through unfortunate trial and error
sdf = sdf.drop(sdf.index[80:82])
sdf = sdf.drop(sdf.index[186:188])
sdf = sdf.drop(sdf.index[293:294])
In [4]:
##Part 0: Cleaning and Summarizing Data
# drop unnecessary first row
sdf = sdf.drop(sdf.index[0:1])
#drop empty end columns
sdf = sdf.drop(sdf.columns[8:], axis=1)
#Sets internal column headers into overall headers
sdf.columns = sdf.iloc[0]
sdf = sdf.drop(sdf.index[0])
###failed column change attempts (remove/ignore)
#sdf["Year"][2:3] = 2015
#sdf.iloc[9][2:3] = 2015
#for i in range(2,79):
# sdf.set_value(2, 9, 2015)s
#Rename Unclear columns
sdf.columns.values[3] = "Vwrs. Change"
sdf.columns.values[5] = "Rtg. Change"
#drop rows with no game ratings and convert others to floats
sdf = sdf.dropna(subset = ["Rtg."])
sdf["Rtg."] = sdf["Rtg."].astype(float)
sdf = sdf.dropna(subset = ["Vwrs."])
sdf["Vwrs."] = sdf["Vwrs."].astype(float)
sdf = sdf.dropna(subset = ["Week"])
sdf["Week"] = sdf["Week"].astype(int)
sdf = sdf.set_index(["Year","Week"])
The rise of Netflix and Amazon Video have dominated the news for a number of years, alongside concurrent trends like cord-cutting, streaming, and the decline of media and cable television. We drew from SportsMediaWatch and ESPN data to test our hypothesis:
Given cord-cutting and a turn from television to streaming, consumption of sports on conventional channels will decrease.
This first figure makes a simple line plot of ratings data against time.
In [5]:
fig = plt.figure()
sdf["Rtg."].plot(fontsize=6)
fig.suptitle("NFL Ratings Trend", fontsize=16)
plt.xlabel("Year and Week", fontsize=12)
plt.show()
Surprisingly, this heartbeat-esque plot shows that there is little or no change over time, with the range of starting and ending ratings being within the same 8-15 range. Perhaps gross viewership, a component of ratings (which are at least partially driven by percentage of viewers watching) but a separate statistic, shows a different story?
In [6]:
fig1 = plt.figure()
sdf["Vwrs."].plot(fontsize=6)
fig1.suptitle("NFL Viewership Trend", fontsize=16)
plt.xlabel("Year and Week", fontsize=12)
plt.show()
Aside from an aberrative number in the fall of 2013 (negative viewers would certainly be bad for TV), viewership numbers have at least weakly increased since 2013, with a good portion of 2013 under or around 10 million viewers and almost all of 2014-2015 above it, though 2016 appears to show decline.
Netflix, though has been strong over the same period. A look at (manually summarized) subscriber numbers shows this:
In [7]:
##ndf stands for netflix data frame
#data sourced from https://www.statista.com/statistics/250934/quarterly-number-of-netflix-streaming-subscribers-worldwide/
ndf = pd.DataFrame(dict(subscribers = pd.Series([44.35, 57.39, 74.76, 86.74]), year = pd.Series([2013,2014,2015,2016])))
ndf = ndf.set_index(ndf["year"])
ndf = ndf.drop(ndf.columns[1], axis=1)
ndf
fig2 = plt.figure()
ndf["subscribers"].plot(fontsize=6, kind="bar")
fig.suptitle("Netflix Subscriber Data", fontsize=16)
plt.xlabel("Year", fontsize=12)
plt.show()
Netflix subscribers have almost doubled over the same time period, indicating that though Netflix has grown, it has certainly not come at the expense of sports. Sports ratings and viewership are steady, and small inter-week fluctuations aside, sports programming shows no sign of change in this age of Netflix.
In [8]:
sdf = sdf.reset_index()
sdf = sdf.sort_values(by="Year")
sdf = sdf.sort_values(by="Week")
sdf = sdf.set_index(["Week"])
fig3 = plt.figure()
sdf["Rtg."].plot(fontsize=8, kind="bar")
fig3.suptitle("Weekly NFL Ratings Data", fontsize=16)
plt.xlabel("Week By Week", fontsize=12)
plt.show()
(Apologies for the nasty x-axis labeling, I tried to fix it for over an hour and abjectly failed. Essentially the x axis tracks weeks over time.)
A small digression, we wanted to test if weekly numbers showed a pattern across years, for example if sports rating increased on holidays or when college is in session. As the bar graph above attests, there is little pattern in the weekly data, an especially odd thought given the hype around start-of-season games and end-of-season playoff-deciding games.
Though specific networks such as ESPN may or may not be declining, the data shows that cross-channel trends in sports viewership are relatively stable.
With viewership and ratings holding steady as Netflix rises, and knowing that weekly trends are not masking other patterns, we believe that sports viewership has not been affected by streaming.
In [9]:
import sys # system module
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics module
import datetime as dt # date and time module
import numpy as np # foundation for pandas
%matplotlib inline
# check versions (overkill, but why not?)
print('Python version: ', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())
#I started by importing all the necessary packages.
So, because of my inability to get Metacritic ratings per episode for TV shows, I chose to stick with one source for my tv ratings: The A/V Club. I was able to find the information courtesy of a post on r/dataisbeautiful with about 50-60 shows worth of information, including critic ratings, community rating, number of people who voted, etc. So before I used the critic data I needed, I decided to play around with the data and just see what I could find by messing around with it.
In [10]:
url = "http://jespajo.neocities.org/clubData.csv"
av = pd.read_csv(url)
avc = av.set_index(['show', 'season', 'epno'])
print(av.columns.values)
avc247 = avc.head(24)
In [11]:
av.show.view
Out[11]:
In [12]:
avc247
print(avc247.dtypes)
The code above describes the following: how I got the data I'd be working with in terms of critical appraisal and began to modify it. The way the data was originally formatted, the index was just simple numbers. So, I changed the index to show, season, and episode number, in that order. So from there, I wanted to see if there was some consistent correlation between critical and community reviews. Was there consistent over or under-scoring?
In [13]:
avcomm = av.groupby(['show', 'season'])['commrating'].mean()
avcrit = av.groupby(['show', 'season'])['critrating'].mean()
In [14]:
avcomm.head(25).plot.barh()
avcrit.head(25).plot.barh(alpha=0.5, color='red')
Out[14]:
In [15]:
avcrit[20:40]
Out[15]:
In [16]:
avdiff = avcomm - avcrit
avdiff.sort_values(ascending=True)
avdiff.dropna().sort_values(ascending=True).head(10)
Out[16]:
In [17]:
avc.T
Out[17]:
In [18]:
avdiff.dropna().sort_values(ascending=True).tail(10)
Out[18]:
In [19]:
avc248 = avc.T[('24', '8')].T
avc248
Out[19]:
In [20]:
e=avc.T[('How I Met Your Mother', '3')].T
Now, this is just the data from the AVClub. I was having a lot of trouble installing IMDB and Metacritic packages on my computer. Once the packages install, I can query the sites and use their data to see if the AVClub is a consistent over- or under-scorer. In addition, with the TV ratings data Vishal has, I can look at the series broadcasting from 2013 onwards and compare viewership trends juxtaposed against quality.
In [21]:
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics
import datetime as dt # date tools, used to note current date
import sys
# these are new
import requests
print('\nPython version: ', sys.version)
print('Pandas version: ', pd.__version__)
print('Requests version: ', requests.__version__)
print("Today's date:", dt.date.today())
In [22]:
import json
import pandas as pd
In [23]:
import pandas as pd
himymratings = pd.ExcelFile("C:/Users/Sidd/Desktop/Data_Bootcamp/TV Ratings.xlsx").parse()
Hey Professors, so something odd happens with this following bit of code. The "a" variable basically generates a list of lists, but the ordering of those lists changes every time this notebook is opened for the first time. So, if you run this code and the graphs don't work, check the list ordering in 'a' and refill the values for the date, rating, title and number variables below, for the three shows that use this data. I know it's tedious and honestly bad this problem is occuring in the first place without a solution, but I couldn't find a solution in time, and would just let you know. Sorry in advance.
In [24]:
url = 'https://raw.githubusercontent.com/leosartaj/tvstats/master/data/jsonData/himym.json'
himym = pd.read_json(url)
himym.episodes[1]
a = [list(col) for col in zip(*[d.values() for d in himym.episodes[1]])]
himymdate1 = a[0]
himymrating1 = a[1]
himymtitle1 = a[2]
himymnumber1 = a[3]
himyms1 = pd.DataFrame(
{'Episode Number': himymnumber1,
'Episode Title': himymtitle1,
'Rating': himymrating1
})
himym.episodes[2]
a = [list(col) for col in zip(*[d.values() for d in himym.episodes[2]])]
himymdate2 = a[0]
himymrating2 = a[1]
himymtitle2 = a[2]
himymnumber2 = a[3]
himyms2 = pd.DataFrame(
{'Episode Number': himymnumber2,
'Episode Title': himymtitle2,
'Rating': himymrating2
})
himym.episodes[3]
a = [list(col) for col in zip(*[d.values() for d in himym.episodes[3]])]
himymdate3 = a[0]
himymrating3 = a[1]
himymtitle3 = a[2]
himymnumber3 = a[3]
himyms3 = pd.DataFrame(
{'Episode Number': himymnumber3,
'Episode Title': himymtitle3,
'Rating': himymrating3
})
himym.episodes[4]
a = [list(col) for col in zip(*[d.values() for d in himym.episodes[4]])]
himymdate4 = a[0]
himymrating4 = a[1]
himymtitle4 = a[2]
himymnumber4 = a[3]
himyms4 = pd.DataFrame(
{'Episode Number': himymnumber4,
'Episode Title': himymtitle4,
'Rating': himymrating4
})
himym.episodes[5]
a = [list(col) for col in zip(*[d.values() for d in himym.episodes[5]])]
himymdate5 = a[0]
himymrating5 = a[1]
himymtitle5 = a[2]
himymnumber5 = a[3]
himyms5 = pd.DataFrame(
{'Episode Number': himymnumber5,
'Episode Title': himymtitle5,
'Rating': himymrating5
})
himym.episodes[6]
a = [list(col) for col in zip(*[d.values() for d in himym.episodes[6]])]
himymdate6 = a[0]
himymrating6 = a[1]
himymtitle6 = a[2]
himymnumber6 = a[3]
himyms6 = pd.DataFrame(
{'Episode Number': himymnumber6,
'Episode Title': himymtitle6,
'Rating': himymrating6
})
himym.episodes[7]
a = [list(col) for col in zip(*[d.values() for d in himym.episodes[7]])]
himymdate7 = a[0]
himymrating7 = a[1]
himymtitle7 = a[2]
himymnumber7 = a[3]
himyms7 = pd.DataFrame(
{'Episode Number': himymnumber7,
'Episode Title': himymtitle7,
'Rating': himymrating7
})
himym.episodes[8]
a = [list(col) for col in zip(*[d.values() for d in himym.episodes[8]])]
himymdate8 = a[0]
himymrating8 = a[1]
himymtitle8 = a[2]
himymnumber8 = a[3]
himyms8 = pd.DataFrame(
{'Episode Number': himymnumber8,
'Episode Title': himymtitle8,
'Rating': himymrating8
})
himym.episodes[9]
a = [list(col) for col in zip(*[d.values() for d in himym.episodes[9]])]
himymdate9 = a[0]
himymrating9 = a[1]
himymtitle9 = a[2]
himymnumber9 = a[3]
himyms9 = pd.DataFrame(
{'Episode Number': himymnumber9,
'Episode Title': himymtitle9,
'Rating': himymrating9
})
a
Out[24]:
In [25]:
himymratings2 = himymratings.set_index(['Season', 'No. in Season'])
d = himymratings2.T[(8)].T
e = avc.T[('How I Met Your Mother', '8')].T
In [26]:
fig, axe = plt.subplots()
e.plot(y='critrating', ax=axe)
himyms8.Rating.convert_objects(convert_numeric = True).plot(ax=axe)
In [112]:
e.critrating - himyms8.Rating.convert_objects(convert_numeric = True)
Out[112]:
In [113]:
himyms1 = himyms1.set_index('Episode Number')
himyms2 = himyms2.set_index('Episode Number')
himyms3 = himyms3.set_index('Episode Number')
himyms4 = himyms4.set_index('Episode Number')
himyms5 = himyms5.set_index('Episode Number')
himyms6 = himyms6.set_index('Episode Number')
himyms7 = himyms7.set_index('Episode Number')
himyms8 = himyms8.set_index('Episode Number')
himyms9 = himyms9.set_index('Episode Number')
In [114]:
frames = [himyms1, himyms2, himyms3, himyms4, himyms5, himyms6, himyms7, himyms8, himyms9]
result = pd.concat(frames, keys=['Season 1', 'Season 2', 'Season 3', 'Season 4', 'Season 5', 'Season 6', 'Season 7', 'Season 8', 'Season 9'])
result
Out[114]:
In [115]:
resultrating = result.Rating.convert_objects(convert_numeric = True)
resultrating['Season 8']
Out[115]:
In [116]:
url = 'https://raw.githubusercontent.com/leosartaj/tvstats/master/data/jsonData/hannibal.json'
hannibal = pd.read_json(url)
hannibal.episodes[1]
a = [list(col) for col in zip(*[d.values() for d in hannibal.episodes[1]])]
hannibaldate1 = a[0]
hannibalrating1 = a[1]
hannibaltitle1 = a[2]
hannibalnumber1 = a[3]
hannibals1 = pd.DataFrame(
{'Episode Number': hannibalnumber1,
'Episode Title': hannibaltitle1,
'Rating': hannibalrating1
})
hannibal.episodes[2]
a = [list(col) for col in zip(*[d.values() for d in hannibal.episodes[2]])]
hannibaldate2 = a[0]
hannibalrating2 = a[1]
hannibaltitle2 = a[2]
hannibalnumber2 = a[3]
hannibals2 = pd.DataFrame(
{'Episode Number': hannibalnumber2,
'Episode Title': hannibaltitle2,
'Rating': hannibalrating2
})
hannibal.episodes[3]
a = [list(col) for col in zip(*[d.values() for d in hannibal.episodes[3]])]
hannibaldate3 = a[0]
hannibalrating3 = a[1]
hannibaltitle3 = a[2]
hannibalnumber3 = a[3]
hannibals3 = pd.DataFrame(
{'Episode Number': hannibalnumber3,
'Episode Title': hannibaltitle3,
'Rating': hannibalrating3
})
hannibals1 = hannibals1.set_index('Episode Number')
hannibals2 = hannibals2.set_index('Episode Number')
hannibals3 = hannibals3.set_index('Episode Number')
haframes = [hannibals1, hannibals2]
hannibalfinal = pd.concat(haframes, keys=['Season 1', 'Season 2'])
In [286]:
avhannibal = avc.T['Hannibal'].T
Out[286]:
In [118]:
hannibaltvratings = pd.ExcelFile("C:/Users/Sidd/Desktop/Data_Bootcamp/TV Ratings.xlsx").parse(sheetname = 'Hannibal')
hannibaltvratings = hannibaltvratings.set_index('Number Overall')
In [119]:
url = 'https://raw.githubusercontent.com/leosartaj/tvstats/master/data/jsonData/breakingBad.json'
bb = pd.read_json(url)
bb.episodes[1]
a = [list(col) for col in zip(*[d.values() for d in bb.episodes[1]])]
bbdate1 = a[0]
bbrating1 = a[1]
bbtitle1 = a[2]
bbnumber1 = a[3]
bbs1 = pd.DataFrame(
{'Episode Number': bbnumber1,
'Episode Title': bbtitle1,
'Rating': bbrating1
})
bb.episodes[2]
a = [list(col) for col in zip(*[d.values() for d in bb.episodes[2]])]
bbdate2 = a[0]
bbrating2 = a[1]
bbtitle2 = a[2]
bbnumber2 = a[3]
bbs2 = pd.DataFrame(
{'Episode Number': bbnumber2,
'Episode Title': bbtitle2,
'Rating': bbrating2
})
bb.episodes[3]
a = [list(col) for col in zip(*[d.values() for d in bb.episodes[3]])]
bbdate3 = a[2]
bbrating3 = a[0]
bbtitle3 = a[3]
bbnumber3 = a[1]
bbs3 = pd.DataFrame(
{'Episode Number': bbnumber3,
'Episode Title': bbtitle3,
'Rating': bbrating3
})
bb.episodes[4]
a = [list(col) for col in zip(*[d.values() for d in bb.episodes[4]])]
bbdate4 = a[0]
bbrating4 = a[1]
bbtitle4 = a[2]
bbnumber4 = a[3]
bbs4 = pd.DataFrame(
{'Episode Number': bbnumber4,
'Episode Title': bbtitle4,
'Rating': bbrating4
})
bb.episodes[5]
a = [list(col) for col in zip(*[d.values() for d in bb.episodes[5]])]
bbdate5 = a[0]
bbrating5 = a[1]
bbtitle5 = a[2]
bbnumber5 = a[3]
bbs5 = pd.DataFrame(
{'Episode Number': bbnumber5,
'Episode Title': bbtitle5,
'Rating': bbrating5}
)
bbs1
bbs1 = bbs1.set_index('Episode Number')
bbs2
Out[119]:
In [120]:
bbs1
bbs2 = bbs2.set_index('Episode Number')
bbs3 = bbs3.set_index('Episode Number')
bbs4 = bbs4.set_index('Episode Number')
bbs5 = bbs5.set_index('Episode Number')
bbframes = [bbs1, bbs2, bbs3, bbs4, bbs5]
bbfinal = pd.concat(bbframes, keys=['Season 1', 'Season 2', 'Season 3', 'Season 4', 'Season 5'])
bbfinal
Out[120]:
In [121]:
bbtvratings = pd.ExcelFile("C:/Users/Sidd/Desktop/Data_Bootcamp/TV Ratings.xlsx").parse(sheetname = 'Breaking Bad')
bbtvratings
Out[121]:
In [122]:
avbb = avc.T['Breaking Bad'].T
bbtvratings = pd.ExcelFile("C:/Users/Sidd/Desktop/Data_Bootcamp/TV Ratings.xlsx").parse(sheetname = 'Breaking Bad')
bbtvratings = bbtvratings.set_index('Number Overall')
In [123]:
empireavc = avc.T['Empire'].T
empireavc
Out[123]:
In [124]:
empireavc = avc.T['Empire'].T
empireavc
empiretvratings = pd.ExcelFile("C:/Users/Sidd/Desktop/Data_Bootcamp/TV Ratings.xlsx").parse(sheetname = 'Empire')
empiretvratings['IMDb Rating']
Out[124]:
Now that we have all the data, it's time to plot. First, let's see if there's any correlation between critical appraisal and community appraisal.
In [125]:
f = avc.T[('How I Met Your Mother', '3')].T
pet1 = himyms1.reset_index()
afs = f.reset_index()
himyms3diff = pet1.Rating.convert_objects(convert_numeric = True) - afs.critrating + 1
f = avc.T[('How I Met Your Mother', '4')].T
pet = himyms4.reset_index()
afs = f.reset_index()
pet1 = pet.drop(pet.index[[0,1,2, 3]]).reset_index()
himyms4diff = pet1.Rating.convert_objects(convert_numeric = True) - afs.critrating + 1
f = avc.T[('How I Met Your Mother', '5')].T
pet = himyms5.reset_index()
afs = f.reset_index()
pet1 = pet.drop(pet.index[[0,1,2,3]]).reset_index()
himyms5diff = pet1.Rating.convert_objects(convert_numeric = True) - afs.critrating + 1
f = avc.T[('How I Met Your Mother', '6')].T
pet = himyms6.reset_index()
afs = f.reset_index()
pet1 = pet.drop(pet.index[[0,1,2,3]]).reset_index()
himyms6diff = pet1.Rating.convert_objects(convert_numeric = True) - afs.critrating + 1
f = avc.T[('How I Met Your Mother', '7')].T
pet = himyms7.reset_index()
afs = f.reset_index()
pet1 = pet.drop(pet.index[[0,1,2, 23]]).reset_index()
himyms7diff = pet1.Rating.convert_objects(convert_numeric = True) - afs.critrating + 1
e = avc.T[('How I Met Your Mother', '8')].T
prt = himyms8.reset_index()
ads = e.reset_index()
prt1 = prt.drop(prt.index[[0,1,2]])
prt1 = prt1.reset_index()
himyms8diff = prt1.Rating.convert_objects(convert_numeric = True) - ads.critrating + 1
himyms8diff.sort_values()
f = avc.T[('How I Met Your Mother', '9')].T
pet = himyms9.reset_index()
afs = f.reset_index()
pet1 = pet.drop(pet.index[[0,1,2]]).reset_index()
himyms9diff = pet1.Rating.convert_objects(convert_numeric = True) - afs.critrating + 1
In [126]:
himymframes = [himyms3diff, himyms4diff, himyms5diff, himyms6diff, himyms7diff, himyms8diff, himyms9diff]
himymdifffinal = pd.concat(himymframes, keys=['Season 3', 'Season 4', 'Season 5', 'Season 6', 'Season 7', 'Season 8', 'Season 9'])
himymdifffinal.dropna().sort_values()
Out[126]:
In [127]:
f = avc.T[('Breaking Bad', '1')].T
pet = bbs1.reset_index()
afs = f.reset_index()
pet1 = pet.reset_index()
bbs1diff = pet1.Rating.convert_objects(convert_numeric = True) - afs.critrating + 1
f = avc.T[('Breaking Bad', '2')].T
pet = bbs2.reset_index()
afs = f.reset_index()
pet1 = pet.reset_index()
bbs2diff = pet1.Rating.convert_objects(convert_numeric = True) - afs.critrating + 1
f = avc.T[('Breaking Bad', '3')].T
pet = bbs3.reset_index()
afs = f.reset_index()
bbs3diff = pet1.Rating.convert_objects(convert_numeric = True) - afs.critrating + 1
f = avc.T[('Breaking Bad', '4')].T
pet = bbs4.reset_index()
afs = f.reset_index()
pet1 = pet.reset_index()
bbs4diff = pet1.Rating.convert_objects(convert_numeric = True) - afs.critrating + 1
f = avc.T[('Breaking Bad', '5')].T
pet = bbs5.reset_index()
afs = f.reset_index()
pet1 = pet.reset_index()
bbs5diff = pet1.Rating.convert_objects(convert_numeric = True) - afs.critrating + 1
bbframes = [bbs1diff, bbs2diff, bbs3diff, bbs4diff, bbs5diff]
bbdifffinal = pd.concat(bbframes, keys=['Season 1', 'Season 2', 'Season 3', 'Season 4', 'Season 5'])
bbdifffinal.dropna().sort_values()
Out[127]:
In [128]:
g = avc.T[('Hannibal', '1')].T
pet = hannibals1.reset_index()
ags = g.reset_index()
pet1 = pet.reset_index()
hannibals1diff = pet1.Rating.convert_objects(convert_numeric = True) - ags.critrating + 1
g = avc.T[('Hannibal', '2')].T
pet = hannibals2.reset_index()
ags = g.reset_index()
pet1 = pet.reset_index()
hannibals2diff = pet1.Rating.convert_objects(convert_numeric = True) - ags.critrating + 1
hannibalframes = [hannibals1diff, hannibals2diff]
hannibaldifffinal = pd.concat(hannibalframes, keys=['Season 1', 'Season 2'])
hannibaldifffinal.dropna().sort_values()
Out[128]:
In [129]:
empires1 = avc.T[('Empire', '1')].T
empires2 = avc.T[('Empire', '2')].T
empiretvratings = pd.ExcelFile("C:/Users/Sidd/Desktop/Data_Bootcamp/TV Ratings.xlsx").parse(sheetname = 'Empire')
empiretvratings
es1 = empires1.reset_index().critrating
es2 = empires2.reset_index().critrating
eframe = [es1, es2]
efinal = pd.concat(eframe)
beta = efinal.reset_index().critrating - empiretvratings['IMDb Rating']
beta.sort_values()
Out[129]:
The main point of this data project is to see if there is any immediate correlation between Television Ratings and critical opinion or community opinion. We have our data, so the most efficient way of going about this would be plotting a dot plot and getting some regression analysis of it done via Python.
In [302]:
result = pd.concat(frames, keys=['Season 1', 'Season 2', 'Season 3', 'Season 4', 'Season 5', 'Season 6', 'Season 7', 'Season 8', 'Season 9'])
result = result.drop(['Season 1', 'Season 2',])
result = result.drop([('Season 4', '1'), ('Season 4', '2'), ('Season 4', '3'), ('Season 4', '4'), ('Season 4', '16')])
result = result.drop([('Season 5', '1'), ('Season 5', '2'), ('Season 5', '3'), ('Season 5', '4')])
result = result.drop([('Season 6', '1'), ('Season 6', '2'), ('Season 6', '3'), ('Season 6', '4')])
result = result.drop([('Season 7', '1'), ('Season 7', '2'), ('Season 7', '3')])
result = result.drop([('Season 8', '1'), ('Season 8', '2'), ('Season 8', '3')])
result = result.drop([('Season 9', '1'), ('Season 9', '2'), ('Season 9', '3'), ('Season 9', '24')])
i = avc.T['How I Met Your Mother'].T.critrating - 1
j = result.Rating
k = pd.ExcelFile("C:/Users/Sidd/Desktop/Data_Bootcamp/TV Ratings.xlsx").parse(sheetname = 'HIMYM for AVClub').set_index(['Season', 'No. in Season']).Rating
fig, axe = plt.subplots()
k.to_frame().plot(ax=axe)
j.convert_objects(convert_numeric=True).plot(ax=axe)
axe.set_title('IMDb User Score against Viewership')
axe.set_xlabel('Season & Episode Number')
axe.legend(['Viewership in Millions', 'Avg. IMDb User Rating'])
Out[302]:
In [308]:
fig, axe = plt.subplots()
i.to_frame().plot(ax=axe, alpha = 0.5)
j.convert_objects(convert_numeric=True).plot(ax=axe)
axe.set_ylim(0, 10)
axe.set_xlabel('Season and Episode Number')
axe.set_title('IMDb v. AV Club Scores per Episode')
axe.legend(['AV Club Score', 'IMDb Score'], loc = 4)
Out[308]:
In [132]:
i2 = i.reset_index().critrating.convert_objects(convert_numeric=True)
j2 = j.reset_index().Rating.convert_objects(convert_numeric = True)
k2 = k.reset_index().Rating
plt.scatter(k2, j2)
Out[132]:
In [336]:
plt.scatter(k2, i2)
Out[336]:
In [337]:
plt.scatter(i2, j2)
Out[337]:
In [133]:
k2.describe()
Out[133]:
In [134]:
i = 0
delta = []
for h in j2:
while i < 140:
d = j2[i+1] - j2[i]
delta.append(d)
i = i + 1
sum(delta)/len(delta)
Out[134]:
So, what do these graphs show us? The graph that plots IMDb score against AV Club score shows us that the volatility of the AV Club's score is much higher than that of the IMDb score. Let's see if we can prove that with some numbers. The mean of the AV Club scores is 7.9, while the average IMDb score was 8.04. A 0.1 point difference exists, but I think the more interesting value to find would be the delta of successive episodes, i.e. how the rating changed for each successive episode, as well as a breakdown of the standard deviation of the show as a whole. So, beginning with the IMDb user score, we see that, for the series as a whole, the mean is 8.04 with a standard deviation of 0.61. 68% of episodes were rated between 7.43 and 8.65, showing the show as a whole was considered good, not great. What's more, the average change in "quality" for the series as a whole was -0.01, implying the show was more consistent in its quality.
What can we ascertain from this? Well, that HIMYM was a reliable show in its genre. The critical volatility contrasted with the less volatile community scores shows that while HIMYM had bad episodes, it was still an overall "reliable" show when it came to comedy. It wasn't offensively bad but it wasn't consistently phenomenal either. It, if anything, looks like the graph for any long-running live-action comedy show: "Friends", "Cheers", etc. "How I Met Your Mother" exists as our baseline, as our example for a show that had a solid fan following backed up by a decent critical opinion.
In [287]:
himdb = hannibalfinal
havc = avc.T['Hannibal'].T
hannibaltvratings = pd.ExcelFile("C:/Users/Sidd/Desktop/Data_Bootcamp/TV Ratings.xlsx").parse(sheetname = 'Hannibal')
htvr = hannibaltvratings.set_index(['Season', 'No. in Season'])
n
Out[287]:
In [292]:
c = himdb.Rating
s = havc.critrating - 1
n = htvr.Rating
c2 = c.reset_index()
c3 = c2.Rating.convert_objects(convert_numeric=True)
n2 = n.drop(3).reset_index().Rating
plt.scatter(n2, c3)
Out[292]:
In [282]:
n3 = n.reset_index().drop(6).Rating
plt.scatter(n3, s)
Out[282]:
In [335]:
s4 = s.drop('3').reset_index().critrating
c4 = c3.drop(25)
plt.scatter(s4, c4)
Out[335]:
In [283]:
c = c.convert_objects(convert_numeric = True)
fig, axr = plt.subplots()
n.plot(ax=axr)
s.plot(ax=axr)
c.plot(ax=axr, color='red')
axr.set_ylim(0, 11)
axr.set_xlabel('Season & Episode Number')
axr.legend(['Viewership in Millions', 'AV Club Rating', 'IMDb User Rating'], loc = 5)
axr.set_title('Critical Appraisal v. Viewership')
Out[283]:
In [284]:
c = c.convert_objects(convert_numeric = True)
fig, axr = plt.subplots()
c.plot(ax=axr, color='red')
s.plot(ax=axr, color='green')
axr.set_ylim(0, 10)
axr.set_xlabel('Season & Episode Number')
axr.legend(['IMDB User Rating', 'AV Club Rating'], loc = 5)
axr.set_title('Critical v. Communal Appraisal')
Out[284]:
In [164]:
i = 0
delta = []
for h in c:
while i < 24:
d = c[i+1] - c[i]
delta.append(d)
i = i + 1
c.mean()
sum(delta)/len(delta)
delta = []
i = 0
for f in s:
while i < 24:
d = s[i+1] - s[i]
delta.append(d)
i = i + 1
s.convert_objects(convert_numeric=True).mean()
sum(delta)/len(delta)
c.std()
Out[164]:
Average user rating for Hannibal: 8.79. Average successive change: 0.03, Standard deviation for series as a whole, 0.38.
Average crit rating for Hannibal: 8.52. Average successive change: 0.04, Std: 0.38
Avg. viewers: 2.67 mil, with standard dev. of .647.
So, with "Hannibal", it's kind of clear to see viewership dived with Season 3, as the graph below shows. Each season had worse and worse viewership, despite the high review scores of the show and the evaluations of its fanbase. If anything, it appears "Hannibal" is an example of a show that was too niche. I think that, its audience initially bought into the case-of-the-week style show of Season 1, but as the show grew more and more serialized (e.g. episodes became more and more connected to one another, making it hard for a newcomer to "get into" the show), viewership dropped. There were very likely more factors that led to the show's demise, such as the emphasis on horror and grotesque imagery or the lack of big stars, but I think that by changing each season, from a set of cases wrapped in an overarching story to just one big overarching story, it left new fans unable to join the viewerbase while testing caught-up fans with its unique pacing.
In [298]:
fig, axe = plt.subplots(nrows=3, ncols=1, sharey = True)
n[1].plot(ax=axe[0])
n[2].plot(ax=axe[1])
n[3].plot(ax=axe[2])
Out[298]:
In [141]:
bbac = avc.T['Breaking Bad'].T.critrating - 1
bbtvr = bbtvratings['Ratings (in Millions)']
bbimdb = bbfinal.Rating
fig, axt = plt.subplots()
bbac.plot(ax=axt)
bbimdb.convert_objects(convert_numeric = True).plot(ax=axt, color='red')
axt.set_ylim(0, 10)
axt.set_xlabel('Season & Episode Number')
axt.legend(['AV Club Rating', 'Avg. IMDb User Rating'], loc = 3)
axt.set_title('Critical v. Communal Consensus')
Out[141]:
In [142]:
fig, axt = plt.subplots()
bbac.plot(ax=axt)
bbtvr.plot(ax=axt)
axt.set_ylim(0, 11)
axt.set_title("AV Club Rating against Viewership")
axt.set_xlabel('Season & Episode Number')
Out[142]:
In [143]:
fig, axt = plt.subplots()
bbimdb.convert_objects(convert_numeric = True).plot(ax=axt, color='red')
bbtvr.plot(ax=axt)
axt.set_title('IMDb User Rating against Viewership')
axt.legend(['Viewership in Millions', 'Avg. IMDb User Score'], loc = 10)
axt.set_xticklabels('')
axt.set_xlabel('Episode Number Aired')
axt.set_ylim(0, 12)
Out[143]:
In [311]:
bbisc = bbimdb.reset_index().Rating.convert_objects(convert_numeric = True)
babcs = bbac.reset_index().critrating
plt.scatter(bbtvr, babcs)
Out[311]:
From this we can see, the second half of Season 5 completely skews the scatter plot of the ratings against critical appraisal. So, if we omit the last half of the season, and look at it...
In [312]:
bbacs = babcs.drop([54, 55, 56, 57, 58, 59, 60, 61])
bbtvr4 = bbtvr.drop([55, 56, 57, 58, 59, 60, 61, 62])
bbimsc2 = bbimdb.reset_index().Rating.convert_objects(convert_numeric = True).drop([54, 55, 56, 57, 58, 59, 60, 61])
plt.scatter(bbtvr4, bbimsc2)
Out[312]:
We see a scatter plot that appears to have some semblance of a correlation. Although any line of best fit would have a small R^2 value, it is still possible to see some correlation existed between viewership and quality.
In [317]:
plt.scatter(bbimdb.reset_index().Rating.convert_objects(convert_numeric = True), babcs)
Out[317]:
In [146]:
fig, axe = plt.subplots()
bbtvr4.plot(ax=axe)
bbimsc2.plot(ax=axe, color='red')
axe.legend(['Viewership (in Millions)', 'Avg. IMDb User Score'], loc=5)
axe.set_title('Viewership against Community Reviews')
Out[146]:
What we see is in fact a more consistent viewerbase. The peaks and troughs of the average user's opinions on the show are much closer than those of other shows, so it can be hyothesized the consistency in quality also led to a consistency in viewership.
In [163]:
bbimsc2
i = 0
delta = []
for h in bbimsc2:
while i < 53:
d = bbimsc2[i+1] - bbimsc2[i]
delta.append(d)
i = i + 1
delta
bbimsc2.mean()
sum(delta)/len(delta)
bbacs.mean()
delta = []
i = 0
for h in bbacs:
while i < 53:
d = bbacs[i+1] - bbacs[i]
delta.append(d)
i = i + 1
bbtvr.describe()
Out[163]:
In [ ]:
IMDB: mean = 8.34, std = 0.61,
AV Club: mean 9.14, std = 0.96
From this what we see is, the largest dip in quality from successive episode is -1. This means that, even before the second half started airing, the average episode quality was 8.14 with an average "consistency", or average change in quality, of 0.01. Critically, the average rating per episode was a 9.07 with an average drop between episodes of 0.02, with the largest successive drop being 3 points. With this and the information in the cell above, it's apparent that "Breaking Bad" was an example of a show that maintained a high bar of quality, despite a small initial viewership. As the AVClub graph against ratings with S5 Pt. 2 included shows, the show had a conistent viewerbase that followed the show for seasons 1 - season 5 pt. 1. Suddenly, when S5 Pt. 2 premieres, the viewership spikes.
My hypothesis for this is as follows: between Season 5 Part 1 and Season 5 Part 2, AMC scheduled a year-long break i.e. the finale of Pt. 1 aired in September 2012 and the premiere of Part 2 aired in August 2013. Because of this unexpected hiatus, AMC enabled fans of the show to spread the word about how good "Breaking Bad" was and gave new fans the time to catchup to the show. By letting the show' massive critical acclaim and word-of-mouth buzz converge, the show gained more and more fans.
This is also why AMC did the same trick with Mad Men's final season, although in that case, it didn't workout as well for them as the premiere of Season 7 Pt. 1 and the premiere of Season 7 Pt. 2 attracted the same size audience. So, there was some x-factor about "Breaking Bad" that enabled the surge in fanbase between parts, and I believe it has something to do with "watercooler" moments. If I had Twitter data, I believe if you were to analyze the gaps between parts for "Mad Men" and "Breaking Bad", you'd find "Breaking Bad" would have a higher volume of hashtag usage than "Mad Men". I'd attribute it to the fact "Breaking Bad" is a show about definite action (e.g. guns being shot, lives being taken, etc), meaning it has moments that can hook new-comers easily (like "Did you see what Jesse did to Gale?" or "OMG that finale with the plane!"), while "Mad Men" is more about character and personal breakthrough, giving fans less moments to really talk about to convince people to watch the show.
In [166]:
empireavc = avc.T['Empire'].T
empireavc
empiretvratings = pd.ExcelFile("C:/Users/Sidd/Desktop/Data_Bootcamp/TV Ratings.xlsx").parse(sheetname = 'Empire')
eavcsc = empireavc.critrating.drop(['3']) - 1
etvrsc = empiretvratings['Viewers (in millions)']
eimdbsc = empiretvratings['IMDb Rating']
eavsc1 = eavcsc.reset_index().critrating
fig, axy = plt.subplots()
eavsc1.plot(ax=axy, color='green')
eimdbsc.plot(ax=axy, color='red')
axy.set_title('Average IMDb User Score against Critical Score')
axy.set_xlabel('Total Episode Number')
axy.legend(['AV Club Score', 'Avg. IMDb User Score'], loc=3)
Out[166]:
In [167]:
fig, axy = plt.subplots()
etvrsc.plot(ax=axy)
eimdbsc.plot(ax=axy, color='red')
axy.set_title('Average IMDb User Rating against Viewership')
axy.set_xlabel('Total Episode Number')
axy.legend(['Viewership (in Millions)', 'Avg. IMDb User Score'])
Out[167]:
In [168]:
fig, axy = plt.subplots()
axy.set_title('Critical score against Viewership')
axy.set_xlabel('Total Episode Number')
axy.set_xlim(0, 12)
eavsc1.plot(ax=axy)
eavsc1.columns = ['AV Club Score']
etvrsc.plot(ax=axy)
axy.legend(['Critical Rating', 'Viewership (in Millions)'])
Out[168]:
In [309]:
plt.scatter(eavsc1, eimdbsc)
Out[309]:
In [207]:
i = 0
delta = []
for h in etvrsc:
while i < 25:
d = etvrsc[i+1] - etvrsc[i]
delta.append(d)
i = i + 1
delta
sum(delta[13:30])/len(delta[13:30])
Out[207]:
Critics: mean was 5.7, std was 2.40, change during season 1 was -0.08, avg. change during season 2 was -0.08.
Communit: mean was 8.4, std was 0.33, change during season 1 was 0.08, avg. change during season 2 was 0.
Ratings: Season 1's delta was 0.3, season 2's was -0.598
So, from this, we can see that critics maintained a much harsher view of "Empire" than its fanbase did. Critics viewed the show as more volatile in quality. For seasons, even though through the average change per successive episode was small, what the graph shows is that the actual change per successive episode varies, with some episodes being huge drops in quality and other huge gainers.
For the community, the graph backs up the numerical findings: namely that it was rated more consistently, hence showing it was viewed as a more quality-consistent show. Surprisingly enough though, even though the quality was constant throughout both seasons, viewership actively declined during the second season despite community members rating episodes on the same level as Season 1. Maybe this was because of the more volatile critics reviews which alternate between B grades and D grades, week in and week out.
In [213]:
fig, axe = plt.subplots()
etvrsc.plot(ax=axe)
bbtvr.plot(ax=axe)
n.plot(ax=axe)
k.plot(ax=axe)
axe.legend(['Empire', 'Breaking Bad', 'Hannibal', 'How I Met Your Mother'])
axe.set_xticklabels('')
axe.set_xlabel('Episode Number in Series')
axe.set_ylabel('Viewers in Millions')
axe.set_title('Viewership Visualization')
Out[213]:
The reason I made the graph about is just for you, the reader, to have context on the relative ratings of each of these series. As expected, the two shows on the networks with the lowest barrier to access have the largest ratings. "Empire" and "How I Met Your Mother" were also shows that belonged to very traditional television genres: the soap opera and the sitcom. It's widely acknowledged that the more "genre" your show is, the less audience it typically attracts e.g. a broad comedy like "How I Met Your Mother" will attract more viewers than a artistic horror-themed show like "Hannibal." In addition, Season 1 of "Empire" is the only instance of a season on this chart where viewership every episode increases. Let's plot these seperately.
In [234]:
fig, axe = plt.subplots(nrows=3, ncols=1)
bbtvr[54:63].plot(ax=axe[0])
etvrsc[0:12].plot(ax=axe[1])
k[0:24].plot(ax=axe[2])
Out[234]:
It is taken as gospel that the highest rating a TV show will get is during finales. And that reason generally remains true here. The anomaly, as you can see, remains in "HIMYM" Season 3. The reason for that episode in particular having such high viewership is likely because of the presence of Britney Spears, during her infamous 2007-08 period when she had her public meltdown. So, with the exception of notable event episodes (i.e. special guest stars), the statement has been proven true.
With that in mind, the topmost graph, which is for "Breaking Bad" Season 5 Part 2, shows a dip throughout the season, same with "HIMYM" Season 3. And if I were to plot the rest of the seasons for the TV shows I'm examining, the presence of dips is almost guaranteed. In the past 5 years, only one show has had a season where viewership increased episode-to-episode, and that was "Empire" Season 1. If I had time, I'd have loved to examine the Twitter data to see the volume of hashtags associated with the show during the 2014-15 television season, to see if there was indeed a correlation between so-called "watercooler" moments and this show. No correlation could be found regarding critical opinion or community opinion and the show's ratings, so I'm left to hypothesize that the reason "Empire" bucked all those trends was because it was a soap opera that played predominantly to a segment of the population left ignored by traditional TV shows and that it had enough gasp-worthy moments to basically have people covert new watchers to the show by just talking about it.
So, I began this project in the hopes, I'd find my beliefs validated: that there does exist some substantive correlation between viewership and critical data. In the end, it ended up being false. There exists no correlation between viewership and critical or popular opinion, neither does there really exist a correlation between critical and popular opinion. Instead, the elimination of these as reasons for popularity gives me insight into what I could do to build on this idea. Reviews can impact a show's viewing audience by convincing their reader to give a show a chance. But, critics are impersonal to us: we don't trust their beliefs because of who they are, rather it is because we trust the publication letting this person act as their offical opinion-giver. But, because of that impersonality, we don't really value their opinion all to highly compared to, say, a friend. Thus, what is the impact if a friend or a circle of friends give a positive opinion when compared to critics? If your friends like a show, say "Westworld", and talk about how much they like, does it make you more prone to watching the show compared to if a New York Times Arts reviewer says you should see it? That'd be an interesting experiment.
In [ ]: