During June 2015 NZRS logged into Netflix from within New Zealand and the USA and observed the content that was avaiable. From each page the titles offered by the service geographically were extracted and stored.
Each title was compared against the OMDd API (http://www.omdbapi.com/) an alternative interace to the Internet Movide Database (IMDd) data. This allowed each title to be compared against data held within IMDb and the title data to be augmented.
This included the following:
The data was compiled and the serialised outputs saved for further analysis. The data can be found here.[[[link to pickles]]]] This was done using the Python pickle module.
A Python module was created to help with analysis and is available on Github (https://github.com/NZRS/content-analysis/blob/master/content_stats.py).
The analysis focussed on titles, so does not at present identify if the title is a movies or a series. This may be able to be ascertianed from the OMDB 'Type', though this still does not give number of episodes, nor how many episodes are on Netflix. Qualitively Netflix NZ is missing the lasest series of Doctor Who as well as series from the 'classic Who'.
In [1]:
%matplotlib inline
import pickle
import plotly.plotly as py
from plotly.graph_objs import *
# module from NZRS
import content_stats
from IPython.core.display import Image
from urllib2 import quote
from IPython.display import display
import plotly.tools as tls
from IPython.display import HTML
from collections import Counter
In [2]:
# load previously pickled dictionaries
nz_data = pickle.load(open('nz/all_movies_dict.p', 'rb'))
us_data = pickle.load(open('us/all_movies_dict.p', 'rb'))
For all charts we present; the data we use is available for exploration and reuse. There is a 'Play with this data' link at the bottom right hand side of each chart.
If you are running these iPython notebooks yourself please note we are embedding the graphs rather than creating them. You can uncomment the creation code and comment out the embed code if you are using the notebooks interactively.
The simplest test we can carry out is looking at library size. We are looking at count of titles not count of discreet episodes or total viewing time. Its not an unusfule test to begin to understand the libraries.
In [3]:
total_titles_nz = len(nz_data)
total_titles_us = len(us_data)
data = (
[Bar( x = ['NZ', 'USA'],
y = [total_titles_nz, total_titles_us],
marker = Marker(
color = 'rgba(34, 95, 250, 0.6)')
)]
)
layout = Layout(
title ='Netflix Library Comparison- USA vs NZ - June 2015',
yaxis = YAxis(title = 'Count of Titles'),
xaxis = XAxis(title = 'Geographic Service'),
)
fig = Figure(data=data, layout=layout)
# Run this to generate the plot.ly plot for yourself, once created we will embed it
# py.iplot(fig, filename = 'Netflix-Library-Comparison-June-2015')
# Run this to embed the plot after creation, this is necessary for rendering in Github in particular.
HTML('''<div>
<a href="https://plot.ly/~gotofftherails/97/" title="Netflix Library Comparison- USA vs NZ - June 2015" style="display: block; text-align: center;"><img src="https://plot.ly/~gotofftherails/97.png" alt="Netflix Library Comparison- USA vs NZ - June 2015" style="max-width: 100%;" onerror="this.onerror=null;this.src='https://plot.ly/404.png';" /></a>
<script data-plotly="gotofftherails:97" src="https://plot.ly/embed.js" async></script>
</div>
<div>
<a href="https://plot.ly/~gotofftherails/97/">Link to interactive chart and data</a>
</div>
'''
)
Out[3]:
In [4]:
# Number of titles are common
common = len(content_stats.Compare_regions(us_data, nz_data).common_titles())
print 'Titles in common between USA and NZ:', common
# Number of titles unique to us
unique_us = len(content_stats.Compare_regions(us_data, nz_data).unique_to_first())
print 'Titles Unique to the USA :', unique_us
# Number of titles unique to nz
unique_nz = len(content_stats.Compare_regions(nz_data, us_data).unique_to_first())
print 'Titles Unique to the NZ :', unique_nz
We can represent this graphically again.
In [5]:
trace1 = Bar(
y=['Service'],
x=[unique_nz],
name='Unique to NZ',
orientation = 'h',
marker = Marker(
color = 'rgba(255,127,14,1.0)' )
)
trace2 = Bar(
y=['Service'],
x=[common],
name='Common',
orientation = 'h',
marker = Marker(
color = 'rgba(44,160,44,1.0)' )
)
trace3 = Bar(
y = ['Service'],
x = [unique_us],
name = 'Unique to USA',
orientation = 'h',
marker = Marker(
color = 'rgba(39,119,180,1.0)' )
)
data = Data([trace1, trace2, trace3])
layout = Layout(
barmode='stack',
title = 'Content Unique to and Common Between US and NZ Netflix Services - June 2015',
yaxis = YAxis(title = ''),
xaxis = XAxis(title = 'Count of Titles')
)
fig = Figure(data=data, layout=layout)
# used to create
#py.iplot(fig, filename = 'Netflix-Library-Comparison-June-2015-Uniqueness of Content')
# used to embed
HTML('''<div>
<a href="https://plot.ly/~gotofftherails/127/" target="_blank" title="Unique to NZ, Common Between NZ and USA, Unique to USA" style="display: block; text-align: center;"><img src="https://plot.ly/~gotofftherails/127.png" alt="Unique to NZ, Common Between NZ and USA, Unique to USA" style="max-width: 100%;" onerror="this.onerror=null;this.src='https://plot.ly/404.png';" /></a>
<script data-plotly="gotofftherails:127" src="https://plot.ly/embed.js" async></script>
</div>
<div>
<a href="https://plot.ly/~gotofftherails/127/">Link to interactive chart and data</a>
</div>
''')
Out[5]:
Quality by definition is qualitive. Though with enough measure of quality we can hopefully have some quantitive measure of quality, the same way we can hopefully say a five star hotel is normally going to better than a one star hotel. We know this is not always the case and we do have the Napoleon Dynamite effect where the hate and love for a title can be strong.
To assess quality we looked looked towards IMDB, who make some of their data available via alternative interfaces though not in a structured API. Luckily OMDB offer a RESTful API that allows querying by title or ID, and returns XML or JSON.
We used this to query the title against OMDB. Not all returned a useful response, Doctor Who fans will be pleased to know 1995's made for TV movie was not recognised when we queried.
We can look at what we did not get a response for, this is useful, it does not tell us where we got a false response, but we're hoping there were not too many of those.
In [6]:
# Count of titles we did not get a response for
nz_no_response = len([k for (k, v) in nz_data.iteritems() if v['Response'] == 'False'])
print 'Count of NZ titles we did not get a reponse for: ', nz_no_response
print 'Percentage of NZ sites: ', round(float(nz_no_response)/float(len(nz_data))*100), '%'
#US count?
We're pretty happy with these percentages for getting an understanding of the content that is available from a quality perpective. We might not be able to be absolutely, absolute, but good enough.
We can now look at the difference in quality.
In our assessment of quality we are not looking at Netflix ratings, we are looking at the ratings held in IMDb. It could be that Netflix have content more suited to its customers and they would rate titles higher. Those rating IMDb may be skewed in a particular way as they may have more interest in esoteric aspects of movies and content. Perhaps another useful metric would be Rotten Tomatoes scores, though we don't have complete information on this and we were declied access to the Rotten Tomatoes API.
In [7]:
# Average IMDB score of NZ geographic content
nz_avg_score = content_stats.Title_stats(nz_data).average_score()
# Average IMDB score of NZ geographic content
us_avg_score = content_stats.Title_stats(us_data).average_score()
print 'NZ Average Score via OMDB: ', round(nz_avg_score,2)
print 'US Average Score via OMDB: ', round(us_avg_score,2)
New Zealand may have a smaller catalgoue but it does seem to have a marginally higher average quality than the US. We can look at what the top movies are (based on IMDb/OMDb ratings) between the two countries.
This give an interesting understanding of the top movies. At first look we can se the titles are very different.
In [8]:
top_nz_titles = content_stats.Title_stats(nz_data).top_movies(25)
top_us_titles = content_stats.Title_stats(us_data).top_movies(25)
print 'Top NZ Titles'
print
# Truncate to 5
for tup in top_nz_titles[:5]:
print 'Title :', tup[0]
print 'Rating: ', tup[1]
print '========='
print
print
print 'Top US Titles'
print
# Truncate to 5
for tup in top_us_titles[:5]:
print 'Title :', tup[0]
print 'Rating: ', tup[1]
print '========='
print
print
print 'Average score of top 25 Titles'
print
print 'NZ top 25 average'
score = 0
count = 0
for tup in top_nz_titles:
count += 1
score += tup[1]
print score/count
print
print 'US top 25 average'
score = 0
count = 0
for tup in top_us_titles:
count += 1
score += tup[1]
print score/count
In [9]:
#truncate to top 5
for tup in top_nz_titles[:5]:
print 'Title :', tup[0]
print 'Title :', tup[1]
try:
poster = Image(nz_data[quote(tup[0])]['Poster'])
display(poster)
except:
print '''
|
| No Poster
|
'''
In [10]:
bottom_nz_titles = content_stats.Title_stats(nz_data).bottom_movies(25)
bottom_us_titles = content_stats.Title_stats(us_data).bottom_movies(25)
print 'Bottom ranked NZ title', bottom_nz_titles[-1][0], ':' , bottom_nz_titles[-1][1]
poster = Image(nz_data[quote(bottom_nz_titles[-1][0])]['Poster'])
display(poster)
print 'Bottom ranked US title', bottom_us_titles[-1][0], ':' , bottom_us_titles[-1][1]
poster = Image(us_data[quote(bottom_us_titles[-1][0])]['Poster'])
display(poster)
print
print
print 'Average score of bottom 25 Titles'
print
print 'NZ bottom 25 average'
score = 0
count = 0
for tup in bottom_nz_titles:
count += 1
score += tup[1]
print score/count
print
print 'US bottom 25 average'
score = 0
count = 0
for tup in bottom_us_titles:
count += 1
score += tup[1]
print score/count
It seems NZ might have a smaller library but better quality (based on IMDB rankings). Though on an absolute value this makes some sense. We can look at a distribution to get a better understanding.
In [11]:
hist_nz_titles = content_stats.Title_stats(nz_data).ratings_distribution()
hist_us_titles = content_stats.Title_stats(us_data).ratings_distribution()
In [12]:
trace1 = Histogram(
x = hist_nz_titles,
opacity=0.75,
name = 'Count of NZ Titles'
)
trace2 = Histogram(
x = hist_us_titles,
opacity = 0.75,
name = 'Count of US Titles'
)
data = Data([trace2, trace1])
layout = Layout(
barmode ='overlay',
yaxis = YAxis(title = 'Count of Titles'),
xaxis = XAxis(title = 'IMdB Score (rounded down)'),
title = 'Netflix Library Comparison June 2015 Distribution of IMDB scores'
)
fig = Figure(data=data, layout=layout)
# py.iplot(fig, filename = 'Netflix-Library-Comparison-June-2015-Distribution of IMDB scores')
HTML('''
<div>
<a href="https://plot.ly/~gotofftherails/207/" target="_blank" title="Netflix Library Comparison June 2015 Distribution of IMDB scores" style="display: block; text-align: center;"><img src="https://plot.ly/~gotofftherails/207.png" alt="Netflix Library Comparison June 2015 Distribution of IMDB scores" style="max-width: 100%;" onerror="this.onerror=null;this.src='https://plot.ly/404.png';" /></a>
<script data-plotly="gotofftherails:207" src="https://plot.ly/embed.js" async></script>
</div>
<div>
<a href="https://plot.ly/~gotofftherails/207/">Link to interactive chart and data</a>
</div>
''')
Out[12]:
In [13]:
titles_score_count_nz = Counter(hist_nz_titles)
titles_score_count_us = Counter(hist_us_titles)
# make relative
nz_tot = sum(titles_score_count_nz.values())
titles_score_count_nz_relative = { k:round((float(v)/float(nz_tot))*100, 1) for (k,v) in titles_score_count_nz.items()}
us_tot = sum(titles_score_count_us.values())
titles_score_count_us_relative = { k:round((float(v)/float(us_tot))*100,1) for (k,v) in titles_score_count_us.items()}
In [14]:
nz_x_list = []
nz_y_list = []
for score, count in titles_score_count_nz_relative.iteritems():
nz_x_list.append(score)
nz_y_list.append(count)
us_x_list = []
us_y_list = []
for score, count in titles_score_count_us_relative.iteritems():
us_x_list.append(score)
us_y_list.append(count)
trace1 = (
Bar( x = us_x_list,
y = us_y_list,
name = 'USA',
marker = Marker(
color = 'rgba(34, 95, 250, 0.6)')
)
)
trace2 = (
Bar( x = nz_x_list,
y = nz_y_list,
name = 'NZ',
marker = Marker(
color = 'rgba(255, 144, 33, 0.6)')
)
)
layout = Layout(
title ='Netflix - IMDb scores - US and NZ June 2015 - Percentage of Titles',
yaxis = YAxis(title = 'Percentage of Titles'),
xaxis = XAxis(title = 'IMDb Score'),
barmode='group'
)
data = Data([trace1, trace2])
fig = Figure(data=data, layout=layout)
py.iplot(fig, filename = 'Netflix-Library-Comparison-Release-US-NZ-June-2015-relative')
Out[14]:
It appears the NZ library, while smaller has a greater proportion of higher quality content.
In [15]:
# test
nz_origin_nz = content_stats.Title_stats(nz_data).nz_origin()
nz_origin_us = content_stats.Title_stats(nz_data).nz_origin()
data = (
[Bar( x = ['NZ', 'USA'],
y = [nz_origin_nz, nz_origin_us],
marker = Marker(
color = 'rgba(34, 95, 250, 0.6)')
)]
)
layout = Layout(
title ='Netflix - Count of Titles with Country of Origin as New Zealand via IMDb',
yaxis = YAxis(title = 'Count of Titles'),
xaxis = XAxis(title = 'Geographic Service'),
)
fig = Figure(data=data, layout=layout)
# py.iplot(fig, filename = 'Netflix-Library-Comparison-Country-June-2015')
HTML('''<div>
<a href="https://plot.ly/~gotofftherails/130/" target="_blank" title="Netflix - Count of Titles with Country of Origin as New Zealand via IMDb" style="display: block; text-align: center;"><img src="https://plot.ly/~gotofftherails/130.png" alt="Netflix - Count of Titles with Country of Origin as New Zealand via IMDb" style="max-width: 100%;" onerror="this.onerror=null;this.src='https://plot.ly/404.png';" /></a>
<script data-plotly="gotofftherails:130" src="https://plot.ly/embed.js" async></script>
</div>
<div>
<a href="https://plot.ly/~gotofftherails/130/">Link to interactive chart and data</a>
</div>
''')
Out[15]:
Neck and neck. We can see what these titles are, we can get a feel as to how Kiwi they are.
We can look at age of titles within the catalogues. There are some interesting caveats to this as age is represented in three ways:
Initially we could look at year of first release.
In [16]:
release_year = content_stats.Title_stats(nz_data).year_first_release_count()
x_list = []
y_list = []
for year, count in release_year.iteritems():
x_list.append(year)
y_list.append(count)
data = (
[Bar( x = x_list,
y = y_list,
marker = Marker(
color = 'rgba(34, 95, 250, 0.6)')
)]
)
layout = Layout(
title ='Netflix - Year of Titles First Release - NZ June 2015',
yaxis = YAxis(title = 'Count of Titles'),
xaxis = XAxis(title = 'Year'),
)
fig = Figure(data=data, layout=layout)
#py.iplot(fig, filename = 'Netflix-Library-Comparison-Release-June-2015')
HTML('''<div>
<a href="https://plot.ly/~gotofftherails/132/" target="_blank" title="Netflix - Year of Titles First Release - NZ June 2015" style="display: block; text-align: center;"><img src="https://plot.ly/~gotofftherails/132.png" alt="Netflix - Year of Titles First Release - NZ June 2015" style="max-width: 100%;" onerror="this.onerror=null;this.src='https://plot.ly/404.png';" /></a>
<script data-plotly="gotofftherails:132" src="https://plot.ly/embed.js" async></script>
</div>
<div>
<a href="https://plot.ly/~gotofftherails/132/">Link to interactive chart and data</a>
</div>
''')
Out[16]:
We can do the same for the USA catalogue.
In [17]:
release_year = content_stats.Title_stats(us_data).year_first_release_count()
x_list = []
y_list = []
for year, count in release_year.iteritems():
x_list.append(year)
y_list.append(count)
data = (
[Bar( x = x_list,
y = y_list,
marker = Marker(
color = 'rgba(34, 95, 250, 0.6)')
)]
)
layout = Layout(
title ='Netflix - Year of Titles First Release - US June 2015',
yaxis = YAxis(title = 'Count of Titles'),
xaxis = XAxis(title = 'Year'),
)
fig = Figure(data=data, layout=layout)
# py.iplot(fig, filename = 'Netflix-Library-Comparison-Release-US-June-2015')
HTML('''<div>
<a href="https://plot.ly/~gotofftherails/142/" target="_blank" title="Netflix - Year of Titles First Release - US June 2015" style="display: block; text-align: center;"><img src="https://plot.ly/~gotofftherails/142.png" alt="Netflix - Year of Titles First Release - US June 2015" style="max-width: 100%;" onerror="this.onerror=null;this.src='https://plot.ly/404.png';" /></a>
<script data-plotly="gotofftherails:142" src="https://plot.ly/embed.js" async></script>
</div>
<div>
<a href="https://plot.ly/~gotofftherails/142/">Link to interactive chart and data</a>
</div>
''')
Out[17]:
But we can look at them side by side. Lets look at absolute first.
In [18]:
release_year_us = content_stats.Title_stats(us_data).year_first_release_count()
us_x_list = []
us_y_list = []
for year, count in release_year_us.iteritems():
us_x_list.append(year)
us_y_list.append(count)
release_year_nz = content_stats.Title_stats(nz_data).year_first_release_count()
nz_x_list = []
nz_y_list = []
for year, count in release_year_nz.iteritems():
nz_x_list.append(year)
nz_y_list.append(count)
trace1 = (
Bar( x = us_x_list,
y = us_y_list,
name = 'USA',
marker = Marker(
color = 'rgba(34, 95, 250, 0.6)')
)
)
trace2 = (
Bar( x = nz_x_list,
y = nz_y_list,
name = 'NZ',
marker = Marker(
color = 'rgba(255, 144, 33, 0.6)')
)
)
layout = Layout(
title ='Netflix - Year of Titles First Release - US and NZ June 2015',
yaxis = YAxis(title = 'Count of Titles'),
xaxis = XAxis(title = 'Year'),
barmode='group'
)
data = Data([trace1, trace2])
fig = Figure(data=data, layout=layout)
#py.iplot(fig, filename = 'Netflix-Library-Comparison-Release-US-NZ-June-2015')
HTML('''<div>
<a href="https://plot.ly/~gotofftherails/157/" target="_blank" title="Netflix - Year of Titles First Release - US and NZ June 2015" style="display: block; text-align: center;"><img src="https://plot.ly/~gotofftherails/157.png" alt="Netflix - Year of Titles First Release - US and NZ June 2015" style="max-width: 100%;" onerror="this.onerror=null;this.src='https://plot.ly/404.png';" /></a>
<script data-plotly="gotofftherails:157" src="https://plot.ly/embed.js" async></script>
</div>
<div>
<a href="https://plot.ly/~gotofftherails/157/">Link to interactive chart and data</a>
</div>
''')
Out[18]:
The smaller New Zealand library makes this a bit tricker to understand which service has newer and older content when put against the larger US library.
In [19]:
total_us_titles = sum(us_y_list)
us_y_list = [(float(count)/(float(total_us_titles)) * 100) for count in us_y_list]
total_nz_titles = sum(nz_y_list)
nz_y_list = [(float(count)/(float(total_nz_titles)) * 100) for count in nz_y_list]
trace1 = (
Bar( x = us_x_list,
y = us_y_list,
name = 'USA',
marker = Marker(
color = 'rgba(34, 95, 250, 0.6)')
)
)
trace2 = (
Bar( x = nz_x_list,
y = nz_y_list,
name = 'NZ',
marker = Marker(
color = 'rgba(255, 144, 33, 0.6)')
)
)
layout = Layout(
title ='Netflix - Year of Titles First Release - US and NZ June 2015 - Percentage',
yaxis = YAxis(title = 'Percentage of Titles'),
xaxis = XAxis(title = 'Year'),
barmode='group'
)
data = Data([trace1, trace2])
fig = Figure(data=data, layout=layout)
py.iplot(fig, filename = 'Relative- Netflix-Library-Comparison-Release-US-NZ-June-2015')
Out[19]:
In [20]:
nz_actors_count = content_stats.Title_stats(nz_data).top_actors(21)
us_actors_count = content_stats.Title_stats(us_data).top_actors(21)
In [21]:
print '<table>'
print ' <tr><th>Actor Name</th><th>Count of Titles with Actor Name</th><tr>'
for tup in nz_actors_count[1:]:
print ' <tr>'
print ' <td>'
print ' ', tup[0].lstrip()
print ' </td>'
print ' <td>'
print ' ', tup[1]
print ' </td>'
print ' </tr>'
print '</table>'
print
print '====================='
print '<table>'
print ' <tr><th>Actor Name</th><th>Count of Titles with Actor Name</th><tr>'
for tup in us_actors_count[1:]:
print ' <tr>'
print ' <td>'
print ' ', tup[0].lstrip()
print ' </td>'
print ' <td>'
print ' ', tup[1]
print ' </td>'
print ' </tr>'
print '</table>'
In [22]:
print '<table>'
print '<tr><th></th><th>NZ Service</th><th></th><th>US Service</th><th></th></tr>'
print '<tr><th>Rank</th><th>Actor Name</th><th>Count of Titles with Actor Name</th><th>Actor Name</th><th>Count of Titles with Actor Name</th></tr>'
for x in range(1,len(nz_actors_count)):
print '<tr>'
print '<td>'
print str(x)
print '</td>'
print '<td>'
print nz_actors_count[x][0].lstrip()
print '</td>'
print '<td>'
print nz_actors_count[x][1]
print '</td>'
print '<td>'
print us_actors_count[x][0].lstrip()
print '</td>'
print '<td>'
print us_actors_count[x][1]
print '</td>'
print '</tr>'
print '</table>'
In [45]:
print '<table>'
print '<tr><th>New Zealand Service </th></tr>'
print '<tr><th>Movie Name</th><th>IMDb Score</th></tr>'
nzx = []
for tup in top_nz_titles[:21]:
print '<tr>'
print '<td>',tup[0],'</td>','<td>', tup[1],'</td>'
print '</tr>'
nzx.append(tup[0])
print '</table>'
print
print
print '==============='
print '<table>'
print '<tr><th>Movie Name</th><th>IMDb Score</th></tr>'
usx = []
for tup in top_us_titles[:21]:
print '<tr>'
print '<td>',tup[0],'</td>','<td>', tup[1],'</td>'
print '</tr>'
usx.append(tup[0])
print '</table>'
In [48]:
print '<table>'
print '<tr><th></th><th>NZ Service</th><th></th><th>US Service</th><th></th></tr>'
print '<tr><th>Rank</th><th>Movie Name</th><th>IMDb Score</th><th>Movie Name</th><th>IMDb Score</th></tr>'
for x in range(0,len(top_nz_titles[:20])):
print '<tr>'
print '<td>'
print str(x+1)
print '</td>'
print '<td>'
print top_nz_titles[x][0].lstrip()
print '</td>'
print '<td>'
print top_nz_titles[x][1]
print '</td>'
print '<td>'
print top_us_titles[x][0].lstrip()
print '</td>'
print '<td>'
print top_us_titles[x][1]
print '</td>'
print '</tr>'
print '</table>'
In [49]:
my_set = (set(usx)).intersection(set(nzx))
for title in my_set:
print '* ', title
In [ ]:
In [ ]: