Ytpak Video analyses

Auther : Atique Ur Rehman

This notebook is part of a tutorial on Scrapy. Scrapy was used to scrap data about videos from a video content providing website YTPAK.

First an initial search was made with a specific title, in our case "The kapil sharma Show"
All the videos in the result were visited
For each visited video, all the data including likes, views and comments were scrapped
Then the suggessions of this visited video were visited recursively.

Data

I ended up with two datasets from two different spiders

First with the comments, likes, views and date
Second with a list of videos and next suggested video

Analyses

I have conducted three different analyses for just proof of concept

Plot of No. of views VS No. of likes
Plot of the popularity of "Tha kapil sharma show" with time
Plot of the sugesstion graph of the videos with hop counts(how far it was from original video). Using this graph we can visualize
- How suggessions span to very different class of videos, and can take you places
- How the related videos are clusstered with only < 10 hops, after that irrelevent videos starts
- The diversity in the sugession algorithnm

Note: I do not claing any of this data to be mine, it was scrapped for academic purposes only. Ytpak website have a robots.txt file the which on this day 5 March, 2017 reads:

User-agent: *
Allow: /

Sitemap: https://www.ytpak.com/sitemap_index.php

Which means the website allows the scrapping of all the content. The code is released under MIT license, a copy of the license can be found in the root folder.

P.S : For hover effect on the graph eigther run the complete notebook with data, or download the html version and open in browser.



In [31]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.plotly as py
import plotly
import networkx as nx
import matplotlib.dates
import ast
import re

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from datetime import datetime
from plotly.graph_objs import *
from collections import defaultdict

%matplotlib inline
plt.style.use('ggplot')



In [32]:

    
frame = pd.read_csv("./results.csv")

Description



In [33]:

    
frame.describe()









    Out[33]:






  
    
      
      data
      type
      id
    
  
  
    
      count
      9322
      9322
      9322
    
    
      unique
      9322
      2
      8022
    
    
      top
      {u'Mustafa B. Ketanci': u'Hem bitmeyen \u015fa...
      meta
      N9AHsZI9iSE
    
    
      freq
      1
      8022
      2

Total data



In [34]:

    
print "Total Comments {}".format(np.sum(frame['type'] == 'comments'))
print "Total meta data items {}".format(np.sum(frame['type'] == 'meta'))









    



Total Comments 1300
Total meta data items 8022

Pre-processing



In [35]:

    
frame['data'] = frame['data'].apply(lambda st : ast.literal_eval(st))

Information included in meta data



In [36]:

    
first_meta_item = frame[frame['type'] == 'meta'].iloc[0]
meta_data = first_meta_item['data']
print meta_data.keys()









    



['description', 'title', 'views', 'dislikes', 'likes', 'date']

Sanity check of data

Plotting No. of views VS No. of like



In [37]:

    
def get_clean_likes(st):
    val = st["likes"]
    val = val.replace(",", "").strip()
    return np.int64(val)

def get_clean_views(st):
    val = st["views"]
    val = val.replace(",", "")
    val = val.split(" ")[0]
    return np.int64(val)



meta_items = frame[frame['type'] == 'meta']
data = meta_items['data']
likes = data.apply(get_clean_likes)
views = data.apply(get_clean_views)

likes = np.array(likes)
views = np.array(views)



In [38]:

    
fig, ax = plt.subplots(figsize=(10,10))
ax.scatter(likes, views)
ax.set_title("Likes VS Views")
ax.set_xlabel("No. of Likes")
ax.set_ylabel("No. of Views")

a,b = np.polyfit(likes, views,1)
ax.plot(likes, a*likes + b, label = "Expected curve")
plt.legend()









    Out[38]:





<matplotlib.legend.Legend at 0x7f662ca43c50>

Popularity of 'The Kapil Sharma Show' with time



In [39]:

    
def get_clean_dates(st):
    val = st["date"]
    date = re.search("Published on (.*) \|", val).group(1)
    date = datetime.strptime(date, '%d %b %Y')
    return date
    
    
def is_kapils_video(st):
    title = st['title']
    des = st['description']
    return "kapil" in title.lower() or "kapil" in des.lower()
    
meta_items = frame[frame['type'] == 'meta']
data = meta_items['data']
related_data_mask = data.apply(is_kapils_video)
filterted_data = data[related_data_mask]

likes = filterted_data.apply(get_clean_likes)
views = filterted_data.apply(get_clean_views)
dates = filterted_data.apply(get_clean_dates)

likes = np.array(likes)
views = np.array(views)



In [40]:

    
fig, ax = plt.subplots(1,2,figsize=(20,10))
ax[0].plot_date(dates, likes)
ax[0].set_title("Date VS Likes")
ax[0].set_xlabel("Date")
ax[0].set_ylabel("Likes")

ax[1].plot_date(dates, views)
ax[1].set_title("Date VS Views")
ax[1].set_xlabel("Date")
ax[1].set_ylabel("Views")









    Out[40]:





<matplotlib.text.Text at 0x7f66348cc110>

Video Sugesstion Analyses



In [41]:

    
connection_frame = pd.read_csv("./connections.csv")



In [42]:

    
connection_frame.describe()









    Out[42]:






  
    
      
      likes
      title
      date
      dislikes
      views
      refrer
      id
      description
    
  
  
    
      count
      1636
      1636
      1635
      1635
      1635
      1634
      1635
      1501
    
    
      unique
      1349
      1633
      952
      865
      1634
      959
      1635
      1402
    
    
      top
      0
      مش صافيناز .رقص شرقي مصري .Hot Belly Dance
      Published on 03 Mar 2017 | 1 day ago
      0
      \n9 views\n1 on YTPak\n
      TAhNGYDNRt0
      vi2tOCWuePw
      SUBSCRIBE OUR CHANNEL FOR REGULAR UPLOADS OF \...
    
    
      freq
      34
      3
      7
      53
      2
      10
      1
      34

Calculating hop count (distance from searched video)



In [43]:

    
ids = connection_frame["id"]
refrer_ids = connection_frame["refrer"]
titles = connection_frame['title']

sugessted = defaultdict(list)
start = str(ids[0])
for i,r in zip(ids[1:], refrer_ids[1:]):
    sugessted[r].append(i)
    
def distance_count(root, sugessted_tree, hop_count = defaultdict(int), current_hop_count = 1):
    suggessions = sugessted_tree[root]
    
    for s in suggessions:
        hop_count[s] = current_hop_count
        distance_count(s, sugessted_tree,hop_count, current_hop_count +1)    
    return hop_count

hop_count = distance_count(start,sugessted)
hop_count[start] = 0

Initializing network graph



In [44]:

    
G=nx.Graph()

Generating edges as a list and adding to graph



In [45]:

    
Nodes=ids
G.add_nodes_from(Nodes)
Edges=[(i,r) for i, r in zip(ids, refrer_ids)]
G.add_edges_from(Edges)

Plotting raw graph



In [46]:

    
plt.figure(figsize=(20,20))
plt.title("Raw graph")
nx.draw(G, node_color='c',edge_color='k', with_labels=False)

The graph above is not convaying any usefull information, so we have to add titles and colors to it

These are some utility functions using plotly library for generaing nodes and edgses with colors, titles and other information. These functions can be skipped for now if you jusr want a general idea of what is happening



In [47]:

    
def scatter_nodes(pos, keys, hop_count, labels, color=None, size=10, opacity=1):
    # pos is the dict of node positions
    # labels is a list  of labels to be displayed when hovering the mouse over the nodes
    # color is the color for nodes. When it is set as None the Plotly default color is used
    # size is the size of the dots representing the nodes
    # opacity is a value between [0,1] defining the node color opacity
    
    trace = Scatter(
        x=[], 
        y=[], 
        mode='markers',
        marker=Marker(size=[], 
                      colorscale='Hot',
                      reversescale=False,
                      color=[],
                      colorbar=dict(
                            thickness=15,
                            title='Hop count from searched video',
                            xanchor='left',
                            titleside='right')))
    for k in keys:
        trace['x'].append(pos[k][0])
        trace['y'].append(pos[k][1])
        trace['marker']['color'].append(hop_count[k])
        
    attrib=dict(name='', text=labels , hoverinfo='text', opacity=0.9) # a dict of Plotly node attributes
    trace=dict(trace, **attrib)# concatenate the dict trace and attrib
    trace['marker']['size']=size
    return trace   

def scatter_edges(G, pos, line_color=None, line_width=1):
    trace = Scatter(x=[], y=[], mode='lines')
    for edge in G.edges():
        trace['x'] += [pos[edge[0]][0],pos[edge[1]][0], None]
        trace['y'] += [pos[edge[0]][1],pos[edge[1]][1], None]  
        trace['hoverinfo']='none'
        trace['line']['width']=line_width
        if line_color is not None: # when it is None a default Plotly color is used
            trace['line']['color']=line_color
    return trace

Using the above functions, generating titles, edges and nodes



In [48]:

    
pos=nx.fruchterman_reingold_layout(G)   
labels = [ "Title : " + titles[i] + "<br> Hop count : " + str(hop_count[k]) for i, k in enumerate(ids) ]
trace1=scatter_edges(G, pos)
trace2=scatter_nodes(pos, ids, hop_count, labels=labels)

Setting up the layout



In [49]:

    
width=1000
height=1000
axis=dict(showline=False, # hide axis line, grid, ticklabels and  title
          zeroline=False,
          showgrid=False,
          showticklabels=False,
          title='' 
          )
layout=Layout(title= 'YTPAK videos suggession graph',  #
    font= Font(),
    showlegend=False,
    autosize=False,
    width=width,
    height=height,
    xaxis=XAxis(axis),
    yaxis=YAxis(axis),
    margin=Margin(
        l=40,
        r=40,
        b=85,
        t=100,
        pad=0,
       
    ),
    hovermode='closest',
#    plot_bgcolor='#EFECEA', #set background color            
    )

data=Data([trace1, trace2])
fig = Figure(data=data, layout=layout)

Plotting the final graph



In [52]:

    
init_notebook_mode(connected=False)
plt.figure(figsize=(10,10))
py.iplot(fig, filename='YtpakSugessions')









    











    Out[52]:











    





<matplotlib.figure.Figure at 0x7f6639209890>



In [54]:

    
from IPython.display import Image
Image(filename='./YTPAKSugessions.png')









    Out[54]:



In [ ]:

	data	type	id
count	9322	9322	9322
unique	9322	2	8022
top	{u'Mustafa B. Ketanci': u'Hem bitmeyen \u015fa...	meta	N9AHsZI9iSE
freq	1	8022	2

	likes	title	date	dislikes	views	refrer	id	description
count	1636	1636	1635	1635	1635	1634	1635	1501
unique	1349	1633	952	865	1634	959	1635	1402
top	0	مش صافيناز .رقص شرقي مصري .Hot Belly Dance	Published on 03 Mar 2017 \| 1 day ago	0	\n9 views\n1 on YTPak\n	TAhNGYDNRt0	vi2tOCWuePw	SUBSCRIBE OUR CHANNEL FOR REGULAR UPLOADS OF \...
freq	34	3	7	53	2	10	1	34