English Wikipedia page views, 2008 - 2017

For this assignment, your job is to analyze traffic on English Wikipedia over time, and then document your process and the resulting dataset and visualization according to best practices for open research that were outlined for you in class.

Step 1: Data aquisition

In order to measure Wikipedia traffic from 2008-2016, collect data from two different API endpoints, the Legacy Pagecounts API and the pageview API.



In [1]:

    
import requests
import json
import pandas as pd
from pandas.io.json import json_normalize
import copy
from datetime import datetime
import plotly
import plotly.graph_objs as go
from plotly import tools

plotly.__version__
plotly.offline.init_notebook_mode(connected=True)

headers={'User-Agent' : 'https://github.com/mbzhuang', 'From' : 'mbzhuang@uw.edu'}

Pagecounts

Get pagecounts of en.wikipedia.org through desktop.



In [2]:

    
endpoint = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access}/{granularity}/{start}/{end}'
params = {'project' : 'en.wikipedia.org',
          'access' : 'desktop-site',
          'agent' : 'user',
          'granularity' : 'monthly',
          'start' : '2008010100',
          'end' : '2016080100'}

api_call = requests.get(endpoint.format(**params))
response = api_call.json()
with open('pagecounts_desktop-site_200801-201607.json', 'w') as outfile:
    json.dump(response, outfile)

Get pagecounts of en.wikipedia.org through mobile.



In [3]:

    
params = {'project' : 'en.wikipedia.org',
          'access' : 'mobile-site',
          'agent' : 'user',
          'granularity' : 'monthly',
          'start' : '2008010100',
          'end' : '2016080100'}

api_call = requests.get(endpoint.format(**params))
response = api_call.json()
with open('pagecounts_mobile-site_200801-201607.json', 'w') as outfile:
    json.dump(response, outfile)

Pageview

Get page viewcounts of en.wikipedia.org through desktop.



In [4]:

    
endpoint = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'
params = {'project' : 'en.wikipedia.org',
          'access' : 'desktop',
          'agent' : 'user',
          'granularity' : 'monthly',
          'start' : '2015070100',
          'end' : '2017100100'}

api_call = requests.get(endpoint.format(**params))
response = api_call.json()

with open('pageviews_desktop-site_201507-201709.json', 'w') as outfile:
    json.dump(response, outfile)

Get page viewcounts of en.wikipedia.org through mobile-web and mobile-app.



In [5]:

    
params = {'project' : 'en.wikipedia.org',
          'access' : 'mobile-web',
          'agent' : 'user',
          'granularity' : 'monthly',
          'start' : '2015070100',
          'end' : '2017100100'}

api_call = requests.get(endpoint.format(**params))
response = api_call.json()

with open('pageviews_mobile-web_201507-201709.json', 'w') as outfile:
    json.dump(response, outfile)
    
params = {'project' : 'en.wikipedia.org',
          'access' : 'mobile-app',
          'agent' : 'user',
          'granularity' : 'monthly',
          'start' : '2015070100',
          'end' : '2017100100'}

api_call = requests.get(endpoint.format(**params))
response = api_call.json()

with open('pageviews_mobile-app_201507-201709.json', 'w') as outfile:
    json.dump(response, outfile)

Step 2: Data processing

Define a function to load data from a JSON format to a dataframe. Then Use the function to load the five JSON files created in step 1.



In [6]:

    
def json_to_DataFrame(json_file):
    with open(json_file, 'r') as infile:
        json_content = json.load(infile)
    DataFrame = json_normalize(json_content['items'])
    return DataFrame

df_pagecounts_desktop = json_to_DataFrame("pagecounts_desktop-site_200801-201607.json")
df_pagecounts_mobile = json_to_DataFrame("pagecounts_mobile-site_200801-201607.json")
df_pageviews_desktop = json_to_DataFrame("pageviews_desktop-site_201507-201709.json")
df_pageviews_mobile_web = json_to_DataFrame("pageviews_mobile-web_201507-201709.json")
df_pageviews_mobile_app = json_to_DataFrame("pageviews_mobile-app_201507-201709.json")

Take a look at the dataframes.



In [7]:

    
df_pagecounts_desktop.head()









    Out[7]:






  
    
      
      access-site
      count
      granularity
      project
      timestamp
    
  
  
    
      0
      desktop-site
      4930902570
      monthly
      en.wikipedia
      2008010100
    
    
      1
      desktop-site
      4818393763
      monthly
      en.wikipedia
      2008020100
    
    
      2
      desktop-site
      4955405809
      monthly
      en.wikipedia
      2008030100
    
    
      3
      desktop-site
      5159162183
      monthly
      en.wikipedia
      2008040100
    
    
      4
      desktop-site
      5584691092
      monthly
      en.wikipedia
      2008050100



In [8]:

    
df_pagecounts_mobile.head()









    Out[8]:






  
    
      
      access-site
      count
      granularity
      project
      timestamp
    
  
  
    
      0
      mobile-site
      3091546685
      monthly
      en.wikipedia
      2014100100
    
    
      1
      mobile-site
      3027489668
      monthly
      en.wikipedia
      2014110100
    
    
      2
      mobile-site
      3278950021
      monthly
      en.wikipedia
      2014120100
    
    
      3
      mobile-site
      3485302091
      monthly
      en.wikipedia
      2015010100
    
    
      4
      mobile-site
      3091534479
      monthly
      en.wikipedia
      2015020100



In [9]:

    
df_pageviews_desktop.head()









    Out[9]:






  
    
      
      access
      agent
      granularity
      project
      timestamp
      views
    
  
  
    
      0
      desktop
      user
      monthly
      en.wikipedia
      2015070100
      4376666686
    
    
      1
      desktop
      user
      monthly
      en.wikipedia
      2015080100
      4332482183
    
    
      2
      desktop
      user
      monthly
      en.wikipedia
      2015090100
      4485491704
    
    
      3
      desktop
      user
      monthly
      en.wikipedia
      2015100100
      4477532755
    
    
      4
      desktop
      user
      monthly
      en.wikipedia
      2015110100
      4287720220



In [10]:

    
df_pageviews_mobile_web.head()









    Out[10]:






  
    
      
      access
      agent
      granularity
      project
      timestamp
      views
    
  
  
    
      0
      mobile-web
      user
      monthly
      en.wikipedia
      2015070100
      3179131148
    
    
      1
      mobile-web
      user
      monthly
      en.wikipedia
      2015080100
      3192663889
    
    
      2
      mobile-web
      user
      monthly
      en.wikipedia
      2015090100
      3073981649
    
    
      3
      mobile-web
      user
      monthly
      en.wikipedia
      2015100100
      3173975355
    
    
      4
      mobile-web
      user
      monthly
      en.wikipedia
      2015110100
      3142247145



In [11]:

    
df_pageviews_mobile_app.head()









    Out[11]:






  
    
      
      access
      agent
      granularity
      project
      timestamp
      views
    
  
  
    
      0
      mobile-app
      user
      monthly
      en.wikipedia
      2015070100
      109624146
    
    
      1
      mobile-app
      user
      monthly
      en.wikipedia
      2015080100
      109669149
    
    
      2
      mobile-app
      user
      monthly
      en.wikipedia
      2015090100
      96221684
    
    
      3
      mobile-app
      user
      monthly
      en.wikipedia
      2015100100
      94523777
    
    
      4
      mobile-app
      user
      monthly
      en.wikipedia
      2015110100
      94353925

Combine mobile app and mobile web views and create a new dataframe df_pageviews_mobile. Since both dataframes are of the same dimension and sorted the on timestamp, we can just add the values of views to get the total mobile views.



In [12]:

    
df_pageviews_mobile = copy.copy(df_pageviews_mobile_web)
df_pageviews_mobile.views = df_pageviews_mobile.views + df_pageviews_mobile_app.views
df_pageviews_mobile.head()









    Out[12]:






  
    
      
      access
      agent
      granularity
      project
      timestamp
      views
    
  
  
    
      0
      mobile-web
      user
      monthly
      en.wikipedia
      2015070100
      3288755294
    
    
      1
      mobile-web
      user
      monthly
      en.wikipedia
      2015080100
      3302333038
    
    
      2
      mobile-web
      user
      monthly
      en.wikipedia
      2015090100
      3170203333
    
    
      3
      mobile-web
      user
      monthly
      en.wikipedia
      2015100100
      3268499132
    
    
      4
      mobile-web
      user
      monthly
      en.wikipedia
      2015110100
      3236601070

Prepare the four dataframes for merging on timestamp.



In [13]:

    
# Merge mobile and desktop of pagecounts dataframes.
df_pagecounts = pd.merge(df_pagecounts_desktop, df_pagecounts_mobile, on='timestamp', how='left')
# Merge mobile and desktop of pageviews dataframes
df_pageviews = pd.merge(df_pageviews_desktop, df_pageviews_mobile, on='timestamp')
# Subset the merged dataframes and rename column names
df_pagecounts = df_pagecounts[['timestamp', 'count_x', 'count_y']]
df_pagecounts.columns = ['timestamp', 'pagecount_desktop_views', 'pagecount_mobile_views']
df_pageviews = df_pageviews[['timestamp', 'views_x', 'views_y']]
df_pageviews.columns = ['timestamp', 'pageview_desktop_views', 'pageview_mobile_views']
# Finally, merge pagecount and pageview dataframes
df_total = pd.merge(df_pagecounts, df_pageviews, on='timestamp', how='left')
# Fill the missing value with 0
df_total = df_total.fillna(value = 0)
# Create two new columns, 'pagecount_all_views' and 'pageview_all_views', that are both the combinition of 
# mobile and desktop counts respectively.
df_total['pagecount_all_views'] = df_total.pagecount_desktop_views + df_total.pagecount_mobile_views
df_total['pageview_all_views'] = df_total.pageview_desktop_views + df_total.pageview_mobile_views
# Create two new columns, 'year' and 'month', that are both based on the column 'timestamp'
df_total.timestamp = pd.to_datetime(df_total.timestamp, format='%Y%m%d%H')
df_total['year'] = df_total['timestamp'].dt.year
df_total['month'] = df_total['timestamp'].dt.month
# Do subsetting to get the final dataframe
df_total = df_total[['year', 'month', 
                     'pagecount_all_views', 'pagecount_desktop_views', 'pagecount_mobile_views', 
                     'pageview_all_views', 'pageview_desktop_views', 'pageview_mobile_views']]



In [14]:

    
df_total.head()









    Out[14]:






  
    
      
      year
      month
      pagecount_all_views
      pagecount_desktop_views
      pagecount_mobile_views
      pageview_all_views
      pageview_desktop_views
      pageview_mobile_views
    
  
  
    
      0
      2008
      1
      4.930903e+09
      4930902570
      0.0
      0.0
      0.0
      0.0
    
    
      1
      2008
      2
      4.818394e+09
      4818393763
      0.0
      0.0
      0.0
      0.0
    
    
      2
      2008
      3
      4.955406e+09
      4955405809
      0.0
      0.0
      0.0
      0.0
    
    
      3
      2008
      4
      5.159162e+09
      5159162183
      0.0
      0.0
      0.0
      0.0
    
    
      4
      2008
      5
      5.584691e+09
      5584691092
      0.0
      0.0
      0.0
      0.0

Save the df_total dataframe to csv file.



In [15]:

    
df_total.to_csv("en-wikipedia_traffic_200801-201709.csv", sep='\t', index=False)

Step 3: Data Analysis

Read the csv file created in step 2 to a dataframe.



In [16]:

    
df = pd.read_csv("en-wikipedia_traffic_200801-201709.csv", sep='\t')

Create a column 'date' that combines year, month, and day. Use it as the x axis later when plotting



In [17]:

    
df['day'] = '01'
df['date']= df.apply(lambda x:datetime.strptime("{0} {1} {2}".format(x['year'],x['month'], x['day']), "%Y %m %d"),axis=1)

Use Plotly to create an interactive visualization to present the page view information on English Wikipedia.



In [18]:

    
trace1 = go.Scatter(
        x = df[df.pagecount_all_views != 0].date,
        y = df[df.pagecount_all_views != 0].pagecount_all_views,
        name='Total(Pagecount definition)',
        line = dict(color = 'black', dash = 'dash')
        )
trace2 = go.Scatter(
        x = df[df.pagecount_desktop_views != 0].date,
        y = df[df.pagecount_desktop_views != 0].pagecount_desktop_views,
        name='Main site(Pagecount definition)',
        line = dict(color = 'green', dash = 'dash')
        )
trace3 = go.Scatter(
        x = df[df.pagecount_mobile_views != 0].date,
        y = df[df.pagecount_mobile_views != 0].pagecount_mobile_views,
        name='Mobile(Pagecount definition)',
        line = dict(color = 'blue', dash = 'dash')
        )
trace4 = go.Scatter(
        x = df[df.pageview_all_views != 0].date,
        y = df[df.pageview_all_views != 0].pageview_all_views,
        name='Total(Pageview definition)',
        line = dict(color = 'black')
        )
trace5 = go.Scatter(
        x = df[df.pageview_desktop_views != 0].date,
        y = df[df.pageview_desktop_views != 0].pageview_desktop_views,
        name='Main site(Pageview definition)',
        line = dict(color = 'green')
        )
trace6 = go.Scatter(
        x = df[df.pageview_mobile_views != 0].date,
        y = df[df.pageview_mobile_views != 0].pageview_mobile_views,
        name='Mobile(Pagecount definition)',
        line = dict(color = 'blue')
        )

data = [trace1, trace2, trace3, trace4, trace5, trace6]
layout = dict(title = 'Page Views on English Wikipedia',
              height= 600, width = 1000,
              xaxis = dict(title = ''),
              yaxis = dict(title = 'Page views', range=[0, 12000000000]),
              legend=dict(orientation="h")
              )

fig = dict(data=data, layout=layout)
plotly.offline.iplot(fig, filename='PageViewsonEnglishWikipedia')

	access-site	count	granularity	project	timestamp
0	desktop-site	4930902570	monthly	en.wikipedia	2008010100
1	desktop-site	4818393763	monthly	en.wikipedia	2008020100
2	desktop-site	4955405809	monthly	en.wikipedia	2008030100
3	desktop-site	5159162183	monthly	en.wikipedia	2008040100
4	desktop-site	5584691092	monthly	en.wikipedia	2008050100

	access-site	count	granularity	project	timestamp
0	mobile-site	3091546685	monthly	en.wikipedia	2014100100
1	mobile-site	3027489668	monthly	en.wikipedia	2014110100
2	mobile-site	3278950021	monthly	en.wikipedia	2014120100
3	mobile-site	3485302091	monthly	en.wikipedia	2015010100
4	mobile-site	3091534479	monthly	en.wikipedia	2015020100

	access	agent	granularity	project	timestamp	views
0	desktop	user	monthly	en.wikipedia	2015070100	4376666686
1	desktop	user	monthly	en.wikipedia	2015080100	4332482183
2	desktop	user	monthly	en.wikipedia	2015090100	4485491704
3	desktop	user	monthly	en.wikipedia	2015100100	4477532755
4	desktop	user	monthly	en.wikipedia	2015110100	4287720220

	access	agent	granularity	project	timestamp	views
0	mobile-web	user	monthly	en.wikipedia	2015070100	3179131148
1	mobile-web	user	monthly	en.wikipedia	2015080100	3192663889
2	mobile-web	user	monthly	en.wikipedia	2015090100	3073981649
3	mobile-web	user	monthly	en.wikipedia	2015100100	3173975355
4	mobile-web	user	monthly	en.wikipedia	2015110100	3142247145

	access	agent	granularity	project	timestamp	views
0	mobile-app	user	monthly	en.wikipedia	2015070100	109624146
1	mobile-app	user	monthly	en.wikipedia	2015080100	109669149
2	mobile-app	user	monthly	en.wikipedia	2015090100	96221684
3	mobile-app	user	monthly	en.wikipedia	2015100100	94523777
4	mobile-app	user	monthly	en.wikipedia	2015110100	94353925

	year	month	pagecount_all_views	pagecount_desktop_views
0	2008	1	4.930903e+09	4930902570
1	2008	2	4.818394e+09	4818393763
2	2008	3	4.955406e+09	4955405809
3	2008	4	5.159162e+09	5159162183
4	2008	5	5.584691e+09	5584691092