English Wikipedia page views, 2008 - 2017

In this assignment, I analyzed the traffic on English Wikipedia from 2008 to 2017 over time by making a visualization using Wikimedia's Analytics/AQS APIs. Documentation in this Jupyter notebook starts with API pulls to obtain the raw data, followed by a series of data processing steps, and lastly a data visualization of the Wikipedia traffic.

Step 1: Data Acquisition

This steps retrieves traffic data on English Wikipedia form 2008 to 2017 using the Legacy Pagecount API and Pageview API from Wikimedia.


In [2]:
# import all python libraries needed in this analysis
import requests
import json
import pandas as pd
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
from datetime import datetime

In [57]:
# provide my credential for API access
headers={'User-Agent' : 'https://github.com/jasonfeiwang', 'From' : 'fwang16@uw.edu'}

Retrieve data from the Legacy Pagecount API


In [5]:
endpoint = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access}/{granularity}/{start}/{end}'

desktop-site data


In [6]:
params = {'project' : 'en.wikipedia.org',
            'access' : 'desktop-site',
            'granularity' : 'monthly',
            'start' : '2008010100',
            'end' : '2016080100' #use the first day of the following month to ensure a full month of data is collected
            }
api_call = requests.get(endpoint.format(**params))
response = api_call.json()

# save the retrieved data as json file
with open('pagecounts_desktop-site_200801-201607.json', 'w') as outfile:
    json.dump(response, outfile)

mobile-site data


In [7]:
params = {'project' : 'en.wikipedia.org',
            'access' : 'mobile-site',
            'granularity' : 'monthly',
            'start' : '2008010100',
            'end' : '2016080100' #use the first day of the following month to ensure a full month of data is collected
            }
api_call = requests.get(endpoint.format(**params))
response = api_call.json()

# save the retrieved data as json file
with open('pagecounts_mobile-site_200801-201607.json', 'w') as outfile:
    json.dump(response, outfile)

Retrieve data from the Pageview API


In [8]:
endpoint = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

desktop-site data


In [9]:
params = {'project' : 'en.wikipedia.org',
            'access' : 'desktop',
            'agent' : 'user',
            'granularity' : 'monthly',
            'start' : '2015070100',
            'end' : '2017100100' #use the first day of the following month to ensure a full month of data is collected
            }

api_call = requests.get(endpoint.format(**params))
response = api_call.json()

# save the retrieved data as json file
with open('pageviews_desktop_201507-201709.json', 'w') as outfile:
    json.dump(response, outfile)

mobile-web data


In [10]:
params = {'project' : 'en.wikipedia.org',
            'access' : 'mobile-web',
            'agent' : 'user',
            'granularity' : 'monthly',
            'start' : '2015070100',
            'end' : '2017100100' #use the first day of the following month to ensure a full month of data is collected
            }

api_call = requests.get(endpoint.format(**params))
response = api_call.json()

# save the retrieved data as json file
with open('pageviews_mobile-web_201507-201709.json', 'w') as outfile:
    json.dump(response, outfile)

mobile-app data


In [11]:
params = {'project' : 'en.wikipedia.org',
            'access' : 'mobile-app',
            'agent' : 'user',
            'granularity' : 'monthly',
            'start' : '2015070100',
            'end' : '2017100100' #use the first day of the following month to ensure a full month of data is collected
            }

api_call = requests.get(endpoint.format(**params))
response = api_call.json()

# save the retrieved data as json file
with open('pageviews_mobile-app_201507-201709.json', 'w') as outfile:
    json.dump(response, outfile)

Step 2: Data Processing

This step merged the legacy Pagecount and Pageview data into one consolidated dataset and performed data cleansing.

read data from json files into pandas dataframes


In [ ]:
#### sample code from "https://stackoverflow.com/questions/21104592/json-to-pandas-dataframe"

with open("pageviews_mobile-app_201507-201709.json", 'r') as f:
        json_content = json.load(f)

df_pageviews_mobile_app = json_normalize(json_content['items'])

In [3]:
with open("pageviews_mobile-web_201507-201709.json", 'r') as f:
        json_content = json.load(f)

df_pageviews_mobile_web = json_normalize(json_content['items'])

In [4]:
with open("pageviews_desktop_201507-201709.json", 'r') as f:
        json_content = json.load(f)

df_pageviews_desktop = json_normalize(json_content['items'])

In [5]:
with open("pagecounts_desktop-site_200801-201607.json", 'r') as f:
        json_content = json.load(f)

df_pagecounts_desktop = json_normalize(json_content['items'])

In [6]:
with open("pagecounts_mobile-site_200801-201607.json", 'r') as f:
        json_content = json.load(f)

df_pagecounts_mobile = json_normalize(json_content['items'])

merge the pageviews mobile data together


In [7]:
df_pageviews_mobile = pd.merge(df_pageviews_mobile_app, df_pageviews_mobile_web.ix[:,['timestamp', 'views']], on = 'timestamp', how = 'inner')
df_pageviews_mobile['views'] = df_pageviews_mobile.views_x + df_pageviews_mobile.views_y
df_pageviews_mobile = df_pageviews_mobile.drop(['views_x', 'views_y'], 1)

add year and month columns to each dataframe


In [8]:
df_pageviews_mobile['year'] = df_pageviews_mobile['timestamp'].map(lambda x: x[0:4])
df_pageviews_mobile['month'] = df_pageviews_mobile['timestamp'].map(lambda x: x[4:6])

In [9]:
df_pageviews_desktop['year'] = df_pageviews_desktop['timestamp'].map(lambda x: x[0:4])
df_pageviews_desktop['month'] = df_pageviews_desktop['timestamp'].map(lambda x: x[4:6])

In [10]:
df_pagecounts_mobile['year'] = df_pagecounts_mobile['timestamp'].map(lambda x: x[0:4])
df_pagecounts_mobile['month'] = df_pagecounts_mobile['timestamp'].map(lambda x: x[4:6])

In [11]:
df_pagecounts_desktop['year'] = df_pagecounts_desktop['timestamp'].map(lambda x: x[0:4])
df_pagecounts_desktop['month'] = df_pagecounts_desktop['timestamp'].map(lambda x: x[4:6])

combine the desktop and mobile data for pageview data


In [12]:
df_pageviews = pd.merge(df_pageviews_mobile.ix[:, ['year', 'month', 'views']]
              , df_pageviews_desktop.ix[:, ['year', 'month', 'views']]
              , on = ['year', 'month'], how = 'outer')

df_pageviews.columns = ['year', 'month', 'pageview_mobile_views', 'pageview_desktop_views']

combine the desktop and mobile data for legacy pagecount data


In [14]:
df_pagecounts = pd.merge(df_pagecounts_mobile.ix[:, ['year', 'month', 'count']]
              , df_pagecounts_desktop.ix[:, ['year', 'month', 'count']]
              , on = ['year', 'month'], how = 'outer')

df_pagecounts.columns = ['year', 'month', 'pagecount_mobile_views', 'pagecount_desktop_views']

combine the pageview and legacy pagecount data


In [16]:
df = pd.merge(df_pageviews
              , df_pagecounts
              , on = ['year', 'month'], how = 'outer')

replace null value with 0; add columns on total views; rename the columns; sort the data by year and month


In [17]:
df = df.fillna(value = 0)
df['pagecount_all_views'] = df.pagecount_mobile_views + df.pagecount_desktop_views
df['pageview_all_views'] = df.pageview_mobile_views + df.pageview_desktop_views

df = df[['year', 'month', 'pagecount_all_views', 'pagecount_desktop_views', 'pagecount_mobile_views'
           , 'pageview_all_views', 'pageview_desktop_views', 'pageview_mobile_views']]

df = df.sort_values(by = ['year', 'month'], ascending=[1, 1])

save the consolidated dataset as a csv file


In [20]:
df.to_csv("en-wikipedia_traffic_200801-201709.csv", sep='\t')

Step 3: Analysis

This step prepares the data for time-series plotting and makes the visualization in a similar format as this sample graph https://wiki.communitydata.cc/upload/a/a8/PlotPageviewsEN_overlap.png

read the data from csv file and remove the redundant column


In [3]:
df = pd.read_csv("en-wikipedia_traffic_200801-201709.csv", sep='\t')
df.drop('Unnamed: 0', axis = 1, inplace = True)

add a time column of datatime type for time series plotting


In [4]:
df['day'] = '01'
df['time'] = pd.to_datetime(df.ix[:, ['year', 'month', 'day']])

change zero to nan in order to avoid plotting them in the visualization


In [5]:
#### code from https://stackoverflow.com/questions/18697417/not-plotting-zero-in-matplotlib-or-change-zero-to-none-python

def zero_to_nan(values):
    """Replace every 0 with 'nan' and return a copy."""
    return [float('nan') if x==0 else x for x in values]

df.pageview_all_views = zero_to_nan(df.pageview_all_views)
df.pageview_desktop_views = zero_to_nan(df.pageview_desktop_views)
df.pageview_mobile_views = zero_to_nan(df.pageview_mobile_views)
df.pagecount_all_views = zero_to_nan(df.pagecount_all_views)
df.pagecount_desktop_views = zero_to_nan(df.pagecount_desktop_views)
df.pagecount_mobile_views = zero_to_nan(df.pagecount_mobile_views)

make the plot


In [18]:
del plt

In [19]:
import matplotlib.pyplot as plt

In [21]:
# set figure size
plt.rcParams["figure.figsize"] = [20,8]

# format plot edges and facecolor
# plt.rcParams['axes.linewidth'] = 2
# plt.rcParams['axes.edgecolor'] = 'black'
# plt.rcParams['axes.facecolor'] = 'lightgrey'

# format the legend
# plt.rcParams['legend.facecolor'] = 'grey'
# plt.rcParams['legend.edgecolor'] = 'black'
# plt.rcParams['legend.borderpad'] = 0.4

# specify the 6 subplots (3 for pageview data, 3 for legacy pagecount data)
fig, ax = plt.subplots()
line1, = ax.plot(df.time, df.pagecount_desktop_views/1000000, '--', linewidth=2,
                 label = '_nolegend_', color = 'green')
line2, = ax.plot(df.time, df.pagecount_mobile_views/1000000, '--', linewidth=2,
                 label = '_nolegend_', color = 'blue')
line3, = ax.plot(df.time, df.pagecount_all_views/1000000, '--', linewidth=2,
                 label = '_nolegend_', color = 'black')
line4, = ax.plot(df.time, df.pageview_desktop_views/1000000, '-', linewidth=2,
                 label='main site', color = 'green')
line5, = ax.plot(df.time, df.pageview_mobile_views/1000000, '-', linewidth=2,
                 label='mobile site', color = 'blue')
line6, = ax.plot(df.time, df.pageview_all_views/1000000, '-', linewidth=2,
                 label='total', color = 'black')

# add vertical lines to emphasis the beginning of each year
xcoords = df.time[[x for x in range(0, len(df.time), 12)]]
for xc in xcoords:
    plt.axvline(x=xc, color = 'black', linewidth=1)

# set the range of x and y axies
ax.set_xlim([df.time.iloc[0], df.time.iloc[-1]])
ax.set_ylim([0, 12000])

# add title and footnote
ax.set_title("Page Views on English Wikipedia (x 1000,000)", size = 20)
ax.text(0.1, -0.15, 'May 2015: a new pageview defition took effect, which eliminated all crawler traffic. Dashed lines mark old definition.',
        verticalalignment='bottom', horizontalalignment='left',
        transform=ax.transAxes,
        color='red', fontsize=15)

# specify legend position and font size
ax.legend(loc='upper left', prop={'size': 15})

# specify x and y ticks font size
plt.rc('xtick', labelsize=15) 
plt.rc('ytick', labelsize=15) 
plt.grid(True)
plt.show()