Here I retrieve, aggregate and visualize the number of monthly visitors to English Wikipedia from January 2008 through September 2017. I group the data by the nature of the visit, whether via the desktop website, or via mobile, which includes the mobile website and the mobile application. I also present total visits by month, which is simply the sum of monthly desktop and mobile visits. I visualize the results (see below) and save the output to a .csv file.
This work is meant to be fully reproducible. I welcome comments if any errors or hurdles are discovered during an attempted reproduction.
In [1]:
import json
import matplotlib.pyplot as plt
import os
import pandas as pd
import requests
%matplotlib inline
The first step in all data projects is data retrieval. For this project we will be downloading data from Wikimedia's REST API. We'll pull from two endpoints:
Before we make our first call, we define an empty dictionary to hold the results. We also define a second dictionary with some contact information which we'll pass to the API calls when they are made.
In [2]:
# create a dictionary to hold API response data
data_dict = dict()
# define headers to pass to the API call
headers={'User-Agent' : '', 'From' : ''}
The code below performs the following steps the Pagecount data, for both the desktop and mobile websites:
I chose to perform this step in a loop since we would otherwise be repeating much of the same code.
In [3]:
# define endpoint and access sites
endpoint = '{project}/{access-site}/{granularity}/{start}/{end}'
access_sites = ['desktop-site', 'mobile-site']
# repeat for each access site of interest
for access_site in access_sites:
# set filename for access-specific API call response JSON file
filename = 'data/raw/pagecounts_' + access_site + '_200807_201709.json'
# check if file already exists; load if so, create if not
if os.path.isfile(filename):
with open(filename) as json_data:
response = json.load(json_data)
print('loaded JSON data from ./' + filename)
# define parameters
params = {'project' : '',
'access-site' : access_site, # [all-sites, desktop-site, mobile-site]
'granularity' : 'monthly', # [hourly, daily, monthly]
'start' : '2008010100',
'end' : '2017100100' # use the first day of the following month to ensure a full month of data is collected
# fetch and format data
api_call = requests.get(endpoint.format(**params), headers)
response = api_call.json()
# format and save output as JSON file
with open(filename, 'w') as f:
json.dump(response, f)
print('saved JSON data to ./' + filename)
# convert to dataframe
temp_df = pd.DataFrame.from_dict(response['items'])
temp_df['yyyymm'] = temp_df.timestamp.str[0:6]
col_name = 'pc_' + access_site
temp_df.rename(columns={'count': col_name}, inplace=True)
# save to dictionary for later combination
data_dict[col_name] = temp_df[['yyyymm', col_name]]
We now repeat the same process as above, but for the Pageviews data. I chose to perform this step separate from the loop above since I felt there were enough differences between the API call parameters and schema to warrant breaking these steps into two. See Pageview API and Legacy Pagecounts API.
The loop below continues appending dataframes to our data dictionary. When complete, the dictionary contains data for all five API call endpoints.
In [4]:
# define endpoint and access sites
endpoint = '{project}/{access}/{agent}/{granularity}/{start}/{end}'
access_sites = ['desktop', 'mobile-app', 'mobile-web']
# repeat for each access site of interest
for access_site in access_sites:
# set filename for access-specific API call response JSON file
filename = 'data/raw/pageviews_' + access_site + '_200807_201709.json'
# check if file already exists; load if so, create if not
if os.path.isfile(filename):
with open(filename) as f:
response = json.load(f)
print('loaded JSON data from ./' + filename)
# define parameters
params = {'project' : '',
'access' : access_site, # [all-access, desktop, mobile-app, mobile-web]
'agent' : 'user', # [all-agents, user, spider]
'granularity' : 'monthly', # [hourly, daily, monthly]
'start' : '2008010100',
'end' : '2017100100' # use the first day of the following month to ensure a full month of data is collected
# fetch and format data
api_call = requests.get(endpoint.format(**params), headers)
response = api_call.json()
# format and save output as JSON file
with open(filename, 'w') as f:
json.dump(response, f)
print('saved JSON data to ./' + filename)
# convert to dataframe
temp_df = pd.DataFrame.from_dict(response['items'])
temp_df['yyyymm'] = temp_df.timestamp.str[0:6]
col_name = 'pv_' + access_site
temp_df.rename(columns={'views': col_name}, inplace=True)
# save to dictionary for later combination
data_dict[col_name] = temp_df[['yyyymm', col_name]]
At this point our data consists of five separate dataframes, stored in a single data dictionary. We want to get this all into the same dataframe so we can work with it, export it, and plot it. To do this, we will define a new dataframe (df
) which consists of the first dataframe in the dictionary. We'll then loop over the remaining dictionary elements and merge them all together.
The end result is a single dataframe with a single date column (yyyymm
) and a column for each of the original five dataframes. The dataframes are merged on the date column so they are sure to align vertically.
In [5]:
keys = list(data_dict.keys())
df = data_dict[keys[0]]
for i in range(1, len(keys)):
df = df.merge(data_dict[keys[i]], how='outer', on='yyyymm')
In [6]:
df = df.fillna(0)
We then run the following line to convert the data to integers, since we are working with discrete counts. Omitting this step would result in unnecessary decimal precision in the output .csv file.
In [7]:
df.iloc[:,1:] = df.iloc[:,1:].astype(int)
Now we perform a few simple string splitting and aggregation steps to create the final dataframe that we'll export to CSV.
In [8]:
# create dataframe for .csv and plot
df_new = pd.DataFrame({'year':df['yyyymm'].str[0:4],
'pagecount_all_views':df['pc_desktop-site'] + df['pc_mobile-site'],
'pageview_all_views':df['pv_desktop'] + df['pv_mobile-app'] + df['pv_mobile-web'],
'pageview_mobile_views':df['pv_mobile-app'] + df['pv_mobile-web']})
# reorder columns
df_new = df_new[['year',
Now we'll save to CSV, unless the file already exists in which case we can load for utmost transparency. We'll call this data "prelim" since we aren't sure yet if there are any problems with the data.
In [9]:
# set filename for combined data CSV
filename = 'data/en-wikipedia_traffic_200801-201709_prelim.csv'
# check if file already exists; load if so, create if not
if os.path.isfile(filename):
df_new = pd.read_csv(filename)
print('loaded CSV data from ./' + filename)
df_new.to_csv(filename, index=False, )
print('saved CSV data to ./' + filename)
In [10]:
df_new.replace(0, float('nan'), inplace=True)
We also pull out a series that consists of timestamps. This is a necessary step since the plot function won't recognize the current date integers as dates.
In [11]:
yyyymm = pd.to_datetime(df_new['year'].astype(str) + df_new['month'].astype(str), format='%Y%m')
Now we plot the data.
In [12]:
plt.rcParams["figure.figsize"] = (14, 4)
plt.plot(yyyymm, df_new['pagecount_desktop_views']/1e6, 'g--')
plt.plot(yyyymm, df_new['pagecount_mobile_views']/1e6, 'b--')
plt.plot(yyyymm, df_new['pagecount_all_views']/1e6, 'k--')
plt.legend(['main site','mobile site','total'], loc=2, framealpha=1)
plt.plot(yyyymm, df_new['pageview_desktop_views']/1e6, 'g-')
plt.plot(yyyymm, df_new['pageview_mobile_views']/1e6, 'b-')
plt.plot(yyyymm, df_new['pageview_all_views']/1e6, 'k-')
plt.title('Page Views on English Wikipedia (x 1,000,000)')
plt.suptitle('May 2015: a new pageview definition took effect, which eliminated all crawler traffic. Solid lines mark new definition.', y=0.04, color='#b22222');
This looks pretty good! However, we notice in the plot above that there appears to be some bad data in mid-2016.
In [13]:
df_new.loc[(pd.DatetimeIndex(yyyymm).year == 2016)]
It looks like row 103 has lower values for the first three columns. Interestingly, this month wasn't even supposed to be included in the analysis. So it seems these values may be an artifact from the API process, in which case we were instructed to set the end date to the start of the following month.
Thus, we set these values to NaN so they won't clutter the analysis.
In [14]:
# update values
df_new.loc[103,['pagecount_all_views', 'pagecount_desktop_views', 'pagecount_mobile_views']] = None
In [15]:
That looks better. Let's plot again to see how it looks graphically.
In [16]:
plt.rcParams["figure.figsize"] = (14, 4)
plt.plot(yyyymm, df_new['pagecount_desktop_views']/1e6, 'g--')
plt.plot(yyyymm, df_new['pagecount_mobile_views']/1e6, 'b--')
plt.plot(yyyymm, df_new['pagecount_all_views']/1e6, 'k--')
plt.legend(['main site','mobile site','total'], loc=2, framealpha=1)
plt.plot(yyyymm, df_new['pageview_desktop_views']/1e6, 'g-')
plt.plot(yyyymm, df_new['pageview_mobile_views']/1e6, 'b-')
plt.plot(yyyymm, df_new['pageview_all_views']/1e6, 'k-')
plt.title('Page Views on English Wikipedia (x 1,000,000)')
plt.suptitle('May 2015: a new pageview definition took effect, which eliminated all crawler traffic. Solid lines mark new definition.', y=0.04, color='#b22222')
plt.savefig('en-wikipedia_traffic_200801-201709.png', dpi=80);
That looks better!! So, it seems we have successfully reproduced the plot found here:
In [17]:
# data formatting updates
df_new = df_new.fillna(0)
df_new.iloc[:,1:] = df_new.iloc[:,1:].astype(int)
# set filename for combined data CSV
filename = 'data/en-wikipedia_traffic_200801-201709.csv'
# check if file already exists; load if so, create if not
if os.path.isfile(filename):
df_new = pd.read_csv(filename)
print('loaded CSV data from ./' + filename)
df_new.to_csv(filename, index=False, )
print('saved CSV data to ./' + filename)
This concludes the analysis.