Urls we need are from wikipedia API. The legacy Pagecounts API provides access to desktop and mobile traffic data from January 2008 through July 2016. The Pageviews API provides access to desktop, mobile web, and mobile app traffic data from July 2015 through September 2017.
First thing to do is to import all packages needed as follows:
In [1]:
import urllib.request
import json
import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from pandas.io.json import json_normalize
Then I write a function to get scource data from url and then store data into jason files. The inputs of the function are source data url and corresponding file name you want to store.
In [2]:
def get_data(dataurl, filename):
"""get data from url then store into jason file in local folder
Args:
dataurl (str): url of the data source.
filename (str): file name you want to store.
"""
with urllib.request.urlopen(dataurl) as url:
data = json.loads(url.read().decode())
with open('raw_data/' + filename, 'w') as outfile:
json.dump(data, outfile)
Next, I generate url variables and corresponding file name variables to put into above function.
In [3]:
# source data url and corresponding filename
pageviews_desktop_url = "https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia.org/desktop/user/monthly/2015070100/2017093000"
pageviews_desktop_filename = "pageviews_desktop_201507-201709.json"
pageviews_mobileapp_url = "https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia.org/mobile-app/user/monthly/2015070100/2017093000"
pageviews_mobileapp_filename = "pageviews_mobile-app_201507-201709.json"
pageviews_mobileweb_url = "https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia.org/mobile-web/user/monthly/2015070100/2017093000"
pageviews_mobileweb_filename = "pageviews_mobile-web_201507-201709.json"
pagecounts_desktop_site_url = "https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/en.wikipedia.org/desktop-site/monthly/2008010100/2016073000"
pagecounts_desktop_site_filename = "pagecounts_desktop-site_200801-201607.json"
pagecounts_mobile_site_url = "https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/en.wikipedia.org/mobile-site/monthly/2008010100/2016073000"
pagecounts_mobile_site_filename = "pagecounts_mobile-site_200801-201607.json"
Last, we just need to call the function. We get the data from url and then store it into jason file under data
folder.
In [5]:
# get data from url then store it into jason file
get_data(pageviews_desktop_url, pageviews_desktop_filename)
get_data(pageviews_mobileapp_url, pageviews_mobileapp_filename)
get_data(pageviews_mobileweb_url, pageviews_mobileweb_filename)
get_data(pagecounts_desktop_site_url, pagecounts_desktop_site_filename)
get_data(pagecounts_mobile_site_url, pagecounts_mobile_site_filename)
For data collected from the Pageviews API, combine the monthly values for mobile-app and mobile-web to create a total mobile traffic count for each month. For all data, separate the value of timestamp into four-digit year (YYYY) and two-digit month (MM) and discard values for day and hour (DDHH).
Combine the monthly values for mobile-app and mobile-web to create a total mobile traffic count for each month. Separate the value of timestamp into four-digit year (YYYY) and two-digit month (MM) and discard values for day and hour (DDHH). Only keep year, month and views columns.
In [6]:
# load json files
with open('raw_data/' + pageviews_mobileweb_filename) as data_file:
views_web = json.load(data_file)
with open('raw_data/' + pageviews_mobileapp_filename) as data_file:
views_app = json.load(data_file)
# transform json file to pandas.dataframe
views_web_df = json_normalize(views_web['items'])
views_app_df = json_normalize(views_app['items'])
# seperate timestamp into year and month
views_web_year = views_web_df['timestamp'].str[0:4]
views_web_month = views_web_df['timestamp'].str[4:6]
# combine web views and app views into total views
mobile_total = views_web_df['views'] + views_app_df['views']
# create pageview_mobile_views dataframe
pageview_mobile_views_df = pd.DataFrame({'year' : views_web_year, 'month' : views_web_month,
'pageview_mobile_views':mobile_total,})
pageview_mobile_views_df = pageview_mobile_views_df[['year', 'month', 'pageview_mobile_views']]
pageview_mobile_views_df[:5]
Out[6]:
In [7]:
# load json files
with open('raw_data/' + pageviews_desktop_filename) as data_file:
views_desktop = json.load(data_file)
# transform json file to pandas.dataframe
views_desktop_df = json_normalize(views_desktop['items'])
# seperate timestamp into year and month
views_desktop_year = views_desktop_df['timestamp'].str[0:4]
views_desktop_month = views_desktop_df['timestamp'].str[4:6]
# create pageview_desktop_views dataframe
pageview_desktop_views_df = pd.DataFrame({'pageview_desktop_views': views_desktop_df['views'],
'year' : views_desktop_year, 'month' : views_desktop_month})
pageview_desktop_views_df = pageview_desktop_views_df[['year', 'month', 'pageview_desktop_views']]
pageview_desktop_views_df[:5]
Out[7]:
In [8]:
# create pageview_all_views dataframe
pageview_all_views_df = pd.DataFrame({'pageview_all_views': pageview_desktop_views_df['pageview_desktop_views']
+ pageview_mobile_views_df['pageview_mobile_views'],
'year' : views_desktop_year, 'month' : views_desktop_month})
pageview_all_views_df = pageview_all_views_df[['year', 'month', 'pageview_all_views']]
pageview_all_views_df[:5]
Out[8]:
In [9]:
# load json files
with open('raw_data/' + pagecounts_mobile_site_filename) as data_file:
counts_mobile = json.load(data_file)
# transform json file to pandas.dataframe
counts_mobile_df = json_normalize(counts_mobile['items'])
# seperate timestamp into year and month
counts_mobile_year = counts_mobile_df['timestamp'].str[0:4]
counts_mobile_month = counts_mobile_df['timestamp'].str[4:6]
# create pageview_desktop_views dataframe
pagecount_mobile_views_df = pd.DataFrame({'pagecount_mobile_views': counts_mobile_df['count'],
'year' : counts_mobile_year, 'month' : counts_mobile_month})
pagecount_mobile_views_df = pagecount_mobile_views_df[['year', 'month', 'pagecount_mobile_views']]
pagecount_mobile_views_df[:5]
Out[9]:
In [10]:
with open('raw_data/' + pagecounts_desktop_site_filename) as data_file:
counts_desktop = json.load(data_file)
# transform json file to pandas.dataframe
counts_desktop_df = json_normalize(counts_desktop['items'])
# seperate timestamp into year and month
counts_desktop_year = counts_desktop_df['timestamp'].str[0:4]
counts_desktop_month = counts_desktop_df['timestamp'].str[4:6]
# create pageview_desktop_views dataframe
pagecount_desktop_views_df = pd.DataFrame({'pagecount_desktop_views': counts_desktop_df['count'],
'year' : counts_desktop_year, 'month' : counts_desktop_month})
pagecount_desktop_views_df = pagecount_desktop_views_df[['year', 'month', 'pagecount_desktop_views']]
pagecount_desktop_views_df[:5]
Out[10]:
In [11]:
# create pageview_all_views dataframe
pagecount_all_views_df = pd.merge(pagecount_desktop_views_df, pagecount_mobile_views_df, how='left', on=['year', 'month'])
pagecount_all_views_df.fillna(0, inplace=True)
pagecount_all_views_df['pagecount_desktop_views'] += pagecount_all_views_df['pagecount_mobile_views']
pagecount_all_views_df = pagecount_all_views_df.rename(columns = {'pagecount_desktop_views':'pagecount_all_views'})
del pagecount_all_views_df['pagecount_mobile_views']
pagecount_all_views_df.pagecount_all_views = pagecount_all_views_df.pagecount_all_views.astype(int)
pagecount_all_views_df = pagecount_all_views_df[['year', 'month', 'pagecount_all_views']]
pagecount_all_views_df[:5]
Out[11]:
In [12]:
df_list = [pagecount_all_views_df, pagecount_mobile_views_df, pagecount_desktop_views_df,
pageview_all_views_df, pageview_mobile_views_df, pageview_desktop_views_df]
df = df_list[0]
# merge each dataframe
for i in range(1, len(df_list)):
df = pd.merge(df, df_list[i], how='left', on=['year', 'month'])
# replace NAN into 0
df.fillna(0, inplace=True)
# set column type to int
df.pagecount_mobile_views = df.pagecount_mobile_views.astype(int)
df.pageview_all_views = df.pageview_all_views.astype(int)
df.pageview_mobile_views = df.pageview_mobile_views.astype(int)
df.pageview_desktop_views = df.pageview_desktop_views.astype(int)
df[:5]
Out[12]:
Combine all data into a single CSV file with the following headers:
Column | Value |
---|---|
year | YYYY |
month | MM |
pagecount_all_views | nums_views |
pagecount_desktop_views | nums_views |
pagecount_mobile_views | nums_views |
pageview_all_views | nums_views |
pageview_desktop_views | nums_views |
pageview_mobile_views | nums_views |
In [13]:
df.to_csv('cleaned_data/' + 'en-wikipedia_traffic_200801-201709.csv', sep=',')
To analyze pageviews of English Wikipedia, I generate a visualization to plot mobile views, desktop views, and all views.
In [14]:
df = df.replace(0, np.nan)
date = pd.to_datetime(df.year + df.month, format = '%Y%m')
In [15]:
years = mdates.YearLocator() # every year
months = mdates.MonthLocator() # every month
fmt = mdates.DateFormatter('%Y')
# make plot
fig, ax = plt.subplots()
ax.plot(date, df['pagecount_mobile_views'], color = 'green', linestyle = '--', alpha=0.7)
ax.plot(date, df['pagecount_desktop_views'], color = 'blue', linestyle = '--', alpha=0.7)
ax.plot(date, df['pagecount_all_views'], color = 'red', linestyle = '--', alpha=0.7)
ax.plot(date, df['pageview_all_views'], color = 'red', alpha=0.7)
ax.plot(date, df['pageview_mobile_views'], color = 'green', alpha=0.7)
ax.plot(date, df['pageview_desktop_views'], color = 'blue', alpha=0.7)
# format the ticks
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(fmt)
ax.xaxis.set_minor_locator(months)
datemin = datetime.date(date.min().year, 1, 1)
datemax = datetime.date(date.max().year + 1, 1, 1)
ax.set_xlim(datemin, datemax)
fig.autofmt_xdate()
# set plot
fig.set_size_inches(13, 8)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, prop={'size': 15}, borderaxespad=0.)
fig.suptitle('Page Views on English Wikipedia', fontsize=20)
ax.set_xlabel('Year')
ax.set_ylabel('Page Views')
ax.xaxis.label.set_size(15)
ax.yaxis.label.set_size(15)
plt.show()
From the figure above, we can see that in general desktop views are much higher than mobile views. For all views, we got the highest point around October 2013, and second highest point around 2011. And we got the lowest point around August 2009. From May 2015, a new pageview definition took effect, which elimilated all crawler traffic. Dashed lines in the figure mark old definition.