HW1 Data curation

The goal of this assignment is to construct, analyze, and publish a dataset of monthly traffic on English Wikipedia from July 1 2008 through September 30 2017.

Step 1: Data acquisition

Urls we need are from wikipedia API. The legacy Pagecounts API provides access to desktop and mobile traffic data from January 2008 through July 2016. The Pageviews API provides access to desktop, mobile web, and mobile app traffic data from July 2015 through September 2017.

First thing to do is to import all packages needed as follows:


In [1]:
import urllib.request
import json 
import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from pandas.io.json import json_normalize

Then I write a function to get scource data from url and then store data into jason files. The inputs of the function are source data url and corresponding file name you want to store.


In [2]:
def get_data(dataurl, filename):
    """get data from url then store into jason file in local folder
    Args:
        dataurl (str): url of the data source.
        filename (str): file name you want to store.
    """
    with urllib.request.urlopen(dataurl) as url:
        data = json.loads(url.read().decode())
    with open('raw_data/' + filename, 'w') as outfile:
        json.dump(data, outfile)

Next, I generate url variables and corresponding file name variables to put into above function.


In [3]:
# source data url and corresponding filename
pageviews_desktop_url = "https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia.org/desktop/user/monthly/2015070100/2017093000"
pageviews_desktop_filename = "pageviews_desktop_201507-201709.json"

pageviews_mobileapp_url = "https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia.org/mobile-app/user/monthly/2015070100/2017093000"
pageviews_mobileapp_filename = "pageviews_mobile-app_201507-201709.json"

pageviews_mobileweb_url = "https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/en.wikipedia.org/mobile-web/user/monthly/2015070100/2017093000"
pageviews_mobileweb_filename = "pageviews_mobile-web_201507-201709.json"

pagecounts_desktop_site_url = "https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/en.wikipedia.org/desktop-site/monthly/2008010100/2016073000"
pagecounts_desktop_site_filename = "pagecounts_desktop-site_200801-201607.json"

pagecounts_mobile_site_url = "https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/en.wikipedia.org/mobile-site/monthly/2008010100/2016073000"
pagecounts_mobile_site_filename = "pagecounts_mobile-site_200801-201607.json"

Last, we just need to call the function. We get the data from url and then store it into jason file under data folder.


In [5]:
# get data from url then store it into jason file
get_data(pageviews_desktop_url, pageviews_desktop_filename)
get_data(pageviews_mobileapp_url, pageviews_mobileapp_filename)
get_data(pageviews_mobileweb_url, pageviews_mobileweb_filename)
get_data(pagecounts_desktop_site_url, pagecounts_desktop_site_filename)
get_data(pagecounts_mobile_site_url, pagecounts_mobile_site_filename)

Step 2: Data processing

For data collected from the Pageviews API, combine the monthly values for mobile-app and mobile-web to create a total mobile traffic count for each month. For all data, separate the value of timestamp into four-digit year (YYYY) and two-digit month (MM) and discard values for day and hour (DDHH).

(1) process pageview_mobile_views

Combine the monthly values for mobile-app and mobile-web to create a total mobile traffic count for each month. Separate the value of timestamp into four-digit year (YYYY) and two-digit month (MM) and discard values for day and hour (DDHH). Only keep year, month and views columns.


In [6]:
# load json files
with open('raw_data/' + pageviews_mobileweb_filename) as data_file:    
    views_web = json.load(data_file)

with open('raw_data/' + pageviews_mobileapp_filename) as data_file:    
    views_app = json.load(data_file)
    
# transform json file to pandas.dataframe
views_web_df = json_normalize(views_web['items'])
views_app_df = json_normalize(views_app['items'])

# seperate timestamp into year and month
views_web_year = views_web_df['timestamp'].str[0:4]
views_web_month = views_web_df['timestamp'].str[4:6]

# combine web views and app views into total views
mobile_total = views_web_df['views'] + views_app_df['views']

# create pageview_mobile_views dataframe
pageview_mobile_views_df = pd.DataFrame({'year' : views_web_year, 'month' : views_web_month,
                                         'pageview_mobile_views':mobile_total,})
pageview_mobile_views_df = pageview_mobile_views_df[['year', 'month', 'pageview_mobile_views']]
pageview_mobile_views_df[:5]


Out[6]:
year month pageview_mobile_views
0 2015 07 3288755294
1 2015 08 3302333038
2 2015 09 3170203333
3 2015 10 3268499132
4 2015 11 3236601070

(2) process pageview_desktop_views

Separate the value of timestamp into four-digit year (YYYY) and two-digit month (MM) and discard values for day and hour (DDHH). Only keep year, month and views columns.


In [7]:
# load json files
with open('raw_data/' + pageviews_desktop_filename) as data_file:    
    views_desktop = json.load(data_file)

# transform json file to pandas.dataframe
views_desktop_df = json_normalize(views_desktop['items'])

# seperate timestamp into year and month
views_desktop_year = views_desktop_df['timestamp'].str[0:4]
views_desktop_month = views_desktop_df['timestamp'].str[4:6]

# create pageview_desktop_views dataframe
pageview_desktop_views_df = pd.DataFrame({'pageview_desktop_views': views_desktop_df['views'], 
                                         'year' : views_desktop_year, 'month' : views_desktop_month})
pageview_desktop_views_df = pageview_desktop_views_df[['year', 'month', 'pageview_desktop_views']]
pageview_desktop_views_df[:5]


Out[7]:
year month pageview_desktop_views
0 2015 07 4376666686
1 2015 08 4332482183
2 2015 09 4485491704
3 2015 10 4477532755
4 2015 11 4287720220

(3) process pageview_all_views

Combine mobile views and desktop views to get all views.


In [8]:
# create pageview_all_views dataframe
pageview_all_views_df = pd.DataFrame({'pageview_all_views': pageview_desktop_views_df['pageview_desktop_views'] 
                                      + pageview_mobile_views_df['pageview_mobile_views'], 
                                         'year' : views_desktop_year, 'month' : views_desktop_month})
pageview_all_views_df = pageview_all_views_df[['year', 'month', 'pageview_all_views']]
pageview_all_views_df[:5]


Out[8]:
year month pageview_all_views
0 2015 07 7665421980
1 2015 08 7634815221
2 2015 09 7655695037
3 2015 10 7746031887
4 2015 11 7524321290

(4) process pagecount_mobile_views

Separate the value of timestamp into four-digit year (YYYY) and two-digit month (MM) and discard values for day and hour (DDHH). Only keep year, month and views columns.


In [9]:
# load json files
with open('raw_data/' + pagecounts_mobile_site_filename) as data_file:    
    counts_mobile = json.load(data_file)

# transform json file to pandas.dataframe
counts_mobile_df = json_normalize(counts_mobile['items'])

# seperate timestamp into year and month
counts_mobile_year = counts_mobile_df['timestamp'].str[0:4]
counts_mobile_month = counts_mobile_df['timestamp'].str[4:6]

# create pageview_desktop_views dataframe
pagecount_mobile_views_df = pd.DataFrame({'pagecount_mobile_views': counts_mobile_df['count'], 
                                         'year' : counts_mobile_year, 'month' : counts_mobile_month})
pagecount_mobile_views_df = pagecount_mobile_views_df[['year', 'month', 'pagecount_mobile_views']]
pagecount_mobile_views_df[:5]


Out[9]:
year month pagecount_mobile_views
0 2014 10 3091546685
1 2014 11 3027489668
2 2014 12 3278950021
3 2015 01 3485302091
4 2015 02 3091534479

(5) process pagecount_desktop_views

Separate the value of timestamp into four-digit year (YYYY) and two-digit month (MM) and discard values for day and hour (DDHH). Only keep year, month and views columns.


In [10]:
with open('raw_data/' + pagecounts_desktop_site_filename) as data_file:    
    counts_desktop = json.load(data_file)

# transform json file to pandas.dataframe
counts_desktop_df = json_normalize(counts_desktop['items'])

# seperate timestamp into year and month
counts_desktop_year = counts_desktop_df['timestamp'].str[0:4]
counts_desktop_month = counts_desktop_df['timestamp'].str[4:6]

# create pageview_desktop_views dataframe
pagecount_desktop_views_df = pd.DataFrame({'pagecount_desktop_views': counts_desktop_df['count'], 
                                         'year' : counts_desktop_year, 'month' : counts_desktop_month})
pagecount_desktop_views_df = pagecount_desktop_views_df[['year', 'month', 'pagecount_desktop_views']]
pagecount_desktop_views_df[:5]


Out[10]:
year month pagecount_desktop_views
0 2008 01 4930902570
1 2008 02 4818393763
2 2008 03 4955405809
3 2008 04 5159162183
4 2008 05 5584691092

(6) process pagecount_all_views

Combine mobile views and desktop views to get all views.


In [11]:
# create pageview_all_views dataframe
pagecount_all_views_df = pd.merge(pagecount_desktop_views_df, pagecount_mobile_views_df, how='left', on=['year', 'month'])
pagecount_all_views_df.fillna(0, inplace=True)
pagecount_all_views_df['pagecount_desktop_views'] += pagecount_all_views_df['pagecount_mobile_views']
pagecount_all_views_df = pagecount_all_views_df.rename(columns = {'pagecount_desktop_views':'pagecount_all_views'})
del pagecount_all_views_df['pagecount_mobile_views']
pagecount_all_views_df.pagecount_all_views = pagecount_all_views_df.pagecount_all_views.astype(int)
pagecount_all_views_df = pagecount_all_views_df[['year', 'month', 'pagecount_all_views']]
pagecount_all_views_df[:5]


Out[11]:
year month pagecount_all_views
0 2008 01 4930902570
1 2008 02 4818393763
2 2008 03 4955405809
3 2008 04 5159162183
4 2008 05 5584691092

(7) merge all dataframes


In [12]:
df_list = [pagecount_all_views_df, pagecount_mobile_views_df, pagecount_desktop_views_df,
          pageview_all_views_df, pageview_mobile_views_df, pageview_desktop_views_df]
df = df_list[0]

# merge each dataframe
for i in range(1, len(df_list)):
    df = pd.merge(df, df_list[i], how='left', on=['year', 'month'])

# replace NAN into 0
df.fillna(0, inplace=True)

# set column type to int
df.pagecount_mobile_views = df.pagecount_mobile_views.astype(int)
df.pageview_all_views = df.pageview_all_views.astype(int)
df.pageview_mobile_views = df.pageview_mobile_views.astype(int)
df.pageview_desktop_views = df.pageview_desktop_views.astype(int)
df[:5]


Out[12]:
year month pagecount_all_views pagecount_mobile_views pagecount_desktop_views pageview_all_views pageview_mobile_views pageview_desktop_views
0 2008 01 4930902570 0 4930902570 0 0 0
1 2008 02 4818393763 0 4818393763 0 0 0
2 2008 03 4955405809 0 4955405809 0 0 0
3 2008 04 5159162183 0 5159162183 0 0 0
4 2008 05 5584691092 0 5584691092 0 0 0

(8) output csv file

Combine all data into a single CSV file with the following headers:

Column Value
year YYYY
month MM
pagecount_all_views nums_views
pagecount_desktop_views nums_views
pagecount_mobile_views nums_views
pageview_all_views nums_views
pageview_desktop_views nums_views
pageview_mobile_views nums_views

In [13]:
df.to_csv('cleaned_data/' + 'en-wikipedia_traffic_200801-201709.csv', sep=',')

Step 3: Analysis

To analyze pageviews of English Wikipedia, I generate a visualization to plot mobile views, desktop views, and all views.


In [14]:
df = df.replace(0, np.nan)
date = pd.to_datetime(df.year + df.month, format = '%Y%m')

In [15]:
years = mdates.YearLocator()   # every year
months = mdates.MonthLocator()  # every month
fmt = mdates.DateFormatter('%Y')

# make plot
fig, ax = plt.subplots()

ax.plot(date, df['pagecount_mobile_views'], color = 'green', linestyle = '--', alpha=0.7)
ax.plot(date, df['pagecount_desktop_views'], color = 'blue', linestyle = '--', alpha=0.7)
ax.plot(date, df['pagecount_all_views'], color = 'red', linestyle = '--', alpha=0.7)
ax.plot(date, df['pageview_all_views'], color = 'red', alpha=0.7)
ax.plot(date, df['pageview_mobile_views'], color = 'green', alpha=0.7)
ax.plot(date, df['pageview_desktop_views'], color = 'blue', alpha=0.7)


# format the ticks
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(fmt)
ax.xaxis.set_minor_locator(months)

datemin = datetime.date(date.min().year, 1, 1)
datemax = datetime.date(date.max().year + 1, 1, 1)
ax.set_xlim(datemin, datemax)
fig.autofmt_xdate()

# set plot
fig.set_size_inches(13, 8)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, prop={'size': 15}, borderaxespad=0.)
fig.suptitle('Page Views on English Wikipedia', fontsize=20)
ax.set_xlabel('Year')
ax.set_ylabel('Page Views')
ax.xaxis.label.set_size(15)
ax.yaxis.label.set_size(15)
plt.show()


From the figure above, we can see that in general desktop views are much higher than mobile views. For all views, we got the highest point around October 2013, and second highest point around 2011. And we got the lowest point around August 2009. From May 2015, a new pageview definition took effect, which elimilated all crawler traffic. Dashed lines in the figure mark old definition.