A1 Data Curation

Step1: Data Acquisition


In [1]:
# Import packages that will be used in this assignment
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns

To get the monthly traffic data on English Wikipedia from January 2008 through September 2017, we need to use 2 API endpoints, the Pagecounts API and the Pageviews API. The Pagecounts API provides monthy desktop and mobile traffic data from January 2008 through July 2016, and the Pageviews API provides monthy desktop, mobile-web, and mobile-app traffic data from July 2015 through September 2017. Once the user finishes the parameter settings for the API request, the traffic data will be returned in JSON format. The codes below will get you all pagecounts for English Wikipedia accessed through desktop from January 2008 through July 2016.


In [2]:
# Collect desktop traffic data from January 2008 through July 2016 using the Pagecounts API
endpoint_pagecounts = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access}/{granularity}/{start}/{end}'

params_pc_desktop = {
                    'project' : 'en.wikipedia.org',
                    'access' : 'desktop-site',
                    'granularity' : 'monthly',
                    'start' : '2008010100',
                    'end' : '2016080100'#use the first day of the following month to ensure a full month of data is collected
                    }

api_call = requests.get(endpoint_pagecounts.format(**params_pc_desktop))
response_pc_desktop = api_call.json()
with open('pagecounts_desktop-site_200801-201607.json', 'w') as outfile:
    json.dump(response_pc_desktop, outfile)

The codes below will get you all pagecounts for English Wikipedia accessed through mobile from January 2008 through July 2016.


In [3]:
# Collect mobile traffic data from January 2008 through July 2016 using the Pagecounts API
endpoint_pagecounts = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access}/{granularity}/{start}/{end}'

params_pc_mobile = {
                    'project' : 'en.wikipedia.org',
                    'access' : 'mobile-site',
                    'granularity' : 'monthly',
                    'start' : '2008010100',
                    'end' : '2016080100'
                    }

api_call = requests.get(endpoint_pagecounts.format(**params_pc_mobile))
response_pc_mobile = api_call.json()
with open('pagecounts_mobile-site_200801-201607.json', 'w') as outfile:
    json.dump(response_pc_mobile, outfile)

The codes below will get you all pageviews for English Wikipedia accessed through desktop from July 2015 through September 2017. Note that the data doesn't count traffic by web crawlers or spiders.


In [4]:
# Collect desktop traffic data from July 2015 through September 2017 using the Pageviews API
endPoint_pageviews = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

headers = {'User-Agent' : 'https://github.com/HWNi', 'From' : 'haowen2@uw.edu'}

params_pv_desktop = {
        'project' : 'en.wikipedia.org',
        'access' : 'desktop',
        'agent' : 'user',
        'granularity' : 'monthly',
        'start' : '2015070100',
        'end' : '2017100100'
        }

api_call = requests.get(endPoint_pageviews.format(**params_pv_desktop))
response_pv_desktop = api_call.json()
with open('pageviews_desktop_201507-201709.json', 'w') as outfile:
    json.dump(response_pv_desktop, outfile)

The codes below will get you all pageviews for English Wikipedia accessed through mobile website from July 2015 through September 2017. Again, note that the data doesn't count traffic by web crawlers or spiders.


In [5]:
# Collect mobile web traffic data from July 2015 through September 2017 using the Pageviews API
endPoint_pageviews = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

headers = {'User-Agent' : 'https://github.com/HWNi', 'From' : 'haowen2@uw.edu'}

params_pv_mobile_web = {
        'project' : 'en.wikipedia.org',
        'access' : 'mobile-web',
        'agent' : 'user',
        'granularity' : 'monthly',
        'start' : '2015070100',
        'end' : '2017100100'
        }

api_call = requests.get(endPoint_pageviews.format(**params_pv_mobile_web))
response_pv_mobile_web = api_call.json()
with open('pageviews_mobile-web_201507-201709.json', 'w') as outfile:
    json.dump(response_pv_mobile_web, outfile)

The codes below will get you all pageviews for English Wikipedia accessed through mobile app from July 2015 through September 2017. Again, note that the data doesn't count traffic by web crawlers or spiders.


In [6]:
# Collect mobile app traffic data from July 2015 through September 2017 using the Pageviews API
endPoint_pageviews = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

headers = {'User-Agent' : 'https://github.com/HWNi', 'From' : 'haowen2@uw.edu'}

params_pv_mobile_app = {
        'project' : 'en.wikipedia.org',
        'access' : 'mobile-app',
        'agent' : 'user',
        'granularity' : 'monthly',
        'start' : '2015070100',
        'end' : '2017100100'
        }

api_call = requests.get(endPoint_pageviews.format(**params_pv_mobile_app))
response_pv_mobile_app = api_call.json()
with open('pageviews_mobile-app_201507-201709.json', 'w') as outfile:
    json.dump(response_pv_mobile_app, outfile)

Step 2: Data processing

Now, we have 5 JSON files containing the traffic data we're interested in. In this step, we first iterate these 5 JSON files one by one and combine the data into a Python dictionary. Eventually, the key of the dictionary will be the list of time stamps (from January 2008 to September 2017). For each key (time stamp), we will append a list which contains 5 values: pagecounts accessed through desktop, pagecounts accessed through mobile, pageviews accessed through desktop, pageviews accessed through mobile web, and pageviews accessed through mobile app.


In [7]:
data_cleaned = {}

for item in response_pc_desktop['items']:
    timeStamp = item['timestamp']
    data_cleaned[timeStamp] = [item['count'], 0, 0, 0, 0]
        
for item in response_pc_mobile['items']:
    timeStamp = item['timestamp']
    if timeStamp in data_cleaned:
        data_cleaned[timeStamp][1] = item['count'] 
    else:
        data_cleaned[timeStamp] = [0, item['count'], 0, 0, 0]
    
for item in response_pv_desktop['items']:
    timeStamp = item['timestamp']
    if timeStamp in data_cleaned:
        data_cleaned[timeStamp][2] = item['views'] 
    else:
        data_cleaned[timeStamp] = [0, 0, item['views'], 0, 0]
        
for item in response_pv_mobile_web['items']:
    timeStamp = item['timestamp']
    if timeStamp in data_cleaned:
        data_cleaned[timeStamp][3] = item['views'] 
    else:
        data_cleaned[timeStamp] = [0, 0, 0, item['views'], 0]
        
for item in response_pv_mobile_app['items']:
    timeStamp = item['timestamp']
    if timeStamp in data_cleaned:
        data_cleaned[timeStamp][4] = item['views'] 
    else:
        data_cleaned[timeStamp] = [0, 0, 0, 0, item['views']]

After we get the dictionary, we could convert it into a Pandas dataframe and save the dataframe to a csv file


In [8]:
df = pd.DataFrame.from_dict(data_cleaned,  orient='index')
df_result = pd.DataFrame
df['timestamp'] = df.index
df['year'] = [t[0:4] for t in df['timestamp']]
df['month'] = [t[4:6] for t in df['timestamp']]
df['pagecount_all_views'] = df[0] + df[1]
df['pagecount_desktop_views'] = df[0]
df['pagecount_mobile_views'] = df[1]
df['pageview_all_views'] = df[2] + df[3] + df[4]
df['pageview_desktop_views'] = df[2]
df['pageview_mobile_views'] = df[3] + df[4]
df = df.loc[:, 'year' : 'pageview_mobile_views']
df.to_csv('en-wikipedia_traffic_200801-201709.csv', index=False)
df


Out[8]:
year month pagecount_all_views pagecount_desktop_views pagecount_mobile_views pageview_all_views pageview_desktop_views pageview_mobile_views
2008010100 2008 01 4930902570 4930902570 0 0 0 0
2008020100 2008 02 4818393763 4818393763 0 0 0 0
2008030100 2008 03 4955405809 4955405809 0 0 0 0
2008040100 2008 04 5159162183 5159162183 0 0 0 0
2008050100 2008 05 5584691092 5584691092 0 0 0 0
2008060100 2008 06 5712104279 5712104279 0 0 0 0
2008070100 2008 07 5306302874 5306302874 0 0 0 0
2008080100 2008 08 5140155519 5140155519 0 0 0 0
2008090100 2008 09 5479533823 5479533823 0 0 0 0
2008100100 2008 10 5679440782 5679440782 0 0 0 0
2008110100 2008 11 5415832071 5415832071 0 0 0 0
2008120100 2008 12 5211708451 5211708451 0 0 0 0
2009010100 2009 01 5802681551 5802681551 0 0 0 0
2009020100 2009 02 5547320860 5547320860 0 0 0 0
2009030100 2009 03 6295159057 6295159057 0 0 0 0
2009040100 2009 04 5988817321 5988817321 0 0 0 0
2009050100 2009 05 6267516733 6267516733 0 0 0 0
2009060100 2009 06 5818924182 5818924182 0 0 0 0
2009070100 2009 07 5801646978 5801646978 0 0 0 0
2009080100 2009 08 5790850384 5790850384 0 0 0 0
2009090100 2009 09 4057515768 4057515768 0 0 0 0
2009100100 2009 10 6016107147 6016107147 0 0 0 0
2009110100 2009 11 5768486910 5768486910 0 0 0 0
2009120100 2009 12 5426505977 5426505977 0 0 0 0
2010010100 2010 01 5703465285 5703465285 0 0 0 0
2010020100 2010 02 5762451418 5762451418 0 0 0 0
2010030100 2010 03 6661347946 6661347946 0 0 0 0
2010040100 2010 04 6618552152 6618552152 0 0 0 0
2010050100 2010 05 6410578775 6410578775 0 0 0 0
2010060100 2010 06 4898035014 4898035014 0 0 0 0
... ... ... ... ... ... ... ... ...
2015040100 2015 04 9421035574 6198945657 3222089917 0 0 0
2015050100 2015 05 9657871297 6323801814 3334069483 0 0 0
2015060100 2015 06 8203576103 5165413640 3038162463 0 0 0
2015070100 2015 07 8483698717 5229226022 3254472695 7665421980 4376666686 3288755294
2015080100 2015 08 8304022031 5035534449 3268487582 7634815221 4332482183 3302333038
2015090100 2015 09 8582061182 5409631355 3172429827 7655695037 4485491704 3170203333
2015100100 2015 10 8781786976 5535704471 3246082505 7746031887 4477532755 3268499132
2015110100 2015 11 8515190628 5296956116 3218234512 7524321290 4287720220 3236601070
2015120100 2015 12 8651858036 5264446173 3387411863 7476287344 4100012037 3376275307
2016010100 2016 01 9309261244 5569632502 3739628742 8154016303 4436179457 3717836846
2016020100 2016 02 8680940753 5347709361 3333231392 7585859457 4250997185 3334862272
2016030100 2016 03 8827529692 5407676056 3419853636 7673274617 4286590426 3386684191
2016040100 2016 04 8873620523 5572235399 3301385124 7408147859 4149383857 3258764002
2016050100 2016 05 8748968139 5330532334 3418435805 7586811330 4191778094 3395033236
2016060100 2016 06 8347710510 4975092447 3372618063 7243630656 3888839711 3354790945
2016070100 2016 07 8864627560 5363966439 3500661121 7834439589 4337865827 3496573762
2016080100 2016 08 0 0 0 8210865519 4695046216 3515819303
2016090100 2016 09 0 0 0 7528292279 4135006498 3393285781
2016100100 2016 10 0 0 0 7871021581 4361737690 3509283891
2016110100 2016 11 0 0 0 7983113161 4392068236 3591044925
2016120100 2016 12 0 0 0 7986152433 4209608578 3776543855
2017010100 2017 01 0 0 0 8753941940 4521980398 4231961542
2017020100 2017 02 0 0 0 7738463562 4026702163 3711761399
2017030100 2017 03 0 0 0 8223465891 4319971902 3903493989
2017040100 2017 04 0 0 0 7591080111 3951456992 3639623119
2017050100 2017 05 0 0 0 7874558299 4187870579 3686687720
2017060100 2017 06 0 0 0 7123934190 3604550997 3519383193
2017070100 2017 07 0 0 0 7290503797 3565444544 3725059253
2017080100 2017 08 0 0 0 7196978615 3575572313 3621406302
2017090100 2017 09 0 0 0 7079052261 3547447892 3531604369

117 rows × 8 columns

Step 3: Analysis

In the final step, we make a time series plot for the data we processed before. The X-axis of the plot will be a date range, and Y-axis of the plot will be the amount of traffic times 1 million(The author downscaled the traffic data by 1 million).


In [9]:
dateRange = pd.date_range('2008-01', '2017-10', freq='M')
scale = 1e-6
sns.set_style("whitegrid")
fig = plt.figure(figsize=(18, 12))
plt.plot(dateRange, df['pagecount_all_views'] * scale, linestyle = ':')
plt.plot(dateRange, df['pagecount_desktop_views'] * scale)
plt.plot(dateRange, df['pagecount_mobile_views'] * scale)
plt.plot(dateRange, df['pageview_all_views'] * scale, linestyle = ':')
plt.plot(dateRange, df['pagecount_desktop_views'] * scale)
plt.plot(dateRange, df['pagecount_mobile_views'] * scale)
plt.legend()
plt.xlabel('Year')
plt.ylabel('Amount of Traffic (* 1,000,000)')
fig.savefig('en-wikipedia_traffic_200801-201709.jpg')



In [ ]: