A1 Data Curation

Step1: Data Acquisition



In [1]:

    
# Import packages that will be used in this assignment
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns

To get the monthly traffic data on English Wikipedia from January 2008 through September 2017, we need to use 2 API endpoints, the Pagecounts API and the Pageviews API. The Pagecounts API provides monthy desktop and mobile traffic data from January 2008 through July 2016, and the Pageviews API provides monthy desktop, mobile-web, and mobile-app traffic data from July 2015 through September 2017. Once the user finishes the parameter settings for the API request, the traffic data will be returned in JSON format. The codes below will get you all pagecounts for English Wikipedia accessed through desktop from January 2008 through July 2016.



In [2]:

    
# Collect desktop traffic data from January 2008 through July 2016 using the Pagecounts API
endpoint_pagecounts = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access}/{granularity}/{start}/{end}'

params_pc_desktop = {
                    'project' : 'en.wikipedia.org',
                    'access' : 'desktop-site',
                    'granularity' : 'monthly',
                    'start' : '2008010100',
                    'end' : '2016080100'#use the first day of the following month to ensure a full month of data is collected
                    }

api_call = requests.get(endpoint_pagecounts.format(**params_pc_desktop))
response_pc_desktop = api_call.json()
with open('pagecounts_desktop-site_200801-201607.json', 'w') as outfile:
    json.dump(response_pc_desktop, outfile)

The codes below will get you all pagecounts for English Wikipedia accessed through mobile from January 2008 through July 2016.



In [3]:

    
# Collect mobile traffic data from January 2008 through July 2016 using the Pagecounts API
endpoint_pagecounts = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access}/{granularity}/{start}/{end}'

params_pc_mobile = {
                    'project' : 'en.wikipedia.org',
                    'access' : 'mobile-site',
                    'granularity' : 'monthly',
                    'start' : '2008010100',
                    'end' : '2016080100'
                    }

api_call = requests.get(endpoint_pagecounts.format(**params_pc_mobile))
response_pc_mobile = api_call.json()
with open('pagecounts_mobile-site_200801-201607.json', 'w') as outfile:
    json.dump(response_pc_mobile, outfile)

The codes below will get you all pageviews for English Wikipedia accessed through desktop from July 2015 through September 2017. Note that the data doesn't count traffic by web crawlers or spiders.



In [4]:

    
# Collect desktop traffic data from July 2015 through September 2017 using the Pageviews API
endPoint_pageviews = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

headers = {'User-Agent' : 'https://github.com/HWNi', 'From' : 'haowen2@uw.edu'}

params_pv_desktop = {
        'project' : 'en.wikipedia.org',
        'access' : 'desktop',
        'agent' : 'user',
        'granularity' : 'monthly',
        'start' : '2015070100',
        'end' : '2017100100'
        }

api_call = requests.get(endPoint_pageviews.format(**params_pv_desktop))
response_pv_desktop = api_call.json()
with open('pageviews_desktop_201507-201709.json', 'w') as outfile:
    json.dump(response_pv_desktop, outfile)

The codes below will get you all pageviews for English Wikipedia accessed through mobile website from July 2015 through September 2017. Again, note that the data doesn't count traffic by web crawlers or spiders.



In [5]:

    
# Collect mobile web traffic data from July 2015 through September 2017 using the Pageviews API
endPoint_pageviews = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

headers = {'User-Agent' : 'https://github.com/HWNi', 'From' : 'haowen2@uw.edu'}

params_pv_mobile_web = {
        'project' : 'en.wikipedia.org',
        'access' : 'mobile-web',
        'agent' : 'user',
        'granularity' : 'monthly',
        'start' : '2015070100',
        'end' : '2017100100'
        }

api_call = requests.get(endPoint_pageviews.format(**params_pv_mobile_web))
response_pv_mobile_web = api_call.json()
with open('pageviews_mobile-web_201507-201709.json', 'w') as outfile:
    json.dump(response_pv_mobile_web, outfile)

The codes below will get you all pageviews for English Wikipedia accessed through mobile app from July 2015 through September 2017. Again, note that the data doesn't count traffic by web crawlers or spiders.



In [6]:

    
# Collect mobile app traffic data from July 2015 through September 2017 using the Pageviews API
endPoint_pageviews = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

headers = {'User-Agent' : 'https://github.com/HWNi', 'From' : 'haowen2@uw.edu'}

params_pv_mobile_app = {
        'project' : 'en.wikipedia.org',
        'access' : 'mobile-app',
        'agent' : 'user',
        'granularity' : 'monthly',
        'start' : '2015070100',
        'end' : '2017100100'
        }

api_call = requests.get(endPoint_pageviews.format(**params_pv_mobile_app))
response_pv_mobile_app = api_call.json()
with open('pageviews_mobile-app_201507-201709.json', 'w') as outfile:
    json.dump(response_pv_mobile_app, outfile)

Step 2: Data processing

Now, we have 5 JSON files containing the traffic data we're interested in. In this step, we first iterate these 5 JSON files one by one and combine the data into a Python dictionary. Eventually, the key of the dictionary will be the list of time stamps (from January 2008 to September 2017). For each key (time stamp), we will append a list which contains 5 values: pagecounts accessed through desktop, pagecounts accessed through mobile, pageviews accessed through desktop, pageviews accessed through mobile web, and pageviews accessed through mobile app.



In [7]:

    
data_cleaned = {}

for item in response_pc_desktop['items']:
    timeStamp = item['timestamp']
    data_cleaned[timeStamp] = [item['count'], 0, 0, 0, 0]
        
for item in response_pc_mobile['items']:
    timeStamp = item['timestamp']
    if timeStamp in data_cleaned:
        data_cleaned[timeStamp][1] = item['count'] 
    else:
        data_cleaned[timeStamp] = [0, item['count'], 0, 0, 0]
    
for item in response_pv_desktop['items']:
    timeStamp = item['timestamp']
    if timeStamp in data_cleaned:
        data_cleaned[timeStamp][2] = item['views'] 
    else:
        data_cleaned[timeStamp] = [0, 0, item['views'], 0, 0]
        
for item in response_pv_mobile_web['items']:
    timeStamp = item['timestamp']
    if timeStamp in data_cleaned:
        data_cleaned[timeStamp][3] = item['views'] 
    else:
        data_cleaned[timeStamp] = [0, 0, 0, item['views'], 0]
        
for item in response_pv_mobile_app['items']:
    timeStamp = item['timestamp']
    if timeStamp in data_cleaned:
        data_cleaned[timeStamp][4] = item['views'] 
    else:
        data_cleaned[timeStamp] = [0, 0, 0, 0, item['views']]

After we get the dictionary, we could convert it into a Pandas dataframe and save the dataframe to a csv file



In [8]:

    
df = pd.DataFrame.from_dict(data_cleaned,  orient='index')
df_result = pd.DataFrame
df['timestamp'] = df.index
df['year'] = [t[0:4] for t in df['timestamp']]
df['month'] = [t[4:6] for t in df['timestamp']]
df['pagecount_all_views'] = df[0] + df[1]
df['pagecount_desktop_views'] = df[0]
df['pagecount_mobile_views'] = df[1]
df['pageview_all_views'] = df[2] + df[3] + df[4]
df['pageview_desktop_views'] = df[2]
df['pageview_mobile_views'] = df[3] + df[4]
df = df.loc[:, 'year' : 'pageview_mobile_views']
df.to_csv('en-wikipedia_traffic_200801-201709.csv', index=False)
df









    Out[8]:







  
    
      
      year
      month
      pagecount_all_views
      pagecount_desktop_views
      pagecount_mobile_views
      pageview_all_views
      pageview_desktop_views
      pageview_mobile_views
    
  
  
    
      2008010100
      2008
      01
      4930902570
      4930902570
      0
      0
      0
      0
    
    
      2008020100
      2008
      02
      4818393763
      4818393763
      0
      0
      0
      0
    
    
      2008030100
      2008
      03
      4955405809
      4955405809
      0
      0
      0
      0
    
    
      2008040100
      2008
      04
      5159162183
      5159162183
      0
      0
      0
      0
    
    
      2008050100
      2008
      05
      5584691092
      5584691092
      0
      0
      0
      0
    
    
      2008060100
      2008
      06
      5712104279
      5712104279
      0
      0
      0
      0
    
    
      2008070100
      2008
      07
      5306302874
      5306302874
      0
      0
      0
      0
    
    
      2008080100
      2008
      08
      5140155519
      5140155519
      0
      0
      0
      0
    
    
      2008090100
      2008
      09
      5479533823
      5479533823
      0
      0
      0
      0
    
    
      2008100100
      2008
      10
      5679440782
      5679440782
      0
      0
      0
      0
    
    
      2008110100
      2008
      11
      5415832071
      5415832071
      0
      0
      0
      0
    
    
      2008120100
      2008
      12
      5211708451
      5211708451
      0
      0
      0
      0
    
    
      2009010100
      2009
      01
      5802681551
      5802681551
      0
      0
      0
      0
    
    
      2009020100
      2009
      02
      5547320860
      5547320860
      0
      0
      0
      0
    
    
      2009030100
      2009
      03
      6295159057
      6295159057
      0
      0
      0
      0
    
    
      2009040100
      2009
      04
      5988817321
      5988817321
      0
      0
      0
      0
    
    
      2009050100
      2009
      05
      6267516733
      6267516733
      0
      0
      0
      0
    
    
      2009060100
      2009
      06
      5818924182
      5818924182
      0
      0
      0
      0
    
    
      2009070100
      2009
      07
      5801646978
      5801646978
      0
      0
      0
      0
    
    
      2009080100
      2009
      08
      5790850384
      5790850384
      0
      0
      0
      0
    
    
      2009090100
      2009
      09
      4057515768
      4057515768
      0
      0
      0
      0
    
    
      2009100100
      2009
      10
      6016107147
      6016107147
      0
      0
      0
      0
    
    
      2009110100
      2009
      11
      5768486910
      5768486910
      0
      0
      0
      0
    
    
      2009120100
      2009
      12
      5426505977
      5426505977
      0
      0
      0
      0
    
    
      2010010100
      2010
      01
      5703465285
      5703465285
      0
      0
      0
      0
    
    
      2010020100
      2010
      02
      5762451418
      5762451418
      0
      0
      0
      0
    
    
      2010030100
      2010
      03
      6661347946
      6661347946
      0
      0
      0
      0
    
    
      2010040100
      2010
      04
      6618552152
      6618552152
      0
      0
      0
      0
    
    
      2010050100
      2010
      05
      6410578775
      6410578775
      0
      0
      0
      0
    
    
      2010060100
      2010
      06
      4898035014
      4898035014
      0
      0
      0
      0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2015040100
      2015
      04
      9421035574
      6198945657
      3222089917
      0
      0
      0
    
    
      2015050100
      2015
      05
      9657871297
      6323801814
      3334069483
      0
      0
      0
    
    
      2015060100
      2015
      06
      8203576103
      5165413640
      3038162463
      0
      0
      0
    
    
      2015070100
      2015
      07
      8483698717
      5229226022
      3254472695
      7665421980
      4376666686
      3288755294
    
    
      2015080100
      2015
      08
      8304022031
      5035534449
      3268487582
      7634815221
      4332482183
      3302333038
    
    
      2015090100
      2015
      09
      8582061182
      5409631355
      3172429827
      7655695037
      4485491704
      3170203333
    
    
      2015100100
      2015
      10
      8781786976
      5535704471
      3246082505
      7746031887
      4477532755
      3268499132
    
    
      2015110100
      2015
      11
      8515190628
      5296956116
      3218234512
      7524321290
      4287720220
      3236601070
    
    
      2015120100
      2015
      12
      8651858036
      5264446173
      3387411863
      7476287344
      4100012037
      3376275307
    
    
      2016010100
      2016
      01
      9309261244
      5569632502
      3739628742
      8154016303
      4436179457
      3717836846
    
    
      2016020100
      2016
      02
      8680940753
      5347709361
      3333231392
      7585859457
      4250997185
      3334862272
    
    
      2016030100
      2016
      03
      8827529692
      5407676056
      3419853636
      7673274617
      4286590426
      3386684191
    
    
      2016040100
      2016
      04
      8873620523
      5572235399
      3301385124
      7408147859
      4149383857
      3258764002
    
    
      2016050100
      2016
      05
      8748968139
      5330532334
      3418435805
      7586811330
      4191778094
      3395033236
    
    
      2016060100
      2016
      06
      8347710510
      4975092447
      3372618063
      7243630656
      3888839711
      3354790945
    
    
      2016070100
      2016
      07
      8864627560
      5363966439
      3500661121
      7834439589
      4337865827
      3496573762
    
    
      2016080100
      2016
      08
      0
      0
      0
      8210865519
      4695046216
      3515819303
    
    
      2016090100
      2016
      09
      0
      0
      0
      7528292279
      4135006498
      3393285781
    
    
      2016100100
      2016
      10
      0
      0
      0
      7871021581
      4361737690
      3509283891
    
    
      2016110100
      2016
      11
      0
      0
      0
      7983113161
      4392068236
      3591044925
    
    
      2016120100
      2016
      12
      0
      0
      0
      7986152433
      4209608578
      3776543855
    
    
      2017010100
      2017
      01
      0
      0
      0
      8753941940
      4521980398
      4231961542
    
    
      2017020100
      2017
      02
      0
      0
      0
      7738463562
      4026702163
      3711761399
    
    
      2017030100
      2017
      03
      0
      0
      0
      8223465891
      4319971902
      3903493989
    
    
      2017040100
      2017
      04
      0
      0
      0
      7591080111
      3951456992
      3639623119
    
    
      2017050100
      2017
      05
      0
      0
      0
      7874558299
      4187870579
      3686687720
    
    
      2017060100
      2017
      06
      0
      0
      0
      7123934190
      3604550997
      3519383193
    
    
      2017070100
      2017
      07
      0
      0
      0
      7290503797
      3565444544
      3725059253
    
    
      2017080100
      2017
      08
      0
      0
      0
      7196978615
      3575572313
      3621406302
    
    
      2017090100
      2017
      09
      0
      0
      0
      7079052261
      3547447892
      3531604369
    
  

117 rows × 8 columns

Step 3: Analysis

In the final step, we make a time series plot for the data we processed before. The X-axis of the plot will be a date range, and Y-axis of the plot will be the amount of traffic times 1 million(The author downscaled the traffic data by 1 million).



In [9]:

    
dateRange = pd.date_range('2008-01', '2017-10', freq='M')
scale = 1e-6
sns.set_style("whitegrid")
fig = plt.figure(figsize=(18, 12))
plt.plot(dateRange, df['pagecount_all_views'] * scale, linestyle = ':')
plt.plot(dateRange, df['pagecount_desktop_views'] * scale)
plt.plot(dateRange, df['pagecount_mobile_views'] * scale)
plt.plot(dateRange, df['pageview_all_views'] * scale, linestyle = ':')
plt.plot(dateRange, df['pagecount_desktop_views'] * scale)
plt.plot(dateRange, df['pagecount_mobile_views'] * scale)
plt.legend()
plt.xlabel('Year')
plt.ylabel('Amount of Traffic (* 1,000,000)')
fig.savefig('en-wikipedia_traffic_200801-201709.jpg')



In [ ]:

	year	month	pagecount_all_views	pagecount_desktop_views	pagecount_mobile_views	pageview_all_views	pageview_desktop_views	pageview_mobile_views
2008010100	2008	01	4930902570	4930902570	0	0	0	0
2008020100	2008	02	4818393763	4818393763	0	0	0	0
2008030100	2008	03	4955405809	4955405809	0	0	0	0
2008040100	2008	04	5159162183	5159162183	0	0	0	0
2008050100	2008	05	5584691092	5584691092	0	0	0	0
2008060100	2008	06	5712104279	5712104279	0	0	0	0
2008070100	2008	07	5306302874	5306302874	0	0	0	0
2008080100	2008	08	5140155519	5140155519	0	0	0	0
2008090100	2008	09	5479533823	5479533823	0	0	0	0
2008100100	2008	10	5679440782	5679440782	0	0	0	0
2008110100	2008	11	5415832071	5415832071	0	0	0	0
2008120100	2008	12	5211708451	5211708451	0	0	0	0
2009010100	2009	01	5802681551	5802681551	0	0	0	0
2009020100	2009	02	5547320860	5547320860	0	0	0	0
2009030100	2009	03	6295159057	6295159057	0	0	0	0
2009040100	2009	04	5988817321	5988817321	0	0	0	0
2009050100	2009	05	6267516733	6267516733	0	0	0	0
2009060100	2009	06	5818924182	5818924182	0	0	0	0
2009070100	2009	07	5801646978	5801646978	0	0	0	0
2009080100	2009	08	5790850384	5790850384	0	0	0	0
2009090100	2009	09	4057515768	4057515768	0	0	0	0
2009100100	2009	10	6016107147	6016107147	0	0	0	0
2009110100	2009	11	5768486910	5768486910	0	0	0	0
2009120100	2009	12	5426505977	5426505977	0	0	0	0
2010010100	2010	01	5703465285	5703465285	0	0	0	0
2010020100	2010	02	5762451418	5762451418	0	0	0	0
2010030100	2010	03	6661347946	6661347946	0	0	0	0
2010040100	2010	04	6618552152	6618552152	0	0	0	0
2010050100	2010	05	6410578775	6410578775	0	0	0	0
2010060100	2010	06	4898035014	4898035014	0	0	0	0
...	...	...	...	...	...	...	...	...
2015040100	2015	04	9421035574	6198945657	3222089917	0	0	0
2015050100	2015	05	9657871297	6323801814	3334069483	0	0	0
2015060100	2015	06	8203576103	5165413640	3038162463	0	0	0
2015070100	2015	07	8483698717	5229226022	3254472695	7665421980	4376666686	3288755294
2015080100	2015	08	8304022031	5035534449	3268487582	7634815221	4332482183	3302333038
2015090100	2015	09	8582061182	5409631355	3172429827	7655695037	4485491704	3170203333
2015100100	2015	10	8781786976	5535704471	3246082505	7746031887	4477532755	3268499132
2015110100	2015	11	8515190628	5296956116	3218234512	7524321290	4287720220	3236601070
2015120100	2015	12	8651858036	5264446173	3387411863	7476287344	4100012037	3376275307
2016010100	2016	01	9309261244	5569632502	3739628742	8154016303	4436179457	3717836846
2016020100	2016	02	8680940753	5347709361	3333231392	7585859457	4250997185	3334862272
2016030100	2016	03	8827529692	5407676056	3419853636	7673274617	4286590426	3386684191
2016040100	2016	04	8873620523	5572235399	3301385124	7408147859	4149383857	3258764002
2016050100	2016	05	8748968139	5330532334	3418435805	7586811330	4191778094	3395033236
2016060100	2016	06	8347710510	4975092447	3372618063	7243630656	3888839711	3354790945
2016070100	2016	07	8864627560	5363966439	3500661121	7834439589	4337865827	3496573762
2016080100	2016	08	0	0	0	8210865519	4695046216	3515819303
2016090100	2016	09	0	0	0	7528292279	4135006498	3393285781
2016100100	2016	10	0	0	0	7871021581	4361737690	3509283891
2016110100	2016	11	0	0	0	7983113161	4392068236	3591044925
2016120100	2016	12	0	0	0	7986152433	4209608578	3776543855
2017010100	2017	01	0	0	0	8753941940	4521980398	4231961542
2017020100	2017	02	0	0	0	7738463562	4026702163	3711761399
2017030100	2017	03	0	0	0	8223465891	4319971902	3903493989
2017040100	2017	04	0	0	0	7591080111	3951456992	3639623119
2017050100	2017	05	0	0	0	7874558299	4187870579	3686687720
2017060100	2017	06	0	0	0	7123934190	3604550997	3519383193
2017070100	2017	07	0	0	0	7290503797	3565444544	3725059253
2017080100	2017	08	0	0	0	7196978615	3575572313	3621406302
2017090100	2017	09	0	0	0	7079052261	3547447892	3531604369