English Wikipedia page views, 2008 - 2017

For this assignment, your job is to analyze traffic on English Wikipedia over time, and then document your process and the resulting dataset and visualization according to best practices for open research that were outlined for you in class.

PageView API

Here, I have queried the APIs multiple times and also converted the results into a dataframe of the required format.

We don't have mobile site date between 2008 and 2014 for pageCounts.


In [1]:
import requests
import pandas

endpoint = 'https://wikimedia.org/api/rest_v1/metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}'

headers={'User-Agent' : 'https://github.com/r1rajiv92', 'From' : 'rajiv92@uw.edu'}

yearMonthCombinations = { '2015' : [ 7, 8, 9, 10, 11, 12],
                          '2016' : [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
                          '2017' : [ 1, 2, 3, 4, 5, 6, 7, 8, 9] }

for accessType in  [ 'desktop', 'mobile-web', 'mobile-app' ]:
    for year in range(2015, 2018):
        for month in yearMonthCombinations[str(year)]:
        
            if int(month / 10) == 0:
                startMonth = ''.join( [ '0', str(month) ] )
                startParam = ''.join([ str(year), startMonth, '0100' ])
                if int((month+1) / 10) == 0:
                    endParam = ''.join([ str(year), '0', str(month+1), '0100' ])
                else:
                    endParam = ''.join([ str(year), str(month+1), '0100' ])
            else:
                startMonth = str(month)
                startParam = ''.join([ str(year), startMonth, '0100' ])
                endParam = ''.join([ str(year), str(month+1), '0100' ])
                if month + 1 == 13:
                    endParam = ''.join( [str(year+1), '01', '0100'])

            params = {'project' : 'en.wikipedia.org',
                    'access' : accessType,
                    'agent' : 'user',
                    'granularity' : 'monthly',
                    'start' : startParam,
                    'end' : endParam
                    }
            api_call = requests.get(endpoint.format(**params))
            response = api_call.json()
        
            for result in response['items']:
                result['year'] = str(year)
                result['month'] = startMonth
        
            if 'PageViewDataFrame' in locals():
                PageViewDataFrame = pandas.concat([ PageViewDataFrame, pandas.DataFrame.from_dict(response['items']) ])
            else:
                PageViewDataFrame = pandas.DataFrame.from_dict(response['items'])

PageCount API

Here, I have queried the APIs multiple times and also converted the results into a dataframe of the required format.


In [2]:
endpoint = 'https://wikimedia.org/api/rest_v1/metrics/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end}'

headers={'User-Agent' : 'https://github.com/r1rajiv92', 'From' : 'rajiv92@uw.edu'}

yearMonthCombinations = { '2008' : [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
                          '2009' : [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
                          '2010' : [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
                          '2011' : [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
                          '2012' : [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
                          '2013' : [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
                          '2014' : [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
                          '2015' : [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
                          '2016' : [ 1, 2, 3, 4, 5, 6, 7]
                          }

for accessType in  [ 'desktop-site', 'mobile-site' ]:
    for year in range(2008, 2017):
        for month in yearMonthCombinations[str(year)]:
        
            if int(month / 10) == 0:
                startMonth = ''.join( [ '0', str(month) ] )
                startParam = ''.join([ str(year), startMonth, '0100' ])
                if int((month+1) / 10) == 0:
                    endParam = ''.join([ str(year), '0', str(month+1), '0100' ])
                else:
                    endParam = ''.join([ str(year), str(month+1), '0100' ])
            else:
                startMonth = str(month)
                startParam = ''.join([ str(year), startMonth, '0100' ])
                endParam = ''.join([ str(year), str(month+1), '0100' ])
                if month + 1 == 13:
                    endParam = ''.join( [str(year+1), '01', '0100'])

            params = {'project' : 'en.wikipedia.org',
                    'access-site' : accessType,
                    'granularity' : 'monthly',
                    'start' : startParam,
                    'end' : endParam
                    }
            api_call = requests.get(endpoint.format(**params))
            response = api_call.json()
            if 'items' in response.keys():
                for result in response['items']:
                    result['year'] = str(year)
                    result['month'] = startMonth
            else:
                print('Page Count Data Missing for', accessType, 'on', month, year )
                continue
                
            if 'PageCountDataFrame' in locals():
                PageCountDataFrame = pandas.concat([ PageCountDataFrame, pandas.DataFrame.from_dict(response['items']) ])
            else:
                PageCountDataFrame = pandas.DataFrame.from_dict(response['items'])


Page Count Data Missing for mobile-site on 1 2008
Page Count Data Missing for mobile-site on 2 2008
Page Count Data Missing for mobile-site on 3 2008
Page Count Data Missing for mobile-site on 4 2008
Page Count Data Missing for mobile-site on 5 2008
Page Count Data Missing for mobile-site on 6 2008
Page Count Data Missing for mobile-site on 7 2008
Page Count Data Missing for mobile-site on 8 2008
Page Count Data Missing for mobile-site on 9 2008
Page Count Data Missing for mobile-site on 10 2008
Page Count Data Missing for mobile-site on 11 2008
Page Count Data Missing for mobile-site on 12 2008
Page Count Data Missing for mobile-site on 1 2009
Page Count Data Missing for mobile-site on 2 2009
Page Count Data Missing for mobile-site on 3 2009
Page Count Data Missing for mobile-site on 4 2009
Page Count Data Missing for mobile-site on 5 2009
Page Count Data Missing for mobile-site on 6 2009
Page Count Data Missing for mobile-site on 7 2009
Page Count Data Missing for mobile-site on 8 2009
Page Count Data Missing for mobile-site on 9 2009
Page Count Data Missing for mobile-site on 10 2009
Page Count Data Missing for mobile-site on 11 2009
Page Count Data Missing for mobile-site on 12 2009
Page Count Data Missing for mobile-site on 1 2010
Page Count Data Missing for mobile-site on 2 2010
Page Count Data Missing for mobile-site on 3 2010
Page Count Data Missing for mobile-site on 4 2010
Page Count Data Missing for mobile-site on 5 2010
Page Count Data Missing for mobile-site on 6 2010
Page Count Data Missing for mobile-site on 7 2010
Page Count Data Missing for mobile-site on 8 2010
Page Count Data Missing for mobile-site on 9 2010
Page Count Data Missing for mobile-site on 10 2010
Page Count Data Missing for mobile-site on 11 2010
Page Count Data Missing for mobile-site on 12 2010
Page Count Data Missing for mobile-site on 1 2011
Page Count Data Missing for mobile-site on 2 2011
Page Count Data Missing for mobile-site on 3 2011
Page Count Data Missing for mobile-site on 4 2011
Page Count Data Missing for mobile-site on 5 2011
Page Count Data Missing for mobile-site on 6 2011
Page Count Data Missing for mobile-site on 7 2011
Page Count Data Missing for mobile-site on 8 2011
Page Count Data Missing for mobile-site on 9 2011
Page Count Data Missing for mobile-site on 10 2011
Page Count Data Missing for mobile-site on 11 2011
Page Count Data Missing for mobile-site on 12 2011
Page Count Data Missing for mobile-site on 1 2012
Page Count Data Missing for mobile-site on 2 2012
Page Count Data Missing for mobile-site on 3 2012
Page Count Data Missing for mobile-site on 4 2012
Page Count Data Missing for mobile-site on 5 2012
Page Count Data Missing for mobile-site on 6 2012
Page Count Data Missing for mobile-site on 7 2012
Page Count Data Missing for mobile-site on 8 2012
Page Count Data Missing for mobile-site on 9 2012
Page Count Data Missing for mobile-site on 10 2012
Page Count Data Missing for mobile-site on 11 2012
Page Count Data Missing for mobile-site on 12 2012
Page Count Data Missing for mobile-site on 1 2013
Page Count Data Missing for mobile-site on 2 2013
Page Count Data Missing for mobile-site on 3 2013
Page Count Data Missing for mobile-site on 4 2013
Page Count Data Missing for mobile-site on 5 2013
Page Count Data Missing for mobile-site on 6 2013
Page Count Data Missing for mobile-site on 7 2013
Page Count Data Missing for mobile-site on 8 2013
Page Count Data Missing for mobile-site on 9 2013
Page Count Data Missing for mobile-site on 10 2013
Page Count Data Missing for mobile-site on 11 2013
Page Count Data Missing for mobile-site on 12 2013
Page Count Data Missing for mobile-site on 1 2014
Page Count Data Missing for mobile-site on 2 2014
Page Count Data Missing for mobile-site on 3 2014
Page Count Data Missing for mobile-site on 4 2014
Page Count Data Missing for mobile-site on 5 2014
Page Count Data Missing for mobile-site on 6 2014
Page Count Data Missing for mobile-site on 7 2014
Page Count Data Missing for mobile-site on 8 2014
Page Count Data Missing for mobile-site on 9 2014
Summing up accross mobile-app and mobile-site for PageViews into a single access type called 'mobile'. Also, grouping by year, month and accessTypes to get the results

In [3]:
PageViewDataFrame.loc[PageViewDataFrame.access != 'desktop', 'access'] = 'mobile'

PageCountDataFrame['access'] = "desktop"
PageCountDataFrame.loc[PageCountDataFrame['access-site'] != 'desktop-site', 'access'] = 'mobile'

pageCounts = pandas.DataFrame( PageCountDataFrame.groupby(['year', 'month', 'access'])['count'].sum().reset_index() )
pageViews = pandas.DataFrame( PageViewDataFrame.groupby(['year', 'month', 'access'])['views'].sum().reset_index() )

Generating the final CSV Data frame used for visualization!!


In [4]:
finalCSVDataFrame = pandas.DataFrame( columns = ['year', 'month', 'pagecount_all_views', 'pagecount_desktop_views', 'pagecount_mobile_views',
                                                 'pageview_all_views', 'pageview_desktop_views', 'pageview_mobile_views' ] )
for year in range( 2008, 2018):
    for month in range(1,13):
        if year == 2017 and month > 9:
            continue
        else:
            if month < 9:
                monthString = ''.join(['0', str(month)])
                
            pagecount_desktop_views = pageCounts[(pageCounts['year'] == str(year)) & (pageCounts['month'] == monthString) 
                                                                                 & (pageCounts['access'] == 'desktop' )]
            
            if len(pagecount_desktop_views) != 0:
                pagecount_desktop_views = int(pagecount_desktop_views['count'])
            else:
                pagecount_desktop_views = 0
                
            pagecount_mobile_views = pageCounts[(pageCounts['year'] == str(year)) & (pageCounts['month'] == monthString) 
                                                                                 & (pageCounts['access'] == 'mobile' )]
            
            if len(pagecount_mobile_views) != 0:
                pagecount_mobile_views = int(pagecount_mobile_views['count'])
            else:
                pagecount_mobile_views = 0
                
            pagecount_all_views = pagecount_desktop_views + pagecount_mobile_views
            
            pageview_desktop_views = pageViews[(pageViews['year'] == str(year)) & (pageViews['month'] == monthString) 
                                                                                 & (pageViews['access'] == 'desktop' )]
            
            if len(pageview_desktop_views) != 0:
                pageview_desktop_views = int(pageview_desktop_views['views'])
            else:
                pageview_desktop_views = 0
            
            pageview_mobile_views = pageViews[(pageViews['year'] == str(year)) & (pageViews['month'] == monthString) 
                                                                                 & (pageViews['access'] == 'mobile' )]
            
            if len(pageview_mobile_views) != 0:
                pageview_mobile_views = int(pageview_mobile_views['views'])
            else:
                pageview_mobile_views = 0
            
            pageview_all_views = pageview_desktop_views + pageview_mobile_views
            
            finalCSVDataFrame = finalCSVDataFrame.append( {'year': int(year), 
                           'month': int(month), 
                           'pagecount_all_views': int(pagecount_all_views),
                           'pagecount_desktop_views': int(pagecount_desktop_views),
                           'pagecount_mobile_views': int(pagecount_mobile_views),
                           'pageview_all_views': int(pageview_all_views),
                           'pageview_desktop_views': int(pageview_desktop_views),
                           'pageview_mobile_views': int(pageview_mobile_views)
                           }, ignore_index=True )

In [5]:
finalCSVDataFrame


Out[5]:
year month pagecount_all_views pagecount_desktop_views pagecount_mobile_views pageview_all_views pageview_desktop_views pageview_mobile_views
0 2008.0 1.0 4.930903e+09 4.930903e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
1 2008.0 2.0 4.818394e+09 4.818394e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
2 2008.0 3.0 4.955406e+09 4.955406e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
3 2008.0 4.0 5.159162e+09 5.159162e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
4 2008.0 5.0 5.584691e+09 5.584691e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
5 2008.0 6.0 5.712104e+09 5.712104e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
6 2008.0 7.0 5.306303e+09 5.306303e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
7 2008.0 8.0 5.140156e+09 5.140156e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
8 2008.0 9.0 5.140156e+09 5.140156e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
9 2008.0 10.0 5.140156e+09 5.140156e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
10 2008.0 11.0 5.140156e+09 5.140156e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
11 2008.0 12.0 5.140156e+09 5.140156e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
12 2009.0 1.0 5.802682e+09 5.802682e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
13 2009.0 2.0 5.547321e+09 5.547321e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
14 2009.0 3.0 6.295159e+09 6.295159e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
15 2009.0 4.0 5.988817e+09 5.988817e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
16 2009.0 5.0 6.267517e+09 6.267517e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
17 2009.0 6.0 5.818924e+09 5.818924e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
18 2009.0 7.0 5.801647e+09 5.801647e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
19 2009.0 8.0 5.790850e+09 5.790850e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
20 2009.0 9.0 5.790850e+09 5.790850e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
21 2009.0 10.0 5.790850e+09 5.790850e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
22 2009.0 11.0 5.790850e+09 5.790850e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
23 2009.0 12.0 5.790850e+09 5.790850e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
24 2010.0 1.0 5.703465e+09 5.703465e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25 2010.0 2.0 5.762451e+09 5.762451e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
26 2010.0 3.0 6.661348e+09 6.661348e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
27 2010.0 4.0 6.618552e+09 6.618552e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
28 2010.0 5.0 6.410579e+09 6.410579e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
29 2010.0 6.0 4.898035e+09 4.898035e+09 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
... ... ... ... ... ... ... ... ...
87 2015.0 4.0 9.421036e+09 6.198946e+09 3.222090e+09 0.000000e+00 0.000000e+00 0.000000e+00
88 2015.0 5.0 9.657871e+09 6.323802e+09 3.334069e+09 0.000000e+00 0.000000e+00 0.000000e+00
89 2015.0 6.0 8.203576e+09 5.165414e+09 3.038162e+09 0.000000e+00 0.000000e+00 0.000000e+00
90 2015.0 7.0 8.483699e+09 5.229226e+09 3.254473e+09 7.665422e+09 4.376667e+09 3.288755e+09
91 2015.0 8.0 8.304022e+09 5.035534e+09 3.268488e+09 7.634815e+09 4.332482e+09 3.302333e+09
92 2015.0 9.0 8.304022e+09 5.035534e+09 3.268488e+09 7.634815e+09 4.332482e+09 3.302333e+09
93 2015.0 10.0 8.304022e+09 5.035534e+09 3.268488e+09 7.634815e+09 4.332482e+09 3.302333e+09
94 2015.0 11.0 8.304022e+09 5.035534e+09 3.268488e+09 7.634815e+09 4.332482e+09 3.302333e+09
95 2015.0 12.0 8.304022e+09 5.035534e+09 3.268488e+09 7.634815e+09 4.332482e+09 3.302333e+09
96 2016.0 1.0 9.309261e+09 5.569633e+09 3.739629e+09 8.154016e+09 4.436179e+09 3.717837e+09
97 2016.0 2.0 8.680941e+09 5.347709e+09 3.333231e+09 7.585859e+09 4.250997e+09 3.334862e+09
98 2016.0 3.0 8.827530e+09 5.407676e+09 3.419854e+09 7.673275e+09 4.286590e+09 3.386684e+09
99 2016.0 4.0 8.873621e+09 5.572235e+09 3.301385e+09 7.408148e+09 4.149384e+09 3.258764e+09
100 2016.0 5.0 8.748968e+09 5.330532e+09 3.418436e+09 7.586811e+09 4.191778e+09 3.395033e+09
101 2016.0 6.0 8.347711e+09 4.975092e+09 3.372618e+09 7.243631e+09 3.888840e+09 3.354791e+09
102 2016.0 7.0 8.864628e+09 5.363966e+09 3.500661e+09 7.834440e+09 4.337866e+09 3.496574e+09
103 2016.0 8.0 0.000000e+00 0.000000e+00 0.000000e+00 8.210866e+09 4.695046e+09 3.515819e+09
104 2016.0 9.0 0.000000e+00 0.000000e+00 0.000000e+00 8.210866e+09 4.695046e+09 3.515819e+09
105 2016.0 10.0 0.000000e+00 0.000000e+00 0.000000e+00 8.210866e+09 4.695046e+09 3.515819e+09
106 2016.0 11.0 0.000000e+00 0.000000e+00 0.000000e+00 8.210866e+09 4.695046e+09 3.515819e+09
107 2016.0 12.0 0.000000e+00 0.000000e+00 0.000000e+00 8.210866e+09 4.695046e+09 3.515819e+09
108 2017.0 1.0 0.000000e+00 0.000000e+00 0.000000e+00 8.753942e+09 4.521980e+09 4.231962e+09
109 2017.0 2.0 0.000000e+00 0.000000e+00 0.000000e+00 7.738464e+09 4.026702e+09 3.711761e+09
110 2017.0 3.0 0.000000e+00 0.000000e+00 0.000000e+00 8.223466e+09 4.319972e+09 3.903494e+09
111 2017.0 4.0 0.000000e+00 0.000000e+00 0.000000e+00 7.591080e+09 3.951457e+09 3.639623e+09
112 2017.0 5.0 0.000000e+00 0.000000e+00 0.000000e+00 7.874558e+09 4.187871e+09 3.686688e+09
113 2017.0 6.0 0.000000e+00 0.000000e+00 0.000000e+00 7.123934e+09 3.604551e+09 3.519383e+09
114 2017.0 7.0 0.000000e+00 0.000000e+00 0.000000e+00 7.290504e+09 3.565445e+09 3.725059e+09
115 2017.0 8.0 0.000000e+00 0.000000e+00 0.000000e+00 7.196979e+09 3.575572e+09 3.621406e+09
116 2017.0 9.0 0.000000e+00 0.000000e+00 0.000000e+00 7.196979e+09 3.575572e+09 3.621406e+09

117 rows × 8 columns


In [12]:
finalCSVDataFrame.to_csv('finalCSV.csv')

Plotting the time series Visualization


In [7]:
import matplotlib.pyplot as plt

dates = pandas.date_range('2008-01', '2017-10',freq='M')

plt.plot(dates, finalCSVDataFrame["pagecount_all_views"]/1000000, label = " All PageCounts", 
         color = "black", linewidth = 0.5)
plt.plot(dates, finalCSVDataFrame["pagecount_desktop_views"]/1000000, label = "Desktop PageCounts", 
         color = "blue", linewidth = 0.5)
plt.plot(dates, finalCSVDataFrame["pagecount_mobile_views"]/1000000, label = "Mobile PageCounts", 
         color = "brown", linewidth = 0.5)
plt.plot(dates, finalCSVDataFrame["pageview_all_views"]/1000000, label = " All PageViews", 
         color = "black", linewidth = 2)
plt.plot(dates, finalCSVDataFrame["pageview_desktop_views"]/1000000, label = "Desktop PageViews", 
         color = "blue", linewidth = 2)
plt.plot(dates, finalCSVDataFrame["pageview_mobile_views"]/1000000, label = "Mobile PageViews", 
         color = "brown", linewidth = 2)

fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 15
fig_size[1] = 10

plt.xlabel("Dates")
plt.ylabel("PageViews - millions")
plt.title("Pageviews on English Wikipedia from July 2015 to Sept 2017")
plt.legend(loc=2)
plt.show()



In [ ]: