Using python to explore Wikipedia pageview data for the 115th U.S. Congress (2016-2018)

By Stuart Geiger (@staeiou, User:Staeiou), licensed under the MIT license

Did you know that Wikipedia has been tracking aggregate, anonymized, hourly data about the number of times each page is viewed? There are data dumps, an API, and a web tool for exploring small sets of pages (see this blog post for more on those three). In this notebook, I show how to use python to get data on hundreds of pages at once -- every member of the U.S. Senate and House of Representatives in the 115th Congress, which was in session from 2016-2018.

Libraries

We're using mwviews for getting the pageview data, pandas for the dataframe, and seaborn/matplotlib for plotting. pywikibot is in here because I tried to use it to get titles programmatically, but gave up.



In [1]:

    
!pip install mwviews pandas matplotlib seaborn









    



Requirement already satisfied: mwviews in /opt/tljh/user/lib/python3.6/site-packages (0.1.0)
Requirement already satisfied: pandas in /opt/tljh/user/lib/python3.6/site-packages (0.23.4)
Requirement already satisfied: matplotlib in /opt/tljh/user/lib/python3.6/site-packages (3.0.2)
Requirement already satisfied: seaborn in /opt/tljh/user/lib/python3.6/site-packages (0.9.0)
Requirement already satisfied: requests in /opt/tljh/user/lib/python3.6/site-packages (from mwviews) (2.21.0)
Requirement already satisfied: mwcli in /opt/tljh/user/lib/python3.6/site-packages (from mwviews) (0.0.2)
Requirement already satisfied: mwapi in /opt/tljh/user/lib/python3.6/site-packages (from mwviews) (0.5.1)
Requirement already satisfied: python-dateutil>=2.5.0 in /opt/tljh/user/lib/python3.6/site-packages (from pandas) (2.7.3)
Requirement already satisfied: pytz>=2011k in /opt/tljh/user/lib/python3.6/site-packages (from pandas) (2018.5)
Requirement already satisfied: numpy>=1.9.0 in /opt/tljh/user/lib/python3.6/site-packages (from pandas) (1.15.2)
Requirement already satisfied: cycler>=0.10 in /opt/tljh/user/lib/python3.6/site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/tljh/user/lib/python3.6/site-packages (from matplotlib) (1.0.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/tljh/user/lib/python3.6/site-packages (from matplotlib) (2.2.1)
Requirement already satisfied: scipy>=0.14.0 in /opt/tljh/user/lib/python3.6/site-packages (from seaborn) (1.1.0)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/tljh/user/lib/python3.6/site-packages (from requests->mwviews) (1.23)
Requirement already satisfied: certifi>=2017.4.17 in /opt/tljh/user/lib/python3.6/site-packages (from requests->mwviews) (2018.11.29)
Requirement already satisfied: idna<2.9,>=2.5 in /opt/tljh/user/lib/python3.6/site-packages (from requests->mwviews) (2.7)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/tljh/user/lib/python3.6/site-packages (from requests->mwviews) (3.0.4)
Requirement already satisfied: docopt in /opt/tljh/user/lib/python3.6/site-packages (from mwcli->mwviews) (0.6.2)
Requirement already satisfied: para in /opt/tljh/user/lib/python3.6/site-packages (from mwcli->mwviews) (0.0.6)
Requirement already satisfied: six>=1.5 in /opt/tljh/user/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas) (1.11.0)
Requirement already satisfied: setuptools in /opt/tljh/user/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib) (40.4.3)



In [2]:

    
import mwviews
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline
sns.set(font_scale=1.5)

Data

The .txt files are manually curated lists of titles, based on first copying and pasting the columns displaying the names of the members of Congress at List_of_United_States_Senators_in_the_115th_Congress_by_seniority and List_of_United_States_Representatives_in_the_115th_Congress_by_seniority. Then each of the article links was manually examined to make sure they match the linked page, and updated if, for example, the text said "Dan Sullivan" but the article was at "Dan Sullivan (U.S. Senator)". Much thanks to Amy Johnson who helped curate these lists.

I tried programmatically getting lists of all current members of Congress, but failed.

The files have one title per line, so we read it in and split it into a list with .split("\n")



In [3]:

    
with open("senators.txt") as f:
    senate_txt = f.read()

senate_list = senate_txt.split("\n")



In [4]:

    
senate_list[0:5]









    Out[4]:





['Richard Shelby',
 'Luther Strange',
 'Lisa Murkowski',
 'Dan Sullivan (U.S. Senator)',
 'John McCain']

Checking the length of the list, we see it has 100, which is good!



In [5]:

    
len(senate_list)









    Out[5]:





100

We do the same with the house list, and we get 431 because there are currently some vacancies.



In [6]:

    
with open("house_reps.txt") as f:
    house_txt = f.read()
    
house_list = house_txt.split("\n")



In [7]:

    
house_list[0:5]









    Out[7]:





['Bradley Byrne', 'Martha Roby', 'Mike Rogers', 'Robert Aderholt', 'Mo Brooks']



In [8]:

    
len(house_list)









    Out[8]:





431

Querying the pageviews API

mwviews makes it much easier to query the pageviews API, so we don't have to directly call the API. We can also pass in a (very long!) list of pages to get data. We get back a nice JSON formatted response, which pandas can convert to a dataframe without any help.

The main way to interact via mwviews is the PageviewsClient object, which we will create as p for short.



In [ ]:



In [9]:

    
your_contact_info = "stuart@stuartgeiger.com"



In [10]:

    
from mwviews.api import PageviewsClient

p = PageviewsClient(user_agent="Python query script by " + your_contact_info)

When we query the API for the view data, we can set many variables in p.article_views(). We pass in senate_list as our list of articles. Granularity can be monthly or daily, and start and end dates are formatted as YYYYMMDDHH. You have to include precise start and end dates by the hour, and it will not give super helpful error messages if you do things lie set your end date before your start date or things like that. And also know that the pageview data only goes back a few years.



In [11]:

    
senate_views = p.article_views(project='en.wikipedia', 
                            articles=senate_list, 
                            granularity='monthly', 
                            start='2016010100', 
                            end='2018123123')

senate_df = pd.DataFrame(senate_views)

If we peek at the first 10 rows and 5 columns in the dataframe, we see it is formatted with one row per page, and one column per month:



In [12]:

    
senate_df.iloc[0:10, 0:5]









    Out[12]:







  
    
      
      2016-03-01 00:00:00
      2016-07-01 00:00:00
      2016-09-01 00:00:00
      2017-09-01 00:00:00
      2017-11-01 00:00:00
    
  
  
    
      Al_Franken
      69646.0
      143641.0
      37181.0
      83442.0
      941293.0
    
    
      Amy_Klobuchar
      22588.0
      36931.0
      9495.0
      32885.0
      39919.0
    
    
      Angus_King
      18929.0
      16043.0
      10410.0
      15717.0
      21210.0
    
    
      Ben_Cardin
      7535.0
      6656.0
      4803.0
      7834.0
      10892.0
    
    
      Ben_Sasse
      46198.0
      21502.0
      10977.0
      33255.0
      25537.0
    
    
      Bernie_Sanders
      2026130.0
      684501.0
      221556.0
      146213.0
      204920.0
    
    
      Bill_Cassidy
      5963.0
      6895.0
      3856.0
      72223.0
      8282.0
    
    
      Bill_Nelson
      20002.0
      13005.0
      9691.0
      25957.0
      12515.0
    
    
      Bob_Casey_Jr.
      110.0
      8934.0
      3562.0
      8948.0
      10328.0
    
    
      Bob_Corker
      10139.0
      41705.0
      8985.0
      48535.0
      27702.0

We transpose this (switching rows and columns), then set the index of each row to a more readable string, Year-Month:



In [13]:

    
senate_df = senate_df.transpose()
senate_df = senate_df.set_index(senate_df.index.strftime("%Y-%m")).sort_index()



In [14]:

    
senate_df.iloc[0:5, 0:10]









    Out[14]:







  
    
      
      Al_Franken
      Amy_Klobuchar
      Angus_King
      Ben_Cardin
      Ben_Sasse
      Bernie_Sanders
      Bill_Cassidy
      Bill_Nelson
      Bob_Casey_Jr.
      Bob_Corker
    
  
  
    
      2016-01
      40768.0
      10282.0
      13743.0
      5774.0
      19451.0
      1727456.0
      4499.0
      13754.0
      88.0
      7418.0
    
    
      2016-02
      42661.0
      40245.0
      19517.0
      7716.0
      28501.0
      3588261.0
      5482.0
      20045.0
      101.0
      10224.0
    
    
      2016-03
      69646.0
      22588.0
      18929.0
      7535.0
      46198.0
      2026130.0
      5963.0
      20002.0
      110.0
      10139.0
    
    
      2016-04
      43087.0
      19740.0
      13951.0
      7733.0
      9943.0
      1337991.0
      5047.0
      12413.0
      95.0
      8403.0
    
    
      2016-05
      66366.0
      16663.0
      13341.0
      5532.0
      78686.0
      787078.0
      4644.0
      11750.0
      2146.0
      53781.0

We can get the sum for each page by running .sum(), and we can peek into the first five pages:



In [15]:

    
senate_sum = senate_df.sum()
senate_sum[0:5]









    Out[15]:





Al_Franken       4729093.0
Amy_Klobuchar    1682276.0
Angus_King       1107258.0
Ben_Cardin        458593.0
Ben_Sasse        1733932.0
dtype: float64

We can get the sum for each month by transposing back and running .sum() on the dataframe:



In [16]:

    
senate_monthly_sum = senate_df.transpose().sum()
senate_monthly_sum









    Out[16]:





2016-01     5261734.0
2016-02    10081944.0
2016-03     7136609.0
2016-04     3931109.0
2016-05     3493508.0
2016-06     3358614.0
2016-07     6661905.0
2016-08     2012990.0
2016-09     2000842.0
2016-10     3647561.0
2016-11     6361233.0
2016-12     2352725.0
2017-01     5803284.0
2017-02     4912876.0
2017-03     3882319.0
2017-04     2520009.0
2017-05     3626457.0
2017-06     5212799.0
2017-07     4328612.0
2017-08     3130511.0
2017-09     2922062.0
2017-10     3207716.0
2017-11     4221156.0
2017-12     4313695.0
2018-01     4608436.0
2018-02     2645557.0
2018-03     2469501.0
2018-04     3187750.0
2018-05     2793076.0
2018-06     2431813.0
2018-07     2885088.0
2018-08     8292917.0
2018-09    11879426.0
2018-10     8232705.0
2018-11     6258168.0
2018-12     3545886.0
dtype: float64

And we can get the sum of all the months from 2016-01 to 2018-12 by summing the monthly sum, which gives us 163.6 million pageviews:



In [17]:

    
senate_monthly_sum.sum()









    Out[17]:





163612593.0

We can use the built-in plotting functionality in pandas dataframes to show a monthly plot. You can adjust kind to be many types, including bar, line, and area.



In [18]:

    
fig = plt.figure()
plt.title("Monthly Wikipedia pageviews for all U.S. Senators")
plt.ticklabel_format(style = 'plain')

ax = senate_monthly_sum.plot(kind='bar', figsize=[14,8], color="purple")

ax.set_xlabel("Monthly pageviews")
ax.set_ylabel("Month")









    Out[18]:





Text(0, 0.5, 'Month')

The House

We do the same thing for the House of Representatives, only with different variables. Recall that house_list is our list of titles:



In [19]:

    
house_list[0:5]









    Out[19]:





['Bradley Byrne', 'Martha Roby', 'Mike Rogers', 'Robert Aderholt', 'Mo Brooks']



In [20]:

    
house_views = p.article_views(project='en.wikipedia', 
                              articles=house_list, 
                              granularity='monthly', 
                              start='2016010100', 
                              end='2018123123')
                              
house_df = pd.DataFrame(house_views)
house_df.iloc[0:5, 0:5]









    Out[20]:







  
    
      
      2016-03-01 00:00:00
      2016-07-01 00:00:00
      2016-09-01 00:00:00
      2017-09-01 00:00:00
      2017-11-01 00:00:00
    
  
  
    
      Adam_Kinzinger
      12940.0
      7217.0
      6846.0
      7401
      10311
    
    
      Adam_Schiff
      5578.0
      7501.0
      5068.0
      18257
      20692
    
    
      Adam_Smith_(politician)
      3012.0
      2939.0
      2802.0
      2551
      3023
    
    
      Adrian_Smith_(politician)
      1354.0
      1151.0
      1363.0
      1689
      1893
    
    
      Adriano_Espaillat
      987.0
      5360.0
      2754.0
      6982
      5930



In [21]:

    
house_df = house_df.transpose()
house_df = house_df.set_index(house_df.index.strftime("%Y-%m")).sort_index()
house_df.iloc[0:5, 0:5]









    Out[21]:







  
    
      
      Adam_Kinzinger
      Adam_Schiff
      Adam_Smith_(politician)
      Adrian_Smith_(politician)
      Adriano_Espaillat
    
  
  
    
      2016-01
      8169.0
      4703.0
      2587.0
      1187.0
      1000.0
    
    
      2016-02
      8125.0
      4105.0
      2627.0
      1226.0
      901.0
    
    
      2016-03
      12940.0
      5578.0
      3012.0
      1354.0
      987.0
    
    
      2016-04
      6579.0
      6541.0
      2712.0
      1368.0
      1296.0
    
    
      2016-05
      10515.0
      6649.0
      2400.0
      1295.0
      1061.0



In [22]:

    
house_sum = house_df.sum()
house_sum[0:5]









    Out[22]:





Adam_Kinzinger                517026.0
Adam_Schiff                  1529166.0
Adam_Smith_(politician)       137867.0
Adrian_Smith_(politician)      65034.0
Adriano_Espaillat             187250.0
dtype: float64



In [23]:

    
house_monthly_sum = house_df.transpose().sum()
house_monthly_sum









    Out[23]:





2016-01     1608732.0
2016-02     1918133.0
2016-03     2131159.0
2016-04     1727960.0
2016-05     1940369.0
2016-06     1983199.0
2016-07     3009143.0
2016-08     1644636.0
2016-09     1609682.0
2016-10     2558133.0
2016-11     5095820.0
2016-12     2408666.0
2017-01     4190713.0
2017-02     3905450.0
2017-03     5931667.0
2017-04     3067144.0
2017-05     3794859.0
2017-06     3921347.0
2017-07     2791601.0
2017-08     2388308.0
2017-09     2264702.0
2017-10     3138822.0
2017-11     3182808.0
2017-12     3905840.0
2018-01     4212512.0
2018-02     3938454.0
2018-03     3329259.0
2018-04     4593766.0
2018-05     2741945.0
2018-06     3504612.0
2018-07     3776133.0
2018-08     4429837.0
2018-09     4419768.0
2018-10     5783828.0
2018-11    11368461.0
2018-12     4448236.0
dtype: float64



In [24]:

    
house_monthly_sum.sum()









    Out[24]:





126665704.0

This gives us 126.6 million total pageviews for House reps.



In [25]:

    
fig = plt.figure()
plt.title("Monthly Wikipedia pageviews for U.S. House of Representatives")
plt.ticklabel_format(style = 'plain')

ax = house_monthly_sum.plot(kind='bar', figsize=[14,8], color="purple")
ax.set_xlabel("Monthly pageviews")
ax.set_ylabel("Month")









    Out[25]:





Text(0, 0.5, 'Month')

Combining the datasets

We have to transpose each dataset back, then append one to the other:



In [26]:

    
congress_df = house_df.transpose().append(senate_df.transpose())
congress_df.iloc[0:10,0:10]









    Out[26]:







  
    
      
      2016-01
      2016-02
      2016-03
      2016-04
      2016-05
      2016-06
      2016-07
      2016-08
      2016-09
      2016-10
    
  
  
    
      Adam_Kinzinger
      8169.0
      8125.0
      12940.0
      6579.0
      10515.0
      12002.0
      7217.0
      22613.0
      6846.0
      6869.0
    
    
      Adam_Schiff
      4703.0
      4105.0
      5578.0
      6541.0
      6649.0
      12993.0
      7501.0
      4760.0
      5068.0
      8318.0
    
    
      Adam_Smith_(politician)
      2587.0
      2627.0
      3012.0
      2712.0
      2400.0
      2770.0
      2939.0
      2458.0
      2802.0
      2841.0
    
    
      Adrian_Smith_(politician)
      1187.0
      1226.0
      1354.0
      1368.0
      1295.0
      1285.0
      1151.0
      1432.0
      1363.0
      2004.0
    
    
      Adriano_Espaillat
      1000.0
      901.0
      987.0
      1296.0
      1061.0
      5591.0
      5360.0
      1729.0
      2754.0
      2017.0
    
    
      Al_Green_(politician)
      2568.0
      3326.0
      2853.0
      4527.0
      3047.0
      3243.0
      3141.0
      2028.0
      2878.0
      2915.0
    
    
      Al_Lawson
      44.0
      27.0
      34.0
      30.0
      36.0
      34.0
      68.0
      479.0
      1070.0
      1185.0
    
    
      Alan_Lowenthal
      1550.0
      1708.0
      2245.0
      2164.0
      2151.0
      2575.0
      1760.0
      1597.0
      1455.0
      2278.0
    
    
      Albio_Sires
      1791.0
      2042.0
      2663.0
      2348.0
      2126.0
      2467.0
      1960.0
      1679.0
      3582.0
      2483.0
    
    
      Alcee_Hastings
      8234.0
      5275.0
      6950.0
      4795.0
      5958.0
      5533.0
      9017.0
      4581.0
      4075.0
      4711.0



In [27]:

    
congress_monthly_sum = congress_df.sum()
congress_monthly_sum









    Out[27]:





2016-01     6870466.0
2016-02    12000077.0
2016-03     9267768.0
2016-04     5659069.0
2016-05     5433877.0
2016-06     5341813.0
2016-07     9671048.0
2016-08     3657626.0
2016-09     3610524.0
2016-10     6205694.0
2016-11    11457053.0
2016-12     4761391.0
2017-01     9993997.0
2017-02     8818326.0
2017-03     9813986.0
2017-04     5587153.0
2017-05     7421316.0
2017-06     9134146.0
2017-07     7120213.0
2017-08     5518819.0
2017-09     5186764.0
2017-10     6346538.0
2017-11     7403964.0
2017-12     8219535.0
2018-01     8820948.0
2018-02     6584011.0
2018-03     5798760.0
2018-04     7781516.0
2018-05     5535021.0
2018-06     5936425.0
2018-07     6661221.0
2018-08    12722754.0
2018-09    16299194.0
2018-10    14016533.0
2018-11    17626629.0
2018-12     7994122.0
dtype: float64

Then to find the total pageviews, run sum on the sum. This is 290 million pageviews from January 2016 to December 2018 for all U.S. Members of Congress:



In [28]:

    
congress_monthly_sum.sum()









    Out[28]:





290278297.0



In [29]:

    
fig = plt.figure()
plt.title("Monthly Wikipedia pageviews for current U.S. Members of Congress")
plt.ticklabel_format(style = 'plain')

ax = congress_monthly_sum.plot(kind='bar', figsize=[14,8], color="purple")

ax.set_xlabel("Monthly pageviews")
ax.set_ylabel("Month")









    Out[29]:





Text(0, 0.5, 'Month')

Plotting a single page's views over time

We can query the dataframe by index for a specific page, then plot it:



In [32]:

    
fig = plt.figure()
plt.title("Monthly Wikipedia pageviews for Al Lawson")
plt.ticklabel_format(style = 'plain')

ax = congress_df.loc['Al_Lawson'].plot(kind='bar', figsize=[14,8], color="purple")

ax.set_xlabel("Monthly pageviews")
ax.set_ylabel("Month")









    Out[32]:





Text(0, 0.5, 'Month')

Output data

We will export these to a folder called data, in csv and excel formats:



In [33]:

    
house_df.to_csv("data/house_views.csv")
house_df.to_excel("data/house_views.xlsx")

senate_df.to_csv("data/senate_views.csv")
senate_df.to_excel("data/senate_views.xlsx")



In [ ]:

	2016-03-01 00:00:00	2016-07-01 00:00:00	2016-09-01 00:00:00	2017-09-01 00:00:00	2017-11-01 00:00:00
Al_Franken	69646.0	143641.0	37181.0	83442.0	941293.0
Amy_Klobuchar	22588.0	36931.0	9495.0	32885.0	39919.0
Angus_King	18929.0	16043.0	10410.0	15717.0	21210.0
Ben_Cardin	7535.0	6656.0	4803.0	7834.0	10892.0
Ben_Sasse	46198.0	21502.0	10977.0	33255.0	25537.0
Bernie_Sanders	2026130.0	684501.0	221556.0	146213.0	204920.0
Bill_Cassidy	5963.0	6895.0	3856.0	72223.0	8282.0
Bill_Nelson	20002.0	13005.0	9691.0	25957.0	12515.0
Bob_Casey_Jr.	110.0	8934.0	3562.0	8948.0	10328.0
Bob_Corker	10139.0	41705.0	8985.0	48535.0	27702.0

	Al_Franken	Amy_Klobuchar	Angus_King	Ben_Cardin	Ben_Sasse	Bernie_Sanders	Bill_Cassidy	Bill_Nelson	Bob_Casey_Jr.	Bob_Corker
2016-01	40768.0	10282.0	13743.0	5774.0	19451.0	1727456.0	4499.0	13754.0	88.0	7418.0
2016-02	42661.0	40245.0	19517.0	7716.0	28501.0	3588261.0	5482.0	20045.0	101.0	10224.0
2016-03	69646.0	22588.0	18929.0	7535.0	46198.0	2026130.0	5963.0	20002.0	110.0	10139.0
2016-04	43087.0	19740.0	13951.0	7733.0	9943.0	1337991.0	5047.0	12413.0	95.0	8403.0
2016-05	66366.0	16663.0	13341.0	5532.0	78686.0	787078.0	4644.0	11750.0	2146.0	53781.0

	2016-03-01 00:00:00	2016-07-01 00:00:00	2016-09-01 00:00:00	2017-09-01 00:00:00	2017-11-01 00:00:00
Adam_Kinzinger	12940.0	7217.0	6846.0	7401	10311
Adam_Schiff	5578.0	7501.0	5068.0	18257	20692
Adam_Smith_(politician)	3012.0	2939.0	2802.0	2551	3023
Adrian_Smith_(politician)	1354.0	1151.0	1363.0	1689	1893
Adriano_Espaillat	987.0	5360.0	2754.0	6982	5930

	Adam_Kinzinger	Adam_Schiff	Adam_Smith_(politician)	Adrian_Smith_(politician)	Adriano_Espaillat
2016-01	8169.0	4703.0	2587.0	1187.0	1000.0
2016-02	8125.0	4105.0	2627.0	1226.0	901.0
2016-03	12940.0	5578.0	3012.0	1354.0	987.0
2016-04	6579.0	6541.0	2712.0	1368.0	1296.0
2016-05	10515.0	6649.0	2400.0	1295.0	1061.0