Using python to explore Wikipedia pageview data for the 115th U.S. Congress (2016-2018)

By Stuart Geiger (@staeiou, User:Staeiou), licensed under the MIT license

Did you know that Wikipedia has been tracking aggregate, anonymized, hourly data about the number of times each page is viewed? There are data dumps, an API, and a web tool for exploring small sets of pages (see this blog post for more on those three). In this notebook, I show how to use python to get data on hundreds of pages at once -- every member of the U.S. Senate and House of Representatives in the 115th Congress, which was in session from 2016-2018.

Libraries

We're using mwviews for getting the pageview data, pandas for the dataframe, and seaborn/matplotlib for plotting. pywikibot is in here because I tried to use it to get titles programmatically, but gave up.


In [1]:
!pip install mwviews pandas matplotlib seaborn


Requirement already satisfied: mwviews in /opt/tljh/user/lib/python3.6/site-packages (0.1.0)
Requirement already satisfied: pandas in /opt/tljh/user/lib/python3.6/site-packages (0.23.4)
Requirement already satisfied: matplotlib in /opt/tljh/user/lib/python3.6/site-packages (3.0.2)
Requirement already satisfied: seaborn in /opt/tljh/user/lib/python3.6/site-packages (0.9.0)
Requirement already satisfied: requests in /opt/tljh/user/lib/python3.6/site-packages (from mwviews) (2.21.0)
Requirement already satisfied: mwcli in /opt/tljh/user/lib/python3.6/site-packages (from mwviews) (0.0.2)
Requirement already satisfied: mwapi in /opt/tljh/user/lib/python3.6/site-packages (from mwviews) (0.5.1)
Requirement already satisfied: python-dateutil>=2.5.0 in /opt/tljh/user/lib/python3.6/site-packages (from pandas) (2.7.3)
Requirement already satisfied: pytz>=2011k in /opt/tljh/user/lib/python3.6/site-packages (from pandas) (2018.5)
Requirement already satisfied: numpy>=1.9.0 in /opt/tljh/user/lib/python3.6/site-packages (from pandas) (1.15.2)
Requirement already satisfied: cycler>=0.10 in /opt/tljh/user/lib/python3.6/site-packages (from matplotlib) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/tljh/user/lib/python3.6/site-packages (from matplotlib) (1.0.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/tljh/user/lib/python3.6/site-packages (from matplotlib) (2.2.1)
Requirement already satisfied: scipy>=0.14.0 in /opt/tljh/user/lib/python3.6/site-packages (from seaborn) (1.1.0)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /opt/tljh/user/lib/python3.6/site-packages (from requests->mwviews) (1.23)
Requirement already satisfied: certifi>=2017.4.17 in /opt/tljh/user/lib/python3.6/site-packages (from requests->mwviews) (2018.11.29)
Requirement already satisfied: idna<2.9,>=2.5 in /opt/tljh/user/lib/python3.6/site-packages (from requests->mwviews) (2.7)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/tljh/user/lib/python3.6/site-packages (from requests->mwviews) (3.0.4)
Requirement already satisfied: docopt in /opt/tljh/user/lib/python3.6/site-packages (from mwcli->mwviews) (0.6.2)
Requirement already satisfied: para in /opt/tljh/user/lib/python3.6/site-packages (from mwcli->mwviews) (0.0.6)
Requirement already satisfied: six>=1.5 in /opt/tljh/user/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas) (1.11.0)
Requirement already satisfied: setuptools in /opt/tljh/user/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib) (40.4.3)

In [2]:
import mwviews
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline
sns.set(font_scale=1.5)

Data

The .txt files are manually curated lists of titles, based on first copying and pasting the columns displaying the names of the members of Congress at List_of_United_States_Senators_in_the_115th_Congress_by_seniority and List_of_United_States_Representatives_in_the_115th_Congress_by_seniority. Then each of the article links was manually examined to make sure they match the linked page, and updated if, for example, the text said "Dan Sullivan" but the article was at "Dan Sullivan (U.S. Senator)". Much thanks to Amy Johnson who helped curate these lists.

I tried programmatically getting lists of all current members of Congress, but failed.

The files have one title per line, so we read it in and split it into a list with .split("\n")


In [3]:
with open("senators.txt") as f:
    senate_txt = f.read()

senate_list = senate_txt.split("\n")

In [4]:
senate_list[0:5]


Out[4]:
['Richard Shelby',
 'Luther Strange',
 'Lisa Murkowski',
 'Dan Sullivan (U.S. Senator)',
 'John McCain']

Checking the length of the list, we see it has 100, which is good!


In [5]:
len(senate_list)


Out[5]:
100

We do the same with the house list, and we get 431 because there are currently some vacancies.


In [6]:
with open("house_reps.txt") as f:
    house_txt = f.read()
    
house_list = house_txt.split("\n")

In [7]:
house_list[0:5]


Out[7]:
['Bradley Byrne', 'Martha Roby', 'Mike Rogers', 'Robert Aderholt', 'Mo Brooks']

In [8]:
len(house_list)


Out[8]:
431

Querying the pageviews API

mwviews makes it much easier to query the pageviews API, so we don't have to directly call the API. We can also pass in a (very long!) list of pages to get data. We get back a nice JSON formatted response, which pandas can convert to a dataframe without any help.

The main way to interact via mwviews is the PageviewsClient object, which we will create as p for short.


In [ ]:


In [9]:
your_contact_info = "stuart@stuartgeiger.com"

In [10]:
from mwviews.api import PageviewsClient

p = PageviewsClient(user_agent="Python query script by " + your_contact_info)

When we query the API for the view data, we can set many variables in p.article_views(). We pass in senate_list as our list of articles. Granularity can be monthly or daily, and start and end dates are formatted as YYYYMMDDHH. You have to include precise start and end dates by the hour, and it will not give super helpful error messages if you do things lie set your end date before your start date or things like that. And also know that the pageview data only goes back a few years.


In [11]:
senate_views = p.article_views(project='en.wikipedia', 
                            articles=senate_list, 
                            granularity='monthly', 
                            start='2016010100', 
                            end='2018123123')

senate_df = pd.DataFrame(senate_views)

If we peek at the first 10 rows and 5 columns in the dataframe, we see it is formatted with one row per page, and one column per month:


In [12]:
senate_df.iloc[0:10, 0:5]


Out[12]:
2016-03-01 00:00:00 2016-07-01 00:00:00 2016-09-01 00:00:00 2017-09-01 00:00:00 2017-11-01 00:00:00
Al_Franken 69646.0 143641.0 37181.0 83442.0 941293.0
Amy_Klobuchar 22588.0 36931.0 9495.0 32885.0 39919.0
Angus_King 18929.0 16043.0 10410.0 15717.0 21210.0
Ben_Cardin 7535.0 6656.0 4803.0 7834.0 10892.0
Ben_Sasse 46198.0 21502.0 10977.0 33255.0 25537.0
Bernie_Sanders 2026130.0 684501.0 221556.0 146213.0 204920.0
Bill_Cassidy 5963.0 6895.0 3856.0 72223.0 8282.0
Bill_Nelson 20002.0 13005.0 9691.0 25957.0 12515.0
Bob_Casey_Jr. 110.0 8934.0 3562.0 8948.0 10328.0
Bob_Corker 10139.0 41705.0 8985.0 48535.0 27702.0

We transpose this (switching rows and columns), then set the index of each row to a more readable string, Year-Month:


In [13]:
senate_df = senate_df.transpose()
senate_df = senate_df.set_index(senate_df.index.strftime("%Y-%m")).sort_index()

In [14]:
senate_df.iloc[0:5, 0:10]


Out[14]:
Al_Franken Amy_Klobuchar Angus_King Ben_Cardin Ben_Sasse Bernie_Sanders Bill_Cassidy Bill_Nelson Bob_Casey_Jr. Bob_Corker
2016-01 40768.0 10282.0 13743.0 5774.0 19451.0 1727456.0 4499.0 13754.0 88.0 7418.0
2016-02 42661.0 40245.0 19517.0 7716.0 28501.0 3588261.0 5482.0 20045.0 101.0 10224.0
2016-03 69646.0 22588.0 18929.0 7535.0 46198.0 2026130.0 5963.0 20002.0 110.0 10139.0
2016-04 43087.0 19740.0 13951.0 7733.0 9943.0 1337991.0 5047.0 12413.0 95.0 8403.0
2016-05 66366.0 16663.0 13341.0 5532.0 78686.0 787078.0 4644.0 11750.0 2146.0 53781.0

We can get the sum for each page by running .sum(), and we can peek into the first five pages:


In [15]:
senate_sum = senate_df.sum()
senate_sum[0:5]


Out[15]:
Al_Franken       4729093.0
Amy_Klobuchar    1682276.0
Angus_King       1107258.0
Ben_Cardin        458593.0
Ben_Sasse        1733932.0
dtype: float64

We can get the sum for each month by transposing back and running .sum() on the dataframe:


In [16]:
senate_monthly_sum = senate_df.transpose().sum()
senate_monthly_sum


Out[16]:
2016-01     5261734.0
2016-02    10081944.0
2016-03     7136609.0
2016-04     3931109.0
2016-05     3493508.0
2016-06     3358614.0
2016-07     6661905.0
2016-08     2012990.0
2016-09     2000842.0
2016-10     3647561.0
2016-11     6361233.0
2016-12     2352725.0
2017-01     5803284.0
2017-02     4912876.0
2017-03     3882319.0
2017-04     2520009.0
2017-05     3626457.0
2017-06     5212799.0
2017-07     4328612.0
2017-08     3130511.0
2017-09     2922062.0
2017-10     3207716.0
2017-11     4221156.0
2017-12     4313695.0
2018-01     4608436.0
2018-02     2645557.0
2018-03     2469501.0
2018-04     3187750.0
2018-05     2793076.0
2018-06     2431813.0
2018-07     2885088.0
2018-08     8292917.0
2018-09    11879426.0
2018-10     8232705.0
2018-11     6258168.0
2018-12     3545886.0
dtype: float64

And we can get the sum of all the months from 2016-01 to 2018-12 by summing the monthly sum, which gives us 163.6 million pageviews:


In [17]:
senate_monthly_sum.sum()


Out[17]:
163612593.0

We can use the built-in plotting functionality in pandas dataframes to show a monthly plot. You can adjust kind to be many types, including bar, line, and area.


In [18]:
fig = plt.figure()
plt.title("Monthly Wikipedia pageviews for all U.S. Senators")
plt.ticklabel_format(style = 'plain')

ax = senate_monthly_sum.plot(kind='bar', figsize=[14,8], color="purple")

ax.set_xlabel("Monthly pageviews")
ax.set_ylabel("Month")


Out[18]:
Text(0, 0.5, 'Month')

The House

We do the same thing for the House of Representatives, only with different variables. Recall that house_list is our list of titles:


In [19]:
house_list[0:5]


Out[19]:
['Bradley Byrne', 'Martha Roby', 'Mike Rogers', 'Robert Aderholt', 'Mo Brooks']

In [20]:
house_views = p.article_views(project='en.wikipedia', 
                              articles=house_list, 
                              granularity='monthly', 
                              start='2016010100', 
                              end='2018123123')
                              
house_df = pd.DataFrame(house_views)
house_df.iloc[0:5, 0:5]


Out[20]:
2016-03-01 00:00:00 2016-07-01 00:00:00 2016-09-01 00:00:00 2017-09-01 00:00:00 2017-11-01 00:00:00
Adam_Kinzinger 12940.0 7217.0 6846.0 7401 10311
Adam_Schiff 5578.0 7501.0 5068.0 18257 20692
Adam_Smith_(politician) 3012.0 2939.0 2802.0 2551 3023
Adrian_Smith_(politician) 1354.0 1151.0 1363.0 1689 1893
Adriano_Espaillat 987.0 5360.0 2754.0 6982 5930

In [21]:
house_df = house_df.transpose()
house_df = house_df.set_index(house_df.index.strftime("%Y-%m")).sort_index()
house_df.iloc[0:5, 0:5]


Out[21]:
Adam_Kinzinger Adam_Schiff Adam_Smith_(politician) Adrian_Smith_(politician) Adriano_Espaillat
2016-01 8169.0 4703.0 2587.0 1187.0 1000.0
2016-02 8125.0 4105.0 2627.0 1226.0 901.0
2016-03 12940.0 5578.0 3012.0 1354.0 987.0
2016-04 6579.0 6541.0 2712.0 1368.0 1296.0
2016-05 10515.0 6649.0 2400.0 1295.0 1061.0

In [22]:
house_sum = house_df.sum()
house_sum[0:5]


Out[22]:
Adam_Kinzinger                517026.0
Adam_Schiff                  1529166.0
Adam_Smith_(politician)       137867.0
Adrian_Smith_(politician)      65034.0
Adriano_Espaillat             187250.0
dtype: float64

In [23]:
house_monthly_sum = house_df.transpose().sum()
house_monthly_sum


Out[23]:
2016-01     1608732.0
2016-02     1918133.0
2016-03     2131159.0
2016-04     1727960.0
2016-05     1940369.0
2016-06     1983199.0
2016-07     3009143.0
2016-08     1644636.0
2016-09     1609682.0
2016-10     2558133.0
2016-11     5095820.0
2016-12     2408666.0
2017-01     4190713.0
2017-02     3905450.0
2017-03     5931667.0
2017-04     3067144.0
2017-05     3794859.0
2017-06     3921347.0
2017-07     2791601.0
2017-08     2388308.0
2017-09     2264702.0
2017-10     3138822.0
2017-11     3182808.0
2017-12     3905840.0
2018-01     4212512.0
2018-02     3938454.0
2018-03     3329259.0
2018-04     4593766.0
2018-05     2741945.0
2018-06     3504612.0
2018-07     3776133.0
2018-08     4429837.0
2018-09     4419768.0
2018-10     5783828.0
2018-11    11368461.0
2018-12     4448236.0
dtype: float64

In [24]:
house_monthly_sum.sum()


Out[24]:
126665704.0

This gives us 126.6 million total pageviews for House reps.


In [25]:
fig = plt.figure()
plt.title("Monthly Wikipedia pageviews for U.S. House of Representatives")
plt.ticklabel_format(style = 'plain')

ax = house_monthly_sum.plot(kind='bar', figsize=[14,8], color="purple")
ax.set_xlabel("Monthly pageviews")
ax.set_ylabel("Month")


Out[25]:
Text(0, 0.5, 'Month')

Combining the datasets

We have to transpose each dataset back, then append one to the other:


In [26]:
congress_df = house_df.transpose().append(senate_df.transpose())
congress_df.iloc[0:10,0:10]


Out[26]:
2016-01 2016-02 2016-03 2016-04 2016-05 2016-06 2016-07 2016-08 2016-09 2016-10
Adam_Kinzinger 8169.0 8125.0 12940.0 6579.0 10515.0 12002.0 7217.0 22613.0 6846.0 6869.0
Adam_Schiff 4703.0 4105.0 5578.0 6541.0 6649.0 12993.0 7501.0 4760.0 5068.0 8318.0
Adam_Smith_(politician) 2587.0 2627.0 3012.0 2712.0 2400.0 2770.0 2939.0 2458.0 2802.0 2841.0
Adrian_Smith_(politician) 1187.0 1226.0 1354.0 1368.0 1295.0 1285.0 1151.0 1432.0 1363.0 2004.0
Adriano_Espaillat 1000.0 901.0 987.0 1296.0 1061.0 5591.0 5360.0 1729.0 2754.0 2017.0
Al_Green_(politician) 2568.0 3326.0 2853.0 4527.0 3047.0 3243.0 3141.0 2028.0 2878.0 2915.0
Al_Lawson 44.0 27.0 34.0 30.0 36.0 34.0 68.0 479.0 1070.0 1185.0
Alan_Lowenthal 1550.0 1708.0 2245.0 2164.0 2151.0 2575.0 1760.0 1597.0 1455.0 2278.0
Albio_Sires 1791.0 2042.0 2663.0 2348.0 2126.0 2467.0 1960.0 1679.0 3582.0 2483.0
Alcee_Hastings 8234.0 5275.0 6950.0 4795.0 5958.0 5533.0 9017.0 4581.0 4075.0 4711.0

In [27]:
congress_monthly_sum = congress_df.sum()
congress_monthly_sum


Out[27]:
2016-01     6870466.0
2016-02    12000077.0
2016-03     9267768.0
2016-04     5659069.0
2016-05     5433877.0
2016-06     5341813.0
2016-07     9671048.0
2016-08     3657626.0
2016-09     3610524.0
2016-10     6205694.0
2016-11    11457053.0
2016-12     4761391.0
2017-01     9993997.0
2017-02     8818326.0
2017-03     9813986.0
2017-04     5587153.0
2017-05     7421316.0
2017-06     9134146.0
2017-07     7120213.0
2017-08     5518819.0
2017-09     5186764.0
2017-10     6346538.0
2017-11     7403964.0
2017-12     8219535.0
2018-01     8820948.0
2018-02     6584011.0
2018-03     5798760.0
2018-04     7781516.0
2018-05     5535021.0
2018-06     5936425.0
2018-07     6661221.0
2018-08    12722754.0
2018-09    16299194.0
2018-10    14016533.0
2018-11    17626629.0
2018-12     7994122.0
dtype: float64

Then to find the total pageviews, run sum on the sum. This is 290 million pageviews from January 2016 to December 2018 for all U.S. Members of Congress:


In [28]:
congress_monthly_sum.sum()


Out[28]:
290278297.0

In [29]:
fig = plt.figure()
plt.title("Monthly Wikipedia pageviews for current U.S. Members of Congress")
plt.ticklabel_format(style = 'plain')

ax = congress_monthly_sum.plot(kind='bar', figsize=[14,8], color="purple")

ax.set_xlabel("Monthly pageviews")
ax.set_ylabel("Month")


Out[29]:
Text(0, 0.5, 'Month')

Plotting a single page's views over time

We can query the dataframe by index for a specific page, then plot it:


In [32]:
fig = plt.figure()
plt.title("Monthly Wikipedia pageviews for Al Lawson")
plt.ticklabel_format(style = 'plain')

ax = congress_df.loc['Al_Lawson'].plot(kind='bar', figsize=[14,8], color="purple")

ax.set_xlabel("Monthly pageviews")
ax.set_ylabel("Month")


Out[32]:
Text(0, 0.5, 'Month')

Output data

We will export these to a folder called data, in csv and excel formats:


In [33]:
house_df.to_csv("data/house_views.csv")
house_df.to_excel("data/house_views.xlsx")

senate_df.to_csv("data/senate_views.csv")
senate_df.to_excel("data/senate_views.xlsx")

In [ ]: