By Stuart Geiger (@staeiou, User:Staeiou), licensed under the MIT license
Did you know that Wikipedia has been tracking aggregate, anonymized, hourly data about the number of times each page is viewed? There are data dumps, an API, and a web tool for exploring small sets of pages (see this blog post for more on those three). In this notebook, I show how to use python to get data on hundreds of pages at once -- every member of the U.S. Senate and House of Representatives in the 115th Congress, which was in session from 2016-2018.
In [1]:
!pip install mwviews pandas matplotlib seaborn
In [2]:
import mwviews
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline
sns.set(font_scale=1.5)
The .txt files are manually curated lists of titles, based on first copying and pasting the columns displaying the names of the members of Congress at List_of_United_States_Senators_in_the_115th_Congress_by_seniority and List_of_United_States_Representatives_in_the_115th_Congress_by_seniority. Then each of the article links was manually examined to make sure they match the linked page, and updated if, for example, the text said "Dan Sullivan" but the article was at "Dan Sullivan (U.S. Senator)". Much thanks to Amy Johnson who helped curate these lists.
I tried programmatically getting lists of all current members of Congress, but failed.
The files have one title per line, so we read it in and split it into a list with .split("\n")
In [3]:
with open("senators.txt") as f:
senate_txt = f.read()
senate_list = senate_txt.split("\n")
In [4]:
senate_list[0:5]
Out[4]:
Checking the length of the list, we see it has 100, which is good!
In [5]:
len(senate_list)
Out[5]:
We do the same with the house list, and we get 431 because there are currently some vacancies.
In [6]:
with open("house_reps.txt") as f:
house_txt = f.read()
house_list = house_txt.split("\n")
In [7]:
house_list[0:5]
Out[7]:
In [8]:
len(house_list)
Out[8]:
mwviews makes it much easier to query the pageviews API, so we don't have to directly call the API. We can also pass in a (very long!) list of pages to get data. We get back a nice JSON formatted response, which pandas can convert to a dataframe without any help.
The main way to interact via mwviews is the PageviewsClient object, which we will create as p for short.
In [ ]:
In [9]:
your_contact_info = "stuart@stuartgeiger.com"
In [10]:
from mwviews.api import PageviewsClient
p = PageviewsClient(user_agent="Python query script by " + your_contact_info)
When we query the API for the view data, we can set many variables in p.article_views(). We pass in senate_list as our list of articles. Granularity can be monthly or daily, and start and end dates are formatted as YYYYMMDDHH. You have to include precise start and end dates by the hour, and it will not give super helpful error messages if you do things lie set your end date before your start date or things like that. And also know that the pageview data only goes back a few years.
In [11]:
senate_views = p.article_views(project='en.wikipedia',
articles=senate_list,
granularity='monthly',
start='2016010100',
end='2018123123')
senate_df = pd.DataFrame(senate_views)
If we peek at the first 10 rows and 5 columns in the dataframe, we see it is formatted with one row per page, and one column per month:
In [12]:
senate_df.iloc[0:10, 0:5]
Out[12]:
We transpose this (switching rows and columns), then set the index of each row to a more readable string, Year-Month:
In [13]:
senate_df = senate_df.transpose()
senate_df = senate_df.set_index(senate_df.index.strftime("%Y-%m")).sort_index()
In [14]:
senate_df.iloc[0:5, 0:10]
Out[14]:
We can get the sum for each page by running .sum(), and we can peek into the first five pages:
In [15]:
senate_sum = senate_df.sum()
senate_sum[0:5]
Out[15]:
We can get the sum for each month by transposing back and running .sum() on the dataframe:
In [16]:
senate_monthly_sum = senate_df.transpose().sum()
senate_monthly_sum
Out[16]:
And we can get the sum of all the months from 2016-01 to 2018-12 by summing the monthly sum, which gives us 163.6 million pageviews:
In [17]:
senate_monthly_sum.sum()
Out[17]:
We can use the built-in plotting functionality in pandas dataframes to show a monthly plot. You can adjust kind to be many types, including bar, line, and area.
In [18]:
fig = plt.figure()
plt.title("Monthly Wikipedia pageviews for all U.S. Senators")
plt.ticklabel_format(style = 'plain')
ax = senate_monthly_sum.plot(kind='bar', figsize=[14,8], color="purple")
ax.set_xlabel("Monthly pageviews")
ax.set_ylabel("Month")
Out[18]:
In [19]:
house_list[0:5]
Out[19]:
In [20]:
house_views = p.article_views(project='en.wikipedia',
articles=house_list,
granularity='monthly',
start='2016010100',
end='2018123123')
house_df = pd.DataFrame(house_views)
house_df.iloc[0:5, 0:5]
Out[20]:
In [21]:
house_df = house_df.transpose()
house_df = house_df.set_index(house_df.index.strftime("%Y-%m")).sort_index()
house_df.iloc[0:5, 0:5]
Out[21]:
In [22]:
house_sum = house_df.sum()
house_sum[0:5]
Out[22]:
In [23]:
house_monthly_sum = house_df.transpose().sum()
house_monthly_sum
Out[23]:
In [24]:
house_monthly_sum.sum()
Out[24]:
This gives us 126.6 million total pageviews for House reps.
In [25]:
fig = plt.figure()
plt.title("Monthly Wikipedia pageviews for U.S. House of Representatives")
plt.ticklabel_format(style = 'plain')
ax = house_monthly_sum.plot(kind='bar', figsize=[14,8], color="purple")
ax.set_xlabel("Monthly pageviews")
ax.set_ylabel("Month")
Out[25]:
In [26]:
congress_df = house_df.transpose().append(senate_df.transpose())
congress_df.iloc[0:10,0:10]
Out[26]:
In [27]:
congress_monthly_sum = congress_df.sum()
congress_monthly_sum
Out[27]:
Then to find the total pageviews, run sum on the sum. This is 290 million pageviews from January 2016 to December 2018 for all U.S. Members of Congress:
In [28]:
congress_monthly_sum.sum()
Out[28]:
In [29]:
fig = plt.figure()
plt.title("Monthly Wikipedia pageviews for current U.S. Members of Congress")
plt.ticklabel_format(style = 'plain')
ax = congress_monthly_sum.plot(kind='bar', figsize=[14,8], color="purple")
ax.set_xlabel("Monthly pageviews")
ax.set_ylabel("Month")
Out[29]:
We can query the dataframe by index for a specific page, then plot it:
In [32]:
fig = plt.figure()
plt.title("Monthly Wikipedia pageviews for Al Lawson")
plt.ticklabel_format(style = 'plain')
ax = congress_df.loc['Al_Lawson'].plot(kind='bar', figsize=[14,8], color="purple")
ax.set_xlabel("Monthly pageviews")
ax.set_ylabel("Month")
Out[32]:
In [33]:
house_df.to_csv("data/house_views.csv")
house_df.to_excel("data/house_views.xlsx")
senate_df.to_csv("data/senate_views.csv")
senate_df.to_excel("data/senate_views.xlsx")
In [ ]: