The first thing I tried was just counting up the number of articles in each of the "[year] deaths" categories, from 2000-2016.
In [141]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
%matplotlib inline
matplotlib.style.use('seaborn-darkgrid')
In [142]:
import pywikibot
site = pywikibot.Site('en', 'wikipedia')
In [ ]:
In [146]:
def yearly_death_counts(startyear,endyear):
years = np.arange(startyear,endyear+1) # add 1 to endyear because np.arange doesn't include the stop
deaths_per_year = {}
for year in years:
deaths_per_year[year] = 0
for year in years:
yearstr = 'Category:' + str(year) + "_deaths"
deathcat = pywikibot.Page(site, yearstr)
deathcat_o = site.categoryinfo(deathcat)
deaths_per_year[year] = deathcat_o['pages']
yearly_articles_df = pd.DataFrame.from_dict(deaths_per_year, orient='index')
yearly_articles_df.columns = ['articles in category']
yearly_articles_df = yearly_articles_df.sort_index()
return yearly_articles_df
In [147]:
yearly_articles_df = yearly_death_counts(2000,2016)
yearly_articles_df
Out[147]:
In [148]:
ax = yearly_articles_df.plot(kind='bar',figsize=[10,4])
ax.legend_.remove()
ax.set_ylabel("Number of articles")
ax.set_title("""Articles in the "[year] deaths" category in the English Wikipedia""")
Out[148]:
One of the first things that we see in this graph is that the data is far from uniform, and has a distinct trend. This should make us suspicious. There are about 4,945 articles in the "2000 deaths" category, and the number steadily rises each year to 7,486 articles in the "2010 deaths" category. Is there any compelling reason we have to believe that the number of notable people in the world would steadily increase by a few percent each year from 2000 to 2010, then plateau? Or is it more of an artifact of what Wikipedia's volunteer editors choose to work on?
What if we look at this over a much longer timescale, like 1800-2016?
In [81]:
yearly_articles_df = yearly_death_counts(1800,2016)
In [96]:
ax = yearly_articles_df.plot(kind='line',figsize=[10,4])
ax.legend_.remove()
ax.set_ylabel("Number of articles")
ax.set_title("""Articles in the "[year] deaths" category in the English Wikipedia""")
We can see the two big jumps in the 20th century, likely reflecting the events around World War I and II. This makes sense, as those time periods were certainly sharp increases in the total number of deaths, as well as the number of notable deaths. Remember: we have already assumed that Wikipedia's biographical articles doesn't represent all of humanity -- in fact, we are counting on it, so we can distinguish celebrity deaths.
However, for the purposes of our question, is it safe to assume that having a Wikipedia article means being a celebrity? When I hear people talk about so many celebrities dying in 2016, people seem to mean a lower number than the ~7,000 people with Wikipedia articles who died in 2010-2016. The number is maybe two orders of magnitude lower, somewhere closer to 70 than 7,000. So is there a way we can filter Wikipedia articles?
To get at this, I first thought of using the pageview data that Wikimedia collects. There is a nice API about how many times every article in every language version of Wikipedia is viewed each hour. I hadn't played around with that API, so I wanted to try it out.
The mwviews python package has support for hourly, daily, and monthly granularity, but not annual. So I wrote a function that gets the pageview counts for a given article for an entire year. But, as we will see, the data in the historical pageview API only goes back to mid-2015.
In [108]:
!pip install mwviews
from mwviews.api import PageviewsClient
def yearly_views(title,year):
p = PageviewsClient(2)
startdate = str(year) + "010100"
enddate = str(year) + "123123"
d = p.article_views('en.wikipedia', title, granularity='monthly', start=startdate, end=enddate)
total = 0
for month in d.values():
for titlecount in month.values():
if titlecount is not None:
total += titlecount
return total
In [101]:
yearly_views("Prince_(musician)", 2016)
Out[101]:
In [102]:
yearly_views("Prince_(musician)", 2015)
Out[102]:
In [103]:
yearly_views("Prince_(musician)", 2014)
I was wanting to get 2016 pageview data for 2016 deaths, 2015 pageview data for 2015 deaths, and so on. But there isn't full historical data for the pageview API. However, we can take a detour and do some interesting exploration with only the 2016 dataset.
This code iterates through the category for "2016 deaths" and for each page, queries the pageview API to get the number of total pageviews in 2016. It takes a few minutes to run. This throws some errors for a few articles (in pink boxes below), which we will ignore.
In [109]:
year = 2016
yearstr = 'Category:' + str(year) + "_deaths"
deathcat = pywikibot.Page(site, yearstr)
pageviews_2016 = {}
for page in site.categorymembers(deathcat):
if page.title().find("List_of") is -1 and page.title().find("Category:") is -1:
try:
page_yearly_views = yearly_views(page.title(),year)
except Exception as e:
page_yearly_views = 0
pageviews_2016[page.title()] = page_yearly_views
In [110]:
pageviews_df = pd.DataFrame.from_dict(pageviews_2016,orient='index')
pageviews_df = pageviews_df.sort_values(0, ascending=False)
In [113]:
pageviews_df.head(25)
Out[113]:
In [140]:
pageviews_df.to_csv("enwiki_pageviews_2016.csv")
In [114]:
articles = []
for index,row in pageviews_df.head(6).iterrows():
articles.append(index)
In [116]:
from mwviews.api import PageviewsClient
p = PageviewsClient(10)
startdate = "2016010100"
enddate = "2016123123"
In [117]:
counts_dict = p.article_views('en.wikipedia', articles, granularity='daily', start=startdate, end=enddate)
In [118]:
counts_df = pd.DataFrame.from_dict(counts_dict, orient='index')
counts_df = counts_df.fillna(0)
In [119]:
counts_df.to_csv("deaths-enwiki-2016.csv")
In [121]:
articles = []
for index,row in pageviews_df.head(6).iterrows():
articles.append(index)
counts_dict = p.article_views('en.wikipedia', articles, granularity='daily', start=startdate, end=enddate)
counts_df = pd.DataFrame.from_dict(counts_dict, orient='index')
counts_df = counts_df.fillna(0)
In [122]:
matplotlib.style.use('seaborn-darkgrid')
font = {'family' : 'normal',
'weight' : 'normal',
'size' : 18}
matplotlib.rc('font', **font)
plt.figure(figsize=[14,7.2])
for title in counts_df:
fig = counts_df[title].plot(legend=True, linewidth=2)
fig.set_ylabel('Views per day')
plt.legend(loc='best')
Out[122]:
To get data about the number of times each article the "[year] deaths" categories has been edited, we could use the API, but it would take a long time. There are over 100,000 articles in the 2000-2016 categories, and that would require a new API call for each one. This is the kind of query that SQL is meant for, and we can use the Quarry service to run this query directly on Wikipedia's servers.
I've included the query below in a code cell, but it was run here. We will download the results in a TSV file, then load it into a pandas dataframe for processing.
In [123]:
sql_query = """
select cl_to, cl_from, count(rev_id) as edits, page_title
from (select * from categorylinks where cl_to LIKE '20___deaths') as d
inner join revision on cl_from = rev_page
inner join page on rev_page = page_id
where page_namespace = 0 and cl_to NOT LIKE '200s_deaths' and page_title NOT LIKE 'List_of%'
group by cl_from
"""
In [124]:
!wget https://quarry.wmflabs.org/run/139193/output/0/tsv?download=true -O deaths.tsv
In [149]:
deaths_df = pd.read_csv("deaths.tsv", sep='\t')
deaths_df.columns = ['year', 'page_id', 'edits', 'title']
deaths_df.head(15)
Out[149]:
We can filter the number of articles in the various death by year categories by the total edit count. But what will be our threshold? What are we looking for? I've chosen 7 different thresholds (over 10, 50, 100, 250, 500, 750, and 1,000 edits). The results these different thresholds produce give rise to different interpretations of the same question.
In [150]:
deaths_over10 = deaths_df[deaths_df.edits>10]
deaths_over50 = deaths_df[deaths_df.edits>50]
deaths_over100 = deaths_df[deaths_df.edits>100]
deaths_over250 = deaths_df[deaths_df.edits>250]
deaths_over500 = deaths_df[deaths_df.edits>500]
deaths_over750 = deaths_df[deaths_df.edits>750]
deaths_over1000 = deaths_df[deaths_df.edits>1000]
In [151]:
deaths_over10 = deaths_over10[['year','edits']]
deaths_over50 = deaths_over50[['year','edits']]
deaths_over100 = deaths_over100[['year','edits']]
deaths_over250 = deaths_over250[['year','edits']]
deaths_over500 = deaths_over500[['year','edits']]
deaths_over750 = deaths_over750[['year','edits']]
deaths_over1000 = deaths_over1000[['year','edits']]
In [152]:
matplotlib.style.use('seaborn-darkgrid')
font = {'family' : 'normal',
'weight' : 'normal',
'size' : 10}
matplotlib.rc('font', **font)
In [155]:
ax = deaths_over10.groupby(['year']).agg(['count']).plot(kind='barh')
ax.legend_.remove()
ax.set_title("""Number of articles with >10 edits in "[year] deaths" category""")
Out[155]:
In [156]:
ax = deaths_over50.groupby(['year']).agg(['count']).plot(kind='barh')
ax.legend_.remove()
ax.set_title("""Number of articles with >50 edits in "[year] deaths" category""")
Out[156]:
In [157]:
ax = deaths_over100.groupby(['year']).agg(['count']).plot(kind='barh')
ax.legend_.remove()
ax.set_title("""Number of articles with >100 edits in "[year] deaths" category""")
Out[157]:
In [158]:
ax = deaths_over250.groupby(['year']).agg(['count']).plot(kind='barh')
ax.legend_.remove()
ax.set_title("""Number of articles with >250 edits in "[year] deaths" category""")
Out[158]:
In [159]:
ax = deaths_over500.groupby(['year']).agg(['count']).plot(kind='barh')
ax.legend_.remove()
ax.set_title("""Number of articles with >500 edits in "[year] deaths" category""")
Out[159]:
In [160]:
ax = deaths_over750.groupby(['year']).agg(['count']).plot(kind='barh')
ax.legend_.remove()
ax.set_title("""Number of articles with >750 edits in "[year] deaths" category""")
Out[160]:
In [161]:
ax = deaths_over1000.groupby(['year']).agg(['count']).plot(kind='barh')
ax.legend_.remove()
ax.set_title("""Number of articles with >1,000 edits in "[year] deaths" category""")
Out[161]:
In [ ]: