Exploring deaths of notable people by year in Wikipedia

By R. Stuart Geiger, last updated 2016-12-28

Dual-licensed under CC-BY-SA 4.0 and the MIT License.

How many articles are in the "[year] deaths" categories in the English Wikipedia?

The first thing I tried was just counting up the number of articles in each of the "[year] deaths" categories, from 2000-2016.


In [141]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
%matplotlib inline
matplotlib.style.use('seaborn-darkgrid')

In [142]:
import pywikibot
site = pywikibot.Site('en', 'wikipedia')

In [ ]:


In [146]:
def yearly_death_counts(startyear,endyear):
    years = np.arange(startyear,endyear+1) # add 1 to endyear because np.arange doesn't include the stop 
    deaths_per_year = {}
    for year in years:
        deaths_per_year[year] = 0

    for year in years:
        yearstr = 'Category:' + str(year) + "_deaths"
        deathcat = pywikibot.Page(site, yearstr)
        deathcat_o = site.categoryinfo(deathcat)
        deaths_per_year[year] = deathcat_o['pages']

    yearly_articles_df = pd.DataFrame.from_dict(deaths_per_year, orient='index')
    yearly_articles_df.columns = ['articles in category']
    yearly_articles_df = yearly_articles_df.sort_index()
    
    return yearly_articles_df

In [147]:
yearly_articles_df = yearly_death_counts(2000,2016)
yearly_articles_df


Out[147]:
articles in category
2000 4945
2001 5112
2002 5378
2003 5566
2004 5691
2005 6128
2006 6586
2007 6872
2008 7178
2009 7322
2010 7486
2011 7258
2012 7242
2013 7606
2014 7721
2015 7591
2016 6766

In [148]:
ax = yearly_articles_df.plot(kind='bar',figsize=[10,4])
ax.legend_.remove()
ax.set_ylabel("Number of articles")
ax.set_title("""Articles in the "[year] deaths" category in the English Wikipedia""")


Out[148]:
<matplotlib.text.Text at 0x7fd06199b128>

Interpreting total article counts

One of the first things that we see in this graph is that the data is far from uniform, and has a distinct trend. This should make us suspicious. There are about 4,945 articles in the "2000 deaths" category, and the number steadily rises each year to 7,486 articles in the "2010 deaths" category. Is there any compelling reason we have to believe that the number of notable people in the world would steadily increase by a few percent each year from 2000 to 2010, then plateau? Or is it more of an artifact of what Wikipedia's volunteer editors choose to work on?

What if we look at this over a much longer timescale, like 1800-2016?


In [81]:
yearly_articles_df = yearly_death_counts(1800,2016)

In [96]:
ax = yearly_articles_df.plot(kind='line',figsize=[10,4])
ax.legend_.remove()
ax.set_ylabel("Number of articles")
ax.set_title("""Articles in the "[year] deaths" category in the English Wikipedia""")


We can see the two big jumps in the 20th century, likely reflecting the events around World War I and II. This makes sense, as those time periods were certainly sharp increases in the total number of deaths, as well as the number of notable deaths. Remember: we have already assumed that Wikipedia's biographical articles doesn't represent all of humanity -- in fact, we are counting on it, so we can distinguish celebrity deaths.

However, for the purposes of our question, is it safe to assume that having a Wikipedia article means being a celebrity? When I hear people talk about so many celebrities dying in 2016, people seem to mean a lower number than the ~7,000 people with Wikipedia articles who died in 2010-2016. The number is maybe two orders of magnitude lower, somewhere closer to 70 than 7,000. So is there a way we can filter Wikipedia articles?

To get at this, I first thought of using the pageview data that Wikimedia collects. There is a nice API about how many times every article in every language version of Wikipedia is viewed each hour. I hadn't played around with that API, so I wanted to try it out.

Pageviews for articles in the "2016 Deaths" category

The mwviews python package has support for hourly, daily, and monthly granularity, but not annual. So I wrote a function that gets the pageview counts for a given article for an entire year. But, as we will see, the data in the historical pageview API only goes back to mid-2015.


In [108]:
!pip install mwviews
from mwviews.api import PageviewsClient

def yearly_views(title,year):
    p = PageviewsClient(2)
    startdate = str(year) + "010100"
    enddate = str(year) + "123123"
    d = p.article_views('en.wikipedia', title, granularity='monthly', start=startdate, end=enddate)
    total = 0
    for month in d.values():
        for titlecount in month.values():
            if titlecount is not None:
                total += titlecount
    return total


Requirement already satisfied: mwviews in /srv/paws/lib/python3.4/site-packages
Requirement already satisfied: requests in /srv/paws/lib/python3.4/site-packages (from mwviews)
Requirement already satisfied: futures in /srv/paws/lib/python3.4/site-packages (from mwviews)

In [101]:
yearly_views("Prince_(musician)", 2016)


Out[101]:
22810614

In [102]:
yearly_views("Prince_(musician)", 2015)


Out[102]:
1642187

In [103]:
yearly_views("Prince_(musician)", 2014)


ERROR while fetching and parsing ['https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Prince_%28musician%29/daily/2014010100/2014123100']
Traceback (most recent call last):
  File "/srv/paws/lib/python3.4/site-packages/mwviews/api/pageviews.py", line 139, in article_views
    'The pageview API returned nothing useful at: {}'.format(urls)
Exception: The pageview API returned nothing useful at: ['https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Prince_%28musician%29/daily/2014010100/2014123100']
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-103-0c851dd11c05> in <module>()
----> 1 yearly_views("Prince_(musician)", 2014)

<ipython-input-97-aed6b3042346> in yearly_views(title, year)
      6     startdate = str(year) + "010100"
      7     enddate = str(year) + "123100"
----> 8     d = p.article_views('en.wikipedia', title, granularity='monthly', start=startdate, end=enddate)
      9     total = 0
     10     for month in d.values():

/srv/paws/lib/python3.4/site-packages/mwviews/api/pageviews.py in article_views(self, project, articles, access, agent, granularity, start, end)
    137             if not some_data_returned:
    138                 raise Exception(
--> 139                     'The pageview API returned nothing useful at: {}'.format(urls)
    140                 )
    141 

Exception: The pageview API returned nothing useful at: ['https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Prince_%28musician%29/daily/2014010100/2014123100']

Querying the pageview API for all the articles in the "2016 deaths" category

I was wanting to get 2016 pageview data for 2016 deaths, 2015 pageview data for 2015 deaths, and so on. But there isn't full historical data for the pageview API. However, we can take a detour and do some interesting exploration with only the 2016 dataset.

This code iterates through the category for "2016 deaths" and for each page, queries the pageview API to get the number of total pageviews in 2016. It takes a few minutes to run. This throws some errors for a few articles (in pink boxes below), which we will ignore.


In [109]:
year = 2016
yearstr = 'Category:' + str(year) + "_deaths"
deathcat = pywikibot.Page(site, yearstr)

pageviews_2016 = {}

for page in site.categorymembers(deathcat):
    if page.title().find("List_of") is -1 and page.title().find("Category:") is -1:

        try:
            page_yearly_views = yearly_views(page.title(),year)
        except Exception as e:
            page_yearly_views = 0

        pageviews_2016[page.title()] = page_yearly_views


ERROR while fetching and parsing ['https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Jaume_Camprodon_i_Rovira/daily/2016010100/2016123123']
Traceback (most recent call last):
  File "/srv/paws/lib/python3.4/site-packages/mwviews/api/pageviews.py", line 139, in article_views
    'The pageview API returned nothing useful at: {}'.format(urls)
Exception: The pageview API returned nothing useful at: ['https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Jaume_Camprodon_i_Rovira/daily/2016010100/2016123123']
ERROR while fetching and parsing ['https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Koichi_Kato_%28politician%2C_born_1939%29/daily/2016010100/2016123123']
Traceback (most recent call last):
  File "/srv/paws/lib/python3.4/site-packages/mwviews/api/pageviews.py", line 139, in article_views
    'The pageview API returned nothing useful at: {}'.format(urls)
Exception: The pageview API returned nothing useful at: ['https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Koichi_Kato_%28politician%2C_born_1939%29/daily/2016010100/2016123123']
ERROR while fetching and parsing ['https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Scott_Eric_Kaufman/daily/2016010100/2016123123']
Traceback (most recent call last):
  File "/srv/paws/lib/python3.4/site-packages/mwviews/api/pageviews.py", line 139, in article_views
    'The pageview API returned nothing useful at: {}'.format(urls)
Exception: The pageview API returned nothing useful at: ['https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Scott_Eric_Kaufman/daily/2016010100/2016123123']
ERROR while fetching and parsing ['https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Vladimir_Stupishin/daily/2016010100/2016123123']
Traceback (most recent call last):
  File "/srv/paws/lib/python3.4/site-packages/mwviews/api/pageviews.py", line 139, in article_views
    'The pageview API returned nothing useful at: {}'.format(urls)
Exception: The pageview API returned nothing useful at: ['https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Vladimir_Stupishin/daily/2016010100/2016123123']
ERROR while fetching and parsing ['https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Tricia_McCauley/daily/2016010100/2016123123']
Traceback (most recent call last):
  File "/srv/paws/lib/python3.4/site-packages/mwviews/api/pageviews.py", line 139, in article_views
    'The pageview API returned nothing useful at: {}'.format(urls)
Exception: The pageview API returned nothing useful at: ['https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Tricia_McCauley/daily/2016010100/2016123123']

In [110]:
pageviews_df = pd.DataFrame.from_dict(pageviews_2016,orient='index')
pageviews_df = pageviews_df.sort_values(0, ascending=False)

In [113]:
pageviews_df.head(25)


Out[113]:
0
Deaths in 2016 35737460
Prince (musician) 22810614
David Bowie 19095595
Muhammad Ali 16365019
Alan Rickman 9802068
Carrie Fisher 8321692
George Michael 8211608
Fidel Castro 7611586
Jayalalithaa 6104233
Gene Wilder 5918108
Christina Grimmie 5712389
Anton Yelchin 5076375
Chyna 4638369
Leonard Cohen 4379613
Kimbo Slice 3980979
Antonin Scalia 3940872
Zsa Zsa Gabor 3928903
Alan Thicke 3776664
Glenn Frey 3695988
Nancy Reagan 3071559
Jerry Heller 2937351
Garry Shandling 2598703
Patty Duke 2582687
John Glenn 2550106
Bhumibol Adulyadej 2420709

In [140]:
pageviews_df.to_csv("enwiki_pageviews_2016.csv")

Getting the daily pageview counts for 6 most viewed articles in "2016 deaths" (includes the "Deaths in 2016" article)


In [114]:
articles = []
for index,row in pageviews_df.head(6).iterrows():
    articles.append(index)

In [116]:
from mwviews.api import PageviewsClient
p = PageviewsClient(10)

startdate = "2016010100"
enddate = "2016123123"

In [117]:
counts_dict = p.article_views('en.wikipedia', articles, granularity='daily', start=startdate, end=enddate)

In [118]:
counts_df = pd.DataFrame.from_dict(counts_dict, orient='index')
counts_df = counts_df.fillna(0)

In [119]:
counts_df.to_csv("deaths-enwiki-2016.csv")

Plotting pageviews per day of top 5 articles


In [121]:
articles = []
for index,row in pageviews_df.head(6).iterrows():
    articles.append(index)
    
counts_dict = p.article_views('en.wikipedia', articles, granularity='daily', start=startdate, end=enddate)
counts_df = pd.DataFrame.from_dict(counts_dict, orient='index')
counts_df = counts_df.fillna(0)

In [122]:
matplotlib.style.use('seaborn-darkgrid')

font = {'family' : 'normal',
        'weight'  : 'normal',
        'size'   : 18}

matplotlib.rc('font', **font)

plt.figure(figsize=[14,7.2])
for title in counts_df:
    fig = counts_df[title].plot(legend=True, linewidth=2)
    fig.set_ylabel('Views per day')
plt.legend(loc='best')


Out[122]:
<matplotlib.legend.Legend at 0x7fd070594c88>
WARNING: /srv/paws/lib/python3.4/site-packages/matplotlib/font_manager.py:1288: UserWarning: findfont: Font family ['normal'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Querying edit counts for articles in the "[year] deaths" categories using SQL/Quarry

To get data about the number of times each article the "[year] deaths" categories has been edited, we could use the API, but it would take a long time. There are over 100,000 articles in the 2000-2016 categories, and that would require a new API call for each one. This is the kind of query that SQL is meant for, and we can use the Quarry service to run this query directly on Wikipedia's servers.

I've included the query below in a code cell, but it was run here. We will download the results in a TSV file, then load it into a pandas dataframe for processing.


In [123]:
sql_query = """
select cl_to, cl_from, count(rev_id) as edits, page_title 
from (select * from categorylinks where cl_to LIKE '20___deaths') as d
inner join revision on cl_from = rev_page
inner join page on rev_page = page_id
where page_namespace = 0 and cl_to NOT LIKE '200s_deaths' and page_title NOT LIKE 'List_of%'
group by cl_from
"""

In [124]:
!wget https://quarry.wmflabs.org/run/139193/output/0/tsv?download=true -O deaths.tsv


--2016-12-28 23:10:48--  https://quarry.wmflabs.org/run/139193/output/0/tsv?download=true
Resolving quarry.wmflabs.org (quarry.wmflabs.org)... 10.68.21.68
Connecting to quarry.wmflabs.org (quarry.wmflabs.org)|10.68.21.68|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘deaths.tsv’

deaths.tsv              [                <=>   ]   4.47M  1.03MB/s   in 4.6s   

2016-12-28 23:10:53 (997 KB/s) - ‘deaths.tsv’ saved [4686976]


In [149]:
deaths_df = pd.read_csv("deaths.tsv", sep='\t')
deaths_df.columns = ['year', 'page_id', 'edits', 'title']
deaths_df.head(15)


Out[149]:
year page_id edits title
0 2000_deaths 888 600 A._E._van_Vogt
1 2016_deaths 930 613 Alvin_Toffler
2 2008_deaths 1625 2758 Aleksandr_Solzhenitsyn
3 2007_deaths 2021 252 Atle_Selberg
4 2014_deaths 2042 1366 Alexander_Grothendieck
5 2001_deaths 2144 10128 Aaliyah
6 2001_deaths 2176 2666 Ahmad_Shah_Massoud
7 2001_deaths 2198 379 Abdulaziz_al-Omari
8 2009_deaths 2201 415 Aage_Bohr
9 2005_deaths 2310 5331 Arthur_Miller
10 2014_deaths 2944 6011 Ariel_Sharon
11 2011_deaths 2999 403 Arthur_Laurents
12 2012_deaths 3029 578 Arthur_Jensen
13 2001_deaths 3221 371 Ahmed_al-Nami
14 2001_deaths 3222 247 Ahmed_al-Haznawi

Filtering articles by number of edits

We can filter the number of articles in the various death by year categories by the total edit count. But what will be our threshold? What are we looking for? I've chosen 7 different thresholds (over 10, 50, 100, 250, 500, 750, and 1,000 edits). The results these different thresholds produce give rise to different interpretations of the same question.


In [150]:
deaths_over10 = deaths_df[deaths_df.edits>10]
deaths_over50 = deaths_df[deaths_df.edits>50]
deaths_over100 = deaths_df[deaths_df.edits>100]
deaths_over250 = deaths_df[deaths_df.edits>250]
deaths_over500 = deaths_df[deaths_df.edits>500]
deaths_over750 = deaths_df[deaths_df.edits>750]
deaths_over1000 = deaths_df[deaths_df.edits>1000]

In [151]:
deaths_over10 = deaths_over10[['year','edits']]
deaths_over50 = deaths_over50[['year','edits']]
deaths_over100 = deaths_over100[['year','edits']]
deaths_over250 = deaths_over250[['year','edits']]
deaths_over500 = deaths_over500[['year','edits']]
deaths_over750 = deaths_over750[['year','edits']]
deaths_over1000 = deaths_over1000[['year','edits']]

In [152]:
matplotlib.style.use('seaborn-darkgrid')

font = {'family' : 'normal',
        'weight'  : 'normal',
        'size'   : 10}

matplotlib.rc('font', **font)

In [155]:
ax = deaths_over10.groupby(['year']).agg(['count']).plot(kind='barh')
ax.legend_.remove()
ax.set_title("""Number of articles with >10 edits in "[year] deaths" category""")


Out[155]:
<matplotlib.text.Text at 0x7fd071640e48>

In [156]:
ax = deaths_over50.groupby(['year']).agg(['count']).plot(kind='barh')
ax.legend_.remove()
ax.set_title("""Number of articles with >50 edits in "[year] deaths" category""")


Out[156]:
<matplotlib.text.Text at 0x7fd089888358>

In [157]:
ax = deaths_over100.groupby(['year']).agg(['count']).plot(kind='barh')
ax.legend_.remove()
ax.set_title("""Number of articles with >100 edits in "[year] deaths" category""")


Out[157]:
<matplotlib.text.Text at 0x7fd0897cc0f0>

In [158]:
ax = deaths_over250.groupby(['year']).agg(['count']).plot(kind='barh')
ax.legend_.remove()
ax.set_title("""Number of articles with >250 edits in "[year] deaths" category""")


Out[158]:
<matplotlib.text.Text at 0x7fd089710390>

In [159]:
ax = deaths_over500.groupby(['year']).agg(['count']).plot(kind='barh')
ax.legend_.remove()
ax.set_title("""Number of articles with >500 edits in "[year] deaths" category""")


Out[159]:
<matplotlib.text.Text at 0x7fd0896412e8>

In [160]:
ax = deaths_over750.groupby(['year']).agg(['count']).plot(kind='barh')
ax.legend_.remove()
ax.set_title("""Number of articles with >750 edits in "[year] deaths" category""")


Out[160]:
<matplotlib.text.Text at 0x7fd08923d908>

In [161]:
ax = deaths_over1000.groupby(['year']).agg(['count']).plot(kind='barh')
ax.legend_.remove()
ax.set_title("""Number of articles with >1,000 edits in "[year] deaths" category""")


Out[161]:
<matplotlib.text.Text at 0x7fd070376b00>

In [ ]: