The goal of this project is to explore the concept of 'bias' in data by analyzing Wikipedia articles on political figures from different countries.
The data will include a dataset of political articles on Wikipedia, the predicted article quality scores for those articles, and a dataset of country populations.
The analysis will quantify the number of Wikipedia articles devoted to politicians in each country, label the quality of those articles, and consider how those measurements vary between countries.
The data visualization will include a series of plots that show:
In [ ]:
"""
The code in this notebook cell is optional, you do not need to run this
cell in order to run the code in subsequent cells, but if this cell
isn't run, some intermediate values won't be displayed.
To run a cell, position the cursor inside the cell, so the cell border
turns green, and simultaneously press the keys: control and return (or enter).
"""
# This code displays all results created within a jupyter notebook cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# This code displays Matplotlib objects inline.
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'inline')
We will be combining three sources of data:
The Wikipedia dataset about political articles (also called pages) can be found on Figshare. The English language article data was extracted using the Wikimedia API, saved as a CSV file named page_data.csv, and uploaded to Figshare. For more information, see the README.md
file in this data-512-a2
repository.
A copy of the page_data.csv
file is also available in this data-512-a2
repository.
The columns in the page_data.csv file are:
In [1]:
"""
The code in this cell reads in the Wikipedia article data from the
file page_data.csv to get this data:
RangeIndex: 47197 entries, 0 to 47196
Data columns (total 3 columns):
page 47197 non-null object
country 47197 non-null object
rev_id 47197 non-null int64
The data is then stored in a pandas DataFrame (DF) object.
"""
import csv
import pandas as pd
page_data = pd.read_csv("page_data.csv")
page_data.head()
Out[1]:
In [2]:
print("Number of rows and columns in original page_data DataFrame", page_data.shape)
In [3]:
"""
The code in this cell:
(1) standardizes some of the country names in page_data for merging page_data
with the population data; and
(2) removes two rows with revision ID values that don't exist in the dataset
being used by the ORES API, and therefore return errors [807367030, 807367166].
"""
# Part 1: standardize country names
# COUNTRY_MAP from Gary
COUNTRY_MAP = {
"East Timorese" : "Timor-Leste",
"Hondura" : "Honduras",
"Rhodesian" : "Zimbabwe",
"Salvadoran" : "El Salvador",
"Samoan" : "Samoa",
"São Tomé and Príncipe" : "Sao Tome and Principe",
"South African Republic" : "South Africa",
"South Korean" : "Korea, South"
}
# A sample of original rows with values we want to replace
page_data.loc[page_data.index.isin([272, 443, 448, 541, 602])]
# use isin to filter for valid rows, and then use replace to replace the values
if page_data["country"].isin(COUNTRY_MAP.keys()).any():
page_data["country"].replace(COUNTRY_MAP, inplace=True)
# Verify that the values were replaced in those sample rows
page_data.loc[page_data.index.isin([272, 443, 448, 541, 602])]
# Part 2: removal of revision ID values
# check if rev_id values in page_data
page_data.loc[page_data["rev_id"].isin([807367030, 807367166])]
# remove rows with rev_id in [807367030, 807367166]
page_data = page_data.loc[~page_data["rev_id"].isin([807367030, 807367166])]
page_data.head(10)
Out[3]:
In [4]:
print("Ending number of rows and columns in page_data", page_data.shape)
In [5]:
"""
The code in this cell reads in the country population data from
the file population_mid-2015.csv and does some processing.
When you read population_mid-2015.csv without any parameters
like the page_data.csv file above, the title becomes a
single column. So, to get 6 columns instead of 1, set the
second row (index 1) as the header with parameter: header=1.
The original data in population_mid-2015.csv looks like this:
RangeIndex: 210 entries, 0 to 209
Data columns (total 6 columns):
Location 210 non-null object
Location Type 210 non-null object
TimeFrame 210 non-null object
Data Type 210 non-null object
Data 210 non-null int64
Footnotes 0 non-null float64
"""
pop_data = pd.read_csv("population_mid-2015.csv", #skiprows=0,
header=1, sep=",", thousands=",")
# Only drop unneeded columns if they are in pop_data DF
if len(pop_data.columns) > 2:
pop_data.drop(["Location Type", "TimeFrame",
"Data Type", "Footnotes"], axis=1, inplace=True)
else:
pass
# Rename columns to standardize names for future merging
pop_data.rename(columns={"Location" : "country",
"Data" : "population"}, inplace=True)
pop_data.head()
Out[5]:
The predicted quality scores for each article in the Wikipedia dataset comes from a Wikimedia API endpoint for a machine learning system called ORES ("Objective Revision Evaluation Service"). ORES estimates the quality of an article at a particular point in time, and assigns a series of probabilities that the article is best described by one of the categories listed below.
The range of quality scores are, from best to worst:
These quality scores are a sub-set of quality assessment categories developed by Wikipedia editors. For more information about the scores, see Project Assessment.
The ORES API documentation can be found here and the web API is here. The API requires: a revision ID, which is the third column in page_data.csv
(originally titled "last_edit"), and the machine learning model, which is "wp10."
When you query the API, the ORES returns a JSON object that includes a predicted quality score, as well as the probability values for each of the six possible quality scores. But, for the analysis in this project, you only need the predicted quality score value, not the probabilities.
The cell below is an example of a response in the JSON format from the ORES API:
In [6]:
import requests
import json
import time
def get_page_quality_scores(rev_ids):
"""
Function takes revision id values, calls the ORES API, and returns the
revision id values recognized by the ORES API, the article quality scores, and
the time for each API call.
"""
# start is the time in seconds as a floating point number
start = time.time()
# Local variables
local_scores = []
local_rev_ids = []
endpoint = "https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}"
params = {"project" : "enwiki",
"model" : "wp10",
"revids" : "|".join(str(x) for x in rev_ids)
}
api_call = requests.get(endpoint.format(**params))
response = api_call.json()
#print(json.dumps(response, indent=4, sort_keys=True))
# Strip out the quality score from the JSON object and save in list (scores)
for rev_id in rev_ids:
try:
local_rev_ids.append(rev_id)
local_scores.append(response["enwiki"]["scores"][str(rev_id)]["wp10"]["score"]["prediction"])
except:
print("exception with rev_id:", str(rev_id))
pass
# Measure time to run batch of Get requests in minutes
end = (time.time() - start)/60
return local_rev_ids, local_scores, end
try:
# This is a shortcut to redefine page_data as a DF with the article_quality data
# since getting the API data takes a about 3.52 minutes
page_data = pd.read_csv("page_quality_data.csv")
if page_data["Unnamed: 0"].any():
page_data.drop(["Unnamed: 0"], axis=1, inplace=True)
print("New page_data DataFrame rows and columns:", page_data.shape)
print(page_data.head(10))
except:
# If no page_quality_data.csv file, then get the data from the ORES API
lst_rev_ids = []
lst_article_scores = []
run_time = 0
for i in range(0, len(page_data["rev_id"]), 100):
# batches of 100 rev_id values to send to get_page_quality_scores()
revision_ids = page_data["rev_id"][i:(i + 100)]
rev_id_vals, quality_scores, batch_time = get_page_quality_scores(revision_ids)
for id_value in rev_id_vals:
# Keep a running list of rev_id values as an index for quality scores
lst_rev_ids.append(id_value)
for score in quality_scores:
# Keep a running list of quality scores to add to page_data
lst_article_scores.append(score)
run_time = run_time + batch_time
# Test print statement for tracking time
# print("i:", str(i) + ", batch_time: %f, run_time: %f" % (batch_time, run_time))
# Add a column with the predicted article quality scores to the page_data
df_scores = pd.DataFrame({"rev_id" : lst_rev_ids, "article_quality" : lst_article_scores})
page_score_data = page_data.merge(df_scores, how='outer',
on=["rev_id"]).dropna(axis=0, how='any')
page_score_data.to_csv("page_quality_data.csv")
print("\nTotal Run Time: {0:5.2f} minutes\n".format(run_time))
In [7]:
page_data.shape
Out[7]:
After retrieving and including the ORES data for each article, we merge the wikipedia article data and the population data together. Both DataFrames have columns named "country" and we will merge the page_data and pop_data on the country values.
While merging the DataFrames, some entries won't be merged because a country name isn't in the other DataFrame, e.g. the population dataset does not have an entry for the equivalent Wikipedia country. So, we will remove the rows that do not have matching data. We had 47,195 rows in page_data initially and the final combine_df has 46,408 rows.
After consolidating the data, we save the DataFrame as a single CSV file with these columns:
country
article_name
revision_id
article_quality
population
In [8]:
try:
# Rename page_data columns
page_data.rename(columns={"page" : "article_name",
"rev_id" : "revision_id"}, inplace=True)
except:
pass
# Combine DataFrames on country names and remove any rows with empty values
combine_df = page_data.merge(pop_data,
how='outer', on=["country"]).dropna(axis=0,
how='any').reset_index()
# Remove the old index which is now a column named index
# because I reset the index
combine_df.drop(["index"], axis=1, inplace=True)
# Change revision_id & population values from float to int
combine_df[["revision_id",
"population"]] = combine_df[["revision_id",
"population"]].astype(int)
combine_df.info()
In [9]:
# Save final combined DataFrame to CSV file
combine_df.to_csv("page_pop_data_final.csv")
combine_df.head(10)
Out[9]:
The analysis calculates the proportion (as a percentage) of articles-per-population and high-quality articles for each country. "High quality" articles are defined as articles about politicians that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.
Examples:
In [10]:
# Get the total number of articles for each country
df_test = combine_df[["country",
"article_name"]].groupby("country").count().astype(int).reset_index()
# Merge total number of articles for each country with population for each country
df_article_pop = df_test.merge(pop_data, on=["country"])
df_article_pop.rename(columns={"article_name" : "total_articles"},
inplace=True)
# Divide the total number of articles by the country's population
art_per_pop = df_article_pop["total_articles"].div(df_article_pop["population"],
axis='index')
# Get a percentage value by multiplying by 100
df_article_pop["percent_articles_per_person"] = round(art_per_pop*100, 4)
prop_df = df_article_pop.sort_values(["percent_articles_per_person"],
axis=0,
ascending=False,
inplace=False,
kind='quicksort')
# The 10 countries with the highest percentage of articles per person
prop_df.head(10)
Out[10]:
In [11]:
# The 10 countries with the lowest percentage of articles per person
prop_df.tail(10)
Out[11]:
In [12]:
# number of GA and FA-quality articles as a proportion of all articles from country
# Gets all the articles rated "FA" or "GA"
fa_ga = combine_df[(combine_df["article_quality"]=="FA") | (combine_df["article_quality"]=="GA")]
# Gets all the articles not rated "FA" snd not rated "GA"
no_fa_ga = combine_df[(combine_df["article_quality"]!="FA") & (combine_df["article_quality"]!="GA")]
# Adds up all the FA & GA articles for each country
total_fa_ga = fa_ga[["country",
"article_quality"]].groupby("country",
as_index=False).count()
# Adds up all the articles for each country regardless of quality rating
total_articles = combine_df[["country",
"article_quality"]].groupby("country",
as_index=False).count()
# Merge total_fa_ga and total_articles on country values
art_type_df = total_fa_ga.merge(total_articles, on="country")
art_type_df.rename(columns={"article_quality_x" : "total_FA_GA",
"article_quality_y" : "total_pages"},
inplace=True)
art_type_df["percent_FA_GA"] = round(art_type_df["total_FA_GA"]/art_type_df["total_pages"].astype(float)*100, 2)
sort_prop_type_df = art_type_df.sort_values(["percent_FA_GA"],
axis=0,
ascending=False,
inplace=False,
kind='quicksort')
# The 10 countries with the highest percentage of high quality articles
sort_prop_type_df.head(10)
Out[12]:
In [13]:
# The 10 countries with the lowest percentage of high quality articles
sort_prop_type_df.tail(10)
Out[13]:
In [14]:
# This code produces a crosstab of countries with their quality articles
# This code is partially based on the code from StackOverflow
# https://stackoverflow.com/questions/46290726/how-to-make-dummy-variables-with-comma-separated-valued-columns
df_articles = combine_df.set_index("country")["article_quality"].apply(pd.Series).stack()
# margins=True adds the All column & All row
cross_tab = pd.crosstab(df_articles.index.get_level_values(0),
df_articles, margins=True).rename_axis(None).rename_axis(None, 1)
cross_tab.sort_values(["All"]).head(20)
Out[14]:
In [15]:
cross_tab.sort_values(["All"]).tail(20)
Out[15]:
In [16]:
# This code produces a list of countries with no quality articles
countries_noQA = cross_tab.sort_values(["FA", "GA"]).head(41)
countries_noQA.index
Out[16]:
Produce four visualizations that show:
In [41]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn
from textwrap import wrap
# based on example from http://matplotlib.org/examples/lines_bars_and_markers/barh_demo.html
fig, ((ax0, ax1), (ax2, ax3)) = plt.subplots(nrows=2, ncols=2, figsize=(20,15))
top_country = prop_df["country"].head(10)
y_pos0 = np.arange(len(top_country))
top_percent = prop_df["percent_articles_per_person"].head(10)
ax0.barh(y_pos0, top_percent, align='center',
color="cadetblue", alpha=0.8)
ax0.set_yticks(y_pos0)
ax0.set_yticklabels(top_country, fontsize=14)
ax0.invert_yaxis() # labels read top-to-bottom
ax0.set_xticklabels(labels=[0, 0.1, 0.2, 0.3, 0.4, 0.5], fontsize=12)
ax0.set_xlabel("Percent of Articles per Person", fontsize=14)
ax0.set_title("Countries with Highest Percent of Wikipedia Articles per Capita",
fontsize=16, fontvariant="small-caps", fontweight="semibold")
bottom_country = prop_df["country"].tail(10)[::-1]
y_pos1 = np.arange(len(bottom_country))
bottom_percent = round(prop_df["percent_articles_per_person"].tail(10)[::-1], 5)
ax1.barh(y_pos1, bottom_percent, align='center',
color="darkcyan", alpha=0.8)
ax1.set_yticks(y_pos1)
ax1.set_yticklabels(bottom_country, fontsize=14)
ax1.invert_yaxis()
ax1.set_xlabel("Percent of Articles per Person",
fontsize=14, fontstretch="semi-condensed")
ax1.set_title("Countries with Lowest Percent of Wikipedia Articles per Capita",
fontsize=16, fontweight="semibold")
top_FA_GA_country = sort_prop_type_df["country"].head(10)
y_pos2 = np.arange(len(top_FA_GA_country))
top_FA_GA_prop = sort_prop_type_df["percent_FA_GA"].head(10)
ax2.barh(y_pos2, top_FA_GA_prop, align='center',
color="firebrick", alpha=0.8)
ax2.set_yticks(y_pos2)
ax2.set_yticklabels(top_FA_GA_country, fontsize=14)
ax2.invert_yaxis() # labels read top-to-bottom
ax2.set_xticklabels(labels=[0, 5.0, 10.0, 15.0, 20.0, 25.0], fontsize=12)
ax2.set_xlabel("Percent of Wikipedia Articles Rated FA or GA",
fontsize=14, fontstretch="semi-condensed")
ax2.set_title("Countries with Highest Proportion of FA and GA Rated Wikipedia Articles",
fontsize=16, fontweight="semibold")
bottom_FA_GA_country = sort_prop_type_df["country"].tail(10)[::-1]
y_pos3 = np.arange(len(bottom_FA_GA_country))
bottom_FA_GA_prop = sort_prop_type_df["percent_FA_GA"].tail(10)[::-1]
ax3.barh(y_pos3, bottom_FA_GA_prop, align='center',
color="darkred", alpha=0.8)
ax3.set_yticks(y_pos3)
ax3.set_yticklabels(bottom_FA_GA_country, fontsize=14)
ax3.invert_yaxis() # labels read top-to-bottom
ax3.set_xticklabels(labels=[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6], fontsize=12)
ax3.set_xlabel("Percent of Wikipedia Articles Rated FA or GA",
fontsize=14, fontstretch="semi-condensed")
ax3.set_title("Countries with Lowest Proportion of FA and GA Rated Wikipedia Articles",
fontsize=16, fontweight="semibold",
fontstretch="semi-condensed")
plt.tight_layout()
plt.show()
# Save plot to file
fig.savefig("WikipediaBiasDataPlot.png")
You are also expected to write a short reflection on the project, that describes how this assignment helps you understand the causes and consequences of bias on Wikipedia.
Write a few paragraphs, either in the README or in the notebook, reflecting on what you have learned, what you found, what (if anything) surprised you about your findings, and/or what theories you have about why any biases might exist (if you find they exist).
You can also include any questions this assignment raised for you about bias, Wikipedia, or machine learning.
In [27]:
combine_df.loc[combine_df["article_name"].isin(["George Washington",
"John Adams",
"Thomas Jefferson"
"James Madison",
"James Monroe",
"John Quincy Adams",
"Andrew Jackson",
"Abraham Lincoln",
"Ulysses S. Grant",
"Rutherford B. Hayes",
"James A. Garfield",
"Theodore Roosevelt",
"William Howard Taft",
"Woodrow Wilson",
"Warren G. Harding",
"Benjamin Harrison",
"Grover Cleveland",
"Calvin Coolidge",
"Herbert Hoover",
"Franklin D. Roosevelt",
"Harry S. Truman",
"Dwight D. Eisenhower",
"John F. Kennedy",
"Lyndon B. Johnson",
"Richard Nixon",
"Richard M. Nixon"
"Gerald Ford",
"Jimmy Carter",
"Ronald Reagan",
"George H. W. Bush",
"Bill Clinton",
"Hillary Clinton",
"Hillary Rodham Clinton",
"William J. Clinton",
"George W. Bush",
"George Bush",
"Barack Obama",
"Donald Trump",
"Donald J. Trump",
"John McCain",
"Bernie Sanders",
"Sarah Palin",
"Mahatma Gandhi",
"Margaret Thatcher",
"Vladimir Putin",
"Vicente Fox",
"Enrique Peña Nieto",
"Enrique Nieto"])]
Out[27]:
First and foremost, this English-language Wikipedia article data was generated from the "Category: Politicians by nationality" and one subcategory. So, the editors who created these articles had to categorize the subjects of these articles as "politicians" and decide on the subjects' "nationality."
The Merriam-Webster Dictionary defines a politician as "a person experienced in the art or science of government," or "a person engaged in party politics as a profession," or "often disparaging: a person primarily interested in political office for selfish or other narrow usually short-sighted reasons."
Especially, since the term "politician" can be considered disparaging, it's possible that editors would be reluctant to label their articles as politicians. For example, out of the 33 most well-known U.S. presidents, only four have articles in our dataset, and one of those is John F. Kennedy, who's country is listed as "Ireland."
Politicians by nationality (243 C)
► Assassinated politicians by nationality (129 C)
► Political candidates by nationality (15 C)
► Politicians by ethnic or national descent (12 C)
► Politicians by former country (18 C)
► Politicians by nationality and city (15 C)
► Politicians by nationality and party (232 C)
► Politicians convicted of crimes by nationality (72 C)
► Politicians by nationality and century (45 C)
► Leaders of political parties by country (39 C)
► Politicians by century and nationality (4 C)
► Politicians from dependent territories (9 C)
► Women in politics by nationality (234 C)
► LGBT politicians by nationality (41 C)
► Sportsperson-politicians by nationality (62 C)
► Politicians of African nations (65 C)
► Politicians of Asian nations (49 C, 1 P)
► Politicians of Caribbean nations (28 C)
► Politicians of European nations (62 C)
► Politicians of North American nations (10 C)
► Politicians of Oceanian nations (31 C)
► Politicians of South American nations (16 C)
Country with the Highest Proportion of Per Capita Articles: Nauru (0.4880 %)
Country with the Lowest Proportion of Per Capita Articles ( > 0 ): Bangladesh (0.0002 %)
Country with the Highest Proportion of Quality Articles: North Korea (23.08 %)
Country with the Lowest Proportion of Quality Articles ( > 0 ): Finland (0.17 %)
Countries with No High Quality Articles: 'Andorra', 'Antigua and Barbuda', 'Bahamas', 'Bahrain', 'Barbados', 'Belgium', 'Belize', 'Burundi', 'Cape Verde', 'Comoros', 'Djibouti', 'Dominica', 'Eritrea', 'Federated States of Micronesia', 'French Guiana', 'Guadeloupe', 'Guyana', 'Honduras', 'Kazakhstan', 'Kiribati', 'Lesotho', 'Liechtenstein', 'Macedonia', 'Marshall Islands', 'Monaco', 'Mozambique', 'Nauru', 'Nepal', 'San Marino', 'Sao Tome and Principe', 'Seychelles', 'Solomon Islands', 'Suriname', 'Swaziland', 'Switzerland', 'Tajikistan', 'Timor-Leste', 'Tonga', 'Tunisia', 'Turkmenistan', 'Zambia'
5 Countries with Highest Number of Articles:
> France 1689 (population 64,346,720);
> Australia 1566;
> China 1138 (population 1,371,920,000);
> United States 1098;
> Mexico 1081
Spain population: 46,368,000 United Kingdom population: 65,092,000
Initially, one might think countries with higher populations, would have more politicians, and therefore more articles about those politicians on Wikipedia. That hypothesis is anecdotally supported by China being one of the most populated countries on Earth and having the third most articles in this dataset, and even India has the seventh most articles with 990 articles. However, France has the highest number of articles, yet has a population 21 times smaller than China.