Sean Miller (millsea0@u.washington.edu)
This notebook outlines an analysis of English Wikipedia articles on political figures from many countries. We seek to explore the ratio of articles compared to population of the country and the percent of those articles that are high quality to understand how the English Wikipedia might be biased.
All of the following code was written and tested against the default packages present in Anaconda3 v4.4.0. You can find a download for Anaconda and its latest versions at https://repo.continuum.io/archive/.
In [1]:
import json
import os
import pandas as pd
import requests
%matplotlib inline
First, we'll prepare our folder structure for our analysis. Any data sets we've downloaded or will scrape from the web will be stored in the raw_data folder, any data sets that have been processed by our code will be stored in clean_data and any visualizations or tables used for our final analysis will be stored in the outputs folder.
In [2]:
# If the folder raw_data doesn't already exist, create it
# raw_data is where any initial data sets are stored
if not os.path.exists("raw_data"):
os.makedirs("./raw_data")
# If the folder clean_data doesn't already exist, create it
# clean_data is where any processed data sets are stored
if not os.path.exists("clean_data"):
os.makedirs("./clean_data")
# If the folder outputs doesn't already exist, create it
# The outputs folder is where visualizations for our analysis will be stored
if not os.path.exists("outputs"):
os.makedirs("./outputs")
To perform this analysis, we'll be joining data from three different data sets. These data sets and relevant information are listed below.
Data Set | File Name | URL | Documentation | License |
---|---|---|---|---|
EN-Wikipedia Articles On Politicians | page_data.csv | Figshare | Same as URL | CC-BY-SA 4.0 |
Country Population Data (Mid-2015) | Population Mid-2015.csv | Population Research Bureau website | Same as URL | I have no idea |
Wikipedia ORES | N/A | ORES | ORES Swagger | CC-BY-SA 3.0 |
For the first two data sets, we'll be manually downloading the data from the provided links, copying the files to the raw_data folder and reading in the csv files with the pandas library.
In [9]:
# Paths to files
population_data_file = "./raw_data/Population Mid-2015.csv"
politician_file_path = "./raw_data/page_data.csv"
# Read in population data
# We skip the first line using header=1 as we're uninterested in information before the column headers
population_df = pd.read_csv(population_url, header=1)
# Remove "," characters and cast the population column Data to a numeric value
population_df["Data"] = population_df["Data"].str.replace(",", "")
population_df["Data"] = population_df["Data"].apply(pd.to_numeric)
# Write the data our to a csv
population_df.to_csv(population_file_path, index=False)
# Read in Wikipedia politician data
politician_df = pd.read_csv(politician_file_path)
# Print out sample of population DataFrame
population_df.head(4)
Out[9]:
In [10]:
# Print out sample of politician DataFrame
politician_df.head(4)
Out[10]:
After reading in our initial two data sets, we'll want to map the rev_id column of the politician DataFrame to a corresponding article quality using the ORES API. The predicted article quality can map to one of the six following values. Documentation for how to format the URLs for this API can be found at the ORES Swagger.
HCDS Fall 2017 - Assignment 2 - Article Ratings
To Note
You can submit up to 50 articles at a time to be evaluated by the ORES API.
If a page has been deleted, the ORES API will return "RevisionNotFound: Could not find revision". Within this function we handle that by outputting the JSON blob of the article that could not be found.
As part of the Terms and conditions from the Wikimedia REST API, we agree to send a unique User-Agent header in our requests so Wikimedia can contact us if any problem arises from our script.
In [29]:
# ORES API endpoint Example code
endpoint = "https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}"
# Create user-agent header
headers = {"User-Agent": "https://github.com/awfuldynne", "From": "millsea0@uw.edu"}
params = \
{
"project": "enwiki",
"model": "wp10",
"revids": "391862070"
}
api_call = requests.get(endpoint.format(**params), headers=headers)
response = api_call.json()
print(json.dumps(response, indent=4, sort_keys=True))
In [12]:
def get_ores_page_quality_prediction(rev_ids, batch_size=50):
"""Method to get the wp10 model"s prediction of page quality for a list of Wikipedia pages identified by revision ID
https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Grades
:param rev_ids: List of revision IDs for Wikipedia pages.
:type rev_ids: list of int.
:param batch_size: Number of pages to send to ORES per iteration.
:type batch_size: int.
:returns: Pandas Dataframe -- DataFrame with columns rev_id and article_quality
"""
# ORES API endpoint
endpoint = "https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}"
# Create user-agent header
headers = {"User-Agent": "https://github.com/awfuldynne", "From": "millsea0@uw.edu"}
# Create column list
columns = ["rev_id", "article_quality"]
# Create empty DataFrame for article quality result set
df = pd.DataFrame(columns=columns)
# Indexes to keep track of what subset in the rev_id list we should be processing
start_index = 0
end_index = start_index + batch_size
done_processing = False
# Iterate through our list of revision IDs appending to df as we process the results
while not done_processing:
params = \
{
"project": "enwiki",
"model": "wp10",
# Create a string of revision IDs like "123123|123124"
"revids": "|".join(str(rev) for rev in rev_ids[start_index:end_index])
}
api_call = requests.get(endpoint.format(**params), headers=headers)
response = api_call.json()
for quality_score in response["enwiki"]["scores"]:
# Create a new Series to append to the DataFrame
new_row = pd.Series(index=columns)
new_row.rev_id = quality_score
try:
new_row.article_quality = response["enwiki"]["scores"][quality_score]["wp10"]["score"]["prediction"]
df = df.append(new_row, ignore_index=True)
except:
# The target article no longer exists in wikipedia. Print each data point that
# couldn't be retrieved
print(response["enwiki"]["scores"][quality_score])
# Update indexes
start_index += batch_size
end_index += batch_size
# If start_indexd is greater then the length of rev_ids we are finished processing our list
done_processing = start_index >= len(rev_ids)
return df
article_quality_df = get_ores_page_quality_prediction(politician_df.rev_id.tolist())
article_quality_df.to_csv("./raw_data/article_quality_data.csv", index=False)
After creating the mapping of revision ID to article quality, we then want to join this to the politician DataFrame.
In [15]:
def get_article_quality(rev_id):
"""Method used to map a Wikipedia revision ID to an article quality within article_quality_df
:param rev_id: Wikipedia Revision ID
:type rev_id: int.
:return: str -- Article quality from article_quality_df if exists, None if not
"""
article_quality = None
# If the revision ID exists in the article quality DataFrame, set article quality to the mapped value
if (article_quality_df.rev_id == rev_id).any():
article_quality = article_quality_df.loc[article_quality_df.rev_id == rev_id].article_quality.iloc[0]
return article_quality
# Join the politician DataFrame to the article quality DataFrame
politician_df["article_quality"] = politician_df.apply(lambda row: get_article_quality(row.rev_id), axis=1)
In a similar fashion, we also want to join the population data to the politician DataFrame.
In [16]:
def get_country_population(country_name):
"""Method used to map country name to a population within population_df
:param country_name: Country name
:type country_name: str.
:return: int -- Population value from population_df if exists, None if not
"""
population = None
# If the country exists in the population DataFrame, set population to the mapped value
if (population_df.Location == country_name).any():
population = population_df.loc[population_df.Location == country_name].Data.iloc[0]
return population
# Join the politician DataFrame to the country population DataFrame
politician_df["population"] = \
politician_df.apply(lambda row: get_country_population(row.country), axis=1)
To simplify our analysis, any row without a corresponding country population or a corresponding article quality will be removed from the data set. We perform some additional cleaning by ordering our rows, renaming our columns and representing population as an integer before writing it out to the clean_data directory.
Our DataFrame will look like the following:
Column | Value |
---|---|
country | Name of the Country the article belongs to |
article_name | Name of the Wikipedia article |
revision_id | Integer ID that maps to the given Wikipedia page's last edit |
article_quality | Quality of the Article as determined by ORES |
population | Number of people living in the country in mid-2015 |
In [23]:
# Filter out any countries without a population or without an article quality
df = politician_df[(pd.notnull(politician_df.population)) & (pd.notnull(politician_df.article_quality))]
print("{} rows were removed".format(politician_df.shape[0] - df.shape[0]))
# Reorder columns
df = df[["country", "page", "rev_id", "article_quality", "population"]]
# Rename columns to match assignment definition
df.columns = ["country", "article_name", "revision_id", "article_quality", "population"]
# Change population column to integer
df.loc[:, "population"] = df["population"].astype(int)
# Write analysis data set out to file
cleaned_data_file_path = "./clean_data/en-wikipedia_politician_article_quality.csv"
df.to_csv(cleaned_data_file_path, index=False)
# Print example of analysis DataFrame
df.head(4)
Out[23]:
As mentioned at the start of this notebook, our analysis seeks to understand bias on Wikipedia through two metrics:
We also output population and the number of articles within the aggregate DataFrame for readability.
In [21]:
# Group our DataFrame by country
country_group = df.groupby("country")
# Returns the number of articles as a percent of the population
def articles_per_population(group):
articles = group.article_name.nunique()
population = group.population.max()
return articles * 100 / float(population)
# Returns the proportion of articles which are ranked FA or GA in quality
def high_quality_articles(group):
high_quality_rating_list = ["FA", "GA"]
article_count = group.shape[0]
high_quality_article_count = group[group.article_quality.isin(high_quality_rating_list)].shape[0]
return high_quality_article_count * 100 / article_count
# Returns the population for a given country.
def population(group):
return group.population.max()
# Returns the number of articles a country has
def number_of_articles(group):
return group.shape[0]
# https://stackoverflow.com/questions/40532024/pandas-apply-multiple-functions-of-multiple-columns-to-groupby-object
# Aggregate method which generates our four aggregate metrics
def get_aggregate_stats(group):
return pd.Series({"articles_per_population_percent": articles_per_population(group),
"population": population(group),
"percent_high_quality_article": high_quality_articles(group),
"number_of_articles": number_of_articles(group)})
agg_df = country_group.apply(get_aggregate_stats)
agg_df.index.name = "Country"
# Print example of aggregate DataFrame
agg_df.head(4)
Out[21]:
Next we create our four DataFrames to look at the top and bottom 10 countries for both of these metrics.
In [22]:
# Suppress scientific notation
# SO Post: https://stackoverflow.com/questions/21137150/format-suppress-scientific-notation-from-python-pandas-aggregation-results
pd.set_option('display.float_format', lambda x: '%.6f' % x)
# Top 10 of Articles per Population
print("Top 10 Countries - Percent of Articles-Per-Population")
top_10_article_per_pop = \
agg_df.sort_values(by=["articles_per_population_percent"], ascending=False).head(10)[["articles_per_population_percent"]]
top_10_article_per_pop.columns = ["Percent of Articles-Per-Population"]
print(top_10_article_per_pop)
print("\n")
# Bottom 10 of Articles per Population
print("Bottom 10 Countries - Percent of Articles-Per-Population")
bottom_10_article_per_pop = \
agg_df.sort_values(by=["articles_per_population_percent"], ascending=True).head(10)[["articles_per_population_percent"]]
bottom_10_article_per_pop.columns = ["Percent of Articles-Per-Population"]
print(bottom_10_article_per_pop)
print("\n")
# Top 10 of High Quality Articles
print("Top 10 Countries - Percent of Articles that are High Quality")
top_10_high_quality_articles = \
agg_df.sort_values(by=["percent_high_quality_article"], ascending=False).head(10)[["percent_high_quality_article"]]
top_10_high_quality_articles.columns = ["Percent of High Quality Articles"]
print(top_10_high_quality_articles)
print("\n")
# Bottom 10 of High Quality Articles
print("Bottom 10 Countries - Percent of Articles that are High Quality")
bottom_10_high_quality_articles = \
agg_df.sort_values(by=["percent_high_quality_article"], ascending=True).head(10)[["percent_high_quality_article"]]
bottom_10_high_quality_articles.columns = ["Percent of High Quality Articles"]
print(bottom_10_high_quality_articles)
print("\n")
For those that are interested in how to download the population data directly from the Population Research Bureau website the following code downloads the file and writes it out to the raw_data directory.
In [4]:
population_file_path = "./raw_data/Population Mid-2015.csv"
population_url = "http://www.prb.org/RawData.axd?ind=14&fmt=14&tf=76&loc=34235%2c249%2c250%2c251%2c252%2c253%2c254%2" \
"c34227%2c255%2c257%2c258%2c259%2c260%2c261%2c262%2c263%2c264%2c265%2c266%2c267%2c268%2c269%2c270%2" \
"c271%2c272%2c274%2c275%2c276%2c277%2c278%2c279%2c280%2c281%2c282%2c283%2c284%2c285%2c286%2c287%2c2" \
"88%2c289%2c290%2c291%2c292%2c294%2c295%2c296%2c297%2c298%2c299%2c300%2c301%2c302%2c304%2c305%2c306" \
"%2c307%2c308%2c311%2c312%2c315%2c316%2c317%2c318%2c319%2c320%2c321%2c322%2c324%2c325%2c326%2c327%2" \
"c328%2c34234%2c329%2c330%2c331%2c332%2c333%2c334%2c336%2c337%2c338%2c339%2c340%2c342%2c343%2c344%2" \
"c345%2c346%2c347%2c348%2c349%2c350%2c351%2c352%2c353%2c354%2c358%2c359%2c360%2c361%2c362%2c363%2c3" \
"64%2c365%2c366%2c367%2c368%2c369%2c370%2c371%2c372%2c373%2c374%2c375%2c377%2c378%2c379%2c380%2c381" \
"%2c382%2c383%2c384%2c385%2c386%2c387%2c388%2c389%2c390%2c392%2c393%2c394%2c395%2c396%2c397%2c398%2" \
"c399%2c400%2c401%2c402%2c404%2c405%2c406%2c407%2c408%2c409%2c410%2c411%2c415%2c416%2c417%2c418%2c4" \
"19%2c420%2c421%2c422%2c423%2c424%2c425%2c427%2c428%2c429%2c430%2c431%2c432%2c433%2c434%2c435%2c437" \
"%2c438%2c439%2c440%2c441%2c442%2c443%2c444%2c445%2c446%2c448%2c449%2c450%2c451%2c452%2c453%2c454%2" \
"c455%2c456%2c457%2c458%2c459%2c460%2c461%2c462%2c464%2c465%2c466%2c467%2c468%2c469%2c470%2c471%2c4" \
"72%2c473%2c474%2c475%2c476%2c477%2c478%2c479%2c480"
# Use pandas read_csv function to read the file directly from the website
# We skip the first line using header=1 as we're uninterested in information before the column headers
population_df = pd.read_csv(population_url, header=1)
# Remove "," characters and cast the population column Data to a numeric value
population_df["Data"] = population_df["Data"].str.replace(",", "")
population_df["Data"] = population_df["Data"].apply(pd.to_numeric)
# Write the data out to a csv
population_df.to_csv(population_file_path, index=False)
# Print a few lines of the data set
population_df.head(4)
Out[4]: