Assignment 2 - Bias in Data

Sean Miller (millsea0@u.washington.edu)

Overview

This notebook outlines an analysis of English Wikipedia articles on political figures from many countries. We seek to explore the ratio of articles compared to population of the country and the percent of those articles that are high quality to understand how the English Wikipedia might be biased.

Libraries

All of the following code was written and tested against the default packages present in Anaconda3 v4.4.0. You can find a download for Anaconda and its latest versions at https://repo.continuum.io/archive/.

Preparation


In [1]:
import json
import os
import pandas as pd
import requests
%matplotlib inline

First, we'll prepare our folder structure for our analysis. Any data sets we've downloaded or will scrape from the web will be stored in the raw_data folder, any data sets that have been processed by our code will be stored in clean_data and any visualizations or tables used for our final analysis will be stored in the outputs folder.


In [2]:
# If the folder raw_data doesn't already exist, create it
# raw_data is where any initial data sets are stored
if not os.path.exists("raw_data"):
    os.makedirs("./raw_data")
    
# If the folder clean_data doesn't already exist, create it
# clean_data is where any processed data sets are stored
if not os.path.exists("clean_data"):
    os.makedirs("./clean_data")

# If the folder outputs doesn't already exist, create it
# The outputs folder is where visualizations for our analysis will be stored
if not os.path.exists("outputs"):
    os.makedirs("./outputs")

Reading in the Data

To perform this analysis, we'll be joining data from three different data sets. These data sets and relevant information are listed below.

Data Set File Name URL Documentation License
EN-Wikipedia Articles On Politicians page_data.csv Figshare Same as URL CC-BY-SA 4.0
Country Population Data (Mid-2015) Population Mid-2015.csv Population Research Bureau website Same as URL I have no idea
Wikipedia ORES N/A ORES ORES Swagger CC-BY-SA 3.0

For the first two data sets, we'll be manually downloading the data from the provided links, copying the files to the raw_data folder and reading in the csv files with the pandas library.


In [9]:
# Paths to files
population_data_file = "./raw_data/Population Mid-2015.csv"
politician_file_path = "./raw_data/page_data.csv"

# Read in population data
# We skip the first line using header=1 as we're uninterested in information before the column headers
population_df = pd.read_csv(population_url, header=1)

# Remove "," characters and cast the population column Data to a numeric value
population_df["Data"] = population_df["Data"].str.replace(",", "")
population_df["Data"] = population_df["Data"].apply(pd.to_numeric)

# Write the data our to a csv
population_df.to_csv(population_file_path, index=False)

# Read in Wikipedia politician data
politician_df = pd.read_csv(politician_file_path)

# Print out sample of population DataFrame
population_df.head(4)


Out[9]:
Location Location Type TimeFrame Data Type Data Footnotes
0 Afghanistan Country Mid-2015 Number 32247000 NaN
1 Albania Country Mid-2015 Number 2892000 NaN
2 Algeria Country Mid-2015 Number 39948000 NaN
3 Andorra Country Mid-2015 Number 78000 NaN

In [10]:
# Print out sample of politician DataFrame
politician_df.head(4)


Out[10]:
page country rev_id
0 Template:ZambiaProvincialMinisters Zambia 235107991
1 Bir I of Kanem Chad 355319463
2 Template:Zimbabwe-politician-stub Zimbabwe 391862046
3 Template:Uganda-politician-stub Uganda 391862070

ORES

After reading in our initial two data sets, we'll want to map the rev_id column of the politician DataFrame to a corresponding article quality using the ORES API. The predicted article quality can map to one of the six following values. Documentation for how to format the URLs for this API can be found at the ORES Swagger.

HCDS Fall 2017 - Assignment 2 - Article Ratings

  1. FA - Featured article
  2. GA - Good article
  3. B - B-class article
  4. C - C-class article
  5. Start - Start-class article
  6. Stub - Stub-class article

To Note

You can submit up to 50 articles at a time to be evaluated by the ORES API.

If a page has been deleted, the ORES API will return "RevisionNotFound: Could not find revision". Within this function we handle that by outputting the JSON blob of the article that could not be found.

As part of the Terms and conditions from the Wikimedia REST API, we agree to send a unique User-Agent header in our requests so Wikimedia can contact us if any problem arises from our script.


In [29]:
# ORES API endpoint Example code
endpoint = "https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}"
# Create user-agent header
headers = {"User-Agent": "https://github.com/awfuldynne", "From": "millsea0@uw.edu"}

params = \
    {
        "project": "enwiki",
        "model": "wp10",
        "revids": "391862070"
    }

api_call = requests.get(endpoint.format(**params), headers=headers)
response = api_call.json()
print(json.dumps(response, indent=4, sort_keys=True))


{
    "enwiki": {
        "models": {
            "wp10": {
                "version": "0.5.0"
            }
        },
        "scores": {
            "391862070": {
                "wp10": {
                    "score": {
                        "prediction": "Stub",
                        "probability": {
                            "B": 0.03460022211051763,
                            "C": 0.10152025001080041,
                            "FA": 0.022405202755090857,
                            "GA": 0.004661806667863751,
                            "Start": 0.12578014679847194,
                            "Stub": 0.7110323716572554
                        }
                    }
                }
            }
        }
    }
}

In [12]:
def get_ores_page_quality_prediction(rev_ids, batch_size=50):
    """Method to get the wp10 model"s prediction of page quality for a list of Wikipedia pages identified by revision ID
    https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Grades

    :param rev_ids: List of revision IDs for Wikipedia pages.
    :type rev_ids: list of int.
    :param batch_size: Number of pages to send to ORES per iteration.
    :type batch_size: int.
    :returns:   Pandas Dataframe -- DataFrame with columns rev_id and article_quality
    """
    # ORES API endpoint
    endpoint = "https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}"
    
    # Create user-agent header
    headers = {"User-Agent": "https://github.com/awfuldynne", "From": "millsea0@uw.edu"}

    # Create column list
    columns = ["rev_id", "article_quality"]
    
    # Create empty DataFrame for article quality result set
    df = pd.DataFrame(columns=columns)

    # Indexes to keep track of what subset in the rev_id list we should be processing
    start_index = 0
    end_index = start_index + batch_size
    done_processing = False

    # Iterate through our list of revision IDs appending to df as we process the results
    while not done_processing:
        params = \
            {
                "project": "enwiki",
                "model": "wp10",
                # Create a string of revision IDs like "123123|123124"
                "revids": "|".join(str(rev) for rev in rev_ids[start_index:end_index])
            }

        api_call = requests.get(endpoint.format(**params), headers=headers)
        response = api_call.json()
        for quality_score in response["enwiki"]["scores"]:
            # Create a new Series to append to the DataFrame
            new_row = pd.Series(index=columns)
            new_row.rev_id = quality_score
            try:
                new_row.article_quality = response["enwiki"]["scores"][quality_score]["wp10"]["score"]["prediction"]
                df = df.append(new_row, ignore_index=True)
            except:
                # The target article no longer exists in wikipedia. Print each data point that 
                # couldn't be retrieved
                print(response["enwiki"]["scores"][quality_score])

        # Update indexes
        start_index += batch_size
        end_index += batch_size
        # If start_indexd is greater then the length of rev_ids we are finished processing our list
        done_processing = start_index >= len(rev_ids)

    return df

article_quality_df = get_ores_page_quality_prediction(politician_df.rev_id.tolist())
article_quality_df.to_csv("./raw_data/article_quality_data.csv", index=False)


{'wp10': {'error': {'message': 'RevisionNotFound: Could not find revision ({revision}:807367030)', 'type': 'RevisionNotFound'}}}
{'wp10': {'error': {'message': 'RevisionNotFound: Could not find revision ({revision}:807367166)', 'type': 'RevisionNotFound'}}}

After creating the mapping of revision ID to article quality, we then want to join this to the politician DataFrame.


In [15]:
def get_article_quality(rev_id):
    """Method used to map a Wikipedia revision ID to an article quality within article_quality_df

    :param rev_id: Wikipedia Revision ID
    :type rev_id: int.
    :return:    str -- Article quality from article_quality_df if exists, None if not
    """
    article_quality = None
    # If the revision ID exists in the article quality DataFrame, set article quality to the mapped value
    if (article_quality_df.rev_id == rev_id).any():
        article_quality = article_quality_df.loc[article_quality_df.rev_id == rev_id].article_quality.iloc[0]
    return article_quality

# Join the politician DataFrame to the article quality DataFrame
politician_df["article_quality"] = politician_df.apply(lambda row: get_article_quality(row.rev_id), axis=1)

In a similar fashion, we also want to join the population data to the politician DataFrame.


In [16]:
def get_country_population(country_name):
    """Method used to map country name to a population within population_df

    :param country_name: Country name
    :type country_name: str.
    :return:    int -- Population value from population_df if exists, None if not
    """
    population = None
    # If the country exists in the population DataFrame, set population to the mapped value
    if (population_df.Location == country_name).any():
        population = population_df.loc[population_df.Location == country_name].Data.iloc[0]
    return population

# Join the politician DataFrame to the country population DataFrame
politician_df["population"] = \
    politician_df.apply(lambda row: get_country_population(row.country), axis=1)

Cleaning our Analysis DataFrame

To simplify our analysis, any row without a corresponding country population or a corresponding article quality will be removed from the data set. We perform some additional cleaning by ordering our rows, renaming our columns and representing population as an integer before writing it out to the clean_data directory.

Our DataFrame will look like the following:

Column Value
country Name of the Country the article belongs to
article_name Name of the Wikipedia article
revision_id Integer ID that maps to the given Wikipedia page's last edit
article_quality Quality of the Article as determined by ORES
population Number of people living in the country in mid-2015

In [23]:
# Filter out any countries without a population or without an article quality
df = politician_df[(pd.notnull(politician_df.population)) & (pd.notnull(politician_df.article_quality))]

print("{} rows were removed".format(politician_df.shape[0] - df.shape[0]))

# Reorder columns
df = df[["country", "page", "rev_id", "article_quality", "population"]]

# Rename columns to match assignment definition
df.columns = ["country", "article_name", "revision_id", "article_quality", "population"]

# Change population column to integer
df.loc[:, "population"] = df["population"].astype(int)

# Write analysis data set out to file
cleaned_data_file_path = "./clean_data/en-wikipedia_politician_article_quality.csv"
df.to_csv(cleaned_data_file_path, index=False)

# Print example of analysis DataFrame
df.head(4)


1400 rows were removed
Out[23]:
country article_name revision_id article_quality population
0 Zambia Template:ZambiaProvincialMinisters 235107991 Stub 15473900
1 Chad Bir I of Kanem 355319463 Stub 13707000
2 Zimbabwe Template:Zimbabwe-politician-stub 391862046 Stub 17354000
3 Uganda Template:Uganda-politician-stub 391862070 Stub 40141000

Analysis

As mentioned at the start of this notebook, our analysis seeks to understand bias on Wikipedia through two metrics:

  1. The percent of articles-per-poulation for each country
  2. The percent of high quality articles for each country

We also output population and the number of articles within the aggregate DataFrame for readability.


In [21]:
# Group our DataFrame by country
country_group = df.groupby("country")

# Returns the number of articles as a percent of the population
def articles_per_population(group):
    articles = group.article_name.nunique()
    population = group.population.max()
    return articles * 100 / float(population)

# Returns the proportion of articles which are ranked FA or GA in quality
def high_quality_articles(group):
    high_quality_rating_list = ["FA", "GA"]
    article_count = group.shape[0]
    high_quality_article_count = group[group.article_quality.isin(high_quality_rating_list)].shape[0]
    return high_quality_article_count * 100 / article_count

# Returns the population for a given country.
def population(group):
    return group.population.max()

# Returns the number of articles a country has
def number_of_articles(group):
    return group.shape[0]

# https://stackoverflow.com/questions/40532024/pandas-apply-multiple-functions-of-multiple-columns-to-groupby-object
# Aggregate method which generates our four aggregate metrics
def get_aggregate_stats(group):
    return pd.Series({"articles_per_population_percent": articles_per_population(group),
                      "population": population(group),
                      "percent_high_quality_article": high_quality_articles(group),
                      "number_of_articles": number_of_articles(group)})

agg_df = country_group.apply(get_aggregate_stats)
agg_df.index.name = "Country"

# Print example of aggregate DataFrame
agg_df.head(4)


Out[21]:
articles_per_population_percent number_of_articles percent_high_quality_article population
Country
Afghanistan 0.001014 327.000000 5.810398 32247000.000000
Albania 0.015906 460.000000 1.086957 2892000.000000
Algeria 0.000298 119.000000 2.521008 39948000.000000
Andorra 0.043590 34.000000 0.000000 78000.000000

Next we create our four DataFrames to look at the top and bottom 10 countries for both of these metrics.


In [22]:
# Suppress scientific notation
# SO Post: https://stackoverflow.com/questions/21137150/format-suppress-scientific-notation-from-python-pandas-aggregation-results
pd.set_option('display.float_format', lambda x: '%.6f' % x)

# Top 10 of Articles per Population
print("Top 10 Countries - Percent of Articles-Per-Population")
top_10_article_per_pop = \
    agg_df.sort_values(by=["articles_per_population_percent"], ascending=False).head(10)[["articles_per_population_percent"]]
top_10_article_per_pop.columns = ["Percent of Articles-Per-Population"]
print(top_10_article_per_pop)
print("\n")

# Bottom 10 of Articles per Population
print("Bottom 10 Countries - Percent of Articles-Per-Population")
bottom_10_article_per_pop = \
    agg_df.sort_values(by=["articles_per_population_percent"], ascending=True).head(10)[["articles_per_population_percent"]]
bottom_10_article_per_pop.columns = ["Percent of Articles-Per-Population"]
print(bottom_10_article_per_pop)
print("\n")

# Top 10 of High Quality Articles
print("Top 10 Countries - Percent of Articles that are High Quality")
top_10_high_quality_articles = \
    agg_df.sort_values(by=["percent_high_quality_article"], ascending=False).head(10)[["percent_high_quality_article"]]
top_10_high_quality_articles.columns = ["Percent of High Quality Articles"]
print(top_10_high_quality_articles)
print("\n")

# Bottom 10 of High Quality Articles
print("Bottom 10 Countries - Percent of Articles that are High Quality")
bottom_10_high_quality_articles = \
    agg_df.sort_values(by=["percent_high_quality_article"], ascending=True).head(10)[["percent_high_quality_article"]]
bottom_10_high_quality_articles.columns = ["Percent of High Quality Articles"]
print(bottom_10_high_quality_articles)
print("\n")


Top 10 Countries - Percent of Articles-Per-Population
                                Percent of Articles-Per-Population
Country                                                           
Nauru                                                     0.488029
Tuvalu                                                    0.466102
San Marino                                                0.248485
Monaco                                                    0.105020
Liechtenstein                                             0.077189
Marshall Islands                                          0.067273
Iceland                                                   0.062268
Tonga                                                     0.060987
Andorra                                                   0.043590
Federated States of Micronesia                            0.036893


Bottom 10 Countries - Percent of Articles-Per-Population
                     Percent of Articles-Per-Population
Country                                                
India                                          0.000075
China                                          0.000083
Indonesia                                      0.000084
Uzbekistan                                     0.000093
Ethiopia                                       0.000107
Korea, North                                   0.000156
Zambia                                         0.000168
Thailand                                       0.000172
Congo, Dem. Rep. of                            0.000194
Bangladesh                                     0.000202


Top 10 Countries - Percent of Articles that are High Quality
                          Percent of High Quality Articles
Country                                                   
Korea, North                                     23.076923
Romania                                          12.931034
Saudi Arabia                                     12.605042
Central African Republic                         11.764706
Qatar                                             9.803922
Guinea-Bissau                                     9.523810
Vietnam                                           9.424084
Bhutan                                            9.090909
Ireland                                           8.136483
United States                                     7.832423


Bottom 10 Countries - Percent of Articles that are High Quality
                       Percent of High Quality Articles
Country                                                
Sao Tome and Principe                          0.000000
Turkmenistan                                   0.000000
Marshall Islands                               0.000000
Guyana                                         0.000000
Comoros                                        0.000000
Tunisia                                        0.000000
Djibouti                                       0.000000
Dominica                                       0.000000
Macedonia                                      0.000000
Tonga                                          0.000000


Appendix

Population Data

For those that are interested in how to download the population data directly from the Population Research Bureau website the following code downloads the file and writes it out to the raw_data directory.


In [4]:
population_file_path = "./raw_data/Population Mid-2015.csv"
population_url = "http://www.prb.org/RawData.axd?ind=14&fmt=14&tf=76&loc=34235%2c249%2c250%2c251%2c252%2c253%2c254%2" \
                 "c34227%2c255%2c257%2c258%2c259%2c260%2c261%2c262%2c263%2c264%2c265%2c266%2c267%2c268%2c269%2c270%2" \
                 "c271%2c272%2c274%2c275%2c276%2c277%2c278%2c279%2c280%2c281%2c282%2c283%2c284%2c285%2c286%2c287%2c2" \
                 "88%2c289%2c290%2c291%2c292%2c294%2c295%2c296%2c297%2c298%2c299%2c300%2c301%2c302%2c304%2c305%2c306" \
                 "%2c307%2c308%2c311%2c312%2c315%2c316%2c317%2c318%2c319%2c320%2c321%2c322%2c324%2c325%2c326%2c327%2" \
                 "c328%2c34234%2c329%2c330%2c331%2c332%2c333%2c334%2c336%2c337%2c338%2c339%2c340%2c342%2c343%2c344%2" \
                 "c345%2c346%2c347%2c348%2c349%2c350%2c351%2c352%2c353%2c354%2c358%2c359%2c360%2c361%2c362%2c363%2c3" \
                 "64%2c365%2c366%2c367%2c368%2c369%2c370%2c371%2c372%2c373%2c374%2c375%2c377%2c378%2c379%2c380%2c381" \
                 "%2c382%2c383%2c384%2c385%2c386%2c387%2c388%2c389%2c390%2c392%2c393%2c394%2c395%2c396%2c397%2c398%2" \
                 "c399%2c400%2c401%2c402%2c404%2c405%2c406%2c407%2c408%2c409%2c410%2c411%2c415%2c416%2c417%2c418%2c4" \
                 "19%2c420%2c421%2c422%2c423%2c424%2c425%2c427%2c428%2c429%2c430%2c431%2c432%2c433%2c434%2c435%2c437" \
                 "%2c438%2c439%2c440%2c441%2c442%2c443%2c444%2c445%2c446%2c448%2c449%2c450%2c451%2c452%2c453%2c454%2" \
                 "c455%2c456%2c457%2c458%2c459%2c460%2c461%2c462%2c464%2c465%2c466%2c467%2c468%2c469%2c470%2c471%2c4" \
                 "72%2c473%2c474%2c475%2c476%2c477%2c478%2c479%2c480"

# Use pandas read_csv function to read the file directly from the website
# We skip the first line using header=1 as we're uninterested in information before the column headers
population_df = pd.read_csv(population_url, header=1)

# Remove "," characters and cast the population column Data to a numeric value
population_df["Data"] = population_df["Data"].str.replace(",", "")
population_df["Data"] = population_df["Data"].apply(pd.to_numeric)

# Write the data out to a csv
population_df.to_csv(population_file_path, index=False)

# Print a few lines of the data set
population_df.head(4)


Out[4]:
Location Location Type TimeFrame Data Type Data Footnotes
0 Afghanistan Country Mid-2015 Number 32247000 NaN
1 Albania Country Mid-2015 Number 2892000 NaN
2 Algeria Country Mid-2015 Number 39948000 NaN
3 Andorra Country Mid-2015 Number 78000 NaN
4 Angola Country Mid-2015 Number 25000000 NaN