Assignment 2 - Bias in Data

Overview

This notebook outlines an analysis of English Wikipedia articles on political figures from many countries. We seek to explore the ratio of articles compared to population of the country and the percent of those articles that are high quality to understand how the English Wikipedia might be biased.

Libraries

All of the following code was written and tested against the default packages present in Anaconda3 v4.4.0. You can find a download for Anaconda and its latest versions at https://repo.continuum.io/archive/.

Preparation



In [1]:

    
import json
import os
import pandas as pd
import requests
%matplotlib inline

First, we'll prepare our folder structure for our analysis. Any data sets we've downloaded or will scrape from the web will be stored in the raw_data folder, any data sets that have been processed by our code will be stored in clean_data and any visualizations or tables used for our final analysis will be stored in the outputs folder.



In [2]:

    
# If the folder raw_data doesn't already exist, create it
# raw_data is where any initial data sets are stored
if not os.path.exists("raw_data"):
    os.makedirs("./raw_data")
    
# If the folder clean_data doesn't already exist, create it
# clean_data is where any processed data sets are stored
if not os.path.exists("clean_data"):
    os.makedirs("./clean_data")

# If the folder outputs doesn't already exist, create it
# The outputs folder is where visualizations for our analysis will be stored
if not os.path.exists("outputs"):
    os.makedirs("./outputs")

Reading in the Data

To perform this analysis, we'll be joining data from three different data sets. These data sets and relevant information are listed below.

Data Set	File Name	URL	Documentation	License
EN-Wikipedia Articles On Politicians	page_data.csv	Figshare	Same as URL	CC-BY-SA 4.0
Country Population Data (Mid-2015)	Population Mid-2015.csv	Population Research Bureau website	Same as URL	I have no idea
Wikipedia ORES	N/A	ORES	ORES Swagger	CC-BY-SA 3.0

For the first two data sets, we'll be manually downloading the data from the provided links, copying the files to the raw_data folder and reading in the csv files with the pandas library.



In [9]:

    
# Paths to files
population_data_file = "./raw_data/Population Mid-2015.csv"
politician_file_path = "./raw_data/page_data.csv"

# Read in population data
# We skip the first line using header=1 as we're uninterested in information before the column headers
population_df = pd.read_csv(population_url, header=1)

# Remove "," characters and cast the population column Data to a numeric value
population_df["Data"] = population_df["Data"].str.replace(",", "")
population_df["Data"] = population_df["Data"].apply(pd.to_numeric)

# Write the data our to a csv
population_df.to_csv(population_file_path, index=False)

# Read in Wikipedia politician data
politician_df = pd.read_csv(politician_file_path)

# Print out sample of population DataFrame
population_df.head(4)









    Out[9]:







  
    
      
      Location
      Location Type
      TimeFrame
      Data Type
      Data
      Footnotes
    
  
  
    
      0
      Afghanistan
      Country
      Mid-2015
      Number
      32247000
      NaN
    
    
      1
      Albania
      Country
      Mid-2015
      Number
      2892000
      NaN
    
    
      2
      Algeria
      Country
      Mid-2015
      Number
      39948000
      NaN
    
    
      3
      Andorra
      Country
      Mid-2015
      Number
      78000
      NaN



In [10]:

    
# Print out sample of politician DataFrame
politician_df.head(4)









    Out[10]:







  
    
      
      page
      country
      rev_id
    
  
  
    
      0
      Template:ZambiaProvincialMinisters
      Zambia
      235107991
    
    
      1
      Bir I of Kanem
      Chad
      355319463
    
    
      2
      Template:Zimbabwe-politician-stub
      Zimbabwe
      391862046
    
    
      3
      Template:Uganda-politician-stub
      Uganda
      391862070

ORES

After reading in our initial two data sets, we'll want to map the rev_id column of the politician DataFrame to a corresponding article quality using the ORES API. The predicted article quality can map to one of the six following values. Documentation for how to format the URLs for this API can be found at the ORES Swagger.

HCDS Fall 2017 - Assignment 2 - Article Ratings

FA - Featured article
GA - Good article
B - B-class article
C - C-class article
Start - Start-class article
Stub - Stub-class article

To Note

You can submit up to 50 articles at a time to be evaluated by the ORES API.

If a page has been deleted, the ORES API will return "RevisionNotFound: Could not find revision". Within this function we handle that by outputting the JSON blob of the article that could not be found.

As part of the Terms and conditions from the Wikimedia REST API, we agree to send a unique User-Agent header in our requests so Wikimedia can contact us if any problem arises from our script.



In [29]:

    
# ORES API endpoint Example code
endpoint = "https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}"
# Create user-agent header
headers = {"User-Agent": "https://github.com/awfuldynne", "From": "millsea0@uw.edu"}

params = \
    {
        "project": "enwiki",
        "model": "wp10",
        "revids": "391862070"
    }

api_call = requests.get(endpoint.format(**params), headers=headers)
response = api_call.json()
print(json.dumps(response, indent=4, sort_keys=True))









    



{
    "enwiki": {
        "models": {
            "wp10": {
                "version": "0.5.0"
            }
        },
        "scores": {
            "391862070": {
                "wp10": {
                    "score": {
                        "prediction": "Stub",
                        "probability": {
                            "B": 0.03460022211051763,
                            "C": 0.10152025001080041,
                            "FA": 0.022405202755090857,
                            "GA": 0.004661806667863751,
                            "Start": 0.12578014679847194,
                            "Stub": 0.7110323716572554
                        }
                    }
                }
            }
        }
    }
}



In [12]:

    
def get_ores_page_quality_prediction(rev_ids, batch_size=50):
    """Method to get the wp10 model"s prediction of page quality for a list of Wikipedia pages identified by revision ID
    https://en.wikipedia.org/wiki/Wikipedia:WikiProject_assessment#Grades

    :param rev_ids: List of revision IDs for Wikipedia pages.
    :type rev_ids: list of int.
    :param batch_size: Number of pages to send to ORES per iteration.
    :type batch_size: int.
    :returns:   Pandas Dataframe -- DataFrame with columns rev_id and article_quality
    """
    # ORES API endpoint
    endpoint = "https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}"
    
    # Create user-agent header
    headers = {"User-Agent": "https://github.com/awfuldynne", "From": "millsea0@uw.edu"}

    # Create column list
    columns = ["rev_id", "article_quality"]
    
    # Create empty DataFrame for article quality result set
    df = pd.DataFrame(columns=columns)

    # Indexes to keep track of what subset in the rev_id list we should be processing
    start_index = 0
    end_index = start_index + batch_size
    done_processing = False

    # Iterate through our list of revision IDs appending to df as we process the results
    while not done_processing:
        params = \
            {
                "project": "enwiki",
                "model": "wp10",
                # Create a string of revision IDs like "123123|123124"
                "revids": "|".join(str(rev) for rev in rev_ids[start_index:end_index])
            }

        api_call = requests.get(endpoint.format(**params), headers=headers)
        response = api_call.json()
        for quality_score in response["enwiki"]["scores"]:
            # Create a new Series to append to the DataFrame
            new_row = pd.Series(index=columns)
            new_row.rev_id = quality_score
            try:
                new_row.article_quality = response["enwiki"]["scores"][quality_score]["wp10"]["score"]["prediction"]
                df = df.append(new_row, ignore_index=True)
            except:
                # The target article no longer exists in wikipedia. Print each data point that 
                # couldn't be retrieved
                print(response["enwiki"]["scores"][quality_score])

        # Update indexes
        start_index += batch_size
        end_index += batch_size
        # If start_indexd is greater then the length of rev_ids we are finished processing our list
        done_processing = start_index >= len(rev_ids)

    return df

article_quality_df = get_ores_page_quality_prediction(politician_df.rev_id.tolist())
article_quality_df.to_csv("./raw_data/article_quality_data.csv", index=False)









    



{'wp10': {'error': {'message': 'RevisionNotFound: Could not find revision ({revision}:807367030)', 'type': 'RevisionNotFound'}}}
{'wp10': {'error': {'message': 'RevisionNotFound: Could not find revision ({revision}:807367166)', 'type': 'RevisionNotFound'}}}

After creating the mapping of revision ID to article quality, we then want to join this to the politician DataFrame.



In [15]:

    
def get_article_quality(rev_id):
    """Method used to map a Wikipedia revision ID to an article quality within article_quality_df

    :param rev_id: Wikipedia Revision ID
    :type rev_id: int.
    :return:    str -- Article quality from article_quality_df if exists, None if not
    """
    article_quality = None
    # If the revision ID exists in the article quality DataFrame, set article quality to the mapped value
    if (article_quality_df.rev_id == rev_id).any():
        article_quality = article_quality_df.loc[article_quality_df.rev_id == rev_id].article_quality.iloc[0]
    return article_quality

# Join the politician DataFrame to the article quality DataFrame
politician_df["article_quality"] = politician_df.apply(lambda row: get_article_quality(row.rev_id), axis=1)

In a similar fashion, we also want to join the population data to the politician DataFrame.



In [16]:

    
def get_country_population(country_name):
    """Method used to map country name to a population within population_df

    :param country_name: Country name
    :type country_name: str.
    :return:    int -- Population value from population_df if exists, None if not
    """
    population = None
    # If the country exists in the population DataFrame, set population to the mapped value
    if (population_df.Location == country_name).any():
        population = population_df.loc[population_df.Location == country_name].Data.iloc[0]
    return population

# Join the politician DataFrame to the country population DataFrame
politician_df["population"] = \
    politician_df.apply(lambda row: get_country_population(row.country), axis=1)

Cleaning our Analysis DataFrame

To simplify our analysis, any row without a corresponding country population or a corresponding article quality will be removed from the data set. We perform some additional cleaning by ordering our rows, renaming our columns and representing population as an integer before writing it out to the clean_data directory.

Our DataFrame will look like the following:

Column	Value
country	Name of the Country the article belongs to
article_name	Name of the Wikipedia article
revision_id	Integer ID that maps to the given Wikipedia page's last edit
article_quality	Quality of the Article as determined by ORES
population	Number of people living in the country in mid-2015



In [23]:

    
# Filter out any countries without a population or without an article quality
df = politician_df[(pd.notnull(politician_df.population)) & (pd.notnull(politician_df.article_quality))]

print("{} rows were removed".format(politician_df.shape[0] - df.shape[0]))

# Reorder columns
df = df[["country", "page", "rev_id", "article_quality", "population"]]

# Rename columns to match assignment definition
df.columns = ["country", "article_name", "revision_id", "article_quality", "population"]

# Change population column to integer
df.loc[:, "population"] = df["population"].astype(int)

# Write analysis data set out to file
cleaned_data_file_path = "./clean_data/en-wikipedia_politician_article_quality.csv"
df.to_csv(cleaned_data_file_path, index=False)

# Print example of analysis DataFrame
df.head(4)









    



1400 rows were removed






    Out[23]:







  
    
      
      country
      article_name
      revision_id
      article_quality
      population
    
  
  
    
      0
      Zambia
      Template:ZambiaProvincialMinisters
      235107991
      Stub
      15473900
    
    
      1
      Chad
      Bir I of Kanem
      355319463
      Stub
      13707000
    
    
      2
      Zimbabwe
      Template:Zimbabwe-politician-stub
      391862046
      Stub
      17354000
    
    
      3
      Uganda
      Template:Uganda-politician-stub
      391862070
      Stub
      40141000

Analysis

As mentioned at the start of this notebook, our analysis seeks to understand bias on Wikipedia through two metrics:

The percent of articles-per-poulation for each country
The percent of high quality articles for each country

We also output population and the number of articles within the aggregate DataFrame for readability.



In [21]:

    
# Group our DataFrame by country
country_group = df.groupby("country")

# Returns the number of articles as a percent of the population
def articles_per_population(group):
    articles = group.article_name.nunique()
    population = group.population.max()
    return articles * 100 / float(population)

# Returns the proportion of articles which are ranked FA or GA in quality
def high_quality_articles(group):
    high_quality_rating_list = ["FA", "GA"]
    article_count = group.shape[0]
    high_quality_article_count = group[group.article_quality.isin(high_quality_rating_list)].shape[0]
    return high_quality_article_count * 100 / article_count

# Returns the population for a given country.
def population(group):
    return group.population.max()

# Returns the number of articles a country has
def number_of_articles(group):
    return group.shape[0]

# https://stackoverflow.com/questions/40532024/pandas-apply-multiple-functions-of-multiple-columns-to-groupby-object
# Aggregate method which generates our four aggregate metrics
def get_aggregate_stats(group):
    return pd.Series({"articles_per_population_percent": articles_per_population(group),
                      "population": population(group),
                      "percent_high_quality_article": high_quality_articles(group),
                      "number_of_articles": number_of_articles(group)})

agg_df = country_group.apply(get_aggregate_stats)
agg_df.index.name = "Country"

# Print example of aggregate DataFrame
agg_df.head(4)









    Out[21]:







  
    
      
      articles_per_population_percent
      number_of_articles
      percent_high_quality_article
      population
    
    
      Country
      
      
      
      
    
  
  
    
      Afghanistan
      0.001014
      327.000000
      5.810398
      32247000.000000
    
    
      Albania
      0.015906
      460.000000
      1.086957
      2892000.000000
    
    
      Algeria
      0.000298
      119.000000
      2.521008
      39948000.000000
    
    
      Andorra
      0.043590
      34.000000
      0.000000
      78000.000000

Next we create our four DataFrames to look at the top and bottom 10 countries for both of these metrics.



In [22]:

    
# Suppress scientific notation
# SO Post: https://stackoverflow.com/questions/21137150/format-suppress-scientific-notation-from-python-pandas-aggregation-results
pd.set_option('display.float_format', lambda x: '%.6f' % x)

# Top 10 of Articles per Population
print("Top 10 Countries - Percent of Articles-Per-Population")
top_10_article_per_pop = \
    agg_df.sort_values(by=["articles_per_population_percent"], ascending=False).head(10)[["articles_per_population_percent"]]
top_10_article_per_pop.columns = ["Percent of Articles-Per-Population"]
print(top_10_article_per_pop)
print("\n")

# Bottom 10 of Articles per Population
print("Bottom 10 Countries - Percent of Articles-Per-Population")
bottom_10_article_per_pop = \
    agg_df.sort_values(by=["articles_per_population_percent"], ascending=True).head(10)[["articles_per_population_percent"]]
bottom_10_article_per_pop.columns = ["Percent of Articles-Per-Population"]
print(bottom_10_article_per_pop)
print("\n")

# Top 10 of High Quality Articles
print("Top 10 Countries - Percent of Articles that are High Quality")
top_10_high_quality_articles = \
    agg_df.sort_values(by=["percent_high_quality_article"], ascending=False).head(10)[["percent_high_quality_article"]]
top_10_high_quality_articles.columns = ["Percent of High Quality Articles"]
print(top_10_high_quality_articles)
print("\n")

# Bottom 10 of High Quality Articles
print("Bottom 10 Countries - Percent of Articles that are High Quality")
bottom_10_high_quality_articles = \
    agg_df.sort_values(by=["percent_high_quality_article"], ascending=True).head(10)[["percent_high_quality_article"]]
bottom_10_high_quality_articles.columns = ["Percent of High Quality Articles"]
print(bottom_10_high_quality_articles)
print("\n")









    



Top 10 Countries - Percent of Articles-Per-Population
                                Percent of Articles-Per-Population
Country                                                           
Nauru                                                     0.488029
Tuvalu                                                    0.466102
San Marino                                                0.248485
Monaco                                                    0.105020
Liechtenstein                                             0.077189
Marshall Islands                                          0.067273
Iceland                                                   0.062268
Tonga                                                     0.060987
Andorra                                                   0.043590
Federated States of Micronesia                            0.036893


Bottom 10 Countries - Percent of Articles-Per-Population
                     Percent of Articles-Per-Population
Country                                                
India                                          0.000075
China                                          0.000083
Indonesia                                      0.000084
Uzbekistan                                     0.000093
Ethiopia                                       0.000107
Korea, North                                   0.000156
Zambia                                         0.000168
Thailand                                       0.000172
Congo, Dem. Rep. of                            0.000194
Bangladesh                                     0.000202


Top 10 Countries - Percent of Articles that are High Quality
                          Percent of High Quality Articles
Country                                                   
Korea, North                                     23.076923
Romania                                          12.931034
Saudi Arabia                                     12.605042
Central African Republic                         11.764706
Qatar                                             9.803922
Guinea-Bissau                                     9.523810
Vietnam                                           9.424084
Bhutan                                            9.090909
Ireland                                           8.136483
United States                                     7.832423


Bottom 10 Countries - Percent of Articles that are High Quality
                       Percent of High Quality Articles
Country                                                
Sao Tome and Principe                          0.000000
Turkmenistan                                   0.000000
Marshall Islands                               0.000000
Guyana                                         0.000000
Comoros                                        0.000000
Tunisia                                        0.000000
Djibouti                                       0.000000
Dominica                                       0.000000
Macedonia                                      0.000000
Tonga                                          0.000000

Appendix

Population Data

For those that are interested in how to download the population data directly from the Population Research Bureau website the following code downloads the file and writes it out to the raw_data directory.



In [4]:

    
population_file_path = "./raw_data/Population Mid-2015.csv"
population_url = "http://www.prb.org/RawData.axd?ind=14&fmt=14&tf=76&loc=34235%2c249%2c250%2c251%2c252%2c253%2c254%2" \
                 "c34227%2c255%2c257%2c258%2c259%2c260%2c261%2c262%2c263%2c264%2c265%2c266%2c267%2c268%2c269%2c270%2" \
                 "c271%2c272%2c274%2c275%2c276%2c277%2c278%2c279%2c280%2c281%2c282%2c283%2c284%2c285%2c286%2c287%2c2" \
                 "88%2c289%2c290%2c291%2c292%2c294%2c295%2c296%2c297%2c298%2c299%2c300%2c301%2c302%2c304%2c305%2c306" \
                 "%2c307%2c308%2c311%2c312%2c315%2c316%2c317%2c318%2c319%2c320%2c321%2c322%2c324%2c325%2c326%2c327%2" \
                 "c328%2c34234%2c329%2c330%2c331%2c332%2c333%2c334%2c336%2c337%2c338%2c339%2c340%2c342%2c343%2c344%2" \
                 "c345%2c346%2c347%2c348%2c349%2c350%2c351%2c352%2c353%2c354%2c358%2c359%2c360%2c361%2c362%2c363%2c3" \
                 "64%2c365%2c366%2c367%2c368%2c369%2c370%2c371%2c372%2c373%2c374%2c375%2c377%2c378%2c379%2c380%2c381" \
                 "%2c382%2c383%2c384%2c385%2c386%2c387%2c388%2c389%2c390%2c392%2c393%2c394%2c395%2c396%2c397%2c398%2" \
                 "c399%2c400%2c401%2c402%2c404%2c405%2c406%2c407%2c408%2c409%2c410%2c411%2c415%2c416%2c417%2c418%2c4" \
                 "19%2c420%2c421%2c422%2c423%2c424%2c425%2c427%2c428%2c429%2c430%2c431%2c432%2c433%2c434%2c435%2c437" \
                 "%2c438%2c439%2c440%2c441%2c442%2c443%2c444%2c445%2c446%2c448%2c449%2c450%2c451%2c452%2c453%2c454%2" \
                 "c455%2c456%2c457%2c458%2c459%2c460%2c461%2c462%2c464%2c465%2c466%2c467%2c468%2c469%2c470%2c471%2c4" \
                 "72%2c473%2c474%2c475%2c476%2c477%2c478%2c479%2c480"

# Use pandas read_csv function to read the file directly from the website
# We skip the first line using header=1 as we're uninterested in information before the column headers
population_df = pd.read_csv(population_url, header=1)

# Remove "," characters and cast the population column Data to a numeric value
population_df["Data"] = population_df["Data"].str.replace(",", "")
population_df["Data"] = population_df["Data"].apply(pd.to_numeric)

# Write the data out to a csv
population_df.to_csv(population_file_path, index=False)

# Print a few lines of the data set
population_df.head(4)









    Out[4]:







  
    
      
      Location
      Location Type
      TimeFrame
      Data Type
      Data
      Footnotes
    
  
  
    
      0
      Afghanistan
      Country
      Mid-2015
      Number
      32247000
      NaN
    
    
      1
      Albania
      Country
      Mid-2015
      Number
      2892000
      NaN
    
    
      2
      Algeria
      Country
      Mid-2015
      Number
      39948000
      NaN
    
    
      3
      Andorra
      Country
      Mid-2015
      Number
      78000
      NaN
    
    
      4
      Angola
      Country
      Mid-2015
      Number
      25000000
      NaN

	Location	Location Type	TimeFrame	Data Type	Data	Footnotes
0	Afghanistan	Country	Mid-2015	Number	32247000	NaN
1	Albania	Country	Mid-2015	Number	2892000	NaN
2	Algeria	Country	Mid-2015	Number	39948000	NaN
3	Andorra	Country	Mid-2015	Number	78000	NaN

	page	country	rev_id
0	Template:ZambiaProvincialMinisters	Zambia	235107991
1	Bir I of Kanem	Chad	355319463
2	Template:Zimbabwe-politician-stub	Zimbabwe	391862046
3	Template:Uganda-politician-stub	Uganda	391862070

	articles_per_population_percent	number_of_articles	percent_high_quality_article	population
Country
Afghanistan	0.001014	327.000000	5.810398	32247000.000000
Albania	0.015906	460.000000	1.086957	2892000.000000
Algeria	0.000298	119.000000	2.521008	39948000.000000
Andorra	0.043590	34.000000	0.000000	78000.000000