A2: Bias in data

Project Overview

The goal of this project is to explore the concept of 'bias' in data by analyzing Wikipedia articles on political figures from different countries.

The data will include a dataset of political articles on Wikipedia, the predicted article quality scores for those articles, and a dataset of country populations.

The analysis will quantify the number of Wikipedia articles devoted to politicians in each country, label the quality of those articles, and consider how those measurements vary between countries.

The data visualization will include a series of plots that show:

The countries with the greatest and the least coverage of politicians on Wikipedia compared to their population sizes.
The countries with the highest and the lowest proportion of high quality articles about politicians.



In [ ]:

    
"""
The code in this notebook cell is optional, you do not need to run this
    cell in order to run the code in subsequent cells, but if this cell
    isn't run, some intermediate values won't be displayed.
To run a cell, position the cursor inside the cell, so the cell border
    turns green, and simultaneously press the keys: control and return (or enter).
"""

# This code displays all results created within a jupyter notebook cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


# This code displays Matplotlib objects inline.
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'inline')

Step 1. Getting the Data

We will be combining three sources of data:

the Wikipedia dataset,
the population dataset, and
the article quality prediction dataset.

Wikipedia Dataset

The Wikipedia dataset about political articles (also called pages) can be found on Figshare. The English language article data was extracted using the Wikimedia API, saved as a CSV file named page_data.csv, and uploaded to Figshare. For more information, see the README.md file in this data-512-a2 repository.

A copy of the page_data.csv file is also available in this data-512-a2 repository.

The columns in the page_data.csv file are:

country: the country name, extracted from the category name
page: the Wikipedia page (aka article) title
rev_id: the revision_id for the last edit to the page



In [1]:

    
"""
The code in this cell reads in the Wikipedia article data from the
    file page_data.csv to get this data:
        RangeIndex: 47197 entries, 0 to 47196
        Data columns (total 3 columns):
            page       47197 non-null object
            country    47197 non-null object
            rev_id     47197 non-null int64
            
The data is then stored in a pandas DataFrame (DF) object.
"""

import csv
import pandas as pd

page_data = pd.read_csv("page_data.csv")
page_data.head()









    Out[1]:







  
    
      
      page
      country
      rev_id
    
  
  
    
      0
      Template:ZambiaProvincialMinisters
      Zambia
      235107991
    
    
      1
      Bir I of Kanem
      Chad
      355319463
    
    
      2
      Template:Zimbabwe-politician-stub
      Zimbabwe
      391862046
    
    
      3
      Template:Uganda-politician-stub
      Uganda
      391862070
    
    
      4
      Template:Namibia-politician-stub
      Namibia
      391862409



In [2]:

    
print("Number of rows and columns in original page_data DataFrame", page_data.shape)









    



Number of rows and columns in original page_data DataFrame (47197, 3)



In [3]:

    
"""
The code in this cell:
    (1) standardizes some of the country names in page_data for merging page_data
        with the population data; and 
    (2) removes two rows with revision ID values that don't exist in the dataset
        being used by the ORES API, and therefore return errors [807367030, 807367166].
"""

# Part 1: standardize country names
# COUNTRY_MAP from Gary
COUNTRY_MAP = {
    "East Timorese" : "Timor-Leste",
    "Hondura" : "Honduras",
    "Rhodesian" : "Zimbabwe",
    "Salvadoran" : "El Salvador",
    "Samoan" : "Samoa",
    "São Tomé and Príncipe" : "Sao Tome and Principe",
    "South African Republic" : "South Africa",
    "South Korean" : "Korea, South"
}

# A sample of original rows with values we want to replace
page_data.loc[page_data.index.isin([272, 443, 448, 541, 602])]

# use isin to filter for valid rows, and then use replace to replace the values

if page_data["country"].isin(COUNTRY_MAP.keys()).any():
    page_data["country"].replace(COUNTRY_MAP, inplace=True)

# Verify that the values were replaced in those sample rows
page_data.loc[page_data.index.isin([272, 443, 448, 541, 602])]


# Part 2: removal of revision ID values

# check if rev_id values in page_data
page_data.loc[page_data["rev_id"].isin([807367030, 807367166])]

# remove rows with rev_id in [807367030, 807367166]
page_data = page_data.loc[~page_data["rev_id"].isin([807367030, 807367166])]
page_data.head(10)









    Out[3]:







  
    
      
      page
      country
      rev_id
    
  
  
    
      0
      Template:ZambiaProvincialMinisters
      Zambia
      235107991
    
    
      1
      Bir I of Kanem
      Chad
      355319463
    
    
      2
      Template:Zimbabwe-politician-stub
      Zimbabwe
      391862046
    
    
      3
      Template:Uganda-politician-stub
      Uganda
      391862070
    
    
      4
      Template:Namibia-politician-stub
      Namibia
      391862409
    
    
      5
      Template:Nigeria-politician-stub
      Nigeria
      391862819
    
    
      6
      Template:Colombia-politician-stub
      Colombia
      391863340
    
    
      7
      Template:Chile-politician-stub
      Chile
      391863361
    
    
      8
      Template:Fiji-politician-stub
      Fiji
      391863617
    
    
      9
      Template:Solomons-politician-stub
      Solomon Islands
      391863809



In [4]:

    
print("Ending number of rows and columns in page_data", page_data.shape)









    



Ending number of rows and columns in page_data (47195, 3)

Population Dataset

The population data is on the Population Research Bureau website, download the CSV file. In this notebook, the population data CSV file is named, Population Mid-2015.csv.



In [5]:

    
"""
The code in this cell reads in the country population data from
    the file population_mid-2015.csv and does some processing.

When you read population_mid-2015.csv without any parameters
    like the page_data.csv file above, the title becomes a
    single column. So, to get 6 columns instead of 1, set the
    second row (index 1) as the header with parameter: header=1.
    
    The original data in population_mid-2015.csv looks like this:
        RangeIndex: 210 entries, 0 to 209
        Data columns (total 6 columns):
            Location         210 non-null object
            Location Type    210 non-null object
            TimeFrame        210 non-null object
            Data Type        210 non-null object
            Data             210 non-null int64
            Footnotes        0 non-null float64
"""

pop_data = pd.read_csv("population_mid-2015.csv", #skiprows=0,
                       header=1, sep=",", thousands=",")

# Only drop unneeded columns if they are in pop_data DF
if len(pop_data.columns) > 2:
    pop_data.drop(["Location Type", "TimeFrame",
                   "Data Type", "Footnotes"], axis=1, inplace=True)
else:
    pass

# Rename columns to standardize names for future merging
pop_data.rename(columns={"Location" : "country",
                         "Data" : "population"}, inplace=True)

pop_data.head()









    Out[5]:







  
    
      
      country
      population
    
  
  
    
      0
      Afghanistan
      32247000
    
    
      1
      Albania
      2892000
    
    
      2
      Algeria
      39948000
    
    
      3
      Andorra
      78000
    
    
      4
      Angola
      25000000

Article Quality Prediction Dataset

The predicted quality scores for each article in the Wikipedia dataset comes from a Wikimedia API endpoint for a machine learning system called ORES ("Objective Revision Evaluation Service"). ORES estimates the quality of an article at a particular point in time, and assigns a series of probabilities that the article is best described by one of the categories listed below.

The range of quality scores are, from best to worst:

FA - Featured article
GA - Good article
B - B-class article
C - C-class article
Start - Start-class article
Stub - Stub-class article

These quality scores are a sub-set of quality assessment categories developed by Wikipedia editors. For more information about the scores, see Project Assessment.

The ORES API documentation can be found here and the web API is here. The API requires: a revision ID, which is the third column in page_data.csv (originally titled "last_edit"), and the machine learning model, which is "wp10."

When you query the API, the ORES returns a JSON object that includes a predicted quality score, as well as the probability values for each of the six possible quality scores. But, for the analysis in this project, you only need the predicted quality score value, not the probabilities.

The cell below is an example of a response in the JSON format from the ORES API:

# Example of the JSON format of ORES API response {"enwiki": { "models": { "wp10": { "version": "0.5.0" } }, "scores": { "774499188": { "wp10": { "score": { "prediction": "Stub", "probability": { "B": 0.03488477079112925, "C": 0.06953258948284814, "FA": 0.0025762575670963965, "GA": 0.007911851615317388, "Start": 0.4106575723489943, "Stub": 0.4744369581946146 } } } } } } }



In [6]:

    
import requests
import json
import time

def get_page_quality_scores(rev_ids):
    """
    Function takes revision id values, calls the ORES API, and returns the
    revision id values recognized by the ORES API, the article quality scores, and
    the time for each API call.
    """
    # start is the time in seconds as a floating point number
    start = time.time() 
    
    # Local variables
    local_scores = []
    local_rev_ids = []
    endpoint = "https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}"    
    params = {"project" : "enwiki",
              "model" : "wp10",
              "revids" : "|".join(str(x) for x in rev_ids)
             }

    api_call = requests.get(endpoint.format(**params))
    
    response = api_call.json()
    #print(json.dumps(response, indent=4, sort_keys=True))
    
    # Strip out the quality score from the JSON object and save in list (scores)
    for rev_id in rev_ids:
        try:
            local_rev_ids.append(rev_id)
            local_scores.append(response["enwiki"]["scores"][str(rev_id)]["wp10"]["score"]["prediction"])
        except:
            print("exception with rev_id:", str(rev_id))
            pass
    
    # Measure time to run batch of Get requests in minutes
    end = (time.time() - start)/60

    return local_rev_ids, local_scores, end


try:
    # This is a shortcut to redefine page_data as a DF with the article_quality data
    # since getting the API data takes a about 3.52 minutes
    page_data = pd.read_csv("page_quality_data.csv")
    
    if page_data["Unnamed: 0"].any():
        page_data.drop(["Unnamed: 0"], axis=1, inplace=True)
        
    print("New page_data DataFrame rows and columns:", page_data.shape)
    print(page_data.head(10))
    
except: 
    # If no page_quality_data.csv file, then get the data from the ORES API
    lst_rev_ids = []
    lst_article_scores = []
    run_time = 0

    for i in range(0, len(page_data["rev_id"]), 100):
        # batches of 100 rev_id values to send to get_page_quality_scores()
        revision_ids = page_data["rev_id"][i:(i + 100)]    
    
        rev_id_vals, quality_scores, batch_time = get_page_quality_scores(revision_ids)   
    
        for id_value in rev_id_vals:
            # Keep a running list of rev_id values as an index for quality scores
            lst_rev_ids.append(id_value)
    
        for score in quality_scores:
            # Keep a running list of quality scores to add to page_data
            lst_article_scores.append(score)

        run_time = run_time + batch_time
    
        # Test print statement for tracking time
        # print("i:", str(i) + ",  batch_time: %f,  run_time: %f" % (batch_time, run_time))

    # Add a column with the predicted article quality scores to the page_data
    df_scores = pd.DataFrame({"rev_id" : lst_rev_ids, "article_quality" : lst_article_scores})
    page_score_data = page_data.merge(df_scores, how='outer',
                                  on=["rev_id"]).dropna(axis=0, how='any')
    page_score_data.to_csv("page_quality_data.csv")

    print("\nTotal Run Time: {0:5.2f} minutes\n".format(run_time))









    



New page_data DataFrame rows and columns: (47195, 4)
                                 page          country     rev_id  \
0  Template:ZambiaProvincialMinisters           Zambia  235107991   
1                      Bir I of Kanem             Chad  355319463   
2   Template:Zimbabwe-politician-stub         Zimbabwe  391862046   
3     Template:Uganda-politician-stub           Uganda  391862070   
4    Template:Namibia-politician-stub          Namibia  391862409   
5    Template:Nigeria-politician-stub          Nigeria  391862819   
6   Template:Colombia-politician-stub         Colombia  391863340   
7      Template:Chile-politician-stub            Chile  391863361   
8       Template:Fiji-politician-stub             Fiji  391863617   
9   Template:Solomons-politician-stub  Solomon Islands  391863809   

  article_quality  
0            Stub  
1            Stub  
2            Stub  
3            Stub  
4            Stub  
5            Stub  
6            Stub  
7            Stub  
8            Stub  
9            Stub



In [7]:

    
page_data.shape









    Out[7]:





(47195, 4)

Step 2. Combining the Page and Population Datasets

After retrieving and including the ORES data for each article, we merge the wikipedia article data and the population data together. Both DataFrames have columns named "country" and we will merge the page_data and pop_data on the country values.

While merging the DataFrames, some entries won't be merged because a country name isn't in the other DataFrame, e.g. the population dataset does not have an entry for the equivalent Wikipedia country. So, we will remove the rows that do not have matching data. We had 47,195 rows in page_data initially and the final combine_df has 46,408 rows.

After consolidating the data, we save the DataFrame as a single CSV file with these columns:

country
article_name
revision_id
article_quality
population



In [8]:

    
try:
    # Rename page_data columns
    page_data.rename(columns={"page" : "article_name",
                              "rev_id" : "revision_id"}, inplace=True)
except:
    pass

# Combine DataFrames on country names and remove any rows with empty values
combine_df = page_data.merge(pop_data,
                             how='outer', on=["country"]).dropna(axis=0,
                                                                 how='any').reset_index()
# Remove the old index which is now a column named index
# because I reset the index
combine_df.drop(["index"], axis=1, inplace=True)

# Change revision_id & population values from float to int
combine_df[["revision_id",
            "population"]] = combine_df[["revision_id",
                                         "population"]].astype(int)
combine_df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46408 entries, 0 to 46407
Data columns (total 5 columns):
article_name       46408 non-null object
country            46408 non-null object
revision_id        46408 non-null int64
article_quality    46408 non-null object
population         46408 non-null int64
dtypes: int64(2), object(3)
memory usage: 1.8+ MB



In [9]:

    
# Save final combined DataFrame to CSV file
combine_df.to_csv("page_pop_data_final.csv")
combine_df.head(10)









    Out[9]:







  
    
      
      article_name
      country
      revision_id
      article_quality
      population
    
  
  
    
      0
      Template:ZambiaProvincialMinisters
      Zambia
      235107991
      Stub
      15473900
    
    
      1
      Gladys Lundwe
      Zambia
      757566606
      Stub
      15473900
    
    
      2
      Mwamba Luchembe
      Zambia
      764848643
      Stub
      15473900
    
    
      3
      Thandiwe Banda
      Zambia
      768166426
      Start
      15473900
    
    
      4
      Sylvester Chisembele
      Zambia
      776082926
      C
      15473900
    
    
      5
      Victoria Kalima
      Zambia
      776530837
      Start
      15473900
    
    
      6
      Margaret Mwanakatwe
      Zambia
      779747587
      Start
      15473900
    
    
      7
      Nkandu Luo
      Zambia
      779747961
      C
      15473900
    
    
      8
      Susan Nakazwe
      Zambia
      779748181
      Start
      15473900
    
    
      9
      Catherine Namugala
      Zambia
      779748285
      Start
      15473900

Step 3. Analysis

The analysis calculates the proportion (as a percentage) of articles-per-population and high-quality articles for each country. "High quality" articles are defined as articles about politicians that ORES predicted would be in either the "FA" (featured article) or "GA" (good article) classes.

Examples:

if a country has a population of 10,000 people, and you found 10 articles about politicians from that country, then the percentage of articles-per-population would be .1%.
if a country has 10 articles about politicians, and 2 of them are FA or GA class articles, then the percentage of high-quality articles would be 20%.



In [10]:

    
# Get the total number of articles for each country
df_test = combine_df[["country",
                      "article_name"]].groupby("country").count().astype(int).reset_index()

# Merge total number of articles for each country with population for each country
df_article_pop = df_test.merge(pop_data, on=["country"])

df_article_pop.rename(columns={"article_name" : "total_articles"},
                      inplace=True)

# Divide the total number of articles by the country's population
art_per_pop = df_article_pop["total_articles"].div(df_article_pop["population"],
                                                   axis='index')

# Get a percentage value by multiplying by 100
df_article_pop["percent_articles_per_person"] = round(art_per_pop*100, 4)

prop_df = df_article_pop.sort_values(["percent_articles_per_person"],
                                      axis=0,
                                      ascending=False,
                                      inplace=False,
                                      kind='quicksort')
# The 10 countries with the highest percentage of articles per person
prop_df.head(10)









    Out[10]:







  
    
      
      country
      total_articles
      population
      percent_articles_per_person
    
  
  
    
      122
      Nauru
      53
      10860
      0.4880
    
    
      177
      Tuvalu
      55
      11800
      0.4661
    
    
      144
      San Marino
      82
      33000
      0.2485
    
    
      115
      Monaco
      40
      38088
      0.1050
    
    
      99
      Liechtenstein
      29
      37570
      0.0772
    
    
      109
      Marshall Islands
      37
      55000
      0.0673
    
    
      74
      Iceland
      206
      330828
      0.0623
    
    
      172
      Tonga
      63
      103300
      0.0610
    
    
      3
      Andorra
      34
      78000
      0.0436
    
    
      143
      Samoa
      77
      194210
      0.0396



In [11]:

    
# The 10 countries with the lowest percentage of articles per person
prop_df.tail(10)









    Out[11]:







  
    
      
      country
      total_articles
      population
      percent_articles_per_person
    
  
  
    
      13
      Bangladesh
      324
      160411000
      0.0002
    
    
      169
      Thailand
      112
      65121250
      0.0002
    
    
      88
      Korea, North
      39
      24983000
      0.0002
    
    
      160
      Sudan
      98
      40883900
      0.0002
    
    
      119
      Mozambique
      60
      25736000
      0.0002
    
    
      76
      Indonesia
      215
      255741973
      0.0001
    
    
      184
      Uzbekistan
      29
      31290791
      0.0001
    
    
      54
      Ethiopia
      105
      98148000
      0.0001
    
    
      34
      China
      1138
      1371920000
      0.0001
    
    
      75
      India
      990
      1314097616
      0.0001



In [12]:

    
# number of GA and FA-quality articles as a proportion of all articles from country

# Gets all the articles rated "FA" or "GA"
fa_ga = combine_df[(combine_df["article_quality"]=="FA") | (combine_df["article_quality"]=="GA")]

# Gets all the articles not rated "FA" snd not rated "GA"
no_fa_ga = combine_df[(combine_df["article_quality"]!="FA") & (combine_df["article_quality"]!="GA")]


# Adds up all the FA & GA articles for each country
total_fa_ga = fa_ga[["country",
                     "article_quality"]].groupby("country",
                                                 as_index=False).count()

# Adds up all the articles for each country regardless of quality rating
total_articles = combine_df[["country",
                             "article_quality"]].groupby("country",
                                                         as_index=False).count()

# Merge total_fa_ga and total_articles on country values
art_type_df = total_fa_ga.merge(total_articles, on="country")

art_type_df.rename(columns={"article_quality_x" : "total_FA_GA",
                            "article_quality_y" : "total_pages"},
                   inplace=True)

art_type_df["percent_FA_GA"] = round(art_type_df["total_FA_GA"]/art_type_df["total_pages"].astype(float)*100, 2)

sort_prop_type_df = art_type_df.sort_values(["percent_FA_GA"],
                                            axis=0,
                                            ascending=False,
                                            inplace=False,
                                            kind='quicksort')
# The 10 countries with the highest percentage of high quality articles
sort_prop_type_df.head(10)









    Out[12]:







  
    
      
      country
      total_FA_GA
      total_pages
      percent_FA_GA
    
  
  
    
      68
      Korea, North
      9
      39
      23.08
    
    
      112
      Romania
      45
      348
      12.93
    
    
      116
      Saudi Arabia
      15
      119
      12.61
    
    
      22
      Central African Republic
      8
      68
      11.76
    
    
      111
      Qatar
      5
      51
      9.80
    
    
      53
      Guinea-Bissau
      2
      21
      9.52
    
    
      147
      Vietnam
      18
      191
      9.42
    
    
      12
      Bhutan
      3
      33
      9.09
    
    
      61
      Ireland
      31
      381
      8.14
    
    
      142
      United States
      86
      1098
      7.83



In [13]:

    
# The 10 countries with the lowest percentage of high quality articles
sort_prop_type_df.tail(10)









    Out[13]:







  
    
      
      country
      total_FA_GA
      total_pages
      percent_FA_GA
    
  
  
    
      100
      Nigeria
      4
      684
      0.58
    
    
      79
      Luxembourg
      1
      180
      0.56
    
    
      138
      Uganda
      1
      188
      0.53
    
    
      42
      Fiji
      1
      199
      0.50
    
    
      90
      Moldova
      2
      426
      0.47
    
    
      78
      Lithuania
      1
      248
      0.40
    
    
      33
      Czech Republic
      1
      254
      0.39
    
    
      107
      Peru
      1
      354
      0.28
    
    
      132
      Tanzania
      1
      408
      0.25
    
    
      43
      Finland
      1
      572
      0.17



In [14]:

    
# This code produces a crosstab of countries with their quality articles
# This code is partially based on the code from StackOverflow
# https://stackoverflow.com/questions/46290726/how-to-make-dummy-variables-with-comma-separated-valued-columns

df_articles = combine_df.set_index("country")["article_quality"].apply(pd.Series).stack()

# margins=True adds the All column & All row
cross_tab = pd.crosstab(df_articles.index.get_level_values(0),
                        df_articles, margins=True).rename_axis(None).rename_axis(None, 1)
cross_tab.sort_values(["All"]).head(20)









    Out[14]:







  
    
      
      B
      C
      FA
      GA
      Start
      Stub
      All
    
  
  
    
      Dominica
      0
      2
      0
      0
      4
      6
      12
    
    
      Barbados
      0
      5
      0
      0
      3
      6
      14
    
    
      Eritrea
      0
      2
      0
      0
      2
      12
      16
    
    
      Belize
      0
      5
      0
      0
      7
      4
      16
    
    
      Guyana
      0
      6
      0
      0
      8
      6
      20
    
    
      Bahamas
      0
      3
      0
      0
      10
      7
      20
    
    
      Guinea-Bissau
      1
      2
      0
      2
      7
      9
      21
    
    
      Seychelles
      0
      1
      0
      0
      3
      18
      22
    
    
      Sao Tome and Principe
      0
      1
      0
      0
      9
      12
      22
    
    
      Antigua and Barbuda
      0
      7
      0
      0
      10
      8
      25
    
    
      Zambia
      0
      6
      0
      0
      14
      6
      26
    
    
      French Guiana
      0
      2
      0
      0
      5
      21
      28
    
    
      Trinidad and Tobago
      0
      5
      0
      1
      15
      7
      28
    
    
      Liechtenstein
      0
      1
      0
      0
      13
      15
      29
    
    
      Uzbekistan
      0
      6
      1
      1
      12
      9
      29
    
    
      Lesotho
      0
      7
      0
      0
      14
      9
      30
    
    
      Swaziland
      1
      4
      0
      0
      4
      23
      32
    
    
      Equatorial Guinea
      1
      1
      0
      1
      14
      15
      32
    
    
      Kiribati
      0
      3
      0
      0
      6
      23
      32
    
    
      Bhutan
      0
      3
      0
      3
      7
      20
      33



In [15]:

    
cross_tab.sort_values(["All"]).tail(20)









    Out[15]:







  
    
      
      B
      C
      FA
      GA
      Start
      Stub
      All
    
  
  
    
      Norway
      0
      33
      3
      5
      177
      440
      658
    
    
      Nigeria
      1
      64
      0
      4
      349
      266
      684
    
    
      Netherlands
      3
      54
      3
      7
      234
      401
      702
    
    
      Germany
      8
      87
      7
      12
      303
      286
      703
    
    
      New Zealand
      2
      89
      1
      10
      245
      444
      791
    
    
      Poland
      11
      56
      4
      9
      221
      508
      809
    
    
      Italy
      15
      56
      2
      6
      273
      476
      828
    
    
      Iran
      10
      102
      0
      17
      229
      472
      830
    
    
      Canada
      19
      170
      8
      17
      376
      262
      852
    
    
      United Kingdom
      32
      249
      12
      42
      324
      208
      867
    
    
      Spain
      13
      95
      30
      11
      314
      418
      881
    
    
      Russia
      26
      141
      9
      26
      348
      332
      882
    
    
      India
      8
      59
      3
      12
      236
      672
      990
    
    
      Pakistan
      4
      68
      3
      11
      324
      635
      1045
    
    
      Mexico
      4
      41
      2
      5
      132
      897
      1081
    
    
      United States
      64
      262
      19
      67
      414
      272
      1098
    
    
      China
      66
      208
      7
      35
      402
      420
      1138
    
    
      Australia
      14
      183
      11
      33
      470
      855
      1566
    
    
      France
      26
      140
      7
      23
      372
      1121
      1689
    
    
      All
      722
      5637
      278
      858
      15016
      23897
      46408



In [16]:

    
# This code produces a list of countries with no quality articles
countries_noQA = cross_tab.sort_values(["FA", "GA"]).head(41)
countries_noQA.index









    Out[16]:





Index(['Andorra', 'Antigua and Barbuda', 'Bahamas', 'Bahrain', 'Barbados',
       'Belgium', 'Belize', 'Burundi', 'Cape Verde', 'Comoros', 'Djibouti',
       'Dominica', 'Eritrea', 'Federated States of Micronesia',
       'French Guiana', 'Guadeloupe', 'Guyana', 'Honduras', 'Kazakhstan',
       'Kiribati', 'Lesotho', 'Liechtenstein', 'Macedonia', 'Marshall Islands',
       'Monaco', 'Mozambique', 'Nauru', 'Nepal', 'San Marino',
       'Sao Tome and Principe', 'Seychelles', 'Solomon Islands', 'Suriname',
       'Swaziland', 'Switzerland', 'Tajikistan', 'Timor-Leste', 'Tonga',
       'Tunisia', 'Turkmenistan', 'Zambia'],
      dtype='object')

Step 4. Visualization

Produce four visualizations that show:

10 highest-ranked countries in terms of number of politician articles as a proportion of country population
10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country



In [41]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn
from textwrap import wrap

# based on example from http://matplotlib.org/examples/lines_bars_and_markers/barh_demo.html
fig, ((ax0, ax1), (ax2, ax3)) = plt.subplots(nrows=2, ncols=2, figsize=(20,15))

top_country = prop_df["country"].head(10)
y_pos0 = np.arange(len(top_country))
top_percent = prop_df["percent_articles_per_person"].head(10)

ax0.barh(y_pos0, top_percent, align='center',
         color="cadetblue", alpha=0.8)
ax0.set_yticks(y_pos0)
ax0.set_yticklabels(top_country, fontsize=14)
ax0.invert_yaxis()  # labels read top-to-bottom
ax0.set_xticklabels(labels=[0, 0.1, 0.2, 0.3, 0.4, 0.5], fontsize=12)
ax0.set_xlabel("Percent of Articles per Person", fontsize=14)
ax0.set_title("Countries with Highest Percent of Wikipedia Articles per Capita",
              fontsize=16, fontvariant="small-caps", fontweight="semibold")

bottom_country = prop_df["country"].tail(10)[::-1]
y_pos1 = np.arange(len(bottom_country))
bottom_percent = round(prop_df["percent_articles_per_person"].tail(10)[::-1], 5)

ax1.barh(y_pos1, bottom_percent, align='center',
         color="darkcyan", alpha=0.8)
ax1.set_yticks(y_pos1)
ax1.set_yticklabels(bottom_country, fontsize=14)
ax1.invert_yaxis()
ax1.set_xlabel("Percent of Articles per Person",
               fontsize=14, fontstretch="semi-condensed")
ax1.set_title("Countries with Lowest Percent of Wikipedia Articles per Capita",
              fontsize=16, fontweight="semibold")

top_FA_GA_country = sort_prop_type_df["country"].head(10)
y_pos2 = np.arange(len(top_FA_GA_country))
top_FA_GA_prop = sort_prop_type_df["percent_FA_GA"].head(10)

ax2.barh(y_pos2, top_FA_GA_prop, align='center',
         color="firebrick", alpha=0.8)
ax2.set_yticks(y_pos2)
ax2.set_yticklabels(top_FA_GA_country, fontsize=14)
ax2.invert_yaxis()  # labels read top-to-bottom
ax2.set_xticklabels(labels=[0, 5.0, 10.0, 15.0, 20.0, 25.0], fontsize=12)
ax2.set_xlabel("Percent of Wikipedia Articles Rated FA or GA",
               fontsize=14, fontstretch="semi-condensed")
ax2.set_title("Countries with Highest Proportion of FA and GA Rated Wikipedia Articles",
              fontsize=16, fontweight="semibold")

bottom_FA_GA_country = sort_prop_type_df["country"].tail(10)[::-1]
y_pos3 = np.arange(len(bottom_FA_GA_country))
bottom_FA_GA_prop = sort_prop_type_df["percent_FA_GA"].tail(10)[::-1]

ax3.barh(y_pos3, bottom_FA_GA_prop, align='center',
         color="darkred", alpha=0.8)
ax3.set_yticks(y_pos3)
ax3.set_yticklabels(bottom_FA_GA_country, fontsize=14)
ax3.invert_yaxis()  # labels read top-to-bottom
ax3.set_xticklabels(labels=[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6], fontsize=12)
ax3.set_xlabel("Percent of Wikipedia Articles Rated FA or GA",
               fontsize=14, fontstretch="semi-condensed")
ax3.set_title("Countries with Lowest Proportion of FA and GA Rated Wikipedia Articles",
              fontsize=16, fontweight="semibold",
              fontstretch="semi-condensed")

plt.tight_layout()

plt.show()

# Save plot to file
fig.savefig("WikipediaBiasDataPlot.png")

Writeup

You are also expected to write a short reflection on the project, that describes how this assignment helps you understand the causes and consequences of bias on Wikipedia.

Write a few paragraphs, either in the README or in the notebook, reflecting on what you have learned, what you found, what (if anything) surprised you about your findings, and/or what theories you have about why any biases might exist (if you find they exist).

You can also include any questions this assignment raised for you about bias, Wikipedia, or machine learning.



In [27]:

    
combine_df.loc[combine_df["article_name"].isin(["George Washington",
                                                "John Adams",
                                                "Thomas Jefferson"
                                                "James Madison",
                                                "James Monroe",
                                                "John Quincy Adams",
                                                "Andrew Jackson",
                                                "Abraham Lincoln",
                                                "Ulysses S. Grant",
                                                "Rutherford B. Hayes",
                                                "James A. Garfield",
                                                "Theodore Roosevelt",
                                                "William Howard Taft",
                                                "Woodrow Wilson",
                                                "Warren G. Harding",
                                                "Benjamin Harrison",
                                                "Grover Cleveland",
                                                "Calvin Coolidge",
                                                "Herbert Hoover",
                                                "Franklin D. Roosevelt",
                                                "Harry S. Truman",
                                                "Dwight D. Eisenhower",
                                                "John F. Kennedy",
                                                "Lyndon B. Johnson",
                                                "Richard Nixon",
                                                "Richard M. Nixon"
                                                "Gerald Ford",
                                                "Jimmy Carter",
                                                "Ronald Reagan",
                                                "George H. W. Bush",
                                                "Bill Clinton",
                                                "Hillary Clinton",
                                                "Hillary Rodham Clinton",
                                                "William J. Clinton",
                                                "George W. Bush",
                                                "George Bush",
                                                "Barack Obama",
                                                "Donald Trump",
                                                "Donald J. Trump",
                                                "John McCain",
                                                "Bernie Sanders",
                                                "Sarah Palin",
                                                "Mahatma Gandhi",
                                                "Margaret Thatcher",
                                                "Vladimir Putin",
                                                "Vicente Fox",
                                                "Enrique Peña Nieto",
                                                "Enrique Nieto"])]









    Out[27]:







  
    
      
      article_name
      country
      revision_id
      article_quality
      population
    
  
  
    
      13641
      Mahatma Gandhi
      India
      806203768
      FA
      1314097616
    
    
      28858
      Theodore Roosevelt
      United States
      806771467
      FA
      321234172
    
    
      28864
      Woodrow Wilson
      United States
      806921107
      FA
      321234172
    
    
      28906
      Franklin D. Roosevelt
      United States
      807395895
      FA
      321234172
    
    
      28909
      John McCain
      United States
      807428251
      GA
      321234172
    
    
      37175
      John F. Kennedy
      Ireland
      807423724
      GA
      4630308
    
    
      41792
      Margaret Thatcher
      United Kingdom
      807039360
      FA
      65092000

Reflection

About the Data

First and foremost, this English-language Wikipedia article data was generated from the "Category: Politicians by nationality" and one subcategory. So, the editors who created these articles had to categorize the subjects of these articles as "politicians" and decide on the subjects' "nationality."

The Merriam-Webster Dictionary defines a politician as "a person experienced in the art or science of government," or "a person engaged in party politics as a profession," or "often disparaging: a person primarily interested in political office for selfish or other narrow usually short-sighted reasons."

Especially, since the term "politician" can be considered disparaging, it's possible that editors would be reluctant to label their articles as politicians. For example, out of the 33 most well-known U.S. presidents, only four have articles in our dataset, and one of those is John F. Kennedy, who's country is listed as "Ireland."

Politicians by nationality‎ (243 C)
► Assassinated politicians by nationality‎ (129 C)
► Political candidates by nationality‎ (15 C)
► Politicians by ethnic or national descent‎ (12 C)
► Politicians by former country‎ (18 C)
► Politicians by nationality and city‎ (15 C)
► Politicians by nationality and party‎ (232 C)
► Politicians convicted of crimes by nationality‎ (72 C)
► Politicians by nationality and century‎ (45 C)
► Leaders of political parties by country‎ (39 C)
► Politicians by century and nationality‎ (4 C)
► Politicians from dependent territories‎ (9 C)
► Women in politics by nationality‎ (234 C)
► LGBT politicians by nationality‎ (41 C)
► Sportsperson-politicians by nationality‎ (62 C)
► Politicians of African nations‎ (65 C)
► Politicians of Asian nations‎ (49 C, 1 P)
► Politicians of Caribbean nations‎ (28 C)
► Politicians of European nations‎ (62 C)
► Politicians of North American nations‎ (10 C)
► Politicians of Oceanian nations‎ (31 C)
► Politicians of South American nations‎ (16 C)

Findings

Country with the Highest Proportion of Per Capita Articles: Nauru (0.4880 %)

Country with the Lowest Proportion of Per Capita Articles ( > 0 ): Bangladesh (0.0002 %)

Country with the Highest Proportion of Quality Articles: North Korea (23.08 %)

Country with the Lowest Proportion of Quality Articles ( > 0 ): Finland (0.17 %)

Countries with No High Quality Articles: 'Andorra', 'Antigua and Barbuda', 'Bahamas', 'Bahrain', 'Barbados', 'Belgium', 'Belize', 'Burundi', 'Cape Verde', 'Comoros', 'Djibouti', 'Dominica', 'Eritrea', 'Federated States of Micronesia', 'French Guiana', 'Guadeloupe', 'Guyana', 'Honduras', 'Kazakhstan', 'Kiribati', 'Lesotho', 'Liechtenstein', 'Macedonia', 'Marshall Islands', 'Monaco', 'Mozambique', 'Nauru', 'Nepal', 'San Marino', 'Sao Tome and Principe', 'Seychelles', 'Solomon Islands', 'Suriname', 'Swaziland', 'Switzerland', 'Tajikistan', 'Timor-Leste', 'Tonga', 'Tunisia', 'Turkmenistan', 'Zambia'

5 Countries with Highest Number of Articles:

 > France       1689  (population 64,346,720);  
 > Australia     1566;  
 > China         1138  (population 1,371,920,000);  
 > United States 1098;  
 > Mexico       1081

Spain population: 46,368,000 United Kingdom population: 65,092,000

Possible Biases

Initially, one might think countries with higher populations, would have more politicians, and therefore more articles about those politicians on Wikipedia. That hypothesis is anecdotally supported by China being one of the most populated countries on Earth and having the third most articles in this dataset, and even India has the seventh most articles with 990 articles. However, France has the highest number of articles, yet has a population 21 times smaller than China.

	page	country	rev_id
0	Template:ZambiaProvincialMinisters	Zambia	235107991
1	Bir I of Kanem	Chad	355319463
2	Template:Zimbabwe-politician-stub	Zimbabwe	391862046
3	Template:Uganda-politician-stub	Uganda	391862070
4	Template:Namibia-politician-stub	Namibia	391862409

	country	population
0	Afghanistan	32247000
1	Albania	2892000
2	Algeria	39948000
3	Andorra	78000
4	Angola	25000000

	country	total_articles	population	percent_articles_per_person
122	Nauru	53	10860	0.4880
177	Tuvalu	55	11800	0.4661
144	San Marino	82	33000	0.2485
115	Monaco	40	38088	0.1050
99	Liechtenstein	29	37570	0.0772
109	Marshall Islands	37	55000	0.0673
74	Iceland	206	330828	0.0623
172	Tonga	63	103300	0.0610
3	Andorra	34	78000	0.0436
143	Samoa	77	194210	0.0396

	country	total_articles	population	percent_articles_per_person
13	Bangladesh	324	160411000	0.0002
169	Thailand	112	65121250	0.0002
88	Korea, North	39	24983000	0.0002
160	Sudan	98	40883900	0.0002
119	Mozambique	60	25736000	0.0002
76	Indonesia	215	255741973	0.0001
184	Uzbekistan	29	31290791	0.0001
54	Ethiopia	105	98148000	0.0001
34	China	1138	1371920000	0.0001
75	India	990	1314097616	0.0001

	country	total_FA_GA	total_pages	percent_FA_GA
68	Korea, North	9	39	23.08
112	Romania	45	348	12.93
116	Saudi Arabia	15	119	12.61
22	Central African Republic	8	68	11.76
111	Qatar	5	51	9.80
53	Guinea-Bissau	2	21	9.52
147	Vietnam	18	191	9.42
12	Bhutan	3	33	9.09
61	Ireland	31	381	8.14
142	United States	86	1098	7.83

	country	total_FA_GA	total_pages	percent_FA_GA
100	Nigeria	4	684	0.58
79	Luxembourg	1	180	0.56
138	Uganda	1	188	0.53
42	Fiji	1	199	0.50
90	Moldova	2	426	0.47
78	Lithuania	1	248	0.40
33	Czech Republic	1	254	0.39
107	Peru	1	354	0.28
132	Tanzania	1	408	0.25
43	Finland	1	572	0.17

	B	C	FA	GA	Start	Stub	All
Dominica	0	2	0	0	4	6	12
Barbados	0	5	0	0	3	6	14
Eritrea	0	2	0	0	2	12	16
Belize	0	5	0	0	7	4	16
Guyana	0	6	0	0	8	6	20
Bahamas	0	3	0	0	10	7	20
Guinea-Bissau	1	2	0	2	7	9	21
Seychelles	0	1	0	0	3	18	22
Sao Tome and Principe	0	1	0	0	9	12	22
Antigua and Barbuda	0	7	0	0	10	8	25
Zambia	0	6	0	0	14	6	26
French Guiana	0	2	0	0	5	21	28
Trinidad and Tobago	0	5	0	1	15	7	28
Liechtenstein	0	1	0	0	13	15	29
Uzbekistan	0	6	1	1	12	9	29
Lesotho	0	7	0	0	14	9	30
Swaziland	1	4	0	0	4	23	32
Equatorial Guinea	1	1	0	1	14	15	32
Kiribati	0	3	0	0	6	23	32
Bhutan	0	3	0	3	7	20	33

	B	C	FA	GA	Start	Stub	All
Norway	0	33	3	5	177	440	658
Nigeria	1	64	0	4	349	266	684
Netherlands	3	54	3	7	234	401	702
Germany	8	87	7	12	303	286	703
New Zealand	2	89	1	10	245	444	791
Poland	11	56	4	9	221	508	809
Italy	15	56	2	6	273	476	828
Iran	10	102	0	17	229	472	830
Canada	19	170	8	17	376	262	852
United Kingdom	32	249	12	42	324	208	867
Spain	13	95	30	11	314	418	881
Russia	26	141	9	26	348	332	882
India	8	59	3	12	236	672	990
Pakistan	4	68	3	11	324	635	1045
Mexico	4	41	2	5	132	897	1081
United States	64	262	19	67	414	272	1098
China	66	208	7	35	402	420	1138
Australia	14	183	11	33	470	855	1566
France	26	140	7	23	372	1121	1689
All	722	5637	278	858	15016	23897	46408

	article_name	country	revision_id	article_quality	population
13641	Mahatma Gandhi	India	806203768	FA	1314097616
28858	Theodore Roosevelt	United States	806771467	FA	321234172
28864	Woodrow Wilson	United States	806921107	FA	321234172
28906	Franklin D. Roosevelt	United States	807395895	FA	321234172
28909	John McCain	United States	807428251	GA	321234172
37175	John F. Kennedy	Ireland	807423724	GA	4630308
41792	Margaret Thatcher	United Kingdom	807039360	FA	65092000