Bias on Wikipedia

For this assignment (https://wiki.communitydata.cc/HCDS_(Fall_2017)/Assignments#A2:_Bias_in_data), your job is to analyze what the nature of political articles on Wikipedia - both their existence, and their quality - can tell us about bias in Wikipedia's content.

Getting the article and population data

The first step is to load data files downloaded from different online resources. The data files are:

  1. page_data.csv: Wikipedia political articles data
  2. Population Mid-2015.csv: population data of a variety of countries

Getting the data from page_data.csv file


In [1]:
import csv

data = []
revid = []
with open('page_data.csv') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        data.append([row[0],row[1],row[2]])
        revid.append(row[2])
# Remove the first element ('rev_id') from revid so that the list only contains revision IDs.
revid.pop(0)


Out[1]:
'rev_id'

Getting the data (country and population) from the population file


In [2]:
from itertools import islice
import csv

import pandas as pd
population = []
with open('Population Mid-2015.csv') as population_file:
    reader = csv.reader(population_file)
    # note that first row is title; the second and last two rows are blank
    # skip first and last two rows in the csv file
    for row in islice(reader,2,213):
        population.append([row[0],row[4]])

Getting article quality predictions

In this step, we'll get article quality predictions by using ORES API. In order to avoid hitting the limits in ORES, we split all revision IDs into chunks of 50. The response from ORES for each article is in one of 6 categories:

  1. FA - Featured article
  2. GA - Good article
  3. B - B-class article
  4. C - C-class article
  5. Start - Start-class article
  6. Stub - Stub-class article

Split revision IDs into chunks of 50


In [3]:
chunks = [revid[x:x+50] for x in range(0, len(revid), 50)]

Write a function to make a request with multiple revision IDs


In [4]:
import requests
import json

def get_ores_data(revision_ids, headers):
    
    # Define the endpoint
    endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
    
    # Specify the parameters - smushing all the revision IDs together separated by | marks.
    # Yes, 'smush' is a technical term, trust me I'm a scientist.
    # What do you mean "but people trusting scientists regularly goes horribly wrong" who taught you tha- oh.  
    params = {'project' : 'enwiki',
              'model'   : 'wp10',
              'revids'  : '|'.join(str(x) for x in revision_ids)
              }
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    return response

Request the values for prediction (the quality of an article) from ORES API.


In [5]:
headers = {'User-Agent' : 'https://github.com/yawen32', 'From' : 'liy44@uw.edu'}
article_quality = []
for i in range(len(chunks)):
    response = get_ores_data(chunks[i],headers)
    aq = response['enwiki']['scores']
    for j in range(len(chunks[i])):
        for key in aq[chunks[i][j]]["wp10"]:
            # Flag the articles have been deleted
            if key == "error":
                article_quality.append("None")
            else:
                article_quality.append(aq[chunks[i][j]]['wp10']['score']['prediction'])

Save prediction values to a file


In [6]:
aq = open("article_quality.txt","w")
for item in article_quality:
    aq.write("{}\n".format(item))
aq.close()

In [7]:
with open("article_quality.csv","w",newline="") as f:
    aqcsv = csv.writer(f)
    aqcsv.writerow(article_quality)

Read prediction values from the saved file


In [8]:
with open('article_quality.txt','r') as f:
    articleQuality = f.read().splitlines()

Combining the datasets

In this step, we'll combine the article quality data, article data and population data together. In addition, the rows without matching data will be removed in the process of combining the data. Write merged data into a single CSV file contains five columns: country, article_name, revision_id, article_quality, population

First, add the ORES data into the Wikipedia data, then merge the Wikipedia data and population data together on the common key value (country).


In [9]:
wiki_data = pd.DataFrame(data[1:],columns=data[0])

In [10]:
wiki_data
len(pd.Series(articleQuality).values)


Out[10]:
47197

In [11]:
# Add the ORES data into the Wikipedia data 
wiki_data["article_quality"] = pd.Series(articleQuality).values

In [12]:
# Rename columns of the Wikipedia data
wiki_data.columns = ["article_name","country","revision_id","article_quality"]

In [13]:
# Convert data (country and population) from the population file to dataframe
population_data = pd.DataFrame(population[1:],columns=population[0])

In [14]:
# Renames the columns with suitable names
population_data.columns = ["Location","population"]

In [15]:
# Merge two datasets(wiki_data and population_data) base on the common key (country name). This step removes the rows do not have
# matching data automatically.
merge_data = pd.merge(wiki_data, population_data, left_on = 'country', right_on = 'Location', how = 'inner')
merge_data = merge_data.drop('Location', axis=1)
# Swap first and second columns so that the dataframe follows the formatting conventions
merge_data = merge_data[["country","article_name","revision_id","article_quality","population"]]

Write merged data to a CSV file


In [16]:
merge_data.to_csv("final_data.csv")

Analysis

In this step, we'll analyze merged dataset ("final_data.csv") and understand how the coverage of politicians on Wikipedia and the quality of articles about politicians varies among different countries

Calculate the proportion (as a percentage) of articles-per-population


In [26]:
# Extract column "country" from merge data
merge_country = merge_data.iloc[:,0].tolist()

In [27]:
# Count the number of articles for each country
from collections import Counter
count_article = Counter(merge_country)

In [28]:
prop_article_per_population = []
df_prop_article_per_population = pd.DataFrame(columns=['country', 'population', 'num_articles','prop_article_per_population'])
num_country = 0

for country in count_article:
    population = int(population_data.loc[population_data["Location"] == country, "population"].iloc[0].replace(",",""))
    percentage = count_article[country] / population
    prop_article_per_population.append("{:.10%}".format(percentage))
    df_prop_article_per_population.loc[num_country] = [country,population,count_article[country],"{:.10%}".format(percentage)]
    num_country += 1

In [29]:
# Show the table of the proportion of articles-per-population for each country
df_prop_article_per_population


Out[29]:
country population num_articles prop_article_per_population
0 Zambia 15473900.0 26.0 0.0001680249%
1 Chad 13707000.0 100.0 0.0007295542%
2 Zimbabwe 17354000.0 167.0 0.0009623142%
3 Uganda 40141000.0 188.0 0.0004683491%
4 Namibia 2482100.0 165.0 0.0066475968%
5 Nigeria 181839400.0 684.0 0.0003761561%
6 Colombia 48218000.0 288.0 0.0005972873%
7 Chile 18025000.0 352.0 0.0019528433%
8 Fiji 867000.0 199.0 0.0229527105%
9 Solomon Islands 641900.0 98.0 0.0152671756%
10 Palestinian Territory 4481195.0 183.0 0.0040837321%
11 Somalia 11123000.0 339.0 0.0030477389%
12 Cambodia 15417100.0 217.0 0.0014075280%
13 Slovakia 5424051.0 119.0 0.0021939322%
14 Slovenia 2064000.0 59.0 0.0028585271%
15 Afghanistan 32247000.0 327.0 0.0010140478%
16 Iraq 37056000.0 302.0 0.0008149827%
17 Nepal 28039000.0 363.0 0.0012946253%
18 Sri Lanka 20868800.0 465.0 0.0022282067%
19 Laos 6903049.0 109.0 0.0015790124%
20 Albania 2892000.0 460.0 0.0159059474%
21 Costa Rica 4832000.0 150.0 0.0031043046%
22 Czech Republic 10551227.0 254.0 0.0024073030%
23 Canada 35833000.0 852.0 0.0023776965%
24 Tunisia 11026000.0 140.0 0.0012697261%
25 Guatemala 16183752.0 84.0 0.0005190391%
26 Burkina Faso 18450400.0 97.0 0.0005257339%
27 Angola 25000000.0 110.0 0.0004400000%
28 Panama 3980000.0 109.0 0.0027386935%
29 Japan 126866820.0 441.0 0.0003476086%
... ... ... ... ...
157 Thailand 65121250.0 112.0 0.0001719869%
158 Latvia 1978454.0 56.0 0.0028304929%
159 Suriname 576000.0 40.0 0.0069444444%
160 Niger 18884462.0 80.0 0.0004236287%
161 Martinique 379000.0 34.0 0.0089709763%
162 Mauritania 3641288.0 52.0 0.0014280661%
163 Cameroon 23739000.0 106.0 0.0004465226%
164 Lesotho 1924381.0 30.0 0.0015589428%
165 Cyprus 1153000.0 102.0 0.0088464874%
166 Gambia 2021893.0 82.0 0.0040556053%
167 Uzbekistan 31290791.0 29.0 0.0000926790%
168 Bahrain 1412299.0 42.0 0.0029738745%
169 Eritrea 5200000.0 16.0 0.0003076923%
170 Kuwait 3837700.0 37.0 0.0009641191%
171 Burundi 10742000.0 76.0 0.0007075033%
172 Central African Republic 5551900.0 68.0 0.0012248059%
173 Equatorial Guinea 805000.0 32.0 0.0039751553%
174 Guadeloupe 407000.0 49.0 0.0120393120%
175 Kosovo 1802000.0 48.0 0.0026637070%
176 Cape Verde 514000.0 37.0 0.0071984436%
177 Andorra 78000.0 34.0 0.0435897436%
178 Comoros 764000.0 51.0 0.0066753927%
179 Trinidad and Tobago 1351000.0 28.0 0.0020725389%
180 Federated States of Micronesia 103000.0 38.0 0.0368932039%
181 Dominica 68000.0 12.0 0.0176470588%
182 Bahamas 377000.0 20.0 0.0053050398%
183 Swaziland 1286000.0 32.0 0.0024883359%
184 Barbados 278000.0 14.0 0.0050359712%
185 Belize 368000.0 16.0 0.0043478261%
186 Seychelles 92833.0 22.0 0.0236984693%

187 rows × 4 columns

Calculate the proportion (as a percentage) of high-quality articles for each country.


In [30]:
prop_high_quality_articles_each_country = []
df_prop_high_quality_articles_each_country = pd.DataFrame(columns=["country","num_high_quality_articles","num_articles","prop_high_quality_articles"])
num_country = 0

for country in count_article:
    num_FA = Counter(merge_data.loc[merge_data['country'] == country].iloc[:,3].tolist())['FA']
    num_GA = Counter(merge_data.loc[merge_data['country'] == country].iloc[:,3].tolist())['GA']
    num_high_quality = num_FA + num_GA
    percentage = num_high_quality / count_article[country]
    prop_high_quality_articles_each_country.append("{:.10%}".format(percentage))
    df_prop_high_quality_articles_each_country.loc[num_country] = [country,num_high_quality,count_article[country],"{:.10%}".format(percentage)]
    num_country += 1

In [31]:
# Show the table of the proportion of high-quality articles for each country
df_prop_high_quality_articles_each_country


Out[31]:
country num_high_quality_articles num_articles prop_high_quality_articles
0 Zambia 0.0 26.0 0.0000000000%
1 Chad 2.0 100.0 2.0000000000%
2 Zimbabwe 2.0 167.0 1.1976047904%
3 Uganda 1.0 188.0 0.5319148936%
4 Namibia 1.0 165.0 0.6060606061%
5 Nigeria 5.0 684.0 0.7309941520%
6 Colombia 3.0 288.0 1.0416666667%
7 Chile 3.0 352.0 0.8522727273%
8 Fiji 1.0 199.0 0.5025125628%
9 Solomon Islands 0.0 98.0 0.0000000000%
10 Palestinian Territory 11.0 183.0 6.0109289617%
11 Somalia 9.0 339.0 2.6548672566%
12 Cambodia 5.0 217.0 2.3041474654%
13 Slovakia 2.0 119.0 1.6806722689%
14 Slovenia 1.0 59.0 1.6949152542%
15 Afghanistan 15.0 327.0 4.5871559633%
16 Iraq 8.0 302.0 2.6490066225%
17 Nepal 0.0 363.0 0.0000000000%
18 Sri Lanka 8.0 465.0 1.7204301075%
19 Laos 3.0 109.0 2.7522935780%
20 Albania 5.0 460.0 1.0869565217%
21 Costa Rica 0.0 150.0 0.0000000000%
22 Czech Republic 1.0 254.0 0.3937007874%
23 Canada 29.0 852.0 3.4037558685%
24 Tunisia 1.0 140.0 0.7142857143%
25 Guatemala 6.0 84.0 7.1428571429%
26 Burkina Faso 3.0 97.0 3.0927835052%
27 Angola 1.0 110.0 0.9090909091%
28 Panama 5.0 109.0 4.5871559633%
29 Japan 9.0 441.0 2.0408163265%
... ... ... ... ...
157 Thailand 3.0 112.0 2.6785714286%
158 Latvia 1.0 56.0 1.7857142857%
159 Suriname 0.0 40.0 0.0000000000%
160 Niger 3.0 80.0 3.7500000000%
161 Martinique 1.0 34.0 2.9411764706%
162 Mauritania 4.0 52.0 7.6923076923%
163 Cameroon 1.0 106.0 0.9433962264%
164 Lesotho 0.0 30.0 0.0000000000%
165 Cyprus 1.0 102.0 0.9803921569%
166 Gambia 6.0 82.0 7.3170731707%
167 Uzbekistan 3.0 29.0 10.3448275862%
168 Bahrain 0.0 42.0 0.0000000000%
169 Eritrea 0.0 16.0 0.0000000000%
170 Kuwait 1.0 37.0 2.7027027027%
171 Burundi 1.0 76.0 1.3157894737%
172 Central African Republic 7.0 68.0 10.2941176471%
173 Equatorial Guinea 1.0 32.0 3.1250000000%
174 Guadeloupe 0.0 49.0 0.0000000000%
175 Kosovo 1.0 48.0 2.0833333333%
176 Cape Verde 0.0 37.0 0.0000000000%
177 Andorra 0.0 34.0 0.0000000000%
178 Comoros 0.0 51.0 0.0000000000%
179 Trinidad and Tobago 1.0 28.0 3.5714285714%
180 Federated States of Micronesia 0.0 38.0 0.0000000000%
181 Dominica 1.0 12.0 8.3333333333%
182 Bahamas 0.0 20.0 0.0000000000%
183 Swaziland 0.0 32.0 0.0000000000%
184 Barbados 0.0 14.0 0.0000000000%
185 Belize 0.0 16.0 0.0000000000%
186 Seychelles 0.0 22.0 0.0000000000%

187 rows × 4 columns

Tables

Produce four tables that show:

  1. 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
  2. 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
  3. 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
  4. 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

10 highest-ranked countries in terms of number of politician articles as a proportion of country population


In [32]:
# Get index of 10 highest-ranked countries
idx = df_prop_article_per_population["prop_article_per_population"].apply(lambda x:float(x.strip('%'))/100).sort_values(ascending=False).index[0:10]
# Retrieve these rows by index values
highest_rank_10_prop_article_per_population = df_prop_article_per_population.loc[idx]
highest_rank_10_prop_article_per_population.to_csv("highest_rank_10_prop_article_per_population.csv")
highest_rank_10_prop_article_per_population


Out[32]:
country population num_articles prop_article_per_population
124 Nauru 10860.0 53.0 0.4880294659%
114 Tuvalu 11800.0 55.0 0.4661016949%
98 San Marino 33000.0 82.0 0.2484848485%
134 Monaco 38088.0 40.0 0.1050199538%
142 Liechtenstein 37570.0 29.0 0.0771892467%
148 Marshall Islands 55000.0 37.0 0.0672727273%
53 Iceland 330828.0 206.0 0.0622680063%
138 Tonga 103300.0 63.0 0.0609874153%
177 Andorra 78000.0 34.0 0.0435897436%
180 Federated States of Micronesia 103000.0 38.0 0.0368932039%

10 lowest-ranked countries in terms of number of politician articles as a proportion of country population


In [33]:
# Get index of 10 lowest-ranked countries
idx = df_prop_article_per_population["prop_article_per_population"].apply(lambda x:float(x.strip('%'))/100).sort_values(ascending=True).index[0:10]
# Retrieve these rows by index values
lowest_rank_10_prop_article_per_population = df_prop_article_per_population.loc[idx]
lowest_rank_10_prop_article_per_population.to_csv("lowest_rank_10_prop_article_per_population.csv")
lowest_rank_10_prop_article_per_population


Out[33]:
country population num_articles prop_article_per_population
44 India 1.314098e+09 990.0 0.0000753369%
80 China 1.371920e+09 1138.0 0.0000829494%
30 Indonesia 2.557420e+08 215.0 0.0000840691%
167 Uzbekistan 3.129079e+07 29.0 0.0000926790%
113 Ethiopia 9.814800e+07 105.0 0.0001069813%
119 Korea, North 2.498300e+07 39.0 0.0001561062%
0 Zambia 1.547390e+07 26.0 0.0001680249%
157 Thailand 6.512125e+07 112.0 0.0001719869%
110 Congo, Dem. Rep. of 7.334020e+07 142.0 0.0001936182%
43 Bangladesh 1.604110e+08 324.0 0.0002019812%

10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country


In [34]:
# Get index of 10 highest-ranked countries
idx = df_prop_high_quality_articles_each_country["prop_high_quality_articles"].apply(lambda x:float(x.strip('%'))/100).sort_values(ascending=False).index[0:10]
# Retrieve these rows by index values
highest_rank_10_prop_high_quality_articles = df_prop_high_quality_articles_each_country.loc[idx]
highest_rank_10_prop_high_quality_articles.to_csv("highest_rank_10_prop_high_quality_articles.csv")
highest_rank_10_prop_high_quality_articles


Out[34]:
country num_high_quality_articles num_articles prop_high_quality_articles
119 Korea, North 9.0 39.0 23.0769230769%
128 Saudi Arabia 14.0 119.0 11.7647058824%
167 Uzbekistan 3.0 29.0 10.3448275862%
172 Central African Republic 7.0 68.0 10.2941176471%
55 Romania 34.0 348.0 9.7701149425%
144 Guinea-Bissau 2.0 21.0 9.5238095238%
156 Bhutan 3.0 33.0 9.0909090909%
91 Vietnam 16.0 191.0 8.3769633508%
181 Dominica 1.0 12.0 8.3333333333%
162 Mauritania 4.0 52.0 7.6923076923%

10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country


In [35]:
# Get index of 10 lowest-ranked countries
idx = df_prop_high_quality_articles_each_country["prop_high_quality_articles"].apply(lambda x:float(x.strip('%'))/100).sort_values(ascending=True).index[0:10]
# Retrieve these rows by index values
lowest_rank_10_prop_high_quality_articles = df_prop_high_quality_articles_each_country.loc[idx]
lowest_rank_10_prop_high_quality_articles.to_csv("lowest_rank_10_prop_high_quality_articles_allzeros.csv")
lowest_rank_10_prop_high_quality_articles


Out[35]:
country num_high_quality_articles num_articles prop_high_quality_articles
0 Zambia 0.0 26.0 0.0000000000%
138 Tonga 0.0 63.0 0.0000000000%
134 Monaco 0.0 40.0 0.0000000000%
131 Tajikistan 0.0 40.0 0.0000000000%
127 Mozambique 0.0 60.0 0.0000000000%
124 Nauru 0.0 53.0 0.0000000000%
115 Antigua and Barbuda 0.0 25.0 0.0000000000%
142 Liechtenstein 0.0 29.0 0.0000000000%
107 Malta 0.0 103.0 0.0000000000%
102 French Guiana 0.0 28.0 0.0000000000%

In [70]:
# Get index of 10 lowest-ranked countries that proportions of high-quality articles are NOT equal to 0
idx = df_prop_high_quality_articles_each_country["prop_high_quality_articles"].apply(lambda x:float(x.strip('%'))/100).sort_values(ascending=True)!=0
idx_not_zero = idx[idx == True].index[0:10]
lowest_rank_10_prop_high_quality_articles_not_zero = df_prop_high_quality_articles_each_country.loc[idx_not_zero]
lowest_rank_10_prop_high_quality_articles_not_zero.to_csv("lowest_rank_10_prop_high_quality_articles_notzeros.csv")
lowest_rank_10_prop_high_quality_articles_not_zero


Out[70]:
country num_high_quality_articles num_articles prop_high_quality_articles
72 Tanzania 1.0 408.0 0.2450980392%
22 Czech Republic 1.0 254.0 0.3937007874%
89 Lithuania 1.0 248.0 0.4032258065%
135 Morocco 1.0 208.0 0.4807692308%
8 Fiji 1.0 199.0 0.5025125628%
3 Uganda 1.0 188.0 0.5319148936%
68 Bolivia 1.0 187.0 0.5347593583%
57 Luxembourg 1.0 180.0 0.5555555556%
37 Peru 2.0 354.0 0.5649717514%
73 Sierra Leone 1.0 166.0 0.6024096386%