For this assignment (https://wiki.communitydata.cc/HCDS_(Fall_2017)/Assignments#A2:_Bias_in_data), your job is to analyze what the nature of political articles on Wikipedia - both their existence, and their quality - can tell us about bias in Wikipedia's content.
In [1]:
import csv
data = []
revid = []
with open('page_data.csv') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
data.append([row[0],row[1],row[2]])
revid.append(row[2])
# Remove the first element ('rev_id') from revid so that the list only contains revision IDs.
revid.pop(0)
Out[1]:
Getting the data (country and population) from the population file
In [2]:
from itertools import islice
import csv
import pandas as pd
population = []
with open('Population Mid-2015.csv') as population_file:
reader = csv.reader(population_file)
# note that first row is title; the second and last two rows are blank
# skip first and last two rows in the csv file
for row in islice(reader,2,213):
population.append([row[0],row[4]])
In this step, we'll get article quality predictions by using ORES API. In order to avoid hitting the limits in ORES, we split all revision IDs into chunks of 50. The response from ORES for each article is in one of 6 categories:
Split revision IDs into chunks of 50
In [3]:
chunks = [revid[x:x+50] for x in range(0, len(revid), 50)]
Write a function to make a request with multiple revision IDs
In [4]:
import requests
import json
def get_ores_data(revision_ids, headers):
# Define the endpoint
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
# Specify the parameters - smushing all the revision IDs together separated by | marks.
# Yes, 'smush' is a technical term, trust me I'm a scientist.
# What do you mean "but people trusting scientists regularly goes horribly wrong" who taught you tha- oh.
params = {'project' : 'enwiki',
'model' : 'wp10',
'revids' : '|'.join(str(x) for x in revision_ids)
}
api_call = requests.get(endpoint.format(**params))
response = api_call.json()
return response
Request the values for prediction (the quality of an article) from ORES API.
In [5]:
headers = {'User-Agent' : 'https://github.com/yawen32', 'From' : 'liy44@uw.edu'}
article_quality = []
for i in range(len(chunks)):
response = get_ores_data(chunks[i],headers)
aq = response['enwiki']['scores']
for j in range(len(chunks[i])):
for key in aq[chunks[i][j]]["wp10"]:
# Flag the articles have been deleted
if key == "error":
article_quality.append("None")
else:
article_quality.append(aq[chunks[i][j]]['wp10']['score']['prediction'])
Save prediction values to a file
In [6]:
aq = open("article_quality.txt","w")
for item in article_quality:
aq.write("{}\n".format(item))
aq.close()
In [7]:
with open("article_quality.csv","w",newline="") as f:
aqcsv = csv.writer(f)
aqcsv.writerow(article_quality)
Read prediction values from the saved file
In [8]:
with open('article_quality.txt','r') as f:
articleQuality = f.read().splitlines()
In this step, we'll combine the article quality data, article data and population data together. In addition, the rows without matching data will be removed in the process of combining the data. Write merged data into a single CSV file contains five columns: country, article_name, revision_id, article_quality, population
First, add the ORES data into the Wikipedia data, then merge the Wikipedia data and population data together on the common key value (country).
In [9]:
wiki_data = pd.DataFrame(data[1:],columns=data[0])
In [10]:
wiki_data
len(pd.Series(articleQuality).values)
Out[10]:
In [11]:
# Add the ORES data into the Wikipedia data
wiki_data["article_quality"] = pd.Series(articleQuality).values
In [12]:
# Rename columns of the Wikipedia data
wiki_data.columns = ["article_name","country","revision_id","article_quality"]
In [13]:
# Convert data (country and population) from the population file to dataframe
population_data = pd.DataFrame(population[1:],columns=population[0])
In [14]:
# Renames the columns with suitable names
population_data.columns = ["Location","population"]
In [15]:
# Merge two datasets(wiki_data and population_data) base on the common key (country name). This step removes the rows do not have
# matching data automatically.
merge_data = pd.merge(wiki_data, population_data, left_on = 'country', right_on = 'Location', how = 'inner')
merge_data = merge_data.drop('Location', axis=1)
# Swap first and second columns so that the dataframe follows the formatting conventions
merge_data = merge_data[["country","article_name","revision_id","article_quality","population"]]
Write merged data to a CSV file
In [16]:
merge_data.to_csv("final_data.csv")
Calculate the proportion (as a percentage) of articles-per-population
In [26]:
# Extract column "country" from merge data
merge_country = merge_data.iloc[:,0].tolist()
In [27]:
# Count the number of articles for each country
from collections import Counter
count_article = Counter(merge_country)
In [28]:
prop_article_per_population = []
df_prop_article_per_population = pd.DataFrame(columns=['country', 'population', 'num_articles','prop_article_per_population'])
num_country = 0
for country in count_article:
population = int(population_data.loc[population_data["Location"] == country, "population"].iloc[0].replace(",",""))
percentage = count_article[country] / population
prop_article_per_population.append("{:.10%}".format(percentage))
df_prop_article_per_population.loc[num_country] = [country,population,count_article[country],"{:.10%}".format(percentage)]
num_country += 1
In [29]:
# Show the table of the proportion of articles-per-population for each country
df_prop_article_per_population
Out[29]:
Calculate the proportion (as a percentage) of high-quality articles for each country.
In [30]:
prop_high_quality_articles_each_country = []
df_prop_high_quality_articles_each_country = pd.DataFrame(columns=["country","num_high_quality_articles","num_articles","prop_high_quality_articles"])
num_country = 0
for country in count_article:
num_FA = Counter(merge_data.loc[merge_data['country'] == country].iloc[:,3].tolist())['FA']
num_GA = Counter(merge_data.loc[merge_data['country'] == country].iloc[:,3].tolist())['GA']
num_high_quality = num_FA + num_GA
percentage = num_high_quality / count_article[country]
prop_high_quality_articles_each_country.append("{:.10%}".format(percentage))
df_prop_high_quality_articles_each_country.loc[num_country] = [country,num_high_quality,count_article[country],"{:.10%}".format(percentage)]
num_country += 1
In [31]:
# Show the table of the proportion of high-quality articles for each country
df_prop_high_quality_articles_each_country
Out[31]:
Produce four tables that show:
10 highest-ranked countries in terms of number of politician articles as a proportion of country population
In [32]:
# Get index of 10 highest-ranked countries
idx = df_prop_article_per_population["prop_article_per_population"].apply(lambda x:float(x.strip('%'))/100).sort_values(ascending=False).index[0:10]
# Retrieve these rows by index values
highest_rank_10_prop_article_per_population = df_prop_article_per_population.loc[idx]
highest_rank_10_prop_article_per_population.to_csv("highest_rank_10_prop_article_per_population.csv")
highest_rank_10_prop_article_per_population
Out[32]:
10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
In [33]:
# Get index of 10 lowest-ranked countries
idx = df_prop_article_per_population["prop_article_per_population"].apply(lambda x:float(x.strip('%'))/100).sort_values(ascending=True).index[0:10]
# Retrieve these rows by index values
lowest_rank_10_prop_article_per_population = df_prop_article_per_population.loc[idx]
lowest_rank_10_prop_article_per_population.to_csv("lowest_rank_10_prop_article_per_population.csv")
lowest_rank_10_prop_article_per_population
Out[33]:
10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
In [34]:
# Get index of 10 highest-ranked countries
idx = df_prop_high_quality_articles_each_country["prop_high_quality_articles"].apply(lambda x:float(x.strip('%'))/100).sort_values(ascending=False).index[0:10]
# Retrieve these rows by index values
highest_rank_10_prop_high_quality_articles = df_prop_high_quality_articles_each_country.loc[idx]
highest_rank_10_prop_high_quality_articles.to_csv("highest_rank_10_prop_high_quality_articles.csv")
highest_rank_10_prop_high_quality_articles
Out[34]:
10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
In [35]:
# Get index of 10 lowest-ranked countries
idx = df_prop_high_quality_articles_each_country["prop_high_quality_articles"].apply(lambda x:float(x.strip('%'))/100).sort_values(ascending=True).index[0:10]
# Retrieve these rows by index values
lowest_rank_10_prop_high_quality_articles = df_prop_high_quality_articles_each_country.loc[idx]
lowest_rank_10_prop_high_quality_articles.to_csv("lowest_rank_10_prop_high_quality_articles_allzeros.csv")
lowest_rank_10_prop_high_quality_articles
Out[35]:
In [70]:
# Get index of 10 lowest-ranked countries that proportions of high-quality articles are NOT equal to 0
idx = df_prop_high_quality_articles_each_country["prop_high_quality_articles"].apply(lambda x:float(x.strip('%'))/100).sort_values(ascending=True)!=0
idx_not_zero = idx[idx == True].index[0:10]
lowest_rank_10_prop_high_quality_articles_not_zero = df_prop_high_quality_articles_each_country.loc[idx_not_zero]
lowest_rank_10_prop_high_quality_articles_not_zero.to_csv("lowest_rank_10_prop_high_quality_articles_notzeros.csv")
lowest_rank_10_prop_high_quality_articles_not_zero
Out[70]: