Due: November 2, 2017
For this assignment (https://wiki.communitydata.cc/HCDS_(Fall_2017)/Assignments#A2:_Bias_in_data), your job is to analyze what the nature of political articles on Wikipedia - both their existence, and their quality - can tell us about bias in Wikipedia's content.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
import json
%matplotlib notebook
Import the data of policitcians by country provided by Oliver Keyes found at https://figshare.com/articles/Untitled_Item/5513449.
In [2]:
politicianFile = 'PolbyCountry_data.csv'
politicianNames = pd.read_csv(politicianFile)
# rename variables
politicianNames.rename(columns = {'page':'article_name'}, inplace = True)
politicianNames.rename(columns = {'last_edit':'revision_id'}, inplace = True)
politicianNames[0:4]
Out[2]:
In [3]:
politicianNames.shape
Out[3]:
Import the population by country provide PRB found at http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14. The data is from mid-2015.
In [4]:
countryFile = 'Population Mid-2015.csv'
tempDF = pd.read_csv(countryFile, header=1)
countryPop = pd.DataFrame(data={'country': tempDF['Location'], 'population': tempDF['Data']})
countryPop[0:5]
Out[4]:
In [5]:
# First add placeholder to politicianNames dataframe for article quality
politicianNames = politicianNames.assign(article_quality = "")
article_quality = politicianNames['article_quality']
# Next, join politicianNames with countryPop
politicData = politicianNames.merge(countryPop,how = 'inner')
#politicianNames[0:5]
politicData[0:5]
Out[5]:
In [6]:
politicData.shape
Out[6]:
Wikimedia API endpoint for a machine learning system called ORES ("Objective Revision Evaluation Service") found at https://www.mediawiki.org/wiki/ORES and documentaiton found at https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model. ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of 6 quality categories. The options are, from best to worst:
FA - Featured article GA - Good article B - B-class article C - C-class article Start - Start-class article Stub - Stub-class article
Below is an example of how to make a request through the ORES system in Python to find out the current quality of the article on Aaron Halfaker (the person who created ORES):
In [ ]:
# ORES
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/{revid}/{model}'
headers = {'User-Agent' : 'https://github.com/your_github_username', 'From' : 'your_uw_email@uw.edu'}
for irevid in range(0, politicData.shape[0]):
revidstr = str(politicData['revision_id'][irevid])
#print(revidstr)
params = {'project' : 'enwiki',
'model' : 'wp10',
'revid' : revidstr
}
try:
api_call = requests.get(endpoint.format(**params))
response = api_call.json()
#print(json.dumps(response, indent=4, sort_keys=True))
# Create data frame and add numeric values for the plotting variable
politicData.loc[irevid,'article_quality'] = response['enwiki']['scores'][revidstr]['wp10']['score']['prediction']
#print(response['enwiki']['scores'][revidstr]['wp10']['score']['prediction'])
except:
print('Error at ' + str(irevid))
if irevid % 500 == 0:
print(irevid)
# Write out csv file
politicData.to_csv('en-wikipedia_bias_2015.csv', index=False)
politicData[0:4]
In [11]:
politicData.shape[0]
#politicData[-5:]
Out[11]:
Importing the other data is just a matter of reading CSV files in! (and for the R programmers - we'll have an R example up as soon as the Hub supports the language).
In [3]:
## getting the data from the CSV files
import csv
data = []
with open('page_data.csv') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
data.append([row[0],row[1],row[2]])
In [5]:
print(data[782])
In [ ]: