Bias on Wikipedia

Todd Schultz

Due: November 2, 2017

For this assignment (https://wiki.communitydata.cc/HCDS_(Fall_2017)/Assignments#A2:_Bias_in_data), your job is to analyze what the nature of political articles on Wikipedia - both their existence, and their quality - can tell us about bias in Wikipedia's content.

Imports


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
import json

%matplotlib notebook

Import data of politicians by country

Import the data of policitcians by country provided by Oliver Keyes found at https://figshare.com/articles/Untitled_Item/5513449.


In [2]:
politicianFile = 'PolbyCountry_data.csv'
politicianNames = pd.read_csv(politicianFile)

# rename variables
politicianNames.rename(columns = {'page':'article_name'}, inplace = True)
politicianNames.rename(columns = {'last_edit':'revision_id'}, inplace = True)
politicianNames[0:4]


Out[2]:
country article_name revision_id
0 Abkhazia Zurab Achba 802551672
1 Abkhazia Garri Aiba 774499188
2 Abkhazia Zaur Avidzba 803841397
3 Abkhazia Raul Eshba 789818648

In [3]:
politicianNames.shape


Out[3]:
(47997, 3)

Import population by country

Import the population by country provide PRB found at http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14. The data is from mid-2015.


In [4]:
countryFile = 'Population Mid-2015.csv'
tempDF = pd.read_csv(countryFile, header=1)
countryPop = pd.DataFrame(data={'country': tempDF['Location'], 'population': tempDF['Data']})

countryPop[0:5]


Out[4]:
country population
0 Afghanistan 32,247,000
1 Albania 2,892,000
2 Algeria 39,948,000
3 Andorra 78,000
4 Angola 25,000,000

Combined data

Combine the data frames into a single data frame with the following variables. Column, country, article_name, revision_id, article_quality, population Make a placeholder, empty variable for article_quality to be filled in next.


In [5]:
# First add placeholder to politicianNames dataframe for article quality
politicianNames = politicianNames.assign(article_quality = "")
article_quality = politicianNames['article_quality']

# Next, join politicianNames with countryPop
politicData = politicianNames.merge(countryPop,how = 'inner')

#politicianNames[0:5]
politicData[0:5]


Out[5]:
country article_name revision_id article_quality population
0 Afghanistan Laghman Province 778690357 32,247,000
1 Afghanistan Roqia Abubakr 779839643 32,247,000
2 Afghanistan Sitara Achakzai 803055503 32,247,000
3 Afghanistan Khadija Ahrari 805920528 32,247,000
4 Afghanistan Rahila Bibi Kobra Alamshahi 717743144 32,247,000

In [6]:
politicData.shape


Out[6]:
(46348, 5)

ORES article quality data

Wikimedia API endpoint for a machine learning system called ORES ("Objective Revision Evaluation Service") found at https://www.mediawiki.org/wiki/ORES and documentaiton found at https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model. ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of 6 quality categories. The options are, from best to worst:

FA - Featured article GA - Good article B - B-class article C - C-class article Start - Start-class article Stub - Stub-class article

Below is an example of how to make a request through the ORES system in Python to find out the current quality of the article on Aaron Halfaker (the person who created ORES):


In [ ]:
# ORES
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/{revid}/{model}'
headers = {'User-Agent' : 'https://github.com/your_github_username', 'From' : 'your_uw_email@uw.edu'}

for irevid in range(0, politicData.shape[0]):
    revidstr = str(politicData['revision_id'][irevid])
    #print(revidstr)
    params = {'project' : 'enwiki',
              'model' : 'wp10',
              'revid' : revidstr
              }
    
    try:
        api_call = requests.get(endpoint.format(**params))
        response = api_call.json()
        #print(json.dumps(response, indent=4, sort_keys=True))
        
        # Create data frame and add numeric values for the plotting variable
        politicData.loc[irevid,'article_quality'] = response['enwiki']['scores'][revidstr]['wp10']['score']['prediction']
        #print(response['enwiki']['scores'][revidstr]['wp10']['score']['prediction'])
    except:
        print('Error at ' + str(irevid))
    
    if irevid % 500 == 0:
        print(irevid)

# Write out csv file
politicData.to_csv('en-wikipedia_bias_2015.csv', index=False)
politicData[0:4]


0

In [11]:
politicData.shape[0]
#politicData[-5:]


Out[11]:
46348

Importing the other data is just a matter of reading CSV files in! (and for the R programmers - we'll have an R example up as soon as the Hub supports the language).


In [3]:
## getting the data from the CSV files
import csv

data = []
with open('page_data.csv') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        data.append([row[0],row[1],row[2]])

In [5]:
print(data[782])


['Albania', 'Aćif Hadžiahmetović', '742544909']

In [ ]: