Bias on Wikipedia

Todd Schultz

Due: November 2, 2017

For this assignment (https://wiki.communitydata.cc/HCDS_(Fall_2017)/Assignments#A2:_Bias_in_data), your job is to analyze what the nature of political articles on Wikipedia - both their existence, and their quality - can tell us about bias in Wikipedia's content.

Imports



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
import json

%matplotlib notebook

Import data of politicians by country

Import the data of policitcians by country provided by Oliver Keyes found at https://figshare.com/articles/Untitled_Item/5513449.



In [2]:

    
politicianFile = 'PolbyCountry_data.csv'
politicianNames = pd.read_csv(politicianFile)

# rename variables
politicianNames.rename(columns = {'page':'article_name'}, inplace = True)
politicianNames.rename(columns = {'last_edit':'revision_id'}, inplace = True)
politicianNames[0:4]









    Out[2]:






  
    
      
      country
      article_name
      revision_id
    
  
  
    
      0
      Abkhazia
      Zurab Achba
      802551672
    
    
      1
      Abkhazia
      Garri Aiba
      774499188
    
    
      2
      Abkhazia
      Zaur Avidzba
      803841397
    
    
      3
      Abkhazia
      Raul Eshba
      789818648



In [3]:

    
politicianNames.shape









    Out[3]:





(47997, 3)

Import population by country

Import the population by country provide PRB found at http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14. The data is from mid-2015.



In [4]:

    
countryFile = 'Population Mid-2015.csv'
tempDF = pd.read_csv(countryFile, header=1)
countryPop = pd.DataFrame(data={'country': tempDF['Location'], 'population': tempDF['Data']})

countryPop[0:5]









    Out[4]:






  
    
      
      country
      population
    
  
  
    
      0
      Afghanistan
      32,247,000
    
    
      1
      Albania
      2,892,000
    
    
      2
      Algeria
      39,948,000
    
    
      3
      Andorra
      78,000
    
    
      4
      Angola
      25,000,000

Combined data

Combine the data frames into a single data frame with the following variables. Column, country, article_name, revision_id, article_quality, population Make a placeholder, empty variable for article_quality to be filled in next.



In [5]:

    
# First add placeholder to politicianNames dataframe for article quality
politicianNames = politicianNames.assign(article_quality = "")
article_quality = politicianNames['article_quality']

# Next, join politicianNames with countryPop
politicData = politicianNames.merge(countryPop,how = 'inner')

#politicianNames[0:5]
politicData[0:5]









    Out[5]:






  
    
      
      country
      article_name
      revision_id
      article_quality
      population
    
  
  
    
      0
      Afghanistan
      Laghman Province
      778690357
      
      32,247,000
    
    
      1
      Afghanistan
      Roqia Abubakr
      779839643
      
      32,247,000
    
    
      2
      Afghanistan
      Sitara Achakzai
      803055503
      
      32,247,000
    
    
      3
      Afghanistan
      Khadija Ahrari
      805920528
      
      32,247,000
    
    
      4
      Afghanistan
      Rahila Bibi Kobra Alamshahi
      717743144
      
      32,247,000



In [6]:

    
politicData.shape









    Out[6]:





(46348, 5)

ORES article quality data

Wikimedia API endpoint for a machine learning system called ORES ("Objective Revision Evaluation Service") found at https://www.mediawiki.org/wiki/ORES and documentaiton found at https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model. ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of 6 quality categories. The options are, from best to worst:

FA - Featured article GA - Good article B - B-class article C - C-class article Start - Start-class article Stub - Stub-class article

Below is an example of how to make a request through the ORES system in Python to find out the current quality of the article on Aaron Halfaker (the person who created ORES):



In [ ]:

    
# ORES
endpoint = 'https://ores.wikimedia.org/v3/scores/{project}/{revid}/{model}'
headers = {'User-Agent' : 'https://github.com/your_github_username', 'From' : 'your_uw_email@uw.edu'}

for irevid in range(0, politicData.shape[0]):
    revidstr = str(politicData['revision_id'][irevid])
    #print(revidstr)
    params = {'project' : 'enwiki',
              'model' : 'wp10',
              'revid' : revidstr
              }
    
    try:
        api_call = requests.get(endpoint.format(**params))
        response = api_call.json()
        #print(json.dumps(response, indent=4, sort_keys=True))
        
        # Create data frame and add numeric values for the plotting variable
        politicData.loc[irevid,'article_quality'] = response['enwiki']['scores'][revidstr]['wp10']['score']['prediction']
        #print(response['enwiki']['scores'][revidstr]['wp10']['score']['prediction'])
    except:
        print('Error at ' + str(irevid))
    
    if irevid % 500 == 0:
        print(irevid)

# Write out csv file
politicData.to_csv('en-wikipedia_bias_2015.csv', index=False)
politicData[0:4]



In [11]:

    
politicData.shape[0]
#politicData[-5:]









    Out[11]:





46348

Importing the other data is just a matter of reading CSV files in! (and for the R programmers - we'll have an R example up as soon as the Hub supports the language).



In [3]:

    
## getting the data from the CSV files
import csv

data = []
with open('page_data.csv') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        data.append([row[0],row[1],row[2]])



In [5]:

    
print(data[782])









    



['Albania', 'Aćif Hadžiahmetović', '742544909']



In [ ]:

	country	article_name	revision_id
0	Abkhazia	Zurab Achba	802551672
1	Abkhazia	Garri Aiba	774499188
2	Abkhazia	Zaur Avidzba	803841397
3	Abkhazia	Raul Eshba	789818648

	country	population
0	Afghanistan	32,247,000
1	Albania	2,892,000
2	Algeria	39,948,000
3	Andorra	78,000
4	Angola	25,000,000

	country	article_name	revision_id	population
0	Afghanistan	Laghman Province	778690357	32,247,000
1	Afghanistan	Roqia Abubakr	779839643	32,247,000
2	Afghanistan	Sitara Achakzai	803055503	32,247,000
3	Afghanistan	Khadija Ahrari	805920528	32,247,000
4	Afghanistan	Rahila Bibi Kobra Alamshahi	717743144	32,247,000