Our principal question in this investigation is what makes a good article about a country in the hope of improving articles related to Côte d'Ivoire and Uganda
We will explore several aspects of the quality of the articles and their relationship to different language Wikipedias. First we look at articles literally describing the countries of Ivory Coast and Uganda and compare them to other country articles in English, French and Swahili Wikipedias. This is done to both sanity check our method, and also to determine which subsequent articles to explore. Here we utilise the research Tell Me More: An Actionable Quality Model for Wikipedia by Warncke-Wang, Cosley, and Reidl which is meant to measure points of quality that can be more easily addressed. Secondly, we move to sets of articles about a subject of that nation. These articles are chosen to be most important by how often they occur in a specific Wikipedia. Next we attempt to apply the lessons from Predicting Quality Flaws in User-generated Content: The Case of Wikipedia by Anderka, Stein, Lipka which are about finding user-specified quality flaws.
This report is an IPython Notebook so it contains all the code necessary to reproduce these results. Much of the the code is at the end, to maintain non-technical readability.
We want to compare literal country articles worldwide. Since we will be analysing English, French, and Swahili Wikipedias, we find the Wikidata items related to these countries (here denoted by their identifiers beginning with "Q"). Each Wikidata items points to all the different Wikipedia language versions available. In an attempt to not include language bias, this list was first queried from the Wikidata property "instance of" with target country (through the offline Wikidata parser ). However this only yielded 133 items. Even though it is Anglocentric our list is scraped from English Wikipedia List of Countries by Popultation, yielding:
In [5]:
country_qids = json.load(open('country_qids.json'))
ugandaqid, cdiqid= 'Q1036', 'Q1008'
print "We are comparing: ", len(country_qids), 'countries".'
print "Some of their Wikidata QIDs are: ", map(lambda qid: 'http://wikidata.org/wiki/'+qid, country_qids[:5])
We look at the 5 statistics that are mentioned in the literature as "Actionable Metrics". The metrics are:
In Tell me more weights are used, and we use the same weights, but for expermientation they could be changed.
In [140]:
def report_actionable_metrics(wikicode, completeness_weight=0.8, infonoise_weight=0.6, images_weight=0.3):
completeness = completeness_weight * num_reg_links(wikicode)
informativeness = (infonoise_weight * infonoise(wikicode) ) + (images_weight * num_file_links(wikicode) )
numheadings = len(section_headings(wikicode))
articlelength = readable_text_length(wikicode)
referencerate = article_refs(wikicode) / readable_text_length(wikicode)
return {'completeness': completeness, 'informativeness': informativeness, 'numheadings': numheadings, 'articlelength': articlelength, 'referencerate': referencerate}
We will now look at each metric one by one, on the set of our Wikidata Q-IDs to get a feel for how these metrics will be used later on.
For each metric the Uganda and Côte d'Ivoire article's position amongst other countries are shown by the z-score which is a measure of how many standard deviations they are above or below the average. Additionally we display a graph of the distribution of how the Country articles score, separated by Wikipedia language.
In [499]:
metric_analyse_density('informativeness', 12)
Out[499]:
Most notable here, Côte d'Ivoire has a very high score in French Wikipedia - 3.31 standard deviations above average.
In general these distributions are quite standard with a relatively long tail, also known as positive Skew, which we would expect from user generated content. Also English and French outpace a spikier Swahili, which fits in line with standard conceptions about the popularities of these languages.
In [500]:
metric_analyse_density('completeness', 1500)
Out[500]:
Reassuringly Uganda performs highly here in Swahili, meaning that it is quite intralinked to other articles.
In [501]:
metric_analyse_density('numheadings', 75)
Out[501]:
Here English appears to have almost a bimodal distribution, for which we are accepting your explanations of why that should be so.
In [505]:
metric_analyse_density('articlelength',150000)
Out[505]:
Again Côte d'Ivoire scores a very high, 2.77, in French Wikipedia. It must be quite a long article.
In [503]:
metric_analyse_density('referencerate', 0.01)
Out[503]:
These results offer credibility to this techqnique because the Wikipedias of the native languages of those countries show the highest quality. For the Uganda articles performance is slightly below average in English and French but strong in Swahili. For the Côte d'Ivoire articles, the real stand out is French Wikipedia, at times 3 standard deviations above average, but again below average in other languages. So we can somewhat confidently apply this technique to other sets of articles so see what improvements could be made. Remember a weakness in any one of the 5 metrics is an indicator for a specific type of task that could be done on those articles. Next we will inspect how to determine other sets of articles to which we can apply this method.
Now we will find the hightest occuring section names in Country Articles by language. The assumption we use here is that if a section name is highly occurring it is an important subject in that language.
In [27]:
top_n_sections('fr',20)
Out[27]:
In [28]:
top_n_sections('en',20)
Out[28]:
In [29]:
top_n_sections('sw',7)
Out[29]:
Looking at the most frequent sections in each what patterns emerge? First let us acknowledge that some of the sections here are not conent - like "Notes" and "References". The 'leader' section - the section before any named sections - usually startts with a code to include the Infobox Country in Swahili, which is where that cryptic result comes from. Ignoring non-content sections though there is a clear Winner in all three languages - History.
In all three languages History is the one section you can most rely on being there, occurring in 88% of French, 95% of English and 41% of Swahili country articles. In Swahili and French, the second most popular sections are Geography, but in English it's Economy. So this informs us on which Wikipedia Categories to apply our metrics on next the: History, Economy, and Geography categories.
Given that we now have the top sections for each language, we devise a method to look at representative articles. We find a Category in each Wikipedia for each subject in our highest occurring section headings and for each major nation associated with our languages. We create a script to present every article in every category and all its recursive subcategories to a human judge. The full list of categories for the second part of this analysis comes from is:
In [266]:
categoryjson = json.load(open('ethnosets-categories-capitalized.json','r'))
for subject, nationdict in categoryjson.iteritems():
print subject
for nation, langdict in nationdict.iteritems():
print " |---" + nation
for lang in ['fr','en','sw']:
try:
print" |---" + langdict[lang]
except:
pass
Looking at every title of every article in every subcategory of these categories, a human (your author) accepted or rejected each article. The purpose of this filtering process was to remove noise, such as articles about economists out of Econom, or highschools out of Geography, while being able to select articles from many diverse categories. In the end we accepted a total of:
In [271]:
ethnosets = !ls ethnosets/
reduce(lambda a, b: a+b, map(lambda l: len(l), map(lambda f: json.load(open('ethnosets/'+f)), ethnosets)))
Out[271]:
24630 Articles which were fetched as live data off Wikipedia. Live so that we could compare the results after any editing efforts. We used Wikimedia labs for the data pull script if you are interested. Then for each of those articles we ask Wikidata to give us the page if possible in all our desired languages. And then on all those pages we call the Actionable Metrics as described earlier.
If you are keeping track that means we have four active variables:
Below is a matrix of heatmaps displaying the mean average - not the full distribution as we previously graphed - of the specified metrics versus specified subjects. In the inner dimensions we graph across all pages in a specified language versus a specified nation. That is If you look at any individual heatmap you can see it compares a Wikipedia language and a nation. Looking down the rows of the greater matrix we have varying metric types, and looking down the columns of the greater matrix we have the subject categories.
In [146]:
make_heat_map()
This heatmap can act as a guide to where one could find good examples of articles. Say for istance you wanted to improve History articles of Uganda. First look at the vertical History column and within it, the vertical Uganda column. Then moving your eyes up and down you can see that while English Wikipedia has the best reference rate, French Wikipedia has a slightly higher code & images score, which would be differently useful depending on how you wanted to improve these articles.
For Economy, we can see that the clearest horizontal band across each metric is English. This is a promising result because we knew that English unlike French or Swahili placed more of an emphasis on Economy in their country articles. However if we take a by-nation approach we see that the bluest square is almost always in the France column, not in the USA column. Yet, wouldn't we expect that if English values Economy, that Economy of the USA articles would perform most strongly? One exlpanation is that ther are so many English USA company articles that infact the average quality is brought down. Some data that would give creedence to this idea is that English articles about France's economy outperform in the code and images metric. Often bot opertions will leave a lot of templated code, on many relatively obscure articles for which this could be a trace. For instance in Economy articles then, it would be recommended to look at French articles of France for written quality, and English for templating ideas.
Looking in the History column, in each of the metric-rows there is strong horizontal band of blue in the French sub-row. This indicates that French history articles are good across most nations. Except in the bottom two metrics, Article length and references per article, English seems to outperform, particularly for Uganda and the USA. Swahili here performs better for the USA than for Uganda. In fact in the Swahili band, many datapoints do not register because there are 25 or less articles in question (see caveat below). However within the Uganda column, English and French do register, we can interpret these as translation opportunities to get more Swahili content. As for Côte d'Ivoire, we also find that in all but referencing French does better than English. This would suggest that for Côte d'Ivoire's history, they unfortunately have no better direct examples to follow (but they do have French history to look up to).
Moving to Geography, a notable pattern is the domination of French Wikipedia about France in - again - all but the referencing metric. We encounter some movement from Swahili Wikipedia for the first time. You can see two fainter, but still noticeable vertical strips in Côte d'Ivoire and Uganda for all Wikipedia languages. So actually this means that Geogrpahy is the relative strong point for these African countries. To improve their relative content coverage Swahili editors would do better to focus on Economy and History if they wanted to work on the more-needed areas of their Wikipedia. And within all the things they could do, we see the bluest points within code and images, which is likely from bot-created Africa place articles. The needed places to improve those articles are section count and article length which for English and French about France, and English about USA could best be a guide.
Since we are using the mean average, sometimes if there are not a lot of pages in a language for a category, if one of the pages is particularly impressive then the average can appear high without it meaning that language is a useful information source. This occurred for instance Swahili Economy of France in Wikilinks. We set a threshold minimum number of contributing articles in a heatmap to 25 to reduce this noise, so a white square will appear if there were less than 25 articles. All subject-nations not meeting this threshold can be found in the code appendix. In fact these are the total counts of how many articles are being considered in each group.
In [52]:
display_article_counts()display_article_counts()
Out[52]:
The previous approach of looking at the top section heading, was not very useful here because within a broad category like "Economy" or "Geography" there are too many sub-subjects which have different expected flavours of section headings. The code appendix shows the results of this experimentation. Instead we look at the top templates, to see if sepecific quality flaws could be identified using the techniques of Stein's Quality Flaws. Regretably this study only disucsses English Wikipedia, so here we would need a French or Kiswhaili Wikipedian to identify the top Clean-up tags in French Wikipedia. While "Citaiton neeed" does appear for some categories neither "Citation nécessaire" nor "Ukweli" crops up, so this indicates maybe a direct translation is not possible.
In fact all of the most common clean-up tags make no apperance in our top 20 of each subject-nation files, except "Citation needed". These subject-nation combinations have the highest incidence of "Citation needed":
Subject-nation| Frequency of "citation needed" (per article)
history-usa | 0.501708
economy-fra | 0.282183
economy-usa | 0.270058
history-fra | 0.226929
history-cdi | 0.173913
geography-usa | 0.154013
economy-cdi | 0.128205
This probably highlights lack of tagging with clean-up templates more than anything. It seems counter-intuitive that the more highly scoring subject-nations would also need more citing that those which scored less is all other areas. Our sample of Wikipedia articles here is too small for, or is otherwise somehow not the right application for this cleanup tag analysis.
Still, looking over the data one can see that other citation templates feature highly. Therefore we can investigate the relative types of Citations used. We do this only in English Wikipedia for lack of understanding of French and Swahili citation philosophies.
First we retrive the statistics for all occurences of English Wikipeda Citation templates from Wikimedia labs, and then convert them into a proportion.
MariaDB [enwiki_p]> select tl_title, count(*) from templatelinks where tl_title like 'Cite_web' or tl_title like 'Cite_news' or tl_title like 'Cite_journal' or tl_title like 'Cite_book' group by tl_title;
+--------------+----------+
| tl_title | count(*) |
+--------------+----------+
| Cite_book | 536258 |
| Cite_journal | 328129 |
| Cite_news | 444447 |
| Cite_web | 1560207 |
+--------------+----------+
4 rows in set (11 min 36.68 sec)
This global statistic, serves as a benchmark for all the compositions we can infer from our Top template work if we consider all the occurences of Cite book, Cite Journal, Cite News, and Cite web from our subject-nation files. See below the graphical reperesentation.
In [125]:
make_cite_plot()
Immediately Geography seems to be less diversely cited for all nations. We can also see signs that Economy on the whole has more news citations, and that History on average utilises more book citations. However the more major fact seems to be that Web citations are almost always more than half. Notably, when they are not, we have 2 out of 3 being heavy news citations in History of Côte d'Ivoire and History of Uganda.
All of this should be quite reassuring. A main disadvantage facing African content, previously postulated, was the lack of citable sources. That problem is exacerbated if we consider only printed historical works as useful to expanding content. However since across the board citations are not heavily reliant on books or journals - even in the Encyclopedia's strong suites like the Economy of the USA - citing should be less of an impediment to Kumusha Takes Wiki editors.
As a curiosity, we investigate the frequnency of occurence of the geo-coordinate template in our subject-nation sets.
In [12]:
coordinate_frequencies()
Out[12]:
There are some sets that show to be almost entirely tagged with the coordinate template. In English the nations with the closets cultural ties have higher frequencies of coordinate tagging, which is what those familiar with English Wikipedia would expect. But French Wikipedia does not seem to follow the same trends. The perencatage of coordinate-tagged geography of France in French (18%), seems low compared to the Geography of USA in English (93%). This seems especially odd as both Côte d'Ivoire and USA, have higher coordinate template frequencies than France herself. Unlike English Wikipedia as the coordinate tagging is inversely proportial to cultural ties - a proxy for domain expertise. This could be related to French-wikipedia categorization techniques. Recall we are looking at all the subcategories of "[insert subject] of [insert snation]", so it could be that there are types of Geography articles that exist in French Wikipedia, that are not tied to a coordinate. Those would be sort of "non-obvious", non-coordinate articles and would more likely to be written about France in the French Wikipedia. This would be supported by the fact that that the total articles with coordinates about France is almost the same in our sets, in both English and French, but the French set is simply larger, containing more non-coordinate tagged articles.
Our main task was to understand what makes a good article with respect to a nation. Despite the question's subjectivity, there are facts we have shown about the current state of Wikipedia, and if you consider any of them "useful" then you can try to emulate them. Our main findings are that when it comes to a encyclopedically desrcibing a nation, all of English, French and Swahili Wikipedias most often write about the nation's history. French and Swahili then consider Geography next most important, while English thinks it Economy. We have created a graphical guide involving our 3 languages, 3 subjects, 4 nations, and 5 metrics. One can consult this guide to know where work needs to be done, and which areas of our Wikipedias to look for as an example. Lastly we determined that we do not have much information about where users have left requests for improvement to the countries in question, but if English Wikipedia is a model, then Web citations are usually sufficient.
The way in which the swaths of articles were chosen for the subject-nation sets could be unsatisfactory for several reasons. Firstly the category system may not accurately represnt a Wikipedia's available items on a subject. Secondly, since the process involved a human judge, error is certainly introduced. A better way of determining these subject-nation sets would be useful. Also, no by-hand human-reading investigation was done on those sets, instead we opted for algorithmic methods. If a sound methodology for the human analysis of pages is available, that would be a good technique to compare to the algorithmic ones presented here.
Code also available by git on github https://github.com/notconfusing/kumusha_takes_wiki
In [1]:
#Infonoise metric of Stvilia (2005) in concept, although the implementation may differ since we are not stopping and stemming words, because of the multiple languages we need to handle
def readable_text_length(wikicode):
#could also use wikicode.filter_text()
return float(len(wikicode.strip_code()))
def infonoise(wikicode):
wikicode.strip_code()
ratio = readable_text_length(wikicode) / float(len(wikicode))
return ratio
#Helper function to mine for section headings, of course if there is a lead it doesn't quite make sense.
def section_headings(wikicode):
sections = wikicode.get_sections()
sec_headings = map( lambda s: filter( lambda l: l != '=', s), map(lambda a: a.split(sep='\n', maxsplit=1)[0], sections))
return sec_headings
#i don't know why mwparserfromhell's .fitler_tags() isn't working at the moment. going to hack it for now
import re
def num_refs(wikicode):
text = str(wikicode)
reftags = re.findall('<(\ )*?ref', text)
return len(reftags)
def article_refs(wikicode):
sections = wikicode.get_sections()
return float(reduce( lambda a,b: a+b ,map(num_refs, sections)))
#Predicate for links and files in English French and Swahili
def link_a_file(linkstr):
fnames = [u'File:', u'Fichier:', u'Image:', u'Picha:']
bracknames = map(lambda a: '[[' + a, fnames)
return any(map(lambda b: linkstr.startswith(b), bracknames))
def link_a_cat(linkstr):
cnames =[u'Category:', u'Catégorie:', u'Jamii:']
bracknames = map(lambda a: '[[' + a, cnames)
return any(map(lambda b: linkstr.startswith(b), bracknames))
def num_reg_links(wikicode):
reg_links = filter(lambda a: not link_a_file(a) and not link_a_cat(a), wikicode.filter_wikilinks())
return float(len(reg_links))
def num_file_links(wikicode):
file_links = filter(lambda a: link_a_file(a), wikicode.filter_wikilinks())
return float(len(file_links))
In [1]:
import pywikibot
import mwparserfromhell as pfh
import os
import datetime
import pandas as pd
import json
from collections import defaultdict
from ggplot import *
import operator
from IPython.display import HTML
%pylab inline
langs = ['en','fr','sw']
nations = ['usa', 'fra', 'cdi', 'uga']
wikipedias = {lang: pywikibot.Site(lang, 'wikipedia') for lang in langs}
wikidata = wikipedias['fr'].data_repository()
this will be our data structure dict of dicts, until we have some numbers to put into pandas
In [143]:
def enfrsw():
return {lang: None for lang in langs}
def article_attributes():
return {attrib: enfrsw() for attrib in ['sitelinks', 'wikitext', 'wikicode', 'metrics']}
def do_sitelinks(langs, qids, data):
for qid in qids:
page = pywikibot.ItemPage(wikidata, qid)
wditem = page.get()
for lang in langs:
try:
data[qid]['sitelinks'][lang] = wditem['sitelinks'][lang+'wiki']
except KeyError:
pass
return data
Functions to get the page texts in all our desired languages
In [494]:
def get_wikitext(lang, title):
page = pywikibot.Page(wikipedias[lang],title)
def get_page(page):
try:
pagetext = page.get()
return pagetext
except pywikibot.exceptions.IsRedirectPage:
redir = page.getRedirectTarget()
get_page(redir)
except pywikibot.exceptions.NoPage:
raise pywikibot.exceptions.NoPage
print 're raising'
return get_page(page)
def do_wikitext(langs, data):
for qid, attribs in data.iteritems():
for lang, sl in attribs['sitelinks'].iteritems():
if sl:
try:
if randint(0,100) == 99:
print sl
data[qid]['wikitext'][lang] = get_wikitext(lang, sl)
except:
print 'bad sitelink', sl
continue
return data
def do_wikicode(langs, data):
for qid, attribs in data.iteritems():
for lang, pagetext in attribs['wikitext'].iteritems():
if pagetext:
data[qid]['wikicode'][lang] = pfh.parse(pagetext)
return data
def do_metrics(data):
for qid, attribs in data.iteritems():
for lang, wikicode in attribs['wikicode'].iteritems():
if wikicode:
data[qid]['metrics'][lang] = report_actionable_metrics(wikicode)
return data
# this will take a lot of network time since we are going to load about 300 pages, but we'll save the data off so we don't have to do it uneccesarrily
def make_data(langs, qids, savename):
print 'getting these qids: ', qids
data = defaultdict(article_attributes)
print 'getting sitelinks'
data = do_sitelinks(langs, qids, data)
print 'getting wikitext'
data = do_wikitext(langs, data)
print 'converting to wikicode'
data = do_wikicode(langs, data)
print 'computing metrics'
data = do_metrics(data)
hashable_data = {qid:
{'wikitext': attribdict['wikitext'],
'metrics': attribdict['metrics'],
'sitelinks': attribdict['sitelinks']}
for qid, attribdict in data.iteritems()}
print 'saving now'
#save the results
safefilename = savename+str(datetime.datetime.now())+'.json'
with open(safefilename,'w') as f3:
json.dump(hashable_data,f3)
with open(savename+'latest.json','w') as f4:
json.dump(hashable_data, f4)
return data
#i don't call this unless i have time to uncomment it
#arts = make_data(langs, country_qids, 'countrydata')
#time to get into pandas, lets throw everything into a data frame
df = pd.DataFrame(columns=['Country','language','metric','val'])
arts = json.load(open('countrydata-latest.json','r'))
for qid, attribdict in arts.iteritems():
for attribname, langdict in attribdict.iteritems():
if attribname == 'metrics':
for lang, metrics in langdict.iteritems():
try:
#someteimes there wasn't an article in that language and thus no corresponding len
for metric_name, metric_val in metrics.iteritems():
df = df.append({'Country': qid, 'language':lang, 'metric':metric_name, 'val':float(metric_val)}, ignore_index=True)
except:
pass
df = df.convert_objects(convert_numeric=True)
metric_list = ['completeness','informativeness','numheadings','articlelength','referencerate']
langs_df_dict = {lang: df[df['language'] == lang] for lang in langs}
metric_df_dict = {metric: df[df['metric'] == metric] for metric in metric_list}
#for later calculation
uganda_zscores = defaultdict(list)
cdi_zscores = defaultdict(list)
In [495]:
def metric_analyse_density(ametric, xlimit):
inf_df = metric_df_dict[ametric]
zscore = lambda x: (x - x.mean()) / x.std()
inf_piv = inf_df.pivot(index='Country', columns='language', values='val')
inf_piv_z = inf_piv.apply(zscore)
metric_analyse_density_plot(ametric, xlimit, inf_df)
print 'Uganda ('+ugandaqid+"), Côte d'Ivoire ("+cdiqid+") " +ametric+ " z-scores."
return inf_piv_z.ix[[ugandaqid,cdiqid]]
In [496]:
def metric_analyse_density_plot(ametric, xlimit, inf_df):
p = ggplot(aes(x='val', colour='language', fill=True, alpha = 0.3), data=inf_df) + geom_density() + labs("score", "frequency") + \
scale_x_continuous(limits=(0,xlimit)) + ggtitle(ametric + '\nall country articles\n ')
p.rcParams["figure.figsize"] = "4, 3"
p.draw()
In [18]:
def defaultint():
return defaultdict(int)
section_count = defaultdict(defaultint)
sorted_secs = defaultdict(list)
total_articles = defaultdict(int)
articles = json.load(open('countrydata-latest.json','r'))
for qid, attribdict in articles.iteritems():
for attribname, langdict in attribdict.iteritems():
if attribname == 'wikitext':
for lang, wikitext in langdict.iteritems():
if wikitext:
total_articles[lang] += 1
wikicode = pfh.parse(wikitext)
secs = section_headings(wikicode)
for sec in secs:
sec = sec.strip()
section_count[lang][sec] += 1
section_df = pd.DataFrame(columns=['lang','secname','freq'])
for lang, sec_dict in section_count.iteritems():
for secname, seccount in sec_dict.iteritems():
freq = seccount/float(total_articles[lang])
section_df = section_df.append({'lang':lang, 'secname':secname, 'freq':freq}, ignore_index=True)
#section_df = section_df.convert_objects(convert_numeric=True)
section_df.head()
top_secs = section_df[section_df.freq > 0.1]
sort_secs= top_secs.sort(columns='freq', ascending=False)
In [26]:
def top_n_sections(lang,n):
top = sort_secs[sort_secs.lang==lang].iloc[:n].convert_objects(convert_numeric=True)
print str(total_articles[lang]) + ' total articles inspected in ' + lang + '.'
return HTML(top.to_html(index=False, columns=['lang','secname','freq']))
In [9]:
def top_sections_ethnoset(ethnoset_filename):
def defaultint():
return defaultdict(int)
section_count = defaultdict(defaultint)
sorted_secs = defaultdict(list)
total_articles = defaultdict(int)
articles = json.load(open(ethnoset_filename,'r'))
for qid, attribdict in articles.iteritems():
for attribname, langdict in attribdict.iteritems():
if attribname == 'wikitext':
for lang, wikitext in langdict.iteritems():
if wikitext:
total_articles[lang] += 1
wikicode = pfh.parse(wikitext)
secs = section_headings(wikicode)
for sec in secs:
sec = sec.strip()
section_count[lang][sec] += 1
section_df = pd.DataFrame(columns=['lang','secname','freq'])
for lang, sec_dict in section_count.iteritems():
for secname, seccount in sec_dict.iteritems():
freq = seccount/float(total_articles[lang])
section_df = section_df.append({'lang':lang, 'secname':secname, 'freq':freq}, ignore_index=True)
#section_df = section_df.convert_objects(convert_numeric=True)
section_df.head()
top_secs = section_df[section_df.freq > 0.1]
sort_secs= top_secs.sort(columns='freq', ascending=False)
return sort_secs
In [31]:
!ls ethnosa!ls ethnosave/
In [50]:
def display_article_counts():
filenames = !ls ethnosave
art_counts = pd.DataFrame(columns=['subj','nation', 'lang','count'])
for filename in filenames:
spl = filename.split('-')
subj, nation = spl[0], spl[1].split('.')[0]
fileaddr = 'ethnosave/' + filename
articles = json.load(open(fileaddr,'r'))
total_articles = defaultdict(int)
for qid, attribdict in articles.iteritems():
for attribname, langdict in attribdict.iteritems():
if attribname == 'wikitext':
for lang, wikitext in langdict.iteritems():
if wikitext:
total_articles[lang] += 1
for lang, count in total_articles.iteritems():
art_counts = art_counts.append({'subj': subj, 'nation': nation, 'lang': lang, 'count': count} ,ignore_index = True)
return HTML(art_counts.sort(columns='count', ascending=False).to_html(index=False))
In [137]:
def make_heat_map():
subj_list = ['economy','history','geography']
metric_list = ['completeness','informativeness','numheadings','articlelength','referencerate']
#pivtables = {metric: {subj: None for subj in subj_list} for metric in metric_list}
fig, axes = plt.subplots(nrows = len(metric_list), ncols = len(subj_list), sharex='col', sharey='row' )
'''
for metric, subjdict in pivtables.iteritems():
for subj, pivtab in subjdict.iteritems():
natlangdf = means_df[(means_df.metric == metric) & (means_df.subj == subj)]
natlangpiv = pd.pivot_table(natlangdf, values='means', rows='lang', cols='nation')
pivtables[metric][subj] = natlangpiv
'''
for axarr, metric in zip(axes, metric_list):
for ax, subj in zip(axarr, subj_list):
natlangdf = means_df[(means_df.metric == metric) & (means_df.subj == subj)]
natlangpiv = pd.pivot_table(natlangdf, values='means', rows='lang', cols='nation')
heatmap = ax.pcolor(natlangpiv, cmap='Blues')
ax.set_yticks(np.arange(0.5, len(natlangpiv.index), 1))
ax.set_yticklabels(natlangpiv.index)
ax.set_xticks(np.arange(0.5, len(natlangpiv.columns), 1))
ax.set_xticklabels(natlangpiv.columns)
cbar = plt.colorbar(mappable=heatmap, ax=ax)
fig.suptitle('Heatmap of Actionable Metrics by Country versus Wikipedia Language, \n by Subject Category', fontsize=18)
fig.set_size_inches(12,12,dpi=600)
#fig.tight_layout()
subj_titles = ['Economy','History','Geography']
metric_titles =['Wikilinks','Code & Images to Text Ratio','Section Count','Article Length', 'References per Article Length']
for i in range(len(subj_titles)):
axes[0][i].set_title(subj_titles[i])
for j in range(len(metric_titles)):
axes[j][0].set_ylabel(metric_titles[j])
In [82]:
means_df[(means_df.metric == 'referencerate') & (means_df.subj == 'geography')]
Out[82]:
In [135]:
def load_ethnosaves():
ethnosaves = !ls ethnosave
subj_df_dict = {subj: pd.DataFrame(columns=['qid','subj','nation','lang','metric','val']) for subj in ethnosaves}
for ethnosavefile in ethnosaves:
nameparts = ethnosavefile.split('-')
subj = nameparts[0]
dotparts = nameparts[1].split('.')
nation = dotparts[0]
arts = json.load(open('ethnosave/'+ethnosavefile,'r'))
print subj, nation
sdf = subj_df_dict[ethnosavefile]
for qid, attribdict in arts.iteritems():
for attribname, langdict in attribdict.iteritems():
if attribname == 'metrics':
for lang, metrics in langdict.iteritems():
try:
#someteimes there wasn't an article in that language and thus no corresponding len
for metric_name, metric_val in metrics.iteritems():
sdf = sdf.append({'qid': qid, 'subj':subj, 'nation':nation, 'lang':lang, 'metric':metric_name, 'val':float(metric_val)}, ignore_index=True)
except:
pass
subj_df_dict[ethnosavefile] = sdf
lens = map(lambda d: len(d), subj_df_dict.itervalues())
print lens
return subj_df_dict
subj_df_dict = load_ethnosaves()
subj_df = pd.concat(subj_df_dict)
assert(len(subj_df) == reduce(lambda a, b: a+b, map(lambda df: len(df), subj_df_dict.itervalues())))
subj_df = subj_df.convert_objects()
In [145]:
means_df = pd.DataFrame(columns=['subj','nation','lang','metric','means'])
for subj in ['geography','history','economy']:
for metric in ['completeness','informativeness','numheadings','articlelength','referencerate']:
for nation in ['usa','fra','cdi','uga']:
for lang in ['en','fr','sw']:
spec_df = subj_df[(subj_df.subj == subj) & (subj_df.nation == nation) & (subj_df.metric == metric) & (subj_df.lang == lang)]['val']
mean = spec_df.mean()
if (not str(mean)[0] in '0123456789'):
mean = 0.0
if len(spec_df) <= 25:
print len(spec_df), subj, metric, nation, lang
mean = 0.0
means_df = means_df.append({'subj':subj, 'nation':nation, 'lang':lang, 'metric':metric, 'means':mean}, ignore_index=True)
means_df = means_df.convert_objects(convert_numeric=True)
In [131]:
def top_sections_ethnoset(ethnoset_filename):
print ethnoset_filename
def defaultint():
return defaultdict(int)
section_count = defaultdict(defaultint)
sorted_secs = defaultdict(list)
total_articles = defaultdict(int)
articles = json.load(open(ethnoset_filename,'r'))
for qid, attribdict in articles.iteritems():
for attribname, langdict in attribdict.iteritems():
if attribname == 'wikitext':
for lang, wikitext in langdict.iteritems():
if wikitext:
total_articles[lang] += 1
wikicode = pfh.parse(wikitext)
secs = section_headings(wikicode)
for sec in secs:
sec = sec.strip()
section_count[lang][sec] += 1
section_df = pd.DataFrame(columns=['lang','secname','freq'])
for lang, sec_dict in section_count.iteritems():
for secname, seccount in sec_dict.iteritems():
freq = seccount/float(total_articles[lang])
section_df = section_df.append({'lang':lang, 'secname':secname, 'freq':freq}, ignore_index=True)
#section_df = section_df.convert_objects(convert_numeric=True)
section_df.head()
top_secs = section_df[section_df.freq > 0.1]
sort_secs= top_secs.sort(columns='freq', ascending=False)
return sort_secs
In [132]:
def top_templates_ethnoset(ethnoset_filename):
def defaultint():
return defaultdict(int)
template_count = defaultdict(defaultint)
sorted_templates = defaultdict(list)
total_articles = defaultdict(int)
articles = json.load(open(ethnoset_filename,'r'))
for qid, attribdict in articles.iteritems():
for attribname, langdict in attribdict.iteritems():
if attribname == 'wikitext':
for lang, wikitext in langdict.iteritems():
if wikitext:
total_articles[lang] += 1
wikicode = pfh.parse(wikitext)
temps = wikicode.filter_templates()
for temp in temps:
tempname = temp.name
tempname = tempname.strip().lower()
template_count[lang][tempname] += 1
temp_df = pd.DataFrame(columns=['lang','tempname','freq'])
for lang, temp_dict in template_count.iteritems():
for tempname, tempcount in temp_dict.iteritems():
freq = tempcount/float(total_articles[lang])
temp_df = temp_df.append({'lang':lang, 'tempname':tempname, 'freq':freq}, ignore_index=True)
#section_df = section_df.convert_objects(convert_numeric=True)
top_templates = temp_df[temp_df.freq > 0.1]
sort_temps= top_templates.sort(columns='freq', ascending=False)
temps_dict = dict()
for lang in template_count.iterkeys():
try:
temps_dict[lang] = sort_temps[sort_temps.lang==lang].iloc[:20].convert_objects(convert_numeric=True)
except:
temps_dict[lang] = sort_temps[sort_temps.lang==lang].convert_objects(convert_numeric=True)
return temps_dict
In [384]:
ethnosaves = !ls ethnosave
filenames = map(lambda name: 'ethnosave/'+name, ethnosaves)ethnosaves = !ls ethnosave
filenames = map(lambda name: 'ethnosave/'+name, ethnosaves)
sort_dfs = map(top_sections_ethnoset, filenames)
In [124]:
def make_cite_plot():
citedf = pd.DataFrame(columns=['setname','cite','freq'])
for i in range(len(filenames)):
for lang, df in temp_dfs[i].iteritems():
if lang == 'en':
df = df[(df.tempname == 'cite web') | (df.tempname == 'cite book') | (df.tempname == 'cite news') | (df.tempname == 'cite journal')]
setname = filenames[i][10:-16]
tot = 0
for row in df.iterrows():
cols = row[1]
tot += cols['freq']
for row in df.iterrows():
cols = row[1]
citedf = citedf.append({'setname':setname, 'cite': cols['tempname'], 'freq':cols['freq']/float(tot)}, ignore_index=True)
cite_dict = {"cite book":536258, "cite journal":328129, "cite news":444447, "cite web":1560207}
globaltot = reduce(lambda a,b: a+b, cite_dict.itervalues())
globaltotfloat = float(globaltot)
globciteratio = map(lambda cd: (cd[0], cd[1]/globaltotfloat), cite_dict.iteritems() )
for cite in globciteratio:
citetype, freq = cite[0], cite[1]
citedf = citedf.append({'setname':'English WP Global', 'cite': citetype, 'freq':freq}, ignore_index=True)
citedf = citedf.convert_objects(convert_numeric=True)
citepiv = citedf.pivot(index = 'setname', columns = 'cite')
citeplot = citepiv.plot(kind='bar', stacked=True)
citeplot.legend(('Citation type', 'Cite book', 'Cite journal', 'Cite news', 'Cite web'), loc=9)
citeplot.figure.set_size_inches(12,8)
citeplot.set_xlabel('subject-nation')
citeplot.set_title('Composition of Citation Type, by Subject-Nation')
This is where I look at the template occurences.
In [383]:
ethnosaves = !ls ethnosave
filenames = map(lambda name: 'ethnosave/'+name, ethnosaves)
temp_dfs = map(top_templates_ethnoset, filenames)
for i in range(len(filenames)):
for lang, df in temp_dfs[i].iteritems():
#print ''
#print filenames[i]
#print df
In [11]:
def coordinate_frequencies():
def coord_templates_ethnoset(ethnosavefile, coord_df):
nameparts = ethnosavefile.split('-')
subj = nameparts[0]
dotparts = nameparts[1].split('.')
nation = dotparts[0]
total_articles = defaultdict(int)
coord_articles = defaultdict(int)
articles = json.load(open('ethnosave/'+ethnosavefile,'r'))
for qid, attribdict in articles.iteritems():
for attribname, langdict in attribdict.iteritems():
if attribname == 'wikitext':
for lang, wikitext in langdict.iteritems():
if wikitext:
total_articles[lang] += 1
wikicode = pfh.parse(wikitext)
temps = wikicode.filter_templates()
for temp in temps:
tempname = temp.name
tempname = tempname.strip().lower()
if tempname == 'coord':
coord_articles[lang] += 1
for lang, coord_count in coord_articles.iteritems():
freq = coord_count / float(total_articles[lang])
coord_df = coord_df.append({'subj':subj, 'nation':nation, 'lang':lang, 'coord_count':coord_count, 'freq':freq}, ignore_index=True)
return coord_df
coord_df = pd.DataFrame(columns=['subj','nation','lang','coord_count','freq'])
ethnosaves = !ls ethnosave
for ethnosave in ethnosaves:
coord_df = coord_templates_ethnoset(ethnosave, coord_df)
coord_df_sort = coord_df.sort(columns='freq', ascending=False)
return HTML(coord_df_sort.to_html(index=False, columns=['subj','nation','lang','coord_count','freq']))
In [ ]: