Investigating Good Articles for Côte d'Ivoire and Uganda

For Kumusha Takes Wiki - by Max Klein
10 January 2014

Our principal question in this investigation is what makes a good article about a country in the hope of improving articles related to Côte d'Ivoire and Uganda

We will explore several aspects of the quality of the articles and their relationship to different language Wikipedias. First we look at articles literally describing the countries of Ivory Coast and Uganda and compare them to other country articles in English, French and Swahili Wikipedias. This is done to both sanity check our method, and also to determine which subsequent articles to explore. Here we utilise the research Tell Me More: An Actionable Quality Model for Wikipedia by Warncke-Wang, Cosley, and Reidl which is meant to measure points of quality that can be more easily addressed. Secondly, we move to sets of articles about a subject of that nation. These articles are chosen to be most important by how often they occur in a specific Wikipedia. Next we attempt to apply the lessons from Predicting Quality Flaws in User-generated Content: The Case of Wikipedia by Anderka, Stein, Lipka which are about finding user-specified quality flaws.

This report is an IPython Notebook so it contains all the code necessary to reproduce these results. Much of the the code is at the end, to maintain non-technical readability.

Getting the Wikidata item for every country

We want to compare literal country articles worldwide. Since we will be analysing English, French, and Swahili Wikipedias, we find the Wikidata items related to these countries (here denoted by their identifiers beginning with "Q"). Each Wikidata items points to all the different Wikipedia language versions available. In an attempt to not include language bias, this list was first queried from the Wikidata property "instance of" with target country (through the offline Wikidata parser ). However this only yielded 133 items. Even though it is Anglocentric our list is scraped from English Wikipedia List of Countries by Popultation, yielding:


In [5]:
country_qids = json.load(open('country_qids.json'))
ugandaqid, cdiqid= 'Q1036', 'Q1008'
print "We are comparing: ", len(country_qids), 'countries".'
print "Some of their Wikidata QIDs are: ", map(lambda qid: 'http://wikidata.org/wiki/'+qid, country_qids[:5])


We are comparing:  242 countries".
Some of their Wikidata QIDs are:  [u'http://wikidata.org/wiki/Q16', u'http://wikidata.org/wiki/Q33788', u'http://wikidata.org/wiki/Q25230', u'http://wikidata.org/wiki/Q874', u'http://wikidata.org/wiki/Q37']

Let's inspect each metric

We look at the 5 statistics that are mentioned in the literature as "Actionable Metrics". The metrics are:

  1. Completeness - This counts the number of intra-wiki links, to other articles.
  2. Informativeness - This is a combination of the ratio of (readable text to wikitext markup) plus number of images.
  3. Numeber of Headings - A count of the sections and subsections.
  4. Article Length - The lenght of the readable article in characters.
  5. Reference Rate - The number of references per article length.

In Tell me more weights are used, and we use the same weights, but for expermientation they could be changed.


In [140]:
def report_actionable_metrics(wikicode, completeness_weight=0.8, infonoise_weight=0.6, images_weight=0.3):
    completeness = completeness_weight * num_reg_links(wikicode)
    informativeness = (infonoise_weight * infonoise(wikicode) ) + (images_weight * num_file_links(wikicode) )
    numheadings = len(section_headings(wikicode))
    articlelength = readable_text_length(wikicode)
    referencerate = article_refs(wikicode) / readable_text_length(wikicode)
    
    return {'completeness': completeness, 'informativeness': informativeness, 'numheadings': numheadings, 'articlelength': articlelength, 'referencerate': referencerate}

We will now look at each metric one by one, on the set of our Wikidata Q-IDs to get a feel for how these metrics will be used later on.

For each metric the Uganda and Côte d'Ivoire article's position amongst other countries are shown by the z-score which is a measure of how many standard deviations they are above or below the average. Additionally we display a graph of the distribution of how the Country articles score, separated by Wikipedia language.


In [499]:
metric_analyse_density('informativeness', 12)


Uganda (Q1036), Côte d'Ivoire (Q1008) informativeness z-scores.
Out[499]:
language en fr sw
Q1036 0.381577 -0.643210 -0.272597
Q1008 -0.431621 3.311523 -0.384729

Most notable here, Côte d'Ivoire has a very high score in French Wikipedia - 3.31 standard deviations above average.

In general these distributions are quite standard with a relatively long tail, also known as positive Skew, which we would expect from user generated content. Also English and French outpace a spikier Swahili, which fits in line with standard conceptions about the popularities of these languages.


In [500]:
metric_analyse_density('completeness', 1500)


Uganda (Q1036), Côte d'Ivoire (Q1008) completeness z-scores.
Out[500]:
language en fr sw
Q1036 -0.443975 -0.176491 1.736298
Q1008 -0.973200 1.767785 -0.505017

Reassuringly Uganda performs highly here in Swahili, meaning that it is quite intralinked to other articles.


In [501]:
metric_analyse_density('numheadings', 75)


Uganda (Q1036), Côte d'Ivoire (Q1008) numheadings z-scores.
Out[501]:
language en fr sw
Q1036 -1.695575 0.028298 0.023235
Q1008 0.660221 -0.411489 -0.427908

Here English appears to have almost a bimodal distribution, for which we are accepting your explanations of why that should be so.


In [505]:
metric_analyse_density('articlelength',150000)


Uganda (Q1036), Côte d'Ivoire (Q1008) articlelength z-scores.
Out[505]:
language en fr sw
Q1036 0.105196 -0.498248 0.190613
Q1008 0.335907 2.764950 -0.244492

Again Côte d'Ivoire scores a very high, 2.77, in French Wikipedia. It must be quite a long article.


In [503]:
metric_analyse_density('referencerate', 0.01)


Uganda (Q1036), Côte d'Ivoire (Q1008) referencerate z-scores.
Out[503]:
language en fr sw
Q1036 0.014790 -0.608334 3.129056
Q1008 -1.183966 1.663328 -0.265412

These results offer credibility to this techqnique because the Wikipedias of the native languages of those countries show the highest quality. For the Uganda articles performance is slightly below average in English and French but strong in Swahili. For the Côte d'Ivoire articles, the real stand out is French Wikipedia, at times 3 standard deviations above average, but again below average in other languages. So we can somewhat confidently apply this technique to other sets of articles so see what improvements could be made. Remember a weakness in any one of the 5 metrics is an indicator for a specific type of task that could be done on those articles. Next we will inspect how to determine other sets of articles to which we can apply this method.

Highest Occuring Sections

Now we will find the hightest occuring section names in Country Articles by language. The assumption we use here is that if a section name is highly occurring it is an important subject in that language.


In [27]:
top_n_sections('fr',20)


242 total articles inspected in fr.
Out[27]:
lang secname freq
fr Liens externes 0.904959
fr Histoire 0.884298
fr Géographie 0.826446
fr Notes et références 0.809917
fr Économie 0.785124
fr Démographie 0.752066
fr Politique 0.685950
fr Voir aussi 0.685950
fr Articles connexes 0.648760
fr Codes 0.644628
fr Culture 0.640496
fr Bibliographie 0.491736
fr Climat 0.301653
fr Subdivisions 0.301653
fr Langues 0.289256
fr Sport 0.243802
fr Éducation 0.239669
fr Religions 0.231405
fr Religion 0.219008
fr Santé 0.198347

In [28]:
top_n_sections('en',20)


242 total articles inspected in en.
Out[28]:
lang secname freq
en History 0.954545
en Economy 0.888430
en Demographics 0.876033
en External links 0.867769
en See also 0.842975
en Geography 0.822314
en References 0.805785
en Culture 0.797521
en Religion 0.657025
en Education 0.648760
en Climate 0.545455
en Military 0.462810
en Administrative divisions 0.438017
en Politics 0.438017
en Sports 0.396694
en Foreign relations 0.384298
en Languages 0.367769
en Cuisine 0.363636
en Health 0.359504
en Further reading 0.342975

In [29]:
top_n_sections('sw',7)


233 total articles inspected in sw.
Out[29]:
lang secname freq
sw Historia 0.412017
sw Jiografia 0.334764
sw {{Infobox Country 0.334764
sw Viungo vya nje 0.330472
sw Tazama pia 0.184549
sw {{Infobox country 0.175966
sw Viungo vya Nje 0.154506

Looking at the most frequent sections in each what patterns emerge? First let us acknowledge that some of the sections here are not conent - like "Notes" and "References". The 'leader' section - the section before any named sections - usually startts with a code to include the Infobox Country in Swahili, which is where that cryptic result comes from. Ignoring non-content sections though there is a clear Winner in all three languages - History.

In all three languages History is the one section you can most rely on being there, occurring in 88% of French, 95% of English and 41% of Swahili country articles. In Swahili and French, the second most popular sections are Geography, but in English it's Economy. So this informs us on which Wikipedia Categories to apply our metrics on next the: History, Economy, and Geography categories.

Subject and Nation Analysis

Given that we now have the top sections for each language, we devise a method to look at representative articles. We find a Category in each Wikipedia for each subject in our highest occurring section headings and for each major nation associated with our languages. We create a script to present every article in every category and all its recursive subcategories to a human judge. The full list of categories for the second part of this analysis comes from is:


In [266]:
categoryjson = json.load(open('ethnosets-categories-capitalized.json','r'))
for subject, nationdict in categoryjson.iteritems():
    print subject
    for nation, langdict in nationdict.iteritems():
        print " |---"  + nation
        for lang in ['fr','en','sw']:
            try:
                print"    |---" + langdict[lang]
            except:
                pass


Geography
 |---Côte d'Ivoire
    |---Catégorie:Géographie_de_la_Côte_d'Ivoire
    |---Geography_of_Ivory_Coast
    |---Jamii:Jiografia_ya_Cote_d'Ivoire
 |---USA
    |---Catégorie:Géographie_des_États-Unis
    |---Category:Geography_of_the_United_States
    |---Jamii:Jiografia_ya_Marekani
 |---Uganda
    |---Catégorie:Géographie_de_l'Ouganda
    |---Category:Geography_of_Uganda
    |---Jamii:Jiografia_ya_Uganda
 |---France
    |---Catégorie:Géographie_de_la_France
    |---Category:Geography_of_France
History
 |---Côte d'Ivoire
    |---Catégorie:Histoire_de_la_Côte_d'Ivoire
    |---Category:History_of_Ivory_Coast
 |---Uganda
    |---Catégorie:Histoire_de_l'Ouganda
    |---Category:History_of_Uganda
    |---Jamii:Historia_ya_Uganda
 |---USA
    |---Catégorie:Histoire_des_États-Unis
    |---Category:History_of_the_United_States
    |---Jamii:Historia_ya_Marekani
 |---France
    |---Catégorie:Histoire_de_France
    |---Category:History_of_France
    |---Jamii:Historia_ya_Ufaransa
Economy
 |---Côte d'Ivoire
    |---Catégorie:Économie_de_la_Côte_d'Ivoire
    |---Category:Economy_of_Ivory_Coast
    |---Jamii:Uchumi_wa_Cote_d'Ivoire
 |---USA
    |---Catégorie:Économie_des_États-Unis
    |---Category:Economy_of_the_United_States
    |---Jamii:Uchumi_wa_Marekani
 |---Uganda
    |---Catégorie:Économie_de_l'Ouganda
    |---Category:Economy_of_Uganda
    |---Jamii:Uchumi_wa_Uganda
 |---France
    |---Catégorie:Économie_de_la_France
    |---Category:Economy_of_France
    |---Jamii:Uchumi wa Ufaransa

Looking at every title of every article in every subcategory of these categories, a human (your author) accepted or rejected each article. The purpose of this filtering process was to remove noise, such as articles about economists out of Econom, or highschools out of Geography, while being able to select articles from many diverse categories. In the end we accepted a total of:


In [271]:
ethnosets = !ls ethnosets/
reduce(lambda a, b: a+b, map(lambda l: len(l), map(lambda f: json.load(open('ethnosets/'+f)), ethnosets)))


Out[271]:
24630

24630 Articles which were fetched as live data off Wikipedia. Live so that we could compare the results after any editing efforts. We used Wikimedia labs for the data pull script if you are interested. Then for each of those articles we ask Wikidata to give us the page if possible in all our desired languages. And then on all those pages we call the Actionable Metrics as described earlier.

If you are keeping track that means we have four active variables:

  1. the Nation,
  2. the Wikipedia language,
  3. the subject, and
  4. the metric type.

Below is a matrix of heatmaps displaying the mean average - not the full distribution as we previously graphed - of the specified metrics versus specified subjects. In the inner dimensions we graph across all pages in a specified language versus a specified nation. That is If you look at any individual heatmap you can see it compares a Wikipedia language and a nation. Looking down the rows of the greater matrix we have varying metric types, and looking down the columns of the greater matrix we have the subject categories.


In [146]:
make_heat_map()


Analysis

This heatmap can act as a guide to where one could find good examples of articles. Say for istance you wanted to improve History articles of Uganda. First look at the vertical History column and within it, the vertical Uganda column. Then moving your eyes up and down you can see that while English Wikipedia has the best reference rate, French Wikipedia has a slightly higher code & images score, which would be differently useful depending on how you wanted to improve these articles.

For Economy, we can see that the clearest horizontal band across each metric is English. This is a promising result because we knew that English unlike French or Swahili placed more of an emphasis on Economy in their country articles. However if we take a by-nation approach we see that the bluest square is almost always in the France column, not in the USA column. Yet, wouldn't we expect that if English values Economy, that Economy of the USA articles would perform most strongly? One exlpanation is that ther are so many English USA company articles that infact the average quality is brought down. Some data that would give creedence to this idea is that English articles about France's economy outperform in the code and images metric. Often bot opertions will leave a lot of templated code, on many relatively obscure articles for which this could be a trace. For instance in Economy articles then, it would be recommended to look at French articles of France for written quality, and English for templating ideas.

Looking in the History column, in each of the metric-rows there is strong horizontal band of blue in the French sub-row. This indicates that French history articles are good across most nations. Except in the bottom two metrics, Article length and references per article, English seems to outperform, particularly for Uganda and the USA. Swahili here performs better for the USA than for Uganda. In fact in the Swahili band, many datapoints do not register because there are 25 or less articles in question (see caveat below). However within the Uganda column, English and French do register, we can interpret these as translation opportunities to get more Swahili content. As for Côte d'Ivoire, we also find that in all but referencing French does better than English. This would suggest that for Côte d'Ivoire's history, they unfortunately have no better direct examples to follow (but they do have French history to look up to).

Moving to Geography, a notable pattern is the domination of French Wikipedia about France in - again - all but the referencing metric. We encounter some movement from Swahili Wikipedia for the first time. You can see two fainter, but still noticeable vertical strips in Côte d'Ivoire and Uganda for all Wikipedia languages. So actually this means that Geogrpahy is the relative strong point for these African countries. To improve their relative content coverage Swahili editors would do better to focus on Economy and History if they wanted to work on the more-needed areas of their Wikipedia. And within all the things they could do, we see the bluest points within code and images, which is likely from bot-created Africa place articles. The needed places to improve those articles are section count and article length which for English and French about France, and English about USA could best be a guide.

Caveats

Since we are using the mean average, sometimes if there are not a lot of pages in a language for a category, if one of the pages is particularly impressive then the average can appear high without it meaning that language is a useful information source. This occurred for instance Swahili Economy of France in Wikilinks. We set a threshold minimum number of contributing articles in a heatmap to 25 to reduce this noise, so a white square will appear if there were less than 25 articles. All subject-nations not meeting this threshold can be found in the code appendix. In fact these are the total counts of how many articles are being considered in each group.


In [52]:
display_article_counts()display_article_counts()


Out[52]:
subj nation lang count
geography usa en 7376
geography fra fr 4201
economy usa en 3303
history fra fr 3135
economy fra fr 1962
history usa en 1756
history fra en 1322
geography cdi fr 1249
geography cdi en 1139
geography usa fr 1135
geography fra en 1103
economy fra en 971
geography uga en 602
history usa fr 473
economy usa fr 397
history uga en 320
geography usa sw 288
geography uga fr 219
geography uga sw 183
history cdi en 115
geography cdi sw 111
history uga fr 110
economy cdi fr 107
history cdi fr 84
economy uga en 78
history usa sw 39
economy cdi en 39
economy uga sw 11
economy usa sw 10
economy uga fr 10
history fra sw 8
geography fra sw 6
economy fra sw 5
history uga sw 3
economy cdi sw 1

Top Templates by Subject and Nation

The previous approach of looking at the top section heading, was not very useful here because within a broad category like "Economy" or "Geography" there are too many sub-subjects which have different expected flavours of section headings. The code appendix shows the results of this experimentation. Instead we look at the top templates, to see if sepecific quality flaws could be identified using the techniques of Stein's Quality Flaws. Regretably this study only disucsses English Wikipedia, so here we would need a French or Kiswhaili Wikipedian to identify the top Clean-up tags in French Wikipedia. While "Citaiton neeed" does appear for some categories neither "Citation nécessaire" nor "Ukweli" crops up, so this indicates maybe a direct translation is not possible.

In fact all of the most common clean-up tags make no apperance in our top 20 of each subject-nation files, except "Citation needed". These subject-nation combinations have the highest incidence of "Citation needed":

Subject-nation| Frequency of "citation needed" (per article)

history-usa   |  0.501708
economy-fra   |  0.282183
economy-usa   |  0.270058
history-fra   |  0.226929
history-cdi   |  0.173913
geography-usa |  0.154013
economy-cdi   |  0.128205

This probably highlights lack of tagging with clean-up templates more than anything. It seems counter-intuitive that the more highly scoring subject-nations would also need more citing that those which scored less is all other areas. Our sample of Wikipedia articles here is too small for, or is otherwise somehow not the right application for this cleanup tag analysis.

Still, looking over the data one can see that other citation templates feature highly. Therefore we can investigate the relative types of Citations used. We do this only in English Wikipedia for lack of understanding of French and Swahili citation philosophies.

First we retrive the statistics for all occurences of English Wikipeda Citation templates from Wikimedia labs, and then convert them into a proportion.

MariaDB [enwiki_p]> select tl_title, count(*) from templatelinks where tl_title like 'Cite_web' or tl_title like 'Cite_news' or tl_title like 'Cite_journal' or tl_title like 'Cite_book' group by tl_title;
+--------------+----------+
| tl_title     | count(*) |
+--------------+----------+
| Cite_book    |   536258 |
| Cite_journal |   328129 |
| Cite_news    |   444447 |
| Cite_web     |  1560207 |
+--------------+----------+
4 rows in set (11 min 36.68 sec)

This global statistic, serves as a benchmark for all the compositions we can infer from our Top template work if we consider all the occurences of Cite book, Cite Journal, Cite News, and Cite web from our subject-nation files. See below the graphical reperesentation.


In [125]:
make_cite_plot()


Immediately Geography seems to be less diversely cited for all nations. We can also see signs that Economy on the whole has more news citations, and that History on average utilises more book citations. However the more major fact seems to be that Web citations are almost always more than half. Notably, when they are not, we have 2 out of 3 being heavy news citations in History of Côte d'Ivoire and History of Uganda.

All of this should be quite reassuring. A main disadvantage facing African content, previously postulated, was the lack of citable sources. That problem is exacerbated if we consider only printed historical works as useful to expanding content. However since across the board citations are not heavily reliant on books or journals - even in the Encyclopedia's strong suites like the Economy of the USA - citing should be less of an impediment to Kumusha Takes Wiki editors.

Coordinate Frequncies

As a curiosity, we investigate the frequnency of occurence of the geo-coordinate template in our subject-nation sets.


In [12]:
coordinate_frequencies()


Out[12]:
subj nation lang coord_count freq
geography usa en 6859 0.9299078
geography uga en 550 0.9136213
geography fra en 781 0.7080689
economy uga en 42 0.5384615
geography cdi en 490 0.4302019
geography cdi fr 428 0.3426741
geography usa fr 360 0.3171806
history usa en 510 0.2904328
geography fra fr 792 0.1885265
geography fra sw 1 0.1666667
economy cdi en 6 0.1538462
history fra en 202 0.1527988
economy usa fr 43 0.1083123
geography uga fr 23 0.1050228
economy usa en 345 0.1044505
economy fra en 91 0.09371782
history usa fr 42 0.08879493
history uga en 16 0.05
geography uga sw 6 0.03278689
economy fra fr 62 0.03160041
history fra fr 94 0.02998405
economy cdi fr 2 0.01869159
history cdi en 1 0.008695652
geography usa sw 1 0.003472222

There are some sets that show to be almost entirely tagged with the coordinate template. In English the nations with the closets cultural ties have higher frequencies of coordinate tagging, which is what those familiar with English Wikipedia would expect. But French Wikipedia does not seem to follow the same trends. The perencatage of coordinate-tagged geography of France in French (18%), seems low compared to the Geography of USA in English (93%). This seems especially odd as both Côte d'Ivoire and USA, have higher coordinate template frequencies than France herself. Unlike English Wikipedia as the coordinate tagging is inversely proportial to cultural ties - a proxy for domain expertise. This could be related to French-wikipedia categorization techniques. Recall we are looking at all the subcategories of "[insert subject] of [insert snation]", so it could be that there are types of Geography articles that exist in French Wikipedia, that are not tied to a coordinate. Those would be sort of "non-obvious", non-coordinate articles and would more likely to be written about France in the French Wikipedia. This would be supported by the fact that that the total articles with coordinates about France is almost the same in our sets, in both English and French, but the French set is simply larger, containing more non-coordinate tagged articles.

Conclusions

Our main task was to understand what makes a good article with respect to a nation. Despite the question's subjectivity, there are facts we have shown about the current state of Wikipedia, and if you consider any of them "useful" then you can try to emulate them. Our main findings are that when it comes to a encyclopedically desrcibing a nation, all of English, French and Swahili Wikipedias most often write about the nation's history. French and Swahili then consider Geography next most important, while English thinks it Economy. We have created a graphical guide involving our 3 languages, 3 subjects, 4 nations, and 5 metrics. One can consult this guide to know where work needs to be done, and which areas of our Wikipedias to look for as an example. Lastly we determined that we do not have much information about where users have left requests for improvement to the countries in question, but if English Wikipedia is a model, then Web citations are usually sufficient.

Further Directons

The way in which the swaths of articles were chosen for the subject-nation sets could be unsatisfactory for several reasons. Firstly the category system may not accurately represnt a Wikipedia's available items on a subject. Secondly, since the process involved a human judge, error is certainly introduced. A better way of determining these subject-nation sets would be useful. Also, no by-hand human-reading investigation was done on those sets, instead we opted for algorithmic methods. If a sound methodology for the human analysis of pages is available, that would be a good technique to compare to the algorithmic ones presented here.

Start of Supporting Code

Code also available by git on github https://github.com/notconfusing/kumusha_takes_wiki


In [1]:
#Infonoise metric of Stvilia (2005) in concept, although the implementation may differ since we are not stopping and stemming words, because of the multiple languages we need to handle

def readable_text_length(wikicode):
    #could also use wikicode.filter_text()
    return float(len(wikicode.strip_code()))

def infonoise(wikicode):
    wikicode.strip_code()
    ratio = readable_text_length(wikicode) / float(len(wikicode))
    return ratio

#Helper function to mine for section headings, of course if there is a lead it doesn't quite make sense.

def section_headings(wikicode):
    sections = wikicode.get_sections()
    sec_headings = map( lambda s: filter( lambda l: l != '=', s), map(lambda a: a.split(sep='\n', maxsplit=1)[0], sections))
    return sec_headings

#i don't know why mwparserfromhell's .fitler_tags() isn't working at the moment. going to hack it for now
import re
def num_refs(wikicode):
    text = str(wikicode)
    reftags = re.findall('<(\ )*?ref', text)
    return len(reftags)

def article_refs(wikicode):
    sections = wikicode.get_sections()
    return float(reduce( lambda a,b: a+b ,map(num_refs, sections)))

#Predicate for links and files in English French and Swahili

def link_a_file(linkstr):
    fnames = [u'File:', u'Fichier:', u'Image:', u'Picha:']
    bracknames = map(lambda a: '[[' + a, fnames)
    return any(map(lambda b: linkstr.startswith(b), bracknames))

def link_a_cat(linkstr):
    cnames =[u'Category:', u'Catégorie:', u'Jamii:']
    bracknames = map(lambda a: '[[' + a, cnames)
    return any(map(lambda b: linkstr.startswith(b), bracknames))

def num_reg_links(wikicode):
    reg_links = filter(lambda a: not link_a_file(a) and not link_a_cat(a), wikicode.filter_wikilinks())
    return float(len(reg_links))

def num_file_links(wikicode):
    file_links = filter(lambda a: link_a_file(a), wikicode.filter_wikilinks())
    return float(len(file_links))

In [1]:
import pywikibot
import mwparserfromhell as pfh
import os
import datetime
import pandas as pd
import json
from collections import defaultdict
from ggplot import *
import operator
from IPython.display import HTML

%pylab inline


langs = ['en','fr','sw']
nations = ['usa', 'fra', 'cdi', 'uga']

wikipedias = {lang: pywikibot.Site(lang, 'wikipedia') for lang in langs}
wikidata = wikipedias['fr'].data_repository()


VERBOSE:pywiki:Starting 1 threads...
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['xlim', 'mpl', 'colors', 'ylim']
`%pylab --no-import-all` prevents importing * from pylab and numpy

this will be our data structure dict of dicts, until we have some numbers to put into pandas


In [143]:
def enfrsw():
    return {lang: None for lang in langs}
def article_attributes():
    return {attrib: enfrsw() for attrib in ['sitelinks', 'wikitext', 'wikicode', 'metrics']}

def do_sitelinks(langs, qids, data):
    for qid in qids:
        page = pywikibot.ItemPage(wikidata, qid)
        wditem = page.get()
        for lang in langs:
            try:
                data[qid]['sitelinks'][lang] = wditem['sitelinks'][lang+'wiki']
            except KeyError:
                pass
    return data

Functions to get the page texts in all our desired languages


In [494]:
def get_wikitext(lang, title):
    page = pywikibot.Page(wikipedias[lang],title)
    def get_page(page):
        try:
            pagetext = page.get()
            return pagetext
        except pywikibot.exceptions.IsRedirectPage:
            redir = page.getRedirectTarget()
            get_page(redir)
        except pywikibot.exceptions.NoPage:
            raise pywikibot.exceptions.NoPage
            print 're raising'
    return get_page(page)

def do_wikitext(langs, data):
    for qid, attribs in data.iteritems():
        for lang, sl in attribs['sitelinks'].iteritems():
            if sl:
                try:
                    if randint(0,100) == 99:
                        print sl
                    data[qid]['wikitext'][lang] = get_wikitext(lang, sl)
                except:
                    print 'bad sitelink', sl
                    continue
    return data

def do_wikicode(langs, data):
    for qid, attribs in data.iteritems():
        for lang, pagetext in attribs['wikitext'].iteritems():
            if pagetext:
                data[qid]['wikicode'][lang] = pfh.parse(pagetext)
    return data

def do_metrics(data):
    for qid, attribs in data.iteritems():
        for lang, wikicode in attribs['wikicode'].iteritems():
            if wikicode:
                data[qid]['metrics'][lang] = report_actionable_metrics(wikicode)
    return data

# this will take a lot of network time since we are going to load about 300 pages, but we'll save the data off so we don't have to do it uneccesarrily

def make_data(langs, qids, savename):
    print 'getting these qids: ', qids 
    data = defaultdict(article_attributes)
    print 'getting sitelinks'
    data = do_sitelinks(langs, qids, data)
    print 'getting wikitext'
    data = do_wikitext(langs, data)
    print 'converting to wikicode'
    data = do_wikicode(langs, data)
    print 'computing metrics'
    data = do_metrics(data)

        
    hashable_data = {qid: 
                        {'wikitext': attribdict['wikitext'],
                         'metrics': attribdict['metrics'],
                         'sitelinks': attribdict['sitelinks']}
                             for qid, attribdict in data.iteritems()}
    print 'saving now'
    #save the results
    safefilename = savename+str(datetime.datetime.now())+'.json'
    with open(safefilename,'w') as f3:
        json.dump(hashable_data,f3)
    with open(savename+'latest.json','w') as f4:
        json.dump(hashable_data, f4)
    return data

#i don't call this unless i have time to uncomment it
#arts = make_data(langs, country_qids, 'countrydata')

#time to get into pandas, lets throw everything into a data frame 

df = pd.DataFrame(columns=['Country','language','metric','val'])

arts = json.load(open('countrydata-latest.json','r'))

for qid, attribdict in arts.iteritems():
    for attribname, langdict in attribdict.iteritems():
        if attribname == 'metrics':
            for lang, metrics in langdict.iteritems(): 
                try:
                #someteimes there wasn't an article in that language and thus no corresponding len
                    for metric_name, metric_val in metrics.iteritems():
                        df = df.append({'Country': qid, 'language':lang, 'metric':metric_name, 'val':float(metric_val)}, ignore_index=True)
                except:
                    pass
df = df.convert_objects(convert_numeric=True)

metric_list = ['completeness','informativeness','numheadings','articlelength','referencerate']

langs_df_dict = {lang: df[df['language'] == lang] for lang in langs}
metric_df_dict = {metric: df[df['metric'] == metric] for metric in metric_list}

#for later calculation
uganda_zscores = defaultdict(list)
cdi_zscores = defaultdict(list)

In [495]:
def metric_analyse_density(ametric, xlimit):
    inf_df = metric_df_dict[ametric]
    
    zscore = lambda x: (x - x.mean()) / x.std()
    
    inf_piv = inf_df.pivot(index='Country', columns='language', values='val')
    
    inf_piv_z = inf_piv.apply(zscore)
    metric_analyse_density_plot(ametric, xlimit, inf_df)
    print 'Uganda ('+ugandaqid+"), Côte d'Ivoire ("+cdiqid+") " +ametric+ " z-scores." 
    return inf_piv_z.ix[[ugandaqid,cdiqid]]

In [496]:
def metric_analyse_density_plot(ametric, xlimit, inf_df):
    p = ggplot(aes(x='val', colour='language', fill=True, alpha = 0.3), data=inf_df) + geom_density() + labs("score", "frequency") + \
    scale_x_continuous(limits=(0,xlimit)) + ggtitle(ametric + '\nall country articles\n                                                                        ')
    p.rcParams["figure.figsize"] = "4, 3"
    p.draw()

In [18]:
def defaultint():
    return defaultdict(int)

section_count = defaultdict(defaultint)
sorted_secs = defaultdict(list)
total_articles = defaultdict(int)

articles = json.load(open('countrydata-latest.json','r'))

for qid, attribdict in articles.iteritems():
    for attribname, langdict in attribdict.iteritems():
        if attribname == 'wikitext':
            for lang, wikitext in langdict.iteritems(): 
                if wikitext:
                    total_articles[lang] += 1
                    wikicode = pfh.parse(wikitext)
                    secs = section_headings(wikicode)
                    for sec in secs:
                        sec = sec.strip()
                        section_count[lang][sec] += 1
                        
section_df = pd.DataFrame(columns=['lang','secname','freq'])
for lang, sec_dict in section_count.iteritems():
    for secname, seccount in sec_dict.iteritems():
        freq = seccount/float(total_articles[lang])
        section_df = section_df.append({'lang':lang, 'secname':secname, 'freq':freq}, ignore_index=True)  
        
#section_df = section_df.convert_objects(convert_numeric=True)
section_df.head()
top_secs = section_df[section_df.freq > 0.1]

sort_secs= top_secs.sort(columns='freq', ascending=False)

In [26]:
def top_n_sections(lang,n):
    top = sort_secs[sort_secs.lang==lang].iloc[:n].convert_objects(convert_numeric=True)
    print str(total_articles[lang]) + ' total articles inspected in ' + lang + '.' 
    return HTML(top.to_html(index=False, columns=['lang','secname','freq']))

In [9]:
def top_sections_ethnoset(ethnoset_filename):
    
    def defaultint():
        return defaultdict(int)

    section_count = defaultdict(defaultint)
    sorted_secs = defaultdict(list)
    total_articles = defaultdict(int)
    
    articles = json.load(open(ethnoset_filename,'r'))
    
    for qid, attribdict in articles.iteritems():
        for attribname, langdict in attribdict.iteritems():
            if attribname == 'wikitext':
                for lang, wikitext in langdict.iteritems(): 
                    if wikitext:
                        total_articles[lang] += 1
                        wikicode = pfh.parse(wikitext)
                        secs = section_headings(wikicode)
                        for sec in secs:
                            sec = sec.strip()
                            section_count[lang][sec] += 1
                        
    section_df = pd.DataFrame(columns=['lang','secname','freq'])
    for lang, sec_dict in section_count.iteritems():
        for secname, seccount in sec_dict.iteritems():
            freq = seccount/float(total_articles[lang])
            section_df = section_df.append({'lang':lang, 'secname':secname, 'freq':freq}, ignore_index=True)
            
    #section_df = section_df.convert_objects(convert_numeric=True)
    section_df.head()
    top_secs = section_df[section_df.freq > 0.1]
    
    sort_secs= top_secs.sort(columns='freq', ascending=False)

    return sort_secs

In [31]:
!ls ethnosa!ls ethnosave/


economy-cdi.jsonlatest.json    geography-uga.jsonlatest.json
economy-fra.jsonlatest.json    geography-usa.jsonlatest.json
economy-uga.jsonlatest.json    history-cdi.jsonlatest.json
economy-usa.jsonlatest.json    history-fra.jsonlatest.json
geography-cdi.jsonlatest.json  history-uga.jsonlatest.json
geography-fra.jsonlatest.json  history-usa.jsonlatest.json

In [50]:
def display_article_counts():
    filenames = !ls ethnosave
    
    art_counts = pd.DataFrame(columns=['subj','nation', 'lang','count'])
    for filename in filenames:
        spl = filename.split('-')
        subj, nation = spl[0], spl[1].split('.')[0]
        
        fileaddr = 'ethnosave/' + filename
        articles = json.load(open(fileaddr,'r'))
        total_articles = defaultdict(int)
        for qid, attribdict in articles.iteritems():
            for attribname, langdict in attribdict.iteritems():
                if attribname == 'wikitext':
                    for lang, wikitext in langdict.iteritems(): 
                        if wikitext:
                            total_articles[lang] += 1
        for lang, count in total_articles.iteritems():
            art_counts = art_counts.append({'subj': subj, 'nation': nation, 'lang': lang, 'count': count} ,ignore_index = True)
    return HTML(art_counts.sort(columns='count', ascending=False).to_html(index=False))

In [137]:
def make_heat_map():
    

    subj_list = ['economy','history','geography']
    metric_list = ['completeness','informativeness','numheadings','articlelength','referencerate']
    #pivtables = {metric: {subj: None for subj in subj_list} for metric in metric_list}
    
    fig, axes = plt.subplots(nrows = len(metric_list), ncols = len(subj_list), sharex='col', sharey='row' )
    '''
    for metric, subjdict in pivtables.iteritems():
        for subj, pivtab in subjdict.iteritems():
            natlangdf = means_df[(means_df.metric == metric) & (means_df.subj == subj)]
            natlangpiv = pd.pivot_table(natlangdf, values='means', rows='lang', cols='nation')
            pivtables[metric][subj] = natlangpiv
            
    '''
    
    for axarr, metric in zip(axes, metric_list):
        for ax, subj in zip(axarr, subj_list):
            
            natlangdf = means_df[(means_df.metric == metric) & (means_df.subj == subj)]
            natlangpiv = pd.pivot_table(natlangdf, values='means', rows='lang', cols='nation')
            heatmap = ax.pcolor(natlangpiv, cmap='Blues')
            ax.set_yticks(np.arange(0.5, len(natlangpiv.index), 1))
            ax.set_yticklabels(natlangpiv.index)
            ax.set_xticks(np.arange(0.5, len(natlangpiv.columns), 1))
            ax.set_xticklabels(natlangpiv.columns)
            cbar = plt.colorbar(mappable=heatmap, ax=ax)
        
    fig.suptitle('Heatmap of Actionable Metrics by Country versus Wikipedia Language, \n by Subject Category', fontsize=18)
    fig.set_size_inches(12,12,dpi=600)
    #fig.tight_layout()
    
    subj_titles = ['Economy','History','Geography']
    metric_titles =['Wikilinks','Code & Images to Text Ratio','Section Count','Article Length', 'References per Article Length'] 
    for i in range(len(subj_titles)):
        axes[0][i].set_title(subj_titles[i])
    for j in range(len(metric_titles)):
        axes[j][0].set_ylabel(metric_titles[j])

In [82]:
means_df[(means_df.metric == 'referencerate') & (means_df.subj == 'geography')]


Out[82]:
subj nation lang metric means
48 geography usa en referencerate 0.003102
49 geography usa fr referencerate 0.001050
50 geography usa sw referencerate 0.000012
51 geography fra en referencerate 0.000573
52 geography fra fr referencerate 0.001929
53 geography fra sw referencerate 0.000000
54 geography cdi en referencerate 0.004123
55 geography cdi fr referencerate 0.002138
56 geography cdi sw referencerate 0.006302
57 geography uga en referencerate 0.001590
58 geography uga fr referencerate 0.000560
59 geography uga sw referencerate 0.000114

In [135]:
def load_ethnosaves():
    ethnosaves = !ls ethnosave
    
    subj_df_dict = {subj: pd.DataFrame(columns=['qid','subj','nation','lang','metric','val']) for subj in ethnosaves}
    
    for ethnosavefile in ethnosaves:
        nameparts = ethnosavefile.split('-')
        subj = nameparts[0]
        dotparts = nameparts[1].split('.')
        nation = dotparts[0]
        arts = json.load(open('ethnosave/'+ethnosavefile,'r'))
        print subj, nation
        sdf = subj_df_dict[ethnosavefile]
        for qid, attribdict in arts.iteritems():
            for attribname, langdict in attribdict.iteritems():
                if attribname == 'metrics':
                    for lang, metrics in langdict.iteritems(): 
                        try:
                        #someteimes there wasn't an article in that language and thus no corresponding len
                            for metric_name, metric_val in metrics.iteritems():
                                sdf = sdf.append({'qid': qid, 'subj':subj, 'nation':nation, 'lang':lang, 'metric':metric_name, 'val':float(metric_val)}, ignore_index=True)
                        except:
                            pass
        subj_df_dict[ethnosavefile] = sdf
        lens = map(lambda d: len(d), subj_df_dict.itervalues())
        print lens
    return subj_df_dict

subj_df_dict = load_ethnosaves()
subj_df = pd.concat(subj_df_dict)
assert(len(subj_df) == reduce(lambda a, b: a+b, map(lambda df: len(df), subj_df_dict.itervalues())))

subj_df = subj_df.convert_objects()


economy cdi
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 735]
economy fra
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14690, 735]
economy uga
[0, 0, 0, 495, 0, 0, 0, 0, 0, 0, 14690, 735]
economy usa
[0, 0, 0, 495, 0, 0, 0, 0, 0, 18550, 14690, 735]
geography cdi
[12495, 0, 0, 495, 0, 0, 0, 0, 0, 18550, 14690, 735]
geography fra
[12495, 0, 0, 495, 0, 26550, 0, 0, 0, 18550, 14690, 735]
geography uga
[12495, 0, 0, 495, 0, 26550, 0, 0, 5020, 18550, 14690, 735]
geography usa
[12495, 0, 43992, 495, 0, 26550, 0, 0, 5020, 18550, 14690, 735]
history cdi
[12495, 0, 43992, 495, 0, 26550, 995, 0, 5020, 18550, 14690, 735]
history fra
[12495, 0, 43992, 495, 0, 26550, 995, 22325, 5020, 18550, 14690, 735]
history uga
[12495, 0, 43992, 495, 2165, 26550, 995, 22325, 5020, 18550, 14690, 735]
history usa
[12495, 11340, 43992, 495, 2165, 26550, 995, 22325, 5020, 18550, 14690, 735]

In [145]:
means_df = pd.DataFrame(columns=['subj','nation','lang','metric','means'])

for subj in ['geography','history','economy']:
    for metric in ['completeness','informativeness','numheadings','articlelength','referencerate']:
        for nation in ['usa','fra','cdi','uga']:
            for lang in ['en','fr','sw']:
                spec_df = subj_df[(subj_df.subj == subj) & (subj_df.nation == nation) & (subj_df.metric == metric) & (subj_df.lang == lang)]['val']
                mean = spec_df.mean()
                if (not str(mean)[0] in '0123456789'):
                    mean = 0.0
                if len(spec_df) <= 25:
                    print len(spec_df), subj, metric, nation, lang
                    mean = 0.0
                means_df = means_df.append({'subj':subj, 'nation':nation, 'lang':lang, 'metric':metric, 'means':mean}, ignore_index=True)
                
means_df = means_df.convert_objects(convert_numeric=True)


6 geography completeness fra sw
6 geography informativeness fra sw
6 geography numheadings fra sw
6 geography articlelength fra sw
6 geography referencerate fra sw
8 history completeness fra sw
0 history completeness cdi sw
3 history completeness uga sw
8 history informativeness fra sw
0 history informativeness cdi sw
3 history informativeness uga sw
8 history numheadings fra sw
0 history numheadings cdi sw
3 history numheadings uga sw
8 history articlelength fra sw
0 history articlelength cdi sw
3 history articlelength uga sw
8 history referencerate fra sw
0 history referencerate cdi sw
3 history referencerate uga sw
10 economy completeness usa sw
5 economy completeness fra sw
1 economy completeness cdi sw
10 economy completeness uga fr
11 economy completeness uga sw
10 economy informativeness usa sw
5 economy informativeness fra sw
1 economy informativeness cdi sw
10 economy informativeness uga fr
11 economy informativeness uga sw
10 economy numheadings usa sw
5 economy numheadings fra sw
1 economy numheadings cdi sw
10 economy numheadings uga fr
11 economy numheadings uga sw
10 economy articlelength usa sw
5 economy articlelength fra sw
1 economy articlelength cdi sw
10 economy articlelength uga fr
11 economy articlelength uga sw
10 economy referencerate usa sw
5 economy referencerate fra sw
1 economy referencerate cdi sw
10 economy referencerate uga fr
11 economy referencerate uga sw

In [131]:
def top_sections_ethnoset(ethnoset_filename):
    print ethnoset_filename
    
    def defaultint():
        return defaultdict(int)

    section_count = defaultdict(defaultint)
    sorted_secs = defaultdict(list)
    total_articles = defaultdict(int)
    
    articles = json.load(open(ethnoset_filename,'r'))
    
    for qid, attribdict in articles.iteritems():
        for attribname, langdict in attribdict.iteritems():
            if attribname == 'wikitext':
                for lang, wikitext in langdict.iteritems(): 
                    if wikitext:
                        total_articles[lang] += 1
                        wikicode = pfh.parse(wikitext)
                        secs = section_headings(wikicode)
                        for sec in secs:
                            sec = sec.strip()
                            section_count[lang][sec] += 1
                        
    section_df = pd.DataFrame(columns=['lang','secname','freq'])
    for lang, sec_dict in section_count.iteritems():
        for secname, seccount in sec_dict.iteritems():
            freq = seccount/float(total_articles[lang])
            section_df = section_df.append({'lang':lang, 'secname':secname, 'freq':freq}, ignore_index=True)
            
    #section_df = section_df.convert_objects(convert_numeric=True)
    section_df.head()
    top_secs = section_df[section_df.freq > 0.1]
    
    sort_secs= top_secs.sort(columns='freq', ascending=False)
    return sort_secs

In [132]:
def top_templates_ethnoset(ethnoset_filename):
    
    def defaultint():
        return defaultdict(int)

    template_count = defaultdict(defaultint)
    sorted_templates = defaultdict(list)
    total_articles = defaultdict(int)
    
    articles = json.load(open(ethnoset_filename,'r'))
    
    for qid, attribdict in articles.iteritems():
        for attribname, langdict in attribdict.iteritems():
            if attribname == 'wikitext':
                for lang, wikitext in langdict.iteritems(): 
                    if wikitext:
                        total_articles[lang] += 1
                        wikicode = pfh.parse(wikitext)
                        temps = wikicode.filter_templates()
                        for temp in temps:
                            tempname = temp.name
                            tempname = tempname.strip().lower()
                            template_count[lang][tempname] += 1
                        
    temp_df = pd.DataFrame(columns=['lang','tempname','freq'])
    for lang, temp_dict in template_count.iteritems():
        for tempname, tempcount in temp_dict.iteritems():
            freq = tempcount/float(total_articles[lang])
            temp_df = temp_df.append({'lang':lang, 'tempname':tempname, 'freq':freq}, ignore_index=True)
            
    #section_df = section_df.convert_objects(convert_numeric=True)
    top_templates = temp_df[temp_df.freq > 0.1]
    
    sort_temps= top_templates.sort(columns='freq', ascending=False)
    temps_dict = dict()
    for lang in template_count.iterkeys():
        try:
            temps_dict[lang] = sort_temps[sort_temps.lang==lang].iloc[:20].convert_objects(convert_numeric=True)
        except:
            temps_dict[lang] = sort_temps[sort_temps.lang==lang].convert_objects(convert_numeric=True)

    return temps_dict

In [384]:
ethnosaves = !ls ethnosave
filenames = map(lambda name: 'ethnosave/'+name, ethnosaves)ethnosaves = !ls ethnosave
filenames = map(lambda name: 'ethnosave/'+name, ethnosaves)
sort_dfs = map(top_sections_ethnoset, filenames)


ethnosave/economy-cdi.jsonlatest.json
ethnosave/economy-fra.jsonlatest.json
ethnosave/economy-uga.jsonlatest.json
ethnosave/economy-usa.jsonlatest.json
ethnosave/geography-cdi.jsonlatest.json
ethnosave/geography-fra.jsonlatest.json
ethnosave/geography-uga.jsonlatest.json
ethnosave/geography-usa.jsonlatest.json
ethnosave/history-cdi.jsonlatest.json
ethnosave/history-fra.jsonlatest.json
ethnosave/history-uga.jsonlatest.json
ethnosave/history-usa.jsonlatest.json

In [124]:
def make_cite_plot(): 
    citedf = pd.DataFrame(columns=['setname','cite','freq'])
    for i in range(len(filenames)):
        for lang, df in temp_dfs[i].iteritems():
            if lang == 'en':
                df = df[(df.tempname == 'cite web') | (df.tempname == 'cite book') | (df.tempname == 'cite news') | (df.tempname == 'cite journal')]
                setname = filenames[i][10:-16]
                tot = 0
                for row in df.iterrows():
                    cols = row[1]
                    tot += cols['freq']
                for row in df.iterrows():
                    cols = row[1]
                    citedf = citedf.append({'setname':setname, 'cite': cols['tempname'], 'freq':cols['freq']/float(tot)}, ignore_index=True)
    
    
    cite_dict = {"cite book":536258, "cite journal":328129, "cite news":444447, "cite web":1560207}
    globaltot = reduce(lambda a,b: a+b, cite_dict.itervalues())
    globaltotfloat = float(globaltot)
    globciteratio = map(lambda cd: (cd[0], cd[1]/globaltotfloat), cite_dict.iteritems() )
    
    for cite in globciteratio:
        citetype, freq = cite[0], cite[1]
        citedf = citedf.append({'setname':'English WP Global', 'cite': citetype, 'freq':freq}, ignore_index=True)       
    
    citedf = citedf.convert_objects(convert_numeric=True)
    citepiv = citedf.pivot(index = 'setname', columns = 'cite')
    citeplot = citepiv.plot(kind='bar', stacked=True)
    citeplot.legend(('Citation type', 'Cite book', 'Cite journal', 'Cite news', 'Cite web'), loc=9)
    citeplot.figure.set_size_inches(12,8)
    citeplot.set_xlabel('subject-nation')
    citeplot.set_title('Composition of Citation Type, by Subject-Nation')

This is where I look at the template occurences.


In [383]:
ethnosaves = !ls ethnosave
filenames = map(lambda name: 'ethnosave/'+name, ethnosaves)
temp_dfs = map(top_templates_ethnoset, filenames)
for i in range(len(filenames)):
    for lang, df in temp_dfs[i].iteritems():
        #print ''
        #print filenames[i]
        #print df





































In [11]:
def coordinate_frequencies():
    def coord_templates_ethnoset(ethnosavefile, coord_df):
        
        nameparts = ethnosavefile.split('-')
        subj = nameparts[0]
        dotparts = nameparts[1].split('.')
        nation = dotparts[0]
      
        total_articles = defaultdict(int)
        coord_articles = defaultdict(int)
        
        articles = json.load(open('ethnosave/'+ethnosavefile,'r'))    
    
        for qid, attribdict in articles.iteritems():
            for attribname, langdict in attribdict.iteritems():
                if attribname == 'wikitext':
                    for lang, wikitext in langdict.iteritems(): 
                        if wikitext:
                            total_articles[lang] += 1
                            wikicode = pfh.parse(wikitext)
                            temps = wikicode.filter_templates()
                            for temp in temps:
                                tempname = temp.name
                                tempname = tempname.strip().lower()
                                if tempname == 'coord':
                                    coord_articles[lang] += 1
        
        for lang, coord_count in coord_articles.iteritems():
            freq = coord_count / float(total_articles[lang])
            coord_df = coord_df.append({'subj':subj, 'nation':nation, 'lang':lang, 'coord_count':coord_count, 'freq':freq}, ignore_index=True)       
        return coord_df
    
    coord_df = pd.DataFrame(columns=['subj','nation','lang','coord_count','freq'])
    
    ethnosaves = !ls ethnosave
    for ethnosave in ethnosaves:
        coord_df = coord_templates_ethnoset(ethnosave, coord_df)
    
    coord_df_sort = coord_df.sort(columns='freq', ascending=False) 
    
    return HTML(coord_df_sort.to_html(index=False, columns=['subj','nation','lang','coord_count','freq']))

In [ ]: