Investigating Good Articles for Côte d'Ivoire and Uganda

For Kumusha Takes Wiki - by Max Klein
10 January 2014

Our principal question in this investigation is what makes a good article about a country in the hope of improving articles related to Côte d'Ivoire and Uganda

We will explore several aspects of the quality of the articles and their relationship to different language Wikipedias. First we look at articles literally describing the countries of Ivory Coast and Uganda and compare them to other country articles in English, French and Swahili Wikipedias. This is done to both sanity check our method, and also to determine which subsequent articles to explore. Here we utilise the research Tell Me More: An Actionable Quality Model for Wikipedia by Warncke-Wang, Cosley, and Reidl which is meant to measure points of quality that can be more easily addressed. Secondly, we move to sets of articles about a subject of that nation. These articles are chosen to be most important by how often they occur in a specific Wikipedia. Next we attempt to apply the lessons from Predicting Quality Flaws in User-generated Content: The Case of Wikipedia by Anderka, Stein, Lipka which are about finding user-specified quality flaws.

This report is an IPython Notebook so it contains all the code necessary to reproduce these results. Much of the the code is at the end, to maintain non-technical readability.

Getting the Wikidata item for every country

We want to compare literal country articles worldwide. Since we will be analysing English, French, and Swahili Wikipedias, we find the Wikidata items related to these countries (here denoted by their identifiers beginning with "Q"). Each Wikidata items points to all the different Wikipedia language versions available. In an attempt to not include language bias, this list was first queried from the Wikidata property "instance of" with target country (through the offline Wikidata parser ). However this only yielded 133 items. Even though it is Anglocentric our list is scraped from English Wikipedia List of Countries by Popultation, yielding:



In [5]:

    
country_qids = json.load(open('country_qids.json'))
ugandaqid, cdiqid= 'Q1036', 'Q1008'
print "We are comparing: ", len(country_qids), 'countries".'
print "Some of their Wikidata QIDs are: ", map(lambda qid: 'http://wikidata.org/wiki/'+qid, country_qids[:5])









    



We are comparing:  242 countries".
Some of their Wikidata QIDs are:  [u'http://wikidata.org/wiki/Q16', u'http://wikidata.org/wiki/Q33788', u'http://wikidata.org/wiki/Q25230', u'http://wikidata.org/wiki/Q874', u'http://wikidata.org/wiki/Q37']

Let's inspect each metric

We look at the 5 statistics that are mentioned in the literature as "Actionable Metrics". The metrics are:

Completeness - This counts the number of intra-wiki links, to other articles.
Informativeness - This is a combination of the ratio of (readable text to wikitext markup) plus number of images.
Numeber of Headings - A count of the sections and subsections.
Article Length - The lenght of the readable article in characters.
Reference Rate - The number of references per article length.

In Tell me more weights are used, and we use the same weights, but for expermientation they could be changed.



In [140]:

    
def report_actionable_metrics(wikicode, completeness_weight=0.8, infonoise_weight=0.6, images_weight=0.3):
    completeness = completeness_weight * num_reg_links(wikicode)
    informativeness = (infonoise_weight * infonoise(wikicode) ) + (images_weight * num_file_links(wikicode) )
    numheadings = len(section_headings(wikicode))
    articlelength = readable_text_length(wikicode)
    referencerate = article_refs(wikicode) / readable_text_length(wikicode)
    
    return {'completeness': completeness, 'informativeness': informativeness, 'numheadings': numheadings, 'articlelength': articlelength, 'referencerate': referencerate}

We will now look at each metric one by one, on the set of our Wikidata Q-IDs to get a feel for how these metrics will be used later on.

For each metric the Uganda and Côte d'Ivoire article's position amongst other countries are shown by the z-score which is a measure of how many standard deviations they are above or below the average. Additionally we display a graph of the distribution of how the Country articles score, separated by Wikipedia language.



In [499]:

    
metric_analyse_density('informativeness', 12)









    



Uganda (Q1036), Côte d'Ivoire (Q1008) informativeness z-scores.






    Out[499]:






  
    
      language
      en
      fr
      sw
    
  
  
    
      Q1036
       0.381577
      -0.643210
      -0.272597
    
    
      Q1008
      -0.431621
       3.311523
      -0.384729

Most notable here, Côte d'Ivoire has a very high score in French Wikipedia - 3.31 standard deviations above average.

In general these distributions are quite standard with a relatively long tail, also known as positive Skew, which we would expect from user generated content. Also English and French outpace a spikier Swahili, which fits in line with standard conceptions about the popularities of these languages.



In [500]:

    
metric_analyse_density('completeness', 1500)









    



Uganda (Q1036), Côte d'Ivoire (Q1008) completeness z-scores.






    Out[500]:






  
    
      language
      en
      fr
      sw
    
  
  
    
      Q1036
      -0.443975
      -0.176491
       1.736298
    
    
      Q1008
      -0.973200
       1.767785
      -0.505017

Reassuringly Uganda performs highly here in Swahili, meaning that it is quite intralinked to other articles.



In [501]:

    
metric_analyse_density('numheadings', 75)









    



Uganda (Q1036), Côte d'Ivoire (Q1008) numheadings z-scores.






    Out[501]:






  
    
      language
      en
      fr
      sw
    
  
  
    
      Q1036
      -1.695575
       0.028298
       0.023235
    
    
      Q1008
       0.660221
      -0.411489
      -0.427908

Here English appears to have almost a bimodal distribution, for which we are accepting your explanations of why that should be so.



In [505]:

    
metric_analyse_density('articlelength',150000)









    



Uganda (Q1036), Côte d'Ivoire (Q1008) articlelength z-scores.






    Out[505]:






  
    
      language
      en
      fr
      sw
    
  
  
    
      Q1036
       0.105196
      -0.498248
       0.190613
    
    
      Q1008
       0.335907
       2.764950
      -0.244492

Again Côte d'Ivoire scores a very high, 2.77, in French Wikipedia. It must be quite a long article.



In [503]:

    
metric_analyse_density('referencerate', 0.01)









    



Uganda (Q1036), Côte d'Ivoire (Q1008) referencerate z-scores.






    Out[503]:






  
    
      language
      en
      fr
      sw
    
  
  
    
      Q1036
       0.014790
      -0.608334
       3.129056
    
    
      Q1008
      -1.183966
       1.663328
      -0.265412

These results offer credibility to this techqnique because the Wikipedias of the native languages of those countries show the highest quality. For the Uganda articles performance is slightly below average in English and French but strong in Swahili. For the Côte d'Ivoire articles, the real stand out is French Wikipedia, at times 3 standard deviations above average, but again below average in other languages. So we can somewhat confidently apply this technique to other sets of articles so see what improvements could be made. Remember a weakness in any one of the 5 metrics is an indicator for a specific type of task that could be done on those articles. Next we will inspect how to determine other sets of articles to which we can apply this method.

Highest Occuring Sections

Now we will find the hightest occuring section names in Country Articles by language. The assumption we use here is that if a section name is highly occurring it is an important subject in that language.



In [27]:

    
top_n_sections('fr',20)









    



242 total articles inspected in fr.






    Out[27]:





  
    
      lang
      secname
      freq
    
  
  
    
       fr
            Liens externes
       0.904959
    
    
       fr
                  Histoire
       0.884298
    
    
       fr
                Géographie
       0.826446
    
    
       fr
       Notes et références
       0.809917
    
    
       fr
                  Économie
       0.785124
    
    
       fr
               Démographie
       0.752066
    
    
       fr
                 Politique
       0.685950
    
    
       fr
                Voir aussi
       0.685950
    
    
       fr
         Articles connexes
       0.648760
    
    
       fr
                     Codes
       0.644628
    
    
       fr
                   Culture
       0.640496
    
    
       fr
             Bibliographie
       0.491736
    
    
       fr
                    Climat
       0.301653
    
    
       fr
              Subdivisions
       0.301653
    
    
       fr
                   Langues
       0.289256
    
    
       fr
                     Sport
       0.243802
    
    
       fr
                 Éducation
       0.239669
    
    
       fr
                 Religions
       0.231405
    
    
       fr
                  Religion
       0.219008
    
    
       fr
                     Santé
       0.198347



In [28]:

    
top_n_sections('en',20)









    



242 total articles inspected in en.






    Out[28]:





  
    
      lang
      secname
      freq
    
  
  
    
       en
                        History
       0.954545
    
    
       en
                        Economy
       0.888430
    
    
       en
                   Demographics
       0.876033
    
    
       en
                 External links
       0.867769
    
    
       en
                       See also
       0.842975
    
    
       en
                      Geography
       0.822314
    
    
       en
                     References
       0.805785
    
    
       en
                        Culture
       0.797521
    
    
       en
                       Religion
       0.657025
    
    
       en
                      Education
       0.648760
    
    
       en
                        Climate
       0.545455
    
    
       en
                       Military
       0.462810
    
    
       en
       Administrative divisions
       0.438017
    
    
       en
                       Politics
       0.438017
    
    
       en
                         Sports
       0.396694
    
    
       en
              Foreign relations
       0.384298
    
    
       en
                      Languages
       0.367769
    
    
       en
                        Cuisine
       0.363636
    
    
       en
                         Health
       0.359504
    
    
       en
                Further reading
       0.342975



In [29]:

    
top_n_sections('sw',7)









    



233 total articles inspected in sw.






    Out[29]:





  
    
      lang
      secname
      freq
    
  
  
    
       sw
                Historia
       0.412017
    
    
       sw
               Jiografia
       0.334764
    
    
       sw
       {{Infobox Country
       0.334764
    
    
       sw
          Viungo vya nje
       0.330472
    
    
       sw
              Tazama pia
       0.184549
    
    
       sw
       {{Infobox country
       0.175966
    
    
       sw
          Viungo vya Nje
       0.154506

Looking at the most frequent sections in each what patterns emerge? First let us acknowledge that some of the sections here are not conent - like "Notes" and "References". The 'leader' section - the section before any named sections - usually startts with a code to include the Infobox Country in Swahili, which is where that cryptic result comes from. Ignoring non-content sections though there is a clear Winner in all three languages - History.

In all three languages History is the one section you can most rely on being there, occurring in 88% of French, 95% of English and 41% of Swahili country articles. In Swahili and French, the second most popular sections are Geography, but in English it's Economy. So this informs us on which Wikipedia Categories to apply our metrics on next the: History, Economy, and Geography categories.

Subject and Nation Analysis

Given that we now have the top sections for each language, we devise a method to look at representative articles. We find a Category in each Wikipedia for each subject in our highest occurring section headings and for each major nation associated with our languages. We create a script to present every article in every category and all its recursive subcategories to a human judge. The full list of categories for the second part of this analysis comes from is:



In [266]:

    
categoryjson = json.load(open('ethnosets-categories-capitalized.json','r'))
for subject, nationdict in categoryjson.iteritems():
    print subject
    for nation, langdict in nationdict.iteritems():
        print " |---"  + nation
        for lang in ['fr','en','sw']:
            try:
                print"    |---" + langdict[lang]
            except:
                pass









    



Geography
 |---Côte d'Ivoire
    |---Catégorie:Géographie_de_la_Côte_d'Ivoire
    |---Geography_of_Ivory_Coast
    |---Jamii:Jiografia_ya_Cote_d'Ivoire
 |---USA
    |---Catégorie:Géographie_des_États-Unis
    |---Category:Geography_of_the_United_States
    |---Jamii:Jiografia_ya_Marekani
 |---Uganda
    |---Catégorie:Géographie_de_l'Ouganda
    |---Category:Geography_of_Uganda
    |---Jamii:Jiografia_ya_Uganda
 |---France
    |---Catégorie:Géographie_de_la_France
    |---Category:Geography_of_France
History
 |---Côte d'Ivoire
    |---Catégorie:Histoire_de_la_Côte_d'Ivoire
    |---Category:History_of_Ivory_Coast
 |---Uganda
    |---Catégorie:Histoire_de_l'Ouganda
    |---Category:History_of_Uganda
    |---Jamii:Historia_ya_Uganda
 |---USA
    |---Catégorie:Histoire_des_États-Unis
    |---Category:History_of_the_United_States
    |---Jamii:Historia_ya_Marekani
 |---France
    |---Catégorie:Histoire_de_France
    |---Category:History_of_France
    |---Jamii:Historia_ya_Ufaransa
Economy
 |---Côte d'Ivoire
    |---Catégorie:Économie_de_la_Côte_d'Ivoire
    |---Category:Economy_of_Ivory_Coast
    |---Jamii:Uchumi_wa_Cote_d'Ivoire
 |---USA
    |---Catégorie:Économie_des_États-Unis
    |---Category:Economy_of_the_United_States
    |---Jamii:Uchumi_wa_Marekani
 |---Uganda
    |---Catégorie:Économie_de_l'Ouganda
    |---Category:Economy_of_Uganda
    |---Jamii:Uchumi_wa_Uganda
 |---France
    |---Catégorie:Économie_de_la_France
    |---Category:Economy_of_France
    |---Jamii:Uchumi wa Ufaransa

Looking at every title of every article in every subcategory of these categories, a human (your author) accepted or rejected each article. The purpose of this filtering process was to remove noise, such as articles about economists out of Econom, or highschools out of Geography, while being able to select articles from many diverse categories. In the end we accepted a total of:



In [271]:

    
ethnosets = !ls ethnosets/
reduce(lambda a, b: a+b, map(lambda l: len(l), map(lambda f: json.load(open('ethnosets/'+f)), ethnosets)))









    Out[271]:





24630

24630 Articles which were fetched as live data off Wikipedia. Live so that we could compare the results after any editing efforts. We used Wikimedia labs for the data pull script if you are interested. Then for each of those articles we ask Wikidata to give us the page if possible in all our desired languages. And then on all those pages we call the Actionable Metrics as described earlier.

If you are keeping track that means we have four active variables:

the Nation,
the Wikipedia language,
the subject, and
the metric type.

Below is a matrix of heatmaps displaying the mean average - not the full distribution as we previously graphed - of the specified metrics versus specified subjects. In the inner dimensions we graph across all pages in a specified language versus a specified nation. That is If you look at any individual heatmap you can see it compares a Wikipedia language and a nation. Looking down the rows of the greater matrix we have varying metric types, and looking down the columns of the greater matrix we have the subject categories.



In [146]:

    
make_heat_map()

Analysis

This heatmap can act as a guide to where one could find good examples of articles. Say for istance you wanted to improve History articles of Uganda. First look at the vertical History column and within it, the vertical Uganda column. Then moving your eyes up and down you can see that while English Wikipedia has the best reference rate, French Wikipedia has a slightly higher code & images score, which would be differently useful depending on how you wanted to improve these articles.

For Economy, we can see that the clearest horizontal band across each metric is English. This is a promising result because we knew that English unlike French or Swahili placed more of an emphasis on Economy in their country articles. However if we take a by-nation approach we see that the bluest square is almost always in the France column, not in the USA column. Yet, wouldn't we expect that if English values Economy, that Economy of the USA articles would perform most strongly? One exlpanation is that ther are so many English USA company articles that infact the average quality is brought down. Some data that would give creedence to this idea is that English articles about France's economy outperform in the code and images metric. Often bot opertions will leave a lot of templated code, on many relatively obscure articles for which this could be a trace. For instance in Economy articles then, it would be recommended to look at French articles of France for written quality, and English for templating ideas.

Looking in the History column, in each of the metric-rows there is strong horizontal band of blue in the French sub-row. This indicates that French history articles are good across most nations. Except in the bottom two metrics, Article length and references per article, English seems to outperform, particularly for Uganda and the USA. Swahili here performs better for the USA than for Uganda. In fact in the Swahili band, many datapoints do not register because there are 25 or less articles in question (see caveat below). However within the Uganda column, English and French do register, we can interpret these as translation opportunities to get more Swahili content. As for Côte d'Ivoire, we also find that in all but referencing French does better than English. This would suggest that for Côte d'Ivoire's history, they unfortunately have no better direct examples to follow (but they do have French history to look up to).

Moving to Geography, a notable pattern is the domination of French Wikipedia about France in - again - all but the referencing metric. We encounter some movement from Swahili Wikipedia for the first time. You can see two fainter, but still noticeable vertical strips in Côte d'Ivoire and Uganda for all Wikipedia languages. So actually this means that Geogrpahy is the relative strong point for these African countries. To improve their relative content coverage Swahili editors would do better to focus on Economy and History if they wanted to work on the more-needed areas of their Wikipedia. And within all the things they could do, we see the bluest points within code and images, which is likely from bot-created Africa place articles. The needed places to improve those articles are section count and article length which for English and French about France, and English about USA could best be a guide.

Caveats

Since we are using the mean average, sometimes if there are not a lot of pages in a language for a category, if one of the pages is particularly impressive then the average can appear high without it meaning that language is a useful information source. This occurred for instance Swahili Economy of France in Wikilinks. We set a threshold minimum number of contributing articles in a heatmap to 25 to reduce this noise, so a white square will appear if there were less than 25 articles. All subject-nations not meeting this threshold can be found in the code appendix. In fact these are the total counts of how many articles are being considered in each group.



In [52]:

    
display_article_counts()display_article_counts()









    Out[52]:





  
    
      subj
      nation
      lang
      count
    
  
  
    
       geography
       usa
       en
       7376
    
    
       geography
       fra
       fr
       4201
    
    
         economy
       usa
       en
       3303
    
    
         history
       fra
       fr
       3135
    
    
         economy
       fra
       fr
       1962
    
    
         history
       usa
       en
       1756
    
    
         history
       fra
       en
       1322
    
    
       geography
       cdi
       fr
       1249
    
    
       geography
       cdi
       en
       1139
    
    
       geography
       usa
       fr
       1135
    
    
       geography
       fra
       en
       1103
    
    
         economy
       fra
       en
        971
    
    
       geography
       uga
       en
        602
    
    
         history
       usa
       fr
        473
    
    
         economy
       usa
       fr
        397
    
    
         history
       uga
       en
        320
    
    
       geography
       usa
       sw
        288
    
    
       geography
       uga
       fr
        219
    
    
       geography
       uga
       sw
        183
    
    
         history
       cdi
       en
        115
    
    
       geography
       cdi
       sw
        111
    
    
         history
       uga
       fr
        110
    
    
         economy
       cdi
       fr
        107
    
    
         history
       cdi
       fr
         84
    
    
         economy
       uga
       en
         78
    
    
         history
       usa
       sw
         39
    
    
         economy
       cdi
       en
         39
    
    
         economy
       uga
       sw
         11
    
    
         economy
       usa
       sw
         10
    
    
         economy
       uga
       fr
         10
    
    
         history
       fra
       sw
          8
    
    
       geography
       fra
       sw
          6
    
    
         economy
       fra
       sw
          5
    
    
         history
       uga
       sw
          3
    
    
         economy
       cdi
       sw
          1

Top Templates by Subject and Nation

The previous approach of looking at the top section heading, was not very useful here because within a broad category like "Economy" or "Geography" there are too many sub-subjects which have different expected flavours of section headings. The code appendix shows the results of this experimentation. Instead we look at the top templates, to see if sepecific quality flaws could be identified using the techniques of Stein's Quality Flaws. Regretably this study only disucsses English Wikipedia, so here we would need a French or Kiswhaili Wikipedian to identify the top Clean-up tags in French Wikipedia. While "Citaiton neeed" does appear for some categories neither "Citation nécessaire" nor "Ukweli" crops up, so this indicates maybe a direct translation is not possible.

In fact all of the most common clean-up tags make no apperance in our top 20 of each subject-nation files, except "Citation needed". These subject-nation combinations have the highest incidence of "Citation needed":

Subject-nation| Frequency of "citation needed" (per article)

history-usa   |  0.501708
economy-fra   |  0.282183
economy-usa   |  0.270058
history-fra   |  0.226929
history-cdi   |  0.173913
geography-usa |  0.154013
economy-cdi   |  0.128205

This probably highlights lack of tagging with clean-up templates more than anything. It seems counter-intuitive that the more highly scoring subject-nations would also need more citing that those which scored less is all other areas. Our sample of Wikipedia articles here is too small for, or is otherwise somehow not the right application for this cleanup tag analysis.

Still, looking over the data one can see that other citation templates feature highly. Therefore we can investigate the relative types of Citations used. We do this only in English Wikipedia for lack of understanding of French and Swahili citation philosophies.

First we retrive the statistics for all occurences of English Wikipeda Citation templates from Wikimedia labs, and then convert them into a proportion.

MariaDB [enwiki_p]> select tl_title, count(*) from templatelinks where tl_title like 'Cite_web' or tl_title like 'Cite_news' or tl_title like 'Cite_journal' or tl_title like 'Cite_book' group by tl_title;
+--------------+----------+
| tl_title     | count(*) |
+--------------+----------+
| Cite_book    |   536258 |
| Cite_journal |   328129 |
| Cite_news    |   444447 |
| Cite_web     |  1560207 |
+--------------+----------+
4 rows in set (11 min 36.68 sec)

This global statistic, serves as a benchmark for all the compositions we can infer from our Top template work if we consider all the occurences of Cite book, Cite Journal, Cite News, and Cite web from our subject-nation files. See below the graphical reperesentation.



In [125]:

    
make_cite_plot()

Immediately Geography seems to be less diversely cited for all nations. We can also see signs that Economy on the whole has more news citations, and that History on average utilises more book citations. However the more major fact seems to be that Web citations are almost always more than half. Notably, when they are not, we have 2 out of 3 being heavy news citations in History of Côte d'Ivoire and History of Uganda.

All of this should be quite reassuring. A main disadvantage facing African content, previously postulated, was the lack of citable sources. That problem is exacerbated if we consider only printed historical works as useful to expanding content. However since across the board citations are not heavily reliant on books or journals - even in the Encyclopedia's strong suites like the Economy of the USA - citing should be less of an impediment to Kumusha Takes Wiki editors.

Coordinate Frequncies

As a curiosity, we investigate the frequnency of occurence of the geo-coordinate template in our subject-nation sets.



In [12]:

    
coordinate_frequencies()









    Out[12]:





  
    
      subj
      nation
      lang
      coord_count
      freq
    
  
  
    
       geography
       usa
       en
       6859
         0.9299078
    
    
       geography
       uga
       en
        550
         0.9136213
    
    
       geography
       fra
       en
        781
         0.7080689
    
    
         economy
       uga
       en
         42
         0.5384615
    
    
       geography
       cdi
       en
        490
         0.4302019
    
    
       geography
       cdi
       fr
        428
         0.3426741
    
    
       geography
       usa
       fr
        360
         0.3171806
    
    
         history
       usa
       en
        510
         0.2904328
    
    
       geography
       fra
       fr
        792
         0.1885265
    
    
       geography
       fra
       sw
          1
         0.1666667
    
    
         economy
       cdi
       en
          6
         0.1538462
    
    
         history
       fra
       en
        202
         0.1527988
    
    
         economy
       usa
       fr
         43
         0.1083123
    
    
       geography
       uga
       fr
         23
         0.1050228
    
    
         economy
       usa
       en
        345
         0.1044505
    
    
         economy
       fra
       en
         91
        0.09371782
    
    
         history
       usa
       fr
         42
        0.08879493
    
    
         history
       uga
       en
         16
              0.05
    
    
       geography
       uga
       sw
          6
        0.03278689
    
    
         economy
       fra
       fr
         62
        0.03160041
    
    
         history
       fra
       fr
         94
        0.02998405
    
    
         economy
       cdi
       fr
          2
        0.01869159
    
    
         history
       cdi
       en
          1
       0.008695652
    
    
       geography
       usa
       sw
          1
       0.003472222

There are some sets that show to be almost entirely tagged with the coordinate template. In English the nations with the closets cultural ties have higher frequencies of coordinate tagging, which is what those familiar with English Wikipedia would expect. But French Wikipedia does not seem to follow the same trends. The perencatage of coordinate-tagged geography of France in French (18%), seems low compared to the Geography of USA in English (93%). This seems especially odd as both Côte d'Ivoire and USA, have higher coordinate template frequencies than France herself. Unlike English Wikipedia as the coordinate tagging is inversely proportial to cultural ties - a proxy for domain expertise. This could be related to French-wikipedia categorization techniques. Recall we are looking at all the subcategories of "[insert subject] of [insert snation]", so it could be that there are types of Geography articles that exist in French Wikipedia, that are not tied to a coordinate. Those would be sort of "non-obvious", non-coordinate articles and would more likely to be written about France in the French Wikipedia. This would be supported by the fact that that the total articles with coordinates about France is almost the same in our sets, in both English and French, but the French set is simply larger, containing more non-coordinate tagged articles.

Conclusions

Our main task was to understand what makes a good article with respect to a nation. Despite the question's subjectivity, there are facts we have shown about the current state of Wikipedia, and if you consider any of them "useful" then you can try to emulate them. Our main findings are that when it comes to a encyclopedically desrcibing a nation, all of English, French and Swahili Wikipedias most often write about the nation's history. French and Swahili then consider Geography next most important, while English thinks it Economy. We have created a graphical guide involving our 3 languages, 3 subjects, 4 nations, and 5 metrics. One can consult this guide to know where work needs to be done, and which areas of our Wikipedias to look for as an example. Lastly we determined that we do not have much information about where users have left requests for improvement to the countries in question, but if English Wikipedia is a model, then Web citations are usually sufficient.

Further Directons

The way in which the swaths of articles were chosen for the subject-nation sets could be unsatisfactory for several reasons. Firstly the category system may not accurately represnt a Wikipedia's available items on a subject. Secondly, since the process involved a human judge, error is certainly introduced. A better way of determining these subject-nation sets would be useful. Also, no by-hand human-reading investigation was done on those sets, instead we opted for algorithmic methods. If a sound methodology for the human analysis of pages is available, that would be a good technique to compare to the algorithmic ones presented here.

Start of Supporting Code

Code also available by git on github https://github.com/notconfusing/kumusha_takes_wiki



In [1]:

    
#Infonoise metric of Stvilia (2005) in concept, although the implementation may differ since we are not stopping and stemming words, because of the multiple languages we need to handle

def readable_text_length(wikicode):
    #could also use wikicode.filter_text()
    return float(len(wikicode.strip_code()))

def infonoise(wikicode):
    wikicode.strip_code()
    ratio = readable_text_length(wikicode) / float(len(wikicode))
    return ratio

#Helper function to mine for section headings, of course if there is a lead it doesn't quite make sense.

def section_headings(wikicode):
    sections = wikicode.get_sections()
    sec_headings = map( lambda s: filter( lambda l: l != '=', s), map(lambda a: a.split(sep='\n', maxsplit=1)[0], sections))
    return sec_headings

#i don't know why mwparserfromhell's .fitler_tags() isn't working at the moment. going to hack it for now
import re
def num_refs(wikicode):
    text = str(wikicode)
    reftags = re.findall('<(\ )*?ref', text)
    return len(reftags)

def article_refs(wikicode):
    sections = wikicode.get_sections()
    return float(reduce( lambda a,b: a+b ,map(num_refs, sections)))

#Predicate for links and files in English French and Swahili

def link_a_file(linkstr):
    fnames = [u'File:', u'Fichier:', u'Image:', u'Picha:']
    bracknames = map(lambda a: '[[' + a, fnames)
    return any(map(lambda b: linkstr.startswith(b), bracknames))

def link_a_cat(linkstr):
    cnames =[u'Category:', u'Catégorie:', u'Jamii:']
    bracknames = map(lambda a: '[[' + a, cnames)
    return any(map(lambda b: linkstr.startswith(b), bracknames))

def num_reg_links(wikicode):
    reg_links = filter(lambda a: not link_a_file(a) and not link_a_cat(a), wikicode.filter_wikilinks())
    return float(len(reg_links))

def num_file_links(wikicode):
    file_links = filter(lambda a: link_a_file(a), wikicode.filter_wikilinks())
    return float(len(file_links))



In [1]:

    
import pywikibot
import mwparserfromhell as pfh
import os
import datetime
import pandas as pd
import json
from collections import defaultdict
from ggplot import *
import operator
from IPython.display import HTML

%pylab inline


langs = ['en','fr','sw']
nations = ['usa', 'fra', 'cdi', 'uga']

wikipedias = {lang: pywikibot.Site(lang, 'wikipedia') for lang in langs}
wikidata = wikipedias['fr'].data_repository()









    



VERBOSE:pywiki:Starting 1 threads...






    



Populating the interactive namespace from numpy and matplotlib






    



WARNING: pylab import has clobbered these variables: ['xlim', 'mpl', 'colors', 'ylim']
`%pylab --no-import-all` prevents importing * from pylab and numpy

this will be our data structure dict of dicts, until we have some numbers to put into pandas



In [143]:

    
def enfrsw():
    return {lang: None for lang in langs}
def article_attributes():
    return {attrib: enfrsw() for attrib in ['sitelinks', 'wikitext', 'wikicode', 'metrics']}

def do_sitelinks(langs, qids, data):
    for qid in qids:
        page = pywikibot.ItemPage(wikidata, qid)
        wditem = page.get()
        for lang in langs:
            try:
                data[qid]['sitelinks'][lang] = wditem['sitelinks'][lang+'wiki']
            except KeyError:
                pass
    return data

Functions to get the page texts in all our desired languages



In [494]:

    
def get_wikitext(lang, title):
    page = pywikibot.Page(wikipedias[lang],title)
    def get_page(page):
        try:
            pagetext = page.get()
            return pagetext
        except pywikibot.exceptions.IsRedirectPage:
            redir = page.getRedirectTarget()
            get_page(redir)
        except pywikibot.exceptions.NoPage:
            raise pywikibot.exceptions.NoPage
            print 're raising'
    return get_page(page)

def do_wikitext(langs, data):
    for qid, attribs in data.iteritems():
        for lang, sl in attribs['sitelinks'].iteritems():
            if sl:
                try:
                    if randint(0,100) == 99:
                        print sl
                    data[qid]['wikitext'][lang] = get_wikitext(lang, sl)
                except:
                    print 'bad sitelink', sl
                    continue
    return data

def do_wikicode(langs, data):
    for qid, attribs in data.iteritems():
        for lang, pagetext in attribs['wikitext'].iteritems():
            if pagetext:
                data[qid]['wikicode'][lang] = pfh.parse(pagetext)
    return data

def do_metrics(data):
    for qid, attribs in data.iteritems():
        for lang, wikicode in attribs['wikicode'].iteritems():
            if wikicode:
                data[qid]['metrics'][lang] = report_actionable_metrics(wikicode)
    return data

# this will take a lot of network time since we are going to load about 300 pages, but we'll save the data off so we don't have to do it uneccesarrily

def make_data(langs, qids, savename):
    print 'getting these qids: ', qids 
    data = defaultdict(article_attributes)
    print 'getting sitelinks'
    data = do_sitelinks(langs, qids, data)
    print 'getting wikitext'
    data = do_wikitext(langs, data)
    print 'converting to wikicode'
    data = do_wikicode(langs, data)
    print 'computing metrics'
    data = do_metrics(data)

        
    hashable_data = {qid: 
                        {'wikitext': attribdict['wikitext'],
                         'metrics': attribdict['metrics'],
                         'sitelinks': attribdict['sitelinks']}
                             for qid, attribdict in data.iteritems()}
    print 'saving now'
    #save the results
    safefilename = savename+str(datetime.datetime.now())+'.json'
    with open(safefilename,'w') as f3:
        json.dump(hashable_data,f3)
    with open(savename+'latest.json','w') as f4:
        json.dump(hashable_data, f4)
    return data

#i don't call this unless i have time to uncomment it
#arts = make_data(langs, country_qids, 'countrydata')

#time to get into pandas, lets throw everything into a data frame 

df = pd.DataFrame(columns=['Country','language','metric','val'])

arts = json.load(open('countrydata-latest.json','r'))

for qid, attribdict in arts.iteritems():
    for attribname, langdict in attribdict.iteritems():
        if attribname == 'metrics':
            for lang, metrics in langdict.iteritems(): 
                try:
                #someteimes there wasn't an article in that language and thus no corresponding len
                    for metric_name, metric_val in metrics.iteritems():
                        df = df.append({'Country': qid, 'language':lang, 'metric':metric_name, 'val':float(metric_val)}, ignore_index=True)
                except:
                    pass
df = df.convert_objects(convert_numeric=True)

metric_list = ['completeness','informativeness','numheadings','articlelength','referencerate']

langs_df_dict = {lang: df[df['language'] == lang] for lang in langs}
metric_df_dict = {metric: df[df['metric'] == metric] for metric in metric_list}

#for later calculation
uganda_zscores = defaultdict(list)
cdi_zscores = defaultdict(list)



In [495]:

    
def metric_analyse_density(ametric, xlimit):
    inf_df = metric_df_dict[ametric]
    
    zscore = lambda x: (x - x.mean()) / x.std()
    
    inf_piv = inf_df.pivot(index='Country', columns='language', values='val')
    
    inf_piv_z = inf_piv.apply(zscore)
    metric_analyse_density_plot(ametric, xlimit, inf_df)
    print 'Uganda ('+ugandaqid+"), Côte d'Ivoire ("+cdiqid+") " +ametric+ " z-scores." 
    return inf_piv_z.ix[[ugandaqid,cdiqid]]



In [496]:

    
def metric_analyse_density_plot(ametric, xlimit, inf_df):
    p = ggplot(aes(x='val', colour='language', fill=True, alpha = 0.3), data=inf_df) + geom_density() + labs("score", "frequency") + \
    scale_x_continuous(limits=(0,xlimit)) + ggtitle(ametric + '\nall country articles\n                                                                        ')
    p.rcParams["figure.figsize"] = "4, 3"
    p.draw()



In [18]:

    
def defaultint():
    return defaultdict(int)

section_count = defaultdict(defaultint)
sorted_secs = defaultdict(list)
total_articles = defaultdict(int)

articles = json.load(open('countrydata-latest.json','r'))

for qid, attribdict in articles.iteritems():
    for attribname, langdict in attribdict.iteritems():
        if attribname == 'wikitext':
            for lang, wikitext in langdict.iteritems(): 
                if wikitext:
                    total_articles[lang] += 1
                    wikicode = pfh.parse(wikitext)
                    secs = section_headings(wikicode)
                    for sec in secs:
                        sec = sec.strip()
                        section_count[lang][sec] += 1
                        
section_df = pd.DataFrame(columns=['lang','secname','freq'])
for lang, sec_dict in section_count.iteritems():
    for secname, seccount in sec_dict.iteritems():
        freq = seccount/float(total_articles[lang])
        section_df = section_df.append({'lang':lang, 'secname':secname, 'freq':freq}, ignore_index=True)  
        
#section_df = section_df.convert_objects(convert_numeric=True)
section_df.head()
top_secs = section_df[section_df.freq > 0.1]

sort_secs= top_secs.sort(columns='freq', ascending=False)



In [26]:

    
def top_n_sections(lang,n):
    top = sort_secs[sort_secs.lang==lang].iloc[:n].convert_objects(convert_numeric=True)
    print str(total_articles[lang]) + ' total articles inspected in ' + lang + '.' 
    return HTML(top.to_html(index=False, columns=['lang','secname','freq']))



In [9]:

    
def top_sections_ethnoset(ethnoset_filename):
    
    def defaultint():
        return defaultdict(int)

    section_count = defaultdict(defaultint)
    sorted_secs = defaultdict(list)
    total_articles = defaultdict(int)
    
    articles = json.load(open(ethnoset_filename,'r'))
    
    for qid, attribdict in articles.iteritems():
        for attribname, langdict in attribdict.iteritems():
            if attribname == 'wikitext':
                for lang, wikitext in langdict.iteritems(): 
                    if wikitext:
                        total_articles[lang] += 1
                        wikicode = pfh.parse(wikitext)
                        secs = section_headings(wikicode)
                        for sec in secs:
                            sec = sec.strip()
                            section_count[lang][sec] += 1
                        
    section_df = pd.DataFrame(columns=['lang','secname','freq'])
    for lang, sec_dict in section_count.iteritems():
        for secname, seccount in sec_dict.iteritems():
            freq = seccount/float(total_articles[lang])
            section_df = section_df.append({'lang':lang, 'secname':secname, 'freq':freq}, ignore_index=True)
            
    #section_df = section_df.convert_objects(convert_numeric=True)
    section_df.head()
    top_secs = section_df[section_df.freq > 0.1]
    
    sort_secs= top_secs.sort(columns='freq', ascending=False)

    return sort_secs



In [31]:

    
!ls ethnosa!ls ethnosave/









    



economy-cdi.jsonlatest.json    geography-uga.jsonlatest.json
economy-fra.jsonlatest.json    geography-usa.jsonlatest.json
economy-uga.jsonlatest.json    history-cdi.jsonlatest.json
economy-usa.jsonlatest.json    history-fra.jsonlatest.json
geography-cdi.jsonlatest.json  history-uga.jsonlatest.json
geography-fra.jsonlatest.json  history-usa.jsonlatest.json



In [50]:

    
def display_article_counts():
    filenames = !ls ethnosave
    
    art_counts = pd.DataFrame(columns=['subj','nation', 'lang','count'])
    for filename in filenames:
        spl = filename.split('-')
        subj, nation = spl[0], spl[1].split('.')[0]
        
        fileaddr = 'ethnosave/' + filename
        articles = json.load(open(fileaddr,'r'))
        total_articles = defaultdict(int)
        for qid, attribdict in articles.iteritems():
            for attribname, langdict in attribdict.iteritems():
                if attribname == 'wikitext':
                    for lang, wikitext in langdict.iteritems(): 
                        if wikitext:
                            total_articles[lang] += 1
        for lang, count in total_articles.iteritems():
            art_counts = art_counts.append({'subj': subj, 'nation': nation, 'lang': lang, 'count': count} ,ignore_index = True)
    return HTML(art_counts.sort(columns='count', ascending=False).to_html(index=False))



In [137]:

    
def make_heat_map():
    

    subj_list = ['economy','history','geography']
    metric_list = ['completeness','informativeness','numheadings','articlelength','referencerate']
    #pivtables = {metric: {subj: None for subj in subj_list} for metric in metric_list}
    
    fig, axes = plt.subplots(nrows = len(metric_list), ncols = len(subj_list), sharex='col', sharey='row' )
    '''
    for metric, subjdict in pivtables.iteritems():
        for subj, pivtab in subjdict.iteritems():
            natlangdf = means_df[(means_df.metric == metric) & (means_df.subj == subj)]
            natlangpiv = pd.pivot_table(natlangdf, values='means', rows='lang', cols='nation')
            pivtables[metric][subj] = natlangpiv
            
    '''
    
    for axarr, metric in zip(axes, metric_list):
        for ax, subj in zip(axarr, subj_list):
            
            natlangdf = means_df[(means_df.metric == metric) & (means_df.subj == subj)]
            natlangpiv = pd.pivot_table(natlangdf, values='means', rows='lang', cols='nation')
            heatmap = ax.pcolor(natlangpiv, cmap='Blues')
            ax.set_yticks(np.arange(0.5, len(natlangpiv.index), 1))
            ax.set_yticklabels(natlangpiv.index)
            ax.set_xticks(np.arange(0.5, len(natlangpiv.columns), 1))
            ax.set_xticklabels(natlangpiv.columns)
            cbar = plt.colorbar(mappable=heatmap, ax=ax)
        
    fig.suptitle('Heatmap of Actionable Metrics by Country versus Wikipedia Language, \n by Subject Category', fontsize=18)
    fig.set_size_inches(12,12,dpi=600)
    #fig.tight_layout()
    
    subj_titles = ['Economy','History','Geography']
    metric_titles =['Wikilinks','Code & Images to Text Ratio','Section Count','Article Length', 'References per Article Length'] 
    for i in range(len(subj_titles)):
        axes[0][i].set_title(subj_titles[i])
    for j in range(len(metric_titles)):
        axes[j][0].set_ylabel(metric_titles[j])



In [82]:

    
means_df[(means_df.metric == 'referencerate') & (means_df.subj == 'geography')]









    Out[82]:






  
    
      
      subj
      nation
      lang
      metric
      means
    
  
  
    
      48
       geography
       usa
       en
       referencerate
       0.003102
    
    
      49
       geography
       usa
       fr
       referencerate
       0.001050
    
    
      50
       geography
       usa
       sw
       referencerate
       0.000012
    
    
      51
       geography
       fra
       en
       referencerate
       0.000573
    
    
      52
       geography
       fra
       fr
       referencerate
       0.001929
    
    
      53
       geography
       fra
       sw
       referencerate
       0.000000
    
    
      54
       geography
       cdi
       en
       referencerate
       0.004123
    
    
      55
       geography
       cdi
       fr
       referencerate
       0.002138
    
    
      56
       geography
       cdi
       sw
       referencerate
       0.006302
    
    
      57
       geography
       uga
       en
       referencerate
       0.001590
    
    
      58
       geography
       uga
       fr
       referencerate
       0.000560
    
    
      59
       geography
       uga
       sw
       referencerate
       0.000114



In [135]:

    
def load_ethnosaves():
    ethnosaves = !ls ethnosave
    
    subj_df_dict = {subj: pd.DataFrame(columns=['qid','subj','nation','lang','metric','val']) for subj in ethnosaves}
    
    for ethnosavefile in ethnosaves:
        nameparts = ethnosavefile.split('-')
        subj = nameparts[0]
        dotparts = nameparts[1].split('.')
        nation = dotparts[0]
        arts = json.load(open('ethnosave/'+ethnosavefile,'r'))
        print subj, nation
        sdf = subj_df_dict[ethnosavefile]
        for qid, attribdict in arts.iteritems():
            for attribname, langdict in attribdict.iteritems():
                if attribname == 'metrics':
                    for lang, metrics in langdict.iteritems(): 
                        try:
                        #someteimes there wasn't an article in that language and thus no corresponding len
                            for metric_name, metric_val in metrics.iteritems():
                                sdf = sdf.append({'qid': qid, 'subj':subj, 'nation':nation, 'lang':lang, 'metric':metric_name, 'val':float(metric_val)}, ignore_index=True)
                        except:
                            pass
        subj_df_dict[ethnosavefile] = sdf
        lens = map(lambda d: len(d), subj_df_dict.itervalues())
        print lens
    return subj_df_dict

subj_df_dict = load_ethnosaves()
subj_df = pd.concat(subj_df_dict)
assert(len(subj_df) == reduce(lambda a, b: a+b, map(lambda df: len(df), subj_df_dict.itervalues())))

subj_df = subj_df.convert_objects()









    



economy cdi
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 735]
economy fra
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14690, 735]
economy uga
[0, 0, 0, 495, 0, 0, 0, 0, 0, 0, 14690, 735]
economy usa
[0, 0, 0, 495, 0, 0, 0, 0, 0, 18550, 14690, 735]
geography cdi
[12495, 0, 0, 495, 0, 0, 0, 0, 0, 18550, 14690, 735]
geography fra
[12495, 0, 0, 495, 0, 26550, 0, 0, 0, 18550, 14690, 735]
geography uga
[12495, 0, 0, 495, 0, 26550, 0, 0, 5020, 18550, 14690, 735]
geography usa
[12495, 0, 43992, 495, 0, 26550, 0, 0, 5020, 18550, 14690, 735]
history cdi
[12495, 0, 43992, 495, 0, 26550, 995, 0, 5020, 18550, 14690, 735]
history fra
[12495, 0, 43992, 495, 0, 26550, 995, 22325, 5020, 18550, 14690, 735]
history uga
[12495, 0, 43992, 495, 2165, 26550, 995, 22325, 5020, 18550, 14690, 735]
history usa
[12495, 11340, 43992, 495, 2165, 26550, 995, 22325, 5020, 18550, 14690, 735]



In [145]:

    
means_df = pd.DataFrame(columns=['subj','nation','lang','metric','means'])

for subj in ['geography','history','economy']:
    for metric in ['completeness','informativeness','numheadings','articlelength','referencerate']:
        for nation in ['usa','fra','cdi','uga']:
            for lang in ['en','fr','sw']:
                spec_df = subj_df[(subj_df.subj == subj) & (subj_df.nation == nation) & (subj_df.metric == metric) & (subj_df.lang == lang)]['val']
                mean = spec_df.mean()
                if (not str(mean)[0] in '0123456789'):
                    mean = 0.0
                if len(spec_df) <= 25:
                    print len(spec_df), subj, metric, nation, lang
                    mean = 0.0
                means_df = means_df.append({'subj':subj, 'nation':nation, 'lang':lang, 'metric':metric, 'means':mean}, ignore_index=True)
                
means_df = means_df.convert_objects(convert_numeric=True)









    



6 geography completeness fra sw
6 geography informativeness fra sw
6 geography numheadings fra sw
6 geography articlelength fra sw
6 geography referencerate fra sw
8 history completeness fra sw
0 history completeness cdi sw
3 history completeness uga sw
8 history informativeness fra sw
0 history informativeness cdi sw
3 history informativeness uga sw
8 history numheadings fra sw
0 history numheadings cdi sw
3 history numheadings uga sw
8 history articlelength fra sw
0 history articlelength cdi sw
3 history articlelength uga sw
8 history referencerate fra sw
0 history referencerate cdi sw
3 history referencerate uga sw
10 economy completeness usa sw
5 economy completeness fra sw
1 economy completeness cdi sw
10 economy completeness uga fr
11 economy completeness uga sw
10 economy informativeness usa sw
5 economy informativeness fra sw
1 economy informativeness cdi sw
10 economy informativeness uga fr
11 economy informativeness uga sw
10 economy numheadings usa sw
5 economy numheadings fra sw
1 economy numheadings cdi sw
10 economy numheadings uga fr
11 economy numheadings uga sw
10 economy articlelength usa sw
5 economy articlelength fra sw
1 economy articlelength cdi sw
10 economy articlelength uga fr
11 economy articlelength uga sw
10 economy referencerate usa sw
5 economy referencerate fra sw
1 economy referencerate cdi sw
10 economy referencerate uga fr
11 economy referencerate uga sw



In [131]:

    
def top_sections_ethnoset(ethnoset_filename):
    print ethnoset_filename
    
    def defaultint():
        return defaultdict(int)

    section_count = defaultdict(defaultint)
    sorted_secs = defaultdict(list)
    total_articles = defaultdict(int)
    
    articles = json.load(open(ethnoset_filename,'r'))
    
    for qid, attribdict in articles.iteritems():
        for attribname, langdict in attribdict.iteritems():
            if attribname == 'wikitext':
                for lang, wikitext in langdict.iteritems(): 
                    if wikitext:
                        total_articles[lang] += 1
                        wikicode = pfh.parse(wikitext)
                        secs = section_headings(wikicode)
                        for sec in secs:
                            sec = sec.strip()
                            section_count[lang][sec] += 1
                        
    section_df = pd.DataFrame(columns=['lang','secname','freq'])
    for lang, sec_dict in section_count.iteritems():
        for secname, seccount in sec_dict.iteritems():
            freq = seccount/float(total_articles[lang])
            section_df = section_df.append({'lang':lang, 'secname':secname, 'freq':freq}, ignore_index=True)
            
    #section_df = section_df.convert_objects(convert_numeric=True)
    section_df.head()
    top_secs = section_df[section_df.freq > 0.1]
    
    sort_secs= top_secs.sort(columns='freq', ascending=False)
    return sort_secs



In [132]:

    
def top_templates_ethnoset(ethnoset_filename):
    
    def defaultint():
        return defaultdict(int)

    template_count = defaultdict(defaultint)
    sorted_templates = defaultdict(list)
    total_articles = defaultdict(int)
    
    articles = json.load(open(ethnoset_filename,'r'))
    
    for qid, attribdict in articles.iteritems():
        for attribname, langdict in attribdict.iteritems():
            if attribname == 'wikitext':
                for lang, wikitext in langdict.iteritems(): 
                    if wikitext:
                        total_articles[lang] += 1
                        wikicode = pfh.parse(wikitext)
                        temps = wikicode.filter_templates()
                        for temp in temps:
                            tempname = temp.name
                            tempname = tempname.strip().lower()
                            template_count[lang][tempname] += 1
                        
    temp_df = pd.DataFrame(columns=['lang','tempname','freq'])
    for lang, temp_dict in template_count.iteritems():
        for tempname, tempcount in temp_dict.iteritems():
            freq = tempcount/float(total_articles[lang])
            temp_df = temp_df.append({'lang':lang, 'tempname':tempname, 'freq':freq}, ignore_index=True)
            
    #section_df = section_df.convert_objects(convert_numeric=True)
    top_templates = temp_df[temp_df.freq > 0.1]
    
    sort_temps= top_templates.sort(columns='freq', ascending=False)
    temps_dict = dict()
    for lang in template_count.iterkeys():
        try:
            temps_dict[lang] = sort_temps[sort_temps.lang==lang].iloc[:20].convert_objects(convert_numeric=True)
        except:
            temps_dict[lang] = sort_temps[sort_temps.lang==lang].convert_objects(convert_numeric=True)

    return temps_dict



In [384]:

    
ethnosaves = !ls ethnosave
filenames = map(lambda name: 'ethnosave/'+name, ethnosaves)ethnosaves = !ls ethnosave
filenames = map(lambda name: 'ethnosave/'+name, ethnosaves)
sort_dfs = map(top_sections_ethnoset, filenames)









    



ethnosave/economy-cdi.jsonlatest.json
ethnosave/economy-fra.jsonlatest.json
ethnosave/economy-uga.jsonlatest.json
ethnosave/economy-usa.jsonlatest.json
ethnosave/geography-cdi.jsonlatest.json
ethnosave/geography-fra.jsonlatest.json
ethnosave/geography-uga.jsonlatest.json
ethnosave/geography-usa.jsonlatest.json
ethnosave/history-cdi.jsonlatest.json
ethnosave/history-fra.jsonlatest.json
ethnosave/history-uga.jsonlatest.json
ethnosave/history-usa.jsonlatest.json



In [124]:

    
def make_cite_plot(): 
    citedf = pd.DataFrame(columns=['setname','cite','freq'])
    for i in range(len(filenames)):
        for lang, df in temp_dfs[i].iteritems():
            if lang == 'en':
                df = df[(df.tempname == 'cite web') | (df.tempname == 'cite book') | (df.tempname == 'cite news') | (df.tempname == 'cite journal')]
                setname = filenames[i][10:-16]
                tot = 0
                for row in df.iterrows():
                    cols = row[1]
                    tot += cols['freq']
                for row in df.iterrows():
                    cols = row[1]
                    citedf = citedf.append({'setname':setname, 'cite': cols['tempname'], 'freq':cols['freq']/float(tot)}, ignore_index=True)
    
    
    cite_dict = {"cite book":536258, "cite journal":328129, "cite news":444447, "cite web":1560207}
    globaltot = reduce(lambda a,b: a+b, cite_dict.itervalues())
    globaltotfloat = float(globaltot)
    globciteratio = map(lambda cd: (cd[0], cd[1]/globaltotfloat), cite_dict.iteritems() )
    
    for cite in globciteratio:
        citetype, freq = cite[0], cite[1]
        citedf = citedf.append({'setname':'English WP Global', 'cite': citetype, 'freq':freq}, ignore_index=True)       
    
    citedf = citedf.convert_objects(convert_numeric=True)
    citepiv = citedf.pivot(index = 'setname', columns = 'cite')
    citeplot = citepiv.plot(kind='bar', stacked=True)
    citeplot.legend(('Citation type', 'Cite book', 'Cite journal', 'Cite news', 'Cite web'), loc=9)
    citeplot.figure.set_size_inches(12,8)
    citeplot.set_xlabel('subject-nation')
    citeplot.set_title('Composition of Citation Type, by Subject-Nation')

This is where I look at the template occurences.



In [383]:

    
ethnosaves = !ls ethnosave
filenames = map(lambda name: 'ethnosave/'+name, ethnosaves)
temp_dfs = map(top_templates_ethnoset, filenames)
for i in range(len(filenames)):
    for lang, df in temp_dfs[i].iteritems():
        #print ''
        #print filenames[i]
        #print df



In [11]:

    
def coordinate_frequencies():
    def coord_templates_ethnoset(ethnosavefile, coord_df):
        
        nameparts = ethnosavefile.split('-')
        subj = nameparts[0]
        dotparts = nameparts[1].split('.')
        nation = dotparts[0]
      
        total_articles = defaultdict(int)
        coord_articles = defaultdict(int)
        
        articles = json.load(open('ethnosave/'+ethnosavefile,'r'))    
    
        for qid, attribdict in articles.iteritems():
            for attribname, langdict in attribdict.iteritems():
                if attribname == 'wikitext':
                    for lang, wikitext in langdict.iteritems(): 
                        if wikitext:
                            total_articles[lang] += 1
                            wikicode = pfh.parse(wikitext)
                            temps = wikicode.filter_templates()
                            for temp in temps:
                                tempname = temp.name
                                tempname = tempname.strip().lower()
                                if tempname == 'coord':
                                    coord_articles[lang] += 1
        
        for lang, coord_count in coord_articles.iteritems():
            freq = coord_count / float(total_articles[lang])
            coord_df = coord_df.append({'subj':subj, 'nation':nation, 'lang':lang, 'coord_count':coord_count, 'freq':freq}, ignore_index=True)       
        return coord_df
    
    coord_df = pd.DataFrame(columns=['subj','nation','lang','coord_count','freq'])
    
    ethnosaves = !ls ethnosave
    for ethnosave in ethnosaves:
        coord_df = coord_templates_ethnoset(ethnosave, coord_df)
    
    coord_df_sort = coord_df.sort(columns='freq', ascending=False) 
    
    return HTML(coord_df_sort.to_html(index=False, columns=['subj','nation','lang','coord_count','freq']))



In [ ]:

language	en	fr	sw
Q1036	0.381577	-0.643210	-0.272597
Q1008	-0.431621	3.311523	-0.384729

language	en	fr	sw
Q1036	-0.443975	-0.176491	1.736298
Q1008	-0.973200	1.767785	-0.505017

language	en	fr	sw
Q1036	-1.695575	0.028298	0.023235
Q1008	0.660221	-0.411489	-0.427908

language	en	fr	sw
Q1036	0.105196	-0.498248	0.190613
Q1008	0.335907	2.764950	-0.244492

language	en	fr	sw
Q1036	0.014790	-0.608334	3.129056
Q1008	-1.183966	1.663328	-0.265412

lang	secname	freq
fr	Liens externes	0.904959
fr	Histoire	0.884298
fr	Géographie	0.826446
fr	Notes et références	0.809917
fr	Économie	0.785124
fr	Démographie	0.752066
fr	Politique	0.685950
fr	Voir aussi	0.685950
fr	Articles connexes	0.648760
fr	Codes	0.644628
fr	Culture	0.640496
fr	Bibliographie	0.491736
fr	Climat	0.301653
fr	Subdivisions	0.301653
fr	Langues	0.289256
fr	Sport	0.243802
fr	Éducation	0.239669
fr	Religions	0.231405
fr	Religion	0.219008
fr	Santé	0.198347

lang	secname	freq
en	History	0.954545
en	Economy	0.888430
en	Demographics	0.876033
en	External links	0.867769
en	See also	0.842975
en	Geography	0.822314
en	References	0.805785
en	Culture	0.797521
en	Religion	0.657025
en	Education	0.648760
en	Climate	0.545455
en	Military	0.462810
en	Administrative divisions	0.438017
en	Politics	0.438017
en	Sports	0.396694
en	Foreign relations	0.384298
en	Languages	0.367769
en	Cuisine	0.363636
en	Health	0.359504
en	Further reading	0.342975

lang	secname	freq
sw	Historia	0.412017
sw	Jiografia	0.334764
sw	{{Infobox Country	0.334764
sw	Viungo vya nje	0.330472
sw	Tazama pia	0.184549
sw	{{Infobox country	0.175966
sw	Viungo vya Nje	0.154506

subj	nation	lang	count
geography	usa	en	7376
geography	fra	fr	4201
economy	usa	en	3303
history	fra	fr	3135
economy	fra	fr	1962
history	usa	en	1756
history	fra	en	1322
geography	cdi	fr	1249
geography	cdi	en	1139
geography	usa	fr	1135
geography	fra	en	1103
economy	fra	en	971
geography	uga	en	602
history	usa	fr	473
economy	usa	fr	397
history	uga	en	320
geography	usa	sw	288
geography	uga	fr	219
geography	uga	sw	183
history	cdi	en	115
geography	cdi	sw	111
history	uga	fr	110
economy	cdi	fr	107
history	cdi	fr	84
economy	uga	en	78
history	usa	sw	39
economy	cdi	en	39
economy	uga	sw	11
economy	usa	sw	10
economy	uga	fr	10
history	fra	sw	8
geography	fra	sw	6
economy	fra	sw	5
history	uga	sw	3
economy	cdi	sw	1

subj	nation	lang	coord_count	freq
geography	usa	en	6859	0.9299078
geography	uga	en	550	0.9136213
geography	fra	en	781	0.7080689
economy	uga	en	42	0.5384615
geography	cdi	en	490	0.4302019
geography	cdi	fr	428	0.3426741
geography	usa	fr	360	0.3171806
history	usa	en	510	0.2904328
geography	fra	fr	792	0.1885265
geography	fra	sw	1	0.1666667
economy	cdi	en	6	0.1538462
history	fra	en	202	0.1527988
economy	usa	fr	43	0.1083123
geography	uga	fr	23	0.1050228
economy	usa	en	345	0.1044505
economy	fra	en	91	0.09371782
history	usa	fr	42	0.08879493
history	uga	en	16	0.05
geography	uga	sw	6	0.03278689
economy	fra	fr	62	0.03160041
history	fra	fr	94	0.02998405
economy	cdi	fr	2	0.01869159
history	cdi	en	1	0.008695652
geography	usa	sw	1	0.003472222

	subj	nation	lang	metric	means
48	geography	usa	en	referencerate	0.003102
49	geography	usa	fr	referencerate	0.001050
50	geography	usa	sw	referencerate	0.000012
51	geography	fra	en	referencerate	0.000573
52	geography	fra	fr	referencerate	0.001929
53	geography	fra	sw	referencerate	0.000000
54	geography	cdi	en	referencerate	0.004123
55	geography	cdi	fr	referencerate	0.002138
56	geography	cdi	sw	referencerate	0.006302
57	geography	uga	en	referencerate	0.001590
58	geography	uga	fr	referencerate	0.000560
59	geography	uga	sw	referencerate	0.000114