Wikidata and the Measure of Nationality: The Germanic Shift

Max Klein December 2014

My latest project has been WIGI, the Wikipedia Gender Inequality Index, whose aim is to take a long systematic look at the gender representation of biography articles across all Wikipedias using Wikidata. Piotr Konieczny and I were looking the evolution of gender ratios over time, and by culture see our open notebook research thread. In comparing the ratios to the absolute population sizes over time we spotted a data problem with the biographies of people born in Modern Protestan Europe. In the investigation of that data problem a new sociological question arose about the way we represent culture on Wikidata, specifically with regards to Germans and Austrians around WWII.

The first way I was classifying culture was the place of birth property on Wikidata. I would see if the value of place of birth either was a country-item, or if the value was a city-item, to see if that city-item had a country-property. This lead to a country determination for 23% of the data. Continuing with this sample popultation, we classified the countries into 9 "world cultures". Viewing the chart, the problem was a heavy drop in biographies represented in Protestant Europe around WWII.



In [71]:

    
from IPython.display import Image
Image('https://upload.wikimedia.org/wikipedia/commons/7/74/Culture_dob_later.png', width=720)









    Out[71]:

Piotr thought this was an error in the data, and sure enogh as I tucked into it, I noticed that many Germans and Austrians born after this time did not have a place of birth, but did have the Property P:27 - Citizenship. So then I compared the rate of rates of Citizenship-data-having Biographies of Germans and Austrians to the rest, by Date of Birth. Now 63% of our data could be mapped to a culture. Let's take a look:



In [41]:

    
for data, labelniggle in ((modgeraus, ''), (modnotgeraus, 'NOT ')):
    moddata = data.query('1800 < dob  < 2000')
    r = pd.DataFrame(moddata.groupby(by=['dob','classification'])['qid'].count()).unstack()['qid'].plot()
    r.set_title('Classification available for Humans {}of "Germany" or "Austria" \n by Date of Birth'.format(labelniggle))
    r.set_xticks(range(1800,2000, 10))
    r.set_xticklabels(range(1800,2000, 10),rotation=90)
    r.set_xlabel('Date of Birth')
    r.set_ylabel('number of humans in Wikidata')
    plt.show()

Interpretation

For biographies whose citizenship or place of birth is Germany or Austria, there was a large switch to moving from both classifications, to only using Citizenship information. Why this shold be so could be one of several hypotheses. (A) Wikidata is incomplete, and place of birth information has not been merged from Wikipedias to Wikidata. (B) A sociological and psychological effect is preesent in how Germans and Austrians retrospectively use for cultural identification. (C) The trend reflects how Wikipedia and Wikidata editors are choosing to record history. There may be more hypotheses, and I would be glad to hear them. None of these hypotheses I know how to test very well. However a look at the non-German, non-Austrian population reveals no similar classification shift. So at least we can negate the hypothesis that this is owed to a global trend.

If you have a constructive theory or test, comment below or find me on twitter @notconfusing.

Update: Explanation

Update: User:Saehrimnir, a friend from sudo room wrote in to offer some history.

I checked a couple of entries and found an explanation. Its option __A).__ Once upon a time there was a bot https://www.wikidata.org/wiki/Special:Contributions/FischBot which harvested Person data mainly in German wikipedia I guess. It was working chronological and guess what it stopped when int was at 1930. Other Bots took over the work for en but never for de were the most of the citizenship only entries can be found exclusively. Why is there still so many with citizenship? That is likely due to Wikidata The Game at least it was for the samples I checked.

So there seems to strong evidence that this is an artefact of the way that bots are migrating data into Wikidata. That is informative when hypothesizing about any other of Wikidata's data at this still early stage of the project.

WIGI's status

By the way the WIGI project ongoing. I will start releasing monthly updates of inequality data soon indexed by: date of birth, date of death, place of birth, citizenship, ethnicity, and Wikipedia language. I invite you to play around with ideas you might get by hacking on a rough example notebook on how to use the files.

Supporting Code



In [2]:

    
import pandas as pd
import numpy
import json

%pylab inline
java_min_int = -2147483648









    



Populating the interactive namespace from numpy and matplotlib



In [3]:

    
allrecs = pd.read_csv('snapshot_data/2014-10-13/gender-index-data-2014-10-13.csv',na_values=[java_min_int]) #you can get this data in the github project WIGI



In [4]:

    
def split_column(q_str):
    if type(q_str) is float:
        if numpy.isnan(q_str):
            return q_str 
    if type(q_str) is str:
        qs = q_str.split('|')
        return qs[0] #cos the format will always end with a |



In [5]:

    
for col in ['place_of_birth','gender', 'citizenship','ethnic_group']:
    allrecs[col] = allrecs[col].apply(split_column)



In [6]:

    
allrecs.head(10)









    Out[6]:






  
    
      
      qid
      dob
      dod
      gender
      ethnic_group
      citizenship
      place_of_birth
      site_links
    
  
  
    
      0
        Q23
       1732
       1799
       Q6581097
       NaN
          Q30
       Q494413
       zhwiki|kywiki|euwiki|plwiki|bswiki|angwiki|uzw...
    
    
      1
        Q42
       1952
       2001
       Q6581097
       NaN
         Q145
          Q350
       zhwiki|jvwiki|euwiki|plwiki|bswiki|eswiki|tawi...
    
    
      2
       Q207
       1946
        NaN
       Q6581097
       NaN
          Q30
        Q49145
       uzwiki|eswiki|kowikiquote|huwiki|liwikiquote|p...
    
    
      3
       Q297
        NaN
       1660
       Q6581097
       NaN
          Q29
         Q8717
       zhwiki|kywiki|plwiki|euwiki|bswiki|uzwiki|eswi...
    
    
      4
       Q326
       1942
        NaN
       Q6581097
       NaN
         Q298
         Q2887
       zhwiki|plwiki|euwiki|kowiki|frwiki|eswiki|yowi...
    
    
      5
       Q368
       1915
       2006
       Q6581097
       NaN
         Q298
        Q33986
       lbwiki|zhwiki|plwiki|euwiki|bswiki|angwiki|esw...
    
    
      6
       Q377
       1882
       1942
       Q6581097
       NaN
       Q34266
       Q658871
       zhwiki|kywiki|ukwikisource|jvwiki|plwiki|euwik...
    
    
      7
       Q475
       1911
       1982
       Q6581097
       NaN
         Q298
         Q2887
       plwiki|euwiki|kowiki|frwiki|eswiki|yowiki|ocwi...
    
    
      8
       Q501
       1821
       1867
       Q6581097
       NaN
         Q142
           Q90
       zhwiki|glwikisource|plwiki|euwiki|bswiki|ptwik...
    
    
      9
       Q530
       1956
        NaN
       Q6581097
       NaN
          Q34
       Q499415
       plwiki|euwiki|frwiki|bswiki|bewiki|eswiki|ocwi...



In [10]:

    
pobs_map = json.load(open('helpers/aggregation_maps/pobs_map.json','r'))

def map_country(qid):
    if not type(qid) is str:
        return None
    else:
        country_list = pobs_map[qid]
        if len(country_list) == 0:
            return None
        else: return country_list[0]
#TODO use citizenship as well.        
allrecs['country'] = allrecs['place_of_birth'].apply(map_country)



In [11]:

    
geraus = allrecs.query('''country == "Q183" or country == "Q40" or citizenship == "Q183" or citizenship =="Q40" or ethnic_group == "Q183" or ethnic_group == "Q40" ''')
notgeraus = allrecs.query('''country != "Q183" and country != "Q40" and citizenship != "Q183" and citizenship !="Q40" and ethnic_group != "Q183" and ethnic_group != "Q40" ''')



In [38]:

    
for data, labelniggle in ((geraus, ''), (notgeraus, 'OUT')):
    moddata = data.query('1800 < dob  < 2000')
    p= moddata['dob'].hist(bins=199,xrot=90)
    p.set_title('Number of Humans with{} place of Birth in, or citizenship of "Germany" or "Austria" \n by Date of Birth'.format(labelniggle))
    p.set_xticks(range(1800,2000, 10))
    plt.show()



In [12]:

    
def classification_type(row):
    has_country = True if row['country'] else False
    has_citizenship = True if str(row['citizenship']).lower() != 'nan' else False
    classification = 'neither'
    if has_country and has_citizenship:
        classification = 'both'
    if has_country and not has_citizenship:
        classification = 'country_only'
    if not has_country and has_citizenship:
        classification = 'citizenship_only'
    return classification



In [13]:

    
geraus['classification'] = geraus.apply(classification_type, axis=1)
notgeraus['classification'] = notgeraus.apply(classification_type, axis=1)









    



-c:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
-c:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [35]:

    
gerclass = geraus.pivot(index=geraus.index, columns='classification', value=')



In [28]:

    
print len(geraus)
print len(notgeraus)
print len(geraus) + len(notgeraus)
print len(allrecs)



In [63]:

    
len(allrecs[allrecs['country'].apply(lambda x: isinstance(x,type(None)))]) /float(len(allrecs))









    Out[63]:





0.7652766453070434



In [69]:

    
(len(notgeraus[notgeraus['classification'] == 'neither']) + len(geraus) ) /float(len(allrecs))









    Out[69]:





0.6314986852063564



In [ ]:

	qid	dob	dod	gender	ethnic_group	citizenship	place_of_birth	site_links
0	Q23	1732	1799	Q6581097	NaN	Q30	Q494413	zhwiki\|kywiki\|euwiki\|plwiki\|bswiki\|angwiki\|uzw...
1	Q42	1952	2001	Q6581097	NaN	Q145	Q350	zhwiki\|jvwiki\|euwiki\|plwiki\|bswiki\|eswiki\|tawi...
2	Q207	1946	NaN	Q6581097	NaN	Q30	Q49145	uzwiki\|eswiki\|kowikiquote\|huwiki\|liwikiquote\|p...
3	Q297	NaN	1660	Q6581097	NaN	Q29	Q8717	zhwiki\|kywiki\|plwiki\|euwiki\|bswiki\|uzwiki\|eswi...
4	Q326	1942	NaN	Q6581097	NaN	Q298	Q2887	zhwiki\|plwiki\|euwiki\|kowiki\|frwiki\|eswiki\|yowi...
5	Q368	1915	2006	Q6581097	NaN	Q298	Q33986	lbwiki\|zhwiki\|plwiki\|euwiki\|bswiki\|angwiki\|esw...
6	Q377	1882	1942	Q6581097	NaN	Q34266	Q658871	zhwiki\|kywiki\|ukwikisource\|jvwiki\|plwiki\|euwik...
7	Q475	1911	1982	Q6581097	NaN	Q298	Q2887	plwiki\|euwiki\|kowiki\|frwiki\|eswiki\|yowiki\|ocwi...
8	Q501	1821	1867	Q6581097	NaN	Q142	Q90	zhwiki\|glwikisource\|plwiki\|euwiki\|bswiki\|ptwik...
9	Q530	1956	NaN	Q6581097	NaN	Q34	Q499415	plwiki\|euwiki\|frwiki\|bswiki\|bewiki\|eswiki\|ocwi...