Wikidata and the Measure of Nationality: The Germanic Shift

Max Klein December 2014

My latest project has been WIGI, the Wikipedia Gender Inequality Index, whose aim is to take a long systematic look at the gender representation of biography articles across all Wikipedias using Wikidata. Piotr Konieczny and I were looking the evolution of gender ratios over time, and by culture see our open notebook research thread. In comparing the ratios to the absolute population sizes over time we spotted a data problem with the biographies of people born in Modern Protestan Europe. In the investigation of that data problem a new sociological question arose about the way we represent culture on Wikidata, specifically with regards to Germans and Austrians around WWII.

The first way I was classifying culture was the place of birth property on Wikidata. I would see if the value of place of birth either was a country-item, or if the value was a city-item, to see if that city-item had a country-property. This lead to a country determination for 23% of the data. Continuing with this sample popultation, we classified the countries into 9 "world cultures". Viewing the chart, the problem was a heavy drop in biographies represented in Protestant Europe around WWII.


In [71]:
from IPython.display import Image
Image('https://upload.wikimedia.org/wikipedia/commons/7/74/Culture_dob_later.png', width=720)


Out[71]:

Piotr thought this was an error in the data, and sure enogh as I tucked into it, I noticed that many Germans and Austrians born after this time did not have a place of birth, but did have the Property P:27 - Citizenship. So then I compared the rate of rates of Citizenship-data-having Biographies of Germans and Austrians to the rest, by Date of Birth. Now 63% of our data could be mapped to a culture. Let's take a look:


In [41]:
for data, labelniggle in ((modgeraus, ''), (modnotgeraus, 'NOT ')):
    moddata = data.query('1800 < dob  < 2000')
    r = pd.DataFrame(moddata.groupby(by=['dob','classification'])['qid'].count()).unstack()['qid'].plot()
    r.set_title('Classification available for Humans {}of "Germany" or "Austria" \n by Date of Birth'.format(labelniggle))
    r.set_xticks(range(1800,2000, 10))
    r.set_xticklabels(range(1800,2000, 10),rotation=90)
    r.set_xlabel('Date of Birth')
    r.set_ylabel('number of humans in Wikidata')
    plt.show()


Interpretation

For biographies whose citizenship or place of birth is Germany or Austria, there was a large switch to moving from both classifications, to only using Citizenship information. Why this shold be so could be one of several hypotheses. (A) Wikidata is incomplete, and place of birth information has not been merged from Wikipedias to Wikidata. (B) A sociological and psychological effect is preesent in how Germans and Austrians retrospectively use for cultural identification. (C) The trend reflects how Wikipedia and Wikidata editors are choosing to record history. There may be more hypotheses, and I would be glad to hear them. None of these hypotheses I know how to test very well. However a look at the non-German, non-Austrian population reveals no similar classification shift. So at least we can negate the hypothesis that this is owed to a global trend.

If you have a constructive theory or test, comment below or find me on twitter @notconfusing.

Update: Explanation

Update: User:Saehrimnir, a friend from sudo room wrote in to offer some history.

I checked a couple of entries and found an explanation. Its option __A).__ Once upon a time there was a bot https://www.wikidata.org/wiki/Special:Contributions/FischBot which harvested Person data mainly in German wikipedia I guess. It was working chronological and guess what it stopped when int was at 1930. Other Bots took over the work for en but never for de were the most of the citizenship only entries can be found exclusively. Why is there still so many with citizenship? That is likely due to Wikidata The Game at least it was for the samples I checked.
So there seems to strong evidence that this is an artefact of the way that bots are migrating data into Wikidata. That is informative when hypothesizing about any other of Wikidata's data at this still early stage of the project.

WIGI's status

By the way the WIGI project ongoing. I will start releasing monthly updates of inequality data soon indexed by: date of birth, date of death, place of birth, citizenship, ethnicity, and Wikipedia language. I invite you to play around with ideas you might get by hacking on a rough example notebook on how to use the files.

Supporting Code


In [2]:
import pandas as pd
import numpy
import json

%pylab inline
java_min_int = -2147483648


Populating the interactive namespace from numpy and matplotlib

In [3]:
allrecs = pd.read_csv('snapshot_data/2014-10-13/gender-index-data-2014-10-13.csv',na_values=[java_min_int]) #you can get this data in the github project WIGI

In [4]:
def split_column(q_str):
    if type(q_str) is float:
        if numpy.isnan(q_str):
            return q_str 
    if type(q_str) is str:
        qs = q_str.split('|')
        return qs[0] #cos the format will always end with a |

In [5]:
for col in ['place_of_birth','gender', 'citizenship','ethnic_group']:
    allrecs[col] = allrecs[col].apply(split_column)

In [6]:
allrecs.head(10)


Out[6]:
qid dob dod gender ethnic_group citizenship place_of_birth site_links
0 Q23 1732 1799 Q6581097 NaN Q30 Q494413 zhwiki|kywiki|euwiki|plwiki|bswiki|angwiki|uzw...
1 Q42 1952 2001 Q6581097 NaN Q145 Q350 zhwiki|jvwiki|euwiki|plwiki|bswiki|eswiki|tawi...
2 Q207 1946 NaN Q6581097 NaN Q30 Q49145 uzwiki|eswiki|kowikiquote|huwiki|liwikiquote|p...
3 Q297 NaN 1660 Q6581097 NaN Q29 Q8717 zhwiki|kywiki|plwiki|euwiki|bswiki|uzwiki|eswi...
4 Q326 1942 NaN Q6581097 NaN Q298 Q2887 zhwiki|plwiki|euwiki|kowiki|frwiki|eswiki|yowi...
5 Q368 1915 2006 Q6581097 NaN Q298 Q33986 lbwiki|zhwiki|plwiki|euwiki|bswiki|angwiki|esw...
6 Q377 1882 1942 Q6581097 NaN Q34266 Q658871 zhwiki|kywiki|ukwikisource|jvwiki|plwiki|euwik...
7 Q475 1911 1982 Q6581097 NaN Q298 Q2887 plwiki|euwiki|kowiki|frwiki|eswiki|yowiki|ocwi...
8 Q501 1821 1867 Q6581097 NaN Q142 Q90 zhwiki|glwikisource|plwiki|euwiki|bswiki|ptwik...
9 Q530 1956 NaN Q6581097 NaN Q34 Q499415 plwiki|euwiki|frwiki|bswiki|bewiki|eswiki|ocwi...

In [10]:
pobs_map = json.load(open('helpers/aggregation_maps/pobs_map.json','r'))

def map_country(qid):
    if not type(qid) is str:
        return None
    else:
        country_list = pobs_map[qid]
        if len(country_list) == 0:
            return None
        else: return country_list[0]
#TODO use citizenship as well.        
allrecs['country'] = allrecs['place_of_birth'].apply(map_country)

In [11]:
geraus = allrecs.query('''country == "Q183" or country == "Q40" or citizenship == "Q183" or citizenship =="Q40" or ethnic_group == "Q183" or ethnic_group == "Q40" ''')
notgeraus = allrecs.query('''country != "Q183" and country != "Q40" and citizenship != "Q183" and citizenship !="Q40" and ethnic_group != "Q183" and ethnic_group != "Q40" ''')

In [38]:
for data, labelniggle in ((geraus, ''), (notgeraus, 'OUT')):
    moddata = data.query('1800 < dob  < 2000')
    p= moddata['dob'].hist(bins=199,xrot=90)
    p.set_title('Number of Humans with{} place of Birth in, or citizenship of "Germany" or "Austria" \n by Date of Birth'.format(labelniggle))
    p.set_xticks(range(1800,2000, 10))
    plt.show()



In [12]:
def classification_type(row):
    has_country = True if row['country'] else False
    has_citizenship = True if str(row['citizenship']).lower() != 'nan' else False
    classification = 'neither'
    if has_country and has_citizenship:
        classification = 'both'
    if has_country and not has_citizenship:
        classification = 'country_only'
    if not has_country and has_citizenship:
        classification = 'citizenship_only'
    return classification

In [13]:
geraus['classification'] = geraus.apply(classification_type, axis=1)
notgeraus['classification'] = notgeraus.apply(classification_type, axis=1)


-c:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
-c:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [35]:
gerclass = geraus.pivot(index=geraus.index, columns='classification', value=')

In [28]:
print len(geraus)
print len(notgeraus)
print len(geraus) + len(notgeraus)
print len(allrecs)


213940
2348059
2561999
2561999

In [63]:
len(allrecs[allrecs['country'].apply(lambda x: isinstance(x,type(None)))]) /float(len(allrecs))


Out[63]:
0.7652766453070434

In [69]:
(len(notgeraus[notgeraus['classification'] == 'neither']) + len(geraus) ) /float(len(allrecs))


Out[69]:
0.6314986852063564

In [ ]: