My latest project has been WIGI, the Wikipedia Gender Inequality Index, whose aim is to take a long systematic look at the gender representation of biography articles across all Wikipedias using Wikidata. Piotr Konieczny and I were looking the evolution of gender ratios over time, and by culture see our open notebook research thread. In comparing the ratios to the absolute population sizes over time we spotted a data problem with the biographies of people born in Modern Protestan Europe. In the investigation of that data problem a new sociological question arose about the way we represent culture on Wikidata, specifically with regards to Germans and Austrians around WWII.
The first way I was classifying culture was the place of birth property on Wikidata. I would see if the value of place of birth either was a country-item, or if the value was a city-item, to see if that city-item had a country-property. This lead to a country determination for 23% of the data. Continuing with this sample popultation, we classified the countries into 9 "world cultures". Viewing the chart, the problem was a heavy drop in biographies represented in Protestant Europe around WWII.
In [71]:
from IPython.display import Image
Image('https://upload.wikimedia.org/wikipedia/commons/7/74/Culture_dob_later.png', width=720)
Out[71]:
Piotr thought this was an error in the data, and sure enogh as I tucked into it, I noticed that many Germans and Austrians born after this time did not have a place of birth, but did have the Property P:27 - Citizenship. So then I compared the rate of rates of Citizenship-data-having Biographies of Germans and Austrians to the rest, by Date of Birth. Now 63% of our data could be mapped to a culture. Let's take a look:
In [41]:
for data, labelniggle in ((modgeraus, ''), (modnotgeraus, 'NOT ')):
moddata = data.query('1800 < dob < 2000')
r = pd.DataFrame(moddata.groupby(by=['dob','classification'])['qid'].count()).unstack()['qid'].plot()
r.set_title('Classification available for Humans {}of "Germany" or "Austria" \n by Date of Birth'.format(labelniggle))
r.set_xticks(range(1800,2000, 10))
r.set_xticklabels(range(1800,2000, 10),rotation=90)
r.set_xlabel('Date of Birth')
r.set_ylabel('number of humans in Wikidata')
plt.show()
For biographies whose citizenship or place of birth is Germany or Austria, there was a large switch to moving from both classifications, to only using Citizenship information. Why this shold be so could be one of several hypotheses. (A) Wikidata is incomplete, and place of birth information has not been merged from Wikipedias to Wikidata. (B) A sociological and psychological effect is preesent in how Germans and Austrians retrospectively use for cultural identification. (C) The trend reflects how Wikipedia and Wikidata editors are choosing to record history. There may be more hypotheses, and I would be glad to hear them. None of these hypotheses I know how to test very well. However a look at the non-German, non-Austrian population reveals no similar classification shift. So at least we can negate the hypothesis that this is owed to a global trend.
If you have a constructive theory or test, comment below or find me on twitter @notconfusing.
Update: User:Saehrimnir, a friend from sudo room wrote in to offer some history.
I checked a couple of entries and found an explanation. Its option __A).__ Once upon a time there was a bot https://www.wikidata.org/wiki/Special:Contributions/FischBot which harvested Person data mainly in German wikipedia I guess. It was working chronological and guess what it stopped when int was at 1930. Other Bots took over the work for en but never for de were the most of the citizenship only entries can be found exclusively. Why is there still so many with citizenship? That is likely due to Wikidata The Game at least it was for the samples I checked.So there seems to strong evidence that this is an artefact of the way that bots are migrating data into Wikidata. That is informative when hypothesizing about any other of Wikidata's data at this still early stage of the project.
By the way the WIGI project ongoing. I will start releasing monthly updates of inequality data soon indexed by: date of birth, date of death, place of birth, citizenship, ethnicity, and Wikipedia language. I invite you to play around with ideas you might get by hacking on a rough example notebook on how to use the files.
In [2]:
import pandas as pd
import numpy
import json
%pylab inline
java_min_int = -2147483648
In [3]:
allrecs = pd.read_csv('snapshot_data/2014-10-13/gender-index-data-2014-10-13.csv',na_values=[java_min_int]) #you can get this data in the github project WIGI
In [4]:
def split_column(q_str):
if type(q_str) is float:
if numpy.isnan(q_str):
return q_str
if type(q_str) is str:
qs = q_str.split('|')
return qs[0] #cos the format will always end with a |
In [5]:
for col in ['place_of_birth','gender', 'citizenship','ethnic_group']:
allrecs[col] = allrecs[col].apply(split_column)
In [6]:
allrecs.head(10)
Out[6]:
In [10]:
pobs_map = json.load(open('helpers/aggregation_maps/pobs_map.json','r'))
def map_country(qid):
if not type(qid) is str:
return None
else:
country_list = pobs_map[qid]
if len(country_list) == 0:
return None
else: return country_list[0]
#TODO use citizenship as well.
allrecs['country'] = allrecs['place_of_birth'].apply(map_country)
In [11]:
geraus = allrecs.query('''country == "Q183" or country == "Q40" or citizenship == "Q183" or citizenship =="Q40" or ethnic_group == "Q183" or ethnic_group == "Q40" ''')
notgeraus = allrecs.query('''country != "Q183" and country != "Q40" and citizenship != "Q183" and citizenship !="Q40" and ethnic_group != "Q183" and ethnic_group != "Q40" ''')
In [38]:
for data, labelniggle in ((geraus, ''), (notgeraus, 'OUT')):
moddata = data.query('1800 < dob < 2000')
p= moddata['dob'].hist(bins=199,xrot=90)
p.set_title('Number of Humans with{} place of Birth in, or citizenship of "Germany" or "Austria" \n by Date of Birth'.format(labelniggle))
p.set_xticks(range(1800,2000, 10))
plt.show()
In [12]:
def classification_type(row):
has_country = True if row['country'] else False
has_citizenship = True if str(row['citizenship']).lower() != 'nan' else False
classification = 'neither'
if has_country and has_citizenship:
classification = 'both'
if has_country and not has_citizenship:
classification = 'country_only'
if not has_country and has_citizenship:
classification = 'citizenship_only'
return classification
In [13]:
geraus['classification'] = geraus.apply(classification_type, axis=1)
notgeraus['classification'] = notgeraus.apply(classification_type, axis=1)
In [35]:
gerclass = geraus.pivot(index=geraus.index, columns='classification', value=')
In [28]:
print len(geraus)
print len(notgeraus)
print len(geraus) + len(notgeraus)
print len(allrecs)
In [63]:
len(allrecs[allrecs['country'].apply(lambda x: isinstance(x,type(None)))]) /float(len(allrecs))
Out[63]:
In [69]:
(len(notgeraus[notgeraus['classification'] == 'neither']) + len(geraus) ) /float(len(allrecs))
Out[69]:
In [ ]: