In the previous post, I used the MOMA collection dataset to demonstrate some features of pandas, a python library that makes SQL-like operations relatively easy.
In this post I'm going to apply another python library to this dataset. sexmachine is a python library that determines the gender of a name. It requires Python 2.
Despite its charming name, sexmachine does not infer sex (it infers gender) and it doesn't use machine learning (it's a table of names.) But despite its limitations, we can use it to look at gross trends in the evolving gender ratio of the MOMA collection.
Determining gender from a name is a difficult thing to do accurately for many reasons, among them:
This is not a complete list of the problems with this approach, and sexmachine is not necessarily the best way of handling them. J. Nathan Mathias at MIT wrote a great post on this.
That said, provided we are willing to make the reasonable assumption that sexmachine screws up equally often for men and women, and consider only ratios we can make statements about trends in the gender split of the MOMA collection.
First let's recreate the environment from the previous post.
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
sns.set_context('poster')
moma = pd.read_csv('Artworks.csv', index_col=12, parse_dates=[10])
moma = moma.dropna(subset=['DateAcquired'])
firsts = moma.drop_duplicates('Artist')
Then we define a helper function that takes a full name as a string and returns sexmachine's best guess for the gender of the first word in that string (which is assumed to be the first name)
In [2]:
import sexmachine.detector as detector # requires python2
g = detector.Detector()
def infer_gender(fullname):
try:
return g.get_gender(fullname.split()[0])
except:
return
Now we apply the infer_gender
function to the Artists
field in the dataset of unique artists, and take a look at the distributions of inferred gender.
In [3]:
firsts.loc[:, 'Gender'] = firsts['Artist'].apply(infer_gender)
firsts.groupby('Gender').size()
Out[3]:
As expected, its mostly male. But more improtantly, we see a couple of the problems with sexmachine. Names it cannot guess are called andy
(for androgynous?!). Names that it is not confident are mostly_male
or mostly_female
. Among the 12921 artists in the collection, nearly a quarter have first names whose gender sexmachine is unable to guess.
Depite this: we're now in a position to examine the how many artists in the MOMA collection who have first names that are usually female, as a fraction of the number of artists whose gender sexmachine determined with confidence.
This is not exactly what we're interested in, but it should at least be directionally correct (when this number goes up, the fraction of new artists added to the collection who really do identify as female has gone up).
And if, as discussed above, we're willing to assuming that sexmachine fails to correctly determine the gender of women as often as men, it should be more than just directionally correct; it should be pretty close to the right answer.
We create a new DataFrame to record the number of people in each gender newly added to the collection in five year buckets. We do this by grouping by two of the fields (DateAcquired
and Gender
). This yields a Series with a hierarchical index, which we turn into a regular DataFrame using the unstack()
method.
In [4]:
gender_trends = (firsts
.groupby([pd.Grouper(key='DateAcquired', freq='5A'), 'Gender'])
.size()
.unstack())
gender_trends['percent female'] = 100. * gender_trends['female'] / (gender_trends['male'] + gender_trends['female'])
In [5]:
ax = gender_trends['percent female'].plot()
ax.set_title('Estimated percentage of new artists added to the MOMA collection who are female')
ax.set_xlabel('')
ax.plot(ax.get_xlim(), [50, 50], 'r--')
ax.set_ylim(0, 100);
Assuming you think it's desirable that the MOMA collection represents female artists, there's bad news and good news from this plot:
Seaborn has a nice function to generate a regression plot (i.e. fit a trendline to this data) with uncertainties, which we can use to see what lies in the future if this trend continues at the present rate.
In [6]:
fig, ax = plt.subplots()
ax.set_ylim(0, 100)
ax.set_xlim(1930, 2300)
ax = sns.regplot(x=gender_trends.reset_index()['DateAcquired'].apply(lambda x: x.year),
y=gender_trends["percent female"])
ax.set_title('Estimated percentage of new artists added to the MOMA collection who are female')
ax.set_xlabel('')
ax.set_ylabel('')
ax.plot(ax.get_xlim(), [50, 50], 'r--');
So, at the present rate, half the new artists MOMA acquires will be female by the mid 22nd century, just after the institution's 200th anniversay in 2129.