As a social media, Twitter made the choice not to display the gender of its users. However as human, we can often easily guess the gender of most users based on the features displayed on their profile. In order to do statistics it can be useful for someone to automatically get the gender of the users. The objective of this project is to find features which can help to determine a Twitter user's gender using machine learning algorithms.
This project is based on the Kaggle Dataset. We will first explore the data, then train some models and test their efficiency at predicting the gender. Finally we will collect another dataset and try to apply our models to it.
The dataset we will use is the Twitter User Gender Classification dataset made available by Crowdflower. This datasets contains 20000 entries, each of them being a tweet from different users, with many other associated features which are listed here:
Most of these features are not relevant for our analysis, we will only focus on a few of them, i.e. the colors of the sidebars and links, the texts in the description and in the tweets and finally, the content of the profile picture
In [1]:
import pandas as pd
import numpy as np
from IPython.display import display
import re
#graph
from bokeh.plotting import output_notebook, figure, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})
from mpl_toolkits.axes_grid1 import make_axes_locatable
from scipy import ndimage
from matplotlib import pyplot as plt
# 3D visualization
import pylab
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import pyplot
from collections import Counter
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from IPython.display import display
from sklearn import linear_model, metrics
from sklearn import naive_bayes
from sklearn import neural_network
from twtgender import *
# we need latin-1 encoding because there are some special characters (é,...) that do not fit in default UTF-8
dataFrame = pd.read_csv('gender-classifier-DFE-791531.csv', encoding='latin-1')
#Show a sample of the dataset
dataFrame.head()
Out[1]:
The first feature we are going to use for our analysis are the link_color and sidebar_color. On Twitter, it is possible to personalize your account by changing the colors of the links or the sidebars, and we expect people from different gender to have different behaviors in how they personalize their page. For example, we can expect females to use more "girly" colors such as pink or purple, while men would keep it more "manly" with some blue maybe.
We wrote the colorsGraphs function to extract and plot the most used colors for sidebars and for links by each gender. As the color is not especially easy to deduce from its HEX code, we found it easier to read to plot each bar in its associated color. The first thing we can notice is that most users do not personalize their page much and keep one of the standard Twitter themes, regardless of their gender. In order to better visualize how the personalization differs, we removed these most used themes from the bar graphs.
In [2]:
#Data Exploration Colors
colorsGraphs(dataFrame, 'sidebar_color', 1, 4)
colorsGraphs(dataFrame, 'link_color', 1, 1)
From the graphs, we can take the following conclusions:
These are only intuitions confirmed by the data, but if we want to predict the gender using the colors, we need a prediction model.
Now that we have seen how users personalize the color of their pages, let's have a deeper look at what they actually write on Twitter. Here, we will explore both the text from the users descriptions but also the text from the tweets themselves. As these two texts lie on different cells in the dataframe, we will first need to process it a bit. The first thing we wanted to do was to normalize the text by removing separators suchs as commas, and also normalize the text itself to have only lowercase letters. To do so, we wrote the text_normalizer function. We then grouped the description and tweet texts together. Finally, we wrote the compute_bag_of_words and print_most_frequent functions to visualize which words are most used by which genders.
In [3]:
#Data Exploration - Text
# Normalize text in the descriptions and tweet messages
# Adding dict to the dataframe containing normalized texts
dataFrameText = dataFrame
dataFrameText['text_norm'] = [text_normalizer(s) for s in dataFrameText['text']]
dataFrameText['description_norm'] = [text_normalizer(s) for s in dataFrameText['description']]
# Now let's put all the interesting text, i.e. the description and the tweeet itself in one string for each tweet
dataFrameText['all_text'] =dataFrameText['text_norm'].str.cat(dataFrameText['description_norm'],sep=' ')
dataFrameText = dataFrameText[(dataFrameText['gender:confidence']==1)&(dataFrameText['gender']!='unknown')&(dataFrameText['link_color'].str.contains('E\+') != True)]
# Extract separate genders dataframes
male_data = dataFrameText[dataFrameText['gender']=='male']
female_data = dataFrameText[dataFrameText['gender']=='female']
brand_data = dataFrameText[dataFrameText['gender']=='brand']
male_data.head()
male_bow, male_voc = compute_bag_of_words(male_data['all_text'])
print_most_frequent(male_bow, male_voc, 'male', feature = 'all_text')
female_bow, female_voc = compute_bag_of_words(female_data['all_text'])
print_most_frequent(female_bow, female_voc, 'female', feature = 'all_text')
brand_bow, brand_voc = compute_bag_of_words(brand_data['all_text'])
print_most_frequent(brand_bow, brand_voc, 'brand', feature = 'all_text')
#nothing special about these words really
The results are not quite as conclusive as with the colors. In fact, the most used words, regardless of the gender, are very simple words such as "the", "and", "to" or "of", and this does not give us any information about the gender really. One interesting thing we noticed is that the brands tends to use the words "weather", "channel" and "news" more than regular male and female users. This means that we have probably many information or weather channels accounts in our database. Another interesting fact is on the usage of the word "https". It seems like brands tend to post more links than standard users.
To use the profile pictures information is a bit more difficult than using simple text or color codes. The first thing we need to do is to extract the picture content from the picture itself. To do so, we used Clarify API, however, as the process is very long to run on the whole dataFrame (approximately 12 hours), we do not recommend to run the code. Instead, we created a new dataFrame containing all the picture contents keyword, which we will use in further analysis.
Now that the we have the content of the profile pictures in text, we can run the same data exploration process than earlier, and see which contents is most used by which gender
In [4]:
#Data Exploration - Pictures
#import imghdr
#import requests
#import io
#import urllib.request as ur
#from clarifai.rest import ClarifaiApp
#app = ClarifaiApp()
## We used clariai api to do image extraction, which takes about 12 hours
## . Thus, We set all the codes into comments avoiding running again.
#raw_data['pic_text']=' '
#for i in range(20048,20049):
# url=raw_data['profileimage'][i]
# index=url.rfind('.')
# url_new=url[:index-7]+url[index:]
# if '.gif' not in url_new:
# if 'pb.com'not in url_new:
# response = requests.get(url_new)
# if(response.status_code != 404):
# result=app.tag_urls([url_new])
# for k in range(0,4):
# if(k==0):
# raw_data['pic_text'][i]=result['outputs'][0]['data']['concepts'][k]['name']
# else:
# raw_data['pic_text'][i]=raw_data['pic_text'][i] + ' ' + result['outputs'][0]['data']['concepts'][k]['name']
## print(i)
## print(raw_data['pic_text'][i])
#new_data=raw_data
#df = pd.DataFrame(new_data, columns = ['gender', 'gender:confidence', 'profileimage', 'pic_text'])
#df.to_csv('new_data.csv')
new_data=pd.read_csv('new_data.csv',encoding='latin-1')
#Show a sample of the dataset
new_data.head()
Out[4]:
In [5]:
#create separate gender dataFrames
male_data = new_data[(new_data['gender']=='male')&(new_data['gender:confidence']==1)&(new_data['pic_text']!=' ')]
female_data = new_data[(new_data['gender']=='female')&(new_data['gender:confidence']==1)&(new_data['pic_text']!=' ')]
brand_data = new_data[(new_data['gender']=='brand')&(new_data['gender:confidence']==1)&(new_data['pic_text']!=' ')]
genderConf = new_data[(new_data['gender:confidence']==1)&(new_data['gender']!='unknown')&(new_data['pic_text']!=' ')]
male_bow, male_voc = compute_bag_of_words(male_data['pic_text'])
print_most_frequent(male_bow, male_voc,'male', feature = 'pic_text')
#print('----------------------------')
female_bow, female_voc = compute_bag_of_words(female_data['pic_text'])
print_most_frequent(female_bow, female_voc,'female', feature = 'pic_text')
#print('----------------------------')
brand_bow, brand_voc = compute_bag_of_words(brand_data['pic_text'])
print_most_frequent(brand_bow, brand_voc,'brand', feature = 'pic_text')
The results here are very interesting. First we can notice that as it is supposed to be a "profile picture", most male and female users use a picture of themselves, which is why the words "adult", "people", "portrait" and respectively "man" and "woman" are the most recurrent. Regarding the brands side, mots of the pictures are "symbol", "illustration" or "design", which means they are probably logos of the brand.
We wrote the predictors function to extract the best predictors and anti-predictors of one specific feature for gender prediction. Here, we applied it to the link_color, using different linear models for the prediction. We chose to use first linear models because they are simple, but still good enough to be efficient, and have a nice implementaion in the sklearn library.
More specifically, some of these models have an attribute called coef_ which gives the weight of each word (here, the color HEX codes) of the model. A word that has a high weight for a given gender means that, if a user make use of it, it has a strong probability of being of this specific gender.
First, we performed the clasification work using the color features:
In [6]:
# Classifier colors
dataFrameColor = dataFrame.loc[:,['gender:confidence', 'gender', 'link_color']]
dataFrameColorFiltered = dataFrameColor[(dataFrameColor['gender:confidence'] == 1)&(dataFrameColor['link_color'].str.contains('E\+') != True)&(dataFrameColor['gender']!='unknown')]
feature = 'link_color'
df = dataFrameColorFiltered
# List of the classifiers we tested
modelListColor = [linear_model.RidgeClassifier(),
linear_model.SGDClassifier(),
linear_model.LogisticRegression(),
linear_model.PassiveAggressiveClassifier(),
naive_bayes.MultinomialNB(),
neural_network.MLPClassifier()]
modelNamesList = ['Ridge Classifier',
'SGD Classifier',
'Logistic regression',
'Passive Aggressive Classifier',
'Multinomial NB',
'Multi-layer Perceptron',
]
acc_color = np.zeros(len(modelListColor))
for i in range(0, len(modelListColor)):
#for i in range(2,3):
# model_color = modelListColor[i]
modelName = modelNamesList[i]
modelListColor[i], voc_color, acc_color[i] = predictors(df, feature, modelListColor[i], modelName, displayResults = False, displayColors=True)
fig, ax1 = plt.subplots()
ax1.set_xlim([0, 1])
bar_width = 0.5
model_number = np.arange(len(modelListColor))+1
rects1 = plt.barh(model_number,acc_color, bar_width, label = 'Brand Predictors', color = '#f5abb5')
plt.yticks(model_number,modelNamesList)
plt.xlabel('Accuracy')
plt.ylabel('Model')
plt.title('Accuracy of the different Classifiers')
plt.tight_layout()
plt.show()
We observe that all the models yield approximately the same accuracy around 40-45%
In [7]:
modelName = modelNamesList[2]
modelListColor[2], voc_color, acc_color[2] = predictors(df, feature, modelListColor[2], modelName, displayResults = True, displayColors=True)
From these bar graphs, we can definitely see that our intuitions are confirmed by the models. The strongest female color-predictors are almost all between pink, red and purple. Also, these colors quite strong anti-predictors for both males and brands. However, the linear models only achieve about 45% of accuracy in predicting the gender using only the colors. Once again, this is mostly because the vast majority of users do not change their sidebar link colors.
In [8]:
# Classifier - Text
#Looking at the most used words per gender doesnt yield anything particular since we all use the same common words,
#so let's try to find predictors
feature = 'all_text'
df = dataFrameText[dataFrameText['_golden']==False]
modelListText = [linear_model.RidgeClassifier(),
linear_model.SGDClassifier(),
linear_model.LogisticRegression(),
linear_model.PassiveAggressiveClassifier(),
naive_bayes.MultinomialNB(),
neural_network.MLPClassifier()
]
acc_text = np.zeros(len(modelListText))
for i in range(0, len(modelListText)):
#for i in range(2,3):
# model_text = modelListText[i]
modelName = modelNamesList[i]
modelListText[i], voc_text, acc_text[i] = predictors(df, feature, modelListText[i], modelName, displayResults = False)
fig, ax1 = plt.subplots()
ax1.set_xlim([0, 1])
bar_width = 0.5
model_number = np.arange(len(modelListText))+1
rects1 = plt.barh(model_number,acc_text, bar_width, label = 'Brand Predictors', color = '#f5abb5')
plt.yticks(model_number,modelNamesList)
plt.xlabel('Accuracy')
plt.ylabel('Model')
plt.title('Accuracy of the different Classifiers')
plt.tight_layout()
plt.show()
We observe that the models yield approximately the same accuracy around 60-65%.
In [9]:
modelName = modelNamesList[2]
modelListText[2], voc_text, acc_text[2] = predictors(df, feature, modelListText[2], modelName, displayResults = True)
Here, as the data is much more varied and meaningful than simple color codes, we manage to obtain a prediction accuracy of about 65%. Strong predictors for male users are words such as "father", "boy", "man" or "niggas", while predictor for female users are "mom", "girl", "feminist" or "makeup", and of course, these words are anti-predictors of the opposite gender.
From the anti-predictors, it seems like female users do not tweet about sports ("player", "hit", "team", "season", "game") while male users are less susceptible to tweet about girls ("girl", "mother", "queen").
On the side of brands, we see that our intuitions are confirmed, as posting a link ("https") or tweeting about "news" and "weather" are typical of the brand "gender".
Finally, some predictors for the female gender might looks quite odd, "_ù" or "ï_" for example, but we think these are unicodes for emojis. However, we did not manage to find which ones due to encoding issues.
Now, let's finally apply our classifiers using the profile picture contents:
In [10]:
# Classifier - Pictures
#Looking at the most used words per gender doesnt yield anything particular since we all use the same common words,
#so let's try to find predictors
feature = 'pic_text'
df = genderConf
modelListPic = [linear_model.RidgeClassifier(),
linear_model.SGDClassifier(),
linear_model.LogisticRegression(),
linear_model.PassiveAggressiveClassifier(),
naive_bayes.MultinomialNB(),
neural_network.MLPClassifier()
]
acc_pic = np.zeros(len(modelListPic))
for i in range(0, len(modelListPic)):
#for i in range(2,3):
# model_pic = modelListPic[i]
modelName = modelNamesList[i]
modelListPic[i], voc_pic, acc_pic[i] = predictors(df, feature, modelListPic[i], modelName, displayResults = False)
fig, ax1 = plt.subplots()
ax1.set_xlim([0, 1])
bar_width = 0.5
model_number = np.arange(len(modelListPic))+1
rects1 = plt.barh(model_number,acc_pic, bar_width, label = 'Brand Predictors', color = '#f5abb5')
plt.yticks(model_number,modelNamesList)
plt.xlabel('Accuracy')
plt.ylabel('Model')
plt.title('Accuracy of the different Classifiers')
plt.tight_layout()
plt.show()
We observe that the models yield approximately the same accuracy around 80-85%.
In [11]:
modelName = modelNamesList[2]
modelListPic[2], voc_pic, acc_pic[2] = predictors(df, feature, modelListPic[2], modelName, displayResults = True)
As the content of the profile picture is very representative of the user, the classifier manage to get up to 85% of accurate predictions, which is quite impressive. However, the predictors are not exactly as we expected them to be. Although "man" and "woman" are among the best predictors for their respective gender, we expected them to be way more important to the prediction of the gender for most of the classifiers.
Also, it seems like we have many twitter account from "actresses", or maybe many female users use pictures of actresses as their profile pictures. Unsurprisigly, "bikini" is an anti-predictor for male users. What is more surprising though is that "leather" and "lips" are strong predictors for brands.
As we did the prediction based on text, profile picture content and theme color, we want to combine the results for the three models based on their accuracy in order to increase the efficiency of the prediction. (For example the theme color and the profile picture may not be relevant in which case the text will be the main decision, and so on...) We obtain the new probability for each class as follow : $$p(X=k) = \frac{\sum_{i=1}^{3}{\alpha_{i}p_{i}(X=k)}}{\sum_{i=1}^{3}{\alpha_{i}}}$$ We took the $\alpha_{i} = e^{10\text{*acc}_{i}}$, with acc$_{i}$ being the current accuracy of the model for the feature $i$.
In [12]:
text_test2 = str("bros in Paris, gospel with the squad and my father sport team")
pic_test2 = "man sunglass portrait"
color_test2 = "cc3300"
dataFrametest2 = pd.DataFrame.from_items([('all_text',[text_test2]), ('pic_text', [pic_test2]), ('link_color', [color_test2]), ('user_name','blabla'),('gender','male')])
dataFrametest2.head()
resultList = combine_features(modelListText[2], modelListPic[2], modelListColor[2], dataFrametest2, voc_text, voc_pic, voc_color, acc_text[1], acc_pic[1], acc_color[1])
We see that providing the model with a fake "cliché" example of a tweet, profile picture features and link_color gives us the expected result.
We then wanted to test our model on an external dataset, which we created based on the twitter API, and again the features detection API: Clarify. Selecting 48 twitter profiles, 16 of each class (male, female, brand), with writers, humorists, official people, from various horizons, we extracted the profile picture features, the theme color and the text of 100 tweets. We then applied the model combining the features as explained above and got the following results.
In [13]:
# import configparser
# import pandas as pd
# import tweepy # pip install tweepy
# consumer_key='m5tp5xKs4Nluq9nTKbnLu9PON'
# consumer_secret='CVaZAFAyRS21Zy9o5E8mP8NKO29EhP07KdyWj5FUPHlcQKjhns'
# access_token='786176122836824064-qNfyQyZWBShZmc8XZRsBKtQa8VZpz6T'
# access_token_secret='kYiY9X84hBlh8drYo5KiYUuXwFodsrqBYnPETEpmLnaE4'
# #we need to update our key and secret every time we predict
# auth = tweepy.OAuthHandler(consumer_key,consumer_secret)
# auth.set_access_token(access_token,access_token_secret)
# api = tweepy.API(auth)
# n=100 #number of tweets we want to get
# id_list=[('realDonaldTrump', 'male'),('augusten', 'male'),('Eminem', 'male'),('IrvineWelsh','male'),('alaindebotton','male'),
# ('joedunthorne','male'),('paulocoelho','male'), ('tejucole','male'),('Shteyngart','male'),('KingJames','male'),
# ('stephenathome','male'), ('jimmyfallon', 'male'), ('normaltweetguy', 'male'), ('beer4agoodtime', 'male'), ('BasedMelGibson', 'male'), ('BigMoneyRalph','male'),
# ('askanyone','female'),('jackiejcollins','female'),('JoyceCarolOates','female'),
# ('rihanna','female'),('JLo','female'),('Beyonce','female'),('Sia','female'),('MileyCyrus','female'),('CharlizeAfrica','female'),('theashleygraham','female'),
# ('missjerrikak', 'female'), ('iyana93', 'female'), ('ihateMORGZ', 'female'), ('nose_inthee_air','female'),('70sluvchild', 'female'),('lovepreciousway','female'),
# ('McDonalds','brand'),('Nike','brand'),('UNDEFEATEDinc','brand'),('CocaCola','brand'),('CanadaGooseInc','brand'),
# ('Hersheys','brand'),('Nestle','brand'),('Oracle','brand'),('LouisVuitton','brand'),('omegawatches','brand'),
# ('DeloitteUS', 'brand'), ('Danone', 'brand'),('easyJet','brand'), ('Toyota', 'brand'), ('CNN','brand'), ('Infosys', 'brand')]
# def createDataFrame(id_list):
# df=pd.DataFrame(columns=['user_name', 'description', 'text', 'profile_image_url', 'profile_background_color', 'profile_sidebar_border_color','gender','pic_text'])
# for user,gender in id_list:
# tweet=api.get_user(user)
# serie = dict(user_name=user, description=tweet.description, profile_image_url=tweet.profile_image_url,gender=gender)
# serie.update(dict(profile_background_color=tweet.profile_background_color,profile_sidebar_border_color=tweet.profile_sidebar_border_color))
# tw=''
# for tweet in tweepy.Cursor(api.user_timeline, screen_name=user).items(n):
# tw=tw + ' ' + tweet.text
# serie.update(dict(text=tw))
# df = df.append(serie, ignore_index=True)
# return df
# results = createDataFrame(id_list)
# #run the function and get our results
# import requests
# from clarifai.rest import ClarifaiApp
# app = ClarifaiApp()
# for i in range(0,len(id_list)):
# url=results['profile_image_url'][i]
# index=url.rfind('.')
# url_new=url[:index-7]+url[index:]
# if '.gif' not in url_new:
# if 'pb.com'not in url_new:
# response = requests.get(url_new)
# if(response.status_code != 404):
# result=app.tag_urls([url_new])
# for k in range(0,4):
# if(k==0):
# results['pic_text'][i]=result['outputs'][0]['data']['concepts'][k]['name']
# else:
# results['pic_text'][i]=results['pic_text'][i] + ' ' + result['outputs'][0]['data']['concepts'][k]['name']
# results
# results.to_csv('test_data_utf.csv',encoding='utf-8')
#save as csv
We apply our combining features function on the new created dataSet, using the logistic regression model.
In [14]:
dataFrametest4 = pd.read_csv('test_data_utf.csv', encoding='utf-8')
dataFrametest4['link_color'] = dataFrametest4['profile_background_color']
dataFrametest4['all_text'] = [text_normalizer(s) for s in dataFrametest4['description'].str.cat(dataFrametest4['text'])]
dataFrametestfem = dataFrametest4[dataFrametest4['gender']=='female']
overallList = combine_features(modelListText[2], modelListPic[2], modelListColor[2], dataFrametest4, voc_text, voc_pic, voc_color, np.exp(10*acc_text[2]), np.exp(10*acc_pic[2]), np.exp(10*acc_color[2]))
In [15]:
resultList = overallList
display_resultList(resultList)
We observe that the brands are well classified and predicted, and there is more error with the female prediction. The dataset is relatively small, but globally the model works well.
In [16]:
dataFrameTextWithPics = dataFrameText
dataFrameTextWithPics['pic_text'] = new_data['pic_text']
dataFrameTextWithPics = dataFrameTextWithPics[dataFrameTextWithPics['pic_text'] != ' ']
In [17]:
overallList = combine_features(modelListText[2], modelListPic[2], modelListColor[2], dataFrameTextWithPics, voc_text, voc_pic, voc_color, np.exp(10*acc_text[2]), np.exp(10*acc_pic[2]), np.exp(10*acc_color[2]),display = False)
display_resultList(overallList)
Combining the features, we manage to increase the prediction accuracy on the original dataset, to 89.3%.
At the end of this project, we can see it is possible to predict the gender of a twitter user based on the profile picture, the text of the tweets and the theme color, with a good accuracy. However as our dataset had more brands and male than females we observe that the accuracy is better for these two. The goal would be to have more data on each gender.
The profile picture often contains the most valuable information, as it is usually very representative of its user, whereas the theme color which the users seldom personalize is more difficult to exploit.
The model that globally gave the best results was the logistic regression model.
We here trained simple models, and if we were to combine features more efficiently or use more complex models it might be possible to get better results.
In [ ]: