Twitter users gender classification

Ramet Gaétan, Schloesing Benjamin, Yao Yuan

Introduction

As a social media, Twitter made the choice not to display the gender of its users. However as human, we can often easily guess the gender of most users based on the features displayed on their profile. In order to do statistics it can be useful for someone to automatically get the gender of the users. The objective of this project is to find features which can help to determine a Twitter user's gender using machine learning algorithms.

This project is based on the Kaggle Dataset. We will first explore the data, then train some models and test their efficiency at predicting the gender. Finally we will collect another dataset and try to apply our models to it.

Step 1 : Import data

The dataset we will use is the Twitter User Gender Classification dataset made available by Crowdflower. This datasets contains 20000 entries, each of them being a tweet from different users, with many other associated features which are listed here:

  • _unit_id : a unique id for each user
  • _golden : a boolean which states whether the user is included in the golden standard for the model
  • _unit_state : the state of the obervation, eiter golden for gold standards or finalized for contributor-judged
  • _trusted_judgments : the number of judgment on a user's gender. 3 for non-golden, or a unique id for golden
  • _last_judgment_at : date and time of the last judgment, blank for golden observations
  • gender : either male, female or brand for non-human profiles
  • gender:confidence : a float representing the confidence of the gender judgment
  • profile_yn : either yes or no, no meaning that the user's profile was not available when contributors went to judge it
  • profile_yn:confidence : confidence in the existence/non-existence of the profile
  • created : date and time of when the profile was created
  • description : the user's Tweeter profile description
  • fav_number : the amount of favorited tweets by the user
  • gender_gold : the gender if the profile is golden
  • link_color : the link color of the profile as a hex value
  • name : the Tweeter user's name
  • profile_yn_gold : yes or no whether the profile y/n value is golden
  • profileimage : a link to the profile image
  • retweet_count : the number of times the user has retweeted something
  • sidebar_color : color of the profile sidebar as a hex value
  • text : text of a random tweet from the user
  • tweet_coord : if the location was available at the time of the tweet, the coordinates as a string ith the format[latitude, longitude]
  • tweet_count : number of tweet of the users
  • tweet_created : the time of the random tweet in text
  • tweet_id : the tweet id of the random tweet
  • tweet_location : the location of the tweet, based on the coordinates
  • user_timezone : the timezone of the user

Most of these features are not relevant for our analysis, we will only focus on a few of them, i.e. the colors of the sidebars and links, the texts in the description and in the tweets and finally, the content of the profile picture


In [1]:
import pandas as pd
import numpy as np
from IPython.display import display
import re

#graph
from bokeh.plotting import output_notebook, figure, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource

%matplotlib inline 
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})
from mpl_toolkits.axes_grid1 import make_axes_locatable
from scipy import ndimage

from matplotlib import pyplot as plt
# 3D visualization
import pylab
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import pyplot

from collections import Counter


from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from IPython.display import display
from sklearn import linear_model, metrics
from sklearn import naive_bayes
from sklearn import neural_network

from twtgender import *


# we need latin-1 encoding because there are some special characters (é,...) that do not fit in default UTF-8
dataFrame = pd.read_csv('gender-classifier-DFE-791531.csv', encoding='latin-1')

#Show a sample of the dataset
dataFrame.head()


Out[1]:
_unit_id _golden _unit_state _trusted_judgments _last_judgment_at gender gender:confidence profile_yn profile_yn:confidence created ... profileimage retweet_count sidebar_color text tweet_coord tweet_count tweet_created tweet_id tweet_location user_timezone
0 815719226 False finalized 3 10/26/15 23:24 male 1.0000 yes 1.0 12/5/13 1:48 ... https://pbs.twimg.com/profile_images/414342229... 0 FFFFFF Robbie E Responds To Critics After Win Against... NaN 110964 10/26/15 12:40 6.587300e+17 main; @Kan1shk3 Chennai
1 815719227 False finalized 3 10/26/15 23:30 male 1.0000 yes 1.0 10/1/12 13:51 ... https://pbs.twimg.com/profile_images/539604221... 0 C0DEED ‰ÛÏIt felt like they were my friends and I was... NaN 7471 10/26/15 12:40 6.587300e+17 NaN Eastern Time (US & Canada)
2 815719228 False finalized 3 10/26/15 23:33 male 0.6625 yes 1.0 11/28/14 11:30 ... https://pbs.twimg.com/profile_images/657330418... 1 C0DEED i absolutely adore when louis starts the songs... NaN 5617 10/26/15 12:40 6.587300e+17 clcncl Belgrade
3 815719229 False finalized 3 10/26/15 23:10 male 1.0000 yes 1.0 6/11/09 22:39 ... https://pbs.twimg.com/profile_images/259703936... 0 C0DEED Hi @JordanSpieth - Looking at the url - do you... NaN 1693 10/26/15 12:40 6.587300e+17 Palo Alto, CA Pacific Time (US & Canada)
4 815719230 False finalized 3 10/27/15 1:15 female 1.0000 yes 1.0 4/16/14 13:23 ... https://pbs.twimg.com/profile_images/564094871... 0 0 Watching Neighbours on Sky+ catching up with t... NaN 31462 10/26/15 12:40 6.587300e+17 NaN NaN

5 rows × 26 columns

Step 2: Data exploration

Color features exploration

The first feature we are going to use for our analysis are the link_color and sidebar_color. On Twitter, it is possible to personalize your account by changing the colors of the links or the sidebars, and we expect people from different gender to have different behaviors in how they personalize their page. For example, we can expect females to use more "girly" colors such as pink or purple, while men would keep it more "manly" with some blue maybe.

We wrote the colorsGraphs function to extract and plot the most used colors for sidebars and for links by each gender. As the color is not especially easy to deduce from its HEX code, we found it easier to read to plot each bar in its associated color. The first thing we can notice is that most users do not personalize their page much and keep one of the standard Twitter themes, regardless of their gender. In order to better visualize how the personalization differs, we removed these most used themes from the bar graphs.


In [2]:
#Data Exploration Colors
colorsGraphs(dataFrame, 'sidebar_color', 1, 4)
colorsGraphs(dataFrame, 'link_color', 1, 1)


From the graphs, we can take the following conclusions:

  • First it seems like users tend to change their link color more than their sidebar color.
  • Female users have indeed a preference for purple, pink and red colors, while male users tends to use more green and blue. Brands usually have their pages in blue or green as well

These are only intuitions confirmed by the data, but if we want to predict the gender using the colors, we need a prediction model.

Text features exploration

Now that we have seen how users personalize the color of their pages, let's have a deeper look at what they actually write on Twitter. Here, we will explore both the text from the users descriptions but also the text from the tweets themselves. As these two texts lie on different cells in the dataframe, we will first need to process it a bit. The first thing we wanted to do was to normalize the text by removing separators suchs as commas, and also normalize the text itself to have only lowercase letters. To do so, we wrote the text_normalizer function. We then grouped the description and tweet texts together. Finally, we wrote the compute_bag_of_words and print_most_frequent functions to visualize which words are most used by which genders.


In [3]:
#Data Exploration - Text
# Normalize text in the descriptions and tweet messages

# Adding dict to the dataframe containing normalized texts 
dataFrameText = dataFrame
dataFrameText['text_norm'] = [text_normalizer(s) for s in dataFrameText['text']]
dataFrameText['description_norm'] = [text_normalizer(s) for s in dataFrameText['description']]

# Now let's put all the interesting text, i.e. the description and the tweeet itself in one string for each tweet
dataFrameText['all_text'] =dataFrameText['text_norm'].str.cat(dataFrameText['description_norm'],sep=' ')
dataFrameText = dataFrameText[(dataFrameText['gender:confidence']==1)&(dataFrameText['gender']!='unknown')&(dataFrameText['link_color'].str.contains('E\+') != True)]

# Extract separate genders dataframes
male_data = dataFrameText[dataFrameText['gender']=='male']
female_data = dataFrameText[dataFrameText['gender']=='female']
brand_data = dataFrameText[dataFrameText['gender']=='brand']
male_data.head()
        
male_bow, male_voc = compute_bag_of_words(male_data['all_text'])
print_most_frequent(male_bow, male_voc, 'male',  feature = 'all_text')

female_bow, female_voc = compute_bag_of_words(female_data['all_text'])
print_most_frequent(female_bow, female_voc, 'female', feature = 'all_text')

brand_bow, brand_voc = compute_bag_of_words(brand_data['all_text'])
print_most_frequent(brand_bow, brand_voc, 'brand',  feature = 'all_text')
#nothing special about these words really


C:\Users\Gaet_2\Anaconda3\lib\site-packages\numpy\matrixlib\defmatrix.py:318: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  out = N.ndarray.__getitem__(self, index)

The results are not quite as conclusive as with the colors. In fact, the most used words, regardless of the gender, are very simple words such as "the", "and", "to" or "of", and this does not give us any information about the gender really. One interesting thing we noticed is that the brands tends to use the words "weather", "channel" and "news" more than regular male and female users. This means that we have probably many information or weather channels accounts in our database. Another interesting fact is on the usage of the word "https". It seems like brands tend to post more links than standard users.

Profile picture features exploration

To use the profile pictures information is a bit more difficult than using simple text or color codes. The first thing we need to do is to extract the picture content from the picture itself. To do so, we used Clarify API, however, as the process is very long to run on the whole dataFrame (approximately 12 hours), we do not recommend to run the code. Instead, we created a new dataFrame containing all the picture contents keyword, which we will use in further analysis.

Now that the we have the content of the profile pictures in text, we can run the same data exploration process than earlier, and see which contents is most used by which gender


In [4]:
#Data Exploration - Pictures

#import imghdr 
#import requests 
#import io
#import urllib.request as ur
#from clarifai.rest import ClarifaiApp
#app = ClarifaiApp()


## We used clariai api to do image extraction, which takes about 12 hours 
## . Thus, We set all the codes into comments avoiding running again.

#raw_data['pic_text']=' '
#for i in range(20048,20049): 
    
#    url=raw_data['profileimage'][i]
#    index=url.rfind('.')
#    url_new=url[:index-7]+url[index:]
#    if '.gif' not in url_new:
#        if 'pb.com'not in url_new:
#            response = requests.get(url_new)
#            if(response.status_code != 404):
#                result=app.tag_urls([url_new])
#                for k in range(0,4):
#                    if(k==0):
#                        raw_data['pic_text'][i]=result['outputs'][0]['data']['concepts'][k]['name']
#                    else:
#                        raw_data['pic_text'][i]=raw_data['pic_text'][i] + ' ' + result['outputs'][0]['data']['concepts'][k]['name']
##                print(i)
##            print(raw_data['pic_text'][i])    



#new_data=raw_data
#df = pd.DataFrame(new_data, columns = ['gender', 'gender:confidence', 'profileimage', 'pic_text'])
#df.to_csv('new_data.csv')

new_data=pd.read_csv('new_data.csv',encoding='latin-1')
#Show a sample of the dataset
new_data.head()


Out[4]:
Unnamed: 0 gender gender:confidence profileimage pic_text
0 0 male 1.0000 https://pbs.twimg.com/profile_images/414342229... music people fun fashion
1 1 male 1.0000 https://pbs.twimg.com/profile_images/539604221... man people portrait one
2 2 male 0.6625 https://pbs.twimg.com/profile_images/657330418...
3 3 male 1.0000 https://pbs.twimg.com/profile_images/259703936...
4 4 female 1.0000 https://pbs.twimg.com/profile_images/564094871... man people portrait two

In [5]:
#create separate gender dataFrames
male_data = new_data[(new_data['gender']=='male')&(new_data['gender:confidence']==1)&(new_data['pic_text']!=' ')]
female_data = new_data[(new_data['gender']=='female')&(new_data['gender:confidence']==1)&(new_data['pic_text']!=' ')]
brand_data = new_data[(new_data['gender']=='brand')&(new_data['gender:confidence']==1)&(new_data['pic_text']!=' ')]
genderConf = new_data[(new_data['gender:confidence']==1)&(new_data['gender']!='unknown')&(new_data['pic_text']!=' ')]

male_bow, male_voc = compute_bag_of_words(male_data['pic_text'])
print_most_frequent(male_bow, male_voc,'male', feature = 'pic_text')
#print('----------------------------')

female_bow, female_voc = compute_bag_of_words(female_data['pic_text'])
print_most_frequent(female_bow, female_voc,'female', feature = 'pic_text')
#print('----------------------------')

brand_bow, brand_voc = compute_bag_of_words(brand_data['pic_text'])
print_most_frequent(brand_bow, brand_voc,'brand', feature = 'pic_text')


C:\Users\Gaet_2\Anaconda3\lib\site-packages\numpy\matrixlib\defmatrix.py:318: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  out = N.ndarray.__getitem__(self, index)

The results here are very interesting. First we can notice that as it is supposed to be a "profile picture", most male and female users use a picture of themselves, which is why the words "adult", "people", "portrait" and respectively "man" and "woman" are the most recurrent. Regarding the brands side, mots of the pictures are "symbol", "illustration" or "design", which means they are probably logos of the brand.

Step 3: Prediction model

Gender Prediction based on color features

We wrote the predictors function to extract the best predictors and anti-predictors of one specific feature for gender prediction. Here, we applied it to the link_color, using different linear models for the prediction. We chose to use first linear models because they are simple, but still good enough to be efficient, and have a nice implementaion in the sklearn library.

More specifically, some of these models have an attribute called coef_ which gives the weight of each word (here, the color HEX codes) of the model. A word that has a high weight for a given gender means that, if a user make use of it, it has a strong probability of being of this specific gender.

First, we performed the clasification work using the color features:


In [6]:
# Classifier colors

dataFrameColor = dataFrame.loc[:,['gender:confidence', 'gender', 'link_color']]
dataFrameColorFiltered = dataFrameColor[(dataFrameColor['gender:confidence'] == 1)&(dataFrameColor['link_color'].str.contains('E\+') != True)&(dataFrameColor['gender']!='unknown')]

feature = 'link_color'
df = dataFrameColorFiltered

# List of the classifiers we tested
modelListColor = [linear_model.RidgeClassifier(), 
             linear_model.SGDClassifier(),
             linear_model.LogisticRegression(),
             linear_model.PassiveAggressiveClassifier(),
             naive_bayes.MultinomialNB(),
             neural_network.MLPClassifier()]
modelNamesList = ['Ridge Classifier', 
                  'SGD Classifier',
                  'Logistic regression',
                  'Passive Aggressive Classifier',
                  'Multinomial NB',
                  'Multi-layer Perceptron',
                  ]
acc_color = np.zeros(len(modelListColor))
for i in range(0, len(modelListColor)):

#for i in range(2,3):
#     model_color = modelListColor[i]
    modelName = modelNamesList[i]
    modelListColor[i], voc_color, acc_color[i] = predictors(df, feature, modelListColor[i], modelName, displayResults = False, displayColors=True)


fig, ax1 = plt.subplots()
ax1.set_xlim([0, 1])
bar_width = 0.5
model_number = np.arange(len(modelListColor))+1
rects1 = plt.barh(model_number,acc_color, bar_width, label = 'Brand Predictors', color = '#f5abb5')
plt.yticks(model_number,modelNamesList)
plt.xlabel('Accuracy')
plt.ylabel('Model')
plt.title('Accuracy of the different Classifiers')
plt.tight_layout()
plt.show()


Testing Ridge Classifier model for gender prediction using link_color
Split: 2755 testing and 11023 training samples
model:  Ridge Classifier
mse: 1.1318
score:  0.446460980036
Testing SGD Classifier model for gender prediction using link_color
Split: 2755 testing and 11023 training samples
model:  SGD Classifier
mse: 1.2276
score:  0.424682395644
Testing Logistic regression model for gender prediction using link_color
Split: 2755 testing and 11023 training samples
model:  Logistic regression
mse: 1.0740
score:  0.443194192377
Testing Passive Aggressive Classifier model for gender prediction using link_color
Split: 2755 testing and 11023 training samples
model:  Passive Aggressive Classifier
mse: 1.2131
score:  0.388021778584
Testing Multinomial NB model for gender prediction using link_color
Split: 2755 testing and 11023 training samples
model:  Multinomial NB
mse: 1.0915
score:  0.460617059891
Testing Multi-layer Perceptron model for gender prediction using link_color
Split: 2755 testing and 11023 training samples
model:  Multi-layer Perceptron
mse: 0.6860
score:  0.42613430127

We observe that all the models yield approximately the same accuracy around 40-45%

Display of the predictors of color features for the logistic regression model


In [7]:
modelName = modelNamesList[2]
modelListColor[2], voc_color, acc_color[2] = predictors(df, feature, modelListColor[2], modelName, displayResults = True, displayColors=True)


Testing Logistic regression model for gender prediction using link_color
Split: 2755 testing and 11023 training samples
model:  Logistic regression
mse: 1.0577
score:  0.471506352087
Best 20 male predictors:
Best 20 male anti-predictors  for theme color:
Best 20 female predictors  for theme color:
Best 20 Female anti-predictors for theme color:
Best 20 brand predictors for theme color:
Best 20 Brand anti-predictors for theme color:

From these bar graphs, we can definitely see that our intuitions are confirmed by the models. The strongest female color-predictors are almost all between pink, red and purple. Also, these colors quite strong anti-predictors for both males and brands. However, the linear models only achieve about 45% of accuracy in predicting the gender using only the colors. Once again, this is mostly because the vast majority of users do not change their sidebar link colors.

Gender prediction based on text features

Now, let's do the same and try to predict the users gender using text features:


In [8]:
# Classifier - Text
#Looking at the most used words per gender doesnt yield anything particular since we all use the same common words,
#so let's try to find predictors

feature = 'all_text'
df = dataFrameText[dataFrameText['_golden']==False]
modelListText = [linear_model.RidgeClassifier(), 
             linear_model.SGDClassifier(),
             linear_model.LogisticRegression(),
             linear_model.PassiveAggressiveClassifier(),
             naive_bayes.MultinomialNB(),
             neural_network.MLPClassifier()
             ]
acc_text = np.zeros(len(modelListText))
for i in range(0, len(modelListText)):
#for i in range(2,3):
#     model_text = modelListText[i]
    modelName = modelNamesList[i]
    modelListText[i], voc_text, acc_text[i] = predictors(df, feature, modelListText[i], modelName, displayResults = False)
    
fig, ax1 = plt.subplots()
ax1.set_xlim([0, 1])
bar_width = 0.5
model_number = np.arange(len(modelListText))+1
rects1 = plt.barh(model_number,acc_text, bar_width, label = 'Brand Predictors', color = '#f5abb5')
plt.yticks(model_number,modelNamesList)
plt.xlabel('Accuracy')
plt.ylabel('Model')
plt.title('Accuracy of the different Classifiers')
plt.tight_layout()

plt.show()


Testing Ridge Classifier model for gender prediction using all_text
Split: 2749 testing and 11000 training samples
model:  Ridge Classifier
mse: 0.4969
score:  0.678792288105
Testing SGD Classifier model for gender prediction using all_text
Split: 2749 testing and 11000 training samples
model:  SGD Classifier
mse: 0.6006
score:  0.642779192434
Testing Logistic regression model for gender prediction using all_text
Split: 2749 testing and 11000 training samples
model:  Logistic regression
mse: 0.5013
score:  0.679883594034
Testing Passive Aggressive Classifier model for gender prediction using all_text
Split: 2749 testing and 11000 training samples
model:  Passive Aggressive Classifier
mse: 0.5864
score:  0.648235722081
Testing Multinomial NB model for gender prediction using all_text
Split: 2749 testing and 11000 training samples
model:  Multinomial NB
mse: 0.4722
score:  0.688250272826
Testing Multi-layer Perceptron model for gender prediction using all_text
Split: 2749 testing and 11000 training samples
model:  Multi-layer Perceptron
mse: 0.5678
score:  0.65041833394

We observe that the models yield approximately the same accuracy around 60-65%.

Display of the predictors of text for the logistic regression model


In [9]:
modelName = modelNamesList[2]
modelListText[2], voc_text, acc_text[2] = predictors(df, feature, modelListText[2], modelName, displayResults = True)


Testing Logistic regression model for gender prediction using all_text
Split: 2749 testing and 11000 training samples
model:  Logistic regression
mse: 0.5060
score:  0.670789377956
Best 20 male predictors:
Best 20 male anti-predictors  for text:
Best 20 female predictors  for text:
Best 20 Female anti-predictors for text:
Best 20 brand predictors for text:
Best 20 Brand anti-predictors for text:

Here, as the data is much more varied and meaningful than simple color codes, we manage to obtain a prediction accuracy of about 65%. Strong predictors for male users are words such as "father", "boy", "man" or "niggas", while predictor for female users are "mom", "girl", "feminist" or "makeup", and of course, these words are anti-predictors of the opposite gender.

From the anti-predictors, it seems like female users do not tweet about sports ("player", "hit", "team", "season", "game") while male users are less susceptible to tweet about girls ("girl", "mother", "queen").

On the side of brands, we see that our intuitions are confirmed, as posting a link ("https") or tweeting about "news" and "weather" are typical of the brand "gender".

Finally, some predictors for the female gender might looks quite odd, "_ù" or "ï_" for example, but we think these are unicodes for emojis. However, we did not manage to find which ones due to encoding issues.

Gender prediction based on profile pictures features

Now, let's finally apply our classifiers using the profile picture contents:


In [10]:
# Classifier - Pictures
#Looking at the most used words per gender doesnt yield anything particular since we all use the same common words,
#so let's try to find predictors

feature = 'pic_text'
df = genderConf
modelListPic = [linear_model.RidgeClassifier(), 
             linear_model.SGDClassifier(),
             linear_model.LogisticRegression(),
             linear_model.PassiveAggressiveClassifier(),
             naive_bayes.MultinomialNB(),
             neural_network.MLPClassifier()
             ]
acc_pic = np.zeros(len(modelListPic))
for i in range(0, len(modelListPic)):
#for i in range(2,3):
#     model_pic = modelListPic[i]
    modelName = modelNamesList[i]
    modelListPic[i], voc_pic, acc_pic[i] = predictors(df, feature, modelListPic[i], modelName, displayResults = False)
    
fig, ax1 = plt.subplots()
ax1.set_xlim([0, 1])
bar_width = 0.5
model_number = np.arange(len(modelListPic))+1
rects1 = plt.barh(model_number,acc_pic, bar_width, label = 'Brand Predictors', color = '#f5abb5')
plt.yticks(model_number,modelNamesList)
plt.xlabel('Accuracy')
plt.ylabel('Model')
plt.title('Accuracy of the different Classifiers')
plt.tight_layout()
plt.show()


Testing Ridge Classifier model for gender prediction using pic_text
Split: 1055 testing and 4223 training samples
model:  Ridge Classifier
mse: 0.4000
score:  0.816113744076
Testing SGD Classifier model for gender prediction using pic_text
Split: 1055 testing and 4223 training samples
model:  SGD Classifier
mse: 0.3100
score:  0.854976303318
Testing Logistic regression model for gender prediction using pic_text
Split: 1055 testing and 4223 training samples
model:  Logistic regression
mse: 0.3630
score:  0.838862559242
Testing Passive Aggressive Classifier model for gender prediction using pic_text
Split: 1055 testing and 4223 training samples
model:  Passive Aggressive Classifier
mse: 0.3763
score:  0.825592417062
Testing Multinomial NB model for gender prediction using pic_text
Split: 1055 testing and 4223 training samples
model:  Multinomial NB
mse: 0.3583
score:  0.832227488152
Testing Multi-layer Perceptron model for gender prediction using pic_text
Split: 1055 testing and 4223 training samples
model:  Multi-layer Perceptron
mse: 0.3564
score:  0.81990521327

We observe that the models yield approximately the same accuracy around 80-85%.

Display of the predictors of profile picture features for the logistic regression model


In [11]:
modelName = modelNamesList[2]
modelListPic[2], voc_pic, acc_pic[2] = predictors(df, feature, modelListPic[2], modelName, displayResults = True)


Testing Logistic regression model for gender prediction using pic_text
Split: 1055 testing and 4223 training samples
model:  Logistic regression
mse: 0.3327
score:  0.835071090047
Best 20 male predictors:
Best 20 male anti-predictors  for profile picture features:
Best 20 female predictors  for profile picture features:
Best 20 Female anti-predictors for profile picture features:
Best 20 brand predictors for profile picture features:
Best 20 Brand anti-predictors for profile picture features:

As the content of the profile picture is very representative of the user, the classifier manage to get up to 85% of accurate predictions, which is quite impressive. However, the predictors are not exactly as we expected them to be. Although "man" and "woman" are among the best predictors for their respective gender, we expected them to be way more important to the prediction of the gender for most of the classifiers.

Also, it seems like we have many twitter account from "actresses", or maybe many female users use pictures of actresses as their profile pictures. Unsurprisigly, "bikini" is an anti-predictor for male users. What is more surprising though is that "leather" and "lips" are strong predictors for brands.

Testing the models with an external dataset

Combining the different features

As we did the prediction based on text, profile picture content and theme color, we want to combine the results for the three models based on their accuracy in order to increase the efficiency of the prediction. (For example the theme color and the profile picture may not be relevant in which case the text will be the main decision, and so on...) We obtain the new probability for each class as follow : $$p(X=k) = \frac{\sum_{i=1}^{3}{\alpha_{i}p_{i}(X=k)}}{\sum_{i=1}^{3}{\alpha_{i}}}$$ We took the $\alpha_{i} = e^{10\text{*acc}_{i}}$, with acc$_{i}$ being the current accuracy of the model for the feature $i$.

Test the model with an homemade example


In [12]:
text_test2 = str("bros in Paris, gospel with the squad and my father sport team")
pic_test2 = "man sunglass portrait"
color_test2 = "cc3300"
dataFrametest2 = pd.DataFrame.from_items([('all_text',[text_test2]), ('pic_text', [pic_test2]), ('link_color', [color_test2]), ('user_name','blabla'),('gender','male')])
dataFrametest2.head()
resultList = combine_features(modelListText[2], modelListPic[2], modelListColor[2], dataFrametest2, voc_text, voc_pic, voc_color, acc_text[1], acc_pic[1], acc_color[1])


The predicted gender by using the text is male with probability 0.8725878462
The predicted gender by using the profile picture is male with probability 0.867629648221
The predicted gender by using the link color is male with probability 0.508777292989
Overall, the predicted gender of user blabla is male with a confidence of 0.790014006084
The average success rate for this test data is 1.0

We see that providing the model with a fake "cliché" example of a tweet, profile picture features and link_color gives us the expected result.

Getting a new dataset

We then wanted to test our model on an external dataset, which we created based on the twitter API, and again the features detection API: Clarify. Selecting 48 twitter profiles, 16 of each class (male, female, brand), with writers, humorists, official people, from various horizons, we extracted the profile picture features, the theme color and the text of 100 tweets. We then applied the model combining the features as explained above and got the following results.


In [13]:
# import configparser
# import pandas as pd
# import tweepy  # pip install tweepy

# consumer_key='m5tp5xKs4Nluq9nTKbnLu9PON'
# consumer_secret='CVaZAFAyRS21Zy9o5E8mP8NKO29EhP07KdyWj5FUPHlcQKjhns'
# access_token='786176122836824064-qNfyQyZWBShZmc8XZRsBKtQa8VZpz6T'
# access_token_secret='kYiY9X84hBlh8drYo5KiYUuXwFodsrqBYnPETEpmLnaE4'

# #we need to update our key and secret every time we predict
# auth = tweepy.OAuthHandler(consumer_key,consumer_secret)
# auth.set_access_token(access_token,access_token_secret)
# api = tweepy.API(auth)


# n=100 #number of tweets we want to get
# id_list=[('realDonaldTrump', 'male'),('augusten', 'male'),('Eminem', 'male'),('IrvineWelsh','male'),('alaindebotton','male'),
#         ('joedunthorne','male'),('paulocoelho','male'), ('tejucole','male'),('Shteyngart','male'),('KingJames','male'),
#          ('stephenathome','male'), ('jimmyfallon', 'male'), ('normaltweetguy', 'male'), ('beer4agoodtime', 'male'), ('BasedMelGibson', 'male'), ('BigMoneyRalph','male'), 
         
#         ('askanyone','female'),('jackiejcollins','female'),('JoyceCarolOates','female'),
#          ('rihanna','female'),('JLo','female'),('Beyonce','female'),('Sia','female'),('MileyCyrus','female'),('CharlizeAfrica','female'),('theashleygraham','female'),
#          ('missjerrikak', 'female'), ('iyana93', 'female'), ('ihateMORGZ', 'female'), ('nose_inthee_air','female'),('70sluvchild', 'female'),('lovepreciousway','female'),
         
#         ('McDonalds','brand'),('Nike','brand'),('UNDEFEATEDinc','brand'),('CocaCola','brand'),('CanadaGooseInc','brand'),
#          ('Hersheys','brand'),('Nestle','brand'),('Oracle','brand'),('LouisVuitton','brand'),('omegawatches','brand'),
#         ('DeloitteUS', 'brand'), ('Danone', 'brand'),('easyJet','brand'), ('Toyota', 'brand'), ('CNN','brand'), ('Infosys', 'brand')]

# def createDataFrame(id_list):
    
#     df=pd.DataFrame(columns=['user_name', 'description', 'text', 'profile_image_url', 'profile_background_color', 'profile_sidebar_border_color','gender','pic_text'])
#     for user,gender in id_list:
#         tweet=api.get_user(user)   
        
#         serie = dict(user_name=user, description=tweet.description, profile_image_url=tweet.profile_image_url,gender=gender)
#         serie.update(dict(profile_background_color=tweet.profile_background_color,profile_sidebar_border_color=tweet.profile_sidebar_border_color))
        
#         tw=''
#         for tweet in tweepy.Cursor(api.user_timeline, screen_name=user).items(n):
#             tw=tw + ' ' + tweet.text
#         serie.update(dict(text=tw))
#         df = df.append(serie, ignore_index=True)
#     return df


# results = createDataFrame(id_list)
# #run the function and get our results



# import requests
# from clarifai.rest import ClarifaiApp
# app = ClarifaiApp()


# for i in range(0,len(id_list)): 
    
#     url=results['profile_image_url'][i]
#     index=url.rfind('.')
#     url_new=url[:index-7]+url[index:]
#     if '.gif' not in url_new:
#         if 'pb.com'not in url_new:
#             response = requests.get(url_new)
#             if(response.status_code != 404):
#                 result=app.tag_urls([url_new])
#                 for k in range(0,4):
#                     if(k==0):
#                         results['pic_text'][i]=result['outputs'][0]['data']['concepts'][k]['name']
#                     else:
#                         results['pic_text'][i]=results['pic_text'][i] + ' ' + result['outputs'][0]['data']['concepts'][k]['name']
 
# results

# results.to_csv('test_data_utf.csv',encoding='utf-8')
#save as csv

We apply our combining features function on the new created dataSet, using the logistic regression model.


In [14]:
dataFrametest4 = pd.read_csv('test_data_utf.csv', encoding='utf-8')

dataFrametest4['link_color'] = dataFrametest4['profile_background_color']
dataFrametest4['all_text'] = [text_normalizer(s) for s in dataFrametest4['description'].str.cat(dataFrametest4['text'])]

dataFrametestfem = dataFrametest4[dataFrametest4['gender']=='female']
overallList = combine_features(modelListText[2], modelListPic[2], modelListColor[2], dataFrametest4, voc_text, voc_pic, voc_color, np.exp(10*acc_text[2]), np.exp(10*acc_pic[2]), np.exp(10*acc_color[2]))


The predicted gender by using the text is brand with probability 0.5
The predicted gender by using the profile picture is male with probability 0.950238520208
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user realDonaldTrump is male with a confidence of 0.864522703113
The predicted gender by using the text is brand with probability 0.999999999824
The predicted gender by using the profile picture is male with probability 0.950238520208
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user augusten is male with a confidence of 0.78523341191
The predicted gender by using the text is female with probability 0.586627884588
The predicted gender by using the profile picture is male with probability 0.721702107696
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user Eminem is male with a confidence of 0.657257600969
The predicted gender by using the text is brand with probability 0.500000029189
The predicted gender by using the profile picture is male with probability 0.598564729455
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user IrvineWelsh is male with a confidence of 0.57621851486
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is male with probability 0.950238520208
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user alaindebotton is male with a confidence of 0.785233411882
The predicted gender by using the text is brand with probability 0.999998285018
The predicted gender by using the profile picture is male with probability 0.509611959096
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user joedunthorne is male with a confidence of 0.424005510674
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is male with probability 0.926198680414
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user paulocoelho is male with a confidence of 0.765525413821
The predicted gender by using the text is brand with probability 0.500011167478
The predicted gender by using the profile picture is male with probability 0.950238520208
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user tejucole is male with a confidence of 0.864520932191
The predicted gender by using the text is brand with probability 0.99999966002
The predicted gender by using the profile picture is male with probability 0.907774591194
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user Shteyngart is male with a confidence of 0.750421294091
The predicted gender by using the text is brand with probability 0.507108574246
The predicted gender by using the profile picture is brand with probability 0.580903749204
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user KingJames is brand with a confidence of 0.562478290994
The predicted gender by using the text is brand with probability 0.900876553183
The predicted gender by using the profile picture is male with probability 0.836735282981
The predicted gender by using the link color is male with probability 0.616694654212
Overall, the predicted gender of user stephenathome is male with a confidence of 0.715009510911
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is male with probability 0.933073450208
The predicted gender by using the link color is male with probability 0.460591622076
Overall, the predicted gender of user jimmyfallon is male with a confidence of 0.774894958691
The predicted gender by using the text is male with probability 0.988840231302
The predicted gender by using the profile picture is male with probability 0.933276938178
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user normaltweetguy is male with a confidence of 0.928137092153
The predicted gender by using the text is male with probability 1.0
The predicted gender by using the profile picture is male with probability 0.938146447602
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user beer4agoodtime is male with a confidence of 0.93389884411
The predicted gender by using the text is brand with probability 0.508649866632
The predicted gender by using the profile picture is male with probability 0.798126653991
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user BasedMelGibson is male with a confidence of 0.738448842215
The predicted gender by using the text is brand with probability 0.500044557533
The predicted gender by using the profile picture is male with probability 0.948095842646
The predicted gender by using the link color is male with probability 0.460591622076
Overall, the predicted gender of user BigMoneyRalph is male with a confidence of 0.866492627232
The predicted gender by using the text is male with probability 0.999952720459
The predicted gender by using the profile picture is female with probability 0.842074990755
The predicted gender by using the link color is brand with probability 0.444372336803
Overall, the predicted gender of user askanyone is female with a confidence of 0.695130999252
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is female with probability 0.912666987969
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user jackiejcollins is female with a confidence of 0.757769354707
The predicted gender by using the text is brand with probability 0.500000000092
The predicted gender by using the profile picture is male with probability 0.721702107696
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user JoyceCarolOates is male with a confidence of 0.677167245939
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is male with probability 0.721702107696
The predicted gender by using the link color is male with probability 0.460591622076
Overall, the predicted gender of user rihanna is male with a confidence of 0.601611524022
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is female with probability 0.485846931458
The predicted gender by using the link color is female with probability 0.606452787873
Overall, the predicted gender of user JLo is female with a confidence of 0.411408994039
The predicted gender by using the text is female with probability 0.586627884588
The predicted gender by using the profile picture is male with probability 0.567332697814
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user Beyonce is male with a confidence of 0.53070467646
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is female with probability 0.68257469244
The predicted gender by using the link color is male with probability 0.418859250534
Overall, the predicted gender of user Sia is female with a confidence of 0.567248605607
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is female with probability 0.842074990755
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user MileyCyrus is female with a confidence of 0.699897631927
The predicted gender by using the text is female with probability 0.586627884588
The predicted gender by using the profile picture is female with probability 0.858185332364
The predicted gender by using the link color is female with probability 0.606452787873
Overall, the predicted gender of user CharlizeAfrica is female with a confidence of 0.809680761545
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is female with probability 0.689094813993
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user theashleygraham is female with a confidence of 0.574483608671
The predicted gender by using the text is brand with probability 0.999999960515
The predicted gender by using the profile picture is male with probability 0.657487267471
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user missjerrikak is male with a confidence of 0.545234264153
The predicted gender by using the text is brand with probability 0.999999999992
The predicted gender by using the profile picture is female with probability 0.541475241977
The predicted gender by using the link color is female with probability 0.851239060424
Overall, the predicted gender of user iyana93 is female with a confidence of 0.46230462809
The predicted gender by using the text is male with probability 0.999999998982
The predicted gender by using the profile picture is female with probability 0.96588361598
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user ihateMORGZ is female with a confidence of 0.801396650667
The predicted gender by using the text is brand with probability 0.999999727317
The predicted gender by using the profile picture is female with probability 0.966924625729
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user nose_inthee_air is female with a confidence of 0.802250076248
The predicted gender by using the text is brand with probability 0.999999998529
The predicted gender by using the profile picture is female with probability 0.96588361598
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user 70sluvchild is female with a confidence of 0.801396650505
The predicted gender by using the text is brand with probability 0.999999999145
The predicted gender by using the profile picture is female with probability 0.893567796271
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user lovepreciousway is female with a confidence of 0.742111728192
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is brand with probability 0.980686223329
The predicted gender by using the link color is male with probability 0.460591622076
Overall, the predicted gender of user McDonalds is brand with a confidence of 0.966824780322
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is brand with probability 0.941067594766
The predicted gender by using the link color is female with probability 0.606452787873
Overall, the predicted gender of user Nike is brand with a confidence of 0.934209704221
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is brand with probability 0.956882030793
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user UNDEFEATEDinc is brand with a confidence of 0.948869458721
The predicted gender by using the text is female with probability 0.586627884588
The predicted gender by using the profile picture is brand with probability 0.898455237153
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user CocaCola is brand with a confidence of 0.748564574743
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is brand with probability 0.950450953634
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user CanadaGooseInc is brand with a confidence of 0.94359722491
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is brand with probability 0.975297482134
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user Hersheys is brand with a confidence of 0.963966550982
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is brand with probability 0.982538866232
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user Nestle is brand with a confidence of 0.969903079053
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is brand with probability 0.972632323102
The predicted gender by using the link color is female with probability 0.606452787873
Overall, the predicted gender of user Oracle is brand with a confidence of 0.960086648917
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is brand with probability 0.870870509691
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user LouisVuitton is brand with a confidence of 0.878356722129
The predicted gender by using the text is female with probability 0.586627884588
The predicted gender by using the profile picture is brand with probability 0.973789714994
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user omegawatches is brand with a confidence of 0.810324210246
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is brand with probability 0.990834445326
The predicted gender by using the link color is female with probability 0.606452787873
Overall, the predicted gender of user DeloitteUS is brand with a confidence of 0.975008852749
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is brand with probability 0.972632323102
The predicted gender by using the link color is female with probability 0.442259133632
Overall, the predicted gender of user Danone is brand with a confidence of 0.961781638376
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is brand with probability 0.960624926343
The predicted gender by using the link color is male with probability 0.368163549178
Overall, the predicted gender of user easyJet is brand with a confidence of 0.952028852153
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is brand with probability 0.987117208595
The predicted gender by using the link color is female with probability 0.529627818254
Overall, the predicted gender of user Toyota is brand with a confidence of 0.971309182457
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is brand with probability 0.980686223329
The predicted gender by using the link color is brand with probability 0.402920059431
Overall, the predicted gender of user CNN is brand with a confidence of 0.971260153367
The predicted gender by using the text is brand with probability 1.0
The predicted gender by using the profile picture is brand with probability 0.897541097395
The predicted gender by using the link color is female with probability 0.542540086188
Overall, the predicted gender of user Infosys is brand with a confidence of 0.899188768802
The average success rate for this test data is 0.8958333333333334

In [15]:
resultList = overallList
display_resultList(resultList)


We observe that the brands are well classified and predicted, and there is more error with the female prediction. The dataset is relatively small, but globally the model works well.

We apply it also on our cleaned original dataset.


In [16]:
dataFrameTextWithPics = dataFrameText
dataFrameTextWithPics['pic_text'] = new_data['pic_text']
dataFrameTextWithPics = dataFrameTextWithPics[dataFrameTextWithPics['pic_text'] != ' ']


C:\Users\Gaet_2\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

In [17]:
overallList = combine_features(modelListText[2], modelListPic[2], modelListColor[2], dataFrameTextWithPics, voc_text, voc_pic, voc_color, np.exp(10*acc_text[2]), np.exp(10*acc_pic[2]), np.exp(10*acc_color[2]),display = False)
display_resultList(overallList)


The average success rate for this test data is 0.8935685828116107

Combining the features, we manage to increase the prediction accuracy on the original dataset, to 89.3%.

Conclusion

At the end of this project, we can see it is possible to predict the gender of a twitter user based on the profile picture, the text of the tweets and the theme color, with a good accuracy. However as our dataset had more brands and male than females we observe that the accuracy is better for these two. The goal would be to have more data on each gender.

The profile picture often contains the most valuable information, as it is usually very representative of its user, whereas the theme color which the users seldom personalize is more difficult to exploit.

The model that globally gave the best results was the logistic regression model.

We here trained simple models, and if we were to combine features more efficiently or use more complex models it might be possible to get better results.


In [ ]: