Analysis and visualization of a public OKCupid profile dataset using python and pandas

Author: Alessandro Giusti (web, email), Dalle Molle Institute for Artificial Intelligence (IDSIA), USI-SUPSI.

Discussion

After publication, this notebook received quite a lot of insightful comments on /r/python (link to post) and /r/okcupid (link to post).

Introduction

This document is an analysis of a public dataset of almost 60000 online dating profiles. The dataset has been published in the Journal of Statistics Education, Volume 23, Number 2 (2015) by Albert Y. Kim et al., and its collection and distribution was explicitly allowed by OkCupid president and co-founder Christian Rudder. Using these data is therefore ethically and legally acceptable; this is in contrast to another recent release of a different OkCupid profile dataset, which was collected without permission and without anonymizing the data (more on the ethical issues in this Wired article).

Notebook setup



In [1]:

    
%matplotlib inline
%config InlineBackend.figure_format='svg'
from IPython.display import display,HTML
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from prettypandas import PrettyPandas
sns.set_style("ticks")
sns.set_context(context="notebook",font_scale=1)

Dataset details

The data is available at this link. The codebook includes many details about the available fields. The dataset was collected by web scraping the OKCupid.com website on 2012/06/30, and includes almost 60k profiles of people within a 25 mile radius of San Francisco, who were online in the previous year (after 06/30/2011), with at least one profile picture.

The CSV contains a row (observation) for each profile. Let's have a look at the first 10 profiles, excluding the columns whose name contains the string "essay", which contain a lot of text and are not practical at the moment.



In [2]:

    
d=pd.read_csv("profiles.csv/profiles.csv")
print("The dataset contains {} records".format(len(d)))

m=d[d["sex"]=="m"] # male users
f=d[d["sex"]=="f"] # female users
print("{} males ({:.1%}), {} females ({:.1%})".format(
    len(m),len(m)/len(d),
    len(f),len(f)/len(d)))

PrettyPandas(d                                   # Prettyprints pandas dataframes
    .head(10)                                    # Sample the first 10 rows
    [[c for c in d.columns if "essay" not in c]] # Ignore columns with "essay" in the name (they are long)
)









    



The dataset contains 59946 records
35829 males (59.8%), 24117 females (40.2%)






    Out[2]:





        

        
        

        
            
            
                
                
                
                age
                
                body_type
                
                diet
                
                drinks
                
                drugs
                
                education
                
                ethnicity
                
                height
                
                income
                
                job
                
                last_online
                
                location
                
                offspring
                
                orientation
                
                pets
                
                religion
                
                sex
                
                sign
                
                smokes
                
                speaks
                
                status
                
            
            
        
        
            
            
                
                
                    0
                
                
                    22
                
                
                    a little extra
                
                
                    strictly anything
                
                
                    socially
                
                
                    never
                
                
                    working on college/university
                
                
                    asian, white
                
                
                    75
                
                
                    -1
                
                
                    transportation
                
                
                    2012-06-28-20-30
                
                
                    south san francisco, california
                
                
                    doesn’t have kids, but might want them
                
                
                    straight
                
                
                    likes dogs and likes cats
                
                
                    agnosticism and very serious about it
                
                
                    m
                
                
                    gemini
                
                
                    sometimes
                
                
                    english
                
                
                    single
                
            
            
            
                
                
                    1
                
                
                    35
                
                
                    average
                
                
                    mostly other
                
                
                    often
                
                
                    sometimes
                
                
                    working on space camp
                
                
                    white
                
                
                    70
                
                
                    80000
                
                
                    hospitality / travel
                
                
                    2012-06-29-21-41
                
                
                    oakland, california
                
                
                    doesn’t have kids, but might want them
                
                
                    straight
                
                
                    likes dogs and likes cats
                
                
                    agnosticism but not too serious about it
                
                
                    m
                
                
                    cancer
                
                
                    no
                
                
                    english (fluently), spanish (poorly), french (poorly)
                
                
                    single
                
            
            
            
                
                
                    2
                
                
                    38
                
                
                    thin
                
                
                    anything
                
                
                    socially
                
                
                    nan
                
                
                    graduated from masters program
                
                
                    nan
                
                
                    68
                
                
                    -1
                
                
                    nan
                
                
                    2012-06-27-09-10
                
                
                    san francisco, california
                
                
                    nan
                
                
                    straight
                
                
                    has cats
                
                
                    nan
                
                
                    m
                
                
                    pisces but it doesn’t matter
                
                
                    no
                
                
                    english, french, c++
                
                
                    available
                
            
            
            
                
                
                    3
                
                
                    23
                
                
                    thin
                
                
                    vegetarian
                
                
                    socially
                
                
                    nan
                
                
                    working on college/university
                
                
                    white
                
                
                    71
                
                
                    20000
                
                
                    student
                
                
                    2012-06-28-14-22
                
                
                    berkeley, california
                
                
                    doesn’t want kids
                
                
                    straight
                
                
                    likes cats
                
                
                    nan
                
                
                    m
                
                
                    pisces
                
                
                    no
                
                
                    english, german (poorly)
                
                
                    single
                
            
            
            
                
                
                    4
                
                
                    29
                
                
                    athletic
                
                
                    nan
                
                
                    socially
                
                
                    never
                
                
                    graduated from college/university
                
                
                    asian, black, other
                
                
                    66
                
                
                    -1
                
                
                    artistic / musical / writer
                
                
                    2012-06-27-21-26
                
                
                    san francisco, california
                
                
                    nan
                
                
                    straight
                
                
                    likes dogs and likes cats
                
                
                    nan
                
                
                    m
                
                
                    aquarius
                
                
                    no
                
                
                    english
                
                
                    single
                
            
            
            
                
                
                    5
                
                
                    29
                
                
                    average
                
                
                    mostly anything
                
                
                    socially
                
                
                    nan
                
                
                    graduated from college/university
                
                
                    white
                
                
                    67
                
                
                    -1
                
                
                    computer / hardware / software
                
                
                    2012-06-29-19-18
                
                
                    san francisco, california
                
                
                    doesn’t have kids, but might want them
                
                
                    straight
                
                
                    likes cats
                
                
                    atheism
                
                
                    m
                
                
                    taurus
                
                
                    no
                
                
                    english (fluently), chinese (okay)
                
                
                    single
                
            
            
            
                
                
                    6
                
                
                    32
                
                
                    fit
                
                
                    strictly anything
                
                
                    socially
                
                
                    never
                
                
                    graduated from college/university
                
                
                    white, other
                
                
                    65
                
                
                    -1
                
                
                    nan
                
                
                    2012-06-25-20-45
                
                
                    san francisco, california
                
                
                    nan
                
                
                    straight
                
                
                    likes dogs and likes cats
                
                
                    nan
                
                
                    f
                
                
                    virgo
                
                
                    nan
                
                
                    english
                
                
                    single
                
            
            
            
                
                
                    7
                
                
                    31
                
                
                    average
                
                
                    mostly anything
                
                
                    socially
                
                
                    never
                
                
                    graduated from college/university
                
                
                    white
                
                
                    65
                
                
                    -1
                
                
                    artistic / musical / writer
                
                
                    2012-06-29-12-30
                
                
                    san francisco, california
                
                
                    doesn’t have kids, but wants them
                
                
                    straight
                
                
                    likes dogs and likes cats
                
                
                    christianity
                
                
                    f
                
                
                    sagittarius
                
                
                    no
                
                
                    english, spanish (okay)
                
                
                    single
                
            
            
            
                
                
                    8
                
                
                    24
                
                
                    nan
                
                
                    strictly anything
                
                
                    socially
                
                
                    nan
                
                
                    graduated from college/university
                
                
                    white
                
                
                    67
                
                
                    -1
                
                
                    nan
                
                
                    2012-06-29-23-39
                
                
                    belvedere tiburon, california
                
                
                    doesn’t have kids
                
                
                    straight
                
                
                    likes dogs and likes cats
                
                
                    christianity but not too serious about it
                
                
                    f
                
                
                    gemini but it doesn’t matter
                
                
                    when drinking
                
                
                    english
                
                
                    single
                
            
            
            
                
                
                    9
                
                
                    37
                
                
                    athletic
                
                
                    mostly anything
                
                
                    not at all
                
                
                    never
                
                
                    working on two-year college
                
                
                    white
                
                
                    65
                
                
                    -1
                
                
                    student
                
                
                    2012-06-28-21-08
                
                
                    san mateo, california
                
                
                    nan
                
                
                    straight
                
                
                    likes dogs and likes cats
                
                
                    atheism and laughing about it
                
                
                    m
                
                
                    cancer but it doesn’t matter
                
                
                    no
                
                
                    english (fluently)
                
                
                    single

The dataset contains many interesting information about the individual users, but this is just a fraction of the data that OKCupid has collected. The OKCupid data team reports in the OKTrends blog insightful (and entertaining) studies which make use of additional information such as for example:

the users' interactions, i.e. who messaged/liked whom;
profile pictures and their metadata);
answers to the profiling questions OKCupid uses to compute the match ratings.

Age distribution



In [3]:

    
print("Age statistics:\n{}".format(d["age"].describe()))
print()
print("There are {} users older than 80".format((d["age"]>80).sum()))









    



Age statistics:
count    59946.000000
mean        32.340290
std          9.452779
min         18.000000
25%         26.000000
50%         30.000000
75%         37.000000
max        110.000000
Name: age, dtype: float64

There are 2 users older than 80

Find the age outliers

Apparently we have one 110-year-old user, and only another one over-80. They might be outliers, let's have a look at their data.



In [4]:

    
PrettyPandas(d[d["age"]>80])









    Out[4]:





        

        
        

        
            
            
                
                
                
                age
                
                body_type
                
                diet
                
                drinks
                
                drugs
                
                education
                
                essay0
                
                essay1
                
                essay2
                
                essay3
                
                essay4
                
                essay5
                
                essay6
                
                essay7
                
                essay8
                
                essay9
                
                ethnicity
                
                height
                
                income
                
                job
                
                last_online
                
                location
                
                offspring
                
                orientation
                
                pets
                
                religion
                
                sex
                
                sign
                
                smokes
                
                speaks
                
                status
                
            
            
        
        
            
            
                
                
                    2512
                
                
                    110
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    67
                
                
                    -1
                
                
                    nan
                
                
                    2012-06-27-22-16
                
                
                    daly city, california
                
                
                    nan
                
                
                    straight
                
                
                    nan
                
                
                    nan
                
                
                    f
                
                
                    nan
                
                
                    nan
                
                
                    english
                
                
                    single
                
            
            
            
                
                
                    25324
                
                
                    109
                
                
                    athletic
                
                
                    mostly other
                
                
                    nan
                
                
                    never
                
                
                    working on masters program
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    nothing
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    nan
                
                
                    95
                
                
                    -1
                
                
                    student
                
                
                    2012-06-30-18-18
                
                
                    san francisco, california
                
                
                    might want kids
                
                
                    straight
                
                
                    nan
                
                
                    other and somewhat serious about it
                
                
                    m
                
                
                    aquarius but it doesn’t matter
                
                
                    when drinking
                
                
                    english (okay)
                
                
                    available

Let's assume the 110-year-old lady and the athletic 109-year-old gentleman (who's working on a masters program) are outliers: we get rid of them so the following plots look better. They didn't say much else about themselves, anyway.



In [5]:

    
d=d[d["age"]<=80] # sorry guys

print("The dataset now contains {} records".format(len(d)))

m=d[d["sex"]=="m"] # male users
f=d[d["sex"]=="f"] # female users
print("{} males ({:.1%}), {} females ({:.1%})".format(
    len(m),len(m)/len(d),
    len(f),len(f)/len(d)))

print()
print("Age statistics:\n{}".format(d["age"].describe()))









    



The dataset now contains 59944 records
35828 males (59.8%), 24116 females (40.2%)

Age statistics:
count    59944.000000
mean        32.337715
std          9.442423
min         18.000000
25%         26.000000
50%         30.000000
75%         37.000000
max         69.000000
Name: age, dtype: float64

Draw age histograms for male and female users



In [6]:

    
fig,(ax1,ax2) = plt.subplots(ncols=2,figsize=(10,3),sharey=True,sharex=True)
sns.distplot(m["age"], ax=ax1,
             bins=range(d["age"].min(),d["age"].max()),
             kde=False,
             color="b")
ax1.set_title("Age distribution for males")
sns.distplot(f["age"], ax=ax2,
             bins=range(d["age"].min(),d["age"].max()),
             kde=False,
             color="r")
ax2.set_title("Age distribution for females")
ax1.set_ylabel("Number of users in age group")
for ax in (ax1,ax2):
    sns.despine(ax=ax)
fig.tight_layout()

Note that both distributions are right-skewed. Then, as is often (but not always!) the case, the mean is larger than the median.



In [7]:

    
print("Mean and median age for males:   {:.2f}, {:.2f}".format(m["age"].mean(),m["age"].median()))
print("Mean and median age for females: {:.2f}, {:.2f}".format(f["age"].mean(),f["age"].median()))









    



Mean and median age for males:   32.02, 30.00
Mean and median age for females: 32.82, 30.00

Females seem to be on average slightly older than males. Let's compare the age distributions in a single plot



In [8]:

    
fig,(ax1,ax2) = plt.subplots(nrows=2,figsize=(10,6),sharex=True)

# Plot the age distributions of males and females on the same axis
sns.distplot(m["age"], ax=ax1,
             bins=range(d["age"].min(),d["age"].max()),
             kde=False,
             color="b",
             label="males")
sns.distplot(f["age"], ax=ax1,
             bins=range(d["age"].min(),d["age"].max()),
             kde=False,
             color="r",
             label="females")
ax1.set_ylabel("Number of users in age group")
ax1.set_xlabel("")
ax1.legend()

# Compute the fraction of males for every age value
fraction_of_males=(m["age"].value_counts()/d["age"].value_counts())
# Ignore values computed from age groups in which we have less than 100 total users (else estimates are too unstable)
fraction_of_males[d["age"].value_counts()<100]=None
barlist=ax2.bar(left=fraction_of_males.index,
        height=fraction_of_males*100-50,
        bottom=50, width=1, color="gray")
for bar,frac in zip(barlist,fraction_of_males):
    bar.set_color("b" if frac>.5 else "r")
    bar.set_alpha(0.4)
ax2.set_xlim([18,70])
ax2.set_xlabel("age")
ax2.set_ylabel("percentage of males in age group")
ax2.axhline(y=50,color="k")

for ax in (ax1,ax2):
    sns.despine(ax=ax)
fig.tight_layout()

Over-60 users are not many, but in this group there are significantly more females than males. This may be explained by the fact that, in this age group, there are more females than males in the general population.

Picture By Delphi234 (Own work) [CC0], via Wikimedia Commons.

On this topic, also see page 5 of the 2010 US Census Brief on Age and Sex Composition. The ratio between males and females in a given population is called Sex Ratio.

Study height distribution and compare with official data from the US Centers of Disease Control and Prevention (CDC)

We first plot the height distribution for males and females in the whole dataset



In [9]:

    
fig,(ax,ax2) = plt.subplots(nrows=2,sharex=True,figsize=(6,6),gridspec_kw={'height_ratios':[2,1]})

# Plot histograms of height
bins=range(55,80)
sns.distplot(m["height"].dropna(), ax=ax,
             bins=bins,
             kde=False,
             color="b",
             label="males")
sns.distplot(f["height"].dropna(), ax=ax,
             bins=bins,
             kde=False,
             color="r",
             label="females")
ax.legend(loc="upper left")
ax.set_xlabel("")
ax.set_ylabel("Number of users with given height")
ax.set_title("height distribution of male and female users");

# Make aligned boxplots
sns.boxplot(data=d,y="sex",x="height",orient="h",ax=ax2,palette={"m":"b","f":"r"})
plt.setp(ax2.artists, alpha=.5)
ax2.set_xlim([min(bins),max(bins)])
ax2.set_xlabel("Self-reported height [inches]")

sns.despine(ax=ax)
fig.tight_layout()

Males are obviously taller than females, and the two distributions make sense.

How does this compare with general population data? Are OkCupid users maybe cheating and overreporting their height?

The CDC publishes growth charts, which contain height data for the general US population. The dataset reports statistics (3rd, 5th, 10th, 25th, 50th, 75th, 90th, 95th, 97th percentiles) for stature for different ages from 2 to 20 years. This (and more) data is plotted by the CDC in these beautiful charts.



In [10]:

    
# Import a CSV file for growth chart data
cdc=pd.read_csv("https://www.cdc.gov/growthcharts/data/zscore/statage.csv")

print("First 10 rows of raw cdc growth chart data")
display(PrettyPandas(cdc.head(10)))









    



First 10 rows of raw cdc growth chart data






    





        

        
        

        
            
            
                
                
                
                Sex
                
                Agemos
                
                L
                
                M
                
                S
                
                P3
                
                P5
                
                P10
                
                P25
                
                P50
                
                P75
                
                P90
                
                P95
                
                P97
                
            
            
        
        
            
            
                
                
                    0
                
                
                    1
                
                
                    24
                
                
                    0.941524
                
                
                    86.4522
                
                
                    0.0403215
                
                
                    79.9108
                
                
                    80.7298
                
                
                    81.9917
                
                
                    84.1029
                
                
                    86.4522
                
                
                    88.8052
                
                
                    90.9262
                
                
                    92.1969
                
                
                    93.0227
                
            
            
            
                
                
                    1
                
                
                    1
                
                
                    24.5
                
                
                    1.00721
                
                
                    86.8616
                
                
                    0.0403956
                
                
                    80.2604
                
                
                    81.0887
                
                
                    82.364
                
                
                    84.4947
                
                
                    86.8616
                
                
                    89.228
                
                
                    91.3575
                
                
                    92.6318
                
                
                    93.4592
                
            
            
            
                
                
                    2
                
                
                    1
                
                
                    25.5
                
                
                    0.837251
                
                
                    87.6525
                
                
                    0.0405775
                
                
                    81.0053
                
                
                    81.8345
                
                
                    83.1139
                
                
                    85.2589
                
                
                    87.6525
                
                
                    90.0568
                
                
                    92.2297
                
                
                    93.5341
                
                
                    94.3828
                
            
            
            
                
                
                    3
                
                
                    1
                
                
                    26.5
                
                
                    0.681493
                
                
                    88.4233
                
                
                    0.0407231
                
                
                    81.7342
                
                
                    82.5641
                
                
                    83.8472
                
                
                    86.0052
                
                
                    88.4233
                
                
                    90.8626
                
                
                    93.0761
                
                
                    94.4088
                
                
                    95.2776
                
            
            
            
                
                
                    4
                
                
                    1
                
                
                    27.5
                
                
                    0.53878
                
                
                    89.1755
                
                
                    0.0408332
                
                
                    82.4485
                
                
                    83.279
                
                
                    84.5653
                
                
                    86.7351
                
                
                    89.1755
                
                
                    91.6471
                
                
                    93.8983
                
                
                    95.2575
                
                
                    96.1451
                
            
            
            
                
                
                    5
                
                
                    1
                
                
                    28.5
                
                
                    0.407697
                
                
                    89.9104
                
                
                    0.0409091
                
                
                    83.1494
                
                
                    83.9805
                
                
                    85.2696
                
                
                    87.4498
                
                
                    89.9104
                
                
                    92.4116
                
                
                    94.6976
                
                
                    96.0815
                
                
                    96.9866
                
            
            
            
                
                
                    6
                
                
                    1
                
                
                    29.5
                
                
                    0.286762
                
                
                    90.6291
                
                
                    0.0409524
                
                
                    83.8382
                
                
                    84.6695
                
                
                    85.961
                
                
                    88.1503
                
                
                    90.6291
                
                
                    93.1572
                
                
                    95.4752
                
                
                    96.882
                
                
                    97.8035
                
            
            
            
                
                
                    7
                
                
                    1
                
                
                    30.5
                
                
                    0.174489
                
                
                    91.3324
                
                
                    0.0409653
                
                
                    84.5156
                
                
                    85.3469
                
                
                    86.6403
                
                
                    88.8375
                
                
                    91.3324
                
                
                    93.885
                
                
                    96.2324
                
                
                    97.6603
                
                
                    98.5969
                
            
            
            
                
                
                    8
                
                
                    1
                
                
                    31.5
                
                
                    0.0694445
                
                
                    92.0213
                
                
                    0.04095
                
                
                    85.1824
                
                
                    86.0136
                
                
                    87.3082
                
                
                    89.512
                
                
                    92.0213
                
                
                    94.5959
                
                
                    96.9702
                
                
                    98.4176
                
                
                    99.3683
                
            
            
            
                
                
                    9
                
                
                    1
                
                
                    32.5
                
                
                    -0.0297206
                
                
                    92.6964
                
                
                    0.0409087
                
                
                    85.8393
                
                
                    86.67
                
                
                    87.9654
                
                
                    90.1746
                
                
                    92.6964
                
                
                    95.2908
                
                
                    97.6898
                
                
                    99.1551
                
                
                    100.119



In [11]:

    
# Adjust the data to fit our format
cdc["Age"]=cdc["Agemos"]/12 # convert age in months to age in fractional years
cdc["Sex"]=cdc["Sex"].replace({1:"m",2:"f"}) # align to our convention
percentiles=[3,5,10,25,50,75,90,95,97]
percentile_columns=["P"+str(p) for p in percentiles] # names of percentile columns
cdc[percentile_columns]=cdc[percentile_columns]*0.393701 # convert percentile columns from centimeters to inches (ugh)
cdc20=cdc[cdc["Age"]==20].set_index("Sex") # Select the two rows corresponding to 20-year-olds (males and females)

print("Height Percentiles for 20-year-old US population [inches]")
display(PrettyPandas(cdc20[percentile_columns],precision=4))









    



Height Percentiles for 20-year-old US population [inches]






    





        

        
        

        
            
            
                
                
                
                P3
                
                P5
                
                P10
                
                P25
                
                P50
                
                P75
                
                P90
                
                P95
                
                P97
                
            
            
            
                
                Sex
                
                
                
                
                
                
                
                
                
                
                
                
                
                
                
                
                
                
                
            
            
        
        
            
            
                
                
                    m
                
                
                    64.3
                
                
                    64.98
                
                
                    66.01
                
                
                    67.73
                
                
                    69.63
                
                
                    71.52
                
                
                    73.21
                
                
                    74.22
                
                
                    74.88
                
            
            
            
                
                
                    f
                
                
                    59.49
                
                
                    60.1
                
                
                    61.03
                
                
                    62.58
                
                
                    64.31
                
                
                    66.02
                
                
                    67.56
                
                
                    68.48
                
                
                    69.08

Let's compare the stats for reported heights of our 20-year-olds to the CDC stats for 20-year-olds.

Note that OKCupid height data are integers, which also causes all percentiles to be integer values. To fix this, we jitter the data by $\pm0.5$ inches by adding random uniformly distributed noise in the range $[-0.5,+0.5]$ (which won't affect the mean, but will smooth the percentiles). This makes sense if we assume that users reported their height rounded to the nearest inch.



In [12]:

    
mheights=m.loc[m["age"]==20,"height"] # heights of 20-year-old males
fheights=f.loc[f["age"]==20,"height"] # heights of 20-year-old females

# To smooth the computation of percentiles, jitter height data by adding
# uniformly distributed noise in the range [-0.5,+0.5]
mheightsj=mheights+np.random.uniform(low=-0.5,high=+0.5,size=(len(mheights),))
fheightsj=fheights+np.random.uniform(low=-0.5,high=+0.5,size=(len(fheights),))

# For each of the available percentiles in CDC data, compute the corresponding percentile from our 20-year-old users
stats=[]
for percentile,percentile_column in zip(percentiles,percentile_columns):
    stats.append({"sex":"m",
                  "percentile":percentile,
                  "CDC":cdc20.loc["m",percentile_column],
                  "users":mheightsj.quantile(percentile/100)})
    stats.append({"sex":"f",
                  "percentile":percentile,
                  "CDC":cdc20.loc["f",percentile_column],
                  "users":fheightsj.quantile(percentile/100)})
stats=pd.DataFrame(stats).set_index(["sex","percentile"]).sort_index()

# For each percentile, compute the gap between users and CDC
stats["gap"]=stats["users"]-stats["CDC"]

print("Height percentiles (in inches) for 20-year-old males")
display(PrettyPandas(stats.loc["m"],precision=4))

print("Height percentiles (in inches) for 20-year-old females")
display(PrettyPandas(stats.loc["f"],precision=4))









    



Height percentiles (in inches) for 20-year-old males






    





        

        
        

        
            
            
                
                
                
                CDC
                
                users
                
                gap
                
            
            
            
                
                percentile
                
                
                
                
                
                
                
            
            
        
        
            
            
                
                
                    3
                
                
                    64.3
                
                
                    64.89
                
                
                    0.589
                
            
            
            
                
                
                    5
                
                
                    64.98
                
                
                    65.56
                
                
                    0.5848
                
            
            
            
                
                
                    10
                
                
                    66.01
                
                
                    66.16
                
                
                    0.154
                
            
            
            
                
                
                    25
                
                
                    67.73
                
                
                    67.88
                
                
                    0.1547
                
            
            
            
                
                
                    50
                
                
                    69.63
                
                
                    70.19
                
                
                    0.567
                
            
            
            
                
                
                    75
                
                
                    71.52
                
                
                    72.26
                
                
                    0.7426
                
            
            
            
                
                
                    90
                
                
                    73.21
                
                
                    74.15
                
                
                    0.9413
                
            
            
            
                
                
                    95
                
                
                    74.22
                
                
                    75.47
                
                
                    1.241
                
            
            
            
                
                
                    97
                
                
                    74.88
                
                
                    76.09
                
                
                    1.209
                
            
            
        
        
        






    



Height percentiles (in inches) for 20-year-old females






    





        

        
        

        
            
            
                
                
                
                CDC
                
                users
                
                gap
                
            
            
            
                
                percentile
                
                
                
                
                
                
                
            
            
        
        
            
            
                
                
                    3
                
                
                    59.49
                
                
                    59.69
                
                
                    0.196
                
            
            
            
                
                
                    5
                
                
                    60.1
                
                
                    59.93
                
                
                    -0.1718
                
            
            
            
                
                
                    10
                
                
                    61.03
                
                
                    60.98
                
                
                    -0.04799
                
            
            
            
                
                
                    25
                
                
                    62.58
                
                
                    62.89
                
                
                    0.3094
                
            
            
            
                
                
                    50
                
                
                    64.31
                
                
                    64.85
                
                
                    0.5411
                
            
            
            
                
                
                    75
                
                
                    66.02
                
                
                    66.64
                
                
                    0.6133
                
            
            
            
                
                
                    90
                
                
                    67.56
                
                
                    68.46
                
                
                    0.896
                
            
            
            
                
                
                    95
                
                
                    68.48
                
                
                    69.46
                
                
                    0.9735
                
            
            
            
                
                
                    97
                
                
                    69.08
                
                
                    70.23
                
                
                    1.153



In [13]:

    
fig,(ax1,ax2)=plt.subplots(ncols=2,sharex=True,figsize=(10,4))
stats.loc["m"][["users","CDC"]].plot.bar(ax=ax1,color=[[0.5,0.5,1],"k"],alpha=1,width=0.8,rot=0)
stats.loc["f"][["users","CDC"]].plot.bar(ax=ax2,color=[[1,0.5,0.5],"k"],alpha=1,width=0.8,rot=0)
ax1.set_ylim([64,77])
ax2.set_ylim([58,71])
ax1.set_ylabel("Height [inches]")
ax2.set_ylabel("Height [inches]")
ax1.set_title("Height percentiles in 20y-old male users vs CDC data")
ax2.set_title("Height percentiles in 20y-old female users vs CDC data")
for ax in (ax1,ax2):
    sns.despine(ax=ax)
fig.tight_layout()

The height statistics of our user population matches remarkably well with CDC data. It looks like our users, males and females alike, may be slightly over-reporting their height (still, much less than one could expect), but there could be many other reasonable explanations for this gap. For example, the San-Francisco population may be taller than the general US population. This might be a starting point to further investigate the issue.

Another explanation may be that short users are more likely to not report their height; in this case, even though other users report their height honestly, our statistics (which ignores missing values) would be biased. This problem is well known to statisticians as data missing not at random (MNAR). May we be victims of this?



In [14]:

    
d["height"].isnull().sum()









    Out[14]:





3

Apparently not, as we only have 3 users with missing height data, which could never explain the gap. Probably, height is a required information to sign up. The reason why a few users are missing this is unknown (maybe they signed up when the information was not compulsory?)

Note that this analysis on the whole population of OKCupid users found a larger tendency to height exaggeration.

Study how height changes with age

All our users are over-18. As shown in CDC plots, height does not increase much after 18, but we can still observe the very last stage of growth, just by fractions of an inch.



In [15]:

    
#%% Investigate heights vs sex vs age
g=d.groupby(["sex","age"])["height"].mean()
fig,(ax1,ax2)=plt.subplots(ncols=2,sharex=True,figsize=(10,3))
ax1.plot(g["m"],color="b")
ax1.set_xlim(18,27)
ax1.set_ylim(69.5,71)
ax1.set(title="Average height vs age for males",
        ylabel="height",
        xlabel="age")
ax2.plot(g["f"],color="r")
ax2.set_xlim(18,27)
ax2.set_ylim(64,65.5)
ax2.set(title="Average height vs age for females",
        ylabel="height",
        xlabel="age");
for ax in (ax1,ax2):
    sns.despine(ax=ax)
fig.tight_layout()

We can also overlay CDC growth charts to the above plots, with minimal data wrangling.

We only have to massage our cdc data so that it is indexed by age (in integer years). For a given integer age (say 19 years) the original data has 12 rows corresponding to the ages of $(19 \cdot 12 + 0), (19 \cdot 12 + 1), \cdots, (19 \cdot 12 + 11)$ months. For a given integer age we average the corresponding percentiles of each of these rows.



In [16]:

    
cdc_m=cdc[cdc["Sex"]=="m"].groupby(np.floor(cdc["Age"]))[percentile_columns].mean()
cdc_f=cdc[cdc["Sex"]=="f"].groupby(np.floor(cdc["Age"]))[percentile_columns].mean()

# Result for males
display(PrettyPandas(cdc_m,precision=4))



In [17]:

    
# Compute average height per sex and age
g=d.groupby(["sex","age"])["height"].mean()

fig,(ax1,ax2)=plt.subplots(ncols=2,sharex=True,figsize=(10,5))
ax1.plot(g["m"],color="b",label="Mean of male OkCupid users")
ax1.plot(cdc_m["P75"],color="k",linestyle='dotted',label="75th percentile of male US Population")
ax1.plot(cdc_m["P50"],color="k",label="Median of male US Population")
ax1.plot(cdc_m["P25"],color="k",linestyle='dotted',label="25th percentile of male US Population")
ax1.fill_between(cdc_m.index,cdc_m["P25"],cdc_m["P75"],color="k",alpha=0.1,linewidth=0)


#ax1.legend(loc="lower right")
# Use direct labeling instead of a legend
x=cdc_m["P50"].index[-1]
ax1.text(x, g["m"].loc[:26].max(), " Mean of male OkCupid users", color="b",
         verticalalignment="bottom",fontsize="small")
ax1.text(x, cdc_m["P75"].iloc[-1]," 75th percentile of male US Population",
         verticalalignment="center",fontsize="small")
ax1.text(x, cdc_m["P50"].iloc[-1]," Median of male US Population",
         verticalalignment="center",fontsize="small")
ax1.text(x, cdc_m["P25"].iloc[-1]," 25th percentile of male US Population",
         verticalalignment="center",fontsize="small")

ax1.set_xlim(16,27)
ax1.set_ylim(67,72)
ax1.set(title="height vs age for males",
        ylabel="height [inches]",
        xlabel="age (rounded down for CDC data) [years]");
ax2.plot(g["f"],color="r",label="Mean of female OkCupid users")
ax2.plot(cdc_f["P75"],color="k",linestyle='dotted',label="75th percentile of female US Population")
ax2.plot(cdc_f["P50"],color="k",label="Median of female US Population")
ax2.plot(cdc_f["P25"],color="k",linestyle='dotted',label="25th percentile of female US Population")
ax2.fill_between(cdc_f.index,cdc_f["P25"],cdc_f["P75"],color="k",alpha=0.1,linewidth=0)

#ax2.legend(loc="lower right")
# Use direct labeling instead of a legend
x=cdc_f["P50"].index[-1]
ax2.text(x, g["f"].loc[:26].max(), " Mean of female OkCupid users", color="r",
         verticalalignment="bottom",fontsize="small")
ax2.text(x, cdc_f["P75"].iloc[-1]," 75th percentile of female US Population",
         verticalalignment="center",fontsize="small")
ax2.text(x, cdc_f["P50"].iloc[-1]," Median of female US Population",
         verticalalignment="center",fontsize="small")
ax2.text(x, cdc_f["P25"].iloc[-1]," 25th percentile of female US Population",
         verticalalignment="center",fontsize="small")

ax2.set_xlim(16,27)
ax2.set_ylim(62,67)
ax2.set(title="height vs age for females",
        ylabel="height [inches]",
        xlabel="age (rounded down for CDC data) [years]");
for ax in (ax1,ax2):
    sns.despine(ax=ax)
fig.tight_layout()

How do users self-report their body type?

An interesting categorical attribute body_type contains the self-reported body type of the user, chosen from a limited set of options.



In [14]:

    
fig,ax=plt.subplots(figsize=(6,5))
sns.countplot(y="body_type",hue="sex",
              order=d["body_type"].value_counts().sort_values(ascending=False).index,
              data=d,palette={"m":"b","f":"r"},alpha=0.5,ax=ax);
ax.set_title("Number of female and male users self-reporting each body type")
sns.despine(ax=ax)

In the plot above, males and females are two sub-groups of the population, whereas body_type is a categorical attribute. It is interesting to compare how users in each of the two sub-groups (i.e. males and females) are likely to use each of the available categorical values; this is normally done through contingency tables.

For example, regardless of how frequent these body types are in the general population, it's interesting to note that the terms curvy and full figured are disproportionally used by females, whereas a little extra is mostly used by males.

The function implemented below aims to visualize this, for a generic categorical attribute and a generic grouping of users. It accpets the following inputs:

a series with categorical data (in the example, the self-reported body type of each user)
two like-indexed boolean series g1 and g2 defining two subgroups of the population (in the example, whether each user is a male (True in g1) or a female (True in g2)). Note that, even though it is the case in the example, the two groups don't need to define partition of the population: they may intersect (an user may be in both g1 and g2) and they may not cover the whole population (an user may be in neither g1 nor g2). We also expect that the two subgroups may have very different sizes.
The function also accepts readable names and RGB (or matplotlib) colors assigned to each of the two groups.

For each unique value of series represented in the union of the two groups, the function visualizes how frequently the value is used by group g1 compared to how frequently it is used by group g2. We expect the output to be $0.5$ for values that were equally frequent in g1 and g2 (note that, since we deal with frequencies, this does not depend on the relative sizes of g1 and g2, nor on whether the frequency in both groups is very low or very high); we expect the output to be $0$ for values that were only seen in g2 and never seen in g1; we expect the output to be $1$ for values that were only seen in g1 and never seen in g2.

The function automatically discards values that appear less than 50 total times in the two groups, which would yield an unreliable estimate of the proportion.



In [15]:

    
import math

# Define visualization function
def compare_prevalence(series,g1,g2,g1name,g2name,g1color,g2color,ax):
    
    # for each categorical value represented in series, number of users in group g1 which have this value
    g1n=series.loc[g1].value_counts()
    # for each categorical value represented in series, number of users in group g2 which have this value
    g2n=series.loc[g2].value_counts()
    
    # join the two series in a single dataframe, filling 0 where indices don't match
    # (e.g. if a value represented in g1 did never appear in g2)
    df=pd.concat({"g1n":g1n,"g2n":g2n},axis=1).fillna(0)
    # df has one row for every distinct value of series in the union of g1 and g2
    
    # normalize the data
    df["g1f"]=df["g1n"]/(df["g1n"].sum()) # fraction of g1 users with each categorical value
    df["g2f"]=df["g2n"]/(df["g2n"].sum()) # fraction of g2 users with each categorical value
    
    assert(math.isclose(df["g1f"].sum(),1)) 
    assert(math.isclose(df["g2f"].sum(),1))
    
    # for each row of df, we now compute how frequent the value was in g1 compared to the frequency it had in g2.
    df["frac12"]=df["g1f"]/(df["g1f"]+df["g2f"])
    # we expect df["frac12"] to be 0.5 for values that were equally frequent in g1 and g2 (note that this does not depend on the size of g1 and g2)
    # we expect df["frac12"] to be 0 for values that were only seen in g2 and never seen in g1
    # we expect df["frac12"] to be 1 for values that were only seen in g1 and never seen in g2
    
    df=df[(df["g1n"]+df["g2n"])>=50] # exclude values which are too rare
    df=df.sort_values("frac12")
    
    # Draw the left bars
    ax.barh(bottom=range(len(df)),
                width=df["frac12"],
                left=0,
                height=1,
                align="center",
                color=g1color,alpha=1)
    # Draw the right bars
    ax.barh(bottom=range(len(df)),
                width=df["frac12"]-1,
                left=1,
                height=1,
                align="center",
                color=g2color,alpha=1)
    
    # Draw a faint vertical line for x=0.5
    ax.axvline(x=0.5,color="k",alpha=0.1,linewidth=5)
    ax.set(xlim=[0,1],
           ylim=[-1,len(df)-0.5],
           yticks=range(len(df)),
           yticklabels=df.index,
           xlabel="fraction of users",
           ylabel=series.name)
    
    ax.set_title("Relative prevalence of {} ($n={}$) vs {} ($n={}$)\nfor each value of {}".format(
                g1name,g1.sum(),g2name,g2.sum(),series.name),
                loc="left",fontdict={"fontsize":"medium"})
    ax.text(0.02,len(df)-1,g1name,verticalalignment="center",horizontalalignment="left",size="smaller",color="w")
    ax.text(0.98,0,g2name,verticalalignment="center",horizontalalignment="right",size="smaller",color="w")

    def color_for_frac(f):
        # Blend g1color and g2color according to f (convex linear combination):
        # 0 returns g1color, 1 returns g2color)
        ret=np.array(g1color)*f+np.array(g2color)*(1-f)
        if(np.linalg.norm(ret)>1):          # If the resulting rgb color is too bright for text,
            ret=(ret/np.linalg.norm(ret))*1 # rescale its brightness to dark (but keep hue)
        return ret
        
    for i,tl in enumerate(plt.gca().get_yticklabels()):
        tl.set_color(color_for_frac(df["frac12"].iloc[i]))
        
    sns.despine(ax=ax,left=True)

# Apply visualization function 
fig,ax = plt.subplots(figsize=(10,3))
compare_prevalence(
    series=d["body_type"],                          # Which categorical attribute?
    g1=d["sex"]=="m",      g2=d["sex"]=="f",        # Definition of the two groups
    g1name="male users",   g2name="female users",   # Names of the two groups
    g1color=[0.5,0.5,1.0], g2color=[1.0,0.5,0.5],   # Colors for the two groups
    ax=ax)
fig.tight_layout()

The plot above can be interpreted as follows. If we sample the same number of males and females from the population, and then we only keep those who describe themselves as "athletic", almost 73% of these will be male, and the remaining 27% will be female.

Similarly, if we consider users who describe themselves as "curvy", more than 98% of these will be female.

Note that the fractions do not depend on the prevalence of the two groups in the overall population, nor on the prevalence of each categorical body type value in the general population. We can therefore apply the same function to other group pairs, even when the two groups have very different sizes.

This plot is a less expressive variant of a well-known representation of contingency tables called mosaic plot, and closely resembles what here is called relative frequency segmented bar plot.

Study categorical attributes for different segments of the population

i.e., make a lot of plots like the one above, for different attributes and using different groupings.

Define a list of interesting group pairs

We now define a large amount of interesting pairs of groups that we want to compare. For each group in each pair, we define:

the human-readable name of the group;
an RGB color to represent the group (because eyecandy);
a boolean series indexed like d, which masks the users belonging to the group



In [16]:

    
groupss=[            #   Color RGB      Boolean series to index observations in d
    (("male users",      [0.5,0.5,1.0], d["sex"]=="m"),
     ("female users",    [1.0,0.5,0.5], d["sex"]=="f")),
    
    (("straight users",  [0.5,0.5,0.5], d["orientation"]=="straight"),
     ("gay/bi users",    [0.7,0.7,0.1], d["orientation"]!="straight")),
    
    (("young users",     [0.4,0.8,0.4], d["age"]<25),
     ("older users",     [0.4,0.4,0.5], d["age"]>35)),
    
    (("earns_little",    [0.4,0.4,0.4], d["income"]==20000),
     ("earns_alot",      [0.4,0.9,0.6], d["income"]>=50000)),
    # We purposefully exclude users not reporting their income (i.e.the majority). 20k is the lowest available tier
    # We also purposefully exclude intermediate tiers
    
    (("fit users",       [0.4,0.6,0.8], d["body_type"].isin(["average","fit","athletic","thin","skinny","jacked"])),
     ("unfit users",     [0.8,0.6,0.4], d["body_type"].isin(["a little extra","curvy","full figured","rather not say","overweight"]))),
    # We assume that users that do not say their body_type are unfit
    
    (("monoglots",       [0.7,0.7,0.7], d["speaks"].str.count(",")==0),
     ("polyglots",       [0.5,0.5,0.5], d["speaks"].str.count(",")>=2)),
   
    (("witty_programmers",     [0.4,0.7,0.4], d["speaks"].str.contains("\\b(?:c|lisp)\\b",na=False)),
     ("non witty_programmers", [0.5,0.5,0.6], ~(d["speaks"].str.contains("\\b(?:c|lisp)\\b",na=True)))),
    # a "witty programmer" is a nerd who, when asked which languages she/he speaks, thinks that "C++" or "perl" are reasonable answers.
    # in their defence, these are valid options in the okcupid interface (see codebook)
    # we compare them with everyone else (i.e. people who are probably not as nerd)

    (("it_lang_speakers",      [0.5,0.7,0.5], d["speaks"].str.contains("\\b(?:italian)\\b",na=False)),
     ("non it_lang_speakers",  [0.5,0.5,0.5], ~(d["speaks"].str.contains("\\b(?:italian)\\b",na=True)))),
   
    (("dead_lang_speakers",    [0.5,0.7,0.5], d["speaks"].str.contains("\\b(?:latin|ancient greek)\\b",na=False)),
     ("non dead_lang_speakers",[0.5,0.5,0.5], ~(d["speaks"].str.contains("\\b(?:latin|ancient greek)\\b",na=True)))),
    # a "dead_lang_speaker" is a nerd who, when asked which languages she/he speaks, thinks that "latin" or "ancient greek" are reasonable answers.
    # in their defence, these are valid options in the okcupid interface (see codebook)
    # we compare them with everyone else (i.e. people who are probably not as nerd)
    
    (("nonsmokers",      [0.6,0.6,1.0], d["smokes"]=="no"),
     ("smokers",         [0.3,0.3,0.3], d["smokes"]=="yes")),
    
    (("omnivores",       [0.8,0.4,0.4], d["diet"].str.split().str.get(-1).str.contains("anything",na=False)),
     ("veg(etari)ans",   [0.3,0.8,0.3], d["diet"].str.split().str.get(-1).str.contains("veg",na=False))),
   
    (("doesnt_want_kids",[0.4,0.4,0.4], d["offspring"].isin(["doesn&rsquo;t want kids","doesn&rsquo;t have kids, and doesn&rsquo;t want any"])),
     ("wants_kids",      [0.4,0.8,0.8], d["offspring"].isin(["doesn&rsquo;t have kids, but wants them","wants kids","has a kid, and wants more","has kids, and wants more"]))),
    # We sample the possible answers to the "offspring" question identifying those that explicitly say they want (more) kids
    # vs those that explicitly say they don't, regardless of whether they already have kids or not.
    
    (("cat lovers",      [1.0,0.6,0.6], d["pets"].str.contains("\\b(?:likes|has) cats",na=False)),
     ("cat haters",      [0.4,0.4,0.4], d["pets"].str.contains("\\bdislikes cats",na=False))),
    # A cat lover either has or likes cats; a cat hater dislikes cats.
   
    (("dog lovers",      [1.0,0.6,0.6], d["pets"].str.contains("\\b(?:likes|has) dogs",na=False)),
     ("dog haters",      [0.4,0.4,0.4], d["pets"].str.contains("\\bdislikes dogs",na=False)))
    # A dog lover either has or likes dogs; a dog hater dislikes dogs.
]


# We finally add a couple of religion_based groupings
atheists = d["religion"].str.contains("\\b(?:atheism|agnosticism)\\b").fillna(False)
religious = ~(d["religion"].str.contains("\\b(?:atheism|agnosticism)\\b").fillna(True))
religion_not_important = religious & d["religion"].str.contains(
    "\\b(?:but not too serious about it|and laughing about it)\\b").fillna(False)
religion_important = religious & d["religion"].str.contains(
    "\\b(?:and somewhat serious about it|and very serious about it)\\b").fillna(False)

groupss+=[
    (("atheists/agnostics", [0.4,0.5,0.4], atheists),
     ("religious",          [0.4,0.4,0.5], religious)),
    (("serious_religious",  [0.4,0.3,0.3], religion_important),
     ("laughing_religious", [0.6,0.6,0.5], religion_not_important))]

Define a list of interesting categorical attributes

Body type is not the only interesting categorical attribute. Below we add this and other attributes in a list, possibly after massaging the data in order to limit the amount of categories in each attribute.

Note that the name of each series is used to label the plots.



In [17]:

    
attrs=[]

attrs.append(d["body_type"])

attrs.append(d["job"])

location=d["location"].copy()
users_per_location=location.value_counts()
location.loc[~(location.isin(list(users_per_location[users_per_location>1000].index)))]="elsewhere"
attrs.append(location)
# There are too many unique locations; we rewrite all locations that appear less than 1k times as "elsewhere".

attrs.append(d["orientation"])

attrs.append(d["drinks"])
# Has potentially interesting categories beyond yes and no, like "desperately".

attrs.append(d["education"].str.replace("working on ","").str.replace("graduated from ",""))
# For each type of school, one may have (see codebook) one optional prefix in the set:
# {"working on ", "graduated from ", "dropped out of "}.  We remove the first two, so
# "working on high school", "graduated from high school", "high school" become the same category.
# "Dropped out of high school", however, will stay a separate category

cats_relationship=pd.Series("does not mention cats",index=d.index,name="cats_relationship")
for s in ["has cats","likes cats","dislikes cats"]:
    cats_relationship[d["pets"].str.contains("\\b"+s,na=False)]=s
attrs.append(cats_relationship)

dogs_relationship=pd.Series("does not mention dogs",index=d.index,name="dogs_relationship")
for s in ["has dogs","likes dogs","dislikes dogs"]:
    dogs_relationship[d["pets"].str.contains("\\b"+s,na=False)]=s
attrs.append(dogs_relationship)
    
attrs.append(d["religion"].str.split(" ").str.get(0).fillna("not entered"))
# We only keep the name of the religion (first token) and not the optional comment
# We consider as a separate value users that did not enter any religion (which is an interesting piece of information per se)

attrs.append(d["sign"].str.split(" ").str.get(0).fillna("not entered"))
# We only keep the name of the astrological sign (first token) and not the optional comment
# We consider as a separate value users that did not enter any sign (which is an interesting piece of information per se)

Actually generate and save a plot for each combination of group and attribute



In [26]:

    
%matplotlib
# we temporarily switch to the default non-inline backend;
# the inline backend crashes in the cell below because it won't close the figures before the cell is done executing.

def makesafefilename(filename):
    keepcharacters = (' ','.','_')
    return "".join(c for c in filename if c.isalnum() or c in keepcharacters).rstrip()

import pathlib
imgdir=pathlib.Path("plots")
for ((g1name,g1color,g1),(g2name,g2color,g2)) in groupss:
    html="{} (n={}) vs {} (n={})\n<ul>".format(g1name,g1.sum(),g2name,g2.sum())
    for attr in attrs:
        fig,ax = plt.subplots(figsize=(10,1.5+0.15*len(attr.unique())))
        compare_prevalence(attr,g1,g2,g1name,g2name,g1color,g2color,ax)
        fig.tight_layout()
        fn=imgdir/makesafefilename("plot_{}_{}_{}.svg".format(attr.name,g1name,g2name))
        fig.savefig(str(fn))

        html+='<li><a href="{}">For attribute {} ({} unique values)</a></li>'.format(
            str(fn),attr.name,len(attr.unique()))
        plt.close(fig)
    html+="</ul>"
    display(HTML(html))









    



Using matplotlib backend: Qt5Agg






    




male users (n=35828) vs female users (n=24116)
For attribute body_type (13 unique values)
For attribute job (22 unique values)
For attribute location (6 unique values)
For attribute orientation (3 unique values)
For attribute drinks (7 unique values)
For attribute education (17 unique values)
For attribute cats_relationship (4 unique values)
For attribute dogs_relationship (4 unique values)
For attribute religion (10 unique values)
For attribute sign (13 unique values)






    




straight users (n=51604) vs gay/bi users (n=8340)
For attribute body_type (13 unique values)
For attribute job (22 unique values)
For attribute location (6 unique values)
For attribute orientation (3 unique values)
For attribute drinks (7 unique values)
For attribute education (17 unique values)
For attribute cats_relationship (4 unique values)
For attribute dogs_relationship (4 unique values)
For attribute religion (10 unique values)
For attribute sign (13 unique values)






    




young users (n=10923) vs older users (n=16869)
For attribute body_type (13 unique values)
For attribute job (22 unique values)
For attribute location (6 unique values)
For attribute orientation (3 unique values)
For attribute drinks (7 unique values)
For attribute education (17 unique values)
For attribute cats_relationship (4 unique values)
For attribute dogs_relationship (4 unique values)
For attribute religion (10 unique values)
For attribute sign (13 unique values)






    




earns_little (n=2952) vs earns_alot (n=6499)
For attribute body_type (13 unique values)
For attribute job (22 unique values)
For attribute location (6 unique values)
For attribute orientation (3 unique values)
For attribute drinks (7 unique values)
For attribute education (17 unique values)
For attribute cats_relationship (4 unique values)
For attribute dogs_relationship (4 unique values)
For attribute religion (10 unique values)
For attribute sign (13 unique values)






    




fit users (n=46090) vs unfit users (n=8204)
For attribute body_type (13 unique values)
For attribute job (22 unique values)
For attribute location (6 unique values)
For attribute orientation (3 unique values)
For attribute drinks (7 unique values)
For attribute education (17 unique values)
For attribute cats_relationship (4 unique values)
For attribute dogs_relationship (4 unique values)
For attribute religion (10 unique values)
For attribute sign (13 unique values)






    




monoglots (n=29070) vs polyglots (n=13595)
For attribute body_type (13 unique values)
For attribute job (22 unique values)
For attribute location (6 unique values)
For attribute orientation (3 unique values)
For attribute drinks (7 unique values)
For attribute education (17 unique values)
For attribute cats_relationship (4 unique values)
For attribute dogs_relationship (4 unique values)
For attribute religion (10 unique values)
For attribute sign (13 unique values)






    




witty_programmers (n=1906) vs non witty_programmers (n=57988)
For attribute body_type (13 unique values)
For attribute job (22 unique values)
For attribute location (6 unique values)
For attribute orientation (3 unique values)
For attribute drinks (7 unique values)
For attribute education (17 unique values)
For attribute cats_relationship (4 unique values)
For attribute dogs_relationship (4 unique values)
For attribute religion (10 unique values)
For attribute sign (13 unique values)






    




it_lang_speakers (n=2181) vs non it_lang_speakers (n=57713)
For attribute body_type (13 unique values)
For attribute job (22 unique values)
For attribute location (6 unique values)
For attribute orientation (3 unique values)
For attribute drinks (7 unique values)
For attribute education (17 unique values)
For attribute cats_relationship (4 unique values)
For attribute dogs_relationship (4 unique values)
For attribute religion (10 unique values)
For attribute sign (13 unique values)






    




dead_lang_speakers (n=602) vs non dead_lang_speakers (n=59292)
For attribute body_type (13 unique values)
For attribute job (22 unique values)
For attribute location (6 unique values)
For attribute orientation (3 unique values)
For attribute drinks (7 unique values)
For attribute education (17 unique values)
For attribute cats_relationship (4 unique values)
For attribute dogs_relationship (4 unique values)
For attribute religion (10 unique values)
For attribute sign (13 unique values)






    




nonsmokers (n=43896) vs smokers (n=2231)
For attribute body_type (13 unique values)
For attribute job (22 unique values)
For attribute location (6 unique values)
For attribute orientation (3 unique values)
For attribute drinks (7 unique values)
For attribute education (17 unique values)
For attribute cats_relationship (4 unique values)
For attribute dogs_relationship (4 unique values)
For attribute religion (10 unique values)
For attribute sign (13 unique values)






    




omnivores (n=27881) vs veg(etari)ans (n=5688)
For attribute body_type (13 unique values)
For attribute job (22 unique values)
For attribute location (6 unique values)
For attribute orientation (3 unique values)
For attribute drinks (7 unique values)
For attribute education (17 unique values)
For attribute cats_relationship (4 unique values)
For attribute dogs_relationship (4 unique values)
For attribute religion (10 unique values)
For attribute sign (13 unique values)






    




doesnt_want_kids (n=4059) vs wants_kids (n=3882)
For attribute body_type (13 unique values)
For attribute job (22 unique values)
For attribute location (6 unique values)
For attribute orientation (3 unique values)
For attribute drinks (7 unique values)
For attribute education (17 unique values)
For attribute cats_relationship (4 unique values)
For attribute dogs_relationship (4 unique values)
For attribute religion (10 unique values)
For attribute sign (13 unique values)






    




cat lovers (n=25724) vs cat haters (n=2899)
For attribute body_type (13 unique values)
For attribute job (22 unique values)
For attribute location (6 unique values)
For attribute orientation (3 unique values)
For attribute drinks (7 unique values)
For attribute education (17 unique values)
For attribute cats_relationship (4 unique values)
For attribute dogs_relationship (4 unique values)
For attribute religion (10 unique values)
For attribute sign (13 unique values)






    




dog lovers (n=36873) vs dog haters (n=561)
For attribute body_type (13 unique values)
For attribute job (22 unique values)
For attribute location (6 unique values)
For attribute orientation (3 unique values)
For attribute drinks (7 unique values)
For attribute education (17 unique values)
For attribute cats_relationship (4 unique values)
For attribute dogs_relationship (4 unique values)
For attribute religion (10 unique values)
For attribute sign (13 unique values)






    




atheists/agnostics (n=15797) vs religious (n=23922)
For attribute body_type (13 unique values)
For attribute job (22 unique values)
For attribute location (6 unique values)
For attribute orientation (3 unique values)
For attribute drinks (7 unique values)
For attribute education (17 unique values)
For attribute cats_relationship (4 unique values)
For attribute dogs_relationship (4 unique values)
For attribute religion (10 unique values)
For attribute sign (13 unique values)






    




serious_religious (n=4357) vs laughing_religious (n=12683)
For attribute body_type (13 unique values)
For attribute job (22 unique values)
For attribute location (6 unique values)
For attribute orientation (3 unique values)
For attribute drinks (7 unique values)
For attribute education (17 unique values)
For attribute cats_relationship (4 unique values)
For attribute dogs_relationship (4 unique values)
For attribute religion (10 unique values)
For attribute sign (13 unique values)



In [18]:

    
%matplotlib inline
# back to inline plots



In [28]:

    
# This cell is only needed to generate the markdown text with image links,
# which is reported and edited manually (deleting uninteresting plots) in the next cell
md=""
for ((g1name,g1color,g1),(g2name,g2color,g2)) in groupss:
    for attr in attrs:
        fn=imgdir/makesafefilename("plot_{}_{}_{}.svg".format(attr.name,g1name,g2name))
        md+='![{}]({})\n\n'.format(attr.name,str(fn))

Discussion on the generated plots

Below, we report some of the interesting plots we just generated.

Males vs Females

As noted above: "curvy" is mostly used by females, understandably

Predictably, males abound in stereotypically male-dominated environments.

Note that those who "dropped out" of something -- especially lower education tiers -- are mostly males.

Everybody likes cats/dogs, but females are most likely to actually own one.

Note how, in this and other plots for the sign attribute, the actual sign has basically no effect; but not entering one's sign is an important piece of information which stands out.

Straight vs Gay/bisexual users

Young vs older users

Young users are defined as <=25 years old; older users are >=35 years old.

Users that earn little vs users that earn a lot

We purposefully exclude users not reporting their income (i.e.the majority). Users that earn a little are those who selected 20k\$ as their annual income (it's the lowest available tier). Users that earn a lot chose 50k\$ or more as their annual income. We exclude users who report earnings in intermediate tiers.

Fit vs unfit users

Fit users reported one of the following body types: "average","fit","athletic","thin","skinny","jacked". Unfit users reported one of the following body types: "a little extra","curvy","full figured","rather not say","overweight".

Monoglots vs Polyglots

Those who do not enter their sign are more likely to be monoglots. This is somewhat unexpected.

Witty programmers vs others

a "witty programmer" is a nerd who, when asked which languages she/he speaks, thinks that "C++" or "perl" are reasonable answers. In their defence, these are valid options in the okcupid interface (see codebook). We compare them with everyone else (i.e. people who are probably not as nerd)

Italian-speakers vs others

Speakers of dead languages vs others

A "dead_lang_speaker" is a nerd who, when asked which languages she/he speaks, thinks that "latin" or "ancient greek" are reasonable answers. In their defence, these are valid options in the okcupid interface (see codebook). We compare them with everyone else (i.e. people who are probably not as nerd)

Smokers vs nonsmokers

Omnivores vs vegetarians and vegans

Those who want kids vs those who don't

We sample the possible answers to the "offspring" question identifying those that explicitly say they want (more) kids vs those that explicitly say they don't, regardless of whether they already have kids or not.

Those who have or like cats vs those who dislike cats

Users not reporting their relationship with cats are excluded

Those who have or like dogs vs those who dislike dogs

Users not reporting their relationship with dogs are excluded

Islamic users tend to really dislike dogs; apparently, there are cultural reasons.

Atheists and agnostics vs those who identify with a religion

Users who take their religion seriously vs users who don't

We exclude atheists/agnostics. Users who take their religion seriously are those who entered a religion and possibly added "and it's important". "Laughing_religious" users are those who entered a religion and added "but it's not important" or "and laughing about it".

Analyzing essays

The data contains essays written by the users on the following topics:

essay0: My self summary
essay1: What I’m doing with my life
essay2: I’m really good at
essay3: The first thing people usually notice about me
essay4: Favorite books, movies, show, music, and food
essay5: The six things I could never do without
essay6: I spend a lot of time thinking about
essay7: On a typical Friday night I am
essay8: The most private thing I am willing to admit
essay9: You should message me if...

In the following, we concatenate all essays to a single string and ignore the different themes.



In [6]:

    
d["essays"]=""
for f in ["essay"+str(i) for i in range(10)]:
    d.loc[d[f].isnull(),f]=""
    d["essays"]=d["essays"]+" "+d[f]

Let's index and count all unique words in all essays.



In [7]:

    
import string
from collections import Counter
wordcounts=Counter()
for e in d["essays"].str.split():
    wordcounts.update([w.strip(string.whitespace+string.punctuation) for w in e])

The 20 most common words



In [21]:

    
for w,c in wordcounts.most_common(20):
    print(c,w)









    



1050302 
823608 i
714584 and
679436 a
626851 the
595608 to
360865 my
356508 of
296365 in
268116 br
196373 i'm
186633 you
175822 with
171168 for
166527 that
149093 is
141352 have
134013 like
132263 it
129212 but

Among tha 10k most common words, we filter those with at least 4 characters. We build a dataframe that contains a binary column for each of these and a row for each user. The value will be True where if the user's essay contains the word, False otherwise.



In [ ]:

    
import tqdm # a cool progress bar

# Let's consider the most common 10k words
words=[w for w,c in wordcounts.most_common(10000) if len(w)>=4 and w.isalpha()]

d_contains=pd.DataFrame(index=d.index)
# This operation takes a lot of time
for w in tqdm.tqdm(words):
    d_contains[w]=d["essays"].str.contains("\\b"+w+"\\b")

# These frequent words were part of the html and we should ignore them
d_contains=d_contains.drop("href",axis=1)
d_contains=d_contains.drop("ilink",axis=1)
d_contains.to_pickle("d_contains.pickle") # Save a cache



In [34]:

    
d_contains=pd.read_pickle("d_contains.pickle") # Read the cache, to avoid the hours of computation above



In [35]:

    
print("The dataset contains {} rows (users) and {} columns (words)".format(
        len(d_contains.index),len(d_contains.columns)))









    



The dataset contains 59944 rows (users) and 8091 columns (words)

Let's visualize the resulting dataframe.

We show the transpose of the dataframe, i.e. one row per word and one column per user. We only display 400 users (i.e. less than one hundreth of all users in the dataset), and 100 words (i.e. about 1/80th of all the frequent words we found)



In [36]:

    
fig,(ax1,ax2)=plt.subplots(nrows=2,sharex=True,figsize=(10,15))
sns.heatmap(d_contains.iloc[0:400,0:50].transpose(),
            ax=ax1,cbar=None)
sns.heatmap(d_contains.iloc[0:400,-50:-1].transpose(),
            ax=ax2,cbar=None)
ax1.set_title("Which of the first 400 users (columns) use which of the 50 most frequent words (rows)")
ax2.set_title("Which of the first 400 users (columns) use which of the 50 least frequent words (rows) among the "+str(d_contains.shape[1])+" most frequent ones")
for ax in (ax1,ax2):
    ax.set_xticks([])
    ax.set_xlabel("Users")
    ax.set_ylabel("User's essays contain word")
fig.tight_layout()

Word use comparison for different user segments

The function below implements a variation of the function compare_prevalence(...) defined above, aimed at comparing the usage of words in two groups of users.



In [45]:

    
def compare_word_use(d_contains,g1,g2,g1name,g2name,g1color,g2color,ax):
    g1n=d_contains.loc[g1].sum(axis=0) # for each word, number of users in group g1 who use this word
    g2n=d_contains.loc[g2].sum(axis=0) # for each word, number of users in group g2 who use this word
    df=pd.concat({"g1n":g1n,"g2n":g2n},axis=1).fillna(0)
    df["g1f"]=df["g1n"]/(df["g1n"].sum()) # fraction of g1 users which use given word
    df["g2f"]=df["g2n"]/(df["g2n"].sum()) # fraction of g2 users which use given word
    df["frac12"]=df["g1f"]/(df["g1f"]+df["g2f"])
    df=df[(df["g1n"]+df["g2n"])>=50] # exclude words which are used by too few users
    df=df.sort_values("frac12")
    
    n=40;
    gap=0.5;
    
    bottom_df=df["frac12"].head(n)
    top_df=df["frac12"].tail(n)
    
    # Bottom bars
    bottom_bars_y=np.arange(n)
    ax.barh(bottom=bottom_bars_y,
                width=bottom_df,
                left=0,
                height=1,
                align="center",
                color=g1color,alpha=1)
    ax.barh(bottom=bottom_bars_y,
                width=bottom_df-1,
                left=1,
                height=1,
                align="center",
                color=g2color,alpha=1)
    
    # Top bars
    top_bars_y=np.arange(n)+gap+n
    ax.barh(bottom=top_bars_y,
                width=top_df,
                left=0,
                height=1,
                align="center",
                color=g1color,alpha=1)
    ax.barh(bottom=top_bars_y,
                width=top_df-1,
                left=1,
                height=1,
                align="center",
                color=g2color,alpha=1)
    
    ax.axvline(x=0.5,color="k",alpha=0.1,linewidth=5)
    ax.set(xlim=[0,1],
           ylim=[-1,n*2+gap-0.5],
           yticks=np.hstack((bottom_bars_y,top_bars_y)),
           yticklabels=list(bottom_df.index)+list(top_df.index),
           xlabel="fraction of users",
           ylabel="word contained in essays")
    ax.set_title("Relative prevalence of {} ($n={}$) vs {} ($n={}$)\namong users that wrote a given word in essays".format(
                g1name,g1.sum(),g2name,g2.sum()),
                loc="left",fontdict={"fontsize":"medium"})
    ax.text(0.02,top_bars_y[-1],g1name,verticalalignment="center",horizontalalignment="left",size="smaller",color="w")
    ax.text(0.98,0,g2name,verticalalignment="center",horizontalalignment="right",size="smaller",color="w")
    
    def color_for_frac(f):
        # Blend g1color and g2color according to f (convex linear combination):
        # 0 returns g1color, 1 returns g2color)
        ret=np.array(g1color)*f+np.array(g2color)*(1-f)
        if(np.linalg.norm(ret)>1):          # If the resulting rgb color is too bright for text,
            ret=(ret/np.linalg.norm(ret))*1 # rescale its brightness to dark (but keep hue)
        return ret
        
    for i,tl in enumerate(plt.gca().get_yticklabels()):
        tl.set_color(color_for_frac(pd.concat((bottom_df,top_df),axis=0).iloc[i]))
         
    sns.despine(ax=ax,left=True)
    
#fig,ax = plt.subplots(figsize=(7,14))
#compare_word_use(d_contains,
#                   g1=d["sex"]=="m",      g2=d["sex"]=="f",
#                   g1name="male users",   g2name="female users",
#                   g1color=[0.5,0.5,1.0], g2color=[1.0,0.5,0.5],
#                   ax=ax)

The function defined above reports the 40 words most strongly associated to each of the two groups.

We now apply it all groupings defined previously, and generate a plot for each grouping.



In [44]:

    
%matplotlib
# we temporarily switch to the default non-inline backend;
# the inline backend crashes in the cell below because it won't close the figures before the cell is done executing.

# Let's generate a plot for each of the groupings defeined above

for (g1name,g1color,g1),(g2name,g2color,g2) in groupss:
    fig,ax = plt.subplots(figsize=(10,14))
    compare_word_use(d_contains,g1,g2,g1name,g2name,g1color,g2color,ax)
    fig.tight_layout()
    fn=imgdir/makesafefilename("words_{}_{}.svg".format(g1name,g2name))
    fig.savefig(str(fn))
    plt.close(fig)
    
    html='<a href="{}">Most used words for {} vs {}</a>'.format(
        str(fn),g1name,g2name)
    display(HTML(html))









    



Using matplotlib backend: Qt5Agg






    




Most used words for male users vs female users






    




Most used words for straight users vs gay/bi users






    




Most used words for young users vs older users






    




Most used words for earns_little vs earns_alot






    




Most used words for fit users vs unfit users






    




Most used words for monoglots vs polyglots






    




Most used words for witty_programmers vs non witty_programmers






    




Most used words for it_lang_speakers vs non it_lang_speakers






    




Most used words for dead_lang_speakers vs non dead_lang_speakers






    




Most used words for nonsmokers vs smokers






    




Most used words for omnivores vs veg(etari)ans






    




Most used words for doesnt_want_kids vs wants_kids






    




Most used words for cat lovers vs cat haters






    




Most used words for dog lovers vs dog haters






    




Most used words for atheists/agnostics vs religious






    




Most used words for serious_religious vs laughing_religious



In [ ]:

    
# This cell is only needed to generate the markdown text with image links,
# which is reported and edited manually (deleting uninteresting plots) in the next cell
md=""
for ((g1name,g1color,g1),(g2name,g2color,g2)) in groupss:
    fn=imgdir/makesafefilename("words_{}_{}.svg".format(g1name,g2name))
    md+='![{}_{}]({})\n\n'.format(g1name,g2name,str(fn))
print(md)

Below we report a subset of these plots

Further analysis

For this question from a reddit user: What does "Isaac" refer to? Asimov, not the game), apparently.



In [51]:

    
print(d["essays"].str.contains("\\bbinding of isaac\\b").sum())
print(d["essays"].str.contains("\\bisaac asimov\\b").sum())
print(d["essays"].str.contains("\\basimov\\b").sum())
d["essays"].str.extract("(\\bisaac [a-z]*\\b)").dropna().value_counts()









    



0
159
394






    



/home/lale/Software/anaconda3/envs/deep/lib/python3.5/site-packages/ipykernel/__main__.py:4: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)






    Out[51]:





isaac asimov      158
isaac hayes        16
isaac babel         5
isaac bashevis      5
isaac albeniz       2
isaac brock         2
isaac newton        2
isaac delgado       1
isaac shepard       1
isaac asimovi       1
isaac i             1
isaac assimov       1
Name: essays, dtype: int64

	P3	P5	P10	P25	P50	P75	P90	P95	P97
Age
2.0	32.99	33.31	33.82	34.68	35.65	36.64	37.55	38.1	38.46
3.0	35.94	36.3	36.85	37.78	38.84	39.93	40.92	41.52	41.92
4.0	38.28	38.7	39.33	40.39	41.57	42.74	43.8	44.43	44.84
5.0	40.54	41	41.7	42.87	44.16	45.44	46.58	47.26	47.7
6.0	42.83	43.31	44.06	45.3	46.69	48.08	49.33	50.08	50.56
7.0	45.09	45.59	46.37	47.68	49.16	50.65	52.01	52.82	53.36
8.0	47.17	47.7	48.52	49.9	51.47	53.07	54.53	55.42	56
9.0	48.98	49.54	50.42	51.91	53.58	55.29	56.85	57.8	58.42
10.0	50.61	51.21	52.15	53.74	55.54	57.36	59.02	60.03	60.69
11.0	52.34	52.98	53.97	55.65	57.55	59.49	61.26	62.34	63.04
12.0	54.47	55.15	56.21	58	60.01	62.06	63.93	65.06	65.8
13.0	57.02	57.78	58.94	60.87	62.99	65.11	67	68.12	68.85
14.0	59.62	60.43	61.65	63.64	65.79	67.87	69.69	70.76	71.44
15.0	61.7	62.49	63.68	65.61	67.68	69.68	71.42	72.44	73.09
16.0	63.03	63.76	64.89	66.72	68.7	70.63	72.33	73.33	73.97
17.0	63.74	64.44	65.51	67.27	69.2	71.1	72.79	73.79	74.43
18.0	64.1	64.77	65.82	67.55	69.45	71.34	73.03	74.03	74.68
19.0	64.26	64.93	65.96	67.68	69.58	71.47	73.16	74.17	74.83
20.0	64.3	64.98	66.01	67.73	69.63	71.52	73.21	74.22	74.88

	age	body_type	diet	drinks	drugs	education	ethnicity	height	income	job	last_online	location	offspring	orientation	pets	religion	sex	sign	smokes	speaks	status
0	22	a little extra	strictly anything	socially	never	working on college/university	asian, white	75	-1	transportation	2012-06-28-20-30	south san francisco, california	doesn’t have kids, but might want them	straight	likes dogs and likes cats	agnosticism and very serious about it	m	gemini	sometimes	english	single
1	35	average	mostly other	often	sometimes	working on space camp	white	70	80000	hospitality / travel	2012-06-29-21-41	oakland, california	doesn’t have kids, but might want them	straight	likes dogs and likes cats	agnosticism but not too serious about it	m	cancer	no	english (fluently), spanish (poorly), french (poorly)	single
2	38	thin	anything	socially	nan	graduated from masters program	nan	68	-1	nan	2012-06-27-09-10	san francisco, california	nan	straight	has cats	nan	m	pisces but it doesn’t matter	no	english, french, c++	available
3	23	thin	vegetarian	socially	nan	working on college/university	white	71	20000	student	2012-06-28-14-22	berkeley, california	doesn’t want kids	straight	likes cats	nan	m	pisces	no	english, german (poorly)	single
4	29	athletic	nan	socially	never	graduated from college/university	asian, black, other	66	-1	artistic / musical / writer	2012-06-27-21-26	san francisco, california	nan	straight	likes dogs and likes cats	nan	m	aquarius	no	english	single
5	29	average	mostly anything	socially	nan	graduated from college/university	white	67	-1	computer / hardware / software	2012-06-29-19-18	san francisco, california	doesn’t have kids, but might want them	straight	likes cats	atheism	m	taurus	no	english (fluently), chinese (okay)	single
6	32	fit	strictly anything	socially	never	graduated from college/university	white, other	65	-1	nan	2012-06-25-20-45	san francisco, california	nan	straight	likes dogs and likes cats	nan	f	virgo	nan	english	single
7	31	average	mostly anything	socially	never	graduated from college/university	white	65	-1	artistic / musical / writer	2012-06-29-12-30	san francisco, california	doesn’t have kids, but wants them	straight	likes dogs and likes cats	christianity	f	sagittarius	no	english, spanish (okay)	single
8	24	nan	strictly anything	socially	nan	graduated from college/university	white	67	-1	nan	2012-06-29-23-39	belvedere tiburon, california	doesn’t have kids	straight	likes dogs and likes cats	christianity but not too serious about it	f	gemini but it doesn’t matter	when drinking	english	single
9	37	athletic	mostly anything	not at all	never	working on two-year college	white	65	-1	student	2012-06-28-21-08	san mateo, california	nan	straight	likes dogs and likes cats	atheism and laughing about it	m	cancer but it doesn’t matter	no	english (fluently)	single

	age	body_type	diet	drinks	drugs	education	essay0	essay1	essay2	essay3	essay4	essay5	essay6	essay7	essay8	essay9	ethnicity	height	income	job	last_online	location	offspring	orientation	pets	religion	sex	sign	smokes	speaks	status
2512	110	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	67	-1	nan	2012-06-27-22-16	daly city, california	nan	straight	nan	nan	f	nan	nan	english	single
25324	109	athletic	mostly other	nan	never	working on masters program	nan	nan	nan	nothing	nan	nan	nan	nan	nan	nan	nan	95	-1	student	2012-06-30-18-18	san francisco, california	might want kids	straight	nan	other and somewhat serious about it	m	aquarius but it doesn’t matter	when drinking	english (okay)	available

	Sex	Agemos	L	M	S	P3	P5	P10	P25	P50	P75	P90	P95	P97
0	1	24	0.941524	86.4522	0.0403215	79.9108	80.7298	81.9917	84.1029	86.4522	88.8052	90.9262	92.1969	93.0227
1	1	24.5	1.00721	86.8616	0.0403956	80.2604	81.0887	82.364	84.4947	86.8616	89.228	91.3575	92.6318	93.4592
2	1	25.5	0.837251	87.6525	0.0405775	81.0053	81.8345	83.1139	85.2589	87.6525	90.0568	92.2297	93.5341	94.3828
3	1	26.5	0.681493	88.4233	0.0407231	81.7342	82.5641	83.8472	86.0052	88.4233	90.8626	93.0761	94.4088	95.2776
4	1	27.5	0.53878	89.1755	0.0408332	82.4485	83.279	84.5653	86.7351	89.1755	91.6471	93.8983	95.2575	96.1451
5	1	28.5	0.407697	89.9104	0.0409091	83.1494	83.9805	85.2696	87.4498	89.9104	92.4116	94.6976	96.0815	96.9866
6	1	29.5	0.286762	90.6291	0.0409524	83.8382	84.6695	85.961	88.1503	90.6291	93.1572	95.4752	96.882	97.8035
7	1	30.5	0.174489	91.3324	0.0409653	84.5156	85.3469	86.6403	88.8375	91.3324	93.885	96.2324	97.6603	98.5969
8	1	31.5	0.0694445	92.0213	0.04095	85.1824	86.0136	87.3082	89.512	92.0213	94.5959	96.9702	98.4176	99.3683
9	1	32.5	-0.0297206	92.6964	0.0409087	85.8393	86.67	87.9654	90.1746	92.6964	95.2908	97.6898	99.1551	100.119

	CDC	users	gap
percentile
3	64.3	64.89	0.589
5	64.98	65.56	0.5848
10	66.01	66.16	0.154
25	67.73	67.88	0.1547
50	69.63	70.19	0.567
75	71.52	72.26	0.7426
90	73.21	74.15	0.9413
95	74.22	75.47	1.241
97	74.88	76.09	1.209

	CDC	users	gap
percentile
3	59.49	59.69	0.196
5	60.1	59.93	-0.1718
10	61.03	60.98	-0.04799
25	62.58	62.89	0.3094
50	64.31	64.85	0.5411
75	66.02	66.64	0.6133
90	67.56	68.46	0.896
95	68.48	69.46	0.9735
97	69.08	70.23	1.153