Data Bootcamp Final Project: ABC's The Bachelor

Laura Capucilli, Michael Trentham, and Carter Stone

Since it’s premiere in 2002, ABC’s The Bachelor franchise has invited audiences across the U.S. to watch as single, young men and women attempt to find true love among a pool of eligible bachelors and bachelorettes, all on national TV and within a timeframe of a few months. The show begins with a single bachelor / bachelorette and typically up to 30 contestants. Each week, the bachelor / bachelorette narrows the pool of potential future partners through an elimination round, the Rose Ceremony. Ultimately, the season is expected to end in a proposal, the recipient of which remains a toss up between two men / women until the finale.

Given the variance in the outcome of relationships that began on the show, our final project analyzes data from nearly 30 seasons of the show with several questions in mind: Are there commonalities or differences among contestants, bachelors, and bachelorettes? Are there patterns that determine the ultimate success or failure of a relationship?

I. Setting Up


In [3]:
#We will begin by importing several packages to use for our analysis:

import sys
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import datetime as dt
import numpy as np
%matplotlib inline

In [4]:
# Check versions
print('Python version: ', sys.version)
print('Pandas version: ', pd.__version__)
print('Matplotlib version: ', mpl.__version__)
print('Today: ', dt.date.today())


Python version:  3.6.0 |Anaconda 4.3.0 (x86_64)| (default, Dec 23 2016, 13:19:00) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
Pandas version:  0.19.2
Matplotlib version:  2.0.0
Today:  2017-05-09

In [5]:
# Import Bachelor and Bachelorette datasets
url1 ='https://github.com/NYUDataBootcamp'
url2 = '/Materials/blob/master/Data/Bachelor_Data.xlsx?raw=true'
url = url1+url2

df = pd.read_excel(url, sheetname = 0) # Sheet 1: Bachelors/Bachelorettes
df2 = pd.read_excel(url, sheetname = 1) # Sheet 2: Female contestants
df3 = pd.read_excel(url, sheetname = 2) # Sheet 3: Male contestants

In [6]:
# Check import for each Sheet

df.shape


Out[6]:
(29, 12)

In [7]:
df2.shape


Out[7]:
(315, 7)

In [8]:
df3.shape


Out[8]:
(206, 7)

In [109]:
df.dtypes # 'Wiki Data' refers to whether or not there is information about the season available on Wikipedia.


Out[109]:
Show                    object
Season                   int64
Premiered       datetime64[ns]
Name                    object
Profile                 object
Winner                  object
Runner(s)-Up            object
Proposal                object
Success                 object
Wiki Data               object
Birthday                object
Height                  object
dtype: object

In [10]:
df2.dtypes


Out[10]:
Season          int64
Year            int64
Name           object
Age           float64
Hometown       object
Occupation     object
Eliminated     object
dtype: object

In [11]:
df3.dtypes


Out[11]:
Season         int64
Year           int64
Name          object
Age            int64
Hometown      object
Occupation    object
Eliminated    object
dtype: object

In [12]:
df.head(2)


Out[12]:
Show Season Premiered Name Profile Winner Runner(s)-Up Proposal Success Wiki Data Birthday Height
0 Bachelor 1 2002-03-25 Alex Michel Management consultant Amanda Marsh Trista Rehn No No No 1970-08-10 00:00:00 6'0"
1 Bachelor 2 2002-09-25 Aaron Buerge Vice President of a chain of family-owned banks Helene Eksterowicz Brooke Smith Yes No No 1974-04-22 00:00:00 6'0"

In [13]:
df2.head(2)


Out[13]:
Season Year Name Age Hometown Occupation Eliminated
0 5 2004 Jessica Bowlin 22.0 Huntington Beach, California Student Winner
1 5 2004 Tara Huckeby 23.0 Shawnee, Oklahoma General contractor Week 7

In [14]:
df3.head(2)


Out[14]:
Season Year Name Age Hometown Occupation Eliminated
0 2 2004 Ian McKee 29 New York, NY Equity Research Sales Winner
1 2 2004 Matthew Hickl 28 Friendswood, TX Pharmaceutical Sales Rep Episode 8

II. Contestants' Age

We began by looking at the average age of contestants at different stages of the competition:


In [15]:
# Clean up and shape Sheet 2 (Female Contestants dataframe)
df2_mean = df2[['Year','Age']] 
df2_mean = df2_mean.groupby('Year')
df2_mean = df2_mean.mean()
df2_mean.plot(linewidth=3, color='black', title= 'Average Age by Year (Women)')


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b0b7b70>

The chart above shows the average age of female contestants for each year. While the average age of contestants is slightly older in recent years compared to early seasons of the show, the average age has remained fairly consistent (within 1 year) over the last decade.

Next, we'll explore how age relates to the week in which the contestants left the competition:


In [16]:
# Shaping the data further

df2_AgeWeekF = df2[['Age', 'Eliminated']]
df2_AgeWeekF = df2_AgeWeekF.groupby('Eliminated')
df2_AgeWeekF = df2_AgeWeekF.mean()

In [121]:
# Plotting our data
df2_AgeWeekF.plot(kind='bar', color = 'deeppink', title="Average Age by Elimination Week (Women)")


Out[121]:
<matplotlib.axes._subplots.AxesSubplot at 0x125db92b0>

Looks like we have more cleaning to do. We'll use the following code to make our x-axis more consistent:


In [18]:
df2_clean = df2

In [19]:
df2_clean['Eliminated'] = df2_clean['Eliminated'].str.replace('Withdrew in episode', 'Week')
df2_clean['Eliminated'] = df2_clean['Eliminated'].str.replace('Eliminated in episode', 'Week')
df2_clean['Eliminated'] = df2_clean['Eliminated'].str.replace('Quit in episode', 'Week')
df2_clean['Eliminated'] = df2_clean['Eliminated'].str.replace('Left in episode', 'Week')
df2_clean['Eliminated'] = df2_clean['Eliminated'].str.replace('Disqualified in episode', 'Week')
df2_clean['Eliminated'] = df2_clean['Eliminated'].str.replace('Removed in episode', 'Week')
df2_clean['Eliminated'] = df2_clean['Eliminated'].str.replace('Quit in epiosde', 'Week')
df2_clean['Eliminated'] = df2_clean['Eliminated'].str.replace('Runner-up', 'Week Runner-Up')

In [72]:
AgeWeekF = df2_clean[['Age', 'Eliminated']]
AgeWeekF = AgeWeekF.groupby('Eliminated')
AgeWeekF = AgeWeekF.mean()

In [120]:
AgeWeekF.plot(kind='bar', ylim=[24,32], color = 'deeppink', title="Average Age by Elimination Week (Women)")


Out[120]:
<matplotlib.axes._subplots.AxesSubplot at 0x125c93be0>

Here, we can see that ages 26 and 27 dominate the elimination age for the first few weeks. This is interesting because it might suggest that the bachelor is interested in a polarized age demographic at first. He seems to be ruling out much of the middle ground. That said, it could also be the case that the bachelor is keeping only the contestants in the polarized age demographic, but that their average ages are 26 and 27.

In the middle weeks, the younger contestants seem to be targeted more for elimination. We think a major takeaway here is that the Bachelor may be becoming more serious about the process as he is moving along, and thus is elimnating more immature contestants (contestants not yet ready for marriage).

We'll now look at the same information using the male contestant dataframe (Sheet 3):


In [22]:
df3_clean = df3.set_index('Season')

In [23]:
df3_clean['Eliminated']=df3_clean['Eliminated'].str.replace('Episode', 'Week')
df3_clean['Eliminated']=df3_clean['Eliminated'].str.replace('Eliminated in episode', 'Week')
df3_clean['Eliminated']=df3_clean['Eliminated'].str.replace('Removed in episode', 'Week')
df3_clean['Eliminated']=df3_clean['Eliminated'].str.replace('Quit in episode', 'Week')
df3_clean['Eliminated']=df3_clean['Eliminated'].str.replace('Withdrew in episode', 'Week')
df3_clean['Eliminated']=df3_clean['Eliminated'].str.replace('Runner-up', 'Week Runner-up')

In [24]:
AgeWeekM = df3_clean[['Age', 'Eliminated']]
AgeWeekM = AgeWeekM.groupby('Eliminated')
AgeWeekM = AgeWeekM.mean()
AgeWeekM.plot(linewidth=3, color = 'black', title='Average Age by Year (Men)')


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x11e60d9e8>

In [25]:
AgeWeekM.plot(kind='bar', ylim=[24,32], color = 'dodgerblue', title='Average Age by Week Eliminated (Men)')


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x11e619ac8>

There's not as much rhyme or reason to this chart; however, we can see that the Bachelorette seems to keep around an older and younger male contestant towards the end of the show. Also, it appears that the winners tend to be on the younger side.

Next, we will compare the data for male and female contestants:


In [122]:
fig, ax = plt.subplots(nrows = 2, ncols = 1, sharex = True, sharey = True)

AgeWeekM.plot(kind='bar', ax=ax[1], ylim=(24,32), color = 'dodgerblue', title = 'Average Age by Week Eliminated (Men)')
AgeWeekF.plot(kind='bar', ax=ax[0], color ='deeppink', title='Average Age by Week Eliminated (Women)')


Out[122]:
<matplotlib.axes._subplots.AxesSubplot at 0x12404fbe0>

In [123]:
fig, ax = plt.subplots()
AgeWeekM.plot(ax=ax, kind='bar', ylim=[24,32], color ='dodgerblue', title='Average Age by Week Eliminated (Men + Women)')
AgeWeekF.plot(ax=ax, kind='bar', ylim=[24,32], color ='deeppink')


Out[123]:
<matplotlib.axes._subplots.AxesSubplot at 0x125c6f898>

The biggest takeaway here is that the ages of contestents are far younger for women than men in general. Other than the average age being significantly different, the two charts show surprisingly similar patterns. The fluctuations are oddly identical. There is most likely some sort of social psychology trend at play here.

III. Proposals and Marriages

We will now look at the end results of both the Bachelor and the Bachelorette. How many seasons result in a proposal? Are those engagements likely to result in marriage or in a break-up after the show ends? Is there any difference between bachelor and bachelorette seasons?


In [49]:
df_proposal = df[['Proposal', 'Show']]
df_proposal.head(2)


Out[49]:
Proposal Show
0 No Bachelor
1 Yes Bachelor

In [50]:
df_proposal = df_proposal.groupby(['Show', 'Proposal']).size()
df_proposal = df_proposal.unstack()

In [69]:
df_proposal.plot(kind='barh', color = ['silver', 'limegreen'], title='Did the season result in a proposal?')


Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x1218bd978>

The key point gathered here is that the men (or bachelors) are far less likely to propose at the end of a season. We think that much of this has to do with their overall intent when beginning the process of applying for the show. Perhaps the women genuinely want to get engaged and the men seem more keen on becoming a TV personality. There is much discussion during the show about whether or not contestents are there for the "right reasons". Given the chart above, we might assume that more women are there for the right reasons than men, though it of course could be the case that the bachelors did not make a connection with their contestents with greater frequency than the bachelorettes with their contestents.


In [52]:
df_marriage = df[['Success', 'Show']] 
df_marriage = df_marriage.groupby(['Show', 'Success']).size()
df_marriage = df_marriage.unstack()

In [70]:
df_marriage.plot(kind='barh', color=['silver', 'plum'], title='Did the season result in marriage?')


Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0x10fdbd898>

This chart is pretty illuminating. It illustrates that the Bachelorette women are far more succesful in choosing a partner. Of the 9 proposals on the Bachelorette, 3 resulted in marriage. Of the 10 proposals on the Bachelor, only 2 have resulted in marriage. Again, this could be related to the fact that the women who enter onto the show are far more dedicated to finding love and a life partner.

IV. Occupations

Finally, we'll take a look at the occupations of female and male contestants:


In [90]:
OccM = df3[['Occupation']]
OccM = OccM.groupby(['Occupation']).size()
OccM = pd.DataFrame(OccM)

In [131]:
OccM.columns = ['Count']
OccM = OccM[OccM.Count !=1]
OccM.plot(kind ='bar', color ='dodgerblue', title='Frequency of Occupations, Male Contestants (>1)')


Out[131]:
<matplotlib.axes._subplots.AxesSubplot at 0x126486c18>

Next, we'll do the same for the female contestants


In [95]:
OccF = df2[['Occupation']]
OccF = OccF.groupby(['Occupation']).size()
OccF = pd.DataFrame(OccF)

In [133]:
OccF.columns = ['Count']
OccF = OccF[OccF.Count !=1]
OccF.plot(kind ='bar', color = 'deeppink', title='Frequency of Occupations, Female Constestants (>1)')


Out[133]:
<matplotlib.axes._subplots.AxesSubplot at 0x1268a4a58>

In [134]:
OccM.plot(figsize=(11,11), kind='pie', subplots=True, legend=False, title='Frequency of Occupations, Male Constestants (>1)')
OccF.plot(figsize=(11,11), kind='pie', subplots=True, legend=False, title='Frequency of Occupations, Female Constestants (>1)')


Out[134]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x126d0ff60>], dtype=object)

Based on the above charts, female contestants exhibit greater diversity in occupations than male contestants in total. Male contestants are more likely to have worked in finance and sales, whereas female contestants are more likely to have worked in design and education. Interestingly, for both bachelor and bachelorette contestants, Lawyer/Attorney is the mode and appears more frequently than all other occupations.

V. Conclusions

After analyzing the data and charts, we discovered a handful of trends in the age, occupation, and success of contestants on both The Bachelor and The Bachelorette:

  • Bachelorette seasons have historically had a greater likelihood of success in terms of the number of proposals and subsequent marriages
  • While the average age of female contestants is younger than the average age of male contestants in the seasons in total, the average age of both female and male contestants by week over the course of the season exhibit a similar pattern
  • Female contestants exhibit greater diversity in occupation and Lawyers are the most frequently occuring profession for both male and female contestants