Since it’s premiere in 2002, ABC’s The Bachelor franchise has invited audiences across the U.S. to watch as single, young men and women attempt to find true love among a pool of eligible bachelors and bachelorettes, all on national TV and within a timeframe of a few months. The show begins with a single bachelor / bachelorette and typically up to 30 contestants. Each week, the bachelor / bachelorette narrows the pool of potential future partners through an elimination round, the Rose Ceremony. Ultimately, the season is expected to end in a proposal, the recipient of which remains a toss up between two men / women until the finale.
Given the variance in the outcome of relationships that began on the show, our final project analyzes data from nearly 30 seasons of the show with several questions in mind: Are there commonalities or differences among contestants, bachelors, and bachelorettes? Are there patterns that determine the ultimate success or failure of a relationship?
In [3]:
#We will begin by importing several packages to use for our analysis:
import sys
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import datetime as dt
import numpy as np
%matplotlib inline
In [4]:
# Check versions
print('Python version: ', sys.version)
print('Pandas version: ', pd.__version__)
print('Matplotlib version: ', mpl.__version__)
print('Today: ', dt.date.today())
In [5]:
# Import Bachelor and Bachelorette datasets
url1 ='https://github.com/NYUDataBootcamp'
url2 = '/Materials/blob/master/Data/Bachelor_Data.xlsx?raw=true'
url = url1+url2
df = pd.read_excel(url, sheetname = 0) # Sheet 1: Bachelors/Bachelorettes
df2 = pd.read_excel(url, sheetname = 1) # Sheet 2: Female contestants
df3 = pd.read_excel(url, sheetname = 2) # Sheet 3: Male contestants
In [6]:
# Check import for each Sheet
df.shape
Out[6]:
In [7]:
df2.shape
Out[7]:
In [8]:
df3.shape
Out[8]:
In [109]:
df.dtypes # 'Wiki Data' refers to whether or not there is information about the season available on Wikipedia.
Out[109]:
In [10]:
df2.dtypes
Out[10]:
In [11]:
df3.dtypes
Out[11]:
In [12]:
df.head(2)
Out[12]:
In [13]:
df2.head(2)
Out[13]:
In [14]:
df3.head(2)
Out[14]:
We began by looking at the average age of contestants at different stages of the competition:
In [15]:
# Clean up and shape Sheet 2 (Female Contestants dataframe)
df2_mean = df2[['Year','Age']]
df2_mean = df2_mean.groupby('Year')
df2_mean = df2_mean.mean()
df2_mean.plot(linewidth=3, color='black', title= 'Average Age by Year (Women)')
Out[15]:
The chart above shows the average age of female contestants for each year. While the average age of contestants is slightly older in recent years compared to early seasons of the show, the average age has remained fairly consistent (within 1 year) over the last decade.
Next, we'll explore how age relates to the week in which the contestants left the competition:
In [16]:
# Shaping the data further
df2_AgeWeekF = df2[['Age', 'Eliminated']]
df2_AgeWeekF = df2_AgeWeekF.groupby('Eliminated')
df2_AgeWeekF = df2_AgeWeekF.mean()
In [121]:
# Plotting our data
df2_AgeWeekF.plot(kind='bar', color = 'deeppink', title="Average Age by Elimination Week (Women)")
Out[121]:
Looks like we have more cleaning to do. We'll use the following code to make our x-axis more consistent:
In [18]:
df2_clean = df2
In [19]:
df2_clean['Eliminated'] = df2_clean['Eliminated'].str.replace('Withdrew in episode', 'Week')
df2_clean['Eliminated'] = df2_clean['Eliminated'].str.replace('Eliminated in episode', 'Week')
df2_clean['Eliminated'] = df2_clean['Eliminated'].str.replace('Quit in episode', 'Week')
df2_clean['Eliminated'] = df2_clean['Eliminated'].str.replace('Left in episode', 'Week')
df2_clean['Eliminated'] = df2_clean['Eliminated'].str.replace('Disqualified in episode', 'Week')
df2_clean['Eliminated'] = df2_clean['Eliminated'].str.replace('Removed in episode', 'Week')
df2_clean['Eliminated'] = df2_clean['Eliminated'].str.replace('Quit in epiosde', 'Week')
df2_clean['Eliminated'] = df2_clean['Eliminated'].str.replace('Runner-up', 'Week Runner-Up')
In [72]:
AgeWeekF = df2_clean[['Age', 'Eliminated']]
AgeWeekF = AgeWeekF.groupby('Eliminated')
AgeWeekF = AgeWeekF.mean()
In [120]:
AgeWeekF.plot(kind='bar', ylim=[24,32], color = 'deeppink', title="Average Age by Elimination Week (Women)")
Out[120]:
Here, we can see that ages 26 and 27 dominate the elimination age for the first few weeks. This is interesting because it might suggest that the bachelor is interested in a polarized age demographic at first. He seems to be ruling out much of the middle ground. That said, it could also be the case that the bachelor is keeping only the contestants in the polarized age demographic, but that their average ages are 26 and 27.
In the middle weeks, the younger contestants seem to be targeted more for elimination. We think a major takeaway here is that the Bachelor may be becoming more serious about the process as he is moving along, and thus is elimnating more immature contestants (contestants not yet ready for marriage).
We'll now look at the same information using the male contestant dataframe (Sheet 3):
In [22]:
df3_clean = df3.set_index('Season')
In [23]:
df3_clean['Eliminated']=df3_clean['Eliminated'].str.replace('Episode', 'Week')
df3_clean['Eliminated']=df3_clean['Eliminated'].str.replace('Eliminated in episode', 'Week')
df3_clean['Eliminated']=df3_clean['Eliminated'].str.replace('Removed in episode', 'Week')
df3_clean['Eliminated']=df3_clean['Eliminated'].str.replace('Quit in episode', 'Week')
df3_clean['Eliminated']=df3_clean['Eliminated'].str.replace('Withdrew in episode', 'Week')
df3_clean['Eliminated']=df3_clean['Eliminated'].str.replace('Runner-up', 'Week Runner-up')
In [24]:
AgeWeekM = df3_clean[['Age', 'Eliminated']]
AgeWeekM = AgeWeekM.groupby('Eliminated')
AgeWeekM = AgeWeekM.mean()
AgeWeekM.plot(linewidth=3, color = 'black', title='Average Age by Year (Men)')
Out[24]:
In [25]:
AgeWeekM.plot(kind='bar', ylim=[24,32], color = 'dodgerblue', title='Average Age by Week Eliminated (Men)')
Out[25]:
There's not as much rhyme or reason to this chart; however, we can see that the Bachelorette seems to keep around an older and younger male contestant towards the end of the show. Also, it appears that the winners tend to be on the younger side.
Next, we will compare the data for male and female contestants:
In [122]:
fig, ax = plt.subplots(nrows = 2, ncols = 1, sharex = True, sharey = True)
AgeWeekM.plot(kind='bar', ax=ax[1], ylim=(24,32), color = 'dodgerblue', title = 'Average Age by Week Eliminated (Men)')
AgeWeekF.plot(kind='bar', ax=ax[0], color ='deeppink', title='Average Age by Week Eliminated (Women)')
Out[122]:
In [123]:
fig, ax = plt.subplots()
AgeWeekM.plot(ax=ax, kind='bar', ylim=[24,32], color ='dodgerblue', title='Average Age by Week Eliminated (Men + Women)')
AgeWeekF.plot(ax=ax, kind='bar', ylim=[24,32], color ='deeppink')
Out[123]:
The biggest takeaway here is that the ages of contestents are far younger for women than men in general. Other than the average age being significantly different, the two charts show surprisingly similar patterns. The fluctuations are oddly identical. There is most likely some sort of social psychology trend at play here.
We will now look at the end results of both the Bachelor and the Bachelorette. How many seasons result in a proposal? Are those engagements likely to result in marriage or in a break-up after the show ends? Is there any difference between bachelor and bachelorette seasons?
In [49]:
df_proposal = df[['Proposal', 'Show']]
df_proposal.head(2)
Out[49]:
In [50]:
df_proposal = df_proposal.groupby(['Show', 'Proposal']).size()
df_proposal = df_proposal.unstack()
In [69]:
df_proposal.plot(kind='barh', color = ['silver', 'limegreen'], title='Did the season result in a proposal?')
Out[69]:
The key point gathered here is that the men (or bachelors) are far less likely to propose at the end of a season. We think that much of this has to do with their overall intent when beginning the process of applying for the show. Perhaps the women genuinely want to get engaged and the men seem more keen on becoming a TV personality. There is much discussion during the show about whether or not contestents are there for the "right reasons". Given the chart above, we might assume that more women are there for the right reasons than men, though it of course could be the case that the bachelors did not make a connection with their contestents with greater frequency than the bachelorettes with their contestents.
In [52]:
df_marriage = df[['Success', 'Show']]
df_marriage = df_marriage.groupby(['Show', 'Success']).size()
df_marriage = df_marriage.unstack()
In [70]:
df_marriage.plot(kind='barh', color=['silver', 'plum'], title='Did the season result in marriage?')
Out[70]:
This chart is pretty illuminating. It illustrates that the Bachelorette women are far more succesful in choosing a partner. Of the 9 proposals on the Bachelorette, 3 resulted in marriage. Of the 10 proposals on the Bachelor, only 2 have resulted in marriage. Again, this could be related to the fact that the women who enter onto the show are far more dedicated to finding love and a life partner.
Finally, we'll take a look at the occupations of female and male contestants:
In [90]:
OccM = df3[['Occupation']]
OccM = OccM.groupby(['Occupation']).size()
OccM = pd.DataFrame(OccM)
In [131]:
OccM.columns = ['Count']
OccM = OccM[OccM.Count !=1]
OccM.plot(kind ='bar', color ='dodgerblue', title='Frequency of Occupations, Male Contestants (>1)')
Out[131]:
Next, we'll do the same for the female contestants
In [95]:
OccF = df2[['Occupation']]
OccF = OccF.groupby(['Occupation']).size()
OccF = pd.DataFrame(OccF)
In [133]:
OccF.columns = ['Count']
OccF = OccF[OccF.Count !=1]
OccF.plot(kind ='bar', color = 'deeppink', title='Frequency of Occupations, Female Constestants (>1)')
Out[133]:
In [134]:
OccM.plot(figsize=(11,11), kind='pie', subplots=True, legend=False, title='Frequency of Occupations, Male Constestants (>1)')
OccF.plot(figsize=(11,11), kind='pie', subplots=True, legend=False, title='Frequency of Occupations, Female Constestants (>1)')
Out[134]:
Based on the above charts, female contestants exhibit greater diversity in occupations than male contestants in total. Male contestants are more likely to have worked in finance and sales, whereas female contestants are more likely to have worked in design and education. Interestingly, for both bachelor and bachelorette contestants, Lawyer/Attorney is the mode and appears more frequently than all other occupations.
After analyzing the data and charts, we discovered a handful of trends in the age, occupation, and success of contestants on both The Bachelor and The Bachelorette: