Thanksgiving Survey Analysis

Every year Thanksgiving is celebrated in United States all around the country. Some people travel to their hometown while others celebrate with friends. In this project, we are going to analyse a survey on Thanksgiving and try to draw some conclusion based on given information.

The dataset used in this analysis is a courtesy of FiveThirtyEight, it is released to public and can be found on their GitHub page.

Structure of the Dataset

The dataset has numerous columns that stand for every question asked in the survey. As explained on FiveThirtyEight's GitHub page (link given above), the columns are:

  • Do you celebrate Thanksgiving?
  • What is typically the main dish at your Thanksgiving dinner?
    • Other (please specify)
  • How is the main dish typically cooked?
    • Other (please specify)
  • What kind of stuffing/dressing do you typically have?
    • Other (please specify)
  • What type of cranberry sauce do you typically have?
    • Other (please specify)
  • Do you typically have gravy?
  • Which of these side dishes are typically served at your Thanksgiving dinner? Please select all that apply.
    • Brussel sprouts
    • Carrots
    • Cauliflower
    • Corn
    • Cornbread
    • Fruit salad
    • Green beans/green bean casserole
    • Macaroni and cheese
    • Mashed potatoes
    • Rolls/biscuits
    • Vegetable salad
    • Yams/sweet potato casserole
    • Other (please specify)
  • Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply.
    • Apple
    • Buttermilk
    • Cherry
    • Chocolate
    • Coconut cream
    • Key lime
    • Peach
    • Pecan
    • Pumpkin
    • Sweet Potato
    • None
    • Other (please specify)
  • Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.
    • Apple cobbler
    • Blondies
    • Brownies
    • Carrot cake
    • Cheesecake
    • Cookies
    • Fudge
    • Ice cream
    • Peach cobbler
    • None
    • Other (please specify)
  • Do you typically pray before or after the Thanksgiving meal?
  • How far will you travel for Thanksgiving?
  • Will you watch any of the following programs on Thanksgiving? Please select all that apply.
    • Macy's Parade
  • What's the age cutoff at your "kids' table" at Thanksgiving?
  • Have you ever tried to meet up with hometown friends on Thanksgiving night?
  • Have you ever attended a "Friendsgiving?"
  • Will you shop any Black Friday sales on Thanksgiving Day?
  • Do you work in retail?
  • Will you employer make you work on Black Friday?
  • How would you describe where you live?
  • Age
  • What is your gender?
  • How much total combined money did all members of your HOUSEHOLD earn last year?
  • US Region

Hypotheses

Before diving into analysis of the data let's come up with hypotheses. Considering the traditions of Thanksgiving, we can say:

The most preferred food for Thanksgiving is turkey.

Younger people would travel (to their parents' house) for Thanksgiving more than older people.

People with higher income would travel more for Thanksgiving.

Setting up Data


In [1]:
# this line is required to see visualizations inline for Jupyter notebook
%matplotlib inline

# importing modules that we need for analysis
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [2]:
# read the data from file and print out first few rows and columns
thanksgiving = pd.read_csv("thanksgiving.csv", encoding="Latin-1")
thanksgiving.iloc[0:10,0:3]


Out[2]:
RespondentID Do you celebrate Thanksgiving? What is typically the main dish at your Thanksgiving dinner?
0 4337954960 Yes Turkey
1 4337951949 Yes Turkey
2 4337935621 Yes Turkey
3 4337933040 Yes Turkey
4 4337931983 Yes Tofurkey
5 4337929779 Yes Turkey
6 4337924420 Yes Turkey
7 4337916002 Yes Turkey
8 4337914977 Yes Turkey
9 4337899817 Yes Other (please specify)

In [3]:
thanksgiving.columns[:10]


Out[3]:
Index(['RespondentID', 'Do you celebrate Thanksgiving?',
       'What is typically the main dish at your Thanksgiving dinner?',
       'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
       'How is the main dish typically cooked?',
       'How is the main dish typically cooked? - Other (please specify)',
       'What kind of stuffing/dressing do you typically have?',
       'What kind of stuffing/dressing do you typically have? - Other (please specify)',
       'What type of cranberry saucedo you typically have?',
       'What type of cranberry saucedo you typically have? - Other (please specify)'],
      dtype='object')

Hypothesis 1 - "The most preferred food for Thanksgiving is turkey."

First, let's get rid of rows that answered "No" when asked if they celebrate Thanksgiving:


In [4]:
thanksgiving["Do you celebrate Thanksgiving?"].value_counts()


Out[4]:
Yes    980
No      78
Name: Do you celebrate Thanksgiving?, dtype: int64

In [5]:
thanksgiving = thanksgiving[thanksgiving["Do you celebrate Thanksgiving?"] == "Yes"]

In [6]:
thanksgiving["Do you celebrate Thanksgiving?"].value_counts()


Out[6]:
Yes    980
Name: Do you celebrate Thanksgiving?, dtype: int64

Let's look at all unique answers given for main dish at Thanksgiving:


In [7]:
thanksgiving["What is typically the main dish at your Thanksgiving dinner?"].unique()


Out[7]:
array(['Turkey', 'Tofurkey', 'Other (please specify)', 'Ham/Pork',
       'Turducken', 'Roast beef', nan, 'Chicken', "I don't know"], dtype=object)

A short code can show us which food is the most preferred one:


In [8]:
thanksgiving["What is typically the main dish at your Thanksgiving dinner?"].value_counts()


Out[8]:
Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64

So as hypothesized, turkey is the most preferred main dish at Thanksgiving dinner.

Hypothesis 2 - "Younger people would travel for Thanksgiving more than older people."

Let's look first few rows of the age column of the data.


In [9]:
thanksgiving["Age"][:10]


Out[9]:
0    18 - 29
1    18 - 29
2    18 - 29
3    30 - 44
4    30 - 44
5    18 - 29
6    18 - 29
7    18 - 29
8    30 - 44
9    30 - 44
Name: Age, dtype: object

As can be seen above, the age column has intervals instead of actual numbers. The unique answers are:


In [10]:
thanksgiving["Age"].unique()


Out[10]:
array(['18 - 29', '30 - 44', '60+', '45 - 59', nan], dtype=object)

Let's define a function and apply it to "Age" column to cast each answer to a number. (We'll take the average of intervals, and 70 for "60+".)


In [11]:
def age_to_num(string):
    # if nan, return None
    if pd.isnull(string):
        return None
    
    first_item = string.split(" ")[0]
    
    # if the answer is "60+" return 70
    if "+" in first_item:
        return 70.0
    
    last_item = string.split(" ")[2]
    
    #return average of the interval
    return (int(first_item)+int(last_item))/2

In [12]:
# apply age_to_num function to "Age" column and assign it to new column
thanksgiving["num_age"] = thanksgiving["Age"].apply(age_to_num)

thanksgiving["num_age"].unique()


Out[12]:
array([ 23.5,  37. ,  70. ,  52. ,   nan])

We need to get rid of missing values:


In [13]:
thanksgiving = thanksgiving[thanksgiving["num_age"].isnull() == False]

In [14]:
thanksgiving["num_age"].describe()


Out[14]:
count    947.000000
mean      47.614044
std       16.847663
min       23.500000
25%       37.000000
50%       52.000000
75%       70.000000
max       70.000000
Name: num_age, dtype: float64

Now we have ages, let's look at another survey question about traveling for Thanksgiving.


In [15]:
thanksgiving["How far will you travel for Thanksgiving?"].unique()


Out[15]:
array(['Thanksgiving is local--it will take place in the town I live in',
       "Thanksgiving is out of town but not too far--it's a drive of a few hours or less",
       "Thanksgiving is happening at my home--I won't travel at all",
       'Thanksgiving is out of town and far away--I have to drive several hours or fly'], dtype=object)

Since there are only 4 unique answers, we can calculate a mean age value for each of them.


In [16]:
# for each unique answer, select the rows and calculate the mean value of ages.

local_string = 'Thanksgiving is local--it will take place in the town I live in'
local_rows = thanksgiving[thanksgiving["How far will you travel for Thanksgiving?"] == local_string]
local_age_mean = local_rows["num_age"].mean()

fewhours_string = "Thanksgiving is out of town but not too far--it's a drive of a few hours or less"
fewhours_rows = thanksgiving[thanksgiving["How far will you travel for Thanksgiving?"] == fewhours_string]
fewhours_age_mean = fewhours_rows["num_age"].mean()

home_string = "Thanksgiving is happening at my home--I won't travel at all"
home_rows = thanksgiving[thanksgiving["How far will you travel for Thanksgiving?"] == home_string]
home_age_mean = home_rows["num_age"].mean()

faraway_string = 'Thanksgiving is out of town and far away--I have to drive several hours or fly'
faraway_rows = thanksgiving[thanksgiving["How far will you travel for Thanksgiving?"] == faraway_string]
faraway_age_mean = faraway_rows["num_age"].mean()

print("Local: " + str(local_age_mean))
print("Drive of few hours or less: " + str(fewhours_age_mean))
print("Home: " + str(home_age_mean))
print("Drive of several hours or have to fly: " + str(faraway_age_mean))


Local: 46.81272727272727
Drive of few hours or less: 44.5
Home: 49.9251269035533
Drive of several hours or have to fly: 46.640243902439025

Now, let's plot the results to get a better understanding.


In [17]:
x = np.arange(4)+0.75
plt.bar(x,[ fewhours_age_mean, local_age_mean, faraway_age_mean, home_age_mean], width=0.5)
plt.xticks([1,2,3,4], ["Few hours", "Local", "Far away", "Home"])
plt.title("Average Age of People for Different Amounts of Travel on Thanksgiving")
plt.ylabel("Average Age")
plt.xlabel("Travel amount")
plt.show()


As we can see average age of people who stay home is larger than the ones who travel. However, the mean values of ages are pretty close to each other, so we can't say there is a strong correlation between age and travel distance.

Hypothesis 3 : "People with higher income would travel more for Thanksgiving."

First, let's read values from file again, since we removed some rows on previous analysis.


In [18]:
thanksgiving2 = pd.read_csv("thanksgiving.csv", encoding="Latin-1")
thanksgiving2 = thanksgiving2 = thanksgiving2[thanksgiving2["Do you celebrate Thanksgiving?"] == "Yes"]

Let's look at how income data is stored in the dataset.


In [19]:
thanksgiving2["How much total combined money did all members of your HOUSEHOLD earn last year?"].unique()


Out[19]:
array(['$75,000 to $99,999', '$50,000 to $74,999', '$0 to $9,999',
       '$200,000 and up', '$100,000 to $124,999', '$25,000 to $49,999',
       'Prefer not to answer', '$10,000 to $24,999',
       '$175,000 to $199,999', '$150,000 to $174,999',
       '$125,000 to $149,999', nan], dtype=object)

Again, we have intervals of values instead of precise values. Let's define a function to get an average, (we will have 250000 for "$200,000 and up").


In [20]:
def income_to_num(string):
    if pd.isnull(string):
        return None
    
    first_item = string.split(" ")[0]
    
    # if the answer is "Prefer not to answer" return none
    if first_item == "Prefer":
        return None
    
    last_item = string.split(" ")[2]
    
    #if the answer is "$200,000 and up" return 250000
    if last_item == "up":
        return 250000.0
    
    #remove dollar signs and commas
    first_item = first_item.replace("$","")
    first_item = first_item.replace(",","")
    last_item = last_item.replace("$","")
    last_item = last_item.replace(",","")
    
    #return the average of the interval
    return (int(first_item)+int(last_item))/2

In [21]:
thanksgiving2["num_income"] = thanksgiving2["How much total combined money did all members of your HOUSEHOLD earn last year?"].apply(income_to_num)

We need to get rid of the rows with missing values.


In [22]:
thanksgiving2 = thanksgiving2[thanksgiving2["num_income"].isnull() == False]

In [23]:
thanksgiving2["num_income"].describe()


Out[23]:
count       829.000000
mean      91070.112786
std       67749.299874
min        4999.500000
25%       37499.500000
50%       87499.500000
75%      112499.500000
max      250000.000000
Name: num_income, dtype: float64

We will follow the same process that we did for hypothesis 2.


In [24]:
# for each unique answer, select the rows and calculate the mean value of income.

local_string = 'Thanksgiving is local--it will take place in the town I live in'
local_rows = thanksgiving2[thanksgiving2["How far will you travel for Thanksgiving?"] == local_string]
local_income_mean = local_rows["num_income"].mean()

fewhours_string = "Thanksgiving is out of town but not too far--it's a drive of a few hours or less"
fewhours_rows = thanksgiving2[thanksgiving2["How far will you travel for Thanksgiving?"] == fewhours_string]
fewhours_income_mean = fewhours_rows["num_income"].mean()

home_string = "Thanksgiving is happening at my home--I won't travel at all"
home_rows = thanksgiving2[thanksgiving2["How far will you travel for Thanksgiving?"] == home_string]
home_income_mean = home_rows["num_income"].mean()

faraway_string = 'Thanksgiving is out of town and far away--I have to drive several hours or fly'
faraway_rows = thanksgiving2[thanksgiving2["How far will you travel for Thanksgiving?"] == faraway_string]
faraway_income_mean = faraway_rows["num_income"].mean()

print("Local: " + str(local_income_mean))
print("Drive of few hours or less: " + str(fewhours_income_mean))
print("Home: " + str(home_income_mean))
print("Drive of several hours or have to fly: " + str(faraway_income_mean))


Local: 82552.2805907173
Drive of few hours or less: 83128.10571428572
Home: 97852.58069164265
Drive of several hours or have to fly: 106142.41428571429

Let's plot the results:


In [25]:
x = np.arange(4)+0.75
plt.bar(x,[ fewhours_income_mean, local_income_mean, faraway_income_mean, home_income_mean], width=0.5)
plt.xticks([1,2,3,4], ["Few hours", "Local", "Far away", "Home"])
plt.title("Average Income of People for Different Amounts of Travel on Thanksgiving")
plt.ylabel("Average Income")
plt.xlabel("Travel amount")
plt.show()


Now, this is better than what we found in part 2. People that travel far away for Thanksgiving have the highest mean income while people spend Thanksgiving in local area or travel few hours have lower income. People who spend Thanksgiving at home have a close mean income to people who travel, but that can be explained if we think that older people tend to spend Thanksgiving at home and they might have higher income compared to others (for example students).


In [ ]: