Every year Thanksgiving is celebrated in United States all around the country. Some people travel to their hometown while others celebrate with friends. In this project, we are going to analyse a survey on Thanksgiving and try to draw some conclusion based on given information.
The dataset used in this analysis is a courtesy of FiveThirtyEight, it is released to public and can be found on their GitHub page.
The dataset has numerous columns that stand for every question asked in the survey. As explained on FiveThirtyEight's GitHub page (link given above), the columns are:
Before diving into analysis of the data let's come up with hypotheses. Considering the traditions of Thanksgiving, we can say:
The most preferred food for Thanksgiving is turkey.
Younger people would travel (to their parents' house) for Thanksgiving more than older people.
People with higher income would travel more for Thanksgiving.
In [1]:
# this line is required to see visualizations inline for Jupyter notebook
%matplotlib inline
# importing modules that we need for analysis
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
In [2]:
# read the data from file and print out first few rows and columns
thanksgiving = pd.read_csv("thanksgiving.csv", encoding="Latin-1")
thanksgiving.iloc[0:10,0:3]
Out[2]:
In [3]:
thanksgiving.columns[:10]
Out[3]:
In [4]:
thanksgiving["Do you celebrate Thanksgiving?"].value_counts()
Out[4]:
In [5]:
thanksgiving = thanksgiving[thanksgiving["Do you celebrate Thanksgiving?"] == "Yes"]
In [6]:
thanksgiving["Do you celebrate Thanksgiving?"].value_counts()
Out[6]:
Let's look at all unique answers given for main dish at Thanksgiving:
In [7]:
thanksgiving["What is typically the main dish at your Thanksgiving dinner?"].unique()
Out[7]:
A short code can show us which food is the most preferred one:
In [8]:
thanksgiving["What is typically the main dish at your Thanksgiving dinner?"].value_counts()
Out[8]:
So as hypothesized, turkey is the most preferred main dish at Thanksgiving dinner.
Let's look first few rows of the age column of the data.
In [9]:
thanksgiving["Age"][:10]
Out[9]:
As can be seen above, the age column has intervals instead of actual numbers. The unique answers are:
In [10]:
thanksgiving["Age"].unique()
Out[10]:
Let's define a function and apply it to "Age" column to cast each answer to a number. (We'll take the average of intervals, and 70 for "60+".)
In [11]:
def age_to_num(string):
# if nan, return None
if pd.isnull(string):
return None
first_item = string.split(" ")[0]
# if the answer is "60+" return 70
if "+" in first_item:
return 70.0
last_item = string.split(" ")[2]
#return average of the interval
return (int(first_item)+int(last_item))/2
In [12]:
# apply age_to_num function to "Age" column and assign it to new column
thanksgiving["num_age"] = thanksgiving["Age"].apply(age_to_num)
thanksgiving["num_age"].unique()
Out[12]:
We need to get rid of missing values:
In [13]:
thanksgiving = thanksgiving[thanksgiving["num_age"].isnull() == False]
In [14]:
thanksgiving["num_age"].describe()
Out[14]:
Now we have ages, let's look at another survey question about traveling for Thanksgiving.
In [15]:
thanksgiving["How far will you travel for Thanksgiving?"].unique()
Out[15]:
Since there are only 4 unique answers, we can calculate a mean age value for each of them.
In [16]:
# for each unique answer, select the rows and calculate the mean value of ages.
local_string = 'Thanksgiving is local--it will take place in the town I live in'
local_rows = thanksgiving[thanksgiving["How far will you travel for Thanksgiving?"] == local_string]
local_age_mean = local_rows["num_age"].mean()
fewhours_string = "Thanksgiving is out of town but not too far--it's a drive of a few hours or less"
fewhours_rows = thanksgiving[thanksgiving["How far will you travel for Thanksgiving?"] == fewhours_string]
fewhours_age_mean = fewhours_rows["num_age"].mean()
home_string = "Thanksgiving is happening at my home--I won't travel at all"
home_rows = thanksgiving[thanksgiving["How far will you travel for Thanksgiving?"] == home_string]
home_age_mean = home_rows["num_age"].mean()
faraway_string = 'Thanksgiving is out of town and far away--I have to drive several hours or fly'
faraway_rows = thanksgiving[thanksgiving["How far will you travel for Thanksgiving?"] == faraway_string]
faraway_age_mean = faraway_rows["num_age"].mean()
print("Local: " + str(local_age_mean))
print("Drive of few hours or less: " + str(fewhours_age_mean))
print("Home: " + str(home_age_mean))
print("Drive of several hours or have to fly: " + str(faraway_age_mean))
Now, let's plot the results to get a better understanding.
In [17]:
x = np.arange(4)+0.75
plt.bar(x,[ fewhours_age_mean, local_age_mean, faraway_age_mean, home_age_mean], width=0.5)
plt.xticks([1,2,3,4], ["Few hours", "Local", "Far away", "Home"])
plt.title("Average Age of People for Different Amounts of Travel on Thanksgiving")
plt.ylabel("Average Age")
plt.xlabel("Travel amount")
plt.show()
As we can see average age of people who stay home is larger than the ones who travel. However, the mean values of ages are pretty close to each other, so we can't say there is a strong correlation between age and travel distance.
In [18]:
thanksgiving2 = pd.read_csv("thanksgiving.csv", encoding="Latin-1")
thanksgiving2 = thanksgiving2 = thanksgiving2[thanksgiving2["Do you celebrate Thanksgiving?"] == "Yes"]
Let's look at how income data is stored in the dataset.
In [19]:
thanksgiving2["How much total combined money did all members of your HOUSEHOLD earn last year?"].unique()
Out[19]:
Again, we have intervals of values instead of precise values. Let's define a function to get an average, (we will have 250000 for "$200,000 and up").
In [20]:
def income_to_num(string):
if pd.isnull(string):
return None
first_item = string.split(" ")[0]
# if the answer is "Prefer not to answer" return none
if first_item == "Prefer":
return None
last_item = string.split(" ")[2]
#if the answer is "$200,000 and up" return 250000
if last_item == "up":
return 250000.0
#remove dollar signs and commas
first_item = first_item.replace("$","")
first_item = first_item.replace(",","")
last_item = last_item.replace("$","")
last_item = last_item.replace(",","")
#return the average of the interval
return (int(first_item)+int(last_item))/2
In [21]:
thanksgiving2["num_income"] = thanksgiving2["How much total combined money did all members of your HOUSEHOLD earn last year?"].apply(income_to_num)
We need to get rid of the rows with missing values.
In [22]:
thanksgiving2 = thanksgiving2[thanksgiving2["num_income"].isnull() == False]
In [23]:
thanksgiving2["num_income"].describe()
Out[23]:
We will follow the same process that we did for hypothesis 2.
In [24]:
# for each unique answer, select the rows and calculate the mean value of income.
local_string = 'Thanksgiving is local--it will take place in the town I live in'
local_rows = thanksgiving2[thanksgiving2["How far will you travel for Thanksgiving?"] == local_string]
local_income_mean = local_rows["num_income"].mean()
fewhours_string = "Thanksgiving is out of town but not too far--it's a drive of a few hours or less"
fewhours_rows = thanksgiving2[thanksgiving2["How far will you travel for Thanksgiving?"] == fewhours_string]
fewhours_income_mean = fewhours_rows["num_income"].mean()
home_string = "Thanksgiving is happening at my home--I won't travel at all"
home_rows = thanksgiving2[thanksgiving2["How far will you travel for Thanksgiving?"] == home_string]
home_income_mean = home_rows["num_income"].mean()
faraway_string = 'Thanksgiving is out of town and far away--I have to drive several hours or fly'
faraway_rows = thanksgiving2[thanksgiving2["How far will you travel for Thanksgiving?"] == faraway_string]
faraway_income_mean = faraway_rows["num_income"].mean()
print("Local: " + str(local_income_mean))
print("Drive of few hours or less: " + str(fewhours_income_mean))
print("Home: " + str(home_income_mean))
print("Drive of several hours or have to fly: " + str(faraway_income_mean))
Let's plot the results:
In [25]:
x = np.arange(4)+0.75
plt.bar(x,[ fewhours_income_mean, local_income_mean, faraway_income_mean, home_income_mean], width=0.5)
plt.xticks([1,2,3,4], ["Few hours", "Local", "Far away", "Home"])
plt.title("Average Income of People for Different Amounts of Travel on Thanksgiving")
plt.ylabel("Average Income")
plt.xlabel("Travel amount")
plt.show()
Now, this is better than what we found in part 2. People that travel far away for Thanksgiving have the highest mean income while people spend Thanksgiving in local area or travel few hours have lower income. People who spend Thanksgiving at home have a close mean income to people who travel, but that can be explained if we think that older people tend to spend Thanksgiving at home and they might have higher income compared to others (for example students).
In [ ]: