In [1]:
from operator import itemgetter
obesityData = open('ObesityByState.txt', 'r').readlines()
# Create a dictionary (hash) to store the data
stateObesity = dict()
# Iterate through obesity data, omit non-data lines, store data in dictionary
for line in obesityData:
# split the contents on each tab
# assign the first entry before a tab to the variable 'state', the second to 'obesityPercentage'
(state, obesityPercentage) = line.split('\t')[0:2]
# skip the header
if 'state' in state:
continue
# check if the state is DC, and change how we refer to it
if state == 'DC':
state = 'District of Columbia'
# Make sure to accurately represent non-numeric values.
# The string 'N/A' shouldn't be used in comparisons.
# Anything that cannot be converted to a float is nulled
try:
float(obesityPercentage)
except ValueError:
obesityPercentage = None
# store the value in the dictionary
stateObesity[state] = obesityPercentage
# uncomment the following line to check
print "%s\t%s" % (state, stateObesity[state])
Next, open MedianIncomeByState.txt
to retrieve every state's median household income.
In [2]:
incomeData = open('MedianIncomeByState.txt', 'r').readlines()
# Remove any trailing newline characters
for line in incomeData:
line.strip()
# Use a hash called %medianIncome to associate state names with median income values
medianIncome = dict()
for line in incomeData:
(statename, income) = line.split('\t')[0:2]
medianIncome[statename] = income
print line
Sort the states from the most obese to least obese, and print the states, obesity rates and median incomes to obesityIncome.txt
.
In [3]:
OUT = open('obesityIncome.txt', 'w')
OUT.write("State"+"\t"+"Obesity rate"+"\t"+"Median household income")
print "%s\t%s\t%s" % ("State", "Obesity rate", "Median household income")
for state, obesityPercentage in sorted(stateObesity.iteritems(), key=lambda (k,v): (v,k), reverse=True):
print "%s\t%s\t%s" % (state,stateObesity[state],medianIncome[state] )
OUT.write("\n%s\t%s\t%s" % (state,stateObesity[state],medianIncome[state]))
OUT.close()
Count the number of states that have a higher rate of obesity than Michigan:
In [4]:
count=0
for state, income in sorted(stateObesity.iteritems(), key=lambda (k,v): (v,k), reverse=True):
if state=="Michigan":
break
count=count+1
print "The number of states that have a higher obesity rate than Michigan:",count
Count the number of states that have a lower median income than Michigan:
In [6]:
count1=0
for state, income in sorted(medianIncome.iteritems(), key=lambda (k,v): (v,k)):
if state=="Michigan":
break
count1=count1+1
print "the number of states that have a lower median income than Michigan:",count1
In [7]:
from IPython.display import Image
Image(filename='states obesity rate 2005.png')
Out[7]:
In [9]:
Image(filename='states median income 2005.png')
Out[9]:
After completing the first part, you start to wonder whether there are other factors besides median household income that are linked to obesity. Beer has been blamed for beer guts, so why not obesity? You then also wonder about the French, and how they maintain a normal body weight while eating what they please. They claim it's the wine. So, you decide to investigate how beer and wine consumption are related to obesity. The source for the beer data (2003-2006) is the Beer Institute. The source for the 2004 wine consumption data is the Adams 2005 Wine Handbook.
Unfortunately, this data is in a weird format. Start with BeerAndWinePerCapita.txt
and fix it up.
In [10]:
# Read the alcohol consumption data
# Change the state names to capitalize the first letter of each word
# Read the obesity data
# Output the data with the state name, obesity %, per capita beer consumption,
# and per capita wine consumption
alcoholConsumption = open('BeerAndWinePerCapita.txt', 'r').readlines()
stateBeer = dict()
stateWine = dict()
In [11]:
# To change switch the upper and lower case of all states
for line in alcoholConsumption:
line.strip()
(statename, beer, wine) = line.split(',')
state = str.title(statename)
try:
float(beer)
except ValueError:
beer = None
stateBeer[state] = beer
try:
float(wine)
except ValueError:
wine = None
stateWine[state] = wine
print state, stateBeer[state], stateWine[state]
In [12]:
# To read the obesity data
obesityData = open('ObesityByState.txt', 'r').readlines()
stateObesity = dict()
for line in obesityData:
(state, obesityPercentage) = line.split('\t')[0:2]
if 'state' in state:
continue
if state == 'DC':
state = 'District Of Columbia'
try:
float(obesityPercentage)
except ValueError:
obesityPercentage = None
stateObesity[state] = obesityPercentage
print state, stateObesity[state]
In [13]:
# To sort states by beer consumption, by increasing order
print "%s\t%s\t%s" % ("State", "per capita beer consumption", "per capita wine consumption")
for state, beer in sorted(stateBeer.iteritems(), key=lambda (k,v):(v,k)):
print "%s\t%s\t%s" % (state,stateBeer[state],stateWine[state])
In [14]:
# To sort states by wine consumption, by increasing order
print "%s\t%s\t%s" % ("State", "per capita beer consumption", "per capita wine consumption")
for state, wine in sorted(stateWine.iteritems(), key=lambda (k,v):(v,k)):
print "%s\t%s\t%s" % (state,stateBeer[state],stateWine[state] )
In [15]:
# To sort the states from the most to the least obese
# To use the three hashes, %obesityByState, %stateBeer and %stateBeer to output
OUT = open('obesityAlcohol.txt', 'w')
OUT.write("state"+"\t"+"obesity %"+"\t"+"per capita beer consumption"+"\t"
+"per capita wine consumption")
print "%s\t%s\t%s\t%s" % ("State", "Obesity rate",
"Per capita beer consumption",
"Per capita wine consumption")
for state, obesityPercentage in sorted(stateObesity.iteritems(),
key=lambda (k,v): (v,k), reverse=True):
print "%s\t%s\t%s\t%s" % (state,stateObesity[state],stateBeer[state],stateWine[state] )
OUT.write("\n%s\t%s\t%s\t%s" % (state,stateObesity[state],stateBeer[state],stateWine[state] ))
OUT.close()
In [18]:
Image(filename='states per capita beer consumption 2005.png')
Out[18]:
In [19]:
Image(filename='states per capita wine consumption 2005.png')
Out[19]:
In [20]:
Image(filename='scatterplot beer capita.png')
Out[20]:
In [23]:
Image(filename='scatterplot wine capita.png')
Out[23]:
Based on the scatterplot of obesity rate vs. per capita beer consumption, obesity rate is fairly correlated with beer consumption. In other words, a state that has higher beer consumption is expected to have higher population obesity rate. Beer consumption is not the sole cause leading to obesity, because there are outliers such as Colorado, which has the lowest obesity rate among the states even though Colorado has higher per capita beer consumption than 29 states.
Conversely, based on the scatterplot of obesity rate vs. per capita wine consumption, the higher wine consumption is highly correlated to lower obesity rate. In other words, a state that has higher wine consumption is expected to have lower obesity rate.
However, it is dangerous to conclude that wine consumption will decrease obesity (correlation does not mean causation). Because consumption of wine is often perceived as expensive, states that have higher wine consumption may also have higher median income. Because states that have higher median income also tend to have lower obesity rates, it is important to conduct further analysis on the association between state median income and per capita wine consumption.
In [ ]: