title: "Homework 05: Analyzing the gapminder dataset using Python"
Due before class Wednesday November 2nd.
The basic goal of the assignment is to implement various computational methods (e.g. data frames, lists, filtering, conditional expressions, iteration, functions) in Python. Rather than using raw programming assignments, you will demonstrate these skills in the context of analyzing the gapminder dataset, something you have already explored in R.
hw05 repositoryGo here to fork the repo for homework 05.
You are provided with a Jupyter Notebook similar to the one seen here. Fill in the chunks with the appropriate code needed to perform the requested analysis. I have already identified the questions and tasks you need to perform.
Your assignment should be submitted as a single Jupyter Notebook. Follow instructions on homework workflow. As part of the pull request, you're encouraged to reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc.
Check minus: Notebook cannot be run. Didn't answer all of the questions. Code is incomprehensible or difficult to follow.
Check: Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output). Nothing spectacular, either bad or good.
Check plus: Innovative use of coding elements to solve the problems (e.g. functions, conditional expressions, iterations). Adds labels to graphs. Uses techniques beyond those from the example notebooks. Successfully attempts the advanced challenge.
In [1]:
# Import libraries
import pandas as pd
import numpy as np
# Turn off notebook package warnings
import warnings
warnings.filterwarnings('ignore')
# print graphs in the document
%matplotlib inline
In [2]:
In [3]:
Out[3]:
In [4]:
Out[4]:
In [5]:
Out[5]:
In [6]:
In [7]:
In [8]:
import seaborn as sns
Out[8]:
In [9]:
Out[9]:
Here the goal is to write a basic function, "life_expectancy", that incorporates your work above.
By default, the function should return a scatterplot of life-expectancy versus years for a given country. [Hint: Subset the data for a specific country, similar to a problem above]
Once you subset the data, the function should do one of two things:
Thus, your function should have arguments and output as follows:
* Arguments:
Country (required): The name of a specific country, such as "Australia"
Model (optional): Build and Return a Model Results, #Hint, set the default to be False
* Output:
(1) - Default: A scatterplot of the relationship with best fit line
(2) - Model: The above graph AND the model results
To run a linear model, we can use the library statsmodels, to predict life expectancy by year.
In [43]:
import statsmodels.formula.api as sm #Import Package
model = sm.ols(formula = 'lifeExp ~ year', data = gapminder).fit() #Fit OLS Model
results = model.summary() #Get Results
print(results) # Print
#Hint: Use this Code in Your Function.
#You will need to replace data = gapminder, with the data subset for a specific country.
In [48]:
# write your function here
In [49]:
# Result for a Country (No Model)
life_expectancy("Afghanistan")
In [50]:
# Result for a Country (Model = True)
life_expectancy("New Zealand", True)
As you know already, the general trend is that over time life expectancy increases, but the trend is different for each country. Some experience a greater increase than others, whereas some countries experience declines in life expectancy. You can use whatever method you wish to assess and explain this relationship using Python.
Use whichever method you think you can master before the assignment is due. Some of you may just stick to basic graphs and tables, while others might build a statistical model using statsmodel. Obviously the more advanced technique you use, the higher your ceiling will be for your evaluation. But don't spend 10 hours getting this to work! Go with what you can accomplish in a reasonable amount of time.
In [ ]: