You are currently looking at version 1.0 of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the Jupyter Notebook FAQ course resource.
All questions are weighted the same in this assignment.
The following code loads the olympics dataset (olympics.csv), which was derrived from the Wikipedia entry on All Time Olympic Games Medals, and does some basic data cleaning. Use this dataset to answer the questions below.
In [3]:
import pandas as pd
df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)
for col in df.columns:
if col[:2]=='01':
df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
if col[:2]=='02':
df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
if col[:2]=='03':
df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
if col[:1]=='№':
df.rename(columns={col:'#'+col[1:]}, inplace=True)
names_ids = df.index.str.split('\s\(') # split the index by '('
df.index = names_ids.str[0] # the [0] element is the country name (new index)
df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that)
df = df.drop('Totals')
df.head()
Out[3]:
In [14]:
# You should write your whole answer within the function provided. The autograder will call
# this function and compare the return value against the correct solution value
def answer_zero():
# This function returns the row for Afghanistan, which is a Series object. The assignment
# question description will tell you the general format the autograder is expecting
return df.iloc[0]
# You can examine what your function returns by calling it in the cell. If you have questions
# about the assignment formats, check out the discussion forums for any FAQs
answer_zero()
Out[14]:
In [15]:
def answer_one():
return df['Gold'].idxmax()
answer_one()
Out[15]:
In [22]:
def answer_two():
return (df['Gold'] - df['Gold.1']).idxmax()
answer_two()
Out[22]:
In [153]:
def answer_three():
tmp_df = df[(df['Gold.1'] > 0) & (df['Gold'] > 0)]
return ((tmp_df['Gold'] - tmp_df['Gold.1']) / ((tmp_df['Gold'] + tmp_df['Gold.1']))).idxmax()
answer_three()
Out[153]:
Write a function to update the dataframe to include a new column called "Points" which is a weighted value where each gold medal counts for 3 points, silver medals for 2 points, and bronze mdeals for 1 point. The function should return only the column (a Series object) which you created.
This function should return a Series named Points
of length 146
In [165]:
def answer_four():
Points = 3*df['Gold.2'] + 2*df['Silver.2'] + 1*df['Bronze.2']
return Points
answer_four()
Out[165]:
For the next set of questions, we will be using census data from the United States Census Bureau. Counties are political and geographic subdivisions of states in the United States. This dataset contains population data for counties and states in the US from 2010 to 2015. See this document for a description of the variable names.
The census dataset (census.csv) should be loaded as census_df. Answer questions using this as appropriate.
Which state has the most counties in it? (hint: consider the sumlevel key carefully! You'll need this for future questions too...)
This function should return a single string value.
In [4]:
census_df = pd.read_csv('census.csv')
census_df.columns
Out[4]:
In [80]:
def answer_five():
return census_df.groupby(['STNAME']).size().idxmax()
answer_five()
Out[80]:
In [51]:
def answer_six():
t = census_df[census_df['SUMLEV'] == 50]
t = t.sort_values(by=['STNAME', 'CENSUS2010POP'], ascending=False).groupby(['STNAME']).head(3)
return list(t.groupby(['STNAME']).sum().sort_values(by='CENSUS2010POP', ascending=False).head(3).index)
answer_six()
Out[51]:
In [66]:
def answer_seven():
tmp_df = census_df[census_df['SUMLEV'] == 50]
tmp_df['2011'] = tmp_df['POPESTIMATE2011'] - tmp_df['POPESTIMATE2010']
tmp_df['2012'] = tmp_df['POPESTIMATE2012'] - tmp_df['POPESTIMATE2011']
tmp_df['2013'] = tmp_df['POPESTIMATE2013'] - tmp_df['POPESTIMATE2012']
tmp_df['2014'] = tmp_df['POPESTIMATE2014'] - tmp_df['POPESTIMATE2013']
tmp_df['2015'] = tmp_df['POPESTIMATE2015'] - tmp_df['POPESTIMATE2014']
tmp_df['max'] = tmp_df[['2011', '2012', '2013', '2014', '2015']].max(axis=1)
return tmp_df.sort_values(by='max', ascending=False).iloc[0].CTYNAME
answer_seven()
Out[66]:
In this datafile, the United States is broken up into four regions using the "REGION" column.
Create a query that finds the counties that belong to regions 1 or 2, whose name starts with 'Washington', and whose POPESTIMATE2015 was greater than their POPESTIMATE 2014.
This function should return a 5x2 DataFrame with the columns = ['STNAME', 'CTYNAME'] and the same index ID as the census_df (sorted ascending by index).
In [144]:
def answer_eight():
result = census_df[(census_df["REGION"].isin([1,2])) & (census_df['CTYNAME'].str.startswith('Washington')) & (census_df['POPESTIMATE2015'] > census_df['POPESTIMATE2014'])]
return result[['STNAME', 'CTYNAME']]
answer_eight()
Out[144]: