You are currently looking at version 1.1 of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the Jupyter Notebook FAQ course resource.


Assignment 2 - Pandas Introduction

All questions are weighted the same in this assignment.

Part 1

The following code loads the olympics dataset (olympics.csv), which was derrived from the Wikipedia entry on All Time Olympic Games Medals, and does some basic data cleaning.

The columns are organized as # of Summer games, Summer medals, # of Winter games, Winter medals, total # number of games, total # of medals. Use this dataset to answer the questions below.


In [2]:
import pandas as pd

df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

for col in df.columns:
    if col[:2]=='01':
        df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
    if col[:2]=='02':
        df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
    if col[:2]=='03':
        df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
    if col[:1]=='№':
        df.rename(columns={col:'#'+col[1:]}, inplace=True)

names_ids = df.index.str.split('\s\(') # split the index by '('

df.index = names_ids.str[0] # the [0] element is the country name (new index) 
df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that)

df = df.drop('Totals')

Question 0 (Example)

What is the first country in df?

This function should return a Series.


In [3]:
# You should write your whole answer within the function provided. The autograder will call
# this function and compare the return value against the correct solution value
def answer_zero():
    # This function returns the row for Afghanistan, which is a Series object. The assignment
    # question description will tell you the general format the autograder is expecting
    return df.iloc[0]

# You can examine what your function returns by calling it in the cell. If you have questions
# about the assignment formats, check out the discussion forums for any FAQs
answer_zero()


Out[3]:
# Summer           13
Gold                0
Silver              0
Bronze              2
Total               2
# Winter            0
Gold.1              0
Silver.1            0
Bronze.1            0
Total.1             0
# Games            13
Gold.2              0
Silver.2            0
Bronze.2            2
Combined total      2
ID                AFG
Name: Afghanistan, dtype: object

Question 1

Which country has won the most gold medals in summer games?

This function should return a single string value.


In [4]:
def answer_one():
    max_gold = df['Gold'].max()
    ret = df[df['Gold'] == max_gold]
    ans = ret.index.values
    return ans[0]

print(answer_one())


United States

Question 2

Which country had the biggest difference between their summer and winter gold medal counts?

This function should return a single string value.


In [5]:
def answer_two():
    df2 = df.copy()
    df2['Gold_diff'] = df['Gold'] - df['Gold.1']

    score = []

    for row in df2['Gold_diff']:
        if row < 0:
            row = row * -1
            score.append(row)
        else:
            score.append(row)

    df2['score'] = score

    max_score = df2['score'].max()
    name = df2[df2['score'] == max_score]
    country_name = name.index.values
    return country_name[0]

print(answer_two())


United States

Question 3

Which country has the biggest difference between their summer gold medal counts and winter gold medal counts relative to their total gold medal count?

$$\frac{Summer~Gold - Winter~Gold}{Total~Gold}$$

Only include countries that have won at least 1 gold in both summer and winter.

This function should return a single string value.


In [6]:
def answer_three():
    df2 = df.copy()
    df2 = df2[(df2['Gold'] > 0) & (df2['Gold.1'] > 0)]

    df2['Gold_diff'] = (df2['Gold'] - df2['Gold.1']) / df2['Gold.2']

    score = []

    for row in df2['Gold_diff']:
        if row < 0:
            row = row * -100
            score.append(row)
        else:
            row = row * 100
            score.append(row)

    df2['score'] = score

    df3 = df2[['Gold','Gold.1','Gold.2','score',]]

    max_score = df3['score'].max()
    name = df3[df3['score'] == max_score]
    country_name = name.index.values

    return country_name[0]

print(answer_three())


Bulgaria

Question 4

Write a function to update the dataframe to include a new column called "Points" which is a weighted value where each gold medal (Gold.2) counts for 3 points, silver medals (Silver.2) for 2 points, and bronze medals (Bronze.2) for 1 point. The function should return only the column (a Series object) which you created.

This function should return a Series named Points of length 146


In [7]:
def answer_four():
    df2 = df.copy()
    df2['Points'] = df2['Gold.2']*3 + df2['Silver.2']*2 + df2['Bronze.2']*1
    df3 = df2[['Gold.2','Silver.2','Bronze.2','Points']]
    return df3['Points']

print(answer_four())


Afghanistan                            2
Algeria                               27
Argentina                            130
Armenia                               16
Australasia                           22
Australia                            923
Austria                              569
Azerbaijan                            43
Bahamas                               24
Bahrain                                1
Barbados                               1
Belarus                              154
Belgium                              276
Bermuda                                1
Bohemia                                5
Botswana                               2
Brazil                               184
British West Indies                    2
Bulgaria                             411
Burundi                                3
Cameroon                              12
Canada                               846
Chile                                 24
China                               1120
Colombia                              29
Costa Rica                             7
Ivory Coast                            2
Croatia                               67
Cuba                                 420
Cyprus                                 2
                                    ... 
Spain                                268
Sri Lanka                              4
Sudan                                  2
Suriname                               4
Sweden                              1217
Switzerland                          630
Syria                                  6
Chinese Taipei                        32
Tajikistan                             4
Tanzania                               4
Thailand                              44
Togo                                   1
Tonga                                  2
Trinidad and Tobago                   27
Tunisia                               19
Turkey                               191
Uganda                                14
Ukraine                              220
United Arab Emirates                   3
United States                       5684
Uruguay                               16
Uzbekistan                            38
Venezuela                             18
Vietnam                                4
Virgin Islands                         2
Yugoslavia                           171
Independent Olympic Participants       4
Zambia                                 3
Zimbabwe                              18
Mixed team                            38
Name: Points, dtype: int64

Part 2

For the next set of questions, we will be using census data from the United States Census Bureau. Counties are political and geographic subdivisions of states in the United States. This dataset contains population data for counties and states in the US from 2010 to 2015. See this document for a description of the variable names.

The census dataset (census.csv) should be loaded as census_df. Answer questions using this as appropriate.

Question 5

Which state has the most counties in it? (hint: consider the sumlevel key carefully! You'll need this for future questions too...)

This function should return a single string value.


In [13]:
census_df = pd.read_csv('census.csv')
census_df


Out[13]:
SUMLEV REGION DIVISION STATE COUNTY STNAME CTYNAME CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 ... RDOMESTICMIG2011 RDOMESTICMIG2012 RDOMESTICMIG2013 RDOMESTICMIG2014 RDOMESTICMIG2015 RNETMIG2011 RNETMIG2012 RNETMIG2013 RNETMIG2014 RNETMIG2015
0 40 3 6 1 0 Alabama Alabama 4779736 4780127 4785161 ... 0.002295 -0.193196 0.381066 0.582002 -0.467369 1.030015 0.826644 1.383282 1.724718 0.712594
1 50 3 6 1 1 Alabama Autauga County 54571 54571 54660 ... 7.242091 -2.915927 -3.012349 2.265971 -2.530799 7.606016 -2.626146 -2.722002 2.592270 -2.187333
2 50 3 6 1 3 Alabama Baldwin County 182265 182265 183193 ... 14.832960 17.647293 21.845705 19.243287 17.197872 15.844176 18.559627 22.727626 20.317142 18.293499
3 50 3 6 1 5 Alabama Barbour County 27457 27457 27341 ... -4.728132 -2.500690 -7.056824 -3.904217 -10.543299 -4.874741 -2.758113 -7.167664 -3.978583 -10.543299
4 50 3 6 1 7 Alabama Bibb County 22915 22919 22861 ... -5.527043 -5.068871 -6.201001 -0.177537 0.177258 -5.088389 -4.363636 -5.403729 0.754533 1.107861
5 50 3 6 1 9 Alabama Blount County 57322 57322 57373 ... 1.807375 -1.177622 -1.748766 -2.062535 -1.369970 1.859511 -0.848580 -1.402476 -1.577232 -0.884411
6 50 3 6 1 11 Alabama Bullock County 10914 10915 10887 ... -30.953709 -5.180127 -1.130263 14.354290 -16.167247 -29.001673 -2.825524 1.507017 17.243790 -13.193961
7 50 3 6 1 13 Alabama Butler County 20947 20946 20944 ... -14.032727 -11.684234 -5.655413 1.085428 -6.529805 -13.936612 -11.586865 -5.557058 1.184103 -6.430868
8 50 3 6 1 15 Alabama Calhoun County 118572 118586 118437 ... -6.155670 -4.611706 -5.524649 -4.463211 -3.376322 -5.791579 -4.092677 -5.062836 -3.912834 -2.806406
9 50 3 6 1 17 Alabama Chambers County 34215 34170 34098 ... -2.731639 3.849092 2.872721 -2.287222 1.349468 -1.821092 4.701181 3.781439 -1.290228 2.346901
10 50 3 6 1 19 Alabama Cherokee County 25989 25986 25976 ... 6.339327 1.113180 5.488706 -0.076806 -3.239866 6.416167 1.420264 5.757384 0.230419 -2.931307
11 50 3 6 1 21 Alabama Chilton County 43643 43631 43665 ... -1.372935 -2.653369 0.480044 0.456017 -2.253483 -0.823761 -2.447504 0.868651 0.957636 -1.752709
12 50 3 6 1 23 Alabama Choctaw County 13859 13858 13841 ... -15.455274 -0.737028 -8.766391 -1.274984 -5.291205 -15.528177 -0.737028 -8.766391 -1.274984 -5.291205
13 50 3 6 1 25 Alabama Clarke County 25833 25840 25767 ... -6.194363 -17.667705 -0.318345 -8.686428 -5.613667 -6.077488 -17.509958 -0.159172 -8.486280 -5.411736
14 50 3 6 1 27 Alabama Clay County 13932 13932 13880 ... -10.744102 -13.345130 4.902871 5.702648 3.912450 -10.816697 -13.345130 4.977157 5.776708 3.986270
15 50 3 6 1 29 Alabama Cleburne County 14972 14972 14973 ... -3.673524 -5.151880 7.345821 3.654485 -3.123961 -3.673524 -5.151880 7.345821 3.654485 -3.123961
16 50 3 6 1 31 Alabama Coffee County 49948 49948 50177 ... 0.377640 7.675579 -13.146535 -3.602859 2.214774 2.166460 11.513368 -10.438741 -0.767822 5.350738
17 50 3 6 1 33 Alabama Colbert County 54428 54428 54514 ... -0.073423 1.065051 1.762390 1.835688 -0.110260 0.513964 1.469035 2.276420 2.533249 0.588052
18 50 3 6 1 35 Alabama Conecuh County 13228 13228 13208 ... -4.861559 -7.504690 -6.107224 -14.645416 2.684140 -4.861559 -7.504690 -6.107224 -14.645416 2.684140
19 50 3 6 1 37 Alabama Coosa County 11539 11758 11758 ... -33.930581 -10.291443 -4.313831 -22.958017 -5.387581 -34.017138 -10.380162 -4.403703 -23.049483 -5.387581
20 50 3 6 1 39 Alabama Covington County 37765 37765 37796 ... 6.696899 -4.612668 0.740271 3.697932 -0.316945 6.881460 -4.559952 0.793147 3.750759 -0.264121
21 50 3 6 1 41 Alabama Crenshaw County 13906 13906 13853 ... 1.729792 3.950156 -1.864936 3.084648 3.439504 2.666763 5.099293 -0.502098 4.734577 5.087600
22 50 3 6 1 43 Alabama Cullman County 80406 80410 80473 ... -1.404233 -1.019628 4.071247 5.087142 7.915406 -1.031427 -0.634159 4.542916 5.593387 8.417777
23 50 3 6 1 45 Alabama Dale County 50251 50251 50358 ... -10.749798 -5.277150 -15.236079 -11.979785 -5.107706 -9.575283 -0.776637 -12.640155 -9.503292 -1.998668
24 50 3 6 1 47 Alabama Dallas County 43820 43820 43803 ... -15.635599 -11.308243 -16.745678 -9.344789 -14.687232 -15.727573 -11.378047 -16.792849 -9.368689 -14.711389
25 50 3 6 1 49 Alabama DeKalb County 71109 71115 71142 ... 0.294677 -9.302391 -1.748807 0.267830 0.028141 1.375159 -8.656001 -1.029539 1.198187 0.956790
26 50 3 6 1 51 Alabama Elmore County 79303 79296 79465 ... 3.235576 0.822717 1.760531 -1.507057 2.067820 3.674511 1.558176 2.306047 -0.951175 2.757093
27 50 3 6 1 53 Alabama Escambia County 38319 38319 38309 ... -3.449988 -3.855889 -4.822706 -1.189831 1.190902 -3.397716 -3.803428 -4.769999 -1.136950 1.243830
28 50 3 6 1 55 Alabama Etowah County 104430 104427 104442 ... -1.015919 2.062637 -1.931884 -1.726932 -2.082234 -0.632554 2.446383 -1.518596 -1.234901 -1.588308
29 50 3 6 1 57 Alabama Fayette County 17241 17241 17231 ... -5.015601 -0.646640 -3.725937 0.296745 -2.797536 -5.132243 -0.705426 -3.785079 0.237396 -2.857058
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3163 50 2 3 55 131 Wisconsin Washington County 131887 131885 131967 ... -0.794876 0.785279 -2.215465 1.601149 -0.434498 -0.431504 1.162817 -1.763330 2.104796 0.059931
3164 50 2 3 55 133 Wisconsin Waukesha County 389891 389938 390076 ... -0.765799 2.128860 0.038132 0.760109 -0.719858 0.102448 3.180527 1.189727 2.077633 0.593567
3165 50 2 3 55 135 Wisconsin Waupaca County 52410 52410 52422 ... 3.111756 -2.241873 6.292687 -0.441031 -0.480617 3.359933 -2.011937 6.561277 -0.134227 -0.173022
3166 50 2 3 55 137 Wisconsin Waushara County 24496 24496 24506 ... 4.930022 -2.404973 -4.097017 -4.906711 -4.397793 5.174486 -2.160399 -3.810226 -4.535615 -4.024395
3167 50 2 3 55 139 Wisconsin Winnebago County 166994 166994 167059 ... 0.316712 2.889873 0.833819 -2.406192 -4.557985 0.842573 3.502335 1.531624 -1.545153 -3.685304
3168 50 2 3 55 141 Wisconsin Wood County 74749 74749 74807 ... -4.081523 -5.019090 -6.901200 -5.596471 -3.958322 -3.733590 -4.562809 -6.442917 -5.040889 -3.414223
3169 40 4 8 56 0 Wyoming Wyoming 563626 563767 564516 ... -0.381530 9.636214 4.487115 -4.788275 -3.221091 0.289680 10.694870 5.440390 -3.727831 -2.091573
3170 50 4 8 56 1 Wyoming Albany County 36299 36299 36428 ... 3.708956 2.637812 -3.544634 -3.334877 -9.911169 6.736119 6.433032 0.719587 1.429233 -5.166460
3171 50 4 8 56 3 Wyoming Big Horn County 11668 11668 11672 ... 4.868258 2.804930 16.815908 -8.026420 5.095861 4.868258 3.144921 17.236306 -7.608378 5.513554
3172 50 4 8 56 5 Wyoming Campbell County 46133 46133 46244 ... -2.843479 15.601020 -5.895711 -8.550911 10.916963 -2.649606 15.558684 -5.916543 -8.509402 10.978525
3173 50 4 8 56 7 Wyoming Carbon County 15885 15885 15837 ... -7.581980 -13.081441 3.178134 -2.970641 -23.300971 -7.392431 -12.636926 3.623073 -2.338590 -22.600668
3174 50 4 8 56 9 Wyoming Converse County 13833 13833 13826 ... -12.847499 15.493820 19.035533 -20.550587 -0.070403 -12.774915 16.502720 20.093063 -19.358233 1.126443
3175 50 4 8 56 11 Wyoming Crook County 7083 7083 7114 ... -1.544618 -4.202564 1.397819 6.378258 18.629317 -0.982939 -3.642222 2.096729 7.071547 19.309219
3176 50 4 8 56 13 Wyoming Fremont County 40123 40123 40222 ... 2.747083 7.782673 -4.990688 -12.331633 -13.673610 3.093562 8.027411 -4.747240 -12.013555 -13.352750
3177 50 4 8 56 15 Wyoming Goshen County 13249 13247 13408 ... 14.293649 3.961413 -8.079028 -7.017803 -11.899450 14.886132 4.841727 -6.903896 -5.761986 -10.635133
3178 50 4 8 56 17 Wyoming Hot Springs County 4812 4812 4813 ... 3.322604 6.208609 3.095336 -6.017222 -5.454164 5.191569 6.001656 2.888981 -6.224712 -5.663940
3179 50 4 8 56 19 Wyoming Johnson County 8569 8569 8581 ... 4.995063 -4.058912 -0.812583 -10.715742 0.933652 5.227392 -4.058912 -0.812583 -10.715742 0.933652
3180 50 4 8 56 21 Wyoming Laramie County 91738 91881 92271 ... -1.200428 15.547274 4.787847 -1.226133 0.278940 -0.973320 17.914554 6.003143 -0.207819 1.673640
3181 50 4 8 56 23 Wyoming Lincoln County 18106 18106 18091 ... -9.802564 -11.566801 13.564556 6.125989 1.555544 -9.691801 -11.566801 13.619696 6.234414 1.662823
3182 50 4 8 56 25 Wyoming Natrona County 75450 75450 75472 ... 7.189319 23.066162 24.322042 -0.958472 -0.061057 7.689674 23.749508 25.085233 -0.110593 0.793743
3183 50 4 8 56 27 Wyoming Niobrara County 2484 2484 2492 ... -0.401849 0.806452 29.066295 -12.603387 7.492114 -0.401849 0.806452 29.066295 -12.603387 7.492114
3184 50 4 8 56 29 Wyoming Park County 28205 28205 28259 ... 4.582951 8.057765 7.641997 -9.252437 -2.878980 6.486639 11.127389 10.877797 -5.585731 0.856839
3185 50 4 8 56 31 Wyoming Platte County 8667 8667 8678 ... 4.373094 5.392073 2.634593 6.055759 4.662270 4.373094 4.933173 2.176403 5.598720 4.207414
3186 50 4 8 56 33 Wyoming Sheridan County 29116 29116 29146 ... 0.958559 8.425487 4.546373 3.678069 -3.298406 2.122524 9.342778 5.523001 4.781489 -2.198937
3187 50 4 8 56 35 Wyoming Sublette County 10247 10247 10244 ... -23.741784 15.272374 -40.870074 -16.596273 -22.870900 -21.092907 16.828794 -39.211861 -14.409938 -20.664059
3188 50 4 8 56 37 Wyoming Sweetwater County 43806 43806 43593 ... 1.072643 16.243199 -5.339774 -14.252889 -14.248864 1.255221 16.243199 -5.295460 -14.075283 -14.070195
3189 50 4 8 56 39 Wyoming Teton County 21294 21294 21297 ... -1.589565 0.972695 19.525929 14.143021 -0.564849 0.654527 2.408578 21.160658 16.308671 1.520747
3190 50 4 8 56 41 Wyoming Uinta County 21118 21118 21102 ... -17.755986 -4.916350 -6.902954 -14.215862 -12.127022 -18.136812 -5.536861 -7.521840 -14.740608 -12.606351
3191 50 4 8 56 43 Wyoming Washakie County 8533 8533 8545 ... -11.637475 -0.827815 -2.013502 -17.781491 1.682288 -11.990126 -1.182592 -2.250385 -18.020168 1.441961
3192 50 4 8 56 45 Wyoming Weston County 7208 7208 7181 ... -11.752361 -8.040059 12.372583 1.533635 6.935294 -12.032179 -8.040059 12.372583 1.533635 6.935294

3193 rows × 100 columns


In [9]:
def answer_five():
    maximum_country = census_df.groupby(["STNAME"]).size().max()
    g2 = census_df.groupby(["STNAME"]).size()
    df3 = g2.reset_index()
    name = df3[df3[0] == maximum_country]
    name = name.set_index('STNAME').index.values[0]

    return name

print(answer_five())


Texas

Question 6

Only looking at the three most populous counties for each state, what are the three most populous states (in order of highest population to lowest population)? Use CENSUS2010POP.

This function should return a list of string values.


In [39]:
def answer_six():
    df = census_df.copy()
    df=df[df['SUMLEV'] == 50]
    df = df[['CTYNAME', 'CENSUS2010POP']]
    df = df.set_index('CTYNAME')
    idx = df.sum(axis=1).sort_values(ascending=False).head(3).index
    # df1 = df.ix[idx]
    df1 = list(idx.values)

    return idx

print(answer_six())


['Los Angeles County', 'Cook County', 'Harris County']

Question 7

Which county has had the largest absolute change in population within the period 2010-2015? (Hint: population values are stored in columns POPESTIMATE2010 through POPESTIMATE2015, you need to consider all six columns.)

e.g. If County Population in the 5 year period is 100, 120, 80, 105, 100, 130, then its largest change in the period would be |130-80| = 50.

This function should return a single string value.


In [43]:
def answer_seven():
    df = census_df.copy()
    df=df[df['SUMLEV'] == 50]
    df = df[['STNAME','CTYNAME','POPESTIMATE2015','POPESTIMATE2014','POPESTIMATE2013','POPESTIMATE2012','POPESTIMATE2011','POPESTIMATE2010']]
    df = df.set_index(['STNAME', 'CTYNAME'])
    df1 = df.apply(lambda x: x.max() - x.min(),axis=1)

    df2 = df1.reset_index()
    df2 = df2.sort_values([0],ascending=[0])
    df3 = df2.set_index('CTYNAME').index.values

    return df3[0]

print(answer_seven())


Harris County

Question 8

In this datafile, the United States is broken up into four regions using the "REGION" column.

Create a query that finds the counties that belong to regions 1 or 2, whose name starts with 'Washington', and whose POPESTIMATE2015 was greater than their POPESTIMATE 2014.

This function should return a 5x2 DataFrame with the columns = ['STNAME', 'CTYNAME'] and the same index ID as the census_df (sorted ascending by index).


In [34]:
def answer_eight():
    df = census_df.copy()
    df = df[(df['REGION'] == 1) | (df['REGION'] == 2)]
    df = df[df['CTYNAME'] == 'Washington County']
    df = df[df['POPESTIMATE2015'] > df['POPESTIMATE2014']]

    return df[['STNAME','CTYNAME']]

print(answer_eight())


            STNAME            CTYNAME
896           Iowa  Washington County
1419     Minnesota  Washington County
2345  Pennsylvania  Washington County
2355  Rhode Island  Washington County
3163     Wisconsin  Washington County

In [ ]: