Advanced Pandas: Combining data

Sometimes we need to combine data from two or more dataframes. That's colloquially known as a merge or a join. There are lots of ways to do this. We do a couple but supply references to more at the end.

Along the way we take an extended detour to review methods for downloading and unzipping compressed files. The tools we use here have a broad range of other applications, including web scraping.

Outline:

Note: requires internet access to run.

This IPython notebook was created by Dave Backus, Chase Coleman, Brian LeBlanc, and Spencer Lyon for the NYU Stern course Data Bootcamp.

Preliminaries

Import packages, etc.


In [1]:
%matplotlib inline 

import pandas as pd             # data package
import matplotlib.pyplot as plt # graphics 
import datetime as dt           # date tools, used to note current date  

# these are new 
import os                       # operating system tools (check files)
import requests, io             # internet and input tools  
import zipfile as zf            # zip file tools 
import shutil                   # file management tools

MovieLens data

The data comes as a zip file that contains several csv's. We get the details from the README inside. (It's written in Markdown, so it's easier to read if we use a browser to format it. Or we could cut and paste into a Markdown cell in an IPython notebook.)

The file descriptions are:

  • ratings.csv: each line is an individual film rating with the rater and movie id's and the rating. Order: userId, movieId, rating, timestamp.
  • tags.csv: each line is a tag on a specific film. Order: userId, movieId, tag, timestamp.
  • movies.csv: each line is a movie name, its id, and its genre. Order: movieId, title, genres. Multiple genres are separated by "pipes" |.
  • links.csv: each line contains the movie id and corresponding id's at IMBd and TMDb.

The easy way to input this data is to download the zip file onto our computer, unzip it, and read the individual csv files using read.csv(). But anyone can do it the easy way. We want to automate this, so we can redo it without any manual steps. This takes some effort, but once we have it down we can apply it to lots of other data sources.


In [ ]:

Automate file download

We're looking for an automated way, so that if we do this again, possibly with updated data, the whole process is in our code. Automated data entry involves these steps:

  • Get the file. We use the requests package, which handles internet files and comes pre-installed with Anaconda. This kind of thing was hidden behind the scenes in the Pandas read_csv function, but here we need to do it for ourselves. The package authors add:

    Recreational use of other HTTP libraries may result in dangerous side-effects, including: security vulnerabilities, verbose code, reinventing the wheel, constantly reading documentation, depression, headaches, or even death.

  • Convert to zip. Requests simply loads whatever's at the given url. The io module's io.Bytes reconstructs it as a file, here a zip file.
  • Unzip the file. We use the zipfile module, which is part of core Python, to extract the files inside.
  • Read in the csv's. Now that we've extracted the csv files, we use read_csv as usual.

We found this Stack Overflow exchange helpful.

Digression. This is probably more than you want to know, but it's a reminder of what goes on behind the scenes when we apply read_csv to a url. Here we grab whatever is at the url. Then we get its contents, convert it to bytes, identify it as a zip file, and read its components using read_csv. It's a lot easier when this happens automatically, but a reminder what's involved if we ever have to look into the details.


In [ ]:
# get "response" from url 
url = 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
r = requests.get(url) 

# describe response 
print('Response status code:', r.status_code)
print('Response type:', type(r))
print('Response .content:', type(r.content)) 
print('Response headers:\n', r.headers, sep='')

In [ ]:
# convert bytes to zip file  
mlz = zf.ZipFile(io.BytesIO(r.content)) 
print('Type of zipfile object:', type(mlz))

In [ ]:
# what's in the zip file?
mlz.namelist()

In [ ]:
mlz.open('ml-latest-small/links.csv')

In [ ]:
pd.read_csv(mlz.open('ml-latest-small/links.csv'))

In [ ]:
# extract and read csv's
movies  = pd.read_csv(mlz.open(mlz.namelist()[2]))
ratings = pd.read_csv(mlz.open(mlz.namelist()[3]))

In [ ]:
# what do we have? 
for df in [movies, ratings]:
    print('Type:', type(df))
    print('Dimensions:', df.shape, '\n')
    print('Variables:', df.columns.tolist(), '\n')
    print('First few rows', df.head(3), '\n')

In [ ]:

Exercise. Something to do together. suppose we wanted to save the files on our computer. How would we do it? Would we prefer individual csv's or a single zip?


In [ ]:
# writing csv (specify different location)
with open('test_01.csv', 'wb') as out_file:
    shutil.copyfileobj(mlz.open(mlz.namelist()[2]), out_file)

In [ ]:


In [ ]:
# experiment via http://stackoverflow.com/a/18043472/804513
with open('test.zip', 'wb') as out_file:
    shutil.copyfileobj(io.BytesIO(r.content), out_file)

Merging ratings and movie titles

The movie ratings in the dataframe ratings give us individual opinions about movies, but they don't include the name of the movie. Why not? Rather than include the name every time a movie is rated, the MovieLens data associates each rating with a movie code, than stores the names of movies associatd with each movie code in the dataframe movies. We run across this a lot: some information is in one data table, other information is in another.

Our want is therefore to add the movie name to the ratings dataframe. We say we merge the two dataferames. There are lots of ways to merge. Here we do one as an illustration.

Let's start by reminding ourselves what we have.


In [ ]:
ratings.head(3)

In [ ]:
movies.head(3)

Merging

Here's roughly what's involved in what we're doing. We take the movieId variable from ratings and look it up in movies. When we find it, we look up the title and add it as a column in ratings. The variable movieId is common, so we can use it to link the two dataframes.


In [ ]:
combo = pd.merge(ratings, movies,   # left and right df's
                 how='left',        # add to left 
                 on='movieId'       # link with this variable/column 
                ) 

print('Dimensions of ratings:', ratings.shape)
print('Dimensions of movies:', movies.shape)
print('Dimensions of new df:', combo.shape)

combo.head(20)

In [ ]:
combo_1 = ratings.merge(movies, how='left', on='movieId')

In [ ]:
combo_1.head()

In [ ]:
combo_2 = ratings.merge(movies, how='inner', on='movieId')
combo_2.shape

In [ ]:
combo_3 = movies.merge(ratings, how='right', on='movieId')
combo_3.shape

In [ ]:
# save as csv file for future use 
combo.to_csv('mlcombined.csv')

In [ ]:
count_2 = movies['movieId'].isin(ratings['movieId'])

In [ ]:
count_2.sum()

In [ ]:
print('Current directory:\n', os.getcwd(), sep='')
print('List of files:', os.listdir(), sep='\n')

In [ ]:

Exercise. Some of these we know how to do, the others we don't. For the ones we know, what is the answer? For the others, what (in loose terms) do we need to be able to do to come up with an answer?

  • What is the overall average rating?
  • What is the overall distribution of ratings?
  • What is the average rating of each movie?
  • How many ratings does each movie get?

In [ ]:
combo['rating'].mean()

In [ ]:
fig, ax = plt.subplots()
bins = [bin/100 for bin in list(range(25, 575, 50))]
print(bins)
combo['rating'].plot(kind='hist', ax=ax, bins=bins, color='blue', alpha=0.5)
ax.set_xlim(0,5.5)
ax.set_ylabel('Number')
ax.set_xlabel('Rating')
plt.show()

In [ ]:
from plotly.offline import iplot             # plotting functions
import plotly.graph_objs as go               # ditto
import plotly

In [ ]:
plotly.offline.init_notebook_mode(connected=True)

In [ ]:
trace = go.Histogram(
    x=combo['rating'],
    histnorm='count',
    name='control',
    autobinx=False,
    xbins=dict(
        start=.5,
        end=5.0,
        size=0.5
    ),
    marker=dict(
        color='Blue',
    ),
    opacity=0.75
)

layout = go.Layout(
    title='Distribution of ratings',
    xaxis=dict(
        title='Rating value'
    ),
    yaxis=dict(
        title='Count'
    ),
    bargap=0.01,
    bargroupgap=0.1
)

iplot(go.Figure(data=[trace], layout=layout))

In [ ]:
combo[combo['movieId']==31]['rating'].mean()

In [ ]:
ave_mov = combo['rating'].groupby(combo['movieId']).mean()

In [ ]:
ave_mov = ave_mov.reset_index()

In [ ]:
ave_mov = ave_mov.rename(columns={"rating": "average rating"})

In [ ]:
combo2 = combo.merge(ave_mov, how='left', on='movieId')

In [ ]:
combo2.shape

In [ ]:
combo2.head(3)

In [ ]:
combo['ave'] = combo['rating'].groupby(combo['movieId']).transform('mean')

In [ ]:
combo.head()

In [ ]:
combo2[combo['movieId']==1129]

In [ ]:
combo['count'] = combo['rating'].groupby(combo['movieId']).transform('count')

In [ ]:
combo.head()

Population "estimates" and "projections"

We look (again) at the UN's population data, specifically the age distribution of the population. The data comes in two sheets: estimates that cover the period 1950-2015 and projections that cover 2016-2100. Our mission is to combine them.

Load data

We start, as usual, by loading the data. This takes a minute or so.


In [2]:
url1 = 'http://esa.un.org/unpd/wpp/DVD/Files/'
url2 = '1_Indicators%20(Standard)/EXCEL_FILES/1_Population/'
url3 = 'WPP2015_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.XLS'
url = url1 + url2 + url3 

cols = [2, 5] + list(range(6,28))
est = pd.read_excel(url, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])
prj = pd.read_excel(url, sheetname=1, skiprows=16, parse_cols=cols, na_values=['…'])

print('Dimensions and dtypes of estimates: ', est.shape, '\n', est.dtypes.head(), sep='')
print('\nDimensions and dtypes of projections: ', prj.shape, '\n', prj.dtypes.head(), sep='')


Dimensions and dtypes of estimates: (3374, 24)
Major area, region, country or area *     object
Reference date (as of 1 July)              int64
0-4                                      float64
5-9                                      float64
10-14                                    float64
dtype: object

Dimensions and dtypes of projections: (4338, 23)
Major area, region, country or area *     object
Reference date (as of 1 July)              int64
0-4                                      float64
5-9                                      float64
10-14                                    float64
dtype: object

In [3]:
est.to_csv('un_pop_est.csv')
prj.to_csv('un_pop_proj.csv')

Comment. Note that they have different numbers of columns. Let's see where that comes from.


In [4]:
list(est)[15:]


Out[4]:
['65-69', '70-74', '75-79', '80+', '80-84', '85-89', '90-94', '95-99', '100+']

In [5]:
list(prj)[15:]


Out[5]:
['65-69', '70-74', '75-79', '80-84', '85-89', '90-94', '95-99', '100+']

In [6]:
est.head()


Out[6]:
Major area, region, country or area * Reference date (as of 1 July) 0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 ... 60-64 65-69 70-74 75-79 80+ 80-84 85-89 90-94 95-99 100+
0 WORLD 1950 337431.893 269549.563 260285.973 238628.448 221780.909 194423.919 166936.550 162917.428 ... 73422.363 55106.429 37360.188 21997.317 14201.827 NaN NaN NaN NaN NaN
1 WORLD 1955 402845.030 315054.954 263265.738 254815.285 231891.794 214877.713 187940.577 160385.333 ... 76843.159 59321.778 40345.907 23755.192 16158.258 NaN NaN NaN NaN NaN
2 WORLD 1960 430565.118 380319.072 309276.447 257898.631 248412.663 225956.592 208746.975 181631.886 ... 85064.295 62665.125 44018.202 25985.789 18061.218 NaN NaN NaN NaN NaN
3 WORLD 1965 477798.025 409020.250 372816.602 303891.430 251896.962 242692.181 219978.079 202499.120 ... 95085.421 70255.639 47382.415 29457.425 21031.995 NaN NaN NaN NaN NaN
4 WORLD 1970 522640.874 458298.455 403910.553 367789.079 297557.428 246920.905 237656.753 214329.535 ... 107200.639 81022.911 55168.403 32875.945 25340.203 NaN NaN NaN NaN NaN

5 rows × 24 columns


In [7]:
prj.head()


Out[7]:
Major area, region, country or area * Reference date (as of 1 July) 0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 ... 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100+
0 WORLD 2015 670928.185 637448.895 607431.299 590069.337 603508.555 609793.207 551085.426 497755.634 ... 339691.467 292726.792 215047.151 153207.026 114652.312 71449.698 37062.196 13388.529 2921.852 450.895
1 WORLD 2020 677599.590 664282.610 634568.409 604322.921 585831.926 599278.263 604808.769 544641.774 ... 387339.470 320927.366 267874.842 187441.892 123355.620 81851.457 42700.991 16937.683 4242.605 605.465
2 WORLD 2025 673174.914 671929.973 661684.410 631509.113 600050.925 581681.875 594598.480 598538.429 ... 428166.199 367325.750 295011.725 234797.844 152989.596 89822.147 49847.989 20116.420 5538.923 905.182
3 WORLD 2030 672121.017 667926.121 669744.224 658942.513 627343.133 595871.013 577192.328 588723.101 ... 450091.908 407564.034 339528.512 260426.171 193066.277 113475.561 56221.915 24041.609 6836.461 1244.851
4 WORLD 2035 678271.589 667353.285 665875.094 667248.100 655034.545 623315.619 591598.185 571825.139 ... 456022.842 429343.580 378601.876 301893.804 216249.259 144393.943 72643.007 28074.222 8397.572 1617.797

5 rows × 23 columns

Clean data

Pick a useable subset and fix extra column so that we can combine them. The problem here is that until 1990, the highest age category was '80+. From 1990 on, we have a finer breakdown.

We fix this by reassigning '80+' to '80-84' and not worrying that some of these people are 85 or older. Note that df.fillna(0.0) replaces missing values with zeros.


In [8]:
def cleanpop(df, countries, years): 
    """
    take df as input and select countries and years 
    """
    # rename first two columns 
    names = list(df) 
    df = df.rename(columns={names[0]: 'Country', names[1]: 'Year'}) 

    # select countries and years
    newdf = df[df['Country'].isin(countries) & df['Year'].isin(years)] 
    
    return newdf

countries = ['Japan']
past      = [1950, 2000]
future    = [2050, 2100]

e = cleanpop(est, countries, past) 
p = cleanpop(prj, countries, future)

# make copie for later use 
ealt = e.copy()
palt = p.copy()

In [9]:
# fix top-coding in estimates
e['80-84'] = e['80-84'].fillna(0.0) + e['80+'].fillna(0.0)
e = e.drop(['80+'], axis=1)

# check dimensions again
print('Dimensions of cleaned estimates: ', e.shape)
print('Dimensions of cleaned projections: ', p.shape)


Dimensions of cleaned estimates:  (2, 23)
Dimensions of cleaned projections:  (2, 23)

In [10]:
# check to see if we have the same variables 
list(e) == list(p)


Out[10]:
True

In [11]:
ealt.head()


Out[11]:
Country Year 0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 ... 60-64 65-69 70-74 75-79 80+ 80-84 85-89 90-94 95-99 100+
1134 Japan 1950 10997.606 9448.307 8634.342 8484.002 7636.338 6089.337 5104.356 4989.952 ... 2277.718 1757.865 1271.333 674.869 362.356 NaN NaN NaN NaN NaN
1144 Japan 2000 5858.083 5992.304 6534.552 7481.678 8394.685 9782.756 8684.281 8046.208 ... 7673.926 7024.429 5825.320 4082.687 NaN 2540.718 1476.095 527.205 111.014 10.78

2 rows × 24 columns


In [11]:
e.head()


Out[11]:
Country Year 0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 ... 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100+
1134 Japan 1950 10997.606 9448.307 8634.342 8484.002 7636.338 6089.337 5104.356 4989.952 ... 2728.668 2277.718 1757.865 1271.333 674.869 362.356 NaN NaN NaN NaN
1144 Japan 2000 5858.083 5992.304 6534.552 7481.678 8394.685 9782.756 8684.281 8046.208 ... 8654.066 7673.926 7024.429 5825.320 4082.687 2540.718 1476.095 527.205 111.014 10.78

2 rows × 23 columns

Merge estimates and projections

If we have two blocks of data, and just want to put them on top of each other, we use the Pandas' concatenate function. Ditto two blocks next to each other.

But first we need to fix the difference in the columns of the two dataframes.


In [12]:
pop = pd.concat([e, p], axis=0).fillna(0.0)
pop


Out[12]:
Country Year 0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 ... 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100+
1134 Japan 1950 10997.606 9448.307 8634.342 8484.002 7636.338 6089.337 5104.356 4989.952 ... 2728.668 2277.718 1757.865 1271.333 674.869 362.356 0.000 0.000 0.000 0.000
1144 Japan 2000 5858.083 5992.304 6534.552 7481.678 8394.685 9782.756 8684.281 8046.208 ... 8654.066 7673.926 7024.429 5825.320 4082.687 2540.718 1476.095 527.205 111.014 10.780
1465 Japan 2050 4365.709 4454.049 4533.887 4644.123 4791.892 4986.758 5222.010 5419.444 ... 6065.917 6630.748 7010.576 7539.984 8225.455 6530.305 4766.516 2968.450 1523.374 441.117
1475 Japan 2100 3662.422 3723.514 3791.893 3872.946 3960.466 4046.149 4121.987 4203.284 ... 4561.356 4613.669 4655.102 4687.750 4715.118 4656.733 4296.603 3419.217 2105.037 904.728

4 rows × 23 columns

Exercise. What happens if we try to merge the original dataframes, including the one with the extra 80+ column? Run the code below and comment on what you get.


In [13]:
popalt = pd.concat([ealt, palt], axis=0)
popalt


Out[13]:
0-4 10-14 100+ 15-19 20-24 25-29 30-34 35-39 40-44 45-49 ... 65-69 70-74 75-79 80+ 80-84 85-89 90-94 95-99 Country Year
1134 10997.606 8634.342 NaN 8484.002 7636.338 6089.337 5104.356 4989.952 4429.172 3959.115 ... 1757.865 1271.333 674.869 362.356 NaN NaN NaN NaN Japan 1950
1144 5858.083 6534.552 10.780 7481.678 8394.685 9782.756 8684.281 8046.208 7741.920 8928.982 ... 7024.429 5825.320 4082.687 NaN 2540.718 1476.095 527.205 111.014 Japan 2000
1465 4365.709 4533.887 441.117 4644.123 4791.892 4986.758 5222.010 5419.444 5543.959 5727.320 ... 7010.576 7539.984 8225.455 NaN 6530.305 4766.516 2968.450 1523.374 Japan 2050
1475 3662.422 3791.893 904.728 3872.946 3960.466 4046.149 4121.987 4203.284 4294.733 4394.420 ... 4655.102 4687.750 4715.118 NaN 4656.733 4296.603 3419.217 2105.037 Japan 2100

4 rows × 24 columns

Shape data

We want age categories in the index (the default x axis in a plot) and the years in the columns. The country we don't care about because there's only one.


In [13]:
pop = pop.drop('Country', axis=1)
popi = pop.set_index('Year')
popi


Out[13]:
0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 ... 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100+
Year
1950 10997.606 9448.307 8634.342 8484.002 7636.338 6089.337 5104.356 4989.952 4429.172 3959.115 ... 2728.668 2277.718 1757.865 1271.333 674.869 362.356 0.000 0.000 0.000 0.000
2000 5858.083 5992.304 6534.552 7481.678 8394.685 9782.756 8684.281 8046.208 7741.920 8928.982 ... 8654.066 7673.926 7024.429 5825.320 4082.687 2540.718 1476.095 527.205 111.014 10.780
2050 4365.709 4454.049 4533.887 4644.123 4791.892 4986.758 5222.010 5419.444 5543.959 5727.320 ... 6065.917 6630.748 7010.576 7539.984 8225.455 6530.305 4766.516 2968.450 1523.374 441.117
2100 3662.422 3723.514 3791.893 3872.946 3960.466 4046.149 4121.987 4203.284 4294.733 4394.420 ... 4561.356 4613.669 4655.102 4687.750 4715.118 4656.733 4296.603 3419.217 2105.037 904.728

4 rows × 21 columns


In [14]:
popi.columns.name = 'Age'
popt = popi.T
popt.head()


Out[14]:
Year 1950 2000 2050 2100
Age
0-4 10997.606 5858.083 4365.709 3662.422
5-9 9448.307 5992.304 4454.049 3723.514
10-14 8634.342 6534.552 4533.887 3791.893
15-19 8484.002 7481.678 4644.123 3872.946
20-24 7636.338 8394.685 4791.892 3960.466

In [15]:
ax = popt.plot(kind='bar', color='blue', alpha=0.5, 
               subplots=True, 
               sharey=True, 
               figsize=(8,12))



In [ ]:

Exercise. Use set_index, stack, and unstack to shape the dataframe popi into popt.


In [16]:
popi


Out[16]:
Age 0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 ... 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100+
Year
1950 10997.606 9448.307 8634.342 8484.002 7636.338 6089.337 5104.356 4989.952 4429.172 3959.115 ... 2728.668 2277.718 1757.865 1271.333 674.869 362.356 0.000 0.000 0.000 0.000
2000 5858.083 5992.304 6534.552 7481.678 8394.685 9782.756 8684.281 8046.208 7741.920 8928.982 ... 8654.066 7673.926 7024.429 5825.320 4082.687 2540.718 1476.095 527.205 111.014 10.780
2050 4365.709 4454.049 4533.887 4644.123 4791.892 4986.758 5222.010 5419.444 5543.959 5727.320 ... 6065.917 6630.748 7010.576 7539.984 8225.455 6530.305 4766.516 2968.450 1523.374 441.117
2100 3662.422 3723.514 3791.893 3872.946 3960.466 4046.149 4121.987 4203.284 4294.733 4394.420 ... 4561.356 4613.669 4655.102 4687.750 4715.118 4656.733 4296.603 3419.217 2105.037 904.728

4 rows × 21 columns


In [17]:
popi.stack().unstack(level='Year')


Out[17]:
Year 1950 2000 2050 2100
Age
0-4 10997.606 5858.083 4365.709 3662.422
5-9 9448.307 5992.304 4454.049 3723.514
10-14 8634.342 6534.552 4533.887 3791.893
15-19 8484.002 7481.678 4644.123 3872.946
20-24 7636.338 8394.685 4791.892 3960.466
25-29 6089.337 9782.756 4986.758 4046.149
30-34 5104.356 8684.281 5222.010 4121.987
35-39 4989.952 8046.208 5419.444 4203.284
40-44 4429.172 7741.920 5543.959 4294.733
45-49 3959.115 8928.982 5727.320 4394.420
50-54 3354.134 10342.985 6019.799 4487.817
55-59 2728.668 8654.066 6065.917 4561.356
60-64 2277.718 7673.926 6630.748 4613.669
65-69 1757.865 7024.429 7010.576 4655.102
70-74 1271.333 5825.320 7539.984 4687.750
75-79 674.869 4082.687 8225.455 4715.118
80-84 362.356 2540.718 6530.305 4656.733
85-89 0.000 1476.095 4766.516 4296.603
90-94 0.000 527.205 2968.450 3419.217
95-99 0.000 111.014 1523.374 2105.037
100+ 0.000 10.780 441.117 904.728

In [18]:
list(range(1950, 2016, 5))


Out[18]:
[1950,
 1955,
 1960,
 1965,
 1970,
 1975,
 1980,
 1985,
 1990,
 1995,
 2000,
 2005,
 2010,
 2015]

In [17]:
countries = ['United States of America', 'Japan']
past      = list(range(1950, 2016, 5))
future    = list(range(2015, 2101, 5))

e_US_J = cleanpop(est, countries, past) 
p_US_J = cleanpop(prj, countries, future)

# fix top-coding in estimates
e_US_J['80-84'] = e_US_J['80-84'].fillna(0.0) + e_US_J['80+'].fillna(0.0)
e_US_J = e_US_J.drop(['80+'], axis=1)

In [18]:
e_US_J.head()


Out[18]:
Country Year 0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 ... 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100+
1134 Japan 1950 10997.606 9448.307 8634.342 8484.002 7636.338 6089.337 5104.356 4989.952 ... 2728.668 2277.718 1757.865 1271.333 674.869 362.356 NaN NaN NaN NaN
1135 Japan 1955 9204.626 10947.970 9426.911 8565.922 8327.713 7519.489 6022.648 5018.096 ... 3173.061 2477.335 1943.396 1381.819 868.574 503.311 NaN NaN NaN NaN
1136 Japan 1960 7751.169 9227.342 10919.008 9226.360 8256.058 8122.805 7427.547 5941.973 ... 3598.817 2901.413 2142.817 1543.870 946.170 663.662 NaN NaN NaN NaN
1137 Japan 1965 8105.794 7784.161 9195.272 10745.365 8981.876 8300.051 8170.033 7407.366 ... 3952.628 3302.217 2532.457 1727.221 1079.470 768.415 NaN NaN NaN NaN
1138 Japan 1970 8902.041 8196.696 7917.089 9180.009 10620.425 9061.923 8376.841 8187.951 ... 4396.458 3701.489 2964.635 2123.400 1265.408 939.598 NaN NaN NaN NaN

5 rows × 23 columns


In [19]:
p_US_J[p_US_J['Country']=='United States of America'].head()


Out[19]:
Country Year 0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 ... 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100+
3996 United States of America 2015 19701.274 20633.332 20642.609 20721.296 23288.826 21990.484 22032.293 19725.338 ... 22129.289 18966.991 16028.634 11427.606 8020.824 5937.915 3877.561 1735.717 477.443 71.972
3997 United States of America 2020 20672.623 20095.221 21202.638 21287.870 21339.101 23830.250 22420.925 22298.257 ... 22405.536 21326.390 17960.822 14732.673 9977.856 6399.083 4025.248 1964.662 576.103 95.543
3998 United States of America 2025 21282.091 21067.227 20665.959 21849.693 21910.763 21898.651 24263.160 22696.822 ... 19949.374 21656.874 20266.623 16594.890 12955.986 8050.796 4400.099 2080.972 669.505 119.579
3999 United States of America 2030 21466.822 21677.522 21638.496 21315.887 22477.288 22477.228 22350.565 24541.701 ... 20124.704 19347.657 20659.611 18818.075 14702.378 10560.532 5624.507 2321.690 727.774 143.984
4000 United States of America 2035 21575.393 21863.314 22249.474 22289.550 21950.247 23050.112 22936.418 22649.523 ... 19296.169 19572.949 18530.664 19278.754 16788.244 12107.722 7482.957 3035.847 833.460 163.028

5 rows × 23 columns


In [21]:
pop_US_J = pd.concat([e_US_J, p_US_J], axis=0)#.fillna(0.0)
pop_US_J.shape


Out[21]:
(64, 23)

In [22]:
pop_US_J


Out[22]:
Country Year 0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 ... 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100+
1134 Japan 1950 10997.606 9448.307 8634.342 8484.002 7636.338 6089.337 5104.356 4989.952 ... 2728.668 2277.718 1757.865 1271.333 674.869 362.356 NaN NaN NaN NaN
1135 Japan 1955 9204.626 10947.970 9426.911 8565.922 8327.713 7519.489 6022.648 5018.096 ... 3173.061 2477.335 1943.396 1381.819 868.574 503.311 NaN NaN NaN NaN
1136 Japan 1960 7751.169 9227.342 10919.008 9226.360 8256.058 8122.805 7427.547 5941.973 ... 3598.817 2901.413 2142.817 1543.870 946.170 663.662 NaN NaN NaN NaN
1137 Japan 1965 8105.794 7784.161 9195.272 10745.365 8981.876 8300.051 8170.033 7407.366 ... 3952.628 3302.217 2532.457 1727.221 1079.470 768.415 NaN NaN NaN NaN
1138 Japan 1970 8902.041 8196.696 7917.089 9180.009 10620.425 9061.923 8376.841 8187.951 ... 4396.458 3701.489 2964.635 2123.400 1265.408 939.598 NaN NaN NaN NaN
1139 Japan 1975 9913.633 8840.587 8210.002 7887.167 9085.397 10694.082 9164.572 8359.107 ... 4581.554 4229.693 3401.963 2538.406 1617.769 1175.370 NaN NaN NaN NaN
1140 Japan 1980 8425.462 10023.593 8861.847 8199.056 7778.115 9054.658 10669.960 9119.977 ... 5523.713 4375.630 3912.158 2978.445 2003.147 1591.386 NaN NaN NaN NaN
1141 Japan 1985 7406.792 8504.840 10033.840 8884.920 8132.473 7766.015 9070.608 10638.236 ... 6915.436 5319.246 4108.403 3514.634 2453.968 2160.195 NaN NaN NaN NaN
1142 Japan 1990 6430.434 7460.626 8498.302 9995.297 8702.295 7997.355 7726.074 9015.470 ... 7641.491 6662.651 5021.324 3738.316 2975.016 1779.613 816.016 237.003 36.451 2.821
1143 Japan 1995 5970.705 6528.210 7471.766 8530.933 9884.930 8694.585 8056.326 7763.030 ... 7891.922 7392.964 6317.749 4619.441 3216.916 2251.198 1087.343 354.562 64.617 5.268
1144 Japan 2000 5858.083 5992.304 6534.552 7481.678 8394.685 9782.756 8684.281 8046.208 ... 8654.066 7673.926 7024.429 5825.320 4082.687 2540.718 1476.095 527.205 111.014 10.780
1145 Japan 2005 5639.483 5877.963 6010.086 6582.365 7549.657 8470.903 9789.363 8705.296 ... 10101.564 8453.822 7351.925 6540.869 5171.887 3343.808 1777.136 799.344 193.667 22.548
1146 Japan 2010 5393.341 5598.905 5928.416 6055.624 6795.214 7425.867 8338.007 9711.312 ... 8754.481 9842.167 8260.983 6968.981 5927.025 4332.446 2393.791 987.060 299.738 39.939
1147 Japan 2015 5269.038 5398.973 5603.638 5960.784 6111.768 6843.421 7455.687 8345.753 ... 7535.334 8530.749 9452.518 7770.410 6297.532 4940.325 3115.732 1338.774 366.082 60.630
3108 United States of America 1950 17236.888 13784.766 11574.070 11217.036 12249.727 12872.040 12143.469 11562.518 ... 7634.709 6685.258 5485.444 3540.719 2215.991 1800.625 NaN NaN NaN NaN
3109 United States of America 1955 19269.975 17206.424 13955.143 11400.351 10908.524 12257.490 12885.021 12111.959 ... 8090.882 7275.168 5968.272 4255.662 2646.757 2198.321 NaN NaN NaN NaN
3110 United States of America 1960 20744.261 19295.372 17308.246 13721.545 11539.773 11444.456 12427.768 12896.664 ... 8530.910 7605.518 6399.870 4736.921 3265.623 2625.181 NaN NaN NaN NaN
3111 United States of America 1965 19791.210 20884.622 19454.907 17345.691 13906.573 11851.783 11622.448 12421.932 ... 10032.254 7705.239 6733.457 5334.653 3672.199 3175.375 NaN NaN NaN NaN
3112 United States of America 1970 17667.442 20348.013 21053.385 19491.649 16833.698 14110.517 12087.959 11587.576 ... 10372.082 9075.167 6947.780 5583.059 4060.962 3819.403 NaN NaN NaN NaN
3113 United States of America 1975 16479.229 17837.717 20553.731 21207.870 19694.953 17449.937 14422.578 12038.360 ... 10866.342 9545.891 8266.448 5663.315 4372.948 4627.202 NaN NaN NaN NaN
3114 United States of America 1980 16644.732 16801.933 18679.477 21302.278 21613.544 19949.953 17737.268 13966.040 ... 11667.279 10250.369 8990.399 7061.010 4847.777 5173.612 NaN NaN NaN NaN
3115 United States of America 1985 17870.945 16891.274 17279.962 19169.899 21522.150 21936.011 20341.459 17649.541 ... 11327.519 10967.588 9482.496 7667.307 5722.555 5987.757 NaN NaN NaN NaN
3116 United States of America 1990 18933.119 18144.024 17536.018 18071.118 19613.498 21541.111 22306.806 20320.000 ... 10574.797 10747.041 10152.695 8161.049 6279.346 4056.324 2005.257 756.500 219.117 32.143
3117 United States of America 1995 19446.403 19291.488 19170.677 18570.164 18440.286 19965.336 22507.836 22568.279 ... 11306.811 10093.377 9951.941 8813.615 6716.088 4357.573 2488.427 921.706 238.527 43.766
3118 United States of America 2000 19247.173 20360.577 20531.685 20282.813 19387.432 19525.378 20883.895 22892.692 ... 13383.754 10991.812 9511.066 8836.743 7399.257 5000.234 2654.098 1127.040 273.762 45.588
3119 United States of America 2005 19897.945 19460.301 21137.730 21447.948 21022.391 19793.433 19969.282 21154.292 ... 17278.315 12958.931 10405.929 8592.351 7437.681 5399.245 3090.386 1188.947 324.782 47.705
3120 United States of America 2010 20177.861 20321.997 20711.385 21838.942 21643.953 21199.737 20294.368 20299.432 ... 19860.709 16763.463 12340.665 9442.827 7305.152 5817.563 3510.353 1476.753 368.354 60.754
3121 United States of America 2015 19701.274 20633.332 20642.609 20721.296 23288.826 21990.484 22032.293 19725.338 ... 22129.289 18966.991 16028.634 11427.606 8020.824 5937.915 3877.561 1735.717 477.443 71.972
1458 Japan 2015 5269.038 5398.973 5603.638 5960.784 6111.768 6843.421 7455.687 8345.753 ... 7535.334 8530.749 9452.518 7770.410 6297.532 4940.325 3115.732 1338.774 366.082 60.630
1459 Japan 2020 5082.934 5272.586 5402.024 5626.194 5999.178 6144.661 6862.479 7458.664 ... 7746.073 7357.595 8220.359 8937.769 7085.091 5322.065 3629.386 1784.418 518.622 79.523
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1464 Japan 2045 4448.702 4529.657 4619.658 4748.906 4947.499 5194.574 5404.633 5541.785 ... 6728.225 7171.861 7796.329 8693.253 7260.858 5876.803 4469.537 3242.964 1597.432 338.945
1465 Japan 2050 4365.709 4454.049 4533.887 4644.123 4791.892 4986.758 5222.010 5419.444 ... 6065.917 6630.748 7010.576 7539.984 8225.455 6530.305 4766.516 2968.450 1523.374 441.117
1466 Japan 2055 4271.907 4371.016 4458.145 4557.117 4685.414 4829.637 5013.420 5236.482 ... 5971.177 5985.655 6493.459 6798.299 7167.066 7456.047 5368.363 3233.182 1437.057 462.541
1467 Japan 2060 4178.608 4276.984 4374.933 4480.210 4596.410 4721.439 4855.253 5027.659 ... 5659.099 5898.419 5871.654 6312.598 6487.951 6543.625 6204.231 3714.541 1610.373 461.317
1468 Japan 2065 4098.930 4183.472 4280.722 4395.831 4517.465 4630.632 4745.904 4869.093 ... 5474.018 5595.983 5795.188 5722.098 6048.305 5963.869 5511.094 4377.365 1904.730 518.644
1469 Japan 2070 4033.949 4103.532 4187.026 4300.453 4430.978 4549.815 4653.840 4759.180 ... 5360.210 5417.808 5505.663 5659.285 5501.812 5593.846 5075.201 3958.367 2304.760 624.430
1470 Japan 2075 3979.032 4038.305 4106.899 4205.584 4333.554 4461.473 4571.743 4666.557 ... 5186.125 5310.115 5338.032 5388.297 5460.947 5122.114 4815.303 3714.384 2146.187 779.118
1471 Japan 2080 3917.939 3983.137 4041.461 4124.256 4236.564 4362.189 4482.112 4583.827 ... 4985.627 5142.129 5238.847 5234.625 5216.768 5114.646 4457.916 3589.217 2069.513 803.601
1472 Japan 2085 3854.440 3921.774 3986.080 4057.611 4153.082 4263.240 4381.470 4493.462 ... 4833.530 4946.985 5078.865 5145.946 5082.046 4911.188 4493.169 3378.072 2050.730 812.047
1473 Japan 2090 3786.347 3857.981 3924.504 4000.999 4084.266 4177.779 4281.149 4392.081 ... 4728.712 4799.365 4891.314 4996.711 5008.993 4807.584 4354.049 3460.194 1981.681 833.559
1474 Japan 2095 3720.541 3789.596 3860.493 3938.189 4025.477 4106.967 4194.272 4290.983 ... 4640.419 4698.495 4750.410 4819.804 4876.471 4761.542 4300.830 3407.697 2082.468 844.420
1475 Japan 2100 3662.422 3723.514 3791.893 3872.946 3960.466 4046.149 4121.987 4203.284 ... 4561.356 4613.669 4655.102 4687.750 4715.118 4656.733 4296.603 3419.217 2105.037 904.728
3996 United States of America 2015 19701.274 20633.332 20642.609 20721.296 23288.826 21990.484 22032.293 19725.338 ... 22129.289 18966.991 16028.634 11427.606 8020.824 5937.915 3877.561 1735.717 477.443 71.972
3997 United States of America 2020 20672.623 20095.221 21202.638 21287.870 21339.101 23830.250 22420.925 22298.257 ... 22405.536 21326.390 17960.822 14732.673 9977.856 6399.083 4025.248 1964.662 576.103 95.543
3998 United States of America 2025 21282.091 21067.227 20665.959 21849.693 21910.763 21898.651 24263.160 22696.822 ... 19949.374 21656.874 20266.623 16594.890 12955.986 8050.796 4400.099 2080.972 669.505 119.579
3999 United States of America 2030 21466.822 21677.522 21638.496 21315.887 22477.288 22477.228 22350.565 24541.701 ... 20124.704 19347.657 20659.611 18818.075 14702.378 10560.532 5624.507 2321.690 727.774 143.984
4000 United States of America 2035 21575.393 21863.314 22249.474 22289.550 21950.247 23050.112 22936.418 22649.523 ... 19296.169 19572.949 18530.664 19278.754 16788.244 12107.722 7482.957 3035.847 833.460 163.028
4001 United States of America 2040 21830.520 21972.900 22436.063 22902.009 22927.102 22532.197 23516.076 23243.187 ... 21834.111 18823.139 18819.465 17386.976 17332.777 13981.338 8725.928 4137.353 1127.852 191.918
4002 United States of America 2045 22134.050 22228.757 22546.487 23090.260 23543.130 23513.195 23007.510 23830.111 ... 22313.644 21341.601 18162.953 17744.842 15741.818 14585.725 10232.749 4940.512 1584.503 262.329
4003 United States of America 2050 22544.790 22532.899 22802.809 23201.972 23735.022 24133.657 23992.427 23331.197 ... 24174.076 21856.221 20647.557 17199.933 16169.908 13370.486 10830.214 5921.897 1951.320 378.336
4004 United States of America 2055 22982.859 22923.496 23078.333 23425.884 23815.944 24297.934 24590.105 24300.018 ... 22440.395 23711.495 21193.980 19616.753 15755.924 13848.389 10049.982 6392.556 2404.179 491.757
4005 United States of America 2060 23290.072 23341.540 23440.301 23668.904 24008.606 24351.160 24732.489 24882.581 ... 23075.531 22052.379 23040.439 20201.178 18060.205 13598.188 10541.782 6046.506 2668.961 629.703
4006 United States of America 2065 23447.870 23628.698 23829.611 23998.234 24219.935 24515.369 24763.495 25010.643 ... 23690.927 22707.067 21477.637 22025.299 18683.209 15700.595 10470.067 6467.202 2592.976 739.299
4007 United States of America 2070 23582.198 23766.396 24087.962 24354.682 24517.172 24697.793 24904.453 25026.889 ... 23236.150 23339.717 22154.432 20590.134 20454.438 16346.388 12219.141 6536.323 2850.574 768.981
4008 United States of America 2075 23743.770 23880.759 24196.904 24580.256 24841.565 24965.705 25063.648 25152.632 ... 24201.687 22920.892 22811.873 21297.240 19205.735 18017.834 12861.201 7773.575 2962.732 863.134
4009 United States of America 2080 23937.002 24022.243 24282.637 24656.496 25035.015 25260.688 25307.769 25296.132 ... 24790.711 23896.587 22439.779 21982.602 19943.070 17024.742 14321.414 8321.331 3619.650 940.936
4010 United States of America 2085 24118.219 24195.381 24395.248 24709.424 25079.039 25424.578 25578.623 25524.217 ... 24934.398 24502.189 23429.731 21676.019 20662.528 17788.334 13674.425 9429.389 3978.514 1157.841
4011 United States of America 2090 24240.914 24356.517 24539.629 24789.222 25099.654 25438.981 25718.488 25778.570 ... 24965.667 24666.681 24056.100 22679.808 20443.939 18532.421 14425.317 9155.596 4625.269 1343.917
4012 United States of America 2095 24285.918 24458.913 24671.896 24900.668 25146.882 25429.876 25708.680 25902.063 ... 25100.347 24718.768 24249.547 23333.826 21461.262 18436.410 15170.988 9823.808 4615.931 1613.198
4013 United States of America 2100 24276.043 24483.767 24745.437 25000.009 25225.748 25447.116 25675.267 25875.724 ... 25250.018 24871.981 24331.263 23567.497 22149.821 19454.976 15228.571 10498.245 5085.079 1739.648

64 rows × 23 columns


In [23]:
pop_i = pop_US_J.set_index(['Country', 'Year'])

In [24]:
pop_i.index


Out[24]:
MultiIndex(levels=[['Japan', 'United States of America'], [1950, 1955, 1960, 1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015, 2020, 2025, 2030, 2035, 2040, 2045, 2050, 2055, 2060, 2065, 2070, 2075, 2080, 2085, 2090, 2095, 2100]],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]],
           names=['Country', 'Year'])

In [27]:
pop_i.columns.name = 'Age'

In [28]:
pop_st = pop_i.stack()

In [29]:
pop_st.head()


Out[29]:
Country  Year  Age  
Japan    1950  0-4      10997.606
               5-9       9448.307
               10-14     8634.342
               15-19     8484.002
               20-24     7636.338
dtype: float64

In [30]:
fig, ax = plt.subplots(2, 1, figsize=(12, 8))
pop_st.reorder_levels([2, 0, 1])['5-9']['United States of America'].plot(ax=ax[0], kind='line')
pop_st.reorder_levels([2, 0, 1])['5-9']['Japan'].plot(ax=ax[1], kind='line')
plt.show()



In [31]:
pop_st.head()


Out[31]:
Country  Year  Age  
Japan    1950  0-4      10997.606
               5-9       9448.307
               10-14     8634.342
               15-19     8484.002
               20-24     7636.338
dtype: float64

In [38]:
pop_st.loc[('Japan', 2100, slice(None))].plot(kind='bar')
plt.show()



In [35]:
pop_st.ix[('Japan', slice(None), '0-4')]


Out[35]:
Year
1950    10997.606
1955     9204.626
1960     7751.169
1965     8105.794
1970     8902.041
1975     9913.633
1980     8425.462
1985     7406.792
1990     6430.434
1995     5970.705
2000     5858.083
2005     5639.483
2010     5393.341
2015     5269.038
2015     5269.038
2020     5082.934
2025     4872.732
2030     4715.896
2035     4610.562
2040     4524.468
2045     4448.702
2050     4365.709
2055     4271.907
2060     4178.608
2065     4098.930
2070     4033.949
2075     3979.032
2080     3917.939
2085     3854.440
2090     3786.347
2095     3720.541
2100     3662.422
dtype: float64

Resources

The Pandas docs are ok, but we prefer the Data Carpentry guide


In [ ]: