Sometimes we need to combine data from two or more dataframes. That's colloquially known as a merge or a join. There are lots of ways to do this. We do a couple but supply references to more at the end.
Along the way we take an extended detour to review methods for downloading and unzipping compressed files. The tools we use here have a broad range of other applications, including web scraping.
Outline:
merge
function. Note: requires internet access to run.
This IPython notebook was created by Dave Backus, Chase Coleman, Brian LeBlanc, and Spencer Lyon for the NYU Stern course Data Bootcamp.
In [1]:
%matplotlib inline
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics
import datetime as dt # date tools, used to note current date
# these are new
import os # operating system tools (check files)
import requests, io # internet and input tools
import zipfile as zf # zip file tools
import shutil # file management tools
The data comes as a zip file that contains several csv's. We get the details from the README inside. (It's written in Markdown, so it's easier to read if we use a browser to format it. Or we could cut and paste into a Markdown cell in an IPython notebook.)
The file descriptions are:
ratings.csv
: each line is an individual film rating with the rater and movie id's and the rating. Order: userId, movieId, rating, timestamp
. tags.csv
: each line is a tag on a specific film. Order: userId, movieId, tag, timestamp
. movies.csv
: each line is a movie name, its id, and its genre. Order: movieId, title, genres
. Multiple genres are separated by "pipes" |
. links.csv
: each line contains the movie id and corresponding id's at IMBd and TMDb. The easy way to input this data is to download the zip file onto our computer, unzip it, and read the individual csv files using read.csv()
. But anyone can do it the easy way. We want to automate this, so we can redo it without any manual steps. This takes some effort, but once we have it down we can apply it to lots of other data sources.
In [ ]:
We're looking for an automated way, so that if we do this again, possibly with updated data, the whole process is in our code. Automated data entry involves these steps:
read_csv
function, but here we need to do it for ourselves. The package authors add: Recreational use of other HTTP libraries may result in dangerous side-effects, including: security vulnerabilities, verbose code, reinventing the wheel, constantly reading documentation, depression, headaches, or even death.
io.Bytes
reconstructs it as a file, here a zip file. read_csv
as usual. We found this Stack Overflow exchange helpful.
Digression. This is probably more than you want to know, but it's a reminder of what goes on behind the scenes when we apply read_csv
to a url. Here we grab whatever is at the url. Then we get its contents, convert it to bytes, identify it as a zip file, and read its components using read_csv
. It's a lot easier when this happens automatically, but a reminder what's involved if we ever have to look into the details.
In [ ]:
# get "response" from url
url = 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
r = requests.get(url)
# describe response
print('Response status code:', r.status_code)
print('Response type:', type(r))
print('Response .content:', type(r.content))
print('Response headers:\n', r.headers, sep='')
In [ ]:
# convert bytes to zip file
mlz = zf.ZipFile(io.BytesIO(r.content))
print('Type of zipfile object:', type(mlz))
In [ ]:
# what's in the zip file?
mlz.namelist()
In [ ]:
mlz.open('ml-latest-small/links.csv')
In [ ]:
pd.read_csv(mlz.open('ml-latest-small/links.csv'))
In [ ]:
# extract and read csv's
movies = pd.read_csv(mlz.open(mlz.namelist()[2]))
ratings = pd.read_csv(mlz.open(mlz.namelist()[3]))
In [ ]:
# what do we have?
for df in [movies, ratings]:
print('Type:', type(df))
print('Dimensions:', df.shape, '\n')
print('Variables:', df.columns.tolist(), '\n')
print('First few rows', df.head(3), '\n')
In [ ]:
Exercise. Something to do together. suppose we wanted to save the files on our computer. How would we do it? Would we prefer individual csv's or a single zip?
In [ ]:
# writing csv (specify different location)
with open('test_01.csv', 'wb') as out_file:
shutil.copyfileobj(mlz.open(mlz.namelist()[2]), out_file)
In [ ]:
In [ ]:
# experiment via http://stackoverflow.com/a/18043472/804513
with open('test.zip', 'wb') as out_file:
shutil.copyfileobj(io.BytesIO(r.content), out_file)
The movie ratings in the dataframe ratings
give us individual opinions about movies, but they don't include the name of the movie. Why not? Rather than include the name every time a movie is rated, the MovieLens data associates each rating with a movie code, than stores the names of movies associatd with each movie code in the dataframe movies
. We run across this a lot: some information is in one data table, other information is in another.
Our want is therefore to add the movie name to the ratings
dataframe. We say we merge the two dataferames. There are lots of ways to merge. Here we do one as an illustration.
Let's start by reminding ourselves what we have.
In [ ]:
ratings.head(3)
In [ ]:
movies.head(3)
In [ ]:
combo = pd.merge(ratings, movies, # left and right df's
how='left', # add to left
on='movieId' # link with this variable/column
)
print('Dimensions of ratings:', ratings.shape)
print('Dimensions of movies:', movies.shape)
print('Dimensions of new df:', combo.shape)
combo.head(20)
In [ ]:
combo_1 = ratings.merge(movies, how='left', on='movieId')
In [ ]:
combo_1.head()
In [ ]:
combo_2 = ratings.merge(movies, how='inner', on='movieId')
combo_2.shape
In [ ]:
combo_3 = movies.merge(ratings, how='right', on='movieId')
combo_3.shape
In [ ]:
# save as csv file for future use
combo.to_csv('mlcombined.csv')
In [ ]:
count_2 = movies['movieId'].isin(ratings['movieId'])
In [ ]:
count_2.sum()
In [ ]:
print('Current directory:\n', os.getcwd(), sep='')
print('List of files:', os.listdir(), sep='\n')
In [ ]:
Exercise. Some of these we know how to do, the others we don't. For the ones we know, what is the answer? For the others, what (in loose terms) do we need to be able to do to come up with an answer?
In [ ]:
combo['rating'].mean()
In [ ]:
fig, ax = plt.subplots()
bins = [bin/100 for bin in list(range(25, 575, 50))]
print(bins)
combo['rating'].plot(kind='hist', ax=ax, bins=bins, color='blue', alpha=0.5)
ax.set_xlim(0,5.5)
ax.set_ylabel('Number')
ax.set_xlabel('Rating')
plt.show()
In [ ]:
from plotly.offline import iplot # plotting functions
import plotly.graph_objs as go # ditto
import plotly
In [ ]:
plotly.offline.init_notebook_mode(connected=True)
In [ ]:
trace = go.Histogram(
x=combo['rating'],
histnorm='count',
name='control',
autobinx=False,
xbins=dict(
start=.5,
end=5.0,
size=0.5
),
marker=dict(
color='Blue',
),
opacity=0.75
)
layout = go.Layout(
title='Distribution of ratings',
xaxis=dict(
title='Rating value'
),
yaxis=dict(
title='Count'
),
bargap=0.01,
bargroupgap=0.1
)
iplot(go.Figure(data=[trace], layout=layout))
In [ ]:
combo[combo['movieId']==31]['rating'].mean()
In [ ]:
ave_mov = combo['rating'].groupby(combo['movieId']).mean()
In [ ]:
ave_mov = ave_mov.reset_index()
In [ ]:
ave_mov = ave_mov.rename(columns={"rating": "average rating"})
In [ ]:
combo2 = combo.merge(ave_mov, how='left', on='movieId')
In [ ]:
combo2.shape
In [ ]:
combo2.head(3)
In [ ]:
combo['ave'] = combo['rating'].groupby(combo['movieId']).transform('mean')
In [ ]:
combo.head()
In [ ]:
combo2[combo['movieId']==1129]
In [ ]:
combo['count'] = combo['rating'].groupby(combo['movieId']).transform('count')
In [ ]:
combo.head()
We look (again) at the UN's population data, specifically the age distribution of the population. The data comes in two sheets: estimates that cover the period 1950-2015 and projections that cover 2016-2100. Our mission is to combine them.
In [2]:
url1 = 'http://esa.un.org/unpd/wpp/DVD/Files/'
url2 = '1_Indicators%20(Standard)/EXCEL_FILES/1_Population/'
url3 = 'WPP2015_POP_F07_1_POPULATION_BY_AGE_BOTH_SEXES.XLS'
url = url1 + url2 + url3
cols = [2, 5] + list(range(6,28))
est = pd.read_excel(url, sheetname=0, skiprows=16, parse_cols=cols, na_values=['…'])
prj = pd.read_excel(url, sheetname=1, skiprows=16, parse_cols=cols, na_values=['…'])
print('Dimensions and dtypes of estimates: ', est.shape, '\n', est.dtypes.head(), sep='')
print('\nDimensions and dtypes of projections: ', prj.shape, '\n', prj.dtypes.head(), sep='')
In [3]:
est.to_csv('un_pop_est.csv')
prj.to_csv('un_pop_proj.csv')
Comment. Note that they have different numbers of columns. Let's see where that comes from.
In [4]:
list(est)[15:]
Out[4]:
In [5]:
list(prj)[15:]
Out[5]:
In [6]:
est.head()
Out[6]:
In [7]:
prj.head()
Out[7]:
Pick a useable subset and fix extra column so that we can combine them. The problem here is that until 1990, the highest age category was '80+
. From 1990 on, we have a finer breakdown.
We fix this by reassigning '80+'
to '80-84'
and not worrying that some of these people are 85 or older. Note that df.fillna(0.0)
replaces missing values with zeros.
In [8]:
def cleanpop(df, countries, years):
"""
take df as input and select countries and years
"""
# rename first two columns
names = list(df)
df = df.rename(columns={names[0]: 'Country', names[1]: 'Year'})
# select countries and years
newdf = df[df['Country'].isin(countries) & df['Year'].isin(years)]
return newdf
countries = ['Japan']
past = [1950, 2000]
future = [2050, 2100]
e = cleanpop(est, countries, past)
p = cleanpop(prj, countries, future)
# make copie for later use
ealt = e.copy()
palt = p.copy()
In [9]:
# fix top-coding in estimates
e['80-84'] = e['80-84'].fillna(0.0) + e['80+'].fillna(0.0)
e = e.drop(['80+'], axis=1)
# check dimensions again
print('Dimensions of cleaned estimates: ', e.shape)
print('Dimensions of cleaned projections: ', p.shape)
In [10]:
# check to see if we have the same variables
list(e) == list(p)
Out[10]:
In [11]:
ealt.head()
Out[11]:
In [11]:
e.head()
Out[11]:
In [12]:
pop = pd.concat([e, p], axis=0).fillna(0.0)
pop
Out[12]:
Exercise. What happens if we try to merge the original dataframes, including the one with the extra 80+
column? Run the code below and comment on what you get.
In [13]:
popalt = pd.concat([ealt, palt], axis=0)
popalt
Out[13]:
In [13]:
pop = pop.drop('Country', axis=1)
popi = pop.set_index('Year')
popi
Out[13]:
In [14]:
popi.columns.name = 'Age'
popt = popi.T
popt.head()
Out[14]:
In [15]:
ax = popt.plot(kind='bar', color='blue', alpha=0.5,
subplots=True,
sharey=True,
figsize=(8,12))
In [ ]:
Exercise. Use set_index
, stack
, and unstack
to shape the dataframe popi
into popt
.
In [16]:
popi
Out[16]:
In [17]:
popi.stack().unstack(level='Year')
Out[17]:
In [18]:
list(range(1950, 2016, 5))
Out[18]:
In [17]:
countries = ['United States of America', 'Japan']
past = list(range(1950, 2016, 5))
future = list(range(2015, 2101, 5))
e_US_J = cleanpop(est, countries, past)
p_US_J = cleanpop(prj, countries, future)
# fix top-coding in estimates
e_US_J['80-84'] = e_US_J['80-84'].fillna(0.0) + e_US_J['80+'].fillna(0.0)
e_US_J = e_US_J.drop(['80+'], axis=1)
In [18]:
e_US_J.head()
Out[18]:
In [19]:
p_US_J[p_US_J['Country']=='United States of America'].head()
Out[19]:
In [21]:
pop_US_J = pd.concat([e_US_J, p_US_J], axis=0)#.fillna(0.0)
pop_US_J.shape
Out[21]:
In [22]:
pop_US_J
Out[22]:
In [23]:
pop_i = pop_US_J.set_index(['Country', 'Year'])
In [24]:
pop_i.index
Out[24]:
In [27]:
pop_i.columns.name = 'Age'
In [28]:
pop_st = pop_i.stack()
In [29]:
pop_st.head()
Out[29]:
In [30]:
fig, ax = plt.subplots(2, 1, figsize=(12, 8))
pop_st.reorder_levels([2, 0, 1])['5-9']['United States of America'].plot(ax=ax[0], kind='line')
pop_st.reorder_levels([2, 0, 1])['5-9']['Japan'].plot(ax=ax[1], kind='line')
plt.show()
In [31]:
pop_st.head()
Out[31]:
In [38]:
pop_st.loc[('Japan', 2100, slice(None))].plot(kind='bar')
plt.show()
In [35]:
pop_st.ix[('Japan', slice(None), '0-4')]
Out[35]:
The Pandas docs are ok, but we prefer the Data Carpentry guide
In [ ]: