Probably the best thing about Pandas is its extensive toolset for managing data. Here we describe features of Pandas that allow us to clean data that, for reasons beyond our control, comes in a form that's not immediately amendable to analysis. This is the first of several such notebooks.
Outline:
True
. The next two methods extend this capability. isin
method. Choose observations whose values are in lists you specify. contains
method. Flag observations that contain a specific piece of text. Another string method, operates through Booleans. Note: requires internet access to run.
This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.
In [1]:
import sys # system module
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics module
import datetime as dt # date and time module
import numpy as np # foundation for Pandas
%matplotlib inline
# check versions (overkill, but why not?)
print('Python version: ', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())
We need to know what we're trying to do -- what we want the data to look like. We say we apply the want operator.
Some problems we've run across that ask to be solved:
What we want in each case is the opposite of what we have: we want nicely formatted numbers, clean row and column labels, and so on.
We'll solve the first four problems here, the last one in the next notebook.
This data comes from a New York Times story about the number of calories in a typical order at Chipotle. The topic doesn't particularly excite us, but the data raises a number of issues that come up repeatedly. We adapt some code written by Daniel Forsyth.
In [2]:
url = 'https://raw.githubusercontent.com/TheUpshot/chipotle/master/orders.tsv'
chp = pd.read_csv(url, sep='\t') # tab (\t) delimited
print('Variable dtypes:\n', chp.dtypes, sep='')
chp.head()
Out[2]:
Comment. Note that the variable item_price
has dtype object. The reason is evidently the dollar sign. We'd prefer to have it as a number, specifically a float.
In [ ]:
This is the poll we did at the start of the course. Responses were collected in a Google spreadsheet, which we converted to a csv and uploaded to our website.
In [3]:
pd.set_option("display.width", 80)
In [4]:
import pandas as pd
url1 = 'http://pages.stern.nyu.edu/~dbackus/Data/'
url2 = 'Data-Bootcamp-entry-poll_s16.csv'
url = url1 + url2
ep = pd.read_csv(url, header=0)
print('Dimensions:', ep.shape)
print('\nData types:\n', ep.dtypes, sep='')
ep.head(2)
Out[4]:
Comments. This is mostly text data, which means it's assigned the dtype object. Which is fine. But there are two things that would make the data easier to work with:
In [5]:
# rename variables
newnames = ['time', 'program', 'career', 'programming', 'stats', 'media',
'other', 'major', 'data', 'why', 'topics']
newnames = [name.title() for name in newnames]
ep.columns = newnames
ep.head()
Out[5]:
In [6]:
# check multi-response question to see what we're dealing with
ep['Media'].head(20)
Out[6]:
Comment. Note the commas separating answers with more than one choice. We want to unpack them somehow.
In [ ]:
In [ ]:
The OECD collects healthcare data on lots of (mostly rich) countries, which is helpful in producing comparisons. Here we use a spreadsheet linked in one of their documents.
In [7]:
url1 = 'http://www.oecd.org/health/health-systems/'
url2 = 'OECD-Health-Statistics-2015-Frequently-Requested-Data.xls'
docs = pd.read_excel(url1+url2,
skiprows=3,
usecols=[0, 51, 52, 53, 54, 55, 57],
sheetname='Physicians',
# na_values=['..'],
skip_footer=21)
print('Dimensions:', docs.shape)
print('\nIndex', docs.index.tolist(), sep='')
print('\nVariable dtypes:\n', docs.dtypes.tail(8), sep='')
docs.head()
Out[7]:
Comments. Here we have a couple issues.
Australia 1
, Chile 3
, etc. We care about this because when we plot the data across countries, the country labels are going to be country names, so we want them in a better form than this. ..
's in the sheet lead us to label any column that includes them as dtype object. Here we want to label them as missing values. One more thing before we proceeed: change the name of the country variable.
In [8]:
names = list(docs)
docs = docs.rename(columns={names[0]: 'Country'})
docs.head(2)
Out[8]:
In [ ]:
The IMF's World Economic Outlook database contains a broad range of macroeconomic data for a large number of countries. It's updated twice a year and is a go-to source for things like current account balances (roughly, the trade balance) and government debt and deficits. It also has a few quirks, as we'll see.
Example. Run the following code as is, and with the thousands
and na_values
parameters commented out. How do the dtypes differ?
In [9]:
url1 = 'http://www.imf.org/external/pubs/ft/weo/2015/02/weodata/'
url2 = 'WEOOct2015all.xls'
url = url1 + url2
weo = pd.read_csv(url, sep='\t',
usecols=[1,2,3,4,6,40,41,42,43,44],
thousands=',',
na_values=['n/a', '--']
)
print('Variable dtypes:\n', weo.dtypes, sep='')
weo.head()
Out[9]:
Comment. This has several issues:
read_csv
statement, the data columns (1980, 1981, etc) have dtype object. A little work suggests that this is because they include commas marking thousands. n/a
need to be marked as missing values.We can solve the last two in the read_csv
function by deleting the hash -- which is what we see in the second read_csv
statement. The other one takes some work.
Question. Can we transpose the whole thing to get the data running down columns?
In [10]:
weo.T.head(10)
Out[10]:
In [ ]:
In [11]:
dollars = '$123.45'
print('Type of variable dollars:', type(dollars))
num = dollars.replace('$', '')
num = float(num)
print('Type of variable num:', type(num))
In [ ]:
Pandas string methods. We can do the same thing to all the observations of a variable with so-called string methods. We append .str
to a variable in a dataframe and then apply the string method of our choice. If this is part of converting a number-like entry that has mistakenly been given dtype object, we then convert its dtype with the astype
method.
Example. Let's use a string method to fix the item_price
variable in the Chipotle dataframe. This has three parts:
str
to identify this as a string method.replace
) to fix the string.astype
method to convert the fixed-up string to a float. We start by making a copy of the chp
dataframe that we can experiment with.
In [12]:
chp.head()
Out[12]:
In [13]:
chpnum = chp.copy()
print('Original dtype:', chpnum['item_price'].dtype)
# create a copy of the df to play with
In [14]:
# delete dollar signs
chpnum['item_price'].str.replace('$', '').head()
Out[14]:
In [15]:
# delete dollar signs, convert to float, and assign back to chpnum
chpnum['item_price'] = chpnum['item_price'].str.replace('$', '').astype(float)
print('New dtype:', chpnum['item_price'].dtype)
In [16]:
# assign back to chp for future use
chp = chpnum
Comment. We did everything here in one line: replace the dollar sign with a string method, then converted to float using astype
. If you think this is too dense, you might break it into two steps.
In [ ]:
Example. Here we strip off the numbers at the end of the indexes in the OECD docs
dataframe. This involves some experimentation:
rsplit
method to see how it works. rsplit
to the example country = 'United States 1'
. Country
.
In [17]:
# try this with an example first
country = 'United States 1'
# get documentation for the rsplit method
#country.rsplit?
In [18]:
# an example
country.rsplit()
Out[18]:
Comment. Not quite, we only want to split once.
In [19]:
# what about this?
country.rsplit(maxsplit=1)
Out[19]:
In [20]:
# one more step, we want the first component of the list
country.rsplit(maxsplit=1)[0]
Out[20]:
In [21]:
docs["Country"].head()
Out[21]:
In [22]:
# now do this for the variable Country
#docs['Country'].str.rsplit(maxsplit=1).str[0].head() # explain why this doesn't work
docs['Country'].str.rsplit(n=1).str[0].head()
Out[22]:
In [23]:
# Spencer prefers the get method to slicing
docs['Country'].str.rsplit(n=1).str.get(0).head()
Out[23]:
In [24]:
# now assign it to newdocs and see what we have
newdocs = docs.copy()
newdocs['Country'] = newdocs['Country'].str.rsplit(n=1).str.get(0)
newdocs.head()
Out[24]:
In [25]:
# assign it back to docs for future use
docs = newdocs
Comments.
str
's here: one to do the split, the other to extract the first element. maxsplit=1
, but it works with n=1
.
In [ ]:
It's important to label missing values, so that Pandas doesn't interpret entries as strings. Pandas is also smart enough to ignore things labeled missing when it does calculations or graphs. If we compute, for example, the mean of a variable, the default is to ignore missing values.
We've seen that we can label certain entries as missing values in read statements: read_csv
, read_excel
, and so on. Here we do it directly, mostly to remind ourselves what's involved.
Example. The docs
dataframe contains a number of instances of ..
(double period). How can we mark them as missing values?
In [26]:
docs = newdocs
docs.head()
Out[26]:
What to do. We use the replace method on the whole dataframe. To mark something as missing, we replace it as None
, which Pandas interprets as missing and labels NaN
.
In [27]:
docs.replace(to_replace=['..'], value=[None]).head()
Out[27]:
Comment. Replace automatically updates the dtypes. Here the double dots led us to label the variables as objects. After the replace, they're now floats, as they should be.
In [28]:
docs.dtypes.head()
Out[28]:
In [29]:
docsna = docs.replace(to_replace=['..'], value=[None])
docsna.dtypes.head()
Out[29]:
Comment. Some people prefer to use the numpy nan
. Here's an example. The only advantage is that we avoid possible conflicts with other uses of the value None
.
In [30]:
docs.replace(to_replace=['..'], value=[np.nan]).head()
Out[30]:
In [31]:
# assign back to docs
docs = docs.replace(to_replace=['..'], value=[np.nan])
Comment. Unlike the string methods we described earlier, this use of replace affects complete entries, not elements of string entries. For example, suppose we tried to replace the periods in decimal numbers with an asterisk. We could try the following, but it doesn't work: the decimal numbers don't change.
In [32]:
docs.replace(to_replace=['.'], value=['*']).head()
Out[32]:
In [ ]:
In [33]:
# grab a variable to play with
var = docsna[2013].head(10)
var
Out[33]:
In [34]:
# which ones are missing ("null")?
var.isnull()
Out[34]:
In [35]:
# which ones are not missing ("not null")?
var.notnull()
Out[35]:
In [36]:
# drop the missing
var.dropna()
Out[36]:
Comment. We usually don't have to worry about this, Pandas takes care of missing values automatically.
In [ ]:
Comment. Let's try a picture to give us a feeling of accomplishment. What else would you say we need? How would we get it?
In [37]:
docs[2013].plot.barh(figsize=(4, 12))
Out[37]:
In [ ]:
The word selection refers to choosing a subset of variables or observations using their labels or index. Similar methods are sometimes referred to as slicing, subsetting, indexing, or filtering. We'll treat the terms as synonymous.
There are lots of ways to do this. Mostly we do "Boolean" selection, which we address in the next section. We review more direct options here, mostly at high speed because they're not things we use much.
In the outline below, df
is a dataframe, var
and varn
are variable names, vlist = ['var1', 'var2']
is a list of variable names, and nlist = [0, 3, 4]
is a list of numerical variable or observation indexes, n1
and n2
are integers, and bools
ia a list or pandas Series
of booleans (True
and False
). Some of the basic selection/indexing/slicing methods have the form:
df[var]
extracts a variable -- a series, in other words.df[vlist]
extracts a new dataframe consisting of the variables in vlist
. df[nlist]
does the same thing. df[bools]
: extracts each row where the corresponding element in bools
is true. len(bools)
must be equal to df.size[0]
df[n1:n2]
extracts observations n1
to n2-1
, the traditional slicing syntax. We find the last one confusing: it extracts rows, not columns. Pandas guru Wes McKinney notes: "This might seem inconsistent to some readers." Yup! We don't do it much, partly for that reason.
The Pandas docs push the loc
and iloc
methods. We'll ignore them -- we don't use them much -- but if you're interested, see the docs.
In [38]:
# we create a small dataframe to experiment with
small = weo.head()
small
Out[38]:
Example. Let's try each of these in a different cell and see what they do:
small[['ISO', 'Units']]
small[[0, 4]]
small['2011']
small[1:3]
Can you explain the results?
In [39]:
small[['ISO', 'Units']]
Out[39]:
In [40]:
small[[0, 4]]
Out[40]:
In [41]:
small['2011']
Out[41]:
In [42]:
small['2011'][3]
Out[42]:
In [43]:
small[1:3]
Out[43]:
In [44]:
small[[False, True, True, False, False]]
Out[44]:
Indexing a Series
is a little different because we only have one column, so all indexing operations interact with rows.
The rules here are a little subtle, so we'll show examples and add comments that explain what each example does
In the list below s
is a Series
, n
is an integer, nlist = [0, 3]
is a list of integers, and i
is a string, and is
is a list of strings
s[n]
: if the index has dtype
int, this extracts the row with index n
. Otherwise extracts the n
th row (starting at zero)s[nlist]
: if the index has dtype
int, this extracts rows with indices in nlist
returning NaN
if they don't appear. Otherwise extracts the rows at positions in nlist
, filling with NaN
for invalid positionss[i]
: if the index has dtype
object
, this extracts the row with index i
, otherwise it is an errors[is]
:
In [53]:
s1 = pd.Series([5, 6, 7, 8], index=["a", "b", "c", "d"])
s1
Out[53]:
In [66]:
s2 = pd.Series([50, 60, 70, 80], index=[0, 4, 2, 999])
s2
Out[66]:
In [65]:
# index has dtype object, so using an int returns the value in that row (starting at 0)
s1[1]
Out[65]:
In [68]:
# index has dtype int, so using an integer tries to find the that int in the
# index and return the corresponding value and throws an error if it can't find it
s2[1]
In [69]:
s2[0] # no error, 0 is in the index
Out[69]:
In [71]:
# index has dtype object, so a list of ints extracts those rows
s1[[0, 3]]
Out[71]:
In [73]:
# index has dtype int, so a list of ints tries to match each int to the index
# it returns NaN where it can't find the index. Notice it **did not** return
# `80` for 3
s2[[0, 3, 999]]
Out[73]:
In [74]:
# index has type object, so a string finds row with matching index
s1["c"]
Out[74]:
In [76]:
# index has dtype int, so using a string causes an error
s2["c"]
In [78]:
# similar behavior for lists of strings
s1[["a", "b", "penguin"]]
Out[78]:
In [80]:
# index has dtype int, so list of strings returns NaN's everywhere
s2[["a", "b"]]
Out[80]:
In [81]:
# lists of True/False work the same for any dtype of index
bools = [True, False, False, True]
s1[bools]
Out[81]:
In [82]:
s2[bools]
Out[82]:
This is mostly what we do: we choose observations that satisfy one or more conditions. Boolean selection consists of two steps that we typically combine in one statement:
df[comparison]
, where df
is a dataframe and comparison
is a comparison. This will select the observations (rows) for which comparison
is true and throw away the others. We work through this one step at a time:
True
isin
methodThis is easier to describe with an example.
In [46]:
weo.head()
Out[46]:
Find variable and country codes. Which ones do we want? Let's start by seeing that's available. Here we create special dataframes that include all the variables and their definitions and all the countries.
Note the use of the drop_duplicates
method, which does what it sounds like.
In [47]:
variable_list = weo[['WEO Subject Code', 'Subject Descriptor', 'Units']].drop_duplicates()
print('Number of variables: ', variable_list.shape[0])
variable_list.head()
Out[47]:
In [48]:
country_list = weo[['ISO', 'Country']].drop_duplicates()
print('Number of countries: ', country_list.shape[0])
country_list.head()
Out[48]:
Exercise.
countries = weo[['ISO', 'Country']]
; that is, without applying the drop_duplicates
method. How large is it? How many duplicates have we dropped? ISO
) for Argentina and the United States? WEO Subject Code
) for government debt (gross debt, percent of GDP) and net lending/borrowing (also percent of GDP)?
In [ ]:
In [ ]:
Comment. Now that we have the country and variable codes, we can be more explicit about what we want. We want observations with those country and variable codes.
We work up to the solution one step at a time.
In [ ]:
We can construct comparisons for series (dataframe columns) much as we did with simple variables. The difference is that we get a complete column of True/False responses, not just one.
Mutiple comparisons have a different syntax than we saw earlier: and
is replaced by &
, and or
is replaced by |
. And when we have more than one comparison, we need to enclose them in parentheses.
Examples. Consider the comparisons:
small['Units'] == 'National currency'
small['2011'] >= 100
(small['Units'] == 'National currency') & (small['2011'] >= 100)
(small['Units'] == 'National currency') | (small['2011'] >= 100)
Remind yourself what the &
and |
do.
In [49]:
small['Units'] == 'National currency'
Out[49]:
In [50]:
small['2011'] >= 100
Out[50]:
In [51]:
(small['Units'] == 'National currency') & (small['2011'] >= 100)
Out[51]:
In [52]:
(small['Units'] == 'National currency') | (small['2011'] >= 100)
Out[52]:
In [ ]:
Example. We choose obervations for which the units are 'National currency'
. We do this first in two steps, then in one.
In [69]:
# remind ourslves what we're starting with
small
Out[69]:
In [68]:
# two steps: comparison, then selection
ncunits = small['Units'] == 'National currency' # comparison
print(ncunits)
small[ncunits] # selection
Out[68]:
In [70]:
# put the steps together in one line
small[small['Units'] == 'National currency']
Out[70]:
In [ ]:
Exercise. Construct dataframes for which
small['Units']
does not equal 'National currency'
.small['Units']
equals 'National currency'
and small['2011']
is greater than 100.
In [ ]:
In [ ]:
isin
methodPay attention now, this is really useful. Suppose we want to extract the data for which weo['ISO'] == 'ARG'
(Argentina) or weo['ISO'] == 'GRC'
(Greece). We could do that by combining the comparisons:
(weo['ISO'] == 'ARG') | (weo['ISO'] == 'GRC')
Remind youself that |
stands for "or." (What do we use for "and"?)
A simpler approach is to apply the isin
method to a variable. This sets the comparison equal to True
if the value of the observation is of weo['ISO']
equals any element in a list. We could do the same thing using mulitple comparisons, but this is a lot easier.
Let's see how this works.
Example. Let's apply the same logic to variable codes. If we want to extract the observations with codes
vlist = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
we would use
In [55]:
vlist = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
weo['WEO Subject Code'].isin(vlist).head(45)
Out[55]:
Comment. We're choosing 2 variables from 45, so there are lots of Falses.
In [ ]:
In [56]:
# this time let's use the result of isin for selection
vlist = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
weo[weo['WEO Subject Code'].isin(vlist)].head(6)
Out[56]:
In [57]:
# we've combined several things in one line
comparison = weo['WEO Subject Code'].isin(vlist)
selection = weo[comparison]
selection.head(6)
Out[57]:
In [ ]:
Comment. We can do the same thing with countries. If we want to choose two variables and three countries, the code looks like:
In [58]:
variables = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
countries = ['ARG', 'DEU', 'GRC']
weo_sub = weo[weo['WEO Subject Code'].isin(variables) & weo['ISO'].isin(countries)]
weo_sub
Out[58]:
Comments.
In [ ]:
Exercise. Use the isin
method to extract Gross domestic product in US dollars for China, India, and the United States. Assign the result to the dataframe gdp
. Hint: You can adapt the code we just ran. The variable code is NGDPD. The country codes are CHN, IND, and USA.
In [ ]:
Exercise (challenging). Plot the variable gdp['2015']
as a bar chart. What would you say it needs?
In [ ]:
In [ ]:
contains
methodAnother useful one. The contains
string method for series identifies observations that contain a specific string. If yes, the observation is labelled True, if no, False. A little trick converts the True/False outcomes to ones and zeros.
We apply it to the Media
variable of the Entry Poll dataframe ep
. You may recall that this variable could have more than one response. We tease them apart with the contains
method. Our want is to have a yes/no variable for each response.
In [59]:
# recall
ep['Media'].head(10)
Out[59]:
In [60]:
# the contains method
ep['Media'].str.contains('Twitter').head(10)
Out[60]:
Comment. That's pretty good, we now know which students mentioned Twitter and which did not. It's more useful, though, to convert this to zeros (False) and ones (True), which we do with this trick: we multiply by 1.
In [61]:
ep['Media'].str.contains('Twitter').head(10)*1
Out[61]:
Comment. Now let's do the same for some of the other entries and save them in new variables.
In [62]:
media = ['None', 'Twitter', 'Facebook', 'Blog']
oldep = ep.copy()
vnames = []
for x in media:
newname = 'Media' + ':' + x
vnames.append(newname)
ep[newname] = ep['Media'].str.contains(x)*1
vnames
Out[62]:
Comment. You might want to think about this a minute. Or two.
In [ ]:
In [64]:
# create new df of just these variables
media = ep[vnames]
media.head()
Out[64]:
In [ ]:
In [65]:
# count them with the sum method
media_counts = media.sum()
media_counts
Out[65]:
Comment. Just for fun, here's a bar graph of the result.
In [66]:
media_counts.plot.barh()
Out[66]:
Exercise. What would you change in this graph? How would you do it? (Words are enough.)
In [ ]:
In [ ]:
Exercise. We explore the Census's Business Dynamics Statistics, a huge collection of data about firms. We've extracted a small piece of one of their databases that includes these variables for 2013:
Run the code cell below to load the data.
In [67]:
data = {'Size': ['a) 1 to 4', 'b) 5 to 9', 'c) 10 to 19', 'd) 20 to 49', 'e) 50 to 99',
'f) 100 to 249', 'g) 250 to 499', 'h) 500 to 999', 'i) 1000 to 2499',
'j) 2500 to 4999', 'k) 5000 to 9999', 'l) 10000+'],
'Firms': [2846416, 1020772, 598153, 373345, 115544, 63845,
19389, 9588, 6088, 2287, 1250, 1357],
'Emp': [5998912, 6714924, 8151891, 11425545, 8055535, 9788341,
6611734, 6340775, 8321486, 6738218, 6559020, 32556671]}
bds = pd.DataFrame(data)
bds .head(3)
Out[67]:
Use the dataframe bds
to:
nbar = bds['Firms'].mean()
. bds['AboveMean']
that is True
if the value of bds['Firms']
is above the mean, False
otherwise. a)
, b)
, etc. That is, change a) 1 to 4
to 1 to 4
, b) 5 to 9
to 5 to 9
, and so on. Hint: Use the split
method.
In [ ]:
In [ ]:
In [ ]:
In [ ]: