Advanced Pandas: Cleaning data

Probably the best thing about Pandas is its extensive toolset for managing data. Here we describe features of Pandas that allow us to clean data that, for reasons beyond our control, comes in a form that's not immediately amendable to analysis. This is the first of several such notebooks.

Outline:

Want operator. Start with what we want to end up, then figure out how to get there.
String methods. Fixing string variables, especially strings that should really be numbers.
Missing values. Marking, dropping, counting missing values.
Selecting variables and observations. Choose the variables and observations we want by their labels.
Boolean selection. This is mostly what we do: choose observations from conditions. We use comparisons to produce Boolean variables and then use the Boolean variables to select observations that are True. The next two methods extend this capability.
The isin method. Choose observations whose values are in lists you specify.
The contains method. Flag observations that contain a specific piece of text. Another string method, operates through Booleans.

Note: requires internet access to run.

This Jupyter notebook was created by Dave Backus, Chase Coleman, Spencer Lyon and Balint Szoke for the NYU Stern course Data Bootcamp.

Preliminaries



In [ ]:

    
import sys                             # system module 
import pandas as pd                    # data package
import matplotlib.pyplot as plt        # graphics module  
import datetime as dt                  # date and time module
import numpy as np                     # foundation for pandas 

%matplotlib inline                     

# check versions (overkill, but why not?)
print('Python version: ', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())

The want operator

We need to know what we're trying to do -- what we want the data to look like. To borrow a phrase from our friend Tom Sargent, we say that we apply the want operator.

Some problems we've run across that ask to be solved:

Numerical data is contaminated by commas (marking thousands) or dollar signs.
Row and column labels are contaminated.
Missing values are marked erratically.
We have too much data, would prefer to choose a subset.
Variables run across rows rather than down columns.

What we want in each case is the opposite of what we have: we want nicely formatted numbers, clean row and column labels, and so on.

We'll solve the first four problems here, the last one in the next notebook.

Example: Chipotle data

This data comes from a New York Times story about the number of calories in a typical order at Chipotle. The topic doesn't particularly excite us, but the data raises a number of issues that come up repeatedly. We adapt some code written by Daniel Forsyth.

Note: The file is a tsv (Tab Separated Values) file, so we need to set the separator accordingly when we call pandas' read_csv method. Remember that the default value of sep is sep=',' (see the docstring). We can change it to tabular by wrinting sep='\t'.



In [ ]:

    
url = 'https://raw.githubusercontent.com/TheUpshot/chipotle/master/orders.tsv'
chipotle = pd.read_csv(url, sep='\t')               # tab (\t) separated values 
print('Variable dtypes:\n', chipotle.dtypes, sep='')
chipotle.head()

Comment. Note that the variable item_price has dtype object. The reason is evidently the dollar sign. We want to have it as a number, specifically a float.

Example: Data Bootcamp entry poll

This is the poll we did at the start of the course. Responses were collected in a Google spreadsheet, which we converted to a csv and uploaded to our website.



In [ ]:

    
url1 = "https://raw.githubusercontent.com/NYUDataBootcamp/"
url2 = "Materials/master/Data/entry_poll_spring17.csv"
url = url1 + url2 
entry_poll = pd.read_csv(url)



In [ ]:

    
entry_poll.head()



In [ ]:

    
print('Dimensions:', entry_poll.shape)



In [ ]:

    
print('Data types:\n\n', entry_poll.dtypes, sep='')

Comments. This is mostly text data, which means it's assigned the dtype object. There are two things that would make the data easier to work with:

First: The column names are excessively verbose. This one's easy: We replace them with single words. Which we do below.



In [ ]:

    
# (1) create list of strings with the new varnames
newnames = ['time', 'why', 'program', 'programming', 'prob_stats', 'major', 'career', 'data', 'topics']
newnames



In [ ]:

    
# (2) Use the str.title() string method to make the varnames prettier
newnames = [name.title() for name in newnames]

str.title() returns a copy of the string in which first characters of all the words are capitalized.



In [ ]:

    
newnames



In [ ]:

    
# (3) assign newnames to the variables
entry_poll.columns = newnames  
entry_poll.head(1)

Second: The second one is harder. The question about special topics of interest says "mark all that apply." In the spreadsheet, we have a list of every choice the person checked. Our want is to count the number of each type of response. For example, we might want a bar chart that gives us the number of each response. The question is how we get there.



In [ ]:

    
# check multi-response question to see what we're dealing with 
entry_poll['Topics'].head(20)

Comment. Note the commas separating answers with more than one choice. We want to unpack them somehow.

Example: OECD healthcare statistics

The OECD collects healthcare data on lots of (mostly rich) countries, which is helpful in producing comparisons. Here we use a spreadsheet that can be found under Frequently Requested Data.



In [ ]:

    
url1 = 'http://www.oecd.org/health/health-systems/'
url2 = 'OECD-Health-Statistics-2016-Frequently-Requested-Data.xls'
oecd = pd.read_excel(url1 + url2)
oecd.head()

This looks bad. But we can always use pd.read_excel?. Let's look into the excel file.

multiple sheets (want: Physicians)



In [ ]:

    
oecd = pd.read_excel(url1 + url2, sheetname='Physicians')
oecd.head()

The first three lines are empty. Skip those



In [ ]:

    
oecd = pd.read_excel(url1 + url2, sheetname='Physicians', skiprows=3)
oecd.head()

Would be nice to have the countries as indices



In [ ]:

    
oecd = pd.read_excel(url1 + url2, 
                     sheetname='Physicians', 
                     skiprows=3, 
                     index_col=0)
oecd.head()

The last two columns contain junk



In [ ]:

    
oecd.shape   # drop 57th and 58th columns



In [ ]:

    
# There is no skipcols argument, so let's google "read_excel skip columns" -> usecols
oecd = pd.read_excel(url1 + url2, 
                     sheetname='Physicians', 
                     skiprows=3, 
                     index_col=0, 
                     usecols=range(57))
oecd.head()

What about the bottom of the table?



In [ ]:

    
oecd.tail()    # we are downloading the footnotes too



In [ ]:

    
?pd.read_excel  # -> skip_footer



In [ ]:

    
# How many rows to skip??
oecd.tail(25)



In [ ]:

    
oecd = pd.read_excel(url1 + url2, 
                     sheetname='Physicians', 
                     skiprows=3, 
                     index_col=0, 
                     usecols=range(57), 
                     skip_footer=20)
oecd.tail()



In [ ]:

    
oecd.dtypes[:5]

We still have a couple issues.

The index includes a space and a number: Australia 1, Chile 3, etc. We care about this because when we plot the data across countries, the country labels are going to be country names, so we want them in a better form than this.
The ..'s in the sheet lead us to label any column that includes them as dtype object. Here we want to label them as missing values.
If we want to plot each country against time, then we'll need to switch the rows and columns somehow, so that the x axis in the plot (the year) is the index and not the column label.

Example: World Economic Outlook

The IMF's World Economic Outlook database contains a broad range of macroeconomic data for a large number of countries. It's updated twice a year and is a go-to source for things like current account balances (roughly, the trade balance) and government debt and deficits. It also has a few quirks, as we'll see.

Example. Run the following code as is, and with the thousands and na_values parameters commented out. How do the dtypes differ?



In [ ]:

    
url = 'http://www.imf.org/external/pubs/ft/weo/2016/02/weodata/WEOOct2016all.xls'

# Try
weo = pd.read_excel(url)         # NOT an excel file!



In [ ]:

    
# try to open the file with a plain text editor (it is a TSV)
weo = pd.read_csv(url, sep = '\t')
weo.head()

Useful columns:

1, 2, 3, 4, 6 (indices)
years, say from 1980 to 2016

Need a list that specifies these



In [ ]:

    
names = list(weo.columns)
names[:8]



In [ ]:

    
# for var details 
details_list = names[1:5] + [names[6]]
# for years
years_list = names[9:-6]



In [ ]:

    
details_list



In [ ]:

    
weo = pd.read_csv(url, 
                  sep = '\t',
                  index_col='ISO',
                  usecols=details_list + years_list)
weo.head()

Look at the bottom



In [ ]:

    
weo.tail(3)



In [ ]:

    
weo = pd.read_csv(url, 
                  sep = '\t',
                  index_col='ISO',
                  usecols=details_list + years_list,
                  skipfooter=1, engine='python')    # read_csv requires 'python' engine (otherwise warning)
weo.tail()

Missing values



In [ ]:

    
weo = pd.read_csv(url, 
                  sep = '\t',
                  index_col='ISO',
                  usecols=details_list + years_list,
                  skipfooter=1, engine='python',
                  na_values='n/a')
weo.head()



In [ ]:

    
weo.dtypes[:10]     # still not ok

Notice the , for thousands. As we saw before, there is an easy fix



In [ ]:

    
weo = pd.read_csv(url, 
                  sep = '\t',
                  index_col='ISO',
                  usecols=details_list + years_list,
                  skipfooter=1, engine='python',
                  na_values='n/a',
                  thousands =',')
weo.head()

Comment. This has several issues. Here's what we want:

We would like the variables to be in columns and have observations labeled 1980, 1981, etc. In other words, we want the years in the index.
We would like to make sure the data columns (1980, 1981, etc.) have dtype float64

Here's what we might to do achieve what we want:

We could try to transpose the dataframe to get the years in the index. This is close, but not quite what we want (see below). We'll come back to this in the next notebook.
To make the data columns have dtype float64 we need to warn pandas about the , thousand separators and n/a for missing data.

Question. Can we transpose the whole thing to get the data running down columns?



In [ ]:

    
weo.T.head(10)

String methods

We can treat variables as strings in Pandas in much the same way we dealt with strings in core Python. Run the code below to remind yourself how this works.



In [ ]:

    
dollars = '$123.45'
print('Type of variable dollars:', type(dollars))
num = dollars.replace('$', '')
num = float(num)
print('Type of variable num:', type(num))

Pandas string methods. We can do the same thing to all the observations of a variable with so-called string methods. We append .str to a variable in a dataframe and then apply the string method of our choice. If this is part of converting a number-like entry that has mistakenly been given dtype object, we then convert its dtype with the astype method.

Example. Let's use a string method to fix the item_price variable in the Chipotle dataframe. This has three parts:

Use the method str to identify this as a string method.
Apply the string method of our choice (here replace) to fix the string.
Use the astype method to convert the fixed-up string to a float.

We start by making a copy of the chp dataframe that we can experiment with.



In [ ]:

    
chipotle.head()



In [ ]:

    
# create a copy of the df to play with 

chipotle_num = chipotle.copy()
print('Original dtype:', chipotle_num['item_price'].dtype)



In [ ]:

    
# delete dollar signs (dtype does not change!)

chipotle_num['item_price'].str.replace('$', '').head()



In [ ]:

    
# delete dollar signs, convert to float, AND assign back to chipotle_num in one line

chipotle_num['item_price'] = chipotle_num['item_price'].str.replace('$', '').astype(float)
print('New dtype:', chipotle_num['item_price'].dtype)



In [ ]:

    
# assign back to chp for future use 
chipotle = chipotle_num



In [ ]:

    
print('Variable dtypes:\n', chipotle.dtypes, sep='')
chipotle.head()

Comment. We did everything here in one line: replace the dollar sign with a string method, then converted to float using astype. If you think this is too dense, you might break it into two steps.

Example. Here we use the astype method again to convert the dtypes of weo into float



In [ ]:

    
weo.head(1)



In [ ]:

    
weo.head(1).dtypes

Want to convert the year variables into float



In [ ]:

    
weo['1980'].astype(float)

This error indicates that somewhere in weo['1980'] there is a string value --. We want to convert that into NaN. Later we will see how we can do that directly. For now use read_csv() again



In [ ]:

    
weo = pd.read_csv(url, 
                  sep = '\t',
                  index_col='ISO',
                  usecols=details_list + years_list,
                  skipfooter=1, engine='python',
                  na_values=['n/a', '--'],
                  thousands =',')
weo.head(1)



In [ ]:

    
# With that out of our way, we can do the conversion for one variable 
weo['1980'].astype(float)



In [ ]:

    
# or for all numeric variables
years = [str(year) for year in range(1980, 2017)]
weo[years] = weo[years].astype(float)



In [ ]:

    
weo.dtypes

Example. Here we strip off the numbers at the end of the indexes in the OECD docs dataframe. This involves some experimentation:

Play with the rsplit method to see how it works.
Apply rsplit to the example country = 'United States 1'.
Use a string method to do this to all the entries of the variable Country.



In [ ]:

    
# try this with an example first 
country = 'United States 1'

# get documentation for the rsplit method
country.rsplit?



In [ ]:

    
# an example 
country.rsplit()

Comment. Not quite, we only want to split once.



In [ ]:

    
# what about this?
country.rsplit(maxsplit=1)



In [ ]:

    
# one more step, we want the first component of the list
country.rsplit(maxsplit=1)[0]



In [ ]:

    
oecd.index



In [ ]:

    
oecd.index.str.rsplit(maxsplit=1)[0]



In [ ]:

    
#try
oecd.index.str.rsplit?



In [ ]:

    
# Note the TWO str's
oecd.index.str.rsplit(n=1).str[0]



In [ ]:

    
#or use the str.get() method
oecd.index.str.rsplit(n=1).str.get(0)



In [ ]:

    
oecd.index = oecd.index.str.rsplit(n=1).str.get(0)



In [ ]:

    
oecd.head()

Comments.

Note that we need two str's here: one to do the split, the other to extract the first element.
For reasons that mystify us, we ran into problems when we used maxsplit=1, but it works with n=1.
This is probably more than you want to know, but file away the possibilities in case you need them.

Missing values

It's important to label missing values, so that Pandas doesn't interpret entries as strings. Pandas is also smart enough to ignore things labeled missing when it does calculations or graphs. If we compute, for example, the mean of a variable, the default is to ignore missing values.

We've seen that we can label certain entries as missing values in read statements: read_csv, read_excel, and so on. Here we do it directly, mostly to remind ourselves what's involved.

Marking missing values

Example. The oecd dataframe contains a number of instances of .. (double period). How can we mark them as missing values?



In [ ]:

    
docs = oecd
docs.head()

What to do. We use the replace method on the whole dataframe. To mark something as missing, we replace it as None, which Pandas interprets as missing and labels NaN.



In [ ]:

    
docs.replace(to_replace=['..'], value=[None]).head()

Comment. Replace automatically updates the dtypes. Here the double dots led us to label the variables as objects. After the replace, they're now floats, as they should be.



In [ ]:

    
docsna = docs.replace(to_replace=['..'], value=[None])
docsna.dtypes

Comment. Unlike the string methods we described earlier, this use of replace affects complete entries, not elements of string entries. For example, suppose we tried to replace the periods in decimal numbers with an asterisk. We could try the following, but it doesn't work: the decimal numbers don't change.



In [ ]:

    
docs.replace(to_replace=['.'], value=['*']).head()



In [ ]:

Working with missing values



In [ ]:

    
# grab a variable to play with
var = docsna[2013].head(10)
var



In [ ]:

    
# why not '2013'? check the type
docsna.columns



In [ ]:

    
# which ones are missing ("null")?
var.isnull()



In [ ]:

    
# which ones are not missing ("not null")?
var.notnull()



In [ ]:

    
# drop the missing 
var.dropna()

Comment. We usually don't have to worry about this, Pandas takes care of missing values automatically.

Comment. Let's try a picture to give us a feeling of accomplishment. What else would you say we need? How would we get it?



In [ ]:

    
docsna[2013].plot.barh(figsize=(4, 12))



In [ ]:

Selecting variables and observations

The word selection refers to choosing a subset of variables or observations using their labels or index. Similar methods are sometimes referred to as slicing, subsetting, indexing, querying, or filtering. We'll treat the terms as synonymous.

There are lots of ways to do this. Mostly we do "Boolean" selection, which we address in the next section. We review more direct options here, mostly at high speed because they're not things we use much.

In the outline below, df is a dataframe, var and varn are variable names, n1 and n2 are integers,

vlist = ['var1', 'var2'] is a list of variable names, and
nlist = [0, 3, 4] is a list of numerical variable or observation indexes and
bools is a list or pandas Series of booleans (True and False).

Some of the basic selection/indexing/slicing methods have the form:

df[var] extracts a variable -- a series, in other words.

df[vlist] extracts a new dataframe consisting of the variables in vlist.

df[nlist] does the same thing.

df[bools]: extracts each row where the corresponding element in bools is true. len(bools) must be equal to df.size[0]

df[n1:n2] extracts observations n1 to n2-1, the traditional slicing syntax.



In [ ]:

    
# we create a small dataframe to experiment with 
small = weo.head()
small

Example. Let's try each of these in a different cell and see what they do:

small[['Country', 'Units']]
small[[0, 4]]
small['2011']
small[1:3]

Can you explain the results?



In [ ]:

    
small[['Country', 'Units']]



In [ ]:

    
small[[0, 4]]



In [ ]:

    
small['2011']



In [ ]:

    
small[1:3]



In [ ]:

    
small[[False, True, True, False, False]]

Series indexing

Indexing a Series is a little different because we only have one column, so all indexing operations interact with rows.

The rules here are a little subtle, so we'll show examples and add comments that explain what each example does

In the list below s is a Series, n is an integer, nlist = [0, 3] is a list of integers, and i is a string, and is is a list of strings

s[n]: if the index has dtype int, this extracts the row with index n. Otherwise extracts the nth row (starting at zero)
s[nlist]: if the index has dtype int, this extracts rows with indices in nlist returning NaN if they don't appear. Otherwise extracts the rows at positions in nlist, filling with NaN for invalid positions
s[i]: if the index has dtype object, this extracts the row with index i, otherwise it is an error
s[is]:



In [ ]:

    
s1 = pd.Series([5, 6, 7, 8], index=["a", "b", "c", "d"])
s1



In [ ]:

    
s2 = pd.Series([50, 60, 70, 80], index=[0, 4, 2, 999])
s2



In [ ]:

    
# index has dtype object, so using an int returns the value in that row (starting at 0)
s1[1]



In [ ]:

    
# index has dtype int, so using an integer tries to find the that int in the 
# index and return the corresponding value and throws an error if it can't find it
s2[1]



In [ ]:

    
s2[0]  # no error, 0 is in the index



In [ ]:

    
# index has dtype object, so a list of ints extracts those rows
s1[[0, 3]]



In [ ]:

    
# index has dtype int, so a list of ints tries to match each int to the index
# it returns NaN where it can't find the index. Notice it **did not** return 
# `80` for 3
s2[[0, 3, 999]]



In [ ]:

    
# index has type object, so a string finds row with matching index
s1["c"]



In [ ]:

    
# index has dtype int, so using a string causes an error
s2["c"]



In [ ]:

    
# similar behavior for lists of strings
s1[["a", "b", "penguin"]]



In [ ]:

    
# index has dtype int, so list of strings returns NaN's everywhere
s2[["a", "b"]]



In [ ]:

    
# lists of True/False work the same for any dtype of index
bools = [True, False, False, True]
s1[bools]



In [ ]:

    
s2[bools]

Boolean selection

We choose observations that satisfy one or more conditions. Boolean selection consists of two steps that we typically combine in one statement:

Use a comparison to construct a Boolean variable consisting of True and False.
Compute df[comparison], where df is a dataframe and comparison is a comparison. This will select the observations (rows) for which comparison is true and throw away the others.

We work through this one step at a time:

Example: apply the want operator
Comparisons for dataframes
Boolean selection: select observations for which the comparison is True
The isin method

This is easier to describe with an example.

Example: Apply the want operator to WEO

Our want here is to take the weo dataframe and extract government debt and deficits for a given set of countries. Putting this to work involves several steps.

Here's the head of the dataframe to remind us what we're dealing with.



In [ ]:

    
weo.head(2)

Find variable and country codes. Which ones do we want? Let's start by seeing that's available. Here we create special dataframes that include all the variables and their definitions and all the countries.

Note the use of the drop_duplicates method, which does what it sounds like: remove duplicate rows (!)



In [ ]:

    
variable_list = weo[['Country', 'Subject Descriptor', 'Units']].drop_duplicates()
print('Number of variables: ', variable_list.shape[0])
variable_list.head()



In [ ]:

    
country_list = weo['Country'].drop_duplicates()
print('Number of countries: ', country_list.shape[0])
country_list

Exercise.

Construct a list of countries with countries = weo['Country']; that is, without applying the drop_duplicates method. How large is it? How many duplicates have we dropped?



In [ ]:

Comment. Now that we have the country and variable codes, we can be more explicit about what we want. We want observations with those country and variable codes.

We work up to the solution one step at a time.

Comparisons for series

We can construct comparisons for series (dataframe columns) much as we did with simple variables. The difference is that we get a complete column of True/False responses, not just one.

Mutiple comparisons have a different syntax than we saw earlier: and is replaced by &, and or is replaced by |. And when we have more than one comparison, we need to enclose them in parentheses.

Examples. Consider the comparisons:

small['Units'] == 'National currency'
small['2011'] >= 100
(small['Units'] == 'National currency') & (small['2011'] >= 100)
(small['Units'] == 'National currency') | (small['2011'] >= 100)

Remind yourself what the & and | do.



In [ ]:

    
small



In [ ]:

    
small['Units'] == 'National currency'



In [ ]:

    
small['2011'] >= 200



In [ ]:

    
(small['Units'] == 'National currency') & (small['2011'] >= 100)



In [ ]:

    
(small['Units'] == 'National currency') | (small['2011'] >= 100)



In [ ]:

Boolean selection

Boolean selection simply chooses those observations for which a condition is True. Some people refer to this as filtering. The syntax is

df[comparison]

The result is a new dataframe of observations in which comparison is true.

Example. We choose obervations for which the units are 'National currency'. We do this first in two steps, then in one.



In [ ]:

    
# remind ourslves what we're starting with 
small



In [ ]:

    
# two steps:  comparison, then selection 
ncunits = small['Units'] == 'National currency'   # comparison
print(ncunits)
small[ncunits]                                    # selection



In [ ]:

    
# put the steps together in one line 
small[small['Units'] == 'National currency']

Exercise. Construct dataframes for which

small['Units'] does not equal 'National currency'.
small['Units'] equals 'National currency' and small['2011'] is greater than 100.

The `isin` method

Pay attention now, this is really useful. Suppose we want to extract the data for which weo['Country'] == 'Argentina' or weo['Country'] == 'Greece' (Greece). We could do that by combining the comparisons:

(weo['Country'] == 'Aregentina') | (weo['Country'] == 'Greece')

Remind youself that | stands for "or." (What do we use for "and"?)

A simpler approach is to apply the isin method to a variable. This sets the comparison equal to True if the value of the observation is of weo['Country'] equals any element in a list. We could do the same thing using mulitple comparisons, but this is a lot easier.

Let's see how this works.

Example. Let's apply the same logic to variable codes. If we want to extract the observations with codes

vlist = ['GGXWDG_NGDP', 'GGXCNL_NGDP']

we would use



In [ ]:

    
vlist = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
weo['WEO Subject Code'].isin(vlist)

Comment. We're choosing 2 variables from 45, so there are lots of Falses.



In [ ]:

    
weo.tail(4)



In [ ]:

    
# this time let's use the result of isin for selection 
vlist = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
weo[weo['WEO Subject Code'].isin(vlist)].head(6)



In [ ]:

    
# we've combined several things in one line
comparison = weo['WEO Subject Code'].isin(vlist) 
selection  = weo[comparison]
selection.head(6)



In [ ]:

Comment. We can do the same thing with countries. If we want to choose two variables and three countries, the code looks like:



In [ ]:

    
variables = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
countries = ['Argentina', 'Greece']
weo_sub = weo[weo['WEO Subject Code'].isin(variables) & weo['Country'].isin(countries)]
weo_sub

Comments.

We've now done what we described when we applied the want operator.
This is a go-to method. Circle it for later reference.
This is a go-to method. Circle it for later reference.

Exercise. Use the isin method to extract Gross domestic product in US dollars for China, India, and the United States. Assign the result to the dataframe gdp. Hint: You can adapt the code we just ran. The variable code is NGDPD. The country codes are CHN, IND, and USA.



In [ ]:

    
countries = ['China', 'India', 'United States']

gdp = weo[(weo['WEO Subject Code']=='NGDPD') & weo['Country'].isin(countries)]
gdp

Exercise (challenging). Plot the variable gdp['2015'] as a bar chart. What would you say it needs?



In [ ]:

    
gdp['2015'].plot(kind='bar')

The `contains` method

Another useful one. The contains string method for series identifies observations that contain a specific string. If yes, the observation is labelled True, if no, False. A little trick converts the True/False outcomes to ones and zeros.

We apply it to the Topics variable of the Entry Poll dataframe entry_poll. You may recall that this variable could have more than one response. We tease them apart with the contains method. Our want is to have a yes/no variable for each response.



In [ ]:

    
# recall
entry_poll['Topics'].head(10)



In [ ]:

    
# the contains method
entry_poll['Topics'].str.contains('Machine Learning')

Comment. That's pretty good, we now know which students mentioned Machine Learning and which did not. It's more useful, though, to convert this to zeros (False) and ones (True), which we do with this trick: we multiply by 1.



In [ ]:

    
entry_poll['Topics'].str.contains('Machine Learning').head(10)*1

Comment. Now let's do the same for some of the other entries and save them in new variables.



In [ ]:

    
topics = ['Web scraping', 'Machine Learning', 'regression'] 
old_ep = entry_poll.copy()

vnames = []
for x in topics:
    newname = 'Topics' + '_' + x 
    vnames.append(newname)
    entry_poll[newname] = entry_poll['Topics'].str.contains(x)*1
    
vnames

Comment. You might want to think about this a minute. Or two.



In [ ]:

    
# create new df of just these variables 
student_topics = entry_poll[vnames]
student_topics



In [ ]:

    
# count them with the sum method 
topics_counts = student_topics.sum()
topics_counts

Comment. Just for fun, here's a bar graph of the result.



In [ ]:

    
topics_counts.plot(kind='barh')

and a pie chart



In [ ]:

    
topics_counts.plot(kind='pie')



In [ ]:

Review

Let's remind ourselves what we've learned.

Exercise. We explore the Census's Business Dynamics Statistics, a huge collection of data about firms. We've extracted a small piece of one of their databases that includes these variables for 2013:

Size: size category of firms based on number of employees
Firms: number of firms in each size category
Emp: number of employees in each size category

Run the code cell below to load the data.



In [ ]:

    
data = {'Size': ['a) 1 to 4', 'b) 5 to 9', 'c) 10 to 19', 'd) 20 to 49', 'e) 50 to 99',
                 'f) 100 to 249', 'g) 250 to 499', 'h) 500 to 999', 'i) 1000 to 2499',
                 'j) 2500 to 4999', 'k) 5000 to 9999', 'l) 10000+'], 
        'Firms': [2846416, 1020772, 598153, 373345, 115544, 63845,
                  19389, 9588, 6088, 2287, 1250, 1357], 
        'Emp': [5998912, 6714924, 8151891, 11425545, 8055535, 9788341, 
                6611734, 6340775, 8321486, 6738218, 6559020, 32556671]}
bds = pd.DataFrame(data) 
bds .head(3)

Use the dataframe bds to:

Compute the mean number of firms with nbar = bds['Firms'].mean().
Generate the new variable bds['AboveMean'] that is True if the value of bds['Firms'] is above the mean, False otherwise.
What dtype is this new variable?
Select the observations for which the number of firms is above the mean.
Challenging. Fix the size categories. Specifically, use a string method to eliminate the prefixes a), b), etc. That is, change a) 1 to 4 to 1 to 4, b) 5 to 9 to 5 to 9, and so on. Hint: Use the split method.

Advanced Pandas: Cleaning data

Preliminaries

The want operator

Example: Chipotle data

Example: Data Bootcamp entry poll

Example: OECD healthcare statistics

Example: World Economic Outlook

String methods

Missing values

Marking missing values

Working with missing values

Selecting variables and observations

Series indexing

Boolean selection

Example: Apply the want operator to WEO

Comparisons for series

Boolean selection

The isin method

The contains method

Review

The `isin` method

The `contains` method