In the last tutorial we learned how to apply functions to every row in a pandas df, and save the output to a new column. One way to do this is the apply function. In some cases, we have to use a lambda function. In this notebook I will try to clarify these two techniques and when you need to use one versus the other.
First, read in our Children's Literature dataset, and do some of the pre-processing step we did in the tutorial. I'll do this in once chunk of code.
In [ ]:
#import necessary modules
import pandas
import nltk
import string
#read in our data
df = pandas.read_csv("../Data/childrens_lit.csv.bz2", sep = '\t', encoding = 'utf-8', compression = 'bz2', index_col=0)
#drop missing texts
df = df.dropna(subset=['text'])
#split the text into a list
df['text_split']=df['text'].str.split()
If we want to do something to the entire value of the column cells we can use the apply function.
For example, we want to take the length of the list we just created (the 'text_split' column), we can apply the len function.
In [ ]:
df['word_count'] = df['text_split'].apply(len)
df['word_count']
We can also apply nltk functions, if it is done to the entire value of the column cells. So we can, for example, tokenize the title column.
In [ ]:
df['title_token'] = df['title'].apply(nltk.word_tokenize)
df['title_token']
The lambda function is like the apply function, but allows us to do more if needed. We can re-do what we did above using the lambda function.
In [ ]:
#apply the len function using .apply
df['word_count'] = df['text_split'].apply(len)
#apply the len function using lambda. This line does the same thing as line 2 above
df['word_count_lambda'] = df['text_split'].apply(lambda x: len(x))
#apply the nltk.word_tokenize function using .apply
df['title_token'] = df['title'].apply(nltk.word_tokenize)
#do the same using lambda. The next line does the same as line 7 above.
df['title_token_lambda'] = df['title'].apply(lambda x: nltk.word_tokenize(x))
df[['word_count', 'word_count_lambda','title_token', 'title_token_lambda']]
Sometimes we can't use the apply function alone, we must also use the lambda function. This is the case if the column contains a list, and we want to loop through the list. For example, if we want to remove punctuation from our title tokens, we can do this using list comprehension. If we try to do this using apply only we get an error.
In [ ]:
df['title_token_clean'] = df['title_token'].apply([word for word in df['title_token'] if word not in list(string.punctuation)])
df['title_token_clean']
We got a TypeError: 'list' object is not callable.
If we try to indicate each element by a variable, for example the variable 'x', we get another error: NameError: name 'x' is not defined
In [ ]:
df['title_token_clean'] = df['title_token'].apply([word for word in x if word not in list(string.punctuation)])
df['title_token_clean']
To make a list object callable and to define the variable to indicate each element in th list, we can write a lambda function.
In [ ]:
df['title_token_clean'] = df['title_token'].apply(lambda x: [word for word in x if word not in list(string.punctuation)])
df['title_token_clean']
The lambda function, or nameless function, allows us to name each element of the list. In the case above we're indicating each element by the variable 'word', and indicating the title_token list as a whole by the variable 'x'. We then apply this lambda function to every row in our dataframe using the .apply function. The combination of apply and lambda allows us to do some really powerful things.
Pandas is continuing to add functions so we won't always need to use lambda, but in some cases we still need it.