Use the Python Pandas library in the Jupyter Notebook to:
A library in Python contains a set of tools (called functions) that perform tasks on our data. Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench for use in a project. Once a library is imported, it can be used or called to perform many tasks.
Python doesn’t load all of the libraries available to it by default. We have to add an import statement to our code in order to use library functions. To import a library, we use the syntax import libraryName
. If we want to give the library a nickname to shorten the command, we can add as nickNameHere
. An example of importing the Pandas library using the common nickname pd
is below.
import
pandas
as
pd
matplotlib is the most widely used Python library for plotting. We can run it in the notebook using the magic command %matplotlib inline
. If you do not use %matplotlib inline
, your plots will be generated outside of the notebook and may be difficult to find. See the IPython docs for other IPython magics commands.
In this lesson, we will only use matplotlib and Seaborn, another package that works in tandem with matplotlib to make nice graphics. There is a whole range of graphics packages in Python, ranging from basic visualizations to fancy, interactive graphics like Bokeh and Plotly.
We encourage you to explore on your own! Chances are, if you can imagine a plot you'd like to make, somebody else has written a package to do it.
Text can be added to Jupyter Notebooks using Markdown cells. Markdown is a popular markup language that is a superset of HTML. To learn more, see Jupyter's Markdown guide or revisit the Reproducible Research lesson on Markdown.
One of the best options for working with tabular data in Python is the Python Data Analysis Library (a.k.a. Pandas). The Pandas library is built on top of the NumPy package (another Python library). Pandas provides data structures, produces high quality plots with matplotlib, and integrates nicely with other libraries that use NumPy arrays. Those familiar with spreadsheets should become comfortable with Pandas data structures.
In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Each time we call a function that’s in a library, we use the syntax LibraryName.FunctionName
. Adding the library name with a .
before the function name tells Python where to find the function. In the example above, we have imported Pandas as pd
. This means we don’t have to type out pandas
each time we call a Pandas function.
See this free Pandas cheat sheet from DataCamp for the most common Pandas commands.
We will begin by locating and reading our data which are in a table format as a tab-delimited file. We will use Pandas’ read_table
function to pull the file directly into a DataFrame
.
DataFrame
?A DataFrame
is a 2-dimensional data structure that can store in columns data of different types (including characters, integers, floating point values, factors and more). It is similar to a spreadsheet or a SQL table or data.frame in R. A DataFrame
always has an index (0-based). An index refers to the position of an element in the data structure.
Note that we use pd.read_table
, not just read_table
or pandas.read_table
, because we imported Pandas as pd
.
In our original file, the columns in the data set are separated by a TAB. We need to tell the read_table
function in Pandas that that is the delimiter with sep = ‘\t’
.
In [ ]:
url = "https://raw.githubusercontent.com/Reproducible-Science-Curriculum/data-exploration-RR-Jupyter/master/gapminderDataFiveYear_superDirty.txt"
#You can also read your table in from a file directory
gapminder = pd.read_table(url, sep = "\t")
The first thing to do when loading data into the notebook is to actually "look" at it. How many rows and columns are there? What types of variables are in it and what values can they take?
There are usually too many rows to print to the screen. By default, when you type the name of the DataFrame
and run a cell, Pandas knows to not print the whole thing. Instead, you will see the first and last few rows with dots in between. A neater way to see a preview of the dataset is the head()
method. Calling dataset.head()
will display the first 5 rows of the data. You can specify how many rows you want to see as an argument, like dataset.head(10)
. The tail()
method does the same with the last rows of the DataFrame
.
In [ ]:
#head
In [ ]:
#tail
In [ ]:
#gapminder
Sometimes the table has too many columns to print on screen. Calling df.columns.values
will print all the column names in an array.
In [ ]:
#columns
We often want to know how many rows and columns are in the data -- what is the "shape" of the DataFrame
. Shape is an attribute of the DataFrame
. Pandas has a convenient way for getting that information by using DataFrame.shape
(using DataFrame
here as a generic name for your DataFrame
). This returns a tuple (immutable values separated by commas) representing the dimensions of the DataFrame
(rows, columns).
To get the shape of the gapminder DataFrame
:
In [ ]:
#shape
We can learn even more about our DataFrame
. The info()
method gives a few useful pieces of information, including the shape of the DataFrame
, the variable type of each column, and the amount of memory stored.
The output from info()
displayed below shows that the fields ‘year’ and ‘pop’ (population) are represented as ‘float’ (that is: numbers with a decimal point). This is not appropriate: year and population should be integers or whole numbers. We can change the data-type with the function astype()
. The code for astype()
is shown below; however, we will change the data types later in this lesson.
In [ ]:
#info
The describe()
method will take the numeric columns and provide a summary of their values. This is useful for getting a sense of the ranges of values and seeing if there are any unusual or suspicious numbers.
In [ ]:
#describe
The DataFrame
function describe()
just blindly looks at all numeric variables. We wouldn't actually want to take the mean year. Additionally, we obtain ‘NaN’ values for our quartiles. This suggests we might have missing data which we can (and will) deal with shortly when we begin to clean our data.
For now, let's pull out only the columns that are truly continuous numbers (i.e. ignore the description for ‘year’). This is a preview of selecting columns from the data; we'll talk more about how to do it later in the lesson.
In [ ]:
#describe continuous
We can also extract one specific variable metric at a time if we wish:
In [ ]:
#min, max, mean, std, count
Next, let's say you want to see all the unique values for the region
column. One way to do this is:
In [ ]:
#unique
This output is useful, but it looks like there may be some formatting issues causing the same region to be counted more than once. Let's take it a step further and find out to be sure.
As mentioned previously, the command value_counts()
gives you a first global idea of your categorical data such as strings. In this case that is the column region
. Run the code below.
In [ ]:
# How many unique regions are in the data?
# use len
# How many times does each unique region occur?
# region counts
The table reveals some problems in our data set. The data set covers 12 years, so each ‘region’ should appear 12 times, but some regions appear more than 12 times and others fewer than 12 times. We also see inconsistencies in the region names (string variables are very susceptible to those), for instance:
Asia_china vs. Asia_China
Another type of problem we see is the various names of 'Congo'. In order to analyze this dataset appropriately we need to take care of these issues. We will fix them in the next section on data cleaning.
In [ ]:
What is the outcome when you run value_counts()
?
In [ ]:
Before we get started with cleaning our data, let's practice good data hygiene by first creating a copy of our original data set. Often, you want to leave the original data untouched. To protect your original, you can make a copy of your data (and save it to a new DataFrame
variable) before operating on the data or a subset of the data. This will ensure that a new version of the original data is created and your original is preserved.
Suppose you take a subset of your DataFrame
and store it in a new variable, like gapminder_early = gapminder[gapminder['year'] < 1970]
. Doing this does not actually create a new object. Instead, you have just given a name to that subset of the original data: gapminder_early
. This subset still points to the original rows of gapminder
. Any changes you make to the new DataFrame
gapminder_early
will appear in the corresponding rows of your original gapminder
DataFrame
too.
In [ ]:
gapminder = pd.read_table(url, sep = "\t")
gapminder_copy = gapminder.copy()
gapminder_copy.head()
Missing data (often denoted as 'NaN'- not a number- in Pandas, or as 'null') is an important issue to handle because Pandas cannot compute on rows or columns with missing data. 'NaN' or 'null' does not mean the value at that position is zero, it means that there is no information at that position. Ignoring missing data doesn't make it go away. There are different ways of dealing with it which include:
For our purposes with the dirty gapminder data set, we know our missing data is excess (and unnecessary) and we are going to choose to analyze only the available data. To do this, we will simply remove rows with missing values.
This is incredibly easy to do because Pandas allows you to either remove all instances with null data or replace them with a particular value.
df = df.dropna()
drops rows with any column having NA/null data. df = df.fillna(value)
replaces all NA/null data with the argument value
.
For more fine-grained control of which rows (or columns) to drop, you can use how
or thresh
. These are more advanced topics and are not covered in this lesson; you are encouraged to explore them on your own.
In [ ]:
# drop na
In [ ]:
#astype()
You can identify which observations are duplicates.
The call df.duplicated()
will return boolean values for each row in the DataFrame
telling you whether or not a row is repeated.
In cases where you don’t want repeated values (we wouldn’t--we only want each country to be represented once for every relevant year), you can easily drop such duplicate rows with the call df.drop_duplicates()
.
In [ ]:
# duplicated() #shows we have a repetition within the first __ rows
Let's look at the first five rows of our data set again (remember we removed the NaNs):
In [ ]:
# How do we look at the first 5 rows?
Our statement from above is correct, rows 1 & 2 are duplicated. Let's fix that:
In [ ]:
# df.drop_duplicates()
reset_index()
Now we have 1704 rows, but our indexes are off because we removed duplicate rows. We can reset our indices easily with the call reset_index(drop=True)
. Remember, Python is 0-indexed so our indices will be valued 0-1703.
The concept of reindexing is important. When we removed some of the messier, unwanted data, we had "gaps" in our index values. By correcting this, we can improve our search functionality and our ability to perform iterative functions on our cleaned data set.
In [16]:
# reset_index()
The region
column is a bit too messy for what we'd like to do.
The value_counts()
operation above revealed some issues that we can solve with several different techniques.
Common problems with string variables are leading and trailing white space and upper case vs. lower case in the same data set.
The following three commands remove all such lingering spaces (left and right) and put everything in lowercase. If you prefer, the three commands can be written in one single line (which is a concept called chaining).
In [ ]:
gapminder_copy['region'] = gapminder_copy['region'].str.lstrip() # Strip white space on left
gapminder_copy['region'] = gapminder_copy['region'].str.rstrip() # Strip white space on right
gapminder_copy['region'] = gapminder_copy['region'].str.lower() # Convert to lowercase
gapminder_copy['region'].value_counts() # How many times does each unique region occur?
# We could have done this in one line!
# gapminder_copy['region'] = gapminder_copy['region'].str.lstrip().str.rstrip().lower()
replace()
A regular expression, a.k.a. regex, is a sequence of characters that define a search pattern. In a regular expression, the symbol “*” matches the preceding character 0 or more times, whereas “+” matches the preceding character 1 or more times. “.” matches any single character. Writing “x|y” means to match either ‘x’ or ‘y’.
For more regex shortcuts (cheatsheet): https://www.shortcutfoo.com/app/dojos/regex/cheatsheet
To play "regex golf," check out this tutorial by Peter Norvig (you may need an O'Reilly or social media account to play).
Pandas allows you to use regex
in its replace()
function -- when a regex term is found in an element, the element is then replaced with the specified replacement term. In order for it to appropriately correct elements, both regex and inplace variables need to be set to True
(as their defaults are False). This ensures that the initial input string is read as a regular expression and that the elements will be modified in place.
For more documentation on the replace method: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html
Here's an incorrect regex example: we create a temporary DataFrame
in which a regex pulls all values that contain the term “congo”. Unfortunately, this creates 24 instances of the Democratic Republic of the Congo -- this is an error in our cleaning! We can revert back to the non-temporary DataFrame
and correctly modify our regex to isolate only the Democratic Republic instances (as opposed to including the Republic as well).
In [ ]:
# This gives a problem -- 24 values of the congo!
temp = gapminder_copy['region'].replace(".*congo.*", "africa_dem rep congo", regex=True)
temp.value_counts()
In [ ]:
# What happened? This shows all the rows that have congo in the name.
gapminder_copy[gapminder_copy["region"].str.contains('congo')]
As noted above, regular expressions (often simply "regex") provide a powerful
tool for fixing errors that arise in strings. In order to correctly label the
two different countries that include the word "congo", we need to design and
use (via pd.df.replace()
) a regex that correctly differentiates between the
two countries.
Recall that the "." is the wildcard (matching any single character); combining this with "*" allows us to match any number of single characters an unspecified number of times. By combining these characters with substrings corresponding to variations in the naming of the Democratic Republic of the Congo, we can correctly normalize the name.
If you feel that the use of regex is not particularly straightforward, you are correct -- appropriately using these tools takes a great deal of time to master. When designing regex for these sorts of tasks, you might find the following prototyper helpful: https://regex101.com/
In [ ]:
gapminder_copy['region'].replace(".*congo, dem.*", "africa_dem rep congo", regex=True, inplace=True)
gapminder_copy['region'].replace(".*_democratic republic of the congo", "africa_dem rep congo", regex=True, inplace=True)
gapminder_copy['region'].value_counts() # Now it's fixed.
Now that we've taken a close look at how to properly design and use regex to
clean string entries in our data, let's try to normalize the naming of a few
other countries. Using the pandas code we constructed above as a template,
construct similar code (using pd.df.replace()
) to set the naming of the Ivory
Coast and Canada to "africa_cote d'ivoire" and "americas_canada", respectively.
In [ ]:
# Try this on your own
Having what is called a "Tidy data set" can make cleaning, analyzing, and visualizing your data much easier. You should aim for having Tidy data when cleaning and preparing your data set for analysis. Two of the important aspects of Tidy data are:
(There are other aspects of Tidy data, here is a good blog post about Tidy data in Python: http://www.jeannicholashould.com/tidy-data-in-python.html)
Currently the gapminder dataset has a single column for continent and country (the ‘region’ column). We can split that column into two, by using the underscore that separates continent from country.
We can create a new column in the DataFrame
by naming it before the = sign:
gapminder['country'] =
The following commands use the function split()
to split the string at the underscore (the first argument), which results in a list of two elements: before and after the _. The second argument tells split()
that the split should take place only at the first occurrence of the underscore.
In [ ]:
gapminder_copy['country']=gapminder_copy['region'].str.split('_', 1).str[1]
gapminder_copy['continent']=gapminder_copy['region'].str.split('_', 1).str[0]
gapminder_copy.head()
We have now added the columns country
and continent
, but we still have the old region
column as well. In order to remove that column we use the drop()
command. The first argument of the drop()
command is the name of the element to be dropped. The second argument is the axis number:
0 for row, 1 for column.
In [17]:
# drop()
Finally, it is a good idea to look critically at your column names. Use lowercase for all column names to avoid confusing gdppercap
with gdpPercap
or GDPpercap
. Avoid spaces in column names to simplify manipulating your data - look out for lingering white space at the beginning or end of your column names. The following code turns all column names to lowercase.
In [ ]:
# str.lower()
We also want to remove the space from the life exp
column name. We can do that with Pandas rename
method. It takes a dictionary as its argument, with the old column names as keys and new column names as values.
If you're unfamiliar with dictionaries, they are a very useful data structure in Python. You can read more about them here.
In [18]:
# rename columns
Often we have more than one DataFrame
that contains parts of our data set and we want to put them together. This is known as merging the data.
Our advisor now wants us to add a new country called The People's Republic of Berkeley to the gapminder data set that we have cleaned up. Our goal is to get this new data into the same DataFrame
in the same format as the gapminder data and, in this case, we want to concatenate (add) it onto the end of the gapminder data.
Concatentating is a simple form of merging, there are many useful (and more complicated) ways to merge data. If you are interested in more information, the Pandas Documentation is useful.
In [ ]:
PRB = pd.read_table('https://raw.githubusercontent.com/Reproducible-Science-Curriculum/data-exploration-RR-Jupyter/master/PRB_data.txt', sep = "\t")
PRB.head()
In [ ]:
## bring in PRB data (no major problems) and make it conform to the gapminder at this point
# clean the data to look like the current gapminder
In [ ]:
# double check that the gapminder is the same
In [ ]:
# combine the data sets with concat
In [ ]:
# our code for fixing index
There are many ways in which you can manipulate a Pandas DataFrame
- here we will discuss two approaches: subsetting and sorting.
We can subset (or slice) by giving the numbers of the rows you want to see between square brackets.
REMINDER: Python uses 0-based indexing. This means that the first element in an object is located at position 0. this is different from other tools like R and Matlab that index elements within objects starting at 1.
In [ ]:
#Select the first 15 rows
In [ ]:
# Use a different way to select the first 15 rows
In [ ]:
#Select the last 10 rows
Answer:
What happens when you leave the space before or after the colon empty?
Answer:
Subsetting can also be done by selecting for a particular column or for a particular value in a column; for instance select the rows that have ‘africa’ in the column ‘continent. Note the double equal sign: single equal signs are used in Python to assign something to a variable. The double equal sign is a comparison: the variable to the left has to be exactly equal to the string to the right.
There other ways of subsetting that are worth knowing about. Do an independent reading of using .loc/.iloc with DataFrames
In [19]:
#Select for a particular column
#this syntax, calling the column as an attribute, gives you the same output
We can also create a new object that contains the data within the continent
column
In [ ]:
In [ ]:
In [ ]:
#sort_values()
Summaries (but can’t say statistics…)
Plots
Exploring is often iterative - summarize, plot, summarize, plot, etc. - sometimes it branches…
Remember that the info()
method gives a few useful pieces of information, including the shape of the DataFrame
, the variable type of each column, and the amount of memory stored. We can see many of our changes (continent and country columns instead of region, higher number of rows, etc.) reflected in the output of the info()
method.
In [ ]:
gapminder_comb.info()
We also saw above that the describe()
method will take the numeric columns and give a summary of their values. We have to remember that we changed the column names and this time it shouldn't have NaNs.
In [ ]:
gapminder_comb[['pop', 'lifeexp', 'gdppercap']].describe()
In [ ]:
What if we want to know the mean population by continent? Then we need to use the Pandas groupby()
method and tell it which column we want to group by.
In [ ]:
What if we want to know the median population by continent?
In [ ]:
Or the number of entries (rows) per continent?
In [ ]:
Sometimes we don't want a whole DataFrame
. Here is another way to do this that produces a Series
that tells us number of entries (rows) as opposed to a DataFrame
.
In [ ]:
We can also look at the mean GDP per capita of each country:
In [ ]:
What if we wanted a new DataFrame
that just contained these summaries? This could be a table in a report, for example.
In [ ]:
continent_mean_pop = gapminder_comb[['continent', 'pop']].groupby(by='continent').mean()
continent_mean_pop = continent_mean_pop.rename(columns = {'pop':'meanpop'})
continent_row_ct = gapminder_comb[['continent', 'country']].groupby(by='continent').count()
continent_row_ct = continent_row_ct.rename(columns = {'country':'nrows'})
continent_median_pop = gapminder_comb[['continent', 'pop']].groupby(by='continent').median()
continent_median_pop = continent_median_pop.rename(columns = {'pop':'medianpop'})
gapminder_summs = pd.concat([continent_row_ct,continent_mean_pop,continent_median_pop], axis=1)
gapminder_summs = gapminder_summs.rename(columns = {'y':'year'})
gapminder_summs
matplotlib
Recall that matplotlib is Python's main visualization library. It provides a range of tools for constructing plots and numerous high-level plotting libraries (e.g., Seaborn) are built with matplotlib in mind. When we were in the early stages of setting up our analysis, we loaded these libraries like so:
In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
Consider the above three commands to be essential practice for plotting (as
essential as import
pandas
as
pd
is for data munging).
Now, let's turn to data visualization. In order to get a feel for the properties of the data set we are working with, data visualization is key. While, we will focus only on the essentials of how to properly construct plots in univariate and bivariate settings here, it's worth noting that both matplotlib and Seaborn support a diversity of plots: matplotlib gallery, Seaborn gallery.
In [ ]:
#import numpy as np
In [ ]:
In [ ]:
# scatter plot goes here
plt.scatter(gapminder_copy['gdppercap'], gapminder_copy['lifeexp'])
plt.xlabel('gdppercap')
plt.ylabel('lifeexp')
In [ ]:
# let's try plotting the log of x
plt.scatter(gapminder_copy['gdppercap'], gapminder_copy['lifeexp'])
plt.xscale('log')
plt.xlabel('gdppercap')
plt.ylabel('lifeexp')
In [ ]:
# Try creating a plot on your own
In this lesson, we've taken the raw data and worked through steps to prepare it for analysis, but we have not yet done any "data analysis". This part of the data workflow can be thought of as "exploratory data analysis", or EDA. Many of the steps we've shown are aimed at uncovering interesting or problematic things in the dataset that are not immediately obvious. We want to stress that when you're doing EDA, it will not necessarily be a linear workflow like what we have shown. When you plot or summarize your data, you may uncover new issues: for example, we saw this when we made a mistake fixing the naming conventions for the Democratic Republic of Congo. You might discover outliers, unusually large values, or points that don't make sense in your plots. Clearly, the work here isn't done: you'll have to investigate these points, decide how to fix any potential problems, document the reasoning for your actions, and check that your fix actually worked.
On the other hand, plots and summaries might reveal interesting questions about your data. You may return to the cleaning and prepping steps in order to dig deeper into these questions. You should continuously refine your plots to give the clearest picture of your hypotheses.
What's interesting about this data set? What would you like to investigate further?
Answer: