(fill in your two names here)
Facilitator: (fill in name)
Spokesperson: (fill in name)
Process Analyst: (fill in name)
Quality Control: (fill in name)
If there are only three people in your group, have one person serve as both spokesperson and process analyst for the rest of this activity.
At the end of this Lesson, you will be asked to record how long each Model required for your team. The Facilitator should keep track of time for your team.
After the preactivity, you are at least a litte more familiar with two ideas:
In this activity we will use the pandas library of tools to analyze a DataFrame. This introduction to analyzing tabular data in Python will include an introduction to:
matplotlib
A note about the data in this activity:
The data originally comes from Gapminder which describes it self as "a fact tank [that] fights devastating misconceptions about global development ... making the world understandable based on reliable statistics". It has been processed lightly by the Data Carpentry community for data analysis activities. It might seem a bit less sciency (unless you're into global development and public health) but analyzing tabular data to do some statistical or other analysis of numerical and categorical data is the same...
This activity is heavily based on the Data Exploration lesson developed by the Data Carpentry community (Brian was a participant/developer).
A library in Python contains a set of tools (called functions) that perform tasks on our data and instead of you having to write them all, someone else has written them for you. Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench for use in a project. Once a library is imported, it can be used or called to perform many tasks.
Python doesn’t load all of the libraries available to it by default -- that would be too inefficient. We have to add an import statement to our code in order to use library functions. To import a library, we use the syntax:
import libraryName
If we want to give the library a nickname to shorten the commands you need to type, we can do it like this:
import libraryName as nickNameHere
An example of importing the pandas
library using the very common nickname pd
is below (and also in notebook):
import pandas as pd
Luckily for us, Anaconda installs a lot of commonly used libraries, like pandas
so we don't have to do anything but import
them. If you are using less common libraries you would need to also install them using Anaconda or the Python installer, pip
.
matplotlib is the most widely used Python library for plotting. We can run it right in the notebook using the magic command
%matplotlib inline
See the IPython docs for other options to pass to the magic command.
In this lesson, we will only use matplotlib
since it is the standard, basic plotting tool in Python. There are a whole range of graphics packages in Python, ranging from better basic visualizations, like ggplot, to fancifying matplotlib graphs like seaborn, and still others that make sweet interactive graphics like bokeh and plotly. None of these are included in the default Anaconda install.
We encourage you to explore on your own! Chances are, if you can imagine a plot you'd like to make, somebody else has written a package to do it. (Although you have to install those packages first...)
One of the best options for working with tabular data in Python is to use the Python Data Analysis Library (a.k.a. pandas. The pandas library provides data structures, produces high quality plots with matplotlib and integrates nicely with other libraries that use NumPy arrays (another Python library that provides a lot of very useful numerical libraries).
One of the best parts of the pandas library is the developers have made it as easy to use as possible, have great documentation, and they even made a cheat sheet that has a lot of useful info!
In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
1. What is the output of the commands above?
Now each time we call a function that’s in a library, we use the syntax:
LibraryName.FunctionName
Adding the library name with a .
before the function name tells Python where to find the function. In the example above, we have imported Pandas as pd
. This means we don’t have to type out pandas each time we call a Pandas function.
We will begin by locating and reading our data which are in a table format. We can use Pandas’ read_table
function to pull the file directly into a DataFrame.
A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, factors and more) in columns. It is similar to a spreadsheet or an SQL table or the data.frame in R.
A DataFrame always has an index (0-based since it's Python). An index refers to the position of an element in the data structure and is super useful for some of the pandas
methods.
pandas
also can take apart DataFrames into columns by making a specialized data structure called a Series
(a 1D structure that still has an index). This allows many pandas
methods to work on one column at a time, or the whole DataFrame column by column. pandas
can also perform what are called vectorized operations which means that the iteration is already built in and you don't have to loop through values in a list, you can just call a methon on a Series. But enough of that for now, let's get to some data...
The commands below load the data directly from the GitHub repository where the data is stored.
This particular data set is tab separated since the columns in the data set are separated by a TAB. We need to tell the read_table
function in Pandas that this is the case with sep = '\t'
argument. (this \t
for tab is kind of like the \n
character that we have seen for a new line in a text file)
The fact that this data is tab separated as opposed to the more common comma separated (.csv) data is just a quirk of how we came by the data, and databases and instruments often have complex or even proprietary data formats, so get used to it.
run the code below to bring in the data and assigns it to the identifier gapminder
.
In [ ]:
# url where data is stored
url = "https://raw.githubusercontent.com/Reproducible-Science-Curriculum/data-exploration-RR-Jupyter/master/gapminderDataFiveYear_superDirty.txt"
# assigns the identifier 'gapminder' to the entire dataset
gapminder = pd.read_table(url, sep = "\t")
2. In the command above, why did we use pd.read_table
, not just read_table
or pandas.read_table
?
Now we have a DataFrame called gapminder
with all of our data in it!
One of the first things we do after importing data is to just have a look and make sure it looks like what we think it should.
In this case, this is your first look, so type the name of the DataFrame and run the code cell below to see what happens...
In [ ]:
## DataFrame name here
There are usually too many rows to print to the screen and you don't really want to find out by printing them all to the screen and regretting it. By default, when you type the name of the DataFrame and run a cell, pandas
knows not to print the whole thing if there are a lot of rows. Instead, you will see the first and last few rows with dots in between.
A neater way to look at a preview of the dataset is by using the head()
method. Calling DataFrame.head()
will displace the first 5 rows of the data (this is also an exmaple of the "dot" notation where we want the "head" gapminder, so we write DataFrame.head()
). You can specify how many rows you want to see as an argument, like DataFrame.head(10)
. The tail()
method does the same with the last rows of the DataFrame.
Use these methods below to get an idea of what the gapminder
DataFrame looks like.
In [ ]:
## head and tail methods
3. Make a few observations about what you see in the data so far (you don't have to go too deep since we are going to use pandas
to look next)
We often want to know how many rows and columns are in the data -- we want to know what is called the "shape" attribute of the DataFrame. Pandas has a convenient way for getting that information by using the DataFrame.shape
(using DataFrame as a generic name for a, well, DataFrame, which in pandas is usually written DataFrame). This returns a tuple (values separated by commas) representing the dimensions of the DataFrame (rows, columns).
write code to use shape
to get the shape of the gapminder
DataFrame:
In [ ]:
## shape
4. How many rows and columns are there?
The info()
method gives a few useful pieces of information quickly, including the shape of the DataFrame, the variable type of each column, and the amount of memory stored.
run the code in the cell below
In [ ]:
# get info
gapminder.info()
There are several problems with this data set as it is. BTW, this is super common and data nerd slang for fundamentally useful data with issues is "dirty data" and the process for un-dirtying it is called "data cleaning". (believe it or not, tidy data is yet again a different thing, we'll get to that later)
The first step in data cleaning is identifying the problems. The info
above has already revealed one major problem and one minor/annoying problem.
5. What types of variables are in the gapminder
DataFrame and what values can they take? (region is a little tricky)
6. Some of your data types should seem odd to you (hint - party like it's 1999.9?). Which ones and why?
7. Look at the number of entries in the entire DataFrame at the top (RangeIndex) and compare that to each column. Explain what you think is happening.
There are other fast, easy, and informative ways to get a sense of what your data might look like and any issues that it has. The describe()
method will take the numeric columns and give a summary of their values. This is useful for getting a sense of the ranges of values and seeing if there are any unusual or suspicious numbers.
Use the describe() method on the gapminder DataFrame in the cell below
In [ ]:
## describe
Uh-oh what happened to the percentiles? Also notice that you get an error, but it gives you some output.
The DataFrame method describe()
just blindly looks at all numeric variables. First, we wouldn't actually want to take the mean year. Additionally, we obtain 'NaN' (not a number) values for our quartiles. This suggests we might have missing data which we can (and will) deal with shortly when we begin to clean our data.
For now, let's pull out only the columns that are truly continuous numbers (i.e. ignore the description for 'year'). This is a preview of selection columns of the data; we'll talk more about how to do it later in the lesson, but many methods that work on DataFrames work in this way if you use DataFrame.method(['column_name'])
-- the brackets should remind you of indexing a list or a string and the 'column_name'
needs to be in quotes since it is itself a string.
run the code in the cell below
In [ ]:
# use describe but only on columns with continuous values
gapminder[['pop', 'life Exp', 'gdpPercap']].describe()
8. We haven't really nailed this problem down yet. Reflect (or look) back on your answer to Question#7 and think about what might be happening (don't worry too much about the mathematical specifics).
Let's get after it.
The command value_counts()
gives you a quick idea of what kinds of names are in your categorical data such as strings. In this case our categorical data is in the region
column and represents the names of the regions where the data came from.
Important info: The data set covers 12 years, so each region
should appear 12 times.
use value_counts()
on gapminder
to see if all regions have 12 rows/entries as expected
In [ ]:
## use value_counts to find out how many times each unique region occus
Uh-oh! The table reveals several problems.
9. Describe and Explain the (several) problems that you see.
Missing data is an important issue to handle. As we've seen, ignoring it doesn't make it go away, in our example, it has been giving us problems in our describe()
results -- remember the NaN
values?
There are different ways of dealing with missing data which include, all of them have advantages and disadvantages so choose carefully and with good reasons!:
For our purposes with the dirty gapminder
data set, we know our missing data is excess (and unnecessary) and we are going to choose to analyze only the available data. To do this, we will simply remove rows with missing values.
10. Wait, why do we think it's extra data? Justify! Also include from what regions you expect to lose observations.
In large tabular data that is used for analysis, missing data is usually coded NA
(stands for Not Available and various other things) although there are other possibilities such as NaN (as we saw above) if a function or import method expects a number. NA
can mean several things under the hood, but using NA
for missing values in your data is your best bet and most of the methods will expect that (and even have it in their name).
Let's find out how many NAs there are
We are going to chain 2 steps together to determine the number of NA/null values in the gapminder
DataFrame.
isnull()
returns a boolean for each cell in the DataFrame - True if the value in the cell is NA, False if it is not.sum()
adds up the values in each column.True == 1
and False == 0
in Python so if you sum()
the results of isnull()
you get the number of NA/null values in each column -- awesome!
In [ ]:
# isnull results in T/F for each cell,
# sum adds them up by column since F=0 and T=1 so #=#NAs
gapminder.isnull().sum()
Yikes! There are NA values in each column except region.
Removing NA
s from a DataFrame is incredibly easy to do because pandas
allows you to either remove all instances will NA
/null data or replace them with a particular value. (sometimes too easy, so make sure you are careful!)
The method df = df.dropna()
(here df
is short for DataFrame
) drops rows with any column having NA/null data. df = df.fillna(value)
replaces all NA
/null data with the argument value
. You have to use the assignment to the same (or different) identifier since dropna()
does not work "in place" so if you don't assign the result to something it prints and nothing happens to the actual DataFrame.
use dropna()
to remove the NA
s from the gapminder
DataFrame
In [ ]:
## before you rip into removing, it's a good idea to know how many rows you have
## you know a few ways to get this info, write some code here to get # of rows
In [ ]:
## now use dropna()
In [ ]:
## now get the number of rows again
In [ ]:
## use isnull() again to confirm
11. How many rows were removed? Make sure the code immediately above justifies this.
12. Which regions lost rows? Run code to see if there are still NAs and also to identify regions with too many observations. (hint: it won't be perfect yet)
In [ ]:
## your code
Are we done yet? Oh, no, we certainly are not...
One more nice trick. If you want to examine a subset (more later) of the rows, for example to exmaine the region with too many observations, you can use code like this:
gapminder[gapminder.column == 'value_in_column']
where column is the name of a column and 'value_in_column' is what you're interested in, for example the region.
write code in the cell below to look at all of the rows from regions
that have a problematically high number of observations.
In [ ]:
## code to examine rows from regions with too many obs
Remember above where we had some strange/annoying variable types? Now we can fix them. The methods below would have failed if there were NA
s in the data since they look for the type of the data and try to change it. Inconsistencies in data types and NA
s cause problems for a lot of methods, so we often deal with them first when cleaning data.
We can change (or type cast) the inappropriate data types with the function astype()
, like so:
DataFrame.astype(dtype)
A few important things to consider!
DataFrame
it will try to cast the entire DataFrame to the same type, and you do not want to go there! so we need to specify which columns we want to cast, see below...There are several ways to select only some columns of a DataFrame in pandas
but the easiest and most intuitive is usually to just use the name. It is very much like indexing a list or a string and looks like: DataFrame['column_name']
, where column_name
is the column name. So in context of our astype()
method call we would want:
DataFrame['column_name'].astype(dtype)
to only cast the type of a single column.
Run the code in the cell below to see an example:
In [ ]:
##type casts the year column data from float to int
gapminder['year'] = gapminder['year'].astype(int)
Now you use astype()
to type cast the other problematic column and use info()
to make sure these two operations worked.
In [ ]:
## fix other column and make sure all is ok
Progress!
Your data types should make more sense now and since we removed the NA values above, the total number of rows or entries is the same as the non-null rows. Nice!
Sometimes observations can end up in the data set more than once creating a duplicate row. Luckily, pandas
has methods that allow us to identify which observations are duplicates. The method df.duplicated()
will return boolean values for each row in the DataFrame telling you whether or not a row is an exact repeat.
Run the code below to see how it works.
In [ ]:
## get T/F output for each row if it is a duplicate (only look at top 5)
gapminder.duplicated().head()
Wow, there is a repeat in the first 5 rows!
Write code below that allows you to confirm this by examining the DataFrame.
In [ ]:
## confirm duplicate in top 5 rows of df
In [1]:
## you can use the .sum() to count the number of duplicated rows in the DataFrame
In cases where you don’t want exactly repeated rows (we don’t -- we only want each country to be represented once for every relevant year), you can easily drop such duplicate rows with the method df.drop_duplicates()
.
Note: drop_duplicates()
is another method that does not work in place
, and if you want the DataFrame to now not have the duplicates, you need to assign it again: df = df.drop_duplicates()
Warning: you always want to be cautious about dropping rows from your data.
13. Justify why it's ok for us to drop these rows (a mechanical reason that has to do with the structure of the data itself, the "experimental" reason is above).
write code below to remove the duplicated rows and confirm that the duplicate in the first 5 rows is gone and/or that all of the duplicates are gone.
In [ ]:
## remove duplicates and confirm
More Progress!
We can reset our indices easily with the method reset_index(drop=True)
. Remember, Python is 0-indexed so our indices will be valued 0-1703. The drop=True
parameter drops the old index (as opposed to placing it in a new column, which is useful sometimes but not here).
The concept of reindexing is important. When we remove some of the messier, unwanted data, we end up with "gaps" in our index values. By correcting this, we can improve our search functionality and our ability to perform iterative functions on our cleaned data set.
Note: reset_index()
is yet another method that does not work in place
, and if you want the DataFrame to now not have the duplicates, you need to assign it again...
write code that resets the index of gapminder
and shows you the top 5 so you can see it has changed
In [ ]:
## reset index and check top 5
In [ ]:
## get value counts for the region column
14. Describe and Explain...
14a. what is better now that we have done some data cleaning.
14b. what is still in need of being fixed.
Very common problems with string variables are:
These issues are problematic since upper and lower case characters are considered different characters by Python (consider that 'abc' == 'ABC'
evaluates to False
) and any extra character in a string makes it different to Python (consider that 'ABC' == ' ABC'
evaluates to False
).
The following three pandas
string methods (hence the str
) remove all such trailing spaces (left and right) and put everything in lowercase, respectively.
df.str.lstrip() # Strip white space on left
df.str.rstrip() # Strip white space on right
df.str.lower() # Convert to lowercase
Note: none of these methods work in place, so if you want the DataFrame to now not have the duplicates, you need to assign it again (df = df.method()
)...
write code that strips the white space from both sides of the values in the region column, makes all of the values lower case, and shows that it has been accomplished (to make sure you've reassigned).
In [ ]:
## write code to strip white space on both left and right of region names
## convert region names to lowercase, and print out df to confirm
As a side note, one of the coolest thing about pandas
is an idea called chaining. If you prefer, the three commands can be written in one single line!
df['column_name'] = df['column_name'].str.lstrip().str.rstrip().str.lower()
You may be wondering about the order of operations and it starts closest to the data (DataFrame name, df
here), and moves away performing methods in order. This is a bit advanced, but can save a lot of typing once you get used to it. The downside is that it can be harder to troubleshoot -- it can leave you asking "which command in my long slick looking chain produced the weird error?"
Are we there yet???
15. What is still wrong with the region
column?
A regular expression, aka regex
, is a powerful search technique that uses a sequence of characters that define a search pattern. In a regular expression, the symbol "*" matches the preceding character 0 or more times, whereas "+" matches the preceding character 1 or more times. "." matches any single character. Writing "x|y" means to match either "x" or "y".
For more regex shortcuts (cheatsheet): https://www.shortcutfoo.com/app/dojos/regex/cheatsheet
Pandas allows you to use regex in its replace()
function -- when a regex term is found in an element, the element is then replaced with the specified replacement term. In order for it to appropriately correct elements, both regex and inplace variables need to be set to True (as their defaults are false). This ensures that the initial input string is read as a regular expression and that the elements will be modified in place.
For more documentation on the replace method: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html
Using regex
and replace()
is easy and powerful and potentially dangerous
Here's an incorrect regex example: we create a temporary DataFrame (notice the temp =
assignment) in which a regex
statement inside replace()
identifies all values that contain the term "congo" and replaces it with "africa_dem rep congo". Unfortunately, this creates 24 instances of the Democratic Republic of the Congo (and there should only be 12) -- this is an error in our cleaning!
run the code below to go through the incorrect example (and let it be a warning to ye)
In [ ]:
# gives a problem -- 24 values of the congo!
temp = gapminder['region'].replace(".*congo.*", "africa_dem rep congo", regex=True)
temp.value_counts()
16. What happened?
In [ ]:
## this should help
# shows all the rows that have 'congo' in the name
gapminder[gapminder['region'].str.contains('congo')]
We can revert back to working on the the non-temporary DataFrame and correctly modify our regex
to isolate only the Democratic Republic of Congo instances (as opposed to including the Republic of Congo as well).
As noted above, regular expressions (regex
) provide a powerful
tool for fixing errors that arise in strings. In order to correctly label the
two different countries that include the word "congo", we need to design and
use (via df.replace()
) a regex that correctly differentiates between the
two countries.
Recall that the "." is the wildcard (matching any single character); combining this with "*" allows us to match any number of single characters an unspecified number of times. By combining these characters with substrings corresponding to variations in the naming of the Democratic Republic of the Congo, we can correctly normalize the name.
If you feel that the use of regex is not particularly straightforward, you are absolutely correct -- appropriately using these tools takes a great deal of time to master. When designing regex for these sorts of tasks, you might find the following prototyper helpful: https://regex101.com/
run the code below to fix the rows for the Democratic Republic of the Congo
In [ ]:
# fix the two different versions of the incorrect name to the same correct version
gapminder['region'].replace(".*congo, dem.*", "africa_dem rep congo", regex=True, inplace=True)
gapminder['region'].replace(".*_democratic republic of the congo", "africa_dem rep congo", regex=True, inplace=True)
## you write code to check to make sure it's fixed
Now that we've taken a close look at how to properly design and use regex to clean string entries in our data, let's try to normalize the naming of a few other countries.
Using the pandas code we constructed above as a template, write similar code to what we used above (using df.replace()
) to fix the names of all of the rows/entries for the Ivory Coast and Canada to "africa_cote d'ivoire" and "americas_canada", respectively.
In [ ]:
## code to fix other incorrect names
Are we there yet??!??!
17. Looks like we successfully cleaned our data! Justify that we are done.
In [ ]:
Having what is called a Tidy
data set can make cleaning and analyzing your data much easier.
Two of the important aspects of Tidy
data are:
There are other aspects of Tidy
data, here is a good blog post about Tidy data in Python.
Currently the dataset has a single column for continent and country (the ‘region’ column).
18. Why is having a single column for continent and country not Tidy
?
gapminder
Tidy
We can split the region
column into two, by using the underscore that separates continent from country. We can create a new column in the DataFrame by naming it before the = sign:
gapminder['country'] =
The following commands use the string method split()
to split the string at the underscore (the first argument), which results in a list of two elements: before and after the _. The second argument, in this case "1"
, tells split() that the split should take place only at the first occurrence of the underscore. Then the str[]
specifies which item in the resulting Series
to return.
In [ ]:
# split region into country and continent to tidy data
gapminder['country']=gapminder['region'].str.split('_', 1).str[1]
gapminder['continent']=gapminder['region'].str.split('_', 1).str[0]
gapminder.head()
19. Describe what happens for each of the lines of code that contains split()
. Be sure to explain why the 2 columns end up with different values.
In [ ]:
## mess around with it if you need to, but make a 'temp' version of gapminder to play with!
We have now added the columns country
and continent
, but we still have the old region
column as well. In order to remove that column we use the drop()
command. The first argument of the drop() command is the name of the element to be dropped. The second argument is the axis number: 0 for row, 1 for column.
*Note: any time we are getting rid of stuff, we want to make sure that we are doing it for a good reason and that we know our data will be ok after. You might want to double check your new columns before you drop the old one*
In [ ]:
## check new columns
In [ ]:
# drop old region column
gapminder = gapminder.drop('region', 1)
gapminder.head()
Finally, it is a good idea to look critically at your column names themselves. It is a good idea to be as consistent as possible when naming columns. We often use all lowercase for all column names to avoid accidentally making names that can be confusing -- gdppercap
and gdpPercap
and GDPpercap
are all the same nut gdpercap
is different.
Avoid spaces in column names to simplify manipulating your data. Also look out for lingering white space at the beginning or end of your column names.
Run the following code that turns all column names to lowercase.
In [ ]:
# turns all column names to lowercase
# yes you need the .columns on the left side too
gapminder.columns = gapminder.columns.str.lower()
gapminder.head()
We also want to remove the space from the life exp
column name. We can do that with Pandas rename
method. It takes a dictionary as its argument, with the old column names as keys and new column names as values.
If you're unfamiliar with dictionaries, they are a very useful data structure in Python. You can read more about them here.
In [ ]:
# rename column
gapminder = gapminder.rename(columns={'life exp' : 'lifeexp'})
gapminder.head()
21. explain why the data set at this point is Tidy
or at least much tidier than it was before.
Now that we have a clean and tidy data set in a DataFrame, we want to export it and give it a new name so that we can refer to it and use it
df.to_csv('file name for export', index=False) # index=False keeps index out of columns
For more info on this method, check out the docs for to_csv()
In [ ]:
# exports gapminder_CandT.csv
gapminder.to_csv('gapminder_CandT.csv', index=False)
There are many ways in which you can manipulate a Pandas DataFrame - here we will discuss only two: subsetting and sorting.
Sometimes you only want part of a larger data set, then you would subset your DataFrame. Othertimes you want to sort the data into a particular order (year more recent to oldest, GDP lowest to highest, etc.).
In [ ]:
# first 15 rows
gapminder[0:15] # could also be gapminder[:15]
write code to take a different slice
In [ ]:
## your slice
21. Predict what the slice below will do.
gapminder[-10:]
In [ ]:
## run the code here to test prediction.
22a. What does the negative number (in the cell above) mean?
22b. What happens when you leave the space before or after the colon empty?
write code to take another different slice
In [ ]:
## your 'nother slice
Subsetting can also be done by selecting for a particular value in a column.
For instance to select all of the rows that have 'africa' in the column 'continent'.
gapminder_africa = gapminder[gapminder['continent']=='africa']
Note the double equal sign: Remember that single equal signs are used in Python to assign something to a variable. The double equal sign is a comparison: in this case, the value from the column on the left has to be exactly equal to the string to the right.
Also note that we made a new DataFrame to contain our subset of the data from Africa.
In [ ]:
# makes a new DataFrame with just data from Africa
gapminder_africa = gapminder[gapminder['continent']=='africa']
## write code to quickly check that this worked
In [ ]:
# sort by year, from most recent to oldest
gapminder.sort_values(by='year', ascending = False)
Note: the sort()
method does not sort in place
.
write code to prove it.
In [ ]:
## your code, sort() not in place
23. Make a new variable with a sorted version of the DataFrame organized by country, from ‘Afganistan’ to ‘Zimbabwe’. Also include code to show that it is sorted correctly.
In [ ]:
## alphabetical by country
Summaries and Statistics are very useful for intial examination of data as well as in depth analysis. Here we will only scratch the surface.
Plots/graphs/charts are also great visual ways to examine data and illustrate patterns.
Exploring your data is often iterative - summarize, plot, summarize, plot, etc. - sometimes it branche, sometimes there are more cleaning steps to be discovered...
Let's try it!
Remember that the info()
method gives a few useful pieces of information, including the shape of the DataFrame, the variable type of each column, and the amount of memory stored. We can see many of our changes (continent and country columns instead of region, higher number of rows, etc.) reflected in the output of the info()
method.
In [ ]:
# review info()
gapminder.info()
We also saw above that the describe()
method will take the numeric columns and give a summary of their values. We have to remember that we changed the changed column names, and this time it shouldn't have NAs.
In [ ]:
# review describe
gapminder[['pop', 'lifeexp', 'gdppercap']].describe()
In [ ]:
# population mean
gapminder['pop'].mean()
What if we want to know the mean population by continent? Then we need to use the Pandas groupby()
method and tell it which column we want to group by.
In [ ]:
# population mean by continent
gapminder[['continent', 'pop']].groupby(by='continent').mean()
What if we want to know the median population by continent?
The method that gives you the median is called median()
.
write code below to get the median population by continent
In [ ]:
## population median by continent
Or the number of entries (rows) per continent?
In [ ]:
# count number of rows
gapminder[['continent', 'country']].groupby(by='continent').count()
Sometimes we don't want a whole DataFrame. Here is another way to do this that produces a series as opposed to a DataFrame that tells us number of entries (rows).
In [ ]:
# get size by continent
gapminder[['continent', 'country']].groupby(by='continent').size()
We can also look at the mean GDP per capita of each country:
write code below to get the mean GDP per capita of each country
In [ ]:
## mean GDP per capita by country
What if we wanted a new DataFrame that just contained these summaries? This could be a table in a report, for example.
In [ ]:
# pretty slick, right?!
continent_mean_pop = gapminder[['continent', 'pop']].groupby(by='continent').mean()
continent_mean_pop = continent_mean_pop.rename(columns = {'pop':'meanpop'})
continent_row_ct = gapminder[['continent', 'country']].groupby(by='continent').count()
continent_row_ct = continent_row_ct.rename(columns = {'country':'nrows'})
continent_median_pop = gapminder[['continent', 'pop']].groupby(by='continent').median()
continent_median_pop = continent_median_pop.rename(columns = {'pop':'medianpop'})
gapminder_summs = pd.concat([continent_row_ct,continent_mean_pop,continent_median_pop], axis=1)
gapminder_summs = gapminder_summs.rename(columns = {'y':'year'})
gapminder_summs
matplotlib
Recall that matplotlib is Python's main visualization library. It provides a range of tools for constructing plots, and numerous high-level and add on plotting libraries are built with matplotlib in mind. When we were in the early stages of setting up our analysis, we loaded these libraries like so:
In [ ]:
# import again just in case
import matplotlib.pyplot as plt
In [ ]:
## generate some toy data and plot it
# required to generate toy data
import numpy as np
# magic to plot straight into notebook, probably no longer needed.
# %matplotlib inline
# evenly sampled time at 200ms intervals
t = np.arange(0., 5., 0.2)
# red dashes, blue squares and green triangles
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.show()
In [ ]:
# example histogram
# generate some random numbers from a normal distribution
data = 100 + np.random.randn(500)
# make a histogram with 20 bins
plt.hist(data, 20)
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.show()
write code below to make a histogram of the distribution of life expectancies of the countries in the gapminder
DataFrame.
In [ ]:
## histogram of lifeexp
In this section, we've taken the raw data and worked through steps to prepare it for analysis, but we have not yet done any "data analysis". This part of the data workflow can be thought of as "exploratory data analysis", or EDA. Many of the steps we've shown are aimed at uncovering interesting or problematic things in the dataset that are not immediately obvious. We want to stress that when you're doing EDA, it will not necessarily be a linear workflow like what we have shown. When you plot or summarize your data, you may uncover new issues: we saw this when we made a mistake fixing the naming conventions for the Democratic Republic of Congo. You might discover outliers, unusually large values, or points that don't make sense in your plots. Clearly, the work here isn't done: you'll have to investigate these points, decide how to fix any potential problems, document the reasoning for your actions, and check that your fix actually worked.
On the other hand, plots and summaries might reveal interesting questions about your data. You may return to the cleaning and prepping steps in order to dig deeper into these questions. You should continuously refine your plots to give the clearest picture of your hypotheses.
In [ ]: