Unit 2: Programming Design

Lesson 14: Packages and Data Analysis

Notebook Authors

(fill in your two names here)

Facilitator: (fill in name)
Spokesperson: (fill in name)
Process Analyst: (fill in name)
Quality Control: (fill in name)

If there are only three people in your group, have one person serve as both spokesperson and process analyst for the rest of this activity.

At the end of this Lesson, you will be asked to record how long each Model required for your team. The Facilitator should keep track of time for your team.

Science Context: (Reproducible) Data Analysis

After the preactivity, you are at least a litte more familiar with two ideas:

  1. Very often scientific data is in a table/spreadsheet/DataFrame format (columns and rows)
  2. Reproducibility of scientific data analysis is challenging, important to consider, and more doable with the correct tools (like Jupyter)

Outline/Objectives

In this activity we will use the pandas library of tools to analyze a DataFrame. This introduction to analyzing tabular data in Python will include an introduction to:

  • loading data form a .csv file into a Jupyter Notebook environment
  • importing and using a library
  • basic data cleaning and wrangling
  • "tidy" data
  • pandas functions and methods for DataFrames
  • a brief intro to regex
  • basic data summary info
  • very quick intro to plotting with matplotlib

A note about the data in this activity:
The data originally comes from Gapminder which describes it self as "a fact tank [that] fights devastating misconceptions about global development ... making the world understandable based on reliable statistics". It has been processed lightly by the Data Carpentry community for data analysis activities. It might seem a bit less sciency (unless you're into global development and public health) but analyzing tabular data to do some statistical or other analysis of numerical and categorical data is the same...

This activity is heavily based on the Data Exploration lesson developed by the Data Carpentry community (Brian was a participant/developer).

Setting up the notebook

About Libraries in Python

A library in Python contains a set of tools (called functions) that perform tasks on our data and instead of you having to write them all, someone else has written them for you. Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench for use in a project. Once a library is imported, it can be used or called to perform many tasks.

Python doesn’t load all of the libraries available to it by default -- that would be too inefficient. We have to add an import statement to our code in order to use library functions. To import a library, we use the syntax:

import libraryName

If we want to give the library a nickname to shorten the commands you need to type, we can do it like this:

import libraryName as nickNameHere

An example of importing the pandas library using the very common nickname pd is below (and also in notebook):

import pandas as pd

Luckily for us, Anaconda installs a lot of commonly used libraries, like pandas so we don't have to do anything but import them. If you are using less common libraries you would need to also install them using Anaconda or the Python installer, pip.

matplotlib and other plotting libraries

matplotlib is the most widely used Python library for plotting. We can run it right in the notebook using the magic command

%matplotlib inline

See the IPython docs for other options to pass to the magic command.

In this lesson, we will only use matplotlib since it is the standard, basic plotting tool in Python. There are a whole range of graphics packages in Python, ranging from better basic visualizations, like ggplot, to fancifying matplotlib graphs like seaborn, and still others that make sweet interactive graphics like bokeh and plotly. None of these are included in the default Anaconda install.

We encourage you to explore on your own! Chances are, if you can imagine a plot you'd like to make, somebody else has written a package to do it. (Although you have to install those packages first...)

The Pandas Library

One of the best options for working with tabular data in Python is to use the Python Data Analysis Library (a.k.a. pandas. The pandas library provides data structures, produces high quality plots with matplotlib and integrates nicely with other libraries that use NumPy arrays (another Python library that provides a lot of very useful numerical libraries).

Resources

One of the best parts of the pandas library is the developers have made it as easy to use as possible, have great documentation, and they even made a cheat sheet that has a lot of useful info!

Let's Go!


In [ ]:
import pandas as pd
import matplotlib.pyplot as plt

1. What is the output of the commands above?

Now each time we call a function that’s in a library, we use the syntax:

LibraryName.FunctionName

Adding the library name with a . before the function name tells Python where to find the function. In the example above, we have imported Pandas as pd. This means we don’t have to type out pandas each time we call a Pandas function.

We will begin by locating and reading our data which are in a table format. We can use Pandas’ read_table function to pull the file directly into a DataFrame.

Getting data into the notebook

What’s a DataFrame?

A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, factors and more) in columns. It is similar to a spreadsheet or an SQL table or the data.frame in R.

A DataFrame always has an index (0-based since it's Python). An index refers to the position of an element in the data structure and is super useful for some of the pandas methods.

pandas also can take apart DataFrames into columns by making a specialized data structure called a Series (a 1D structure that still has an index). This allows many pandas methods to work on one column at a time, or the whole DataFrame column by column. pandas can also perform what are called vectorized operations which means that the iteration is already built in and you don't have to loop through values in a list, you can just call a methon on a Series. But enough of that for now, let's get to some data...

Load data

The commands below load the data directly from the GitHub repository where the data is stored.

This particular data set is tab separated since the columns in the data set are separated by a TAB. We need to tell the read_table function in Pandas that this is the case with sep = '\t' argument. (this \t for tab is kind of like the \n character that we have seen for a new line in a text file)

The fact that this data is tab separated as opposed to the more common comma separated (.csv) data is just a quirk of how we came by the data, and databases and instruments often have complex or even proprietary data formats, so get used to it.

run the code below to bring in the data and assigns it to the identifier gapminder.


In [ ]:
# url where data is stored
url = "https://raw.githubusercontent.com/Reproducible-Science-Curriculum/data-exploration-RR-Jupyter/master/gapminderDataFiveYear_superDirty.txt"

# assigns the identifier 'gapminder' to the entire dataset
gapminder = pd.read_table(url, sep = "\t")

2. In the command above, why did we use pd.read_table, not just read_table or pandas.read_table?

Now we have a DataFrame called gapminder with all of our data in it!

One of the first things we do after importing data is to just have a look and make sure it looks like what we think it should.

In this case, this is your first look, so type the name of the DataFrame and run the code cell below to see what happens...


In [ ]:
## DataFrame name here

There are usually too many rows to print to the screen and you don't really want to find out by printing them all to the screen and regretting it. By default, when you type the name of the DataFrame and run a cell, pandas knows not to print the whole thing if there are a lot of rows. Instead, you will see the first and last few rows with dots in between.

A neater way to look at a preview of the dataset is by using the head() method. Calling DataFrame.head() will displace the first 5 rows of the data (this is also an exmaple of the "dot" notation where we want the "head" gapminder, so we write DataFrame.head()). You can specify how many rows you want to see as an argument, like DataFrame.head(10). The tail() method does the same with the last rows of the DataFrame.

Use these methods below to get an idea of what the gapminder DataFrame looks like.


In [ ]:
## head and tail methods

3. Make a few observations about what you see in the data so far (you don't have to go too deep since we are going to use pandas to look next)

Assess structure and cleanliness

How many rows and columns are in the data?

We often want to know how many rows and columns are in the data -- we want to know what is called the "shape" attribute of the DataFrame. Pandas has a convenient way for getting that information by using the DataFrame.shape (using DataFrame as a generic name for a, well, DataFrame, which in pandas is usually written DataFrame). This returns a tuple (values separated by commas) representing the dimensions of the DataFrame (rows, columns).

write code to use shape to get the shape of the gapminder DataFrame:


In [ ]:
## shape

4. How many rows and columns are there?

The info() method gives a few useful pieces of information quickly, including the shape of the DataFrame, the variable type of each column, and the amount of memory stored.

run the code in the cell below


In [ ]:
# get info
gapminder.info()

There are several problems with this data set as it is. BTW, this is super common and data nerd slang for fundamentally useful data with issues is "dirty data" and the process for un-dirtying it is called "data cleaning". (believe it or not, tidy data is yet again a different thing, we'll get to that later)

The first step in data cleaning is identifying the problems. The info above has already revealed one major problem and one minor/annoying problem.

Let's investigate!

5. What types of variables are in the gapminder DataFrame and what values can they take? (region is a little tricky)

6. Some of your data types should seem odd to you (hint - party like it's 1999.9?). Which ones and why?

7. Look at the number of entries in the entire DataFrame at the top (RangeIndex) and compare that to each column. Explain what you think is happening.

Let's keep looking

There are other fast, easy, and informative ways to get a sense of what your data might look like and any issues that it has. The describe() method will take the numeric columns and give a summary of their values. This is useful for getting a sense of the ranges of values and seeing if there are any unusual or suspicious numbers.

Use the describe() method on the gapminder DataFrame in the cell below


In [ ]:
## describe

Uh-oh what happened to the percentiles? Also notice that you get an error, but it gives you some output.

The DataFrame method describe() just blindly looks at all numeric variables. First, we wouldn't actually want to take the mean year. Additionally, we obtain 'NaN' (not a number) values for our quartiles. This suggests we might have missing data which we can (and will) deal with shortly when we begin to clean our data.

For now, let's pull out only the columns that are truly continuous numbers (i.e. ignore the description for 'year'). This is a preview of selection columns of the data; we'll talk more about how to do it later in the lesson, but many methods that work on DataFrames work in this way if you use DataFrame.method(['column_name']) -- the brackets should remind you of indexing a list or a string and the 'column_name' needs to be in quotes since it is itself a string.

run the code in the cell below


In [ ]:
# use describe but only on columns with continuous values
gapminder[['pop', 'life Exp', 'gdpPercap']].describe()

8. We haven't really nailed this problem down yet. Reflect (or look) back on your answer to Question#7 and think about what might be happening (don't worry too much about the mathematical specifics).

Let's get after it.

The command value_counts() gives you a quick idea of what kinds of names are in your categorical data such as strings. In this case our categorical data is in the region column and represents the names of the regions where the data came from.

Important info: The data set covers 12 years, so each region should appear 12 times.

use value_counts() on gapminder to see if all regions have 12 rows/entries as expected


In [ ]:
## use value_counts to find out how many times each unique region occus

Uh-oh! The table reveals several problems.

9. Describe and Explain the (several) problems that you see.

Data cleaning

Handling Missing Data

Missing data is an important issue to handle. As we've seen, ignoring it doesn't make it go away, in our example, it has been giving us problems in our describe() results -- remember the NaN values?

There are different ways of dealing with missing data which include, all of them have advantages and disadvantages so choose carefully and with good reasons!:

  • analyzing only the available data (i.e. ignore the missing data)
  • replace the missing data with replacement values and treat these as though they were observed (danger!)
  • replace the missing data and account for the fact that these values were inputed with uncertainty (e.g. create a new boolean variable as a flag so you know that these values were not actually observed)
  • use statistical models to allow for missing data -- make assumptions about their relationships with the available data as necessary

For our purposes with the dirty gapminder data set, we know our missing data is excess (and unnecessary) and we are going to choose to analyze only the available data. To do this, we will simply remove rows with missing values.

10. Wait, why do we think it's extra data? Justify! Also include from what regions you expect to lose observations.

In large tabular data that is used for analysis, missing data is usually coded NA (stands for Not Available and various other things) although there are other possibilities such as NaN (as we saw above) if a function or import method expects a number. NA can mean several things under the hood, but using NA for missing values in your data is your best bet and most of the methods will expect that (and even have it in their name).

Let's find out how many NAs there are

We are going to chain 2 steps together to determine the number of NA/null values in the gapminder DataFrame.

  • first isnull() returns a boolean for each cell in the DataFrame - True if the value in the cell is NA, False if it is not.
  • then sum() adds up the values in each column.
    "hold on", you say, "you can't add booleans!" Oh yes you can! True == 1 and False == 0 in Python so if you sum() the results of isnull() you get the number of NA/null values in each column -- awesome!

In [ ]:
# isnull results in T/F for each cell, 
# sum adds them up by column since F=0 and T=1 so #=#NAs
gapminder.isnull().sum()

Yikes! There are NA values in each column except region.

Removing NAs from a DataFrame is incredibly easy to do because pandas allows you to either remove all instances will NA/null data or replace them with a particular value. (sometimes too easy, so make sure you are careful!)

The method df = df.dropna() (here df is short for DataFrame) drops rows with any column having NA/null data. df = df.fillna(value) replaces all NA/null data with the argument value. You have to use the assignment to the same (or different) identifier since dropna() does not work "in place" so if you don't assign the result to something it prints and nothing happens to the actual DataFrame.

use dropna() to remove the NAs from the gapminder DataFrame


In [ ]:
## before you rip into removing, it's a good idea to know how many rows you have
## you know a few ways to get this info, write some code here to get # of rows

In [ ]:
## now use dropna()

In [ ]:
## now get the number of rows again

In [ ]:
## use isnull() again to confirm

11. How many rows were removed? Make sure the code immediately above justifies this.

12. Which regions lost rows? Run code to see if there are still NAs and also to identify regions with too many observations. (hint: it won't be perfect yet)


In [ ]:
## your code

Are we done yet? Oh, no, we certainly are not...

One more nice trick. If you want to examine a subset (more later) of the rows, for example to exmaine the region with too many observations, you can use code like this:

gapminder[gapminder.column == 'value_in_column']

where column is the name of a column and 'value_in_column' is what you're interested in, for example the region.

write code in the cell below to look at all of the rows from regions that have a problematically high number of observations.


In [ ]:
## code to examine rows from regions with too many obs

Handling strange variable types

Remember above where we had some strange/annoying variable types? Now we can fix them. The methods below would have failed if there were NAs in the data since they look for the type of the data and try to change it. Inconsistencies in data types and NAs cause problems for a lot of methods, so we often deal with them first when cleaning data.

We can change (or type cast) the inappropriate data types with the function astype(), like so:

DataFrame.astype(dtype)

A few important things to consider!

  • dtype is the data type that you want to cast to
  • if you just use DataFrame it will try to cast the entire DataFrame to the same type, and you do not want to go there! so we need to specify which columns we want to cast, see below...

There are several ways to select only some columns of a DataFrame in pandas but the easiest and most intuitive is usually to just use the name. It is very much like indexing a list or a string and looks like: DataFrame['column_name'], where column_name is the column name. So in context of our astype() method call we would want:

DataFrame['column_name'].astype(dtype)

to only cast the type of a single column.

Run the code in the cell below to see an example:


In [ ]:
##type casts the year column data from float to int
gapminder['year'] = gapminder['year'].astype(int)

Now you use astype() to type cast the other problematic column and use info() to make sure these two operations worked.


In [ ]:
## fix other column and make sure all is ok

Progress!
Your data types should make more sense now and since we removed the NA values above, the total number of rows or entries is the same as the non-null rows. Nice!

Handling (Unwanted) Repetitive Data

Sometimes observations can end up in the data set more than once creating a duplicate row. Luckily, pandas has methods that allow us to identify which observations are duplicates. The method df.duplicated() will return boolean values for each row in the DataFrame telling you whether or not a row is an exact repeat.

Run the code below to see how it works.


In [ ]:
## get T/F output for each row if it is a duplicate (only look at top 5)
gapminder.duplicated().head()

Wow, there is a repeat in the first 5 rows!

Write code below that allows you to confirm this by examining the DataFrame.


In [ ]:
## confirm duplicate in top 5 rows of df

In [1]:
## you can use the .sum() to count the number of duplicated rows in the DataFrame

In cases where you don’t want exactly repeated rows (we don’t -- we only want each country to be represented once for every relevant year), you can easily drop such duplicate rows with the method df.drop_duplicates().

Note: drop_duplicates() is another method that does not work in place, and if you want the DataFrame to now not have the duplicates, you need to assign it again: df = df.drop_duplicates()

Warning: you always want to be cautious about dropping rows from your data.

13. Justify why it's ok for us to drop these rows (a mechanical reason that has to do with the structure of the data itself, the "experimental" reason is above).

write code below to remove the duplicated rows and confirm that the duplicate in the first 5 rows is gone and/or that all of the duplicates are gone.


In [ ]:
## remove duplicates and confirm

More Progress!

Reindexing with reset_index()

Now we have 1704 rows, but our index is off. Look at the index values in gapminder just above. One is missing!

We can reset our indices easily with the method reset_index(drop=True). Remember, Python is 0-indexed so our indices will be valued 0-1703. The drop=True parameter drops the old index (as opposed to placing it in a new column, which is useful sometimes but not here).

The concept of reindexing is important. When we remove some of the messier, unwanted data, we end up with "gaps" in our index values. By correcting this, we can improve our search functionality and our ability to perform iterative functions on our cleaned data set.

Note: reset_index() is yet another method that does not work in place, and if you want the DataFrame to now not have the duplicates, you need to assign it again...

write code that resets the index of gapminder and shows you the top 5 so you can see it has changed


In [ ]:
## reset index and check top 5

Handling Inconsistent Data

The region column still has issues that will affect our analysis. We used the value_counts() method above to examine some of these issues...

write code to look at the number of observations for each unique region


In [ ]:
## get value counts for the region column

14. Describe and Explain...
14a. what is better now that we have done some data cleaning.

14b. what is still in need of being fixed.

String manipulations

Very common problems with string variables are:

  • lingering white space
  • upper case vs. lower case

These issues are problematic since upper and lower case characters are considered different characters by Python (consider that 'abc' == 'ABC' evaluates to False) and any extra character in a string makes it different to Python (consider that 'ABC' == ' ABC' evaluates to False).

The following three pandas string methods (hence the str) remove all such trailing spaces (left and right) and put everything in lowercase, respectively.

df.str.lstrip() # Strip white space on left
df.str.rstrip() # Strip white space on right
df.str.lower() # Convert to lowercase

Note: none of these methods work in place, so if you want the DataFrame to now not have the duplicates, you need to assign it again (df = df.method())...

write code that strips the white space from both sides of the values in the region column, makes all of the values lower case, and shows that it has been accomplished (to make sure you've reassigned).


In [ ]:
## write code to strip white space on both left and right of region names
## convert region names to lowercase, and print out df to confirm

As a side note, one of the coolest thing about pandas is an idea called chaining. If you prefer, the three commands can be written in one single line!

df['column_name'] = df['column_name'].str.lstrip().str.rstrip().str.lower()

You may be wondering about the order of operations and it starts closest to the data (DataFrame name, df here), and moves away performing methods in order. This is a bit advanced, but can save a lot of typing once you get used to it. The downside is that it can be harder to troubleshoot -- it can leave you asking "which command in my long slick looking chain produced the weird error?"

Are we there yet???
15. What is still wrong with the region column?

regex + replace()

A regular expression, aka regex, is a powerful search technique that uses a sequence of characters that define a search pattern. In a regular expression, the symbol "*" matches the preceding character 0 or more times, whereas "+" matches the preceding character 1 or more times. "." matches any single character. Writing "x|y" means to match either "x" or "y".

For more regex shortcuts (cheatsheet): https://www.shortcutfoo.com/app/dojos/regex/cheatsheet

Pandas allows you to use regex in its replace() function -- when a regex term is found in an element, the element is then replaced with the specified replacement term. In order for it to appropriately correct elements, both regex and inplace variables need to be set to True (as their defaults are false). This ensures that the initial input string is read as a regular expression and that the elements will be modified in place.

For more documentation on the replace method: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html

Using regex and replace() is easy and powerful and potentially dangerous

Here's an incorrect regex example: we create a temporary DataFrame (notice the temp = assignment) in which a regex statement inside replace() identifies all values that contain the term "congo" and replaces it with "africa_dem rep congo". Unfortunately, this creates 24 instances of the Democratic Republic of the Congo (and there should only be 12) -- this is an error in our cleaning!

run the code below to go through the incorrect example (and let it be a warning to ye)


In [ ]:
# gives a problem -- 24 values of the congo!
temp = gapminder['region'].replace(".*congo.*", "africa_dem rep congo", regex=True)
temp.value_counts()

16. What happened?


In [ ]:
## this should help
#  shows all the rows that have 'congo' in the name
gapminder[gapminder['region'].str.contains('congo')]

We can revert back to working on the the non-temporary DataFrame and correctly modify our regex to isolate only the Democratic Republic of Congo instances (as opposed to including the Republic of Congo as well).

Using regex to fix the Dem. Rep. Congo...

As noted above, regular expressions (regex) provide a powerful tool for fixing errors that arise in strings. In order to correctly label the two different countries that include the word "congo", we need to design and use (via df.replace()) a regex that correctly differentiates between the two countries.

Recall that the "." is the wildcard (matching any single character); combining this with "*" allows us to match any number of single characters an unspecified number of times. By combining these characters with substrings corresponding to variations in the naming of the Democratic Republic of the Congo, we can correctly normalize the name.

If you feel that the use of regex is not particularly straightforward, you are absolutely correct -- appropriately using these tools takes a great deal of time to master. When designing regex for these sorts of tasks, you might find the following prototyper helpful: https://regex101.com/

run the code below to fix the rows for the Democratic Republic of the Congo


In [ ]:
# fix the two different versions of the incorrect name to the same correct version
gapminder['region'].replace(".*congo, dem.*", "africa_dem rep congo", regex=True, inplace=True)
gapminder['region'].replace(".*_democratic republic of the congo", "africa_dem rep congo", regex=True, inplace=True)

## you write code to check to make sure it's fixed

Exercise (regex):

Now that we've taken a close look at how to properly design and use regex to clean string entries in our data, let's try to normalize the naming of a few other countries.

Using the pandas code we constructed above as a template, write similar code to what we used above (using df.replace()) to fix the names of all of the rows/entries for the Ivory Coast and Canada to "africa_cote d'ivoire" and "americas_canada", respectively.


In [ ]:
## code to fix other incorrect names

Are we there yet??!??!
17. Looks like we successfully cleaned our data! Justify that we are done.


In [ ]:

Tidy data

Having what is called a Tidy data set can make cleaning and analyzing your data much easier.
Two of the important aspects of Tidy data are:

  • every variable has its own column
  • every observation has its own row

There are other aspects of Tidy data, here is a good blog post about Tidy data in Python.

Currently the dataset has a single column for continent and country (the ‘region’ column).

18. Why is having a single column for continent and country not Tidy?

Let's make gapminder Tidy

We can split the region column into two, by using the underscore that separates continent from country. We can create a new column in the DataFrame by naming it before the = sign:

gapminder['country'] =

The following commands use the string method split() to split the string at the underscore (the first argument), which results in a list of two elements: before and after the _. The second argument, in this case "1", tells split() that the split should take place only at the first occurrence of the underscore. Then the str[] specifies which item in the resulting Series to return.


In [ ]:
# split region into country and continent to tidy data
gapminder['country']=gapminder['region'].str.split('_', 1).str[1]
gapminder['continent']=gapminder['region'].str.split('_', 1).str[0]
gapminder.head()

19. Describe what happens for each of the lines of code that contains split(). Be sure to explain why the 2 columns end up with different values.


In [ ]:
## mess around with it if you need to, but make a 'temp' version of gapminder to play with!

Removing and renaming columns

We have now added the columns country and continent, but we still have the old region column as well. In order to remove that column we use the drop() command. The first argument of the drop() command is the name of the element to be dropped. The second argument is the axis number: 0 for row, 1 for column.

*Note: any time we are getting rid of stuff, we want to make sure that we are doing it for a good reason and that we know our data will be ok after. You might want to double check your new columns before you drop the old one*


In [ ]:
## check new columns

In [ ]:
# drop old region column
gapminder = gapminder.drop('region', 1)
gapminder.head()

Finally, it is a good idea to look critically at your column names themselves. It is a good idea to be as consistent as possible when naming columns. We often use all lowercase for all column names to avoid accidentally making names that can be confusing -- gdppercap and gdpPercap and GDPpercap are all the same nut gdpercap is different.

Avoid spaces in column names to simplify manipulating your data. Also look out for lingering white space at the beginning or end of your column names.

Run the following code that turns all column names to lowercase.


In [ ]:
# turns all column names to lowercase
# yes you need the .columns on the left side too
gapminder.columns = gapminder.columns.str.lower()
gapminder.head()

We also want to remove the space from the life exp column name. We can do that with Pandas rename method. It takes a dictionary as its argument, with the old column names as keys and new column names as values.

If you're unfamiliar with dictionaries, they are a very useful data structure in Python. You can read more about them here.


In [ ]:
# rename column
gapminder = gapminder.rename(columns={'life exp' : 'lifeexp'})
gapminder.head()

Tidy data wrap-up

21. explain why the data set at this point is Tidy or at least much tidier than it was before.

Export clean and tidy data file

Now that we have a clean and tidy data set in a DataFrame, we want to export it and give it a new name so that we can refer to it and use it

df.to_csv('file name for export', index=False) # index=False keeps index out of columns

For more info on this method, check out the docs for to_csv()


In [ ]:
# exports gapminder_CandT.csv
gapminder.to_csv('gapminder_CandT.csv', index=False)

Subsetting and sorting

There are many ways in which you can manipulate a Pandas DataFrame - here we will discuss only two: subsetting and sorting.

Sometimes you only want part of a larger data set, then you would subset your DataFrame. Othertimes you want to sort the data into a particular order (year more recent to oldest, GDP lowest to highest, etc.).

Subsetting

We can subset (or slice) by giving the numbers of the rows you want to see between square brackets.

Run the code below for an example:


In [ ]:
# first 15 rows
gapminder[0:15]  # could also be gapminder[:15]

write code to take a different slice


In [ ]:
## your slice

21. Predict what the slice below will do.

gapminder[-10:]

In [ ]:
## run the code here to test prediction.

22a. What does the negative number (in the cell above) mean?

22b. What happens when you leave the space before or after the colon empty?

write code to take another different slice


In [ ]:
## your 'nother slice

More Subsetting

Subsetting can also be done by selecting for a particular value in a column.
For instance to select all of the rows that have 'africa' in the column 'continent'.

gapminder_africa = gapminder[gapminder['continent']=='africa']

Note the double equal sign: Remember that single equal signs are used in Python to assign something to a variable. The double equal sign is a comparison: in this case, the value from the column on the left has to be exactly equal to the string to the right.

Also note that we made a new DataFrame to contain our subset of the data from Africa.


In [ ]:
# makes a new DataFrame with just data from Africa
gapminder_africa = gapminder[gapminder['continent']=='africa']

## write code to quickly check that this worked

Even more...

There are several other fancy ways to subset rows and columns from DataFrames, the .loc and .iloc methods are particularly useful. If you're interested, look them up on the cheatsheet or in the docs.

Sorting

Sorting may help to further organize and inspect your data. The method sort_values() takes a number of arguments; the most important ones are by and ascending. The following command will sort your DataFrame by year, beginning with the most recent.


In [ ]:
# sort by year, from most recent to oldest
gapminder.sort_values(by='year', ascending = False)

Note: the sort() method does not sort in place. write code to prove it.


In [ ]:
## your code, sort() not in place

23. Make a new variable with a sorted version of the DataFrame organized by country, from ‘Afganistan’ to ‘Zimbabwe’. Also include code to show that it is sorted correctly.


In [ ]:
## alphabetical by country

Summarize and plot

Summaries and Statistics are very useful for intial examination of data as well as in depth analysis. Here we will only scratch the surface.

Plots/graphs/charts are also great visual ways to examine data and illustrate patterns.

Exploring your data is often iterative - summarize, plot, summarize, plot, etc. - sometimes it branche, sometimes there are more cleaning steps to be discovered...

Let's try it!

Summarizing data

Remember that the info() method gives a few useful pieces of information, including the shape of the DataFrame, the variable type of each column, and the amount of memory stored. We can see many of our changes (continent and country columns instead of region, higher number of rows, etc.) reflected in the output of the info() method.


In [ ]:
# review info()
gapminder.info()

We also saw above that the describe() method will take the numeric columns and give a summary of their values. We have to remember that we changed the changed column names, and this time it shouldn't have NAs.


In [ ]:
# review describe
gapminder[['pop', 'lifeexp', 'gdppercap']].describe()

More summaries

What if we just want a single value, like the mean of the population column? We can call mean on a single column this way:


In [ ]:
# population mean
gapminder['pop'].mean()

What if we want to know the mean population by continent? Then we need to use the Pandas groupby() method and tell it which column we want to group by.


In [ ]:
# population mean by continent
gapminder[['continent', 'pop']].groupby(by='continent').mean()

What if we want to know the median population by continent?
The method that gives you the median is called median().

write code below to get the median population by continent


In [ ]:
## population median by continent

Or the number of entries (rows) per continent?


In [ ]:
# count number of rows
gapminder[['continent', 'country']].groupby(by='continent').count()

Sometimes we don't want a whole DataFrame. Here is another way to do this that produces a series as opposed to a DataFrame that tells us number of entries (rows).


In [ ]:
# get size by continent
gapminder[['continent', 'country']].groupby(by='continent').size()

We can also look at the mean GDP per capita of each country:
write code below to get the mean GDP per capita of each country


In [ ]:
## mean GDP per capita by country

What if we wanted a new DataFrame that just contained these summaries? This could be a table in a report, for example.


In [ ]:
# pretty slick, right?!
continent_mean_pop = gapminder[['continent', 'pop']].groupby(by='continent').mean()
continent_mean_pop = continent_mean_pop.rename(columns = {'pop':'meanpop'})
continent_row_ct = gapminder[['continent', 'country']].groupby(by='continent').count()
continent_row_ct = continent_row_ct.rename(columns = {'country':'nrows'})
continent_median_pop = gapminder[['continent', 'pop']].groupby(by='continent').median()
continent_median_pop = continent_median_pop.rename(columns = {'pop':'medianpop'})
gapminder_summs = pd.concat([continent_row_ct,continent_mean_pop,continent_median_pop], axis=1)
gapminder_summs = gapminder_summs.rename(columns = {'y':'year'})
gapminder_summs

Visualization with matplotlib

Recall that matplotlib is Python's main visualization library. It provides a range of tools for constructing plots, and numerous high-level and add on plotting libraries are built with matplotlib in mind. When we were in the early stages of setting up our analysis, we loaded these libraries like so:


In [ ]:
# import again just in case
import matplotlib.pyplot as plt

In [ ]:
## generate some toy data and plot it
# required to generate toy data
import numpy as np

# magic to plot straight into notebook, probably no longer needed.
# %matplotlib inline  

# evenly sampled time at 200ms intervals
t = np.arange(0., 5., 0.2)

# red dashes, blue squares and green triangles
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.show()

Single variable plots

Histograms - provide a quick way of visualizing the distribution of numerical data, or the frequencies of observations for categorical variables.

Run the code below to generate a smaple histogram.


In [ ]:
# example histogram
# generate some random numbers from a normal distribution
data = 100 + np.random.randn(500)

# make a histogram with 20 bins
plt.hist(data, 20)
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.show()

write code below to make a histogram of the distribution of life expectancies of the countries in the gapminder DataFrame.


In [ ]:
## histogram of lifeexp

Interpret plots and summaries

Exploration is an iterative process

In this section, we've taken the raw data and worked through steps to prepare it for analysis, but we have not yet done any "data analysis". This part of the data workflow can be thought of as "exploratory data analysis", or EDA. Many of the steps we've shown are aimed at uncovering interesting or problematic things in the dataset that are not immediately obvious. We want to stress that when you're doing EDA, it will not necessarily be a linear workflow like what we have shown. When you plot or summarize your data, you may uncover new issues: we saw this when we made a mistake fixing the naming conventions for the Democratic Republic of Congo. You might discover outliers, unusually large values, or points that don't make sense in your plots. Clearly, the work here isn't done: you'll have to investigate these points, decide how to fix any potential problems, document the reasoning for your actions, and check that your fix actually worked.

On the other hand, plots and summaries might reveal interesting questions about your data. You may return to the cleaning and prepping steps in order to dig deeper into these questions. You should continuously refine your plots to give the clearest picture of your hypotheses.


In [ ]: