Based on the Data Carpentry curriculum for Data Visualization in Python ( © Data Carpentry under Creative Commons Attribution license )
As for the last lesson, we will be using NCD Risk Factor Collaboration (NDC-RisC) data, from Worldwide trends in body-mass index, underweight, overweight, and obesity from 1975 to 2016: a pooled analysis of 2416 population-based measurement studies in 128.9 million children, adolescents, and adults.
Lancet 2017, published online 11 October 2017
We will be using the adult BMI dataset country-specific data and the height at age 18 country specific data.
The downloaded filename is NCD_RisC_Lancet_2017_BMI_age_standardised_country.csv
, but we have saved a simplified version of the file here as NCD_RisC_bmi.csv
.
The table of arable land (hectares per person) is taken from the World Bank Databank
The table of life expectancy is taken from our world in data
In many "real world" situations, the data that we want to use come in multiple
files. We often need to combine these files into a single DataFrame to analyze
the data. The pandas package provides various methods for combining
DataFrames including
merge
and concat
.
To work through the examples below, we first need to load the species and surveys files into pandas DataFrames. In iPython:
In [1]:
# import pandas library:
import pandas as pd
# set the plots to appear 'inline' in the notebook:
%matplotlib inline
In [2]:
df_bmi = pd.read_csv("data/NCD_RisC_bmi.csv")
df_bmi.head()
Out[2]:
In [3]:
df_height = pd.read_csv("data/NCD_RisC_height.csv")
df_height.head()
Out[3]:
Take note that the read_csv
method we used can take some additional options. Many functions in python have a set of options that
can be set by the user if needed.
More about all of the read_csv options here.
In [4]:
# Read in first 10 lines of bmi table
bmi_sub_first10 = df_bmi.head(10)
# Grab the last 10 rows
bmi_sub_last10 = df_bmi.tail(10)
In [7]:
# Reset the index values so the second dataframe appends properly
bmi_sub_last10=bmi_sub_last10.reset_index(drop=True)
# drop=True option avoids adding new index column with old index values
In [11]:
bmi_sub_first10
Out[11]:
When we concatenate DataFrames, we need to specify the axis. axis=0
tells
Pandas to stack the second DataFrame under the first one. It will automatically
detect whether the column names are the same and will stack accordingly.
axis=1
will stack the columns in the second DataFrame to the RIGHT of the
first DataFrame. To stack the data vertically, we need to make sure we have the
same columns and associated column format in both datasets. When we stack
horizonally, we want to make sure what we are doing makes sense (ie the data are
related in some way).
In [8]:
# Stack the DataFrames on top of each other
vertical_stack = pd.concat([bmi_sub_first10, bmi_sub_last10], axis=0)
# Place the DataFrames side by side
horizontal_stack = pd.concat([bmi_sub_first10, bmi_sub_last10], axis=1)
In [13]:
vertical_stack.reset_index(drop=True)
Out[13]:
In [ ]:
In [ ]:
# use this cell to view your vertical_stack dataframe
In [ ]:
# use this cell to view your horizontal_stack dataframe
In [ ]:
# use this cell to re-index the vertical_stack dataframe
We can use the to_csv
command to do export a DataFrame in CSV format. Note that the code
below will by default save the data into the current working directory. We can
save it to a different folder by adding the foldername and a slash to the file
vertical_stack.to_csv('foldername/out.csv')
. We use the 'index=False' so that
pandas doesn't include the index number for each line.
In [22]:
df_1 = pd.DataFrame({"A":[1,2,3,4], "B":[3,4,5,6]})
df_2 = pd.DataFrame({"C":[1,2,3,4], "B":[3,4,5,6]})
df_1
Out[22]:
In [23]:
df_2
Out[23]:
In [24]:
pd.concat([df_1, df_2], axis=0)
Out[24]:
In [ ]:
# Write DataFrame to CSV
vertical_stack.to_csv('data/vertical_stack.csv', index=False)
Check out your working directory to make sure the CSV wrote out properly, and that you can open it! If you want, try to bring it back into python to make sure it imports properly.
Make sure that any directory you are specifying has already been created.
In [ ]:
# For kicks read our output back into python and make sure all looks good
new_output = pd.read_csv('data/vertical_stack.csv')
Challenge - Combine Data
In the data folder, there are two survey data files:
NCD_RisC_height_1896.csv
andNCD_RisC_height_1996.csv
. Read the data into python and combine the files to make one new data frame. Create a plot of average plot weight by year grouped by sex. Export your results as a CSV and make sure it reads back into python properly.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
When we concatenated our DataFrames we simply added them to each other - stacking them either vertically or side by side. Another way to combine DataFrames is to use columns in each dataset that contain common values (a common unique id). Combining DataFrames using a common field is called "joining". The columns containing the common values are called "join key(s)". Joining DataFrames in this way is often useful when one DataFrame is a "lookup table" containing additional data that we want to include in the other.
NOTE: This process of joining tables is similar to what we do with tables in an SQL database.
For example, we might create a lookup table for our country data. This table could contain information such as economic indicators, population, area for each of the countries. The country code would be unique for each line. Rather than adding multiple extra columns to each of the 16,800 lines of the bmi data table, we can maintain the shorter table with the country information. When we want to access that information, we can create a query that joins the additional columns of information to the Survey data.
Storing data in this way has many benefits including:
In [25]:
# Read in first 10 lines of bmi table
bmi_sub = df_bmi.head(10)
# Import a small subset of the species data designed for this part of the lesson.
# It is stored in the data folder.
df_arable = pd.read_csv('data/arable_land.csv')
arable_sub = df_arable.head(70)
In this example, df_arable
is the lookup table containing country code, country, and
hectares of arable land per person, that we want to join with the data in survey_sub
to produce a new
DataFrame that contains all of the columns from both df_bmi
and
df_arable
.
To identify appropriate join keys we first need to know which field(s) are shared between the files (DataFrames). We might inspect both DataFrames to identify these columns. If we are lucky, both DataFrames will have columns with the same name that also contain the same data. If we are less lucky, we need to identify a (differently-named) column in each DataFrame that contains the same information.
In [26]:
arable_sub.columns
Out[26]:
In [27]:
bmi_sub.columns
Out[27]:
In our example, the join key is the column containing the three-letter country
identifier, which is called iso
.
Now that we know the fields with the common country ID attributes in each DataFrame, we are almost ready to join our data. However, since there are different types of joins, we also need to decide which type of join makes sense for our analysis.
The most common type of join is called an inner join. An inner join combines two DataFrames based on a join key and returns a new DataFrame that contains only those rows that have matching values in both of the original DataFrames.
Inner joins yield a DataFrame that contains only rows where the value being joined exists in BOTH tables. An example of an inner join, adapted from this page is below:
The pandas function for performing joins is called merge
and an Inner join is
the default option:
In [32]:
merged_inner = pd.merge(left=bmi_sub,
right=arable_sub,
left_on='iso',
right_on='iso')
# In this case `species_id` is the only column name in both dataframes, so if we skippd `left_on`
# And `right_on` arguments we would still get the same result
# What's the size of the output data?
merged_inner.shape
merged_inner
type(merged_inner)
Out[32]:
In [33]:
merged_inner.info()
The result of an inner join of bmi_sub
and arable_sub
is a new DataFrame
that contains the combined set of columns from bmi_sub
and arable_sub
. It
only contains rows that have three-letter country codes that are the same in
both the bmi_sub
and arable_sub
DataFrames. In other words, if a row in
bmi_sub
has a value of iso
that does not appear in the iso
column of arable_sub
, it will not be included in the DataFrame returned by an
inner join. Similarly, if a row in arable_sub
has a value of iso
that does not appear in the iso
column of bmi_sub
, that row will not
be included in the DataFrame returned by an inner join.
The two DataFrames that we want to join are passed to the merge
function using
the left
and right
argument. The left_on='iso'
argument tells merge
to use the iso
column as the join key from bmi_sub
(the left
DataFrame). Similarly , the right_on='iso'
argument tells merge
to
use the iso
column as the join key from arable_sub
(the right
DataFrame). For inner joins, the order of the left
and right
arguments does
not matter.
The resulting merged_inner
DataFrame contains all of the columns from bmi_sub
(iso, year, bmi, etc.) as well as all the columns from arable_sub
(iso, country, and arable land per person).
Notice that merged_inner
has fewer rows than arable_sub
. This is an
indication that there were rows in bmi_sub
with value(s) for iso
that
do not exist as value(s) for iso
in arable_sub
.
What if we want to add information from arable_sub
to bmi_sub
without
losing any of the information from bmi_sub
? In this case, we use a different
type of join called a "left outer join", or a "left join".
Like an inner join, a left join uses join keys to combine two DataFrames. Unlike
an inner join, a left join will return all of the rows from the left
DataFrame, even those rows whose join key(s) do not have values in the right
DataFrame. Rows in the left
DataFrame that are missing values for the join
key(s) in the right
DataFrame will simply have null (i.e., NaN or None) values
for those columns in the resulting joined DataFrame.
Note: a left join will still discard rows from the right
DataFrame that do not
have values for the join key(s) in the left
DataFrame.
A left join is performed in pandas by calling the same merge
function used for
inner join, but using the how='left'
argument:
In [ ]:
merged_left = pd.merge(left=bmi_sub,
right=arable_sub,
how='left',
left_on='iso',
right_on='iso')
merged_left
The result DataFrame from a left join (merged_left
) looks very much like the
result DataFrame from an inner join (merged_inner
) in terms of the columns it
contains. However, unlike merged_inner
, merged_left
contains the same
number of rows as the original bmi_sub
DataFrame. When we inspect
merged_left
, we find there are rows where the information that should have
come from arable_sub
(i.e., country_name
, and arable_land_ha_pp
) is
missing (they contain NaN values):
In [ ]:
merged_left[ pd.isnull(merged_left.country_name) ]
These rows are the ones where the value of iso
from bmi_sub
(in this
case, SVK
, JPN
, MKD
, etc) does not occur in arable_sub
.
The pandas merge
function supports two other join types:
how='right'
as an argument. Similar
to a left join, except all rows from the right
DataFrame are kept, while
rows from the left
DataFrame without matching join key(s) values are
discarded.how='outer'
as an argument. This join
type returns the all pairwise combinations of rows from both DataFrames; i.e.,
the result DataFrame will NaN
where data is missing in one of the dataframes. This join type is
very rarely used.Challenge - Weight by geographical region
- In the data folder, there is a
CSV
that contains information about the countries assigned to different geographical regions. Use that data to summarize the number of countries by region.- Calculate an expected mean weight of each region, using the bmi and height data.
In [ ]:
df_regions = pd.read_csv('data/world_regions.csv')
df_bmi = pd.read_csv('data/NCD_RisC_bmi.csv')
df_height = pd.read_csv('data/NCD_RisC_height.csv')
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
We can select specific ranges of our data in both the row and column directions using either label or integer-based indexing.
loc
is primarily label based indexing. Integers may be used but
they are interpreted as a label.iloc
is primarily integer based indexingTo select a subset of rows and columns from our DataFrame, we can use the
iloc
method. For example, we can select month, day and year (columns 2, 3
and 4 if we start counting at 1), like this:
In [ ]:
# iloc[row slicing, column slicing]
df_regions.iloc[0:3, 1:4]
which gives the output
alpha-3 region sub-region
0 AFG Asia Southern Asia
1 ALA Europe Northern Europe
2 ALB Europe Southern Europe
Notice that we asked for a slice from 0:3. This yielded 3 rows of data. When you ask for 0:3, you are actually telling Python to start at index 0 and select rows 0, 1, 2 up to but not including 3.
Let's explore some other ways to index and select subsets of data:
In [ ]:
# Select all columns for rows of index values 0 and 10
df_regions.loc[[0, 10], :]
In [ ]:
# What does this do?
df_regions.loc[0, ['name', 'alpha-3']]
In [ ]:
# What happens when you type the code below?
df_regions.loc[[0, 10, 149], :]
NOTE: Labels must be found in the DataFrame or you will get a KeyError
.
Indexing by labels loc
differs from indexing by integers iloc
.
With loc
, the both start bound and the stop bound are inclusive. When using
loc
, integers can be used, but the integers refer to the
index label and not the position. For example, using loc
and select 1:4
will get a different result than using iloc
to select rows 1:4.
We can also select a specific data value using a row and
column location within the DataFrame and iloc
indexing:
# Syntax for iloc indexing to finding a specific data element
dat.iloc[row, column]
In [ ]:
df_regions.iloc[23, 3]
Remember that Python indexing begins at 0. So, the index location [23, 3] selects the element that is 24 rows down and 4 columns over in the DataFrame.
In [ ]:
df_regions[df_regions.region == 'Oceania']
Or we can select all rows that do not contain the year 2002:
In [ ]:
df_regions[df_regions.region != 'Oceania']
We can define sets of criteria too:
In [ ]:
df_regions[(df_regions.region == 'Asia') | (df_regions.region == 'Europe')]
Challenge - Queries
Select a subset of rows in the
df_bmi
DataFrame that contain data from the year 1999 and that contain mean bmi values less than or equal to 18. How many rows did you end up with?You can use the
isin
command in Python to query a DataFrame based upon a list of values as follows:df_bmi[df_bmi['iso'].isin([listGoesHere])]Use the
isin
function to find all bmi records from a list of countries in the "df_bmi" DataFrame. How many records contain these values?Experiment with other queries. Create a query that finds all rows with a bmi between 20 and 23.
The
~
symbol in Python can be used to return the OPPOSITE of the selection that you specify in Python. It is equivalent to is not in. Write a query that selects all rows with year NOT equal to 1996 or 2004 in the "df_bmi" data.
In [ ]: