In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
from IPython.core.display import HTML
def css_styling():
styles = open("styles/custom.css", "r").read()
return HTML(styles)
css_styling()
Out[1]:
An initial step in statistical data analysis is the preparation of the data to be used in the analysis. In practice, a little some much the majority of the actual time spent on a statistical modeling project is typically devoted to importing, cleaning, validating and transforming the dataset.
This section will introduce pandas, an important third-party Python package for data analysis, as a tool for data preparation, and provide some general advice for what should or should not be done to data before it is analyzed.
pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python.
pandas is well suited for:
Virtually any statistical dataset, labeled or unlabeled, can be converted to a pandas data structure for cleaning, transformation, and analysis.
In [2]:
counts = pd.Series([632, 1638, 569, 115])
counts
Out[2]:
If an index is not specified, a default sequence of integers is assigned as the index. A NumPy array comprises the values of the Series
, while the index is a pandas Index
object.
In [3]:
counts.values
Out[3]:
In [4]:
counts.index
Out[4]:
We can assign meaningful labels to the index, if they are available. These counts are of bacteria taxa constituting the microbiome of hospital patients, so using the taxon of each bacterium is a useful index.
In [5]:
bacteria = pd.Series([632, 1638, 569, 115],
index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])
bacteria
Out[5]:
These labels can be used to refer to the values in the Series
.
In [6]:
bacteria['Actinobacteria']
Out[6]:
In [7]:
bacteria[bacteria.index.str.endswith('bacteria')]
Out[7]:
In [8]:
'Bacteroidetes' in bacteria
Out[8]:
Notice that the indexing operation preserved the association between the values and the corresponding indices.
We can still use positional indexing if we wish.
In [9]:
bacteria[0]
Out[9]:
We can give both the array of values and the index meaningful labels themselves:
In [10]:
bacteria.name = 'counts'
bacteria.index.name = 'phylum'
bacteria
Out[10]:
NumPy's math functions and other operations can be applied to Series without losing the data structure.
In [11]:
np.log(bacteria)
Out[11]:
We can also filter according to the values in the Series
:
In [12]:
bacteria[bacteria>1000]
Out[12]:
A Series
can be thought of as an ordered key-value store. In fact, we can create one from a dict
:
In [13]:
bacteria_dict = {'Firmicutes': 632, 'Proteobacteria': 1638, 'Actinobacteria': 569, 'Bacteroidetes': 115}
bact = pd.Series(bacteria_dict)
In [14]:
bact
Out[14]:
Notice that the Series
is created in key-sorted order.
If we pass a custom index to Series
, it will select the corresponding values from the dict, and treat indices without corrsponding values as missing. pandas uses the NaN
(not a number) type for missing values.
In [15]:
bacteria2 = pd.Series(bacteria_dict,
index=['Cyanobacteria','Firmicutes','Proteobacteria','Actinobacteria'])
bacteria2
Out[15]:
In [16]:
bacteria2.isnull()
Out[16]:
Critically, the labels are used to align data when used in operations with other Series objects:
In [17]:
bacteria + bacteria2
Out[17]:
Contrast this with NumPy arrays, where arrays of the same length will combine values element-wise; adding Series combined values with the same label in the resulting series. Notice also that the missing values were propogated by addition.
Inevitably, we want to be able to store, view and manipulate data that is multivariate, where for every index there are multiple fields or columns of data, often of varying data type.
A DataFrame
is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. Data are stored internally as a 2-dimensional object, but the DataFrame
allows us to represent and manipulate higher-dimensional data.
In [18]:
bacteria_data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],
'patient':[1, 1, 1, 1, 2, 2, 2, 2],
'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria',
'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes']})
bacteria_data
Out[18]:
Notice the DataFrame
is sorted by column name. We can change the order by indexing them in the order we desire:
In [19]:
bacteria_data[['phylum','value','patient']]
Out[19]:
A DataFrame
has a second index, representing the columns:
In [20]:
bacteria_data.columns
Out[20]:
If we wish to access columns, we can do so either by dict-like indexing or by attribute:
In [21]:
bacteria_data['value']
Out[21]:
In [22]:
bacteria_data.value
Out[22]:
Using the standard indexing syntax for a single column of data from a DataFrame
returns the column as a Series
.
In [23]:
type(bacteria_data['value'])
Out[23]:
Passing the column name as a list returns the column as a DataFrame
instead.
In [24]:
bacteria_data[['value']]
Out[24]:
Notice that indexing works differently with a DataFrame
than with a Series
, where in the latter, dict-like indexing retrieved a particular element (row). If we want access to a row in a DataFrame
, we index its ix
attribute.
In [25]:
bacteria_data.ix[3]
Out[25]:
Since a row potentially contains different data types, the returned Series
of values is of the generic object
type.
If we want to create a DataFrame
row-wise rather than column-wise, we can do so with a dict of dicts:
In [26]:
bacteria_data = pd.DataFrame({0: {'patient': 1, 'phylum': 'Firmicutes', 'value': 632},
1: {'patient': 1, 'phylum': 'Proteobacteria', 'value': 1638},
2: {'patient': 1, 'phylum': 'Actinobacteria', 'value': 569},
3: {'patient': 1, 'phylum': 'Bacteroidetes', 'value': 115},
4: {'patient': 2, 'phylum': 'Firmicutes', 'value': 433},
5: {'patient': 2, 'phylum': 'Proteobacteria', 'value': 1130},
6: {'patient': 2, 'phylum': 'Actinobacteria', 'value': 754},
7: {'patient': 2, 'phylum': 'Bacteroidetes', 'value': 555}})
In [27]:
bacteria_data
Out[27]:
However, we probably want this transposed:
In [28]:
bacteria_data = bacteria_data.T
bacteria_data
Out[28]:
In [29]:
vals = bacteria_data.value
vals
Out[29]:
Now, let's assign a new value to one of the elements of the Series
.
In [30]:
vals[5] = 0
vals
Out[30]:
However, we may not anticipate that the value in the original DataFrame
has also been changed!
In [31]:
bacteria_data
Out[31]:
We can avoid this by working with a copy when modifying subsets of the original data.
In [32]:
vals = bacteria_data.value.copy()
vals[5] = 1000
bacteria_data
Out[32]:
So, as we have seen, we can create or modify columns by assignment; let's put back the value we accidentally changed.
In [33]:
bacteria_data.value[5] = 1130
Or, we may wish to add a column representing the year the data were collected.
In [34]:
bacteria_data['year'] = 2013
bacteria_data
Out[34]:
But note, we cannot use the attribute indexing method to add a new column:
In [35]:
bacteria_data.treatment = 1
bacteria_data
Out[35]:
In [36]:
bacteria_data.treatment
Out[36]:
In [37]:
treatment = pd.Series([0]*4 + [1]*2)
treatment
Out[37]:
In [38]:
bacteria_data['treatment'] = treatment
bacteria_data
Out[38]:
Other Python data structures (ones without an index) need to be the same length as the DataFrame
:
In [40]:
month = ['Jan', 'Feb', 'Mar', 'Apr']
bacteria_data['month'] = month
In [41]:
bacteria_data['month'] = ['Jan']*len(bacteria_data)
bacteria_data
Out[41]:
We can use del
to remove columns, in the same way dict
entries can be removed:
In [42]:
del bacteria_data['month']
bacteria_data
Out[42]:
Or employ the drop
method.
In [43]:
bacteria_data.drop('treatment', axis=1)
Out[43]:
We can extract the underlying data as a simple ndarray
by accessing the values
attribute:
In [44]:
bacteria_data.values
Out[44]:
Notice that because of the mix of string, integer and float (and NaN
) values, the dtype of the array is object
. The dtype will automatically be chosen to be as general as needed to accomodate all the columns.
In [45]:
df = pd.DataFrame({'foo': [1,2,3], 'bar':[0.4, -1.0, 4.5]})
df.values, df.values.dtype
Out[45]:
pandas uses a custom data structure to represent the indices of Series and DataFrames.
In [46]:
bacteria_data.index
Out[46]:
Index objects are immutable:
In [47]:
bacteria_data.index[0] = 15
This is so that Index objects can be shared between data structures without fear that they will be changed.
In [48]:
bacteria2.index = bacteria.index
bacteria2
Out[48]:
In [49]:
# Write your answer here
A key, but often under-appreciated, step in data analysis is importing the data that we wish to analyze. Though it is easy to load basic data structures into Python using built-in tools or those provided by packages like NumPy, it is non-trivial to import structured data well, and to easily convert this input into a robust data structure:
genes = np.loadtxt("genes.csv", delimiter=",", dtype=[('gene', '|S10'), ('value', '<f4')])
pandas provides a convenient set of functions for importing tabular data in a number of formats directly into a DataFrame
object. These functions include a slew of options to perform type inference, indexing, parsing, iterating and cleaning automatically as data are imported.
The file olympics.1996.txt
in the data
directory contains counts of medals awarded at the 1996 Summer Olympic Games by country, along with the countries' respective population sizes. This data is stored in a tab-separated format.
In [50]:
!head ../data/olympics.1996.txt
This table can be read into a DataFrame using read_table
.
In [51]:
medals = pd.read_table('../data/olympics.1996.txt', sep='\t',
index_col=0,
header=None, names=['country', 'medals', 'population'])
medals.head()
Out[51]:
There is no header row in this dataset, so we specified this, and provided our own header names. If we did not specify header=None
the function would have assumed the first row contained column names.
The tab separator was passed to the sep
argument as \t
.
The sep
argument can be customized as needed to accomodate arbitrary separators. For example, we can use a regular expression to define a variable amount of whitespace, which is unfortunately common in some datasets:
sep='\s+'
We would like to add another variable to this dataset. Along with population, a country's economic development may be a useful predictor of Olympic success. A very simple indicator of this might be OECD membership status.
The OECD website contains a table listing OECD member nations, along with its year of membership. We would like to import this table and extract the contries that were members as of the 1996 games.
The read_html
function accepts a URL argument, and will attempt to extract all the tables from that address, returning whatever it finds in a list of DataFrame
s.
In [52]:
oecd_site = 'http://www.oecd.org/about/membersandpartners/list-oecd-member-countries.htm'
pd.read_html(oecd_site)
Out[52]:
There is typically some cleanup that is required of the returned data, such as the assignment of column names or conversion of types.
The table of interest is at index 1, and we will extract two columns from the table. Otherwise, this table is pretty clean.
In [53]:
oecd = pd.read_html(oecd_site, header=0)[1][[1,2]]
oecd.head()
Out[53]:
In [54]:
oecd['year'] = pd.to_datetime(oecd.Date).apply(lambda x: x.year)
oecd_year = oecd.set_index(oecd.Country.str.title())['year'].dropna()
oecd_year
Out[54]:
We can create an indicator (binary) variable for OECD status by checking if each country is in the index of countries with membership year less than 1997.
The new DataFrame
method assign
is a convenient means for creating the new column from this operation.
In [55]:
medals_data = medals.assign(oecd=medals.index.isin((oecd_year[oecd_year<1997]).index).astype(int))
Since the distribution of populations spans several orders of magnitude, we may wish to use the logarithm of the population size, which may be created similarly.
In [56]:
medals_data = medals_data.assign(log_population=np.log(medals.population))
The NumPy log
function will return a pandas Series
(or DataFrame
when applied to one) instead of a ndarray
; all of NumPy's functions are compatible with pandas in this way.
In [57]:
medals_data.head()
Out[57]:
In [58]:
!cat ../data/microbiome/microbiome.csv
This table can be read into a DataFrame using read_csv
:
In [59]:
mb = pd.read_csv("../data/microbiome/microbiome.csv")
mb.head()
Out[59]:
If we have sections of data that we do not wish to import (for example, known bad data), we can populate the skiprows
argument:
In [60]:
pd.read_csv("../data/microbiome/microbiome.csv", skiprows=[3,4,6]).head()
Out[60]:
Conversely, if we only want to import a small number of rows from, say, a very large data file we can use nrows
:
In [61]:
few_recs = pd.read_csv("../data/microbiome/microbiome.csv", nrows=4)
few_recs
Out[61]:
Alternately, if we want to process our data in reasonable chunks, the chunksize
argument will return an iterable object that can be employed in a data processing loop. For example, our microbiome data are organized by bacterial phylum, with 15 patients represented in each:
In [62]:
data_chunks = pd.read_csv("../data/microbiome/microbiome.csv", chunksize=15)
data_chunks
Out[62]:
In [63]:
# Write your answer here
In [64]:
mb = pd.read_csv("../data/microbiome/microbiome.csv", index_col=['Taxon','Patient'])
mb.head()
Out[64]:
This is called a hierarchical index, which allows multiple dimensions of data to be represented in tabular form.
In [65]:
mb.index
Out[65]:
The corresponding index is a MultiIndex
object that consists of a sequence of tuples, the elements of which is some combination of the three columns used to create the index. Where there are multiple repeated values, pandas does not print the repeats, making it easy to identify groups of values.
Rows can be indexed by passing the appropriate tuple.
In [66]:
mb.ix[('Firmicutes', 2)]
Out[66]:
With a hierachical index, we can select subsets of the data based on a partial index:
In [67]:
mb.ix['Proteobacteria']
Out[67]:
To extract arbitrary levels from a hierarchical row index, the cross-section method xs
can be used.
In [68]:
mb.xs(1, level='Patient')
Out[68]:
We may also reorder levels as we like.
In [69]:
mb.swaplevel('Patient', 'Taxon').head()
Out[69]:
In [70]:
mb.Stool / mb.Tissue
Out[70]:
Since so much financial and scientific data ends up in Excel spreadsheets (regrettably), pandas' ability to directly import Excel spreadsheets is valuable. This support is contingent on having one or two dependencies (depending on what version of Excel file is being imported) installed: xlrd
and openpyxl
(these may be installed with either pip
or easy_install
).
Importing Excel data to pandas is a two-step process. First, we create an ExcelFile
object using the path of the file:
In [71]:
mb_file = pd.ExcelFile('../data/microbiome/MID1.xls')
mb_file
Out[71]:
Then, since modern spreadsheets consist of one or more "sheets", we parse the sheet with the data of interest:
In [72]:
mb1 = mb_file.parse("Sheet 1", header=None)
mb1.columns = ["Taxon", "Count"]
mb1.head()
Out[72]:
There is now a read_excel
conveneince function in pandas that combines these steps into a single call:
In [73]:
mb2 = pd.read_excel('../data/microbiome/MID2.xls', sheetname='Sheet 1', header=None)
mb2.head()
Out[73]:
If you are fortunate, your data will be stored in a database (relational or non-relational) rather than in arbitrary text files or spreadsheet. Relational databases are particularly useful for storing large quantities of structured data, where fields are grouped together in tables according to their relationships with one another.
pandas' DataFrame
interacts with relational (i.e. SQL) databases, and even provides facilties for using SQL syntax on the DataFrame
itself, which we will get to later. For now, let's work with a ubiquitous embedded database called SQLite, which comes bundled with Python. A SQLite database can be queried with the standard library's sqlite3
module.
In [74]:
import sqlite3
query = '''
CREATE TABLE samples
(taxon VARCHAR(15), patient INTEGER, tissue INTEGER, stool INTEGER);
'''
This query string will create a table to hold some of our microbiome data, which we can execute after connecting to a database (which will be created, if it does not exist).
In [75]:
con = sqlite3.connect('microbiome.sqlite3')
con.execute(query)
con.commit()
In [76]:
few_recs.ix[0]
Out[76]:
In [77]:
con.execute('INSERT INTO samples VALUES(\'{}\',{},{},{})'.format(*few_recs.ix[0]))
Out[77]:
In [78]:
query = 'INSERT INTO samples VALUES(?, ?, ?, ?)'
con.executemany(query, few_recs.values[1:])
Out[78]:
In [79]:
con.commit()
Using SELECT
queries, we can read from the database.
In [80]:
cursor = con.execute('SELECT * FROM samples')
rows = cursor.fetchall()
rows
Out[80]:
These results can be passed directly to a DataFrame
In [81]:
pd.DataFrame(rows)
Out[81]:
To obtain the column names, we can obtain the table information from the database, via the special PRAGMA
statement.
In [82]:
table_info = con.execute('PRAGMA table_info(samples);').fetchall()
table_info
Out[82]:
In [83]:
pd.DataFrame(rows, columns=np.transpose(table_info)[1])
Out[83]:
A more direct approach is to pass the query to the read_sql_query
functon, which returns a populated `DataFrame.
In [84]:
pd.read_sql_query('SELECT * FROM samples', con)
Out[84]:
Correspondingly, we can append records into the database with to_sql
.
In [85]:
more_recs = pd.read_csv("../data/microbiome/microbiome_missing.csv").head(20)
In [86]:
more_recs.to_sql('samples', con, if_exists='append', index=False)
In [87]:
cursor = con.execute('SELECT * FROM samples')
cursor.fetchall()
Out[87]:
There are several other data formats that can be imported into Python and converted into DataFrames, with the help of buitl-in or third-party libraries. These include JSON, XML, HDF5, non-relational databases, and various web APIs.
In [88]:
# Get rid of the database we created
!rm microbiome.sqlite3
The ../data/ebola
folder contains summarized reports of Ebola cases from three countries during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.
From these data files, use pandas to import them and create a single data frame that includes the daily totals of new cases for each country.
We may use this compiled data for more advaned applications later in the course.
The data are taken from Caitlin Rivers' ebola
GitHub repository, and are licenced for both commercial and non-commercial use. The tutorial repository contains a subset of this data from three countries (Sierra Leone, Liberia and Guinea) that we will use as an example. They reside in a nested subdirectory in the data
directory.
In [89]:
ebola_dirs = !ls ../data/ebola/
ebola_dirs
Out[89]:
Within each country directory, there are CSV files containing daily information regarding the state of the outbreak for that country. The first step is to efficiently import all the relevant files.
Our approach will be to construct a dictionary containing a list of filenames to import. We can use the glob
package to identify all the CSV files in each directory. This can all be placed within a dictionary comprehension.
In [90]:
import glob
filenames = {data_dir[:data_dir.find('_')]: glob.glob('../data/ebola/{0}/*.csv'.format(data_dir)) for data_dir in ebola_dirs[1:]}
We are now in a position to iterate over the dictionary and import the corresponding files. However, the data layout of the files across the dataset is partially inconsistent.
In [91]:
pd.read_csv('../data/ebola/sl_data/2014-08-12-v77.csv').head()
Out[91]:
In [92]:
pd.read_csv('../data/ebola/guinea_data/2014-09-02.csv').head()
Out[92]:
Clearly, we will need to develop row masks to extract the data we need across all files, without having to manually extract data from each file.
Let's hack at one file to develop the mask.
In [93]:
sample = pd.read_csv('../data/ebola/sl_data/2014-08-12-v77.csv')
To prevent issues with capitalization, we will simply revert all labels to lower case.
In [94]:
lower_vars = sample.variable.str.lower()
Since we are interested in extracting new cases only, we can use the string accessor attribute to look for key words that we would like to include or exclude.
In [95]:
case_mask = (lower_vars.str.contains('new')
& (lower_vars.str.contains('case') | lower_vars.str.contains('suspect'))
& ~lower_vars.str.contains('non')
& ~lower_vars.str.contains('total'))
We could have instead used regular expressions to do the same thing.
Finally, we are only interested in three columns.
In [96]:
sample.loc[case_mask, ['date', 'variable', 'National']]
Out[96]:
We can now embed this operation in a loop over all the filenames in the database.
In [97]:
datasets = []
for country in filenames:
country_files = filenames[country]
for f in country_files:
data = pd.read_csv(f)
# Convert to lower case to avoid capitalization issues
data.columns = data.columns.str.lower()
# Column naming is inconsistent. These procedures deal with that.
keep_columns = ['date']
if 'description' in data.columns:
keep_columns.append('description')
else:
keep_columns.append('variable')
if 'totals' in data.columns:
keep_columns.append('totals')
else:
keep_columns.append('national')
# Index out the columns we need, and rename them
keep_data = data[keep_columns]
keep_data.columns = 'date', 'variable', 'totals'
# Extract the rows we might want
lower_vars = keep_data.variable.str.lower()
# Of course we can also use regex to do this
case_mask = (lower_vars.str.contains('new')
& (lower_vars.str.contains('case') | lower_vars.str.contains('suspect')
| lower_vars.str.contains('confirm'))
& ~lower_vars.str.contains('non')
& ~lower_vars.str.contains('total'))
keep_data = keep_data[case_mask].dropna()
# Convert data types
keep_data['date'] = pd.to_datetime(keep_data.date)
keep_data['totals'] = keep_data.totals.astype(int)
# Assign country label and append to datasets list
datasets.append(keep_data.assign(country=country))
Now that we have a list populated with DataFrame
objects for each day and country, we can call concat
to concatenate them into a single DataFrame
.
In [98]:
all_data = pd.concat(datasets)
all_data.head()
Out[98]:
This works because the structure of each table was identical
In [99]:
all_data.index.is_unique
Out[99]:
We can create a new unique index by calling the reset_index
method on the new data frame after we import it, which will generate a new ordered, unique index.
In [100]:
all_data = pd.concat(datasets).reset_index(drop=True)
all_data.head()
Out[100]:
Reindexing allows users to manipulate the data labels in a DataFrame. It forces a DataFrame to conform to the new index, and optionally, fill in missing data if requested.
A simple use of reindex
is to alter the order of the rows. For example, records are currently ordered first by country then by day, since this is the order in which they were iterated over and imported. We might arbitrarily want to reverse the order, which is performed by passing the appropriate index values to reindex
.
In [101]:
all_data.reindex(all_data.index[::-1])
Out[101]:
Notice that the reindexing operation is not performed "in-place"; the original DataFrame
remains as it was, and the method returns a copy of the DataFrame
with the new index. This is a common trait for pandas, and is a Good Thing.
We may also wish to reorder the columns this way.
In [102]:
all_data.reindex(columns=['date', 'country', 'variable', 'totals']).head()
Out[102]:
One of pandas' most powerful features is the ability to perform operations on subgroups of a DataFrame
. These so-called group by operations defines subunits of the dataset according to the values of one or more variabes in the DataFrame
.
For this data, we want to sum the new case counts by day and country; so we pass these two column names to the groupby
method, then sum the totals
column accross them.
In [103]:
all_data_grouped = all_data.groupby(['country', 'date'])
daily_cases = all_data_grouped['totals'].sum()
daily_cases.head(10)
Out[103]:
The resulting series retains a hierarchical index from the group by operation. Hence, we can index out the counts for a given country on a particular day by indexing with the appropriate tuple.
In [104]:
daily_cases[('liberia', '2014-09-02')]
Out[104]:
One issue with the data we have extracted is that there appear to be serious outliers in the Liberian counts. The values are much too large to be a daily count, even during a serious outbreak.
In [105]:
daily_cases.sort(ascending=False)
daily_cases.head(10)
Out[105]:
We can filter these outliers using an appropriate threshold.
In [106]:
daily_cases = daily_cases[daily_cases<200]
pandas data structures have high-level methods for creating a variety of plots, which tends to be easier than generating the corresponding plot using matplotlib.
For example, we may want to create a plot of the cumulative cases for each of the three countries. The easiest way to do this is to remove the hierarchical index, and create a DataFrame
of three columns, which will result in three lines when plotted.
First, call unstack
to remove the hierarichical index:
In [107]:
daily_cases.unstack().head()
Out[107]:
Next, transpose the resulting DataFrame
to swap the rows and columns.
In [108]:
daily_cases.unstack().T.head()
Out[108]:
Since we have missing values for some dates, we will assume that the counts for those days were zero (the actual counts for that day may have bee included in the next reporting day's data).
In [109]:
daily_cases.unstack().T.fillna(0).head()
Out[109]:
Finally, calculate the cumulative sum for all the columns, and generate a line plot, which we get by default.
In [110]:
daily_cases.unstack().T.fillna(0).cumsum().plot()
Out[110]:
In [111]:
weekly_cases = daily_cases.unstack().T.resample('W', how='sum')
weekly_cases
Out[111]:
In [112]:
weekly_cases.cumsum().plot()
Out[112]:
In [113]:
medals_data.to_csv("../data/medals.csv", index=False)
The to_csv
method writes a DataFrame
to a comma-separated values (csv) file. You can specify custom delimiters (via sep
argument), how missing values are written (via na_rep
argument), whether the index is writen (via index
argument), whether the header is included (via header
argument), among other options.
The occurence of missing data is so prevalent that it pays to use tools like pandas, which seamlessly integrates missing data handling so that it can be dealt with easily, and in the manner required by the analysis at hand.
Missing data are represented in Series
and DataFrame
objects by the NaN
floating point value. However, None
is also treated as missing, since it is commonly used as such in other contexts (e.g. NumPy).
In [114]:
!head -n 20 ../data/microbiome/microbiome_missing.csv
In [115]:
pd.read_csv("../data/microbiome/microbiome_missing.csv").head(20)
Out[115]:
Above, pandas recognized NA
and an empty field as missing data.
In [116]:
pd.isnull(pd.read_csv("../data/microbiome/microbiome_missing.csv")).head(20)
Out[116]:
Unfortunately, there will sometimes be inconsistency with the conventions for missing data. In this example, there is a question mark "?" and a large negative number where there should have been a positive integer. We can specify additional symbols with the na_values
argument:
In [117]:
missing_sample = pd.read_csv("../data/microbiome/microbiome_missing.csv",
na_values=['?', -99999], nrows=20)
missing_sample
Out[117]:
These can be specified on a column-wise basis using an appropriate dict as the argument for na_values
.
By default, dropna
drops entire rows in which one or more values are missing.
In [118]:
missing_sample.dropna()
Out[118]:
If we want to drop missing values column-wise instead of row-wise, we use axis=1
.
In [119]:
missing_sample.dropna(axis=1)
Out[119]:
Rather than omitting missing data from an analysis, in some cases it may be suitable to fill the missing value in, either with a default value (such as zero), a sentinel value, or a value that is either imputed or carried forward/backward from similar data points. We can do this programmatically in pandas with the fillna
argument.
In [120]:
missing_sample.fillna(-999)
Out[120]:
Sentinel values are useful in pandas because missing values are treated as floats, so it is impossible to use explicit missing values with integer columns. Using some large (positive or negative) integer as a sentinel value will allow the column to be integer typed.
In [121]:
## Write your answer here