In [1]:

    
%matplotlib inline
import pandas as pd
import numpy as np

from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()









    Out[1]:

Data Preparation using pandas

An initial step in statistical data analysis is the preparation of the data to be used in the analysis. In practice, ~~a little~~ ~~some~~ ~~much~~ the majority of the actual time spent on a statistical modeling project is typically devoted to importing, cleaning, validating and transforming the dataset.

This section will introduce pandas, an important third-party Python package for data analysis, as a tool for data preparation, and provide some general advice for what should or should not be done to data before it is analyzed.

Introduction to pandas

pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python.

pandas is well suited for:

Tabular data with heterogeneously-typed columns, as you might find in an SQL table or Excel spreadsheet
Ordered and unordered (not necessarily fixed-frequency) time series data.
Arbitrary matrix data with row and column labels

Virtually any statistical dataset, labeled or unlabeled, can be converted to a pandas data structure for cleaning, transformation, and analysis.

Key features

Easy handling of missing data
Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
Powerful, flexible group by functionality to perform split-apply-combine operations on data sets
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
Intuitive merging and joining data sets
Flexible reshaping and pivoting of data sets
Hierarchical labeling of axes
Robust IO tools for loading data from flat files, Excel files, databases, and HDF5
Time series functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Series

A Series is a single vector of data (like a NumPy array) with an index that labels each element in the vector.



In [2]:

    
counts = pd.Series([632, 1638, 569, 115])
counts









    Out[2]:





0     632
1    1638
2     569
3     115
dtype: int64

If an index is not specified, a default sequence of integers is assigned as the index. A NumPy array comprises the values of the Series, while the index is a pandas Index object.



In [3]:

    
counts.values









    Out[3]:





array([ 632, 1638,  569,  115])



In [4]:

    
counts.index









    Out[4]:





Int64Index([0, 1, 2, 3], dtype='int64')

We can assign meaningful labels to the index, if they are available. These counts are of bacteria taxa constituting the microbiome of hospital patients, so using the taxon of each bacterium is a useful index.



In [5]:

    
bacteria = pd.Series([632, 1638, 569, 115], 
    index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])

bacteria









    Out[5]:





Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
dtype: int64

These labels can be used to refer to the values in the Series.



In [6]:

    
bacteria['Actinobacteria']









    Out[6]:





569



In [7]:

    
bacteria[bacteria.index.str.endswith('bacteria')]









    Out[7]:





Proteobacteria    1638
Actinobacteria     569
dtype: int64



In [8]:

    
'Bacteroidetes' in bacteria









    Out[8]:





True

Notice that the indexing operation preserved the association between the values and the corresponding indices.

We can still use positional indexing if we wish.



In [9]:

    
bacteria[0]









    Out[9]:





632

We can give both the array of values and the index meaningful labels themselves:



In [10]:

    
bacteria.name = 'counts'
bacteria.index.name = 'phylum'
bacteria









    Out[10]:





phylum
Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
Name: counts, dtype: int64

NumPy's math functions and other operations can be applied to Series without losing the data structure.



In [11]:

    
np.log(bacteria)









    Out[11]:





phylum
Firmicutes        6.448889
Proteobacteria    7.401231
Actinobacteria    6.343880
Bacteroidetes     4.744932
Name: counts, dtype: float64

We can also filter according to the values in the Series:



In [12]:

    
bacteria[bacteria>1000]









    Out[12]:





phylum
Proteobacteria    1638
Name: counts, dtype: int64

A Series can be thought of as an ordered key-value store. In fact, we can create one from a dict:



In [13]:

    
bacteria_dict = {'Firmicutes': 632, 'Proteobacteria': 1638, 'Actinobacteria': 569, 'Bacteroidetes': 115}
bact = pd.Series(bacteria_dict)



In [14]:

    
bact









    Out[14]:





Actinobacteria     569
Bacteroidetes      115
Firmicutes         632
Proteobacteria    1638
dtype: int64

Notice that the Series is created in key-sorted order.

If we pass a custom index to Series, it will select the corresponding values from the dict, and treat indices without corrsponding values as missing. pandas uses the NaN (not a number) type for missing values.



In [15]:

    
bacteria2 = pd.Series(bacteria_dict, 
                      index=['Cyanobacteria','Firmicutes','Proteobacteria','Actinobacteria'])
bacteria2









    Out[15]:





Cyanobacteria      NaN
Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
dtype: float64



In [16]:

    
bacteria2.isnull()









    Out[16]:





Cyanobacteria      True
Firmicutes        False
Proteobacteria    False
Actinobacteria    False
dtype: bool

Critically, the labels are used to align data when used in operations with other Series objects:



In [17]:

    
bacteria + bacteria2









    Out[17]:





Actinobacteria    1138
Bacteroidetes      NaN
Cyanobacteria      NaN
Firmicutes        1264
Proteobacteria    3276
dtype: float64

Contrast this with NumPy arrays, where arrays of the same length will combine values element-wise; adding Series combined values with the same label in the resulting series. Notice also that the missing values were propogated by addition.

DataFrame

Inevitably, we want to be able to store, view and manipulate data that is multivariate, where for every index there are multiple fields or columns of data, often of varying data type.

A DataFrame is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. Data are stored internally as a 2-dimensional object, but the DataFrame allows us to represent and manipulate higher-dimensional data.



In [18]:

    
bacteria_data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],
                     'patient':[1, 1, 1, 1, 2, 2, 2, 2],
                     'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
    'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes']})
bacteria_data









    Out[18]:






  
    
      
      patient
      phylum
      value
    
  
  
    
      0
      1
      Firmicutes
      632
    
    
      1
      1
      Proteobacteria
      1638
    
    
      2
      1
      Actinobacteria
      569
    
    
      3
      1
      Bacteroidetes
      115
    
    
      4
      2
      Firmicutes
      433
    
    
      5
      2
      Proteobacteria
      1130
    
    
      6
      2
      Actinobacteria
      754
    
    
      7
      2
      Bacteroidetes
      555

Notice the DataFrame is sorted by column name. We can change the order by indexing them in the order we desire:



In [19]:

    
bacteria_data[['phylum','value','patient']]









    Out[19]:






  
    
      
      phylum
      value
      patient
    
  
  
    
      0
      Firmicutes
      632
      1
    
    
      1
      Proteobacteria
      1638
      1
    
    
      2
      Actinobacteria
      569
      1
    
    
      3
      Bacteroidetes
      115
      1
    
    
      4
      Firmicutes
      433
      2
    
    
      5
      Proteobacteria
      1130
      2
    
    
      6
      Actinobacteria
      754
      2
    
    
      7
      Bacteroidetes
      555
      2

A DataFrame has a second index, representing the columns:



In [20]:

    
bacteria_data.columns









    Out[20]:





Index(['patient', 'phylum', 'value'], dtype='object')

If we wish to access columns, we can do so either by dict-like indexing or by attribute:



In [21]:

    
bacteria_data['value']









    Out[21]:





0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64



In [22]:

    
bacteria_data.value









    Out[22]:





0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64

Using the standard indexing syntax for a single column of data from a DataFrame returns the column as a Series.



In [23]:

    
type(bacteria_data['value'])









    Out[23]:





pandas.core.series.Series

Passing the column name as a list returns the column as a DataFrame instead.



In [24]:

    
bacteria_data[['value']]

Notice that indexing works differently with a DataFrame than with a Series, where in the latter, dict-like indexing retrieved a particular element (row). If we want access to a row in a DataFrame, we index its ix attribute.



In [25]:

    
bacteria_data.ix[3]









    Out[25]:





patient                1
phylum     Bacteroidetes
value                115
Name: 3, dtype: object

Since a row potentially contains different data types, the returned Series of values is of the generic object type.

If we want to create a DataFrame row-wise rather than column-wise, we can do so with a dict of dicts:



In [26]:

    
bacteria_data = pd.DataFrame({0: {'patient': 1, 'phylum': 'Firmicutes', 'value': 632},
                    1: {'patient': 1, 'phylum': 'Proteobacteria', 'value': 1638},
                    2: {'patient': 1, 'phylum': 'Actinobacteria', 'value': 569},
                    3: {'patient': 1, 'phylum': 'Bacteroidetes', 'value': 115},
                    4: {'patient': 2, 'phylum': 'Firmicutes', 'value': 433},
                    5: {'patient': 2, 'phylum': 'Proteobacteria', 'value': 1130},
                    6: {'patient': 2, 'phylum': 'Actinobacteria', 'value': 754},
                    7: {'patient': 2, 'phylum': 'Bacteroidetes', 'value': 555}})



In [27]:

    
bacteria_data









    Out[27]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
    
  
  
    
      patient
      1
      1
      1
      1
      2
      2
      2
      2
    
    
      phylum
      Firmicutes
      Proteobacteria
      Actinobacteria
      Bacteroidetes
      Firmicutes
      Proteobacteria
      Actinobacteria
      Bacteroidetes
    
    
      value
      632
      1638
      569
      115
      433
      1130
      754
      555

However, we probably want this transposed:



In [28]:

    
bacteria_data = bacteria_data.T
bacteria_data









    Out[28]:






  
    
      
      patient
      phylum
      value
    
  
  
    
      0
      1
      Firmicutes
      632
    
    
      1
      1
      Proteobacteria
      1638
    
    
      2
      1
      Actinobacteria
      569
    
    
      3
      1
      Bacteroidetes
      115
    
    
      4
      2
      Firmicutes
      433
    
    
      5
      2
      Proteobacteria
      1130
    
    
      6
      2
      Actinobacteria
      754
    
    
      7
      2
      Bacteroidetes
      555

Views

Its important to note that the Series returned when a DataFrame is indexed is merely a view on the DataFrame, and not a copy of the data itself. So you must be cautious when manipulating this data.

For example, let's isolate a column of our dataset by assigning it as a Series to a variable.



In [29]:

    
vals = bacteria_data.value
vals









    Out[29]:





0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: object

Now, let's assign a new value to one of the elements of the Series.



In [30]:

    
vals[5] = 0
vals









    Out[30]:





0     632
1    1638
2     569
3     115
4     433
5       0
6     754
7     555
Name: value, dtype: object

However, we may not anticipate that the value in the original DataFrame has also been changed!



In [31]:

    
bacteria_data









    Out[31]:






  
    
      
      patient
      phylum
      value
    
  
  
    
      0
      1
      Firmicutes
      632
    
    
      1
      1
      Proteobacteria
      1638
    
    
      2
      1
      Actinobacteria
      569
    
    
      3
      1
      Bacteroidetes
      115
    
    
      4
      2
      Firmicutes
      433
    
    
      5
      2
      Proteobacteria
      0
    
    
      6
      2
      Actinobacteria
      754
    
    
      7
      2
      Bacteroidetes
      555

We can avoid this by working with a copy when modifying subsets of the original data.



In [32]:

    
vals = bacteria_data.value.copy()
vals[5] = 1000

bacteria_data









    Out[32]:






  
    
      
      patient
      phylum
      value
    
  
  
    
      0
      1
      Firmicutes
      632
    
    
      1
      1
      Proteobacteria
      1638
    
    
      2
      1
      Actinobacteria
      569
    
    
      3
      1
      Bacteroidetes
      115
    
    
      4
      2
      Firmicutes
      433
    
    
      5
      2
      Proteobacteria
      0
    
    
      6
      2
      Actinobacteria
      754
    
    
      7
      2
      Bacteroidetes
      555

So, as we have seen, we can create or modify columns by assignment; let's put back the value we accidentally changed.



In [33]:

    
bacteria_data.value[5] = 1130

Or, we may wish to add a column representing the year the data were collected.



In [34]:

    
bacteria_data['year'] = 2013
bacteria_data









    Out[34]:






  
    
      
      patient
      phylum
      value
      year
    
  
  
    
      0
      1
      Firmicutes
      632
      2013
    
    
      1
      1
      Proteobacteria
      1638
      2013
    
    
      2
      1
      Actinobacteria
      569
      2013
    
    
      3
      1
      Bacteroidetes
      115
      2013
    
    
      4
      2
      Firmicutes
      433
      2013
    
    
      5
      2
      Proteobacteria
      1130
      2013
    
    
      6
      2
      Actinobacteria
      754
      2013
    
    
      7
      2
      Bacteroidetes
      555
      2013

But note, we cannot use the attribute indexing method to add a new column:



In [35]:

    
bacteria_data.treatment = 1
bacteria_data









    Out[35]:






  
    
      
      patient
      phylum
      value
      year
    
  
  
    
      0
      1
      Firmicutes
      632
      2013
    
    
      1
      1
      Proteobacteria
      1638
      2013
    
    
      2
      1
      Actinobacteria
      569
      2013
    
    
      3
      1
      Bacteroidetes
      115
      2013
    
    
      4
      2
      Firmicutes
      433
      2013
    
    
      5
      2
      Proteobacteria
      1130
      2013
    
    
      6
      2
      Actinobacteria
      754
      2013
    
    
      7
      2
      Bacteroidetes
      555
      2013



In [36]:

    
bacteria_data.treatment









    Out[36]:





1

Auto-alignment

When adding a column that is not a simple constant, we need to be a bit more careful. Due to pandas' auto-alignment behavior, specifying a Series as a new column causes its values to be added according to the DataFrame's index:



In [37]:

    
treatment = pd.Series([0]*4 + [1]*2)

treatment









    Out[37]:





0    0
1    0
2    0
3    0
4    1
5    1
dtype: int64



In [38]:

    
bacteria_data['treatment'] = treatment

bacteria_data









    Out[38]:






  
    
      
      patient
      phylum
      value
      year
      treatment
    
  
  
    
      0
      1
      Firmicutes
      632
      2013
      0
    
    
      1
      1
      Proteobacteria
      1638
      2013
      0
    
    
      2
      1
      Actinobacteria
      569
      2013
      0
    
    
      3
      1
      Bacteroidetes
      115
      2013
      0
    
    
      4
      2
      Firmicutes
      433
      2013
      1
    
    
      5
      2
      Proteobacteria
      1130
      2013
      1
    
    
      6
      2
      Actinobacteria
      754
      2013
      NaN
    
    
      7
      2
      Bacteroidetes
      555
      2013
      NaN

Other Python data structures (ones without an index) need to be the same length as the DataFrame:



In [40]:

    
month = ['Jan', 'Feb', 'Mar', 'Apr']
bacteria_data['month'] = month









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-40-56b358a0dde9> in <module>()
      1 month = ['Jan', 'Feb', 'Mar', 'Apr']
----> 2 bacteria_data['month'] = month

/usr/local/lib/python3.4/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
   2119         else:
   2120             # set column
-> 2121             self._set_item(key, value)
   2122 
   2123     def _setitem_slice(self, key, value):

/usr/local/lib/python3.4/site-packages/pandas/core/frame.py in _set_item(self, key, value)
   2196 
   2197         self._ensure_valid_index(value)
-> 2198         value = self._sanitize_column(key, value)
   2199         NDFrame._set_item(self, key, value)
   2200 

/usr/local/lib/python3.4/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value)
   2354 
   2355             # turn me into an ndarray
-> 2356             value = _sanitize_index(value, self.index, copy=False)
   2357             if not isinstance(value, (np.ndarray, Index)):
   2358                 if isinstance(value, list) and len(value) > 0:

/usr/local/lib/python3.4/site-packages/pandas/core/series.py in _sanitize_index(data, index, copy)
   2570 
   2571     if len(data) != len(index):
-> 2572         raise ValueError('Length of values does not match length of '
   2573                          'index')
   2574 

ValueError: Length of values does not match length of index



In [41]:

    
bacteria_data['month'] = ['Jan']*len(bacteria_data)

bacteria_data









    Out[41]:






  
    
      
      patient
      phylum
      value
      year
      treatment
      month
    
  
  
    
      0
      1
      Firmicutes
      632
      2013
      0
      Jan
    
    
      1
      1
      Proteobacteria
      1638
      2013
      0
      Jan
    
    
      2
      1
      Actinobacteria
      569
      2013
      0
      Jan
    
    
      3
      1
      Bacteroidetes
      115
      2013
      0
      Jan
    
    
      4
      2
      Firmicutes
      433
      2013
      1
      Jan
    
    
      5
      2
      Proteobacteria
      1130
      2013
      1
      Jan
    
    
      6
      2
      Actinobacteria
      754
      2013
      NaN
      Jan
    
    
      7
      2
      Bacteroidetes
      555
      2013
      NaN
      Jan

We can use del to remove columns, in the same way dict entries can be removed:



In [42]:

    
del bacteria_data['month']

bacteria_data









    Out[42]:






  
    
      
      patient
      phylum
      value
      year
      treatment
    
  
  
    
      0
      1
      Firmicutes
      632
      2013
      0
    
    
      1
      1
      Proteobacteria
      1638
      2013
      0
    
    
      2
      1
      Actinobacteria
      569
      2013
      0
    
    
      3
      1
      Bacteroidetes
      115
      2013
      0
    
    
      4
      2
      Firmicutes
      433
      2013
      1
    
    
      5
      2
      Proteobacteria
      1130
      2013
      1
    
    
      6
      2
      Actinobacteria
      754
      2013
      NaN
    
    
      7
      2
      Bacteroidetes
      555
      2013
      NaN

Or employ the drop method.



In [43]:

    
bacteria_data.drop('treatment', axis=1)









    Out[43]:






  
    
      
      patient
      phylum
      value
      year
    
  
  
    
      0
      1
      Firmicutes
      632
      2013
    
    
      1
      1
      Proteobacteria
      1638
      2013
    
    
      2
      1
      Actinobacteria
      569
      2013
    
    
      3
      1
      Bacteroidetes
      115
      2013
    
    
      4
      2
      Firmicutes
      433
      2013
    
    
      5
      2
      Proteobacteria
      1130
      2013
    
    
      6
      2
      Actinobacteria
      754
      2013
    
    
      7
      2
      Bacteroidetes
      555
      2013

We can extract the underlying data as a simple ndarray by accessing the values attribute:



In [44]:

    
bacteria_data.values









    Out[44]:





array([[1, 'Firmicutes', 632, 2013, 0.0],
       [1, 'Proteobacteria', 1638, 2013, 0.0],
       [1, 'Actinobacteria', 569, 2013, 0.0],
       [1, 'Bacteroidetes', 115, 2013, 0.0],
       [2, 'Firmicutes', 433, 2013, 1.0],
       [2, 'Proteobacteria', 1130, 2013, 1.0],
       [2, 'Actinobacteria', 754, 2013, nan],
       [2, 'Bacteroidetes', 555, 2013, nan]], dtype=object)

Notice that because of the mix of string, integer and float (and NaN) values, the dtype of the array is object. The dtype will automatically be chosen to be as general as needed to accomodate all the columns.



In [45]:

    
df = pd.DataFrame({'foo': [1,2,3], 'bar':[0.4, -1.0, 4.5]})

df.values, df.values.dtype









    Out[45]:





(array([[ 0.4,  1. ],
        [-1. ,  2. ],
        [ 4.5,  3. ]]), dtype('float64'))

pandas uses a custom data structure to represent the indices of Series and DataFrames.



In [46]:

    
bacteria_data.index









    Out[46]:





Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64')

Index objects are immutable:



In [47]:

    
bacteria_data.index[0] = 15









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-47-d897e4dc5161> in <module>()
----> 1 bacteria_data.index[0] = 15

/usr/local/lib/python3.4/site-packages/pandas/core/index.py in __setitem__(self, key, value)
   1046 
   1047     def __setitem__(self, key, value):
-> 1048         raise TypeError("Indexes does not support mutable operations")
   1049 
   1050     def __getitem__(self, key):

TypeError: Indexes does not support mutable operations

This is so that Index objects can be shared between data structures without fear that they will be changed.



In [48]:

    
bacteria2.index = bacteria.index

bacteria2









    Out[48]:





phylum
Firmicutes         NaN
Proteobacteria     632
Actinobacteria    1638
Bacteroidetes      569
dtype: float64

Excercise: Indexing

From the bacteria_data table above, create an index to return all rows for which the phylum name ends in "bacteria" and the value is greater than 1000.



In [49]:

    
# Write your answer here

Using pandas

This section, we will import and clean up some of the datasets that we will be using later on in the tutorial. And in doing so, we will introduce the key functionality of pandas that is required to use the software effectively.

Importing data

A key, but often under-appreciated, step in data analysis is importing the data that we wish to analyze. Though it is easy to load basic data structures into Python using built-in tools or those provided by packages like NumPy, it is non-trivial to import structured data well, and to easily convert this input into a robust data structure:

genes = np.loadtxt("genes.csv", delimiter=",", dtype=[('gene', '|S10'), ('value', '<f4')])

pandas provides a convenient set of functions for importing tabular data in a number of formats directly into a DataFrame object. These functions include a slew of options to perform type inference, indexing, parsing, iterating and cleaning automatically as data are imported.

Delimited data

The file olympics.1996.txt in the data directory contains counts of medals awarded at the 1996 Summer Olympic Games by country, along with the countries' respective population sizes. This data is stored in a tab-separated format.



In [50]:

    
!head ../data/olympics.1996.txt









    



Tonga	1	96165
Bahamas	1	281584
Jamaica	6	2589043
Cuba	25	10952046
Australia	41	18348078
Hungary	21	10273590
Bulgaria	15	8181047
Trinidad & Tobago	2	1196910
New Zealand	6	3621200
Norway	7	4381275

This table can be read into a DataFrame using read_table.



In [51]:

    
medals = pd.read_table('../data/olympics.1996.txt', sep='\t',
                       index_col=0,
                       header=None, names=['country', 'medals', 'population'])
medals.head()

There is no header row in this dataset, so we specified this, and provided our own header names. If we did not specify header=None the function would have assumed the first row contained column names.

The tab separator was passed to the sep argument as \t.

The sep argument can be customized as needed to accomodate arbitrary separators. For example, we can use a regular expression to define a variable amount of whitespace, which is unfortunately common in some datasets:

sep='\s+'

Scraping Data from the Web

We would like to add another variable to this dataset. Along with population, a country's economic development may be a useful predictor of Olympic success. A very simple indicator of this might be OECD membership status.

The OECD website contains a table listing OECD member nations, along with its year of membership. We would like to import this table and extract the contries that were members as of the 1996 games.

The read_html function accepts a URL argument, and will attempt to extract all the tables from that address, returning whatever it finds in a list of DataFrames.



In [52]:

    
oecd_site = 'http://www.oecd.org/about/membersandpartners/list-oecd-member-countries.htm'
pd.read_html(oecd_site)









    Out[52]:





[                                                   0
 0  On 14 December 1960, 20 countries originally s...
 1  Here is a list of the current Member countries...,
                                      0                1                  2   3
 0                                  NaN          Country               Date NaN
 1                                  NaN        AUSTRALIA        7 June 1971 NaN
 2                                  NaN          AUSTRIA  29 September 1961 NaN
 3                                  NaN          BELGIUM  13 September 1961 NaN
 4                                  NaN           CANADA      10 April 1961 NaN
 5                                  NaN            CHILE         7 May 2010 NaN
 6                                  NaN   CZECH REPUBLIC   21 December 1995 NaN
 7                                  NaN          DENMARK        30 May 1961 NaN
 8                                  NaN          ESTONIA    9 December 2010 NaN
 9                                  NaN          FINLAND    28 January 1969 NaN
 10                                 NaN           FRANCE      7 August 1961 NaN
 11                                 NaN          GERMANY  27 September 1961 NaN
 12                                 NaN           GREECE  27 September 1961 NaN
 13                                 NaN          HUNGARY         7 May 1996 NaN
 14                                 NaN          ICELAND        5 June 1961 NaN
 15                                 NaN          IRELAND     17 August 1961 NaN
 16                                 NaN           ISRAEL   7 September 2010 NaN
 17                                 NaN            ITALY      29 March 1962 NaN
 18                                 NaN            JAPAN      28 April 1964 NaN
 19                                 NaN            KOREA   12 December 1996 NaN
 20                                 NaN       LUXEMBOURG    7 December 1961 NaN
 21                                 NaN           MEXICO        18 May 1994 NaN
 22                                 NaN      NETHERLANDS   13 November 1961 NaN
 23                                 NaN      NEW ZEALAND        29 May 1973 NaN
 24                                 NaN           NORWAY        4 July 1961 NaN
 25                                 NaN           POLAND   22 November 1996 NaN
 26                                 NaN         PORTUGAL      4 August 1961 NaN
 27                                 NaN  SLOVAK REPUBLIC   14 December 2000 NaN
 28                                 NaN         SLOVENIA       21 July 2010 NaN
 29                                 NaN            SPAIN      3 August 1961 NaN
 30                                 NaN           SWEDEN  28 September 1961 NaN
 31                                 NaN      SWITZERLAND  28 September 1961 NaN
 32                                 NaN           TURKEY      2 August 1961 NaN
 33                                 NaN   UNITED KINGDOM         2 May 1961 NaN
 34                                 NaN    UNITED STATES      12 April 1961 NaN
 35  More on membership and enlargement              NaN                NaN NaN]

There is typically some cleanup that is required of the returned data, such as the assignment of column names or conversion of types.

The table of interest is at index 1, and we will extract two columns from the table. Otherwise, this table is pretty clean.



In [53]:

    
oecd = pd.read_html(oecd_site, header=0)[1][[1,2]]
oecd.head()









    Out[53]:






  
    
      
      Country
      Date
    
  
  
    
      0
      AUSTRALIA
      7 June 1971
    
    
      1
      AUSTRIA
      29 September 1961
    
    
      2
      BELGIUM
      13 September 1961
    
    
      3
      CANADA
      10 April 1961
    
    
      4
      CHILE
      7 May 2010



In [54]:

    
oecd['year'] = pd.to_datetime(oecd.Date).apply(lambda x: x.year)
oecd_year = oecd.set_index(oecd.Country.str.title())['year'].dropna()
oecd_year









    Out[54]:





Country
Australia          1971
Austria            1961
Belgium            1961
Canada             1961
Chile              2010
Czech Republic     1995
Denmark            1961
Estonia            2010
Finland            1969
France             1961
Germany            1961
Greece             1961
Hungary            1996
Iceland            1961
Ireland            1961
Israel             2010
Italy              1962
Japan              1964
Korea              1996
Luxembourg         1961
Mexico             1994
Netherlands        1961
New Zealand        1973
Norway             1961
Poland             1996
Portugal           1961
Slovak Republic    2000
Slovenia           2010
Spain              1961
Sweden             1961
Switzerland        1961
Turkey             1961
United Kingdom     1961
United States      1961
Name: year, dtype: float64

We can create an indicator (binary) variable for OECD status by checking if each country is in the index of countries with membership year less than 1997.

The new DataFrame method assign is a convenient means for creating the new column from this operation.



In [55]:

    
medals_data = medals.assign(oecd=medals.index.isin((oecd_year[oecd_year<1997]).index).astype(int))

Since the distribution of populations spans several orders of magnitude, we may wish to use the logarithm of the population size, which may be created similarly.



In [56]:

    
medals_data = medals_data.assign(log_population=np.log(medals.population))

The NumPy log function will return a pandas Series (or DataFrame when applied to one) instead of a ndarray; all of NumPy's functions are compatible with pandas in this way.



In [57]:

    
medals_data.head()









    Out[57]:






  
    
      
      medals
      population
      oecd
      log_population
    
    
      country
      
      
      
      
    
  
  
    
      Tonga
      1
      96165
      0
      11.473821
    
    
      Bahamas
      1
      281584
      0
      12.548186
    
    
      Jamaica
      6
      2589043
      0
      14.766799
    
    
      Cuba
      25
      10952046
      0
      16.209037
    
    
      Australia
      41
      18348078
      1
      16.725035

Comma-separated Values (CSV)

The most common form of delimited data is comma-separated values (CSV). Since CSV is so ubiquitous, the read_csv is available as a convenience function for read_table.

Consider some more microbiome data.



In [58]:

    
!cat ../data/microbiome/microbiome.csv









    



Taxon,Patient,Tissue,Stool
Firmicutes,1,632,305
Firmicutes,2,136,4182
Firmicutes,3,1174,703
Firmicutes,4,408,3946
Firmicutes,5,831,8605
Firmicutes,6,693,50
Firmicutes,7,718,717
Firmicutes,8,173,33
Firmicutes,9,228,80
Firmicutes,10,162,3196
Firmicutes,11,372,32
Firmicutes,12,4255,4361
Firmicutes,13,107,1667
Firmicutes,14,96,223
Firmicutes,15,281,2377
Proteobacteria,1,1638,3886
Proteobacteria,2,2469,1821
Proteobacteria,3,839,661
Proteobacteria,4,4414,18
Proteobacteria,5,12044,83
Proteobacteria,6,2310,12
Proteobacteria,7,3053,547
Proteobacteria,8,395,2174
Proteobacteria,9,2651,767
Proteobacteria,10,1195,76
Proteobacteria,11,6857,795
Proteobacteria,12,483,666
Proteobacteria,13,2950,3994
Proteobacteria,14,1541,816
Proteobacteria,15,1307,53
Actinobacteria,1,569,648
Actinobacteria,2,1590,4
Actinobacteria,3,25,2
Actinobacteria,4,259,300
Actinobacteria,5,568,7
Actinobacteria,6,1102,9
Actinobacteria,7,678,377
Actinobacteria,8,260,58
Actinobacteria,9,424,233
Actinobacteria,10,548,21
Actinobacteria,11,201,83
Actinobacteria,12,42,75
Actinobacteria,13,109,59
Actinobacteria,14,51,183
Actinobacteria,15,310,204
Bacteroidetes,1,115,380
Bacteroidetes,2,67,0
Bacteroidetes,3,0,0
Bacteroidetes,4,85,5
Bacteroidetes,5,143,7
Bacteroidetes,6,678,2
Bacteroidetes,7,4829,209
Bacteroidetes,8,74,651
Bacteroidetes,9,169,254
Bacteroidetes,10,106,10
Bacteroidetes,11,73,381
Bacteroidetes,12,30,359
Bacteroidetes,13,51,51
Bacteroidetes,14,2473,2314
Bacteroidetes,15,102,33
Other,1,114,277
Other,2,195,18
Other,3,42,2
Other,4,316,43
Other,5,202,40
Other,6,116,0
Other,7,527,12
Other,8,357,11
Other,9,106,11
Other,10,67,14
Other,11,203,6
Other,12,392,6
Other,13,28,25
Other,14,12,22
Other,15,305,32

This table can be read into a DataFrame using read_csv:



In [59]:

    
mb = pd.read_csv("../data/microbiome/microbiome.csv")
mb.head()

If we have sections of data that we do not wish to import (for example, known bad data), we can populate the skiprows argument:



In [60]:

    
pd.read_csv("../data/microbiome/microbiome.csv", skiprows=[3,4,6]).head()

Conversely, if we only want to import a small number of rows from, say, a very large data file we can use nrows:



In [61]:

    
few_recs = pd.read_csv("../data/microbiome/microbiome.csv", nrows=4)

few_recs

Alternately, if we want to process our data in reasonable chunks, the chunksize argument will return an iterable object that can be employed in a data processing loop. For example, our microbiome data are organized by bacterial phylum, with 15 patients represented in each:



In [62]:

    
data_chunks = pd.read_csv("../data/microbiome/microbiome.csv", chunksize=15)
data_chunks









    Out[62]:





<pandas.io.parsers.TextFileReader at 0x10ff36898>

Exercise: Calculating summary statistics

Import the microbiome data, calculating the mean counts across all patients for each taxon, returning these values in a dictionary.

Hint: using chunksize makes this more efficent!



In [63]:

    
# Write your answer here

Hierarchical Indices

For a more useful index, we can specify the first two columns, which together provide a unique index to the data.



In [64]:

    
mb = pd.read_csv("../data/microbiome/microbiome.csv", index_col=['Taxon','Patient'])
mb.head()

This is called a hierarchical index, which allows multiple dimensions of data to be represented in tabular form.



In [65]:

    
mb.index









    Out[65]:





MultiIndex(levels=[['Actinobacteria', 'Bacteroidetes', 'Firmicutes', 'Other', 'Proteobacteria'], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]],
           labels=[[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]],
           names=['Taxon', 'Patient'])

The corresponding index is a MultiIndex object that consists of a sequence of tuples, the elements of which is some combination of the three columns used to create the index. Where there are multiple repeated values, pandas does not print the repeats, making it easy to identify groups of values.

Rows can be indexed by passing the appropriate tuple.



In [66]:

    
mb.ix[('Firmicutes', 2)]









    Out[66]:





Tissue     136
Stool     4182
Name: (Firmicutes, 2), dtype: int64

With a hierachical index, we can select subsets of the data based on a partial index:



In [67]:

    
mb.ix['Proteobacteria']

To extract arbitrary levels from a hierarchical row index, the cross-section method xs can be used.



In [68]:

    
mb.xs(1, level='Patient')









    Out[68]:






  
    
      
      Tissue
      Stool
    
    
      Taxon
      
      
    
  
  
    
      Firmicutes
      632
      305
    
    
      Proteobacteria
      1638
      3886
    
    
      Actinobacteria
      569
      648
    
    
      Bacteroidetes
      115
      380
    
    
      Other
      114
      277

We may also reorder levels as we like.



In [69]:

    
mb.swaplevel('Patient', 'Taxon').head()

Operations

DataFrame and Series objects allow for several operations to take place either on a single object, or between two or more objects.

For example, we can perform arithmetic on the elements of two objects, such as calculating the ratio of bacteria counts between locations:



In [70]:

    
mb.Stool / mb.Tissue









    Out[70]:





Taxon           Patient
Firmicutes      1           0.482595
                2          30.750000
                3           0.598807
                4           9.671569
                5          10.354994
                6           0.072150
                7           0.998607
                8           0.190751
                9           0.350877
                10         19.728395
                11          0.086022
                12          1.024912
                13         15.579439
                14          2.322917
                15          8.459075
Proteobacteria  1           2.372405
                2           0.737546
                3           0.787843
                4           0.004078
                5           0.006891
                6           0.005195
                7           0.179168
                8           5.503797
                9           0.289325
                10          0.063598
                11          0.115940
                12          1.378882
                13          1.353898
                14          0.529526
                15          0.040551
                             ...    
Bacteroidetes   1           3.304348
                2           0.000000
                3                NaN
                4           0.058824
                5           0.048951
                6           0.002950
                7           0.043280
                8           8.797297
                9           1.502959
                10          0.094340
                11          5.219178
                12         11.966667
                13          1.000000
                14          0.935706
                15          0.323529
Other           1           2.429825
                2           0.092308
                3           0.047619
                4           0.136076
                5           0.198020
                6           0.000000
                7           0.022770
                8           0.030812
                9           0.103774
                10          0.208955
                11          0.029557
                12          0.015306
                13          0.892857
                14          1.833333
                15          0.104918
dtype: float64

Microsoft Excel

Since so much financial and scientific data ends up in Excel spreadsheets (regrettably), pandas' ability to directly import Excel spreadsheets is valuable. This support is contingent on having one or two dependencies (depending on what version of Excel file is being imported) installed: xlrd and openpyxl (these may be installed with either pip or easy_install).

Importing Excel data to pandas is a two-step process. First, we create an ExcelFile object using the path of the file:



In [71]:

    
mb_file = pd.ExcelFile('../data/microbiome/MID1.xls')
mb_file









    Out[71]:





<pandas.io.excel.ExcelFile at 0x10fede358>

Then, since modern spreadsheets consist of one or more "sheets", we parse the sheet with the data of interest:



In [72]:

    
mb1 = mb_file.parse("Sheet 1", header=None)
mb1.columns = ["Taxon", "Count"]
mb1.head()









    Out[72]:






  
    
      
      Taxon
      Count
    
  
  
    
      0
      Archaea "Crenarchaeota" Thermoprotei Desulfuro...
      7
    
    
      1
      Archaea "Crenarchaeota" Thermoprotei Desulfuro...
      2
    
    
      2
      Archaea "Crenarchaeota" Thermoprotei Sulfoloba...
      3
    
    
      3
      Archaea "Crenarchaeota" Thermoprotei Thermopro...
      3
    
    
      4
      Archaea "Euryarchaeota" "Methanomicrobia" Meth...
      7

There is now a read_excel conveneince function in pandas that combines these steps into a single call:



In [73]:

    
mb2 = pd.read_excel('../data/microbiome/MID2.xls', sheetname='Sheet 1', header=None)
mb2.head()









    Out[73]:






  
    
      
      0
      1
    
  
  
    
      0
      Archaea "Crenarchaeota" Thermoprotei Acidiloba...
      2
    
    
      1
      Archaea "Crenarchaeota" Thermoprotei Acidiloba...
      14
    
    
      2
      Archaea "Crenarchaeota" Thermoprotei Desulfuro...
      23
    
    
      3
      Archaea "Crenarchaeota" Thermoprotei Desulfuro...
      1
    
    
      4
      Archaea "Crenarchaeota" Thermoprotei Desulfuro...
      2

Relational Databases

If you are fortunate, your data will be stored in a database (relational or non-relational) rather than in arbitrary text files or spreadsheet. Relational databases are particularly useful for storing large quantities of structured data, where fields are grouped together in tables according to their relationships with one another.

pandas' DataFrame interacts with relational (i.e. SQL) databases, and even provides facilties for using SQL syntax on the DataFrame itself, which we will get to later. For now, let's work with a ubiquitous embedded database called SQLite, which comes bundled with Python. A SQLite database can be queried with the standard library's sqlite3 module.



In [74]:

    
import sqlite3

query = '''
CREATE TABLE samples
(taxon VARCHAR(15), patient INTEGER, tissue INTEGER, stool INTEGER);
'''

This query string will create a table to hold some of our microbiome data, which we can execute after connecting to a database (which will be created, if it does not exist).



In [75]:

    
con = sqlite3.connect('microbiome.sqlite3')
con.execute(query)
con.commit()



In [76]:

    
few_recs.ix[0]









    Out[76]:





Taxon      Firmicutes
Patient             1
Tissue            632
Stool             305
Name: 0, dtype: object



In [77]:

    
con.execute('INSERT INTO samples VALUES(\'{}\',{},{},{})'.format(*few_recs.ix[0]))









    Out[77]:





<sqlite3.Cursor at 0x10f5c16c0>



In [78]:

    
query = 'INSERT INTO samples VALUES(?, ?, ?, ?)'
con.executemany(query, few_recs.values[1:])









    Out[78]:





<sqlite3.Cursor at 0x10f5c1730>



In [79]:

    
con.commit()

Using SELECT queries, we can read from the database.



In [80]:

    
cursor = con.execute('SELECT * FROM samples')
rows = cursor.fetchall()

rows









    Out[80]:





[('Firmicutes', 1, 632, 305),
 ('Firmicutes', 2, 136, 4182),
 ('Firmicutes', 3, 1174, 703),
 ('Firmicutes', 4, 408, 3946)]

These results can be passed directly to a DataFrame



In [81]:

    
pd.DataFrame(rows)

To obtain the column names, we can obtain the table information from the database, via the special PRAGMA statement.



In [82]:

    
table_info = con.execute('PRAGMA table_info(samples);').fetchall()

table_info









    Out[82]:





[(0, 'taxon', 'VARCHAR(15)', 0, None, 0),
 (1, 'patient', 'INTEGER', 0, None, 0),
 (2, 'tissue', 'INTEGER', 0, None, 0),
 (3, 'stool', 'INTEGER', 0, None, 0)]



In [83]:

    
pd.DataFrame(rows, columns=np.transpose(table_info)[1])

A more direct approach is to pass the query to the read_sql_query functon, which returns a populated `DataFrame.



In [84]:

    
pd.read_sql_query('SELECT * FROM samples', con)

Correspondingly, we can append records into the database with to_sql.



In [85]:

    
more_recs = pd.read_csv("../data/microbiome/microbiome_missing.csv").head(20)



In [86]:

    
more_recs.to_sql('samples', con, if_exists='append', index=False)



In [87]:

    
cursor = con.execute('SELECT * FROM samples')
cursor.fetchall()









    Out[87]:





[('Firmicutes', 1, 632, 305),
 ('Firmicutes', 2, 136, 4182),
 ('Firmicutes', 3, 1174, 703),
 ('Firmicutes', 4, 408, 3946),
 ('Firmicutes', 1, 632, 305),
 ('Firmicutes', 2, 136, 4182),
 ('Firmicutes', 3, None, 703),
 ('Firmicutes', 4, 408, 3946),
 ('Firmicutes', 5, 831, 8605),
 ('Firmicutes', 6, 693, 50),
 ('Firmicutes', 7, 718, 717),
 ('Firmicutes', 8, 173, 33),
 ('Firmicutes', 9, 228, None),
 ('Firmicutes', 10, 162, 3196),
 ('Firmicutes', 11, 372, -99999),
 ('Firmicutes', 12, 4255, 4361),
 ('Firmicutes', 13, 107, 1667),
 ('Firmicutes', 14, '?', 223),
 ('Firmicutes', 15, 281, 2377),
 ('Proteobacteria', 1, 1638, 3886),
 ('Proteobacteria', 2, 2469, 1821),
 ('Proteobacteria', 3, 839, 661),
 ('Proteobacteria', 4, 4414, 18),
 ('Proteobacteria', 5, 12044, 83)]

There are several other data formats that can be imported into Python and converted into DataFrames, with the help of buitl-in or third-party libraries. These include JSON, XML, HDF5, non-relational databases, and various web APIs.



In [88]:

    
# Get rid of the database we created
!rm microbiome.sqlite3

2014 Ebola Outbreak Data

The ../data/ebola folder contains summarized reports of Ebola cases from three countries during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

From these data files, use pandas to import them and create a single data frame that includes the daily totals of new cases for each country.

We may use this compiled data for more advaned applications later in the course.

The data are taken from Caitlin Rivers' ebola GitHub repository, and are licenced for both commercial and non-commercial use. The tutorial repository contains a subset of this data from three countries (Sierra Leone, Liberia and Guinea) that we will use as an example. They reside in a nested subdirectory in the data directory.



In [89]:

    
ebola_dirs = !ls ../data/ebola/
ebola_dirs









    Out[89]:





['LICENSE', 'guinea_data', 'liberia_data', 'sl_data']

Within each country directory, there are CSV files containing daily information regarding the state of the outbreak for that country. The first step is to efficiently import all the relevant files.

Our approach will be to construct a dictionary containing a list of filenames to import. We can use the glob package to identify all the CSV files in each directory. This can all be placed within a dictionary comprehension.



In [90]:

    
import glob

filenames = {data_dir[:data_dir.find('_')]: glob.glob('../data/ebola/{0}/*.csv'.format(data_dir)) for data_dir in ebola_dirs[1:]}

We are now in a position to iterate over the dictionary and import the corresponding files. However, the data layout of the files across the dataset is partially inconsistent.



In [91]:

    
pd.read_csv('../data/ebola/sl_data/2014-08-12-v77.csv').head()









    Out[91]:






  
    
      
      date
      variable
      Kailahun
      Kenema
      Kono
      Kambia
      Koinadugu
      Bombali
      Tonkolili
      Port Loko
      Pujehun
      Bo
      Moyamba
      Bonthe
      Western area urban
      Western area rural
      National
    
  
  
    
      0
      2014-08-12
      population
      465048
      653013
      325003
      341690
      335471
      494139
      434937
      557978
      335574
      654142
      278119
      168729
      1040888
      263619
      6348350
    
    
      1
      2014-08-12
      new_noncase
      0
      3
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      4
    
    
      2
      2014-08-12
      new_suspected
      0
      9
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      10
    
    
      3
      2014-08-12
      new_probable
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      1
    
    
      4
      2014-08-12
      new_confirmed
      0
      9
      0
      0
      0
      0
      0
      2
      0
      0
      0
      0
      0
      0
      11



In [92]:

    
pd.read_csv('../data/ebola/guinea_data/2014-09-02.csv').head()









    Out[92]:






  
    
      
      Date
      Description
      Totals
      Conakry
      Gueckedou
      Macenta
      Dabola
      Kissidougou
      Dinguiraye
      Telimele
      ...
      Mzerekore
      Yomou
      Dubreka
      Forecariah
      Kerouane
      Coyah
      Dalaba
      Beyla
      Kindia
      Lola
    
  
  
    
      0
      2014-09-02
      New cases of suspects
      11
      NaN
      5
      6
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      2014-09-02
      New cases of probables
      0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      2014-09-02
      New cases of confirmed
      14
      NaN
      4
      6
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      2014-09-02
      Total new cases registered so far
      25
      0
      9
      12
      0
      0
      0
      0
      ...
      0
      0
      1
      0
      0
      3
      NaN
      NaN
      NaN
      NaN
    
    
      4
      2014-09-02
      Total cases of suspects
      49
      15
      5
      17
      0
      0
      0
      0
      ...
      0
      1
      1
      2
      5
      0
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 24 columns

Clearly, we will need to develop row masks to extract the data we need across all files, without having to manually extract data from each file.

Let's hack at one file to develop the mask.



In [93]:

    
sample = pd.read_csv('../data/ebola/sl_data/2014-08-12-v77.csv')

To prevent issues with capitalization, we will simply revert all labels to lower case.



In [94]:

    
lower_vars = sample.variable.str.lower()

Since we are interested in extracting new cases only, we can use the string accessor attribute to look for key words that we would like to include or exclude.



In [95]:

    
case_mask = (lower_vars.str.contains('new') 
             & (lower_vars.str.contains('case') | lower_vars.str.contains('suspect')) 
             & ~lower_vars.str.contains('non')
             & ~lower_vars.str.contains('total'))

We could have instead used regular expressions to do the same thing.

Finally, we are only interested in three columns.



In [96]:

    
sample.loc[case_mask, ['date', 'variable', 'National']]









    Out[96]:






  
    
      
      date
      variable
      National
    
  
  
    
      2
      2014-08-12
      new_suspected
      10

We can now embed this operation in a loop over all the filenames in the database.



In [97]:

    
datasets = []
for country in filenames:
    
    country_files = filenames[country]
    for f in country_files:
        
        data = pd.read_csv(f)
        
        
        # Convert to lower case to avoid capitalization issues
        data.columns = data.columns.str.lower()
        # Column naming is inconsistent. These procedures deal with that.
        keep_columns = ['date']
        if 'description' in data.columns:
            keep_columns.append('description')
        else:
            keep_columns.append('variable')
            
        if 'totals' in data.columns:
            keep_columns.append('totals')
        else:
            keep_columns.append('national')
            
        # Index out the columns we need, and rename them
        keep_data = data[keep_columns]
        keep_data.columns = 'date', 'variable', 'totals'
        
        # Extract the rows we might want
        lower_vars = keep_data.variable.str.lower()
        # Of course we can also use regex to do this
        case_mask = (lower_vars.str.contains('new') 
                     & (lower_vars.str.contains('case') | lower_vars.str.contains('suspect') 
                                                        | lower_vars.str.contains('confirm')) 
                     & ~lower_vars.str.contains('non')
                     & ~lower_vars.str.contains('total'))
        
        keep_data = keep_data[case_mask].dropna()
        
        # Convert data types
        keep_data['date'] = pd.to_datetime(keep_data.date)
        keep_data['totals'] = keep_data.totals.astype(int)
        
        # Assign country label and append to datasets list
        datasets.append(keep_data.assign(country=country))

Now that we have a list populated with DataFrame objects for each day and country, we can call concat to concatenate them into a single DataFrame.



In [98]:

    
all_data = pd.concat(datasets)
all_data.head()









    Out[98]:






  
    
      
      date
      variable
      totals
      country
    
  
  
    
      2
      2014-08-12
      new_suspected
      10
      sl
    
    
      4
      2014-08-12
      new_confirmed
      11
      sl
    
    
      2
      2014-08-13
      new_suspected
      3
      sl
    
    
      4
      2014-08-13
      new_confirmed
      15
      sl
    
    
      2
      2014-08-14
      new_suspected
      0
      sl

This works because the structure of each table was identical

Manipulating indices

Notice from above, however, that the index contains redundant integer index values. We can confirm this:



In [99]:

    
all_data.index.is_unique









    Out[99]:





False

We can create a new unique index by calling the reset_index method on the new data frame after we import it, which will generate a new ordered, unique index.



In [100]:

    
all_data = pd.concat(datasets).reset_index(drop=True)
all_data.head()









    Out[100]:






  
    
      
      date
      variable
      totals
      country
    
  
  
    
      0
      2014-08-12
      new_suspected
      10
      sl
    
    
      1
      2014-08-12
      new_confirmed
      11
      sl
    
    
      2
      2014-08-13
      new_suspected
      3
      sl
    
    
      3
      2014-08-13
      new_confirmed
      15
      sl
    
    
      4
      2014-08-14
      new_suspected
      0
      sl

Reindexing allows users to manipulate the data labels in a DataFrame. It forces a DataFrame to conform to the new index, and optionally, fill in missing data if requested.

A simple use of reindex is to alter the order of the rows. For example, records are currently ordered first by country then by day, since this is the order in which they were iterated over and imported. We might arbitrarily want to reverse the order, which is performed by passing the appropriate index values to reindex.



In [101]:

    
all_data.reindex(all_data.index[::-1])









    Out[101]:






  
    
      
      date
      variable
      totals
      country
    
  
  
    
      636
      2014-12-09
      New case/s (confirmed)
      2946
      liberia
    
    
      635
      2014-12-09
      New Case/s (Probable)
      1801
      liberia
    
    
      634
      2014-12-09
      New Case/s (Suspected)
      3050
      liberia
    
    
      633
      2014-12-08
      New case/s (confirmed)
      2927
      liberia
    
    
      632
      2014-12-08
      New Case/s (Probable)
      1805
      liberia
    
    
      631
      2014-12-08
      New Case/s (Suspected)
      3054
      liberia
    
    
      630
      2014-12-07
      New case/s (confirmed)
      2869
      liberia
    
    
      629
      2014-12-07
      New Case/s (Probable)
      1829
      liberia
    
    
      628
      2014-12-07
      New Case/s (Suspected)
      3067
      liberia
    
    
      627
      2014-12-06
      New case/s (confirmed)
      2869
      liberia
    
    
      626
      2014-12-06
      New Case/s (Probable)
      1810
      liberia
    
    
      625
      2014-12-06
      New Case/s (Suspected)
      3056
      liberia
    
    
      624
      2014-12-05
      New case/s (confirmed)
      2867
      liberia
    
    
      623
      2014-12-05
      New Case/s (Probable)
      1808
      liberia
    
    
      622
      2014-12-05
      New Case/s (Suspected)
      3056
      liberia
    
    
      621
      2014-12-04
      New case/s (confirmed)
      2867
      liberia
    
    
      620
      2014-12-04
      New Case/s (Probable)
      1800
      liberia
    
    
      619
      2014-12-04
      New Case/s (Suspected)
      3054
      liberia
    
    
      618
      2014-12-02
      New case/s (confirmed)
      9
      liberia
    
    
      617
      2014-12-02
      New Case/s (Probable)
      10
      liberia
    
    
      616
      2014-12-02
      New Case/s (Suspected)
      18
      liberia
    
    
      615
      2014-12-02
      Newly Reported Cases in HCW
      1
      liberia
    
    
      614
      2014-12-01
      New case/s (confirmed)
      1
      liberia
    
    
      613
      2014-12-01
      New Case/s (Probable)
      9
      liberia
    
    
      612
      2014-12-01
      New Case/s (Suspected)
      25
      liberia
    
    
      611
      2014-11-30
      New case/s (confirmed)
      10
      liberia
    
    
      610
      2014-11-30
      Newly Reported Cases in HCW
      1
      liberia
    
    
      609
      2014-11-29
      New case/s (confirmed)
      10
      liberia
    
    
      608
      2014-11-29
      New Case/s (Probable)
      4
      liberia
    
    
      607
      2014-11-29
      New Case/s (Suspected)
      7
      liberia
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      29
      2014-08-27
      new_confirmed
      27
      sl
    
    
      28
      2014-08-27
      new_suspected
      3
      sl
    
    
      27
      2014-08-25
      new_confirmed
      20
      sl
    
    
      26
      2014-08-25
      new_suspected
      5
      sl
    
    
      25
      2014-08-24
      new_confirmed
      31
      sl
    
    
      24
      2014-08-24
      new_suspected
      0
      sl
    
    
      23
      2014-08-23
      new_confirmed
      23
      sl
    
    
      22
      2014-08-23
      new_suspected
      4
      sl
    
    
      21
      2014-08-22
      new_confirmed
      56
      sl
    
    
      20
      2014-08-22
      new_suspected
      1
      sl
    
    
      19
      2014-08-21
      new_confirmed
      9
      sl
    
    
      18
      2014-08-21
      new_suspected
      0
      sl
    
    
      17
      2014-08-20
      new_confirmed
      4
      sl
    
    
      16
      2014-08-20
      new_suspected
      1
      sl
    
    
      15
      2014-08-19
      new_confirmed
      9
      sl
    
    
      14
      2014-08-19
      new_suspected
      16
      sl
    
    
      13
      2014-08-18
      new_confirmed
      5
      sl
    
    
      12
      2014-08-18
      new_suspected
      40
      sl
    
    
      11
      2014-08-17
      new_confirmed
      2
      sl
    
    
      10
      2014-08-17
      new_suspected
      1
      sl
    
    
      9
      2014-08-16
      new_confirmed
      18
      sl
    
    
      8
      2014-08-16
      new_suspected
      3
      sl
    
    
      7
      2014-08-15
      new_confirmed
      10
      sl
    
    
      6
      2014-08-15
      new_suspected
      6
      sl
    
    
      5
      2014-08-14
      new_confirmed
      13
      sl
    
    
      4
      2014-08-14
      new_suspected
      0
      sl
    
    
      3
      2014-08-13
      new_confirmed
      15
      sl
    
    
      2
      2014-08-13
      new_suspected
      3
      sl
    
    
      1
      2014-08-12
      new_confirmed
      11
      sl
    
    
      0
      2014-08-12
      new_suspected
      10
      sl
    
  

637 rows × 4 columns

Notice that the reindexing operation is not performed "in-place"; the original DataFrame remains as it was, and the method returns a copy of the DataFrame with the new index. This is a common trait for pandas, and is a Good Thing.

We may also wish to reorder the columns this way.



In [102]:

    
all_data.reindex(columns=['date', 'country', 'variable', 'totals']).head()









    Out[102]:






  
    
      
      date
      country
      variable
      totals
    
  
  
    
      0
      2014-08-12
      sl
      new_suspected
      10
    
    
      1
      2014-08-12
      sl
      new_confirmed
      11
    
    
      2
      2014-08-13
      sl
      new_suspected
      3
    
    
      3
      2014-08-13
      sl
      new_confirmed
      15
    
    
      4
      2014-08-14
      sl
      new_suspected
      0

Group by operations

One of pandas' most powerful features is the ability to perform operations on subgroups of a DataFrame. These so-called group by operations defines subunits of the dataset according to the values of one or more variabes in the DataFrame.

For this data, we want to sum the new case counts by day and country; so we pass these two column names to the groupby method, then sum the totals column accross them.



In [103]:

    
all_data_grouped = all_data.groupby(['country', 'date'])
daily_cases = all_data_grouped['totals'].sum()
daily_cases.head(10)









    Out[103]:





country  date      
guinea   2014-08-04    11
         2014-08-26    28
         2014-08-27    22
         2014-08-30    24
         2014-08-31    46
         2014-09-02    25
         2014-09-04    30
         2014-09-07    18
         2014-09-08    18
         2014-09-09    16
Name: totals, dtype: int64

The resulting series retains a hierarchical index from the group by operation. Hence, we can index out the counts for a given country on a particular day by indexing with the appropriate tuple.



In [104]:

    
daily_cases[('liberia', '2014-09-02')]









    Out[104]:





liberia                78
2014-09-02 00:00:00    78
Name: totals, dtype: int64

One issue with the data we have extracted is that there appear to be serious outliers in the Liberian counts. The values are much too large to be a daily count, even during a serious outbreak.



In [105]:

    
daily_cases.sort(ascending=False)
daily_cases.head(10)









    Out[105]:





country  date      
liberia  2014-12-09    7797
         2014-12-08    7786
         2014-12-07    7765
         2014-12-06    7735
         2014-12-05    7731
         2014-12-04    7721
         2014-10-17     167
sl       2014-11-08     131
         2014-11-20     130
         2014-11-10     126
Name: totals, dtype: int64

We can filter these outliers using an appropriate threshold.



In [106]:

    
daily_cases = daily_cases[daily_cases<200]

Plotting

pandas data structures have high-level methods for creating a variety of plots, which tends to be easier than generating the corresponding plot using matplotlib.

For example, we may want to create a plot of the cumulative cases for each of the three countries. The easiest way to do this is to remove the hierarchical index, and create a DataFrame of three columns, which will result in three lines when plotted.

First, call unstack to remove the hierarichical index:



In [107]:

    
daily_cases.unstack().head()









    Out[107]:






  
    
      date
      2014-06-16 00:00:00
      2014-06-17 00:00:00
      2014-06-22 00:00:00
      2014-06-24 00:00:00
      2014-06-25 00:00:00
      2014-06-28 00:00:00
      2014-06-29 00:00:00
      2014-07-01 00:00:00
      2014-07-02 00:00:00
      2014-07-03 00:00:00
      ...
      2014-11-24 00:00:00
      2014-11-26 00:00:00
      2014-11-27 00:00:00
      2014-11-28 00:00:00
      2014-11-29 00:00:00
      2014-11-30 00:00:00
      2014-12-01 00:00:00
      2014-12-02 00:00:00
      2014-12-04 00:00:00
      2014-12-05 00:00:00
    
    
      country
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      guinea
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      liberia
      4
      2
      14
      6
      9
      10
      3
      4
      4
      4
      ...
      25
      31
      46
      7
      21
      11
      35
      38
      NaN
      NaN
    
    
      sl
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      115
      NaN
      NaN
      110
      88
      NaN
      86
      NaN
      41
      78
    
  

3 rows × 128 columns

Next, transpose the resulting DataFrame to swap the rows and columns.



In [108]:

    
daily_cases.unstack().T.head()

Since we have missing values for some dates, we will assume that the counts for those days were zero (the actual counts for that day may have bee included in the next reporting day's data).



In [109]:

    
daily_cases.unstack().T.fillna(0).head()

Finally, calculate the cumulative sum for all the columns, and generate a line plot, which we get by default.



In [110]:

    
daily_cases.unstack().T.fillna(0).cumsum().plot()









    Out[110]:





<matplotlib.axes._subplots.AxesSubplot at 0x110235dd8>

Resampling

An alternative to filling days without case reports with zeros is to aggregate the data at a coarser time scale. New cases are often reported by week; we can use the resample method to summarize the data into weekly values.



In [111]:

    
weekly_cases = daily_cases.unstack().T.resample('W', how='sum')
weekly_cases



In [112]:

    
weekly_cases.cumsum().plot()









    Out[112]:





<matplotlib.axes._subplots.AxesSubplot at 0x1103cf7f0>

Writing Data to Files

As well as being able to read several data input formats, pandas can also export data to a variety of storage formats. We will bring your attention to just one of these, but the usage is similar across formats.



In [113]:

    
medals_data.to_csv("../data/medals.csv", index=False)

The to_csv method writes a DataFrame to a comma-separated values (csv) file. You can specify custom delimiters (via sep argument), how missing values are written (via na_rep argument), whether the index is writen (via index argument), whether the header is included (via header argument), among other options.

Missing data

The occurence of missing data is so prevalent that it pays to use tools like pandas, which seamlessly integrates missing data handling so that it can be dealt with easily, and in the manner required by the analysis at hand.

Missing data are represented in Series and DataFrame objects by the NaN floating point value. However, None is also treated as missing, since it is commonly used as such in other contexts (e.g. NumPy).



In [114]:

    
!head -n 20 ../data/microbiome/microbiome_missing.csv









    



Taxon,Patient,Tissue,Stool
Firmicutes,1,632,305
Firmicutes,2,136,4182
Firmicutes,3,,703
Firmicutes,4,408,3946
Firmicutes,5,831,8605
Firmicutes,6,693,50
Firmicutes,7,718,717
Firmicutes,8,173,33
Firmicutes,9,228,NA
Firmicutes,10,162,3196
Firmicutes,11,372,-99999
Firmicutes,12,4255,4361
Firmicutes,13,107,1667
Firmicutes,14,?,223
Firmicutes,15,281,2377
Proteobacteria,1,1638,3886
Proteobacteria,2,2469,1821
Proteobacteria,3,839,661
Proteobacteria,4,4414,18



In [115]:

    
pd.read_csv("../data/microbiome/microbiome_missing.csv").head(20)









    Out[115]:






  
    
      
      Taxon
      Patient
      Tissue
      Stool
    
  
  
    
      0
      Firmicutes
      1
      632
      305
    
    
      1
      Firmicutes
      2
      136
      4182
    
    
      2
      Firmicutes
      3
      NaN
      703
    
    
      3
      Firmicutes
      4
      408
      3946
    
    
      4
      Firmicutes
      5
      831
      8605
    
    
      5
      Firmicutes
      6
      693
      50
    
    
      6
      Firmicutes
      7
      718
      717
    
    
      7
      Firmicutes
      8
      173
      33
    
    
      8
      Firmicutes
      9
      228
      NaN
    
    
      9
      Firmicutes
      10
      162
      3196
    
    
      10
      Firmicutes
      11
      372
      -99999
    
    
      11
      Firmicutes
      12
      4255
      4361
    
    
      12
      Firmicutes
      13
      107
      1667
    
    
      13
      Firmicutes
      14
      ?
      223
    
    
      14
      Firmicutes
      15
      281
      2377
    
    
      15
      Proteobacteria
      1
      1638
      3886
    
    
      16
      Proteobacteria
      2
      2469
      1821
    
    
      17
      Proteobacteria
      3
      839
      661
    
    
      18
      Proteobacteria
      4
      4414
      18
    
    
      19
      Proteobacteria
      5
      12044
      83

Above, pandas recognized NA and an empty field as missing data.



In [116]:

    
pd.isnull(pd.read_csv("../data/microbiome/microbiome_missing.csv")).head(20)









    Out[116]:






  
    
      
      Taxon
      Patient
      Tissue
      Stool
    
  
  
    
      0
      False
      False
      False
      False
    
    
      1
      False
      False
      False
      False
    
    
      2
      False
      False
      True
      False
    
    
      3
      False
      False
      False
      False
    
    
      4
      False
      False
      False
      False
    
    
      5
      False
      False
      False
      False
    
    
      6
      False
      False
      False
      False
    
    
      7
      False
      False
      False
      False
    
    
      8
      False
      False
      False
      True
    
    
      9
      False
      False
      False
      False
    
    
      10
      False
      False
      False
      False
    
    
      11
      False
      False
      False
      False
    
    
      12
      False
      False
      False
      False
    
    
      13
      False
      False
      False
      False
    
    
      14
      False
      False
      False
      False
    
    
      15
      False
      False
      False
      False
    
    
      16
      False
      False
      False
      False
    
    
      17
      False
      False
      False
      False
    
    
      18
      False
      False
      False
      False
    
    
      19
      False
      False
      False
      False

Unfortunately, there will sometimes be inconsistency with the conventions for missing data. In this example, there is a question mark "?" and a large negative number where there should have been a positive integer. We can specify additional symbols with the na_values argument:



In [117]:

    
missing_sample = pd.read_csv("../data/microbiome/microbiome_missing.csv", 
                             na_values=['?', -99999], nrows=20)

missing_sample









    Out[117]:






  
    
      
      Taxon
      Patient
      Tissue
      Stool
    
  
  
    
      0
      Firmicutes
      1
      632
      305
    
    
      1
      Firmicutes
      2
      136
      4182
    
    
      2
      Firmicutes
      3
      NaN
      703
    
    
      3
      Firmicutes
      4
      408
      3946
    
    
      4
      Firmicutes
      5
      831
      8605
    
    
      5
      Firmicutes
      6
      693
      50
    
    
      6
      Firmicutes
      7
      718
      717
    
    
      7
      Firmicutes
      8
      173
      33
    
    
      8
      Firmicutes
      9
      228
      NaN
    
    
      9
      Firmicutes
      10
      162
      3196
    
    
      10
      Firmicutes
      11
      372
      NaN
    
    
      11
      Firmicutes
      12
      4255
      4361
    
    
      12
      Firmicutes
      13
      107
      1667
    
    
      13
      Firmicutes
      14
      NaN
      223
    
    
      14
      Firmicutes
      15
      281
      2377
    
    
      15
      Proteobacteria
      1
      1638
      3886
    
    
      16
      Proteobacteria
      2
      2469
      1821
    
    
      17
      Proteobacteria
      3
      839
      661
    
    
      18
      Proteobacteria
      4
      4414
      18
    
    
      19
      Proteobacteria
      5
      12044
      83

These can be specified on a column-wise basis using an appropriate dict as the argument for na_values.

By default, dropna drops entire rows in which one or more values are missing.



In [118]:

    
missing_sample.dropna()









    Out[118]:






  
    
      
      Taxon
      Patient
      Tissue
      Stool
    
  
  
    
      0
      Firmicutes
      1
      632
      305
    
    
      1
      Firmicutes
      2
      136
      4182
    
    
      3
      Firmicutes
      4
      408
      3946
    
    
      4
      Firmicutes
      5
      831
      8605
    
    
      5
      Firmicutes
      6
      693
      50
    
    
      6
      Firmicutes
      7
      718
      717
    
    
      7
      Firmicutes
      8
      173
      33
    
    
      9
      Firmicutes
      10
      162
      3196
    
    
      11
      Firmicutes
      12
      4255
      4361
    
    
      12
      Firmicutes
      13
      107
      1667
    
    
      14
      Firmicutes
      15
      281
      2377
    
    
      15
      Proteobacteria
      1
      1638
      3886
    
    
      16
      Proteobacteria
      2
      2469
      1821
    
    
      17
      Proteobacteria
      3
      839
      661
    
    
      18
      Proteobacteria
      4
      4414
      18
    
    
      19
      Proteobacteria
      5
      12044
      83

If we want to drop missing values column-wise instead of row-wise, we use axis=1.



In [119]:

    
missing_sample.dropna(axis=1)









    Out[119]:






  
    
      
      Taxon
      Patient
    
  
  
    
      0
      Firmicutes
      1
    
    
      1
      Firmicutes
      2
    
    
      2
      Firmicutes
      3
    
    
      3
      Firmicutes
      4
    
    
      4
      Firmicutes
      5
    
    
      5
      Firmicutes
      6
    
    
      6
      Firmicutes
      7
    
    
      7
      Firmicutes
      8
    
    
      8
      Firmicutes
      9
    
    
      9
      Firmicutes
      10
    
    
      10
      Firmicutes
      11
    
    
      11
      Firmicutes
      12
    
    
      12
      Firmicutes
      13
    
    
      13
      Firmicutes
      14
    
    
      14
      Firmicutes
      15
    
    
      15
      Proteobacteria
      1
    
    
      16
      Proteobacteria
      2
    
    
      17
      Proteobacteria
      3
    
    
      18
      Proteobacteria
      4
    
    
      19
      Proteobacteria
      5

Rather than omitting missing data from an analysis, in some cases it may be suitable to fill the missing value in, either with a default value (such as zero), a sentinel value, or a value that is either imputed or carried forward/backward from similar data points. We can do this programmatically in pandas with the fillna argument.



In [120]:

    
missing_sample.fillna(-999)









    Out[120]:






  
    
      
      Taxon
      Patient
      Tissue
      Stool
    
  
  
    
      0
      Firmicutes
      1
      632
      305
    
    
      1
      Firmicutes
      2
      136
      4182
    
    
      2
      Firmicutes
      3
      -999
      703
    
    
      3
      Firmicutes
      4
      408
      3946
    
    
      4
      Firmicutes
      5
      831
      8605
    
    
      5
      Firmicutes
      6
      693
      50
    
    
      6
      Firmicutes
      7
      718
      717
    
    
      7
      Firmicutes
      8
      173
      33
    
    
      8
      Firmicutes
      9
      228
      -999
    
    
      9
      Firmicutes
      10
      162
      3196
    
    
      10
      Firmicutes
      11
      372
      -999
    
    
      11
      Firmicutes
      12
      4255
      4361
    
    
      12
      Firmicutes
      13
      107
      1667
    
    
      13
      Firmicutes
      14
      -999
      223
    
    
      14
      Firmicutes
      15
      281
      2377
    
    
      15
      Proteobacteria
      1
      1638
      3886
    
    
      16
      Proteobacteria
      2
      2469
      1821
    
    
      17
      Proteobacteria
      3
      839
      661
    
    
      18
      Proteobacteria
      4
      4414
      18
    
    
      19
      Proteobacteria
      5
      12044
      83

Sentinel values are useful in pandas because missing values are treated as floats, so it is impossible to use explicit missing values with integer columns. Using some large (positive or negative) integer as a sentinel value will allow the column to be integer typed.

Exercise: Mean imputation

Fill the missing values in missing_sample with the mean count from the corresponding species across patients.



In [121]:

    
## Write your answer here

	patient	phylum	value
0	1	Firmicutes	632
1	1	Proteobacteria	1638
2	1	Actinobacteria	569
3	1	Bacteroidetes	115
4	2	Firmicutes	433
5	2	Proteobacteria	1130
6	2	Actinobacteria	754
7	2	Bacteroidetes	555

	0	1	2	3	4	5	6	7
patient	1	1	1	1	2	2	2	2
phylum	Firmicutes	Proteobacteria	Actinobacteria	Bacteroidetes	Firmicutes	Proteobacteria	Actinobacteria	Bacteroidetes
value	632	1638	569	115	433	1130	754	555

	patient	phylum	value	year
0	1	Firmicutes	632	2013
1	1	Proteobacteria	1638	2013
2	1	Actinobacteria	569	2013
3	1	Bacteroidetes	115	2013
4	2	Firmicutes	433	2013
5	2	Proteobacteria	1130	2013
6	2	Actinobacteria	754	2013
7	2	Bacteroidetes	555	2013

	medals	population
country
Tonga	1	96165
Bahamas	1	281584
Jamaica	6	2589043
Cuba	25	10952046
Australia	41	18348078

	Country	Date
0	AUSTRALIA	7 June 1971
1	AUSTRIA	29 September 1961
2	BELGIUM	13 September 1961
3	CANADA	10 April 1961
4	CHILE	7 May 2010

	Taxon	Patient	Tissue	Stool
0	Firmicutes	1	632	305
1	Firmicutes	2	136	4182
2	Firmicutes	3	1174	703
3	Firmicutes	4	408	3946
4	Firmicutes	5	831	8605

		Tissue	Stool
Taxon	Patient
Firmicutes	1	632	305
	2	136	4182
	3	1174	703
	4	408	3946
	5	831	8605

	Tissue	Stool
Patient
1	1638	3886
2	2469	1821
3	839	661
4	4414	18
5	12044	83
6	2310	12
7	3053	547
8	395	2174
9	2651	767
10	1195	76
11	6857	795
12	483	666
13	2950	3994
14	1541	816
15	1307	53

	Taxon	Count
0	Archaea "Crenarchaeota" Thermoprotei Desulfuro...	7
1	Archaea "Crenarchaeota" Thermoprotei Desulfuro...	2
2	Archaea "Crenarchaeota" Thermoprotei Sulfoloba...	3
3	Archaea "Crenarchaeota" Thermoprotei Thermopro...	3
4	Archaea "Euryarchaeota" "Methanomicrobia" Meth...	7

	0	1
0	Archaea "Crenarchaeota" Thermoprotei Acidiloba...	2
1	Archaea "Crenarchaeota" Thermoprotei Acidiloba...	14
2	Archaea "Crenarchaeota" Thermoprotei Desulfuro...	23
3	Archaea "Crenarchaeota" Thermoprotei Desulfuro...	1
4	Archaea "Crenarchaeota" Thermoprotei Desulfuro...	2

	date	variable	Kailahun	Kenema	Kono	Kambia	Koinadugu	Bombali	Tonkolili	Port Loko	Pujehun	Bo	Moyamba	Bonthe	Western area urban	Western area rural	National
0	2014-08-12	population	465048	653013	325003	341690	335471	494139	434937	557978	335574	654142	278119	168729	1040888	263619	6348350
1	2014-08-12	new_noncase	0	3	0	0	0	0	0	1	0	0	0	0	0	0	4
2	2014-08-12	new_suspected	0	9	0	0	0	0	0	0	0	1	0	0	0	0	10
3	2014-08-12	new_probable	0	0	0	0	0	0	0	0	0	1	0	0	0	0	1
4	2014-08-12	new_confirmed	0	9	0	0	0	0	0	2	0	0	0	0	0	0	11

	Date	Description	Totals	Conakry	Gueckedou	Macenta	Dabola	Kissidougou	Dinguiraye	Telimele	...	Mzerekore	Yomou	Dubreka	Forecariah	Kerouane	Coyah	Dalaba	Beyla	Kindia	Lola
0	2014-09-02	New cases of suspects	11	NaN	5	6	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2014-09-02	New cases of probables	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2014-09-02	New cases of confirmed	14	NaN	4	6	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	2014-09-02	Total new cases registered so far	25	0	9	12	0	0	0	0	...	0	0	1	0	0	3	NaN	NaN	NaN	NaN
4	2014-09-02	Total cases of suspects	49	15	5	17	0	0	0	0	...	0	1	1	2	5	0	NaN	NaN	NaN	NaN

	date	variable	totals	country
636	2014-12-09	New case/s (confirmed)	2946	liberia
635	2014-12-09	New Case/s (Probable)	1801	liberia
634	2014-12-09	New Case/s (Suspected)	3050	liberia
633	2014-12-08	New case/s (confirmed)	2927	liberia
632	2014-12-08	New Case/s (Probable)	1805	liberia
631	2014-12-08	New Case/s (Suspected)	3054	liberia
630	2014-12-07	New case/s (confirmed)	2869	liberia
629	2014-12-07	New Case/s (Probable)	1829	liberia
628	2014-12-07	New Case/s (Suspected)	3067	liberia
627	2014-12-06	New case/s (confirmed)	2869	liberia
626	2014-12-06	New Case/s (Probable)	1810	liberia
625	2014-12-06	New Case/s (Suspected)	3056	liberia
624	2014-12-05	New case/s (confirmed)	2867	liberia
623	2014-12-05	New Case/s (Probable)	1808	liberia
622	2014-12-05	New Case/s (Suspected)	3056	liberia
621	2014-12-04	New case/s (confirmed)	2867	liberia
620	2014-12-04	New Case/s (Probable)	1800	liberia
619	2014-12-04	New Case/s (Suspected)	3054	liberia
618	2014-12-02	New case/s (confirmed)	9	liberia
617	2014-12-02	New Case/s (Probable)	10	liberia
616	2014-12-02	New Case/s (Suspected)	18	liberia
615	2014-12-02	Newly Reported Cases in HCW	1	liberia
614	2014-12-01	New case/s (confirmed)	1	liberia
613	2014-12-01	New Case/s (Probable)	9	liberia
612	2014-12-01	New Case/s (Suspected)	25	liberia
611	2014-11-30	New case/s (confirmed)	10	liberia
610	2014-11-30	Newly Reported Cases in HCW	1	liberia
609	2014-11-29	New case/s (confirmed)	10	liberia
608	2014-11-29	New Case/s (Probable)	4	liberia
607	2014-11-29	New Case/s (Suspected)	7	liberia
...	...	...	...	...
29	2014-08-27	new_confirmed	27	sl
28	2014-08-27	new_suspected	3	sl
27	2014-08-25	new_confirmed	20	sl
26	2014-08-25	new_suspected	5	sl
25	2014-08-24	new_confirmed	31	sl
24	2014-08-24	new_suspected	0	sl
23	2014-08-23	new_confirmed	23	sl
22	2014-08-23	new_suspected	4	sl
21	2014-08-22	new_confirmed	56	sl
20	2014-08-22	new_suspected	1	sl
19	2014-08-21	new_confirmed	9	sl
18	2014-08-21	new_suspected	0	sl
17	2014-08-20	new_confirmed	4	sl
16	2014-08-20	new_suspected	1	sl
15	2014-08-19	new_confirmed	9	sl
14	2014-08-19	new_suspected	16	sl
13	2014-08-18	new_confirmed	5	sl
12	2014-08-18	new_suspected	40	sl
11	2014-08-17	new_confirmed	2	sl
10	2014-08-17	new_suspected	1	sl
9	2014-08-16	new_confirmed	18	sl
8	2014-08-16	new_suspected	3	sl
7	2014-08-15	new_confirmed	10	sl
6	2014-08-15	new_suspected	6	sl
5	2014-08-14	new_confirmed	13	sl
4	2014-08-14	new_suspected	0	sl
3	2014-08-13	new_confirmed	15	sl
2	2014-08-13	new_suspected	3	sl
1	2014-08-12	new_confirmed	11	sl
0	2014-08-12	new_suspected	10	sl

date	2014-06-16 00:00:00	2014-06-17 00:00:00	2014-06-22 00:00:00	2014-06-24 00:00:00	2014-06-25 00:00:00	2014-06-28 00:00:00	2014-06-29 00:00:00	2014-07-01 00:00:00	2014-07-02 00:00:00	2014-07-03 00:00:00	...	2014-11-24 00:00:00	2014-11-26 00:00:00	2014-11-27 00:00:00	2014-11-28 00:00:00	2014-11-29 00:00:00	2014-11-30 00:00:00	2014-12-01 00:00:00	2014-12-02 00:00:00	2014-12-04 00:00:00	2014-12-05 00:00:00
country
guinea	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
liberia	4	2	14	6	9	10	3	4	4	4	...	25	31	46	7	21	11	35	38	NaN	NaN
sl	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	115	NaN	NaN	110	88	NaN	86	NaN	41	78

country	guinea	liberia	sl
date
2014-06-16	NaN	4	NaN
2014-06-17	NaN	2	NaN
2014-06-22	NaN	14	NaN
2014-06-24	NaN	6	NaN
2014-06-25	NaN	9	NaN

country	guinea	liberia	sl
date
2014-06-22	NaN	20	NaN
2014-06-29	NaN	28	NaN
2014-07-06	NaN	12	NaN
2014-07-13	NaN	18	NaN
2014-07-20	NaN	15	NaN
2014-07-27	NaN	56	NaN
2014-08-03	NaN	11	NaN
2014-08-10	11	10	NaN
2014-08-17	NaN	106	92
2014-08-24	NaN	122	199
2014-08-31	120	94	189
2014-09-07	73	402	141
2014-09-14	83	436	216
2014-09-21	55	404	301
2014-09-28	95	275	408
2014-10-05	51	207	368
2014-10-12	NaN	248	409
2014-10-19	NaN	330	496
2014-10-26	NaN	220	452
2014-11-02	NaN	228	499
2014-11-09	NaN	54	236
2014-11-16	NaN	60	449
2014-11-23	NaN	120	434
2014-11-30	NaN	141	313
2014-12-07	NaN	73	205

	Taxon	Patient	Tissue	Stool
0	False	False	False	False
1	False	False	False	False
2	False	False	True	False
3	False	False	False	False
4	False	False	False	False
5	False	False	False	False
6	False	False	False	False
7	False	False	False	False
8	False	False	False	True
9	False	False	False	False
10	False	False	False	False
11	False	False	False	False
12	False	False	False	False
13	False	False	False	False
14	False	False	False	False
15	False	False	False	False
16	False	False	False	False
17	False	False	False	False
18	False	False	False	False
19	False	False	False	False