In [1]:
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
from IPython.display import Image
import matplotlib
%matplotlib inline
matplotlib.style.use('ggplot')
In [2]:
%%javascript
$.getScript('misc/kmahelona_ipython_notebook_toc.js')
see Installation_Customization_Resources.txt for useful infos and links
IPython provides a rich toolkit to help you make the most out of using Python, with:
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.
Checkout the "command palette" for useful commands and shortcuts.
In [3]:
from IPython.core.display import HTML
HTML("<iframe src=https://jupyter.readthedocs.io/en/latest/_images/notebook_components.png width=800 height=400></iframe>")
Out[3]:
Esc
and Enter
to toggle between "Command" and "Edit" mode --> see colour indicator of frame surrounding the cell
when in "Command" mode:
a
to insert cell Aboveb
to insert cell Belowdd
to delete cellwhen in "Edit" mode:
Tab
for tab completion of file-namesShift
+ Tab
in parenthesis of function to display docstringCmd
+ /
toggle line commentIn the notebook, to run a cell of code, hit Shift + Enter
. This executes the cell and puts the cursor in the next cell below, or makes a new one if you are at the end. Alternately, you can use:
Alt + Enter
to force the creation of a new cell unconditionally (useful when inserting new content in the middle of an existing notebook).Control + Enter
executes the cell and keeps the cursor in the same cell, useful for quick experimentation of snippets that you don't need to keep permanently.Esc
and Enter
to toggle between "Command" and "Edit" mode --> see colour indicator of frame surrounding the cellh
for Help or checkout "Help" --> "Keyboard shortcuts" from the menuShift + Tab
inside of the brackets of a function to get the DocstringCmd + /
to toggle line comment
In [4]:
!pwd
In [5]:
!ls -la
In [6]:
var = !ls
len(var), var[:3]
Out[6]:
In [7]:
%%bash
echo "My shell is:" $SHELL
echo "My disk usage is:"
df -h
command | description |
---|---|
? | Introduction and overview of IPython’s features. |
%quickref | Quick reference. |
help | Python’s own help system. |
?object | Details about ‘object’, use ‘??object’ for extra details. |
In [8]:
# ?math.log
In [9]:
help(math.log)
press Shift + Tab
inside of the brackets to get the Docstring
In [10]:
# math.log()
In [11]:
# %magic
In [12]:
values = range(1, 1001)
In [13]:
%%timeit
results = []
for val in values:
new_val = math.log(val, 10)
results.append(new_val)
In [14]:
%%timeit
results = [math.log(val, 10) for val in values]
In [15]:
%%timeit
results = np.log10(values)
pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python.
pandas is well suited for:
Key features:
Creating a Series by passing a list of values, letting pandas create a default integer index:
In [16]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s
Out[16]:
Creating a DataFrame by passing a numpy array, with labeled columns:
In [17]:
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df
Out[17]:
Creating a DataFrame by passing a dict of objects that can be converted to series-like.
In [18]:
df2 = pd.DataFrame({'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1, index=list(range(4)), dtype='float32'),
'D' : np.array([3] * 4, dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo',
'tinyRick': pd.Series([2000])})
df2
Out[18]:
Having specific dtypes (data types)
In [19]:
df2.dtypes
Out[19]:
change the dtype
In [20]:
df2["D"] = df2["D"].astype('float64')
See the top & bottom rows of the frame
In [21]:
df.head()
Out[21]:
In [22]:
df.tail(3)
Out[22]:
Display the index, columns, and the underlying numpy data
In [23]:
df.index
Out[23]:
In [24]:
df.columns
Out[24]:
In [25]:
df.values
Out[25]:
Describe shows a quick statistic summary of your data excluding NaN (Not a Number) values
In [26]:
df.describe()
Out[26]:
Concise summary of a DataFrame
In [27]:
df2.info()
Transposing your data
In [28]:
df.T
Out[28]:
Sorting by values
In [29]:
df.sort_values(by='B')
Out[29]:
important ones:
for completeness/comparison:
CAVEATS: label-based slicing in pandas is inclusive. The primary reason for this is that it is often not possible to easily determine the “successor” or next element after a particular label in an index.
Selecting a single column, which yields a Series.
In [30]:
df["A"]
Out[30]:
equivalent
In [31]:
df.A
Out[31]:
selecting with a list of column names, yields a data frame (also with a single column name)
In [32]:
df[["A", "C"]]
Out[32]:
Selecting via [], which slices the rows.
In [33]:
df[0:3]
Out[33]:
df.loc[rows, columns]
.loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found. Allowed inputs are:
In [34]:
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df
Out[34]:
In [35]:
# Let's transpose the DataFrame and change the order of the columns (to force us to the string labels of the novel row index).
# row-index: Strings, columns: Integers
dfx = df.T
import random
columns = dfx.columns.tolist()
random.shuffle(columns)
dfx.columns = columns
dfx
Out[35]:
In [36]:
dfx.loc["A":"C", ]
Out[36]:
In [37]:
dfx.loc[["C", "A", "D"], ]
Out[37]:
In [38]:
dfx.loc[["A", "D"], 0:2]
Out[38]:
df.iloc[rows, columns]
.iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with python/numpy slice semantics). Allowed inputs are:
In [39]:
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df
Out[39]:
In [40]:
### transpose
# df = df.T
# df
In [41]:
df.iloc[0:3, :]
Out[41]:
uncomment the transpose and execute the two cells above again and notice that this doesn't raise an Error compared to label based indexing.
In [42]:
df.iloc[[1, 3, 5], 1:3]
Out[42]:
In [43]:
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df
Out[43]:
Using a single column’s values to select data.
In [44]:
df[df["A"] > 0]
Out[44]:
A where operation for getting.
In [45]:
df[df > 0]
Out[45]:
This comparison yields a Pandas.Series of Booleans.
In [46]:
cond = df["A"] > 0
print type(cond)
cond
Out[46]:
Summing boolean arrays can come in handy (True=1, False=0)
In [47]:
sum(cond), len(cond) # or cond.sum(), cond.shape[0]
Out[47]:
Frequent use case: combining a Series of Bools with specific column names to select data.
In [48]:
df.loc[cond, ["A", "C"]]
Out[48]:
Let's add a column to the DataFrame
In [49]:
df['E'] = ['one', 'one','two','three','four','three']
In [50]:
df
Out[50]:
Using the isin() method for filtering:
In [51]:
df[df['E'].isin(['two','four'])]
Out[51]:
Setting a new column automatically aligns the data by the indexes
In [52]:
dates = pd.date_range('20130101', periods=6)
dfx = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
dfx
Out[52]:
In [53]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))
s1 = s1.sort_values(ascending=False)
s1
Out[53]:
In [54]:
dfx["F"] = s1
dfx
Out[54]:
Setting values by label
In [55]:
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df['E'] = ['one', 'one','two','three','four','three']
df
Out[55]:
In [56]:
df.loc[5, "A"] = 0.815
Setting values by position
In [57]:
df.iloc[0, 4] = "zero"
delete (drop) some rows
In [58]:
df.drop([3, 5])
Out[58]:
delete (drop) a column
In [59]:
df = df.drop("E", axis=1)
df
Out[59]:
In [60]:
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df
Out[60]:
A where operation with setting.
In [61]:
df[df > 0] = -df
df
Out[61]:
Multiply with a scalar.
In [62]:
df = df * -1
df
Out[62]:
Row wise division
In [63]:
df["F"] = df["A"] / df["B"]
df
Out[63]:
pandas primarily uses the value np.nan to represent missing data.
In [64]:
df.loc[0, ] = np.nan
df.loc[2, ["A", "C"]] = np.nan
df
Out[64]:
To drop any rows that have missing data.
In [65]:
df.dropna()
Out[65]:
Drop only rows where all
values are missing.
In [66]:
df.dropna(how='all')
Out[66]:
Fill missing values
In [67]:
df.fillna(value=5)
Out[67]:
see replace
method to replace values given in 'to_replace' with 'value'.
To get the boolean mask where values are nan
In [68]:
df.isnull() # or pd.isnull(df)
Out[68]:
In [69]:
df.notnull() == -df.isnull()
Out[69]:
In [70]:
dfx = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
dfx
Out[70]:
Drop duplicates except for the first occurrence, considering only a certain column.
In [71]:
dfx.drop_duplicates(subset="A")
Out[71]:
see also duplicated
method to return boolean Series denoting duplicate rows, optionally only considering certain columns
In [72]:
df
Out[72]:
Performing a descriptive statistic
In [73]:
df.mean()
Out[73]:
On the other axis
In [74]:
df.median(axis=1)
Out[74]:
The full range of basic statistics that are quickly calculable and built into the base Pandas package are:
Function | Description |
---|---|
count | Number of non-null observations |
sum | Sum of values |
mean | Mean of values |
mad | Mean absolute deviation |
median | Arithmetic median of values |
min | Minimum |
max | Maximum |
mode | Mode |
abs | Absolute Value |
prod | Product of values |
std | Bessel-corrected sample standard deviation |
var | Unbiased variance |
sem | Standard error of the mean |
skew | Sample skewness (3rd moment) |
kurt | Sample kurtosis (4th moment) |
quantile | Sample quantile (value at %) |
cumsum | Cumulative sum |
cumprod | Cumulative product |
cummax | Cumulative maximum |
cummin | Cumulative minimum |
The need for custom functions is minimal unless you have very specific requirements. c.f. http://pandas.pydata.org/pandas-docs/stable/basics.html
e.g. compute pairwise covariance of columns, excluding NA/null values
In [75]:
df.cov()
Out[75]:
Applying functions to the data
In [76]:
df = pd.DataFrame(np.random.randint(0,100,size=(6, 4)), columns=list('ABCD'))
df
Out[76]:
In [77]:
df.apply(np.cumsum, axis=0)
Out[77]:
In [78]:
df.apply(lambda x: x.max() - x.min())
Out[78]:
pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
In [79]:
df = pd.DataFrame(np.random.randn(10, 4))
df
Out[79]:
In [80]:
# break it into multiple pieces
pieces = [df[3:7], df[:3], df[7:]]
pd.concat(pieces)
Out[80]:
SQL style merges.
In [81]:
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['bar', 'foo'], 'rval': [4, 5]})
left
Out[81]:
In [82]:
right
Out[82]:
In [83]:
pd.merge(left, right, on='key')
Out[83]:
Append rows to a dataframe.
In [84]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df.iloc[3]
df
Out[84]:
In [85]:
df.append(s, ignore_index=True)
Out[85]:
By “group by” we are referring to a process involving one or more of the following steps
In [86]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df
Out[86]:
Grouping and then applying a function sum to the resulting groups.
In [87]:
df.groupby('A').sum()
Out[87]:
Grouping by multiple columns forms a hierarchical index, which we then apply the function.
In [88]:
df.groupby(['A','B']).sum()
Out[88]:
Series.nlargest: Return the largest n elements. Series.nsmallest: Return the smallest n elements.
In [89]:
df.groupby('A')["C"].nlargest(2)
Out[89]:
Iterate over a groupby object, just to illustrate the split part of the groupby method.
In [90]:
grouped = df.groupby(['A','B'])
for name, group in grouped:
print "Name:", name
print group
print "#"*50
Aggregation: computing a summary statistic (or statistics) about each group. Some examples:
Transformation: perform some group-specific computations and return a like-indexed. Some examples:
Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:
"reduces" the DataFrame, meaning the df_original.shape > df_aggregated.shape
In [91]:
grouped.aggregate(["sum", "count", "median", "mean"]) # see `Operations` above for more built-in methods
Out[91]:
returns an object that is indexed the same (same size) as the one being grouped. Thus, the passed transform function should return a result that is the same size as the group chunk.
In [92]:
df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6],
'b': [1, 2, 3, 4, 5, 6],
'c': ['q', 'q', 'q', 'q', 'w', 'w'],
'd': ['z','z','z','o','o','o']})
df['e'] = df['a'] + df['b']
df
Out[92]:
In [93]:
df['f'] = (df.groupby(['c', 'd'])['e'].transform('sum'))
df
Out[93]:
In [94]:
assert df.loc[0, "f"] == df.loc[( (df["c"] == "q") & (df["d"] == "z") ), "e"].sum()
returns a subset of the original object.
In [95]:
df = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc')})
df
Out[95]:
In [96]:
df.groupby('B').filter(lambda x: len(x) > 2)
Out[96]:
Another frequent operation is applying a function on 1D arrays to each column or row.
Apply a custom function to the entire DataFrame or a Series. axis
indicates row or column-wise application. (default: axis=0)
apply on a Series performs an element-wise operation
In [97]:
def example_custom_function(number):
if number >= 6:
return number / 5.0
elif number <= 3:
return number * 88
else:
return np.nan
In [98]:
df["A"].apply(example_custom_function)
Out[98]:
apply on a DataFrame perform a Series-wise operation
In [99]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df
Out[99]:
In [100]:
def example_custom_function_on_series_1(series):
return series / series.sum()
In [101]:
### notice how example_custom_function_on_series_1 return a single value per Series --> transformation
df.apply(example_custom_function_on_series_1)
Out[101]:
In [102]:
def example_custom_function_on_series_2(series):
return series.max() - series.min()
In [103]:
### notice how example_custom_function_on_series_2 return a single value per Series --> aggregation
df.apply(example_custom_function_on_series_2)
Out[103]:
Element-wise Python functions can be used, too. You can do this with applymap:
In [104]:
formater = lambda x: '%.2f' % x
df.applymap(formater)
Out[104]:
The reason for the name applymap is that Series has a map method for applying an element-wise function:
In [105]:
df["A"].map(formater)
Out[105]:
“compresses” a level in the DataFrame’s columns.
In [106]:
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two',
'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df2 = df[:4]
df2
Out[106]:
In [107]:
stacked = df2.stack()
stacked
Out[107]:
the inverse operation of stack() is unstack(), which by default unstacks the last level
In [108]:
stacked.unstack() # in this particular case equivalent to stacked.unstack(2)
Out[108]:
In [109]:
stacked.unstack(0)
Out[109]:
In [110]:
stacked.unstack(1)
Out[110]:
We can produce pivot tables from this data very easily:
In [111]:
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
'B' : ['a', 'b', 'c'] * 4,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
'D' : np.random.randn(12),
'E' : np.random.randn(12)})
df
Out[111]:
In [112]:
pd.pivot_table(df, values=['D', "E"], index=['A', 'B'], columns=['C'])
Out[112]:
“Unpivots” a DataFrame from wide format to long format, optionally leaving identifier variables set. This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
In [113]:
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
'B': {0: 1, 1: 3, 2: 5},
'C': {0: 2, 1: 4, 2: 6}})
df
Out[113]:
In [114]:
pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])
Out[114]:
On DataFrame, plot() is a convenience to plot all of the columns with labels. Using %matplotlib inline
magic function enables plotting within the jupyter notebook cells (instead of e.g. in a separate Qt window).
In [115]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.plot()
Out[115]:
In [116]:
df.boxplot()
Out[116]:
In [117]:
df.plot(kind="scatter", x="A", y="B")
Out[117]:
In [118]:
df[["C", "D"]].sum().plot(kind="bar", x="C", y="D")
Out[118]:
In [119]:
axs = pd.scatter_matrix(df, figsize=(16,12), diagonal='kde')
tabular data such as CSV or tab-delimited txt files
In [120]:
fn = r"data/Saliva.txt"
df = pd.read_csv(fn, sep='\t')
df.head(3)
Out[120]:
from Excel
In [121]:
xls = pd.ExcelFile('data/microbiome/MID1.xls')
print xls.sheet_names # see all sheet names
df = xls.parse("Sheet 1", header=None)
df.columns = ["Taxon", "Count"]
df.head()
Out[121]:
a current list of I/O tools:
checkout the documentation http://pandas.pydata.org/pandas-docs/stable/ for more infos
In [122]:
fn_out = r"data/Tidy_data.txt"
df.to_csv(fn_out, sep='\t', header=True, index=False)
In [123]:
xls = pd.ExcelFile('data/microbiome/MID1.xls')
df = xls.parse("Sheet 1", header=None)
df.columns = ["Taxon", "Count"]
df.head(3)
Out[123]:
In [124]:
df["superkingdom"], df["rest"] = zip(*df["Taxon"].apply(lambda x: x.split(" ", 1)))
df.head(3)
Out[124]:
In [125]:
df.superkingdom.unique()
Out[125]:
for Series and Index, to make it easy to operate on each element of the array. These methods exclude missing/NA values automatically. These are accessed via the str
attribute and generally have names matching the equivalent (scalar) built-in string methods.
In [126]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s
Out[126]:
In [127]:
s.str.lower()
Out[127]:
Return indices of half-open bins to which each value of x belongs. The cut function can be useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges.
In [128]:
df = pd.DataFrame(np.array([.2, 1.4, 2.5, 6.2, 9.7, np.nan, 2.1]), columns=["value"])
df["category"] = pd.cut(df["value"], 3, retbins=False, labels=["good", "medium", "bad"])
df
Out[128]:
call an R function from Python
In [129]:
import rpy2.robjects as robjects
import rpy2.robjects.packages as rpackages
from rpy2.robjects.numpy2ri import numpy2ri
robjects.conversion.py2ri = numpy2ri
In [130]:
dict_ = {"Intensity First" : [5.0, 2.0, 3.0, 4.0],
"Intensity Second" : [4.0, 1.0, 4.0, 2.0],
"Intensity Third" : [3.0, 4.0, 6.0, 8.0]}
df = pd.DataFrame(dict_, index=list("ABCD"))
df
Out[130]:
In [131]:
def R_function_normalizeQuantiles():
rpackages.importr('limma')
normalizeQuantiles = robjects.r['normalizeQuantiles']
return normalizeQuantiles
In [132]:
def quantile_normalize(df, cols_2_norm):
"""
:param df: DataFrame
:param cols_2_norm: ListOfString (Columns to normalize)
:return: DataFrame
"""
normalizeQuantiles = R_function_normalizeQuantiles()
# set Zero to NaN and transform to log-space
df[cols_2_norm] = df[cols_2_norm].replace(to_replace=0.0, value=np.nan)
df[cols_2_norm] = np.log10(df[cols_2_norm])
# quantile normalize
df[cols_2_norm] = np.array(normalizeQuantiles(df[cols_2_norm].values))
# Transform back to non-log space and replace NaN with Zero
df[cols_2_norm] = np.power(10, df[cols_2_norm])
df[cols_2_norm] = df[cols_2_norm].replace(to_replace=np.nan, value=0.0)
return df
In [133]:
df_norm = quantile_normalize(df, df.columns.tolist())
df_norm
Out[133]:
open Exercises_part_A.ipynb and/or Exercises_part_B.ipynb