Data analysis library that introduces the concepts of data-frames and series to Python. Powerful tool for time-series analysis and fast visualizations of data.
Excellent introduction by the author: https://vimeo.com/59324550
http://www.eusprig.org/horror-stories.htm
See pages 131-132 of the JP Morgan Task Force Report "...further errors were discovered in the Basel II.5 model, including, most significantly, an operational error in the calculation of the relative changes in hazard rates and correlation estimates. Specifically, after subtracting the old rate from the new rate, the spreadsheet divided by their sum instead of their average, as the modeler had intended. This error likely had the effect of muting volatility by a factor of two and of lowering the VaR"
As reported in "A tempest in a spreadsheet" http://ftalphaville.ft.com/2013/01/17/1342082/a-tempest-in-a-spreadsheet/? Lisa Pollack comments "On a number of occasions, he asked the trader to whom he reported for additional resources to support his work on the VaR model, but he did not receive any. Also it appears that he (had to?) cut a number of corners, which resulted increased operational risk and artificially low volatility numbers ... pressure was put on the reviewers to get on with approving the model"
In [3]:
import pandas as pd
import numpy as np
In [4]:
data = np.random.rand(5, 5)
data
Out[4]:
Matrix representation of data, provided by numpy, often isn't enough and this is where Pandas can help, through the introduction of a "data-frame", which is a tabular representation.
In [5]:
pd.DataFrame(data)
Out[5]:
In [6]:
table = pd.DataFrame(data, columns=["a","b","c","d","e"])
table
Out[6]:
Note the automatic selection of an index (left hand column). This can be used to pull out rows of interest.
In [7]:
table.ix[0]
Out[7]:
Similarly we can select data from columns.
In [8]:
table["a"]
Out[8]:
In [9]:
table[["a","e"]]
Out[9]:
A table behave similarly to a numpy array (in fact underneath it is a nd-array), meaning we can use fancy indexing:
In [10]:
missing_data = table[table > 0.6]
missing_data
Out[10]:
Q. What are the summary statistics of the "missing_data" table?
In [11]:
missing_data.describe()
Out[11]:
In [12]:
missing_data.count()
Out[12]:
In [13]:
missing_data.min()
Out[13]:
Visualization of data is made simple with pandas (for simple things).
In [14]:
%matplotlib inline
missing_data.describe().plot(kind='bar', figsize=(10,5))
Out[14]:
Pandas has an extensive set of functions for working with time series.
http://pandas.pydata.org/pandas-docs/dev/timeseries.html#time-series-date-functionality
Creating a time series of data requires a time based index.
In [15]:
dates = pd.date_range(start="1/1/14", end="31/12/14")
dates
Out[15]:
In [16]:
values = np.sin(np.linspace(0,2*np.pi,365)) * np.random.rand(365)
values[:10]
Out[16]:
In [17]:
time_series = pd.Series(index=dates, data=values)
time_series.head()
Out[17]:
In [18]:
time_series.plot(figsize=(10,5))
Out[18]:
Q. Whats the monthly variation of the time-series. Smooth out the noise by down-sampling - note any change in the signal.
http://pandas.pydata.org/pandas-docs/dev/timeseries.html#up-and-downsampling
In [19]:
time_series.resample("M").plot(figsize=(10,5))
Out[19]:
In [20]:
dates = pd.date_range(start="1/1/14", end="31/12/14")
categories = ["a", "b", "c", "d"]
noisey_signal = np.c_[np.sin(np.linspace(0, 2*np.pi, 365)) * np.random.rand(365)]
data = np.hstack([0.5 * noisey_signal, 3.5 * noisey_signal, 0.01 * noisey_signal, noisey_signal])
table = pd.DataFrame(data=data, columns=categories)
table["date"] = dates
table["month"] = [x.strftime("%B") for x in dates]
table.head()
Out[20]:
In [21]:
table.set_index("date")[["a","b","c","d"]].plot(figsize=(10,5))
Out[21]:
Q. What are the summary statistics for December?
In [22]:
table[table["month"] == "December"].describe()
Out[22]:
In [23]:
groups = table.groupby("month").describe().loc["December"]
Q. What are the summary statistics for all months starting with the letter "J".
In [24]:
groups = table.groupby( [x[0] for x in table["month"]] ).describe().loc["J"]
groups
Out[24]:
Q. What is the 95 quantile for each month.
In [25]:
table.groupby("month").quantile(.95)
Out[25]:
Q. Plot the monthly statistics for the "a" category.
In [26]:
data = table[["date", "a"]].set_index("date")
data.head()
Out[26]:
In [27]:
ax = data.plot(figsize=(10,5), color='g')
data.resample('w').plot(color='m', ax=ax)
data.resample('w', how="min").plot(color='b', ax=ax)
data.resample('w', how="max").plot(color='r', ax=ax)
ax.legend(['a', 'mean(a)', 'min(a)', 'max(a)']);
In [28]:
ax = data.plot(figsize=(10,5), color='g')
data.resample("w", how=["mean", "min", "max"]).plot(ax=ax, color=["m","b","r"])
ax.legend(['a', 'mean(a)', 'min(a)', 'max(a)']);