Dask.dataframe inherits timeseries functionality from Pandas.
This notebook shows off a few time series functions on artificial but moderately large data.
The interface should look familiar to existing Pandas users.
In [1]:
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
progress_bar = ProgressBar()
progress_bar.register() # turn on progressbar globally
In [2]:
%matplotlib inline
In [3]:
import dask.dataframe as dd
df = dd.demo.make_timeseries(start='2000', end='2015', dtypes={'A': float, 'B': int},
freq='1s', partition_freq='3M', seed=1234)
df.head()
Out[3]:
In [4]:
c = df.to_castra()
df = c.to_dask()
We compute the cumulative sum of the A
column. This will look like typical Brownian motion / a ticker plot. However the resolution of one second over 15 years is too big to give to matplotlib so instead we resample to weekly values. Finally we compute the result as a (small) pandas DataFrame and use Pandas and matplotlib to plot the results.
In [5]:
df.A.cumsum().resample('1w', how='mean').compute().plot()
Out[5]:
In [6]:
duration = progress_bar.last_duration
nbytes = df.A.nbytes.compute()
In [7]:
nbytes / 1e6 / duration # MB/s
Out[7]:
We use the dask Profiler
to measure the durations of all of our tasks running in parallel. We'll use this to gain intuition on what our current bottlenecks are.
Note for readers on github, this plot won't show up on github's notebook viewer. You can see the results by doing one of the following:
In [8]:
from bokeh.plotting import output_notebook
output_notebook()
from dask.diagnostics import Profiler
prof = Profiler()
with prof:
df.A.cumsum().resample('1w', how='mean').compute()
prof.visualize()
Out[8]:
It looks like we're still dominated by data access. The load_partition
calls take up the majority of the time.