T.N.Olsthoorn, April 18, 2017
Most scientists and engineers, including hydrologists, physisists, electronic engineers, social scientists and economists are often faced with time series that bear information that is to be extracted or to be used in predictions. Pandas has virtually all the tools that are required to handle time series, while keeping dates and data strictly connected. These time series loaded into pandas then form the basis of further analysis.
Loading into pandas can be done with pd.read_csv, pd.read_table, pd.read_excel as we used before as well as with numerous other functions ready to be using in pandas. Just use tab-complition to see al the possibilities
In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
Show which reading functions pandas has as onboard methods.
We can use a coprehension to select what we want:
In [2]:
[d for d in dir(pd) if d.startswith("read")]
Out[2]:
In [3]:
pd.read_table()
In [ ]:
[d for d in dir(pd) if d.startswith("read")]
Hence there's a large number of possibilities.
Move to the directory with the examples. Then print pwd to see if you're there.
Notice, the first part of the pwd command will be different on your computer.
In [ ]:
cd python/IHEcourse2017/exercises/Apr18/
In [ ]:
pwd
See if we have a csv datafile, which is a long year groundwater head series in the south of the Netherlands (chosen more or less at random for its length).
In [ ]:
ls
It's not a bad habit to use os to verify that the file exists.
In [ ]:
import os
os.path.isfile("B50E0133001_1.csv")
Ok, now we will naively try to read it in using pd.read_csv. This may fail or not. If it fails we sharpen the knife by adding or using one or more options provided by pd.read_csv.
In [4]:
pb = pd.read_csv("B50E0133001_1.csv")
pb.head()
Obviously, the read_csv above failed. Upon inspection of the file in an editor, we see that the top is a mess. Not really, but at least we want to sktip this part and get to the actual time series data of interest further down in the file.
So let's skip a few rows (too few, but we can correct step by step)
In [5]:
pb = pd.read_csv("B50E0133001_1.csv", skiprows=9)
pb.head()
Out[5]:
Ok, we got some top table in the file. See which line pd thought was the header.
Ok. skip a few more lines.
In [6]:
pb = pd.read_csv("B50E0133001_1.csv", skiprows=11)
pb.head()
Out[6]:
Now we really got the first table in the file, but this is not the one we need. On line 3 we see the desired header line. So skip 3 more lines to get there.
In [7]:
pb = pd.read_csv("B50E0133001_1.csv", skiprows=15)
pb.head()
Out[7]:
This is fine. At least a good start. But we want "Peildatum" as our index. So:
In [8]:
pb = pd.read_csv("B50E0133001_1.csv", skiprows=15, index_col="Peildatum")
pb.head()
Out[8]:
Better, but the idex still consists of strings and not of dates. Therefore, tell read_csv to part the dates:
In [9]:
pb = pd.read_csv("B50E0133001_1.csv", skiprows=15, index_col="Peildatum", parse_dates=True)
pb.head()
Out[9]:
Problem is that some dates will be messed up as pandas will by default interprete dates as mm-dd-yyyyy, while we have dd-mm-yyyy. For some dates this does not matter but for other dates this is ambiguous unless it is specified that the dates start with the day instead of the month.
In [10]:
pb = pd.read_csv("B50E0133001_1.csv", skiprows=15, index_col="Peildatum", parse_dates=True, dayfirst=True)
pb.head()
Out[10]:
In [11]:
pb.head()
Out[11]:
So far so good. Now do some clean-up as we only need the 6th column with the head above national datum. We can tell read_csv what columns to use by specifying a list of headers. First trial
In [12]:
pb = pd.read_csv("B50E0133001_1.csv", skiprows=15,
index_col="Peildatum", parse_dates=True, dayfirst=True,
usecols=["Stand (cm t.o.v. NAP)"])
pb.head()
This failed, because we now have to specify all columns we want to use. This should include the columne "Peildatum". So add it to the list.
In [13]:
pb = pd.read_csv("B50E0133001_1.csv", skiprows=15,
index_col="Peildatum", parse_dates=True, dayfirst=True,
usecols=["Peildatum", "Stand (cm t.o.v. NAP)"])
pb.head()
Out[13]:
This is fine. We now have a one-column dataFrame with the proper index.
For English speakers, change the column header for better readability.
In [14]:
pb.columns = ["NAP"]
pb.head()
Out[14]:
Check that pb is still a data frame, and only when we select one column from a dataFrame it becomes a series.
In [15]:
print(type(pb))
print(type(pb['NAP']))
So select this column to get a time series.
In [16]:
pb = pb['NAP']
print(type(pb))
Dataframes and series can immediately be plotted. Of course, you may also plot titles on the axes and above the plot. But because of lazyness, I leave this out for this exercise.
In [17]:
pb.plot()
plt.show() # default color is blue, and default plot is line.
The next problem is to get the mean of the highest three measurements within each hydrological year, which starts on April 1 and ends at March 31.
This requires resampling the data per hydrologic year.
Which can be done with aliases put in the rule of the resample function of pandas series and dataFrames.
Here are options:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
B business day frequency C custom business day frequency (experimental) D calendar day frequency W weekly frequency M month end frequency SM semi-month end frequency (15th and end of month) BM business month end frequency CBM custom business month end frequency MS month start frequency SMS semi-month start frequency (1st and 15th) BMS business month start frequency CBMS custom business month start frequency Q quarter end frequency BQ business quarter endfrequency QS quarter start frequency BQS business quarter start frequency A year end frequency BA business year end frequency AS year start frequency BAS business year start frequency BH business hour frequency H hourly frequency T minutely frequency S secondly frequency L milliseonds U microseconds N nanoseconds
But fo sample at some arbitrary interval we need anchored offsets as the resample rule. Here are the options.
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
For some frequencies you can specify an anchoring suffix:
Alias Description W-SUN weekly frequency (sundays). Same as ‘W’ W-MON weekly frequency (mondays) W-TUE weekly frequency (tuesdays) W-WED weekly frequency (wednesdays) W-THU weekly frequency (thursdays) W-FRI weekly frequency (fridays) W-SAT weekly frequency (saturdays) (B)Q(S)-DEC quarterly frequency, year ends in December. Same as ‘Q’ (B)Q(S)-JAN quarterly frequency, year ends in January (B)Q(S)-FEB quarterly frequency, year ends in February (B)Q(S)-MAR quarterly frequency, year ends in March (B)Q(S)-APR quarterly frequency, year ends in April (B)Q(S)-MAY quarterly frequency, year ends in May (B)Q(S)-JUN quarterly frequency, year ends in June (B)Q(S)-JUL quarterly frequency, year ends in July (B)Q(S)-AUG quarterly frequency, year ends in August (B)Q(S)-SEP quarterly frequency, year ends in September (B)Q(S)-OCT quarterly frequency, year ends in October (B)Q(S)-NOV quarterly frequency, year ends in November (B)A(S)-DEC annual frequency, anchored end of December. Same as ‘A’ (B)A(S)-JAN annual frequency, anchored end of January (B)A(S)-FEB annual frequency, anchored end of February (B)A(S)-MAR annual frequency, anchored end of March (B)A(S)-APR annual frequency, anchored end of April (B)A(S)-MAY annual frequency, anchored end of May (B)A(S)-JUN annual frequency, anchored end of June (B)A(S)-JUL annual frequency, anchored end of July (B)A(S)-AUG annual frequency, anchored end of August (B)A(S)-SEP annual frequency, anchored end of September (B)A(S)-OCT annual frequency, anchored end of October (B)A(S)-NOV annual frequency, anchored end of November
To see this at work. Resample the time series by hydrological year and compute the mean head in every hydrological year. This can be done as follows:
In [18]:
pb.resample("AS").mean().head()
Out[18]:
In [19]:
pb.resample("AS-APR").mean().head()
Out[19]:
This uses Groupby functionality. Which we'll inspect next.
In fact, pb.resample(...) yields a DatetimeIndexResampler
In [43]:
Z = pb.resample("AS-APR")
type(Z)
Out[43]:
This resampler has its own functinality that can be used. This fucntionality is shown here:
In [21]:
[z for z in dir(Z) if not z.startswith("_")]
Out[21]:
It's now easy to plot the resampled data using several of the functions, like so:
Notice that Z.mean() is a pandas series so that Z.mean().plot() is plot method of the pandas series.
In [41]:
Z.max().plot(label="max")
Z.mean().plot(label="mean")
Z.min().plot(label="min")
plt.title("The max, mean and min of the head in each hydrological year")
plt.legend(loc='best')
plt.show()
In [45]:
Z.max()
for z in Z:
print(z)
Insteresting is the agg function (which is an abbreviation of aggregate function). Here is its documentation:
In [128]:
print(Z.agg.__doc__)
So we can use any function and apply it on the data grouped by the resampler. These data are the time series consisting of the data that fall in the interval between the last resample moment and the currrent one. Let's try this out to get the three highest values of any hydrological year. For this we define our own function, called highest3
.
It works by taking z which should be a time series, one consisting of any of the hydrological years in your long-year time series. We use argsort to the indices of the ordered values (we could also directly use the values themselves, but it's good to know argsort exists). The series is sorted from low to high, so we take the last 3 values, i.e. the highest 3. Notice that this also works in Python of the total number of values is less than three, so we don't need to check this. Then we return the mean of these highest three values. That's all.
In [129]:
def highest3(z):
"""returns mean of highest 3 values using np.argsort"""
I = np.argsort(z)[-3:]
return z[I].mean()
def highest3a(z):
"""returns mean of highest 3 values using np.sort"""
z = np.sort(z)
return z[-3:].mean()
In [130]:
# Apply
print("Using np.argsort")
highest = pb.resample("AS-APR").agg(highest3)
highest.columns = ["mean_highest_value"]
print(highest.head())
print("\nUsing np.sort")
highesta = pb.resample("AS-APR").agg(highest3a)
highesta.columns = ["mean_highest_value"]
print(highesta.head())
This, of course, solves the problem. Which means we could just as well also compute the lowest 3 at the same time.
And why not also remember the highest and lowest 3
In [131]:
def h_and_l_3(z):
z = np.sort(z)
# rounding off for a nicer list, but is not necessary
return (np.round(z[ :3].mean()),
np.round(z[-3:].mean()))
# Apply
h_and_l = pb.resample("AS-APR").agg(h_and_l_3)
h_and_l.columns = ["mean_lowest_and_highest_values"]
h_and_l.head()
Out[131]:
The above functions all reduce
, that is, they all aggreate the data held by the resampler for each sampling interval to a single value (or a tuple)
In [143]:
def h3(z):
"""Returns a tuple of the three highest value within sampling interval"""
return (z[np.argsort(z)[-3:]],)
Z.agg(h3).head()
Out[143]:
This does indeed give a tuple of the three highest values within each sampling interval, but we can't plot these values easily on the graph of the time series.
Other functionality of the sampler are the indices, i.e. Z.indices. This yields a dictionary with the indices into the overall time series that belong to each resampled timestamp. Therefore we can readily find the values that belong to each hydrological year.
In [159]:
Z.apply(h3).head()
Out[159]:
So appy() works the same as agg() at least here.
If we want to plot the three highest points in each hydrlogical year, we could make a list with the sub time series that consist of the three highest points with there timestamp as index. Then, each item in this list is a series consisting of three values, which we may plot one after the other.
In [178]:
dd = list()
def h33(z):
"""Returns a tuple of the three highest value within sampling interval"""
#print(type(z))
dd.append(z[z.argsort()[-3:]])
return
# the time series are put in the list dd
Z.apply(h33).head()
#Z.agg(h33).head() # alternative works just as well
# for instance show dd[3]
print(type(dd[3]))
dd[3]
Out[178]:
The next step is to plot them. But we first plot the entire data set as a line. Then we plot each sub time series as small circles. The adjacent hydrological years then have a different color.
In [177]:
pb.plot() # plot all data as a line
for d in dd:
#plot sub time series of the three highest points
d.plot(marker='o')
plt.show()
If we want to color the data in the same hydrological year in the same color, then we also make a list of all data in each sampling interval next to the list of the three highest values. Each item in dd has the complete time series of the interval, each item in dd3 has a tiem series of the three highest values alone.
The append within the function is away of using a side-effect to get things done. It's a bit sneaky, not very elegant. But it works:
In [323]:
dd
Out[323]:
In [333]:
dd = list() # the entire time series in each sampling interval.
dd3 = list() # only the three highest values in each sampling interval.
def h33(z):
"""Returns a tuple of the three highest value within sampling interval
Notice that this function just used append() to generate a list as a side-effect.
It effectively consists of two lines and returns nothing.
"""
# z is what the sampler Z yields while resampling the original time series
# It isthe sub-time series that falls in the running interval.
# With tis append we get a list of the sub time series.
dd.append(z[:]) # z[:] forces a copy
# you can do an argsort on z. This yields a time series with the same index
# but with as values the index in the original series. You can see it if
# you print it here or make a list of these index-time series.
dd3.append(z[z.argsort()[-3:]])
return
# Here we apply the function by calling the method .agg() of the sampler Z.
# The method receives the just created function as input. It applies this function
# on every iteration, that is on every sub-time series.
# Each time the function h33 is called it appends to the lists dd and ddr.
# The sampler Z method agg calls the funcion h33 for every sample interval.
# You may be tempted to insert a print statement in the function to see that
# this is what actually happens.
Z.apply(h33)
# Then plot the sub-time series in the lists in dd and dd3.
# We make sure to use the same color for all points in the same
# hydrological year in both dd and dd3.
# The subseries in dd are plotted as a line, those in dd3 as small circles.
clr = 'brgkmcy'; i=0 # colors to use
for d3, d in zip(dd3, dd):
d.plot(marker='.', color=clr[i]) # all data in hydrological year
d3.plot(marker='o', color=clr[i]) # highest three
i += 1
if i==len(clr): i=0 # set i to 0 when colors are exhausted.
plt.title("measurements per hydr. yr with the 3 highest accentuated")
plt.xlabel('time')
plt.ylabel('cm above national datum NAP')
plt.show()
Show that the argort works to get the indices that sort the time series
In [338]:
print("The sub time series for 1964\n")
print(dd[10])
print("\nThe indices that sort this sub series. It is itself a time series")
dd[10].argsort()
Out[338]:
Instead of appending to the list dd and dd3 sneakyly behind the scene (hidden inside the function, that is as a side effect of the function), we can also aim in achieveing the same thing head-on. This can be done using the indices of each sub-timeseries, which is also a functionality of the sample.
In [310]:
# Don't need this, but just to make sure refresh our sampler
Z = pb.resample("AS-APR")
The resampler object Z also has a method indices
, which yields a dictionary with the indices of the values that fall in each sampling interval. The indices are the absolute indices, i.e. they point into the large, original time series.
Let's see how this works.
First generate the dictionary.
In [186]:
Idict = Z.indices
type(Idict)
Out[186]:
A dict has keys. So let's show one item in this dict like so:
In [194]:
pb.ix[3]
Out[194]:
In [196]:
# Show the indices for one of the keys of the Idict
for k in Idict.keys():
print(k) # the key
print()
print(Idict[k]) # the indices
print()
print(pb.ix[Idict[k]]) # the values beloning to these indices
break
This implies that we can now plot each sub time series like so:
To plot them together with the boundaries of each hydrological year, we first plot the data as a colored line, within each hydrological year. Then we plot the vertical lines that separate the hydrological years. The lines are colored light grey using color=[R, G, B] where R, G and B are all 0.8. ax=get_ylim() gets the extremes of the vertical axis, which are then used to draw the vertical lines.
In [258]:
I
Out[258]:
In [309]:
# Show the indices for one of the keys of the Idict
fig, ax = plt.subplots()
clr = "brgkmcy"; i=0
Idict = Z.indices
for k in Idict.keys():
I = Idict[k] # The indices belonging to this key k
ax.plot(pb.ix[I].index, pb.ix[I].values, color=clr[i])
# The values have dimension [1,n] so use values[0] to get a 1D array of indices
J = np.argsort(pb.ix[I].values[0])[-3:]
# Need a comprehension to get the indexes because
# indexing like I[J] is not allowed for lists
Idx = [I[j] for j in J]
ax.plot(pb.index[Idx], pb.values[Idx], color=clr[i], marker='o')
i += 1;
if i==len(clr): i=0
# plot the hydrological year boundaries as vertical grey lines
ylim = ax.get_ylim()
for k in Idict.keys():
i = Idict[k][-1]
ax.plot(pb.index[[i, i]], ylim, color=[0.8, 0.8, 0.8])
plt.show()
#pb.ix[I].plot(ax=ax) # the values beloning to these indices (can't omit the legend)
That's it.
Here's a referene of a nice so-called cheat-sheet where someone of Idaho University as figured out how Pandas werkt and put this in a consice overview.
http://www.webpages.uidaho.edu/~stevel/504/Pandas%20DataFrame%20Notes.pdf
In [ ]: