*how do we measure the similarity between two time series*?

The Euclidean distance between two time series $Q$ and $C$ of length $n$ is defined as

$$d(Q,C) = \sqrt{\sum^n_{i=1}[Q(i)-C(i)]^2}$$At first glance, it seems like simply calculating the Euclidean distance between two time series would give us a good idea of the similarity between them. After all, the Euclidean distance between identical time series is zero and the Euclidean distance between very different time series is large. However, before we settle on Euclidean distance as a similarity measure we should clearly state our desired criteria for determining the similarity between two time series

```
In [1]:
```import pandas as pd
import numpy as np
import matplotlib.pylab as plt
x=np.linspace(0,50,100)
ts1=pd.Series(3.1*np.sin(x/1.5)+3.5)
ts2=pd.Series(2.2*np.sin(x/3.5+2.4)+3.2)
ts3=pd.Series(0.04*x+3.0)
#ts1.plot()
#ts2.plot()
#ts3.plot()
#plt.ylim(-2,10)
#plt.legend(['ts1','ts2','ts3'])
#plt.show()

```
In [2]:
```def euclid_dist(t1,t2):
return sqrt(sum((t1-t2)**2))

Let's now find the Euclidean distance between $ts1$ and $ts2$

```
In [3]:
``````
#print euclid_dist(ts1,ts2)
```

and the Euclidean distance between $ts1$ and $ts3$

```
In [4]:
``````
#print euclid_dist(ts1,ts3)
```

```
In [5]:
```def DTWDistance(s1, s2):
DTW={}
for i in range(len(s1)):
DTW[(i, -1)] = float('inf')
for i in range(len(s2)):
DTW[(-1, i)] = float('inf')
DTW[(-1, -1)] = 0
for i in range(len(s1)):
for j in range(len(s2)):
dist= (s1[i]-s2[j])**2
DTW[(i, j)] = dist + min(DTW[(i-1, j)],DTW[(i, j-1)], DTW[(i-1, j-1)])
return sqrt(DTW[len(s1)-1, len(s2)-1])

Now let's compute the Euclidean distance between $ts1$ and $ts2$ using dynamic time warping.

```
In [6]:
``````
#print DTWDistance(ts1,ts2)
```

and now the dynamic time warping distance between $ts1$ and $ts3$

```
In [7]:
``````
#print DTWDistance(ts1,ts3)
```

```
In [8]:
```def DTWDistance(s1, s2):
DTW={}
w = 5
w = max(w, abs(len(s1)-len(s2)))
for i in range(-1,len(s1)):
for j in range(-1,len(s2)):
DTW[(i, j)] = float('inf')
DTW[(-1, -1)] = 0
for i in range(len(s1)):
for j in range(max(0, i-w), min(len(s2), i+w)):
dist= (s1[i]-s2[j])**2
DTW[(i, j)] = dist + min(DTW[(i-1, j)],DTW[(i, j-1)], DTW[(i-1, j-1)])
return sqrt(DTW[len(s1)-1, len(s2)-1])

Let's test this faster version.

```
In [9]:
``````
#rint DTWDistance(ts1,ts2,10)
```

```
In [10]:
``````
#print DTWDistance(ts1,ts3,10)
```

*LB Keogh* lower bound of dynamic time warping. It is defined as $$LBKeogh(Q,C)=\sum_{i=1}^n (c_i-U_i)^2I(c_i > U_i)+(c_i-L_i)^2I(c_i < L_i)$$
where $U_i$ and $L_i$ are upper and lower bounds for time series $Q$ which are defined as $U_i=max(q_{i-r}:q_{i+r})$ and $L_i=min(q_{i-r}:q_{i+r})$ for a reach $r$ and $I(\cdot)$ is the indicator function. It can be implemented with the following function.

```
In [11]:
```def LB_Keogh(s1,s2,r):
LB_sum=0
for ind,i in enumerate(s1):
#print(ind -r, ind+r)
lower_bound=min(s2[(ind-r if ind-r>=0 else 0):(ind+r)])
upper_bound=max(s2[(ind-r if ind-r>=0 else 0):(ind+r)])
if i>upper_bound:
LB_sum=LB_sum+(i-upper_bound)**2
elif i<lower_bound:
LB_sum=LB_sum+(i-lower_bound)**2
return sqrt(LB_sum)

Let's now test on $ts1$ and $ts2$

```
In [12]:
``````
#print LB_Keogh(ts1,ts2,20)
```

and now $ts1$ and $ts3$.

```
In [13]:
``````
#print LB_Keogh(ts1,ts3,20)
```

*LB Keogh* lower bound method is linear whereas dynamic time warping is quadratic in complexity which make it very advantageous for searching over large sets of time series.

*LB Keogh* lower bound. Computing *LB Keogh* is much less expensive than performing dynamic time warping. And since $LB Keogh(Q,C) \leq DTW(Q,C)$ , we can eliminate time series that cannot possibly be more similar that the current most similar time series. In this way we are eliminating many unnecessary dynamic time warping computations.

```
In [14]:
```#from sklearn.metrics import classification_report
from math import sqrt
def knn(train,test,w):
preds=[]
for ind,i in enumerate(test):
min_dist=float('inf')
closest_seq=[]
#print ind
for j in train:
if LB_Keogh(i[:-1],j[:-1],5)<min_dist:
dist=DTWDistance(i[:-1],j[:-1],w)
if dist<min_dist:
min_dist=dist
closest_seq=j
preds.append(closest_seq[-1])
return classification_report(test[:,-1],preds)

*LB Keogh* bound and the dynamic time warping locality contraint, it may still take a few minutes to run.

```
In [15]:
```train = np.genfromtxt('datasets/train.csv', delimiter='\t')
test = np.genfromtxt('datasets/test.csv', delimiter='\t')
#print (knn(train,test,4))

*apriori* and similar time series are clustered together.

```
In [16]:
```import random
def k_means_clust(data,num_clust,num_iter,w=5):
centroids=random.sample(data,num_clust)
counter=0
for n in range(num_iter):
counter+=1
print (counter)
assignments={}
#assign data points to clusters
for ind,i in enumerate(data):
min_dist=float('inf')
closest_clust=None
for c_ind,j in enumerate(centroids):
if LB_Keogh(i,j,200)<min_dist:
cur_dist=DTWDistance(i,j,w)
if cur_dist<min_dist:
min_dist=cur_dist
closest_clust=c_ind
if closest_clust in assignments:
assignments[closest_clust].append(ind)
else:
assignments[closest_clust]=[]
#recalculate centroids of clusters
for key in assignments:
clust_sum=0
for k in assignments[key]:
clust_sum= clust_sum+data[k]
print("DEBUG")
for m in clust_sum:
#print(m)
t = m/float(len(assignments[key]))
centroids[key] = m/float(len(assignments[key])) #centroids[key]=[m/float(len(assignments[key])) for m in clust_sum]
return centroids

Let's test it on the entire data set (i.e. the training set and the test set stacked together).

```
In [17]:
```train = np.genfromtxt('datasets/train.csv', delimiter='\t')
test = np.genfromtxt('datasets/test.csv', delimiter='\t')
data1=np.vstack((train[:,:-1],test[:,:-1]))
#print(type(train))
#print(np.fromfile("ndarray.csv"))
#print("origi dataset")
df = pd.DataFrame.from_csv("vmarrays.csv")
#data = np.ndarray(df)
#numpyMatrix = df.as_matrix()
"""
data1=np.vstack((train[:,:-1],test[:,:-1]))
print(data1[0])
print(type(data1[0]))
data = np.fromfile("prices.csv")
data = np.vstack(data)
print(data[0])
print(type(data[0]))
"""
d = df.values.tolist()
#data = np.vstack(d)
#for i in range(26):
# if np.isnan(d[i][-1]):
# d[i][-1] = 0.1
#data = np.ndarray(d)
#type(data1[0])
#input = data1
y=np.array([np.array(di)[:100] for di in d])
#ts1 = y[0]
#ts2 = y[1]
#print LB_Keogh(ts1,ts3,2)
#print(data1[1])
#y = np.delete(y, 25, 0)
# len of ts in example is 60 - range=5
# len of ts in datset is 1416 - 100
#for i in range(26):
# ts = y[i][:60]
# y[i] = ts
#(y[1][-1])
#print(y[24])
#len(y[1])
y

```
Out[17]:
```

```
In [18]:
```import matplotlib.pylab as plt
"""
centroids=k_means_clust(data1,4,10,4) #data,num_clust,num_iter,w=5
print("centroids" ,centroids)
for i in centroids:
plt.plot(i)
plt.show()
"""

```
Out[18]:
```

```
In [ ]:
```

```
In [19]:
```import numpy as np;
import seaborn as sns;
import pandas as pd
from scipy import stats
import scipy.cluster.hierarchy as hac
import matplotlib.pyplot as plt
num_samples = 61
group_size = 10
df = pd.DataFrame.from_csv("ndarray.csv")
d = df.values.tolist()
data = np.vstack(d)
for i in range(26):
d[i][-1] = 0
#data = np.ndarray(d)
#type(data1[0])
input = data1
y=np.array([np.array(di)[:60] for di in d])
for i in range(26):
timeseries = y[i]
timeSeries = (timeseries-timeseries.min())/(timeseries.max()-timeseries.min())
y[i] = timeSeries
timeSeries = pd.DataFrame()
#timeSeries = (timeseries-timeseries.min())/(timeseries.max()-timeseries.min())
ax = None
for arr in y:
#for arr in data1:
#arr = arr + np.random.rand(group_size, num_samples) + (np.random.randn(group_size, 1)*3)
df = pd.DataFrame(arr)
#print(df)
timeSeries = timeSeries.append(df)
# We use seaborn to plot what we have
#ax = sns.tsplot(ax=ax, data=df.values, ci=[68, 95])
#ax = sns.tsplot(ax=ax, data=df.values, err_style="unit_traces")

```
```

```
In [20]:
```# Just one line :)
import numpy as np;
import seaborn as sns;
import pandas as pd
from scipy import stats
import scipy.cluster.hierarchy as hac
import matplotlib.pyplot as plt
timeSeries = pd.DataFrame()
#timeSeries = (timeseries-timeseries.min())/(timeseries.max()-timeseries.min())
ax = None
for arr in y:
#for arr in data1:
#arr = arr + np.random.rand(group_size, num_samples) + (np.random.randn(group_size, 1)*3)
df = pd.DataFrame(arr)
#print(df)
timeSeries = timeSeries.append(df)
#Z = hac.linkage(timeSeries, 'single', 'correlation')
Z = hac.linkage(timeSeries, 'ward')
import sys
sys.setrecursionlimit(15000) # DON'T TOUCH IT, IT's MAGIC
#sys.setrecursionlimit(10000)
# Plot the dendogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
hac.dendrogram(
Z,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=14., # font size for the x axis labels
)
plt.show()
print("method is ward")

```
```

```
In [21]:
```# Just one line :)
import numpy as np;
import seaborn as sns;
import pandas as pd
from scipy import stats
import scipy.cluster.hierarchy as hac
import matplotlib.pyplot as plt
timeSeries = pd.DataFrame()
#timeSeries = (timeseries-timeseries.min())/(timeseries.max()-timeseries.min())
ax = None
for arr in y:
#for arr in data1:
#arr = arr + np.random.rand(group_size, num_samples) + (np.random.randn(group_size, 1)*3)
df = pd.DataFrame(arr)
#print(df)
timeSeries = timeSeries.append(df)
#Z = hac.linkage(timeSeries, 'single', 'correlation')
Z = hac.linkage(timeSeries, 'complete')
import sys
sys.setrecursionlimit(15000) # DON'T TOUCH IT, IT's MAGIC
#sys.setrecursionlimit(10000)
# Plot the dendogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
hac.dendrogram(
Z,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=14., # font size for the x axis labels
)
plt.show()
print("method is complete")

```
```

```
In [22]:
```# Just one line :)
import numpy as np;
import seaborn as sns;
import pandas as pd
from scipy import stats
import scipy.cluster.hierarchy as hac
import matplotlib.pyplot as plt
timeSeries = pd.DataFrame()
#timeSeries = (timeseries-timeseries.min())/(timeseries.max()-timeseries.min())
ax = None
for arr in y:
#for arr in data1:
#arr = arr + np.random.rand(group_size, num_samples) + (np.random.randn(group_size, 1)*3)
df = pd.DataFrame(arr)
#print(df)
timeSeries = timeSeries.append(df)
#Z = hac.linkage(timeSeries, 'single', 'correlation')
Z = hac.linkage(timeSeries, 'centroid')
import sys
sys.setrecursionlimit(15000) # DON'T TOUCH IT, IT's MAGIC
#sys.setrecursionlimit(10000)
# Plot the dendogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
hac.dendrogram(
Z,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=14., # font size for the x axis labels
)
plt.show()
print("method is centroid")

```
```

```
In [24]:
```# Just one line :)
import numpy as np;
import seaborn as sns;
import pandas as pd
from scipy import stats
import scipy.cluster.hierarchy as hac
import matplotlib.pyplot as plt
timeSeries = pd.DataFrame()
#timeSeries = (timeseries-timeseries.min())/(timeseries.max()-timeseries.min())
ax = None
for arr in y:
#for arr in data1:
#arr = arr + np.random.rand(group_size, num_samples) + (np.random.randn(group_size, 1)*3)
df = pd.DataFrame(arr)
#print(df)
timeSeries = timeSeries.append(df)
#Z = hac.linkage(timeSeries, 'single', 'correlation')
Z = hac.linkage(timeSeries, 'single', DTWDistance)
import sys
sys.setrecursionlimit(15000) # DON'T TOUCH IT, IT's MAGIC
#sys.setrecursionlimit(10000)
# Plot the dendogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
hac.dendrogram(
Z,
show_leaf_counts=False,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=14., # font size for the x axis labels
show_contracted=True,
)
plt.show()
print("method is single")

```
```