Shape matching with time series data


In [1]:
__author__  = 'Devini Senaratna and Chris Potts'

TL;DR: This post reviews a variety methods for representing and comparing sequential data, focusing in particular on how these methods interact with each other and with different tasks and domains. Our goal for this work is to find effective methods for matching empirical sequences to archetypal shapes that experts find informative about their domains. We close with some exploratory experiments on the CMS Open Payments dataset.

Overview

Reasoning effectively about time-series data is vital in many domains but can be challenging because of the sheer quantity and diversity of information. In this post, we look at some methods for taming this complexity by matching sequences to archetypal shapes.

Clustering is a common kind of shape matching. In this post, we focus instead on situations in which the archetypal shapes are specified by separate models summarizing patterns we expect to be meaningful, like a linear increase or decrease, or a U shape. With the same techniques, we could bring expert humans in the loop, by having them define by hand the archetypal shapes that they expect to be important.

Wherever the archetypal shapes come from, we compare the sequences in the data to them. From there, we can cluster sequences by shape, use the shapes to define features in other machine learning models, and so forth.

Motivations for shape matching

Shape-matching with sequential data yields insights in many domains. A few examples:

  1. Product launch: You have data on what doctors are prescribing during the period in which a new product is launched, and you want to cluster doctors based on the shape of their prescribing behavior in that period. An expert might draw different shapes in relation to the product launch (e.g., flat then rise after the launch, or rise then fall after the launch).

  2. Differential burden of illness: You have data on how different patients are responding to related treatments over time, and you want to cluster those patients to understand which characteristics associate with which shapes. Here, a doctor might draw on expert knowledge about disease progression and the nature of the treatments in specifying shapes.

  3. Beat the markets: You have data on stock prices over time in a particular market, and you know certain model-based shapes are important (e.g., exponential increase), or an expert says that particular shapes are harbingers of price crashes.

  4. Alien species: You have time-series data on thousands of endangered plants in a region in which conservationists introduced a new species, and you would like to figure out how different plants respond to the new species.

  5. Seismic warning signals: You are a geophysicist looking for anomalies or outliers in seismic data that might signal an earthquake. Past events inform the set of shapes that you choose to match to, and then you can study historical data to assess their value before using them to trigger early warning systems.

At Roam, our uses-cases tend to revolve around examples like 1 and 2, in which the goal is to understand issues in healthcare – how we pay for it, what we pay for, which patients respond to which treatments, and other questions of that nature. Accurately tracking and summarizing trends over time is often key to achieving clarify in these domains.

There are two important, interacting factors to consider when shape matching: how will the shapes be represented, and how will those representations be compared with each other to determine matches? We look at the comparison question first, and use it to inform the representation question.

Shape comparison

There are many techniques for making the kinds of comparison that form the basis for shape matching. It's useful to break these techniques down into two parts: data pre-processing and calculating distance (or similarity).

Data pre-processing

We can leave the raw sequences as-as and go directly to distance calculation, but it's often helpful to first transform the time-series sequences, to highlight trend-related information and/or reduce noise. Here's a brief review of such pre-processing methods:

Smoothing, standardization, and normalization

To reduce noise, various kinds of smoothing can be applied. Common choices are exponential smoothening, Holt's linear smoothening, moving averages, differencing, splines, and LOESS.

Standardization and normalization are also commonly used in order to make the sequences more comparable. These two terms are sometimes used inter-changeably, but we should make a distinction: standardization forces the data to have mean 0 and standard deviation 1, whereas normalization scales the data to be in a specific range (often [0, 1]). Such steps will tend to erase differences of overall magnitude. This loss of information can be problematic or helpful, depending on the domain.

Projecting the data onto a basis

Here we project the data $Y$ onto a basis $W$ in order to represent the data in a new data space $T$, which is calculated by $YW$. These methods can be used to compress the data into a smaller space than before by reducing the dimensionality of $W$. This process is known as dimensionality reduction. Prominent examples are Principal Component Analysis (PCA), Truncated Singular Value Decomposition (also known as Latent Semantic Analysis), Fast Fourier Transformation (FFT), Discrete Fourier Transformation (DFT), and Wavelet Transformation.

Sometimes, basis methods can appear when you don't expect it. For example, the coefficients from linear models, such as linear regression or Auto-Regressive Integrated Moving Average (ARIMA) models, can take the form of a basis method. Let's take linear regression as an example. Suppose $y$ is the sequence of interest, and $X$ denotes some time-related features. The linear regression model for each sequence is as follows:

$$ \begin{aligned} y = X\beta = X (X^{\top} X)^{-1} X^{\top} y \end{aligned} $$

Here, the coefficients denoted by $\beta$ are obtained by projecting y onto the basis $(X^{\top} X)^{-1} X^{\top}$.

String representations with SAX

One of the most interesting and promising representation methods we have explored at Roam is Symbolic Aggregate approXimation (SAX), an extension of Symbolic Piecewise Aggregate Approximation (PAA). SAX offers a symbolic approach to shape matching: given a time series of some length $n$, SAX reduces it to a string of arbitrary length $w$ where $w \leq n$. For making comparisons, this is extremely efficient, since it greatly reduces the dimensionality of the original sequence; one can view it as an aggressive form of smoothing. We find SAX useful when needing to store large numbers of sequences compactly. The downside is that one loses detailed information about each sequence; it is sometimes difficult to faithfully represent complex shape archetypes using SAX. In order to calculate the distance between two sequences, SAX uses its own method, known as MINDIST which is explained at the end of the following section.

The SAXModel class of the pysax Python package implements the sequence-to-string conversion with its symbolize method, and the distance measure with its symbol_distance method. (Here's a helpful tutorial.) For R, SAX, TSclust and TSdist have implementations of SAX and MINDIST. For the examples below, TSdist was used via the rpy2 python interface.

Calculating distance

Once we have pre-processed the sequences, we need a measure or numerical value for the distance between the two pre-processed sequences, in order to assess how similar the two time series curves are.

Common distance measures

Arguably the most common distance measure is the $L_2$ norm, otherwise known as the Euclidean distance. Here, the distance between two sequences or vectors $\bar{x}$ and $\bar{y}$ of dimension $n$ is:

$$ \begin{aligned} ||\bar{x} − \bar{y}||^{2} = \sqrt{\sum_{i=1}^{n}|x_{i}-y_{i}|^{2}} \end{aligned} $$

Mahalanobis distance and cosine distance, though not identical, are both related to Euclidean distance and also commonly used.

Correlation measures can also be used as distance measures, as the correlation measure is inversely proportional to the distance between the two sequences. Pearson's correlation coefficient is an example of a parametric correlation measure, and Spearman and Kendall rank-based correlation are non-parametric correlation measures. Non-parametric methods can pick non-linear trends and are robust against outliers.

If the data has been transformed into a string representation, edit distance measures such as Longest Common Subsequence can be used. Here is a good reference for some of these techniques.

Python's scipy.spatial.distance offers many distance methods and scipy.stats implements a wide range of correlation methods. R's cluster and stringdist implement the above measures as well.

Fréchet distance and Dynamic Time Warping (DTW)

Fréchet distance is also called "man walks dog" or "dog walks man" distance because of the colorful way it is often described. Imagine a person walking along one sequence and their dog walking in the same direction along another. Each can walk at any speed (even stop), but they cannot go backwards. The Fréchet distance between the two sequences is the length of the shortest leash that could connect them. For a thorough discussion of the math behind this, check out these helpful notes.

Fréchet distance does not require the sequences being compared to have the same length, and it also doesn't assume that specific dimensions can be compared. Rather, the method implicitly aligns the two sequences as part of the comparison.

Dynamic Time Warning (DTW) is similar to Fréchet distance. Here again, we seek an optimal alignment between the two shapes without assuming that the underlying processes happened at the same speed or for the same amount of time. In other words, the DTW algorithm stretches or compresses two non-linear sequences in making its comparison:

We have two sequences: a red colored curve A and a shorter blue colored curve B. The shapes of these two curves look similar. However, they are out of phase, and B is shorter than A. The DTW algorithm can warp these two curves by time-aligning the sequences (as depicted by the black arrows), making it possible to find similarities between sequences that have differences in length, are not in phase, and/or are warped representations of each other. To prevent the number of such warping functions from growing too large, they are typically constrained to be monotonic and continuous, with further restrictions placed on their slopes.

The Python dtw package implements Dynamic Time Warping.

MINDIST for SAX

SAX, discussed above, uses its own distance method:

$$ \textrm{MINDIST}(\hat{Q}, \hat{C}) = \sqrt\frac{n}{w} \sqrt{\sum_{i=1}^w\left(\textbf{dist}(\hat q_i, \hat c_i)\right)^{2}} $$

where, $Q$ and $C$ are the two sequences or series of interest, and $\hat Q$ and $\hat C$ denote the transformed string versions of the original sequences, $\hat q_i$ is the $i^{th}$ character of the transformed sequence $Q$ (and similarly $\hat c_i$), $n$ is the length of each sequence, $w$ is the length of the string which the sequence will be reduced to (aka as breakpoints), and $\textbf{dist}$ is calculated using a look-up table based on the breakpoints $w$.

Things to ponder when selecting a shape-matching method

  1. Speed: The run-time of the algorithm is important if the shape-matching task needs to be carried out in real time. In such situations, one might favor algorithms where the dimensionality of the sequences is greatly reduced, since they will be extremely fast at comparison time.

  2. Clustering: The choice of clustering algorithm interacts in complex ways with the choice of comparison method. For instance, geometric methods will tend to build in assumptions of linearity that can either be aligned with or at odds with the method for representation or comparison.

  3. Human intuition: What do the results look like to experts? This is especially important if the goal of shape-matching is to automate the more expensive task of getting a human to manually match shapes.

  4. Flexibility: Real-world time-series datasets rarely involve the same number of measures taken at the same intervals, which is problematic for many comparison methods.

  5. Storage: Where a large number of sequences needs to be stored, SAX might be the optimal choice, since it turns each sequence of floats into a single word.

Finding archetypes in CMS Open Payments

To bring the above ideas together, we'll now walk through some basic experiments with the publicly available CMS Open Payments dataset, which tracks payments to healthcare providers by healthcare companies. The data are aggregated by manufacturer name and week number of the year, giving us a temporal payment series for each manufacturer. There are approximately 1,600 sequences in the resulting dataset.

CMS shape archetypes

We defined two model-based shape archetypes:

  1. Linear archetype: A simple linear increasing trend $y = x$.
  2. Curve archetype: A curve in the form of $y = x^{2}$.

CMS shape comparison

We assessed four different comparison methods:

  1. Cosine distance on standardized data
  2. Spearman's correlation on standardized data
  3. DTW
  4. MINDIST and SAX

This is meant to be a representative sample of the methods reviewed above: cosine is one of the simplest distance measures, Spearman is a conceptually similar non-parametric method, Dynamic Time Warping is an "optimal match" method, and the combination of MINDIST and SAX gives us a look at a symbolic approach.

Numerical vectors corresponding to our linear and curve archetypes were created and compared against the CMS Payments sequences using each of these methods.

Qualitative comparisons

For qualitative comparisons, we identified the top trend lines that match the linear archetype and the curve archetype for each method.

For the linear archetype, the methods all deliver similar results, as the following plot shows:

It's noteworthy that DTW picked lines that were of about the same magnitude, whereas the others did not, with SAX appearing as the least sensitive to this dimension. A similar picture emerges with the curve archetype:

As before, the methods broadly agree, with DTW showing the most sensitivity to absolute differences along the y-axis.

Roam Humans

We thought it would be fun and enlightening to experiment on our colleagues a bit, so we recruited seven Roam Humans for a little experiment. For each of the two shape archetypes, we took 5 samples each consisting of 4, 16, and 25 random trend lines and asked our participants to select the best match with the archetype. A few clear results emerged:

  1. The overall level of agreement was high, especially when 4 images were displayed (see table below). This is reassuring about the overall coherence of the task.
  2. Agreement was higher for the linear archetype than for the curve archetype, which might be related to their relative complexity. However, it happens that two Roam humans had 100% agreement for the curve archetype when 25 images were displayed! A striking mind-meld!
  3. As one would expect, as the number of images to choose from increased, agreement dropped. This could reflect true differences in opinion, increased uncertainty, or just plain fatigue.

Table: Comparison of percentage pairwise agreement between Roam Humans.

Curve Type % of pairwise agreement: 4 images % of pairwise agreement: 16 images % of pairwise agreement: 25 images
Linear Increasing 63% 56% 34%
U-shaped 58% 22% 21%
Overall 60.5% 39% 27.5%

Comparing all five methods

In pairwise comparisons of the four mathematical methods and the aggregate Roam Human, one can see clearly that Roam Human, Spearman, and cosine cluster together, whereas SAX and DTW do not match well with other methods. First, here's the summary for the linear archetype:

For the curve archetype, the similarities are lower overall, but Spearman and cosine are again closest to the Roam Human, with SAX and DTW appearing quite distinct from them:

In our experiment, recommendations from Spearman correlation and cosine similarity were the most similar to Roam Human. However, from our qualitative comparisons, we can see that all these four methods provided agreeable recommendations for the types of shapes.

We must stress also that Roam Humans may not be an ideal gold standard for shape matching. Other measures, such as the within shape-group and between shape-group variation, may give a better idea notion of how "good" the shape comparison measure is. In practice Spearman correlation can be time consuming to calculate in real-time, both methods are not flexible to varying sequence lengths, and as both methods do not involve reducing the sequence lengths, may require a lot of memory for storage.

Hence, we are still SAX (with MINDIST) and DTW supporters. SAX (with MINDIST), together with a human-in-the-loop method, was successfully used to tune hyperparameters in one of our pilot project experiments. However, doing so for this blog-post would have required a lot of human resources. In the future, we plan to use expert annotations like those of our Roam Humans to tune the hyperparameters for SAX and DTW, which should give both methods an edge.

Clustering sequences

Beyond looking at nearest neighbor sequences to these archetypes, we might also want to bin or group a large number of sequences into similar smaller bins that are natural occurring ones or otherwise.

For this work, automatic clustering methods such as K-means or Hierarchical Agglomerative Clustering can be used, and experts can be brought in here as well to adjust the output of the clustering algorithm.

Studying the centroids of the resulting clusters can also suggest new shapes to experts, which can be used in futher shape-matching studies.

Key take-aways

  • There are many types of methodologoies used to match shape archetypes which are combinations of pre-processing steps and distance measures.

  • It is important to think about the speed, flexibility and to a lesser extent storage of each method, when selecting a suitable shape-matching methodology.

  • Spearman Correlation and Cosine Similarity were the most similar to Roam Humans. However, Spearman correlation can be costly to compute and both methods are not flexible w.r.t varying lengths in sequences nor are they a good solution for real-time processing of large amounts of sequences.

  • The intermediate alphabet based string output from SAX makes it easy and efficient to store. Hence, SAX is handy when real-time proessing is required, since the pre-computed intermediate results can be used to do real-time MINDIST-calculations very quickly. However, SAX is not sensitive to more intricate patterns in the sequence, and for this purpose, methods such as Cosine Distance or DTW might be more suited.