Finding Sprints, Formally

We present an example use of Montre on a data set obtained by tracking positions of players in a real soccer match. In this example, we find all sprints performed by a single player where a sprint is formally specified by a timed regular expression over speed and acceleration behaviors. The data are obtained by a computer vision algorithm with a frame rate of 10 Hz so we have a xy-coordinate for each player on the field at every 100 milliseconds. Therefore we use milliseconds as our base time unit for behaviors and expressions.

Writing patterns

In order to specify a pattern for sprints, we need to address two issues in order: (1) how to categorize continuous speed and acceleration axes, and (2) which composition of these categories defines a sprinting effort best. Clearly, there are no universal answers for these questions so we rely on the study in the following. First, we partition speed and acceleration axes into four categories (near-zero, low, medium, and high), and we associate a letter for each category below. For example, a period of medium speed, denoted by r, means the speed value resides between 3.7 and 6 m/s during the period.


In [ ]:
# Speed (m/s) and acceleration (m/s^2) categories 

accel_desc = ['nhigh','nmedium', 'nlow','around_zero', 'low', 'medium', 'high']
accel_syms = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
accel_bins = [-100, -1.60, -1.17,-0.57, 0.57, 1.17, 1.60, 100]

speed_desc = ['low', 'medium', 'high', 'very_high']
speed_syms = ['p', 'q', 'r', 's']
speed_bins = [-1.0, 2, 3.7, 6, 20]

Often a sprint effort is characterized by any movement above a certain speed threshold for a limited time. This gives us our first sprint pattern such that a period of high speed between 1-10 seconds, formally written as follows:


In [ ]:
P1 = "\'(<:s:>)%(1000,10000)\'"

Above we use anchor operators from both sides on the proposition s to obtain only maximal periods that satisfy s; otherwise, any sub-period satisfies the pattern as well. The operator % specifies that the duration is restricted to be in 1000 and 10000 milliseconds. Alternatively we may want to find other efforts starting with high acceleration but not reaching top speeds necessarily. This gives us our second sprint pattern such that a period of high acceleration followed by a period of medium or high speed between 1-10 seconds, formally written as follows:


In [ ]:
P2 = "\'(<:g);(<:(r||s):>)%(1000,10000)\'"

Notice that we do not use the right-anchor on g. This allows a medium or high speed period to overlap with a high acceleration period as it is usually the case that they are concurrent. Writing an equivalent pattern using classical regular expressions over a product alphabet would be a very tedious task partly due to a requirement to handle such interleavings explicitly (and the lack of timing constraints). For timed regular expressions all propositions are considered to be concurrent by definition, which results in concise and intuitive expressions. Finally we give a third pattern to find rather short but intense sprints such that


In [ ]:
P3 = "\'(<:(f||g));((<:s:>)%(1000,2000))\'"

Then we visualize all sprints found by Montre for patterns P1-P3 in Figure 3 over the behavior of a single player during one half of the game (45 min.) containing 27K data points that reduces to timed behaviors of 5K segments after pre-processing.

Timed Pattern Matching, Pre- and Post-Processing


In [ ]:
%matplotlib inline

# In order to process data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# In order to call montre
import subprocess, shlex

For this tutorial we use the data from a real soccer match, Galatasaray v Sivasspor, played at January 16th, 2016. My favorite team Galatasaray won the match so that's cool. More info about the match (like jersey numbers of players, we'll use next.) can be found here. These data are obtained by a sport analytics company, Sentio. Their soccer player tracking technology tracks players by using state-of-the-art computer vision and machine learning algorithms. But here we just use the output of these algorithms: Raw coordinates of players monitored with a rate of 10 samples per second for 90 minutes. We read the data from a csv file.


In [ ]:
# Sentio constants 
FRAME_PER_SECOND = 10

# Read Sentio data
df = pd.read_csv('data/match.csv', index_col=["MINUTE", "SECOND", 'TIMESTAMP'], header=0, 
                 names=["TIMESTAMP", "HALF", "MINUTE", "SECOND", "TEAM_ID", 
                        "JERSEY_NUMBER", "X_POS", "Y_POS", "DISTANCE", "SPEED"])

Now I choose a player to find his sprints during the match. My usual suspect is Olcan Adin (Jersey Number #29) from Galatasaray (Team #1). I choose him because he is a winger and usually sprints more than others. Also we specify whether we want to find sprints in the first half or the second half.


In [ ]:
# Usual Suspect
PLAYER_TEAM = 1     # Galatasaray
PLAYER_NUMBER = 29  # Olcan Adin
HALF = 1

# Select the data for the specific player
data = df[(df['TEAM_ID'] == PLAYER_TEAM) & 
          (df['JERSEY_NUMBER'] == PLAYER_NUMBER) & 
          (df['HALF'] == HALF)].copy(deep=True)

First we sort datapoints according to timestamps/seconds/minutes and then introduce a single time axis. Besides the tracking technology sometimes gets confused and misses some samples; therefore, we fill these values by interpolation in the following.


In [ ]:
data['TIME'] = pd.Series(range(len(data)), index=data.index)
data = data.set_index(['TIME'])
data = data.drop(['HALF', 'TEAM_ID', "JERSEY_NUMBER", "DISTANCE", "SPEED"], axis=1).copy(deep=True)

data.loc[data['X_POS'] < 0, 'X_POS'] = np.nan; data['X_POS'] = data['X_POS'].interpolate()
data.loc[data['Y_POS'] < 0, 'Y_POS'] = np.nan; data['Y_POS'] = data['Y_POS'].interpolate()

If you would like to see the entire movement of the player, it's here:


In [ ]:
fig = plt.figure()
ax = fig.add_subplot(111, aspect='equal')  

# Draw a simple soccer pitch (105x68m)
ax.set_ylim(0,68)
ax.set_xlim(0,105)
plt.plot([52.5,52.5], [0,68], color='black', linewidth=2)
plt.plot([0,16.5,16.5,0], [54,54, 14,14], color='black', linewidth=2)
plt.plot([105,105-16.5,105-16.5,105], [54,54,14,14], color='black', linewidth=2)
circle1=plt.Circle((52.5,34),10, color='black', linewidth=2, fill=False)
ax.add_artist(circle1)

plt.plot(data['X_POS'],data['Y_POS'])

We calculate displacement, speed and accelaration from xy-coordinates. Here some Euclidian geometry and Newtonian mechanics:


In [ ]:
data['DISPLACEMENT'] =  np.sqrt((data['X_POS'].diff())**2 + (data['Y_POS'].diff())**2)
data['DISPLACEMENT'][0] = 0

data['SPEED'] = data['DISPLACEMENT'] * FRAME_PER_SECOND

data['ACCEL'] = data['SPEED'].diff() * FRAME_PER_SECOND
data['ACCEL'][0] = 0
data['ACCEL'][1] = 0

# In case more smoothing is needed.
# data['SPEED'][2:] = data['SPEED'].rolling(center=False,window=3).mean()[2:]
# data['ACCEL'][2:] = data['ACCEL'].rolling(center=False,window=3).mean()[2:]

Finally we apply speed and acceleration categories and I obtain symbolic speed and acceleration behaviors.


In [ ]:
data['SPEEDSYM'] = pd.cut(data['SPEED'], bins=speed_bins, labels=speed_syms)
data['ACCELSYM'] = pd.cut(data['ACCEL'], bins=accel_bins, labels=accel_syms)

Then write symbolic behaviors into a file named as INPUT.txt.


In [ ]:
def collect(x, *ys):

    y = [''.join(str(i) for i in e) for e in zip(*ys)]
        
    xp = x[0]
    yp = y[0]
    for (xi, yi) in zip(x,y):
        if yi != yp:
            yield (xi-xp,yp)
            xp = xi
            yp = yi
    yield (xi-xp,yp)

with open('INPUT.txt', 'w') as f:
    for (xi, yi) in collect(data.index.values, list(data['ACCELSYM'].values), list(data['SPEEDSYM'].values)):
        f.write('{0} {1}\n'.format(xi*100, yi))

Now we call Montre to perform timed pattern matching for one of given patterns (P1, P2, P3) and the file INPUT.txt.


In [ ]:
command = shlex.split("montre" + " " + "-b" + " " + P1 + " " + "-f" + " " + "INPUT.txt" + " " + "-o" + " " + "OUTPUT.txt")
subprocess.call(command)

The output containing a list of zones is in the OUTPUT.txt. The zone from the first line is as follows.


In [8]:
%%latex
\begin{equation*}
144800 \leq t \leq 148800\\
144800 \leq t' \leq 148800\\
4000 \leq t' -t \leq 4000
\end{equation*}


\begin{equation*} 144800 \leq t \leq 148800\\ 144800 \leq t' \leq 148800\\ 4000 \leq t' -t \leq 4000 \end{equation*}

This zone contains only one time segment (144800, 148800) and it is when a sprint occurred for the pattern given. In other word there is a sprint of Olcan Adin starting at 144.8s and ending at 148.8s, which is 4-second long. In general, zones can contain uncountable number of occurrences however here remember that we have restricted start and end times of a match by using anchors in the pattern. That's why we ended up zone with a single point inside.

The last thing is to visualise all sprints of Olcan Adin performed in the first half with respect to pattern definitions of P1, P2, and P3, respectively. The dot shows the endpoint of a sprint.

Conclusion

We have shown how to perform timed pattern matching over speed and acceleration signals to find all instances of a pattern (e.g. sprints). Perhaps the most important thing in the process is to come up a good pattern definition. And clearly this is not easy and it requires some domain knowledge at the first hand. But, once you have a pattern, the problem can be solved easily by using Montre and some standard data processing.

Finally special thanks to Hande Alemdar and Serdar Alemdar for sharing their data for this demo.