aggregate-micro-paths (AMP)

Algorithm:

Description

This algorithm takes as input geo-spatial data that includes time stamps and an id. The points in space and time are used to create "micro paths" if certain conditions are met (such as time between points, and velocity of movement between points). Track segments are then binned based on a grid and aggregated to create a heat map.

The way it works is through a three step process. The first step reads from the data source and creates path segments from the individual points. Each segment is defined by start and end points in space and time. This step also determines distance between points (haversine formula) as well as time spent and velocity during the segment and uses these to filter out segments that do not fit the configuration.

The next step is to map each segment into bins. Potential bins are established by a grid of the configured resolution (e.g. 1km by 1km). For each segment, a record is created for all the potential bins it passes through. This record also contains velocity, direction, track id, and a timestamp that is truncated based on a configurable temporal binning.

The final step is to reduce the binned records into aggregate counts. Every bin with a matching lat/lon coordinate and timestamp gets combined into a table. By default, this is done to create heatmaps in three ways: combining counts for track activity, combining average velocity, and combining average direction/bearing. The results are a set of tables containing the aggregated data.

Data

The kind of data that can be used in this algorithm consists of any kind of records containing geo-spatial data (latitude and longitude), timestamps, and an identifier that distinguishes each user or track. You'll just need to point your configuration to where the data resides and which columns correspond to the necessary features.

Configurable options

There are also several configurable options availble to create meaningful output based on your specific input. These include a geo-spatial bounding box to only capture data in a specific area, a time filter to cut off track segments that have too much time between them, a distance filter to cut off track segments that are too far apart, lat/lon resolution to determine how big the bins should be, and a temporal split that will further bin data based on a chosen temporal factor (minutes, hours, days, months, years, or none).

Results

The final results are in the form of a table of records each of which contain it's latitude, longitude, timestamp, and aggregated value. It can be read from directly or the data can be exported for use elsewhere (as in our example).

Running Example:

Prerequisites

  • Hive
  • Hadoop (streaming)
  • Python + gmpy

Example code

To run the example, execute run_ais.sh found in /hive-streaming. This script will unpack the sample data, upload it to the hadoop filesystem, enter it into Hive, and run the aggregate micro pathing algorithm. When completed, it will also pull down the finished count data from Hive and place it locally into a .csv file located in the hive-streaming/output directory.

Viewing Results: