aggregate-micro-paths (AMP) - Hive Implementation

This algorithm is implemented using HIVE with hadoop streaming. The python file "AggregateMicroPath.py" generates and executes HQL to create data tables and execute queries. The scripts folder contains additional python files that are used with HADOOP streaming to process the data.

Prerequisites

Hive
Hadoop (Streaming)
Python
gmpy

Configuration

Running AMP requires a configuration file (.ini) to be passed in to the main python file. This configuration has several parameters that are used to define the data used and how to perform the aggregation. An example can be found in config/ais.ini.

table_name - The name of the Hive table that contains your data.

table_schema_id - The column of your Hive table that contains the track id or user id that identifies a track.

table_schema_dt - The column of your Hive table that contains the timestamp to be used (YYYY-mm-dd HH:MM:SS).

table_schema_lat - The column of your Hive table that contains latitude.

table_schema_lon - The column of your Hive table that contains longitude.

time_filter - The maximum number of seconds allowed between points on a track. Any segment with more time between points gets removed.

distance_filter - The maximum distance allowable between points in KM. Any segment with more distance between points gets removed.

lower_left_lat - Lower Left latitude of bounding box to contain data.

lower_left_lon - Lower Left longitude of bounding box to contain data.

upper_right_lat - Upper Right latitude of bounding box to contain data.

upper_right_lon - Upper Right longitude of bounding box to contain data.

trip_name - A label for the aggregated data. Used in naming Hive tables.

resolution_lat - The height of bins in approximately 100 KM. This must be a factor of 10 (e.g. 1 ~= 100KM, .1 ~= 10KM, .01 ~= 1KM).

resolution_lon - The width of bins in approximately 100 KM. This must be a factor of 10 (e.g. 1 ~= 100KM, .1 ~= 10KM, .01 ~= 1KM).

temporal_split - Used to further bin data by discrete temporal amounts. Valid values are "minute", "hour", "day", "month", "year", and "all" for ignoring timestamps for binning.

Running

Running an AMP job (assuming the config and data are in place) only requires a single python command however it is best to write a shell script in order to perform any desired setup/cleanup as well as data exporting. An example can be found in run_ais.sh.

The command to run AMP is:

python AggregateMicroPath.py -c config.ini

This assumes you are currently in the hive-streaming directory and that config.ini is your configuration file. When completed, the count results will be found in a Hive table called "micro_path_intersect_counts_" with your trip_name appended (e.g. micro_path_intersect_counts_ais_small_final). Similar tables can be found for velocity and direction results.