Generating trajectories from massive movement datasets

To explore travel patterns like origin-destination relationships, we need to identify individual trips with their start/end locations and trajectories between them. Extracting these trajectories from large datasets can be challenging, particularly if the records of individual moving objects don’t fit into memory anymore and if the spatial and temporal extent varies widely (as is the case with ship data, where individual vessel journeys can take weeks while crossing multiple oceans). 

This is part 2 of “Exploring massive movement datasets”.

Roughly speaking, trip trajectories can be generated by first connecting consecutive records into continuous tracks and then splitting them at stops. This general approach applies to many different movement datasets. However, the processing details (e.g. stop detection parameters) and preprocessing steps (e.g. removing outliers) vary depending on input dataset characteristics.

For example, in our paper [1], we extracted vessel journeys from AIS data which meant that we also had to account for observation gaps when ships leave the observable (usually coastal) areas. In the accompanying 10-minute talk, I went through a 4-step trajectory exploration workflow for assessing our dataset’s potential for travel time prediction:

Click to watch the recorded talk

Like the M³ prototype computation presented in part 1, our trajectory aggregation approach is implemented in Spark. The challenges are both the massive amounts of trajectory data and the fact that operations only produce correct results if applied to a complete and chronologically sorted set of location records.This is challenging because Spark core libraries (version 2.4.5 at the time) are mostly geared towards dealing with unsorted data. This means that, when using high-level Spark core functionality incorrectly, an aggregator needs to collect and sort the entire track in the main memory of a single processing node. Consequently, when dealing with large datasets, out-of-memory errors are frequently encountered.

To solve this challenge, our implementation is based on the Secondary Sort pattern and on Spark’s aggregator concept. Secondary Sort takes care to first group records by a key (e.g. the moving object id), and only in the second step, when iterating over the records of a group, the records are sorted (e.g. chronologically). The resulting iterator can be used by an aggregator that implements the logic required to build trajectories based on gaps and stops detected in the dataset.

If you want to dive deeper, here’s the full paper:

[1] Graser, A., Dragaschnig, M., Widhalm, P., Koller, H., & Brändle, N. (2020). Exploratory Trajectory Analysis for Massive Historical AIS Datasets. In: 21st IEEE International Conference on Mobile Data Management (MDM) 2020. doi:10.1109/MDM48529.2020.00059

This post is part of a series. Read more about movement data in GIS.

  1. Hello underdark,

    This is well-done, the first step is to order by (or extract) the object_id, which should be the moving vehicle identifier. a while back I did this with automobiles, but instead of using Apache Spark, we used what is today AWS Redshift. I’d consider many principles the same, to deal with the data volumes. By using a SQL window function like this … “rank() over (partition by object_id order by movement_timestamp)”, we were able to do the same thing that you describe, logically at least. Our goal was to establish a trip table, and give it a new ID (trip_id), which was logically the vehicle_id and start timestamp for the trip (i.e. when the driver pressed the trip button). By establishing this trip_id, and then adding this as a foreign key relationship back into the big table of timestamps, latitiudes and longitudes (and other “signals”), we were able to then “re-distribute” the signal data to each trip, and thereby enable parallel processing of things like paths, speeds, and other useful analysis.

    I’m interested in your stop logic, that sounds tricky, and probably has specific business rules for ships.

    Best Regards,

    • Dear gxclark,
      Thank you for sharing your experience with AWS Redshift for car data. The key difference between car and ship movement data is that ship trip duration varies a log and can cover multiple weeks. Therefore, even individual trips may not fit into memory. The stop logic used so far is rather straightforward and logically similar to what I’ve implemented in the MovingPandas stop detector (

  2. Furqan Raza said:

    That’s awesome. Google keeps sending everyone who enabled the location services. You know where you go and what are the most request trajectories your followed but this article revealed the business side of the same data. The event planners, the advertisements companies, and many more would love to pay for this analysis because it gives you the intelligence to find the target market for advertisement.

  3. Andrés said:

    Hello, Anita, greetings from Buenos Aires. I’m not sure my comment was submitted properly, so here comes again.

    I read your blog, watched your workshop on moving pandas, and read some of your papers. Loved your work, thank you for sharing so useful stuff.

    I have a use case of splitting bus tracks based on terminals, with data stored on Delta tables and using mostly databricks to manipulate it. I was thinking about using UDFs for the task, but what’s your view on the best approach to scale this up? I have roughly 120 million monthly data points, and 3 years of data (4.3 B).

    Thanks again for all your work.


    • Hi Andrés, thank you for the kind feedback!

      There should be a couple of advantages to splitting at terminal locations: you can write an aggregator that receives chronologically sorted records, and every time the terminal location is visited, the aggregator can finish the current trajectory and start a new one when the bus leaves the terminal.

      However, UDFs can be simpler and may be sufficient for your use case. Since bus trajectories are relatively short (i.e. not multiple days long), you probably won’t run in to memory issues while trying to process batches of daily bus tracks either. Therefore, a simple UDF could work.

%d bloggers like this: