Archive

Tag Archives: movement data

The last time I preprocessed the whole GeoLife dataset, I loaded it into PostGIS. Today, I want to share a new workflow that creates a (Geo)Parquet file and that is much faster.

The dataset (GeoLife)

“This GPS trajectory dataset was collected in (Microsoft Research Asia) Geolife project by 182 users in a period of over three years (from April 2007 to August 2012). A GPS trajectory of this dataset is represented by a sequence of time-stamped points, each of which contains the information of latitude, longitude and altitude. This dataset contains 17,621 trajectories with a total distance of about 1.2 million kilometers and a total duration of 48,000+ hours. These trajectories were recorded by different GPS loggers and GPS-phones, and have a variety of sampling rates. 91 percent of the trajectories are logged in a dense representation, e.g. every 1~5 seconds or every 5~10 meters per point.”

The GeoLife GPS Trajectories download contains 182 directories full of .plt files:

Basically, CSV files with a custom header:

Creating the (Geo)Parquet using DuckDB

DuckDB installation

Following the official instructions, installation is straightforward:

curl https://install.duckdb.org | sh

From there, I’ve been using the GUI which we can launch using:

duckdb -ui

The spatial extension is a DuckDB core extension, so it’s readily available. We can create a spatial db with:

ATTACH IF NOT EXISTS ':memory:' AS memory;
INSTALL spatial;
LOAD spatial;

Reading a spatial file is as simple as:

SELECT * 
FROM '/home/anita/Documents/Codeberg/trajectools/sample_data/geolife.gpkg'

thanks to the GDAL integration.

But today, we want to do to get a bit more involved …

DuckDB SQL magic

The issues we need to solve are:

  1. Read all CSV files from all subdirectories
  2. Parse the CSV, ignoring the first couple of lines, while assigning proper column names
  3. Assign the CSV file name as the trajectory ID (because there is no ID in the original files)
  4. Create point geometries that will work with our GeoParquet file
  5. Create proper datetimes from the separate date and time fields

Luckily, DuckDB’s read_csv function comes with the necessary features built-in. Putting it all together:

CREATE OR REPLACE TABLE geolife AS 
SELECT 
  parse_filename(filename, true) as vehicle_id, 
  strptime(date||' '||time, '%c') as t, 
  ST_Point(lon, lat) as geometry -- do NOT use ST_MakePoint
FROM read_csv('/home/anita/Documents/Geodata/Geolife/Geolife Trajectories 1.3/Data/*/*/*.plt',
    skip=6,
    filename = true, 
    columns = {
        'lat': 'DOUBLE', 
        'lon': 'DOUBLE', 
        'ignore': 'INT', 
        'alt': 'DOUBLE', 
        'epoch': 'DOUBLE', 
        'date': 'VARCHAR',
        'time': 'VARCHAR'
    });

It’s blazingly fast:

I haven’t tested reading directly from ZIP archives yet, but there seems to be a community extension (zipfs) for this exact purpose.

Ready to QGIS

GeoParquet files can be drag-n-dropped into QGIS:

I’m running QGIS 3.42.1-Münster from conda-forge on Linux Mint.

Yes, it takes a while to render all 25 million points … But you know what? It get’s really snappy once we zoom in closer, e.g. to the situation in Germany:

Let’s have a closer look at what’s going on here.

Trajectools time

Selecting the 9,438 points in this extent, let’s compute movement metrics (speed & direction) and create trajectory lines:

Looks like we have some high-speed sections in there (with those red > 100 km/h streaks):

When we zoom in to Darmstadt and enable the trajectories layer, we can see each individual trip. Looks like car trips on the highway and walks through the city:

That looks like quite the long round trip:

Let’s see where they might have stopped to have a break:

If I had to guess, I’d say they stayed at the Best Western:

Conclusion

DuckDB has been great for this ETL workflow. I didn’t use much of its geospatial capabilities here but I was pleasantly surprised how smooth the GeoParquet creation process has been. Geometries are handled without any special magic and are recognized by QGIS. Same with the timestamps. All ready for more heavy spatiotemporal analysis with Trajectools.

If you haven’t tried DuckDB or GeoParquet yet, give it a try, particularly if you’re collaborating with data scientists from other domains and want to exchange data.

The QGISUC2025 team has done an awesome job recording and editing the conference presentations. All “presentation” type talks where the presenter has accepted to be published are now available in a dedicated list on the QGIS Youtube channel.

I also had the pleasure of presenting our Trajectools plugin and you can see this talk here:

Thank you to all the organizers, speakers, and participants for the great time!

The latest releases of MovingPandas and Trajectools come with many “under the hood” changes that aim to make your movement analytics faster:

  1. Instead of immediately creating a GeoPandas GeoDataFrame and populating the geometry column with Point objects, MovingPandas now has “lazy geometry column creation” that holds off on this operation until / if the geometries are actually needed. This way, for many operations, no geometry objects have to be generated at all.
  2. MovingPandas TrajectorySplitters now support parallel processing and Trajectools uses parallel processing whenever available (e.g. for adding speed & direction metrics, detecting stops, splitting trajectories).
  3. When a minimum length is specified for trajectories, MovingPandas now avoids computing the total trajectory length and, instead, immediately stops once the threshold value has been reached (“early skip”).
  4. Trajectools now offers the option to skip computation of movement metrics (speed & direction). This way, we can skip unnecessary computations and leverage the lazy geometry column creation, wherever applicable.

Let’s have a look at some example performance measurements!

Example 1: MovingPandas ValueChangeSplitter

The ValueChangeSplitter splits trajectories when it detects a value change in the specified column. This is useful, for example, to split up public trajectories that contain a “next_stop” column.

The following graph shows ValueChangeSplitter runtimes for different minimum trajectory length settings (from 0 to 1km, 100km, and 10,000km):

We see that the new, lazy geometry column initialization outperforms the old original code in all cases (e.g. 57% runtime reduction for 1km), except for the worst-case scenario, when the original implementation discards all trajectories as too short right from the start. (For most use cases, min_length will be set to rather small values to avoid creation of undesired short trajectory fragments, similar to sliver polygons in classic geometry operations.)

Additionally, we can engage multiprocessing by setting the n_processes parameter, e.g. to the number of CPUs to achieve further speedup:

Example 2: Trajectools

By applying all above-mentioned speedup techniques, Trajectools is now considerably faster. For example, the following runtime reductions can be achieved by deactivating the “Add movement metrics (speed, direction)” option in the algorithm dialog:

  • Create trajectories: 62%
  • Spatiotemporal generalization (TDTR): 78%
  • Temporal generalization: 81%
  • Split trajectories at stops: 53%

I have also updated the default trajectory points output style. It now uses a graduated renderer to visualize the speed values (if they have been calculated) instead of the previously used data-defined override. This makes the style faster to customize and provides a user-friendly legend:

For more infos, have a look at:

Enjoy the latest performance increases!

At the end of yesterday’s TimeGPT for mobility post, we concluded that TimeGPT’s trainingset probably included a copy of the popular BikeNYC timeseries dataset and that, therefore, we were not looking at a fair comparison.

Naturally, it’s hard to find mobility timeseries datasets online that haven’t been widely disseminated and therefore may have slipped past the scrapers of foundation model builders.

So I scoured the Austrian open government data portal and came up with a bike-share dataset from Vienna.

Dataset

SharedMobility.ai dataset published by Philipp Naderer-Puiu, covering 2019-05-05 to 2019-12-31.

Here are eight of the 120 stations in the dataset. I’ve resampled the number of available bicycles to the maximum hourly value and made a cutoff mid August (before a larger data collection cap and the less busy autumn and winter seasons):

Models

To benchmark TimeGPT, I computed different baseline predictions. I used statsforecast’s HistoricAverage, SeasonalNaive, and AutoARIMA models and computed predictions for horizons of 1 hour, 12 hours, and 24 hours.

Here are examples of the 12-hour predictions:

We can see how Historic Average is pretty much a straight line of the average past value. A little more sophisticated, SeasonalNaive assumes that the future will be a repeat of the past (i.e. the previous day), which results in the shifted curve we can see in the above examples. Finally, there’s AutoARIMA which seems to do a better job than the first two models but also takes much longer to compute.

For comparison, here’s TimeGPT with 12 hours horizon:

You can find the full code in https://github.com/anitagraser/ST-ResNet/blob/570d8a1af4a10c7fb2230ccb2f203307703a9038/experiment.ipynb

Results

In the following table, you’ll find the best model highlighted in bold. Unsurprisingly, this best model is for the 1 hour horizon. The best models for 12 and 24 hours are marked in italics.

ModelHorizonRMSE
HistoricAverage17.0229
HistoricAverage127.0195
HistoricAverage247.0426
SeasonalNaive17.8703
SeasonalNaive127.7317
SeasonalNaive247.8703
AutoARIMA12.2639
AutoARIMA125.1505
AutoARIMA246.3881
TimeGPT12.3193
TimeGPT124.8383
TimeGPT245.6671

AutoARIMA and TimeGPT are pretty closely tied. Interestingly, the SeasonalNaive model performs even worse than the very simple HistoricAverage, which is an indication of the irregular nature of the observed phenomenon (probably caused by irregular restocking of stations, depending on the system operator’s decisions).

Conclusion & next steps

Overall, TimeGPT struggles much more with the longer horizons than in the previous BikeNYC experiment. The error more than doubled between the 1 hour and 12 hours prediction. TimeGPT’s prediction quality barely out-competes AutoARIMA’s for 12 and 24 hours.

I’m tempted to test AutoARIMA for the BikeNYC dataset to further complete this picture.

Of course, the SharedMobility.ai dataset has been online for a while, so I cannot be completely sure that we now have a fair comparison. For that, we would need a completely new / previously unpublished dataset.


For a more thorough write-up, head along to Graser, A. (2025). Timeseries Foundation Models for Mobility: A Benchmark Comparison with Traditional and Deep Learning Models. arXiv preprint arXiv:2504.03725.

tldr; Maybe. Preliminary results certainly are impressive.

Introduction

Crowd and flow predictions have been very popular topics in mobility data science. Traditional forecasting methods rely on classic machine learning models like ARIMA, later followed by deep learning approaches such as ST-ResNet.

More recently, foundation models for timeseries forecasting, such as TimeGPT, Chronos, and LagLlama have been introduced. A key advantage of these models is their ability to generate zero-shot predictions — meaning that they can be applied directly to new tasks without requiring retraining for each scenario.

In this post, I want to compare TimeGPT’s performance against traditional approaches for predicting city-wide crowd flows.

Experiment setup

The experiment builds on the paper “Deep Spatio-Temporal Residual Networks for Citywide Crowd Flows Prediction” by Zhang et al. (2017). The original repo referenced on the homepage does not exist anymore. Therefore, I forked: https://github.com/topazape/ST-ResNet as a starting point.

The goals of this experiment are to:

  1. Get an impression how TimeGPT predicts mobility timeseries.
  2. Compare TimeGPT to classic machine learning (ML) and deep learning (DL) models.
  3. Understand how different forecasting horizons impact predictive accuracy.

The paper presents results for two datasets (TaxiBJ and BikeNYC). The following experiment only covers BikeNYC.

You can find the full notebook at https://github.com/anitagraser/ST-ResNet/blob/079948bfbab2d512b71abc0b1aa4b09b9de94f35/experiment.ipynb

First attempt

In the first version, I applied TimeGPT’s historical forecast function to generate flow predictions. However, there was an issue: the built-in historic forecast function ignores the horizon parameter, thus making it impossible to control the horizon and make a fair comparison.

Refinements

In the second version, I therefore added backtesting with customizable forecast horizon to evaluate TimeGPT’s forecasts over multiple time windows.

To reproduce the original experiments as truthfully as possible, both inflows and outflows were included in the experiments.

I ran TimeGPT for different forecasting horizons: 1 hour, 12 hours, and 24 hours. (In the original paper (Zhang et al. 2017), only one-step-ahead (1 hour) forecasting is performed but it is interesting to explore the effects of the additional challenge resulting from longer forecast horizons.) Here’s an example of the 24-hour forecast:

The predictions pick up on the overall daily patterns but the peaks are certainly hit-and-miss.

For comparison, here are some results for the easier 1-hour forecast:

Not bad. Let’s run the numbers! (And by that I mean: let’s measure the error.)

Results 

The original paper provides results (RMSE, i.e. smaller is better) for multiple traditional ML models and DL models. Addition our experiments to these results, we get:

ModelRMSE
ARIMA10.56
SARIMA10.07
VAR9.92
DeepST-C8.39
DeepST-CP7.64
DeepST-CPT7.56
DeepST-CPTM7.43
ST-ResNet6.33
TimeGPT (horizon=1)5.70
TimeGPT (horizon=12)7.62
TimeGPT (horizon=24)8.93

Key takeaways

  • TimeGPT with a 1 hour horizon outperforms all ML and DL models.
  • For longer horizons, TimeGPT’s accuracy declines but remains competitive with DL approaches.
  • TimeGPT’s pre-trained nature means that we can immediately make predictions without any prior training. 

Conclusion & next steps

These preliminary results suggest that timeseries foundation models, such as TimeGPT, are a promising tool. However, a key limitation of the presented experiment remains: since BikeNYC data has been public for a long time, it is well possible that TimeGPT has seen this dataset during its training. This raises questions about how well it generalizes to truly unseen datasets. To address this, the logical next step would be to test TimeGPT and other foundation models on an entirely new dataset to better evaluate its robustness.

We also know that DL model performance can be improved by providing more training data. It is therefore reasonable to assume that specialized DL models will outperform foundation models once they are trained with enough data. But in the absence of large-enough training datasets, foundation models can be an option.

In recent literature, we also find more specific foundation models for spatiotemporal prediction, such as UrbanGPT https://arxiv.org/abs/2403.00813, UniST https://arxiv.org/abs/2402.11838, and UrbanDiT https://arxiv.org/pdf/2411.12164. However, as far as I can tell, none of them have published the model weights.

If you want to join forces, e.g. add more datasets or test other timeseries foundation models, don’t hesitate to reach out.


Part 2: The quest for a fair TimeGPT benchmark

In today’s post, we (that is, Gaspard Merten from Universite Libre de Bruxelles and yours truly) are going to dive deep into how to analyze public transport data, using both schedule and real time information. This collaboration has been made possible by the EMERALDS project.

Previously, I already shared news about GTFS algorithms for Trajectools that add GTFS preprocessing tools (incl. Route, segment, and stop layer extraction) to the QGIS Processing toolbox. 

Today, we’ll discuss the aspect of handling realtime GTFS data and how we approach analytics that combine both data sources.

About Realtime GTFS 

Many of us have come to rely on real-time public transport updates in apps like Google Maps. These apps are powered by standardized data formats that ensure different systems can communicate. Google first introduced GTFS in 2005, a format designed to organize transit schedules, stop locations, and other static transit information. Then, in 2011, they introduced GTFS Realtime (GTFS-RT), which added the capability to include live updates on vehicle positions, delays, speeds, and much more.

However, as the name suggests, GTFS Realtime is all about live data. This means that while GTFS-RT APIs are useful for providing real-time insights,  they don’t hold historical data for analytics. Moreover, most transit agencies don’t keep past GTFS-RT records, and even fewer make them available to the public. This can be a significant challenge for anyone looking to analyze past trends and extract valuable insights from the data. For this reason, we had to implement our own solution to efficiently archive GTFS-RT files while making sure the files could be queried easily.

There are two main challenges in the implementation of such a solution:

  • Data Volume: While individual GTFS-RT files are relatively small—typically ranging from 50KB to 500KB depending on the public transport network size—the challenge lies in ingestion frequency. With an average file size of 100KB and updates every 5 seconds, a full day’s worth of data quickly scales up to 1.728GB.
  • Data Usability: GTFS-RT is a deeply nested format based on Protobuf, making direct conversion into a more accessible structure like a DataFrame difficult. Efficiently unnesting the data without losing critical details would significantly improve usability and streamline analysis.

Parquet to the Rescue

Storing and analyzing real-time transit data efficiently isn’t just about saving space—it’s about making the data easy to work with. Luckily, modern data formats have come a long way, allowing us to store massive amounts of data while keeping retrieval and analytics processing fast. One of the best tools for the job is Apache Parquet, a columnar storage format originally designed for Hadoop but now widely adopted in data science. With built-in support in libraries like Polars and Pandas, it’s become a go-to choice for handling large datasets efficiently. Moreover, Parquet can be converted to GeoParquet for smoother integration with GIS such as GeoPandas.

What makes Parquet particularly well-suited for GTFS Realtime data is the way it compresses columnar data. It leverages multiple compression algorithms and encodings, significantly reducing file sizes while keeping access speeds high. However, to get the most out of Parquet’s compression, we need to be smart about how we structure our data. Simply converting each GTFS-RT file into its own Parquet file might give us around 60% compression, which is decent. But if we group all GTFS-RT records for an entire hour into a single file, we can push that number up to 95%. The reason? A lot of transit data—like trip IDs and stop locations—doesn’t change much within an hour, while other values, such as coordinates, often share common elements. By organizing data in larger batches, we allow Parquet’s compression algorithms to work their magic, drastically reducing storage needs. And with a smaller disk footprint, retrieval is faster, making the entire analytics pipeline more efficient.

One more challenge to tackle is the structure of the data itself. GTFS-RT files tend to be highly nested, which isn’t an issue for Parquet but can be problematic for most data science tools. While Parquet technically supports nested structures, many analytical frameworks don’t handle them well. To fix this, we apply a lightweight preprocessing step to “unnest” the data. In the original GTFS-RT format, the vehicle position feed is deeply nested, making it difficult to work with. But once unnesting is applied, the structure becomes flat, with clear column names derived from the original hierarchy. This makes it easy to convert the data into a table format, ensuring smooth integration with tools commonly used by data scientists.

The GTFS-RT Pipelines

With this in mind, let’s walk through the two pipelines we built to store and retrieve GTFS-RT data efficiently.

The entire system relies on two key pipelines that work together. The first pipeline fetches GTFS-RT data from an API every five seconds, processes it, and stores it in an S3 bucket. The second pipeline runs hourly, gathering all the individual files from the past hour, merging them into a single Parquet file, and saving it back to the bucket in a structured format. We will now take a look at each pipeline in more detail.

Pipeline 1: Fetching and Storing Data

The first step in the process is retrieving GTFS-RT data. This is done via an API, which returns files in the Protocol Buffer (ProtoBuf) format. Fortunately, Google provides libraries (such as gtfs-realtime-bindings) that make it easy to parse ProtoBuf and convert it into a more accessible format like JSON. 

Once we have the data in JSON format, we need to split it based on entity type. GTFS-RT files contain different types of data, such as TripUpdate, which provides updated arrival times for stops, and VehiclePosition, which tracks real-time locations and speeds. Not all GTFS-RT feeds contain every entity type, but TripUpdate and VehiclePosition are the most commonly used. The full list of entity types can be found in the GTFS Realtime documentation.

We separate entity types because they have different schemas, making it difficult to store them in a single Parquet file. Keeping each entity type separate not only improves organization but also enhances compression efficiency. Once split, we apply the same unnesting process as described earlier, ensuring the data is structured in a way that’s easy to analyze. After that, we convert the data into a data frame and store it as a Parquet file in memory before uploading it to an S3 bucket. The files follow a structured naming convention like this:

{feed_type}/YYYY-MM-DD/hour/individual_{date-isoformat}.parquet

This format makes it easy to navigate the storage bucket manually while also ensuring seamless integration with the second pipeline.

Pipeline 2: Merging and Optimizing Storage

The second pipeline’s job is to take all the small Parquet files generated by Pipeline 1 and merge them into a single, optimized file per hour. To do this, it scans the storage bucket for the earliest unprocessed “hour folder” and begins processing from there. This design ensures that if the pipeline is temporarily interrupted, it can easily resume without skipping any data.

Once it identifies the files to merge, the pipeline loads them, assigns a proper timestamp to each record, and concatenates them into a single Parquet table. The final file is then uploaded to the S3 bucket using the following naming convention:

{feed_type}/YYYY-MM-DD/hour/HH.parquet

If any files fail to merge, they are renamed with the prefix unmerged_{date-isoformat}.parquet for manual inspection. After successfully storing the merged file, the pipeline deletes the individual files to keep storage clean and avoid unnecessary clutter.

One critical advantage of converting GTFS-RT data into Parquet early in the process is that it prevents memory overload. If we had to merge raw GTFS-RT files instead of pre-converted Parquet files, we would likely run into memory constraints, especially on standard servers with limited RAM. This makes Parquet not just a storage solution but an enabler of efficient large-scale processing.

Ready for Analytics

In this section, we will explore how to use the GTFS-RT data for public transport analytics. Specifically, we want to compute delays, that is, the difference between the scheduled travel time and the real travel time. 

The previously created Parquet files can be loaded into QGIS as tables without geometries. To turn them into point layers, we use the “Create points layer from table” algorithm from the Processing “Vector creation” toolbox. And once we convert the unixtimes to datetimes (using the datetime_from_epoch function), we have a point layer that is ready for use in Trajectools. 

Let’s have a look at one bus route. Bus 3 is one of the busiest routes in Riga. We apply a filter to the point layer which reveals the location of the route. 

Computing segment travel times

Computing travel times on public transport segments, i.e. between two scheduled stops, comes with a couple of challenges:

  1. The GTFS-RT location updates are provided in a rather sparse fashion with irregular reporting intervals. It is not clear that we “see” every stop that happens. 
  2. We cannot rely solely on stop detection since, sometimes, a vehicle will not come to a halt at scheduled stop locations (if nobody wants to get off or on)
  3. The stop ID, representing the next stop the vehicle will visit, is not always exact. Updates are often delayed and happen some time after passing the stop. 

Here’s an example visualization of the stop ID information of a single trip of bus 3, overlaid on top of the GTFS route and stops (in red):

To compute the desired delays, we decided to compare GTFS-RT travel times based on stop ID info with the scheduled travel times. To get the GTFS-RT travel times, we use Trajectools and create trajectories by splitting at stop ID change using the Split by value change algorithm:

Computing delays

The final step is to compute travel time differences between schedule and real time. For this, we implemented a SQL join that matches GTFS-RT trajectories with the corresponding entry in the GTFS schedule using route information and temporal information: 

The temporal information is important since the schedule accounts for different travel times during peak hours and off peak: 

This information is extracted from the GTFS schedule using the Trajectools Extract segments algorithm, if we chose the “Add scheduled speeds” option:

This will add the time windows, speeds, and runtimes per segment to the resulting segment layer: 

Joining the GTFS-RT trajectories with the scheduled segment information, we compute delays for every segment and trip. For example, here are the resulting delays for trip ‘AUTO3-18-1-240501-ab-2230’: 

Red lines mark segments where time is lost compared to the schedule, while blue lines indicate that the vehicle traversed the segment faster than the schedule suggested.

What’s next

When interpreting the results, it is important to acknowledge the effects caused by the timing of the next stop ID updates in the real-time GTFS feed. Sometimes, these updates come very late and thus introduce distortions where one segment’s travel time gets too long and the other too short. 

We will continue refining the analytics and related libraries, including the QGIS Trajectools plugin, to facilitate analytics of GTFS-RT & GTFS.

After successful testing of this analytics approach in Riga, we aim to transfer it to other cities. But for this to work, public transport companies need ways to efficiently store their data and, ideally, to release them openly to allow for analysis.

The pipelines we described, help keep storage needs low, which allows us to drastically reduce costs (for a year we would only have a few gigabytes, which is inexpensive to store in S3 storage). Let us know if you would be interested in an online platform on which one could register a GTFS-RT feed & GTFS, which would then automatically start being archived (in exchange, the provider would only need to accept sharing the archives as open data, at no cost for them).

The Trajectools repository is migrating from GitHub to Codeberg. The new home for Trajectools is:

➡️ https://codeberg.org/movingpandas/trajectools

The GitHub repo remains as a writable mirror, for now, but the issue tracking is only active on Codeberg.

Why the move?

I am working on moving my projects to European infrastructure that better aligns with my values. Codeberg is a nonprofit and libre-friendly platform based in Germany. This will ensure that the projects are hosted on infrastructure that prioritizes user privacy and open-source ideals.

What does this mean for users?

  • No impact on functionality – Trajectools remains the same great tool for trajectory analysis, available through the recently update QGIS Plugin Repo.
  • Development continues – I’ll continue actively maintaining and improving the project. (If you want to file feature requests, please note that the issue tracker on the GitHub mirror has been deactivated and issues should be filed on Codeberg instead.)

What does this mean for contributors?

If you’re contributing to Trajectools, simply update your remotes to the new repository. The GitHub repo continues to accept PRs and the changes are synched between GitHub and Codeberg, but I’d encourage all contributors to use Codeberg.

How to update your local repository

If you’ve already cloned the GitHub repository, you can update your remote URL with the following commands:

cd trajectools
git remote set-url --add --push origin https://codeberg.org/movingpandas/trajectools.git
git pull origin main

Interested in testing Codeberg for your projects?

Here are the instructions I followed to perform the migration and to set up the mirroring: https://codeberg.org/Recommendations/Mirror_to_Codeberg

Thanks for your support, and see you on Codeberg!

In this new release, you will find new algorithms, default output styles, and other usability improvements, in particular for working with public transport schedules in GTFS format, including:

  • Added GTFS algorithms for extracting stops, fixes #43
  • Added default output styles for GTFS stops and segments c600060
  • Added Trajectory splitting at field value changes 286fdbd
  • Added option to add selected fields to output trajectories layer, fixes #53
  • Improved UI of the split by observation gap algorithm, fixes #36

Note: To use this new version of Trajectools, please upgrade your installation of MovingPandas to >= 0.21.2, e.g. using

import pip; pip.main(['install', '--upgrade', 'movingpandas'])

or

conda install movingpandas==0.21.2

Today, I want to point out a blog post over at

https://carto.com/blog/urban-mobility-insights-with-movingpandas-carto-in-snowflake

written together with my fellow co-authors and EMERALDS project team member Argyrios Kyrgiazos.

For the technically inclined, the highlight are the presented UDFs in Snowflake to process and transform the trajectory data. For example, here’s a TemporalSplitter UDF:

CREATE OR REPLACE FUNCTION CARTO_DATABASE.CARTO.TemporalSplitter(geom ARRAY, t ARRAY, mode STRING)
RETURNS ARRAY
LANGUAGE PYTHON
RUNTIME_VERSION = 3.11
PACKAGES = ('numpy','pandas', 'geopandas','movingpandas', 'shapely')
HANDLER = 'udf'
AS $$
import numpy as np
import pandas as pd
import geopandas as gpd
import movingpandas as mpd
import shapely
from shapely.geometry import shape, mapping, Point, Polygon
from shapely.validation import make_valid
from datetime import datetime, timedelta

def udf(geom, t, mode):
    valid_df = pd.DataFrame(geom, columns=['geometry'])
    valid_df['t'] = pd.to_datetime(t)
    valid_df['geometry'] = valid_df['geometry'].apply(lambda x:shapely.wkt.loads(x))
    gdf = gpd.GeoDataFrame(valid_df, geometry='geometry', crs='epsg:4326')
    gdf = gdf.set_index('t')
    traj = mpd.Trajectory(gdf, 1)
    traj_sm = mpd.TemporalSplitter(traj).split(mode=mode)
    if len(traj_sm.trajectories)>0:
        res = traj_sm.to_point_gdf()
        res['geometry'] = res['geometry'].apply(lambda x: shapely.wkt.dumps(x))
        return res.reset_index().values
    else:
        return []
$$;

You can find the full code here: https://github.com/anitagraser/carto-research-public/tree/master/movingpandas_carto_in_snowflake

Today marks the release of Trajectools 2.3 which brings a new set of algorithms, including trajectory generalizing, cleaning, and smoothing.

To give you a quick impression of what some of these algorithms would be useful for, this post introduces a trajectory preprocessing workflow that is quite general-purpose and can be adapted to many different datasets.

We start out with the Geolife sample dataset which you can find in the Trajectools plugin directory’s sample_data subdirectory. This small dataset includes 5908 points forming 5 trajectories, based on the trajectory_id field:

We first split our trajectories by observation gaps to ensure that there are no large gaps in our trajectories. Let’s make at cut at 15 minutes:

This splits the original 5 trajectories into 11 trajectories:

When we zoom, for example, to the two trajectories in the north western corner, we can see that the trajectories are pretty noisy and there’s even a spike / outlier at the western end:

If we label the points with the corresponding speeds, we can see how unrealistic they are: over 300 km/h!

Let’s remove outliers over 50 km/h:

Better but not perfect:

Let’s smooth the trajectories to get rid of more of the jittering.

(You’ll need to pip/mamba install the optional stonesoup library to get access to this algorithm.)

Depending on the noise values we chose, we get more or less smoothing:

Let’s zoom out to see the whole trajectory again:

Feel free to pan around and check how our preprocessing affected the other trajectories, for example: