Finally it’s here: Jupyter notebooks inside QGIS. I don’t know about you but I’ve been hoping for someone to get around to doing this for quite a while.

Qiusheng Wu published the first version of the Notebook plugin on 26 Dec 2025. Late Christmas present?!

For the setup, there’s a handy tutorial by Hans van der Kwast and, additionally, Qiusheng published an intro video:

Development is going fast (version 0.3.0 at the time of writing) so there will be new features when you install / update the plugin compared to both the tutorial and the video.

The user interface is pretty stripped down with just a few buttons to add new code or markdown cells and to run them. And there is a neat drop-down menu with all kinds of ready-made code snippets to get you started:

For other functionalities, for example, to delete cells, you need to right-click on the cell to access the function through the context menu. And, as far as I can tell, there is currently no way to rearrange cells (moving them up or down).

I also haven’t quite understood yet what kinds of outputs are displayed and which are not because – quite often – the cell output just stays empty, even though the same code generates output on the console:

Some of the plugin settings I would have liked to experiment with, such as adjusting the font size or enabling line numbers, don’t seem to work yet. So a little more patience seems to be necessary.

I’ll definitely keep an eye on this one :)

QGIS to (Geo)Pandas – part 3

By underdark

2025-12-03

GIS, QGIS

Leave a comment

The journey continues: QgsArrowIterator is now merged! This makes it possible to iterate over QgsFeatures as Arrow batches.

This is where we are now, quoting Dewey Dunnington:

import geopandas
from nanoarrow.c_array import allocate_c_array
import qgis
from qgis.core import QgsVectorLayer

# Create a vector layer
layer = QgsVectorLayer("tests/testdata/zonalstatistics/polys.shp", "layer_name", "ogr")
schema = qgis.core.QgsArrowIterator.inferSchema(layer)

it = qgis.core.QgsArrowIterator(layer.getFeatures())
it.setSchema(schema, 1)

c_array = allocate_c_array()
schema.exportToAddress(c_array.schema._addr())
it.nextFeatures(5, c_array._addr())

print(geopandas.GeoDataFrame.from_arrow(c_array))
#> lev3_name                                           geometry
#> 0    poly_1  MULTIPOLYGON (((100.37934 -0.96049, 100.37934 ...
#> 1    poly_2  MULTIPOLYGON (((100.37944 -0.96044, 100.37955 ...
#> 2    poly_3  MULTIPOLYGON (((100.37938 -0.96049, 100.37949 ...

print(geopandas.read_file("tests/testdata/zonalstatistics/polys.shp"))
#> lev3_name                                           geometry
#> 0    poly_1  POLYGON ((100.37934 -0.96049, 100.37934 -0.960...
#> 1    poly_2  POLYGON ((100.37944 -0.96044, 100.37955 -0.960...
#> 2    poly_3  POLYGON ((100.37938 -0.96049, 100.37949 -0.960...

Further improvements are already being planned. To quote from the ticket:

“The final state after this improvement would be a compact way for Arrow Python consumers like GeoPandas to ergonomically consume a layer. Maybe:

geopandas.GeoDataFrame.from_arrow(qgis_layer_object)

Or maybe:

geopandas.GeoDataFrame.from_arrow(qgis_layer_object.getArrowStream())

Looking forward to seeing this develop further.

QGIS to (Geo)Pandas follow-up

By underdark

2025-10-31

QGIS

Leave a comment

The conversation around Looking for better ways to convert between QGIS VectorLayer and (Geo)DataFrame is continuing over at https://fosstodon.org/@underdarkGIS/115442614331293320

What I’ve learned so far:

QgsVectorLayer.as_geopandas() has landed in QGIS master on 13 Oct 2025.
There’s also QgsVectorLayer.field_to_numpy() which will be useful for many applications and has landed on 29 Oct 2025.
QgsArrowIterator is in the works right now.

Exciting times for spatial data science tooling 🤩

Looking for better ways to convert between QGIS VectorLayer and (Geo)DataFrame

By underdark

2025-10-26

QGIS

Leave a comment

Plugin developers who want to use (Geo)Pandas-based functionality in their plugins regularly face the challenge of converting QGIS vector layers to (Geo)DataFrames. There is currently no built-in convenience function.

In Trajectools, so far, I have been performing the conversion manually, looping through all features and taking care of tricky column types, such as datetimes and geometries:

def df_from_layer_trajectools(layer,time_field_name="t"):
    # Original Trajectools 2.7 version
    names = [field.name() for field in layer.fields()]
    data = []
    for feature in layer.getFeatures():
        my_dict = {}
        for i, a in enumerate(feature.attributes()):
            if names[i] == time_field_name and isinstance(a, QDateTime):
                a = a.toPyDateTime()
            my_dict[names[i]] = a
        pt = feature.geometry().asPoint()
        my_dict["geom_x"] = pt.x()
        my_dict["geom_y"] = pt.y()
        data.append(my_dict)
    df = pd.DataFrame(data)
    return df

It works (mostly), but it’s far from fast. For the 25 million Geolife points, it takes 4 minutes:

In an attempt to speed-up (and make the conversion more robust, e.g. regarding datetime/timezone conversion and null values), I’ve spent some time at SDSL2025 with Joris Van den Bossche trying a workaround that writes the QGIS layer to an Arrow file and then reads that file with pyogrio:

def gdf_from_layer_arrow(layer):
    # SDSL2025 version
    with tempfile.TemporaryDirectory() as tmpdirname:
        path = os.path.join(tmpdirname, "data.arrow")

        options = QgsVectorFileWriter.SaveVectorOptions()
        options.actionOnExistingFile = QgsVectorFileWriter.CreateOrOverwriteFile 
        options.layerName = 'data'
        options.driverName = "arrow"
        
        QgsVectorFileWriter.writeAsVectorFormatV3(
            layer, path, QgsProject.instance().transformContext(), options
        )
       
        meta, table = pyogrio.read_arrow(path)
        gdf = gpd.GeoDataFrame.from_arrow(table)

    return gdf

Not only do we get a GeoDataFrame in return, this also runs in half the time, i.e. in 2 minutes instead of 4:

Switching to this approach will require adding pyogrio to the plugin dependencies. Looks like it could be worth it.

We also discussed another alternative: It would be faster to read the vector layer data source directly, in case it is a supported file format. However, this means we’d need separate handling for other input layers.

There’s also the issue of supporting the Processing feature that allows users to run the algorithm only on the selected features because selected features are only exposed through QgsProcessingParameterFeatureSource (and not through QgsProcessingParameterVectorLayer). Maybe the Export Selected Features algorithm can cover this case but it will export an empty layer if there is no selection.

Are you aware of any other / better ways to approach this issue? Any pointers are appreciated.

Wrangling hundreds of GPS files with DuckDB, QGIS & Trajectools

By underdark

2025-10-12

Big Data, Data Mining, GIS, Movement data in GIS, QGIS, spatio-temporal data, Trajectools

Leave a comment

The last time I preprocessed the whole GeoLife dataset, I loaded it into PostGIS. Today, I want to share a new workflow that creates a (Geo)Parquet file and that is much faster.

The dataset (GeoLife)

“This GPS trajectory dataset was collected in (Microsoft Research Asia) Geolife project by 182 users in a period of over three years (from April 2007 to August 2012). A GPS trajectory of this dataset is represented by a sequence of time-stamped points, each of which contains the information of latitude, longitude and altitude. This dataset contains 17,621 trajectories with a total distance of about 1.2 million kilometers and a total duration of 48,000+ hours. These trajectories were recorded by different GPS loggers and GPS-phones, and have a variety of sampling rates. 91 percent of the trajectories are logged in a dense representation, e.g. every 1~5 seconds or every 5~10 meters per point.”

The GeoLife GPS Trajectories download contains 182 directories full of .plt files:

Basically, CSV files with a custom header:

Creating the (Geo)Parquet using DuckDB

DuckDB installation

Following the official instructions, installation is straightforward:

curl https://install.duckdb.org | sh

From there, I’ve been using the GUI which we can launch using:

duckdb -ui

The spatial extension is a DuckDB core extension, so it’s readily available. We can create a spatial db with:

ATTACH IF NOT EXISTS ':memory:' AS memory;
INSTALL spatial;
LOAD spatial;

Reading a spatial file is as simple as:

SELECT * 
FROM '/home/anita/Documents/Codeberg/trajectools/sample_data/geolife.gpkg'

thanks to the GDAL integration.

But today, we want to do to get a bit more involved …

DuckDB SQL magic

The issues we need to solve are:

Read all CSV files from all subdirectories
Parse the CSV, ignoring the first couple of lines, while assigning proper column names
Assign the CSV file name as the trajectory ID (because there is no ID in the original files)
Create point geometries that will work with our GeoParquet file
Create proper datetimes from the separate date and time fields

Luckily, DuckDB’s read_csv function comes with the necessary features built-in. Putting it all together:

CREATE OR REPLACE TABLE geolife AS 
SELECT 
  parse_filename(filename, true) as vehicle_id, 
  strptime(date||' '||time, '%c') as t, 
  ST_Point(lon, lat) as geometry -- do NOT use ST_MakePoint
FROM read_csv('/home/anita/Documents/Geodata/Geolife/Geolife Trajectories 1.3/Data/*/*/*.plt',
    skip=6,
    filename = true, 
    columns = {
        'lat': 'DOUBLE', 
        'lon': 'DOUBLE', 
        'ignore': 'INT', 
        'alt': 'DOUBLE', 
        'epoch': 'DOUBLE', 
        'date': 'VARCHAR',
        'time': 'VARCHAR'
    });

It’s blazingly fast:

I haven’t tested reading directly from ZIP archives yet, but there seems to be a community extension (zipfs) for this exact purpose.

Ready to QGIS

GeoParquet files can be drag-n-dropped into QGIS:

I’m running QGIS 3.42.1-Münster from conda-forge on Linux Mint.

Yes, it takes a while to render all 25 million points … But you know what? It get’s really snappy once we zoom in closer, e.g. to the situation in Germany:

Let’s have a closer look at what’s going on here.

Trajectools time

Selecting the 9,438 points in this extent, let’s compute movement metrics (speed & direction) and create trajectory lines:

Looks like we have some high-speed sections in there (with those red > 100 km/h streaks):

When we zoom in to Darmstadt and enable the trajectories layer, we can see each individual trip. Looks like car trips on the highway and walks through the city:

That looks like quite the long round trip:

Let’s see where they might have stopped to have a break:

If I had to guess, I’d say they stayed at the Best Western:

Conclusion

DuckDB has been great for this ETL workflow. I didn’t use much of its geospatial capabilities here but I was pleasantly surprised how smooth the GeoParquet creation process has been. Geometries are handled without any special magic and are recognized by QGIS. Same with the timestamps. All ready for more heavy spatiotemporal analysis with Trajectools.

If you haven’t tried DuckDB or GeoParquet yet, give it a try, particularly if you’re collaborating with data scientists from other domains and want to exchange data.

QGIS User Conf 2025 videos have landed!

By underdark

2025-06-25

Mobility Data Science, Movement data in GIS, QGIS, spatio-temporal data, Trajectools

Leave a comment

The QGISUC2025 team has done an awesome job recording and editing the conference presentations. All “presentation” type talks where the presenter has accepted to be published are now available in a dedicated list on the QGIS Youtube channel.

I also had the pleasure of presenting our Trajectools plugin and you can see this talk here:

Thank you to all the organizers, speakers, and participants for the great time!

Speed up your analytics with the new MovingPandas 0.22 and Trajectools 2.6

By underdark

2025-05-17

GIS, Mobility Data Science, Movement data in GIS, MovingPandas, QGIS, Trajectools

Leave a comment

The latest releases of MovingPandas and Trajectools come with many “under the hood” changes that aim to make your movement analytics faster:

Instead of immediately creating a GeoPandas GeoDataFrame and populating the geometry column with Point objects, MovingPandas now has “lazy geometry column creation” that holds off on this operation until / if the geometries are actually needed. This way, for many operations, no geometry objects have to be generated at all.
MovingPandas TrajectorySplitters now support parallel processing and Trajectools uses parallel processing whenever available (e.g. for adding speed & direction metrics, detecting stops, splitting trajectories).
When a minimum length is specified for trajectories, MovingPandas now avoids computing the total trajectory length and, instead, immediately stops once the threshold value has been reached (“early skip”).
Trajectools now offers the option to skip computation of movement metrics (speed & direction). This way, we can skip unnecessary computations and leverage the lazy geometry column creation, wherever applicable.

Let’s have a look at some example performance measurements!

Example 1: MovingPandas ValueChangeSplitter

The ValueChangeSplitter splits trajectories when it detects a value change in the specified column. This is useful, for example, to split up public trajectories that contain a “next_stop” column.

The following graph shows ValueChangeSplitter runtimes for different minimum trajectory length settings (from 0 to 1km, 100km, and 10,000km):

We see that the new, lazy geometry column initialization outperforms the old original code in all cases (e.g. 57% runtime reduction for 1km), except for the worst-case scenario, when the original implementation discards all trajectories as too short right from the start. (For most use cases, min_length will be set to rather small values to avoid creation of undesired short trajectory fragments, similar to sliver polygons in classic geometry operations.)

Additionally, we can engage multiprocessing by setting the n_processes parameter, e.g. to the number of CPUs to achieve further speedup:

Example 2: Trajectools

By applying all above-mentioned speedup techniques, Trajectools is now considerably faster. For example, the following runtime reductions can be achieved by deactivating the “Add movement metrics (speed, direction)” option in the algorithm dialog:

Create trajectories: 62%
Spatiotemporal generalization (TDTR): 78%
Temporal generalization: 81%
Split trajectories at stops: 53%

I have also updated the default trajectory points output style. It now uses a graduated renderer to visualize the speed values (if they have been calculated) instead of the previously used data-defined override. This makes the style faster to customize and provides a user-friendly legend:

For more infos, have a look at:

Enjoy the latest performance increases!

The quest for a fair TimeGPT benchmark

By underdark

2025-03-29

AI, GIS, Mobility Data Science

At the end of yesterday’s TimeGPT for mobility post, we concluded that TimeGPT’s trainingset probably included a copy of the popular BikeNYC timeseries dataset and that, therefore, we were not looking at a fair comparison.

Naturally, it’s hard to find mobility timeseries datasets online that haven’t been widely disseminated and therefore may have slipped past the scrapers of foundation model builders.

So I scoured the Austrian open government data portal and came up with a bike-share dataset from Vienna.

Dataset

SharedMobility.ai dataset published by Philipp Naderer-Puiu, covering 2019-05-05 to 2019-12-31.

Here are eight of the 120 stations in the dataset. I’ve resampled the number of available bicycles to the maximum hourly value and made a cutoff mid August (before a larger data collection cap and the less busy autumn and winter seasons):

Models

To benchmark TimeGPT, I computed different baseline predictions. I used statsforecast’s HistoricAverage, SeasonalNaive, and AutoARIMA models and computed predictions for horizons of 1 hour, 12 hours, and 24 hours.

Here are examples of the 12-hour predictions:

We can see how Historic Average is pretty much a straight line of the average past value. A little more sophisticated, SeasonalNaive assumes that the future will be a repeat of the past (i.e. the previous day), which results in the shifted curve we can see in the above examples. Finally, there’s AutoARIMA which seems to do a better job than the first two models but also takes much longer to compute.

For comparison, here’s TimeGPT with 12 hours horizon:

You can find the full code in https://github.com/anitagraser/ST-ResNet/blob/570d8a1af4a10c7fb2230ccb2f203307703a9038/experiment.ipynb

Results

In the following table, you’ll find the best model highlighted in bold. Unsurprisingly, this best model is for the 1 hour horizon. The best models for 12 and 24 hours are marked in italics.

Model	Horizon	RMSE
HistoricAverage	1	7.0229
HistoricAverage	12	7.0195
HistoricAverage	24	7.0426
SeasonalNaive	1	7.8703
SeasonalNaive	12	7.7317
SeasonalNaive	24	7.8703
AutoARIMA	1	2.2639
AutoARIMA	12	5.1505
AutoARIMA	24	6.3881
TimeGPT	1	2.3193
TimeGPT	12	4.8383
TimeGPT	24	5.6671

AutoARIMA and TimeGPT are pretty closely tied. Interestingly, the SeasonalNaive model performs even worse than the very simple HistoricAverage, which is an indication of the irregular nature of the observed phenomenon (probably caused by irregular restocking of stations, depending on the system operator’s decisions).

Conclusion & next steps

Overall, TimeGPT struggles much more with the longer horizons than in the previous BikeNYC experiment. The error more than doubled between the 1 hour and 12 hours prediction. TimeGPT’s prediction quality barely out-competes AutoARIMA’s for 12 and 24 hours.

I’m tempted to test AutoARIMA for the BikeNYC dataset to further complete this picture.

Of course, the SharedMobility.ai dataset has been online for a while, so I cannot be completely sure that we now have a fair comparison. For that, we would need a completely new / previously unpublished dataset.

For a more thorough write-up, head along to Graser, A. (2025). Timeseries Foundation Models for Mobility: A Benchmark Comparison with Traditional and Deep Learning Models. arXiv preprint arXiv:2504.03725.

TimeGPT for mobility: Can foundation models outperform classic machine learning models for mobility predictions?

By underdark

2025-03-28

AI, Mobility Data Science

tldr; Maybe. Preliminary results certainly are impressive.

Introduction

Crowd and flow predictions have been very popular topics in mobility data science. Traditional forecasting methods rely on classic machine learning models like ARIMA, later followed by deep learning approaches such as ST-ResNet.

More recently, foundation models for timeseries forecasting, such as TimeGPT, Chronos, and LagLlama have been introduced. A key advantage of these models is their ability to generate zero-shot predictions — meaning that they can be applied directly to new tasks without requiring retraining for each scenario.

In this post, I want to compare TimeGPT’s performance against traditional approaches for predicting city-wide crowd flows.

Experiment setup

The experiment builds on the paper “Deep Spatio-Temporal Residual Networks for Citywide Crowd Flows Prediction” by Zhang et al. (2017). The original repo referenced on the homepage does not exist anymore. Therefore, I forked: https://github.com/topazape/ST-ResNet as a starting point.

The goals of this experiment are to:

Get an impression how TimeGPT predicts mobility timeseries.
Compare TimeGPT to classic machine learning (ML) and deep learning (DL) models.
Understand how different forecasting horizons impact predictive accuracy.

The paper presents results for two datasets (TaxiBJ and BikeNYC). The following experiment only covers BikeNYC.

You can find the full notebook at https://github.com/anitagraser/ST-ResNet/blob/079948bfbab2d512b71abc0b1aa4b09b9de94f35/experiment.ipynb

First attempt

In the first version, I applied TimeGPT’s historical forecast function to generate flow predictions. However, there was an issue: the built-in historic forecast function ignores the horizon parameter, thus making it impossible to control the horizon and make a fair comparison.

Refinements

In the second version, I therefore added backtesting with customizable forecast horizon to evaluate TimeGPT’s forecasts over multiple time windows.

To reproduce the original experiments as truthfully as possible, both inflows and outflows were included in the experiments.

I ran TimeGPT for different forecasting horizons: 1 hour, 12 hours, and 24 hours. (In the original paper (Zhang et al. 2017), only one-step-ahead (1 hour) forecasting is performed but it is interesting to explore the effects of the additional challenge resulting from longer forecast horizons.) Here’s an example of the 24-hour forecast:

The predictions pick up on the overall daily patterns but the peaks are certainly hit-and-miss.

For comparison, here are some results for the easier 1-hour forecast:

Not bad. Let’s run the numbers! (And by that I mean: let’s measure the error.)

Results

The original paper provides results (RMSE, i.e. smaller is better) for multiple traditional ML models and DL models. Addition our experiments to these results, we get:

Model	RMSE
ARIMA	10.56
SARIMA	10.07
VAR	9.92
DeepST-C	8.39
DeepST-CP	7.64
DeepST-CPT	7.56
DeepST-CPTM	7.43
ST-ResNet	6.33
TimeGPT (horizon=1)	5.70
TimeGPT (horizon=12)	7.62
TimeGPT (horizon=24)	8.93

Key takeaways

TimeGPT with a 1 hour horizon outperforms all ML and DL models.
For longer horizons, TimeGPT’s accuracy declines but remains competitive with DL approaches.
TimeGPT’s pre-trained nature means that we can immediately make predictions without any prior training.

Conclusion & next steps

These preliminary results suggest that timeseries foundation models, such as TimeGPT, are a promising tool. However, a key limitation of the presented experiment remains: since BikeNYC data has been public for a long time, it is well possible that TimeGPT has seen this dataset during its training. This raises questions about how well it generalizes to truly unseen datasets. To address this, the logical next step would be to test TimeGPT and other foundation models on an entirely new dataset to better evaluate its robustness.

We also know that DL model performance can be improved by providing more training data. It is therefore reasonable to assume that specialized DL models will outperform foundation models once they are trained with enough data. But in the absence of large-enough training datasets, foundation models can be an option.

In recent literature, we also find more specific foundation models for spatiotemporal prediction, such as UrbanGPT https://arxiv.org/abs/2403.00813, UniST https://arxiv.org/abs/2402.11838, and UrbanDiT https://arxiv.org/pdf/2411.12164. However, as far as I can tell, none of them have published the model weights.

If you want to join forces, e.g. add more datasets or test other timeseries foundation models, don’t hesitate to reach out.

Part 2: The quest for a fair TimeGPT benchmark

Analyzing GTFS Realtime Data for Public Transport Insights

By underdark

2025-03-10

GIS, Movement data in GIS, Trajectools

2 Comments

In today’s post, we (that is, Gaspard Merten from Universite Libre de Bruxelles and yours truly) are going to dive deep into how to analyze public transport data, using both schedule and real time information. This collaboration has been made possible by the EMERALDS project.

Previously, I already shared news about GTFS algorithms for Trajectools that add GTFS preprocessing tools (incl. Route, segment, and stop layer extraction) to the QGIS Processing toolbox.

Today, we’ll discuss the aspect of handling realtime GTFS data and how we approach analytics that combine both data sources.

About Realtime GTFS

Many of us have come to rely on real-time public transport updates in apps like Google Maps. These apps are powered by standardized data formats that ensure different systems can communicate. Google first introduced GTFS in 2005, a format designed to organize transit schedules, stop locations, and other static transit information. Then, in 2011, they introduced GTFS Realtime (GTFS-RT), which added the capability to include live updates on vehicle positions, delays, speeds, and much more.

However, as the name suggests, GTFS Realtime is all about live data. This means that while GTFS-RT APIs are useful for providing real-time insights, they don’t hold historical data for analytics. Moreover, most transit agencies don’t keep past GTFS-RT records, and even fewer make them available to the public. This can be a significant challenge for anyone looking to analyze past trends and extract valuable insights from the data. For this reason, we had to implement our own solution to efficiently archive GTFS-RT files while making sure the files could be queried easily.

There are two main challenges in the implementation of such a solution:

Data Volume: While individual GTFS-RT files are relatively small—typically ranging from 50KB to 500KB depending on the public transport network size—the challenge lies in ingestion frequency. With an average file size of 100KB and updates every 5 seconds, a full day’s worth of data quickly scales up to 1.728GB.
Data Usability: GTFS-RT is a deeply nested format based on Protobuf, making direct conversion into a more accessible structure like a DataFrame difficult. Efficiently unnesting the data without losing critical details would significantly improve usability and streamline analysis.

Parquet to the Rescue

Storing and analyzing real-time transit data efficiently isn’t just about saving space—it’s about making the data easy to work with. Luckily, modern data formats have come a long way, allowing us to store massive amounts of data while keeping retrieval and analytics processing fast. One of the best tools for the job is Apache Parquet, a columnar storage format originally designed for Hadoop but now widely adopted in data science. With built-in support in libraries like Polars and Pandas, it’s become a go-to choice for handling large datasets efficiently. Moreover, Parquet can be converted to GeoParquet for smoother integration with GIS such as GeoPandas.

What makes Parquet particularly well-suited for GTFS Realtime data is the way it compresses columnar data. It leverages multiple compression algorithms and encodings, significantly reducing file sizes while keeping access speeds high. However, to get the most out of Parquet’s compression, we need to be smart about how we structure our data. Simply converting each GTFS-RT file into its own Parquet file might give us around 60% compression, which is decent. But if we group all GTFS-RT records for an entire hour into a single file, we can push that number up to 95%. The reason? A lot of transit data—like trip IDs and stop locations—doesn’t change much within an hour, while other values, such as coordinates, often share common elements. By organizing data in larger batches, we allow Parquet’s compression algorithms to work their magic, drastically reducing storage needs. And with a smaller disk footprint, retrieval is faster, making the entire analytics pipeline more efficient.

One more challenge to tackle is the structure of the data itself. GTFS-RT files tend to be highly nested, which isn’t an issue for Parquet but can be problematic for most data science tools. While Parquet technically supports nested structures, many analytical frameworks don’t handle them well. To fix this, we apply a lightweight preprocessing step to “unnest” the data. In the original GTFS-RT format, the vehicle position feed is deeply nested, making it difficult to work with. But once unnesting is applied, the structure becomes flat, with clear column names derived from the original hierarchy. This makes it easy to convert the data into a table format, ensuring smooth integration with tools commonly used by data scientists.

The GTFS-RT Pipelines

With this in mind, let’s walk through the two pipelines we built to store and retrieve GTFS-RT data efficiently.

The entire system relies on two key pipelines that work together. The first pipeline fetches GTFS-RT data from an API every five seconds, processes it, and stores it in an S3 bucket. The second pipeline runs hourly, gathering all the individual files from the past hour, merging them into a single Parquet file, and saving it back to the bucket in a structured format. We will now take a look at each pipeline in more detail.

Pipeline 1: Fetching and Storing Data

The first step in the process is retrieving GTFS-RT data. This is done via an API, which returns files in the Protocol Buffer (ProtoBuf) format. Fortunately, Google provides libraries (such as gtfs-realtime-bindings) that make it easy to parse ProtoBuf and convert it into a more accessible format like JSON.

Once we have the data in JSON format, we need to split it based on entity type. GTFS-RT files contain different types of data, such as TripUpdate, which provides updated arrival times for stops, and VehiclePosition, which tracks real-time locations and speeds. Not all GTFS-RT feeds contain every entity type, but TripUpdate and VehiclePosition are the most commonly used. The full list of entity types can be found in the GTFS Realtime documentation.

We separate entity types because they have different schemas, making it difficult to store them in a single Parquet file. Keeping each entity type separate not only improves organization but also enhances compression efficiency. Once split, we apply the same unnesting process as described earlier, ensuring the data is structured in a way that’s easy to analyze. After that, we convert the data into a data frame and store it as a Parquet file in memory before uploading it to an S3 bucket. The files follow a structured naming convention like this:

{feed_type}/YYYY-MM-DD/hour/individual_{date-isoformat}.parquet

This format makes it easy to navigate the storage bucket manually while also ensuring seamless integration with the second pipeline.

Pipeline 2: Merging and Optimizing Storage

The second pipeline’s job is to take all the small Parquet files generated by Pipeline 1 and merge them into a single, optimized file per hour. To do this, it scans the storage bucket for the earliest unprocessed “hour folder” and begins processing from there. This design ensures that if the pipeline is temporarily interrupted, it can easily resume without skipping any data.

Once it identifies the files to merge, the pipeline loads them, assigns a proper timestamp to each record, and concatenates them into a single Parquet table. The final file is then uploaded to the S3 bucket using the following naming convention:

{feed_type}/YYYY-MM-DD/hour/HH.parquet

If any files fail to merge, they are renamed with the prefix unmerged_{date-isoformat}.parquet for manual inspection. After successfully storing the merged file, the pipeline deletes the individual files to keep storage clean and avoid unnecessary clutter.

One critical advantage of converting GTFS-RT data into Parquet early in the process is that it prevents memory overload. If we had to merge raw GTFS-RT files instead of pre-converted Parquet files, we would likely run into memory constraints, especially on standard servers with limited RAM. This makes Parquet not just a storage solution but an enabler of efficient large-scale processing.

Ready for Analytics

In this section, we will explore how to use the GTFS-RT data for public transport analytics. Specifically, we want to compute delays, that is, the difference between the scheduled travel time and the real travel time.

The previously created Parquet files can be loaded into QGIS as tables without geometries. To turn them into point layers, we use the “Create points layer from table” algorithm from the Processing “Vector creation” toolbox. And once we convert the unixtimes to datetimes (using the datetime_from_epoch function), we have a point layer that is ready for use in Trajectools.

Let’s have a look at one bus route. Bus 3 is one of the busiest routes in Riga. We apply a filter to the point layer which reveals the location of the route.

Computing segment travel times

Computing travel times on public transport segments, i.e. between two scheduled stops, comes with a couple of challenges:

The GTFS-RT location updates are provided in a rather sparse fashion with irregular reporting intervals. It is not clear that we “see” every stop that happens.
We cannot rely solely on stop detection since, sometimes, a vehicle will not come to a halt at scheduled stop locations (if nobody wants to get off or on)
The stop ID, representing the next stop the vehicle will visit, is not always exact. Updates are often delayed and happen some time after passing the stop.

Here’s an example visualization of the stop ID information of a single trip of bus 3, overlaid on top of the GTFS route and stops (in red):

To compute the desired delays, we decided to compare GTFS-RT travel times based on stop ID info with the scheduled travel times. To get the GTFS-RT travel times, we use Trajectools and create trajectories by splitting at stop ID change using the Split by value change algorithm:

Computing delays

The final step is to compute travel time differences between schedule and real time. For this, we implemented a SQL join that matches GTFS-RT trajectories with the corresponding entry in the GTFS schedule using route information and temporal information:

The temporal information is important since the schedule accounts for different travel times during peak hours and off peak:

This information is extracted from the GTFS schedule using the Trajectools Extract segments algorithm, if we chose the “Add scheduled speeds” option:

This will add the time windows, speeds, and runtimes per segment to the resulting segment layer:

Joining the GTFS-RT trajectories with the scheduled segment information, we compute delays for every segment and trip. For example, here are the resulting delays for trip ‘AUTO3-18-1-240501-ab-2230’:

Red lines mark segments where time is lost compared to the schedule, while blue lines indicate that the vehicle traversed the segment faster than the schedule suggested.

What’s next

When interpreting the results, it is important to acknowledge the effects caused by the timing of the next stop ID updates in the real-time GTFS feed. Sometimes, these updates come very late and thus introduce distortions where one segment’s travel time gets too long and the other too short.

We will continue refining the analytics and related libraries, including the QGIS Trajectools plugin, to facilitate analytics of GTFS-RT & GTFS.

After successful testing of this analytics approach in Riga, we aim to transfer it to other cities. But for this to work, public transport companies need ways to efficiently store their data and, ideally, to release them openly to allow for analysis.

The pipelines we described, help keep storage needs low, which allows us to drastically reduce costs (for a year we would only have a few gigabytes, which is inexpensive to store in S3 storage). Let us know if you would be interested in an online platform on which one could register a GTFS-RT feed & GTFS, which would then automatically start being archived (in exchange, the provider would only need to accept sharing the archives as open data, at no cost for them).