Archive

QGIS

The journey continues: QgsArrowIterator is now merged! This makes it possible to iterate over QgsFeatures as Arrow batches.

This is where we are now, quoting Dewey Dunnington:

import geopandas
from nanoarrow.c_array import allocate_c_array
import qgis
from qgis.core import QgsVectorLayer

# Create a vector layer
layer = QgsVectorLayer("tests/testdata/zonalstatistics/polys.shp", "layer_name", "ogr")
schema = qgis.core.QgsArrowIterator.inferSchema(layer)

it = qgis.core.QgsArrowIterator(layer.getFeatures())
it.setSchema(schema, 1)

c_array = allocate_c_array()
schema.exportToAddress(c_array.schema._addr())
it.nextFeatures(5, c_array._addr())

print(geopandas.GeoDataFrame.from_arrow(c_array))
#> lev3_name                                           geometry
#> 0    poly_1  MULTIPOLYGON (((100.37934 -0.96049, 100.37934 ...
#> 1    poly_2  MULTIPOLYGON (((100.37944 -0.96044, 100.37955 ...
#> 2    poly_3  MULTIPOLYGON (((100.37938 -0.96049, 100.37949 ...

print(geopandas.read_file("tests/testdata/zonalstatistics/polys.shp"))
#> lev3_name                                           geometry
#> 0    poly_1  POLYGON ((100.37934 -0.96049, 100.37934 -0.960...
#> 1    poly_2  POLYGON ((100.37944 -0.96044, 100.37955 -0.960...
#> 2    poly_3  POLYGON ((100.37938 -0.96049, 100.37949 -0.960...

Further improvements are already being planned. To quote from the ticket:

“The final state after this improvement would be a compact way for Arrow Python consumers like GeoPandas to ergonomically consume a layer. Maybe:

geopandas.GeoDataFrame.from_arrow(qgis_layer_object)

Or maybe:

geopandas.GeoDataFrame.from_arrow(qgis_layer_object.getArrowStream())

Looking forward to seeing this develop further.

The conversation around Looking for better ways to convert between QGIS VectorLayer and (Geo)DataFrame is continuing over at https://fosstodon.org/@underdarkGIS/115442614331293320

What I’ve learned so far:

Exciting times for spatial data science tooling 🤩

Plugin developers who want to use (Geo)Pandas-based functionality in their plugins regularly face the challenge of converting QGIS vector layers to (Geo)DataFrames. There is currently no built-in convenience function.

In Trajectools, so far, I have been performing the conversion manually, looping through all features and taking care of tricky column types, such as datetimes and geometries:

def df_from_layer_trajectools(layer,time_field_name="t"):
    # Original Trajectools 2.7 version
    names = [field.name() for field in layer.fields()]
    data = []
    for feature in layer.getFeatures():
        my_dict = {}
        for i, a in enumerate(feature.attributes()):
            if names[i] == time_field_name and isinstance(a, QDateTime):
                a = a.toPyDateTime()
            my_dict[names[i]] = a
        pt = feature.geometry().asPoint()
        my_dict["geom_x"] = pt.x()
        my_dict["geom_y"] = pt.y()
        data.append(my_dict)
    df = pd.DataFrame(data)
    return df

It works (mostly), but it’s far from fast. For the 25 million Geolife points, it takes 4 minutes:

In an attempt to speed-up (and make the conversion more robust, e.g. regarding datetime/timezone conversion and null values), I’ve spent some time at SDSL2025 with Joris Van den Bossche trying a workaround that writes the QGIS layer to an Arrow file and then reads that file with pyogrio:

def gdf_from_layer_arrow(layer):
    # SDSL2025 version
    with tempfile.TemporaryDirectory() as tmpdirname:
        path = os.path.join(tmpdirname, "data.arrow")

        options = QgsVectorFileWriter.SaveVectorOptions()
        options.actionOnExistingFile = QgsVectorFileWriter.CreateOrOverwriteFile 
        options.layerName = 'data'
        options.driverName = "arrow"
        
        QgsVectorFileWriter.writeAsVectorFormatV3(
            layer, path, QgsProject.instance().transformContext(), options
        )
       
        meta, table = pyogrio.read_arrow(path)
        gdf = gpd.GeoDataFrame.from_arrow(table)

    return gdf

Not only do we get a GeoDataFrame in return, this also runs in half the time, i.e. in 2 minutes instead of 4:

Switching to this approach will require adding pyogrio to the plugin dependencies. Looks like it could be worth it.

We also discussed another alternative: It would be faster to read the vector layer data source directly, in case it is a supported file format. However, this means we’d need separate handling for other input layers.

There’s also the issue of supporting the Processing feature that allows users to run the algorithm only on the selected features because selected features are only exposed through QgsProcessingParameterFeatureSource (and not through QgsProcessingParameterVectorLayer). Maybe the Export Selected Features algorithm can cover this case but it will export an empty layer if there is no selection.

Are you aware of any other / better ways to approach this issue? Any pointers are appreciated.

The last time I preprocessed the whole GeoLife dataset, I loaded it into PostGIS. Today, I want to share a new workflow that creates a (Geo)Parquet file and that is much faster.

The dataset (GeoLife)

“This GPS trajectory dataset was collected in (Microsoft Research Asia) Geolife project by 182 users in a period of over three years (from April 2007 to August 2012). A GPS trajectory of this dataset is represented by a sequence of time-stamped points, each of which contains the information of latitude, longitude and altitude. This dataset contains 17,621 trajectories with a total distance of about 1.2 million kilometers and a total duration of 48,000+ hours. These trajectories were recorded by different GPS loggers and GPS-phones, and have a variety of sampling rates. 91 percent of the trajectories are logged in a dense representation, e.g. every 1~5 seconds or every 5~10 meters per point.”

The GeoLife GPS Trajectories download contains 182 directories full of .plt files:

Basically, CSV files with a custom header:

Creating the (Geo)Parquet using DuckDB

DuckDB installation

Following the official instructions, installation is straightforward:

curl https://install.duckdb.org | sh

From there, I’ve been using the GUI which we can launch using:

duckdb -ui

The spatial extension is a DuckDB core extension, so it’s readily available. We can create a spatial db with:

ATTACH IF NOT EXISTS ':memory:' AS memory;
INSTALL spatial;
LOAD spatial;

Reading a spatial file is as simple as:

SELECT * 
FROM '/home/anita/Documents/Codeberg/trajectools/sample_data/geolife.gpkg'

thanks to the GDAL integration.

But today, we want to do to get a bit more involved …

DuckDB SQL magic

The issues we need to solve are:

  1. Read all CSV files from all subdirectories
  2. Parse the CSV, ignoring the first couple of lines, while assigning proper column names
  3. Assign the CSV file name as the trajectory ID (because there is no ID in the original files)
  4. Create point geometries that will work with our GeoParquet file
  5. Create proper datetimes from the separate date and time fields

Luckily, DuckDB’s read_csv function comes with the necessary features built-in. Putting it all together:

CREATE OR REPLACE TABLE geolife AS 
SELECT 
  parse_filename(filename, true) as vehicle_id, 
  strptime(date||' '||time, '%c') as t, 
  ST_Point(lon, lat) as geometry -- do NOT use ST_MakePoint
FROM read_csv('/home/anita/Documents/Geodata/Geolife/Geolife Trajectories 1.3/Data/*/*/*.plt',
    skip=6,
    filename = true, 
    columns = {
        'lat': 'DOUBLE', 
        'lon': 'DOUBLE', 
        'ignore': 'INT', 
        'alt': 'DOUBLE', 
        'epoch': 'DOUBLE', 
        'date': 'VARCHAR',
        'time': 'VARCHAR'
    });

It’s blazingly fast:

I haven’t tested reading directly from ZIP archives yet, but there seems to be a community extension (zipfs) for this exact purpose.

Ready to QGIS

GeoParquet files can be drag-n-dropped into QGIS:

I’m running QGIS 3.42.1-Münster from conda-forge on Linux Mint.

Yes, it takes a while to render all 25 million points … But you know what? It get’s really snappy once we zoom in closer, e.g. to the situation in Germany:

Let’s have a closer look at what’s going on here.

Trajectools time

Selecting the 9,438 points in this extent, let’s compute movement metrics (speed & direction) and create trajectory lines:

Looks like we have some high-speed sections in there (with those red > 100 km/h streaks):

When we zoom in to Darmstadt and enable the trajectories layer, we can see each individual trip. Looks like car trips on the highway and walks through the city:

That looks like quite the long round trip:

Let’s see where they might have stopped to have a break:

If I had to guess, I’d say they stayed at the Best Western:

Conclusion

DuckDB has been great for this ETL workflow. I didn’t use much of its geospatial capabilities here but I was pleasantly surprised how smooth the GeoParquet creation process has been. Geometries are handled without any special magic and are recognized by QGIS. Same with the timestamps. All ready for more heavy spatiotemporal analysis with Trajectools.

If you haven’t tried DuckDB or GeoParquet yet, give it a try, particularly if you’re collaborating with data scientists from other domains and want to exchange data.

The QGISUC2025 team has done an awesome job recording and editing the conference presentations. All “presentation” type talks where the presenter has accepted to be published are now available in a dedicated list on the QGIS Youtube channel.

I also had the pleasure of presenting our Trajectools plugin and you can see this talk here:

Thank you to all the organizers, speakers, and participants for the great time!

The latest releases of MovingPandas and Trajectools come with many “under the hood” changes that aim to make your movement analytics faster:

  1. Instead of immediately creating a GeoPandas GeoDataFrame and populating the geometry column with Point objects, MovingPandas now has “lazy geometry column creation” that holds off on this operation until / if the geometries are actually needed. This way, for many operations, no geometry objects have to be generated at all.
  2. MovingPandas TrajectorySplitters now support parallel processing and Trajectools uses parallel processing whenever available (e.g. for adding speed & direction metrics, detecting stops, splitting trajectories).
  3. When a minimum length is specified for trajectories, MovingPandas now avoids computing the total trajectory length and, instead, immediately stops once the threshold value has been reached (“early skip”).
  4. Trajectools now offers the option to skip computation of movement metrics (speed & direction). This way, we can skip unnecessary computations and leverage the lazy geometry column creation, wherever applicable.

Let’s have a look at some example performance measurements!

Example 1: MovingPandas ValueChangeSplitter

The ValueChangeSplitter splits trajectories when it detects a value change in the specified column. This is useful, for example, to split up public trajectories that contain a “next_stop” column.

The following graph shows ValueChangeSplitter runtimes for different minimum trajectory length settings (from 0 to 1km, 100km, and 10,000km):

We see that the new, lazy geometry column initialization outperforms the old original code in all cases (e.g. 57% runtime reduction for 1km), except for the worst-case scenario, when the original implementation discards all trajectories as too short right from the start. (For most use cases, min_length will be set to rather small values to avoid creation of undesired short trajectory fragments, similar to sliver polygons in classic geometry operations.)

Additionally, we can engage multiprocessing by setting the n_processes parameter, e.g. to the number of CPUs to achieve further speedup:

Example 2: Trajectools

By applying all above-mentioned speedup techniques, Trajectools is now considerably faster. For example, the following runtime reductions can be achieved by deactivating the “Add movement metrics (speed, direction)” option in the algorithm dialog:

  • Create trajectories: 62%
  • Spatiotemporal generalization (TDTR): 78%
  • Temporal generalization: 81%
  • Split trajectories at stops: 53%

I have also updated the default trajectory points output style. It now uses a graduated renderer to visualize the speed values (if they have been calculated) instead of the previously used data-defined override. This makes the style faster to customize and provides a user-friendly legend:

For more infos, have a look at:

Enjoy the latest performance increases!

tldr; Tired of working with large CSV files? Give GeoParquet a try!

“Parquet is a powerful column-oriented data format, built from the ground up to as a modern alternative to CSV files.” https://geoparquet.org/

(Geo)Parquet is both smaller and faster than CSV. Additionally, (Geo)Parquet columns are typed. Text, numeric values, dates, geometries retain their data types. GeoParquet also stores CRS information and support in GIS solutions is growing.

I’ll be giving a quick overview using AIS data in GeoPandas 1.0.1 (with pyarrow) and QGIS 3.38 (with GDAL 3.9.2).

File size

The example AIS dataset for this demo contains ~10 million rows with 22 columns. I’ve converted the original zipped CSV into GeoPackage and GeoParquet using GeoPandas to illustrate the huge difference in file size: ~470 MB for GeoParquet and zipped CSV, 1.6 GB for CSV, and a whopping 2.6 GB for GeoPackage:

Reading performance

Pandas and GeoPandas both support selective reading of files, i.e. we can specify the specific columns to be loaded. This does speed up reading, even from CSV files:

Whole fileSelected columns
CSV27.9 s13.1 s
Geopackage2min 12s 😵20.2 s
GeoParquet7.2 s4.1 s

Indeed, reading the whole GeoPackage is getting quite painful.

Here’s the code I used for timing the read times:

As you can see, these times include the creation of the GeoPandas.GeoDataFrame.

If we don’t need a GeoDataFrame, we can read the files even faster:

Non-spatial DataFrames

GeoParquet files can be read by non-GIS tools, such as Pandas. This makes it easier to collaborate with people who may not be familiar with geospatial data stacks.

And reading plain DataFrames is much faster than creating GeoDataFrames:

But back to GIS …

GeoParquet in QGIS

In QGIS, GeoParquet files can be loaded like any other vector layer, thanks to GDAL:

Loading the GeoParquet and GeoPackage files is pretty quick, especially if we zoom into a small region of interest (even though, unfortunately, it doesn’t seem possible to restrict the columns to further speed up loading). Loading the CSV, however, is pretty painful due to the lack of spatial indexing, which becomes apparent very quickly in the direct comparison:

(You can see how slowly the red CSV points are rendering. I didn’t have the patience to include the whole process in the GIF.)

As far as I can tell, my QGIS 3.38 ‘Grenoble’ does not support writing to or editing of GeoParquet files. So I’m limited to reading GeoParquet for now.

However, seeing how much smaller GeoParquets are compared to GeoPackages (and also faster to write), I hope that we will soon get the option to export to GeoParquet.

For now, I’ll start by converting my large CSV files to GeoParquet using GeoPandas.

More reading

If you’re into GeoJSON and/or PyGeoAPI, check out Joana Simoes’ post: “Navigating GeoParquet: Lessons Learned from the eMOTIONAL Cities Project”

And if you want to see a global dataset example, have a look at Matt Travis’ presentation using Overture data:

After the initial ChatGPT hype in 2023 (when we saw the first LLM-backed QGIS plugins, e.g. QChatGPT and QGPT Agent), there has been a notable slump in new development. As far as I can tell, none of the early plugins are actively maintained anymore. They were nice tech demos but with limited utility.

However, in the last month, I saw two new approaches for combining LLMs with QGIS that I want to share in this post:

IntelliGeo plugin: generating PyQGIS scripts or graphical models

At the QGIS User Conference in Bratislava, I had the pleasure to attend the “Large Language Models and GIS” workshop presented by Gustavo Garcia and Zehao Lu from the the University of Twente. There, they presented the IntelliGeo Plugin which enables the automatic generation of PyQGIS scripts and graphical models.

The workshop was packed. After we installed all dependencies and the plugin, it was exciting to test the graphical model generation capabilities. During the workshop, we used OpenAI’s API but the readme also mentions support for Cohere.

I was surprised to learn that even simple graphical models are actually pretty large files. This makes it very challenging to generate and/or modify models because they take up a big part of the LLM’s context window. Therefore, I expect that the PyQGIS script generation will be easier to achieve. But, of course, model generation would be even more impressive and useful since models are easier to edit for most users than code.

Image source: https://github.com/MahdiFarnaghi/intelli_geo

ChatGeoAI: chat with PyQGIS

ChatGeoAI is an approach presented in Mansourian, A.; Oucheikh, R. (2024). ChatGeoAI: Enabling Geospatial Analysis for Public through Natural Language, with Large Language Models. ISPRS Int. J. Geo-Inf.13, 348.

It uses a fine-tuned Llama 2 model in combination with spaCy for entity recognition and WorldKG ontology to write PyQGIS code that can perform a variety of different geospatial analysis tasks on OpenStreetMap data.

The paper is very interesting, describing the LLM fine-tuning, integration with QGIS, and evaluation of the generated code using different metrics. However, as far as I can tell, the tool is not publicly available and, therefore, cannot be tested.

Image source: https://www.mdpi.com/2220-9964/13/10/348

Are you aware of more examples that integrate QGIS with LLMs? Please share them in the comments below. I’d love to hear about them.

Today marks the release of Trajectools 2.3 which brings a new set of algorithms, including trajectory generalizing, cleaning, and smoothing.

To give you a quick impression of what some of these algorithms would be useful for, this post introduces a trajectory preprocessing workflow that is quite general-purpose and can be adapted to many different datasets.

We start out with the Geolife sample dataset which you can find in the Trajectools plugin directory’s sample_data subdirectory. This small dataset includes 5908 points forming 5 trajectories, based on the trajectory_id field:

We first split our trajectories by observation gaps to ensure that there are no large gaps in our trajectories. Let’s make at cut at 15 minutes:

This splits the original 5 trajectories into 11 trajectories:

When we zoom, for example, to the two trajectories in the north western corner, we can see that the trajectories are pretty noisy and there’s even a spike / outlier at the western end:

If we label the points with the corresponding speeds, we can see how unrealistic they are: over 300 km/h!

Let’s remove outliers over 50 km/h:

Better but not perfect:

Let’s smooth the trajectories to get rid of more of the jittering.

(You’ll need to pip/mamba install the optional stonesoup library to get access to this algorithm.)

Depending on the noise values we chose, we get more or less smoothing:

Let’s zoom out to see the whole trajectory again:

Feel free to pan around and check how our preprocessing affected the other trajectories, for example:

If you downloaded Trajectools 2.1 and ran into troubles due to the introduced scikit-mobility and gtfs_functions dependencies, please update to Trajectools 2.2.

This new version makes it easier to set up Trajectools since MovingPandas is pip-installable on most systems nowadays and scikit-mobility and gtfs_functions are now truly optional dependencies. If you don’t install them, you simply will not see the extra algorithms they add:

If you encounter any other issues with Trajectools or have questions regarding its usage, please let me know in the Trajectools Discussions on Github.