Working with movement data analysis, I’ve banged my head against performance issues every once in a while. For example, PostgreSQL – and therefore PostGIS – run queries in a single thread of execution. This is now changing, with more and more functionality being parallelized. PostgreSQL version 9.6 (released on 2016-09-29) included important steps towards parallelization, including parallel execution of sequential scans, joins and aggregates. Still, there is no parallel processing in PostGIS so far (but it is under development as described by Paul Ramsey in his posts “Parallel PostGIS II” and “PostGIS Scaling” from late 2017).

At the FOSS4G2016 in Bonn, I had the pleasure to chat with Shoaib Burq who ran the “An intro to Apache PySpark for Big Data GeoAnalysis” workshop. Back home, I downloaded the workshop material and gave it a try but since I wanted a scalable system for storing, analyzing, and visualizing spatial data, it didn’t really seem to fit the bill.

Around one year ago, my search grew more serious since we needed a solution that would support our research group’s new projects where we expected to work with billions of location records (timestamped points and associated attributes). I was happy to find that the fine folks at LocationTech have some very promising open source projects focusing on big spatial data, most notably GeoMesa and GeoWave. Both tools take care of storing and querying big spatio-temporal datasets and integrate into GeoServer for publication and visualization. (A good – if already slightly outdated – comparison of the two has been published by Azavea.)

My understanding at the time was that GeoMesa had a stronger vector data focus while GeoWave was more focused on raster data. This lead me to try out GeoMesa. I published my first steps in “Getting started with GeoMesa using Geodocker” but things only really started to take off once I joined the developer chats and was pointed towards CCRI’s cloud-local “a collection of bash scripts to set up a single-node cloud on your desktop, laptop, or NUC”. This enabled me to skip most of the setup pains and go straight to testing GeoMesa’s functionality.

The learning curve is rather significant: numerous big data stack components (including HDFS, Accumulo, and GeoMesa), a most likely new language (Scala), as well as the Spark computing system require some getting used to. One thing that softened the blow is the fact that writing queries in SparkSQL + GeoMesa is pretty close to writing PostGIS queries. It’s also rather impressive to browse hundreds of millions of points by connecting QGIS TimeManager to a GeoServer WMS-T with GeoMesa backend.

Spatial big data stack with GeoMesa

One of the first big datasets I’ve tested are taxi floating car data (FCD). At one million records per day, the three years in the following example amount to a total of around one billion timestamped points. A query for travel times between arbitrary start and destination locations took a couple of seconds:

Travel time statistics with GeoMesa (left) compared to Google Maps predictions (right)

Besides travel time predictions, I’m also looking into the potential for predicting future movement. After all, it seems not unreasonable to assume that an object would move in a similar fashion as other similar objects did in the past.

Early results of a proof of concept for GeoMesa based movement prediction

Big spatial data – both vector and raster – are an exciting challenge bringing new tools and approaches to our ever expanding spatial toolset. Development of components in open source big data stacks is rapid – not unlike the development speed of QGIS. This can make it challenging to keep up but it also holds promises for continuous improvements and quick turn-around times.

If you are using GeoMesa to work with spatio-temporal data, I’d love to hear about your experiences.


In Movement data in GIS #2: visualization I mentioned that it should be possible to label trajectory segments without having to break the original trajectory feature. While it’s not a straightforward process, it is indeed possible to create timestamp labels at desired intervals:

The main point here is that we cannot use regular labels because there would be only one label for the whole trajectory feature. Instead, we are using a marker line with a font marker:

By default, font markers only display one character from a given font but by using expressions we can make it display longer text, including datetime strings:

If you want to have a label at every node of the trajectory, the expression looks like this:

         segments_to_lines( $geometry ),
      ))||' seconds'

You probably remember those parts of the expression that extract the m value from previous posts. Note that – compared to 2016 – it is now necessary to add the segments_to_lines() function.

The m value (which stores time as seconds since Unix epoch) is then converted to datetime and finally formatted to only show time. Of course you can edit the datetime format string to also include the date.

If we only want a label every 30 seconds, we can add a case statement around that:

   segments_to_lines( $geometry ),
)) % 30 = 0
         segments_to_lines( $geometry ),
      ))||' seconds'

This works well if the trajectory sampling interval is fairly regular. This is not always the case and that means that the above case statement wouldn’t find many nodes with a timestamp that ends in :30 or :00. In such a case, we could resort to labeling nodes based on their order in the linestring:

 @geometry_part_num  % 30 = 0

Thanks a lot to @JuergenEFischer for providing a solution for converting seconds since Unix epoch to datetime without a custom function!

Note that expressions using @geometry_part_num currently suffer from the following issue: Combination of segments_to_lines($geometry) and @geometry_part_num gives wrong segment numbers

Read more:

In short: both writing trajectory queries as well as executing them is considerably faster using PostGIS trajectories (as LinestringM) rather than the commonly used point-based approach.

Here are a couple of examples to give you an impression of the differences.

Spoiler alert! Trajectory queries are up to 500 times faster than comparable point-based queries.

A quick look at indexing

In both cases, we have indexed the tracker id, geometry, and time columns to speed up query processing.

The trajectory table has 3 indexes

  • gist (time_range)
  • gist (track gist_geometry_ops_nd)
  • btree (tracker)

The point-based table has 4 indexes

  • gist (pt)
  • btree (trajectory_id)
  • btree (tracker)
  • btree (t)


First, let’s see how to determine trajectory length for all observed moving objects (identified by a tracker id).

Using the point-based approach, we first need to ensure that the points are in the correct temporal order, create the lines, and finally sum up their length:

WITH ordered AS (
 SELECT trajectory_id, tracker, t, pt
 FROM geolife.trajectory_pt
), tmp AS (
 SELECT trajectory_id, tracker, st_makeline(pt) traj
 FROM ordered 
 GROUP BY trajectory_id, tracker
SELECT tracker, round(sum(ST_Length(traj::geography)))
FROM tmp
GROUP BY tracker 
ORDER BY tracker

With trajectories, we can go right to computing lengths:

SELECT tracker, round(sum(ST_Length(track::geography)))
FROM geolife.trajectory_ext
GROUP BY tracker
ORDER BY tracker

On my test system, the trajectory query run time is 22.7 sec instead of 43.0 sec for the point-based approach:


Compared to trajectory length, duration is less complicated in the point-based approach:

WITH tmp AS (
 SELECT trajectory_id, tracker, min(t) start_time, max(t) end_time
 FROM geolife.trajectory_pt
 GROUP BY trajectory_id, tracker
SELECT tracker, sum(end_time - start_time)
FROM tmp
GROUP BY tracker
ORDER BY tracker

Still, the trajectory query is less complex and much faster at 31 ms instead of 6.0 sec:

SELECT tracker, sum(upper(time_range) - lower(time_range))
FROM geolife.trajectory_ext
GROUP BY tracker
ORDER BY tracker

Temporal filter

Extracting trajectories that occurred during a certain time frame is another common use case:

WITH tmp AS (
 SELECT trajectory_id, tracker, min(t) start_time, max(t) end_time
 FROM geolife.trajectory_pt
 GROUP BY trajectory_id, tracker
SELECT trajectory_id, tracker, start_time, end_time
FROM tmp
WHERE end_time > '2008-11-26 11:00'
AND start_time < '2008-11-26 15:00'
ORDER BY tracker

This point-based query takes 6.0 sec while the shorter trajectory query finishes in 12 ms:

SELECT id, tracker, time_range
FROM geolife.trajectory_ext
WHERE time_range && '[2008-11-26 11:00+1,2008-11-26 15:00+01]'::tstzrange

or equally fast (12 ms) by making use of the n-dimensional index:

WHERE track &&&	ST_Collect(
 ST_MakePointM(-180, -90, extract(epoch from '2008-11-26 11:00'::timestamptz)),
 ST_MakePointM(180, 90, extract(epoch from '2008-11-26 15:00'::timestamptz))

Spatial filter

Finally, of course, let’s have a look at spatial filters, for example, trajectories that start in a certain area:

WITH my AS ( 
 SELECT ST_Buffer(ST_SetSRID(ST_MakePoint(116.31894,39.97472),4326),0.0005) areaA
), tmp AS (
 SELECT trajectory_id, tracker, min(t) t 
 FROM geolife.trajectory_pt
 GROUP BY trajectory_id, tracker
SELECT distinct traj.tracker, traj.trajectory_id 
FROM tmp
JOIN geolife.trajectory_pt traj
ON tmp.trajectory_id = traj.trajectory_id AND traj.t = tmp.t
ON ST_Within(, my.areaA)

This point-based query takes 6.0 sec while the shorter trajectory query finishes in 488 ms:

WITH my AS ( 
 SELECT ST_Buffer(ST_SetSRID(ST_MakePoint(116.31894, 39.97472),4326),0.0005) areaA
SELECT id, tracker, ST_AsText(track)
FROM geolife.trajectory_ext
ON areaA && track
AND ST_Within(ST_StartPoint(track), areaA)

For more generic “does this trajectory intersect another geometry”, the points can also be aggregated to a linestring on the fly but that takes 21.9 sec:

I’ll be presenting more work on PostGIS trajectories at GI_Forum in Salzburg in July. In the talk, I’ll also have a look at the custom PG-Trajectory datatype.

Read more:

Remember the good old times when all parameters in Processing were mandatory?

Inputs and outputs are fixed, and optional parameters or outputs are not supported. [Graser & Olaya, 2015]

Since QGIS 2.14, this is no longer the case. Scripts, as well as models, can now have optional parameters. Here is how for QGIS 3:

When defining a Processing script parameter, the parameter’s constructor takes a boolean flag indicating whether the parameter should be optional. It’s false by default:

class qgis.core.QgsProcessingParameterNumber(
   name: str, description: str = '', 
   type: QgsProcessingParameterNumber.Type = QgsProcessingParameterNumber.Integer, 
   defaultValue: Any = None, 
   optional: bool = False,
   minValue: float = -DBL_MAX+1, maxValue: float = DBL_MAX)


One standard tool that uses optional parameters is Add autoincremental field:

From Python, this algorithm can be called with or without the optional parameters:

When building a model, an optional input can be assigned to the optional parameter. To create an optional input, make sure to deactivate the mandatory checkbox at the bottom of the input parameter definition:

Then this optional input can be used in an algorithm. For example, here the numerical input optional_value is passed to the Start values at parameter:

You can get access to all available inputs by clicking the … button next to the Start values at field. In this example, I have access to values of the input layer as well as  the optional value:

Once this is set up, this is how it looks when the model is run:

You can see that the optional value is indeed Not set.


Graser, A., & Olaya, V. (2015). Processing: A Python Framework for the Seamless Integration of Geoprocessing Tools in QGIS. ISPRS Int. J. Geo-Inf. 2015, 4, 2219-2245. doi:10.3390/ijgi4042219.

Processing has been overhauled significantly for QGIS 3.0. Besides speed-ups, one of the most obvious changes is the way to write Processing scripts. Instead of the old Processing-specific syntax, Processing scripts for QGIS3 are purely pythonic implementations of QgsProcessingAlgorithm.

Here’s a template that you can use to develop your own algorithms:

from qgis.PyQt.QtCore import QCoreApplication, QVariant
from qgis.core import (QgsField, QgsFeature, QgsFeatureSink, QgsFeatureRequest, QgsProcessing, QgsProcessingAlgorithm, QgsProcessingParameterFeatureSource, QgsProcessingParameterFeatureSink)
class ExAlgo(QgsProcessingAlgorithm):

    def __init__(self):

    def name(self):
        return "exalgo"
    def tr(self, text):
        return QCoreApplication.translate("exalgo", text)
    def displayName(self):
        return"Example script")

    def group(self):

    def groupId(self):
        return "examples"

    def shortHelpString(self):
        return"Example script without logic")

    def helpUrl(self):
        return ""
    def createInstance(self):
        return type(self)()
    def initAlgorithm(self, config=None):
  "Input layer"),
  "Output layer"),

    def processAlgorithm(self, parameters, context, feedback):
        source = self.parameterAsSource(parameters, self.INPUT, context)
        (sink, dest_id) = self.parameterAsSink(parameters, self.OUTPUT, context,
                                               source.fields(), source.wkbType(), source.sourceCrs())

        features = source.getFeatures(QgsFeatureRequest())
        for feat in features:
            out_feat = QgsFeature()
            sink.addFeature(out_feat, QgsFeatureSink.FastInsert)

        return {self.OUTPUT: dest_id}

This script just copies the features of the input layer to the output layer without any modifications. Add your logic to the processAlgorithm() function to get started.

Use Create New Script from the Toolbox toolbar:

Paste the example script:

Once saved, the script will show up in the Processing toolbox:

Joining polygon attributes to points based on their location is a very common GIS task. In QGIS 2, QGIS’ own implementation of “Join attributes by location” was much slower than SAGA’s “Add polygon attributes to points”. Thus, installations without SAGA were out of good options.

Luckily this issue (and many more) has been fixed by the rewrite of many geoprocessing algorithms for QGIS 3! Let’s revisit the comparison:

I’m using publicly available datasets from Naturalearth: The small scale populated places (243 points) and the large scale countries (255 polygons with many nodes). Turns out that QGIS 3’s built-in tool takes a little less than two seconds while the SAGA Processing tool requires a litte less than six seconds:

Like in the previous comparison, times were measured using the Python Console:

In both tools, only the countries’ SOVEREIGNT attribute is joined to the point attribute table:

import processing
t0 =
print("QGIS Join attributes by location ...")
t1 =
print("Runtime: "+str(t1-t0))
print("SAGA Add polygon attributers to points ...")
t2 =
print("Runtime: "+str(t2-t1))

It is worth noting that it takes longer if more attributes are to be joined to the point layer attribute table. For example, if the JOIN_FIELDS parameter is empty:


instead of


then the the Join attributes by location takes almost 16 seconds. (The country layer contains 71 attributes after all.)

(The SAGA tool currently allows only joining one attribute at a time.)

The release of 3.0 is really close now. If you want to know what’s new or are just looking for interesting ways to pass the time until the packages land, check out the following QGIS3 resources.

For users

For more recordings from the developer meeting in Madeira check my Youtube playlist.

For developers

If you have further reading recommendations, please post them in the comments below.


%d bloggers like this: