In the previous post, we explored how hvPlot and Datashader can help us to visualize large CSVs with point data in interactive map plots. Of course, the spatial distribution of points usually only shows us one part of the whole picture. Today, we’ll therefore look into how to explore other data attributes by linking other (non-spatial) plots to the map.

This functionality, referred to as “linked brushing” or “crossfiltering” is under active development and the following experiment was prompted by a recent thread on Twitter launched by @plotlygraphs announcement of HoloViews 1.14:

Turns out these features are not limited to plotly but can also be used with Bokeh and hvPlot:

Like in the previous post, this demo uses a Pandas DataFrame with 12 million rows (and HoloViews 1.13.4).

In addition to the map plot, we also create a histogram from the same DataFrame:

map_plot = df.hvplot.scatter(x='x', y='y', datashade=True, height=300, width=400)
hist_plot = df.where((df.SOG>0) & (df.SOG<50)).hvplot.hist("SOG",  bins=20, width=400, height=200) 

To link the two plots, we use HoloViews’ link_selections function:

from holoviews.selection import link_selections
linked_plots = link_selections(map_plot + hist_plot)

That’s all! We can now perform spatial filters in the map and attribute value filters in the histogram and the filters are automatically applied to the linked plots:

Linked brushing demo using ship movement data (AIS): filtering records by speed (SOG) reveals spatial patterns of fast and slow movement.

You’ve probably noticed that there is no background map in the above plot. I had to remove the background map tiles to get rid of an error in Holoviews 1.13.4. This error has been fixed in 1.14.0 but I ran into other issues with the datashaded Scatterplot.

It’s worth noting that not all plot types support linked brushing. For the complete list, please refer to http://holoviews.org/user_guide/Linked_Brushing.html

Even with all their downsides, CSV files are still a common data exchange format – particularly between disciplines with different tech stacks. Indeed, “How to Specify Data Types of CSV Columns for Use in QGIS” (originally written in 2011) is still one of the most popular posts on this blog. QGIS continues to be quite handy for visualizing CSV file contents. However, there are times when it’s just not enough, particularly when the number of rows in the CSV is in the range of multiple million. The following example uses a 12 million point CSV:

To give you an idea of the waiting times in QGIS, I’ve run the following script which loads and renders the CSV:

from datetime import datetime

def get_time():
    t2 = datetime.now()
    print(t2)
    print(t2-t1)
    print('Done :)')

canvas = iface.mapCanvas()
canvas.mapCanvasRefreshed.connect(get_time)

print('Starting ...')

t0 = datetime.now()
print(t0)

print('Loading CSV ...')

uri = "file:///E:/Geodata/AISDK/raw_ais/aisdk_20170701.csv?type=csv&amp;xField=Longitude&amp;yField=Latitude&amp;crs=EPSG:4326&amp;"
vlayer = QgsVectorLayer(uri, "layer name you like", "delimitedtext")

t1 = datetime.now()
print(t1)
print(t1 - t0)

print('Rendering ...')

QgsProject.instance().addMapLayer(vlayer)

The script output shows that creating the vector layer takes 02:39 minutes and rendering it takes over 05:10 minutes:

Starting ...
2020-12-06 12:35:56.266002
Loading CSV ...
2020-12-06 12:38:35.565332
0:02:39.299330
Rendering ...
2020-12-06 12:43:45.637504
0:05:10.072172
Done :)

Rendered CSV file in QGIS

Panning and zooming around are no fun either since rendering takes so long. Changing from a single symbol renderer to, for example, a heatmap renderer does not improve the rendering times. So we need a different solutions when we want to efficiently explore large point CSV files.

The Pandas data analysis library is well-know for being a convenient tool for handling CSVs. However, it’s less clear how to use it as a replacement for desktop GIS for exploring large CSVs with point coordinates. My favorite solution so far uses hvPlot + HoloViews + Datashader to provide interactive Bokeh plots in Jupyter notebooks.

hvPlot provides a high-level plotting API built on HoloViews that provides a general and consistent API for plotting data in (Geo)Pandas, xarray, NetworkX, dask, and others. (Image source: https://hvplot.holoviz.org)

But first things first! Loading the CSV as a Pandas Dataframe takes 10.7 seconds. Pandas’ default plotting function (based on Matplotlib), however, takes around 13 seconds and only produces a static scatter plot.

Loading and plotting the CSV with Pandas

hvPlot to the rescue!

We only need two more steps to get faster and interactive map plots (plus background maps!): First, we need to reproject the lat/lon values. (There’s a warning here, most likely since some of the input lat/lon values are invalid.) Then, we replace plot() with hvplot() and voilà:

Plotting the CSV with Datashader

As you can see from the above GIF, the whole process barely takes 2 seconds and the resulting map plot is interactive and very responsive.

12 million points are far from the limit. As long as the Pandas DataFrame fits into memory, we are good and when the datasets get bigger than that, there are Dask DataFrames. But that’s a story for another day.

If you are following QGIS topics on social media, you may have already seen this but if you don’t, I recommend having a look at Tim Sutton’s most recent adventures in building dashboards with QGIS:

The dashboard is built using labeling and geometry generator functionality. This means that they work in the QGIS application map window as well as in layouts. As hinted at in the screenshot above, the dashboard can show information about whole layers as well as interactive selections.

Here’s a full walk-through Tim published yesterday:

You can follow the further development via Tim’s tweets or the dedicated Github issue (where you can even find an example QGIS dashboard project in a GeoPacakge for download).

Update July 2021: the style hub is now part of the official QGIS infrastructure at https://plugins.qgis.org/styles/


The 2016 post More icons & symbols for QGIS still regularly makes it to the top 10 list of posts by visitors. I wouldn’t attribute this popularity to the quality of this particular post, however. Instead, it’s a pretty clear sign that QGIS users are actively searching for more styling resources to add to their installations.

When it comes to styling resources, the person to follow right now is clearly Klas Karlsson who’s been keeping a steady stream of styling-related posts coming to Twitter:

Additionally, he’s the master-mind behind QGIS Hub, a – currently prototypical – platform for sharing styling resources and print layout templates:

If you are interested in sharing styling resources, head over there. Similarly, if you want to lend a hand developing QGIS Hub, get in touch!

The latest v0.5 release is now available from conda-forge.

New features include:

As always, all tutorials are available on MyBinder:

Detected stops (left) and trajectory split at stops (right)

On Thursday, I was awarded the 2020 Sol Katz award for Free and Open Source Software for Geospatial. I feel very honored to have been selected for this award and I’d like to take this opportunity to share a few words of thanks:

As people working in open source projects, we are constantly reminded that we are all standing on the shoulders of giants. However, particularly this year, we also see just how important small personal connections are. For me, my involvement with open source communities really took off when I joined the QGIS hackfest in Vienna in 2009 and I felt that my participation was really appreciated and welcome. I couldn’t imagine being without these connections anymore.

Thank you to the whole QGIS community, particularly my fellow PSC members both current and former: Tim, Andreas, Jürgen, Richard, Paolo, Otto, Marco Hugentobler, Alessandro, our new chair Marco Bernasocchi, and of course QGIS founder Gary Sherman for starting this awesome project and for still being around and actively promoting geospatial open source by publishing so many great books covering multiple different OSGeo projects.

I’d also like to thank my partner and my family for being incredibly understanding whenever I’m spend my time geeking out over a new programming project, data analysis, forum question, or conference talk.

Thank you also to my friends, colleagues and fellow members of the larger OSGeo community for sharing ideas, providing valuable feedback, and spreading the word about all the great work that’s going on all around us.

I’m constantly amazed by all the innovation happening to nourish and grow our community. And I’m looking forward to continue being a part of these efforts.

Thank you!

 

Rendering large sets of trajectory lines gets messy fast. Different aggregation approaches have been developed to address this issue. However, most approaches, such as mobility graphs or generalized flow maps, cannot handle large input datasets. Building on M³ prototypes, the following approach can be used in distributed computing environments to extracts flows from large datasets. 

This is part 3 of “Exploring massive movement datasets”.

This flow extraction is based on a two-step process, conceptually similar to Andrienko flow maps: first, we extract M³ prototypes from the movement data. In the second step, we determine flows between these prototypes, including information about: distribution of travel speeds and number of observed transitions. The resulting flows can be visualized, for example, to explore the popularity of different paths of movement:

After the prototypes have been computed, the flow algorithm computes transitions between pairs of prototypes. An object moving from prototype A to prototype B triggers an update of the corresponding flow. To allow for distributed processing, each node in the distributed computing environment needs a copy of the previously computed prototypes. Additionally, the raw movement data records need to be converted into trajectories. Afterwards, each trajectory is processed independently, going through its records in chronological order:

  1. Find the best matching prototype for the current record
  2. Ensure that the distance to the match is below the distance threshold and that the matched prototype is different from the previous prototype
  3. Get or create the flow between the two prototypes
  4. Ensure that the prototype and flow directions are a good match for the current record’s direction
  5. Update the flow properties: travel speed and number of transitions, as well as the previous prototype reference

This approach scales to large datasets since only the prototypes, the (intermediate) flow results, and the trajectory currently being worked on have to be kept in memory for each iteration. However, this algorithm does not allow for continuous updates. Flows would have to be recomputed (at least locally) whenever prototypes changed. Therefore, the algorithm does not support exploration of continuous data streams. However, it can be used to explore large historical datasets:

Flow example: passenger vessel speed patterns showing mean flow speeds (line color: darker colors equal higher speeds) and speed variation (line width)

If you want to dive deeper, here’s the full paper:

[1] Graser, A., Widhalm, P., & Dragaschnig, M. (2020). Extracting Patterns from Large Movement Datasets. GI_Forum – Journal of Geographic Information Science, 1-2020, 153-163. doi:10.1553/giscience2020_01_s153.


This post is part of a series. Read more about movement data in GIS.

To explore travel patterns like origin-destination relationships, we need to identify individual trips with their start/end locations and trajectories between them. Extracting these trajectories from large datasets can be challenging, particularly if the records of individual moving objects don’t fit into memory anymore and if the spatial and temporal extent varies widely (as is the case with ship data, where individual vessel journeys can take weeks while crossing multiple oceans). 

This is part 2 of “Exploring massive movement datasets”.

Roughly speaking, trip trajectories can be generated by first connecting consecutive records into continuous tracks and then splitting them at stops. This general approach applies to many different movement datasets. However, the processing details (e.g. stop detection parameters) and preprocessing steps (e.g. removing outliers) vary depending on input dataset characteristics.

For example, in our paper [1], we extracted vessel journeys from AIS data which meant that we also had to account for observation gaps when ships leave the observable (usually coastal) areas. In the accompanying 10-minute talk, I went through a 4-step trajectory exploration workflow for assessing our dataset’s potential for travel time prediction:

Click to watch the recorded talk

Like the M³ prototype computation presented in part 1, our trajectory aggregation approach is implemented in Spark. The challenges are both the massive amounts of trajectory data and the fact that operations only produce correct results if applied to a complete and chronologically sorted set of location records.This is challenging because Spark core libraries (version 2.4.5 at the time) are mostly geared towards dealing with unsorted data. This means that, when using high-level Spark core functionality incorrectly, an aggregator needs to collect and sort the entire track in the main memory of a single processing node. Consequently, when dealing with large datasets, out-of-memory errors are frequently encountered.

To solve this challenge, our implementation is based on the Secondary Sort pattern and on Spark’s aggregator concept. Secondary Sort takes care to first group records by a key (e.g. the moving object id), and only in the second step, when iterating over the records of a group, the records are sorted (e.g. chronologically). The resulting iterator can be used by an aggregator that implements the logic required to build trajectories based on gaps and stops detected in the dataset.

If you want to dive deeper, here’s the full paper:

[1] Graser, A., Dragaschnig, M., Widhalm, P., Koller, H., & Brändle, N. (2020). Exploratory Trajectory Analysis for Massive Historical AIS Datasets. In: 21st IEEE International Conference on Mobile Data Management (MDM) 2020. doi:10.1109/MDM48529.2020.00059


This post is part of a series. Read more about movement data in GIS.

Visualizations of raw movement data records, that is, simple point maps or point density (“heat”) maps provide very limited data exploration capabilities. Therefore, we need clever aggregation approaches that can actually reveal movement patterns. Many existing aggregation approaches, however, do not scale to large datasets. We therefore developed the M³ Massive Movement Model [1] which supports distributed computing environments and can be incrementally updated with new data.

This is part 1 of “Exploring massive movement datasets”.

Using state-of-the-art big gespatial tools, such as GeoMesa, it is quite straightforward to ingest, index and query large amounts of timestamped location records. Thanks to GeoMesa’s GeoServer integration, it is also possible to publish GeoMesa tables as WMS and WFS which can be visualized in QGIS and explored (for more about GeoMesa, see Scalable spatial vector data processing ).So far so good! But with this basic setup, we only get point maps and point density maps which don’t tell us much about important movement characteristics like speed and direction (particularly if the reporting interval between consecutive location records is irregular). Therefore, we developed an aggregation method which models local record density, as well as movement speed and direction which we call M³.

For distributed computation, we need to split large datasets into chunks. To build models of local movement characteristics, it makes sense to create spatial or spatiotemporal chunks that can be processed independently. We therefore split the data along a regular grid but instead of computing one average value per grid cell, we create a flexible number of prototypes that describe the movement in the cell. Each prototype models a location, speed, and direction distribution (mean and sigma).

In our paper, we used M³ to explore ship movement data. We turned roughly 4 billion AIS records into prototypes:

M³ for ship movement data during January to December 2017 (3.9 billion records turned into 3.4 million prototypes; computing time: 41 minutes)

The above plot really only gives a first impression of the spatial distribution of ship movement records. The real value of M³ becomes clearer when we zoom in and start exploring regional patterns. Then we can discover vessel routes, speeds, and movement directions:

The prototype details on the right side, in particular, show the strength of the prototype idea: even though the grid cells we use are rather large, the prototypes clearly form along vessel routes. We can see exactly where these routes are and what speeds ship travel there, without having to increase the grid resolution to impractical values. Slow prototypes with high direction sigma (red+black markers) are clear indicators of ports. The marker size shows the number of records per prototype and thus helps distinguish heavily traveled routes from minor ones.

M³ is implemented in Spark. We read raw location records from GeoMesa and write prototypes to GeoMesa. All maps have been created in QGIS using prototype data published as GeoServer WFS.

If you want to dive deeper, here’s the full paper:

[1] Graser. A., Widhalm, P., & Dragaschnig, M. (2020). The M³ massive movement model: a distributed incrementally updatable solution for big movement data exploration. International Journal of Geographical Information Science. doi:10.1080/13658816.2020.1776293.


This post is part of a series. Read more about movement data in GIS.

We’ve done it again!

This time, Daniel O’Donohue and I talked about spatiotemporal data in GIS, including – of course – Time Manager, the new QGIS temporal support, and MovingPandas.

 

Since we need both data and tools to do spatiotemporal analysis, we also talked about file formats and data models. If you want to know more about data models for spatiotemporal (especially movement) data, have a look at the latest discussion paper I wrote together with Esteban Zimányi (MobilityDB) and Krishna Chaitanya Bommakanti (mobilitydb-sqlalchemy):

Data model of the Moving Features standard illustrated with two moving points A and B. Stars mark changes in attribute values. (Source: Graser et al. (2020))

For more details and all options for listening to this podcast, visit mapscaping.com.

 

%d bloggers like this: