Archive

Visualization

After writing “Towards a template for exploring movement data” last year, I spent a lot of time thinking about how to develop a solid approach for movement data exploration that would help analysts and scientists to better understand their datasets. Finally, my search led me to the excellent paper “A protocol for data exploration to avoid common statistical problems” by Zuur et al. (2010). What they had done for the analysis of common ecological datasets was very close to what I was trying to achieve for movement data. I followed Zuur et al.’s approach of a exploratory data analysis (EDA) protocol and combined it with a typology of movement data quality problems building on Andrienko et al. (2016). Finally, I brought it all together in a Jupyter notebook implementation which you can now find on Github.

There are two options for running the notebook:

  1. The repo contains a Dockerfile you can use to spin up a container including all necessary datasets and a fitting Python environment.
  2. Alternatively, you can download the datasets manually and set up the Python environment using the provided environment.yml file.

The dataset contains over 10 million location records. Most visualizations are based on Holoviz Datashader with a sprinkling of MovingPandas for visualizing individual trajectories.

Point density map of 10 million location records, visualized using Datashader

Line density map for detecting gaps in tracks, visualized using Datashader

Example trajectory with strong jitter, visualized using MovingPandas & GeoViews

 

I hope this reference implementation will provide a starting point for many others who are working with movement data and who want to structure their data exploration workflow.

If you want to dive deeper, here’s the paper:

[1] Graser, A. (2021). An exploratory data analysis protocol for identifying problems in continuous movement data. Journal of Location Based Services. doi:10.1080/17489725.2021.1900612.

(If you don’t have institutional access to the journal, the publisher provides 50 free copies using this link. Once those are used up, just leave a comment below and I can email you a copy.)

References


This post is part of a series. Read more about movement data in GIS.

In the previous post, we explored how hvPlot and Datashader can help us to visualize large CSVs with point data in interactive map plots. Of course, the spatial distribution of points usually only shows us one part of the whole picture. Today, we’ll therefore look into how to explore other data attributes by linking other (non-spatial) plots to the map.

This functionality, referred to as “linked brushing” or “crossfiltering” is under active development and the following experiment was prompted by a recent thread on Twitter launched by @plotlygraphs announcement of HoloViews 1.14:

Turns out these features are not limited to plotly but can also be used with Bokeh and hvPlot:

Like in the previous post, this demo uses a Pandas DataFrame with 12 million rows (and HoloViews 1.13.4).

In addition to the map plot, we also create a histogram from the same DataFrame:

map_plot = df.hvplot.scatter(x='x', y='y', datashade=True, height=300, width=400)
hist_plot = df.where((df.SOG>0) & (df.SOG<50)).hvplot.hist("SOG",  bins=20, width=400, height=200) 

To link the two plots, we use HoloViews’ link_selections function:

from holoviews.selection import link_selections
linked_plots = link_selections(map_plot + hist_plot)

That’s all! We can now perform spatial filters in the map and attribute value filters in the histogram and the filters are automatically applied to the linked plots:

Linked brushing demo using ship movement data (AIS): filtering records by speed (SOG) reveals spatial patterns of fast and slow movement.

You’ve probably noticed that there is no background map in the above plot. I had to remove the background map tiles to get rid of an error in Holoviews 1.13.4. This error has been fixed in 1.14.0 but I ran into other issues with the datashaded Scatterplot.

It’s worth noting that not all plot types support linked brushing. For the complete list, please refer to http://holoviews.org/user_guide/Linked_Brushing.html

Even with all their downsides, CSV files are still a common data exchange format – particularly between disciplines with different tech stacks. Indeed, “How to Specify Data Types of CSV Columns for Use in QGIS” (originally written in 2011) is still one of the most popular posts on this blog. QGIS continues to be quite handy for visualizing CSV file contents. However, there are times when it’s just not enough, particularly when the number of rows in the CSV is in the range of multiple million. The following example uses a 12 million point CSV:

To give you an idea of the waiting times in QGIS, I’ve run the following script which loads and renders the CSV:

from datetime import datetime

def get_time():
    t2 = datetime.now()
    print(t2)
    print(t2-t1)
    print('Done :)')

canvas = iface.mapCanvas()
canvas.mapCanvasRefreshed.connect(get_time)

print('Starting ...')

t0 = datetime.now()
print(t0)

print('Loading CSV ...')

uri = "file:///E:/Geodata/AISDK/raw_ais/aisdk_20170701.csv?type=csv&amp;xField=Longitude&amp;yField=Latitude&amp;crs=EPSG:4326&amp;"
vlayer = QgsVectorLayer(uri, "layer name you like", "delimitedtext")

t1 = datetime.now()
print(t1)
print(t1 - t0)

print('Rendering ...')

QgsProject.instance().addMapLayer(vlayer)

The script output shows that creating the vector layer takes 02:39 minutes and rendering it takes over 05:10 minutes:

Starting ...
2020-12-06 12:35:56.266002
Loading CSV ...
2020-12-06 12:38:35.565332
0:02:39.299330
Rendering ...
2020-12-06 12:43:45.637504
0:05:10.072172
Done :)

Rendered CSV file in QGIS

Panning and zooming around are no fun either since rendering takes so long. Changing from a single symbol renderer to, for example, a heatmap renderer does not improve the rendering times. So we need a different solutions when we want to efficiently explore large point CSV files.

The Pandas data analysis library is well-know for being a convenient tool for handling CSVs. However, it’s less clear how to use it as a replacement for desktop GIS for exploring large CSVs with point coordinates. My favorite solution so far uses hvPlot + HoloViews + Datashader to provide interactive Bokeh plots in Jupyter notebooks.

hvPlot provides a high-level plotting API built on HoloViews that provides a general and consistent API for plotting data in (Geo)Pandas, xarray, NetworkX, dask, and others. (Image source: https://hvplot.holoviz.org)

But first things first! Loading the CSV as a Pandas Dataframe takes 10.7 seconds. Pandas’ default plotting function (based on Matplotlib), however, takes around 13 seconds and only produces a static scatter plot.

Loading and plotting the CSV with Pandas

hvPlot to the rescue!

We only need two more steps to get faster and interactive map plots (plus background maps!): First, we need to reproject the lat/lon values. (There’s a warning here, most likely since some of the input lat/lon values are invalid.) Then, we replace plot() with hvplot() and voilà:

Plotting the CSV with Datashader

As you can see from the above GIF, the whole process barely takes 2 seconds and the resulting map plot is interactive and very responsive.

12 million points are far from the limit. As long as the Pandas DataFrame fits into memory, we are good and when the datasets get bigger than that, there are Dask DataFrames. But that’s a story for another day.

We’ve done it again!

This time, Daniel O’Donohue and I talked about spatiotemporal data in GIS, including – of course – Time Manager, the new QGIS temporal support, and MovingPandas.

 

Since we need both data and tools to do spatiotemporal analysis, we also talked about file formats and data models. If you want to know more about data models for spatiotemporal (especially movement) data, have a look at the latest discussion paper I wrote together with Esteban Zimányi (MobilityDB) and Krishna Chaitanya Bommakanti (mobilitydb-sqlalchemy):

Data model of the Moving Features standard illustrated with two moving points A and B. Stars mark changes in attribute values. (Source: Graser et al. (2020))

For more details and all options for listening to this podcast, visit mapscaping.com.

 

TimeManager turns 10 this year. The code base has made the transition from QGIS 1.x to 2.x and now 3.x and it would be wrong to say that it doesn’t show ;-)

Now, it looks like the days of TimeManager are numbered. Four days ago, Nyall Dawson has added native temporal support for vector layers to QGIS. This is part of a larger effort of adding time support for rasters, meshes, and now also vectors.

The new Temporal Controller panel looks similar to TimeManager. Layers are configured through the new Temporal tab in Layer Properties. The temporal dimension can be used in expressions to create fancy time-dependent styles:

temporal1

TimeManager Geolife demo converted to Temporal Controller (click for full resolution)

Obviously, this feature is brand new and will require polishing. Known issues listed by Nyall include limitations of supported time fields (only fields with datetime type are supported right now, strings cannot be used) and worse performance than TimeManager since features are filtered in QGIS rather than in the backend.

If you want to give the new Temporal Controller a try, you need to install the current development version, e.g. qgis-dev in OSGeo4W.


Update from May 16:

Many of the limitations above have already been addressed.

Last night, Nyall has recorded a one hour tutorial on this new feature, enjoy:

This post introduces Holoviz Panel, a library that makes it possible to create really quick dashboards in notebook environments as well as more sophisticated custom interactive web apps and dashboards.

The following example shows how to use Panel to explore a dataset (a trajectory collection in this case) and different parameter settings (relating to trajectory generalization). All the Panel code we need is a dict that defines the parameters that we want to explore. Then we can use Panel’s interact function to automatically generate a dashboard for our custom plotting function:

import panel as pn

kw = dict(traj_id=(1, len(traj_collection)), 
          tolerance=(10, 100, 10), 
          generalizer=['douglas-peucker', 'min-distance'])
pn.interact(plot_generalized, **kw)

Click to view the resulting dashboard in full resolution:

The plotting function uses the parameters to generate a Holoviews plot. First it fetches a specific trajectory from the trajectory collection. Then it generalizes the trajectory using the specified parameter settings. As you can see, we can easily combine maps and other plots to visualize different aspects of the data:

def plot_generalized(traj_id=1, tolerance=10, generalizer='douglas-peucker'):
  my_traj = traj_collection.get_trajectory(traj_id).to_crs(CRS(4088))
  if generalizer=='douglas-peucker':
    generalized = mpd.DouglasPeuckerGeneralizer(my_traj).generalize(tolerance)
  else:
    generalized = mpd.MinDistanceGeneralizer(my_traj).generalize(tolerance)
  generalized.add_speed(overwrite=True)
  return ( 
    generalized.hvplot(
      title='Trajectory {} (tolerance={})'.format(my_traj.id, tolerance), 
      c='speed', cmap='Viridis', colorbar=True, clim=(0,20), 
      line_width=10, width=500, height=500) + 
    generalized.df['speed'].hvplot.hist(
      title='Speed histogram', width=300, height=500) 
    )

Trajectory collections and generalization functions used in this example are part of the MovingPandas library. If you are interested in movement data analysis, you should check it out! You can find this example notebook in the MovingPandas tutorial section.

We recently published a new paper on “Open Geospatial Tools for Movement Data Exploration” (open access). If you liked Movement data in GIS #26: towards a template for exploring movement data, you will find even more information about the context, challenges, and recent developments in this paper.

It also presents three open source stacks for movement data exploration:

  1. QGIS + PostGIS: a combination that will be familiar to most open source GIS users
  2. Jupyter + MovingPandas: less common so far, but Jupyter notebooks are quickly gaining popularity (even in the proprietary GIS world)
  3. GeoMesa + Spark: for when datasets become too big to handle using other means

and discusses their capabilities and limitations:


This post is part of a series. Read more about movement data in GIS.

In December, I wrote about GeoPandas on Databricks. Back then, I also tried to get MovingPandas working but without luck. (While GeoPandas can be installed using Databricks’ dbutils.library.installPyPI("geopandas") this PyPI install just didn’t want to work for MovingPandas.)

Now that MovingPandas is available from conda-forge, I gave it another try and … *spoiler alert* … it works!

First of all, conda support on Databricks is in beta. It’s not included in the default runtimes. At the time of writing this post, “6.0 Conda Beta” is the latest runtime with conda:

Once the cluster is up and connected to the notebook, a quick conda list shows the installed packages:

Time to install MovingPandas! I went with a 100% conda-forge installation. This takes a looong time (almost half an hour)!

When the installs are finally done, it get’s serious: time to test the imports!

Success!

Now we can put the MovingPandas data structures to good use. But first we need to load some movement data:

Or course, the points in this GeoDataFrame can be plotted. However, the plot isn’t automatically displayed once plot() is called on the GeoDataFrame. Instead, Databricks provides a display() function to display Matplotlib figures:

MovingPandas also uses Matplotlib. Therefore we can use the same approach to plot the TrajectoryCollection that can be created from the GeoDataFrame:

These Matplotlib plots are nice and quick but they lack interactivity and therefore are of limited use for data exploration.

MovingPandas provides interactive plotting (including base maps) using hvplot. hvplot is based on Bokeh and, luckily, the Databricks documentation tells us that bokeh plots can be exported to html and then displayed using  displayHTML():

Of course, we could achieve all this on MyBinder as well (and much more quickly). However, Databricks gets interesting once we can add (Py)Spark and distributed processing to the mix. For example, “Getting started with PySpark & GeoPandas on Databricks” shows a spatial join function that adds polygon information to a point GeoDataFrame.

A potential use case for MovingPandas would be to speed up flow map computations. The recently added aggregator functionality (currently in master only) first computes clusters of significant trajectory points and then aggregates the trajectories into flows between these clusters. Matching trajectory points to the closest cluster could be a potential use case for distributed computing. Each trajectory (or each point) can be handled independently, only the cluster locations have to be broadcast to all workers.

Flow map (screenshot from MovingPandas tutorial 4_generalization_and_aggregation.ipynb)

 

This post is a follow-up to the draft template for exploring movement data I wrote about in my previous post. Specifically, I want to address step 4: Exploring patterns in trajectory and event data.

The patterns I want to explore in this post are clusters of trip origins. The case study presented here is an extension of the MovingPandas ship data analysis notebook.

The analysis consists of 4 steps:

  1. Splitting continuous GPS tracks into individual trips
  2. Extracting trip origins (start locations)
  3. Clustering trip origins
  4. Exploring clusters

Since I have already removed AIS records with a speed over ground (SOG) value of zero from the dataset, we can use the split_by_observation_gap() function to split the continuous observations into individual trips. Trips that are shorter than 100 meters are automatically discarded as irrelevant clutter:

traj_collection.min_length = 100
trips = traj_collection.split_by_observation_gap(timedelta(minutes=5))

The split operation results in 302 individual trips:

Passenger vessel trajectories are blue, high-speed craft green, tankers red, and cargo vessels orange. Other vessel trajectories are gray.

To extract trip origins, we can use the get_start_locations() function. The list of column names defines which columns are carried over from the trajectory’s GeoDataFrame to the origins GeoDataFrame:

 
origins = trips.get_start_locations(['SOG', 'ShipType']) 

The following density-based clustering step is based on a blog post by Geoff Boeing and uses scikit-learn’s DBSCAN implementation:

from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
from shapely.geometry import MultiPoint

origins['lat'] = origins.geometry.y
origins['lon'] = origins.geometry.x
matrix = origins.as_matrix(columns=['lat', 'lon'])

kms_per_radian = 6371.0088
epsilon = 0.1 / kms_per_radian

db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(matrix))
cluster_labels = db.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.Series([matrix[cluster_labels == n] for n in range(num_clusters)])
print('Number of clusters: {}'.format(num_clusters))

Resulting in 69 clusters.

Finally, we can add the cluster labels to the origins GeoDataFrame and plot the result:

origins['cluster'] = cluster_labels

To analyze the clusters, we can compute summary statistics of the trip origins assigned to each cluster. For example, we compute a representative (center-most) point, count the number of trips, and compute the mean speed (SOG) value:

 
def get_centermost_point(cluster):
    centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
    centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
    return Point(tuple(centermost_point)[1], tuple(centermost_point)[0])
centermost_points = clusters.map(get_centermost_point) 

The largest cluster with a low mean speed (indicating a docking or anchoring location) is cluster 29 which contains 43 trips from passenger vessels, high-speed craft, an an undefined vessel:

To explore the overall cluster pattern, we can plot the clusters colored by speed and scaled by the number of trips:

Besides cluster 29, this visualization reveals multiple smaller origin clusters with low speeds that indicate different docking locations in the analysis area.

Cluster locations with high speeds on the other hand indicate locations where vessels enter the analysis area. In a next step, it might be interesting to compute flows between clusters to gain insights about connections and travel times.

It’s worth noting that AIS data contains additional information, such as vessel status, that could be used to extract docking or anchoring locations. However, the workflow presented here is more generally applicable to any movement data tracks that can be split into meaningful trips.

For the full interactive ship data analysis tutorial visit https://mybinder.org/v2/gh/anitagraser/movingpandas/binder-tag


This post is part of a series. Read more about movement data in GIS.

Exploring new datasets can be challenging. Addressing this challenge, there is a whole field called exploratory data analysis that focuses on exploring datasets, often with visual methods.

Concerning movement data in particular, there’s a comprehensive book on the visual analysis of movement by Andrienko et al. (2013) and a host of papers, such as the recent state of the art summary by Andrienko et al. (2017).

However, while the literature does provide concepts, methods, and example applications, these have not yet translated into readily available tools for analysts to use in their daily work. To fill this gap, I’m working on a template for movement data exploration implemented in Python using MovingPandas. The proposed workflow consists of five main steps:

  1. Establishing an overview by visualizing raw input data records
  2. Putting records in context by exploring information from consecutive movement data records (such as: time between records, speed, and direction)
  3. Extracting trajectories & events by dividing the raw continuous tracks into individual trajectories and/or events
  4. Exploring patterns in trajectory and event data by looking at groups of the trajectories or events
  5. Analyzing outliers by looking at potential outliers and how they may challenge preconceived assumptions about the dataset characteristics

To ensure a reproducible workflow, I’m designing the template as a a Jupyter notebook. It combines spatial and non-spatial plots using the awesome hvPlot library:

This notebook is a work-in-progress and you can follow its development at http://exploration.movingpandas.org. Your feedback is most welcome!

 

References

  • Andrienko G, Andrienko N, Bak P, Keim D, Wrobel S (2013) Visual analytics of movement. Springer Science & Business Media.
  • Andrienko G, Andrienko N, Chen W, Maciejewski R, Zhao Y (2017) Visual Analytics of Mobility and Transportation: State of the Art and Further Research Directions. IEEE Transactions on Intelligent Transportation Systems 18(8):2232–2249, DOI 10.1109/TITS.2017.2683539
%d bloggers like this: