Archive

Author Archives: underdark

In December, I wrote about GeoPandas on Databricks. Back then, I also tried to get MovingPandas working but without luck. (While GeoPandas can be installed using Databricks’ dbutils.library.installPyPI("geopandas") this PyPI install just didn’t want to work for MovingPandas.)

Now that MovingPandas is available from conda-forge, I gave it another try and … *spoiler alert* … it works!

First of all, conda support on Databricks is in beta. It’s not included in the default runtimes. At the time of writing this post, “6.0 Conda Beta” is the latest runtime with conda:

Once the cluster is up and connected to the notebook, a quick conda list shows the installed packages:

Time to install MovingPandas! I went with a 100% conda-forge installation. This takes a looong time (almost half an hour)!

When the installs are finally done, it get’s serious: time to test the imports!

Success!

Now we can put the MovingPandas data structures to good use. But first we need to load some movement data:

Or course, the points in this GeoDataFrame can be plotted. However, the plot isn’t automatically displayed once plot() is called on the GeoDataFrame. Instead, Databricks provides a display() function to display Matplotlib figures:

MovingPandas also uses Matplotlib. Therefore we can use the same approach to plot the TrajectoryCollection that can be created from the GeoDataFrame:

These Matplotlib plots are nice and quick but they lack interactivity and therefore are of limited use for data exploration.

MovingPandas provides interactive plotting (including base maps) using hvplot. hvplot is based on Bokeh and, luckily, the Databricks documentation tells us that bokeh plots can be exported to html and then displayed using  displayHTML():

Of course, we could achieve all this on MyBinder as well (and much more quickly). However, Databricks gets interesting once we can add (Py)Spark and distributed processing to the mix. For example, “Getting started with PySpark & GeoPandas on Databricks” shows a spatial join function that adds polygon information to a point GeoDataFrame.

A potential use case for MovingPandas would be to speed up flow map computations. The recently added aggregator functionality (currently in master only) first computes clusters of significant trajectory points and then aggregates the trajectories into flows between these clusters. Matching trajectory points to the closest cluster could be a potential use case for distributed computing. Each trajectory (or each point) can be handled independently, only the cluster locations have to be broadcast to all workers.

Flow map (screenshot from MovingPandas tutorial 4_generalization_and_aggregation.ipynb)

 

This post is a follow-up to the draft template for exploring movement data I wrote about in my previous post. Specifically, I want to address step 4: Exploring patterns in trajectory and event data.

The patterns I want to explore in this post are clusters of trip origins. The case study presented here is an extension of the MovingPandas ship data analysis notebook.

The analysis consists of 4 steps:

  1. Splitting continuous GPS tracks into individual trips
  2. Extracting trip origins (start locations)
  3. Clustering trip origins
  4. Exploring clusters

Since I have already removed AIS records with a speed over ground (SOG) value of zero from the dataset, we can use the split_by_observation_gap() function to split the continuous observations into individual trips. Trips that are shorter than 100 meters are automatically discarded as irrelevant clutter:

traj_collection.min_length = 100
trips = traj_collection.split_by_observation_gap(timedelta(minutes=5))

The split operation results in 302 individual trips:

Passenger vessel trajectories are blue, high-speed craft green, tankers red, and cargo vessels orange. Other vessel trajectories are gray.

To extract trip origins, we can use the get_start_locations() function. The list of column names defines which columns are carried over from the trajectory’s GeoDataFrame to the origins GeoDataFrame:

 
origins = trips.get_start_locations(['SOG', 'ShipType']) 

The following density-based clustering step is based on a blog post by Geoff Boeing and uses scikit-learn’s DBSCAN implementation:

from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
from shapely.geometry import MultiPoint

origins['lat'] = origins.geometry.y
origins['lon'] = origins.geometry.x
matrix = origins.as_matrix(columns=['lat', 'lon'])

kms_per_radian = 6371.0088
epsilon = 0.1 / kms_per_radian

db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(matrix))
cluster_labels = db.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.Series([matrix[cluster_labels == n] for n in range(num_clusters)])
print('Number of clusters: {}'.format(num_clusters))

Resulting in 69 clusters.

Finally, we can add the cluster labels to the origins GeoDataFrame and plot the result:

origins['cluster'] = cluster_labels

To analyze the clusters, we can compute summary statistics of the trip origins assigned to each cluster. For example, we compute a representative (center-most) point, count the number of trips, and compute the mean speed (SOG) value:

 
def get_centermost_point(cluster):
    centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
    centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
    return Point(tuple(centermost_point)[1], tuple(centermost_point)[0])
centermost_points = clusters.map(get_centermost_point) 

The largest cluster with a low mean speed (indicating a docking or anchoring location) is cluster 29 which contains 43 trips from passenger vessels, high-speed craft, an an undefined vessel:

To explore the overall cluster pattern, we can plot the clusters colored by speed and scaled by the number of trips:

Besides cluster 29, this visualization reveals multiple smaller origin clusters with low speeds that indicate different docking locations in the analysis area.

Cluster locations with high speeds on the other hand indicate locations where vessels enter the analysis area. In a next step, it might be interesting to compute flows between clusters to gain insights about connections and travel times.

It’s worth noting that AIS data contains additional information, such as vessel status, that could be used to extract docking or anchoring locations. However, the workflow presented here is more generally applicable to any movement data tracks that can be split into meaningful trips.

For the full interactive ship data analysis tutorial visit https://mybinder.org/v2/gh/anitagraser/movingpandas/binder-tag


This post is part of a series. Read more about movement data in GIS.

Exploring new datasets can be challenging. Addressing this challenge, there is a whole field called exploratory data analysis that focuses on exploring datasets, often with visual methods.

Concerning movement data in particular, there’s a comprehensive book on the visual analysis of movement by Andrienko et al. (2013) and a host of papers, such as the recent state of the art summary by Andrienko et al. (2017).

However, while the literature does provide concepts, methods, and example applications, these have not yet translated into readily available tools for analysts to use in their daily work. To fill this gap, I’m working on a template for movement data exploration implemented in Python using MovingPandas. The proposed workflow consists of five main steps:

  1. Establishing an overview by visualizing raw input data records
  2. Putting records in context by exploring information from consecutive movement data records (such as: time between records, speed, and direction)
  3. Extracting trajectories & events by dividing the raw continuous tracks into individual trajectories and/or events
  4. Exploring patterns in trajectory and event data by looking at groups of the trajectories or events
  5. Analyzing outliers by looking at potential outliers and how they may challenge preconceived assumptions about the dataset characteristics

To ensure a reproducible workflow, I’m designing the template as a a Jupyter notebook. It combines spatial and non-spatial plots using the awesome hvPlot library:

This notebook is a work-in-progress and you can follow its development at http://exploration.movingpandas.org. Your feedback is most welcome!

 

References

  • Andrienko G, Andrienko N, Bak P, Keim D, Wrobel S (2013) Visual analytics of movement. Springer Science & Business Media.
  • Andrienko G, Andrienko N, Chen W, Maciejewski R, Zhao Y (2017) Visual Analytics of Mobility and Transportation: State of the Art and Further Research Directions. IEEE Transactions on Intelligent Transportation Systems 18(8):2232–2249, DOI 10.1109/TITS.2017.2683539

Over the last years, many data analysis platforms have added spatial support to their portfolio. Just two days ago, Databricks have published an extensive post on spatial analysis. I took their post as a sign that it is time to look into how PySpark and GeoPandas can work together to achieve scalable spatial analysis workflows.

If you sign up for Databricks Community Edition, you get access to a toy cluster for experimenting with (Py)Spark. This considerably lowers the entry barrier to Spark since you don’t need to bother with installing anything yourself. They also provide a notebook environment:

I’ve followed the official Databricks GeoPandas example notebook but expanded it to read from a real geodata format (GeoPackage) rather than from CSV.

I’m using test data from the MovingPandas repository: demodata_geolife.gpkg contains a hand full of trajectories from the Geolife dataset. Demodata_grid.gpkg contains a simple 3×4 grid that covers the same geographic extent as the geolife sample:

Once the files are downloaded, we can use GeoPandas to read the GeoPackages:

Note that the display() function is used to show the plot.

The same applies to the grid data:

When the GeoDataFrames are ready, we can start using them in PySpark. To do so, it is necessary to convert from GeoDataFrame to PySpark DataFrame. Therefore, I’ve implemented a simple function that performs the conversion and turn the Point geometries into lon and lat columns:

To compute new values for our DataFrame, we can use existing or user-defined functions (UDF). Here’s a simple hello world function and associated UDF:

A spatial UDF is a little more involved. For example, here’s an UDF that finds the first polygon that intersects the specified lat/lon and returns that polygon’s ID. Note how we first broadcast the grid DataFrame to ensure that it is available on all computation nodes:

It’s worth noting that PySpark has its peculiarities. Since it’s a Python wrapper of a strongly typed language, we need to pay close attention to types in our Python code. For example, when defining UDFs, if the specified return type (Integertype in the above example) does not match the actual value returned by the find_intersection() function, this will cause rather cryptic errors.

To plot the results, I’m converting the joined PySpark DataFrame back to GeoDataFrame:

I’ve published this notebook so you can give it a try. (Any notebook published on Databricks is supposed to stay online for six months, so if you’re trying to access it after June 2020, this link may be broken.)

Recently there has been some buzz on Twitter about a new moving object database (MOD) called MobilityDB that builds on PostgreSQL and PostGIS (Zimányi et al. 2019). The MobilityDB Github repo has been published in February 2019 but according to the following presentation at PgConf.Russia 2019 it has been under development for a few years:

Of course, moving object databases have been around for quite a while. The two most commonly cited MODs are HermesDB (Pelekis et al. 2008) which comes as an extension for either PostgreSQL or Oracle and is developed at the University of Piraeus and SECONDO (de Almeida et al. 2006) which is a stand-alone database system developed at the Fernuniversität Hagen. However, both MODs remain at the research prototype level and have not achieved broad adoption.

It will be interesting to see if MobilityDB will be able to achieve the goal they have set in the title of Zimányi et al. (2019) to become “a mainstream moving object database system”. It’s promising that they are building on PostGIS and using its mature spatial analysis functionality instead of reinventing the wheel. They also discuss why they decided that PostGIS trajectories (which I’ve written about in previous posts) are not the way to go:

However, the presentation does not go into detail whether there are any straightforward solutions to visualizing data stored in MobilityDB.

According to the Github readme, MobilityDB runs on Linux and needs PostGIS 2.5. They also provide an online demo as well as a Docker container with MobilityDB and all its dependencies. If you give it a try, I would love to hear about your experiences.

References

  • de Almeida, V. T., Guting, R. H., & Behr, T. (2006). Querying moving objects in secondo. In 7th International Conference on Mobile Data Management (MDM’06) (pp. 47-47). IEEE.
  • Pelekis, N., Frentzos, E., Giatrakos, N., & Theodoridis, Y. (2008). HERMES: aggregative LBS via a trajectory DB engine. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (pp. 1255-1258). ACM.
  • Zimányi, E., Sakr, M., Lesuisse, A., & Bakli, M. (2019). MobilityDB: A Mainstream Moving Object Database System. In Proceedings of the 16th International Symposium on Spatial and Temporal Databases (pp. 206-209). ACM.

This post is part of a series. Read more about movement data in GIS.

In the previous post, I showed how Folium can be used to create interactive maps of GeoPandas GeoDataFrames. Today’s post continues this theme. Specifically, it compares Folium to another dataviz library called hvplot. hvplot also recently added support for GeoDataFrames, so it’s interesting to see how these different solutions compare.

Minimum viable

The following snippets show the minimum code I found to put a GeoDataFrame of Points onto a map with either Folium or hvplot.

Folium does not automatically zoom to the data extent and I didn’t find a way to add the whole GeoDataFrame of Points without looping through the rows individually:

Hvplot on the other hand registers the hvplot function directly with the GeoDataFrame. This makes it as convenient to use as the original GeoPandas plot function. It also zooms to the data extent:

Standard interaction and zoom to area of interest

The following snippets ensure that the map is set to a useful extent and the map tools enable panning and zooming.

With Folium, we have to set the map center and the zoom. The map tools are Leaflet defaults, so panning and zooming work as expected:

Since hvplot does not come with mouse wheel zoom enabled by default, we need to set that:

Color by attribute

Finally, for many maps, we want to show the point location as well as an attribute value.

To create a continuous color ramp for a numeric value, we can use branca.colormap to define the marker fill color:

In hvplot, it is sufficient to specify the attribute of interest:

I’m pretty impressed with hvplot. The integration with GeoPandas is very smooth. Just don’t forget to set the geo=True parameter if you want to plot lat/lon geometries.

Folium seems less straightforward for this use case. Maybe I missed some option similar to the Choropleth function that I showed in the previous post.

GeoPandas makes it easy to create basic visualizations of GeoDataFrames:

However, if we want interactive plots, we need additional libraries. Folium (which is built on Leaflet) is a great option. However, all examples for plotting GeoDataFrames that I found focused on point or polygon data. So here is what I found to work for GeoDataFrames of LineStrings:

First, some imports:

import pandas as pd
import geopandas
import folium

Loading the data:

graph = geopandas.read_file('data/population_test-routes-geom.csv')
graph.crs = {'init' :'epsg:4326'}

Creating the map using folium.Choropleth:

m = folium.Map([48.2, 16.4], zoom_start=10)

folium.Choropleth(
    graph[graph.geometry.length>0.001],
    line_weight=3,
    line_color='blue'
).add_to(m)

m

I also tried using folium.PolyLine which seemed like the more obvious choice but does not seem to accept GeoDataFrames as input. Instead, it expects a list of coordinate pairs and of course it expects them to be in the opposite order that Shapely.LineString.coords provides … Oh the joys of geodata!

In any case, I had to limit the number of features that get plotted because Folium refuses to plot all 8778 features at once. I decided to filter by line length because drawing really short lines is pointless for my overview visualization anyway.

%d bloggers like this: