In December, I wrote about GeoPandas on Databricks. Back then, I also tried to get MovingPandas working but without luck. (While GeoPandas can be installed using Databricks’
dbutils.library.installPyPI("geopandas") this PyPI install just didn’t want to work for MovingPandas.)
Now that MovingPandas is available from conda-forge, I gave it another try and … *spoiler alert* … it works!
First of all, conda support on Databricks is in beta. It’s not included in the default runtimes. At the time of writing this post, “6.0 Conda Beta” is the latest runtime with conda:
When the installs are finally done, it get’s serious: time to test the imports!
Now we can put the MovingPandas data structures to good use. But first we need to load some movement data:
Or course, the points in this GeoDataFrame can be plotted. However, the plot isn’t automatically displayed once
plot() is called on the GeoDataFrame. Instead, Databricks provides a
display() function to display Matplotlib figures:
MovingPandas also uses Matplotlib. Therefore we can use the same approach to plot the TrajectoryCollection that can be created from the GeoDataFrame:
These Matplotlib plots are nice and quick but they lack interactivity and therefore are of limited use for data exploration.
MovingPandas provides interactive plotting (including base maps) using hvplot. hvplot is based on Bokeh and, luckily, the Databricks documentation tells us that bokeh plots can be exported to html and then displayed using
Of course, we could achieve all this on MyBinder as well (and much more quickly). However, Databricks gets interesting once we can add (Py)Spark and distributed processing to the mix. For example, “Getting started with PySpark & GeoPandas on Databricks” shows a spatial join function that adds polygon information to a point GeoDataFrame.
A potential use case for MovingPandas would be to speed up flow map computations. The recently added aggregator functionality (currently in master only) first computes clusters of significant trajectory points and then aggregates the trajectories into flows between these clusters. Matching trajectory points to the closest cluster could be a potential use case for distributed computing. Each trajectory (or each point) can be handled independently, only the cluster locations have to be broadcast to all workers.