First working MovingPandas setup on Databricks
In December, I wrote about GeoPandas on Databricks. Back then, I also tried to get MovingPandas working but without luck. (While GeoPandas can be installed using Databricks’
dbutils.library.installPyPI("geopandas") this PyPI install just didn’t want to work for MovingPandas.)
Now that MovingPandas is available from conda-forge, I gave it another try and … *spoiler alert* … it works!
First of all, conda support on Databricks is in beta. It’s not included in the default runtimes. At the time of writing this post, “6.0 Conda Beta” is the latest runtime with conda:
Once the cluster is up and connected to the notebook, a quick conda list shows the installed packages:
Time to install MovingPandas! I went with a 100% conda-forge installation. This takes a looong time (almost half an hour)!
When the installs are finally done, it get’s serious: time to test the imports!
Now we can put the MovingPandas data structures to good use. But first we need to load some movement data:
Or course, the points in this GeoDataFrame can be plotted. However, the plot isn’t automatically displayed once
plot() is called on the GeoDataFrame. Instead, Databricks provides a
display() function to display Matplotlib figures:
MovingPandas also uses Matplotlib. Therefore we can use the same approach to plot the TrajectoryCollection that can be created from the GeoDataFrame:
These Matplotlib plots are nice and quick but they lack interactivity and therefore are of limited use for data exploration.
MovingPandas provides interactive plotting (including base maps) using hvplot. hvplot is based on Bokeh and, luckily, the Databricks documentation tells us that bokeh plots can be exported to html and then displayed using
Of course, we could achieve all this on MyBinder as well (and much more quickly). However, Databricks gets interesting once we can add (Py)Spark and distributed processing to the mix. For example, “Getting started with PySpark & GeoPandas on Databricks” shows a spatial join function that adds polygon information to a point GeoDataFrame.
A potential use case for MovingPandas would be to speed up flow map computations. The recently added aggregator functionality (currently in master only) first computes clusters of significant trajectory points and then aggregates the trajectories into flows between these clusters. Matching trajectory points to the closest cluster could be a potential use case for distributed computing. Each trajectory (or each point) can be handled independently, only the cluster locations have to be broadcast to all workers.
Really interesting, this looks very good. But…. how does this compare to PostGIS? I have a very large dataset – well, several – in a complex map involving geology and sediment types. Would Pandas have limitations in handling large datasets, and cartographic generalisation?
There will be issues if the datasets don’t fit into memory. Without knowing the specific query requirements, in general, I would expect PostGIS to be faster. However, some queries, particularly relating to time series will be easier to implement in (Geo)Pandas.
I am facing same trouble you described in Cmd7, but I am using holoviews. Also, I’ve installed bokeh package, but I could not display a plot (in this case, a Sankey Diagram). How did you fix this? I tried to see in Cmd8, but it’s a simple help command! I appreciate if you could share with us.
Hvplot worked for me as shown in Cmd15 https://underdark.files.wordpress.com/2020/02/databricks8-1.png