Archive

Tag Archives: GeoPandas

The journey continues: QgsArrowIterator is now merged! This makes it possible to iterate over QgsFeatures as Arrow batches.

This is where we are now, quoting Dewey Dunnington:

import geopandas
from nanoarrow.c_array import allocate_c_array
import qgis
from qgis.core import QgsVectorLayer

# Create a vector layer
layer = QgsVectorLayer("tests/testdata/zonalstatistics/polys.shp", "layer_name", "ogr")
schema = qgis.core.QgsArrowIterator.inferSchema(layer)

it = qgis.core.QgsArrowIterator(layer.getFeatures())
it.setSchema(schema, 1)

c_array = allocate_c_array()
schema.exportToAddress(c_array.schema._addr())
it.nextFeatures(5, c_array._addr())

print(geopandas.GeoDataFrame.from_arrow(c_array))
#> lev3_name                                           geometry
#> 0    poly_1  MULTIPOLYGON (((100.37934 -0.96049, 100.37934 ...
#> 1    poly_2  MULTIPOLYGON (((100.37944 -0.96044, 100.37955 ...
#> 2    poly_3  MULTIPOLYGON (((100.37938 -0.96049, 100.37949 ...

print(geopandas.read_file("tests/testdata/zonalstatistics/polys.shp"))
#> lev3_name                                           geometry
#> 0    poly_1  POLYGON ((100.37934 -0.96049, 100.37934 -0.960...
#> 1    poly_2  POLYGON ((100.37944 -0.96044, 100.37955 -0.960...
#> 2    poly_3  POLYGON ((100.37938 -0.96049, 100.37949 -0.960...

Further improvements are already being planned. To quote from the ticket:

“The final state after this improvement would be a compact way for Arrow Python consumers like GeoPandas to ergonomically consume a layer. Maybe:

geopandas.GeoDataFrame.from_arrow(qgis_layer_object)

Or maybe:

geopandas.GeoDataFrame.from_arrow(qgis_layer_object.getArrowStream())

Looking forward to seeing this develop further.

The conversation around Looking for better ways to convert between QGIS VectorLayer and (Geo)DataFrame is continuing over at https://fosstodon.org/@underdarkGIS/115442614331293320

What I’ve learned so far:

Exciting times for spatial data science tooling 🤩

Plugin developers who want to use (Geo)Pandas-based functionality in their plugins regularly face the challenge of converting QGIS vector layers to (Geo)DataFrames. There is currently no built-in convenience function.

In Trajectools, so far, I have been performing the conversion manually, looping through all features and taking care of tricky column types, such as datetimes and geometries:

def df_from_layer_trajectools(layer,time_field_name="t"):
    # Original Trajectools 2.7 version
    names = [field.name() for field in layer.fields()]
    data = []
    for feature in layer.getFeatures():
        my_dict = {}
        for i, a in enumerate(feature.attributes()):
            if names[i] == time_field_name and isinstance(a, QDateTime):
                a = a.toPyDateTime()
            my_dict[names[i]] = a
        pt = feature.geometry().asPoint()
        my_dict["geom_x"] = pt.x()
        my_dict["geom_y"] = pt.y()
        data.append(my_dict)
    df = pd.DataFrame(data)
    return df

It works (mostly), but it’s far from fast. For the 25 million Geolife points, it takes 4 minutes:

In an attempt to speed-up (and make the conversion more robust, e.g. regarding datetime/timezone conversion and null values), I’ve spent some time at SDSL2025 with Joris Van den Bossche trying a workaround that writes the QGIS layer to an Arrow file and then reads that file with pyogrio:

def gdf_from_layer_arrow(layer):
    # SDSL2025 version
    with tempfile.TemporaryDirectory() as tmpdirname:
        path = os.path.join(tmpdirname, "data.arrow")

        options = QgsVectorFileWriter.SaveVectorOptions()
        options.actionOnExistingFile = QgsVectorFileWriter.CreateOrOverwriteFile 
        options.layerName = 'data'
        options.driverName = "arrow"
        
        QgsVectorFileWriter.writeAsVectorFormatV3(
            layer, path, QgsProject.instance().transformContext(), options
        )
       
        meta, table = pyogrio.read_arrow(path)
        gdf = gpd.GeoDataFrame.from_arrow(table)

    return gdf

Not only do we get a GeoDataFrame in return, this also runs in half the time, i.e. in 2 minutes instead of 4:

Switching to this approach will require adding pyogrio to the plugin dependencies. Looks like it could be worth it.

We also discussed another alternative: It would be faster to read the vector layer data source directly, in case it is a supported file format. However, this means we’d need separate handling for other input layers.

There’s also the issue of supporting the Processing feature that allows users to run the algorithm only on the selected features because selected features are only exposed through QgsProcessingParameterFeatureSource (and not through QgsProcessingParameterVectorLayer). Maybe the Export Selected Features algorithm can cover this case but it will export an empty layer if there is no selection.

Are you aware of any other / better ways to approach this issue? Any pointers are appreciated.

Today, I’m super excited to share with you the announcement that our open source textbook “Geocomputation with Python” has finally arrived in print and is now available for purchase from Routledge.com, Amazon.com, Amazon.co.uk, and other booksellers.

“Geocomputation with Python” (or geocompy for short) covers the entire range of standard GIS operations for both vector and raster data models. Each section and chapter builds on the previous. If you’re just starting out with Python to work with geographic data, we hope that the book will be an excellent place to start.

Of course, you can still find the online version of the book at py.geocompx.org.

The book is open-source and you can find the code on GitHub. This ensures that the content is reproducible, transparent, and accessible. It also lets you interact with the project by opening issues and submitting pull requests.

tldr; Tired of working with large CSV files? Give GeoParquet a try!

“Parquet is a powerful column-oriented data format, built from the ground up to as a modern alternative to CSV files.” https://geoparquet.org/

(Geo)Parquet is both smaller and faster than CSV. Additionally, (Geo)Parquet columns are typed. Text, numeric values, dates, geometries retain their data types. GeoParquet also stores CRS information and support in GIS solutions is growing.

I’ll be giving a quick overview using AIS data in GeoPandas 1.0.1 (with pyarrow) and QGIS 3.38 (with GDAL 3.9.2).

File size

The example AIS dataset for this demo contains ~10 million rows with 22 columns. I’ve converted the original zipped CSV into GeoPackage and GeoParquet using GeoPandas to illustrate the huge difference in file size: ~470 MB for GeoParquet and zipped CSV, 1.6 GB for CSV, and a whopping 2.6 GB for GeoPackage:

Reading performance

Pandas and GeoPandas both support selective reading of files, i.e. we can specify the specific columns to be loaded. This does speed up reading, even from CSV files:

Whole fileSelected columns
CSV27.9 s13.1 s
Geopackage2min 12s 😵20.2 s
GeoParquet7.2 s4.1 s

Indeed, reading the whole GeoPackage is getting quite painful.

Here’s the code I used for timing the read times:

As you can see, these times include the creation of the GeoPandas.GeoDataFrame.

If we don’t need a GeoDataFrame, we can read the files even faster:

Non-spatial DataFrames

GeoParquet files can be read by non-GIS tools, such as Pandas. This makes it easier to collaborate with people who may not be familiar with geospatial data stacks.

And reading plain DataFrames is much faster than creating GeoDataFrames:

But back to GIS …

GeoParquet in QGIS

In QGIS, GeoParquet files can be loaded like any other vector layer, thanks to GDAL:

Loading the GeoParquet and GeoPackage files is pretty quick, especially if we zoom into a small region of interest (even though, unfortunately, it doesn’t seem possible to restrict the columns to further speed up loading). Loading the CSV, however, is pretty painful due to the lack of spatial indexing, which becomes apparent very quickly in the direct comparison:

(You can see how slowly the red CSV points are rendering. I didn’t have the patience to include the whole process in the GIF.)

As far as I can tell, my QGIS 3.38 ‘Grenoble’ does not support writing to or editing of GeoParquet files. So I’m limited to reading GeoParquet for now.

However, seeing how much smaller GeoParquets are compared to GeoPackages (and also faster to write), I hope that we will soon get the option to export to GeoParquet.

For now, I’ll start by converting my large CSV files to GeoParquet using GeoPandas.

More reading

If you’re into GeoJSON and/or PyGeoAPI, check out Joana Simoes’ post: “Navigating GeoParquet: Lessons Learned from the eMOTIONAL Cities Project”

And if you want to see a global dataset example, have a look at Matt Travis’ presentation using Overture data:

Today, I want to point out a blog post over at

https://geocompx.org/post/2023/geocompy-bp1/

In this post, Jakub Nowosad introduces our book “Geocomputation with Python”, also known as geocompy. It is an open-source book on geographic data analysis with Python, written by Michael Dorman, Jakub Nowosad, Robin Lovelace, and me with contributions from others. You can find it online at https://py.geocompx.org/

Previously, we mapped neo4j spatial nodes. This time, we want to take it one step further and map relationships.

A prime example, are the relationships between GTFS StopTime and Trip nodes. For example, this is the Cypher query to get all StopTime nodes of Trip 17:

MATCH 
    (t:Trip  {id: "17"})
    <-[:BELONGS_TO]-
    (st:StopTime) 
RETURN st

To get the stop locations, we also need to get the stop nodes:

MATCH 
    (t:Trip {id: "17"})
    <-[:BELONGS_TO]-
    (st:StopTime)
    -[:STOPS_AT]->
    (s:Stop)
RETURN st ,s

Adapting our code from the previous post, we can plot the stops:

from shapely.geometry import Point

QUERY = """MATCH (
    t:Trip {id: "17"})
    <-[:BELONGS_TO]-
    (st:StopTime)
    -[:STOPS_AT]->
    (s:Stop)
RETURN st ,s
ORDER BY st.stopSequence
"""

with driver.session(database="neo4j") as session:
    tx = session.begin_transaction()
    results = tx.run(QUERY)
    df = results.to_df(expand=True)
    gdf = gpd.GeoDataFrame(
        df[['s().prop.name']], crs=4326,
        geometry=df["s().prop.location"].apply(Point)
    )

tx.close() 
m = gdf.explore()
m

Ordering by stop sequence is actually completely optional. Technically, we could use the sorted GeoDataFrame, and aggregate all the points into a linestring to plot the route. But I want to try something different: we’ll use the NEXT_STOP relationships to get a DataFrame of the start and end stops for each segment:

QUERY = """
MATCH (t:Trip {id: "17"})
   <-[:BELONGS_TO]-
   (st1:StopTime)
   -[:NEXT_STOP]->
   (st2:StopTime)
MATCH (st1)-[:STOPS_AT]->(s1:Stop)
MATCH (st2)-[:STOPS_AT]->(s2:Stop)
RETURN st1, st2, s1, s2
"""

from shapely.geometry import Point, LineString

def make_line(row):
    s1 = Point(row["s1().prop.location"])
    s2 = Point(row["s2().prop.location"])
    return LineString([s1,s2])

with driver.session(database="neo4j") as session:
    tx = session.begin_transaction()
    results = tx.run(QUERY)
    df = results.to_df(expand=True)
    gdf = gpd.GeoDataFrame(
        df[['s1().prop.name']], crs=4326,
        geometry=df.apply(make_line, axis=1)
    )

tx.close() 
gdf.explore(m=m)

Finally, we can also use Cypher to calculate the travel time between two stops:

MATCH (t:Trip {id: "17"})
   <-[:BELONGS_TO]-
   (st1:StopTime)
   -[:NEXT_STOP]->
   (st2:StopTime)
MATCH (st1)-[:STOPS_AT]->(s1:Stop)
MATCH (st2)-[:STOPS_AT]->(s2:Stop)
RETURN st1.departureTime AS time1, 
   st2.arrivalTime AS time2, 
   s1.location AS geom1, 
   s2.location AS geom2, 
   duration.inSeconds(
      time(st1.departureTime), 
      time(st2.arrivalTime)
   ).seconds AS traveltime

As always, here’s the notebook: https://github.com/anitagraser/QGIS-resources/blob/master/qgis3/notebooks/neo4j.ipynb

In the recent post Setting up a graph db using GTFS data & Neo4J, we noted that — unfortunately — Neomap is not an option to visualize spatial nodes anymore.

GeoPandas to the rescue!

But first we need the neo4j Python driver:

pip install neo4j

Then we can connect to our database. The default user name is neo4j and you get to pick the password when creating the database:

from neo4j import GraphDatabase

URI = "neo4j://localhost"
AUTH = ("neo4j", "password")

with GraphDatabase.driver(URI, auth=AUTH) as driver:
    driver.verify_connectivity()

Once we have confirmed that the connection works as expected, we can run a query:

QUERY = "MATCH (p:Stop) RETURN p.name AS name, p.location AS geom"

records, summary, keys = driver.execute_query(
    QUERY, database_="neo4j",
)

for rec in records:
    print(rec)

Nice. There we have our GTFS stops, their names and their locations. But how to put them on a map?

Conveniently, there is a to_db() function in the Neo4j driver:

import geopandas as gpd
import numpy as np

with driver.session(database="neo4j") as session:
    tx = session.begin_transaction()
    results = tx.run(QUERY)
    df = results.to_df(expand=True)
    df = df[df["geom[].0"]>0]
    gdf = gpd.GeoDataFrame(
        df['name'], crs=4326,
        geometry=gpd.points_from_xy(df['geom[].0'], df['geom[].1']))
    print(gdf)

tx.close() 

Since some of the nodes lack geometries, I added a quick and dirty hack to get rid of these nodes because — otherwise — gdf.explore() will complain about None geometries.

You can find this notebook at: https://github.com/anitagraser/QGIS-resources/blob/1e4ea435c9b1795ba5b170ddb176aa83689112eb/qgis3/notebooks/neo4j.ipynb

Next step will have to be the relationships. Stay posted.

In the last few days, there’s been a sharp rise in interest in vessel movements, and particularly, in understanding where and why vessels stop. Following the grounding of Ever Given in the Suez Canal, satellite images and vessel tracking data (AIS) visualizations are everywhere:

Using movement data analytics tools, such as MovingPandas, we can dig deeper and explore patterns in the data.

The MovingPandas.TrajectoryStopDetector is particularly useful in this situation. We can provide it with a Trajectory or TrajectoryCollection and let it detect all stops, that is, instances were the moving object stayed within a certain area (with a diameter of 1000m in this example) for a an extended duration (at least 3 hours).

stops = mpd.TrajectoryStopDetector(trajs).get_stop_segments(
    min_duration=timedelta(hours=3), max_diameter=1000)

The resulting stop segments include spatial and temporal information about the stop location and duration. To make this info more easily accessible, let’s turn the stop segment TrajectoryCollection into a point GeoDataFrame:

stop_pts = gpd.GeoDataFrame(columns=['geometry']).set_geometry('geometry')
stop_pts['stop_id'] = [track.id for track in stops.trajectories]
stop_pts= stop_pts.set_index('stop_id')

for stop in stops:
    stop_pts.at[stop.id, 'ID'] = stop.df['ID'][0]
    stop_pts.at[stop.id, 'datetime'] = stop.get_start_time()
    stop_pts.at[stop.id, 'duration_h'] = stop.get_duration().total_seconds()/3600
    stop_pts.at[stop.id, 'geometry'] = stop.get_start_location()

Indeed, I think the next version of MovingPandas should include a function that directly returns stops as points.

Now we can explore the stop information. For example, the map plot shows that stops are concentrated in three main areas: the northern and southern ends of the Canal, as well as the Great Bitter Lake in the middle. By looking at the timing of stops and their duration in a scatter plot, we can clearly see that the Ever Given stop (red) caused a chain reaction: the numerous points lining up on the diagonal of the scatter plot represent stops that very likely are results of the blockage:

Before the grounding, the stop distribution nicely illustrates the canal schedule. Vessels have to wait until it’s turn for their direction to go through:

You can see the full analysis workflow in the following video. Please turn on the captions for details.

Huge thanks to VesselsValue for supplying the data!

For another example of MovingPandas‘ stop dectection in action, have a look at Bryan R. Vallejo’s tutorial on detecting stops in bird tracking data which includes some awesome visualizations using KeplerGL:

Kepler.GL visualization by Bryan R. Vallejo

This post is part of a series. Read more about movement data in GIS.

Data sourcing and preparation is one of the most time consuming tasks in many spatial analyses. Even though the Austrian data.gv.at platform already provides a central catalog, the individual datasets still vary considerably in their accessibility or readiness for use.

OGD.AT Lab is a new repository collecting Jupyter notebooks for working with Austrian Open Government Data and other auxiliary open data sources. The notebooks illustrate different use cases, including so far:

  1. Accessing geodata from the city of Vienna WFS
  2. Downloading environmental data (heat vulnerability and air quality)
  3. Geocoding addresses and getting elevation information
  4. Exploring urban movement data

Data processing and visualization are performed using Pandas, GeoPandas, and Holoviews. GeoPandas makes it straighforward to use data from WFS. Therefore, OGD.AT Lab can provide one universal gdf_from_wfs() function which takes the desired WFS layer as an argument and returns a GeoPandas.GeoDataFrame that is ready for analysis:

Many other datasets are provided as CSV files which need to be joined with spatial datasets to use them in spatial analysis. For example, the “Urban heat vulnerability index” dataset which needs to be joined to statistical areas.

 

Another issue with many CSV files is that they use German number formatting, where commas are used as a decimal separater instead of dots:

Besides file access, there are also open services provided by other developers, for example, Manfred Egger developed an elevation service that provides elevation information for any point in Austria. In combination with geocoding services, such as Nominatim, this makes is possible to, for example, find the elevation for any address in Austria:

Last but not least, the first version of the mobility notebook showcases open travel time data provided by Uber Movement:

The utility functions for data access included in this repository will continue to grow as new data sources are included. Eventually, it may make sense to extract the data access function into a dedicated library, similar to geofi (Finland) or geobr (Brazil).

If you’re aware of any interesting open datasets or services that should be included in OGD.AT, feel free to reach out here or on Github through the issue tracker or by providing a pull request.