Archive

Tag Archives: SQL

The last time I preprocessed the whole GeoLife dataset, I loaded it into PostGIS. Today, I want to share a new workflow that creates a (Geo)Parquet file and that is much faster.

The dataset (GeoLife)

“This GPS trajectory dataset was collected in (Microsoft Research Asia) Geolife project by 182 users in a period of over three years (from April 2007 to August 2012). A GPS trajectory of this dataset is represented by a sequence of time-stamped points, each of which contains the information of latitude, longitude and altitude. This dataset contains 17,621 trajectories with a total distance of about 1.2 million kilometers and a total duration of 48,000+ hours. These trajectories were recorded by different GPS loggers and GPS-phones, and have a variety of sampling rates. 91 percent of the trajectories are logged in a dense representation, e.g. every 1~5 seconds or every 5~10 meters per point.”

The GeoLife GPS Trajectories download contains 182 directories full of .plt files:

Basically, CSV files with a custom header:

Creating the (Geo)Parquet using DuckDB

DuckDB installation

Following the official instructions, installation is straightforward:

curl https://install.duckdb.org | sh

From there, I’ve been using the GUI which we can launch using:

duckdb -ui

The spatial extension is a DuckDB core extension, so it’s readily available. We can create a spatial db with:

ATTACH IF NOT EXISTS ':memory:' AS memory;
INSTALL spatial;
LOAD spatial;

Reading a spatial file is as simple as:

SELECT * 
FROM '/home/anita/Documents/Codeberg/trajectools/sample_data/geolife.gpkg'

thanks to the GDAL integration.

But today, we want to do to get a bit more involved …

DuckDB SQL magic

The issues we need to solve are:

  1. Read all CSV files from all subdirectories
  2. Parse the CSV, ignoring the first couple of lines, while assigning proper column names
  3. Assign the CSV file name as the trajectory ID (because there is no ID in the original files)
  4. Create point geometries that will work with our GeoParquet file
  5. Create proper datetimes from the separate date and time fields

Luckily, DuckDB’s read_csv function comes with the necessary features built-in. Putting it all together:

CREATE OR REPLACE TABLE geolife AS 
SELECT 
  parse_filename(filename, true) as vehicle_id, 
  strptime(date||' '||time, '%c') as t, 
  ST_Point(lon, lat) as geometry -- do NOT use ST_MakePoint
FROM read_csv('/home/anita/Documents/Geodata/Geolife/Geolife Trajectories 1.3/Data/*/*/*.plt',
    skip=6,
    filename = true, 
    columns = {
        'lat': 'DOUBLE', 
        'lon': 'DOUBLE', 
        'ignore': 'INT', 
        'alt': 'DOUBLE', 
        'epoch': 'DOUBLE', 
        'date': 'VARCHAR',
        'time': 'VARCHAR'
    });

It’s blazingly fast:

I haven’t tested reading directly from ZIP archives yet, but there seems to be a community extension (zipfs) for this exact purpose.

Ready to QGIS

GeoParquet files can be drag-n-dropped into QGIS:

I’m running QGIS 3.42.1-Münster from conda-forge on Linux Mint.

Yes, it takes a while to render all 25 million points … But you know what? It get’s really snappy once we zoom in closer, e.g. to the situation in Germany:

Let’s have a closer look at what’s going on here.

Trajectools time

Selecting the 9,438 points in this extent, let’s compute movement metrics (speed & direction) and create trajectory lines:

Looks like we have some high-speed sections in there (with those red > 100 km/h streaks):

When we zoom in to Darmstadt and enable the trajectories layer, we can see each individual trip. Looks like car trips on the highway and walks through the city:

That looks like quite the long round trip:

Let’s see where they might have stopped to have a break:

If I had to guess, I’d say they stayed at the Best Western:

Conclusion

DuckDB has been great for this ETL workflow. I didn’t use much of its geospatial capabilities here but I was pleasantly surprised how smooth the GeoParquet creation process has been. Geometries are handled without any special magic and are recognized by QGIS. Same with the timestamps. All ready for more heavy spatiotemporal analysis with Trajectools.

If you haven’t tried DuckDB or GeoParquet yet, give it a try, particularly if you’re collaborating with data scientists from other domains and want to exchange data.

For everyone working with spatial databases in QGIS there comes a time when “Add PostGIS/SpatiaLite Layer” and “RT Sql Layer” start to be annoying. You always have to retype or copy-paste your SQL queries into the user interface if you need to change the tiniest thing in the layer’s definition.

This is where “Fast SQL Layer” can be a real time saver. Fast SQL Layer is a new plugin for QGIS by Pablo T. Carreira. It basically adds an SQL console for loading layers from PostGIS/SpatiaLite into QGIS. And it even comes with syntax highlighting!

Installation

Fast SQL Layer comes with one dependency: Pygments, which is used for syntax highlighting.

On Ubuntu, all you have to do is install it with apt-get:

sudo apt-get install python-pygments

For Windows with OSGeo4W, @Mike_Toews posted this on gis.stackexchange:

I downloaded and extracted Pygments-1.4.tar.gz, then in an OSGeo4W shell within the Pygments-1.4 directory, type python setup.py build then python setup.py install

Usage

When you activate the plugin in plugin manager, a dock widget will appear which contains the console and some fields for specifying the database connection that should be used. Then, you can simply write your SQL query and load the results with one click.

Fast SQL plugin

In this example, I renamed “gid” to “id”, but you can actually edit the values in the drop down boxes to adjust the column names for id and geometry:

A second layer loaded using Fast SQL plugin

It certainly needs some polishing on the user interface side but I really like it.

Do you need a random sample of features in a Postgres table? Here is an example of how to select 1,000 random features from a table:

SELECT * FROM myTable
WHERE attribute = 'myValue'
ORDER BY random()
LIMIT 1000;

Maybe this is a bit off-topic, but I just spent quite some time on this and I need to write it down so I can look it up again later :)

These are instructions for Ubuntu running Postgres 8.4. By default, Postgres ships without PL/Python so we need to get it first:

sudo apt-get install postgresql-plpython-8.4

Next, we need to create the language for our database. I’m using PgAdmin3. From there, I ran:

CREATE PROCEDURAL LANGUAGE 'plpython' HANDLER plpython_call_handler;

This should have been it. Let’s try with a simple function:

CREATE FUNCTION replace_e_to_a(text) RETURNS text AS
'
import re
Text1 = re.sub(''e'', ''a'',args[0])
return Text1
'
LANGUAGE 'plpython';

SELECT replace_e_to_a('Meee');

… should return ‘Maaa’.

Now for the juicy part: Let’s create an INSERT trigger function!

First, let’s have a look at the corresponding table structure. We have two tables “user_data” and “user_keywords”. “user_data” is the table that’s being filled with information from external functions. “user_keywords” has to be kept up-to-date. It is supposed to count the appearance of keywords on a per-user base.

user_data                                   user_keywords
user_id, event_id, keywords                 user_id, keyword,   count
1,       1,        'music,rock'             1,       'music',   2
1,       2,        'music,classic'          1,       'rock',    1
                                            1,       'classic', 1

First, the keyword list has to be split. Then a row has to be inserted for new keywords (compare insert_plan) and the counter has to be increased for existing keywords (update_plan).

The values that are about to be inserted can be accessed via TD[“new”][“column_name”].

CREATE FUNCTION update_keyword_count() RETURNS trigger AS '

keywords = TD["new"]["keywords"]
user = TD["new"]["user_id"]

insert_plan = plpy.prepare("INSERT INTO user_keywords (keyword, count, user_id) VALUES ($1, $2, $3)", ["text", "int", "int"])

update_plan = plpy.prepare("UPDATE user_keywords SET count = $3 WHERE user_id = $1 AND keyword = $2", ["int", "text", "int"])

for keyword in keywords.split(","):
  select_cnt_rows = plpy.prepare("SELECT count(*) AS cnt FROM user_keywords WHERE user_id = $1 AND keyword = $2", ["int", "text"])
  cnt_rows = plpy.execute(select_cnt_rows, [user, keyword])

  select_plan = plpy.prepare("SELECT count AS cnt FROM user_keywords WHERE user_id = $1 AND keyword = $2", ["int", "text"])
  results = plpy.execute(select_plan, [user, keyword])

  if cnt_rows[0]["cnt"] == 0:
   rv = plpy.execute(insert_plan, [keyword, 1, user])
  else:
   rv = plpy.execute(update_plan, [user, keyword, results[0]["cnt"]+1])

' LANGUAGE plpython;

Now, we need to wire it up by defining the trigger:

CREATE TRIGGER update_keywords
BEFORE INSERT ON user_data
FORE EACH ROW
EXECUTE PROCEDURE update_keyword_count();

… Wasn’t that bad ;)

GeoServer has always been good at simply publishing database tables. But anything more complex (e.g. pre-filtering data in a table, joining two tables together, or generating values on the fly) could be painful. With Geoserver 2.1 one can finally create a layer directly from an SQL query.

"Create New SQL View" interface

Even dynamic queries are possible, e.g.

select gid, state_name, the_geom from pgstates where persons between %low% and %high%

To select for example all states with 2 to 5 millions inhabitants, the following parameters can be added to the normal GetMap request:

&viewparams=low:2000000;high:5000000

Find more information on SQL layers in Geoserver 2.1 documentation.

Sometimes, we just want to visualize the contents of a PostGIS table containing some x/y data but no actual geometries in QGIS. But there the problems arise: We don’t have the right to add a geometry column, the table doesn’t have a suitable ID or OIDs (QGIS demands a unique integer ID) and we can’t or don’t want to mess with the database anyway. Loading the table with “Add PostGIS Layer” will result in a non-spatial layer (or fail if you use an older QGIS versions).

RT Sql Layer Plugin to the rescue!

I presented this plugin in a previous post. It allows you to execute any SQL SELECT statement, even really complex ones. Luckily, this time we don’t need anything fancy, only the two functions row_number() and makepoint():

select  
  row_number() over (order by col1)::int AS my_id,
  col1, 
  col2,
  x, y, 
  makepoint(x,y) as the_geom
from my_table