The quest for a fair TimeGPT benchmark

At the end of yesterday’s TimeGPT for mobility post, we concluded that TimeGPT’s trainingset probably included a copy of the popular BikeNYC timeseries dataset and that, therefore, we were not looking at a fair comparison.

Naturally, it’s hard to find mobility timeseries datasets online that haven’t been widely disseminated and therefore may have slipped past the scrapers of foundation model builders.

So I scoured the Austrian open government data portal and came up with a bike-share dataset from Vienna.

Dataset

SharedMobility.ai dataset published by Philipp Naderer-Puiu, covering 2019-05-05 to 2019-12-31.

Here are eight of the 120 stations in the dataset. I’ve resampled the number of available bicycles to the maximum hourly value and made a cutoff mid August (before a larger data collection cap and the less busy autumn and winter seasons):

Models

To benchmark TimeGPT, I computed different baseline predictions. I used statsforecast’s HistoricAverage, SeasonalNaive, and AutoARIMA models and computed predictions for horizons of 1 hour, 12 hours, and 24 hours.

Here are examples of the 12-hour predictions:

We can see how Historic Average is pretty much a straight line of the average past value. A little more sophisticated, SeasonalNaive assumes that the future will be a repeat of the past (i.e. the previous day), which results in the shifted curve we can see in the above examples. Finally, there’s AutoARIMA which seems to do a better job than the first two models but also takes much longer to compute.

For comparison, here’s TimeGPT with 12 hours horizon:

You can find the full code in https://github.com/anitagraser/ST-ResNet/blob/570d8a1af4a10c7fb2230ccb2f203307703a9038/experiment.ipynb

Results

In the following table, you’ll find the best model highlighted in bold. Unsurprisingly, this best model is for the 1 hour horizon. The best models for 12 and 24 hours are marked in italics.

Model	Horizon	RMSE
HistoricAverage	1	7.0229
HistoricAverage	12	7.0195
HistoricAverage	24	7.0426
SeasonalNaive	1	7.8703
SeasonalNaive	12	7.7317
SeasonalNaive	24	7.8703
AutoARIMA	1	2.2639
AutoARIMA	12	5.1505
AutoARIMA	24	6.3881
TimeGPT	1	2.3193
TimeGPT	12	4.8383
TimeGPT	24	5.6671

AutoARIMA and TimeGPT are pretty closely tied. Interestingly, the SeasonalNaive model performs even worse than the very simple HistoricAverage, which is an indication of the irregular nature of the observed phenomenon (probably caused by irregular restocking of stations, depending on the system operator’s decisions).

Conclusion & next steps

Overall, TimeGPT struggles much more with the longer horizons than in the previous BikeNYC experiment. The error more than doubled between the 1 hour and 12 hours prediction. TimeGPT’s prediction quality barely out-competes AutoARIMA’s for 12 and 24 hours.

I’m tempted to test AutoARIMA for the BikeNYC dataset to further complete this picture.

Of course, the SharedMobility.ai dataset has been online for a while, so I cannot be completely sure that we now have a fair comparison. For that, we would need a completely new / previously unpublished dataset.

For a more thorough write-up, head along to Graser, A. (2025). Timeseries Foundation Models for Mobility: A Benchmark Comparison with Traditional and Deep Learning Models. arXiv preprint arXiv:2504.03725.