GPU-Accelerated Presto C++ is Here: Nightly Images for NVIDIA GPUs

TL;DR — GPUs can run analytical SQL dramatically faster than CPUs: published numbers show a single node Presto C++ with GPU-accelerated operators running a TPC-H-style benchmark in ~100 seconds versus ~1,200 seconds on a high-end CPU — on the order of 12× faster — and a UCX/NVLink exchange running >6× faster with multi-node Presto. Together with NVIDIA, IBM has brought those GPU-accelerated operators to the Velox execution engine that powers Presto C++, and built the new UCX-based exchange that exploits fast interconnects like NVLink. All of this work is open source, and you can try it today: we now publish nightly builds of an OSS GPU-accelerated Presto C++ image.

Why put a GPU under Presto?

Analytical SQL is a great fit for GPUs: scans and aggregations over columnar Parquet data are massively parallel and bandwidth-hungry, precisely the workload modern data-center GPUs are built for.

Velox, the engine underpinning Presto C++, was designed to be extensible at the operator level, which makes it the natural place to plug in a completely different kind of hardware: the GPU.

In a collaboration between IBM and NVIDIA, we have been integrating NVIDIA’s cuDF (part of the RAPIDS ecosystem) into Velox so that table scans, filters, projections, joins, and aggregations can execute on the GPU instead of the CPU. Because the integration lives in Velox, the same work benefits other Velox-based engines too, e.g. Spark.

Presto is part of a broader shift toward GPU acceleration in databases. Sirius, a GPU-native SQL engine built on cuDF that plugs into DuckDB, recently topped the ClickBench benchmark on a single GH200 — fastest runtime and at least 7.2× better cost-efficiency than the CPU-based competition. The cuDF-Polars engine delivered significant speedups over CPU on the PDS benchmark (TPC-H- and TPC-DS-style), up to >20× faster at 3 TB. Presto C++ brings that same cuDF-powered acceleration to distributed, large-scale SQL workloads.

The hard part: moving data between GPUs

Offloading individual operators to the GPU is only half the story. In a distributed query, workers have to shuffle large volumes of intermediate data between each other during the exchange phase. If that data has to take a slow path — copied off the GPU, pushed through a generic HTTP exchange, and copied back — the interconnect becomes the bottleneck and the GPU sits idle.

Fundamental to our work is a new GPU-aware exchange built on UCX. UCX is an open-source communication framework that can drive high-performance interconnects directly: NVLink for intra-node GPU-to-GPU transfers, and RDMA over InfiniBand / RoCE between nodes. We extended Presto by implementing a new exchange protocol UcxExchange that uses UCX when both ends are GPU-aware workers, retaining the data within the GPU cluster and avoiding the bottleneck of copying to CPU memory. You can read the full details in a paper to be published at VLDB’26.

What we open-sourced

All of this is upstream in the Presto and Velox projects. On the Presto side, some work has already been merged, some pull requests are still under review, and more improvements and features are in the pipeline.

The first group packages a GPU-ready native image: it adds the CUDA runtime, the RDMA libraries, and the UCX libraries to the Prestissimo runtime image, and uses rapids-cmake to generate device code for the full set of NVIDIA GPU architectures the CUDA runtime supports — keeping Presto and Velox in lockstep. (See PRs #27972 and #27975.)

The second group integrates the UcxExchange that leverages NVLink: it carries a per-node transport type (HTTP vs. UCX) from the Presto plan into the Velox PlanFragment so the engine can pick the right transport for each exchange, falls back to HTTP whenever the coordinator is one end of the exchange, and wires the UCX communicator into the native worker’s startup and shutdown lifecycle. (See PRs #27864, #27719, and #27937.)

Together these contributions take Presto C++ from “CPU-only native engine” to “a native engine that can run operators on the GPU and shuffle data between GPUs over NVLink and RDMA.”

Try it today: nightly images and a ready-made prestorial

You no longer have to build any of this yourself. We publish nightly Docker images on Docker Hub:

Coordinator: prestodb/presto:coordinator-gpu-nightly
GPU worker: prestodb/presto-native:gpu-nightly

The easiest way to get started is the new GPU prestorial that just merged into the prestorials repository. It deploys, with a single docker compose command, a small GPU-accelerated Presto C++ cluster — 1 coordinator and 1 GPU-enabled worker — backed by a file-based Hive metastore over the local filesystem.

Requirements

GPU acceleration needs real hardware and a little host setup:

An NVIDIA GPU supported by the CUDA 13.0 runtime with up-to-date drivers
The NVIDIA Container Toolkit so Docker can expose the GPU to the worker container
Docker configured to use the NVIDIA runtime

The prestorial currently ships an amd64 compose file. Bring the cluster up with:
docker compose -f docker-compose-amd64.yaml up

The coordinator listens on port 8080 (UI at http://localhost:8080) and the worker on 8082. Connect with the Presto CLI and start querying:
presto --server localhost:8080 --catalog hive

The configuration that matters

The interesting bits live in the worker’s etc_worker/config.properties:

cudf.enabled=true turns on GPU execution in the native worker. On a CPU-only build the flag is simply ignored, so the same configuration degrades gracefully.
cudf.exchange, cudf.exchange.server.port, and cudf.intra_node_exchange control the UCX-based GPU exchange. In this single-worker prestorial the exchange stays disabled — its real value appears in multi-GPU clusters where NVLink and RDMA do the heavy lifting.

The Hive catalog (etc_worker/catalog/hive.properties) also enables a few GPU-oriented Parquet reader options to feed the GPU efficiently. On the worker container itself, the GPU is reserved through Docker’s device reservations — adjust device_ids to pick a different GPU.

What kind of speedups are we talking about?

For a sense of scale, NVIDIA and IBM published reference numbers for GPU-native Velox and cuDF. On a Presto benchmark similar to TPC-H at scale factor 1,000 (single node, 21 of 22 queries), a GH200 Grace Hopper Superchip completed the set in ~100 seconds versus ~1,246 seconds on an AMD 7965WX CPU — roughly a 12× speedup — with an RTX PRO 6000 Blackwell Workstation close behind. On a multi-GPU DGX A100 (8× A100), enabling the UCX/NVLink exchange delivered more than a 6× speedup over the baseline HTTP exchange.

Reference reading:

NVIDIA — Accelerating Large-Scale Data Analytics with GPU-Native Velox and NVIDIA cuDF
Velox — Extending Velox: GPU Acceleration with cuDF

Note: Numbers depend heavily on hardware, data, and query mix — treat them as directional, not as a promise about your workload.

A big, exciting step — under heavy development

We want to be upfront: these are nightly images under heavy active development. Interfaces, configuration properties, and the images themselves will change; not every operator is GPU-accelerated yet; and there are rough edges. Do not run this in production. Treat it as a preview for researchers, enthusiasts, and contributors who want to explore where GPU-accelerated analytics is heading. A good pattern is to produce your tables and data with one of the CPU-based prestorials and use the GPU image purely for analysis.

This is a milestone we are really excited about: a fully open-source path to running Presto queries on NVIDIA GPUs, with an exchange that can saturate NVLink and RDMA. Pull the images, fire up the prestorial, run a few queries, and tell us what you find.

Get involved

Try the GPU prestorial
Follow the upstream work in the Presto repository
File issues and ideas in the prestorials repo

Happy (GPU-accelerated) querying!

Notes

Packaging the GPU-ready image: PR #27972 (CUDA runtime, RDMA and UCX libraries) and PR #27975 (rapids-cmake-driven CUDA architecture configuration).

Building the UCX exchange: PR #27864 (transport types on the Velox PlanFragment), PR #27719 (HTTP fallback when the coordinator is one end of the exchange), and PR #27937 (UCX communicator lifecycle in PrestoServer).