Building Connectors in Presto C++: Deep Dive into the TPCDS Connector (Lightning Talk) 

    At PrestoCon Day 2025, engineers from IBM presented a deep dive into how connectors in Presto C++ extend the engine’s modular capabilities, focusing on the newly implemented TPCDS benchmark connector. Connectors are central to Presto’s architecture, enabling the query engine to communicate seamlessly with external systems such as databases, file formats, or benchmark data generators. The TPCDS connector showcases how Presto C++ and Velox work together to dynamically generate benchmark data at different scale factors using dsdgen, providing a powerful tool for testing performance and scalability. In this article, we expand on that talk, breaking down the architecture of Presto connectors, the role of Velox in execution, and the implementation details of the TPCDS connector to help developers understand how to build and extend connectors for their own data sources. 

    Understanding Presto Plugins and Connectors

    What is a Plugin in Presto? 

    A plugin in Presto is an external code addition that hooks into the Presto engine at startup, enabling new capabilities such as: 

    • Connecting to new data sources 
    • Registering custom functions and types 
    • Extending metadata handling 

    Think of a plugin as a modular extension that enhances Presto’s core without modifying its engine. ms. Rerunning queries without understanding or fixing the root cause is inherently wasteful and unproductive. 

    What is a Connector?

    A connector is a specialized plugin that acts as a bridge between Presto’s query engine and an external system such as a database, file system, or benchmark data generator. It manages: 

    • Table discovery 
    • Work splitting for parallel execution 
    • Reading rows into Presto’s engine 

    Connectors are loaded dynamically at runtime using Java’s Service Provider Interface (SPI), making Presto highly flexible.

    Anatomy of a Connector in Presto Java 

    To add a new connector in Presto Java, developers must implement several key interfaces: 

    • Connector Factory: Responsible for creating connector instances and passing configuration properties. 
    • Connector Class: The main entry point for communication between Presto runtime and the data source. 
    • Connector Handle Resolver: Maps engine-level opaque handles (e.g., table handles, column handles) to connector-specific implementations. 
    • Connector Metadata Interface: Manages table schema and column metadata. 
    • Connector Split Manager: Divides work into splits for parallel processing. 
    • Connector Transaction Handle: Represents connector transactions in queries. 

    Optional interfaces like Connector Page Source Provider help read data into Presto in its native page format.

    Introducing the TPCDS Connector in Presto C++ 

    What is TPCDS? 

    TPCDS is a standard benchmark provided by the Transaction Processing Performance Council (TPC) to evaluate query processing engines. It features: 

    • 7 fact tables (e.g., sales, returns) 
    • 17 dimension tables (e.g., customers) 
    • A scale factor that defines data size in gigabytes (e.g., scale factor 1 = 1 GB)

    TPCDS is widely used to test performance and scalability of SQL engines.

    Purpose of the TPCDS Connector 

    The TPCDS connector in Presto C++ generates data for TPCDS tables at various scale factors on the fly. It is designed for data generation only and does not support writing data to external storage. For persisting data, users can combine TPCDS with other connectors like Hive or Iceberg. 

    Connector Protocol and Serialization in Presto C++ 

    Presto C++ uses a Presto protocol to translate Java classes into C++ equivalents conveniently. This involves: 

    • Defining Java classes to convert in a YAML file 
    • Using Mustache templates (a logicless templating language) to generate JSON representations 
    • Serializing and deserializing connector classes to/from JSON 

    The connector protocol provides templates to handle serialization of connector handles, enabling smooth communication between Presto Java and C++ components. 

    Presto to Velox Connector: Bridging Presto and Velox

    Velox is a C++ execution engine used by Presto C++ for query processing. The Presto to Velox connector

    • Converts Presto connector handles and splits into Velox equivalents 
    • Maintains a registry of supported connectors for easy lookup 
    • Provides APIs for converting plan fragments, table write nodes, and splits 

    This conversion is essential for executing Presto queries efficiently in the Velox environment. 

    Implementing the TPCDS Connector in Velox 

    Connectors in Velox are similar to Presto connectors. They support: 

    • Reading data via the table scan operator 
    • Writing data via the table write operator 

    Velox connectors have a connector factory that creates connector instances and registers them by name. 

    Two key APIs are typically required for a Velox connector:

    • createDataSource: Returns a connected data source that supports addSplit and next APIs to process data splits. 
    • createDataSink: Supports writing data to external locations. 

    Because the TPCDS connector is designed purely for data generation and does not support writing to external storage, it only implements createDataSource with a TPCDS data source that generates data split by split.

    Data Generation with dsdgen

    • The TPCDS benchmark provides dsdgen, a C program for generating benchmark data at various scale factors. dsdgen is the official tool recommended by TPC for data generation. 
    • It supports parallel data generation using configuration options. 
    • Velox borrows dsdgen source files (converted to C++) from DuckDB’s TPCDS extension. 
    • Modifications were made to ensure thread safety and parallel generation.

    How the TPCDS Connector Works in Practice

    • The Velox table scan operator calls addSplit on the TPCDS data source, passing a TPCDS split that specifies a data chunk by starting row and row count. 
    • The operator repeatedly calls next to generate batches of rows for that split. 
    • Under the hood, next invokes dsdgen’s data generation methods wrapped in an API. 
    • This process continues until all rows in the split are generated. 

    Current Status and Future Work

    • A pull request for Velox changes implementing the TPCDS connector is approved and pending merge. 
    • Presto C++ changes for the TPCDS connector are in progress. 
    • Full support for the TPCDS connector in Presto C++ is expected soon. 
    • A detailed blog post on adding connectors in Presto C++ using TPCDS as an example will follow. 

    Summary

    • Presto’s modularity relies heavily on plugins, with connectors bridging the engine and external data. 
    • Connectors require implementing key interfaces for metadata, splits, and transactions. 
    • The TPCDS connector in Presto C++ generates benchmark data dynamically using dsdgen. 
    • Velox acts as the execution engine, converting Presto plans and connector handles. 
    • The TPCDS connector currently supports data generation only, with no external writes. 
    • This work enhances Presto C++’s capabilities for benchmarking and extensibility. 

    This deep dive into the TPCDS connector in Presto C++ provides developers with a technical understanding of how to build and extend connectors, leverage benchmark data generation, and integrate with Velox for efficient query execution. 

    Follow Presto at LinkedinYoutube, and Join Slack channel to interact with the community.