Diving into the Presto Native C++ Query Engine (Presto 2.0)

For the past three years, engineers from Meta, Ahana (now IBM), Intel, and Bytedance have been working to build Velox, a state-of-the-art execution engine designed to be composable across compute engines. From this work came the native C++ Presto worker, a full rewrite of the Presto query execution engine built on Velox (also referred to as Project Prestissimo and now called Presto 2.0).

Today we are happy to share that we’ve reached a point where companies like Meta and IBM have started using the Presto C++ engine in production. You can read more about IBM’s benchmarking results in our latest blog, and more about Meta’s Presto C++ use case for batch efficiency in their VeloxCon presentation.

There is a strong and active community effort rallied around this project. The number of active contributors has grown rapidly and today engineers from Uber, ByteDance, Pinterest, Intel, Neuroblade, and more regularly participate in discussions and forums. We’re moving fast and are seeing continuous improvements every month. The eventual goal of this project is to fully deprecate the Presto Java execution engine.

This blog provides an overview of the Presto native engine, including why we built it and its architecture and underlying concepts. We’ll also give more information about its production usage at Meta and IBM.

Motivation

Presto 2.0 is a full rewrite of the Presto query execution engine, aka the Presto worker process. The goal is to bring a 3-4x improvement in its performance and scalability by moving from the old Java implementation to a modern C++ one. This move towards native execution aligns with industry initiatives like Databricks Photon and Apache DataFusion, among others. We are very excited to bring this technology to Presto to make it the best Open Data Lakehouse engine in the market.

The new native engine leverages state-of-the-art developments in query processing including vectorized and SIMD execution, runtime optimizations, smart I/O prefetching and coalescing to improve CPU and I/O efficiency. The built-in memory management gives full control of memory allocations to the developer to eliminate the uncertainty and limitations ofJava garbage collection, including sophisticated features of memory arbitration and spilling. All of the CPU, memory, I/O, and network use is tuned to get the desired performance characteristic. A testament of this is the recent TPC-DS benchmarking numbers achieved at IBM which we shared in the blog link above.

Native engine architecture

The Presto C++ architecture is a drop-in replacement of the Presto Java worker process with the Native (C++) worker in a cluster. As shown in the diagram below, a Presto cluster comprises a coordinator node, worker nodes, and a Hive metastore. Queries are submitted to the coordinator where they are parsed, optimized, and scheduled for distributed execution across worker nodes. Note the differences in the Worker boxes. But beyond that, the Native Presto C++ cluster is an exact replica of the Java cluster.

To achieve this simplicity of a drop-in replacement, the Native worker is designed with the same data and control APIs as the Java worker. While this is the goal, it’s not always achieved. Enhancements or restrictions during development can introduce deviations in the API. For example, some changes in the exchange protocol, intermediate types of aggregation functions, or new operators that give a desired performance boost.

This leads us to enhance the coordinator to be aware of the native execution mode and capabilities. We are building lockdown features in the coordinator to improve the user-experience for Presto C++ by throwing errors or warning messages when using features with differences between the worker engines. Examples of such errors are missing types, missing functions or unsupported plan node fields between Native and Java engines. These features are a precursor of a Native engine SPI for Presto. We have begun work on the Connector SPI as well.

The ultimate goal for the Native engine is to fully deprecate the Java execution engine. The team is moving rapidly to reach parity of the Java execution engine.

Velox Library

Presto Native Engine builds heavily on the Velox library (https://velox-lib.io/) which is another open-source project from Meta.

Meta’s Data Infrastructure team started the Velox project to consolidate knowledge from several data processing engines in use at the company to provide a single set of reusable primitives across them. This consolidation streamlines engineering efforts and facilitates a unified SQL experience across the various engines. A single dialect of functions and operator APIs can be used across engines seamlessly.

Modern data systems are increasingly built in a composable manner. Meta is a strong proponent of such architectures, and has published at VLDB 2023 about the need for an industrial manifesto for the same https://research.facebook.com/publications/the-composable-data-management-system-manifesto/. While part of the thinking here is futuristic, Velox is a definitive step in that direction. It can interoperate with other open standards like Substrait and Arrow to build a composable data system as described in this blog.

Beyond Presto, Velox is also used for a Spark runtime in Project Gluten and in Torch Arrow.

Query Processing with Velox

Presto Native engine is a narrow shim that delegates most of its query processing to Velox. As described above, the user submits a SQL query to the coordinator. The query is parsed and planned for distributed execution at the coordinator. The result of the planning is a graph of plan fragments that are scheduled by the coordinator for all workers in parallel. Each plan fragment comprises a DAG of query processing plan nodes to execute with the pipelined data. The pipelined data originates from table scan nodes in the fragments and passes through the operators for each plan node.

The Presto worker receives a plan fragment task and creates a corresponding Velox task for it. The Presto worker first translates the Presto plan fragment to a Velox plan node tree. The Velox task starts with creating multiple drivers of operator pipelines for the plan nodes. Each operator in the pipeline is a unit of data processing in the system that works with Velox vectors. Each Velox vector is a column of the data. This column vectorization with SIMD and processing in tight loops lends well to modern pipe-lined processor architectures.The vectors can be encoded in different ways like Flat, Constant, or Dictionary to lend for optimal layouts in the processing.

Each operator has its own algorithms for individual processing like Expression evaluation, filtering, joining, aggregation, and windowing. The operators use runtime optimization concepts to tune their behavior based on the characteristics of the data in the vectors being processed. For example, if there are few distinct values in the data, then an aggregation could use array indexing for the group keys instead of using an expensive hash function. Expression evaluation of filters can re-order clauses based on the selectivity. These strategies greatly enhance efficiency and scalability.

Each Velox operator also has a Memory pool associated with it. The memory pool accounts for the memory of any intermediate structures like RowContainer and HashTable used by the operator. The Memory pools of different operators along with a system memory pool are tracked by the Velox Memory manager as the query proceeds. If an operator needs more memory, it requests some from the Memory pool which could trigger an arbitration of memory if there is no free memory available. The arbitration could cause other operators to spill their current state and free up memory for the requesting operator. Doing so could slow down the system query processing, but lets the queries progress with a limited amount of memory. Arbitration can be adjusted using priorities that give an element of control to the user for their workload management.

The below diagram shows the Velox components within a Velox task for a Presto Query Task.

The Presto Native engine is a loop that receives tasks with plan fragments from the Presto scheduler, creates a Velox task for them, and calls the Velox driver loop to progress with the query execution.

Presto C++ deployments

Over the last few years, the Velox and Presto communities have been hard at work to build the Native engine. At this point, it supports most of the Presto SQL dialect. The performance runs of the engine with customer production or benchmark workloads consistently show a 2-3x improvement over Presto Java.

Register for PrestoCon on June 25th, 2024 to hear more about the latest benchmark runs with Native engine.

In this next section, we will outline how the Native engine is being deployed at Meta and IBM.

Native engine at Meta

Meta runs one of the largest data warehouses in the world with a wide variety of SQL processing from interactive dashboards to large scheduled ETL jobs. Presto supports Meta’s interactive and short running (~ < 20 minutes) batch workloads.

Prestissimo and Velox were built by Meta Data Infrastructure teams to not only support faster query processing but also more efficient execution enabling additional growth for customer use cases on our existing Presto fleet. Developing and deploying Prestissimo into this existing massive data warehouse required extensive collaboration between multiple teams over several years.

While Prestissimo and Velox are state of the art query processing engines we need to go one step further and utilize this infrastructure to support our existing warehouse workloads. In Meta, Prestissimo was first deployed for a performance and business critical SQL use case: experimentation platform. Prestissimo optimizations and deployment were a great success for the experimentation platform workloads, reducing its hardware footprint by 3x and making queries 1.5/2x faster even with less hardware. We are glad to share that the Prestissimo platform is running reliably and maintaining customer expectations with minimal service interruption since launch. For a completely new query engine developed from scratch, this is a milestone we are proud to have achieved.

After the major success with the experimentation platform, Meta engineers began enhancing Prestissimo for more generalized batch and ad-hoc workloads. While the experimentation platform had a fixed set of query shapes, Prestissimo needed to adapt to the entire suite of functionality that Presto offers. We’ve built a number of additional features including table writers, memory and spilling management, functional gap reduction, secure communication, task and driver management – almost all areas of Prestissimo went through major revamp to support generalized workload. Validating results and testing reliability of a completely new query engine is critical and has seen us make major investments in building testing infrastructure such as fuzzers, presto verification with functional substitution and data writer verification as well as extensive shadowing of live workloads. We are excited to share that this extensive revamp has resulted in a stable platform with ~1.5X wall time and ~2-3X CPU time improvements.

In addition to performance and reliability it’s important to ensure that Prestissimo query results are correct and match existing behavior in Presto; one of the most challenging tasks given the scale of Meta’s workloads. While known correctness bugs are fixed in Prestissimo, multiple issues have been found in Presto which are also being fixed to bring parity to both platforms. Another challenge is that Prestissimo uses different libraries when compared to Presto Java (e.g. RE2, SimdJson) resulting in different behavior from the existing Presto platform. This poses an unique challenge for workload migration and we are actively working on finding innovative solutions to address these issues.

We are excited to see customers embracing this new technology and its benefits. For example, Experimentation platform has also begun transitioning its batch workloads from Spark to Prestissimo; with a large fraction in production already. Transitioning our warehouse batch workloads from the older Java stack onto Prestissimo is well underway with a plan to productionize a vast majority of our existing workloads by Q1 2025.

Native engine at IBM

Unlike the Big Data vendors, IBM is a platform company. At IBM, we’ve been focused on building a comprehensive Open Data Lakehouse platform called watsonx.data https://www.ibm.com/products/watsonx-data. Enterprises use watsonx.data to build open and governed Data Lakehouses. The data is stored in open formats like Parquet and Iceberg so that it can be queried by varied engines and multiple tools. Watsonx.data comes with several data engines like Presto, Spark, or a Vector database like Milvus that can be customized for optimal performance of the different query workloads. Watsonx.data supports intelligent cataloging that can be used to efficiently search the lakehouse for various data artifacts. It supports hybrid cloud deployments and can be configured in multi-cloud, on-premise configurations in minutes.

Watsonx.data now also offers the Presto Native engine as a query engine along with Presto Java/Spark and others. Presto C++ is the main query engine for open data lakehouse workloads. The ex-Ahana team at IBM has built several capabilities in Velox and Presto C++ to support such use. Beyond enhancing the SQL coverage in Velox, the team has also built a high-performance Parquet and Iceberg reader, support for the S3 storage system, JWT and TLS based authentication and integration with the Watson Knowledge catalog.

The team has been constantly improving the performance of the Native engine by chasing the TPC-DS benchmarks. The TPC-DS benchmark is used by different data infrastructure vendors as a fair comparison of their capabilities. The TPC-DS tests the boundaries of the platform by executing multiple streams of intensive and high complexity decision-support queries including a data maintenance task in between. While we still have much more to do, the early results of the benchmark on scale factors of 1, 10 and 100 TB are very encouraging. Please refer to the blog article for more results.

Conclusion

The Presto C++ Native engine is a significant advancement in the Presto world. By harnessing native execution capabilities and leveraging the powerful Velox library, it has pushed the envelope for Presto data processing. We are very excited to bring these contributions to the industry and are constantly amazed by the growth and impact we see in the work.

This project is supported by a strong community of engineers. Beyond Meta and IBM, engineers from Uber, ByteDance, Pinterest, Intel, Alibaba, and Neuroblade participate in its development. A special interest working group https://lists.prestodb.io/g/native-worker-wg for this project meets every other Thursday at 11 am PST to discuss any design or issues in this project. Most senior engineers from participating companies are present in this forum. All problems are discussed and resolved across the community members.

Please get involved with the community. If you have any questions or want to learn more, feel free to reach out to us through our community channels.

Stay tuned for more exciting updates from the Native Engine team!