An update on Presto C++ (and more about Presto sidecar)

Presto is transforming its evaluation engine to use Velox, a highly performant modular execution engine. As anyone who’s worked on databases knows, they are very large and complicated, and making systems behave in identical ways is nearly impossible. To address the subtle differences between how Velox executes queries and Presto Java used to, and to allow things such as C++ UDFs and optimizations such as constant folding, a new process is being developed called the Presto Sidecar.

This blog will introduce the sidecar process, the problems it will solve, and how to use the Presto sidecar process in the future.

The Genesis of Presto C++

Presto’s evolution over the years has been marked by its flexibility and adaptability. Initially built as a purely JVM-based project, this approach enabled rapid development and swift adoption. Java’s garbage collection (GC) was both a blessing and a curse in this journey. While it provided developers with a powerful tool for memory management, it also introduced performance bottlenecks, particularly in large-scale deployments.

Garbage collection, a core feature of Java, is designed to automatically manage memory. However, in a distributed system like Presto, where thousands of nodes need to work in perfect harmony, the unpredictable nature of garbage collection can lead to significant performance degradation. In large clusters, the slowest node often dictates the overall performance, and garbage collection can exacerbate this issue, leading to reliability and efficiency concerns.

The motivation to rewrite the evaluation engine in C++ stems from a need to address these limitations. By leveraging a language that offers more control over memory management and can better utilize modern CPU features like vectorized instructions, Presto C++ aims to achieve higher performance and greater reliability.

Why Presto C++?

Memory Management and Performance: One of the primary drivers behind the move to C++ is the need for more efficient memory management. Unlike Java, where memory management is largely handled by the garbage collector, C++ gives developers direct control. This allows for more precise and optimized memory usage, which is crucial in a high-performance query engine like Presto. By managing memory directly, Presto can minimize the overhead associated with garbage collection, leading to improved performance and reduced latency.

Hardware Utilization: Modern CPUs come equipped with vectorized instructions, which allow for the parallel execution of the same operation on multiple data points. This is particularly beneficial in data processing tasks, where operations like scans, joins, and aggregations can be significantly accelerated. Java, being a higher-level language, does not natively support these low-level instructions, limiting the potential for optimization. C++, on the other hand, can directly leverage these CPU features, providing a substantial performance boost.

Disaggregation and Future-Proofing: The trend in modern data architectures is moving towards disaggregation, where different components of the data stack are split into smaller, more manageable pieces. This approach has driven significant cost savings and innovation, particularly in cloud-based environments. By moving to a C++-based execution engine, Presto can further disaggregate its architecture, allowing for more flexible deployment models and better integration with other components of the data stack. This positions Presto at the forefront of the next wave of database innovation, ensuring it remains relevant and competitive in the evolving landscape.

The Current State of Presto’s C++ Engine

I’ll share a brief summary of the current state of Presto C++. If you’re interested in seeing more technical details, I recommend this blog from IBM and Meta.

The transition to C++ has been an ongoing effort since late 2020. While considerable progress has been made, there are still some gaps that need to be addressed before C++ can fully replace the Java-based engine as the default.

Those include:

More connectors and plugins

One of Presto’s strengths has always been its modular architecture, which allows users to customize and extend its functionality through connectors and plugins. Currently, the C++ engine does not support the full range of connectors and plugins available in the Java version. This includes both built-in and user-defined functions, as well as certain types of connectors. Addressing this gap is a priority, as it is essential for maintaining the flexibility and extensibility that Presto users expect.

Feature parity

Achieving complete feature parity between the Java and C++ engines is a complex task. Certain operations, such as array sorting with lambda expressions, are inherently difficult to vectorize and may require alternative implementations in C++. Additionally, differences in regular expression handling and other language-specific features present further challenges. While most core functionalities are already supported in the C++ engine, there remains a long tail of edge cases that need to be resolved.

Split-brain scenarios

As the C++ and Java engines evolve, there is a risk of “split-brain” scenarios, where the two engines have differing understandings of certain operations. This can lead to inconsistencies in query execution, particularly when migrating from a Java-based cluster to a C++-based one. Ensuring consistent behavior across both engines is critical for a smooth transition and requires ongoing effort to harmonize their capabilities.

The Path Forward: Making C++ the Default

Despite the challenges, the Presto community is committed to making C++ the default execution engine as soon as possible. The benefits in terms of performance, scalability, and future-proofing far outweigh the difficulties associated with the transition. In fact, the shift to C++ is seen not just as an enhancement, but as a necessity for Presto to fulfill its potential as the go-to query engine for modern data infrastructures.

To facilitate this transition, we have several initiatives underway. We’ll be focused on expanding plugin and connector support and efforts are being made to port existing Java-based connectors and plugins to C++. Additionally, work is being done to develop a robust plugin API for the C++ engine, allowing users to create and integrate their own extensions. This will restore the full range of customization options that Presto users are accustomed to and will be a key milestone in achieving feature parity with the Java engine.

To address the challenges of split-brain scenarios and ensure consistent behavior between the Java and C++ engines, a sidecar architecture is being introduced. This involves a native process that acts as an intermediary between the C++ execution engine and the Java-based coordinator. The sidecar will be responsible for communicating the capabilities of the C++ engine to the coordinator, ensuring that queries are correctly parsed and executed, and that custom functions, types, and connectors are accurately represented.

The transition to C++ will be gradual, with users encouraged to test the new engine in their environments and provide feedback. Documentation is being developed to guide users through the process of enabling the C++ engine, with clear instructions on how to switch back to the Java engine if necessary. This phased approach will allow the Presto team to identify and address any remaining issues before making C++ the default.

While there are challenges to overcome, the benefits in terms of performance, scalability, and flexibility make this transition an exciting development for the Presto community. As the work on the C++ engine continues, we invite users to participate in the journey, test the new capabilities, and help shape the future of Presto.

I talked a lot about this in my last meetup on an intro to Presto sidecar, which you can now watch on-demand. If you would like to try out the new C++ execution engine, see the documentation.

In the coming months, we expect to see C++ take center stage in Presto’s architecture, paving the way for new possibilities and cementing Presto’s role as a leader in the world of distributed SQL query engines.