Elevating Presto Query Optimization: Leveraging State-of-the-Art Techniques for Improved Performance 

    Presto, a prominent open-source distributed SQL query engine, has been at the leading edge of high-performance data analytics for over a decade. In analytical data processing, the effectiveness of query optimization is paramount. Over the last half-century, optimizing SQL queries has been a hotbed of research and development, resulting in groundbreaking innovations. This blog post explores how the Ahana/IBM query optimization team is bringing state-of-the-art query optimization techniques into Presto. We delve into the search space and cost model and link to documents that provide in-depth blueprints for improvements that could shape the future of query optimization in Presto.

    The Hunt for the Optimal Execution Plan  

    Optimizing SQL queries is a problem that has garnered the attention of researchers and engineers for decades. The goal is to find the most efficient execution plan for a given query, considering the exponential number of possible ways data can be accessed and joined. While optimization techniques have evolved significantly, the challenge remains. Let’s explore the fundamental aspects of query optimization. 

    Search Space  

    The search space in query optimization refers to the universe of potential execution plans the optimizer explores to find the best one. An optimizer generates its search space by applying transformation rules that logically map a query to equivalent algebraic representations and implementation rules that map those algebraic representations to execution plans consisting of physical operators, executable by the query evaluation engine. A broader search space enables a better chance of finding the most efficient execution plan. Presto can benefit from expanding its search space to consider an increased range of query execution strategies.  

    Cost Model  

    A cost model assigns a cost, an estimate of query execution plan efficiency, to each potential query plan in the search space. Various factors determine the cost of an execution plan; however, cardinality estimation, the process for determining the size of intermediate results after applying predicates or aggregation, plays an outsized role in cost estimation. Advanced data distribution statistics and estimation techniques can dramatically improve the ability of the Presto optimizer to achieve more accurate cardinality estimates. Presto already can store and use statistics from historical runs to improve future query plans. We believe that this framework can be used in conjunction with better estimation techniques to greatly benefit the Presto optimizer. 

    Bringing State-of-the-Art to Presto 

    While the query optimizer of Presto is capable, there is always room for improvement. The challenge lies in elevating it by incorporating state-of-the-art techniques from the field of query optimization. This endeavor becomes increasingly important as more companies use Presto for demanding enterprise workloads. The Ahana/IBM query optimization team has started this effort by determining areas where the Presto optimizer can benefit from incorporating state-of-the-art, analyzing the current capabilities of Presto relative to that work, and providing detailed blueprints for evolving it accordingly. We initially focused on expanding the search space and improving the cardinality estimation aspect of the cost model. Following is a summary of our initial effort with links to documents that supply the technical details.  

    Opening Up The Search Space 

    • A relational query written in the SQL relational query language might have multiple query blocks due to view references, nested table expressions, and subqueries. A query optimizer must effectively merge multiple query blocks into equivalent single-block representations to open up the optimizer search space of join alternatives exhaustively. Multi-Query Block Merge Optimizations explores state-of-the-art for such transformations in comparison with Presto, and provides a blueprint for applying such transformations in Presto.  
    • An optimizer must decide the order in which to join tables referenced in a query. Inner joins are fully commutative and associative; hence, all join orders are valid. Outer joins, anti-joins, and semi-joins are asymmetric operators with limited commutative and associative properties; consequently, not all sequences of joins are correct when present. The Presto optimizer does not attempt to reorder joins when other than inner joins are involved. Comprehensive Join Enumeration distills the prior art for enumerating joins of all types and details how to harness that work to remove this deficiency. 
    • The Presto optimizer predominately performs transformations based on heuristics; however, a large class of transforms, such as eager aggregation or push down of join through union, are more speculative. These transformations could result in a sub-optimal execution plan if not made in a cost-based manner. Cost-based Logical Transformations surfaces the prior art for performing cost-based logical transformations and discusses incorporating the techniques into Presto.  

    Cardinality Estimation Enhancements 

    • Cardinality estimation, the process for determining the size of intermediate results after applying predicates or aggregation, plays a determinative role in cost estimation. The cardinality estimation process must exploit and combine all available individual and joint selectivity estimates in a principled, consistent, and unbiased manner. An Architecture for Consistent Cardinality Estimation describes theoretically sound state-of-the-art cardinality estimation techniques based on adjustments and maximum entropy and a strategy for evolving the Presto cardinality estimation process according to those techniques. 
    • The individual and joint predicate selectivity estimates that form the basis for cardinality estimation derive from data-driven selectivity estimators of various types. Presto today exploits simple single attribute data summaries that assume independence and uniform distributions.  
      • Histograms extends this with state-of-the-art histogram estimators for modeling non-uniform distributions.  
      • Stored Samples enhances the available estimators by providing an incrementally maintained backing sample for determining individual and joint selectivity estimates by querying the sample within the optimization process. 

    Search Space Controls 

    For those concerned that casting the net wider might lead the Presto optimizer to more frequently catch the occasional sub-optimal execution plan, Plan Constraints introduces state-of-the-art mechanisms to lock down an existing plan and to hint your way out of a jam. In both cases, the optimizer accepts a plan constraint specification, which determines critical aspects of an execution plan, such as access method, join method, and join order. The optimizer uses that specification as a filter on the search space it generates and selects the execution plan that best matches the constraint. Once the optimizer generates a plan you approve of, ask the optimizer for a constraint to lock it down so that future upgrades to the optimizer will not affect your application. In cases where the optimizer selects a sub-optimal plan, one may manually construct a constraint to nudge the optimizer to the optimal plan. 

    Conclusion and Call to Action 

    Optimization of relational queries has come a long way in the past 50 years, and Presto can take data analytics performance to new heights by embracing the decades of advances in the field. The Ahana query optimization team has initiated this effort by providing detailed blueprints for incorporating state-of-the-art techniques to expand the Presto search space and refine its cost model. We invite the Presto community to join us in designing and developing these proposed enhancements. The query optimization prior art is vast, and we are just scratching the surface with the work identified here. We encourage the community to collaborate and help identify other opportunities for advancing Presto query optimization capabilities by drawing from decades of advances. These and future enhancements will enable Presto to handle even more complex queries with continued exceptional performance. 

    For query optimization enthusiasts who would like to collaborate with our community in elevating the Presto optimizer, please join our optimizer working group meetings https://lists.prestodb.io/g/optimizer-wg to get started.  

    References 

    1. Multi-Query Block Merge Optimizations
    1. Comprehensive Join Enumeration 
    1. Cost-based Logical Transformations 
    1. An Architecture for Consistent Cardinality Estimation 
    1. Histograms 
    1. Stored Samples 
    1. Plan Constraints