Quick Stats – Runtime ANALYZE for Better Query Plans – Anant Aneja, Ahana

Quick Stats – Runtime ANALYZE for Better Query Plans – Anant Aneja, Ahana

An optimizer’s plans are only as good as the estimates available for the tables its querying. For queries over recently ingested data that is not yet ANALYZE-d to update table or partition stats, the Presto optimizer flies blind; it is unable to make good query plans and resorts to syntactic join orders. To solve this problem, we propose building ‘Quick Stats’ : By utilizing file level metadata available in open data lake formats such as Delta & Hudi, and by examining stats from Parquet & ORC footers, we can build a representative stats sample at a per partition level. These stats can be cached for use be newer queries, and can also be persisted back to the metastore. New strategies for tuning these stats, such as sampling, can be added to improve their precision.

Customer-Facing Presto at Rippling – Andy Li, Rippling

Customer-Facing Presto at Rippling – Andy Li, Rippling

Presto is used for a variety of cases, but tends to be used for larger scale analytical queries. We have been transitioning to using Presto to power our data platform and customer-facing scripting language, RQL (Rippling Query Language) to run arbitrary customer queries to power core products. Presto helps enable diverse, federated querying at scale. In this talk, Andy will cover where Presto sits in Rippling’s ecosystem as a core query layer, our collaboration and contributions for closer integration with Apache Pinot, and learnings on using Presto to handle a large variety of query patterns.