Videos Archive - PrestoDB

Quick Stats – Runtime ANALYZE for Better Query Plans – Anant Aneja, Ahana

An optimizer’s plans are only as good as the estimates available for the tables its querying. For queries over recently ingested data that is not yet ANALYZE-d to update table or partition stats, the Presto optimizer flies blind; it is unable to make good query plans and resorts to syntactic join orders. To solve this problem, we propose building ‘Quick Stats’ : By utilizing file level metadata available in open data lake formats such as Delta & Hudi, and by examining stats from Parquet & ORC footers, we can build a representative stats sample at a per partition level. These stats can be cached for use be newer queries, and can also be persisted back to the metastore. New strategies for tuning these stats, such as sampling, can be added to improve their precision.

Query Performance Optimization at Alibaba Cloud Log Analytics Service – Bin Wang, Alibaba Cloud

Row-limited operators are fast short-circuiting CSE rule by combining FilterNode and ProjectNode and complex function splitting Avoid unnecessary intermediate data structures Optimization Merge similar TableScan Merge arbitrary() and inner complex scalar function

Building the Presto Open Source Community – Ahana Round Table

In this round table moderated by Eric Kavanagh of The Bloor Group, panelists from Uber, Facebook, Ahana, and Alibaba will discuss all aspects of building a thriving open source community around PrestoDB including why Presto is so popular & the problems it solves, the open source model the foundation follows, why governance and transparency are so important to an open source community, and what the community looks for in open source projects.

Petabyte Scale Log Analysis at Alibaba: Infrastructure, Challenge and Optimization – Yunlei Ma

Yunlei will share the infrastructure for petabytes scale log data collection, storage and analysis at Alibaba . Presto plays a key role in the infrastructure. Presto processes over hundreds of billions of query, about over 1 quadrillion rows every day .