PrestoDB Blog - PrestoDB

Elevating Presto Query Optimization: Leveraging State-of-the-Art Techniques for Improved Performance

By David Simmen, Anant Aneja, Vivek Bharathan, Zachary Blanco, Aditi Pandit & Ethan Zhang March 21, 2024March 21, 2024

Presto, a prominent open-source distributed SQL query engine, has been at the leading edge of high-performance data analytics for over a decade. In analytical data processing, the effectiveness of query optimization is paramount. Over the last half-century, optimizing SQL queries has been a hotbed of research and development, resulting in groundbreaking innovations. This blog post…

Presto on AWS at Twilio – Lesson Learned and Optimization

By Ali LeClerc December 28, 2022September 21, 2023

Earlier this month we hosted PrestoCon, a fantastic in-person event that showcased the innovation around the Presto project. In this blog we’ll detail Twilio’s presentation on their Presto use case, including their architecture, key optimizations, and lessons learned. You can also check out their full presentation here. In their session, Twilio engineers Aakash Pradeep and…

Avoid Data Silos in Presto in Meta: the journey from Raptor to RaptorX

By Rongrong Zhong, James Sun & Ke Wang January 28, 2022September 21, 2023

Raptor is a Presto connector (presto-raptor) that is used to power some critical interactive query workloads in Meta (previously Facebook). Though referred to in the ICDE 2019 paper Presto: SQL on Everything, it remains somewhat mysterious to many Presto users because there is no available documentation for this feature. This article will shed some light…

Native Parquet Writer for Presto

By Lu Niu & Zhenxiao Luo June 29, 2021September 21, 2023

Overview With the wide deployment of Presto in a growing number of companies, Presto is used not only for queries, but also for data ingestion and ETL jobs. There is a need to improve Presto’s file writer performance, especially for popular columnar file formats, e.g. Parquet, and ORC. In this article, we introduce the brand…

RaptorX: Building a 10X Faster Presto

By James Sun, Ke Wang, Rohit Jain, Saksham Sachdev, Shixuan Fan, Bin Fan, Zhenxiao Luo & Lu Niu February 4, 2021September 21, 2023

RaptorX is an internal project name aiming to boost query latency significantly beyond what vanilla Presto is capable of. This blog post introduces the hierarchical cache work, which is the key building block for RaptorX. With the support of the cache, we are able to boost query performance by 10X. This new architecture can beat…

Data Lake Analytics: Alibaba’s Federated Cloud Strategy

By George Wang June 30, 2020September 21, 2023

Presto is known to be a high-performance, distributed SQL query engine for Big Data. It offers large-scale data analytics with multiple connectors for accessing various data sources. This capability enables the Presto users to further extend some features to build a large-scale data federation service on cloud. Alibaba Data Lake Analytics embraces Presto’s federated query…

Table Scan: Doing The Right Thing With Structured Types

By Orri Erling September 26, 2019September 21, 2023

In the previous article we saw what gains are possible when filtering early and in the right order. In this article we look at how we do this with nested and structured types. We use the 100G TPC-H dataset, but now we group top level columns into structs or maps. Maps, lists and structs are…

Complete Table Scan: A Quantitative Assessment

By Orri Erling July 29, 2019September 21, 2023

In the previous article we looked at the abstract problem statement and possibilities inherent in scanning tables. In this piece we look at the quantitative upside with Presto. We look at a number of queries and explain the findings. The initial impulse motivating this work is the observation that table scan is by far the…

Everything You Always Wanted To Do in Table Scan

By Orri Erling, Maria Basmanova, Ying Su, Tim Meehan & Elon Azoulay June 29, 2019September 21, 2023

Table scan, on the face of it, sounds trivial and boring. What’s there in just reading a long bunch of records from first to last? Aren’t indexing and other kinds of physical design more interesting? As data has gotten bigger, the columnar table scan has only gotten more prominent. The columnar scan is a fairly…