Optimizing Presto for Uber scale

    Optimizing Presto for Uber scale

    In this talk, we present some of the work streams we have underway at Uber to optimize Presto performance. In particular, we will cover enabling aggregation pushdown in queries in order to use statistics in the file headers/footers, our investigations into and attempts to efficiently executing approximate queries, and our experience with humongous object allocation in Presto.

    Presto on Spark – Facebook – Virtual Meetup

    Presto on Spark – Facebook – Virtual Meetup

    At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads. It is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine. In this talk, we’ll take a deep dive in Presto and Spark architecture with a focus on key differentiators (e.g., disaggregated shuffle) that are required to further scale Presto.

    Building the Presto Open Source Community – Ahana Round Table

    Building the Presto Open Source Community – Ahana Round Table

    In this round table moderated by Eric Kavanagh of The Bloor Group, panelists from Uber, Facebook, Ahana, and Alibaba will discuss all aspects of building a thriving open source community around PrestoDB including why Presto is so popular & the problems it solves, the open source model the foundation follows, why governance and transparency are so important to an open source community, and what the community looks for in open source projects.

    Presto SQL Functions – Facebook

    Presto SQL Functions – Facebook

    In this talk we will show how to use the recently introduced SQL function feature, how it works, and the ongoing work to support invoking arbitrary functions remotely with remote UDF server.

    Common Sub Expression Optimization at Facebook

    Common Sub Expression Optimization at Facebook

    In complex analytics queries, we often see repeated expressions, for example parsing the same JSON column but extracting different fields, elaborate CASE statement with common predicates and different ones. Previously, Presto will compute the same expression many times as they appear in query. With common sub expression optimization, we would only evaluate the same expression once within the same project operator or filter operator. In our workload, we’ve seen 3x improvements on certain queries with expensive common sub expressions like JSON_PARSE. Microbenchmark also shows a consistent ~10% performance improvement for simple common sub-expressions like x + y. In this talk, we will talk about how this is implemented.

    Extending Presto at LinkedIn with a Smart Catalog Layer LinkedIn

    Extending Presto at LinkedIn with a Smart Catalog Layer LinkedIn

    In this talk, Walaa describes how LinkedIn extended its Presto Hive Catalog with a smart logical abstraction layer that is capable of reasoning about logical views with UDFs by using two core components, Coral and Transport UDFs. Coral is a view virtualization library, powered by Apache Calcite, that represents views using their logical query plans. Walaa shows how LinkedIn leverages Coral abstractions to decouple view expression language from the execution engine, and hence execute non-Presto-SQL views inside Presto, and achieve on-the-fly query rewrite for data governance and query optimization.

    Presto for Real Time Analytics at Uber – Ankit Sultana, Uber

    Presto for Real Time Analytics at Uber – Ankit Sultana, Uber

    The Real Time Analytics Platform at Uber serves 100M+ queries daily and is used for several critical features: from end-user app features to radius selection for Uber Eats. All these queries are proxied via a custom internal fork of Presto (named Neutrino) that is optimized for low-latency/high-throughput (50ms latency at 1000s of RPS). With this talk we plan to share our learnings over the last 6 months and how we run Presto reliably at this scale for real-time analytics.

    Free-Forever Managed Service for Presto for your Cloud-Native Open SQL Lakehouse – Wen Phan, Ahana

    Free-Forever Managed Service for Presto for your Cloud-Native Open SQL Lakehouse – Wen Phan, Ahana

    Getting started with a do-it-yourself approach to standing up an open SQL Lakehouse can be challenging and cumbersome. Ahana Cloud Community Edition dramatically simplifies it and gives you the ability to learn and validate Presto for your open SQL Lakehouse—for free. In this session, we’ll show you how easy it is to register for, stand up, and use the Ahana Cloud Community Edition to query on top of your Lakehouse.

    Building a Modern Data Platform with Presto – Denis Krivenko, Platform24

    Building a Modern Data Platform with Presto – Denis Krivenko, Platform24

    Hadoop era is gone. Cloud computing is today’s reality. But… What if you cannot use public clouds? What if your cloud does not provide data platform capabilities? What if you want your solution to be cloud agnostic? In this case you create your own cloud native data platform on Kubernetes. In the session Denis will talk about reasons for building analytics data platform solution in Platform24, cloud native data platform architecture principles, data stack they use and why Presto plays one of the key roles in it.

    How Blinkit is Building an Open Data Lakehouse with Presto on AWS – Satyam Krishna & Akshay Agarwal

    How Blinkit is Building an Open Data Lakehouse with Presto on AWS – Satyam Krishna & Akshay Agarwal

    Blinkit, India’s leading instant delivery service, uses Presto on AWS to help them deliver on their promise of “everything delivered in 10 minutes”. In this session, Satyam and Akshay will discuss why they moved to Presto on S3 from their cloud data warehouse for more flexibility and better price performance. They’ll also share more on their open data lakehouse architecture which includes Presto as their SQL engine for ad hoc reporting, Ahana as SaaS for Presto, Apache Hudi and Iceberg to help manage transactions, and AWS S3 as their data lake.

    Query Execution Optimization for Broadcast Join using Replicated-Reads Strategy – George Wang, Ahana

    Query Execution Optimization for Broadcast Join using Replicated-Reads Strategy – George Wang, Ahana

    Today presto supports broadcast join by having a worker to fetch data from a small data source to build a hash table and then sending the entire data over the network to all other workers for hash lookup probed by large data source. This can be optimized by a new query execution strategy as source data from small tables is pulled directly by all workers which is known as replicated reads from dimension tables. This feature comes with a nice caching property given that all worker nodes N are now participating in scanning the data from remote sources. The table scan operation for dimension tables is cacheable per all worker nodes. In addition, there will be better resource utilization because the presto scheduler can now reduce the number plan fragment to execute as the same workers run tasks in parallel within a single stage to reduce data shuffles.

    Dynamic UDF Framework and its Applications – Rongrong Zhong, Alluxio & Yanbing Zhang, Bytedance

    Dynamic UDF Framework and its Applications – Rongrong Zhong, Alluxio & Yanbing Zhang, Bytedance

    Presto supports dynamically registered User Defined Functions (UDFs) since 2020. Over the years, we used this framework to add support for SQL UDFs and remote / external UDFs. One common community request in the UDF domain is to support Hive UDFs. Many companies have legacy Hive pipelines, and engineers who are familiar with HQL and Hive UDFs. With remote UDF, one can implement Hive UDF support as UDFs running on the remote cluster. But since HiveUDFs are written in Java, we can also run them inside the engine. We extended the dynamic UDF framework to support Java UDFs, and used this new extension to add HiveUDF support in Presto. With this feature, users can directly use their familiar HiveUDFs and UDAFs in their Presto query.

    PrestoDB and Apache Hudi for the Lakehouse – Sagar Sumit & Bhavani Sudha Saktheeswaran

    PrestoDB and Apache Hudi for the Lakehouse – Sagar Sumit & Bhavani Sudha Saktheeswaran

    Apache Hudi is a rich platform to build self-managing, exabyte-scale data lakes, optimized for incremental as well as regular batch processing. Hudi tables can be seamlessly synced to Hive metastore, which unlocks the powerful capabilities of Presto engine via the Hive connector. Presto-Hudi integration is over five years old. What started as simply fetching splits using a custom input format for a Hudi Copy-On-Write table has evolved into snapshot querying of Merge-On-Read tables and using Hudi’s internal metadata table to boost query performance. In this session, we trace that journey and discuss in detail the recent developments that have made this integration stronger not only in terms of usability but also performance. We discuss the additional features that come with the brand new presto-hudi connector, such as multi-modal index and data skipping for better query performance.