Learn more about the new open-source Redis-based Historical Statistics Provider for Presto from Jay Narale, software engineer at Uber who built it. Redis is an open-source in-memory database that integrates with Presto through a dedicated connector. Now with a Redis history-based optimizer, you can enhance the efficiency and speed of query execution for Presto by using historical stats to generate optimized plans for your queries. Jay will cover how the Redis HBO utilizes the in-memory capabilities of Redis to store & analyze historical query execution data, which helps the optimizer make informed decisions about query planning and resource allocation based on the historical patterns of queries, leading to improved execution times and resource utilization.
The Presto Foundation is the organization that oversees the development of the Presto open source project. Hosted at the Linux Foundation, the Presto Foundation operates under a community governance model with representation from all its members. In this fireside chat, we’ll hear more from Girish Baliga, Chair of the Presto Foundation, on what it actually means to be a Presto Foundation member and why this governance model is so important for open source projects. We’ll also talk with Vikram Murali of IBM, the newest member of the Presto Foundation. He’ll share more about IBM’s journey to Presto, how they’re using it in IBM’s new watsonx.data lakehouse, and why the Presto Foundation played an important role in IBM’s decision to choose Presto.
Over the last decade Presto has become one of the most widely adopted open source SQL query engines. In use at companies large and small, Presto’s performance, reliability, and efficiency at scale have become critical to many companies’ data infrastructures. In this panel we’ll hear from three of the largest companies running Presto at scale – Meta, Uber, and Intuit. They’ll share more about their learnings, some of their impressive performance metrics with Presto, and what they envision going forward for Presto at their respective companies.
In this talk, we present some of the work streams we have underway at Uber to optimize Presto performance. In particular, we will cover enabling aggregation pushdown in queries in order to use statistics in the file headers/footers, our investigations into and attempts to efficiently executing approximate queries, and our experience with humongous object allocation in Presto.
In this round table moderated by Eric Kavanagh of The Bloor Group, panelists from Uber, Facebook, Ahana, and Alibaba will discuss all aspects of building a thriving open source community around PrestoDB including why Presto is so popular & the problems it solves, the open source model the foundation follows, why governance and transparency are so important to an open source community, and what the community looks for in open source projects.
The Real Time Analytics Platform at Uber serves 100M+ queries daily and is used for several critical features: from end-user app features to radius selection for Uber Eats. All these queries are proxied via a custom internal fork of Presto (named Neutrino) that is optimized for low-latency/high-throughput (50ms latency at 1000s of RPS). With this talk we plan to share our learnings over the last 6 months and how we run Presto reliably at this scale for real-time analytics.
At Uber, Presto is heavily used as one of the primary data analytics tools, and Presto’s query performance has profound production impact at Uber. As part of the Presto optimization effort, we turned to explore Alluxio as a caching solution. Alluxio is an open source data orchestration platform often used by many compute frameworks as the caching layer. Alluxio caching is currently enabled on ~2000 nodes across 6 clusters at Uber. In this presentation, we will talk about our journey at Uber of integrating Alluxio cache into Presto. We will discuss the Uber specific challenges we encountered and how we addressed them. We will also present the performance improvements we have seen. Besides, we will also discuss our plan and next steps, and potential future collaboration opportunities with the community.
Today’s digital-native companies need a modern data infra that can handle data wrangling and data-driven analytics for the ever-increasing amount of data needed to drive business. Specifically, they need to address challenges like complexity, cost, and lock-in. An Open SQL Data Lakehouse approach enables flexibility and better cost performance by leveraging open technologies and formats. Join us for this panel where leading technologists from the Presto open source project will share their vision of the SQL Data Lakehouse and why Presto is a critical component.
In this talk, I will be talking about a microservice that we have built at Uber to be able to analyze Presto queries. The Presto Query Engine does not provide endpoints for query analysis purposes. One has to either execute the query or gather insights from the query explain plan. In this talk, I will talk about 1. The work that we had to do to do the query analysis in a microservice using Presto as a library. 2. Doing predicate analysis on the queries to come up with data formatting recommendations in order to improve query performance. 3. Using the analysis service for query result cache invalidation. The analysis figures out whether the results from a previous run of the query are still valid and can be reused.
Data analytic tables in the big data ecosystem are usually large and some of them can reach petabytes in size. Presto as a fast query engine needs to be intelligent to skip reading unnecessary data based on filters. In addition to the existing filtering to skip partitions, files, and row groups, Apache Parquet Column Index provides further filtering to pages, which is the I/O unit for the Parquet data source. In this presentation, we will show the work that we integrated Parquet Column Index to Presto code base, the performance gains, etc. We will also talk about our effort to open-source this project to PrestoDB and look forward to collaborating with the community to merge!
Presto is a popular distributed SQL query engine for running interactive analytic queries. Presto provides a Connector API that allows plugins to dozens of data sources, and thus positions itself as a single point of access to a wide variety of data. At Uber, we significantly improved Presto’s Kafka connector to meet Uber’s scale. For example, the new connector allows dynamic Kafka cluster and topic discovery so users can directly query existing Kafka topics without any registration and onboarding process; dynamic schema discovery allows fetching the latest schema without any Presto restart or deployment; smart time range suggestions to users based on Kafka metadata analysis to avoid large-range scans and thus keep the query interactive.
Prism is a gateway service for all Presto queries at Uber. It addresses Uber specific needs in four main areas – resource management, query gating, monitoring, and security. It is responsible for proxying over three million weekly queries from 6000+ weekly active users across all of Uber. Presto has variable execution times due to high multi-tenancy at Uber. Prism helps in overcoming those challenges using features like query routing, load balancing, query gating, session parameter checks, failover clusters which helps in maintaining a 99.9% availability and reliability SLA for Presto at Uber. Functionality – Query Execution: 1. Async execution API returns data stream 2. Async execution API returns File Descriptor – Routing – Prism can route queries to different clusters based on client sources. Other functionalities: Load Balancing, Query Gating, Failover, Session Properties, Security
Apache Hudi is a data lake platform that supercharges data lakes. Originally created at Uber, Hudi provides various ways to strike trade-offs between ingestion speed and query performance by supporting user defined partitioners, automatic file sizing which are favorable to query performance. Hudi integrates with PrestoDB to make this data available for queries. During ingestion, data is typically co-located based on arrival time. However, query engines perform better when the data frequently queried is co-located together, which may be different from arrival time order. We will discuss a new framework called “data clustering” to make data lakes adaptable to query patterns, thereby improving query latencies. Finally, we will discuss future work to support improving data locality using custom bucketing of data during ingestion, avoiding some of the rewrite costs.