Videos

On-Demand Recordings from PrestoCon’s, Webinars, Meetups, and more

    • How Blinkit is Building an Open Data Lakehouse with Presto on AWS – Satyam Krishna & Akshay Agarwal

      How Blinkit is Building an Open Data Lakehouse with Presto on AWS – Satyam Krishna & Akshay Agarwal

      Blinkit, India’s leading instant delivery service, uses Presto on AWS to help them deliver on their promise of “everything delivered in 10 minutes”. In this session, Satyam and Akshay will discuss why they moved to Presto on S3 from their cloud data warehouse for more flexibility and better price performance. They’ll also share more on their open data lakehouse architecture which includes Presto as their SQL engine for ad hoc reporting, Ahana as SaaS for Presto, Apache Hudi and Iceberg to help manage transactions, and AWS S3 as their data lake.

    • Query Execution Optimization for Broadcast Join using Replicated-Reads Strategy – George Wang, Ahana

      Query Execution Optimization for Broadcast Join using Replicated-Reads Strategy – George Wang, Ahana

      Today presto supports broadcast join by having a worker to fetch data from a small data source to build a hash table and then sending the entire data over the network to all other workers for hash lookup probed by large data source. This can be optimized by a new query execution strategy as source data from small tables is pulled directly by all workers which is known as replicated reads from dimension tables. This feature comes with a nice caching property given that all worker nodes N are now participating in scanning the data from remote sources. The table scan operation for dimension tables is cacheable per all worker nodes. In addition, there will be better resource utilization because the presto scheduler can now reduce the number plan fragment to execute as the same workers run tasks in parallel within a single stage to reduce data shuffles.

    • Dynamic UDF Framework and its Applications – Rongrong Zhong, Alluxio & Yanbing Zhang, Bytedance

      Dynamic UDF Framework and its Applications – Rongrong Zhong, Alluxio & Yanbing Zhang, Bytedance

      Presto supports dynamically registered User Defined Functions (UDFs) since 2020. Over the years, we used this framework to add support for SQL UDFs and remote / external UDFs. One common community request in the UDF domain is to support Hive UDFs. Many companies have legacy Hive pipelines, and engineers who are familiar with HQL and Hive UDFs. With remote UDF, one can implement Hive UDF support as UDFs running on the remote cluster. But since HiveUDFs are written in Java, we can also run them inside the engine. We extended the dynamic UDF framework to support Java UDFs, and used this new extension to add HiveUDF support in Presto. With this feature, users can directly use their familiar HiveUDFs and UDAFs in their Presto query.

    • PrestoDB and Apache Hudi for the Lakehouse – Sagar Sumit & Bhavani Sudha Saktheeswaran

      PrestoDB and Apache Hudi for the Lakehouse – Sagar Sumit & Bhavani Sudha Saktheeswaran

      Apache Hudi is a rich platform to build self-managing, exabyte-scale data lakes, optimized for incremental as well as regular batch processing. Hudi tables can be seamlessly synced to Hive metastore, which unlocks the powerful capabilities of Presto engine via the Hive connector. Presto-Hudi integration is over five years old. What started as simply fetching splits using a custom input format for a Hudi Copy-On-Write table has evolved into snapshot querying of Merge-On-Read tables and using Hudi’s internal metadata table to boost query performance. In this session, we trace that journey and discuss in detail the recent developments that have made this integration stronger not only in terms of usability but also performance. We discuss the additional features that come with the brand new presto-hudi connector, such as multi-modal index and data skipping for better query performance.

    • Speed Up Presto at Uber with Alluxio Caching – Chen Liang, Uber & Beinan Wang, Alluxio

      Speed Up Presto at Uber with Alluxio Caching – Chen Liang, Uber & Beinan Wang, Alluxio

      At Uber, Presto is heavily used as one of the primary data analytics tools, and Presto’s query performance has profound production impact at Uber. As part of the Presto optimization effort, we turned to explore Alluxio as a caching solution. Alluxio is an open source data orchestration platform often used by many compute frameworks as the caching layer. Alluxio caching is currently enabled on ~2000 nodes across 6 clusters at Uber. In this presentation, we will talk about our journey at Uber of integrating Alluxio cache into Presto. We will discuss the Uber specific challenges we encountered and how we addressed them. We will also present the performance improvements we have seen. Besides, we will also discuss our plan and next steps, and potential future collaboration opportunities with the community.

    • Connect to PrestoDB from Anywhere – Jerod Johnson, CData

      Connect to PrestoDB from Anywhere – Jerod Johnson, CData

      Leveraging the benefits of PrestoDB with 3rd-party BI, reporting, ETL, and custom applications can present unique challenges. The CData Connectivity Solutions allow you to connect, integrate, and automate your PrestoDB data in the tools and applications you already use. In this video, you’ll learn about the different connectivity solutions CData offers and see how to connect to PrestoDB through CData’s technology.

    • Presto & the Foundations of Open Lake House: Trends & Opportunities – Biswapesh Chattopadhyay, Meta

      Presto & the Foundations of Open Lake House: Trends & Opportunities – Biswapesh Chattopadhyay, Meta

      Building open and shared foundational tech to build a lake house architecture can provide the best-of-breed user experience across the Analytics and ML domains and potentially beyond. In this talk, Biswa will share examples drawn from the evolution of the data stack at Meta over the last few years including efforts towards dialect unification (Sapphire aka Presto-on-Spark and Xstream-IE streaming engine efforts), eval unification (using Velox as the base), eliminating the need for data duplication for interactive analytics by building smart caching (RaptorX), building a best-of-breed file format that works across Analytics and ML (Alpha), and building an open source ML data pre-proc engine (TorchArrow) which shares the core dialect and eval components with Presto.

    • Executing Any External Code in Any Language with Presto – A Universal Connector – Ravishankar Nair

      Executing Any External Code in Any Language with Presto – A Universal Connector – Ravishankar Nair

      Connector based architecture is one of the powerful features in Presto for extensibility. While we have a solid pack of many connectors, the ability to reuse an existing external snippet to fetch data and access through Presto will make it enormously helpful. For example, consider accessing mainframe code through Presto using simple SQL which is quite cumbersome to handle by creating a connector paradigm. Ravishankar explores how he implemented this feature using a protocol server and a protocol connector which eventually helped him to achieve a patent on the concept.

    • Panel Discussion: Presto for the Open Data Lakehouse

      Panel Discussion: Presto for the Open Data Lakehouse

      Today’s digital-native companies need a modern data infra that can handle data wrangling and data-driven analytics for the ever-increasing amount of data needed to drive business. Specifically, they need to address challenges like complexity, cost, and lock-in. An Open SQL Data Lakehouse approach enables flexibility and better cost performance by leveraging open technologies and formats. Join us for this panel where leading technologists from the Presto open source project will share their vision of the SQL Data Lakehouse and why Presto is a critical component.

    • Presto Query Analysis for Data Layout Formatting and Query Result Caching – Gurmeet Singh, Uber

      Presto Query Analysis for Data Layout Formatting and Query Result Caching – Gurmeet Singh, Uber

      In this talk, I will be talking about a microservice that we have built at Uber to be able to analyze Presto queries. The Presto Query Engine does not provide endpoints for query analysis purposes. One has to either execute the query or gather insights from the query explain plan. In this talk, I will talk about 1. The work that we had to do to do the query analysis in a microservice using Presto as a library. 2. Doing predicate analysis on the queries to come up with data formatting recommendations in order to improve query performance. 3. Using the analysis service for query result cache invalidation. The analysis figures out whether the results from a previous run of the query are still valid and can be reused.

    • HermesDB – Integrated Presto with a lucene-based Query Engine – Yue Long, Tencent

      HermesDB – Integrated Presto with a lucene-based Query Engine – Yue Long, Tencent

      HermesDB is the next generation of OLAP engine at Tencent with the architecture featuring separation of storage and calculation. HermesDB characterizes efficient indexing files in storage data, equipping with customized Presto as the core query engine. With the help of Presto connector, HermesDB could not only support full ANSI syntax but also ultilize Apache Lucene as underlying computer core. Besides, we are in the progress of improving the end-to-end performance with the newly released Java Vector APIs, acclecerating different kinds of complex computations with SIMD instructions. According to the benchmark(SSB) we have, HermesDB outperformances other mainstream C++ based MPP engines.

    • Speed Up Presto Reading with Paquet Column Indexes – Xinli Shang, & Chen Liang, Uber

      Speed Up Presto Reading with Paquet Column Indexes – Xinli Shang, & Chen Liang, Uber

      Data analytic tables in the big data ecosystem are usually large and some of them can reach petabytes in size. Presto as a fast query engine needs to be intelligent to skip reading unnecessary data based on filters. In addition to the existing filtering to skip partitions, files, and row groups, Apache Parquet Column Index provides further filtering to pages, which is the I/O unit for the Parquet data source. In this presentation, we will show the work that we integrated Parquet Column Index to Presto code base, the performance gains, etc. We will also talk about our effort to open-source this project to PrestoDB and look forward to collaborating with the community to merge!