Speed Up Presto at Uber with Alluxio Caching – Chen Liang, Uber & Beinan Wang, Alluxio

    Speed Up Presto at Uber with Alluxio Caching – Chen Liang, Uber & Beinan Wang, Alluxio

    At Uber, Presto is heavily used as one of the primary data analytics tools, and Presto’s query performance has profound production impact at Uber. As part of the Presto optimization effort, we turned to explore Alluxio as a caching solution. Alluxio is an open source data orchestration platform often used by many compute frameworks as the caching layer. Alluxio caching is currently enabled on ~2000 nodes across 6 clusters at Uber. In this presentation, we will talk about our journey at Uber of integrating Alluxio cache into Presto. We will discuss the Uber specific challenges we encountered and how we addressed them. We will also present the performance improvements we have seen. Besides, we will also discuss our plan and next steps, and potential future collaboration opportunities with the community.

    Connect to PrestoDB from Anywhere – Jerod Johnson, CData

    Connect to PrestoDB from Anywhere – Jerod Johnson, CData

    Leveraging the benefits of PrestoDB with 3rd-party BI, reporting, ETL, and custom applications can present unique challenges. The CData Connectivity Solutions allow you to connect, integrate, and automate your PrestoDB data in the tools and applications you already use. In this video, you’ll learn about the different connectivity solutions CData offers and see how to connect to PrestoDB through CData’s technology.

    Presto & the Foundations of Open Lake House: Trends & Opportunities – Biswapesh Chattopadhyay, Meta

    Presto & the Foundations of Open Lake House: Trends & Opportunities – Biswapesh Chattopadhyay, Meta

    Building open and shared foundational tech to build a lake house architecture can provide the best-of-breed user experience across the Analytics and ML domains and potentially beyond. In this talk, Biswa will share examples drawn from the evolution of the data stack at Meta over the last few years including efforts towards dialect unification (Sapphire aka Presto-on-Spark and Xstream-IE streaming engine efforts), eval unification (using Velox as the base), eliminating the need for data duplication for interactive analytics by building smart caching (RaptorX), building a best-of-breed file format that works across Analytics and ML (Alpha), and building an open source ML data pre-proc engine (TorchArrow) which shares the core dialect and eval components with Presto.

    Executing Any External Code in Any Language with Presto – A Universal Connector – Ravishankar Nair

    Executing Any External Code in Any Language with Presto – A Universal Connector – Ravishankar Nair

    Connector based architecture is one of the powerful features in Presto for extensibility. While we have a solid pack of many connectors, the ability to reuse an existing external snippet to fetch data and access through Presto will make it enormously helpful. For example, consider accessing mainframe code through Presto using simple SQL which is quite cumbersome to handle by creating a connector paradigm. Ravishankar explores how he implemented this feature using a protocol server and a protocol connector which eventually helped him to achieve a patent on the concept.

    Panel Discussion: Presto for the Open Data Lakehouse

    Panel Discussion: Presto for the Open Data Lakehouse

    Today’s digital-native companies need a modern data infra that can handle data wrangling and data-driven analytics for the ever-increasing amount of data needed to drive business. Specifically, they need to address challenges like complexity, cost, and lock-in. An Open SQL Data Lakehouse approach enables flexibility and better cost performance by leveraging open technologies and formats. Join us for this panel where leading technologists from the Presto open source project will share their vision of the SQL Data Lakehouse and why Presto is a critical component.

    Presto Query Analysis for Data Layout Formatting and Query Result Caching – Gurmeet Singh, Uber

    Presto Query Analysis for Data Layout Formatting and Query Result Caching – Gurmeet Singh, Uber

    In this talk, I will be talking about a microservice that we have built at Uber to be able to analyze Presto queries. The Presto Query Engine does not provide endpoints for query analysis purposes. One has to either execute the query or gather insights from the query explain plan. In this talk, I will talk about 1. The work that we had to do to do the query analysis in a microservice using Presto as a library. 2. Doing predicate analysis on the queries to come up with data formatting recommendations in order to improve query performance. 3. Using the analysis service for query result cache invalidation. The analysis figures out whether the results from a previous run of the query are still valid and can be reused.

    HermesDB – Integrated Presto with a lucene-based Query Engine – Yue Long, Tencent

    HermesDB – Integrated Presto with a lucene-based Query Engine – Yue Long, Tencent

    HermesDB is the next generation of OLAP engine at Tencent with the architecture featuring separation of storage and calculation. HermesDB characterizes efficient indexing files in storage data, equipping with customized Presto as the core query engine. With the help of Presto connector, HermesDB could not only support full ANSI syntax but also ultilize Apache Lucene as underlying computer core. Besides, we are in the progress of improving the end-to-end performance with the newly released Java Vector APIs, acclecerating different kinds of complex computations with SIMD instructions. According to the benchmark(SSB) we have, HermesDB outperformances other mainstream C++ based MPP engines.

    Speed Up Presto Reading with Paquet Column Indexes – Xinli Shang, & Chen Liang, Uber

    Speed Up Presto Reading with Paquet Column Indexes – Xinli Shang, & Chen Liang, Uber

    Data analytic tables in the big data ecosystem are usually large and some of them can reach petabytes in size. Presto as a fast query engine needs to be intelligent to skip reading unnecessary data based on filters. In addition to the existing filtering to skip partitions, files, and row groups, Apache Parquet Column Index provides further filtering to pages, which is the I/O unit for the Parquet data source. In this presentation, we will show the work that we integrated Parquet Column Index to Presto code base, the performance gains, etc. We will also talk about our effort to open-source this project to PrestoDB and look forward to collaborating with the community to merge!

    Presto On Spark: Scaling not Failing with Spark – Ariel Weisberg, Meta & Shradha Ambekar, Intuit

    Presto On Spark: Scaling not Failing with Spark – Ariel Weisberg, Meta & Shradha Ambekar, Intuit

    Presto on Spark is an integration between Presto and Spark that leverages Presto’s compiler/evaluation as a library and Spark’s large scale processing capabilities. It enables a unified SQL experience between interactive and batch use cases. A unified option for batch data processing and ad hoc is very important for creating the experience of queries that scale instead of fail without requiring rewrites between different SQL dialects. In this session, we’ll talk about Presto On Spark architecture, why it matters and its implementation/usage at Intuit.

    Presto on Kafka at Scale – Yang Yang & Yupeng Fu, Uber

    Presto on Kafka at Scale – Yang Yang & Yupeng Fu, Uber

    Presto is a popular distributed SQL query engine for running interactive analytic queries. Presto provides a Connector API that allows plugins to dozens of data sources, and thus positions itself as a single point of access to a wide variety of data. At Uber, we significantly improved Presto’s Kafka connector to meet Uber’s scale. For example, the new connector allows dynamic Kafka cluster and topic discovery so users can directly query existing Kafka topics without any registration and onboarding process; dynamic schema discovery allows fetching the latest schema without any Presto restart or deployment; smart time range suggestions to users based on Kafka metadata analysis to avoid large-range scans and thus keep the query interactive.