PrestoCon 2024 Agenda

Tuesday, December 3rd

8:00AM – 4:00PM

9:00AM – 12:00PM

In this workshop you’ll learn the basics of Presto, the open-source SQL query engine. We’ll be focused on Presto C++, the next-gen Presto engine built on Velox, an open-source C++ native acceleration library. Presto C++ brings a state-of-the-art query processing engine to Presto that replaces Java workers in Presto clusters.

You’ll get Presto cluster running locally on your machine, connect data sources, and run some queries. We’ll also do a quick Presto C++ vs. Presto Java benchmark.

This is a beginner-level workshop for software developers and engineers who are new to Presto.

Course outline:

What is Presto and why you’d use it
Presto C++ architecture
How to write a Presto query
How to create and deploy a Presto cluster on your machine using Docker
Use pbench to run a Presto C++ vs Presto Java benchmark

Instructors

Yi-hong Wang
Open Source Software Developer at IBM

Pablo Alvarez
Global VP Product Management at Denodo

You may be familiar with the Data Lakehouse, an emerging architecture that brings the flexibility, scale and cost management benefits of the data lake together with the data management capabilities of the data warehouse. In this workshop, we’ll get hands-on building an Open Data Lakehouse – an approach that brings open technologies and formats to your lakehouse. This is a beginner-level workshop for software developers and engineers who are building data platforms. We’ll use Presto for the open-source SQL query engine, Apache Iceberg to enable ACID transactions, and Minio S3-compatible Object Storage for the data lake.

You’ll get hands-on with Presto and Iceberg. We’ll show you how to set up and connect these technologies, how to run queries on your data, and how to access and interpret Iceberg metadata. By the end, you should be well-versed in Presto and Iceberg and have the building blocks to create your own Open Data Lakehouse.

Course outline:

Introduction to the Open Data Lakehouse and the Presto query engine
Introduction to Apache Iceberg and common use cases
Querying S3 data with Presto
Integrating Iceberg with Presto
Working with Iceberg data and metadata tables

Instructor

Kiersten Stokes
Open Source Software Developer at IBM

12:00PM – 1:00PM

1:00PM – 4:00PM

Running time of queries on Presto depends on query structure, data layout, parameter tuning and optimizations. We will dive a bit deep into important aspects of all these parts, talk about anti-patterns and run some queries hands-on using joins, aggregations and window functions.

Course outline:

Presto preliminaries, how to run presto queries from the CLI, understand query plans and a few important session parameters that are generally good
Presto SQL best practices and anti-patterns
Optimizations for common patterns using some built-in session parameters and manual query rewriting

Instructor

Sreeni Viswanadha
Software Engineer at Meta

* Note that this is a repeat of the morning workshop *

You may be familiar with the Data Lakehouse, an emerging architecture that brings the flexibility, scale and cost management benefits of the data lake together with the data management capabilities of the data warehouse. In this workshop, we’ll get hands-on building an Open Data Lakehouse – an approach that brings open technologies and formats to your lakehouse. This is a beginner-level workshop for software developers and engineers who are building data platforms. We’ll use Presto for the open-source SQL query engine, Apache Iceberg to enable ACID transactions, and Minio S3-compatible Object Storage for the data lake.

Course outline:

Introduction to the Open Data Lakehouse and the Presto query engine
Introduction to Apache Iceberg and common use cases
Querying S3 data with Presto
Integrating Iceberg with Presto
Working with Iceberg data and metadata tables

Instructor

Kiersten Stokes
Open Source Software Developer at IBM

Wednesday, December 4th

8:00AM – 4:00PM

9:00AM – 9:10AM

Join us as we officially welcome attendees to PrestoCon 2024 and kick off the conference, led by Presto Foundation Chairs Curt and Ali.

Speakers

Curt Hu
Chair, Presto Foundation Governing Board & Sr. Engineering Manager at Uber

Ali LeClerc
PrestoCon Chair & Head of Open Source Strategy at IBM

9:10AM – 9:30AM

A state of the union address on the Presto open-source project from TSC Chair Tim Meehan.

Speakers

Tim Meehan
Chair, Presto Foundation TSC & Software Engineer at IBM

9:30AM – 9:55AM

In this session we will share how Presto powers one of the largest warehouse in the world at Meta. Vivek will highlight Meta’s Interactive Analytics business, which makes most of the company productive, as well as Batch Processing which powers AI and Analytics systems.

Speaker

Vivek Guar
Engineering Manager for Presto at Meta

9:55AM – 10:30AM

Open data platforms are revolutionizing how organizations harness the power of their data, driving innovation, flexibility, and competitive advantage. In this panel discussion, leaders from Meta, Denodo, and Bolt will share why they chose Presto as a cornerstone of their data strategies and how it aligns with the broader shift toward open, collaborative ecosystems. The panelists will dive into why open data platforms like Presto are game-changers for modern tech stacks. They’ll share how these platforms empower teams to stay agile, avoid vendor lock-in, and drive innovation at scale. You’ll hear real-world insights on how Presto’s flexibility and vibrant open-source community help solve evolving challenges and shape the future of data infrastructure. Whether you’re writing queries, designing systems, or driving strategy, this session will give you practical ideas for making the most of open platforms like Presto.

Moderator

Radhika Rangarajan
Co-Founder and Executive Director, Women in Big Data

Panelists

Pedro Pedreira
Software Engineer, Meta

Kostya Tsykulenko
Staff Software Engineer & Tech Lead, Bolt

Pablo Alvarez-Yanez
Global VP of Product Management at Denodo

10:30AM – 11:00AM

11:00AM – 11:30AM

Bolt (https://bolt.eu/) is one of Europe’s largest mobility providers (ride-hailing, shared cars, food delivery, and more). In this session learn more about their journey with Presto and how they leverage IBM watsonx.data with Presto for their data platform.

Speakers

Przemysław Gumuła
Sr. Software Engineer at Bolt

Kostya Tsykulenko
Staff Software Engineer and Tech Lead, Bolt

This session will present Presto C++ progress in Meta since the 2023 presentation. Sergey will cover:

How we handled challenges on migrating 50% of the batch workload to Presto C++.
Handling correctness issues, difference in behavior, choosing what queries/workloads to migrate first, admission control, pipeline certification, and SEVs
How all these challenges were exacerbated by new HW migration and warehouse namespaces going global (multi-region).
What’s next: batch 50% -> 100%, interactive

Speaker

Sergey Pershin
Software Engineer at Meta

11:35AM – 12:05PM

In today’s data-driven world, organizations are constantly seeking ways to optimize their data management and analytics processes to achieve cost and speed efficiencies. This presentation explores the use of Denodo’s Massively Parallel Processing (MPP) with Presto, an open-source distributed SQL query engine, to address these challenges.

Denodo MPP with Presto is designed to handle large-scale data processing by distributing workloads across multiple nodes using Presto, which excels in executing fast, interactive queries on large datasets. This technology can significantly reduce query times and operational costs.

This session will delve into the technical architecture and benefits of leveraging Denodo MPP with Presto, including:

Enhanced Performance: How the Denodo MPP solution accelerates query execution and data retrieval.
Cost Efficiency: Strategies for reducing infrastructure and operational costs through optimized resource utilization.
Scalability: Techniques for scaling data processing capabilities to meet growing business demands.
Use Cases: Real-world examples demonstrating the successful implementation and outcomes of this integration.

Attendees will gain insights into the effectiveness of deploying Denodo MPP with Presto, along with practical tips for maximizing the return on investment in their data infrastructure. Whether you are a data architect, engineer, or business analyst, this presentation will provide valuable knowledge to help you harness the full potential of your data assets.

Speakers

Mark Thorogood
Director, Data Operations & Software Engineering, Perkins Coie

Jim Naufel
Data Virtualization Architect, Perkins Coie

The Presto community has been working to improve the integration of Velox into Presto, focusing on simplifying the native evaluation engine’s integration. This talk introduces the Presto Sidecar, a new component designed to improve the Presto C++ user experience and add missing features found in Java. We’ll discuss the current progress, the next steps for native integration, and how these changes support a more seamless transition toward a fully native runtime.

Speaker

Tim Meehan
Software Engineer at IBM

12:05PM – 1:00PM

1:00PM – 1:30PM

In this talk, we will share Uber’s journey of migrating our large-scale Presto infrastructure to a cloud environment. As of Nov 2024, over half of Uber’s Presto workloads operate in the cloud, marking a significant milestone in our data analytics evolution. We’ll delve into the technical strategies employed to adapt Presto for cloud, including the redesign of query routing mechanisms, integration with cloud storage and security, and insights and learnings regarding performance.

Additionally, we’ll discuss the challenges we encountered during this migration, such as adapting new data security services in the cloud environment, handling of data consistency between onprem and cloud, as well as managing the operational complexities of transitioning critical workloads to the cloud.

Speakers

Chen Liang
Staff Software Engineer at Uber

Vineeth Karayil Sekharan
Sr. Software Engineer at Uber

Preparing Presto for commercial distribution involves a lot of decisions that are often overlooked in common deployments, like dealing with multiple deployment platforms, performance tuning, or dealing with high-secure deployments (FIPS, Kerberos) and tighter scrutiny in vulnerability management. And of course, paving the way for the adoption of Presto C++ next year.

Speaker

Pablo Alvarez-Yanez
Global VP of Product Management at Denodo

1:35PM – 2:05PM

Today Meta uses Presto to power its data lakehouse, which is one of the largest in the world. Learn more about the Presto use case at Meta including various improvements and work the team has done to make it the best solution for the interactive data lakehouse.

Speaker

Nikhil Collooru
Software Developer at Meta

In this session, NeuroBlade will showcase its collaboration with the open-source community to enhance the performance of the Presto-Velox engine using specialized hardware acceleration by tackling key challenges in query execution. The session will explore how bottlenecks shift when traditionally dominant operations in typical workloads, such as initial scan and filter tasks, are hardware accelerated, demonstrating the impact using the TPC-H benchmark for measuring execution time. Attendees will also gain insights into the integration highlights, performance profiling through open-source tools like PBench for performance analysis, and NeuroBlade’s path to contributing these advancements back to the open-source community.

Speaker

Hillel Sreter
System Engineer at NeuroBlade

2:10PM – 2:40PM

As in many other businesses, insurance companies crunch millions of data points to build reports that support leaders in their decision-making process. However, because of a technical debt and a tendency for bad habits, these reports come with a huge time to market and little room for changes.

Discover how cutting-edge technology supported by strong governance foundations is used by traditional corporations to fuel their Data Strategies to transition from legacy systems to the cloud, and how the power of tools like Tableau, Denodo and Presto are helping to reduce the time to market from hours to just mere minutes.

Speaker

Álvaro Rabadán González
Data Product Manager at Admiral Group

Presto is widely used for data processing and analysis in Meta. While working on optimizing Presto query for Meta workload, we observe multiple query optimization opportunities and contribute some of them as new optimizations in Presto. In this talk, we will talk about some optimizations we have applied to our workload.

Speaker

Feilong Liu
Research Scientist at Meta

Zac will give an update on the JDK upgrade for OS Presto.

Speaker

Zac Blanco
Software Engineer at IBM

Join us for an in-depth exploration of the expression evaluation engine within Velox, the high-performance vectorized database acceleration library powering Presto C++. This session will delve into the specific optimizations tailored for vectorized execution, including techniques such as dictionary peeling, common sub-expression reuse, dictionary memoization, null propagation, and the use of lazy vectors, among others. We will also discuss the complexities involved in their implementation and how their intricate interactions can lead to challenging debugging scenarios. Additionally, we will share insights into the development and evolution of a cutting-edge fuzz testing framework designed to rigorously test expression evaluation.

Speaker

Bikramjeet Vig
Software Engineer at Meta

2:45PM – 2:55PM

The migration of data-intensive analytics workloads to the cloud promises enhanced scalability and flexibility but introduces complex cost models that pose new challenges to traditional optimization strategies. While on-premises setups focused on speed, cloud deployments require a more nuanced approach, like cloud storage API costs, which can escalate rapidly in real-world scenarios.

In this presentation, we will analyze these challenges through a case study on Uber’s large deployment analytics Presto platform on HDFS and GCS. We will show findings of unexpected cost implications with standard I/O optimizations like table scans, filters, and broadcast joins when implemented in cloud environments. We will also highlight the need for a paradigm shift in optimizing data-intensive applications for the cloud and advocate for developing new I/O strategies, balancing performance and costs while tailored to cloud ecosystems’ unique demands.

Speakers

Tom Luckenbach
Engineering Manager at Alluxio

In this session we delve into the integration of Python support within Presto, designed to revolutionize data processing workflows. Discover how the introduction of a new UDF function empowers developers to effortlessly embed Python scripts within SQL queries, unlocking advanced data manipulation and analysis capabilities.

We’ll discuss the driving motivations behind this integration and provide insights into the high-level system design, deployment strategies, and security considerations that ensure efficient and secure execution. Please note that this project is currently in the prototype stage at Meta.

Speakers

Sreeni Viswanadha
Software Engineer at Meta

Feilong Liu
Research Scientist at Meta

3:00PM – 3:30PM

3:30PM – 4:00PM

Enterprises in the GenAI sector have long sought a seamless solution for managing and extracting value from vast amounts of documents, images, video and audio data. With the Presto-Lance integration, that solution is now within reach.

Lance delivers unparalleled value with its zero-copy schema evolution capability and robust support for large blobs, simplifying multimodal data analytics to the level of SQL. Its advanced indexing for semantic and full-text search, combined with rapid random access and Presto query pushdown, enables high-performance AI-driven analytics.

Vector data lakes built with the Lance format are accessible through high-performance Presto, a matured distributed analytical engine with a rich set of compute kernels via simple SQL queries. Lance delivers 10x performance improvement on real-time search queries, and is compatible with Presto to support fast distributed OLAP queries. This unified approach simplifies data management, boosts performance, and significantly reduces infrastructure costs.

Speakers

Lu Qiu
Database Engineer at LanceDB

Beinan Wang
Presto Committer

As data volumes grow & analytical demands intensify, efficient data management and optimized retrieval systems are key. This session will provide an in-depth look at the evolution and technical innovations behind the Presto-Hudi connector, tracing its lineage from the earlier Hive connector to its current, advanced capabilities.

We will explore the features that set the Hudi connector apart from conventional approaches like file listing and static partition pruning for query optimization. Central to this discussion is Hudi’s multi-modal indexing framework, which brings in powerful support for Column Statistics and Record Indexing. These unique indexing capabilities significantly enhance query performance, allowing for accelerated point and range lookups that streamline data retrieval.

The talk will also cover upcoming developments for the Presto Hudi connector, focusing on the expansion of multi-modal indexing and the integration of full DDL/DML support. These forward-looking improvements aim to elevate the Presto Hudi connector’s capabilities, delivering greater flexibility and efficiency for large-scale data operations and solidifying its role as a critical tool in data lakehouse environments.

Speakers

Sudha Saktheeswaran
Head of Open Source at Onehouse.ai

Jon Vexler
Database Engineer at Onehouse.ai

4:05PM – 4:35PM

When current storage mechanisms fail to meet your requirements, sometimes it’s worth building your own. While in the past this may have been daunting, using Presto’s connector framework allows for experimentation with minimal risk while being able to take advantage of storage dis-aggregation at scale. As a case study, lets look at how IBM’s SOC Platform team went and decreased costs and modernized their storage platform all while keeping the end user experience the same via Presto’s Thrift connector.

Speaker

Hunter Madison
Office of the CISO at IBM

Ravishankar will demonstrate two approaches to query vector data lakes using Presto – through an Arrow flight connector and by building a vector data lake connector, powered by open source LanceDB. This will show how universal multimodal AI apps can be made using Presto, making it ready for future.

Speaker

Ravishankar Nair
CEO Passionbytes

4:40PM – 5:00PM

Gurmeet will share the pivotal role Presto plays in Uber’s data ecosystem, as well as the latest metrics and workloads running on Presto. Learn more about the current data infra and plans for the future.

Speaker

Gurmeet Singh
Staff Software Engineer, Uber

5:00PM – 6:30PM

Join us for drinks, light appetizers and Presto fun!