PrestoCon 2024 Agenda

    Tuesday, December 3rd

    8:00AM – 4:00PM

    9:00AM – 12:00PM

    In this workshop you’ll learn the basics of Presto, the open-source SQL query engine. We’ll be focused on Presto C++, the next-gen Presto engine built on Velox, an open-source C++ native acceleration library. Presto C++ brings a state-of-the-art query processing engine to Presto that replaces Java workers in Presto clusters.

    You’ll get Presto cluster running locally on your machine, connect data sources, and run some queries. We’ll also do a quick Presto C++ vs. Presto Java benchmark.

    This is a beginner-level workshop for software developers and engineers who are new to Presto.

    Course outline:

    • What is Presto and why you’d use it
    • Presto C++ architecture
    • How to write a Presto query
    • How to create and deploy a Presto cluster on your machine using Docker
    • Use pbench to run a Presto C++ vs Presto Java benchmark

    Instructors

    Yi-hong Wang
    Open Source Software Developer at IBM

    Pablo Alvarez
    Global VP Product Management at Denodo

    You may be familiar with the Data Lakehouse, an emerging architecture that brings the flexibility, scale and cost management benefits of the data lake together with the data management capabilities of the data warehouse. In this workshop, we’ll get hands-on building an Open Data Lakehouse – an approach that brings open technologies and formats to your lakehouse. This is a beginner-level workshop for software developers and engineers who are building data platforms. We’ll use Presto for the open-source SQL query engine, Apache Iceberg to enable ACID transactions, and Minio S3-compatible Object Storage for the data lake.

    You’ll get hands-on with Presto and Iceberg. We’ll show you how to set up and connect these technologies, how to run queries on your data, and how to access and interpret Iceberg metadata. By the end, you should be well-versed in Presto and Iceberg and have the building blocks to create your own Open Data Lakehouse.

    Course outline:

    • Introduction to the Open Data Lakehouse and the Presto query engine
    • Introduction to Apache Iceberg and common use cases
    • Querying S3 data with Presto
    • Integrating Iceberg with Presto
    • Working with Iceberg data and metadata tables

    Instructor

    Kiersten Stokes
    Open Source Software Developer at IBM

    12:00PM – 1:00PM

    1:00PM – 4:00PM

    Running time of queries on Presto depends on query structure, data layout, parameter tuning and optimizations. We will dive a bit deep into important aspects of all these parts, talk about anti-patterns and run some queries hands-on using joins, aggregations and window functions.

    Course outline:

    • Presto preliminaries, how to run presto queries from the CLI, understand query plans and a few important session parameters that are generally good
    • Presto SQL best practices and anti-patterns
    • Optimizations for common patterns using some built-in session parameters and manual query rewriting

    Instructor

    Sreeni Viswanadha
    Software Engineer at Meta

    * Note that this is a repeat of the morning workshop *

    You may be familiar with the Data Lakehouse, an emerging architecture that brings the flexibility, scale and cost management benefits of the data lake together with the data management capabilities of the data warehouse. In this workshop, we’ll get hands-on building an Open Data Lakehouse – an approach that brings open technologies and formats to your lakehouse. This is a beginner-level workshop for software developers and engineers who are building data platforms. We’ll use Presto for the open-source SQL query engine, Apache Iceberg to enable ACID transactions, and Minio S3-compatible Object Storage for the data lake.

    You’ll get hands-on with Presto and Iceberg. We’ll show you how to set up and connect these technologies, how to run queries on your data, and how to access and interpret Iceberg metadata. By the end, you should be well-versed in Presto and Iceberg and have the building blocks to create your own Open Data Lakehouse.

    Course outline:

    • Introduction to the Open Data Lakehouse and the Presto query engine
    • Introduction to Apache Iceberg and common use cases
    • Querying S3 data with Presto
    • Integrating Iceberg with Presto
    • Working with Iceberg data and metadata tables

    Instructor

    Kiersten Stokes
    Open Source Software Developer at IBM

    Wednesday, December 4th

    8:00AM – 4:00PM

    9:00AM – 9:10AM

    Join us as we officially welcome attendees to PrestoCon 2024 and kick off the conference, led by Presto Foundation Chairs Curt and Ali.

    Speakers

    Curt Hu
    Chair, Presto Foundation Governing Board & Sr. Engineering Manager at Uber

    Ali LeClerc
    PrestoCon Chair & Head of Open Source Strategy at IBM

    9:10AM – 9:30AM

    A state of the union address on the Presto open-source project from TSC Chair Tim Meehan.

    Speakers

    Tim Meehan
    Chair, Presto Foundation TSC & Software Engineer at IBM

    9:30AM – 9:55AM

    In this session we will share how Presto powers one of the largest warehouse in the world at Meta. Vivek will highlight Meta’s Interactive Analytics business, which makes most of the company productive, as well as Batch Processing which powers AI and Analytics systems.

    Speaker

    Vivek Guar
    Engineering Manager for Presto at Meta

    9:55AM – 10:30AM

    Open data platforms are revolutionizing how organizations harness the power of their data, driving innovation, flexibility, and competitive advantage. In this panel discussion, leaders from Meta, Denodo, and Bolt will share why they chose Presto as a cornerstone of their data strategies and how it aligns with the broader shift toward open, collaborative ecosystems. The panelists will dive into why open data platforms like Presto are game-changers for modern tech stacks. They’ll share how these platforms empower teams to stay agile, avoid vendor lock-in, and drive innovation at scale. You’ll hear real-world insights on how Presto’s flexibility and vibrant open-source community help solve evolving challenges and shape the future of data infrastructure. Whether you’re writing queries, designing systems, or driving strategy, this session will give you practical ideas for making the most of open platforms like Presto.

    Moderator

    Radhika Rangarajan
    Co-Founder and Executive Director, Women in Big Data

    Panelists

    Pedro Pedreira
    Software Engineer, Meta

    Kostya Tsykulenko
    Staff Software Engineer & Tech Lead, Bolt

    Pablo Alvarez-Yanez
    Global VP of Product Management at Denodo

    10:30AM – 11:00AM

    11:00AM – 11:30AM

    Bolt (https://bolt.eu/) is one of Europe’s largest mobility providers (ride-hailing, shared cars, food delivery, and more). In this session learn more about their journey with Presto and how they leverage IBM watsonx.data with Presto for their data platform.

    Speakers

    Przemysław Gumuła
    Sr. Software Engineer at Bolt

    Kostya Tsykulenko
    Staff Software Engineer and Tech Lead, Bolt

    This session will present Presto C++ progress in Meta since the 2023 presentation. Sergey will cover:

    • How we handled challenges on migrating 50% of the batch workload to Presto C++.
    • Handling correctness issues, difference in behavior, choosing what queries/workloads to migrate first, admission control, pipeline certification, and SEVs
    • How all these challenges were exacerbated by new HW migration and warehouse namespaces going global (multi-region).
    • What’s next: batch 50% -> 100%, interactive

    Speaker

    Sergey Pershin
    Software Engineer at Meta

    11:35AM – 12:05PM

    In today’s data-driven world, organizations are constantly seeking ways to optimize their data management and analytics processes to achieve cost and speed efficiencies. This presentation explores the use of Denodo’s Massively Parallel Processing (MPP) with Presto, an open-source distributed SQL query engine, to address these challenges.

    Denodo MPP with Presto is designed to handle large-scale data processing by distributing workloads across multiple nodes using Presto, which excels in executing fast, interactive queries on large datasets. This technology can significantly reduce query times and operational costs.

    This session will delve into the technical architecture and benefits of leveraging Denodo MPP with Presto, including:

    • Enhanced Performance: How the Denodo MPP solution accelerates query execution and data retrieval.
    • Cost Efficiency: Strategies for reducing infrastructure and operational costs through optimized resource utilization.
    • Scalability: Techniques for scaling data processing capabilities to meet growing business demands.
    • Use Cases: Real-world examples demonstrating the successful implementation and outcomes of this integration.

    Attendees will gain insights into the effectiveness of deploying Denodo MPP with Presto, along with practical tips for maximizing the return on investment in their data infrastructure. Whether you are a data architect, engineer, or business analyst, this presentation will provide valuable knowledge to help you harness the full potential of your data assets.

    Speakers

    Mark Thorogood
    Director, Data Operations & Software Engineering, Perkins Coie

    Jim Naufel
    Data Virtualization Architect, Perkins Coie

    The Presto community has been working to improve the integration of Velox into Presto, focusing on simplifying the native evaluation engine’s integration. This talk introduces the Presto Sidecar, a new component designed to improve the Presto C++ user experience and add missing features found in Java. We’ll discuss the current progress, the next steps for native integration, and how these changes support a more seamless transition toward a fully native runtime.

    Speaker

    Tim Meehan
    Software Engineer at IBM

    12:05PM – 1:00PM

    1:00PM – 1:30PM

    In this talk, we will share Uber’s journey of migrating our large-scale Presto infrastructure to a cloud environment. As of Nov 2024, over half of Uber’s Presto workloads operate in the cloud, marking a significant milestone in our data analytics evolution. We’ll delve into the technical strategies employed to adapt Presto for cloud, including the redesign of query routing mechanisms, integration with cloud storage and security, and insights and learnings regarding performance.

    Additionally, we’ll discuss the challenges we encountered during this migration, such as adapting new data security services in the cloud environment, handling of data consistency between onprem and cloud, as well as managing the operational complexities of transitioning critical workloads to the cloud.

    Speakers

    Chen Liang
    Staff Software Engineer at Uber

    Vineeth Karayil Sekharan
    Sr. Software Engineer at Uber

    Preparing Presto for commercial distribution involves a lot of decisions that are often overlooked in common deployments, like dealing with multiple deployment platforms, performance tuning, or dealing with high-secure deployments (FIPS, Kerberos) and tighter scrutiny in vulnerability management. And of course, paving the way for the adoption of Presto C++ next year.

    Speaker

    Pablo Alvarez-Yanez
    Global VP of Product Management at Denodo

    1:35PM – 2:05PM

    Today Meta uses Presto to power its data lakehouse, which is one of the largest in the world. Learn more about the Presto use case at Meta including various improvements and work the team has done to make it the best solution for the interactive data lakehouse.

    Speaker

    Nikhil Collooru
    Software Developer at Meta

    In this session, NeuroBlade will showcase its collaboration with the open-source community to enhance the performance of the Presto-Velox engine using specialized hardware acceleration by tackling key challenges in query execution. The session will explore how bottlenecks shift when traditionally dominant operations in typical workloads, such as initial scan and filter tasks, are hardware accelerated, demonstrating the impact using the TPC-H benchmark for measuring execution time. Attendees will also gain insights into the integration highlights, performance profiling through open-source tools like PBench for performance analysis, and NeuroBlade’s path to contributing these advancements back to the open-source community.

    Speaker

    Hillel Sreter
    System Engineer at NeuroBlade

    2:10PM – 2:40PM

    As in many other businesses, insurance companies crunch millions of data points to build reports that support leaders in their decision-making process. However, because of a technical debt and a tendency for bad habits, these reports come with a huge time to market and little room for changes.

    Discover how cutting-edge technology supported by strong governance foundations is used by traditional corporations to fuel their Data Strategies to transition from legacy systems to the cloud, and how the power of tools like Tableau, Denodo and Presto are helping to reduce the time to market from hours to just mere minutes.

    Speaker

    Álvaro Rabadán González
    Data Product Manager at Admiral Group

    Presto is widely used for data processing and analysis in Meta. While working on optimizing Presto query for Meta workload, we observe multiple query optimization opportunities and contribute some of them as new optimizations in Presto. In this talk, we will talk about some optimizations we have applied to our workload.

    Speaker

    Feilong Liu
    Research Scientist at Meta

    Zac will give an update on the JDK upgrade for OS Presto.

    Speaker

    Zac Blanco
    Software Engineer at IBM

    Join us for an in-depth exploration of the expression evaluation engine within Velox, the high-performance vectorized database acceleration library powering Presto C++. This session will delve into the specific optimizations tailored for vectorized execution, including techniques such as dictionary peeling, common sub-expression reuse, dictionary memoization, null propagation, and the use of lazy vectors, among others. We will also discuss the complexities involved in their implementation and how their intricate interactions can lead to challenging debugging scenarios. Additionally, we will share insights into the development and evolution of a cutting-edge fuzz testing framework designed to rigorously test expression evaluation.

    Speaker

    Bikramjeet Vig
    Software Engineer at Meta

    2:45PM – 2:55PM

    The migration of data-intensive analytics workloads to the cloud promises enhanced scalability and flexibility but introduces complex cost models that pose new challenges to traditional optimization strategies. While on-premises setups focused on speed, cloud deployments require a more nuanced approach, like cloud storage API costs, which can escalate rapidly in real-world scenarios.

    In this presentation, we will analyze these challenges through a case study on Uber’s large deployment analytics Presto platform on HDFS and GCS. We will show findings of unexpected cost implications with standard I/O optimizations like table scans, filters, and broadcast joins when implemented in cloud environments. We will also highlight the need for a paradigm shift in optimizing data-intensive applications for the cloud and advocate for developing new I/O strategies, balancing performance and costs while tailored to cloud ecosystems’ unique demands.

    Speakers

    Tom Luckenbach
    Engineering Manager at Alluxio

    In this session we delve into the integration of Python support within Presto, designed to revolutionize data processing workflows. Discover how the introduction of a new UDF function empowers developers to effortlessly embed Python scripts within SQL queries, unlocking advanced data manipulation and analysis capabilities.

    We’ll discuss the driving motivations behind this integration and provide insights into the high-level system design, deployment strategies, and security considerations that ensure efficient and secure execution. Please note that this project is currently in the prototype stage at Meta.

    Speakers

    Sreeni Viswanadha
    Software Engineer at Meta

    Feilong Liu
    Research Scientist at Meta

    3:00PM – 3:30PM

    3:30PM – 4:00PM

    Enterprises in the GenAI sector have long sought a seamless solution for managing and extracting value from vast amounts of documents, images, video and audio data. With the Presto-Lance integration, that solution is now within reach.

    Lance delivers unparalleled value with its zero-copy schema evolution capability and robust support for large blobs, simplifying multimodal data analytics to the level of SQL. Its advanced indexing for semantic and full-text search, combined with rapid random access and Presto query pushdown, enables high-performance AI-driven analytics.

    Vector data lakes built with the Lance format are accessible through high-performance Presto, a matured distributed analytical engine with a rich set of compute kernels via simple SQL queries. Lance delivers 10x performance improvement on real-time search queries, and is compatible with Presto to support fast distributed OLAP queries. This unified approach simplifies data management, boosts performance, and significantly reduces infrastructure costs.

    Speakers

    Lu Qiu
    Database Engineer at LanceDB

    Beinan Wang
    Presto Committer

    As data volumes grow & analytical demands intensify, efficient data management and optimized retrieval systems are key. This session will provide an in-depth look at the evolution and technical innovations behind the Presto-Hudi connector, tracing its lineage from the earlier Hive connector to its current, advanced capabilities.

    We will explore the features that set the Hudi connector apart from conventional approaches like file listing and static partition pruning for query optimization. Central to this discussion is Hudi’s multi-modal indexing framework, which brings in powerful support for Column Statistics and Record Indexing. These unique indexing capabilities significantly enhance query performance, allowing for accelerated point and range lookups that streamline data retrieval.

    The talk will also cover upcoming developments for the Presto Hudi connector, focusing on the expansion of multi-modal indexing and the integration of full DDL/DML support. These forward-looking improvements aim to elevate the Presto Hudi connector’s capabilities, delivering greater flexibility and efficiency for large-scale data operations and solidifying its role as a critical tool in data lakehouse environments.

    Speakers

    Sudha Saktheeswaran
    Head of Open Source at Onehouse.ai

    Jon Vexler
    Database Engineer at Onehouse.ai

    4:05PM – 4:35PM

    When current storage mechanisms fail to meet your requirements, sometimes it’s worth building your own. While in the past this may have been daunting, using Presto’s connector framework allows for experimentation with minimal risk while being able to take advantage of storage dis-aggregation at scale. As a case study, lets look at how IBM’s SOC Platform team went and decreased costs and modernized their storage platform all while keeping the end user experience the same via Presto’s Thrift connector.

    Speaker

    Hunter Madison
    Office of the CISO at IBM

    Ravishankar will demonstrate two approaches to query vector data lakes using Presto – through an Arrow flight connector and by building a vector data lake connector, powered by open source LanceDB. This will show how universal multimodal AI apps can be made using Presto, making it ready for future.

    Speaker

    Ravishankar Nair
    CEO Passionbytes

    4:40PM – 5:00PM

    Gurmeet will share the pivotal role Presto plays in Uber’s data ecosystem, as well as the latest metrics and workloads running on Presto. Learn more about the current data infra and plans for the future.

    Speaker

    Gurmeet Singh
    Staff Software Engineer, Uber

    5:00PM – 6:30PM

    Join us for drinks, light appetizers and Presto fun!