Agenda

    8:30am – 8:45am PDTWelcome RemarksCurt Hu, Chair, Presto Foundation | Senior Engineer Mananger at Uber
    Ali LeClerc, Chair, Presto Foundation Outreach Committee | Open Source & Product at IBM
    8:45am – 9:15am PDTTSC KeynoteTim Meehan, Chair, Presto Foundation TSC | Software Engineer at IBM
    9:15am – 9:35am PDTOptimizing data analytics at Etisalat Egypt: Presto at the edgeMohamed Taha, Big Data Engineer at Etisalat Egypt
    9:35am – 9:55am PDTEnabling analytics with Presto at ApnaDhvanit Trivedi, Data Engineer at Apna
    Piyush Mujavadiya, Lead Data Engineer at Apna
    Subham Todi, Lead Data Engineer at Apna
    9:55am – 10:00am PDTSponsor Session: Presto C++ and IBM watsonx.data for the Open Data LakehouseKevin Shen, Product Manager at IBM
    10:00am –10:30am PDTBreak
    10:30am – 11:00am PDTUnraveling the Non-Deterministic Query Conundrum for Prestissimo VerificationGe Gao, Software Engineer at Meta
    Krishna Pai, Software Engineer at Meta
    Wei He, Software Engineer at Meta
    11:00am –11:20am PDTPresto Native Iceberg SupportYing Su, Software Engineer at IBM
    11:20am –11:40am PDTPresto C++ TPC-DS updates & PbenchAditi Pandit, Software Engineer at IBM
    Ethan Zhang, Engineering Manager at IBM
    11:40am – 12:00pm PDTDetecting and Resolving Presto Performance HurdlesGoutam Verma, Software Engineer at WSO2
    12:00pm – 12:10pm PDTLeveraging TTL in Presto’s Local Cache for Data Privacy and PerformanceChunxu Tang, Staff Research Scientist at Alluxio
    Jianjian Xie, Staff Software Engineer at Alluxio
    12:10pm – 12:30pm PDTDiving into Presto 2.0 benchmark internals at IBM – Presto C++ and Query Optimizer resultsBerthold Reinwald, Principal Researcher at IBM
    Ashok Kumar, Program Director at IBM
    12:30pm –
    1:00pm PDT
    Break
    1:00pm – 1:20pm PDTExploring Cloud Intelligence: Leveraging Presto for Data Analytics on AWS CloudHenry Clavo, Data Professional at Government Agency
    1:20pm – 1:30pm PDTPresto OpenAPI/HTTP ConnectorAndrei Savu, Software Engineer at Rippling
    1:30pm – 1:40pm PDTHow we accelerated our Iceberg queries for CDC with MoR and Equality DeletesRoy Hasson, VP Product at Upsolver
    1:40pm – 2:00pm PDTPresto Pinot DataLake Segment ReaderMingjia Hang, Sr. Software Engineer at Uber
    2:00pm – 2:20pm PDTEnhancing Presto’s query performance and data management with Hudi: innovations and futureEthan Guo, Data Infrastructure Engineer at Onehouse
    2:20pm – 2:40pm PDTStreamlining Data Analytics with NeuroBlade’s SPU HW AccelerationDeepak Narain, VP Product at Neuroblade
    2:40pm – 3:00pm PDTHow can Presto better support ML users?Pedro Pedreira, Software Engineer at Meta
    3:00pm – 3:30pm PDTBreak
    3:30pm – 3:50pm PDTBridging the Divide: Running Presto SQL on a Vector Data Lake powered by LanceLei Xu, CTO/Co-Founder at LanceDB
    Beinan Wang, Software Engineer & Presto TSC member
    3:50pm – 4:10pm PDTUnlocking Language Insights: Building a Presto Connector for Large Language ModelsSatej Sahu, Sr. Software Data Architect at The Boeing Company
    4:10pm – 4:30pm PDTNimble, a new file format for large datasetsJialiang Tan, Software Engineer at Meta
    Jimmy Lu, Software Engineer at Meta

    Welcome Remarks

    Welcome to PrestoCon Day! Join us for a day of all things open-source Presto. You’ll hear more from Presto Foundation Chairs Curt and Ali as they share latest updates from the community and what to expect for the day.

    Curt Hu
    Chair, Presto Foundation | Senior Engineer Mananger at Uber

    Ali LeClerc
    Chair, Presto Foundation Outreach Committee | Open Source & Product at IBM

    TSC Keynote

    Tim Meehan
    Chair, Presto Foundation TSC | Software Engineer at IBM

    Optimizing data analytics at Etisalat Egypt: Presto at the edge

    Etisalat Egypt is one of the leading mobile operators in Egypt. In this session, learn more about some of the data challenges at Etisalat and how the data team harnesses the power of Presto to handle the challenges of fragmented data.

    Mohamed Taha
    Big Data Engineer at Etisalat Egypt

    Enabling analytics with Presto at Apna

    Apna is the largest and fastest-growing professional opportunity platform in India. In this session, we will explore Apna’s journey with Presto, including its deployment on Kubernetes and the optimizations implemented to significantly reduce query times. Discover the strategies that have helped Apna achieve efficient and scalable data analytics.

    Dhvanit Trivedi
    Data Engineer at Apna

    Piyush Mujavadiya
    Lead Data Engineer at Apna


    Subham Todi
    Lead Data Engineer at Apna

    Sponsor Session: Presto C++ and IBM watsonx.data for the Open Data Lakehouse

    Learn more about IBM watsonx.data, the Open Data Lakehouse and first platform that offers Presto C++ for better price-performance. In this session, Kevin will dive into the watsonx.data components including Presto C++, Apache Spark, Milvus, and more. Learn how companies are using the watsonx.data platform to power all of their workloads at scale.

    Kevin Shen
    Product Manager at IBM

    Unraveling the Non-Deterministic Query Conundrum for Prestissimo Verification

    We will present our work on enabling the correctness verification of Prestissimo on non-deterministic queries for Meta’s Presto production release. Non-deterministic queries constitute a large portion of production traffic, yet their results are not comparable between engines and between engine versions, hence posing a big challenge to the correctness verification for Prestissimo. In this talk, we will share how we divide the problem and leverage Presto Verifier and Velox Fuzzer to rewrite non-deterministic queries and verify correctness at the query level and expression level.

    Ge Gao
    Software Engineer at Meta

    Krishna Pai
    Software Engineer at Meta


    Wei He
    Software Engineer at Meta

    Presto Native Iceberg Support

    Ying will share a brief intro to Apache Iceberg and the latest work that has gone into support for Iceberg in the Presto native C++ engine , which includes support for reads, time travel, caching, and more. She will also share design and implementation details.

    Ying Su
    Software Engineer at IBM

    Presto C++ TPC-DS updates & pbench

    A big motivation for the Presto native C++ project is the price-performance wins on account of the new architecture. The use of vectorization, in-built memory management/caching and runtime optimizations lend to a state-of-the-art data engine built for efficiency.

    At IBM, we are constantly improving Presto C++ by chasing the TPC-DS benchmark. This industry benchmark signifies capabilities for complex decision support and is a key factor considered by customers when purchasing SQL engine products.

    In this talk, we will present the latest numbers in Presto C++ open-source for TPC-DS 1K, 10K and 100K runs. We will delve into the roadblocks, issues fixed and the next round of improvements proposed. We will also share more about the results we’re seeing with pbench, a benchmark runner intended as a replacement to Benchto.

    Aditi Pandit
    Software Engineer at IBM


    Ethan Zhang
    Engineering Manager at IBM

    Detecting and Resolving Presto Performance Hurdles

    In this session, Goutam will explore advanced monitoring strategies for detecting and resolving performance issues in Presto clusters. We will delve into specific metrics and tools that can help identify issues such as query latency spikes, resource contention, and node failures. Through real-world examples and case studies, attendees will learn how to optimize their monitoring setup to proactively detect and resolve issues, ensuring smooth operation and high performance of their Presto deployments.

    The session will begin with an overview of Presto clusters and the critical role of monitoring in optimizing performance. We will then discuss common performance hurdles, including query latency spikes, resource contention, and node failures, highlighting the need for proactive monitoring. Next, Goutam will delve into key metrics that should be monitored, such as query execution times, resource utilization, and network latency, and how these metrics can help in identifying and addressing performance issues. Goutam will also provide a brief overview of monitoring tools like Prometheus, Grafana, and Presto’s built-in metrics, showcasing their capabilities in collecting and analyzing monitoring data. Before the end of session attendees will explore real-world examples demonstrating the effectiveness of these monitoring strategies in detecting and resolving performance issues in Presto clusters.

    Goutam Verma
    Software Engineer at WSO2

    Leveraging TTL in Presto’s Local Cache for Data Privacy and Performance

    The automatic eviction of cached data beyond a certain period of time is a very useful feature for Presto users who have to comply with data privacy regulations like GDPR and CCPA. In this session, Chunxu and Jianjian will share the implementation of caching time-to-live (TTL) for data cached on the local disk. This feature not only helps Presto users with regulatory compliance but also can keep Presto’s local cache populated with the freshest, most relevant data.

    You will learn:
    – The implementation of TTL in Presto local cache
    – Configurations and strategies for picking optimal TTL values
    – Examples of using TTL to meet data privacy requirements while maximizing local cache performance gains

    Chunxu Tang
    Staff Research Scientist at Alluxio

    Jianjian Xie
    Staff Software Engineer at Alluxio

    Diving into Presto 2.0 benchmark internals at IBM – Presto C++ and Query Optimizer results

    At IBM we’ve recently released our latest benchmarking results of Presto C++ v0.286 and query optimizer on IBM Storage Fusion HCI. In this session we’ll discuss benchmark internals and share more detailed analysis and results of all our runs.

    Berthold Reinwald
    Principal Research Staff Member at IBM Research

    Ashok Kumar
    Program Director, Data and AI at IBM

    Exploring Cloud Intelligence: Leveraging Presto for Data Analytics on AWS Cloud

    This session explores how to seamlessly migrate data from on-premises to AWS and leverage Presto for advanced SQL queries. Gain practical insights into accelerating analytical workflows and making data-driven decisions.

    Henry Clavo
    Data Professional at Government Agency

    Presto OpenAPI/HTTP Connector

    The OpenAPI HTTP/JSON alternative to the Thrift Presto connector. Fewer features but not less useful.

    Andrei Savu
    Software Engineer at Rippling

    How we accelerated our Iceberg queries for CDC with MoR and Equality Deletes

    Ingesting and maintaining a stream of Change Data Capture (CDC) from transactional databases to an Iceberg lakehouse is not easy. More specifically, as the frequency and volume of changes increase, query performance quickly degrades forcing users to make hard choices between CoW vs. MoR, small vs. large files and even whether you should delay refreshing the table. In this lightning talk, you’ll learn how Apache Iceberg manages deleted rows, the difference between position and equality delete files and how recent enhancements to Presto optimize MoR with equality deletes using joins to improve queries by 400X.

    Roy Hasson
    VP Product at Upsolver

    Presto Pinot DataLake Segment Reader

    Currently, the existing Presto Pinot Connector primarily supports hot data, which can strain Pinot servers. To address user demands for extended data retention and advanced join queries, we are introducing the new Presto Pinot Datalake connector. This connector allows direct access to Pinot segments stored in deep store, eliminating redundant data ingestion and optimizing our data handling capabilities.

    Mingjia Hang
    Sr. Software Engineer at Uber

    Enhancing Presto’s query performance and data management with Hudi: innovations and future

    In the continually changing world of big data and analytics, effective data management and retrieval systems are crucial. In this presentation, we will set forth on an insightful exploration of the development and innovation of the Presto Hudi connector, tracing its origins through the earlier Hive connector.

    We will delve into the Hudi connector’s distinctive features that distinguish it from traditional file listing and partition pruning approaches for query optimization in systems like Presto. We will learn about the unique features of Hudi, including its multi-modal indexing framework which integrates support for Column Statistics and Record Index, demonstrating how these attributes enhance query efficiency for both point and range lookups.

    The talk will present the forward-looking agenda for the Presto Hudi connector, featuring the growth of the multi-modal indexing framework and the addition of DDL/DML support. These enhancements aim to further improve data management functions with the Presto Hudi connector, providing increased flexibility and efficiency in large-scale data operations.

    Ethan Guo
    Data Infrastructure Engineer at Onehouse.ai

    Streamlining Data Analytics with NeuroBlade’s SPU HW Acceleration

    This presentation will discuss NeuroBlade’s collaboration with the open-source community to enhance the Velox analytics engine through specialized hardware acceleration. We will delve into the technical enhancements and performance improvements enabled by the NeuroBlade SQL Processing Unit (SPU). Utilizing the Data Analytics Acceleration (DAXL) framework, this approach abstracts the underlying hardware complexities, thus streamlining the integration with data analytics platforms. Krishna Maheshwari will explain the seamless integration of the SPU with Presto-Velox, focusing on its compatibility with major data formats including Iceberg, Parquet, and ClickHouse. We will also present benchmark results that demonstrate the SPU’s pipelined processing capabilities, showcasing significant improvements in efficiency and processing speed.

    Deepak Narain
    VP Product at Neuroblade

    How can Presto better support ML users?

    In this talk, I’ll discuss some of the challenges faced by ML users as they leverage Presto to prepare large scale training datasets. Based on the experience supporting these workloads at Meta, I’ll present how they are different from traditional analytic workloads, and discuss the opportunities such new requirements offer to the design of modern compute engines. I’ll present our findings in three different dimensions:

    1. More efficient storage and in-memory data layout.
    2. Compressed execution and its impact in operator design. 
    3. (Extremely) late materialization.

    I’ll also share recent progress the Meta team made at supporting these workloads, initial results, existing and new open source projects that support this stack, and present areas where more research, development, and collaboration is needed.

    Pedro Pedreira
    Software Engineer at Meta

    Bridging the Divide: Running Presto SQL on a Vector Data Lake powered by Lance

    In recent years, advancements in GenAI, LLM, computer vision, and robotics have sparked a significant increase in the demand for massive computational power and innovative data practices. These demands were previously unseen in traditional big data infrastructure, which leads to AI data being stored in separate silos and queried using separate systems increasing cost and complexity.

    Instead, what if you could use Presto to run large scale OLAP queries and data transforms on the same datasets used for search and retrieval, or even training? This saves AI teams from wasting time and effort on converting between different formats, and it allows them to write SQL rather than complex and expensive python scripts for data transformation.

    To make this possible, we propose a vector data lake based on Lance format, accessed by high-performance Presto, a matured distributed analytical engine with a rich set of compute kernels via simple SQL queries. Lance delivers 10x performance improvement on real-time search queries, and is compatible with Presto to support fast distributed OLAP queries. This unified approach simplifies data management, boosts performance, and significantly reduces infrastructure costs.

    Lei Xu
    CTO/Co-Founder at LanceDB

    Beinan Wang
    Software Engineer & Presto TSC member

    Unlocking Language Insights: Building a Presto Connector for Large Language Models

    Dive into the realm of natural language understanding and data analytics as we embark on a groundbreaking journey to harness the power of Large Language Models (LLMs) with Presto. In this captivating session, I’ll unveil a visionary approach to seamlessly integrate LLMs into your data ecosystem using a custom Presto connector.

    Large Language Models have revolutionized the way we interact with and analyze textual data, offering unparalleled capabilities in natural language processing and understanding. However, unlocking the full potential of LLMs within traditional data analytics pipelines can be challenging. That’s where Presto comes in.

    Join me as we explore the innovative fusion of LLMs and Presto, enabling direct access to vast troves of textual data for real-time analysis and insights extraction. Through this session, you’ll gain invaluable insights into designing and implementing a custom Presto connector tailored specifically for LLM integration.

    Key highlights include:
    – Understanding the transformative potential of integrating LLMs into data analytics workflows
    – Designing an architecture for the Presto connector to seamlessly interface with LLMs, ensuring efficient data retrieval and processing
    – Leveraging Presto’s extensibility to develop custom connectors optimized for handling large volumes of textual data
    – Overcoming challenges and optimizing performance for real-time analysis and insights extraction
    – Real-world case studies and use cases demonstrating the transformative impact of LLM-Presto integration on various industries and applications

    Satej Sahu
    Sr. Software Data Architect at The Boeing Company

    Nimble, a new file format for large datasets

    In this talk we will present Nimble, a novel file format for large datasets, recently open-sourced by Meta. Nimble was designed to enhance the efficiency, flexibility, and extensibility of existing file formats. It outperforms existing formats such as Apache ORC and Parquet by offering better support for very wide tables, which are commonly found in data preparation workloads for ML training tables. Nimble also provides more flexibility and extensibility in the encodings it supports, and is better suited for parallel decoding using SIMD and GPUs. Our ultimate goal is to eventually migrate Meta’s data warehouse to Nimble.

    The session will include an overview of:

    1. Meta’s training data preparation workloads, why they are not suited for existing file formats like ORC and Parquet, and the role Presto plays on them.
    2. Presto Native’s new integration with Nimble file format.
    3. Nimble’s current status at Meta
    4. Ongoing development and future work, with the purpose of creating new collaboration opportunities in file formats for analytics.

    Jialiang Tan
    Software Engineer at Meta

    Jimmy Lu
    Software Engineer at Meta