Rajmani Arya, Varun Senthilnathan & Manoj Kumar Dhakad, Adobe Advertising: We are from the Product Engineering team in Adobe Advertising (https://business.adobe.com/in/product…. Adobe Advertising is a digital advertisement platform. We take care of accumulating all data, providing platform intelligence, building and maintaining machine leaning capabilities, building and maintaining internal pipelines that form derived data to be used by other teams. The volume of total incoming raw data ranges between 8 to 10 tb/ day spread across 7 regions. The total data in the system currently is about 7pb. This data is largely stored in Hive tables with a central metastore. We use Presto in three ways: 1. Data studio – an internal tool to enable data analysts, sales, marketing and other teams to do adhoc querying. This is also used by data engineers to do adhoc querying for engineering tasks. 2. Custom Reports – We create reports for customers to get performance insights on their campaigns. We have 100s of reports that are run on a daily basis. 3. Internal Pipelines – Presto is used to retrieve data to power 100s of pipelines run daily to generate derived data.
In this demo we’ll go through two key pieces of watsonx.data, IBM’s new Data Lakehouse offering. Multiple analytics engines working on the same data: – Demo: Multiple engines working on the same data set so you can use the analytics tools you love without having to deal with the ugly plumbing Semantic Automation: Leverage AI to simplify data discovery and manipulation, allowing your data to work for you – Demo: Using a chat interface to find tables of relevance and how AI can enrich data sets with semantic information
Shengxuan Liu from ByteDance and Beinan Wang from Alluxio will present the practical problems and interesting findings during the launch of Presto Router and Alluxio Local Cache. Their talk covers how ByteDance’s Presto team implements the cache invalidation and dashboard for Alluxio’s Local Cache. Shengxuan will also share his experience using a customized cache strategy to improve the cache efficiency and system reliability.
Twilio as a leader in cloud communication platforms is very heavy on data and data-based decision making. Most data related use cases are currently powered by the Presto engine. Two years back we started the Journey with Presto in Twilio and today the system has scaled to a multi-PB data lakehouse and supports more than 75k queries per day. In this journey, we learned a lot about how to effectively operationalize Presto on AWS and some of the tricks to have better query reliability, query performance, guard-railing the clusters and save cost. With this talk, we want to share this experience with the community.
Presto is used for a variety of cases, but tends to be used for larger scale analytical queries. We have been transitioning to using Presto to power our data platform and customer-facing scripting language, RQL (Rippling Query Language) to run arbitrary customer queries to power core products. Presto helps enable diverse, federated querying at scale. In this talk, Andy will cover where Presto sits in Rippling’s ecosystem as a core query layer, our collaboration and contributions for closer integration with Apache Pinot, and learnings on using Presto to handle a large variety of query patterns.
Data processing systems have evolved significantly over the last decade, driven by various factors such as the advent of cloud computing, increasingly complexity of applications such as ML, HTAP, Streaming, Observability and Graph processing. However, historically, these frameworks have evolved independently, leading to significant fragmentation of the stack. In this talk, I will talk about how this has evolved in the open source and at Meta, and how we are solving this problem through the Shared Foundations effort, leading to composable systems. This has resulted in significantly better performance, more features, higher engineering velocity and a more consistent user experience.
In this talk, Aditi Pandit, Principal Software Engineer at Ahana and Presto/Velox contributor, will throw the covers back on some of the most interesting portions of working in Prestissimo and Velox. The talk will be based on the experience of implementing the windowing functions in Velox. It will cover the nitty gritty on the vectorized operator, memory management and spilling. This talk is perfect for anyone who is using Presto in production and wants to understand more about the internals, or someone who is new to Presto and is looking for a deep technical understanding of the architecture.
In this talk, you will hear from the query optimizer OG himself, Bill McKenna (Principal software engineer at Ahana, Architect for the query optimizer that became the code base of the Amazon Redshift query optimizer, and co-author of The Volcano Optimizer Generator: Extensibility and Efficient Search) go into detail about the state of modern query optimizers, and how Presto stacks up against them and where it will go in the near future. If database theory is your jam, you won’t want to miss this deeply technical presentation from one of the pioneers in the field.
Sivabalan Narayanan of Onehouse shares more about how Apache Hudi brought transactions, incremental processing on top of data lakes, which are deemed as the foundational pillars for Lakehouse architecture. In this session, we will discuss Apache Hudi and how it fills the key technology gaps in the modern data architecture. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds realized by incremental processing model. We will take a look at the capabilities of native Hudi connector in Presto. We will dive deep into this connector, covering the key optimizations and features it unblocks. Presto users could now leverage the metadata table for optimized file listing and avoid large number of list operations in cloud storages. We will look at how we can improve the query latency in Presto using advanced data skipping methodogies employed with multi-modal sub-system with Hudi.
The modern data lake is distributed, unstructured and demands performance and scale – or better stated, performance at scale. Modern object stores are the ideal platform to pair with MPP query engines like Presto – particularly as the scale reaches tens or hundreds of petabytes with tens to hundreds of concurrent queries. In this talk, Satish Ramakrishnan will outline the better together attributes of the two technologies with a focus on the most sophisticated modern object storage features – from throughput optimizations, multi-cloud capabilities, cross-cloud active active replication and lifecycle management. Participants will come away with a reference architecture suited to query processing at object scale.
PrestoDB is a fast and efficient software that can change the way teams access their data. OpenMetadata is a metadata management software that is here to change the data pipeline space. Together they can take your data, metadata and the entire company data culture to the next level. In this talk, Suresh and Sriharsha will show you how to connect your Presto database, ingest its metadata and get your data right. Use your company’s data to its full potential with OpenMetadata’s unified solution with data cataloging, discoverability, quality, lineage, operations, and collaboration all in one central UI. OpenMetadata’s philosophies: 1. Building Metadata Standards with a Schema and API first approach 2. Automation with bots abstraction for automating mundane tasks – quality, data deletion, tag propagation. 3. Top-Notch UX to enable everybody in the company to unlock the value of data.
Women in Open Source & Presto – Getting Started in the Presto Open Source Ecosystem – Neha Pawar, Startree; Rebecca Schlussel, Meta; RongRong Zhong, Celonis & Moderated By Dipti Borkar, Microsoft Among GitHub users with at least ten contributions, a mere 6% were women. This is way less than the ratio of women in tech that various research shows at 26%. Given the amount of investment going into and the growth / success of companies based on open source as well as the enormous demand for developers in open source, it is a ratio we need to strive to improve for women. In this panel, we will discuss a few areas: – The journey of each panelist into open source projects – The benefits they have seen by participating in open source projects particularly Presto – The challenges women face in male-dominated open source communities – Ideas, suggestions and guidance to budding engineers on getting started with open source including Presto.
Over the last two decades, we’ve seen the birth and emergence of the data lake systems–from the internal walls of Google to modern Lakehouses at Meta/Facebook, which promise the best of both data lake and data warehouse worlds. Equally important is the role open source–and more broadly, openness–has played and will play in this journey. In this talk, Steven will draw his experience with open source distributed systems (Couchbase, Mesosphere, Alluxio, Linux Foundation Presto) to explore the significance of the “5 shades of openness” with respect to the composable open data lakehouse ecosystem.
While using the Presto Iceberg connector, the in-heap cache in Presto is likely overloaded. In this talk, Beinan and Chunxu will share the design, implementation, and optimization of the off-heap cache to address the scalability challenges. You will learn how to cache Iceberg data and metadata for the Presto Iceberg connector, followed by future work on improving table scans using Apache Arrow.