Shengxuan Liu from ByteDance and Beinan Wang from Alluxio will present the practical problems and interesting findings during the launch of Presto Router and Alluxio Local Cache. Their talk covers how ByteDance’s Presto team implements the cache invalidation and dashboard for Alluxio’s Local Cache. Shengxuan will also share his experience using a customized cache strategy to improve the cache efficiency and system reliability.
Sivabalan Narayanan of Onehouse shares more about how Apache Hudi brought transactions, incremental processing on top of data lakes, which are deemed as the foundational pillars for Lakehouse architecture. In this session, we will discuss Apache Hudi and how it fills the key technology gaps in the modern data architecture. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds realized by incremental processing model. We will take a look at the capabilities of native Hudi connector in Presto. We will dive deep into this connector, covering the key optimizations and features it unblocks. Presto users could now leverage the metadata table for optimized file listing and avoid large number of list operations in cloud storages. We will look at how we can improve the query latency in Presto using advanced data skipping methodogies employed with multi-modal sub-system with Hudi.
Data analytic tables in the big data ecosystem are usually large and some of them can reach petabytes in size. Presto as a fast query engine needs to be intelligent to skip reading unnecessary data based on filters. In addition to the existing filtering to skip partitions, files, and row groups, Apache Parquet Column Index provides further filtering to pages, which is the I/O unit for the Parquet data source. In this presentation, we will show the work that we integrated Parquet Column Index to Presto code base, the performance gains, etc. We will also talk about our effort to open-source this project to PrestoDB and look forward to collaborating with the community to merge!
Presto has been adopted at Tencent as scale to serve scenarios of ad-hoc queries and interactive queries for different business units. In this talk, we’d like to share our practice of Presto in production. In details, we’ll talk about our works to further improve the stability, extend the usability, and optimize the performance of Presto. The works all together make Presto better fit in our production environment, which we think will also benefit the community.
RaptorX is an internal project name aiming to boost query latency significantly beyond what vanilla Presto is capable of. For this session, we introduce the hierarchical cache work including Alluxio data cache, fragment result cache, etc. Cache is the key building block for RaptorX. With the support of the cache, we are able to boost query performance by 10X. This new architecture can beat performance oriented connectors like Raptor with the added benefit of continuing to work with disaggregated storage.
Apache Hudi is a data lake platform that supercharges data lakes. Originally created at Uber, Hudi provides various ways to strike trade-offs between ingestion speed and query performance by supporting user defined partitioners, automatic file sizing which are favorable to query performance. Hudi integrates with PrestoDB to make this data available for queries. During ingestion, data is typically co-located based on arrival time. However, query engines perform better when the data frequently queried is co-located together, which may be different from arrival time order. We will discuss a new framework called “data clustering” to make data lakes adaptable to query patterns, thereby improving query latencies. Finally, we will discuss future work to support improving data locality using custom bucketing of data during ingestion, avoiding some of the rewrite costs.
The Google BigQuery connector gives users the ability to query tables in the BigQuery service, Google Cloud’s fully managed data warehouse. In this presentation, we’ll discuss the BigQuery Connector plugin for Presto which uses the BigQuery Storage API to stream data in parallel, allowing users to query from BigQuery tables via gPRC to achieve a better read performance. We’ll also discuss how the connector enables interactive ad-hoc query to join data across distributed systems for data lake analytics.