Parquet Column Level Access Control with Presto

Parquet Column Level Access Control with Presto

Apache Parquet is the major columnar file storage format used by Apache Presto and several other query engines in many big data analytic frameworks today. In a lot of use cases, a portion of the column data is highly sensitive and must be protected. Column encryption at the file format level is supported in the Parquet community. Due to the rewritten code of Parquet in Presto, Parquet column encryption at Presto needs to be ported with modifications to the Presto code page. And the integration with Key Management Service (KMS) and other query engines like Hive and Spark is another challenge. In this talk, we will show the work we have done for enabling Presto for Parquet column decryption including challenges, solutions, integration with Hive/Spark Parquet column encryption and look forward to the next step of encryption work.

HermesDB – Integrated Presto with a lucene-based Query Engine – Yue Long, Tencent

HermesDB – Integrated Presto with a lucene-based Query Engine – Yue Long, Tencent

HermesDB is the next generation of OLAP engine at Tencent with the architecture featuring separation of storage and calculation. HermesDB characterizes efficient indexing files in storage data, equipping with customized Presto as the core query engine. With the help of Presto connector, HermesDB could not only support full ANSI syntax but also ultilize Apache Lucene as underlying computer core. Besides, we are in the progress of improving the end-to-end performance with the newly released Java Vector APIs, acclecerating different kinds of complex computations with SIMD instructions. According to the benchmark(SSB) we have, HermesDB outperformances other mainstream C++ based MPP engines.

Speed Up Presto Reading with Paquet Column Indexes – Xinli Shang, & Chen Liang, Uber

Speed Up Presto Reading with Paquet Column Indexes – Xinli Shang, & Chen Liang, Uber

Data analytic tables in the big data ecosystem are usually large and some of them can reach petabytes in size. Presto as a fast query engine needs to be intelligent to skip reading unnecessary data based on filters. In addition to the existing filtering to skip partitions, files, and row groups, Apache Parquet Column Index provides further filtering to pages, which is the I/O unit for the Parquet data source. In this presentation, we will show the work that we integrated Parquet Column Index to Presto code base, the performance gains, etc. We will also talk about our effort to open-source this project to PrestoDB and look forward to collaborating with the community to merge!

Presto at Tencent at Scale: Usability Extension, Stability Improvement and Performance Optimization – Junyi Huang & Pan Liu

Presto at Tencent at Scale: Usability Extension, Stability Improvement and Performance Optimization – Junyi Huang & Pan Liu

Presto has been adopted at Tencent as scale to serve scenarios of ad-hoc queries and interactive queries for different business units. In this talk, we’d like to share our practice of Presto in production. In details, we’ll talk about our works to further improve the stability, extend the usability, and optimize the performance of Presto. The works all together make Presto better fit in our production environment, which we think will also benefit the community.