Presto Parquet Column Encryption

Introduction Apache Parquet modular encryption provides encryption at-rest and in-transit at finer-grained. In big data world, data analytic tables are usually very wide with hundreds of columns, while only a small number of columns need to be protected. So the finer-grained access control is a better fit than coarse-grained one like table level access control….

Faster Presto Queries with Parquet Page Index

Introduction Today’s data is growing very fast, which creates challenges for query engines like Presto. Presto is a popular interactive query engine, because of its scalability, high performance, and smooth integration with Hadoop. As the volume of data grows, Presto needs to read larger chunks of data and load them into memory, which causes higher…

Avoid Data Silos in Presto in Meta: the journey from Raptor to RaptorX

Raptor is a Presto connector (presto-raptor) that is used to power some critical interactive query workloads in Meta (previously Facebook). Though referred to in the ICDE 2019 paper Presto: SQL on Everything, it remains somewhat mysterious to many Presto users because there is no available documentation for this feature. This article will shed some light…

RaptorX: Building a 10X Faster Presto

RaptorX is an internal project name aiming to boost query latency significantly beyond what vanilla Presto is capable of. This blog post introduces the hierarchical cache work, which is the key building block for RaptorX. With the support of the cache, we are able to boost query performance by 10X. This new architecture can beat…

Improving Presto Latencies with Alluxio Data Caching

The Facebook Presto team has been collaborating with Alluxio on an open source data caching solution for Presto. This is required for multiple Facebook use-cases to improve query latency for queries that scan data from remote sources such as HDFS. We have observed significant improvements in query latencies and IO scans in our experiments. We…