Avoid Data Silos in Presto in Meta: the journey from Raptor to RaptorX

Raptor is a Presto connector (presto-raptor) that is used to power some critical interactive query workloads in Meta (previously Facebook). Though referred to in the ICDE 2019 paper Presto: SQL on Everything, it remains somewhat mysterious to many Presto users because there is no available documentation for this feature. This article will shed some light…

Common Sub-Expression optimization

The problem One common pattern we see in some analytical workloads is the repeated use of the same, often times expensive expression. Look at the following query plan for example: The expression JSON_PARSE(features) is used 6 times, and casted to different ROW structures for further processing. Traditionally, Presto would just execute the expression 6 times,…

RaptorX: Building a 10X Faster Presto

RaptorX is an internal project name aiming to boost query latency significantly beyond what vanilla Presto is capable of. This blog post introduces the hierarchical cache work, which is the key building block for RaptorX. With the support of the cache, we are able to boost query performance by 10X. This new architecture can beat…

Using OptimizedTypedSet to Improve Map and Array Functions

Function evaluation is a big part of projection CPU cost. Recently we optimized a set of functions that use TypedSet, e.g. map_concat, array_union, array_intersect, and array_except. By introducing a new OptimizedTypeSet, the above functions saw improvements in several dimensions: Furthermore, OptimizedTypeSet resolves the long standing issue of throwing EXCEEDED_FUNCTION_MEMORY_LIMIT for large incoming blocks: “The input…

Even Faster Unnest

Unnest is a common operation in Facebook’s daily Presto workload. It converts an ARRAY, MAP, or ROW into a flat relation. Its original implementation used deep copy all the time and was very inefficient. In Unnest Operator Performance Enhancement with Dictionary Blocks, the author improved the Unnest operator by up to 10x in CPU and…

Getting Started with PrestoDB and Aria Scan Optimizations

This article was originally published by Adam on June 15th, 2020 over at his blog at datacatessen.com. PrestoDB recently released a set of experimental features under their Aria project in order to increase table scan performance of data stored in ORC files via the Hive Connector. In this post, we’ll check out these new features…

PrestoDB and Apache Hudi

Apache Hudi is a fast growing data lake storage system that helps organizations build and manage petabyte-scale data lakes. Hudi brings stream style processing to batch-like big data by introducing primitives such as upserts, deletes and incremental queries. These features help surface faster, fresher data on a unified serving layer. Hudi tables can be stored…

Improving Presto Latencies with Alluxio Data Caching

The Facebook Presto team has been collaborating with Alluxio on an open source data caching solution for Presto. This is required for multiple Facebook use-cases to improve query latency for queries that scan data from remote sources such as HDFS. We have observed significant improvements in query latencies and IO scans in our experiments. We…

Spatial Joins 1: Local Spatial Joins

A common type of spatial query involves relating one table of geometric objects (e.g., a table population_centers with columns population, latitude, longitude) with another such table (e.g., a table counties with columns county_name, boundary_wkt), such as calculating for each county the population sum of all population centers contained within it. These kinds of calculations are…

Engineering SQL Support on Apache Pinot at Uber

The article, Engineering SQL Support on Apache Pinot at Uber, was originally published by Uber on the Uber Engineering Blog on January 15, 2020. Check out eng.uber.com for more articles about Uber’s engineering work and follow Uber Engineering at @UberEng and Uber Open Source at @UberOpenSouce on Twitter for updates from our teams. Uber leverages…

Improving the Presto planner for better push down and data federation

Presto defines a connector API that allows Presto to query any data source that has a connector implementation. The existing connector API provides basic predicate pushdown functionality allowing connectors to perform filtering at the underlying data source. However, there are certain limitations with the existing predicate pushdown functionality that limits what connectors can do. The…

5 design choices—and 1 weird trick — to get 2x efficiency gains in Presto repartitioning

We like Presto. We like it a lot — so much we want to make it better in every way. Here’s an example: we just optimized the PartitionedOutputOperator. It’s now 2-3x more CPU efficient, which, when measured against Facebook’s production workload, translates to 6% gains overall. That’s huge. The optimized repartitioning is in use on…