Unlocking Petabyte-Scale Performance: Uber's Journey with Presto for Distributed Cache using Alluxio

At PrestoCon Day 2025, Uber presented their innovative solution for optimizing petabyte-scale data analytics by deploying a distributed cache using Alluxio for Presto. Their journey was driven by significant challenges during a massive cloud migration, including read slowness and overwhelming HDFS clusters on-premises, and later high GCS egress costs and file access charges in the cloud.

With a data lake storing petabytes of physical data in HDFS and a peak read throughput hitting three terabytes per second, managing this scale efficiently is paramount. Presto, the dominant query engine, contributes over 70% of this read traffic, consuming more than 300 PB every week. This immense scale, coupled with a massive migration from on-premises infrastructure to the cloud, presented significant challenges. Uber’s solution? A robust, petabyte-scale distributed cache built on Alluxio.

The Motivation: Overcoming On-Premises & Cloud Hurdles

Uber’s journey began with a critical need to alleviate pressure on their on-premises HDFS clusters. Due to non-negotiable requirements, the migration progress for storage and computing differed significantly. For instance, in 2015, Uber decommissioned over 50% of HDFS storage nodes while still having a large portion of Presto queries on-premises. This led to overwhelmed HDFS clusters and severe read slowness. Building a distributed cache became essential to reduce reads from HDFS and enable traffic shaping for heavy users.

As Uber, an “elephant” in new GCS regions, moved to the cloud, new challenges emerged. Uber faced GCS egress charges and file access costs that could be significantly reduced with an effective cache.

Introducing Alluxio: The Backbone of Uber’s Cache

Alluxio is a popular open-source project that provided the foundational technology for Uber’s distributed cache. Key features of Alluxio that made it ideal for Uber’s needs include its decentralized architecture, support for page-level partial file caching, and linear scalability.

Architecture Overview: Uber’s integration of Alluxio with Presto keeps the majority of the Presto architecture, especially the coordinator, the same. The core integration happens at the Presto worker side, where Uber updated the storage client to read hot datasets directly from the Alluxio cluster. For non-hot data, reads go directly to remote storage like HDFS or GCS. If Alluxio doesn’t have a requested file, it pulls it from the remote storage.

Alluxio Worker and Scalability: The Alluxio worker is a decentralized cache, registering itself with ETCD and maintaining heartbeats for node liveness checks. The Alluxio client on each Presto node pulls the list of cache nodes from ETCD and builds a hash ring to route traffic by file ID, thereby supporting linear scalability. Crucially, the Alluxio master node is not on the critical path, eliminating a single point of failure.

Within the Alluxio worker, there are two key endpoints: one for metadata (to get file info) and another for data reads, served by a Netty data server. Internally, each worker has a file metadata store, a page file cache in the local file system, and an in-memory page meta index to track pages per file ID. Alluxio workers manage capacity by evicting files using an LRU (Least Recently Used) policy.

Phase 1: Building a Robust On-Premises Cache

Uber’s initial focus was on reducing data reads from HDFS and integrating Alluxio with Presto to build their first distributed cache cluster. They designed a “shared file system” where Presto first attempts to read from its local cache, then from the Alluxio remote cache, and finally falls back to HDFS for remote storage.

To ensure flexibility, they built features like an in-flight configurable cache allow list, allowing them to turn caching on or off and update hot table/partition lists on demand. They also implemented security features to protect traffic between the Alluxio client and worker.

Overcoming On-Premise Challenges:

Unreliable Reads & Network Issues: Node replacement, security patches, or hardware issues could make Alluxio reads unreliable. Their solution was a fallbackable input stream. If anything went wrong at the Alluxio worker or network layer, the system would fall back to the HDFS or GCS input stream, significantly improving reliability.

Nondeterministic Query Termination: Presto queries can be interrupted or terminated, leading to side effects like thread leaks on the Alluxio worker. To address this, they implemented a mechanism at the worker side to proactively release state machines and resources, preventing thread leaks even if the client doesn’t send a clear cancellation signal.

Small File Read Overwhelm: Uber handles many “pet files” and uses column-level encryption, resulting in numerous small reads, which initially overwhelmed Alluxio workers. Uber’s innovative solution was to add a prefetch buffer in the Alluxio client. For any read less than one megabyte, the client would read a full megabyte and hold it in the buffer. This dramatically reduced actual requests sent to the Alluxio worker by more than 1,000 times, drastically improving performance.

On-Premise Impact: Uber’s initial production cluster comprised 112 cache nodes with approximately 1 PB capacity, serving two Presto clusters. At peak, the cache achieved a 200 GB/s hit rate, with the majority of hot traffic served directly from Alluxio. This significantly reduced HDFS throughput, relieving IO saturation and providing more flexibility for decoupling storage and compute migrations.

Phase 2: Advancing to the Cloud with Alluxio

Migrating to the cloud brought new challenges, including higher egress limitations and GCS Class B operation costs. To tackle these, Uber introduced several key enhancements:

Cost Reduction & Scalability through Collocation: The most significant cost-saving measure was collocating Alluxio with Presto nodes. Since Presto typically doesn’t use its local disk for data spills, Uber could utilize the local disk of every Presto node for Alluxio caching. Alluxio proved efficient, needing only 6-8 CPUs per node to handle traffic well. This strategy highly saved cost and allowed Uber to deploy 10 times more nodes in the cloud.

GCS Connector Enhancement: Uber enhanced their file system to support hundreds of physical bucket instances, crucial for working with GCS.

Scalable Membership Management (Addressing ETCD Load): With 10 times more Alluxio servers and clients, ETCD (used for membership management) faced much higher traffic, sometimes leading to failures. The open-source Alluxio’s ETCD management was on the critical path. Uber mitigated this by implementing asynchronized lazy membership fetching and a fail-open design. This means that even if fetching fails, the client can still use the old hash ring to fetch data, making traffic to ETCD much smoother with no spikes.

Enforcing Data Freshness with Versioning: On-premise, Uber relied on metadata validation, which had potential race conditions. On cloud, Uber enforced versioning for each cache entry by embedding the last modified time (LMT) of the file into the cache key. The Presto coordinator provides this LMT when listing files, pushing it down to workers. If the LMT changes, it creates a new cache key, ensuring old data is passively evicted. This approach makes cache entries immutable, completely avoiding race conditions and corrupt data reads, and significantly improves performance by skipping metadata validation.

Boosting Cache Hit Rates with Time-Bucket Based Filters: To achieve much higher cache hit rates, Uber implemented a time-bucket based cache filter using a chain of sliding windows. This filter records read counts for each partition over specific periods (e.g., 2 or 7 days). Based on these statistics, Uber intelligently decide which partitions are most valuable to cache, as Presto reads entire partitions. This sophisticated filtering mechanism boosted Uber cache hit rate from 40% to over 80%, sometimes even 90%.

Cloud Impact: Uber’s cloud deployment features 1,200 cache nodes with a capacity of more than 2 petabytes, serving five Presto clusters. The shared cache design contributed to higher cache hit rates. At peak, Uber’s total traffic reached 8 TB/s, with the cache handling 6 TB/s, significantly reducing egress traffic to GCS down to just 2 TB/s. This not only prevents egress challenges during traffic peaks but also saves significant costs by reducing both data read and metadata access charges to GCS.

The Road Ahead: Future Innovations

Uber’s work with Alluxio is continuous. They are focused on:

Integrating with Velox: Developing a native client to ensure Velox has the same features as their Java cluster.

Resource Usage Optimization: Further improving performance and reducing resource consumption.

Native Server Development: Exploring a C++ or Rust server implementation to significantly reduce server-side resource usage, especially critical for their collocation strategy.

Conclusion: Uber’s implementation of Alluxio has been a game-changer, enabling them to handle petabyte-scale data with high performance and cost efficiency across both on-premise and cloud environments.

Unlocking Petabyte-Scale Performance: Uber’s Journey with Presto for Distributed Cache using Alluxio

Saurabh Mahawar, Developer Relations Engineer at IBM

The Motivation: Overcoming On-Premises & Cloud Hurdles

Introducing Alluxio: The Backbone of Uber’s Cache

Phase 1: Building a Robust On-Premises Cache

Phase 2: Advancing to the Cloud with Alluxio

The Road Ahead: Future Innovations

Follow Presto at Linkedin, Youtube, and Join Slack channel to interact with the community.