Unleashing Interactivity: Inside Meta's Presto-Powered Data Warehouse Innovation

At this year’s PrestoCon Day, Meta had an awesome session to share the latest on what they’re doing with Presto. As you probably know, Meta has one of the largest data Lakehouse’s in the world and Presto is a critical piece of that data platform. It plays a critical role in serving vast and diverse data analytics needs, underpinning both hyperscale batch processing and crucial interactive use cases.

As the data landscape evolves with advancements in AI, Meta is pushing the boundaries of its interactive warehouse, developing cutting-edge solutions to achieve consistent low latencies and support emerging paradigms like agentic exploration and real-time analytics. management.

Presto’s Pivotal Role in Meta’s Data Infrastructure

Presto is central to Meta’s data infrastructure, particularly for interactive use cases, where a human often waits for query results. This makes Presto vital for employee productivity and everyday workflows. While Presto assists Spark in batch processing, it is exclusively responsible for all interactive needs at Meta.

The scale at which Presto operates is immense:

Tens of thousands of analytics users monthly.

Tens of thousands of dashboards and dashboard users.

Millions of queries daily.

Hundreds of petabytes of data processed daily.

This massive scale presents significant challenges in maintaining interactivity.

Critical Interactive Use Cases Powered by Presto

Presto fuels a wide array of critical use cases across Meta:

Traditional Analytics Users: Data engineers, data scientists, and software engineers leverage various products built on Presto for analytics.

Data Applications: This includes traditional dashboarding, explorative analytics applications, and CRM solutions powering sales organizations, leading to significant savings in sales hours.

Machine Learning Engineers: Utilize notebooks for exploratory analytics, heavily relying on Presto.

Experimentation Platform: As a highly data-driven company, nearly every engineer at Meta uses the experimentation platform for development, which is primarily powered by Presto computations at massive scale.

AI Recommendation Systems: Collections of products and use cases focused on ranking and recommendation systems use Presto for data preparation and analysis stages.

Privacy Compliance Systems: Many privacy products leverage Presto to enhance queries with privacy-compliant logic, making Presto a critical dependency for ensuring data access remains private and compliant.

Innovations in Meta’s Interactive Warehouse Stack

Presto-Specific Optimization’s

Significant work has been done within Presto itself to provide amazing latencies:

Presto with Velox: Achieved 2x to 10x improvements in query latencies.
Distributed Caching: For both metadata and data, effectively avoiding numerous I/O calls, which has been highly impactful for better latencies.
Improved Task Scheduling: Transitioned from thread-based to event loops, removing contentions with locks and reducing CPU usage in the Presto scheduling component by almost 90%.
Leaner and Faster Communication: Switched from JSON to Thrift for communication between the coordinator and workers.
Multiplexing with HTTP2: Helped in reducing SSL handshake costs.
I/O Optimization Service: Produces better data layouts. A key example is file compacting, which reduces the number of files and splits for partitions with many small files, significantly improving performance.
Effective Result Caching: Duplicate queries are served directly from the result cache, enhancing user experience and significantly reducing computational load on Presto.

Tackling Tail Latencies: The P99+ Challenge

Despite these extensive improvements, tail latencies (P99 and beyond) remain challenging. The distributed nature of Presto, coupled with its complex web of dependencies (Metastore, warm storage, which themselves have dependencies), leads to volatile and inconsistent tail latencies. To achieve high percentiles like P99 or P99.9 for Presto, even higher percentile guarantees are needed from dependencies that are interacted with hundreds or thousands of times per query.

Hardware Evolution and Foundational Principles

The industry trend, especially in hardware, offers a new avenue for innovation:

Over the last decade, hardware capabilities have dramatically shifted: network speeds increased (from 10 Gb/s to much faster), single-socket threads increased (from 16-32), and memory/SSDs became significantly cheaper.

A single-socket server now boasts more CPU, network, and memory than a 10-20 node cluster did a decade ago.

Based on these advancements, Meta leverages key principles for consistent low latencies:

The Single Node Offering:

To overcome the tail latency challenge, Meta is innovating with a single node offering:

No RPC: Eliminates network overhead by operating within a single node.

Local Data Store: Ensures critical data locality.

Affordable: Requires minimum capacity due to its single-node nature.

This offering is in its early stages but shows promising results for specific use cases. Key to its success are significant storage improvements:

Custom Encodings:

Bit Packing: Uses fewer bits to store integers, optimizing space.

ALP (Adaptive Lossless Floating-Point Comparison): Highly effective for floating-point data.

Shared Dictionary: Leveraged for good compression results.

Dramatic Storage Reduction and Decompression Speed: Benchmarks (e.g., on TPCH SF 100 with Veil Vector) show significant reductions in storage size and substantial improvements in decompression speed. For example, Meta’s custom storage improvements brought data storage from almost 8 times larger than ORC file format storage down to less than 2x compared to general-purpose compression, while achieving excellent decompression and query performance.

Real-time Analytics: The Next Frontier

Presto is also powering critical real-time ads analytics use cases at Meta. This domain presents unique challenges:

Fast Data Ingestion: Data is generated by real-time events and must be ingested rapidly.

Extremely High QPS: These use cases experience thousands of queries per second (QPS), with each query performing significant processing.

The strategy to address these challenges remains consistent: continuously making query execution lean and efficient from various angles.

Conclusion: Presto has made remarkable strides in enhancing interactive experiences at Meta, powering numerous critical use cases and enabling the company’s data-driven culture. The ongoing challenges presented by agentic exploration and real-time analytics continue to fuel innovation within Presto and the interactive warehouse, promising further advancements in performance and capability.

Unleashing Interactivity: Inside Meta’s Presto-Powered Data Warehouse Innovation

Saurabh Mahawar, Developer Relations Engineer at IBM