Prestissimo Extension for AI Training Data Normalization at Meta: A Deep Dive for Developers (Lightning Talk)

At PrestoCon Day 2025, Meta’s Presto team recently unveiled the Prestissimo extension, a powerful enhancement designed to optimize AI training data normalization. This article explores the technical underpinnings and developer-centric features of this extension, providing a comprehensive understanding of how it supports large-scale AI workloads at Meta.

Understanding AI Training Data Storage at Meta

At Meta, AI training data is modelled as relational tables within a data warehouse. Each table contains training samples, where features and labels are stored as separate columns in a denormalized format. During model training, samples are sequentially read and fed to GPUs.

However, recent shifts in machine learning have introduced new challenges:

User sequence modelling requires handling much larger feature sets.

Multimodal models incorporate expensive data artifacts like text, images, videos, and embeddings.

There is a growing need for larger and more diverse datasets to improve model performance.

These trends demand a significant increase in training data scale, which exposes limitations in the denormalized storage approach, primarily due to:

High storage costs caused by feature data duplication across samples.

Slower ML experimentation, as adding new features row-by-row is time-consuming and storage-intensive.

AI Data Storage: The Shift to Normalization

To address these challenges, Meta introduced AI data storage, which normalizes training data by separating features into distinct tables apart from training samples. This approach offers several advantages:

Reduced storage costs by eliminating feature duplication.

Efficient retrieval of feature data via sequential and random reads during training.

Support for SQL operations through integration with compute engines like Presto and Spark.

Robust data versioning to ensure consistency between training and prediction, preventing feature leakage.

Optimized ETL workflows and query performance enhancements tailored to AI data patterns.

This normalized storage system is foundational to Meta’s AI training infrastructure, enabling scalable and efficient data management.

Prestissimo Extension: Enabling AI Data Storage in Presto

The Prestissimo extension is Meta’s solution to bridge Presto with the AI data storage system, supporting the normalized format and enabling ML engineers to efficiently access and manipulate training data.

Key Features of Prestissimo

SST File Format Integration: AI feature data is stored in a new SST file format with dual copies in Hive and a key-value store. Prestissimo integrates a new SST reader and writer, enabling Presto to perform sequential scans on these tables, facilitating feature data exploration.

Index Lookup Join for Efficient Data Reconstruction : Since training data is split into separate datasets (features and samples), Prestissimo supports joining these datasets to reconstruct full training samples. It introduces an index lookup join mechanism that performs random reads for fast lookups.
- Supports inner join and left outer join types.
- Handles equal join conditions and non-equal conditions like between and contains.

AI Data Storage Connector: This connector manages metadata retrieval, index resolution, and connector field pushdown, optimizing query execution.

Index Join Optimizer: Converts eligible hash joins into index joins during query planning, replacing join nodes with index join nodes for performance gains.

Plan Conversion Enhancements: Modifications to Presto’s plan conversion protocol enable support for new plan nodes (index join node, index source node), ensuring compatibility with the execution engine (Velox).

Prefetch Support: To improve throughput, the index lookup join operator can prefetch results for multiple lookup requests before upstream operators request output, reducing latency.

Why Developers Should Care ?

Enhanced Data Exploration: Machine learning engineers can now query normalized feature data seamlessly using Presto.

Improved Storage Efficiency: Normalization drastically reduces redundant data storage, lowering costs.

Optimized Query Performance: Index lookup joins, and plan optimizations enable faster data retrieval, crucial for large-scale AI training.

Extensibility: The connector and join interfaces allow developers to customize and extend the system for evolving AI data needs.

Summary

Meta’s Prestissimo extension represents a significant advancement in AI training data management by enabling normalized storage and efficient querying within Presto. It addresses the challenges of scaling AI training datasets while balancing storage costs and compute efficiency. For developers working on large-scale AI systems, understanding and leveraging Prestissimo’s capabilities is essential for building performant and scalable ML pipelines.