Nimble, a new file format for large datasets
In this talk we will present Nimble, a novel file format for large datasets, recently open-sourced by Meta. Nimble was designed to enhance the efficiency, flexibility, and extensibility of existing file formats. It outperforms existing formats such as Apache ORC and Parquet by offering better support for very wide tables, which are commonly found in data preparation workloads for ML training tables. Nimble also provides more flexibility and extensibility in the encodings it supports, and is better suited for parallel decoding using SIMD and GPUs. Our ultimate goal is to eventually migrate Meta’s data warehouse to Nimble. The session will include an overview of:
- Meta’s training data preparation workloads, why they are not suited for existing file formats like ORC and Parquet, and the role Presto plays on them.
- Presto Native’s new integration with Nimble file format.
- Nimble’s current status at Meta
- Ongoing development and future work, with the purpose of creating new collaboration opportunities in file formats for analytics.
