Bridging the Divide: Running Presto SQL on a Vector Data Lake powered by Lance

    Bridging the Divide: Running Presto SQL on a Vector Data Lake powered by Lance

    In recent years, advancements in GenAI, LLM, computer vision, and robotics have sparked a significant increase in the demand for massive computational power and innovative data practices. These demands were previously unseen in traditional big data infrastructure, which leads to AI data being stored in separate silos and queried using separate systems increasing cost and complexity.

    Instead, what if you could use Presto to run large scale OLAP queries and data transforms on the same datasets used for search and retrieval, or even training? This saves AI teams from wasting time and effort on converting between different formats, and it allows them to write SQL rather than complex and expensive python scripts for data transformation.

    To make this possible, we propose a vector data lake based on Lance format, accessed by high-performance Presto, a matured distributed analytical engine with a rich set of compute kernels via simple SQL queries. Lance delivers 10x performance improvement on real-time search queries, and is compatible with Presto to support fast distributed OLAP queries. This unified approach simplifies data management, boosts performance, and significantly reduces infrastructure costs.