Data processing systems have evolved significantly over the last decade, driven by various factors such as the advent of cloud computing, increasingly complexity of applications such as ML, HTAP, Streaming, Observability and Graph processing. However, historically, these frameworks have evolved independently, leading to significant fragmentation of the stack. In this talk, I will talk about how this has evolved in the open source and at Meta, and how we are solving this problem through the Shared Foundations effort, leading to composable systems. This has resulted in significantly better performance, more features, higher engineering velocity and a more consistent user experience.
Building open and shared foundational tech to build a lake house architecture can provide the best-of-breed user experience across the Analytics and ML domains and potentially beyond. In this talk, Biswa will share examples drawn from the evolution of the data stack at Meta over the last few years including efforts towards dialect unification (Sapphire aka Presto-on-Spark and Xstream-IE streaming engine efforts), eval unification (using Velox as the base), eliminating the need for data duplication for interactive analytics by building smart caching (RaptorX), building a best-of-breed file format that works across Analytics and ML (Alpha), and building an open source ML data pre-proc engine (TorchArrow) which shares the core dialect and eval components with Presto.
Open source data analytics is undergoing an interesting transformation as the industry rapidly evolves around it. Accelerating migration to the cloud, the rise of immensely well funded proprietary vendors, fast evolving needs of the users all contribute to this. This talk goes into detail about the trends and opportunities in the OSS data analytics space, and a call to action on how this space can stay relevant.