Embracing Presto Open Source for Breakthrough Performance 

    At IBM, we are passionate about pushing the boundaries of data processing capabilities. To that end, we’re excited to share our latest groundbreaking work towards Presto 2.0, the next-generation version of Presto that includes a robust native C++ worker. This initiative not only strengthens the project’s technological foundations but also reiterates our dedication to the open-source community, particularly within Linux Foundation’s Presto Foundation.   

    Introducing Presto 2.0 with Native C++ Worker 
    Presto, traditionally known for its high performance and flexibility as a distributed SQL query engine, is evolving. The introduction of the Presto 2.0 initiative – a native C++ Presto worker – last December has rallied the development community. While built into the existing 0.286 version today, the use of C++ is not just an upgrade; it’s a reimagining of Presto with the integration of Velox, an open-source C++ native acceleration library. In addition to being written in C++, a main component is a state-of-the-art vector engine. Velox is designed to be composable across various compute engines, enhancing Presto’s core capabilities as well as other engines such as Apache Spark. 

    Our work on the native C++ worker is a testament to our commitment to performance and efficiency. By incorporating C++ into Presto’s architecture, we leverage the language’s efficiency and speed, which are crucial for processing large-scale data workloads. This strategic move is aligned with our goal to continuously improve the performance and scalability of Presto. 

    Performance Benchmarking the Presto Native C++ Worker 
    Our team’s relentless focus on optimization and performance enhancement has led to significant achievements in benchmarking tests. Specifically, in the TPC-DS 100TB benchmarks, which are industry standards for testing the performance of data processing systems, we are extremely encouraged with the results we’re seeing today for IBM watsonx.data with Presto C++ v0.286 .   

    To share more details, we benchmarked IBM watsonx.data with Presto C++ v0.286 and our query optimizer. This was able to deliver better price performance than Databrick’s Photon engine, with equal query runtime at less than 60% of the cost, based on public 100TB TCP-DS Query benchmarks. You can read more about the infrastructure setup in IBM’s blog which was published this week on the benchmark. 
     
    Utilizing the enhanced capabilities of the native C++ worker and a sophisticated query optimizer, Presto C++ v0.286 has demonstrated exceptional price performance metrics. This achievement not only highlights Presto’s cost-efficiency but also its capability to handle enormous datasets with excellent efficiency.  

    Contributions to the Velox Project and Community Engagement 
    IBM is not just a user of open-source technologies; we are a key contributor. Our engineers are committers to the Presto project and key contributors to the Velox project. Our contributions span significant features and improvements, such as the development of Parquet and Iceberg readers, enhanced support for various filesystems, window functions, and more. These contributions are integral to both the Velox project and Presto 2.0. You can learn more about our work with Velox in our VeloxCon 2024 session.

    In achieving these results, we included IBM enterprise-proven query compilation technology with advanced query rewrite and cost-based optimization techniques. We are exploring open sourcing that technology and advancing development of the Presto optimizer in the future. Check out our latest blog on the Presto optimizer if you want to learn more.

    Commitment to Open Source and Community Collaboration 
    Our commitment to open source is deep-rooted. We believe in the power of collaboration and the collective improvement of technologies through shared knowledge and resources. By participating in the Linux Foundation Presto project and actively contributing to the Velox and Presto communities, we aim to help shape the future of open-source data processing technologies. 

    Furthermore, our open-source initiatives extend beyond just code contributions. We engage with the community through discussions, workshops, and conferences to share knowledge, gather feedback, and continuously refine our approaches. This collaborative environment not only fuels our technological advancements but also ensures they are aligned with the community’s needs and the evolving landscape of data processing. 

    We hope to see you at our PrestoCon Day talk on Presto C++ TPC-DS updates where we will share the latest numbers for TPC-DS 1K, 10K and 100K runs. (Registration is free for this virtual community conference!) 

    Looking Forward 
    As we continue to develop and enhance Presto 2.0, our focus remains on delivering high-performance, cost-effective solutions that cater to the demanding needs of large-scale data processing. The advancements we have achieved with the native C++ worker and our significant contributions to the Velox project are just the beginning.  

    Thank you to everyone from IBM who had a hand in contributing to this benchmark, including Aditi Pandit, Deepak Majeti, Ying Su, Christian Zentgraf, Michael Osaka, Karteek Murthy, Pramod Satyanarayana, Anant Aneja, and Zac Blanco. 

    Also thank you to everyone from the community who has helped contribute to Presto C++ including the teams from Meta, Intel, and Bytedance.  

    We are excited about the future possibilities that these technologies hold, not just for IBM but for the entire open-source community. Stay tuned for what’s coming next!