Presto on Spark – Facebook – Virtual Meetup

Presto on Spark – Facebook – Virtual Meetup

At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads. It is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine. In this talk, we’ll take a deep dive in Presto and Spark architecture with a focus on key differentiators (e.g., disaggregated shuffle) that are required to further scale Presto.

Presto On Spark: Scaling not Failing with Spark – Ariel Weisberg, Meta & Shradha Ambekar, Intuit

Presto On Spark: Scaling not Failing with Spark – Ariel Weisberg, Meta & Shradha Ambekar, Intuit

Presto on Spark is an integration between Presto and Spark that leverages Presto’s compiler/evaluation as a library and Spark’s large scale processing capabilities. It enables a unified SQL experience between interactive and batch use cases. A unified option for batch data processing and ad hoc is very important for creating the experience of queries that scale instead of fail without requiring rewrites between different SQL dialects. In this session, we’ll talk about Presto On Spark architecture, why it matters and its implementation/usage at Intuit.

Presto on Kafka at Scale – Yang Yang & Yupeng Fu, Uber

Presto on Kafka at Scale – Yang Yang & Yupeng Fu, Uber

Presto is a popular distributed SQL query engine for running interactive analytic queries. Presto provides a Connector API that allows plugins to dozens of data sources, and thus positions itself as a single point of access to a wide variety of data. At Uber, we significantly improved Presto’s Kafka connector to meet Uber’s scale. For example, the new connector allows dynamic Kafka cluster and topic discovery so users can directly query existing Kafka topics without any registration and onboarding process; dynamic schema discovery allows fetching the latest schema without any Presto restart or deployment; smart time range suggestions to users based on Kafka metadata analysis to avoid large-range scans and thus keep the query interactive.