How Twilio Scales Presto with Odin: A New Query Gateway

One of my favorite parts of working with the Presto community is seeing how different companies push the project forward in creative ways. Recently, Aakash Pradeep from Twilio shared a great example of this with their development of Odin, a new modular query gateway they built to help scale Presto usage across their organization.

I wanted to highlight some of the key ideas from their talk because I think it’s a fantastic case study not just for running Presto at scale, but for thinking more broadly about building flexible, future-proof data platforms.

Register for PrestoCon Day on June 17

Twilio’s Scale with Presto

First, some quick context: Twilio operates at serious scale. Their Presto infrastructure today supports:

2.5 million queries per month, growing 37% year-over-year
86+ petabytes of data scanned monthly
5,000+ dashboards powered by Looker and Tableau
400-500 nodes at peak across multiple Presto clusters
A hybrid setup including Presto for business-critical workloads and Athena for others

They also manage diverse formats like Iceberg, Delta Lake, Hudi, and Parquet, across a 12+ PB data lake.

As usage grew, Twilio started facing some challenges that many Presto users can probably relate to.

Why They Built a New Gateway

Originally, users connected directly to Presto clusters using DNS endpoints. This caused several issues:

Tight coupling between users and clusters: any maintenance caused disruptions.
Imbalanced load: users tended to stick with the same cluster even as others sat underutilized.
Authentication and authorization complexity: different engines (Presto, Athena, Spark) had different auth setups.
Compliance and governance gaps: Twilio needed finer-grained access control at the column and row level.
Multiple user interfaces: switching between Presto, Athena, and Spark wasn’t smooth.

They considered modifying Presto directly but decided against it to avoid maintaining a large custom fork. Instead, they chose to build a modular gateway in front of their query engines, something flexible enough to evolve independently without entangling them in deep engine modifications.

Thus, Odin was born.

What Odin Does

Odin acts as a layer between users and the query engines. Users just change the DNS endpoint they connect to, no application rewrites needed. Odin handles:

Authentication (initially LDAP, extensible to Okta)
Authorization (using AWS Lake Formation today, but pluggable to other systems like Ranger)
Load balancing across clusters
Rate limiting to protect clusters from user refresh overloads
Query routing based on hints and smart fallback logic
Caching to speed up metadata fetching and authorization checks
Query history collection across engines (Presto, Athena, and Spark soon)

It essentially abstracts away the underlying engines and makes user access safer, faster, and more governed.

Some Smart Design Choices

Twilio made a few interesting technical decisions that are worth calling out:

Stateless architecture: All routing and query metadata are persisted in DynamoDB, making Odin instances fully stateless and easy to scale horizontally.
Two ALB Target Groups: One for POST (query submission) calls, another for GET (fetch results) calls, optimizing caching and scaling behavior separately.
Built on Netflix Zuul: Leveraging an existing Java gateway framework made it easier to extend and scale.
Glue Catalog adoption: They moved metadata management to AWS Glue for easier multi-engine integration and better scaling.
Pluggable authorization and caching: Each component (authn, authz, routing, caching) is modular, so Twilio can swap pieces out as their needs evolve.

They even developed a smart fallback routing mechanism that if a cluster is overloaded, Odin dynamically routes queries to alternate clusters or engines based on use case priorities.

Challenges They Ran Into

It wasn’t all easy. Some of the challenges Twilio encountered:

Query parsing complexity across Presto, Athena, and Spark
Athena protocol differences requiring special translation
Older ODBC connectors (especially Tableau) being rigid and slower
Cross-region metadata fetches adding unexpected latency
Fine-grained authorization at scale needing heavy caching to keep query latency low

Despite those hurdles, the system today handles over 100,000 jobs per day, with Odin adding just 6-10ms of overhead per query which is pretty incredible.

What’s Next for Odin

Twilio shared a few areas they’re excited to keep improving:

Adding Spark SQL support natively through Odin
Smarter, AI-driven query routing based on query patterns
Moving toward column-level and row-level governance enforcement
Exploring open-sourcing Odin for the broader community

(And yes — they’re seriously considering open-sourcing it. If you’d be interested, they encouraged the community to speak up.)

I love seeing stories like this because they highlight not just how powerful Presto is, but how flexible it can be as part of larger, evolving data platforms.

Big thanks to Aakash from the Twilio team for sharing this deep dive and for pushing the Presto ecosystem forward in such an inspiring way. You can check out the full session recording on our YouTube channel.

If you want to hear more stories like this, or share your own, join us at PrestoCon Day on June 11…it’s virtual and free to attend!