PrestoCon Day 2024: Leveraging Presto for Scalable Data Analytics at Apna

At PrestoCon Day 2024 we had a session from Apna, the largest and fastest-growing professional opportunity platform in India. Apna connects job seekers with recruiters, enabling the discovery and application for relevant jobs and access to community groups.

This is our second blog in our PrestoCon Day 2024 recap series, where I’ll do my best to share the key highlights from this session!

Building a Robust Data Platform

Apna’s data platform needs to be able to support over 90 internal users who run over 60K+ queries per day on 30TB+ data. They chose Presto as the engine to do this, along with Apache Hudi as the table format. Together they have 85+ Presto nodes in their cluster scanning over 500 Hudi tables.

The Apna team must make data available from application databases in their data platform. To do that they use Change Data Capture (CDC) Kafka connectors to publish change events into Kafka. Additionally, they publish stream events into Kafka for analyzing user behavior. This architecture ensures real-time data availability and supports robust data analysis workflows.

Looking at the actual deployment of their data platform, Apna moves data between different stages using Apache Hudi as their table format (leveraging Onehouse to do it). This approach helps in managing and optimizing the lifecycle of data, from raw ingestion to refined insights. They have deployed Apache Hive Metastore as their catalog and Presto as their query engine, ensuring efficient data management and querying capabilities.

Deploying on Kubernetes

Apna believes in a container-native architecture approach, making Kubernetes the obvious choice for deployment. They use the standard Presto Helm chart with customizations to connect all BI and ETL tools within their Virtual Private Cloud (VPC) and Grafana and Prometheus for monitoring. The infrastructure uses spot nodes for Presto workers, which are approximately four times cheaper than regular instances. By benchmarking multiple machine types, they settled on n2d-highcpu instances with 64 GB of RAM for their workload.

For autoscaling, Apna uses custom metrics based on memory and CPU utilization, leveraging Kubernetes operators to manage the scaling of workers gracefully. They also use scheduled batch scaling with Jenkins to pre-arm the cluster at specific times throughout the day.

To handle memory-intensive queries, Apna enabled the spill-to-disk feature in Presto, utilizing SSD disks attached to each worker for caching. This capability helps in managing large data volumes efficiently, maintaining a balance between cost and performance.

Key Learnings along the way: Query Performance Optimization

Tweaking Parameters

To improve query performance, Apna adjusted several parameters. They increased the Hive split loader concurrency parameter from 4 to 64, significantly reducing query time. By tweaking sync max buffer size, exchange max buffer size, and exchange client threads, they resolved delays in data communication between Presto worker nodes and underutilization of CPU resources.

Leveraging Cache and Spilling

For queries with multiple joins and aggregations, Apna utilized the Alluxio SDK caching to avoid fetching data multiple times. For memory-intensive queries, enabling the experimental spilling feature and switching storage from HDDs to SSDs helped maintain performance while managing resources effectively.

Addressing Metadata Size Issues

One of the significant challenges was the large metadata size of Hudi tables, which impacted query performance. By deleting old save points and reducing the number of latest commits retained during cleaning, they managed to decrease metadata size, lowering both planning time and overall query execution time.

Common Table Expressions (CTE) Materialization

Enabling CTE materialization, a feature released recently, has significantly optimized their queries. This feature ensures that a CTE referred to multiple times in a query is materialized only once, reducing overall query cost and improving performance.

Future Roadmap

Integration with Velox

Apna is excited about integrating Velox, an open-source unified execution engine developed by Meta, with their system. Velox promises significant performance enhancements, and they are looking forward to accelerating their query processing capabilities.

Disaggregated Coordinator in Kubernetes

One of the challenges faced was implementing a disaggregated coordinator in Kubernetes. Despite attempts to use high availability configurations, they encountered issues with improper communication between coordinators and instability in the user interface. Their goal is to deploy multiple coordinators to handle peak load periods efficiently, and they are committed to finding a more robust solution in the future.

Open-Source Contributions

Apna plans to contribute several enhancements to the Presto open-source community (woohoo!). These include:

Google Sheets Connector: Extended functionality to include writing capabilities, optimizing data retrieval with BigQuery by applying filter conditions directly within the storage API.

Hudi Connector Enhancements: Leveraging advancements in the Hudi connector, such as multimodal indexing, to improve query performance and efficiency.

Monitoring and Analytics: Enhancing monitoring capabilities to provide a comprehensive overview of the entire Presto cluster, identifying the most queried and underutilized tables, and optimizing resource utilization.

Final thoughts

The session provided deep insights into Apna’s journey with Presto, showcasing their technical strategies, challenges, and solutions for building a robust and scalable data architecture. By leveraging advanced features and optimizing performance, Apna has successfully created an efficient data analytics platform, paving the way for future innovations and contributions to the open-source community.

Here’s full session from PrestoCon Day 2024