Optimizing Our Data Lakehouse with Presto: A Strategic Transformation Project

This is a guest post from Yahya Elhag Elemam, Director of Big Data and Analytics at Zain Sudan

Our data journey – from Cloudera to Data Lakehouse

Zain Sudan is a mobile phone operator in Sudan (and was the first mobile phone operator in Sudan when it started in 1996!), and part of the Zain Group. Today, there are over 11M subscribers and a number of internal analysts to support in everyday business use cases.

For several years, we operated our data lakehouse using Cloudera, which provided a stable foundation but presented challenges in data operations and lacked support for real-time analytics.

Our technology stack included HDFS, Hive, HBase, Impala, Oozie, and Hue, with Tableau serving as our primary visualization tool. Initially, we connected Tableau to Impala, but this setup led to significant issues related to data synchronization, requiring manual metadata refreshes and delivering suboptimal performance.

To address these challenges, we introduced an RDBMS (PostgreSQL) to store summary data, which improved data consistency and performance. However, maintaining synchronization between PostgreSQL and the data lake added operational complexity.

Strategic Vision: EDT

(Easy Operation, Data Accuracy and TCO Reduction)

Following an extensive evaluation of our technology stack and total platform cost, we conducted thorough research and testing of alternative solutions. This led us to define a strategic vision for our analytics platform, centered on three key objectives:

Easy Operation: Streamlining data management and reducing operational overhead.
Data Accuracy: Ensuring data integrity and synchronization across all systems.
TCO Reduction: Optimizing costs while maintaining high performance and scalability.

Re-Architecting Our Data Lakehouse

After conducting multiple proof-of-concept (PoC) evaluations, we adopted the following architecture which is built on open-source:

ETL Processing: Apache NiFi
Data Storage: Apache Hadoop
Table Format: Apache Iceberg
Data Transformation: Apache Spark
Orchestration & Scheduling: Apache Airflow
Analytics Query Engine: Presto to support all Visualization & Querying Tools
BI dashboards & BI self-service: Apache Superset
Data exploration: Apache Hue
Central analytics notebook: JupyterLab
Interactive data science workflows: Apache Zeppelin

Key Achievements with This Architecture

By implementing this architecture, we successfully:

Established a unified data lakehouse solution
Enabled BI self-service analytics through Presto
Delivered real-time use cases, such as real-time revenue reporting, real-time products monitoring, etc.
Zero license platform

Presto’s Impact on Our Analytics Platform

Presto was deployed with seven worker nodes and two coordinators, providing a stable and high-performance environment.

The majority of queries now execute within one second, demonstrating significant performance improvements.

Without Presto, direct connectivity between our visualization tools and the data lakehouse would have been impossible. Presto has been instrumental in elevating our analytics platform to the next level, providing a scalable and efficient solution for our growing data needs.