Optimizing Our Data Lakehouse with Presto: A Strategic Transformation Project

    This is a guest post from Yahya Elhag Elemam, Director of Big Data and Analytics at Zain Sudan

    Our data journey – from Cloudera to Data Lakehouse

    Zain Sudan is a mobile phone operator in Sudan (and was the first mobile phone operator in Sudan when it started in 1996!), and part of the Zain Group. Today, there are over 11M subscribers and a number of internal analysts to support in everyday business use cases.

    For several years, we operated our data lakehouse using Cloudera, which provided a stable foundation but presented challenges in data operations and lacked support for real-time analytics.

    Our technology stack included HDFS, Hive, HBase, Impala, Oozie, and Hue, with Tableau serving as our primary visualization tool. Initially, we connected Tableau to Impala, but this setup led to significant issues related to data synchronization, requiring manual metadata refreshes and delivering suboptimal performance.

    To address these challenges, we introduced an RDBMS (PostgreSQL) to store summary data, which improved data consistency and performance. However, maintaining synchronization between PostgreSQL and the data lake added operational complexity.

    Strategic Vision: EDT

    (Easy Operation, Data Accuracy and TCO Reduction)

    Following an extensive evaluation of our technology stack and total platform cost, we conducted thorough research and testing of alternative solutions. This led us to define a strategic vision for our analytics platform, centered on three key objectives:

    • Easy Operation: Streamlining data management and reducing operational overhead.
    • Data Accuracy: Ensuring data integrity and synchronization across all systems.
    • TCO Reduction: Optimizing costs while maintaining high performance and scalability.

    Re-Architecting Our Data Lakehouse

    After conducting multiple proof-of-concept (PoC) evaluations, we adopted the following architecture which is built on open-source:

    • ETL Processing: Apache NiFi
    • Data Storage: Apache Hadoop
    • Table Format: Apache Iceberg
    • Data Transformation: Apache Spark
    • Orchestration & Scheduling: Apache Airflow
    • Analytics Query Engine: Presto to support all Visualization & Querying Tools
    • BI dashboards & BI self-service: Apache Superset
    • Data exploration: Apache Hue
    • Central analytics notebook: JupyterLab
    • Interactive data science workflows: Apache Zeppelin

    Key Achievements with This Architecture

    By implementing this architecture, we successfully:

    • Established a unified data lakehouse solution
    • Enabled BI self-service analytics through Presto
    • Delivered real-time use cases, such as real-time revenue reporting, real-time products monitoring, etc.
    • Zero license platform

    Presto’s Impact on Our Analytics Platform

    Presto was deployed with seven worker nodes and two coordinators, providing a stable and high-performance environment.

    The majority of queries now execute within one second, demonstrating significant performance improvements.

    Without Presto, direct connectivity between our visualization tools and the data lakehouse would have been impossible. Presto has been instrumental in elevating our analytics platform to the next level, providing a scalable and efficient solution for our growing data needs.