Build Your Open Data Lakehouse: A Step-by-Step ETL Guide with MySQL, OLake, and PrestoDB

    This tutorial provides a comprehensive guide to building an Open Data Lakehouse from scratch, a modern and flexible data architecture solution. Open Data Lakehouses offer a powerful and scalable method for storing, managing, and querying both structured and semi-structured data, leveraging a suite of robust open-source tools for enhanced control and flexibility.

    Pre-Requisites:

    Before commencing the setup of your Open Data Lakehouse, ensure the following prerequisites are met:

    • Docker Application Installation: Docker is essential for deploying the system components.
    • Familiarity with SQL and Command-Line: A foundational understanding of SQL and command-line operations is necessary for executing the setup steps and interacting with the database components.

    Key Components of the Open Data Lakehouse Architecture

    The Open Data Lakehouse architecture described herein integrates several key open-source components, each serving a specific role:

    MySQL: This functions as the source transactional database for data ingestion.

    OLake: An open-source, lightweight data ingestion tool primarily responsible for converting data into the Iceberg table format.

    MinIO: Provides S3-compatible object storage, which is utilized for housing both raw and processed data within the lakehouse.

    REST Catalog: Serves as a REST-based manager for metadata and table schemas.

    PrestoDB (Presto): An open-source distributed SQL query engine that enables efficient data querying across the lakehouse.

    Step-by-Step Implementation Guide

    Step 1: Setting Up the OLake Infrastructure To establish the OLake infrastructure, proceed with the following steps:

    1. Directory Creation: Create a primary directory, designated Olake, on your Desktop or at another suitable location.

    2. Subdirectory Creation: Within the Olake directory, create a subdirectory named Olake-data. This subdirectory is crucial for local storage of OLake data.

    3. Repository Cloning: Navigate back to the Olake directory and execute the following command to clone the OLake UI repository

    git clone https://github.com/datazip-inc/olake-ui.git
    cd olake-ui

    4. Upon completion, the directory structure should reflect the specified organization.

    5. docker-compose.yml Configuration: Modify the docker-compose.yml file to include the path to the Olake-data directory. Optionally, default username and password credentials can be updated within this file.

    6. Run this command in your terminal so it saves your file location for the host persistence path:

    export PWD=$(pwd)

    7. Container Deployment: Initiate the OLake container by executing the command.

    docker compose up -d

    8. Upon successful deployment, OLake will be accessible via a web browser at localhost:8000. The default login credentials are Username: admin and Password: password

    Step 2: Infrastructure Setup for MySQL, MinIO, REST Catalog, and PrestoDB This step involves configuring the core data lakehouse components:

    1. Main Directory Creation: Create a directory named my-lakehouse on your Desktop or an alternative desired location.

    2. docker-compose.yml Configuration: Within the my-lakehouse directory, create a docker-compose.yml file with the provided content. This file orchestrates the simultaneous setup of all services including MySQL, MinIO, Iceberg REST Catalog, and PrestoDB.

    version: '3.8'
    
    services:
      # --- 1. MySQL (Source Database) ---
      mysql:
        image: mysql:8.0
        container_name: mysql-source
        ports:
          - "3306:3306"
        environment:
          MYSQL_ROOT_PASSWORD: root_password
          MYSQL_DATABASE: retail_db
          MYSQL_USER: user
          MYSQL_PASSWORD: password
        volumes:
          - ./mysql_data:/var/lib/mysql
          - ./init_mysql.sql:/docker-entrypoint-initdb.d/init_mysql.sql
        networks:
          - lakehouse-net
    
      # --- 2. MinIO (Object Storage - S3 Compatible) ---
      minio:
        image: minio/minio
        container_name: minio-storage
        ports:
          - "9000:9000" # MinIO API port
          - "9001:9001" # MinIO Console port
        environment:
          MINIO_ROOT_USER: minioadmin
          MINIO_ROOT_PASSWORD: minioadminpassword
          MINIO_DOMAIN: minio
        command: server /data --console-address ":9001"
        healthcheck:
          test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
          interval: 30s
          timeout: 20s
          retries: 3
        volumes:
          - ./minio_data:/data
        networks:
          - lakehouse-net
    
      # --- 3. Iceberg REST Catalog (tabulario/iceberg-rest) ---
      iceberg-rest:
        image: tabulario/iceberg-rest:0.6.0
        container_name: iceberg-rest
        depends_on:
          - minio
        ports:
          - "8181:8181"
        environment:
          AWS_REGION: us-east-1
          AWS_DEFAULT_REGION: us-east-1
          CATALOG_WAREHOUSE: s3://mylakehousedata
          CATALOG_IO__IMPL: org.apache.iceberg.aws.s3.S3FileIO
          CATALOG_S3_ENDPOINT: http://minio:9000
          AWS_ACCESS_KEY_ID: minioadmin
          AWS_SECRET_ACCESS_KEY: minioadminpassword
          CATALOG_S3_PATH_STYLE_ACCESS: "true"
          CATALOG_S3_PATH__STYLE__ACCESS: "true"
        networks:
          - lakehouse-net
    
      # --- 4. PrestoDB Query Engine ---
      presto:
        image: prestodb/presto:latest
        container_name: presto-query-engine
        ports:
          - "8089:8080"
        volumes:
          - ./presto_config/catalog/iceberg.properties:/opt/presto-server/etc/catalog/iceberg.properties
        environment:
          # MinIO/S3 configuration for Presto
          AWS_ACCESS_KEY_ID: minioadmin
          AWS_SECRET_ACCESS_KEY: minioadminpassword
          AWS_ENDPOINT: http://minio:9000
          AWS_FORCE_PATH_STYLE: "true"
        depends_on:
          - minio
          - iceberg-rest
        networks:
          - lakehouse-net
    
    networks:
      lakehouse-net:
        driver: bridge

    3. MySQL Initialization File: Create an init_mysql.sql file within the my-lakehouse directory. This script will automatically create the retail_db database and a products table, populating it with sample data. Alternatively, the table can be created manually in the MySQL database.

    CREATE DATABASE IF NOT EXISTS retail_db;
    USE retail_db;
    
    CREATE TABLE IF NOT EXISTS products (
        product_id INT AUTO_INCREMENT PRIMARY KEY,
        product_name VARCHAR(255) NOT NULL,
        category VARCHAR(100),
        price DECIMAL(10, 2),
        stock_quantity INT,
        last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
    );
    
    INSERT INTO products (product_name, category, price, stock_quantity) VALUES
    ('Laptop Pro', 'Electronics', 1200.00, 50),
    ('Mechanical Keyboard', 'Electronics', 95.50, 150),
    ('Wireless Mouse', 'Electronics', 25.00, 200),
    ('Ergonomic Chair', 'Office Furniture', 350.00, 30),
    ('Desk Lamp LED', 'Office Furniture', 45.00, 80),
    ('Coffee Maker 12-cup', 'Home Appliances', 75.00, 100),
    ('Blender High-Speed', 'Home Appliances', 150.00, 70),
    ('Smartwatch X', 'Wearables', 299.99, 120),
    ('Fitness Tracker Pro', 'Wearables', 89.99, 180),
    ('Noise Cancelling Headphones', 'Audio', 199.00, 90);

    4. PrestoDB Configuration:

    • Create two nested directories inside my-lakehouse: presto_config and presto_config/catalog.
    • Within the catalog directory, create a file named iceberg.properties with the following configuration details for PrestoDB to connect to the Iceberg REST Catalog and MinIO
    connector.name=iceberg
    iceberg.catalog.type=rest
    iceberg.rest.uri=http://iceberg-rest:8181
    hive.s3.endpoint=http://minio:9000
    hive.s3.path-style-access=true
    hive.s3.aws-access-key=minioadmin
    hive.s3.aws-secret-key=minioadminpassword

    5. The final directory structure should align with the specified layout.

    6. Service Deployment and Verification:

    • Deploy all configured services by executing the Docker Compose command:
    • Verify the operational status of all containers by running docker ps.

    Step 3: MinIO S3 Bucket Creation To create the necessary S3 storage bucket in MinIO for your data lakehouse:

    1. Access MinIO Console: Navigate to localhost:9000 in a web browser.

    2. Authenticate: Log in using the default MinIO credentials: Username: minioadmin and Password: minioadminpassword.

    3. Bucket Creation: Proceed to create a new bucket, naming it mylakehousedata.

    Step 4: End-to-End ETL Pipeline: Connecting MySQL with OLake and Querying via PrestoDB This step details the establishment of the ETL pipeline for data flow from MySQL to Iceberg via OLake, enabling querying with PrestoDB:

    1. Access OLake: Open the OLake home page at localhost:8000 in your browser.

    2. Create Job – Source Configuration:

    • Click on ‘Create Job’.
    • Configure the source details as follows:

            ▪ Connector: MySQL

            ▪ Name of your source: mysql (or a descriptive name of your choosing)

            ▪ OLake Version: latest

            ▪ MySQL Host: host.docker.internal (or your Orbstack domain)

            ▪ Port: 3306

            ▪ Database: retail_db (or your specific database name)

            ▪ Username: user (or your MySQL root username)

            ▪ Password: Your MySQL password

    3. Create Job – Destination Configuration:

    • Click ‘Next’ to proceed to destination configuration.
    • Configure the destination details for Apache Iceberg and the REST Catalog as follows:

            ▪ Connector: Apache Iceberg

            ▪ Catalog: REST Catalog

            ▪ Name of your destination: temp_destination (or a descriptive name of your choosing)

            ▪ Version: latest

            ▪ REST Catalog URI: http://host.docker.internal:8181

            ▪ Iceberg S3 Path (Warehouse): s3://mylakehousedata/

            ▪ Iceberg Database: warehouse

            ▪ S3 Endpoint: http://host.docker.internal:9000

            ▪ AWS Region: us-east-1

            ▪ AWS Access Key: minioadmin

            ▪ AWS Secret Key: minioadminpassword

    4. Schema Confirmation and Job Finalization:

    • Click ‘Next’ to confirm the source and destination configurations.
    • Click ‘Next’ again to finalize the data ingestion process. Assign a suitable Job name.

    5. Accessing PrestoDB and Querying Data:

    • Finally, navigate to localhost:8089 to access the PrestoDB web UI.
    • Execute the appropriate SQL query to retrieve data from the Iceberg table stored in MinIO.

    Congratulations! You have successfully deployed and configured your Open Data Lakehouse.

    Follow Presto at LinkedinYoutube, and Join Slack channel to interact with the community.