Build Your Open Data Lakehouse: A Step-by-Step ETL Guide with MySQL, OLake, and PrestoDB
This tutorial provides a comprehensive guide to building an Open Data Lakehouse from scratch, a modern and flexible data architecture solution. Open Data Lakehouses offer a powerful and scalable method for storing, managing, and querying both structured and semi-structured data, leveraging a suite of robust open-source tools for enhanced control and flexibility.
Pre-Requisites:
Before commencing the setup of your Open Data Lakehouse, ensure the following prerequisites are met:
- Docker Application Installation: Docker is essential for deploying the system components.
- Familiarity with SQL and Command-Line: A foundational understanding of SQL and command-line operations is necessary for executing the setup steps and interacting with the database components.
Key Components of the Open Data Lakehouse Architecture
The Open Data Lakehouse architecture described herein integrates several key open-source components, each serving a specific role:
• MySQL: This functions as the source transactional database for data ingestion.
• OLake: An open-source, lightweight data ingestion tool primarily responsible for converting data into the Iceberg table format.
• MinIO: Provides S3-compatible object storage, which is utilized for housing both raw and processed data within the lakehouse.
• REST Catalog: Serves as a REST-based manager for metadata and table schemas.
• PrestoDB (Presto): An open-source distributed SQL query engine that enables efficient data querying across the lakehouse.

Step-by-Step Implementation Guide
Step 1: Setting Up the OLake Infrastructure To establish the OLake infrastructure, proceed with the following steps:
1. Directory Creation: Create a primary directory, designated Olake, on your Desktop or at another suitable location.

2. Subdirectory Creation: Within the Olake directory, create a subdirectory named Olake-data. This subdirectory is crucial for local storage of OLake data.
3. Repository Cloning: Navigate back to the Olake directory and execute the following command to clone the OLake UI repository
git clone https://github.com/datazip-inc/olake-ui.git
cd olake-ui4. Upon completion, the directory structure should reflect the specified organization.

5. docker-compose.yml Configuration: Modify the docker-compose.yml file to include the path to the Olake-data directory. Optionally, default username and password credentials can be updated within this file.
6. Run this command in your terminal so it saves your file location for the host persistence path:
export PWD=$(pwd)
7. Container Deployment: Initiate the OLake container by executing the command.
docker compose up -d8. Upon successful deployment, OLake will be accessible via a web browser at localhost:8000. The default login credentials are Username: admin and Password: password

Step 2: Infrastructure Setup for MySQL, MinIO, REST Catalog, and PrestoDB This step involves configuring the core data lakehouse components:
1. Main Directory Creation: Create a directory named my-lakehouse on your Desktop or an alternative desired location.
2. docker-compose.yml Configuration: Within the my-lakehouse directory, create a docker-compose.yml file with the provided content. This file orchestrates the simultaneous setup of all services including MySQL, MinIO, Iceberg REST Catalog, and PrestoDB.
version: '3.8'
services:
# --- 1. MySQL (Source Database) ---
mysql:
image: mysql:8.0
container_name: mysql-source
ports:
- "3306:3306"
environment:
MYSQL_ROOT_PASSWORD: root_password
MYSQL_DATABASE: retail_db
MYSQL_USER: user
MYSQL_PASSWORD: password
volumes:
- ./mysql_data:/var/lib/mysql
- ./init_mysql.sql:/docker-entrypoint-initdb.d/init_mysql.sql
networks:
- lakehouse-net
# --- 2. MinIO (Object Storage - S3 Compatible) ---
minio:
image: minio/minio
container_name: minio-storage
ports:
- "9000:9000" # MinIO API port
- "9001:9001" # MinIO Console port
environment:
MINIO_ROOT_USER: minioadmin
MINIO_ROOT_PASSWORD: minioadminpassword
MINIO_DOMAIN: minio
command: server /data --console-address ":9001"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 30s
timeout: 20s
retries: 3
volumes:
- ./minio_data:/data
networks:
- lakehouse-net
# --- 3. Iceberg REST Catalog (tabulario/iceberg-rest) ---
iceberg-rest:
image: tabulario/iceberg-rest:0.6.0
container_name: iceberg-rest
depends_on:
- minio
ports:
- "8181:8181"
environment:
AWS_REGION: us-east-1
AWS_DEFAULT_REGION: us-east-1
CATALOG_WAREHOUSE: s3://mylakehousedata
CATALOG_IO__IMPL: org.apache.iceberg.aws.s3.S3FileIO
CATALOG_S3_ENDPOINT: http://minio:9000
AWS_ACCESS_KEY_ID: minioadmin
AWS_SECRET_ACCESS_KEY: minioadminpassword
CATALOG_S3_PATH_STYLE_ACCESS: "true"
CATALOG_S3_PATH__STYLE__ACCESS: "true"
networks:
- lakehouse-net
# --- 4. PrestoDB Query Engine ---
presto:
image: prestodb/presto:latest
container_name: presto-query-engine
ports:
- "8089:8080"
volumes:
- ./presto_config/catalog/iceberg.properties:/opt/presto-server/etc/catalog/iceberg.properties
environment:
# MinIO/S3 configuration for Presto
AWS_ACCESS_KEY_ID: minioadmin
AWS_SECRET_ACCESS_KEY: minioadminpassword
AWS_ENDPOINT: http://minio:9000
AWS_FORCE_PATH_STYLE: "true"
depends_on:
- minio
- iceberg-rest
networks:
- lakehouse-net
networks:
lakehouse-net:
driver: bridge3. MySQL Initialization File: Create an init_mysql.sql file within the my-lakehouse directory. This script will automatically create the retail_db database and a products table, populating it with sample data. Alternatively, the table can be created manually in the MySQL database.
CREATE DATABASE IF NOT EXISTS retail_db;
USE retail_db;
CREATE TABLE IF NOT EXISTS products (
product_id INT AUTO_INCREMENT PRIMARY KEY,
product_name VARCHAR(255) NOT NULL,
category VARCHAR(100),
price DECIMAL(10, 2),
stock_quantity INT,
last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);
INSERT INTO products (product_name, category, price, stock_quantity) VALUES
('Laptop Pro', 'Electronics', 1200.00, 50),
('Mechanical Keyboard', 'Electronics', 95.50, 150),
('Wireless Mouse', 'Electronics', 25.00, 200),
('Ergonomic Chair', 'Office Furniture', 350.00, 30),
('Desk Lamp LED', 'Office Furniture', 45.00, 80),
('Coffee Maker 12-cup', 'Home Appliances', 75.00, 100),
('Blender High-Speed', 'Home Appliances', 150.00, 70),
('Smartwatch X', 'Wearables', 299.99, 120),
('Fitness Tracker Pro', 'Wearables', 89.99, 180),
('Noise Cancelling Headphones', 'Audio', 199.00, 90);4. PrestoDB Configuration:
- Create two nested directories inside
my-lakehouse:presto_configandpresto_config/catalog. - Within the
catalogdirectory, create a file namediceberg.propertieswith the following configuration details for PrestoDB to connect to the Iceberg REST Catalog and MinIO
connector.name=iceberg
iceberg.catalog.type=rest
iceberg.rest.uri=http://iceberg-rest:8181
hive.s3.endpoint=http://minio:9000
hive.s3.path-style-access=true
hive.s3.aws-access-key=minioadmin
hive.s3.aws-secret-key=minioadminpassword5. The final directory structure should align with the specified layout.

6. Service Deployment and Verification:
- Deploy all configured services by executing the Docker Compose command:
- Verify the operational status of all containers by running
docker ps.
Step 3: MinIO S3 Bucket Creation To create the necessary S3 storage bucket in MinIO for your data lakehouse:
1. Access MinIO Console: Navigate to localhost:9000 in a web browser.
2. Authenticate: Log in using the default MinIO credentials: Username: minioadmin and Password: minioadminpassword.
3. Bucket Creation: Proceed to create a new bucket, naming it mylakehousedata.
Step 4: End-to-End ETL Pipeline: Connecting MySQL with OLake and Querying via PrestoDB This step details the establishment of the ETL pipeline for data flow from MySQL to Iceberg via OLake, enabling querying with PrestoDB:
1. Access OLake: Open the OLake home page at localhost:8000 in your browser.
2. Create Job – Source Configuration:
▪ Connector: MySQL
▪ Name of your source: mysql (or a descriptive name of your choosing)
▪ OLake Version: latest
▪ MySQL Host: host.docker.internal (or your Orbstack domain)
▪ Port: 3306
▪ Database: retail_db (or your specific database name)
▪ Username: user (or your MySQL root username)
▪ Password: Your MySQL password
3. Create Job – Destination Configuration:
▪ Connector: Apache Iceberg
▪ Catalog: REST Catalog
▪ Name of your destination: temp_destination (or a descriptive name of your choosing)
▪ Version: latest
▪ REST Catalog URI: http://host.docker.internal:8181
▪ Iceberg S3 Path (Warehouse): s3://mylakehousedata/
▪ Iceberg Database: warehouse
▪ S3 Endpoint: http://host.docker.internal:9000
▪ AWS Region: us-east-1
▪ AWS Access Key: minioadmin
▪ AWS Secret Key: minioadminpassword
4. Schema Confirmation and Job Finalization:
- Click ‘Next’ to confirm the source and destination configurations.
- Click ‘Next’ again to finalize the data ingestion process. Assign a suitable Job name.
5. Accessing PrestoDB and Querying Data:
- Finally, navigate to
localhost:8089to access the PrestoDB web UI. - Execute the appropriate SQL query to retrieve data from the Iceberg table stored in MinIO.

Congratulations! You have successfully deployed and configured your Open Data Lakehouse.