Deploy Presto on Kubernetes using Helm: Query S3 Data with Hive Metastore

Deploying Presto on Kubernetes transforms this powerful engine into a cloud-native, resilient service that automatically handles failures, scales seamlessly, and optimizes resource utilization. When combined with Helm charts, the deployment becomes standardized, version-controlled, and easily reproducible across environments.

This comprehensive guide will walk you through deploying a production-capable baseline Presto cluster on Kubernetes using the official Presto Helm charts, covering everything from basic setup to advanced concepts like high availability, autoscaling, and monitoring integration.

Prerequisites: Setting Up Your Environment

Before diving into deployment, ensure your environment meets these requirements:

Required Software Versions

Kubernetes Cluster: Version 1.30+ (EKS, GKE, AKS, local machine or on-premise)
Helm 3: Latest stable version (3.14+ recommended)
kubectl: Configured to communicate with target cluster
Container Runtime: Docker Desktop / OrbStack or any container management tool

Infrastructure Requirements

Minimum Cluster Resources: 4 CPU cores, 16GB RAM for basic cluster
Production Cluster: 8+ CPU cores, 32GB+ RAM for workloads
Network: Pod-to-Pod communication enabled
Storage: Persistent storage available (optional but recommended)

Knowledge Prerequisites

Basic understanding of Kubernetes concepts (Pods, Services, ConfigMaps, Secrets)
Familiarity with Presto architecture (Coordinator, Worker, Catalogs)
Experience with YAML configuration and Helm basics

Presto Kubernetes Architecture: Components and Deployment Modes

Before deployment, let’s understand the key components:

Core Architecture Components

Coordinator Pod: The brain that parses SQL statements, plans queries, and manages worker nodes
Worker Pods: The compute engines that execute tasks and process data
Discovery Service: Headless service enabling worker-to-coordinator communication
ConfigMaps: Store configuration files (config.properties, jvm.config, log.properties)
Catalog ConfigMaps: Define data source connectors (Hive, MySQL, PostgreSQL, etc.)
Secrets: Securely store credentials and sensitive configuration
Services: Expose Presto endpoints internally and externally

Deployment Modes

The Presto Helm chart supports three deployment modes:

Single Mode: One pod acting as both coordinator and worker (ideal for development)
Cluster Mode: Separate coordinator and worker pods (standard production setup)
HA Cluster Mode: Multiple active coordinator pods managed by a shared resource manager (high availability)

Step-by-Step Deployment Guide

Step 1: Setting up the Presto Helm Chart on Kubernetes (Local)

Create an organized workspace for your Presto deployments:

mkdir -p ~/presto-k8s
cd ~/presto-k8s

Create a dedicated namespace for Presto to isolate resources:

#Verify Installation first
kubectl version --client
helm version

# Create a dedicated namespace
kubectl create namespace sql-query-engine

# Verify namespace
kubectl get namespaces

Add the official Presto Helm repository:

# Add Presto Helm repository
helm repo add presto https://prestodb.github.io/presto-helm-charts

# Update repository to fetch latest charts
helm repo update

# Verify repository
helm repo list

#Pull the chart to inspect the default configurations
helm pull presto/presto --untar
tree presto

Understanding the Helm Chart Structure:

presto-helm-charts/
├── charts/
│   └── presto/
│       ├── Chart.yaml              # Chart metadata
│       ├── values.yaml             # Default configuration
│       ├── README.md               # Chart documentation
│       ├── .helmignore             # Files to ignore
│       └── templates/              # Kubernetes manifests
│           ├── configmap-catalog.yaml
│           ├── configmap-coordinator.yaml
│           ├── configmap-resource-manager.yaml
│           ├── configmap-worker.yaml
│           ├── deployment-coordinator.yaml
│           ├── deployment-resource-manager.yaml
│           ├── deployment-worker.yaml
│           ├── ingress.yaml
│           ├── NOTES.txt
│           ├── service-discovery.yaml
│           ├── service.yaml
│           └── serviceaccount.yaml
├── DEVELOPMENT.md
├── README.md
└── LICENSE

Create a my-presto-config.yaml file in the root directory with minimal custom configurations for quick deployment:

# my-presto-config.yaml

mode: cluster

coordinator:
  replicas: 1
  resources:
    requests:
      memory: "2Gi"
      cpu: "1"
    limits:
      memory: "4Gi"
      cpu: "2"
  jvm: |-
    -server
    -Xmx2G
    -XX:+ExitOnOutOfMemoryError
    -Djdk.attach.allowAttachSelf=true
    --add-opens=java.base/java.lang=ALL-UNNAMED
    --add-opens=java.base/java.io=ALL-UNNAMED
    --add-opens=java.base/java.util=ALL-UNNAMED
    --add-opens=java.base/java.net=ALL-UNNAMED

worker:
  replicas: 2
  resources:
    requests:
      memory: "2Gi"
      cpu: "1"
    limits:
      memory: "4Gi"
      cpu: "2"
  jvm: |-
    -server
    -Xmx2G
    -XX:+ExitOnOutOfMemoryError
    -Djdk.attach.allowAttachSelf=true
    --add-opens=java.base/java.lang=ALL-UNNAMED
    --add-opens=java.base/java.io=ALL-UNNAMED
    --add-opens=java.base/java.util=ALL-UNNAMED
    --add-opens=java.base/java.net=ALL-UNNAMED

# To connect data sources
catalog:
  tpch: |-
    connector.name=tpch
  hive: |-
    connector.name=hive-hadoop2
    hive.metastore.uri=thrift://hive-metastore:9083
    hive.s3.aws-access-key=<YOUR_ACCESS_KEY>
    hive.s3.aws-secret-key=<YOUR_SECRET_KEY>
    hive.s3.endpoint=https://<YOUR_END_POINT>
    hive.s3.path-style-access=true
    hive.s3.ssl.enabled=true

Now, let’s deploy Presto to local machine with custom configuration. It will take about 2-3 minutes for the images to pull and the containers to start.

helm install my-presto presto/presto -f my-presto-config.yaml --namespace sql-query-engine

To apply any future configuration changes, you simply run the Helm upgrade command:

helm upgrade my-presto presto/presto -f my-presto-config.yaml -n sql-query-engine

Check the status of the pods.

kubectl get pods --namespace sql-query-engine

Expected Output:

NAME                                    READY   STATUS    RESTARTS   AGE

my-presto-coordinator-886bc4b5f-wsj6t   1/1     Running   0          59m

my-presto-worker-7b5698569b-bg664       1/1     Running   0          59m

my-presto-worker-7b5698569b-wdfcn       1/1     Running   0          59m

The above status confirms that Presto coordinator and workers are running in the background and presto server has started successfully.

Since the cluster is running inside Kubernetes (an isolated network), you need to create a tunnel to access the Web UI from our browser.

kubectl port-forward svc/my-presto 8080:8080 -n sql-query-engine

Leave this terminal window running. Do not close it

Run the below command to execute queries using presto-cli

kubectl exec -it my-presto-coordinator-67565444dd-rvrt7 --namespace sql-query-engine -- presto-cli

Step 2 : Scaling Your Deployment (Adding More Workers)

To add more workers (scale out), edit replicas in my-presto-config.yaml file and change it to any number under worker section

worker:
  replicas: 5  # <-- Update this
  # ... rest of config stays the same

#Apply the Update

helm upgrade my-presto presto/presto -f my-presto-config.yaml -n sql-query-engine

#Check the pod status

kubectl get pods -n sql-query-engine

NAME                                    READY    STATUS   RESTARTS    AGE

my-presto-coordinator-67565444dd-rvrt7   1/1     Running   0          91m

my-presto-worker-56f9fd8b84-4rm6v        1/1     Running   0          103s

my-presto-worker-56f9fd8b84-7l2lr        1/1     Running   0          91m

my-presto-worker-56f9fd8b84-89zd7        1/1     Running   0          91m

my-presto-worker-56f9fd8b84-8srz2        1/1     Running   0          103s

my-presto-worker-56f9fd8b84-ts4x5        1/1     Running   0          103s

Step 3 : Configuring the Apache Hive Metastore for S3

You can use any modern cloud storage provider (like AWS S3, Wasabi, or MinIO). This layer serves as the massive, scalable foundation where your actual raw data files (CSV, JSON, Parquet) physically reside.

Create an S3 bucket in your preferred region, and set up a dedicated directory inside it to serve as the storage location for your data files.

Set up hive-metastore.yaml and initialize a PostgreSQL database to serve as the centralized metadata catalog for the Presto cluster.

# 1. POSTGRES DEPLOYMENT (Holds the Data)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hive-metastore-db
  namespace: sql-query-engine
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hive-metastore-db
  template:
    metadata:
      labels:
        app: hive-metastore-db
    spec:
      containers:
      - name: postgres
        image: postgres:13
        env:
        - name: POSTGRES_DB
          value: metastore
        - name: POSTGRES_USER
          value: hive
        - name: POSTGRES_PASSWORD
          value: hivepassword
        ports:
        - containerPort: 5432
---
# 2. POSTGRES SERVICE
apiVersion: v1
kind: Service
metadata:
  name: hive-metastore-db
  namespace: sql-query-engine
spec:
  ports:
  - port: 5432
  selector:
    app: hive-metastore-db
---
# 3. HIVE CONFIGURATION (With S3 Keys)
apiVersion: v1
kind: ConfigMap
metadata:
  name: hive-config
  namespace: sql-query-engine
data:
  hive-site.xml: |
    <configuration>
      <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:postgresql://hive-metastore-db:5432/metastore</value>
      </property>
      <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hive</value>
      </property>
      <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>hivepassword</value>
      </property>
      <property>
        <name>fs.s3a.access.key</name>
        <value>YOUR_ACCESS_KEY</value>
      </property>
      <property>
        <name>fs.s3a.secret.key</name>
        <value>YOUR_SECRET_KEY</value>
      </property>
      <property>
        <name>fs.s3a.endpoint</name>
        <value>YOUR_ENDPOINT</value>
      </property>
      <property>
        <name>fs.s3a.path.style.access</name>
        <value>true</value>
      </property>
    </configuration>
---
# 4. HIVE METASTORE DEPLOYMENT
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hive-metastore
  namespace: sql-query-engine
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hive-metastore
  template:
    metadata:
      labels:
        app: hive-metastore
    spec:
      # Shared Volume for Driver Jar
      volumes:
      - name: hive-config
        configMap:
          name: hive-config
      - name: lib-share
        emptyDir: {}

      initContainers:
      # A. Download Postgres Driver + AWS/S3 JARs
      - name: download-driver
        image: python:3.9-slim
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
        command: ["python3", "-c"]
        args:
          - |
            import urllib.request
            files = [
              ('https://jdbc.postgresql.org/download/postgresql-42.2.18.jar', '/lib-share/postgresql-42.2.18.jar'),
              ('https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar', '/lib-share/hadoop-aws-3.3.4.jar'),
              ('https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.367/aws-java-sdk-bundle-1.12.367.jar', '/lib-share/aws-java-sdk-bundle-1.12.367.jar'),
            ]
            for url, dest in files:
              print(f'Downloading {url}...')
              urllib.request.urlretrieve(url, dest)
              print('Done.')
        volumeMounts:
        - name: lib-share
          mountPath: /lib-share

      # B. Initialize Schema (Run as ROOT to allow copying to /opt/hive/lib)
      - name: init-schema
        image: apache/hive:4.0.0
        securityContext:
          runAsUser: 0
        command: ["sh", "-c"]
        args:
          - "cp /lib-share/*.jar /opt/hive/lib/ && (/opt/hive/bin/schematool -dbType postgres -info || /opt/hive/bin/schematool -dbType postgres -initSchema)"
        volumeMounts:
        - name: hive-config
          mountPath: /opt/hive/conf/hive-site.xml
          subPath: hive-site.xml
        - name: lib-share
          mountPath: /lib-share

      containers:
      # C. Run Metastore Service (Run as ROOT to allow copying to /opt/hive/lib)
      - name: metastore
        image: apache/hive:4.0.0
        securityContext:
          runAsUser: 0
        command: ["sh", "-c"]
        args:
          - "cp /lib-share/*.jar /opt/hive/lib/ && /opt/hive/bin/hive --service metastore"
        ports:
        - containerPort: 9083
        volumeMounts:
        - name: hive-config
          mountPath: /opt/hive/conf/hive-site.xml
          subPath: hive-site.xml
        - name: lib-share
          mountPath: /lib-share
---
# 5. HIVE METASTORE SERVICE
apiVersion: v1
kind: Service
metadata:
  name: hive-metastore
  namespace: sql-query-engine
spec:
  ports:
  - port: 9083
    targetPort: 9083
  selector:
    app: hive-metastore

In production, always use Kubernetes Secrets or cloud-native identity mechanisms (IRSA) to store the keys.

# To apply above configuration, run the below command:

kubectl apply -f hive-metastore.yaml

# Check status of the pods

kubectl get pods -n sql-query-engine

Expected Output:

NAME                                       READY   STATUS    RESTARTS   AGE

hive-metastore-5b964db94-dg2wm             1/1     Running   0          102m

hive-metastore-db-658f9d4546-89n8l         1/1     Running   0          151m

my-presto-coordinator-68b85f9566-rt5jw     1/1     Running   0          138m

my-presto-worker-66bb49fb6b-bs6f7          1/1     Running   0          138m

my-presto-worker-66bb49fb6b-gvtwd          1/1     Running   0          138m

Let’s confirm that hive registered as a catalog with Presto.

Let’s create the schema, if it doesn’t already exist.

CREATE SCHEMA IF NOT EXISTS hive.default;

It’s time to connect Presto to actual S3 data, We will use External Tables. When we define an External Table, we are simply telling the Hive Metastore to draw a map or an index card.

Run the following SQL command through Presto-UI or CLI:

CREATE TABLE IF NOT EXISTS hive.default.subscriptions (
  user_id VARCHAR, 
  phone_number VARCHAR, 
  subscription_type VARCHAR,
  region VARCHAR, 
  subscription_date VARCHAR, 
  source VARCHAR
) WITH (
  format = 'CSV',
  external_location = 's3a://presto-cluster/my_data/'
);

This deployment uses an ephemeral, stateless PostgreSQL database. If the database pod restarts, you will simply need to re-run your CREATE TABLE scripts to map your S3 data back into Presto. For permanent metadata retention, attach a PersistentVolumeClaim (PVC) to the Postgres container.

Step 4: Querying S3 Data

Run the command to query the data:

SELECT subscription_type, COUNT(*) as total_users
FROM hive.default.subscriptions
GROUP BY subscription_type
ORDER BY total_users DESC limit 100

Presto executed the query successfully and delivered the aggregated result.

Troubleshooting Common Issues

Memory Thrashing (OOMKilled), CrashLoopBackOff

Reason: Presto Coordinator is requesting more RAM than your Kubernetes node has available, or the Java Heap size (-Xmx) is set incorrectly.
Fix: Check your resources.limits.memory. If you set it to 2Gi, you cannot set your Java Heap (-Xmx) to 2G. The JVM needs overhead room.

Error: Connector hive not found

Reason: This issue almost always stems from mixing up PrestoDB configuration syntax with Trino.
Fix: Change the connector name to exactly: connector.name=hive-hadoop2

Summary

By deploying Presto on Kubernetes alongside the Apache Hive Metastore, you can seamlessly query terabytes of external data stored securely in S3-compatible storage. By embracing a stateless metadata catalog mapped to external tables, you now have a highly resilient, cost-effective, and containerized analytics engine ready for production.

Refer to Presto Documentation on Presto Helm Deployment and Hive Connector for more information.