PBench 1.2.1: End-to-End Benchmarking and Performance Testing for Presto

Benchmarking a distributed SQL engine like Presto involves much more than running a few queries and recording wall-clock times. Real-world performance evaluation demands multi-phase test execution, concurrent workloads, production traffic replay, and deep offline analysis. PBench is a purpose-built benchmarking toolkit for Presto that handles all of this through a declarative, composable stage system. With the 1.2.1 release, PBench becomes significantly more robust and flexible, adding parallel stream execution, dynamic query generation, richer script integration, and a fully data-driven cluster configuration generator, while continuing to provide a unified interface for standard benchmarks, A/B testing, and production workload analysis. In this post, we’ll walk through implementing a spec-compliant TPC-DS benchmark with PBench, deep offline analysis with pbench loadjson, and how the same building blocks extend to real-world workflows like schema capture, A/B testing, and traffic replay.

What’s New in PBench 1.2.1

The key new capabilities that enable the workflows described in this post:

Parallel stream execution — the stream_count parameter runs N parallel instances of a stage, each with a deterministically derived random seed, mapping directly to the TPC-DS throughput test model
Directory expansion in query_files — entries can now point to directories, expanded to contained SQL files at execution time (after pre_stage_scripts), enabling dynamic query generation workflows like dsqgen
no_random_duplicates — shuffled random execution that cycles through all queries before repeating, useful for throughput tests requiring full query coverage
Shell script environment variables — PBENCH_STAGE_ID, PBENCH_OUTPUT_DIR, PBENCH_QUERY_FILE, PBENCH_QUERY_ID, etc. are injected into all script hooks
Data-driven genconfig — generalized to use generic maps and templates with arithmetic/string functions, so adding new cluster configuration fields requires only JSON and template changes

The full release notes are available for 1.2 and 1.2.1.

Implementing a Full TPC-DS Benchmark with PBench

The TPC-DS specification defines a multi-phase benchmark: data loading, a sequential power test, a concurrent throughput test, data maintenance operations, and a second throughput test. PBench’s DAG-based stage system maps naturally to this structure. Let’s walk through how to implement the complete TPC-DS benchmark lifecycle.

The Stage DAG

PBench benchmarks are defined as JSON stage files. Each stage specifies queries to run, session settings, and optionally a next field that points to child stages. Child stages execute in parallel after the parent completes, forming a DAG. Settings like catalog, schema, and session_params are inherited by child stages unless overridden.

The overall DAG looks like this:

data_load (CREATE TABLE, CTAS or INSERT SELECT, ANALYZE)
  └→ power_test (99 queries, sequential)
       └→ throughput_test_1 (N parallel streams)
            └→ data_maintenance (INSERT/DELETE operations)
                 └→ throughput_test_2 (N parallel streams)

Phase 1: Data Loading

In TPC-DS, raw data is pre-generated as CSV flat files by the dsdgen tool — this happens outside PBench. The loading phase creates tables in your target format and populates them. A typical approach is to create external tables on the CSV files, then use CREATE TABLE AS SELECT (CTAS) or INSERT SELECT to load the data into Iceberg or Hive Parquet tables with proper type casting, partitioning schemes, and compression:

{
  "id": "data_load",
  "description": "Create tables, load from CSV source, and gather statistics",
  "catalog": "iceberg",
  "schema": "tpcds_sf1000_parquet",
  "query_files": [
    "./ddl/",
    "./data_loading/",
    "./analyze/"
  ],
  "abort_on_error": true,
  "next": ["power_test.json"]
}

Since query_files supports directories as of PBench 1.2.1, each directory is expanded to its contained SQL files in sorted order. Here ddl/ holds the CREATE TABLE statements, data_loading/ holds the INSERT SELECT or CTAS statements that populate the tables from the CSV-backed source, and analyze/ holds the ANALYZE statements for gathering column statistics. The directories are processed in the order listed, so the natural separation also gives you the correct execution sequence.

When benchmarking across different table formats (Iceberg vs. Hive), partitioning schemes, or compression methods, the number of DDL scripts and configurations can quickly become unwieldy. PBench includes genddl and genconfig helper commands that generate these scripts and cluster configurations from templates, reducing the chance of human error when managing many variations.

Phase 2: Power Test

The power test runs all 99 TPC-DS queries sequentially on a single stream. There are two approaches depending on your use case.

For development and regression testing, you can use a fixed set of pre-generated queries with known expected row counts. This is useful during iterative development where you want fast feedback on correctness:

{
  "id": "power_test",
  "description": "TPC-DS Power Test: 99 pre-generated queries with row count validation",
  "query_files": [
    "queries/query_01.sql",
    "queries/query_02.sql",
    "...",
    "queries/query_99.sql"
  ],
  "cold_runs": 1,
  "save_json": true,
  "expected_row_counts": {
    "tpcds_sf1000": [100, 2520, 9, "..."]
  },
  "next": ["throughput_test_1.json"]
}

The expected_row_counts field validates that each query returns the correct number of rows for the given scale factor, catching silent correctness regressions. For row-by-row correctness checking, set save_output: true to write the full query result to disk — you can then use pbench cmp to diff outputs between runs. Setting save_json: true captures the full Presto query JSON for each query, which we can later load into a database for deep analysis on the detailed metrics (more on this below).

For a spec-compliant TPC-DS run, queries are generated on the fly by dsqgen with a specific random seed, producing a unique query set each time. We use a pre-stage script to invoke dsqgen and point query_files at the output directory:

{
  "id": "power_test",
  "description": "TPC-DS Power Test: queries generated by dsqgen",
  "pre_stage_scripts": ["./scripts/generate_power_queries.sh"],
  "query_files": ["./generated_queries/power/"],
  "cold_runs": 1,
  "save_json": true,
  "next": ["throughput_test_1.json"]
}

The generate_power_queries.sh script invokes dsqgen to produce the 99 queries into the directory that PBench will discover:

#!/bin/bash
SCALE_FACTOR=1000
SEED=12345
OUTPUT_DIR="./generated_queries/power"

mkdir -p "$OUTPUT_DIR"
dsqgen \
  -DIRECTORY ../query_templates \
  -INPUT ../query_templates/templates.lst \
  -SCALE "$SCALE_FACTOR" \
  -RNGSEED "$SEED" \
  -DIALECT presto \
  -OUTPUT_DIR "$OUTPUT_DIR"

Since query_files supports directories as of PBench 1.2.1, the generated SQL files are automatically discovered and executed in sorted order after the pre-stage script completes. In this mode there are no expected row counts — the queries are fresh from the generator and the focus is on performance measurement rather than regression checking.

Phase 3: Throughput Test

The TPC-DS throughput test runs N concurrent query streams, each executing all 99 queries in a different permutation order (defined in the spec’s Appendix D). There are two ways to model this in PBench.

Option A: Explicit streams via DAG. Define each stream as a separate stage file with its own query ordering, and fan them out from a parent stage:

{
  "id": "throughput_test_1",
  "description": "TPC-DS Throughput Test: 4 concurrent streams",
  "next": [
    "streams/stream_01.json",
    "streams/stream_02.json",
    "streams/stream_03.json",
    "streams/stream_04.json"
  ],
  "next": ["data_maintenance.json"]
}

Each stream file (e.g., stream_01.json) lists the 99 queries in the spec-defined order for that stream and sets start_on_new_client: true so each stream gets its own Presto session:

{
  "start_on_new_client": true,
  "query_files": [
    "../queries/query_96.sql",
    "../queries/query_07.sql",
    "../queries/query_75.sql",
    "..."
  ]
}

PBench ships with 21 pre-built stream orderings matching the TPC-DS Appendix D specification.

Option B: stream_count for randomized throughput. New in PBench 1.2.1, if you don’t need the exact spec-defined orderings, you can use stream_count to spin up N parallel instances of a single stage:

{
  "id": "throughput_test_1",
  "description": "4 concurrent random streams, no duplicates within each stream",
  "stream_count": 4,
  "random_execution": true,
  "randomly_execute_until": "99",
  "no_random_duplicates": true,
  "query_files": ["queries/query_01.sql", "...", "queries/query_99.sql"],
  "next": ["data_maintenance.json"]
}

Each stream gets a deterministic seed derived from the base seed (seed + stream_index * 1000), so the entire run is reproducible from a single --seed value. The no_random_duplicates flag ensures each stream cycles through all 99 queries before repeating.

Phase 4: Data Maintenance

The TPC-DS spec includes data maintenance operations (INSERTs and DELETEs) between throughput tests. The refresh data is generated by dsdgen, so we use a pre_stage_scripts hook to generate the refresh flat files and create external staging tables on them, then execute the maintenance SQL:

#!/bin/bash
# scripts/generate_refresh_data.sh
SCALE_FACTOR=1000
UPDATE_SET=1
REFRESH_DIR="./refresh_data"

mkdir -p "$REFRESH_DIR"
dsdgen -SCALE "$SCALE_FACTOR" -UPDATE "$UPDATE_SET" -DIR "$REFRESH_DIR"

This produces new-row flat files and delete-key files for each affected fact table. The script can also create external staging tables pointing to these files (or that SQL can be part of the queries). The maintenance queries then reference the staging tables:

{
  "id": "data_maintenance",
  "description": "TPC-DS Data Maintenance: generate refresh data, then INSERT/DELETE",
  "pre_stage_scripts": ["./scripts/generate_refresh_data.sh"],
  "queries": [
    "INSERT INTO catalog_sales SELECT * FROM catalog_sales_staging",
    "INSERT INTO catalog_returns SELECT * FROM catalog_returns_staging",
    "DELETE FROM catalog_sales WHERE cs_item_sk || cs_order_number IN (SELECT cs_item_sk || cs_order_number FROM catalog_sales_delete)",
    "DELETE FROM catalog_returns WHERE cr_item_sk || cr_order_number IN (SELECT cr_item_sk || cr_order_number FROM catalog_returns_delete)"
  ],
  "next": ["throughput_test_2.json"]
}

Phase 5: Second Throughput Test

The second throughput test is identical in structure to the first, running after data maintenance to measure performance on the modified dataset. Simply define another throughput stage with the same query set:

{
  "id": "throughput_test_2",
  "description": "Post-maintenance throughput test",
  "stream_count": 4,
  "random_execution": true,
  "randomly_execute_until": "99",
  "no_random_duplicates": true,
  "query_files": ["queries/query_01.sql", "...", "queries/query_99.sql"]
}

Running the Full Benchmark

With the stage DAG defined, the entire multi-phase benchmark is launched with a single command:

pbench run \
  --server-url http://presto-coordinator:8080 \
  --name "tpcds_sf1000_%t" \
  --output-path ./results \
  --mysql mysql_config.json \  # record per-query metrics for analysis and scoring
  benchmarks/tpc-ds/sf1k.json benchmarks/tpc-ds/tpcds_full.json

The sf1k.json file sets the scale-factor-specific schema, and tpcds_full.json defines the DAG. PBench merges these, inheriting the schema across all stages. The --mysql flag records per-query metrics (duration, row count, success/failure, etc.) into a MySQL database, enabling you to analyze slow queries and compute the TPC-DS performance score from the recorded timings. Results are also written to local CSV files.

Offline Analysis with pbench loadjson

Running benchmarks is only half the story — understanding why performance differs between runs requires deep query-level analysis. This is where pbench loadjson and the Presto event listener come in.

Capturing Query Details

When PBench runs with save_json: true, it captures the full Presto query JSON (from the /v1/query/{id} API) for every query execution. These JSON files contain the complete query plan, operator statistics, stage-level metrics, and timing breakdowns.

Independently, Presto’s event listener can be configured to write query completion events to disk as JSON files. This captures all queries on a cluster — not just those from PBench — providing a complete picture of cluster activity during benchmarks.

Loading into a Database

pbench loadjson processes these JSON files and loads them into MySQL tables for structured analysis:

pbench loadjson \
  --mysql mysql_config.json \
  --name "tpcds_sf1000_analysis" \
  ./results/tpcds_sf1000/

When --mysql is provided to pbench run, run metadata (run name, timing, query durations, success/failure) is automatically recorded to the database. The detailed per-query metrics (operator stats, stage stats, query plans) come from a separate path — either via save_json: true in PBench or by enabling a Presto event listener plugin that writes query JSON to disk. pbench loadjson then parses these JSON files and loads them into the database — you must provide --mysql for it to write to the database.

This populates five tables (DDL is available in event_listener_ddl.sql):

presto_query_creation_info — query text, catalog, schema, session properties, user, and resource group for each query
presto_query_statistics — top-level execution metrics: wall time, CPU time, peak memory, input/output rows and bytes, queued time, and failure info
presto_query_stage_stats — per-stage breakdown (each Presto execution stage, not PBench stages): CPU time, I/O, memory, and GC statistics
presto_query_operator_stats — per-operator metrics: CPU and wall time, memory reservations, input/output rows for each scan, join, aggregation, etc.
presto_query_plans — query plan in both text and JSON format for plan comparison across runs

Together these tables let you drill down from a slow query all the way to the specific operator and stage responsible.

Comparative Analysis

With results from multiple PBench runs loaded into the same database, you can perform comparative analysis using SQL:

-- Compare query durations between two runs
SELECT
  a.query_file,
  a.duration_ms AS baseline_ms,
  b.duration_ms AS candidate_ms,
  ROUND((b.duration_ms - a.duration_ms) / a.duration_ms * 100, 1) AS pct_change
FROM pbench_queries a
JOIN pbench_queries b
  ON a.query_file = b.query_file
  AND a.sequence_no = b.sequence_no
WHERE a.run_id = (SELECT id FROM pbench_runs WHERE run_name = 'baseline_run')
  AND b.run_id = (SELECT id FROM pbench_runs WHERE run_name = 'candidate_run')
ORDER BY pct_change DESC;

-- Find operators with the highest CPU increase in the candidate build
SELECT
  b.operator_type,
  SUM(b.wall_nanos - a.wall_nanos) / 1e9 AS wall_time_increase_sec,
  SUM(b.input_rows - a.input_rows) AS input_rows_increase
FROM presto_query_operator_stats a
JOIN presto_query_operator_stats b
  ON a.query_id = b.query_id AND a.operator_type = b.operator_type
WHERE a.query_id IN (SELECT query_id FROM pbench_queries WHERE run_id = 1)
  AND b.query_id IN (SELECT query_id FROM pbench_queries WHERE run_id = 2)
GROUP BY b.operator_type
ORDER BY wall_time_increase_sec DESC;

This turns PBench + event listener data into a queryable performance analysis platform. You can identify exactly which operators regressed, which stages are bottlenecked, and how resource consumption changed between Presto versions.

Beyond Standard Benchmarks: Real-World Performance Testing

Standard benchmarks like TPC-DS are valuable, but production workloads often behave very differently. PBench provides three tools for testing with real-world query patterns.

Capturing Production Schemas with pbench save

Before you can replay production queries against a test cluster, you need to reproduce the schema. pbench save exports table metadata, column statistics, and partition information from a live cluster:

pbench save \
  --server-url http://production-presto:8080 \
  --catalog hive \
  --schema production_db \
  --output-path ./saved_schemas \
  --parallel 8 \
  customers orders transactions

This generates one JSON file per table, named {catalog}_{schema}_{table}.json (e.g., hive_production_db_orders.json). Each file captures the full DDL and column-level statistics:

{
  "name": "orders",
  "catalog": "hive",
  "schema": "production_db",
  "ddl": "CREATE TABLE hive.production_db.orders (\n   \"order_id\" bigint,\n   \"customer_id\" bigint,\n   \"order_date\" date,\n   \"total_amount\" decimal(12,2),\n   \"status\" varchar,\n   \"region\" varchar\n)\nWITH (\n   format = 'PARQUET',\n   partitioned_by = ARRAY['region']\n)",
  "columnStats": [
    {
      "column_name": "order_id",
      "distinct_values_count": 5000000,
      "nulls_fraction": 0,
      "low_value": "1",
      "high_value": "5000000",
      "data_type": "bigint"
    },
    {
      "column_name": "total_amount",
      "distinct_values_count": 48923,
      "nulls_fraction": 0.002,
      "low_value": "0.99",
      "high_value": "9999.99",
      "data_type": "decimal(12,2)"
    },
    {
      "column_name": "region",
      "data_size": 42000000,
      "distinct_values_count": 12,
      "nulls_fraction": 0,
      "data_type": "varchar",
      "extra": "partition key"
    },
    {"row_count": 5000000}
  ],
  "rowCount": 5000000
}

This gives you the complete DDL to recreate the table on a test cluster, along with column statistics (cardinality, null fractions, min/max values, data sizes) that inform the query optimizer. For bulk capture, you can pass a CSV file listing all tables:

pbench save \
  --server-url http://production-presto:8080 \
  -f tables.csv \
  --output-path ./saved_schemas

A/B Testing with pbench forward

pbench forward monitors a source Presto cluster and mirrors every incoming query to one or more target clusters in real time. This enables transparent A/B testing between Presto versions or configurations without modifying any client applications:

pbench forward \
  --server-url http://current-presto:8080 \
  --server-url http://candidate-presto:8080 \
  --poll-interval 5s \
  --exclude "^(EXPLAIN|DESCRIBE|SHOW)" \
  --schema-mapping prod_schema,test_schema \
  --output-path ./forward_results \
  --name "v0.286_vs_v0.287"

This forwards all queries from the current cluster to the candidate, excluding DDL/metadata queries. The --schema-mapping flag handles cases where the test cluster uses a different schema name. You can add multiple --server-url targets to test more than two configurations simultaneously.

pbench forward also supports query rewriting via regex patterns, useful when table names or function names differ between versions:

pbench forward \
  --server-url http://source:8080 \
  --server-url http://target:8080 \
  --replace "old_udf\((.*?)\)" "new_udf(\1)" \
  --replace "legacy_table" "migrated_table"

Traffic Replay with pbench replay

While forward mirrors live traffic, replay replays recorded traffic from a CSV file. This is useful for reproducible testing — record a production workload once, then replay it against multiple cluster configurations:

pbench replay \
  --server-url http://test-presto:8080 \
  --parallel 150 \
  --name "peak_hour_replay" \
  --output-path ./replay_results \
  workload_capture.csv

The CSV file contains query metadata captured from production (query text, original timing, catalog, schema, session properties). PBench replays queries with the original inter-query timing to simulate realistic load patterns, and respects the parallelism limit to avoid overwhelming the test cluster.

An End-to-End Real-World Testing Workflow

Here’s how these tools come together for a complete Presto version upgrade validation:

Step 1: Capture the baseline. Save production schemas and run standard benchmarks against the current version:

# Export schemas
pbench save --server-url http://current:8080 \
  --catalog hive --schema prod -f all_tables.csv \
  --output-path ./schemas

# Run TPC-DS baseline
pbench run --server-url http://current:8080 \
  --name "baseline_tpcds_%t" --mysql mysql.json \
  --output-path ./results \
  benchmarks/tpc-ds/sf1k.json benchmarks/tpc-ds/ds_full.json

Step 2: Live A/B testing. Forward production traffic to the candidate cluster and collect real-world comparison data:

pbench forward \
  --server-url http://current:8080 \
  --server-url http://candidate:8080 \
  --exclude "^(EXPLAIN|DESCRIBE|SHOW)" \
  --output-path ./forward_results \
  --name "upgrade_ab_test"

Step 3: Replay and compare. Take a recorded peak-hour workload and replay it against both clusters:

pbench replay --server-url http://current:8080 \
  --name "peak_replay_baseline" workload.csv

pbench replay --server-url http://candidate:8080 \
  --name "peak_replay_candidate" workload.csv

Step 4: Run TPC-DS on the candidate. Run the same standard benchmark on the new version:

pbench run --server-url http://candidate:8080 \
  --name "candidate_tpcds_%t" --mysql mysql.json \
  --output-path ./results \
  benchmarks/tpc-ds/sf1k.json benchmarks/tpc-ds/ds_full.json

Step 5: Deep analysis. Load all query JSON files into the database and compare:

# Load event listener output from both clusters
pbench loadjson --mysql mysql.json \
  --name "baseline" ./results/baseline_tpcds/

pbench loadjson --mysql mysql.json \
  --name "candidate" ./results/candidate_tpcds/

Now you can query the MySQL database to compare operator-level metrics, identify regressions, and validate that the candidate version meets your performance bar — all without writing a single line of custom tooling.

Step 6: Validate correctness. Use pbench cmp to diff query outputs between the two runs:

pbench cmp \
  ./results/baseline_tpcds/ \
  ./results/candidate_tpcds/ \
  --output-path ./diffs

This generates diffs for any queries that returned different results, catching correctness regressions that performance metrics alone would miss.

Getting Started

Check out the PBench 1.2.1 release for pre-built binaries for macOS and Linux, or build from source:

git clone https://github.com/prestodb/pbench.git
cd pbench
make install

The PBench Wiki has comprehensive documentation on configuring stages, setting up result databases, and writing benchmark suites.

Whether you’re running standard TPC-DS benchmarks, validating a Presto version upgrade with production traffic, or building a continuous performance regression pipeline, PBench 1.2.1 provides the building blocks to do it declaratively and reproducibly. We welcome contributions — check out the GitHub repository to get involved.