Executing Presto on Spark#
Presto on Spark makes it possible to leverage Spark as an execution framework for Presto queries. This is useful for queries that we want to run on thousands of nodes, requires 10s or 100s of terabytes of memory, and consume many CPU years.
Spark provides several values adds like resource isolation, fine grained resource management, and Spark’s scalable materialized exchange mechanism.
Download the Presto Spark package tarball, presto-spark-package-0.285.1.tar.gz and the Presto Spark launcher, presto-spark-launcher-0.285.1.jar. Keep both the files at, say, example directory. We assume here a two node Spark cluster with four cores each, thus giving us eight total cores.
The following is an example
The details about properties are available at Properties Reference.
task.max-worker-threads are set to 4 each, since we have four cores per executor
and want to synchronize with the relevant Spark submit arguments below.
These values should be adjusted to keep all executor cores busy and
synchronize with spark-submit parameters.
To execute Presto on Spark, first start your Spark cluster, which we will assume have the URL spark://spark-master:7077. Keep your time consuming query in a file called, say, query.sql. Run spark-submit command from the example directory created earlier:
/spark/bin/spark-submit \ --master spark://spark-master:7077 \ --executor-cores 4 \ --conf spark.task.cpus=4 \ --class com.facebook.presto.spark.launcher.PrestoSparkLauncher \ presto-spark-launcher-0.285.1.jar \ --package presto-spark-package-0.285.1.tar.gz \ --config /presto/etc/config.properties \ --catalogs /presto/etc/catalogs \ --catalog hive \ --schema default \ --file query.sql
The details about configuring catalogs are at Catalog Properties. In
Spark submit arguments, note the values of executor-cores (number of cores per
executor in Spark) and spark.task.cpus (number of cores to allocate to each task
in Spark). These are also equal to the number of cores (4 in this case) and are
same as some of the
config.properties settings discussed above. This is to ensure that
a single Presto on Spark task is run in a single Spark executor (This limitation may be
temporary and is introduced to avoid duplicating broadcasted hash tables for every