Alluxio Cache Service

Overview

A common optimization to improve Presto query latency is to cache the working set to avoid unnecessary I/O from remote data sources or through a slow network. This section describes how to leverage Alluxio as a caching layer for Presto.

Alluxio File System serves Presto Hive Connector as an independent distributed caching file system on top of HDFS or object stores like AWS S3, GCP, Azure blob store. Users can understand the cache usage and control cache explicitly through a file system interface. For example, one can preload all files in an Alluxio directory to warm the cache for Presto queries, and set the TTL (time-to-live) for cached data to reclaim cache capacity.

Presto Hive connector can connect to AlluxioFileSystem as a Hadoop-compatible file system, on top of other persistent storage systems.

Setup

First, configure ${PRESTO_HOME}/etc/catalog/hive.properties to use the Hive connector.

connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083

Second, ensure the Alluxio client jar is already in ${PRESTO_HOME}/plugin/hive-hadoop2/ on all Presto servers. If this is not the case, download Alluxio binary, extract the tarball to ${ALLUXIO_HOME} and copy Alluxio client jar ${ALLUXIO_HOME}/client/alluxio-<VERSION>-client.jar into this directory. Restart Presto service:

$ ${PRESTO_HOME}/bin/launcher restart

Third, configure Hive Metastore connects to Alluxio File System when serving Presto. Edit ${HIVE_HOME}/conf/hive-env.sh to include Alluxio client jar on the Hive classpath:

export HIVE_AUX_JARS_PATH=${ALLUXIO_HOME}/client/alluxio-<VERSION>-client.jar

Then restart Hive Metastore

$ ${HIVE_HOME}/hcatalog/sbin/hcat_server.sh start

Query

After completing the basic configuration, Presto should be able to access Alluxio File System with tables pointing to alluxio:// address. Refer to the Hive Connector documentation to learn how to configure Alluxio file system in Presto. Here is a simple example:

$ cd ${ALLUXIO_HOME}
$ bin/alluxio-start.sh local -f
$ bin/alluxio fs mount --readonly /example \
   s3://apc999/presto-tutorial/example-reason/

Start a Prest CLI connecting to the server started in the previous step.

Download presto-cli-0.288.1-executable.jar, rename it to presto, make it executable with chmod +x, then run it:

$ ./presto --server localhost:8080 --catalog hive --debug
presto> use default;
USE

Create a new table based on the file mounted in Alluxio:

presto:default> DROP TABLE IF EXISTS reason;
DROP TABLE
presto:default> CREATE TABLE reason (
  r_reason_sk integer,
  r_reason_id varchar,
  r_reason_desc varchar
) WITH (
  external_location = 'alluxio://localhost:19998/example',
  format = 'PARQUET'
);
CREATE TABLE

Scan the newly created table on Alluxio:

presto:default> SELECT * FROM reason LIMIT 3;
 r_reason_sk |   r_reason_id    |                r_reason_desc
-------------+------------------+---------------------------------------------
           1 | AAAAAAAABAAAAAAA | Package was damaged
           4 | AAAAAAAAEAAAAAAA | Not the product that was ordred
           5 | AAAAAAAAFAAAAAAA | Parts missing

Basic Operations

With Alluxio file system this approach supports the following features:

  • Preloading: Users can proactively load the working set into Alluxio using command-lines like alluxio fs distributedLoad, in addition to caching data transparently based on the data access pattern.

  • Read/write Types and Data Policies: Users can customize read and write modes for Presto when reading from and writing to Alluxio. E.g. tell Presto read to skip caching data when reading from certain locations and avoid cache thrashing, or set TTLs on files in given locations using alluxio fs setTtl.

  • Check Working Set: Users can verify which files are cached to understand and optimize Presto performance. For example, users can check the output from Alluxio command line alluxio fs ls, or browse the corresponding files on Alluxio WebUI.

  • Check Resource Utilization: System admins can monitor how much of the cache capacity on each node is used using alluxio fsadmin report and plan the resource accordingly.