Delta Lake Connector#
To configure the Delta Lake connector, create a catalog properties file
etc/catalog/delta.properties with the following contents,
replacing the properties as appropriate:
The following configuration properties are available:
The URI(s) of the Hive metastore where Delta Lake tables are registered. The connector relies on the Hive metastore to find the location of Delta Lake tables. From the Delta Log at given location, schema and data file list of the table is found.
Enable pushing nested column dereferences into
table scan so that only the required fields
selected in a
Delta Lake connector reuses many of the modules existing in Hive connector. Modules for connectivity and security such as S3, Azure Data Lake, Glue metastore etc. So the configurations for these modules is same those available in Hive connector documentation.
Querying Delta Lake Tables#
SELECT * FROM sales.apac.sales_data LIMIT 200;
In the above query
salesrefers to the Delta Lake catalog.
apacrefers to the database in Hive metastore.
sales_datarefers to the Delta Lake table registered in the
If the table is not registered in Hive metastore, it can be registered using the following DDL command.
To register a table in Hive metastore, full schema of the table is not required in DDL as the Delta Lake connector gets the schema from the metadata located at the Delta Lake table location. To get around no columns error in Hive metastore, provide a dummy column as schema of the Delta table being registered.
Create a new Delta table named
sales_data_new in the
apac schema that has Delta Lake
table location in an S3 bucket named
db-sa-datasets using Delta Lake connector:
CREATE TABLE sales.apac.sales_data_new (dummyColumn INT) WITH (external_location = 's3://db-sa-datasets/presto/sales_data_new');
To register a partition Delta table in Hive metastore, use the
CREATE TABLE same as above.
external_location is required in the properties, no need to specify
Another option is querying the table directly using the table location as table name.
SELECT * FROM sales."$path$"."s3://db-sa-datasets/presto/sales_data" LIMIT 200;
In the above query the schema
$path$ indicates the table name is a path.
Table name given as s3://db-sa-datasets/presto/sales_date is a path where the
Delta Lake table is located. The path based option allows users to query a
Delta table without registering it in the Hive metastore.
To query a specific snapshot of the Delta Lake table use the snapshot identifier as suffix to the table name.
SELECT * FROM sales.apac."sales_data@v4" LIMIT 200;
Above query reads data from snapshot version
4 of the table
To query the snapshot of the Delta Lake table as of particular time, specify the timestamp as suffix to the table name.
SELECT * FROM sales.apac."sales_data@t2021-11-18 09:45" LIMIT 200;
Above query reads data from the latest snapshot as of timestamp
in the table
DROP TABLE sales.apac.sales_data_new;
Above query drops the external table
sales.apac.sales_data_new. This only drops the
metadata for the table. The referenced data directory is not deleted.