Time Travel for Iceberg Tables in Presto
Introduction
Presto, an open-source distributed SQL query engine, excels at querying large data sets distributed across diverse data sources. It has maintained its position as a high-performance data analytics tool for over a decade.
As data collection capabilities expand, businesses increasingly recognize the importance of historical data alongside current data. With the recent release of Presto version 0.286, a powerful capability has emerged: time travel queries on iceberg tables.
Time Travel queries, utilizing constructs like “AS OF” predicates, allow you to delve into historical data, analyze trends, retrieve specific information, reconstruct datasets, and even audit the evolution of data — all without requiring any modifications to table definitions.
With the introduction of Presto release 0.286, users now can execute time travel queries on Iceberg tables using either the snapshot ID or timestamp of the records. This release specifically supports time travel queries based on SYSTEM_VERSION and SYSTEM_TIME.
Time Travel Use Case
In an actual airline reservation system database, various tables exist for flight schedules, customer reservations, mileage rewards, and mappings between airport codes and city names. However, for this article, we will focus on a simplified version of the flight table that specifically tracks flight schedules. Whenever a flight is delayed or cancelled, the original scheduling data for that flight is updated with the current flight time information and status.
Iceberg Table Creation
To accurately follow the history of flight delays and cancellations, the table that tracks flight schedules must keep all the deleted and inserted rows, not just information about the current state of each flight. This example uses Version 2 of the Apache Iceberg spec as version_format and copy-on-write mode.
create table FLIGHT (
FlightNo VARCHAR,
DepartFrom VARCHAR,
ArriveTo VARCHAR,
DepartDate DATE,
DepartTime TIME,
Status VARCHAR,
SeatsAvailable INTEGER,
SeatsReserved INTEGER,
Overbooked BOOLEAN,
ChangeReason VARCHAR
) WITH (format_version = '2', delete_mode = 'copy-on-write', partitioning = Array[‘FlightNo]) ;
Note that the DepartDate and DepartTime columns of the table represent “application times”. These are distinct from “system time or version”, which reflects when flight data was inserted or deleted in Presto. Time travel queries in Presto retrieve current and historical data based on “snapshot id” or “committed time”. The application may also query the table using standard Presto SQL on application time columns, e.g., to find flights departing on a particular day or in a particular week, or may query historical data based on both system time (or version) and application time.
Flight Table Data
Let’s say the following records are added to the flight table over some time.
insert into FLIGHT values ('ABC123', 'JFK', 'SFO', date '2024-04-01', time '09:30:00', 'ON-TIME', 300, 200, false, NULL);
insert into FLIGHT values ('ABC456', 'SFO', 'DCA', date '2024-04-01’ time '16:45:00', 'ON-TIME', 300, 250, false, NULL);
insert into FLIGHT values ('ABC789', 'JFK', 'DEV', date '2024-04-02’, time '19:50:00', 'ON-TIME', 350, 270, false, NULL);
insert into FLIGHT values ('XYZ111', 'SFO', 'DEN', date '2024-04-02’, time '15:10:00', 'ON-TIME', 300, 260, false, NULL);
insert into FLIGHT values ('XYZ222', 'DCA', 'SFO', date '2024-04-03’, time '08:15:00', 'ON-TIME', 350, 290, false, NULL);
Flight rows are deleted and inserted again when a flight is cancelled or delayed. Presto has associated records in snapshots table with snapshot id and committed timestamp.
Time Travel Queries
Let’s see how the analytics team can use Presto time travel to obtain information about changes that were made to a specific flight.
Flight record ABC123 was added to the table on ‘2024–03–28’, say.
select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status
from FLIGHT
where FlightNo = 'ABC123';
FlightNo | DepartFrom | ArriveTo | DepartDate | DepartTime | Status
----------+------------+----------+------------+--------------+---------
ABC123 | JFK | SFO | 2024-04-01 | 09:30:00.000 | ON-TIME
Later, the flight time (DepartTime) and status are updated to reflect multiple changes such as delay and cancellation.
delete from flight where flightno = 'ABC123';
insert into FLIGHT values ('ABC123', 'JFK', 'SFO', date '2024-04-01', time '11:00:00', 'DELAYED', 300, 200, false, 'CREW');
delete from flight where flightno = 'ABC123';
insert into FLIGHT values ('ABC123', 'JFK', 'SFO', date '2024-04-01', time '16:00:00', 'DELAYED', 300, 200, false, 'ENGINE');
delete from flight where flightno = 'ABC123';
insert into FLIGHT values ('ABC123', 'JFK', 'SFO', date '2024-04-01', time ’19:00:00', ‘CANCELLED’, 300, 200, false, 'ENGINE');
Now you have the latest flight status, which is cancelled in the flight table:
select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status
from FLIGHT
where FlightNo = 'ABC123';
FlightNo | DepartFrom | ArriveTo | DepartDate | DepartTime | Status
----------+------------+----------+------------+--------------+-----------
ABC123 | JFK | SFO | 2024-04-01 | 19:00:00.000 | CANCELLED
You can refer to the snapshots table for the changes made to the flight table:
select committed_at, snapshot_id, parent_id, operation
from "flight$snapshots";
committed_at | snapshot_id | parent_id | operation
---------------------------------------------+---------------------+---------------------+-----------
2024-03-28 15:27:42.528 America/Los_Angeles | 4708704013741143824 | NULL | append
2024-03-28 15:31:30.899 America/Los_Angeles | 8807630723385960024 | 4708704013741143824 | append
2024-03-28 15:36:03.864 America/Los_Angeles | 3692326879904985834 | 8807630723385960024 | append
2024-03-28 15:37:54.295 America/Los_Angeles | 4447539382147915857 | 3692326879904985834 | append
2024-03-28 15:40:08.127 America/Los_Angeles | 9082795235713066824 | 4447539382147915857 | append
2024-03-28 15:44:30.125 America/Los_Angeles | 4125791514565872106 | 9082795235713066824 | delete
2024-03-28 15:45:32.332 America/Los_Angeles | 2984842478748584283 | 4125791514565872106 | append
2024-03-28 15:46:51.795 America/Los_Angeles | 8335379082715663714 | 2984842478748584283 | delete
2024-03-28 15:47:40.018 America/Los_Angeles | 8730097130590891085 | 8335379082715663714 | append
2024-03-28 15:50:52.065 America/Los_Angeles | 7190230469857312220 | 8730097130590891085 | delete
2024-03-28 15:52:14.170 America/Los_Angeles | 8096230320602953325 | 7190230469857312220 | append
By utilizing the snapshot ID (9082795235713066824) and the “AS OF” time travel clause, you can retrieve the original flight record for flight ‘ABC123’. When querying with Presto, the state of the flight table will be returned based on an exact match of the snapshot ID.
select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status
from FLIGHT for system_version as of 9082795235713066824
where flightno = 'ABC123';
FlightNo | DepartFrom | ArriveTo | DepartDate | DepartTime | Status
----------+------------+----------+------------+--------------+---------
ABC123 | JFK | SFO | 2024-04-01 | 09:30:00.000 | ON-TIME
In the above example, VERSION can be used as an alias of SYSTEM_VERSION.
select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status
from FLIGHT for version as of 9082795235713066824
where flightno = 'ABC123';
FlightNo | DepartFrom | ArriveTo | DepartDate | DepartTime | Status
----------+------------+----------+------------+--------------+---------
ABC123 | JFK | SFO | 2024-04-01 | 09:30:00.000 | ON-TIME
Presto can perform time travel queries based on timestamp. In this case, the query returns the table’s state using the table snapshot id that is closest to the specified timestamp. In this example, timestamp for committed_at column is used to retrieve the original flight record.
select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status
from FLIGHT for system_time as of timestamp '2024-03-28 15:40:08.127 America/Los_Angeles'
where flightno = 'ABC123';
FlightNo | DepartFrom | ArriveTo | DepartDate | DepartTime | Status
----------+------------+----------+------------+--------------+---------
ABC123 | JFK | SFO | 2024-03-28 | 09:30:00.000 | ON-TIME
In the above example, TIMESTAMP can be used as an alias of SYSTEM_TIME.
select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status
from FLIGHT for timestamp as of timestamp '2024-03-28 15:40:08.127 America/Los_Angeles'
where flightno = 'ABC123';
FlightNo | DepartFrom | ArriveTo | DepartDate | DepartTime | Status
----------+------------+----------+------------+--------------+---------
ABC123 | JFK | SFO | 2024-03-28 | 09:30:00.000 | ON-TIME
The airline can use CURRENT_TIMESTAMP in the time travel query to get the latest record.
select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status
from FLIGHT for system_time as of current_timestamp
where flightno = 'ABC123';
FlightNo | DepartFrom | ArriveTo | DepartDate | DepartTime | Status
----------+------------+----------+------------+--------------+-----------
ABC123 | JFK | SFO | 2024-04-01 | 19:00:00.000 | CANCELLED
The airline can find out the first status change from the timestamp ‘2024–03–28 15:45:32.332 America/Los_Angeles’.
select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status
from FLIGHT for system_time as of timestamp '2024-03-28 15:45:32.332 America/Los_Angeles'
where flightno = 'ABC123';
FlightNo | DepartFrom | ArriveTo | DepartDate | DepartTime | Status
----------+------------+----------+------------+--------------+---------
ABC123 | JFK | SFO | 2024-04-01 | 11:00:00.000 | DELAYED
When you use version 7190230469857312220 which is a snapshot of the deleted record, the time travel query will not return any rows.
select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status
from FLIGHT for system_version as of 7190230469857312220
where flightno = 'ABC123';
FlightNo | DepartFrom | ArriveTo | DepartDate | DepartTime | Status
----------+------------+----------+------------+------------+--------
(0 rows)
Similar case for timestamp ‘2024–03–28 15:50:52.065 America/Los_Angeles’ which is a timestamp of a deleted record.
select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status f
from FLIGHT for system_time as of timestamp '2024-03-28 15:50:52.065 America/Los_Angeles'
where flightno = 'ABC123';
FlightNo | DepartFrom | ArriveTo | DepartDate | DepartTime | Status
----------+------------+----------+------------+------------+--------
(0 rows)
When performing a time travel query based on a timestamp, Presto will return the state of the table data at the specified timestamp. In this scenario, airlines have the flexibility to use any timestamp within the range between two snapshots to retrieve the closest state.
In this case, you can retrieve the closest state for timestamp “2024–03–28 15:52:14.170 America/Los_Angeles”, by using a lower timestamp value.
select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status
from FLIGHT for system_time as of timestamp '2024-03-28 15:52:14.000 America/Los_Angeles'
where flightno = 'ABC123';
FlightNo | DepartFrom | ArriveTo | DepartDate | DepartTime | Status
----------+------------+----------+------------+--------------+-----------
ABC123 | JFK | SFO | 2024-03-28 | 16:00:00.000 | DELAYED
Time Travel Analytics Cases
Since “AS OF” clause is attached to the table, it can be used in various analytics. Airlines can build reports based on two different snapshot ids.
select * from flight for system_version as of 9082795235713066824 where flightno = 'ABC123'
union all
select * from flight for system_version as of 8096230320602953325 where flightno = 'ABC123';
flightno | departfrom | arriveto | departdate | departtime | status | seatsavailable | seatsreserved | overbooked | changereason
----------+------------+----------+------------+--------------+-----------+----------------+---------------+------------+--------------
ABC123 | JFK | SFO | 2024-03-28 | 19:00:00.000 | CANCELLED | 300 | 200 | false | ENGINE
ABC123 | JFK | SFO | 2024-03-28 | 09:30:00.000 | ON-TIME | 300 | 200 | false | NULL
The following example shows airline can save flight status changes as a history table.
create table flight_v1 as select * from flight for system_version as of 9082795235713066824;
select * from flight_v1;
flightno | departfrom | arriveto | departdate | departtime | status | seatsavailable | seatsreserved | overbooked | changereason
----------+------------+----------+------------+--------------+---------+----------------+---------------+------------+--------------
XYZ222 | DCA | SFO | 2024-03-30 | 08:15:00.000 | ON-TIME | 350 | 290 | false | NULL
ABC456 | SFO | DCA | 2024-03-28 | 16:45:00.000 | ON-TIME | 300 | 250 | false | NULL
ABC789 | JFK | DEV | 2024-03-29 | 19:50:00.000 | ON-TIME | 350 | 270 | false | NULL
XYZ111 | SFO | DEN | 2024-03-29 | 15:10:00.000 | ON-TIME | 300 | 260 | false | NULL
ABC123 | JFK | SFO | 2024-03-28 | 09:30:00.000 | ON-TIME | 300 | 200 | false | NULL
Create a new flight state flight_v2 based on a specific snapshot id.
create table flight_v2 as select * from flight for system_version as of 2984842478748584283;
select * from flight_v2;
flightno | departfrom | arriveto | departdate | departtime | status | seatsavailable | seatsreserved | overbooked | changereason
----------+------------+----------+------------+--------------+---------+----------------+---------------+------------+--------------
ABC123 | JFK | SFO | 2024-03-28 | 11:00:00.000 | DELAYED | 300 | 200 | false | CREW
ABC456 | SFO | DCA | 2024-03-28 | 16:45:00.000 | ON-TIME | 300 | 250 | false | NULL
ABC789 | JFK | DEV | 2024-03-29 | 19:50:00.000 | ON-TIME | 350 | 270 | false | NULL
XYZ111 | SFO | DEN | 2024-03-29 | 15:10:00.000 | ON-TIME | 300 | 260 | false | NULL
XYZ222 | DCA | SFO | 2024-03-30 | 08:15:00.000 | ON-TIME | 350 | 290 | false | NULL
Airlines can perform analytics based on both flight history tables.
select * from flight_v1 union select * from flight_v2 order by flightno;
flightno | departfrom | arriveto | departdate | departtime | status | seatsavailable | seatsreserved | overbooked | changereason
----------+------------+----------+------------+--------------+---------+----------------+---------------+------------+--------------
ABC123 | JFK | SFO | 2024-03-28 | 09:30:00.000 | ON-TIME | 300 | 200 | false | NULL
ABC123 | JFK | SFO | 2024-03-28 | 11:00:00.000 | DELAYED | 300 | 200 | false | CREW
ABC456 | SFO | DCA | 2024-03-28 | 16:45:00.000 | ON-TIME | 300 | 250 | false | NULL
ABC789 | JFK | DEV | 2024-03-29 | 19:50:00.000 | ON-TIME | 350 | 270 | false | NULL
XYZ111 | SFO | DEN | 2024-03-29 | 15:10:00.000 | ON-TIME | 300 | 260 | false | NULL
XYZ222 | DCA | SFO | 2024-03-30 | 08:15:00.000 | ON-TIME | 350 | 290 | false | NULL
Another example is to build a flight status change report.
select * from flight_v1 where flightno = 'ABC123'
union all
select * from flight_v2 where flightno = 'ABC123';
flightno | departfrom | arriveto | departdate | departtime | status | seatsavailable | seatsreserved | overbooked | changereason
----------+------------+----------+------------+--------------+---------+----------------+---------------+------------+--------------
ABC123 | JFK | SFO | 2024-03-28 | 11:00:00.000 | DELAYED | 300 | 200 | false | CREW
ABC123 | JFK | SFO | 2024-03-28 | 09:30:00.000 | ON-TIME | 300 | 200 | false | NULL
Time Travel Queries Across Historical Periods
With the recent release of Presto version 0.286, users can now utilize a “changelog” table to view row-level changes that have occurred to the table in a specific order over time. In this context, the “changelog” table enables time-travel queries across a historical period based on ranges.
In the following example, an airline can build a change log since the first snapshot id.
select operation, ordinal, snapshotid , rowdata.flightno
from "flight$changelog" order by ordinal;
operation | ordinal | snapshotid | flightno
-----------+---------+---------------------+----------
INSERT | 0 | 8807630723385960024 | ABC456
INSERT | 1 | 3692326879904985834 | ABC789
INSERT | 2 | 4447539382147915857 | XYZ111
INSERT | 3 | 9082795235713066824 | XYZ222
DELETE | 4 | 4125791514565872106 | ABC123
INSERT | 5 | 2984842478748584283 | ABC123
DELETE | 6 | 8335379082715663714 | ABC123
INSERT | 7 | 8730097130590891085 | ABC123
DELETE | 8 | 7190230469857312220 | ABC123
INSERT | 9 | 8096230320602953325 | ABC123
The airline creates a new history table for flight ABC123. In this case, duplicate records are generated for deleted entries.
create table flight_abc123_change as select rowdata.flightno, rowdata.departdate, rowdata.departtime , rowdata.status, rowdata.changereason from "flight$changelog@8096230320602953325" where rowdata.flightno = 'ABC123';
select * from flight_abc123_change order by departtime;
flightno | departdate | departtime | status | changereason
----------+------------+--------------+-----------+--------------
ABC123 | 2024-03-28 | 09:30:00.000 | ON-TIME | NULL
ABC123 | 2024-03-28 | 11:00:00.000 | DELAYED | CREW
ABC123 | 2024-03-28 | 11:00:00.000 | DELAYED | CREW
ABC123 | 2024-03-28 | 16:00:00.000 | DELAYED | ENGINE
ABC123 | 2024-03-28 | 16:00:00.000 | DELAYED | ENGINE
ABC123 | 2024-03-28 | 19:00:00.000 | CANCELLED | ENGINE
You can eliminate duplicate records based on delete and insert operation.
Now, the airline can build a change report starting from the original flight record for ABC123.
select distinct FlightNo, Departdate , DepartTime, Status, changereason
from flight_abc123_change order by departtime;
FlightNo | Departdate | DepartTime | Status | changereason
----------+------------+--------------+-----------+--------------
ABC123 | 2024-03-28 | 09:30:00.000 | ON-TIME | NULL
ABC123 | 2024-03-28 | 11:00:00.000 | DELAYED | CREW
ABC123 | 2024-03-28 | 16:00:00.000 | DELAYED | ENGINE
ABC123 | 2024-03-28 | 19:00:00.000 | CANCELLED | ENGINE
Conclusion
The capabilities of Presto time travel queries allow for convenient retrieval of historical data. In this case, these queries facilitate the production of audit reports and trend analysis across flights, airports, and specific time frames.
Note: In time travel query examples, you can use TIMESTAMP as an alias for SYSTEM_TIME, and VERSION as an alias for SYSTEM_VERSION.
References