Time Travel for Iceberg Tables in Presto

    Introduction

    Presto, an open-source distributed SQL query engine, excels at querying large data sets distributed across diverse data sources. It has maintained its position as a high-performance data analytics tool for over a decade.

    As data collection capabilities expand, businesses increasingly recognize the importance of historical data alongside current data. With the recent release of Presto version 0.286, a powerful capability has emerged: time travel queries on iceberg tables.

    Time Travel queries, utilizing constructs like “AS OF” predicates, allow you to delve into historical data, analyze trends, retrieve specific information, reconstruct datasets, and even audit the evolution of data — all without requiring any modifications to table definitions.

    With the introduction of Presto release 0.286, users now can execute time travel queries on Iceberg tables using either the snapshot ID or timestamp of the records. This release specifically supports time travel queries based on SYSTEM_VERSION and SYSTEM_TIME.

    Time Travel Use Case

    In an actual airline reservation system database, various tables exist for flight schedules, customer reservations, mileage rewards, and mappings between airport codes and city names. However, for this article, we will focus on a simplified version of the flight table that specifically tracks flight schedules. Whenever a flight is delayed or cancelled, the original scheduling data for that flight is updated with the current flight time information and status.

    Iceberg Table Creation

    To accurately follow the history of flight delays and cancellations, the table that tracks flight schedules must keep all the deleted and inserted rows, not just information about the current state of each flight. This example uses Version 2 of the Apache Iceberg spec as version_format and copy-on-write mode.

    create table FLIGHT (
      FlightNo         VARCHAR,
      DepartFrom       VARCHAR,
      ArriveTo         VARCHAR,
      DepartDate       DATE,
      DepartTime       TIME,
      Status           VARCHAR,
      SeatsAvailable   INTEGER,
      SeatsReserved    INTEGER,
      Overbooked       BOOLEAN,
      ChangeReason     VARCHAR
    ) WITH (format_version = '2', delete_mode = 'copy-on-write', partitioning = Array[‘FlightNo]) ;

    Note that the DepartDate and DepartTime columns of the table represent “application times”. These are distinct from “system time or version”, which reflects when flight data was inserted or deleted in Presto. Time travel queries in Presto retrieve current and historical data based on “snapshot id” or “committed time”. The application may also query the table using standard Presto SQL on application time columns, e.g., to find flights departing on a particular day or in a particular week, or may query historical data based on both system time (or version) and application time.

    Flight Table Data

    Let’s say the following records are added to the flight table over some time.

    insert into FLIGHT values ('ABC123', 'JFK', 'SFO', date '2024-04-01', time '09:30:00', 'ON-TIME', 300, 200, false, NULL);
    insert into FLIGHT values ('ABC456', 'SFO', 'DCA', date '2024-04-01’ time '16:45:00', 'ON-TIME', 300, 250, false, NULL);
    insert into FLIGHT values ('ABC789', 'JFK', 'DEV', date '2024-04-02’, time '19:50:00', 'ON-TIME', 350, 270, false, NULL);
    insert into FLIGHT values ('XYZ111', 'SFO', 'DEN', date '2024-04-02’, time '15:10:00', 'ON-TIME', 300, 260, false, NULL);
    insert into FLIGHT values ('XYZ222', 'DCA', 'SFO', date '2024-04-03’, time '08:15:00', 'ON-TIME', 350, 290, false, NULL);

    Flight rows are deleted and inserted again when a flight is cancelled or delayed. Presto has associated records in snapshots table with snapshot id and committed timestamp.

    Time Travel Queries

    Let’s see how the analytics team can use Presto time travel to obtain information about changes that were made to a specific flight.

    Flight record ABC123 was added to the table on ‘2024–03–28’, say.

    select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status 
     from FLIGHT 
     where FlightNo = 'ABC123';
    
     FlightNo | DepartFrom | ArriveTo | DepartDate |  DepartTime  | Status  
    ----------+------------+----------+------------+--------------+---------
     ABC123   | JFK        | SFO      | 2024-04-01 | 09:30:00.000 | ON-TIME 

    Later, the flight time (DepartTime) and status are updated to reflect multiple changes such as delay and cancellation.

    delete from flight where flightno = 'ABC123';
    insert into FLIGHT values ('ABC123', 'JFK', 'SFO', date '2024-04-01', time '11:00:00', 'DELAYED', 300, 200, false, 'CREW');
    
    delete from flight where flightno = 'ABC123';
    insert into FLIGHT values ('ABC123', 'JFK', 'SFO', date '2024-04-01', time '16:00:00', 'DELAYED', 300, 200, false, 'ENGINE'); 
    
    delete from flight where flightno = 'ABC123';
    insert into FLIGHT values ('ABC123', 'JFK', 'SFO', date '2024-04-01', time ’19:00:00', ‘CANCELLED’, 300, 200, false, 'ENGINE');
    

    Now you have the latest flight status, which is cancelled in the flight table:

    select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status 
     from FLIGHT 
     where FlightNo = 'ABC123';
    
     FlightNo | DepartFrom | ArriveTo | DepartDate |  DepartTime  |  Status   
    ----------+------------+----------+------------+--------------+-----------
     ABC123   | JFK        | SFO      | 2024-04-01 | 19:00:00.000 | CANCELLED

    You can refer to the snapshots table for the changes made to the flight table:

    select committed_at, snapshot_id, parent_id, operation 
     from "flight$snapshots";
    
                    committed_at                 |     snapshot_id     |      parent_id      | operation 
    ---------------------------------------------+---------------------+---------------------+-----------
     2024-03-28 15:27:42.528 America/Los_Angeles | 4708704013741143824 | NULL                | append    
     2024-03-28 15:31:30.899 America/Los_Angeles | 8807630723385960024 | 4708704013741143824 | append    
     2024-03-28 15:36:03.864 America/Los_Angeles | 3692326879904985834 | 8807630723385960024 | append    
     2024-03-28 15:37:54.295 America/Los_Angeles | 4447539382147915857 | 3692326879904985834 | append    
     2024-03-28 15:40:08.127 America/Los_Angeles | 9082795235713066824 | 4447539382147915857 | append    
     2024-03-28 15:44:30.125 America/Los_Angeles | 4125791514565872106 | 9082795235713066824 | delete    
     2024-03-28 15:45:32.332 America/Los_Angeles | 2984842478748584283 | 4125791514565872106 | append    
     2024-03-28 15:46:51.795 America/Los_Angeles | 8335379082715663714 | 2984842478748584283 | delete    
     2024-03-28 15:47:40.018 America/Los_Angeles | 8730097130590891085 | 8335379082715663714 | append    
     2024-03-28 15:50:52.065 America/Los_Angeles | 7190230469857312220 | 8730097130590891085 | delete    
     2024-03-28 15:52:14.170 America/Los_Angeles | 8096230320602953325 | 7190230469857312220 | append    

    By utilizing the snapshot ID (9082795235713066824) and the “AS OF” time travel clause, you can retrieve the original flight record for flight ‘ABC123’. When querying with Presto, the state of the flight table will be returned based on an exact match of the snapshot ID.

    select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status 
     from FLIGHT for system_version as of 9082795235713066824 
     where flightno = 'ABC123';
    
     FlightNo | DepartFrom | ArriveTo | DepartDate |  DepartTime  | Status  
    ----------+------------+----------+------------+--------------+---------
     ABC123   | JFK        | SFO      | 2024-04-01 | 09:30:00.000 | ON-TIME 

    In the above example, VERSION can be used as an alias of SYSTEM_VERSION.

    select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status 
     from FLIGHT for version as of 9082795235713066824 
     where flightno = 'ABC123';
    
     FlightNo | DepartFrom | ArriveTo | DepartDate |  DepartTime  | Status  
    ----------+------------+----------+------------+--------------+---------
     ABC123   | JFK        | SFO      | 2024-04-01 | 09:30:00.000 | ON-TIME 

    Presto can perform time travel queries based on timestamp. In this case, the query returns the table’s state using the table snapshot id that is closest to the specified timestamp. In this example, timestamp for committed_at column is used to retrieve the original flight record.

    select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status 
     from FLIGHT for system_time as of timestamp '2024-03-28 15:40:08.127 America/Los_Angeles' 
     where flightno = 'ABC123';
     FlightNo | DepartFrom | ArriveTo | DepartDate |  DepartTime  | Status  
    ----------+------------+----------+------------+--------------+---------
     ABC123   | JFK        | SFO      | 2024-03-28 | 09:30:00.000 | ON-TIME 

    In the above example, TIMESTAMP can be used as an alias of SYSTEM_TIME.

    select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status 
     from FLIGHT for timestamp as of timestamp '2024-03-28 15:40:08.127 America/Los_Angeles' 
     where flightno = 'ABC123';
     FlightNo | DepartFrom | ArriveTo | DepartDate |  DepartTime  | Status  
    ----------+------------+----------+------------+--------------+---------
     ABC123   | JFK        | SFO      | 2024-03-28 | 09:30:00.000 | ON-TIME

    The airline can use CURRENT_TIMESTAMP in the time travel query to get the latest record.

    select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status 
     from FLIGHT for system_time as of current_timestamp 
     where flightno = 'ABC123';
    
     FlightNo | DepartFrom | ArriveTo | DepartDate |  DepartTime  |  Status   
    ----------+------------+----------+------------+--------------+-----------
     ABC123   | JFK        | SFO      | 2024-04-01 | 19:00:00.000 | CANCELLED 

    The airline can find out the first status change from the timestamp ‘2024–03–28 15:45:32.332 America/Los_Angeles’.

    select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status 
     from FLIGHT for system_time as of timestamp '2024-03-28 15:45:32.332 America/Los_Angeles' 
     where flightno = 'ABC123';
     FlightNo | DepartFrom | ArriveTo | DepartDate |  DepartTime  | Status  
    ----------+------------+----------+------------+--------------+---------
     ABC123   | JFK        | SFO      | 2024-04-01 | 11:00:00.000 | DELAYED

    When you use version 7190230469857312220 which is a snapshot of the deleted record, the time travel query will not return any rows.

    select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status 
     from FLIGHT for system_version as of 7190230469857312220 
     where flightno = 'ABC123';
    
     FlightNo | DepartFrom | ArriveTo | DepartDate | DepartTime | Status 
    ----------+------------+----------+------------+------------+--------
    (0 rows)

    Similar case for timestamp ‘2024–03–28 15:50:52.065 America/Los_Angeles’ which is a timestamp of a deleted record.

    select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status f
     from FLIGHT for system_time as of timestamp '2024-03-28 15:50:52.065 America/Los_Angeles' 
     where flightno = 'ABC123';
    
     FlightNo | DepartFrom | ArriveTo | DepartDate | DepartTime | Status 
    ----------+------------+----------+------------+------------+--------
    (0 rows)

    When performing a time travel query based on a timestamp, Presto will return the state of the table data at the specified timestamp. In this scenario, airlines have the flexibility to use any timestamp within the range between two snapshots to retrieve the closest state.

    In this case, you can retrieve the closest state for timestamp “2024–03–28 15:52:14.170 America/Los_Angeles”, by using a lower timestamp value.

    select FlightNo, DepartFrom, ArriveTo, DepartDate, DepartTime, Status 
     from FLIGHT for system_time as of timestamp '2024-03-28 15:52:14.000 America/Los_Angeles' 
     where flightno = 'ABC123';
    
     FlightNo | DepartFrom | ArriveTo | DepartDate |  DepartTime  |  Status   
    ----------+------------+----------+------------+--------------+-----------
     ABC123   | JFK        | SFO      | 2024-03-28 | 16:00:00.000 | DELAYED

    Time Travel Analytics Cases

    Since “AS OF” clause is attached to the table, it can be used in various analytics. Airlines can build reports based on two different snapshot ids.

    select * from flight for system_version as of 9082795235713066824 where flightno = 'ABC123' 
    union all 
    select * from flight for system_version as of 8096230320602953325 where flightno = 'ABC123';
    
     flightno | departfrom | arriveto | departdate |  departtime  |  status   | seatsavailable | seatsreserved | overbooked | changereason 
    ----------+------------+----------+------------+--------------+-----------+----------------+---------------+------------+--------------
     ABC123   | JFK        | SFO      | 2024-03-28 | 19:00:00.000 | CANCELLED |            300 |           200 | false      | ENGINE       
     ABC123   | JFK        | SFO      | 2024-03-28 | 09:30:00.000 | ON-TIME   |            300 |           200 | false      | NULL  

    The following example shows airline can save flight status changes as a history table.

    create table flight_v1 as select * from flight for system_version as of 9082795235713066824;
    
    select * from flight_v1;
    
     flightno | departfrom | arriveto | departdate |  departtime  | status  | seatsavailable | seatsreserved | overbooked | changereason 
    ----------+------------+----------+------------+--------------+---------+----------------+---------------+------------+--------------
     XYZ222   | DCA        | SFO      | 2024-03-30 | 08:15:00.000 | ON-TIME |            350 |           290 | false      | NULL         
     ABC456   | SFO        | DCA      | 2024-03-28 | 16:45:00.000 | ON-TIME |            300 |           250 | false      | NULL         
     ABC789   | JFK        | DEV      | 2024-03-29 | 19:50:00.000 | ON-TIME |            350 |           270 | false      | NULL         
     XYZ111   | SFO        | DEN      | 2024-03-29 | 15:10:00.000 | ON-TIME |            300 |           260 | false      | NULL         
     ABC123   | JFK        | SFO      | 2024-03-28 | 09:30:00.000 | ON-TIME |            300 |           200 | false      | NULL    

    Create a new flight state flight_v2 based on a specific snapshot id.

    create table flight_v2 as select * from flight for system_version as of 2984842478748584283;
    
    select * from flight_v2;
    
     flightno | departfrom | arriveto | departdate |  departtime  | status  | seatsavailable | seatsreserved | overbooked | changereason 
    ----------+------------+----------+------------+--------------+---------+----------------+---------------+------------+--------------
     ABC123   | JFK        | SFO      | 2024-03-28 | 11:00:00.000 | DELAYED |            300 |           200 | false      | CREW         
     ABC456   | SFO        | DCA      | 2024-03-28 | 16:45:00.000 | ON-TIME |            300 |           250 | false      | NULL         
     ABC789   | JFK        | DEV      | 2024-03-29 | 19:50:00.000 | ON-TIME |            350 |           270 | false      | NULL         
     XYZ111   | SFO        | DEN      | 2024-03-29 | 15:10:00.000 | ON-TIME |            300 |           260 | false      | NULL         
     XYZ222   | DCA        | SFO      | 2024-03-30 | 08:15:00.000 | ON-TIME |            350 |           290 | false      | NULL

    Airlines can perform analytics based on both flight history tables.

    select * from flight_v1 union select * from flight_v2 order by flightno;
    
    flightno | departfrom | arriveto | departdate |  departtime  | status  | seatsavailable | seatsreserved | overbooked | changereason 
    ----------+------------+----------+------------+--------------+---------+----------------+---------------+------------+--------------
     ABC123   | JFK        | SFO      | 2024-03-28 | 09:30:00.000 | ON-TIME |            300 |           200 | false      | NULL         
     ABC123   | JFK        | SFO      | 2024-03-28 | 11:00:00.000 | DELAYED |            300 |           200 | false      | CREW         
     ABC456   | SFO        | DCA      | 2024-03-28 | 16:45:00.000 | ON-TIME |            300 |           250 | false      | NULL         
     ABC789   | JFK        | DEV      | 2024-03-29 | 19:50:00.000 | ON-TIME |            350 |           270 | false      | NULL         
     XYZ111   | SFO        | DEN      | 2024-03-29 | 15:10:00.000 | ON-TIME |            300 |           260 | false      | NULL         
     XYZ222   | DCA        | SFO      | 2024-03-30 | 08:15:00.000 | ON-TIME |            350 |           290 | false      | NULL    

    Another example is to build a flight status change report.

    select * from flight_v1 where flightno = 'ABC123' 
    union all 
    select * from flight_v2 where flightno = 'ABC123';
    
     flightno | departfrom | arriveto | departdate |  departtime  | status  | seatsavailable | seatsreserved | overbooked | changereason 
    ----------+------------+----------+------------+--------------+---------+----------------+---------------+------------+--------------
     ABC123   | JFK        | SFO      | 2024-03-28 | 11:00:00.000 | DELAYED |            300 |           200 | false      | CREW         
     ABC123   | JFK        | SFO      | 2024-03-28 | 09:30:00.000 | ON-TIME |            300 |           200 | false      | NULL        

    Time Travel Queries Across Historical Periods

    With the recent release of Presto version 0.286, users can now utilize a “changelog” table to view row-level changes that have occurred to the table in a specific order over time. In this context, the “changelog” table enables time-travel queries across a historical period based on ranges.

    In the following example, an airline can build a change log since the first snapshot id.

    select operation, ordinal, snapshotid , rowdata.flightno 
    from "flight$changelog" order by ordinal;
    
     operation | ordinal |     snapshotid      | flightno 
    -----------+---------+---------------------+----------
     INSERT    |       0 | 8807630723385960024 | ABC456   
     INSERT    |       1 | 3692326879904985834 | ABC789   
     INSERT    |       2 | 4447539382147915857 | XYZ111   
     INSERT    |       3 | 9082795235713066824 | XYZ222   
     DELETE    |       4 | 4125791514565872106 | ABC123   
     INSERT    |       5 | 2984842478748584283 | ABC123   
     DELETE    |       6 | 8335379082715663714 | ABC123   
     INSERT    |       7 | 8730097130590891085 | ABC123   
     DELETE    |       8 | 7190230469857312220 | ABC123   
     INSERT    |       9 | 8096230320602953325 | ABC123   

    The airline creates a new history table for flight ABC123. In this case, duplicate records are generated for deleted entries.

    create table flight_abc123_change as select rowdata.flightno, rowdata.departdate, rowdata.departtime , rowdata.status, rowdata.changereason from "flight$changelog@8096230320602953325" where rowdata.flightno = 'ABC123';
    
    select * from flight_abc123_change order by departtime;
    
     flightno | departdate |  departtime  |  status   | changereason 
    ----------+------------+--------------+-----------+--------------
     ABC123   | 2024-03-28 | 09:30:00.000 | ON-TIME   | NULL
     ABC123   | 2024-03-28 | 11:00:00.000 | DELAYED   | CREW
     ABC123   | 2024-03-28 | 11:00:00.000 | DELAYED   | CREW        
     ABC123   | 2024-03-28 | 16:00:00.000 | DELAYED   | ENGINE       
     ABC123   | 2024-03-28 | 16:00:00.000 | DELAYED   | ENGINE       
     ABC123   | 2024-03-28 | 19:00:00.000 | CANCELLED | ENGINE

    You can eliminate duplicate records based on delete and insert operation.
    Now, the airline can build a change report starting from the original flight record for ABC123.

    select distinct FlightNo, Departdate , DepartTime, Status, changereason 
     from flight_abc123_change order by departtime;
    
     FlightNo | Departdate |  DepartTime  |  Status   | changereason 
    ----------+------------+--------------+-----------+--------------
     ABC123   | 2024-03-28 | 09:30:00.000 | ON-TIME   | NULL         
     ABC123   | 2024-03-28 | 11:00:00.000 | DELAYED   | CREW         
     ABC123   | 2024-03-28 | 16:00:00.000 | DELAYED   | ENGINE       
     ABC123   | 2024-03-28 | 19:00:00.000 | CANCELLED | ENGINE

    Conclusion

    The capabilities of Presto time travel queries allow for convenient retrieval of historical data. In this case, these queries facilitate the production of audit reports and trend analysis across flights, airports, and specific time frames.

    Note: In time travel query examples, you can use TIMESTAMP as an alias for SYSTEM_TIME, and VERSION as an alias for SYSTEM_VERSION.

    References