Even Faster Unnest

    Unnest is a common operation in Facebook’s daily Presto workload. It converts an ARRAY, MAP, or ROW into a flat relation. Its original implementation used deep copy all the time and was very inefficient. In Unnest Operator Performance Enhancement with Dictionary Blocks, the author improved the Unnest operator by up to 10x in CPU and elapsed times by using DictionaryBlock when possible. We went one step further and improved it for another 5-10x.

    In this post, we will refer to the PrestoSQL implementation as “baseline”. The JMH Benchmark results for different cases are shown below (“before” is the baseline implementation, and “after” is our new implementation):

    The optimized Unnest implementation is available in Presto 0.235 and onward and is enabled by default. The JMH benchmark was also expanded to have better coverage and can be found in BenchmarkUnnestOperator.java.

    In Facebook’s production workload, we observed over a 6x CPU reduction on this operator. The following chart shows the pairwise comparison for all unnest operators in one of our test runs. Every dot below the diagonal is a win. About 25% of operators show over a 5x CPU reduction, and some of them have even over a 20x reduction.

    The histogram of the after vs. before CPU ratio shows most operators has a ratio less than 1, meaning most of them were more efficient.

    The following chart shows the operator’s CPU percentage was reduced from about 2% to 0.1% out of all operators after the roll-out happened on May 28. The amount of CPU, though small, still matters at Facebook’s scale.

    What is the Unnest Operator?

    When users have data structured as ARRAY, MAP, or ROW they sometimes need to flatten them so that the nested structure can be regarded as top level citizen and sent to downstream operators for easier arithmetic or aggregation processing. An example query is as follows:

    SELECT
        zoo, animal
    FROM (
        VALUES
            ('OaklandZoo', ARRAY['dog', 'cat', 'tiger']),
            ('SanFranciscoZoo', ARRAY['dog', 'cat'])
    ) AS x (zoo, animals)
    CROSS JOIN UNNEST(animals) AS t (animal);

    In this example, the zoo column, which is a VARCHAR column, is being replicated for the same row and thus being called the “replicated column”; and the animals column, which is an ARRAY(VARCHAR) column, is unnested into a VARCHAR column, named as animal and is a “unnest column”:

    | zoo             | animal |
    | ----------------| ------ |
    | OaklandZoo      | dog    |
    | OaklandZoo      | cat    |
    | OaklandZoo      | tiger  |
    | SanFranciscoZoo | dog    |
    | SanFranciscoZoo | cat    |

    In Presto relational data is represented as a series of Pages internally. Pages are composed of Blocks, one for each column. For replicated columns, the Unnest operator needs to create new Blocks from the original Blocks with the data being replicated within each original row. The baseline implementation achieved this by creating a DictionaryBlock on top of the original Block, and thus avoided expensive deep copies. The same thing was done for the unnest columns if there is no need to insert nulls, where DictionaryBlocks were created, and the array offsets were translated to indices of the underlying block.

    What happens when there are multiple unnest columns and their cardinalities do not match? Take a look at this example:

    SELECT
        zoo,
        animal,
        employee
    FROM (
        VALUES
            ('OaklandZoo', ARRAY['dog', 'cat', 'tiger'], ARRAY['Alice', 'Bob']),
            ('SanFranciscoZoo', ARRAY['dog', 'cat'], ARRAY['Clara', 'Danny'])
    ) AS x (zoo, animals, employees)
    CROSS JOIN UNNEST(animals, employees) AS t (animal, employee);

    The result is

    | zoo             | animal | employee |
    | --------------- | ------ | -------- |
    | OaklandZoo      | dog    | Alice    |
    | OaklandZoo      | cat    | Bob      |
    | OaklandZoo      | tiger  | NULL     |
    | SanFranciscoZoo | dog    | Clara    |
    | SanFranciscoZoo | cat    | Danny    |

    Note that a NULL is inserted in the last column on the third row. In this case, can we still create a DictionaryBlock on the original block? The original nested block does not have a NULL! So it’s impossible to find an index that points to a NULL value and build a DictionaryBlock easily. The baseline implementation just switches back to deep copying for this case. In the sections below, we are going to explain how we tackled this problem.

    Opportunities for Improvements? Yes!

    The baseline implementation did a number of things well. It avoided deep copy by using DictionaryBlock which improved the original improvement by an order of magnitude for cases without NULL insertions. Can we improve it additionally? The answer is yes.

    Process data column-by-column, not row-by-row

    If you read our blog post 5 design choices—and 1 weird trick — to get 2x efficiency gains in Presto repartitioning, you may remember the design choice “Process data column by column, not row by row”. The same thing applies to UnnestOperator as well. The baseline implementation constructs the output blocks in the row-by-row manner:

    for each incoming row
        for each replicated column
            append the repeated values by adding ids to the DictionaryBlock
        for each unnest column
            append the unnested  values by adding ids to the DictionaryBlock if possible.
            Otherwise append the value to BlockBuilder.
        if there is ordinality column
            append the oridinality to the BlockBuilder

    A key principle of vectorized execution is to use tight loops which allow the JVM to vectorize the compiled binary, and allow the CPU to parallelize the work as much as possible. Processing the data in a column by column manner makes this possible and an example of such a loop for building a replicated block looks like this:

    public Block buildOutputBlock(int[] maxEntries, int offset, int length, int totalEntries)
    {
        int[] ids = new int[totalEntries];
        int fromPosition = 0;
        for (int i = 0; i < length; i++) {
            int toPosition = fromPosition + maxEntries[offset + i];
            Arrays.fill(ids, fromPosition, toPosition, sourcePosition++);
            fromPosition = toPosition;
        }
    
        return new DictionaryBlock(totalEntries, source, ids);
    }

    This simple change improved the performance of the NO-NULL case by at least 3x as shown by the JMH benchmarks.

    Computation of maxEntries array and batch size

    To make the above loop possible, we pre-compute the cardinalities for each row for all unnest columns and use them to get the max cardinalities (we call it the maxEntries array) in the operator. This way the memory can be accessed sequentially and the computation can be reused. The size of these arrays should require very little memory overhead, since they’re only for top level rows and should always be less than 1024 per block.

    We then compute the batch size, ie. top level row count that can fit into the next output page. Right now we limit the nested row count for each batch to 1000 with a minimum batch size of 1. This means the unnested row count for each block might be over 1000. For example, if a row contains 10000 array entries, then the unnested row count would be 10000. Although this is over the 1000 rows we still have to output the whole top level row which translates into 10000 unnested rows.

    Bonus: No need to create DictionaryBlock at all for some cases!

    There are several cases that we can tell no NULLs need to be inserted:

    1. When there is only one unnest column (except ARRAY of ROWs case).
    2. When there are multiple columns, but the cardinalities for them are the same for the current batch.
    3. For a single Array of Row unnest column case, if the row block doesn’t contain any nulls.

    For these cases, the result can simply be acquired by using getRegion() to get a view on top of the original leaf block:

    public Block buildOutputBlockWithoutNulls(int outputPositionCount)
    {
        Block output = source.getRegion(sourcePosition, outputPositionCount);
        sourcePosition += outputPositionCount;
        return output;
    }

    This doesn’t copy the data, nor does it construct the ids mapping and DictionaryBlock. Nice!

    A better way to copy Blocks

    How do we handle the cases where NULLs need to be inserted? We can formulate the problem as a simple question: given a Block without NULL, how do we insert a NULL efficiently?

    There are two ways of doing this:

    1. Use special id, e.g. -1 to identify null in DictionaryBlock
    2. Copy the blocks and insert a NULL.

    Currently DictionaryBlock doesn’t support the logic in 1). To make it happen, we need to change all getId() and getIds() call to handle the special values. While the size of the callsites is manageable (about 50), the performance implication need to be studied carefully.

    The baseline implementation used the second way and used PageBuilder and BlockBuilder to copy the values for this case. However, the JMH benchmarks shows that they are not super efficient for this purpose. To compare the cost of copying blocks using PageBuilder and BlockBuilder versus doing bulk copy using tight loops, we measured the following two implementations:

    1. Using PageBuilder and BlockBuilder
    @Benchmark
    public void copyBlockByAppend(BenchmarkData data)
    {
        LongArrayBlockBuilder longArrayBlockBuilder = new LongArrayBlockBuilder(null, POSITIONS_PER_PAGE);
        for (int i = 0; i < BLOCK_COUNT; i++) {
            Block block = data.blocks.get(i);
            int positionCount = block.getPositionCount();
    
            for (int j = 0; j < positionCount; j++) {
                BIGINT.appendTo(block, j, longArrayBlockBuilder);
            }
    
            Block outputBlock = longArrayBlockBuilder.build();
        }
    }
    1. Using bulk copy
    @Benchmark
    public void copyBlockByLoop(BenchmarkData data)
    {
        for (int i = 0; i < BLOCK_COUNT; i++) {
            Block block = data.blocks.get(i);
            int positionCount = block.getPositionCount();
    
            long[] values = new long[positionCount];
            for (int j = 0; j < positionCount; j++) {
                values[j] = block.getLong(j);
            }
    
            boolean[] valueIsNull = copyIsNulls(block);
    
            Block outputBlock = new LongArrayBlock(positionCount, Optional.of(valueIsNull), values);
        }
    }

    Even though the second code snippet is copying through the Block interface, it was over 4x faster over the first one:

    We thus thought copying the Blocks this way can give us acceptable performance. To make the code clean and better encapsulated, we added a new method appendNull() to the Block interface. For all primitive types like BOOLEAN, SMALLINT, INTEGER, BIGINT, etc., we create a new Block with all the positions copied and then a NULL appended at the end of the block. For VariableWidthBlock, ArrayBlock, MapBlock and RowBlock, we can get hold of the underlying elements block or the raw slice, and then construct the offsets array. It doesn’t need to do a deep copy of the nested structure. To do this, we need to convert the ArrayBlock, MapBlock and RowBlock into the ColumnarXXX objects but the cost of conversion is very worthwhile.

    Further reading