Set Digest functions

MinHash, or the min-wise independent permutations locality-sensitive hashing scheme, is a technique used in computer science to quickly estimate how similar two sets are. MinHash serves as a probabilistic data structure that estimates the Jaccard similarity coefficient - the measure of the overlap between two sets as a percentage of the total unique elements in both sets. Presto offers several functions that deal with the MinHash technique.

MinHash is used to quickly estimate the Jaccard similarity coefficient between two sets. It is commonly used in data mining to detect near-duplicate web pages at scale. By using this information, the search engines efficiently avoid showing within the search results two pages that are nearly identical.

Data structures

Presto implements Set Digest data sketches by encapsulating the following components:

As of now, HyperLogLog and MinHash are among the techniques implemented in Presto or used by certain functions in Presto to handle large data sets.

HyperLogLog (HLL): HyperLogLog is an algorithm used to estimate the cardinality of a set — that is, the number of distinct elements in a large data set. Presto uses it to provide the function approx_distinct which can be used to estimate the number of distinct entries in a column.

Examples:

SELECT approx_distinct(column_name) FROM table_name;

MinHash: MinHash is used to estimate the similarity between two or more sets, commonly known as Jaccard similarity. It is particularly effective when dealing with large data sets and is generally used in data clustering and near-duplicate detection.

Examples:

WITH mh1 AS (SELECT minhash_agg(to_utf8(value)) AS minhash FROM table1), mh2 AS (SELECT minhash_agg(to_utf8(value))
AS minhash FROM table2), SELECT jaccard_index(mh1.minhash, mh2.minhash) AS similarity FROM mh1, mh2;

The Presto type for this data structure is called setdigest. Presto offers the ability to merge multiple Set Digest data sketches.

Serialization

Data sketches such as those created via the use of MinHash or HyperLogLog can be serialized into a varbinary data type. Serializing these data structures allows them to be efficiently stored and, if needed, transferred between different systems or sessions. Once stored, they can then be deserialized back into to their original state when they need to be used again. In the context of Presto, you might normally do this using functions that convert these data sketches to and from binary. An example might include using to_utf8() or from_utf8().

Functions

make_set_digest(x) setdigest

Composes all input values of x into a setdigest.

Examples:

Create a ``setdigest`` corresponding to a ``bigint`` array::

SELECT make_set_digest(value)
FROM (VALUES 1, 2, 3) T(value);

Create a ``setdigest`` corresponding to a ``varchar`` array::

SELECT make_set_digest(value)
FROM (VALUES 'Presto', 'SQL', 'on', 'everything') T(value);
merge_set_digest(setdigest) setdigest

Returns the setdigest of the aggregate union of the individual setdigest structures.

Examples:

SELECT merge_set_digest(a) from (SELECT make_set_digest(value) as a FROM (VALUES 4,3,2,1) T(value));
cardinality(setdigest) bigint

Returns the cardinality of the set digest from its internal HyperLogLog component.

Examples:

SELECT cardinality(make_set_digest(value))
FROM (VALUES 1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5) T(value);
-- 5
intersection_cardinality(x, y) bigint

Returns the estimation for the cardinality of the intersection of the two set digests.

x and y be of type setdigest

Examples:

SELECT intersection_cardinality(make_set_digest(v1), make_set_digest(v2))
FROM (VALUES (1, 1), (NULL, 2), (2, 3), (3, 4)) T(v1, v2);
-- 3
jaccard_index(x, y) double

Returns the estimation of Jaccard index for the two set digests.

x and y be of type setdigest.

Examples:

SELECT jaccard_index(make_set_digest(v1), make_set_digest(v2))
FROM (VALUES (1, 1), (NULL,2), (2, 3), (NULL, 4)) T(v1, v2);
-- 0.5
hash_counts(x)

Returns a map containing the Murmur3Hash128 hashed values and the count of their occurences within the internal MinHash structure belonging to x or varchar

x must be of type setdigest.

Examples:

SELECT hash_counts(make_set_digest(value))
FROM (VALUES 1, 1, 1, 2, 2) T(value);
-- {19144387141682250=3, -2447670524089286488=2}