Presto implements the KHyperLogLog
algorithm and data structure.
KHyperLogLog data structure can be created
KHyperLogLog is a data sketch that compactly represents the association of two
columns. It is implemented in Presto as a two-level data structure composed of
a MinHash structure whose entries map to
KHyperLogLog sketches can be cast to and from
varbinary. This allows them to
be stored for later use.
- khyperloglog_agg(x, y) KHyperLogLog #
KHyperLogLogsketch that represents the relationship between columns
y. The MinHash structure summarizes
xand the HyperLogLog sketches represent
yvalues linked to
- cardinality(khll) bigint
This calculates the cardinality of the MinHash sketch, i.e.
- intersection_cardinality(khll1, khll2) bigint #
Returns the set intersection cardinality of the data represented by the MinHash structures of
- jaccard_index(khll1, khll2) double #
Returns the Jaccard index of the data represented by the MinHash structures of
- uniqueness_distribution(khll) map<bigint,double> #
For a certain value
x', uniqueness is understood as how many
y'values are associated with it in the source dataset. This is obtained with the cardinality of the HyperLogLog that is mapped from the MinHash bucket that corresponds to
x'. This function returns a histogram that represents the uniqueness distribution, the X-axis being the
uniquenessand the Y-axis being the relative frequency of
- uniqueness_distribution(khll, histogramSize) map<bigint,double> #
Returns the uniqueness histogram with the given amount of buckets. If omitted, the value defaults to 256. All
uniquenessvalues greater than
histogramSizeare accumulated in the last bucket.
- reidentification_potential(khll, threshold) double #
The reidentification potential is the ratio of
xvalues that have a
uniquenessunder the given
- merge(khll) KHyperLogLog
KHyperLogLogof the aggregate union of the individual
- merge_khll(array(khll)) KHyperLogLog #
KHyperLogLogof the union of an array of KHyperLogLog structures.