Understanding the APPROXIMATE_COUNT_DISTINCT Functions

Posted February 28, 2018 by Soniya Shah, Information Developer

This blog post was authored by Curtis Bennett.

The exact computation of the number of distinct values of an expression X on a multi-node architecture requires bringing all distinct values of X (within the specified group if a GROUP BY was specified) to the same node, and then counting the number of distinct values on each node in parallel. This can be expensive for several reasons:

• If the records are not segmented on X and if there is no unsegmented projection on X, then in a multi-node system, a DISTINCT is first performed on the node where the data resides, and then the data must be redistributed to bring the partially aggregated data together to perform a second stage of aggregation. The results of the second stage of aggregation are then counted, and the counts are summed to produce the final count distinct. All of this must be done within a group, if a GROUP BY was specified.
• If the data is not sorted in a way that delivers sufficient performance for the COUNT (DISTINCT X) then hashed aggregation must be used. If the number of distinct values is large, then the hash table must be large.
• When multiple distinct aggregates are requested in a single statement, the data must be re-segmented for each distinct aggregate, which is expensive. This is mitigated by having the query optimizer perform each distinct aggregate in a separate query block, but the need to do many expensive data re-distributions remains.
• COUNT (DISTINCT X) does not roll up. That is, if you have “COUNT (DISTINCT X) … GROUP BY a, b, c”, you cannot derive “COUNT (DISTINCT X) … GROUP BY a, b” from it. The fact that COUNT (DISTINCT…) doesn’t roll up means that ROLAP-type tools have to do COUNT (DISTINCT X) the hard way and this is the source of the COUNT DISTINCT pain some users have.

In many cases, an exact answer is not required, and for those situations, APPROXIMATE_COUNT_DISTINCT is a much faster alternative when an exact count is unnecessary.

APPROXIMATE_COUNT_DISTINCT is an implementation of the LogLogBeta algorithm developed by Jason Qin of Cornell University, as published in 2016. Vertica starts with a binary search tree sorting algorithm, and switches to the LogLogBeta algorithm to handle larger data sets. This avoids the overhead that LogLogBeta incurs for smaller data sets. More importantly, the LogLogBeta algorithm is implemented in a distributed fashion, which leverages the Vertica Massively Parallel Processing (MPP) architecture. This makes the supporting functions very fast. => SELECT COUNT(DISTINCT product_key) FROM store.store_sales_fact; COUNT ------- 19982 (1 row) Time: First fetch (1 row): 863.218 ms. All rows formatted: 863.280 ms => SELECT APPROXIMATE_COUNT_DISTINCT(product_key) FROM store.store_sales_fact; ApproxCountDistinct --------------------- 19991 (1 row) Time: First fetch (1 row): 479.946 ms. All rows formatted: 480.014 ms APPROXIMATE_COUNT_DISTINCT is faster than a traditional COUNT DISTINCT, with an error tolerance within 2.17 standard deviations, which corresponds to a 97% confidence interval. Furthermore, the error tolerance default of 1.75% can be adjusted with an optional parameter, which provides more accurate results with a slight penalty in performance. => SELECT APPROXIMATE_COUNT_DISTINCT(product_key,5) AS five_pct_accuracy, APPROXIMATE_COUNT_DISTINCT(product_key,1) AS one_pct_accuracy, APPROXIMATE_COUNT_DISTINCT(product_key,.88) AS point_eighteight_pct_accuracy FROM store.store_sales_fact; five_pct_accuracy | one_pct_accuracy | point_eighteight_pct_accuracy -------------------+------------------+------------------------------- 19431 | 19921 | 19921 (1 row) Time: First fetch (1 row): 772.212 ms. All rows formatted: 772.288 ms APPROXIMATE_COUNT_DISTINCT_SYNOPSIS summarizes the information of distinct, non-null values and materializes the result set in a synopsis object. This object is saved to a table and is queried by the APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS function. => SELECT date_key, store_key, COUNT(DISTINCT product_key) FROM store.store_sales_fact; Output not shown Time: First fetch (1000 rows): 10627.813 ms. All rows formatted: 11510.467 ms => CREATE TABLE my_summary AS SELECT date_key, store_key, APPROXIMATE_COUNT_DISTINCT_SYNOPSIS(product_key) syn from store.store_sales_fact group by 1, 2 ; Time: First fetch (0 rows): 19540.632 ms. All rows formatted: 19540.653 ms => SELECT date_key, APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(syn) from my_summary group by 1; Output not shown Time: First fetch (1000 rows): 1642.266 ms. All rows formatted: 2662.296 ms => SELECT date_key, store_key, APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(syn) from my_summary group by 1, 2; Output not shown Time: First fetch (1000 rows): 31.080 ms. All rows formatted: 3164.499 ms The selection criteria supports all combinations of columns and doesn’t have to be top-down, based on the original GROUP BY condition.