Understanding the APPROXIMATE_COUNT_DISTINCT Functions

This blog post was authored by Curtis Bennett.

The exact computation of the number of distinct values of an expression X on a multi-node architecture requires bringing all distinct values of X (within the specified group if a GROUP BY was specified) to the same node, and then counting the number of distinct values on each node in parallel. This can be expensive for several reasons:

• If the records are not segmented on X and if there is no unsegmented projection on X, then in a multi-node system, a DISTINCT is first performed on the node where the data resides, and then the data must be redistributed to bring the partially aggregated data together to perform a second stage of aggregation. The results of the second stage of aggregation are then counted, and the counts are summed to produce the final count distinct. All of this must be done within a group, if a GROUP BY was specified.
• If the data is not sorted in a way that delivers sufficient performance for the COUNT (DISTINCT X) then hashed aggregation must be used. If the number of distinct values is large, then the hash table must be large.
• When multiple distinct aggregates are requested in a single statement, the data must be re-segmented for each distinct aggregate, which is expensive. This is mitigated by having the query optimizer perform each distinct aggregate in a separate query block, but the need to do many expensive data re-distributions remains.
• COUNT (DISTINCT X) does not roll up. That is, if you have “COUNT (DISTINCT X) … GROUP BY a, b, c”, you cannot derive “COUNT (DISTINCT X) … GROUP BY a, b” from it. The fact that COUNT (DISTINCT…) doesn’t roll up means that ROLAP-type tools have to do COUNT (DISTINCT X) the hard way and this is the source of the COUNT DISTINCT pain some users have.

In many cases, an exact answer is not required, and for those situations, APPROXIMATE_COUNT_DISTINCT is a much faster alternative when an exact count is unnecessary.

APPROXIMATE_COUNT_DISTINCT is an implementation of the LogLogBeta algorithm developed by Jason Qin of Cornell University, as published in 2016. Vertica starts with a binary search tree sorting algorithm, and switches to the LogLogBeta algorithm to handle larger data sets. This avoids the overhead that LogLogBeta incurs for smaller data sets. More importantly, the LogLogBeta algorithm is implemented in a distributed fashion, which leverages the Vertica Massively Parallel Processing (MPP) architecture. This makes the supporting functions very fast.

=> SELECT COUNT(DISTINCT product_key) FROM store.store_sales_fact;
COUNT
-------
19982
(1 row)
Time: First fetch (1 row): 863.218 ms. All rows formatted: 863.280 ms
=> SELECT APPROXIMATE_COUNT_DISTINCT(product_key) FROM
store.store_sales_fact;
ApproxCountDistinct
---------------------
19991
(1 row)
Time: First fetch (1 row): 479.946 ms. All rows formatted: 480.014 ms

APPROXIMATE_COUNT_DISTINCT is faster than a traditional COUNT DISTINCT, with an error tolerance within 2.17 standard deviations, which corresponds to a 97% confidence interval. Furthermore, the error tolerance default of 1.75% can be adjusted with an optional parameter, which provides more accurate results with a slight penalty in performance.

=> SELECT APPROXIMATE_COUNT_DISTINCT(product_key,5) AS five_pct_accuracy,
APPROXIMATE_COUNT_DISTINCT(product_key,1) AS one_pct_accuracy,
APPROXIMATE_COUNT_DISTINCT(product_key,.88) AS
point_eighteight_pct_accuracy
FROM store.store_sales_fact;
five_pct_accuracy | one_pct_accuracy | point_eighteight_pct_accuracy
-------------------+------------------+-------------------------------
19431 | 19921 | 19921
(1 row)
Time: First fetch (1 row): 772.212 ms. All rows formatted: 772.288 ms

APPROXIMATE_COUNT_DISTINCT_SYNOPSIS summarizes the information of distinct, non-null values and materializes the result set in a synopsis object. This object is saved to a table and is queried by the APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS function.

=> SELECT date_key, store_key, COUNT(DISTINCT product_key) FROM
store.store_sales_fact;
Output not shown
Time: First fetch (1000 rows): 10627.813 ms. All rows formatted: 11510.467
ms
=> CREATE TABLE my_summary AS SELECT date_key, store_key,
APPROXIMATE_COUNT_DISTINCT_SYNOPSIS(product_key) syn from
store.store_sales_fact group by 1, 2 ;
Time: First fetch (0 rows): 19540.632 ms. All rows formatted: 19540.653 ms
=> SELECT date_key, APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(syn) from
my_summary group by 1;
Output not shown
Time: First fetch (1000 rows): 1642.266 ms. All rows formatted: 2662.296
ms
=> SELECT date_key, store_key, APPROXIMATE_COUNT_DISTINCT_OF_SYNOPSIS(syn)
from my_summary group by 1, 2;
Output not shown
Time: First fetch (1000 rows): 31.080 ms. All rows formatted: 3164.499 ms

The selection criteria supports all combinations of columns and doesn’t have to be top-down, based on the original GROUP BY condition.

About the Author

Soniya Shah
Information Developer

Currently, a first year law student with a background in science and technology. Experienced technical writer, with specializations in software documentation, big data, blog development, and website development. I build user-centered content to communicate complex and technical information more easily.

I used to work for Vertica full time for about 3 years. I still work at Vertica part time while going to law school.

Update: Soniya is now doing her law internship, and no longer working at Vertica. Good luck, Soniya!

Product Overview

Vertica Announces Vertica 12 for Future-Proof Analytics

Harness the Internet of Things (IoT)

Support & Services

Partners

Vertica Inside – Embedded Analytics at Scale

Resources

About Vertica

Stay Informed

Understanding the APPROXIMATE_COUNT_DISTINCT Functions

About the Author

Search The Blog

Explore Popular Topics

Subscribe For Email Updates

Product Overview

Vertica Announces Vertica 12 for Future-Proof Analytics

Harness the Internet of Things (IoT)

Support & Services

Partners

Vertica Inside – Embedded Analytics at Scale

Resources

About Vertica

Stay Informed

Understanding the APPROXIMATE_COUNT_DISTINCT Functions

About the Author

Search The Blog

Explore Popular Topics

Subscribe For Email Updates

See More Vertica Blog Posts