How to make ROLLUP fly in Vertica?

Posted May 22, 2014 by Po Hong

rollup-screenshot-e1401214498840
ROLLUP is a very common Online Analytic Processing (OLAP) function and is part of ANSI SQL. Many customers use ROLLUP to write reports that automatically perform sub-total aggregations across multiple dimensions at different levels in one SQL query. The following is an example:
Select DEPT_CODE, STORE_CODE, SUM(SALE_TOT_QTY) As TOTAL_SALES
From AGG_CUST_MONTHLY A INNER JOIN DIM_DATE B
ON ( A.DATE_KEY = B.DATE_KEY )
WHERE B.DATE_KEY BETWEEN 20120301 AND 20120331
GROUP BY ROLLUP(DEPT_CODE, STORE_CODE);
Note: ROLLUP syntax is not supported in the current version of Vertica 7.0. This
is used only as an illustration. See blog text for more details

The business question the above query intends to answer is: For the month of March, 2014, show the total sales across both departments (dept_code) and stores (store_code) to generate sub-total/total sales figures at three levels:

    1. At the detailed level of each pair of (dept_code, store_code)
    2. At the level of only (dept_code) but aggregating across all stores
    3. Finally a grand sales total aggregating across all departments and stores

Assuming performance is not an issue, clearly ROLLUP is a very powerful and useful function that can do a lot of data crunching in the background and deliver a quite complex report.

Unlike the more familiar GROUP BY SQL function in which the column order is not material, ROLLUP works by creating subtotals that “roll up” from the most detailed level to a grand total and requires an ordered list of grouping expressions to be supplied as arguments. First ROLLUP calculates the standard aggregate values specified in the standard GROUP BY (without ROLLUP) and then ROLLUP moves from right to left through the list of grouping columns, starting with the lowest level of aggregation to create progressively higher-level subtotals. This process continues until all grouping columns have been consumed and aggregations computed. The operator essentially ?rolls up? its grouping operations, culminating in a grand total in the end.

If N is the number of grouping columns, ROLLUP function will create (N+1) levels of subtotals (including a final grand total). The column order in a ROLLUP operator is critical since it works by removing the left most column at each step. For example:

ROLLUP(A, B, C) creates 4 groups: (A, B, C), (A, B), (A), ().Note there are NO groups such as (A, C) and (B, C).

Unfortunately the current version of Vertica 7.0 does not support ROLLUP. So what is the alternative or workaround for the existing Vertica customers? Will the workaround perform on a large data set? Fortunately, the answers to both questions are yes. However it does take some ingenuity in physical design and SQL re-write to make ROLLUP work and perform in the currently supported Vertica releases (version 5.1 and later).

The key considerations are to create and pre-load a local temp table which is sorted on ALL the ROLLUP columns in the same exact order, rewrite SQL at each level using GROUP BY into a sub-query and finally UNION ALL of these sub-queries together. By applying this strategy to the ROLLUP SQL posted at the beginning of this blog, we arrive at the following alternative/workaround:

DROP TABLE IF EXISTS AGG_TEMP CASCADE;CREATE LOCAL TEMPORARY TABLE AGG_TEMP
ON COMMIT PRESERVE ROWS
AS
/*+ DIRECT */
( Select DEPT_CODE, STORE_CODE, SUM(SALE_TOT_QTY) As TOTAL_SALES
From AGG_CUST_MONTHLY A INNER JOIN DIM_DATE B
ON ( A.DATE_KEY = B.DATE_KEY )
WHERE B.DATE_KEY BETWEEN 20140301 AND 20140331
GROUP BY DEPT_CODE, STORE_CODE )
ORDER BY DEPT_CODE,
STORE_CODE
SEGMENTED BY HASH (DEPT_CODE, STORE_CODE) ALL NODES;
SELECT ANALYZE_STATISTICS('AGG_TEMP');
SELECT DEPT_CODE, STORE_CODE, SUM(TOTAL_SALES)
FROM AGG_TEMP
GROUP BY DEPT_CODE, STORE_CODE
UNION ALL
SELECT DEPT_CODE, NULL As STORE_CODE, SUM(TOTAL_SALES)
FROM AGG_TEMP
GROUP BY DEPT_CODE
UNION ALL
SELECT NULL As DEPT_CODE, NULL As STORE_CODE, SUM(TOTAL_SALES)
FROM AGG_TEMP;

The fact that temp table AGG_TEMP is sorted on the ROLLUP columns (dept_code, store_code) is important for performance because this means that all the GROUP BY sub-queries at individual levels will be able to exploit the highly efficient GROUPBY PIPELINED operator (explicit in the query?s explain plan) in Vertica to get the job done quickly and accurately.

A nice property of GROUPBY PIPELINED operator (in contrast to the more resource intensive GROUPBY HASH) is that its performance is largely independent of the actual table size. As a result this workaround can easily scale up to include tables with 10+B or even 100+B rows without any noticeable drop in query performance (as measured in first fetch time).

To conclude I would like to mention that there has been serious work going on in Vertica Engineering to implement ROLLUP (and many other OLAP functions) in the up-coming Vertica releases. Stay tuned!

Po Hong is a senior pre-sales engineer in Vertica?s Corporate Systems Engineering (CSE) group with a broad range of experience in various relational databases such as Vertica, Neoview, Teradata and Oracle.