Getting Statistics

ANALYZE_STATISTICS collects and aggregates data samples and storage information from all nodes that store projections associated with the specified table. You call this function as follows:

ANALYZE_STATISTICS ('[schema.]table' 
                     [, 'column[,...]'] 
                     [, percent ] )

If you call this function with a single empty string argument, Vertica collects statistics for all database tables and their projections:

=> SELECT ANALYZE_STATISTICS ('');

Operations

ANALYZE_STATISTICS performs the following operations:

Performs fast data sampling, which expedites analysis of relatively small tables with a large number of columns.
Includes data from WOS.
Recognizes deleted data instead of ignoring delete markers.
Records in system table PROJECTION_COLUMNS the last time statistics were run for a table.

Enables table analysis tables on a per-column basis for improved performance.

Sampling Size

ANALYZE_STATISTICS constructs a column histogram from a set of rows that it randomly selects from all collected data. Regardless of the percentage setting, the function always creates a statistical sample that contains up to (approximately) the smaller of:

2¹⁷ (131,072) rows
Number of rows that fit in 1 GB of memory

If a column has fewer rows than the maximum sample size, ANALYZE_STATISTICS reads all rows from disk and analyzes the entire column.

Note: The data collected in a sample range does not indicate how data should be distributed.

The following table shows how ANALYZE_STATISTICS, when set to different percentages, obtains a statistical sample from a given column:

Number of column rows	%	Number of rows read	Number of sampled rows
`<=`max‑sample‑size	20	All	All
400K	10	max‑sample‑size	max‑sample‑size
4000K	10	400K	max‑sample‑size

Note: When a column specified for ANALYZE_STATISTICS is first in a projection's sort order, the function reads all data from disk to avoid a biased sample.

Data Collection Percentage

If ANALYZE_STATISTICS omits specifying a percentage , Vertica collects a fixed 10-percent sample of statistical data from disk. Specifying a percentage of data to read from disk gives you more control over deciding between sample accuracy and speed.

The percentage of data you collect affects collection time and accuracy:

A smaller percentage is faster but returns a smaller data sample, which might compromise histogram accuracy.
A larger percentage reads more data off disk. Data collection is slower, but a larger data sample enables greater histogram accuracy.

Evaluating Results

After you collect the desired statistics, run the Workload Analyzer to retrieve hints about under-performing queries and their root causes, and obtain tuning recommendations. For more information, see Analyzing Workloads.

Examples

Compute statistics on all projections in the VMart database:

=> SELECT ANALYZE_STATISTICS (''); 
 ANALYZE_STATISTICS 
--------------------
                  0
(1 row)

Compute statistics on a single table:

=> SELECT ANALYZE_STATISTICS ('shipping_dimension'); 
 ANALYZE_STATISTICS 
--------------------
                  0
(1 row)

Compute statistics on a single column across all projections for a table:

=> SELECT ANALYZE_STATISTICS('shipping_dimension','shipping_key'); 
 ANALYZE_STATISTICS 
--------------------
                  0
(1 row)

Collect data from the entire disk by setting the percent parameter to 100:

=> SELECT ANALYZE_STATISTICS ('shipping_dimension', 'shipping_key', 100);
 ANALYZE_STATISTICS 
--------------------
                  0
(1 row)

Collect data on all projections for shipping_dimension from 20 percent of the disk:

=> SELECT ANALYZE_STATISTICS ('shipping_dimension', 20); 
 ANALYZE_STATISTICS 
-------------------
                 0
(1 row)