Getting Statistics
ANALYZE_STATISTICS
collects and aggregates data samples and storage information from all nodes that store projections associated with the specified table. You call this function as follows:
ANALYZE_STATISTICS ('[schema.]table' [, 'column[,...]'] [, percent ] )
If you call this function with a single empty string argument, Vertica collects statistics for all database tables and their projections:
=> SELECT ANALYZE_STATISTICS ('');
Operations
ANALYZE_STATISTICS
performs the following operations:
- Performs fast data sampling, which expedites analysis of relatively small tables with a large number of columns.
- Includes data from WOS.
- Recognizes deleted data instead of ignoring delete markers.
- Records in system table
PROJECTION_COLUMNS
the last time statistics were run for a table.
- Enables table analysis tables on a per-column basis for improved performance.
Sampling Size
ANALYZE_STATISTICS
constructs a column histogram from a set of rows that it randomly selects from all collected data. Regardless of the percentage setting, the function always creates a statistical sample that contains up to (approximately) the smaller of:
- 217 (131,072) rows
- Number of rows that fit in 1 GB of memory
If a column has fewer rows than the maximum sample size, ANALYZE_STATISTICS
reads all rows from disk and analyzes the entire column.
Note: The data collected in a sample range does not indicate how data should be distributed.
The following table shows how ANALYZE_STATISTICS
, when set to different percentages, obtains a statistical sample from a given column:
Number of column rows | % | Number of rows read | Number of sampled rows |
---|---|---|---|
|
20 |
All |
All |
400K |
10 |
|
|
4000K |
10 |
400K |
|
Note: When a column specified for ANALYZE_STATISTICS
is first in a projection's sort order, the function reads all data from disk to avoid a biased sample.
Data Collection Percentage
If ANALYZE_STATISTICS
omits specifying a percentage , Vertica collects a fixed 10-percent sample of statistical data from disk. Specifying a percentage of data to read from disk gives you more control over deciding between sample accuracy and speed.
The percentage of data you collect affects collection time and accuracy:
- A smaller percentage is faster but returns a smaller data sample, which might compromise histogram accuracy.
- A larger percentage reads more data off disk. Data collection is slower, but a larger data sample enables greater histogram accuracy.
Evaluating Results
After you collect the desired statistics, run the Workload Analyzer to retrieve hints about under-performing queries and their root causes, and obtain tuning recommendations. For more information, see Analyzing Workloads.
Examples
Compute statistics on all projections in the VMart database:
=> SELECT ANALYZE_STATISTICS (''); ANALYZE_STATISTICS -------------------- 0 (1 row)
Compute statistics on a single table:
=> SELECT ANALYZE_STATISTICS ('shipping_dimension'); ANALYZE_STATISTICS -------------------- 0 (1 row)
Compute statistics on a single column across all projections for a table:
=> SELECT ANALYZE_STATISTICS('shipping_dimension','shipping_key'); ANALYZE_STATISTICS -------------------- 0 (1 row)
Collect data from the entire disk by setting the percent
parameter to 100:
=> SELECT ANALYZE_STATISTICS ('shipping_dimension', 'shipping_key', 100); ANALYZE_STATISTICS -------------------- 0 (1 row)
Collect data on all projections for shipping_dimension
from 20 percent of the disk:
=> SELECT ANALYZE_STATISTICS ('shipping_dimension', 20); ANALYZE_STATISTICS ------------------- 0 (1 row)