Best Practices for Statistics Collection

When you query a table, the query optimizer requires representative statistics on its data in order to choose the best query plan. For many applications, statistics do not need to be accurate to the latest minute. Vertica regularly calls ANALYZE ROW COUNT, which collects partial statistics and supplies enough data to satisfy many optimizer requirements.

Requesting Statistics

You can explicitly request Vertica to gather detailed information on the data of individual tables and their columns, by calling ANALYZE_STATISTICS. This function analyzes distribution of column data and storage usage across all projections. Without this data, the query optimizer assumes uniform distribution of data values and equal storage usage for all projections, and creates query plans accordingly.

You can specify the size of the data sample—by default, 10 percent. You can also narrow the scope of the collection by specifying individual columns.

You should call ANALYZE_STATISTICS on a table or individual columns when one or more of following conditions are true:

Overhead Considerations

Running ANALYZE_STATISTICS is an efficient but potentially long-running operation. You can run it concurrently with queries and loads in a production environment. However, the function can incur considerable overhead on system resources (CPU and memory), at the expense of queries and load operations. To minimize overhead, run ANALYZE_STATISTICS on individual tables rather than the entire database, or on individual table columns.

Related Tools

You can diagnose and resolve many statistics-related issues by calling ANALYZE_WORKLOAD, which returns tuning recommendations.

If you update statistics and find that the query still performs poorly, run your query through the Database Designer and choose incremental as the design type. See Incremental Design.