Best Practices for Statistics Collection
When you query a table, the query optimizer requires representative statistics on its data in order to choose the best query plan. For many applications, statistics do not need to be accurate to the latest minute. Vertica regularly calls ANALYZE ROW COUNT
, which collects partial statistics and supplies enough data to satisfy many optimizer requirements.
Requesting Statistics
You can explicitly request Vertica to gather detailed information on the data of individual tables and their columns, by calling ANALYZE_STATISTICS
. This function analyzes distribution of column data and storage usage across all projections. Without this data, the query optimizer assumes uniform distribution of data values and equal storage usage for all projections, and creates query plans accordingly.
You can specify the size of the data sample—by default, 10 percent. You can also narrow the scope of the collection by specifying individual columns.
You should call ANALYZE_STATISTICS
on a table or individual columns when one or more of following conditions are true:
- Data is bulk loaded for the first time.
- A new projection is refreshed.
- The number of rows changes significantly.
- A new column is added to the table.
- Column minimum/maximum values change significantly.
- New primary key values with referential integrity constraints are added . The primary key and foreign key tables should be re-analyzed.
- Table size notably changes relative to other tables it is joined to—for example, a table that was 50 times larger than another table is now only five times larger.
- A notable deviation in data distribution necessitates recalculating histograms—for example, an event causes abnormally high levels of trading for a particular stock.
- The database is inactive for an extended period of time.
Overhead Considerations
Running ANALYZE_STATISTICS
is an efficient but potentially long-running operation. You can run it concurrently with queries and loads in a production environment. However, the function can incur considerable overhead on system resources (CPU and memory), at the expense of queries and load operations. To minimize overhead, run ANALYZE_STATISTICS
on individual tables rather than the entire database, or on individual table columns.
Related Tools
You can diagnose and resolve many statistics-related issues by calling ANALYZE_WORKLOAD
, which returns tuning recommendations.
If you update statistics and find that the query still performs poorly, run your query through the Database Designer and choose incremental as the design type. See Incremental Design.