«

»

Feb 21

Collect Statistics in Teradata

Statistics are data demographics (or hints) used by the Teradata optimizer.


There are many ways to generate a query plan for a given SQL, and collecting statistics ensures that the optimizer will have the most accurate information to create the best access and join plans.

The optimizing phase of Teradata, makes decisions on how to access table data. These decisions can be very important when table joins (especially those involving multiple joins) are required by a query. By default, the Optimizer uses approximations of the number of rows in each table (known as the cardinality of the table) and of the number of unique values in indexes in making its decisions. To build such estimates, the Optimizer picks a random AMP and builds the information and it is possible for the estimates to be significantly off. This can lead to poor choices of join plans, and associated increases in the response times of the queries involved.

One way to help the Optimizer make better decisions is to give it more accurate information as to the content of the table. This can be done using the COLLECT STATISTICS statement. When the Optimizer finds that there are statistics available for a referenced table, it will use those statistics instead of using estimated table cardinality or estimated unique index value counts.

Stats should be collected mainly under the below circumstances:


1. A thumb rule is to collect statistics when they’ve changed by 10%. (That would be 10% more rows inserted, or 10% of the rows deleted, or 10% of the rows changed, or some combination.)

2. The range of values for an index or column of a table for which statistics have been collected has changed significantly. Sometimes one can infer this from the date and time the statistics were last collected, or by the very nature of the column (for instance, if the column in question holds a transaction date, and statistics on that column were last gathered a year ago, it is almost certain that the statistics for that column are stale).

How stats are built over the table?


TD builds the uniqueness count for each identified column / set of columns for the completed table/partition data and stores the information in the DBC tables.

Whenever the stats are collected later, the previously collected information is lost and fresh stats are updated in the DBC tables.

The time taken to collect stats doesn’t depend on how frequently the stats have been collected or how recently the stats have been collected.

Stats should be collected on all dimensions, history, transactional, reference and aggregate tables based on the below approach:


1. If the table is loaded under DELETE INSERT mode, then STATS should be collected during each load.

2. If the table is built under INSERT UPDATE mode, then STATS should be collected if the data demographics change by more than 10%.

3. If the target is a transactional table loaded in APPEND mode, then STATS should be collected if the data demographics change by more than 10%.

4. If the table is built under INSERT mode; (aggregate tables where data is built for a particular duration and queried upon this duration) tables where partitions are built over each aggregation period, STATS should be collected on the new partition, even if the data demographics for the entire table changes less than 10%, because user queries or extractions might be built over data for current period of aggregation.


↓↓↓↓↓↓↓↓ SHARE IT WITH YOUR FRIENDS  ↓↓↓↓↓↓↓↓

12 pings

Skip to comment form

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.