Each table in Teradata is required to have a primary index.Even if you did not define any primary index in CREATE table statement, the Teradata system will automatically create the primary index based on the setting of DBScontrol setting field . The primary index defines where data will reside and which AMP receives the row.
The three most important roles the primary index does is the following –
- Data Distribution
- Fastest way to retrieve Data
- Incredibly important for Joins
In short primary index provides the fastest physical path to retrieving data.
Two types of Primary Indexes are given below –
Unique Primary Index (UPI)
A unique primary index means that the value for the selected column must be unique. In the example below the EMP_ID is the Unique Primary Index.
A unique primary index (UPI) will always spread the rows of the table evenly amongst the AMPs.
Non Unique Primary Index (NUPI)
A Non unique primary index means that the value for the selected column can be non unique. In the example below the LAST_NAME is the Non Unique Primary Index.
Key Point – Don’t assume that UPI is always the BEST CHOICE; sometimes the use of NUPI is very critical. If you need to perform lot of query with LAST_NAME in the WHERE clause then NUPI on last name is essential.
Primary Index and the ROW HASH
The primary index is the key to determine where the ROW of the table will reside on which AMP. When a new row arrive for insert in Teradata the following steps occur –
- Teradata Parsing Engine (PE) examines the primary index of the row.
- Teradata takes the primary index of the rows and run it through HASHING ALGORITHM.
- The output of the Hashing Algorithm is the 32 bit Row – Hash value.
New row coming with following attributes –
Hash the PI value -> 55 / HASHING ALGORITHM = 11110000111100001111000011110000 (Row – Hash)
This value in Binary format is called as the Row – Hash Value. And it will always be unique for the given PI. Means it’s not possible to get two Row – Hash Value for the same EMP_ID (55).
This 32 bit Row – Hash value determines in which AMP the row will reside and it always attached along with the ROW to make it a UNIQUE identification for that ROW.
Now in Teradata we have HASH MAP which contains the different bucket called as Hash Map Buckets. These buckets contain only the different AMP number which is attached with the Teradata system. Suppose that Teradata system contains 4 AMP then this Hash Map contains numbers from 1 to 4 in different buckets. see image below –
Four Amp Hash Map
The Row – hash value which we determine above, is used to point to certain bucket in the Hash Map. This value points to only one bucket in the Hash Map and that bucket number will determine the AMP number where this new row will reside.
11110000111100001111000011110000 tells that it correspond to first row and fourth column of Hash Map i.e. AMP no = 4
So finally we have the AMP number (4) where the new row will reside.
After doing all this calculation in PE , the PE instruct BYNET to send the row along with its unique Row – Hash Value in the AMP number 4.
The new row along with its Row hash Value is send to AMP number 4 for storing. In this way we achieve the data distribution for each row of the given table.
When the AMP receives the row it will place the row into the proper table, and the AMP itself check that is there any other row with the same Row- Hash Value. If this is the first Row with the particular Row- Hash value then it will assign a uniqueness value of 1 along with the row. But if this is the second row with the same Row-Hash value for the same table then it will assign a uniqueness value of 2. Similarly it assign a value of 3 if it finds third occurrence of the same Row – Hash, in this fashion it keeps on adding uniqueness value for the duplicates rows.
If you don’t get this concept at the first go don’t worry, it is explained again with the suitable examples for more clarity.
An example of UPI table, where EMP_ID is the UPI.
Note – as the table has UPI defined on EMP_ID it will always has the unique row in the given table. So each time AMP will assign the uniqueness value of 1 against each row.
An example of NUPI table, where LAST_NAME is the NUPI.
Note – as the table has NUPI defined on LAST_NAME, we have three rows for the same LAST_NAME = ‘Kumar’. Hence the AMP will assign different uniqueness value from 1 t 3 for each row.
Together with Row – Hash and the uniqueness value the Teradata make as 64 bit ROW – ID to uniquely identify each row in the given AMP.
This is the complete reference for Primary Index in Teradata. I tried to make it as simple as possible for the newbies, in case of any correction or clarification feel free to reach me 🙂
To read about Secondary index in teradata please go to – Secondary Index in Teradata