Content:
- When B Tree Index should be used. We will try to understand some of the situations when the optimizer will use B Tree Index.
- Index Clustering Factor.
- Influence of parameter OPTIMIZER_MODE and OPTIMIZER_INDEX_COST_ADJ on using B Tree Index.
When to use B-Tree Index:
Although it can differ from situation to situation that when Oracle Optimizer will decide to use index or not, based on table size, available statistics, index clustering factor, several db parameters like OPTIMIZER_MODE and OPTIMIZER_INDEX_COST_ADJ, OPTIMIZER_INDEX_CACHING etc and other things.
But mainly B*Tree index are used, when:
1. You are fetching very small fraction of the rows from the table using index. B*Tree index should be created on the columns that are frequently used in predicates, join condition of query.
2. You are fetching many/all rows of a table and the index can be used instead of the table (you are only selecting the columns that you have indexed).
When we access the table using index, Oracle will scan the index and from the index leaf nodes it will get the ROWID, then using the rowid it will read the data block from the table segment. This is typically known as "TABLE ACCES BY ROWID". This is very efficient method when you are accessing a small percentage of rows but however its not that efficient while you are reading a large amount of rows. Now the statement "small percentage" is very much relative. In a thin table it might be 2-3% or 5% but in a fat table it may be upto 20-25%. And Index Clustering Factor has significant impact on this small percentage value.
Table Access by Index Rowid
An index is stored sorted by index key. The index might be ascending order or it might be descending order in case of Descending index. The index will be accessed in sorted order by key, it will be sequential access and from the leaf block it will get the rowid to access the data blocks of the table segment and data block are stored randomly, scattered in heap. Therefore when Oracle does a table access by index rowid, it will not access the table in sorted order, it will not go to block 1, block 2, block 3, rather it will search for blocks that are scattered in the heap. For example it might go to block 10, then block 533, then block 777, then again block 10. It will do n number of "table access by index Rowid" to get each and every row and to do it will have to read and reread blocks.
There will be lots of scattered, single block read from here and there in table segment. Typically in the thin table a single block will hold more number of rows, while in a fat table a block will hold less number of rows.
Suppose DB_BLOCK_SIZE = 8k. A thin table has 1 Million rows. And if rows are almost about 80 bytes in size, so logically there will be about 8k/80b = 100 rows per block. That means the table has approzimately 100*100 = 10,000 blocks.
If read 2,00,000 rows via the index; there will be 2,00,000 "TABLE ACCESS BY ROWID" operations. Oracle will read the blocks 2,00,000 times to execute this query but there are only about 10,000 block in the entire table. If the index column data that you are fetching is scattered in all the 10000 blocks then it will read and reread 1 single block in the table on average 20 times. So in this case optimizer will prefer Full Table Scan (FTS) than using index.
Now if a fat table has 1 million rows, and if avg size of row is about 1600k then per block there will have about 5 rows, so there will be about 2,00,000 blocks (20 times more than thin table), so if you are accessing 2,00,000 "TABLE ACCESS BY ROWID" then in avg you will read the block once, so there will be lesser number of rereading blocks. So optimizer might think about using the index. But this is just a rough measure, but this is actually how it works. Although index clustering factor have a great impact on this.
CREATE TABLE FAT_EMP(
EMPLOYEE_ID NUMBER,
FIRST_NAME VARCHAR2(20),
LAST_NAME VARCHAR2(25) NOT NULL,
EMAIL VARCHAR2(232) NOT NULL,
PHONE_NUMBER VARCHAR2(20),
HIRE_DATE DATE NOT NULL,
JOB_ID VARCHAR2(10) NOT NULL,
SALARY NUMBER(8,2)
COMMISSION_PCT NUMBER(2,2)
MANAGER_ID NUMBER(6)
DEPARTMENT_ID NUMBER(4),
EMP_STS_FLAG VARCHAR2(10),
CTC NUMBER,
MANAGER_NAME VARCHAR2(60),
FIRST_NAME2 VARCHAR2(20),
LAST_NAME2 VARCHAR2(25) NOT NULL,
EMAIL2 VARCHAR2(232) NOT NULL,
PHONE_NUMBER2 VARCHAR2(20),
HIRE_DATE2 DATE NOT NULL,
JOB_ID2 VARCHAR2(10) NOT NULL,
SALARY2 NUMBER(8,2)
COMMISSION_PCT2 NUMBER(2,2)
MANAGER_ID2 NUMBER(6)
DEPARTMENT_ID2 NUMBER(4),
EMP_STS_FLAG2 VARCHAR2(10),
CTC2 NUMBER,
MANAGER_NAME2 VARCHAR2(60),
FIRST_NAME3 VARCHAR2(20),
LAST_NAME3 VARCHAR2(25) NOT NULL,
EMAIL3 VARCHAR2(232) NOT NULL,
PHONE_NUMBER3 VARCHAR2(20),
HIRE_DATE3 DATE NOT NULL,
JOB_ID3 VARCHAR2(10) NOT NULL,
SALARY3 NUMBER(8,2)
.
.
.
.
MANAGER_ID7 NUMBER(6)
DEPARTMENT_ID7 NUMBER(4),
EMP_STS_FLAG7 VARCHAR2(10),
CTC7 NUMBER,
MANAGER_NAME7 VARCHAR2(60));
CREATE TABLE THIN_EMP
AS
SELECT EMPLOYEE_ID, FIRST_NAME, LAST_NAME, EMAIL, JOB_ID, SALARY, COMMISSION_PCT, CTC
FROM FAT_EMP;
ALTER TABLE FAT_EMP ADD CONSTRAINT FAT_EMP_PK PRIMARY_KEY (EMPLOYEE_ID);
ALTER TABLE THIN_EMP ADD CONSTRAINT THIN_EMP_PK PRIMARY KEY (EMPLOYEE_ID);
BEGIN
SYS.DBMS_STATS.GATHER_TABLE_STATS (
OwnName => 'SYS',
TabName => 'FAT_EMP',
Method_Opt => 'FOR ALL COLUMNS SIZE AUTO',
Cascade => 'TRUE');
END;
/
BEGIN
SYS.DBMS_STATS.GATHER_TABLE_STATS (
OwnName => 'SYS',
TabName => 'THIN_EMP',
Method_Opt => 'FOR ALL COLUMNS SIZE AUTO',
Cascade => 'TRUE');
END;
/
Both tables will now have around 1 million records. We have created primary key on these tables and by default indexes have been created on these tables.
SELECT TABLE_NAME, BLOCKS, NUM_ROWS, NUM_ROWS/BLOCKS "Rows Per Block"
FROM DBA_TABLES
WHERE TABLE_NAME IN ('FAT_EMP', 'THIN_EMP');
TABLE_NAME BLOCKS NUM_ROWS "Rows Per Block"
THIN_EMP 9197 1000000 108.731107966999
FAT_EMP 203536 1000000 4.91313575976731
Now selecting around 2% from each table:
SELECT * FROM THIN_EMP WHERE EMPLOYEE_ID < 20000;
SELECT * FROM FAT_EMP WHERE EMPLOYEE_ID < 20000;
In both the cased there is TABLE ACCESS BY ROWID and index is being used for two percent.
Now Let's check what happen when we fetch 5% of the record:
SELECT * FROM THIN_EMP WHERE EMPLOYEE_ID < 50000;
SELECT * FROM FAT_EMP WHERE EMPLOYEE_ID < 50000;
In case of THIN_EMP table, it is not using the Index anymore. It is doing TABLE ACCESS FULL. But in case of FAT_EMP table it is still using INDEX RANGE SCAN and then TABLE ACCESS BY ROWID
Now fetching 20% records from FAT_EMP table:
SELECT * FROM FAT_EMP WHERE EMPLOYEE_ID < 200000;
We were fetching around 20% records and it is still using the index for the fact table but for THIN_EMP table it was not using index even when we were fetching 5% of rows.
With this example we can clear our concept that FTS will be performed if oracle optimizer sees huge reread of blocks to fetch data from table ( as in case of THIN_EMP). In FAT_EMP table data was less in each block so number of reread of block was less.
Index Clustering Factor is important factor on which Optimizer will decide whether to use index or not when we select 5% rows.
Index Clustering Factor
The clustering factor is a measure that indicates how many adjacent index key do not refer to the same data block in the table. It compares the order of the index with the degree of disorder in the table. It is typically the number of Block changes while ypu are reading the table using the index.
If you look conceptually, we have one index leaf block and four data block, each block contaiing five rows. Suppose scan begin from 100 key value. This Rowid pointing to BLOCK 1 (2), counter will become 1. Now 101 it is going to Block 2 (1). There is a block change, it is reading from a different block so the counter will become 2. Now for Employee_ID 102, it is again referring to Block 1, there is again a block change, reading from Block1 (5). Counter value become 3. For Employee_ID 103, it again reading from different block (BLOCK 2), counter is then set to 4. For 104 it is reading from same block (BLOCK 2) so counter will remain same 4. 105 referring to different block - BLOCK 1, counter changes to 5. 106 pointing to BLOCK 3 counter will change to 6. 107 and 108 are also pointing to BLOCK 3, counter will remain 6. Read of Employee_ID 109 and 110 mapping to BLOCK 1, again a block change and counter will become 7 and so on.
Finally for Key Values 115 to 119, all data will be in BLOCK 4 and Clustering Factor will be 10.
This is how Index Clustering Factor can be defined.
- If the clustering factor his high, then Oracle Database performs a relatively high number of I/O during index range scan. The index entries points to random table blocks, so the database may have to read and reread the same blocks over and over again to retrieve the data pointed to by the index.
- If the clustering factor is low, then Oracle Database performs a relatively low number of I/O during a large index range scan. The index keys in a range tend to point to the same data block, so the database does not have to read and reread the same block over and over. For example the read of Key Value 115 to 119 in the diagram above, all these values referring to block 4. So Low clustering factor is a good indicator of good index.
- The smallest possible value of clustering factor will be same as the number of table blocks, and the largest possible value will be the same as the number of rows in the table.
We will create two table. One table will have Organized data in particular order of Primary Key
CREATE TABLE ORGANIZED (COL1 INT, COL2 VARCHAR2(200), COL3 VARCHAR2(200));
BEGIN
FOR I IN 1..100000
LOOP
INSERT INTO ORGANIZED VALUES(I, DBMS_RANDOM.STRING('x', 10), DBMS_RANDOM.STRING('Y', 10));
END LOOP;
COMMIT;
END;
/
ALTER TABLE ORGANIZED ADD CONSTRAINT ORGANIZED_PK PRIMARY KEY(COL1);
In case of sequence populated values using such FOR LOOP, the particular index will hold index key in the same sequence. Typically clustering factor will be low.
Now Creating table DISORGANIZED.
CREATE TABLE DISORGANIZED
AS SELECT COL1, COL2, COL3 FROM ORGANIZED ORDER BY COL2;
By making it Order By COL2, values in DISORGANIZED table will not be stored in in order by COL1. It will be disorganized.
ALTER TABLE DISORGANIZED ADD CONSTRAINT DISORGANIZED_PK PRIMARY KEY(COL1);
EXEC DBMS_STATS.GATHER_TABLE_STATS ('SCOTT', 'ORGANIZED');
EXEC DBMS_STATS.GATHER_TABLE_STATS ('SCOTT', 'DISORGANIZED');
SELECT IND.INDEX_NAME, IND.TABLE_NAME, IND.CLUSTERING_FACTOR, TAB.NUM_ROWS, TAB.BLOCKS
FROM ALL_INDEXES IND, ALL_TABLES TAB
WHERE IND.TABLE_NAME = TAB.TABLE_NAME
AND IND.TABLE_NAME IN ('ORGANIZED','DISORGANIZED');
SQL> SELECT IND.INDEX_NAME, IND.TABLE_NAME, IND.CLUSTERING_FACTOR, TAB.NUM_ROWS, TAB.BLOCKS
2 FROM ALL_INDEXES IND, ALL_TABLES TAB
3 WHERE IND.TABLE_NAME = TAB.TABLE_NAME
4 AND IND.TABLE_NAME IN ('ORGANIZED','DISORGANIZED');
INDEX_NAME TABLE_NAME CLUSTERING_FACTOR NUM_ROWS BLOCKS
------------------------------ ------------------------------ ----------------- ---------- ----------
ORGANIZED_PK ORGANIZED 440 100000 496
DISORGANIZED_PK DISORGANIZED 99766 100000 458
In Organized table Clustering Factor is close to number of Blocks. This is a very good indicator. In the Disorganized table data is not organized by particular key value, clustering factor is very high, it is close to number of rows 100000.
SELECT * /* 0.1 percent */ FROM DISORGANIZED WHERE COL1 < 100;
SELECT * /* 0.1 percent */ FROM ORGANIZED WHERE COL1 < 100;
We are fetching 0.1 percent of records and in both cases it is using the index to access the table in both cases.
Now fetching one percent records:
SELECT * /* 1 percent */ FROM DISORGANIZED WHERE COL1 < 1000;
SELECT * /* 1 percent */ FROM ORGANIZED WHERE COL1 < 1000;
In DISORGANIZED table we are accessing just one percent rows but it is not using the Index any more. It is doing TABLE ACCESS FULL. In case of ORGANIZED table it is using Index.
SELECT * /* 10 percent */ FROM ORGANIZED WHERE COL1 < 10000;
It is still using index for 10,000 rows.
SELECT * /* 15 percent */ FROM ORGANIZED WHERE COL1 < 15000;
Even with 15 percent data fetch it is using index because the clustering factor is too low.
Optimizer_Mode Effect
Apart from clusting factor there are other various thing that has crucial affect on whether Optimizer will use the index or not. Among them most crucial is Optimizer_Mode parameter.
ALL_ROWS:
ALL_ROWS is the default mode. If the value of the Optimizer_Mode parameter is set to "ALL_ROWS" then the optimizer will attempt to find an execution plan that completes the statement (typically meaning "returning all rows") in the shortest possible time. ALL_ROWS mode is desinged to minimize computing resource & provide best throughput. The Default value of optimizer_mode in oracle 11g is set to ALL_ROWS.
If the Optimizer_Mode is set to ALL_ROWS then the CBO will favor FULL scan compared to index scan as the index scan is an additional IO.
FIRST_ROWS:
If the value of the Optimizer_Mode parameter is set to "FIRST_ROWS" the optimizer will attempt to find an execution plan to return the first row of a result set as fast as possible. This mode always prefer Index scan over FTS even if FTS is better option.
It blindly rely on INDEX SCAN.
It was deprecated in Oracle 9i.
FIRST_ROWS_N:
It was introduced in Oracle 9i. The number N can be 1,10,100 or 1000 (using first_rows(n) hint, the number n can be any positive whole number). The FIRST_ROWS_N mode instructs the optimizer to choose a query execution plan that minimizes the response time to produce the first N rows of query result & it also favors index scan to access table row. It is useful in case of interactive front end /web app where you want to view first N number of rows as soon as possible. You don't care about the whole result set. At a time you want to see first 100 rows.
Checking default Optimizer Mode:
SQL> show parameter optimizer_mode
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
optimizer_mode string ALL_ROWS
Creating table EMP which has around 0.2 million records, we have also created Primary Key on Employee_ID, be dafault an Index is created and I have gathered the statistics.
ALTER TABLE HR.T_EMP
ADD CONSTRAINT T_EMP_PK PRIMARY KEY (EMPLOYEE_ID);
BEGIN
SYS.DBMS_STATS.GATHER_TABLE_STATS(
OwnName => 'HR',
TabNmae => 'T_EMP',
Cascade => 'TRUE');
END;
/
SELECT * FROM HR.T_EMP
WHERE EMPLOYEE_ID < 100000;
It is doing TABLE ACCESS FULL. Now we will change the parameter to FIRST_ROWS.
ALTER SESSION SET OPTIMIZER_MODE = FIRST_ROWS;
It is now using Index to access the 50% of the rows. Here now even if we fetch 90% rows, it will still use index no matter there are more IOs. FIRST_ROWS mode will always try to use an index if present on table.
This is the problem with FIRST_ROWS mode. It blindly uses the Index.
Now setting OPTIMIZER_MODE to FIRST_ROWS_100, the oracle will try to return first 100 rows.
ALTER SESSION SET OPTIMIZER_MODE = FIRST_ROWS_100;
SELECT * FROM HR.T_EMP WHERE EMPLOYEE_ID < 100000;
We are trying to access 50% of the data and using FIRST_ROWS_100 as optimizer mode. In execution plan it is still using Index in execution plan to give the data. It uses the index to give you the first 100 rows as fast as possible. As can be seen from execution plan ROWS is 100. All the information of ROWS, Bytes, Cost and Time is of first 100 rows.
So even if you are fetching large number of rows, if your optimizer mode is FIRST_ROWS or FIRST_ROWS_100 oracle might use the index.
Impact of OPTIMZER_INDEX_COST_ADJ:
- Using OPTIMZER_INDEX_COST_ADJ parameter you can change the cost of the table access through index scans.
- Valid Values goes from 1 to 10,000. The default is 100.
- Values greater than 100 make index scans more expensive and favor full table scans. Values less than 100 make index scans less expensive & make index scan more favourable.
- By default it is 100.
Setting Optimizer_Mode agian to ALL_ROWS.
ALTER SESSION SET OPTIMIZER_MODE = ALL_ROWS;
SQL> show parameter OPTIMIZER_INDEX_COST_ADJ
NAME TYPE VALUE
------------------------------------ ----------- ---------------------------
optimizer_index_cost_adj integer 100
ALTER SESSION SET OPTIMIZER_INDEX_COST_ADJ=20;
SELECT * FROM HR.T_EMP WHERE EMPLOYEE_ID < 70000;
We tried to fetch thirty percent data and it is using the Index.
ALTER SESSION SET OPTIMIZER_INDEX_COST_ADJ = 100;
SELECT * FROM HR.T_EMP WHERE EMPLOYEE_ID < 70000;
By setting OPTIMIZER_INDEX_COST_ADJ to 100 ( which is default), it is using INDEX as we are fetching 30% of the data. However it was using index when we set it to 20.
Now fetching EMPLOYEE_ID less than 15,000.
SELECT * FROM HR.T_EMP WHERE EMPLOYEE_ID < 15000;
We are fetching around 7.5 percent data and it is using index, with OPTIMIZER_INDEX_COST_ADJ parameter set to 100.
Setting the OPTIMIZER_INDEX_COST_ADJ parameter to 400. According to theory by setting parameter to higher value, it will prefer FULL TABLE SCAN .
ALTER SESSION SET OPTIMIZER_INDEX_COST_ADJ = 400;
SELECT * FROM HR.T_EMP WHERE EMPLOYEE_ID < 15000;
As we can see it is doing TABLE ACCESS FULL, it is not using the index any more.
This is how different values of different parameter OPTIMIZER_MODE and INDEX_OPTIMIZER_MODE_ADJ can affect optimizer to use the index or not.
Source: https://www.youtube.com/user/Anindya007das