Histograms
Histograms tell the Optimizer about the distribution of data within a column. By default (without a histogram), the Optimizer assumes a uniform distribution of rows across the distinct values in a column. Therefore, the Optimizer calculates the cardinality (number of rows returned) for an equality predicate by dividing the total number of rows in the table by the number of distinct values in the column used in the equality predicate.
If the data distribution in that column is not uniform (i.e., a data skew exists) then the cardinality estimate will be incorrect.
In order to accurately reflect a non-uniform data distribution, a histogram is required on the column. The presence of a histogram changes the formula used by the Optimizer to estimate the cardinality, and allows it to generate a more accurate execution plan.
A histogram sorts values into “buckets,” as you might sort coins into buckets.
The database has to scan the table and read the values in the column to build the histogram.
The real purpose of adding a histogram to table is, to give the more information to optimizer to come up with a better plan.
If no histograms are present, the CBO assumes an even distribution of data. Histograms tell the CBO when data is not evenly distributed, and can help the CBO to estimate the number of rows returned from a table join (called “cardinality”).
Data skew: Skew is defined like uneven data distribution in column. Large variations in the number of duplicate values in a column.
For example, a California-based book store ships 95% of the books to California, 4% to Oregon, and 1% to Nevada. The book orders table has 300,000 rows. A table column stores the state to which orders are shipped. A user queries the number of books shipped to Oregon. Without a histogram, the optimizer assumes an even distribution of 300000/3 (Number of Distinct Vaues (NDV) is 3), estimating cardinality at 100,000 rows. With this estimate, the optimizer chooses a full table scan. With a histogram, the optimizer calculates that 4% of the books are shipped to Oregon, and chooses an index scan.
Type of Histograms
Based on the NDV and the distribution of the data, the database chooses the type of histogram to create. (In some cases, when creating a histogram, the database samples an internally predetermined number of rows.) The types of histograms are as follows:
The database view DBA_TAB_COL_STATISTICS or USER_TAB_COL_STATISTICS has a column called HISTOGRAM.
– HEIGHT BALANCED – Height Based : Each bucket has same number of values, skewed values occupy more buckets
– FREQUENCY – Value Based: Each key has its own number of records
– TOP FREQUENCY HISTOGRAMS
– HYBRID HISTOGRAMS
– NONE – No histogram present for the column
Note: Frequency histograms are more precise than Height Balanced histograms.
Let’s create a quick test so we can see this better.
First I will create a table with 11,110 rows , with 4 different values: (1,2,3 y 4) .
create table t
as select case when (rownum between 1 and 10) then 1
when (rownum between 11 and 110) then 2
when (rownum between 111 and 1110) then 3
when (rownum between 1111 and 11110) then 4
end x
from dual
connect by rownum <= 11110;
select x,count(1)
from t
group by x;
X COUNT(1)
----- ----------
1 10
2 100
3 1000
4 10000
Then I’m going to collect statistics and check the table USER_TAB_COL_STATISTICS for statistics values over the X Column:
exec dbms_stats.gather_table_stats(ownname => user,tabname => 'T');
or
exec dbms_stats.gather_table_stats('TEST','T', METHOD_OPT=>'FOR ALL COLUMNS SIZE 1');
exec print_table('select * from user_tab_col_statistics where table_name = ''T''');
TABLE_NAME : T
COLUMN_NAME : X
NUM_DISTINCT : 4
LOW_VALUE : C102
HIGH_VALUE : C105
DENSITY : .25
NUM_NULLS : 0
NUM_BUCKETS : 1
LAST_ANALYZED : 02-dic-2009 15:28:59
SAMPLE_SIZE : 11110
GLOBAL_STATS : YES
USER_STATS : NO
AVG_COL_LEN : 3
HISTOGRAM : NONE
From these results, we can see that Oracle detected 4 different values, there are no nulls and the density is 0.25 (1/”different values”=1/4), there is only 1 bucket and there is no histograms (NONE).
Why no histograms? because I didn’t specify the “method_opt” parameters, hence, Oracle used the default (“for all columns size auto”).
The only information about the distribution is the following:
select * from user_tab_histograms where table_name = 'T';
TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_A
---------- ---------- --------------- -------------- ----------
T X 0 1
T X 1 4
If we tell Oracle to use 4 buckets we get:
exec dbms_stats.gather_table_stats(ownname => user, tabname => 'T', method_opt=>'for all columns size 4');
exec print_table('select * from user_tab_col_statistics where table_name = ''T''');
TABLE_NAME : T
COLUMN_NAME : X
NUM_DISTINCT : 4
LOW_VALUE : C102
HIGH_VALUE : C105
DENSITY : .000045004500450045
NUM_NULLS : 0
NUM_BUCKETS : 4
LAST_ANALYZED : 02-dic-2009 16:12:21
SAMPLE_SIZE : 11110
GLOBAL_STATS : YES
USER_STATS : NO
AVG_COL_LEN : 3
HISTOGRAM : FREQUENCY
select * from user_tab_histograms where table_name = 'T';
TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_A
---------- ---------- --------------- -------------- ----------
T X 10 1
T X 110 2
T X 1110 3
T X 11110 4
Now we can see that we have a HISTOGRAM (FREQUENCY). To analyze the table USER_TAB_HISTOGRAMS we need to know that the column ENDPOINT_VALUE shows all the different values, and the column ENDPOINT_NUMBER shows the number of records with that value. For 1 we have 10, for 2 we have 110-10=100, for 3 we have 1110-110=1000 and so on….
How does Oracle determine when to create a histogram?
Oracle automatically determines the columns that need histograms based on the column usage information recorded in SYS.COL_USAGE$ (columns used in WHERE clause predicates), and the presence of a data skew.
For example, Oracle will not automatically create a histogram on a unique column if it is only seen in equality predicates.
What’s the down side to stopping Oracle from creating any histograms?
The Optimizer will assume an even distribution of values in all columns.
Another Example were Oracle got incorrect stats (data skewed)
Imagine we owned two hardware stores. Most of customers live in the same zipcode as our store, with a couple of exceptions.
SELECT zipcode, COUNT(*)
FROM customers
GROUP BY zipcode;
ZIPCODE COUNT(*)
---------- ----------
91140 20000
94303 20000
94070 2
94065 2
Let’s now gather statistics on the CUSTOMERS table using the recommendation to disable histograms.
exec dbms_stats.gather_table_stats('SH','CUSTOMERS', METHOD_OPT=>'FOR ALL COLUMNS SIZE 1');
SELECT column_name, num_distinct, histogram
FROM user_tab_col_Statistics
WHERE TABLE_NAME='CUSTOMERS';
COLUMN_NAME NUM_DISTINCT HISTOGRAM
----------------------- ------------ ---------------
CUST_FIRST_NAME 451 NONE
CUST_LAST_NAME 401 NONE
CUST_GENDER 2 NONE
ZIPCODE 4 NONE
CUST_EMAIL 401 NONE
No histograms were created, so when a query comes in looking to find the different genders of our customers in a given zipcode, the Optimizer will calculate the cardinality estimates by dividing the total number of rows in the table (40,004) by the number of distinct values (4). Therefore the Optimizer will assume 10,001 rows will be returned from the customers table regardless of the zipcode provided, which will always result in a FULL TABLE SCAN. This works great for two of our zipcodes (91140 and 94303) but not so great for the other two (94070 and 94065), who would be better off using an INDEX RANGE SCAN.
Now I’m going to create a different distribution to get a different histogram:
drop table t;
create table t as select rownum x
from dual
connect by rownum <= 11110;
update t set x = 1 where rownum <= 5000;
update t set x = 2 where rownum <= 3000 and x != 1;
create index t_idx on t(x);
So now I have a table similar to what I had before but with a different distribution. For x=1 we have 5000 values, for x=2 300 values and for the rest just 1.
We will collect stats but not for histograms.
exec dbms_stats.gather_table_stats(ownname => user, tabname => 'T', cascade => true, method_opt => 'for columns size 1');
exec print_table('select * from user_tab_col_statistics where table_name = ''T''');
TABLE_NAME : T
COLUMN_NAME : X
NUM_DISTINCT : 11110
LOW_VALUE : C102
HIGH_VALUE : C3020C0B
DENSITY : .00009000900090009
NUM_NULLS : 0
NUM_BUCKETS : 1
LAST_ANALYZED : 17-AUG-2018 08:38:37
SAMPLE_SIZE : 11110
GLOBAL_STATS : YES
USER_STATS : NO
NOTES : STATS_ON_LOAD
AVG_COL_LEN : 5
HISTOGRAM : NONE
SCOPE : SHARED
-----------------------------------------------------
explain plan for select count(1) from t where x = 1;
select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3482591947
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 3 | 1 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 3 | | |
|* 2 | INDEX RANGE SCAN| T_IDX | 111 | 333 | 1 (0)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("X"=1)
You can see that the optimiz estimated (ROWS column) 111 rows, which is not correct because we have 5000 rows that are X=1.
Now we will collect stats with the default method:
exec dbms_stats.gather_table_stats(ownname => user, tabname => 'T', cascade => true);
SELECT column_name, num_distinct, histogram FROM dba_tab_col_statistics WHERE table_name= '&TABLE_NAME' and owner='&OWNER';
or
exec print_table('select * from user_tab_col_statistics where table_name = ''T''');
TABLE_NAME : T
COLUMN_NAME : X
NUM_DISTINCT : 3112
LOW_VALUE : C102
HIGH_VALUE : C3020C0B
DENSITY : .000141
NUM_NULLS : 0
NUM_BUCKETS : 254
LAST_ANALYZED : 17-AUG-2018 08:42:24
SAMPLE_SIZE : 11110
GLOBAL_STATS : YES
USER_STATS : NO
NOTES :
AVG_COL_LEN : 4
HISTOGRAM : HYBRID
SCOPE : HEIGHT BALANCED
-----------------------------------------------------
The histogram Tipe is “HEIGHT BALANCED”, the number of different rows is 3312 and the number of buckets is 254 (the maximum allowed)
select * from user_tab_histograms where table_name = 'T';
TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT
------------------------------ ---------- --------------- -------------- --------
T X 113 1
T X 181 2
T X 182 8008
T X 183 8052
T X 184 8096
T X 185 8140
T X 186 8184
T X 187 8228
T X 188 8272
T X 189 8315
T X 190 8358
T X 191 8401
T X 192 8444
T X 193 8487
T X 194 8530
T X 195 8573
T X 196 8616
T X 197 8659
T X 198 8702
T X 199 8745
T X 200 8788
T X 201 8831
T X 202 8874
T X 203 8917
T X 204 8960
T X 205 9003
T X 206 9046
T X 207 9089
T X 208 9132
T X 209 9175
T X 210 9218
T X 211 9261
T X 212 9304
T X 213 9347
T X 214 9390
T X 215 9433
T X 216 9476
T X 217 9519
T X 218 9562
T X 219 9605
T X 220 9648
T X 221 9691
T X 222 9734
T X 223 9777
T X 224 9820
T X 225 9863
T X 226 9906
T X 227 9949
T X 228 9992
T X 229 10035
T X 230 10078
T X 231 10121
T X 232 10164
T X 233 10207
T X 234 10250
T X 235 10293
T X 236 10336
T X 237 10379
T X 238 10422
T X 239 10465
T X 240 10508
T X 241 10551
T X 242 10594
T X 243 10637
T X 244 10680
T X 245 10723
T X 246 10766
T X 247 10809
T X 248 10852
T X 249 10895
T X 250 10938
T X 251 10981
T X 252 11024
T X 253 11067
T X 254 11110
As we can see in the previous table, the distribution is different that the histogram “FRECUENCY” because there are many values to represent. Checking the histogram we can see that the column ENDPOINT_VALUE has 254 values (buckets). The first row:
TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT
------------------------------ ---------- --------------- -------------- --------
T X 113 1
Has 113 buckets assigned. Where is this value coming from?. We have 11110 rows, and 5000 of those with a value of 1. Let’s do some math:
11100 rows represented in 254 buckets and for the value “1” 113 were assigned:
select (11110/254)*113 from dual;
(11110/254)*113
---------------
4942.6378
Close to 5000, correct?. Now let’s check for value = 2:
TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT
------------------------------ ---------- --------------- -------------- --------
T X 113 1
T X 181 2
to check the amount of buckets for ENDPOINT_VALUE :
select (11110/254)*(181-113) from dual;
(11110/254)*(181-113)
---------------------
2974.33071
As you can see, this is very close to 3000.
The difference is because we have 11110 rows represented in 254 buckets so we have 11110/254= 43.74 rows per bucket.
Once we get the histogram we can check the new plans:
explain plan for select count(1) from t where x = 1;
select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 1842905362
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 3 | 6 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 3 | | |
|* 2 | TABLE ACCESS FULL| T | 4943 | 14829 | 6 (0)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("X"=1)
Now the optimizer select a FULL_SCAN as the best plan.
If we look for fewer rows:
explain plan for select count(1) from t where x = 999999;
select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3482591947
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 3 | 1 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 3 | | |
|* 2 | INDEX RANGE SCAN| T_IDX | 1 | 3 | 1 (0)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("X"=999999)
We can see that an index was used and this is thanks to the histogram.
How to Use Histogram
The default value of METHOD_OPT from 10g onwards is ‘FOR ALL COLUMNS SIZE AUTO’.
METHOD_OPT=>’FOR ALL COLUMNS SIZE AUTO’
METHOD_OPT=>’FOR ALL COLUMNS SIZE REPEAT’
METHOD_OPT=>’FOR ALL COLUMNS SIZE SKEWONLY’
AUTO:
Let Oracle determines the which columns to collect histograms based on data distributiion and the workload of the columns.
exec dbms_stats.gather_table_stats(‘HR’,’EMPLOYEES’,method_opt=> ‘for all columns size AUTO);
REPEAT:
This will re-create existing histograms without creating new histograms.
exec dbms_stats.gather_table_stats(‘HR’,’EMPLOYEES’,METHOD_OPT=> ‘FOR ALL COLUMNS SIZE REPEAT’);
SKEWONLY:
When you specify SKEWONLY, the database will look at the data distribution for each column to determine if the data is skewed enough to warrant the creation of a histogram.
We can do this by gathering histograms on the skewed columns.
exec dbms_stats.gather_table_stats(‘HR’,’EMPLOYEES’,method_opt=> ‘for all columns size SKEWONLY’);
When to Use/Not Use Histograms
When to Use Histograms
Create histograms on columns that are frequently used in WHERE clauses of queries and have a highly-skewed data distribution. It is recommended to let Oracle determine this automatically using the METHOD_OPT AUTO setting so that Oracle can determine the columns to collect histograms upon based on data distribution and the workload of the columns.
When to Not Use Histograms
Also, be aware that histograms, as well as all optimizer statistics, are static. If the data distribution of a column changes frequently, it is necessary to recompute the histogram for a given column.
Histograms are not useful for columns with the following characteristics:
– The column data is uniformly distributed
– The column is not used in WHERE clauses of queries
– The column is unique and is used only with equality predicates
– Do not create them on every column of every table. This requires more time when the table is analyzed. Increases parse time. And can result in poor plans being generated by the optimizer.