Histograms

Histograms tell the Optimizer about the distribution of data within a column. By default (without a histogram), the Optimizer assumes a uniform distribution of rows across the distinct values in a column. Therefore, the Optimizer calculates the cardinality (number of rows returned) for an equality predicate by dividing the total number of rows in the table by the number of distinct values in the column used in the equality predicate.

If the data distribution in that column is not uniform (i.e., a data skew exists) then the cardinality estimate will be incorrect.
In order to accurately reflect a non-uniform data distribution, a histogram is required on the column. The presence of a histogram changes the formula used by the Optimizer to estimate the cardinality, and allows it to generate a more accurate execution plan.
A histogram sorts values into “buckets,” as you might sort coins into buckets.

The database has to scan the table and read the values in the column to build the histogram.
The real purpose of adding a histogram to table is, to give the more information to optimizer to come up with a better plan.
If no histograms are present, the CBO assumes an even distribution of data. Histograms tell the CBO when data is not evenly distributed, and can help the CBO to estimate the number of rows returned from a table join (called “cardinality”).
Data skew: Skew is defined like uneven data distribution in column. Large variations in the number of duplicate values in a column.

For example, a California-based book store ships 95% of the books to California, 4% to Oregon, and 1% to Nevada. The book orders table has 300,000 rows. A table column stores the state to which orders are shipped. A user queries the number of books shipped to Oregon. Without a histogram, the optimizer assumes an even distribution of 300000/3 (Number of Distinct Vaues (NDV) is 3), estimating cardinality at 100,000 rows. With this estimate, the optimizer chooses a full table scan. With a histogram, the optimizer calculates that 4% of the books are shipped to Oregon, and chooses an index scan.

Type of Histograms
Based on the NDV and the distribution of the data, the database chooses the type of histogram to create. (In some cases, when creating a histogram, the database samples an internally predetermined number of rows.) The types of histograms are as follows:

The database view DBA_TAB_COL_STATISTICS or USER_TAB_COL_STATISTICS has a column called HISTOGRAM.

–   HEIGHT BALANCED – Height Based : Each bucket has same number of values, skewed values occupy more buckets
–   FREQUENCY – Value Based: Each key has its own number of records
–   TOP FREQUENCY HISTOGRAMS
–   HYBRID HISTOGRAMS
–   NONE – No histogram present for the column

Note: Frequency histograms are more precise than Height Balanced histograms.

Let’s create a quick test so we can see this better.
First I will create a table with 11,110 rows , with 4 different values: (1,2,3 y 4) .

create table tas select case when (rownum between 1 and 10) then 1 when (rownum between 11 and 110) then 2 when (rownum between 111 and 1110) then 3 when (rownum between 1111 and 11110) then 4 end xfrom dualconnect by rownum <= 11110;select x,count(1)from tgroup by x; X COUNT(1)----- ---------- 1 10 2 100 3 1000 4 10000

Then I’m going to collect statistics and check the table USER_TAB_COL_STATISTICS for statistics values over the X Column:

exec dbms_stats.gather_table_stats(ownname => user,tabname => 'T');
or
exec dbms_stats.gather_table_stats('TEST','T', METHOD_OPT=>'FOR ALL COLUMNS SIZE 1');

exec print_table('select * from user_tab_col_statistics where table_name = ''T''');TABLE_NAME : TCOLUMN_NAME : XNUM_DISTINCT : 4LOW_VALUE : C102HIGH_VALUE : C105DENSITY : .25NUM_NULLS : 0NUM_BUCKETS : 1LAST_ANALYZED : 02-dic-2009 15:28:59SAMPLE_SIZE : 11110GLOBAL_STATS : YESUSER_STATS : NOAVG_COL_LEN : 3HISTOGRAM : NONE

From these results, we can see that Oracle detected 4 different values, there are no nulls and the density is 0.25 (1/”different values”=1/4), there is only 1 bucket and there is no histograms (NONE).
Why no histograms? because I didn’t specify the “method_opt” parameters, hence, Oracle used the default (“for all columns size auto”).

The only information about the distribution is the following:
select * from user_tab_histograms where table_name = 'T';TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_A---------- ---------- --------------- -------------- ----------T X 0 1T X 1 4

If we tell Oracle to use 4 buckets we get:
exec dbms_stats.gather_table_stats(ownname => user, tabname => 'T', method_opt=>'for all columns size 4');exec print_table('select * from user_tab_col_statistics where table_name = ''T''');TABLE_NAME : TCOLUMN_NAME : XNUM_DISTINCT : 4LOW_VALUE : C102HIGH_VALUE : C105DENSITY : .000045004500450045NUM_NULLS : 0NUM_BUCKETS : 4LAST_ANALYZED : 02-dic-2009 16:12:21SAMPLE_SIZE : 11110GLOBAL_STATS : YESUSER_STATS : NOAVG_COL_LEN : 3HISTOGRAM : FREQUENCYselect * from user_tab_histograms where table_name = 'T';TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_A---------- ---------- --------------- -------------- ----------T X 10 1T X 110 2T X 1110 3T X 11110 4

Now we can see that we have a HISTOGRAM (FREQUENCY). To analyze the table USER_TAB_HISTOGRAMS we need to know that the column ENDPOINT_VALUE shows all the different values, and the column ENDPOINT_NUMBER shows the number of records with that value. For 1 we have 10, for 2 we have 110-10=100, for 3 we have 1110-110=1000 and so on….

How does Oracle determine when to create a histogram?
Oracle automatically determines the columns that need histograms based on the column usage information recorded in SYS.COL_USAGE$ (columns used in WHERE clause predicates), and the presence of a data skew.
For example, Oracle will not automatically create a histogram on a unique column if it is only seen in equality predicates.

What’s the down side to stopping Oracle from creating any histograms?
The Optimizer will assume an even distribution of values in all columns.

Another Example were Oracle got incorrect stats (data skewed)
Imagine we owned two hardware stores. Most of customers live in the same zipcode as our store, with a couple of exceptions.

SELECT zipcode, COUNT(*) FROM customers GROUP BY zipcode; ZIPCODE COUNT(*)---------- ---------- 91140 20000 94303 20000 94070 2 94065 2

Let’s now gather statistics on the CUSTOMERS table using the recommendation to disable histograms.
exec dbms_stats.gather_table_stats('SH','CUSTOMERS', METHOD_OPT=>'FOR ALL COLUMNS SIZE 1');SELECT column_name, num_distinct, histogram FROM user_tab_col_StatisticsWHERE TABLE_NAME='CUSTOMERS'; COLUMN_NAME NUM_DISTINCT HISTOGRAM----------------------- ------------ ---------------CUST_FIRST_NAME 451 NONECUST_LAST_NAME 401 NONECUST_GENDER 2 NONEZIPCODE 4 NONECUST_EMAIL 401 NONE

No histograms were created, so when a query comes in looking to find the different genders of our customers in a given zipcode, the Optimizer will calculate the cardinality estimates by dividing the total number of rows in the table (40,004) by the number of distinct values (4). Therefore the Optimizer will assume 10,001 rows will be returned from the customers table regardless of the zipcode provided, which will always result in a FULL TABLE SCAN. This works great for two of our zipcodes (91140 and 94303) but not so great for the other two (94070 and 94065), who would be better off using an INDEX RANGE SCAN.

Now I’m going to create a different distribution to get a different histogram:

drop table t;create table t as select rownum x from dual connect by rownum <= 11110;update t set x = 1 where rownum <= 5000;update t set x = 2 where rownum <= 3000 and x != 1;create index t_idx on t(x);

So now I have a table similar to what I had before but with a different distribution. For x=1 we have 5000 values, for x=2 300 values and for the rest just 1.
We will collect stats but not for histograms.

exec dbms_stats.gather_table_stats(ownname => user, tabname => 'T', cascade => true, method_opt => 'for columns size 1');exec print_table('select * from user_tab_col_statistics where table_name = ''T''');TABLE_NAME : TCOLUMN_NAME : XNUM_DISTINCT : 11110LOW_VALUE : C102HIGH_VALUE : C3020C0BDENSITY : .00009000900090009NUM_NULLS : 0NUM_BUCKETS : 1LAST_ANALYZED : 17-AUG-2018 08:38:37SAMPLE_SIZE : 11110GLOBAL_STATS : YESUSER_STATS : NONOTES : STATS_ON_LOADAVG_COL_LEN : 5HISTOGRAM : NONESCOPE : SHARED-----------------------------------------------------

explain plan for select count(1) from t where x = 1;select * from table(dbms_xplan.display);PLAN_TABLE_OUTPUT----------------------------------------------------------------------------------------------------Plan hash value: 3482591947---------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |---------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 1 | 3 | 1 (0)| 00:00:01 || 1 | SORT AGGREGATE | | 1 | 3 | | ||* 2 | INDEX RANGE SCAN| T_IDX | 111 | 333 | 1 (0)| 00:00:01 |---------------------------------------------------------------------------Predicate Information (identified by operation id):--------------------------------------------------- 2 - access("X"=1)

You can see that the optimiz estimated (ROWS column) 111 rows, which is not correct because we have 5000 rows that are X=1.
Now we will collect stats with the default method:

exec dbms_stats.gather_table_stats(ownname => user, tabname => 'T', cascade => true);SELECT column_name, num_distinct, histogram FROM dba_tab_col_statistics WHERE table_name= '&TABLE_NAME' and owner='&OWNER';orexec print_table('select * from user_tab_col_statistics where table_name = ''T''');TABLE_NAME : TCOLUMN_NAME : XNUM_DISTINCT : 3112LOW_VALUE : C102HIGH_VALUE : C3020C0BDENSITY : .000141NUM_NULLS : 0NUM_BUCKETS : 254LAST_ANALYZED : 17-AUG-2018 08:42:24SAMPLE_SIZE : 11110GLOBAL_STATS : YESUSER_STATS : NONOTES :AVG_COL_LEN : 4HISTOGRAM : HYBRIDSCOPE : HEIGHT BALANCED-----------------------------------------------------

The histogram Tipe is “HEIGHT BALANCED”, the number of different rows is 3312 and the number of buckets is 254 (the maximum allowed)

select * from user_tab_histograms where table_name = 'T';TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT------------------------------ ---------- --------------- -------------- --------T X 113 1T X 181 2T X 182 8008T X 183 8052T X 184 8096T X 185 8140T X 186 8184T X 187 8228T X 188 8272T X 189 8315T X 190 8358T X 191 8401T X 192 8444T X 193 8487T X 194 8530T X 195 8573T X 196 8616T X 197 8659T X 198 8702T X 199 8745T X 200 8788T X 201 8831T X 202 8874T X 203 8917T X 204 8960T X 205 9003T X 206 9046T X 207 9089T X 208 9132T X 209 9175T X 210 9218T X 211 9261T X 212 9304T X 213 9347T X 214 9390T X 215 9433T X 216 9476T X 217 9519T X 218 9562T X 219 9605T X 220 9648T X 221 9691T X 222 9734T X 223 9777T X 224 9820T X 225 9863T X 226 9906T X 227 9949T X 228 9992T X 229 10035T X 230 10078T X 231 10121T X 232 10164T X 233 10207T X 234 10250T X 235 10293T X 236 10336T X 237 10379T X 238 10422T X 239 10465T X 240 10508T X 241 10551T X 242 10594T X 243 10637T X 244 10680T X 245 10723T X 246 10766T X 247 10809T X 248 10852T X 249 10895T X 250 10938T X 251 10981T X 252 11024T X 253 11067T X 254 11110

As we can see in the previous table, the distribution is different that the histogram “FRECUENCY” because there are many values to represent. Checking the histogram we can see that the column ENDPOINT_VALUE has 254 values (buckets). The first row:

TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT------------------------------ ---------- --------------- -------------- --------T X 113 1

Has 113 buckets assigned. Where is this value coming from?. We have 11110 rows, and 5000 of those with a value of 1. Let’s do some math:
11100 rows represented in 254 buckets and for the value “1” 113 were assigned:

select (11110/254)*113 from dual;(11110/254)*113--------------- 4942.6378

Close to 5000, correct?. Now let’s check for value = 2:

TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT------------------------------ ---------- --------------- -------------- --------T X 113 1T X 181 2

to check the amount of buckets for ENDPOINT_VALUE :

select (11110/254)*(181-113) from dual;(11110/254)*(181-113)--------------------- 2974.33071

As you can see, this is very close to 3000.
The difference is because we have 11110 rows represented in 254 buckets so we have 11110/254= 43.74 rows per bucket.
Once we get the histogram we can check the new plans:

explain plan for select count(1) from t where x = 1;select * from table(dbms_xplan.display);PLAN_TABLE_OUTPUT----------------------------------------------------------------------------------------------------Plan hash value: 1842905362---------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |---------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 1 | 3 | 6 (0)| 00:00:01 || 1 | SORT AGGREGATE | | 1 | 3 | | ||* 2 | TABLE ACCESS FULL| T | 4943 | 14829 | 6 (0)| 00:00:01 |---------------------------------------------------------------------------Predicate Information (identified by operation id):--------------------------------------------------- 2 - filter("X"=1)

Now the optimizer select a FULL_SCAN as the best plan.

If we look for fewer rows:

explain plan for select count(1) from t where x = 999999;select * from table(dbms_xplan.display);PLAN_TABLE_OUTPUT----------------------------------------------------------------------------------------------------Plan hash value: 3482591947---------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |---------------------------------------------------------------------------| 0 | SELECT STATEMENT | | 1 | 3 | 1 (0)| 00:00:01 || 1 | SORT AGGREGATE | | 1 | 3 | | ||* 2 | INDEX RANGE SCAN| T_IDX | 1 | 3 | 1 (0)| 00:00:01 |---------------------------------------------------------------------------Predicate Information (identified by operation id):--------------------------------------------------- 2 - access("X"=999999)

We can see that an index was used and this is thanks to the histogram.

How to Use Histogram
The default value of METHOD_OPT from 10g onwards is ‘FOR ALL COLUMNS SIZE AUTO’.

METHOD_OPT=>’FOR ALL COLUMNS SIZE AUTO’
METHOD_OPT=>’FOR ALL COLUMNS SIZE REPEAT’
METHOD_OPT=>’FOR ALL COLUMNS SIZE SKEWONLY’

AUTO:
Let Oracle determines the which columns to collect histograms based on data distributiion and the workload of the columns.
exec dbms_stats.gather_table_stats(‘HR’,’EMPLOYEES’,method_opt=> ‘for all columns size AUTO);

REPEAT:
This will re-create existing histograms without creating new histograms.
exec dbms_stats.gather_table_stats(‘HR’,’EMPLOYEES’,METHOD_OPT=> ‘FOR ALL COLUMNS SIZE REPEAT’);

SKEWONLY:
When you specify SKEWONLY, the database will look at the data distribution for each column to determine if the data is skewed enough to warrant the creation of a histogram.
We can do this by gathering histograms on the skewed columns.
exec dbms_stats.gather_table_stats(‘HR’,’EMPLOYEES’,method_opt=> ‘for all columns size SKEWONLY’);

When to Use/Not Use Histograms
When to Use Histograms
Create histograms on columns that are frequently used in WHERE clauses of queries and have a highly-skewed data distribution. It is recommended to let Oracle determine this automatically using the METHOD_OPT AUTO setting so that Oracle can determine the columns to collect histograms upon based on data distribution and the workload of the columns.

When to Not Use Histograms
Also, be aware that histograms, as well as all optimizer statistics, are static. If the data distribution of a column changes frequently, it is necessary to recompute the histogram for a given column.
Histograms are not useful for columns with the following characteristics:
–   The column data is uniformly distributed
–   The column is not used in WHERE clauses of queries
–   The column is unique and is used only with equality predicates
–   Do not create them on every column of every table. This requires more time when the table is analyzed. Increases parse time. And can result in poor plans being generated by the optimizer.