Histograms

Histograms

 

Histograms tell the Optimizer about the distribution of data within a column. By default (without a histogram), the Optimizer assumes a uniform distribution of rows across the distinct values in a column. Therefore, the Optimizer calculates the cardinality (number of rows returned) for an equality predicate by dividing the total number of rows in the table by the number of distinct values in the column used in the equality predicate.

If the data distribution in that column is not uniform (i.e., a data skew exists) then the cardinality estimate will be incorrect.
In order to accurately reflect a non-uniform data distribution, a histogram is required on the column. The presence of a histogram changes the formula used by the Optimizer to estimate the cardinality, and allows it to generate a more accurate execution plan.
A histogram sorts values into “buckets,” as you might sort coins into buckets.

The database has to scan the table and read the values in the column to build the histogram.
The real purpose of adding a histogram to table is, to give the more information to optimizer to come up with a better plan.
If no histograms are present, the CBO assumes an even distribution of data. Histograms tell the CBO when data  is not evenly distributed, and can help the CBO to estimate the number of rows returned from a table join (called “cardinality”).
Data skew: Skew is defined like uneven data distribution in column. Large variations in the number of duplicate values in a column.

For example, a California-based book store ships 95% of the books to California, 4% to Oregon, and 1% to Nevada. The book orders table has 300,000 rows. A table column stores the state to which orders are shipped. A user queries the number of books shipped to Oregon. Without a histogram, the optimizer assumes an even distribution of 300000/3 (Number of Distinct Vaues (NDV) is 3), estimating cardinality at 100,000 rows. With this estimate, the optimizer chooses a full table scan. With a histogram, the optimizer calculates that 4% of the books are shipped to Oregon, and chooses an index scan.

Type of Histograms
Based on the NDV and the distribution of the data, the database chooses the type of histogram to create. (In some cases, when creating a histogram, the database samples an internally predetermined number of rows.) The types of histograms are as follows:

The database view DBA_TAB_COL_STATISTICS or USER_TAB_COL_STATISTICS has a column called HISTOGRAM.

–    HEIGHT BALANCED – Height Based : Each bucket has same number of values, skewed values occupy more buckets
–    FREQUENCY – Value Based:  Each key has its own number of records
–   TOP FREQUENCY HISTOGRAMS
–   HYBRID HISTOGRAMS
–    NONE  – No histogram present for the column

Note: Frequency histograms are more precise than Height Balanced histograms.

Let’s create a quick test so we can see this better.
First I will create a table with 11,110 rows , with 4 different values: (1,2,3 y 4) .

create table t
as select case when (rownum between 1 and 10) then 1
               when (rownum between 11 and 110) then 2
               when (rownum between 111 and 1110) then 3
               when (rownum between 1111 and 11110) then 4
          end x
from dual
connect by rownum <= 11110;


select x,count(1)
from t
group by x;

    X   COUNT(1)
----- ----------
    1         10
    2        100
    3       1000
    4      10000

Then I’m going to collect statistics and check the table USER_TAB_COL_STATISTICS for statistics values over the X Column:

exec dbms_stats.gather_table_stats(ownname => user,tabname => 'T');
or
exec dbms_stats.gather_table_stats('TEST','T', METHOD_OPT=>'FOR ALL COLUMNS SIZE 1');

exec print_table('select * from user_tab_col_statistics where table_name = ''T''');
TABLE_NAME                    : T
COLUMN_NAME                   : X
NUM_DISTINCT                  : 4
LOW_VALUE                     : C102
HIGH_VALUE                    : C105
DENSITY                       : .25
NUM_NULLS                     : 0
NUM_BUCKETS                   : 1
LAST_ANALYZED                 : 02-dic-2009 15:28:59
SAMPLE_SIZE                   : 11110
GLOBAL_STATS                  : YES
USER_STATS                    : NO
AVG_COL_LEN                   : 3
HISTOGRAM                     : NONE

From these results, we can see that Oracle detected 4 different values, there are no nulls and the density is 0.25 (1/”different values”=1/4), there is only 1 bucket and there is no histograms (NONE).
Why no histograms? because I didn’t specify the “method_opt” parameters, hence, Oracle used the default (“for all columns size auto”).

The only information about the distribution is the following:
select * from user_tab_histograms where table_name = 'T';

TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_A
---------- ---------- --------------- -------------- ----------
T          X                       0              1
T          X                       1              4

If we tell Oracle to use 4 buckets we get:
exec dbms_stats.gather_table_stats(ownname => user, tabname => 'T', method_opt=>'for all columns size 4');

exec print_table('select * from user_tab_col_statistics where table_name = ''T''');
TABLE_NAME                    : T
COLUMN_NAME                   : X
NUM_DISTINCT                  : 4
LOW_VALUE                     : C102
HIGH_VALUE                    : C105
DENSITY                       : .000045004500450045
NUM_NULLS                     : 0
NUM_BUCKETS                   : 4
LAST_ANALYZED                 : 02-dic-2009 16:12:21
SAMPLE_SIZE                   : 11110
GLOBAL_STATS                  : YES
USER_STATS                    : NO
AVG_COL_LEN                   : 3
HISTOGRAM                     : FREQUENCY


select * from user_tab_histograms where table_name = 'T';

TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_A
---------- ---------- --------------- -------------- ----------
T          X                       10              1
T          X                      110              2
T          X                     1110              3
T          X                    11110              4

Now we can see that we have a HISTOGRAM (FREQUENCY). To analyze the table USER_TAB_HISTOGRAMS we need to know that the column ENDPOINT_VALUE shows all the different values, and the column ENDPOINT_NUMBER shows the number of records with that value. For 1 we have 10, for 2 we have 110-10=100, for 3 we have 1110-110=1000 and so on….

How does Oracle determine when to create a histogram?
Oracle automatically determines the columns that need histograms based on the column usage information recorded in SYS.COL_USAGE$ (columns used in WHERE clause predicates), and the presence of a data skew.
For example, Oracle will not automatically create a histogram on a unique column if it is only seen in equality predicates.

What’s the down side to stopping Oracle from creating any histograms?
The Optimizer will assume an even distribution of values in all columns.

Another Example were Oracle got incorrect stats (data skewed)
Imagine we owned two hardware stores. Most of customers live in the same zipcode as our store, with a couple of exceptions.

SELECT   zipcode, COUNT(*)
FROM     customers
GROUP BY zipcode;
 
  ZIPCODE   COUNT(*)
---------- ----------
     91140    20000
     94303    20000
     94070        2
     94065        2

Let’s now gather statistics on the CUSTOMERS table using the recommendation to disable histograms.
exec dbms_stats.gather_table_stats('SH','CUSTOMERS', METHOD_OPT=>'FOR ALL COLUMNS SIZE 1');

SELECT column_name, num_distinct, histogram
FROM   user_tab_col_Statistics
WHERE  TABLE_NAME='CUSTOMERS';
 
COLUMN_NAME                NUM_DISTINCT HISTOGRAM
----------------------- ------------ ---------------
CUST_FIRST_NAME                  451 NONE
CUST_LAST_NAME                   401 NONE
CUST_GENDER                        2 NONE
ZIPCODE                            4 NONE
CUST_EMAIL                       401 NONE

No histograms were created, so when a query comes in looking to find the different genders of our customers in a given zipcode, the Optimizer will calculate the cardinality estimates by dividing the total number of rows in the table (40,004) by the number of distinct values (4). Therefore the Optimizer will assume 10,001 rows will be returned from the customers table regardless of the zipcode provided, which will always result in a FULL TABLE SCAN. This works great for two of our zipcodes (91140 and 94303) but not so great for the other two (94070 and 94065), who would be better off using an INDEX RANGE SCAN.

Now I’m going to create a different distribution to get a different histogram:

drop table t;

create table t as select rownum x
  from dual
  connect by rownum <= 11110;

update t set x = 1 where rownum <= 5000;
update t set x = 2 where rownum <= 3000 and x != 1;
create index t_idx on t(x);

So now I have a table similar to what I had before but with a different distribution. For x=1 we have 5000 values, for x=2 300 values and for the rest just 1.
We will collect stats but not for histograms.

exec dbms_stats.gather_table_stats(ownname => user, tabname => 'T', cascade => true, method_opt => 'for columns size 1');

exec print_table('select * from user_tab_col_statistics where table_name = ''T''');

TABLE_NAME                    : T
COLUMN_NAME                   : X
NUM_DISTINCT                  : 11110
LOW_VALUE                     : C102
HIGH_VALUE                    : C3020C0B
DENSITY                       : .00009000900090009
NUM_NULLS                     : 0
NUM_BUCKETS                   : 1
LAST_ANALYZED                 : 17-AUG-2018 08:38:37
SAMPLE_SIZE                   : 11110
GLOBAL_STATS                  : YES
USER_STATS                    : NO
NOTES                         : STATS_ON_LOAD
AVG_COL_LEN                   : 5
HISTOGRAM                     : NONE
SCOPE                         : SHARED
-----------------------------------------------------

explain plan for select count(1) from t where x = 1;
select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3482591947

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       |     1 |     3 |     1   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |       |     1 |     3 |            |          |
|*  2 |   INDEX RANGE SCAN| T_IDX |   111 |   333 |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("X"=1)

You can see that the optimiz estimated (ROWS column) 111 rows, which is not correct because we have 5000 rows that are  X=1.
Now we will collect stats with the default method:

exec dbms_stats.gather_table_stats(ownname => user, tabname => 'T', cascade => true);

SELECT column_name, num_distinct, histogram FROM   dba_tab_col_statistics WHERE  table_name= '&TABLE_NAME' and owner='&OWNER';
or
exec print_table('select * from user_tab_col_statistics where table_name = ''T''');

TABLE_NAME                    : T
COLUMN_NAME                   : X
NUM_DISTINCT                  : 3112
LOW_VALUE                     : C102
HIGH_VALUE                    : C3020C0B
DENSITY                       : .000141
NUM_NULLS                     : 0
NUM_BUCKETS                   : 254
LAST_ANALYZED                 : 17-AUG-2018 08:42:24
SAMPLE_SIZE                   : 11110
GLOBAL_STATS                  : YES
USER_STATS                    : NO
NOTES                         :
AVG_COL_LEN                   : 4
HISTOGRAM                     : HYBRID
SCOPE                         : HEIGHT BALANCED
-----------------------------------------------------

The histogram Tipe is “HEIGHT BALANCED”, the number of different rows is 3312 and the number of buckets is 254 (the maximum allowed)

select * from user_tab_histograms where table_name = 'T';

TABLE_NAME                     COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT
------------------------------ ---------- --------------- -------------- --------
T                              X                      113              1
T                              X                      181              2
T                              X                      182           8008
T                              X                      183           8052
T                              X                      184           8096
T                              X                      185           8140
T                              X                      186           8184
T                              X                      187           8228
T                              X                      188           8272
T                              X                      189           8315
T                              X                      190           8358
T                              X                      191           8401
T                              X                      192           8444
T                              X                      193           8487
T                              X                      194           8530
T                              X                      195           8573
T                              X                      196           8616
T                              X                      197           8659
T                              X                      198           8702
T                              X                      199           8745
T                              X                      200           8788
T                              X                      201           8831
T                              X                      202           8874
T                              X                      203           8917
T                              X                      204           8960
T                              X                      205           9003
T                              X                      206           9046
T                              X                      207           9089
T                              X                      208           9132
T                              X                      209           9175
T                              X                      210           9218
T                              X                      211           9261
T                              X                      212           9304
T                              X                      213           9347
T                              X                      214           9390
T                              X                      215           9433
T                              X                      216           9476
T                              X                      217           9519
T                              X                      218           9562
T                              X                      219           9605
T                              X                      220           9648
T                              X                      221           9691
T                              X                      222           9734
T                              X                      223           9777
T                              X                      224           9820
T                              X                      225           9863
T                              X                      226           9906
T                              X                      227           9949
T                              X                      228           9992
T                              X                      229          10035
T                              X                      230          10078
T                              X                      231          10121
T                              X                      232          10164
T                              X                      233          10207
T                              X                      234          10250
T                              X                      235          10293
T                              X                      236          10336
T                              X                      237          10379
T                              X                      238          10422
T                              X                      239          10465
T                              X                      240          10508
T                              X                      241          10551
T                              X                      242          10594
T                              X                      243          10637
T                              X                      244          10680
T                              X                      245          10723
T                              X                      246          10766
T                              X                      247          10809
T                              X                      248          10852
T                              X                      249          10895
T                              X                      250          10938
T                              X                      251          10981
T                              X                      252          11024
T                              X                      253          11067
T                              X                      254          11110

As we can see in the previous table, the distribution is different that the histogram “FRECUENCY” because there are many values to represent. Checking the histogram we can see that the column ENDPOINT_VALUE has 254 values (buckets). The first row:

TABLE_NAME                     COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT
------------------------------ ---------- --------------- -------------- --------
T                              X                      113              1

Has 113 buckets assigned. Where is this value coming from?. We have 11110 rows, and 5000 of those with a value of 1. Let’s do some math:
11100 rows represented in 254 buckets and for the value “1” 113 were assigned:

 select (11110/254)*113 from dual;

(11110/254)*113
---------------
      4942.6378

Close to 5000, correct?. Now let’s check for value = 2:

TABLE_NAME                     COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT
------------------------------ ---------- --------------- -------------- --------
T                              X                      113              1
T                              X                      181              2

to check the amount of buckets for ENDPOINT_VALUE :

select (11110/254)*(181-113) from dual;

(11110/254)*(181-113)
---------------------
           2974.33071

As you can see, this is very close to 3000.
The difference is because we have 11110 rows represented in 254 buckets so we have 11110/254= 43.74 rows per bucket.
Once we get the histogram we can check the new plans:

explain plan for select count(1) from t where x = 1;
select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 1842905362

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     3 |     6   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     3 |            |          |
|*  2 |   TABLE ACCESS FULL| T    |  4943 | 14829 |     6   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("X"=1)

Now the optimizer select a FULL_SCAN as the best plan.

If we look for fewer rows:

explain plan for select count(1) from t where x = 999999;
select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3482591947

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       |     1 |     3 |     1   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |       |     1 |     3 |            |          |
|*  2 |   INDEX RANGE SCAN| T_IDX |     1 |     3 |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("X"=999999)

We can see that an index was used and this is thanks to the histogram.

How to Use Histogram
The default value of METHOD_OPT from 10g onwards is ‘FOR ALL COLUMNS SIZE AUTO’.

METHOD_OPT=>’FOR ALL COLUMNS SIZE AUTO’
METHOD_OPT=>’FOR ALL COLUMNS SIZE REPEAT’
METHOD_OPT=>’FOR ALL COLUMNS SIZE SKEWONLY’

AUTO:
Let Oracle determines the which columns to collect histograms based on data distributiion and the workload of the columns.
exec dbms_stats.gather_table_stats(‘HR’,’EMPLOYEES’,method_opt=> ‘for all columns size AUTO);

REPEAT:
This will re-create existing histograms without creating new histograms.
exec dbms_stats.gather_table_stats(‘HR’,’EMPLOYEES’,METHOD_OPT=> ‘FOR ALL COLUMNS SIZE REPEAT’);

SKEWONLY:
When you specify SKEWONLY, the database will look at the data distribution for each column to determine if the data is skewed enough to warrant the creation of a histogram.
We can do this by gathering histograms on the skewed columns.
exec dbms_stats.gather_table_stats(‘HR’,’EMPLOYEES’,method_opt=> ‘for all columns size SKEWONLY’);

When to Use/Not Use Histograms
When to Use Histograms
Create histograms on columns that are frequently used in WHERE clauses of queries and have a highly-skewed data distribution. It is recommended to let Oracle determine this automatically using the METHOD_OPT AUTO setting so that Oracle can determine the columns to collect histograms upon based on data distribution and the workload of the columns.

When to Not Use Histograms
Also, be aware that histograms, as well as all optimizer statistics, are static. If the data distribution of a column changes frequently, it is necessary to recompute the histogram for a given column.
Histograms are not useful for columns with the following characteristics:
–    The column data is uniformly distributed
–    The column is not used in WHERE clauses of queries
–    The column is unique and is used only with equality predicates
–    Do not create them on every column of every table. This requires more time when the table is analyzed. Increases parse time. And can result in poor plans being generated by the optimizer.