Histograms tell the Optimizer about the distribution of data within a column. By default (without a histogram), the Optimizer assumes a uniform distribution of rows across the distinct values in a column. Therefore, the Optimizer calculates the cardinality (number of rows returned) for an equality predicate by dividing the total number of rows in the table by the number of distinct values in the column used in the equality predicate.
If the data distribution in that column is not uniform
(i.e., a data skew exists) then the cardinality estimate will be
incorrect.
In order to accurately reflect a non-uniform data distribution, a
histogram is required on the column. The presence of a histogram
changes the formula used by the Optimizer to estimate the
cardinality, and allows it to generate a more accurate execution
plan.
A histogram sorts values into "buckets," as you might sort coins
into buckets.
The database has to scan the table and read the values
in the column to build the histogram.
The real purpose of adding a histogram to table is, to give the
more information to optimizer to come up with a better plan.
If no histograms are present, the CBO assumes an even distribution
of data. Histograms tell the CBO when data is not evenly
distributed, and can help the CBO to estimate the number of rows
returned from a table join (called "cardinality").
Data skew: Skew is defined like uneven data distribution in
column. Large variations in the number of duplicate values in a
column.
For example, a California-based book store ships 95% of the books
to California, 4% to Oregon, and 1% to Nevada. The book orders
table has 300,000 rows. A table column stores the state to which
orders are shipped. A user queries the number of books shipped to
Oregon. Without a histogram, the optimizer assumes an even
distribution of 300000/3 (Number of Distinct Vaues (NDV) is 3),
estimating cardinality at 100,000 rows. With this estimate, the
optimizer chooses a full table scan. With a histogram, the
optimizer calculates that 4% of the books are shipped to Oregon,
and chooses an index scan.
Type of Histograms
Based on the NDV and the distribution of the data, the database
chooses the type of histogram to create. (In some cases, when
creating a histogram, the database samples an internally
predetermined number of rows.) The types of histograms are as
follows:
The database view DBA_TAB_COL_STATISTICS or
USER_TAB_COL_STATISTICS has a column called HISTOGRAM.
- HEIGHT BALANCED - Height Based : Each bucket
has same number of values, skewed values occupy more buckets
- FREQUENCY - Value Based: Each key has
its own number of records
- TOP FREQUENCY HISTOGRAMS
- HYBRID HISTOGRAMS
- NONE - No histogram present for the
column
Note: Frequency histograms are more precise than Height Balanced
histograms.
Let's create a quick test so we can see this better.
First I will create a table with 11,110 rows , with 4 different
values: (1,2,3 y 4) .
create table t
as select case when (rownum between 1 and 10) then 1
when (rownum between 11 and 110) then 2
when (rownum between 111 and 1110) then 3
when (rownum between 1111 and 11110) then 4
end x
from dual
connect by rownum <= 11110;
select x,count(1)
from t
group by x;
X COUNT(1)
----- ----------
1 10
2 100
3
1000
4 10000
Then I'm going to collect statistics and check the table
USER_TAB_COL_STATISTICS for statistics values over the X Column:
exec dbms_stats.gather_table_stats(ownname => user,tabname
=> 'T');
or
exec dbms_stats.gather_table_stats('TEST','T',
METHOD_OPT=>'FOR ALL COLUMNS SIZE 1');
exec print_table('select * from user_tab_col_statistics where
table_name = ''T''');
TABLE_NAME
: T
COLUMN_NAME
: X
NUM_DISTINCT
: 4
LOW_VALUE
: C102
HIGH_VALUE
: C105
DENSITY
: .25
NUM_NULLS
: 0
NUM_BUCKETS
: 1
LAST_ANALYZED
: 02-dic-2009 15:28:59
SAMPLE_SIZE
: 11110
GLOBAL_STATS
: YES
USER_STATS
: NO
AVG_COL_LEN
: 3
HISTOGRAM
: NONE
From these results, we can see that Oracle detected 4 different
values, there are no nulls and the density is 0.25 (1/"different
values"=1/4), there is only 1 bucket and there is no histograms
(NONE).
Why no histograms? because I didn't specify the "method_opt"
parameters, hence, Oracle used the default ("for all columns size
auto").
The only information about the distribution is the following:
select * from user_tab_histograms where table_name = 'T';
TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE
ENDPOINT_A
---------- ---------- --------------- --------------
----------
T
X
0
1
T
X
1
4
or
SELECT
endpoint_value,
endpoint_number,
endpoint_number -
LAG(endpoint_number, 1, 0) OVER (ORDER BY endpoint_value) AS
frequency
FROM user_tab_histograms
WHERE table_name = 'T1'
--AND column_name = 'X'
ORDER BY endpoint_value;
If we tell Oracle to use 4 buckets we get:
exec dbms_stats.gather_table_stats(ownname => user, tabname
=> 'T', method_opt=>'for all columns size 4');
exec print_table('select * from user_tab_col_statistics
where table_name = ''T''');
TABLE_NAME
: T
COLUMN_NAME
: X
NUM_DISTINCT
: 4
LOW_VALUE
: C102
HIGH_VALUE
: C105
DENSITY
: .000045004500450045
NUM_NULLS
: 0
NUM_BUCKETS
: 4
LAST_ANALYZED
: 02-dic-2009 16:12:21
SAMPLE_SIZE
: 11110
GLOBAL_STATS
: YES
USER_STATS
: NO
AVG_COL_LEN
: 3
HISTOGRAM
: FREQUENCY
select * from user_tab_histograms where table_name = 'T';
TABLE_NAME COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE
ENDPOINT_A
---------- ---------- --------------- --------------
----------
T
X
10
1
T
X
110
2
T
X
1110
3
T
X
11110
4
or
SELECT
endpoint_value,
endpoint_number,
endpoint_number -
LAG(endpoint_number, 1, 0) OVER (ORDER BY endpoint_value) AS
frequency
FROM user_tab_histograms
WHERE table_name = 'T1'
--AND column_name = 'X'
ORDER BY endpoint_value;
Now we can see that we have a HISTOGRAM (FREQUENCY). To analyze
the table USER_TAB_HISTOGRAMS we need to know that the column
ENDPOINT_VALUE shows all the different values, and the column
ENDPOINT_NUMBER shows the number of records with that value. For 1
we have 10, for 2 we have 110-10=100, for 3 we have 1110-110=1000
and so on....
How does Oracle determine when to create a histogram?
Oracle automatically determines the columns that need histograms
based on the column usage information recorded in SYS.COL_USAGE$
(columns used in WHERE clause predicates), and the presence of a
data skew.
For example, Oracle will not automatically create a histogram on a
unique column if it is only seen in equality predicates.
What’s the down side to stopping Oracle from creating any
histograms?
The Optimizer will assume an even distribution of values in all
columns.
Another Example were Oracle got incorrect stats (data
skewed)
Imagine we owned two hardware stores. Most of customers live in
the same zipcode as our store, with a couple of exceptions.
SELECT zipcode, COUNT(*)
FROM customers
GROUP BY zipcode;
ZIPCODE COUNT(*)
---------- ----------
91140 20000
94303 20000
94070
2
94065
2
Let’s now gather statistics on the CUSTOMERS table using the
recommendation to disable histograms.
exec dbms_stats.gather_table_stats('SH','CUSTOMERS',
METHOD_OPT=>'FOR ALL COLUMNS SIZE 1');
SELECT column_name, num_distinct, histogram
FROM user_tab_col_Statistics
WHERE TABLE_NAME='CUSTOMERS';
COLUMN_NAME
NUM_DISTINCT
HISTOGRAM
----------------------- ------------ ---------------
CUST_FIRST_NAME
451 NONE
CUST_LAST_NAME
401
NONE
CUST_GENDER
2 NONE
ZIPCODE
4
NONE
CUST_EMAIL
401 NONE
No histograms were created, so when a query comes in looking to
find the different genders of our customers in a given zipcode,
the Optimizer will calculate the cardinality estimates by dividing
the total number of rows in the table (40,004) by the number of
distinct values (4). Therefore the Optimizer will assume 10,001
rows will be returned from the customers table regardless of the
zipcode provided, which will always result in a FULL TABLE SCAN.
This works great for two of our zipcodes (91140 and 94303) but not
so great for the other two (94070 and 94065), who would be better
off using an INDEX RANGE SCAN.
Now I'm going to create a different distribution to get a
different histogram:
drop table t;
create table t as select rownum x
from dual
connect by rownum <= 11110;
update t set x = 1 where rownum <= 5000;
update t set x = 2 where rownum <= 3000 and x != 1;
create index t_idx on t(x);
So now I have a table similar to what I had before but with a
different distribution. For x=1 we have 5000 values, for x=2 300
values and for the rest just 1.
We will collect stats but not for histograms.
exec dbms_stats.gather_table_stats(ownname => user, tabname
=> 'T', cascade => true, method_opt => 'for columns
size 1');
exec print_table('select * from user_tab_col_statistics
where table_name = ''T''');
TABLE_NAME
: T
COLUMN_NAME
: X
NUM_DISTINCT
: 11110
LOW_VALUE
: C102
HIGH_VALUE
: C3020C0B
DENSITY
: .00009000900090009
NUM_NULLS
: 0
NUM_BUCKETS
: 1
LAST_ANALYZED
: 17-AUG-2018 08:38:37
SAMPLE_SIZE
: 11110
GLOBAL_STATS
: YES
USER_STATS
: NO
NOTES
: STATS_ON_LOAD
AVG_COL_LEN
: 5
HISTOGRAM
: NONE
SCOPE
: SHARED
-----------------------------------------------------
explain plan for select count(1) from t where x = 1;
select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3482591947
---------------------------------------------------------------------------
| Id |
Operation |
Name | Rows | Bytes | Cost (%CPU)|
Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT
| |
1 | 3 |
1 (0)| 00:00:01 |
| 1 | SORT AGGREGATE
| |
1 | 3
|
| |
|* 2 | INDEX RANGE SCAN| T_IDX
| 111 | 333 |
1 (0)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("X"=1)
You can see that the optimiz estimated (ROWS column) 111 rows,
which is not correct because we have 5000 rows that are X=1.
Now we will collect stats with the default method:
exec dbms_stats.gather_table_stats(ownname => user, tabname
=> 'T', cascade => true);
SELECT column_name, num_distinct, histogram
FROM dba_tab_col_statistics WHERE table_name=
'&TABLE_NAME' and owner='&OWNER';
or
exec print_table('select * from user_tab_col_statistics
where table_name = ''T''');
TABLE_NAME
: T
COLUMN_NAME
: X
NUM_DISTINCT
: 3112
LOW_VALUE
: C102
HIGH_VALUE
: C3020C0B
DENSITY
: .000141
NUM_NULLS
: 0
NUM_BUCKETS
: 254
LAST_ANALYZED
: 17-AUG-2018 08:42:24
SAMPLE_SIZE
: 11110
GLOBAL_STATS
: YES
USER_STATS
: NO
NOTES
:
AVG_COL_LEN
: 4
HISTOGRAM
: HYBRID
SCOPE
: HEIGHT BALANCED
-----------------------------------------------------
The histogram Tipe is "HEIGHT BALANCED", the number of different
rows is 3312 and the number of buckets is 254 (the maximum
allowed)
select * from user_tab_histograms where table_name = 'T';
TABLE_NAME
COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT
------------------------------ ---------- ---------------
-------------- --------
T
X
113
1
T
X
181
2
T
X
182
8008
T
X
183
8052
T
X
184
8096
T
X
185
8140
T
X
186
8184
T
X
187
8228
T
X
188
8272
T
X
189
8315
T
X
190
8358
T
X
191
8401
T
X
192
8444
T
X
193
8487
T
X
194
8530
T
X
195
8573
T
X
196
8616
T
X
197
8659
T
X
198
8702
T
X
199
8745
T
X
200
8788
T
X
201
8831
T
X
202
8874
T
X
203
8917
T
X
204
8960
T
X
205
9003
T
X
206
9046
T
X
207
9089
T
X
208
9132
T
X
209
9175
T
X
210
9218
T
X
211
9261
T
X
212
9304
T
X
213
9347
T
X
214
9390
T
X
215
9433
T
X
216
9476
T
X
217
9519
T
X
218
9562
T
X
219
9605
T
X
220
9648
T
X
221
9691
T
X
222
9734
T
X
223
9777
T
X
224
9820
T
X
225
9863
T
X
226
9906
T
X
227
9949
T
X
228
9992
T
X
229 10035
T
X
230 10078
T
X
231 10121
T
X
232 10164
T
X
233 10207
T
X
234 10250
T
X
235 10293
T
X
236 10336
T
X
237 10379
T
X
238 10422
T
X
239 10465
T
X
240 10508
T
X
241 10551
T
X
242 10594
T
X
243 10637
T
X
244 10680
T
X
245 10723
T
X
246 10766
T
X
247 10809
T
X
248 10852
T
X
249 10895
T
X
250 10938
T
X
251 10981
T
X
252 11024
T
X
253 11067
T
X
254 11110
or
SELECT
endpoint_value,
endpoint_number,
endpoint_number -
LAG(endpoint_number, 1, 0) OVER (ORDER BY endpoint_value) AS
frequency
FROM user_tab_histograms
WHERE table_name = 'T1'
--AND column_name = 'X'
ORDER BY endpoint_value;
As we can see in the previous table, the distribution
is different that the histogram "FRECUENCY" because there are many
values to represent. Checking the histogram we can see that the
column ENDPOINT_VALUE has 254 values (buckets). The first row:
TABLE_NAME
COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT
------------------------------ ---------- ---------------
-------------- --------
T
X
113
1
Has 113 buckets assigned. Where is this value coming from?. We
have 11110 rows, and 5000 of those with a value of 1. Let's do
some math:
11100 rows represented in 254 buckets and for the value "1"
113 were assigned:
select (11110/254)*113 from dual;
(11110/254)*113
---------------
4942.6378
Close to 5000, correct?. Now let's check for value = 2:
TABLE_NAME
COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT
------------------------------ ---------- ---------------
-------------- --------
T
X
113
1
T
X
181
2
to check the amount of buckets for ENDPOINT_VALUE :
select (11110/254)*(181-113) from dual;
(11110/254)*(181-113)
---------------------
2974.33071
As you can see, this is very close to 3000.
The difference is because we have 11110 rows represented in 254
buckets so we have 11110/254= 43.74 rows per bucket.
Once we get the histogram we can check the new plans:
explain plan for select count(1) from t where x = 1;
select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 1842905362
---------------------------------------------------------------------------
| Id |
Operation
| Name | Rows | Bytes | Cost (%CPU)|
Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT
| | 1
| 3 |
6 (0)| 00:00:01 |
| 1 | SORT AGGREGATE
| | 1
| 3
|
| |
|* 2 | TABLE ACCESS FULL|
T | 4943 | 14829
| 6 (0)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("X"=1)
Now the optimizer select a FULL_SCAN as the best plan.
If we look for fewer rows:
explain plan for select count(1) from t where x = 999999;
select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3482591947
---------------------------------------------------------------------------
| Id |
Operation |
Name | Rows | Bytes | Cost (%CPU)|
Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT
| |
1 | 3 |
1 (0)| 00:00:01 |
| 1 | SORT AGGREGATE
| |
1 | 3
|
| |
|* 2 | INDEX RANGE SCAN| T_IDX
| 1 | 3
| 1 (0)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("X"=999999)
We can see that an index was used and this is thanks to the
histogram.
How to Use Histogram
The default value of METHOD_OPT from 10g onwards is 'FOR ALL
COLUMNS SIZE AUTO’.
METHOD_OPT=>'FOR ALL COLUMNS SIZE AUTO'
METHOD_OPT=>'FOR ALL COLUMNS SIZE REPEAT'
METHOD_OPT=>'FOR ALL COLUMNS SIZE SKEWONLY'
AUTO:
Let Oracle determines the which columns to collect histograms
based on data distributiion and the workload of the columns.
exec
dbms_stats.gather_table_stats('HR','EMPLOYEES',method_opt=>
'for all columns size AUTO);
REPEAT:
This will re-create existing histograms without creating new
histograms.
exec
dbms_stats.gather_table_stats('HR','EMPLOYEES',METHOD_OPT=>
'FOR ALL COLUMNS SIZE REPEAT');
SKEWONLY:
When you specify SKEWONLY, the database will look at the data
distribution for each column to determine if the data is skewed
enough to warrant the creation of a histogram.
We can do this by gathering histograms on the skewed columns.
exec
dbms_stats.gather_table_stats('HR','EMPLOYEES',method_opt=>
'for all columns size SKEWONLY');
When to Use/Not Use Histograms
When to Use Histograms
Create histograms on columns that are frequently used in WHERE
clauses of queries and have a highly-skewed data distribution. It
is recommended to let Oracle determine this automatically using
the METHOD_OPT AUTO setting so that Oracle can determine the
columns to collect histograms upon based on data distribution and
the workload of the columns.
When to Not Use Histograms
Also, be aware that histograms, as well as all optimizer
statistics, are static. If the data distribution of a column
changes frequently, it is necessary to recompute the histogram for
a given column.
Histograms are not useful for columns with the following
characteristics:
- The column data is uniformly distributed
- The column is not used in WHERE clauses of
queries
- The column is unique and is used only with
equality predicates
- Do not create them on every column of every
table. This requires more time when the table is analyzed.
Increases parse time. And can result in poor plans being generated
by the optimizer.