CHECKPOINT TUNING AND ERROR HANDLING (ML note 147468.1)
This bulletin provides the Database Administrator a better understanding of checkpoint processing and a description of four key initialization parameters used for checkpoint tuning:
           -  CHECKPOINT_PROCESS
           -  LOG_CHECKPOINT_INTERVAL
           -  LOG_CHECKPOINT_TIMEOUT
           -  LOG_CHECKPOINTS_TO_ALERT

It also explains how to interpret and handle checkpoint errors: 'Checkpoint not Complete' and 'Cannot Allocate New Log' reported in the ALERT<sid>.LOG file.

Contents:
1.  What is a Checkpoint?
2. The checkpoint process
3. Checkpoints and Performance
4. Redo logs and Checkpoint
5. Instance parameters related with the checkpoint process
6. Understanding Checkpoint Error messages    ("Cannot allocate new log" and "Checkpoint not complete")
7. Oracle Release Information
8. Using Statspack to determine Checkpointing problems
9. Sizing Redo Log Files


1.  What is a Checkpoint?
A Checkpoint is a database event which synchronizes the modified data blocks in memory with the datafiles on disk.

A checkpoint has two purposes: (1) to establish data consistency, and (2) enable faster database recovery.   How is recovery faster?  Because all database changes up to the checkpoint have been recorded in the datafiles, making it unnecessary to apply redo log entries prior to the checkpoint. The checkpoint must ensure that all the modified buffers in the cache are really written to the corresponding datafiles to avoid the loss of data which may occur with a crash (instance or disk failure).

Oracle writes the dirty buffers to disk only on certain conditions:
  - A shadow process must scan more than one-quarter of the db_block_buffer parameter.
  - Every three seconds.
  - When a checkpoint is produced.

A checkpoint is realized on five types of events:
  - At each switch of the redo log files.
  - When the delay for LOG_CHECKPOINT_TIMEOUT is reached.
  - When the size in bytes corresponding to :
     (LOG_CHECKPOINT_INTERVAL* size of IO OS blocks) is written on the current redo log file.
  -  Directly by the ALTER SYSTEM SWITCH LOGFILE command.
  - Directly with the ALTER SYSTEM CHECKPOINT command.

During a checkpoint the following occurs:
 -  The database writer (DBWR) writes all modified database  blocks in the buffer cache back to datafiles,
 -  Log writer (LGWR) updates both the controlfile and  the datafiles to indicate when the last checkpoint  occurred (SCN)


2. The checkpoint process

The CHECKPOINT_PROCESS init.ora parameter determines whether or not the optional CKPT background process will be started to perform LGWRs tasks during checkpoint operations of updating the datafile headers.  LGWR is then free to perform its' primary function flushing the redo log buffer to the online redo logs.

The CKPT process can improve performance significantly and decrease the amount of time users have to wait for a checkpoint operation to complete.  The overhead associated with starting another background process is not significant when compared to the performance benefit to be gained by enabling CKPT, therefore, Oracle recommends always enabling the checkpoint process (CKPT).

Please note that this parameter is obsolete starting with Oracle8 and now the CKPT process is always started as part of the background processes.


3. Checkpoints and Performance

Checkpoints present a tuning dilemma for the Database Administrator.  Frequent checkpoints will enable faster recovery, but can cause performance
degradation. How then should the DBA address this?

This bulletin assumes that performance is your number one priority and so recommendations are made accordingly. Therefore, your goal is to minimize the frequency of checkpoints through tuning. Tuning checkpoints involves four key initialization parameters
           -  CHECKPOINT_PROCESS
           -  LOG_CHECKPOINT_INTERVAL
           -  LOG_CHECKPOINT_TIMEOUT
           -  LOG_CHECKPOINTS_TO_ALERT
These parameters are discussed in detail below.
Recommendations are also given for handling "checkpoint not complete" messages
found in the alert log, which indicate a need to tune redo logs and checkpoints.


4. Redo logs and Checkpoint

A checkpoint occurs at every log switch.  If a previous checkpoint is already in progress, the checkpoint forced by the log switch will override the current checkpoint.

This necessitates well-sized redo logs to avoid unnecessary checkpoints as a result of frequent log switches.  The alert log is a valuable tool for
monitoring the rate that log switches occur, and subsequently, checkpoints occur.  Oracle recommends sizing the online redo logs such that switches occur no more than once per hour. The following is an example of quick log switches from the alert log:

Fri May 16 17:15:43 1997
Thread 1 advanced to log sequence 1272
  Current log# 3 seq# 1272 mem# 0: /prod1/oradata/logs/redologs03.log
Thread 1 advanced to log sequence 1273
  Current log# 1 seq# 1273 mem# 0: /prod1/oradata/logs/redologs01.log
Fri May 16 17:17:25 1997
Thread 1 advanced to log sequence 1274
  Current log# 2 seq# 1274 mem# 0: /prod1/oradata/logs/redologs02.log
Thread 1 advanced to log sequence 1275
  Current log# 3 seq# 1275 mem# 0: /prod1/oradata/logs/redologs03.log
Fri May 16 17:20:51 1997
Thread 1 advanced to log sequence 1276
  Current log# 1 seq# 1276 mem# 0: /prod1/oradata/logs/redologs01.log

If redo logs switch every 3 minutes, you will see performance degradation. This indicates the redo logs are not sized large enough to efficiently handle the transaction load. See here for further detail on how to estimate an adequate size of the redolog files.

5. Instance parameters related with the checkpoint process


6. Understanding Checkpoint Error messages  (“Cannot allocate new log” and “Checkpoint not complete”)

Sometimes, you can see in your alert.log file, the following corresponding messages:

  Thread 1 advanced to log sequence 248
    Current log# 2 seq# 248 mem# 0: /prod1/oradata/logs/redologs02.log
  Thread 1 cannot allocate new log, sequence 249
  Checkpoint not complete

This message indicates that Oracle wants to reuse a redo log file, but the corresponding checkpoint associated is not terminated.  In this case, Oracle
must wait until the checkpoint is completely realized. This situation may be encountered particularly when the transactional activity is important.

Keep in mind the goal of only one log switch per hour, which ideally should equate to one checkpoint per hour as well.

The way to resolve incomplete checkpoints is through tuning checkpoints and logs:

1) Give the checkpoint process more time to cycle through the logs
       -  add more redo log groups
       -  increase the size of the redo logs
2) Reduce the frequency of checkpoints
        - increase LOG_CHECKPOINT_INTERVAL
        - increase size of online redo logs
3) Improve the efficiency of checkpoints enabling the CKPT process with CHECKPOINT_PROCESS=TRUE
4) Set LOG_CHECKPOINT_TIMEOUT = 0.  This disables the checkpointing based on
     time interval.


7. Oracle Release Information

Starting from Oracle8i, Oracle Corporation recommends that Enterprise Edition users who were using incremental checkpoints in an earlier release
to use fast-start checkpointing in Oracle8i. In fast-start checkpointing, the FAST_START_IO_TARGET parameter replaces DB_FILE_MAX_DIRTY_TARGET. FAST_START_IO_TARGET specifies the number of I/Os that should be needed during crash or instance recovery.When you set this parameter, DBWR writes dirty buffers out more aggressively, so that the number of blocks that must be processed during recovery stays below the value specified in the parameter.

So in Oracle8i The incremental checkpoint position should not lag the tail of the log by more than LOG_CHECKPOINT_INTERVAL operating system blocks.The LOG_CHECKPOINT_INTERVAL and FAST_START_IO_TARGET is taken into account to determine how far behind the end of the redo stream the checkpoint position can really be.

In Oracle9i FAST_START_MTTR_TARGET is the preferred method of specifying how far the checkpoint position should be behind the tail of the redo stream. However, LOG_CHECKPOINT_INTERVAL is still supported if needed. It functions as per the Oracle8i behaviour above.

FAST_START_MTTR_TARGET enables you to specify the number of seconds the database takes to perform crash recovery of a single instance. Whenever you set FAST_START_MTTR_TARGET  to a nonzero value, then set the following parameters to 0.

LOG_CHECKPOINT_TIMEOUT = 0
LOG_CHECKPOINT_INTERVAL = 0
FAST_START_IO_TARGET = 0

Because these initialization parameters either override FAST_START_MTTR_TARGET or potentially drive checkpoints more aggressively than FAST_START_MTTR_TARGET does, they can interfere with the simulation.
 

8. Using Statspack to determine Checkpointing problems
Statspack snapshots can be taken every 15 minutes or so, these reports gather useful information about number of checkpoints started and checkpoints completed and number of database buffers written during checkpointing for that window of time . It also contains statistics about redo activity. Gathering and comparing these snapshot reports gives you a complete idea about checkpointing performance at different periods of time.

Another important thing to watch in statspack report is the following wait events, they could be a good indication about problems with the redo log throughput and checkpointing:

log file switch (checkpoint incomplete)
log file switch (archiving needed)
log file switch/archive
log file switch (clearing log file)
log file switch completion
log switch/archive
log file sync

In the case when one or more of the above wait events is repeated frequently with considerable values then you need to take an action like adding More online redo log files or increasing their sizes and/or modifying checkpointing parameters.
 

9.Sizing Redo Log Files
The way to determine how effectively redo logs are performing is to look in the alert log for messages about switching logs and checkpointing.  When you examine your alert log for switches, do so when there is the most activity and the heaviest load on the database.
 
Every time Oracle does a switch it is recorded in the alert log.  Your alert log will have entries similar to the following:
 
   Mon May 5 13:53:31 1997
   Thread 1 advanced to log sequence 68
   Current log#2 seq# 68 mem# 0: /u05/dbs/log2ween.dbf
 
Redo logs should switch approximately every 30 minutes.  If they switch sooner, look at the time between switches.
 
For example:
   a.) If your time between switches is 10 minutes, you need to increase the size of redo log files three times.
   b.) If your time between switches is 15 minutes, you need to increase the size twice.
   c.) If the switch time is more than 30 minutes, it means there is not much activity on the database and it's OK.
 
Recording of checkpoints can also impact redo log files.  The LGWR is responsible for recording checkpoints in the datafile headers.  If checkpoints
are occurring frequently and you have many datafiles, system performance can be degraded as the LGWR can become overloaded.  Oracle suggests that you enable the checkpoint process in 'init.ora', CKPT to separate the work of performing a checkpoint from other work performed by LGWR.
    CHECKPOINT_PROCESS = true
You need to shutdown and restart your database after you change the "init.ora" for the change to go into effect.
 
If you have a very stable environment where the computer very rarely crashes, increasing the size of redo logs will make your checkpoints occur less often (remember, a checkpoint causes your SGA to be flushed).  After the redo logs are recreated with a larger size, set CHECK_POINT_INTERVAL greater than your redo log size.