Tuesday, July 28, 2015

Checkpoint not Complete

This is one of the most common messages a DBA could see in alert log file of a database. Exact message should be similar to the following
Wed Mar 04 12:02:01 2015
Thread 1 cannot allocate new log, sequence 201
Checkpoint not complete

Whenever a redo log switch occurs i.e. current redo log group (suppose log group 1) is full and LGWR starts writing to the next available redo log group (suppose log group 2), a CHECKPOINT also occurs which means that all database block changes which are in database buffer cache; and have not yet been written to the database files; should be written to the disk (datafiles) by the database writer process (DBWR). Now DBWR starts writing changed blocks to the datafiles and this redo log group (group 1) will no longer be available for the LGWR to reuse until DBWR completes writing all changed blocks from the buffer cache (buffer cache blocks which are related to the data changes recorded in this redo log group). Now if redo log group 2 is also full and LGWR tries to allocate redo log group 1 again for writing; and DBWR has still not finished writing changed blocks to the data files, LGWR would need to wait, and message “Checkpoint no complete” would be written the alert log file. All transactions would hang here as these can’t continue until LGWR is allocated next redo log group for writing.

Most of the time the solution of “checkpoint not complete” is to add more redo log groups in the database so that DBWR has enough time to write changed blocks from database buffer cache to the datafiles before LGWR tries to acquire same redo log group again. For example, if you have 2 redo log groups (group 1 and group 2) and you face checkpoint not complete warning, adding 2 more groups (group 3 and group 4) would mean that turn of a redo log group for writing by LGWR would come after longer time as compared to having only 2 log groups, and this would ensure that LGWR have enough free redo log groups available to be allocated and write rather than waiting for a redo log group to become available before it could be reused.

How Many redo log groups, and redo log group size
This is a very tricky question. As per my experience and knowledge, for a new database, I would recommend 5 redo log groups with 100 MB size for each redo log group Start monitoring the log switch frequency during peak hours and if log switch frequency is very less (suppose less than 5 minutes), increase the size of redo log groups to make log switch frequency to not less than 10 minutes. If checkpoint not complete message appears in alert log file, start adding more redo log groups until you no longer see this message appearing in alert log file.

Other things to consider
As I mentioned above in “Solution” section, most of the time solution is to add more redo log groups, but this may not be a true solution for some scenarios (could be a work around though). If your disk drives where you have placed your redo log groups are very slow, you should consider putting your redo log groups on faster disks. For example this is not recommended to put your redo log groups on RAID 5, as this RAID type is very slow for sequential writing because of computing parity during writing to the disks.
There is another misconception of putting redo logs on SSD (I also had this misconception, but thanks to an Oracle community discussion which cleared this confusion). SSDs are also not good for sequential writes (also a couple of other reasons) and redo logs should not be placed on these (unless you are using some appliance and it is recommended by the vendor, for example Exadata machine) 

No comments: