Tuesday, July 28, 2015

Checkpoint not Complete

This is one of the most common messages a DBA could see in alert log file of a database. Exact message should be similar to the following
Wed Mar 04 12:02:01 2015
Thread 1 cannot allocate new log, sequence 201
Checkpoint not complete

Reason
Whenever a redo log switch occurs i.e. current redo log group (suppose log group 1) is full and LGWR starts writing to the next available redo log group (suppose log group 2), a CHECKPOINT also occurs which means that all database block changes which are in database buffer cache; and have not yet been written to the database files; should be written to the disk (datafiles) by the database writer process (DBWR). Now DBWR starts writing changed blocks to the datafiles and this redo log group (group 1) will no longer be available for the LGWR to reuse until DBWR completes writing all changed blocks from the buffer cache (buffer cache blocks which are related to the data changes recorded in this redo log group). Now if redo log group 2 is also full and LGWR tries to allocate redo log group 1 again for writing; and DBWR has still not finished writing changed blocks to the data files, LGWR would need to wait, and message “Checkpoint no complete” would be written the alert log file. All transactions would hang here as these can’t continue until LGWR is allocated next redo log group for writing.

Solution
Most of the time the solution of “checkpoint not complete” is to add more redo log groups in the database so that DBWR has enough time to write changed blocks from database buffer cache to the datafiles before LGWR tries to acquire same redo log group again. For example, if you have 2 redo log groups (group 1 and group 2) and you face checkpoint not complete warning, adding 2 more groups (group 3 and group 4) would mean that turn of a redo log group for writing by LGWR would come after longer time as compared to having only 2 log groups, and this would ensure that LGWR have enough free redo log groups available to be allocated and write rather than waiting for a redo log group to become available before it could be reused.

How Many redo log groups, and redo log group size
This is a very tricky question. As per my experience and knowledge, for a new database, I would recommend 5 redo log groups with 100 MB size for each redo log group Start monitoring the log switch frequency during peak hours and if log switch frequency is very less (suppose less than 5 minutes), increase the size of redo log groups to make log switch frequency to not less than 10 minutes. If checkpoint not complete message appears in alert log file, start adding more redo log groups until you no longer see this message appearing in alert log file.

Other things to consider
As I mentioned above in “Solution” section, most of the time solution is to add more redo log groups, but this may not be a true solution for some scenarios (could be a work around though). If your disk drives where you have placed your redo log groups are very slow, you should consider putting your redo log groups on faster disks. For example this is not recommended to put your redo log groups on RAID 5, as this RAID type is very slow for sequential writing because of computing parity during writing to the disks.
There is another misconception of putting redo logs on SSD (I also had this misconception, but thanks to an Oracle community discussion which cleared this confusion). SSDs are also not good for sequential writes (also a couple of other reasons) and redo logs should not be placed on these (unless you are using some appliance and it is recommended by the vendor, for example Exadata machine) 

Thursday, July 23, 2015

Finding Sessions with High CPU Usage

If database server CPU usage is showing 100%, or high 90%, DBA needs to find out which session is hogging the CPU(s) and take appropriate action. I will explain in the following that how can we find out the session(s) which are excessively using CPU.
v$statname and v$sesstat dictionary view contains different metrics values, and in this example we will be checking values of 2 metrics highlighted in red bellow. These metrics would help to find out CPU usage by the sessions.
SQL> col name format a70
SQL> select statistic#,name from v$statname where upper(name) like '%CPU%';

STATISTIC# NAME
---------- ----------------------------------------------------------------------
         0 OS CPU Qt wait time
        10 recursive cpu usage
        16 CPU used when call started
        17 CPU used by this session
        61 IPC CPU used by this session
        64 global enqueue CPU used by this session
       229 gc CPU used by this session
       248 cell physical IO bytes sent directly to DB node to balance CPU
       581 parse time cpu

9 rows selected.

If we want to check which session is top consumer of the CPU currently, we can use following script to find it out.
Script for 11g and bellow
alter session set nls_date_format='Dd-MON-YY HH24:MI:SS';
set lines 250
set pages 2000
col name format a26
col username format a15
col program format a40
col SESS_CPU_SECS wra format 999,999,999.99
col LAST_CPU_SECS wra format 999,999,999.99
col logon_secs  wra format 999,999,999
col Percent  wra format 999.99

select sess_cpu.sid, NVL(sess_cpu.username, 'Oracle Process') username, sess_cpu.status, sess_cpu.logon_time,  round((sysdate - sess_cpu.logon_time)*1440*60) logon_SECS, sess_cpu.value/100 SESS_CPU_SECS, (sess_cpu.value - call_cpu.value)/100 LAST_CPU_SECS, round ((sess_cpu.value/100)/round((sysdate - sess_cpu.logon_time)*1440*60)*100,2) Percent, sess_cpu.sql_id          
from
(select se.sql_id,ss.statistic#,se.sid, se.username, se.status, se.program, se.logon_time, sn.name, ss.value from v$session se, v$sesstat ss,
v$statname sn
where se.sid=ss.sid
and sn.statistic#=ss.statistic#
and sn.name in ('CPU used by this session') ) sess_cpu,
(select ss.statistic#,se.sid, ss.value, value/100 seconds from v$session se, v$sesstat ss, v$statname sn
where se.sid=ss.sid
and sn.statistic#=ss.statistic#
and sn.name in ('CPU used when call started') ) call_cpu
where sess_cpu.sid=call_cpu.sid
order by SESS_CPU_SECS ;


Script for 12c and above
Column CON_ID is added in the script to list the container id where sessions are connected. If CON_ID is 0, it would mean that this is an internal process/session. If value is 1, it would mean that this session is from container database. Any value other than 0 or 1 would represent a pluggable database sessions. Check v$containers to match container ID with your pluggable database.
alter session set nls_date_format='Dd-MON-YY HH24:MI:SS';
set lines 250
set pages 2000
col name format a26
col username format a15
col program format a40
col SESS_CPU_SECS wra format 999,999,999.99
col LAST_CPU_SECS wra format 999,999,999.99
col logon_secs  wra format 999,999,999
col Percent  wra format 999.99

select sess_cpu.con_id, sess_cpu.sid, NVL(sess_cpu.username, 'Oracle Process') username, sess_cpu.status, sess_cpu.logon_time,  round ((sysdate-sess_cpu.logon_time)*1440*60) logon_SECS, sess_cpu.value/100 SESS_CPU_SECS, (sess_cpu.value - call_cpu.value)/100 LAST_CPU_SECS,
round((sess_cpu.value/100)/round((sysdate - sess_cpu.logon_time)*1440*60)*100,2) Percent, sess_cpu.sql_id          
from
(select se.con_id,se.sql_id,ss.statistic#,se.sid, se.username, se.status, se.program, se.logon_time, sn.name, ss.value from v$session se, v$sesstat ss, v$statname sn
where se.sid=ss.sid
and sn.statistic#=ss.statistic#
and sn.name in ('CPU used by this session') ) sess_cpu,
(select se.con_id, ss.statistic#,se.sid, ss.value, value/100 seconds from v$session se, v$sesstat ss, v$statname sn
where se.sid=ss.sid
and sn.statistic#=ss.statistic#
and sn.name in ('CPU used when call started') ) call_cpu
where sess_cpu.sid=call_cpu.sid
and sess_cpu.con_id=call_cpu.con_id
order by SESS_CPU_SECS ;

LOGON_SECS
column shows total time in seconds this session is connected to the database.
SESS_CPU_SECS columns shows number of seconds this session has spent using CPU.
LAST_CPU_SECS column shows the CPU taken by last execution call – a single SQL may call CPU several times during its execution.
PERCENT column shows the percent of time this session has spent using CPU since this session has connect to the database.
  
Points to note
While on CPU, LAST_CPU_SECS would show zero (session status would be ACTIVE). During execution of SQL(s), session may be switching between CPU and IO - and hence we can see value changing under SESS_CPU_SECS and LAST_CPU_SECS columns as soon as session switches from CPU to IO.
If a session is on CPU, status would be ACTIVE and SESS_CPU_SECS would be showing last value captured for this session, and LAST_CPU_DECS would be 0.
If a session is on IO, status would be ACTIVE and SESS_CPU_SECS would be showing total seconds of CPU taken by the session until now.
LAST_CPU_SECS would show CPU seconds taken during last call to the CPU. I have also seen value of zero under this column even if session is not ACTIVE. I can't figure it out why, but most important thing is total time of CPU taken by the session which is under SESS_CPU_SECS.
See the following example where I have used ORDER BY SESS_CPU_SECS to display sessions which are at the top CPU consumption since log in.

If you see session bellow highlighted in red, it is connected to the database for 799 seconds and out of which it has spent 260 seconds on CPU which accounts to 32.55 percent of total session time.
SID
USERNAME
STATUS
LOGON_TIME
LOGON_SECS
SESS_CPU_SECS
LAST_CPU_SECS
PERCENT
SQL_ID
262
Oracle Process
ACTIVE
23-JUL-15 09:44:48
22,804
.08
.08
.00
 
2
Oracle Process
ACTIVE
23-JUL-15 09:44:14
22,838
.10
.10
.00
 
504
Oracle Process
ACTIVE
23-JUL-15 09:44:18
22,834
.22
.22
.00
 
508
Oracle Process
ACTIVE
23-JUL-15 09:44:27
22,825
.26
.26
.00
 
11
SYS
ACTIVE
23-JUL-15 15:47:58
1,014
.48
.00
.05
0r86pjqdx9r5m
18
SYS
INACTIVE
23-JUL-15 14:58:11
4,001
1.90
.02
.05
 
509
Oracle Process
ACTIVE
23-JUL-15 09:44:41
22,811
12.07
12.07
.05
 
256
Oracle Process
ACTIVE
23-JUL-15 09:44:19
22,833
16.94
16.94
.07
 
5
Oracle Process
ACTIVE
23-JUL-15 09:44:19
22,833
30.57
30.57
.13
 
764
C##SALMAN
INACTIVE
23-JUL-15 15:51:33
799
260.07
.00
32.55
c0yavgxt1nxqc

After some time, percentage of CPU usage increases to 35.15 percent. 
SID
USERNAME
STATUS
LOGON_TIME
LOGON_SECS
SESS_CPU_SECS
LAST_CPU_SECS
PERCENT
SQL_ID
262
Oracle Process
ACTIVE
23-JUL-15 09:44:48
23,341
.08
.08
.00
 
2
Oracle Process
ACTIVE
23-JUL-15 09:44:14
23,375
.10
.10
.00
 
504
Oracle Process
ACTIVE
23-JUL-15 09:44:18
23,371
.22
.22
.00
 
508
Oracle Process
ACTIVE
23-JUL-15 09:44:27
23,362
.26
.26
.00
 
11
SYS
ACTIVE
23-JUL-15 15:47:58
1,551
.67
.00
.04
0r86pjqdx9r5m
18
SYS
INACTIVE
23-JUL-15 14:58:11
4,538
2.04
.00
.04
 
509
Oracle Process
ACTIVE
23-JUL-15 09:44:41
23,348
12.13
12.13
.05
 
256
Oracle Process
ACTIVE
23-JUL-15 09:44:19
23,370
17.12
17.12
.07
 
5
Oracle Process
ACTIVE
23-JUL-15 09:44:19
23,370
30.91
30.91
.13
 
764
C##SALMAN
INACTIVE
23-JUL-15 15:51:33
1,336
469.58
.00
35.15
c0yavgxt1nxqc

In real time scenario, if you see a session spending huge amount time on CPU, have a look at the SQL(s) it is executing (last column showing SQL_ID) and investigate further.
You can modify this script to “ORDER BY PERCENT” to see which session has spent most of its connected time on CPU.

Monday, July 20, 2015

ORA-03135: connection lost contact

This error may be accompanied by ORA-03113: end-of-file on communication channel. This actually means that your session has been dropped and there could be different reasons which might have caused this. Most commonly; a firewall between client and server could have caused this session/connection drop, if firewall senses some session to be idle for a longer period of time. This session drop timeout setting is done in the firewall and to deny its effect, SQLNET profile parameter “expire_time” could be used to keep the database sessions alive to avoid any connection drop by the firewall.

SQLNET.EXPIRE_TIME
To avoid the problem of "unwanted" disconnection, sqlnet.expire_time can be added in the sqlnet.ora file of database server. This parameter has a value in minutes and it would cause database server to send a packed to each client connect to it at an interval equal to the value of this parameter. This small packet sending would actually deceive firewall so that it considers it an active connection and does not drop it.

We can initially set it to a value of 10 minutes which can be alerted with coordination of network admin who can help to find out firewall settings. Add the following line in $ORACLE_HOME/network/admin/sqlnet.ora

############

SQLNET.EXPIRE_TIME = 10

############

Actual work of this parameter is to check the client sessions which might have exited but server process related to the session is still intact; so that this server process could also be terminated and resources could be released. For details about this parameter, please 
Click Here

Wednesday, July 15, 2015

Testing RMAN Restore

Starting 10g, we can test our backups whether backups are valid and can be used for a successful restore/recovery of the database or not. We can do so by using “RESTORE VALIDATE” command from RMAN. RESTORE VALIDATE command would simulate a restore of the backups by reading backup pieces. If it completes successfully, it would mean that backup is good and can be used for restoration and recovery of database in case of media failure or a disaster.

This can be used on the production server itself where backup was taken, but it is better not to test in on production. In the following example I will show you how to perform this. First I am taking a backup on my 12.1.0.2 database on windows platform and then validating the backup.

Taking backup of the database

RMAN> backup format 'c:\dbbackup\DB_BACKUP_%U.bak' database plus archivelog delete input;

Starting backup at 10-JUL-15
current log archived
using channel ORA_DISK_1
channel ORA_DISK_1: starting archived log backup set
channel ORA_DISK_1: specifying archived log(s) in backup set
input archived log thread=1 sequence=45 RECID=9 STAMP=884684979
input archived log thread=1 sequence=46 RECID=10 STAMP=884684980
input archived log thread=1 sequence=47 RECID=11 STAMP=884684984
input archived log thread=1 sequence=48 RECID=12 STAMP=884684984
input archived log thread=1 sequence=49 RECID=13 STAMP=884684984
input archived log thread=1 sequence=50 RECID=14 STAMP=884684997
channel ORA_DISK_1: starting piece 1 at 10-JUL-15
channel ORA_DISK_1: finished piece 1 at 10-JUL-15
piece handle=C:\DBBACKUP\DB_BACKUP_0BQBME66_1_1.BAK tag=TAG20150710T094957 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
channel ORA_DISK_1: deleting archived log(s)
archived log file name=C:\APP\SALMQURE\PRODUCT\12.1.0\DBHOME_1\RDBMS\ARC0000000045_0870715164.0001 RECID=9 STAMP=884684979
archived log file name=C:\APP\SALMQURE\PRODUCT\12.1.0\DBHOME_1\RDBMS\ARC0000000046_0870715164.0001 RECID=10 STAMP=884684980
archived log file name=C:\APP\SALMQURE\PRODUCT\12.1.0\DBHOME_1\RDBMS\ARC0000000047_0870715164.0001 RECID=11 STAMP=884684984
archived log file name=C:\APP\SALMQURE\PRODUCT\12.1.0\DBHOME_1\RDBMS\ARC0000000048_0870715164.0001 RECID=12 STAMP=884684984
archived log file name=C:\APP\SALMQURE\PRODUCT\12.1.0\DBHOME_1\RDBMS\ARC0000000049_0870715164.0001 RECID=13 STAMP=884684984
archived log file name=C:\APP\SALMQURE\PRODUCT\12.1.0\DBHOME_1\RDBMS\ARC0000000050_0870715164.0001 RECID=14 STAMP=884684997
Finished backup at 10-JUL-15

Starting backup at 10-JUL-15
using channel ORA_DISK_1
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00001 name=C:\APP\SALMQURE\ORADATA\SALMAN12\SYSTEM01.DBF
input datafile file number=00003 name=C:\APP\SALMQURE\ORADATA\SALMAN12\SYSAUX01.DBF
input datafile file number=00005 name=C:\APP\SALMQURE\ORADATA\SALMAN12\UNDOTBS01.DBF
input datafile file number=00006 name=C:\APP\SALMQURE\ORADATA\SALMAN12\USERS01.DBF
input datafile file number=00010 name=C:\APP\SALMQURE\ORADATA\SALMAN12\TEST01.DBF
channel ORA_DISK_1: starting piece 1 at 10-JUL-15
channel ORA_DISK_1: finished piece 1 at 10-JUL-15
piece handle=C:\DBBACKUP\DB_BACKUP_0CQBME67_1_1.BAK tag=TAG20150710T094959 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:35
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00015 name=C:\APP\SALMQURE\ORADATA\SALMAN12\PDB1\SYSAUX01.DBF
input datafile file number=00014 name=C:\APP\SALMQURE\ORADATA\SALMAN12\PDB1\SYSTEM01.DBF
input datafile file number=00016 name=C:\APP\SALMQURE\ORADATA\SALMAN12\PDB1\PDB1_USERS01.DBF
channel ORA_DISK_1: starting piece 1 at 10-JUL-15
channel ORA_DISK_1: finished piece 1 at 10-JUL-15
piece handle=C:\DBBACKUP\DB_BACKUP_0DQBME7B_1_1.BAK tag=TAG20150710T094959 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:25
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00025 name=C:\APP\SALMQURE\ORADATA\SALMAN12\PDB2\SYSAUX01.DBF
input datafile file number=00024 name=C:\APP\SALMQURE\ORADATA\SALMAN12\PDB2\SYSTEM01.DBF
channel ORA_DISK_1: starting piece 1 at 10-JUL-15
channel ORA_DISK_1: finished piece 1 at 10-JUL-15
piece handle=C:\DBBACKUP\DB_BACKUP_0EQBME84_1_1.BAK tag=TAG20150710T094959 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:15
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00004 name=C:\APP\SALMQURE\ORADATA\SALMAN12\PDBSEED\SYSAUX01.DBF
input datafile file number=00002 name=C:\APP\SALMQURE\ORADATA\SALMAN12\PDBSEED\SYSTEM01.DBF
channel ORA_DISK_1: starting piece 1 at 10-JUL-15
channel ORA_DISK_1: finished piece 1 at 10-JUL-15
piece handle=C:\DBBACKUP\DB_BACKUP_0FQBME8J_1_1.BAK tag=TAG20150710T094959 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:15
Finished backup at 10-JUL-15

Starting backup at 10-JUL-15
current log archived
using channel ORA_DISK_1
channel ORA_DISK_1: starting archived log backup set
channel ORA_DISK_1: specifying archived log(s) in backup set
input archived log thread=1 sequence=51 RECID=15 STAMP=884685091
channel ORA_DISK_1: starting piece 1 at 10-JUL-15
channel ORA_DISK_1: finished piece 1 at 10-JUL-15
piece handle=C:\DBBACKUP\DB_BACKUP_0GQBME93_1_1.BAK tag=TAG20150710T095131 comment=NONE
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:01
channel ORA_DISK_1: deleting archived log(s)
archived log file name=C:\APP\SALMQURE\PRODUCT\12.1.0\DBHOME_1\RDBMS\ARC0000000051_0870715164.0001 RECID=15 STAMP=884685091
Finished backup at 10-JUL-15

Starting Control File and SPFILE Autobackup at 10-JUL-15
piece handle=C:\APP\SALMQURE\PRODUCT\12.1.0\DBHOME_1\DATABASE\C-3945497372-20150710-01 comment=NONE
Finished Control File and SPFILE Autobackup at 10-JUL-15

Performing RESTORE VALIDATE
For archivelog files restore validation, I am providing start log sequence number which is part of the backup and needed for the recovery using this backup.
RMAN> restore database validate;

Starting restore at 10-JUL-15
using channel ORA_DISK_1

channel ORA_DISK_1: starting validation of datafile backup set
channel ORA_DISK_1: reading from backup piece C:\DBBACKUP\DB_BACKUP_0CQBME67_1_1.BAK
channel ORA_DISK_1: piece handle=C:\DBBACKUP\DB_BACKUP_0CQBME67_1_1.BAK tag=TAG20150710T094959
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: validation complete, elapsed time: 00:00:15
channel ORA_DISK_1: starting validation of datafile backup set
channel ORA_DISK_1: reading from backup piece C:\DBBACKUP\DB_BACKUP_0DQBME7B_1_1.BAK
channel ORA_DISK_1: piece handle=C:\DBBACKUP\DB_BACKUP_0DQBME7B_1_1.BAK tag=TAG20150710T094959
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: validation complete, elapsed time: 00:00:07
channel ORA_DISK_1: starting validation of datafile backup set
channel ORA_DISK_1: reading from backup piece C:\DBBACKUP\DB_BACKUP_0EQBME84_1_1.BAK
channel ORA_DISK_1: piece handle=C:\DBBACKUP\DB_BACKUP_0EQBME84_1_1.BAK tag=TAG20150710T094959
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: validation complete, elapsed time: 00:00:07
channel ORA_DISK_1: starting validation of datafile backup set
channel ORA_DISK_1: reading from backup piece C:\DBBACKUP\DB_BACKUP_0FQBME8J_1_1.BAK
channel ORA_DISK_1: piece handle=C:\DBBACKUP\DB_BACKUP_0FQBME8J_1_1.BAK tag=TAG20150710T094959
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: validation complete, elapsed time: 00:00:07
Finished restore at 10-JUL-15

RMAN> restore archivelog from logseq 45 validate;

Starting restore at 10-JUL-15
using channel ORA_DISK_1

channel ORA_DISK_1: starting validation of archived log backup set
channel ORA_DISK_1: reading from backup piece C:\DBBACKUP\DB_BACKUP_0BQBME66_1_1.BAK
channel ORA_DISK_1: piece handle=C:\DBBACKUP\DB_BACKUP_0BQBME66_1_1.BAK tag=TAG20150710T094957
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: validation complete, elapsed time: 00:00:01
channel ORA_DISK_1: starting validation of archived log backup set
channel ORA_DISK_1: reading from backup piece C:\DBBACKUP\DB_BACKUP_0GQBME93_1_1.BAK
channel ORA_DISK_1: piece handle=C:\DBBACKUP\DB_BACKUP_0GQBME93_1_1.BAK tag=TAG20150710T095131
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: validation complete, elapsed time: 00:00:01
Finished restore at 10-JUL-15