Thursday, 23 June 2011

Troubleshooting RAC Public Network Failure

Here are some steps I used to troubleshoot the failure of a public network used for SCAN in a 2-node RAC cluster.
Note: I used an aliased crsstat for the command: crsctl stat res –t
Check the status of the Clusterware resources. You can see that there are several resources that are offline below. The ones highlighted in red are the ones we are interested in. The resource for the local listener is now offline while the VIP has been failed over. You will also notice that all of the SCAN Listeners have been failed over to the surviving node.
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >crsstat
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DG_DATA.dg
ONLINE ONLINE tibora30
ONLINE ONLINE tibora31
ora.DG_FLASH.dg
ONLINE ONLINE tibora30
ONLINE ONLINE tibora31
ora.LISTENER.lsnr
ONLINE OFFLINE tibora30
ONLINE ONLINE tibora31
ora.asm
ONLINE ONLINE tibora30 Started
ONLINE ONLINE tibora31 Started
ora.gsd
OFFLINE OFFLINE tibora30
OFFLINE OFFLINE tibora31
ora.net1.network
ONLINE ONLINE tibora30
ONLINE ONLINE tibora31
ora.ons
ONLINE ONLINE tibora30
ONLINE OFFLINE tibora31
ora.registry.acfs
ONLINE ONLINE tibora30
ONLINE ONLINE tibora31
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE tibora31
ora.LISTENER_SCAN2.lsnr
1 ONLINE ONLINE tibora31
ora.LISTENER_SCAN3.lsnr
1 ONLINE ONLINE tibora31
ora.cvu
1 ONLINE OFFLINE
ora.oc4j
1 ONLINE ONLINE tibora31
ora.scan1.vip
1 ONLINE ONLINE tibora30
ora.scan2.vip
1 ONLINE ONLINE tibora31
ora.scan3.vip
1 ONLINE ONLINE tibora31
ora.tibora30.vip
1 ONLINE INTERMEDIATE tibora31 FAILED OVER
ora.tibora31.vip
1 ONLINE ONLINE tibora31
ora.tibprd.db
1 ONLINE ONLINE tibora30 Open
2 ONLINE ONLINE tibora31 Open
ora.tibprd.tibprd_applog.svc
1 ONLINE ONLINE tibora31
ora.tibprd.tibprd_basic.svc
1 ONLINE ONLINE tibora31
ora.tibprd.tibprd_smap.svc
1 ONLINE ONLINE tibora31
Then look at the CRS logs under $GI_HOME/log/hostname/alerthostname.log for entries similar to the ones below:
2011-06-21 09:43:57.844
[/u01/11.2.0/grid/bin/orarootagent.bin(21168162)]CRS-5818:Aborted command 'check for resource: ora.net1.network tibora30 1' for resource 'ora.net1.network'. Details at (:CRSAGF00113:) {0:9:2} in /u01/11.2.0/grid/log/tibora30/agent/crsd/orarootagent_root/orarootagent_root.log.
2011-06-21 09:44:00.459
[/u01/11.2.0/grid/bin/oraagent.bin(22413372)]CRS-5016:Process "/u01/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/u01/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/11.2.0/grid/log/tibora30/agent/crsd/oraagent_grid/oraagent_grid.log"
2011-06-21 09:44:01.112
[/u01/11.2.0/grid/bin/oraagent.bin(22413372)]CRS-5016:Process "/u01/11.2.0/grid/bin/lsnrctl" spawned by agent "/u01/11.2.0/grid/bin/oraagent.bin" for action "
check" failed: details at "(:CLSN00010:)" in "/u01/11.2.0/grid/log/tibora30/agent/crsd/oraagent_grid/oraagent_grid.log"
2011-06-21 09:44:01.180
[/u01/11.2.0/grid/bin/oraagent.bin(22413372)]CRS-5016:Process "/u01/11.2.0/grid/bin/lsnrctl" spawned by agent "/u01/11.2.0/grid/bin/oraagent.bin" for action "
check" failed: details at "(:CLSN00010:)" in "/u01/11.2.0/grid/log/tibora30/agent/crsd/oraagent_grid/oraagent_grid.log"
Check the status of the VIP on the node
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl status vip -n tibora30
VIP tibora30-vip is enabled
VIP tibora30-vip is not running
Also check the status of the SCAN resources.
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl status scan
SCAN VIP scan1 is enabled
SCAN VIP scan1 is running on node tibora30
SCAN VIP scan2 is enabled
SCAN VIP scan2 is running on node tibora31
SCAN VIP scan3 is enabled
SCAN VIP scan3 is running on node tibora31
In this particular case the SCAN VIPs were running on both nodes. It turns out that on another cluster which also experienced a network failure the SCAN VIPs were all running on node.
First we need to start the local listener:
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl start listener
Now check the status of the resources.
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >crsstat
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DG_DATA.dg
ONLINE ONLINE tibora30
ONLINE ONLINE tibora31
ora.DG_FLASH.dg
ONLINE ONLINE tibora30
ONLINE ONLINE tibora31
ora.LISTENER.lsnr
ONLINE ONLINE tibora30
ONLINE ONLINE tibora31
ora.asm
ONLINE ONLINE tibora30 Started
ONLINE ONLINE tibora31 Started
ora.gsd
OFFLINE OFFLINE tibora30
OFFLINE OFFLINE tibora31
ora.net1.network
ONLINE ONLINE tibora30
ONLINE ONLINE tibora31
ora.ons
ONLINE ONLINE tibora30
ONLINE OFFLINE tibora31
ora.registry.acfs
ONLINE ONLINE tibora30
ONLINE ONLINE tibora31
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE tibora31
ora.LISTENER_SCAN2.lsnr
1 ONLINE ONLINE tibora31
ora.LISTENER_SCAN3.lsnr
1 ONLINE ONLINE tibora31
ora.cvu
1 ONLINE OFFLINE
ora.oc4j
1 ONLINE ONLINE tibora31
ora.scan1.vip
1 ONLINE ONLINE tibora30
ora.scan2.vip
1 ONLINE ONLINE tibora31
ora.scan3.vip
1 ONLINE ONLINE tibora31
ora.tibora30.vip
1 ONLINE ONLINE tibora30
ora.tibora31.vip
1 ONLINE ONLINE tibora31
ora.tibprd.db
1 ONLINE ONLINE tibora30 Open
2 ONLINE ONLINE tibora31 Open
ora.tibprd.tibprd_applog.svc
1 ONLINE ONLINE tibora31
ora.tibprd.tibprd_basic.svc
1 ONLINE ONLINE tibora31
ora.tibprd.tibprd_smap.svc
1 ONLINE ONLINE tibora31

Starting the local listener also caused the VIP to relocate to the previous node.
In another situation I had to manually relocate the VIP to the original node.
Next we need to check our nodeapps including ONS.


grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl status nodeapps

VIP tibora30-vip is enabled

VIP tibora30-vip is running on node: tibora30

VIP tibora31-vip is enabled

VIP tibora31-vip is running on node: tibora31

Network is enabled

Network is running on node: tibora30

Network is running on node: tibora31

GSD is disabled

GSD is not running on node: tibora30

GSD is not running on node: tibora31

ONS is enabled

ONS daemon is running on node: tibora30

ONS daemon is not running on node: tibora31


Here you can see that the ONS daemon is not running on the tibora31.
To start the ONS daemon issue the following command:
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl start nodeapps -n tibora31

PRKO-2421 : Network resource is already started on node(s): tibora31

PRKO-2420 : VIP is already started on node(s): tibora31
Check the status of the nodeapps again
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl status nodeapps
VIP tibora30-vip is enabled
VIP tibora30-vip is running on node: tibora30
VIP tibora31-vip is enabled
VIP tibora31-vip is running on node: tibora31
Network is enabled
Network is running on node: tibora30
Network is running on node: tibora31
GSD is disabled
GSD is not running on node: tibora30
GSD is not running on node: tibora31
ONS is enabled
ONS daemon is running on node: tibora30
ONS daemon is running on node: tibora31
The CVU resource can be started as follows:
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl start cvu -n tibora30
You can now verify connectivity to the database/services. I prefer to use SQL Developer to test connectivity to my databases with one connection for each service name.

All the SCAN listeners were running on a single node. At least one needed to be relocated to service the requests coming from the SCAN on that node.
grid@tibora31[+ASM2]-/home/grid >srvctl relocate scan_listener -i 1 -n tibora30
grid@tibora31[+ASM2]-/home/grid >srvctl status scan_listener
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is running on node tibora30
SCAN Listener LISTENER_SCAN2 is enabled
SCAN listener LISTENER_SCAN2 is running on node tibora31
SCAN Listener LISTENER_SCAN3 is enabled
SCAN listener LISTENER_SCAN3 is running on node tibora31
Once the SCAN was relocated the application connected successfully.

Thursday, 17 March 2011

Stopping the Oracle RAC 10g Environment

The first step is to stop the Oracle instance. When the instance (and related services) is down, then bring down the ASM instance. Finally, shut down the node applications (Virtual IP, GSD, TNS Listener, and ONS).

$ export ORACLE_SID=orcl1 $ emctl stop dbconsole $ srvctl stop instance -d orcl -i orcl1 $ srvctl stop asm -n linux1 $ srvctl stop nodeapps -n linux1 

Starting the Oracle RAC 10g Environment

The first step is to start the node applications (Virtual IP, GSD, TNS Listener, and ONS). When the node applications are successfully started, then bring up the ASM instance. Finally, bring up the Oracle instance (and related services) and the Enterprise Manager Database console.

$ export ORACLE_SID=orcl1 $ srvctl start nodeapps -n linux1 $ srvctl start asm -n linux1 $ srvctl start instance -d orcl -i orcl1 $ emctl start dbconsole 

Start/Stop All Instances with SRVCTL

Start/stop all the instances and their enabled services. I have included this step just for fun as a way to bring down all instances!

$ srvctl start database -d orcl  $ srvctl stop database -d orcl

**************************************************************************************
A brief Summary of Steps to Start / Shutdown RAC/ASM Setup (Oracle 11g):

Shutdown

1. shutdown database
srvctl stop database -d "that will shutdown ALL running instances of "
srvctl stop instance -d -i "that will shutdown only specified instance of , other running instances will continue to run"

2. shutdown ASM
srvctl stop asm -n "run it for every rac node one-by-one"

3. shutdown nodeapps
srvctl stop nodeapps -n "run it for every rac node one-by-one"

4. Shutdown crs processes
crsctl stop crs

Startup

1. crsctl start crs -> that will start crs processes (HA-Engine), NodeApps, then ASM then Database's Instances then HA-Services (if any)

Thursday, 10 March 2011

Migrate OCR and Voting Disk from External to High Redundancy disk group in Oracle ASM

Multiple OCRs and vote disks on ASM in Oracle 11gR2 « Guenadi N Jilevski's Oracle BLOG

Voting Disk and OCR in 11gR2 On ASM Storage

Having just delivered an Oracle Database 11gR2 RAC Admin course, I’d like to point out some remarkable changes in the way we handle now the important Clusterware components Voting Disk and Oracle Cluster Registry (OCR): Amazingly, we can now store the two inside of an Automatic Storage Management (ASM) Disk Group, which was not possible in 10g.

Then you cannot create more than 1 voting disk in the same or on another/different diskgroup disk when using External Redundancy in 11.2.

The rules are as follows:
External = 1 voting disk
Normal= 3 voting disk
High= 5 voting disk

External redundancy diskgroup depends on the third party hardware vendor to handle the redundancy.

If you want to have the voting disk multiplexed at ASM level then it is recommended you change your redundancy to Normal so that ASM provides this redundancy for you.

The OCR is striped and mirrored (if we have a redundancy other than external), similar as ordinary Database Files are. So we can now leverage the mirroring capabilities of ASM to mirror the OCR also, without having to use multiple RAW devices for that purpose only. The Voting Disk (or Voting File, as it is now also referred to) is not striped but put as a whole on ASM Disks – if we use a redundancy of normal on the Diskgroup, 3 Voting Files are placed, each on one ASM Disk. This is a concern, if our ASM Diskgroups consist of only 2 ASM Disks! Therefore, the new quorum failgroup clause was introduced:

create diskgroup data normal redundancy

failgroup fg1 disk 'ORCL:ASMDISK1'

failgroup fg2 disk 'ORCL:ASMDISK2'

quorum failgroup fg3 disk 'ORCL:ASMDISK3'

attribute 'compatible.asm' = '11.2.0.0.0';

The failgroup fg3 above needs only one small Disk (300 MB should be on the safe side here, since the Voting File is only about 280 MB in size) to keep one Mirror of the Voting File. fg1 and fg2 will contain each one Voting File and all the other stripes of the Database Area as well, but fg3 will only get that one Voting File.

[root@uhesse1 ~]# /u01/app/11.2.0/grid/bin/crsctl query css votedisk

## STATE File Universal Id File Name Disk group

-- ----- ----------------- --------- ---------

1. ONLINE 511de6e64e354f9bbf4be318fc928c28 (ORCL:ASMDISK1) [DATA]

2. ONLINE 2f1973ed4be84f50bffc2475949b428f (ORCL:ASMDISK2) [DATA]

3. ONLINE 5ed44fb7e79c4f79bfaf09b402ba70df (ORCL:ASMDISK3) [DATA]

Another important change regarding the Voting File is that it is no longer supported to take a manual backup of it with dd. Instead, the Voting File gets backed up automatically into the OCR. As a New Feature, you can now do a manual backup of the OCR any time you like, without having to wait until that is done automatically – which is also still done:

[root@uhesse1 ~]# /u01/app/11.2.0/grid/bin/ocrconfig -showbackup

uhesse1 2010/10/06 09:37:30 /u01/app/11.2.0/grid/cdata/cluhesse/backup00.ocr

uhesse1 2010/10/06 05:37:29 /u01/app/11.2.0/grid/cdata/cluhesse/backup01.ocr

uhesse1 2010/10/06 01:37:27 /u01/app/11.2.0/grid/cdata/cluhesse/backup02.ocr

uhesse1 2010/10/05 01:37:21 /u01/app/11.2.0/grid/cdata/cluhesse/day.ocr

uhesse1 2010/10/04 13:37:19 /u01/app/11.2.0/grid/cdata/cluhesse/week.ocr

Above are the automatic backups of the OCR as in earlier versions. Now the manual backup:

[root@uhesse1 ~]# /u01/app/11.2.0/grid/bin/ocrconfig -manualbackup

uhesse1 2010/10/06 13:07:03 /u01/app/11.2.0/grid/cdata/cluhesse/backup_20101006_130703.ocr

I got a manual backup on the default location on my master node. We can define another backup location for the automatic backups as well as for the manual backups – preferrable on a Shared Device that is accessible by all the nodes (which is not the case with /home/oracle, unfortunately :-) ):

[root@uhesse1 ~]# /u01/app/11.2.0/grid/bin/ocrconfig -backuploc /home/oracle

[root@uhesse1 ~]# /u01/app/11.2.0/grid/bin/ocrconfig -manualbackup

uhesse1 2010/10/06 13:10:50 /home/oracle/backup_20101006_131050.ocr

uhesse1 2010/10/06 13:07:03 /u01/app/11.2.0/grid/cdata/cluhesse/backup_20101006_130703.ocr

[root@uhesse1 ~]# /u01/app/11.2.0/grid/bin/ocrconfig -showbackup

uhesse1 2010/10/06 09:37:30 /u01/app/11.2.0/grid/cdata/cluhesse/backup00.ocr

uhesse1 2010/10/06 05:37:29 /u01/app/11.2.0/grid/cdata/cluhesse/backup01.ocr

uhesse1 2010/10/06 01:37:27 /u01/app/11.2.0/grid/cdata/cluhesse/backup02.ocr

uhesse1 2010/10/05 01:37:21 /u01/app/11.2.0/grid/cdata/cluhesse/day.ocr

uhesse1 2010/10/04 13:37:19 /u01/app/11.2.0/grid/cdata/cluhesse/week.ocr

uhesse1 2010/10/06 13:10:50 /home/oracle/backup_20101006_131050.ocr

uhesse1 2010/10/06 13:07:03 /u01/app/11.2.0/grid/cdata/cluhesse/backup_20101006_130703.ocr

Conclusion: The way to handle Voting Disk and OCR has changed significantly – they can be kept inside of an ASM Diskgroup especially.

ORA-29770 LMHB Terminates Instance as LMON Waited for Control File IO for too Long

Applies to: [ID 1197674.1]

Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1 - Release: 11.1 to 11.2
Information in this document applies to any platform.

Symptoms

Instance crashes with messages like the following:
Wed Sep 09 03:24:14 2009
LMON (ospid: 31216) waits for event 'control file sequential read' for 88 secs.
Wed Sep 09 03:24:29 2009
Errors in file /oracle/base/diag/rdbms/prod/prod3/trace/prod3_lmhb_31304.trc (incident=2329):
ORA-29770: global enqueue process LMON (OSID 31216) is hung for more than 70 seconds
Incident details in: /oracle/base/diag/rdbms/prod/prod3/incident/incdir_2329/prod3_lmhb_31304_i2329.trc
Wed Sep 09 03:24:39 2009
ERROR: Some process(s) is not making progress.
LMHB (ospid: 31304) is terminating the instance.

OR:

Mon Jan 10 14:23:00 2011
LMON (ospid: 8594) waits for event 'control file sequential read' for 87 secs.
Mon Jan 10 14:23:31 2011
LMON (ospid: 8594) waits for event 'control file sequential read' for 118 secs.
ERROR: LMON is not healthy and has no heartbeat.
ERROR: LM** (ospid: 8614) is terminating the instance.


Cause

RAC critical background process not participating heart beat for longer than default threshold of 70 seconds as it's waiting for control file IO

Solution

Control file IO could take longer than the default threshold under some circumstances, bug 8888434 which is affecting 11.1 and 11.2.0.1 but has been fixed in 11.2.0.2 will prevent instance from being terminated.

Troubleshoot the ORA-29740 error in a Real Application

Applies to:

Oracle Server - Enterprise Edition
Oracle Server - Enterprise Edition - Version: 9.2.0.1 to 11.1.0.8 - Release: 9.2 to 11.1
Information in this document applies to any platform.

Purpose

This note was created to troubleshoot the ORA-29740 error in a Real Application
Clusters environment.

Last Review Date

January 22, 2010

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

Troubleshooting ORA-29740 in a RAC Environment
==============================================

An ORA-29740 error occurs when a member was evicted from the group by another
member of the cluster database for one of several reasons, which may include
a communications error in the cluster, failure to issue a heartbeat to the
control file, and other reasons. This mechanism is in place to prevent
problems from occurring that would affect the entire database. For example,
instead of allowing a cluster-wide hang to occur, Oracle will evict the
problematic instance(s) from the cluster. When an ORA-29740 error occurs, a
surviving instance will remove the problem instance(s) from the cluster.
When the problem is detected the instances 'race' to get a lock on the
control file (Results Record lock) for updating. The instance that obtains
the lock tallies the votes of the instances to decide membership. A member
is evicted if:

a) A communications link is down
b) There is a split-brain (more than 1 subgroup) and the member is
not in the largest subgroup
c) The member is perceived to be inactive

Sample message in Alert log of the evicted instance:

Fri Sep 28 17:11:51 2001
Errors in file /oracle/export/TICK_BIG/lmon_26410_tick2.trc:
ORA-29740: evicted by member %d, group incarnation %d
Fri Sep 28 17:11:53 2001
Trace dumping is performing id=[cdmp_20010928171153]
Fri Sep 28 17:11:57 2001
Instance terminated by LMON, pid = 26410

The key to resolving the ORA-29740 error is to review the LMON trace files
from each of the instances. On the evicted instance we will see something
like:

*** 2002-11-20 18:49:51.369
kjxgrdtrt: Evicted by 0, seq (3, 2)
^
|
This indicates which instance initiated the eviction.

On the evicting instance we will see something like:

kjxgrrcfgchk: Initiating reconfig, reason 3
*** 2002-11-20 18:49:29.559
kjxgmrcfg: Reconfiguration started, reason 3

...
*** 2002-11-20 18:49:29.727
Obtained RR update lock for sequence 2, RR seq 2
*** 2002-11-20 18:49:31.284
Voting results, upd 0, seq 3, bitmap: 0
Evicting mem 1, stat 0x0047 err 0x0002

You can see above that the instance initiated a reconfiguration for reason 3
(see Note 139435.1 for more information on reconfigurations). The
reconfiguration is then started and this instance obtained the RR lock
(Results Record lock) which means this instance will tally the votes of the
instances to decide membership. The last lines show the voting results then
this instance evicts instance 1.

For troubleshooting ORA-29740 errors, the 'reason' will be very important.
In the above example, the first section indicates the reason for the
initiated reconfiguration. The reasons are as follows:



Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend

For ORA-29740 errors, you will most likely see reasons 1, 2, or 3.

-----------------------------------------------------------------------------

Reason 1: The Node Monitor generated the reconfiguration. This can happen if:

a) An instance joins the cluster
b) An instance leaves the cluster
c) A node is halted

It should be easy to determine the cause of the error by reviewing the alert
logs and LMON trace files from all instances. If an instance joins or leaves
the cluster or a node is halted then the ORA-29740 error is not a problem.

ORA-29740 evictions with reason 1 are usually expected when the cluster
membership changes. Very rarely are these types of evictions a real problem.

If you feel that this eviction was not correct, do a search in Metalink or
the bug database for:

ORA-29740 'reason 1'

Important files to review are:

a) Each instance's alert log
b) Each instance's LMON trace file
c) Statspack reports from all nodes leading up to the eviction
d) Each node's syslog or messages file
e) iostat output before, after, and during evictions
f) vmstat output before, after, and during evictions
g) netstat output before, after, and during evictions

There is a tool called "OS Watcher" that is being developed that helps gather
this information. For more information on "OS Watcher" see Note 301137.1
"OS Watcher User Guide".


-----------------------------------------------------------------------------

Reason 2: An instance death was detected. This can happen if:

a) An instance fails to issue a heartbeat to the control file.

When the heartbeat is missing, LMON will issue a network ping to the instance
not issuing the heartbeat. As long as the instance responds to the ping,
LMON will consider the instance alive. If, however, the heartbeat is not
issued for the length of time of the control file enqueue timeout, the
instance is considered to be problematic and will be evicted.


Common causes for an ORA-29740 eviction (Reason 2):

a) NTP (Time changes on cluster) - usually on Linux, Tru64, or IBM AIX
b) Network Problems (SAN).
c) Resource Starvation (CPU, I/O, etc..)
d) An Oracle bug.



Important files to review are:

a) Each instance's alert log
b) Each instance's LMON trace file
c) Statspack reports from all nodes leading up to the eviction
d) The CKPT process trace file of the evicted instance
e) Other bdump or udump files...
f) Each node's syslog or messages file
g) iostat output before, after, and during evictions
h) vmstat output before, after, and during evictions
i) netstat output before, after, and during evictions

There is a tool called "OS Watcher" that is being developed that helps gather
this information. For more information on "OS Watcher" see Note 301137.1
"OS Watcher User Guide".

-----------------------------------------------------------------------------

Reason 3: Communications Failure. This can happen if:

a) The LMON processes lose communication between one another.
b) One instance loses communications with the LMS, LMD, process of
another instance.
c) The LCK processes lose communication between one another.
d) A process like LMON, LMD, LMS, or LCK is blocked, spinning, or stuck
and is not responding to remote requests.

In this case the ORA-29740 error is recorded when there are communication
issues between the instances. It is an indication that an instance has been
evicted from the configuration as a result of IPC send timeout. A
communications failure between processes across instances will also generate a
ORA-29740 with reason 3. When this occurs, the trace file of the process
experiencing the error will print a message:

Reporting Communication error with instance:

If communication is lost at the cluster layer (for example, network cables
are pulled), the cluster software may also perform node evictions in the
event of a cluster split-brain. Oracle will detect a possible split-brain
and wait for cluster software to resolve the split-brain. If cluster
software does not resolve the split-brain within a specified interval,
Oracle proceeds with evictions.


Oracle Support has seen cases where resource starvation (CPU, I/O, etc...) can
cause an instance to be evicted with this reason code. The LMON or LMD process
could be blocked waiting for resources and not respond to polling by the remote
instance(s). This could cause that instance to be evicted. If you have
a statspack report available from the time just prior to the eviction on the
evicted instance, check for poor I/O times and high CPU utilization. Poor I/O
times would be an average read time of > 20ms.

Common causes for an ORA-29740 eviction (Reason 3):

a) Network Problems.
b) Resource Starvation (CPU, I/O, etc..)
c) Severe Contention in Database.
d) An Oracle bug.



Tips for tuning inter-instance performance can be found in the following note:

Note 181489.1
Tuning Inter-Instance Performance in RAC and OPS

Important files to review are:

a) Each instance's alert log
b) Each instance's LMON trace file
c) each instance's LMD and LMS trace files
d) Statspack reports from all nodes leading up to the eviction
e) Other bdump or udump files...
f) Each node's syslog or messages file
g) iostat output before, after, and during evictions
h) vmstat output before, after, and during evictions
i) netstat output before, after, and during evictions

There is a tool called "OS Watcher" that is being developed that helps gather
this information. For more information on "OS Watcher" see Note 301137.1
"OS Watcher User Guide".


Tuesday, 8 March 2011

Troubleshooting 10g and 11.1 Clusterware Reboots


If there is a ocssd.bin problem/failure, the oprocd daemon detected a scheduling
problem, or some other fatal problem, a node will reboot in a RAC cluster. This
functionality is used for I/O fencing to ensure that writes from I/O capable clients
can be cleared avoiding potential corruption scenarios in the event of a network
split, node hang, or some other fatal event.

1.0 - PROCESS ROLES FOR REBOOTS


OCSSD (aka CSS daemon) - This process is spawned in init.cssd. It runs in both
vendor clusterware and non-vendor clusterware environments and is armed with a
node kill via the init script. OCSSD's primary job is internode health monitoring
and RDBMS instance endpoint discovery. It runs as the Oracle user.

PS Output:
oracle 686 0.0 0.23207216608 ? S 11:42:42 0:12 /oracle/10g/crs/bin/ocssd.bin

INIT.CSSD - In a normal environment, init spawns init.cssd, which in turn spawns
OCSSD as a child. If ocssd dies or is killed, the node kill functionality of the
init script will kill the node. If the script is killed, its ocssd survives and
continues operating. However init has been instructed to respawn init.cssd via
inittab. When it does so, the second init.cssd will attempt to start its own ocssd.
That ocssd starts up, finds that its endpoint is owned by the first ocssd, fails,
and then the 2nd init.cssd kills the node.

PS Output:
root 635 0.0 0.0 1120 840 ? S 11:41:41 0:00 /bin/sh /etc/init.d/init.cssd fatal

OPROCD - This process is spawned in any non-vendor clusterware environment, except
on Windows where Oracle uses a kernel driver to perform the same actions and Linux
prior to version 10.2.0.4. If oprocd detects problems, it will kill a node via C
code. It is spawned in init.cssd and runs as root. This daemon is used to detect
hardware and driver freezes on the machine. If a machine were frozen for long enough
that the other nodes evicted it from the cluster, it needs to kill itself to prevent
any IO from getting reissued to the disk after the rest of the cluster has remastered
locks."

PS Output:
root 684 0.0 0.0 2240 968 ? S 11:42:42 0:00 /oracle/10g/crs/bin/oprocd start -t 1000 -m 50

OCLSOMON (10.2.0.2 and above) - This process monitors the CSS daemon for hangs or
scheduling issues and can reboot a node if there is a perceived hang.

2.0 - DETERMINING WHICH PROCESS IS RESPONSIBLE FOR A REBOOT


* Messages file locations:
Sun: /var/adm/messages
HP-UX: /var/adm/syslog/syslog.log
Tru64: /var/adm/messages
Linux: /var/log/messages
IBM: /bin/errpt -a > messages.out

** CSS log locations:
11.1 and 10.2: <CRS_HOME>/log/<node name>/cssd
10.1: <CRS_HOME>/css/log

*** Oprocd log locations:
In /etc/oracle/oprocd or /var/opt/oracle/oprocd depending on version/platform.

Note that oprocd only runs when no vendor clusterware is running or on Linux > 10.2.0.4

3.0 - TROUBLESHOOTING OCSSD REBOOTS


If you have encountered an OCSSD reboot, review common causes in section 3.1 below.
If the problem cannot be determined by reviewing the common causes, review and
collect the data from section 3.3.

3.1 - COMMON CAUSES OF OCSSD REBOOTS


- Network failure or latency between nodes. It would take at least 30 consecutive
missed checkins to cause a reboot, where heartbeats are issued once per second.

Example of missed checkins in the CSS log:

WARNING: clssnmPollingThread: node <node> (1) at 50% heartbeat fatal, eviction in 29.100 seconds
WARNING: clssnmPollingThread: node <node> (1) at 75% heartbeat fatal, eviction in 14.960 seconds
WARNING: clssnmPollingThread: node <node> (1) at 75% heartbeat fatal, eviction in 13.950 seconds

The first thing to do is find out if the missed checkins ARE the problem or are a
result of the node going down due to other reasons. Check the messages file to see
what exact time the node went down and compare it to the time of the missed checkins.

- If the messages file reboot time < missed checkin time then the node eviction was
likely not due to these missed checkins.

- If the messages file reboot time > missed checkin time then the node eviction was
likely a result of the missed checkins.


- Problems writing to or reading from the CSS voting disk.

Example of a voting disk problem in the CSS log:

ERROR: clssnmDiskPingMonitorThread: voting device access hanging (160008 miliseconds)

- Lack of CPU resources. There are some situations which will appear to be missed
heartbeat issues, however turn out to be caused by a user running a high
sustained load average. When a machine gets too heavily loaded, the scheduling
reliability can be bad. This could cause CSS to not get scheduled in time and
thus CSS cannot get its work done. If this happens, the node is declared
not-viable for cluster work and is evicted.

- A problem with the executables (for example, removing CRS Home files)

- Misconfiguration of CRS. Possible misconfigurations:

- Wrong network selected as the private network for CRS (confirm with CSS log,
/etc/hosts, and ifconfig output). Make sure it is not the public or VIP
address. Look in the CSS log for strings like...
clsc_listen: (*) Listening on
(ADDRESS=(PROTOCOL=tcp)(HOST=dlsun2046)(PORT=61196))

- Putting the CSS vote file on a Netapp that's shared over some kind of public
network or otherwise excessively loaded/unreliable network. If this is the
case, you are likely to see the following message in the CSS logfile:

ERROR: clssnmDiskPingThread(): Large disk IO timeout * seconds.

If you ever see this error, then it's important to investigate why the disk
subsystem is unresponsive.

See section 3.2 for information on how to correct common misconfiguration
problems.

- Killing the "init.cssd fatal" process or "ocssd" process.

- An unexpected failure of the OCSSD process, this can be caused by any of the
above issues.

- An Oracle bug. Known bugs that can cause CSS reboots:



Note 264699.1 - CSS Fails to Flush Writes After Installing 10.1.0.2 CRS on Linux with OCFS
Workaround: Put OCR and CSS Voting files on raw devices
Fixed in OCFS 1.0.11 and above.

Bug 3942568 - A deadlock can occur between 2 threads of the CSS daemon process.
Fixed in 10.1.0.4 and above.




SOLARIS ONLY: See these bugids that fixed the problem (in Solaris 9; the fixes were
backported to Solaris 8 Update 6):
Bug 4308370 cond_timedwait(), sigtimedwait(), poll() and /proc time out too soon
Bug 4391799 Fix for BugID 4308370 causes timeout failures when system time is reset


3.2 - FILES TO REVIEW AND GATHER FOR OCSSD REBOOTS


If logging a service request, please provide ALL of the following files to Oracle
Support if possible:

- All the files in the following directories from all nodes.

For 10.2 and above, all files under:

<CRS_HOME>/log

Recommended method for gathering these for each node would be to run the
diagcollection.pl script.

For 10.1:

<CRS_HOME>/crs/log
<CRS_HOME>/crs/init
<CRS_HOME>/css/log
<CRS_HOME>/css/init
<CRS_HOME>/evm/log
<CRS_HOME>/evm/init
<CRS_HOME>/srvm/log

Recommended method for gathering these for each node:

cd <CRS_HOME>
tar cf crs.tar crs/init crs/log css/init css/log evm/init evm/log srvm/log

- Messages or Syslog from all nodes from the time of the problem:

Sun: /var/adm/messages
HP-UX: /var/adm/syslog/syslog.log
Tru64: /var/adm/messages
Linux: /var/log/messages
IBM: /bin/errpt -a > messages.out

- If a core files was written it would be useful to obtain a stack trace of the
core file using Note 1812.1 "TECH Getting a Stack Trace from a CORE file".
Core files are usually writtin in one of the following directories:

<CRS_HOME>/crs/init
<CRS_HOME>/css/init
<CRS_HOME>/evm/init

You should also check all threads of the core file and get a stack trace for each.
Note 118252.1 has information on gathering multiple threads.

- OCR dump file - To get this cd to <CRS_HOME>/bin as the root user and issue
"ocrdump <unique filename>". This will generate two files (ocrdump.log and the
a dump file with the name given for it).

- 'opatch lsinventory -detail' output for the CRS home

- Back up the scls_scr directory, inittab, and hosts file for analysis with:

Sun, HP-UX, HP Tru64:

cd /
tar cf /var/backup/oraclecrs.tar var/opt/oracle/scls_scr etc/hosts etc/inittab

Linux, IBM-AIX:

cd /
tar cf /var/backup/oraclecrs.tar etc/oracle/scls_scr etc/hosts etc/inittab


- Ifconfig output from each node (ifconfig -a on unix platforms).
@ These can be useful in determining whether the user has set up the private
@ interconnect names to resolve to the same IP on all nodes. They can be useful
@ in determining whether someone is using their public IP as their private
@ interconnect address. In some cases it may also be useful to get the output
@ of nslookup for each private interconnect name. If some layer is attempting to
@ connect across the private interconnect and there is a possibility of getting
@ different results from /etc/hosts and from the nameserver, it could cause quite
@ a bit of confusion.

@ If it looks like there are network issues, asking for a detailed map of
@ their network wiring is pretty appropriate, and can be handy.

- It would also be useful to get the following from each node leading up to the time
of the reboot:

- netstat -is (or equivelant)
- iostat -x (or equivelant)
- vmstat (or equivelant)
- ping -s (or equivelant) output of the private network

There is a tool called "OS Watcher" that helps gather this information. This tool
will dump netstat, vmstat, iostat, and other output at an inverval and save x number
of hours of archived data. For more information about this tool see Note 301137.1.

4.0 - TROUBLESHOOTING OPROCD REBOOTS


If you have encountered an OPROCD reboot, review common causes in section 4.1 below.
If the problem cannot be determined by reviewing the common causes, review and
collect the data from section 4.2.

4.1 - COMMON CAUSES OF OPROCD REBOOTS


- A problem detected by the OPROCD process. This can be caused by 4 things:

1) An OS scheduler problem.
2) The OS is getting locked up in a driver or hardware.
3) Excessive amounts of load on the machine, thus preventing the scheduler from
behaving reasonably.
4) An Oracle bug.

OPROCD Bugs Known to Cause Reboots:


Bug 5015469 - OPROCD may reboot the node whenever the system date is moved
backwards.
Fixed in 10.2.0.3+

Bug 4206159 - Oprocd is prone to time regression due to current API used (AIX only)
Fixed in 10.1.0.3 + One off patch for Bug 4206159.


Diagnostic Fixes (VERY NECESSARY IN MOST CASES):

Bug 5137401 - Oprocd logfile is cleared after a reboot
Fixed in 10.2.0.4+

Bug 5037858 - Increase the warning levels if a reboot is approaching
Fixed in 10.2.0.3+

4.2 - FILES TO REVIEW AND GATHER FOR OPROCD REBOOTS


If logging a service request, please provide ALL of the following files to Oracle
Support if possible:

- Oprocd logs in /etc/oracle/oprocd or /var/opt/oracle/oprocd depending on version/platform.

- All the files in the following directories from all nodes.

For 10.2 and above, all files under:

<CRS_HOME>/log

Recommended method for gathering these for each node would be to run the
diagcollection.pl script.

For 10.1:

<CRS_HOME>/crs/log
<CRS_HOME>/crs/init
<CRS_HOME>/css/log
<CRS_HOME>/css/init
<CRS_HOME>/evm/log
<CRS_HOME>/evm/init
<CRS_HOME>/srvm/log

Recommended method for gathering these for each node:

cd <CRS_HOME>
tar cf crs.tar crs/init crs/log css/init css/log evm/init evm/log srvm/log

- Messages or Syslog from all nodes from the time of the problem:

Sun: /var/adm/messages
HP-UX: /var/adm/syslog/syslog.log
Tru64: /var/adm/messages
Linux: /var/log/messages
IBM: /bin/errpt -a > messages.out

- 'opatch lsinventory -detail' output for the CRS home

- It would also be useful to get the following from each node leading up to the time
of the reboot:

- netstat -is (or equivelant)
- iostat -x (or equivelant)
- vmstat (or equivelant)

There is a tool called "OS Watcher" that helps gather this information. This tool
will dump netstat, vmstat, iostat, and other output at an inverval and save x number
of hours of archived data. For more information about this tool see Note 301137.1.


5.0 - TROUBLESHOOTING OCLSOMON REBOOTS


If you have encountered an OCLSOMON reboot, review common causes in section 5.1 below.
If the problem cannot be determined by reviewing the common causes, review and
collect the data from section 5.2.

5.1 - COMMON CAUSES OF OCLSOMON REBOOTS


- A problem detected by the OCLSOMON process. This can be caused by 4 things:

1) A thread(s) within the CSS daemon hung.
2) An OS scheduler problem.
3) Excessive amounts of load on the machine, thus preventing the scheduler from
behaving reasonably.
4) An Oracle bug.

5.2 - FILES TO REVIEW AND GATHER FOR OCLSOMON REBOOTS


If logging a service request, please provide ALL of the following files to Oracle
Support if possible:

- All the files in the following directories from all nodes. For a description of
these directories, see Note 259301.1 :

For 10.2, all files under:

<CRS_HOME>/log

Recommended method for gathering these for each node would be to run the
diagcollection.pl script.

- Messages or Syslog from all nodes from the time of the problem:

Sun: /var/adm/messages
HP-UX: /var/adm/syslog/syslog.log
Tru64: /var/adm/messages
Linux: /var/log/messages
IBM: /bin/errpt -a > messages.out

- 'opatch lsinventory -detail' output for the CRS home

- It would also be useful to get the following from each node leading up to the time
of the reboot:

- netstat -is (or equivelant)
- iostat -x (or equivelant)
- vmstat (or equivelant)

There is a tool called "OS Watcher" that helps gather this information. This tool
will dump netstat, vmstat, iostat, and other output at an inverval and save x number
of hours of archived data. For more information about this tool see Note 301137.1.

Troubleshooting 10g and 11.1 Clusterware Reboots [ID 265769.1]


______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________