Thursday 23 June 2011

Troubleshooting RAC Public Network Failure

Here are some steps I used to troubleshoot the failure of a public network used for SCAN in a 2-node RAC cluster.
Note: I used an aliased crsstat for the command: crsctl stat res –t
Check the status of the Clusterware resources. You can see that there are several resources that are offline below. The ones highlighted in red are the ones we are interested in. The resource for the local listener is now offline while the VIP has been failed over. You will also notice that all of the SCAN Listeners have been failed over to the surviving node.
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >crsstat
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DG_DATA.dg
ONLINE ONLINE tibora30
ONLINE ONLINE tibora31
ora.DG_FLASH.dg
ONLINE ONLINE tibora30
ONLINE ONLINE tibora31
ora.LISTENER.lsnr
ONLINE OFFLINE tibora30
ONLINE ONLINE tibora31
ora.asm
ONLINE ONLINE tibora30 Started
ONLINE ONLINE tibora31 Started
ora.gsd
OFFLINE OFFLINE tibora30
OFFLINE OFFLINE tibora31
ora.net1.network
ONLINE ONLINE tibora30
ONLINE ONLINE tibora31
ora.ons
ONLINE ONLINE tibora30
ONLINE OFFLINE tibora31
ora.registry.acfs
ONLINE ONLINE tibora30
ONLINE ONLINE tibora31
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE tibora31
ora.LISTENER_SCAN2.lsnr
1 ONLINE ONLINE tibora31
ora.LISTENER_SCAN3.lsnr
1 ONLINE ONLINE tibora31
ora.cvu
1 ONLINE OFFLINE
ora.oc4j
1 ONLINE ONLINE tibora31
ora.scan1.vip
1 ONLINE ONLINE tibora30
ora.scan2.vip
1 ONLINE ONLINE tibora31
ora.scan3.vip
1 ONLINE ONLINE tibora31
ora.tibora30.vip
1 ONLINE INTERMEDIATE tibora31 FAILED OVER
ora.tibora31.vip
1 ONLINE ONLINE tibora31
ora.tibprd.db
1 ONLINE ONLINE tibora30 Open
2 ONLINE ONLINE tibora31 Open
ora.tibprd.tibprd_applog.svc
1 ONLINE ONLINE tibora31
ora.tibprd.tibprd_basic.svc
1 ONLINE ONLINE tibora31
ora.tibprd.tibprd_smap.svc
1 ONLINE ONLINE tibora31
Then look at the CRS logs under $GI_HOME/log/hostname/alerthostname.log for entries similar to the ones below:
2011-06-21 09:43:57.844
[/u01/11.2.0/grid/bin/orarootagent.bin(21168162)]CRS-5818:Aborted command 'check for resource: ora.net1.network tibora30 1' for resource 'ora.net1.network'. Details at (:CRSAGF00113:) {0:9:2} in /u01/11.2.0/grid/log/tibora30/agent/crsd/orarootagent_root/orarootagent_root.log.
2011-06-21 09:44:00.459
[/u01/11.2.0/grid/bin/oraagent.bin(22413372)]CRS-5016:Process "/u01/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/u01/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/11.2.0/grid/log/tibora30/agent/crsd/oraagent_grid/oraagent_grid.log"
2011-06-21 09:44:01.112
[/u01/11.2.0/grid/bin/oraagent.bin(22413372)]CRS-5016:Process "/u01/11.2.0/grid/bin/lsnrctl" spawned by agent "/u01/11.2.0/grid/bin/oraagent.bin" for action "
check" failed: details at "(:CLSN00010:)" in "/u01/11.2.0/grid/log/tibora30/agent/crsd/oraagent_grid/oraagent_grid.log"
2011-06-21 09:44:01.180
[/u01/11.2.0/grid/bin/oraagent.bin(22413372)]CRS-5016:Process "/u01/11.2.0/grid/bin/lsnrctl" spawned by agent "/u01/11.2.0/grid/bin/oraagent.bin" for action "
check" failed: details at "(:CLSN00010:)" in "/u01/11.2.0/grid/log/tibora30/agent/crsd/oraagent_grid/oraagent_grid.log"
Check the status of the VIP on the node
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl status vip -n tibora30
VIP tibora30-vip is enabled
VIP tibora30-vip is not running
Also check the status of the SCAN resources.
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl status scan
SCAN VIP scan1 is enabled
SCAN VIP scan1 is running on node tibora30
SCAN VIP scan2 is enabled
SCAN VIP scan2 is running on node tibora31
SCAN VIP scan3 is enabled
SCAN VIP scan3 is running on node tibora31
In this particular case the SCAN VIPs were running on both nodes. It turns out that on another cluster which also experienced a network failure the SCAN VIPs were all running on node.
First we need to start the local listener:
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl start listener
Now check the status of the resources.
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >crsstat
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DG_DATA.dg
ONLINE ONLINE tibora30
ONLINE ONLINE tibora31
ora.DG_FLASH.dg
ONLINE ONLINE tibora30
ONLINE ONLINE tibora31
ora.LISTENER.lsnr
ONLINE ONLINE tibora30
ONLINE ONLINE tibora31
ora.asm
ONLINE ONLINE tibora30 Started
ONLINE ONLINE tibora31 Started
ora.gsd
OFFLINE OFFLINE tibora30
OFFLINE OFFLINE tibora31
ora.net1.network
ONLINE ONLINE tibora30
ONLINE ONLINE tibora31
ora.ons
ONLINE ONLINE tibora30
ONLINE OFFLINE tibora31
ora.registry.acfs
ONLINE ONLINE tibora30
ONLINE ONLINE tibora31
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE tibora31
ora.LISTENER_SCAN2.lsnr
1 ONLINE ONLINE tibora31
ora.LISTENER_SCAN3.lsnr
1 ONLINE ONLINE tibora31
ora.cvu
1 ONLINE OFFLINE
ora.oc4j
1 ONLINE ONLINE tibora31
ora.scan1.vip
1 ONLINE ONLINE tibora30
ora.scan2.vip
1 ONLINE ONLINE tibora31
ora.scan3.vip
1 ONLINE ONLINE tibora31
ora.tibora30.vip
1 ONLINE ONLINE tibora30
ora.tibora31.vip
1 ONLINE ONLINE tibora31
ora.tibprd.db
1 ONLINE ONLINE tibora30 Open
2 ONLINE ONLINE tibora31 Open
ora.tibprd.tibprd_applog.svc
1 ONLINE ONLINE tibora31
ora.tibprd.tibprd_basic.svc
1 ONLINE ONLINE tibora31
ora.tibprd.tibprd_smap.svc
1 ONLINE ONLINE tibora31

Starting the local listener also caused the VIP to relocate to the previous node.
In another situation I had to manually relocate the VIP to the original node.
Next we need to check our nodeapps including ONS.


grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl status nodeapps

VIP tibora30-vip is enabled

VIP tibora30-vip is running on node: tibora30

VIP tibora31-vip is enabled

VIP tibora31-vip is running on node: tibora31

Network is enabled

Network is running on node: tibora30

Network is running on node: tibora31

GSD is disabled

GSD is not running on node: tibora30

GSD is not running on node: tibora31

ONS is enabled

ONS daemon is running on node: tibora30

ONS daemon is not running on node: tibora31


Here you can see that the ONS daemon is not running on the tibora31.
To start the ONS daemon issue the following command:
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl start nodeapps -n tibora31

PRKO-2421 : Network resource is already started on node(s): tibora31

PRKO-2420 : VIP is already started on node(s): tibora31
Check the status of the nodeapps again
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl status nodeapps
VIP tibora30-vip is enabled
VIP tibora30-vip is running on node: tibora30
VIP tibora31-vip is enabled
VIP tibora31-vip is running on node: tibora31
Network is enabled
Network is running on node: tibora30
Network is running on node: tibora31
GSD is disabled
GSD is not running on node: tibora30
GSD is not running on node: tibora31
ONS is enabled
ONS daemon is running on node: tibora30
ONS daemon is running on node: tibora31
The CVU resource can be started as follows:
grid@tibora30[+ASM1]-/u01/11.2.0/grid/log/tibora30 >srvctl start cvu -n tibora30
You can now verify connectivity to the database/services. I prefer to use SQL Developer to test connectivity to my databases with one connection for each service name.

All the SCAN listeners were running on a single node. At least one needed to be relocated to service the requests coming from the SCAN on that node.
grid@tibora31[+ASM2]-/home/grid >srvctl relocate scan_listener -i 1 -n tibora30
grid@tibora31[+ASM2]-/home/grid >srvctl status scan_listener
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is running on node tibora30
SCAN Listener LISTENER_SCAN2 is enabled
SCAN listener LISTENER_SCAN2 is running on node tibora31
SCAN Listener LISTENER_SCAN3 is enabled
SCAN listener LISTENER_SCAN3 is running on node tibora31
Once the SCAN was relocated the application connected successfully.