Brief intro
Our last article was about the Extended RAC--a way to spread out our RAC nodes across the city. Whether your RAC is in the building or outside, a smart and prudent DBA always monitors RAC.
As the application is taken into production, it grows and flourishes. That can be a big burden for the DBA if he is not proactively monitoring his database. There could be several causes for this growth: disk usage might go up, network bandwidth might bottleneck, transactions may start taking too long too commit, more users and more aggressive usage. This may be good for the business but the Service Level Agreements (SLAs) still need to be met. Proactive monitoring your Oracle RAC or even a typical single node Oracle database will keep you upwind of problems. However, that means you need to know which tools you should be using to do just that.
1. Are we meeting our SLA (Service Level Agreements)?
2. Are the High Availability (HA) objectives being met?
3. Are all instances sharing the load evenly?
4. What is the interconnect load (latency/overhead)?
5. CPU: Are Oracle processes getting enough resources?
6. Memory: How is the memory for System Global Area (SGA) etc?
At OS level: You need to monitor the CPU, Memory, Disk performance and Network traffic.
- CPU (%idle time, etc.)
- I/O (queue length)
- Shared storage
- network (both public and private network)
- memory (paging, swapping, etc.)
- logs (var/log/messages etc)
At Application level, we need to carefully sprinkle the monitoring code (for instance, if the application server were a Tomcat Server or JBoss, then you would be interested in all of the Catalina logs, java.util logs or log4J logging, etc). There are tools, which are more professional, like BIRT, which can be employed to monitor your application's performance.
For the OS you will need the following tools:
- top: Top Processes
- ps: status of the processes
- iostat: I/O stats
- vmstat: Virtual Memory stats
- netstat: network stats
- ipcfg/ipconfig: checking IP address locally on nodes
- ping: utility to ping across nodes
- tracert: TRACERT is useful for troubleshooting large networks where several paths can be taken to arrive at the same point, or where many intermediate systems (routers or bridges) are involved.
- nslookup: Ping IP addresses using DNS to lookup the nodes
Let's quickly check if our cluster is online. In our test/development scenario, I changed the disk configuration from RAID5 to RAID 0 for optimal performance of my RAC nodes. If you have an enterprise version of ESX, you can put all of the VMs on SAN and not have to bother with the disk configurations, as you will not need DAS (Direct Attached Storage).
Upon restarting, do the cluster check.
[oracle@vm01 bin]$ crs_stat -t
------------------------------------------------------------
ora.esxrac.db application ONLINE ONLINE vm02
ora....c1.inst application ONLINE ONLINE vm01
ora....c2.inst application ONLINE ONLINE vm02
ora....serv.cs application ONLINE ONLINE vm02
ora....ac1.srv application ONLINE ONLINE vm01
ora....ac2.srv application ONLINE ONLINE vm02
ora....SM1.asm application ONLINE ONLINE vm01
ora....01.lsnr application ONLINE ONLINE vm01
ora.vm01.gsd application ONLINE ONLINE vm01
ora.vm01.ons application ONLINE ONLINE vm01
ora.vm01.vip application ONLINE ONLINE vm01
ora....SM2.asm application ONLINE ONLINE vm02
ora....02.lsnr application ONLINE ONLINE vm02
ora.vm02.gsd application ONLINE ONLINE vm02
ora.vm02.ons application ONLINE ONLINE vm02
ora.vm02.vip application ONLINE ONLINE vm02
[oracle@vm01 udump]$ iostat --help
Usage: iostat [ options... ] [ <interval> [ <count> ] ]
[ -c | -d ] [ -k ] [ -t ] [ -V ] [ -x ]
[ { <device> [ ... ] | ALL } ] [ -p [ { <device> | ALL } ] ]
Linux 2.6.9-42.0.0.0.1.ELsmp (vm01.wolga.nl) 05/03/2007
avg-cpu: %user %nice %sys %iowait %idle
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 2.18 2.60 76.77 2220396 65637664
sda1 9.65 2.60 76.76 2219427 65637656
sdc 2.75 44.07 33.41 37681732 28563403
sdc1 2.76 44.07 33.41 37678487 28563403
sdd 1.71 10.44 33.41 8926315 28563403
sdd1 1.73 10.44 33.41 8923070 28563403
sde 1.12 1.26 22.43 1079253 19182446
sde1 1.13 1.26 22.43 1076008 19182446
sdf 4.31 502.78 4.18 429902055 3570958
sdf1 4.31 502.78 4.18 429900486 3570958
sdg 7.24 1004.57 5.15 858957930 4407357
sdg1 7.24 1004.57 5.15 858956361 4407357
sdh 1.00 1.00 0.50 858776 428077
sdh1 1.00 1.00 0.50 857207 428077
[oracle@vm01 udump]$ iostat -t 3
Linux 2.6.9-42.0.0.0.1.ELsmp (vm01.wolga.nl) 05/03/2007
avg-cpu: %user %nice %sys %iowait %idle
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 2.18 2.60 76.77 2220396 65645200
sda1 9.65 2.60 76.77 2219427 65645192
sdc 2.75 44.07 33.40 37685412 28565854
sdc1 2.76 44.07 33.40 37682167 28565854
sdd 1.71 10.44 33.40 8926763 28565854
sdd1 1.73 10.44 33.40 8923518 28565854
sde 1.12 1.26 22.43 1079381 19183793
sde1 1.13 1.26 22.43 1076136 19183793
sdf 4.31 502.78 4.18 429949905 3571674
sdf1 4.31 502.78 4.18 429948336 3571674
sdg 7.24 1004.57 5.15 859052465 4408164
sdg1 7.24 1004.57 5.15 859050896 4408164
sdh 1.00 1.00 0.50 858870 428124
sdh1 1.00 1.00 0.50 857301 428124
avg-cpu: %user %nice %sys %iowait %idle
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
avg-cpu: %user %nice %sys %iowait %idle
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
Let's run it with –x parameter.
Linux 2.6.9-42.0.0.0.1.ELsmp (vm01.wolga.nl) 05/04/2007
avg-cpu: %user %nice %sys %iowait %idle
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 8.02 13.87 13.55 7.73 1130.44 172.78 565.22 86.39 61.23 0.34 16.09 4.89 10.41
sdb 2.51 0.00 0.11 0.00 6.22 0.00 3.11 0.00 57.28 0.00 2.89 2.89 0.03
sdc 2.56 0.01 8.72 1.40 276.30 26.56 138.15 13.28 29.94 0.26 25.62 5.41 5.47
sdd 2.59 0.01 7.05 1.40 242.61 26.56 121.30 13.28 31.85 0.23 27.22 6.19 5.23
sde 2.56 0.12 0.55 1.00 19.11 75.21 9.55 37.60 60.91 0.02 15.33 13.81 2.14
sdf 10.18 0.00 16.07 3.45 472.50 21.01 236.25 10.51 25.29 0.06 3.20 3.06 5.97
sdg 10.18 0.00 17.19 3.98 748.88 21.55 374.44 10.77 36.40 0.07 3.30 3.02 6.39
sdh 10.18 0.00 1.01 0.42 87.56 0.69 43.78 0.35 61.40 0.01 7.00 6.97 1.00