Thursday, 7 October 2010

Oracle RAC Monitoring

Brief intro

Our last article was about the Extended RAC--a way to spread out our RAC nodes across the city. Whether your RAC is in the building or outside, a smart and prudent DBA always monitors RAC.

As the application is taken into production, it grows and flourishes. That can be a big burden for the DBA if he is not proactively monitoring his database. There could be several causes for this growth: disk usage might go up, network bandwidth might bottleneck, transactions may start taking too long too commit, more users and more aggressive usage. This may be good for the business but the Service Level Agreements (SLAs) still need to be met. Proactive monitoring your Oracle RAC or even a typical single node Oracle database will keep you upwind of problems. However, that means you need to know which tools you should be using to do just that.

What are we monitoring?

The questions below can assist a DBA to optimize his routine administration tasks as well as help management make timely decisions.

1.      Are we meeting our SLA (Service Level Agreements)?

2.      Are the High Availability (HA) objectives being met?

3.      Are all instances sharing the load evenly?

4.      What is the interconnect load (latency/overhead)?

5.      CPU: Are Oracle processes getting enough resources?

6.      Memory: How is the memory for System Global Area (SGA) etc?

Questions similar to these, broken up to monitor at all levels (Application, Database, OS, and HW), help a DBA to monitor the RAC environment efficiently and effectively.

At hardware level: Disks, HBAs, NICs, cabling, backup devices etc. need to function properly. All these devices need to be configured properly.

At OS level: You need to monitor the CPU, Memory, Disk performance and Network traffic.

At Database level: You have to monitor all the cluster logs, event logs, asm logs and rdbms logs at your database level.

Application Level:

At Application level, we need to carefully sprinkle the monitoring code (for instance, if the application server were a Tomcat Server or JBoss, then you would be interested in all of the Catalina logs, java.util logs or log4J logging, etc). There are tools, which are more professional, like BIRT, which can be employed to monitor your application's performance.

The Toolkit

For the OS you will need the following tools:

Checking I/O stat

Let's quickly check if our cluster is online. In our test/development scenario, I changed the disk configuration from RAID5 to RAID 0 for optimal performance of my RAC nodes. If you have an enterprise version of ESX, you can put all of the VMs on SAN and not have to bother with the disk configurations, as you will not need DAS (Direct Attached Storage).

Upon restarting, do the cluster check.

[oracle@vm01 bin]$ crs_stat -t

Name Type Target State Host

------------------------------------------------------------

ora.esxrac.db application ONLINE ONLINE vm02

ora....c1.inst application ONLINE ONLINE vm01

ora....c2.inst application ONLINE ONLINE vm02

ora....serv.cs application ONLINE ONLINE vm02

ora....ac1.srv application ONLINE ONLINE vm01

ora....ac2.srv application ONLINE ONLINE vm02

ora....SM1.asm application ONLINE ONLINE vm01

ora....01.lsnr application ONLINE ONLINE vm01

ora.vm01.gsd application ONLINE ONLINE vm01

ora.vm01.ons application ONLINE ONLINE vm01

ora.vm01.vip application ONLINE ONLINE vm01

ora....SM2.asm application ONLINE ONLINE vm02

ora....02.lsnr application ONLINE ONLINE vm02

ora.vm02.gsd application ONLINE ONLINE vm02

ora.vm02.ons application ONLINE ONLINE vm02

ora.vm02.vip application ONLINE ONLINE vm02

Then do the iostat at intervals of 3 seconds. To see the full manual for this excellent tool, type man iostat.

[oracle@vm01 udump]$ iostat --help

sysstat version 5.0.5

(C) Sebastien Godard

Usage: iostat [ options... ] [ <interval> [ <count> ] ]

Options are:

[ -c | -d ] [ -k ] [ -t ] [ -V ] [ -x ]

[ { <device> [ ... ] | ALL } ] [ -p [ { <device> | ALL } ] ]

[oracle@vm01 udump]$ iostat

Linux 2.6.9-42.0.0.0.1.ELsmp (vm01.wolga.nl) 05/03/2007

avg-cpu: %user %nice %sys %iowait %idle

1.49 0.07 5.81 2.32 90.30

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn

sda 2.18 2.60 76.77 2220396 65637664

sda1 9.65 2.60 76.76 2219427 65637656

sdb 0.00 0.00 0.00 2635 0

sdb1 0.00 0.00 0.00 1186 0

sdc 2.75 44.07 33.41 37681732 28563403

sdc1 2.76 44.07 33.41 37678487 28563403

sdd 1.71 10.44 33.41 8926315 28563403

sdd1 1.73 10.44 33.41 8923070 28563403

sde 1.12 1.26 22.43 1079253 19182446

sde1 1.13 1.26 22.43 1076008 19182446

sdf 4.31 502.78 4.18 429902055 3570958

sdf1 4.31 502.78 4.18 429900486 3570958

sdg 7.24 1004.57 5.15 858957930 4407357

sdg1 7.24 1004.57 5.15 858956361 4407357

sdh 1.00 1.00 0.50 858776 428077

sdh1 1.00 1.00 0.50 857207 428077

[oracle@vm01 udump]$ iostat -t 3

Linux 2.6.9-42.0.0.0.1.ELsmp (vm01.wolga.nl) 05/03/2007

Time: 02:37:19 PM

avg-cpu: %user %nice %sys %iowait %idle

1.49 0.07 5.81 2.32 90.30

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn

sda 2.18 2.60 76.77 2220396 65645200

sda1 9.65 2.60 76.77 2219427 65645192

sdb 0.00 0.00 0.00 2635 0

sdb1 0.00 0.00 0.00 1186 0

sdc 2.75 44.07 33.40 37685412 28565854

sdc1 2.76 44.07 33.40 37682167 28565854

sdd 1.71 10.44 33.40 8926763 28565854

sdd1 1.73 10.44 33.40 8923518 28565854

sde 1.12 1.26 22.43 1079381 19183793

sde1 1.13 1.26 22.43 1076136 19183793

sdf 4.31 502.78 4.18 429949905 3571674

sdf1 4.31 502.78 4.18 429948336 3571674

sdg 7.24 1004.57 5.15 859052465 4408164

sdg1 7.24 1004.57 5.15 859050896 4408164

sdh 1.00 1.00 0.50 858870 428124

sdh1 1.00 1.00 0.50 857301 428124

Time: 02:37:22 PM

avg-cpu: %user %nice %sys %iowait %idle

1.00 0.08 6.91 1.00 91.01

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn

sda 0.00 0.00 0.00 0 0

sda1 0.00 0.00 0.00 0 0

sdb 0.00 0.00 0.00 0 0

sdb1 0.00 0.00 0.00 0 0

sdc 2.00 42.67 13.33 128 40

sdc1 2.00 42.67 13.33 128 40

sdd 0.67 0.00 13.33 0 40

sdd1 0.67 0.00 13.33 0 40

sde 0.67 0.00 13.33 0 40

sde1 0.67 0.00 13.33 0 40

sdf 3.33 343.00 1.00 1029 3

sdf1 3.00 343.00 1.00 1029 3

sdg 6.33 856.33 2.67 2569 8

sdg1 6.67 856.33 2.67 2569 8

sdh 1.33 1.33 0.67 4 2

sdh1 1.33 1.33 0.67 4 2

Time: 02:37:25 PM

avg-cpu: %user %nice %sys %iowait %idle

0.83 0.08 5.67 1.75 91.67

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn

sda 0.66 0.00 74.42 0 224

sda1 9.30 0.00 74.42 0 224

sdb 0.00 0.00 0.00 0 0

sdb1 0.00 0.00 0.00 0 0

sdc 1.33 21.26 13.29 64 40

sdc1 1.33 21.26 13.29 64 40

sdd 0.66 0.00 13.29 0 40

sdd1 0.66 0.00 13.29 0 40

sde 0.66 0.00 13.29 0 40

sde1 0.66 0.00 13.29 0 40

sdf 4.32 512.62 1.66 1543 5

sdf1 4.32 512.62 1.66 1543 5

sdg 6.64 1023.26 1.99 3080 6

sdg1 6.31 1023.26 1.99 3080 6

sdh 0.66 0.66 0.33 2 1

sdh1 0.66 0.66 0.33 2 1

Time: 02:37:28 PM

Here sda, sdb are the files for OS installation and swap. Sdc, sdd and sde are the files used for OCR, Votingdisk and Spfileasm respectively. Sdf and Sdg are the files we chose for oradata (where all of our Oracle data files are residing) and sdh is for the flashback recovery. You can clearly see that iowait is considerably low, which is a good thing--had it been higher you would be looking at an I/O bottleneck. On the disks (devices section) you can clearly see that our cluster is doing fine but the oradata disks are working hard (and rightly so!). That is why I explained earlier that I have optimized my test scenario on the DAS to have more advantage on the spindle speed, seek time and throughput. The oradata files are on a separate disk. After getting your disks and data all evenly spread out, you can use the –x parameter to get additional useful information, such as average request size, average wait time for requests and average service time for requests.

Let's run it with –x parameter.

[oracle@vm01 ~]$ iostat -x

Linux 2.6.9-42.0.0.0.1.ELsmp (vm01.wolga.nl) 05/04/2007

avg-cpu: %user %nice %sys %iowait %idle

2.86 0.00 14.76 5.79 76.59

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util

sda 8.02 13.87 13.55 7.73 1130.44 172.78 565.22 86.39 61.23 0.34 16.09 4.89 10.41

sdb 2.51 0.00 0.11 0.00 6.22 0.00 3.11 0.00 57.28 0.00 2.89 2.89 0.03

sdc 2.56 0.01 8.72 1.40 276.30 26.56 138.15 13.28 29.94 0.26 25.62 5.41 5.47

sdd 2.59 0.01 7.05 1.40 242.61 26.56 121.30 13.28 31.85 0.23 27.22 6.19 5.23

sde 2.56 0.12 0.55 1.00 19.11 75.21 9.55 37.60 60.91 0.02 15.33 13.81 2.14

sdf 10.18 0.00 16.07 3.45 472.50 21.01 236.25 10.51 25.29 0.06 3.20 3.06 5.97

sdg 10.18 0.00 17.19 3.98 748.88 21.55 374.44 10.77 36.40 0.07 3.30 3.02 6.39

sdh 10.18 0.00 1.01 0.42 87.56 0.69 43.78 0.35 61.40 0.01 7.00 6.97 1.00

Conclusion

As you can see, all you need is a good toolkit to monitor your system. We took a brief look at the iostat tool and in the upcoming articles we will take a more detailed look at other tools. There is also a utility called OSwatcher (downloadable at Metalink), which lumps all of the tools together so you can run them together.