Thursday, February 14, 2013

Simple Way to Calculate OS CPU Utilization

Another (Simpler) Way to Calculate CPU Utilization


Back in April of 2011, I blogged about how to calculate the operating system CPU utilization from data in a Statspack or AWR report. All that is necessary is the report snapshot interval and a couple of columns from the Operating System Statistics view, v$osstat.

Utilization Made Easy


It's pretty simple actually. Utilization is simply requirements divided by capacity. If you have one cup that contains 1/2 cup of water, the cup is 50% full/busy/utilization/etc. Here is the basic utilization formula:

U = R / C

Where;

U is utilization
R is requirements
C is capacity

Requirements


We can get the seconds of CPU consumed directly from v$osstat which is shown in an AWR and Statspack report as, Operating System Statistics. The v$osstat view was introduced in Oracle 10g. It contains a dizzying array of sometimes confusing statistics that seem to change from platform to platform and from release to release.

The statistic BUSY_TIME is the total CPU consumed by all operating system processes in hundreds of a seconds, that is, centi-seconds. For example, if the busy time is 123456 then since the operating system (not the Oracle instance) has started, all operating system processes (Oracle and everything else) have consumed 1234.56 seconds of CPU. We are not making a statement about the speed of a CPU, but simply the processes consumed 1234.56 seconds of CPU since the server was last rebooted.

I'm sorry, but I just lied a little. I implied that if there are multiple databases on the same box, they will ALL have the same v$osstat values for BUSY_TIME. In one of my Oracle Performance Firefighting classes (where we discuss this kind of thing), we decided to check this out. It turns out the BUSY_TIME statistic values were different. The good news is the BUSY_TIME shown in an AWR/Statspack report is correct and can be used to calculate the CPU utilization because it shows the delta; the ending BUSY_TIME minus the beginning BUSY_TIME. Sorry about that little lie.

Now that we have discussed the requirements, we need to tackle the other part of the utilization equation, that is, capacity.

Capacity


Over a 1 minute interval a single core box can provide a maximum of 60 seconds of CPU. Over a 2 minute interval a single core box can provide a maximum of 120 seconds of CPU. Over a 2 minute interval a dual core box can provide a maximum of 240 seconds of CPU. This, "can provide a maximum" is capacity.

I think you get the pattern, which is C = cores X interval. Using a more realistic example, over a one hour interval, a 16 core database server can provide up to 57600 seconds of CPU power; 16 cores X 1 hour X 60 min/hour X 60 sec/min = 57600 core-seconds.

The Bad News


However, there can be a significant barrier in any host with virtual machine activity and especially with AIX. Add to this, in v$osstat the naming of CPU power units could be CPU_SOCKETS, NUM_CPUS, VCPU, LCPU, CPU_THREADS, and probably a variety of other names. This makes creating a script to calculate a host's CPU utilization difficult and always suspect. This is especially true if you want one script to run on multiple platforms; AIX, Linux, Windows, and HP.

A Simpler Utilization Calculation


The solution comes with an understanding that capacity always equals the busy time plus the idle time. That is, the requirements plus the idle time… what was used plus what was left over. With this in mind, another just as correct utilization formula is:

U = R / C

R = requirements = v$osstat.BUSY_TIME
C = capacity = v$ostat.BUSY_TIME + v$ostat.IDLE_TIME

Another plus is we can use the values straight from v$osstat view without remembering their unit of time! They will simply cancel each other out.

Here is a real example from an AIX box. The AWR report's Operating System Statistics area (which is based on v$osstat) shows the BUSY_TIME and IDLE_TIME to be 346028 and 11250450 respectively.

Therefore, regardless of the units of time, the number of CPUs, CPU cores, CPU sockets, threads or hyper-threads, virtual CPUs, logical CPUs, and even the AWR report's snapshot interval (breath) we can easily calculate the average CPU utilization over a snapshot interval. Like this (ref: bob awrrpt_1_40757_40758 ):

U = R / C

R = BUSY_TIME = 3042449
C = BUSY_TIME + IDLE_TIME = 3042449 + 2644832 = 5687281
U = R / C = 3042449 / 5687281 = 0.535 = 54%

But was the utilization really 54%? Yes. And did this match with standard operating system monitoring tools? Yes.

Checking Our Work With the CPU Core Formula


I can check the calculation because, in this box, I know there are 16 physical CPU cores and I know the AWR snapshot interval is about 60 minutes. Therefore the CPU capacity is 57600 seconds (57600=16*60*60). Using the core based utilization formula:

U = R / C = ( 3042449/100) / (16*60*60) = 30424 / 57600 = 0.528 = 53%

Not bad, eh? I have checked this many times and it works wonderfully and reliably…except on AIX! I have looked at a number of AIX examples, and sometimes our calculated utilization does not match what the operating system shows… but sometimes it does.

Eye-Balling the Utilization


This is powerful. Usually when calculating the CPU utilization, you have to well… calculate it. Which can be a pain, takes a minute, and it's easy to make a mistake. But when using the busy and idle time, you can instantly get a general idea of the utilization. Or least know if it is likely to be a factor in performance. I'll show you what I mean.

Consider this Actual Situation


(ref: awr_prod13.htm)

Over a 60 minute interval, the AWR's Operating System Statistics show show a BUSY_IIME of 1,913,617 an IDLE_TIME of 7,159,367 and the NUM_CPUS of 24. What makes this even easier is the AWR report displays the BUSY_TIME directly above the IDLE_TIME. Visually I can easily reduce the BUSY_TIME to 2 and the IDLE_TIME to 7.

Immediately, I know the average utilization is much less than 50% and therefore a CPU bottleneck is extremely unlikely. Want more precision, no problem: divide 2 by 7, which is about 30%. Again, the likelihood of a CPU bottleneck would be extremely unlikely. Very cool!

As a side note, just in case your wondering doing the math using the busy and idle time method the average utilization is 21% : (1913617/(1913617+7159367)) and using the CPU based formula, the average utilization is 22% : (19136.17/(24*60*60)).

The Take-Aways


  1. For me, "eye balling" the utilization situation is priceless and with just a little more effort (typing 2/7 ) I can get a quick quantitative answer.
  2. Getting the number of CPU cores is not dependable on many systems and requires a quick check with the OS and the OS administrator. The busy and idle time method completely negates the CPU core requirements and converting all time to the same units.
  3. What continues to be frustrating is the AIX situation. Perhaps one day I'll be able to reliably calculate CPU utilization, but I'm not holding my breath. However, I am working with two colleagues on a solution!

Thanks for reading!

Craig.

3 comments:

  1. How do you know that IDLE_TIME in V$OSSTAT is not measuring "idle thread time" and thus your utilization is really "thread utilization" and not "core utilization" and therefore not a very good indicator of CPU-bound state?

    ReplyDelete
  2. As long as both the IDLE_TIME and BUSY_TIME are the same units of time we'll still be ok because the unit of time will cancel out leaving simply business, that is, utilization. But regardless, I always check my calculations with an OS tool and admin; especially if it is key to the performance analysis.

    ReplyDelete
  3. Should you add IOWAIT_TIME to BUSY_TIME to accommodate for I/O level activity?

    ReplyDelete