Thursday, February 28, 2013

How many CPU cores do I really have?


How Many CPU Cores Do I Really Have?


The view operating system statistic view v$osstat is can be misleading with regards to CPU cores. Not that the information is incorrect, it's well... let's say troubling. If I ask ten people to email a sample AWR report, I'm likely to see CPU core-like statistics such as CPU_SOCKETS, NUM_CPUS, VCPU, LCPU, CPU_THREADS, and probably a variety of other names. Wow… what a mess!

But I'd still like to know because it's important for my work. For two reasons:

First, it helps me to understand how high the CPU utilization can go before performance starts to degrade. Based on queuing theory, the more processes a system can simultaneously process the higher the average utilization the system can sustain before performance begins to degrade. I write about this in the Operating System section in my Oracle Performance Firefighting book.

Second, I always check what the math and my observations indicate versus what the OS administrator and OS commands (such as vmstat, war, top) tell me. Paranoid perhaps, but doing Oracle work for 20-plus years has taught me a few things...

Call It a Server, Not a Core, Lcpu, Thread, etc.


To avoid the entire discussion about which provides the processing power; core or thread, let's simply call the unit of processing power a "server." Why? Two reasons. First, because it provides CPU service to processes, so it truly is a "server." Second, that's what capacity planners call something that services transactions; a server. In fact, its symbol is M (capital "m").

By the way, it is very easy to determine, on your system, what provides the true CPU processing power (cores, threads, or something else). I blogged about this in June of 2011.

So the question is, how many "servers" does your database host contain? That's what this posting is all about.

If you recall from my previous posting, I demonstrated two ways to calculate CPU utilization. Both follow the classic; requirements divided by capacity. But the capacity is where the two approaches differ.

Capacity Calculation Using "servers"


Using "servers" to calculate the capacity is simply the number of servers multiplied by the snapshot interval. So a 2 server (think two cores) host over a 60 minute period can provide a maximum of 120 minutes or 7200 seconds of CPU power.

Here's the utilization formula using the capacity approach:

U = R / C

where;

R = CPU consumption over the interval (seconds)
C = CPU "servers" X interval (seconds)

For example, looking at a real AWR report, over a 60 minute interval, the AWR's Operating System Statistics show show a BUSY_IIME of 1913617, IDLE_TIME of 7159367 and the NUM_CPUS of 24.

Therefore, the average CPU utilization over the interval is:

U = 19136.17 / ( 24 * 60 * 60 ) = 0.221 = 22%

Capacity Calculation Using Busy and Idle Time


In my previous posting I introduced using only v$osstat's BUSY_TIME and IDLE_TIME values to calculate the average CPU utilization over the snapshot interval. Here's the formula:

U = R / C = BUSY_TIME / ( BUSY_TIME + IDLE_TIME )

Using the above examples numbers;

U = 1913617 / ( 1913617 + 7159367 ) = 0.211 = 21%

Yes, the two utilization calculation results don't match perfectly but they are very close… close enough.

Calculating the Number of "Servers"


Notice that in the busy and idle time capacity calculation there is no reference to the number of servers. Suppose you don't trust the v$osstat CPU core-like statistics or are simply not sure which one is important. In other words, you want to understand the effective number of CPU "servers." Using the two utilization formulas and some algebra we can figure this out!

Making sure to use the same unit of time, here are two capacity calculations:

C = servers * interval
C = busy_time + idle_time

Let's put them together and solve for "servers".

servers * interval = busy_time + idle_time

servers = ( busy_time + idle_time ) / interval

OK… but does this really work? Let's give it a try! (I'm going to use seconds as my unit of time.)

effective servers = ( 19136.17 + 71593.67 ) / ( 60 * 60 ) = 25.2

The math tells us that based on the collected data, on average the system is operating with effectively 25 "servers." I know in this situation there are physically 24 CPU cores, so we're pretty close.

What to Do With AIX


While this "effective servers" formula has proven its worth in many systems, I still find it does not work well many times in an AIX environment. Sometimes it does, but not always. So do the math and compare it with vmstat or some other AIX based tool.

The Take-Aways


The big one:

servers = ( busy_time + idle_time ) / interval

Personally, I never initially trust the CPU number related v$osstat statistics. I always check with the OS administrator and also run a simple OS command like top or sar or do a "cat /proc/stat". It's always a good idea to casually check with the OS administrator. You don't want to be thinking and working with 12 "servers" when the administrator is thinking 24 "servers."

For me, knowing the number of CPU "servers" is important. And since I never blindly trust the v$osstat CPU statistics, this is a very fast and reliable way (so far at least) to check my work.

Thanks for reading!

Craig.


If you enjoy my blog, I suspect you'll get a lot out of my courses; Oracle Performance Firefighting,  Advanced Oracle Performance Analysis, and my one-day Oracle Performance Research Seminar. I teach these classes around the world multiple times each year. For the latest schedule, click here. I also offer on-site training and consulting services.

P.S. If you want me to respond to a comment or you have a question, please feel free to email me directly at craig@orapub .com. Another option is to send an email to OraPub's general email address, which is currently orapub.general@comcast .net. 





Thursday, February 14, 2013

Simple Way to Calculate OS CPU Utilization

Another (Simpler) Way to Calculate CPU Utilization


Back in April of 2011, I blogged about how to calculate the operating system CPU utilization from data in a Statspack or AWR report. All that is necessary is the report snapshot interval and a couple of columns from the Operating System Statistics view, v$osstat.

Utilization Made Easy


It's pretty simple actually. Utilization is simply requirements divided by capacity. If you have one cup that contains 1/2 cup of water, the cup is 50% full/busy/utilization/etc. Here is the basic utilization formula:

U = R / C

Where;

U is utilization
R is requirements
C is capacity

Requirements


We can get the seconds of CPU consumed directly from v$osstat which is shown in an AWR and Statspack report as, Operating System Statistics. The v$osstat view was introduced in Oracle 10g. It contains a dizzying array of sometimes confusing statistics that seem to change from platform to platform and from release to release.

The statistic BUSY_TIME is the total CPU consumed by all operating system processes in hundreds of a seconds, that is, centi-seconds. For example, if the busy time is 123456 then since the operating system (not the Oracle instance) has started, all operating system processes (Oracle and everything else) have consumed 1234.56 seconds of CPU. We are not making a statement about the speed of a CPU, but simply the processes consumed 1234.56 seconds of CPU since the server was last rebooted.

I'm sorry, but I just lied a little. I implied that if there are multiple databases on the same box, they will ALL have the same v$osstat values for BUSY_TIME. In one of my Oracle Performance Firefighting classes (where we discuss this kind of thing), we decided to check this out. It turns out the BUSY_TIME statistic values were different. The good news is the BUSY_TIME shown in an AWR/Statspack report is correct and can be used to calculate the CPU utilization because it shows the delta; the ending BUSY_TIME minus the beginning BUSY_TIME. Sorry about that little lie.

Now that we have discussed the requirements, we need to tackle the other part of the utilization equation, that is, capacity.

Capacity


Over a 1 minute interval a single core box can provide a maximum of 60 seconds of CPU. Over a 2 minute interval a single core box can provide a maximum of 120 seconds of CPU. Over a 2 minute interval a dual core box can provide a maximum of 240 seconds of CPU. This, "can provide a maximum" is capacity.

I think you get the pattern, which is C = cores X interval. Using a more realistic example, over a one hour interval, a 16 core database server can provide up to 57600 seconds of CPU power; 16 cores X 1 hour X 60 min/hour X 60 sec/min = 57600 core-seconds.

The Bad News


However, there can be a significant barrier in any host with virtual machine activity and especially with AIX. Add to this, in v$osstat the naming of CPU power units could be CPU_SOCKETS, NUM_CPUS, VCPU, LCPU, CPU_THREADS, and probably a variety of other names. This makes creating a script to calculate a host's CPU utilization difficult and always suspect. This is especially true if you want one script to run on multiple platforms; AIX, Linux, Windows, and HP.

A Simpler Utilization Calculation


The solution comes with an understanding that capacity always equals the busy time plus the idle time. That is, the requirements plus the idle time… what was used plus what was left over. With this in mind, another just as correct utilization formula is:

U = R / C

R = requirements = v$osstat.BUSY_TIME
C = capacity = v$ostat.BUSY_TIME + v$ostat.IDLE_TIME

Another plus is we can use the values straight from v$osstat view without remembering their unit of time! They will simply cancel each other out.

Here is a real example from an AIX box. The AWR report's Operating System Statistics area (which is based on v$osstat) shows the BUSY_TIME and IDLE_TIME to be 346028 and 11250450 respectively.

Therefore, regardless of the units of time, the number of CPUs, CPU cores, CPU sockets, threads or hyper-threads, virtual CPUs, logical CPUs, and even the AWR report's snapshot interval (breath) we can easily calculate the average CPU utilization over a snapshot interval. Like this (ref: bob awrrpt_1_40757_40758 ):

U = R / C

R = BUSY_TIME = 3042449
C = BUSY_TIME + IDLE_TIME = 3042449 + 2644832 = 5687281
U = R / C = 3042449 / 5687281 = 0.535 = 54%

But was the utilization really 54%? Yes. And did this match with standard operating system monitoring tools? Yes.

Checking Our Work With the CPU Core Formula


I can check the calculation because, in this box, I know there are 16 physical CPU cores and I know the AWR snapshot interval is about 60 minutes. Therefore the CPU capacity is 57600 seconds (57600=16*60*60). Using the core based utilization formula:

U = R / C = ( 3042449/100) / (16*60*60) = 30424 / 57600 = 0.528 = 53%

Not bad, eh? I have checked this many times and it works wonderfully and reliably…except on AIX! I have looked at a number of AIX examples, and sometimes our calculated utilization does not match what the operating system shows… but sometimes it does.

Eye-Balling the Utilization


This is powerful. Usually when calculating the CPU utilization, you have to well… calculate it. Which can be a pain, takes a minute, and it's easy to make a mistake. But when using the busy and idle time, you can instantly get a general idea of the utilization. Or least know if it is likely to be a factor in performance. I'll show you what I mean.

Consider this Actual Situation


(ref: awr_prod13.htm)

Over a 60 minute interval, the AWR's Operating System Statistics show show a BUSY_IIME of 1,913,617 an IDLE_TIME of 7,159,367 and the NUM_CPUS of 24. What makes this even easier is the AWR report displays the BUSY_TIME directly above the IDLE_TIME. Visually I can easily reduce the BUSY_TIME to 2 and the IDLE_TIME to 7.

Immediately, I know the average utilization is much less than 50% and therefore a CPU bottleneck is extremely unlikely. Want more precision, no problem: divide 2 by 7, which is about 30%. Again, the likelihood of a CPU bottleneck would be extremely unlikely. Very cool!

As a side note, just in case your wondering doing the math using the busy and idle time method the average utilization is 21% : (1913617/(1913617+7159367)) and using the CPU based formula, the average utilization is 22% : (19136.17/(24*60*60)).

The Take-Aways


  1. For me, "eye balling" the utilization situation is priceless and with just a little more effort (typing 2/7 ) I can get a quick quantitative answer.
  2. Getting the number of CPU cores is not dependable on many systems and requires a quick check with the OS and the OS administrator. The busy and idle time method completely negates the CPU core requirements and converting all time to the same units.
  3. What continues to be frustrating is the AIX situation. Perhaps one day I'll be able to reliably calculate CPU utilization, but I'm not holding my breath. However, I am working with two colleagues on a solution!

Thanks for reading!

Craig.


 If you enjoy my blog, I suspect you'll get a lot out of my courses; Oracle Performance Firefighting,  Advanced Oracle Performance Analysis, and my one-day Oracle Performance Research Seminar. I teach these classes around the world multiple times each year. For the latest schedule, click here. I also offer on-site training and consulting services.

P.S. If you want me to respond to a comment or you have a question, please feel free to email me directly at craig@orapub .com. Another option is to send an email to OraPub's general email address, which is currently orapub.general@comcast .net.