Friday, April 22, 2011

Core vs Thread...CPU Utilization - Part 1

What's the Big Deal?

This blog series was motivated for two reasons. First, I want to demonstrate how one can calculate a database server's CPU utilization straight from Oracle v$ views. Second, over the past few years I've seen a couple examples where database servers have exhibited more power then the number of CPU cores could provide, which implies threads are indeed providing some extra power...so some investigation seems warranted.

This is a two part blog entry. This first part will detail how you can calculate the OS CPU utilization straight from v$ views along with the fundamentals of utilization. (Which can be widely applied to performance analysis.) In part two, I will present the results of my experiments contrasting CPU cores versus threads...and of course you'll be able to download and run the experimental scripts yourself!

Threads versus Cores

When you get down to what actually services work, it is a CPU core. Threads and hyper-threads work must be ultimately processed by a core. An exception would be if threads are truly simultaneously processed (and this can be tested). Capacity planners will tell you that threads don't scale. They will tell you that once each thread is occupied, their advantages decline. Unless you have experimental evidence leading you differently, when predicting the CPU power of a database server, the conservative (some say honest and ethical) way to determine the CPU capacity is to use cores, not threads, or hyper-threads. As a simple internet search will demonstrate, there are many published references. I found a great reference that any DBA will understand on Dell's web-site here.

Calculating OS CPU Utilization

On most Oracle systems it is a simple matter to compute the OS CPU utilization based on the 10g view v$osstat. The key to understanding utilization is it is simply requirements divided by capacity...end of story. Requirements can take on many forms such as CPU time consumed, IOPS performed, memory currently held, etc. Capacity can also take on many forms such as available CPU time, sustained IOPS available, physical memory, etc. Over an interval of time, if I know the CPU time consumed and I know the number of CPU cores on the database server, I can easily calculate the average database server CPU utilization over the interval. Here's a basic utilization formula:

U = R / C

Where;

U is utilization
R is requirements
C is capacity

I need to mention that when, referring to a CPU subsystem the capacity is the number of cores multiplied by the interval. For example, over a 1 minute interval a single core box can provide a maximum of 60 seconds of CPU. Over a 2 minute interval a single core box can provide a maximum of 120 seconds of CPU. Over a 2 minute interval a dual core box can provide a maximum of 240 seconds of CPU. I think you get the pattern, which is C = cores X interval. Using a more realistic example, over a one hour interval, a 16 core database server can provide up to 57600 seconds of CPU power; 16 cores X 1 hour X 60 min/hour X 60 sec/min = 57600 core-seconds.

Let's put this together resulting in the CPU utilization. Suppose over a one-hour interval an 8 core database server was busy 10080 seconds. In other words, the CPU subsystem was required to work 10080 seconds, so R = 10080 sec. Over a one-hour interval 8 cores can provide a maximum of 28800 seconds of CPU power; C = 8 cores X  1 hour X 60 min/hour X 60 sec/min = 28800 core-seconds. Therefore, the average CPU utilization over the interval is 0.35 or 35%; 10080 seconds / 28800 seconds.

But where did I get the 10080 seconds figure?

We can get the seconds of CPU consumed directly from v$osstat. The v$osstat view was introduced in Oracle 10g. It contains a dizzying array of confusing statistics that seem to change from platform to platform and from release to release. The statistic busy_time is the total CPU consumed by all operating system processes in hundreds of a seconds, that is, centi-seconds. For example, if the busy time is 123456 then since the operating system (not the Oracle instance) has started, all operating system processes (Oracle and everything else) have consumed 1234.56 seconds of CPU. We are not making a statement about the speed of a CPU, but simply the processes consumed 1234.56 seconds of CPU since the server was last rebooted.

Where do we get the number of CPU cores or threads?

For our purposes today, we want to know the number of CPU cores and sometimes threads. While not perfect I'm sure, if you reference any core related v$osstat statistic that does not contain the word thread, and take the maximum value, that will be the number of CPU cores. I realize this is not real scientific, but it seems to work. (And a quick check using sar, top, or asking the sysadmin also helps build confidence!)

Let's do it!

The pl/sql block below will determine the average OS CPU utilization over a 60 second interval. You can easily change the interval time.

set tab on
set serveroutput on
declare
  interval_sec      integer;
  busy_sec_t0       integer;
  busy_sec_t1       integer;
  cpu_cores_t0      integer;
  cpu_cores_t1      integer;
  cpu_cores_avg     number;
  requirements_sec  number;
  capacity_sec      number;
  utilization       number;
begin
  interval_sec := 60;

  select (a.value/100), b.value
  into   busy_sec_t0, cpu_cores_t0
  from   ( select value from v$osstat where stat_name='BUSY_TIME' ) a,
         ( select value from v$osstat where stat_name='NUM_CPUS' ) b ;

  dbms_lock.sleep(interval_sec);

  select (a.value/100), b.value
  into   busy_sec_t1, cpu_cores_t1
  from   ( select value from v$osstat where stat_name='BUSY_TIME' ) a,
         ( select value from v$osstat where stat_name='NUM_CPUS' ) b ;

  requirements_sec := busy_sec_t1 - busy_sec_t0 ;
  cpu_cores_avg    := (cpu_cores_t1 + cpu_cores_t0) / 2 ;
  capacity_sec     := interval_sec * cpu_cores_avg ;
  utilization      := requirements_sec / capacity_sec ;

  dbms_output.put_line('OS CPU Utilization Calculation...');
  dbms_output.put_line('REQUIREMENTS----------');
  dbms_output.put_line('..busy_sec_t0 (sec)    :'||busy_sec_t0);
  dbms_output.put_line('..busy_sec_t1 (sec)    :'||busy_sec_t1);
  dbms_output.put_line('..requirements (sec)   :'||requirements_sec);
  dbms_output.put_line('CAPACITY----------');
  dbms_output.put_line('..interval (sec)       :'||interval_sec);
  dbms_output.put_line('..CPU cores            :'||cpu_cores_avg);
  dbms_output.put_line('..capacity (sec)       :'||capacity_sec);
  dbms_output.put_line('-------');
  dbms_output.put_line('Utilization (avg)      :'||round(utilization,2));
end;
/
Note: You will need to check your v$osstat CPU core statistic name as there is a good chance it will be different. Also, if you see the statistic avg_busy_time do NOT use that. (How Oracle determines the average time of something over an interval is beyond me.) Instead, always use the busy_time statistic.

On my Linux system the results where:

SQL> /
OS CPU Utilization Calculation...
REQUIREMENTS----------
..busy_sec_t0 (sec)    :307954
..busy_sec_t1 (sec)    :308015
..requirements (sec)   :61
CAPACITY----------
..interval (sec)       :60
..CPU cores            :4
..capacity (sec)       :240
-------
Utilization (avg)      :.25

PL/SQL procedure successfully completed.
This should match the average CPU utilization from an OS command, such as sar, vmstat, or top. Why? Because the v$osstat view pulls its data directly from the virtual filesystem /proc...and so does sar, vmstat, and top. For details, refer to the Operating System Contention chapter's Monitoring CPU Activity section (starting on page 110) in my Oracle Performance Firefighting book.

What's Next?

Most Oracle systems, using CPU cores to determine the OS CPU capacity works perfectly. But over the years, I have seen a few systems where it appears the CPU capacity is actually greater than the number of CPU cores could physically provide (in this universe). This implies threads are providing more power and a better CPU capacity calculation will include a mix of CPU cores and threads. This calls for an experiment...one that we all can run on our Oracle systems. So in part two of this series I will present the tool to gather the data, share some experimental results, and draw some conclusions.

Thanks for reading!

Craig.


P.S. If you want me to respond to a comment or have a question, please feel free to email me directly at craig@orapub .com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email. Another option is to send an email to OraPub's general email address, which is currently orapub@comcast.net.

5 comments:

  1. Hi Craig,
    according to yur sql statement it looks like 'NUM_CPUS' is likely to change over time. Isn't the number of cpus supposed to remain a fixed value ? Could you please clarify what i missed here ?
    Thanks
    Olivier

    ReplyDelete
  2. Thank you for sharing this with us.
    Damir Vadas
    www.vadas.hr

    ReplyDelete
  3. Olivier,

    On virtualized servers, the number of CPUs can be increased (or decreased) dynamically. The values in v$osstat will get updated when that happens.

    Ken

    ReplyDelete
  4. In my firm we are using IBM LPAR technology.
    The LPAR is set to shared-CPU, and is configured with entitled capasity 1,00 and maximum capasity 5,00. This means that my virtual node could use more CPU than entitiled capasity if it needs. BUT how do I know what the capasity really is? My virtual node competes with other virtual nodes, and is not garanteed to get what it requests. I guess the whole point with shared CPU is to allocated a maximum capasity for all the virtual nodes that cumulates to more than 100 % of the physical capasity. Such shared configuration really makes it difficult to calculate the utilization. Craig - Do you have any comment on this?

    ReplyDelete
  5. Sorry ... ment to write "shared-SMT".

    ReplyDelete