Tuesday, February 10, 2015

What Is That Light-Green Oracle Database CPU Wait Time?

This page has been permenately moved. Please CLICK HERE to be redirected.

Thanks, Craig.


What Really Is That Light-Green Oracle Database CPU Wait Time?



Have you ever wondered what that light-green "cpu wait time" really means in Oracle Enterprise Manager? It's what I call, the "gap" time. The "gap" time is the "missing" or the "leftover" time when DB Time does not equal the DB CPU (foreground process CPU consumption) plus the non-idle wait time. And, it happens more often than you might think.

If you have ever noticed that the database time seems too large, then you need to read this article. And, if you really want to know what the light-green "cpu wait time" in your OEM charts is, then you need to read this article. It's that good.

If you're serious about Oracle performance tuning and analysis, you'll want to know I just posted my complete 2015 public training schedule. It's on the main OraPub.com page HERE. Remember, alumni receive a 50% discount...crazy I know.

My Experiment Shows...


My experiment shows a strong relationship between the "gap" time and operating system CPU utilization. This means that a significant portion of the "gap" time is Oracle foreground processes sitting in the CPU run queue ready to consume CPU. This CPU run queue time is not part of DB CPU but it part of DB Time. So, when the CPU run queue time increases, so does DB Time and so does the "gap" time. And I have the data to show it! And you can run the same experiment yourself.

Let me put this another way. Most of the DB Time "gap" is Oracle foreground processes waiting in the operating system CPU run queue so they can eventually and truly consume CPU.

This is really important: When an Oracle foreground process is not consuming CPU but is sitting in the CPU run queue, Oracle Active Session History (ASH) facility records the session sample state as "CPU" and if the Oracle process is a foreground process (not a background process) Oracle's time model records this time as DB Time but not DB CPU. So in both the ASH and time model cases, someone must do some math to calculate this "cpu wait time".

But that name... "cpu wait"!

CPU Wait Time Is A Lousy Name


"CPU wait time" is a lousy name. Why? Mainly because it has caused lots of confusion and speculation. The name would be more appropriately called something like, "cpu queue time." Three reasons come to mind.

First, wait time means something special to Oracle DBAs. To an Oracle DBA anything associate with a "wait" should have a wait event name, a wait occurance, the time should be instrumented (i.e., measured) and should be recorded in the many wait interface related views, such as v$system_event or v$session.

Second, from an Oracle perspective the process is truly "on cpu" because the process is not "waiting." Remember, an Oracle session is either in one of two states; CPU or WAIT. There is no third choice. So the words "CPU Wait" are really confusing.

Third, from an OS perspective or simply a non-Oracle perspective, the Oracle process is sitting in the CPU run queue.

I'm sure in some Oracle Corporation meeting the words "cpu wait" were considered a great idea, but it has caused lots of confusion. And I'm sure it's here to stay.

What Does This "CPU WAIT" Look Like In OEM?


In OEM, the "cpu wait" is a light green color. I grabbed a publically available screenshot off the internet and posted it below. Look familiar? 


OK, so it's really easy to spot in OEM. And if you've seen it before you know EXACTLY what I'm referring to.

What Is CPU Wait Time?


First, let's review what we do know.

1. DB CPU is true Oracle foreground process CPU consumption as reported by the OS through a system call, such as getrusage.

2. CPU Wait time is derived, that is, somebody at Oracle wrote code to calculate the "cpu wait" time.

3. CPU Wait time is a lousy name because it causes lots of confusion.

4. CPU Wait time is shown in OEM as a light green color. DB CPU is shown as a dark/normal green color.

Second, I need to define what I'll call the DB Time "gap." This is not error and I am not implying something is wrong with database time, that it's not useful or anything like that. All I am saying is that sometimes DB Time does not equal DB CPU plus the non-idle wait time. Let's put that in a formula:

DB Time = DB CPU + non Idle Wait Time + gap

Really, What Is CPU Wait Time?


Now I'm ready to answer the question, "What is CPU WAIT time?" Here is the answer stated multiple ways.

"CPU Wait" time is Oracle foreground process OS CPU run queue time.

I ran an experiment (detailed below) and as the OS CPU utilization increased, so did the DB Time "gap" implying that the gap is CPU run queue time or at least a significant part of it.

I ran an experiment and there was a strong correlation between OS CPU utilization and the DB Time "gap" implying that the gap is CPU run queue time.

I ran an experiment and using queuing theory I was able to predict the "gap" time implying that the gap is CPU run queue time. (Whoops... sorry. That's what I'll present in my next post!)

So I'm very comfortable stating that when DB Time is greater than Oracle process CPU consumption plus the non-idle wait time, it's probably the result of Oracle foreground process CPU run queue time.

Yes, there could be some math problems on Oracle's side, there could be uninstrumented time (for sure it's happened before), the operating system could be reporting bogus values or a host of other potential issues. But unless there is an obvious wrong value, I'm sticking with the experimental evidence.

Now I'm going to show the experimental "evidence" that is, that the DB Time "gap" time correlates with the OS CPU utilization.

Let The Data Drive Our Understanding


You can download all the data collection scripts, raw experimental data, Mathematica notepad files, graphic files, etc HERE in a single zip file.

You should be able to run the experiment on any Linux Oracle test system. All you need is a logical IO load and for that I used my free opload tool which, you can download HERE.

The experiment placed an increasing logical IO load on an Linux Oracle 12c system until the operating system CPU utilization exceeded 90%. The load was increased 18 times. During each of the 18 loads, I gathered 31 three minute samples. Each sample contains busy time (v$osstat), idle time (v$osstat), logical IO (v$sysstat "session logical reads"), non-idle wait time (v$system_event where wait_class != 'Idle'), DB CPU (v$sys_time_model), background cpu time (v$sys_time_model), database time (v$sys_time_model DB time) and the sample time (dual table current_timestamp).

The CPU utilization was calculated using the "busy idle" method that I blog about HERE. This method is detailed in my Utilization On Steroids online video seminar.

The workload is defined as the logical IOs per second, lio/s.

Below is a table summarizing the experimental data. The times shown are the averages. If you look at the actual raw experimental data contained in the analysis pack, you'll notice the data is very consistent. This is not suprising since the load I placed should produce a very consistent workload.

Do you see the gaps? Look closely at load 18. The DB Time is 8891.4 seconds. But the sum of DB CPU (996.8 seconds) and the non-idle wait time (2719.2) seconds only equals 3716.0. Yet DB Time is 8891.4. So the "gap" is 5175.3 which is DB Time (8891.3) minus DB CPU (996.8) minus the non-idle wait time (2719.2).

Note: Load 11 and 12 where excluded because of a problem with my data collection. Sorry.

While we can numberically see the DB Time "gap" increase as the CPU utilization increases, check out the graphic in the next section!

The Correlation Between CPU Utilization And DB Time Gap


We can numerically and visually see that as the CPU utilization increases, so does the DB Time "gap." But is there a strong mathematical correlation? To determine this, I used all the experimental samples (except load 11 and 12). Because there was 17 different workloads and with each workload I gathered 31 samples, the correlation comprises of something like 527 samples. Pretty good sample set I'd say.

The correlation coefficient is a strong 0.891. The strongest is 1.0 and the weakest is 0.

Graphically, here is the scatterplot showing the relationship between the CPU utilization and the workload.

Don't expect the DB Time "gap" and OS CPU utilization correlation to be perfect. Remember that DB Time does not include Oracle background process CPU consumption, yet it is obviously part of the OS CPU utilization.

Summary


My experiment indicated the light-green "CPU wait time" is primarily Oracle foreground process operating system CPU run queue time. This is DB Time "gap" time.

My experiment also showed the "gap" time is highly correlated with CPU utilization. Which means, as the CPU utilization increases, so does the "gap" time.

If there are Oracle Database instrumentation bugs or a host of other potential problems, that will also affect the "gap" time.

If you want a more complete and detailed DB Time formula is would be this:

DB Time = DB CPU + Non Idle Wait Time + gap time

In my next post, I'll show you how to calculate the gap time based on queuing theory!

Thanks for reading!

Craig.