Monday, December 22, 2014

Is Oracle Database Time Correct? Something Is Not Quite Right.

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

Is Oracle Database Time Correct? Something Is Not Quite Right.


Oracle Database tuning and performance analysis is usually based on time. As I blogged HERE, the Oracle "database time" statistic is more interesting than simply "time spent in the database." It is the sum of CPU consumption and non-idle wait time. And Elapsed Time is the sum of all the database time related to perhaps a session or a SQL statement execution. However...

If you do the fundamental math, you'll notice the numbers don't always add up. In fact, they rarely match. In this posting, I want to demonstrate this mismatch and I want you to see this on your systems!

I'll include experimental data from a number of different systems, the statistical analysis (including pictures) and provide a tool you can download for free from OraPub.com to check out the situation on your systems.

Checking DB Time Math


DB Time is defined as "time spent in the database." This is the sum of Oracle process CPU consumption plus non-idle wait time. Usually we don't derive DB Time. The Time Model Statistics view v$sys_time_mode contains the DB Time statistic. But this view also contains the DB CPU statistic. Since there is no sum of non-idle wait time, most people just assume everything is fine.

However, if you run the simple query below on your system, it could look something this:
SQL> l
  1  select db_time_s, db_cpu_s, tot_ni_wt_s
  2  from (select value/1000000 db_time_s from v$sys_time_model where stat_name = 'DB time' ),
  3   (select value/1000000 db_cpu_s from v$sys_time_model where stat_name = 'DB CPU' ),
  4*  (select sum(TIME_WAITED_MICRO_FG)/1000000 tot_ni_wt_s from v$system_event where wait_class != 'Idle' )
SQL> /

 DB_TIME_S   DB_CPU_S TOT_NI_WT_S
---------- ---------- -----------
330165.527 231403.925  119942.952

1 row selected.
If you add up the DB CPU and the total non-idle wait time, the value is 351,346.877. Woops! 351K does not equal 330K. What happened on my Oracle Database 12c (12.1.0.2.0)? As I have demonstrated in this POSTING (which contains videos of this) and in my online seminar training HERE, many times DB Time does nearly equal DB CPU plus the non-idle wait time. But clearly in the above situation something does not seem quite right.

Checking DB Time On Your Systems


To demonstrate the possibility of a DB Time mismatch, I created a simple plsql tool. You can download this free tool or do an OraPub.com search for "db time tool". The tool, which is easily configurable, takes a number of samples over a period of time and displays the output.


Here is an example of the output.

OraPub DB Time Test v1a 26-Sep-2014. Enjoy but use at your own risk.
.
Starting to collect 11 180 second samples now...
All displayed times are in seconds.
.
anonymer Block abgeschlossen
..........................................................................
... RAW OUTPUT (keep the output for your records and analysis)
..........................................................................
.
sample#,  db_time_delta_v ,  db_cpu_delta_v,  tot_ni_wait_delta_v, derived_db_time_delta_v, diff_v, diff_pct_v
.
1, 128,4, 128,254, ,103, 128,357266, ,043, 0
2, 22,014, 3,883, 17,731, 21,614215, ,399, 1,8
3, 1,625, 1,251, ,003, 1,253703, ,371, 22,8
4, 13,967, 12,719, 1,476, 14,194999, -,228, -1,6
5, 41,086, 41,259, ,228, 41,486482, -,4, -1
6, 36,872, 36,466, ,127, 36,593884, ,278, ,8
7, 38,545, 38,71, ,137, 38,847459, -,303, -,8
8, 37,264, 37,341, ,122, 37,463525, -,199, -,5
9, 22,818, 22,866, ,102, 22,967141, -,149, -,7
10, 30,985, 30,614, ,109, 30,723831, ,261, ,8
11, 5,795, 5,445, ,513, 5,958586, -,164, -2,8
.
The test is complete.
.
All displayed times are in seconds.

The output is formatted to make it easy to statistically analyze. The far right column is percent difference between the reported DB Time and the calculated DB Time. In the above example, they are pretty close. Get the tool and try it out on your systems.

Some Actual Examples


I want to quickly show you four examples from a variety of systems. You can download all the data in the "analysis pack" HERE. The data, for each of the four systems, contains the raw DB Time Test output (like in the section above), the statistical numeric analysis output from the statistical package "R", the actual "R" script and the visual analysis using "smooth histograms" also created using "R."

Below is the statistical numeric summary:


About the columns: Only the "craig" system is mine and other are real production or DEV/QA systems. The statistical columns all reference the far right column of the DB Time Test Tool's output, which is the percent difference between the reported DB Time and the calculated DB Time. Each sample set consists of eleven 180 second samples. The P-Value greater than 0.05 means the reported and calculated DB Time differences are normally distributed. This is not important in this analysis, but gives me clues if there is a problem with the data collection.

As you can easily see, two of the system's "DB Times" difference is greater than 10% and one of them was over 20%. The data collected shows that something is not quite right... but that's about it.

What Does This Mean In Our Work?


Clearly something is not quite right. There are a number of possible reasons and this will be focus of my next few articles.

However, I want to say that even though the numbers don't match perfectly and sometimes they are way off, this does not negate the value of a time based analysis. Remember, we not trying to land a man on the moon. We try diagnosing performance to derive solutions that (usually) aim to reduce the database time. I suspect that in all four cases I show, we would not be misled.

But this does highlight the requirement to also analysis performance from a non-Oracle database centric perspective. I always look at the performance situation from an operating system perspective, an Oracle centric perspective and an application (think: SQL, processes, user experience, etc.) perspective. This "3 Circle" analysis will reduce the likelihood of making a tuning diagnosis mistake. So in case DB Time is completely messed up, by diagnosing performance from the other two "circles" you will know something is not right.

If you want to learn more about my "3-Circle" analysis, here are two resources:
  1. Paper. Total Performance Management. Do an OraPub search for "3 circle" and you'll find it.
  2. Online Seminar: Tuning Oracle Using An AWR Report. I go DEEP into an Oracle Time Based Analysis but keeping it day-to-day production system practical.
In my next few articles I will drill down into why there can be a "DB Time mismatch," what to do about it and how to use this knowledge to our advantage.

Enjoy your work! There is nothing quite like analyzing performance and tuning Oracle database systems!!

Craig.





3 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. It's important to realise how the wait interface is implemented. The global formula for response time analysis is: R=S+W, which translates to Oracle: Wallclock time≈CPU time + (non-idle) wait (event) time.

    When doing time based analysis, the wall clock time is taken from a begin and end wall clock time. There is no confusion about that.

    The CPU time is taken from the getrusage() system call, which is the per process CPU usage as collected by the kernel (on linux, this means it's collected per process per 1ms (1000 hz; egrep CONFIG_HZ.*=y /boot/config-$(uname -r)*). This means (despite being more precise than the 10ms precision in the older days) that it can be a bit off.
    * this is what I could find to the best of my research, I am quite confident it still works this way.

    During execution, Oracle can enter a situation where it needs to wait, which means there's no CPU usage. Oracle corp. implemented instrumentation in Oracle 7 via the wait interface, which is getting the wall clock time before an operation which is likely to wait, and getting the wall clock time after it.

    The decision to include a wait is completely up to Oracle corp. This means not all wait locations are instrumented, despite getting more and more events in recent Oracle versions. The reason for this is for some operations setting up a wait (which means registering stuff in various memory locations, and executing additional code) is likely taking more time than the actual processing. Good examples of these are latch gets and mutex gets. Getting these is not instrumented, because it ought to be a very quick operation.

    This means that outside of the wait events the Oracle code is supposed to spend fully running until finished. If it needs to wait outside of the wait event, there's loss of time. Also, it means that inside the wait the process is supposed to be waiting, otherwise you'll get too much time (this however is a situation I haven't come across yet).

    However, think about this: when doing IO, all everything is setup (memory, etc) outside of the wait, then the IO is done inside of the wait. In order to do IO, you need CPU. So there is and will be CPU time (mostly kernel/system mode) spend when doing the pread()/pwrite()/io_getevents().

    As you've described, and proven, wall clock time isn't always CPU (processing) time and wait time. My experience is that when measuring the time at a lesser granular level, let's say at query level, it's fairly okay, unless you hit a specific situation.

    Frits Hoogland

    ReplyDelete