Tuesday, December 10, 2013

Tool Options: Detailing Oracle Process CPU Consumption

Detailing an Oracle Database process's CPU consumption is amazing, a lot of fun, and can lead to some “ah ha” moments. To make this posting daily digestible, I’m breaking it up into a multi-part series. Today I’m posting the second part entitle, Tools Options. Here we go…

Tool Options

As I mentioned in my previous post entitled, The Situation, there is bad news and there is good news. The bad news, to my knowledge, Oracle does not collect process CPU consumption at the Oracle kernel level. For sure, Oracle collects Oracle process CPU consumption, but it is the total CPU consumption, with the key word being total. What I want is process CPU consumption at the Oracle kernel function level.

When I started my quest I knew there had to be some good news! With all the OS tools out there, there must be one that will meet my requirements. There were four tools that caught my attention: strace/truss, dtrace, gdb, and perf. Here is what I learned in my quest.

System Call Tracing Using STRACE/TRUSS

I started with strace because I am very familiar with it. For many years I have traced Oracle processes at the operating system level using strace. (If you are on Solaris, the command is truss.) It's a fantastic way to learn how Oracle really works. And it's a great teaching tool (which I'm always looking for).

Here is an example of what you can do with strace. Suppose I wanted to know how long it takes an Oracle server process to read a single block from disk. Through Oracle wait interface we would see this as a db file sequential read. To do an Operating system trace on the Oracle server process 2518, I could issue this command:
$ strace –rp 2518
     0.000324 clock_gettime(CLOCK_MONOTONIC, {504, 52586559}) = 0
     0.000040 clock_gettime(CLOCK_MONOTONIC, {504, 52625324}) = 0
     0.000040 pread(257, "\6\242\0\f\0"..., 8192, 427270144) = 8192
     0.000047 clock_gettime(CLOCK_MONOTONIC, {504, 52712996}) = 0
     0.000044 clock_gettime(CLOCK_MONOTONIC, {504, 52757393}) = 0
     0.000329 clock_gettime(CLOCK_MONOTONIC, {504, 53086771}) = 0
     0.000040 clock_gettime(CLOCK_MONOTONIC, {504, 53125505}) = 0
     0.000040 pread(257, "\6\76 [y\f\0"..., 8192, 427278336) = 8192
     0.000047 clock_gettime(CLOCK_MONOTONIC, {504, 53213583}) = 0
     0.000040 clock_gettime(CLOCK_MONOTONIC, {504, 53253021}) = 0
     0.000327 clock_gettime(CLOCK_MONOTONIC, {504, 53580561}) = 0
     0.000040 clock_gettime(CLOCK_MONOTONIC, {504, 53619199}) = 0
     0.000040 pread(257, "\6\273\f\0"..., 8192, 427286528) = 8192
     0.000047 clock_gettime(CLOCK_MONOTONIC, {504, 53706779}) = 0
     0.000040 clock_gettime(CLOCK_MONOTONIC, {504, 53752611}) = 0
The beauty of strace is all the system calls are shown in the OS manual pages. For example, if I want to learn about the clock_gettime call, I simply do a "man clock_gettime". I actually did that a few months ago when exploring Oracle Database 12c. My past research has been on 11g and when Oracle asks the OS for the time, I only observed Oracle making the gettimeofday call. What happened to the gettimeofday call? James Morle has a very nice posting about this.

What did I learn from the above? I observed that Oracle is instrumenting a single block read using the clock_gettime call, the server processes is requesting a single 8K block using the pread call, and the first pread shown above took about 0.047 ms (not 0.040 ms).

While that's great, notice there is no reference to CPU consumption and there is no mention of Oracle kernel functions either. In fact, only the "system calls" are shown. Also, based on what the contents of the Linux /proc/sys/kernel/vsyscall64 file is set to, I may not see all the system calls!

So clearly, strace is not going to work.

Detailed Tracing Using DTRACE

If you are a developer on Solaris, you probably have used dtrace.  DTrace is a framework for troubleshooting processes as they are running...and in production! DTrace can be used to capture CPU consumption, IO read times, IO write times, network activity, and many other things. It is amazing!

DTrace programs resemble awk programs in structure; they consist of a list of one or more probes (instrumentation points), and each probe is associated with an action. These probes are comparable to a pointcut in aspect-oriented programming. Whenever the condition for the probe is met, the associated action is executed (the probe "fires"). A typical probe might fire when a certain file is opened, or a process is started, or a certain line of code is executed. The probe may result in some information being displayed.

The good news is, DTrace is very powerful. I think any DBA could see the value in learning how Oracle works by using DTrace. Have you ever wondered how Tanel Poder finds out so much about the Oracle kernel? Tanel is a super big fan of DTrace and you quickly see this when reading his work.

There is a down side though. There is a considerable learning curve. True this rarely stops the performance researcher... but it sure can slows them down. While DTrace is standard for Solaris, it's a little trickier on Linux. The DTrace package and the Unbreakable Enterprise Kernel (UEK) package are available on the Unbreakable Linux Network (ULN), but not on the public yum server. You must register your system with ULN before you can download the required packages. The Oracle Linux Administrator's Solutions Guide for Release 6 lists the packages you need to download from ULN and shows how to install them. ...bummer.

Hmm... So dtrace is not installed and not working by default. That's a problem for me and especially in this quest. Whenever I do my research, I want others to be able to duplicate it. I also usually don't want someone to have to install, configure, and learn about another tool in the process. This combined with the learning curve, put the kibosh on DTrace for this endeavor.

Using the GNU Debugger: GDB

The  purpose  of  a debugger such as GDB is to allow you to see what is going on ‘‘inside’’ another program while it executes—or  what  another program was doing at the moment it crashed.

I have spoken to Frits Hoogland about gdb before and was really impressed with what one can do with it. Frits uses gdb quite a bit, so his blog postings have many great examples. Here is just one good example. At this years IOUG/Collaborate conference in Las Vegas, Frits is going to present a 30 minute Quick Tip on using how DBAs can use gdb. I hope the room is full!

The word "debugger" is a clue that gdb is not going to meet my objective. GDB is great for stepping through Oracle's kernel code (and even interrupting or stopping it!), but it wasn't clear how I could capture an Oracle process's CPU consumption by Oracle kernel function.

As with strace and dtrace, gdb isn't what I'm looking for.

Using the Linux Profiler: Perf

First of all, Perf is a free Linux process profiling tool. When I checked on my pretty-much default OEL 6.2 and Oracle Database 12c box, the perf tools where already there!

On your Linux box, just do:
$ which perf
In many ways Perf is like Oracle's ASH facility in that, it samples data. Unlike Oracle's wait interface it is about instrumentation providing timing information. For example, we can ask Perf to sample a process for 5 seconds at the default rate of 1K/sec and to keep a count whenever it sees the process "on CPU". Now suppose over the 5 second interval, the process was "on CPU" for 3K samples. Therefore, we know that that 3/5 of the 5 second interval, that is, 60% of the time it was "on CPU."

Oh yeah... one other thing. Perf keeps track of the "count" at the function level. This means I can determine how much CPU time (more correctly the counts) each Oracle kernel function "consumes." Therefore, it appears Perf will do exactly what I need; keep track of an Oracle's process's CPU consumption down to the Oracle kernel function. Perf-ect! (sorry about that)

To help make this crystal clear, now I can get something like this:
Count    Kernel Function
-------- ---------------
  1500   abc124
   900   td8tr
   200   kcbte
    12   kbce8
  2612 counts over a 5 second interval
There is a lot more to Perf than I have mentioned. There are actually suite of Perf tools, that all begin with the word "Perf". For example:

perf stat is used to obtain, that is collect, the event counts.
perf record is used to record the event counts for later reporting.
perf report is used to break down and report on the events process, function, etc.
perf top is like the standard "top" command but provides function level detail for a specific process.

Here is a quick example of "perf stat" collecting the default events for process 28497 for five seconds. Notice the "sleep 5"? The execution duration of the command "sleep 5" is the collection period duration. I could have used another command like, "ls -1" but "sleep 5" is an obvious way to sleep for five seconds.
$ ps -eaf | grep oracleprod35
oracle   28497 28496  8 09:58 ?        00:31:36 oracleprod35 (DESCRIPTION=(LOCAL=YES)...
oracle   42265 28425  0 15:51 pts/1    00:00:00 grep oracleprod35

$ perf stat -p 28497 sleep 5

 Performance counter stats for process id '28497':

      560.686866 task-clock               #    0.112 CPUs utilized          
              32 context-switches         #    0.000 M/sec                  
               1 CPU-migrations           #    0.000 M/sec                  
               0 page-faults              #    0.000 M/sec                  
   1,522,513,670 cycles                   #    2.715 GHz                    [83.11%]
     952,309,947 stalled-cycles-frontend  #   62.55% frontend cycles idle   [83.65%]
     568,827,423 stalled-cycles-backend   #   37.36% backend  cycles idle   [66.93%]
   1,333,413,297 instructions             #    0.88  insns per cycle        
                                          #    0.71  stalled cycles per insn[83.54%]
     249,397,610 branches                 #  444.807 M/sec                  [82.99%]
       4,478,117 branch-misses            #    1.80% of all branches        [83.40%]

      5.000696361 seconds time elapsed
Here is an example of "perf top" for process id 28497 at a 30 second refresh rate.
$ perf top -p 28497 -d 30

PerfTop: 89 irqs/sec  kernel: 5.6%  exact: 0.0% [1000Hz cycles],(target_pid: 28497)

             samples  pcnt function                   DSO
             _______ _____ __________________________ __________________________

              205.00  7.6% kcbchg1_main                product/12.1.0/bin/oracle
              148.00  5.5% kcbget                      product/12.1.0/bin/oracle
              124.00  4.6% kcrfw_redo_gen_ext          product/12.1.0/bin/oracle
              114.00  4.2% kdkcmp1                     product/12.1.0/bin/oracle
              100.00  3.7% __intel_ssse3_rep_memcpy    product/12.1.0/bin/oracle
               99.00  3.7% kcrfw_copy_cv               product/12.1.0/bin/oracle
               89.00  3.3% kcoapl                      product/12.1.0/bin/oracle
               84.00  3.1% kcbgcur                     product/12.1.0/bin/oracle
               77.00  2.9% kdiins1                     product/12.1.0/bin/oracle
               62.00  2.3% kduovw                      product/12.1.0/bin/oracle
               56.00  2.1% ktuchg2                     product/12.1.0/bin/oracle
               52.00  1.9% ktugur                      product/12.1.0/bin/oracle
               51.00  1.9% kdimod0                     product/12.1.0/bin/oracle
               50.00  1.9% updrow                      product/12.1.0/bin/oracle
               46.00  1.7% kauupd                      product/12.1.0/bin/oracle
               42.00  1.6% qertbFetch                  product/12.1.0/bin/oracle
               38.00  1.4% kdudcp                      product/12.1.0/bin/oracle
Here is were it really get interesting. I'm going to record just CPU "cycles" for process 28497 for 30 seconds. It's simple:
$ perf record -e cycles -p 28497 sleep 30

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.093 MB perf.data (~4057 samples) ]
At this point the data has been stored in the default perf.data file. Now I'm going to create the default report based on the perf.data file, like this:
$ perf report

# Events: 2K cycles
# Overhead          Command      Shared Object                       Symbol
# ........  ...............  .................  ...........................
     7.91%  oracle_28497_pr  oracle             [.] kcbchg1_main
     4.87%  oracle_28497_pr  oracle             [.] kcbget
     4.07%  oracle_28497_pr  oracle             [.] kcrfw_redo_gen_ext
     3.86%  oracle_28497_pr  oracle             [.] kdkcmp1
     3.42%  oracle_28497_pr  oracle             [.] __intel_ssse3_rep_memcpy
     3.37%  oracle_28497_pr  oracle             [.] kcoapl
     3.23%  oracle_28497_pr  oracle             [.] kcrfw_copy_cv
     3.08%  oracle_28497_pr  oracle             [.] kcbgcur
     2.56%  oracle_28497_pr  oracle             [.] kdiins1
     2.22%  oracle_28497_pr  oracle             [.] ktuchg2
     2.03%  oracle_28497_pr  oracle             [.] ktbgwl
     2.00%  oracle_28497_pr  oracle             [.] kauupd
     1.88%  oracle_28497_pr  oracle             [.] kduovw
     1.78%  oracle_28497_pr  oracle             [.] kdimod0
     1.56%  oracle_28497_pr  oracle             [.] kdudcp
Oh yeah... I forgot to mention that the "perf report" command has the capability to output the results in comma delimitated format... can anyone say, "external table"? I think you see where I'm going with this. However, we need to get past a potential deal killer first.

Can We Go From Count To Time?

This is actually a very serious issue. Most DBAs will and should pause and consider the situation when jumping from sample counts to time.

Here is an example situation. Suppose over a 30 second interval the Oracle view v$sess_time_model (stat: db cpu) shows a server process consumed 10 seconds of CPU. Also, according to Perf, the Oracle kernel function kcb1 accounts for 60% of the CPU counts and kcb2 accounts for the remaining 40%.

Now the question is, can we attribute 60% of the 10 seconds of CPU time to the Oracle kernel function kcb1 and 40% of the 10 seconds of CPU to kcb2? If so, then kcb1 would be assigned 6 seconds and kcb2 4 seconds.

Using the count-to-seconds math above combined with Perf's default sample rate of 1000/sec, (i.e., 1/ms), then each CPU count is equivalent to 1 ms of CPU consumption. I suspect it is likely either less than 1ms or greater than 1ms. Assuming either direction is just as likely to occur (and I can't prove this to you), then I suspect assessing one count to 1ms of CPU is OK.

Based on the above, I am comfortable jumping from Perf counts to time.

So Where Are We In This Quest?

At this point, in terms of CPU consumption I'd say we are here:
Time Component                                                secs       %
------------------------------------------------------- ---------- -------
cpu : [.] kcbgtcr                                           29.714   66.87
cpu : [.] kdstf000010100001km                                3.716    8.36
cpu : [.] lnxsum                                             3.541    7.97
cpu : [?] sum of funcs consuming less than 2% of CPU ti      2.393    5.38
cpu : [.] kaf4reasrp0km                                      2.180    4.91
As DBAs we already know how to get Oracle wait time. This puts us in the position to combine Oracle wait time and Oracle process CPU consumption by kernel function. Something like this:
Time Component                                                secs       %
------------------------------------------------------- ---------- -------
cpu : [.] kcbgtcr                                           29.714   66.87
cpu : [.] kdstf000010100001km                                3.716    8.36
cpu : [.] lnxsum                                             3.541    7.97
cpu : [?] sum of funcs consuming less than 2% of CPU ti      2.393    5.38
cpu : [.] kaf4reasrp0km                                      2.180    4.91
wait: latch: cache buffers chains                            2.100    4.73
Obviously we are not finished. First, we need to create a tool to collect, merge, and report the above details. And second, we need to gain some understanding about Oracle function names. For example, just what is "kcbgtcr" anyways?! In my next posting I'll cover both of these issues.

But until then, log into your non-production yet active Linux box and some a few "perf top" and a few "perf record" and then "perf report" commands. If it's not production, profile the Oracle log writer or database writer background process or perhaps an active server process!

Feeling dangerous yet?

If you can’t wait until the next posting (or you have lots of time), here are some resources.

1. Presentation: Detailing Oracle CPU Consumption. Download here.
2. Tool: Oracle wait time and kernel function CPU monitor. Download the fulltime.sh script.

Thanks for reading,


1 comment:

  1. Hello meen, I've been following your blog for almost 6 months now without leaving any comment so I decided to quickly say hi today, just so you know you've got a fan somewhere. LOL. Your blog makes sense to me just like http://danieluyi.com, another similarly interesting blog that I'm also a fan of.

    Keep it up.