Friday, November 18, 2011

CBC Latch Diagnosis & Acquisition Patterns


Cache buffer chain (CBC) latch contention is a common top Oracle wait event. There are a number of interrelated causes but also a number of solutions. The trick is to properly diagnose the problem which results in a short list of solutions.

The CBCs are created as a hashing structure and are primarily used to determine if a block currently resides in the buffer cache. (More->>) As you can image, even the smallest Oracle systems ask, "Is a block in the buffer cache?" a ga-zillion times each day. If CBC access continues to intensify, at some point the time to acquire the desired CBC latch will be a performance problem.

Is there a CBC issue?

When CBC latch contention is raging, you're system is likely to have a crippling CPU bottleneck because the application SQL is concurrently and repeatedly checking if specific blocks reside in the buffer cache. If the answer to, "Is the block in the buffer cache?" is usually, "Yes" then IO reads are minimized and memory structure access is maximized...hence the crippling CPU bottleneck and CBC latch contention.

The OraPub System Monitor (OSM and OSM) script I use to interactively determine overall time situation is rtpctx.sql. Here's an example of a 707 second interval.
Figure 1. Typical CBC latch contention result based on rtsysx.sql, response time report.

In Figure 1, the rtsysx.sql output the "% WT" column shows the percentage of wait time by wait event over the report interval. The "% RT" column shows the percentage of the total response time (CPU time and Wait time). The values in Figure 1 are typical when there is a very serious CBC latch contention issue. An AWR or Statspack report will tell a similar story; the top wait event being "latch: cache buffers chains" and most of the system's CPU resources being consumed by Oracle.

While there are a number of causes for CBC latch contention, I tend to see two CBC acquisition patterns. The first pattern is when many CBC latches are very active, that is, the access pattern is very disperse. The second pattern when a single CBC latch is very active. So once you know there is a significant CBC issue, the next step is to determine the acquisition pattern characteristic. Read on!

Determining CBC Wait Pattern

To determine the CBC wait pattern, you can run a very simple script like this:
select p1 latch_addr,
       p2 latch_#
from   v$session
where  status    = 'ACTIVE'
  and  wait_time = 0
  and  event     = 'latch: cache buffers chains'
/

LATCH_ADDR    LATCH_#
---------- ----------
1993834384        150
1993834384        150
1993834384        150
1993834384        150

4 rows selected.
Notice that all four sessions are sleeping (i.e., they are posting the wait event) while trying to acquire the same CBC latch (note latch address is identical). While the above snippet and the result are interesting, you could easily misled from this single sample. A more statically sound method is to gather multiple samples. Using my OSM script, latchchild.sql we can a sample each second to glean from statistical analysis. The latchchild.sql script essentially does this:
def latch=150
def sleep=300
drop table op_interim
/
create table op_interim as
select addr,gets,sleeps
from   v$latch_children
where  latch# = &latch
/
exec dbms_lock.sleep(&sleep);
select t1.addr,
       t1.gets-t0.gets delta_gets,
       t1.sleeps-t0.sleeps delta_sleeps
from   op_interim t0,
       (
         select addr,gets,sleeps
         from   v$latch_children
         where  latch# = &latch
       ) t1
where  t0.addr = t1.addr
order by 3,2
/
Below is some actual output. The "delta" columns are simply the difference between the beginning and ending values for gets and sleeps. Notice there is not a massive gap between the delta_gets and delta_sleeps  and there is not a single (or a few) latch that is significantly more active then the others. This would be classified as dispersed CBC latch contention
ADDR             DELTA_GETS     DELTA_SLEEPS
-------- ------------------ ----------------
...
76D1ABC4            30,2356                5
76D23210            30,9631                5
76C8413C            32,4284                5
76C75428            23,2780                6
76CFCFAC            24,7324                6
76DB79B0            24,7332                6
76BFE7A8            25,3808                6
76DB9C98            28,1330                6
76C534AC            32,3395                6
76D0BCC0            33,7938                6
76C17DF0            20,3694                7
76C04DF4            24,0050                7
76DE64CC            29,5872                7
76BF5EF0            23,2782                8
76D05864            27,4886                8

1024 rows selected.
To really grasp the situation, a visual histogram based on the sleeps is very useful.

Figure 2. Histogram of the number of CBC latch address and their respective sleep activity.
Figure 2 is a standard histogram I copied from the statistical analysis Mathematica notepad (you can download below). Just over 600 CBC latches have zero sleeps while only two CBC latches had eight sleeps. Notice that while there are differences in the number of sleeps, we don't see a pattern with a massive jump like; 0,0,0,1,2,3,5,6,7,1021. Again, this is an example of dispersed CBC latch contention. But sometimes the situation is not very dispersed, but singular towards just a couple or perhaps even one single CBC latch!
Figure 3. Histogram of the number of CBC latch address and their respective sleep activity.
Figure 3 is the result, as we'll see, of three very popular buffers which are each related to three different three buffer chains. While nearly 1000 CBC latches have zero sleeps (far left vertical bar in histogram), just as important is there is an obvious large delta_sleeps gap near the most active delta_sleeps CBC latches. For some people, the histogram tells a better story, but other others the below numeric snippet better captures the situation. (I personally like to use both.)
ADDR             DELTA_GETS     DELTA_SLEEPS
-------- ------------------ ----------------
...
76C33818            19,8038                3
76C644A8             8,1535                4
76C7759C             9,6993                4
76C2D1CC            14,5096                4
76E03FEC            14,9355                4
76CDD394            16,0718                4
76C68904            18,2300                4
76D69374             6,5250                5
76D7800C            13,3134                5
76DA2650            15,6578                5
76C02B88            15,8293                5
76CE59E0            14,5169                7
76C7B9F8            15,8243                7
76CFF120             6,5228                9
76CFCF30            14,8187               11
76D38668            96,6345               62
76CDF508            98,1384               96
76DAF26C          1,96,2752              187

1024 rows selected.
What is important to us is that there is a substantial delta_sleeps gap separating the top three CBC latches from the rest of the pack.

I'm hoping you can see the differences in these access patterns. I would suggest running the latchchild.sql script on one of your systems to see this for yourself.

If you want to see the latchchild.sql output (latchchild.txt) and the statistical details for above CBC activity, here are the links:
  1. Disperse CBC latch contention (latchchild.txt, PDF, Mathematica Notepad)
  2. Singular CBC latch contention (latchchild.txt, PDF, Mathematica Notepad)
Solutions for Disperse CBC Latch Contention

Figure 2 and the code snippet directly above Figure 2 show a typical example of what you'll likely see when many CBC latches are active enough to cause a significant performance problem. Here's the likely situation: There is an intense CBC latch situation along with a raging CPU bottleneck, and you can probably easily see the heavy logical IO (sysstat.session logical IO) SQL as well. There are a number of solutions, with some of them listed below.

An Oracle focused solution is to increase the number of CBC latches by increasing the hidden instance parameter, _db_block_hash_latches.

An application focused solution is to find the most logical IO intensive SQL and reduce the LIO's by executing it less often or tuning it. Either way, your objective is to reduce the LIOs generated during times of critical performance.

An operating system focused solution is to increase CPU resources by removing CPU consuming process if possible, adding more CPU cores, or increasing CPU speed.

There are of course other solutions, but I think you get the idea.

Solutions for Singular CBC Latch Contention

Figure 3 and the output snippet directly below it are typical when there is intense CBC contention focused on one or perhaps a few CBC latches. When this is the situation, additional diagnosis is needed to determine specifically why the intense singular activity is occuring. This is the topic of my next posting...

Summary

The Cache buffer chain (CBC) structure is used to answer the question, "Is this block in the buffer cache?" At some point, this question can get asked enough to cause significant performance problems known as CBC latch contention. While there are a number of causes for CBC latch contention, I tend to see two CBC acquisition patterns: The first pattern is when many CBC latches are very active, that is, the access pattern is very disperse. The second pattern when a single CBC latch is very active. So once you know there is a significant CBC issue, the next step is to determine the acquisition pattern characteristic.

In this posting I focused on how to determine the CBC latch contention acquisition pattern; disperse or singular. I then presented some disperse CBC latch contention solutions. In my next posting I'll focus on additional steps to diagnose singular CBC latch contention, two common situations, and some possible solutions.

Thanks for reading!

Craig.

No comments:

Post a Comment