Cache buffer chain (CBC) latch contention is a common top Oracle wait event. There are a number of interrelated causes but also a number of solutions. The trick is to properly diagnose the problem which results in a short list of solutions.
The CBCs are created as a hashing structure and are primarily used to determine if a block currently resides in the buffer cache. (More->>) As you can image, even the smallest Oracle systems ask, "Is a block in the buffer cache?" a ga-zillion times each day. If CBC access continues to intensify, at some point the time to acquire the desired CBC latch will be a performance problem.
Is there a CBC issue?
When CBC latch contention is raging, you're system is likely to have a crippling CPU bottleneck because the application SQL is concurrently and repeatedly checking if specific blocks reside in the buffer cache. If the answer to, "Is the block in the buffer cache?" is usually, "Yes" then IO reads are minimized and memory structure access is maximized...hence the crippling CPU bottleneck and CBC latch contention.
The OraPub System Monitor (OSM and OSM) script I use to interactively determine overall time situation is rtpctx.sql. Here's an example of a 707 second interval.
Figure 1. Typical CBC latch contention result based on rtsysx.sql, response time report.
In Figure 1, the rtsysx.sql output the "% WT" column shows the percentage of wait time by wait event over the report interval. The "% RT" column shows the percentage of the total response time (CPU time and Wait time). The values in Figure 1 are typical when there is a very serious CBC latch contention issue. An AWR or Statspack report will tell a similar story; the top wait event being "latch: cache buffers chains" and most of the system's CPU resources being consumed by Oracle.
While there are a number of causes for CBC latch contention, I tend to see two CBC acquisition patterns. The first pattern is when many CBC latches are very active, that is, the access pattern is very disperse. The second pattern when a single CBC latch is very active. So once you know there is a significant CBC issue, the next step is to determine the acquisition pattern characteristic. Read on!
Determining CBC Wait Pattern
To determine the CBC wait pattern, you can run a very simple script like this:
select p1 latch_addr,
p2 latch_#
from v$session
where status = 'ACTIVE'
and wait_time = 0
and event = 'latch: cache buffers chains'
/
LATCH_ADDR LATCH_#
---------- ----------
1993834384 150
1993834384 150
1993834384 150
1993834384 150
4 rows selected.
Notice that all four sessions are sleeping (i.e., they are posting the wait event) while trying to acquire the same CBC latch (note latch address is identical). While the above snippet and the result are interesting, you could easily misled from this single sample. A more statically sound method is to gather multiple samples. Using my OSM script, latchchild.sql we can a sample each second to glean from statistical analysis. The latchchild.sql script essentially does this:def latch=150
def sleep=300
drop table op_interim
/
create table op_interim as
select addr,gets,sleeps
from v$latch_children
where latch# = &latch
/
exec dbms_lock.sleep(&sleep);
select t1.addr,
t1.gets-t0.gets delta_gets,
t1.sleeps-t0.sleeps delta_sleeps
from op_interim t0,
(
select addr,gets,sleeps
from v$latch_children
where latch# = &latch
) t1
where t0.addr = t1.addr
order by 3,2
/
Below is some actual output. The "delta" columns are simply the difference between the beginning and ending values for gets and sleeps. Notice there is not a massive gap between the delta_gets and delta_sleeps and there is not a single (or a few) latch that is significantly more active then the others. This would be classified as dispersed CBC latch contention. ADDR DELTA_GETS DELTA_SLEEPS
-------- ------------------ ----------------
...
76D1ABC4 30,2356 5
76D23210 30,9631 5
76C8413C 32,4284 5
76C75428 23,2780 6
76CFCFAC 24,7324 6
76DB79B0 24,7332 6
76BFE7A8 25,3808 6
76DB9C98 28,1330 6
76C534AC 32,3395 6
76D0BCC0 33,7938 6
76C17DF0 20,3694 7
76C04DF4 24,0050 7
76DE64CC 29,5872 7
76BF5EF0 23,2782 8
76D05864 27,4886 8
1024 rows selected.
To really grasp the situation, a visual histogram based on the sleeps is very useful.
Figure 3. Histogram of the number of CBC latch address and their respective sleep activity.
Figure 3 is the result, as we'll see, of three very popular buffers which are each related to three different three buffer chains. While nearly 1000 CBC latches have zero sleeps (far left vertical bar in histogram), just as important is there is an obvious large delta_sleeps gap near the most active delta_sleeps CBC latches. For some people, the histogram tells a better story, but other others the below numeric snippet better captures the situation. (I personally like to use both.)ADDR DELTA_GETS DELTA_SLEEPS
-------- ------------------ ----------------
...
76C33818 19,8038 3
76C644A8 8,1535 4
76C7759C 9,6993 4
76C2D1CC 14,5096 4
76E03FEC 14,9355 4
76CDD394 16,0718 4
76C68904 18,2300 4
76D69374 6,5250 5
76D7800C 13,3134 5
76DA2650 15,6578 5
76C02B88 15,8293 5
76CE59E0 14,5169 7
76C7B9F8 15,8243 7
76CFF120 6,5228 9
76CFCF30 14,8187 11
76D38668 96,6345 62
76CDF508 98,1384 96
76DAF26C 1,96,2752 187
1024 rows selected.
What is important to us is that there is a substantial delta_sleeps gap separating the top three CBC latches from the rest of the pack.I'm hoping you can see the differences in these access patterns. I would suggest running the latchchild.sql script on one of your systems to see this for yourself.
If you want to see the latchchild.sql output (latchchild.txt) and the statistical details for above CBC activity, here are the links:
- Disperse CBC latch contention (latchchild.txt, PDF, Mathematica Notepad)
- Singular CBC latch contention (latchchild.txt, PDF, Mathematica Notepad)
Figure 2 and the code snippet directly above Figure 2 show a typical example of what you'll likely see when many CBC latches are active enough to cause a significant performance problem. Here's the likely situation: There is an intense CBC latch situation along with a raging CPU bottleneck, and you can probably easily see the heavy logical IO (sysstat.session logical IO) SQL as well. There are a number of solutions, with some of them listed below.
An Oracle focused solution is to increase the number of CBC latches by increasing the hidden instance parameter, _db_block_hash_latches.
An application focused solution is to find the most logical IO intensive SQL and reduce the LIO's by executing it less often or tuning it. Either way, your objective is to reduce the LIOs generated during times of critical performance.
An operating system focused solution is to increase CPU resources by removing CPU consuming process if possible, adding more CPU cores, or increasing CPU speed.
There are of course other solutions, but I think you get the idea.
Solutions for Singular CBC Latch Contention
Figure 3 and the output snippet directly below it are typical when there is intense CBC contention focused on one or perhaps a few CBC latches. When this is the situation, additional diagnosis is needed to determine specifically why the intense singular activity is occuring. This is the topic of my next posting...
Summary
The Cache buffer chain (CBC) structure is used to answer the question, "Is this block in the buffer cache?" At some point, this question can get asked enough to cause significant performance problems known as CBC latch contention. While there are a number of causes for CBC latch contention, I tend to see two CBC acquisition patterns: The first pattern is when many CBC latches are very active, that is, the access pattern is very disperse. The second pattern when a single CBC latch is very active. So once you know there is a significant CBC issue, the next step is to determine the acquisition pattern characteristic.
In this posting I focused on how to determine the CBC latch contention acquisition pattern; disperse or singular. I then presented some disperse CBC latch contention solutions. In my next posting I'll focus on additional steps to diagnose singular CBC latch contention, two common situations, and some possible solutions.
Thanks for reading!
Craig.
No comments:
Post a Comment