Friday, February 5, 2010

Gotta love latch classes

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

Serialization is a performance killer. This is why IT is so consumed by it. While there may be massive parallelism in IT, when Oracle is involved it eventually all boils down to  memory access. At this point, Oracle must insure serial access to prevent a memory structure from being inappropriately changed. Oracle uses latches and mutexes to enforce memory access serialization.

A latch and mutex is essentially a token. If you have the token, you have access to the memory structure. The key to Oracle serialization performance optimization is to decrease both token acquisition time and token hold time.

This post is directed toward latch (not mutex) acquisition time.

Oracle's latch acquisition scheme is not proprietary, yet works remarkably well, in that a processes attempts to acquire a specific latch by repeatedly checking to see if it is available. This is called spinning on the latch, consumes CPU resources, and is not included in the latch's wait event time. If the process does not acquire the latch after spinning a specified number of times, Oracle's latch acquisition algorithm will have the process back-off a specific period of time. The back-off time is commonly called sleep time, does not consume CPU resources, and is the latch's wait event time. So total latch acquisition time will include CPU time (while spinning) and on a real production OLTP system, will most likely also contain some wait time (while sleeping).

If you want to optimize performance you must consider both the CPU time and the wait time, that is, response time. And also the work performed. This is important to understand because latch related optimization typically change both CPU consumption and latch wait time. So you must watch both to ensure you're actually improving the situation by reducing response time. And of course if the workload decreases, then we would expect the associated computing resources to decrease as well.

The instance parameter _spin_count determines the number of spin attempts before the process back-off. There are some significant challenges that must be overcome before changing _spin_count makes any sense at all.

1. The _spin_count parameter is not latch specific. This means any _spin_count change will effect each and every latch acquisition. For example, if the problem is related to the shared pool latch acquisition and _spin_count is changed, then not only will this affect the shared pool latch acquisition, but also the cache buffer chain latch acquisition, etc. Good latching solutions are latch specific, so you can attack the problem where it hurts/helps the most while minimizing collateral damage.

2. Increasing spinning is effectively asking the CPU subsystem for more CPU resources. That's fine, but 99.9% of the time when there are latching issues, the CPU subsystem is saturated. So asking the CPU subsystem for more CPU when there is none, just makes a bad situation worse. (In those 0.1% situations, then altering _spin_count could make sense.) What is so insidious about this is the latch related wait time may decrease, while the CPU time may increase, and if the DBA is not looking at the response time, they will think the situation improved... ouch! If any change to _spin_count must be made, it probably makes more sense to decrease (not increase) it, to releive the strain on the CPU subsystem.

What we performance analysts need is the ability to alter the spin count for specific latches. Well, starting in 10gR2 Oracle gave us this capability. You can create latch classes, assign latch(es) to them, and then alter the spin count for a specific latch class. This gives us the ability to surgically make spin related changes!

Here's a classic situation where the top wait event is latch: cache buffer chains (CBC latch) and Oracle server and background processes are consuming all the available server CPU. The response time report (from OraPub's OSM tool kit) below represents a 60 sec interval on a 4 core Linux 5.4 running Oracle 11.2. The OS CPU run queue varied between 4 and 8.


SQL: @rtpctx

Remember: This report must be run twice so both the initial and
final values are available. If no output, press ENTER about nine times.

Database: prod18                                          21-JAN-10 01:46pm
Report:   rtpctx.sql        OSM by OraPub, Inc.           Page        1
                   Response Time Activity (60 sec interval)

                                       Time       Avg     Time        Wait
RT Component                           (sec)      % RT    Waited (ms) Count(k)
-------------------------------------- ---------- ------- ----------- --------
CPU consumption: Oracle SP + BG procs   180.097     79.12       0.000       0
latch: cache buffers chains              47.020     20.66      12.603       4
control file parallel write               0.100      0.04       5.000       0
os thread startup                         0.060      0.03      20.000       0
latch free                                0.040      0.02       2.000       0
log file parallel write                   0.030      0.01       0.429       0
db file async I/O submit                  0.020      0.01       0.667       0
control file sequential read              0.010      0.00       0.093       0
enq: CR - block range reuse ckpt          0.010      0.00      10.000       0

Instance Workload Statistic              Interval Work Interval Work/Sec
---------------------------------------- ------------- -----------------
user calls                                          46            0.767
session logical reads                          6589570       109826.167
physical read IO requests                            4            0.067
execute count                                      665           11.083


Notice that while the top wait event is the CBC latch (which is latch acquisition sleep time), from a response time and opportunity perspective, the issue is CPU consumption. Because the CBC latch is the top wait event we know that processes are spinning a great deal (since they have to spin before they sleep). Therefore, if we can reduce CBC latch related spin time, we may be able to also reduce the net response time.

While we could just decrease the _spin_count, we want to avoid any unknown collateral damage. After all, if this was a production system we honestly don't know what will happen to response time situation if we where to increase or decrease _spin_count.

But with spin classes, we can effectively decrease the spin count just for the CBC latches. Here's how to do this followed by the result.

Based on the OSM latchclass.sql report below, we can see there currently exists a single latch class (#2) containing the process allocation latch.


SQL: @latchclass.sql

Database: prod18                                       21-JAN-10 01:19pm
Report:   latchclass.sql       OSM by OraPub, Inc.     Page       1
                             Latch Contention Report

LATCH_NAME                               LATCH_CLASS CLASS_SPIN_COUNT
---------------------------------------- ----------- ----------------
process allocation                                 2            20000

SQL> l
  1  select kslltnam latch_name,
  2         class_ksllt latch_class,
  3         c.spin class_spin_count
  4  from   x$kslltr r,
  5         x$ksllclass c
  6  where  c.indx = r.class_ksllt
  7*   and  r.class_ksllt 


Let's determine the CBC latches number, create a new latch class, and set its spin count.


SQL: select latch# from v$latch where name='cache buffers chains';

LATCH#
----------
150

1 row selected.


In 11gR2 I was unable to dynamically alter the _latch_classes and also the _latch_class_1 parameter. Here how to do this using instance parameters.

_latch_classes = "150:1" # cbc latch assigned to latch class 1

I changed the instance parameter, recycled the instance, and then reconnected as internal and...


SQL: @ipx %latch%classes
old   8: and i.ksppinm like '&name%'
new   8: and i.ksppinm like '%latch%classes%'

Database: prod18                                    21-JAN-10 01:31pm
Report:   ipx.sql       OSM by OraPub, Inc.         Page       1
                  Display ALL Instance Parameters

Instance Parameter and Value                       Description          Dflt?
-------------------------------------------------- -------------------- -----
_latch_classes = 150:1                             latch classes        FALSE
                                                   override
9 rows selected.

SQL: @latchclass

Database: prod18                                    21-JAN-10 01:31pm
Report:   latchclass.sql       OSM by OraPub, Inc.  Page       1
                            Latch Contention Report

LATCH_NAME                               LATCH_CLASS CLASS_SPIN_COUNT
---------------------------------------- ----------- ----------------
process allocation                                 2            20000
cache buffers chains                               1            20000

2 rows selected.


At this point, I have assigned the CBC latches to class #1, but notice their class spin count is set to the default of 2000. I am now going to reduce the spin count to 250 but just for latch class #1, recycle the instance, and put the CBC load back on the system and see if the situation has changed.

Here is the instance parameter file entry:

_latch_class_1 = 250

After I recycled the instance, I double checked just to make sure Oracle did not override my entry (which seems to happen more often in 11g).


SQL: @ipx %latch%class%1

Database: prod18                                     21-JAN-10 01:51pm
Report:   ipx.sql       OSM by OraPub, Inc.          Page       1
                   Display ALL Instance Parameters

Instance Parameter and Value                       Description          Dflt?
-------------------------------------------------- -------------------- -----
_latch_class_1    = 250                            latch class 1        FALSE


After the load was reapplied and stabilized, I ran another response time report. Here it is.


SQL: @rtpctx
Remember: This report must be run twice so both the initial and
final values are available. If no output, press ENTER about nine times.

Database: prod18                                        21-JAN-10 02:01pm
Report:   rtpctx.sql         OSM by OraPub, Inc.        Page       1
                  Response Time Activity (61 sec interval)

                                       Time               Avg Time    Wait
RT Component                           (sec)      % RT    Waited (ms) Count(k)
-------------------------------------- ---------- ------- ----------- --------
CPU consumption: Oracle SP + BG procs     181.747   81.10       0.000     0
latch: cache buffers chains                42.140   18.80      12.113     3
control file parallel write                 0.120    0.05       6.000     0
latch free                                  0.070    0.03       3.889     0
db file sequential read                     0.060    0.03       0.571     0
os thread startup                           0.050    0.02      25.000     0
log file parallel write                     0.020    0.01       0.357     0
log file sync                               0.010    0.00       3.333     0

Instance Workload Statistic              Interval Work Interval Work/Sec
---------------------------------------- ------------- -----------------
user calls                                          43             0.705
session logical reads                          7168604        117518.098
physical read IO requests                          122             2.000
execute count                                      592             9.705


Let look closely at what just happened. Because the CBC latch is stressed when there is a logical IO load, when the CBC latching is being optimized logical IO is a good way to see if Oracle is performing more work. Here's a recap:

spin count  cpu time %    CBC sleep time %    LIO/s  CPU RQ
----------  -------- ---  -------------- ---  -----  ------
      2000       180 79%              47 21%  109.8  4 to 8
       250       182 81%              42 19%  117.5  2 to 4

The results are actually very encouraging. But you have to look further than just the time! Given nearly the same resource consumption and the same response time breakdown, Oracle performed 7% more work and the operating system CPU run queue is looking great! What we have done is made Oracle more efficient. That is, it is taking the same resources to process more work. If you're a response time and queuing theory buff, what we effectively done is reduced the service time (when represented as CPU time consumed to process a logical IO). This drops and shifts the response time curve to the right allowing more work to be performed without increasing the database server hardware power.

To recap: Usually systems with severe latching issues are CPU bound. One way to reduce the spinning and therefore reduced CPU consumption or/and increased work performed is to reduce the spin count. But reducing the instance-wide _spin_count parameter is not a precision change and the risk of collateral damage is very real. Using latch classes we are able to surgically alter how processes spin for a specific latch. To put in real simple terms, with latch classes Oracle has given us a scalpel as opposed to an axe.

Thanks for reading and I look to hearing from you.

Craig.

3 comments:

  1. Hello Craig,

    couple of thoughts

    1) could you please add more info on what your gains were after applying non-standard CBC latch spin count, in more convenient units of end-user response time & transactions per second. I think it would be much more representative than some Oracle numbers.

    2) I'm slightly concerned on the overall message of this post. I hope it won't be considered as a "way to go" approach - since far too often CBC latch contention is caused by inappropriate design and/or code (i.e. by application itself) rather than Oracle's configuration, I bet changing spin count for CBC latches is like "flogging a dead horse" for 99% sites.

    ReplyDelete
  2. Timur, Thanks for your comments. Great stuff!

    1) I chose to measure the change in Oracle response time (cpu consumption and wait time) and the workload (logical reads) in very Oracle specific terms because it is pure and can be quantified. If I had added a "transaction" or "end user response time" into the mix, the situation and impact becomes much less clear AND I could have created an experiment to prove whatever point I wanted to make. The trick is to take what I demonstrated and understand it and relate it to your real life situation. For example, if what the user is experiencing is highly correlated with acquiring and holding a CBC latch time, changes in CPU consumption, and logical read activity, then altering spin count is something to consider. Otherwise, it is not relevant.

    2) I shared your concern about altering spin count. The overall message of this post is that changing spin count can make a difference (I demonstrated this), but ALSO that it is also highly unlikely making a change will ever be justified in a real production environment (demonstrated by the text in point #1 and #2 near the top of the post).

    Thanks again for your comments.

    ReplyDelete
  3. Hello!

    Thank you for this great post!
    It was the starting point of my further investigation of latch classes.
    Look at:

    http://andreynikolaev.wordpress.com/2011/01/18/spin-tales-part-3-non-standard-latch-classes-in-oracle-9-2-11g/

    I would like to add that:
    1) Assignment of latch to the non-standard class completely changes the way Oracle uses to wait for the latch.

    2) My experiments showed that _spin_count tuning may be very effective for some performance problems.

    With Best Regards
    Andrey Nikolaev
    http://andreynikolaev.wordpress.com

    ReplyDelete