Thursday, February 25, 2010

How mutexes sleep

As with Oracle locks and latches, mutexes are used to ensure serial access to Oracle structures. They both follow a shockingly simple yet efficient algorithm of spinning to acquire the control structure (thereby consuming CPU) and sleeping when spinning does not quickly acquire the control structure (posting a wait event).

But there are significant differences. Locks are defined to work on relational structures, like tables and rows, are orderly (no randomness and very boring), and consume more resources (per acquisition) than latches and mutexes. Whereas latches and mutexes were specifically designed to work with memory structures. When there are locking issues, enqueue related wait events become significant but when memory serialization issues become an issue latch and mutex related wait events become significant. So there are differences not only in name, but also in usage and resource requirements.

Mutexes were original created for high concurrency thread control and are part of the standard C programming library of functions. You can even do a man -k mutex on U/Linux systems to get a glimpse of the various mutex related functions. Try that with a latch and nothing is returned. Oracle uses mutexes to control serialization access to certain internal memory structures. As of 11g, only library cache related mutexes exist, but this will undoubtedly change.

Mutexes offer the kernel developer flexibility and offer significant performance benefits over latches. While I have much more detail in my Oracle Performance Firefighting book's chapter on Serialization Control, I'll quickly summarize mutex benefits. First mutexes are a library function available to Oracle kernel developers, so as you would expect they will be more efficient. My Oracle contacts tell me that on average, acquiring a mutex requires 30 to 35 instructions whereas latches require 150-200. This translates into reduced CPU consumption and therefore faster acquisition times or the the ability to process more work (more activity per second). The memory footprint is also supposed to be smaller by around 100 bytes, but honestly I don't think Oracle or anyone really cares about hundreds, thousands, or even hundreds of thousands of bytes of memory. Mutexes can pin cursors (so they are not deallocated) much quicker than latches. One of my experiments showed mutexes pinned and unpinned cursors 11% faster than traditional latches. But the big benefit is they provide more granular memory structure control, which reduces what's called false contention.

False contention results when say, a latch is used protect a complex memory structure where two processes need to access different parts of the memory structures yet because there is only a single control structure, one process will get access and the other will not. Even though they will not access the same piece of memory! Because mutexes can be more flexibly defined to be associated with a memory structure, Oracle can more naturally associate a mutex with small pieces of memory. For example, within the library cache, each cursor, child cursor, and table has an associated mutex. (I have dumped the library cache and demonstrated this.) When latches are used entire library cache chains could be protected by a single latch (I have not dumped a library cache in quite a while using only latches to 110% confirm this). With increased granularity comes increased concurrency, which means decreased CPU consumption and Oracle wait time to get the same amount of work done.

When a server or background process attempts to acquire a latch or a mutex in willing-to-wait mode they repeatedly ask for the control structure until either they acquire the control structure or they reach their spin limit. Latches spin up to the instance parameter _spin_count (unless latch classes are used) and for mutexes Oracle has a hard limit of 255. If the control structure is not acquired through spinning, the process sleeps.

This is where it gets really interesting! For latches, the system call select is used to pause the spinning for a defined period of time. For mutexes, Oracle has options, but has chosen to have the sleeping process yield the CPU. Essentially, the sleeping process put itself at the end of the run queue. This is in a way a self adjusting sleep. If the CPU subsystem is not all that busy, then the sleep time will be near zero and the server or background process will quickly resume spinning. But if the CPU subsystem is very busy, the process will patiently wait for its turn to consume CPU once again. (more about this below)

OK... let's take a step back and I'll show you how we can see this take place. Here's what an Oracle system can look like when severe mutex contention exists. The cursor: pin S is the result of Oracle processes sleeping (and the relatively high CPU consumption) is due in large part to spinning on the mutex trying to acquire it. But CPU is also used (and lots of it) to actually manage the library cache.


SQL: @rtpctx.sql
Remember: This report must be run twice so both the initial and
final values are available. If no output, press ENTER about nine times.

Database: prod18                                             21-JAN-10 12:57pm
Report:   rtpctx.sql           OSM by OraPub, Inc.           Page        1
                     Response Time Activity (61 sec interval)

                                       Time       Avg     Time        Wait
RT Component                           (sec)      % RT    Waited (ms) Count(k)
-------------------------------------- ---------- ------- ----------- --------
CPU consumption: Oracle SP + BG procs     248.585   97.92       0.000        0
cursor: pin S                               5.100    2.01      42.500        0
os thread startup                           0.030    0.01      15.000        0
control file parallel write                 0.020    0.01       1.000        0
db file async I/O submit                    0.010    0.00       0.769        0


If you're wondering how I forced the cursor: pin S wait event, just have a few processes do this:

declare
  i number;
  j number;
  cursor abc is select count(c1) into j from test_table where rownum=1 ;
begin
  loop
    for i in 1..1000000
    loop
        open  abc;
        close abc;
    end loop;
    sys.dbms_lock.sleep(1);
  end loop;
end;
/

Opening a cursor requires it be pinned in shared mode and part of that process is acquiring the cursor's mutex in shared mode, hence the wait cursor: pin S.

The way to actually see the difference with how a mutex sleeps compared to a latch is to operating system trace a process that is spending lots of time acquiring mutexes. While the above script was being run by a few processes (only takes a few), I traced it using the Linux command strace. Here are the results for a 60 second sample and then followed by a 180 second sample.

[oracle@fourcore mutex_calls]$ ps -eaf|grep oracleprod18
oracle   12647 12646 59 14:32 ?        00:09:11 oracleprod18

[oracle@fourcore mutex_calls]$ strace -rcp 12647
Process 12647 attached - interrupt to quit
Process 12647 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 47.40    0.015586           0   1538041           getrusage
 33.49    0.011013           0   1230577           gettimeofday
 19.11    0.006284           0    615192           times
  0.00    0.000000           0        57           sched_yield
  0.00    0.000000           0         1         1 semtimedop
------ ----------- ----------- --------- --------- ----------------
100.00    0.032883               3383868         1 total

[oracle@fourcore mutex_calls]$ strace -rcp 12647
Process 12647 attached - interrupt to quit
Process 12647 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 46.09    0.039050           0   4194768           getrusage
 35.27    0.029884           0   3356216           gettimeofday
 18.63    0.015787           0   1677838           times
  0.01    0.000007           0       157           sched_yield
  0.00    0.000000           0         1         1 semtimedop
------ ----------- ----------- --------- --------- ----------------
100.00    0.084728               9228980         1 total

On average this struggling server process yielded CPU control around once every 0.9 seconds. The CPU run queue was not that long however because the process was able to "wake up" and start spinning once again in nearly no time at all (see the seconds column).

The sched_yield call simply relinquishes the processor voluntarily without blocking. The process will then be moved to the end of the queue for its static priority and a new process gets to run. What is so cool is if the relinquishing process is the only process in the highest priority list at that time, this process  will continue to run after a call to sched_yield().

This is significant because instead of a process waiting a defined period of time, say 10ms, it will simply move itself to the end of the queue. This avoids the unfortunate situation where there is idle CPU resources while the process is sleeping! It's kind of like the lead goose in a large geese V formation. The lead goose is taking all the heat up front. After a while of this, it backs off and let's another goose take some heat. Eventually the goose will work its way back up to the front.

While this post is about how mutexes sleep, a little more practically speaking (but not much), once you determine there is significant mutex contention, you need to understand specifically what the particular mutex is telling you and what the application is trying to do that is stressing Oracle so much. So far every time I have seen mutex contention and talked with DBAs with systems experiencing mutex contention, we have been able to understand how and why their application is stressing the mutexes so much. In each case, by creatively altering the application, the significant mutex contention was eliminated.

Thanks for reading and I look forward to meeting in you person one day.

Craig.

Friday, February 5, 2010

Gotta love latch classes

Serialization is a performance killer. This is why IT is so consumed by it. While there may be massive parallelism in IT, when Oracle is involved it eventually all boils down to  memory access. At this point, Oracle must insure serial access to prevent a memory structure from being inappropriately changed. Oracle uses latches and mutexes to enforce memory access serialization.

A latch and mutex is essentially a token. If you have the token, you have access to the memory structure. The key to Oracle serialization performance optimization is to decrease both token acquisition time and token hold time.

This post is directed toward latch (not mutex) acquisition time.

Oracle's latch acquisition scheme is not proprietary, yet works remarkably well, in that a processes attempts to acquire a specific latch by repeatedly checking to see if it is available. This is called spinning on the latch, consumes CPU resources, and is not included in the latch's wait event time. If the process does not acquire the latch after spinning a specified number of times, Oracle's latch acquisition algorithm will have the process back-off a specific period of time. The back-off time is commonly called sleep time, does not consume CPU resources, and is the latch's wait event time. So total latch acquisition time will include CPU time (while spinning) and on a real production OLTP system, will most likely also contain some wait time (while sleeping).

If you want to optimize performance you must consider both the CPU time and the wait time, that is, response time. And also the work performed. This is important to understand because latch related optimization typically change both CPU consumption and latch wait time. So you must watch both to ensure you're actually improving the situation by reducing response time. And of course if the workload decreases, then we would expect the associated computing resources to decrease as well.

The instance parameter _spin_count determines the number of spin attempts before the process back-off. There are some significant challenges that must be overcome before changing _spin_count makes any sense at all.

1. The _spin_count parameter is not latch specific. This means any _spin_count change will effect each and every latch acquisition. For example, if the problem is related to the shared pool latch acquisition and _spin_count is changed, then not only will this affect the shared pool latch acquisition, but also the cache buffer chain latch acquisition, etc. Good latching solutions are latch specific, so you can attack the problem where it hurts/helps the most while minimizing collateral damage.

2. Increasing spinning is effectively asking the CPU subsystem for more CPU resources. That's fine, but 99.9% of the time when there are latching issues, the CPU subsystem is saturated. So asking the CPU subsystem for more CPU when there is none, just makes a bad situation worse. (In those 0.1% situations, then altering _spin_count could make sense.) What is so insidious about this is the latch related wait time may decrease, while the CPU time may increase, and if the DBA is not looking at the response time, they will think the situation improved... ouch! If any change to _spin_count must be made, it probably makes more sense to decrease (not increase) it, to releive the strain on the CPU subsystem.

What we performance analysts need is the ability to alter the spin count for specific latches. Well, starting in 10gR2 Oracle gave us this capability. You can create latch classes, assign latch(es) to them, and then alter the spin count for a specific latch class. This gives us the ability to surgically make spin related changes!

Here's a classic situation where the top wait event is latch: cache buffer chains (CBC latch) and Oracle server and background processes are consuming all the available server CPU. The response time report (from OraPub's OSM tool kit) below represents a 60 sec interval on a 4 core Linux 5.4 running Oracle 11.2. The OS CPU run queue varied between 4 and 8.


SQL: @rtpctx

Remember: This report must be run twice so both the initial and
final values are available. If no output, press ENTER about nine times.

Database: prod18                                          21-JAN-10 01:46pm
Report:   rtpctx.sql        OSM by OraPub, Inc.           Page        1
                   Response Time Activity (60 sec interval)

                                       Time       Avg     Time        Wait
RT Component                           (sec)      % RT    Waited (ms) Count(k)
-------------------------------------- ---------- ------- ----------- --------
CPU consumption: Oracle SP + BG procs   180.097     79.12       0.000       0
latch: cache buffers chains              47.020     20.66      12.603       4
control file parallel write               0.100      0.04       5.000       0
os thread startup                         0.060      0.03      20.000       0
latch free                                0.040      0.02       2.000       0
log file parallel write                   0.030      0.01       0.429       0
db file async I/O submit                  0.020      0.01       0.667       0
control file sequential read              0.010      0.00       0.093       0
enq: CR - block range reuse ckpt          0.010      0.00      10.000       0

Instance Workload Statistic              Interval Work Interval Work/Sec
---------------------------------------- ------------- -----------------
user calls                                          46            0.767
session logical reads                          6589570       109826.167
physical read IO requests                            4            0.067
execute count                                      665           11.083


Notice that while the top wait event is the CBC latch (which is latch acquisition sleep time), from a response time and opportunity perspective, the issue is CPU consumption. Because the CBC latch is the top wait event we know that processes are spinning a great deal (since they have to spin before they sleep). Therefore, if we can reduce CBC latch related spin time, we may be able to also reduce the net response time.

While we could just decrease the _spin_count, we want to avoid any unknown collateral damage. After all, if this was a production system we honestly don't know what will happen to response time situation if we where to increase or decrease _spin_count.

But with spin classes, we can effectively decrease the spin count just for the CBC latches. Here's how to do this followed by the result.

Based on the OSM latchclass.sql report below, we can see there currently exists a single latch class (#2) containing the process allocation latch.


SQL: @latchclass.sql

Database: prod18                                       21-JAN-10 01:19pm
Report:   latchclass.sql       OSM by OraPub, Inc.     Page       1
                             Latch Contention Report

LATCH_NAME                               LATCH_CLASS CLASS_SPIN_COUNT
---------------------------------------- ----------- ----------------
process allocation                                 2            20000

SQL> l
  1  select kslltnam latch_name,
  2         class_ksllt latch_class,
  3         c.spin class_spin_count
  4  from   x$kslltr r,
  5         x$ksllclass c
  6  where  c.indx = r.class_ksllt
  7*   and  r.class_ksllt 


Let's determine the CBC latches number, create a new latch class, and set its spin count.


SQL: select latch# from v$latch where name='cache buffers chains';

LATCH#
----------
150

1 row selected.


In 11gR2 I was unable to dynamically alter the _latch_classes and also the _latch_class_1 parameter. Here how to do this using instance parameters.

_latch_classes = "150:1" # cbc latch assigned to latch class 1

I changed the instance parameter, recycled the instance, and then reconnected as internal and...


SQL: @ipx %latch%classes
old   8: and i.ksppinm like '&name%'
new   8: and i.ksppinm like '%latch%classes%'

Database: prod18                                    21-JAN-10 01:31pm
Report:   ipx.sql       OSM by OraPub, Inc.         Page       1
                  Display ALL Instance Parameters

Instance Parameter and Value                       Description          Dflt?
-------------------------------------------------- -------------------- -----
_latch_classes = 150:1                             latch classes        FALSE
                                                   override
9 rows selected.

SQL: @latchclass

Database: prod18                                    21-JAN-10 01:31pm
Report:   latchclass.sql       OSM by OraPub, Inc.  Page       1
                            Latch Contention Report

LATCH_NAME                               LATCH_CLASS CLASS_SPIN_COUNT
---------------------------------------- ----------- ----------------
process allocation                                 2            20000
cache buffers chains                               1            20000

2 rows selected.


At this point, I have assigned the CBC latches to class #1, but notice their class spin count is set to the default of 2000. I am now going to reduce the spin count to 250 but just for latch class #1, recycle the instance, and put the CBC load back on the system and see if the situation has changed.

Here is the instance parameter file entry:

_latch_class_1 = 250

After I recycled the instance, I double checked just to make sure Oracle did not override my entry (which seems to happen more often in 11g).


SQL: @ipx %latch%class%1

Database: prod18                                     21-JAN-10 01:51pm
Report:   ipx.sql       OSM by OraPub, Inc.          Page       1
                   Display ALL Instance Parameters

Instance Parameter and Value                       Description          Dflt?
-------------------------------------------------- -------------------- -----
_latch_class_1    = 250                            latch class 1        FALSE


After the load was reapplied and stabilized, I ran another response time report. Here it is.


SQL: @rtpctx
Remember: This report must be run twice so both the initial and
final values are available. If no output, press ENTER about nine times.

Database: prod18                                        21-JAN-10 02:01pm
Report:   rtpctx.sql         OSM by OraPub, Inc.        Page       1
                  Response Time Activity (61 sec interval)

                                       Time               Avg Time    Wait
RT Component                           (sec)      % RT    Waited (ms) Count(k)
-------------------------------------- ---------- ------- ----------- --------
CPU consumption: Oracle SP + BG procs     181.747   81.10       0.000     0
latch: cache buffers chains                42.140   18.80      12.113     3
control file parallel write                 0.120    0.05       6.000     0
latch free                                  0.070    0.03       3.889     0
db file sequential read                     0.060    0.03       0.571     0
os thread startup                           0.050    0.02      25.000     0
log file parallel write                     0.020    0.01       0.357     0
log file sync                               0.010    0.00       3.333     0

Instance Workload Statistic              Interval Work Interval Work/Sec
---------------------------------------- ------------- -----------------
user calls                                          43             0.705
session logical reads                          7168604        117518.098
physical read IO requests                          122             2.000
execute count                                      592             9.705


Let look closely at what just happened. Because the CBC latch is stressed when there is a logical IO load, when the CBC latching is being optimized logical IO is a good way to see if Oracle is performing more work. Here's a recap:

spin count  cpu time %    CBC sleep time %    LIO/s  CPU RQ
----------  -------- ---  -------------- ---  -----  ------
      2000       180 79%              47 21%  109.8  4 to 8
       250       182 81%              42 19%  117.5  2 to 4

The results are actually very encouraging. But you have to look further than just the time! Given nearly the same resource consumption and the same response time breakdown, Oracle performed 7% more work and the operating system CPU run queue is looking great! What we have done is made Oracle more efficient. That is, it is taking the same resources to process more work. If you're a response time and queuing theory buff, what we effectively done is reduced the service time (when represented as CPU time consumed to process a logical IO). This drops and shifts the response time curve to the right allowing more work to be performed without increasing the database server hardware power.

To recap: Usually systems with severe latching issues are CPU bound. One way to reduce the spinning and therefore reduced CPU consumption or/and increased work performed is to reduce the spin count. But reducing the instance-wide _spin_count parameter is not a precision change and the risk of collateral damage is very real. Using latch classes we are able to surgically alter how processes spin for a specific latch. To put in real simple terms, with latch classes Oracle has given us a scalpel as opposed to an axe.

Thanks for reading and I look to hearing from you.

Craig.