Wednesday, June 2, 2010

Altering Oracle DML batch sizes (log file sync) - Part 2

If you have encountered the Oracle Database wait event log file sync before, you know it can be very difficult to resolve. And even when a solution technically makes sense it can be very challenging to explain to others and to anticipate the impact of your proposed solution. In part one of this series I introduced the log file sync wait event and presented an experiment that demonstrated the insert rate increases as the number of inserts per commit increases. I call the number of inserts per commit, the batch size. In this blog entry I'll be exploring the components of time during each insert. So let's get started!

Breaking Down Time

As the experiment in Part 1 demonstrated, increasing the number of inserts per commit allows the system to absorb inserts at a faster rate. But to really grasp why a performance change improves the situation, we need to understand how the components of time changed. A fantastic way begin this journey is to break down time situation into three core elements: What is known as the arrival rate, the service time, and the queue time.

If you are new to this type of analysis, pay close attention because this is the aroma of how we can numerically anticipate the impact of our proposed change, which in this case is increasing the inserts per commit.

So here we go... The arrival rate is some unit of work over some meaningful unit of time, such as inserts/ms. You can think of the arrival rate as the workload, such as transactions per second, user calls per second, inserts per second, or logical IOs per second. The service time is some unit of time over some unit of work, and for Oracle performance analysis, it works wonderfully when the service time is CPU consumed over some unit of work, such as ms/insert. The queue time can be represented by the non-idle wait time per unit of work, such as ms/insert. When we sum the CPU time per insert and the non-idle wait time per insert, we have what is called the response time, in terms of time per insert (e.g., ms/insert).

Our goal as a performance analyst is to reduce the response time, not only the service time or only the non-idle wait time (that is, the queue time). The response time is directly related to how long a batch job runs, what the system is experiencing, and many times what the user is experiencing. Another added benefit of enumerating these characteristics is we can plot the situation graphically. I'll save the plotting for the next blog entry in this series.

So let's look at the actual experimental results in terms of the arrival rate, the service time, and the queue time. Here they are below.

 Batch Arrival Rate       % Service Time       %   Queue Time       % Respone Time       %
  Size (inserts/ms)  Change (ms/inserts)  Change (ms/inserts)  Change (ms/inserts)  Change
------ ------------ ------- ------------ ------- ------------ ------- ------------ -------
     1      10.8445    0.00       0.0914    0.00       0.1356   -0.00       0.2270    0.00
     2      13.8570   27.78       0.0722  -21.01       0.1071  -20.96       0.1794  -20.98
     4      16.2470   49.82       0.0613  -32.95       0.0914  -32.61       0.1527  -32.75
     8      17.7815   63.97       0.0564  -38.34       0.0855  -36.91       0.1419  -37.49
    16      18.7664   73.05       0.0537  -41.23       0.0785  -42.11       0.1322  -41.76
    32      19.9793   84.23       0.0523  -42.76       0.0260  -80.81       0.0784  -65.48
    64      20.2438   86.67       0.0515  -43.64       0.0177  -86.92       0.0693  -69.49
   128      20.2650   86.87       0.0511  -44.10       0.0140  -89.69       0.0651  -71.33
   256      20.4564   88.63       0.0507  -44.57       0.0115  -91.54       0.0621  -72.62
   512      20.5789   89.76       0.0499  -45.46       0.0098  -92.77       0.0597  -73.71
  1024      20.6183   90.13       0.0501  -45.27       0.0172  -87.33       0.0672  -70.39
  2048      20.6536   90.45       0.0500  -45.35       0.0152  -88.75       0.0652  -71.27
  4096      20.5930   89.89       0.0498  -45.53       0.0145  -89.31       0.0643  -71.67
  8192      20.6755   90.65       0.0495  -45.83       0.0119  -91.22       0.0614  -72.93
 16384      20.7953   91.76       0.0517  -43.43       0.0177  -86.98       0.0694  -69.44
 32768      20.8777   92.52       0.0494  -46.03       0.0112  -91.72       0.0606  -73.31

Looking at the above numbers, you'll notice that by simply increasing the batch size from 1 to 4 the arrival rate (measured in inserts/ms) increases by 50%. This means the system was able to absorb 50% more work simply because the inserts per commit was increased from 1 to 4. Second, the associated CPU consumption (i.e., service time), non-idle wait time (i.e., queue time), and the response time per insert was reduced by 30%. By increasing the inserts per commit from 1 to 32, the work absorbed increased by 84% while at the same the time response time decreased by 65%. These are significant numbers and is why we witnessed such a dramatic drop in elapsed time when inserting 900000 rows.

But more than just big time changes, the breakdown of time in a response time-like format allows us to further understand what occurred and then explain that to others. So why did the arrival rate, service time, and queue time numbers change? If you have a good understanding of Oracle internals you'll probably be able to figure this out.

The service time decreased because, in part, Oracle was able to better batch redo entries achieving increased efficiency and thereby reduce the commit overhead per insert. This reduces the CPU consumption per insert, so the service time decreased. The queue time decreased because, with fewer commits per second the total commit time per insert was less which means there is less log file sync time. And let's not forget that there is also less physical IO being written by the log writer as well. Having less log file sync time means there is a decreased likelihood that log file sync will be the top wait event. The really amazing part is with the reduction in both service time and queue time, the system was able to process more inserts per second.

Caution: By the way, if you're used to plotting response time graphs, you'll be tempted to plot the above table with the arrival rate on the horizontal axis and the response time on the vertical axis. But be careful, because the arrival rate is changing because of the batch size increase not because we are pushing more work into the system. If you want to plot the response time graph based on experimental data, you'll need to set the batch size and increase the workload while sample the response time components.

So in this entry we broke down insert time into CPU time (service time), non-idle wait time (queue time), and the associated workload (arrival rate).  Plus we looked into why increasing the batch size alters each component.

In the next blog entry of this series, I'll introduce how to plot a graph representing the situation, which will set us up for anticipating the performance change before actually altering the batch size.

Thanks for reading!

Craig.



P.S. If you want me to respond with a comment or have a question, please feel free to email me directly at craig@orapub.com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email.

No comments:

Post a Comment