Wednesday, September 7, 2011

Anticipating SQL Elapsed Times

Getting More Practical

Many of my postings can be considered quite theoretical. But in this posting I'm going to make a clear application of quantitatively modeling response time and understanding why Oracle systems behave as Operations Research queuing theory state (which was the topic of my previous posting).

In this posting I'm building from two recent posts. In July I demonstrated that tuning Oracle can reduce the CPU and the total time it takes to process a single LIO. In August I demonstrated that a standard Operations Research queuing theory quantitative model can be used to anticipate the time it takes to process a single LIO. Now it's time to go from micro to macro!

In this posting I will develop a simple SQL statement elapsed time model that takes two input parameters; the time to process a single piece of work (e.g., 1 ms per logical IO) and also the number of pieces of work to process (e.g., 1000 logical IOs). As you will see, what we know from our personal experience will play out in this quantitative model; adding confidence that Oracle systems do in fact behave as Operations Research queuing theory would have us believe.

Very Simple SQL Elapsed Time Model

In my July posting, I introduced a very simple SQL statement elapsed time model. The model shows that the time to run a SQL statement can be represented as the amount of work to be processed multiplied by the time it takes to process each piece of work. (This is a very simplistic and limited model.) For example, if a SQL statement must process 1000 LIOs and each LIO takes 1.0 ms, then the elapsed time will be 1,000 ms.

To make this a little more elegant, let's create an equation.

E = Work X Time per work

where:

E is the elapsed time (e.g., 1 second)
Work is the amount of work to be completed (e.g., 1000 LIOs)
Time per Work is the time it takes to process a single piece of work (e.g., 1.0 ms/LIO)

This simple (and limited) model indicates we can reduce a statement's elapsed time by reducing either or both the amount of work to be processed or the time to process a single piece of work. Before we get into the actual experiment, let's first review each of these terms.

Important: This is key: I define response time as the time to process a single piece of work. Elapsed time is the time to process many pieces of work.

Representing and quantifying work.

The amount of work to be processed is very straightforward. We can measure this by gathering statistics from v$sesstat, SQL tracing, and perhaps other ways. How we decide to represent work, is typically based on what constrains the SQL statement elapsed time. For example, if a SQL statement must process 1M LIOs and 1K PIOs (block reads) and there is a raging CPU bottleneck, choosing LIOs to represent the work is probably a good choice. But if there is a raging IO read bottleneck, then choosing PIOs is probably a better choice. This topic is known as choosing the unit of work and was presented in my previous posting's section Unit of Work Time. I have discussed this in many of blog postings and cover it extensively in my Advanced Oracle Performance Analysis class.

While I'm not going to detail my experimental results in this posting, as you might expect, if you reduce the amount of work to be processed by 50% you are likely to see a similar reduction in elapsed time. However, I do present experimental evidence of this occurring in my conference presentation, SQL Elapsed Time Analysis.

Quantifying Time per Work.

Our models work very nicely if we classify SQL elapsed time into either CPU consumed or non-idle wait time. As you may know, it is rather simple to gather this time from either the instance statistic (v$sesstat) or the time system model (v$ses_time_model) views. To determine the time to process a single piece of work, we simply divide the total time (i.e., cpu + non-idle wait time) by the total amount of work (e.g., 1000 LIOs). This can be done for a specific SQL statement, a process, or over an interval of time (think: Statspack/AWR).

In my previous posting, I detailed how tuning Oracle reduced the response time for a LIO and how that response time reduction occurred as Operations Research queuing theory states.

In this posting, I'm going to focus on the impact of response time on a SQL staetment's elapsed time. Specifically, I want to experimentally and demonstrably see if reducing response time by 50% will reduce the elapsed of a SQL statement also by 50%... just as the above simple elapsed time model indicates.

Experimental Setup

For this posting, I am going to use the exact same data set I used in my previous blog entry. While I didn't present this in the previous posting, in addition to gathering the response time related metrics, I also gathered the elapsed time for a very LIO dependent SQL statement. All the experimental data is included in the Mathematica analysis notebook and can be viewed and download from my previous posting. The data collection script can also be viewed on-line from the previous posting. Link to previous posting.

But to summarize the situation: I created a massive CPU bottleneck by having a number of Oracle sessions run SQL where all their blocks reside in the buffer cache, that is the SQL is logical IO (v$sysstat: session logical reads) dependent. I gathered 30 ten minute samples with the CBC latches set to 256 and then to 32768. During these ten minute collection periods, I sampled the elapsed time of a specific LIO dependent SQL statement.  With 256 CBC latches, around 23 elapsed times where collected. With 32768 CBC latches, around 71 elapsed times where collected. The difference in the number of elapsed time samples was the result of the SQL completing sooner when there was 32768 CBC latches. The SQL elapsed times where gathered using the OraPub SQL Elapsed Time Sampler (beta 3c).

Experimental Results Analyzed
Figure 1.
Figure 1 contains a summary of the experimental results. The Instance Rt (ms/lio) is simply the total time (CPU time plus non-idle wait time) divided by the total number of LIOs processed during the sample interval. The results are stunningly straightforward: By adding additional CBC latches to combat the massive CBC latch contention, the instance LIO response time decreased by 85% and the elapsed time for the specific LIO dependent SQL statement being monitored decreased by 86%. That is truly amazing!

Will this always occur? For this particular SQL statement probably YES, but for other SQL statements, it depends. The key is understanding what resource is in short supply and understanding if the SQL statement is dependent on this resource. Because there was massive CBC latch contention caused by an intense LIO workload and limited CPU power, I knew that SQL statements who performance was massively affected by LIO would therefore also have their elapsed time highly impacted by LIO response time. If I had selected a statement that was all about physical IO or inserts or updates, and not LIO, their performance (i.e., elapsed time) would likely not have been effected.

This may sound very theoretical but you probably can relate this type of situation to your personal Oracle performance experiences. Relating the OS bottleneck to both the Oracle instance situation and the application SQL is extremely important to more fully understanding the performance situation and also to derive multiple appropriate solutions. For example, if there is a massive IO read bottleneck, you can expect SQL that is heavily IO read centric to be affected while LIO dependent SQL will likely not be affected. I discuss this in the Methods & Madness chapter (1) in my book entitled, Oracle Performance Firefighting. This posting helps build a quantitative framework for what we have experienced in real production Oracle systems.

In Summary

Understanding the experimental results should at a minimum add confidence to what you have likely already experienced but hopefully also opens the door towards anticipating SQL statement elapsed times when tuning either Oracle (reducing response time) or tuning a SQL statement (reducing work to process).

In this short posting, I've demonstrated a heavily LIO dependent SQL statement's elapsed was reduced nearly the same percentage as the LIO response time reduction. The LIO response time reduction was due to adding additional CBC latches.

A broder application of this posting is we can increase SQL statement performance by tuning it (i.e., reduce the pieces of work it must process) and also by tuning Oracle (i.e., reduce LIO response time). While not mentioned in this entry, it should make sense that an OS centric solution would be use faster CPUs as this would also reduce LIO response time! Notice, we have a solution from an Oracle, an application, and an OS perspective. I call this approach the OraPub 3-Circle Method.

Thanks for reading!

Craig.

If you enjoyed this blog entry, I suspect you'll get a lot out of my courses; Oracle Performance Firefighting and Advanced Oracle Performance Analysis. I teach these classes around the world multiple times each year. For the latest schedule, click here. I also offer on-site training and consulting services.

P.S. If you want me to respond with a comment or you have a question, please feel free to email me directly at craig@orapub .com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email. Another option is to send an email to OraPub's general email address, which is currently orapub.general@gmail .com.