Friday, May 28, 2010

Altering insert commit batch size - Part 1

If you have encountered the log file sync wait event before, you know it can be very difficult to resolve. And even when a solution technically makes sense it can be very challenging to explain to others and to anticipate the impact of your proposed solution. Today I want to start exploring one of the typical log file sync solutions, including diving deep into one of the most common causes and solutions, breaking down the time components to better understand how to derive spot-on solutions, how to graphically show the situation, and how to anticipate your solution benefits before you deploy your solution in a production environment. This is a fairly large and complex topic, so I'm going to break it up into a multiple blog entry series. Thanks for joining me in this journey and let's get started!

Getting Started

But first, just what is a log file sync? It is a wait event that a server process posts when it is waiting for its commit to complete. So it's transaction commit time from an Oracle server process perspective. If you want to track commit times, then track log file sync wait times.

Every commit contains some overhead, stuff that must occur regardless of how much redo is to be written. Besides the log writer having to immediately write, various memory structures must be updated, and process communication must occur. The log writer likes to pack the log buffer blocks (sized in operating system sized blocks) full of redo entries before it writes. This makes writing very efficient. In other words, if commits are rapidly occurring the log writer is unable to fully pack each block. So if you want to increase the overhead per dml statement, just do commits very, very rapidly.

In fact, the most common root cause for log file sync being the top wait event is rapid application commits. For example, you might find the code causing the problem is essentially doing something like this:

define
  i number
begin
  for i 1..10000
    insert into t1 values (c1,c2,c3);
    commit;
  end loop;
end;

The overly careful and most likely undertrained developer wanted to ensure Oracle committed after each insert. After all, they are thinking, "If I don't ensure the commit, then it may never happen!" By simply moving the commit outside of the loop, log file sync could easily no longer be the top wait event.

So one logical solution is to reduce the number of commits while still completing the same amount of work and perhaps even more! But does this really improve performance? And how? And can we anticipate the performance improvement before we actually change the production code? I decided to create an experiment.

The Experimental Results

The actual experimental code is too long to include in this blog, but can be downloaded here. You can't simply run the file. Take a close look at it. You'll notice a few grants must occur, a few objects must be created, and then near the bottom is some SQL to run the experiment, reports to view the results, and finally the raw experimental data. It's all there!

Here's the experiment design. To summarize, the experiment simply collects the amount of a single server process and instance CPU time, instance non-idle wait time, and the single server process elapsed time to insert 900000 rows at batch sizes from 1 to 16384. The batch size is the number of inserts per commit. The batch size  was altered while ensuring a total of 900000 rows was inserted for each test. I ran each batch size test 35 times. Each experimental figure I refer to in this series is based on a 35 sample average. The experiment was run on a single quad core Intel chip in a Dell computer running Oracle 11gR2 on Linux.

What can we expect? Well for starters, we hope that a larger batch size (that is, rows per commit) will help increase the efficiency of each log writer write, that is, reduce the overhead per insert. Plus other commit related overhead will be reduced. So our hope is the elapsed time to complete the 900000 inserts will decrease as the batch size increases. Here are the results in table format followed by a partial data plot.

Batch      Total             Elapsed       %
  Size       Rows Samples  Time (sec)  Change
------ ---------- ------- ----------- -------
     1     900000      35     82.9945   -0.00
     2     900000      35     64.9540  -21.74
     4     900000      35     55.4025  -33.25
     8     900000      35     50.6216  -39.01
    16     900000      35     47.9620  -42.21
    32     900000      35     45.0470  -45.72
    64     900000      35     44.4585  -46.43
   128     900000      35     44.4191  -46.48
   256     900000      35     43.9992  -46.99
   512     900000      35     43.7342  -47.30
  1024     900000      35     43.6555  -47.40
  2048     900000      35     43.5763  -47.49
  4096     900000      35     43.7083  -47.34
  8192     900000      35     43.5302  -47.55
 16384     900000      35     43.2844  -47.85
 32768     900000      35     43.1086  -48.06



It's pretty obvious from this experiment that increasing the batch size (inserts per commit) made a massive elapsed time difference. In fact, increasing the rows per commit from 1 to only 32 resulted in a 47% elapsed time decrease! However after this, the batch size increase benefit begins to dramatically decrease to where once we hit around 33000 rows per commit only an additional 3% elapsed time decrease was seen. So creating batches in the hundreds or thousands did not make a significant difference in this experiment.

So this experiment demonstrated that (in this case) increasing the number of inserts per commit increased the number of inserts that can occur over a fixed period of time.

In part two I will break down the components of the insert time. This will help us to understand how and why batching the inserts improves performance and also sets us up to anticipate the improvement before any changes are actually made.

Here's a link or two to get the juices flowing...

Thanks for reading!

Craig.



P.S. If you want me to respond with a comment or have a question, please feel free to email me directly at craig@orapub.com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email.

Tuesday, May 25, 2010

Posted remaining 2010 training schedule

I know many of you don't receive my email updates, so I figured I had better post the details about my upcoming training on my blog.

I just posted my training course schedule for the remainder of 2010. As you can see, I'm going be traveling quite a bit! If you're interested in attending one of my training courses, the main schedule (w/registration links) can be found at either:

http://training.orapub.com
http://resources.orapub.com

If you have taken one of my courses within the past three years, you can receive a 50% discount for a re-take. I'm always updating my courses and most people are surprised by the number of re-takes. Email me for the discount code.

Here's a look at where I'll be teaching by course:

Oracle Performance Firefighting 2010. October in Philadelphia and November on the West Coast of North America. I'm still not decided on a West Coast location. It will probably be Irvine, Santa Clara, or Vancouver (BC). Email me if you have a preference.

Advanced Oracle Performance Analysis. October in Philadelphia and November on the West Coast of North America.

Oracle Forecasting & Predictive Analysis. August in Washington, DC. This is a very special offering. The recession has all but killed this course (which I absolutely love to teach), but I've been getting enough interest lately that I decided to offer it in a typically central location; Washington, DC. I purposely selected a date that is before a major US holiday and before the craziness of the Fall begins.

2010 One-Day Performance Seminar. September in Tempe, AZ and Tampa, FL, plus possibly two other locations in December.

I frequently get asked about the differences in my courses, so here's a summary. Both the Oracle Performance Firefighting and the Advanced Oracle Performance analysis courses are focused on thriving in an existing intense performance battle. The firefighting course weaves together systematic and quantitative methods, diagnosis, Oracle internals, and deriving multiple spot-on solutions that make perfect sense. The advanced analysis course is focused on objectively and quantitatively ranking these solutions and being able to effectively communicate them to management. In contrast, my Oracle Forecasting & Predictive Analysis course is squarely focused on understanding how much your existing or proposed system can take before the wall is hit and the firefighting begins. In other words, you're working to avoid getting into a firefight in the first place. The one-day seminar is a combination of topics pulled from my courses and latest research packed together in a very intense format.

I know this seems way out into the future, but I'm already considering locations for 2011. If you would like me to teach in your area (or on-site at your company), just send me an email. And I always try and partner with local Oracle user groups because it's a good thing for everyone involved.

Hope to see you personally later this year!

Craig.


P.S. If you want me to respond to a comment or have a question, please feel free to email me directly at craig@orapub.com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email. 

Friday, May 7, 2010

Understanding user calls

The instance statistic user calls is frequently tossed around in casual DBA conversations. But it's like one of those big vocabulary words, corpuscle that's in sophisticated books that we hated to read as kids. And as a kid we were too afraid to ask and too lazy to look up the definition. But once we got a little older we figured out that most people are just as stupid and as lazy as we were. Oracle user calls is like one of those words, except that as DBAs it's very important to understand.

The challenge for DBAs is user calls shows up many reports, is frequently used in conversations, and is used during performance and predictive analysis. So unless you really, really know what the words mean you're back in school feeling kind of strange. And as you progress in your Oracle work you'll discover that user calls is frequently a good general metric for database activity as well as a predictor for CPU and IO activity.

But because I'm not always lazy though sometimes stupid, I actually looked up the word in the dictionary, which for us is Oracle documentation. And as you might have expected what I found is pretty much worthless (seriously, don't bother to look it up).

So I figured it was time to go beyond looking up the definition and do some actual research.

To start, consider our first clue, the word user. This is referring to an Oracle user process, which is more commonly called a client process. Client processes are like SQL*Plus or for that matter any process that gets Oracle related work performed by communicating and working in collaboration with an Oracle server process. So this posting is not about server processes or background processes, but only client, that is, user processes.

The second word is obviously call which is also very telling. When the user process makes a request to its server process for some work to be performed in the database, it is literally making a call. But since this call is originating from a user process it's a user call. (Don't mean to insult anyone here.)

So a user call is referring to a user process making a call to a server process to have some work done in the database on its behalf. It gets a little more interesting when we dig into the types of calls a user process makes and even more interesting when we start counting the calls.

While user processes can issue a connect to the database call, the calls that we as performance analysts care about are the classic parse, execute, and fetch calls. These three calls are what forces the server and background processes to perform a potentially tremendous amount of work while consuming operating system resources.

Let's first focus on the parse call. Whenever a SQL statement is run (a word I use to casually describe what happens when I press RETURN in SQL*Plus) it must first be parsed. I'm not going to get into the parsing details, but to say that regardless of the resulting hard, soft, or softer parse, a single call to the server process is performed. Therefore, a parse call results in only one user call.

How do I know this? I created two Oracle SQL*Plus sessions. In window #1 I connect as an application user and retrieve my session identifier (sid) by running something like select sid from v$mystat. Let's say the result is 12. I also know that the user call statistic number is 7 because from a DBA account I ran, select statistic# from v$statname where name='user calls'. From the second window I connect as a DBA and run select value from v$sesstat where sid=12 and statistic#=7. In my test case the application user had so far made 60 user calls. Now as the application user in window #1, I ran the never-before-run SQL statement select count(*) from customers where status='abaci' and then as the DBA account I re-ran select value from v$sesstat where sid=12 and statistic#=7 and observed the returned value was 63, which means 3 user calls had been performed. I then re-ran the application query and then re-ran the v$sesstat query, and once again the user process SQL*Plus made 3 users calls. Just to be sure, I did this a number of times and observed the exact same result.

This does NOT prove a select statement always makes 3 user calls. Because a select statement requires a parse, execution, and fetch call, this proves parsing (hard, soft, or softer), requires a single call.

To understand the number of calls related to executing and fetching, in the table below are my test results. This is important: The Rows column below is about the number of rows returned, not the number of rows processed.


If you take a few minutes to study the table you'll notice a few things:

1. There is always a single parse and single execute call. Even a commit and rollback call require a parse call and an execute call.

2. A select requires a parse and an execute call, but can make multiple fetch calls. The key to understanding this is realizing the number of fetches is from the user process perspective and not the server process perspective. While a server process may make many operating system requests for blocks outside of the Oracle buffer cache or crunch through millions of rows and buffers in the buffer cache, if the user process only requires two fetches to retrieve the results, then the number of user calls will be two!

As a side note, this is why when the SQL*Plus array size is increased from the default of 50 to the max of 250 (set arraysize 250) you will observe both the number of user calls and the number of SQL*Net roundtrips (statistic #557) proportionally decrease. The drop in user calls is because a single fetch call can now handle many more rows at once. The corresponding drop in SQL*Net activity is because each fetch call invokes a single SQL*Net round trip (send and receive).

However, if you look at the numbers closely, you'll notice the number of SQL*Net round trips is one less then you would expect. This is because SQL*Plus bundles the parse and execute calls into a single SQL*Net action. If you write a bogus SQL statement you'll notice that the parse and execute occur (2 calls) but there is no fetch... and there is only a single SQL*Net roundtrip, which means the parse and execute calls were bundled into a single SQL*Net roundtrip.

3. An update, deletecommit, and rollback statement will only make two user calls because the user process will never ever need to fetch rows from the server process. As with a select statement, the server process may be very busy, but from a user process perspective, when an update/delete/commit/rollback statement is run there is no reason to fetch any rows and therefore the number of fetches is zero!

I think the key to understanding a user call is to remember we're referring to an Oracle client/user process making a request to its server process. Period.

So now when you're standing around talking DBA-smack and someone throws out the word user call, you can look them straight in the eye and not be thinking, "Should I ask what a user call is?"

Oh... a corpuscle is a minute particle of matter.

Thanks for reading!

Craig.


P.S. If you want me to respond to a comment or have a question, please feel free to email me directly at craig@orapub.com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email.