Friday, April 30, 2010

Parallelism introduces limits - Part 3

We all know in our gut that increasing the number of batch process parallel streams has its benefit limits. And in fact, we have a hunch that if we get carried away and spawned perhaps 100 parallel streams the batch process could actually take longer. Why this "diminishing returns" effect occurs is what I want to explore.

In Part 2 of this series I introduced a plot showing batch process duration versus the number of parallel streams. This fairyland best-case-scenario is shown as the blue line in the below plot.


The general relationship between streams, workload, and process duration is:

duration = total workload / parallel streams

But the reality can be much different, as shown by the red line in the above plot. What occurs in reality is for each additional parallel stream, the system effectively receives only a fraction of parallel stream benefit. So when two parallel streams exist, the system may effectively receive only 1.75 streams, not 2. And if three parallel streams exist, the system may effectively receive only 2.31 streams, not 3.

Why this loss in effective processing power? This is a vast topic centered around scalability. I want to limit our exploration by focusing only on increasing the number of batch processing parallel streams.

But if you want to dig into scalability and explore the benefits of say, adding additional CPU cores, then I encourage you to read the Scalability chapter in my book entitled, Forecasting Oracle Performance. There is also an interesting spreadsheet anyone can download for free, full of fascinating scalability related math which, also includes the basis for which you can determine the CPU core scalability for your system. But this is not the focus of today.

There are a surprising number of reasons why we can't just keep adding additional batch process parallel streams and expect the duration to continually improve.

First, remember that our best case scenario is the blue line in the above plot. Even in our best case scenario adding more than ten streams is not likely to produce a significant improvement.

Second, when we parallelize, a serial process must be split into multiple parallel executing processes. There is overhead, that is, time is consumed in splitting the serial process. This becomes especially challenging (and extremely frustrating) when the process was initially designed to run serially. I'm sure there are books written on this subject alone.

Third, when we parallelize, the likelihood of concurrency issues increase. The parallel processes may be referencing rows in the same blocks or tables (sure Oracle has row level locking, but buffers can still be busy...), they may be requesting duplicate full table scans (though Oracle, we hope, will determine the optimal execution plan), and there may be control structures like sequence numbers or status codes that must be maintained. My point here is that when the process is run serially none of this is an issue, that is, there is no chance of a concurrency issue. If we start increasing the number of parallel streams we could push the unique combination of Oracle's configuration, the power of the operating system, and design of the application into manifesting various enqueue, buffer busy, read by other session, or latch free waits. When altering a process that was initially designed to run serially, we have to be very careful about introducing concurrency issues.

Fourth, when we parallelize we could introduce processing dependancies that did not exist before. Perhaps steps A, B, and C must complete before step D can begin. In this scenario, the parallel streams involved in processing steps A, B, and C must somehow know to stop and wait (ouch!) before step D begins. A dependency laden batch process will severely limit the potential advantages gained by introducing parallelization. And it also severely limits the number of parallel streams that will make a real difference in the batch duration.

Fifth, it follows that if we split a process into multiple parallel streams there may be the need to join/merge the results together once again. This is highly application dependent, but it could represent a significant amount of time that simply did not exist when running the process serially. If this join/merge operation must occur, it is also highly likely dependency issues must be controlled.

I'm sure there are other performance inhibitors not listed above, but I think you get the idea... that adding additional parallel streams may be both technically challenging (if not impossible) and will likely force additional code to be run.

Let's interject some numbers. Suppose a serial batch process takes 2 hours to complete. Some initial parallelization strategy designs indicate that running six parallel streams will introduce about 5 minutes of process splitting time, 5 minutes of concurrency contention, 10 minutes of process dependency contention, and 10 minutes to merge the final results. So while we are able to technically run the process in six parallel streams, we have introduced 30 minutes of additional time as a direct result of our parallelization efforts. In our six stream parallelization dream-world, a serial 2 hour process will complete in 20 minutes (120 min / 6 streams). Injecting some reality, which is estimated to be 30 minutes, with six parallel stream actual process duration will be closer to 50 minutes (20 min + 30 min). So instead of the serial process taking 120 minutes, when running as six parallel streams we expect it to run in around 50 minutes, which is less than half the time. Perhaps this is a fantastic result and just what is needed.

But the decisions are just beginning. How confident are we in our parallelization estimates? What is the impact if we are wrong?  How much time will take the developers and DBAs to parallelize the process? And of course there is the nagging question of, "Will the system be able to handle the increased workload when we let loose six parallel streams?" The increased workload by increasing parallelization will be the topic of exploration in my next blog entry.

So while increasing parallelization can produce amazing results, it does introduce and intensify factors that we may not have had to deal with when the batch process ran serially.

Thanks for walking with me through this parallelism journey,

Craig.


P.S. If you want me to respond to a comment or have a question, please feel free to email me directly at craig@orapub.com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email.

Tuesday, April 13, 2010

Parallelization vs duration - Part 2

Suppose there is a batch process running in a single serial processing stream. For some good reason, the serial process is re-architected to run as two parallel processes, that is in two processing streams. We would expect the modified batch process to finish in about half the time because we have doubled the number of parallel streams allowing the work to flow into the system at twice the rate.

Let's quantify this so it becomes a little more practical. Suppose a serial batch process runs in 50 minutes and processes 100 transactions. There are actually three variables in this scenario. The first is the number of parallel streams, which is 1. The second is the duration, which is 50 minutes. And the third variable is the total workload, which is 100 transactions. The general relationship between these three variables is:

duration = total workload  / parallel streams

Let's think about this for a bit. If the total workload increases while the number of parallel streams remains constant, the duration will increase. And if the total workload remains the same yet we increase the number of parallel streams, the duration will decrease. So it follows that if it takes 50 minutes to process 100 transactions with a single stream, then it will take 25 minutes to process those same 100 transactions with two streams. If three streams are used, then it will take around 17 minutes, four streams 12 minutes, and five streams 10 minutes. I created a simple plot based on this situation as shown below.


There are a few things I want to highlight. First, the resulting plot of duration versus parallel streams is not linear. This can be easily seen in the above plot. This is typically surprising because the are no variables raised to a power greater than one. But if you think about it, it does make sense. When going from two streams to three, we are not doubling the work flow, instead we are increasing the work flow by one-third. To double the work flow, we would need to double the parallel streams, from two to four, or from three to six, and so on.

The second thing to notice is the duration benefit per parallel streams decreases as we add additional streams. Looking closely at the above plot we can see that once you get beyond, let's say, ten parallel streams it will take quite a few more streams to significantly reduce the process duration.

The third interesting aspect of this situation is the computing system must be able to handle the increased workload rate. Keep in mind that if we double the number of pipes, the workload will come rushing into the computing system at twice the rate. The system must be able to handle this workload increase or we will loose the benefit of the multiple streams. This is the classic rainy season or spring time snowmelt river overflow situation where the increased water flow cannot be "processed" quick enough. This topic will be explored in a future blog entry.

The fourth interesting aspect of this situation is, it is a best case scenario. We are getting the same amount of work processed per parallel stream. That is, there is no overhead or loss of power when we keep adding parallel streams. But we know in our gut this will not occur... and for a surprisingly large number of reasons. This will be the topic of my next blog entry.

So to summarize this entry, increasing parallelism will also decrease process duration because we are allowing the same amount of work to enter a system within a shorter time span. So instead of 100 transactions entering the system in 50 minutes, we are allowing those same 100 transactions to enter the system in 25 minutes. I also mentioned this is a best case scenario and there are a number of reasons which limit the continuing benefit of increasing parallelism.

Thanks for walking with me through this parallelism journey,

Craig.


P.S. If you want me to respond to a comment or have a question, please feel free to email me directly at craig@orapub.com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email.

Friday, April 2, 2010

The parallelization sweet spot - Part 1

Everyone in the world knows that if you want a long process to complete sooner you should consider parallelization. For example, businesses run their systems on multiple divisional computers, DBAs import by the user, soda drinkers use multiple straws, multiple iphone application updates are downloaded at the same time, and Oracle has a variety of parallelization schemes such as multiple background processes, parallel query, and even RAC. So parallelization is not something foreign to us. In fact, it seems rather natural.

But everyone also knows in their gut diminishing returns plays a part. That is, parallelization benefits slowly decrease as the parallelization increases. And in fact, at some point the task may take longer to complete because we got carried away with parallelization. So the lurking question is, what is the optimal number of parallel streams that will minimize the task duration time. This is what I want to ponder over the next few weeks.

My goal is for you to successfully select a batch process that is taking too long, consider how you might parallelize it, gather classic performance statistics, and determine the optimal number of parallel streams, that is determine the parallelization sweet spot.

My plan is to take this step by step, digging deeper and deeper as we go. The topics I'm sure I'll touch on will be scalability limitations of various kinds, process and transaction response time and elapsed time, utilization, gathering performance data, developing a mathematical model to help us understand and anticipate the value of our strategies, and combining activity from both the CPU and IO subsystems. That's a lot to cover, but we'll work through this together step-by-step.

Thanks for joining me in this quest!

Craig.


P.S. If you want me to respond to a comment or have a question, please feel free to email me directly at craig@orapub.com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email.