Let's quantify this so it becomes a little more practical. Suppose a serial batch process runs in 50 minutes and processes 100 transactions. There are actually three variables in this scenario. The first is the number of parallel streams, which is 1. The second is the duration, which is 50 minutes. And the third variable is the total workload, which is 100 transactions. The general relationship between these three variables is:
duration = total workload / parallel streams
Let's think about this for a bit. If the total workload increases while the number of parallel streams remains constant, the duration will increase. And if the total workload remains the same yet we increase the number of parallel streams, the duration will decrease. So it follows that if it takes 50 minutes to process 100 transactions with a single stream, then it will take 25 minutes to process those same 100 transactions with two streams. If three streams are used, then it will take around 17 minutes, four streams 12 minutes, and five streams 10 minutes. I created a simple plot based on this situation as shown below.
There are a few things I want to highlight. First, the resulting plot of duration versus parallel streams is not linear. This can be easily seen in the above plot. This is typically surprising because the are no variables raised to a power greater than one. But if you think about it, it does make sense. When going from two streams to three, we are not doubling the work flow, instead we are increasing the work flow by one-third. To double the work flow, we would need to double the parallel streams, from two to four, or from three to six, and so on.
The second thing to notice is the duration benefit per parallel streams decreases as we add additional streams. Looking closely at the above plot we can see that once you get beyond, let's say, ten parallel streams it will take quite a few more streams to significantly reduce the process duration.
The third interesting aspect of this situation is the computing system must be able to handle the increased workload rate. Keep in mind that if we double the number of pipes, the workload will come rushing into the computing system at twice the rate. The system must be able to handle this workload increase or we will loose the benefit of the multiple streams. This is the classic rainy season or spring time snowmelt river overflow situation where the increased water flow cannot be "processed" quick enough. This topic will be explored in a future blog entry.
The fourth interesting aspect of this situation is, it is a best case scenario. We are getting the same amount of work processed per parallel stream. That is, there is no overhead or loss of power when we keep adding parallel streams. But we know in our gut this will not occur... and for a surprisingly large number of reasons. This will be the topic of my next blog entry.
So to summarize this entry, increasing parallelism will also decrease process duration because we are allowing the same amount of work to enter a system within a shorter time span. So instead of 100 transactions entering the system in 50 minutes, we are allowing those same 100 transactions to enter the system in 25 minutes. I also mentioned this is a best case scenario and there are a number of reasons which limit the continuing benefit of increasing parallelism.
Thanks for walking with me through this parallelism journey,
P.S. If you want me to respond to a comment or have a question, please feel free to email me directly at email@example.com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email.