Friday, June 10, 2011

Cores vs Threads...Part 3

Why This Is Very Important

As CPU subsystems become more complex and employ various methods to utilize multiple CPU cores and multiple threads per core, determining the CPU consumption requirements, capacity, and the resulting utilization is not entirely clear. But it is important we understand what the utilization tells us about our system. As an Oracle DBA, Iʼm OK with not knowing the specific OS utilization calculations, but Iʼm not OK with blindly stating a utilization figure without understanding what that means in relation to performance. Hence my quest...

This is the third and final (I hope) posting about differences in the operating system CPU utilization when determined by vmstat or by using the v$osstat CPU core approach. If you've been following this blog series you know there can be a statistically significant utilization difference and if there is, it can increase as the CPU subsystem gets busier. Also, seven out of the eight samples from production Oracle systems that I analyzed (some of them extremely busy systems) showed no real difference between the utilization method. However, the one AIX sample (AG1) clearly showed a difference as the CPU utilization increased over 40%. Figure 1 below is the AG1 scatter plot of the utilizations versus the sample interval.
Figure 1.

Background

For your reference, the initial blog posting (April 22, 2011) presented general utilization, how to gather CPU utilization purely from v$ views, and then I stated that on a few occasions I have seen the utilizations from vmstat and v$osstat differ significantly. The second blog posting (May 13, 2011) presented the experimental results and the subsequent analysis from seven production Oracle systems. As mentioned above, all but one of the samples (AG1) showed no statistical difference between the utilizations (Oracle core based based vs vmstat based). But I mentioned a concern I had; none of the samples were running with the utilization over 65%. Based on my comments a reader contacted me and ended up gathering data on his system that was running between 90% to 100% CPU utilization. The analysis of this data set (AB1) was posted in the second "B" posting on May 26, 2011. Like most of the other samples, this very busy CPU subsystem showed no difference between the utilizations.

This final posting (at this point I think it is anyways) is focused clearly on how could vmstat and Oracle core based CPU utilizations result in a different value. And if so, is this something to be concerned about? If this is something you're interested in...Read on!

Deeper Into Requirements and Capacity

There is a very good reason why there can be a difference. But for my explanation to make any sense, we need to understand how utilization is calculated and especially so when threads are involved. Utilization is simply requirements divided by capacity. Mathematically this can be represented as:

U = R / C

It's important to understand that we are typically looking at a slice, interval, or snapshot of time. For example, 15 minutes or 1 hour. This requires two samples from our data source. Don't mean to insult anyone here, but the value we need is calculated by subtracting the initial value from the final value resulting in the delta or difference. This delta is what we typically (but not always) use in the calculation.

Requirements.

Requirements is simply how much time CPU resources are being expended, used, and consumed. It does not matter if the consumer is a thread or a process. If it is consuming CPU resources then this time counts as "busy time." We can see this "busy time" via v$osstat. It is also encapsulated in vmstat. On all L/Unix systems (exception: HPUX) we can see the busy time in the /proc filesystem. I present this in my book, Oracle Performance Firefighting but here is a nice link to an on-line source. Mathematically, we can represent the requirements as:

R = busy time  = interval time X number of CPU power consumers

If we differentiate between cores and threads, requirements can be something like:

R (thread based) = busy time = interval time X number of threads consuming CPU resources
R (core based)  = busy time = interval time X number of cores consuming CPU resources

As DBAs it is not our decision whether the operating system reports busy time a specific way. Someone else made that decision for us. ...it is what it is. But as we'll see below, the decision makes a profound difference in the final calculated utilization---for both vmstat and v$osstat.

Capacity.

Capacity is the power available; 100 CPU seconds, 64 cores, 128 threads, etc. As with requirements, we typically need the power available over a period of time, that is, a time slice, interval, or snapshot of time. This is actually very easy to calculate and is simply the number of whatever is supplying the power (e.g., cores) multiplied by the snapshot interval time.

Suppose over a one hour interval the CPU subsystem contains 8 cores and each core has 2 threads, for a total of 16 threads. As with the requirements, we can represent the CPU power capacity from either a core or thread perspective.

C = interval time X power supplying units (cores) = 60 minutes X 8 cores = 480 core minutes
C = interval time X power supplying units (threads) = 60 minutes X 8 cores X 2 threads/core = 960 thread minutes.

As with the requirement's busy time, as DBAs it is not our decision whether the operating system reports busy time a specific way. Someone else made the decision for us. But as we'll see below, the decision makes a profound difference in the final calculated utilization---for both vmstat and v$osstat.

Dark Matter: Cores vs Threads

Before we combine requirements and capacity to derive utilization, it is important to understand there are differences in how CPU subsystems process at the core and thread level. It's even more important to understand how this occurs on your production systems.

Suppose a process, when run by itself on an idle system takes 30 seconds to complete. This is the "wall time" or "elapsed time." Here is an example bourne shell command sequence that places an incredibly intense CPU load on a single CPU core or thread.
echo "scale=12;for(x=0;x<39999999;++x){1+1.243;}" | bc >/dev/null
On my Linux system, the above command takes about 78 seconds to complete.

Now... Consider the elapsed time of the above command when a bunch of the commands are launched at the same time!

If cores are what is providing the true CPU power, then with a 2 core CPU subsystem we will observe this type relationship between the number of concurrently launched processes and elapsed time (seconds): (1,30), (2,30), (3,60), (4,60), (5,90), (6,90), (7,120), etc. Essentially, when a process completes, a core becomes available and the next process begins. This perfect elapsed time sequencing assumes the OS makes no optimizations.

If threads are what is providing the true CPU power, then with a 2 core but 2 threads/core (4 total threads) CPU subsystem we will observe this type of relationship between the number of concurrently launched processes and elapsed time (seconds): (1,30), (2,30), (3,30), (4,30), (5,60), (6,60), (7,60), (8,60), (9,90), etc. Essentially, when a processing thread completes a thread becomes available and the next process thread begins. Again, this perfect elapsed time sequencing assumes the OS makes no optimizations.

When threads get involved, the resulting elapsed times are not so straightforward and can be much more complicated to anticipate and very, very operating system specific.

One of my colleagues did some testing. (my ref: JB 9-May-2011) He used an 8 core box with each core having 2 threads. His results showed the CPU subsystem was operating more core-based than thread-based. Why? Because when 8 processes that run serially in 30 seconds were simultaneously launched, they finished in about 30 seconds yet when 9 processes where simultaneously launched the elapsed time jump to 60 seconds...meaning all 16 threads were unable to truly process the more than 8 processes simultaneously. The only way he would know this, is to perform an actual test (more about this below.)

However, an IBM employee that specializes in Oracle emailed me (my ref: DM 7-June-2011) and wrote, "SMT on Power Systems allows for true simultaneous execution of up to 4 SMT threads (on Power7) in the same clock cycle." An AWR report added some support up his well presented and thought out claim. The v$osstat busy_time statistic was clearly thread based because the busy_time (285888 secs) was greater then the core based capacity (13194 core secs = 34.36 min X 60 sec/min X 64 cores). From a core-based perspective and on his system, there is no way 64 cores can provide more than 13193 seconds of CPU power over the 34.36 minute interval. Threads must be involved.

Complicated...but seriously practical and necessary.

It can get even more complicated with virtual machines, vpars, lpars, and on and on. With all this complication it is easy to loose sight of our goal: gathering reliable and understandable values for OS CPU requirements, capacity, and utilization. If I can't do this, then I can not make a simple statement such as, "From an OS perspective, CPU utilization indicates the CPU subsystem is the bottleneck." So while this may seem pretty academic, it has serious practical performance management implications.

The best way to tell what is occurring on your systems is to gather some performance data. Here's how...

Gathering The Data

I became so frustrated with all complications and possibilities of complications, it was obvious the only real way to get a firm grasp on the reality on a real system was to gather some data on a real system. So I created a basic shell script that tracked the relationship between the number of simultaneously launched processes and their final elapsed time. I also gathered and displayed the OS CPU utilization and CPU run queue (both based on vmstat).

Please.... I need to write this: Do not run this script on your production system if you care about production system performance. The script is designed to suck every bit of CPU power out of your database server. Running this on a test box with the same OS and CPU architecture as your production systems should produce the results you are looking for.

While I wrote the script in Linux and you can view it on-line here, it would be a simple matter to make the vmstat column parsing adjustments and potentially a few other things. The system I gathered data from a single 4 core CPU with no threads. Here's the Linux details:
[oracle@fourcore ~]$ cat /proc/version
Linux version 2.6.18-164.el5PAE (mockbuild@ca-build10.us.oracle.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Thu Sep 3 02:28:20 EDT 2009

Figure 2.
Figure 2 above shows the elapsed time (or wall time) to complete X number processes when they are all launch simultaneously in the background. It is obvious that up to four processes complete at pretty much the same time (and Figure 3 below details the numeric results). However, once 5 processes are launched simultaneously, the elapsed time takes a significant jump. This indicates that the system I gathered has 4 CPU cores and no threads, and is "core powered." ...and it does have a single CPU with 4 cores and no threads.

This is important to understand. Let's generalize this a bit:
  • Where C is the total number of CPU cores, a core power focused system will complete C number of processes launched simultaneously at pretty much the same time. Once the number of processes is greater than C, the all-process completion time will increase. (This is what we see occurring in Figure 2 above, where C is 4 and 5.)
  • Where T is the total number of CPU threads, a thread powered focused system will complete T number of processes launched simultaneously at pretty much the same time. Once the number of processes is greater than T, the all-process completion time will increase.
  • My observations have shown that CPU subsystems without threads tend to be completely core powered. (As you would expect and what we see in Figure 2.) However, CPU subsystems with threads can be either more core or more thread focused. This is when understanding what a utilization value means becomes more complicated.
Figure 3.
Figure 3 above shows the numerical experimental results. I performed a statistical significance test between the elapsed time sample sets when there were 4 processes launched simultaneously and when there were 5 processes launched simultaneously. Statistically they are indeed different. The significance tested was a little tricky because the elapsed time sample sets are not normally distributed. While I won't get into the details in this blog entry, if you are interested you can view the Mathematica notebook PDF output here...with all the details. You can also download the actual Mathematica notebook here. The raw experimental results can be downloaded here.

Why the Possible Utilization Difference

If you are still tracking with me (thanks for reading by the way), this next step should be simple. If requirements can be represented as either core-seconds or thread-seconds, and if capacity can be represented as core-seconds or thread-seconds, we have a simple two by two matrix. As long as both the requirements and capacity are core or thread based, we should be OK (if the OS can evenly distribute all the work). But if they don't match the utilization is going to be either under or over reported (at least from a DBA perspective).
Figure 4.
Figure 4 above is based on a hypothetical system: Over a one hour interval, the 4 core system (each core has 2 threads) was busy from a core perspective at 1000 seconds but from a thread perspective 2000 seconds.

The point of the Figure 4 matrix is not the busy time or the core or the threads. Rather, the point is if the requirements and capacity are not the same units, that is core or threads, the resulting utilization will be either under reported (e.g., 21%) or over reported (e.g., 83%), whereas the "true" utilization in Figure 4 is 42%.

More practically, if I assume the busy_time in v$osstat is core based (but it is not) and calculate the capacity based on the number of cores, the resulting utilization will be higher then reality. In Figure 4 above, that would be 83% cell. This is what could have happened on the AIX system as shown in Figure 1. Very experienced AIX DBAs are likely to have observed this utilization difference.

Sometimes it is easy to spot the problem. For example, one colleague I mentioned above (my ref: DM 7-June-2011) emailed me a partial AWR report from his AIX system. Over the 2062 second interval, the busy_time was reported to be 285888 seconds and the CPU subsystem consisted of 64 cores with a total of 256 threads. The core base capacity is then 131968 core seconds (131968 = 2062 X 64)... but the busy time was only 285888 seconds, so the utilization is 217%. Woops! The thread based capacity is 528384 thread seconds (528384=2064 X 256). This means the thread based CPU utilization is only 54%. This is much more in line with Operations Research reality and how the system was actually performing.

Conclusion

When we see data as in Figure 1, where there is clearly a difference between v$osstat core based utilization and some other tool (e.g., vmstat) it is very likely threads are being used in the calculation. How threads are being used in the calculation could easily be different than how I presented, but my point is NOT to determine exactly how a tool like vmstat or sar calculates CPU utilization. My point is:
  1. There can be a real difference in the reported utilizations.
  2. We need to understand what the OS reported CPU utilization means from a performance perspective.
  3. We need to gather data from our real systems to understand the true CPU requirements, capacity, and utilization. If we do not do this, then stating a utilization figure becomes arbitrary and not all that useful.
Next Steps To Make This Practical

The first step is to gather v$osstat data and do the core based utilization math. If you see any real difference in CPU utilizations, then compute the utilization using v$osstat data but using the thread statistics. This this still doesn't match with vmstat, then you will need run a test (like I did and showed in this blog entry above) which launch processes simultaneously in the background and measuring their wall time.

Thanks for reading!

Craig.

2 comments:

  1. Craig,

    All this assumes that the CPU clock frequency stays constant (Capacity = cores*time). This is becoming increasingly not true for two reasons (I think):
    1. Green IT, obviously
    2. Virtualization. How do you dynamically reallocate available physical CPUs to competing virtual machines? As most OSs are ill-prepared to deal with dynamically adding or removing a CPU, I suspect this is achieved by varying the CPU frequency reported to the guest OS.

    So, I think the capacity and utilization formulae need to account for that to be of any use in an LPAR. "CPU seconds" seems a bad unit of measurement and needs to be replaced by e.g. "CPU cycles." When operating in fixed-frequency mode, "CPU cycles" = "CPU seconds" * frequency (cycles/second). When the frequency is allowed to change we need a sum over all periods of constant frequency operation what fall within our reporting period, or the OS could just provide a counter of the available CPU cycles since startup.

    What do you think?

    Cheers,
    Flado

    ReplyDelete
  2. My company are also using LPAR and shared-SMT, and I must say I dislike this alot. The whole point with such a configuration must be to set maximum capasity for each virtual node so the cumulative sum is above the physical CPU available. I must say I really don't like this. Even if you find a way to make the utilization calcualations right, you're still facing the problem: Your calculation might not be valid the next hour. You'll never know how much CPU you got available.

    Any comments to this?

    ReplyDelete