Thursday, June 23, 2011

Impact of Consistent Reads

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

The Purpose

For sure there is a cost for read consistency. But is that cost significant? Can users feel the difference? Does it require significant computing resources? Like many things in Oracle, perhaps this entire discussion is more academic and just makes for interesting presentations and internals discussions... Delving in these issues is what this blog entry is all about!

Some Internals Background

If you've studied Oracle internals you will eventually learn that by default, query results reflect the committed state of the database system when the query started. (Oracle can also provide this capability at the transaction level.) This implies that if a needed buffer has a "change date" (think: SCN) AFTER the query started or  BEFORE the query started yet contains uncommitted data, Oracle must rebuild an image of the buffer containing committed data consistent with the query start time.

But it can get more complicated. In addition to the above, Oracle can cache multiple images of blocks, representing the block at various times (think: SCN) in both the buffer cache and a server process's PGA. Oracle can decide to use any appropriate buffer to fulfill a read consistent query. This is important to understand, because it is common to see consistent read activity when the data you are querying has been committed well before the query begin or even more surprising after the instance has been cycled! This is why you can see cr (consistent read) activity when running a tkprof even though you know there is no dml occurring related to the table. If you really want to dig into this, start by looking at a handful of instance statistics starting with "consistent".

I need to interject a couple definitions. The rebuilt or copied buffers are commonly called cloned buffers because they are a clone of the buffer at a point in time. When a cloned buffer is created, at a minimum, the instance statistic consistent change and consistent get are incremented. If an already buffered clone is accessed the consistent get statistic is increased and probably other consistent related instance statistics.

Even with multiple Oracle instances, every block that is buffered has a single changeable latest and greatest version in the entire database system, called the current buffer or the cu buffer. If you have done a tkprof and see values in the cu column you know your SQL has accessed cu buffers.

The performance implications of clones are massive. For example and to summarize, to make a clone a free buffer must be secured, the cu buffer is copied into cloned (cr) buffer, the most recent buffer chance undo address is retrieved from the clones' itl, and the server process checks is to see if the undo block is in the buffer cache. If the undo block is not in the buffer cache, the server process must find a free buffer for it, make an IO request (and probably wait a bit), place the block into the buffer cache, update various internal structures, and finally apply the undo to the cloned buffer. Now suppose after applying the undo to the cloned buffer it is discovered that additional undo is required! Then this process cycles again and potentially again and again. This "again and again" is following what is called a "chain of undo."

So you can see that creating a lot of cr buffers will increase both CPU and IO requirements. There is a cost for read consistency. But is that cost significant? Or perhaps its more academic and just makes for interesting presentations and internals discussions... Let's find out!

The Experimental Design

You can download the entire script that is somewhat documented HERE.

The experiment is actually quite simple. I created and loaded a table with 18M rows and committed. (The actual experimental 18M table did have DML and commits against it before the experiment began.) To buffer the data, I ran a full table scan query on the table five times. I then took my initial sample and called this my baseline sample, since updates have not yet occurred. The baseline query required no cr buffers as a result of uncommitted data or because of buffers changing after the query started and then clones had to be created. However, as I mentioned in the internals discussion above, Oracle can keep older versions of the buffer and when they are accessed they count as consistent reads. This indeed occurred because after an update occurred, the session committed (and even the instance cycled), Oracle recorded consistent read activity when the table was queried.

Before the second, third, and fourth loops were run, in a different Oracle session I updated 1M rows twice, but did not commit. It's important to notice the same 1M are being updated twice. This will likely force a "chain of undo" which will intensify read consistent activity.

As you can see below, when I "run the query" I actually run the full stats collection and actual query 60 times, providing me with 60 samples for each of the four loops.

I am expecting to see an increasingly larger elapsed time gap between the baseline query and each subsequent query.

So the statistics collection and query looping looks something like this:

load data...(18M rows)
cache data: select sum(update_seq) from rc_test; 5X



for i in 1..4 loop
  for sample in 1..60
    gather time t0 performance data 
    select sum(update_seq) into bogus from rc_test;
    gather time t1 performance data
    calculate deltas and store
  end loop
  pause this script and do the update in another session
end loop

The performance data gathered was:
  • time stamp based on the timestamp data type for maximum precision.
  • instance CPU consumption. I want to know the impact on the entire instance, not just my session. The data source is the v$sys_time_model, statistics "DB CPU" and "background cpu time".
  • instance non-idle wait time. Again, I wanted to know the impact on the entire instance, not just my session. The data source is v$system_event.
  • consistent gets. The number of cr buffers accessed in the entire instance. My sessions where the only sessions connected as I was in a very controlled environment. The data source is v$sysstat and the statistics is "consistent gets".
The update, which is performed in a different Oracle session, looks like this:

declare
  cntr number;
begin
for cntr in 1..2
loop
  update rc_test
  set    update_stamp=sysdate
  where  rownum<1000000;
end loop;
end;
/

You may notice I updated the same row set twice. As I stated above, this can cause a "chain of undo" to occur which increases consistent ready intensity.

Experimental Results (Summary)

The results are very clear, although a little more statistically tricky than I anticipated. First the results and then I'll provide some numeric and visual details.

There was both a visual and significant statistical difference between the initial baseline query (sample set 1) and the each query run after the three update (sample sets 2, 3, 4). This means consistent read statements take longer to run. The non-baseline samples were at least 1.5X slower than our baseline! Consistent read activity clearly made an elapsed time impact that users will notice.

There is also another key take away, that is not so apparent. There is a real (statistically significant) difference between each of the samples. Not just between the baseline and the others but between each of the four sample sets. This means that each time I ran the 1M update twice, it caused a real slowdown in the query.
Figure 1.
While you can download all the raw experimental data here, Figure 1 shows some of the numeric results. The columns should be self explanatory with the exception of the "P Value". This the result of the statistical significance test (details in the following section). If the P Value was less than 0.05 then we can say there is a real difference between the baseline sample set and the other sample set. It's important to understand that the P Values are between the baseline sample and the other sample set. For example, between set 1 and set 2, set 1 and set 3, and set 1 and set 4.
Figure 2.
Figure 2 above is a smooth histogram of each of the four sample set's elapsed times. Our baseline line sample set appears on the far left, then to the right set 2, set 3, and finally set 4. Visually it appear there is a clear and real difference between these sample sets...and the statistics agree!
Figure 3.
Figure 3 above is a consistent read histogram containing sample sets 2, 3, and 4. If you look closely at sample set 1's (our baseline) average consistent read value of 308390 it dwarfs in comparison to the other samples. So much so, when the histogram includes all four sample sets, because of the required scaling, the histogram is visually worthless. This implies that visually there is a real difference between the baseline sample set and the others. And as Figure 3 shows, clearly there is also a visual difference between each of the other sample sets as well...and the statistics agree!

The actual statistical analysis was more difficult than it may appear. If you are interested, read the next section. Otherwise, just skip to the Conclusion.

Experimental Results (Detailed)

By looking at the numerical and visual experimental results, it appears that indeed when consistent buffers are involved, query performance (our query to be exact) was indeed negatively impacted. But visually comparing values and histograms is not enough to make a strong statement. In fact, it's rather weak. So I needed to perform a statistical hypothesis test.

My plan was perform a simple t-test comparing sample set 1 (our baseline) to sample set 2, set 1 to set 3, and set 1 to set 4. But the t-test has a key requirement which, our data clearly fails to demonstrate.
Figure 4.                                  Figure 5.
As you'll recall, each sample set consists of 60 samples. Figure 4 above is the histogram for our baseline consistent reads and Figure 5 is the histogram for our baseline elapsed times. 

A key t-test requirement is our sample set must be normally distributed. Figure 4 looks extremely exponential and Figure 5 is some inverse normal distribution looking monstrosity-thing! Oh boy... talk about disappointment! After doing some research I found a non-parametric significance test needed to be performed. While I summarize this in the actual Mathematica notepad (notepad file, pdf file) there is a very nice explanation of non-parametric significance testing here.

Thankfully Mathematica has built in functions for this type of significance testing. In fact, I can tell it to choose what it thinks to be the most appropriate statistical test and do the test. It choose the Kruskal-Wallis test, which was one of the tests mentioned in the URL above.

To show a real difference exists between the sample sets, the P-Value needed to be less than 0.05. All the significance tests I performed resulted in a P-Value rounded to 0.0000. (I'm not joking.) The tests where between set 1 and set 2, set 1 and set 3, set 1 and set 4, and also between set 2 and set 3, and set 3 and set 4. For every test, statistically there is a real difference between the sample sets.

Just in case you're wondering, I too thought a consistent 0.0000 value was perhaps too good to be true. So I did a significance test between set 1 and set 1, which would have resulted in a P-Value of 1. And it did! So I'm comfortable with the results.

I must say though, I am not a statistician but based on my research and Mathematica, I believe I did the correct statistical significance test and represented the results correctly. If you think I make a mistake, please email me at OraPub's general email (orapub@comcast.net) and I'll be more then happy to make any required changes.

Conclusion

Many DBAs don't realize how resource intensive building and accessing consistent read buffers are. But once the Oracle internals is understood it becomes clear that intensive consistent read activity can significantly increase both CPU and IO consumption, which results in an elapsed time increase. But is this really true and is there is an increase make a real performance difference?
This experiment demonstrated that:
  1. Consistent read buffer access significantly increases query elapsed time.
  2. An increase in the "chain of undo" also significantly increases consistent read access and elapsed time.
  3. The elapsed time difference could definitely be felt by a user as our simple test showed a 1.5X elapsed time increase.
One important and final comment: My experiment did not include the creation of the consistent read buffers (statistics consistent changes), but only the access of already created cloned buffers. I ran a few other tests just to be sure the consistent read statistics did not include consistent changes...in fact, it was usually zero. So if my experiments shows a significant elapsed time increase when the consistent read buffers have already been created, we know that creating and accessing consistent read buffers will result in even longer elapsed times.

To minimize the impact of consistent reads, keep your SQL tuned, commit often (but often enough to cause frequent commit issues!), and try not to query blocks that another transaction is changing.

Thanks for reading.

Craig.



If you enjoy my blog, I suspect you'll get a lot out of my courses; Oracle Performance Firefighting and Advanced Oracle Performance Analysis. I teach these classes around the world multiple times each year. For the latest schedule, click here. I also offer on-site training and consulting services.

P.S. If you want me to respond to a comment or you have a question, please feel free to email me directly at craig@orapub .com. I use a challenge-response spam blocker, so you'll need to open the challenge email and click on the link or I will not receive your email. Another option is to send an email to OraPub's general email address, which is currently orapub@comcast .net.

Friday, June 10, 2011

Cores vs Threads...Part 3

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

Why This Is Very Important

As CPU subsystems become more complex and employ various methods to utilize multiple CPU cores and multiple threads per core, determining the CPU consumption requirements, capacity, and the resulting utilization is not entirely clear. But it is important we understand what the utilization tells us about our system. As an Oracle DBA, Iʼm OK with not knowing the specific OS utilization calculations, but Iʼm not OK with blindly stating a utilization figure without understanding what that means in relation to performance. Hence my quest...

This is the third and final (I hope) posting about differences in the operating system CPU utilization when determined by vmstat or by using the v$osstat CPU core approach. If you've been following this blog series you know there can be a statistically significant utilization difference and if there is, it can increase as the CPU subsystem gets busier. Also, seven out of the eight samples from production Oracle systems that I analyzed (some of them extremely busy systems) showed no real difference between the utilization method. However, the one AIX sample (AG1) clearly showed a difference as the CPU utilization increased over 40%. Figure 1 below is the AG1 scatter plot of the utilizations versus the sample interval.
Figure 1.

Background

For your reference, the initial blog posting (April 22, 2011) presented general utilization, how to gather CPU utilization purely from v$ views, and then I stated that on a few occasions I have seen the utilizations from vmstat and v$osstat differ significantly. The second blog posting (May 13, 2011) presented the experimental results and the subsequent analysis from seven production Oracle systems. As mentioned above, all but one of the samples (AG1) showed no statistical difference between the utilizations (Oracle core based based vs vmstat based). But I mentioned a concern I had; none of the samples were running with the utilization over 65%. Based on my comments a reader contacted me and ended up gathering data on his system that was running between 90% to 100% CPU utilization. The analysis of this data set (AB1) was posted in the second "B" posting on May 26, 2011. Like most of the other samples, this very busy CPU subsystem showed no difference between the utilizations.

This final posting (at this point I think it is anyways) is focused clearly on how could vmstat and Oracle core based CPU utilizations result in a different value. And if so, is this something to be concerned about? If this is something you're interested in...Read on!

Deeper Into Requirements and Capacity

There is a very good reason why there can be a difference. But for my explanation to make any sense, we need to understand how utilization is calculated and especially so when threads are involved. Utilization is simply requirements divided by capacity. Mathematically this can be represented as:

U = R / C

It's important to understand that we are typically looking at a slice, interval, or snapshot of time. For example, 15 minutes or 1 hour. This requires two samples from our data source. Don't mean to insult anyone here, but the value we need is calculated by subtracting the initial value from the final value resulting in the delta or difference. This delta is what we typically (but not always) use in the calculation.

Requirements.

Requirements is simply how much time CPU resources are being expended, used, and consumed. It does not matter if the consumer is a thread or a process. If it is consuming CPU resources then this time counts as "busy time." We can see this "busy time" via v$osstat. It is also encapsulated in vmstat. On all L/Unix systems (exception: HPUX) we can see the busy time in the /proc filesystem. I present this in my book, Oracle Performance Firefighting but here is a nice link to an on-line source. Mathematically, we can represent the requirements as:

R = busy time  = interval time X number of CPU power consumers

If we differentiate between cores and threads, requirements can be something like:

R (thread based) = busy time = interval time X number of threads consuming CPU resources
R (core based)  = busy time = interval time X number of cores consuming CPU resources

As DBAs it is not our decision whether the operating system reports busy time a specific way. Someone else made that decision for us. ...it is what it is. But as we'll see below, the decision makes a profound difference in the final calculated utilization---for both vmstat and v$osstat.

Capacity.

Capacity is the power available; 100 CPU seconds, 64 cores, 128 threads, etc. As with requirements, we typically need the power available over a period of time, that is, a time slice, interval, or snapshot of time. This is actually very easy to calculate and is simply the number of whatever is supplying the power (e.g., cores) multiplied by the snapshot interval time.

Suppose over a one hour interval the CPU subsystem contains 8 cores and each core has 2 threads, for a total of 16 threads. As with the requirements, we can represent the CPU power capacity from either a core or thread perspective.

C = interval time X power supplying units (cores) = 60 minutes X 8 cores = 480 core minutes
C = interval time X power supplying units (threads) = 60 minutes X 8 cores X 2 threads/core = 960 thread minutes.

As with the requirement's busy time, as DBAs it is not our decision whether the operating system reports busy time a specific way. Someone else made the decision for us. But as we'll see below, the decision makes a profound difference in the final calculated utilization---for both vmstat and v$osstat.

Dark Matter: Cores vs Threads

Before we combine requirements and capacity to derive utilization, it is important to understand there are differences in how CPU subsystems process at the core and thread level. It's even more important to understand how this occurs on your production systems.

Suppose a process, when run by itself on an idle system takes 30 seconds to complete. This is the "wall time" or "elapsed time." Here is an example bourne shell command sequence that places an incredibly intense CPU load on a single CPU core or thread.
echo "scale=12;for(x=0;x<39999999;++x){1+1.243;}" | bc >/dev/null
On my Linux system, the above command takes about 78 seconds to complete.

Now... Consider the elapsed time of the above command when a bunch of the commands are launched at the same time!

If cores are what is providing the true CPU power, then with a 2 core CPU subsystem we will observe this type relationship between the number of concurrently launched processes and elapsed time (seconds): (1,30), (2,30), (3,60), (4,60), (5,90), (6,90), (7,120), etc. Essentially, when a process completes, a core becomes available and the next process begins. This perfect elapsed time sequencing assumes the OS makes no optimizations.

If threads are what is providing the true CPU power, then with a 2 core but 2 threads/core (4 total threads) CPU subsystem we will observe this type of relationship between the number of concurrently launched processes and elapsed time (seconds): (1,30), (2,30), (3,30), (4,30), (5,60), (6,60), (7,60), (8,60), (9,90), etc. Essentially, when a processing thread completes a thread becomes available and the next process thread begins. Again, this perfect elapsed time sequencing assumes the OS makes no optimizations.

When threads get involved, the resulting elapsed times are not so straightforward and can be much more complicated to anticipate and very, very operating system specific.

One of my colleagues did some testing. (my ref: JB 9-May-2011) He used an 8 core box with each core having 2 threads. His results showed the CPU subsystem was operating more core-based than thread-based. Why? Because when 8 processes that run serially in 30 seconds were simultaneously launched, they finished in about 30 seconds yet when 9 processes where simultaneously launched the elapsed time jump to 60 seconds...meaning all 16 threads were unable to truly process the more than 8 processes simultaneously. The only way he would know this, is to perform an actual test (more about this below.)

However, an IBM employee that specializes in Oracle emailed me (my ref: DM 7-June-2011) and wrote, "SMT on Power Systems allows for true simultaneous execution of up to 4 SMT threads (on Power7) in the same clock cycle." An AWR report added some support up his well presented and thought out claim. The v$osstat busy_time statistic was clearly thread based because the busy_time (285888 secs) was greater then the core based capacity (13194 core secs = 34.36 min X 60 sec/min X 64 cores). From a core-based perspective and on his system, there is no way 64 cores can provide more than 13193 seconds of CPU power over the 34.36 minute interval. Threads must be involved.

Complicated...but seriously practical and necessary.

It can get even more complicated with virtual machines, vpars, lpars, and on and on. With all this complication it is easy to loose sight of our goal: gathering reliable and understandable values for OS CPU requirements, capacity, and utilization. If I can't do this, then I can not make a simple statement such as, "From an OS perspective, CPU utilization indicates the CPU subsystem is the bottleneck." So while this may seem pretty academic, it has serious practical performance management implications.

The best way to tell what is occurring on your systems is to gather some performance data. Here's how...

Gathering The Data

I became so frustrated with all complications and possibilities of complications, it was obvious the only real way to get a firm grasp on the reality on a real system was to gather some data on a real system. So I created a basic shell script that tracked the relationship between the number of simultaneously launched processes and their final elapsed time. I also gathered and displayed the OS CPU utilization and CPU run queue (both based on vmstat).

Please.... I need to write this: Do not run this script on your production system if you care about production system performance. The script is designed to suck every bit of CPU power out of your database server. Running this on a test box with the same OS and CPU architecture as your production systems should produce the results you are looking for.

While I wrote the script in Linux and you can view it on-line here, it would be a simple matter to make the vmstat column parsing adjustments and potentially a few other things. The system I gathered data from a single 4 core CPU with no threads. Here's the Linux details:
[oracle@fourcore ~]$ cat /proc/version
Linux version 2.6.18-164.el5PAE (mockbuild@ca-build10.us.oracle.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Thu Sep 3 02:28:20 EDT 2009

Figure 2.
Figure 2 above shows the elapsed time (or wall time) to complete X number processes when they are all launch simultaneously in the background. It is obvious that up to four processes complete at pretty much the same time (and Figure 3 below details the numeric results). However, once 5 processes are launched simultaneously, the elapsed time takes a significant jump. This indicates that the system I gathered has 4 CPU cores and no threads, and is "core powered." ...and it does have a single CPU with 4 cores and no threads.

This is important to understand. Let's generalize this a bit:
  • Where C is the total number of CPU cores, a core power focused system will complete C number of processes launched simultaneously at pretty much the same time. Once the number of processes is greater than C, the all-process completion time will increase. (This is what we see occurring in Figure 2 above, where C is 4 and 5.)
  • Where T is the total number of CPU threads, a thread powered focused system will complete T number of processes launched simultaneously at pretty much the same time. Once the number of processes is greater than T, the all-process completion time will increase.
  • My observations have shown that CPU subsystems without threads tend to be completely core powered. (As you would expect and what we see in Figure 2.) However, CPU subsystems with threads can be either more core or more thread focused. This is when understanding what a utilization value means becomes more complicated.
Figure 3.
Figure 3 above shows the numerical experimental results. I performed a statistical significance test between the elapsed time sample sets when there were 4 processes launched simultaneously and when there were 5 processes launched simultaneously. Statistically they are indeed different. The significance tested was a little tricky because the elapsed time sample sets are not normally distributed. While I won't get into the details in this blog entry, if you are interested you can view the Mathematica notebook PDF output here...with all the details. You can also download the actual Mathematica notebook here. The raw experimental results can be downloaded here.

Why the Possible Utilization Difference

If you are still tracking with me (thanks for reading by the way), this next step should be simple. If requirements can be represented as either core-seconds or thread-seconds, and if capacity can be represented as core-seconds or thread-seconds, we have a simple two by two matrix. As long as both the requirements and capacity are core or thread based, we should be OK (if the OS can evenly distribute all the work). But if they don't match the utilization is going to be either under or over reported (at least from a DBA perspective).
Figure 4.
Figure 4 above is based on a hypothetical system: Over a one hour interval, the 4 core system (each core has 2 threads) was busy from a core perspective at 1000 seconds but from a thread perspective 2000 seconds.

The point of the Figure 4 matrix is not the busy time or the core or the threads. Rather, the point is if the requirements and capacity are not the same units, that is core or threads, the resulting utilization will be either under reported (e.g., 21%) or over reported (e.g., 83%), whereas the "true" utilization in Figure 4 is 42%.

More practically, if I assume the busy_time in v$osstat is core based (but it is not) and calculate the capacity based on the number of cores, the resulting utilization will be higher then reality. In Figure 4 above, that would be 83% cell. This is what could have happened on the AIX system as shown in Figure 1. Very experienced AIX DBAs are likely to have observed this utilization difference.

Sometimes it is easy to spot the problem. For example, one colleague I mentioned above (my ref: DM 7-June-2011) emailed me a partial AWR report from his AIX system. Over the 2062 second interval, the busy_time was reported to be 285888 seconds and the CPU subsystem consisted of 64 cores with a total of 256 threads. The core base capacity is then 131968 core seconds (131968 = 2062 X 64)... but the busy time was only 285888 seconds, so the utilization is 217%. Woops! The thread based capacity is 528384 thread seconds (528384=2064 X 256). This means the thread based CPU utilization is only 54%. This is much more in line with Operations Research reality and how the system was actually performing.

Conclusion

When we see data as in Figure 1, where there is clearly a difference between v$osstat core based utilization and some other tool (e.g., vmstat) it is very likely threads are being used in the calculation. How threads are being used in the calculation could easily be different than how I presented, but my point is NOT to determine exactly how a tool like vmstat or sar calculates CPU utilization. My point is:
  1. There can be a real difference in the reported utilizations.
  2. We need to understand what the OS reported CPU utilization means from a performance perspective.
  3. We need to gather data from our real systems to understand the true CPU requirements, capacity, and utilization. If we do not do this, then stating a utilization figure becomes arbitrary and not all that useful.
Next Steps To Make This Practical

The first step is to gather v$osstat data and do the core based utilization math. If you see any real difference in CPU utilizations, then compute the utilization using v$osstat data but using the thread statistics. This this still doesn't match with vmstat, then you will need run a test (like I did and showed in this blog entry above) which launch processes simultaneously in the background and measuring their wall time.

Thanks for reading!

Craig.