Thursday, April 19, 2012

When are DBA_HIST_SYSSTAT values correct?

Eventually you'll want to really know...

When performance data is periodically collected (i.e., snapshotted) and stored, the question undoubtedly will cross our minds, "Is the stored value correct at the beginning or the ending snapshot time?" In relation to Oracle's AWR system and specifically the dba_hist_sysstat table, this is what this posting is all about.

If you have ever written a data collection program or pulled data from either Statspack or AWR, you've had to figure this out.... or guessed and know you were kinda close.

Let me demonstrate this problem. Let's say this is the data:
Snap #  Start Time  End Time  Value 
------  ----------  --------  ----- 
  0           0700      0800     98
1 0800 0900 100 2 0900 1000 110 3 1000 1100 115 4 1100 1200 118
There are two ways we could interpret this data. The question we want answered is, "What happened between 1000 and 1100? Based solely on the data above, there are two answers! If the statistic value was captured at the start time, then the delta (i.e., the activity between 1000 and 1100) for snap 3 would be 3; 118-115. However, if the statistic value was captured at the end time, then the delta for snap 3 would be 5; 115-110

If the above two sentences are not crystal clear, then please STOP and re-read them a few times.

My Motivation

In just a few days, I'm going to teach a one day seminar at Collaborate 2012 entitled, Daily Forecasting and Predictive Analysis. In only five hours, I'm supposed to provide my students with enough information so they can do some responsible forecasting. And it's lecture only except for the demos I'll be giving. This meant I need to quickly empower my students with collecting good performance data.

When developing predictive models, data collection and characterizing the data is a really big deal: After all, how can you say, "What happens if we double the workload?" if we don't know what the workload is?

To help my students quickly get going, I wanted to use an existing data collection facility. So I decided to use AWR. This meant I had to quickly create some very flexible reports. (Actually I created a few functions that I use in the reporting SQL statements, which gives me lots of flexibility and allows for really easy yet advanced reporting.) All this to say, I really, really, really needed to know when raw performance data is stored in a particular AWR table: Is the value collected at the beginning or the ending snapshot time.

To figure this out, I created an experiment...

If you just want to know and don't want to read the rest of this posting, scroll down to the Conclusion section.

The Experiment

For this experiment I wanted to completely control AWR data collection and get some data as quickly as possible. I create a variable light load on the database server and using the snippet below, collected data every 10 seconds; yes every 10 seconds, not minutes... I was in a hurry.
-- AWR collection and firing time and delay
--
select to_char(sysdate,'HH24:MI:SS') now from dual;
EXEC dbms_workload_repository.create_snapshot;
exec dbms_lock.sleep(10);
select to_char(sysdate,'HH24:MI:SS') now from dual;
EXEC dbms_workload_repository.create_snapshot;
exec dbms_lock.sleep(10);
select to_char(sysdate,'HH24:MI:SS') now from dual;
EXEC dbms_workload_repository.create_snapshot;
...
I then created and executed the looper.sql script once a second to show the details about the true v$sysstat statistic value and what was stored in the associated AWR table, dba_hist_sysstat. Below is the report output.
                                                 AWR
NOW       SYSSTAT_LIO      AWR_LIO  DIFF_LIO SNAP_ID AWR_BEGIN       AWR_END
-------- ------------ ------------ --------- ------- --------------- ---------------
10:51:26  11284510295  11263576910  20933385    9474 10.47.10.806 AM 10.47.21.336 AM
10:51:27  11284590227  11263576910  21013317    9474 10.47.10.806 AM 10.47.21.336 AM
10:51:28  11284678032  11263576910  21101122    9474 10.47.10.806 AM 10.47.21.336 AM
10:51:29  11284759623  11263576910  21182713    9474 10.47.10.806 AM 10.47.21.336 AM
10:51:30  11284845769  11263576910  21268859    9474 10.47.10.806 AM 10.47.21.336 AM 
10:51:31  11284935284  11263576910  21358374    9474 10.47.10.806 AM 10.47.21.336 AM
10:51:32  11285013563  11263576910  21436653    9474 10.47.10.806 AM 10.47.21.336 AM 
10:51:33  11285099853  11285060579     39274    9475 10.47.21.336 AM 10.51.30.390 AM 
10:51:34  11285178132  11285060579    117553    9475 10.47.21.336 AM 10.51.30.390 AM
10:51:35  11285259385  11285060579    198806    9475 10.47.21.336 AM 10.51.30.390 AM
10:51:36  11285336246  11285060579    275667    9475 10.47.21.336 AM 10.51.30.390 AM
10:51:37  11285444373  11285060579    383794    9475 10.47.21.336 AM 10.51.30.390 AM
10:51:40  11285652730  11285060579    592151    9475 10.47.21.336 AM 10.51.30.390 AM
10:51:41  11285749950  11285060579    689371    9475 10.47.21.336 AM 10.51.30.390 AM
10:51:42  11285837157  11285060579    776578    9475 10.47.21.336 AM 10.51.30.390 AM
10:51:43  11285920825  11285060579    860246    9475 10.47.21.336 AM 10.51.30.390 AM 
10:51:44  11286003933  11285942610     61323    9476 10.51.30.390 AM 10.51.43.236 AM 
10:51:45  11286097986  11285942610    155376    9476 10.51.30.390 AM 10.51.43.236 AM
10:51:46  11286182852  11285942610    240242    9476 10.51.30.390 AM 10.51.43.236 AM
10:51:47  11286279336  11285942610    336726    9476 10.51.30.390 AM 10.51.43.236 AM
10:51:48  11286394216  11285942610    451606    9476 10.51.30.390 AM 10.51.43.236 AM
10:51:51  11286619230  11285942610    676620    9476 10.51.30.390 AM 10.51.43.236 AM
10:51:52  11286708688  11285942610    766078    9476 10.51.30.390 AM 10.51.43.236 AM
10:51:53  11286798146  11285942610    855536    9476 10.51.30.390 AM 10.51.43.236 AM 
10:51:54  11286890285  11285942610    947675    9476 10.51.30.390 AM 10.51.43.236 AM 
10:51:55  11286984898  11286876687    108211    9477 10.51.43.236 AM 10.51.53.794 AM
10:51:56  11287096753  11286876687    220066    9477 10.51.43.236 AM 10.51.53.794 AM
10:51:57  11287175032  11286876687    298345    9477 10.51.43.236 AM 10.51.53.794 AM
10:51:58  11287252596  11286876687    375909    9477 10.51.43.236 AM 10.51.53.794 AM
10:51:59  11287337325  11286876687    460638    9477 10.51.43.236 AM 10.51.53.794 AM
10:52:00  11287421048  11286876687    544361    9477 10.51.43.236 AM 10.51.53.794 AM
10:52:01  11287499333  11286876687    622646    9477 10.51.43.236 AM 10.51.53.794 AM

First, let me explain each column.

  • NOW is the clock time. The first sample was reported at 10:51:26, that is, about 10:51am.
  • SYSSTAT_LIO is the actual v$sysstat statistic session logical reads value at the "NOW" time. This value is "the truth" at the NOW time.
  • AWR_LIO is the stored session logical reads most recent value from the dba_hist_sysstat table.
  • DIFF_LIO is the difference between SYSSTAT_LIO and AWR_LIO. The smaller the number the closer to the truth the AWR data is.
  • AWR SNAP_ID is the maximum, that is, most recent AWR snapshot ID.
  • AWR_BEGIN is the beginning snapshot time for its associated AWR_SNAP ID.
  • AWR_END is the ending snapshot time for its associated AWR_SNAP ID. Notice there is about a 10 second difference between the AWR_BEGIN and AWR_END time; hence the 10 second AWR collection period.
Results analyzed

If you haven't had your morning cup of coffee, now would be a good time! The below four points reference the looper.sql script output shown above.

First, notice AWR SNAP_ID does not appear until after the ending collection time has occurred. That is, once NOW surpasses the AWR_END time we see the snapshot data. If AWR is capturing data at the beginning, it is not allowing us to see it yet.

Second, notice the minimum DIFF_LIO values occur when a new AWR SNAP_ID appears. This minimum time is when the AWR data is the closest to the truth.

Third, since the AWR data is closest to the truth at the snapshot end time (AWR_END), we know the statistic value stored was gathered at the end of the snapshot period, not the beginning.

Fourth, therefore for a specific AWR SNAP_ID, the ending statistic value is the value stored for the AWR SNAP_ID and the starting value is the value stored with the previous AWR SNAP_ID.

Now let's apply the AWR strategy of collecting and storing raw performance statistic values to the example I provided at the top of this posting. You'll immediately notice I added the "Delta" column, which is what we see on a typical AWR report or one of my AWR data collection reports.
Snap #  Start Time  End Time  Value  Delta
------  ----------  --------  -----  -----
  0           0700      0800     98      -
  1           0800      0900    100      2
  2           0900      1000    110     10
  3           1000      1100    115      5
  4           1100      1200    118      3
So if I had to guess when users were upset during the morning, it would be between 0900 and 1000.

Conclusion

The dba_hist_sysstat statistic values (column value) are pulled from the raw v$sysstat statistic values (column value) at the end of the collection period.

My guess is this is what you thought. At least that was my guess... and it was a guess. But now I really know.

Thanks for reading!

Craig.

If you enjoyed this blog entry, I suspect you'll get a lot out of my courses; Oracle Performance Firefighting and Advanced Oracle Performance Analysis. I teach these classes around the world multiple times each year. I also occasionally teach my Oracle Forecasting and Predictive Analysis class and also a one day Performance Research Seminar. For the latest schedule, click here. I also offer on-site training and consulting services.

P.S. If you want me to respond with a comment or you have a question, please feel free to email me directly at orapub.craig@gmail .com.







Wednesday, April 4, 2012

When is v$sesstat really updated?

When Performance Statistics Are Useful

Statistics are meaningful when we understand them and potentially dangerous when we don't. In the Oracle performance analysis arena, the instance statistics views (v$[ses,sys]stat) are used quite a bit. While there are many sources describing what a statistic means, what remains elusive is when a statistic is updated. For example, is the execute count updated when a SQL statements starts, ends, or something else?

This posting will provide the answer to when some of the key instance statistics are updated based on experimental evidence. Plus you can perform the tests yourself. So when does v$sesstat get updated? Interestingly, the answer is, "It depends."

It's a Big Deal When the Statistics Are Updated

While you will hear DBAs talking about when statistics get updated, when this truly becomes something we care about is during long running processes. Especially when there are only a few long running processes. For example, suppose you have long running query that typically takes around 1.75 hours to complete and consumes 2B physical Oracle blocks from the IO subsystem. AND suppose your statistics collection interval is 30 minutes. The table below shows just a few of the possible statistics update scenarios over the six 30 minute periods.


Here are some common beliefs as to when the instance statistics views are updated:
End of each user call. There are three possible user calls; parse, execute, and fetch. (More? Ref #1, Ref #2) A query will have a single parse and execute, and potentially many fetches. In this specific example, I'm stating now this query returns only a few rows which requires only a single fetch, resulting in a total of only three user calls. If v$sesstat is updated after each user call, this means nearly all of the physical IO reads will be recorded in period four (P4). Another way of looking at this is if you are monitoring this SQL statement's block reads, you'll see no activity until period four, and then BAM! Surprise! The 2B physical reads will be record in period P4! For quick OLTP activity this wouldn't typically be a problem, but for long running SQL it can be misleading.

End of the statement. No activity will be recorded until period P4 and then BAM! the 2B physical reads will be recorded! Good luck trying to match that with the IO subsystem activity. For quick OLTP statements, this wouldn't typically be a problem.

Every 3 seconds. This is what we would like, as we would see the physical read activity occur pretty much when it actually did occur.

Start of next statement. For OLTP activity, this wouldn't be such a big deal. But suppose the next statement doesn't start for quite a while? And then BAM! all the physical reads will be shown not when the query finished but perhaps a long time after. Not a good reflection of reality.

So what's the real answer? To answer the question I needed to create an experiment...

Let's See What Really Happens

Experimental Design.

The experimental design is actually quite simple. There are three parts. You can view all three parts in this single text file here.

Part 1 is the application user. That is, this is the code which the "monitoring user" will be monitoring. As you can see here, the code consists of setup (load the table), getting the user's SID (so I know which SID to monitor), and finally the test SQL that I will be monitoring. I run this is a few times so, on my test system, I generally know how long it takes to complete all the SQL. On my system this is about 90 seconds.

Part 2 is the DBA or the monitoring user. After the Part 1 test SQL begins, I will start the monitoring SQL. This SQL is very straightforward: drop the experimental data table, create the experimental table, and then every second sample all the stats from v$sesstat for this application user's SID. Because on my system the application user's SQL takes around 90 seconds to complete, I sampled for 100 seconds, which at a one second interval means 100 samples are collected and stored.

Part 3 is the reporting SQL where I simply select the experimental data that I collected in Part 2. I have various reporting SQL. Some is more visually pleasing and some is better for direct input into my Mathematica notepad used for the actual analysis.

The experiment was run a few times just to ensure the results were consistent... and they were. I detected no differences. In fact, a few months back I performed a similar experiment and the results where the same.

Statistics Analyzed.

Since data for all v$sesstat statistics were collected I can compare when the various statistics are updated. I selected six statistics. I selected them because I use them often in my performance analysis work. Here is the list (statistic #, statistic name, show description):

11. session logical reads. That is, a buffer get (logical IO).

14. CPU used by this session. This is the CPU consumed by the session in centi-seconds.

44. physical read total IO requests. This is the total IO read requests. This is NOT the number of Oracle blocks read but actual IO calls. One Oracle IO request can contain multiple blocks, that is, a multi block read.

72. physical reads. This is the number of Oracle blocks read. This is NOT the number of IO requests, but the number of actual Oracle blocks read as the result of the IO requests.

549. parse count (total). This is the number of parses; hard and soft. I don't know if the "softer" parse is included, which occurs when there is a hit in a session's cursor cache.

554. execute count. This is the number of SQL executions; selects, inserts, updates, delete, and DDL. I have not verified this and will be checked in a later experiment.

Statistics Reporting and initial observation.

For each of the 100 samples, all v$sesstat were collected. But as mentioned above, I only analyzed six statistics.

What I am specifically looking for is when a statistic value changes and relating that back to what the application user's Oracle session was doing (or not doing!). So I want to know what the application user (i.e., the running script) is doing when the statistic value changes. By comparing the statistic value change time stamp with the application SQL timestamp, I can understand when and hopefully why the statistic value changed.

You can look at all the raw experimental results here in this fileIf you would like to perform some additional analysis yourself, you can download the exported the table that contains all the v$sesstat data here.

When looking at the two data sets below, it is important to know the first query begins at time (STAMP) 85.

Here is an example of the results for statistic physical reads (#72):

STATISTIC#      STAMP      VALUE
---------- ---------- ----------
        72         80     112270
. . .
        72         84     112270
        72         85     115034
        72         86     137321
        72         87     138070
        72         88     138756
        72         89     139646
        72         90     140681
        72         91     141596
. . .

Notice that the above statistic value is always changing. Below is an example for the statistic session logical reads (#11):

STATISTIC#      STAMP      VALUE
---------- ---------- ----------
...
        11         83      75155
        11         84      75155
        11         85      77917
        11         86     100142
        11         87     100142
...
        11         97     100142
        11         98     100142
        11         99     100142
        11        100     118547
        11        101     125129
        11        102     125129
        11        103     125129
...
        11        117     125129
        11        118     125129
        11        119     129823
        11        120     150116
        11        121     150116
...
        11        133     150116
        11        134     174371
        11        135     175103
        11        136     175103
...
        11        156     175103
        11        157     175103
        11        158     175106
        11        159     175106
        11        160     175106
        11        161     175106
        11        162     175106
        11        163     175109
        11        164     175109
        11        165     175109
. . . 

Notice above how for session logical reads (#11) the statistic values don't constantly change. There must be some event that triggers the value change. Even a quick review of the above data demonstrates instance statistics are updated for different reasons!

Results Analysis.

To understand when and hopefully why a specific statistic is change, we must know what the application "user" is doing at various points in time. Below is a table containing a time identifier (ID), the activity time stamp (Time), and the related activity (Activity 1, Activity 2). Take a moment and scan through this table.
For example, at time 85 (ID A) the big query starts and at time 100 (ID B) it completed and then was immediately run again. After it completed the second time at time 114 (ID C) there was a five second sleep period (dbms_lock.sleep(5)) and when the sleep completed (ID D), the big query started once again...

Now let's look at what occurred from a time, application user, and instance statistic perspective. Looking at the plot below, you can see the horizontal axis is time and the IDs along with their associated time stamps are clearly labeled. The vertical axis contains either the application activity event (Appl) or a specific instance statistic (e.g., LIO, CPU) change. Each blue dot represents a change occurred; either an application activity or a statistic value change.
Even a quick glance at the above plot clearly shows that various instance statistics are updated at different times. For example, physical IO block reads (PIO rds) are updated very often in contrast to executions (exec) which are updated infrequently. To really understand what is occurring we need to compare when a statistic change occurred in conjunction with what the application is doing. Also, it may be helpful to reference the raw experimental data. As I mentioned above, you can look at all the raw experimental results here in this file

Specific Statistics Analysis.

Let's look closely at each of the instance statistics analyzed.

Logical IOs.

Summary: This experiment indicates the LIOs statistic is updated when the next statement begins. I know, this seems really strange, but this experiment and others I have performed clearly show this to be true.

Basis: Because the big query begins at time 85 (A) and time 100 (B), it initially appears that LIOs are updated when a statement begins and ends. However, if you look closely at when the big query ended at time 114 (C), LIOs are not updated! Rather, five seconds later at time 119 (D) is when the LIOs are updated! It's important to notice that time 119 (D) is after the five second sleep and the next query begins!!! This is not an anomaly because this occurs again at time 158 (G/H): the query ends at time 148 (F). So it's very clear the LIOs are not updated when the query completed or during the query, but when the next statement begins!

So if you are gathering the LIOs statistic for a specific session, make sure to start another statement so the statistic value is updated! Even a "select * from dual" will trigger the LIO statistic update.

CPU Consumption.

Summary: This experiment indicates session CPU consumption is updated when the statement completes.

Basis: For the first few queries (big queries) starting at times 85 (A), 100 (B), 119 (D), and 134 (E), the CPU consumption is updated when the query completes at time 100 (B), 114 (C), 134 (E), and 148 (F) respectively. However, when the small sub-second queries complete (time 158 (G/H), 162 (I/J), 173 (K)) CPU consumption is not updated. Why? My guess is because the small query consumes so little CPU the results rounds to the current number displayed.

So if you are gathering session CPU consumption be aware that the CPU consumption will not be updated until the statement completes.

Physical Reads (PIO).

Summary: The PIO statistic is frequently updated and when a statement completes.

Basis: My experiment showed PIOs where updated during each experimental sample, which occurred every second. And for a sub-second statement (time 158 (G/H)) a PIO change occurred when the query completed; also at time 158 (G/H) which was then followed by a five second sleep.

My experimental queries only required one fetch. Therefore, this experiment also demonstrates a starting or completed fetch is not required for the PIO statistic is updated.

Physical Read Requests.

Summary: The data clearly shows the same behavior as with physical reads (see above).

Parse Count.

Summary: The parse count statistic is updated when the parse occurs.

Basis: It appears that when a parse occurs the statistic is incremented by one and recorded, which is at the beginning of a statement. In my experiment, every application event and activity time contained a parse (even the sleep call requires a parse) and at every activity time the parse count was incremented. Good news if you are counting parses.

Execute Count.

Summary: The execute count statistic is updated as with the parse count (see above).

Important Conclusions.

Even if I have made mistakes in my analysis or my experiment is improved, two points are clear:

1. Instance statistics are not updated at the same time, that is, there are different triggers.

2. My experiments demonstrated four possible statistic value change triggers: when a statement begins (parse and execute), throughout processing (physical IO reads and requests), statement completion (CPU consumption), and when the next statement begins (LIO).

So when do the v$sesstat statistics get updated? As a highly paid consultant would say, "It depends."

Thanks for reading!


Craig.

If you enjoyed this blog entry, I suspect you'll get a lot out of my courses; Oracle Performance Firefighting and Advanced Oracle Performance Analysis. I teach these classes around the world multiple times each year. I also occasionally teach my Oracle Forecasting and Predictive Analysis class and also a one day Performance Research Seminar. For the latest schedule, click here. I also offer on-site training and consulting services.

P.S. If you want me to respond with a comment or you have a question, please feel free to email me directly at orapub.craig@gmail .com.