Sunday, September 7, 2014

Watch Oracle DB Elapsed Time and Wall Time With Parallel Query

This page has been permanently moved. Please CLICK HERE to be redirected.

Thanks, Craig.

Watch Oracle Elapsed Time and Wall Time With Parallel Query


In my recent postings I wrote that when using the Oracle Database parallel query a SQL statement's wall time should be equal to its elapsed time divided by the number of parallel query slaves plus some overhead.

That may seem correct, but is it really true? To check I ran an experiment and posted the results here. The results are both obvious and illuminating.

If you don't want to read but just sit on the couch, have a beer and watch TV you're in luck! I took a clip from my Tuning Oracle Using An AWR Report online video seminar put it on youtube.  You can watch the video clip on YouTube HERE or simply click on the movie below.


The Math, For Review Purposes


In my previous recent postings I detailed the key time parameters; DB Time, DB CPU, non-idle wait time, elapsed time, parallelism and effective parallelism. To save you some clicking, the key parameters and their relationships are shown below.

DB Time = DB CPU + NIWT

Elapsed Time = Sum of DB Time

Wall Time = ( Elapsed Time / Parallelism ) + Parallelism Overhead

Wall Time = Elapsed Time / Effective Parallelism


Test Results: When Oracle Parallel Query was NOT involved.


If you want to see my notes, snippets, etc. they can be found in this text file HERE.

Here is the non-parallel SQL statement.

select /*+  FULL(big2) NOPARALLEL (big2) */ count(*)
into   i_var
from   big2 
where  rownum < 9000000

When the SQL statement was running, I was monitoring the session using my Realtime Session Sampler OSM tool, rss.sql. Since I knew the server process session ID and wanted to sample every second and wanted to see everything just for this session, this is the rss.sql syntax:
SQL>@rss.sql 16 16 827 827 % % 1
For details on any OSM tool syntax, run the OSM menu script, osmi.sql. You can download my OSM Toolkit HERE.

The rss.sql tool output is written to a text file, which I was doing a "tail -f" on. Here is a very small snippet of the output. The columns are sample number, sample time, session SID, session serial#, Oracle username, CPU or WAIT, SQL_ID, OraPub wait category, wait event, [p1,p2,p3].


We can see the session is consuming CPU and waiting. When waiting, the wait event is "direct path read", which is asynchronous (we hope) block read requests to the IO subsystem that will NOT be buffered in the Oracle buffer cache.

Now for the timing results, which are shown in the below table. I took five samples.  It's VERY important to know that the wait time (WAIT_TIME_S), DB CPU (DB_CPU_S), and DB Time (DB_TIME_S) values are related to ONLY server process SID 16. In blazing contrast, the wall time (WALL_S), elapsed time (EL_VSQL_S), and SQL statement CPU consumption (CPU_VSQL_S) is related the entire SQL_ID statement execution.

Here are the "no parallel" experimental results.
SQL> select * from op_results;

 SAMPLE_NO     WALL_S  EL_VSQL_S CPU_VSQL_S WAIT_TIME_S   DB_CPU_S  DB_TIME_S
---------- ---------- ---------- ---------- ----------- ---------- ----------
         1  35.480252  35.470015   9.764407       24.97   9.428506  34.152294
         2  35.670021  35.659748   9.778554       25.15   9.774984  35.541861
         3  35.749926  35.739473   9.774375       25.12    9.31266  34.126285
         4  35.868076  35.857752   9.772321       25.32   9.345398  34.273479
         5  36.193062   36.18378   9.712962       25.46   9.548465  35.499693
Let's check the math. For simplicity and clarity, please allow me to round and use only sample 5.
DB_TIME_S = DB_CPU_S + WAIT_TIME_S
     35.5 =   9.5    +    25.5  = 35.0
The DB Time is pretty close (35.5 vs 35.0). Close enough to demonstrate the time statistic relationships.
Elapsed Time (EL_VSQL_S) = DB_TIME_S
                    35.5 =    34.2
The Elapsed Time is off by around 4% (35.5 vs 34.2), but still closely to demonstrate the time statistic relationships.
Wall Time (WALL_S) = Elapsed Time (EL_VSQL_S) / Effective Parallelism
             35.5  =       35.5               /         1
Nice! The Wall Time results matched perfectly. (35.5 vs 35.5)

To summarize in a non parallel query (i.e., single server process) situation, the time math results are what we expected! (and hoped for)

Test Results: When Oracle Parallel Query WAS involved.


The only difference in the "non parallel" SQL statement above and the SQL statement below is the parallel hint. Below is the "parallel" SQL statement.
select /*+  FULL(big2) PARALLEL(big2,3)  */ count(*) into i_var from big2 where rownum<9000000>
When the "parallel" SQL statement was running, because Oracle parallel query was involved resulting in multiple related Oracle sessions, when monitoring using my rss.sql tool, I need to open the session ID (and serial#) to include all sessions. I still sampled every second. Here is the rss.sql syntax:
SQL>@rss.sql 0 9999 0 9999 % % 1
The tool output is written to a text file, which I was doing a "tail -f" on. Here is a very small snippet of the output. I manually inserted the blank lines to make it easier to see the different sample periods.


There is only one SQL statement being run on this idle test system. And because there is no DML involved, we don't see much background process activity. If you look closely above, sessions 168 (see third column) must be a log write process because the wait event is "log file parallel write". I checked and session 6 is a background process as well.

It's no surprise to typically see only four session involved. One session is the parallel query coordinator and the three parallel query slaves! Interestingly, the main server process session that I executed the query from is session number 16. It never showed up in any of my samples! I suspect it was "waiting" on an idle wait event and I'm only showing processes consuming CPU or waiting on a non-idle wait event. Very cool.

Now for the timing results. I took five samples.  Again, it's VERY important to know that the wait time (WAIT_TIME_S), DB CPU (DB_CPU_S), and DB Time (DB_TIME_S) values are related to ONLY calling server process, which in this case is session 16. In blazing contrast, the wall time (WALL_S), elapsed time (EL_VSQL_S), and SQL statement CPU consumption (CPU_VSQL_S) is related the entire SQL statement execution.

Here are the "parallel" experimental results.
 SQL>  select * from op_results;

 SAMPLE_NO     WALL_S  EL_VSQL_S CPU_VSQL_S WAIT_TIME_S   DB_CPU_S  DB_TIME_S
---------- ---------- ---------- ---------- ----------- ---------- ----------
         1  46.305951 132.174453   19.53818         .01   4.069579   4.664083
         2  46.982111 132.797536  19.371063         .02   3.809439   4.959602
         3   47.79761 134.338069  19.739735         .02   4.170921   4.555491
         4   45.97324 131.809249  19.397557         .01   3.790226   4.159572
         5  46.053922 131.765983  19.754143         .01   4.062703   4.461175
Let's check the math. So simplicity and clarity, please allow me to round and use sample 5.
DB_TIME_S = DB_CPU_S + WAIT_TIME_S
     4.5  =    4.1   +     0   
The DB Time shown above is kind of close... 10% off. (4.5 vs 4.1) But there is for sure timing error in my collection sript. I take the position, this is close enough to demonstrate the time statistic relationships. Now look below.
Elapsed Time (EL_VSQL_S)  = DB_TIME_S
                   131.7 !=    4.5
Woah! What happened here? (131.7 vs 4.5) Actually, everything is OK (so far aways) because the DB Time is related to the session (Session ID 16), whereas the elapsed time is ALL the DB Time for ALL the processes involved in the SQL statement. Since parallel query is involved, resulting in four additional sessions (1 coordinator, 3 slaves) we would expect the elapsed time to be greater than the DB Time. Now let's look at the wall time.
Wall Time (WALL_S) = ( Elapsed Time (EL_VSQL_S) / Parallelism ) + overhead
             46.1  = (          131.8           /     3       ) + 2.2
Nice! Clearly the effective parallelism is greater than 3 because there is some overhead (2.2). But the numbers makes sense because:

1. The wall time is less than the elapsed time because parallel query is involved.

2. The wall time is close to the elapsed time divided by the parallelism. And we can even see the parallelism overhead.

So it looks like our time math is correct!


Reality And The AWR Or Statspack Report


This is really important. In the SQL Statement section of any AWR or Statspack Report, you will see the total elapsed time over the snapshot interval and perhaps the average SQL ID elapsed time per execution. So what is the wall time? What are users experiencing? The short answer is, we do NOT have enough information.

To know the wall time, we need to know the parallelism situation. If you are NOT using parallel query, than based on the time math demonstrated above, the elapsed time per execution will be close to what the user experiencing (unless there is an issue outside of Oracle). However, if parallelism is involved, you can expect the wall time (i.e, user's experience) to be much less than the elapsed time per execution shown in the AWR or Statspack report.

Another way of looking at this is: If a user is reporting a query is taking 10 seconds, but the average elapsed time is showing as as 60 seconds, parallel query is probably involved. Also, as I mentioned above, never forget the average value is not always the typical value. (More? Check out my video seminar entitled, Using Skewed Data To Your Advantage HERE.)

Thanks for reading!

Craig.

1 comment:

  1. Hi Craig,

    Very cool...
    Just a doubt, in the: "Here are the "no parallel" experimental results". You used the sample number 5 as an example but in the moment of calculation:

    Elapsed Time (EL_VSQL_S) = DB_TIME_S
    35.5 = 34.2

    and

    Wall Time (WALL_S) = Elapsed Time (EL_VSQL_S) / Effective Parallelism
    35.5 = 35.5 / 1

    This numbers are more likely to sample number 1.
    I was racking my brain to understand from where these numbers come based on sample 5, but a can't.... :)

    Tks and waiting for the next post...


    ReplyDelete