SQL Tuning | Mauro Pagano's Blog

August 11, 2015
by Mauro Pagano 5 Comments

Concurrent INSERT APPEND into the same table

The main reason I decided to have a blog was to share all the small things I learn every day while experimenting but I’ve to admit I fell short on my initial intention, hopefully I’ll do better in the coming months, starting with this super-quick post.

I always knew that concurrent (aka multiple sessions at the same time) INSERT /*+ APPEND */ into the same target table was not possible even if the target table was partitioned. Here is an example of a SQL

create table test_target (n1 number, n2 number)
partition by range(n1)
(partition p1 values less than (5),
 partition p2 values less than (10));

insert /*+ append */ into test_target select mod(rownum,5), rownum from dual connect by rownum <= 10;

Reason for the concurrency being not feasible is the first session acquires an enqueue TM in X mode on the whole table (so every other session will have to wait) even if the inserted data only goes into a specific partition.
In the following extract SID 306 (my session) is holding the enq TM in X mode on object id 138542, which is the table even though only partition P1 has been populated.

SQL> select * from v$lock where sid = 306;

       SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK     CON_ID
---------- -- ---------- ---------- ---------- ---------- ---------- ---------- ----------
       306 TX     131102      11102          6          0         29          0          0
       306 TM     138542          0          6          0         29          0          3
       306 AE        133          0          4          0         86          0          3

SQL> select object_id, object_name, object_type from user_objects where object_id = 138542;

 OBJECT_ID OBJECT_NAME     OBJECT_TYPE
---------- --------------- -----------------------
    138542 TEST_TARGET     TABLE

If in another session (SID 186) I try to run an INSERT /*+ APPEND */ to populated only P2 then the execution will sit and wait for SID 306 to release the enqueue on object id 138542, as shown in the first row from V$LOCK where 186 wants to grab an enq TM in X mode for object id 138542

SQL> insert /*+ append */ into test_target select mod(rownum,5)+5, rownum from dual connect by rownum  select * from v$lock where sid in (306,186) order by sid;

       SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK     CON_ID
---------- -- ---------- ---------- ---------- ---------- ---------- ---------- ----------
       186 TM     138542          0          0          6         94          0          3
       186 AE        133          0          4          0        104          0          3
       306 AE        133          0          4          0        640          0          3
       306 TM     138542          0          6          0        583          1          3
       306 TX     131102      11102          6          0        583          0          0

Nothing new so far, it confirms what I expected.
What I didn’t know is that if “extended partition syntax” is used to specify the target partition (and if you specify the wrong one you get a correct ORA-14401) then the X lock is held on the partition and not the table (it’s held in SX on the table).
Here is the example

SID 306
insert /*+ append */ into test_target partition (p1) select mod(rownum,5), rownum from dual connect by rownum <= 10;

SID 186
insert /*+ append */ into test_target partition (p2) select mod(rownum,5)+5, rownum from dual connect by rownum  select * from v$lock where sid in (306,186) and type = 'TM' order by sid;

       SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK     CON_ID
---------- -- ---------- ---------- ---------- ---------- ---------- ---------- ----------
       186 TM     138542          0          3          0         11          0          3
       186 TM     138544          0          6          0         11          0          3
       306 TM     138543          0          6          0         35          0          3
       306 TM     138542          0          3          0         35          0          3

SQL> select object_id, object_name, object_type from user_objects where object_id in (138542, 138543, 138544);

 OBJECT_ID OBJECT_NAME    OBJECT_TYPE
---------- -------------- -----------------------
    138544 TEST_TARGET    TABLE PARTITION
    138543 TEST_TARGET    TABLE PARTITION
    138542 TEST_TARGET    TABLE

So in case you know in advance where the data is going (which is generally true specially for range/interval partition) then you can have multiple concurrent loads into the same target table.
I’m not recommending it, just saying it’s possible 😀

August 6, 2015
by Mauro Pagano Leave a comment

How to quickly identify if a SQL running in parallel has been downgraded using SQLd360

This is just a quick note on how to leverage a couple reports in order to identify if a PX has been downgraded. Those reports have been around for quite some time so this isn’t about any new feature but rather how to use what’s already there 😉

An execution is downgraded when the number of PX slaves used by the SQL at runtime is lower than the requested number, there are few reasons why it can happen with the most common one being the lack of available PX slaves at the time the SQL starts. Such a downgrade can cause from small to catastrophic performance degradation depending on the severity of the downgrades (measured in %) since the CBO generates an execution plan expecting to have some specific horsepower (aka number of PX slaves) at disposal but then such horsepower is reduced. It’s like planning how long it will take to drive from A to B with a Ferrari and then go to the garage and find out you only have a Jetta 😀

SQLd360 aims at making it easier (and quicker) to identify if an execution was downgraded, let’s see how.

The requested DoP for a SQL is stored in the OTHER_XML column, for each PHV/source SQLd360 provides in the “SQL Performance Summary” the MIN and MAX DoP requested, which is likely to stay stable overtime (exception made for Adaptive Degree Policy or Adaptive Multi User, which aren’t that common anyway).
Here is an example of a SQL with a requested DoP of 4:

Starting from 11gR2 the DoP the SQL executed with can be extracted from column PX_FLAGS in ASH (TRUNC(px_flags / 2097152) as already reported by Randolf here) so for each execution that made it into ASH SQLd360 reports the “execution DoP”.
Here is the output from report “Elapsed Time for Recent Execs”:

Column MAX_PX_DEGREE is the MAX(DoP) this specific execution used (see note below about why MAX).
Column NUM_PROCESSES is the distinct number of ASH sampled sessions that collaborated to this execution (up to 2*DoP).

From the report above it’s easy to spot how the execution from July 30 was indeed executed at the requested DoP (4) while the one from August 6 was downgraded (requested DoP 4, execution DoP 2).

Note: ASH reports the DoP by DFO tree so if the plan had multiple DFO trees with different DoP (uncommon but possible) just keep in mind the number you see in the report is the MAX.

July 29, 2015
by Mauro Pagano 5 Comments

AutoDOP in 12c, what’s new and what’s old

The New Features guide for 12c reports enhancements to AutoDOP but unfortunately there is not much details so let’s try to dig a little to find out which are (some of) those enhancements.

The focus on this blog post is on how the Degree of Parallelism (DoP) is computed for three SQLs in 11.2.0.4 vs 12.1.0.2 (so not considering ADAPTIVE degree policy) reading a single table, joins will come in a separate post.

First requirement to play with AutoDOP is to have IO Calibration statistics in place (UPDATE: the requirement is in 11.2.0.x, in 12c default values are used when IO Calibration are missing, 200MB/s is the default for MAX_PMBPS).
Since all my tests are on VMs on my personal machine the IO response time is very unstable so for the sake of consistent results I’ll manually set the IO Calibration stats (AFAIK this isn’t recommended, also the INSERT requires an instance restart to take effect).
We’ll use 10 as value for PMBPS (Max MBPS during parallel scan).

SQL> desc dba_rsrc_io_calibrate
 Name                  Null?    Type
 --------------------- -------- --------------
 START_TIME                     TIMESTAMP(6)
 END_TIME                       TIMESTAMP(6)
 MAX_IOPS                       NUMBER
 MAX_MBPS                       NUMBER
 MAX_PMBPS                      NUMBER
 LATENCY                        NUMBER
 NUM_PHYSICAL_DISKS             NUMBER

SQL> insert into dba_rsrc_io_calibrate values (systimestamp, systimestamp, 10000, 5, 10, 5, 10);
SQL> commit;

And here are the three tables with 10k, 50k and 16M rows each used in the tests.
Each row is created to account for pretty much 1 block (in a 8k block_size database) in the first two tables so those tables will grow large in size even with just a few rows, the third table is designed to be very small.

SQL> drop table t10k purge;
SQL> drop table t50k purge;
SQL> drop table t16m purge;
SQL> create table t10k (n1 number, c1 char(2000), c2 char(2000), c3 char(2000));
SQL> create table t50k (n1 number, c1 char(2000), c2 char(2000), c3 char(2000));
SQL> create table t16m (n1 number, c1 char(1), c2 char(1), c3 char(1));
SQL> insert into t10k select rownum, 'a','b','c' from dual connect by rownum <= 10000;
SQL> insert into t50k select rownum, 'a','b','c' from dual connect by rownum <= 50000;
SQL> insert into t16m select rownum, 'a', 'b', 'c' from dual connect by rownum <= 250000; 
SQL> insert into t16m select * from t16m;  
SQL> /  
SQL> /  
SQL> /
SQL> /
SQL> /  
SQL> exec dbms_stats.gather_table_stats(user,'T10K');
SQL> exec dbms_stats.gather_table_stats(user,'T50K');
SQL>  exec dbms_stats.gather_table_stats(user,'T16M');
SQL> select table_name, num_rows, blocks from user_tables where table_name like 'T%' order by 2;
TABLE_NAME                       NUM_ROWS     BLOCKS
------------------------------ ---------- ----------
T10K                                10000      10097
T50K                                50000      50897
T16M				 16000000      35472

We’ll use three SQLs to check different scenarios and how they show up in the 10053 trace.
EXPLAIN PLAN has been used just to force a hard parse each time (it won’t have any “negative” side-effect here).

explain plan for select /* Q1 */ * from t10k;   
explain plan for select /* Q2 */ * from t50k;   
explain plan for select /* Q3 */ count(distinct n1) from t16m;

As we already known (nice post here) in 11.2.0.x the DoP under AutoDOP is based on “IO cost” for the operation and the calculation is performed for each table individually at the time the CBO calculates cardinality and best access method for the table (aka you’ll find it under SINGLE TABLE ACCESS PATH section of the 10053).

##### Q1 #####

The CBO decides to execute Q1 in serial in both versions, let’s look into the details.
From 11.2.0.4 trace, the first 4 lines in green are the input values used to perform the math, the next two are about In-Memory Parallel Execution (nothing to do with In-Memory Option) and the last three lines are the most interesting ones.

SINGLE TABLE ACCESS PATH 
  Single Table Cardinality Estimation for T10K[T10K] 
  Table: T10K  Alias: T10K
    Card: Original: 10000.000000  Rounded: 10000  Computed: 10000.00  Non Adjusted: 10000.00
  Access Path: TableScan
    Cost:  2737.52  Resp: 2737.52  Degree: 0
      Cost_io: 2736.00  Cost_cpu: 74005180
      Resp_io: 2736.00  Resp_cpu: 74005180
kkeCostToTime: using io calibrate stats 
 maxmbps=5(MB/s) maxpmbps=10(MB/s) 
 block_size=8192 mb_io_count=4 mb_io_size=30247 (bytes) 
 tot_io_size=79(MB) time=7897(ms)
AutoDOP: Consider caching for T10K[T10K](obj#70600) 
cost:2737.52 blkSize:8192 objSize:10097.00 marObjSize:9592.15 bufSize:37240.00 affPercent:80 smallTab:NO affinitized:YES
AutoDOP:parallel operation is set to NO
AutoDOP:parallel operation is set to NO
AutoDOP: Table/Index(#70600) access (scan) cost=2737.52 estTime=7896.70 unit=10000.00 dop=2 -> maxdop=2
  Best:: AccessPath: TableScan
         Cost: 2737.52  Degree: 1  Resp: 2737.52  Card: 10000.00  Bytes: 0

The following is based on my understanding and tests so it could easily be wrong.

The first “AutoDOP: parallel operation is set to NO” refers to the lack of PARALLEL hint in the SQL, if you add the hint then the message becomes YES prefixed by another one with the degree specified in the hint (if any).

The second “AutoDOP: parallel operation is set to NO” is based on the fact this SQL is not “expensive enough” to qualify for parallel execution, basically the E-Time is lower than PARALLEL_MIN_TIME_THRESHOLD.

The last line shows the details of the math. In 11.2.0.4 the DoP is indeed computed based on how large is the segment (tot_io_size=79MB) and how fast we can read from disk (maxpmbps=10MB/s) so 79 (MB) / 10 (MB/s) = (time=) 7897ms. The value is then divided by 10k (unit=, not sure where the value comes from since it doesn’t seem related to PARALLEL_MIN_TIME_THRESHOLD even though it does represent 10s) and the result (0.7) considered as DoP.
DoP 0.7 means this execution will be serial.
The value of dop=2 -> maxdop=2 is a little misleading here (maybe it’s a default in the code), PX is rejected here and the SQL goes serial.
The trace confirms it just few lines below

AUTO DOP PLANS EVALUATION
***************************************
Compilation completed with Dop: 1.
  Cost_io:  2736.00  Cost_cpu: 74005179.68
  Card:     10000.00  Bytes:    60070000.00
  Cost:     2737.52  Est_time:  7897ms
kkopqSetDopReason: Reason why we chose this DOP is: no expensive parallel operation.
No parallel scan operations
kkopqCombineDop: Dop:1 Hint:no 
Query: compute:yes forced:no  computedDop:2 forceDop:0
kkopqSetDopReason: Reason why we chose this DOP is: parallel threshold.
Serial phase is inexpensive (2738), stick to it.
Costing completed. DOP chosen: 1.

Here we can see the DoP chosen was 1 because no operation was expensive enough to warrant for PX as well as the E-Time for the SQL (7.8s) is below the PARALLEL_MIN_TIME_THRESHOLD (10s).

Nothing new so far, let’s look into the 10053 from 12.1.0.2 for the same SQL, Q1 as see how is that different.

SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for T10K[T10K]
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
  Table: T10K  Alias: T10K
    Card: Original: 10000.000000  Rounded: 10000  Computed: 10000.000000  Non Adjusted: 10000.000000
  Scan IO  Cost (Disk) =   2736.000000
  Scan CPU Cost (Disk) =   74005179.680000
  Total Scan IO  Cost  =   2736.000000 (scan (Disk))
                       =   2736.000000
  Total Scan CPU  Cost =   74005179.680000 (scan (Disk))
                       =   74005179.680000
  Access Path: TableScan
    Cost:  2737.397801  Resp: 2737.397801  Degree: 0
      Cost_io: 2736.000000  Cost_cpu: 74005180
      Resp_io: 2736.000000  Resp_cpu: 74005180
  Best:: AccessPath: TableScan
         Cost: 2737.397801  Degree: 1  Resp: 2737.397801  Card: 10000.000000  Bytes: 0.000000

The first difference is the AutoDOP computation does not happen during the access path decision, that’s why no AutoDOP tag is present in the previous extract.

The calculations are performed at the end of the join selection for the query block (no join in this case and just one query block) and here is where the major changes shows up (steps are marked with different colors to make it easier to reference them):

kkecComputeAPDopCS: CPUtabBytes: 82714624.000000 CPUtabRows: 10000.000000  rowSize: 6007
kkecComputeAPDopCS: Basic Stats Level
kkecComputeDOP: Proc Rate: 1048576000.000000 - Cost: 82714624.000000 - Time: 78.882812
kkecComputeDOP: Proc Rate: 1000000.000000 - Cost: 10000.000000 - Time: 10.000000
kkecComputeAPDopCS: Before Filter Preds - CPU Dop: 0.007888
kkecComputeAPDopCS: Final Filter Predicate Bytes: 0.000000 Rows: 0.000000
kkecComputeAPDop: Final -  CPU Dop: 0.007888 Rate: 1000.000000
AutoDOP: Consider caching for T10K[T10K](obj#138272)
cost:2737.397801 blkSize:8192 objSize:10097.00 marObjSize:9592.15 bufSize:41160.00 affPercent:80 smallTab:NO affinitized:YES
kkeCostToTime: using io calibrate stats maxpmbps=10(MB/s)
 block_size=8192 mb_io_count=4 mb_io_size=30247 (bytes)
 tot_io_size=79(MB) time=7896(ms)
AutoDOP:parallel operation is set to NO
AutoDOP:parallel operation is set to NO
AutoDOP: Table/Index(#138272) access (scan) cost=2737.397801 estTime=7896.34 unit=10000.00 dop=2.000 -> maxdop=2
kkecComputeAPDop: IO Dop: 2.000000 - CPU Dop: 0.007888

The first big difference is the AutoDOP in 12c also includes the CPU DoP for the operations in the query block, including not only those operations that have an IO cost (i.e. a FTS) but also those operations that have just a CPU cost (i.e. a Join).
The way the operations are normalized is via “processing rates”, in 12c there is a new view that provides to the CBO with such numbers, V$OPTIMIZER_PROCESSING_RATE.

SQL> select operation_name, to_number(manual_value) manual, to_number(calibration_value) calibration, to_number(default_value) default_ from v$optimizer_processing_rate order by 1;

OPERATION_NAME                MANUAL CALIBRATION   DEFAULT_
------------------------- ---------- ----------- ----------
AGGR                             560                   1000
ALL                                                     200
CPU                                                     200
CPU_ACCESS                                              200
CPU_AGGR                                                200
CPU_BYTES_PER_SEC                560                   1000
CPU_FILTER                                              200
CPU_GBY                                                 200
CPU_HASH_JOIN                                           200
CPU_IMC_BYTES_PER_SEC                                  2000
CPU_IMC_ROWS_PER_SEC                                2000000
CPU_JOIN                                                200
CPU_NL_JOIN                                             200
CPU_RANDOM_ACCESS                                       200
CPU_ROWS_PER_SEC                                    1000000
CPU_SEQUENTIAL_ACCESS                                   200
CPU_SM_JOIN                                             200
CPU_SORT                                                200
HASH                                                    200
IO                               560                     10
IO_ACCESS                        560                     10
IO_BYTES_PER_SEC                                         10
IO_IMC_ACCESS                                          1000
IO_RANDOM_ACCESS                3000                     10
IO_ROWS_PER_SEC                                     1000000
IO_SEQUENTIAL_ACCESS             560                     10
MEMCMP                                                  500
MEMCPY                           560                   1000

Values can be manually set using DBMS_STATS.SET_PROCESSING_RATE but they are silently ignored until the hidden parameter _OPTIMIZER_PROC_RATE_SOURCE is not set to MANUAL. Also they can be gathered using DBMS_STATS.GATHER_PROCESSING_RATE.
The value for an operation is inherited from its parent in case is missing (the hierarchy is visible in X$OPTIM_CALIB_STATS where each STATID_KKECSTATS has its parent is PSTATID_KKECSTATS).

The first line in green (two sections above) provides info about the table we want to scan: size in bytes, number of rows and avg row lenght.
The lines in orange is where the CPU DoP is computed, my understanding is here the CPU DoP is the greatest value (expressed in ms) between the “size in bytes / CPU_BYTES_PER_SEC (expressed in MB)” and “number of rows / CPU_ROWS_PER_SEC”, so in this case it’s 10 vs 78.8 = 78.8, which is then divided by 10k (unit=) giving a CPU DoP of 0.0078. The values used for the computation are also reported in the OTHER_XML column.
The lines in blue are the same as the 11.2.0.4 one, where the IO DoP is computed (again with the misleading value of 2).
The final line in red is where IO DoP (misleading here) and CPU DoP are combined together.

As for the 11.2.0.4 case, this SQL doesn’t quality for parallel execution because too fast (the second NO in the blue lines).

AUTO DOP PLANS EVALUATION
***************************************
Compilation completed with Dop: 1.
  Cost_io:  2736.000000  Cost_cpu: 74005180
  Card:     10000.000000  Bytes:    60070000.000000
  Cost:     2737.397801  Est_time:  7896ms
kkopqSetDopReason: Reason why we chose this DOP is: no expensive parallel operation.
No parallel scan operations
kkopqCombineDop: Dop:1 Hint:no
Query: compute:yes forced:no  scanDop:2 cpuDop:1 forceDop:0
kkopqSetDopReason: Reason why we chose this DOP is: parallel threshold.
Serial phase is inexpensive (2737.397801), stick to it.
Costing completed. DOP chosen: 1.

##### Q2 #####

Let’s now focus on Q2, which runs on the larger table T50K.
From 11.2.0.4

SINGLE TABLE ACCESS PATH 
  Single Table Cardinality Estimation for T50K[T50K] 
  Table: T50K  Alias: T50K
    Card: Original: 50000.000000  Rounded: 50000  Computed: 50000.00  Non Adjusted: 50000.00
  Access Path: TableScan
    Cost:  13793.68  Resp: 13793.68  Degree: 0
      Cost_io: 13786.00  Cost_cpu: 372959932
      Resp_io: 13786.00  Resp_cpu: 372959932
kkeCostToTime: using io calibrate stats 
 maxmbps=5(MB/s) maxpmbps=10(MB/s) 
 block_size=8192 mb_io_count=4 mb_io_size=30247 (bytes) 
 tot_io_size=398(MB) time=39789(ms)
AutoDOP: Consider caching for T50K[T50K](obj#70601) 
cost:13793.68 blkSize:8192 objSize:50897.00 marObjSize:48352.15 bufSize:37240.00 affPercent:80 smallTab:NO affinitized:NO
AutoDOP:parallel operation is set to NO
AutoDOP:parallel operation is set to YES
AutoDOP: Table/Index(#70601) access (scan) cost=13793.68 estTime=39789.45 unit=10000.00 dop=3 -> maxdop=3
  Best:: AccessPath: TableScan
         Cost: 13793.68  Degree: 1  Resp: 13793.68  Card: 50000.00  Bytes: 0

Table this time is around 398MB so 398 / 10 = 39789ms which is larger than 10 seconds (PARALLEL_MIN_TIME_THRESHOLD) hence the SQL is candidate to run using PX, reported in the second “AutoDOP: parallel operation is set to YES” (again this is a guess). The DoP is computed as usual with time/unit that is 3.9 truncated to 3.

AUTO DOP PLANS EVALUATION
***************************************
Compilation completed with Dop: 1.
  Cost_io:  13786.00  Cost_cpu: 372959931.68
  Card:     50000.00  Bytes:    300400000.00
  Cost:     13793.68  Est_time:  39789ms
kkopqCombineDop: Dop:3 Hint:no 
Query: compute:yes forced:no  computedDop:3 forceDop:0
Serial plan is expensive enough to be a candidate for parallelism (13794)
Signal reparse with DOP 3.

the SQL is then re-parsed and the plan with DoP 3 is found as cheaper than the serial one, hence accepted.

AUTO DOP PLANS EVALUATION
***************************************
Compilation completed with Dop: 3.
  Cost_io:  5105.93  Cost_cpu: 3888888.89
  Card:     50000.00  Bytes:    300400000.00
  Cost:     5106.01  Est_time:  14729ms
Comparing plan with dop=3 and plan with dop=1:
  dop=3 io=5105.93 cpu=3888889 cost=5106.01 card=50000 bytes=300400000 -> est=14729ms, scaled cost=7659.01
  dop=1 io=13786.00 cpu=372959932 cost=13793.68 card=50000 bytes=300400000 -> est=39789ms, scaled cost=13793.68
Plan with dop 3 is better. Scalability: 50
Current plan with dop=3 is better than best plan with dop=1
Costing completed. DOP chosen: 3.

Again, nothing new so far. Let’s now take a look at 12.1.0.2 for Q2:

kkecComputeAPDopCS: CPUtabBytes: 416948224.000000 CPUtabRows: 50000.000000  rowSize: 6008
kkecComputeAPDopCS: Basic Stats Level
kkecComputeDOP: Proc Rate: 1048576000.000000 - Cost: 416948224.000000 - Time: 397.632812
kkecComputeDOP: Proc Rate: 1000000.000000 - Cost: 50000.000000 - Time: 50.000000
kkecComputeAPDopCS: Before Filter Preds - CPU Dop: 0.039763
kkecComputeAPDopCS: Final Filter Predicate Bytes: 0.000000 Rows: 0.000000
kkecComputeAPDop: Final -  CPU Dop: 0.039763 Rate: 1000.000000
AutoDOP: Consider caching for T50K[T50K](obj#138273) 
cost:13793.044423 blkSize:8192 objSize:50897.00 marObjSize:48352.15 bufSize:41160.00 affPercent:80 smallTab:NO affinitized:NO
kkeCostToTime: using io calibrate stats maxpmbps=10(MB/s) 
 block_size=8192 mb_io_count=4 mb_io_size=30247 (bytes) 
 tot_io_size=398(MB) time=39788(ms)
AutoDOP:parallel operation is set to NO
AutoDOP:parallel operation is set to YES
AutoDOP: Table/Index(#138273) access (scan) cost=13793.044423 estTime=39787.63 unit=10000.00 dop=3.000 -> maxdop=3
kkecComputeAPDop: IO Dop: 3.000000 - CPU Dop: 0.039763

Little higher CPU DoP but still extremely low (0.039) and same IO DoP as 11.2.0.4, hence same DoP computed here as well.

AUTO DOP PLANS EVALUATION
***************************************
Compilation completed with Dop: 1.
  Cost_io:  13786.000000  Cost_cpu: 372959932
  Card:     50000.000000  Bytes:    300400000.000000
  Cost:     13793.044423  Est_time:  39788ms
kkopqCombineDop: Dop:3 Hint:no 
Query: compute:yes forced:no  scanDop:3 cpuDop:1 forceDop:0
Serial plan is expensive enough to be a candidate for parallelism (13793)
Signal reparse with DOP 3.
.....
AUTO DOP PLANS EVALUATION
***************************************
Compilation completed with Dop: 3.
  Cost_io:  5105.925926  Cost_cpu: 3888889
  Card:     50000.000000  Bytes:    300400000.000000
  Cost:     5105.999379  Est_time:  14729ms
kkopqCombineDop: Dop:3 Hint:no 
Query: compute:yes forced:no  scanDop:3 cpuDop:1 forceDop:0
Adopt new plan with dop=3
Costing completed. DOP chosen: 3.

So the new algorithm lead to the same DoP (which is good, no surprises) because the CPU DoP was too little to make a dent.

##### Q3 #####

Let’s now look into Q3 that is where things start to change and get a little more tricky (to make things easier to report I turned off _optimizer_distinct_agg_transform).

From 11.2.0.4

SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for T16M[T16M]
  Table: T16M  Alias: T16M
    Card: Original: 16000000.000000  Rounded: 16000000  Computed: 16000000.00  Non Adjusted: 16000000.00
  Access Path: TableScan
    Cost:  9697.62  Resp: 9697.62  Degree: 0
      Cost_io: 9643.00  Cost_cpu: 2653501900
      Resp_io: 9643.00  Resp_cpu: 2653501900
kkeCostToTime: using io calibrate stats
 maxmbps=5(MB/s) maxpmbps=10(MB/s)
 block_size=8192 mb_io_count=4 mb_io_size=30247 (bytes)
 tot_io_size=280(MB) time=27974(ms)
AutoDOP: Consider caching for T16M[T16M](obj#70603)
cost:9697.62 blkSize:8192 objSize:35597.00 marObjSize:33817.15 bufSize:37240.00 affPercent:80 smallTab:NO affinitized:NO
AutoDOP:parallel operation is set to NO
AutoDOP:parallel operation is set to YES
AutoDOP: Table/Index(#70603) access (scan) cost=9697.62 estTime=27973.89 unit=10000.00 dop=2 -> maxdop=2
  Best:: AccessPath: TableScan
         Cost: 9697.62  Degree: 1  Resp: 9697.62  Card: 16000000.00  Bytes: 0

so IO DoP is 2 and E-Time larger than 10 secs so we decide to go PX

AUTO DOP PLANS EVALUATION
***************************************
Compilation completed with Dop: 1.
  Cost_io:  9643.00  Cost_cpu: 2653501899.68
  Card:     1.00  Bytes:    5.00
  Cost:     9697.62  Est_time:  27974ms
kkopqCombineDop: Dop:2 Hint:no
Query: compute:yes forced:no  computedDop:2 forceDop:0
Serial plan is expensive enough to be a candidate for parallelism (9698)
Signal reparse with DOP 2.
....
AUTO DOP PLANS EVALUATION
***************************************
Compilation completed with Dop: 2.
  Cost_io:  5357.22  Cost_cpu: 1333333333.33
  Card:     1.00  Bytes:    5.00
  Cost:     5384.67  Est_time:  15533ms
Comparing plan with dop=2 and plan with dop=1:
  dop=2 io=5357.22 cpu=1333333333 cost=5384.67 card=1 bytes=5 -> est=15533ms, scaled cost=5384.67
  dop=1 io=9643.00 cpu=2653501900 cost=9697.62 card=1 bytes=5 -> est=27974ms, scaled cost=9697.62
Plan with dop 2 is better. Scalability: 50
Current plan with dop=2 is better than best plan with dop=1
Costing completed. DOP chosen: 2.

so 11.2.0.4 re-parsed the SQL and decided to go with DoP = 2 because of the IO DoP.

In 12.1.0.2 this time the CPU DoP makes a difference, from the 10053

kkecComputeAPDopCS: CPUtabBytes: 290586624.000000 CPUtabRows: 16000000.000000  rowSize: 5
kkecComputeAPDopCS: Basic Stats Level
kkecComputeDOP: Proc Rate: 1048576000.000000 - Cost: 290586624.000000 - Time: 277.125000
kkecComputeDOP: Proc Rate: 1000000.000000 - Cost: 16000000.000000 - Time: 16000.000000
kkecComputeAPDopCS: Before Filter Preds - CPU Dop: 1.600000
kkecComputeAPDopCS: Final Filter Predicate Bytes: 0.000000 Rows: 0.000000
kkecComputeAPDop: Final -  CPU Dop: 1.600000 Rate: 1000.000000
AutoDOP: Consider caching for T16M[T16M](obj#138277) 
cost:9658.102216 blkSize:8192 objSize:35472.00 marObjSize:33698.40 bufSize:41160.00 affPercent:80 smallTab:NO affinitized:NO
kkeCostToTime: using io calibrate stats maxpmbps=10(MB/s) 
 block_size=8192 mb_io_count=4 mb_io_size=30247 (bytes) 
 tot_io_size=279(MB) time=27860(ms)
AutoDOP:parallel operation is set to NO
AutoDOP:parallel operation is set to YES
AutoDOP: Table/Index(#138277) access (scan) cost=9658.102216 estTime=27859.91 unit=10000.00 dop=2.000 -> maxdop=2
kkecComputeAPDop: IO Dop: 2.000000 - CPU Dop: 1.600000
Transfer optimizer annotations for T16M[T16M]
kkecComputeGbyObyAggrDOP: rowSize: 5.000000
kkecComputeGbyObyAggrDOP: gbyRows: 16000000.000000 gbyBytes: 80000000.000000
kkecComputeGbyObyAggrDOP: obyRows: 16000000.000000 obyBytes: 80000000.000000
kkecComputeGbyObyAggrDOP: before aggr Gby - IO Dop: 0.000000  CPU Dop: 0.000000
kkecComputeDOP: Proc Rate: 1048576000.000000 - Cost: 80000000.000000 - Time: 76.293945
kkecComputeDOP: Proc Rate: 1000000.000000 - Cost: 16000000.000000 - Time: 16000.000000
kkecComputeGbyObyAggrDOP: Basic level Aggr DOP 1.600000
kkecComputeGbyObyAggrDOP: Gby - IO Dop: 0.000000 CPU Dop: 1.600000 GbyRate: 1000.000000
kkecComputeGbyObyAggrDOP: Oby - IO Dop: 0.000000 CPU Dop: 0.000000  ObyRate: 0.000000

the important parts from the extract above are a CPU DoP of 1.6 and an IO DoP of 2 because of the access to table T16M plus a Aggr DoP of 1.6 because of the count(distinct ).
Since E-Time > 10 secs we decide to go PX and trigger a reparse

AUTO DOP PLANS EVALUATION
***************************************
Compilation completed with Dop: 1.
  Cost_io:  9608.000000  Cost_cpu: 2652611720
  Card:     1.000000  Bytes:    5.000000
  Cost:     9658.102216  Est_time:  27860ms
kkopqCombineDop: Dop:2 Hint:no 
Query: compute:yes forced:no  scanDop:2 cpuDop:1 forceDop:0
Serial plan is expensive enough to be a candidate for parallelism (9658)
Signal reparse with DOP 2.

Now this is where things get a little tricky and my understanding could easily be even more wrong than before 😀

AUTO DOP PLANS EVALUATION
***************************************
Compilation completed with Dop: 2.
  Cost_io:  5337.777778  Cost_cpu: 1333333333
  Card:     1.000000  Bytes:    5.000000
  Cost:     5362.961620  Est_time:  15470ms
kkopqCombineDop: Dop:3 Hint:no 
Query: compute:yes forced:no  scanDop:2 cpuDop:3 forceDop:0
Adopt new plan with dop=2
Signal reparse with DOP 3.
....
AUTO DOP PLANS EVALUATION
***************************************
Compilation completed with Dop: 3.
  Cost_io:  3558.518519  Cost_cpu: 888888889
  Card:     1.000000  Bytes:    5.000000
  Cost:     3575.307747  Est_time:  10313ms
kkopqCombineDop: Dop:3 Hint:no 
Query: compute:yes forced:no  scanDop:2 cpuDop:3 forceDop:0
Adopt new plan with dop=3
Costing completed. DOP chosen: 3.

I omitted the details of the DoP trying to keep this extremely long post just a little shorter.
The three green lines in the previous two sections show the IO DoP and the CPU DoP, the IO DoP is consistently 2 but the CPU DoP starts as 1 and then it becomes 3.
Starting from the first re-parse the IO DoP is not computed anymore as it would provide no benefit since the table T16M is still large the same, the CBO is carrying on annotations (some details of what an annotation is is explained here, but nothing to do with that bug) on the pre-computed DoPs

Transfer optimizer annotations for T16M[T16M]
AutoDOP: dop(io):2.000
AutoDOP: rate(cpu):1000.000000 dop(cpu):1.600

and I think at this point the Aggr DoP (1.6) is added to the CPU DoP from T16M to come up with a CPU DoP of 3.
I’ve to admit I’m not sure why this isn’t done at the first iteration, maybe just to save time since a large IO DoP is enough “to get started” with a reparse.

This is where the new AutoDOP starts to behave differently and return a different DoP for a SQL (3 instead of 2).

I haven’t seen enough real life cases on AutoDOP in 12c to say if it works better than before but for sure it seems more aware of the whole plan and it factors in additional steps (i.e. those CPU only) trying to come up with a more accurate DoP.
I suspect most of the SQLs won’t have a different DoP just because most of the large parallel execution are constrained by IO scan rather than large CPU operations but it’s exactly on those environments with large sorts or aggregation that probably the new AutoDOP can surprise you, for the good or the bad 😉

May 4, 2015
by Mauro Pagano 8 Comments

Historical SQL Monitor reports in 12c!

I think SQL Monitoring is one of the greatest addition to the Oracle performance diagnostics world since a long time ago (10046-time maybe?) because it allows to collect in a single shot a long list of crucial information that are otherwise painful to extract and put together. The report provides a complete picture of the execution across processes (PX), nodes (RAC), etc etc.
On the other hand, one of the major limitations (the biggest one, imho) is SQL Monitor info are quickly aged out of memory and not stored in AWR (there is no DBA_HIST_SQL_MONITOR or similar) so they are are unlikely to be around for a post-mortem investigation.

Good news is in 12c we can pull a report for a historical execution, granted the execution was expensive enough to candidate for collection. I didn’t investigate the details of the automatic collection yet but there are two ways, details from DBA_HIST_REPORTS_CONTROL:

REGULAR – per-minute report capture subject to DBTIME budget
FULL_CAPTURE – capture will be run per minute without the DBTIME budget constraints

The FULL_CAPTURE can be enabled using DBMS_AUTO_REPORT.START_REPORT_CAPTURE/FINISH_REPORT_CAPTURE.

Info about each automatically collected report are stored in DBA_HIST_REPORTS and the report itself (in XML format) is stored in DBA_HIST_REPORTS_DETAILS.

The whole framework is also used for Real-Time ADDM so the DBA_HIST_REPORTS* views are not organized in a SQL Monitor-friendly way (ie. SQL ID/SQL Exec ID/SQL Exec Start) but rather in a report-oriented way, the key is REPORT_ID.
Column COMPONENT_NAME helps track down the source of the report, “sqlmonitor” in this case.
A summary of the report is stored in REPORT_SUMMARY in XML format, so a simple SQL like the following pulls the list of REPORT_ID/SQL_ID (plus anything else you may want to extract from the summary, ie. SQL Exec ID and SQL Exec Start):

SELECT report_id,
       EXTRACTVALUE(XMLType(report_summary),'/report_repository_summary/sql/@sql_id') sql_id,
       EXTRACTVALUE(XMLType(report_summary),'/report_repository_summary/sql/@sql_exec_id') sql_exec_id,
       EXTRACTVALUE(XMLType(report_summary),'/report_repository_summary/sql/@sql_exec_start') sql_exec_start
  FROM dba_hist_reports
 WHERE component_name = 'sqlmonitor'

From my system I have

 REPORT_ID SQL_ID          SQL_EXEC_I SQL_EXEC_START
---------- --------------- ---------- ------------------------------
      1022 fx439nus0rtcz     16777216 04/29/2015 13:34:15
      1024 fjvsmy2yujbqd     16777216 04/29/2015 13:40:00
      1025 9qn59dh1w8352     16777216 04/29/2015 13:41:12
      1026 1uqrk6t8gfny8     16777216 04/29/2015 13:41:14

Using the REPORT_ID we can now extract the report in different format (HTML, TEXT, ACTIVE, XML) using DBMS_AUTO_REPORT.REPORT_REPOSITORY_DETAIL

SQL> set long 10000000 longchunksize 10000000 pages 0
SQL> SELECT DBMS_AUTO_REPORT.REPORT_REPOSITORY_DETAIL(RID => 1022, TYPE => 'text')
       FROM dual;

SQL Monitoring Report

SQL Text
------------------------------
select count(*) from test_inmemory

Global Information
------------------------------
 Status             : DONE (ALL ROWS)
 Instance ID        : 1
 Session            : MPAGANO (12:35570)
 SQL ID             : fx439nus0rtcz
 SQL Execution ID   : 16777216
 Execution Started  : 04/29/2015 13:34:15
 First Refresh Time : 04/29/2015 13:34:15
 Last Refresh Time  : 04/29/2015 13:34:15
 Duration           : .064582s
 Module/Action      : SQL*Plus/-
 Service            : orcl
 Program            : sqlplus@Mauros-iMac.local (TNS V1-V3)
 Fetch Calls        : 1
.....

or if we want it in flashy ACTIVE format

SQL> set trimspool on
SQL> set trim on
SQL> set pages 0
SQL> set linesize 1000
SQL> set long 1000000
SQL> set longchunksize 1000000
SQL> spool historical_sqlmon.sql
SQL> SELECT DBMS_AUTO_REPORT.REPORT_REPOSITORY_DETAIL(RID => 1022, TYPE => 'active')
       FROM dual;

Several details are missing from this post (and from my knowledge of how the whole framework works) but I’ll make sure to pass them along as soon as I find out more.

Hopefully this blog post will be useful next time you are trying to figure out why a SQL ran so slow last weekend 🙂

UPDATE: DBA_HIST_REPORTS also includes SQL_ID, SQL_EXEC_ID, SQL_EXEC_START and several other information (concatenated together into a single column and separated by ‘3#’ sign) as basic columns KEY1, KEY2, KEY3 and KEY4. The following SQL provides the same information as the original SQL posted in this blog post

SELECT report_id, key1 sql_id, key2 sql_exec_id, key3 sql_exec_start
  FROM dba_hist_reports
 WHERE component_name = 'sqlmonitor'

April 29, 2015
by Mauro Pagano 5 Comments

TEMP I/O and Linux TMPFS

There are already several articles on the web about how to speed up TEMP I/O in an Oracle database (I really like this one from Alex Fatkulin) , specially since even the cool and fancy Engineered Systems still behave pretty much the same when it comes to spilling to disk, so probably this blog post is not going to show anything new but at least it can work as a reminder for me 🙂

I have an artificial SQL (pretty ugly to be honest) that spills to disk, around 665M

select count(distinct c1)
  from (select c1 
          from (select rownum||lpad('x',999,'x') c1 from dual connect by rownum <= 5*1e4), 
               (select rownum n1 from dual connect by rownum <= 1e2))

this is the result from SQL Monitor on 12.1.0.2 (I’m using the Oracle DB Developer VM)

Considering TEMP data doesn’t require any kind of persistency, redundancy and recoverability we can considering creating the TEMP tablespace on a file under tmpfs/ramfs. I like tmpfs a little better just because it has the ability to stop growing in size (even though it might swap hence reducing the benefit of going to memory).
From Linux

mount -o size=1G -t tmpfs none /mnt/mytmpfs

and then from SQL*Plus

SQL> create temporary tablespace TEMP2 tempfile '/mnt/mytmpfs/temp2.f' size 1G;
SQL> alter user mpagano temporary tablespace temp2;

and this is the result for the same SQL when using the new shiny TEMP2 tablespace

Notice how the same amount of data has been spilled to disk but the elapsed time is half and the “direct path write temp” is basically gone.

DISCLAIMER: There are a few “oddities” in the both reports, the 0 buffer gets and no “direct path read temp” in the first one and the absence of the “direct path write temp” in the second one, my educated guess is the combination of low number of samples (the SQL is fast in both cases so number of samples is low) combined with the low wait time for the events (the VM writes to a file that could benefit from caching) could justify why not all the info are captured.

April 27, 2015
by Mauro Pagano 2 Comments

Poor man SQL Monitor with SQLd360…kind of!

One of the most interesting section of a SQL Monitor report is IMHO the Metrics page where we can see on a timeline the impact / requirement on an execution in terms of CPU, IOPS, MBPS, PGA and TEMP.
Starting 11gR2, we can get something similar (not exactly the same) from ASH, this is from DBA_HIST_ACTIVE_SESS_HISTORY

For each TM_DELTA_TIME we know how much CPU and DB Time has been spent, as well as how many RW IOPS and RW and Interconnect (M)BPS per DELTA_TIME. The metrics are accumulated and reported at the time of the ASH sample “over the Delta Time” that roughly matches with the ASH sampling one, so ie. from V$ACTIVE_SESSION_HISTORY we are able to see how many RW IOPS per second (since the DELTA_TIME is roughly going to be the second) are accounted by the session we focus on. Also each time the ASH sample is taken the PGA and TEMP consumed by the session at that time is tracked.

Starting SQLd360 v1511 those columns are now collected and aggregated/charted over time so we can look at the single consumption via SQL Monitor (or from raw ASH data) and at the same time evaluate the historical trend from SQLd360. The PlansAnalysis page in SQLd360 now looks something like this

Screen Shot 2015-04-27 at 10.54.29 AM

Other new additions since the last post (v1507) are:

SQLd360 now uses a configuration file to determine how many days to collect info for and which “major” external APIs to collect / ignore
Several reports have been rewritten to better represent info from Parallel Executions
New sections provide info on
- Fix Controls (default and non-default) settings
- NLS settings
- Adaptive Cursor Sharing info
- SQL Monitoring info
- Partition Statistics history info
- SQLs with the same Force Matching Signature

plus as usual bug fixes here and there.

Thanks to everybody who reported issues and provided feedback for new features! 🙂

March 26, 2015
by Mauro Pagano Leave a comment

SQLd360 and ORA-7445 [evaopn3()]

A few kind people reported to me that SQLd360 disconnected in the middle of the execution. Looking into the details the cause was an ORA-7445 evaopn3() but such errors are a little tricky to track down because they only show up with a combination of faulty plan (that’s the bug) with specific data, that’s why the error never reproduced in my labs as well as at several other folks’ systems.

Galo Balda was kind and patience enough to report the error and test my suspects (Thanks Galo!!!) so we were able to identify a way to avoid the error.

The ORA-7445 was caused by bug 12672969 (I suggest you check the details in MOS, the bug is pretty “common”) and starting today SQLd360 works it around so that your execution shouldn’t disconnect anymore!

If you tried SQLd360 in the past but ignored it because of the unexpected disconnect then I suggest you give it one more shot now that the error should be taken care!

March 25, 2015
by Mauro Pagano Leave a comment

Quick update on SQLd360 recent changes

The last blog post was about SQLd360 v1504 and last night I released v1507 so a couple released (v1505, v1506) were silent. This post is just a quick update of what’s changed in the last weeks.

ASH SQL reports have been removed, the API is very slow (was taking up to 90% of the overall SQLd360 elapsed time) and the report itself is usually not that useful compared to raw data.
ASH Raw Data report has been enhanced with several new columns (mostly wait events related), this section is useful for quick searches on the most common ASH columns
SQLd360 now leverages eAdam to export GV$ACTIVE_SESSION_HISTORY and DBA_HIST_ACTIVE_SESS_HISTORY for the SQL ID of interest, this is useful for detailed ASH mining operations
Two new reports have been included, number of execs by PHV and total elapsed time by PHV, they can help in understanding how impact each PHV is (ie. one single exec of the bad plan took as much as 20 execs for the good plan)
CBO environment and IO Calibration statistics are now collected

Plus a few minor improvements as well as several bug fixes 🙂

As usual please reach out of me if you have any question/comment/concern/suggestion!!

March 10, 2015
by Mauro Pagano 1 Comment

Trouble reading an execution plan? Maybe SQLd360 can help!

Last week I was in Dallas, TX for Hotsos 2015 and I had a few questions on how to read an execution plan so I thought why not including in SQLd360 an alternative representation of the execution plan, maybe one that is easier to understand and closer to the academic explanation of the execution plan as a tree.

The main improvements for version v1504 (released yesterday, March 8th) are around execution plans analysis, there is a new link “Plans analysis” in the Plans section with a few new reports, all organized by PHV.
Basically this new page will show N columns where N is the number of distinct plans for the SQL we focus on and for each PHV we have

Plan Tree
Average elapsed time per execution (recent and historical) from ASH
Top 15 wait events
Top 15 objects accessed
Top 15 execution plan step id / operation

The plan tree is an organizational chart where the execution plan is represented as a tree upside-down where the first operation executed is the first leaf on the extreme left and the next operation is the next left (walking from left to right) and where each couple (or more) nodes are combined together by their parent node.

Ie. in the following execution plan the first step is #5 but somebody might be tricked thinking it’s #13 (or #7) since more indented

while looking at the new chart is way easier to understand that step #5 is indeed the first one executed

Hopefully you will like it too!!

Other small additions for v1504 are Optimizer system statistics, Segments/Objects information and fixes here and there that (hopefully) reduce the number of bugs in the code 🙂

February 27, 2015
by Mauro Pagano 5 Comments

SQLd360 v3 now available, new features and (hopefully) less bugs!

Couple of weeks ago SQLd360 v1 came out and now v3, so what happened to v2? It was a “silent” release to fix a bug and sneak in a couple features I didn’t complete in time for v1 but most of the new features are in v3.
The main page now looks something like this

so the list of content grew a bit and it will keep growing release after release.
List of bugs (hopefully) shrank as well but no promise on that 🙂

The major additions are

Plan Control section reporting info about SQL Profiles, SQL Plan Baselines and SQL Patches
Execution plans details from SQL Plan Baselines
Time series report/chart for average and median elapsed for the SQL (regardless of the plan), this can help answer the question “how is my SQL doing over time?”
Time series report/chart for average elapsed time per plan hash value for the SQL, this can help identify if a change in performance is caused by a plan change
A standalone script to execute the SQL from SQL*Plus, the script includes bind variables definition/assignment in case the SQL has binds
Cursor Sharing section reporting info from GV$SQL_SHARED_CURSOR
Bind datatype mismatch report, this can help identify those cases where apps define binds for the same SQL using different datatypes

plus some other minor reports/tweaks.

Bug fixes here and there, specially in the histograms section where now every endpoint value is (AFAIK, if you see issues let me know so I can fix them!) properly converted and charted so we can leverage charts to analyze histograms 😀
Ie the following one is for SH.CUSTOMERS.CUST_YEAR_OF_BIRTH

From the chart above we can see how the data is distributed and for each bar(/bucket) the (approximate) value, how many rows belong to each bucket and the selectivity for an equality predicate for that specific value.

More things in the work and as usual feedback/comments/reported issues are very much welcome!!!!