Troubleshooting tips and tricks for Oracle Database Oct 2020

VP AIOps for the Autonomous Database
Sandesh Rao
15 Troubleshooting tips and tricks for the
Oracle Database
@sandeshr
https://www.linkedin.com/in/raosandesh/
https://www.slideshare.net/SandeshRao4

Systemstate dumps what they are
and how to read them

A systemstate is made up of the processtate of each process in the instance
found at the time the systemstate was called for.
Each processtate is made up of SO (State Objects) which hold details of the
state of current objects owned by each PROCESS.
To navigate a systemstate:
1. Find what process most sessions are waiting for
2. Recursively navigate what each process is waiting for
3. When you find a process on the CPU get an error stack to understand
why it is blocked
Systemstate Dumps

These are waits for locks held upon a particular object. In the example below, the process is waiting for
a TX enqueue as indicated by the "waiting for 'enq: TX - row lock contention'" message:
Enqueues
Systemstate Dumps
PROCESS 41
...
waiting for 'enq: TX - row lock contention' blocking sess=0x39b3a5c90 seq=152 wait_time=0 seconds since wait
started=796
name|mode=54580006, usn * 54580006 is ASCII and can be split up as follows to reveal the meaning:
* ASCII 54 (T) + ASCII 58 (T) => (TX) + Mode 0006 (X) ...

To find more details on the enqueue, do a search for the string 'req:' (searching DOWN) within the
process. In this case we find a section with a "req:X" request:
"req:" in this case refers the "request" for the TX lock that is being waited for by the 'enq: TX - row lock
contention' wait. The request is for an eXclusive TX lock.
This section also reveals the enqueue name as a string: (TX-00020009-0001FA04) that can be used to
search for the HOLDER (the holder of the resource is shown with the string "mode:" with the mode that
the lock is being held in by the holder, in this case eXclusive) :
We can see we hold the enqueue (mode: X) in a incompatible mode to the req: X request...
Enqueues
Systemstate Dumps
SO: 39ad80d60, type: 5, owner: 393cb85e0, flag: INIT/-/-/0x00
(enqueue) TX-00020009-0001FA04 DID: 0001-0029-00000090
lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 flag: 0x6
res: 39aef20c8, req: X, prv: 39aef20e8, own: 39b383aa8, sess: 39b383aa8, proc: 39b7384f0
(enqueue) TX-00020009-0001FA04 DID: 0001-002E-00000014
lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 flag: 0x6
res: 39aef20c8, mode: X, prv: 39aef20d8, own: 39b3a5c90, sess: 39b3a5c90, proc: 39b73ac78

A Row cache waits are waits against the Row Cache (or Dictionary Cache). Processes will show "waiting
for 'row cache lock’”
• mode=0 shows the lock is not currently held
• request=3 shows we are requesting the lock in Shared (mode 3)
• object=7000000eedc13a0 show the object we are requesting the lock on
• request=S shows the lock is Shared(S)
• cid=7(dc_users) shows the cache type of dc_users with a cache ID of 7
• mode=X shows the lock is held in eXclusive mode
Rowcache locks
Systemstate Dumps
PROCESS 19:
...
waiting for 'row cache lock' blocking sess=0x0 seq=2174 wait_time=0
cache id=7, mode=0, request=3
--------------------------------------------------------------------------------
SO: 7000000c6de7678, type: 48, owner: 7000000a6c97cf8, flag: INIT/-/-/0x00
row cache enqueue: count=1 session=7000000a660b8b0 object=7000000eedc13a0, request=S
savepoint=2148
row cache parent object: address=7000000eedc13a0 cid=7(dc_users) hash=2a057ebe typ=9 transaction=7000000c42297a0
flags=00000002
own=7000000eedc1480[7000000c6de8518,7000000c6de8518] wat=7000000eedc1490[7000000c6de7568,7000000c6deed98] mode=X
status=VALID/-/-/-/-/-/-/-/-
request=N release=TRUE flags=0

This process is waiting for 'row cache lock'. The waiter is waiting for "object=7000000eedc13a0" and it is requesting a
Share mode lock "request=S". To find the HOLDER, search for object but use the mode: string to indicate a holder
Rowcache locks
Systemstate Dumps
PROCESS 19:
...
waiting for 'row cache lock' blocking sess=0x0 seq=2174 wait_time=0
cache id=7, mode=0, request=3
--------------------------------------------------------------------------------
SO: 7000000c6de7678, type: 48, owner: 7000000a6c97cf8, flag: INIT/-/-/0x00
row cache enqueue: count=1 session=7000000a660b8b0 object=7000000eedc13a0, request=S
savepoint=2148
row cache parent object: address=7000000eedc13a0 cid=7(dc_users) hash=2a057ebe typ=9 transaction=7000000c42297a0 flags=00000002
own=7000000eedc1480[7000000c6de8518,7000000c6de8518] wat=7000000eedc1490[7000000c6de7568,7000000c6deed98] mode=X status=VALID/-/-
/-/-/-/-/-/-
SO: 7000000c6de84e8, type: 48, owner: 7000000c42297a0, flag: INIT/-/-/0x00
row cache enqueue: count=1 session=7000000a6702710 object=7000000eedc13a0, mode=X
savepoint=109
row cache parent object: address=7000000eedc13a0 cid=7(dc_users)
hash=2a057ebe typ=9 transaction=7000000c42297a0 flags=00000002
own=7000000eedc1480[7000000c6de8518,7000000c6de8518] wat=7000000eedc1490[7000000c6de7568,7000000c6df1b08] mode=X
status=VALID/-/-/-/-/-/-/-/-
instance lock id=QH 00000440 00000000
set=0, complete=FALSE
data=
In this case the "mode:" of the holder is eXclusive
(i.e. object=7000000eedc13a0, mode=X). Search back up to the top
of this process to find which process is holding the resource.

Waits for library cache pins are of the form" waiting for 'cursor: pin S wait on X’”
To find more details use the idn=XXXXXX to search down in the systemstate (idn=535d1a6c)
• SID 3094 holds the Mutex (3094,0)
• Request is for Shared (GET_SHRD) mode
Library Cache Pins
Systemstate Dumps
PROCESS 16:
waiting for 'cursor: pin S wait on X' blocking sess=0x0 seq=58849 wait_time=0 seconds since wait started=0
idn=535d1a6c, value=c1600000000, where|sleeps=5003f2428
KGX Atomic Operation Log 7000002e5b9d160
Mutex 7000002b8e92268(3094, 0) idn 535d1a6c oper GET_SHRD
Cursor Pin uid 2489 efd 0 whr 5 slp 58733
opr=2 pso=70000028c47def0 flg=0
pcs=7000002b8e92268 nxt=0 flg=34 cld=3 hd=70000030d6c6eb0 par=7000002eefe64d0
ct=31 hsh=0 unp=0 unn=0 hvl=b825a4d0 nhv=1 ses=700000309b42600
hep=7000002b8e922e8 flg=80 ld=1 ob=7000002de49f8a0 ptr=70000022cf39db8 fex=70000022cf390c8

To find the HOLDER, search for idn XXXXXXX oper until you find one which is held (ie not GET_XXX)
( idn 535d1a6c oper):
• SID 3094 holds Mutex in Exclusive (EXCL)
Library Cache Pins
Systemstate Dumps
KGX Atomic Operation Log 7000002cd934270
Mutex 7000002b8e92268(3094, 0) idn 535d1a6c oper EXCL
Cursor Pin uid 3094 efd 0 whr 7 slp 0
opr=3 pso=7000002a71c4180 flg=0
pcs=7000002b8e92268 nxt=0 flg=34 cld=3 hd=70000030d6c6eb0 par=7000002eefe64d0
ct=31 hsh=0 unp=0 unn=0 hvl=b825a4d0 nhv=1 ses=700000309b42600
hep=7000002b8e922e8 flg=80 ld=1 ob=7000002de49f8a0 ptr=70000022cf39db8 fex=70000022cf390c8

To find more details use the handle address in the form handle=address to search down in the
systemstate (ie handle=70000030de975a8)
• Exclusive (X) Requested
• <USER_NAME>.<OBJECT_NAME> is the object we are trying to lock
Library Cache Lock
Systemstate Dumps
PROCESS 35:
waiting for 'library cache lock' blocking sess=0x0 seq=35844 wait_time=0 seconds since wait started=14615
handle address=70000030de975a8, lock address=70000026947e190, 100*mode+namespace=12d
SO: 70000026947e190, type: 53, owner: 700000308d726f0, flag: INIT/-/-/0x00
LIBRARY OBJECT LOCK: lock=70000026947e190 handle=70000030de975a8 request=X
call pin=0 session pin=0 hpc=0000 hlc=0000
htl=70000026947e210[7000002b333ffe8,7000002b333ffe8] htb=7000002b333ffe8 ssga=7000002b333f2a0
user=700000307a7ca68 session=700000307a7ca68 count=0 flags=[0000] savepoint=0x23e411
LIBRARY OBJECT HANDLE: handle=70000030de975a8 mtx=70000030de976d8(0) cdp=0
name=<USER_NAME>.<OBJECT_NAME>

To find the HOLDER, search for 'handle=XXXXXXXXXX mode=' until you find one which is held (but not
in NULL)( handle=70000030de975a8 mode=)
• Hold in Shared (S)
• name=<USER_NAME>.<OBJECT_NAME> confirms the object name
Library Cache Lock
Systemstate Dumps
SO: 700000288b03ae0, type: 53, owner: 7000002cc697468, flag: INIT/-/-/0x00
LIBRARY OBJECT LOCK: lock=700000288b03ae0 handle=70000030de975a8 mode=S
call pin=0 session pin=0 hpc=0000 hlc=0000
htl=700000288b03b60[7000002a179a1a8,7000002b3800878] htb=7000002b3800878 ssga=7000002b37ffb30
user=70000030fafab00 session=70000030fafab00 count=1 flags=[0000] savepoint=0x417
LIBRARY OBJECT HANDLE: handle=70000030de975a8 mtx=70000030de976d8(0) cdp=0
name=<USER_NAME>.<OBJECT_NAME>

• 9d is the latch# (in HEX = 157) from v$latchname
Towards the top of the PROCESS dump you will see the exact latch we are waiting for and even who holds it:
• PROCESS 127 (ospid:23086) holds the latch, PROCESS 127 shows:
Latch free
Systemstate Dumps
PROCESS 8:
waiting for 'latch free' blocking sess=0x0 seq=4577 wait_time=0
address=99ff60018, number=9d, tries=0
waiting for 99ff60018 Child library cache level=5 child#=3
Location from where latch is held: kglic: child
Context saved from call: 26
state=busy
possible holder pid = 127 ospid=23086
wtr=99ff60018, next waiter 9993858b8
holding 99ff60018 Child library cache level=5 child#=3
Location from where latch is held: kglic: child
Context saved from call: 26
state=busy

If you want to find which object a handle refers to then use the handle=XXXXXXXXXX until you come across
the LIBRARY OBJECT HANDLE. ie handle=c00000006c0f8490:-
• name shows the name of the handle
• Namespace=CRSR show the that it is of type CURSOR
Other useful information
Systemstate Dumps
LIBRARY OBJECT HANDLE: handle=c00000006c0f8490
name=SELECT USER FROM DUAL
hash=cd1ceca0 timestamp=03-23-2007 09:00:00
namespace=CRSR flags=RON/TIM/PN0/SML/[12010000]

ADDM in a multitenant environment

Starting with Oracle Database 12c, ADDM is enabled by default in the root
container of a multitenant container database (CDB)
You can also use ADDM in a pluggable database (PDB)
• In a CDB, ADDM works in the same way as it works in a non-CDB
• ADDM analysis is performed each time an AWR snapshot is taken on a CDB root or a
PDB
• ADDM does not work in a PDB by default, because automatic AWR snapshots are
disabled

To enable ADDM in a PDB:
Set the AWR_PDB_AUTOFLUSH_ENABLED initialization parameter to TRUE in the
PDB using the following command:
Set the AWR snapshot interval greater than 0 in the PDB using the command as
shown in the following example:
Results on a PDB provide only PDB-specific findings and recommendations
SQL> ALTER SYSTEM SET AWR_PDB_AUTOFLUSH_ENABLED=TRUE;
SQL> EXEC
dbms_workload_repository.modify_snapshot_settings(interval=>60);

Analyze logs and look for errors

Investigate logs and look for errors
tfactl analyze -since 1d
INFO: analyzing all (Alert and Unix System Logs) logs for the last 1440
minutes...
...
Unique error messages for last ~1 day(s)
Occurrences percent server name error
----------- ------- -------------------- -----
1 100.0% myserver1 Errors in file
/u01/oracle/diag/rdbms/orcl2/orcl2/trace/orcl2_ora_12272.trc
(incident=10151):
ORA-00600: internal error code, arguments: [600], [], [], [], [], [], [], [],
[], [], [], []
Incident details in:
/u01/oracle/diag/rdbms/orcl2/orcl2/incident/incdir_10151/orcl2_ora_12272_i101
51.trc
...

tfactl analyze -search "ORA-04031" -last 1d
INFO: analyzing all (Alert and Unix System Logs) logs for the last 1440
minutes...
...
Matching regex: ORA-04031
Case sensitive: false
Match count: 1
[Source: /u01/oracle/diag/rdbms/orcl2/orcl2/trace/alert_orcl2.log, Line: 1941]
Oct 01 12:09:05 2020
Errors in file /u01/oracle/diag/rdbms/orcl2/orcl2/trace/orcl2_ora_6982.trc
(incident=7665):
ORA-04031: unable to allocate bytes of shared memory ("","","","")
Incident details in:
/u01/app/oracle/diag/rdbms/orcl2/orcl2/incident/incdir_7665/orcl2_ora_6982_i76
65.trc
...

Examples
tfactl analyze -since 5h
#Show summary of events from alert logs,
system messages in last 5 hours
tfactl analyze -comp os -since 1d
#Show summary of events from system
messages in last 1 day
tfactl analyze -search "ORA-" -since 2d
#Search string ORA- in alert and system
logs in past 2 days
tfactl analyze -search "/Starting/c" -
since 2d
#Search case sensitive string "Starting"
in past 2 days
tfactl analyze -comp os -for ”Oct/01/2020
11" -search "."
#Show all system log messages at time
Oct/01/2020 11
tfactl analyze -comp osw -since 6h
#Show OSWatcher Top summary in last 6
hours
tfactl analyze -comp oswslabinfo -from
”Oct/01/2020 05:00:01" -to ”Oct/01/2020
06:00:01"
#Show OSWatcher slabinfo summary for
specified time period
tfactl analyze -since 1h -type generic
#Analyze all generic messages in last one
hour

$ ./tfactl analyze -type generic -since 7d
INFO: analyzing all (Alert and Unix System Logs) logs for the last 10080 minutes...
...
Total message count: 54,807, from 01-Oct-2020 02:41:34 PM PST to
08-Oct-2020 02:41:34
Messages matching last ~7 day(s): 3,139, from 02-Oct-2020 02:46:23 PM PST to
08-Oct-2020 02:41:34
last ~7 day(s) generic count: 3,139, from 06-Oct-2020 02:46:23 PM PST to
08-Oct-2020 02:41:34
last ~7 day(s) unique generic count: 94
Message types for last ~7 day(s)
Occurrences percent server name type
----------- ------- -------------------- -----
3,139 100.0% myhost1 generic
...

Unique generic messages for last ~7 day(s)
Occurrences percent server name generic
----------- ------- -------------------- -----
1,504 47.9% myhost1 : [crflogd(13931)]CRS-9520:The storage of Grid
Infrastructure Managem...
487 15.5% myhost1 : [crflogd(13931)]CRS-9520:The storage of Grid
Infrastructure Managem...
336 10.7% myhost1 myhost1 smartd[13812]: Device: /dev/sdv, SMART
Failure: FAILURE...
336 10.7% myhost1 myhost1 smartd[13812]: Device: /dev/sdag, SMART
Failure: FAILURE ...
103 3.3% myhost1 myhost1 last message repeated 9 times
103 3.3% myhost1 myhost1 kernel: oracle: sending ioctl 2285 to a
partition!
...snipping for brevity...

Pattern match search output
tfactl analyze -search "ORA-" -since 7d
...
[Source: /u01/app/oracle/diag/rdbms/ratoda/RATODA1/trace/alert_RATODA1.log, Line:
9494]
Feb 25 22:00:02 2014
Errors in file
/u01/app/oracle/diag/rdbms/ratoda/RATODA1/trace/RATODA1_j003_10948.trc:
ORA-12012: error on auto execute of job "ORACLE_OCM"."MGMT_CONFIG_JOB_2_1"
ORA-29280: invalid directory path
ORA-06512: at "ORACLE_OCM.MGMT_DB_LL_METRICS", line 2436
ORA-06512: at line 1
End automatic SQL Tuning Advisor run for special tuning task
"SYS_AUTO_SQL_TUNING_TASK”
...

OS Watcher top data
tfactl analyze -comp osw -since 6h
...
statistic: t first highest (time) lowest (time) average non zero 3rd last 2nd last last trend
top.cpu.util.id: % 98.0 99.7 @10:35AM 72.8 @03:11PM 97.3 2,059 95.2 96.8 96.0 -2%
top.cpu.util.st: % 0.1 0.1 @09:14AM 0.0 @09:14AM 0.0 889 0.0 0.0 0.0 -100%
top.cpu.util.us: % 0.1 8.8 @11:31AM 0.0 @09:14AM 0.6 1,966 4.3 0.8 3.4 3300%
top.cpu.util.wa: % 1.7 18.7 @03:11PM 0.1 @10:35AM 1.1 2,059 0.3 0.4 0.4 -76%
top.loadavg.last01min: 1.17 3.12 @09:44AM 0.07 @12:45PM 0.93 1,823 0.31 0.26 0.22 -81%
top.loadavg.last05min: 0.94 2.26 @09:44AM 0.27 @12:45PM 0.93 1,823 0.82 0.79 0.77 -18%
top.loadavg.last15min: 0.79 1.60 @09:46AM 0.44 @01:18PM 0.92 1,823 0.96 0.95 0.94 18%
top.mem.buffers: k 808232 808388 @09:41AM 785608 @02:57PM 796511 2,093 785744 785744 785744 -2%
top.mem.free: k 1130332 1291344 @10:02AM 927576 @09:43AM 1188576 2,093 1244020 1265248 1265188 11%
top.swap.used: k 47556 48088 @03:00PM 47556 @09:14AM 47828 2,097 48088 48088 48088 1%
top.tasks.running: 1 4 @12:04PM 1 @09:14AM 1 1,996 1 2 2 100%
top.tasks.total: 514 527 @02:57PM 509 @09:18AM 514 1,996 518 521 520 1%
top.tasks.zombie: 0 5 @11:04AM 0 @09:14AM 0 62 0 0 0 n/a
top.users: 5 6 @03:00PM 5 @09:14AM 5 1,823 6 6 6 20%
...

OS Watcher slabinfo data
tfactl analyze -comp oswslabinfo -from ”Oct/01/2020 05:00:01" -to ”Oct/01/2020 06:00:01"
...
statistic: t first highest (time) lowest (time) average non zero 3rd last 2nd last last trend
slabinfo.acfs_ccb_cache.active_objs: 4 38 @05:52AM 0 @05:01AM 10 294 3 1 8 100%
slabinfo.inet_peer_cache.active_objs: 23 39 @05:59AM 23 @05:00AM 23 351 23 23 39 69%
slabinfo.sigqueue.active_objs: 385 768 @05:28AM 285 @05:27AM 554 351 712 621 577 49%
slabinfo.skbuff_fclone_cache.active_objs: 55 133 @05:51AM 11 @05:20AM 69 351 56 77 70 27%
slabinfo.names_cache.active_objs: 126 180 @05:00AM 110 @05:23AM 146 351 171 166 156 23%
slabinfo.sgpool-8.active_objs: 135 228 @05:31AM 59 @05:11AM 152 351 180 165 157 16%
slabinfo.UDP.active_objs: 568 675 @05:28AM 492 @05:17AM 597 351 630 596 626 10%
slabinfo.size-8192.active_objs: 174 209 @05:36AM 160 @05:14AM 181 351 205 187 188 8%
slabinfo.task_delay_info.active_objs: 1477 1856 @05:28AM 1334 @05:57AM 1574 351 1529 1411 1579 6%
slabinfo.pid.active_objs: 1608 1980 @05:29AM 1452 @05:21AM 1678 351 1564 1487 1689 5%
slabinfo.blkdev_requests.active_objs: 720 880 @05:04AM 651 @05:54AM 745 351 707 736 761 5%
slabinfo.ip_dst_cache.active_objs: 1497 1800 @05:28AM 1279 @05:36AM 1517 351 1594 1466 1560 4%
slabinfo.sock_inode_cache.active_objs: 2168 2329 @05:11AM 2106 @05:56AM 2225 351 2322 2278 2232 2%
...

How to connect to a hung
database for diagnostics

How do you connect to a database when connections are hanging?
• sqlplus preliminary connection will connect to database since no session is
created
• You will have limited access to the SGA
• This will help in capturing diagnostic information like a systemstate dump
• Two ways to connect to sqlplus using a preliminary connection:
or
sqlplus -prelim
sqlplus -prelim / as sysdba
SQL> set _prelim on
SQL> connect / as sysdba
Prelim connection established

Always on - Enabled by default
Reliably detects database hangs and deadlocks
Autonomously resolves them
Logs all detections and resolutions
New SQL interface to configure sensitivity (Normal/High)
and trace file sizes
Oracle Hang Manager
Session
DIA0
EVALUATE
DETECT
ANALYZE
Hung?
VERIFY
Victim
Policy

Monitors Session snapshots for progress
Evaluates potential hangs over time with
based upon Wait Graphs
Analyzes hang chain of sessions to
identify blocker/victim
Discovers blocker is located in ASM
instance
Requests ASM terminate session or
instance relying on Flex ASM for recovery
Detection and resolution is bi-directional
Database Hang Management - Infrastructure
Database
ASM

Full Resolution Dump Trace File and DB Alert Log Audit Reports
Oracle 12c Hang Manager
Dump file …/diag/rdbms/hm6/hm62/incident/incdir_5753/hm62_dia0_12656_i5753.trc
Oracle Database 12c Enterprise Edition Release 18/19c.0.0.0 - 64bit Beta
With the Partitioning, Real Application Clusters, OLAP, Advanced Analytics
and Real Application Testing options
Build label: RDBMS_MAIN_LINUX.X64_151013
ORACLE_HOME: …/3775268204/oracle
System name: Linux
Node name: slc05kyr
Release: 2.6.39-400.211.1.el6uek.x86_64
Version: #1 SMP Fri Nov 15 13:39:16 PST 2013
Machine: x86_64
VM name: Xen Version: 3.4 (PVM)
Instance name: hm62
Redo thread mounted by this instance: 2
Oracle process number: 19
Unix process pid: 12656, image: oracle@slc05kyr (DIA0)
*** 2020-10-01T16:47:59.541509+17:00
*** SESSION ID:(96.41299) 2020-10-01T16:47:59.541519+17:00
*** CLIENT ID:() 2020-10-01T16:47:59.541529+17:00
*** SERVICE NAME:(SYS$BACKGROUND) 2020-10-01T16:47:59.541538+17:00
*** MODULE NAME:() 2020-10-01T16:47:59.541547+17:00
*** ACTION NAME:() 2020-10-01T16:47:59.541556+17:00
*** CLIENT DRIVER:() 2020-10-01T3T16:47:59.541565+17:00

Full Resolution Dump Trace File and DB Alert Log Audit Reports
Oracle 12c Hang Manager
2020-10-01T16:47:59.435039+17:00
Errors in file /oracle/log/diag/rdbms/hm6/hm6/trace/hm6_dia0_12433.trc (incident=7353):
ORA-32701: Possible hangs up to hang ID=1 detected
Incident details in: …/diag/rdbms/hm6/hm6/incident/incdir_7353/hm6_dia0_12433_i7353.trc
2020-10-01T16:47:59.506775+17:00
DIA0 requesting termination of session sid:40 with serial # 43179 (ospid:13031) on instance 2
due to a GLOBAL, HIGH confidence hang with ID=1.
Hang Resolution Reason: Automatic hang resolution was performed to free a
significant number of affected sessions.
DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=1.
In the alert log on the instance local to the session (instance 2 in this case),
we see the following:
2020-10-01T16:47:59.538673+17:00
Errors in file …/diag/rdbms/hm6/hm62/trace/hm62_dia0_12656.trc (incident=5753):
ORA-32701: Possible hangs up to hang ID=1 detected
Incident details in: …/diag/rdbms/hm6/hm62/incident/incdir_5753/hm62_dia0_12656_i5753.trc
2020-10-01T16:48:04.222661+17:00
DIA0 terminating blocker (ospid: 13031 sid: 40 ser#: 43179) of hang with ID = 1
requested by master DIA0 process on instance 1
Hang Resolution Reason: Automatic hang resolution was performed to free a
significant number of affected sessions.
by terminating session sid:40 with serial # 43179 (ospid:13031)

Guided resolution with Oracle Support

Oracle Database ORA-00060 Errors on Single Instance (Non-RAC) Diagnosing
Using Deadlock Graphs in ORA-00060 Trace Files (Doc ID 1550091.2)
Troubleshooting Assistant
https://support.oracle.com/epmos/faces/DocContentDisplay?id=1550091.2

Oracle Database ORA-00060 Errors on Single Instance (Non-RAC) Diagnosing
Using Deadlock Graphs in ORA-00060 Trace Files (Doc ID 1550091.2)

Understand and Troubleshoot Startup/Shutdown Issues (Doc ID 1591095.2)

Oracle Undo Management (ORA-01555, ORA-30036, ORA-01628,
ORA-01552, etc.) (Doc ID 1575667.2)

Handling Block Corruptions in Oracle7 / 8 / 8i / 9i / 10g / 11g (Doc ID 1598103.2)

Health Check
SQL

1. Login to the database server and set the environment used by the Database Instance
2. Download the "sqlhc.zip" archive file and extract the contents to a suitable directory/folder
3. Connect into SQL*Plus as SYS, a DBA account, or a user with access to Data Dictionary views
and simply execute the "sqlhc.sql" script. It will request to enter two parameters:
i. Oracle Pack License (Tuning, Diagnostics or None) [T|D|N] (required)
ii. A valid SQL_ID for the SQL to be analyzed.
If site has both Tuning and Diagnostics licenses then specify T
(Oracle Tuning pack includes Oracle Diagnostics)
For Example:
Health Check
SQL
# sqlplus / as sysdba
SQL> START sqlhc.sql T djkbyr8vkc64h

SQL> describe V$DIAG_TRACE_FILE
Name Null? Type
----------------------------------------- -------- ----------------------------
ADR_HOME VARCHAR2(444)
TRACE_FILENAME VARCHAR2(68)
CHANGE_TIME TIMESTAMP(3) WITH TIME ZONE
MODIFY_TIME TIMESTAMP(3) WITH TIME ZONE
CON_ID NUMBER
V$DIAG_TRACE_FILE and V$DIAG_TRACE_FILE_CONTENTS

SQL> describe V$DIAG_TRACE_FILE_CONTENTS
Name Null? Type
----------------------------------------- -------- ----------------------------
RECORD_LEVEL NUMBER
PARENT_LEVEL NUMBER
RECORD_TYPE NUMBER
TIMESTAMP TIMESTAMP(3) WITH TIME ZONE
PAYLOAD VARCHAR2(4000)
SECTION_ID NUMBER
SECTION_NAME VARCHAR2(64)
COMPONENT_NAME VARCHAR2(64)
OPERATION_NAME VARCHAR2(64)
FILE_NAME VARCHAR2(64)
FUNCTION_NAME VARCHAR2(64)
LINE_NUMBER NUMBER
THREAD_ID VARCHAR2(64)
SESSION_ID NUMBER
SERIAL# NUMBER
CON_UID NUMBER
CONTAINER_NAME VARCHAR2(64)
CON_ID NUMBER

SQL> select trace_filename from v$diag_trace_file;
TRACE_FILENAME
--------------------------------------------------------------------
ORCL1_mz00_21108.trc
ORCL1_gcr2_16504.trc
ORCL1_ora_19005.trc

SQL> select payload from v$diag_trace_file_contents where trace_filename ='ORCL1_ora_19005.trc';
PAYLOAD
--------------------------------------------------------------------------------
Trace file /u01/app/oracle/diag/rdbms/orcl_unq/ORCL1/trace/ORCL1_ora_19005.trc
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.2.0.0.0
Build label: RDBMS_19.2.0.0.0_LINUX.X64_190121
ORACLE_HOME: /u01/app/oracle/product/19c/dbhome_1
System name: Linux
Node name: myserver65
Release: 4.14.35-1844.1.3.el7uek.x86_64
Version: #2 SMP Wed Jan 2 21:18:29 PST 2019
Machine: x86_64
VM name: Xen Version: 4.1 (HVM)
...

...
PAYLOAD
--------------------------------------------------------------------------------
Instance name: ORCL1
Redo thread mounted by this instance: 1
Oracle process number: 12
Unix process pid: 19005, image: oracle@myserver65 (TNS V1-V3)
*** 2020-10-01T01:22:10.770960+00:00
*** SESSION ID:(106.17196) 2020-10-01T01:22:10.771014+00:00
*** CLIENT ID:() 2020-10-01T01:22:10.771027+00:00
*** SERVICE NAME:(SYS$USERS) 2020-10-01T01:22:10.771039+00:00
...

SQL> describe V$DIAG_SESS_SQL_TRACE_RECORDS
Name Null? Type
----------------------------------------- -------- ----------------------------
RECORD_LEVEL NUMBER
PARENT_LEVEL NUMBER
RECORD_TYPE NUMBER
TIMESTAMP TIMESTAMP(3) WITH TIME ZONE
PAYLOAD VARCHAR2(4000)
SECTION_ID NUMBER
SECTION_NAME VARCHAR2(64)
COMPONENT_NAME VARCHAR2(64)
OPERATION_NAME VARCHAR2(64)
FILE_NAME VARCHAR2(64)
FUNCTION_NAME VARCHAR2(64)
LINE_NUMBER NUMBER
THREAD_ID VARCHAR2(64)
SESSION_ID NUMBER
SERIAL# NUMBER
CON_UID NUMBER
CONTAINER_NAME VARCHAR2(64)
CON_ID NUMBER
V$DIAG_SESS_SQL_TRACE_RECORDS

SQL> SELECT sid,serial# FROM v$session WHERE username = 'SYS’;
SID SERIAL#
---------- ----------
33 45888
129 6051
SQL> EXECUTE DBMS_SYSTEM.SET_SQL_TRACE_IN_SESSION(129,6051,TRUE);
PL/SQL procedure successfully completed.
Enable session tracing

SQL> select unique trace_filename from V$DIAG_SESS_SQL_TRACE_RECORDS;
TRACE_FILENAME
--------------------------------------------------------------------
ORCL1_ora_14151.trc
SQL> select payload from V$DIAG_SESS_SQL_TRACE_RECORDS where trace_filename = 'ORCL1_ora_14151.trc';
PAYLOAD
--------------------------------------------------------------------------------
CLOSE #140506358472544:c=19,e=18,dep=0,type=1,tim=7769230586778
=====================
PARSING IN CURSOR #140506358494608 len=97 dep=1 uid=0 oct=3 lid=0 tim=7769230600
163 hv=791757000 ad='7fa0c290' sqlid='87gaftwrm2h68'
select o.owner#,o.name,o.namespace,o.remoteowner,o.linkname,o.subname from obj$
o where o.obj#=:1
END OF STMT
EXEC #140506358494608:c=65,e=65,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,plh=107238262
4,tim=7769230600159
...

...
PAYLOAD
--------------------------------------------------------------------------------
FETCH #140506358494608:c=38,e=37,p=0,cr=2,cu=0,mis=0,r=0,dep=1,og=4,plh=10723826
24,tim=7769230600324
CLOSE #140506358494608:c=5,e=4,dep=1,type=3,tim=7769230600381
EXEC #140506358494608:c=23,e=23,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,plh=107238262
4,tim=7769230600500
FETCH #140506358494608:c=11,e=12,p=0,cr=2,cu=0,mis=0,r=0,dep=1,og=4,plh=10723826
24,tim=7769230600547
...
SQL> EXECUTE DBMS_SYSTEM.SET_SQL_TRACE_IN_SESSION(129,6051,FALSE);
PL/SQL procedure successfully completed.

Keep track of the attribute of
import file pre and post patching

Start tracking using –fileattr start
Automatically discovers Grid Infrastructure and Database directories and files
• Prevent discovery using –excludediscovery
Further configure the list of monitored directories using –includedir
Track attribute changes on important files
tfactl <orachk|exachk> -fileattr start -includedir "/root/myapp/config"
...
List of directories(recursive) for checking file attributes:
/u01/app/oradb/product/11.2.0/dbhome_11203
/root/myapp/config
orachk has taken snapshot of file attributes for above directories at:
/orahome/oradb/orachk/orachk_mysrv21_20201001_041214

Compare current attributes against first snapshot using –fileattr check
When checking, use the same include/exclude arguments you started with
tfactl <orachk|exachk> -fileattr check -includedir "/root/myapp/config”
...
List of directories(recursive) for checking file attributes:
/root/myapp/config
Checking file attribute changes...
"/root/myapp/config/myappconfig.xml" is different:
Baseline : 0644 oracle root
/root/myapp/config/myappconfig.xml
Current : 0644 root root
/root/myapp/config/myappconfig.xml
...

Automatically proceeds to run compliance checks after file attribute checks
• Only run attribute checks by using -fileattronly
File Attribute Changes are shown in HTML report output

Automatically running critical checks every two hours and full checks once a day at 2am
• You only need to configure your email for notification
ORAchk | EXAchk email notification
tfactl <orachk|exachk> -set “NOTIFICATION_EMAIL=SOME.BODY@COMPANY.COM

TFA can send email notification when faults are detected
• Notification for all problems:
• Notification for all problems on database owned by oracle user:
• Optionally configure an SMTP server:
• Confirm email notification work:
Critical event notification
tfactl set notificationAddress=some.body@example.com
tfactl set notificationAddress=oracle:another.person@example.com
tfactl set smtp
tfactl sendmail <email_address>

Event: ORA-29770
Event time: Thu Oct
01 07:13:09 PDT 2020
File containing
event:
/u01/app/oracle/diag
/rdbms/orcl/orcl/tra
ce/alert_orcl.log
Logs will be
collected at:
/opt/oracle.ahf/data
/repository/auto_srd
c_ORA-
29770_2020_10_01:09_
myserver1.zip

Symptom
LCK0 (ospid:NNNN)
has not called a
wait for <n_secs>
secs.
Call stack:
ksedsts <-
kjzdssdmp <-
kjzduptcctx <-
kjzdicrshnfy <-
ksuitm <-
kjgcr_KillInstance
<- kjgcr_Main <-
kjfmlmhb_Main <-
ksbrdp

Action
Apply the one-off
patch 18795105 to
resolve this issue
For further
information see
Doc :1998445.1 and
Doc :18795105.8
Cause
Instance crash due
to ORA-29770 LCK0
hung

Evidence
Orcl_lmhb_23242.trc
(15):
ksedsts()+465<-
kjzdssdmp()+267<-
kjzduptcctx()+232<-
kjzdicrshnfy()+63<-
ksuitm()+5570<-
kjgcr_KillInstance()
+125
alert_orcl.log(140):
ORA-29770: global
enqueue process LMS0
(OSID 11912) is hung
for more than 70
seconds

Self analysis in MOS using TFA collections

tfactl diagcollect –srdc <srdc_type>
• Scans system to identify recent events
• Once the relevant event is chosen, proceeds with diagnostic collection
One command SRDC
tfactl diagcollect -srdc ORA-00600
Enter the time of the ORA-00600 [YYYY-MM-DD HH24:MI:SS,<RETURN>=ALL] :
Enter the Database Name [<RETURN>=ALL] :
1. Oct/01/2020 05:29:58 : [orcl2] ORA-00600: internal error code,
arguments: [600], [], [], [], [], [], [], [], [], [], [], []
2. Oct/01/2020 06:55:08 : [orcl2] ORA-00600: internal error code,
arguments: [600], [], [], [], [], [], [], [], [], [], [], []
Please choose the event : 1-2 [1]
Selected value is : 1 (Oct/01/2020 05:29:58 )

All required files are identified
• Trimmed where applicable
• Package in a zip ready to provide to support
One command SRDC
...
2020/10/01 06:14:24 EST : Getting List of Files to Collect
2020/10/01 06:14:27 EST : Trimming file :
myserver1/rdbms/orcl2/orcl2/trace/orcl2_lmhb_3542.trc with original file
size : 163MB
...
2020/10/01 06:14:58 EST : Total time taken : 39s
2020/10/01 06:14:58 EST : Completed collection of zip files.
...
/opt/oracle.ahf/data/repository/srdc_ora600_collection_Tue_Sep_7_06_14_17
_EST_2020_node_local/myserver1.tfa_srdc_ora600_Thu_Oct_1_06_14_17_EST_202
0.zip

Collects, processes, and maintains performance statistics for problem detection and self-tuning purposes
Gathered data is stored both in memory and in the database, and is displayed in both reports and views
Automatic Workload Repository (AWR)
The statistics collected and processed by AWR include:
• Object statistics that determine both access and usage
statistics of database segments
• Time model statistics based on time usage for activities,
displayed in the V$SYS_TIME_MODEL and
V$SESS_TIME_MODEL views
• Some of the system and session statistics collected in
the V$SYSSTAT and V$SESSTAT views
• SQL statements that are producing the highest load on
the system, based on criteria such as elapsed time and
CPU time
• Active Session History (ASH) statistics, representing the
history of recent sessions activity

Create an AWR snapshot
Run your workload
Create an AWR snapshot
Generate report for the time period
Generating an AWR Report
SQL> EXECUTE DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT()
SQL> EXECUTE DBMS_WORKLOAD_REPOSITORY.CREATE_SNAPSHOT()
SQL> @$ORACLE_HOME/rdbms/admin/awrrpt.sql

Generating an AWR Compare Periods Report for the Local Database
Generating an AWR Compare Periods Report for a Specific Database
To generate an AWR Compare Periods report for Oracle RAC on the local database instance
To generate an AWR Compare Periods report for Oracle RAC on a specific database
To generate a Global AWR report for RAC
To generate a SQL Statement report
Information on the AWR Repository
AWR Scripts
SQL> @$ORACLE_HOME/rdbms/admin/awrddrpt.sql
SQL> @$ORACLE_HOME/rdbms/admin/awrddrpi.sql
SQL> @$ORACLE_HOME/rdbms/admin/awrgdrpt.sql
SQL> @$ORACLE_HOME/rdbms/admin/awrgdrpi.sql
SQL> @$ORACLE_HOME/rdbms/admin/awrgrpt.sql
SQL> @$ORACLE_HOME/rdbms/admin/awrsqrpt.sql
SQL> @$ORACLE_HOME/rdbms/admin/awrinfo.sql

Sanitize sensitive information

Sensitive information can be hidden from diagnostics
Machine learning algorithms determine sensitive data like:
• Host names
• IP addresses
• MAC addresses
• Oracle Database names
• Tablespace names
• Service names
• Ports
• Operating system user names
Sanitize or mask sensitive information

Add –sanitize or –mask to any command
• –sanitize replaces a sensitive value with random characters
• myhost123 >>>> JnsF3km9
• –mask replaces a sensitive value with a series of ‘X’
• myhost123 >>>> XXXXXXXX
Sanitize or mask sensitive information

Sanitized hostname
Sanitized hostname
tfactl orachk –preupgrade -sanitize

tfactl orachk -rmap qzh024703246tsa1
TFA using ORAchk : /opt/oracle.ahf/orachk/orachk
___________________________________________________________________________
| Entity Type | Substituted Entity Name | Original Entity Name |
___________________________________________________________________________
| hostname | qzh024703246tsa1 | myserver1 |
___________________________________________________________________________
Reverse map the sanitization

Sanitized hostname
Repair command
Check ID

Repair command
Check IDCheck ID
Repair command

Understand what the repair command does
Understand what the repair command will do with:
tfactl orachk -showrepair 8300E0A2FFE48253E053D298EB0A76CC
TFA using ORAchk : /opt/oracle.ahf/orachk/orachk
Repair Command:
currentUserName=$(whoami)
if [ "$currentUserName" = "root" ]
then
repair_report=$(rpm -e stix-fonts 2>&1)
else
repair_report="$currentUserName does not have priviedges to run
$CRS_HOME/bin/crsctl set resource use 1"
fi
echo -e "$repair_report"

Run the repair command
Run the checks again and repair everything that fails
Run the checks again and repair only the specified checks
Run the checks again and repair all checks listed in the file
tfactl orachk -repaircheck all
tfactl orachk -repaircheck <check_id_1>,<check_id_2>
tfactl orachk -repaircheck <file>

tfactl changes
Output from host : myserver69
------------------------------
[Oct/01/2020 04:54:15.397]: Parameter: fs.aio-nr: Value: 95488 => 97024
[Oct/01/2020 04:54:15.397]: Parameter: fs.inode-nr: Value: 764974 131561 => 740744
131259
[Oct/01/2020 04:54:15.397]: Parameter: kernel.pty.nr: Value: 2 => 1
[Oct/01/2020 04:54:15.397]: Parameter: kernel.random.entropy_avail: Value: 189 =>
158
[Oct/01/2020 04:54:15.397]: Parameter: kernel.random.uuid: Value: 36269877-9bc9-
40a3-82e0-1619865096f2 => 7551c5e7-c59f-40fa-b55f-5bd170e8b1ab
[Oct/01/2020 05:46:15.397]: Parameter: fs.inode-nr: Value: 1580316 810036 =>
1562320 768555
[Oct/01/2020 05:46:15.397]: Parameter: kernel.random.uuid: Value: 37cc31aa-ee31-
459e-8f2a-0766b34b1b64 => f5176cdc-6390-415d-882e-02c4cff2ae4e
...
Has anything changed recently?

...
------------------------------
[Oct/01/2020 04:54:15.397]: Parameter: fs.inode-nr: Value: 764974 131561 => 740744
131259
[Oct/01/2020 04:54:15.397]: Parameter: kernel.random.entropy_avail: Value: 189 =>
158
[Oct/01/2020 04:54:15.397]: Parameter: kernel.random.uuid: Value: 36269877-9bc9-
40a3-82e0-1619865096f2 => 7551c5e7-c59f-40fa-b55f-5bd170e8b1ab
[Oct/01/2020 05:46:15.397]: Parameter: fs.inode-nr: Value: 1580316 810036 =>
1562320 768555
[Oct/01/2020 05:46:15.397]: Parameter: kernel.random.uuid: Value: 37cc31aa-ee31-
459e-8f2a-0766b34b1b64 => f5176cdc-6390-415d-882e-02c4cff2ae4e
Has anything changed recently?

Pre and post upgrade compliance checking

ORAchk/EXAchk provides a single source for all upgrade checks
ORAchk checks
EXAchk checks
Database
AutoUpgrade checks
Cluster Verification
Utility (CVU) checks
Compare
Contrast
Combine
Consolidate
Resulting ORAchk / EXAchk
checks

ORAchk/EXAchk provides a single source for all upgrade checks
To check an environment before upgrading run:
To check an environment after upgrade run:
tfactl <orachk|exachk> –preupgrade
tfactl <orachk|exachk> –postupgrade

Other Server Technology
Enterprise Manager
Data Guard
GoldenGate
Exalogic
Database areas
Errors / Corruption
Performance
Install / patching / upgrade
RAC / Grid Infrastructure
Import / Export
RMAN
Transparent Data Encryption
Storage / partitioning
Undo / auditing
Listener / naming services
Spatial / XDB
Some problem areas covered in SRDCs
Full list in documentation
Around 100 problem types covered
tfactl diagcollect –srdc <srdc_type>
[-sr <sr_number>]

TFA SRDCManual method
Manual collection vs TFA SRDC for database performance
1. Generate ADDM reviewing Document 1680075.1 (multiple steps)
2. Identify “good” and “problem” periods and gather AWR reviewing
Document 1903158.1 (multiple steps)
3. Generate AWR compare report (awrddrpt.sql) using “good” and
“problem” periods
4. Generate ASH report for “good” and “problem” periods reviewing
Document 1903145.1 (multiple steps)
5. Collect OSWatcher data reviewing Document 301137.1 (multiple
steps)
6. Collect Hang Analyze output at Level 4
7. Generate SQL Healthcheck for problem SQL id using Document
1366133.1 (multiple steps)
8. Run support provided sql scripts – Log File sync diagnostic output
using Document 1064487.1 (multiple steps)
9. Check alert.log if there are any errors during the “problem” period
10. Find any trace files generated during the “problem” period
11. Collate and upload all the above files/outputs to SR
1. Run
tfactl diagcollect –srdc dbperf
[-sr <sr_number>]

All required files are identified
• Trimmed where applicable
• Package in a zip ready to provide to support
One command SRDC
...
2020/10/01 06:14:24 EST : Getting List of Files to Collect
2020/10/01 06:14:27 EST : Trimming file :
myserver1/rdbms/orcl2/orcl2/trace/orcl2_lmhb_3542.trc with original file
size : 163MB
...
2020/10/01 06:14:58 EST : Total time taken : 39s
2020/10/01 06:14:58 EST : Completed collection of zip files.
...
/opt/oracle.ahf/data/repository/srdc_ora600_collection_Thu_Oct_1_06_14_17
_EST_2020_node_local/myserver1.tfa_srdc_ora600_Thu_Oct_1_06_14_17_EST_202
0.zip

TFA can automatically purge database logs
Purging automatically removes logs older than 30 days
• Configurable with
Purging runs every 60 minutes
• Configurable with:
Automatic Database Log Purge
tfactl set manageLogsAutoPurge=ON
tfactl set manageLogsAutoPurgePolicyAge=<n><d|h>
tfactl set manageLogsAutoPurgeInterval=<minutes>

TFA can manage ADR log and trace files
tfactl managelogs <options>
–show usage #Show disk space usage per diagnostic directory for both
GI and database logs
-show variation –older <n><m|h|d> #Show disk space growth for
specified period
-purge –older <n><m|h|d> #Remove ADR files older than the time
specified
–gi #Restrict command to only files under the GI_BASE
–database [all | dbname] #Restrict command to only files under the
database directory
-dryrun #Use with –purge to estimate how many files will be affected and
how much disk space will be freed by a potential purge command
Manual Database Log Purge

tfactl managelogs -show usage
...
.---------------------------------------------------------------------------------.
| Grid Infrastructure Usage |
+---------------------------------------------------------------------+-----------+
| Location | Size |
+---------------------------------------------------------------------+-----------+
| /u01/app/crsusr/diag/afdboot/user_root/host_309243680_94/alert | 28.00 KB |
| /u01/app/crsusr/diag/afdboot/user_root/host_309243680_94/incident | 4.00 KB |
| /u01/app/crsusr/diag/afdboot/user_root/host_309243680_94/trace | 8.00 KB |
...
+---------------------------------------------------------------------+-----------+
| Total | 739.06 MB |
'---------------------------------------------------------------------+-----------’
...
Understand Database log disk space usage
Use -gi to only show grid infrastructure

...
.---------------------------------------------------------------.
| Database Homes Usage |
+---------------------------------------------------+-----------+
| Location | Size |
+---------------------------------------------------+-----------+
| /u01/app/crsusr/diag/rdbms/cdb674/CDB674/alert | 1.06 MB |
| /u01/app/crsusr/diag/rdbms/cdb674/CDB674/incident | 4.00 KB |
| /u01/app/crsusr/diag/rdbms/cdb674/CDB674/trace | 146.19 MB |
| /u01/app/crsusr/diag/rdbms/cdb674/CDB674/cdump | 4.00 KB |
| /u01/app/crsusr/diag/rdbms/cdb674/CDB674/hm | 4.00 KB |
+---------------------------------------------------+-----------+
| Total | 147.26 MB |
'---------------------------------------------------+-----------'
Understand Database log disk space usage
Use -database to only show database

Understand Database log disk space usage variations
tfactl managelogs -show variation -older 30d
------------------------------
2020-10-01 12:30:42: INFO Checking space variation for 30 days
.---------------------------------------------------------------------------------------------.
| Grid Infrastructure Variation |
+---------------------------------------------------------------------+-----------+-----------+
| Directory | Old Size | New Size |
+---------------------------------------------------------------------+-----------+-----------+
| /u01/app/crsusr/diag/asm/user_root/host_309243680_96/alert | 22.00 KB | 28.00 KB |
+---------------------------------------------------------------------+-----------+-----------+
| /u01/app/crsusr/diag/clients/user_crsusr/host_309243680_96/cdump | 4.00 KB | 4.00 KB |
+---------------------------------------------------------------------+-----------+-----------+
| /u01/app/crsusr/diag/tnslsnr/myserver74/listener/alert | 15.06 MB | 244.10 MB |
+---------------------------------------------------------------------+-----------+-----------+
...

Understand Database log disk space usage variations
...
.---------------------------------------------------------------------------.
| Database Homes Variation |
+---------------------------------------------------+-----------+-----------+
| Directory | Old Size | New Size |
+---------------------------------------------------+-----------+-----------+
| /u01/app/crsusr/diag/rdbms/cdb674/CDB674/hm | 4.00 KB | 4.00 KB |
+---------------------------------------------------+-----------+-----------+
| /u01/app/crsusr/diag/rdbms/cdb674/CDB674/trace | 16.63 MB | 146.19 MB |
+---------------------------------------------------+-----------+-----------+
| /u01/app/crsusr/diag/rdbms/cdb674/CDB674/cdump | 4.00 KB | 4.00 KB |
+---------------------------------------------------+-----------+-----------+
| /u01/app/crsusr/diag/rdbms/cdb674/CDB674/incident | 4.00 KB | 4.00 KB |
+---------------------------------------------------+-----------+-----------+
| /u01/app/crsusr/diag/rdbms/cdb674/CDB674/alert | 1.06 MB | 1.06 MB |
'------------------------------------------------------------+-------------+-------------'

Run a database log purge dryrun
tfactl managelogs -purge -older 30d -dryrun
------------------------------
Estimating files older than 30 days
Estimating purge for diagnostic destination "diag/afdboot/user_root/host_309243680_94" for files ~ 2 files deleted , 22.58 KB freed ]
Estimating purge for diagnostic destination "diag/afdboot/user_crsusr/host_309243680_94" for files ~ 2 files deleted , 11.72 KB freed ]
Estimating purge for diagnostic destination "diag/asmtool/user_root/host_309243680_96" for files ~ 2 files deleted , 21.36 KB freed ]
Estimating purge for diagnostic destination "diag/asmtool/user_crsusr/host_309243680_96" for files ~ 3 files deleted , 23.22 KB freed ]
Estimating purge for diagnostic destination "diag/tnslsnr/myserver74/listener" for files ~ 23 files deleted , 225.33 MB freed ]
Estimating purge for diagnostic destination "diag/diagtool/user_root/adrci_309243680_96" for files ~ 73 files deleted , 517.69 KB freed ]
Estimating purge for diagnostic destination "diag/clients/user_crsusr/host_309243680_96" for files ~ 38 files deleted , 17.15 KB freed ]
Estimating purge for diagnostic destination "diag/asm/+asm/+ASM" for files ~ 0 files deleted , 0 bytes freed ]
Estimating purge for diagnostic destination "diag/asm/user_root/host_309243680_96" for files ~ 1 files deleted , 19.52 KB freed ]
Estimating purge for diagnostic destination "diag/asm/user_crsusr/host_309243680_96" for files ~ 1 files deleted , 20.25 KB freed ]
Estimating purge for diagnostic destination "diag/crs/myserver74/crs" for files ~ 40 files deleted , 219.39 MB freed ]
Estimation for Grid Infrastructure [ Files to delete : ~ 185 files | Space to be freed : ~ 445.36 MB ]
Estimating purge for diagnostic destination "diag/rdbms/cdb674/CDB674" for files ~ 27760 files deleted , 66.57 MB freed ]
Estimation for Database Home [ Files to delete : ~ 27760 files | Space to be freed : ~ 66.57 MB ]

Run a database log purge
tfactl managelogs -purge -older 30d
------------------------------
Purging files older than 30 days
Cleaning Grid Infrastructure destinations
Purging diagnostic destination "diag/afdboot/user_root/host_309243680_94" for files - 0 files deleted , 0 bytes freed
Purging diagnostic destination "diag/afdboot/user_crsusr/host_309243680_94" for files - 1 files deleted , 10.16 KB freed
Purging diagnostic destination "diag/asmtool/user_root/host_309243680_96" for files - 1 files deleted , 10.16 KB freed
Purging diagnostic destination "diag/asmtool/user_crsusr/host_309243680_96" for files - 2 files deleted , 29.18 KB freed
Purging diagnostic destination "diag/tnslsnr/myserver74/listener" for files - 2 files deleted , 29.18 KB freed
Purging diagnostic destination "diag/diagtool/user_root/adrci_309243680_96" for files - 2 files deleted , 29.18 KB freed
Purging diagnostic destination "diag/clients/user_crsusr/host_309243680_96" for files - 2 files deleted , 29.18 KB freed
Purging diagnostic destination "diag/asm/+asm/+ASM" for files - 2 files deleted , 29.18 KB freed
Purging diagnostic destination "diag/asm/user_root/host_309243680_96" for files - 2 files deleted , 29.18 KB freed
Purging diagnostic destination "diag/asm/user_crsusr/host_309243680_96" for files - 2 files deleted , 29.18 KB freed
Purging diagnostic destination "diag/crs/myserver74/crs" for files - 2 files deleted , 29.18 KB freed
...

Run a database log purge
...
Grid Infrastructure [ Files deleted : 18 files | Space Freed : 253.75 KB ]
.-----------------------------------------------------------------------------------------------.
| File System Variation : /u01/app/crsusr/12.2.0/grid2 |
+--------+-----------------------------------+----------+----------+---------+----------+-------+
| State | Name | Size | Used | Free | Capacity | Mount |
+--------+-----------------------------------+----------+----------+---------+----------+-------+
| Before | /dev/mapper/vg_rws1270665-lv_root | 51475068 | 46597152 | 2256476 | 96% | / |
| After | /dev/mapper/vg_rws1270665-lv_root | 51475068 | 46597152 | 2256476 | 96% | / |
'--------+-----------------------------------+----------+----------+---------+----------+-------'

tail files
tfactl tail alert
------------------------------
/scratch/app/11.2.0.4/grid/log/myserver69/alertmyserver69.log
2020-10-01 23:28:22.532:
[ctssd(5630)]CRS-2409:The clock on host myserver69 is not synchronous with
the mean cluster time. No action has been taken as the Cluster Time
Synchronization Service is running in observer mode.
2020-10-01 23:58:22.964:
[ctssd(5630)]CRS-2409:The clock on host myserver69 is not synchronous with
the mean cluster time. No action has been taken as the Cluster Time
Synchronization Service is running in observer mode.
...

tail files
...
/scratch/app/oradb/diag/rdbms/apxcmupg/apxcmupg_2/trace/alert_apxcmupg_2.log
Thu Oct 01 06:00:00 2020 VKRM started with pid=82, OS id=4903
Thu Oct 01 06:00:02 2020 Begin automatic SQL Tuning Advisor run for special
tuning task "SYS_AUTO_SQL_TUNING_TASK"
Thu Oct 01 06:00:37 2020 End automatic SQL Tuning Advisor run for special
tuning task "SYS_AUTO_SQL_TUNING_TASK"
Thu Oct 01 23:00:28 2020 Thread 2 advanced to log sequence 759 (LGWR switch)
Current log# 3 seq# 759 mem# 0:
+DATA/apxcmupg/onlinelog/group_3.289.917164707
Current log# 3 seq# 759 mem# 1:
+FRA/apxcmupg/onlinelog/group_3.289.917164707
...

tail files
...
/scratch/app/oradb/diag/rdbms/ogg11204/ogg112041/trace/alert_ogg112041.log
Clearing Resource Manager plan via parameter
Thu Oct 01 05:59:59 2020
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
Thu Oct 01 05:59:59 2020
Starting background process VKRM
Thu Oct 01 05:59:59 2020
VKRM started with pid=36, OS id=4901
Thu Oct 01 22:00:31 2020
Thread 1 advanced to log sequence 305 (LGWR switch)
Current log# 1 seq# 305 mem# 0: +DATA/ogg11204/redo01.log
...

tail files
...
/scratch/app/oragrid/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log <==
Thu Oct 01 04:42:22 2020
NOTE: [ocrcheck.bin@myserver69 (TNS V1-V3) 2323] opening OCR file
Thu Oct 01 01:05:39 2020
Thu Oct 01 01:05:41 2020
Thu Oct 01 01:21:12 2020
Thu Oct 01 01:21:12 2020
...

Near real-time Database monitoring
• Single instance & RAC
• Monitoring current database activities
• Database performance
• Identifying contentions and bottleneck
• Process & SQL Monitoring
• Real time wait events
• Active Data Guard support
• Multitenant Database (CDB) support
oratop (Support Tools Bundle)

Monitor Database performance
tfactl run oratop -database ogg19c

Section 1 DATABASE:
Global database
information
Section 2 INSTANCE:
Database instance
Activity
Section 3 EVENT: AWR
like “Top 5 Timed
Events“
Section 4 PROCESS |
SQL: Processes or SQL
mode information
Monitor Database performance
more info 1500864.1

Collect & Archive OS Metrics
Executes standard UNIX utilities (e.g. vmstat, iostat, ps,
etc) on regular intervals
Built in Analyzer functionality to summarize, graph and
report upon collected metrics
Output is Required for node reboot and performance
issues
Simple to install, extremely lightweight
Runs on ALL platforms (Except Windows)
OS Watcher (Support Tools Bundle)

Analyse OS Metrics
tfactl run oswbb
Starting OSW Analyzer V8.4.0
OSWatcher Analyzer Written by Oracle Center of Expertise
Copyright (c) 2020 by Oracle Corporation
Parsing Data. Please Wait...
Scanning file headers for version and platform info...
Parsing file rws1270069_iostat_18.11.24.0900.dat ...
Parsing file rws1270069_iostat_18.11.24.1000.dat ...
...

Analyse OS Metrics
...
Enter 1 to Display CPU Process Queue Graphs
Enter 2 to Display CPU Utilization Graphs
Enter 3 to Display CPU Other Graphs
Enter 4 to Display Memory Graphs
Enter 5 to Display Disk IO Graphs
Enter GC to Generate All CPU Gif Files
Enter GM to Generate All Memory Gif Files
Enter GD to Generate All Disk Gif Files
Enter GN to Generate All Network Gif Files
Enter L to Specify Alternate Location of Gif Directory
Enter Z to Zoom Graph Time Scale (Does not change analysis dataset)
...

Analyse OS Metrics
...
Enter B to Returns to Baseline Graph Time Scale (Does not change
analysis dataset)
Enter R to Remove Currently Displayed Graphs
Enter X to Export Parsed Data to Flat File
Enter S to Analyze Subset of Data(Changes analysis dataset including
graph time scale)
Enter A to Analyze Data
Enter D to Generate DashBoard
Enter Q to Quit Program
Please Select an Option:1

Analyse OS Metrics
myserver69
more info 301137.1

Generates view of Cluster and Database diagnostic
metrics
• Always on - Enabled by default
• Provides Detailed OS Resource Metrics
• Assists Node eviction analysis
• Locally logs all process data
• User can define pinned processes
• Listens to CSS and GIPC events
• Categorizes processes by type
• Supports plug-in collectors (ex. traceroute, netstat,
ping, etc.)
• New CSV output for ease of analysis
Cluster Health Monitor (CHM)
GIMR
ologgerd
(master)
osysmon
d
osysmon
d
osysmon
d
osysmon
d
12c Grid Infrastructure
Management Repository
OS Data OS Data
OS Data
OS Data

Cluster Health Monitor (CHM)
Confidential – Oracle Internal/Restricted/Highly
Restricted
Oclumon CLI or full integration
with EM Cloud Control

Always on - Enabled by default
Detects node and database performance problems
Provides early-warning alerts and corrective action
Supports on-site calibration to improve sensitivity
Integrated into EMCC Incident Manager and
notifications
Standalone Interactive GUI Tool
Cluster Health Advisor (CHA)*
OS Data
GIMR
ochad
DB Data
CHM
Node
Health
Prognostic
s
Engine
Database
Health
Prognostic
s
Engine
* Requires and Included with RAC or R1N License

Choosing a Data Set for Calibration – Defining “normal”
Calibrating CHA to your RAC deployment
chactl query calibration –cluster –timeranges ‘start=2020-10-01
07:00:00,end=2020-10-01 13:00:00’
Cluster name : mycluster
Start time : 2020-10-01 07:00:00
End time : 2020-10-01 13:00:00
Total Samples : 11524
Percentage of filtered data : 100%
1) Disk read (ASM) (Mbyte/sec)
MEAN MEDIAN STDDEV MIN MAX
0.11 0.00 2.62 0.00 114.66
<25 <50 <75 <100 >=100
99.87% 0.08% 0.00% 0.02% 0.03%
...

...
2) Disk write (ASM) (Mbyte/sec)
0.01 0.00 0.15 0.00 6.77
<50 <100 <150 <200 >=200
100.00% 0.00% 0.00% 0.00% 0.00%
...

...
3) Disk throughput (ASM) (IO/sec)
2.20 0.00 31.17 0.00 1100.00
<5000 <10000 <15000 <20000 >=20000
100.00% 0.00% 0.00% 0.00% 0.00%
4) CPU utilization (total) (%)
9.62 9.30 7.95 1.80 77.90
<20 <40 <60 <80 >=80
92.67% 6.17% 1.11% 0.05% 0.00%
...

Create and store a new model
Begin using the new model
Confirm the new model is working
chactl query calibrate cluster –model daytime –timeranges
‘start=2020-10-01 07:00:00, end= 2020-10-01 13:00:00’
chactl monitor cluster –model daytime
chactl status –verbose
monitoring nodes svr01, svr02 using model daytime
monitoring database qoltpacdb, instances oltpacdb_1, oltpacdb_2 using
model DEFAULT_DB

Enable CHA monitoring on RAC database with optional model
Enable CHA monitoring on RAC database with optional verbose
Command line operations
chactl monitor database –db oltpacdb [-model model_name]
chactl status –verbose
monitoring nodes svr01, svr02 using model DEFAULT_CLUSTER
monitoring database oltpacdb, instances oltpacdb_1, oltpacdb_2 using
model DEFAULT_DB

Check for Health Issues and Corrective Actions with CHACTL QUERY DIAGNOSIS
chactl query diagnosis -db oltpacdb -start "2020-10-01 01:42:50" -end "2020-10-01 03:19:15"
2020-10-01 01:47:10.0 Database oltpacdb DB Control File IO Performance (oltpacdb_1) [detected]
2020-10-01 02:59:35.0 Database oltpacdb DB Log File Switch (oltpacdb_1) [detected]
Problem: DB Control File IO Performance
Description: CHA has detected that reads or writes to the control files are slower than expected.
Cause: The Cluster Health Advisor (CHA) detected that reads or writes to the control files were
slow because of an increase in disk IO.
The slow control file reads and writes may have an impact on checkpoint and Log Writer (LGWR)
performance.
Action: Separate the control files from other database files and move them to faster disks or Solid
State Devices.
Problem: DB Log File Switch
Description: CHA detected that database sessions are waiting longer than expected
for log switch completions.
Cause: The Cluster Health Advisor (CHA) detected high contention during log switches
because the redo log files were small and the redo logs switched frequently.
Action: Increase the size of the redo logs.

HTML diagnostic health output available (-html <file_name>)

Diagnose cluster health
chactl query diagnosis -db oltpacdb -start ”2020-10-01 01:42:50.0" -end " 2020-10-01 03:19:15.0"
2020-10-01 02:52:15.0 Database oltpacdb DB CPU Utilization (oltpacdb_2) [detected]
2020-10-01 02:52:50.0 Database oltpacdb DB CPU Utilization (oltpacdb_1) [detected]

Troubleshooting tips and tricks for Oracle Database Oct 2020

Troubleshooting tips and tricks for Oracle Database Oct 2020

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Troubleshooting tips and tricks for Oracle Database Oct 2020

Similar a Troubleshooting tips and tricks for Oracle Database Oct 2020 (20)

Más de Sandesh Rao

Más de Sandesh Rao (20)

Último

Último (20)

Troubleshooting tips and tricks for Oracle Database Oct 2020