I faced Oracle Database block corruption yesterday and posting the recovery steps taken to survive the crash.
The following filesystems need to be mounted with options (rw,_netdev,noatime,datavolume,nointr,heartbeat=local)
/dev/mapper/vg_RP1_dataHDS-oracle_RP1_sapdata39 on /oracle/RP1/sapdata39 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/mapper/vg_RP1_dataHDS-oracle_RP1_sapdata40 on /oracle/RP1/sapdata40 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/mapper/vg_RP1_dataHDS-oracle_RP1_sapdata41 on /oracle/RP1/sapdata41 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/mapper/vg_RP1_dataHDS-oracle_RP1_sapdata42 on /oracle/RP1/sapdata42 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/mapper/vg_RP1_dataHDS-oracle_RP1_sapdata43 on /oracle/RP1/sapdata43 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/mapper/vg_RP1_dataHDS-oracle_RP1_sapdata44 on /oracle/RP1/sapdata44 type ocfs2 (rw,_netdev,heartbeat=local)
Because of which high I/O and contention caused all five of the Oracle RAC nodes to crash. Also resulting in datafile recovery.
But, recover datafile was failing because of inconsistent redolog with datablock.
So, the following steps are taking in order to get the Production system out of danger.
General recommendation for mounting the ocfs2 filesystems in Oracle RAC environment is with the following options.
rw,_netdev,noatime,datavolume,nointr,heartbeat=local.
Remounted the problematic filesystems with the above options. And now, going ahead with the analysis and recovery steps.
SQL> startup pfile=initRP11.ora
ORA-00445: background process "PSP0" did not start after 120 seconds
Probably due to high number of db writer processes and large sga.
SQL> select distinct status from v$backup;
STATUS
------------------
NOT ACTIVE
SQL> set NUMWIDTH 13
SQL> select distinct checkpoint_change# from v$datafile;
CHECKPOINT_CHANGE#
------------------
100242078361
100243048128
SQL> select checkpoint_change#,count(*) from v$datafile group by checkpoint_change#;
CHECKPOINT_CHANGE# COUNT(*)
------------------ -------------
100242078361 1
100243048128 629
SQL> select tablespace_name , status from dba_tablespaces ;
TABLESPACE_NAME STATUS
------------------------------ ---------
SYSTEM ONLINE
PSAPUNDO_5 ONLINE
SYSAUX ONLINE
PSAPTEMP ONLINE
PERFMAN ONLINE
PSAPR3E620 ONLINE
PSAPR3EUSR ONLINE
PSAPR3E ONLINE
PSAPUNDO_1 ONLINE
PSAPUNDO_2 ONLINE
PSAPUNDO_3 ONLINE
PSAPUNDO_4 ONLINE
SQL> select status,count(*) from dba_data_files group by status;
STATUS COUNT(*)
--------- -------------
AVAILABLE 630
SQL> select file# from v$datafile where checkpoint_change#=100242078361;
FILE#
-------------
628
SQL> select * from dba_data_files where file_id=628;
FILE_NAME
--------------------------------------------------------------------------------
FILE_ID TABLESPACE_NAME BYTES BLOCKS
------------- ------------------------------ ------------- -------------
STATUS RELATIVE_FNO AUT MAXBYTES MAXBLOCKS INCREMENT_BY
--------- ------------- --- ------------- ------------- -------------
USER_BYTES USER_BLOCKS ONLINE_
------------- ------------- -------
/oracle/RP1/sapdata40/r3e_554/r3e.data554
628 PSAPR3E
AVAILABLE 628
RECOVER
SQL> select count(*) from dba_extents where file_id=628;
COUNT(*)
----------
16
SQL> select name from v$datafile where status ='RECOVER';
NAME
--------------------------------------------------------------------------------
/oracle/RP1/sapdata40/r3e_554/r3e.data554
SQL> recover datafile '/oracle/RP1/sapdata40/r3e_554/r3e.data554' ;
ORA-00283: recovery session canceled due to errors
ORA-12801: error signaled in parallel query server P003, instance
DBHOST:RP11 (1)
ORA-00600: internal error code, arguments: [3020], [628], [128460],
[2634151372], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 628, block# 128460)
ORA-10564: tablespace PSAPR3E
ORA-01110: data file 628: '/oracle/RP1/sapdata40/r3e_554/r3e.data554'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 99304
Alert_RP11.log
Fri Jan 7 17:15:19 2011
Recovery of Online Redo Log: Thread 2 Group 23 Seq 6211 Reading mem 0
Mem# 0: /oracle/RP1/origlogA/LOG_G23M1.DBF
Mem# 1: /oracle/RP1/mirrlogA/LOG_G23M2.DBF
Fri Jan 7 17:16:34 2011
Errors in file /oracle/RP1/saptrace/background/rp11_p003_6215.trc:
ORA-00600: internal error code, arguments: [3020], [628], [128460], [2634151372] , [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 628, block# 128460)
ORA-10564: tablespace PSAPR3E
ORA-01110: data file 628: '/oracle/RP1/sapdata40/r3e_554/r3e.data554'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 99304
Fri Jan 7 17:16:34 2011
Errors in file /oracle/RP1/saptrace/background/rp11_p007_6229.trc:
ORA-00600: internal error code, arguments: [3020], [628], [126171], [2634149083] , [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 628, block# 126171)
ORA-10564: tablespace PSAPR3E
ORA-01110: data file 628: '/oracle/RP1/sapdata40/r3e_554/r3e.data554'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 99304
Fri Jan 7 17:16:34 2011
Errors in file /oracle/RP1/saptrace/background/rp11_p007_6229.trc:
ORA-00600: internal error code, arguments: [3020], [628], [126171], [2634149083] , [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 628, block# 126171)
ORA-10564: tablespace PSAPR3E
ORA-01110: data file 628: '/oracle/RP1/sapdata40/r3e_554/r3e.data554'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 99304
Fri Jan 7 17:16:34 2011
Errors in file /oracle/RP1/saptrace/background/rp11_p003_6215.trc:
ORA-00600: internal error code, arguments: [3020], [628], [128460], [2634151372] , [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 628, block# 128460)
ORA-10564: tablespace PSAPR3E
ORA-01110: data file 628: '/oracle/RP1/sapdata40/r3e_554/r3e.data554'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 99304
Fri Jan 7 17:16:35 2011
Trace dumping is performing id=[cdmp_20110107171635]
Fri Jan 7 17:16:42 2011
Media Recovery failed with error 12801
ORA-283 signalled during: ALTER DATABASE RECOVER datafile '/oracle/RP1/sapdata40/r3e_554/r3e.data554' ...
SQL> SELECT tablespace_name, segment_type, owner, segment_name FROM dba_extents WHERE file_id = 628 and 128460 between block_id AND block_id + blocks - 1;
Run DB Verify dbv file=r3e.data554 FEEDBACK=100 but no guarantee that DB Verify will be 100% dependable.
DBVERIFY - Verification complete
Total Pages Examined : 3840000
Total Pages Processed (Data) : 48992
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 78245
Total Pages Failing (Index): 0
Total Pages Processed (Other): 144
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 3712619
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Highest block SCN : 1458725156 (23.1458725156)
Find the object in the corrupt block
SQL> desc obj$;
Name Null? Type
----------------------------------------- -------- ----------------------------
OBJ# NOT NULL NUMBER
DATAOBJ# NUMBER
OWNER# NOT NULL NUMBER
NAME NOT NULL VARCHAR2(30)
NAMESPACE NOT NULL NUMBER
SUBNAME VARCHAR2(30)
TYPE# NOT NULL NUMBER
CTIME NOT NULL DATE
MTIME NOT NULL DATE
STIME NOT NULL DATE
STATUS NOT NULL NUMBER
REMOTEOWNER VARCHAR2(30)
LINKNAME VARCHAR2(128)
FLAGS NUMBER
OID$ RAW(16)
SPARE1 NUMBER
SPARE2 NUMBER
SPARE3 NUMBER
SPARE4 VARCHAR2(1000)
SPARE5 VARCHAR2(1000)
SPARE6 DATE
SQL> elect name from obj$ where obj#=99304 ;
NAME
------------------------------
GLPCA~0
SQL> SELECT tablespace_name, segment_type, owner, segment_name FROM dba_extents WHERE file_id = 628 and 128460 between block_id AND block_id + blocks - 1;
TABLESPACE_NAME SEGMENT_TYPE OWNER
------------------------------ ------------------ ------------------------------
SEGMENT_NAME
--------------------------------------------------------------------------------
PSAPR3E INDEX SAPR3E
GLPCA~0
SELECT tablespace_name, segment_type, owner, segment_name FROM dba_extents WHERE file_id = 628 and 126171 between block_id AND block_id + blocks - 1;
So it's an index block. I think what needs to be done is to run the recover clause with option "test" and then check in the syslog the number of corruptions
Run a test recovery
SQL> recover datafile '/oracle/RP1/sapdata40/r3e_554/r3e.data554' test;
ORA-10589: Test recovery had to corrupt 2 data blocks in order to proceed
ORA-10573: Test recovery tested redo from change 100242078361 to 100242999965
ORA-10572: Test recovery canceled due to errors
ORA-00600: internal error code, arguments: [3019], [2], [6211], [1232097],
[23], [1458752157], [23], [1458752153]
Alert_RP11.log
Fri Jan 7 17:57:47 2011
CORRUPTING BLOCK 126171 OF FILE 628 AND CONTINUING RECOVERY
Fri Jan 7 17:57:47 2011
Errors in file /oracle/RP1/saptrace/usertrace/rp11_ora_31850.trc:
ORA-10567: Redo is inconsistent with data block (file# 628, block# 126171)
ORA-10564: tablespace PSAPR3E
ORA-01110: data file 628: '/oracle/RP1/sapdata40/r3e_554/r3e.data554'
RA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 99304
Fri Jan 7 17:57:47 2011
Errors in file /oracle/RP1/saptrace/usertrace/rp11_ora_31850.trc:
ORA-00600: internal error code, arguments: [3019], [2], [6211], [1232097], [23], [1458752157], [23], [1458752153]
Fri Jan 7 17:57:49 2011
Test recovery canceled due to errors
Test recovery tested redo from change 100242078361 to 100242999965
Test recovery found 2 problems
ORA-10589 signalled during: ALTER DATABASE RECOVER datafile '/oracle/RP1/sapdata40/r3e_554/r3e.data554' test ..
Fri Jan 7 17:57:49 2011
Trace dumping is performing id=[cdmp_20110107175749]
SQL> select segment_name, segment_type from dba_extents where file_id=628 order by segment_name;
SEGMENT_NAME
--------------------------------------------------------------------------------
SEGMENT_TYPE
------------------
BSIS~0
INDEX
CE11000
TABLE
GLPCA
TABLE
SEGMENT_NAME
--------------------------------------------------------------------------------
SEGMENT_TYPE
------------------
GLPCA~0
INDEX
GLPCA~1
INDEX
GLPCA~2
INDEX
SEGMENT_NAME
--------------------------------------------------------------------------------
SEGMENT_TYPE
------------------
IBINVALUES~SEL
INDEX
IBINVALUES~SWO
INDEX
SOFFCONT1
TABLE
SEGMENT_NAME
--------------------------------------------------------------------------------
SEGMENT_TYPE
------------------
SWWLOGHIST
TABLE
SWWLOGHIST
TABLE
SWWLOGHIST
TABLE
SEGMENT_NAME
--------------------------------------------------------------------------------
SEGMENT_TYPE
------------------
SWWLOGHIST~0
INDEX
SWWLOGHIST~0
INDEX
SWWLOGHIST~1
INDEX
SEGMENT_NAME
--------------------------------------------------------------------------------
SEGMENT_TYPE
------------------
SWWLOGHIST~Z01
INDEX
16 rows selected.
The datafile 628 contains only few extents. If we are lucky maybe these are only indexes?
We could ignore this if only indexes were affected.
SQL> SELECT tablespace_name, segment_type, owner, segment_name FROM dba_extents WHERE file_id = 628 and 126171 between block_id AND block_id + blocks - 1;
TABLESPACE_NAME SEGMENT_TYPE OWNER
------------------------------ ------------------ ------------------------------
SEGMENT_NAME
--------------------------------------------------------------------------------
PSAPR3E INDEX SAPR3E
GLPCA~0
SQL> recover datafile '/oracle/RP1/sapdata40/r3e_554/r3e.data554' allow 1 corruption;
SQL> recover datafile '/oracle/RP1/sapdata40/r3e_554/r3e.data554' allow 1 corruption;
Rebuild the corrupted index.
SQL> alter index sapr3e."GLPCA~0" rebuild online parallel 8;
After the rebuild the datafile will continue to contain the two corrupt blocks. But they will not be addressed.
They will sooner or later be gone, when they are overwritten. After the rebuild is finished there is no further impact.
Except of the fact that a db verify will show these two blocks as corrupt. It could be that you see it *now*, i.e. after the recovery.
No comments:
Post a Comment