I faced Oracle Database block corruption yesterday and posting the recovery steps taken to survive the crash.
The following filesystems need to be mounted with options (rw,_netdev,noatime,datavolume,nointr,heartbeat=local)
/dev/mapper/vg_RP1_dataHDS-oracle_RP1_sapdata39 on /oracle/RP1/sapdata39 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/mapper/vg_RP1_dataHDS-oracle_RP1_sapdata40 on /oracle/RP1/sapdata40 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/mapper/vg_RP1_dataHDS-oracle_RP1_sapdata41 on /oracle/RP1/sapdata41 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/mapper/vg_RP1_dataHDS-oracle_RP1_sapdata42 on /oracle/RP1/sapdata42 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/mapper/vg_RP1_dataHDS-oracle_RP1_sapdata43 on /oracle/RP1/sapdata43 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/mapper/vg_RP1_dataHDS-oracle_RP1_sapdata44 on /oracle/RP1/sapdata44 type ocfs2 (rw,_netdev,heartbeat=local)
Because of which high I/O and contention caused all five of the Oracle RAC nodes to crash. Also resulting in datafile recovery.
But, recover datafile was failing because of inconsistent redolog with datablock.
So, the following steps are taking in order to get the Production system out of danger.
General recommendation for mounting the ocfs2 filesystems in Oracle RAC environment is with the following options.
rw,_netdev,noatime,datavolume,nointr,heartbeat=local.
Remounted the problematic filesystems with the above options. And now, going ahead with the analysis and recovery steps.
SQL> startup pfile=initRP11.ora
ORA-00445: background process "PSP0" did not start after 120 seconds
Probably due to high number of db writer processes and large sga.
SQL> select distinct status from v$backup;
STATUS
------------------
NOT ACTIVE
SQL> set NUMWIDTH 13
SQL> select distinct checkpoint_change# from v$datafile;
CHECKPOINT_CHANGE#
------------------
100242078361
100243048128
SQL> select checkpoint_change#,count(*) from v$datafile group by checkpoint_change#;
CHECKPOINT_CHANGE# COUNT(*)
------------------ -------------
100242078361 1
100243048128 629
SQL> select tablespace_name , status from dba_tablespaces ;
TABLESPACE_NAME STATUS
------------------------------ ---------
SYSTEM ONLINE
PSAPUNDO_5 ONLINE
SYSAUX ONLINE
PSAPTEMP ONLINE
PERFMAN ONLINE
PSAPR3E620 ONLINE
PSAPR3EUSR ONLINE
PSAPR3E ONLINE
PSAPUNDO_1 ONLINE
PSAPUNDO_2 ONLINE
PSAPUNDO_3 ONLINE
PSAPUNDO_4 ONLINE
SQL> select status,count(*) from dba_data_files group by status;
STATUS COUNT(*)
--------- -------------
AVAILABLE 630
SQL> select file# from v$datafile where checkpoint_change#=100242078361;
FILE#
-------------
628
SQL> select * from dba_data_files where file_id=628;
FILE_NAME
--------------------------------------------------------------------------------
FILE_ID TABLESPACE_NAME BYTES BLOCKS
------------- ------------------------------ ------------- -------------
STATUS RELATIVE_FNO AUT MAXBYTES MAXBLOCKS INCREMENT_BY
--------- ------------- --- ------------- ------------- -------------
USER_BYTES USER_BLOCKS ONLINE_
------------- ------------- -------
/oracle/RP1/sapdata40/r3e_554/r3e.data554
628 PSAPR3E
AVAILABLE 628
RECOVER
SQL> select count(*) from dba_extents where file_id=628;
COUNT(*)
----------
16
SQL> select name from v$datafile where status ='RECOVER';
NAME
--------------------------------------------------------------------------------
/oracle/RP1/sapdata40/r3e_554/r3e.data554
SQL> recover datafile '/oracle/RP1/sapdata40/r3e_554/r3e.data554' ;
ORA-00283: recovery session canceled due to errors
ORA-12801: error signaled in parallel query server P003, instance
DBHOST:RP11 (1)
ORA-00600: internal error code, arguments: [3020], [628], [128460],
[2634151372], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 628, block# 128460)
ORA-10564: tablespace PSAPR3E
ORA-01110: data file 628: '/oracle/RP1/sapdata40/r3e_554/r3e.data554'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 99304
Alert_RP11.log
Fri Jan 7 17:15:19 2011
Recovery of Online Redo Log: Thread 2 Group 23 Seq 6211 Reading mem 0
Mem# 0: /oracle/RP1/origlogA/LOG_G23M1.DBF
Mem# 1: /oracle/RP1/mirrlogA/LOG_G23M2.DBF
Fri Jan 7 17:16:34 2011
Errors in file /oracle/RP1/saptrace/background/rp11_p003_6215.trc:
ORA-00600: internal error code, arguments: [3020], [628], [128460], [2634151372] , [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 628, block# 128460)
ORA-10564: tablespace PSAPR3E
ORA-01110: data file 628: '/oracle/RP1/sapdata40/r3e_554/r3e.data554'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 99304
Fri Jan 7 17:16:34 2011
Errors in file /oracle/RP1/saptrace/background/rp11_p007_6229.trc:
ORA-00600: internal error code, arguments: [3020], [628], [126171], [2634149083] , [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 628, block# 126171)
ORA-10564: tablespace PSAPR3E
ORA-01110: data file 628: '/oracle/RP1/sapdata40/r3e_554/r3e.data554'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 99304
Fri Jan 7 17:16:34 2011
Errors in file /oracle/RP1/saptrace/background/rp11_p007_6229.trc:
ORA-00600: internal error code, arguments: [3020], [628], [126171], [2634149083] , [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 628, block# 126171)
ORA-10564: tablespace PSAPR3E
ORA-01110: data file 628: '/oracle/RP1/sapdata40/r3e_554/r3e.data554'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 99304
Fri Jan 7 17:16:34 2011
Errors in file /oracle/RP1/saptrace/background/rp11_p003_6215.trc:
ORA-00600: internal error code, arguments: [3020], [628], [128460], [2634151372] , [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 628, block# 128460)
ORA-10564: tablespace PSAPR3E
ORA-01110: data file 628: '/oracle/RP1/sapdata40/r3e_554/r3e.data554'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 99304
Fri Jan 7 17:16:35 2011
Trace dumping is performing id=[cdmp_20110107171635]
Fri Jan 7 17:16:42 2011
Media Recovery failed with error 12801
ORA-283 signalled during: ALTER DATABASE RECOVER datafile '/oracle/RP1/sapdata40/r3e_554/r3e.data554' ...
SQL> SELECT tablespace_name, segment_type, owner, segment_name FROM dba_extents WHERE file_id = 628 and 128460 between block_id AND block_id + blocks - 1;
Run DB Verify dbv file=r3e.data554 FEEDBACK=100 but no guarantee that DB Verify will be 100% dependable.
DBVERIFY - Verification complete
Total Pages Examined : 3840000
Total Pages Processed (Data) : 48992
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 78245
Total Pages Failing (Index): 0
Total Pages Processed (Other): 144
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 3712619
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Highest block SCN : 1458725156 (23.1458725156)
Find the object in the corrupt block
SQL> desc obj$;
Name Null? Type
----------------------------------------- -------- ----------------------------
OBJ# NOT NULL NUMBER
DATAOBJ# NUMBER
OWNER# NOT NULL NUMBER
NAME NOT NULL VARCHAR2(30)
NAMESPACE NOT NULL NUMBER
SUBNAME VARCHAR2(30)
TYPE# NOT NULL NUMBER
CTIME NOT NULL DATE
MTIME NOT NULL DATE
STIME NOT NULL DATE
STATUS NOT NULL NUMBER
REMOTEOWNER VARCHAR2(30)
LINKNAME VARCHAR2(128)
FLAGS NUMBER
OID$ RAW(16)
SPARE1 NUMBER
SPARE2 NUMBER
SPARE3 NUMBER
SPARE4 VARCHAR2(1000)
SPARE5 VARCHAR2(1000)
SPARE6 DATE
SQL> elect name from obj$ where obj#=99304 ;
NAME
------------------------------
GLPCA~0
SQL> SELECT tablespace_name, segment_type, owner, segment_name FROM dba_extents WHERE file_id = 628 and 128460 between block_id AND block_id + blocks - 1;
TABLESPACE_NAME SEGMENT_TYPE OWNER
------------------------------ ------------------ ------------------------------
SEGMENT_NAME
--------------------------------------------------------------------------------
PSAPR3E INDEX SAPR3E
GLPCA~0
SELECT tablespace_name, segment_type, owner, segment_name FROM dba_extents WHERE file_id = 628 and 126171 between block_id AND block_id + blocks - 1;
So it's an index block. I think what needs to be done is to run the recover clause with option "test" and then check in the syslog the number of corruptions
Run a test recovery
SQL> recover datafile '/oracle/RP1/sapdata40/r3e_554/r3e.data554' test;
ORA-10589: Test recovery had to corrupt 2 data blocks in order to proceed
ORA-10573: Test recovery tested redo from change 100242078361 to 100242999965
ORA-10572: Test recovery canceled due to errors
ORA-00600: internal error code, arguments: [3019], [2], [6211], [1232097],
[23], [1458752157], [23], [1458752153]
Alert_RP11.log
Fri Jan 7 17:57:47 2011
CORRUPTING BLOCK 126171 OF FILE 628 AND CONTINUING RECOVERY
Fri Jan 7 17:57:47 2011
Errors in file /oracle/RP1/saptrace/usertrace/rp11_ora_31850.trc:
ORA-10567: Redo is inconsistent with data block (file# 628, block# 126171)
ORA-10564: tablespace PSAPR3E
ORA-01110: data file 628: '/oracle/RP1/sapdata40/r3e_554/r3e.data554'
RA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 99304
Fri Jan 7 17:57:47 2011
Errors in file /oracle/RP1/saptrace/usertrace/rp11_ora_31850.trc:
ORA-00600: internal error code, arguments: [3019], [2], [6211], [1232097], [23], [1458752157], [23], [1458752153]
Fri Jan 7 17:57:49 2011
Test recovery canceled due to errors
Test recovery tested redo from change 100242078361 to 100242999965
Test recovery found 2 problems
ORA-10589 signalled during: ALTER DATABASE RECOVER datafile '/oracle/RP1/sapdata40/r3e_554/r3e.data554' test ..
Fri Jan 7 17:57:49 2011
Trace dumping is performing id=[cdmp_20110107175749]
SQL> select segment_name, segment_type from dba_extents where file_id=628 order by segment_name;
SEGMENT_NAME
--------------------------------------------------------------------------------
SEGMENT_TYPE
------------------
BSIS~0
INDEX
CE11000
TABLE
GLPCA
TABLE
SEGMENT_NAME
--------------------------------------------------------------------------------
SEGMENT_TYPE
------------------
GLPCA~0
INDEX
GLPCA~1
INDEX
GLPCA~2
INDEX
SEGMENT_NAME
--------------------------------------------------------------------------------
SEGMENT_TYPE
------------------
IBINVALUES~SEL
INDEX
IBINVALUES~SWO
INDEX
SOFFCONT1
TABLE
SEGMENT_NAME
--------------------------------------------------------------------------------
SEGMENT_TYPE
------------------
SWWLOGHIST
TABLE
SWWLOGHIST
TABLE
SWWLOGHIST
TABLE
SEGMENT_NAME
--------------------------------------------------------------------------------
SEGMENT_TYPE
------------------
SWWLOGHIST~0
INDEX
SWWLOGHIST~0
INDEX
SWWLOGHIST~1
INDEX
SEGMENT_NAME
--------------------------------------------------------------------------------
SEGMENT_TYPE
------------------
SWWLOGHIST~Z01
INDEX
16 rows selected.
The datafile 628 contains only few extents. If we are lucky maybe these are only indexes?
We could ignore this if only indexes were affected.
SQL> SELECT tablespace_name, segment_type, owner, segment_name FROM dba_extents WHERE file_id = 628 and 126171 between block_id AND block_id + blocks - 1;
TABLESPACE_NAME SEGMENT_TYPE OWNER
------------------------------ ------------------ ------------------------------
SEGMENT_NAME
--------------------------------------------------------------------------------
PSAPR3E INDEX SAPR3E
GLPCA~0
SQL> recover datafile '/oracle/RP1/sapdata40/r3e_554/r3e.data554' allow 1 corruption;
SQL> recover datafile '/oracle/RP1/sapdata40/r3e_554/r3e.data554' allow 1 corruption;
Rebuild the corrupted index.
SQL> alter index sapr3e."GLPCA~0" rebuild online parallel 8;
After the rebuild the datafile will continue to contain the two corrupt blocks. But they will not be addressed.
They will sooner or later be gone, when they are overwritten. After the rebuild is finished there is no further impact.
Except of the fact that a db verify will show these two blocks as corrupt. It could be that you see it *now*, i.e. after the recovery.
Some interesting question and answers on SAP NetWeaver / Basis. Announcements. Handy Unix Tools. Needful Oracle commands.
Saturday, January 8, 2011
Sunday, January 2, 2011
Firefox Tip: Access SAP Notes Easily
Step 1:Setup SSO to service.sap.com
The first thing to do is to setup Single Sign On to service.sap.com. This can be done by going to http://service.sap.com and clicking the "Benefit from SSO" link at the bottom of the page. Here you can download a X.509 certificate which is the method of SSO.
Step 2:Custom keyword bookmark
The last step is to add a new bookmark in Firefox by right clicking in the Bookmarks menu and selecting New Bookmark...
For the URL use http://service.sap.com/~form/handler?_APP=01100107900000000342&_EVENT=REDIR&_NNUM=%s and in the keyword enter note (see image below).
Finished: Access the notes
Now, all you have to do in order to access a SAP note is to type note in the address bar of firefox as shown in the image below
The first thing to do is to setup Single Sign On to service.sap.com. This can be done by going to http://service.sap.com and clicking the "Benefit from SSO" link at the bottom of the page. Here you can download a X.509 certificate which is the method of SSO.
Step 2:Custom keyword bookmark
The last step is to add a new bookmark in Firefox by right clicking in the Bookmarks menu and selecting New Bookmark...
For the URL use http://service.sap.com/~form/handler?_APP=01100107900000000342&_EVENT=REDIR&_NNUM=%s and in the keyword enter note (see image below).
Finished: Access the notes
Now, all you have to do in order to access a SAP note is to type note
Wednesday, December 22, 2010
Stage Copy Solution for Faster Dev and Sandbox system copies
CAUTION: PLEASE DO NOT COPY AND PASTE ANY OF THE COMMANDS.
Environment: RHEL 2.6 on HP Proliant Blades, Oracle 10.2.0.4, SAP R3
The idea is to restore the online backup of DEV onto SND.
The systems in question are source DEV and target SND.
Because of time constraint we had to go for an efficient backup/restore method. And we have opted for stage_copy method.
The following procedure describes the stage copy method.
Online backup profile - initDEV_online.sap
Create a new profile based on the online backup profile for stage_copy and name it as initDEV_online_stage.sap
vi initDEV_online_stage.sap and set/modify the following parameters
backup_mode = all
restore_mode = all
backup_type = online
backup_dev_type = stage
backup_root_dir = $SAPDATA_HOME/sapbackup
stage_root_dir = /oracle/SND/sapbackup
compress = no
compress_dir = $SAPDATA_HOME/sapreorg
archive_function = save
archive_copy_dir = $SAPDATA_HOME/sapbackup
archive_stage_dir = /oracle/SND/sapbackup
new_db_home = /oracle/SND
stage_db_home = /oracle/SND
remote_host = fctfrmSND
remote_user = ‘oraSND test123’ (can be hashed in case of ssh)
stage_copy_cmd = scp
exec_parallel = 2
Preparation of filesystems in SND (target)
1. Validate the mount points and their permissions.
2. Clean up all sapdata filesystems and compare the sizes between DEV and SND
3. Clean up all origlog* and mirrlog* filesystems.
4. Clean up saparch, oraarch, sapreorg, sapbackup filesystems.
5. Disable the archive log and online backup scripts in the cron on both and DEV and SND.
6. Stop SAP and Oracle on SND (target)
7. cleanipc –XX remove and also remove any remaining shared memory segments using ipcrm
8. kill all the processes running with SNDadm and oraSND
Initiate online restore from DEV to SND using stage_copy profile
Ø Ensure ssh is setup on SND (target) for the user oraSND and oraDEV.
Ø Login to DEV as oraDEV and issue the following command.
Ø brbackup -p initDEV_online_stage.sap -t online -c -u / &
Upon successful completion of the brbackup, the online backup of DEV is restored into the directories of SND system. Now, we need to transfer the archive log files from DEV to SND depending upon PITR requested (Point-in-Time recovery). Use sftp from DEV to SND and copy the required archive logs to /oracle/SND/oraarch. And rename them from DEV* to SND*.
Restore archive log files from tape
Ø If the required archive log files are copied to tape and are not available in oraarch then issue the following command.
Ø brrestore -a 9921-9929 -p initDEV_arch.sap
Enable archive log and online backup scripts on DEV server.
Control file creation
Ø Login to SND and rename the existing control files.
o mv /oracle/SND/origlogA/cntrl/cntrlSND.dbf /oracle/SND/origlogA/cntrl/cntrlSND.dbf.old
o mv /oracle/SND/saparch/cntrl/cntrlSND.dbf /oracle/SND/saparch/cntrl/cntrlSND.dbf.old
o mv /oracle/SND/sapdata1/system_1/cntrl/cntrlSND.dbf /oracle/SND/sapdata1/system_1/cntrl/cntrlSND.dbf.old
Ø Login to DEV and create the control file trace
Ø alter database backup controlfile to trace; replace DEV with SND;
Ø CREATE CONTROLFILE SET DATABASE "SND" RESETLOGS NOARCHIVELOG
o SQL> @controlSND.sql
o ORACLE instance started.
o Total System Global Area 2600468480 bytes
o Fixed Size 2086192 bytes
o Variable Size 1308625616 bytes
o Database Buffers 1275068416 bytes
o Redo Buffers 14688256 bytes
o Control file created.
Recovery Steps
Ø Now, the database is in mount state. Issue the following command to start the recovery.
Ø recover database using backup controlfile until time ‘2010-11-17:09:00:00’;
o SQL> recover database using backup controlfile until time '2010-11-17:09:00:00';
o ORA-00279: change 4326224613 generated at 11/17/2010 08:18:44 needed for thread 1
o ORA-00289: suggestion : /oracle/SND/oraarch/SNDarch1_9930_721386457.dbf
o ORA-00280: change 4326224613 for thread 1 is in sequence #9930
o Specify log: {=suggested | filename | AUTO | CANCEL}
o AUTO
o Log applied.
o Media recovery complete.
Ø Open the database using the following command.
o alter database open resetlogs ;
o lsnrctl start LISTENER_SND;
Post Recovery Steps
Ø Download and execute ORADBUSR.SQL (SAPNote: 50088 )with the following command
Ø sqlplus /nolog @ORADBUSR.SQL SAPR3E UNIX SND X
Ø Add the temp datafiles.
o ALTER TABLESPACE PSAPTEMP ADD TEMPFILE '/oracle/SND/sapdata1/temp_1/temp.data1' SIZE 5000M REUSE AUTOEXTEND OFF;
o ALTER TABLESPACE PSAPTEMP ADD TEMPFILE '/oracle/SND/sapdata3/temp_2/temp.data2' SIZE 5000M REUSE AUTOEXTEND OFF;
Ø Run the statistics
o brconnect -u / -c -f stats -t oradict_stats
o brconnect -u / -c -f stats -t system_stats
o brconnect -c -u / -f stats -t all -f collect -p 4
Ø Enable the archive log for SND oracle database.
Ensure TSM backups are running fine and enable the archive log and online backup scripts.
Friday, August 20, 2010
Oracle Block Corruptions
1. OS layer
2. DB layer
3. SAP Pool / Cluster table layer
4. Application layer
The best way to begin the investigation is through File system check.
SAP pool or cluster table layer check can be done with the tool R3check.
DB layer corruptions
A corrupted block can occur in datafiles or in redologfiles.
Consistency check can be performed with the following tools.
1. dbverify - checks if the database block fulfills the predefined rules
2. export - reads the data stored in table blocks.
3. analyze - reads table and index data and performs cross checks.
Note: If using RMAN for backing up the database, an implicit consistency check for blocks saved is performed. Explicit dbverify check is not needed.
The above 3 methods can only be used for datafiles.
Corruptions in redolog files can only be found when applying the redologs during recovery. If you dump a redolog file and do not get an error, this is not sufficient to prove that the redo log is ok.
We basically dump the output of redo log files in a trace and then read the trace file to understand the content. Below are some of the useful command.
The following ways of dumping a redo log file are covered
1. To dump records based in DBA (Data Block Address)
2. To dump records based on RBA (Redo Block Address)
3. To dump records based on SCN
4. To dump records based on time
5. Dump the file header information
6. Dump an entire log file
1. To dump records based on DBA (Data Block Address)
Connect to database using sysdba and execute the below command
ALTER SYSTEM DUMP LOGFILE ‘filename’ DBA MIN (fileno) (blockno) DBA MAX (fileno) (blockno);
Example:
ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’ DBA MIN 5 . 31125 DBA MAX 5 . 31150;
This will cause all the changes to the specified range of data blocks to be dumped to the trace file. In the example given, all redo records for file #5, blocks 31125 thru 31150 are dumped.
2. To dump records based on RBA (Redo Block Address)
This will dump all redo records for the range of redo addresses specified for the given sequence number and block number.
Syntax:
ALTER SYSTEM DUMP LOGFILE ‘filename’ RBA MIN seqno blockno RBA MAX seqno blockno;
Example:
ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’ RBA MIN 2050 13255 RBA MAX 2255 15555;
3. To dump records based on SCN
Using this option will cause redo records owning changes within the SCN range
specified to be dumped to the trace file.
ALTER SYSTEM DUMP LOGFILE ‘filename’ SCN MIN minscn SCN MAX maxscn;
Example:
ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’ SCN MIN 103243 SCN MAX 103294;
4. To dump records based on time
Using this option will cause redo records created within the time range specified to be dumped to the trace file.
ALTER SYSTEM DUMP LOGFILE ‘filename’ TIME MIN value TIME MAX value;
Example:
ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’ TIME MIN 299425687 TIME MAX 299458800;
Please Note: the time value is given in REDO DUMP TIME
5. Dump the file header information
This will dump file header information for every online redo log file.
alter session set events ‘immediate trace name redohdr level 10′;
6. Dump an entire log file:
ALTER SYSTEM DUMP LOGFILE ‘filename’;
Please note: Fully qualify the filename, and include the single quotes.
Example:
ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’;
Note: Please know that brconnect -f check is not a consistency check. This is only for left freespace, parameter settings, last successful backup etc.
A proper consistency check can prevent data loss because of the corruption. If you check sufficiently often for corruptions you can restore the file containing the corrupted blocks from a backup that does not contain the corruption and recover up to the current point in time. It is quite unlikely that the same corruption is in the archive logs too.
Recommendation: Check the database for consistency at least once a week.
Analyze and export reads the blocks to be checked into the SGA which is why the buffer quality of the DB block buffer is adversely affected for a short while.
Analyze – checks both tables and indexes and may run for long time. Better to run this when the system has low workload. Buffer quality is adversely affected.
Export – Only checks tables. Performance loss as a result of export processes. Certain kinds of corruptions are exported without an error.
If you later try to reorganize this type of table (export/import), you will have problems during import. Buffer quality is adversely affected.
Dbverify – feasible in running operation, this checks for table and indexes and also the blank DB blocks. Doesn’t checks the cross references. Reads the blocks without loading them in the SGA. So, buffer quality is not affected. Dbverify can be run on the data files restored from the backup (without these files having to belong to a DB). Also checks the data in the LOB columns.
To be continued...
Tuesday, August 17, 2010
Oracle Block Corruptions
1. OS layer
2. DB layer
3. SAP Pool / Cluster table layer
4. Application layer
The best way to begin the investigation is through File system check.
SAP pool or cluster table layer check can be done with the tool R3check.
DB layer corruptions
A corrupted block can occur in datafiles or in redologfiles.
Consistency check can be performed with the following tools.
1. dbverify - checks if the database block fulfills the predefined rules
2. export - reads the data stored in table blocks.
3. analyze - reads table and index data and performs cross checks.
Note: If using RMAN for backing up the database, an implicit consistency check for blocks saved is performed. Explicit dbverify check is not needed.
The above 3 methods can only be used for datafiles.
Corruptions in redolog files can only be found when applying the redologs during recovery. If you dump a redolog file and do not get an error, this is not sufficient to prove that the redo log is ok.
We basically dump the output of redo log files in a trace and then read the trace file to understand the content. Below are some of the useful command.
The following ways of dumping a redo log file are covered
1. To dump records based in DBA (Data Block Address)
2. To dump records based on RBA (Redo Block Address)
3. To dump records based on SCN
4. To dump records based on time
5. Dump the file header information
6. Dump an entire log file
1. To dump records based on DBA (Data Block Address)
Connect to database using sysdba and execute the below command
ALTER SYSTEM DUMP LOGFILE ‘filename’ DBA MIN (fileno) (blockno) DBA MAX (fileno) (blockno);
Example:
ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’ DBA MIN 5 . 31125 DBA MAX 5 . 31150;
This will cause all the changes to the specified range of data blocks to be dumped to the trace file. In the example given, all redo records for file #5, blocks 31125 thru 31150 are dumped.
2. To dump records based on RBA (Redo Block Address)
This will dump all redo records for the range of redo addresses specified for the given sequence number and block number.
Syntax:
ALTER SYSTEM DUMP LOGFILE ‘filename’ RBA MIN seqno blockno RBA MAX seqno blockno;
Example:
ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’ RBA MIN 2050 13255 RBA MAX 2255 15555;
3. To dump records based on SCN
Using this option will cause redo records owning changes within the SCN range
specified to be dumped to the trace file.
ALTER SYSTEM DUMP LOGFILE ‘filename’ SCN MIN minscn SCN MAX maxscn;
Example:
ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’ SCN MIN 103243 SCN MAX 103294;
4. To dump records based on time
Using this option will cause redo records created within the time range specified to be dumped to the trace file.
ALTER SYSTEM DUMP LOGFILE ‘filename’ TIME MIN value TIME MAX value;
Example:
ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’ TIME MIN 299425687 TIME MAX 299458800;
Please Note: the time value is given in REDO DUMP TIME
5. Dump the file header information
This will dump file header information for every online redo log file.
alter session set events ‘immediate trace name redohdr level 10′;
6. Dump an entire log file:
ALTER SYSTEM DUMP LOGFILE ‘filename’;
Please note: Fully qualify the filename, and include the single quotes.
Example:
ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’;
Note: Please know that brconnect -f check is not a consistency check. This is only for left freespace, parameter settings, last successful backup etc.
A proper consistency check can prevent data loss because of the corruption. If you check sufficiently often for corruptions you can restore the file containing the corrupted blocks from a backup that does not contain the corruption and recover up to the current point in time. It is quite unlikely that the same corruption is in the archive logs too.
Recommendation: Check the database for consistency at least once a week.
Analyze and export reads the blocks to be checked into the SGA which is why the buffer quality of the DB block buffer is adversely affected for a short while.
Analyze – checks both tables and indexes and may run for long time. Better to run this when the system has low workload. Buffer quality is adversely affected.
Export – Only checks tables. Performance loss as a result of export processes. Certain kinds of corruptions are exported without an error.
If you later try to reorganize this type of table (export/import), you will have problems during import. Buffer quality is adversely affected.
Dbverify – feasible in running operation, this checks for table and indexes and also the blank DB blocks. Doesn’t checks the cross references. Reads the blocks without loading them in the SGA. So, buffer quality is not affected. Dbverify can be run on the data files restored from the backup (without these files having to belong to a DB). Also checks the data in the LOB columns.
To be continued...
Cluster table diagnosis using R3check utility
Use this utility to check whether a cluster table contains incorrect cluster records.
If a termination occurs when you process a cluster table in productive operation, you can find information about the cluster table in which the problems occur in the corresponding system log and short dumps.
Execute the program R3 check for the diagnosis, which generates an output line with the key of the affected record for every defective cluster record. Create a control file for this on operating system level with the following contents.
export file='/deve/null' dumping=yes client=CLIENTID
select * from log.clustertable
and start the check asadm user
R3check controlfile > outputfile
The lines in the output have the following structure:
export file='/deve/null' dumping=yes client=CLIENTID
select * from log.clustertable
and start the check as
R3check controlfile > outputfile
The lines in the output have the following structure:
To check a physical cluster table, call R3check for a logical cluster table which belongs to the corresponding physical cluster table.
If after consulting the SAP Hotline on how to proceed in dealing with the error, you need to export the data belonging to a certain cluster key, generate an additional control file using the following contents:
export file='datafile' client=nnn
select * from phys. cluster table where KEY1 = key1 and KEY2 = key2
and ...
and start the export as
R3trans controlfile
data file for the name of the data file, into which the exported data is to be written.
Monday, August 16, 2010
JMS Message Monitoring in CE 7.1 system
In this article we reviewed the available Telnet commands related to the SAP JMS Provider.
how you can use these commands in combination to investigate a particular problem or just to
monitor the runtime status of the JMS Provider.
Telnet Commands Related to the SAP JMS Provider
The SAP JMS Provider service provides Telnet commands for administering and monitoring the JMS
resources in a server. To use these commands, you have to connect to the AS Java using Telnet and enable
the JMS commands group with add jms. To show the help information about the available commands
under the jms group, you can type one of the following at the Telnet command prompt:
jms –h
jms -?
man jms
There are several subgroups of JMS Telnet commands:
jms_list – this command provides details about the JMS runtime environment
list_temp_destinations – this one displays all currently active temporary destinations.
Example Scenario
All JMS-related Telnet commands give you the possibility to monitor the runtime state of the SAP JMS
Provider. To illustrate the usage of the described Telnet commands, let us consider one simple scenario.
Imagine we just discovered that some persistent messages are not delivered to our application and we want
to investigate what might have happened with them. For the purposes of this example, we will use the
sapDemoQueue destination and the default JMS Virtual Provider.
The following procedure describes one possible path of investigation and the respective sequence of
commands.
telnet to the java server
telnet localhost 50008
Administrator / password
command >> jms add
jms_list msg <> default
This command lists all messages sent to the <> destination that are present in the databae. If there are no messages in this list, we know that there are currently no messages pending for delivery either no messages have been produced, or all that have been produced have already been consumed and acknowledged. We can try to determine which producer was supposed to send them.
jms_list producers default
This command lists all producers registered to destinations belonging to the default JMS virtual provider. Note that this is the same virtual provider to which our destination belongs. From this list we can determine the producer ID, the destination to which the producer sends messages, its session ID and client ID. By the client ID, we can later on find out the consumer that is supposed to receive the messages. In this case, we look for producers registered that is supposed to receive the messages. In this case, we look for producers registered to the <> destination. This is a way to determine if there is a currently active producer registered to our destination.
If there are messages pending to be delivered, then we have to continue our investigation with consumers that are expected to receive them. We can check the status of the JMS connection - how many bytes have been sent and received through it and when it was last accessed.
jms_list connections default
We use the client ID to check if there are any active connections and when for the last time was particular connection accessed. The JMS Virtual prodiver again has to be the same.
Note: If you want to find the corresponding connection to your consumer, you need the connection with client ID that is equal to the one of the already found consumer.
We can also check the status of the consumer registered to the <> destination.
jms_list consumers default
This command lists all currently active consumers registered to destinations belonging to the default JMS virtual provider. From this list we can determine the consumer ID, the destination to which the consumer is registered, the session ID and the client ID. If there is no consumer registered to <> then we know that our application does not receive messages because it failed for some reason to create the respective consumer and we can continue the investigation in this direction, for example by checking the server traces for relevant exceptions.
If there is an active consumer but it still does not receive any of the pending messages, it is possible that there is an issue in the application message processing logic which causes the messages to be redelivered again and again. By default, message delivery attempts are limited and once they are exhausted for a particular message, it is considered undeliverable (dead) and it is skipped by the consumer and moved to configured error destination of the original destination. To determine the error destination of the <> destination, we have to use the configuration edition. In the Display configuration tab, expand Configurations -> jms_provider -> default -> queues -> <> -> Propertysheet data. In the Property Sheet you can find the error destination of a particular destination. In our case, the error destination of <> is sapDefaultErrorQueue.
Then, we can check if there are any messages in the error destination.
jms_list msg sapDefaultErrorQueue default
With this command we can check if the missing messages are present in the error destination.
If our application is unable to consume some of the messages, we have to check why and then we may want to do something with the undelivered messages. Since error destinations are just ordinary JMS destinations, you can access dead messages using the standard JMS API - for example, your application (or a dedicated tool) can consumer and process the messages from the error destination - it can even return them back to the original destination, if that is the error handling login of the application.
Note that we can configure the following properties on the jms-resources.xml related to the dead messages functionality.
a. deliveryAttemptsLimited - a Boolean property that indicates whether the message delivery
attempts are limited. The default value is "true".
b. maxDeliveryAttempts - an Integer property that indicates the maximum number of delivery
attempts before the message is considered undeliverable (dead). The default value is 5.
c. deliveryDelayInterval - the delay in milliseconds between two consecutive message delivery
attempts. The default value of this property is 2000 milliseconds.
d. errorDestination - the name of a JMS Queue where dead messages will be forwarded. If you
leave this property blank (“”), this means that you want dead messages to be discarded.
These four properties are configurable per JMS destination.
Note: The default error destination has an empty string for the errorDestination property, otherwise, when a message becomes dead in its original destination and then it also becomes dead in the error destination, this may lead to several transfers of this message through error destinations and potentially this may even lead to a message delivery endless loop.
Note: The value of the errorDestination property must be the name of an already existing Queue.
This command lists all messages sent to the
jms_list producers default
This command lists all producers registered to destinations belonging to the default JMS virtual provider. Note that this is the same virtual provider to which our destination belongs. From this list we can determine the producer ID, the destination to which the producer sends messages, its session ID and client ID. By the client ID, we can later on find out the consumer that is supposed to receive the messages. In this case, we look for producers registered that is supposed to receive the messages. In this case, we look for producers registered to the
If there are messages pending to be delivered, then we have to continue our investigation with consumers that are expected to receive them. We can check the status of the JMS connection - how many bytes have been sent and received through it and when it was last accessed.
jms_list connections default
We use the client ID to check if there are any active connections and when for the last time was particular connection accessed. The JMS Virtual prodiver again has to be the same.
Note: If you want to find the corresponding connection to your consumer, you need the connection with client ID that is equal to the one of the already found consumer.
We can also check the status of the consumer registered to the
jms_list consumers default
This command lists all currently active consumers registered to destinations belonging to the default JMS virtual provider. From this list we can determine the consumer ID, the destination to which the consumer is registered, the session ID and the client ID. If there is no consumer registered to
If there is an active consumer but it still does not receive any of the pending messages, it is possible that there is an issue in the application message processing logic which causes the messages to be redelivered again and again. By default, message delivery attempts are limited and once they are exhausted for a particular message, it is considered undeliverable (dead) and it is skipped by the consumer and moved to configured error destination of the original destination. To determine the error destination of the
Then, we can check if there are any messages in the error destination.
jms_list msg sapDefaultErrorQueue default
With this command we can check if the missing messages are present in the error destination.
If our application is unable to consume some of the messages, we have to check why and then we may want to do something with the undelivered messages. Since error destinations are just ordinary JMS destinations, you can access dead messages using the standard JMS API - for example, your application (or a dedicated tool) can consumer and process the messages from the error destination - it can even return them back to the original destination, if that is the error handling login of the application.
Note that we can configure the following properties on the jms-resources.xml related to the dead messages functionality.
a. deliveryAttemptsLimited - a Boolean property that indicates whether the message delivery
attempts are limited. The default value is "true".
b. maxDeliveryAttempts - an Integer property that indicates the maximum number of delivery
attempts before the message is considered undeliverable (dead). The default value is 5.
c. deliveryDelayInterval - the delay in milliseconds between two consecutive message delivery
attempts. The default value of this property is 2000 milliseconds.
d. errorDestination - the name of a JMS Queue where dead messages will be forwarded. If you
leave this property blank (“”), this means that you want dead messages to be discarded.
These four properties are configurable per JMS destination.
Note: The default error destination has an empty string for the errorDestination property, otherwise, when a message becomes dead in its original destination and then it also becomes dead in the error destination, this may lead to several transfers of this message through error destinations and potentially this may even lead to a message delivery endless loop.
Note: The value of the errorDestination property must be the name of an already existing Queue.
Subscribe to:
Posts (Atom)