SAP Basis Procedures / Questions and Answers

Saturday, January 8, 2011

Oracle Recovery – Data Corruption – Bad blocks

I faced Oracle Database block corruption yesterday and posting the recovery steps taken to survive the crash.

The following filesystems need to be mounted with options (rw,_netdev,noatime,datavolume,nointr,heartbeat=local)
/dev/mapper/vg_RP1_dataHDS-oracle_RP1_sapdata39 on /oracle/RP1/sapdata39 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/mapper/vg_RP1_dataHDS-oracle_RP1_sapdata40 on /oracle/RP1/sapdata40 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/mapper/vg_RP1_dataHDS-oracle_RP1_sapdata41 on /oracle/RP1/sapdata41 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/mapper/vg_RP1_dataHDS-oracle_RP1_sapdata42 on /oracle/RP1/sapdata42 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/mapper/vg_RP1_dataHDS-oracle_RP1_sapdata43 on /oracle/RP1/sapdata43 type ocfs2 (rw,_netdev,heartbeat=local)
/dev/mapper/vg_RP1_dataHDS-oracle_RP1_sapdata44 on /oracle/RP1/sapdata44 type ocfs2 (rw,_netdev,heartbeat=local)

Because of which high I/O and contention caused all five of the Oracle RAC nodes to crash. Also resulting in datafile recovery.
But, recover datafile was failing because of inconsistent redolog with datablock.
So, the following steps are taking in order to get the Production system out of danger.

General recommendation for mounting the ocfs2 filesystems in Oracle RAC environment is with the following options.
rw,_netdev,noatime,datavolume,nointr,heartbeat=local.

Remounted the problematic filesystems with the above options. And now, going ahead with the analysis and recovery steps.

SQL> startup pfile=initRP11.ora
ORA-00445: background process "PSP0" did not start after 120 seconds

Probably due to high number of db writer processes and large sga.

SQL> select distinct status from v$backup;
STATUS
------------------
NOT ACTIVE

SQL> set NUMWIDTH 13

SQL> select distinct checkpoint_change# from v$datafile;
CHECKPOINT_CHANGE#
------------------
     100242078361
     100243048128

SQL> select checkpoint_change#,count(*) from v$datafile group by checkpoint_change#;
CHECKPOINT_CHANGE#      COUNT(*)
------------------ -------------
     100242078361             1
     100243048128           629

SQL> select tablespace_name , status from dba_tablespaces ;
TABLESPACE_NAME                STATUS
------------------------------ ---------
SYSTEM                         ONLINE
PSAPUNDO_5                     ONLINE
SYSAUX                         ONLINE
PSAPTEMP                       ONLINE
PERFMAN                        ONLINE
PSAPR3E620                     ONLINE
PSAPR3EUSR                     ONLINE
PSAPR3E                        ONLINE
PSAPUNDO_1                     ONLINE
PSAPUNDO_2                     ONLINE
PSAPUNDO_3                     ONLINE
PSAPUNDO_4                     ONLINE

SQL> select status,count(*) from dba_data_files group by status;
STATUS         COUNT(*)
--------- -------------
AVAILABLE           630

SQL> select file# from v$datafile where checkpoint_change#=100242078361;
       FILE#
-------------
         628

SQL> select * from dba_data_files where file_id=628;
FILE_NAME
--------------------------------------------------------------------------------
     FILE_ID TABLESPACE_NAME                        BYTES        BLOCKS
------------- ------------------------------ ------------- -------------
STATUS     RELATIVE_FNO AUT      MAXBYTES     MAXBLOCKS INCREMENT_BY
--------- ------------- --- ------------- ------------- -------------
USER_BYTES   USER_BLOCKS ONLINE_
------------- ------------- -------
/oracle/RP1/sapdata40/r3e_554/r3e.data554
         628 PSAPR3E
AVAILABLE           628
                           RECOVER

SQL> select count(*) from dba_extents where file_id=628;
COUNT(*)
----------
       16

SQL> select name from v$datafile where status ='RECOVER';
NAME
--------------------------------------------------------------------------------
/oracle/RP1/sapdata40/r3e_554/r3e.data554

SQL> recover datafile '/oracle/RP1/sapdata40/r3e_554/r3e.data554' ;
ORA-00283: recovery session canceled due to errors
ORA-12801: error signaled in parallel query server P003, instance
DBHOST:RP11 (1)
ORA-00600: internal error code, arguments: [3020], [628], [128460],
[2634151372], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 628, block# 128460)
ORA-10564: tablespace PSAPR3E
ORA-01110: data file 628: '/oracle/RP1/sapdata40/r3e_554/r3e.data554'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 99304

Alert_RP11.log
Fri Jan 7 17:15:19 2011
Recovery of Online Redo Log: Thread 2 Group 23 Seq 6211 Reading mem 0
Mem# 0: /oracle/RP1/origlogA/LOG_G23M1.DBF
Mem# 1: /oracle/RP1/mirrlogA/LOG_G23M2.DBF
Fri Jan 7 17:16:34 2011
Errors in file /oracle/RP1/saptrace/background/rp11_p003_6215.trc:
ORA-00600: internal error code, arguments: [3020], [628], [128460], [2634151372]                                                                                        , [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 628, block# 128460)
ORA-10564: tablespace PSAPR3E
ORA-01110: data file 628: '/oracle/RP1/sapdata40/r3e_554/r3e.data554'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 99304
Fri Jan 7 17:16:34 2011
Errors in file /oracle/RP1/saptrace/background/rp11_p007_6229.trc:
ORA-00600: internal error code, arguments: [3020], [628], [126171], [2634149083]                                                                                        , [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 628, block# 126171)
ORA-10564: tablespace PSAPR3E
ORA-01110: data file 628: '/oracle/RP1/sapdata40/r3e_554/r3e.data554'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 99304
Fri Jan 7 17:16:34 2011
Errors in file /oracle/RP1/saptrace/background/rp11_p007_6229.trc:
ORA-00600: internal error code, arguments: [3020], [628], [126171], [2634149083]                                                                                        , [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 628, block# 126171)
ORA-10564: tablespace PSAPR3E
ORA-01110: data file 628: '/oracle/RP1/sapdata40/r3e_554/r3e.data554'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 99304
Fri Jan 7 17:16:34 2011
Errors in file /oracle/RP1/saptrace/background/rp11_p003_6215.trc:
ORA-00600: internal error code, arguments: [3020], [628], [128460], [2634151372]                                                                                        , [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 628, block# 128460)
ORA-10564: tablespace PSAPR3E
ORA-01110: data file 628: '/oracle/RP1/sapdata40/r3e_554/r3e.data554'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 99304
Fri Jan 7 17:16:35 2011
Trace dumping is performing id=[cdmp_20110107171635]
Fri Jan 7 17:16:42 2011
Media Recovery failed with error 12801
ORA-283 signalled during: ALTER DATABASE RECOVER datafile '/oracle/RP1/sapdata40/r3e_554/r3e.data554'   ...

SQL> SELECT tablespace_name, segment_type, owner, segment_name FROM dba_extents WHERE file_id = 628 and 128460 between block_id AND block_id + blocks - 1;

Run DB Verify dbv file=r3e.data554 FEEDBACK=100 but no guarantee that DB Verify will be 100% dependable.

DBVERIFY - Verification complete
Total Pages Examined         : 3840000
Total Pages Processed (Data) : 48992
Total Pages Failing   (Data) : 0
Total Pages Processed (Index): 78245
Total Pages Failing   (Index): 0
Total Pages Processed (Other): 144
Total Pages Processed (Seg) : 0
Total Pages Failing   (Seg) : 0
Total Pages Empty            : 3712619
Total Pages Marked Corrupt   : 0
Total Pages Influx           : 0
Highest block SCN            : 1458725156 (23.1458725156)

Find the object in the corrupt block
SQL> desc obj$;
Name                                      Null?    Type
----------------------------------------- -------- ----------------------------
OBJ#                                      NOT NULL NUMBER
DATAOBJ#                                           NUMBER
OWNER#                                    NOT NULL NUMBER
NAME                                      NOT NULL VARCHAR2(30)
NAMESPACE                                 NOT NULL NUMBER
SUBNAME                                            VARCHAR2(30)
TYPE#                                     NOT NULL NUMBER
CTIME                                     NOT NULL DATE
MTIME                                     NOT NULL DATE
STIME                                     NOT NULL DATE
STATUS                                    NOT NULL NUMBER
REMOTEOWNER                                        VARCHAR2(30)
LINKNAME                                           VARCHAR2(128)
FLAGS                                              NUMBER
OID$                                               RAW(16)
SPARE1                                             NUMBER
SPARE2                                             NUMBER
SPARE3                                             NUMBER
SPARE4                                             VARCHAR2(1000)
SPARE5                                             VARCHAR2(1000)
SPARE6                                             DATE

SQL> elect name from obj$ where obj#=99304 ;
NAME
------------------------------
GLPCA~0

SQL> SELECT tablespace_name, segment_type, owner, segment_name FROM dba_extents WHERE file_id = 628 and 128460 between block_id AND block_id + blocks - 1;
TABLESPACE_NAME                SEGMENT_TYPE       OWNER
------------------------------ ------------------ ------------------------------
SEGMENT_NAME
--------------------------------------------------------------------------------
PSAPR3E                        INDEX              SAPR3E
GLPCA~0

SELECT tablespace_name, segment_type, owner, segment_name FROM dba_extents WHERE file_id = 628 and 126171 between block_id AND block_id + blocks - 1;

So it's an index block. I think what needs to be done is to run the recover clause with option "test" and then check in the syslog the number of corruptions

Run a test recovery
SQL> recover datafile '/oracle/RP1/sapdata40/r3e_554/r3e.data554' test;
ORA-10589: Test recovery had to corrupt 2 data blocks in order to proceed
ORA-10573: Test recovery tested redo from change 100242078361 to 100242999965
ORA-10572: Test recovery canceled due to errors
ORA-00600: internal error code, arguments: [3019], [2], [6211], [1232097],
[23], [1458752157], [23], [1458752153]

Alert_RP11.log

Fri Jan 7 17:57:47 2011
CORRUPTING BLOCK 126171 OF FILE 628 AND CONTINUING RECOVERY
Fri Jan 7 17:57:47 2011
Errors in file /oracle/RP1/saptrace/usertrace/rp11_ora_31850.trc:
ORA-10567: Redo is inconsistent with data block (file# 628, block# 126171)
ORA-10564: tablespace PSAPR3E
ORA-01110: data file 628: '/oracle/RP1/sapdata40/r3e_554/r3e.data554'
RA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object# 99304
Fri Jan 7 17:57:47 2011
Errors in file /oracle/RP1/saptrace/usertrace/rp11_ora_31850.trc:
ORA-00600: internal error code, arguments: [3019], [2], [6211], [1232097], [23], [1458752157], [23], [1458752153]
Fri Jan 7 17:57:49 2011
Test recovery canceled due to errors
Test recovery tested redo from change 100242078361 to 100242999965
Test recovery found 2 problems
ORA-10589 signalled during: ALTER DATABASE RECOVER datafile '/oracle/RP1/sapdata40/r3e_554/r3e.data554' test ..
Fri Jan 7 17:57:49 2011
Trace dumping is performing id=[cdmp_20110107175749]

SQL> select segment_name, segment_type from dba_extents where file_id=628 order by segment_name;
SEGMENT_NAME
--------------------------------------------------------------------------------
SEGMENT_TYPE
------------------
BSIS~0
INDEX

CE11000
TABLE

GLPCA
TABLE

SEGMENT_NAME
--------------------------------------------------------------------------------
SEGMENT_TYPE
------------------
GLPCA~0
INDEX

GLPCA~1
INDEX

GLPCA~2
INDEX

SEGMENT_NAME
--------------------------------------------------------------------------------
SEGMENT_TYPE
------------------
IBINVALUES~SEL
INDEX

IBINVALUES~SWO
INDEX

SOFFCONT1
TABLE

SEGMENT_NAME
--------------------------------------------------------------------------------
SEGMENT_TYPE
------------------
SWWLOGHIST
TABLE

SWWLOGHIST
TABLE

SWWLOGHIST
TABLE

SEGMENT_NAME
--------------------------------------------------------------------------------
SEGMENT_TYPE
------------------
SWWLOGHIST~0
INDEX

SWWLOGHIST~0
INDEX

SWWLOGHIST~1
INDEX

SEGMENT_NAME
--------------------------------------------------------------------------------
SEGMENT_TYPE
------------------
SWWLOGHIST~Z01
INDEX

16 rows selected.

The datafile 628 contains only few extents. If we are lucky maybe these are only indexes?
We could ignore this if only indexes were affected.

SQL> SELECT tablespace_name, segment_type, owner, segment_name FROM dba_extents WHERE file_id = 628 and 126171 between block_id AND block_id + blocks - 1;
TABLESPACE_NAME                SEGMENT_TYPE       OWNER
------------------------------ ------------------ ------------------------------
SEGMENT_NAME
--------------------------------------------------------------------------------
PSAPR3E                        INDEX              SAPR3E
GLPCA~0

SQL> recover datafile '/oracle/RP1/sapdata40/r3e_554/r3e.data554' allow 1 corruption;
SQL> recover datafile '/oracle/RP1/sapdata40/r3e_554/r3e.data554' allow 1 corruption;

Rebuild the corrupted index.

SQL> alter index sapr3e."GLPCA~0" rebuild online parallel 8;

After the rebuild the datafile will continue to contain the two corrupt blocks. But they will not be addressed.
They will sooner or later be gone, when they are overwritten. After the rebuild is finished there is no further impact.
Except of the fact that a db verify will show these two blocks as corrupt. It could be that you see it *now*, i.e. after the recovery.

Sunday, January 2, 2011

Firefox Tip: Access SAP Notes Easily

Step 1:Setup SSO to service.sap.com
The first thing to do is to setup Single Sign On to service.sap.com. This can be done by going to http://service.sap.com and clicking the "Benefit from SSO" link at the bottom of the page. Here you can download a X.509 certificate which is the method of SSO.

Step 2:Custom keyword bookmark
The last step is to add a new bookmark in Firefox by right clicking in the Bookmarks menu and selecting New Bookmark...
For the URL use http://service.sap.com/~form/handler?_APP=01100107900000000342&_EVENT=REDIR&_NNUM=%s and in the keyword enter note (see image below).

Finished: Access the notes
Now, all you have to do in order to access a SAP note is to type note in the address bar of firefox as shown in the image below

Wednesday, December 22, 2010

Stage Copy Solution for Faster Dev and Sandbox system copies

CAUTION: PLEASE DO NOT COPY AND PASTE ANY OF THE COMMANDS.

Environment: RHEL 2.6 on HP Proliant Blades, Oracle 10.2.0.4, SAP R3

The idea is to restore the online backup of DEV onto SND.

The systems in question are source DEV and target SND.

Because of time constraint we had to go for an efficient backup/restore method. And we have opted for stage_copy method.

The following procedure describes the stage copy method.

Online backup profile - initDEV_online.sap

Create a new profile based on the online backup profile for stage_copy and name it as initDEV_online_stage.sap

vi initDEV_online_stage.sap and set/modify the following parameters

backup_mode = all

restore_mode = all

backup_type = online

backup_dev_type = stage

backup_root_dir = $SAPDATA_HOME/sapbackup

stage_root_dir = /oracle/SND/sapbackup

compress = no

compress_dir = $SAPDATA_HOME/sapreorg

archive_function = save

archive_copy_dir = $SAPDATA_HOME/sapbackup

archive_stage_dir = /oracle/SND/sapbackup

new_db_home = /oracle/SND

stage_db_home = /oracle/SND

remote_host = fctfrmSND

remote_user = ‘oraSND test123’ (can be hashed in case of ssh)

stage_copy_cmd = scp

exec_parallel = 2

Preparation of filesystems in SND (target)

1. Validate the mount points and their permissions.

2. Clean up all sapdata filesystems and compare the sizes between DEV and SND

3. Clean up all origlog* and mirrlog* filesystems.

4. Clean up saparch, oraarch, sapreorg, sapbackup filesystems.

5. Disable the archive log and online backup scripts in the cron on both and DEV and SND.

6. Stop SAP and Oracle on SND (target)

7. cleanipc –XX remove and also remove any remaining shared memory segments using ipcrm

8. kill all the processes running with SNDadm and oraSND

Initiate online restore from DEV to SND using stage_copy profile

Ø Ensure ssh is setup on SND (target) for the user oraSND and oraDEV.

Ø Login to DEV as oraDEV and issue the following command.

Ø brbackup -p initDEV_online_stage.sap -t online -c -u / &

Upon successful completion of the brbackup, the online backup of DEV is restored into the directories of SND system. Now, we need to transfer the archive log files from DEV to SND depending upon PITR requested (Point-in-Time recovery). Use sftp from DEV to SND and copy the required archive logs to /oracle/SND/oraarch. And rename them from DEV* to SND*.

Restore archive log files from tape

Ø If the required archive log files are copied to tape and are not available in oraarch then issue the following command.

Ø brrestore -a 9921-9929 -p initDEV_arch.sap

Enable archive log and online backup scripts on DEV server.

Control file creation

Ø Login to SND and rename the existing control files.

o mv /oracle/SND/origlogA/cntrl/cntrlSND.dbf /oracle/SND/origlogA/cntrl/cntrlSND.dbf.old

o mv /oracle/SND/saparch/cntrl/cntrlSND.dbf /oracle/SND/saparch/cntrl/cntrlSND.dbf.old

o mv /oracle/SND/sapdata1/system_1/cntrl/cntrlSND.dbf /oracle/SND/sapdata1/system_1/cntrl/cntrlSND.dbf.old

Ø Login to DEV and create the control file trace

Ø alter database backup controlfile to trace; replace DEV with SND;

Ø CREATE CONTROLFILE SET DATABASE "SND" RESETLOGS NOARCHIVELOG

o SQL> @controlSND.sql

o ORACLE instance started.

o Total System Global Area 2600468480 bytes

o Fixed Size 2086192 bytes

o Variable Size 1308625616 bytes

o Database Buffers 1275068416 bytes

o Redo Buffers 14688256 bytes

o Control file created.

Recovery Steps

Ø Now, the database is in mount state. Issue the following command to start the recovery.

Ø recover database using backup controlfile until time ‘2010-11-17:09:00:00’;

o SQL> recover database using backup controlfile until time '2010-11-17:09:00:00';

o ORA-00279: change 4326224613 generated at 11/17/2010 08:18:44 needed for thread 1

o ORA-00289: suggestion : /oracle/SND/oraarch/SNDarch1_9930_721386457.dbf

o ORA-00280: change 4326224613 for thread 1 is in sequence #9930

o Specify log: {=suggested | filename | AUTO | CANCEL}

o AUTO

o Log applied.

o Media recovery complete.

Ø Open the database using the following command.

o alter database open resetlogs ;

o lsnrctl start LISTENER_SND;

Post Recovery Steps

Ø Download and execute ORADBUSR.SQL (SAPNote: 50088 )with the following command

Ø sqlplus /nolog @ORADBUSR.SQL SAPR3E UNIX SND X

Ø Add the temp datafiles.

o ALTER TABLESPACE PSAPTEMP ADD TEMPFILE '/oracle/SND/sapdata1/temp_1/temp.data1' SIZE 5000M REUSE AUTOEXTEND OFF;

o ALTER TABLESPACE PSAPTEMP ADD TEMPFILE '/oracle/SND/sapdata3/temp_2/temp.data2' SIZE 5000M REUSE AUTOEXTEND OFF;

Ø Run the statistics

o brconnect -u / -c -f stats -t oradict_stats

o brconnect -u / -c -f stats -t system_stats

o brconnect -c -u / -f stats -t all -f collect -p 4

Ø Enable the archive log for SND oracle database.

Ensure TSM backups are running fine and enable the archive log and online backup scripts.

Friday, August 20, 2010

Oracle Block Corruptions

There are different layers where a corruption can occur.

1. OS layer

2. DB layer

3. SAP Pool / Cluster table layer

4. Application layer

The best way to begin the investigation is through File system check.

SAP pool or cluster table layer check can be done with the tool R3check.

DB layer corruptions

A corrupted block can occur in datafiles or in redologfiles.

Consistency check can be performed with the following tools.

1. dbverify - checks if the database block fulfills the predefined rules

2. export - reads the data stored in table blocks.

3. analyze - reads table and index data and performs cross checks.

Note: If using RMAN for backing up the database, an implicit consistency check for blocks saved is performed. Explicit dbverify check is not needed.

The above 3 methods can only be used for datafiles.

Corruptions in redolog files can only be found when applying the redologs during recovery. If you dump a redolog file and do not get an error, this is not sufficient to prove that the redo log is ok.

We basically dump the output of redo log files in a trace and then read the trace file to understand the content. Below are some of the useful command.

The following ways of dumping a redo log file are covered

1. To dump records based in DBA (Data Block Address)

2. To dump records based on RBA (Redo Block Address)

3. To dump records based on SCN

4. To dump records based on time

5. Dump the file header information

6. Dump an entire log file

1. To dump records based on DBA (Data Block Address)

Connect to database using sysdba and execute the below command

ALTER SYSTEM DUMP LOGFILE ‘filename’ DBA MIN (fileno) (blockno) DBA MAX (fileno) (blockno);

Example:

ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’ DBA MIN 5 . 31125 DBA MAX 5 . 31150;

This will cause all the changes to the specified range of data blocks to be dumped to the trace file. In the example given, all redo records for file #5, blocks 31125 thru 31150 are dumped.

2. To dump records based on RBA (Redo Block Address)

This will dump all redo records for the range of redo addresses specified for the given sequence number and block number.

Syntax:

ALTER SYSTEM DUMP LOGFILE ‘filename’ RBA MIN seqno blockno RBA MAX seqno blockno;

Example:

ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’ RBA MIN 2050 13255 RBA MAX 2255 15555;

3. To dump records based on SCN

Using this option will cause redo records owning changes within the SCN range

specified to be dumped to the trace file.

ALTER SYSTEM DUMP LOGFILE ‘filename’ SCN MIN minscn SCN MAX maxscn;

Example:

ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’ SCN MIN 103243 SCN MAX 103294;

4. To dump records based on time

Using this option will cause redo records created within the time range specified to be dumped to the trace file.

ALTER SYSTEM DUMP LOGFILE ‘filename’ TIME MIN value TIME MAX value;

Example:

ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’ TIME MIN 299425687 TIME MAX 299458800;

Please Note: the time value is given in REDO DUMP TIME

5. Dump the file header information

This will dump file header information for every online redo log file.

alter session set events ‘immediate trace name redohdr level 10′;

6. Dump an entire log file:

ALTER SYSTEM DUMP LOGFILE ‘filename’;

Please note: Fully qualify the filename, and include the single quotes.

Example:

ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’;

Note: Please know that brconnect -f check is not a consistency check. This is only for left freespace, parameter settings, last successful backup etc.

A proper consistency check can prevent data loss because of the corruption. If you check sufficiently often for corruptions you can restore the file containing the corrupted blocks from a backup that does not contain the corruption and recover up to the current point in time. It is quite unlikely that the same corruption is in the archive logs too.

Recommendation: Check the database for consistency at least once a week.

Analyze and export reads the blocks to be checked into the SGA which is why the buffer quality of the DB block buffer is adversely affected for a short while.

Analyze – checks both tables and indexes and may run for long time. Better to run this when the system has low workload. Buffer quality is adversely affected.

Export – Only checks tables. Performance loss as a result of export processes. Certain kinds of corruptions are exported without an error.

If you later try to reorganize this type of table (export/import), you will have problems during import. Buffer quality is adversely affected.

Dbverify – feasible in running operation, this checks for table and indexes and also the blank DB blocks. Doesn’t checks the cross references. Reads the blocks without loading them in the SGA. So, buffer quality is not affected. Dbverify can be run on the data files restored from the backup (without these files having to belong to a DB). Also checks the data in the LOB columns.

To be continued...

Tuesday, August 17, 2010

Oracle Block Corruptions

There are different layers where a corruption can occur.

1. OS layer

2. DB layer

3. SAP Pool / Cluster table layer

4. Application layer

The best way to begin the investigation is through File system check.

SAP pool or cluster table layer check can be done with the tool R3check.

DB layer corruptions

A corrupted block can occur in datafiles or in redologfiles.

Consistency check can be performed with the following tools.

1. dbverify - checks if the database block fulfills the predefined rules

2. export - reads the data stored in table blocks.

3. analyze - reads table and index data and performs cross checks.

Note: If using RMAN for backing up the database, an implicit consistency check for blocks saved is performed. Explicit dbverify check is not needed.

The above 3 methods can only be used for datafiles.

Corruptions in redolog files can only be found when applying the redologs during recovery. If you dump a redolog file and do not get an error, this is not sufficient to prove that the redo log is ok.

We basically dump the output of redo log files in a trace and then read the trace file to understand the content. Below are some of the useful command.

The following ways of dumping a redo log file are covered

1. To dump records based in DBA (Data Block Address)

2. To dump records based on RBA (Redo Block Address)

3. To dump records based on SCN

4. To dump records based on time

5. Dump the file header information

6. Dump an entire log file

1. To dump records based on DBA (Data Block Address)

Connect to database using sysdba and execute the below command

ALTER SYSTEM DUMP LOGFILE ‘filename’ DBA MIN (fileno) (blockno) DBA MAX (fileno) (blockno);

Example:

ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’ DBA MIN 5 . 31125 DBA MAX 5 . 31150;

This will cause all the changes to the specified range of data blocks to be dumped to the trace file. In the example given, all redo records for file #5, blocks 31125 thru 31150 are dumped.

2. To dump records based on RBA (Redo Block Address)

This will dump all redo records for the range of redo addresses specified for the given sequence number and block number.

Syntax:

ALTER SYSTEM DUMP LOGFILE ‘filename’ RBA MIN seqno blockno RBA MAX seqno blockno;

Example:

ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’ RBA MIN 2050 13255 RBA MAX 2255 15555;

3. To dump records based on SCN

Using this option will cause redo records owning changes within the SCN range

specified to be dumped to the trace file.

ALTER SYSTEM DUMP LOGFILE ‘filename’ SCN MIN minscn SCN MAX maxscn;

Example:

ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’ SCN MIN 103243 SCN MAX 103294;

4. To dump records based on time

Using this option will cause redo records created within the time range specified to be dumped to the trace file.

ALTER SYSTEM DUMP LOGFILE ‘filename’ TIME MIN value TIME MAX value;

Example:

ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’ TIME MIN 299425687 TIME MAX 299458800;

Please Note: the time value is given in REDO DUMP TIME

5. Dump the file header information

This will dump file header information for every online redo log file.

alter session set events ‘immediate trace name redohdr level 10′;

6. Dump an entire log file:

ALTER SYSTEM DUMP LOGFILE ‘filename’;

Please note: Fully qualify the filename, and include the single quotes.

Example:

ALTER SYSTEM DUMP LOGFILE ‘u01/oracle/V7323/dbs/arch1_76.dbf’;

Note: Please know that brconnect -f check is not a consistency check. This is only for left freespace, parameter settings, last successful backup etc.

Recommendation: Check the database for consistency at least once a week.

Analyze and export reads the blocks to be checked into the SGA which is why the buffer quality of the DB block buffer is adversely affected for a short while.

Analyze – checks both tables and indexes and may run for long time. Better to run this when the system has low workload. Buffer quality is adversely affected.

Export – Only checks tables. Performance loss as a result of export processes. Certain kinds of corruptions are exported without an error.

If you later try to reorganize this type of table (export/import), you will have problems during import. Buffer quality is adversely affected.

To be continued...

Cluster table diagnosis using R3check utility

Use this utility to check whether a cluster table contains incorrect cluster records.

If a termination occurs when you process a cluster table in productive operation, you can find information about the cluster table in which the problems occur in the corresponding system log and short dumps.

Execute the program R3 check for the diagnosis, which generates an output line with the key of the affected record for every defective cluster record. Create a control file for this on operating system level with the following contents.
export file='/deve/null' dumping=yes client=CLIENTID
select * from log.clustertable

and start the check as adm user
R3check controlfile > outputfile

The lines in the output have the following structure:

phys. cluster table: key1*key2*key3... (error number)
To check a physical cluster table, call R3check for a logical cluster table which belongs to the corresponding physical cluster table.

If after consulting the SAP Hotline on how to proceed in dealing with the error, you need to export the data belonging to a certain cluster key, generate an additional control file using the following contents:
export file='datafile' client=nnn
select * from phys. cluster table where KEY1 = key1 and KEY2 = key2
and ...
and start the export as adm user
R3trans controlfile
data file for the name of the data file, into which the exported data is to be written.

Monday, August 16, 2010

JMS Message Monitoring in CE 7.1 system

In this article we reviewed the available Telnet commands related to the SAP JMS Provider.

how you can use these commands in combination to investigate a particular problem or just to

monitor the runtime status of the JMS Provider.

Telnet Commands Related to the SAP JMS Provider

The SAP JMS Provider service provides Telnet commands for administering and monitoring the JMS

resources in a server. To use these commands, you have to connect to the AS Java using Telnet and enable

the JMS commands group with add jms. To show the help information about the available commands

under the jms group, you can type one of the following at the Telnet command prompt:

􀁸 jms –h

􀁸 jms -?

􀁸 man jms

There are several subgroups of JMS Telnet commands:

􀁸 jms_list – this command provides details about the JMS runtime environment

􀁸 list_temp_destinations – this one displays all currently active temporary destinations.

Example Scenario

All JMS-related Telnet commands give you the possibility to monitor the runtime state of the SAP JMS

Provider. To illustrate the usage of the described Telnet commands, let us consider one simple scenario.

Imagine we just discovered that some persistent messages are not delivered to our application and we want

to investigate what might have happened with them. For the purposes of this example, we will use the

sapDemoQueue destination and the default JMS Virtual Provider.

The following procedure describes one possible path of investigation and the respective sequence of

commands.

telnet to the java server

telnet localhost 50008

Administrator / password

command >> jms add

jms_list msg <> default
This command lists all messages sent to the <> destination that are present in the databae. If there are no messages in this list, we know that there are currently no messages pending for delivery either no messages have been produced, or all that have been produced have already been consumed and acknowledged. We can try to determine which producer was supposed to send them.

jms_list producers default
This command lists all producers registered to destinations belonging to the default JMS virtual provider. Note that this is the same virtual provider to which our destination belongs. From this list we can determine the producer ID, the destination to which the producer sends messages, its session ID and client ID. By the client ID, we can later on find out the consumer that is supposed to receive the messages. In this case, we look for producers registered that is supposed to receive the messages. In this case, we look for producers registered to the <> destination. This is a way to determine if there is a currently active producer registered to our destination.

If there are messages pending to be delivered, then we have to continue our investigation with consumers that are expected to receive them. We can check the status of the JMS connection - how many bytes have been sent and received through it and when it was last accessed.

jms_list connections default
We use the client ID to check if there are any active connections and when for the last time was particular connection accessed. The JMS Virtual prodiver again has to be the same.
Note: If you want to find the corresponding connection to your consumer, you need the connection with client ID that is equal to the one of the already found consumer.

We can also check the status of the consumer registered to the <> destination.

jms_list consumers default
This command lists all currently active consumers registered to destinations belonging to the default JMS virtual provider. From this list we can determine the consumer ID, the destination to which the consumer is registered, the session ID and the client ID. If there is no consumer registered to <> then we know that our application does not receive messages because it failed for some reason to create the respective consumer and we can continue the investigation in this direction, for example by checking the server traces for relevant exceptions.

If there is an active consumer but it still does not receive any of the pending messages, it is possible that there is an issue in the application message processing logic which causes the messages to be redelivered again and again. By default, message delivery attempts are limited and once they are exhausted for a particular message, it is considered undeliverable (dead) and it is skipped by the consumer and moved to configured error destination of the original destination. To determine the error destination of the <> destination, we have to use the configuration edition. In the Display configuration tab, expand Configurations -> jms_provider -> default -> queues -> <> -> Propertysheet data. In the Property Sheet you can find the error destination of a particular destination. In our case, the error destination of <> is sapDefaultErrorQueue.

Then, we can check if there are any messages in the error destination.

jms_list msg sapDefaultErrorQueue default
With this command we can check if the missing messages are present in the error destination.

If our application is unable to consume some of the messages, we have to check why and then we may want to do something with the undelivered messages. Since error destinations are just ordinary JMS destinations, you can access dead messages using the standard JMS API - for example, your application (or a dedicated tool) can consumer and process the messages from the error destination - it can even return them back to the original destination, if that is the error handling login of the application.

Note that we can configure the following properties on the jms-resources.xml related to the dead messages functionality.

a. deliveryAttemptsLimited - a Boolean property that indicates whether the message delivery
attempts are limited. The default value is "true".
b. maxDeliveryAttempts - an Integer property that indicates the maximum number of delivery
attempts before the message is considered undeliverable (dead). The default value is 5.
c. deliveryDelayInterval - the delay in milliseconds between two consecutive message delivery
attempts. The default value of this property is 2000 milliseconds.
d. errorDestination - the name of a JMS Queue where dead messages will be forwarded. If you
leave this property blank (“”), this means that you want dead messages to be discarded.

These four properties are configurable per JMS destination.

Note: The default error destination has an empty string for the errorDestination property, otherwise, when a message becomes dead in its original destination and then it also becomes dead in the error destination, this may lead to several transfers of this message through error destinations and potentially this may even lead to a message delivery endless loop.

Note: The value of the errorDestination property must be the name of an already existing Queue.

SAP Basis Procedures / Questions and Answers

Saturday, January 8, 2011

Oracle Recovery – Data Corruption – Bad blocks

Sunday, January 2, 2011

Firefox Tip: Access SAP Notes Easily

Wednesday, December 22, 2010

Stage Copy Solution for Faster Dev and Sandbox system copies

Friday, August 20, 2010

Oracle Block Corruptions

Tuesday, August 17, 2010

Oracle Block Corruptions

Cluster table diagnosis using R3check utility

Monday, August 16, 2010

JMS Message Monitoring in CE 7.1 system

Blog Archive

Followers