Category Archives: High-Availability

The Day Before SQLskills Immersion Event 3 – High Availability and Disaster Recovery

Tomorrow starts my third SQLskills Immersion Event.  I took the first two in Chicago (May & Oct 2011), and they were both phenomenal!  They are led by my friends, Paul Randal (Blog | Twitter), Kimberly Tripp (Blog | Twitter), Jonathan Kehayias (Blog | Twitter), and Joe Sack (Blog | Twitter).  So, we have all gathered here in Tampa, FL to improve our knowledge.

Each class has been chocked full of great knowledge with tons of very deep dives and little-known facts.  I am expecting IE3 to be no different.  HA and DR are a huge part of my job and I am always trying to improve on these aspects of my databases.  I am hoping to walk away from this class with new tips, tricks, and ideas to not only improve my current environment, but grow my understanding of SQL and improve my “craft” in general.  One thing is certain:  if you are at a SQLskills Immersion Event, you are serious about deepening your knowledge of SQL.

One extra-curricular note about SQLskills Immersion Events: when they say “Immersion” they mean it!  You have the option to spend a lot of time with your fellow students as well as the teachers.  I love connecting with others in the community, visiting with those I’ve met before, and meeting new people as well.   It is very refreshing to spend time with others who not only love SQL, but also love talking about a wide range of other topics; sometimes we even get into a little philosophy.  :-D

I plan to post an update later in the week or at the end of the event.  Until then, I’ll be fully “immersed” in SQL knowledge!

Simple Walk-Through for Setting Up Oracle Data Guard

In most Enterprise-Class database environments, there is the very real need for maintaining a near-line, transactionally-consistent “mirror” of your production databases.  This sort of technology goes beyond the benefits of a cluster, in that it can be part of a Disaster Recovery or Business Continuity plan, in which the entire primary data center is unavailable.  Each RDBMS has a different set of tools to help the DBA provide this sort of solution.  In Oracle, the tool I use is Data Guard.

Oracle Data Guard has been a feature in Oracle since 8i, but it has evolved over the years.  I will be using the syntax and features available in Oracle 11g.  As with almost every feature in Oracle, there are multiple ways to configure Data Guard and there are countless combinations of options that can be utilized at every step along the way.  I do not intend to discuss exhaustively the whole of Data Guard.  I will leave the idiosyncrasies for the online documentation or other blogs to do that.  I simply want to present a simple way to get an 11g, +ASM-based, stand-alone, instance covered by a Data Guard Physical Standby on another server.

My initial environment consists of two Windows servers, one at my primary site, which I will name “Oracle”, and one at my secondary site, which I will name “OracleDR”.  On each server, I have installed the grid components and a single instance of the database engine.  I have configured ASM on each server, and two Disk Groups on each ASM instance, in my grid home on each server.  On “Oracle”, the Disk Groups are called “PROD_DATA” and “PROD_RECOVERY”.  On “OracleDR”, the Disk Groups are called “DR_DATA” and “DR_RECOVERY”.  I also have configured each of my servers with a single Oracle Listener, in the grid home.  My production Database Instance is named “TEXAS”, has a db_unique_name of “TEXASP”, is open, is in archivelog mode, and is serving users.  “OracleDR” does not have any database instances at this point.

Please read through the entire walk-through before beginning your own Data Guard setup.

Preparing for the New Instance

First, take a fresh backup of the old database instance, using an RMAN script, similiar to this (starting from a command prompt) (11g users, see special note below, regarding skipping the backup):

CD O:\Oracle\11.2.0.2\TEXAS\BIN
SET ORACLE_HOME=O:\Oracle\11.2.0.2\TEXAS
SET ORACLE_SID=TEXAS
RMAN TARGET /
RUN {
ALLOCATE CHANNEL D1 TYPE DISK;
ALLOCATE CHANNEL D2 TYPE DISK;
ALLOCATE CHANNEL D3 TYPE DISK;
BACKUP AS COMPRESSED BACKUPSET FORMAT 'W:\TEXAS\df_t%t_s%s_p%p' DATABASE;
RELEASE CHANNEL D1;
RELEASE CHANNEL D2;
RELEASE CHANNEL D3;
}
EXIT

Next, create a standby controlfile, which will contain our fresh backup (assuming I am using my controlfile for cataloging RMAN):

SQLPLUS / AS SYSDBA
ALTER DATABASE CREATE STANDBY CONTROLFILE AS 'W:\TEXAS\StandbyCTL.CTL';
EXIT

We are also going to need a copy of the production pfile, to create our new instance at the DR site:

SQLPLUS / AS SYSDBA
CREATE PFILE='W:\TEXAS\initTEXASDR.ora' FROM SPFILE;
EXIT

Now, I logon to “OracleDR”, make sure the same file structure (W:\TEXAS\) exists, and copy these files over to the DR server:

robocopy \\Oracle\W$\TEXAS W:\TEXAS
Creating the New Instance and Duplicating Production to DR

Now, we need to move that pfile to the TEXASDR home’s “database” directory and modify it to serve our DR instance.  Change the following parameters:

  • control_files=’W:\TEXAS\StandbyCTL.CTL’ (temporarily, while we setup the standby)
  • db_create_file_dest=’+DR_DATA’
  • db_create_onlinge_log_dest_1=’+DR_DATA’
  • db_recovery_file_dest=’+DR_RECOVERY’
  • log_archive_dest_1=’LOCATION=+DR_RECOVERY’
  • db_unique_name=’TEXASDR’
  • change diagnostic_dest (if different from the production server)

Next, it is time to create our DR database instance.  use ORADIM on “OracleDR” to accomplish this:

CD O:\Oracle\11.2.0.2\TEXASDR\BIN
SET ORACLE_HOME=O:\Oracle\11.2.0.2\TEXASDR
SET ORACLE_SID=TEXASDR
ORADIM -NEW -SID TEXASDR -INTPWD ***** -STARTMODE AUTO

We have to make sure we have connectivity both directions between the two instances. So, setup tnsnames.ora and listener.ora on both sides so they can see each other. You can verify this using TNSPING.

We should also copy the Password File from production to DR. The passwords must match on both instances.

We then startup the new instance in nomount mode and initiate a duplicate for standby:

SQLPLUS / as sysdba
STARTUP NOMOUNT;
EXIT
RMAN TARGET /@TEXASP AUXILIARY /@TEXASDR
DUPLICATE TARGET DATABASE FOR STANDBY;

If you are using 11g, take special note:
There is a new feature in Oracle 11g which allows you to skip the backup and control file portions of this and skip straight to creating the instance and doing a duplicate from the live production instance.  The RMAN command used here would be:

DUPLICATE TARGET DATABASE FOR STANDBY FROM ACTIVE DATABASE;
Cleanup and Standby Configuration

Once the duplicate has completed, we now have a physical standby instance at the DR site.  We just need to do some final configuration on this new instance before we can turn on Data Guard.

We need to move that Control File to a more permanent location, and make a few copies of it.  So, let’s get back into SQLPLUS and bounce the new Oracle instance and bring it back up in nomount mode:

CD O:\Oracle\11.2.0.2\TEXASDR\BIN
SET ORACLE_HOME=O:\Oracle\11.2.0.2\TEXASDR
SET ORACLE_SID=TEXASDR
SQLPLUS / as sysdba
SHUTDOWN IMMEDIATE:
STARTUP NOMOUNT;
EXIT;

RMAN TARGET /
RESTORE CONTROLFILE TO '+DR_DATA' FROM 'W:\TEXAS\StandbyCTL.CTL';
/   (the / repeats the last command in the buffer, I'm doing it twice, for a total of three copies of the Control File)
/
EXIT

Now, head over to the grid home and use ASMCMD to check what names the RMAN RESTORE gave the Control Files:

CD O:\Oracle\11.2.0.2\grid\BIN
SET ORACLE_HOME=O:\Oracle\11.2.0.2\grid
SET ORACLE_SID=+ASM
ASMCMD
CD +DR_DATA/TEXASDR/CONTROLFILE
LS
EXIT

We take the three Control Files listed from the ASMCMD “LS” command and change the pfile control_files parameter to list all three of these locations (with the full ASM path).

Now we need to apply these changes, so get back into SQLPLUS, shutdown the instance, create an spfile, and startup the instance, this time in mount mode:

CD O:\Oracle\11.2.0.2\TEXASDR\BIN
SET ORACLE_HOME=O:\Oracle\11.2.0.2\TEXASDR
SET ORACLE_SID=TEXASDR
SQLPLUS / as sysdba
SHUTDOWN IMMEDIATE:
CREATE SPFILE FROM PFILE;
STARTUP MOUNT;
EXIT

At this point, we technically have an instance ready for Data Guard, but there is one last issue we need to resolve so that our instance is ready for a switchover, should we ever need it in the future (which is the primary point of having the DR site in the first place). The outstanding issue is our standby logfiles. Each instance needs to have a set of standby logfiles that can be used to apply the incoming logs from the primary instance. Please note: when you switchover to your DR site, at some point, you will most likely want to apply the logs from the DR site back to the Primary site, before initiating a switch back to the Primary (to keep the data in sync), so, you will need to create these standby logfiles on the Primary instance as well. You must always have at least the same number of standby log files as the other side has online logs, plus one, and the standby log files must be at least as large as the online log files. For instance, if your Primary site has three online log files, you will need four standby log files on the DR instance (as well as four standby log files on the Primary instance, to cover the DR online log files).

So, let’s execute this sort of code on both sides, to create the necessary standby logs:

SQLPLUS / as sysdba
ALTER SYSTEM SET STANDBY_FILE_MANAGEMENT='MANUAL'; (you will most likely have to execute this command to create the new files, I recommend changing it to 'AUTO' when you are done)
SELECT * FROM V$LOG; (to verify how many online logs we have - let's assume 3, each 52428800 bytes)
ALTER DATABASE ADD STANDBY LOGFILE SIZE 52428800;
/
/
/   (again, we are creating four standby log files to cover the online log files, plus one)
SELECT * FROM V$STANDBY_LOG; (to verify they were created as you expected)
EXIT

Now, we are ready to turn on managed standby (this is activating Data Guard):

SQLPLUS / as sysdba
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT FROM SESSION;
EXIT
Sync-Up the DR Instance with Production

Depending on several factors (how long the DUPLICATE took, how often you take archive log backups, so forth), there is a good chance that your Production system has moved on and your Standby is several sequences behind.  To proceed with the last step, it is not necessary that all of your archive logs be applied to your standby, but the standby must be at least to the point of your last archive log backup.  This is because the Primary can send over any sequences that are missing at the Standby, but only up to the point that it has readily on hand.

So, we are going to have to move your most recent archive log backups to the DR site, catalog them into the Standby instance, and restore them to the Standby so that Data Guard can apply them.  We’ll go through this step-by-step, but first, let’s see what sequence your Standby instance is waiting on:

SQLPLUS / as sysdba
SELECT process, status, thread#, sequence# FROM V$MANAGED_STANDBY;
EXIT

This is a good glance at what your Data Guard processes are doing. If one of the ARCH processes has a STATUS of “WAITING”, look at the SEQUENCE#. Let’s say the SEQUENCE# is 12420. That means that the DUPLICATE we did earlier, only took us up through SEQUENCE# 12419, and the Primary instance has moved on since then. So, we need to restore from SEQUENCE# 12420 all the way up to whatever your last archive log backup holds. So, we make a note of SEQUENCE# 12420 and move on.
Using whatever method you like best, robocopy for instance, copy all of your archive log backups that have been taken since the DUPLICATE began to a location on your DR Site server, “OracleDR” (W:\Backups\TEXASDR, for instance). Next, we are going to catalog those into our Standby instance, using RMAN. We will then list all of the archive log backups that have been cataloged, to find which sequences they contain:

RMAN TARGET /
CATALOG START WITH 'W:\Backups\TEXASDR\' NOPROMPT;
LIST BACKUP OF ARCHIVELOG ALL;
EXIT

Once this completes, you will have seen all of your archive log backups scrolled on the screen, near the end you should find the SEQUENCE# that is the greatest. The output will look something like this:

BS Key  Size       Device Type Elapsed Time Completion Time
------- ---------- ----------- ------------ ---------------
12095   45.50K     DISK        00:00:00     31-JAN-12
        BP Key: 12125   Status: AVAILABLE  Compressed: YES  Tag: TAG20120131T000
414
        Piece Name: W:\BACKUPS\TEXASDR\AL_T12345466_S85246_P1

  List of Archived Logs in backup set 12095
  Thrd Seq     Low SCN    Low Time  Next SCN   Next Time
  ---- ------- ---------- --------- ---------- ---------
  1    12516   1346096477 31-JAN-12 1121631224 31-JAN-12

Notice the Seq field, this is the SEQUENCE# we are looking for, 12516. So, referencing the SEQUENCE# we noted above (the one that ARCH is currently waiting on), we see that we need to restore all sequences from 12420 to 12516. So, we perform this RMAN command:

RMAN TARGET /
RESTORE ARCHIVELOG SEQUENCE BETWEEN 12420 AND 12516;
EXIT

Restoring and applying all of those logs may take a considerable amount of time, but we can check on the status by running this command:

SQLPLUS / as sysdba
SELECT THREAD#, SEQUENCE#, FIRST_TIME, APPLIED, COMPLETION_TIME FROM V$ARCHIVED_LOG WHERE APPLIED = 'NO' ORDER BY SEQUENCE#;
EXIT

This will show you how many sequences still need to be applied to the Standby instance. When the query returns no rows, we have applied all available (restored) sequences. At this point, depending on how long the application of all of those logs took, it may be necessary to copy any new backups and run through this whole process again. One way to avoid this would be to temporarily suspend archive log backups on your Primary, but that may or may not be acceptable in your environment.

Setup the Data Guard Broker

The concept of what is going on up to this point should be fairly clear.  We are taking redo logs from our Primary and replaying them on our Standby.  In this way, we are manually maintaining the Standby and keeping our two sites in sync (only up to the last archive log backup).  This is certainly not how you want to run your Standby, you want an automated way to do this all the time, without human intervention, and, if possible, keeping up to the last completed transaction, not just the last backup.

This is where Data Guard Broker comes in.  The broker is a simple process in Oracle that manages this entire process for you.  It even makes switching over to the DR site and switching back easier.  So, first, we need to turn the process on at both sites:

SQLPLUS / as sysdba
ALTER SYSTEM SET DG_BROKER_START=TRUE SCOPE=BOTH;
EXIT

Verify that the DMON process is running:

SQLPLUS / as sysdba
SELECT PROGRAM FROM V$SESSION WHERE USERNAME IS NULL;
EXIT

Now, the configuration (as well as all commands, when possible) should be ran from the Primary instance. So, connect to the Primary instance and issue the following to get into the Data Guard Broker Manager and verify that no configuration already exists:

CD O:\Oracle\11.2.0.2\TEXAS\BIN
SET ORACLE_HOME=O:\Oracle\11.2.0.2\TEXAS
SET ORACLE_SID=TEXAS
DGMGRL
CONNECT /
SHOW CONFIGURATION

The Broker makes you name your configuration, even though, at this time, you can only have one configuration. Let’s create a configuration, referencing the database by its unique name and the connect identifier by whatever you setup in TNSNames (in my case TEXAS):

CREATE CONFIGURATION 'DATAGRD' AS PRIMARY DATABASE IS 'TEXASP' CONNECT IDENTIFIER IS TEXAS;

We now add our Standby to the configuration:

ADD DATABASE 'TEXASDR' AS CONNECT IDENTIFIER IS TEXASDR MAINTAINED AS PHYSICAL;

We can check the broker configuration of our databases by issuing:

SHOW DATABASE VERBOSE 'TEXASP';
SHOW DATABASE VERBOSE 'TEXASDR';

Now, we can enable our configuration to allow the Broker to start managing our Data Guard:

ENABLE CONFIGURATION;

Lastly, we need to configure how/when our logs are transferred and applied to the Standby. We have two settings we need to configure, LogXPTMODE and Protection mode.
LogXPTMODE has three options, ARCH, ASYNC, or SYNC.

  • ARCH is similiar to SQL Log Shipping, it only sends a completed archive log, so, depending on how fast your instance runs through online logs, will be a large determining factor in how often your DR site has that data.  Your DR site will almost always be behind by some margin.
  • ASYNC is only used for Logical Standby instances
  • SYNC is going to send each transaction to your Standby as soon as possible, after it has written it to the Primary online log.  Your DR site will almost always be caught up to the last committed transaction.

In my opinion, if you have sufficient bandwidth, you should always choose SYNC.

Protection mode also has three options, MAXPROTECTION, MAXAVAILABILITY, or MAXPERFORMANCE.

  • MAXPROTECTION will not close a transaction (commit) on the Primary side until the transaction has ALSO been written to the DR site.  This will ensure that no transactions that are committed are lost on the Standby, even when an emergency failover occurs.  This would most often be used in a financial application, where not a single transaction can be lost.  The downside to this Protection mode is that any network blip or short-term disconnect will halt transactions from committing on the Primary instance, until the network connectivity is restored.
  • MAXAVAILABILITY will return to the user as being committed as soon as the Primary instance has written the transaction to the online log and will then make a best-effort to get that transaction  to the Standby as quickly as possible.  This allows the Primary instance to be more resilient to short-term outages of connectivity to the Standby.  The downside, however, is that if an emergency failover occurs before that transaction is sent to the Standby, a user could think that a transaction is hardened to the database, but see that it is actually missing at the DR site.
  • MAXPERFORMANCE waits for the archive writer process to send the last archive log over to the Standby, this is very much like SQL Log Shipping.  The upside is that no matter the state of the Standby instance, the Primary database continues along happily.  The downside is that if a failover occurs, only the transactions contained in the last archive log that has been sent to the Standby will be preserved.

In my opinion, either MAXPROTECTION or MAXAVAILABILITY make sense for this setting.  Due to the nature of my databases, I choose MAXAVAILABILITY.

So, we now configure these settings:

EDIT DATABASE 'TEXASP' SET PROPERTY LogXPTMODE='SYNC';
EDIT DATABASE 'TEXASDR' SET PROPERTY LogXPTMODE='SYNC';
EDIT CONFIGURATION SET PROTECTION MODE AS MAXAVAILABILITY;

And finally, we check our configuration to make sure there are no errors or warnings:

SHOW CONFIGURATION

The configuration should show “SUCCESS”, which means that logs are applying and the Data Guard is properly configured for switchover. If you encounter any errors, they will most likely be due to the Standby not having the most recent archive logs applied yet. Try syncing the Data Guard manually one last time. After you have it synced up, the broker should take over automatically.

Some Code You Should Know

In addition to the syntax stated above to let you check on how the Data Guard is working, there are two more important commands that you will use in DGMGRL.  They are “SWITCHOVER” and “FAILOVER”.  It would be worth while to read up on the entire DGMGRL process, but these two are critical, in that they handle switching your Production instance from running over at your Primary site to your DR site.

SWITCHOVER – this is a graceful switch from your Primary instance to your DR.  It makes sure that you are all synced up from your Primary to your DR, kills open connections on your Primary instance, and switches the FAL roles so that your DR site is now the live Production and your Primary site is now the Standby.  It will also switch you back, when the time comes.

FAILOVER – this is a much more drastic command.  You would only use this when you are in a dire situation, such as, your Primary Site is gone, unavailable, or completely down in some way.  This is going to ignore trying to reconcile any last transactions and open your DR site instance with a RESETLOGS.  You will not be able to simply switch back over to primary.  Your best course of action after your Primary site is available again, will most likely be to blow away your old Primary instance and setup a new Data Guard from your DR site back to your Primary site and THEN perform a switchover to get the instance back to the primary.  But, it will be a new incarnation of your instance.

I hope this has helped to get you running with Data Guard.  I do not guarantee that my scripts are perfect, but I have tried to triple-check them as much as possible.