Category Archives: High-Availability

Resolving Disconnected SQL Server Mirrors

On more than one occasion I’ve had a SQL Server mirror disconnect and reconnect repeatedly over a short amount of time, due to an unstable connection between the Principal and Mirror databases.  Usually, this resolves quite quickly and the Mirror will begin re-synchronizing on its own.

Occasionally, however, the mirror would get into a state where it believed it was disconnected, even though the connection between the two instances was up and stable.  When this occurs, you can usually resolve the issue by simply restarting the mirroring endpoints.

First, find the name of the endpoint, if you do not already know it, but issuing this command from the Principal side:

SELECT [name] FROM [sys].[database_mirroring_endpoints]

Then, using the name returned from that simple query, issue the following two commands back-to-back (replacing the endpoint name with your own):

ALTER ENDPOINT YourEndPointName STATE=STOPPED
ALTER ENDPOINT YourEndPointName STATE=STARTED

In each of my cases, this has immediately resolved this issue and caused SQL Server to resume synchronizing the mirror.  I hope this helps.

Sneaker-Net Saves a 20 TB SQL Mirror

It started a little more than an hour after the ball dropped on New Year’s Day 2016.  We had had a few friends over to celebrate and we were just saying goodbye to the last of our company when I started getting hit with a wave of texts right at 1:35AM.  I glanced down at my watch:

SEV 20 – The client was unable to reuse a session…

I had seen the error before.  It always indicated some momentary saturation of the network.  I would usually get a cluster of them, then the moment would pass and everything would be fine again.  The condition would only last long enough for the client to have to retry the connection, due to connection pooling failing to maintain the session.  The users wouldn’t even see a blip.  It was one of those errors that I knew I should get around to solving, and it was on my backlog, but let’s be honest: with everything else I have to do every day, and the fact that there was zero user impact, this one had slipped pretty far down the list.

Normally, I would get a cluster of these errors, and everything would subside.  This time, however, they just kept coming; hundreds of them; my phone chirping and buzzing with each new arrival.  Then my custom SQL Mirroring job spat out a hundred or so e-mails telling me that all of my critical database mirrors had just disconnected!  Uh-oh!  This wasn’t just some blip…

I quickly explained to everyone that something was going on and rushed upstairs to log in and see what was causing this.  Everything seemed fine on the Primary site’s servers, no SAN failure: good; no memory pressure: good; no crazy CPU: good.  I checked a few databases mirroring statuses: DISCONNECTED, DISCONNECTED, DISCONNECTED.  I RDPed to the DR servers and was able to connect in, but as I started to inspect more, *blip*, my RDP session dropped.  I reconnected, started looking around again and, *blip*, disconnected again.

So, it starts to dawn on me that something is going on between the two sites.  I disconnect my VPN session from the Primary site and reconnect to our DR VPN.  Now my RDP session to the DR servers is nice and stable.  I try to connect to a Primary server, just to test my hypothesis: after a few seconds, *blip*, disconnected.  This is when I jump on the phone and call our infrastructure team.  I tell one of the guys that I think something is going on with our microwave connection between the sites and he starts digging in to see what might be the cause.

Our Beloved Microwave

For a little bit of background, we’ve had our DR site for about 15 years.  Around ten years ago, we started to outgrow the 3 Mbps connection we had between our primary and DR sites.  We decided to go for a microwave connection because we could operate it in unlicensed frequencies for no annual cost and get about 100 Mbps Full-Duplex for just the initial equipment purchase, installation, and then annual service warranty.  While our use of this connection has grown dramatically over the past ten years, we have rarely bumped up against that upper limit for more than short bursts.  This has been a great solution for us.

Tonight, however, it seemed it was starting to show its age.  We had fought to get a replacement pair put into the budget each year for the past several years.  We knew it was getting to be past the “best by” date and our projections were showing that before long the 100 Mbps wasn’t going to cut it any longer.  Upper management, however, had been reluctant to replace it proactively.  No new story there.

About an hour later, the microwave started to calm down and become more stable once again.  We were still seeing enormous latency, but we were at least maintaining connectivity.  My mirrors were getting back in sync and it looked like we might be ok through the weekend, at least.  We had already put in a call to our warranty vender, but since the connectivity had been partially restored, they didn’t want to send anyone out and our infrastructure team didn’t really want to make it too much of an issue either.  We would handle it on Monday.

Down for the Count

Well, it didn’t last that long.  Just 36 hours later my phone starts blowing up again with alerts.  I contact infrastructure again and they start scrambling to see what’s going on: they confirm, it’s the microwave.  This time it seems it has gone down for good.  I go ahead and disable my alerts so I can stop getting buzzed every few seconds and I start talking to the rest of my team about how we’re going to handle what appears to be an extended outage.

We quickly assess our status: the microwave connection is down, our database servers are still being actively hit by production users and working just fine, but our transaction logs are growing and our data at the DR site is getting more stale by the minute.  Also, it’s the weekend, so we are running some of our most aggressive index maintenance.  We decide to stop the index maintenance for now and see how long it will be before infrastructure is going to have the link back up.

Then comes the bad news: our warranty vender is not able to fix the problem and we need new hardware, BUT, they are having trouble sourcing a replacement.  Now we see that this is going to be more than a 4-hour or even 24-hour turn-around.  We realize that we are going to have to do something, or our DR site is going to become so stale that it will be essentially useless.

Maintain…

So, first, I mention to infrastructure that we have redundant internet connections at both sites.  There’s a whole 10 Mbps Internet connection at the DR site and a 30 Mbps Internet connection at the Primary that aren’t usually being used, they’re just for backup if the primary connections fail.  I suggest that we take a couple of routers and setup a VPN between the two sites over the Internet so that we will at least have SOME connection between the two sites.  They set it up and route just the database VLAN over the new VPN tunnel.  Now we have connectivity for our databases and they start to slowly attempt to get back in sync.

This is fine for our smaller, less-active databases.  But we have several very large databases and one in particular is first on my mind.  It is a 20 TB critical database and it’s creating transaction log records at a rate of about 12,000 KB / sec.  There is no way our 10 Mbps connection is going to be able to catch up.  The DR site is already 8 hours stale by this point and the roughly 240 GB of transaction logs that are unsent are stating they’re going to take over two days to sync up, if we were to stop transacting right now, which, of course, is not going to happen.  We do ask a few of our heaviest users if we can delay some processes, but this only helps a bit, and they cannot hold off their processes for very long, and we’re not really sure how long it will be before our 100 Mbps link is back up.

Sneaker-Net Rides Again

With Monday coming up quick and business getting ready to swing into “full-on” mode, we have to have a solution.  I grab a 2TB USB hard drive and plug it into one of the nodes of our cluster.  I know that if I can get all of the transaction logs from the point of our last sent transaction up until our most recent transaction log backup, I can apply them at the DR site and keep this party going.  Sneaker-Net has taken many forms over my career:  it used to be floppy disks, when there either WAS NO network, or the 10Base2 connections had become unplugged for some reason; it then evolved into Zip Disks, packing a 100 MB punch of data at a time; and eventually burnable CDs helped get me through tough times; now, it was 2 TB I could fit in my pocket.  I almost got nostalgic as I grabbed the DR site keys and jumped in my car.  The robocopy command hadn’t taken very long and I was on my way with a fist full of data and a plan.

Now, I must admit, breaking the mirror on my 20 TB database gave me pause.  Sure, it SHOULD all work as planned, but what if I messed something up?  I didn’t have time to consider if for long: I broke the mirror and started apply transaction log backups en masse.  When the script finally finished, I re-established the mirror and after just a few moments, we were “Synchronizing” again.  The mirror was still stale by about an hour and a half, but that was, of course, due to the time it took to copy the transaction log backups from the Primary site to the external hard drive, drive them over to the DR site, copy the backups to the DR site server, and apply the transaction log backups there.  It wasn’t going to be in sync, that just wasn’t possible.  But it is way easier to tell the president of the company that we are a couple hours stale, rather than the alternative.

A Very Long Week

Our Infrastructure team ran into roadblock after roadblock getting the microwave replaced.  I won’t go into the details at this time, but suffice it to say that you REALLY need to make sure your warranty vendor can provide the level of service they have signed to; especially on specialty equipment.  Arbitration and compensation after the fact is NOT going to save your data during the fire.  My team and I took turns running data between the Primary and DR sites a couple times a day for the next several days while we waited for a resolution.  We all breathed a sigh of relief when at 10:25 PM on Tuesday, January 12th, I sent out the e-mail to our Business Continuity group that, “All Production databases are fully synchronized with the DR site.”.  Short and simple, but that one sentence represented one crazy outage for us.

Steps Taken, For Reference

I’m going to enumerate the steps here, just in case anyone needs the specifics:

  1. Temporarily pause your transaction log backups at the Primary site so that new transaction log backups are not taken while you are couriering your data from one site to the other.
  2. Copy all transaction logs from your Primary site to your media of choice, I prefer using robocopy for speed and simplicity. In command prompt issue something similar to: robocopy C:\TransactionLogBackups\ D:\SneakerNet\
  3. Physically transport the backups to the DR site and copy them to the DR server.
  4. Break the mirror from the DR site, by issuing ALTER DATABASE dbname SET PARTNER OFF; (in some cases, I had to issue this command twice, otherwise the first transaction log backup restoration would complain that the database was busy)
  5. Start applying the transaction log backups WITH NORECOVERY, by issuing RESTORE LOG dbname FROM ‘D:\LocationOnDisk\backup051.bak’ WITH NORECOVERY;
  6. After all transaction logs have been restored WITH NORECOVERY, re-connect the mirror from the Primary site.
  7. Don’t forget to Re-Enable your transaction log backups at the Primary site.

I hope you’ve enjoyed reading about this adventure I had.  It is way easier to look back on it now that it is over than when we were right in the thick of it.  If you’ve never had to do any sort of emergency recovery, I would recommend you practice doing so.  Bring up a test instance and then break it and try to fix it.  Going through the stress of figuring out how to work through the situation when there isn’t company data on the line is way better than figuring it out once you’re already in the situation.

The Day Before SQLskills Immersion Event 3 – High Availability and Disaster Recovery

Tomorrow starts my third SQLskills Immersion Event.  I took the first two in Chicago (May & Oct 2011), and they were both phenomenal!  They are led by my friends, Paul Randal (Blog | Twitter), Kimberly Tripp (Blog | Twitter), Jonathan Kehayias (Blog | Twitter), and Joe Sack (Blog | Twitter).  So, we have all gathered here in Tampa, FL to improve our knowledge.

Each class has been chocked full of great knowledge with tons of very deep dives and little-known facts.  I am expecting IE3 to be no different.  HA and DR are a huge part of my job and I am always trying to improve on these aspects of my databases.  I am hoping to walk away from this class with new tips, tricks, and ideas to not only improve my current environment, but grow my understanding of SQL and improve my “craft” in general.  One thing is certain:  if you are at a SQLskills Immersion Event, you are serious about deepening your knowledge of SQL.

One extra-curricular note about SQLskills Immersion Events: when they say “Immersion” they mean it!  You have the option to spend a lot of time with your fellow students as well as the teachers.  I love connecting with others in the community, visiting with those I’ve met before, and meeting new people as well.   It is very refreshing to spend time with others who not only love SQL, but also love talking about a wide range of other topics; sometimes we even get into a little philosophy.  😀

I plan to post an update later in the week or at the end of the event.  Until then, I’ll be fully “immersed” in SQL knowledge!

Simple Walk-Through for Setting Up Oracle Data Guard

In most Enterprise-Class database environments, there is the very real need for maintaining a near-line, transactionally-consistent “mirror” of your production databases.  This sort of technology goes beyond the benefits of a cluster, in that it can be part of a Disaster Recovery or Business Continuity plan, in which the entire primary data center is unavailable.  Each RDBMS has a different set of tools to help the DBA provide this sort of solution.  In Oracle, the tool I use is Data Guard.

Oracle Data Guard has been a feature in Oracle since 8i, but it has evolved over the years.  I will be using the syntax and features available in Oracle 11g.  As with almost every feature in Oracle, there are multiple ways to configure Data Guard and there are countless combinations of options that can be utilized at every step along the way.  I do not intend to discuss exhaustively the whole of Data Guard.  I will leave the idiosyncrasies for the online documentation or other blogs to do that.  I simply want to present a simple way to get an 11g, +ASM-based, stand-alone, instance covered by a Data Guard Physical Standby on another server.

My initial environment consists of two Windows servers, one at my primary site, which I will name “Oracle”, and one at my secondary site, which I will name “OracleDR”.  On each server, I have installed the grid components and a single instance of the database engine.  I have configured ASM on each server, and two Disk Groups on each ASM instance, in my grid home on each server.  On “Oracle”, the Disk Groups are called “PROD_DATA” and “PROD_RECOVERY”.  On “OracleDR”, the Disk Groups are called “DR_DATA” and “DR_RECOVERY”.  I also have configured each of my servers with a single Oracle Listener, in the grid home.  My production Database Instance is named “TEXAS”, has a db_unique_name of “TEXASP”, is open, is in archivelog mode, and is serving users.  “OracleDR” does not have any database instances at this point.

Please read through the entire walk-through before beginning your own Data Guard setup.

Preparing for the New Instance

First, take a fresh backup of the old database instance, using an RMAN script, similiar to this (starting from a command prompt) (11g users, see special note below, regarding skipping the backup):

CD O:\Oracle\11.2.0.2\TEXAS\BIN
SET ORACLE_HOME=O:\Oracle\11.2.0.2\TEXAS
SET ORACLE_SID=TEXAS
RMAN TARGET /
RUN {
ALLOCATE CHANNEL D1 TYPE DISK;
ALLOCATE CHANNEL D2 TYPE DISK;
ALLOCATE CHANNEL D3 TYPE DISK;
BACKUP AS COMPRESSED BACKUPSET FORMAT 'W:\TEXAS\df_t%t_s%s_p%p' DATABASE;
RELEASE CHANNEL D1;
RELEASE CHANNEL D2;
RELEASE CHANNEL D3;
}
EXIT

Next, create a standby controlfile, which will contain our fresh backup (assuming I am using my controlfile for cataloging RMAN):

SQLPLUS / AS SYSDBA
ALTER DATABASE CREATE STANDBY CONTROLFILE AS 'W:\TEXAS\StandbyCTL.CTL';
EXIT

We are also going to need a copy of the production pfile, to create our new instance at the DR site:

SQLPLUS / AS SYSDBA
CREATE PFILE='W:\TEXAS\initTEXASDR.ora' FROM SPFILE;
EXIT

Now, I logon to “OracleDR”, make sure the same file structure (W:\TEXAS\) exists, and copy these files over to the DR server:

robocopy \\Oracle\W$\TEXAS W:\TEXAS
Creating the New Instance and Duplicating Production to DR

Now, we need to move that pfile to the TEXASDR home’s “database” directory and modify it to serve our DR instance.  Change the following parameters:

  • control_files=’W:\TEXAS\StandbyCTL.CTL’ (temporarily, while we setup the standby)
  • db_create_file_dest=’+DR_DATA’
  • db_create_onlinge_log_dest_1=’+DR_DATA’
  • db_recovery_file_dest=’+DR_RECOVERY’
  • log_archive_dest_1=’LOCATION=+DR_RECOVERY’
  • db_unique_name=’TEXASDR’
  • change diagnostic_dest (if different from the production server)

Next, it is time to create our DR database instance.  use ORADIM on “OracleDR” to accomplish this:

CD O:\Oracle\11.2.0.2\TEXASDR\BIN
SET ORACLE_HOME=O:\Oracle\11.2.0.2\TEXASDR
SET ORACLE_SID=TEXASDR
ORADIM -NEW -SID TEXASDR -INTPWD ***** -STARTMODE AUTO

We have to make sure we have connectivity both directions between the two instances. So, setup tnsnames.ora and listener.ora on both sides so they can see each other. You can verify this using TNSPING.

We should also copy the Password File from production to DR. The passwords must match on both instances.

We then startup the new instance in nomount mode and initiate a duplicate for standby:

SQLPLUS / as sysdba
STARTUP NOMOUNT;
EXIT
RMAN TARGET /@TEXASP AUXILIARY /@TEXASDR
DUPLICATE TARGET DATABASE FOR STANDBY;

If you are using 11g, take special note:
There is a new feature in Oracle 11g which allows you to skip the backup and control file portions of this and skip straight to creating the instance and doing a duplicate from the live production instance.  The RMAN command used here would be:

DUPLICATE TARGET DATABASE FOR STANDBY FROM ACTIVE DATABASE;
Cleanup and Standby Configuration

Once the duplicate has completed, we now have a physical standby instance at the DR site.  We just need to do some final configuration on this new instance before we can turn on Data Guard.

We need to move that Control File to a more permanent location, and make a few copies of it.  So, let’s get back into SQLPLUS and bounce the new Oracle instance and bring it back up in nomount mode:

CD O:\Oracle\11.2.0.2\TEXASDR\BIN
SET ORACLE_HOME=O:\Oracle\11.2.0.2\TEXASDR
SET ORACLE_SID=TEXASDR
SQLPLUS / as sysdba
SHUTDOWN IMMEDIATE:
STARTUP NOMOUNT;
EXIT;

RMAN TARGET /
RESTORE CONTROLFILE TO '+DR_DATA' FROM 'W:\TEXAS\StandbyCTL.CTL';
/   (the / repeats the last command in the buffer, I'm doing it twice, for a total of three copies of the Control File)
/
EXIT

Now, head over to the grid home and use ASMCMD to check what names the RMAN RESTORE gave the Control Files:

CD O:\Oracle\11.2.0.2\grid\BIN
SET ORACLE_HOME=O:\Oracle\11.2.0.2\grid
SET ORACLE_SID=+ASM
ASMCMD
CD +DR_DATA/TEXASDR/CONTROLFILE
LS
EXIT

We take the three Control Files listed from the ASMCMD “LS” command and change the pfile control_files parameter to list all three of these locations (with the full ASM path).

Now we need to apply these changes, so get back into SQLPLUS, shutdown the instance, create an spfile, and startup the instance, this time in mount mode:

CD O:\Oracle\11.2.0.2\TEXASDR\BIN
SET ORACLE_HOME=O:\Oracle\11.2.0.2\TEXASDR
SET ORACLE_SID=TEXASDR
SQLPLUS / as sysdba
SHUTDOWN IMMEDIATE:
CREATE SPFILE FROM PFILE;
STARTUP MOUNT;
EXIT

At this point, we technically have an instance ready for Data Guard, but there is one last issue we need to resolve so that our instance is ready for a switchover, should we ever need it in the future (which is the primary point of having the DR site in the first place). The outstanding issue is our standby logfiles. Each instance needs to have a set of standby logfiles that can be used to apply the incoming logs from the primary instance. Please note: when you switchover to your DR site, at some point, you will most likely want to apply the logs from the DR site back to the Primary site, before initiating a switch back to the Primary (to keep the data in sync), so, you will need to create these standby logfiles on the Primary instance as well. You must always have at least the same number of standby log files as the other side has online logs, plus one, and the standby log files must be at least as large as the online log files. For instance, if your Primary site has three online log files, you will need four standby log files on the DR instance (as well as four standby log files on the Primary instance, to cover the DR online log files).

So, let’s execute this sort of code on both sides, to create the necessary standby logs:

SQLPLUS / as sysdba
ALTER SYSTEM SET STANDBY_FILE_MANAGEMENT='MANUAL'; (you will most likely have to execute this command to create the new files, I recommend changing it to 'AUTO' when you are done)
SELECT * FROM V$LOG; (to verify how many online logs we have - let's assume 3, each 52428800 bytes)
ALTER DATABASE ADD STANDBY LOGFILE SIZE 52428800;
/
/
/   (again, we are creating four standby log files to cover the online log files, plus one)
SELECT * FROM V$STANDBY_LOG; (to verify they were created as you expected)
EXIT

Now, we are ready to turn on managed standby (this is activating Data Guard):

SQLPLUS / as sysdba
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT FROM SESSION;
EXIT
Sync-Up the DR Instance with Production

Depending on several factors (how long the DUPLICATE took, how often you take archive log backups, so forth), there is a good chance that your Production system has moved on and your Standby is several sequences behind.  To proceed with the last step, it is not necessary that all of your archive logs be applied to your standby, but the standby must be at least to the point of your last archive log backup.  This is because the Primary can send over any sequences that are missing at the Standby, but only up to the point that it has readily on hand.

So, we are going to have to move your most recent archive log backups to the DR site, catalog them into the Standby instance, and restore them to the Standby so that Data Guard can apply them.  We’ll go through this step-by-step, but first, let’s see what sequence your Standby instance is waiting on:

SQLPLUS / as sysdba
SELECT process, status, thread#, sequence# FROM V$MANAGED_STANDBY;
EXIT

This is a good glance at what your Data Guard processes are doing. If one of the ARCH processes has a STATUS of “WAITING”, look at the SEQUENCE#. Let’s say the SEQUENCE# is 12420. That means that the DUPLICATE we did earlier, only took us up through SEQUENCE# 12419, and the Primary instance has moved on since then. So, we need to restore from SEQUENCE# 12420 all the way up to whatever your last archive log backup holds. So, we make a note of SEQUENCE# 12420 and move on.
Using whatever method you like best, robocopy for instance, copy all of your archive log backups that have been taken since the DUPLICATE began to a location on your DR Site server, “OracleDR” (W:\Backups\TEXASDR, for instance). Next, we are going to catalog those into our Standby instance, using RMAN. We will then list all of the archive log backups that have been cataloged, to find which sequences they contain:

RMAN TARGET /
CATALOG START WITH 'W:\Backups\TEXASDR\' NOPROMPT;
LIST BACKUP OF ARCHIVELOG ALL;
EXIT

Once this completes, you will have seen all of your archive log backups scrolled on the screen, near the end you should find the SEQUENCE# that is the greatest. The output will look something like this:

BS Key  Size       Device Type Elapsed Time Completion Time
------- ---------- ----------- ------------ ---------------
12095   45.50K     DISK        00:00:00     31-JAN-12
        BP Key: 12125   Status: AVAILABLE  Compressed: YES  Tag: TAG20120131T000
414
        Piece Name: W:\BACKUPS\TEXASDR\AL_T12345466_S85246_P1

  List of Archived Logs in backup set 12095
  Thrd Seq     Low SCN    Low Time  Next SCN   Next Time
  ---- ------- ---------- --------- ---------- ---------
  1    12516   1346096477 31-JAN-12 1121631224 31-JAN-12

Notice the Seq field, this is the SEQUENCE# we are looking for, 12516. So, referencing the SEQUENCE# we noted above (the one that ARCH is currently waiting on), we see that we need to restore all sequences from 12420 to 12516. So, we perform this RMAN command:

RMAN TARGET /
RESTORE ARCHIVELOG SEQUENCE BETWEEN 12420 AND 12516;
EXIT

Restoring and applying all of those logs may take a considerable amount of time, but we can check on the status by running this command:

SQLPLUS / as sysdba
SELECT THREAD#, SEQUENCE#, FIRST_TIME, APPLIED, COMPLETION_TIME FROM V$ARCHIVED_LOG WHERE APPLIED = 'NO' ORDER BY SEQUENCE#;
EXIT

This will show you how many sequences still need to be applied to the Standby instance. When the query returns no rows, we have applied all available (restored) sequences. At this point, depending on how long the application of all of those logs took, it may be necessary to copy any new backups and run through this whole process again. One way to avoid this would be to temporarily suspend archive log backups on your Primary, but that may or may not be acceptable in your environment.

Setup the Data Guard Broker

The concept of what is going on up to this point should be fairly clear.  We are taking redo logs from our Primary and replaying them on our Standby.  In this way, we are manually maintaining the Standby and keeping our two sites in sync (only up to the last archive log backup).  This is certainly not how you want to run your Standby, you want an automated way to do this all the time, without human intervention, and, if possible, keeping up to the last completed transaction, not just the last backup.

This is where Data Guard Broker comes in.  The broker is a simple process in Oracle that manages this entire process for you.  It even makes switching over to the DR site and switching back easier.  So, first, we need to turn the process on at both sites:

SQLPLUS / as sysdba
ALTER SYSTEM SET DG_BROKER_START=TRUE SCOPE=BOTH;
EXIT

Verify that the DMON process is running:

SQLPLUS / as sysdba
SELECT PROGRAM FROM V$SESSION WHERE USERNAME IS NULL;
EXIT

Now, the configuration (as well as all commands, when possible) should be ran from the Primary instance. So, connect to the Primary instance and issue the following to get into the Data Guard Broker Manager and verify that no configuration already exists:

CD O:\Oracle\11.2.0.2\TEXAS\BIN
SET ORACLE_HOME=O:\Oracle\11.2.0.2\TEXAS
SET ORACLE_SID=TEXAS
DGMGRL
CONNECT /
SHOW CONFIGURATION

The Broker makes you name your configuration, even though, at this time, you can only have one configuration. Let’s create a configuration, referencing the database by its unique name and the connect identifier by whatever you setup in TNSNames (in my case TEXAS):

CREATE CONFIGURATION 'DATAGRD' AS PRIMARY DATABASE IS 'TEXASP' CONNECT IDENTIFIER IS TEXAS;

We now add our Standby to the configuration:

ADD DATABASE 'TEXASDR' AS CONNECT IDENTIFIER IS TEXASDR MAINTAINED AS PHYSICAL;

We can check the broker configuration of our databases by issuing:

SHOW DATABASE VERBOSE 'TEXASP';
SHOW DATABASE VERBOSE 'TEXASDR';

Now, we can enable our configuration to allow the Broker to start managing our Data Guard:

ENABLE CONFIGURATION;

Lastly, we need to configure how/when our logs are transferred and applied to the Standby. We have two settings we need to configure, LogXPTMODE and Protection mode.
LogXPTMODE has three options, ARCH, ASYNC, or SYNC.

  • ARCH is similiar to SQL Log Shipping, it only sends a completed archive log, so, depending on how fast your instance runs through online logs, will be a large determining factor in how often your DR site has that data.  Your DR site will almost always be behind by some margin.
  • ASYNC is only used for Logical Standby instances
  • SYNC is going to send each transaction to your Standby as soon as possible, after it has written it to the Primary online log.  Your DR site will almost always be caught up to the last committed transaction.

In my opinion, if you have sufficient bandwidth, you should always choose SYNC.

Protection mode also has three options, MAXPROTECTION, MAXAVAILABILITY, or MAXPERFORMANCE.

  • MAXPROTECTION will not close a transaction (commit) on the Primary side until the transaction has ALSO been written to the DR site.  This will ensure that no transactions that are committed are lost on the Standby, even when an emergency failover occurs.  This would most often be used in a financial application, where not a single transaction can be lost.  The downside to this Protection mode is that any network blip or short-term disconnect will halt transactions from committing on the Primary instance, until the network connectivity is restored.
  • MAXAVAILABILITY will return to the user as being committed as soon as the Primary instance has written the transaction to the online log and will then make a best-effort to get that transaction  to the Standby as quickly as possible.  This allows the Primary instance to be more resilient to short-term outages of connectivity to the Standby.  The downside, however, is that if an emergency failover occurs before that transaction is sent to the Standby, a user could think that a transaction is hardened to the database, but see that it is actually missing at the DR site.
  • MAXPERFORMANCE waits for the archive writer process to send the last archive log over to the Standby, this is very much like SQL Log Shipping.  The upside is that no matter the state of the Standby instance, the Primary database continues along happily.  The downside is that if a failover occurs, only the transactions contained in the last archive log that has been sent to the Standby will be preserved.

In my opinion, either MAXPROTECTION or MAXAVAILABILITY make sense for this setting.  Due to the nature of my databases, I choose MAXAVAILABILITY.

So, we now configure these settings:

EDIT DATABASE 'TEXASP' SET PROPERTY LogXPTMODE='SYNC';
EDIT DATABASE 'TEXASDR' SET PROPERTY LogXPTMODE='SYNC';
EDIT CONFIGURATION SET PROTECTION MODE AS MAXAVAILABILITY;

And finally, we check our configuration to make sure there are no errors or warnings:

SHOW CONFIGURATION

The configuration should show “SUCCESS”, which means that logs are applying and the Data Guard is properly configured for switchover. If you encounter any errors, they will most likely be due to the Standby not having the most recent archive logs applied yet. Try syncing the Data Guard manually one last time. After you have it synced up, the broker should take over automatically.

Some Code You Should Know

In addition to the syntax stated above to let you check on how the Data Guard is working, there are two more important commands that you will use in DGMGRL.  They are “SWITCHOVER” and “FAILOVER”.  It would be worth while to read up on the entire DGMGRL process, but these two are critical, in that they handle switching your Production instance from running over at your Primary site to your DR site.

SWITCHOVER – this is a graceful switch from your Primary instance to your DR.  It makes sure that you are all synced up from your Primary to your DR, kills open connections on your Primary instance, and switches the FAL roles so that your DR site is now the live Production and your Primary site is now the Standby.  It will also switch you back, when the time comes.

FAILOVER – this is a much more drastic command.  You would only use this when you are in a dire situation, such as, your Primary Site is gone, unavailable, or completely down in some way.  This is going to ignore trying to reconcile any last transactions and open your DR site instance with a RESETLOGS.  You will not be able to simply switch back over to primary.  Your best course of action after your Primary site is available again, will most likely be to blow away your old Primary instance and setup a new Data Guard from your DR site back to your Primary site and THEN perform a switchover to get the instance back to the primary.  But, it will be a new incarnation of your instance.

I hope this has helped to get you running with Data Guard.  I do not guarantee that my scripts are perfect, but I have tried to triple-check them as much as possible.