{"id":140,"date":"2016-01-17T03:31:21","date_gmt":"2016-01-17T09:31:21","guid":{"rendered":"https:\/\/www.sqlphilosopher.com\/wp\/?p=140"},"modified":"2016-01-18T06:05:27","modified_gmt":"2016-01-18T12:05:27","slug":"sneaker-net-saves-a-20-tb-sql-mirror","status":"publish","type":"post","link":"https:\/\/www.sqlphilosopher.com\/wp\/2016\/01\/sneaker-net-saves-a-20-tb-sql-mirror\/","title":{"rendered":"Sneaker-Net Saves a 20 TB SQL Mirror"},"content":{"rendered":"<p>It started a little more than an hour after the ball dropped on New Year\u2019s Day 2016.\u00a0 We had had a few friends over to celebrate and we were just saying goodbye to the last of our company when I started getting hit with a wave of texts right at 1:35AM.\u00a0 I glanced down at my watch:<\/p>\n<p style=\"text-align: center;\"><span style=\"color: #339966;\"><strong>SEV 20 \u2013 The client was unable to reuse a session\u2026<\/strong><\/span><\/p>\n<p>I had seen the error before.\u00a0 It always indicated some momentary saturation of the network.\u00a0 I would usually get a cluster of them, then the moment would pass and everything would be fine again.\u00a0 The condition would only last long enough for the client to have to retry the connection, due to connection pooling failing to maintain the session.\u00a0 The users wouldn\u2019t even see a blip.\u00a0 It was one of those errors that I knew I should get around to solving, and it was on my backlog, but let\u2019s be honest: with everything else I have to do every day, and the fact that there was zero user impact, this one had slipped pretty far down the list.<\/p>\n<p>Normally, I would get a cluster of these errors, and everything would subside.\u00a0 This time, however, they just kept coming; hundreds of them; my phone chirping and buzzing with each new arrival.\u00a0 Then my custom SQL Mirroring job spat out a hundred or so e-mails telling me that all of my critical database mirrors had just disconnected!\u00a0 Uh-oh!\u00a0 This wasn\u2019t just some blip\u2026<\/p>\n<p>I quickly explained to everyone that something was going on and rushed upstairs to log in and see what was causing this.\u00a0 Everything seemed fine on the Primary site\u2019s servers, no SAN failure: good; no memory pressure: good; no crazy CPU: good.\u00a0 I checked a few databases mirroring statuses: DISCONNECTED, DISCONNECTED, DISCONNECTED.\u00a0 I RDPed to the DR servers and was able to connect in, but as I started to inspect more, *blip*, my RDP session dropped.\u00a0 I reconnected, started looking around again and, *blip*, disconnected again.<\/p>\n<p>So, it starts to dawn on me that something is going on between the two sites.\u00a0 I disconnect my VPN session from the Primary site and reconnect to our DR VPN.\u00a0 Now my RDP session to the DR servers is nice and stable.\u00a0 I try to connect to a Primary server, just to test my hypothesis: after a few seconds, *blip*, disconnected.\u00a0 This is when I jump on the phone and call our infrastructure team.\u00a0 I tell one of the guys that I think something is going on with our microwave connection between the sites and he starts digging in to see what might be the cause.<\/p>\n<p><strong>Our Beloved Microwave<\/strong><\/p>\n<p>For a little bit of background, we\u2019ve had our DR site for about 15 years.\u00a0 Around ten years ago, we started to outgrow the 3 Mbps connection we had between our primary and DR sites.\u00a0 We decided to go for a microwave connection because we could operate it in unlicensed frequencies for no annual cost and get about 100 Mbps Full-Duplex for just the initial equipment purchase, installation, and then annual service warranty.\u00a0 While our use of this connection has grown dramatically over the past ten years, we have rarely bumped up against that upper limit for more than short bursts.\u00a0 This has been a great solution for us.<\/p>\n<p>Tonight, however, it seemed it was starting to show its age.\u00a0 We had fought to get a replacement pair put into the budget each year for the past several years.\u00a0 We knew it was getting to be past the \u201cbest by\u201d date and our projections were showing that before long the 100 Mbps wasn\u2019t going to cut it any longer.\u00a0 Upper management, however, had been reluctant to replace it proactively.\u00a0 No new story there.<\/p>\n<p>About an hour later, the microwave started to calm down and become more stable once again.\u00a0 We were still seeing enormous latency, but we were at least maintaining connectivity.\u00a0 My mirrors were getting back in sync and it looked like we might be ok through the weekend, at least.\u00a0 We had already put in a call to our warranty vender, but since the connectivity had been partially restored, they didn\u2019t want to send anyone out and our infrastructure team didn\u2019t really want to make it too much of an issue either.\u00a0 We would handle it on Monday.<\/p>\n<p><strong>Down for the Count<\/strong><\/p>\n<p>Well, it didn\u2019t last that long.\u00a0 Just 36 hours later my phone starts blowing up again with alerts.\u00a0 I contact infrastructure again and they start scrambling to see what\u2019s going on: they confirm, it\u2019s the microwave.\u00a0 This time it seems it has gone down for good.\u00a0 I go ahead and disable my alerts so I can stop getting buzzed every few seconds and I start talking to the rest of my team about how we\u2019re going to handle what appears to be an extended outage.<\/p>\n<p>We quickly assess our status: the microwave connection is down, our database servers are still being actively hit by production users and working just fine, but our transaction logs are growing and our data at the DR site is getting more stale by the minute.\u00a0 Also, it\u2019s the weekend, so we are running some of our most aggressive index maintenance.\u00a0 We decide to stop the index maintenance for now and see how long it will be before infrastructure is going to have the link back up.<\/p>\n<p>Then comes the bad news: our warranty vender is not able to fix the problem and we need new hardware, BUT, they are having trouble sourcing a replacement.\u00a0 Now we see that this is going to be more than a 4-hour or even 24-hour turn-around.\u00a0 We realize that we are going to have to do something, or our DR site is going to become so stale that it will be essentially useless.<\/p>\n<p><strong>Maintain\u2026<\/strong><\/p>\n<p>So, first, I mention to infrastructure that we have redundant internet connections at both sites.\u00a0 There\u2019s a whole 10 Mbps Internet connection at the DR site and a 30 Mbps Internet connection at the Primary that aren\u2019t usually being used, they\u2019re just for backup if the primary connections fail.\u00a0 I suggest that we take a couple of routers and setup a VPN between the two sites over the Internet so that we will at least have SOME connection between the two sites.\u00a0 They set it up and route just the database VLAN over the new VPN tunnel.\u00a0 Now we have connectivity for our databases and they start to slowly attempt to get back in sync.<\/p>\n<p>This is fine for our smaller, less-active databases.\u00a0 But we have several very large databases and one in particular is first on my mind.\u00a0 It is a 20 TB critical database and it\u2019s creating transaction log records at a rate of about 12,000 KB \/ sec.\u00a0 There is no way our 10 Mbps connection is going to be able to catch up.\u00a0 The DR site is already 8 hours stale by this point and the roughly 240 GB of transaction logs that are unsent are stating they\u2019re going to take over two days to sync up, if we were to stop transacting right now, which, of course, is not going to happen.\u00a0 We do ask a few of our heaviest users if we can delay some processes, but this only helps a bit, and they cannot hold off their processes for very long, and we\u2019re not really sure how long it will be before our 100 Mbps link is back up.<\/p>\n<p><strong>Sneaker-Net Rides Again<\/strong><\/p>\n<p>With Monday coming up quick and business getting ready to swing into \u201cfull-on\u201d mode, we have to have a solution.\u00a0 I grab a 2TB USB hard drive and plug it into one of the nodes of our cluster.\u00a0 I know that if I can get all of the transaction logs from the point of our last sent transaction up until our most recent transaction log backup, I can apply them at the DR site and keep this party going.\u00a0 Sneaker-Net has taken many forms over my career:\u00a0 it used to be floppy disks, when there either WAS NO network, or the 10Base2 connections had become unplugged for some reason; it then evolved into Zip Disks, packing a 100 MB punch of data at a time; and eventually burnable CDs helped get me through tough times; now, it was 2 TB I could fit in my pocket.\u00a0 I almost got nostalgic as I grabbed the DR site keys and jumped in my car.\u00a0 The robocopy command hadn\u2019t taken very long and I was on my way with a fist full of data and a plan.<\/p>\n<p>Now, I must admit, breaking the mirror on my 20 TB database gave me pause.\u00a0 Sure, it SHOULD all work as planned, but what if I messed something up?\u00a0 I didn\u2019t have time to consider if for long: I broke the mirror and started apply transaction log backups en masse.\u00a0 When the script finally finished, I re-established the mirror and after just a few moments, we were \u201cSynchronizing\u201d again.\u00a0 The mirror was still stale by about an hour and a half, but that was, of course, due to the time it took to copy the transaction log backups from the Primary site to the external hard drive, drive them over to the DR site, copy the backups to the DR site server, and apply the transaction log backups there.\u00a0 It wasn\u2019t going to be in sync, that just wasn\u2019t possible.\u00a0 But it is way easier to tell the president of the company that we are a couple hours stale, rather than the alternative.<\/p>\n<p><strong>A Very Long Week<\/strong><\/p>\n<p>Our Infrastructure team ran into roadblock after roadblock getting the microwave replaced.\u00a0 I won\u2019t go into the details at this time, but suffice it to say that you REALLY need to make sure your warranty vendor can provide the level of service they have signed to; especially on specialty equipment.\u00a0 Arbitration and compensation after the fact is NOT going to save your data during the fire.\u00a0 My team and I took turns running data between the Primary and DR sites a couple times a day for the next several days while we waited for a resolution.\u00a0 We all breathed a sigh of relief when at 10:25 PM on Tuesday, January 12<sup>th<\/sup>, I sent out the e-mail to our Business Continuity group that, \u201cAll Production databases are fully synchronized with the DR site.\u201d.\u00a0 Short and simple, but that one sentence represented one crazy outage for us.<\/p>\n<p><strong>Steps Taken, For Reference<\/strong><\/p>\n<p>I\u2019m going to enumerate the steps here, just in case anyone needs the specifics:<\/p>\n<ol>\n<li>Temporarily pause your transaction log backups at the Primary site so that new transaction log backups are not taken while you are couriering your data from one site to the other.<\/li>\n<li>Copy all transaction logs from your Primary site to your media of choice, I prefer using robocopy for speed and simplicity. In command prompt issue something similar to: robocopy C:\\TransactionLogBackups\\ D:\\SneakerNet\\<\/li>\n<li>Physically transport the backups to the DR site and copy them to the DR server.<\/li>\n<li>Break the mirror from the DR site, by issuing ALTER DATABASE dbname SET PARTNER OFF; (in some cases, I had to issue this command twice, otherwise the first transaction log backup restoration would complain that the database was busy)<\/li>\n<li>Start applying the transaction log backups WITH NORECOVERY, by issuing RESTORE LOG dbname FROM \u2018D:\\LocationOnDisk\\backup051.bak\u2019 WITH NORECOVERY;<\/li>\n<li>After all transaction logs have been restored WITH NORECOVERY, re-connect the mirror from the Primary site.<\/li>\n<li>Don&#8217;t forget to Re-Enable your transaction log backups at the Primary site.<\/li>\n<\/ol>\n<p>I hope you\u2019ve enjoyed reading about this adventure I had.\u00a0 It is way easier to look back on it now that it is over than when we were right in the thick of it.\u00a0 If you\u2019ve never had to do any sort of emergency recovery, I would recommend you practice doing so.\u00a0 Bring up a test instance and then break it and try to fix it.\u00a0 Going through the stress of figuring out how to work through the situation when there isn\u2019t company data on the line is way better than figuring it out once you\u2019re already in the situation.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>It started a little more than an hour after the ball dropped on New Year\u2019s Day 2016.\u00a0 We had had a few friends over to celebrate and we were just saying goodbye to the last of our company when I &hellip;<\/p>\n<p class=\"read-more\"><a href=\"https:\/\/www.sqlphilosopher.com\/wp\/2016\/01\/sneaker-net-saves-a-20-tb-sql-mirror\/\">Read more &raquo;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[20,29],"tags":[],"class_list":["post-140","post","type-post","status-publish","format-standard","hentry","category-high-availability","category-vldb"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p1vS3B-2g","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.sqlphilosopher.com\/wp\/wp-json\/wp\/v2\/posts\/140","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.sqlphilosopher.com\/wp\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.sqlphilosopher.com\/wp\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.sqlphilosopher.com\/wp\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.sqlphilosopher.com\/wp\/wp-json\/wp\/v2\/comments?post=140"}],"version-history":[{"count":0,"href":"https:\/\/www.sqlphilosopher.com\/wp\/wp-json\/wp\/v2\/posts\/140\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.sqlphilosopher.com\/wp\/wp-json\/wp\/v2\/media?parent=140"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.sqlphilosopher.com\/wp\/wp-json\/wp\/v2\/categories?post=140"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.sqlphilosopher.com\/wp\/wp-json\/wp\/v2\/tags?post=140"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}