In March, I had just received three new servers for a new cluster (Dell R910s) I was creating for our flagship application’s database. I had just gotten the cluster setup and had moved a copy of our Production database to the cluster to test it out before we migrated to it officially.
The next week I was sitting in the SQLskills Immersion Event in Tampa when I get a page and e-mail with a Sev 24, Error 832. This error can mean something truly bad has happened, that the memory has corrupted the data while in the Buffer Cache.
I kicked off a CheckDB to see if the corruption was anywhere else in the database, but since I was sitting in the classroom with Paul Randal (Blog | Twitter), the corruption guru himself, I shot him an e-mail letting him know what I was seeing and asking if he had any suggestions. He suggested that the corruption probably wasn’t on disk, since the 832 specifically means that the corruption happened after the time the page was read into memory, so that was somewhat of a relief. Since we suspected the box had bad memory, I flipped the instance to another node in the cluster and restarted the CheckDB.
Since this cluster was still in test, there was no real concern from a business perspective, but I wanted my infrastructure guys to test the box as soon as possible, in case there was a hardware issue that would delay the roll-out to production coming up. So, they started running memory tests as well as all the other Dell diagnostics. We checked BIOS versions, Windows patches, memory over and over again, and nothing looked wrong, at all.
Finally, after working with Dell Support for several hours, they found that the problem was two BIOS settings, that would normally be disabled for our servers, were enabled. C-State (CPU States) and C1E (Processor Core Speed and Voltage) were both configured, which was causing problems with the stability of the RAM. These had been enabled because the server in question had been bought off of the Dell Outlet, and the previous owner must have enabled them, because they are not on by default.
After they disabled those two settings in the BIOS, the server has been stable ever since. It certainly gave me a scare, but I learned a lesson from it: double-check your BIOS settings, and make sure you know what those settings really mean.