fubarnorthwest: Ops: the A in Confidentiality, Integrity and Availability

Edit: there is too much command-line and storage stuff in here. If you can wade through it, great. If not, my bad. I'm not certain how I should segregate very technical information, and I welcome your feedback.

Got a call to recover a developer's drive on Fedora 18. Apparently there were problems on /home, which are easy to fix, but also /. Yes, I get a lot of weird little jobs. Apparently this person had Googled around, and independently discovered that the Web needs an editor.

First off, Fedora Live media is inherently a rescue disk for problems like these. Boot into it, look around in /dev/, and you will see the devices related to your system. They will not be mounted, so you can run fsck. fdisk -l will show you what is where.

In this case, /dev/sda1 was a generic Linux filesystem, and sda2 contained logical volumes named lv_root, lv_home, and lv_swap.

There are various things you can do with badblocks(8), including a non-destructive read-write test, but it is horribly slow. 'e2fsck -fc' will force an fsck, (journaled filesystems can be marked as clean when there are problems) you get a progress indicator, and bad blocks will be cordoned off.

Run this command against any non-lvm partition, and each logical volume within a partition, with the exception of swap, whether it is a logical volume or not. Swap is raw disk; there is no filesystem to check, and fsck will simply abort if you attempt it.

In this case, several problems were revealed. In this case fsck was allowed to fix all of them except for a short read on a critical userland file. A backup existed, so fsck was allowed to simply delete the file (fsck will prompt you if only the -fc options have been supplied).

Root cause was Who Knows. It could be down to a cosmic ray, and the system not having ECC memory. But that's not a reason to blow it off. I could recommend that the owner of the system pay more attention to smartd reports, except that evidence is beginning to show that smartd has little predictive power.

This not yet certain, so we still have to gather data. But it looks like it may be an Oops.

Worse is that all of this a waste of devloper time. fsck -fc will take something like two and a quarter hours per GB of disk on SATA-3, it is not an exhaustive test, and now you have a devloper with another ongoing distraction.

This almost perfectly resembles a poor outcome. Better to replace the drive, reprovision the OS (if that takes more than five minutes, You Are Doing It Wrong), and restore from backup. Distracting developers, when there is an easy means of avoiding it, is never a good plan.

Be sure that replaced drives are physically destroyed or securely wiped (which takes long enough that destrction is cheaper). I once requested additional storage for a Linux workstation, loaded an NTFS driver, and was astonished at what I found. This should be in your ops policies.

Sunday, December 8, 2013

Ops: the A in Confidentiality, Integrity and Availability

No comments:

Post a Comment