Rather expected it, actually. There are high winds blowing in from the Oregon coast, the weather news is full of it, etc.
So there I was, minding my own business, coincidentally thinking about data QA and reaction times, for an entirely unrelated. Which makes for a very sweet coincidence, as now I've pulled data from a couple of scripts I wrote to check the APC UPS. There is frequently a PostGreSQL db server running on this machine, and the combination of databases and unreliable power always ends badly.
That usually happens sooner rather than later, but like most people I tend to put off characterizing what things actually look like as systems fall down. I advocate doing this all the time, to the extent of of periodically killing test systems at the power distribution panel. "Really. Now and then, just replicate into a test environment, and flip the breakers".
That can be a huge pain in the ass, but there is really no other way to be absolutely certain. Cloud is not the answer to this issue. Or, at best, it can only be part of the solution. There are many examples of cloud failure.
Today, I got some great data on UPS drain and recovery, and found a problem with time-stamping of notifications. Discovering that bug in my code is a win. As is jogging me to post on the topic of a bug in the Linux APC UPS monitor daemon. Which I (obviously) have had no control over, and served as an example of why greater care should be taken than previously before turning on SELinux enforcement.
Things Are Going to Go Wrong
As long Murphy is alive and well, and Murphy seems to be immortal, things will continue to go all pear-shaped, at the worst possible times. I almost wrote 'periodically pear-shaped', but we don't always have the benefit of periodicity. Aside from the Big Three of periodic FUBAR announcements (Microsoft, Adobe, and Oracle), anyway. I might justifiably add OpenSSL and other Open Source projects, but the data to back that up is a whole new post. That is not going to happen today. Which is just as well, because the ongoing incompetence of Sony beggars the imagination. I don't even want to think about it, beyond being very happy that I am not on their security team.
Today, As An Example
From a security perspective, we are most concerned with the CIA triad of Confidentiality, Integrity, and Availability of data. Power problems on database systems will cause issues with integrity and availability, as mentioned above. Confidentiality only becomes a factor if disparate systems with responsibility for authentication/authorization fail open if a remote system is not available. That is rare, these days. Possibly because it is an easy test. So run it, just to be certain. Really. It's just a temporary firewall rule. And, as always, make it a test, so that pass/fail is always recorded.
But, we can never miss an opportunity to get better. Particularly under circumstances so benign as a power outage. Which, for people focused on security rather than pure availability, really is benign.
So, We Are Back to Logs
I first mentioned logs in We Still Fail at Log Analysis back in July of 2013. Nothing much has happened since then to change my opinion. 18 months, and little or no progress on an operational problem that has been with us for time out of mind. That is a bit discouraging, so I feel the need to visit this issue again, and probably not for the last time.
Please look at log policy again. Logging takes many forms, of course. System and application state and performance data are both vital. Were those recorded? Was it possible for an adversary, possibly internal, to avoid detection by shutting down a remote log host, or a network path to that host? In a virtualized environment, do you have records of what machines were spun up or migrated, and the security posture of those systems? If so, are those records amenable to analysis, or are they just data for the sake of data?
That last question is not meant to imply that you may be doing anything obviously wrong, BTW. Effective means, which will stand the test of time, have yet to evolve. I regard this as an open research question. Which is a bit sad, considering how badly the failures have been in legacy environments.
Possibly, for some environments, the future lies solely in data exfiltration detection.