Quote:
Originally Posted by photon
It could have been a change, just one with a really bad payload. Then when the BGP update starts knocking all the systems offline you don't have access anymore to be able to roll back. And maybe the people with physical access didn't have the necessary level of access to roll back the changes, or something else with their design prevented an easy rollback in that specific failure mode.
I've had to write RCAs for failures that were a perfect storm of unusual circumstances before, though usually isolated to a single system. I agree would love to know the details, I assume that we'll get some level of explanation at some point.
|
Good on you for writing what sounds to be thorough RCAs. I've seen some laughably terrible RCAs for Sev 1s which are one or two sentences long but nobody cares anymore because the issue has already been resolved and closed.