Your Data Is Safe…… Not!!!
Poor attempt at a Not joke? I had a rough day, but I want to promptly document it for everyone’s benefit. Our story begins during late night hours, somewhere deep in the tough Ulduar area in World of Warcraft Instance.
My computer suddenly hung, though mouse was still moving, which is highly unusual! After simple troubleshooting failed to recover it, I proceeded to a full power-cycle. World of Warcraft promptly crashed again, much quicker this time, with CRC error. Concerned, I scheduled a full HD check-disk and surface scan, and left it overnight.
In the morning, I found situation to be same or worse. Software was crashing left and right, seemingly randomly. World of Warcraft’s handy Repair utility, told me my installation was “Too Corrupt to Repair”. I tried many different things, at first focusing on Hard-drive integrity, and kept finding pretty much all data written to disk, coming back with CRC errors.
Long story short, I recalled a handy memtest utility right on the boot menu of Ubuntu. I ran it, and voila, my screen quickly filled with Lots of RED errors, thousands, in fact. I proceeded to remove all memory modules (I had four DDR2), and put them one by one to the test, and as Murphy would have it, the last module was the faulty one. All others passed long battery of tests with flying colors. I reinstalled the three good memory modules, and computer is back to normal, as if nothing was wrong!
Moral of the story? I see several here – For example, why didn’t self-respecting Vista OS (ahem, funny!) include some sanity checks, to alert me to memory failures, instead of crashing with blue screens and failing “Host Processes”? (Yea, I know there is ECC memory, but I am not sure it would have helped in this strange scenario).
But even more so, what about Cloud? I had Mesh and Live Sync going! Something may have Synched into the cloud, with corruption on it! Any Cloud Backup, and pretty much everything else, would cause a disaster! Amazon’s entire S3 cluster went down not too long ago, due to corruption in status data being passed around the cloud.
I’ve been long excited about ZFS technology, and while I am sure it would have alerted me to problems sooner, I am not certain it could have prevented real data corruption in this case. With faulty memory module, everything written from memory to disk, will most likely become “corrupt for life”. Even an attempt to rescue such data will likely not end well, unless hard-drive is moved to another computer for rescue.
It seems that as computer scientists we are missing this very fundamental issue. We can’t trust our latest operating system to alert us if our underlying hardware is misbehaving?! And that our files are becoming corrupt every time we touch them?!
Back to the drawing board!