In each cluster's first year, it's typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will "go wonky," with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span, Dean said. And there's about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover.
This is an example of designing networks to handle failure, but can apply to almost anything one cares to build. Sadly too many times the driving parameter is cost, resulting in the lowest bid doing the building, and absorbing the losses incurred during the lifetime of a project. Most of those costs are hidden in the form of employee overtime, disgruntled customers and 'upgrades.'
This requires a mindset change. Projects must not be designed and built to work, they must be designed and built to not fail.