"Why do computers stop and what can be done about it?"

Note

Availability

Mean Time Between Failures (MTBF)
Mean Time to Repair (MTTR)
Availability: percentage of time the system is operational
- \(99.37\%\) percentage availability over 10 days translates to 1.5 hours outage every 10 days on average (i.e. \((1 - 99.37\%) \times 10 \times 24 = 1.51\))
- Availability = MTBF / (MTBF + MTTR) = \(\frac{10*24}{(10*24 + 1.5)} = 0.9937\)
- If \(90\%\) of servers are available \(90\%\) of the time, overall availability could be \(81\%\) (could be higher when using certain techniques)

If MTTR is zero, then Availability = MTTF/ (MTTF + 0) = 1
We need to give the illusion of instantaneous repair
Key idea: Modularize the system so that modules can be repaired “instantly”
How to provide instant repair? Have a “hot” spare that can take over instantly
We can analyze schemes to increase availability along several dimensions:
- CAPEX (one time capital expense)
- OPEX (on-going operating expenses)
- Increase in latency?
- Reduction in throughput?

Key ideas: modularity and redundancy
Modularity: a failure within a module affects only that module
- von Neuman’s system required 20K replicas to achieve a MTBF of 100 years
- Why? No modularity
- Large combinations of modules were replicated
Jim Gray’s algorithm (can have the system has MTBF in decades or centuries)
- Hierarchically decompose the system into modules
- Design each module to have MTBF > 1 year
- Make each module fail-fast
- Have a heart-beat message for each module so you know when it fails
- Have spare modules which pick up job of failed module. Failover to spare module should be quick.

Analyzed cause of failures over 7 months
Study covers 2000 systems, 10M system hours
166 failures reported in this period
59 of these failures are “infant” failures - faulty hardware or new
42% of failures caused by system administration
- Includes software and hardware maintenance: 25%
- Operations: 9%, configuration: 8%
25% software failures, 18% hardware failures
14% of failures caused by environmental failures
- 9% power failures, 5% communication and facilities

Key to high availability: tolerating human errors and operations failures
Need to design systems to have:
- Minimal configuration
- Minimal maintenance
- Simple, consistent interfaces
New systems often have higher failure rate
- Need time to work out these bugs
- Do not deploy systems until they become stable
Jim Gray suggests:
- Do regular hardware maintenance
- Delay software upgrades as long as possible, allow them time to become mature
- Only patch a bug if it is causing outages

Applying lessons from before:
- Software modularity through processes and messages
- Fail-fast software modules
- Process-pairs to handle transient faults
- Transactions
Underlying assumption: software faults are transient
- Why? The hard software faults would have been removed in testing and quality assurance checks

In today’s terms, lots of assert conditions in the code
- Linux kernel is filled with PANIC calls. If something goes wrong, print the stack trace and kill the kernel.

Provide the ACID property: atomicity, consistency, isolation, durability
Jim Gray argues for persistent process pairs combined with transactions
- Implemented in the Encompass system