"Why do computers stop and what can be done about it?"
Note
Gray, J. Why do computers stop and what can be done about it?, 1985.
Availability
Terminology
- Mean Time Between Failures (MTBF)
- Mean Time to Repair (MTTR)
-
Availability: percentage of time the system is operational
- \(99.37\%\) percentage availability over 10 days translates to 1.5 hours outage every 10 days on average (i.e. \((1 - 99.37\%) \times 10 \times 24 = 1.51\))
- Availability = MTBF / (MTBF + MTTR) = \(\frac{10*24}{(10*24 + 1.5)} = 0.9937\)
- If \(90\%\) of servers are available \(90\%\) of the time, overall availability could be \(81\%\) (could be higher when using certain techniques)
Key to Availability
- If MTTR is zero, then Availability = MTTF/ (MTTF + 0) = 1
- We need to give the illusion of instantaneous repair
- Key idea: Modularize the system so that modules can be repaired “instantly”
- How to provide instant repair? Have a “hot” spare that can take over instantly
-
We can analyze schemes to increase availability along several dimensions:
- CAPEX (one time capital expense)
- OPEX (on-going operating expenses)
- Increase in latency?
- Reduction in throughput?
Achieving High Availability
- Key ideas: modularity and redundancy
-
Modularity: a failure within a module affects only that module
- von Neuman’s system required 20K replicas to achieve a MTBF of 100 years
- Why? No modularity
- Large combinations of modules were replicated
-
Jim Gray’s algorithm (can have the system has MTBF in decades or centuries)
- Hierarchically decompose the system into modules
- Design each module to have MTBF > 1 year
- Make each module fail-fast
- Have a heart-beat message for each module so you know when it fails
- Have spare modules which pick up job of failed module. Failover to spare module should be quick.
Study of Failures
- Analyzed cause of failures over 7 months
- Study covers 2000 systems, 10M system hours
- 166 failures reported in this period
- 59 of these failures are “infant” failures - faulty hardware or new
-
42% of failures caused by system administration
- Includes software and hardware maintenance: 25%
- Operations: 9%, configuration: 8%
-
25% software failures, 18% hardware failures
-
14% of failures caused by environmental failures
- 9% power failures, 5% communication and facilities
Lessons from Tandem Study
- Key to high availability: tolerating human errors and operations failures
-
Need to design systems to have:
- Minimal configuration
- Minimal maintenance
- Simple, consistent interfaces
-
New systems often have higher failure rate
- Need time to work out these bugs
- Do not deploy systems until they become stable
-
Jim Gray suggests:
- Do regular hardware maintenance
- Delay software upgrades as long as possible, allow them time to become mature
- Only patch a bug if it is causing outages
Software Fault Tolerance
-
Applying lessons from before:
- Software modularity through processes and messages
- Fail-fast software modules
- Process-pairs to handle transient faults
- Transactions
-
Underlying assumption: software faults are transient
- Why? The hard software faults would have been removed in testing and quality assurance checks
Containing Software Faults
-
Two main approaches:
-
Static checking checks the code before it is even run
- Conservative checking
- May throw up lots of false positives
-
Dynamic checking checks code that is executed
- Has lower false positives
- Might not catch all bugs, especially in rarely run code paths
-
Fail Fast Software
-
In today’s terms, lots of assert conditions in the code
- Linux kernel is filled with PANIC calls. If something goes wrong, print the stack trace and kill the kernel.
Process Pairs
- When one process fails, the other process takes over
-
Types of process pairs:
- Lockstep: both execute every instruction
-
Checkpointing: primary occasionally checkpoints its state, which is copied over to backup
- Variants: Delta Checkpointing, Kernel Checkpointing
-
Persistence: backup gets all its knowledge from persistent storage
- Need to ensure persistent storage is not inconsistent
Transactions
- Provide the ACID property: atomicity, consistency, isolation, durability
-
Jim Gray argues for persistent process pairs combined with transactions
- Implemented in the Encompass system
Fault-Tolerant Communication
- Key idea: sessions and sequence numbers
- Same idea used in TCP
- Sequence numbers used to identify duplicate and lost messages