"Why do computers stop and what can be done about it?"

Availability

Terminology

• Mean Time Between Failures (MTBF)
• Mean Time to Repair (MTTR)
• Availability: percentage of time the system is operational

• $$99.37\%$$ percentage availability over 10 days translates to 1.5 hours outage every 10 days on average (i.e. $$(1 - 99.37\%) \times 10 \times 24 = 1.51$$)
• Availability = MTBF / (MTBF + MTTR) = $$\frac{10*24}{(10*24 + 1.5)} = 0.9937$$
• If $$90\%$$ of servers are available $$90\%$$ of the time, overall availability could be $$81\%$$ (could be higher when using certain techniques)

Key to Availability

• If MTTR is zero, then Availability = MTTF/ (MTTF + 0) = 1
• We need to give the illusion of instantaneous repair
• Key idea: Modularize the system so that modules can be repaired “instantly”
• How to provide instant repair? Have a “hot” spare that can take over instantly
• We can analyze schemes to increase availability along several dimensions:

• CAPEX (one time capital expense)
• OPEX (on-going operating expenses)
• Increase in latency?
• Reduction in throughput?

Achieving High Availability

• Key ideas: modularity and redundancy
• Modularity: a failure within a module affects only that module

• von Neuman’s system required 20K replicas to achieve a MTBF of 100 years
• Why? No modularity
• Large combinations of modules were replicated
• Jim Gray’s algorithm (can have the system has MTBF in decades or centuries)

• Hierarchically decompose the system into modules
• Design each module to have MTBF > 1 year
• Make each module fail-fast
• Have a heart-beat message for each module so you know when it fails
• Have spare modules which pick up job of failed module. Failover to spare module should be quick.

Study of Failures

• Analyzed cause of failures over 7 months
• Study covers 2000 systems, 10M system hours
• 166 failures reported in this period
• 59 of these failures are “infant” failures - faulty hardware or new
• 42% of failures caused by system administration

• Includes software and hardware maintenance: 25%
• Operations: 9%, configuration: 8%
• 25% software failures, 18% hardware failures

• 14% of failures caused by environmental failures

• 9% power failures, 5% communication and facilities

Lessons from Tandem Study

• Key to high availability: tolerating human errors and operations failures
• Need to design systems to have:

• Minimal configuration
• Minimal maintenance
• Simple, consistent interfaces
• New systems often have higher failure rate

• Need time to work out these bugs
• Do not deploy systems until they become stable
• Jim Gray suggests:

• Do regular hardware maintenance
• Delay software upgrades as long as possible, allow them time to become mature
• Only patch a bug if it is causing outages

Software Fault Tolerance

• Applying lessons from before:

• Software modularity through processes and messages
• Fail-fast software modules
• Process-pairs to handle transient faults
• Transactions
• Underlying assumption: software faults are transient

• Why? The hard software faults would have been removed in testing and quality assurance checks

Containing Software Faults

• Two main approaches:

• Static checking checks the code before it is even run

• Conservative checking
• May throw up lots of false positives
• Dynamic checking checks code that is executed

• Has lower false positives
• Might not catch all bugs, especially in rarely run code paths

Fail Fast Software

• In today’s terms, lots of assert conditions in the code

• Linux kernel is filled with PANIC calls. If something goes wrong, print the stack trace and kill the kernel.

Process Pairs

• When one process fails, the other process takes over
• Types of process pairs:

• Lockstep: both execute every instruction
• Checkpointing: primary occasionally checkpoints its state, which is copied over to backup

• Variants: Delta Checkpointing, Kernel Checkpointing
• Persistence: backup gets all its knowledge from persistent storage

• Need to ensure persistent storage is not inconsistent

Transactions

• Provide the ACID property: atomicity, consistency, isolation, durability
• Jim Gray argues for persistent process pairs combined with transactions

• Implemented in the Encompass system

Fault-Tolerant Communication

• Key idea: sessions and sequence numbers
• Same idea used in TCP
• Sequence numbers used to identify duplicate and lost messages