Zeyuan Hu's page

"Why do computers stop and what can be done about it?"



  • Mean Time Between Failures (MTBF)
  • Mean Time to Repair (MTTR)
  • Availability: percentage of time the system is operational

    • \(99.37\%\) percentage availability over 10 days translates to 1.5 hours outage every 10 days on average (i.e. \((1 - 99.37\%) \times 10 \times 24 = 1.51\))
    • Availability = MTBF / (MTBF + MTTR) = \(\frac{10*24}{(10*24 + 1.5)} = 0.9937\)
    • If \(90\%\) of servers are available \(90\%\) of the time, overall availability could be \(81\%\) (could be higher when using certain techniques)

Key to Availability

  • If MTTR is zero, then Availability = MTTF/ (MTTF + 0) = 1
  • We need to give the illusion of instantaneous repair
  • Key idea: Modularize the system so that modules can be repaired “instantly”
  • How to provide instant repair? Have a “hot” spare that can take over instantly
  • We can analyze schemes to increase availability along several dimensions:

    • CAPEX (one time capital expense)
    • OPEX (on-going operating expenses)
    • Increase in latency?
    • Reduction in throughput?

Achieving High Availability

  • Key ideas: modularity and redundancy
  • Modularity: a failure within a module affects only that module

    • von Neuman’s system required 20K replicas to achieve a MTBF of 100 years
    • Why? No modularity
    • Large combinations of modules were replicated
  • Jim Gray’s algorithm (can have the system has MTBF in decades or centuries)

    • Hierarchically decompose the system into modules
    • Design each module to have MTBF > 1 year
    • Make each module fail-fast
    • Have a heart-beat message for each module so you know when it fails
    • Have spare modules which pick up job of failed module. Failover to spare module should be quick.

Study of Failures

  • Analyzed cause of failures over 7 months
  • Study covers 2000 systems, 10M system hours
  • 166 failures reported in this period
  • 59 of these failures are “infant” failures - faulty hardware or new
  • 42% of failures caused by system administration

    • Includes software and hardware maintenance: 25%
    • Operations: 9%, configuration: 8%
  • 25% software failures, 18% hardware failures

  • 14% of failures caused by environmental failures

    • 9% power failures, 5% communication and facilities

Lessons from Tandem Study

  • Key to high availability: tolerating human errors and operations failures
  • Need to design systems to have:

    • Minimal configuration
    • Minimal maintenance
    • Simple, consistent interfaces
  • New systems often have higher failure rate

    • Need time to work out these bugs
    • Do not deploy systems until they become stable
  • Jim Gray suggests:

    • Do regular hardware maintenance
    • Delay software upgrades as long as possible, allow them time to become mature
    • Only patch a bug if it is causing outages

Software Fault Tolerance

  • Applying lessons from before:

    • Software modularity through processes and messages
    • Fail-fast software modules
    • Process-pairs to handle transient faults
    • Transactions
  • Underlying assumption: software faults are transient

    • Why? The hard software faults would have been removed in testing and quality assurance checks

Containing Software Faults

  • Two main approaches:

    • Static checking checks the code before it is even run

      • Conservative checking
      • May throw up lots of false positives
    • Dynamic checking checks code that is executed

      • Has lower false positives
      • Might not catch all bugs, especially in rarely run code paths

Fail Fast Software

  • In today’s terms, lots of assert conditions in the code

    • Linux kernel is filled with PANIC calls. If something goes wrong, print the stack trace and kill the kernel.

Process Pairs

  • When one process fails, the other process takes over
  • Types of process pairs:

    • Lockstep: both execute every instruction
    • Checkpointing: primary occasionally checkpoints its state, which is copied over to backup

      • Variants: Delta Checkpointing, Kernel Checkpointing
    • Persistence: backup gets all its knowledge from persistent storage

      • Need to ensure persistent storage is not inconsistent


  • Provide the ACID property: atomicity, consistency, isolation, durability
  • Jim Gray argues for persistent process pairs combined with transactions

    • Implemented in the Encompass system

Fault-Tolerant Communication

  • Key idea: sessions and sequence numbers
  • Same idea used in TCP
  • Sequence numbers used to identify duplicate and lost messages
comments powered by Disqus