Zeyuan Hu's page

"Threads and Input/Output in the Synthesis Kernel"

Problem

Design an OS for a parallel and distributed computational environment and achieve the following three goals:

  • High performance
  • Self-tuning capability to dynamic load and configuration changes
  • A simple, uniform and intuitive model of computation with a high-level interface

System Designs

Dataflow: Synthesis Model of Computation

  • The threads of execution form a directed graph, in which the nodes are threads and the arcs are data flow channels
  • Observation: Data follows a pipeline connects many OS-managed devices/resources (e.g., capture | xform | filter | detect &)

Dataflow example from Synthesis: FS

  • Above shows a dataflow example from Synthesis: file system server. It differs from a traditional design:

    • Boxes are threads/servers
    • Boxes are connected directly with jmp instructions to implement scheduling
    • Arrows (and boxes) are specialized dynamically to the application

Fast context switch: procedure chaining

procedure chaining

  • Picture above shows that the ready-to-run threads are chained in an executable circular queue. A jmp instruction in each context-switch-out procedure of the preceding thread points to the context-switch-in procedure of the follow thread.

  • "Executable data structures": embed code in data structures to avoid data structure traversals and to specialize code for each object (e.g., put context switch code inside of thread control block)

  • Context switch steps: Timer interrup vectored directly to current thread's sw_out; sw_out calls (directly) next thread's sw_in or sw_in_mmu:

    • interrupt vectored to sw_out
    • sw_out saves registers
    • sw_out jumps to next sw_in_mmu
    • sw_in_mmu updates MMU
    • sw_in_mmu updates CPU interrupt vector base
    • sw_in_mmu restores CPU registers (including putting user-PC into the user-PC register)
    • sw_in_mmu does return from exception (replacing PC with user-PC and changing mode back to user mode)

Mechanism to reduce syncrhonization overhead

Lots of techniques used to reduce synchronization overhead

  • Code Isolation: reduce false sharing (i.e., eliminate false sharing within a single C struct) (e.g., thread table entries (TTEs) are not shared. Similar to privatization)

  • Procedure Chaining: use continuations (implemented by changing the return address on the stack) to allow certain services to complete atomically (e.g., defer signal to end of interrupt handling)

  • Optimistic synchronization: it is easier to break the rule and ask forgiveness than get permission. Try the operation, but before commit, check to see if no one else interfered.

  • Lock-free queues that use the compare-and-swap instruction 1. This is not wait-free (some operations do not have bounded waiting time), it is obstruction-free (a thread, executed in isolation for a bounded number of operations will complete).

Remarks

  • Even I cannot fully understand every bit details of the paper, I think it is the best written paper I have read so far in the semester (the flow is great).
  • Many details in the paper are omit: scheduling, interrupt handling, details of lock-free queues (SP-SC, MP-MC,etc)
  • Lots of cool techniques that are worth investigation in its own rights and they are "field openers"

    • Code synthesis (JIT compiler, super-optimizers)
    • Code isolation (Privatization)
    • Procedure chaining (Continuations (lambda, events))
    • Optimistic synchronization (lock-free data structures)
    • Synthesis I/O (Dataflow: Scout, Click Router, SEDA, StageServer, IXP, PTask, etc)

  1. See OSTEP::locks for details on test-and-set and compare-and-swap instruction and their usage in lock implementations. 

comments powered by Disqus