Zeyuan Hu's page

"Dandelion: a Compiler and Runtime for Heterogeneous Systems"

Problem

How to design a system that provides programmability for heterogeneous distributed systems?

  • Challenges are

    • Heterogeneous: different programming models, architecture expertise
    • Distributed resources: data movement, scheduling
    • Concurrency: synchronization, consistency

System Design

Overview

  • Goals: make it simple for programmers to write high performance applications for hetergeneous system on a small cluster with GPUs and leverage its available resources

    • Single programming interface for clusters with CPUs, GPUs, FPGAs, etc
    • "Single machine" programming model: programmer writes sequential code
    • Runtime: take sequential program and do the following whenever possible

      • Parallize computation
      • Partition data
      • Runs on all available resources
      • Maps computation to best architecture
  • Workflow

    1. Given User program & partitioned data files as input
    2. Compile to a mix of CPU and GPU code and run on the cluster (Dandelion role)
    3. Output result as partitioned data files
  • Challenges

    • Simple programming model
    • Integerate multiple runtime efficiently to enable high performance
  • Dandelion innovation by levels

    • Programming languages

      • Adopt language integration approach (LINQ): extends with Dandelion specific operators
      • Constraints: UDF must be side-effect free; execute .NET function with dynamic memory allocation on CPUs only
    • Compilers

      • Automatically compiles a data-parallel program to run on distributed heterogeneous systems
      • Translation of .NET byte-code to multiple backends (e.g., GPU, FPGA)
    • Distributed and parallel runtime

      • Treat hetergeneous system as the composition of a collection dataflow engines
      • Three dataflow engines: cluster, mult-core CPU, and GPU (e.g., use PTask)

Architecture

Dandelion Architecture

  • Two main components

    • Dandelion compiler generates the execution plans and the worker code to be run on the CPUs and GPUs of cluster machines
    • Dandelion runtime uses the execution plans to manage the computation on the cluster (e.g., scheduling, distribution)

Dandelion Compilers

  • Dandelion compiler generates CUDA code, and three levels of dataflow graphs to orchestrate the execution

  • Relies on a library of generic primitives (GPU primitive library) to construct execution plans

    • GPU primitive library: for GPU dataflow graph

      • primitives include low level building blocks (e.g., parallel scan and hash tables), high-level primitives for relational operators (e.g., groupby and join)
  • Compiling C# to GPU code

    • Translation performed at .NET byte-code level

      • Map C# types to CUDA structs
      • Translate C# methods into CUDA kernel functions
      • Generate C# code for CPU-GPU serialization/transfer
    • Main constraint: dynamic memory allocation

      • Convert to stack allocation if object size can be inferred
      • Fail parallelization, fallback to host otherwise
    • Use Common Compiler Infrastructure (CGI) framework for the intermediate AST

Dandelion Runtime

Dandelion dataflow graph

  • Dataflow graphs

    • Vertex: a fragment of the computation
    • Edge: communication channels
    • Three levels:

      • cluster level (what machine to compute what): cluster execution engine assigns vertices to available machines and distributes code and graphs, orchestrating the computation
      • machine level (how the computation is done on each machine): executes its own dataflow graph, managing input/output and execution threads
      • GPU level (use PTask as GPU dataflow engine)
  • Three dataflow graphs (shown above) form the Dandelion runtime, and the composition of those graphs forms the global dataflow graph for the entire computation

  • Cluster dataflow engine ("Moxie")

    • Allows the entire computation to stay in memory when the aggregate cluster memory is sufficient (assume work on a small cluster with powerful machines with GPUs)
    • Holds intermediate data in memory and can checkpoint them to disk (like Spark)
    • Aggressively caches in-memory datasets (including intermediate data)
    • Uses async. checkpoints to support coarse-grained fault tolerance
  • Machine dataflow engine

    • Vertex: a unit of computation can be executed on CPU or GPU
    • For CPU: parallelize the computation on multi-core
    • For GPU: dispatch computation to GPU dataflow engine
    • Async channels are created to transfer data between the CPU and GPU memory spacs
  • GPU dataflow engine

    • Use PTask execution engine

      • a token model dataflow system
      • tasks: computation or nodes in the graph
      • ports: inputs and outputs of the tasks
      • Channels: connects ports
      • datablocks: deseralized data into chunks that are moved through channels
      • Computation are done by pushing and pulling datablocks to and from channels (Future)
      • A task is ready for execution when all of its input ports have available datablocks and all of its output ports have capacity
comments powered by Disqus