X-ARCC
From Modelado Foundation
X-ARCC: Exascale Adaptive Resource Centric Computing with Tesselation | |
---|---|
Team Members | UC Berkeley, LBNL |
PI | Steven Hofmeyr (LBNL) |
Co-PIs | John Kubiatowicz (UCB) |
Website | http://tessellation.cs.berkeley.edu |
Download |
Overview
We are exploring new approaches to Operating System (OS) design for exascale using Adaptive Resource-Centric Computing (ARCC). The fundamental basis of ARCC is dynamic resource allocation for adaptive assignment of resources to applications, combined with Quality-of-Service (QoS) enforcement to prevent interference between components. We have embodied ARCC in Tessellation, an OS designed for multicore nodes. In this project, our goal is to explore the potential for ARCC to address issues in exascale systems. This will require extending Tessellation with new features for multiple nodes, such as multi-node cell synchronization, distributed resource accounting, and topology-aware resource control. Rather than emphasizing component development for an exascale OS, we are focusing our efforts on high-risk, high-reward topics related to novel OS mechanisms and designs.
There are several aspects we are exploring in the context of a multinode Tessellation:
- What OS support is needed for new global address space programming models and task-based parallel programming models? To explore this, we are porting UPC and Habanero to run on multinode Tessellation, using GASNet as the underlying communication layer. Our test-cases for these runtimes on Tessellation are a subset of the Co-design proxy apps, which we use as representatives of potential exascale applications.
- How should the OS support advanced memory management, including mechanisms for user-level paging, locality-aware memory allocation and multicell shared memory?
- How do we extend hierarchical adaptive resource allocation and control across multiple nodes, including heterogeneous nodes such as the Intel Mic?
- How should the OS manage the trade-off between power and performance optimizations? Will the Tessellation approach of treating both power and other resources (cores, memory) as first class citizens within the adaptive loop be adequate?
- What OS abstractions are needed for the durable QoS-guaranteed storage that is essential to resilience?
Tessellation
The Tessellation kernel is a lightweight, hypervisor-like layer that provides support for ARCC. It implements cells, along with interfaces for user-level scheduling, resource adaptation and cell composition. Since the software in cells runs entirely in user space, the kernel can enforce resource allocations, e.g. CPU cores, memory pages, without specialized virtualization hardware, but enforcing resource allocations to some resources, such as processor cache slices and memory bandwidth, requires additional hardware support. Tessellation is written from scratch and the prototype runs on both Intel x86 and RAMP architectures.
The Cell Model
Cells provide the basic unit of computation and protection in Tessellation. Cells are performance-isolated resource containers that export their resources to user level. The software running within each cell has full user-level control of the resources assigned to the cell, free from interference from other cells. Application programmers can customize the cell's runtime for their application domains with, for instance, a particular CPU core-scheduling algorithm, or a novel page replacement policy.
The performance isolation of cells is achieved through space-time partitioning, a multiplexing technique that divides the hardware into a sequence of simultaneously resident spatial partitions, as shown in Figure 1. Cells can either have temporally dedicated access to their resources, or be time-multiplexed with other cells, and, depending on the spatial partitioning, both time-multiplexed and non-multiplexed cells may be active simultaneously.
Time multiplexing is implemented using gang-scheduling, to ensure that cells provide their hosted applications an environment that is similar to a dedicated machine. The kernel implements gang-scheduling in a decentralized manner through a set of kernel multiplexer threads muxers, one per hardware thread in the system. The muxers implement the same scheduling algorithm and rely on a high-precision global time-base to simultaneously activate a cell on multiple hardware threads with minimum skew. In the common case the muxers do not need to communicate since they replicate the scheduling decisions of all relevant other muxers; however, the muxers will communicate via IPI multicast in certain cases, e.g., when cells are created or terminated, when resource allocations change, etc.
Applications in Tessellation that span multiple cells communicate via efficient and secure channels. Channels provide fast, user-level asynchronous message-passing between cells. Applications use channels to access standard OS services (e.g. network and file services) hosted in other cells. New composite services are constructed from OS services by wrapping a cell around existing resources and exporting a service interface. Tessellation can support QoS in this service-oriented architecture because the stable environment of the cell is easily combined with a custom user-level scheduler to provide QoS-aware access to the resulting service. With such QoS-aware access providing reproducible service times, applications in cells experience better performance predictability for autotuning, but without sacrificing flexibility in job placement for optimized system usage.
Customized Scheduling: OS Support for Runtime Systems
A major benefit of two-level scheduling is the ability to support different resource-management policies simultaneously. In Tessellation, cells provide their own, possibly highly-customized, user-level runtime systems for processor (thread) scheduling and memory management. Furthermore, each cell's runtime can control the delivery of events, such as timer and device interrupts, inter-cell message notifications, exceptions, and memory faults. Consequently, Tessellation can support a diverse set of runtimes without requiring kernel modifications or users to have root access. For example, for a complex scientific application with in-situ visualization, one cell can run a rendering or decoding component with a real time scheduler, while another runs a throughput oriented compute job with simple run-to-completion scheduling and one thread per core. The ability to support multiple different schedulers simultaneously bypasses the issues that arise when trying to develop a "one size fits all" complex global scheduler (e.g., see the CFS vs BFS debate in the Linux community).
The current Tessellation prototype includes two user-level thread scheduling frameworks: a preemptive one called Pulse (Preemptive User-Level SchEduling), and a cooperative one based on Lithe (LIquid THrEads). With either framework, a cell starts when a single entry point, enter(), is executed simultaneously on each core. After that, the kernel interferes with the cell's runtime only when: 1) the runtime receives events (e.g. interrupts) it has registered for, 2) the cell is suspended and reactivated according to its time-multiplexing policy (which can be time-triggered, event-triggered, or dedicated), or 3) the resources (e.g. hardware threads) assigned to the cell change. For instance, when an interrupt occurs during user-level code execution, the kernel saves the thread context and calls a registered interrupt handler, passing the saved context to the cell's user-level runtime. The cell's runtime can then choose whether to restore the previously running context or swap to a new one.
It is relatively easy to build new runtime scheduling frameworks for Tessellation. The Pulse framework, for example, comprises less than 800 lines of code (LOC); it can serve as a starting point for creating new user-level preemptive schedulers. A new scheduler based on Pulse needs to implement only four callbacks: enter(), mentioned earlier; tick(context), which is called whenever a timer tick occurs and receives the context of the interrupted thread; yield(), called when a thread yields; and done(), called when a thread terminates. The Pulse framework provides APIs to save and restore contexts, and other relevant operations. Since Tessellation's schedulers run at user-level, kernel patches are not required.
Pulse and Tessellation make it easy to implement custom schedulers. For example, we implemented a global round-robin scheduler with mutex and conditional-variable support in about 850 LOC. We also wrote a global earliest deadline first (EDF) scheduler with mutex support and priority-inversion control via dynamic deadline modification in less than 1000 LOC. By contrast, support for EDF in Linux requires kernel modifications and substantially more code: the best-known EDF kernel patch for Linux, SCHED_DEADLINE, has over 3500 LOC in over 50 files. Furthermore, any bugs in SCHED_DEADLINE could cause the kernel to crash and bring down the whole system, whereas bugs in a userspace scheduler in Tessellation will only crash the user-space application using that scheduler.