Actions

X-ARCC: Difference between revisions

From Modelado Foundation

imported>Sonia
(Created page with "{{Infobox project | title = X-ARCC: Exascale Adaptive Resource Centric Computing with Tesselation | image = Image goes here | imagecaption = | website = | team-members = | ...")
 
imported>Shofmeyr
No edit summary
 
(47 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{Infobox project
{{Infobox project
| title = X-ARCC: Exascale Adaptive Resource Centric Computing with Tesselation
| title = X-ARCC: Exascale Adaptive Resource Centric Computing with Tesselation
| image = Image goes here
| image = [[File:xarcc.png|320x300px]]
| imagecaption =  
| website = http://tessellation.cs.berkeley.edu
| website =  
| imagecaption =
| team-members =  
| download = https://bitbucket.org/berkeleylab/crd-x-arcc
| pi =  
| team-members = [http://www.berkeley.edu/ UC Berkeley], [http://www.lbl.gov/ LBNL]
| co-pi = }}
| pi = Steven Hofmeyr (LBNL)
| co-pi = John Kubiatowicz (UCB)}}


Description goes here
== Overview ==
 
We are exploring new approaches to Operating System (OS) design for exascale using ''Adaptive Resource-Centric Computing'' (ARCC). The fundamental basis of ARCC is dynamic resource allocation for adaptive assignment of resources to applications, combined with Quality-of-Service (QoS) enforcement to prevent interference between components. We have embodied ARCC in [http://tessellation.cs.berkeley.edu/publications/pdf/tess-paper-DAC-2013.pdf Tessellation], an OS designed for multicore nodes. In this project, our goal is to explore the potential for ARCC to address issues in exascale systems. This will require extending Tessellation with new features for multiple nodes, such as multi-node cell synchronization, distributed resource accounting, and topology-aware resource control. Rather than emphasizing component development for an exascale OS, we are focusing our efforts on high-risk, high-reward topics related to novel OS mechanisms and designs.
 
There are several aspects we are exploring in the context of a multinode
Tessellation:
 
* What OS support is needed for new global address space programming models and task-based parallel programming models? To explore this, we are porting UPC to run on multinode Tessellation, using GASNet as the underlying communication layer. Our test-cases for this runtime on Tessellation are a subset of the Co-design proxy apps, which we use as representatives of potential exascale applications.
* How should the OS support advanced memory management, including mechanisms for user-level paging, locality-aware memory allocation and multicell shared memory?
* How do we extend hierarchical adaptive resource allocation and control across multiple nodes?
* How should the OS manage the trade-off between power and performance optimizations? Will the Tessellation approach of treating both power and other resources (cores, memory) as first class citizens within the adaptive loop be adequate?
* What OS abstractions are needed for the durable QoS-guaranteed storage that is essential to resilience?
 
== Tessellation ==
 
The Tessellation kernel is a lightweight, hypervisor-like layer that provides
support for ARCC. It implements cells, along with interfaces for user-level
scheduling, resource adaptation and cell composition. Since the software in
cells runs entirely in user space, the kernel can enforce resource allocations,
e.g. CPU cores, memory pages, without specialized virtualization hardware, but
enforcing resource allocations to some resources, such as processor cache slices
and memory bandwidth, requires additional hardware support.  Tessellation is written from
scratch and the prototype runs on both Intel x86 and RAMP architectures.
 
=== The Cell Model ===
 
Cells provide the basic unit of computation and protection in Tessellation.
Cells are performance-isolated resource containers that export their resources
to user level. The software running within each cell has full user-level control
of the resources assigned to the cell, free from interference from other cells.
Application programmers can customize the cell's runtime for their application
domains with, for instance, a particular CPU core-scheduling algorithm, or a
novel page replacement policy.
 
The performance isolation of cells is achieved through ''space-time partitioning'', a multiplexing technique that
divides the hardware into a sequence of simultaneously resident spatial
partitions. Cells can either have
temporally dedicated access to their resources, or be time-multiplexed with
other cells, and, depending on the spatial partitioning, both time-multiplexed
and non-multiplexed cells may be active simultaneously.
 
Time multiplexing is implemented using ''gang-scheduling'', to ensure that
cells provide their hosted applications an environment that is similar to a
dedicated machine. The kernel implements gang-scheduling in a decentralized
manner through a set of kernel multiplexer threads ''muxers'', one per
hardware thread in the system. The muxers implement the same scheduling
algorithm and rely on a high-precision global time-base to simultaneously
activate a cell on multiple hardware threads with minimum skew. In the common
case the muxers do not need to communicate since they replicate the scheduling
decisions of all relevant other muxers; however, the muxers will communicate via
IPI multicast in certain cases, e.g., when cells are created or terminated, when
resource allocations change, etc.
 
Applications in Tessellation that span multiple cells communicate via efficient
and secure ''channels''.  Channels provide fast, user-level asynchronous
message-passing between cells. Applications use channels to access standard OS
services (e.g. network and file services) hosted in other cells.  New
''composite'' services are constructed from OS services by wrapping a cell
around existing resources and exporting a service interface.  Tessellation
can support QoS in this ''service-oriented architecture''
because the stable environment of the cell is easily combined with a custom
user-level scheduler to provide QoS-aware access to the resulting service.
With such QoS-aware access providing reproducible service times, applications
in cells experience better performance predictability for autotuning, but
without sacrificing flexibility in job placement for optimized system usage.
 
=== Customized Scheduling: OS Support for Runtime Systems ===
 
A major benefit of two-level scheduling is the ability to support different
resource-management policies simultaneously.  In Tessellation, cells provide
their own, possibly highly-customized, user-level runtime systems for processor
(thread) scheduling and memory management.  Furthermore, each cell's runtime can
control the delivery of events, such as timer and device interrupts, inter-cell
message notifications, exceptions, and memory faults. Consequently, Tessellation
can support a diverse set of runtimes without requiring kernel modifications or
users to have root access. For example, for a complex scientific application
with in-situ visualization, one cell can run a rendering or decoding component
with a real time scheduler, while another runs a throughput oriented compute job
with simple run-to-completion scheduling and one thread per core. The ability to
support multiple different schedulers simultaneously bypasses the issues that
arise when trying to develop a "one size fits all" complex global scheduler
(e.g., see the CFS vs BFS debate in the Linux community).
 
The current Tessellation prototype includes two user-level thread scheduling
frameworks: a ''preemptive'' one called Pulse (Preemptive User-Level
SchEduling), and a ''cooperative'' one based on Lithe (LIquid
THrEads).  With either framework, a cell starts when a single
entry point, <tt>enter()</tt>, is executed simultaneously on each core.  After
that, the kernel interferes with the cell's runtime only when: 1) the runtime
receives events (e.g. interrupts) it has registered for, 2) the cell is suspended
and reactivated according to its time-multiplexing policy (which can be time-triggered, event-triggered, or dedicated), or 3) the resources (e.g. hardware
threads) assigned to the cell change. For instance, when an interrupt occurs
during user-level code execution, the kernel saves the thread context and calls
a registered interrupt handler, passing the saved context to the cell's
user-level runtime. The cell's runtime can then choose whether to restore the
previously running context or swap to a new one.
 
It is relatively easy to build new runtime scheduling frameworks for
Tessellation.  The Pulse framework, for example, comprises less than 800 lines
of code (LOC); it can serve as a starting point for creating new user-level
preemptive schedulers. A new scheduler based on Pulse needs to implement only
four callbacks: <tt>enter()</tt>, mentioned earlier; <tt>tick(context)</tt>,
which is called whenever a timer tick occurs and receives the context of the
interrupted thread; <tt>yield()</tt>, called when a thread yields; and
<tt>done()</tt>, called when a thread terminates.  The Pulse framework provides
APIs to save and restore contexts, and other relevant operations.  Since
Tessellation's schedulers run at user-level, kernel patches are not required.
 
Pulse and Tessellation make it easy to implement custom schedulers.  For
example, we implemented a global round-robin scheduler with mutex and
conditional-variable support in about 850 LOC.  We also wrote a global
''earliest deadline first'' (EDF) scheduler with mutex support and
priority-inversion control via dynamic deadline modification in
less than 1000 LOC. By contrast, support for EDF in Linux requires kernel
modifications and substantially more code: the best-known EDF kernel patch for
Linux, SCHED_DEADLINE, has over 3500 LOC in over 50
files.  Furthermore, any bugs in SCHED_DEADLINE could cause the kernel to crash
and bring down the whole system, whereas bugs in a userspace scheduler in
Tessellation will only crash the user-space application using that scheduler.
 
=== Adaptive Resource Allocation ===
 
Tessellation ''adapts'' resource allocations to provide applications
with QoS while maximizing efficiency in the system. Resources
are allocated and reallocated using a feedback control loop, as shown in
the figure. The allocation decisions are made by the ''Resource Allocation Broker'' (RAB), a broker service which runs in user-space. The RAB uses system-wide goals, resource
constraints, performance targets and performance measurements as inputs to an
optimizer that attempts to simultaneously satisfy multiple application
requirements and competing system-wide goals such as energy efficiency and
throughput. The allocation decisions are then communicated to the kernel and
system services for enforcement.
 
The current implementation of the RAB provides a resource-allocation framework
that supports rapid development and testing of new optimizers. Two different optimizers that we have experimented with are POTA and Pacora.
 
=== QoS-Aware Services ===
 
Cells provide a convenient abstraction for building OS services with QoS
guarantees.  Such services can reside in dedicated cells with exclusive control
of a device to encapsulate user-level device drivers.  We call these cells ''service cells''.  In keeping
with ARCC, we treat services offered by service cells as additional resources
managed by the adaptive resource allocation architecture.  A service cell
arbitrates access to its devices, leveraging the cell's performance isolation
and customizable schedulers to offer QoS guarantees to other cells.  By
encapsulating system services we not only improve performance
predictability, but also enhance tolerance of failures in device drivers and
mitigate the impact of OS noise.
 
Each service in Tessellation comes with a library to facilitate the
development of
client applications.  The client libraries offer friendly, high-level
application programming interfaces (APIs) to manage connections and interact
with the services (i.e. they hide most of the details of inter-cell channel
communication). Those libraries also allow applications to request the QoS
guarantees they need from services.
 
Currently, there are two services implemented for Tessellation that offer QoS
guarantees: a ''Network Service'', which provides access to network
adapters and guarantees that the data flows are processed with the agreed levels
of throughput, and a ''GUI Service'', which provides a windowing system
with response-time guarantees for visual applications.
 
== Research Plan ==
 
Our main focus is using Tessellation to investigate OS support for new
programming models and new use cases and application structures for
supercomputers. We believe that Tessellation is an ideal platform to experiment
with the mechanisms required to support the ''sophisticated applications'' described in the DOE Exascale OS/R report:
 
<blockquote>
  As machines grow larger, applications are not just adding more grid points but
  are adding new modules for improving fidelity with new physics and
  chemistry. Rather than a monolithic source code tree, these sophisticated
  applications are built from multiple libraries, programming models, and
  execution models. Future operating and runtime systems must support the
  composition of several types of software components, from legacy math
  libraries written in MPI to new dynamic task-graph execution engines running
  simultaneously with a data reduction component.
</blockquote>
 
In Tessellation, we envisage these sophisticated applications as spanning
multiple cells, with different cells for different library components, each with
their own scheduling models. Furthermore, support for multiple active cells per
node enables the colocation of components to minimize data movement and avoid
communication. However, Tessellation is currently a single node OS and these
applications will need to scale to large numbers of nodes. Hence, in this
research, our main goal is to determine how adaptive resource control
with QoS guarantees can be scaled across nodes.
 
=== Multinode Synchronized Cells ===
 
Cells in Tessellation are currently a purely local construct, contained within a
single node. Although there are many ways of extending the cell model across
nodes, we will keep cells as fundamentally a local construct, and provide a way
to synchronize cells across nodes. In particular, for temporal multiplexing, we
need a way to be able to synchronize activations to enable inter-node
gang-scheduling. The muxer
algorithm is decentralized, and communication-free in the common case, and thus
naturally lends itself to a multinode environment. The muxer algorithm relies on
synchronized global clocks, which should be achievable on a relatively large
scale, e.g., Jones and Koenig [2013] describe an algorithm that
takes only 2.3 microseconds to synchronize 16,000 12-core nodes. Our goal is to extend
the muxer algorithm to function across nodes; we will determine what mechanisms
need to be implemented to enable this, and how OS components should communicate
and interface across a network.
 
It is frequently possible to provide the same portion of CPU node capacity to a
cell through either temporal multiplexing or spatial partitioning, so it is not
clear a priori that multinode temporal multiplexing is really needed. However,
we wish to add support for multinode temporal multiplexing because we believe
that the temporal dimension enhances flexibility. This could be especially
useful when solving complex problems of multidimensional resource allocation. It could also be useful for improving
performance in certain types of applications, e.g., colocating modeling and
analysis threads on the same cores so that cached data is shared, while still
having separate, customized runtimes for each component. This kind of high-risk,
high-reward research has the potential to open up new, more productive ways of
using large-scale computers; we will not see fundamental advances with only
incremental improvements to existing approaches to OSes.
 
=== Support for New Programming Models and Multicomponent Applications ===
 
Multinode synchronized cells will provide the environment necessary for
exploring the OS support needed for advanced programming models and
sophisticated, multicomponent applications. To enable multinode, large scale
applications, we propose first porting GASNet as a communication layer, and then porting UPC for PGAS
models. We will focus on these frameworks because of time and budget
constraints, and because of our close relationship with the UPC and DEGAS
projects. . Of particular interest is what OS-specific support and
mechanisms may be needed for frameworks such as GASNet and UPC. This aspect of our research will allow us to explore one of the issues raised in the DOE Exascale OS/R
report, namely:
 
<blockquote>
  Operating systems traditionally have mediated and controlled access to all
  resources in the system. However, a variety of new programming models and
  runtimes will have significantly different requirements for resource
  management, which will likely result in inefficiencies if the OS has to
  directly manage all resources. In the future, the application and runtime will
  need increased control over resources such as cores, memory, and power and
  user-space handling of control requests and events.
</blockquote>
 
We will use the newly ported runtimes to explore application designs relevant to
exascale, through the Co-design ''proxy apps''. Several of
these will be easy to port to Tessellation, since we already have UPC versions
of them, e.g., LULESH from ExMatEx, Multigrid from Exact and XSBench from Cesar.  However, to really explore
the potential benefits of the ARCC approach, we will experiment with OS
support for more sophisticated applications, in addition to application kernels.
For this purpose, we will port two of the more complex Co-design Proxy Apps:
CIAN (from the Cesar project), which has multiple components,
including compute-intensive modeling, data-intensive analytics, visualization
and storage; and CoMD (from the ExMatEx project), which includes
''in-situ'' visualization. Not only are we interested in how we can support
the programming models used in these applications, but how we can tie the
visualization, data analytics and storage components into the Tessellation
service framework, including the GUI and Network services, and our proposed
storage services.
 
=== Advanced Memory Management ===
 
Advanced memory management is a key requirement for an exascale OS, as described
in the DOE Exascale OS/R report:
 
<blockquote>
  The OS will need to provide more support for runtime and application
  management of memory (possibly moving traditional OS services, like updating
  translation tables to runtime systems).
</blockquote>
 
In Tessellation, our goal is to give cells as much control over the memory they
own as possible. Currently, Tessellation does not support advanced memory
management features, such as paging or locality-aware allocation. We intend to
develop a user-level paging facility that a cell's runtime can manage with
customized page replacement and locality-aware allocation policies. Customized
memory management can deal with subtle problems that arise in monolithic memory
management, e.g., performance of the LULESH proxy app suffers on Linux because of
the way the kernel zero fills pages before allocating
them. Apart from customization, a key goal
of advanced memory management is to minimize uncontrolled interference between
cells and enable better performance predictability and stronger QoS guarantees.
 
For user-level memory management, the approach we propose to adopt is similar to
self-paging in the Nemesis OS, where all the kernel does is
dispatch fault notifications, and all paging operations are performed at the
application (cell) level. In general, the approach can be implemented entirely
in software, e.g., through para-virtualized interfaces to the page tables, as
used in Xen, but, where available, we will make use of
hardware support such as ''nested page tables'' (AMD) and ''extended page tables'' (Intel).
 
=== Adaptive Resource Allocation on Multinode Systems ===
 
The framework for adaptive resource control that is currently implemented in the
kernel and the Resource Allocation Broker (RAB) is flexible, supporting
pluggable policies. We intend to retain this flexibility for the
multinode version of Tessellation, and intend to explore various methods to
extend the adaptation framework across nodes. In particular, we will to
develop hierarchical resource allocation, with at least two levels: an
intra-node level, and an an inter-node level that is constructed through the
composition of intra-node resource policies. Although exascale systems will
likely need more than two levels to account for additional layers of physical
clustering of resources, we believe that two levels is already a challenge, and
will highlight many of the relevant issues.
 
It is beyond the scope of this project to investigate the various adaptation
policies. Rather, our focus is on providing the OS mechanisms and interfaces
needed for a flexible framework that can be used to explore a variety of
policies in future, e.g., determining how best to allocate multinode resources
to minimize data movement and avoid communication. Included in those mechanisms
will be communication and coordination channels between the inter- and intra-node
levels of the adaptation framework, support for inter-node resource accounting,
and support for topology-awareness.
 
=== Power Awareness ===
 
Power-awareness and control of power as a resource will be another dimension for
the RAB optimizer to incorporate when making resource allocation decisions. We
do not intend to design new power-balancing policies or experiment extensively
with power-aware resource allocation algorithms, but rather to ensure that
Tessellation provides the necessary mechanisms. We will develop the
mechanisms needed to enable Tessellation to accurately attribute power
consumption to each cell, and control the allocation of the "power" resource
by using DVFS controls or processor sleep states.
 
Although control mechanisms are likely straightforward to implement, accurate
resource attribution will be tricky.  We will begin by investigating systems
such as the Intel Sandybridge, which provides support for measuring energy-usage
per socket and DRAM energy usage, via the RAPL
interface. Accessing the RAPL interface is done via
model-specific registers that have to be accessed in ring-0. We propose to
export these interfaces to userspace, specifically, to the RAB so that it can
use power to make allocation decisions. Unfortunately, this power information is
not sufficiently fine-grained to attribute it to cells that span sockets. To
correct this, we will explore various algorithms for deriving the power usage at
the cell level. This can be as simple as proportionally dividing up power
according to CPU usage, or involve more sophisticated approaches based on
performance counter measurements. Which approach is
the best is an open research question that we intend to investigate.
 
=== System Services for Resilience ===
 
A complete exascale OS will need many components to support resilience. Since
many of these components, fault detection for example, may be implemented in the
runtime, we propose to focus our investigation on those performance-critical
components necessary for effective fault recovery that also benefit from
OS-level optimization.  Specifically, we intend to investigate OS abstractions
for durable QoS-guaranteed storage, which is an essential requirement for a
variety of resilience techniques such as traditional checkpoint/restart, faster
log-based recovery, and more recent approaches using Containment
Domains. Although our focus is on storage for resilience,
according to the DOE Exascale OS/R report, QoS-aware I/O services will also be
necessary for exascale tools:
 
<blockquote>
  Another potential overlap in the requirements of tools and applications
  involves quality of service concerns for shared resources. Tools need QoS
  guarantees to ensure that they do not too severely impact application
  execution. A particular area of concern is that tools can have extensive I/O
  requirements.
</blockquote>
 
To date we have developed two QoS-aware system services: a network
service and a GUI service. We intend to implement a framework that will make it
easy to develop QoS-aware services, which we will use for developing two
additional services: a Block Service and an Object Store Service:
 
* The ''Block Service'' will provide block-level operations with additional QoS guarantees to applications or other services. It is an essential basis for advanced memory management, providing swap pager support to  individual cells.
 
* The ''Object Store Service'' (OSS) will provide persistent storage through a simple interface that will enable applications to put (and retrieve) variable sized objects in a flat namespace.  QoS guarantees from the OSS will include such parameters as guaranteed number of cached pages or read and/or commit bandwidth.  This service will extend the traditional on-disk inode structure into a generic persistent local store available to clients (applications or services). Although a POSIX-like Filesystem API will be available for the Object Store Service, clients will also be able to manipulate stored objects outside of the POSIX APIs, to reduce the overhead associated with maintaining POSIX-required metadata.
 
The implementations of these services are sufficient to investigate a wide
variety of optimized storage abstractions for resilience.  Although we hope to
look at more general storage and filesystem issues, we want to concentrate our
efforts on well-defined resilience requirements.  We feel this line of research
is important, since resilient application execution, whether checkpointing,
logging, or replay-based, requires access to durable storage during the
application's critical path; providing QoS to applications with resilience
means not just providing guaranteed access to CPU, cache, memory and network,
but also to the I/O subsystem.
 
<!--
== Timetable ==
 
In this section we present a rough timetable giving details about what work will be performed and
when. All the software developed by this project will be released under an
open-source license and made available to the exascale research community. Any
software that relates to pre-existing code (e.g. device drivers) will maintain
the original licensing.
 
'''Year 1''': In the first year, we will focus on developing the
mechanisms needed for multinode Tessellation, including inter-node communication
channels and protocols (by Q2). Based on this infrastructure, we will extend the
decentralized muxing algorithm across multiple nodes by Q4, implementing global
clock synchronization algorithms as needed. Also in the first year we will begin
the development of the QoS-guaranteed services, with a QoS-aware service
framework by Q4. Finally, we will begin to port
GASNet: specifically, we will aim to get GASNet functional with the SMP conduit,
by Q4.
 
'''Year 2''': During the second year of the project, we will continue
work on the services, with the implementation of the Block Service by Q2 and the Object Store Service by Q4. Furthermore, we will use the Block Store service to develop advanced memory management features,
primarily user-level paging by Q4. Another
objective of year two is to have the UPC versions of the Co-design proxy apps
running by the end of the year (Q4). This will require porting the UPC
runtime (by Q2/Q3), which will depend on the previous port of GASNet. To enable
multinode support, we intend to port the GASNet RDMA conduit by Q4.
 
'''Year 3''': In the final year, we will work on the framework for
multinode adaptation, in particular, by Q2 we hope to develop a global,
multinode RAB service that composes local, node-level RABs. Furthermore, for
support of multinode adaptation, we will develop mechanisms for distributed
resource accounting and attribution, and topology-awareness by Q3. Another
thrust will be to incorporate power metrics into our framework (by Q1) and
mechanisms for power control (e.g. DVFS and sleep states) by Q2. Once these are
done, we will explore models for power attribution to cells during Q3 and
Q4. We will conclude the final year with a software release to the exascale
community.
 
-->


== Team Members ==
== Team Members ==


== Objectives ==
[https://crd.lbl.gov/about/staff/cds/ftg/steven-hofmeyr/ Steven Hofmeyr (LBNL)]
 
== Roadmap ==


== Software Stack ==
[http://www.cs.berkeley.edu/~kubitron/ John Kubiatowicz (UCB)]


== Deliverables ==
[http://crd.lbl.gov/about/staff/cds/ftg/eric-roman/ Eric Roman (LBNL)]

Latest revision as of 22:33, May 13, 2016

X-ARCC: Exascale Adaptive Resource Centric Computing with Tesselation
Xarcc.png
Team Members UC Berkeley, LBNL
PI Steven Hofmeyr (LBNL)
Co-PIs John Kubiatowicz (UCB)
Website http://tessellation.cs.berkeley.edu
Download https://bitbucket.org/berkeleylab/crd-x-arcc

Overview

We are exploring new approaches to Operating System (OS) design for exascale using Adaptive Resource-Centric Computing (ARCC). The fundamental basis of ARCC is dynamic resource allocation for adaptive assignment of resources to applications, combined with Quality-of-Service (QoS) enforcement to prevent interference between components. We have embodied ARCC in Tessellation, an OS designed for multicore nodes. In this project, our goal is to explore the potential for ARCC to address issues in exascale systems. This will require extending Tessellation with new features for multiple nodes, such as multi-node cell synchronization, distributed resource accounting, and topology-aware resource control. Rather than emphasizing component development for an exascale OS, we are focusing our efforts on high-risk, high-reward topics related to novel OS mechanisms and designs.

There are several aspects we are exploring in the context of a multinode Tessellation:

  • What OS support is needed for new global address space programming models and task-based parallel programming models? To explore this, we are porting UPC to run on multinode Tessellation, using GASNet as the underlying communication layer. Our test-cases for this runtime on Tessellation are a subset of the Co-design proxy apps, which we use as representatives of potential exascale applications.
  • How should the OS support advanced memory management, including mechanisms for user-level paging, locality-aware memory allocation and multicell shared memory?
  • How do we extend hierarchical adaptive resource allocation and control across multiple nodes?
  • How should the OS manage the trade-off between power and performance optimizations? Will the Tessellation approach of treating both power and other resources (cores, memory) as first class citizens within the adaptive loop be adequate?
  • What OS abstractions are needed for the durable QoS-guaranteed storage that is essential to resilience?

Tessellation

The Tessellation kernel is a lightweight, hypervisor-like layer that provides support for ARCC. It implements cells, along with interfaces for user-level scheduling, resource adaptation and cell composition. Since the software in cells runs entirely in user space, the kernel can enforce resource allocations, e.g. CPU cores, memory pages, without specialized virtualization hardware, but enforcing resource allocations to some resources, such as processor cache slices and memory bandwidth, requires additional hardware support. Tessellation is written from scratch and the prototype runs on both Intel x86 and RAMP architectures.

The Cell Model

Cells provide the basic unit of computation and protection in Tessellation. Cells are performance-isolated resource containers that export their resources to user level. The software running within each cell has full user-level control of the resources assigned to the cell, free from interference from other cells. Application programmers can customize the cell's runtime for their application domains with, for instance, a particular CPU core-scheduling algorithm, or a novel page replacement policy.

The performance isolation of cells is achieved through space-time partitioning, a multiplexing technique that divides the hardware into a sequence of simultaneously resident spatial partitions. Cells can either have temporally dedicated access to their resources, or be time-multiplexed with other cells, and, depending on the spatial partitioning, both time-multiplexed and non-multiplexed cells may be active simultaneously.

Time multiplexing is implemented using gang-scheduling, to ensure that cells provide their hosted applications an environment that is similar to a dedicated machine. The kernel implements gang-scheduling in a decentralized manner through a set of kernel multiplexer threads muxers, one per hardware thread in the system. The muxers implement the same scheduling algorithm and rely on a high-precision global time-base to simultaneously activate a cell on multiple hardware threads with minimum skew. In the common case the muxers do not need to communicate since they replicate the scheduling decisions of all relevant other muxers; however, the muxers will communicate via IPI multicast in certain cases, e.g., when cells are created or terminated, when resource allocations change, etc.

Applications in Tessellation that span multiple cells communicate via efficient and secure channels. Channels provide fast, user-level asynchronous message-passing between cells. Applications use channels to access standard OS services (e.g. network and file services) hosted in other cells. New composite services are constructed from OS services by wrapping a cell around existing resources and exporting a service interface. Tessellation can support QoS in this service-oriented architecture because the stable environment of the cell is easily combined with a custom user-level scheduler to provide QoS-aware access to the resulting service. With such QoS-aware access providing reproducible service times, applications in cells experience better performance predictability for autotuning, but without sacrificing flexibility in job placement for optimized system usage.

Customized Scheduling: OS Support for Runtime Systems

A major benefit of two-level scheduling is the ability to support different resource-management policies simultaneously. In Tessellation, cells provide their own, possibly highly-customized, user-level runtime systems for processor (thread) scheduling and memory management. Furthermore, each cell's runtime can control the delivery of events, such as timer and device interrupts, inter-cell message notifications, exceptions, and memory faults. Consequently, Tessellation can support a diverse set of runtimes without requiring kernel modifications or users to have root access. For example, for a complex scientific application with in-situ visualization, one cell can run a rendering or decoding component with a real time scheduler, while another runs a throughput oriented compute job with simple run-to-completion scheduling and one thread per core. The ability to support multiple different schedulers simultaneously bypasses the issues that arise when trying to develop a "one size fits all" complex global scheduler (e.g., see the CFS vs BFS debate in the Linux community).

The current Tessellation prototype includes two user-level thread scheduling frameworks: a preemptive one called Pulse (Preemptive User-Level SchEduling), and a cooperative one based on Lithe (LIquid THrEads). With either framework, a cell starts when a single entry point, enter(), is executed simultaneously on each core. After that, the kernel interferes with the cell's runtime only when: 1) the runtime receives events (e.g. interrupts) it has registered for, 2) the cell is suspended and reactivated according to its time-multiplexing policy (which can be time-triggered, event-triggered, or dedicated), or 3) the resources (e.g. hardware threads) assigned to the cell change. For instance, when an interrupt occurs during user-level code execution, the kernel saves the thread context and calls a registered interrupt handler, passing the saved context to the cell's user-level runtime. The cell's runtime can then choose whether to restore the previously running context or swap to a new one.

It is relatively easy to build new runtime scheduling frameworks for Tessellation. The Pulse framework, for example, comprises less than 800 lines of code (LOC); it can serve as a starting point for creating new user-level preemptive schedulers. A new scheduler based on Pulse needs to implement only four callbacks: enter(), mentioned earlier; tick(context), which is called whenever a timer tick occurs and receives the context of the interrupted thread; yield(), called when a thread yields; and done(), called when a thread terminates. The Pulse framework provides APIs to save and restore contexts, and other relevant operations. Since Tessellation's schedulers run at user-level, kernel patches are not required.

Pulse and Tessellation make it easy to implement custom schedulers. For example, we implemented a global round-robin scheduler with mutex and conditional-variable support in about 850 LOC. We also wrote a global earliest deadline first (EDF) scheduler with mutex support and priority-inversion control via dynamic deadline modification in less than 1000 LOC. By contrast, support for EDF in Linux requires kernel modifications and substantially more code: the best-known EDF kernel patch for Linux, SCHED_DEADLINE, has over 3500 LOC in over 50 files. Furthermore, any bugs in SCHED_DEADLINE could cause the kernel to crash and bring down the whole system, whereas bugs in a userspace scheduler in Tessellation will only crash the user-space application using that scheduler.

Adaptive Resource Allocation

Tessellation adapts resource allocations to provide applications with QoS while maximizing efficiency in the system. Resources are allocated and reallocated using a feedback control loop, as shown in the figure. The allocation decisions are made by the Resource Allocation Broker (RAB), a broker service which runs in user-space. The RAB uses system-wide goals, resource constraints, performance targets and performance measurements as inputs to an optimizer that attempts to simultaneously satisfy multiple application requirements and competing system-wide goals such as energy efficiency and throughput. The allocation decisions are then communicated to the kernel and system services for enforcement.

The current implementation of the RAB provides a resource-allocation framework that supports rapid development and testing of new optimizers. Two different optimizers that we have experimented with are POTA and Pacora.

QoS-Aware Services

Cells provide a convenient abstraction for building OS services with QoS guarantees. Such services can reside in dedicated cells with exclusive control of a device to encapsulate user-level device drivers. We call these cells service cells. In keeping with ARCC, we treat services offered by service cells as additional resources managed by the adaptive resource allocation architecture. A service cell arbitrates access to its devices, leveraging the cell's performance isolation and customizable schedulers to offer QoS guarantees to other cells. By encapsulating system services we not only improve performance predictability, but also enhance tolerance of failures in device drivers and mitigate the impact of OS noise.

Each service in Tessellation comes with a library to facilitate the development of client applications. The client libraries offer friendly, high-level application programming interfaces (APIs) to manage connections and interact with the services (i.e. they hide most of the details of inter-cell channel communication). Those libraries also allow applications to request the QoS guarantees they need from services.

Currently, there are two services implemented for Tessellation that offer QoS guarantees: a Network Service, which provides access to network adapters and guarantees that the data flows are processed with the agreed levels of throughput, and a GUI Service, which provides a windowing system with response-time guarantees for visual applications.

Research Plan

Our main focus is using Tessellation to investigate OS support for new programming models and new use cases and application structures for supercomputers. We believe that Tessellation is an ideal platform to experiment with the mechanisms required to support the sophisticated applications described in the DOE Exascale OS/R report:

As machines grow larger, applications are not just adding more grid points but are adding new modules for improving fidelity with new physics and chemistry. Rather than a monolithic source code tree, these sophisticated applications are built from multiple libraries, programming models, and execution models. Future operating and runtime systems must support the composition of several types of software components, from legacy math libraries written in MPI to new dynamic task-graph execution engines running simultaneously with a data reduction component.

In Tessellation, we envisage these sophisticated applications as spanning multiple cells, with different cells for different library components, each with their own scheduling models. Furthermore, support for multiple active cells per node enables the colocation of components to minimize data movement and avoid communication. However, Tessellation is currently a single node OS and these applications will need to scale to large numbers of nodes. Hence, in this research, our main goal is to determine how adaptive resource control with QoS guarantees can be scaled across nodes.

Multinode Synchronized Cells

Cells in Tessellation are currently a purely local construct, contained within a single node. Although there are many ways of extending the cell model across nodes, we will keep cells as fundamentally a local construct, and provide a way to synchronize cells across nodes. In particular, for temporal multiplexing, we need a way to be able to synchronize activations to enable inter-node gang-scheduling. The muxer algorithm is decentralized, and communication-free in the common case, and thus naturally lends itself to a multinode environment. The muxer algorithm relies on synchronized global clocks, which should be achievable on a relatively large scale, e.g., Jones and Koenig [2013] describe an algorithm that takes only 2.3 microseconds to synchronize 16,000 12-core nodes. Our goal is to extend the muxer algorithm to function across nodes; we will determine what mechanisms need to be implemented to enable this, and how OS components should communicate and interface across a network.

It is frequently possible to provide the same portion of CPU node capacity to a cell through either temporal multiplexing or spatial partitioning, so it is not clear a priori that multinode temporal multiplexing is really needed. However, we wish to add support for multinode temporal multiplexing because we believe that the temporal dimension enhances flexibility. This could be especially useful when solving complex problems of multidimensional resource allocation. It could also be useful for improving performance in certain types of applications, e.g., colocating modeling and analysis threads on the same cores so that cached data is shared, while still having separate, customized runtimes for each component. This kind of high-risk, high-reward research has the potential to open up new, more productive ways of using large-scale computers; we will not see fundamental advances with only incremental improvements to existing approaches to OSes.

Support for New Programming Models and Multicomponent Applications

Multinode synchronized cells will provide the environment necessary for exploring the OS support needed for advanced programming models and sophisticated, multicomponent applications. To enable multinode, large scale applications, we propose first porting GASNet as a communication layer, and then porting UPC for PGAS models. We will focus on these frameworks because of time and budget constraints, and because of our close relationship with the UPC and DEGAS projects. . Of particular interest is what OS-specific support and mechanisms may be needed for frameworks such as GASNet and UPC. This aspect of our research will allow us to explore one of the issues raised in the DOE Exascale OS/R report, namely:

Operating systems traditionally have mediated and controlled access to all resources in the system. However, a variety of new programming models and runtimes will have significantly different requirements for resource management, which will likely result in inefficiencies if the OS has to directly manage all resources. In the future, the application and runtime will need increased control over resources such as cores, memory, and power and user-space handling of control requests and events.

We will use the newly ported runtimes to explore application designs relevant to exascale, through the Co-design proxy apps. Several of these will be easy to port to Tessellation, since we already have UPC versions of them, e.g., LULESH from ExMatEx, Multigrid from Exact and XSBench from Cesar. However, to really explore the potential benefits of the ARCC approach, we will experiment with OS support for more sophisticated applications, in addition to application kernels. For this purpose, we will port two of the more complex Co-design Proxy Apps: CIAN (from the Cesar project), which has multiple components, including compute-intensive modeling, data-intensive analytics, visualization and storage; and CoMD (from the ExMatEx project), which includes in-situ visualization. Not only are we interested in how we can support the programming models used in these applications, but how we can tie the visualization, data analytics and storage components into the Tessellation service framework, including the GUI and Network services, and our proposed storage services.

Advanced Memory Management

Advanced memory management is a key requirement for an exascale OS, as described in the DOE Exascale OS/R report:

The OS will need to provide more support for runtime and application management of memory (possibly moving traditional OS services, like updating translation tables to runtime systems).

In Tessellation, our goal is to give cells as much control over the memory they own as possible. Currently, Tessellation does not support advanced memory management features, such as paging or locality-aware allocation. We intend to develop a user-level paging facility that a cell's runtime can manage with customized page replacement and locality-aware allocation policies. Customized memory management can deal with subtle problems that arise in monolithic memory management, e.g., performance of the LULESH proxy app suffers on Linux because of the way the kernel zero fills pages before allocating them. Apart from customization, a key goal of advanced memory management is to minimize uncontrolled interference between cells and enable better performance predictability and stronger QoS guarantees.

For user-level memory management, the approach we propose to adopt is similar to self-paging in the Nemesis OS, where all the kernel does is dispatch fault notifications, and all paging operations are performed at the application (cell) level. In general, the approach can be implemented entirely in software, e.g., through para-virtualized interfaces to the page tables, as used in Xen, but, where available, we will make use of hardware support such as nested page tables (AMD) and extended page tables (Intel).

Adaptive Resource Allocation on Multinode Systems

The framework for adaptive resource control that is currently implemented in the kernel and the Resource Allocation Broker (RAB) is flexible, supporting pluggable policies. We intend to retain this flexibility for the multinode version of Tessellation, and intend to explore various methods to extend the adaptation framework across nodes. In particular, we will to develop hierarchical resource allocation, with at least two levels: an intra-node level, and an an inter-node level that is constructed through the composition of intra-node resource policies. Although exascale systems will likely need more than two levels to account for additional layers of physical clustering of resources, we believe that two levels is already a challenge, and will highlight many of the relevant issues.

It is beyond the scope of this project to investigate the various adaptation policies. Rather, our focus is on providing the OS mechanisms and interfaces needed for a flexible framework that can be used to explore a variety of policies in future, e.g., determining how best to allocate multinode resources to minimize data movement and avoid communication. Included in those mechanisms will be communication and coordination channels between the inter- and intra-node levels of the adaptation framework, support for inter-node resource accounting, and support for topology-awareness.

Power Awareness

Power-awareness and control of power as a resource will be another dimension for the RAB optimizer to incorporate when making resource allocation decisions. We do not intend to design new power-balancing policies or experiment extensively with power-aware resource allocation algorithms, but rather to ensure that Tessellation provides the necessary mechanisms. We will develop the mechanisms needed to enable Tessellation to accurately attribute power consumption to each cell, and control the allocation of the "power" resource by using DVFS controls or processor sleep states.

Although control mechanisms are likely straightforward to implement, accurate resource attribution will be tricky. We will begin by investigating systems such as the Intel Sandybridge, which provides support for measuring energy-usage per socket and DRAM energy usage, via the RAPL interface. Accessing the RAPL interface is done via model-specific registers that have to be accessed in ring-0. We propose to export these interfaces to userspace, specifically, to the RAB so that it can use power to make allocation decisions. Unfortunately, this power information is not sufficiently fine-grained to attribute it to cells that span sockets. To correct this, we will explore various algorithms for deriving the power usage at the cell level. This can be as simple as proportionally dividing up power according to CPU usage, or involve more sophisticated approaches based on performance counter measurements. Which approach is the best is an open research question that we intend to investigate.

System Services for Resilience

A complete exascale OS will need many components to support resilience. Since many of these components, fault detection for example, may be implemented in the runtime, we propose to focus our investigation on those performance-critical components necessary for effective fault recovery that also benefit from OS-level optimization. Specifically, we intend to investigate OS abstractions for durable QoS-guaranteed storage, which is an essential requirement for a variety of resilience techniques such as traditional checkpoint/restart, faster log-based recovery, and more recent approaches using Containment Domains. Although our focus is on storage for resilience, according to the DOE Exascale OS/R report, QoS-aware I/O services will also be necessary for exascale tools:

Another potential overlap in the requirements of tools and applications involves quality of service concerns for shared resources. Tools need QoS guarantees to ensure that they do not too severely impact application execution. A particular area of concern is that tools can have extensive I/O requirements.

To date we have developed two QoS-aware system services: a network service and a GUI service. We intend to implement a framework that will make it easy to develop QoS-aware services, which we will use for developing two additional services: a Block Service and an Object Store Service:

  • The Block Service will provide block-level operations with additional QoS guarantees to applications or other services. It is an essential basis for advanced memory management, providing swap pager support to individual cells.
  • The Object Store Service (OSS) will provide persistent storage through a simple interface that will enable applications to put (and retrieve) variable sized objects in a flat namespace. QoS guarantees from the OSS will include such parameters as guaranteed number of cached pages or read and/or commit bandwidth. This service will extend the traditional on-disk inode structure into a generic persistent local store available to clients (applications or services). Although a POSIX-like Filesystem API will be available for the Object Store Service, clients will also be able to manipulate stored objects outside of the POSIX APIs, to reduce the overhead associated with maintaining POSIX-required metadata.

The implementations of these services are sufficient to investigate a wide variety of optimized storage abstractions for resilience. Although we hope to look at more general storage and filesystem issues, we want to concentrate our efforts on well-defined resilience requirements. We feel this line of research is important, since resilient application execution, whether checkpointing, logging, or replay-based, requires access to durable storage during the application's critical path; providing QoS to applications with resilience means not just providing guaranteed access to CPU, cache, memory and network, but also to the I/O subsystem.


Team Members

Steven Hofmeyr (LBNL)

John Kubiatowicz (UCB)

Eric Roman (LBNL)