Actions

X-ARCC: Difference between revisions

From Modelado Foundation

imported>Shofmeyr
imported>Shofmeyr
No edit summary
 
(23 intermediate revisions by the same user not shown)
Line 4: Line 4:
| website = http://tessellation.cs.berkeley.edu
| website = http://tessellation.cs.berkeley.edu
| imagecaption =
| imagecaption =
| download =
| download = https://bitbucket.org/berkeleylab/crd-x-arcc
| team-members = [http://www.berkeley.edu/ UC Berkeley], [http://www.lbl.gov/ LBNL]
| team-members = [http://www.berkeley.edu/ UC Berkeley], [http://www.lbl.gov/ LBNL]
| pi = Steven Hofmeyr (LBNL)
| pi = Steven Hofmeyr (LBNL)
Line 16: Line 16:
Tessellation:
Tessellation:


* What OS support is needed for new global address space programming models and task-based parallel programming models? To explore this, we are porting UPC and Habanero to run on multinode Tessellation, using GASNet as the underlying communication layer. Our test-cases for these runtimes on Tessellation are a subset of the Co-design proxy apps, which we use as representatives of potential exascale applications.  
* What OS support is needed for new global address space programming models and task-based parallel programming models? To explore this, we are porting UPC to run on multinode Tessellation, using GASNet as the underlying communication layer. Our test-cases for this runtime on Tessellation are a subset of the Co-design proxy apps, which we use as representatives of potential exascale applications.  
* How should the OS support advanced memory management, including mechanisms for user-level paging, locality-aware memory allocation and multicell shared memory?  
* How should the OS support advanced memory management, including mechanisms for user-level paging, locality-aware memory allocation and multicell shared memory?  
* How do we extend hierarchical adaptive resource allocation and control across multiple nodes, including heterogeneous nodes such as the Intel Mic?
* How do we extend hierarchical adaptive resource allocation and control across multiple nodes?
* How should the OS manage the trade-off between power and performance optimizations? Will the Tessellation approach of treating both power and other resources (cores, memory) as first class citizens within the adaptive loop be adequate?  
* How should the OS manage the trade-off between power and performance optimizations? Will the Tessellation approach of treating both power and other resources (cores, memory) as first class citizens within the adaptive loop be adequate?  
* What OS abstractions are needed for the durable QoS-guaranteed storage that is essential to resilience?
* What OS abstractions are needed for the durable QoS-guaranteed storage that is essential to resilience?
Line 45: Line 45:
The performance isolation of cells is achieved through ''space-time partitioning'', a multiplexing technique that
The performance isolation of cells is achieved through ''space-time partitioning'', a multiplexing technique that
divides the hardware into a sequence of simultaneously resident spatial
divides the hardware into a sequence of simultaneously resident spatial
partitions, as shown in Figure 1. Cells can either have
partitions. Cells can either have
temporally dedicated access to their resources, or be time-multiplexed with
temporally dedicated access to their resources, or be time-multiplexed with
other cells, and, depending on the spatial partitioning, both time-multiplexed
other cells, and, depending on the spatial partitioning, both time-multiplexed
Line 134: Line 134:
with QoS while maximizing efficiency in the system. Resources
with QoS while maximizing efficiency in the system. Resources
are allocated and reallocated using a feedback control loop, as shown in
are allocated and reallocated using a feedback control loop, as shown in
Figure 2. The allocation decisions are made by the ''Resource Allocation Broker'' (RAB), a broker service which runs in user-space (see
the figure. The allocation decisions are made by the ''Resource Allocation Broker'' (RAB), a broker service which runs in user-space. The RAB uses system-wide goals, resource
Figure 1). The RAB uses system-wide goals, resource
constraints, performance targets and performance measurements as inputs to an
constraints, performance targets and performance measurements as inputs to an
optimizer that attempts to simultaneously satisfy multiple application
optimizer that attempts to simultaneously satisfy multiple application
Line 149: Line 148:
Cells provide a convenient abstraction for building OS services with QoS
Cells provide a convenient abstraction for building OS services with QoS
guarantees.  Such services can reside in dedicated cells with exclusive control
guarantees.  Such services can reside in dedicated cells with exclusive control
of a device to encapsulate user-level device drivers (see
of a device to encapsulate user-level device drivers.  We call these cells ''service cells''.  In keeping
Figure 1).  We call these cells ''service cells''.  In keeping
with ARCC, we treat services offered by service cells as additional resources
with ARCC, we treat services offered by service cells as additional resources
managed by the adaptive resource allocation architecture.  A service cell
managed by the adaptive resource allocation architecture.  A service cell
Line 212: Line 210:
synchronized global clocks, which should be achievable on a relatively large
synchronized global clocks, which should be achievable on a relatively large
scale, e.g., Jones and Koenig [2013] describe an algorithm that
scale, e.g., Jones and Koenig [2013] describe an algorithm that
takes only 2.3 microseconds to synchronize 16,000 12-core nodes. We will extend
takes only 2.3 microseconds to synchronize 16,000 12-core nodes. Our goal is to extend
the muxer algorithm to function across nodes; we will determine what mechanisms
the muxer algorithm to function across nodes; we will determine what mechanisms
need to be implemented to enable this, and how OS components should communicate
need to be implemented to enable this, and how OS components should communicate
Line 235: Line 233:
exploring the OS support needed for advanced programming models and
exploring the OS support needed for advanced programming models and
sophisticated, multicomponent applications. To enable multinode, large scale
sophisticated, multicomponent applications. To enable multinode, large scale
applications, we propose first porting GASNet~\cite{gasnet}\footnote{GASNet
applications, we propose first porting GASNet as a communication layer, and then porting UPC for PGAS
  takes care of a lot of details related to communication, such as RDMA
  support.} as a communication layer, and then porting UPC~\cite{upc} for PGAS
models and Habanero~\cite{habanero} for dynamic tasking and work-stealing
models. We will focus on these frameworks because of time and budget
models. We will focus on these frameworks because of time and budget
constraints, and because of our close relationship with the UPC and DEGAS
constraints, and because of our close relationship with the UPC and DEGAS
projects. Currently, the DEGAS team is moving towards an integrated UPC/Habanero
projects. . Of particular interest is what OS-specific support and
runtime; depending on the timing, we may port the combined runtime, rather than
mechanisms may be needed for frameworks such as GASNet and UPC. This aspect of our research will allow us to explore one of the issues raised in the DOE Exascale OS/R
separate components. Of particular interest is what OS-specific support and
report, namely:
mechanisms may be needed for frameworks such as GASNet and UPC. This proposed
research allows us to explore one of the issues raised in the DOE Exascale OS/R
report~\cite{osr-report}:


\begin{quote}
<blockquote>
   Operating systems traditionally have mediated and controlled access to all
   Operating systems traditionally have mediated and controlled access to all
   resources in the system. However, a variety of new programming models and
   resources in the system. However, a variety of new programming models and
Line 256: Line 248:
   need increased control over resources such as cores, memory, and power and
   need increased control over resources such as cores, memory, and power and
   user-space handling of control requests and events.
   user-space handling of control requests and events.
\end{quote}
</blockquote>


We will use the newly ported runtimes to explore application designs relevant to
We will use the newly ported runtimes to explore application designs relevant to
exascale, through the Co-design \emph{proxy apps}~\cite{codesign}. Several of
exascale, through the Co-design ''proxy apps''. Several of
these will be easy to port to Tessellation, since we already have UPC versions
these will be easy to port to Tessellation, since we already have UPC versions
of them, e.g., LULESH~\cite{lulesh} from ExMatEx, Multigrid~\cite{multigrid}
of them, e.g., LULESH from ExMatEx, Multigrid from Exact and XSBench from Cesar.  However, to really explore
from Exact and XSBench~\cite{xsbench} from Cesar.  However, to really explore
the potential benefits of the ARCC approach, we will experiment with OS
the potential benefits of the ARCC approach, we propose to experiment with OS
support for more sophisticated applications, in addition to application kernels.
support for more sophisticated applications, in addition to application kernels.
For this purpose, we will port two of the more complex Co-design Proxy Apps:
For this purpose, we will port two of the more complex Co-design Proxy Apps:
CIAN~\cite{cian} (from the Cesar project), which has multiple components,
CIAN (from the Cesar project), which has multiple components,
including compute-intensive modeling, data-intensive analytics, visualization
including compute-intensive modeling, data-intensive analytics, visualization
and storage; and CoMD~\cite{comd} (from the ExMatEx project), which includes
and storage; and CoMD (from the ExMatEx project), which includes
\emph{in-situ} visualization. Not only are we interested in how we can support
''in-situ'' visualization. Not only are we interested in how we can support
the programming models used in these applications, but how we can tie the
the programming models used in these applications, but how we can tie the
visualization, data analytics and storage components into the Tessellation
visualization, data analytics and storage components into the Tessellation
service framework, including the GUI and Network services, and our proposed
service framework, including the GUI and Network services, and our proposed
storage services (see Section~\ref{sec:proposed:resilience}).
storage services.


\subsection{Advanced Memory Management}
=== Advanced Memory Management ===


Advanced memory management is a key requirement for an exascale OS, as described
Advanced memory management is a key requirement for an exascale OS, as described
in the DOE Exascale OS/R report~\cite{osr-report}:  
in the DOE Exascale OS/R report:  


\begin{quote}
<blockquote>
   The OS will need to provide more support for runtime and application
   The OS will need to provide more support for runtime and application
   management of memory (possibly moving traditional OS services, like updating
   management of memory (possibly moving traditional OS services, like updating
   translation tables to runtime systems).
   translation tables to runtime systems).
\end{quote}
</blockquote>


In Tessellation, our goal is to give cells as much control over the memory they
In Tessellation, our goal is to give cells as much control over the memory they
Line 294: Line 285:
management, e.g., performance of the LULESH proxy app suffers on Linux because of
management, e.g., performance of the LULESH proxy app suffers on Linux because of
the way the kernel zero fills pages before allocating
the way the kernel zero fills pages before allocating
them~\cite{Liu:2013:NAP:2464996.2465433}. Apart from customization, a key goal
them. Apart from customization, a key goal
of advanced memory management is to minimize uncontrolled interference between
of advanced memory management is to minimize uncontrolled interference between
cells and enable better performance predictability and stronger QoS guarantees.
cells and enable better performance predictability and stronger QoS guarantees.


For user-level memory management, the approach we propose to adopt is similar to
For user-level memory management, the approach we propose to adopt is similar to
self-paging in the Nemesis OS~\cite{selfpaging99}, where all the kernel does is
self-paging in the Nemesis OS, where all the kernel does is
dispatch fault notifications, and all paging operations are performed at the
dispatch fault notifications, and all paging operations are performed at the
application (cell) level. In general, the approach can be implemented entirely
application (cell) level. In general, the approach can be implemented entirely
in software, e.g., through para-virtualized interfaces to the page tables, as
in software, e.g., through para-virtualized interfaces to the page tables, as
used in Xen~\cite{barham03xen}, but, where available, we will make use of
used in Xen, but, where available, we will make use of
hardware support such as \emph{nested page tables} (AMD) and \emph{extended page
hardware support such as ''nested page tables'' (AMD) and ''extended page tables'' (Intel).
  tables} (Intel).


\subsection{Adaptive Resource Allocation on Multinode, Heterogeneous Systems}
=== Adaptive Resource Allocation on Multinode Systems ===


The framework for adaptive resource control that is currently implemented in the
The framework for adaptive resource control that is currently implemented in the
kernel and the Resource Allocation Broker (RAB) is flexible, supporting
kernel and the Resource Allocation Broker (RAB) is flexible, supporting
pluggable policies (we demonstrated two examples of those in
pluggable policies. We intend to retain this flexibility for the
Section~\ref{sec:adaptation}). We intend to retain this flexibility for the
multinode version of Tessellation, and intend to explore various methods to
multinode version of Tessellation, and propose to explore various methods to
extend the adaptation framework across nodes. In particular, we will to
extend the adaptation framework across nodes. In particular, we propose to
develop hierarchical resource allocation, with at least two levels: an
develop hierarchical resource allocation, with at least two levels: an
intra-node level, and an an inter-node level that is constructed through the
intra-node level, and an an inter-node level that is constructed through the
Line 322: Line 311:
will highlight many of the relevant issues.
will highlight many of the relevant issues.


It is beyond the scope of this proposal to investigate the various adaptation
It is beyond the scope of this project to investigate the various adaptation
policies. Rather, our focus is on providing the OS mechanisms and interfaces
policies. Rather, our focus is on providing the OS mechanisms and interfaces
needed for a flexible framework that can be used to explore a variety of
needed for a flexible framework that can be used to explore a variety of
Line 331: Line 320:
and support for topology-awareness.
and support for topology-awareness.


Another important aspect of multinode adaptation will be support for heterogeneous
=== Power Awareness ===
systems. To investigate the challenges involved in this aspect, we intend to
incorporate support for the Intel Mic in Tessellation. This will involve
developing a device driver for the Intel Mic that enables the coprocessor to
boot up and load the coprocessor environment. The device driver we intend to
develop will be similar in functionality to the Linux X100 device
driver~\cite{x100-device-driver}.
 
 
\subsection{Power Awareness}
 
%\green{this should address some of the reviewer comments}


Power-awareness and control of power as a resource will be another dimension for
Power-awareness and control of power as a resource will be another dimension for
Line 348: Line 326:
do not intend to design new power-balancing policies or experiment extensively
do not intend to design new power-balancing policies or experiment extensively
with power-aware resource allocation algorithms, but rather to ensure that
with power-aware resource allocation algorithms, but rather to ensure that
Tessellation provides the necessary mechanisms. We propose to develop the
Tessellation provides the necessary mechanisms. We will develop the
mechanisms needed to enable Tessellation to accurately attribute power
mechanisms needed to enable Tessellation to accurately attribute power
consumption to each cell, and control the allocation of the ``power'' resource
consumption to each cell, and control the allocation of the "power" resource
by using DVFS controls or processor sleep states.  
by using DVFS controls or processor sleep states.  


Line 357: Line 335:
such as the Intel Sandybridge, which provides support for measuring energy-usage
such as the Intel Sandybridge, which provides support for measuring energy-usage
per socket and DRAM energy usage, via the RAPL
per socket and DRAM energy usage, via the RAPL
interface~\cite{rapl}. Accessing the RAPL interface is done via
interface. Accessing the RAPL interface is done via
model-specific registers that have to be accessed in ring-0. We propose to
model-specific registers that have to be accessed in ring-0. We propose to
export these interfaces to userspace, specifically, to the RAB so that it can
export these interfaces to userspace, specifically, to the RAB so that it can
Line 365: Line 343:
the cell level. This can be as simple as proportionally dividing up power
the cell level. This can be as simple as proportionally dividing up power
according to CPU usage, or involve more sophisticated approaches based on
according to CPU usage, or involve more sophisticated approaches based on
performance counter measurements~\cite{power-modeling-1,power-modeling-2}. Which approach is
performance counter measurements. Which approach is
the best is an open research question that we propose to investigate.
the best is an open research question that we intend to investigate.


\subsection{System Services for Resilience}\label{sec:proposed:resilience}
=== System Services for Resilience ===


A complete exascale OS will need many components to support resilience. Since
A complete exascale OS will need many components to support resilience. Since
Line 378: Line 356:
variety of resilience techniques such as traditional checkpoint/restart, faster
variety of resilience techniques such as traditional checkpoint/restart, faster
log-based recovery, and more recent approaches using Containment
log-based recovery, and more recent approaches using Containment
Domains~\cite{sullivan2011cdf}. Although our focus is on storage for resilience,
Domains. Although our focus is on storage for resilience,
according to the DOE Exascale OS/R report, QoS-aware I/O services will also be
according to the DOE Exascale OS/R report, QoS-aware I/O services will also be
necessary for exascale tools:
necessary for exascale tools:


 
<blockquote>
\begin{quote}
   Another potential overlap in the requirements of tools and applications
   Another potential overlap in the requirements of tools and applications
   involves quality of service concerns for shared resources. Tools need QoS
   involves quality of service concerns for shared resources. Tools need QoS
Line 389: Line 366:
   execution. A particular area of concern is that tools can have extensive I/O
   execution. A particular area of concern is that tools can have extensive I/O
   requirements.
   requirements.
\end{quote}
</blockquote>


To date we have developed two QoS-aware system services: a network
To date we have developed two QoS-aware system services: a network
service and a GUI service. We intend to implement a framework that will make it
service and a GUI service. We intend to implement a framework that will make it
easy to develop QoS-aware services, which we will use for developing three
easy to develop QoS-aware services, which we will use for developing two
additional services: a Block Service, an Object
additional services: a Block Service and an Object Store Service:
Store Service, and a Logging Service:


\begin{itemize}
* The ''Block Service'' will provide block-level operations with additional QoS guarantees to applications or other services. It is an essential basis for advanced memory management, providing swap pager support to  individual cells.


\item The \emph{Block Service} will provide block-level operations with
* The ''Object Store Service'' (OSS) will provide persistent storage through a simple interface that will enable applications to put (and retrieve) variable sized objects in a flat namespace.  QoS guarantees from the OSS will include such parameters as guaranteed number of cached pages or read and/or commit bandwidth.  This service will extend the traditional on-disk inode structure into a generic persistent local store available to clients (applications or services). Although a POSIX-like Filesystem API will be available for the Object Store Service, clients will also be able to manipulate stored objects outside of the POSIX APIs, to reduce the overhead associated with maintaining POSIX-required metadata.
  additional QoS guarantees to applications or other services. It is an essential
  basis for advanced memory management, providing swap pager support to
  individual cells.
 
\item The \emph{Object Store Service} (OSS) will provide persistent storage
  through a simple interface that will enable applications to put (and retrieve)
  variable sized objects in a flat namespace.  QoS guarantees from the OSS will
  include such parameters as guaranteed number of cached pages or read and/or
  commit bandwidth.  This service will extend the traditional on-disk inode
  structure~\cite{McKusick96} into a generic persistent local store available to
  clients (applications or services). Although a POSIX-like Filesystem API will
  be available for the ObjectStore Service, clients will also be able to
  manipulate stored objects outside of the POSIX APIs, to reduce the overhead
  associated with maintaining POSIX-required metadata.
 
\item The \emph{Logging Service} will provide durability for journalled metadata
  operations on behalf of filesystem-like operations, much like the journaling
  block device (JBD) used by the ext4 filesystem \cite{Sovani06}. However, the
  Logging Service will differ from JBD in several key respects: It will perform
  both forwards (redo) and backwards (undo) recovery, and it will not be limited
  to physical full-page updates.  Also, the logging service will be open;
  multiple clients, not just the filesystem, will be able to use log services,
  i.e. the logging service can be used directly by applications to implement
  application-specific resilience.
 
\end{itemize}


The implementations of these services are sufficient to investigate a wide
The implementations of these services are sufficient to investigate a wide
Line 436: Line 386:
means not just providing guaranteed access to CPU, cache, memory and network,
means not just providing guaranteed access to CPU, cache, memory and network,
but also to the I/O subsystem.
but also to the I/O subsystem.
<!--
== Timetable ==
In this section we present a rough timetable giving details about what work will be performed and
when. All the software developed by this project will be released under an
open-source license and made available to the exascale research community. Any
software that relates to pre-existing code (e.g. device drivers) will maintain
the original licensing.
'''Year 1''': In the first year, we will focus on developing the
mechanisms needed for multinode Tessellation, including inter-node communication
channels and protocols (by Q2). Based on this infrastructure, we will extend the
decentralized muxing algorithm across multiple nodes by Q4, implementing global
clock synchronization algorithms as needed. Also in the first year we will begin
the development of the QoS-guaranteed services, with a QoS-aware service
framework by Q4. Finally, we will begin to port
GASNet: specifically, we will aim to get GASNet functional with the SMP conduit,
by Q4.
'''Year 2''': During the second year of the project, we will continue
work on the services, with the implementation of the Block Service by Q2 and the Object Store Service by Q4. Furthermore, we will use the Block Store service to develop advanced memory management features,
primarily user-level paging by Q4. Another
objective of year two is to have the UPC versions of the Co-design proxy apps
running by the end of the year (Q4). This will require porting the UPC
runtime (by Q2/Q3), which will depend on the previous port of GASNet. To enable
multinode support, we intend to port the GASNet RDMA conduit by Q4.
'''Year 3''': In the final year, we will work on the framework for
multinode adaptation, in particular, by Q2 we hope to develop a global,
multinode RAB service that composes local, node-level RABs. Furthermore, for
support of multinode adaptation, we will develop mechanisms for distributed
resource accounting and attribution, and topology-awareness by Q3. Another
thrust will be to incorporate power metrics into our framework (by Q1) and
mechanisms for power control (e.g. DVFS and sleep states) by Q2. Once these are
done, we will explore models for power attribution to cells during Q3 and
Q4. We will conclude the final year with a software release to the exascale
community.
-->


== Team Members ==
== Team Members ==

Latest revision as of 22:33, May 13, 2016

X-ARCC: Exascale Adaptive Resource Centric Computing with Tesselation
Xarcc.png
Team Members UC Berkeley, LBNL
PI Steven Hofmeyr (LBNL)
Co-PIs John Kubiatowicz (UCB)
Website http://tessellation.cs.berkeley.edu
Download https://bitbucket.org/berkeleylab/crd-x-arcc

Overview

We are exploring new approaches to Operating System (OS) design for exascale using Adaptive Resource-Centric Computing (ARCC). The fundamental basis of ARCC is dynamic resource allocation for adaptive assignment of resources to applications, combined with Quality-of-Service (QoS) enforcement to prevent interference between components. We have embodied ARCC in Tessellation, an OS designed for multicore nodes. In this project, our goal is to explore the potential for ARCC to address issues in exascale systems. This will require extending Tessellation with new features for multiple nodes, such as multi-node cell synchronization, distributed resource accounting, and topology-aware resource control. Rather than emphasizing component development for an exascale OS, we are focusing our efforts on high-risk, high-reward topics related to novel OS mechanisms and designs.

There are several aspects we are exploring in the context of a multinode Tessellation:

  • What OS support is needed for new global address space programming models and task-based parallel programming models? To explore this, we are porting UPC to run on multinode Tessellation, using GASNet as the underlying communication layer. Our test-cases for this runtime on Tessellation are a subset of the Co-design proxy apps, which we use as representatives of potential exascale applications.
  • How should the OS support advanced memory management, including mechanisms for user-level paging, locality-aware memory allocation and multicell shared memory?
  • How do we extend hierarchical adaptive resource allocation and control across multiple nodes?
  • How should the OS manage the trade-off between power and performance optimizations? Will the Tessellation approach of treating both power and other resources (cores, memory) as first class citizens within the adaptive loop be adequate?
  • What OS abstractions are needed for the durable QoS-guaranteed storage that is essential to resilience?

Tessellation

The Tessellation kernel is a lightweight, hypervisor-like layer that provides support for ARCC. It implements cells, along with interfaces for user-level scheduling, resource adaptation and cell composition. Since the software in cells runs entirely in user space, the kernel can enforce resource allocations, e.g. CPU cores, memory pages, without specialized virtualization hardware, but enforcing resource allocations to some resources, such as processor cache slices and memory bandwidth, requires additional hardware support. Tessellation is written from scratch and the prototype runs on both Intel x86 and RAMP architectures.

The Cell Model

Cells provide the basic unit of computation and protection in Tessellation. Cells are performance-isolated resource containers that export their resources to user level. The software running within each cell has full user-level control of the resources assigned to the cell, free from interference from other cells. Application programmers can customize the cell's runtime for their application domains with, for instance, a particular CPU core-scheduling algorithm, or a novel page replacement policy.

The performance isolation of cells is achieved through space-time partitioning, a multiplexing technique that divides the hardware into a sequence of simultaneously resident spatial partitions. Cells can either have temporally dedicated access to their resources, or be time-multiplexed with other cells, and, depending on the spatial partitioning, both time-multiplexed and non-multiplexed cells may be active simultaneously.

Time multiplexing is implemented using gang-scheduling, to ensure that cells provide their hosted applications an environment that is similar to a dedicated machine. The kernel implements gang-scheduling in a decentralized manner through a set of kernel multiplexer threads muxers, one per hardware thread in the system. The muxers implement the same scheduling algorithm and rely on a high-precision global time-base to simultaneously activate a cell on multiple hardware threads with minimum skew. In the common case the muxers do not need to communicate since they replicate the scheduling decisions of all relevant other muxers; however, the muxers will communicate via IPI multicast in certain cases, e.g., when cells are created or terminated, when resource allocations change, etc.

Applications in Tessellation that span multiple cells communicate via efficient and secure channels. Channels provide fast, user-level asynchronous message-passing between cells. Applications use channels to access standard OS services (e.g. network and file services) hosted in other cells. New composite services are constructed from OS services by wrapping a cell around existing resources and exporting a service interface. Tessellation can support QoS in this service-oriented architecture because the stable environment of the cell is easily combined with a custom user-level scheduler to provide QoS-aware access to the resulting service. With such QoS-aware access providing reproducible service times, applications in cells experience better performance predictability for autotuning, but without sacrificing flexibility in job placement for optimized system usage.

Customized Scheduling: OS Support for Runtime Systems

A major benefit of two-level scheduling is the ability to support different resource-management policies simultaneously. In Tessellation, cells provide their own, possibly highly-customized, user-level runtime systems for processor (thread) scheduling and memory management. Furthermore, each cell's runtime can control the delivery of events, such as timer and device interrupts, inter-cell message notifications, exceptions, and memory faults. Consequently, Tessellation can support a diverse set of runtimes without requiring kernel modifications or users to have root access. For example, for a complex scientific application with in-situ visualization, one cell can run a rendering or decoding component with a real time scheduler, while another runs a throughput oriented compute job with simple run-to-completion scheduling and one thread per core. The ability to support multiple different schedulers simultaneously bypasses the issues that arise when trying to develop a "one size fits all" complex global scheduler (e.g., see the CFS vs BFS debate in the Linux community).

The current Tessellation prototype includes two user-level thread scheduling frameworks: a preemptive one called Pulse (Preemptive User-Level SchEduling), and a cooperative one based on Lithe (LIquid THrEads). With either framework, a cell starts when a single entry point, enter(), is executed simultaneously on each core. After that, the kernel interferes with the cell's runtime only when: 1) the runtime receives events (e.g. interrupts) it has registered for, 2) the cell is suspended and reactivated according to its time-multiplexing policy (which can be time-triggered, event-triggered, or dedicated), or 3) the resources (e.g. hardware threads) assigned to the cell change. For instance, when an interrupt occurs during user-level code execution, the kernel saves the thread context and calls a registered interrupt handler, passing the saved context to the cell's user-level runtime. The cell's runtime can then choose whether to restore the previously running context or swap to a new one.

It is relatively easy to build new runtime scheduling frameworks for Tessellation. The Pulse framework, for example, comprises less than 800 lines of code (LOC); it can serve as a starting point for creating new user-level preemptive schedulers. A new scheduler based on Pulse needs to implement only four callbacks: enter(), mentioned earlier; tick(context), which is called whenever a timer tick occurs and receives the context of the interrupted thread; yield(), called when a thread yields; and done(), called when a thread terminates. The Pulse framework provides APIs to save and restore contexts, and other relevant operations. Since Tessellation's schedulers run at user-level, kernel patches are not required.

Pulse and Tessellation make it easy to implement custom schedulers. For example, we implemented a global round-robin scheduler with mutex and conditional-variable support in about 850 LOC. We also wrote a global earliest deadline first (EDF) scheduler with mutex support and priority-inversion control via dynamic deadline modification in less than 1000 LOC. By contrast, support for EDF in Linux requires kernel modifications and substantially more code: the best-known EDF kernel patch for Linux, SCHED_DEADLINE, has over 3500 LOC in over 50 files. Furthermore, any bugs in SCHED_DEADLINE could cause the kernel to crash and bring down the whole system, whereas bugs in a userspace scheduler in Tessellation will only crash the user-space application using that scheduler.

Adaptive Resource Allocation

Tessellation adapts resource allocations to provide applications with QoS while maximizing efficiency in the system. Resources are allocated and reallocated using a feedback control loop, as shown in the figure. The allocation decisions are made by the Resource Allocation Broker (RAB), a broker service which runs in user-space. The RAB uses system-wide goals, resource constraints, performance targets and performance measurements as inputs to an optimizer that attempts to simultaneously satisfy multiple application requirements and competing system-wide goals such as energy efficiency and throughput. The allocation decisions are then communicated to the kernel and system services for enforcement.

The current implementation of the RAB provides a resource-allocation framework that supports rapid development and testing of new optimizers. Two different optimizers that we have experimented with are POTA and Pacora.

QoS-Aware Services

Cells provide a convenient abstraction for building OS services with QoS guarantees. Such services can reside in dedicated cells with exclusive control of a device to encapsulate user-level device drivers. We call these cells service cells. In keeping with ARCC, we treat services offered by service cells as additional resources managed by the adaptive resource allocation architecture. A service cell arbitrates access to its devices, leveraging the cell's performance isolation and customizable schedulers to offer QoS guarantees to other cells. By encapsulating system services we not only improve performance predictability, but also enhance tolerance of failures in device drivers and mitigate the impact of OS noise.

Each service in Tessellation comes with a library to facilitate the development of client applications. The client libraries offer friendly, high-level application programming interfaces (APIs) to manage connections and interact with the services (i.e. they hide most of the details of inter-cell channel communication). Those libraries also allow applications to request the QoS guarantees they need from services.

Currently, there are two services implemented for Tessellation that offer QoS guarantees: a Network Service, which provides access to network adapters and guarantees that the data flows are processed with the agreed levels of throughput, and a GUI Service, which provides a windowing system with response-time guarantees for visual applications.

Research Plan

Our main focus is using Tessellation to investigate OS support for new programming models and new use cases and application structures for supercomputers. We believe that Tessellation is an ideal platform to experiment with the mechanisms required to support the sophisticated applications described in the DOE Exascale OS/R report:

As machines grow larger, applications are not just adding more grid points but are adding new modules for improving fidelity with new physics and chemistry. Rather than a monolithic source code tree, these sophisticated applications are built from multiple libraries, programming models, and execution models. Future operating and runtime systems must support the composition of several types of software components, from legacy math libraries written in MPI to new dynamic task-graph execution engines running simultaneously with a data reduction component.

In Tessellation, we envisage these sophisticated applications as spanning multiple cells, with different cells for different library components, each with their own scheduling models. Furthermore, support for multiple active cells per node enables the colocation of components to minimize data movement and avoid communication. However, Tessellation is currently a single node OS and these applications will need to scale to large numbers of nodes. Hence, in this research, our main goal is to determine how adaptive resource control with QoS guarantees can be scaled across nodes.

Multinode Synchronized Cells

Cells in Tessellation are currently a purely local construct, contained within a single node. Although there are many ways of extending the cell model across nodes, we will keep cells as fundamentally a local construct, and provide a way to synchronize cells across nodes. In particular, for temporal multiplexing, we need a way to be able to synchronize activations to enable inter-node gang-scheduling. The muxer algorithm is decentralized, and communication-free in the common case, and thus naturally lends itself to a multinode environment. The muxer algorithm relies on synchronized global clocks, which should be achievable on a relatively large scale, e.g., Jones and Koenig [2013] describe an algorithm that takes only 2.3 microseconds to synchronize 16,000 12-core nodes. Our goal is to extend the muxer algorithm to function across nodes; we will determine what mechanisms need to be implemented to enable this, and how OS components should communicate and interface across a network.

It is frequently possible to provide the same portion of CPU node capacity to a cell through either temporal multiplexing or spatial partitioning, so it is not clear a priori that multinode temporal multiplexing is really needed. However, we wish to add support for multinode temporal multiplexing because we believe that the temporal dimension enhances flexibility. This could be especially useful when solving complex problems of multidimensional resource allocation. It could also be useful for improving performance in certain types of applications, e.g., colocating modeling and analysis threads on the same cores so that cached data is shared, while still having separate, customized runtimes for each component. This kind of high-risk, high-reward research has the potential to open up new, more productive ways of using large-scale computers; we will not see fundamental advances with only incremental improvements to existing approaches to OSes.

Support for New Programming Models and Multicomponent Applications

Multinode synchronized cells will provide the environment necessary for exploring the OS support needed for advanced programming models and sophisticated, multicomponent applications. To enable multinode, large scale applications, we propose first porting GASNet as a communication layer, and then porting UPC for PGAS models. We will focus on these frameworks because of time and budget constraints, and because of our close relationship with the UPC and DEGAS projects. . Of particular interest is what OS-specific support and mechanisms may be needed for frameworks such as GASNet and UPC. This aspect of our research will allow us to explore one of the issues raised in the DOE Exascale OS/R report, namely:

Operating systems traditionally have mediated and controlled access to all resources in the system. However, a variety of new programming models and runtimes will have significantly different requirements for resource management, which will likely result in inefficiencies if the OS has to directly manage all resources. In the future, the application and runtime will need increased control over resources such as cores, memory, and power and user-space handling of control requests and events.

We will use the newly ported runtimes to explore application designs relevant to exascale, through the Co-design proxy apps. Several of these will be easy to port to Tessellation, since we already have UPC versions of them, e.g., LULESH from ExMatEx, Multigrid from Exact and XSBench from Cesar. However, to really explore the potential benefits of the ARCC approach, we will experiment with OS support for more sophisticated applications, in addition to application kernels. For this purpose, we will port two of the more complex Co-design Proxy Apps: CIAN (from the Cesar project), which has multiple components, including compute-intensive modeling, data-intensive analytics, visualization and storage; and CoMD (from the ExMatEx project), which includes in-situ visualization. Not only are we interested in how we can support the programming models used in these applications, but how we can tie the visualization, data analytics and storage components into the Tessellation service framework, including the GUI and Network services, and our proposed storage services.

Advanced Memory Management

Advanced memory management is a key requirement for an exascale OS, as described in the DOE Exascale OS/R report:

The OS will need to provide more support for runtime and application management of memory (possibly moving traditional OS services, like updating translation tables to runtime systems).

In Tessellation, our goal is to give cells as much control over the memory they own as possible. Currently, Tessellation does not support advanced memory management features, such as paging or locality-aware allocation. We intend to develop a user-level paging facility that a cell's runtime can manage with customized page replacement and locality-aware allocation policies. Customized memory management can deal with subtle problems that arise in monolithic memory management, e.g., performance of the LULESH proxy app suffers on Linux because of the way the kernel zero fills pages before allocating them. Apart from customization, a key goal of advanced memory management is to minimize uncontrolled interference between cells and enable better performance predictability and stronger QoS guarantees.

For user-level memory management, the approach we propose to adopt is similar to self-paging in the Nemesis OS, where all the kernel does is dispatch fault notifications, and all paging operations are performed at the application (cell) level. In general, the approach can be implemented entirely in software, e.g., through para-virtualized interfaces to the page tables, as used in Xen, but, where available, we will make use of hardware support such as nested page tables (AMD) and extended page tables (Intel).

Adaptive Resource Allocation on Multinode Systems

The framework for adaptive resource control that is currently implemented in the kernel and the Resource Allocation Broker (RAB) is flexible, supporting pluggable policies. We intend to retain this flexibility for the multinode version of Tessellation, and intend to explore various methods to extend the adaptation framework across nodes. In particular, we will to develop hierarchical resource allocation, with at least two levels: an intra-node level, and an an inter-node level that is constructed through the composition of intra-node resource policies. Although exascale systems will likely need more than two levels to account for additional layers of physical clustering of resources, we believe that two levels is already a challenge, and will highlight many of the relevant issues.

It is beyond the scope of this project to investigate the various adaptation policies. Rather, our focus is on providing the OS mechanisms and interfaces needed for a flexible framework that can be used to explore a variety of policies in future, e.g., determining how best to allocate multinode resources to minimize data movement and avoid communication. Included in those mechanisms will be communication and coordination channels between the inter- and intra-node levels of the adaptation framework, support for inter-node resource accounting, and support for topology-awareness.

Power Awareness

Power-awareness and control of power as a resource will be another dimension for the RAB optimizer to incorporate when making resource allocation decisions. We do not intend to design new power-balancing policies or experiment extensively with power-aware resource allocation algorithms, but rather to ensure that Tessellation provides the necessary mechanisms. We will develop the mechanisms needed to enable Tessellation to accurately attribute power consumption to each cell, and control the allocation of the "power" resource by using DVFS controls or processor sleep states.

Although control mechanisms are likely straightforward to implement, accurate resource attribution will be tricky. We will begin by investigating systems such as the Intel Sandybridge, which provides support for measuring energy-usage per socket and DRAM energy usage, via the RAPL interface. Accessing the RAPL interface is done via model-specific registers that have to be accessed in ring-0. We propose to export these interfaces to userspace, specifically, to the RAB so that it can use power to make allocation decisions. Unfortunately, this power information is not sufficiently fine-grained to attribute it to cells that span sockets. To correct this, we will explore various algorithms for deriving the power usage at the cell level. This can be as simple as proportionally dividing up power according to CPU usage, or involve more sophisticated approaches based on performance counter measurements. Which approach is the best is an open research question that we intend to investigate.

System Services for Resilience

A complete exascale OS will need many components to support resilience. Since many of these components, fault detection for example, may be implemented in the runtime, we propose to focus our investigation on those performance-critical components necessary for effective fault recovery that also benefit from OS-level optimization. Specifically, we intend to investigate OS abstractions for durable QoS-guaranteed storage, which is an essential requirement for a variety of resilience techniques such as traditional checkpoint/restart, faster log-based recovery, and more recent approaches using Containment Domains. Although our focus is on storage for resilience, according to the DOE Exascale OS/R report, QoS-aware I/O services will also be necessary for exascale tools:

Another potential overlap in the requirements of tools and applications involves quality of service concerns for shared resources. Tools need QoS guarantees to ensure that they do not too severely impact application execution. A particular area of concern is that tools can have extensive I/O requirements.

To date we have developed two QoS-aware system services: a network service and a GUI service. We intend to implement a framework that will make it easy to develop QoS-aware services, which we will use for developing two additional services: a Block Service and an Object Store Service:

  • The Block Service will provide block-level operations with additional QoS guarantees to applications or other services. It is an essential basis for advanced memory management, providing swap pager support to individual cells.
  • The Object Store Service (OSS) will provide persistent storage through a simple interface that will enable applications to put (and retrieve) variable sized objects in a flat namespace. QoS guarantees from the OSS will include such parameters as guaranteed number of cached pages or read and/or commit bandwidth. This service will extend the traditional on-disk inode structure into a generic persistent local store available to clients (applications or services). Although a POSIX-like Filesystem API will be available for the Object Store Service, clients will also be able to manipulate stored objects outside of the POSIX APIs, to reduce the overhead associated with maintaining POSIX-required metadata.

The implementations of these services are sufficient to investigate a wide variety of optimized storage abstractions for resilience. Although we hope to look at more general storage and filesystem issues, we want to concentrate our efforts on well-defined resilience requirements. We feel this line of research is important, since resilient application execution, whether checkpointing, logging, or replay-based, requires access to durable storage during the application's critical path; providing QoS to applications with resilience means not just providing guaranteed access to CPU, cache, memory and network, but also to the I/O subsystem.


Team Members

Steven Hofmeyr (LBNL)

John Kubiatowicz (UCB)

Eric Roman (LBNL)