Actions

Resilience

From Modelado Foundation

Sonia requested that Andrew Chien initiate this page. For comments, please contact Andrew Chien.

PI
XPRESS Ron Brightwell
TG Shekhar Borkar
DEGAS Katherine Yelick
D-TEC Daniel Quinlan
DynAX Guang Gao
X-TUNE Mary Hall
GVR Andrew Chien
CORVETTE Koushik Sen
SLEEC Milind Kulkarni
PIPER Martin Schulz

Questions:

Describe how your approach to resilience and its dependence on other programming, runtime, or resilience technologies? (i.e. uses lower-level mechanisms from hardware or lower level software, depends on higher level management, creates new mechanisms)
XPRESS XPRESS will employ Micro-checkpointing, which employs a Compute-Validate-Commit cycle bounded by Error-Zones for localized support of error detection and isolation, also diagnosis, correction, and recovery.
TG Not in TG scope.
DEGAS Our approach to resilience comprises three principal technologies. First, Containment Domains (CDs) are an application-facing resilience technology. Second, we introduce error handling and recovery routines into the communications runtime (GASNet-EX) and clients, e.g. UPC or CAF, to handle errors, and return the system to a consistent global state after an error. Third, we provide low-level state preservation, via node-level checkpoints for the application and runtime to complement their native error-handling routines.

Effective use of CDs requires application and library writers to write application-level error detection, state preservation and fault recovery schemes. For GASNet-EX, we would like mechanisms to identify failed nodes. Although we expect to implement timeout-based failure detection, we would like system-specific RAS function to provide explicit notification of failures and consensus on such failures, as we expect vendor-provided RAS mechanisms to be more efficient, more accurate, and more responsive than a generic timeout-based mechanism. Finally, our resilience schemes depend on fast durable storage for lightweight state preservation. We are designing schemes that use local or peer storage (disk or memory) for high volume (size) state preservation, and in-memory storage for high traffic (IOs) state preservation, we need the hardware to be present. Since our IO bottleneck is largely sequential writes with little or no reuse, almost any form of persistent storage technology is suitable.

D-TEC Dealing with errors in Exascale systems requires a deviation from the traditional approach to fault tolerance models that take an egalitarian view of the importance of errors (all errors are equal) to a utilitarian approach where errors are ranked by their impact on the accuracy of computation.

To deal with this problem, we propose a novel fault tolerance approach in which we view computations as composed of tasks or choices that are accompanied with specifications that define acceptable computational error margins. Tasks critical to the computation indicate low error margins (or potentially no margin for error) and less critical tasks have a wider error margin. Using uncertainty quantification (UQ) techniques, we can propagate these error margins throughout the program, enabling programmers to reason about the effects of failures in terms of an expected error in the resulting computation and selectively apply fault tolerance techniques to components critical to the computation. We are the first to develop a sensitivity analysis framework that can identify critical software components through targeted fault injection. We have demonstrated this functionality by developing a tool to find critical program regions, developers can produce selective detection- and recovery-techniques. We are also the first to explore the use of a sensitivity analysis framework to quantify computational uncertainty (i.e., the relationship between input and output) by modeling errors as input uncertainty. Specifically, we have explored how to apply uncertainty quantification techniques to critical program regions so as to enable selective recovery techniques (e.g., result interpolation). We have demonstrated this approach by developing a new programming language, Rely, that enables developers to reason about the quantitative reliability of application.

DynAX The DynAX project will focus on Resilience in year three. The general approach will be to integrate Containment Domains into the runtime and adapt proxy applications to use them. This will depend on the application developer (or a smart compiler) identifying key points in the control flow where resumption can occur after failure, and identifying the set of data which must be saved and restored in order to do so. It also relies on some mechanism to determine when a failure has occurred and which tasks were effected. This will vary from one system and type of failure to the next, examples include a hardware ECC failure notification, or a software data verification step detects invalid output, or maybe even a simple timeout occurs. Finally, it will require a way for the user to provide some criteria for how often to save resilience information off; the runtime may choose to snapshot at every opportunity it is given, or it might only do a subset of those, to meet some power/performance goal.
X-TUNE
GVR GVR depends on MPI3 and lower level storage (memory, nvram, filesystem) services. It is intended as a flexible portable library, so these dependences are intentionally minimal, and likely well below the requirements of any programming system or library or application that it might be embedded into. So in short, it effectively adds no dependences. GVR provides a portable, versioned distributed array abstraction... which is reliable. An application can use one or many of these, and version them at different cadences. This can be used by libraries (e.g. demonstrated with Trilinos), programming systems, or applications to create resilient applications. Because versioning can be controlled, these systems can manage both the overheads and the resilience coverage as needed. Because error checking and recovery can be controlled by applications, GVR allows applications to become increasingly reliable based on application semantics (and minimal code change), and portably (the investment is preserved over different platforms).
CORVETTE
SLEEC We do not create any new mechanisms. However, we might expect some information from lower level mechanisms (e.g., whether a particular library method might be re-executed for resilience purposes) to inform our cost models and optimizations.
PIPER Resilience is important for the data gathering and aggregation mechanism. When layered on top of new RTS we require proper notification and controllable fault isolation (e.g., for process failures in aggregation trees). Infrastructures like MRNet already provide basic support for this.
Charm++ The Charm++ resilience support in production use includes in-remote-memory checkpoints and buddy-based automatic failure detection for fail-stop failures. Proactive schemes evacuate a processor when notified of impending failures. The protocols depend on the runtime for monitoring and object migration. Scalable message-logging scheme is also supported where appropriate. The soft error recovery is a work in progress: experimental versions of replication-based strategies are available, while other methods are planned. These depend on over-decomposition provided by the programming models and supported by the RTS.
Early Career-SriramK We focus on selective localized recovery from faults. This involves tracking ongoing execution progress and identifying the state to be recovered and tasks re-executed to tolerate faults. Dynamic load balancing adapts the execution around faults. We assume the presence of a failure detector, either in hardware or in software. While the recovery techniques do not rely on frequent collective checkpoints, checkpoints act as backstop when the information tracked is insufficient to effect a localized recovery.
One challenging problem for Exascale systems is that projections of soft error rates and hardware lifetimes (wearout) span a wide range from a modest increase over current systems to as much as a 100-fold increase. How does your system scale in resilience to ensure effective exascale capabilities on both the varied systems that are likely to exist and varied operating points (power, error rate)?
XPRESS Both hardware and software errors are treated in the same manner for error detection, isolation, and diagnosis. Recovery differs if the error is transient, a software bug, or a hardware error. Because the method is localized as in (1), it is believed to be scalable into the billion-way parallelism era. However, this has to be demonstrated through experience. The methods assume a Poisson error distribution.
TG Not in TG scope.
DEGAS Our resilience technologies provide tremendous flexibility in handling faults. In our hybrid user-level and system-level resilience scheme, CDs provide lightweight error recovery that enables the application to isolate the effects of faults to specific portions of the code, thus localizing error recovery. With the use of nesting CDs, an inner CD can decide how to handle an error, or to propagate this error to a parent CD. If no CD handles the error locally, we use a global rollback to hide the fault. With this approach, the use of local CDs for isolated recovery limits the global restart rate
D-TEC Our ability to reason about the importance of bugs, enables us to develop selective detection and recovery techniques, which allows our to system to focus only on critical faults. For example, if a task fails, then depending on where this task occurs within the overall computation, it may be possible to simply tolerate the fault. For example, if the computation is an input to a large reduction node and the node can, for instance, tolerate losing 10\% of its input tasks and yet still provide acceptable accuracy, then the task does not need recovery.

If an assembly of tasks leads to accuracy degradation, then the system can provide alternative implementations that use fault tolerance techniques, such as replication, to the optimization management framework. This, in combination with the runtime system, will ensure the overall system goals are met despite failures.

DynAX In a recursive tree hierarchy of containment domains, each level will have successively finer grained resilience, with fewer affected tasks, and less cost of redundant computation if the domain restarts due to failure. Containment domains are isolated from each other, so there is no extra communication / synchronization necessary between them. The 3 dependencies mentioned above (identification of containment domains, identification of failures, and the cost function) should be equally applicable to any exascale system. The only hardware dependence we anticipate is the hardware's ability to identify failures as it occurs, and the details of how that failure is reported. The system can be tuned to act more conservatively, or less conservatively, by ignoring some containment domains and enforcing others. The most conservative approach is to back up the necessary data for, and check results of, every containment domain in the application. The least conservative case is to only act on the outer-most containment domain, i.e. simply retain the program image and the parameters it was launched with, and if a failure occurs anywhere in the application, the entire thing is thrown out and starts again from the beginning. Reality is usually somewhere between those two extremes... the runtime can pick and choose which containment domains to act on, based on cost/benefit analysis (which I will discuss further in the question below on application semantics information).
X-TUNE
GVR GVR uses versioning as the primary basis for error checking and recovery. Applications and programming systems can control the frequency of versioning, error checking, and recovery approach to adapt to the underlying error rate. As this is under application control, the application programmer is armed both with the ability to control/manage overhead as well as error coverage and to do so portably. The ideal outcome is a portable application than runs effectively over 100x or larger dynamic range of errors with no more than a single or few parameter change.
CORVETTE
SLEEC N/A
PIPER N/A
Charm++ The control system observes the error rates and tunes its behavior (parameters for now, and in future, choice of strategies) accordingly. The simplest example is checkpoint periodicity, controlled by observations of fail-stop failures. Message-logging can be used if failures are frequent, because it avoids rolling back all the processors. Object based checksums are used to contain corruption while the object is passive.
Early Career-SriramK The schemes can be tuned to meet the resilience needs. In particular, the overhead during normal execution can be decreased at the expense of increased penalty incurred on a failure. The approach can also reduce to checkpointing when the task-parallel phases are too small compared to node/system MTBF.
What opportunities are there to improve resilience or efficiency of resilience by exporting/exploiting runtime or application semantics information in your system?
XPRESS Application semantics can provide myriad simple tests of correctness that will detect errors early and limit their propagation throughout the computation through better isolation.
TG Not in TG scope.
DEGAS In DEGAS, applications express semantic information required for resilience through the CD API; that's the point of CDs. The CD hierarchy, i.e. boundary and nesting structure, describes the communication dependencies of the application in a way that is not visible to the underlying runtime or communication systems. In contrast, a transparent checkpoint scheme must discover the dependency structure of an application by tracking (or preventing) receive events to construct a valid recovery line. The CD structure makes such schemes unnecessary, as the recovery line is described by the CD hierarchy.

For applications that do not use CDs, we fall back to a transparent checkpoints. Here, we rely on the runtime to discover communication dependencies, advance the recovery line, and perform rollback propagation as required. One clear opportunity to improve efficiency for resilience is by exploiting the one-sided semantics of UPC to track rollback dependencies. We are starting by calculating message (receive) dependencies inside the runtime, but we are interested in tracking memory (read/write) dependencies as a possible optimization. A second area for improved efficiency is in dynamically adjusting the checkpoint interval (recovery line advance) according to job sizes at runtime. An application running on, 100 k-nodes must checkpoint half as often as one running on 400 k-nodes; for error rates, by tolerating 3/4 of the errors, checkpoint overhead is halved. Such considerations suggest that static (compile-time) resilience policies should be supplemented with runtime policies that take into account the number of nodes, failure (restart) rates, checkpoint times, memory size, etc. to ensure good efficiency.

D-TEC Application semantic information can be used to implement efficient, targeted ABFT. When combined with our reliability analysis, these techniques can be used to selectively detect and recover from critical faults. Information from the runtime system can enable our system to dynamically adapt its resiliency posture in order to meet accuracy/performance requirements
DynAX The runtime decides whether or not to act on each individual containment domain. It chooses which containment domains to act upon, based on tuning parameters from the user and based on the data size and the estimated amount of parallel computation involved in completing the work associated with a containment domain. The data size is easy to determine at runtime, and the execution information can be provided by runtime analysis of previous tasks, or by hints from the application developer or smart compiler. Using that information, and resilience tuning parameters provided by the user, the runtime can do simple cost/benefit analysis to determine which containment domains are the best ones to act upon to achieve the user's performance/resilience goal.
X-TUNE
GVR GVR allows applications and systems software to control error checking and handling (recovery), thus a full range of application semantics (algorithmic, data structure, physics, etc.) and system correctness semantics can be exploited. We have implemented full ABFT checkers and recovery using GVR.
CORVETTE
SLEEC We could integrate resilience or accuracy information into our cost models to drive transformations (e.g., to tune library parameters to hit an overall accuracy target)
PIPER The aggregation runtime will likely be highly structured (e.g., use tree overlay networks), which could be exploited for resilience
Charm++ Our system relies on over-decomposition provided by the application, and runtime instrumentation provided by the runtime, to optimize its resiliency protocols. In addition, work is under way to separate application object memory into different compartments requiring different degree of resilience. This requires some support for application and/or compiler.
Early Career-SriramK The resilience techniques are closely related to the runtime and application semantics exposed by the user. In particular, the task-parallel program specification and properties of the task-parallel abstraction allow us to constrain the inter-task and task-data relationships to be tracked and recovered.
What capabilities of provided by resilience researchers (software or hardware) could have a significant impact on the capabilities or efficiency of resilience? Where does resilience fit into the X-stack runtime abstract architecture?
XPRESS Numerical correctness of calculations in a non-deterministic scheduling context would benefit from knowledge about floating-point corruption due to order of actions.
TG Not in TG scope.
DEGAS Fast, durable storage is a key technology for increasing the efficiency of resilience. In DEGAS we are interested in both bulk storage and logging storage. The requirements for these differ slightly, in that we use logging for small high-frequency updates, possibly as much as one log entry per message, but we use the bulk storage for large infrequent updates, as these are used for checkpoints.

We are exploring non-inhibitory consistency algorithms, i.e. algorithms that allow messages to remain outstanding while a recovery line is advanced. We face a challenge right now in that it is difficult to order RDMA operations with respect to other message operations on a given channel. In the worst case, we are required to completely drain network channels, and globally terminate communications in order to establish global consistency, i.e. agreement on which messages have been sent and received. A hardware capability that may prove valuable here are network-level point-to-point fence operations. A similar issue on Infiniband networks is that the current Infiniband driver APIs require us to fully tear down network connections in order to put the hardware into a known physical state with respect to the running application process, i.e. we need to shut down the NIC to ensure that it doesn't modify process memory while we determine local process state. A lightweight method of disabling NIC transfers (mainly RDMA) would eliminate this teardown requirement.

D-TEC Our programming language, Rely, defines a static quantitative reliability analysis that verifies quantitative requirements on the reliability of Rely programs, enabling a developer to perform sound and verified reliability engineering. The analysis takes a Rely program with a reliability specification and a hardware specification (which characterizes the reliability of the underlying hardware components) and verifies that the program satisfies its reliability specification when executed on the underlying unreliable hardware platform. Naturally, having better models of the hardware, and hardware support soft-error detection (e.g., TMR) can help improve the accuracy of our system.
DynAX If the hardware can reliably detect and report soft failures (such as errors in floating point instructions), this would avoid running extra application code to check the results, which are sometimes rather costly. Reliable, slower memory for storing snapshot data would also be beneficial. This memory would be written with snapshot data every time a containment domain was entered, but only read while recovering from a failure, so the cost of the write is much more important than the cost of the read.
X-TUNE
GVR Cheap non-volatile storage, inexpensive error checking, exposure of "partially correct" states to higher levels, ... would all increase the scope of errors that GVR and applications could handle.
CORVETTE
SLEEC N/A
PIPER N/A
Charm++ The first question is unclear, for software. For hardware, some useful capabilities are: quick detection and flexible notification of errors, more robust memory regions for control variables, sensors for early notification of impending component failures.

The need for (and the complexity of) resilience will depend on the direction taken by the exascale hardware (viz. how much reliability gets sacrificed for power-performance). It is desirable, but challenging, to separate the resilience protocols modularly from the rest of the runtime.

Early Career-SriramK Efficient low-overhead error detection can enable recovery to be performed at various layers of the software stack. Runtime support for resilience is a non-trivial challenge and, if desired, needs to be planned from the beginning.