Resilience
From Modelado Foundation
Sonia requested that Andrew Chien initiate this page. For comments, please contact Andrew Chien.
QUESTIONS | XPRESS | TG X-Stack | DEGAS | D-TEC | DynAX | X-TUNE | GVR | CORVETTE | SLEEC | PIPER |
---|---|---|---|---|---|---|---|---|---|---|
PI | Ron Brightwell | Shekhar Borkar | Katherine Yelick | Daniel Quinlan | Guang Gao | Mary Hall | Andrew Chien | Koushik Sen | Milind Kulkarni | Martin Schulz |
Describe how your approach to resilience and its dependence on other programming, runtime, or resilience technologies? (i.e. uses lower-level mechanisms from hardware or lower level software, depends on higher level management, creates new mechanisms) | (EXPRESS) | (TG) | (DEGAS) | (D-TEC) | The DynAX project will focus on Resilience in year three. The general approach will be to integrate Containment Domains into the runtime and adapt proxy applications to use them. This will depend on the application developer (or a smart compiler) identifying key points in the control flow where resumption can occur after failure, and identifying the set of data which must be saved and restored in order to do so. It also relies on some mechanism to determine when a failure has occurred and which tasks were effected. This will vary from one system and type of failure to the next, examples include a hardware ECC failure notification, or a software data verification step detects invalid output, or maybe even a simple timeout occurs. Finally, it will require a way for the user to provide some criteria for how often to save resilience information off; the runtime may choose to snapshot at every opportunity it is given, or it might only do a subset of those, to meet some power/performance goal. | (X-TUNE) | (GVR) | (CORVETTE) | SLEEC | (PIPER) |
'One challenging problem for Exascale systems is that projections of soft error rates and hardware lifetimes (wearout) span a wide range from a modest increase over current systems to as much as a 100-fold increase. How does your system scale in resilience to ensure effective exascale capabilities on both the varied systems that are likely to exist and varied operating points (power, error rate)? | (EXPRESS) | (TG) | (DEGAS) | (D-TEC) | In a recursive tree hierarchy of containment domains, each level will have successively finer grained resilience, with fewer affected tasks, and less cost of redundant computation if the domain restarts due to failure. Containment domains are isolated from each other, so there is no extra communication / synchronization necessary between them. The 3 dependencies mentioned above (identification of containment domains, identification of failures, and the cost function) should be equally applicable to any exascale system. The only hardware dependence we anticipate is the hardware's ability to identify failures as it occurs, and the details of how that failure is reported. The system can be tuned to act more conservatively, or less conservatively, by ignoring some containment domains and enforcing others. The most conservative approach is to back up the necessary data for, and check results of, every containment domain in the application. The least conservative case is to only act on the outer-most containment domain, i.e. simply retain the program image and the parameters it was launched with, and if a failure occurs anywhere in the application, the entire thing is thrown out and starts again from the beginning. Reality is usually somewhere between those two extremes... the runtime can pick and choose which containment domains to act on, based on cost/benefit analysis (which I will discuss further in the question below on application semantics information). | (X-TUNE) | (GVR) | (CORVETTE) | N/A | (PIPER) |
What opportunities are there to improve resilience or efficiency of resilience by exporting/exploiting runtime or application semantics information in your system?" | (EXPRESS) | (TG) | (DEGAS) | (D-TEC) | The runtime decides whether or not to act on each individual containment domain. It chooses which containment domains to act upon, based on tuning parameters from the user and based on the data size and the estimated amount of parallel computation involved in completing the work associated with a containment domain. The data size is easy to determine at runtime, and the execution information can be provided by runtime analysis of previous tasks, or by hints from the application developer or smart compiler. Using that information, and resilience tuning parameters provided by the user, the runtime can do simple cost/benefit analysis to determine which containment domains are the best ones to act upon to achieve the user's performance/resilience goal. | (X-TUNE) | (GVR) | (CORVETTE) | N/A | (PIPER) |
What capabilities of provided by resilience researchers (software or hardware) could have a significant impact on the capabilities or efficiency of resilience?" | (EXPRESS) | (TG) | (DEGAS) | (D-TEC) | If the hardware can reliably detect and report soft failures (such as errors in floating point instructions), this would avoid running extra application code to check the results, which are sometimes rather costly. Reliable, slower memory for storing snapshot data would also be beneficial. This memory would be written with snapshot data every time a containment domain was entered, but only read while recovering from a failure, so the cost of the write is much more important than the cost of the read. | (X-TUNE) | (GVR) | (CORVETTE) | N/A | (PIPER) |