GVR

GVR: Exploiting Global-view for Resilience


Team Members	U. of Chicago, ANL, HP Labs
PI	Andrew Chien
Co-PIs	Pavan Balaji (ANL)
Website	http://gvr.cs.uchicago.edu/
Download	https://sites.google.com/site/uchicagolssg/lssg/research/gvr/downloads

Exploiting Global View for Resilience or GVR

Team Members

University of Chicago: Andrew A. Chien (PI), Hajime Fujita, Zachary Rubenstein, Ziming Zheng, Nan Dun, Aiman Fang, Yan Liu
Argonne National Laboratory (ANL): Pavan Balaji (co-PI), Pete Beckman, Kamil Iskra, Wes Bland
HP Labs: Robert Schreiber

Application Partnerships

Advanced Nuclear Reactor Simulation (Andrew Siegel, CESAR)
Computational Chemistry (Jeff Hammond, ALCF)
Rich Computational Frameworks (Trilinos, Mike Heroux, Sandia)
Particle codes (ddcMD) (David Richards, Ignacio Laguna, LLNL)
Adaptive Mesh Refinement (Chombo) (Brian van Straalen, Anshu Dubey, LBNL)
Combustion (S3D) (Jackie Chen, Sandia)

Global View Resilience (GVR) is a new programming approach that exploits a global view data model (global naming of data, consistency, and distributed layout), adding reliability to globally visible distributed arrays. The globally-visible distributed array abstraction is "multi-version", providing redundancy in time, and a convenient location for application annotations for reliability needs. Because the distributed array abstraction is portable, GVR enables application programmers to manage reliability (and its overhead) in a flexible, portable fashion, tapping their deep scientific and application code insights. Further, GVR will provide a flexible, efficient, cross-layer error management architecture called “open reliability” that allows applications to describe error detection (checking) and recovery routines and inject them into the GVR stack for efficient implementation. This architecture enables applications and systems to work in concert, exploiting semantics (algorithmic or even scientific domain) and key capabilities (e.g., fast error detection in hardware) to dramatically increase the range of errors that can be detected and corrected.

Resilience Challenges

Can we achieve a smooth transition to system resilience? (a la Flash memory, Internet)
What’s an application to do?

Resilience Co-design

Co‑design without co‑dependence

Software: Information and Algorithms to enhance resilience (REQ: Portable, flexible)
Runtime, OS, and Architecture Mechanisms to enhance resilience (REQ: leverage beyond HPC, cheap)

Project Impact

GVR Project Impact

Challenges

Enable an application to incorporate resilience incrementally, expressing resilience proportionally to the application need
“Outside in”, as needed, incremental, ...

GVR Approach 1

Application-System Partnership
- Expose and exploit algorithm and application domain knowledge
- Enable “End to end” resilience model

Foundation in Data-oriented resilience
- Internet services, map-reduce, internet, ...
- Achieve with high performance and massive parallelism...
- Global view data Foundation (PGAS..., GA, SWARM, ParalleX, CnC, ...)

Data-oriented Resilience

Parallel applications and global-view data
Natural parallel structure version-to-version
- Example: shock hydro simulation at t=10ms to 100ms
- Example: iterative solver at iteration 1 to 20
- Example: monte carlo at 10M to 20M points

Temporal redundancy enables rollback and resume
- User-controlled, convenient

Resilience Partnership

Proportional Resilience
- Application specifies “Resilience priorities”
- Mapped into data-redundancy in space
- Mapped into redundancy in time (multi-version)
- Complements computation/task redundancy efforts

Deep error detection: invariants, assertions, checks ... and recovery

Applications add further checks based on algorithm and domain semantics
- Application add flexible, adaptive recovery mechanisms (and exploit multi-version)

“End-to-end” resilience

GVR Approach 2

x-layer approach for efficient execution (and better resilience)
- Spatial redundancy – coding at multiple levels, system level checking
- Temporal redundancy - Multi-version memory, integrated memory and NVRAM management

Push checks to most efficient level (find early, contain, reduce overhead)
Recover based on semantics from any level (repair more, larger feasible computation, reduce overhead)
Efficient implementation support in runtime, OS, architecture ... increase efficiency and containment

Multi-version Memory

Common parallel paradigm, basis for programmer engagement
Frames invariant checks, more complex checks based on high-level semantics
Frames sophisticated recovery

Research Challenges

Understand application resilience needs and opportunities for proportional resilience and deep error detection/end-to-end resilience
Explore multi-version memory as opportunity for framing richer resilience and parallelism
Design API that embodies these ideas and gentle slope incremental application effort
Create efficient x-layer implementations - many questions
Explore architecture opportunities to increase resilience and reduce overhead

Global‑view Data Program

GVR Resilience Program

Global View & Consistent Snapshots

How to safely, efficiently identify consistent snapshots?
- Application control: Global Synch; Array-level synch; explicit snapshot
- Application flagged (optional)
- Implicit (runtime decides)
Snapshots = natural points to express and implement assertions, checks, recovery

Implementing Multi-version

How to implement multi-version efficiently?
- Time, Space, Label => representation, protocol
Which to take?
- Versions are logical, snapshots require resources
Intelligent storage:
- Representation, compression, architecture support
- Older versions recede into storage [SILT]

Intelligent Memory and Storage

How to exploit intelligence at memory and storage? (at controller)
Intelligent stacked DRAM and storage-class Memory [HMC,PIM]
Fine-grained state tracking; compression, intelligent, copying, etc.
Efficient version capture; differenced checkpoints (Plank95, Svard11)

Opportunities

Multi-version and increased concurrency
Multi-version and debugging
Architecture support and fine-grained synchronization, application checks, compressed memory, etc.
...more?

Expected Outcomes

Use cases – Application skeleton design and classifications which form foundation of the design
Design of GVR API for flexible resilience and multi-version global data
Research prototype software developed as a library; target for programmers, compiler backends
Experiments with mini-apps and application partners (w/ co-design postdocs)
Assessment of architecture support opportunities and quantitative benefits

GVR X-Stack Synergies

Direct Application Programming Interface
Co-existence, even target with other Runtimes
Rich Solver Library Building Block
Programming System Target

Research Products

Full report = Media:gvr-research-products.pdf

Demonstrated easy application integration, <2% lines of code change in large (10K-100K line applications)
Demonstrated controllable and low performance overhead (application scaling to 16,384 nodes and <2% overhead)
Released on multiple platforms, including Cray (Edison, Cori), IBM BG/Q (Mira, JuQueen), Linux clusters
Demonstrated flexible, portable application-semantics based forward-error correction in multiple applications (OpenMC, ddcMD, etc.)
Software release available from http://gvr.cs.uchicago.edu/ and deployed at multiple supercomputing centers, including NERSC.

Publications

(see project web site for full up to date list)

Hajime Fujita, Kamil Iskra, Pavan Balaji, and Andrew A. Chien, "Versioning Architectures for Local and Global Memory", in Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS), December 2015, Melbourne, Australia.
Aiman Fang, Hajime Fujita and Andrew A. Chien, "Towards Understanding Post-Recovery Efficiency for Shrinking and Non-Shrinking Recovery", in Proceedings of the 8th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, at Euro-Par 2015, Vienna, Austria, August 24, 2015
Anshu Dubey, Hajime Fujita, Zachary Rubenstein, Brian Van Straalen and Andrew Chien. "A Case Study Of Application Structure Aware Resilience Through Differentiated State Saving And Recovery", in Proceedings of the 8th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, at Euro-Par 2015, Vienna, Austria, August 24, 2015
Hajime Fujita, Kamil Iskra, Pavan Balaji, and Andrew A. Chien, "Empirical Characterization of Versioning Architectures", in Proceedings of IEEE Cluster, September 8-10, 2015, Chicago.
A. Chien, P. Balaji, N. Dun, A. Fang, H. Fujita, K. Iskra, Z. Rubenstein, Z. Zheng, J. Hammond, I. Laguna, D. Richards, A. Dubey, B. van Straalen, M Hoemmen, M. Heroux, K. Teranishi, A. Siegel. Exploring Versioning for Resilience in Scientific Applications: Global-view Resilience, submitted for publication, March 2015. (Best overall project summary)
Aiman Fang and Andrew A. Chien, "How Much SSD Is Useful for Resilience in Supercomputers”, in ACM Symposium on Fault-tolerance at Extreme-Scale (FTXS) associated with HPDC 2015, Portland, Oregon, June 15, 2015 (Slides)
Aiman Fang, "How Much SSD Is Useful for Resilience in Supercomputers”, Master's Thesis, Department of Computer Science, University of Chicago, April 2015.
Nan Dun, Hajime Fujita, John R. Tramm, Andrew A. Chien, Andrew R. Siegel, Data Decomposition in Monte Carlo Neutron Transport Simulations using Global View Arrays, International Journal of High Performance Computing Applications, March 2015.
A. Chien, P. Balaji, P. Beckman, N. Dun, A. Fang, H. Fujita, K. Iskra, Z. Rubenstein, Z. Zheng, R. Schreiber, J. Hammond, J. Dinan, A. Laguna, D. Richards, A. Dubey, B. van Straalen, M Hoemmen, M. Heroux, K. Teranishi, A. Siegel, and J. Tramm, "Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience", in International Conference on Computational Science (ICCS 2015), Reykjavik, Iceland, June 2015.
Hajime Fujita, Nan Dun, Zachary Rubenstein, and Andrew A. Chien. Log-Structured Global Array for Efficient Multi-Version Snapshots, IEEE CCGrid 2015, May 2015. Also UChicago CS Tech Report 2014-16, Nov 2014.
Hajime Fujita, Nan Dun, Aiman Fang, Zachary A. Rubenstein, Ziming Zheng, Kamil Iskra, Jeff Hammond, Anshu Dubey, Pavan Balaji, Andrew A. Chien: Using Global View Resilience (GVR) to add Resilience to Exascale Applications, SC14, Nov 2014 (Best Poster Finalist!)
The GVR Team, Global View Resilience (GVR) Documentation, Release 1.0, University of Chicago, Computer Science Technical Report 2014-10.
Nan Dun, Hajime Fujita, John Tramm, Andrew A. Chien, and Andrew R. Siegel. Data Decomposition in Monte Carlo Particle Transport Simulations using Global View Arrays, UChicago CS Tech Report 2014-09 May 2014.
The GVR Team, How Applications Use GVR: Use Cases, University of Chicago, Computer Science Technical Report 2014-06.
The GVR Team, Global View Resilience, API Documentation R0.8.1-rc0, University of Chicago, Computer Science Technical Report 2014-05.
Aiman Fang and Andrew A. Chien, "Applying GVR to Molecular Dynamics: Enabling Resilience for Scientific Computations", Tech Report, University of Chicago, Dept of Computer Science, CS-TR-2014-04, April 2014.
Ziming Zheng, Andrew A. Chien, Keita Teranishi, "Fault Tolerance in an Inner-Outer Solver: a GVR-enabled Case Study", in Proceedings of VECPAR 2014, July 2014, Eugene, Oregon. Proceedings available from Springer-Verlag Lecture Notes in Computer Science.
Z. Rubenstein, "Error Checking and Snapshot-based Recovery in Preconditioned Conjugate Gradient Solver", Masters Thesis, University of Chicago, Department of Computer Science, March 2014.
Z. Rubenstein, J. Dinan, H. Fujita, Z. Zheng, A. Chien, "Error Checking and Snapshot-Based Recovery in a Preconditioned Conjugate Gradient Solver", University of Chicago, Department of Computer Science Technical Report 2013-11, December 2013
Wesley Bland, Aurelien Bouteiller, Thomas Herault, Joshua Hursey, George Bosilca, and JackJ. Dongarra. An evaluation of User-Level Failure Mitigation support in MPI. Computing, 95(12):1171–1184, 2013.
Ziming Zheng, Zachary Rubenstein, and Andrew A. Chien, GVR-Enabled Trilinos: An Outside-In Approach for Resilient Computing, in the SIAM Conference on Parallel Processing, February 2014, Portland Oregon.
Ziming Zheng, Andrew A. Chien, Mark Hoemmen, Keita Teranishi, "Fault Tolerance in an Inner-Outer Solver: a GVR-enabled Case Study", available as Technical Report from University of Chicago Department of Computer Science, CS-TR-2014-01, January 2014.
Guoming Lu, Ziming Zheng, and Andrew A. Chien, When are Multiple Checkpoints Needed?, in 3rd Workshop for Fault-tolerance at Extreme Scale (FTXS), at IEEE Conference on High Performance Distributed Computing, June 2013, New York, New York.
Hajime Fujita, Robert Schreiber, Andrew A. Chien, It's Time for New Programming Models for Unreliable Hardware, to appear in ASPLOS 2013 Provocative Ideas session, March 18, 2013.
Sean Hogan, Jeff Hammond, and Andrew A. Chien, An Evaluation of Difference and Threshold Techniques for Efficient Checkpointing, 2nd Workshop on Fault-Tolerance at Extreme Scale FTXS 2012 at DSN 2012, June 2012, Boston, Massachusetts.

GVR

From Modelado Foundation

Contents