GVR
From Modelado Foundation
GVR: Exploiting Global-view for Resilience | |
---|---|
Team Members | U. of Chicago, ANL, HP Labs |
PI | Andrew A. Chien (U. of Chicago) |
Co-PIs | Pavan Balaji (ANL) |
Website | http://gvr.cs.uchicago.edu/ |
Download | {{{download}}} |
Exploiting Global View for Resilience or GVR
Team Members
- University of Chicago: Andrew A. Chien (PI), Hajime Fujita, Zachary Rubenstein, Ziming Zheng, Nan Dun, Aiman Fang
- Argonne National Laboratory (ANL): Pavan Balaji (co-PI), Pete Beckman, Kamil Iskra
- HP Labs: Robert Schreiber
Application Partnerships
- Advanced Nuclear Reactor Simulation (Andrew Siegel, CESAR)
- Computational Chemistry (Jeff Hammond, ALCF)
- Rich Computational Frameworks (Trilinos, Mike Heroux, Sandia)
- Particle codes (ddcMD) (David Richards, Ignacio Laguna, LLNL)
- Adaptive Mesh Refinement (Chombo) (Brian van Straalen, Anshu Dubey, LBNL)
Global View Resilience (GVR) is a new programming approach that exploits a global view data model (global naming of data, consistency, and distributed layout), adding reliability to globally visible distributed arrays. The globally-visible distributed array abstraction is "multi-version", providing redundancy in time, and a convenient location for application annotations for reliability needs. Because the distributed array abstraction is portable, GVR enables application programmers to manage reliability (and its overhead) in a flexible, portable fashion, tapping their deep scientific and application code insights. Further, GVR will provide a flexible, efficient, cross-layer error management architecture called “open reliability” that allows applications to describe error detection (checking) and recovery routines and inject them into the GVR stack for efficient implementation. This architecture enables applications and systems to work in concert, exploiting semantics (algorithmic or even scientific domain) and key capabilities (e.g., fast error detection in hardware) to dramatically increase the range of errors that can be detected and corrected.
Resilience Challenges
- Can we achieve a smooth transition to system resilience? (a la Flash memory, Internet)
- What’s an application to do?
Resilience Co-design
Co‑design without co‑dependence
- Software: Information and Algorithms to enhance resilience (REQ: Portable, flexible)
- Runtime, OS, and Architecture Mechanisms to enhance resilience (REQ: leverage beyond HPC, cheap)
Challenges
- Enable an application to incorporate resilience incrementally, expressing resilience proportionally to the application need
- “Outside in”, as needed, incremental, ...
GVR Approach 1
- Application-System Partnership
- Expose and exploit algorithm and application domain knowledge
- Enable “End to end” resilience model
- Foundation in Data-oriented resilience
- Internet services, map-reduce, internet, ...
- Achieve with high performance and massive parallelism...
- Global view data Foundation (PGAS..., GA, SWARM, ParalleX, CnC, ...)
Data-oriented Resilience
- Parallel applications and global-view data
- Natural parallel structure version-to-version
- Example: shock hydro simulation at t=10ms to 100ms
- Example: iterative solver at iteration 1 to 20
- Example: monte carlo at 10M to 20M points
- Temporal redundancy enables rollback and resume
- User-controlled, convenient
Resilience Partnership
- Proportional Resilience
- Application specifies “Resilience priorities”
- Mapped into data-redundancy in space
- Mapped into redundancy in time (multi-version)
- Complements computation/task redundancy efforts
- Deep error detection: invariants, assertions, checks ... and recovery
- Applications add further checks based on algorithm and domain semantics
- Application add flexible, adaptive recovery mechanisms (and exploit multi-version)
- “End-to-end” resilience
GVR Approach 2
- x-layer approach for efficient execution (and better resilience)
- Spatial redundancy – coding at multiple levels, system level checking
- Temporal redundancy - Multi-version memory, integrated memory and NVRAM management
- Push checks to most efficient level (find early, contain, reduce overhead)
- Recover based on semantics from any level (repair more, larger feasible computation, reduce overhead)
- Efficient implementation support in runtime, OS, architecture ... increase efficiency and containment
Multi-version Memory
- Common parallel paradigm, basis for programmer engagement
- Frames invariant checks, more complex checks based on high-level semantics
- Frames sophisticated recovery
Research Challenges
- Understand application resilience needs and opportunities for proportional resilience and deep error detection/end-to-end resilience
- Explore multi-version memory as opportunity for framing richer resilience and parallelism
- Design API that embodies these ideas and gentle slope incremental application effort
- Create efficient x-layer implementations - many questions
- Explore architecture opportunities to increase resilience and reduce overhead
Global‑view Data Program
GVR Resilience Program
Global View & Consistent Snapshots
- How to safely, efficiently identify consistent snapshots?
- Application control: Global Synch; Array-level synch; explicit snapshot
- Application flagged (optional)
- Implicit (runtime decides)
- Snapshots = natural points to express and implement assertions, checks, recovery
Implementing Multi-version
- How to implement multi-version efficiently?
- Time, Space, Label => representation, protocol
- Which to take?
- Versions are logical, snapshots require resources
- Intelligent storage:
- Representation, compression, architecture support
- Older versions recede into storage [SILT]
Intelligent Memory and Storage
- How to exploit intelligence at memory and storage? (at controller)
- Intelligent stacked DRAM and storage-class Memory [HMC,PIM]
- Fine-grained state tracking; compression, intelligent, copying, etc.
- Efficient version capture; differenced checkpoints (Plank95, Svard11)
Opportunities
- Multi-version and increased concurrency
- Multi-version and debugging
- Architecture support and fine-grained synchronization, application checks, compressed memory, etc.
- ...more?
Expected Outcomes
- Use cases – Application skeleton design and classifications which form foundation of the design
- Design of GVR API for flexible resilience and multi-version global data
- Research prototype software developed as a library; target for programmers, compiler backends
- Experiments with mini-apps and application partners (w/ co-design postdocs)
- Assessment of architecture support opportunities and quantitative benefits
GVR X-Stack Synergies
- Direct Application Programming Interface
- Co-existence, even target with other Runtimes
- Rich Solver Library Building Block
- Programming System Target
Research Products
- Demonstrated easy application integration, <2% lines of code change in large (10K-100K line applications)
- Demonstrated controllable and low performance overhead (application scaling to 16,384 nodes and <2% overhead)
- Demonstrated flexible, portable application-semantics based forward-error correction in multiple applications (OpenMC, ddcMD, etc.)
- Software release available from http://gvr.cs.uchicago.edu/ and deployed at multiple supercomputing centers, including NERSC.