Actions

GVR: Difference between revisions

From Modelado Foundation

imported>Achien
No edit summary
imported>Achien
No edit summary
Line 25: Line 25:




Global View Resilience (GVR) is a new programming approach that exploits a global view data model (global naming of data, consistency, and distributed layout), adding reliability to globally visible distributed arrays. GVR introduces multi-version memory and application-system partnership for managing reliability.
Global View Resilience (GVR) is a new programming approach that exploits a global view data model (global naming of data, consistency, and distributed layout), adding reliability to globally visible distributed arrays. The globally-visible distributed array abstraction is "multi-version", providing redundancy in time, and a convenient location for application annotations for reliability needs. Because the distributed array abstraction is portable, GVR enables application programmers to manage reliability (and its overhead) in a flexible, portable fashion, tapping their deep scientific and application code insights.  Further, GVR will provide a flexible, efficient, cross-layer error management architecture called “open reliability” that allows applications to describe error detection (checking) and recovery routines and inject them into the GVR stack for efficient implementation. This architecture enables applications and systems to work in concert, exploiting semantics (algorithmic or even scientific domain) and key capabilities (e.g., fast error detection in hardware) to dramatically increase the range of errors that can be detected and corrected.  
Because the distributed array abstraction is portable, GVR enables application programmers to manage reliability (and its overhead) in a flexible, portable fashion, tapping their deep scientific and application code insights.  Further, GVR will provide a flexible, efficient, cross-layer error management architecture called “open reliability” that allows applications to describe error detection (checking) and recovery routines and inject them into the GVR stack for efficient implementation. This architecture enables applications and systems to work in concert, exploiting semantics (algorithmic or even scientific domain) and key capabilities (e.g., fast error detection in hardware) to dramatically increase the range of errors that can be detected and corrected.  


== Resilience Challenges ==
== Resilience Challenges ==

Revision as of 18:22, July 17, 2013

GVR
GVR-Logos.png
Team Members U. of Chicago, ANL, HP Labs
PI Andrew A. Chien (U. of Chicago)
Co-PIs Pavan Balaji (ANL)
Website http://gvr.cs.uchicago.edu/
Download {{{download}}}

Global View for Resilience or GVR

Team Members


Application Partnerships

  • Advanced Nuclear Reactor Simulation (Andrew Siegel, CESAR)
  • Computational Chemistry (Jeff Hammond, ALCF)
  • Rich Computational Frameworks (Mike Heroux, Sandia)
  • Particle codes (ddcMD) (David Richards, Ignacio Laguna, LLNL)
  • ... and more!...


Global View Resilience (GVR) is a new programming approach that exploits a global view data model (global naming of data, consistency, and distributed layout), adding reliability to globally visible distributed arrays. The globally-visible distributed array abstraction is "multi-version", providing redundancy in time, and a convenient location for application annotations for reliability needs. Because the distributed array abstraction is portable, GVR enables application programmers to manage reliability (and its overhead) in a flexible, portable fashion, tapping their deep scientific and application code insights. Further, GVR will provide a flexible, efficient, cross-layer error management architecture called “open reliability” that allows applications to describe error detection (checking) and recovery routines and inject them into the GVR stack for efficient implementation. This architecture enables applications and systems to work in concert, exploiting semantics (algorithmic or even scientific domain) and key capabilities (e.g., fast error detection in hardware) to dramatically increase the range of errors that can be detected and corrected.

Resilience Challenges

  • Can we achieve a smooth transition to system resilience? (a la Flash memory, Internet)
  • What’s an application to do?

GVR-Resilience-Challenges.png


Resilience Co-design

Co‑design without co‑dependence

  • Software: Information and Algorithms to enhance resilience (REQ: Portable, flexible)
  • Runtime, OS, and Architecture Mechanisms to enhance resilience (REQ: leverage beyond HPC, cheap)


GVR-Resilience-Co-design.png


Challenges

  • Enable an application to incorporate resilience incrementally, expressing resilience proportionally to the application need
  • “Outside in”, as needed, incremental, ...


GVR Approach 1

GVR-Approach-1.png


  • Application-System Partnership
    • Expose and exploit algorithm and application domain knowledge
    • Enable “End to end” resilience model
  • Foundation in Data-oriented resilience
    • Internet services, map-reduce, internet, ...
    • Achieve with high performance and massive parallelism...
    • Global view data Foundation (PGAS..., GA, SWARM, ParalleX, CnC, ...)

Data-oriented Resilience

GVR-Data-Oriented.png


  • Parallel applications and global-view data
  • Natural parallel structure version-to-version
    • Example: shock hydro simulation at t=10ms to 100ms
    • Example: iterative solver at iteration 1 to 20
    • Example: monte carlo at 10M to 20M points
  • Temporal redundancy enables rollback and resume
    • User-controlled, convenient

Resilience Partnership

  • Proportional Resilience
    • Application specifies “Resilience priorities”
    • Mapped into data-redundancy in space
    • Mapped into redundancy in time (multi-version)
    • Complements computation/task redundancy efforts
  • Deep error detection: invariants, assertions, checks ... and recovery
  • Applications add further checks based on algorithm and domain semantics
    • Application add flexible, adaptive recovery mechanisms (and exploit multi-version)
  • “End-to-end” resilience


GVR Approach 2

GVR-Approach-2.png


  • x-layer approach for efficient execution (and better resilience)
    • Spatial redundancy – coding at multiple levels, system level checking
    • Temporal redundancy - Multi-version memory, integrated memory and NVRAM management
  • Push checks to most efficient level (find early, contain, reduce overhead)
  • Recover based on semantics from any level (repair more, larger feasible computation, reduce overhead)
  • Efficient implementation support in runtime, OS, architecture ... increase efficiency and containment

Multi-version Memory

GVR-Memory.png


  • Common parallel paradigm, basis for programmer engagement
  • Frames invariant checks, more complex checks based on high-level semantics
  • Frames sophisticated recovery


Research Challenges

  • Understand application resilience needs and opportunities for proportional resilience and deep error detection/end-to-end resilience
  • Explore multi-version memory as opportunity for framing richer resilience and parallelism
  • Design API that embodies these ideas and gentle slope incremental application effort
  • Create efficient x-layer implementations - many questions
  • Explore architecture opportunities to increase resilience and reduce overhead


Global‑view Data Program

GVR-Program-1.png


GVR Resilience Program

GVR-Program-2.png


Global View & Consistent Snapshots

GVR-Snapshots.png


  • How to safely, efficiently identify consistent snapshots?
    • Application control: Global Synch; Array-level synch; explicit snapshot
    • Application flagged (optional)
    • Implicit (runtime decides)
  • Snapshots = natural points to express and implement assertions, checks, recovery


Implementing Multi-version

GVR-Implementing.png


  • How to implement multi-version efficiently?
    • Time, Space, Label => representation, protocol
  • Which to take?
    • Versions are logical, snapshots require resources
  • Intelligent storage:
    • Representation, compression, architecture support
    • Older versions recede into storage [SILT]


Intelligent Memory and Storage

GVR-Memory-Storage.png


  • How to exploit intelligence at memory and storage? (at controller)
  • Intelligent stacked DRAM and storage-class Memory [HMC,PIM]
  • Fine-grained state tracking; compression, intelligent, copying, etc.
  • Efficient version capture; differenced checkpoints (Plank95, Svard11)


Opportunities

  • Multi-version and increased concurrency
  • Multi-version and debugging
  • Architecture support and fine-grained synchronization, application checks, compressed memory, etc.
  • ...more?


Expected Outcomes

  • Use cases – Application skeleton design and classifications which form foundation of the design
  • Design of GVR API for flexible resilience and multi-version global data
  • Research prototype software developed as a library; target for programmers, compiler backends
  • Experiments with mini-apps and application partners (w/ co-design postdocs)
  • Assessment of architecture support opportunities and quantitative benefits


GVR X-Stack Synergies

GVR-Synergies.png


  • Direct Application Programming Interface
  • Co-existence, even target with other Runtimes
  • Rich Solver Library Building Block
  • Programming System Target