Actions

GVR: Difference between revisions

From Modelado Foundation

imported>Cdenny
(Created page with "{{Infobox project | title = GVR | image = 180px | imagecaption = | team-members = List of team members | pi = Lead PI (Institute) | co-pi = Co-PIs...")
 
imported>Cdenny
No edit summary
Line 1: Line 1:
{{Infobox project
{{Infobox project
| title = GVR
| title = GVR
| image = [[File:Your-team-logo.png|180px]]
| image = [[File:GVR-Logos.png|400px]]
| imagecaption =  
| imagecaption =  
| team-members = List of team members
| team-members = [http://www.uchicago.edu/ U. of Chicago], [http://www.anl.gov/ ANL], [http://www.hpl.hp.com/ HP Labs]
| pi = Lead PI (Institute)
| pi = Andrew A. Chien (U. of Chicago)
| co-pi = Co-PIs (Institute)
| co-pi = Pavan Balaji (ANL)
| website = team website
| website = team website
}}
}}
Line 12: Line 12:


== Team Members ==
== Team Members ==
* [http://www.uchicago.edu/ University of Chicago]
* [http://www.anl.gov/ Argonne National Laboratory (ANL)]
* [http://www.hpl.hp.com/ HP Labs]


== Resilience Challenges ==


== Objectives ==
* Can we achieve a smooth transition to system resilience? (a la Flash memory, Internet)
* What’s an application to do?


[[File:GVR-Resilience-Challenges.png|600px]]


== Roadmap ==
== Resilience Co-design ==


'''Co‑design without co‑dependence'''


== Impact ==
[[File:GVR-Resilience-Co-design.png|400px]]
* Software: Information and Algorithms to enhance resilience (REQ: Portable, flexible)
* Runtime, OS, and Architecture Mechanisms to enhance resilience (REQ: leverage beyond HPC, cheap)




== Software Stack ==
'''Challenges'''
* Enable an application to incorporate resilience incrementally, expressing resilience proportionally to the application need
* “Outside in”, as needed, incremental, ...
 
 
== GVR Approach ==
 
[[File:GVR-Approach-1.png|600px]]
 
* Application-System Partnership
** Expose and exploit algorithm and application domain knowledge
** Enable “End to end” resilience model
 
* Foundation in Data-oriented resilience
** Internet services, map-reduce, internet, ...
** Achieve with high performance and massive parallelism...
** Global view data Foundation (PGAS..., GA, SWARM, ParalleX, CnC, ...)
 
=== Data-oriented Resilience ===
 
[[File:GVR-Data-Oriented.png|600px]]
 
* Parallel applications and global-view data
* Natural parallel structure version-to-version
** Example: shock hydro simulation at t=10ms to 100ms
** Example: iterative solver at iteration 1 to 20
** Example: monte carlo at 10M to 20M points
 
* Temporal redundancy enables rollback and resume
** User-controlled, convenient
 
=== Resilience Partnership ===
* Proportional Resilience
** Application specifies “Resilience priorities”
** Mapped into data-redundancy in space
** Mapped into redundancy in time (multi-version)
** Complements computation/task redundancy efforts
 
* Deep error detection: invariants, assertions, checks ... and recovery
 
* Applications add further checks based on algorithm and domain semantics
** Application add flexible, adaptive recovery mechanisms (and exploit multi-version)
 
* “End-to-end” resilience
 
== GVR Approach ==
 
[[File:GVR-Approach-2.png|600px]]
 
* x-layer approach for efficient execution (and better resilience)
** Spatial redundancy – coding at multiple levels, system level checking
** Temporal redundancy - Multi-version memory, integrated memory and NVRAM management
 
* Push checks to most efficient level (find early, contain, reduce overhead)
* Recover based on semantics from any level (repair more, larger feasible computation, reduce overhead)
* Efficient implementation support in runtime, OS, architecture ... increase efficiency and containment
 
=== Multi-version Memory ===
 
[[File:GVR-Memory.png|600px]]
 
* Common parallel paradigm, basis for programmer engagement
* Frames invariant checks, more complex checks based on high-level semantics
* Frames sophisticated recovery
 
== Research Challenges ==
* Understand application resilience needs and opportunities for ''proportional resilience'' and ''deep error detection''/''end-to-end resilience''
* Explore multi-version memory as opportunity for framing richer resilience and parallelism
* Design API that embodies these ideas and ''gentle slope'' incremental application effort
* Create efficient x-layer implementations - many questions
* Explore architecture opportunities to increase resilience and reduce overhead
 
== Global‑view data Program ==
 
[[File:GVR-Program-1.png|600px]]
 
== GVR Resilience Program ==
 
[[File:GVR-Program-2.png|600px]]
 
== Global View & Consistent Snapshots ==
 
[[File:GVR-Snapshots.png|600px]]
 
* How to safely, efficiently identify consistent snapshots?
** Application control: Global Synch; Array-level synch; explicit snapshot
** Application flagged (optional)
** Implicit (runtime decides)
* Snapshots = natural points to express and implement assertions, checks, recovery
 
== Implementing Multi-version ==
 
[[File:GVR-Implementing.png|600px]]
 
* How to implement multi-version efficiently?
** Time, Space, Label => representation, protocol
* Which to take?
** Versions are logical, snapshots require resources
* Intelligent storage:
** Representation, compression, architecture support
** Older versions recede into storage [SILT]
 
== Intelligent Memory and Storage
 
[[File:GVR-Memory-Storage.png|600px]]
 
* How to exploit intelligence at memory and storage? (at controller)
* Intelligent stacked DRAM and storage-class Memory [HMC,PIM]
* Fine-grained state tracking; compression, intelligent, copying, etc.
* Efficient version capture; differenced checkpoints (Plank95, Svard11)

Revision as of 23:31, February 11, 2013

GVR
GVR-Logos.png
Team Members U. of Chicago, ANL, HP Labs
PI Andrew A. Chien (U. of Chicago)
Co-PIs Pavan Balaji (ANL)
Website team website
Download {{{download}}}

Description about your project goes here.....

Team Members

Resilience Challenges

  • Can we achieve a smooth transition to system resilience? (a la Flash memory, Internet)
  • What’s an application to do?

GVR-Resilience-Challenges.png

Resilience Co-design

Co‑design without co‑dependence

GVR-Resilience-Co-design.png

  • Software: Information and Algorithms to enhance resilience (REQ: Portable, flexible)
  • Runtime, OS, and Architecture Mechanisms to enhance resilience (REQ: leverage beyond HPC, cheap)


Challenges

  • Enable an application to incorporate resilience incrementally, expressing resilience proportionally to the application need
  • “Outside in”, as needed, incremental, ...


GVR Approach

GVR-Approach-1.png

  • Application-System Partnership
    • Expose and exploit algorithm and application domain knowledge
    • Enable “End to end” resilience model
  • Foundation in Data-oriented resilience
    • Internet services, map-reduce, internet, ...
    • Achieve with high performance and massive parallelism...
    • Global view data Foundation (PGAS..., GA, SWARM, ParalleX, CnC, ...)

Data-oriented Resilience

GVR-Data-Oriented.png

  • Parallel applications and global-view data
  • Natural parallel structure version-to-version
    • Example: shock hydro simulation at t=10ms to 100ms
    • Example: iterative solver at iteration 1 to 20
    • Example: monte carlo at 10M to 20M points
  • Temporal redundancy enables rollback and resume
    • User-controlled, convenient

Resilience Partnership

  • Proportional Resilience
    • Application specifies “Resilience priorities”
    • Mapped into data-redundancy in space
    • Mapped into redundancy in time (multi-version)
    • Complements computation/task redundancy efforts
  • Deep error detection: invariants, assertions, checks ... and recovery
  • Applications add further checks based on algorithm and domain semantics
    • Application add flexible, adaptive recovery mechanisms (and exploit multi-version)
  • “End-to-end” resilience

GVR Approach

GVR-Approach-2.png

  • x-layer approach for efficient execution (and better resilience)
    • Spatial redundancy – coding at multiple levels, system level checking
    • Temporal redundancy - Multi-version memory, integrated memory and NVRAM management
  • Push checks to most efficient level (find early, contain, reduce overhead)
  • Recover based on semantics from any level (repair more, larger feasible computation, reduce overhead)
  • Efficient implementation support in runtime, OS, architecture ... increase efficiency and containment

Multi-version Memory

GVR-Memory.png

  • Common parallel paradigm, basis for programmer engagement
  • Frames invariant checks, more complex checks based on high-level semantics
  • Frames sophisticated recovery

Research Challenges

  • Understand application resilience needs and opportunities for proportional resilience and deep error detection/end-to-end resilience
  • Explore multi-version memory as opportunity for framing richer resilience and parallelism
  • Design API that embodies these ideas and gentle slope incremental application effort
  • Create efficient x-layer implementations - many questions
  • Explore architecture opportunities to increase resilience and reduce overhead

Global‑view data Program

GVR-Program-1.png

GVR Resilience Program

GVR-Program-2.png

Global View & Consistent Snapshots

GVR-Snapshots.png

  • How to safely, efficiently identify consistent snapshots?
    • Application control: Global Synch; Array-level synch; explicit snapshot
    • Application flagged (optional)
    • Implicit (runtime decides)
  • Snapshots = natural points to express and implement assertions, checks, recovery

Implementing Multi-version

GVR-Implementing.png

  • How to implement multi-version efficiently?
    • Time, Space, Label => representation, protocol
  • Which to take?
    • Versions are logical, snapshots require resources
  • Intelligent storage:
    • Representation, compression, architecture support
    • Older versions recede into storage [SILT]

== Intelligent Memory and Storage

GVR-Memory-Storage.png

  • How to exploit intelligence at memory and storage? (at controller)
  • Intelligent stacked DRAM and storage-class Memory [HMC,PIM]
  • Fine-grained state tracking; compression, intelligent, copying, etc.
  • Efficient version capture; differenced checkpoints (Plank95, Svard11)