GVR: Difference between revisions
From Modelado Foundation
imported>Cdenny (Created page with "{{Infobox project | title = GVR | image = 180px | imagecaption = | team-members = List of team members | pi = Lead PI (Institute) | co-pi = Co-PIs...") |
imported>Cdenny No edit summary |
||
Line 1: | Line 1: | ||
{{Infobox project | {{Infobox project | ||
| title = GVR | | title = GVR | ||
| image = [[File: | | image = [[File:GVR-Logos.png|400px]] | ||
| imagecaption = | | imagecaption = | ||
| team-members = | | team-members = [http://www.uchicago.edu/ U. of Chicago], [http://www.anl.gov/ ANL], [http://www.hpl.hp.com/ HP Labs] | ||
| pi = | | pi = Andrew A. Chien (U. of Chicago) | ||
| co-pi = | | co-pi = Pavan Balaji (ANL) | ||
| website = team website | | website = team website | ||
}} | }} | ||
Line 12: | Line 12: | ||
== Team Members == | == Team Members == | ||
* [http://www.uchicago.edu/ University of Chicago] | |||
* [http://www.anl.gov/ Argonne National Laboratory (ANL)] | |||
* [http://www.hpl.hp.com/ HP Labs] | |||
== Resilience Challenges == | |||
* Can we achieve a smooth transition to system resilience? (a la Flash memory, Internet) | |||
* What’s an application to do? | |||
[[File:GVR-Resilience-Challenges.png|600px]] | |||
== | == Resilience Co-design == | ||
'''Co‑design without co‑dependence''' | |||
[[File:GVR-Resilience-Co-design.png|400px]] | |||
* Software: Information and Algorithms to enhance resilience (REQ: Portable, flexible) | |||
* Runtime, OS, and Architecture Mechanisms to enhance resilience (REQ: leverage beyond HPC, cheap) | |||
== | '''Challenges''' | ||
* Enable an application to incorporate resilience incrementally, expressing resilience proportionally to the application need | |||
* “Outside in”, as needed, incremental, ... | |||
== GVR Approach == | |||
[[File:GVR-Approach-1.png|600px]] | |||
* Application-System Partnership | |||
** Expose and exploit algorithm and application domain knowledge | |||
** Enable “End to end” resilience model | |||
* Foundation in Data-oriented resilience | |||
** Internet services, map-reduce, internet, ... | |||
** Achieve with high performance and massive parallelism... | |||
** Global view data Foundation (PGAS..., GA, SWARM, ParalleX, CnC, ...) | |||
=== Data-oriented Resilience === | |||
[[File:GVR-Data-Oriented.png|600px]] | |||
* Parallel applications and global-view data | |||
* Natural parallel structure version-to-version | |||
** Example: shock hydro simulation at t=10ms to 100ms | |||
** Example: iterative solver at iteration 1 to 20 | |||
** Example: monte carlo at 10M to 20M points | |||
* Temporal redundancy enables rollback and resume | |||
** User-controlled, convenient | |||
=== Resilience Partnership === | |||
* Proportional Resilience | |||
** Application specifies “Resilience priorities” | |||
** Mapped into data-redundancy in space | |||
** Mapped into redundancy in time (multi-version) | |||
** Complements computation/task redundancy efforts | |||
* Deep error detection: invariants, assertions, checks ... and recovery | |||
* Applications add further checks based on algorithm and domain semantics | |||
** Application add flexible, adaptive recovery mechanisms (and exploit multi-version) | |||
* “End-to-end” resilience | |||
== GVR Approach == | |||
[[File:GVR-Approach-2.png|600px]] | |||
* x-layer approach for efficient execution (and better resilience) | |||
** Spatial redundancy – coding at multiple levels, system level checking | |||
** Temporal redundancy - Multi-version memory, integrated memory and NVRAM management | |||
* Push checks to most efficient level (find early, contain, reduce overhead) | |||
* Recover based on semantics from any level (repair more, larger feasible computation, reduce overhead) | |||
* Efficient implementation support in runtime, OS, architecture ... increase efficiency and containment | |||
=== Multi-version Memory === | |||
[[File:GVR-Memory.png|600px]] | |||
* Common parallel paradigm, basis for programmer engagement | |||
* Frames invariant checks, more complex checks based on high-level semantics | |||
* Frames sophisticated recovery | |||
== Research Challenges == | |||
* Understand application resilience needs and opportunities for ''proportional resilience'' and ''deep error detection''/''end-to-end resilience'' | |||
* Explore multi-version memory as opportunity for framing richer resilience and parallelism | |||
* Design API that embodies these ideas and ''gentle slope'' incremental application effort | |||
* Create efficient x-layer implementations - many questions | |||
* Explore architecture opportunities to increase resilience and reduce overhead | |||
== Global‑view data Program == | |||
[[File:GVR-Program-1.png|600px]] | |||
== GVR Resilience Program == | |||
[[File:GVR-Program-2.png|600px]] | |||
== Global View & Consistent Snapshots == | |||
[[File:GVR-Snapshots.png|600px]] | |||
* How to safely, efficiently identify consistent snapshots? | |||
** Application control: Global Synch; Array-level synch; explicit snapshot | |||
** Application flagged (optional) | |||
** Implicit (runtime decides) | |||
* Snapshots = natural points to express and implement assertions, checks, recovery | |||
== Implementing Multi-version == | |||
[[File:GVR-Implementing.png|600px]] | |||
* How to implement multi-version efficiently? | |||
** Time, Space, Label => representation, protocol | |||
* Which to take? | |||
** Versions are logical, snapshots require resources | |||
* Intelligent storage: | |||
** Representation, compression, architecture support | |||
** Older versions recede into storage [SILT] | |||
== Intelligent Memory and Storage | |||
[[File:GVR-Memory-Storage.png|600px]] | |||
* How to exploit intelligence at memory and storage? (at controller) | |||
* Intelligent stacked DRAM and storage-class Memory [HMC,PIM] | |||
* Fine-grained state tracking; compression, intelligent, copying, etc. | |||
* Efficient version capture; differenced checkpoints (Plank95, Svard11) |
Revision as of 23:31, February 11, 2013
GVR | |
---|---|
Team Members | U. of Chicago, ANL, HP Labs |
PI | Andrew A. Chien (U. of Chicago) |
Co-PIs | Pavan Balaji (ANL) |
Website | team website |
Download | {{{download}}} |
Description about your project goes here.....
Team Members
Resilience Challenges
- Can we achieve a smooth transition to system resilience? (a la Flash memory, Internet)
- What’s an application to do?
Resilience Co-design
Co‑design without co‑dependence
- Software: Information and Algorithms to enhance resilience (REQ: Portable, flexible)
- Runtime, OS, and Architecture Mechanisms to enhance resilience (REQ: leverage beyond HPC, cheap)
Challenges
- Enable an application to incorporate resilience incrementally, expressing resilience proportionally to the application need
- “Outside in”, as needed, incremental, ...
GVR Approach
- Application-System Partnership
- Expose and exploit algorithm and application domain knowledge
- Enable “End to end” resilience model
- Foundation in Data-oriented resilience
- Internet services, map-reduce, internet, ...
- Achieve with high performance and massive parallelism...
- Global view data Foundation (PGAS..., GA, SWARM, ParalleX, CnC, ...)
Data-oriented Resilience
- Parallel applications and global-view data
- Natural parallel structure version-to-version
- Example: shock hydro simulation at t=10ms to 100ms
- Example: iterative solver at iteration 1 to 20
- Example: monte carlo at 10M to 20M points
- Temporal redundancy enables rollback and resume
- User-controlled, convenient
Resilience Partnership
- Proportional Resilience
- Application specifies “Resilience priorities”
- Mapped into data-redundancy in space
- Mapped into redundancy in time (multi-version)
- Complements computation/task redundancy efforts
- Deep error detection: invariants, assertions, checks ... and recovery
- Applications add further checks based on algorithm and domain semantics
- Application add flexible, adaptive recovery mechanisms (and exploit multi-version)
- “End-to-end” resilience
GVR Approach
- x-layer approach for efficient execution (and better resilience)
- Spatial redundancy – coding at multiple levels, system level checking
- Temporal redundancy - Multi-version memory, integrated memory and NVRAM management
- Push checks to most efficient level (find early, contain, reduce overhead)
- Recover based on semantics from any level (repair more, larger feasible computation, reduce overhead)
- Efficient implementation support in runtime, OS, architecture ... increase efficiency and containment
Multi-version Memory
- Common parallel paradigm, basis for programmer engagement
- Frames invariant checks, more complex checks based on high-level semantics
- Frames sophisticated recovery
Research Challenges
- Understand application resilience needs and opportunities for proportional resilience and deep error detection/end-to-end resilience
- Explore multi-version memory as opportunity for framing richer resilience and parallelism
- Design API that embodies these ideas and gentle slope incremental application effort
- Create efficient x-layer implementations - many questions
- Explore architecture opportunities to increase resilience and reduce overhead
Global‑view data Program
GVR Resilience Program
Global View & Consistent Snapshots
- How to safely, efficiently identify consistent snapshots?
- Application control: Global Synch; Array-level synch; explicit snapshot
- Application flagged (optional)
- Implicit (runtime decides)
- Snapshots = natural points to express and implement assertions, checks, recovery
Implementing Multi-version
- How to implement multi-version efficiently?
- Time, Space, Label => representation, protocol
- Which to take?
- Versions are logical, snapshots require resources
- Intelligent storage:
- Representation, compression, architecture support
- Older versions recede into storage [SILT]
== Intelligent Memory and Storage
- How to exploit intelligence at memory and storage? (at controller)
- Intelligent stacked DRAM and storage-class Memory [HMC,PIM]
- Fine-grained state tracking; compression, intelligent, copying, etc.
- Efficient version capture; differenced checkpoints (Plank95, Svard11)