DEGAS: Difference between revisions

Revision as of 21:56, February 6, 2013

DEGAS
File:Your-team-logo.png

Team Members	LBNL, Rice U., UC Berkeley, UT Austin, LLNL, NCSU
PI	Katherine Yelick (LBNL))
Co-PIs	Vivek Sarkar (Rice U.), James Demmel (UC Berkeley), Mattan Erez (UT Austin), Dan Quinlan (LLNL)
Website	team website
Download	{{{download}}}

Description about your project goes here.....

Team Members

Lawrence Berkeley National Laboratory (LBNL)
Rice University
University of California, Berkeley
University of Texas at Austin
Lawrence Livermore National Laboratory (LLNL)
North Carolina State University (NCSU)

Mission

Mission Statement: To ensure the broad success of Exascale systems through a unified programming model that is productive, scalable, portable, and interoperable, and meets the unique Exascale demands of energy efficiency and resilience.

Goals & Objectives

Scalability: Billion‐way concurrency, thousand‐way on chip with new architectures
Programmability: Convenient programming through a global address space and high‐level abstractions for parallelism, data movement and resilience
Performance Portability: Ensure applications can be moved across diverse machines using implicit (automatic) compiler optimizations and runtime adaptation
Resilience: Integrated language support for capturing state and recovering from faults
Energy Efficiency: Avoid communication, which will dominate energy costs, and adapt to performance heterogeneity due to system-‐level energy management
Interoperability: Encourage use of languages and features through incremental adoption

Programming Models

Two Distinct Parallel Programming Questions

What is the parallel control model?

What is the model for sharing/communication?

Applications Drive New Programming Models

Message Passing Programming
- Divide up domain in pieces
- Compute one piece and exchange
- MPI and many libraries

Global Address Space Programming
- Each start computing
- Grab whatever/whenever
- UPC, CAF, X10, Chapel, Fortress, Titanium, GlobalArrays

Hierarchical Programming Model

Goal: Programmability of exascale applications while providing scalability, locality, energy efficiency, resilience, and portability
- Implicit constructs: parallel multidimensional loops, global distributed data structures, adaptation for performance heterogeneity
- Explicit constructs: asynchronous tasks, phaser synchronization, locality

Built on scalability, performance, and asynchrony of PGAS models
- Language experience from UPC, Habanero‐C, Co‐Array Fortran, Titanium

Both intra and inter‐node; focus is on node model

Languages demonstrate DEGAS programming model
- Habanero‐UPC: Habanero’s intra‐node model with UPC’s inter‐node model
- Hierarchical Co‐Array Fortran (CAF): CAF for on‐chip scaling and more
- Exploration of high level languages: E.g., Python extended with H‐PGAS

Language‐independent H‐PGAS Features:
- Hierarchical distributed arrays, asynchronous tasks, and compiler specialization for hybrid (task/loop) parallelism and heterogeneity
- Semantic guarantees for deadlock avoidance, determinism, etc.
- Asynchronous collectives, function shipping, and hierarchical places
- End‐to‐end support for asynchrony (messaging, tasking, bandwidth utilization through concurrency)
- Early concept exploration for applications and benchmarks

Communication-Avoiding Compilers

Goal: massive parallelism, deep memory and network hierarchies, plus functional and performance heterogeneity
- Fine‐grained task and data parallelism: enable performance portability
- Heterogeneity: guided by functional, energy and performance characteristics
- Energy efficiency: minimize data movement and hooks to runtime adaptation
- Programmability: manage details of memory, heterogeneity, and containment
- Scalability: communication and synchronization hiding through asynchrony

H-PGAS into the Node
- Communication is all data movement

Build on code‐generation infrastructure
- ROSE for H‐CAF and Communication‐Avoidance optimizations
- BUPC and Habanero‐C; Zoltan
- Additional theory of CA code generation

Exascale Programming: Support for Future Algorithms

Approach: “Rethink” algorithms to optimize for data movement
- New class of communication‐optimal algorithms
- Most codes are not bandwidth limited, but many should be

Challenges: How general are these algorithms?
- Can they be automated and for what types of loops?
- How much benefit is there in practice?

Adaptive Runtime Systems (ARTS)

Goal: Adaptive runtime for manycore systems that are hierarchical, heterogeneous and provide asymmetric performance
- Reactive and proactive control: for utilization and energy efficiency
- Integrated tasking and communication: for hybrid programming
- Sharing of hardware threads: required for library interoperability

Novelty: Scalable control; integrated tasking with communication
- Adaptation: Runtime annotated with performance history/intentions
- Performance models: Guide runtime optimizations, specialization
- Hierarchical: Resource/energy
- Tunable control: Locality/load balance

Leverages: Existing runtimes
- Lithe scheduler composition; Juggle
- BUPC and Habanero‐C runtimes

Synchronization Avoidance vs Resource Management

Management of critical resources will be more important:
- Memory and network bandwidth limited by cost and energy
- Capacity limited at many levels: network buffers at interfaces, internal network congestion are real and growing problems

Can runtimes manage these or do users need to help?
- Adaptation based on history and (user‐supplied) intent?
- Where will bottlenecks be for a given architecture and application?

Lith Scheduling Abstraction: "Harts" (Hardware Threads)

Lightweight Communication (GASNet-EX)

Goal: Maximize bandwidth use with lightweight communication
- One‐sided communication: to avoid over‐synchronization
- Active‐Messages: for productivity and portability
- Interoperability: with MPI and threading layers

Novelty:
- Congestion management: for 1‐sided communication with ARTS
- Hierarchical: communication management for H‐PGAS
- Resilience: globally consist states and fine‐grained fault recovery
- Progress: new models for scalability and interoperatbility

Leverage GASNet (redesigned):
- Major changes for on‐chip interconnects
- Each network has unique opportunities

Resilience through Containment Domains

Goal: Provide a resilient runtime for PGAS applications
- Applications should be able to customize resilience to their needs
- Resilient runtime that provides easy‐to‐use mechanisms

Novelty: Single analyzable abstraction for resilience
- PGAS Resilience consistency model
- Directed and hierarchical preservation
- Global or localized recovery
- Algorithm and system‐specific detection, elision, and recovery

Leverage: Combined superset of prior approaches
- Fast checkpoints for large bulk updates
- Journal for small frequent updates
- Hierarchical checkpoint‐restart
- OS‐level save and restore
- Distributed recovery

Resilience: Research Questions

1. How to define consistent (i.e. allowable) states in the PGAS model?

Theory well understood for fail‐stop message‐passing, but not PGAS.

2. How do we discover consistent states once we've defined them?

Containment domains offer a new approach, beyond conventional sync-and‐stop algorithms.

3. How do we reconstruct consistent states after a failure?

Explore low overhead techniques that minimize effort required by applications programmers.
Leverage BLCR, GASnet, Berkeley UPC for development, and use Containment Domains as prototype API for requirements discovery

Energy and Performance Feedback

Goal: Monitoring and feedback of performance and energy for online and offline optimization
- Collect and distill: performance/energy/timing data
- Identify and report bottlenecks: through summarization/visualization
- Provide mechanisms: for autonomous runtime adaptation

Novelty: Automated runtime introspection
- Provide monitoring: power/network utilization
- Machine Learning: identify common characteristics
- Resource management: including dark silicon

Leverage: Performance/energy counters
- Integrated Performance Monitoring (IPM)
- Roofline formalism
- Performance/energy counters

Software Stack

DEGAS Pieces of the Puzzle

@@ Line 23: / Line 23: @@
 == Mission ==
 '''Mission Statement:''' To ensure the broad success of Exascale systems through a unified programming model that is productive, scalable, portable, and interoperable, and meets the unique Exascale demands of energy efficiency and resilience.
+[[File:DEGAS-Mission.png]]
 == Goals & Objectives ==
@@ Line 31: / Line 33: @@
 * '''Energy Efficiency:''' Avoid communication, which will dominate energy costs, and adapt to performance heterogeneity due to system-‐level energy management
 * '''Interoperability:''' Encourage use of languages and features through incremental adoption
-== Roadmap ==
-'''''Any Roadmap to be included?'''''
-== Impact ==
-'''''Any Impact to be included?'''''
@@ Line 43: / Line 38: @@
 === Two Distinct Parallel Programming Questions ===
 * What is the parallel control model?
-[[File:Example.jpg]]
+[[File:DEGAS-Parallel-Control-Model.png]]
 * What is the model for sharing/communication?
-[[File:Example.jpg]]
+[[File:DEGAS-Sharing-Model.png]]
-=== Applications Drive New Programming Models
+=== Applications Drive New Programming Models ===
 * Message Passing Programming
 ** Divide up domain in pieces
 ** Compute one piece and exchange
 ** '''MPI and many libraries'''
+[[File:DEGAS-Message-Passing.png]]
 * Global Address Space Programming
@@ Line 58: / Line 54: @@
 ** Grab whatever/whenever
 ** '''UPC, CAF, X10, Chapel, Fortress, Titanium, GlobalArrays'''
+[[File:DEGAS-Global-Address-Space.png]]
 === Hierarchical Programming Model ===
-[[File:DEGAS-Heirarchical-PM.png|right]]
+[[File:DEGAS-Hierarchical-PM.png|right]]
-* Goal: Programmability of exascale applications while providing scalability, locality, energy efficiency, resilience, and portability
+* '''Goal:''' Programmability of exascale applications while providing scalability, locality, energy efficiency, resilience, and portability
 ** ''Implicit constructs:'' parallel multidimensional loops, global distributed data structures, adaptation for performance heterogeneity
 ** ''Explicit constructs:'' asynchronous tasks, phaser synchronization, locality
@@ Line 83: / Line 80: @@
 === Communication-Avoiding Compilers ===
+[[File:DEGAS-Communication-Node.png|right]]
+* '''Goal:''' massive parallelism, deep memory and network hierarchies, plus functional and performance heterogeneity
+** '''Fine‐grained task and data parallelism:''' enable performance portability
+** '''Heterogeneity:''' guided by functional, energy and performance characteristics
+** '''Energy efficiency:''' minimize data movement and hooks to runtime adaptation
+** '''Programmability:''' manage details of memory, heterogeneity, and containment
+** '''Scalability:''' communication and synchronization hiding through asynchrony
+* H-PGAS into the Node
+** Communication is all data movement
+* Build on code‐generation infrastructure
+** ROSE for H‐CAF and Communication‐Avoidance optimizations
+** BUPC and Habanero‐C; Zoltan
+** Additional theory of CA code generation
+=== Exascale Programming: Support for Future Algorithms ===
+[[File:DEGAS-Algorithm.png]]
+* '''Approach:''' “Rethink” algorithms to optimize for data movement
+** New class of communication‐optimal algorithms
+** Most codes are not bandwidth limited, but many should be
+* '''Challenges:''' How general are these algorithms?
+** Can they be automated and for what types of loops?
+** How much benefit is there in practice?
+=== Adaptive Runtime Systems (ARTS) ===
+[[File:DEGAS-Infiniband-Throughput.png|right|400px]]
+* '''Goal:''' Adaptive runtime for manycore systems that are hierarchical, heterogeneous and provide asymmetric performance
+** '''Reactive and proactive control:''' for utilization and energy efficiency
+** '''Integrated tasking and communication:''' for hybrid programming
+** '''Sharing of hardware threads:''' required for library interoperability
+* '''Novelty:''' Scalable control; integrated tasking with communication
+** '''Adaptation:''' Runtime annotated with performance history/intentions
+** '''Performance models:''' Guide runtime optimizations, specialization
+** '''Hierarchical:''' Resource/energy
+** '''Tunable control:''' Locality/load balance
+* '''Leverages:''' Existing runtimes
+** '''Lithe''' scheduler composition; '''Juggle'''
+** '''BUPC and Habanero‐C''' runtimes
+=== Synchronization Avoidance vs Resource Management ===
+[[File:DEGAS-Resource-Mgmt.png|700px]]
+* Management of critical resources will be more important:
+** ''Memory and network bandwidth limited'' by cost and energy
+** ''Capacity limited at many levels:'' network buffers at interfaces, internal network congestion are real and growing problems
+* Can runtimes manage these or do users need to help?
+** Adaptation based on history and (user‐supplied) intent?
+** Where will bottlenecks be for a given architecture and application?
+=== Lith Scheduling Abstraction: "Harts" (Hardware Threads) ===
+[[File:DEGAS-Harts.png|700px]]
+=== Lightweight Communication (GASNet-EX) ===
+[[File:DEGAS-GASNet.png|right]]
+* '''Goal:''' Maximize bandwidth use with lightweight communication
+** '''One‐sided communication:''' to avoid over‐synchronization
+** '''Active‐Messages:''' for productivity and portability
+** '''Interoperability:''' with MPI and threading layers
+* '''Novelty:'''
+** Congestion management: for 1‐sided communication with ARTS
+** Hierarchical: communication management for H‐PGAS
+** Resilience: globally consist states and fine‐grained fault recovery
+** Progress: new models for scalability and interoperatbility
+* '''Leverage GASNet''' (redesigned):
+** Major changes for on‐chip interconnects
+** Each network has unique opportunities
+=== Resilience through Containment Domains ===
+[[File:DEGAS-Resilience.png|right]]
+* '''Goal:''' Provide a resilient runtime for PGAS applications
+** Applications should be able to customize resilience to their needs
+** Resilient runtime that provides easy‐to‐use mechanisms
+* '''Novelty:''' Single analyzable abstraction for resilience
+** PGAS Resilience consistency model
+** Directed and hierarchical preservation
+** Global or localized recovery
+** Algorithm and system‐specific detection, elision, and recovery
+* '''Leverage:''' Combined superset of prior approaches
+** Fast checkpoints for large bulk updates
+** Journal for small frequent updates
+** Hierarchical checkpoint‐restart
+** OS‐level save and restore
+** Distributed recovery
+'''Resilience: Research Questions'''
+. How to define consistent (i.e. allowable) states in the PGAS model?
+* Theory well understood for fail‐stop message‐passing, but not PGAS.
+. How do we discover consistent states once we've defined them?
+* Containment domains offer a new approach, beyond conventional sync-and‐stop algorithms.
+. How do we reconstruct consistent states after a failure?
+* Explore low overhead techniques that minimize effort required by applications programmers.
+* Leverage BLCR, GASnet, Berkeley UPC for development, and use Containment Domains as prototype API for requirements discovery
+[[File:DEGAS-Resilience-Research-Area.png|300px]]
+=== Energy and Performance Feedback ===
+[[File:DEGAS-Nvidia-graph.png|right|300px]]
+* '''Goal:''' Monitoring and feedback of performance and energy for online and offline optimization
+** Collect and distill: performance/energy/timing data
+** Identify and report bottlenecks: through summarization/visualization
+** Provide mechanisms: for autonomous runtime adaptation
+* '''Novelty:''' Automated runtime introspection
+** Provide monitoring: power/network utilization
+** Machine Learning: identify common characteristics
+** Resource management: including dark silicon
+* '''Leverage:''' Performance/energy counters
+** Integrated Performance Monitoring (IPM)
+** Roofline formalism
+** Performance/energy counters
 == Software Stack ==
+[[File:DEGAS-Software-Stack.png|500px]]
+== DEGAS Pieces of the Puzzle ==
+[[File:DEGAS-Puzzle.png|500px]]