Actions

Communications: Difference between revisions

From Modelado Foundation

imported>Groved
No edit summary
imported>ERoman
No edit summary
Line 20: Line 20:
|(TG)
|(TG)
|
|
The GASNet-EX communications library will provide Active Messages, one-sided data-movement and collectives as its primary communications primitives, with atomic operations being secondary (over AM initially, and natively as time permits)These are used in various ways by the multiple programming models under study by the DEGAS project.  The Active Message primitives in particular support the asynchronous/remote invocation operations present in both the Habanero and UPC++ efforts, while atomics will provide efficient point-to-point synchronization.
The GASNet-EX communications library provides Active Messages (AM), one-sided data-movement and collectives as its primary communications primitives.  Secondary primitives, such as atomics, may be emulated via AM or implemented through native hardware when availableThe programming models in the DEGAS project use these primitives for many purposes.  The Active Message primitives support the asynchronous remote invocation operations present in the Habanero and UPC++ efforts, while atomics will provide efficient point-to-point synchronization.
|The primary focus is computation via asynchronous tasks. The primary communication primitive is (reliably delivered) fire-and-forget active messages. Higher level behavior (finish, at) is synthesized by the APGAS runtime ontop of the active message primitive.  However, for performance at scale we recognize the importance of additional primitives: both non-blocking collectives and one-sided asynchronous RDMA's for point-to-point bulk data transfer.  
|The primary focus is computation via asynchronous tasks. The primary communication primitive is (reliably delivered) fire-and-forget active messages. Higher level behavior (finish, at) is synthesized by the APGAS runtime ontop of the active message primitive.  However, for performance at scale we recognize the importance of additional primitives: both non-blocking collectives and one-sided asynchronous RDMA's for point-to-point bulk data transfer.  
|(DynAX)
|(DynAX)
Line 32: Line 32:
|''It is important for performance portability that most communication related codes comprise invariants that will always be true. Aggregation, routing, order sensitivity, time to arrival, and error management should be transparent to the user. Destinations, in most cases, should be relative to placement of first class objects for adaptive placement and routing. Scheduling should tolerate asynchrony uncertainty of message delivery without forfeit of performance assuming sufficient parallelism.''
|''It is important for performance portability that most communication related codes comprise invariants that will always be true. Aggregation, routing, order sensitivity, time to arrival, and error management should be transparent to the user. Destinations, in most cases, should be relative to placement of first class objects for adaptive placement and routing. Scheduling should tolerate asynchrony uncertainty of message delivery without forfeit of performance assuming sufficient parallelism.''
|(TG)
|(TG)
|DEGAS is extending the GASNet APIs to produce GASNet-EX.  GASNet-EX will continue to take the position that its role is to '''support''' the chosen computation model rather than dictate it.  One of the major changes relative to the current GASNet will the the treatment of threads as first-class entities (as in the MPI endpoints proposal if that means anything to the reader).  This will allow efficient mapping of non-SPMD execution models (such as Habanero and other similar models) that are impractical or inefficient today.
|DEGAS is extending the GASNet APIs to produce GASNet-EX.  GASNet-EX is designed to '''support''' the computation model rather than dictate it.  Unlike the current GASNet, GASNet-EX treats threads as first-class entities (as in the MPI endpoints proposal), allowing efficient mapping of non-SPMD execution models, e.g. Habanero, that are impractical or inefficient today.
| It is not clear to us that tight integration of communication libraries and the computation model is needed to support non SPMD execution.  The X10/APGAS runtime supports non-SPMD execution at scale while maintaining a fairly strict separation between the communication layer (X10RT) and the computational model.  X10RT provides basic active message facilities, but all higher-level computational model concerns are handled above the X10RT layer of the runtime.  However, there are certainly opportunities to optimize some of the small "control" messages sent over the X10RT transport by the APGAS runtime layer by off-loading pieces of runtime logic into message handlers that could run directly within the network software/hardware.  Pushing this function down simply requires the network layer to allow execution of user-provided handlers, not a true integration of the computation model into the communication library.
| It is not clear to us that tight integration of communication libraries and the computation model is needed to support non SPMD execution.  The X10/APGAS runtime supports non-SPMD execution at scale while maintaining a fairly strict separation between the communication layer (X10RT) and the computational model.  X10RT provides basic active message facilities, but all higher-level computational model concerns are handled above the X10RT layer of the runtime.  However, there are certainly opportunities to optimize some of the small "control" messages sent over the X10RT transport by the APGAS runtime layer by off-loading pieces of runtime logic into message handlers that could run directly within the network software/hardware.  Pushing this function down simply requires the network layer to allow execution of user-provided handlers, not a true integration of the computation model into the communication library.
|(DynAX)
|(DynAX)
Line 45: Line 45:
|''Time to solution of application workload, with minimum energy cost within that scope.''  
|''Time to solution of application workload, with minimum energy cost within that scope.''  
|(TG)
|(TG)
| Communications libraries, or layers of runtime immediately above them, can/should be responsible only for "dynamic" optimization of communication such as: aggregation of messages with same destination, scheduling multiple links, injection control for congestion avoidance.  Compilers or application developers should be responsible for "static" optimizations such as communication avoidance, hot-spot elimination, etc.
| Communications libraries and neighboring runtime layers should be responsible only for '''dynamic''' optimization of communication.  Examples of such optimizations include: aggregation of messages with the same destination, scheduling multiple links, and injection control for congestion avoidance.  Compilers or application developers should be responsible for '''static''' optimizations such as communication avoidance, hot-spot elimination, etc.


Primary metrics for communications runtime include latency of short messages, bandwidth of large messages, and communications/computation overlap opportunity during long-latency operations.  Reduction of energy is a metric for the stack as a whole and may be more dependent on '''avoiding''' communication than on optimizing it (see also energy-related question).
Primary metrics for communications runtime include latency of short messages, bandwidth of large messages, and communications/computation overlap opportunity during long-latency operations.  Reduction of energy is a metric for the stack as a whole and may be more dependent on '''avoiding''' communication than on optimizing it (see also energy-related question).
Line 59: Line 59:
|''To first order runtime system assumes correct operation of communication libraries as being pursued by Portals-4 and the experimental Photon communication fabric. Under NNSA PSAAP-2 the Micro-checkpoint Compute-Validate-Commit cycle will detect errors including those due to communication failures.''
|''To first order runtime system assumes correct operation of communication libraries as being pursued by Portals-4 and the experimental Photon communication fabric. Under NNSA PSAAP-2 the Micro-checkpoint Compute-Validate-Commit cycle will detect errors including those due to communication failures.''
|(TG)
|(TG)
|DEGAS is pursuing a two-pronged approach to resilience which consists of both backward recovery (rollback-recovery of state via checkpoint/restart) and forward recovery (recompute or re-communicate faulty state via Containment Domains), working together in the same run.  The ability of Containment Domains to isolate faults and to perform most recovery ''locally'' is ideal for most "soft errors", while the use of rollback-recovery is appropriate to hard node crashes.  The combination of the two greatly reduces the frequency of checkpoints required to provide effective protection, and may allow the application of rollback-recovery to be limited to subsets of the nodes (ideally only the crashed one).
|DEGAS is pursuing a hybrid approach to resilience which consists of both backward recovery (rollback-recovery of state via checkpoint/restart) and forward recovery (recompute or re-communicate faulty state via Containment Domains), working together in the same run.  The ability of Containment Domains to isolate faults and to perform most recovery ''locally'' is ideal for most "soft errors", while the use of rollback-recovery is appropriate to hard node crashes.  The combination of the two not only reduces the frequency of checkpoints required to provide effective protection, but also limits the
type of errors that an application programmer must tolerate.  Further, our approach allows the scope of rollback-recovery to be limited to subsets of the nodes and, in some cases, only the faulty nodes need to perform recovery.


The roll of the communications library is slightly different with respect to these two resilience mechanisms.  For rollback-recovery GASNet-EX must include a mechanism to capture a consistent state, which is significantly more "interesting" a problem with one-sided communication than in a message-passing system especially if one does not wish to quiesce the entire application in order to take a checkpoint.  For Containment Domains, the roll of GASNet-EX is to provide an implementation that runs-through communications failures by reacting '''not''' by aborting, but instead by providing notifications to other runtime components, providing them with interfaces to take appropriate actions, and avoiding internal resource leaks associated with (for instance) now-unreachable peers.
The communications library supports each resilience mechanism in appropriate ways.  For rollback-recovery, GASNet-EX must include a mechanism to capture a consistent state, a significantly more challenging a problem with one-sided communication than in a message-passing system, especially if one does not wish to quiesce all application communications for a consistent checkpoint.  For Containment Domains, GASNet-EX must run through communications failures by reacting (not aborting), by notifying other runtime components, by allowing these compenents to take appropriate actions, and by preventing resource leaks associated with (for instance) now-unreachable peers.
| This is not an area of research for D-TEC.  We are assuming the low-level communication libraries will (at least as visible by our layer) operate correctly or report faults when it does not.  Any faults reported by the underlying communication library will be reflected up to higher-levels of the runtime stack.
| This is not an area of research for D-TEC.  We are assuming the low-level communication libraries will (at least as visible by our layer) operate correctly or report faults when it does not.  Any faults reported by the underlying communication library will be reflected up to higher-levels of the runtime stack.
|(DynAX)
|(DynAX)
Line 85: Line 86:
|''Vendor systems can help with redundant paths and dynamic routing. Runtime system data and task placement can attempt to maximize locality for reduced message traffic contention.''
|''Vendor systems can help with redundant paths and dynamic routing. Runtime system data and task placement can attempt to maximize locality for reduced message traffic contention.''
|(TG)
|(TG)
|As others have already observed, the first line of defense against congestion is intelligent relative placement of tasks and their data.  This is the domain of the tasking runtime and/or the application author.
|As others have observed, the first line of defense against congestion is intelligent placement of tasks and their data.  This is the domain of the tasking runtime and the application author.


Ideally the vendor systems can be relied upon to provide some degree of congestion management that utilizes information not necessarily available to the communications runtime (e.g. static information about their network and dynamic information about traffic, including traffic from ''other'' jobs).  However, that is necessarily "reactive" and thus assumes that the recent past is predictive of the future.  However, compilers and runtime components with "macro" communications behaviors (such as collectives and libraries with well-structured comms) have the potential to provide the communications layer with information about near-future communications.  Such information can then be used to build communications schedules that avoid (or at least reduce) congestionSuch scheduling approaches would be greatly enhanced if the vendor can provide information about current conditions, especially for networks in which multiple jobs utilize the same links.
Ideally, the vendor systems would provide some degree of congestion management.  This would use information not necessarily available to the communications runtime, e.g. static information about their network, dynamic information about application traffic, and traffic from ''other'' jobs.  However, compilers and runtime components with "macro" communications behaviors, i.e. collectives or other structured communications, could potentially inform the communications layer about near-future communications, where this information can be used to build congestion-avoiding communications schedules.  These scheduling approaches can be greatly enhanced if the vendors provide information about current network conditions, particularly for networks where multiple jobs share the same links.
| Higher levels should focus on task & data placement to increase locality and reduce redundant communication. Placement should also be aware of network topology and optimize towards keeping frequently communicating tasks in the same "neighborhood" when possible. Micro-optimization of routing, congestion management, and flow control are probably most effectively handled by system/vendor mechanisms since it may require detailed understanding of the internals of network software/hardware and the available dynamic information.
| Higher levels should focus on task & data placement to increase locality and reduce redundant communication. Placement should also be aware of network topology and optimize towards keeping frequently communicating tasks in the same "neighborhood" when possible. Micro-optimization of routing, congestion management, and flow control are probably most effectively handled by system/vendor mechanisms since it may require detailed understanding of the internals of network software/hardware and the available dynamic information.
|(DynAX)
|(DynAX)

Revision as of 23:51, May 9, 2014

QUESTIONS XPRESS TG X-Stack DEGAS D-TEC DynAX X-TUNE GVR CORVETTE SLEEC PIPER
PI Ron Brightwell Shekhar Borkar Katherine Yelick Daniel Quinlan Guang Gao Mary Hall Andrew Chien Koushik Sen Milind Kulkarni Martin Schulz
What are the communication "primitives" that you expect to emphasize within your project? (e.g. two-sided vs one-sided, collectives, topologies, groups) Do we need to define extensions to the traditional application level interfaces which now emphasize only data transfers and collective operations? Do we need atomics, remote invocation interfaces, or should these be provided ad-hoc by clients? The communication primitive is based on the “parcel” protocol that is an expanded form of active-message and that operates within a global address space distributed across “localities” (approx. nodes). Logical destinations are hierarchical global names, actions include instantiations of threads and ParalleX processes (spanning multiple localities), data movement, compound atomic operations, and OS calls. Continuations determine follow-on actions. Payload conveys data operands and block data for moves. Parcels are an integral component of the semantics of the ParalleX execution model providing symmetric semantics in the domain of asynchronous distributed processing to local synchronous processing on localities (nodes). (TG)

The GASNet-EX communications library provides Active Messages (AM), one-sided data-movement and collectives as its primary communications primitives. Secondary primitives, such as atomics, may be emulated via AM or implemented through native hardware when available. The programming models in the DEGAS project use these primitives for many purposes. The Active Message primitives support the asynchronous remote invocation operations present in the Habanero and UPC++ efforts, while atomics will provide efficient point-to-point synchronization.

The primary focus is computation via asynchronous tasks. The primary communication primitive is (reliably delivered) fire-and-forget active messages. Higher level behavior (finish, at) is synthesized by the APGAS runtime ontop of the active message primitive. However, for performance at scale we recognize the importance of additional primitives: both non-blocking collectives and one-sided asynchronous RDMA's for point-to-point bulk data transfer. (DynAX) (X-TUNE) (GVR) (CORVETTE) SLEEC Communication will be out of band and need to be isolated, emphasis on streaming communication
Traditional communication libraries (e.g. MPI or GASNet) have been developed without tight integration with the computation "model". What is your strategy for integrating communication and computation to address the needs of non SPMD execution? It is important for performance portability that most communication related codes comprise invariants that will always be true. Aggregation, routing, order sensitivity, time to arrival, and error management should be transparent to the user. Destinations, in most cases, should be relative to placement of first class objects for adaptive placement and routing. Scheduling should tolerate asynchrony uncertainty of message delivery without forfeit of performance assuming sufficient parallelism. (TG) DEGAS is extending the GASNet APIs to produce GASNet-EX. GASNet-EX is designed to support the computation model rather than dictate it. Unlike the current GASNet, GASNet-EX treats threads as first-class entities (as in the MPI endpoints proposal), allowing efficient mapping of non-SPMD execution models, e.g. Habanero, that are impractical or inefficient today. It is not clear to us that tight integration of communication libraries and the computation model is needed to support non SPMD execution. The X10/APGAS runtime supports non-SPMD execution at scale while maintaining a fairly strict separation between the communication layer (X10RT) and the computational model. X10RT provides basic active message facilities, but all higher-level computational model concerns are handled above the X10RT layer of the runtime. However, there are certainly opportunities to optimize some of the small "control" messages sent over the X10RT transport by the APGAS runtime layer by off-loading pieces of runtime logic into message handlers that could run directly within the network software/hardware. Pushing this function down simply requires the network layer to allow execution of user-provided handlers, not a true integration of the computation model into the communication library. (DynAX) (X-TUNE) (GVR) (CORVETTE) N/A N/A
What type of optimizations should be transparently provided by a communication layer and what should be delegated to compilers or application developers?

What is the primary performance metric for your runtime?

Time to solution of application workload, with minimum energy cost within that scope. (TG) Communications libraries and neighboring runtime layers should be responsible only for dynamic optimization of communication. Examples of such optimizations include: aggregation of messages with the same destination, scheduling multiple links, and injection control for congestion avoidance. Compilers or application developers should be responsible for static optimizations such as communication avoidance, hot-spot elimination, etc.

Primary metrics for communications runtime include latency of short messages, bandwidth of large messages, and communications/computation overlap opportunity during long-latency operations. Reduction of energy is a metric for the stack as a whole and may be more dependent on avoiding communication than on optimizing it (see also energy-related question).

Under the assumption that the communication layer is not tightly integrated with the computation model, the scope of transparent optimization seems limited to optimizing the flow of traffic within the network. Perhaps also providing performance diagnostics and control points to higher-levels of the runtime to enable them to optimize communication behavior. Optimizations need to be planned/managed at a level of the stack that has sufficient scope to make good decisions. (DynAX) (X-TUNE) (GVR) (CORVETTE) N/A unclear
What is your strategy towards resilient communication libraries? To first order runtime system assumes correct operation of communication libraries as being pursued by Portals-4 and the experimental Photon communication fabric. Under NNSA PSAAP-2 the Micro-checkpoint Compute-Validate-Commit cycle will detect errors including those due to communication failures. (TG) DEGAS is pursuing a hybrid approach to resilience which consists of both backward recovery (rollback-recovery of state via checkpoint/restart) and forward recovery (recompute or re-communicate faulty state via Containment Domains), working together in the same run. The ability of Containment Domains to isolate faults and to perform most recovery locally is ideal for most "soft errors", while the use of rollback-recovery is appropriate to hard node crashes. The combination of the two not only reduces the frequency of checkpoints required to provide effective protection, but also limits the

type of errors that an application programmer must tolerate. Further, our approach allows the scope of rollback-recovery to be limited to subsets of the nodes and, in some cases, only the faulty nodes need to perform recovery.

The communications library supports each resilience mechanism in appropriate ways. For rollback-recovery, GASNet-EX must include a mechanism to capture a consistent state, a significantly more challenging a problem with one-sided communication than in a message-passing system, especially if one does not wish to quiesce all application communications for a consistent checkpoint. For Containment Domains, GASNet-EX must run through communications failures by reacting (not aborting), by notifying other runtime components, by allowing these compenents to take appropriate actions, and by preventing resource leaks associated with (for instance) now-unreachable peers.

This is not an area of research for D-TEC. We are assuming the low-level communication libraries will (at least as visible by our layer) operate correctly or report faults when it does not. Any faults reported by the underlying communication library will be reflected up to higher-levels of the runtime stack. (DynAX) (X-TUNE) (GVR) (CORVETTE) N/A ability to drop and reroute around failed processes
What and how can a communication layer help in power and energy optimizations? Energy waste on unused channels needs to be prevented. Delays due to contention for hotspots need to be mitigated through dynamic routing. Information on message traffic, granularity, and power needs to be provided to OSR. (TG) If/when applications become blocked waiting for communications to complete, one should consider energy-aware mechanisms for blocking. Other than that, most mechanisms for energy reduction with respect to communication are also effective for reducing time-to-solution and are likely to be studied in that context (where the metric is easier to measure). Not a topic of research for D-TEC. (DynAX) (X-TUNE) (GVR) (CORVETTE) N/A N/A
Congestion management and flow control mechanisms are of particular concern at very large scale. How much can we rely on "vendor" mechanisms and how much do we need to address in higher level layers? Vendor systems can help with redundant paths and dynamic routing. Runtime system data and task placement can attempt to maximize locality for reduced message traffic contention. (TG) As others have observed, the first line of defense against congestion is intelligent placement of tasks and their data. This is the domain of the tasking runtime and the application author.

Ideally, the vendor systems would provide some degree of congestion management. This would use information not necessarily available to the communications runtime, e.g. static information about their network, dynamic information about application traffic, and traffic from other jobs. However, compilers and runtime components with "macro" communications behaviors, i.e. collectives or other structured communications, could potentially inform the communications layer about near-future communications, where this information can be used to build congestion-avoiding communications schedules. These scheduling approaches can be greatly enhanced if the vendors provide information about current network conditions, particularly for networks where multiple jobs share the same links.

Higher levels should focus on task & data placement to increase locality and reduce redundant communication. Placement should also be aware of network topology and optimize towards keeping frequently communicating tasks in the same "neighborhood" when possible. Micro-optimization of routing, congestion management, and flow control are probably most effectively handled by system/vendor mechanisms since it may require detailed understanding of the internals of network software/hardware and the available dynamic information. (DynAX) (X-TUNE) (GVR) (CORVETTE) N/A N/A