Actions

Runtimes (os/hardware-facing)

From Modelado Foundation

Revision as of 03:33, May 20, 2014 by imported>Admin
PI
XPRESS Ron Brightwell
TG Shekhar Borkar
DEGAS Katherine Yelick
D-TEC Daniel Quinlan
DynAX Guang Gao
X-TUNE Mary Hall
GVR Andrew Chien
CORVETTE Koushik Sen
SLEEC Milind Kulkarni
PIPER Martin Schulz

Questions:

What system calls does your RTS currently use?
XPRESS HPX requires basic calls for memory allocation and deallocation, virtual address translation and management, thread execution resource allocation and deallocation, parcel communication transmit and receive, error detection, and others.
TG Our RTS is platform independent and we have been building a hardware and system abstraction layer that wraps all "system calls" that we may need. On x86, we rely on calls to print, exit, and do memory and thread management. These same functionalities are provided differently on other platforms.
DEGAS
D-TEC Typical POSIX calls for memory allocation/deallocation, threads and synchronization operations, support needed for core libc operations.
DynAX SWARM requires access to hardware threads, memory, and network interconnect(s), whether by system call or direct access. On today's x86 clusters, SWARM additionally needs access to I/O facilities, such as the Linux select, read, and write calls.
X-TUNE Presumably, autotuning would only be applied when ready to run software in production mode. I suspect correctness software would only be used if the tuning process had some error, in which case some overhead would be tolerable.
GVR
CORVETTE
SLEEC
PIPER The PIPER runtime will be used to collect performance information - it will be out of band, potentially running on external (non-compute node) resources. As such, we require additional communication mechanisms, which is currently mostly done through sockets. Additionally, tools typically use ptrace, signals, and shared memory segments, as well as the dynamic linker for their implementation.
Does your RTS span the system? If so, what network interface capability does your RTS need?
XPRESS The HPX RTX spans the system. It requires global address space and parcels message-driven interface.
TG Yes, it can span the entire system depending on the platform. We have defined very simple communication interfaces (which we will almost certainly extend) that currently allow the RTS to send and receive one-way messages between nodes.
DEGAS
D-TEC We run different instances of the X10/APGAS runtime across different OS instances on the system. They coordinate via active messages. We developed an an active-message based transport MPI which we implemented of top of TCP/IP and MPI.
DynAX Yes, SWARM operates on all available/configured threads of all available/configured nodes. SWARM can operate over stream-, message-, or DMA-based interconnects.
X-TUNE
GVR
CORVETTE
SLEEC
PIPER Tools will have a global "runtime" to collect and aggregate data - this network will be out of band. This will span the whole job, in some cases the whole machine. A high performance communication mechanism would be preferable - currently mostly sockets are used.
How does your RTS map user-level and OS-level scheduling?
XPRESS The LXK OS allocates a share of its execution resources (e.g., Pthreads) to each relatively root ParalleX Process allocated to the locality. The HPX runtime system uses lightweight scheduling policies to assign user threads to the allocated OS threads.
TG Our RTS is built on the assumption that there is almost nothing below it. In other words, we try to rely as little as possible on the operating system. For scheduling for example, on a traditional x86 Linux system, we create a certain number of pinned worker threads and we then manage work on these workers ourselves.
DEGAS
D-TEC We allocate a pool of OS level execution resources (eg pthreads). Our scheduler then uses these resources as workers on which to schedule the APGAS level tasks using a work-stealing scheduler.
DynAX SWARM uses codelets to intermediate between threads and function/method calls. Threads are set up and bound at runtime startup; codelets are bound to particular threads only when they're dispatched, unless some more specific binding is arranged for before readying the codelet. The runtime can dynamically balance load by shifting readied codelets and/or context data from one location to another. When the OS is in charge of power management, blocking is used to relinquish a hardware thread to the OS so that it can be used for other work, or its core powered down.
X-TUNE The most interesting tool would be one that could compare two different versions of the code to see where changes to variable values are observed.
GVR
CORVETTE
SLEEC N/A
PIPER N/A
What does your RTS use for locality information?
XPRESS The “locality” is defined as a synchronous domain that guarantees bounded response time and compound atomic sequences of operations. Compute complexes (thread instances) are to be performed on a single locality at a time and can assume its properties. ParalleX Processes are contexts that define relative logical locality although this may span multiple localities. Parcels permit asynchronous non-blocking operation and move work to data to minimize latency effects.
TG We expect this information to come from: (a) user (or higher level tools/compilers) hints, (b) introspection of the physical layout based on configuration files and (c ) (potentially) introspection into machine behavior.
DEGAS
D-TEC The X10/APGAS runtime system spans over multiple shared-memory domains called places. An application specifies the place of each data object and computational task.
DynAX It uses a tree of locale descriptors to associate threads, cores, nodes, etc. with each other, typically in a fashion correlating with the hardware memory hierarchy.
X-TUNE The key issue will be understanding when differences in output are acceptable, and when they represent an error.
GVR
CORVETTE
SLEEC
PIPER Locality/topology information should be exposed by the application facing runtime and will be used for proper attribution of performance data.
What OS or hardware information does your RTS need to monitor and adapt?
XPRESS Availability of execution resources, energy consumption, detected errors, delays due to contention.
TG Performance monitoring units and fault detection.
DEGAS
D-TEC The X10/APGAS RTS monitors the connections between nodes (hosts) to detect node and network failures.
DynAX Reliable notification of hardware failures, and a local or global cycle-based or real time. Performance counters would help for load modeling and balancing.
X-TUNE
GVR
CORVETTE
SLEEC
PIPER in short: anything and everything - in particular hardware counters (in profiling and sampling) and any kind of system adaptation information (where does system configuration change) is required
Does your RTS require support for global namespace or global address space?
XPRESS Yes.
TG No, will use if available.
DEGAS
D-TEC Currently the APGAS runtime provides a global address space entirely in software. If the lower-level system software provided full or partial support for a global address space, the APGAS runtime could exploit it. However, we do not require global address support from the underlying system.
DynAX SWARM can take advantage of a global name/address space, but provides for a global namespace entirely in software. OS or hardware involvement are only needed for data storage and communication.
X-TUNE
GVR
CORVETTE
SLEEC N/A
PIPER N/A
What local memory management capability does your RTS require?
XPRESS It must have support for allocation and deallocation of physical memory blocks. It must have support for protected virtual memory addresses at the local level. It must receive error information during memory accesses.
TG Our RTS self-manages fine-grained allocations. It simply needs to acquire range(s) of addresses it can use.
DEGAS
D-TEC Garbage collection.
DynAX SWARM requires the ability to allocate physical memory.
X-TUNE
GVR
CORVETTE
SLEEC N/A
PIPER Individual parts of the runtime will require dynamic memory management - additionally, shared memory communication with a target process would be highly beneficial
Does your RTS address external I/O capability?
XPRESS Yes.
TG Yes (partial).
DEGAS
D-TEC No
DynAX Yes.
X-TUNE
GVR
CORVETTE
SLEEC
PIPER N/A
What interface and/or mechanism is used for the OS to request RTS services?
XPRESS The OS (e.g., LXK) may make user requests of the runtime information to coordinate actions, resources, and services across multiple localities or the entire system and to provide high-level functionality like POSIX calls.
TG n/a
DEGAS
D-TEC The X10/APGAS RTS is linked with the application binary.
DynAX Current versions of SWARM do not require the OS to request services from the runtime. In the event this is necessary, it's expected that either a signal-/interrupt-based or polling-based interface will be provided, and either of these can be integrated easily.
X-TUNE N/A -- We use standard languages and run-time support.
GVR
CORVETTE
SLEEC N/A
PIPER N/A
How does your RTS support legacy application or legacy RTS capability?
XPRESS Both MPI and OpenMP software interfaces are being provided to XPI as a target interface to HPX. LXK can also support both in native form.
TG Not in TG scope.
DEGAS
D-TEC N/A
DynAX Legacy applications can be converted piecewise or in entirety; while the

application may block normally during single-threaded regions, parallelized regions requires blocking calls to use stack-switching or (equivalently) extra software threads, or else break apart blocking operations into separate initiation and callback sections. Where possible, SWARM provides predefined exports that allow asynchronous use of common legacy runtime/API functionality.

X-TUNE Scalability and determining what is an error seem like the biggest challenges.
GVR
CORVETTE
SLEEC
PIPER Yes: PIPER components intend to support tools for MPI+X codes as well as new RTS and DSL approaches
Does your RTS depend on any specific hardware-specific capability?
XPRESS HPX at a minimum requires standard hardware functionality of conventional systems but would benefit from new capabilities for efficiency and scalability.
TG No but it can take advantage of some if available.
DEGAS
D-TEC No. But the X10/APGAS RTS can take advantage of hardware-specific networking capabilities and CUDA GPUs.
DynAX SWARM can operate perfectly well on commodity systems, but benefits from access to performance-counting and power-monitoring/-control facilities.
X-TUNE
GVR
CORVETTE
SLEEC N/A
PIPER Full (and well documented) access to performance counters, profiling and sampling





QUESTIONS XPRESS TG X-Stack DEGAS D-TEC DynAX X-TUNE GVR CORVETTE SLEEC PIPER
PI Ron Brightwell Shekhar Borkar Katherine Yelick Daniel Quinlan Guang Gao Mary Hall Andrew Chien Koushik Sen Milind Kulkarni Martin Schulz
What system calls does your RTS currently use? HPX requires basic calls for memory allocation and deallocation, virtual address translation and management, thread execution resource allocation and deallocation, parcel communication transmit and receive, error detection, and others. Our RTS is platform independent and we have been building a hardware and system abstraction layer that wraps all "system calls" that we may need. On x86, we rely on calls to print, exit, and do memory and thread management. These same functionalities are provided differently on other platforms. (DEGAS) Typical POSIX calls for memory allocation/deallocation, threads and synchronization operations, support needed for core libc operations. SWARM requires access to hardware threads, memory, and network interconnect(s), whether by system call or direct access. On today's x86 clusters, SWARM additionally needs access to I/O facilities, such as the Linux select, read, and write calls. (X-TUNE) (GVR) (CORVETTE) SLEEC The PIPER runtime will be used to collect performance information - it will be out of band, potentially running on external (non-compute node) resources. As such, we require additional communication mechanisms, which is currently mostly done through sockets. Additionally, tools typically use ptrace, signals, and shared memory segments, as well as the dynamic linker for their implementation.
Does your RTS span the system? If so, what network interface capability does your RTS need? The HPX RTX spans the system. It requires global address space and parcels message-driven interface. Yes, it can span the entire system depending on the platform. We have defined very simple communication interfaces (which we will almost certainly extend) that currently allow the RTS to send and receive one-way messages between nodes. (DEGAS) We run different instances of the X10/APGAS runtime across different OS instances on the system. They coordinate via active messages. We developed an an active-message based transport MPI which we implemented of top of TCP/IP and MPI. Yes, SWARM operates on all available/configured threads of all available/configured nodes. SWARM can operate over stream-, message-, or DMA-based interconnects. (X-TUNE) (GVR) (CORVETTE) N/A Tools will have a global "runtime" to collect and aggregate data - this network will be out of band. This will span the whole job, in some cases the whole machine. A high performance communication mechanism would be preferable - currently mostly sockets are used.
How does your RTS map user-level and OS-level scheduling? The LXK OS allocates a share of its execution resources (e.g., Pthreads) to each relatively root ParalleX Process allocated to the locality. The HPX runtime system uses lightweight scheduling policies to assign user threads to the allocated OS threads. Our RTS is built on the assumption that there is almost nothing below it. In other words, we try to rely as little as possible on the operating system. For scheduling for example, on a traditional x86 Linux system, we create a certain number of pinned worker threads and we then manage work on these workers ourselves. (DEGAS) We allocate a pool of OS level execution resources (eg pthreads). Our scheduler then uses these resources as workers on which to schedule the APGAS level tasks using a work-stealing scheduler. SWARM uses codelets to intermediate between threads and function/method calls. Threads are set up and bound at runtime startup; codelets are bound to particular threads only when they're dispatched, unless some more specific binding is arranged for before readying the codelet. The runtime can dynamically balance load by shifting readied codelets and/or context data from one location to another. When the OS is in charge of power management, blocking is used to relinquish a hardware thread to the OS so that it can be used for other work, or its core powered down. (X-TUNE) (GVR) (CORVETTE) N/A N/A
What does your RTS use for locality information? The “locality” is defined as a synchronous domain that guarantees bounded response time and compound atomic sequences of operations. Compute complexes (thread instances) are to be performed on a single locality at a time and can assume its properties. ParalleX Processes are contexts that define relative logical locality although this may span multiple localities. Parcels permit asynchronous non-blocking operation and move work to data to minimize latency effects. We expect this information to come from: (a) user (or higher level tools/compilers) hints, (b) introspection of the physical layout based on configuration files and (c ) (potentially) introspection into machine behavior. (DEGAS) The X10/APGAS runtime system spans over multiple shared-memory domains called places. An application specifies the place of each data object and computational task. It uses a tree of locale descriptors to associate threads, cores, nodes, etc. with each other, typically in a fashion correlating with the hardware memory hierarchy. (X-TUNE) (GVR) (CORVETTE) N/A Locality/topology information should be exposed by the application facing runtime and will be used for proper attribution of performance data.
What OS or hardware information does your RTS need to monitor and adapt? Availability of execution resources, energy consumption, detected errors, delays due to contention. Performance monitoring units and fault detection. (DEGAS) The X10/APGAS RTS monitors the connections between nodes (hosts) to detect node and network failures. Reliable notification of hardware failures, and a local or global cycle-based or real time. Performance counters would help for load modeling and balancing. (X-TUNE) (GVR) (CORVETTE) N/A in short: anything and everything - in particular hardware counters (in profiling and sampling) and any kind of system adaptation information (where does system configuration change) is required
Does your RTS require support for global namespace or global address space? Yes. No, will use if available. (DEGAS) Currently the APGAS runtime provides a global address space entirely in software. If the lower-level system software provided full or partial support for a global address space, the APGAS runtime could exploit it. However, we do not require global address support from the underlying system. SWARM can take advantage of a global name/address space, but provides for a global namespace entirely in software. OS or hardware involvement are only needed for data storage and communication. (X-TUNE) (GVR) (CORVETTE) N/A N/A
What local memory management capability does your RTS require? It must have support for allocation and deallocation of physical memory blocks. It must have support for protected virtual memory addresses at the local level. It must receive error information during memory accesses. Our RTS self-manages fine-grained allocations. It simply needs to acquire range(s) of addresses it can use. (DEGAS) Garbage collection. SWARM requires the ability to allocate physical memory. (X-TUNE) (GVR) (CORVETTE) N/A Individual parts of the runtime will require dynamic memory management - additionally, shared memory communication with a target process would be highly beneficial
Does your RTS address external I/O capability? Yes. Yes (partial). (DEGAS) No Yes. (X-TUNE) (GVR) (CORVETTE) N/A N/A
What interface and/or mechanism is used for the OS to request RTS services? The OS (e.g., LXK) may make user requests of the runtime information to coordinate actions, resources, and services across multiple localities or the entire system and to provide high-level functionality like POSIX calls. n/a (DEGAS) The X10/APGAS RTS is linked with the application binary. Current versions of SWARM do not require the OS to request services from the runtime. In the event this is necessary, it's expected that either a signal-/interrupt-based or polling-based interface will be provided, and either of these can be integrated easily. (X-TUNE) (GVR) (CORVETTE) N/A N/A
How does your RTS support legacy application or legacy RTS capability? Both MPI and OpenMP software interfaces are being provided to XPI as a target interface to HPX. LXK can also support both in native form. Not in TG scope. (DEGAS) N/A. Legacy applications can be converted piecewise or in entirety; while the

application may block normally during single-threaded regions, parallelized regions requires blocking calls to use stack-switching or (equivalently) extra software threads, or else break apart blocking operations into separate initiation and callback sections. Where possible, SWARM provides predefined exports that allow asynchronous use of common legacy runtime/API functionality.

(X-TUNE) (GVR) (CORVETTE) N/A Yes: PIPER components intend to support tools for MPI+X codes as well as new RTS and DSL approaches
Does your RTS depend on any specific hardware-specific capability? HPX at a minimum requires standard hardware functionality of conventional systems but would benefit from new capabilities for efficiency and scalability. No but it can take advantage of some if available. (DEGAS) No. But the X10/APGAS RTS can take advantage of hardware-specific networking capabilities and CUDA GPUs. SWARM can operate perfectly well on commodity systems, but benefits from access to performance-counting and power-monitoring/-control facilities. (X-TUNE) (GVR) (CORVETTE) N/A Full (and well documented) access to performance counters, profiling and sampling