Runtimes (os/hardware-facing): Difference between revisions
From Modelado Foundation
imported>Schulzm No edit summary |
imported>Admin No edit summary |
||
(9 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
{ | {{PITable}} | ||
'''Questions:''' | |||
* [[#table1|What system calls does your RTS currently use?]] | |||
* [[#table2|Does your RTS span the system? If so, what network interface capability does your RTS need?]] | |||
* [[#table3|How does your RTS map user-level and OS-level scheduling?]] | |||
* [[#table4|What does your RTS use for locality information?]] | |||
* [[#table5|What OS or hardware information does your RTS need to monitor and adapt?]] | |||
* [[#table6|Does your RTS require support for global namespace or global address space?]] | |||
* [[#table7|What local memory management capability does your RTS require?]] | |||
* [[#table8|Does your RTS address external I/O capability?]] | |||
| | * [[#table9|What interface and/or mechanism is used for the OS to request RTS services?]] | ||
| | * [[#table10|How does your RTS support legacy application or legacy RTS capability?]] | ||
|- | * [[#table11|Does your RTS depend on any specific hardware-specific capability? ]] | ||
|What system calls does your RTS currently use? | |||
<div id="table1"></div>{{ | |||
| | CorrectnessToolsTable | ||
|question=What system calls does your RTS currently use? | |||
| | |||
|xpressanswer=HPX requires basic calls for memory allocation and deallocation, virtual address translation and management, thread execution resource allocation and deallocation, parcel communication transmit and receive, error detection, and others. | |||
|tganswer=Our RTS is platform independent and we have been building a hardware and system abstraction layer that wraps all "system calls" that we may need. On x86, we rely on calls to print, exit, and do memory and thread management. These same functionalities are provided differently on other platforms. | |||
| | |||
|degasanswer= | |||
| | |||
| | |dtecanswer=Typical POSIX calls for memory allocation/deallocation, threads and synchronization operations, support needed for core libc operations. | ||
| | |||
| | |dynaxanswer=SWARM requires access to hardware threads, memory, and network interconnect(s), whether by system call or direct access. On today's x86 clusters, SWARM additionally needs access to I/O facilities, such as the Linux select, read, and write calls. | ||
| | |||
|xtuneanswer=Presumably, autotuning would only be applied when ready to run software in production mode. I suspect correctness software would only be used if the tuning process had some error, in which case some overhead would be tolerable. | |||
|( | |||
|gvranswer= | |||
|corvetteanswer= | |||
|sleecanswer= | |||
| | |piperanswer=The PIPER runtime will be used to collect performance information - it will be out of band, potentially running on external (non-compute node) resources. As such, we require additional communication mechanisms, which is currently mostly done through sockets. Additionally, tools typically use ptrace, signals, and shared memory segments, as well as the dynamic linker for their implementation. | ||
| | }} | ||
<div id="table2"></div>{{ | |||
| | CorrectnessToolsTable | ||
|question=Does your RTS span the system? If so, what network interface capability does your RTS need? | |||
| | |xpressanswer=The HPX RTX spans the system. It requires global address space and parcels message-driven interface. | ||
| | |||
| | |tganswer=Yes, it can span the entire system depending on the platform. We have defined very simple communication interfaces (which we will almost certainly extend) that currently allow the RTS to send and receive one-way messages between nodes. | ||
| | |degasanswer= | ||
| | |||
|dtecanswer=We run different instances of the X10/APGAS runtime across different OS instances on the system. They coordinate via active messages. We developed an an active-message based transport MPI which we implemented of top of TCP/IP and MPI. | |||
| | |||
|dynaxanswer=Yes, SWARM operates on all available/configured threads of all available/configured nodes. SWARM can operate over stream-, message-, or DMA-based interconnects. | |||
|xtuneanswer= | |||
|gvranswer= | |||
|corvetteanswer= | |||
|sleecanswer= | |||
| | |piperanswer=Tools will have a global "runtime" to collect and aggregate data - this network will be out of band. This will span the whole job, in some cases the whole machine. A high performance communication mechanism would be preferable - currently mostly sockets are used. | ||
}} | |||
|( | |||
<div id="table3"></div>{{ | |||
CorrectnessToolsTable | |||
|question=How does your RTS map user-level and OS-level scheduling? | |||
| | |||
| | |xpressanswer=The LXK OS allocates a share of its execution resources (e.g., Pthreads) to each relatively root ParalleX Process allocated to the locality. The HPX runtime system uses lightweight scheduling policies to assign user threads to the allocated OS threads. | ||
|( | |||
|tganswer=Our RTS is built on the assumption that there is almost nothing below it. In other words, we try to rely as little as possible on the operating system. For scheduling for example, on a traditional x86 Linux system, we create a certain number of pinned worker threads and we then manage work on these workers ourselves. | |||
| | |||
| | |degasanswer= | ||
| | |||
|dtecanswer=We allocate a pool of OS level execution resources (eg pthreads). Our scheduler then uses these resources as workers on which to schedule the APGAS level tasks using a work-stealing scheduler. | |||
| | |||
|dynaxanswer=SWARM uses codelets to intermediate between threads and function/method calls. Threads are set up and bound at runtime startup; codelets are bound to particular threads only when they're dispatched, unless some more specific binding is arranged for before readying the codelet. The runtime can dynamically balance load by shifting readied codelets and/or context data from one location to another. When the OS is in charge of power management, blocking is used to relinquish a hardware thread to the OS so that it can be used for other work, or its core powered down. | |||
| | |||
|xtuneanswer=The most interesting tool would be one that could compare two different versions of the code to see where changes to variable values are observed. | |||
|gvranswer= | |||
| | |corvetteanswer= | ||
| | |sleecanswer=N/A | ||
|piperanswer=N/A | |||
}} | |||
|( | |||
<div id="table4"></div>{{ | |||
CorrectnessToolsTable | |||
|question=What does your RTS use for locality information? | |||
|xpressanswer=The “locality” is defined as a synchronous domain that guarantees bounded response time and compound atomic sequences of operations. Compute complexes (thread instances) are to be performed on a single locality at a time and can assume its properties. ParalleX Processes are contexts that define relative logical locality although this may span multiple localities. Parcels permit asynchronous non-blocking operation and move work to data to minimize latency effects. | |||
| | |||
| | |tganswer=We expect this information to come from: (a) user (or higher level tools/compilers) hints, (b) introspection of the physical layout based on configuration files and (c ) (potentially) introspection into machine behavior. | ||
| | |||
|degasanswer= | |||
| | |dtecanswer=The X10/APGAS runtime system spans over multiple shared-memory domains called places. An application specifies the place of each data object and computational task. | ||
| | |||
| | |dynaxanswer=It uses a tree of locale descriptors to associate threads, cores, nodes, etc. with each other, typically in a fashion correlating with the hardware memory hierarchy. | ||
| | |||
| | |xtuneanswer=The key issue will be understanding when differences in output are acceptable, and when they represent an error. | ||
|gvranswer= | |||
|corvetteanswer= | |||
|What | |sleecanswer= | ||
| | |||
| | |piperanswer=Locality/topology information should be exposed by the application facing runtime and will be used for proper attribution of performance data. | ||
}} | |||
| | |||
<div id="table5"></div>{{ | |||
|( | CorrectnessToolsTable | ||
|question=What OS or hardware information does your RTS need to monitor and adapt? | |||
| | |xpressanswer=Availability of execution resources, energy consumption, detected errors, delays due to contention. | ||
| | |tganswer=Performance monitoring units and fault detection. | ||
| | |||
| | |degasanswer= | ||
| | |||
|dtecanswer=The X10/APGAS RTS monitors the connections between nodes (hosts) to detect node and network failures. | |||
| | |||
|dynaxanswer=Reliable notification of hardware failures, and a local or global cycle-based or real time. Performance counters would help for load modeling and balancing. | |||
|xtuneanswer= | |||
|gvranswer= | |||
| | |||
| | |corvetteanswer= | ||
| | |||
| | |sleecanswer= | ||
|piperanswer=in short: anything and everything - in particular hardware counters (in profiling and sampling) and any kind of system adaptation information (where does system configuration change) is required | |||
}} | |||
<div id="table6"></div>{{ | |||
|( | CorrectnessToolsTable | ||
|question=Does your RTS require support for global namespace or global address space? | |||
|xpressanswer=Yes. | |||
|tganswer=No, will use if available. | |||
|degasanswer= | |||
|dtecanswer=Currently the APGAS runtime provides a global address space entirely in software. If the lower-level system software provided full or partial support for a global address space, the APGAS runtime could exploit it. However, we do not require global address support from the underlying system. | |||
|dynaxanswer=SWARM can take advantage of a global name/address space, but provides for a global namespace entirely in software. OS or hardware involvement are only needed for data storage and communication. | |||
|xtuneanswer= | |||
|gvranswer= | |||
|corvetteanswer= | |||
|sleecanswer=N/A | |||
|piperanswer=N/A | |||
}} | |||
<div id="table7"></div>{{ | |||
CorrectnessToolsTable | |||
|question=What local memory management capability does your RTS require? | |||
|xpressanswer=It must have support for allocation and deallocation of physical memory blocks. It must have support for protected virtual memory addresses at the local level. It must receive error information during memory accesses. | |||
|tganswer=Our RTS self-manages fine-grained allocations. It simply needs to acquire range(s) of addresses it can use. | |||
|degasanswer= | |||
|dtecanswer=Garbage collection. | |||
|dynaxanswer=SWARM requires the ability to allocate physical memory. | |||
|xtuneanswer= | |||
|gvranswer= | |||
|corvetteanswer= | |||
|sleecanswer=N/A | |||
|piperanswer=Individual parts of the runtime will require dynamic memory management - additionally, shared memory communication with a target process would be highly beneficial | |||
}} | |||
<div id="table8"></div>{{ | |||
CorrectnessToolsTable | |||
|question=Does your RTS address external I/O capability? | |||
|xpressanswer=Yes. | |||
|tganswer=Yes (partial). | |||
|degasanswer= | |||
|dtecanswer=No | |||
|dynaxanswer=Yes. | |||
|xtuneanswer= | |||
|gvranswer= | |||
|corvetteanswer= | |||
|sleecanswer=N/A | |||
|piperanswer=N/A | |||
}} | |||
<div id="table9"></div>{{ | |||
CorrectnessToolsTable | |||
|question=What interface and/or mechanism is used for the OS to request RTS services? | |||
|xpressanswer=The OS (e.g., LXK) may make user requests of the runtime information to coordinate actions, resources, and services across multiple localities or the entire system and to provide high-level functionality like POSIX calls. | |||
|tganswer=n/a | |||
|degasanswer= | |||
|dtecanswer=The X10/APGAS RTS is linked with the application binary. | |||
|dynaxanswer=Current versions of SWARM do not require the OS to request services from the runtime. In the event this is necessary, it's expected that either a signal-/interrupt-based or polling-based interface will be provided, and either of these can be integrated easily. | |||
|xtuneanswer=N/A -- We use standard languages and run-time support. | |||
|gvranswer= | |||
|corvetteanswer= | |||
|sleecanswer=N/A | |||
|piperanswer=N/A | |||
}} | |||
<div id="table10"></div>{{ | |||
CorrectnessToolsTable | |||
|question=How does your RTS support legacy application or legacy RTS capability? | |||
|xpressanswer=Both MPI and OpenMP software interfaces are being provided to XPI as a target interface to HPX. LXK can also support both in native form. | |||
|tganswer=Not in TG scope. | |||
|degasanswer= | |||
|dtecanswer=N/A | |||
|dynaxanswer=Legacy applications can be converted piecewise or in entirety; while the | |||
application may block normally during single-threaded regions, parallelized regions requires blocking calls to use stack-switching or (equivalently) extra software threads, or else break apart blocking operations into separate initiation and callback sections. Where possible, SWARM provides predefined exports that allow asynchronous use of common legacy runtime/API functionality. | |||
|xtuneanswer=Scalability and determining what is an error seem like the biggest challenges. | |||
|gvranswer= | |||
|corvetteanswer= | |||
|sleecanswer= | |||
|piperanswer=Yes: PIPER components intend to support tools for MPI+X codes as well as new RTS and DSL approaches | |||
}} | |||
<div id="table11"></div>{{ | |||
CorrectnessToolsTable | |||
|question=Does your RTS depend on any specific hardware-specific capability? | |||
|xpressanswer=HPX at a minimum requires standard hardware functionality of conventional systems but would benefit from new capabilities for efficiency and scalability. | |||
|tganswer=No but it can take advantage of some if available. | |||
|degasanswer= | |||
|dtecanswer=No. But the X10/APGAS RTS can take advantage of hardware-specific networking capabilities and CUDA GPUs. | |||
|dynaxanswer=SWARM can operate perfectly well on commodity systems, but benefits from access to performance-counting and power-monitoring/-control facilities. | |||
|xtuneanswer= | |||
|gvranswer= | |||
|corvetteanswer= | |||
|sleecanswer=N/A | |||
|piperanswer=Full (and well documented) access to performance counters, profiling and sampling | |||
}} |
Latest revision as of 03:35, May 20, 2014
PI | |
XPRESS | Ron Brightwell |
TG | Shekhar Borkar |
DEGAS | Katherine Yelick |
D-TEC | Daniel Quinlan |
DynAX | Guang Gao |
X-TUNE | Mary Hall |
GVR | Andrew Chien |
CORVETTE | Koushik Sen |
SLEEC | Milind Kulkarni |
PIPER | Martin Schulz |
Questions:
- What system calls does your RTS currently use?
- Does your RTS span the system? If so, what network interface capability does your RTS need?
- How does your RTS map user-level and OS-level scheduling?
- What does your RTS use for locality information?
- What OS or hardware information does your RTS need to monitor and adapt?
- Does your RTS require support for global namespace or global address space?
- What local memory management capability does your RTS require?
- Does your RTS address external I/O capability?
- What interface and/or mechanism is used for the OS to request RTS services?
- How does your RTS support legacy application or legacy RTS capability?
- Does your RTS depend on any specific hardware-specific capability?
What system calls does your RTS currently use? | |
XPRESS | HPX requires basic calls for memory allocation and deallocation, virtual address translation and management, thread execution resource allocation and deallocation, parcel communication transmit and receive, error detection, and others. |
TG | Our RTS is platform independent and we have been building a hardware and system abstraction layer that wraps all "system calls" that we may need. On x86, we rely on calls to print, exit, and do memory and thread management. These same functionalities are provided differently on other platforms. |
DEGAS | |
D-TEC | Typical POSIX calls for memory allocation/deallocation, threads and synchronization operations, support needed for core libc operations. |
DynAX | SWARM requires access to hardware threads, memory, and network interconnect(s), whether by system call or direct access. On today's x86 clusters, SWARM additionally needs access to I/O facilities, such as the Linux select, read, and write calls. |
X-TUNE | Presumably, autotuning would only be applied when ready to run software in production mode. I suspect correctness software would only be used if the tuning process had some error, in which case some overhead would be tolerable. |
GVR | |
CORVETTE | |
SLEEC | |
PIPER | The PIPER runtime will be used to collect performance information - it will be out of band, potentially running on external (non-compute node) resources. As such, we require additional communication mechanisms, which is currently mostly done through sockets. Additionally, tools typically use ptrace, signals, and shared memory segments, as well as the dynamic linker for their implementation. |
Does your RTS span the system? If so, what network interface capability does your RTS need? | |
XPRESS | The HPX RTX spans the system. It requires global address space and parcels message-driven interface. |
TG | Yes, it can span the entire system depending on the platform. We have defined very simple communication interfaces (which we will almost certainly extend) that currently allow the RTS to send and receive one-way messages between nodes. |
DEGAS | |
D-TEC | We run different instances of the X10/APGAS runtime across different OS instances on the system. They coordinate via active messages. We developed an an active-message based transport MPI which we implemented of top of TCP/IP and MPI. |
DynAX | Yes, SWARM operates on all available/configured threads of all available/configured nodes. SWARM can operate over stream-, message-, or DMA-based interconnects. |
X-TUNE | |
GVR | |
CORVETTE | |
SLEEC | |
PIPER | Tools will have a global "runtime" to collect and aggregate data - this network will be out of band. This will span the whole job, in some cases the whole machine. A high performance communication mechanism would be preferable - currently mostly sockets are used. |
How does your RTS map user-level and OS-level scheduling? | |
XPRESS | The LXK OS allocates a share of its execution resources (e.g., Pthreads) to each relatively root ParalleX Process allocated to the locality. The HPX runtime system uses lightweight scheduling policies to assign user threads to the allocated OS threads. |
TG | Our RTS is built on the assumption that there is almost nothing below it. In other words, we try to rely as little as possible on the operating system. For scheduling for example, on a traditional x86 Linux system, we create a certain number of pinned worker threads and we then manage work on these workers ourselves. |
DEGAS | |
D-TEC | We allocate a pool of OS level execution resources (eg pthreads). Our scheduler then uses these resources as workers on which to schedule the APGAS level tasks using a work-stealing scheduler. |
DynAX | SWARM uses codelets to intermediate between threads and function/method calls. Threads are set up and bound at runtime startup; codelets are bound to particular threads only when they're dispatched, unless some more specific binding is arranged for before readying the codelet. The runtime can dynamically balance load by shifting readied codelets and/or context data from one location to another. When the OS is in charge of power management, blocking is used to relinquish a hardware thread to the OS so that it can be used for other work, or its core powered down. |
X-TUNE | The most interesting tool would be one that could compare two different versions of the code to see where changes to variable values are observed. |
GVR | |
CORVETTE | |
SLEEC | N/A |
PIPER | N/A |
What does your RTS use for locality information? | |
XPRESS | The “locality” is defined as a synchronous domain that guarantees bounded response time and compound atomic sequences of operations. Compute complexes (thread instances) are to be performed on a single locality at a time and can assume its properties. ParalleX Processes are contexts that define relative logical locality although this may span multiple localities. Parcels permit asynchronous non-blocking operation and move work to data to minimize latency effects. |
TG | We expect this information to come from: (a) user (or higher level tools/compilers) hints, (b) introspection of the physical layout based on configuration files and (c ) (potentially) introspection into machine behavior. |
DEGAS | |
D-TEC | The X10/APGAS runtime system spans over multiple shared-memory domains called places. An application specifies the place of each data object and computational task. |
DynAX | It uses a tree of locale descriptors to associate threads, cores, nodes, etc. with each other, typically in a fashion correlating with the hardware memory hierarchy. |
X-TUNE | The key issue will be understanding when differences in output are acceptable, and when they represent an error. |
GVR | |
CORVETTE | |
SLEEC | |
PIPER | Locality/topology information should be exposed by the application facing runtime and will be used for proper attribution of performance data. |
What OS or hardware information does your RTS need to monitor and adapt? | |
XPRESS | Availability of execution resources, energy consumption, detected errors, delays due to contention. |
TG | Performance monitoring units and fault detection. |
DEGAS | |
D-TEC | The X10/APGAS RTS monitors the connections between nodes (hosts) to detect node and network failures. |
DynAX | Reliable notification of hardware failures, and a local or global cycle-based or real time. Performance counters would help for load modeling and balancing. |
X-TUNE | |
GVR | |
CORVETTE | |
SLEEC | |
PIPER | in short: anything and everything - in particular hardware counters (in profiling and sampling) and any kind of system adaptation information (where does system configuration change) is required |
Does your RTS require support for global namespace or global address space? | |
XPRESS | Yes. |
TG | No, will use if available. |
DEGAS | |
D-TEC | Currently the APGAS runtime provides a global address space entirely in software. If the lower-level system software provided full or partial support for a global address space, the APGAS runtime could exploit it. However, we do not require global address support from the underlying system. |
DynAX | SWARM can take advantage of a global name/address space, but provides for a global namespace entirely in software. OS or hardware involvement are only needed for data storage and communication. |
X-TUNE | |
GVR | |
CORVETTE | |
SLEEC | N/A |
PIPER | N/A |
What local memory management capability does your RTS require? | |
XPRESS | It must have support for allocation and deallocation of physical memory blocks. It must have support for protected virtual memory addresses at the local level. It must receive error information during memory accesses. |
TG | Our RTS self-manages fine-grained allocations. It simply needs to acquire range(s) of addresses it can use. |
DEGAS | |
D-TEC | Garbage collection. |
DynAX | SWARM requires the ability to allocate physical memory. |
X-TUNE | |
GVR | |
CORVETTE | |
SLEEC | N/A |
PIPER | Individual parts of the runtime will require dynamic memory management - additionally, shared memory communication with a target process would be highly beneficial |
Does your RTS address external I/O capability? | |
XPRESS | Yes. |
TG | Yes (partial). |
DEGAS | |
D-TEC | No |
DynAX | Yes. |
X-TUNE | |
GVR | |
CORVETTE | |
SLEEC | N/A |
PIPER | N/A |
What interface and/or mechanism is used for the OS to request RTS services? | |
XPRESS | The OS (e.g., LXK) may make user requests of the runtime information to coordinate actions, resources, and services across multiple localities or the entire system and to provide high-level functionality like POSIX calls. |
TG | n/a |
DEGAS | |
D-TEC | The X10/APGAS RTS is linked with the application binary. |
DynAX | Current versions of SWARM do not require the OS to request services from the runtime. In the event this is necessary, it's expected that either a signal-/interrupt-based or polling-based interface will be provided, and either of these can be integrated easily. |
X-TUNE | N/A -- We use standard languages and run-time support. |
GVR | |
CORVETTE | |
SLEEC | N/A |
PIPER | N/A |
How does your RTS support legacy application or legacy RTS capability? | |
XPRESS | Both MPI and OpenMP software interfaces are being provided to XPI as a target interface to HPX. LXK can also support both in native form. |
TG | Not in TG scope. |
DEGAS | |
D-TEC | N/A |
DynAX | Legacy applications can be converted piecewise or in entirety; while the
application may block normally during single-threaded regions, parallelized regions requires blocking calls to use stack-switching or (equivalently) extra software threads, or else break apart blocking operations into separate initiation and callback sections. Where possible, SWARM provides predefined exports that allow asynchronous use of common legacy runtime/API functionality. |
X-TUNE | Scalability and determining what is an error seem like the biggest challenges. |
GVR | |
CORVETTE | |
SLEEC | |
PIPER | Yes: PIPER components intend to support tools for MPI+X codes as well as new RTS and DSL approaches |
Does your RTS depend on any specific hardware-specific capability? | |
XPRESS | HPX at a minimum requires standard hardware functionality of conventional systems but would benefit from new capabilities for efficiency and scalability. |
TG | No but it can take advantage of some if available. |
DEGAS | |
D-TEC | No. But the X10/APGAS RTS can take advantage of hardware-specific networking capabilities and CUDA GPUs. |
DynAX | SWARM can operate perfectly well on commodity systems, but benefits from access to performance-counting and power-monitoring/-control facilities. |
X-TUNE | |
GVR | |
CORVETTE | |
SLEEC | N/A |
PIPER | Full (and well documented) access to performance counters, profiling and sampling |