Actions

Runtimes (os/hardware-facing): Difference between revisions

From Modelado Foundation

imported>Groved
No edit summary
imported>Jsstone1
No edit summary
Line 14: Line 14:
| '''PI''' || Ron Brightwell || Shekhar Borkar || Katherine Yelick || Daniel Quinlan  || Guang Gao || Mary Hall || Andrew Chien || Koushik Sen || Milind Kulkarni || Martin Schulz
| '''PI''' || Ron Brightwell || Shekhar Borkar || Katherine Yelick || Daniel Quinlan  || Guang Gao || Mary Hall || Andrew Chien || Koushik Sen || Milind Kulkarni || Martin Schulz
|- style="vertical-align:top;"
|- style="vertical-align:top;"
|What system calls does your RTS currently use?
|'''What system calls does your RTS currently use?'''
|''HPX requires basic calls for memory allocation and deallocation, virtual address translation and management, thread execution resource allocation and deallocation, parcel communication transmit and receive, error detection, and others.''
|HPX requires basic calls for memory allocation and deallocation, virtual address translation and management, thread execution resource allocation and deallocation, parcel communication transmit and receive, error detection, and others.
|(TG)
|Our RTS is platform independent and we have been building a hardware and system abstraction layer that wraps all "system calls" that we may need. On x86, we rely on calls to print, exit, and do memory and thread management. These same functionalities are provided differently on other platforms.
|(DEGAS)
|(DEGAS)
|Typical POSIX calls for memory allocation/deallocation, threads and synchronization operations, support needed for core libc operations.
|Typical POSIX calls for memory allocation/deallocation, threads and synchronization operations, support needed for core libc operations.
Line 26: Line 26:
|The PIPER runtime will be used to collect performance information - it will be out of band, potentially running on external (non-compute node) resources. As such, we require additional communication mechanisms, which is currently mostly done through sockets. Additionally, tools typically use ptrace, signals, and shared memory segments, as well as the dynamic linker for their implementation.
|The PIPER runtime will be used to collect performance information - it will be out of band, potentially running on external (non-compute node) resources. As such, we require additional communication mechanisms, which is currently mostly done through sockets. Additionally, tools typically use ptrace, signals, and shared memory segments, as well as the dynamic linker for their implementation.
|- style="vertical-align:top;"
|- style="vertical-align:top;"
|Does your RTS span the system? If so, what network interface capability does your RTS need?
|'''Does your RTS span the system? If so, what network interface capability does your RTS need?'''
|''The HPX RTX spans the system. It requires global address space and parcels message-driven interface.''
|The HPX RTX spans the system. It requires global address space and parcels message-driven interface.
|(TG)
|Yes, it can span the entire system depending on the platform. We have defined very simple communication interfaces (which we will almost certainly extend) that currently allow the RTS to send and receive one-way messages between nodes.
|(DEGAS)
|(DEGAS)
| We run different instances of the X10/APGAS runtime across different OS instances on the system. They coordinate via active messages. We developed an an active-message based transport MPI which we implemented of top of TCP/IP and MPI.
| We run different instances of the X10/APGAS runtime across different OS instances on the system. They coordinate via active messages. We developed an an active-message based transport MPI which we implemented of top of TCP/IP and MPI.
Line 38: Line 38:
|Tools will have a global "runtime" to collect and aggregate data - this network will be out of band. This will span the whole job, in some cases the whole machine. A high performance communication mechanism would be preferable - currently mostly sockets are used.
|Tools will have a global "runtime" to collect and aggregate data - this network will be out of band. This will span the whole job, in some cases the whole machine. A high performance communication mechanism would be preferable - currently mostly sockets are used.
|- style="vertical-align:top;"
|- style="vertical-align:top;"
|How does your RTS map user-level and OS-level scheduling?
|'''How does your RTS map user-level and OS-level scheduling?'''
|''The LXK OS allocates a share of its execution resources (e.g., Pthreads) to each relatively root ParalleX Process allocated to the locality. The HPX runtime system uses lightweight scheduling policies to assign user threads to the allocated OS threads.''
|The LXK OS allocates a share of its execution resources (e.g., Pthreads) to each relatively root ParalleX Process allocated to the locality. The HPX runtime system uses lightweight scheduling policies to assign user threads to the allocated OS threads.
|(TG)
|Our RTS is built on the assumption that there is almost nothing below it. In other words, we try to rely as little as possible on the operating system. For scheduling for example, on a traditional x86 Linux system, we create a certain number of pinned worker threads and we then manage work on these workers ourselves.
|(DEGAS)
|(DEGAS)
| We allocate a pool of OS level execution resources (eg pthreads).  Our scheduler then uses these resources as workers on which to schedule the APGAS level tasks using a work-stealing scheduler.  
| We allocate a pool of OS level execution resources (eg pthreads).  Our scheduler then uses these resources as workers on which to schedule the APGAS level tasks using a work-stealing scheduler.  
Line 50: Line 50:
|N/A
|N/A
|- style="vertical-align:top;"
|- style="vertical-align:top;"
|What does your RTS use for locality information?
|'''What does your RTS use for locality information?'''
|''The “locality” is defined as a synchronous domain that guarantees bounded response time and compound atomic sequences of operations. Compute complexes (thread instances) are to be performed on a single locality at a time and can assume its properties. ParalleX Processes are contexts that define relative logical locality although this may span multiple localities. Parcels permit asynchronous non-blocking operation and move work to data to minimize latency effects.''
|The “locality” is defined as a synchronous domain that guarantees bounded response time and compound atomic sequences of operations. Compute complexes (thread instances) are to be performed on a single locality at a time and can assume its properties. ParalleX Processes are contexts that define relative logical locality although this may span multiple localities. Parcels permit asynchronous non-blocking operation and move work to data to minimize latency effects.
|(TG)
|We expect this information to come from: (a) user (or higher level tools/compilers) hints, (b) introspection of the physical layout based on configuration files and (c ) (potentially) introspection into machine behavior.
|(DEGAS)
|(DEGAS)
| The X10/APGAS runtime system spans over multiple shared-memory domains called places. An application specifies the place of each data object and computational task.
| The X10/APGAS runtime system spans over multiple shared-memory domains called places. An application specifies the place of each data object and computational task.
Line 62: Line 62:
|Locality/topology information should be exposed by the application facing runtime and will be used for proper attribution of performance data.
|Locality/topology information should be exposed by the application facing runtime and will be used for proper attribution of performance data.
|- style="vertical-align:top;"
|- style="vertical-align:top;"
|What OS or hardware information does your RTS need to monitor and adapt?
|'''What OS or hardware information does your RTS need to monitor and adapt?'''
|''Availability of execution resources, energy consumption, detected errors, delays due to contention.''
|Availability of execution resources, energy consumption, detected errors, delays due to contention.
|(TG)
|Performance monitoring units and fault detection.
|(DEGAS)
|(DEGAS)
|The X10/APGAS RTS monitors the connections between nodes (hosts) to detect node and network failures.
|The X10/APGAS RTS monitors the connections between nodes (hosts) to detect node and network failures.
Line 74: Line 74:
|in short: anything and everything - in particular hardware counters (in profiling and sampling) and any kind of system adaptation information (where does system configuration change) is required
|in short: anything and everything - in particular hardware counters (in profiling and sampling) and any kind of system adaptation information (where does system configuration change) is required
|- style="vertical-align:top;"
|- style="vertical-align:top;"
|Does your RTS require support for global namespace or global address space?
|'''Does your RTS require support for global namespace or global address space?'''
|''Yes.''
|Yes.
|(TG)
|No, will use if available.
|(DEGAS)
|(DEGAS)
| Currently the APGAS runtime provides a global address space entirely in software.  If the lower-level system software provided full or partial support for a global address space, the APGAS runtime could exploit it.  However, we do not require global address support from the underlying system.
| Currently the APGAS runtime provides a global address space entirely in software.  If the lower-level system software provided full or partial support for a global address space, the APGAS runtime could exploit it.  However, we do not require global address support from the underlying system.
Line 86: Line 86:
|N/A
|N/A
|- style="vertical-align:top;"
|- style="vertical-align:top;"
|What local memory management capability does your RTS require?
|'''What local memory management capability does your RTS require?'''
|''It must have support for allocation and deallocation of physical memory blocks. It must have support for protected virtual memory addresses at the local level. It must receive error information during memory accesses.''
|It must have support for allocation and deallocation of physical memory blocks. It must have support for protected virtual memory addresses at the local level. It must receive error information during memory accesses.
|(TG)
|Our RTS self-manages fine-grained allocations. It simply needs to acquire range(s) of addresses it can use.
|(DEGAS)
|(DEGAS)
| Garbage collection.
| Garbage collection.
Line 98: Line 98:
|Individual parts of the runtime will require dynamic memory management - additionally, shared memory communication with a target process would be highly beneficial
|Individual parts of the runtime will require dynamic memory management - additionally, shared memory communication with a target process would be highly beneficial
|- style="vertical-align:top;"
|- style="vertical-align:top;"
|Does your RTS address external I/O capability?
|'''Does your RTS address external I/O capability?'''
|''Yes.''
|Yes.
|(TG)
|Yes (partial).
|(DEGAS)
|(DEGAS)
| No
| No
Line 110: Line 110:
|N/A
|N/A
|- style="vertical-align:top;"
|- style="vertical-align:top;"
|What interface and/or mechanism is used for the OS to request RTS services?
|'''What interface and/or mechanism is used for the OS to request RTS services?'''
|''The OS (e.g., LXK) may make user requests of the runtime information to coordinate actions, resources, and services across multiple localities or the entire system and to provide high-level functionality like POSIX calls.''
|The OS (e.g., LXK) may make user requests of the runtime information to coordinate actions, resources, and services across multiple localities or the entire system and to provide high-level functionality like POSIX calls.
|(TG)
|n/a
|(DEGAS)
|(DEGAS)
|The X10/APGAS RTS is linked with the application binary.
|The X10/APGAS RTS is linked with the application binary.
Line 122: Line 122:
|N/A
|N/A
|- style="vertical-align:top;"
|- style="vertical-align:top;"
|How does your RTS support legacy application or legacy RTS capability?
|'''How does your RTS support legacy application or legacy RTS capability?'''
|''Both MPI and OpenMP software interfaces are being provided to XPI as a target interface to HPX. LXK can also support both in native form.''
|Both MPI and OpenMP software interfaces are being provided to XPI as a target interface to HPX. LXK can also support both in native form.
|(TG)
|Not in TG scope.
|(DEGAS)
|(DEGAS)
| N/A.  
| N/A.  
Line 134: Line 134:
|Yes: PIPER components intend to support tools for MPI+X codes as well as new RTS and DSL approaches
|Yes: PIPER components intend to support tools for MPI+X codes as well as new RTS and DSL approaches
|- style="vertical-align:top;"
|- style="vertical-align:top;"
|Does your RTS depend on any specific hardware-specific capability?
|'''Does your RTS depend on any specific hardware-specific capability?'''
|''HPX at a minimum requires standard hardware functionality of conventional systems but would benefit from new capabilities for efficiency and scalability.''
|HPX at a minimum requires standard hardware functionality of conventional systems but would benefit from new capabilities for efficiency and scalability.
|(TG)
|No but it can take advantage of some if available.
|(DEGAS)
|(DEGAS)
| No. But the X10/APGAS RTS can take advantage of hardware-specific networking capabilities and CUDA GPUs.
| No. But the X10/APGAS RTS can take advantage of hardware-specific networking capabilities and CUDA GPUs.

Revision as of 03:22, May 12, 2014

QUESTIONS XPRESS TG X-Stack DEGAS D-TEC DynAX X-TUNE GVR CORVETTE SLEEC PIPER
PI Ron Brightwell Shekhar Borkar Katherine Yelick Daniel Quinlan Guang Gao Mary Hall Andrew Chien Koushik Sen Milind Kulkarni Martin Schulz
What system calls does your RTS currently use? HPX requires basic calls for memory allocation and deallocation, virtual address translation and management, thread execution resource allocation and deallocation, parcel communication transmit and receive, error detection, and others. Our RTS is platform independent and we have been building a hardware and system abstraction layer that wraps all "system calls" that we may need. On x86, we rely on calls to print, exit, and do memory and thread management. These same functionalities are provided differently on other platforms. (DEGAS) Typical POSIX calls for memory allocation/deallocation, threads and synchronization operations, support needed for core libc operations. (DynAX) (X-TUNE) (GVR) (CORVETTE) SLEEC The PIPER runtime will be used to collect performance information - it will be out of band, potentially running on external (non-compute node) resources. As such, we require additional communication mechanisms, which is currently mostly done through sockets. Additionally, tools typically use ptrace, signals, and shared memory segments, as well as the dynamic linker for their implementation.
Does your RTS span the system? If so, what network interface capability does your RTS need? The HPX RTX spans the system. It requires global address space and parcels message-driven interface. Yes, it can span the entire system depending on the platform. We have defined very simple communication interfaces (which we will almost certainly extend) that currently allow the RTS to send and receive one-way messages between nodes. (DEGAS) We run different instances of the X10/APGAS runtime across different OS instances on the system. They coordinate via active messages. We developed an an active-message based transport MPI which we implemented of top of TCP/IP and MPI. (DynAX) (X-TUNE) (GVR) (CORVETTE) N/A Tools will have a global "runtime" to collect and aggregate data - this network will be out of band. This will span the whole job, in some cases the whole machine. A high performance communication mechanism would be preferable - currently mostly sockets are used.
How does your RTS map user-level and OS-level scheduling? The LXK OS allocates a share of its execution resources (e.g., Pthreads) to each relatively root ParalleX Process allocated to the locality. The HPX runtime system uses lightweight scheduling policies to assign user threads to the allocated OS threads. Our RTS is built on the assumption that there is almost nothing below it. In other words, we try to rely as little as possible on the operating system. For scheduling for example, on a traditional x86 Linux system, we create a certain number of pinned worker threads and we then manage work on these workers ourselves. (DEGAS) We allocate a pool of OS level execution resources (eg pthreads). Our scheduler then uses these resources as workers on which to schedule the APGAS level tasks using a work-stealing scheduler. (DynAX) (X-TUNE) (GVR) (CORVETTE) N/A N/A
What does your RTS use for locality information? The “locality” is defined as a synchronous domain that guarantees bounded response time and compound atomic sequences of operations. Compute complexes (thread instances) are to be performed on a single locality at a time and can assume its properties. ParalleX Processes are contexts that define relative logical locality although this may span multiple localities. Parcels permit asynchronous non-blocking operation and move work to data to minimize latency effects. We expect this information to come from: (a) user (or higher level tools/compilers) hints, (b) introspection of the physical layout based on configuration files and (c ) (potentially) introspection into machine behavior. (DEGAS) The X10/APGAS runtime system spans over multiple shared-memory domains called places. An application specifies the place of each data object and computational task. (DynAX) (X-TUNE) (GVR) (CORVETTE) N/A Locality/topology information should be exposed by the application facing runtime and will be used for proper attribution of performance data.
What OS or hardware information does your RTS need to monitor and adapt? Availability of execution resources, energy consumption, detected errors, delays due to contention. Performance monitoring units and fault detection. (DEGAS) The X10/APGAS RTS monitors the connections between nodes (hosts) to detect node and network failures. (DynAX) (X-TUNE) (GVR) (CORVETTE) N/A in short: anything and everything - in particular hardware counters (in profiling and sampling) and any kind of system adaptation information (where does system configuration change) is required
Does your RTS require support for global namespace or global address space? Yes. No, will use if available. (DEGAS) Currently the APGAS runtime provides a global address space entirely in software. If the lower-level system software provided full or partial support for a global address space, the APGAS runtime could exploit it. However, we do not require global address support from the underlying system. (DynAX) (X-TUNE) (GVR) (CORVETTE) N/A N/A
What local memory management capability does your RTS require? It must have support for allocation and deallocation of physical memory blocks. It must have support for protected virtual memory addresses at the local level. It must receive error information during memory accesses. Our RTS self-manages fine-grained allocations. It simply needs to acquire range(s) of addresses it can use. (DEGAS) Garbage collection. (DynAX) (X-TUNE) (GVR) (CORVETTE) N/A Individual parts of the runtime will require dynamic memory management - additionally, shared memory communication with a target process would be highly beneficial
Does your RTS address external I/O capability? Yes. Yes (partial). (DEGAS) No (DynAX) (X-TUNE) (GVR) (CORVETTE) N/A N/A
What interface and/or mechanism is used for the OS to request RTS services? The OS (e.g., LXK) may make user requests of the runtime information to coordinate actions, resources, and services across multiple localities or the entire system and to provide high-level functionality like POSIX calls. n/a (DEGAS) The X10/APGAS RTS is linked with the application binary. (DynAX) (X-TUNE) (GVR) (CORVETTE) N/A N/A
How does your RTS support legacy application or legacy RTS capability? Both MPI and OpenMP software interfaces are being provided to XPI as a target interface to HPX. LXK can also support both in native form. Not in TG scope. (DEGAS) N/A. (DynAX) (X-TUNE) (GVR) (CORVETTE) N/A Yes: PIPER components intend to support tools for MPI+X codes as well as new RTS and DSL approaches
Does your RTS depend on any specific hardware-specific capability? HPX at a minimum requires standard hardware functionality of conventional systems but would benefit from new capabilities for efficiency and scalability. No but it can take advantage of some if available. (DEGAS) No. But the X10/APGAS RTS can take advantage of hardware-specific networking capabilities and CUDA GPUs. (DynAX) (X-TUNE) (GVR) (CORVETTE) N/A Full (and well documented) access to performance counters, profiling and sampling