Runtimes (os/hardware-facing)

PI
XPRESS	Ron Brightwell
TG	Shekhar Borkar
DEGAS	Katherine Yelick
D-TEC	Daniel Quinlan
DynAX	Guang Gao
X-TUNE	Mary Hall
GVR	Andrew Chien
CORVETTE	Koushik Sen
SLEEC	Milind Kulkarni
PIPER	Martin Schulz

Questions:

What system calls does your RTS currently use?
Does your RTS span the system? If so, what network interface capability does your RTS need?
How does your RTS map user-level and OS-level scheduling?
What does your RTS use for locality information?
What OS or hardware information does your RTS need to monitor and adapt?
Does your RTS require support for global namespace or global address space?
What local memory management capability does your RTS require?
Does your RTS address external I/O capability?
What interface and/or mechanism is used for the OS to request RTS services?
How does your RTS support legacy application or legacy RTS capability?
Does your RTS depend on any specific hardware-specific capability?

What system calls does your RTS currently use?
XPRESS	HPX requires basic calls for memory allocation and deallocation, virtual address translation and management, thread execution resource allocation and deallocation, parcel communication transmit and receive, error detection, and others.
TG	Our RTS is platform independent and we have been building a hardware and system abstraction layer that wraps all "system calls" that we may need. On x86, we rely on calls to print, exit, and do memory and thread management. These same functionalities are provided differently on other platforms.
DEGAS
D-TEC	Typical POSIX calls for memory allocation/deallocation, threads and synchronization operations, support needed for core libc operations.
DynAX	SWARM requires access to hardware threads, memory, and network interconnect(s), whether by system call or direct access. On today's x86 clusters, SWARM additionally needs access to I/O facilities, such as the Linux select, read, and write calls.
X-TUNE	Presumably, autotuning would only be applied when ready to run software in production mode. I suspect correctness software would only be used if the tuning process had some error, in which case some overhead would be tolerable.
GVR
CORVETTE
SLEEC
PIPER	The PIPER runtime will be used to collect performance information - it will be out of band, potentially running on external (non-compute node) resources. As such, we require additional communication mechanisms, which is currently mostly done through sockets. Additionally, tools typically use ptrace, signals, and shared memory segments, as well as the dynamic linker for their implementation.

Does your RTS span the system? If so, what network interface capability does your RTS need?
XPRESS	The HPX RTX spans the system. It requires global address space and parcels message-driven interface.
TG	Yes, it can span the entire system depending on the platform. We have defined very simple communication interfaces (which we will almost certainly extend) that currently allow the RTS to send and receive one-way messages between nodes.
DEGAS
D-TEC	We run different instances of the X10/APGAS runtime across different OS instances on the system. They coordinate via active messages. We developed an an active-message based transport MPI which we implemented of top of TCP/IP and MPI.
DynAX	Yes, SWARM operates on all available/configured threads of all available/configured nodes. SWARM can operate over stream-, message-, or DMA-based interconnects.
X-TUNE
GVR
CORVETTE
SLEEC
PIPER	Tools will have a global "runtime" to collect and aggregate data - this network will be out of band. This will span the whole job, in some cases the whole machine. A high performance communication mechanism would be preferable - currently mostly sockets are used.

How does your RTS map user-level and OS-level scheduling?
XPRESS	The LXK OS allocates a share of its execution resources (e.g., Pthreads) to each relatively root ParalleX Process allocated to the locality. The HPX runtime system uses lightweight scheduling policies to assign user threads to the allocated OS threads.
TG	Our RTS is built on the assumption that there is almost nothing below it. In other words, we try to rely as little as possible on the operating system. For scheduling for example, on a traditional x86 Linux system, we create a certain number of pinned worker threads and we then manage work on these workers ourselves.
DEGAS
D-TEC	We allocate a pool of OS level execution resources (eg pthreads). Our scheduler then uses these resources as workers on which to schedule the APGAS level tasks using a work-stealing scheduler.
DynAX	SWARM uses codelets to intermediate between threads and function/method calls. Threads are set up and bound at runtime startup; codelets are bound to particular threads only when they're dispatched, unless some more specific binding is arranged for before readying the codelet. The runtime can dynamically balance load by shifting readied codelets and/or context data from one location to another. When the OS is in charge of power management, blocking is used to relinquish a hardware thread to the OS so that it can be used for other work, or its core powered down.
X-TUNE	The most interesting tool would be one that could compare two different versions of the code to see where changes to variable values are observed.
GVR
CORVETTE
SLEEC	N/A
PIPER	N/A

What does your RTS use for locality information?
XPRESS	The “locality” is defined as a synchronous domain that guarantees bounded response time and compound atomic sequences of operations. Compute complexes (thread instances) are to be performed on a single locality at a time and can assume its properties. ParalleX Processes are contexts that define relative logical locality although this may span multiple localities. Parcels permit asynchronous non-blocking operation and move work to data to minimize latency effects.
TG	We expect this information to come from: (a) user (or higher level tools/compilers) hints, (b) introspection of the physical layout based on configuration files and (c ) (potentially) introspection into machine behavior.
DEGAS
D-TEC	The X10/APGAS runtime system spans over multiple shared-memory domains called places. An application specifies the place of each data object and computational task.
DynAX	It uses a tree of locale descriptors to associate threads, cores, nodes, etc. with each other, typically in a fashion correlating with the hardware memory hierarchy.
X-TUNE	The key issue will be understanding when differences in output are acceptable, and when they represent an error.
GVR
CORVETTE
SLEEC
PIPER	Locality/topology information should be exposed by the application facing runtime and will be used for proper attribution of performance data.

What OS or hardware information does your RTS need to monitor and adapt?
XPRESS	Availability of execution resources, energy consumption, detected errors, delays due to contention.
TG	Performance monitoring units and fault detection.
DEGAS
D-TEC	The X10/APGAS RTS monitors the connections between nodes (hosts) to detect node and network failures.
DynAX	Reliable notification of hardware failures, and a local or global cycle-based or real time. Performance counters would help for load modeling and balancing.
X-TUNE
GVR
CORVETTE
SLEEC
PIPER	in short: anything and everything - in particular hardware counters (in profiling and sampling) and any kind of system adaptation information (where does system configuration change) is required

Does your RTS require support for global namespace or global address space?
XPRESS	Yes.
TG	No, will use if available.
DEGAS
D-TEC	Currently the APGAS runtime provides a global address space entirely in software. If the lower-level system software provided full or partial support for a global address space, the APGAS runtime could exploit it. However, we do not require global address support from the underlying system.
DynAX	SWARM can take advantage of a global name/address space, but provides for a global namespace entirely in software. OS or hardware involvement are only needed for data storage and communication.
X-TUNE
GVR
CORVETTE
SLEEC	N/A
PIPER	N/A

What local memory management capability does your RTS require?
XPRESS	It must have support for allocation and deallocation of physical memory blocks. It must have support for protected virtual memory addresses at the local level. It must receive error information during memory accesses.
TG	Our RTS self-manages fine-grained allocations. It simply needs to acquire range(s) of addresses it can use.
DEGAS
D-TEC	Garbage collection.
DynAX	SWARM requires the ability to allocate physical memory.
X-TUNE
GVR
CORVETTE
SLEEC	N/A
PIPER	Individual parts of the runtime will require dynamic memory management - additionally, shared memory communication with a target process would be highly beneficial

Does your RTS address external I/O capability?
XPRESS	Yes.
TG	Yes (partial).
DEGAS
D-TEC	No
DynAX	Yes.
X-TUNE
GVR
CORVETTE
SLEEC	N/A
PIPER	N/A

What interface and/or mechanism is used for the OS to request RTS services?
XPRESS	The OS (e.g., LXK) may make user requests of the runtime information to coordinate actions, resources, and services across multiple localities or the entire system and to provide high-level functionality like POSIX calls.
TG	n/a
DEGAS
D-TEC	The X10/APGAS RTS is linked with the application binary.
DynAX	Current versions of SWARM do not require the OS to request services from the runtime. In the event this is necessary, it's expected that either a signal-/interrupt-based or polling-based interface will be provided, and either of these can be integrated easily.
X-TUNE	N/A -- We use standard languages and run-time support.
GVR
CORVETTE
SLEEC	N/A
PIPER	N/A

How does your RTS support legacy application or legacy RTS capability?
XPRESS	Both MPI and OpenMP software interfaces are being provided to XPI as a target interface to HPX. LXK can also support both in native form.
TG	Not in TG scope.
DEGAS
D-TEC	N/A
DynAX	Legacy applications can be converted piecewise or in entirety; while the application may block normally during single-threaded regions, parallelized regions requires blocking calls to use stack-switching or (equivalently) extra software threads, or else break apart blocking operations into separate initiation and callback sections. Where possible, SWARM provides predefined exports that allow asynchronous use of common legacy runtime/API functionality.
X-TUNE	Scalability and determining what is an error seem like the biggest challenges.
GVR
CORVETTE
SLEEC
PIPER	Yes: PIPER components intend to support tools for MPI+X codes as well as new RTS and DSL approaches

Does your RTS depend on any specific hardware-specific capability?
XPRESS	HPX at a minimum requires standard hardware functionality of conventional systems but would benefit from new capabilities for efficiency and scalability.
TG	No but it can take advantage of some if available.
DEGAS
D-TEC	No. But the X10/APGAS RTS can take advantage of hardware-specific networking capabilities and CUDA GPUs.
DynAX	SWARM can operate perfectly well on commodity systems, but benefits from access to performance-counting and power-monitoring/-control facilities.
X-TUNE
GVR
CORVETTE
SLEEC	N/A
PIPER	Full (and well documented) access to performance counters, profiling and sampling

Runtimes (os/hardware-facing)

From Modelado Foundation