What policies and/or mechanisms will your runtime use to schedule code and place data for 100M objects (executing code, data elements, etc.) in a scalable fashion?
|
XPRESS |
A hierarchical representation of logical contexts and tasks (processes and compute complexes) provides semantic representations of relative locality for placement of data objects and the tasks that are performed on them. Where data is widely distributed, they can be organized on separate processes distributed across multiple nodes with methods that allow actual work to be performed near the data. Research is exploring the allocation of resources by the LXK OS to the HPX runtime system and the policies to be implemented including programming interface semantics.
|
TG |
Open Community Runtime (OCR) will optimize for data-movement scalability. Our programming model divides an application into event-driven tasks with explicit data-dependences. Our runtime uses of this to schedule code close to its data or move the data close to the code. Scalability will be achieved through hierarchical task-stealing favoring locality.
|
DEGAS |
The DEGAS runtime uses one-sided communication (put, get, active messages, atomics, and remote enqueue of tasks) to place data and work across a large-scale machine. Within a node there are currently two scheduling approaches being pursued. One (under HCLib/Habanero-C) is built on OCR and uses a dynamic task scheduler; it is being evaluated to determine the need for locality control within the node; the second is derived from the UPC runtime and has both a fixed set of locality-aware threads tied to cores (or hardware threads or NUMA domains -- it's an abstraction that can be used a various machine levels), augmented with voluntary task scheduling for both locality and remotely generated dynamic tasks. A global task stealing scheduler is also part of the DEGAS plan and exists in prototype form; as with dynamic tasking, it is to be used on-demand for applications that are not naturally load balanced (e.g., divide-and-conquer problems with irregular trees).
|
D-TEC |
The APGAS (Asynchronous Partitioned Global Address Space) runtime uses a work-stealing scheduler to dynamically schedule tasks within a node. We are introducing areas to enable finer-grained locality and scheduling control within a node (Place). By design the runtime does not directly address automatic cross-node data placement. The APGAS runtime/programming model does provide primitive mechanisms (Places and Areas; at/async/finish) that allow application frameworks to productively implement data placement and cross-node scheduling frameworks on top of the runtime.
|
DynAX |
The SWift Adaptive Runtime Machine (SWARM) has a "locale" hierarchy, which roughly mirrors the hardware architecture hierarchy. Each locale has a set of local scheduler queues, allowing distributed and scalable scheduling. Data allocation and task/data migration are expressed to ensure proper parallelism around the conjunction. SWARM will rely on a single-assignment policy to prevent the need for globally coordinated checkout or write-back operations.
|
X-TUNE |
The compiler for X-TUNE must generate code with hierarchical threading, and will rely on the run-time to manage that threading efficiently. Point-to-point synchronization between threads may be more efficient than barriers to allow more dynamic behavior of the threads.
|
GVR |
GVR will use performance information for varied memory and storage types (DRAM, NVRAM, SSD, Disk), resource failure rate and prediction, redundancy in data encoding, existing version data copies and their location, as well as communication costs to place data. GVR does not include code scheduling mechanisms.
|
CORVETTE |
|
SLEEC |
SLEEC does not have a true runtime component, except insofar as we are developing single-node runtimes to, e.g., manage data movement between cores and accelerators. We also perform small-scale inspector/executor-style scheduling for applications. However, we expect to rely on other systems for our large-scale runtime needs.
|
PIPER |
N/A
|
Charm++ |
Overdecomposition: A Charm++ program consists of a large number of objects assigned to the processors by the RTS. Initial placement of objects is controlled by map functions that are either system defined or user defined static functions. The RTS dynamically migrates objects across processors as needed. Message driven scheduling is used on individual processors: a message, containing a method invocation for an object, is selected by the scheduler and the corresponding object's execution is triggered.
|
Early Career-SriramK |
Work is assumed to be decomposed into finer-grained tasks. The specification of inter-task dependences and task-data relationships is used to automate aspects of locality management, load balance, and resilience. We investigate algorithms based on dynamic load balancing for various classes of inter-task and task-data relationships — strict computations, data-flow graphs, etc. — for intra-node and inter-node scheduling.
|