Runtimes (application-facing)
From Modelado Foundation
Sonia requested the Traleika Glacier X-Stack team to initiate this page. For comments, please contact Shekhar Borkar.
QUESTIONS | XPRESS | TG X-Stack | DEGAS | D-TEC | DynAX | X-TUNE | GVR | CORVETTE | SLEEC | PIPER |
---|---|---|---|---|---|---|---|---|---|---|
PI | Ron Brightwell | Shekhar Borkar | Katherine Yelick | Daniel Quinlan | Guang Gao | Mary Hall | Andrew Chien | Koushik Sen | Milind Kulkarni | Martin Schulz |
What policies and/or mechanisms will your runtime use to schedule code and place data for 100M objects (executing code, data elements, etc.) in a scalable fashion? | A hierarchical representation of logical contexts and tasks (processes and compute complexes) provides semantic representations of relative locality for placement of data objects and the tasks that are performed on them. Where data is widely distributed, they can be organized on separate processes distributed across multiple nodes with methods that allow actual work to be performed near the data. Research is exploring the allocation of resources by the LXK OS to the HPX runtime system and the policies to be implemented including programming interface semantics. | Open Community Runtime (OCR) will optimize for data-movement scalability. Our programming model divides an application into event-driven tasks with explicit data-dependences. Our runtime uses of this to schedule code close to its data or move the data close to the code. Scalability will be achieved through hierarchical task-stealing favoring locality. | (DEGAS) | (D-TEC) | (DynAX) | (X-TUNE) | (GVR) a | (CORVETTE) | SLEEC does not have a true runtime component, except insofar as we are developing single-node runtimes to, e.g., manage data movement between cores and accelerators. We also perform small-scale inspector/executor-style scheduling for applications. However, we expect to rely on other systems for our large-scale runtime needs. | (PIPER) |
What features will allow your runtime to dynamically adapt the schedule and placement for 100K sockets to improve the metrics of code-data affinity, power consumption, migration cost and resiliency? | The HPX/LXK System software architecture (also known as the “OpenX Architecture” integrates a closed-loop introspection component comprising the APEX and RCR components within the runtime and OS respectively. Code-data affinity is supported by multiple mechanisms. Intra-compute complex (thread) function keeps all private or local data in the same locality. Parcels move work to the data when preferred although supports data access and gathers as appropriate. Processes keep shared data organized within a single logical context that can be spread across multiple localities. The effective reduction of latency effects also reduces data movement energy. For resiliency reconfiguration and recovery data migration is enabled by logical active global address space. Research is being performed to address these issues; some under other funding. | If the hardware supports it, OCR will monitor performance and power counters to adapt its scheduling and data-placement to better utilize the machine. | (DEGAS) | (D-TEC) | (DynAX) | (X-TUNE) | (GVR) | (CORVETTE) | N/A | (PIPER) |
How will the runtime manage resources (compute, memory, power, bandwidth) for 100K sockets to meet a power, energy and performance objective? | The HPX runtime system maintains an abstraction of global data and compute complexes (threads) within the context of the ParalleX process hierarchy and engages in a bi-directional protocol with the LXK lightweight kernel to acquire and employ memory blocks and OS thread executables. As the OS manages multiple job program resource conflicts and the HPX runtime manages the intra-job task requirements and priorities, the two work together in dialog to balance the complex tradeoffs. Power imposes upper constraints at the node (locality) and socket level limited by the OS. Energy usage is governed by the ParalleX Side-Path Energy Suppression methodology that (attempts) to determine critical path of execution to which highest power is applied and reduces energy usage to the non-critical (side-path) work to the degree that the critical path does not change thus minimizing total energy with shortest time to completion. This strategy addresses scaling of both energy and performance objectives. | OCR will manage resources based on the application's needs and the power budget and turn off or scale back unneeded resources. | (DEGAS) | (D-TEC) | (DynAX) | (X-TUNE) | (GVR) | (CORVETTE) | N/A | (PIPER) |
How does the runtime software itself scale to 100K sockets? Specifically, how does it distribute, monitor and balance itself and how is it resilient to failures? | Individual instances of runtime system functions and responsibilities are created on a per node basis and per user program basis to spread the work uniformly as a system scales in workload (number of user jobs) and scales to larger number of hardware localities (ensembles of sockets). Introspection at the hardware support layer and software application layer detects and manages load balance through the RIOS control interface, the APEX runtime instrumentation and control layer, and the RCR black-boarding at the OS layer. Resiliency will be supported through the ParalleX execution model micro-checkpointing cross-cutting Compute-Validate-Commit cycle that employs hierarchical fault zones. This dynamic methodology engages all component layers of the hardware-software system for fault detection, isolation, diagnosis, reconfiguration, recovery, and restart. | OCR functionality is hierarchically distributed along the hardware’s natural computation hierarchy (if it has one) or imposing an arbitrary one. OCR divides cores into "runtime" and "user". For efficiency, "user" cores run a small layer of the runtime and manage that specific core. The other "runtime" cores manage the user cores in a hierarchical fashion where the "runtime" cores "closest" to the "user" cores will perform low-latency simple scheduling decisions whereas higher level cores will perform longer-term optimization operations. | (DEGAS) | (D-TEC) | (DynAX) | (X-TUNE) | (GVR) | (CORVETTE) | SLEEC's runtimes are intended to operate within the scope of a single node, or at a small scale. We rely on other runtimes to provide higher levels of the hierarchy. | (PIPER) |
What is the efficiency of the runtime? Specifically, how much impact does the runtime have on a) the total execution time of the application and b) resources taken from algorithmic computations? What are your plans to maximize efficiency? How will runtime overhead scale to 100K sockets? | The HPX runtime is event driven and stays out of the way of the user codes executing intra-thread for purposes of efficiency. However, inter-thread there are a number of overhead actions that impact efficiency and impose a lower bound on thread granularity, which limits scalability for fixed size workloads. OS overhead (LXK) is fixed on a per node basis and therefore scalable. HPX process calls across nodes (conceptually millions) employ symmetric semantics (synchronous versus asynchronous) for portability, parcels for message-driven computing in combination with local control objects to manage asynchrony including mitigation of latency effects, and active global address space to handle remote data load and stores. Research will determine the scaling factors for these as well as the time and energy efficiencies that may be achieved. | OCR code runs on cores that are physically separate from those for user code. Our goal is to have enough “runtime” cores that runtime overhead is completely masked by the application code. As machine size increases, more runtime cores will be needed to handle higher-level functions and global optimizations but this will increase very slowly. | (DEGAS) | (D-TEC) | (DynAX) | (X-TUNE) | (GVR) | (CORVETTE) | Because SLEEC focuses on small-scale runtimes that are directly integrated with application code, we expect our runtime overheads to be negligible and, essentially, independent of scale (because scaling will be provided by other runtime systems. | (PIPER) |
Do you support isolation of the runtime code from the user code to avoid violations and contamination? | The ParalleX process construct and hierarchy with capabilities addressing separates runtime functions from user functions. The global addressing permits runtime system instances to manipulate user “compute complexes” (e.g., threads) as first class objects. Independent runtime instances isolates multiple user applications sharing any particular localities (nodes). Research is exploring the costs and completeness of these protection mechanisms. | The majority of the runtime code runs on cores that are physically separate from the ones on which user code is running. Although we are currently considering a model where all cores can touch data everywhere else, our model will support possible hardware restriction (user cores cannot touch data in runtime cores). | (DEGAS) | (D-TEC) | (DynAX) | (X-TUNE) | (GVR) | (CORVETTE) | SLEEC's runtimes are application/domain-specific and hence intended to closely couple with the application code. | (PIPER) |
What specific hardware features do you require for proper or efficient runtime operation (atomics, DMA, F/E bits, etc.)? | There are no absolute requirements for proper operation of the HPX runtime system beyond those found on conventional parallel and distributed systems. These include compound atomic operations, message exchange between nodes, scheduling of threads and their precise interrupts, and local virtual address translation. However, there are additional features that may be incorporated in the future that would dramatically reduce overheads, mitigate latencies, increase parallelism, and circumvent hotspots. Among such mechanisms for efficient runtime operation are hardware support for 1) user lightweight thread creation, termination, and context switching (including preemption), 2) global virtual address translation, 3) ‘struct’ processing for simultaneous multi-word processing (for local control objects among others), message driven computation, and combined DMA plus synchronization. Research will ascertain, evaluate, and analyze to degree of operational improvements that may be derived from such hardware support. | OCR requires hardware to support some form of atomic locking. Additional HW features identified for increased efficiency: 1) Remote atomics for cheaper manipulation of far-away memory; 2) Heterogeneity to taylor "user" cores for user code and "runtime" cores for runtime code (no FP for example) 3) Fast runtime core-to-core communication to allow the runtime to communicate efficiently without impacting user code 4) Asynchronous data movement (DMA engines) 5) HW monitoring to allow introspection and adaptation; 6) knowledge of HW structure (memory costs, network links available, etc) enabling more efficient scheduling and placement. | (DEGAS) | (D-TEC) | (DynAX) | (X-TUNE) | (GVR) | (CORVETTE) | N/A | (PIPER) |
What is your model for execution of multiple different programs (ie: a single mention would be doing more than one thing) in terms of division, isolation, containment and protection? | The HPX runtime system supports the ParalleX processes, which serve as logical contexts and are referenced through a hierarchical namespace. The global root process of the entire system provides global naming. Each program has a program root process that contains the instances of the dedicated runtime kernel and the “main” process of the user applications. The process boundaries incorporate a form of capabilities based addressing for protections. Programs are logically separate and isolated although can interact through the upper hierarchy of the process stack. Nonetheless, programs may share physical resources (localities). The underlying OS manages the protections of the virtual address space. | Our programming model splits user code into small event-driven tasks (EDTs). Multiple non-related EDT sub-graphs can coexist at the same time with the same runtime. While not isolating applications, it does automatically globally balance all the applications at once. The locality aware scheduling will also naturally migrate related data and code closer together thereby physically partitioning the different applications. If a more secure model is required, different runtimes can run on a subset of the machine thereby statically partitioning the machine for the various applications; it is more secure but less flexible. | (DEGAS) | (D-TEC) | (DynAX) | (X-TUNE) | (GVR) | (CORVETTE) | N/A | (PIPER) |