Runtimes (application-facing)
From Modelado Foundation
Sonia requested the Traleika Glacier X-Stack team to initiate this page. For comments, please contact Shekhar Borkar.
QUESTIONS | TG X-Stack | DEGAS | D-TEC | XPRESS | DynAX | X-TUNE | GVR | CORVETTE | SLEEC | PIPER |
---|---|---|---|---|---|---|---|---|---|---|
PI | Shekhar Borkar | Katherine Yelick | Daniel Quinlan | Ron Brightwell | Guang Gao | Mary Hall | Andrew Chien | Koushik Sen | Milind Kulkarni | Martin Schulz |
What policies and/or mechanisms will your runtime use to schedule code and place data for 100M objects (executing code, data elements, etc.) in a scalable fashion? | Open Community Runtime (OCR) will optimize for data-movement scalability. Our programming model divides an application into event-driven tasks with explicit data-dependences. Our runtime uses of this to schedule code close to its data or move the data close to the code. Scalability will be achieved through hierarchical task-stealing favoring locality. | (DEGAS) | (D-TEC) | (XPRESS) | (DynAX) | (X-TUNE) | (GVR) | (CORVETTE) | (SLEEC) | (PIPER) |
What features will allow your runtime to dynamically adapt the schedule and placement for 100K sockets to improve the metrics of code-data affinity, power consumption, migration cost and resiliency? | If the hardware supports it, OCR will monitor performance and power counters to adapt its scheduling and data-placement to better utilize the machine. | (DEGAS) | (D-TEC) | (XPRESS) | (DynAX) | (X-TUNE) | (GVR) | (CORVETTE) | (SLEEC) | (PIPER) |
How will the runtime manage resources (compute, memory, power, bandwidth) for 100K sockets to meet a power, energy and performance objective? | OCR will manage resources based on the application's needs and the power budget and turn off or scale back unneeded resources. | (DEGAS) | (D-TEC) | (XPRESS) | (DynAX) | (X-TUNE) | (GVR) | (CORVETTE) | (SLEEC) | (PIPER) |
How does the runtime software itself scale to 100K sockets? Specifically, how does it distribute, monitor and balance itself and how is it resilient to failures? | OCR functionality is hierarchically distributed along the hardware’s natural computation hierarchy (if it has one) or imposing an arbitrary one. OCR divides cores into "runtime" and "user". For efficiency, "user" cores run a small layer of the runtime and manage that specific core. The other "runtime" cores manage the user cores in a hierarchical fashion where the "runtime" cores "closest" to the "user" cores will perform low-latency simple scheduling decisions whereas higher level cores will perform longer-term optimization operations. | (DEGAS) | (D-TEC) | (XPRESS) | (DynAX) | (X-TUNE) | (GVR) | (CORVETTE) | (SLEEC) | (PIPER) |
What is the efficiency of the runtime? Specifically, how much impact does the runtime have on a) the total execution time of the application and b) resources taken from algorithmic computations? What are your plans to maximize efficiency? How will runtime overhead scale to 100K sockets? | OCR code runs on cores that are physically separate from those for user code. Our goal is to have enough “runtime” cores that runtime overhead is completely masked by the application code. As machine size increases, more runtime cores will be needed to handle higher-level functions and global optimizations but this will increase very slowly. | (DEGAS) | (D-TEC) | (XPRESS) | (DynAX) | (X-TUNE) | (GVR) | (CORVETTE) | (SLEEC) | (PIPER) |
Do you support isolation of the runtime code from the user code to avoid violations and contamination? | The majority of the runtime code runs on cores that are physically separate from the ones on which user code is running. Although we are currently considering a model where all cores can touch data everywhere else, our model will support possible hardware restriction (user cores cannot touch data in runtime cores). | (DEGAS) | (D-TEC) | (XPRESS) | (DynAX) | (X-TUNE) | (GVR) | (CORVETTE) | (SLEEC) | (PIPER) |
What specific hardware features do you require for proper or efficient runtime operation (atomics, DMA, F/E bits, etc.)? | OCR requires hardware to support some form of atomic locking. Additional HW features identified for increased efficiency: 1) Remote atomics for cheaper manipulation of far-away memory; 2) Heterogeneity to taylor "user" cores for user code and "runtime" cores for runtime code (no FP for example) 3) Fast runtime core-to-core communication to allow the runtime to communicate efficiently without impacting user code 4) Asynchronous data movement (DMA engines) 5) HW monitoring to allow introspection and adaptation; 6) knowledge of HW structure (memory costs, network links available, etc) enabling more efficient scheduling and placement. | (DEGAS) | (D-TEC) | (XPRESS) | (DynAX) | (X-TUNE) | (GVR) | (CORVETTE) | (SLEEC) | (PIPER) |
What is your model for execution of multiple different programs (ie: a single mention would be doing more than one thing) in terms of division, isolation, containment and protection? | Our programming model splits user code into small event-driven tasks (EDTs). Multiple non-related EDT sub-graphs can coexist at the same time with the same runtime. While not isolating applications, it does automatically globally balance all the applications at once. The locality aware scheduling will also naturally migrate related data and code closer together thereby physically partitioning the different applications. If a more secure model is required, different runtimes can run on a subset of the machine thereby statically partitioning the machine for the various applications; it is more secure but less flexible. | (DEGAS) | (D-TEC) | (XPRESS) | (DynAX) | (X-TUNE) | (GVR) | (CORVETTE) | (SLEEC) | (PIPER) |