OS/R Program Description: Difference between revisions
From Modelado Foundation
imported>Rbbrigh (Created page with "== OS/R Program Motivations == The operating system and runtime system are critical components of the software stack for extreme-scale systems and fundamental advancements ar...") |
imported>Rbbrigh No edit summary |
||
Line 5: | Line 5: | ||
* ''Lightweight message and thread management'': In order to hide latency and support dynamic programming environments, low-level message handling and lightweight thread activation must be co-optimized. New techniques are needed to handle lightweight and resilient message layers; scalable message-driven thread activation and fine-grained active messages; global address spaces; extremely large thread counts; buffer management, collective operations, and fast parallel reductions; thread scheduling and placement; and improved quality-of-service (QoS) and prioritization. | * ''Lightweight message and thread management'': In order to hide latency and support dynamic programming environments, low-level message handling and lightweight thread activation must be co-optimized. New techniques are needed to handle lightweight and resilient message layers; scalable message-driven thread activation and fine-grained active messages; global address spaces; extremely large thread counts; buffer management, collective operations, and fast parallel reductions; thread scheduling and placement; and improved quality-of-service (QoS) and prioritization. | ||
* | * ''Holistic power management'': Extreme-scale systems will manage power and energy as a first-class resource across all layers of software as a crosscutting concern. Novel techniques are needed for whole-system monitoring and dynamic optimization; trading of energy for resilience or time to solution; power-aware scheduling and usage forecasts; goal-based feedback and control strategies; coscheduling; and adaptive power management of storage, computing, and bandwidth. | ||
* | * ''Resilience'': Extreme-scale OS/Rs must support scalable mechanisms to predict, detect, inform, and isolate faults at all levels in the system. Therefore, resilience is a crosscutting concern. The OS/R must be resilient and support an array of low-level services to enable resilience in other software components, from the HPC application to the storage system. Innovative concepts to support multilevel, pluggable collection and response services are needed. | ||
* | * ''OS/R architecture'': Extreme-scale systems need agile and dynamic node OS/Rs. New designs are needed for the node OS to support heterogeneous multicore, processor-in-memory, and HPC-customized hardware; I/O forwarding; autonomic fault response mechanisms; dynamic goal-oriented performance tuning; QoS management across thread groups, I/O, and messaging; support for fine-grained work tasks; and efficient mechanisms to support coexecution of compute and in situ analysis. | ||
* | * ''Memory'': Deep hierarchies, fixed power budgets, in situ analysis, and several levels of solid-state memory will dramatically change memory management, data movement, and caching in extreme-scale OS/Rs. Clearly needed are novel designs for lightweight structures to support allocating, managing, moving, and placing objects in memory; methods to dynamically adapt thread affinity; techniques to manage memory reliability; and mechanisms for sharing and protecting data across colocated, coscheduled processes. | ||
* | * ''Global OS/R'': Extreme-scale platforms must be run as whole systems, managing dynamic resources with a global view. New concepts and implementations are needed to enable collective tuning of dynamic groups of interacting resources; scalable infrastructure for collecting, analyzing, and responding to whole-system data such as fault events, power consumption, and performance; reusable and scalable publish/subscribe infrastructures; distributed and resilient RAS (reliability, availability, and serviceability) subsystems; feedback loops for tuning and optimization; and dynamic power management. | ||
== OS/R Program Goals == | == OS/R Program Goals == |
Revision as of 21:22, February 18, 2014
OS/R Program Motivations
The operating system and runtime system are critical components of the software stack for extreme-scale systems and fundamental advancements are needed within these components to address several challenges facing applications:
- Lightweight message and thread management: In order to hide latency and support dynamic programming environments, low-level message handling and lightweight thread activation must be co-optimized. New techniques are needed to handle lightweight and resilient message layers; scalable message-driven thread activation and fine-grained active messages; global address spaces; extremely large thread counts; buffer management, collective operations, and fast parallel reductions; thread scheduling and placement; and improved quality-of-service (QoS) and prioritization.
- Holistic power management: Extreme-scale systems will manage power and energy as a first-class resource across all layers of software as a crosscutting concern. Novel techniques are needed for whole-system monitoring and dynamic optimization; trading of energy for resilience or time to solution; power-aware scheduling and usage forecasts; goal-based feedback and control strategies; coscheduling; and adaptive power management of storage, computing, and bandwidth.
- Resilience: Extreme-scale OS/Rs must support scalable mechanisms to predict, detect, inform, and isolate faults at all levels in the system. Therefore, resilience is a crosscutting concern. The OS/R must be resilient and support an array of low-level services to enable resilience in other software components, from the HPC application to the storage system. Innovative concepts to support multilevel, pluggable collection and response services are needed.
- OS/R architecture: Extreme-scale systems need agile and dynamic node OS/Rs. New designs are needed for the node OS to support heterogeneous multicore, processor-in-memory, and HPC-customized hardware; I/O forwarding; autonomic fault response mechanisms; dynamic goal-oriented performance tuning; QoS management across thread groups, I/O, and messaging; support for fine-grained work tasks; and efficient mechanisms to support coexecution of compute and in situ analysis.
- Memory: Deep hierarchies, fixed power budgets, in situ analysis, and several levels of solid-state memory will dramatically change memory management, data movement, and caching in extreme-scale OS/Rs. Clearly needed are novel designs for lightweight structures to support allocating, managing, moving, and placing objects in memory; methods to dynamically adapt thread affinity; techniques to manage memory reliability; and mechanisms for sharing and protecting data across colocated, coscheduled processes.
- Global OS/R: Extreme-scale platforms must be run as whole systems, managing dynamic resources with a global view. New concepts and implementations are needed to enable collective tuning of dynamic groups of interacting resources; scalable infrastructure for collecting, analyzing, and responding to whole-system data such as fault events, power consumption, and performance; reusable and scalable publish/subscribe infrastructures; distributed and resilient RAS (reliability, availability, and serviceability) subsystems; feedback loops for tuning and optimization; and dynamic power management.