
OS/R Program Description

OS/R Program Motivations

The operating system and runtime system are critical components of the software stack for extreme-scale systems and fundamental advancements are needed within these components to address several challenges facing applications:

  • Lightweight message and thread management: In order to hide latency and support dynamic programming environments, low-level message handling and lightweight thread activation must be co-optimized. New techniques are needed to handle lightweight and resilient message layers; scalable message-driven thread activation and fine-grained active messages; global address spaces; extremely large thread counts; buffer management, collective operations, and fast parallel reductions; thread scheduling and placement; and improved quality-of-service (QoS) and prioritization.
  • Holistic power management: Extreme-scale systems will manage power and energy as a first-class resource across all layers of software as a crosscutting concern. Novel techniques are needed for whole-system monitoring and dynamic optimization; trading of energy for resilience or time to solution; power-aware scheduling and usage forecasts; goal-based feedback and control strategies; coscheduling; and adaptive power management of storage, computing, and bandwidth.
  • Resilience: Extreme-scale OS/Rs must support scalable mechanisms to predict, detect, inform, and isolate faults at all levels in the system. Therefore, resilience is a crosscutting concern. The OS/R must be resilient and support an array of low-level services to enable resilience in other software components, from the HPC application to the storage system. Innovative concepts to support multilevel, pluggable collection and response services are needed.
  • OS/R architecture: Extreme-scale systems need agile and dynamic node OS/Rs. New designs are needed for the node OS to support heterogeneous multicore, processor-in-memory, and HPC-customized hardware; I/O forwarding; autonomic fault response mechanisms; dynamic goal-oriented performance tuning; QoS management across thread groups, I/O, and messaging; support for fine-grained work tasks; and efficient mechanisms to support coexecution of compute and in situ analysis.
  • Memory: Deep hierarchies, fixed power budgets, in situ analysis, and several levels of solid-state memory will dramatically change memory management, data movement, and caching in extreme-scale OS/Rs. Clearly needed are novel designs for lightweight structures to support allocating, managing, moving, and placing objects in memory; methods to dynamically adapt thread affinity; techniques to manage memory reliability; and mechanisms for sharing and protecting data across colocated, coscheduled processes.
  • Global OS/R: Extreme-scale platforms must be run as whole systems, managing dynamic resources with a global view. New concepts and implementations are needed to enable collective tuning of dynamic groups of interacting resources; scalable infrastructure for collecting, analyzing, and responding to whole-system data such as fault events, power consumption, and performance; reusable and scalable publish/subscribe infrastructures; distributed and resilient RAS (reliability, availability, and serviceability) subsystems; feedback loops for tuning and optimization; and dynamic power management.

OS/R Program Goals