OCR Module General organization
From Modelado Foundation
OCR General Philosophy
OCR is a loaded term that refers to either:
- The user-level API that a ninja programmer or higher level tools use to create tasks, data-blocks, events, etc.
- The runtime framework itself
In this document, we describe the philosophy behind the runtime framework itself. The user-level API is described in OCR User Documentation.
Goals of the OCR Runtime Framework
The goal of OCR is to provide a framework for research into runtimes implementing the fine-grained event-driven task paradigm. OCR should be able to integrate research into:
- User-level APIs
- Runtime heuristics: the "brains" of the runtime involving scheduling (code placement), data placement, allocation policies, etc.
- Runtime organization: how the responsibilities of the runtime are distributed across the system
- Hardware: what the underlying hardware looks like at a high level
Therefore, OCR is really a runtime framework which can be configured to instantiate a specific runtime implementation. Each implementation corresponds to a specific data-point in the space defined above (heuristics, runtime organization, hardware). We also want to balance this need for exploration with good performance: the runtime implementations using this framework should be competitive with other runtimes today.
TODO: ADD LOTS OF FIGURES
Basic concepts of the OCR framework
This section describes a few of the key concepts of the OCR runtime framework.
General organization
The runtime's functionality is split into modules. Each module defines an API exposed to the other modules and the specific implementation of a given module can be chosen either at compile time or launch-time. For example, the runtime defines an allocator modules which implements memory allocation; at launch-time, the specific implementation of the allocator can be chosen.
The modules can be broken up into a few categories:
- Modules representing user-level constructs such as EDTs, data-blocks and events. These modules implement how EDTs, DBs and events are handled by the rest of the runtime. For example, a DB module would specify how to acquire and release data-blocks.
- Modules representing the algorithmic part of the runtime. These include modules representing allocators, schedulers, workers, introspection frameworks, etc.
- Modules representing the machine that OCR is pretending to operate on; we refer to this as the target layer. At this level, we represent computation resources and memory resources.
- Modules representing the physical machine that OCR is actually running on; we refer to this as the platform layer. This layer is distinct from the target layer to allow us to, for example, emulate a SPAD and CE/XE architecture on a regular x86 machine by representing SPAD and CE/XE resources at the target level but using malloc and pthread functionality at the platform level.
A special module called the policy domain orchestrates the connections between the various modules in the following manner:
- When servicing by the runtime is needed (either directly from user code such as the creation of a data-block for example or internally to the runtime such as a request for additional work to execute), a policy message is sent by the worker executing the code to its policy domain (each worker is associated with a single policy domain) requesting the service to be performed
- The local policy domain (the initial recipient of the message) will determine if it can process the message locally or not. This involves:
- Determining if it has access to the object(s) needed to process the service
- Determining whether it has permissions/rights to process the service
- If the local policy domain can process the message, it will do so and send back any response to the requester. The processing of the message frequently consists of calling an API on one of the other modules (for example invoking the 'allocate' method on an allocator object).
- If the local policy domain cannot process the message, it will send it to another policy domain for processing.
Types of modules
The runtime is composed of two types of modules (excluding the policy domain):
- Modules of which instances are created once at launch time and that are permanently associated with a given policy domain. All categories of modules presented above fall in this type except for the user-level construct modules.
- Modules of which instances are created dynamically at runtime and which can "move" around. The user-level construct modules fall in this type.
Policy domain
The policy domain is the main arbiter in OCR and is the mechanism by which the functionality of the runtime is split across resources. Conceptually, a policy domain is composed of:
- A set of resources (compute and memory) that it "manages". Specifically, a policy domain will refer to:
- workers which in turn refer to compute targets
- allocators which in turn refer to memory targets
- A set of "smarts" such as schedulers. These modules are responsible for implementing the portion of the runtime that is to run on the resources contained within the policy domain
- A set of neighboring policy domains with which it can communicate
We make the assumption that all resources managed by a policy domain are in a single address space. There is no other assumption on a policy domain so it is a very flexible mechanism to represent a hierarchical runtime. For example, on FSim-like architectures, we have one policy domain per XE and CE. Each of these policy domains contains a single worker representing the work loop of the XE or CE. On the other hand, on x86, a single policy domain can include multiple workers (each running on a separate for example) and our current x86 implementation only has a single policy domain.
Example runtime call
The figure
on the right explains how a runtime call is executed by the runtime. Two scenarios are shown:
- The scenario along the path 1, 2, 3 shows how a runtime call can be handled by the local policy-domain;
- The scenario along the path 1, 4, 5, 6 shows how a runtime call can be handled by offloading its handling to another policy-domain.
Context
As previously noted, policy domains are OCR's way to segregate the responsibilities of the runtime. This may be done for two main reasons:
- The execution target has a non-global address space. Since policy domains cannot span multiple address spaces, the runtime has to be divided with a different sub runtime managing each address space;
- Managing large scale machines cannot be done without some form of hierarchy; policy-domains are a way to split the responsibilities of the runtime among various components. One can imagine policy domains "close" to the computing resources being concerned only by what to execute next on those computing resources while higher level policy domains could concern themselves with more global load-balancing.
The figure shows two policy domains (to present a simple case). Each policy domain manages a single compute resource.
In OCR, we adopted an object-oriented paradigm where each object is represented by a bit of metadata which includes:
- Methods to operate on the object (basically the API of the object);
- Shared or private data for the object. In this context, shared data means that the data can be concurrently accessed by multiple workers/threads/processes whereas private data can be accessed in a thread unsafe manner as it is guaranteed that no concurrent access will occur. Another way to view private data is as "owned" data where a single worker/thread/process owns the data.
Furthermore, there are two types of objects in OCR: those that are created when the runtime is brought up (represented in green) and those that are created as the program is running (represented in red). The green objects are associated with a single policy-domain (they basically "form" the policy domain); these include allocators, schedulers, factories that will be used to create EDTs, data-blocks and events, etc. The red objects are the user-level objects: EDTs, data-blocks and events primarily. These objects will move around as the application progresses and may either be "owned" or shared depending on the implementation of their methods (i.e.: the runtime programmer chooses whether he wants to implement thread-safe methods and therefore pay the cost of synchronization across different policy domains or whether he prefers assigning an "owner" PD (for example) and delegating all operations on that object to that owner.
Explanation
In the figure, the code running in the compute resource on the left side makes a runtime call (for example, creating an EDT). Depending on the policy, there are two ways to handle this. In both cases, the call goes to the local policy-domain (the policy domain that "owns" the computing resource). The call can then be handled either locally or remotely.
Local handling
In the case of a local handling of the call, the computing resource that made the call will execute the entire handling. This is, in effect, a regular C function call. The policy domain will look-up the appropriate function to call (method) and call it. This function can then touch either owned data without the use of any additional lock or "shared"/"non-owned" data using a thread-safe algorithm (locks, atomics, etc.). This is entirely up to the implementer of the module called.
Once the request is handled, the runtime call returns.
Remote handling
In some cases, the local policy domain cannot handle the request. This can either be because it does not physically have access to the metadata needed to handle the request (different address space) or because the target module indicates that the code needs to be executed by the "owner" of the metadata and the local policy domain is not the owner. In this case, the runtime call becomes akin to a remote procedure call where the call will be sent to another policy domain to be executed by its computing resource. In the figure, the second computing resource (on the right) will, at some point, execute the request of the left computing resource.
Hardware abstraction
At a very high level, OCR has three types of resources:
- Compute resources
- Memory resources
- Communication resources (these are slightly different from the first two and are used to communicate "remotely" for some definition of remotely; this is described in more detail in #Policy domain
For the first two types of resources, however, OCR provides the ability to masquerade as a different type of resource. In other words, OCR can act as a basic emulator, pretending, for example, to be running on the Traleika Glacier architecture while actually running on a x86 machine. The structure of OCR allows, for example, the development of a scheduling algorithm targeting FSim-like architecture on a x86 platform and the ability to run this same code on the actual FSim platform with no modifications. This enables rapid prototyping of algorithmic features as well as hardware features (since we can change things like number of cores, sizes of memories, etc).