|X-ARCC: Exascale Adaptive Resource Centric Computing with Tesselation|
|Team Members||UC Berkeley, LBNL|
|PI||Steven Hofmeyr (LBNL)|
|Co-PIs||John Kubiatowicz (UCB)|
- 1 Overview
- 2 Tessellation
- 3 Research Plan
- 4 Team Members
We are exploring new approaches to Operating System (OS) design for exascale using Adaptive Resource-Centric Computing (ARCC). The fundamental basis of ARCC is dynamic resource allocation for adaptive assignment of resources to applications, combined with Quality-of-Service (QoS) enforcement to prevent interference between components. We have embodied ARCC in Tessellation, an OS designed for multicore nodes. In this project, our goal is to explore the potential for ARCC to address issues in exascale systems. This will require extending Tessellation with new features for multiple nodes, such as multi-node cell synchronization, distributed resource accounting, and topology-aware resource control. Rather than emphasizing component development for an exascale OS, we are focusing our efforts on high-risk, high-reward topics related to novel OS mechanisms and designs.
There are several aspects we are exploring in the context of a multinode Tessellation:
- What OS support is needed for new global address space programming models and task-based parallel programming models? To explore this, we are porting UPC to run on multinode Tessellation, using GASNet as the underlying communication layer. Our test-cases for this runtime on Tessellation are a subset of the Co-design proxy apps, which we use as representatives of potential exascale applications.
- How should the OS support advanced memory management, including mechanisms for user-level paging, locality-aware memory allocation and multicell shared memory?
- How do we extend hierarchical adaptive resource allocation and control across multiple nodes?
- How should the OS manage the trade-off between power and performance optimizations? Will the Tessellation approach of treating both power and other resources (cores, memory) as first class citizens within the adaptive loop be adequate?
- What OS abstractions are needed for the durable QoS-guaranteed storage that is essential to resilience?
The Tessellation kernel is a lightweight, hypervisor-like layer that provides support for ARCC. It implements cells, along with interfaces for user-level scheduling, resource adaptation and cell composition. Since the software in cells runs entirely in user space, the kernel can enforce resource allocations, e.g. CPU cores, memory pages, without specialized virtualization hardware, but enforcing resource allocations to some resources, such as processor cache slices and memory bandwidth, requires additional hardware support. Tessellation is written from scratch and the prototype runs on both Intel x86 and RAMP architectures.
The Cell Model
Cells provide the basic unit of computation and protection in Tessellation. Cells are performance-isolated resource containers that export their resources to user level. The software running within each cell has full user-level control of the resources assigned to the cell, free from interference from other cells. Application programmers can customize the cell's runtime for their application domains with, for instance, a particular CPU core-scheduling algorithm, or a novel page replacement policy.
The performance isolation of cells is achieved through space-time partitioning, a multiplexing technique that divides the hardware into a sequence of simultaneously resident spatial partitions. Cells can either have temporally dedicated access to their resources, or be time-multiplexed with other cells, and, depending on the spatial partitioning, both time-multiplexed and non-multiplexed cells may be active simultaneously.
Time multiplexing is implemented using gang-scheduling, to ensure that cells provide their hosted applications an environment that is similar to a dedicated machine. The kernel implements gang-scheduling in a decentralized manner through a set of kernel multiplexer threads muxers, one per hardware thread in the system. The muxers implement the same scheduling algorithm and rely on a high-precision global time-base to simultaneously activate a cell on multiple hardware threads with minimum skew. In the common case the muxers do not need to communicate since they replicate the scheduling decisions of all relevant other muxers; however, the muxers will communicate via IPI multicast in certain cases, e.g., when cells are created or terminated, when resource allocations change, etc.
Applications in Tessellation that span multiple cells communicate via efficient and secure channels. Channels provide fast, user-level asynchronous message-passing between cells. Applications use channels to access standard OS services (e.g. network and file services) hosted in other cells. New composite services are constructed from OS services by wrapping a cell around existing resources and exporting a service interface. Tessellation can support QoS in this service-oriented architecture because the stable environment of the cell is easily combined with a custom user-level scheduler to provide QoS-aware access to the resulting service. With such QoS-aware access providing reproducible service times, applications in cells experience better performance predictability for autotuning, but without sacrificing flexibility in job placement for optimized system usage.
Customized Scheduling: OS Support for Runtime Systems
A major benefit of two-level scheduling is the ability to support different resource-management policies simultaneously. In Tessellation, cells provide their own, possibly highly-customized, user-level runtime systems for processor (thread) scheduling and memory management. Furthermore, each cell's runtime can control the delivery of events, such as timer and device interrupts, inter-cell message notifications, exceptions, and memory faults. Consequently, Tessellation can support a diverse set of runtimes without requiring kernel modifications or users to have root access. For example, for a complex scientific application with in-situ visualization, one cell can run a rendering or decoding component with a real time scheduler, while another runs a throughput oriented compute job with simple run-to-completion scheduling and one thread per core. The ability to support multiple different schedulers simultaneously bypasses the issues that arise when trying to develop a "one size fits all" complex global scheduler (e.g., see the CFS vs BFS debate in the Linux community).
The current Tessellation prototype includes two user-level thread scheduling frameworks: a preemptive one called Pulse (Preemptive User-Level SchEduling), and a cooperative one based on Lithe (LIquid THrEads). With either framework, a cell starts when a single entry point, enter(), is executed simultaneously on each core. After that, the kernel interferes with the cell's runtime only when: 1) the runtime receives events (e.g. interrupts) it has registered for, 2) the cell is suspended and reactivated according to its time-multiplexing policy (which can be time-triggered, event-triggered, or dedicated), or 3) the resources (e.g. hardware threads) assigned to the cell change. For instance, when an interrupt occurs during user-level code execution, the kernel saves the thread context and calls a registered interrupt handler, passing the saved context to the cell's user-level runtime. The cell's runtime can then choose whether to restore the previously running context or swap to a new one.
It is relatively easy to build new runtime scheduling frameworks for Tessellation. The Pulse framework, for example, comprises less than 800 lines of code (LOC); it can serve as a starting point for creating new user-level preemptive schedulers. A new scheduler based on Pulse needs to implement only four callbacks: enter(), mentioned earlier; tick(context), which is called whenever a timer tick occurs and receives the context of the interrupted thread; yield(), called when a thread yields; and done(), called when a thread terminates. The Pulse framework provides APIs to save and restore contexts, and other relevant operations. Since Tessellation's schedulers run at user-level, kernel patches are not required.
Pulse and Tessellation make it easy to implement custom schedulers. For example, we implemented a global round-robin scheduler with mutex and conditional-variable support in about 850 LOC. We also wrote a global earliest deadline first (EDF) scheduler with mutex support and priority-inversion control via dynamic deadline modification in less than 1000 LOC. By contrast, support for EDF in Linux requires kernel modifications and substantially more code: the best-known EDF kernel patch for Linux, SCHED_DEADLINE, has over 3500 LOC in over 50 files. Furthermore, any bugs in SCHED_DEADLINE could cause the kernel to crash and bring down the whole system, whereas bugs in a userspace scheduler in Tessellation will only crash the user-space application using that scheduler.
Adaptive Resource Allocation
Tessellation adapts resource allocations to provide applications with QoS while maximizing efficiency in the system. Resources are allocated and reallocated using a feedback control loop, as shown in the figure. The allocation decisions are made by the Resource Allocation Broker (RAB), a broker service which runs in user-space. The RAB uses system-wide goals, resource constraints, performance targets and performance measurements as inputs to an optimizer that attempts to simultaneously satisfy multiple application requirements and competing system-wide goals such as energy efficiency and throughput. The allocation decisions are then communicated to the kernel and system services for enforcement.
The current implementation of the RAB provides a resource-allocation framework that supports rapid development and testing of new optimizers. Two different optimizers that we have experimented with are POTA and Pacora.
Cells provide a convenient abstraction for building OS services with QoS guarantees. Such services can reside in dedicated cells with exclusive control of a device to encapsulate user-level device drivers. We call these cells service cells. In keeping with ARCC, we treat services offered by service cells as additional resources managed by the adaptive resource allocation architecture. A service cell arbitrates access to its devices, leveraging the cell's performance isolation and customizable schedulers to offer QoS guarantees to other cells. By encapsulating system services we not only improve performance predictability, but also enhance tolerance of failures in device drivers and mitigate the impact of OS noise.
Each service in Tessellation comes with a library to facilitate the development of client applications. The client libraries offer friendly, high-level application programming interfaces (APIs) to manage connections and interact with the services (i.e. they hide most of the details of inter-cell channel communication). Those libraries also allow applications to request the QoS guarantees they need from services.
Currently, there are two services implemented for Tessellation that offer QoS guarantees: a Network Service, which provides access to network adapters and guarantees that the data flows are processed with the agreed levels of throughput, and a GUI Service, which provides a windowing system with response-time guarantees for visual applications.
Our main focus is using Tessellation to investigate OS support for new programming models and new use cases and application structures for supercomputers. We believe that Tessellation is an ideal platform to experiment with the mechanisms required to support the sophisticated applications described in the DOE Exascale OS/R report:
As machines grow larger, applications are not just adding more grid points but are adding new modules for improving fidelity with new physics and chemistry. Rather than a monolithic source code tree, these sophisticated applications are built from multiple libraries, programming models, and execution models. Future operating and runtime systems must support the composition of several types of software components, from legacy math libraries written in MPI to new dynamic task-graph execution engines running simultaneously with a data reduction component.
In Tessellation, we envisage these sophisticated applications as spanning multiple cells, with different cells for different library components, each with their own scheduling models. Furthermore, support for multiple active cells per node enables the colocation of components to minimize data movement and avoid communication. However, Tessellation is currently a single node OS and these applications will need to scale to large numbers of nodes. Hence, in this research, our main goal is to determine how adaptive resource control with QoS guarantees can be scaled across nodes.
Multinode Synchronized Cells
Cells in Tessellation are currently a purely local construct, contained within a single node. Although there are many ways of extending the cell model across nodes, we will keep cells as fundamentally a local construct, and provide a way to synchronize cells across nodes. In particular, for temporal multiplexing, we need a way to be able to synchronize activations to enable inter-node gang-scheduling. The muxer algorithm is decentralized, and communication-free in the common case, and thus naturally lends itself to a multinode environment. The muxer algorithm relies on synchronized global clocks, which should be achievable on a relatively large scale, e.g., Jones and Koenig  describe an algorithm that takes only 2.3 microseconds to synchronize 16,000 12-core nodes. Our goal is to extend the muxer algorithm to function across nodes; we will determine what mechanisms need to be implemented to enable this, and how OS components should communicate and interface across a network.
It is frequently possible to provide the same portion of CPU node capacity to a cell through either temporal multiplexing or spatial partitioning, so it is not clear a priori that multinode temporal multiplexing is really needed. However, we wish to add support for multinode temporal multiplexing because we believe that the temporal dimension enhances flexibility. This could be especially useful when solving complex problems of multidimensional resource allocation. It could also be useful for improving performance in certain types of applications, e.g., colocating modeling and analysis threads on the same cores so that cached data is shared, while still having separate, customized runtimes for each component. This kind of high-risk, high-reward research has the potential to open up new, more productive ways of using large-scale computers; we will not see fundamental advances with only incremental improvements to existing approaches to OSes.
Support for New Programming Models and Multicomponent Applications
Multinode synchronized cells will provide the environment necessary for exploring the OS support needed for advanced programming models and sophisticated, multicomponent applications. To enable multinode, large scale applications, we propose first porting GASNet as a communication layer, and then porting UPC for PGAS models. We will focus on these frameworks because of time and budget constraints, and because of our close relationship with the UPC and DEGAS projects. . Of particular interest is what OS-specific support and mechanisms may be needed for frameworks such as GASNet and UPC. This aspect of our research will allow us to explore one of the issues raised in the DOE Exascale OS/R report, namely:
Operating systems traditionally have mediated and controlled access to all resources in the system. However, a variety of new programming models and runtimes will have significantly different requirements for resource management, which will likely result in inefficiencies if the OS has to directly manage all resources. In the future, the application and runtime will need increased control over resources such as cores, memory, and power and user-space handling of control requests and events.
We will use the newly ported runtimes to explore application designs relevant to exascale, through the Co-design proxy apps. Several of these will be easy to port to Tessellation, since we already have UPC versions of them, e.g., LULESH from ExMatEx, Multigrid from Exact and XSBench from Cesar. However, to really explore the potential benefits of the ARCC approach, we will experiment with OS support for more sophisticated applications, in addition to application kernels. For this purpose, we will port two of the more complex Co-design Proxy Apps: CIAN (from the Cesar project), which has multiple components, including compute-intensive modeling, data-intensive analytics, visualization and storage; and CoMD (from the ExMatEx project), which includes in-situ visualization. Not only are we interested in how we can support the programming models used in these applications, but how we can tie the visualization, data analytics and storage components into the Tessellation service framework, including the GUI and Network services, and our proposed storage services.
Advanced Memory Management
Advanced memory management is a key requirement for an exascale OS, as described in the DOE Exascale OS/R report:
The OS will need to provide more support for runtime and application management of memory (possibly moving traditional OS services, like updating translation tables to runtime systems).
In Tessellation, our goal is to give cells as much control over the memory they own as possible. Currently, Tessellation does not support advanced memory management features, such as paging or locality-aware allocation. We intend to develop a user-level paging facility that a cell's runtime can manage with customized page replacement and locality-aware allocation policies. Customized memory management can deal with subtle problems that arise in monolithic memory management, e.g., performance of the LULESH proxy app suffers on Linux because of the way the kernel zero fills pages before allocating them. Apart from customization, a key goal of advanced memory management is to minimize uncontrolled interference between cells and enable better performance predictability and stronger QoS guarantees.
For user-level memory management, the approach we propose to adopt is similar to self-paging in the Nemesis OS, where all the kernel does is dispatch fault notifications, and all paging operations are performed at the application (cell) level. In general, the approach can be implemented entirely in software, e.g., through para-virtualized interfaces to the page tables, as used in Xen, but, where available, we will make use of hardware support such as nested page tables (AMD) and extended page tables (Intel).
Adaptive Resource Allocation on Multinode Systems
The framework for adaptive resource control that is currently implemented in the kernel and the Resource Allocation Broker (RAB) is flexible, supporting pluggable policies. We intend to retain this flexibility for the multinode version of Tessellation, and intend to explore various methods to extend the adaptation framework across nodes. In particular, we will to develop hierarchical resource allocation, with at least two levels: an intra-node level, and an an inter-node level that is constructed through the composition of intra-node resource policies. Although exascale systems will likely need more than two levels to account for additional layers of physical clustering of resources, we believe that two levels is already a challenge, and will highlight many of the relevant issues.
It is beyond the scope of this project to investigate the various adaptation policies. Rather, our focus is on providing the OS mechanisms and interfaces needed for a flexible framework that can be used to explore a variety of policies in future, e.g., determining how best to allocate multinode resources to minimize data movement and avoid communication. Included in those mechanisms will be communication and coordination channels between the inter- and intra-node levels of the adaptation framework, support for inter-node resource accounting, and support for topology-awareness.
Power-awareness and control of power as a resource will be another dimension for the RAB optimizer to incorporate when making resource allocation decisions. We do not intend to design new power-balancing policies or experiment extensively with power-aware resource allocation algorithms, but rather to ensure that Tessellation provides the necessary mechanisms. We will develop the mechanisms needed to enable Tessellation to accurately attribute power consumption to each cell, and control the allocation of the "power" resource by using DVFS controls or processor sleep states.
Although control mechanisms are likely straightforward to implement, accurate resource attribution will be tricky. We will begin by investigating systems such as the Intel Sandybridge, which provides support for measuring energy-usage per socket and DRAM energy usage, via the RAPL interface. Accessing the RAPL interface is done via model-specific registers that have to be accessed in ring-0. We propose to export these interfaces to userspace, specifically, to the RAB so that it can use power to make allocation decisions. Unfortunately, this power information is not sufficiently fine-grained to attribute it to cells that span sockets. To correct this, we will explore various algorithms for deriving the power usage at the cell level. This can be as simple as proportionally dividing up power according to CPU usage, or involve more sophisticated approaches based on performance counter measurements. Which approach is the best is an open research question that we intend to investigate.
System Services for Resilience
A complete exascale OS will need many components to support resilience. Since many of these components, fault detection for example, may be implemented in the runtime, we propose to focus our investigation on those performance-critical components necessary for effective fault recovery that also benefit from OS-level optimization. Specifically, we intend to investigate OS abstractions for durable QoS-guaranteed storage, which is an essential requirement for a variety of resilience techniques such as traditional checkpoint/restart, faster log-based recovery, and more recent approaches using Containment Domains. Although our focus is on storage for resilience, according to the DOE Exascale OS/R report, QoS-aware I/O services will also be necessary for exascale tools:
Another potential overlap in the requirements of tools and applications involves quality of service concerns for shared resources. Tools need QoS guarantees to ensure that they do not too severely impact application execution. A particular area of concern is that tools can have extensive I/O requirements.
To date we have developed two QoS-aware system services: a network service and a GUI service. We intend to implement a framework that will make it easy to develop QoS-aware services, which we will use for developing two additional services: a Block Service and an Object Store Service:
- The Block Service will provide block-level operations with additional QoS guarantees to applications or other services. It is an essential basis for advanced memory management, providing swap pager support to individual cells.
- The Object Store Service (OSS) will provide persistent storage through a simple interface that will enable applications to put (and retrieve) variable sized objects in a flat namespace. QoS guarantees from the OSS will include such parameters as guaranteed number of cached pages or read and/or commit bandwidth. This service will extend the traditional on-disk inode structure into a generic persistent local store available to clients (applications or services). Although a POSIX-like Filesystem API will be available for the Object Store Service, clients will also be able to manipulate stored objects outside of the POSIX APIs, to reduce the overhead associated with maintaining POSIX-required metadata.
The implementations of these services are sufficient to investigate a wide variety of optimized storage abstractions for resilience. Although we hope to look at more general storage and filesystem issues, we want to concentrate our efforts on well-defined resilience requirements. We feel this line of research is important, since resilient application execution, whether checkpointing, logging, or replay-based, requires access to durable storage during the application's critical path; providing QoS to applications with resilience means not just providing guaranteed access to CPU, cache, memory and network, but also to the I/O subsystem.