Concrete Ingredients for Flexible Programming Abstractions on Exascale Systems
Pacific Northwest National Laboratory
Exascale systems are expected to incur variations in their execution environment due to architectural heterogeneity, variable data access latencies, system noise and selective fault recovery. Difficulties in programming such systems have led to a renewed interest in abstractions for finer-grained concurrency. These abstractions allow the programmer to express the essential characteristics of the applications -- concurrency, synchronization, locality, etc. -- that are then used by the software stack to map the computation to the target platform and react to events without onerous programmer involvement. While elegant and inherently adaptive due to fewer synchronizations, effective realizations of such programming abstractions require fundamental advancements in techniques for automated management of concurrency, data movement, and resilience. We propose to transform exascale programming models and runtime systems for scientific applications through the design and characterization of the algorithms that automate concurrency, data movement, and resilience management. These algorithms will target the key features shared among several candidate exascale programming abstractions: finer-grained concurrency, irregular data structures, and flexible data movement and synchronization semantics. This concerted effort exploits the shared characteristics of distinct abstractions to develop an interwoven suite of algorithms that build on each other. The suite of algorithms developed will support the design of flexible programming abstractions for the exascale by helping determine: the mechanisms to effectively execute applications written using specific programming model constructs, the computation characteristics required to support these constructs, and the behavior of these constructs on future systems.
Efficient soft-error detectors. The choice of task scheduling strategies depends on the quality of the error detection and recovery scheme. In particular, cost of error recovery can be significantly influenced by the error detection latency: the time between the occurrence and detection of an error. We, therefore, sought to design an efficient detector for soft-errors in memory sub-systems. The probability of bit flips in hardware memory systems is projected to increase significantly as memory systems continue to scale in size and complexity. Effective hardware-based error detection and correction require that the complete data path, involving all parts of the memory system, be protected with sufficient redundancy. First, this may be costly to employ on commodity computing platforms, and second, even on high-end systems, protection against multi-bit errors may be lacking. Therefore, augmenting hardware error detection schemes with software techniques is of considerable interest. We consider software-level mechanisms to comprehensively detect transient memory faults. We develop novel compile-time algorithms to instrument application programs with checksum computation codes to detect memory errors. Unlike prior approaches that employ checksums on computational and architectural states, our scheme verifies every data access and works by tracking variables as they are produced and consumed. Experimental evaluation demonstrates that the proposed comprehensive error detection solution is viable as a completely software-only scheme. We also demonstrate that with limited hardware support, overheads of error detection can be further reduced. This work has been accepted for publication in PLDI’14 .
1. S. Tavarageri, S. Krishnamoorthy, and P. Sadayappan. “Compiler-assisted detection of transient memory errors”. ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2014