|PI||Pete Beckman (ANL)|
|Chief Scientist||Marc Snir (ANL)|
Disruptive new computing technologies, such as 3D memory, ultra-low-power cores, and embedded network controllers, are changing the scientific computing landscape. For the next few years, novel designs will flourish as new technologies are explored. Furthermore, changing work-flows and programming environments are making new demands on the low-level system software. As noted by DOE workshops and reports, today’s operating system and runtime (OS/R) software cannot be incrementally extended and grown into an exascale solution. A new approach is required.
Argo is a project to develop new exascale Operating System and Runtime Software (OS/R) specifically designed to support extreme-scale scientific computation. Argo is built on an agile, new modular architecture that supports both global optimization and local control. It aims to efficiently leverage new chip and interconnect technologies while addressing the new modalities, programming environments, and workflows expected at exascale. It is designed from the ground up to run future HPC applications at extreme scales.
Argo will be developed over the course of three years and will result in an open-source prototype system that is vendor neutral and runs on several architectures. Four key innovations create the foundation of this project: a new node OS/R that supports OS specialization, a lightweight run-time system for massive concurrency, a global view that supports cross-cutting verticals of power and fault management, and a backplane to allow resource managers and optimizers to communicate and control the platform.
An OS/R with Multiple Views: Our design supports hierarchical views on the entire exascale system. The global view enables Argo to combine live performance data, active control interfaces, and machine-learning techniques to dynamically manage power across the entire system, respond to fault, or tune application performance. Only with a whole-system perspective can power budget goals be reached and cascading failures halted to avoid a system crash. At the other end of the spectrum is the local view. For scalability, compute nodes must have a measure of autonomy to manage and optimize massive intranode parallelism, schedule low-latency messages on embedded network adapters, and adapt to new memory technologies. Bringing together these multiple perspectives, and the corresponding software components operating within our hierarchical view, is our strategy for addressing the four key exascale challenges: power, parallelism, memory hierarchy, and resilience.
- Argonne National Laboratory: Pete Beckman, Marc Snir, Pavan Balaji, Rinku Gupta, Kamil Iskra, Franck Cappello, Rajeev Thakur, Kazutomo Yoshii
- Boston University: Jonathan Appavoo, Orran Krieger
- Lawrence Livermore National Laboratory: Maya Gokhale, Edgar Leon, Barry Rountree, Martin Schulz, Brian Van Essen
- Pacific Northwest National Laboratory: Sriram Krishnamoorthy, Roberto Gioiosa
- University of Chicago: Henry Hoffmann
- University of Illinois Champagne Urbana: Laxmikant Kale, Eric Bohm, Ramprasad Venkataraman
- University of Oregon: Allen Malony, Sameer Shende, Kevin Huck
University of Tennessee Knoxville: Jack Dongarra, George Bosilca, Thomas Herault