Researchers ‘overclocking’ world’s fastest supercomputers to process big data faster

“Approximate computing” tricks use controlled errors to achieve speed increases and reduce power consumption
March 2, 2015

High performance computing (HPC) systems (credit: Queens University Belfast)

Researchers at Queen’s University Belfast, the University of Manchester, and the STFC Daresbury Laboratory are developing new software to increase the ability of supercomputers to process big data faster while minimizing increases in power consumption.

To do that, computer scientists in the Scalable, Energy-Efficient, Resilient and Transparent Software Adaptation (SERT) project are using “approximate computing” (also known as “significance-based computing”) — a form of “overclocking” that trades reliability for reduced energy consumption.

The idea is to operate hardware slightly above the threshold voltage (also called near-threshold voltage, NTV), actually allowing components to operate in an unreliable state — and assuming that software and parallelism can cope with the resulting timing errors that will occur — using increased iterations to reach convergence, for example.

“We also investigate scenarios where we distinguish between significant and insignificant parts [of programs] and execute them selectively on reliable or unreliable hardware, respectively,” according to the authors of a paper in Computer Science – Research and Development journal. “We consider parts of the algorithm that are more resilient to errors as ‘insignificant,’ whereas parts in which errors increase the execution time substantially are marked as “significant.'”

Software methods for improving error resilience include checkpointing for failed tasks and replication to identify silent data corruption.
“This new software … [means] complex computing simulations which would take thousands of years on a desktop computer will be completed in a matter of hours,” according to the project’s Principal Investigator, Professor Dimitrios Nikolopoulos from Queen’s University Belfast.

The SERT project, due to start this month, has just been awarded almost £1million from the U.K. Engineering and Physical Sciences Research Council.

The researchers are simulating detailed models of natural phenomena such as ocean currents, the blood flow of a human body, and global weather patterns to help address some of the big global challenges, including sustainable energy, the rise in global temperatures, and worldwide epidemics.


Abstract of On the potential of significance-driven execution for energy-aware HPC

Dynamic voltage and frequency scaling (DVFS) exhibits fundamental limitations as a method to reduce energy consumption in computing systems. In the HPC domain, where performance is of highest priority and codes are heavily optimized to minimize idle time, DVFS has limited opportunity to achieve substantial energy savings. This paper explores if operating processors near the transistor threshold voltage (NTV) is a better alternative to DVFS for breaking the power wall in HPC. NTV presents challenges, since it compromises both performance and reliability to reduce power consumption. We present a first of its kind study of a significance-driven execution paradigm that selectively uses NTV and algorithmic error tolerance to reduce energy consumption in performance-constrained HPC environments. Using an iterative algorithm as a use case, we present an adaptive execution scheme that switches between near-threshold execution on many cores and above-threshold execution on one core, as the computational significance of iterations in the algorithm evolves over time. Using this scheme on state-of-the-art hardware, we demonstrate energy savings ranging between 35 and 67 %, while compromising neither correctness nor performance.