Sandia’s Exascale Computing Effort Expected to Curtail Effects of System Faults

||

Computing power today is more potent than ever before. Or is it? In many applications, yes, but when it comes to sophisticated, detailed modeling of the Earth’s climate and other pressing global challenges, an apt analogy to describe our computational resources might be the use of an abacus to track the national debt.

That might be an exaggeration, but to hear CRF researcher Bert Debusschere explain the situation, the comparison is not too far off base.

“To accurately predict the Earth’s climate over the next 200–300 years, one needs to simulate the atmosphere, the oceans, and the Earth’s land, and one would need to do it all at the same time,” Bert says. Current supercomputers, as powerful as they may be, would take several years to spit out accurate predictions—and that’s only if they could be dedicated for that sole purpose.

“Each component of climate modeling and simulation [atmosphere, oceans, and land] is, by itself, challenging the most powerful computing resources known today,” Bert says. “The problem is, to make sound predictions about the future, we need to have the computers run simulation programs, not just once, but hundreds of times for slightly different conditions, for each model or component. Predictive power requires that kind of computational muscle and capacity.”

To put the problem even more succinctly, DOE’s Office of Advanced Scientific Computing Research (ASCR) predicts that the DOE’s mission—to include not only climate modeling, but also genomics, high-energy physics, light sources, and other program areas—will necessitate 1,000 times the capabilities of today’s computers, but with a similar size and power footprint. That will require major advances in computing technology—“exascale computing,” to be precise.

Flops, petas, and quintillions

Kathye Chavez inspects a component board in one of the many cabinets that make up Sandia’s Red Sky.

The vision for exascale computing is to achieve at least one “exaflop,” or a thousandfold increase over the first petascale computer that came into operation in 2008. An exaflop represents a thousand petaflops, or a quintillion floating point operations per second (a “flop”), or about 4,000 times the computing power of Sandia’s Red Sky machine. That level of computing power would be considered a significant achievement in computer science, as it would approach the processing power of the human brain. Even more important, Bert says, is that climate research and other important applications simply can’t be done effectively without exascale computing capabilities.

These needs motivate DOE’s ASCR program, which funds deploying advanced computational facilities such as the Leadership Computing Facilities at Oak Ridge and Argonne national laboratories and the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory, as well as research to develop the algorithms, codes, and software to make effective use of those facilities.

Advancing these efforts is complex, but doing so is important because these centers house many of the world’s top supercomputers and provide researchers with the most robust computing resources available. Sandia computer scientists are working on ways to expand and improve upon current capabilities.

In one of the many exascale-related research efforts at Sandia, Bert’s group is focusing on fault tolerance.

“We’re looking at fundamental mathematical and algorithmic aspects of keeping calculations meaningful, which is particularly important when you’re dealing with the need for extreme scalability and the soft and hard errors that are inevitable with massive, powerful machines,” Bert says.

In computational terms, a “soft” error has occurred when a number has been stored digitally but—unbeknownst to anyone—is retrieved later as a vastly different number, resulting in computational error. “Hard” errors occur when machines go down or crash and stop whatever programs are running.

While both kinds of errors are important, Bert says, the traditional solution to hard errors—looking at “checkpoints” and restarting the full machine—will soon become obsolete as computing moves toward exascale because performing the checkpoint/restart will take longer than the machines will remain up. With soft errors, today’s workarounds (such as error-correcting RAM chips) may not work in 5 to 10 years due to their extra cost in power consumption or execution-time.

Bert says the work Sandia is doing, if successful, will lead to software that programmers can use to solve these problems and allow exascale computing to positively impact climate research and other important applications.

As accurate as it needs to be

Traditionally, Bert says, computer scientists have looked at calculations as being deterministic: a mathematical problem must be solved, numbers are crunched, the machine spits back an answer, and the researchers have confidence that the answer back from the computer will remain the same no matter how many attempts are made to replicate the calculation.

However, predictions like those needed for climate research are, in practice, never deterministic, he says. Even if a deterministic equation has been accepted, uncertainty in input parameters will always exist, which can lead to an incorrect prediction.

“So we’re accepting that there will be uncertainty in the predictions and are looking at the effects of soft and hard faults on the system as simply an additional source of uncertainty,” Bert says. “We are treating it as just another factor that affects how much we can trust computer results.”

Instead of seeking a deterministic number, he says, the approach is to treat uncertainty as an instrument, much like a thermometer offers a close estimate of temperature but is understood as possibly being up to a degree off. “We look at the computer as something that gives us a noisy measurement of the problem we want to solve,” Bert says.

Using a tool known as “probability density function,” Bert says the team captures the full body of knowledge about the computational problem and uses computer simulations to refine things to arrive at a descriptive data set that’s as accurate as possible. The idea, he says, is to come up with formulations that quantify the uncertainty that exists, then perform more work until we attain confidence that the predictions are close enough for the application’s needs.

“This doesn’t necessarily solve all of the problems, but it gives us more handles to pull on,” Bert explains. “At least by quantifying the uncertainty, we can start to understand it and work with it instead of against it.”

Reducing uncertainties with computational equations, Bert says, is a matter of splitting things into smaller “subdomains.” This “domain decomposition,” a common technique in numerical simulation, essentially means that Bert and his team can assess all uncertainty through the solutions found at the subdomain level, leading to algorithms that allow many tasks to execute in parallel. Relying on Bayesian methods, which provide a way to update a prior belief with information contained in noisy data, the results of all of these tasks are used to refine the knowledge about the problem solution.

If you have a “noisy” instrument, Bert says—in this case, a computer with faults—you can take the noisy data from the computer to provide information on your prediction. That reduces overall uncertainty.

Work requires diverse teams

Bert Debuscherre speaks to Olivier Le Maître (French visiting professor at Duke University) and Paul Mycek (postdoc at Duke University, not shown in photo). Omar Knio (professor at Duke University, not show in photo) about their collaboration on exascale computing.

Bert says Sandia brings a variety of disciplines to the table for tackling exascale computing challenges: mechanical engineers, applied mathematicians, computer scientists, software engineers, and others with expertise in application codes, operating systems, and uncertainty quantification. These experts conduct research and development at Sandia across a wide range of technologies to help enable exascale computing.

Bert’s project includes a collaboration with Duke University that began its three-year project last summer, following a year-long pilot study funded by ASCR.

Bert’s team delivered a presentation on this new approach at the Society for Industrial and Applied Mathematics (SIAM) Conference on Parallel Processing for Scientific Computing in Portland, Ore., in February 2014. This conference also included presentations from more than a dozen other Sandia National Laboratories researchers working to push the frontiers of computing to the exascale.