Tutorial 1

Mitigation of soft errors: from adding selective redundancy to changing
the abstraction stack

Tutorial Organizers: Prof. Dr. Luigi Carro and Dr. Álvaro Moreira, Universidade Federal do Rio Grande do Sul, Brazil

Abstract

Soft errors caused by ionizing radiation are already an issue for current technologies, and with the estimates of transistors scaling to 5.9 nm by 2026, computing devices will be forced to employ some reliability mechanism to ensure proper computation at a reasonable cost. Previously a major concern only in aerospace and avionic applications, soft errors have been recently reported also on the Earth level, in applications ranging from high performance computing to critical embedded systems, such as automotive, for instance.

We believe that a knowledge on the causes of soft errors and on the pros and cons of different approaches to mitigate their effects is valuable for those working not only on microprocessor reliability, but also for those concerned with the design of software systems, since some error mitigation techniques might require the redesign of the computational stack. This way one can avoid the huge cost in terms of area, performance or energy incurred in traditional techniques. In this half a day tutorial we will focus on ionizing radiation as the source for soft errors and explain how experiments with real radiation are performed in order to evaluate the susceptibility of digital circuits to soft errors. We will then present and analyze pros and cons of some approaches in the literature to mitigate faults defined according to a fault model. We conclude by exploring challenges on the re‐design of the computational stack when taking into account soft errors, in order to achieve high reliability, high performance and low energy designs in different application domains.

Tutorial scope and objectives

The goal of this tutorial is to present advanced techniques to cope with soft errors in several layers of the abstraction stack. We start with a characterization of the problem and its causes, developing an overview of the mechanisms involved in soft errors creation by ionizing radiation and other sources. We then present some approaches that can be used to mitigate soft error effects. The occurrence of soft errors tends to increase as circuits get smaller, and their effects are magnified as advanced technologies are embedded in commonly used systems. These advances in the technology, while reducing the overall reliability of the system, have also opened up the opportunity for re‐designing the stack of abstraction and associated programming models. We believe that this half a day tutorial can contribute to disseminate knowledge on soft errors and how they can be taken into account when proposing new architectures and programming models.

Target audience

Researches and graduate students interested in fault tolerance and reliability, working at different abstraction levels, and also those interested in program transformations and the re‐design of the abstraction stack.

Topics to be covered

The tutorial is organized as follows:

  • PART 1: Characterizing soft errors, its causes and its effects, traditional strategies for error mitigation and detection.
  • PART 2: Analyzing current approaches for mitigating effects of soft error, and challenges for redesigning the computational stack taking into account soft errors detection and correction.

The topics to be covered in each part are the following:

PART 1 – Characterizing soft errors, its causes and its effects, traditional strategies for error mitigation and detection:

  • causes and consequences of soft errors;
  • concrete examples/reports of disruptions caused by soft errors in critical applications (avionics, space applications, automotive, medicine, oil exploration, nuclear plants, etc);
  • how to estimate/predict how sensitive to soft errors a circuit is? Reports on experiments with real radiation;
  • current approaches: triple modular redundancy, invariant checkers, block signature checking, processor watchdogs;
  • pros and cons of current approaches regarding fault coverage, area, performance and energy consumption.

PART 2 – Analyzing approaches for mitigating effects of soft errors and challenges for the redesign of the computational stack:

  • fault model as an abstraction of the real phenomena and as a reference for evaluating mitigating approaches;
  • new challenges that soft errors bring to the area of fault tolerance and to microprocessor resilience;
  • how soft errors can be taken into account when designing a new computational stack;
  • the effect of soft errors in massively parallel machines, and how to cope with them using only software related techniques and programming guides.

References

RECH, P. ; AGUIAR, C. ; FROST, C. ; CARRO, L. . An Efficient and Experimentally Tuned Software‐Based Hardening Strategy for Matrix Multiplication on GPUs. IEEE Transactions on Nuclear Science, v. 60‐4 p. 2797 ‐ 2804, 2013.

NAZAR, G. ; RECH, P. ; FROST, C. ; CARRO, L. Radiation and Fault Injection Testing of a Fine‐Grained Error Detection Technique for FPGAs. IEEE Transactions on Nuclear Science, v. 60‐4, 2742 – 2749, 2013.

RECH, P. ; CARRO, L. Experimental Evaluation of Neutron‐Induced Effects in Graphic Processing Units. In: 9th Workshop on Silicon Errors in Logic ‐ System Effects, 2013, Stanford.

AITKEN, R; FEY, G.; KALBARCZYK, Z.; REICHENBACH, F.; REORDA, M. Robert. Reliability analysis reloaded: how will we survive?. In: DATE 2013, p. 358‐367.

HYUNGMIN CHO, H.;   MIRKHANI, S.; CHER, C.; ABRAHAM, A.; AMITRA, S. Quantitative evaluation of soft error injection techniques for robust system design. In: DAC 2013.

HWANG,   A.   ;   STEFANOVICI,   I.;   SCHROEDER.   B.   Cosmic  rays  don't  strike  twice:understanding the nature of DRAM errors and the implications for system design In: ASPLOS 2012, London, UK, p. 111‐122.

CAMPAGNA, S.; VIOLANTE, M. An hybrid architecture to detect transient faults in microprocessors: An experimental validation. DATE 2012, p. 1433‐1438.

HARI, S.; ADVE, S.; NAEIMI, H.;  RAMACHANDRAN, P. Relyzer: exploiting application‐level fault equivalence to analyze application resiliency to transient faults. In: ASPLOS 2012, London, UK, p.123‐134.

LISBÔA, C. ; GRANDO, C. ; MOREIRA, A. ; CARRO, L . Invariant Checkers: an Efficient Low Cost Technique for Run‐time Transient Errors Detection. In: IOLTS 2009, v. 1. p. 35‐40.

ITTURRIET, F. ; NAZAR, G.; FERREIRA, R. ; MOREIRA, A.; CARRO, L . Adaptive parallelism exploitation under physical and real‐time constraints for resilient systems. In: ReCoSoC 2012, York. p. 1‐8.

ITURRIET, F. ; FERREIRA, R.; GIRÃO, G.; NAZAR, G. ; MOREIRA, A.; CARRO, L. ResilientAdaptive Algebraic Architecture for Parallel Detection and Correction of Soft‐Errors. In: 15th Euromicro Conference on Digital System Design, 2012, Izmir, Turkey. p. 1‐8.

FERREIRA, R.; MOREIRA, A.; CARRO, L. Matrix control‐flow algorithm‐based fault tolerance. In: IEEE International On‐Line Testing Symposium, 2011, Athens. p. 37‐42.

FERREIRA, R.;   AZAMBUJA, J. ;  MOREIRA, A.; CARRO, L . Correction of Soft Errors in Control and Data Flow Program Segments. In: Workshop on Design for Reliability (DFR), 2011, Creta.

RHOD, E.; LISBÔA, C;   CARRO, L.; REORDA, M.;   VIOLANTE, M. Hardware and SoftwareTransparency in the Protection of Programs Against SEUs and SETs. J. Electronic Testing 24(1‐3): 45‐56 (2008).

VEMU, R.; GURUMURTHY, S.; ABRAHAM, J.A. ACCE: Automatic correction of control‐flow errors. IEEE International Test Conference, 2007, pp. 1,10.

ZIEGLER, F. ; CURTIS, H.; MUHLFELD, H; et al. IBM experiments in soft fails in computer electronics (1978–1994). IBM J. Res. Dev. 40, 1 (January 1996), 3‐18.

Biography of the Speakers

Luigi Carro has received the electrical engineering, M.Sc. and Ph.D. degree in Computer Science from Federal University of Rio Grande do Sul, Porto Alegre, Brazil. He is a full professor at the Institute of Informatics at UFRGS. He has considerable experience with computer engineering with emphasis on hardware and software design for embedded systems focusing on: embedded electronic systems, processor architecture dedicated test, fault tolerance, and multiplatform software development. He has advised more than 20 graduate students, and has published more than 150 technical papers on those topics. He has authored the book Digital Systems Design and Prototyping (2001‐in Portuguese) and is the co‐author of Fault‐Tolerance Techniques for SRAM‐based FPGAs (2006‐Springer), Dynamic Reconfigurable Architectures and Transparent optimization Techniques (2010‐Springer) and Adaptive Systems (Springer 2012).

Álvaro Moreira has a B.Sc and a M.Sc in Computer Science from Federal University of Rio Grande do Sul, Porto Alegre, Brazil, and a PhD in Computer Science from the University of Edinburgh, Scotland. He is an associate professor at the Institute of Informatics at UFRGS. He is interested in software‐based approaches for mitigation of soft errors, in the formal definition of fault models and in the formal semantics of new ISAs that take into account soft errors.