Neutron Beam Testing Methodology and Results for a Complex Programmable Multiprocessor SoC

Size: px

Start display at page:

Download "Neutron Beam Testing Methodology and Results for a Complex Programmable Multiprocessor SoC"

Jerome Jacobs
5 years ago
Views:

1 Masthead Logo Brigham Young University BYU ScholarsArchive All Theses and Dissertations Neutron Beam Testing Methodology and Results for a Complex Programmable Multiprocessor SoC Jordan Daniel Anderson Brigham Young University Follow this and additional works at: Part of the Electrical and Computer Engineering Commons BYU ScholarsArchive Citation Anderson, Jordan Daniel, "Neutron Beam Testing Methodology and Results for a Complex Programmable Multiprocessor SoC" (2019). All Theses and Dissertations This Thesis is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for inclusion in All Theses and Dissertations by an authorized administrator of BYU ScholarsArchive. For more information, please contact scholarsarchive@byu.edu, ellen_amatangelo@byu.edu.

2 Neutron Beam Testing Methodology and Results for a Complex Programmable Multiprocessor SoC Jordan Daniel Anderson A thesis submitted to the faculty of Brigham Young University in partial fulfillment of the requirements for the degree of Master of Science Michael J. Wirthlin, Chair James K. Archibald Jeffrey Brant Goeders Department of Electrical and Computer Engineering Brigham Young University Copyright 2019 Jordan Daniel Anderson All Rights Reserved

3 ABSTRACT Neutron Beam Testing Methodology and Results for a Complex Programmable Multiprocessor SoC Jordan Daniel Anderson Department of Electrical and Computer Engineering, BYU Master of Science The Xilinx Multiprocessor System-on-Chip (MPSoC) is a complex device that uses 16nm FinFET technology to combine multiple processors, a large amount of FPGA resources, and many I/O interfaces on a single chip die. These features make the MPSoC a high-performance and architecturally flexible device. The potential computing power makes the MPSoC ideal for many embedded applications including terrestrial and space applications. The MPSoC, however, does not have extensive radiation history as many other devices have. The extent of the effect that ionized particles may have on the MPSoC is not well established. To solve this problem, neutron radiation testing can be used to determine the device s susceptibility to single-event upsets (SEUs). Though this thesis is not intended to qualify the MPSoC for space, this work does provide useful neutron radiation test data that helps to characterize the susceptible nature of the device. This thesis summarizes the SEU results obtained from neutron testing on the UltraScale+ MPSoC ZU9EG device. A series of three neutron beam tests were performed on the MPSoC ZU9EG at Los Alamos National Laboratories (LANL). Testing was performed using a novel testing methodology to collect SEU counts on the programmable logic and the processing system simultaneously. These results show a 10.1 improvement of the programmable logic CRAM over the previous Xilinx UltraScale device series. Keywords: MPSoC, ZU9EG, ZCU102, FPGA, Scrubbing, Neutron Cross Section, PCAP, SEM IP, SEU, CRAM

4 ACKNOWLEDGMENTS First and foremost, I want to thank Dr. Wirthlin for his help and encouragement over the past several years. Dr. Wirthlin has mentored me in principles of research, school, work, and life, teaching me innumerable lessons that will influence me forever. I also want to thank my graduate committee members for their feedback and efforts with this work. I deeply appreciate all of the help that Heather Quinn from Los Alamos National Laboratories provided, assisting in many aspects of my experiments. Heather provided benchmarks, a ZCU102 evaluation board for development, office space, thermal images, and much more in her time and effort. I would also like to thank NASA Goddard who provided a second ZCU102 evaluation board to pursue additional research into the high current events. I am grateful for my wife, for supporting me in my research and studies. I was often gone working, studying, performing radiation tests, or attending conferences. She supported me in all my efforts and always asked what more she could do to help. She has been an inspiring light to my research. I appreciate the help and effort of the BYU CCL students both in my studies and research. Particularly I would like to recognize Jennings Leavitt and Alex West for their efforts on the MP- SoC software benchmarks. This work was supported by the NSF Center for High Performance Reconfigurable Computing (CHREC), supported by the I/UCRC Program of the National Science Foundation under Grant No ; and the NSF Center for Space, High-Performance, and Resilient Computing (SHREC), supported by the I/UCRC Program of the National Science Foundation under Grant No This work was also supported by Los Alamos Neutron Science Center (LANSCE) which provided neutron beam time under proposals NS F, NS A, & NS A. The opinions, findings, and conclusions or recommendations expressed are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

5 TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES vii viii Chapter 1 Introduction Thesis Contributions Thesis Organization Chapter 2 Radiation Effects Sources of Radiation Single Event Effects SEEs on Processors SDC Prevention SEEs on FPGAs Radiation Metrics Radiation Testing Chapter 3 MPSoC Architecture MPSoC Device Overview Processing System Application Processing Unit (APU) Real-Time Processing Unit (RPU) Programmable Logic Other Chip Resources On-Chip Memory (OCM) Configuration and Security Unit (CSU) Platform Management Unit (PMU) Chapter 4 Testing Methodology Methodology Power Modifications Beam Counting Methodology Test Overview Chapter 5 Programmable Logic Testing and Results FPGA Readback FPGA Scrubbing Programmable Logic (PL) Methodology BYU JCM - JTAG Scrubber SEM IP - ICAP Scrubber Baremetal PCAP Scrubber Block RAM Readback iv

6 5.4 Programmable Logic SEU Results SEM IP Sensitivity Chapter 6 Processing System Testing and Results Processing System (PS) Methodology Software Processor Benchmarks Advanced Encryption Standard (AES) Matrix Multiply Linux and Dhrystone Processor Results Memory Benchmarks Caches On-Chip Memory (OCM) Memory Test Results Chapter 7 High Current Events High Current Event Monitoring Methodology External Power Supply Monitoring On-Board Power Monitoring Test Captured High Current Event Chapter 8 Conclusion Future Work Chapter 9 Acronyms REFERENCES Appendix A ZCU102 Power Modifications Appendix B Radiation Testing Logs B.1 Beam Count Log B.2 PL Logs B.2.1 JCM PL Log B.2.2 SEM-IP Log B.2.3 PCAP Log B.3 PS Logs B.3.1 Matrix Multiply Log B.3.2 Linux Log B.3.3 OCM Log B.4 Power Logs B.4.1 Power Supply Log B.4.2 On-Board Power Monitoring Log B.5 Cross Section Calculations Using Log Files v

7 Appendix C PCAP Code C.1 PCAP Macros C.2 PCAP Interface C.3 PCAP Scrubber Appendix D Interrupt and Testing Code D.1 Interrupts D.1.1 APU Cache ECC D.1.2 DDR ECC D.1.3 OCM ECC D.1.4 RPU ECC D.2 Test Code D.2.1 Matrix Multiply Test D.2.2 Matrix Multiply Main D.2.3 Watchdog Timer Linux App D.2.4 OCM Interface D.2.5 OCM Test Appendix E Power Monitoring Code E.1 Keysight Power Supply Script E.2 On-Board Power Detection Software E.2.1 Power Monitoring PMBus Interface E.2.2 Power Monitoring Code vi

8 LIST OF TABLES 3.1 Available ZU9EG PL resources FIT rates for the MPSoC FPGA Per-bit cross section of MPSoC FPGA CRAM Per-bit cross section of MPSoC FPGA BRAM Cross section of the SEM IP Software cross section on the APU Per-bit cross section data from the processor memories Power supply current values used in these tests Power event cross section results On-board power monitoring detection results A.1 Color coding of wires A.2 Locations and labels for bypassing regulators vii

9 LIST OF FIGURES 2.1 Radiation effects on a transitor Example of an SEE on a processor cache SEEs on FPGA CRAM LANSCE ICE House neutron flux MPSoC EG block diagram CSU block diagram Simple block diagram of PS/PL separation for testing Wiring to bypass the power regulators on the ZCU102 evaluation board Dosimeter in beam path at LANSCE Beam counter at LANSCE facility ZCU102 evaluation board in flight path at LANSCE, November Block diagram of physical setup Flow diagram for test execution Simple diagram of scrubbing JCM with the MPSoC The JCM with its daughter card Flow diagram for initializing the PCAP on the MPSoC Per-bit cross section comparison of Xilinx FPGA CRAM Flow diagram for Dhrystone benchmark running in Linux Memory with and without ECC interleaving Wiring to bypass the power regulators on the ZCU102 evaluation board Keysight power supply powering and monitoring evaluation board Flow diagram of power supply monitoring script Current plot with a current spike Current plot with increased current Histogram categorizing events on VCCAUX power rail Software flow of RPU baremetal power monitoring code Block diagram of on-board power detection test setup Raspberry Pi connected to MPSoC through GPIO Picture of the UPS at LANSCE Thermal image of ZCU Thermal image of ZU9EG chip Thermal image of ZCU102 power regulators A.1 Metal plate with banana connectors connected to ZCU A.2 Bottom of ZCU102 showing the bypass wires A.3 ZCU102 board with test points highlighted for bypassing viii

10 CHAPTER 1. INTRODUCTION SRAM-based field programmable gate arrays (FPGAs) are increasingly being used in space and terrestrial large data computing applications due to the amount of processing power and flexibility that the architecture provides [1, 2]. The demands for data and signal processing that many data center, military, space, and other applications require can be filled by FPGAs. To increase the computing power of these FPGA devices, many are being coupled with processors and even with multiple processors to create a single powerful device. One such device is the Xilinx UltraScale+ Multiprocessor System-on Chip (MPSoC). Using 16nm FinFet technology, this large device combines the flexibility of many programmable logic resources with multiple powerful ARM processors [3]. Due to the large amount of computational power that the MPSoC provides, the device is very attractive for a wide variety of embedded and high-performance computing applications. Several examples may include motion control, high security mobile radios, data center processing, and camera-based advanced driver assistance [3]. Terrestrial applications are increasingly experiencing the effects of ionizing radiation. This increase in observed events is in part due to the physical size of the device, location of the application, and how the system applications are designed [4]. The effects from ionizing radiation can negatively affect components on electronic devices. In particular, many components of a device s processor are vulnerable to ionized particles. Ionized particles can interact with the sensitive areas of the processing device and cause an observable event called a single-event effect (SEE) [5]. SEE interactions with an electronic device can cause unintended behavior to occur. This unintended behavior can be observed most commonly in data corruption or hardware module malfunction. Though most of these errors are temporary in nature, the consequences of failure can be substantial depending on the application of the system. For example, imagine the possible consequences of incorrect data being generated from a self driving car identification system or a flight 1

11 control system for a commercial airplane. These systems need to be carefully designed and tested before being placed in the application field. One way to obtain a measurement of system vulnerability is through radiation testing. Radiation testing is performed on a device by exposing that device to an accelerated particle beam. Radiation testing allows for the collection of SEE data on a particular device or application. The SEE data can then be used to generate metrics that express the vulnerability of the device. This work focuses on the MPSoC neutron radiation test results performed on both the programmable logic and the processing system, in addition to a novel simultaneous testing method for both. Neutron SEE data is collected from the programmable logic using three different detection and correction methods. These methods include both internal and external configuration readback methodologies. Processing system components including the memories and software reliability are also examined. Prior to this thesis, little work has been published on the radiation effects on the MPSoC. The purpose of this thesis is to provide a neutron radiation testing baseline for the MPSoC that will give insight into the device s response to neutrons for future testing. A ZCU102 ZU9EG evaluation board was used for a series of three neutron radiation tests performed at Los Alamos National Laboratories (LANL). Each test consisted of a novel testing model which performed two tests simultaneously, one on the programmable logic and one on the processing system. Through these radiation tests, neutron cross sections for the FPGA CRAM, FPGA BRAM, processor software, and processor caches were estimated. 1.1 Thesis Contributions The main contributions of this thesis are: 1. A description of the methodology for testing the MPSoC ZU9EG with programmable logic and processing system separation. 2. Presentation of the neutron radiation results for both programmable logic and processing system device regions. 3. Discussion of power results obtained from current monitoring of the ZCU102 evaluation board. 2

12 These results provide an understanding of the neutron radiation interactions with the MP- SoC ZU9EG device. These results can be used for terrestrial rate estimations. 1.2 Thesis Organization Chapter 2 discusses in detail what a single-event effect is and the general effects on device FPGAs and processing systems. Chapter 3 provides a background into the architecture of the MPSoC device. The chapter describes the general architecture of the programmable logic and processing system, including the application and real time processing units and relevant subsystems. Chapter 4 describes the purpose of performing radiation testing and then presents the methodology used for testing the MPSoC. This methodology includes a method for separating the processing system and the programmable logic to test both simultaneously but independently. Chapter 5 provides the specific details of the tests performed on the programmable logic and the results obtained from configurable RAM and block RAM. Chapter 6 discusses the specifics for the processing system tests, including the testing methodology and main software tests ran during the experiments. Estimate cross sections are provided for the processing system running a software application, processing system caches, and on-chip memory. Chapter 7 presents the details of the current monitoring tests that were performed on the ZCU102 evaluation board. This chapter provides several graphs showing the current on the board during various upsets. Additionally, an experiment with sustaining a high current event on the board and the results from that experiment are provided. This work will conclude in Chapter 8. 3

13 CHAPTER 2. RADIATION EFFECTS This chapter provides background information about the effects of radiation on electronic components. This chapter discusses several of the main sources of terrestrial radiation, how the ionized particles affect electronic devices, how these effects influence processors and FPGAs specifically, and several metrics that are used within this thesis. The purpose of this chapter is motivate the need for understanding radiation effects on the MPSoC and provide important terminology that will be used throughout the thesis. 2.1 Sources of Radiation Electronic devices are exposed to radiation in their operating environments. These applications include both terrestrial and space applications. There are three main sources of radiation for terrestrial applications: alpha particles, high-energy cosmic rays, and thermal neutrons [6]. Though there are several additional sources of radiation for space applications, such as heavy-ions, they will not be discussed in this work. Alpha particles are a significant source of radiation effects in electronic devices. These alpha particles are emitted from the natural decay of radioactive impurities within the device materials. These materials can include, but are not limited to, lead-based isotopes in solder, gold, and aluminum [7]. Uranium and thorium have the highest activity energy and are therefore the most dominant source of alpha particles in device materials [6]. However, manufacturing techniques have improved since this discovery and most of these impurities have been removed from the device materials. High-energy cosmic rays are the second main source of radiation effects. The cosmic rays hit the Earth s atmosphere and break apart into a bunch of secondary particles. The secondary particles cascade down through the atmosphere. Most particles remain caught in the atmosphere due to interactions with the other particles. The particles that are not caught in the atmosphere are 4

14 severely attenuated. This means that the lower the elevation, the lower the chance that a particle will survive through the atmosphere. High energy neutrons, however, are the most likely particle to reach sea level because they interact the least with the particles in the atmosphere [7]. Neutrons indirectly affect electronic devices by striking the nucleus of other atoms, typically silicon, which generates secondaries each with an individual charge [6]. The last main source of terrestrial radiation is the interaction of thermal neutrons from cosmic rays and unstable isotopes of boron. Boron is typically used as a dopant for the junctions in integrated circuit packages. Low-energy neutrons interact with the boron causing a release of gamma photons and alpha particles. These emitted gamma and alpha particles are the source of radiation effects on the electronic devices. Modern processors are built using highly purified materials, greatly reducing this source of radiation [7]. 2.2 Single Event Effects A single event effect (SEE) is when a single energized particle creates an observable effect in a device [8]. SEEs are the result of energetic particles interacting with the silicon atoms in electronic devices. These particles create an electron-hole pair as they pass through a semiconductor device creating an electron current and a hole current, illustrated by Figure 2.1 [9]. These currents cause a shift in charge on the device. If sufficient charge accumulates on a particular component, such as a transistor, it may cause an inversion in the state of that device [4]. These faults in the device are considered transient faults because they do not represent an architecture flaw in the hardware and can be corrected. One of the main categories of SEEs are non-destructive or soft errors. A soft error is a term used to describe a failure in a device that is not a permanent hardware failure. The other main category of SEEs are destructive or hard errors, which cause permanent hardware failure. Soft errors can result in data corruption or incorrect operation of a circuit, which may result in a crash of the entire system. A firm error is a type of soft error that can only be recovered from by reconfiguring or reinitializing. Several classifications of soft error SEEs are single event transients (SETs), single event upsets (SEUs), single event functional interrupt (SEFI), and single event latchup (SEL) [11]. 5

Figure 2.1: Effect of an ionized particle on an N-type MOSFET transistor. Used with permission from [10]. SETs are when charge accumulates due to the shift in charge from a particle strike.

15 Figure 2.1: Effect of an ionized particle on an N-type MOSFET transistor. Used with permission from [10]. SETs are when charge accumulates due to the shift in charge from a particle strike. This can cause a jump in the voltage levels, known as a glitch, on the device. This voltage glitch on the output of the device (typically a transistor) is the SET. SETs can be measured for incident voltage and duration of the glitch [9]. SEUs are a direct result from SETs in memory cells. When the voltage generated by the SET is sufficiently large enough, it can force the feedback loop to change its value and propagate to the memory cells [9]. The term SEU is often used to mean soft errors in general, but can also be used to refer to hard errors. The term SEU in this thesis will be used to describe the event of a bit-flip or multiple bit-flips due to a single particle strike. SEFIs are soft errors that effect the elements of the device that control the device functionality. SEFIs can cause the entire device to behave incorrectly and even cause a complete system failure. Several examples of control elements include control registers, clock signals, reset signals, program counter (pc), etc. [9, 11]. SELs are errors that cause an increase in the overall current of the device. The event creates an abnormally high current by causing a parasitic dual bipolar circuit. This event cannot 6

16 be recovered from while the device is active. The only way to clear this error is by performing a complete power reset of the device. SELs are dangerous for the device because if the error is not remedied in time, or if the event is exceptionally high, the SEL may cause permanent consequences or destruction of the device. The amount of time before permanent damage occurs is dependent on the event and the device [9, 11]. 2.3 SEEs on Processors This section describes how the soft errors described in Section 2.2 specifically affect processing systems. The main concern of radiation effects on processors is the effect caused on the volatile memory cells of the processor. Among these cells are processor registers which include control flags and the program counter (pc), processor caches, and main memory. SEUs can cause these memory elements to change their value, which may effect the outcome of a program running on the processor. There are several outcomes that an SEU can have on the memory: the event can have no effect, be detected, or result in a silent data corruption (SDC) [4]. Each of these outcomes will be addressed in turn in the following paragraphs. First, the SEU may have no effect on the processor output. This occurs when a particle strike effects a part of the processor that is not in use, or may not be used again. This means that an effect is only a problem if the affected part of the processor is used [4]. For example, imagine a program that calculates a value and prints that value out to a screen but then never uses that value again. If that value were corrupted by an SEU, the event would have no effect on the output of the program. This can be considered a benign fault [4]. The second outcome is that the corruption is detected. This detection could be done by checking the results of an application against a correct version. The fact that the error is detected provides an opportunity to try to mitigate the situation. This could include performing a particular operation for a second time, reverting the program to a correct program location, or reset processor execution. At the minimum, the detection lets the user know that something has gone wrong in the system. The final outcome, the SDC, can be very problematic for the execution of the program. An SDC is when an upset occurs that corrupts the system, the error is not detected, and the system continues as if the entire state were correct [4]. Figure 2.2 provides a simple example of a proces- 7

17 Code: my_variable = 10 while(1){ foo = 5*my_variable print(foo) } my_variable cache location: At some time during operation Figure 2.2: Example of an SEE on a processor cache. Output: sor data cache being corrupted. This example has no detection logic and the corrupted memory location is used, but the program continues as normal, silently corrupting the other parts of the program SDC Prevention Though this thesis is not about fixing or preventing SDCs, error detection and correction (EDAC) was used throughout this work on the caches and main memory. EDAC was used to detect memory corruptions in order to count the corruptions. This section provides a brief background into two EDAC methods, error correction codes and parity bits. One method commonly used is error correction codes (ECC). ECC allows for corruptions caused by particle strikes to be detected and even fixed. Different kinds of ECC provide varying capability to detect corruptions and correct them. For example, single error correct/double error detect (SECDED) ECC is a popular type of ECC used commonly in SRAM cells and caches [12]. As its name states, SECDED ECC allows for a single bit to be corrected and double bit corruptions, corruptions that result in two bits next to each other being upset, to be detected. Another method commonly used for error detection is parity bits. Parity is used to indicate if there is an even number or odd number of 1 bits in a segment of data. This method can be used to detect if a bit in a data message has been flipped, but it is impossible to know which bit was flipped. 8

18 Figure 2.3: Single-event effects on FPGA CRAM from [14] This method is also ineffective if an even number of bits (for example two bits) have been flipped, meaning that the parity bit will indicate the message is correct even though multiple errors have occurred [13]. 2.4 SEEs on FPGAs This section describes how the soft errors described in Section 2.2 specifically effect the programmable logic of SRAM FPGAs. FPGAs have a large amount of resources that are dedicated to defining the current state of the device. All of the elements that make up this state are vulnerable to SEEs. This includes logic elements, like flip-flops and DSPs, and routing resources, such as multiplexers and programmable interconnect points (PIPs) [9]. These logic and routing elements are contained within the configuration RAM (CRAM) of the FPGA. An SEE can cause a change in the state of the FPGA which can then cause a corruption in logic functionality or in routing as shown in Figure 2.3. A change in logic functionality is shown on the left side of Figure 2.3. A logic element that defines the logic gates of the FPGA can be changed and can create a different gate than the intended one. This can cause the wrong signal value to propagate throughout the circuit. The inputs to the gate can also be connected to the wrong input signal, causing the gate to evaluate the wrong set of signals [9, 15, 16]. 9

19 SEUs can also change the FPGA routing or wire connections. These issues are shown on the right side of Figure 2.3. An SEU can cause signals to be re-routed and misroute the signal to the wrong connection, causing the wrong signals to propagate through the design. One of the wires can also be dropped completely from the design creating a missing signal [9, 15, 16]. Another consideration is the memory elements used by the FPGA design. Memory elements such as block RAMs (BRAMs) can cause failures within an FPGA design similar to memory corruption in processors. An SEU that causes one of the BRAM bits to change will make that memory location invalid. These upsets could go unnoticed in the BRAMs if the specific value does not matter or is not used in the design. The BRAM bits may also be essential to the design, which could cause an SDC in the hardware design. To prevent SDCs from occurring, BRAMs typically use some form of ECC [17]. 2.5 Radiation Metrics In order to understand the significance of the results in this work, an understanding of several radiation metrics is needed. Though there are many metrics for radiation, two in particular will be discussed in relation to this thesis, a device cross section and failures in time (FIT). These metrics are used to describe the rate of SEUs on the device. Before understanding the cross section and FIT of a device, we must understand how radiation is measured. The rate that ionized particles pass through a given area is known as the flux and is general expressed as particles/cm 2 hr [18]. For terrestrial applications, we are typical concerned with the neutron flux rate at sea level, meaning the rate that neutrons pass through a given area at sea level. A related metric for radiation is the fluence, which is the total number of particles that passed through a given area after a specified time [18]. The fluence is generally expressed by particles/cm 2. The fluence that a particular device has received can be calculated by taking the flux rate at the device s location and multiplying the flux by the time it has been exposed. The fluence is a key component in calculating the cross section of a device. The cross section is an average of how many particles it takes for an SEU to occur. More specifically, it is a relation between the number of SEUs observed on the device and the amount of radiation fluence the device has been exposed to. This metric provides an idea of a device s 10

20 sensitivity to ionized particles. The relation is defined as σ = # of SEUs total fluence, and is expressed in terms of cm 2 [19]. The larger the calculated cross section, the more sensitive the device is to SEUs. Most of the results presented in this work are expressed in terms of a cross section. The FIT rate is defined as the number of failures per one billion hours of device operation [18]. Just as the cross section, the FIT rate provides an idea of device s sensitivity. The FIT can be calculated directly from the device cross section by multiplying the cross section by the flux and one billion hours. For example, typically the FIT rate is generated for a device located at sea-level. The neutron flux at sea-level is specified at 13 neutrons/cm 2 hr and a FIT can be generated for a specific cross section (σ) by FIT = σ The metrics presented in this work are approximations based on experimental observations, meaning that they are not exact. Therefore, these metrics are generally presented with an upper and lower bound of where the exact result may lie, called a confidence interval. The numbers within this work are presented using a 95% confidence interval. Generally these confidence interval can be generated using 95% C.I. = 2 events. fluence More details regarding the confidence interval and calculations can be found in [19]. An example cross section and confidence interval calculation can be found in Appendix B. 2.6 Radiation Testing The probability of detecting an SEE under normal terrestrial radiation conditions is low, requiring many years to collect data. Radiation testing is a way to accelerate the collection of component SEE data as to ensure correct execution in the application field. To perform radiation testing, a charged beam of high energy radiation is created and the test device is placed into the flight path of the released charged particles. These radiation particles can induce SEUs on the test device so that the effects of SEUs can be studied in a controlled environment [19, 20]. 11

21 Figure 2.4: Neutron flux rates at LANSCE ICE House from [24] For this work, three high-energy neutron radiation tests were performed on the ZCU102 MPSoC at the Los Alamos Neutron Science Center (LANSCE) at Los Alamos National Laboratories (LANL) New Mexico. LANSCE uses a linearly accelerated proton beam striking a tungsten spallation target to generate neutrons for neutron radiation testing [21, 22]. All experiments were performed on the 30 flight paths which have a neutron spectrum similar to the neutron spectrum in the atmosphere caused by cosmic rays, matching the JEDEC standard for spallation neutron beams [18,20,23]. The neutron flux provided by the beam is one million times greater than cosmic rays as seen in Figure 2.4 (depending on the altitude). This provides accelerated rates of exposure; one hour in the beam is equivalent to one hundred years at the altitudes of airplanes [23, 24]. The results obtained from these neutron radiation test experiments help estimate device sensitivity for terrestrial applications. 12

22 CHAPTER 3. MPSOC ARCHITECTURE The Xilinx Multiprocessor System-on-Chip (MPSoC) is a complex device that uses 16nm FinFET technology to combine multiple processors, a large amount of FPGA resources, and many I/O interfaces on a single chip die. These features combine to create a powerful computing machine with the advantage of architectural flexibility. This makes the MPSoC very attractive for a wide variety of embedded and high-performance computing applications such as motion control, high security mobile radios, data center processing, and camera-based advanced driver assistance [3]. This chapter describes the different parts of the MPSoC architecture, with the majority of the information being taken from Xilinx s Technical Reference Manual [3]. 3.1 MPSoC Device Overview The MPSoC device is available in a variety of configurations with different processing resources and different FPGA sizes. The different resources available to a specific part are divided up into three different MPSoC families, the CG, EG, and EV device families. The specific MPSoC chip used in this work is the ZU9EG. The ZU9EG can be considered as being in the middle of the family spectrum in terms of resources available and FPGA size. The MPSoC device is separated into two major system regions, the processing system (PS) and the programmable logic resources (PL), shown in Figure 3.1. The PS contains a number of processors that execute software from main memory and an external DDR memory. The different processors in the MPSoC include a quad-core ARM Cortex -A53 application processing unit (APU), a dual-core ARM Cortex -R5 real-time processing unit (RPU), and in the EG and EV MPSoC device series, an ARM Mali -400 MP2 graphics processing unit (GPU). Each processing system can operate independent of the other processors and can be organized to share overlapping DDR memory space. The PS also contains many fixed I/O interfaces and board peripherals such as USB and Ethernet. 13

23 Figure 3.1: Block diagram of the MPSoC EG series from [25]. The PL is part of the Xilinx UltraScale family and is comprised of static random-access memory (SRAM) cells that provide a large amount of user definable resources. These resources include block RAMs (BRAMs), configurable logic blocks (CLBs), and look-up tables (LUTs). The PL also provide a bunch of silicon hardened components used for high performance I/O, high speed transceivers, and PL system monitoring. The PL and the PS can also intercommunicate through high performance AXI interfaces, general I/O, and the processor configuration access port (PCAP). The MPSoC has many other components that the chip uses for board security and communication between the many subsystems. Detail regarding all of the systems cannot be described in this work; however, several board components will be discussed in this chapter. In particular, the on-chip memory, the configuration and security unit (CSU), and the platform management unit (PMU) will be described in more detail. 14

24 3.2 Processing System Application Processing Unit (APU) The ZU9EG s application processor is a quad-core ARM Cortex-A53 1. The Cortex-A53 processor is a mid-range, low-power, 64-bit processing unit. The A53 is built using the ARM v8-a architecture and supports both AArch32 and AArch64 execution states, which allows for the processor to be run in 32-bit and 64-bit modes respectively [26]. The A53 is also backwards compatible with ARM v7 due to similar architecture and 32-bit support. The A53 can be clocked up to a maximum of 1200 MHz [27]. The processor has 4 different exceptions levels in which a program can be run and a two-stage memory management unit (MMU) which allows for hypervisor and guest modes. Different execution levels are used to add additional security. Each Cortex-A53 core has its own separate 32KB L1 instruction cache and data cache. There is a shared 1MB L2 cache with a snooping unit monitoring the entries between the separate L1 caches. The caches are protected using two different EDAC methods. The L1 instruction cache is protected using parity bits. The L1 data cache and the shared L2 cache are protected by SECDED ECC Real-Time Processing Unit (RPU) The real-time processor is a dual-core ARM Cortex-R5 processor. The Cortex-R5 processor is a mid-range processor created for embedded, real-time applications. The processor is implemented using the ARM v7-r architecture with a maximum clocking frequency of 500 MHz [27]. Each processing core has its own separate 32KB L1 instruction cache and data cache. Both L1 caches are protected by SECDED ECC. Unlike the APU, the RPU does not have an L2 cache. The Cortex-R5 processor s dual-core can be run in individual or lock-step processor modes. Running the processors in lockstep allows one of the processors (RPU1) to become a redundant copy of the other processor (RPU0). This is used for increased reliability and detection of errors if one of the processor s data is corrupted. The RPU has dedicated hardware for lockstep as opposed to using a virtual lockstep mode. 1 The CG MPSoC devices only have a dual-core ARM Cortex-A53 processor 15

25 Table 3.1: Available ZU9EG PL resources from [25]. Resource Amount Available Logic Cells 600,000 CLB FFs 548,000 CLB LUTs 274,000 DSP slices 2,520 BRAM (36Kb) 912 (32.1Mb) The RPU also has 128KB of user accessible tightly-coupled memory (TCM) which allows for low-latency memory with predictable memory reading and writing times. The TCM is divided into two different banks of 64KB allowing for concurrent accesses to both of the banks. When the RPU is run in lock-step, the TCMs of each core can be used as a single 256KB TCM with two banks of 128KB. The TCM is protected by single-bit correction ECC. 3.3 Programmable Logic The FPGA portion of the ZU9EG chip uses a state of the art 16nm FinFET technology node. FinFET technology nodes allow for higher configuration RAM (CRAM) density, faster switching speeds, and less leakage current. These improvements result in higher performance and lower power consumption when compared to the planar technologies [28]. Many resources are available for user programmability. The available user programmable PL resources for the ZU9EG are shown in Table 3.1. Other resources available in the programmable logic include many hardened blocks such as 328 high speed I/O, 24 GTH 16.3 Gb/s high speed transceivers, and a system monitor (SYSMON) IP. The SYSMON monitors the temperature, voltage, current, and other factors of the FPGA fabric. The CLBs and LUTs of the ZU9EG follow the same architectural style as the other Xilinx UltraScale FPGA devices. One of the biggest differences in the UltraScale CLBs from other Xilinx FPGA series is that two independent CLB slices have been combined into a single slice. This combination of slices allows for greater logic and more efficient routing. Another significant improvement is the expansion of the CLB carry chain from 4 bits to 8 bits for faster arithmetic [29]. 16

26 Table 3.2: FIT rates for the MPSoC FPGA. Device Component Cross Section (σ) FIT/Mb 2 CRAM BRAM The PL can interface with the PS through many PS-PL interfaces. These interfaces allow for greater connectivity and versatility of the device. The main features of these interfaces are, AXI interfaces, 32 general-purpose input and 32 output bits (GPIO), 16 shared interrupts, and several dedicated direct memory access (DMA) modules. The AXI interfaces also provide cache-coherent interconnects, FIFOs, and a system memory management unit. These interconnection features provide for effective integration of user-created hardware with the processors. Table 3.2 shows the estimated FIT rate for the CRAM and BRAM cells of the MPSoC based on [30]. The table shows the number of anticipated errors per megabyte. The flux rate used in this thesis is according to the JEDEC standard [18] at a flux rate based on New York, USA (i.e., sea level). 3.4 Other Chip Resources On-Chip Memory (OCM) On-chip memory (OCM) refers to memory that is on the same silicon as the processor itself. The term OCM typically includes L1 data and instruction caches. Some devices, such as the MPSoC, have an additional separate OCM often termed as Scratch-Pad memory. This memory, unlike the caches, is attached to the same buses as off-chip memory (main memory). OCM provides high speed access compared to off-chip memory [31]. This OCM is essential for the MPSoC s RPU as it ensures low memory access latency for real-time processing and applications [3]. The MPSoC s OCM contains 256KB of RAM that is divided up into 4 banks. It supports high bandwidth transfer using a 128-bit AXI-slave interface, double-width memory, and eight 2 Failure in time. Based on the information provided in the Xilinx Reliability Report [30]. Calculated using the JEDEC standard [18] and the flux rate at New York, USA (sea level). Mb = 10 6 memory bits. 17

27 Figure 3.2: Block diagram of the configuration and security unit from [3]. separate exclusive access monitors. Proximity to the other components on chip allows for increased clock frequencies, up to 600 MHz. Transfers of 256 bits allow for the maximum bandwidth for the device. The OCM is protected by 64-bit SECDED ECC Configuration and Security Unit (CSU) The configuration and security unit (CSU) is a processing block that manages system-level configuration. A simple block diagram of the CSU is shown in Figure 3.2. The CSU manages programming the PL, determining the boot mode, checking authentication certificates on software binaries, and programing the processor s first-stage boot loader (FSBL). The CSU also manages the tamper monitoring and response for the system. The CSU has a triple-redundant MicroBlaze processor with dedicated 128KB ROM and a small 32KB private RAM that are protected by ECC [3]. The triple-redundant processor is designed to be a fault-tolerant processor. The CSU utilizes the SEE mitigation technique of triplemodular redundancy (TMR) which creates three identical copies that run the same code, and has voters to vote on the correct output [32]. The CSU contains and manages several other sub-blocks which include a direct memory access (DMA) and the processor s side of the PCAP. The DMA can be set up to configure the processing system or the programmable logic. The DMA is routed to different locations by config- 18

28 uring the secure stream switch (SSS) which can allow for different operations like DMA loopback, PL configuration, secure PL configuration, secure PS configuration, and PS image load with simultaneous SHA3 calculation. The most important feature of the CSU for the purposes of this work is programming the PL. Through the CSU the PS can program the PL through the processor configuration access port (PCAP). This allows the PS to dynamically read and write the PL s configuration memory. In order for the PS to communicate with the PCAP, the CSU s SSS must be configured to have the PCAP receive information from the DMA and the DMA to receive information from the PCAP Platform Management Unit (PMU) The Platform Management Unit (PMU) is a dedicated user-programmable processor. The purpose of the PMU is to measure power usage on the board, error management, and system initialization prior to boot. This subsystem utilizes a battery power mode to maintain security configuration and a real-time clock even when the system is powered off. The PMU was used in this work to enable several system errors, including the system watchdog reboot signals and OCM ECC. The PMU s triple-redundant processor is designed to be a fault-tolerant processor utilizing TMR just as the CSU. The processor does not have any caches but has an off-chip 128KB SECDED ECC protected RAM. The PMU has a read-only memory (ROM) which contains the PMU startup sequence, interrupts, and power-up/power-down requests. After a power-on-reset (POR), the PMU checks the power and other peripherals for proper operation. The PMU then runs the CSU ROM code through a SHA-3/384 security check. Once the CSU is booted, the PMU manages the powerup and restart for the APU and RPU. The PMU also maintains the system-power state at all times and includes PS-level error capture and propagation logic [3]. Each of the components discussed in this chapter play a role in the experiments designed for the research of this thesis. The details of these tests will be described in detail in the following chapters. Understanding the purpose and features of each component will provide a framework for design decisions and testing processes. 19

29 CHAPTER 4. TESTING METHODOLOGY Radiation testing is one method to obtain radiation effects data. A device can be placed in an accelerated radiation beam and can be monitored to collect SEEs data. This chapter describes the testing methodology that was used during radiation tests on the MPSoC. These tests were performed over three different visits to LANSCE: December 2016, August 2017, and November This chapter also discusses in detail the setup used during testing and the devices that were used for gathering data and monitoring device state. A complete overview of the experiment is given and the physical setup at LANSCE is described. This chapter also introduces beam counting as a method for obtaining the neutron fluence and describes how beam counting is performed at LANSCE. 4.1 Methodology Understanding the radiation effects within all of the components of a complex device like the MPSoC is very difficult. Unlike programmable processors and FPGAs tested in the past, the MPSoC contains many different processors with multiple cores, peripherals, sub-systems, and a large amount of FPGA resources all within the same device. In the past these sub-components were tested individually, but now they must be tested together. Tests should be designed to target individual parts and isolate the test component from other systems of the device [19]. Further, it is very difficult to completely isolate any one part from the others on the board. Organizing specialized tests that target individual parts of the device and isolate them from other parts is challenging. To complicate the testing more, there is not a lot of previous radiation testing of complex systems like the MPSoC. No generally accepted methods for performing radiation tests have been developed yet for such a complex device. One of the major challenges in this work was preparing 20

Figure 4.1: Simple block diagram of PS/PL separation for testing. tests that exercised individual components of the device without the results being influenced by other board components.

30 Figure 4.1: Simple block diagram of PS/PL separation for testing. tests that exercised individual components of the device without the results being influenced by other board components. An example where on component influences another is an FPGA test that uses the processor to pass data to the FPGA design. If there is a failure in the FPGA test, was the failure because the FPGA failed, or because the data was corrupted in the processing system and passed into the FPGA? Tests were carefully prepared to collect radiation effects data on as many components of the MPSoC as possible. The MPSoC was placed in the neutron beam at LANSCE on three separate visits. At each visit to LANSCE, various tests were performed on both the programmable logic (PL) and the processing system (PS). In particular, these experiments collected data from the PL and the processing system PS at the same time. The primary goal of these tests was to collect single-event effect radiation data on several key components of this complex device. One of the major challenges of these tests was instrumenting the device so that data from both device regions could be collected at the same time as well as isolating these components from each other during the test. Each test was performed with the PS tests and PL tests simultaneous but separate as shown in Figure 4.1. The PS part of the tests was performed by programming a software binary to the processors on the MPSoC through an SD card. Execution output was generated in each binary and sent through the UART to a monitoring host computer. The host computer would record the execution 21

31 Figure 4.2: Wiring to bypass the power regulators on the ZCU102 evaluation board. output from the binaries, add timestamps, and store the information. Most of these test programs included self-checking code to verify that the program was executing correctly and would output an error message if the program was incorrect. All tests enabled processor errors and interrupts for monitoring processor state. These software and processor errors were counted during and after the tests to generate processor neutron cross section estimates. The PL part of the tests was performed by programming a binary bitstream to the FPGA for configuration. This configuration was performed through various methods which will be discussed in Chapter 5. The main method was configuring over the JTAG port of the MPSoC using the Zynq-based JTAG configuration manager (JCM) created by the BYU Configurable Computing Lab (CCL) [33]. The JCM would configure the FPGA portion of the MPSoC independent of the processors. The JCM also provided information regarding SEUs in the FPGA CRAM and BRAMs and logged the data. These log files were retrieved through an ethernet connection and used for counting FPGA CRAM and BRAM SEUs during and after the tests were completed Power Modifications During the first radiation test in December 2016, the ZCU102 evaluation board died. Afterward, we determined that one of the power regulators on the board had failed. To prevent the regulators on other boards from failing during radiation testing, a method for powering the board 22

32 Figure 4.3: Dosimeter in beam path at LANSCE. externally was designed. The schematics for the ZCU102 board showed several power regulators that powered the board, in addition to several voltage test points. The power regulators on a second board were bypassed by soldering wires into the test points (Figure 4.2) in order to supply specific power rails. These wires were brought out to standard female banana jack connectors so that the power could be supplied by an external bench power supply. The board was powered by supplying four different power rails externally by a Keysight N6705B power supply. The power supply settings were customized to carefully regulate the voltage and current for each of the four powered lines. This power setup was used for both the August and November 2017 tests. Details regarding the modifications can be found in Appendix A. This setup also provided the resources to monitor the voltage and current of the board which allowed us to identify anomalous power events such as abrupt increases in current (discussed in Chapter 7) Beam Counting Methodology In order to measure the cross section of SEEs on the MPSoC, the experiment must measure the fluence of neutrons applied to the device under test (DUT). The fluence can be measured by beam counting, in other words, counting the number of ionized particles that are passing through the DUT. The number of ionized particles are typically detected by a special apparatus located at 23

33 Custom Zynq - Counter LANSCE Beam Counter Figure 4.4: Beam counter at LANSCE facility. the front of the beam flight path in the testing room. At the LANSCE facility, a dosimeter is used to count the number of neutrons as shown in Figure 4.3. In order to record and make the beam counting easier, another Zynq-based board developed by the BYU CCL was used to keep track of the beam count. This board was designed specifically for LANSCE and is shown in Figure 4.4. The custom counter is attached to LANSCE s beam counter and keeps track of the neutron count updates. The output of the device is logged on the Zynq board and periodically retrieved by a remote desktop computer. This log is used to determine the total number of neutrons that passed through a device during a test period. The beam count log can also be used to determine when the neutron beam was down by looking for entries in the log that do not increase the neutron count, meaning that no neutrons were passing through the DUT. These logs allow for a better understanding of what occurred throughout all of the beam test and allow for cross section estimates of the DUT (see Appendix B). 4.2 Test Overview The ZCU102 evaluation board was placed in the neutron flight path using stands and clamps as shown in Figure 4.5. The board was placed at a 90 angle with the beam s trajectory flight 24

MPSOC Figure 4.5: ZCU102 evaluation board in flight path at LANSCE, November 2017. path. The device was prepared by removing the heat sink and fan to avoid irradiating the fan, reducing the flux, and activating the heat sink.

34 MPSOC Figure 4.5: ZCU102 evaluation board in flight path at LANSCE, November path. The device was prepared by removing the heat sink and fan to avoid irradiating the fan, reducing the flux, and activating the heat sink. The beam was prepared by adjusting the beam hole diameter to a two inch opening. The Keysight power supply, the BYU Zynq-based JCM, and all other peripheral devices were placed outside of the beam flight path to avoid irradiated particles upsetting these devices. The Keysight power supply carefully monitored the voltage and current of the ZCU102 throughout the experiments. A Python script retrieved these values through an Ethernet connection multiple times a second and recorded the values on the host computer. The host computer could also use the Python script to send control commands to the power supply remotely through the Ethernet connection. This control allowed for real-time adjustments to be made, for example, powering off the board in response to a voltage or current reading. The BYU JCM was used for monitoring and collecting SEU data from the PL through the MPSoC s JTAG port. The SEU data was counted by the JCM and logged by the host computer through an Ethernet connection. The JCM was also used for managing the configuration of the PL. All FPGA configuration was done by the JCM through Ethernet commands by the host computer. The complete experiment setup is shown in Figure 4.6. The BYU JCM, power supply, and remote host computer were connected through an Ethernet switch. The BYU JCM was connected 25

35 Power Supply Power Cables Host Computer Ethernet Switch ZCU102 (DUT) SD UART BYU JCM JTAG Cable Figure 4.6: Block diagram of physical setup. Power-Up Through Keysight Power Supply Load Software Binary From SD Card JCM Configures PL Reset Beam Counter Open Beam Shutter Testing Close Beam Shutter Figure 4.7: Flow diagram for test execution. to the ZCU102 evaluation board by a JTAG cable connector, which allowed for configuration readback through JTAG. The evaluation board was connected to the host computer through a UART cable for logging of the software test output. The host computer monitored the output from the BYU JCM, software tests, and the power supply. The execution of a typical test is shown in Figure 4.7. The ZCU102 evaluation board would be powered-up by the Keysight power supply. The MPSoC would load the software binary from the SD card and begin execution of the program. Logging of the PS UART output began on the host computer as soon as execution of the processors began. The JCM would configure the PL 26

36 bitstream through the JTAG port. The beam counter would be reset to collect the neutron beam count for the current test. After the initial preparations, neutrons were allowed to flow through the DUT by opening the flight path s shutter. Once the test finished, the shutter was closed and the device would be prepared for another test. 27

37 CHAPTER 5. PROGRAMMABLE LOGIC TESTING AND RESULTS The first major component of the tests was understanding the effects of neutron radiation on the programmable logic resources. The primary goal of the PL tests was to estimate the neutron cross section of the configuration memory (CRAM) as well as the internal user accessible block memory (BRAM). The cross section was estimated by counting the number of SEUs through configuration readback methods. Another part of these tests was experimenting with several different readback and scrubbing methods. These methods include JTAG through the BYU JCM, ICAP through the Xilinx SEM IP, and baremetal PCAP. 5.1 FPGA Readback In order to understand how the PL tests were performed, the idea of FPGA configuration readback must be described. This section discusses the basic principles and methods behind readback and the different ways that the configuration memory can be read. A readback of configuration memory is when the configuration memory of an FPGA is read and stored. A readback can be performed on an FPGA device while the FPGA is running, without interrupting operation [34]. Performing full readback of all configuration memory can be used to count the number of CRAM upsets. This is accomplished by identifying differences between a readback and a correct copy of the readback, known as the golden readback. This error count can then be used to generate a cross section for the FPGA device. There are several ways that readback can be performed on an FPGA using different hardware access points. These access points include JTAG, internal configuration access port (ICAP), and processor configuration access port (PCAP) [35]. Each of these ports will be discussed separately in the following paragraphs. The JTAG interface is a common and easy-to-use serial interface. On most FPGA boards, the JTAG port is brought out to a JTAG connector on the board [36]. This makes access to the JTAG 28

38 SEU? a) FPGA Readback b) Golden Readback c) Write Back Correct Value Figure 5.1: Simple diagram of scrubbing. simple and convenient. JTAG allows for external configuration through a JTAG programmer. The serial interface is limited to a maximum of about 66MHz. The ICAP is an internal module that allows the FPGA to be configured from within the device [35]. The ICAP provides opportunities for readback and scrubbing, but only from within the programmable logic itself. The ICAP supports a 32-bit interface with high configuration speeds of up to 200MHz on the ZU9EG device [37]. The PCAP is another internal interface that provides the processing system with access to the PL. Readback and scrubbing can be performed through the PCAP. Access to the PCAP is typically done through high bandwidth DMA channels. The PCAP interface is only available on the Xilinx SoC and MPSoC devices [35]. 5.2 FPGA Scrubbing Scrubbing is a method of correcting upsets in an FPGA memory that builds upon memory readback. The objective of scrubbing is to identify upsets in the configuration memory (CRAM) cells of the FPGA through a readback and then to correct them [38]. Buildup of many upsets in the CRAM may cause a design to fail. Scrubbing is intended to correct the upsets so that they cannot build up in the CRAM, thus preventing the design from failing due to buildup of upsets [34]. Scrubbing becomes an important mechanism in FPGA designs when the design is critical to the success of a mission. Generally, a scrubber corrects upsets in the CRAM by comparing a golden copy of the CRAM with a current copy. A simple example of scrubbing is shown in Figure 5.1. To begin 29

39 scrubbing, first a golden readback file is generated from the CRAM before a test begins, creating a perfect copy (i.e., golden) of the original configuration memory (part b of Figure 5.1). This golden readback is used throughout the test to identify any upsets that have occurred in the memory. Continuously during the test, new readback files are generated (part a) and then compared against the golden file (part b). Differences between the readback file and the golden file are identified as SEUs and corrected (part c). Corrections to the CRAM are performed by determining the bits that were upset and writing the correct bits back into the configuration memory frame thereby repairing the contents of CRAM [38, 39]. 5.3 Programmable Logic (PL) Methodology CRAM and BRAM upset data was collected independently from the PL using three different readback methods. Each method was used to count CRAM upsets and the count was used to generate an estimate of the CRAM neutron cross section. The three different readback mechanisms used in this study include external JTAG scrubbing [33], internal scrubbing with the Xilinx SEM IP [37, 40], and scrubbing utilizing the PCAP [35]. All three scrubbers were run independently during separate tests as running all of the scrubbers at once would cause conflict on the CRAM. Example log files for each can be found in Appendix B BYU JCM - JTAG Scrubber The custom Zynq-based JCM, developed at the Configurable Computing lab (CCL) of BYU [33], was used as the external scrubber of the MPSoC s programmable logic. The JCM was connected to the MPSoC by a JTAG cable to the on-board JTAG port as shown in Figure 5.2. The JCM consists of an Avnet MicroZed board which contains a ZC7020 chip. The ZC7020 is a system on-chip with a dual-core ARM Cortex-A9 processor and programmable resources. The MicroZed is coupled with a custom breakout board to allow for high speed external JTAG access, shown in Figure 5.3. The JCM runs an embedded Linux OS and custom software to manage the configuration and scrubbing of the MPSoC [41]. The JCM was placed outside of the beam s flight path to avoid influence by neutron SEUs. An eight foot JTAG cable is used to connect the JCM to the MPSoC. An initial readback is per- 30

Figure 5.2: JCM connected to the MPSoC through JTAG. Figure 5.3: The JCM with its daughter card.

The custom software running on the JCM performs continuous readback operations, comparing each readback file to the golden readback,

2 SEM IP - ICAP Scrubber The second scrubber is Xilinx s soft error mitigation (SEM) IP [37].

40 Figure 5.2: JCM connected to the MPSoC through JTAG. Figure 5.3: The JCM with its daughter card. formed to generate the golden file and is stored in the MicroZed s DDR memory. The custom software running on the JCM performs continuous readback operations, comparing each readback file to the golden readback, identifies upsets, and scrubs (repairs) any detected upsets in the PL SEM IP - ICAP Scrubber The second scrubber is Xilinx s soft error mitigation (SEM) IP [37]. The SEM IP is placed into the PL of the MPSoC and performs SEU detection and correction through the ICAP. The SEM IP can detect single-bit errors and double-bit errors if the two errors are adjacent to each other. This also allows correction of multi-bit errors as long as there are only one or two-bit adjacent errors 31

41 PCAP Released From Reset PCAP Configured To Write Mode CSU Configured To Program PL SSS Configured To Send DMA to PCAP SSS Configured To Send PCAP to DMA PCAP Ready Figure 5.4: Flow diagram for initializing the PCAP on the MPSoC. per memory frame [37]. The SEM IP uses one of the MPSoC s PL UART connections to report detected and fixed errors Baremetal PCAP Scrubber The third scrubber used in these tests was a software PCAP scrubber. The PCAP scrubber was built as a C program to run on the MPSoC s RPU dual-core Cortex-R5 processor in lockstep mode without an operating system (i.e., baremetal program). In order to configure the system for PCAP scrubbing, access to the CSU and its components is required. Commands are issued to the CSU to send data to the FPGA through the PCAP. The code for the PCAP scrubber is shown in Appendix C. The process to set up the PCAP is shown as a flow diagram in Figure 5.4. Initialization of the PCAP begins by releasing the PCAP from reset, setting the PCAP in write mode, and configuring the CSU to program the PL from the PCAP. Additionally, the secure stream switch (SSS) of the CSU block must be configured to transfer data from the dedicated CSU direct memory access (DMA) to the system s PCAP and to pull data from the PCAP back to the DMA (See Section 3.4.2). Communication with the PCAP is similar to JTAG, requiring packets of information and the same commands. The Xilinx UltraScale Configuration Guide [42] was used extensively in 32

42 developing the PCAP scrubber in addition to the command sequences used by the hybrid scrubber created by Aaron Stoddard in [43]. Normal operation of the PCAP scrubber initializes the PCAP, configures the PL, and does an initial readback to create a golden file which is stored in the DDR of the MPSoC. The scrubber then performs continuous readback of the PL, comparing each readback to the golden file. Errors are detected and corrected frames are written back through the PCAP Block RAM Readback Although not a scrubber, similar scrubbing principles were used to count upsets within the BRAMs. BRAMs cannot be repaired by writing the correct bits back to them. Once a BRAM has been initialized during configuration, they cannot be reconfigured and corrected. This prevents BRAMs from being scrubbed. The method for counting SEUs in BRAMs consists of creating a golden readback file of the BRAM configuration data and comparing current readback against the golden. When subsequent readbacks are performed, differences between the readback and the golden readback are identified as upsets and counted. Instead of repairing the BRAMs by writing the correct bits back to them, the golden file is updated with the change in the BRAMs. This is so that future readbacks do not identify the same error over again, but identifies the error only once. Upsets will build up and corrupt the original configuration of the BRAM. 5.4 Programmable Logic SEU Results Using the number of upsets counted by each scrubber, estimates of the neutron radiation cross sections were generated using the method in [19]. The estimates determined from the BYU JCM, Xilinx SEM IP, and the baremetal PCAP scrubber are shown in Table 5.1. The results obtained from performing readback on the BRAMs is shown in Table 5.2. Sample logs from each scrubber and an example cross section calculation can be found in Appendix B. The estimated cross sections slightly vary between the three scrubbers. This variability in the results may be due to the functionality and nature of the different scrubbers. Though there is variability, the error bounds for the cross sections do overlap. The overlapping regions are not very 33

43 Table 5.1: Per-bit cross section of MPSoC FPGA CRAM. Number of Errors Fluence (n/cm 2 ) Cross Section (cm 2 /bit) +95% Confidence 95% Confidence JCM SEM IP PCAP Xilinx Table 5.2: Per-bit cross section of MPSoC FPGA BRAM. Number of Errors Fluence (n/cm 2 ) Cross Section (cm 2 /bit) +95% Confidence 95% Confidence BRAM Xilinx large and the JCM and the PCAP error bounds do not overlap at all. Regardless, all three of the scrubbers are within the error bounds of the cross section generated by Xilinx. One factor that the varying cross sections could be attributed to is the location of the scrubber. The location can influence factors like effectiveness and reliability. For example, The SEM IP is placed within the fabric of the PL meaning that the scrubber can perform at higher speeds than the external JCM. On the other hand, since the SEM IP is in the FPGA, it is also in the path of the beam making the SEM IP susceptible to radiation induced upsets. The BYU JCM is placed outside of the beam s flight path reducing the amount of radiation that might influence the device. Another possible influencing factor is the speed of the scrubber. For example, during these experiments the PCAP scrubber took about 12 seconds to perform a full scrub of the entire PL while the BYU JCM took 7 seconds. It is possible that the PCAP scrubber may have missed several SEU counts due to the slowness of the scrubber. The speed of the PCAP has been addressed and can now perform a full scrub in 5 seconds, but no PCAP results with these speeds have been performed yet. The neutron cross section estimate using the JCM scrubber is compared to other Xilinx device families in Figure 5.5. The cross section of the PL for the MPSoC is 27.7x smaller than the 28nm devices (7-Series family) and 10.1x smaller than the 20nm devices (UltraScale family). 1 From [30]. Xilinx uses a 90% confidence interval instead of the 95% used in the table. 34

44 Figure 5.5: Per-bit cross section comparison of Xilinx FPGA CRAM based on [30]. The error bars on the 16nm MPSoC shown in Figure 5.5 are much smaller than the error bars of previous generations because the amount of fluence was much larger and therefore the number of errors counted was much higher, reducing the error bars. As shown in Table 5.2, the estimate BRAM cross section is not within the error bars provided by Xilinx. The are a couple of reasons why this could be the case, and is mostly due to not knowing how Xilinx performed their BRAM test. For this thesis, the BRAMs were initialized to a pattern of zeros and ones. Xilinx may have initialized their BRAMs to all zeros or all ones. This difference in pattern could be causing some of the difference in cross section since the probability of an SEU causing a bit to change from a zero to a one is not the same as a one changing to a zero [11, 44]. Another reason the BRAM cross section could be different is because there may be more bits associated with the BRAMs than are taken into account in this calculation. The number of BRAM bits were calculated by taking the size of a single BRAM (36Kb) and multiplying that number by the total number of BRAMs in the device (912). There may be more bits associated with the BRAMs that Xilinx is aware of, but we are not. In summary, it is hard to know why this cross section is higher without knowing the methodology that Xilinx used to test the BRAMs and calculate their cross section. 35

45 Number of Errors Table 5.3: Cross section of the SEM IP. Fluence (n/cm 2 ) Cross Section (cm 2 ) +95% Confidence 95% Confidence SEM IP SEM IP Sensitivity The SEM IP experienced several failures due to neutron SEUs. The SEM IP would get stuck in its idle mode and be unable to return to its observation mode (readback and scrubbing mode). This is most likely due to parts of the SEM IP being corrupted by upsets in the CRAM memory where the SEM IP is located. The SEM IP also uses a number of BRAMs for its operation, and upsets within these BRAMs could affect the SEM IP operation. During the tests, we tried to reconfigure the PL through the JCM to fix these failures of the SEM IP. A reconfiguration of the PL worked in most cases; however, in several occasions a simple reconfiguration of the SEM IP would result in another failure soon after. A complete reboot of the MPSoC was the most effective method to remedy this issue. A neutron cross section for the SEM IP was estimated in Table 5.3. Only the first failure was counted if a reconfiguration quickly resulted in a subsequent failure of the SEM IP. 36

46 CHAPTER 6. PROCESSING SYSTEM TESTING AND RESULTS The next component of these tests was understanding the effects of neutron radiation on the processing system resources. These tests were performed on the MPSoC s ARM Cortex-A53 (APU) and ARM Cortex-R5 (RPU). There were two goals for these tests; first, to study and estimate a functional cross section of the entire processing system executing a variety of software benchmarks; second, to estimate a cross section for the associated memories of the MPSoC including the OCM and processor caches. Sample log files for many of the benchmarks can be found in Appendix B as well as code samples in Appendix D. 6.1 Processing System (PS) Methodology The PS tests were performed simultaneously with the PL tests. The main idea of the PS tests was to program the processors with a benchmark that utilized a predetermined subsection of device resources, for example the caches but not main memory (DDR). Each benchmark provided periodic output messages through a UART connection, known as a heartbeat, to the host computer in addition to error detection messages. The purpose of the heartbeat message is to provide a way to know that the board is still functioning and performing the benchmark. The error detection messages are the main focus for the PS tests as they were used to detect SEU events and test for processor hang events on the MPSoC processing system. Data was collected from the PS through the execution of binaries, each stored on a separate SD card. The software binary was automatically loaded from the SD card to the processor on power-up by a first-stage bootloader (FSBL). Each of the software binaries were created with a specific purpose, targeting specific board components like the cache memories or the on-chip memory (OCM). A variety of different software benchmarks were used and are described in the sections below. 37

47 An essential component of the processor tests is the detection of upsets in the form of output errors (SDCs) or hang events. SDCs, in the context of these software tests, refer to an SEU that causes the program data to become corrupted. This corruption is most often identified as incorrect calculations performed by the program. Hang events are when the normal execution flow of the processor stops, which results in the termination of the software. SDCs and hang events were identified by using self-checking code and a system watchdog timer (WDT) respectively. Each benchmark contained self-checking code. SDCs in the processor are determined by comparing the output of the test with a correct version of the answer called the golden. The golden output is calculated before the test begins and stored in a data structure in the off-chip DDR memory. Each iteration of the benchmark is compared against the golden for errors. If any SDC occurs in the benchmark operations, or the golden itself gets corrupted, specific error output messages are sent through the UART. If the same upset appears in multiple sequential iterations, then the golden has been corrupted and is recalculated. The SDCs are determined by parsing the output of the benchmarks for these specific messages. A built-in WDT was enabled to detect processor hang/crash events. A WDT is a simple counter that counts from a set value down to zero. For these tests the counter was set to count for 20 seconds, the exact value being device specific. If the counter reaches zero, then the WDT sends an interrupt to the processor telling the system to reboot, causing the FSBL to reload the software benchmark. It is the responsibility of the program to reset the WDT periodically so that the WDT never reaches zero. If the program fails to reset the WDT because the control flow of the program was upset (i.e., not an SDC), then the WDT will cause the system to reboot. Reboot messages are also easily identified by parsing the program output after the tests have finished. All of the benchmarks generated output that was transmitted through the UART and logged by the remote host computer. Output included benchmark progress, system interrupts captured, and errors detected. SDCs and hang events can be identified and counted by post processing the log files. The number of SDCs and and hang events were then used to generate a functional neutron cross section for an application running on the MPSoC processors. 38

48 6.2 Software Processor Benchmarks A variety of benchmarks were created to test the MPSoC, modeled from the benchmarks in [45]. In particular, an advanced encryption standard (AES) 128 bit, a matrix multiply, and Dhrystone were developed. Most of these benchmarks were designed to be small enough to fit in the system s caches, but large enough to fill the L1 and L2 caches. The DDR would not influence these tests much as it is protected by ECC and outside of the direct path of the neutron beam, therefore most of the data was placed in the caches. The majority of the benchmarks were designed as baremetal applications. Baremetal means that the software runs directly on the processing unit without the aid of an operating system. The AES-128 and matrix multiply benchmarks were baremetal applications, where the software binary was programmed to the processing unit. On the other hand, the Dhrystone benchmark was started by an embedded Linux operating system in order to test multiple cores and an operating system Advanced Encryption Standard (AES) The AES encryption test uses a 128 bit cipher (AES-128) to encrypt and decrypt data using the same key called a symmetric cipher. This requires that both the sender and the receiver know the key. For the 128 bit keys, there are ten rounds or iterations of the encryption. Each round consists of several processing steps which include substitution, transposition, and mixing of the columns [46]. The first step of the process is to put the data into an array so that the processing steps can be performed. A typical operation of the AES-128 encryption test is as follows; the binary starts by initializing the WDT, printing a beginning header, and starting repeated iterations of AES encryptions and decryptions. Each run of the AES performs four tests using different input, key, and cipher arrays. The first set of input and golden files are checked for correctness. The ten rounds of encryption follow and the input and the encrypted golden are checked. Finally, the input is decrypted and checked for correctness. SDCs are detected, and error messages printed, if the input (encrypted or decrypted) differs from the corresponding golden array during any of the checks. The WDT is reset periodically throughout the test. 39

49 6.2.2 Matrix Multiply The matrix multiply benchmark performs a standard integer matrix multiply operation. Matrix multiplication is performed by having two matrices A and B with matrix A of size m n and matrix B n p to create a result matrix C of size m p. The values in the rows of matrix A are multiplied by the values in the columns of matrix B and summed together to create each value in matrix C [47]. This matrix multiply benchmark uses input matrices of size which creates a results matrix. Each time the matrix multiply operation is complete, the benchmark self-checks the results matrix is against a precomputed golden results matrix. The self-checking part of the program ran significantly faster than the matrix multiplication. Each iteration took about 8 seconds to run. A typical operation of the matrix multiply test is as follows; the binary starts by initializing the WDT, printing a beginning header, initializing the matrices, and starting repeated iterations of the matrix multiply. The matrix multiply is performed and the result matrix is checked against the golden matrix. If there were any differences between the result matrix and the golden, errors were sent to the UART output. Additionally, the input matrices are recreated and a new golden is generated. This is to reset the matrices in the case that the golden matrix was corrupted. The WDT is reset periodically throughout the test Linux and Dhrystone This test is comprised of a Linux operating system (OS) running several instances of the Dhrystone benchmark, one instance on each core. This Linux OS is an embedded version of the operating system built by Xilinx to run on embedded devices [48]. The Linux OS was run on the ARM Cortex-A53 and built using symmetric multiprocessing (SMP) which allows Linux to use all four cores of the A53 s APU. The purpose for using Linux is that we anticipated Linux having a larger cross section than the other baremetal tests due to its large footprint in memory. Additionally, Linux has a lot more control flow that can be corrupted. To increase the likelihood of observing errors, Linux was run more than the other benchmarks. This increased likelihood of errors is because of the overhead of running an operating system, using virtual memory, context switching, and system interrupts. 40

50 Bootloader starts Linux booting WDT Started Start-up Script Begins WDT Application Begins Dhrystone Begins On All 4 cores Reset Watchdog Yes Yes Dhrystone still running? No Restart Dhrystone Instance Dhrystone Running No Did Dhrystone Restart? No & Tried 3 Times Wait For WDT Reboot Figure 6.1: Flow diagram for Dhrystone benchmark running in Linux. Dhrystone is an integer benchmark created by Dr. Reinhold Weicker with the intention of measuring the performance of computer systems. The benchmark was designed by analyzing several popular benchmarks of the time and combining good practices from those benchmarks [49]. Dhyrstone previously was a more popular benchmark. Dhrystone has since been replaced by other benchmarks, such as CoreMark, due to the many weaknesses that the benchmark has, like being able to cheat the benchmark and make the processor appear better than it really is. The benchmark is also small so it can fit easily into the caches and not stress the other memory systems of the processor [50]. However, these weakness are not important for the purpose of this test. This test is not designed to generate an accurate benchmark result for processor comparison. The purpose of the benchmark is to keep the processor busy and utilize the processor caches. This benchmark outputs multiple lines of computed strings and numbers. These are the same after each iteration so the output can be easily post processed to check for SDCs. A typical operation of this test is as follows (shown in Figure 6.1); Linux begins boot on power-up by the use of first-stage and second-stage bootloaders. A WDT is started during the firststage bootloader. Once the Linux OS has booted completely (about 10 seconds), a start-up script begins four instances of the Dhrystone benchmark, each benchmark running on a different core. The start-up script also begins a WDT application that resets the WDT and checks to make sure 41

51 Table 6.1: Software executable cross section operating on the Cortex-A53 APU. Number of Errors 1 Fluence (n/cm 2 ) Cross Section (cm 2 ) +95% Confidence 95% Confidence AES 1* < MxM 1* < Lnx/Dhr 1* < that each instance of Dhrystone is still running. If one of the instances has terminated, the WDT application will try to restart the instance. If restarting the Dhrystone instance fails multiple times, a check in the application fails and no longer resets the WDT, causing a system reboot. After the tests, the log files were parsed for SDCs and hang events (reboots) Processor Results Surprisingly, no SDC s or crashes were seen in any of the software tests. With over n/cm 2 neutron fluence, there were no positively identified processor SDCs or processor hangs. Though no errors were observed, the cross section for the software processor benchmarks was generated assuming that there was an error to give a worst-case cross section [19]. The estimated cross section for the software processor benchmarks is shown in Table 6.1. With more testing, it is likely that the estimated neutron cross section would be much lower than reported here. This MPSoC result is quite different than what we have previously observed on the Zynq SoCs. Running a similar test on a Zedboard, embedded Linux running a single Dhrystone instance, the Zynq-7000 ARM Cortex-A9 processor would crash frequently under neutron radiation. With the processor caches enabled, the software test failed 71 times in about 3 hours, meaning the processor crashed about every 2.5 minutes. With the caches disabled, the software test failed 53 times in about 151 hours, meaning the processor crashed about every 3 hours. In contrast, the MPSoC ran Linux with caches and four instances of Dhrystone for about 58 hours without a single crash or SDC. This comparison shows how robust MPSoC PS is compared to previous SoCs. There are probably two main reasons why the neutron cross section for upsetting the processor is so low. First, the 16nm architecture process results in a very low cross section, making 1 Though no software errors were observed, a value of one was used in the calculation of these results. This is used to show a worst-case cross section estimate [19]. 42

52 A) Memory without interleaving B) Memory with 3 bit interleaving Figure 6.2: Memory with (A) and without (B) ECC interleaving. it hard for neutrons to hit processor registers and control logic. Second, the proper use of ECC and interleaving of the memory cells is likely preventing multi-bit upsets from breaking a word in memory. As shown in Figure 6.2, interleaving prevents multiple bits that are upset near each other from breaking ECC. Without interleaving, a three bit upset would break SECDED ECC and the error would not be detected. ECC and memory interleaving help mitigate upsets from propagating through the processor memories to the output of the benchmark as SDCs. When an upset bit is detected in one of the memories, the ECC will fix the error and transmit the correct value to the processor for use. 6.3 Memory Benchmarks The second objective to these tests was to estimate a neutron cross section of the processor memories including the caches and on-chip memory (OCM). Though no SDCs or processor crash/hang events were detected, observations were made of upsets occurring within these memories by capturing processor interrupts and reading processor performance registers. 43

53 6.3.1 Caches The benchmarks that were previously used for the software processor cross section were modified to target the cache memories by filling the caches without spilling into main memory. As data is read from the caches, the ECC finds cache memory upsets. Specialized registers for tracking ECC errors within the ARM Cortex-A53 caches, the processor memory error syndrome registers (CPUMERRSR), were used to count ECC cache errors. These registers report what memory location in the caches was upset, how many times it has been upset, and if there were any other upsets at other memory locations [26]. These registers are reset by any write operation. These registers are not interrupt driven so code was added to regularly check these registers (see Appendix D). This code read the error syndrome registers for both the L1 data and L2 caches, updated global software error counters, and cleared the registers by performing a write. These operations were performed repeatedly during the cache benchmark tests. The total count of ECC errors was added to the heartbeat messages of the benchmarks as an additional printout. These error registers were also used in benchmarks run with Linux to count the memory upsets. The error detection and correction (EDAC) kernel module reads the error registers and buffers print statements until the OS has an opportunity to print them out. The ECC errors can easily be counted through post processing of the log files On-Chip Memory (OCM) To test the OCM and estimate the terrestrial neutron cross section, a specialized baremetal software application was created for the R5. The program initializes the 256KB OCM to a known set of values, enables ECC, and enables all system interrupts and exceptions. The program then continuously reads from all OCM memory locations. As the contents of the OCM are read out, the internal ECC will catch any single-bit upsets within the memory, correct the error in the read data (i.e., data is still incorrect in the OCM), and send an interrupt to the processor signaling that an ECC was detected and corrected. Since the OCM is protected by SECDED, the interrupts will also indicate detected double bit errors. Any memory location that triggered a system interrupt was then written the correct data to fix any upsets. A dedicated software interrupt handler was used to catch the system interrupt and count any ECC error. 44

54 Table 6.2: Per-bit cross section data from the processor memories. Number of Errors Fluence (n/cm 2 ) Cross Section (cm 2 /bit) +95% Confidence 95% Confidence Caches OCM Memory Test Results The estimate neutron cross section for the caches and the OCM are shown in Table 6.2. All of the detected errors recorded in the table were errors corrected by ECC. None of the cache or OCM memory errors propagated through the logic to become SDCs in the program. There were no ECC errors that could not be corrected and no double bit errors were detected. As previously described, the low cross section can most likely be attributed to the 16nm architecture process and effective ECC on the memories. 45

55 CHAPTER 7. HIGH CURRENT EVENTS This chapter describes the tests and results that were obtained from monitoring abnormally high current on the MPSoC ZU9EG component. These tests emerged after observing a power anomaly during the first neutron test in December In one of the evenings during that first neutron test, some problems occurred such that no more data was able to be collected from that test. After the test, serious effort was devoted to determining the cause of the problem. We hypothesized that the cause was an abnormally high current that caused the VCCAUX regulator to fail because the VCCAUX regulator failed to respond correctly and the inductor connected to the regulator was visible damaged. In order to detect these power anomalies, or current events, another MPSoC ZCU102 evaluation board was modified. These modifications include bypassing the on-board power regulators and being powered by an external power supply. The power supply was continuously monitored in order to detect current events. These modifications were used during the August and November 2017 neutron tests. The purpose of these tests was to detect current events on the power supply and count the number of these events that occurred, not characterize the shape or magnitude of the events. The event count provides an idea of the rate of the power anomalies on the ZU9EG. 7.1 High Current Event Monitoring Methodology To protect the ZCU102 from any high current events, the board s power regulators were bypassed and the board was externally powered as previously described. In particular, the regulators for the 3.3V, 1.8V, 1.2V, and 0.85V power lines were bypassed and powered externally (Figure 7.1). The bypassed regulators include Util 3v3 and VCC3v3 (3.3V), VCCAUX and VCCOPS (1.8V), DDR (1.2V), and VCCINT, VCCBRAM, PSINTFP, and PSINTLP (0.85V). 46

Figure 7.1: Wiring to bypass the power regulators on the ZCU102 evaluation board. Figure 7.2: Keysight power supply powering and monitoring evaluation board.

56 Figure 7.1: Wiring to bypass the power regulators on the ZCU102 evaluation board. Figure 7.2: Keysight power supply powering and monitoring evaluation board. Util 3v3 and VCC3v3 power most of the board s integrated circuit devices. These integrated circuits have many different purposes including power regulating, attenuating, bus transceivers, etc. and the LEDs. The VCCAUX is the PL s auxiliary power rail and VCCOPS is the PS I/O supply voltage. The DDR is the main supply voltage for the DDR4 memory. VCCINT is the internal supply voltage for the PL and VCCBRAM is the supply voltage for the FPGA BRAMs. PSINTFP and PSINTLP are the supply voltages for the PS full power and low power domains respectively [27]. 47

57 Turn power rails on Monitor power rails Turn power rails off No No Currents below threshold? Yes Voltages below threshold? No System status good? Yes Yes Figure 7.3: Flow diagram of power supply monitoring script. With this power setup, all power rails on the ZCU102 are powered except for the 5V power line. The 5V power line powers many board peripherals and these peripherals are not necessary for the MPSoC to function properly. This means that all components of the processor, FPGA, DDR, UART, QSPI, etc. are powered. Without the 5V line, the Ethernet, HDMI, and USB are left unpowered. Other components such as the FMC connector and the MGTs were likewise not powered. Specific details about how the bypassing of the regulators was accomplished can be found in Appendix A. The board was powered from an external Keysight N6705B power supply that allowed for careful monitoring of system voltage and current. This power setup is shown in Figure 7.2. The external Keysight power supply was carefully monitored through a host computer running python scripts (see Appendix E). These scripts continuously monitored the power supply s current and voltage for each individual power line. Threshold values were determined for each power line by monitoring the nominal current under normal operation and adding about 1A since no rail should draw an additional 1A under normal operation. Each current and voltage value was compared against their predetermined threshold value, and if any power line exceeded the threshold, the script would cause the power supply to reboot the board by turning off and back on all power lines. This is what is considered a current event 48

58 Table 7.1: Power supply current values used in these tests. Power Line Nominal Current(A) Threshold Current(A) Power Supply Current Limit(A) 3.3V V V V Current Event "Semi-upset" Power Cycle Figure 7.4: Current plot of one test segment showing a large spike in current. for the purposes of these tests. A flow diagram of this process is shown in Figure 7.3. The power supply was current limited to about 2A above nominal current in order to protect the board from any current events missed by the scripts, so the actual maximum current of the high current events was not measured. The nominal current, threshold, and limits are shown in Table External Power Supply Monitoring Numerous anomalous current events were observed during the tests of August and November An analysis of these events showed a number of different types of events including voltage spikes and current spikes. Figure 7.4 shows an example of several current events. Specif- 49

59 Increased but stable Power Cycle VCCAUX Events Figure 7.5: Current plot of one test segment showing several increased current events. ically, an event on the VCCINT line can be seen at about time 110,000 seconds. With a nominal current of 2.85A, the high current event caused an anomalous increase in current to 3.5A. The downward spikes represent events that caused the power supply script to power cycle the board. The current draw decreases in VCCINT, VCC3V3, and VCCAUX as the board powers up while current increases on DDR during power up. It appears that power cycling the board following the VCCINT event did not allow the board to return to normal operation. It is possible that the short three second power cycle time did not allow the capacitors to discharge completely. This kept the board in a semi-upset state which can be seen in the irregular current right after the current spike. Figure 7.5 shows a 14% jump in VCC3V3 current and 6% jump in VCCINT that remained stable for a time until a reboot was triggered. The upset was not high enough to trigger a reboot based on current alone nor did it affect operation of the board during these events. Eventually, these persistent events did trigger a power cycle of the board. However, it is unclear if the power cycle occurred because of the original persistent current jump or if a second upset caused a current event on VCCAUX. 50

60 Number of Events Table 7.2: Power event cross section results. Fluence (n/cm 2 ) Cross Section (cm 2 /line) +95% Confidence 95% Confidence VCCAUX VCCINT Total Current Limited Figure 7.6: Histogram categorizing events on VCCAUX power rail. The only power lines for which current events were detected were the VCCAUX 1.8V and the VCCINT 0.85V power lines. A neutron cross section of the affected power lines was estimated in Table 7.2 using the number of power cycles caused by that particular power line. These results provide an idea of the rate at which one of these current events may occur on the ZU9EG device. A histogram showing the events that occurred on the VCCAUX line is shown in Figure 7.6. This histogram shows that all of the current events exhibited a jump in current at least 1.3A higher than nominal current. The majority of the current events were protected by current limiting of 2.0A, meaning that these events had a current jump of about 1.8A. Three of the VCCAUX errors were over voltage errors. The voltage protection was set to 2.5V which means that the voltage jump on these events was at least 0.70V. 51

61 Initialize PMBus through I2C Initialize SYSMON in PS and PL Check currents from regulators Send reboot signal through GPIO No Current below threshold? Yes Check temperature from SYSMON Yes Temperature flatlined? No Figure 7.7: Software flow of RPU baremetal power monitoring code. 7.3 On-Board Power Monitoring Test After successfully using an external power supply to protect the evaluation board from high current events, attempts were made to create an on-board power monitor. On-board power monitoring means using the resources already available on the device to monitor the current and voltage of the device through the on-board processors. If the current and voltage can be monitored, it may be possible to address these current events without the need for external power monitoring. A success for this test would be to detect all current events and respond to the event using the on-board power monitor. The general process for power monitoring was pulled from ideas generated from previous work. In [51] they use the on-board voltage regulators to monitor current and voltage. On-board temperature sensors were also used to measure increases in temperature as increased current was always associated with increased junction temperature. The voltage regulators used on the board have a lot of features that include voltage and current limiting, voltage and current monitoring, temperature, and many other status and configurable options [52]. These features were used by the processors to monitor the power generated by the regulators through the system s PMBus interface (see Appendix E). 52

62 Ethernet Netbooter Power Switches Host Computer Raspi SD Card ZCU102 (DUT) UART GPIO Ethernet Switch Figure 7.8: Block diagram of on-board power detection test setup. A baremetal program was designed to run on the ARM Cortex-R5 processor to monitor the power of the device. There are two parts to on-board power monitoring; using the on-board, off-chip power regulators and using the on-chip system monitor (SYSMON) [53]. A flow diagram of the software process is shown in Figure 7.7. The software initializes the PMBus through the I2C interface and initializes the system monitors in the PS and PL. The power regulators are read through the PMBus and are used to monitor the current passing through the power regulators for VCCAUX and VCCPSAUX. The SYSMONs are used to monitor the temperature of the PS and PL respectively. The software continuously checks for jumps in current in the monitored regulators. If the current has jumped more than 100mA, then an event has probably occurred. The software also checks for jumps in temperature as well as temperature flatlining, i.e. device temperature is no longer fluctuating, through the SYSMON in both the PS and the PL, which signify that an event has probably occurred. This process is inexact and there will be false-positives, meaning the software will detect a current event, even though there was no event. Determining the number of false-positive events compared to actual events will also be a metric for how effective the on-board power monitoring software is at detecting current events. 53

63 Figure 7.9: Raspberry Pi connected to MPSoC through GPIO. A block diagram of the setup for this test is shown in Figure 7.8. A Raspberry Pi (Raspi) was connected to the ZCU102 board through a custom made general purpose I/O (GPIO) cable (Figure 7.9). An Ethernet Netbooter power strip was used to power the devices. A Netbooter can be used to remotely reboot a power socket over Ethernet. A host computer was used to log output from the ZCU102 and the Raspi (see Appendix B.4). This setup was used to determine if the power monitoring software could successfully detect a current event and send a signal that an event had occurred and that the system needs rebooting. The Raspi was used to detect the signal coming from the MPSoC and signal the Netbooter to reboot the MPSoC s power socket, thereby preventing damage to the board. At the end of this test, the test logs showed that the mitigation software was successful in detecting current events and could successfully send an external signal to request reboot; however, the logs did present many false-positive requests in addition to the actual requests. A total of 92 current events were detected, five actual current events were observed on the board with eightyseven false-positive events. Only 5.4% of the detected events were actual current events. These results are shown in Table 7.3. The first actual current event was detected on the current from the VCCAUX power regulator where the current jumped from 0.27A to 3.92A. In the previous external power supply test, this event would have been current limited to 2.0A, which provides some insight into the extent of 54

64 Total Detected Table 7.3: On-board power monitoring detection results. Actual Events False Positives Fluence (n/cm 2 ) Ratio of Actual/ Detected % these VCCAUX high current events. The second event was from the VCCPSAUX regulator where the current jumped from 5.2A to 6.36A. After this event, the current did not return back to nominal current levels, but remained around 6.2A. The remaining three events were where the SYSMON temperature readings flatline. Most of the false-positives occurred because of a drop in the VCCPSAUX current and then a return to the nominal current levels, which the software detected as a current spike. The nominal voltage level for the VCCPSAUX was about 5.2A, but the current at any one time would vary from 4.3A to 5.6A. When the current would drop to some lower current such as 4.3A and then return to a higher 5.4A current, this would be detected as an event. Though this type of event is in fact a current spike, it is not a high current event and could be remedied by taking into account nominal current levels in the current monitoring. Through these changes and more testing, a reliable on-board detection method is likely achievable. 7.4 Captured High Current Event During the test of November 2017, a separate test was implemented with the objective of capturing a high current event on an ZCU102 evaluation board. Once the event is captured, the device is pulled out of the beam while maintaining power and moved elsewhere to be monitored. The device is then monitored over time to determine any lasting effects of the current event [54]. The lasting effects could include; inability to communicate with parts of the device (i.e., can communicate with the FPGA but not the processor), permanent increased current draw, or general changes in normal device behavior. A second ZCU102 was provided by NASA Goddard to perform the experiment. The board was placed in the LANSCE neutron beam connected to an uninterruptible power supply (UPS) so that the board could be moved without loss of power (See Figure 7.10). Unlike the other ZCU102, no power modifications were made, meaning no regulators were bypassed. 55

Figure 7.10: Picture of the UPS at LANSCE. The board was running the PCAP scrubber and the on-board current monitoring code (see Section 5.3.3 and 7.3 respectively).

65 Figure 7.10: Picture of the UPS at LANSCE. The board was running the PCAP scrubber and the on-board current monitoring code (see Section and 7.3 respectively). When the on-board monitoring code detected a current event, the event was verified by reading the power regulators through an external Maxim PowerTool PMBus reader [55]. While waiting for the experiment run to end, a second event was detected and verified. After the second event occurred, UART communication with the board was lost. Once the board was pulled out of the beam, several methods to mitigate the upset were attempted; a power-on-reset was requested through one of the external buttons, a processor reset requested through JTAG, processor programming attempted through JTAG, and a couple of other similar attempts. None of the mitigation attempts worked, though there was some communication through JTAG possible. The board was kept powered on at Los Alamos National Laboratories (LANL) for 213 days (from November 18th 2017 to June 18th 2018). On June 18th Heather Quinn, a collaborator from LANL, performed a number of experiments on the board to learn more about the long term effects of the high current event. She took a number of thermal images of the board, shown in Figures 7.11 and 7.12, looking for hotspots or points in the thermal image that appeared problematic. We had 56

Figure 7.11: Thermal image of ZCU102. Figure 7.12: Thermal image of ZU9EG chip. Figure 7.13: Thermal image of ZCU102 power regulators.

The hottest points on the board were the regulators and their associated inductors as seen in Figure 7.13.

The output voltage of the effected power regulators were slowly lowered until the current event no longer appeared.

66 Figure 7.11: Thermal image of ZCU102. Figure 7.12: Thermal image of ZU9EG chip. Figure 7.13: Thermal image of ZCU102 power regulators. anticipated there to be hotspots on the chip due to the current event, but there were none present in the thermal images. The hottest points on the board were the regulators and their associated inductors as seen in Figure Attempts were made to clear the sustained current event on the board without powering the board off. The output voltage of the effected power regulators were slowly lowered until the current event no longer appeared. The output voltage was then raised back to nominal voltage levels and communication with the board was again attempted with no success. Communication with JTAG seemed to work better than before, but still did not work properly. The processor could not be programmed but the PL was supposedly programmed successfully (the MPSoC reported that it had been programmed), though we were unable to verify the PL programming. The board 57

67 appeared to return to normal operation after a full power cycle of the system, showing no signs of permanent damage due to the sustained power event. After performing this test of capturing a high current event, there are several things that would be done differently if the test was performed again. The first and biggest thing that would be changed is the type of event that was captured on the board. During the test, two events were captured on the board. The second event caused most of the communication with the board to be lost. If the test were to be performed again, a current event that causes the board to lose communication would be cleared by resetting the board. Only a current event that does not disrupt communication with the device would be maintained for long term monitoring. The lack of communication did not allow some mitigation tests, like lowering the output voltage of the regulators, to be analyzed appropriately because there was no communication with board. The high current events observed on the ZCU102 MPSoC are a concern for the overall reliability of the device in radiation environments. These current events could possibly damage the device during operation and cause a loss of quality and/or service. Techniques for on-board mitigation, such as the one presented in this chapter, are an option for detecting and responding to these high current events and protecting the device from any permanent damage. 58

68 CHAPTER 8. CONCLUSION The Xilinx UltraScale+ MPSoC uses 16nm FinFET technology to combine the flexibility of an FPGA with multiple ARM processors. This makes the MPSoC a candidate for many highperformance computing applications. Estimating the neutron SEU sensitivity of the MPSoC is an important step in understanding how the device with behave in neutron and other radiation environments. Device testing in a neutron radiation beam is one method to obtain radiation effects data on a device. A methodology was developed for testing the ZCU102 in a neutron beam that separates the PL and PS to collect data from both device regions separately and simultaneously. In general, this consists of a system performing configuration readback on the PL in parallel with a software benchmark running on the processors. Three different readback/scrubbing methods performed on the MPSoC were presented including; external JTAG scrubbing through the BYU JCM, internal ICAP scrubbing with the Xilinx SEM IP, and PCAP scrubbing. Neutron cross section estimates for the configuration logic were obtained using each of the scrubbing methods. These estimations show a neutron cross section improvement of 10.1 over the previous 20nm UltraScale device series. The BYU JCM was also used to measure the sensitivity of the BRAMs and generate an estimate neutron cross section. Several processor baremetal software tests were created to test the processor and several processor subsystems including the caches and the OCM. These tests provided error detection methods in order to count the number of upsets and generate a software cross section. The processor tests did not result in any SDCs. All errors were corrected by EDAC methods, in particular ECC. The number of ECC errors detected in the caches and the OCM were used to calculate an estimate cross section for each. High current events were experienced on the board and a methodology was developed for careful monitoring and protecting the ZCU102 board. This methodology included bypassing the 59

69 power regulators and powering the board through an external power supply. Most of the high current events occurred on the VCCAUX power line, resulting in current jumps of over 1.25A or higher. A methodology was presented for on-board detection of these current events and this methodology was tested during one of the neutron test visits. The on-board detection software was able to successfully detect current events on the board, with many false-positives also being reported. Several ideas to fix these false-positive detections were presented. 8.1 Future Work Though this work has provided a good starting point for the MPSoC device, more work is needed in order to fully understand the behavior of the device and provide better neutron cross section estimations. The processor software cross section would greatly benefit from more neutron testing. The cross section in its current state was calculated with no errors. To get a more accurate cross section, more neutron testing data would be required, which would lower the cross section and provide a better estimation of the actual processor software cross section. Ideally, testing would be performed until software SDCs or hang events were detected. There are still several more tests that could be conducted with the high current events on the device. More work in characterizing the shape and type of high current events that occur on the device would provide more information into mitigation techniques and limitations of the device. Additional development of on-board mitigation techniques could prove to be very successful in detecting and appropriately responding to these events. Several ideas of how the on-board detection software could be improved were presented in this work. The long term effects of the high current events could also be studied in more detail. Additional studies could provide a greater understanding into the types of behavior that the MPSoC might exhibit under these high current conditions. These studies could also show if certain types of events drastically change the behavior of the device, making it non-viable for certain space or terrestrial applications. The methodologies and results in this thesis may be helpful in providing a starting point for future neutron testing and MPSoC device characterization. Parts of this work have been previously submitted to and presented at the 2018 Military and Aerospace Programmable Logic Devices (MAPLD) Workshop and IEEE Nuclear and Space Radiation Effects Conference (NSREC) Radiation Effects Data Workshop (REDW) 2018 [56,57]. 60

70 CHAPTER 9. ACRONYMS AES Advanced Encryption Standard. APU Application Processing Unit. ARM Advanced RISC Machine. AXI Advanced Extensible Interface Protocol. BRAM Block Random Access Memory. BYU Brigham Young University. CCL Configurable Computing Lab. CHREC Center for High Performance Reconfigurable Computing. CI Confidence Interval. CLB Configurable Logic Block. CRAM Configuration Random Access Memory. CSU Configuration and Security Unit. DDR Double Data Rate Dynamic RAM. DMA Direct Memory Access. DSP Digital Signal Processing. DUT Device Under Test. ECC Error Correction Codes. EDAC Error Detection And Correction. FIFO First In, First Out. FIT Failures In Time. FPGA Field Programmable Gate Array. FSBL First-Stage Boot Loader. Gb Gigabit. GPIO General Purpose Input/Output. 61

71 GPU Graphics Processing Unit. ICAP Internal Configuration Access Port. ICE Irradiation of Chips and Electronics. JCM JTAG Configuration Manager. JEDEC Joint Electron Device Engineering Council. JTAG Joint-Test Action Group. KB Kilobyte. LANL Los Alamos National Laboratories. LANSCE Los Alamos Neutron Science Center. LUT Look-up-Table. Mb Megabit. MHz Megahertz. MMU Memory Management Unit. MPSoC Multiprocessor System-on-Chip. NSF National Science Foundation. OCM On-chip Memory. OS Operating System. PCAP Processor Configuration Access Port. PL Programmable Logic. PMU Platform Management Unit. POR Power-On-Reset. PS Processing System. ROM Read-Only Memory. RPU Real-time Processing Unit. SDC Silent Data Corruption. SECDED Single Error Correction/Double Error Detection. SEE Single Event Effect. SEFI Single Event Functional Interrupt. SEL Single Event Latch-up. SEM IP Soft Error Mitigation Intellectual Property. 62

72 SET Single Event Transient. SEU Single Event Upset. SHREC Center for Space, High-Performance, and Resilient Computing. SMP Symmetric Multiprocessing. SoC System-on-Chip. SSS Secure Stream Switch. SRAM Static Random Access Memory. SYSMON System Monitor. TCM Tightly-Coupled Memory. UART Universal Asynchronous Receiver-Transmitter. WDT Watchdog Timer. 63

73 REFERENCES [1] B. Fawcett, FPGAs as reconfigurable processing elements, IEEE Circuits and Devices Magazine, vol. 12, no. 2, pp. 8 10, [2] A. Caulfiled et al., A cloud-scale acceleration architecture, in th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Taipei, Taiwan: IEEE, Oct [3] Zynq UltraScale+ Device Technical Reference Manual, Xilinx, , 13, 17, 18, 19 [4] S. Mukherjee, J. Emer, and S. Reinhardt, The soft error problem: an architectural perspective, in 11th international Symposium on High-Performance Computer Architecture. San Francisco, CA, USA: IEEE, Feb , 5, 7 [5] R. Schrimpf, Radiation effects in microelectronics, in Radiation Effects on Embedded Systems, R. Velazco, P. Fouillat, and R. Reis, Eds. Springer, Dordecht, 2007, ch. 2, pp [6] R. C. Baumann, Soft errors in advanced smiconductor devices-part i: the three radiation sources, IEEE Transactions on Device and Materials Reliability, vol. 1, no. 1, pp , Mar , 5 [7] F. Wang and V. D. Agrawal, Single event upset: An embedded tutorial, in 21st International Conference on VLSI Design (VLSID 2008). Hyderabad, India: IEEE, Jan , 5 [8] L. D. Edmonds, C. E. Barnes, and L. Z. Scheick, An Introduction to Space Radiation Effects on Microelectronics, Jet Propulsion Laboratory (JPL), California Institution of Technology. 5 [9] N. Battezzati, L. Sterpone, and M. Violante, Reconfigurable Field Programmable Gate Arrays: Failure Modes and Analysis, In: Reconfigurable Field Programmable Gate Arrays for Mission-Critical Applications. New York, NY: Springer, , 6, 7, 9, 10 [10] A. Keller, Using on-chip error detection to estimate FPGA design sensitivity to configuration upsets, Master Thesis, Brigham Young University, April [11] T. Heijmen, Soft Errors from Space to Ground: Historical Overview, Empirical Evidence, and Future Trends. Boston, MA: Springer US, 2011, pp [Online]. Available: 1 5, 6, 7, 35 [12] L. D. Hung, H. Irie, M. Goshima, and S. Sakai, Utilization of secded for soft error and variation-induced defect tolerance in caches, in 2007 Design, Automation and Test in Europe Conference and Exhibition. Nice, France: IEEE, May

74 [13] J. Singh and J. Singh, A comparative study of error detection and correction coding techniques, in 2012 Second International Conference on Advanced Computing and Communication Technologies. Rohtak, Haryana, India: IEEE, March [14] Microsemi, Single event effects (SEE), reliability/see, 2018, accessed: July [15] M. Bellato et al., Evaluating the effects of SEUs affecting the configuration memory of an SRAM-based FPGA, in Proceedings of the Conference on Design, Automation and Test in Europe - Volume 1, ser. DATE 04. Washington, DC, USA: IEEE Computer Society, 2004, pp [Online]. Available: 9, 10 [16] P. Graham, M. Caffrey, J. Zimmerman, D. Eric Johnson, P. Sundararajan, and C. Patterson, Consequences and categories of SRAM FPGA configuration SEUs, Proc. 5th Annu. Int. Conf. Military Aerosp. Program. Logic Devices, , 10 [17] G. R. Allen, L. Edmonds, C. W. Tseng, G. Swift, and C. Carmichael, Single-event upset (SEU) results of embedded error detect and correct enabled block random access memory (Block RAM) within the Xilinx XQR5VFX130, IEEE Transactions on Nuclear Science, vol. 57, no. 6, pp , Dec [18] J. S. JESD89A, Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray- Induced Soft Errors in Semiconductor Devices, JEDEC solid state technology association. 10, 11, 12, 17 [19] H. Quinn, Challenges in testing complex systems, IEEE Transactions on Nuclear Science, vol. 61, no. 2, pp , Apr , 20, 33, 42, 83 [20] R. Velazco, G. Foucard, and P. Peronnard, Integrated Circuit Qualification for Space and Ground-Level Applications: Accelerated Tests and Error-Rate Predictions. Boston, MA: Springer US, 2011, pp [Online]. Available: , 12 [21] LANSCE. Weapons neutron research facility at LANSCE. [Online]. Available: http: // 12 [22] C. Slayman, JEDEC Standards on Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray Induced Soft Errors. Boston, MA: Springer US, 2011, pp [Online]. Available: [23] LANSCE. Weapons neutron research flight paths. [Online]. Available: lanl.gov/facilities/wnr/flight-paths/index.php 12 [24] B. E. Takala, The ICE House: Neutron testing leads to more-reliable electronics, Los Alamos Science, no. 30, pp , Nov [25] Zynq UltraScale+ MPSoC Product Tables and Product Selection Guide, Xilinx, , 16 [26] ARM Cortex-A53 MPCore Processor, ARM, Accessible: topic/com.arm.doc.ddi0500j/ddi0500j cortex a53 trm.pdf. 15, 44 65

75 [27] Zynq UltraScale+ MPSoC Data Sheet: DC and AC Switching Characteristics, Xilinx, , 47 [28] L. Hansen, Unleash the Unparalleled Power and Flexibility of Zynq UltraScale+ MPSoCs, Xilinx, [29] UltraScale Architecture Configurable Logic Block User Guide, Xilinx, [30] Device Reliability Report - Second Half 2017, Xilinx. 17, 34, 35 [31] P. R. Panda, N. D. Dutt, and A. Nicolau, On-chip vs. off-chip memory: The data partitioning problem in embedded processor-based systems, ACM Trans. Des. Autom. Electron. Syst., vol. 5, no. 3, pp , Jul [Online]. Available: 17 [32] C. Carmichael, Triple Module Redundancy Design Techniques for Virtex FGPAs, Xilinx. 18 [33] A. Gruwell, P. Zabriskie, and M. Wirthlin, High-speed programmable FPGA configuration through JTAG, in th International Conference on Field Programmable Logic and Applications (FPL). Lausanne, Switzerland: IEEE, Sept , 30 [34] N. Battezzati, L. Sterpone, and M. Violante, Reconfigurable Field Programmable Gate Arrays: Hardening Solutions. New York, NY: Springer New York, 2011, pp [Online]. Available: , 29 [35] A. Stoddard, A. Gruwell, P. Zabriskie, and M. Wirthlin, High-speed PCAP configuration scrubbing on Zynq-7000 All Programmable SoCs, in th International Conference on Field Programmable Logic and Applications (FPL). Lausanne, Switzerland: IEEE, September , 29, 30 [36] J. Jeppesen et al., JTAG interface system for communicating with compliant and non-compliant JTAG devices, Jan , us Patent 5,708,773. [Online]. Available: 28 [37] Soft Error Mitigation Controller v4.1, Xilinx. 29, 30, 31, 32, 75 [38] A. Stoddard et al., A hybrid approach to FPGA configuration scrubbing, IEEE Transactions on Nuclear Science, vol. 64, no. 1, pp , Jan , 30 [39] I. Herrera-Alzu and M. Lopez-Vallejo, Design techniques for Xilinx Virtex FPGA configuration memory scrubbers, IEEE Transactions on Nuclear Science, vol. 60, no. 1, pp , Feb [40] T. Bates and C. Bridges, Single event mitigation for Xilinx 7-series FPGAs, in 2018 IEEE Aerospace Conference. Big Sky, MT, USA: IEEE, March [41] A. Gruwell, High-speed programmable FPGA configuration memory access using JTAG, Master Thesis, Brigham Young University, April [42] UltraScale Architecture Configuration User Guide, Xilinx,

76 [43] A. Stoddard, Configuration scrubbing architectures for high-reliability FPGA systems, Master Thesis, Brigham Young University, December [44] P. Hazucha et al., Neutron soft error rate measurements in a 90-nm cmos process and scaling trends in sram from 0.25-/spl mu/m to 90-nm generation, in IEEE International Electron Devices Meeting 2003, Dec 2003, pp [45] H. Quinn et al., Using benchmarks for radiation testing of microprocessors and FPGAs, IEEE Transactions on Nuclear Science, vol. 62, no. 6, pp , Dec [46] National Institute of Standards and Technology (NIST). Announcing the Advanced Encryption Standard (AES). [Online]. Available: NIST.FIPS.197.pdf 39 [47] S. Skiena, The Algorithm Design Manual. London, England: Springer, [48] Xilinx, Xilinx/linux-xlnx github, 2018 [Online], Accessible: linux-xlnx. 40 [49] R. P. Weicker, Dhrystone: a synthetic systems programming benchmark, Communications of the ACM, vol. 27, no. 10, pp , Oct [50] A. Weiss, Dhrystone benchmark; history, analysis, scores, and recommendations, ECL, LLC, El Dorado Hills, CA, USA, Tech. Rep., 2002 [Online], Accessible: org/techlit/datasheets/dhrystone wp.pdf. 41 [51] J. Karp, M. Hart, P. Maillard, G. Hellings, and D. Linten, Single-event latch-up: Increased senstivitiy from planar to FinFET, IEEE Transactions on Nuclear Science, vol. 65, no. 1, pp , Jan [52] MAX A Digital PoL DC-DC Converter with InTune Automatic Compensation, Maxim, Accessible: 52 [53] UltraScale Architecture System Monitor - User Guide, Xilinx, Accessible: xilinx.com/support/documentation/user guides/ug580-ultrascale-sysmon.pdf. 53 [54] W. Rudge, C. Dinkins, W. Boesch, D. Vail, J. Bruckmeyer, and G. Swift, SEL site localization using masking and PEM imaging techniques: A case study on Xilinx 28nm 7-Series FPGAs, in 2016 IEEE Radiation Effects Data Workshop (REDW), July 2016, pp [55] PowerTool MAXPOWERTOOL002 Quick Start Guide, Maxim Integrated, Accessible: https: //pdfserv.maximintegrated.com/en/an/ug5981.pdf. 56 [56] J. D. Anderson, J. C. Leavitt, and M. J. Wirthlin, Neutron radiation beam results for the Xilinx UltraScale+ MPSoC, in 2018 IEEE Nuclear Space Radiation Effects Conference (NSREC 2018), July 2018, pp [57] D. S. Lee et al., Single-event characterization of 16 nm FinFET Xilinx UltraScale+ devices with heavy ion and neutron irradiation, in 2018 IEEE Nuclear Space Radiation Effects Conference (NSREC 2018), July 2018, pp

APPENDIX A. ZCU102 POWER MODIFICATIONS This appendix gives a description of the modifications that were performed on the ZCU102 evaluation board in order to power the board externally.

77 APPENDIX A. ZCU102 POWER MODIFICATIONS This appendix gives a description of the modifications that were performed on the ZCU102 evaluation board in order to power the board externally. This section describes which power regulators were bypassed and how the bypass wires were exposed for access. Figure A.1: Metal plate with banana connectors connected to ZCU102. A banana connector plate was connected on the bottom of long side of the board where there were no board components that the plate could interfere with (shown in Figure A.1). The plate was created by first obtaining a piece of metal elbow brace. The elbow brace was cut to allow room for seven different banana connector jacks and the intersecting board feet. Seven holes were drilled on one side of the brace to allow the banana jacks to be connected to the plate and two holes were drilled on the other side for the screws for the feet. The banana jacks were fed through the drilled holes and attached using metal nuts. A thin piece of plastic for insulating the metal brace 68

78 Figure A.2: Bottom of ZCU102 showing the bypass wires. Table A.1: Color coding of wires. Wire Color Voltage Green 3.3V White 1.2V Red 0.85V Yellow 1.8V Black Ground from the board was cut to the same length as the elbow brace. Identical holes for the board s feet were drilled into the plastic and placed on top of the elbow brace. The feet on the long side of the board were removed and the brace and plastic were inserted and the feet replaced. The plate is grounded and connected to the board by the screws for the feet. 69

79 Table A.2: Locations and labels for the board test points used for bypassing regulators (see also Figure A.3). Label Num Jumper Num Power Name Voltage 1 J47 Util 3v3 3.30V 2 J35 VCCINT 0.85V 3 J37 VCCBRAM 0.85V 4 J46 DDR4 DIMM VDDQ 1.20V 5 J36 VCCAUX 1.80V 6 J25 PSINTFP 0.85V 7 J26 PSINTLP 0.85V 8 J53 VCC3v3 3.30V 9 J44 VCCOPS 1.80V N/A J30 & J31 Ground GND Four different power rails were bypassed in this setup; 3.3V, 1.8V, 1.2V, and 0.85V power rails. Wires were soldered to the banana jacks on the back side of the board. The wires were routed to avoid the MPSoC chip as to avoid direct influence from a radiation beam as shown in the setup in Figure A.2. The wires were placed through the test point jumper pins on the bottom of the board and soldered on the top of the board to the test point junction. The wires were color coordinated in connection with the banana connectors, a description of the wires is shown in Table A.1. Numerous different power rails need to be powered in order to completely power the necessary board components. Table A.2 lists the different test point jumper locations that are bypassed and what power rail each test point bypasses. The corresponding voltage for each rail is also shown. Each test point jumper location is labeled and shown in Figure A.3. 70

80 Figure A.3: ZCU102 board with test points highlighted for bypassing. 71

Radiation Effects on Electronics. Dr. Brock J. LaMeres Associate Professor Electrical & Computer Engineering Montana State University

Radiation Effects on Electronics. Dr. Brock J. LaMeres Associate Professor Electrical & Computer Engineering Montana State University Dr. Brock J. LaMeres Associate Professor Electrical & Computer Engineering Montana State University Research Statement Support the Computing Needs of Space Exploration & Science Computation Power Efficiency