Advanced electronic prognostics through system telemetry and pattern recognition methods

Size: px

Start display at page:

Download "Advanced electronic prognostics through system telemetry and pattern recognition methods"

Sibyl Hawkins
5 years ago
Views:

1 Available online at Microelectronics Reliability 47 (2007) Advanced electronic prognostics through system telemetry and pattern recognition methods Leon Lopez RAS Computer Analysis Laboratory, Sun Microsystems, San Diego, CA, United States Received 14 January 2007 Available online 1 May 2007 Abstract Electronic Prognostics (EP) is a technique used in high-reliability and high-availability systems to actively and proactively detect faults, allowing the reduction of system downtime and unplanned repairs. The approach of Sun Microsystems to EP consists of a Continuous System Telemetry Harness (CSTH) that is coupled with Sequential Probability Ratio Test (SPRT) and Multivariate State Estimation Technique (MSET) algorithms. This approach provides a unique and complete EP solution, harnessing the rich information from sensors and system variables, and providing means for their storage and analysis. The background theory behind SPRT and MSET techniques as well as their implementation for advanced EP in enterprise servers is presented in this paper. Ó 2007 Elsevier Ltd. All rights reserved. 1. Introduction Enterprise servers are complex systems that are utilized in mission-critical or safety-critical applications requiring 24 7 availability. For these kinds of systems failure is not an option, since it invariably results in heavy financial loss, customer dissatisfaction, unplanned maintenance cycles, and loss of customer loyalty. A typical enterprise server, such as the Sun Microsystems F15K, consists of thousands of individual components and subsystems that are physically, electrically, thermally, and/or mechanically interconnected. Each component or sub-system provides or receives signals and services that allow the proper operation of the server. If components or sub-systems do not operate according to specifications (degraded or failure states) they do not provide the signals or services needed by the system, which may result in system level faults or failures. In the context of this paper a fault is defined as the operation outside address: leon.lopez@sun.com of specifications, while failure is defined as the lack of operation. Very often faults or failures of enterprise servers cannot be readily root-caused, due to difficulties identifying the individual component or operating condition that initiated the faulty or failure state. Since the information that is needed to identify the true failure mechanism (failure process) is not often available, systems or components are simply labeled as No Trouble Found (NTF). NTF is a very expensive problem in the electronic industry that impacts customers and manufacturers. It requires a great commitment of human resources, and a great investment in spare parts, resulting in the loss of materials in the form of scrap piles. In the electronics industry, Health Monitoring (HM) and Electronics Prognostics (EP) methods have been implemented to allow the identification of faults and failures in systems during normal operating conditions. HM and EP methods consist of the continuous assessment of a product s operating environment and performance to determine deviations from expected normal operating conditions. Information gleaned through this continuous surveillance makes it possible to obtain estimates of the /$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved. doi: /j.microrel

2 1866 L. Lopez / Microelectronics Reliability 47 (2007) product reliability, execute proactive maintenance activities, increase product availability, increase customer satisfaction, and reduce NTF events. In this paper methods of electronic prognostics that were developed by Sun Microsystems for enterprise servers are described. Principles of Continuous System Telemetry Harness (CSTH), Sequential Probability Ratio Test (SPRT), and Multivariate State Estimation Technique (MSET) as pertaining to EP applications are provided. 2. Health monitoring and electronic prognostics Health Monitoring of electronic systems can be performed by means of diagnostics, prognostics, or life consumption monitors. Diagnostic monitors determine the current state of health of a system and determine potential problems. Prognostic monitors identify faults and, by studying the fault behavior, provide an estimate of the time to failure for the product. Life consumption monitors measure the operating conditions, assessing accumulated damage and providing estimates of remaining life of the product [1,2]. Vichare and Pecht categorize the main approaches for the implementation of prognostics as built-in-test (BIT), fuses and/or canary devices, parameter monitor, and accumulated damage monitor [3]. These approaches for HM and EP provide the means to address some of the concerns of enterprise server users. Unfortunately, there are some significant issues associated with the current EP methods for the surveillance of electronic signals (voltage, resistance, temperature, humidity, vibration, or other system variables). Most of the HM and EP implementations perform simple tests to monitored signals. These tests (such as threshold tests and mean value tests) are only sensitive to gross changes in the signals that are monitored and produce a high number of false alarms and missed alarms [4]. Threshold values can be set close to the monitored signal to achieve high fault discrimination. Unfortunately, the noise that is an integral part of that signal produces false alarms. The closer the threshold is to the signal, the higher the number of false alarms. The opposite situation is also of concern: when threshold values are set too far from the monitored signal to reduce the rate of false alarms, missed alarms are produced. The farther the threshold is from the signal, the higher the number of missed alarms. This seesaw effect is illustrated in Fig. 1. Other significant issues with current EP approaches are that they: cannot separately specify the probabilities of false alarms and missed alarms; assume noise levels are insignificant; cannot detect drift in signals; do not provide the earliest possible annunciation of failure; cannot handle sensor degradation or failure; Fig. 1. See-saw effect for fault discrimination and alarm rate seen in threshold limit approach. cannot identify correlations between signals; require expertise from operator to interpret results; lack insight into root causes of failure to reduce NTF instances. 3. Advanced electronic prognostics A method for EP in computer systems was introduced in 2001 by Dr. Kenny Gross and his team of researchers at Sun Microsystems. This approach circumvents many of the deficiencies of traditional HM and EP methods that were previously discussed, providing the tools needed to capture relevant system data and identifying signal patterns, correlations between signals, and root causes of failures, effectively reducing NTF instances and increasing Reliability, Availability and Serviceability (RAS). The EP approach consists of a Continuous System Telemetry Harness coupled with real-time pattern recognition algorithms [5]. The Continuous System Telemetry Harness (CSTH) was developed by Sun Microsystems to enable the capture, conditioning, synchronizing, and storage of computer system telemetry signals, which allows the subsequent statistical analysis of data. The CSTH enables EP in Sun computer systems [7]. An overall illustration of CSTH is provided in Fig. 2. The CSTH categorizes the information provided by the server into three different kinds: soft variables, canary variables, and physical variables. Soft variables (internal variables) are values generated by the operating system which provide information on the performance of the hardware. Canary variables are values generated by software programs (other than the operating system) which provide information on the quality of the service, such as number of transactions per minute, service availability, user wait times, etc. Physical variables are direct measurements made in the system by means of sensors, such as temperature, voltage, current, vibration, fan speed, and relative humidity. All of these variables originate from multiple locations, formats, time stamps, sampling frequencies, and signal resolutions. The analysis of information by EP tools is not possible until the variables are synchronized and conditioned into a digitized time series, removing signal resolution and sampling frequency differences. Once the data is in the proper

3 L. Lopez / Microelectronics Reliability 47 (2007) Fig. 2. Continuous system telemetry harness used in Sun Microsystems enterprise servers. format and time stamped it is stored in a double circular file structure that acts as a black box recorder. In the Sun Microsystems F15K, which is a 72 microprocessor server with over 1000 sensors, the information is stored at high resolution for 72 h (1st circular file structure) and then at low resolution for 30 days (2nd circular file structure). Analysis of data, either by human or algorithms, can be performed real time or later by retrieving stored information. The analysis of signals obtained by the CSTH is done by means of pattern recognition algorithms. The first of these is the Sequential Probability Ratio Test (SPRT), and the second is the Multivariate State Estimation Technique (MSET) [6,7]. The methods are not new in statistical analysis, but their implementation with the CSTH for the monitoring of signals in computer systems is very innovative. The application of these algorithms to enable EP in enterprise servers will be presented in the remaining sections of this paper. 4. Testing of a statistical hypothesis Before SPRT and MSET implementations in EP are presented it is necessary to cover some basic theory regarding the testing of statistical hypotheses. This section will describe test procedures, critical regions, and error types Test procedures A statistical hypothesis test is a quantitative evaluation of a belief in light of available data. The current belief is defined by H 0, and is called the null hypothesis. The alternative belief is defined by H 1, and is called the alternative hypothesis. The acceptance of the null hypothesis indicates that the test results (the data) are not sufficient evidence to reject the belief. The rejection of the null hypothesis, on the other hand, indicates that there is overwhelming evidence to reject the belief [8]. To perform a test of a statistical hypothesis it is necessary to define a test procedure that will classify all potential observations into mutually exclusive sets, allowing the acceptance or rejection of the hypothesis [9]. As an example of this concept consider the following. During an experiment a random number of observations n are performed to decide the outcome of a hypothesis. Assuming that each observation in the experiment is represented by the random variable x i, and that the total number of observations made are n, then the set {x 1, x 2,...,x n } will represent one combination of outcomes for the experiment. A test procedure is defined when all sets of potential outcomes are categorized, either supporting or rejecting a null hypothesis. In a similar manner, consider an experiment where the following hypothesis is presented for consideration: H 0 : The product has no defects (null hypothesis); H 1 : The product has defects (alternative hypothesis). As part of the experiment a sample of the product is analyzed two times, with each observation having a binary outcome representing a defect (1) or no-defect (0). One combination of random observations of the variable x i in the experiment will be the set {x 1,x 2 }. All possible combinations of random test observations will be given by the sets {0, 0}, {0, 1}, {1, 0} and {1, 1}. Assuming that no defects are tolerated on the product (criteria to categorize outcomes), then any of the sequential observations {0, 1}, {1, 0} or {1, 1} will result in the rejection of the null hypothesis. The sequential observations {0, 0} will result in the acceptance of the null hypothesis. This is a test procedure of a statistical hypothesis that allows the acceptance or rejection of the null hypothesis Critical region and error types In order to quantify test procedures for statistical hypotheses it is necessary to define the probabilities of committing errors. When accepting or rejecting a null hypothesis H 0, it is possible to perform two kinds of errors: rejecting the null hypothesis H 0 when it is true (Type I error), or accepting the null hypothesis H 0 when the alternative hypothesis H 1 is true (Type II error). The Type I error is represented by a and is known as the size of the critical region, or false alarm probability. The critical region is simply the area that represents the probability of rejection. The Type II error is represented by b and is known as the missed alarm probability. 1 b is called the power of the critical region. Based on these definitions the probabilities of acceptance and rejection for a hypothesis test are defined.

4 1868 L. Lopez / Microelectronics Reliability 47 (2007) When the null hypothesis H 0 is rejected a = probability of rejecting H 0 (if H 0 is true); 1 b = probability of rejecting H 0 (if H 1 is true) When the null hypothesis H 0 is accepted b = probability of accepting H 0 (if H 1 is true) 1 a = probability of accepting H 0 (if H 0 is true) If the size of a critical region is chosen, it is possible to select multiple probabilities of a and b that would satisfy the required size. For practical purposes it is desirable to have a small probability of Type I error and a small probability of Type II error. To simplify the selection of a critical region and to provide the smallest possible probabilities for a and b the Neyman Pearson theory is used. See Refs. [10,11] for more details Neyman Pearson method The theory presented up to this point has only addressed a random variable x i without further consideration to statistical distributions and distribution parameters. The Neyman Pearson method allows the incorporation of statistical distributions and distribution parameters in the estimation of a critical region. Assume that the random variable x i is described by a statistical distribution that has a single parameter l, which represents the mean. A hypothesis test can be formulated in relation to the mean value l of the random variable x i, with null hypothesis H 0 : l = l 0, and alternative hypothesis H 1 : l = l 1. If the unknown statistical distributions of the random variable x i are represented by g(x) (for the null hypothesis) and by f(x) (for the alternative hypothesis), then the Neyman Pearson method defines the size of the critical region as f ðx 1 Þf ðx 2 Þf ðx n Þ gðx 1 Þgðx 2 Þgðx n Þ P k; ð1þ where k is a constant that ensures a critical region of size a Example of a statistical hypothesis test The following example was provided by Wald [9] to demonstrate statistical hypothesis testing of experiments with a fixed sample size. It is replicated in this work because it underpins the principles of sequential testing that define SPRT which will simplify following explanations. Assume that a null hypothesis, H 0, is formulated that a random variable x i is normally distributed, having a mean l = l 0 and a standard deviation r = 1. The alternative hypothesis, H 1, is that the random variable x i is normally distributed, having a mean l = l 1 and a standard deviation r = 1 (assume l 1 > l 0 ). The probability density function (PDF) for the random variable x i with mean l 1 will be given by f ðx i Þ¼ p 1 ffiffiffiffiffi e 1 2 ½xi l 1Š 2 ð2þ 2p and the PDF for the random variable x i with mean l 0 will be given by gðx i Þ¼ p 1 ffiffiffiffiffi e 1 2 ½xi l 0Š 2 : ð3þ 2p Applying the Neyman Pearson principle f ðx 1 Þf ðx 2 Þf ðx n Þ¼ 1 ð2pþ gðx 1 Þgðx 2 Þgðx n Þ¼ 1 ð2pþ f ðx 1 Þf ðx 2 Þf ðx n Þ gðx 1 Þgðx 2 Þgðx n Þ ¼ P n 2 ðx i l 1 Þ 2 e 1 i¼1 ; n=2 P n ðx 2 i l 0 Þ 2 e 1 i¼1 ; n=2 P n 1 e 1 ð2pþ n=2 2 ðxi l i¼1 1Þ 2 P n 1 e 1 ð2pþ n=2 2 ðxi l i¼1 0Þ ð4þ ð5þ 2 P k: ð6þ Finally, taking the natural log of both sides and simplifying ðl 1 l 0 Þ Xn X n i¼1 i¼1 ðx i Þþ n 2 ðl2 0 l2 1 Þ P LnðkÞ; ð7þ ðx i Þ P LnðkÞ n 2 ðl2 0 l2 1 Þ : ð8þ ðl 1 l 0 Þ From these inequalities it is possible to define the limits of a critical region of size a that would have the smallest possible value of b (Type II error), for the case where the number of observations is fixed. See Ref. [9] for a full description of this topic. 5. Sequential testing of statistical hypothesis This section describes sequential analysis, Sequential Probability Ratio Test (SPRT), error probabilities, and estimation of test constants Sequential analysis Sequential analysis is a method specially designed for the testing of hypotheses where there is a random number of observations during the experiment. In other words, the decision to stop or continue an experiment is determined by the latest outcome of the test and not by a specified test time or failure count. A sequential test of a statistical hypothesis requires a test procedure that will define the criteria to: Stop the test and accept H 0. Stop the test and reject H 0. Continue the test (not enough information). The implementation of sequential testing as well as the definition of the criteria to decide the outcome of a test will be provided next.

5 L. Lopez / Microelectronics Reliability 47 (2007) Sequential probability ratio test SPRT is a method of statistical inference that was developed by Abraham Wald back in the 1940s. It incorporates the concepts of test procedures, critical regions, Neyman Pearson method, and sequential analysis. In SPRT the criteria to stop-accept, stop-reject, or continue a test (as required by sequential testing) is defined with the Neyman Pearson method using upper and lower limits B < f ðx 1Þf ðx 2 Þfðx n Þ gðx 1 Þgðx 2 Þgðx n Þ < A; ð9þ where A and B are constants, chosen to ensure that the critical region size defined by a is obtained. In Eq. (9) f(x i ) and g(x i ) represent the PDFs for the alternative (H 1 ) and null (H 0 ) hypotheses respectively, of a random variable x i. The term f(x 1 )f(x 2 )f(x n ) represents a series of observations of x i, whose product constitutes the probability of occurrence of the alternative hypothesis. The term g(x 1 )g(x 2 )g(x n ) represents a series of observations of x i whose product constitutes the probability of occurrence of the null hypothesis. The ratio of these probabilities is compared to the upper and lower limits to provide the stop/continue criteria for the test. If the probability ratio equals or exceeds A, the test is stopped, rejecting the null hypothesis H 0. If the probability ratio is less or equal to B, the test is stopped, accepting the null hypothesis H 0.If the probability ratio is between A and B, the test continues. If f(x 1 )f(x 2 )f(x n ) is represented by F(x) and g(x 1 )g(x 2 )g(x n ) is represented by G(x), then inequality (9) can be represented by B < F ðxþ GðxÞ < A: ð10þ When the functions for the probability ratio F(x)/G(x) are substituted by statistical distributions, the probability ratio becomes complex and more difficult to evaluate. This complexity is avoided by taking the natural log of each term, as shown by inequality (7), except that there would be an upper and lower limit given by ln(b) and ln(a). Inequality (7) is called the SPRT index Selection of error probabilities and constants In inequality (10) F(x) was shown to represent the probability of having sequential observations that have a mean l 0 = l 1 (alternative hypothesis) while G(x) was the probability of having sequential observations that have a mean l = l 0. Using the definitions for error probabilities that were provided in Section 4.2, the probability ratio for the case where the null hypothesis H 0 is rejected is represented by [12] F ðxþ GðxÞ ¼ 1 b P A: a ð11þ For the case where the null hypothesis H 0 is accepted F ðxþ GðxÞ ¼ b 6 B: ð12þ 1 a These two inequalities provide the criteria needed to stop the test and accept the null hypothesis, stop the test and reject the null hypothesis, or to continue the test, all with user-defined error probabilities Advantages of SPRT The SPRT approach provides many advantages for the analysis of surveillance data in enterprise servers. Some of these are that it: is quantitative; allows the user to independently define false alarm and missed alarm probabilities; is ideal for analysis of steady state processes, which are stationary; has high sensitivity to subtle changes in signals; provides the shortest mathematically possible time for the detection of subtle changes in noisy process variables; does not require an expert to interpret results; can detect the change in mean and variance for signals with noise that is normally distributed; allows the definition of signal disturbance magnitude that will be tolerated; can be set for non-normal probability distributions. 6. Advanced electronic prognostics using SPRT The implementation of SPRT that was shown in the last section provided an example of how a variable could be monitored, testing a null and alternative hypothesis about an expected mean value. An implementation of SPRT for the analysis of time series data (as found in signals of enterprise servers) is presented next Definition of the test procedure Assume that a null hypothesis is tested, but this time, against multiple alternative hypotheses. Let H 0 be the null hypothesis that a monitored signal from an enterprise server, x i, follows a normal distribution with mean l = 0 and variance r 2. The following alternative hypotheses can be made about the distribution of the server signal. H 1 is the hypothesis that the mean is l 1 > l, called the positive mean test. H 2 is the hypothesis that the mean l 2 < l, called the negative mean test. H 3 is the hypothesis that the variance is Vr 2, called the nominal variance test. Finally, H 4 is the hypothesis that the variance is r 2 /V, called the inverse variance test. These hypotheses provide the test procedure needed to analyze positive and negative changes in the mean and variance of a signal with normally distributed noise, as illustrated in Figs. 3 and 4.

6 1870 L. Lopez / Microelectronics Reliability 47 (2007) In summary, the hypotheses are: H 0 : mean l= and variance r 2 ; H 1 : mean l 1 > l and variance r 2 ; H 2 : mean l 2 < l and variance r 2 ; H 3 : mean l= and variance Vr 2 ; H 4 : mean l= and variance r 2 /V; where V is a variance factor. To evaluate the four hypotheses just presented, the Neyman Pearson principle, as described by inequalities (9) and (10), is applied, creating a total of four inequalities (each against the null hypothesis). B < F 1ðxÞ GðxÞ < A; B < F 3ðxÞ GðxÞ < A; Fig. 3. Change in mean detected with a SPRT test. Fig. 4. Shift in variance detected with a SPRT test. B < F 2ðxÞ GðxÞ < A; B < F 4ðxÞ GðxÞ < A: ð13þ In each inequality, the parameters of each statistical distribution are replaced by the ones assumed by the hypothesis (0 and r 2 for the null hypothesis, l 1 and r 2 for H 1, l 2 and r 2 for H 2, 0 and Vr 2 for H 3, 0 and r 2 /V for H 3 ). After the inequalities are transformed, as done in (7), the probability ratio for each case can be evaluated against the natural log of A and B (see Section 5.3) Definition of parameters Before monitored time series signals from an enterprise system can be tested against the four hypotheses, it is necessary to define the variables l 1 (mean value above l), l 2 (mean value below l), V (variance factor), a (false alarm probability), and b (missed alarm probability). The parameters provide the user with the ability to set alarm levels and to choose the risk that can be tolerated for false and missed alarms. While this example assumes a normally distributed signal, it is possible to have implementations with other statistical distributions. For a SPRT test using a Weibull distribution see reference [12] Continuous monitoring of system signals with SPRT The process used for the continuous monitoring of enterprise server signals with SPRT is illustrated in Fig. 5. Initially, as described in Sections 6.1 and 6.2, the test procedure (pass/fail criteria) is defined by means of the inequalities (13) and the values for parameters and probabilities selected. Observed signals provided by the CSTH are used for a period of time to train the SPRT program, allowing the identification of the signal mean and variance and determining if there is a non-zero mean. Signals that do not have a mean value of zero are normalized at this point to allow the monitoring. Signals that are not static in nature (observed mean is not zero) can be analyzed by other methods, such as MSET. After the setup activities are completed the system starts a cycle of continuous monitoring, testing the null hypothesis against the four alternative hypotheses. In the event that any probability ratio shown in (13) is outside Fig. 5. SPRT setup and monitor process.

7 L. Lopez / Microelectronics Reliability 47 (2007) of the boundaries defined by the parameters A and B, the null hypothesis is rejected and one or more of the alternative hypotheses are accepted. The probability ratio is set to zero, an alarm is raised, and the monitoring of the signals continues. If all of the probability ratios shown in (13) are within the defined boundaries, no action is taken, and the sampling of signals continues. As demonstrated the CSTH and SPRT can be used for single or multiple hypotheses testing of signals, and can be implemented with normal and non-normal distributed random variables. Since SPRT is based on sequential testing, it can provide test results with a smaller number of observations than is possible with other methods. 7. Advanced electronic prognostics using MSET The methods provided for SPRT are only applicable for independent signals that are stationary in nature. However many of the signals found in advanced systems are not stationary or are defined by a normal distribution. In addition, CSTH SPRT analysis looks at the monitored signals individually, lacking the ability of identifying correlations between them. The multivariate state estimation technique (MSET) is a statistical analysis method developed by Argonne National Laboratory (ANL) for the detection of faults in complex, safety-critical systems (Sun Microsystems utilizes a commercial implementation of MSET in a Matlab toolkit called ecm TM from SmartSignal Corp.). When this tool is combined with CSTH and SPRT it has multiple benefits over other EP methods Advantages of MSET The CSTH MSET SPRT approach to EP has the following capabilities for the analysis of signals in enterprise servers: all capabilities provided by SPRT; very low probability of false alarms; analysis of dynamic data (non-stationary); can perform training of systems with data from known good components or systems; identifies correlations between signals; monitors and compares multiple signals at a given time, generating a model for each; creates a dynamic band around each input value; can detect small variations in input signals; it can identify faulty sensors using learned correlations between signals; can be used to detect software aging issues and perform proactive software rejuvenation [13,14]; with CSTH allows proactive fault monitoring capabilities providing the earliest possible warning for incipient failures [15] Continuous monitoring of system signals with MSET The overall process used by MSET to analyze signals in enterprise server applications is provided in Fig. 6. Similar to the process presented for SPRT analysis, MSET utilizes signals that have been conditioned by the CSTH. The first phase of MSET analysis consists of the monitoring of soft, canary, and physical variables that are representative of the normal operation of the server. Typically the server is set to perform normal operating transactions, providing representative measurements of temperature, humidity, voltage, and other parameters of interest. This is called the training phase. At any given moment in time during the training phase multiple signals are monitored, and their respective values are stored as vectors. Over time these vectors become a matrix, called the state matrix. The state matrix is manipulated algebraically to provide estimates of signal correlations as well as models that can represent signal behavior over time. The models and correlations are completed when enough data has been obtained. Upon completion of the training phase the MSET program is ready for the continuous monitoring of signals. For every signal observation, the MSET model generates an estimate of that signal for that particular time. The difference between the measured and estimated signals is calculated and fed into the SPRT analysis module. If the difference between the observed and estimated signals is zero, then the SPRT module will determine that there has been no measurable change at the time of the observation. This will be an indicator of normal operation. If the difference between the signals is not zero, and the difference is determined to be statistically significant, then the SPRT module will trigger an alarm. This will be an indicator that the system is not functioning within the normal operating conditions Other significant features of MSET A very important feature of MSET is its ability to generate a dynamic band around the monitored signal. The band is defined with the upper and lower limits of SPRT and updated for every single observation. Since the MSET model considers operating conditions and correlations between signals, this dynamic band provides an unparalleled tool to detect even the most subtle anomalies with very low probabilities of error. Another significant feature of MSET is the capability to validate sensors. Sensors used in computer systems are well known to have a shorter life span than that of the system they are monitoring. When a sensor fails providing degraded or constant signals, false alarms are triggered by the monitoring system. When a sensor fails providing no signals (or a constant signal stuck below the alarm limits), alarms are missed by the monitoring system. Since enterprise servers have hundreds of sensors, the propagation of false alarms and missed alarms become an

1872 L. Lopez / Microelectronics Reliability 47 (2007) 1865 1873 Fig. 6. MSET training and surveillance phases. important concern.

8 1872 L. Lopez / Microelectronics Reliability 47 (2007) Fig. 6. MSET training and surveillance phases. important concern. MSET uses the state matrix defined during the training phase to derive correlations between sensors. These correlations are used during the surveillance of signals to identify sensors whose behavior is not consistent with the known normal operation. In this manner sensor failures and degradation can be detected proactively, allowing the MSET module to mask defective sensors and replace the signal with an estimated signal that is generated by the MSET model of the system. The reader is referred to the publications listed for specific examples on SPRT and MSET surveillance of enterprise server signals [13 15]. 8. Example of implementation using SPRT A practical application of electronic prognostics for the monitoring of signals in enterprise servers will now be presented. Assume that a voltage signal from a server is monitored over time to ensure that levels are maintained within predetermined specifications, with the results provided in Table 1. In addition, assume that previous measurements of the power supply indicate that the signal is normally distributed with a mean of 1.5 V and a standard deviation of 0.02 V. With this information a null and alternative hypotheses are defined to evaluate the data: H 0 : The signal is normally distributed with mean l =0 and variance r 2 (signal is normalized to zero). H 1 : The signal is normally distributed with mean l 1 > l and variance r 2. Table 1 Sample measurements for a power supply Using the procedure shown in Eqs. (2) (7), with variance = r 2, results in the following inequality and SPRT index for H 1 : 1 f ðx 1 Þf ðx 2 Þfðx n Þ gðx 1 Þgðx 2 Þgðx n Þ ¼ e lð2x i lþ i¼1 P k; ð14þ Ln f ðx 1Þf ðx 2 Þfðx n Þ ¼ lr X n x gðx 1 Þgðx 2 Þgðx n Þ 2 i l : ð15þ 2 2r 2 P n If the false alarm probability a = and the missed alarm probability b = 0.005, then per inequalities (11) and (12): F ðxþ 1 0:005 ¼ P A; GðxÞ 0:001 ð16þ F ðxþ GðxÞ ¼ 0:005 6 B: 1 0:001 ð17þ The natural log of each term provides the accept/reject criteria for the hypothesis test in relation to the SPRT index 5:2973 < l X n x r 2 i l < 6:9027: ð18þ 2 i¼1 Every voltage measurement from the power supply is now tested against the SPRT index. If the index is less than or equal to ln(b) then the signal is declared healthy, Table 2 Signal analysis with SPRT Observation SPRT index Results Index set to zero Signal is healthy Signal is healthy Signal is healthy Degraded signal Degraded signal Degraded signal Degraded signal Degraded signal Signal is healthy i¼1

9 L. Lopez / Microelectronics Reliability 47 (2007) having no evidence to say otherwise. If the index is greater than or equal to ln(a) then the signal is declared degraded, having strong statistical evidence to support it. These results are shown in Table 2 for some representative data points. 9. Conclusions This paper has provided the fundamental theory that enables the implementation of SPRT and MSET EP in enterprise servers. The continuous system telemetry harness (CSTH) was introduced as the tool that enables Electronic Prognostics in Sun Microsystems servers, providing the means to retrieve, condition, and store information that can be used by experts and/or pattern recognition algorithms. CSTH features such as the black box and circular file structure were presented. The CSTH in conjunction with pattern recognition techniques was introduced as a method for EP, allowing the detection of incipient faults and the prediction of failures in enterprise servers. These features allow proactive and reactive actions such as estimation of remaining life, preventive repair planning, sensor characterization, corrective actions in software, and an overall reduction of NTF issues. Acknowledgments The author wishes to express his thanks to Kenny Gross, Keith Whisnant, David McElfresh, Dan Vacar and Bob Melanson from Sun Microsystems for their support in the writing of this paper. In addition many thanks to Michael Pecht, Diganta Das, and Peter Sandborn from the CALCE center at the University of Maryland for their valuable suggestions. References [1] Pecht M et al. Health and life consumption monitoring of electronic products. CALCE, University of Maryland; [2] Mishra S, Pecht M, Goodman D. In-situ sensors for product reliability monitoring. In: Proc SPIE, vol. 4755; p [3] Vichare N, Pecht M. Prognostics and health management of electronics. IEEE transactions on components and packaging technologies, vol. 29, no. 1; March [4] Gross K, Lu W. Early detection of signal and process anomalies in enterprise computing systems. IEEE international conference on machine learning and applications; June [5] Gross K, Mishra K. Improved methods for early fault detection in enterprise computing servers using sas tools. In: SAS users group international symposium; May [6] Gross K et al. Proactive fault monitoring in enterprise servers. IEEE international multiconference in computer science & computer engineering; June [7] Gross K et al. Electronic prognostics through continuous system telemetry. 60th meeting of the society for machinery failure prevention technology (MFPT06); April [8] Tobias P, Trindade D. Applied reliability. Chapman & Hall/CRC; [9] Wald A. Sequential analysis. Wiley; p [10] Barlett MS. Sequential methods in statistics. Wiley; p [11] Wald A. Selected papers in statistics and probability. In: Anderson TW, editor. McGraw-Hill; p. 13. p. 154 and 548. [12] Johnson L. The statistical treatment of fatigue experiments. Elsevier; p [13] Vaidyanathan K, Gross K. Proactive detection of software anomalies through MSET. IEEE workshop on predictive software models (PSM); September [14] Vaidyanathan K, Gross K. MSET performance optimization for detection of software aging. In: IEEE international symposium on software reliability engineering (ISSRE); November [15] Gross K, Bhardwaj V, Bickford R. Proactive detection of aging mechanisms in performance-critical computers. In: Annual IEEE/ NASA software engineering symposium; December Leon Lopez is a reliability physics researcher in the RAS Computer Analysis Laboratory of Sun Microsystems, San Diego, California where he applies the principles of physics of failure in the research and analysis of materials, components, and assemblies used in computer systems. He also develops reliability qualification procedures for the evaluation of computer components. He has over 10 years of experience in the qualification of ASICs, microprocessors, memory devices and IC sockets. Leon received his B.S. in electronics engineering from Brigham Young University in 1996 and has completed the academic requirements for his M.S. degree in Reliability Engineering from the University of Maryland, College Park. He is currently a Ph.D. candidate in Reliability Engineering, at the University of Maryland, College Park.

EARLY DETECTION OF AVALANCHE BREAKDOWN IN EMBEDDED CAPACITORS USING SPRT

EARLY DETECTION OF AVALANCHE BREAKDOWN IN EMBEDDED CAPACITORS USING SPRT Mohammed A. Alam 1, Michael H. Azarian 2, Michael Osterman and Michael Pecht Center for Advanced Life Cycle Engineering (CALCE)