Advanced electronic prognostics through system telemetry and pattern recognition methods

Size: px
Start display at page:

Download "Advanced electronic prognostics through system telemetry and pattern recognition methods"

Transcription

1 Available online at Microelectronics Reliability 47 (2007) Advanced electronic prognostics through system telemetry and pattern recognition methods Leon Lopez RAS Computer Analysis Laboratory, Sun Microsystems, San Diego, CA, United States Received 14 January 2007 Available online 1 May 2007 Abstract Electronic Prognostics (EP) is a technique used in high-reliability and high-availability systems to actively and proactively detect faults, allowing the reduction of system downtime and unplanned repairs. The approach of Sun Microsystems to EP consists of a Continuous System Telemetry Harness (CSTH) that is coupled with Sequential Probability Ratio Test (SPRT) and Multivariate State Estimation Technique (MSET) algorithms. This approach provides a unique and complete EP solution, harnessing the rich information from sensors and system variables, and providing means for their storage and analysis. The background theory behind SPRT and MSET techniques as well as their implementation for advanced EP in enterprise servers is presented in this paper. Ó 2007 Elsevier Ltd. All rights reserved. 1. Introduction Enterprise servers are complex systems that are utilized in mission-critical or safety-critical applications requiring 24 7 availability. For these kinds of systems failure is not an option, since it invariably results in heavy financial loss, customer dissatisfaction, unplanned maintenance cycles, and loss of customer loyalty. A typical enterprise server, such as the Sun Microsystems F15K, consists of thousands of individual components and subsystems that are physically, electrically, thermally, and/or mechanically interconnected. Each component or sub-system provides or receives signals and services that allow the proper operation of the server. If components or sub-systems do not operate according to specifications (degraded or failure states) they do not provide the signals or services needed by the system, which may result in system level faults or failures. In the context of this paper a fault is defined as the operation outside address: leon.lopez@sun.com of specifications, while failure is defined as the lack of operation. Very often faults or failures of enterprise servers cannot be readily root-caused, due to difficulties identifying the individual component or operating condition that initiated the faulty or failure state. Since the information that is needed to identify the true failure mechanism (failure process) is not often available, systems or components are simply labeled as No Trouble Found (NTF). NTF is a very expensive problem in the electronic industry that impacts customers and manufacturers. It requires a great commitment of human resources, and a great investment in spare parts, resulting in the loss of materials in the form of scrap piles. In the electronics industry, Health Monitoring (HM) and Electronics Prognostics (EP) methods have been implemented to allow the identification of faults and failures in systems during normal operating conditions. HM and EP methods consist of the continuous assessment of a product s operating environment and performance to determine deviations from expected normal operating conditions. Information gleaned through this continuous surveillance makes it possible to obtain estimates of the /$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved. doi: /j.microrel

2 1866 L. Lopez / Microelectronics Reliability 47 (2007) product reliability, execute proactive maintenance activities, increase product availability, increase customer satisfaction, and reduce NTF events. In this paper methods of electronic prognostics that were developed by Sun Microsystems for enterprise servers are described. Principles of Continuous System Telemetry Harness (CSTH), Sequential Probability Ratio Test (SPRT), and Multivariate State Estimation Technique (MSET) as pertaining to EP applications are provided. 2. Health monitoring and electronic prognostics Health Monitoring of electronic systems can be performed by means of diagnostics, prognostics, or life consumption monitors. Diagnostic monitors determine the current state of health of a system and determine potential problems. Prognostic monitors identify faults and, by studying the fault behavior, provide an estimate of the time to failure for the product. Life consumption monitors measure the operating conditions, assessing accumulated damage and providing estimates of remaining life of the product [1,2]. Vichare and Pecht categorize the main approaches for the implementation of prognostics as built-in-test (BIT), fuses and/or canary devices, parameter monitor, and accumulated damage monitor [3]. These approaches for HM and EP provide the means to address some of the concerns of enterprise server users. Unfortunately, there are some significant issues associated with the current EP methods for the surveillance of electronic signals (voltage, resistance, temperature, humidity, vibration, or other system variables). Most of the HM and EP implementations perform simple tests to monitored signals. These tests (such as threshold tests and mean value tests) are only sensitive to gross changes in the signals that are monitored and produce a high number of false alarms and missed alarms [4]. Threshold values can be set close to the monitored signal to achieve high fault discrimination. Unfortunately, the noise that is an integral part of that signal produces false alarms. The closer the threshold is to the signal, the higher the number of false alarms. The opposite situation is also of concern: when threshold values are set too far from the monitored signal to reduce the rate of false alarms, missed alarms are produced. The farther the threshold is from the signal, the higher the number of missed alarms. This seesaw effect is illustrated in Fig. 1. Other significant issues with current EP approaches are that they: cannot separately specify the probabilities of false alarms and missed alarms; assume noise levels are insignificant; cannot detect drift in signals; do not provide the earliest possible annunciation of failure; cannot handle sensor degradation or failure; Fig. 1. See-saw effect for fault discrimination and alarm rate seen in threshold limit approach. cannot identify correlations between signals; require expertise from operator to interpret results; lack insight into root causes of failure to reduce NTF instances. 3. Advanced electronic prognostics A method for EP in computer systems was introduced in 2001 by Dr. Kenny Gross and his team of researchers at Sun Microsystems. This approach circumvents many of the deficiencies of traditional HM and EP methods that were previously discussed, providing the tools needed to capture relevant system data and identifying signal patterns, correlations between signals, and root causes of failures, effectively reducing NTF instances and increasing Reliability, Availability and Serviceability (RAS). The EP approach consists of a Continuous System Telemetry Harness coupled with real-time pattern recognition algorithms [5]. The Continuous System Telemetry Harness (CSTH) was developed by Sun Microsystems to enable the capture, conditioning, synchronizing, and storage of computer system telemetry signals, which allows the subsequent statistical analysis of data. The CSTH enables EP in Sun computer systems [7]. An overall illustration of CSTH is provided in Fig. 2. The CSTH categorizes the information provided by the server into three different kinds: soft variables, canary variables, and physical variables. Soft variables (internal variables) are values generated by the operating system which provide information on the performance of the hardware. Canary variables are values generated by software programs (other than the operating system) which provide information on the quality of the service, such as number of transactions per minute, service availability, user wait times, etc. Physical variables are direct measurements made in the system by means of sensors, such as temperature, voltage, current, vibration, fan speed, and relative humidity. All of these variables originate from multiple locations, formats, time stamps, sampling frequencies, and signal resolutions. The analysis of information by EP tools is not possible until the variables are synchronized and conditioned into a digitized time series, removing signal resolution and sampling frequency differences. Once the data is in the proper

3 L. Lopez / Microelectronics Reliability 47 (2007) Fig. 2. Continuous system telemetry harness used in Sun Microsystems enterprise servers. format and time stamped it is stored in a double circular file structure that acts as a black box recorder. In the Sun Microsystems F15K, which is a 72 microprocessor server with over 1000 sensors, the information is stored at high resolution for 72 h (1st circular file structure) and then at low resolution for 30 days (2nd circular file structure). Analysis of data, either by human or algorithms, can be performed real time or later by retrieving stored information. The analysis of signals obtained by the CSTH is done by means of pattern recognition algorithms. The first of these is the Sequential Probability Ratio Test (SPRT), and the second is the Multivariate State Estimation Technique (MSET) [6,7]. The methods are not new in statistical analysis, but their implementation with the CSTH for the monitoring of signals in computer systems is very innovative. The application of these algorithms to enable EP in enterprise servers will be presented in the remaining sections of this paper. 4. Testing of a statistical hypothesis Before SPRT and MSET implementations in EP are presented it is necessary to cover some basic theory regarding the testing of statistical hypotheses. This section will describe test procedures, critical regions, and error types Test procedures A statistical hypothesis test is a quantitative evaluation of a belief in light of available data. The current belief is defined by H 0, and is called the null hypothesis. The alternative belief is defined by H 1, and is called the alternative hypothesis. The acceptance of the null hypothesis indicates that the test results (the data) are not sufficient evidence to reject the belief. The rejection of the null hypothesis, on the other hand, indicates that there is overwhelming evidence to reject the belief [8]. To perform a test of a statistical hypothesis it is necessary to define a test procedure that will classify all potential observations into mutually exclusive sets, allowing the acceptance or rejection of the hypothesis [9]. As an example of this concept consider the following. During an experiment a random number of observations n are performed to decide the outcome of a hypothesis. Assuming that each observation in the experiment is represented by the random variable x i, and that the total number of observations made are n, then the set {x 1, x 2,...,x n } will represent one combination of outcomes for the experiment. A test procedure is defined when all sets of potential outcomes are categorized, either supporting or rejecting a null hypothesis. In a similar manner, consider an experiment where the following hypothesis is presented for consideration: H 0 : The product has no defects (null hypothesis); H 1 : The product has defects (alternative hypothesis). As part of the experiment a sample of the product is analyzed two times, with each observation having a binary outcome representing a defect (1) or no-defect (0). One combination of random observations of the variable x i in the experiment will be the set {x 1,x 2 }. All possible combinations of random test observations will be given by the sets {0, 0}, {0, 1}, {1, 0} and {1, 1}. Assuming that no defects are tolerated on the product (criteria to categorize outcomes), then any of the sequential observations {0, 1}, {1, 0} or {1, 1} will result in the rejection of the null hypothesis. The sequential observations {0, 0} will result in the acceptance of the null hypothesis. This is a test procedure of a statistical hypothesis that allows the acceptance or rejection of the null hypothesis Critical region and error types In order to quantify test procedures for statistical hypotheses it is necessary to define the probabilities of committing errors. When accepting or rejecting a null hypothesis H 0, it is possible to perform two kinds of errors: rejecting the null hypothesis H 0 when it is true (Type I error), or accepting the null hypothesis H 0 when the alternative hypothesis H 1 is true (Type II error). The Type I error is represented by a and is known as the size of the critical region, or false alarm probability. The critical region is simply the area that represents the probability of rejection. The Type II error is represented by b and is known as the missed alarm probability. 1 b is called the power of the critical region. Based on these definitions the probabilities of acceptance and rejection for a hypothesis test are defined.

4 1868 L. Lopez / Microelectronics Reliability 47 (2007) When the null hypothesis H 0 is rejected a = probability of rejecting H 0 (if H 0 is true); 1 b = probability of rejecting H 0 (if H 1 is true) When the null hypothesis H 0 is accepted b = probability of accepting H 0 (if H 1 is true) 1 a = probability of accepting H 0 (if H 0 is true) If the size of a critical region is chosen, it is possible to select multiple probabilities of a and b that would satisfy the required size. For practical purposes it is desirable to have a small probability of Type I error and a small probability of Type II error. To simplify the selection of a critical region and to provide the smallest possible probabilities for a and b the Neyman Pearson theory is used. See Refs. [10,11] for more details Neyman Pearson method The theory presented up to this point has only addressed a random variable x i without further consideration to statistical distributions and distribution parameters. The Neyman Pearson method allows the incorporation of statistical distributions and distribution parameters in the estimation of a critical region. Assume that the random variable x i is described by a statistical distribution that has a single parameter l, which represents the mean. A hypothesis test can be formulated in relation to the mean value l of the random variable x i, with null hypothesis H 0 : l = l 0, and alternative hypothesis H 1 : l = l 1. If the unknown statistical distributions of the random variable x i are represented by g(x) (for the null hypothesis) and by f(x) (for the alternative hypothesis), then the Neyman Pearson method defines the size of the critical region as f ðx 1 Þf ðx 2 Þf ðx n Þ gðx 1 Þgðx 2 Þgðx n Þ P k; ð1þ where k is a constant that ensures a critical region of size a Example of a statistical hypothesis test The following example was provided by Wald [9] to demonstrate statistical hypothesis testing of experiments with a fixed sample size. It is replicated in this work because it underpins the principles of sequential testing that define SPRT which will simplify following explanations. Assume that a null hypothesis, H 0, is formulated that a random variable x i is normally distributed, having a mean l = l 0 and a standard deviation r = 1. The alternative hypothesis, H 1, is that the random variable x i is normally distributed, having a mean l = l 1 and a standard deviation r = 1 (assume l 1 > l 0 ). The probability density function (PDF) for the random variable x i with mean l 1 will be given by f ðx i Þ¼ p 1 ffiffiffiffiffi e 1 2 ½xi l 1Š 2 ð2þ 2p and the PDF for the random variable x i with mean l 0 will be given by gðx i Þ¼ p 1 ffiffiffiffiffi e 1 2 ½xi l 0Š 2 : ð3þ 2p Applying the Neyman Pearson principle f ðx 1 Þf ðx 2 Þf ðx n Þ¼ 1 ð2pþ gðx 1 Þgðx 2 Þgðx n Þ¼ 1 ð2pþ f ðx 1 Þf ðx 2 Þf ðx n Þ gðx 1 Þgðx 2 Þgðx n Þ ¼ P n 2 ðx i l 1 Þ 2 e 1 i¼1 ; n=2 P n ðx 2 i l 0 Þ 2 e 1 i¼1 ; n=2 P n 1 e 1 ð2pþ n=2 2 ðxi l i¼1 1Þ 2 P n 1 e 1 ð2pþ n=2 2 ðxi l i¼1 0Þ ð4þ ð5þ 2 P k: ð6þ Finally, taking the natural log of both sides and simplifying ðl 1 l 0 Þ Xn X n i¼1 i¼1 ðx i Þþ n 2 ðl2 0 l2 1 Þ P LnðkÞ; ð7þ ðx i Þ P LnðkÞ n 2 ðl2 0 l2 1 Þ : ð8þ ðl 1 l 0 Þ From these inequalities it is possible to define the limits of a critical region of size a that would have the smallest possible value of b (Type II error), for the case where the number of observations is fixed. See Ref. [9] for a full description of this topic. 5. Sequential testing of statistical hypothesis This section describes sequential analysis, Sequential Probability Ratio Test (SPRT), error probabilities, and estimation of test constants Sequential analysis Sequential analysis is a method specially designed for the testing of hypotheses where there is a random number of observations during the experiment. In other words, the decision to stop or continue an experiment is determined by the latest outcome of the test and not by a specified test time or failure count. A sequential test of a statistical hypothesis requires a test procedure that will define the criteria to: Stop the test and accept H 0. Stop the test and reject H 0. Continue the test (not enough information). The implementation of sequential testing as well as the definition of the criteria to decide the outcome of a test will be provided next.

5 L. Lopez / Microelectronics Reliability 47 (2007) Sequential probability ratio test SPRT is a method of statistical inference that was developed by Abraham Wald back in the 1940s. It incorporates the concepts of test procedures, critical regions, Neyman Pearson method, and sequential analysis. In SPRT the criteria to stop-accept, stop-reject, or continue a test (as required by sequential testing) is defined with the Neyman Pearson method using upper and lower limits B < f ðx 1Þf ðx 2 Þfðx n Þ gðx 1 Þgðx 2 Þgðx n Þ < A; ð9þ where A and B are constants, chosen to ensure that the critical region size defined by a is obtained. In Eq. (9) f(x i ) and g(x i ) represent the PDFs for the alternative (H 1 ) and null (H 0 ) hypotheses respectively, of a random variable x i. The term f(x 1 )f(x 2 )f(x n ) represents a series of observations of x i, whose product constitutes the probability of occurrence of the alternative hypothesis. The term g(x 1 )g(x 2 )g(x n ) represents a series of observations of x i whose product constitutes the probability of occurrence of the null hypothesis. The ratio of these probabilities is compared to the upper and lower limits to provide the stop/continue criteria for the test. If the probability ratio equals or exceeds A, the test is stopped, rejecting the null hypothesis H 0. If the probability ratio is less or equal to B, the test is stopped, accepting the null hypothesis H 0.If the probability ratio is between A and B, the test continues. If f(x 1 )f(x 2 )f(x n ) is represented by F(x) and g(x 1 )g(x 2 )g(x n ) is represented by G(x), then inequality (9) can be represented by B < F ðxþ GðxÞ < A: ð10þ When the functions for the probability ratio F(x)/G(x) are substituted by statistical distributions, the probability ratio becomes complex and more difficult to evaluate. This complexity is avoided by taking the natural log of each term, as shown by inequality (7), except that there would be an upper and lower limit given by ln(b) and ln(a). Inequality (7) is called the SPRT index Selection of error probabilities and constants In inequality (10) F(x) was shown to represent the probability of having sequential observations that have a mean l 0 = l 1 (alternative hypothesis) while G(x) was the probability of having sequential observations that have a mean l = l 0. Using the definitions for error probabilities that were provided in Section 4.2, the probability ratio for the case where the null hypothesis H 0 is rejected is represented by [12] F ðxþ GðxÞ ¼ 1 b P A: a ð11þ For the case where the null hypothesis H 0 is accepted F ðxþ GðxÞ ¼ b 6 B: ð12þ 1 a These two inequalities provide the criteria needed to stop the test and accept the null hypothesis, stop the test and reject the null hypothesis, or to continue the test, all with user-defined error probabilities Advantages of SPRT The SPRT approach provides many advantages for the analysis of surveillance data in enterprise servers. Some of these are that it: is quantitative; allows the user to independently define false alarm and missed alarm probabilities; is ideal for analysis of steady state processes, which are stationary; has high sensitivity to subtle changes in signals; provides the shortest mathematically possible time for the detection of subtle changes in noisy process variables; does not require an expert to interpret results; can detect the change in mean and variance for signals with noise that is normally distributed; allows the definition of signal disturbance magnitude that will be tolerated; can be set for non-normal probability distributions. 6. Advanced electronic prognostics using SPRT The implementation of SPRT that was shown in the last section provided an example of how a variable could be monitored, testing a null and alternative hypothesis about an expected mean value. An implementation of SPRT for the analysis of time series data (as found in signals of enterprise servers) is presented next Definition of the test procedure Assume that a null hypothesis is tested, but this time, against multiple alternative hypotheses. Let H 0 be the null hypothesis that a monitored signal from an enterprise server, x i, follows a normal distribution with mean l = 0 and variance r 2. The following alternative hypotheses can be made about the distribution of the server signal. H 1 is the hypothesis that the mean is l 1 > l, called the positive mean test. H 2 is the hypothesis that the mean l 2 < l, called the negative mean test. H 3 is the hypothesis that the variance is Vr 2, called the nominal variance test. Finally, H 4 is the hypothesis that the variance is r 2 /V, called the inverse variance test. These hypotheses provide the test procedure needed to analyze positive and negative changes in the mean and variance of a signal with normally distributed noise, as illustrated in Figs. 3 and 4.

6 1870 L. Lopez / Microelectronics Reliability 47 (2007) In summary, the hypotheses are: H 0 : mean l= and variance r 2 ; H 1 : mean l 1 > l and variance r 2 ; H 2 : mean l 2 < l and variance r 2 ; H 3 : mean l= and variance Vr 2 ; H 4 : mean l= and variance r 2 /V; where V is a variance factor. To evaluate the four hypotheses just presented, the Neyman Pearson principle, as described by inequalities (9) and (10), is applied, creating a total of four inequalities (each against the null hypothesis). B < F 1ðxÞ GðxÞ < A; B < F 3ðxÞ GðxÞ < A; Fig. 3. Change in mean detected with a SPRT test. Fig. 4. Shift in variance detected with a SPRT test. B < F 2ðxÞ GðxÞ < A; B < F 4ðxÞ GðxÞ < A: ð13þ In each inequality, the parameters of each statistical distribution are replaced by the ones assumed by the hypothesis (0 and r 2 for the null hypothesis, l 1 and r 2 for H 1, l 2 and r 2 for H 2, 0 and Vr 2 for H 3, 0 and r 2 /V for H 3 ). After the inequalities are transformed, as done in (7), the probability ratio for each case can be evaluated against the natural log of A and B (see Section 5.3) Definition of parameters Before monitored time series signals from an enterprise system can be tested against the four hypotheses, it is necessary to define the variables l 1 (mean value above l), l 2 (mean value below l), V (variance factor), a (false alarm probability), and b (missed alarm probability). The parameters provide the user with the ability to set alarm levels and to choose the risk that can be tolerated for false and missed alarms. While this example assumes a normally distributed signal, it is possible to have implementations with other statistical distributions. For a SPRT test using a Weibull distribution see reference [12] Continuous monitoring of system signals with SPRT The process used for the continuous monitoring of enterprise server signals with SPRT is illustrated in Fig. 5. Initially, as described in Sections 6.1 and 6.2, the test procedure (pass/fail criteria) is defined by means of the inequalities (13) and the values for parameters and probabilities selected. Observed signals provided by the CSTH are used for a period of time to train the SPRT program, allowing the identification of the signal mean and variance and determining if there is a non-zero mean. Signals that do not have a mean value of zero are normalized at this point to allow the monitoring. Signals that are not static in nature (observed mean is not zero) can be analyzed by other methods, such as MSET. After the setup activities are completed the system starts a cycle of continuous monitoring, testing the null hypothesis against the four alternative hypotheses. In the event that any probability ratio shown in (13) is outside Fig. 5. SPRT setup and monitor process.

7 L. Lopez / Microelectronics Reliability 47 (2007) of the boundaries defined by the parameters A and B, the null hypothesis is rejected and one or more of the alternative hypotheses are accepted. The probability ratio is set to zero, an alarm is raised, and the monitoring of the signals continues. If all of the probability ratios shown in (13) are within the defined boundaries, no action is taken, and the sampling of signals continues. As demonstrated the CSTH and SPRT can be used for single or multiple hypotheses testing of signals, and can be implemented with normal and non-normal distributed random variables. Since SPRT is based on sequential testing, it can provide test results with a smaller number of observations than is possible with other methods. 7. Advanced electronic prognostics using MSET The methods provided for SPRT are only applicable for independent signals that are stationary in nature. However many of the signals found in advanced systems are not stationary or are defined by a normal distribution. In addition, CSTH SPRT analysis looks at the monitored signals individually, lacking the ability of identifying correlations between them. The multivariate state estimation technique (MSET) is a statistical analysis method developed by Argonne National Laboratory (ANL) for the detection of faults in complex, safety-critical systems (Sun Microsystems utilizes a commercial implementation of MSET in a Matlab toolkit called ecm TM from SmartSignal Corp.). When this tool is combined with CSTH and SPRT it has multiple benefits over other EP methods Advantages of MSET The CSTH MSET SPRT approach to EP has the following capabilities for the analysis of signals in enterprise servers: all capabilities provided by SPRT; very low probability of false alarms; analysis of dynamic data (non-stationary); can perform training of systems with data from known good components or systems; identifies correlations between signals; monitors and compares multiple signals at a given time, generating a model for each; creates a dynamic band around each input value; can detect small variations in input signals; it can identify faulty sensors using learned correlations between signals; can be used to detect software aging issues and perform proactive software rejuvenation [13,14]; with CSTH allows proactive fault monitoring capabilities providing the earliest possible warning for incipient failures [15] Continuous monitoring of system signals with MSET The overall process used by MSET to analyze signals in enterprise server applications is provided in Fig. 6. Similar to the process presented for SPRT analysis, MSET utilizes signals that have been conditioned by the CSTH. The first phase of MSET analysis consists of the monitoring of soft, canary, and physical variables that are representative of the normal operation of the server. Typically the server is set to perform normal operating transactions, providing representative measurements of temperature, humidity, voltage, and other parameters of interest. This is called the training phase. At any given moment in time during the training phase multiple signals are monitored, and their respective values are stored as vectors. Over time these vectors become a matrix, called the state matrix. The state matrix is manipulated algebraically to provide estimates of signal correlations as well as models that can represent signal behavior over time. The models and correlations are completed when enough data has been obtained. Upon completion of the training phase the MSET program is ready for the continuous monitoring of signals. For every signal observation, the MSET model generates an estimate of that signal for that particular time. The difference between the measured and estimated signals is calculated and fed into the SPRT analysis module. If the difference between the observed and estimated signals is zero, then the SPRT module will determine that there has been no measurable change at the time of the observation. This will be an indicator of normal operation. If the difference between the signals is not zero, and the difference is determined to be statistically significant, then the SPRT module will trigger an alarm. This will be an indicator that the system is not functioning within the normal operating conditions Other significant features of MSET A very important feature of MSET is its ability to generate a dynamic band around the monitored signal. The band is defined with the upper and lower limits of SPRT and updated for every single observation. Since the MSET model considers operating conditions and correlations between signals, this dynamic band provides an unparalleled tool to detect even the most subtle anomalies with very low probabilities of error. Another significant feature of MSET is the capability to validate sensors. Sensors used in computer systems are well known to have a shorter life span than that of the system they are monitoring. When a sensor fails providing degraded or constant signals, false alarms are triggered by the monitoring system. When a sensor fails providing no signals (or a constant signal stuck below the alarm limits), alarms are missed by the monitoring system. Since enterprise servers have hundreds of sensors, the propagation of false alarms and missed alarms become an

8 1872 L. Lopez / Microelectronics Reliability 47 (2007) Fig. 6. MSET training and surveillance phases. important concern. MSET uses the state matrix defined during the training phase to derive correlations between sensors. These correlations are used during the surveillance of signals to identify sensors whose behavior is not consistent with the known normal operation. In this manner sensor failures and degradation can be detected proactively, allowing the MSET module to mask defective sensors and replace the signal with an estimated signal that is generated by the MSET model of the system. The reader is referred to the publications listed for specific examples on SPRT and MSET surveillance of enterprise server signals [13 15]. 8. Example of implementation using SPRT A practical application of electronic prognostics for the monitoring of signals in enterprise servers will now be presented. Assume that a voltage signal from a server is monitored over time to ensure that levels are maintained within predetermined specifications, with the results provided in Table 1. In addition, assume that previous measurements of the power supply indicate that the signal is normally distributed with a mean of 1.5 V and a standard deviation of 0.02 V. With this information a null and alternative hypotheses are defined to evaluate the data: H 0 : The signal is normally distributed with mean l =0 and variance r 2 (signal is normalized to zero). H 1 : The signal is normally distributed with mean l 1 > l and variance r 2. Table 1 Sample measurements for a power supply Using the procedure shown in Eqs. (2) (7), with variance = r 2, results in the following inequality and SPRT index for H 1 : 1 f ðx 1 Þf ðx 2 Þfðx n Þ gðx 1 Þgðx 2 Þgðx n Þ ¼ e lð2x i lþ i¼1 P k; ð14þ Ln f ðx 1Þf ðx 2 Þfðx n Þ ¼ lr X n x gðx 1 Þgðx 2 Þgðx n Þ 2 i l : ð15þ 2 2r 2 P n If the false alarm probability a = and the missed alarm probability b = 0.005, then per inequalities (11) and (12): F ðxþ 1 0:005 ¼ P A; GðxÞ 0:001 ð16þ F ðxþ GðxÞ ¼ 0:005 6 B: 1 0:001 ð17þ The natural log of each term provides the accept/reject criteria for the hypothesis test in relation to the SPRT index 5:2973 < l X n x r 2 i l < 6:9027: ð18þ 2 i¼1 Every voltage measurement from the power supply is now tested against the SPRT index. If the index is less than or equal to ln(b) then the signal is declared healthy, Table 2 Signal analysis with SPRT Observation SPRT index Results Index set to zero Signal is healthy Signal is healthy Signal is healthy Degraded signal Degraded signal Degraded signal Degraded signal Degraded signal Signal is healthy i¼1

9 L. Lopez / Microelectronics Reliability 47 (2007) having no evidence to say otherwise. If the index is greater than or equal to ln(a) then the signal is declared degraded, having strong statistical evidence to support it. These results are shown in Table 2 for some representative data points. 9. Conclusions This paper has provided the fundamental theory that enables the implementation of SPRT and MSET EP in enterprise servers. The continuous system telemetry harness (CSTH) was introduced as the tool that enables Electronic Prognostics in Sun Microsystems servers, providing the means to retrieve, condition, and store information that can be used by experts and/or pattern recognition algorithms. CSTH features such as the black box and circular file structure were presented. The CSTH in conjunction with pattern recognition techniques was introduced as a method for EP, allowing the detection of incipient faults and the prediction of failures in enterprise servers. These features allow proactive and reactive actions such as estimation of remaining life, preventive repair planning, sensor characterization, corrective actions in software, and an overall reduction of NTF issues. Acknowledgments The author wishes to express his thanks to Kenny Gross, Keith Whisnant, David McElfresh, Dan Vacar and Bob Melanson from Sun Microsystems for their support in the writing of this paper. In addition many thanks to Michael Pecht, Diganta Das, and Peter Sandborn from the CALCE center at the University of Maryland for their valuable suggestions. References [1] Pecht M et al. Health and life consumption monitoring of electronic products. CALCE, University of Maryland; [2] Mishra S, Pecht M, Goodman D. In-situ sensors for product reliability monitoring. In: Proc SPIE, vol. 4755; p [3] Vichare N, Pecht M. Prognostics and health management of electronics. IEEE transactions on components and packaging technologies, vol. 29, no. 1; March [4] Gross K, Lu W. Early detection of signal and process anomalies in enterprise computing systems. IEEE international conference on machine learning and applications; June [5] Gross K, Mishra K. Improved methods for early fault detection in enterprise computing servers using sas tools. In: SAS users group international symposium; May [6] Gross K et al. Proactive fault monitoring in enterprise servers. IEEE international multiconference in computer science & computer engineering; June [7] Gross K et al. Electronic prognostics through continuous system telemetry. 60th meeting of the society for machinery failure prevention technology (MFPT06); April [8] Tobias P, Trindade D. Applied reliability. Chapman & Hall/CRC; [9] Wald A. Sequential analysis. Wiley; p [10] Barlett MS. Sequential methods in statistics. Wiley; p [11] Wald A. Selected papers in statistics and probability. In: Anderson TW, editor. McGraw-Hill; p. 13. p. 154 and 548. [12] Johnson L. The statistical treatment of fatigue experiments. Elsevier; p [13] Vaidyanathan K, Gross K. Proactive detection of software anomalies through MSET. IEEE workshop on predictive software models (PSM); September [14] Vaidyanathan K, Gross K. MSET performance optimization for detection of software aging. In: IEEE international symposium on software reliability engineering (ISSRE); November [15] Gross K, Bhardwaj V, Bickford R. Proactive detection of aging mechanisms in performance-critical computers. In: Annual IEEE/ NASA software engineering symposium; December Leon Lopez is a reliability physics researcher in the RAS Computer Analysis Laboratory of Sun Microsystems, San Diego, California where he applies the principles of physics of failure in the research and analysis of materials, components, and assemblies used in computer systems. He also develops reliability qualification procedures for the evaluation of computer components. He has over 10 years of experience in the qualification of ASICs, microprocessors, memory devices and IC sockets. Leon received his B.S. in electronics engineering from Brigham Young University in 1996 and has completed the academic requirements for his M.S. degree in Reliability Engineering from the University of Maryland, College Park. He is currently a Ph.D. candidate in Reliability Engineering, at the University of Maryland, College Park.

EARLY DETECTION OF AVALANCHE BREAKDOWN IN EMBEDDED CAPACITORS USING SPRT

EARLY DETECTION OF AVALANCHE BREAKDOWN IN EMBEDDED CAPACITORS USING SPRT EARLY DETECTION OF AVALANCHE BREAKDOWN IN EMBEDDED CAPACITORS USING SPRT Mohammed A. Alam 1, Michael H. Azarian 2, Michael Osterman and Michael Pecht Center for Advanced Life Cycle Engineering (CALCE)

More information

Statistical Reliability Modeling of Field Failures Works!

Statistical Reliability Modeling of Field Failures Works! Statistical Reliability Modeling of Field Failures Works! David Trindade, Ph.D. Distinguished Principal Engineer Sun Microsystems, Inc. Quality & Productivity Research Conference 1 Photo by Dave Trindade

More information

Prognostics implementation of electronics under vibration loading

Prognostics implementation of electronics under vibration loading Available online at www.sciencedirect.com Microelectronics Reliability 7 (7) 89 856 www.elsevier.com/locate/microrel Prognostics implementation of electronics under vibration loading Jie Gu *, Donald Barker,

More information

A Fusion Prognostics Method for Remaining Useful Life Prediction of Electronic Products

A Fusion Prognostics Method for Remaining Useful Life Prediction of Electronic Products 5th Annual IEEE Conference on Automation Science and Engineering Bangalore, India, August -5, 009 A Fusion Prognostics Method for Remaining Useful Life Prediction of Electronic Products Shunfeng Cheng,

More information

Application of Grey Prediction Model for Failure Prognostics of Electronics

Application of Grey Prediction Model for Failure Prognostics of Electronics International Journal of Performability Engineering, Vol. 6, No. 5, September 2010, pp. 435-442. RAMS Consultants Printed in India Application of Grey Prediction Model for Failure Prognostics of Electronics

More information

Reliability of Technical Systems

Reliability of Technical Systems Main Topics 1. Introduction, Key Terms, Framing the Problem 2. Reliability Parameters: Failure Rate, Failure Probability, etc. 3. Some Important Reliability Distributions 4. Component Reliability 5. Software

More information

Predicting Long-Term Telemetry Behavior for Lunar Orbiting, Deep Space, Planetary and Earth Orbiting Satellites

Predicting Long-Term Telemetry Behavior for Lunar Orbiting, Deep Space, Planetary and Earth Orbiting Satellites Predicting Long-Term Telemetry Behavior for Lunar Orbiting, Deep Space, Planetary and Earth Orbiting Satellites Item Type text; Proceedings Authors Losik, Len Publisher International Foundation for Telemetering

More information

Independent Component Analysis for Redundant Sensor Validation

Independent Component Analysis for Redundant Sensor Validation Independent Component Analysis for Redundant Sensor Validation Jun Ding, J. Wesley Hines, Brandon Rasmussen The University of Tennessee Nuclear Engineering Department Knoxville, TN 37996-2300 E-mail: hines2@utk.edu

More information

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Applied Mathematical Sciences, Vol. 4, 2010, no. 62, 3083-3093 Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Julia Bondarenko Helmut-Schmidt University Hamburg University

More information

Advanced Methods for Fault Detection

Advanced Methods for Fault Detection Advanced Methods for Fault Detection Piero Baraldi Agip KCO Introduction Piping and long to eploration distance pipelines activities Piero Baraldi Maintenance Intervention Approaches & PHM Maintenance

More information

Evaluation of Two Level Classifier for Predicting Compressor Failures in Heavy Duty Vehicles. Slawomir Nowaczyk SAIS 2017 workshop, May

Evaluation of Two Level Classifier for Predicting Compressor Failures in Heavy Duty Vehicles. Slawomir Nowaczyk SAIS 2017 workshop, May Evaluation of Two Level Classifier for Predicting Compressor Failures in Heavy Duty Vehicles Yuantao Fan, Pablo De Moral & Slawomir Nowaczyk SAIS 2017 workshop, 15-16 May Objective & Motivation Predictive

More information

Microelectronics Reliability

Microelectronics Reliability Microelectronics Reliability 52 (2012) 482 488 Contents lists available at SciVerse ScienceDirect Microelectronics Reliability journal homepage: www.elsevier.com/locate/microrel A prognostic approach for

More information

Remaining Useful Performance Analysis of Batteries

Remaining Useful Performance Analysis of Batteries Remaining Useful Performance Analysis of Batteries Wei He, Nicholas Williard, Michael Osterman, and Michael Pecht Center for Advanced Life Engineering, University of Maryland, College Park, MD 20742, USA

More information

Hypothesis Tests and Estimation for Population Variances. Copyright 2014 Pearson Education, Inc.

Hypothesis Tests and Estimation for Population Variances. Copyright 2014 Pearson Education, Inc. Hypothesis Tests and Estimation for Population Variances 11-1 Learning Outcomes Outcome 1. Formulate and carry out hypothesis tests for a single population variance. Outcome 2. Develop and interpret confidence

More information

Chap 4. Software Reliability

Chap 4. Software Reliability Chap 4. Software Reliability 4.2 Reliability Growth 1. Introduction 2. Reliability Growth Models 3. The Basic Execution Model 4. Calendar Time Computation 5. Reliability Demonstration Testing 1. Introduction

More information

Failure prognostics in a particle filtering framework Application to a PEMFC stack

Failure prognostics in a particle filtering framework Application to a PEMFC stack Failure prognostics in a particle filtering framework Application to a PEMFC stack Marine Jouin Rafael Gouriveau, Daniel Hissel, Noureddine Zerhouni, Marie-Cécile Péra FEMTO-ST Institute, UMR CNRS 6174,

More information

P R O G N O S T I C S

P R O G N O S T I C S P R O G N O S T I C S THE KEY TO PREDICTIVE MAINTENANCE @senseyeio Me BEng Digital Systems Engineer Background in aerospace & defence and large scale wireless sensing Software Verification & Validation

More information

Exercises Solutions. Automation IEA, LTH. Chapter 2 Manufacturing and process systems. Chapter 5 Discrete manufacturing problems

Exercises Solutions. Automation IEA, LTH. Chapter 2 Manufacturing and process systems. Chapter 5 Discrete manufacturing problems Exercises Solutions Note, that we have not formulated the answers for all the review questions. You will find the answers for many questions by reading and reflecting about the text in the book. Chapter

More information

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Fault Tolerant Computing ECE 655

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Fault Tolerant Computing ECE 655 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing ECE 655 Part 1 Introduction C. M. Krishna Fall 2006 ECE655/Krishna Part.1.1 Prerequisites Basic courses in

More information

Monitoring Radar Mechanical Drive Systems FAA's choice of monitoring solutions

Monitoring Radar Mechanical Drive Systems FAA's choice of monitoring solutions Monitoring Radar Mechanical Drive Systems FAA's choice of monitoring solutions Mission-critical Ask the public to name mission-critical systems, and air traffic control radar will be at the top of the

More information

Decentralized Sequential Hypothesis Testing. Change Detection

Decentralized Sequential Hypothesis Testing. Change Detection Decentralized Sequential Hypothesis Testing & Change Detection Giorgos Fellouris, Columbia University, NY, USA George V. Moustakides, University of Patras, Greece Outline Sequential hypothesis testing

More information

When enough is enough: early stopping of biometrics error rate testing

When enough is enough: early stopping of biometrics error rate testing When enough is enough: early stopping of biometrics error rate testing Michael E. Schuckers Department of Mathematics, Computer Science and Statistics St. Lawrence University and Center for Identification

More information

Monte Carlo Simulation for Reliability Analysis of Emergency and Standby Power Systems

Monte Carlo Simulation for Reliability Analysis of Emergency and Standby Power Systems Monte Carlo Simulation for Reliability Analysis of Emergency and Standby Power Systems Chanan Singh, Fellow, IEEE Joydeep Mitra, Student Member, IEEE Department of Electrical Engineering Texas A & M University

More information

Signal Detection Basics - CFAR

Signal Detection Basics - CFAR Signal Detection Basics - CFAR Types of noise clutter and signals targets Signal separation by comparison threshold detection Signal Statistics - Parameter estimation Threshold determination based on the

More information

Artificial Intelligence (AI) Common AI Methods. Training. Signals to Perceptrons. Artificial Neural Networks (ANN) Artificial Intelligence

Artificial Intelligence (AI) Common AI Methods. Training. Signals to Perceptrons. Artificial Neural Networks (ANN) Artificial Intelligence Artificial Intelligence (AI) Artificial Intelligence AI is an attempt to reproduce intelligent reasoning using machines * * H. M. Cartwright, Applications of Artificial Intelligence in Chemistry, 1993,

More information

Stochastic Monitoring and Testing of Digital LTI Filters

Stochastic Monitoring and Testing of Digital LTI Filters Stochastic Monitoring and Testing of Digital LTI Filters CHRISTOFOROS N. HADJICOSTIS Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign 148 C&SRL, 1308 West Main

More information

Introduction to Statistical Inference

Introduction to Statistical Inference Structural Health Monitoring Using Statistical Pattern Recognition Introduction to Statistical Inference Presented by Charles R. Farrar, Ph.D., P.E. Outline Introduce statistical decision making for Structural

More information

Detection theory. H 0 : x[n] = w[n]

Detection theory. H 0 : x[n] = w[n] Detection Theory Detection theory A the last topic of the course, we will briefly consider detection theory. The methods are based on estimation theory and attempt to answer questions such as Is a signal

More information

Distance-based test for uncertainty hypothesis testing

Distance-based test for uncertainty hypothesis testing Sampath and Ramya Journal of Uncertainty Analysis and Applications 03, :4 RESEARCH Open Access Distance-based test for uncertainty hypothesis testing Sundaram Sampath * and Balu Ramya * Correspondence:

More information

Computing Consecutive-Type Reliabilities Non-Recursively

Computing Consecutive-Type Reliabilities Non-Recursively IEEE TRANSACTIONS ON RELIABILITY, VOL. 52, NO. 3, SEPTEMBER 2003 367 Computing Consecutive-Type Reliabilities Non-Recursively Galit Shmueli Abstract The reliability of consecutive-type systems has been

More information

PHM Engineering Perspectives, Challenges and Crossing the Valley of Death. 30 September, 2009 San Diego, CA

PHM Engineering Perspectives, Challenges and Crossing the Valley of Death. 30 September, 2009 San Diego, CA PHM Engineering Perspectives, Challenges and Crossing the Valley of Death 30 September, 2009 San Diego, CA The views, opinions, and/or findings contained in this article/presentation are those of the author/presenter

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Slide Set 3: Detection Theory January 2018 Heikki Huttunen heikki.huttunen@tut.fi Department of Signal Processing Tampere University of Technology Detection theory

More information

Sensor Fault Detection in Nuclear Power Plants Using Multivariate State Estimation Technique and Support Vector Machines

Sensor Fault Detection in Nuclear Power Plants Using Multivariate State Estimation Technique and Support Vector Machines *The paper submitted for presentation at the Third nternational Conference of the Yugoslav Nuclear Society YUNSC 2000, October 2-5, 2000, Belgrade, Yugoslavia Sensor Fault Detection in Nuclear Power Plants

More information

FROM WATER LEAKS TO WINE GRAPES: A NEW OUTLOOK FOR IMAGERY ANALYSIS

FROM WATER LEAKS TO WINE GRAPES: A NEW OUTLOOK FOR IMAGERY ANALYSIS Place image here (10 x 3.5 ) FROM WATER LEAKS TO WINE GRAPES: A NEW OUTLOOK FOR IMAGERY ANALYSIS ENVI AS A FRAMEWORK FOR CLOUD SERVICES & DEEP LEARNING Presented By Gordon Sumerling on Behalf of Cherie

More information

Integrated Electricity Demand and Price Forecasting

Integrated Electricity Demand and Price Forecasting Integrated Electricity Demand and Price Forecasting Create and Evaluate Forecasting Models The many interrelated factors which influence demand for electricity cannot be directly modeled by closed-form

More information

Sequential Detection. Changes: an overview. George V. Moustakides

Sequential Detection. Changes: an overview. George V. Moustakides Sequential Detection of Changes: an overview George V. Moustakides Outline Sequential hypothesis testing and Sequential detection of changes The Sequential Probability Ratio Test (SPRT) for optimum hypothesis

More information

Identifying and Analyzing Implicit Interactions in Critical Infrastructure Systems

Identifying and Analyzing Implicit Interactions in Critical Infrastructure Systems Identifying and in Critical Infrastructure Systems Jason Jaskolka Collaborator: John Villasenor (UCLA) Department of Systems and Computer Engineering Carleton University, Ottawa, ON, Canada jaskolka@sce.carleton.ca

More information

g(.) 1/ N 1/ N Decision Decision Device u u u u CP

g(.) 1/ N 1/ N Decision Decision Device u u u u CP Distributed Weak Signal Detection and Asymptotic Relative Eciency in Dependent Noise Hakan Delic Signal and Image Processing Laboratory (BUSI) Department of Electrical and Electronics Engineering Bogazici

More information

Failure Prognostics with Missing Data Using Extended Kalman Filter

Failure Prognostics with Missing Data Using Extended Kalman Filter Failure Prognostics with Missing Data Using Extended Kalman Filter Wlamir Olivares Loesch Vianna 1, and Takashi Yoneyama 2 1 EMBRAER S.A., São José dos Campos, São Paulo, 12227 901, Brazil wlamir.vianna@embraer.com.br

More information

Implementing an Intelligent Error Back Propagation (EBP) Relay in PSCAD TM /EMTDC 4.2.1

Implementing an Intelligent Error Back Propagation (EBP) Relay in PSCAD TM /EMTDC 4.2.1 1 Implementing an Intelligent Error Back Propagation (EBP) Relay in PSCAD TM /EMTDC 4.2.1 E. William, IEEE Student Member, Brian K Johnson, IEEE Senior Member, M. Manic, IEEE Senior Member Abstract Power

More information

Optimization of the detection of train wheel defects. SNCF Innovation and Research Department Paris, FRANCE 1

Optimization of the detection of train wheel defects. SNCF Innovation and Research Department Paris, FRANCE 1 Optimization of the detection of train wheel defects 1 R. Ziani SNCF Innovation and Research Department Paris, FRANCE 1 Abstract This paper describes how statistical models and learning algorithms could

More information

ISM Evolution. Elscolab. Nederland BV

ISM Evolution. Elscolab. Nederland BV ISM Evolution Agenda Introduction Marketing Strategy ISM Concept & Technology Product Offering Applications Digital Communication and ISM Outlook 1 Agenda Introduction Marketing Strategy ISM Concept &

More information

Bayesian networks for multilevel system reliability

Bayesian networks for multilevel system reliability Reliability Engineering and System Safety 92 (2007) 1413 1420 www.elsevier.com/locate/ress Bayesian networks for multilevel system reliability Alyson G. Wilson a,,1, Aparna V. Huzurbazar b a Statistical

More information

Reliability of Technical Systems

Reliability of Technical Systems Reliability of Technical Systems Main Topics 1. Short Introduction, Reliability Parameters: Failure Rate, Failure Probability, etc. 2. Some Important Reliability Distributions 3. Component Reliability

More information

Novelty Detection based on Extensions of GMMs for Industrial Gas Turbines

Novelty Detection based on Extensions of GMMs for Industrial Gas Turbines Novelty Detection based on Extensions of GMMs for Industrial Gas Turbines Yu Zhang, Chris Bingham, Michael Gallimore School of Engineering University of Lincoln Lincoln, U.. {yzhang; cbingham; mgallimore}@lincoln.ac.uk

More information

Introduction to Signal Detection and Classification. Phani Chavali

Introduction to Signal Detection and Classification. Phani Chavali Introduction to Signal Detection and Classification Phani Chavali Outline Detection Problem Performance Measures Receiver Operating Characteristics (ROC) F-Test - Test Linear Discriminant Analysis (LDA)

More information

TARGET DETECTION WITH FUNCTION OF COVARIANCE MATRICES UNDER CLUTTER ENVIRONMENT

TARGET DETECTION WITH FUNCTION OF COVARIANCE MATRICES UNDER CLUTTER ENVIRONMENT TARGET DETECTION WITH FUNCTION OF COVARIANCE MATRICES UNDER CLUTTER ENVIRONMENT Feng Lin, Robert C. Qiu, James P. Browning, Michael C. Wicks Cognitive Radio Institute, Department of Electrical and Computer

More information

A CUSUM approach for online change-point detection on curve sequences

A CUSUM approach for online change-point detection on curve sequences ESANN 22 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges Belgium, 25-27 April 22, i6doc.com publ., ISBN 978-2-8749-49-. Available

More information

PARTICLE MEASUREMENT IN CLEAN ROOM TECHNOLOGY

PARTICLE MEASUREMENT IN CLEAN ROOM TECHNOLOGY WHITEPAPER ENGLISH PARTICLE MEASUREMENT IN CLEAN ROOM TECHNOLOGY PARTICLE MEASUREMENT Particle measurement in cleanrooms. WP1508006-0100-EN, V1R0, 2015-08 PARTICLE MEASUREMENT IN CLEAN ROOM TECHNOLOGY

More information

An Acoustic Emission Approach to Assess Remaining Useful Life of Aging Structures under Fatigue Loading PI: Mohammad Modarres

An Acoustic Emission Approach to Assess Remaining Useful Life of Aging Structures under Fatigue Loading PI: Mohammad Modarres An Acoustic Emission Approach to Assess Remaining Useful Life of Aging Structures under Fatigue Loading PI: Mohammad Modarres Outline Objective Motivation, acoustic emission (AE) background, approaches

More information

DVClub Europe Formal fault analysis for ISO fault metrics on real world designs. Jörg Große Product Manager Functional Safety November 2016

DVClub Europe Formal fault analysis for ISO fault metrics on real world designs. Jörg Große Product Manager Functional Safety November 2016 DVClub Europe Formal fault analysis for ISO 26262 fault metrics on real world designs Jörg Große Product Manager Functional Safety November 2016 Page 1 11/27/2016 Introduction Functional Safety The objective

More information

Figure 1.1: Schematic symbols of an N-transistor and P-transistor

Figure 1.1: Schematic symbols of an N-transistor and P-transistor Chapter 1 The digital abstraction The term a digital circuit refers to a device that works in a binary world. In the binary world, the only values are zeros and ones. Hence, the inputs of a digital circuit

More information

Lecture 6: Time-Dependent Behaviour of Digital Circuits

Lecture 6: Time-Dependent Behaviour of Digital Circuits Lecture 6: Time-Dependent Behaviour of Digital Circuits Two rather different quasi-physical models of an inverter gate were discussed in the previous lecture. The first one was a simple delay model. This

More information

1. INTRODUCTION /09/$ IEEE. 2 IEEEAC paper #1715, Version 4, Updated Nov 02, 2008

1. INTRODUCTION /09/$ IEEE. 2 IEEEAC paper #1715, Version 4, Updated Nov 02, 2008 An Innovative Approach for Isolating Faulty Parameters Sachin Kumar, Eli Dolev, and Michael Pecht Prognostics Health Management Group Center for Advanced Life Cycle Engineering (CALCE) University of Maryland,

More information

High Voltage Capacitors Designed to Avoid Catastrophic Failure Modes

High Voltage Capacitors Designed to Avoid Catastrophic Failure Modes GENERAL ATOMICS ENERGY PRODUCTS Engineering Bulletin High Voltage Capacitors Designed to Avoid Catastrophic Failure Modes F. W. MacDougall G. L. McKee, J.B. Ennis, R.A. Cooper Maxwell Energy Products,

More information

Detection Theory. Chapter 3. Statistical Decision Theory I. Isael Diaz Oct 26th 2010

Detection Theory. Chapter 3. Statistical Decision Theory I. Isael Diaz Oct 26th 2010 Detection Theory Chapter 3. Statistical Decision Theory I. Isael Diaz Oct 26th 2010 Outline Neyman-Pearson Theorem Detector Performance Irrelevant Data Minimum Probability of Error Bayes Risk Multiple

More information

Introduction to Bayesian Statistics

Introduction to Bayesian Statistics Bayesian Parameter Estimation Introduction to Bayesian Statistics Harvey Thornburg Center for Computer Research in Music and Acoustics (CCRMA) Department of Music, Stanford University Stanford, California

More information

Parameter Estimation, Sampling Distributions & Hypothesis Testing

Parameter Estimation, Sampling Distributions & Hypothesis Testing Parameter Estimation, Sampling Distributions & Hypothesis Testing Parameter Estimation & Hypothesis Testing In doing research, we are usually interested in some feature of a population distribution (which

More information

Field data reliability analysis of highly reliable item

Field data reliability analysis of highly reliable item Field data reliability analysis of highly reliable item David Vališ & Zdeněk Vintr Faculty of Military Technologies University of Defence 612 00 Brno Czech Republic david.valis@unob.cz Miroslav Koucký

More information

THE LONG-TERM STABILITY OF THE U.S. NAVAL OBSERVATORY S MASERS

THE LONG-TERM STABILITY OF THE U.S. NAVAL OBSERVATORY S MASERS THE LONG-TERM STABILITY OF THE U.S. NAVAL OBSERVATORY S MASERS Demetrios Matsakis, Paul Koppang Time Service Department U.S. Naval Observatory Washington, DC, USA and R. Michael Garvey Symmetricom, Inc.

More information

EECS150 - Digital Design Lecture 26 - Faults and Error Correction. Types of Faults in Digital Designs

EECS150 - Digital Design Lecture 26 - Faults and Error Correction. Types of Faults in Digital Designs EECS150 - Digital Design Lecture 26 - Faults and Error Correction April 25, 2013 John Wawrzynek 1 Types of Faults in Digital Designs Design Bugs (function, timing, power draw) detected and corrected at

More information

Damage detection in a reinforced concrete slab using outlier analysis

Damage detection in a reinforced concrete slab using outlier analysis Damage detection in a reinforced concrete slab using outlier analysis More info about this article: http://www.ndt.net/?id=23283 Abstract Bilal A. Qadri 1, Dmitri Tcherniak 2, Martin D. Ulriksen 1 and

More information

Terminology and Concepts

Terminology and Concepts Terminology and Concepts Prof. Naga Kandasamy 1 Goals of Fault Tolerance Dependability is an umbrella term encompassing the concepts of reliability, availability, performability, safety, and testability.

More information

Load characterization during transportation

Load characterization during transportation Microelectronics Reliability 44 (2004) 333 338 www.elsevier.com/locate/microrel Load characterization during transportation Arun Ramakrishnan, Michael Pecht * CALCE Electronic Products and Systems Center,

More information

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0 NEC PerforCache Influence on M-Series Disk Array Behavior and Performance. Version 1.0 Preface This document describes L2 (Level 2) Cache Technology which is a feature of NEC M-Series Disk Array implemented

More information

A Comparison Between Polynomial and Locally Weighted Regression for Fault Detection and Diagnosis of HVAC Equipment

A Comparison Between Polynomial and Locally Weighted Regression for Fault Detection and Diagnosis of HVAC Equipment MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com A Comparison Between Polynomial and Locally Weighted Regression for Fault Detection and Diagnosis of HVAC Equipment Regunathan Radhakrishnan,

More information

Available online at ScienceDirect. Procedia Engineering 119 (2015 ) 13 18

Available online at   ScienceDirect. Procedia Engineering 119 (2015 ) 13 18 Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 119 (2015 ) 13 18 13th Computer Control for Water Industry Conference, CCWI 2015 Real-time burst detection in water distribution

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

Basics of Uncertainty Analysis

Basics of Uncertainty Analysis Basics of Uncertainty Analysis Chapter Six Basics of Uncertainty Analysis 6.1 Introduction As shown in Fig. 6.1, analysis models are used to predict the performances or behaviors of a product under design.

More information

HIGH ENERGY DENSITY CAPACITOR CHARACTERIZATION

HIGH ENERGY DENSITY CAPACITOR CHARACTERIZATION GENERAL ATOMICS ENERGY PRODUCTS Engineering Bulletin HIGH ENERGY DENSITY CAPACITOR CHARACTERIZATION Joel Ennis, Xiao Hui Yang, Fred MacDougall, Ken Seal General Atomics Energy Products General Atomics

More information

Censoring for Type-Based Multiple Access Scheme in Wireless Sensor Networks

Censoring for Type-Based Multiple Access Scheme in Wireless Sensor Networks Censoring for Type-Based Multiple Access Scheme in Wireless Sensor Networks Mohammed Karmoose Electrical Engineering Department Alexandria University Alexandria 1544, Egypt Email: mhkarmoose@ieeeorg Karim

More information

Do we have a quorum?

Do we have a quorum? Do we have a quorum? Quorum Systems Given a set U of servers, U = n: A quorum system is a set Q 2 U such that Q 1, Q 2 Q : Q 1 Q 2 Each Q in Q is a quorum How quorum systems work: A read/write shared register

More information

Known probability distributions

Known probability distributions Known probability distributions Engineers frequently wor with data that can be modeled as one of several nown probability distributions. Being able to model the data allows us to: model real systems design

More information

Class 19. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Class 19. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700 Class 19 Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science Copyright 2017 by D.B. Rowe 1 Agenda: Recap Chapter 8.3-8.4 Lecture Chapter 8.5 Go over Exam. Problem Solving

More information

Deformation of solder joint under current stressing and numerical simulation II

Deformation of solder joint under current stressing and numerical simulation II International Journal of Solids and Structures 41 (2004) 4959 4973 www.elsevier.com/locate/ijsolstr Deformation of solder joint under current stressing and numerical simulation II Hua Ye *, Cemal Basaran,

More information

CLCC Solder Joint Life Prediction under Complex Temperature Cycling Loading

CLCC Solder Joint Life Prediction under Complex Temperature Cycling Loading CLCC Solder Joint Life Prediction under Complex Temperature Cycling Loading, Michael Osterman, and Michael Pecht Center for Advanced Life Cycle Engineering (CALCE) University of Maryland College Park,

More information

Dan Vacar Page 1 of PATENTS

Dan Vacar Page 1 of PATENTS Dan Vacar Page 1 of 6 PATENTS 1. Apparatus and method for testing electrical interconnects with switches. VACAR, DAN; McElfresh, David K; Melanson, Robert H; Lopez, Leon D; US 7,982,468 (2011). 2. Surface

More information

Reduction of Detected Acceptable Faults for Yield Improvement via Error-Tolerance

Reduction of Detected Acceptable Faults for Yield Improvement via Error-Tolerance Reduction of Detected Acceptable Faults for Yield Improvement via Error-Tolerance Tong-Yu Hsieh and Kuen-Jong Lee Department of Electrical Engineering National Cheng Kung University Tainan, Taiwan 70101

More information

Study of Fault Diagnosis Method Based on Data Fusion Technology

Study of Fault Diagnosis Method Based on Data Fusion Technology Available online at www.sciencedirect.com Procedia Engineering 29 (2012) 2590 2594 2012 International Workshop on Information and Electronics Engineering (IWIEE) Study of Fault Diagnosis Method Based on

More information

TEGAM s Connection to the EarthScope Project

TEGAM s Connection to the EarthScope Project TEGAM s Connection to the EarthScope Project Introduction The EarthScope Project is an undertaking funded by the National Science Foundation in partnership with the United States Geological Survey and

More information

Risk Analysis of Highly-integrated Systems

Risk Analysis of Highly-integrated Systems Risk Analysis of Highly-integrated Systems RA II: Methods (FTA, ETA) Fault Tree Analysis (FTA) Problem description It is not possible to analyse complicated, highly-reliable or novel systems as black box

More information

COMPARISON OF STATISTICAL ALGORITHMS FOR POWER SYSTEM LINE OUTAGE DETECTION

COMPARISON OF STATISTICAL ALGORITHMS FOR POWER SYSTEM LINE OUTAGE DETECTION COMPARISON OF STATISTICAL ALGORITHMS FOR POWER SYSTEM LINE OUTAGE DETECTION Georgios Rovatsos*, Xichen Jiang*, Alejandro D. Domínguez-García, and Venugopal V. Veeravalli Department of Electrical and Computer

More information

Chapter 6. a. Open Circuit. Only if both resistors fail open-circuit, i.e. they are in parallel.

Chapter 6. a. Open Circuit. Only if both resistors fail open-circuit, i.e. they are in parallel. Chapter 6 1. a. Section 6.1. b. Section 6.3, see also Section 6.2. c. Predictions based on most published sources of reliability data tend to underestimate the reliability that is achievable, given that

More information

Department of Electrical and Computer Engineering University of Wisconsin Madison. Fall Midterm Examination CLOSED BOOK

Department of Electrical and Computer Engineering University of Wisconsin Madison. Fall Midterm Examination CLOSED BOOK Department of Electrical and Computer Engineering University of Wisconsin Madison ECE 553: Testing and Testable Design of Digital Systems Fall 2014-2015 Midterm Examination CLOSED BOOK Kewal K. Saluja

More information

APPENDIX 1 NEYMAN PEARSON CRITERIA

APPENDIX 1 NEYMAN PEARSON CRITERIA 54 APPENDIX NEYMAN PEARSON CRITERIA The design approaches for detectors directly follow the theory of hypothesis testing. The primary approaches to hypothesis testing problem are the classical approach

More information

Session-Based Queueing Systems

Session-Based Queueing Systems Session-Based Queueing Systems Modelling, Simulation, and Approximation Jeroen Horters Supervisor VU: Sandjai Bhulai Executive Summary Companies often offer services that require multiple steps on the

More information

VLSI Design I. Defect Mechanisms and Fault Models

VLSI Design I. Defect Mechanisms and Fault Models VLSI Design I Defect Mechanisms and Fault Models He s dead Jim... Overview Defects Fault models Goal: You know the difference between design and fabrication defects. You know sources of defects and you

More information

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries Anonymous Author(s) Affiliation Address email Abstract 1 2 3 4 5 6 7 8 9 10 11 12 Probabilistic

More information

Test Strategies for Experiments with a Binary Response and Single Stress Factor Best Practice

Test Strategies for Experiments with a Binary Response and Single Stress Factor Best Practice Test Strategies for Experiments with a Binary Response and Single Stress Factor Best Practice Authored by: Sarah Burke, PhD Lenny Truett, PhD 15 June 2017 The goal of the STAT COE is to assist in developing

More information

All-in-one or BOX industrial PC for autonomous or distributed applications

All-in-one or BOX industrial PC for autonomous or distributed applications M a g e l i s i P C All-in-one or BOX industrial PC for autonomous or distributed applications Intel Core Duo TM Windows XP TM HDD / Flash disk M a g e l i s i P C You are looking for an open, powerful

More information

CS7267 MACHINE LEARNING

CS7267 MACHINE LEARNING CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning

More information

Constrained Optimization and Support Vector Machines

Constrained Optimization and Support Vector Machines Constrained Optimization and Support Vector Machines Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/

More information

Advanced/Advanced Subsidiary. You must have: Mathematical Formulae and Statistical Tables (Blue)

Advanced/Advanced Subsidiary. You must have: Mathematical Formulae and Statistical Tables (Blue) Write your name here Surname Other names Pearson Edexcel International Advanced Level Centre Number Statistics S2 Advanced/Advanced Subsidiary Candidate Number Monday 26 June 2017 Afternoon Time: 1 hour

More information

Automated Statistical Recognition of Partial Discharges in Insulation Systems.

Automated Statistical Recognition of Partial Discharges in Insulation Systems. Automated Statistical Recognition of Partial Discharges in Insulation Systems. Massih-Reza AMINI, Patrick GALLINARI, Florence d ALCHE-BUC LIP6, Université Paris 6, 4 Place Jussieu, F-75252 Paris cedex

More information

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)?

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)? ECE 830 / CS 76 Spring 06 Instructors: R. Willett & R. Nowak Lecture 3: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics Executive summary In the last lecture we

More information

Duration of online examination will be of 1 Hour 20 minutes (80 minutes).

Duration of online examination will be of 1 Hour 20 minutes (80 minutes). Program Name: SC Subject: Production and Operations Management Assessment Name: POM - Exam Weightage: 70 Total Marks: 70 Duration: 80 mins Online Examination: Online examination is a Computer based examination.

More information

Gear Health Monitoring and Prognosis

Gear Health Monitoring and Prognosis Gear Health Monitoring and Prognosis Matej Gas perin, Pavle Bos koski, -Dani Juiric ic Department of Systems and Control Joz ef Stefan Institute Ljubljana, Slovenia matej.gasperin@ijs.si Abstract Many

More information

Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems

Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems Jeremy S. Conner and Dale E. Seborg Department of Chemical Engineering University of California, Santa Barbara, CA

More information

Maintenance free operating period an alternative measure to MTBF and failure rate for specifying reliability?

Maintenance free operating period an alternative measure to MTBF and failure rate for specifying reliability? Reliability Engineering and System Safety 64 (1999) 127 131 Technical note Maintenance free operating period an alternative measure to MTBF and failure rate for specifying reliability? U. Dinesh Kumar

More information

DETECTION theory deals primarily with techniques for

DETECTION theory deals primarily with techniques for ADVANCED SIGNAL PROCESSING SE Optimum Detection of Deterministic and Random Signals Stefan Tertinek Graz University of Technology turtle@sbox.tugraz.at Abstract This paper introduces various methods for

More information