Early Fault Detection and Optimal Maintenance Control for Partially Observable Systems Subject to Vibration Monitoring

Size: px

Start display at page:

Download "Early Fault Detection and Optimal Maintenance Control for Partially Observable Systems Subject to Vibration Monitoring"

Dora Thornton
5 years ago
Views:

1 Early Fault Detection and Optimal Maintenance Control for Partially Observable Systems Subject to Vibration Monitoring by Chen Lin A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Department of Mechanical and Industrial Engineering University of Toronto Copyright by Chen Lin (2014)

2 Early Fault Detection and Optimal Maintenance Control for Partially Observable Systems Subject to Vibration Monitoring Chen Lin Doctor of Philosophy Department of Mechanical and Industrial Engineering University of Toronto 2014 Abstract Due to the advancements in data measurement and computer technology, automated data collection from multiple sensors has become common in recent years. However, very few published papers have dealt with the cost-optimal early fault detection of gearboxes, condition based maintenance policy, and remaining useful life prediction when multiple sensors are used for data collection under both fixed load and varying load. This research has focused on several new developments. An early fault detection scheme has been firstly developed using real multivariate vibration data under varying load. A multivariate Hotelling s T-squared control chart has been applied to detect the early fault occurrence of a gearbox. Furthermore, to investigate maintenance policies triggered by the early fault detection, we have considered the combination of continuous time hidden Markov modeling (HMM) and the optimal Bayesian control technique using vector time series model residuals. Model parameters have been obtained using the Expectation-Maximization (EM) algorithm. An optimal Bayesian maintenance policy represented by a Bayesian control chart has been developed by minimizing the long-run expected average cost per unit time. II

3 Prediction of mean residual life using a posterior probability has also been developed in this research. For vibration systems without historical failure data, a hidden semi-markov model (HSMM), has also been developed for maintenance planning. The evolution of the unknown state process has been described by a hidden, two-state semi-markov process with a generally distributed sojourn time in the healthy state. The model parameter estimation procedure based on the EM algorithm has been developed. The optimal maintenance scheme in a two-state HSMM framework has also been constructed. We have developed this model considering three different state sojourn time distributions using the same set of real helical gearboxes multivariate data: exponential distribution, phase-type Erlang distribution, and Weibull distribution. The numerical results of parameter estimates and optimal control limits have been compared for these three distributions. Lastly, we present the proposed HSMM models to monitor the health condition of a planetary gearbox degradation system using vibration data from multiple sensors. III

4 Acknowledgments I wish to express my sincere thanks to my dearest family - Mam and Dad, Gina, Xu, Rui, Yang, Lihuang, Jiming, Lerong. You have given me great support on my thesis development. I would like to express my sincere gratitude to the following people. The completion of this thesis has been a significant challenge in itself apart from the research work. Without their support, patience and guidance, this task would not be completed successfully. First and foremost, I wish to thank my supervisor, Professor Viliam Makis. His thoroughness of fundamentals and clear insights into the advanced topics has motivated me through my Ph.D study. I deeply appreciate his strict training in all aspects of research and teaching assistant activities, which are definitely life-long treasures for me. I would like to thank my thesis committee: Dr. Jean Zu, Dr. Roy H. Kwon, Dr. Daniel M. Frances and Dr. Chi-Guhn Lee for their expertise and valuable suggestions. I would also like to express my appreciation to all the members of Quality, Reliability and Maintenance Laboratory, and Vibration Monitoring, Signal Processing and CBM Laboratory, especially Michael J. Kim, Rui Jiang, Catharine Hancharek, Konstantin Shestopaloff, Lawrence Yip, Jing Yu, Akram Khaleghei G., F. Naderkhani Z.G., B. Leila Jafari, Yantao Liu, Shuaizi Li, Wenyuan Lv, Jianjun Wu, Diyin Tang, Xiongzi Chen, Yin Tian and Dr. Jian Liu for their help and friendship. I thank the administrative staff in our department, Professor Markus Bussmann, Professor Chi-Guhn Lee, Brenda Fung. Their generous help in my course of study and teaching assistants allocation really benefited me a lot. IV

5 Also, I would like to acknowledge NSERC, MITACS, Ministry of Transportation Ontario, Bombardier Inc., SGS, and the University of Toronto scholarship and fellowship for providing financial support for my research. V

6 Table of Contents Acknowledgments... IV List of Tables... IX List of Figures... X List of Appendices... XIII List of Symbols... XIV Chapter 1 Introduction Background Research Contributions Organization of Dissertation Chapter 2 Application of VAR Modelling and a Multivariate Statistical Process Control Technique to Detect Early Gearbox Deterioration Introduction Experimental Data Collection Time Synchronous Averaging Algorithm and VAR Modeling Approach Time Synchronous Averaging Modeling and Computation of Residuals T-squared Control Chart Development Numerical Results VAR Modeling for Piecewise Stationary Vibration Data Hypothesis Testing and the Application of Statistical Process Control Method VI

7 Chapter 3 Optimal Bayesian Maintenance Policy and Early Fault Detection for a Gearbox Subject to Vibration Monitoring Introduction Computation of Residuals Using TSA and VAR Modeling under Varying Load Hidden Markov Model Formulation Multivariate Bayesian Control Chart for Cost-Optimal Early Fault Detection and CBM Residual Life Prediction Chapter 4 Bayesian Estimation and Optimal Maintenance Control with a Two-state Hidden semi-markov Model Approach Introduction Model Description Parameter Estimation for a General HSMM with Two Hidden States Formula for the Likelihood Function Formula for the Pseudo Log-Likelihood Function Parameter Estimation using Weibull Distribution Parameter Estimation using Exponential Distribution Parameter Estimation using Phase-Type Distribution A Practical Illustration of the Parameter Estimation Procedure Optimal Bayesian Control using Weibull distribution A Case Study using Multivariate Bayesian Control Chart Bayesian Control Chart using Erlang Sojourn Time Distribution VII

8 Chapter 5 A Comparison of Hidden Markov and Semi-Markov Models for Monitoring Planetary Gearbox Systems Introduction Vibration Separation Vector Autoregressive Modeling Parameter Estimation The Bayesian Control Chart Approach for Cost-Optimal Early Fault Detection Chapter 6 Summary and Future Research Conclusions Future Research Appendix A The Explicit Formulae for the Three-State HMM Parameters Appendix B The Algorithm for Non-Central Chi-Square Computation VIII

9 List of Tables Table 2.1 Vector autoregressive models selected with AIC and BIC minimization Table 3.1 p-values of the independence and normality tests Table 3.2 Iterations of the EM algorithm Table 3.3 Optimal early fault detection limit and the long run average costs Table 4.1 Iterations of the EM algorithm using Weibull sojourn time distribution Table 4.2 Iterations of the EM algorithm using exponential sojourn time distribution Table 4.3 Iterations of the EM algorithm using phase-type sojourn time distribution Table 4.4 Average cost and optimal control limits Table 4.5 Expected average costs and optimal control limit Table 5.1 Gearboxes information Table 5.2 Planetary gearbox II meshing sequence Table 5.3 p-values of the independence and normality tests Table 5.4 Iterations of the EM algorithm using Weibull distribution Table 5.5 Iterations of the EM algorithm using exponential distribution Table 5.6 Expected average costs and optimal control limits IX

10 List of Figures Figure 2.1 Testing bed and selected triaxial sensors Figure 2.2 Location of triaxial accelerometer Figure 2.3 Original time wave at triaxial sensors Figure 2.4 Broken output gear in test run # Figure 2.5 Output torque V05 and mean value of the drive motor speed V Figure 2.6 Four time series models Figure 2.7 Time wave residuals for test files Figure 2.8 Four VAR models order selection using AIC criterion Figure 2.9 ACF of Model 1, 300% torque Figure 2.10 ACF of Model 2, 250% and 200% torque Figure 2.11 ACF of Model 3, 150% and 100% torque Figure 2.12 ACF of Model 4, 50% torque Figure 2.13 RMS analysis comparison: (a) Before VAR modeling and (b) After VAR modeling Figure 2.14 Hotelling s T-squared control chart Figure 3.1 Testing bed and selected triaxial sensors Figure 3.2 Time wave residuals for test files Figure 3.3 RMS residuals for test files Figure 3.4 Fault detection by using multivariate Bayesian control chart Figure 3.5 Deterioration sample path of one tooth Figure 3.6 Remaining useful life prediction Figure 3.7 Conditional reliability function X

11 Figure 4.1 RMS values of one tooth deterioration process Figure 4.2 Time wave residuals for test files Figure 4.3 The cost-optimal early fault detection scheme Figure 4.4 RMS residuals for test files Figure 4.5 Posterior probabilities and the control limit using Weibull distribution Figure 4.6 Posterior probabilities with Erlang sojourn time distribution Figure 4.7 Optimal control limit using phase-type distribution Figure 5.1 Transmission diagram of the planetary gearbox systems Figure 5.2 Sensor placements Figure 5.3 Spalled planet gear tooth Figure 5.4 A Tukey window function Figure 5.5 Raw vibration data from planetary 2-v Figure 5.6 Raw vibration data from Hs planetary 2-v Figure 5.7 Tachometer pulses data Figure 5.8 Data in the 5 th run Figure 5.9 Windowed data at the carrier cycle Figure 5.10 Windowed data at the carrier cycle Figure 5.11 Planet gear tooth vibration separation in the 5th run from sensor Planetary2-v 93 Figure 5.12 Angular position of TSA data to develop a VAR model Figure 5.13 RMS and variance from runs (1 st - 4 th ) Figure 5.14 VAR model order selection using AIC criterion and BIC criterion Figure 5.15 Residuals for the remaining test runs (6 th to 12 th ) Figure 5.16 Auto and cross correlation plots for VAR model residuals XI

12 Figure 5.17 RMS values of residual data for runs (6 th - 12 th ) Figure 5.18 Comparison of control limits for Weibull distribution and exponential distribution XII

13 List of Appendices Appendix A. The Explicit Formulae for the Three-State HMM Parameters Appendix B. The Algorithm for the Computation using Non-Central Chi Square Variables XIII

14 List of Symbols Symbol Description f s f m N t M 0 a(t) Θ k f shaft N d (μ, Σ) y i τ f S (t) C OM C I C PM C LP C s X h Sampling frequency Fundamental meshing frequency of the gear of interest The number of teeth of the given gear The number of cycles to be averaged The ceil function returning the closest higher integer The floor function returning the closest lower integer The amplitude modulation function Initial phase of harmonic k Shaft rotation frequency Multivariate normal distribution The i th observation history Sojourn time in the healthy state Density function of the sojourn time distribution Maintenance cost rate in state 1 when the system is operational Inspection cost. Preventive maintenance cost rate Lost production cost rate Sampling cost incurred for each observation Continuous-time state process Sampling interval XIV

15 Y S Discrete-time observation process Set of state parameters determining the distribution of τ O Set of observation parameters {(μ 0, Σ 0 ), (μ 1, Σ 1 )} y i L O, S y i L O, S y i, t i g(y x) Π Y c R(j, n) N r N p N s ω[k + 1] n reset,p Individual i th observation history Partial likelihood for history i Full likelihood for history i Density function of an observation vector given the system state Posterior probability statistic The complete observation data set The reliability function at time nh considering parameter estimates S j The number of teeth on the ring The number of teeth on the planet The number of teeth on the sun gears Tukey window function The number of rotations of the gear under consideration that occur before the gear returns to its initial state relative to the position of the carrier P n,p The sequence of aligned teeth for a particular planet gear XV

16 Chapter 1 Introduction 1.1 Background Condition based maintenance(cbm) is a maintenance program that recommends maintenance actions based on one or more indicators showing that equipment is going to fail or that equipment performance is deteriorating. The actions are going to be made only when there is evidence of abnormal behaviors of a physical asset. This is very useful for systems, such as aircraft, power plant and military equipment, to guarantee high reliability of their performance. For years, vibration condition monitoring has been proven to be an effective approach to identify deterioration of machines in CBM. It is an important part of non-destructive detection and diagnostic techniques used to detect the damage and degradation of machines without the need to stop the machine and perform expensive full inspection. The vibration measurement and analysis provide information that helps industrial users to evaluate machinery conditions, avoid machine breakdown, conduct effective maintenance planning, and assist manufacturers to identify failure mechanisms and maintain the desired quality of production. In CBM, vibration monitoring is widely used to identify deterioration of machines which is very useful for maintenance planning as well as for failure diagnostics and prognostics. An abnormal operation will be indicated by vibrational characteristics which correspond to internal physical changes. Prognostic methods are also used to predict the machine remaining life. A transmission system is widely used in manufacturing. Fixed-axis gearboxes and planetary gearboxes are two common gearbox systems used in such machinery. Due to the 1

17 advancements in data measurement and computer technology, automated data collection from multiple sensors has become common in recent years. For rotating machines, the fault diagnostic system usually has many channels to obtain the sample signals simultaneously with high sampling speed, which leads to massive sampling data. Moreover, the sampling data are heavily contaminated with noises. When multiple sensors are used, vibration data collected from different sensors may contain different partial information about the same machine systems. As complexity of machinery increases, single-sensor based vibration monitoring techniques may exhibit low sensitivity in detecting the growth of incipient fault. A more detailed review can be found in e.g. [1], [2]. Thus, multiple sensors can be selected to provide complementary information for fault detection and diagnosis. Once the incipient faults are identified, it is necessary to select appropriate maintenance action to prevent a more severe situation. In practice, the maintenance actions need to be conducted according to criteria such as risk, cost, reliability and availability. Since cost criterion applies to most situations, the development of maintenance actions based on cost in CBM is dominated in literature. However, very few papers have dealt with cost-optimal early fault diagnosis and maintenance decision making by using multiple sensor data. The early detection of incipient fault leads to lower repair cost, reduced maintenance time, and increased machine availability. In our research, the observed condition monitoring process is represented by signals obtained from vibration monitoring. Under constant speed condition, vibration data recorded from a healthy machine should behave as a stationary process. A high level of vibration is usually caused by a hidden defect. In order to represent system deterioration, we consider a stochastic model with Markov property. Vibration data pre-processing and modeling are required before applying the stochastic modelling methodology. A variety of techniques for 2

18 signal processing and fault recognition have been proposed by many publications. Frequency domain analysis using e.g. FFT and Time-frequency domain analysis, using e.g. wavelets or wavelet packets, have been introduced into fault diagnosis ([1]). However, special methods must be used to extract the signals corresponding to individual teeth used in this study. Furthermore, we should mention that the time series model-based approaches used in this study account for both cross and autocorrelation in the vibration data histories. Thus, in this research, time domain based signal pre-processing and modeling approaches are considered. In literature, time synchronous averaging method is an essential technique for gearbox data analysis by using time-domain methods. It allows the vibration periodic components to be separated from other noise sources in the signal. McFadden introduced the time synchronous averaging method (TSA) to vibration monitoring in 1987 and further applied this method for bearing monitoring, (see e.g. [3], [4]). A review of TSA algorithms can be found in [5]. Time series models have been extensively studied for decades. These models account for the fact that data points taken over time have an internal structure, such as correlation, autocorrelation, trend or seasonal variation. An overview of time series methodology can be found in various sources (see e.g. [6]). When using the statistical time series methods for gearbox damage detection, the assumption is that the vibration caused by a healthy pair of gears can be modeled as a stationary time series. The non-stationary transients generated by the localized faults in gear teeth will be detected by modeling the stationary process and applying a reference model approach. For example, Sohn and Farrar [7] diagnosed damage structure by using autoregressive (AR) model with exogenous inputs (ARX). However, very few researchers considered the application of time series models under varying load with 3

19 vector observations, which are common in practice. Hines et al. [8] developed a Stressorbased Univariate Monitoring Method (SUMM) method for fault identification, which could be used in a changing load operation. Bartelmus et al. [9] experimentally confirmed that when load has many changes, the diagnostic feature for gearbox becomes load dependent. Yang and Makis [10] investigated AR model and ARX model to diagnose gearbox deterioration considering sinusoidal varying load and using vibration data from one sensor. Zhan and Makis [11] proposed an adaptive Kalman-filter-based AR model to handle the situation with several load levels. Thus, we fit a multivariate time series model that accounts for both cross and autocorrelation in the vibration data histories when the system is in the healthy state. The residuals are then obtained using the fitted model for complete data histories. Yang and Makis [12] investigated residual behavior of autocorrelation processes subject to a change from a healthy to unhealthy state and proved that for a wide class of time series models the residuals are conditionally independent and normally distributed in both healthy and unhealthy states. After obtaining the fitted model residuals, some parameters, such as kurtosis, root mean square (RMS), crest factor, may be good indicators of incipient failure of gearboxes. Traditional control charts such as Hotelling's T 2 control chart, EWMA, and CUSUM chart have been applied to control industrial processes characterized by several measurable variables. Statistical process control (SPC) is a method of monitoring, controlling, and improving a process through data collection and statistical analysis. An important SPC tool is the control chart, which can be used to detect changes in production processes. Readers can refer to Evans and Lindsay [13] for more details. Some papers have focused on applying the 4

20 statistical process control techniques to the fault detection problem. For example, Wang and Zhang [14] identified an early defect by using moving average chart and adaptive moving chart-based AR model. In the first project, we study a fixed-axis gearbox system using triaxial accelerometer data under varying load. A multivariate Hotelling s T-squared control chart is applied using gearbox three dimensional vibration data residuals to detect the early fault occurrence of a gearbox. It is well known that these traditional, non-bayesian process control techniques are not optimal. For example, Taylor [15] showed that non-bayesian techniques are not optimal and suggested that in the general case, the action decision should be determined based on the probability that the process is in the out-of-control state. In many industries, it is still a common practice to apply a univariate control chart to each measureable variable. However, when the measurable variables are correlated, the correct multivariate control should be considered. The most extensive development of multivariate control charts has been used for the case where the vector of process observations follows a multivariate normal distribution, and monitoring process mean has been the primary objective. To develop an effective process control chart for detecting small and moderate-sized sustained shifts in the process mean, Makis [16] formulated the multivariate Bayesian process control problem in the optimal stopping framework and proved that a control limit policy is optimal with the minimum average cost. Thus, in order to obtain the cost-optimal early fault detection scheme and develop a condition based maintenance model for prediction failures of gearbox deterioration systems, we consider a Markov model combined with multivariate Bayesian control chart techniques for the purpose of optimal decision making. The gearbox deterioration system is considered 5

21 as a partially observable deteriorating system subject to vibration monitoring, where the degradation state is not directly observable and only incomplete information is available through vibration measurement. Firstly, we consider a three-state hidden Markov model with observable failure information. Hidden Markov models (HMMs) have been studied extensively in the literature and have been successfully applied in many areas of research such as speech and handwriting recognition, econometrics, and more recently, condition-based maintenance. The application of an HMM in condition monitoring can be found in several papers (see e.g. Miao and Makis [17], Liu et al. [18]). Lin and Makis [19] proposed a multistate model to describe system deterioration, namely, all states are hidden except the failure state which is observable and the hidden state process is continuous in time. Recently, Kim et al. [20] considered the partially observable deterioration system with a three-state hidden Markov model with observable failure information. The condition of the system is modeled as a three-state continuous time Markov chain, where state 0 represents a good operational state, state 1 represents a warning state and state 2 represents the observable failure state. Since full likelihood function cannot be obtained for an HMM, a statistical approach known as the expectation maximization (EM) algorithm is usually applied for parameter estimation. If the hidden Markov model is used to describe the deterioration processes for the development of maintenance policies, it is necessary to consider a suitable optimization criterion such as the average cost minimization or the availability maximization. The long run average cost criterion has been considered in optimal maintenance control problems for years. To develop a cost-effective early fault detection procedure, we apply a Bayesian process control technique which can be applied for both maintenance control as well as the 6

22 statistical process control (SPC). The Bayesian control chart has been studied in the control literature for partially observable systems (see e.g. Calabrese [21], Makis [16], Makis [22]). By using the multivariate control chart techniques, the control problem for a hidden Markov process can be formulated as an optimal stopping problem under partial observations. It is well known from the theory of partially observable Markov decision processes that the posterior probability that the system is in a warning state is sufficient for decision making. The multivariate data residuals can be used to calculate a univariate posterior probability, which represents sufficient information for maintenance decision making. Based on this work, Kim et al. [23] and Makis et al. [24] developed the new computational methods to estimate the three-state HMM parameters, and proposed a cost optimal Bayesian fault prediction scheme by using oil data [20]. The maintenance decision was optimized over the long-run horizon by minimizing the expected average cost per unit time. Successful development and application of this methodology to oil data motivate our interest to extend the research and investigate a cost effective machine early fault detection scheme for a partially observable vibration process by using multivariate vibration data. A semi-markov decision process (SMDP) framework has been considered to obtain the control limit for optimal preventive maintenance triggered by the early fault detection. We validate this new technique by analyzing the vibration data collected from an experimental gearbox using several sensors under varying load condition. In condition based maintenance, the development of the optimal maintenance policy as well as diagnostics and prognostics are the main objectives. Mean residual life is defined as the mean length of time from the current time to the end of the useful life. It is an essential reliability characteristic in health management of a gearbox [25]. Si et al. [1] commented on 7

23 the methods for obtaining the remaining useful life (RUL) for two categories of models, namely, models based on directly observable processes and partially observable processes. The conclusion was that RUL based on a partially observable process model is harder to achieve. Karandikar et al. [26] predicted the remaining useful life using Bayesian inference of an aircraft fuselage panel considering a random work framework. We present a new approach of using a posterior probability statistic in a hidden Markov framework to predict remaining useful life for a gearbox subject to vibration monitoring. In many industries, vibration data with failure information are scarce in real applications or non-existent at all, mainly because such systems are preventively maintained before failure occurs. In practice, machines are rarely allowed to run to failure and data are commonly suspended (see e.g. Si et al. [1], Heng et al. [27]). For such critical systems without historical failure data, it is suitable to model the degradation process as a two-state HSMM. Hidden semi-markov chain possesses the flexibility of the hidden Markov chain without the restriction of exponential or geometric distribution of the sojourn times in its hidden states. A detailed review of the HSMM can be found in Barbu and Limnios [28]. In the condition based maintenance literature, researchers have considered semi-markov structures mainly in two types of models. The first one is the covariate-based hazard model (see e.g. Makis and Jardine [29], Moghaddass and Zuo [30]). The second one is a hidden semi-markov model. Dong and He [32] presented an N-state HSMM-based diagnostics and prognostics method to monitor the health condition of a hydraulic pump system by considering a univariate observation process. Su and Shen [33] identified degradation of a cylinder system using a multi-state hidden semi-markov model also with a univariate 8

24 observation process. No optimal maintenance policies have been considered in the aforementioned papers using multivariate observations in an HSMM framework. In this research, we continually focus on developing an estimation and optimal maintenance control scheme in a two-state hidden semi-markov model framework. In the HSMM, the state duration distribution and observation distribution are essential. The state duration distribution can be non-parametric or parametric. In this research, a general parametric density function of the unobservable sojourn time distribution has been considered. We derive explicit formulae for the parameter re-estimation in the EM algorithm, which leads to a fast estimation procedure. Using the EM algorithm, both the state and observation model parameters can be estimated. Once the parameters of the HSMM are estimated, a procedure for determining an optimal maintenance policy based on a multivariate Bayesian control chart will be developed. To our knowledge, there are limited references that take into account the multivariate Bayesian control chart in an HSMM model framework subject to vibration monitoring. An efficient computational algorithm based on a semi-markov decision process framework has been developed to optimally design a Bayesian control chart to monitor the deterioration process. The method is illustrated using real vibration data. We consider three sojourn time distributions separately to develop a fault prediction scheme in the semi-markov decision process framework. Planetary gearboxes are useful for machinery because they can provide a large transmission ratio and a high output power. They are widely used in aerospace, automotive and heavy industry applications, where most of them operate in tough working environment. Condition monitoring and early fault diagnosis of planetary gearboxes aim to prevent 9

25 shutdown, reduce major economic losses and even human casualties. Vibration monitoring, playing a critical part in condition monitoring, is also extensively used for fault detection, diagnosis and prognosis in planetary gearboxes. During normal condition, the vibration monitoring system collects data from a number of sensors including sensors mounted on the transmission housing. A detailed review can be seen in [34] and Samuel [35]. However, studies on cost-optimal early fault detection and condition based maintenance of planetary gearboxes using multiple sensors are quite limited in the literature as well. Dong et al. [36] considered an HSMM model to identify the faults for a planetary gearbox system without considering the maintenance policies. Liu et al. [37] presented an HSMM-based model with aging factors into the equipment health management problem and solved the problem using a non-bayesian approach. They evaluated the model with a univariate vibration data obtained from a hydraulic pump. Geramifard et al. [38] considered an HSMM approach for continuous health condition monitoring in machinery systems. The advantage this research has over traditional early fault diagnosis and prognosis is that the optimal maintenance decision is made considering long-run expected average maintenance cost using multivariate observations. The advantage of this approach compared with the HMM model, in which an exponential sojourn time is taken into account in the hidden state, is also illustrated. 1.2 Research Contributions The following contributions have been made in this research: 1. A new development is presented to detect the early fault of gearbox systems using real multivariate vibration data under varying load. To achieve this goal, we first pre-process the vibration data using TSA method. Then, piece-wise vector auto-regressive (VAR) time series models are considered and fitted to the healthy portion of multivariate vibration data. 10

26 Multivariate residuals are obtained from the integrated fitted models. We use a multivariate Hotelling's T-squared control chart to detect the early fault of the gearbox system. 2. We have investigated an optimal Bayesian process control technique to detect the early fault of the gearbox systems including the maintenance policies. We have considered a gearbox as a partially observable deteriorating system subject to random failure. Gearbox deterioration process is described by a hidden, three state homogeneous continuous-time Markov process. The EM algorithm has been applied to obtain the model parameter estimates and a multivariate Bayesian control problem has been formulated and solved in SMDP framework to find the optimal control limit and the optimal average cost. The value of the posterior probability above this limit indicates an early fault occurrence. Also, the remaining useful life prediction formula has been developed. We have compared the predicted values with the actual remaining life using real vibration data and obtained a very good agreement. This is the new development that applies a Bayesian control method to multivariate vibration data under varying load condition. 3. In practice, machines are rarely allowed to run to failure and data are commonly suspended. We have developed a two-state continuous time hidden semi-markov model with a general sojourn time distribution for such a gearbox deteriorating system. The evolution of the unknown state process is described by a hidden, two-state semi-markov process with a generally distributed sojourn time in the healthy state. The objective is to develop an estimation and condition-based maintenance procedure using the partial information obtained from the vibration data. The re-estimation formulas for model parameters are derived. The unknown model parameters are estimated using the EM algorithm. The optimal maintenance policy is obtained by applying a multivariate Bayesian control chart. The method is 11

27 illustrated using real multivariate vibration data obtained from a deteriorating gearbox system. Numerical results are compared with the results obtained using hidden Markov modeling and the same vibration data. 4. A two-state continuous hidden semi-markov model with an Erlang sojourn time distribution has been developed. A closed-form analytical procedure has been developed for parameter estimation using EM algorithm. We have illustrated the optimal maintenance policy by applying Bayesian control techniques. 5. A cost-optimal fault prediction scheme based on Bayesian estimation and control methodologies for a planetary gearbox deteriorating system has been developed. The state process has been modeled as a two-state continuous time semi-markov process. A Tukey window has been chosen to separate the vibration data when we applied the time synchronous averaging method. This is the new development that applies multivariate Bayesian control techniques combined with an HSMM model to monitor the health condition of a planetary gearbox deterioration system using multivariate vibration data. 1.3 Organization of Dissertation In Chapter 2, we develop a scheme for early fault detection of a gearbox under varying load condition by considering multi-sensor vibration data. TSA method is used to reduce the noise and then remove the periodic signals from the data. To handle the varying load situation, several VAR time series models are fitted to the historical data obtained in healthy gearbox condition considering different load levels. After testing the independence and multivariate normality of residual data, a multivariate Hotelling s T-squared control chart is applied using 12

28 gearbox three dimensional vibration data residuals to detect the early fault occurrence of a gearbox. In Chapter 3, we present a multivariate Bayesian approach to detect the early fault of gearbox systems with minimum average costs, and predict the mean residual life of machine systems using real vector vibration data as well. The deterioration process is treated as a partially observable hidden state stochastic process with observable failure information. State 0 and state 1 are unobservable and represent the healthy and warning system state, respectively. State 2 represents the observable failure state. The residuals of the fitted models in Chapter 2 are used as the observation process in the hidden Markov framework. We apply EM algorithms to estimate the model parameters. A univariate posterior statistic is generated to represent the multi-dimensional residuals. We determine the early fault of systems considering minimum average costs in a semi-markov decision process framework. A prediction of mean residual life using a univariate posterior probability is also developed in this chapter. The proposed methodologies are validated using real vector vibration data. In Chapter 4, a two-state hidden semi-markov process has been developed to describe the deterioration process of a partially observable system, where state 0 represents a healthy state and state 1 represents an unhealthy state. The explicit formulae for parameter re-estimation using EM algorithm are derived and used for the joint estimation of the state and observation parameters. Three different state sojourn time distributions are compared in this research: exponential distribution, Erlang distribution, and Weibull distribution. We demonstrate the effectiveness of the estimation procedures for these three sojourn time distributions using real vector vibration data. 13

29 After obtaining the model parameters, we present a SMDP methodology in an HSMM framework to determine a cost-optimal Bayesian control limit for preventive maintenance in vibration systems. We assume that the vector observation process is stochastically related to the system state. A posterior probability that the system is in the out-of-control state has been developed for three different sojourn time distributions: exponential distribution, Weibull distribution and Erlang distribution. The control limit problem for this HSMM is formulated as a semi-markov decision problem. By taking the vector data from the real gearbox deterioration system, we determine the cost-optimal maintenance level by using Weibull sojourn time distribution and Erlang sojourn time distribution. The computational results are also compared with the results obtained using a two-state hidden Markov model framework. Using the developed methods in Chapter 4, we investigate a cost-optimal early fault detection scheme and CBM using real multivariate vibration data from a planetary gearbox system in Chapter 5. The vibration data is pre-processed using the Tukey window separation technique and TSA method. A vector autoregressive model is fitted to the TSA filtered data. The residuals are obtained for both healthy and unhealthy portions of the vibration data histories. The independence and normality of the residual data are tested and the residual data is used as the observation process for a two-state hidden semi-markov model. The model parameters are estimated by the EM algorithm. Once the parameters of the HSMM are estimated, an optimal Bayesian maintenance policy represented by a Bayesian control chart is developed by minimizing the long-run expected average cost per unit time. This optimization problem is formulated as a discrete time optimal stopping problem with partial information in a hidden semi-markov model framework. A comparison of parameter 14

30 estimation and multivariate Bayesian control limits using an exponential sojourn time distribution has also been given. In Chapter 6, we summarize the conclusions of this research and discuss future research. 15

31 Chapter 2 Application of VAR Modelling and a Multivariate Statistical Process Control Technique to Detect Early Gearbox Deterioration 2.1 Introduction Time synchronous averaging method is an important technique for gearbox data analysis. It allows the vibration periodic components to be separated from other noise sources in the signal. Many applications of TSA method can be found in McFadden s work [3], [5], [6]. However, the TSA signal can be evaluated when the vibration signal is periodical and stationary. It is less sensitive to detect the early failures under varying load conditions. To overcome this limitation, some papers have applied parametric time-series models for gearbox fault detection [11], [39]. An overview of time series methodology can be found in many papers [7]. Fitting a time series model into the TSA data requires identifying the healthy portion of the data histories. The assumption is that the vibration caused by a healthy pair of gears can be modeled as a stationary time series. The non-stationary transients generated by the localized faults in gear teeth will be detected by modeling the stationary process and applying a reference model approach [8], [9], [10]. However, very few researchers considered the application of time series models under varying load with vector observations, which is common in practice. The purpose of this chapter is to apply multivariate statistical process control methodology to diagnose early fault occurrence in a gearbox operating under varying load. Due to the fact that the load is approximately constant during the short sampling interval, this chapter considers using several vector autoregressive models to approximate the vibration signal under varying load, assuming piece-wise constant load condition. The multivariate control 16

32 chart [11], [13], is applied to the residuals obtained from multiple VAR models for early fault diagnosis. This approach has not been used before. The selected data represent test run 14 obtained from Pennsylvania State University. In each recorded file, the varying load is constant during the data recording time of 10 seconds. If the load level is known from the output torque, there is a stationary vector time series model associated with that load level. This makes it possible to separate the recorded data under different load conditions and fit several VAR models for particular load ranges, the union of which represents the total load range. The remainder of the chapter is organized as follows. Section 2.2 briefly describes the experimental gear test rig and the data acquisition system used in this research. In Section 2.3, several algorithms are introduced to obtain the residuals from vibration data and test the conditions for applying the statistical process control methods. Section 2.4 presents the results using real vibration data. 2.2 Experimental Data Collection The studied vibration data was obtained from the Mechanical Diagnostic Test Bed (MDTB) [40]. The MDTB gearbox contained a 70-tooth driven helical gear and a 21-tooth pinion gear. It was driven at a set input speed using a kw, 1750 rpm drive motor, and the torque was applied by a kw absorption motor. The test-run #14 (see Fig. 2.1, Fig. 2.2, for the location of sensors) was selected for this study, where the sensor A10 is in the axial direction, the sensor A11 is in the parallel to the floor direction, and the sensor A12 is in the perpendicular to the floor direction. 17

Figure 2.1 Testing bed and selected triaxial sensors Figure 2.

input speed 1750 rpm and output torque 555in-lbs), then increased to 300% torque until failure in another 19.

3 hours and 338 files of vibration data were collected in total.

33 Figure 2.1 Testing bed and selected triaxial sensors Figure 2.2 Location of triaxial accelerometer The gearbox was run at 100% output torque for 95 hours at constant load (with input speed 1750 rpm and output torque 555in-lbs), then increased to 300% torque until failure in another 19.3 hours with varying load (with input speed 1750 rpm and output torque 1665inlbs). The whole test took hours and 338 files of vibration data were collected in total. The data in each file was collected 10 seconds at a sampling rate of 20 khz. The sampling interval was 8 minutes, which is hrs. For example, Figure 2.3 shows the original vibration data in file 194, and the files from 194 to

(a) File 194 (b) Files 194-338 Figure 2.

34 (a) File 194 (b) Files Figure 2.3 Original time wave at triaxial sensors The gearbox was run under varying load from file 194 to 338 and shut down with eight broken teeth in the output gear (Figure 2.4). Figure 2.4 Broken Output Gear in Test Run #14 Figure 2.5 Output torque V05 and mean value Figure 2.6 Four time series models 19

35 of the drive motor speed V01 The triaxial accelerometer [A10 (axial), A11 (floor) and A12 (perp floor)] data were collected simultaneously from three directions on the gearbox and used to detect the fault. The data was sampled as the load dropped to 250%, 200%, 150%, 100% and 50% (Figure 2.5, Figure 2.6). This study only considers the monitoring machine health condition under varying load situation, i.e., it considers the files from 194 to 338 from the 19.3 hours of data collection period. The file 212 is excluded since it was reported unreliable in MDTB due to some accelerometer problems [40]. The files of 194 to 246 were used to represent the healthy condition of the gearbox and the files were used for the gearbox deterioration diagnostics. When the output load is varying, it is difficult to determine whether the changes in vibration signals are caused by the varying load or indicate the early gear tooth deterioration. The output torque V05 and mean value of the drive motor speed V01 are shown in Figure 2.5. The signal in Figure 2.6 is the zoom-in observation of the torque V05. The loads are stable at each sampling epoch. By observing the signals from speed sensor Figure 2.5, the motor velocity fluctuated in the range less than 0.06%, which can be considered as a constant speed. The data was sampled at six load levels: 300%, 250%, 200%, 150%, 100% and 50%. According to the changing load conditions (Figure 2.6), we divided the training files 194 to 246 into four groups: 300%, (200%, 250%), (100%, 150%) and 50%. Then we considered four stationary time series processes using these four groups, which cover six load levels mentioned above. We then studied four time series models using files to represent the vibration data under varying load. The files were used to detect the gearbox early failure. 20

36 2.3 Time Synchronous Averaging Algorithm and VAR Modeling Approach Time Synchronous Averaging TSA [5] is a successful noise reduction technique useful for the development of schemes for detection of gear related faults. To extract repetitive signals from additive noise, this process requires an accurate knowledge of the repetitive frequency of the desired signal or a signal that is synchronous with the desired signal. The vibration data is then divided into segments of equal length associated with the synchronous signal and averaged. Suppose there are n data points in each vibration data file collected from a gearbox operating under a constant speed. The number of sampling points corresponding to one complete revolution of the gear of interest is described as K = f s f m Nt (2.1) where f s is the sampling frequency, f m is the fundamental meshing frequency of the gear of interest, N t is the number of teeth of the given gear, and is the ceil function returning the closest higher integer. The round up error can be ignored when K 1. The number of cycles to be averaged can be obtained from M 0 = N, where N is the total number of sampling K points in each data file and is the floor function returning the closest lower integer. In the time domain, the conventional TSA signal obtained from vibration data is given by M 0 1 V TSA (k) = 1 V(k + ik) M 0 i=0, k = 1,2,, K. (2.2) 21

37 Generally, TSA is one of the most powerful techniques for extracting the desired periodic signals from the original signal. Consider the following model x(t) = K k=1 X k (1 + a k (t))cos(2πkf m (t) + Θ k + b(t)), (2.3) where X k is the amplitude of the k th meshing harmonic, a k (t) is the amplitude modulation function of the k th meshing harmonic, f m (t) is the average meshing frequency, Θ k is the initial phase of harmonic k, b(t) is phase modulation function of the k th harmonic. If f m is the meshing frequency, then f m = N t f shaft, where N t is the number of gear teeth, and f shaft is the shaft rotation frequency. This vibration model assumes that f shaft is a constant. In most systems, there is some variation in the shaft speed due to changes in load. In order to accurately detect gear tooth damage, several methods need to be used in combination. Here we consider the TSA data with the meshing frequency and its harmonic frequencies removed. After removing the mesh and the components associated with its harmonic frequencies [41], [42], the techniques are designed to test the early fault detection performance of two parameters for tri-axial sensors. These two parameters are the RMS value of the TSA filtered data and the RMS value of the VAR model residuals Modeling and Computation of Residuals The three dimensional observed healthy TSA filtered data histories Z = (Z 1t, Z 2t, Z 3t ), t = 0,1,, T, have the following representation p Z t = μ + r=1 φ r Z t r + ε t, (2.4) 22

38 where ε t are independent identically distributed (i.i.d.) N 3 (0, Σ), p is the lag which determines the model order, φ r is the coefficient matrix, φ r R 3 3, and the mean and covariance model parameters μ R 3 and Σ R 3 3. We can rewrite the Eq. (2.4) in a general way W = φa + E, where W = [Z p, Z p+1,, Z T ], φ = [µ, φ 1, φ 2,, φ p ], E = [ε p, ε p+1,, ε T ], and 1 A = Z p 1 Z p 2 1 Z p Z p 1 1 Z T 1 Z T 2 Z 0 Z 1 Z T p. (2.5) Lütkepohl H. [6] showed that the least squares estimates for φ and covariance matrix Σ = cov(ε t ) are given by: φ = WA (AA ) 1, (2.6) Σ = (T 3p 1) 1 W φ A (W φ A). (2.7) There are several information criteria available to determine the order p of a VAR model. Akaike [43] suggested measuring the goodness of fit for the model by balancing the error of the fit against the number of parameters in the model. For VAR (p) model, AIC = ln σ p 2 + 2pD2 T, (2.8) σ p 2 is the maximum likelihood estimate of σ ε 2, which is the covariance matrix of ε t, and T is the sample size, D is the dimension of the time series. The function of Bayesian information criterion (BIC) [44] is defined as follows: BIC = ln σ 2 p + lnt k (2.9) T 23

39 2 where σ p is the error covariance matrix, k is the number of estimated parameters in the model. It is important to note that the VAR model is fitted only to the healthy portion of the data histories, and, using this stationary VAR model, the residuals are computed for both the healthy and unhealthy portions of the data histories. The advantage of this reference model approach is that it is not necessary to accurately determine the exact change point time from the healthy to unhealthy state. We just need to conservatively select a sufficient amount of healthy data to build a stationary VAR model. The idea is that when the system does in fact move to an unhealthy state, the characteristics of the residuals will change, signalling system deterioration. Using parameter estimates φ, Σ, p, we define the residual process Y nδ Z n E φ,σ,p (Z n Z n 1), where Z n 1 = (Z 1,, Z n 1 ). The residuals are then computed for the complete gearbox running history. Next, multivariate Ljung-Box portmanteau test (Qtest) is performed to determine whether the model residuals are independent and identically distributed random variables. The null hypothesis of the Ljung-Box test is that all noise terms at different lags up to lag s are uncorrelated. Given a 3-dimensional VAR (p) model, the Q- test statistic, in the form introducing by [45], is given by: Q m = n 2 m l=1 (n l) 1 tr{ρ ε (l)σ 1 ρ ε (l)σ 1 }, (2.10) where tr(a) is the trace of the matrix A, ρ ε (l) is the residual cross-correlation matrix, ρ ε (l) = V 1 2 ε C ε V 1 2 ε, C ε (l) is the residual cross-covariance matrix, C ε (l) = n 1 t=1 ε t ε t+l, V ε = Diag(σ 2 11,, σ ), σ ii are the diagonal elements of Σ, n is the number of observations and m is the lag order. The asymptotic distribution of Q m is chi-squared with k 2 (m p) degrees of freedom. 24 n l

40 To test the multivariate normality, the Henze-Zirkler Multivariate Normality test is applied. The test is based on a nonnegative function D(.,. )that measures the distance between two distribution functions and has the property that D(N d (0, I d ), Q) = 0, (2.11) if and only if Q = N d (0, I d ), where N d (μ, Σ d ) is a d-dimensional normal distribution. The test statistic can be found in [46]. Figure 2.7 shows the residual signals obtained from the fitted time series model for the testing files Figure 2.7 Time wave residuals for test files T-squared Control Chart Development Statistical process control technique, such as the one based on Hotelling s T-squared multivariate control chart, is used to localize irregularities caused by faults. This statistical control method is an analog of the univariate Shewhart control chart for monitoring the mean 25

41 vector of the process. Let N p (μ, Σ) denote a p-variate normal distribution. Assume the process is monitored by observing a vector X = (x 1,, x n ) N p (μ, Σ ). It can be shown that (x i μ ) Σ 1 (x i μ ) T 2, (2.12) where the common estimates of μ and Σ are μ p 1 = 1 n x n i=1 i and Σ p p = 1 n n 1 i=1 (x i μ)(x i μ). (2.13) The application of T 2 control chart is conducted in two phases: the control limit establishment phase, and the monitoring phase. The first phase focuses on obtaining an incontrol set of model residuals so that the calculated control limit can be used in phase two for monitoring the residual process of future vibrations. The phase one control limit for the T 2 control chart is given by: UCL = p(n+1)(n 1) F n(n p) α,p,n p, (2.14) where F α,p,n p represents an F distribution with p and n p degrees of freedom with significance level α. If T 2 > UCL, then stop and investigate. The estimated μ and Σ obtained at the end of phase 1 are used to calculate T 2 statistic using equation (2.12) for each new observation. 2.4 Numerical Results VAR Modeling for Piecewise Stationary Vibration Data To apply the TSA method, the f s, f m, N t should be computed to determine the length of the averaging signal. With the meshing frequency f m = 612.5Hz, the round-off value K equals 26

42 2285. After removing the components associated with the gear meshing frequency and its harmonics, each revolution of the TSA filtered signal of the data file is used to build the VAR model. Four VAR models have been built for four load conditions: 300%, (200%, 250%), (100%, 150%) and 50% (see Figure 2.4b). For example, the model 4 contains six residual signal files (file 204, 205, 218, 219, 232, 233) with three dimensional data values. The orders of the VAR models are determined using both the AIC and BIC information criteria (see Table 2. 1). Table 2.1 Vector autoregressive models selected with AIC and BIC minimization Model1 Model2 Model3 Model4 AIC BIC This computation was conducted using statistical software R Figures show the autocorrelation (ACF) and cross correlation values of the vector autoregressive model residuals. All auto and cross correlations are within the bounds and demonstrate no correlation pattern. Thus, the residual checking indicates that the models are appropriate. The multivariate Ljung-Box statistics of model residuals show that the selected lags for the VAR models are adequate for describing the data. Multivariate Ljung-Box tests applied to the AIC selected VAR model give better results than for the models with BIC selected order. The dot point determines the preferred vector autoregressive model lag number with minimum AIC value (Figure 2.8). Therefore, the fitted models are selected by AIC with lags: 291, 152, 131, and 210. Vector Autoregressive Models with exogenous variables (VARX) model has also been tried in this research to model the multivariate data, where the varying load was enveloped and considered as the exogenous input. However, the AIC and the BIC values 27

43 have been large using both Matlab and R softwares when estimating the best model order. Figure 2.8 Four VAR model order selection using AIC criterion Figure 2.9 ACF of model 1, 300% torque 28

44 Figure 2.10 ACF of model 2, 250% and 200% torque 29

45 Figure 2.11 ACF of model 3, 150% and 100% torque 30

46 Figure 2.12 ACF of model 4, 50% torque Root-mean square and kurtosis statistics are two common tools to diagnose the gearbox condition. Here we use the root mean square indicator to evaluate the performance of the VAR model first. Figure 2.13a shows that RMS values of the TSA filtered signal without the VAR modeling do not indicate any early fault before file 290. In Figure 2.13b, after applying VAR modeling, the RMS parameter shows better indication of fault detection. The RMS 31

47 values of all three sensors indicate the first increasing peak on file 284 (Figure 2.13b) as well as evidence of damage on file 296. McClintic s paper [47] showed boroscope images of teeth damage. The first broken tooth was actually observed at 109 hours, which was on file 296. The second broken tooth was observed on sampling file 323. Finally, 8 broken teeth were found on data file 338, which was collected at the end of running time after hours. Figure 2.13b shows peaks in each case mentioned in McClintic s paper. This comparison means that the RMS parameter after the VAR modeling performs very well to detect the significant fault in this case. In terms of the early damage detection, McClintic et al. s works were only able to detect the early fault on file 290 at 108 hours, while our method indicates the incipient fault occurrence on file 284. (a) Figure 2.13 RMS analysis comparison: (a) Before VAR modeling and (b) After VAR modeling Hypothesis Testing and the Application of Statistical Process Control Method (b) We consider the residuals from the time series models selected using the AIC criterion as the steady-state process where the training data file residuals ( ) were used to develop the process control limit. The testing data file residuals ( ) were used to detect the failure. The parameters (μ, Σ ) in equation (2.12) were estimated using residual RMS values of 32

48 training files ( ). Here, x represents the RMS value of each sampling file, which is the individual vector observation as p = 3 and n = 52. After estimating μ, Σ from these files, we computed the UCL to control the incoming testing file RMS variables. Furthermore, in order to check the independence and normality assumptions for sampling variables x i, the independence test [46] and normality test for RMS variables up to file 284 were conducted. The independence test for RMS variables was computed using R 14.1 function Hosking with significance level 0.05 and obtained p-value was We used Henze-Zirkler Multivariate normality test for residual RMS data and obtained the p-value of The control chart shows that vector data on file 284 was an outlier (Figure 2.14). Figure 2.14 Hotelling s T-squared control chart From Figure 2.14, the gear is in unhealthy condition starting with vibration file 284. Our result matches Yang s [10] conclusion for incipient failure detection, where the accelerometer A02 was carefully selected in his study. Our results confirmed that the combination of TSA and multiple VAR models can be used to obtain stable residuals from multivariate vibration data under changing load. The detection of the early fault of the gearbox can be done by using multivariate Hotelling s T 2 control chart. 33

49 Chapter 3 Optimal Bayesian Maintenance Policy and Early Fault Detection for a Gearbox Subject to Vibration Monitoring 3.1 Introduction In chapter 2, we use a traditional process control chart to monitor the deterioration process of a gearbox system. In condition-based maintenance (CBM), vibration monitoring is widely used to identify deterioration of machines which is very useful for failure diagnostics and prognostics as well as for maintenance planning. A properly established maintenance planning can significantly reduce maintenance cost by reducing the number of unnecessary scheduled preventive maintenance operations. Generally, an abnormal operation will be indicated by vibrational characteristics which correspond to internal physical changes. Prognostic methods are used to predict the machine remaining life. A more detailed review can be found in e.g. [1], [2]. To save unnacessary extra maintenance cost, it is useful to investigate a cost-optimal early fault detection scheme where a lower-cost maintenance solution combined with the machine condition assessment are considered. In literature, very few references have dealt with early fault diagnosis and maintenance decision making by using multiple sensor data. The early detection of incipient fault leads to lower repair cost, reduced maintenance time, and increased machine availability. The focus of this chapter is on the cost-optimal early fault detection by using vibration data from a gearbox which is an important part of rotating machinery. Effective implementation of the developed procedure will reduce maintenance cost and improve quality and reliability of mechanical products and manufacturing processes. 34

3.2 Computation of Residuals Using TSA and VAR Modeling under Varying Load We re-consider the multivariate vibration data obtained from the Mechanical Diagnostic Test Bed (MDTB) built by Pennsylvania

The triaxial accelerometer [A10 (axial), A11 (parallel to floor) and A12 (perpendicular floor)] data were measured simultaneously from three directions on the gearbox and used to detect the fault

50 3.2 Computation of Residuals Using TSA and VAR Modeling under Varying Load We re-consider the multivariate vibration data obtained from the Mechanical Diagnostic Test Bed (MDTB) built by Pennsylvania State University, Applied Research Laboratory in the Condition-Based Maintenance Department. The triaxial accelerometer [A10 (axial), A11 (parallel to floor) and A12 (perpendicular floor)] data were measured simultaneously from three directions on the gearbox and used to detect the fault (Figure 3.1). According to the development in Chapter 2, the testing data file residuals ( ) are obtained from the fitted VAR models, and they are shown in Figure 3.2. Because each file contains 70 teeth residual signals at each revolution, we obtain 70 teeth histories for the target gear. Statistical features of residual signals, which include Root-mean square (RMS) and kurtosis, are frequently used to compress the huge amount of vibration data and present a convenient measure of gear condition. In order to develop the cost-optimal early fault detection scheme, we selected RMS indicator for further development (Figure 3.3). Figure 3.1 Testing bed and selected triaxial sensors 35

51 Figure 3.2 Time wave residuals for test files Figure 3.3 RMS residuals for test files Hidden Markov Model Formulation In this section, the residual RMS data is considered as the partial observation data for fitting the hidden Markov model. Since the machine deterioration has monotonic behavior over time, the state of the vibration data can be well represented by a non-decreasing continuous time homogeneous Markov chain X t : (t R + ) with state space {0, 1, 2}. State 0 and state 1 are unobservable states and they represent the healthy and warning system state, respectively. State 2 represents the observable failure state. Let Y h, Y 2h,, Y kh R d denote the sequence of d-dimensional vectors representing the VAR model residuals. When Y kh is in state x, it follows multivariate normal distribution N d (μ x, Σ x ) with density g(y k μ x, Σ x ) = exp 1 2 (y k μ x ) Σ 1 x (y k μ x ). (3.1) (2π) d det(σ x ) We divided residual RMS values of files 247 to 295 into two groups. In Figure 3.3, we assume that from the files 247 to 283, the system is in state 0. The files 284 to 295 represent state 1. All the files after 295 represent failure state. The independence test and normality test are conducted for both healthy state and warning state residual data (Table 3.1). 36

52 RMS of residuals file Independence (Portmanteau) Multivariate Normality (Henze-Zirkler) Table 3.1 p-values of the Independence and Normality Tests Healthy portion(files 247 to Unhealthy portion(files ) to 295) After checking the conditional independence and normality properties, the parameter estimation for the hidden Markov model and the subsequent optimal system control problem are simplified considerably [48]. The machine is assumed to start in a healthy state 0. The instantaneous transition rate Q-matrix for the continuous time Markov chain is given by (q 01 + q 02 ) q 01 q 02 Q = 0 q 12 q 12. (3.2) In this case, the interval between two sampling epochs is hrs. Three dimensional vector data Y h, Y 2h,, Y kh represent the residual process RMS values. Let O represent all vibration data histories and L(Λ, Θ O) be the associated likelihood function, where Θ = (q 01, q 02, q 12 ) and Λ = {μ 0, μ 1, Σ 0, Σ 1 } are the sets of unknown state and observation parameters. The expectation-maximization (EM) algorithm is well suited to solve this continuous time hidden Markov model parameter estimation problem. E-step. Compute the pseudo likelihood function defined by Q(Λ, Θ Λ k, Θ k ): = E Λk,Θ k (ln L(Λ, Θ O ) O), (3.3) O represents the complete observation data set, where each observation history O of the historical data set has been augmented with the hidden state process {X t : t R + } information. 37

53 M-step. Choose Λ k+1, Θ k+1 such that Λ k+1, Θ k+1 arg max Λ,Θ Q(Λ, Θ Λ k, Θ k ). (3.4) The E and M steps are repeated until Euclidean norm Λ k+1, Θ k+1 Λ k, Θ k < ε. Detailed formulae for the likelihood function L(Λ, Θ O ) and the pseudo likelihood function Q(Λ, Θ Λ k, Θ k ) can be found in [48] (see Appendix A). We refine the computational procedures and obtain the estimated parameters using the residual RMS data. The algorithm converged promptly with only 7 iterations under 15 seconds. It is extremely fast for offline computations, which is attractive for real applications. Table 3.2 Iterations of the EM algorithm q 01 q 02 q 12 μ 0 μ 1 Σ 0 Σ 1 Q Time (sec) Initial value e4 0 1 st iteration e3* e3* nd iteration e2* e2* th iteration e e2* e

54 3.4 Multivariate Bayesian Control Chart for Cost-Optimal Early Fault Detection and CBM In this section, we consider the cost-optimal stopping time problem to detect early fault occurrence. The stopping threshold is used to decide when to stop the machine with the objective of minimizing the long-run expected average cost per unit time. In partially observable Markov decision process framework, it is well known that the posterior probability that the system is in a warning state is sufficient for optimal decision making [16]. Let Π k be the probability the process is in warning state at time kh given the observations up to time kh. Given one sample at each sampling epoch, the posterior probability Π k can be computed recursively as: Π k+1 = P(X k+1 = 1 Y 1,, Y k, Π k, ξ > (k + 1)h) = where D 1 and D 0 are defined as D 1 f(y μ0,σ0 ) f(y μ1,σ1 ) D0 + D1. (3.5) D 0 = P 00 (h)(1 Π k ) + P 10 (h)π k, (3.6) D 1 = P 01 (h) (1 Π k ) + P 11 (h) Π k (3.7) The initial value Π 0 = 0.The transition probabilities in equations (3.6) (3.7) can be obtained using equation (3.2) by solving Kolmogorov backward differential equations [49]: e (q 01+q 02 )t q 01 e q12t e (q 01+q02 )t 1 e (q 01+q 02 )t q 01 e q12t e (q01+q02)t q P ij (t) = 01 +q 02 q 12 q 01 +q 02 q 12 0 e q 12t 1 e q. 12t (3.8) By renewal theory, the cost minimization problem is equivalent to finding an optimal control limit Π [0,1], such that 39

55 g(π ) = inf Π [0,1] E Π (CC) E Π (CL). (3.9) where CC and CL denote the cycle cost and cycle length, respectively. When applying the Bayesian control technique, the posterior probability Π k [0,1] is used to monitor the deterioration process. The posterior probability is computed for each observation. When Π Π k, the system is stopped and full inspection is performed with inspection cost C I. The system has probability 1 Π k to be in healthy state 0 and probability Π k to be in warning state 1. If the system is in warning state 1, preventive maintenance is conducted at cost rate C PM during time T PM. The associated lost production cost is incurred at a rate of C LP. If the system fails before the chart signals, failure replacement is triggered which incurs cost rate of C F for a duration of time T F, together with the lost production cost at a rate of C LP. After inspection or replacement, the system is renewed and a new cycle begins. The optimal average cost g(π ) can be obtained by defining and solving a semi- Markov decision process (SMDP) problem. The state of SMDP is defined as the value of Π k between 0 and 1 which is plotted on the Bayesian control chart. For a sufficiently large integer I, if the current value Π k lies in the interval i 1 I, i, we assume that Π I k = i 0.5. If I Π > Π k we continue and if Π Π k, full inspection is performed. When the system is in warning state 1, the SMDP is defined to be in state preventive maintenance (PM) and the preventive maintenance is performed. When the system fails, the SMDP is defined to be in state F. The machine should be stopped and repaired. The complete state space of SMDP is defined by S = {0} {i: 1 i I } {PM} {F}. For a given value of Π, the long-run expected average cost g(π) can then be obtained by solving the following system of linear equations: 40

56 v i = c i g(π)τ i + j S P ij v j, for i S, v 0 = 0. The SMDP model is determined by the following characteristics: (3.10) c i = the expected cost incurred until the next decision epoch given the present state i, i S, τ i = the expected time until the next decision epoch given the present state i, i S, P ij = the probability that at the next decision epoch the system will be in state j given the present state is i, i S, g(π) = E(CC) E(CL). Consider the following reformulation f(y μ 0,Σ 0 ) = ( Σ 0 ) 1/2 exp 1 (y f(y μ 1,Σ 1 ) ( Σ 1 ) 1/2 2 B) A(y B) + E = ( Σ 0 ) 1/2 exp 1 ( Σ 1 ) 1/2 2 where A, B, E are given by V(y) + E. (3.11) A = Σ 1 1 Σ 0 1, B = ( Σ 1 1 Σ 0 1 ) 1 ( Σ 1 1 μ 1 Σ 0 1 μ 0 ), E = ( μ 1 T Σ 1 1 μ 1 μ 0 T Σ 0 1 μ 0 ) B T ( Σ 1 1 μ 1 Σ 0 1 μ 0 ), V(y) = (y B) T A(y B). (3.12) Here, y B follows the normal distribution with mean μ 0 B in healthy state 0, and with mean μ 1 B in warning state 1. F(j) can be computed using equations of Provost [50] (Appendix B), who provided a closed-form expression for the distribution of the indefinite quadratic form V(y). Suppose at kh the system has not failed, i.e. ξ > kh. The conditional reliability function is given by (Equation 3.8): R(t Π k ) = P(ξ > kh + t ξ > kh, Y 1,, Y k, Π k ), 41

57 = (P 00 (t) + P 01 (t)) (1 Π k ) + (P 10 (t) + P 11 (t)) (Π k ). (3.13) The distribution function F(j) of Π k+1 has the following form: F(j) = Pr(Π k+1 j ξ > kh, Y 1,, Y k, Π k ), = D 0 Pr V(y) > k(j) X (k+1)h = 0 D 0 + D 1 + Pr V(y) > k(j) X (k+1)h = 1 D 1 D 0 + D 1 (3.14) where k(j) 2ln ( Σ 1 ) 1/2 ( Σ 0 ) 1/2 1 j D1 1 D0 E, j [0,1]. For a large I, if the SMDP is in state i S, then Π k i = i 0.5 and the SMDP transition I probability P ij can be computed as: P ij = F( j + ) F(j ) R(t Π k ), if Π k < Π. P if = 1 R(t Π k ), P i,pm = Π k, Π k > Π, P i,0 = 1 Π k, Π k > Π, The expected costs are given by: P PM,0 = P F,0 = 1. (3.15) c i = h E C 0 p0 C s, I X(k+1)h =0 ds ξ > kh, Π h k + E C 0 p1 I X(k+1)h =1 ds ξ > kh, Π k + = C p0 (1 Π k ) 1 e (q 01+q02 )h + C q 01 +q p1 02 C p1 1 e q 12h q 12 Π k +C s, Π k < Π, q 01 1 e q 12h q12 1 e (q 01+q02 )h q01+q02 q 01 +q 02 q 12 (1 Π k ) + c i = C I T I +C LP T I, if Π k > Π, c PM = C PM T PM +C LP T PM, c F = C F T F +C LP T F. (3.16) 42

58 The sojourn times are given by: h τ i = R(t Π k )dt 0 = 1 e (q 01+q02 )h q 01 +q 02 + q 01 1 e q 12h q12 1 e (q 01+q02 )h q01+q02 q 01 +q 02 q 12 (1 Π k ) + 1 e q12h q 12 Π k, if Π k < Π, τ i = T I, if Π k > Π, τ PM = T M, if the system is in the preventive maintenance state, τ F = T r, repair time. (3.17) By setting the times T I =1hour, T PM = 3 hours, T F = 20hours, the cost rates C p0 = 5, C p1 = 15, C I = 50, C PM = 100, C F = 200, C LP = 200, and choosing the partition with I=30, we computed the optimal control limit Π = 0.3 in Matlab on an Intel Core i5 dell PC, with 4GB RAM (See Table 3.3). The control chart indicates that the machine should be stopped at point 36, corresponding to the residual data file 283, as seen in Figure 3.4. The sensitivity analysis of the process parameters is not the main focus of this study. However, it can be checked by some references [16, 53]. Figure 3.4 Fault detection by using multivariate Bayesian control chart 43

59 Table 3.3 The optimal early fault detection limit and the long run average costs Π 1e * Average cost 2.23e Residual Life Prediction We can assess the remaining useful life of the system by computing the mean residual life using the posterior probability Π k. We assume that the system deterioration follows a continuous-time homogeneous Markov chain (X t : t R + ) with state space S = {0,1,2}. States 0 and 1 are unobservable operating states, and state 2 is an observable failure state. Using equation (3.13), the conditional reliability function is given by R(t Π k ) = P(ξ > kh + t ξ > kh, Y 1,, Y k, Π k ), = e (q 01+q 02 )t + q 01 e q 12t e (q 01+q 02 )t q 01 + q 02 q 12 (1 Π k ) + (e q 12t ) Π k. (3.18) Thus, the mean residual life function for this model can be computed as follows. μ kh = E(ξ kh ξ > kh, Y 1,, Y k, Π k ), = = R(t Π k )dt, q 1 01 q12 1 q01+q02 q 01 +q 02 (1 Π q 01 +q 02 q k ) + 1 Π 12 q k. 12 (3.19) 44

60 Figure 3.5 Deterioration sample path of one tooth Given parameter estimates Θ = (q 01, q 02, q 12 ), the mean residual life function (equation 3.19) is fully determined by the posterior probability Π k. We use this function to determine the mean residual life given a gear tooth failure history up to file 295 (RMS values), as shown in Figure 3.5. In Figure 3.4, there are little changes in posterior probability Π k when the system is in the running period from data files 247 to 280, and the posterior probability for these files is very low. It means that the system is almost certainly in good condition during this period and the mean time to failure values remain constant. From data file 279, the posterior probability Π k starts to increase. It jumps to one from file 282 to file 285 and remain in an unhealthy state after the file 285. It indicates that the system is in the warning state while the system is still operational after the file 285. The expected remaining lives from files 278 to 295 are shown in Figure 3.6. The remaining life is predicted to be hours at file 247. It remains constant up to the file 279 when the system is still in a healthy state. The expected value decreases to 7.57 hours after the file 285 when there is a strong indication that the system is in a warning state. The corresponding conditional reliabilities, starting at the time of the file 282, 283 and 284, respectively, are shown in Figure 3.7. We 45

61 note that the actual remaining life time from the file 246 is 12.4 hours. The actual remaining life time from the file 284 is hours. Thus, the proposed model is suitable to predict the remaining useful life for deteriorating gearbox systems. Figure 3.6 Remaining useful life prediction Figure 3.7 Conditional reliability function 46

62 Chapter 4 Bayesian Estimation and Optimal Maintenance Control with a Two-state Hidden Semi-Markov Model Approach 4.1 Introduction Observed vibration data and the hidden system deterioration can be modeled using an HMM or an HSMM. In the previous chapter, we considered a three-state hidden Markov model to monitor the health condition of partially observable vibration data with observable failure information. However, vibration data with failure information are scarce in real applications or non-existent at all, mainly because such systems are preventively maintained before failure occurs. In practice, machines are rarely allowed to run to failure and data are commonly suspended (see e.g. [1], [27]). For such critical systems without historical failure data, it is suitable to model the degradation process as a two-state HSMM. In literature, a very limited number of research papers have been devoted to semi-markov modeling using multivariate observation processes. Hidden semi-markov chains possess the flexibility of hidden Markov chain without the restriction of exponential or geometric distribution in its hidden states. In this chapter, we focus on developing an estimation and optimal maintenance control scheme in a two-state hidden semi-markov model framework. To estimate the model parameters in the HSMM, the collected vibration data pre-processing and modeling are required. We present a parameter estimation procedure for this model considering a general sojourn time distribution in the healthy state. Using the EM algorithm, both the state and observation model parameters can be estimated. Explicit formulae for the parameter updates using Weibull distribution are developed as a special case. We also present a novel development of applying a computational algorithm in the SMDP framework which can be 47

63 used to obtain the optimal maintenance policy minimizing the long-run average cost per unit time. Furthermore, we consider both Bayesian estimation and control problems using phasetype sojourn time distribution. A case study using real vibration data is developed to demonstrate the whole procedure. The numerical results are compared with the ones obtained from a hidden Markov modelling using the same vibration data set. 4.2 Model Description We assume that the deteriorating system follows a continuous time semi-markov process {X t : (t R + )} with state space {0, 1}. States 0 and 1 are unobservable, representing healthy and unhealthy states, respectively. The system starts in a good state, and the sojourn time τ in this state is generally distributed with density function f S (t). The unhealthy state is absorbing. The system is monitored at equidistant sampling epochs (h,, Nh) and the vibration data y n is obtained at the epoch nh. For given data histories i, i = 1,, M, y i = y i i 1,, y Ni represents the collection of all d dimensional vector observation data up to time N i h, which is the last sampling point for history i. We assume that the observations y i are conditionally independent and have a multivariate normal distribution given the state of the system. This assumption is valid once the appropriate data pre-processing method has been applied. In particular, one should first fit a model using only the healthy portion of the vibration data histories that account for cross and autocorrelation in the data. It was proven by Yang and Makis [12] that for a general class of the time series models, the residuals are independent and normally distributed. For the remainder of the chapter, the residuals are then chosen as the "observation process satisfying the assumption of multivariate normality and conditional independence. Thus, y n i conditional on X nh = x, x = {0, 1}, has d variate 48

64 normal distribution N d (μ x, Σ x ) with density function: g(y x) = exp 1 2 (y μ x) Σ x 1 (y μ x ) (2π) d det(σ x ). (4.1) After collecting a new observation vector, a decision is made either to run the system until the next sampling point, or to stop the system and possibly carry out full preventive maintenance after an inspection, which brings the system to the healthy state. The following cost structure is considered: C OM : C I : C PM : C LP : C s : Maintenance cost rate in state 1 when the system is operational. Inspection cost rate. Preventive maintenance cost rate. Lost production cost rate. Sampling cost incurred for each observation. The objective is to determine the optimal maintenance policy minimizing the long-run expected average cost per unit time. From renewal theory, the expected average cost of the system is equal to the expected cycle cost divided by the expected cycle length: E(CC) E(CL) = inf Π [0,1] E Π (CC) E Π (CL). (4.2) Assume that Π n = P(X n = 1 y 1,, y n ) represents the probability that the process is in an unhealthy state given the observations up to time nh. E Π is the conditional expectation given Π. CC is the total cost over one complete cycle of length CL. Recently, the Bayesian control chart has received a lot of attention ([51]), and was proven to be an optimal tool for decision making in the area of quality control [16]. In this chapter, we consider the multivariate Bayesian control approach to find the optimal maintenance policy for a system described by 49

65 a hidden semi-markov process with partial observations. The following notation will be used throughout this chapter: X: Continuous-time state process. h: Sampling interval. Y: Discrete-time observation process. τ: Sojourn time in the healthy state. S: Set of state parameters determining the distribution of τ. O: Set of observation parameters {(μ 0, Σ 0 ), (μ 1, Σ 1 )}. y i = y i i 1,, y Ni : Individual i th observation history. L O, S y i : Partial likelihood for history i. L O, S y i, t i : Full likelihood for history i. g(y x): Density function of an observation vector given the system state. Π n : Posterior probability statistic. 4.3 Parameter Estimation for a General HSMM with Two Hidden States Suppose we have obtained M observation histories of the form y i = y i i 1,, y Ni, where i = {1,, M} and N i h is the length of the i th history. The unobservable deterioration process is described as a continuous time semi-markov process (X t : t R + ) with state space {0,1}. The deterioration process begins in a healthy state 0. The sojourn time τ in state 0 is a random variable with a general distribution function: F S (t) = P(τ t), t R +, (4.3) where S is the parameter set of the probability distribution function F S ( ). The actual values of parameters are unknown and must be estimated using observed vibration data. The probability density function of the sojourn time τ is denoted by f S (t). Let Y = {y 1,, y M } 50

66 represent all observable data and L(O, S Y ) be the associated likelihood function, where {O, S} are the sets of unknown observation parameters O = μ 0, μ 1, Σ 0, Σ 1 and state parameters S = {s 1, s 2, }. Because the sample path (X t : t R + ) of the deterioration process is not observable, maximizing L(O, S Y ) analytically is not possible. The EM algorithm resolves this difficulty by iteratively maximizing the so-called pseudo-likelihood function. A detailed review can be found in McLachlan and Krishnan[52]. Let O 0, S 0 be some initial values of the unknown parameters. The EM algorithm works as follows: E-step. For n 0, compute the pseudo log-likelihood function defined by: Ω O, S O j, S j : = E Oj,S j (ln L (O, S Y c ) Y ), (4.4) where Y c represents the complete observation data set, consisting of the observation histories augmented with the sample path information of the state process X. M-step. Maximize the expectation computed in the first step: O j+1, S j+1 arg max O,S Ω O, S O j, S j. (4.5) O j+1, S j+1 are chosen as the updated parameter estimates for the next iteration. The E and M steps are repeated until the Euclidean norm O j+1, S j+1 O j, S j < ε with selected small ϵ > 0. O j+1, S j+1 are then chosen as the optimal parameters (O, S ) Formula for the Likelihood Function 51

67 Note that the entire sample path {X t : t R + } of the system state is fully determined by the random sojourn time variable τ. That is, if the value of τ is known, the sample path of the state process {X t : t R + } is fully defined. Before we derive the formula for the full likelihood L (O, S Y c ) for M observation histories, we first consider the case with a single deterioration history y. Because τ is sufficient for characterizing the sample paths of the hidden state process, it implies that the complete likelihood function L (O, S y, t) is given by: L (O, S y, t) = g O (y τ = t)f S (t), (4.6) where f S (t) is the density function of the sojourn time τ and g O (y τ = t) is the conditional density function of the observation history {y 1,, y N }. If τ = t is in the interval ((k 1)h, kh], this conditional density can be written as: g O (y k 1) = exp 1 2 k 1 n=1 (y n μ 0 ) Σ 1 0 (y n μ 0 ) 1 2 (y n μ 1 ) N n=k (2π) Nd det k 1 (Σ 0 )det N k+1 (Σ 1 ) Σ 1 1 (y n μ 1 ), (4.7) For t > Nh, we write: g O (y N) = exp 1 2 N n=1 (y n μ 0 ) Σ 1 0 (y n μ 0 ). (2π) Nd det N (Σ 0 ) (4.8). Thus, for M observation histories, the likelihood function is given by: M L (O, S Y c ) = i=1 L O, S y i, t i. (4.9) Formula for the Pseudo Log-Likelihood Function 52

68 In this subsection, we describe the E-step of the EM algorithm. We derive the pseudo loglikelihood by taking the expectation of the likelihood function given by equation (4.9). Considering the general case where we have M observation histories, y i = y i i 1,, y Ni, i = {1,, M} and N N, the pseudo log-likelihood function is obtained as follows: Ω Y O, S O j, S j = E Oj,S j (ln L (O, S Y c ) Y ), M `= E Oj,S j ln L O, S y i, t i Y, i=1 M = E Oj,S j ln L O, S y i, t i y i, i=1 M = Ω i O, S O j, S j. i=1 (4.10) To simplify notation, for any vector v = (v 0,, v n ), we denote ln v = (ln v 0,, ln v n ). The inner product v, w : = v w. Theorem 4.1 Given M observation histories, the pseudo log-likelihood function has the following decomposition: M M M Ω i O, S O j, S j = Ω obs i O O j, S j + Ω state i S O j, S j, i=1 i=1 i=1 (4.11) where Ω i obs O O j, S j = b i, lng i, (4.12) Ω i state S O j, S j = a i, c i. (4.13) Vectors a i, b i, c i and g i depend only on the fixed estimates O j, S j, which are given in the proof. 53

69 Proof. The pseudo log-likelihood can be written as follows. M Ω i O, S O j, S j E Oj,S j ln L O, S y i, t i y i, i=1 M i=1 M = ln g O y i t f S (t) g Oj y i t f Sj (t)dt 0, g Oj (y i t)f Sj (t)dt i=1 M 0 = ln g O y i t g Oj y i t f Sj (t)dt 0 g Oj (y i t)f Sj (t)dt i=1 0 + ln f S(t) g Oj y i t f Sj (t)dt 0 g Oj (y i t)f Sj (t)dt 0, = M M Ω obs i O O j, S j + Ω state i S O j, S j. i=1 i=1 (4.14) The first term in equation (4.14) depends only on the observation parameter set O and the second term in equation (4.14) depends only on the state parameter set S. Thus, the observation term in equation (4.14) can be written as follows: where Ω obs N i 1 i O O j, S j = b i n ln g O y i n n=0 i + b Ni ln g O y i N i (4.15) b n i = i b Ni = g Oj y i n R(j,n) R(j,n+1) g Oj y i t f Sj (t)dt 0 g Oj y i N i R(j,N i ) g Oj y i t f Sj (t)dt 0., n = 0,, N i 1, (4.16) and g Oj y i t f Sj (t)dt = 0 N i 1 n=0 g Oj y i n R(j, n) R(j, n + 1) + g Oj y i N i R(j, N i ). (4.17) The reliability function R(j, n) represents the probability that the time to failure will be greater than a specific time nh considering parameter estimates S j. The state term in equation 54

70 (4.14) can be written as follows: where N i 1 i Ω state i S O j, S j = n=0 a i n c i n + a Ni c Ni. (4.18) i a n i = i a Ni = g Oj y i n g Oj y 0 i t f Sj (t)dt, n = 0,, N i 1, g Oj y i N i. (4.19) g Oj (y i t)f Sj (t)dt 0 (n+1)h c i n = ln f S (t) f Sj (t)dt, n = 0,, N i 1 nh i c Ni = N i h ln f S (t) f Sj (t)dt. (4.20) We denote vectors a i = a i i 0,, a Ni, b i = b i i 0,, b Ni, c i = c i i 0,, c Ni and g i = g i 0, g i i 1,, g Ni, where g i n = g Oj y i t, with nh < t (n + 1)h, n = 0,, N i 1. When i t > N i h, we have g Ni = g Oj y i N i. This completes the proof. Theorem 4.1 implies that the maximization step of the pseudo log-likelihood function in equation (4.11) can be carried out separately for the observation term Ω obs i and for the state term Ω state i, which considerably increases the speed of computation. Thus, we are interested in finding maximizers of the pseudo log-likelihood function defined in Theorem 4.1. Using Theorem 4.1, we solve for the stationary points of the observation parameters O and obtain the following results. Theorem 4.2 The observation parameter maximizers O j+1 = μ 0 j+1, Σ0 j+1, μ1 j+1, Σ1 j+1 of the pseudo likelihood function are given explicitly by: 55

71 μ 0 j+1 = μ 1 j+1 = i=1,,m n 1 i b i i=1,,m b i,d i 1 i=1,,m n 2 i b i i=1,,m b i,d i 2, Σ j+1 0 = i=1,,m n 3 i b i i=1,,m b i,d i 1, Σ j+1 1 = i=1,,m n 4 i b i i=1,,m b i,d i 2,. (4.21) where n i 1 = 0, n 1 y i n,, i n N i y n, n i 2 = n 1 y i n, i n 2 y n,, y i Ni, 0, n i 3 = 0, n 1(y i n μ 0 )(y i n μ 0 ),, n N (y i n μ 0 )(y i n μ i 0 ), n i 4 = n 1(y i n μ 1 )(y i n μ 1 ),, (y i n μ 1 )(y i n μ i n N i 1 1 ), (y Ni μ 1 ), 0 d 1 i =( 0,1,, N i ), d i 2 = (N i,,1, 0). μ i 1 )(y Ni b i = b i i 0,, b Ni (4.22) At this point, we have no yet specified the structure of the density function f S (t) with respect to the state parameter set S. Using equations (4.11) - (4.22), we have obtained general formulas to estimate parameters using the EM algorithm. Next, we give an example to illustrate how to use Theorem 4.1 to estimate state parameters S in a special case Parameter Estimation using Weibull Distribution The estimation procedure can be used for different sojourn time distribution functions. For example, the system hidden state can follow a continuous time homogeneous semi-markov chain using phase-type distribution. In this chapter, we first consider a continuous time nonhomogeneous unobservable stochastic process with Weibull sojourn time distribution, which is a commonly used life distribution in reliability engineering. The density function f(α, β) 56

72 is given by: α f S (t) = β α tα 1 exp t β α, t 0, (4.23) 0,, t < 0 β is the scale parameter, which is also called the characteristic life and α is the shape parameter. The state parameter set is S = {α, β}. Then, we have the following explicit expressions: N i 1 (n+1)h Ω state i S O j, S j = n=0 (lnα αlnβ + (α 1)lnt tα β α ) g Oj (y i t)f Sj (t) dt nh g Oj (y i t)f Sj (t)dt 0 + (lnα αlnβ + (α 1)lnt tα β α ) g Oj (y i t)f Sj (t)dt N i h. g Oj (y i t)f Sj (t)dt 0 (4.24) The formulas for the updated parameters for a single observation history using Weibull distribution are given by the following Lemma. Lemma 4.1 The updated parameters for M observation histories can be obtained using equations (4.7) and (4.8), which have the following form: β j+1 = β j M i=1 e 1 i, g i M e i 2, g i i=1 1/α, (4.25) M α j α + i=1 e 3 i, g i M e i 2, g i i=1 M i=1 e 4 i, g i M e i 1, g i i=1 = 0. (4.26) The vectors e 1 i, e 2 i, e 3 i and e 4 i are given in the proof, where e 1 i, g i and e 4 i, g i are functions of the variable α. Equation (4.26) has only one unknown variable, which can be found using Matlab fminsearch function. Proof. The maximum likelihood estimates of two-state parameters α, β, can be found from: 57

73 α N 1 = α lnβ + lnt t β ln t i 1 (n+1)h n=0 β g O j y i t f Sj (t) dt nh α g Oj (y i t)f Sj (t)dt 0 1 α α lnβ + lnt t β ln t β g O j y i t f Sj (t)dt N i h + = 0, g Oj (y i t)f Sj (t)dt Q i state 0 N i 1 (n+1)h α = n=0 β tα ( α)β ( α 1) g Oj y i t f Sj (t) dt nh β g Oj (y i t)f Sj (t)dt Q i state + N i h 0 α β tα ( α)β ( α 1) g Oj y i t f Sj (t)dt 0 g Oj (y i t)f Sj (t)dt = 0. (4.27) The incomplete gamma function is γ(α, x) = t α 1 e t dt, Γ(α, x) = t α 1 e t dt. We define the vectors e i 1 = e 0 1, e 1 1, e 2 N 1,, e i 1, e i 2 = e 0 2, e 1 2, e 2 N 2,, e i 2, i e3 = e 0 3, e 1 3, e 2 N 3,, e i 3, e i 4 = e 0 4, e 1 4, e 2 N 4,, e i 4, x 0 x where z H α z e n α 1 = j e z dz=γ α + 1, z z L α H - γ α + 1, z j α L, n=0,, N i 1, j e 1 N i = Γ α α j + 1, z Ni h, z H e n 2 = e z dz=exp( z z L L ) exp ( z H ), n=0,, N i 1, e 2 N i = exp zni h, z H e n 3 = lnz e z dz=lnz z L L e z L lnz H e z H + (Ei(z H ) Ei(z L )), n=0,, N i 1, e 3 N i = lnzni h e z N ih Ei(z Ni h), α e n z 4 = H α lnz z j e z dz, n=0,, N z i 1, L e 4 N i = z Ni h α lnz z α j e z dz. (4.28) 58

74 and Ei(x) = x e t t the following equations: dt is the exponential integral. The quantities z H, z L, z Ni h are defined by (n + 1) h z H = β j α j, This completes the proof. z L = n h α j, β j z Ni h = N α i h j. β j (4.29) Parameter Estimation using Exponential Distribution If we consider the exponentially distributed sojourn time, the probability density function is f S (t) = θ exp( θt), if t 0 0, if t < 0, (4.30) where state parameter S = {θ}. The process is described by a hidden Markov chain with state space X t = {0,1}. Then we have the closed-form expressions as follows. Lemma 4.2 For M observation histories, the unique maximizers for the state parameters are given by: θ j+1 = M i=1 e 1 i, g i e i 2, g i. (4.31) M i=1 where vectors e 1 i, e 2 i are defined in the proof. Proof. We can rewrite equation (4. 13) by using exponential sojourn time distribution. Ω i state S O j, S j = N i 1 (n+1)h n=0 ln θ exp( θt) g Oj (y i t)f Sj (t) dt nh g Oj (y i t)f Sj (t)dt 0 + ln θ exp( θt) g O j (y i t)f Sj (t)dt N i h, g Oj (y i t)f Sj (t)dt 0 59

75 = N i 1 n=0 (n+1)h (ln(θ) θt) g Oj (y i t)f Sj (t) dt nh g Oj (y i t)f Sj (t)dt 0 + (ln(θ) θt) g Oj (y i t)f Sj (t)dt N i h, g Oj (y i t)f Sj (t)dt 0 = α i θ + ln (θ). (4.32) where vectors e 1 i = e 1 0, e 1 1,, e 1 N i and e2 i = e 2 0, e 2 1,, e 2 N i in equation (4. 31) are defined as follows: e 1 n = (n+1)h e θ ju nh du = e θ j nh e θ j (n+1)h θ j, n = 0,, N i 1, e 1 N i = e 2 n = e 2 N i = e θ ju N i h (n+1)h ue θ ju nh ue θ ju N i h du = e θ j N i h θ j, du = e 1 n (n + 1)h e θ j (n+1)h + nh e θ j nh θ j, n = 0,, N i 1, du = N ih e θ j N i h e 1 N i θ j. (4.33) with α i = e 2 i,g i. The stationary point of state parameter θ can be solved by Ω state i = 0. This e i 1,g i completes the proof. θ Parameter Estimation using Phase-Type Distribution In this subsection, we consider the sojourn time distribution to be an Erlang phase-type distribution. A phase-type distribution is a probability distribution constructed by a convolution of exponential distributions. Most of the original applications were in the area of queuing theory. A phase-type density function is given by: 60

76 f S (t) = r m t m 1 e rt, if t 0 (m 1)! 0, if t < 0. (4.34) where an integer m > 0 is the shape parameter and r > 0 is the scale parameter. The state parameters are S = {m, r}. Under Erlang distribution assumptions, the hidden stochastic process becomes X t = j, j = {1,, m, m + 1}, t>0. If the system is in the healthy state, the system will go through phases j = {1,, m}. The system will be in phase Z t = {m + 1} if the system is in the unhealthy state. This process is equivalent to a model in which the sojourn time is represented by a suitably organized series of mindependent exponential phases running one after another until the end of sojourn time and then reach state m + 1. The state equation (4.18) is given by: N i 1 Ω state i S O j, S j = a i (n+1)h n ln rm t m 1 e rt f (m 1)! Sj (t)dt n=0 nh i + a Ni N i h ln rm t m 1 e rt (m 1)! f Sj (t) dt, (4.35) Lemma 4.3 For M multiple observation histories, the estimates of parameters are as follows: r j+1 = m M i=1 γi M i=1 α i. (4.36) and the equation for getting the estimates of the shape parameter m j+1 is: M γ i ln m i=1 M i=1 γi M i=1 α i M + c i γ i (ln(γ(m))) = 0. m i=1 M i=1 (4.37) Vectors α i, γ i and c i depend only on fixed estimates O j, S j. Proof. Let ln rm t m 1 e rt = m ln(r) + (m 1) ln(t) r t ln (m 1)!. (m 1)! Equation (4.13) can be rewritten as follows. 61

77 Ω i state S O j, S j = N i 1 (n+1)h m r j i j t mj 1 e r jt a n m ln r + (m 1) ln t rt ln (m 1)! dt n=0 nh m j 1! m r j i j t mj 1 e r jt +a Ni m ln r + (m 1) ln t rt ln (m 1)! dt. N i h m j 1! Taking partial derivative with respect to m and r, after some algebraic manipulation, we obtain the updating parameters for m and r as follows: (4.38) Ω i state m M = γ i lnr i=1 M i=1 e i 2, g i (ln(γ(m))) m M + e i 3, g i i=1 where vectors e 1 i = e 1 0, e 1 1,, e 1 N i, e2 i = e 2 0, e 2 1,, e 2 N i, e3 i = e 3 0, e 3 1,, e 3 N i. We define = 0. α i = e 1 i, g i, γ i = e 2 i, g i, c i = e 3 i, g i, e 1 n = (n+1)h nh t m j e r jt dt, n = 0,, N i 1, e 1 N i = N i h t m j e r jt dt, e 2 n = (n+1)h nh t m j 1 e r jt dt, n = 0,, N i 1, e 2 N i = N i h t mj 1 e r jt dt, e 3 n = (n+1)h nh lnt t m j 1 e r jt dt, n = 0,, N i 1, e 3 N i = N i h lnt t mj 1 e r jt dt. (4.39) Starting from (n+1)h nh lnt e r j t dt, e 3 n and e 3 N i can be obtained using integration by parts. (n+1)h lnt e r jt nh dt = = (n+1)h 1 r (lnz lnr) e z dz, nh 1 r (ln((n + 1)h r)) e (n+1)hr j ln nh r Sj e nhr j 62

78 where z = r j t, and Ei( ) = Ei nh r j Ei (n + 1)h r j + lnr r e t u t (n+1)h r j e z dz. nh r j (4.40) dt is the exponential integral. Denoting φ(m) = (ln(γ(m))), m 1 we obtain φ(m + 1) = φ(1) + n k=1, n = 0,1, 2,. m is the only unknown variable in k Lemma 4.3, which can be found by Matlab optimization toolbox. This completes the proof A Practical Illustration of the Parameter Estimation Procedure. Figure 4.1 RMS values of one tooth deterioration process In this section, we consider a parameter estimation problem for a partially observable gearbox system. The system s condition can be categorized into two states: a healthy state 0 and a warning state 1. Figure 4.1 shows RMS values of one tooth running in a deterioration process. Seventy gear teeth deterioration histories are obtained from model residuals. We choose feasible initial values and run the EM algorithm using the convergence criterion O j+1, S j+1 O j, S j < 10 7, where T sec shows the computational time (seconds). All computations were coded in Matlab (2009) on an Intel Corel i5, 2.6GHz with 8G RAM. The parameter estimation results using Weibull distribution are shown in Table 4.1. The results of using exponential distribution are shown in Table 4.2. The results using Erlang 63

79 distribution are shown in Table 4.3. Due to the increasing number of parameters, computations in an HSMM using Weibull distribution take longer time than in an HMM. From Tables 4.1 to 4.3, it can be seen that the mean time in a healthy state using Weibull distribution is hours. The mean time in a healthy state using exponential distribution is hours. The mean time in a healthy state using phase-type distribution is hours. The assumption of using Weibull sojourn time distribution is the one that results in the longest mean time in a healthy state among all three sojourn time distributions. Table 4.1 Iterations of the EM algorithm using Weibull sojourn time distribution Initial value Update 1 Update 2 Update 7 β α μ μ Σ * Σ * Ω -2.71e e-6 T sec Table 4.2 Iterations of the EM algorithm using exponential sojourn time distribution Initial value Update 1 Update 2 Update 8 θ μ

80 μ 1 Σ 0 Σ * * Ω e e-6 T sec Table 4.3 Iterations of the EM algorithm using phase-type sojourn time distribution Initial value Update 1 Update 2 Update 14 r m μ μ Σ * Σ * Ω -2.2e e-5 T sec Optimal Bayesian Control using Weibull distribution 65

81 Recently, Kim et al. [53] and Jiang et al. [54] considered an effective control method to monitor the posterior probability that system is in the out-of-control state in the HMM. In realistic cases, the sojourn time distribution can be non-exponential. In this section, we develop a Bayesian control model assuming a hidden semi-markov process which is suitable for modeling systems subject to vibration monitoring. We consider a general sojourn time distribution and the transition probabilities for the related HSMMs depend on both the current state and the time spent in that state. The objective is to determine the optimal value of the control limit Π that minimizes the long-run expected average cost per unit time. We now summarize the procedure of applying multivariate Bayesian control chart for maintenance optimization over a long-run horizon. In step 1, we present a formula for the posterior probability based on the given sojourn time distribution. In step 2, we define the SMDP state space. In step 3, we formulate the semi- Markov decision process in a cost optimal stopping framework. Examples using Weibull distribution will illustrate the whole procedure. More examples can be developed by considering other sojourn time distributions. In the first step, we define the continuous time hidden state process {X t : t R + } with the state space S = { (n, 0) and (n, 1), n = 1,, N}. State (n, 0) represents the system in a healthy condition having age nh. State (n, 1) represents the system in an unhealthy situation having age nh. Given the sojourn time distribution f S (t), the reliability function R S (t) represents the probability that the time to failure will be greater than a specific time t. The posterior probability that the system is in an unhealthy state at age (n + 1)h is denoted as: 66

82 Π n+1 = P(X n+1 = (n + 1,1) Y 1,, Y n, Y n+1, Π n ), with = P(X n+1 = (n + 1,1), Y n+1 = y Y 1,, Y n, Π n ), P( Y n+1 = y Y 1,, Y n, Π n ) = P(X n+1 = (n + 1,1), Y n+1 = y Y 1,, Y n, Π n ) P(X n+1 = (n + 1,1), Y n+1 = y Y 1,, Y n, Π n ) + P(X n+1 = (n + 1,0), Y n+1 = y Y 1,, Y n, Π n ), = f(y μ 1,Σ 1 ) D 1 f(y μ 1,Σ 1 ) D 1 +f(y μ 0,Σ 0 ) D 0, (4.41) g(y μ 0, Σ 0 ) = P( Y n+1 = y X n+1 = (n + 1,0), Y 1,, Y n, Π n ), g(y μ 1, Σ 1 ) = P( Y n+1 = y X n+1 = (n + 1,1), Y 1,, Y n, Π n ). (4.42) and D 1 = P(X n+1 = (n + 1,1) Y 1,, Y n, Π n ), = P (n,0),(n+1,1) (1 Π n ) + P (n,1),(n+1,1) Π n, = P(t < (n + 1)h t > nh) (1 Π n ) + P(t > (n + 1)h t > nh) Π n, = P (nh < t < (n + 1)h) P (t > nh) (1 Π n ) + Π n, = R S (nh) R S ((n+1)h) R S (nh) (1 Π n ) + Π n, D 0 = P(X n+1 = (n + 1,0) Y 1,, Y n, Π n ), = P (n,0),(n+1,0) (1 Π n ) + P (n,1),(n+1,0) Π n, = P(t > (n + 1)h t > nh) (1 Π n ) + P(t > (n + 1)h t < nh) Π n, = P (t > (n + 1)h) P (t > nh) (1 Π n ), = R S ((n + 1)h) R S (nh) (1 Π n ). (4.43) where Π 0 = P X 0 = (0,1) = 0. The optimal policy for the specific SMDP considered here can be found by iteratively 67

83 solving system of linear equations ([55], [56]). Generally, the computation of the long-run average cost requires discretization of [0,1], the state space of the posterior probability process. Two fundamental assumptions should still be satisfied in SMDP. First, the state space and the action set A must be finite. Second, the Markovian properties should be satisfied. In order to have an SMDP with a finite state space, we define the set of SMDP states as S = {0, (n, i), n = 1,, N} {PM}, where i = 1,, I represents the coded posterior probability value. We have found that I 25 subintervals provide a sufficiently high degree of precision, so that the cardinality of the state space does not have to be chosen very large. State 0 represents the new condition. State PM represents the preventive maintenance. We consider a finite age limit Nh. When the system reaches the age Nh, preventive maintenance is a required action. When applying the Bayesian control technique, the posterior probability Π n [0,1] is used to monitor the deterioration process. Posterior probability is updated after each observation. When Π Π n, the system is stopped and full inspection is performed with inspection cost C I. The system has probability 1 Π n to be in the healthy state 0 and probability Π n to be in the warning state 1. If the system is in unhealthy state 1, preventive maintenance is conducted at cost rate C PM which takes T PM time units. The associated lost production cost is incurred at a rate of C LP. After inspection or preventive maintenance, the system is renewed and a new cycle begins. The optimal average cost g(π ) can be obtained by defining and solving a semi-markov decision problem. The posterior probability Π n is plotted on the Bayesian control chart. If the current value Π n lies in the interval i 1 I, i, we I assume that Π n = i 0.5, which is denoted as Π I n = i. If Π > Π n and n = {1,, N 1}, 68

84 we continue. The long-run expected average cost g(π) can then be obtained by solving the following system of linear equations: v n,i = c n,i g(π)τ n,i + j S P (n,i),(n+1,j) v n+1,j, v PM = c PM g(π)τ PM, v 0 = 0. (4.44) where c n,i = the expected cost incurred until the next decision epoch given the present state (n, i), i S, n = {1,, N 1}. τ n,i = the expected time until the next decision epoch given the present state (n, i), i S, n = {1,, N}. P (n,i),(n+1,j) = the probability that at the next decision epoch the system will be in state (n + 1, j) given the present state is (n, i), i S. g(π) = E(CC) E(CL). (4.45) The quantities such as the sojourn times, costs and transition probabilities can be obtained as follows. The expected costs c n,i are given by: c n,i = h E C 0 OM = h C OM 0 I {Xnh+t =1}ds nh, Π n + C s, h P(X nh+t = 1 X nh = 0)dt (1 Π n ) + C OM 1 X nh = 1)dt Π n +C s, 0 P(X nh+t = = h C OM (1 Π n ) 1 R S (nh+t) dt + C OM Π n h+c s, if Π n < Π, and n {1,, N 1}, 0 R S (nh) c n,i = C I T I +C LP T I, if Π n > Π, and n {1,, N 1}, 69

85 c N,i = C I T I +C LP T I, c PM = C PM T PM +C LP T PM. The sojourn times τ n,i are given by: (4.46) τ n,i = h, if Π n < Π, and n {1,, N 1}, τ n,i = T I, if Π n > Π, and n {1,, N 1}, τ N,i = T I, τ PM = T PM, if system is in preventive maintenance, τ 0 = h. (4.47) where T I is the inspection time and T PM is the preventive maintenance time. The transition probabilities P (n,i),(n+1,j) are given by: f(y μ 0, Σ 0 ) f(y μ 1, Σ 1 ) = = (2π)Nd 1/2 1 Σ0 exp 2 (y n μ 0 ) Σ 1 0 (y n μ 0 ), (2π) Nd Σ 1 1/2 exp 1 2 (y n μ 1 ) Σ 1 1 (y n μ 1 ) ( Σ 0 ) 1/2 ( Σ 1 ) 1/2 exp 1 2 (y B) A(y B) + E, ( Σ = 0 ) 1/2 exp ( Σ 1 ) 1/2 1 V(y) + E, (4.48) 2 where A, B, E are given by: A= Σ 1 1 Σ 0 1, B= ( Σ 1 1 Σ 0 1 ) 1 ( Σ 1 1 μ 1 Σ 0 1 μ 0 ), E= ( μ 1 T Σ 1 1 μ 1 μ 0 T Σ 0 1 μ 0 ) B T ( Σ 1 1 μ 1 Σ 0 1 μ 0 ), V(y) = (y B) T A(y B). (4.49) Here y B follows the normal distribution with mean μ 0 B in the healthy state 0, and with mean μ 1 B in the unhealthy state 1. Based on equations ( ), the transition probabilities can be obtained as follows: 70

86 P (n,i),(n+1,j) = F( j + ) F(j ), for Π n < Π, n = {0,, N 1}, where j = j 1 and j + = j. I (4.50) The distribution function F(x) of Π n+1 has the following form: I F(x) = = = Pr Pr(Π n+1 x Y 1,, Y n, Π n ), D 1 D 1 + g(y μ 0, Σ 0 ) g(y μ 1, Σ 1 ) D0 x Y 1,, Y n, Π n, Pr 1 + D0 D 1 g(y μ 0, Σ 0 ) g(y μ 1, Σ 1 ) 1 x Y 1,, Y n, Π n, X (n+1)h = 0 P X (n+1)h = 0 Y 1,, Y n, Π n + Pr 1 + D0 g(y μ 0, Σ 0 ) D 1 g(y μ 1, Σ 1 ) 1 x Y 1,, Y n, Π n, X (n+1)h = 1 P X (n+1)h = 1 Y 1,, Y n, Π n ), = D 0 Pr V(y) > k(x) X (n+1)h = 0 D 0 + D 1 + Pr V(y) > k(x) X (n+1)h = 1 D 0 + D 1, D 1 where (x) 2ln ( Σ 1 ) 1/2 ( Σ 0 ) 1/2 1 x (4.51) D1 1 D0 E, F(x) can be computed using equations of Provost and Rudiuk (1996) (Appendix B), who provide a closed-form expression for the cumulative distribution function of V(y). The remaining transition probabilities are as follows. P (n,i),0 = 1-Π n, Π n > Π and n = {1,, N 1}, P (n,i),pm = Π n, Π n > Π and n = {1,, N 1}, P (N,i),0 = 1 Π N, P (N,i),PM = Π N, P PM,0 = 1. (4.52) 4.5 A Case Study using Multivariate Bayesian Control Chart 71

87 The studied vibration data was obtained from the Mechanical Diagnostic Test Bed (MDTB) [40] built by Pennsylvania State University, Applied Research Laboratory in the Condition- Based Maintenance Department. This study only considers the monitoring of the gearbox health condition under varying load before deterioration occurrence, i.e. it considers files from 194 to 295 obtained from the hours of data collection period. Figure 4.2 VAR model residuals for test files Figure 4.3 shows an HSMM fault prediction scheme for partially observable failure systems subject to vibration monitoring. Once the fitted model residuals are obtained, some statistical indicators such as root mean square (RMS), kurtosis, crest factor etc. can be selected for further fault diagnosis. We investigate RMS values from VAR model residuals. Figure 4.4 shows RMS values of one deterioration history. 72

88 Figure 4.3 The cost-optimal early fault detection scheme Figure 4.4 RMS residuals for test files

89 We compute the optimal control limit Π using both Weibull distribution and exponential distribution by setting the times T PM = 10 hours, T I = 1 hour, the costs C OM = 10, C I = 50, C PM = 50, C LP = 20 and C s = 0. In this vibration data case, the eigenvalues in both healthy state and unhealthy state using Provost s method are all negative [50] (see Appendix B). Given Weibull sojourn time distribution, we have the following results: h R S(nh) R S (nh + t) R S (nh) 0 dt = h exp nh β α exp nh+t β α 0 exp nh β α dt, h = 1 exp nh+t β 0 α exp nh β α dt, h R S(nh + t) R S (nh) 0 dt = = h exp nh+t β 0 α exp nh β α β α exp nh β α dt, Z h exp( z) z 1 α 1 dz. Z l (4.53) where z = nh+t β α, Z h = nh+h α, Z l = nh β β α. The control limits and optimal costs in comparison with an HMM are given in Table 4.4. Table 4.4 Average cost and optimal control limits Π Average cost using Exponential distribution Average cost using Weibull(N=40) distribution 1e * *

90 Figure 4.5 Posterior probabilities and the control limit using Weibull distribution It can be seen that the optimal limit equals 0.3 with g(π ) = using the exponential distribution. The optimal limit equals 0.5 with g(π ) = using the Weibull distribution (Figure 4.5). Once the posterior probability that the system is in the unhealthy state is greater than Π, full system inspection is initiated. It is noted that the expected average costs obtained using the Weibull sojourn time distribution are lower than the costs obtained using the exponential distribution. The results also indicate that inspection occurs more frequently in the HMM than in the HSMM. For a typical suspension history in Figure 4.4, full system inspection occurs at the 36th sampling epoch(corresponding to data file 283) using the exponential sojourn time distribution, in comparison with stopping at the 37th sampling epoch (corresponding to data file 284) using the Weibull sojourn time distribution. 4.6 Bayesian Control Chart using Erlang Sojourn Time Distribution When modeling process sojourn times, the exponential distribution is often inappropriate 75

91 because the standard deviation is as large as the mean. For the Erlang distribution, the standard deviation decreases as the shape parameter m increases so that sojourn times with a small standard deviation can often be approximated by an Erlang random variable. Next, we develop the multivariate Bayesian control procedure considering Erlang sojourn time distribution in a 2-state HSMM model framework. Previously, we defined the posterior probability that the system is in the unhealthy state in equation (4.41). Note that the Erlang distribution in the healthy state represents the semi-markov process as a sum of m independent exponential distributions, where m is a positive integer. Each phase has a common exponential distribution with mean 1. The Erlang variable could be thought of as r the length of time required to go through a sequence of exponentially distributed m phases or steps. To demonstrate the posterior probability that the system is in the unhealthy state at time (n + 1)h, let {X t, t 0} be a Markov process with state space {1,, m, m + 1}, where states {1,, m} represent the healthy state, and {m + 1} represents the unhealthy state 1. The posterior probability that the system is in the unhealthy state becomes the probability that the system is in phase {m + 1 }, shown as Π n+1,m+1. We also need to define Π n+1,j, j {1,, m}, to represent posterior probabilities that the system is in the phase j at time (n + 1), given all observations until at the nh th sampling epoch. Denoting Π n = Π n,1,, Π n,m+1, the vector Π n becomes a multiple dimensional statistic, where the component Π n,m+1 is used for optimal decision making. We calculate the posterior probabilities as follows: Π n+1,m+1 = P(X n+1 = m + 1 Y 1,, Y n, Y n+1 = y, Π n ), where Π n+1,j = P(X n+1 = j Y 1,, Y n, Y n+1 = y, Π n ), j {1,, m}, 76

92 Π n+1,m+1 = P( Y n+1 = y X n+1 = m + 1, Y 1,, Y n, Π n ) P(X n+1 = m + 1 Y 1,, Y n, Π n ) P( Y, n+1 = y X n+1 = j, Y 1,, Y n, Π n ) (1 P(X n+1 = m + 1 Y 1,, Y n, Π n )) +P( Y n+1 = y X n+1 = m + 1, Y 1,, Y n, Π n ) P(X n+1 = m + 1 Y 1,, Y n, Π n ) = f(y μ 1,Σ 1 ) P(X n+1 =m+1 Y 1,,Y n,π n ) f(y μ 0,Σ 0 ) 1 P(X n+1 =m+1 Y 1,,Y n,π n ) +f(y μ 1,Σ 1 ) P(X n+1 =m+1 Y 1,,Y n,π n ), = f(y μ 1, Σ 1 ) f(y μ 0, Σ 0 ) 1 +f(y μ 1, Σ 1 ) 1 k m+1 1 k m+1 1 k m+1 P(X n+1 = m + 1 X n = k) Π n,k, P(X n+1 = m + 1 X n = k) Π n,k P(X n+1 = m + 1 X n = k) Π n,k = f(y μ 0,Σ 0 ) f(y μ 1,Σ 1 ) D m+1 m j=1 D j +f(y μ 1,Σ 1 ) D m+1, j {1,, m}, (4.54) Π n+1,j = P( Y n+1 = y X n+1 = j, Y 1,, Y n, Π n ) P(X n+1 = j Y 1,, Y n, Π n ) P( Y, n+1 = y X n+1 = j, Y 1,, Y n, Π n )(1 P(X n+1 = m + 1 Y 1,, Y n, Π n )) +P( Y n+1 = y X n+1 = m + 1, Y 1,, Y n, Π n ) P(X n+1 = m + 1 Y 1,, Y n, Π n ) = f(y μ 0, Σ 0 ) P(X n+1 = j Y 1,, Y n, Π n ) f y μ 0, Σ, 0 (1 P(X n+1 = m + 1 Y 1,, Y n, Π n )) + f y μ 1, Σ 1 P(X n+1 = m + 1 Y 1,, Y n, Π n ) = f(y μ 0, Σ 0 ) 1 k j P(X n+1 = j X n = k) Π n,k, f(y μ 0, Σ 0 ) 1 1 k m+1 P(X n+1 = m + 1 X n = k) Π n,k + 1 k m+1 f(y μ 1, Σ 1 ) P(X n+1 = m + 1 X n = k) Π n,k = f(y μ 0,Σ 0 ) f(y μ 0,Σ 0 ) D j m j=1 +f(y μ 1,Σ 1 ) D m+1, j {1,, m}. (4.55) D j where D j = 1 k j P(X n+1 = j X n = k) Π n,k, j {1,, m} D m+1 = P(X n+1 = m + 1 X n = k) Π n,k. (4.56) 1 k m+1 with 1 k m+1 = 1, n = {0,1,2, }. Π n,k The cumulative distribution function of Erlang (m, r) is given by: 77

93 m 1 (rt)me rt F S (t) = 1 j=0, if t 0 j!. 0, if t < 0 (4.57) The probabilities in equations (4.54) - (4.55) are as follows: P(X n+1 = m + 1 X n = m + 1) = 1, P(X n+1 = j X n = k) = (rh) m e rh (j k)!, m j > k 1, P(X n+1 = j X n = k) = e rh, m j = k 1, P(X n+1 = m + 1 X n = k) = 1 m j>k 1 P(X n+1 = j X n = k) P(X n+1 = k X n = k), m k 1, P(X n+1 = j X n = k) = 0, j < k. (4.58) The initial posterior probabilities are as follows: Π 0,m+1 = 1. Π 0,j = 0. m j 1. (4.59) Generally, the computation of the long-run average cost requires discretization of [0,1], the state space of the posterior probability process. We define the posterior probability Π n,m+1 in the semi-markov decision process framework to find the cost-optimal control limit for the vibration system. The set of the SMDP states is de ined as S = {0, Π n,m+1 ε[0,1], PM}, where Π ε[0,1] is the optimal Π that minimizes the long-run expected average cost per unit time. When applying the Bayesian control technique, the posterior probability Π n,m+1 [0,1] is used to monitor the deterioration process. Posterior probability is updated after each observation. When Π Π n,m+1, the system is stopped and full inspection is performed with inspection cost C I. The system has probability 1 Π n,m+1 to be in the healthy state 0 and probability Π n,m+1 to be in the warning state 1. If the system is in unhealthy state 1, preventive maintenance is conducted at cost rate C PM which takes T PM time units. The 78

94 associated lost production cost is incurred at a rate of C LP. After inspection or preventive maintenance, the system is renewed and a new cycle begins. The optimal average cost g Π can be obtained by defining and solving a semi-markov decision problem. Denoting Π n = Π n,1,, Π n,m+1 and i = (i 1,, i m+1 ), the posterior probability Π n,m+1 is plotted on the Bayesian control chart. If the current value Π n,m+1 lies in the interval i 1 I, i, we assume that I Π n,m+1 = i m+1 0.5, which is denoted as Π I n,m+1 = i m+1. If Π > Π n,m+1, we continue. The long-run expected average cost g(π) can then be obtained by solving the following system of linear equations [55]: v i = c i g(π)τ i + j S P ij v j, v 0 = 0. (4.60) The quantities such as the costs, sojourn times and transition probabilities can be obtained as follows. c i = h E C 0 OM = h C 0 OM I {Xnh+t =1}dt Π n + C s, P(X nh+t = m + 1 Π n )ds + C s, if Π n,m+1 = i m I < Π, c i = C I T I +C LP T I, Π n,m+1, if Π n,m+1 = i m I > Π, c PM = C M T M +C LP T M. (4.61) where P(X nh+t = m + 1 Π n ) = m+1 k 1 P(X nh+t = m + 1 X nh = k, Π n )Π n,k. The associated time structures τ i are given by: (4.62) τ i = h, Π n,m+1 < Π, τ i = T I, Π n,m+1 > Π, 79

95 τ PM = T M, if system is in preventive maintenance, τ 0 = h. The associated transition probabilities P ij are given by: (4.63) P ij = P j Π n+1,m+1 < j + Y 1,, Y n, Y n+1, Π n, = P j f(y μ 1,Σ 1 ) D m+1 m j=1 +f(y μ 1,Σ 1 ) D m+1 < j+ Y 1,, Y n, Y n+1, Π n, f(y μ 0,Σ 0 ) D j = F( j + ) F(j ), For Π n,m+1 < Π, where j = j 1 and j + = j. I The distribution function F(x) of Π n+1 has the following form: F(x) = Pr Π n+1,m+1 x Y 1,, Y n, Π n, I D m+1 = Pr D m+1 + g(y μ x Y 0, Σ 0 ) 1,, Y n, Π n, g(y μ 1, Σ 1 ) m j=1 Dj = m j=1 Pr 1 + D j D m+1 g(y μ 0, Σ 0 ) g(y μ 1, Σ 1 ) 1 x Y 1,, Y n, Π n, X (n+1)h = 0 P X (n+1)h = 0 Y 1,, Y n, Π n + m j=1 Pr 1 + D j D m+1 g(y μ 0, Σ 0 ) g(y μ 1, Σ 1 ) 1 x Y 1,, Y n, Π n, X (n+1)h = 1 P X (n+1)h = 1 Y 1,, Y n, Π n = m Pr V(y) > k(x) X (n+1)h = 0 j=1 D j m j=1 D j +D m+1 + Pr V(y) > k(x) X (n+1)h = 1 D m+1 m j=1 D j +D m+1. For Π n,m+1 < Π P i,0 = 1-Π n,m+1, Π n,m+1 > Π, P i,pm = Π n,m+1, Π n,m+1 > Π, P PM,0 = 1. (4.64) F(x) can be computed using equations of Provost given in Appendix B, which provide a closed-form expression for the cumulative distribution function of V(y). As shown in Figure 4.6, we plot six posterior probabilities based on the parameter estimations for this vibration data. Each line represents a posterior probability that the system is in that particular phase. 80

96 The red line Π 6 represents the posterior probability that the system is in an unhealthy state. At each observation epoch, the sum of all posterior probabilities equals to one. Figure 4.6 Posterior probabilities with Erlang sojourn time distribution The optimal control limit using multiple states in the healthy state involves solving a high dimensional optimal control problem, which is difficult to demonstrate. We show the application procedure by considering an Erlang-2 sojourn time distribution in the healthy state. The updated cost structures are as follows: c i = h C 0 OM P(X nh+t = 3 Π n )dt + C s, if Π n,3 < Π, = h C 0 OM [P(X nh+t = 3 Π n, X nh = 1) P(X nh = 1) + P(X nh+h = 3 Π n, X nh = 2) P(X nh = 2) + P(X nh+h = 3 Π n, X nh = 3) P(X nh = 3)] dt + C s, = h C OM P(X nh+t = 3 Π n, X nh = 1) Π n,1 + P(X nh+t = 2 Π n, X nh = 2) 0 1 Π n,1 Π n,3 + P(X nh+t = 3 Π n, X nh = 3) Π n,3 dt + C s, = C OM h + he rh 2 (1 e rh ) Π r n,1 + h 1 e rh 1 Π r n,1 Π n,3 + hπ n,3 + C s. where the posterior probabilities of each phase are given by: (4.65) 81

97 Π n+1,1 = Π n+1,2 = Π n+1,3 = f(y μ 0, Σ 0 ) D 1 f(y μ 0, Σ 0 ) (D 1 + D 2 ) + f(y μ 1, Σ 1 ) D 3, f(y μ 0, Σ 0 ) D 2 f(y μ 0, Σ 0 ) (D 1 + D 2 ) + f(y μ 1, Σ 1 ) D 3, f(y μ 1, Σ 1 ) D 3 f(y μ 0, Σ 0 ) (D 1 + D 2 ) + f(y μ 1, Σ 1 ) D 3. (4.66) where D 1 = P(X n+1 = 1 X n = 1) Π n,1, = e rt Π n,1, D 2 = P(X n+1 = 2 X n = 1) Π n,1 + P(X n+1 = 2 X n = 2) Π n,2, = e rt rt Π n,1 + e rt 1 Π n,1 Π n,3, D 3 = (1 e rt rt e rt ) Π n,1 + (1 e rt ) 1 Π n,1 Π n,3 + Π n,3. (4.67) By setting the same cost rates and sojourn time structures as previously on page 73, we compute the optimal control limits using Erlang-2 sojourn time distribution. The results are shown in Table 4.5. The control limit was computed as Π = 0.7(see Figure 4.7). For a higher dimension optimal Bayesian control chart problem, readers can use similar techniques to obtain the optimal control limit. By comparing the optimal maintenance cost values in Table 4.4 and Table 4.5, we find out that the long-run average cost using the Weibull distribution sojourn time distribution is cheaper than the costs obtained by the other two sojourn time distributions. 82

98 Table 4.5 Expected average costs and optimal control limit Π n, Π n, * Figure 4.7 Optimal control limit using phase-type distribution In this chapter, a hidden two states, continuous time semi-markov model with a general sojourn time distribution in the healthy state has been proposed to model the machine deterioration process. We have formulated the condition based maintenance model and have presented a parameter estimation procedure for this model considering a general sojourn time distribution in the healthy state. We have derived the explicit formulae to estimate both the state and observation model parameters using the EM algorithm. Three sojourn time distributions have been studied considering this HSMM framework: the Weibull distribution, the Erlang distribution and the exponential distribution. We have presented a computational 83

99 algorithm in the SMDP framework which can be used to obtain the optimal maintenance policy minimizing the long-run expected average cost per unit time for this HSMM. Finally, a case study using real vibration data has been developed to demonstrate the whole procedure. We have compared the results obtained from these three distributions using the same vibration data set. It is interesting to note that the long-run average hourly savings obtained from the HSMM approach are not much comparing with the HMM approach. However, the annual savings will become significant if we consider the costs over a year. Furthermore, when the value of the cost parameters are difficult to obtain in the real world, the alternative approach for developing the optimal Bayesian fault detection scheme is to compute the maximization of the long-run expected average system availability per unit time. The "cost" used in the SMDP framework will be replaced by the system downtimes (see e.g. Jiang et al. [64]). 84

100 Chapter 5 A Comparison of Hidden Markov and Semi-Markov Models for Monitoring Planetary Gearbox Systems 5.1 Introduction Planetary gearboxes are useful for machinery because they can convert the high rotating speed to a high output power. They are widely used in aerospace, automotive and heavy industry applications, where most of them operate in tough working environment. Condition monitoring and early fault diagnosis of planetary gearboxes aim to prevent shutdown, reduce major economic losses and even human casualties. Vibration monitoring, playing a critical part in condition monitoring, is extensively useful for fault detection, diagnosis and prognosis. During normal conditions, the vibration monitoring system collects data from a number of sensors including sensors mounted on the transmission housing. A detailed review can be found in [34] and [35]. However, studies on early fault detection and condition based maintenance of planetary gearboxes using multiple sensors are quite limited in the literature. Figure 5.1 Transmission diagram of the planetary gearbox systems 85

The transmission diagram of the gearbox is shown in Figure 5.1. The gearbox contains three stages. All the gears in the gearbox are straight-tooth gears.

101 Figure 5.2 Sensor placements Figure 5.3 Spalled planet gear tooth The vibration data considered in this chapter were obtained from the accelerometers placed on the housing of a test rig given by the Syncrude Canada. The transmission diagram of the gearbox is shown in Figure 5.1. The gearbox contains three stages. All the gears in the gearbox are straight-tooth gears. The tooth number, speed and the reduction ratios of each stage are summarized in Table 5.1, see [57] and [58]. For data collection, five accelerometer sensors were used to capture the vibration data. As shown in Figure 5.2, Input-h was placed on the housing of stage 1, the bevel gearbox. Planetary1-v and HS planetary1-v were placed on the housing of stage 2, planetary gearbox I. Planetary2-v and HS planetary2-v were placed on the housing of stage 3, planetary gearbox II. HS denotes high sensitivity and the v denotes vertical mounting. This particular planetary gearbox was tested for thirteen days of operation, i.e. the total number of runs was equal to 13 and the gearbox was run to reach a poor condition (see Figure 5.3). Each run consisted of continuous operation followed by inspection. The data was collected for five minutes (300 seconds) once every hour. During testing, the data was collected by using two different sampling frequencies. For convenience, we divide the data into two groups. Group 1 includes files from the 1st to the 4th run with sampling frequency of 10kHz. Six sampling files are selected from each run, and each file covers 3e6 sampling points. Group 2 includes files from the 5th to the 13th run with 86

102 sampling frequency of 5kHz. Sixteen sampling files are selected from each run, and each file covers 1.5e6 sampling points. The damaged files are excluded. We are interested in the early fault detection of a planetary gearbox. The data in the 13th run is believed to be in a cycle caused by machine failure. Thus, the 13 th run is excluded. The stage 3, planetary gearbox II, bears the most load and would likely fail first. Thus, we selected accelerometer sensors planetary 2-v and HS planetary 2-v for analysis in this chapter (Figure 5.2). Given the geometry of the planetary transmission, specific properties of the planetary gearbox in each stage can be computed. Detailed calculation is readily available in the literature, e.g., [34], [35] and [59]. The tooth meshing frequency is given by: f m = N r f c =N p f p + f c = N s (f s f c ), (5. 1) where N r, N p and N s are the number of teeth on the ring, planet and sun gears, respectively. If we fix the planet carrier and let the ring gear rotate at a frequency equal to the carrier rotation frequency, we can obtain the rotation frequency of the planet gear relative to the ring gear, f p + f c, and the sun gear relative to the ring gear, f s f c, which are given by: f p + f c = f c N r N p, (5.2) f s f c = f c N r N s, (5.3) respectively. More generally, the rotation frequency of the gear of interest is given by: f g =f c N r N g. (5.4) where the subscript g refers to the gear under consideration, either the planet or the sun. 87

103 Table 5.1 Gearboxes information Input speed(rp M) Reduct ion ratio Carrier speed(rpm) No. of teeth of sun gear No. of teeth of planet/beve l No. of teeth of ring Gear meshing frequency(h z) 1st stage N/A N/A 360 2nd stage rd stage Vibration Separation TSA methods have been broadly adopted in condition monitoring and fault diagnosis of planetary gearboxes. It is a useful signal processing technique to extract the periodic component from vibration data and average out the signals of both noise and unwanted waveforms. Many applications of the TSA method can be found in the literature [4], [60], and [5]. The implementation of a TSA algorithm for a planetary transmission requires the appropriate separation of the vibration data. Yip et al. [57] and Yu [61] applied a cosine window function ("Australian Patent)" for the extraction of the time averaged vibration signals. The residual signals were obtained based on the TSA. This approach was combined with the wavelet transform to investigate the health condition of the planetary gearbox. The window function serves to attenuate vibrations not associated with the meshing of the gear of interest. The use of different window functions was discussed in detail in Samuel ([35], [62]). Samuel et al. [62] compared the accuracy of using various window functions: the rectangular window, the triangular window, the Hanning window and the Tukey window. They 88

104 concluded that the results from a Tukey window function provide the best accurate waveform. A Tukey window function is a flat-top Hanning window function. For example, an N-point Tukey window function is given by: 1.0 for 0 k N (1 + α) 2 ω[k + 1] = cos π k N (1 2 + α) N(1 α) for N. (5.5) 2 (1 + α) k N where α is the taper ratio, defining the fraction of the total window width encompassed by the shoulders. For instance, a five teeth wide Tukey window function with α = 0.8 is shown in Figure 5.4. Samuel et al. [62] shows that a Tukey window with a five-tooth width and α = 0.8 would leave the primary tooth mesh waveform of interest unaffected while providing smooth shoulders. This Tukey window is used with the vibration separation algorithm in this study. The signals with a hold of one tooth in the center will be chosen. Figure 5.4 A Tukey window function 89

105 Figure 5.5 Raw vibration data from planetary 2-v Figure 5.6 Raw vibration data from Hs planetary 2-v As an example, original vibration data in the 6th run from different sensors are shown in Figure 5.5 and Figure 5.6. Prior to the application of the TSA method, the data re-sampling using interpolation should be considered. The tachometer signals are required as phase markers to help performing the re-sampling of the vibration signals. The signal cycles triggered by the tachometer impulses are then re-sampled (using cubic spline interpolation) with respect to each signal cycle. Interpolation can bring the number of points sampled during one cycle period of the gear to a predetermined value and enable averaging. Thus, every cycle contains the same number of points. Figure 5.7 shows 68 successive sections corresponding to one pulse per revolution of the output voltage signals. These sections can then be interpolated to a predetermined value prior to the application of time synchronous averaging. 90

106 Figure 5.7 Tachometer pulses data Unlike the TSA techniques used for a fixed-axis gearbox, a technique for TSA analysis in a planetary gearbox requires the calculation of the planet gear meshing sequence [35]. Essentially, a given tooth of a planet gear will be aligned with a given tooth of the ring gear for a given carrier rotation only once every n reset,p rotations. n reset,p represents the number of rotations of the gear under consideration that occur before the gear returns to its initial state relative to the position of the carrier. This number of rotations, n reset,p, is given by n reset,p = LCM (N p, N r ) N r. (5.6) where LCM refers to the least common multiple. N p, N r are the tooth number of planet and ring gears, respectively. Given geometry of the planetary transmission, we compute n reset,p = 31 for the third stage planetary gearset, which indicates that the initial state is repeated at every 31 carrier rotations. The sequence of states has an associated sequence of aligned teeth, which can be found using: P n,p = mod nn r, N p + 1. (5.7) 91

107 It returns the modulus after division of nn r by N p. Thus, the sequence of aligned teeth for a particular planet gear in planetary gearbox II is listed in Table 5.2. Table 5.2 Planetary gearbox II meshing sequence Gear tooth Carrier revolution Gear tooth Carrier revolution The TSA cycle length is determined by the carrier frequency and sampling frequency. From files in the 1st run to the 4th run, a total of points (81 ring gear teeth 836 points per tooth) were used for each carrier cycle. From files in the 5th run to the 13th run, a total of points (81 ring gear teeth 418 points per tooth) were used for each carrier cycle. Next, we use the window function to attenuate vibrations not associated with the meshing of gear of interest. In the windowing process, the separation of each tooth vector from each carrier cycle is obtained by multiplying the five teeth wide Tukey window function with the interpolated data. Then the center tooth vibration points are selected and aligned with the corresponding meshing sequence determined in Table 5.2. Figures 5.8 to 5.11 illustrate the meshing procedures by generating the 20th tooth meshing data from the 1st carrier cycle, and the 8th tooth meshing data from the 2nd carrier cycle, respectively. The remaining tooth meshing data were generated following the meshing sequence shown in Table 5.2. Figure 5.11 shows the final meshing data generated in a single run of data for one planet. We obtain the TSA data for this planet at each run. This process is also performed for the sensor HS planetary 2-v. 92

108 Figure 5.8 Data in the 5 th run Figure 5.9 Windowed data at the beginning of the 1st carrier cycle (seconds) Figure 5.10 Windowed data at the beginning of the 2nd carrier cycle (seconds) Figure 5.11 Planet gear tooth vibration separation in the 5th run from sensor Planetary2-v 93

109 5.3 Vector Autoregressive Modeling The two dimensional observed healthy data histories Z t = (Z 1t, Z 2t ), t = 0,1,, T, have the following representation p Z t = μ + r=1 φ r Z t r + ε t. (5.8) where ε t are i.i.d. N 2 (0, Σ), p is the lag which determines the model order, φ r is the coefficient matrix, φ r R 2 2, and μ R 2 and Σ R 2 2 are the mean and covariance model parameters, respectively. We can rewrite the Eq. (5.8) in a general way W = φa + E, where W = [Z p, Z p+1,, Z T ], φ = [µ, φ 1, φ 2,, φ p ], E = [ε p, ε p+1,, ε T ], and 1 A = Z p 1 Z p 2 1 Z p Z p 1 1 Z T 1 Z T 2 Z 0 Z 1 Z T p. (5.9) Lütkepohl [6] showed that the least square estimates for φ and covariance matrix Σ = cov(ε t ) are given by: φ = WA (AA ) 1, (5.10) Σ = (T 2p 1) 1 W φ A W φ A. (5.11) There are several information criteria available to determine the order p of a VAR model. Akaike [43] suggested measuring the goodness of fit for the model by balancing the error of the fit against the number of parameters in the model. For VAR (p) model, AIC = ln σ p 2 + 2pD2 T. (5.12) 94

110 σ p 2 is the maximum likelihood estimate of σ ε 2, which is the covariance matrix of ε t, and T is the sample size, D is the dimension of the time series. The Bayesian information criterion (BIC) [63] is defined as follows: BIC = ln σ 2 p + lnt k. (5.13) T where σ p 2 is the error covariance matrix, k is the number of estimated parameters in the model. We fit a VAR model to the healthy portion of the TSA filtered data and the residuals are computed for both the healthy and unhealthy portions of the data histories. Because files from group 2 were collected at the same sampling frequency, we build a VAR model using TSA filtered data obtained from the 5th run (Figure 5.12), which is selected to represent the healthy portion of the data histories. It is necessary to check whether the signals in the runs from the group 1 represent a stable process. Two statistical indicators, root mean square (RMS) and variance, are chosen to observe the gearbox health conditions in group 1. Figure 5.13 shows no trends for both statistical indicators. Thus, we believe the gearbox is in a healthy condition up to the 4th run. Figure 5.12 Angular position of TSA data to develop a VAR model(360 degrees) 95

111 Figure 5.13 RMS and variance from runs (1 st - 4 th ) We select TSA filtered data in the 5th run as a stationary process to fit a time series model, which accounts for both cross and auto-correlation in the vibration data histories. The model orders are determined by both the Akaike information criterion (AIC) and the Bayesian information criterion, which are shown in Figure The points denoted by star indicate the best fitted order of a VAR model. We select the AIC chosen minimum value for our VAR model, where the model order is determined as 156. The files from the 6th run to the 12th run were used for gearbox deterioration diagnosis. The residuals are then obtained using the fitted model for complete data histories in Group 2. 96

112 Figure 5.14 VAR model order selection using AIC criterion and BIC criterion Figure 5.15 Residuals for the remaining test runs (runs from the 6 th to 12 th ) The residual signals from the 6th run to the 12th run are shown in Figure Each run contains one complete revolution of TSA signals for a particular planet gear, which includes 31 gear teeth vibration signals. Figure 5.16 shows the autocorrelations and cross correlations of the vector autoregressive model residuals using the data in Figure It is used to examine whether the VAR model is correct. Figure 5.16 shows that all auto and cross 97

Application of Vector Time Series Modeling and T-squared Control Chart to Detect Early Gearbox Deterioration

International Journal of Performability Engineering, Vol.10, No. 1, January 2014, pp.105-114. RAMS Consultants Printed in India Application of Vector Time Series Modeling and T-squared Control Chart to