Technical Condition Indexes and Remaining Useful Life of Aggregated Systems

Size: px

Start display at page:

Download "Technical Condition Indexes and Remaining Useful Life of Aggregated Systems"

Shanna Cunningham
6 years ago
Views:

1 Technical Condition Indexes and Remaining Useful Life of Aggregated Systems by Bent Helge Nystad Faculty of Engineering Science and Technology Department of Marine Technology Norwegian University of Science and Technology Trondheim, June 28

3 Abstract It is a challenge to determine the remaining useful life of items having several failure modes that are dependent on age and several condition monitoring sources. Remaining useful life has always been a wanted quantity in maintenance planning. If the exact time before a failure occurs were known, maintenance planning would be simpler. The main objective of the thesis is to provide quantitative estimates of remaining useful life including error bounds. Based on information from process control systems, condition monitoring systems, and inspection reports, the technical condition of the items is quantified by technical condition indexes (TCIs). The TCIs are normalized quantities that can be aggregated upwards an established hierarchy to the desired level. Both an expert judgement principle and a proportional hazards model approach are used to establish the aggregated TCIs. In the latter approach significant covariates test are used to identify significant covariates. Problems with partial or incomplete data where segments of recorded required maintenance actions and TCIs are missing are solved by left-truncation in the statistical model. The aging effect which is quantified by the Weibull proportional hazards model is used to estimate the remaining useful life. Realizing that the literature within the prognostic area tends to put more focus on age-independent models, the remaining useful life is also modelled as a degradationthreshold model. Here, a gamma process is used to model the deterioration trend of an aggregated TCI. The results from the two approaches are compared and discussed. A discussion of uncertainty and predictability is also performed. This information is important in decision making for further maintenance. The models have been applied to a case study of a natural gas export compressor system. In the case study a detailed description of failure modes, condition monitoring, and inspection data is given. The case specific hierarchical aggregation of the technical condition index to compressor-level is also presented. Finally, the thesis make a summary of the work performed, and proposes further research related to the next generation of prognostic systems. iii

5 Preface This thesis is the result of my PhD study at the Norwegian University of Science and Technology, NTNU, Department of Marine Technology. Professor Magnus Rasmussen, Department of Marine Technology, NTNU, has been supervisor during the work. I am thankful to Magnus for his guidance and support. StatoilHydro has been helpful to find a proper case study, and they have offered access to internal projects and data. I also want to thank my colleagues at Institute for Energy Technology (IFE) in Halden, which has contributed with additional funding and given me the opportunity to conduct and finalize this work. Especially I would thank Aimar Sørenssen and Jan Erik Farbrot for their valuable correction and comments in my writing of this thesis. Last, but not least I wish to express my gratitude to my wife Lise, my daughter Oda, and my son Magnus for their support and patience during this work. v

7 Table of Contents 1. Introduction Background The challenge of technical condition and remaining useful life modelling Objective and solving approach Structure of report Limitations Maintenance strategy including a description of IO Introduction Integrated Operations (IO) The Technical Condition Index (TCI) Determination of an item s technical condition from available information Application of condition monitoring Application of inspections (notifications) Application of process data Hierarchical aggregation of the technical condition Case Study - Natural gas export compressors System for sales gas compression and sales gas metering Export gas compressors Failure modes Condition monitoring and process data Notifications Efficiency monitoring Hierarchical aggregation of the technical condition First order regression analysis of aggregated TCIs Aging-based model for remaining useful life Failure in nonrepairable items Failure in repairable items Model selection framework Regression models Cox regression models Weibull regression models Parameter estimation by the method of maximum likelihood Case assumptions and model building Remaining useful life estimation vii

8 6. Condition-based model for remaining useful life Accelerated monotone TCI development modelled by a gamma process Parameter estimation for the gamma process Remaining useful life estimation Modelling and results Lifetime distribution based on aging and condition monitoring Lifetime distribution based on condition monitoring only Discussion of uncertainty and predictability Conclusions and further work Conclusions and discussion Future work Appendix A Maintenance policies basics Appendix B Available transfer functions in TeCoMan Appendix C Export compressor hierarchy in TeCoMan Appendix D Time dependent TCIs aggregated at the gas compressor level Appendix E Results of polynomial fitting of aggregated TCIs Appendix F Programming and Results F.1 S-plus-Programming and Results F.1.1 Script program to estimate parameters of the Weibull PHM F.1.2 Script program to plot the log-likelihood function near optimum F.2 Maple-Programming and Results F.2.1 Expected remaining useful life F.2.2 Variance of the expected mean residual life Appendix G Graphs of the hazard rate (risk) of aggregated and low-level TCIs Appendix H Reliability theory H.1 Complete and censored data sets H.2 Statistical Tests of Trend Appendix I Low-level signals for compressor KA I.1 Bearing temperature I.2 Seal leakage I.3 Vibration References viii

9 Nomenclature Notation β Shape parameter in the Weibull and the Weibull PHM model c i Censoring indicator at running period i. c i = in case of suspension and c i = 1 in case of required maintenance action Fisher matrix. The Fisher matrix is defined as C ( θ ) ( ()) C n ( i i ) T ( θ) EU( θ) U ( θ ) = i= 1 CV Y t Coefficient of variation of a gamma process defined as 2 χ δ i DMRL η ERUL f () t ( ()) CV Y t = ( ()) ( ()) Var Y t E Y t Chi-square distribution Observed transformed TCI increment between two succeeding TCI calculations defined by δi = yi yi 1 Decreasing Mean Residual Life Location parameter in the Weibull and the Weibull PHM model Expected Remaining Useful Life. ERUL is defined by ( ) R x dx t ERUL() t = R () t Probability density function (pdf) of time to failure F() t Cumulative failure distribution function (cdf). F( t) = Pr ( T t) ( (), ) Ga y v t u A gamma distributed random variable y with shape function v() t and scale parameter u a 1 Γ ( a) The gamma function defined by ( ) ( ax, ) Γ The incomplete gamma function defined by γ H a 1 ( ) z Γ ax, = z e dz, x, a> z= x z Γ a = z e dz, a> z= Regression coefficient in the Weibull PHM model Null hypothesis ix

10 I ( θ ) Observed information defined by I ( θ ) ηγ, ( θ) 2 ln L = θi θ j κ Estimated correlation coefficient. The correlation coefficient is defined Cov( ηγ, ) by κηγ, = Var η Var γ ( ) ( ) l ( θ ) Log-likelihood function l( θ ) = ln ( L( θ )) L( θ ) Likelihood function L() Median life MTBF MTTF N() t Loss function A measure of the center of a life distribution. The median life t m is R t = F t = defined by ( ) ( ).5 m m Mean Time Between Failure Mean Time To Failure. The mean time to failure is the mean expected value of T which is given by: MTTF = R() t dt Integer number of required maintenance actions in the time interval (,t ] p Predictability measure defined to be in the range p [,1] p-value PMRL Ρ Probability of obtaining a result at least as extreme as a given data point under the null hypothesis Proportional Mean Residual Life Randomized design margin (deterioration threshold) R t = Pr T > t R () t Reliability function or survivor function. ( ) ( ) ρ Design margin (deterioration threshold) defined by TCI ( ) p p ρ = s s Stress S i Maintenance action times. Si = T1+ T2 + + Ti t A specific point in operational time T Time to first failure of an item as a random variable T i Interoccurrence time. The interoccurrene time is the time between maintenance action i and maintenance action i 1 for i = 1,2,... TCI Estimated Technical Condition Index trend u Scale parameter of a gamma process d Ui () t Score vector. The score vector is defined as Ui() t = ln Li( θ) dθ v() t Shape function of a gamma process w i wt () Transformed time between two succeeding TCI calculations defined by b b wi = ti ti 1 ROCOF Rate of OCurrence Of Failures. ROCOF is defined as d dx ( ) () = E Nt () wt x

11 W () t W () t = E( N() t ) Expected number of failures in the time interval (,t ] Ψ () t A state variable associated with an item such that 1 if the system is functioning at time t Ψ () t = if the system is in a failed state at time t y t Polynomial trend of TCIs at running period i i y i () Y() t z() t TCI value calculated at operational time (day) i Transformed TCI value at time t Hazard rate function. The rate at which failure occur as a function of f ( t) time z() t = R t () Abbreviations BPP Branching Poisson Process CBM Condition-Based Maintenance CM Corrective Maintenance. The actions performed, as a result of failure, to restore an item to a specified condition (MIL-STD-2173(AS))[1] Compass Name of a vibration monitoring system developed by Brüel & Kjær Vibro and Thomassen Compression Systems CORD Coordinated Operation and maintenance Research and Development project Covariate Also called covariable. An independent variable, or predictor, in a regression equation. Also, a secondary variable that can affect the relationship between the dependent variable and independent variables of primary interest in a regression equation DE Drive End EMS EMergency shutdown System ERP Enterprise Resource Planning EUREKA A pan-european network for market-oriented, industrial research and development. See Failure Termination of its ability to perform a required function (BS 4778) [49] Failure mode The effect by which a failure is observed on the failed item (EuReDatA, 1983) Failure symptom An identifiable physical condition by which a potential failure can be recognized FFA Functional Failure Analysis FMECA Failure Modes, Effects & Criticality Analysis. A procedure by which each potential failure mode in a system is analyzed to determine the results of effects thereof on the system and to classify each potential failure mode according to its severity (MIL-STD-1629A) [2] FOM Force Of Mortality. See z( t ) FSI Functional Significant Items HPP Homogenous Poisson Process ICA Independent Component Analysis ICT Information and Communication Technology xi

12 IID Independent and Indentical Distributed IMS Information Management System Inspection Activities such as measuring, examining, testing, gauging one or more characteristics of a product or service, and comparing these with specified requirements to determine conformity (ISO 842) [3] IO Integrated Operations MA Moving average Maple Mathematical software developed by Waterloo Maple Inc. MCSI Maintenance Cost Significant Items MIMOSA Machinery Management Open Systems Alliance ( ) MLE Maximum Likelihood Estimator MSI Maintenance Significant Items MTO Man, Technology and Organization NA Not Available NDE Non Drive End NHPP Non-Homogeneous Poisson Process NPP Nuclear Power Plant OLF Norwegian Oil Industry Association (Oljeindustriens Landsforening) PCA Principal Component Analysis PHM Proportional Hazards Model PM Preventive Maintenance. The maintenance carried out at predetermined intervals or corresponding to prescribed criteria and intended to reduce the probability of failure or the performance degradation of an item (BS 4778) [49] PROSMAT 2 Norwegian Research Council founded programme within industry and energy PSS Process Shutdown System P&ID Process and Instrumentation Diagram p-value Significance probability for variables RCM Reliability-Centered Maintenance. A disciplined logic or methodology used to identify preventive maintenance tasks to realize the inherent reliability of equipment at least expenditure of resources (MIL-STD- 2173(AS)) [1] Reaction time Reaction time is the time interval between observation of a potential required maintenance action and the actual moment (timestamp) of required maintenance action Required maintenance action A replacement or an overhaul that cannot be delayed any further due to safety and functional requirements RP SAP S-plus TeCoMan TCI Weibull PHM Renewal Process A German company, which delivers enterprise resource planningsystems. SAP is also the name of their enterprise resource planning system Statistical software developed by Insightful Corp. A software toolbox developed by StatoilHydro for technical condition management Technical Condition Index A PHM model with a Weibull distributed baseline xii

13 1. Introduction 1.1. Background It is more important than ever to exploit the resources on the Norwegian Continental Shelf completely for maximum production and efficiency. This can be done in several manners. It is possible to increase the oil recovery and to accelerate the production. Development and implementation of new technology have made it possible to gain access to significant petroleum resources that were previously deemed unprofitable or impossible to produce, but new technology and new work processes are also needed to reduce the ongoing operation costs. Many fields in the Norwegian Continental Shelf will shortly enter the tail-end production phase and unit costs will increase rapidly. This implies that the take-up process of new concepts and technologies has to be accelerated. In the effort of meeting the requirement for improvement of the creation potential on Norwegian Continental Shelf the Norwegian Oil Industry Association (OLF, Oljeindustriens Landsforening) initiated the project Integrated Operations (IO) autumn 24 [4]. IO is a twostage project towards remote operation of offshore oil & gas fields, managed by people located in virtual operation centres, i.e. in geographically dispersed centres that interacts digitally. In the report Integrated Work Processes (25) [4] OLF stated that one key function in cutting costs is optimization of maintenance efforts. The future of maintenance management will be to turn away from offshore-stationed personnel carrying out periodically predetermined work schedules, or unplanned maintenance tasks when equipment fails, towards heavy use of condition-based maintenance (CBM). In the first stage all planning and preparatory work will be carried out on onshore operation centres that integrate onshore and offshore maintenance and logistic functions. The preventive maintenance process will be closely integrated and coordinated with other processes like the production and drilling. The CBM technique will be extended to other types of equipment than heavy rotary machinery, and use of external specialists will increase. In the second step it is believed that new instrumentation will be cheaper to install, and that all systems benefiting from CBM will use this technique. To manage the vastly increased amount of data sent to shore, the onshore staff will use smart decision support software packages that reduce the large amounts of data available into aggregated features for decision making. Orders will then be sent to multidisciplined roving teams or offshore operators as automatically generated maintenance work orders. To be able to cope with a future described in IO with increasing unit costs it is important to accelerate development of CBM techniques. This thesis deals with a sub-problem of CBM, focusing on data processing, or more specific remaining useful life estimation. Obviously, it is very important in the IO scenario to predict how much useful time is left before failures 1

14 CHAPTER 1. INTRODUCTION occur given the current machine condition and past and future operation profile. Information about the past and future environment may also be important. Condition-based maintenance is a preventive maintenance program that is executed after a condition verification of the equipment indicates that maintenance becomes necessary. The analysis consists of three main tasks: data acquisition, data processing and maintenance decision-making. Diagnostics and prognostics are two important aspects of a CBM program. Roughly speaking, diagnostics deals with fault processing, i.e. post-event analysis. Prognostics deals with fault prediction, i.e. prior-event analysis. Prognostics is therefore much more efficient than diagnostics to achieve zero-downtime performance. However, diagnostics is required when fault prediction fails and the need of root-cause analysis for detection, isolation and identification of the fault is present. The literature of prognostics and maintenance is rather limited. Often the modelling is based on a non-repairable item with one single dominant failure mode, where the physical parameters are easy to identify and model, and where the failure mode is independent of operational mode and stressors (Andersen [5]). Two main ways of thinking are described in the literature: The first way of thinking assumes that the remaining useful life is independent of age and can be determined by condition monitoring. Time to preventive maintenance action is the time until the condition monitoring measurement reaches a predefined threshold value (Pulkkinen [6]). Setting the correct threshold value is non-trivial. If the value is too low, too much preventive maintenance actions might be the result, and a too high limit might result in too much unexpected failures. A trial and error procedure is often expensive or unacceptable for safety-critical items. Recommendations from manufacturers of acceptable thresholds for condition monitoring equipment might be useful here. The second way of thinking is based on aging. The main assumptions in the renewal processes (RP) is perfect repair, meaning that the system is good as new after the maintenance action is completed. When the non-homogeneous Poisson process (NHPP) is used, it is assumed that the maintenance action is minimal, meaning that the reliability of the system is the same immediately after the maintenance action as it was before the failure occurred. The renewal process and the NHPP may thus be considered as two extremes with rather unrealistic underlying assumptions. Systems subject to normal repair will be somewhere between these two extremes. A high number of models have been suggested for modelling imperfect repair processes, and a survey of available models are provided by e.g. Hokstad [7]. These models are based on reliability data only, which makes them rather static i.e. not taking into account the item-specific lifestyle (technical condition) since new. However, a promising statistical model that combines aging with condition monitoring information in a regression model is the proportional hazards with time dependent covariates (Cox [8]). Determining the remaining useful life of items having several failure modes that are dependent on age and several condition monitoring sources is a challenge. Ideally, each item should have an estimate of its remaining useful life for each failure mode, taking into account its age and its technical condition history. However, still if the data are sampled from several identical items, the number of documented mode-specific faults often is limited. Therefore it 2

15 1.2. THE CHALLENGE OF TECHNICAL CONDITION AND REMAINING USEFUL LIFE MODELLING might be a solution to look at several different failure modes at an aggregated system level. Then it is important that the data that are merged from several identical items are homogeneous, meaning that the items and their failure modes are of the same type and that the operational and environmental stresses are comparable. The work in this thesis contributes not to development of new theory of remaining useful life prediction but to the research of when a highly aggregated repairable system subject to several failure modes should be repaired or overhauled. A suggestion of how to deal with partial or missing data is also proposed. The main objective of this thesis is to use an aggregation methodology for technical condition measures to calculate indicators of the time to the first major upcoming maintenance action The challenge of technical condition and remaining useful life modelling Often the technical condition of the equipment cannot be fully observed or identified. What can be observed is only somehow related to the real equipment state. Exact knowledge of the technical condition of all relevant degradation mechanisms is often too expensive in terms of labour hours and costs to calculate. In many cases it is impossible to gather all information about the technical condition. Multiple sources of information may contain different partial information about the same equipment condition. The clue is how to combine all partial information in the model in the best way. The number of possible system states grows exponentially with the number of condition variables. To handle this dimensionality problem that leads to unaffordable costs, transformation of condition variables into a lower dimension is a possible solution. High correlation between condition variables also makes the estimation complicated. Multivariate analysis techniques such as Principal Component Analysis (PCA) and Independent Component Analysis (ICA) have been used to handle high dimensional data with complicated correlation structure (Lin et al. [9], Duda et al.[1]). The different failure modes might be associated with hierarchical structures like high-level assembly failure modes, module-specific failure modes, part-specific failure modes and subpart-specific failure modes. (Jardine et al. [11]). A tool for identifying the different failure modes in a hierarchical structure might be Failure Modes, Effects & Criticality Analysis (FMECA). FMECA analysis using reliability-centered maintenance (RCM) is a qualitative tool to link the hierarchical structure of equipment or a system to its failure modes. The compressors used as a case in this thesis have several failure modes and are critical for natural gas export. This thesis will therefore be focused on compressors as the repairable system. The challenge of technical condition modelling is to associate e.g. the production availability related symptoms (that can be partly measured through inspections, condition monitoring, process information system parameters etc.) with the failure modes in a high-level assembly hierarchical structure. Few references exist on this topic in the literature. The article by Desforges et al. [12] prototype a hierarchical process condition monitoring system where control and monitoring systems utilize condition information from self-validating sensors (sensor-level) to ensure continued process operation (process-level). At high-level assembly the target is often to narrow down the list of plausible condition variables to a few significant 3

16 CHAPTER 1. INTRODUCTION ones. In the Proportional Hazards Model (PHM) analysis an integral part is the systematic and scientific discrimination between significant covariates and non-influential data. To have a quantitative measure on the term Technical Condition to each level in the hierarchy, a measure named Technical Condition Index (TCI) has been developed in the EUREKA project Aging Management ( ) [13] as part of the Norwegian Research Council founded programme PROSMAT 2. The following definition is used: The Technical Condition Index, denoted TCI, is defined as the degree of degradation relative to the design condition. It may take values between a maximum and a minimum value, where the maximum value describes the design condition and the minimum value describes the state of total degradation. The evaluation of the technical condition of an item must be related to a particular property like production availability, environment, costs etc. An engine, for example, might deliver sufficient horsepower but is completely degraded if the evaluation is done in an environmental context, because the NO X or CO values are too high. The calculation of the TCI values is based on an aggregation method. Since the technical condition values of smaller building blocks at the bottom of the hierarchy are aggregated to a higher level, the high-level assemblies will suffer from several failure modes. Details on how to calculate TCIs based on alarm limits of condition monitoring and process data, and how to calculate damage code for notifications, are described in chapter 3 The Technical Condition Index (TCI). Remaining useful life knowledge has always been a wanted quantity in maintenance planning. If the exact time before a failure occurs were known, maintenance planning would be simpler. The ideal maintenance decision should then be to have a planned maintenance action on the equipment just before the timestamp of the failure so that no remaining useful life is wasted. Unfortunately such a quantity is stochastic by nature, and there is a need to use a scientific approach that incorporates maintenance event information and knowledge about its past, current and future lifestyle. This lifestyle will in this thesis be quantified in terms of TCIs. A proper definition of failure is crucial to establish a correct interpretation of remaining useful life. A failure can be defined as the condition that the equipment is operating at an unsatisfactory level, or it can be a functional failure when the equipment cannot perform its intended function at all, or it can just be a breakdown when the equipment stops operating, etc. A formal definition of a failure can be found in many reliability textbooks. In this work the remaining useful life is the time from today until the occurrence of the next upcoming event. The event is either a corrective maintenance action (CM) where the high-level assembly is unable to perform a required function and the system is restored to operating condition or a preventive maintenance action (PM) where it is only known that the system survived up to the age of the preventive maintenance action (suspension). Missing data or wrongly documented or calculated data are also discussed and modelled. Figure 1.1 and Figure 1.2 illustrate the connection between the remaining useful life (RUL), the Technical Condition Index (TCI) calculation and the historical data available. The TCIs are calculated at different aggregation levels in the TeCoMan (Technical Condition Manager) software. TeCoMan is a toolbox developed in the above mentioned EUREKA project to easily access different information sources. It supports a range of different aggregation methods and functions to transfer measurement readings to TCI values. Information sources for the TCIs are operational conditions, condition monitoring data, inspection data in term of 4

17 1.2. THE CHALLENGE OF TECHNICAL CONDITION AND REMAINING USEFUL LIFE MODELLING notifications and Information Management System (IMS) parameters. RUL quantification methods also require different documented historical information dependent on the basic model thinking discussed in chapter 1.1 Background. Models based on both aging and condition monitoring (Figure 1.1) requires additional reliability data in terms of operational time stamps of the historical events, while models based on condition monitoring only (Figure 1.2) requires condition monitoring values at the time stamps of the historical events to quantify the threshold value. RUL estimation might be an extension of the existing version of the TeCoMan software, where trends of the remaining useful life of the items may be displayed together with existing trends of TCIs and indicated events. The modelling development in case of both age and condition is based on proportional hazards models with time dependent covariates where the underlying process is a renewal process (RP). The proportional intensity model (Jiang et al. [14], Andersen et al. [15]) where the underlying process is a NHPP has also been used. In case of modelling based on condition monitoring only the gamma process approach is used (Abdel-Hameed [16], van Noortwijk [17]). Historical data sources TeCoMan Software Information Unobserved Information (Truncation) SAP IMS Process Data Condition Monitoring Data Notifications Maintenance Actions (PM or CM) Compass Oil analysis TCIs Censored Failure Time Observed Failure Time Other Proportional Hazards Modelling = Existing version of TeCoMan Remaining Useful Life Estimation = Extension of existing version of TeCoMan Figure 1.1: Remaining useful life modelling based on both aging and condition 5

18 CHAPTER 1. INTRODUCTION Models of the remaining useful life of single failure mode failures have been published, including condition monitoring information. Wang [18] uses the conditional residual life concept, Gabraeel et al. [19] uses a Bayesian updating method to compute a residual-life distribution. Compared to the extensive literature on application of the Cox-based models in the biostatistics field, there has been a small but growing body of literature for maintenance applications. Banjevic et al. [2] discussed remaining useful life estimation for a Markov failure time process that includes a joint model of PHM and Markov property for the covariate evolution as a special case. Vlok et al. [21] applied a proportional intensity model with covariate extrapolation to estimate bearing residual life. Historical data sources SAP TeCoMan Software Information IMS Compass Process Data Condition Monitoring Data Notifications Maintenance Actions (PM or CM) Oil analysis TCIs Observed Failure TCI value Other Gamma Process Modelling Failure threshold value estimation = Existing version of TeCoMan Remaining Useful Life Estimation = Extension of existing version of TeCoMan Figure 1.2: Remaining useful life modelling based condition monitoring only In the RUL modelling based on condition monitoring only, finding the correct thresholds for corrective and preventive maintenance is important. Park [22] determined the preventive maintenance level for which the long-term expected average cost per unit time is minimal. Failure is detected immediately and the failure level is a random variable. The equipment is replaced (perfect repair) instantaneously when its deterioration is beyond the preventive maintenance level or beyond the failure level, whichever occurs first. Bérenguer et al. [23] 6

19 1.2. THE CHALLENGE OF TECHNICAL CONDITION AND REMAINING USEFUL LIFE MODELLING built models of continuous monitoring and perfect repair to find the preventive maintenance level for which the asymptotic unavailability is minimal. Up to now, remaining useful life models are mainly applied for single failure mode of single components. Less developed are multi-failure-mode models of aggregated systems as the natural gas export compressors in this thesis. Remaining useful life models often suffer from lack of quality data due to incorrect data collection approaches and too few observed failures. An efficient communication between management, maintenance personnel and analysts is required to achieve a successful result. Since future approaches in the prognostic area also tends to be more complex both in volume and substance, they need to be more user friendly. It is important that the users do not develop deep mistrust in the monitoring system whenever an unexpected fault occurs. If the estimate of time to next upcoming event is wrong and the fault has occurred unexpected, the diagnostic part should help the user in the root-cause analysis for detection, isolation and identification of the fault. In general, the developed modelling technologies should be precise, adaptive, comprehensible, and configurable (by the user) Objective and solving approach The main objective of this thesis is remaining useful life estimation of natural gas export compressors. In order to fulfil this objective a detailed study of technical condition determination methods used in the TeCoMan software is necessary. The TCIs are based on historical information used for control of the technical condition of critical items, based on methods such as vibration analysis (Compass), process parameters, and inspections (notifications). In the future other new sources of information like performance monitoring and particles in lubrication-oil monitoring might be included. TCIs at an aggregated level (compressor level) will be a fusion of multiple data sources and depend on several failure modes. A literature survey of advanced regression models in survival analysis has to be performed to e.g. understand the significance of time-dependent lifestyle data in this analysis. In order to fulfil the main objective, 4 sub-objectives are deduced: 1. Conduct a literature survey of research already performed related to maintenance prognostics with special focus on the main objective. A developed framework for the maintenance cycle in an opportunistic environment has to be discussed to show how a quantitative measure like remaining useful life fits into it. It is important to explain how the inputs and outputs of this quantity are dependent on the other steps in the cycle, and how the algorithm for remaining useful life is dependent on its input, which again affects decision making and data storage. 2. Utilization of information obtained at an actual plant. Technical condition monitoring and aggregation (multiple sensor data fusion). 3. Suggest how to estimate the remaining useful life based on: i. Both age and condition ii. Condition monitoring only with utilization of aggregated- or low-level TCI paths. 4. Use information and data gathered to estimate the remaining useful life. New information in terms of condition monitoring data, process data, inspections and 7

CHAPTER 1. INTRODUCTION maintenance actions might improve the estimate. Extracted data comes from a natural gas export compressor system at Kollsnes gas treatment plant west of Bergen.

20 CHAPTER 1. INTRODUCTION maintenance actions might improve the estimate. Extracted data comes from a natural gas export compressor system at Kollsnes gas treatment plant west of Bergen. The plant was started up in The huge gas export compressors are chosen because they have a relative long documented operational history. A brief introduction is given below. A further description of the system is given is chapter 4 Case Study - Natural gas export compressors. Case Natural Gas Export Compressors Plant: Onshore Natural Gas Treatment Plant 6 3 The plant is expected to deliver Sm gas per day through export pipelines to the European continent and is designed for an operational lifetime of 5 years. The gas received from the production platforms in the North Sea is sent to the plant to be treated, compressed, cooled and metered before it is exported. Off-spec. gas is recycled until the required specifications are met. Cleaning pigs are sent through the pipelines on demand. This study is limited to the export gas compressors in the compressor system. The original compressor system consists of 5 identical compressor trains including inlet separator, export gas compressor, gear, lubrication oil system, compressor-motor (electrical), outlet cooling fans, 2 sales gas metering stations and 2 pig launchers. The redundancy of the natural gas export compressor trains offers flexibility with respect to gas export and maintenance. It is possible to maintain one or more compressors without a complete plant shutdown. However, in the recent years the export profile has changed dramatically and the redundancy opportunity windows have become much smaller even though a sixth compressor was installed in 25. Only the 5 original export gas compressors, which have a long maintenance and condition monitoring history, are studied. Figure 1.3: A natural gas export compressor 8

21 1.4. STRUCTURE OF REPORT 1.4. Structure of report The structure of the thesis is illustrated in Figure 1.4. It is structured in 4 major parts of 8 chapters. A literature survey has been undertaken for each chapter. Part 1: A generic framework (Chapter 2): Here a general introduction to the maintenance cycle that combines a qualitative and a quantitative approach is given. It is discussed how to find the maintenance significant items and how to optimise the time schedule of when to perform the maintenance tasks. In the latter case, the remaining useful life (RUL) is a wanted quantity. RUL is also a feedback measure of the quality of the performed maintenance tasks since one important objective of maintenance is to extend the RUL. A discussion of the relation between the maintenance cycle and the needs in the Integrated Operation (IO) concept is presented in this chapter. Part 2: The Technical Condition Index (TCI) methodology & Case study (Chapter 3 & 4): Based on the operational conditions of the natural export gas compressors, an overview of existing internal condition measurements and failure modes is given. From this overview, the TCI methodology including technical details, expert judgement, and data details are presented. Part 3: Remaining Useful Life (RUL) estimation (Chapter 5 & 6): How to develop lifetime distributions and RUL estimates based on aggregated/low-level TCIs in the cases of: i. Both age and condition monitoring ii. Condition monitoring only are presented. The expert judgement based approach used to select and weight contributing condition measurements are compared with the proportional hazards model approach. Parameter estimation is based on the method of maximum likelihood and the method of moments. Handling of occurrences where the exact failure time/value is not known or where documented/calculated data is wrong is also discussed. Part 4: Modelling, results and discussion (Chapter 7 & 8): The last chapters utilize results and historical information from previous chapters. The different RUL estimates based on different basic model thinking are implemented for the natural gas export compressors. The estimates are dependent on the item specific lifestyle in terms of time varying changes in the TCIs. 9

22 CHAPTER 1. INTRODUCTION General framework Chapter 2 TCI methodology Chapter 3 Case: Natural Gas Export Compressors Chapter 4 Aging-based model for the remaining useful life Chapter 5 Condition-based model for the remaining useful life Chapter 6 Literature survey Modelling and results Chapter 7 Data basis Appendixes Conclusions and further work Chapter Limitations Figure 1.4: Thesis structure This thesis is not covering use of remaining useful life estimates in decision making. Maintenance decisions might be based on the RUL estimate only, but since the estimate cannot be 1% sure other diagnostic information is often needed. In an opportunistic environment the RUL estimate will be an important feature to solve the grouping of maintenance tasks. Utilization of maintenance personnel, maintenance tools, logistics support and spares are quite comprehensive and not a topic in this thesis. 1

23 2. Maintenance strategy including a description of IO 2.1. Introduction Maintenance planning has traditionally been allocated based on requirement in legislation (e.g. safety requirements to personnel safety and environmental protection), company standards, recommendations from manufacturers and vendors of the parts/systems (these recommendations are not necessarily designed to minimize downtime for the user), and inhouse maintenance experience. There is a need to have a generic framework over the maintenance cycle and evaluate how a quantitative measure like remaining useful life fits into it. It is important to explain how the inputs and outputs of this quantity are dependent on the other steps in the cycle, and how the algorithm for remaining useful life is dependent on its input, which again affects decision making and data storage. Maintenance theory and maintenance models should be used to develop system-specific maintenance strategies. The maintenance cycle in Figure 2.1 suggests using a combination of the qualitative reliabilitycentered maintenance (RCM) approach, and a quantitative remaining useful life approach to develop a numerical/analytical method to predict the remaining useful life of parts/systems. As mentioned in chapter 1.2 The challenge of technical condition and remaining useful life modelling knowledge of the remaining useful life is a very important element in maintenance planning. In Figure 2.1 there are three typical main maintenance optimizations tasks, which frequently need to be updated, analyzed and solved. The solution to these problems including constraints in the maintenance concept will define the basis for the operational behaviour of the total system. To be able to optimize problem 2 and problem 3, the solution is based on the solution of the preceding problem 1. Similarly, to find the optimal opportunity in problem 3, the solution is dependent on the opportunities found in problem 2 and other upcoming opportunities coming from system breakdowns. Problem 1. Optimize the maintenance policy for each maintenance significant item in the total system. Here the maintenance policies must be analyzed, and life cycle costs of the maintenance significant items need to be calculated. Problem 2. Optimize the maintenance program. In an opportunistic approach the main problem is to solve the grouping of maintenance tasks, which can utilize simultaneous downtime. Problem 3. Optimize the decision whether a particular item needs opportunistic maintenance, and if so how cost effective is the opportunistic maintenance compared to a later shutdown. 11

24 CHAPTER 2. MAINTENANCE STRATEGY INCLUDING A DESCRIPTION OF IO A fourth optimization problem is in the Maintenance Concept box, which includes utilization of maintenance resources such as maintenance personnel, maintenance tools, logistic support, spares etc. It also includes the maintenance organization such as the administrative structure and maintenance decision makers. This problem justifies for a PhD by itself and will not be discussed here. Prior knowledge: Technical condition history Maintenance events Trend models Recommendations from manufaturers Etc Qualitative analysis: RCM (FMECA) Problem 1 Optimal maintenance policy Problem 2 Direct maintenance costs Indirect maintenance costs Quantitative analysis: Remaining useful life estimation Maintenance program Clustering of maintenance tasks Problem 3 Finding best opportunity of component of interest Online/offline calibration monitoring Maintenance concept Logistic support Maintenance decision Maintenance organization Etc. Online/offline condition monitoring Operation Data cleaning Data assessment Registration of maintenance events Figure 2.1: The maintenance cycle The whole maintenance cycle has, at least to some degree, to be treated as an entity. Since RCM is a well established technique for developing an effective PM program that ensures that the inherent reliability of the system is maintained, it is a natural starting point in the framework over the maintenance cycle. The PM program is based on the assumption that the 12

25 2.1. INTRODUCTION inherent reliability of the part/system is a function of the design and quality. An effective preventive maintenance program will ensure that the inherent reliability is maintained. It is not true that the more a system is routinely maintained, the more reliable it will be. Often the opposite is the case, due to e.g. maintenance-induced failures. The reliability-centered maintenance (RCM) is a systematic consideration of system functions, the way functions can fail, and a priority-based consideration of safety and economy that identifies applicable preventive maintenance tasks (Electric Power Research Institute [24]). The main steps of an RCM analysis are shown in Figure 2.2 (Rasmussen [25], Rausand & Høyland [26]). 1. Study preparation 2. System selection 3. Functional failure analysis (FFA) 4. Critical item selection 5. Data collection and analysis 6. Failure Modes, Effects, and Criticality Analysis (FMECA) 7. Selection of maintenance actions 8. Determination of maintenance intervals 9. Preventive maintenance comparison analysis 1. Treatment of non-critical items 11. Implementation 12. In-service data collection and updating Figure 2.2: The main steps of the reliability-centered maintenance (RCM) analysis Starting with the Prior knowledge box in Figure 2.1, and focusing on the RCM approach, Prior knowledge corresponds to step 5 in the RCM analysis. Prior data like design data, recommendations from manufacturers, expert judgement, operational history data, and 13

26 CHAPTER 2. MAINTENANCE STRATEGY INCLUDING A DESCRIPTION OF IO reliability history data are needed for the functional failures analysis (step 3) in RCM. Reliability data has traditionally been required to decide the criticality, to mathematically describe the failure process, and to optimize the time between PM tasks. In addition also concomitant information in terms of technical condition history may be used. In some situations there is a lack of reliability data, concomitant information, or both, i.e. when the system is new. Helpful sources of information may then be experience data from similar equipment, recommendations from manufacturers, and expert data. To find the maintenance significant items (MSI), that is, items which are regarded as functional significant items (FSI) and/or maintenance cost significant items (MCSI), a critical item selection is performed (step 4 of Figure 2.2). These maintenance significant items will be analyzed for assignment to an effective preventive maintenance policy. For non-maintenance significant items or maintenance safe items, the corrective maintenance policy is the most effective choice. In the Failure Modes, Effects, and Criticality Analysis (FMECA) in step 6 of Figure 2.2 each of the maintenance significant items will be analyzed to identify potential failure modes and effects. Risk may also be used as a criterion for finding the maintenance significant items. Since the main aim of equipment prognostics is to provide decision support for maintenance actions, it is natural to include maintenance policies in the consideration of the equipment prognostic process. This makes the situation more complicated since extra effort is needed to describe the nature of maintenance policies. In the Optimal maintenance policy box of Figure 2.1 the optimal maintenance policy decision can be based on the decision logical on step 7 in the RCM approach. The decision logic is first used to decide for each dominant failure mode, whether a preventive maintenance policy is applicable and effective or if it will be best to let the part/system deliberately run to failure. A preventive maintenance policy should prevent a failure, detect the onset of a failure, and discover a hidden failure. In some cases condition-based maintenance is a good option, provided the condition can be determined. The development of condition-monitoring methods and information technology, as well as renewal of automation and information systems, provides new opportunities to implement better condition monitoring solutions. At the same time with the rapid development of modern technology, products have become more and more complex while better quality and higher reliability is required. This makes the cost of preventive maintenance higher and higher. In some cases condition-based maintenance might handle the situation, provided the condition can be determined. Effective condition monitoring options are first identified meaning that the diagnostic and prognostic capability has to be sufficient enough for the option to be considered as potential useful. To support the identification of effective, preventive as well as predictive maintenance policies, a logical tree has been made (MSG-3 [27], Rasmussen [28], Rasmussen & Moen [29], Rasmussen & Rysst [3]). This logical tree might be an extension of the decision logic in RCM. The use of such a logical tree is a qualitative assessment method screening out those maintenance tasks that do not qualify for a more detailed analysis involving quantitative performance assessments (Rosquist & Laakso [31]). A quantitative approach to find the optimal maintenance policy could also be a life-cycle cost optimizing problem given direct and indirect costs for each policy and risk of the parts or systems. The optimization of various maintenance policies (preventive, corrective and condition-based) is mainly based on: Direct maintenance (primary) costs Indirect maintenance dependent (secondary) costs 14

27 2.1. INTRODUCTION Risk, which is equal to the criticality multiplied with the probability of failure (The criticality index is quantified with respect to cost, safety, environment etc.) Reliability Availability The main idea of prognostics incorporating maintenance policies is to optimize the maintenance policies according to certain criteria such as risk, cost, reliability, and availability. Risk is defined as the product of probability and criticality. The idea behind risk is that the occurrence of an unwanted event with low criticality can be more frequently accepted than a problem with high criticality. The criticality index is quantified with respect to cost, safety, environment etc. In the literature the criticality index is often quantified with respect to cost and optimization is dominated by cost-based optimization. The maintenance regime of complex systems most often consists of a variety of maintenance policies, like preventive maintenance, corrective maintenance and condition-based maintenance. A general procedure for selection of an effective maintenance policy for a piece of equipment (Problem 1) requires information about risks, costs, failure modes, failure effects, predictability of failures and reaction time. Predictability is a characteristic of a failure mode and gives information about the accuracy of failure time prediction (e.g. in case of a Weibull distribution the predictability is none if the scale parameter β 1). Reaction time is the time interval between the observation of a potential failure and the actual moment of functional failure (e.g., remaining useful life based on early fault detection covariates). An effective maintenance policy will e.g. optimize costs and at the same time ensure that the safety risk is acceptable. This is typically the criterion for critical equipment in a nuclear power plant. Both quantitative and qualitative information are needed (see Figure 2.3 ). A part or system for which corrective maintenance has been chosen, has insignificant failure consequences for operation and safety, and is best left untouched unless failed. It is repaired when failed but the system does not have to be down because of such failures. This part or system is maintenance safe. When the safety risk is acceptable and the operational risk (e.g. production availability) is undesirable, corrective maintenance will be considered as an option. 15

28 CHAPTER 2. MAINTENANCE STRATEGY INCLUDING A DESCRIPTION OF IO Maintenance significant item The item is maintenance safe Risk acceptable? Failure mode+failure effect+properties YES Select corrective maintenance NO Operational Risk undesirable? NO Predictable from reliability data? NO Deterioration measurable and reaction time acceptable? NO Is Set of effective policies empty? YES continue YES continue YES continue NO Set of effective policies Add corrective maintenance to set of options and continue Add time-based maintenance to set of options and continue Add condition-based maintenance to set of options and continue YES Modification Best task identification by e.g. cost optimization with constraints Figure 2.3: Maintenance policy selection. (Woud et al. [32]) For those parts or systems, whose failures may result in economic or safety hazards, properties like predictability (diagnostic and prognostic capability) and reaction time will be used to decide whether the preventive maintenance activities should take place either at predetermined time intervals (time-based or age-based policy), or on a prognostic scheduled basis in case of condition-deterioration monitoring (remaining useful life estimation based on e.g. the technical condition indicators and historical events). An effective maintenance policy is one that can bring the risks into an acceptable area. If there are no effective policies, modification of the part or system is necessary. Cost optimization with constraints might be used to find the most cost-efficient maintenance policy. In the Maintenance program box in Figure 2.1, the qualitative analysis is then used to develop a preventive maintenance program. One limitation with RCM is that it does not give support for e.g. optimizing maintenance intervals, which in turn improves availability. The Quantitative analysis box in Figure 2.1, which is remaining useful life estimation, might be a helpful tool for this as an extension of the RCM. For some of the preventive maintenance tasks, the remaining useful life analysis together with direct and indirect maintenance costs could be used for e.g. determination of optimal maintenance intervals. In practise the various maintenance tasks have to be grouped (clustered) into maintenance packages that are carried 16

29 2.1. INTRODUCTION out at the same time, or in a specific sequence. The stationary grouping of maintenance packages is based on a long-term stable situation with an infinite horizon (Problem 2). In the Finding the best opportunity for component of interest box, the optimizing question could be whether to perform maintenance at the first upcoming opportunity, to defer it to the next or create a new opportunity. Dynamic grouping includes short time information in a short term horizon and is a difficult and complex task to perform where remaining useful life knowledge will be helpful. In the Maintenance concept box requirements of logistic support (tools, spares, documentation etc.) is included. It also includes utilization of maintenance resources such as maintenance personnel, and the maintenance organization such as the administrative structure and maintenance decision makers. In step 12 of the RCM analysis (Figure 2.2) the updating process should be concentrated on the three major time perspectives: 1. Short-term internal adjustments 2. Medium-term task evaluation 3. Long-term revision of the initial strategy This corresponds to the feedback loops in Figure 2.1. The long-term revision should consider all steps in the RCM analysis. It is not sufficient to consider only the system being analyzed. It is required to consider the entire plant with its regulations from the outside world. The medium-term update should carefully review the basis for the selection of maintenance tasks in step 7 of Figure 2.2. Analysis of maintenance experience may identify significant failure causes not considered in the initial analysis (documentation of new required maintenance actions), requiring an updated FMECA (step 6 of Figure 2.2) and remaining useful life estimate. This update is not performed as frequently as the short-term internal adjustments and is shown as the slow outer feedback loop in Figure 2.1. The short-term update may be considered as a revision of previous analysis results. The new input to remaining useful life analysis is updated failure information like technical condition history. This is shown as the fast inner feedback loop in Figure 2.1. Only steps 5 to 8 in the RCM process will be affected by short term updates. Special focus has to be set on step 8 (determination of maintenance intervals). High quality of data is needed for analysis in the maintenance cycle. Wrongly documented data from inspections, work orders, and online condition monitoring will cause sub-optimal models as basis for sub-optimal decisions. To avoid unnecessary manipulation of imperfect data, guidelines for data cleaning routines and online or offline calibration monitoring is often needed before data is stored Integrated Operations (IO) The relation between the maintenance cycle and a wider scope which is the IO concept with respect to maintenance and work processes has to be explained in more detail. As the mature fields on the Norwegian Continental Shelf shortly will enter the tail-end production phase, the unit costs will increase rapidly. This is critical on a competitive level, where we see that oil-production costs outside the Norwegian Continental Shelf are lower. By 17

30 CHAPTER 2. MAINTENANCE STRATEGY INCLUDING A DESCRIPTION OF IO reducing the operational activities offshore and developing more efficient work processes focusing on core activities to maintain production and safety, tail-end production can achieve the goal of extending the economical lifetime of the field. A report published by OLF shows 9 that IO could increase value creation by at least 25 1 NOK by 215. To wait three years before implementation, would mean that industry and society would manage to extract only NOK in value creation [33]. Smart use of real-time monitoring, real-time control, visualization and new work processes based on Information and Communication Technology (ICT), will constitute an important drive for increased efficiency. A continuous stream of real-time data between offshore and onshore installations through optical-fibre cables will open for close collaboration. The real challenge is to utilize the best available real-time technology in integrated teams independent of localization. Today, there are already remotely monitored platforms in use like Hod (which is operated from Valhall) and Tambar (which is operated from Ula). Successful implementations of integrated work processes have integrated supplier s condition monitoring centres with operators onshore support centres and offshore control rooms. Continuous monitoring of load, temperature, and vibration has for instance resulted in a substantial lengthening of maintenance intervals and lifetime of equipment such as turbines and valves and in a corresponding reduction of maintenance costs. It is expected that the integrated work processes most likely will be implemented in two stages, i.e. first by Generation 1 (G1), then by Generation 2 (G2). Both generations will change existing work processes profoundly. In the traditional practices most operative decisions are made offshore, in isolation or with limited support from experts onshore. Plans are relatively rigid and primarily changed on fixed intervals. Personnel onshore and offshore belong to several different units with often different goals. Maintenance tasks are currently carried out periodically based on expected degradation models (preventive maintenance), or unplanned when equipment fails (corrective maintenance). To manage and monitor the work, work orders are issued, followed up and closed. The process is supported by modern enterprise resource planning (ERP) solutions and technical documentation systems, but resources are still used inefficiently. The activity plans are only loosely integrated with other work processes like drilling, production, and injection etc. Condition-based maintenance is mostly used for heavy rotary equipment and often based on manual gathering of data. 18

31 2.2. INTEGRATED OPERATIONS (IO) Value Integration across companies Integration across on- and offshore Generation 2 Integrated operator and vendor centres Automated processes Digital services and 24 hours 7 days a week operations Generation 1 Integrated onshore and offshore processes and centers Continuous onshore support Limited integration Traditional practices Self-sustainable fields Specialized onshore units Periodic onshore support Time Figure 2.4: Integrated work processes implemented in 2 steps Today some oil companies are running pilots, whilst others are well into the implementation phase of G1 processes to investigate the improvement potential associated with IO. In G1 all planning and preparatory work will be carried out onshore, and the preventive maintenance process will be closely integrated and coordinated with other work processes. It is expected that the condition-based maintenance techniques developed in the 9s and early s will be extended to other types of equipment than heavy rotary equipment, like valves, separators and so on. Tools for online monitoring of technical condition are needed. Decision-making will be left to onshore operation centres and supporting specialists with real-time access to information about vibrations, other plant parameters, lubrication oil, ultrasound, thermography, performance, strain and corrosion data etc., and access to estimates in real time of the remaining useful life, root cause analysis of the data and online collaboration with the operators operation centres. In the G2 processes fields will be managed by people located in virtual operation centres, i.e., in geographically dispersed centres that interact digitally. The centres will be operational 24 hours a day, 7 days a week and will, to avoid information overload, make extensive use of tools for automatic filtering of information and automation of processes and decisions. To ensure that the team members collaborate well, goals will be aligned. Preparations for maintenance, actual maintenance, modification, and repairs will be performed by multidisciplinary roving teams. They will do all planning onshore and freeze the plans for the final weeks before they go offshore, so that they know exactly what will be expected of them and can check that they have spare parts and equipment in place in good time before they arrive. Plans will be integrated with other work processes. The onshore planning process will be supported by offshore staff using portable video conference equipment and up to date 3D models of the platform to collaborate with onshore. During offshore visits the roving team will communicate with experts globally through mobile systems that provide the team members with exactly the data, information, and tools they need to carry out the plans. A key 19

32 CHAPTER 2. MAINTENANCE STRATEGY INCLUDING A DESCRIPTION OF IO factor will be the ability to plan ahead and predict failures before they occur, and plan for timely intervention before a disruptive unplanned stop occurs. There are several very complicated challenges in IO. Inventory mapping and gap analysis have identified the following topics to be of particular relevance for further co-ordinated research and development (Brekke, 25) [34]: Information management, visualization and communication. Man, technology and organization (MTO) challenges. Condition monitoring and diagnosis. Technical integrity and safety. The CORD projects (Coordinated Operation and maintenance offshore Research and Development) address the above challenges and constitute a close co-ordination and cooperation between the operators on the Norwegian Continental Shelf and Norwegian research institutes. The work in this thesis is based on needs recognized in CORD and the fact that further research of the prognostic part of the condition-based maintenance program is needed. In the integrated operations (IO) reliability data and technical condition data should be shared across disciplines, different organization units and different geographical places in real time. Standardization and storage of these building blocks are highly needed (OLF [35]). The data standard ISO Integration of life-cycle data for process plants including oil and gas production facilities is constructed for integration of data across data systems and disciplines. This standard is supposed to ensure that the data model is so generic that it will cope with all applications through a very generic set of entities. Interesting alliances of organisations are the Machinery Management Open Systems Alliance (MIMOSA) and the open system alliance CBM (OSA-CBM) (see ). The initial efforts of OSA-CBM were aimed at developing open and interchangeable systems that cater to the maintenance arena (specifically CBM systems). OSA-CBM has developed a seven-layered architecture that encompasses the typical stages in the development, deployment, and integration of maintenance solutions under the CBM framework (Figure 2.5). The primary objective of MIMOSA is to project maintenance management as a business function that operates on business objects with well defined properties, methods and information interfaces [36]. 2

33 2.2. INTEGRATED OPERATIONS (IO) PRESENTATION Presentation layer is the man/machine interface. It may query all other layers. DECISION SUPPORT PROGNOSTICS HEALTH ASSESSMENT CONDITION MONITORING C o m m N e t w o r k Decision support utilizes spares, logistics, manning etc. to assemble maintenance options. Prognostics considers health assessment, employment schedule, and models/reasoners that are able to predict future health with certainty levels and error bounds. Health Assessment is the lowest level of goal directed behaviour. Uses historical data and condition monitor values to determine current health. Multi-site condition monitor inputs. Condition monitoring gathers signal processing data and compares to specific features. Highest physical site specific application. SIGNAL PROCESSING SENSOR MODULE Data acquisition transducer Signal Processing provides low level computation on sensor data Data Acquisition- conversion/formatting of analog output from transducer to digital word. May incorporate meta data. Transducer converts some stimuli to electrical signals for entry into system. Figure 2.5: OSA-CBM architecture (Mitchell et al. [36]) 21

35 3. The Technical Condition Index (TCI) To be able to cope with the increasing amount of condition information and to have sufficient control over the technical condition of the system and its sub-systems, a measure named Technical Condition Index (TCI) has been developed in the EUREKA project Ageing Management ( ) [37]. The project s hypothesis was: by developing a new, reliable variable, named TCI, which is only affected by changes in the system s technical integrity, the organization will be alerted much earlier of developing problems than by using traditional indicators like regularity, budgets and accounts, accident and incident statistics, and environmental emissions. Thereby management is given the ability to take the necessary actions. The traditional indicators are burdened with a low sensitivity with regard to technical condition since there may be a long period between the point in time where the technical integrity of the system is substantially reduced and when these indicators alert the organization. This is illustrated in Figure 3.1. Regularity Delay of detection Safety Technical condition Costs Operational time Figure 3.1: Traditional key indicators used at management level (with TCI included) The information acquired to obtain relevant technical condition indicators is in general coming from three sources (see Figure 1.1): process data available from the IMS, subjective 23

36 CHAPTER 3. THE TECHNICAL CONDITION INDEX (TCI) mal-function reports from inspections and condition monitoring data (on-line and off-line). In some cases the relation between load and degradation mechanisms is known and can be modelled. In such cases process data may be used analytically to calculate a technical condition (Samdal [38]). However, degradation models address relations at a very low level. It is realized that inspections on the equipment constitute a significant source of information. This information need to be quantified and (automatically) stored. Condition monitoring data are versatile. It may consist of a single value, a time series, or a more complicated data set. Work orders are part of the CBM program but they are not a part of the TCI information. The TCIs are supposed to quantify the technical condition of the physical asset and are part of the data acquisition step and the data processing step of CBM. In the EUREKA project [37] the maximum value (design condition) of the TCI was set to 1, and the minimum value (state of total degradation) was set to. The design condition was taken as a reference in order to make the technical condition independent of the demands of the system in question. The design is fixed, whereas usage might change over time, making comparisons of TCIs difficult. As easy as it is to define the perfect condition, as difficult it is to define an item s state of total degradation. The evaluation of technical conditions depends very much on the applied context. The technical condition should therefore be related to a certain context rather than to express an absolute technical condition. In the project five principal contexts were identified: 1. Safety 2. Environment 3. Availability 4. Man-hours (man-hours for maintenance i.e. the required maintenance organizational load) 5. Costs (investment for renewal etc.) Due to the large production volume in the system used as case in this thesis, small deviations in capacity and downtime can have substantial economical impacts. It was therefore decided to start out with the TCI parameter describing the ability to keep the contractual availability. The TCI with respect to Production availability was selected to be a proper measure. To be able to give sufficient and efficient decision support (the last step of the CBM programme) aggregation of TCIs from smaller building blocks in a hierarchy which represents the actual industrial system might be a solution. The aggregation method used in the case comprises the following steps: Establish a hierarchy of objects which represents the actual industrial system. Assign a weight to each of the objects according to their context criticality. Assign relevant input variables, which characterize the technical condition of objects (mainly at the bottom level). Based on values of the input variables (inspection data in term of notifications, condition monitoring data, and process data) the TCI values are then aggregated upwards in the hierarchy. The establishment of the hierarchy is an important task, which should be done in co-operation with operational and maintenance personnel. Two principal approaches may be used: Split of system functions into sub functions on an increasing level of detail. Systems break down into subsystems and sub subsystems, and so on. 24

37 3.1. DETERMINATION OF AN ITEM S TECHNICAL CONDITION FROM AVAILABLE INFORMATION The approach of split of system functions into sub functions is used, for example in criticality analysis in the context of RCM (Figure 2.2). An advantage is the ability to see more easily influences of degradation at lower levels on the functional availability at a higher level. The hierarchy will, however, contain many levels, which in turn make it difficult to follow the branches. Another disadvantage with splitting system functions into sub functions is the fact that equipment often provides more than one function, i.e. functions are not defined as a strict hierarchy, but as a functional net. If a net is converted to a hierarchy one item may occur several times in that hierarchy. Systems break down into subsystems is easier established down to any desired level by answering the question What parts do this system consist of? A disadvantage is the lack of link to the functions provided by the system. Therefore neither a strict functional split into sub functions nor a strict systems break down into subsystems was used in the analysis performed in the EUREKA project [37], but both forms were applied where appropriate. Another constraint was to allow for an existing tag system to be reflected in the hierarchy used to calculate the technical condition index (Andersen & Rasmussen [13]). Prognostics is a technique for maintenance decision in a CBM programme, and TCIs aggregated to different levels in an abstraction hierarchy might be the foundation for more high-level aggregates like remaining useful life of multi-failure mode critical systems Determination of an item s technical condition from available information The TCI value is not a 1% exact technical condition parameter. The fuzziness of the subject may be compared to knowledge-based diagnostic systems. Although those systems rarely conclude with a 1% certainty, they have proven to be useful. Another important argument against exact determination of technical condition is the enormous amount of work needed to study all relevant degradation mechanisms. In the case studied in chapter 4 Case Study - Natural gas export compressors, a combination of expert knowledge (malfunction reports) and available data from existing condition monitoring and maintenance management systems is used to determine an item s technical condition. The five compressors KA31, KA41, KA51, KA61, and KA71 are identical also in terms of condition monitoring- and process monitoring instrumentation. The only difference is the tag number. In the following chapters the tag numbers of compressor KA41 are used Application of condition monitoring Condition monitoring data constitute an important and obvious source of information. They are usually stored in specialized condition monitoring systems. All measured values are transferred into TCI values by transfer functions utilizing their corresponding alarm limit values defined in IMS, P&ID, or set by experts from their operational experience. In the compressor case vibration data are collected from the Compass monitoring system. Both online and offline values are collected and transferred into TCI values. For the online part only the total vibration level (root mean square amplitude) is calculated and stored once a day. If the vibration amplitude is below a specific value e.g. the hi alarm limit, a bearing is considered to be in a good as new condition, whereas if the vibration amplitude is above some higher value as e.g. the hihi alarm limit, the bearing is consider to be worn out and has to be replaced. The corresponding TCI may then be found by e.g. a linear relation between the two states as shown in Figure

38 CHAPTER 3. THE TECHNICAL CONDITION INDEX (TCI) 1 TCI Good condition Poor condition Measured value Figure 3.2: Linear relation between vibration amplitude and the technical condition indenx It is possible to use different transfer functions to describe the relation between measured values and TCI values, e.g. if the condition monitoring parameter is equipped with both hi, hihi and lo, lolo alarm limits. A list of available transfer functions is given in Appendix B. Available transfer functions in TeCoMan. Calculation of the transfer function of the axial distance probe measure of tag GT7171A of compressor KA41 utilizing its corresponding alarm limit values is shown in Table 3.1 and Figure 3.3. The value is calculated to be in µm (1-6 m). Table 3.1: Axial probe limit values and corresponding TCI LoLo Limit Lo Limit Hi Limit HiHi Limit µm = (1-6 m) TCI

39 3.1. DETERMINATION OF AN ITEM S TECHNICAL CONDITION FROM AVAILABLE INFORMATION TCI µm Figure 3.3: The relation between the axial distance probe measure (unit 1-6m) and the TCI (Technical Condition Index). Between the axial distance probe measures given in Table 3.1, a linear interpolation is used. For distances outside the given interval, extrapolation to the relevant end-point is used. The transfer function is called a 4 point saddle transfer function. A description of this transfer function is given in Appendix B. In this case the transfer function is a 4 point saddle transfer function. Between the axial distance probe measures given in Table 3.1, a linear interpolation is used. For distances outside the given interval, extrapolation to the relevant end-point is used. The TCI is therefore at or below -45 µm and at or above 45 µm. Calculation of the transfer function of the radial accelerometer measure of tag YT7171A of compressor KA41 utilizing its corresponding alarm limit values is shown in Table 3.2 and Figure 3.4. The peak to peak (p-p) vibration amplitude is calculated to be in µm (1-6 m). Table 3.2: Radial probe limit values and corresponding TCI Normal Working Value Hi Limit HiHi Limit µm p-p TCI

40 CHAPTER 3. THE TECHNICAL CONDITION INDEX (TCI) TCI µm Figure 3.4: The relation between the radial accelerometer vibration amplitude (unit 1-6 m) and the TCI (Technical Condition Index). Between the radial accelerometer vibration amplitudes given in Table 3.2, a linear interpolation is used. For vibration amplitudes outside the given interval, extrapolation to the relevant end-point is used. The transfer function is called a Down slope transfer function. A description of this transfer function is given in Appendix B. In this case the transfer function is a Down slope transfer function. Between the radial accelerometer vibration amplitude given in Table 3.2, a linear interpolation is used. For peak to peak vibration amplitudes outside the given interval, extrapolation to the relevant end-point is used. The TCI is therefore 1 at or below 2 µm p-p and at or above 11 µm p-p. A typical radial accelerometer measure for a perfectly balanced system consisting of compressor, gear, and electrical motor is 25 µm p-p. If the system is not perfectly balanced after a maintenance action, the vibration level is a little higher ( < 4 µm p-p) Application of inspections (notifications) A significant knowledge of the plant s technical condition is often gained during daily observations and periodic inspections. In order to also include subjective knowledge into the TCI value a notification procedure had to be developed. To exploit the knowledge gained at e.g. inspections, a special reporting form also had to be developed. The basic idea is that the technical condition is considered to be as good as new as long as nothing is reported. However, if a malfunction, deviation, or damage is found this is then reported as a notification, which is entered manually in SAP 1. In this notification the tag number, the code for deviation or damage to be reported, and the code for the failure mode have to be reported 1 A German company, which delivers enterprise resource planning systems. SAP is also the name of their enterprise resource planning system, which StatoilHydro is using. 28

41 3.1. DETERMINATION OF AN ITEM S TECHNICAL CONDITION FROM AVAILABLE INFORMATION to reduce the TCI according to defined rules. A unique form and guideline for allocation of deviation or damage codes has been developed for each major equipment class like valve, pipeline, fan, compressor, etc. The deviation or damage code is classified in three levels. The level-code labels U, L, and M/H are just internal codes in the SAP system. Minor (U) will give an initial transition totci = 8 Major (L) will give an initial transition totci = 6 Unacceptable (M/H) will give an initial transition totci = 4 The failure modes are more sophisticated and are meant to adjust the severity of the initial deviation or damage code. Each failure mode is assigned a weight that describes the importance of the deviation or damage code with respect to functional availability. Figure 3.5 presents examples of weights of failure modes. SAP notification - Malfunction report - Damage codes: TCI weights for gas compression system Code Description Gen. failure mode Pipeline Valve Fan Compressor Pump Drum A Erratic output B Abnormal instrument signal C External leakage 1 1 D Internal leakage 5 NA E Physical damage 75 5 F Fail to operate on demand 1 1 G Damaging impact from process environment 1 1 H Operational failure I Packing, fouling 1 1 J Minor failure, defect or deviation K Vibration noise 5 5 L Loss of safety protection 5 5 M Reduced safety protection Y No defects, deviations, faults, damages Z Other Figure 3.5: Example of weights of failure modes A more detailed description of each SAP code has also been made to help operators in categorizing the notification (Figure 3.6). Code Description Usage A Erratic output Erratic capacity, performance, quantity, flow B Abnormal instrument signal False alarms, deviation in measured value/signal/response time/effect/frequency C External leakage Leakage from equipment to environment or from environment to equipment D Internal leakage Internal leakage (no environmental leakage) in valves/compressors/pumps/etc. E Physical damage Rupture/crack/deformation/corrosion/erosion/wear F Fail to operate on demand Equipment doesn t start/close. Erratic process control. Equipment is stuck. Safety equipment fails to operate on demand. 29

42 CHAPTER 3. THE TECHNICAL CONDITION INDEX (TCI) G Damaging load that may cause failure or damage Load in main media: pressure shock, temperature, rotational speed, ph, unwanted fluid in gas, unwanted gas in fluid, unwanted corrosion products, unwanted water/sand/salts/co2/h2s/etc H Operational failure Unintended stop/trip/closure/power failure I Packing, fouling External/internal scale/layer/foreign object J Minor failure, defect or deviation First aid maintenance: minor defect/loosening equipment/malfunction/ discoloration/adjustment/calibration/lubrication K Vibration noise Observation of vibration or noise L Loss of ex protection Loss of/reduced overpressure M Reduced safety protection Cover/isolation/ jacketing/corrosion protection/hms protection Y No defects, deviations, faults, damages When there is no deviation/defect/fault/damage to notify from an inspection Z Other Figure 3.6: Description of SAP codes The final TCI value is then calculated using the failure mode and the deviation or damage code. As for instance, the initial TCI value for a notification with deviation or damage code (U) will be 8. If the same notification is the failure mode Abnormal instrument signal (B) for a compressor, the importance weight is 25. The final TCI for the notification is calculated to be: TCI = 1 ( 1 8).25 = 95 It is important to note that the technical condition is considered to be as good as new as long as nothing is reported or when the notification is set in status Notification Complete. If several notifications based on different deviation or damage codes and different failure modes are reported over time, the TCI decreases will be added to each other, as shown in Figure 3.7. U L M/H 1 Time 8 6 Abnormal instr. signal Physical damage U -> TCI = 8 L -> TCI = 6 M/H -> TCI = 4 Figure 3.7: Calculation of TCI values when several different notifications are reported over time. Notifications will be completed by maintenance actions. The effect of maintenance actions is then shown as an increased TCI value. A major work order will often have several notifications belonging to it. A simultaneous completion of several notifications will be seen as a major increase in the TCI. 3

43 3.1. DETERMINATION OF AN ITEM S TECHNICAL CONDITION FROM AVAILABLE INFORMATION Application of process data The information management system (IMS) holds a huge amount of data. The information from this system is handled in a similar manner as the condition monitoring data. All measured values are transferred into TCI values by transfer functions utilizing their corresponding alarm limit values defined in IMS, P&ID, or set by experts from their operational experience. An example of calculation of the transfer function of the temperature measurements from tag TE719A of compressor KA41 utilizing its corresponding alarm limit values is shown in Table 3.3 and Figure 3.8. Table 3.3: Bearing temperature limit values and corresponding TCI Normal Working Value Hi Limit HiHi Limit C TCI TCI C Figure 3.8: Transfer function for compressor bearing temperature, e.g. tag TE719A of compressor KA41. Between the temperatures given in Table 3.3, a linear interpolation is used. For temperatures outside the given interval, extrapolation to the relevant end-point is used. The transfer function is called a Down slope transfer function. A further description of this transfer function is given in Appendix B. In this case the transfer function is a Down slope transfer function. Between the temperatures given in Table 3.3, a linear interpolation is used. For temperatures outside the given interval, extrapolation to the relevant end-point is used. The TCI is then 1 at or below 75 C and at or above 121 C. Each compressor is equipped with double, dry gas seals to prevent process gas from reaching bearings or the atmosphere. 31

44 CHAPTER 3. THE TECHNICAL CONDITION INDEX (TCI) N 2 Differential pressure control valve Filter Bearings FT7155 FT7156 Balance pipe NDE W W M M W W W M M M DE Outer labyrinth seal Gas seal (dry) Inner labyrinth seal Process gas EV valve Figure 3.9: Seal gas system (very simplified) [4] The inner labyrinth seals next to the process are fed with process gas (buffer gas), while the outer labyrinth seals are supplied with nitrogen to ensure safety. The controller of the differential pressure control valve in Figure 3.9 has a setpoint of.7 Bar, which implies that the buffer gas pressure towards the inner labyrinth seals is.7 Bar higher than the process gas pressure on the process side. Most of the buffer gas is supposed to return to the process. The small leakage through the inner labyrinth seals is directed to the low-pressure flare. More details of the seal gas system are given in the internal StatoilHydro technical document [4]. To measure flow of buffer gas towards the seals, the seals are equipped with flow transmitters (FT7155 at the low pressure side and FT7156 at the high pressure side). Leakage of buffer gas to low-pressure flare is monitored by FT7161 (low pressure) and FT7162 (high pressure). The calculation of the transfer functions of FT7155 or FT7156, utilizing their corresponding alarm limit values, is shown in Table 3.4 and Figure 3.1. Table 3.4: Flow towards the seals (FT7155 and FT7156) and corresponding TCI Normal Working Value Hi Limit HiHi Limit Sm 3 /h TCI

45 3.1. DETERMINATION OF AN ITEM S TECHNICAL CONDITION FROM AVAILABLE INFORMATION TCI Sm 3 /h Figure 3.1: Transfer function for flow transmitters to measure flow towards the seals (FT7155 at the low pressure side and FT7156 at the high pressure side). Between the flows given in Table 3.4, a linear interpolation is used. For flows outside the given interval, extrapolation to the relevant end-point is used. The transfer function is called a Down slope transfer function. A further description of this transfer function is given in Appendix B. In this case the transfer function is a Down slope transfer function. Between the flows given in Table 3.4, a linear interpolation is used. For flows outside the given interval, extrapolation to the relevant end-point is used. The TCI is then 1 at or below 8 Sm 3 /h and at or above 226 Sm 3 /h. The calculation of the transfer functions of FT7161 and FT7162, utilizing their corresponding alarm limit values, is shown in Table 3.5 and Figure Table 3.5: Leakage to low pressure flare measured by tag FT7161 (low pressure) or FT7162 (high pressure) and corresponding TCI Normal Working Value Hi Limit HiHi Limit Sm 3 /h TCI 1 5 In this case the transfer function is a Down slope transfer function. Between the flows given in Table 3.5, a linear interpolation is used. For flows outside the given interval, extrapolation to the relevant end-point is used. The TCI is then 1 at or below 15 Sm 3 /h and at or above 2 Sm 3 /h. 33

46 CHAPTER 3. THE TECHNICAL CONDITION INDEX (TCI) TCI Sm 3 /h Figure 3.11: Transfer function of leakage (flow) to low-pressure flare measured by tag FT7161 (low pressure) or FT7162 (high pressure). Between the flows given in Table 3.5, a linear interpolation is used. For flows outside the given interval, extrapolation to the relevant end-point is used. The transfer function is called a Down slope transfer function. A further description of this transfer function is given in Appendix B Hierarchical aggregation of the technical condition The challenge of technical condition modelling is to associate e.g. symptoms related to production availability (that can be partly measured through inspections, condition monitoring, parameters from IMS, etc.) with the failure modes in the high-level assembly hierarchical structure. FMECA analysis using reliability-centered maintenance (RCM) shall link the high-level assembly hierarchical structure with its failure modes. The idea of aggregation of TCIs from smaller building blocks in the established hierarchical tree structure of objects is to associate the failure modes with (mainly) leaf node TCIs described above. The leaf node 2 TCIs should affect the level of deterioration or progression of every failure mode in the hierarchy. A challenge for a high-level assembly failure mode is to determine proper aggregation methods and aggregation weights. Establishment of the significance (weights) of the leaf node TCIs in aggregation methods can be made in two different ways: 1. The weights are based on expert judgement. E.g. based on the criticality index identified in the FMECA analysis. 2 A leaf node is a node of a tree structure that has zero child nodes. In this context the leaf node is the lowest level in the compressor hierarchy. 34

47 3.2. HIERARCHICAL AGGREGATION OF THE TECHNICAL CONDITION 2. The weights are based on regression parameters in a regression analysis. The objective of e.g. proportional hazards model (PHM) analysis is to estimate the covariate coefficients thus providing a quantitative measure of the importance of each covariate (e.g. low-level time dependent TCI) and their impact on the propensity for failure. In general it is an advantage to have few independent leaf node TCIs that together in a best possible way reflects the level of technical condition of a failure mode. In the TeCoMan software several aggregation methods have been defined. The 3 most frequently used aggregation methods are described below: 1. The weighted sum aggregation method is expressed as n ( ) TCI = 1 1 TCI w, w = 1 i i i i= 1 i= 1 where TCIi is the technical condition of child node i, wi is the weight of child nodei, and n is the number of child nodes. 2. The penalty aggregation method is similar to the weighted sum aggregation, except that the sum of the weights is permitted to be different from one. We then put n i= 1 w 1 n n TCI = 1 1 TCI w, w 1 i ( ) i i i i= 1 i= 1 In the case the calculated TCI of the parent node as calculated by this formula turns out to be less than zero, it is set equal to zero. 3. The worst case aggregation method gives the parent node a TCI equal to value of the child node with the smallest TCI. TCI = Min TCI n i= 1 An example of how the technical condition may be aggregated using the concept of worst case and penalty is presented in Figure The condition of the bearings is calculated to be 6, because we have selected to use the worst case aggregation method to the bearing temperature TCIs. The bearing temperature is transferred to a TCI value by the transfer function in Figure 3.8. To calculate the TCI value for the Export compressor, a penalty aggregation method is applied. The establishment of the hierarchy is an important task, which has been done in close cooperation with operational and maintenance personnel. The choice of aggregation methods and weights is based on the expert judgement principle. i n 35

CHAPTER 3. THE TECHNICAL CONDITION INDEX (TCI) Export compressor 27 - KA31 TCI = 57.5 Penalty aggregation SAP Notification w i =.25 TCI i = 9 Seals w i =.3 TCI i = 75 Vibration w i =.

48 CHAPTER 3. THE TECHNICAL CONDITION INDEX (TCI) Export compressor 27 - KA31 TCI = 57.5 Penalty aggregation SAP Notification w i =.25 TCI i = 9 Seals w i =.3 TCI i = 75 Vibration w i =.7 TCI i = 85 Bearing Temperature w i =.3 TCI i = 6 Efficiency w i =.5 TCI i = 8 Worst case aggregation 27TE79A = 95 C TCI = 6 27TE791A = 75 C TCI = 1 27TE792A = 8 C TCI = 9 27TE793A = 85 C TCI = 8 Figure 3.12: Example of hierarchy to determine technical condition with respect to production availability. The figure is partly taken from the TeCoMan software. 36

49 4. Case Study - Natural gas export compressors 4.1. System for sales gas compression and sales gas metering The system for sales gas compression and sales gas metering consists of the following main systems (the 6 th compressor installed in 25 is not included in this thesis): 5 identical compressor trains including inlet separator, export gas compressor, gear, lubrication-oil system, compressor motor (electrical), and outlet cooling fans. 2 sales gas metering stations. 2 pig launchers. The purpose of the system for sales gas compression and sales gas metering is to: Compress treated gas from the prior gas treatment system. Meter, record and export gas to export pipelines. Recirculate off-spec. gas until the required specifications are met. Send cleaning pigs through the pipelines on demand. Figure 4.1: Gas flow through a 7-stage back to back centrifugal compressor 37

50 CHAPTER 4. CASE STUDY NATURAL GAS EXPORT COMPRESSORS The compressor trains receive lean gas from the exhaust manifold of the gas treatment system (system 25). The pressure is here 55-8 bar g and the temperature is between -5 C and 5 C. Entrained glycol and condensates are treated in an inlet separator. In normal working conditions there is no glycol or condensates in the gas. From the top of the inlet separator dry gas is supplied the non drive end (NDE) at the 1 st section intake of the corresponding compressor (see Figure 4.1). Figure 4.2: System sales gas compression and sales gas metering [4] The gas is directed through three impellers towards a crossflow and further to the drive end (DE) of the compressor. At the drive end of the compressor the gas is directed through four impellers, but in the opposite direction of the gas from the impellers at NDE. The axial forces are then working against each other in this 7 stage back to back centrifugal compressor. Each compressor increases the gas pressure to 191 bar g (max). The temperature of the gas from one compressor is then cooled down to 4-5 C by on/off switching of 1-8 fans before it is directed towards an exhaust manifold and further to one out of two metering stations. There is one metering station for each export pipeline which measures and records the amount of gas delivered for export (see Figure 4.2). A computer is part of the metering station and calculates: 38

51 4.2. EXPORT GAS COMPRESSORS Fuel-value Density Dew point Flow Pressure Temperature Gas composition The parameters above have to satisfy customer requirements to ensure correct gas quality and quantity. During start-up of the first compressor train the system might need to recycle offspec gas to the inlet of system 25 until the required sales-gas specifications are met. Each export pipeline has its own pig launcher. Cleaning pigs driven by the gas are sent through the pipelines on demand. More details of system for sales gas compression and sales gas metering including its control description is written in the internal StatoilHydro technical document [4] Export gas compressors The focus in this thesis is on the five 7 stage back to back centrifugal compressors, excluding the inlet separator, the gear, the electrical motor, the lubrication oil system, the cooling fans, the sales gas metering stations and the cleaning pig launchers. The 5 export gas compressors are identical and have been operating under the same working conditions since The compressors are subject to different failure modes as documented in the work-order list (event data) observed over a 7 year period from January January 7. The work-order list had to be filtered to yield all the relevant events. Some expensive work orders labelled Cleaning of bundle, recorded the first years in the work-order list are regarded as modifications. Salt caught on the rotor reduced the compressor efficiency and had to be removed by manually cleaning of the bundle. By a modification the salt is now automatically removed and the problem is solved. Performance test events and minor work orders performed on the compressors like replacement of sensors are also not considered as relevant work orders. However, notifications belonging to those compressor work orders are part of the notification procedure to calculate the aggregated compressor technical condition index. The aggregated TCI will then decrease when these notifications are reported in the SAP system, and increase when they are completed by their belonging work orders Failure modes From the end of December 1996 until the end of January 27, there were few required maintenance actions to fix a failure [39] that caused loss of availability until some part of the export compressor was replaced or repaired. This might be because of the high inherent reliability of the export gas compressors. The failure modes documented on the export compressors as work orders in the SAP system are: Bearing failure (thrust bearing or radial bearing, see Figure 4.3), Seal failure (see leakage in seal gas system Figure 3.9). 39

52 CHAPTER 4. CASE STUDY NATURAL GAS EXPORT COMPRESSORS For both failure modes the export compressor is overhauled by replacement of a bearing or a seal. It is important to note that the work orders recorded in the SAP system are not only complete breakdowns of the compressor. They are expert judgements that the compressor is operating at an unsatisfactory level based on the condition measurements and the important daily inspections at the site. Complete breakdowns of the compressors are highly unwanted in terms of costs and safety. The implemented Process Shutdown Systems (PSS), level 2 and 3, and the Emergency Shutdown System (EMS) protects the compressor efficiently against complete breakdowns. Radial bearings Thrust bearings Figure 4.3: Thrust bearings and radial bearings of a compressor Table 4.1: Simplified split of the observation interval for compressor KA31, KA41, KA51, KA61, and KA71 into running periods. In this table there is one row for each running period. A running period ends with a required maintenance action or a suspension. For those running periods ending with a suspension the exact time of a required maintenance action is not known. The repair times are neglectable. The failure mode column presents the mode of the required maintenance action causing the end of the running period. A suspension is in this table noted in the failure mode column. Compressor Running period Start of running period (date) End of running period (date) Length of running period (days) Failure mode KA Seal Seal Bearing also maintained Suspension of observation period KA Bearing Seal 4

53 4.2. EXPORT GAS COMPRESSORS KA Suspension of observation period KA Bearing Seal also maintained Bearing Seal KA Bearing Seal also maintained Suspension of observation period Since the gas export demand to the European continent is high, the export gas compressors are running close to 24 hours a day, 7 days a week throughout the year. Each export gas compressor was put into operation December 31 st 1996 and has partly been observed from it was put into operation until the January 19 th 27. The length of this interval is calculated to be 3671 days. In this interval the repair times are considered neglectable. This is an acceptable simplification of the interval. A closer look at the interval will be performed during this chapter. The content of Table 4.1 is the result of work performed to filter the maintenance work orders to yield from all the documented work orders in the SAP system. The Compressor column denotes the unique tag number of each export gas compressor. The Compressor column is further segmented into running periods ending with a failure mode observation or a suspension. For those running periods which end with a failure mode observation, the cause of the maintenance action and the exact time stamp of the maintenance action are known. For a running period ending with a suspension we only have partial information of the running period and the exact time stamp of the required maintenance action is not known. For example, the first row in Table 4.1 is the running period of export gas compressor with tag KA31 from it was put into operation (new) until the first maintenance action. It was then brought back into a functioning state. The second row is the running period of the same export gas compressor immediately after its first maintenance action (since the repair time is considered neglectable) until its second maintenance action. The failure mode of a maintenance action is documented in the Failure mode column, e.g. in the work order report of the first running period of KA31 it was documented that an overhaul of two bearings was required due to mechanical damage and corrosion. At this opportunity other maintenance actions like replacement of a seal ring and a temperature element was performed. The Start of running period and End of running period columns are the first date and the last date in each running period. The Length of running period column represents the length of the interval in operational days between maintenance action in running period i and maintenance action in running periodi 1. In the first running period of a compressor working age represents the length of the interval in operational days between its maintenance action and (new). Missing data are discovered in the documentation. Since December 31 st 96 different operators have been responsible at the natural gas treatment plant. The documentation quality of work orders performed on the gas compressors in the first operational years is regarded as insufficient. Automated collection of inspection, condition monitoring and process data in the TeCoMan software began as late as in the end of year A recovery of maintenance 41

54 CHAPTER 4. CASE STUDY NATURAL GAS EXPORT COMPRESSORS actions documented as work orders in SAP and data exported to TeCoMan before January 1 st 2, is very time consuming and in some cases impossible. The existence of maintenance actions is therefore treated as unknown for all compressors before January 1 st 2 (the first 195 operational days). For compressor KA51 the first 1265 days and for KA61 the first 132 days are treated as unknown with regard to existence of maintenance actions and documented TeCoMan data. A problem is also present as to when a work order is given the time stamp complete. The complete status of work orders should be set immediately after a performed maintenance action. A huge delay is discovered in the SAP system because SAP requires other time demanding administrative repair tasks to be completed before the complete status is set. For some days after the maintenance action this leads to a physically repaired compressor running in good condition, while looking in SAP it is still being repaired and in bad technical condition. The recovery of this error in SAP and TeCoMan is decided to be outside the scope of this thesis. To solve the problem the existence of required maintenance actions and documented TeCoMan data for a certain time span immediately after a performed maintenance action is treated as unknown. The missing data problems are solved by left truncation of the data since it is known that the compressors have been running in the first unknown operational time interval of a running period, but information of the existence of maintenance actions and documented TeCoMan data is missing. Truncated operational time intervals are included in the column named Trunc. time and added to Table 4.1 as shown in Table 4.2. A more detailed description of truncated information, or more specifically lefttruncated information, is given in Appendix H Reliability theory. 42

55 4.2. EXPORT GAS COMPRESSORS Table 4.2: Detailed overview of observation intervals for compressor KA31, KA41, KA51, KA61, and KA71. In this table there is one row for each running period. A running period ends with a required maintenance action or a suspension. For those running periods ending with a suspension the exact time of a required maintenance action is not known. The repair times are neglectable. The failure mode column presents the mode of the required maintenance action causing the end of the running period. A suspension is in this table noted in the failure mode column. In the truncation time column the number of operational days of the left truncated time is presented. Compressor Running period Start of running period (date) End of running period (date) Length of running period (days) Failure mode Trunc. time (days) KA Seal Seal Bearing also maintained Suspension of 1 observation period KA Bearing Seal 271 KA Suspension of 1265 observation period KA Bearing 195 Seal also maintained Bearing Seal 14 KA Bearing 195 Seal also maintained Suspension of observation period 1 An illustration of the maintenance actions in Table 4.2 is presented in Figure 4.4. The solid horizontal lines illustrate the observation intervals for each compressor. A summary shows a total of 8 required maintenance actions for the 5 export gas compressors, which are very few in the relatively long observation period shown in Figure 4.4. There are also 3 running periods which are classified as suspensions meaning that the observation is stopped before a required maintenance action occurs. This implies a total of 11 running periods. In addition all running periods are truncated so that a reliability approach solely based on lifetime distributions will be on the limit of being justifiable. It appears from Figure 4.4 that the required maintenance actions occur at irregular operational time intervals. 43

56 CHAPTER 4. CASE STUDY NATURAL GAS EXPORT COMPRESSORS Events for all compressors KA71 KA61 KA51 Maint. action Maint. action Suspension Maint. action Maint. action Suspension KA41 KA31 Maint. action Maint. action Maint. action Maint. action Suspension Operational time (days since ) Figure 4.4: Trend of maintenance actions as a function of operational time. The solid horizontal lines illustrate the observation intervals for each compressor. The required maintenance actions are indicated by x and the suspensions are indicated by. Truncated operational time intervals are shown as gaps in the solid horizontal lines Condition monitoring and process data The rotating equipment like the export compressors, gears, export compressor el. motors, and cooling fans are equipped with condition monitoring for early fault detection. The intention was to use a preventive condition-based maintenance (CBM) policy for planning of maintenance actions based on the technical condition. The condition-monitoring equipment is also supposed to protect the equipment against too high temperatures and vibrations by alerting the operators with Hi and Lo alarms. For some of the HiHi and LoLo alarms the process shutdown system will automatically be activated. The alarm limits are set by experts and are used for calculation of the technical condition indexes (TCIs) as described in chapter 3.1 Determination of an item s technical condition from available information. The compressors KA31, KA41, KA51, KA61, and KA71 are identical also in terms of instrumentation for condition monitoring and process monitoring. The only difference is their tag numbers Vibration monitoring The Compass system is used for online and offline vibration monitoring. The data from the Compass system are exported to the TeCoMan system once a day. The total vibration level is recorded at a specific point in time each day by a set of probes (see also P&ID for compressor KA41 in Figure 4.6): 44

57 4.2. EXPORT GAS COMPRESSORS Type of vibration recorded Position Axial Non-drive end (NDE), tag GT 717A Axial Non-drive end (NDE), tag GT 717B Radial in horizontal direction Non-drive end (NDE), tag YT 7171A Radial in vertical direction Non-drive end (NDE), tag YT 7171B Radial in horizontal direction Drive end (DE), tag YT 7172A Radial in vertical direction Drive end (DE), tag YT 7172B The time dependent vibration signal of tag YT 7171A and the calculated time-dependent TCI are shown in Figure 4.5. The time-dependent vibrations signals for all vibration tags are shown in Appendix I Low-level signals for compressor KA41. Upper trend diagram Lower trend diagram Figure 4.5: The operational time series of radial vibration measurements located at non drive end from tag YT 7171A are shown in the upper trend diagram. The calculated TCI values from YT7171A are shown in the lower trend diagram. The calculation of TCI values from radial vibration measurements is performed by the Down slope transfer function in Figure 3.4. In this transfer function a vibration level less than 2 µm peak to peak implies TCI = 1, see Table 3.2. The figure is taken from the TeCoMan software. 45

58 CHAPTER 4. CASE STUDY NATURAL GAS EXPORT COMPRESSORS Bearing temperature monitoring Bearing temperatures are recorded by the information management system (IMS) and exported to the TeCoMan system once a day. The temperatures are recorded by a set of temperature elements, located as follows; see P&ID for compressor KA41 in Figure 4.6: Type of bearing temperature element Outer bearing temperature element (simplex) Inner bearing temperature element (simplex) Inner bearing temperature element (duplex) Bearing temperature element (duplex) Position Non-drive end (NDE), tag TE 719A Non-drive end (NDE), tag TE 7191A Non-drive end (NDE), tag TE 7192A Drive end (DE), tag TE 7193A TE 719A TE 7191A TE 7192A TE 7193A NDE 27-KA41 Compressor 27-KA41 Gear GE 717A GE 717B YE 7171A YE 7171B YE 7172A YE 7172B GT 717A GT 717B YT 7171A YT 7171B YT 7172A YT 7172B Figure 4.6: Vibration monitoring and bearing temperature monitoring compressor KA41. This figure is a simplification of an internal StatoilHydro P&ID drawing. The time-dependent bearing temperature signal of tag TE 719A and the calculated time dependent TCI are shown in Figure 4.7. The time-dependent bearing temperature signal of all bearing temperature tags are shown in Appendix I Low-level signals for compressor KA41. 46

59 4.2. EXPORT GAS COMPRESSORS Upper trend diagram Lower trend diagram Figure 4.7: The operational time series of the outer bearing temperature measurements located at non drive end from tag TE 719A are shown in the upper trend diagram. The calculated TCI values from TE 719A are shown in the lower trend diagram. The calculation is performed by the down slope transfer function in Figure 3.8. In this transfer function a temperature less than 75 C implies TCI = 1, see Table 3.3. The figure is taken from the TeCoMan software Seal leakage monitoring Each compressor is equipped with double, dry gas seals to avoid gas from the production to reach the bearings or the atmosphere. The seal gas filters are monitored by a differential pressure indicator. On both sides of the seals the flow rate is monitored by flow transmitters: Flow transmitter located at the low-pressure side, tag FT7155 Flow transmitter located at the high-pressure side, tag FT Both transmitters are activating a Hi-alarm when measuring a flow rate 255 Sm / h. This is a strong indicator of a seal failure [4]. Leakage to the low-pressure flare is monitored by flow transmitters: Flow transmitter located at the low-pressure side, tag FT

CHAPTER 4. CASE STUDY NATURAL GAS EXPORT COMPRESSORS Flow transmitter located at the high-pressure side, tag FT7162 3 If a large leakage takes place (flow rate 28.5 Sm / h ) a Hi-alarm is activated.

The time dependent seal leakage signals for all flow tags are shown in Appendix I Low-level signals for compressor KA41. Upper trend diagram Lower trend diagram Figure 4.

60 CHAPTER 4. CASE STUDY NATURAL GAS EXPORT COMPRESSORS Flow transmitter located at the high-pressure side, tag FT If a large leakage takes place (flow rate 28.5 Sm / h ) a Hi-alarm is activated. The time dependent seal leakage signal of tag FT 7162 and the calculated time dependent TCI are shown in Figure 4.8. The time dependent seal leakage signals for all flow tags are shown in Appendix I Low-level signals for compressor KA41. Upper trend diagram Lower trend diagram Figure 4.8: The operational time series of the seal leakage to low-pressure flare from tag FT 7162 are shown in the upper trend diagram. The calculated TCI values from FT 7162 are shown in the lower trend diagram. The calculation is performed by the down slope transfer function in Figure In this transfer function a leakage less than 15 Sm 3 /h implies TCI = 1, see Table 3.5. The figure is taken from the TeCoMan software Oil analysis monitoring The Compr. Lubrication System is implemented as a separate system at a higher level in the TeCoMan software hierarchy (Figure 3.12). The TCIs calculated here is not a result of e.g. particles in lubrication oil analysis, but based on indicators for the condition of the lubrication oil system itself. Each compressor train has its own lube oil tank for supply of lubrication oil to: 48

61 4.2. EXPORT GAS COMPRESSORS Compressor bearings Gear Electric driven compressor engine Engine shaft (jacking oil at start-up) The Compr. Lubrication System has equipment for filtering and cooling of the lubrication oil. An online particle in lubrication oil analysis is not implemented for the gas compressors implying that online values are not available Notifications Notifications are manually entered in SAP. In this notification the unique tag number, deviation or damage code and failure mode code have to be reported to reduce the TCI according to defined rules as described in chapter Application of inspections (notifications). The notifications for the export gas compressors should ideally be reported as soon as possible in the SAP database as an early fault indicator. In the project work of Dyrnes [41] the number of notifications reported for each failure mode in the years 2-25 were counted. The result is presented in Figure 4.9. See Figure 3.6 for description of the failure modes. Number of notifications A B C D F E H I J K M Y Z Failure modes Figure 4.9: Number of recorded notifications for each failure mode [41] The most dominant failure mode is Z. This failure mode is a non-specific failure mode and its weight in SAP is, i.e. these reported notifications will not reduce the technical condition (TCI). It is therefore decided to remove failure mode alternatives Y and Z from the SAP system to force a specific failure mode registration of the notifications. Another weakness with the notification system is the time stamp of completing the notification, i.e. the notification-tci is set equal to 1 again. The complete status for the 49

62 CHAPTER 4. CASE STUDY NATURAL GAS EXPORT COMPRESSORS notifications should be set immediately after a performed maintenance action. The notifications have the same delay problem as the work orders (see chapter Failure modes ) Efficiency monitoring The compressor efficiency is a compound measure that depends on several measurements (several tags) like temperature, pressure and flow rate. This measure may be a good indicator that includes concomitant information on the technical condition. Today a calculated efficiency measure for each export compressor is performed on an annual basis. According to the maintenance manager the underlying measurements are very oscillating and not useful for more frequently calculations of the efficiency. However, an upcoming system upgrade may include better underlying measurements for continuous efficiency monitoring. To be able to calculate a TCI value from the efficiency, knowledge of the load is also needed. The efficiency measure is included as a future measure in the TeCoMan software application [42] Hierarchical aggregation of the technical condition An export compressor has 2 axial and 4 radial probes for vibration monitoring. According to the theory of vibration analysis total vibration level surveillance might be useful for early fault detection of failures in e.g., bearings, impeller, and shaft. The total vibration level and corresponding time stamp are recorded during the operational life. For each probe the exit variable after normalization is a TCI. To be able to aggregate the technical condition from local point- to vibration level a worst case aggregation method is chosen, i.e. the probe with the least TCI value of all probes will be the TCI at the vibration level. Also for the individual measurements of bearing temperature and seal leakage a worst case aggregation method is chosen. At the bearing temperature level the temperature element with the least TCI of all bearing elements is chosen. At the seals leakage level the flow transmitter with the least TCI is chosen. For the notifications a penalty aggregation method is used to aggregate to SAP notification level as described in chapter Application of inspections (notifications). To be able to aggregate to the export compressor top level a penalty aggregation method including weights is chosen. In the model the following expert judgement weights (severity/damage codes) are chosen for all compressors: SAP notification :.25 bearing temperature :.3 seal leakage :.3 vibration :.7 efficiency :.5 The performance monitoring (efficiency) has not yet been implemented and is therefore not in use in the export compressor penalty aggregation method. An overview over the total aggregation hierarchy is shown in Figure 4.1. Hierarchical details are presented in Appendix C Export compressor hierarchy in TeCoMan. 5

63 4.3. HIERARCHICAL AGGREGATION OF THE TECHNICAL CONDITION Since the important SAP notifications obviously are wrong in a time span immediately after the maintenance actions for reasons discussed in chapter Failure modes, the time dependent technical condition index (TCI) aggregated at the natural gas export compressor level is not calculated in these time spans (see the truncated time ranges in Table 4.2). An alternative might be to try to recover the notification problem, but for this thesis it was too labour intensive. If possible, this would have been the best solution to the problem. Export compressor 27 KA41 Penalty aggregation w i =.5 Efficiency (Not in use) w i =.25 w i =.3 w i =.3 w i =.7 SAP Notification Seals Vibration Bearing temperature Penalty aggregation Worst case aggregation Worst case aggregation Worst case aggregation A, w i =.75 B, w i =.25 C, w i = 1.. FT7155 FT7156 FT7161 FT7162 GT717A GT717B YT7171A TE719A TE7191A TE7192A Figure 4.1: Overview of the aggregation-method hierarchy of a gas compressor Figure 4.11 shows the resulting calculated time dependent TCIs aggregated at the natural gas export compressor level as a function of operational time for KA41. 51

64 CHAPTER 4. CASE STUDY NATURAL GAS EXPORT COMPRESSORS TCI Working age (days) Figure 4.11: Running periods of the time-dependent TCIs aggregated at the gas compressor level To reduce the sampling variations and more clearly show the trend in the data a moving average of the time dependent TCI is shown in Figure Moving average takes into account the signal value within a specified working age distance. In this case the TCI value at a given working age is calculated as the average of all TCI records from a running period within the specified interval[ t 15, t+ 15]. In the beginning and at the end of each running period the size of the interval shrinks in such a way that the size of the interval at the first and last value is equal to 1. A graphical examination of each aggregated running period revealed no running periods which did not exhibit a long term decreasing trend of the TCIs as a function of operational time. A small number of running periods were close to constant like the third running period of KA41 in Figure Another observation is that the technical condition after the maintenance actions is in general not equal to 1. Hence the natural export gas compressors are not judged to be in a good as new condition after the maintenance actions. The time dependent TCIs aggregated at the natural gas export compressor level are shown for all compressors in Appendix D Time dependent TCIs aggregated at the gas compressor level. 52

65 4.3. HIERARCHICAL AGGREGATION OF THE TECHNICAL CONDITION Moving average (+-15) TCI Working age (days) Moving average (+-15) Figure 4.12: Weighted average of TCIs aggregated at gas compressor level for KA First order regression analysis of aggregated TCIs At this point the validated and corrected data are considered to be clean enough so that modelling can take place. Further in this chapter polynomial curve-fitting is used to express the trend of each history of the aggregated export gas compressors technical condition index [43]. In the literature the expected deterioration of items over time is often proportional to power law. b t Some examples of expected deterioration according to a power law are the rate of pit growth of massive anticorrosive steel ( b =.5 [38]), expected degradation of concrete due to sulphate attack (parabolic: b = 2 [44]), and volume of worn material due to cavitation wear in fluid machinery (linear: b = 1 [45] if the cavitation index is time independent). The TCIs are related to degradation development but they are relative degradation measures that are scaled (weighted). A classical problem is the trade-off between perfect matching of training patterns, i.e. of recorded TCIs, and generalization, i.e. assumed real development of the TCIs [1]. The main focus on the TeCoMan solution has not been to focus on determination of the exact expected technical condition but to focus on its trends. Since the TCIs in TeCoMan are based on stressors rather than degradation models, the lower frequency responses (low order polynomial fitting) seems to describe the real relative degradation best. At the same time the calculation/estimation of the remaining useful life should be able to react promptly to changes in the technical condition, usage and environmental influences. The optimum 53

66 CHAPTER 4. CASE STUDY NATURAL GAS EXPORT COMPRESSORS weighting between the slow and stable dynamic properties that are easy to predict and the high energy properties depends on the case. Another important argument against exact determination of technical condition is the enormous amount of work needed to study all the relevant degradation mechanisms for the natural gas export compressors. A combination of expert knowledge (notifications) and available data from existing condition monitoring and maintenance management systems is judged to be sufficient in the TeCoMan software to determine an item s technical condition. However, improved early fault detections techniques and instrumentation will hopefully improve quantification of the technical condition development. The polynomial model used for trending the development in aggregated TCIs is implemented in the S-plus software from Insightful Corp using the least squares method. Modelling can be carried out for the TCIs on the whole data set from the last maintenance action until the next upcoming maintenance action. In this section the whole available data set is used. Different orders of the polynomial fitting were investigated considering the p-values. Emphasize in the modelling phase is also placed on the latest data values before the next upcoming required maintenance action. If a renewal process is valid the operational time is set to after each maintenance action. In that case the results of the polynomial fitting will be as described below. The behaviour of the TCIs could also be stochastically modelled. An important argument is the ability to predict the TCI behaviour within certain confidence limits instead of an exact prediction like in the polynomial curve fitting case. The Markov Process and the Gamma Process are two models which have got a lot of interest in the literature (Dhillon et al. [46], Tsang et al. [47], van Noortwijk et al. [48]). For the Markov Process an estimate of the transition rates are easily computed since the interval between each stored TCI in TeCoMan is fixed (one day). In this thesis the Gamma Process as a special case of the Markov Process will be discussed and used to model the lifetime distribution for the last part of each history before the maintenance action (see chapter 6 Condition-based model for remaining useful life ). A deterministic model of the expected deterioration development might be accepted as a realistic model, at least as a first order approximation. In the S-plus software, the first running period of export compressor KA41 was modelled from the development of the moving average trend in Figure TCI = t t t (4.1) wheret is the local operational time (local to each running period) and TCI 41.1 is the fitted value for TCI

67 4.4. FIRST ORDER REGRESSION ANALYSIS OF AGGREGATED TCIS Table 4.3: Summary of regression coefficients KA41, running period 1 Predictor Coef Pr(> t ) t t t t *1.7. Since there is a definite downturn of the trend at the end of this TCI running period a third order polynomial is fitted. The p-value in the column Pr(> t ) for the third order coefficient is equal to zero. The p-values for the coefficients are the significance probability for the variables given the presence of all variables in the model. Thus, according to the software, the probability that the model does not include a third order term is zero. The order of the polynomial could be increased, but it is believed that this will not improve the true model of the TCI development (degradation development). For the second running period of export compressor KA41, which also has definite downturn of the trend at the end of the running period, the trend of the development is assumed to be as follows: -1 4 TCI 41.2 = t t t t Table 4.4: Temporary summary of regression coefficients KA41, running period 2 Predictor Coef Pr(> t ) t t t t *1-6.9 t * Here the p-value in the column Pr(> t ) for the fourth order coefficient is.9929 indicating that the model can omit the fourth order term. A third order polynomial will then might be a good candidate for trendingtci TCI = t t t (4.2) 55

68 CHAPTER 4. CASE STUDY NATURAL GAS EXPORT COMPRESSORS Table 4.5: Summary of regression coefficients KA41, running period 2 Predictor Coef Pr(> t ) t t t t *1-6.9 The p-value in the column Pr(> t ) for the third order coefficient is.9. The probability that the model does not include a third order term is then very small. The third running period of export compressor KA41 is hard to fit with a polynomial model since it seems to be oscillating. A third order polynomial fit of TCI 41.3 is: TCI = t t t (4.3) Table 4.6: Summary of regression coefficients KA41, running period 3 Predictor Coef Pr(> t ) t t t *1-6. t *1-9.3 The p-value in the column Pr(> t ) for the third order coefficient is equal to.3. Thus, the third order term is needed. A fourth order polynomial model for TCI 41.3 implies that Pr(> t ) is equal to.236 for the fourth order term. This indicates that it is not strongly needed and is therefore skipped. Figure 4.13 shows the linear regression lines for running periods 1, 2 and 3 through the aggregated TCIs calculated on the natural export gas compressor KA41. Note that the operational time is at the beginning of each running period in the regression calculation. To be able to trend the original TCI data, the moving average and the calculated first order approximation trends simultaneously in Figure 4.13 the global operational time since new is used. In the S-plus software, the running periods of export compressors KA31, KA51, KA61 and KA71 were also modelled from the development of their moving average trend. The results of all running periods are shown in Appendix E Results of polynomial fitting of aggregated TCIs. A summary of the resulting polynomial fitting at the aggregated level is shown in Table

69 4.4. FIRST ORDER REGRESSION ANALYSIS OF AGGREGATED TCIS TCI Working age (days) Compr R. P. KA31 Running period 1 KA31 Running period 2 KA41 Running period1 KA41 Running period 2 KA41 Running period 3 KA51 Running period 1 KA61 Running period 1 KA61 Running period 2 KA71 Running period 1 KA71 Running period 2 KA71 Running period 3 TCI value Moving average (+-15) Trend TCI Figure 4.13: Aggregated TCI for compressor KA41, moving average and regression lines vs. operational time Table 4.7: Summary of polynomial curve fitting. The column labelled Compr R.P. describes the running period of a compressor. t 4 t 3 t 2 t 1 t *1^ *1^ *1^ *1^ *1^ *1^ *1^ *1^ ^ *1^ *1^ *1^ *1^ *1^ ^

70 CHAPTER 4. CASE STUDY NATURAL GAS EXPORT COMPRESSORS 58

71 5. Aging-based model for remaining useful life So far, the process of constructing a model for the development of the technical condition of a high-level assembly in a hierarchical aggregation model has been outlined. In the reliability theory the operational time stamps of required maintenance actions (failure) are of main interest to study the aging characteristics. In British Standard (BS 4778) [49] a failure of an item is defined as Termination of its ability to perform a required function. The time stamp of a failure in the compressors used as the case in this thesis will not be the timestamp of total breakdown, but the timestamp where replacement or overhaul cannot be delayed any further due to safety and functional requirements. At that time stamp the compressor is stopped, opened up and judged to be in a faulty state. All its remaining useful life is spent. The state variables of components and systems are often modelled as binary and independent, which are rather strong underlying assumptions. In the lifetime models quantitative measures like mean residual life of nonrepairable items have been developed. When we classify an item as nonrepairable, we are only interested in studying the item until the first failure occurs. In counting process models the reliability of repairable items as a function of time are studied. These items are suffering several failures and repairs and may be studied by using a stochastic process. In maintenance theory the bathtub curve is often used as a basis for maintenance activities. There exist at least two such bathtub curves: One bathtub curve for nonrepairable parts and one bathtub curve for repairable items (see Aacher & Feingold [5]). A brief description of bathtub curves and two fundamental measures the hazard rate and the ROCOF follows in the next chapter Failure in nonrepairable items A nonrepairable item can be anything from a small component to a large system. In some cases the item may be literally nonrepairable, meaning that it will be discarded by the first failure. In other cases, the item may be repaired, but we are not interested in what is happening with the item after the first failure. To estimate statistical properties of the population of nonrepairable items, many items must be considered. Furthermore, the information must arise from identical items operated independently, under the same conditions. Statistics on times to failures of nonrepairable items may be gathered in several ways. For the best results, these items should be mounted on a test stand and operated in the way mentioned above. However, this is seldom possible, due to the costs of these items, experimental difficulties and the like. When this is the case, data can be gathered from 59

72 CHAPTER 5. AGING-BASED MODEL FOR REMAINING USEFUL LIFE operational experience. The data from the latter case may be significantly influenced by operating conditions, loads etc., which degrades the quality of the data gathered. System reliability may be significantly affected by operating conditions, loads, changes in maintenance and operating procedures, etc. When analysing these statistics, an important assumption is that the nonrepairable items are Independent and Identical Distributed (IID). This assumption should be tested. For instance, a tendency for successive failure times to become larger implies that they are not independent samples from the same distribution function. To determine the form of the hazard rate zt () for a given type of nonrepairable items, the following experiment may be carried out [26]: Split the observation interval (,t ) into disjoint intervals of equal length t. Then put n identical items into operation at time t =. When an item fails, note the time and discard that item. For each interval record: 1. The number of items ni () that fail in intervali. 2. The functioning times for the individual items ( T1 i, T2i,..., Tni) in intervali. Hence T ji is the time that item j has been functioning in time intervali. Tji is therefore equal to if item j has failed before intervali, where j = 1,2,..., n. Thus n j= 1 T ji is the total functioning time for the items in intervali. Now ni () zi () = n T which shows the number of failures per unit time in interval i is a natural estimate of the hazard rate in interval i for the nonrepairable items that are functioning at the start of this interval. Let mi () denote the number of items that are functioning at the start of interval i then ni () zi () mi () t If n is very large, we may use very small time intervals. If we let t, it is expected that the step function zi ( ) will tend towards a smooth curve, the bathtub curve, as illustrated in Figure 5.1. j= 1 ji 6

73 5.1. FAILURE IN NONREPAIRABLE ITEMS Hazard rate z(t) Useful life period Burn-in period Wear out period Figure 5.1: The bathtub curve for nonrepairable items Time t The hazard rate is often high at the beginning of the burn-in period. This can be explained by the fact that there may be undiscovered defects in the items which soon show up when the items are activated. Often the items are tested in the factory before they are distributed to the users to remove most of the infant defects. In the useful life period the failure rate is stabilized at a level for a certain amount of time until it reaches the wear out period. When trying to calculate the residual life of a gas compressor, it is from a literal point of view a repairable item. To decide the best model for residual life estimation for the gas compressor from reliability theory point of view, some test must be performed. The applicability of nonrepairable statistical methods is discussed later in chapter 5.3 Model Selection Framework. E.g. if a renewal process is applicable for modelling the system, the underlying distribution of the interoccurrence times could be modelled with a lifetime distribution. A short description of the most important measures for the reliability of a non-repairable item is given below. The state variable Ψ( t) is a binary state variable and the state at time t is given by: 1 if the item is functioning at time t Ψ () t = if the item is in a failed state at time t By the time to maintenance action of an item we mean the time elapsing from when the item is put into operation (good as new) until a maintenance action is required for the first time. We set t = as the starting point. Time to required maintenance action (failure) is considered as a random variablet. The connection between the state variable Ψ () t and the time to required maintenance action T is illustrated in Figure

74 CHAPTER 5. AGING-BASED MODEL FOR REMAINING USEFUL LIFE Ψ(t) Failure 1 T t Figure 5.2: The state variable and the time to failure of an item Here, the time to a required maintenance action is assumed to be continuously distributed with probability density function f () t. The cumulative density function, which denotes the probability of a required maintenance action in the time interval (,t ] is defined by: ( ) t Ft () Pr T t f( udu ) = = (5.1) The reliability function, which denotes the probability that the item does not require a maintenance action in the time interval (,t] is defined by: ( ) R() t 1 F() t Pr T t f( u) du t = = > = (5.2) From this it is possible to define the Force Of Mortality (FOM) or hazard rate. The probability that an item will require a maintenance action in an interval( tt, + t] when we know it has survived up to timet is given by: Pr ( t < T t+ t T > t) The Force Of Mortality (FOM) or hazard rate is found by dividing this probability by the length of this time interval, t, and letting t : ( < + > ) Pr t T t t T t f () t zt () = lim = t t R() t Mean Time To Failure (MTTF) is a measure of the center of a life distribution of an item and is defined by: d Since f () t = R() t we have: dt ( ) MTTF E T uf ( u) du (5.3) = = (5.4) 62

75 5.1. FAILURE IN NONREPAIRABLE ITEMS d MTTF= t Rt dt= trt + Rtdt= Rtdt dt () [ ()] () () (5.5) (If MTTF <, it can be shown that[ tr() t ] = ) Another measure of the center of a life distribution is the median lifet m, defined by Rt ( ) = Ft ( ) =.5 (5.6) m m If the lifetime distribution pdf is symmetric about MTTF, then MTTF = tm. This is the case when e.g. f () t is normal distributed. The mean residual life µ () t is also an important measure. Consider an item with time to a required maintenance action T that is put into operation at time t = and is still functioning at timet. The probability that the item of aget survives an additional interval of lengthu is: Pr ( T > u+ t) R( u+ t) Rut ( ) = Pr( T> u+ tt> t) = = Pr T > t R( t) ( ) R( u/ t) is also called the conditional survivor function of the item at aget. The mean residual life µ () t of the item at age t is: 1 µ () t = R( u / t) du = R( u) du Rt () t When t =, the item is new, and we have µ () = MTTF. (5.7) 5.2. Failure in repairable items Times between successive required maintenance actions (failures) of a repairable item could be modelled by a sequence of distribution functions. If an adequate number of interoccurrence times of a single repairable item are observed, statistical analysis can be based on data from that one repairable item. Consider a repairable item that is put into operation at time t =. The first required maintenance action of the repairable item will occur at time S 1. When the repairable item has failed, it will be restored to a functioning state. The repair time is neglected. The second required maintenance action will occur at time S 2 and so on. We thus get a sequence of required maintenance action times S 1, S 2,.... LetT i be the time between maintenance action i 1 and maintenance action i fori = 1, 2,.... T i is often called the interoccurrence time. Of primary interest is the random variable Nt (), the integer number of maintenance actions in the time interval( ],t. This particular stochastic process { Nt (), t } is called a counting process. A counting process { Nt (), t } may alternatively be represented by the sequence of maintenance action times S 1, S 2,..., or by the sequence of interoccurrence times T 1, T 2,... since Si = T1+ T Ti. The three representations contain the same information about the counting process. 63

76 CHAPTER 5. AGING-BASED MODEL FOR REMAINING USEFUL LIFE In general, counting processes are used to model sequences of maintenance actions. These maintenance actions will later in this thesis be classified as preventive maintenance actions (right censored) and corrective maintenance actions (uncensored). Truncated maintenance actions will also be described. In this thesis t denotes a specific point in time, irrespective whether t is global time (a realization of S i ) or local time (a realization of an interoccurrence timet i ). The time concepts are illustrated in Figure 5.3. N(t) 3 T 1 T 2 T Maintenance action Maintenance action Maintenance action S 1 S 2 S 3 Time t Figure 5.3: Relation between the number of required maintenance actions N(t), the interoccurrence time (T i ) and the calendar/operational times (S i ) The sequence of interoccurrence times, T 1, T 2,... will generally be neither independent nor identical distributed (IID). Hence, the MTBF i will in general be a function of i and T1, T2,.., Ti 1. The remaining useful life will be the time to the next maintenance action from an arbitrary point in timet. The expected number of maintenance actions in the interval (,t ] is given by Wt () ENt [ ()] =, d The time derivate of Wt, () wt () = W() t is called the ROCOF. The ROCOF is an absolute dt rate, and has the following interpretation: wtdtis ( ) the probability that a required maintenance action, not necessarily the first, will occur in the interval( tt, + t]. Since a repairable item can experience several required maintenance actions, Figure 5.4 shows the absolute rate at which required maintenance actions of a single repairable item occur. Dividing the curve in three sections, we can give the following interpretation: 1. Initially in the Running in region, times between successive required maintenance actions tend to increase. This leads to a reduction in the ROCOF. 2. In the middle region, times between successive required maintenance actions tend neither to increase or decrease. The ROCOF is close to constant. This region is also called useful life and the maintenance focus is on life extension of the item. 3. When the system ages ( wear out section), the times between successive required maintenance actions tends to decrease, hence the ROCOF increases. In the beginning 64

77 5.2. FAILURE IN REPAIRABLE ITEMS of this section the maintenance focus here is to postpone a complete renewal. A steep increasing ROCOF will finally lead to a complete renewal since the time between required maintenance actions are becoming very short and the effect of the maintenance actions is becoming very small. Running in Constant ROCOF Wear out ROCOF w(t) Calendar time t Figure 5.4: Bathtub curve for a repairable item Several stochastic point processes have been used for modelling repairable systems. The most well known are [26]: The Homogenous Poisson Process (HPP). In the HPP model all interoccurrence times are independent and exponentially distributed with the same constant hazard rate zt ( ) = λ. The Renewal Process (RP). RP is a generalization of HPP where all interoccurrence times are independent and identical distributed (IID). Upon failure the repairable item is replaced or restored to good as new condition. The Superimposed Renewal Process (SRP) considers a series of items where each item will produce a renewal process. The process formed by the union of all failures is called a SRP. The Non-Homogeneous Poisson Process (NHPP). NHPP is also a generalization of HPP. In the NHPP the ROCOF varies with time rather than being a constant. This implies that the interoccurrence times are neither independent nor identically distributed. In the NHPP the failed item is restored just back to functioning state and the repair time is negligible. The likelihood of item failure is the same immediately before and after a failure. It is restored to a bad as old condition. The NHPP model might be suitable to model a repairable item in the Running in and the Wear out part of Figure 5.4, while HPP and RP might be suitable to model the Constant ROCOF part of Figure 5.4. Imperfect repair processes. RP and NHPP represent two extreme types of repair: good as new and bad as old. Most maintenance actions are somewhere in between those two actions. 65

78 CHAPTER 5. AGING-BASED MODEL FOR REMAINING USEFUL LIFE Branching Poisson Process (BPP). This counting process can be used where a primary failure causes one or more secondary failures which are not detected until after the system is put back into operation (since otherwise, all failures occur at the same instant). Another possibility is imperfect repair procedures so that the system is put back into operation, only to fail again because of the same undetected problem. Both primary and secondary failures will keep the system from meeting its operational requirements. The model can thus be used to model imperfect repair procedures situation Model selection framework To select the correct model type for a particular reliability data set the data must be kept in a chronological order (Aacher & Feingold) [5]. It is important that the data that are merged are homogeneous, meaning that the items are of the same type and that the operational and the environmental conditions are comparable. If they are homogeneous and if the interarrival timest i are either increasing or decreasing, a NHPP approach might be suitable. A simple graph of the number of events N(t) as a function of the operating time since observation of the system began can indicate an increasing or decreasing trend. Laplace s trend test or Military Handbook Test can also be used if the number of required maintenance actions for each item is large enough (see Rausand & Høyland [26]). If no trend is present (the interoccurrence times are identically distributed), it is required to test for dependencies between interarrival times. This step also requires many maintenance action observations. If the interarrival times are independent and identically distributed, a lifetime approach is most suited. A model selection framework is presented in Rausand & Høyland [26] and is shown in Figure 5.5. In the Export Gas Compressor case the chronologically ordered required maintenance actions (failures) are shown in Figure 4.4. Because of the limited number of maintenance actions for each export gas compressor (1, 2 or 3 failures), statistical tests on trend by use of the Laplace test [5] or the Military Handbook test [51] cannot be performed. The Military Handbook test supports suspension as the last running period (in the case of export gas compressors KA31, KA51 and KA71), but none of them support truncated distributions. The tests are shown in detail in Appendix H Reliability theory. However, from Figure 4.4 it can be judged that there is no trend that the interoccurrence times are being shorter and shorter (increasing rate of failures) or longer and longer (decreasing rate of failures). We can then conclude that the intervals between failures are identically distributed, but not necessarily independent. Several plotting and formal tests are available [52 ] to check whether or not the data may be considered as independent. Again, because of the limited number of reliability data, the conclusion is that it is not possible to test for independence and the running periods are treated as independent and identically distributed (IID). 66

79 5.2. FAILURE IN REPAIRABLE ITEMS Chronologically ordered T i s Homogeneous samples? No Split in homogeneous samples Yes T i s in chronological order from a homogeneous sample Trend? Yes Repairable system models like NHPP No T i s identically distributed but not necessarily independent Dependence? Yes Branching Poisson process No Renewal processes Lifetime distribution like e.g. Weibull Figure 5.5: Model selection framework [26] 5.4. Regression models Regression analysis is a valuable statistical procedure for estimating the risk of equipment failing when it is subject to condition monitoring. An approach for estimating the hazard (conditional probability of failure) that combines the age of equipment and conditionmonitoring data is using a proportional hazards model (PHM). There are various forms that can be taken by a PHM, all of which combine a baseline hazard function with a component that takes into account covariates that are used to improve the prediction of failure. A PHM model could be: zt (; yt ()) () m γ jyj() t j= 1 = z te (5.8) 67

80 CHAPTER 5. AGING-BASED MODEL FOR REMAINING USEFUL LIFE where zt (; yt ()) is the instantaneous conditional probability of failure at time t, given the values of y 1 ( t), y 2 ( t),..., ym( t ). Each yi ( t) in equation 5-8 represents a monitored condition data variable at time t. yi ( t) will in this thesis be the development of a time dependent TCI value. In the statistical literature it is called a covariate. Theγ i value is the covariate parameter to be estimated that along with the yi ( t) value indicate the degree of influence the covariate has on the hazard function. The model consists of two parts: 1. The first part z () t is the baseline hazard function that takes into account the age of equipment as a function of time t. m γ jyj() t j= 1 2. The second part e, takes into account the variables that may be thought of as the key risk factors used to monitor the technical condition of equipment and their associated weights. From the literature study there are several regression model classes within survival analysis, which examine the time it takes for events to occur. The prototypical event is death, which accounts for the name given to these methods. But survival analysis is also appropriate for many other kinds of events; therefore, terms such as survival are to be understood generically. The wheels of survival analysis have been reinvented several times in different disciplines, where terminology varies from discipline to discipline; Survival analysis in biostatistics, which has the richest tradition in this area. Event-history analysis in sociology. Failure-time analysis in maintenance and reliability engineering. Some aspects of the regression models, which are important when modelling residual life, are discussed in the following sections Cox regression models In Cox regression models (also called relative risk or semi-parametric regression models) the baseline hazard function z () t is left unspecified. A Cox regression model for observation a with time independent covariates may be written as: m γ jyaj j= 1 ( γ1ya1+ γ2ya2+ + γmyam) z (; t y) = z () t e = z () t e a This model is semi-parametric because the baseline hazard can take any form, and the covariates enter the model linearly. Consider two observations a and b that differ in their y- values, with the corresponding linear predictors: η = γ y + γ y + + γ y and η = γ y + γ y + + γ y a 1 a1 2 a2 m am b 1 b1 2 b2 m bm The hazard ratio for these two observations, ηa za () t z() t e = = e ηb z () t z () t e b is independent of time t. This makes the Cox model a proportional hazards model. Remarkably, even though the baseline hazard is unspecified, the Cox model can still be estimated by the method of partial likelihood, developed by Cox in the same paper in which η a ηb 68

81 5.4. REGRESSION MODELS he introduced the Cox model [8]. The estimation of the γ vector is performed by maximizing the partial likelihood (Kalbfleisch & Prentice) [53]. Derivation of an estimator of the baseline cumulative hazard function is analogous to the Nelson-Aalen estimator [53]. The relative risk model is attractive because there is no need to make any assumptions on the form of the underlying baseline. However, to be able to calculate remaining useful life, the baseline function needs to be parametric. This means the total PHM model needs to be parameterized. In this thesis the covariates (TCI values) are time dependent. A natural extension of the PHM model is to allow the covariates to vary over time Weibull regression models In the parametric PHM model the form of the baseline is known, and its parameters need to be estimated. A number of probability distributions may be used to model the lifetime assuming IID interoccurrence times. In the book of Rausand & Høyland [26] a list of distributions is given. Since the Weibull distribution is flexible and well-known distribution candidate in the maintenance and reliability theory it will be used further in this thesis. A Weibull PHM [54], which is a PHM with a Weibull baseline, is given by: β t zt (; yt ()) = η η β 1 e m γ jyj() t j= 1 The parameter β describes how age of the equipment is influencing the hazard rate (shape parameter), and the parameter η is the scale parameter of the Weibull PHM. It is important to note thatη in Weibull PHM does not take the interpretation that 63.2% of failures occur 1 before this time ( F ( η ) = 1 e ), which would be the case if the hazard was not influenced by covariates. The PHM function zt ( ; yt ( )) does not have a clear hazard function interpretation. However, in the book of Jardine & Tsang [54] it is given a risk interpretation Parameter estimation by the method of maximum likelihood The general idea of likelihood inference is to fit models to data by entertaining modelparameter combinations for which the probability of the data is large. Model-parameter combinations with relatively high probabilities are more plausible than combinations with low probability. Likelihood methods provide general tools for fitting models to data, and can be applied with a wide variety of censored and truncated data. It is also possible to fit models with explanatory variables (i.e., regression analysis). (5.9) 69

82 CHAPTER 5. AGING-BASED MODEL FOR REMAINING USEFUL LIFE To fix the idea, lett 1, T 2,, Tn denote n independent, identically distributed random variables with probability density function (pdf) f( t; θ1, θ2,..., θ p ), where f is of known form and θ = ( θ, θ,..., θ ) is a subset of an p-dimensional space. Here, the observations T1, T 2,, Tn 1 2 p represent the interoccurrence times. In the case of no censoring, the joint probability density oft 1, T 2,, Tn is given by f ( ti; θ ). Now consider this expression as a function ofθ for fixedt 1, t 2,, t n and denote the function: L(; θ t1, t2,, t ) = L(;) θ t = L(; θ t ) n i i i= 1 L( θ;t) is then called the likelihood function. From chapter 5.3 Model selection framework it is concluded that the interoccurrence times can be modelled as independent and identical distributed. If the NHPP had been applicable, the independence of the increments could be utilized in the likelihood function construction. With censored observations or truncated data, the likelihood function is somewhat modified. The specific case data set for the natural gas export compressors in Table 4.2 includes left truncated data and some of the running periods are right censored. A short description of such contributions to the likelihood is therefore given below Right-censored observations Right censoring is common in reliability data analysis. For example the last running periods of gas compressors KA31, KA51, and KA71 are right-censored because all that we know about the required maintenance action time is that they were greater than a specific time. If there is a lower boundt i for the i th required maintenance action time, the time is somewhere in the interval( ti, ). Then the probability and likelihood construction for this right-censored observation is i ( ) = ( ; ) = ( ; ) ti i n L θ f t θ dt R t θ (5.1) More details on censoring and truncation are given in Appendix H Reliability theory Left-truncated data In some reliability studies, observations may be truncated. Truncation, which is similar to but different from censoring, arises when observations are observed only when they take on values in a particular range. For observations that fall outside the certain range, the existence is not known (and that is what distinguishes truncation from censoring). An example of lefttruncation in AIDS research is reported in a medical paper by Kim [55]. If the interoccurrence time T i is truncated to the left at time x i, then the probability of an L observation in the interval ( t, t is the conditional probability i i n i= 1 7

83 5.4. REGRESSION MODELS L ( i; θ) F( ti ; θ) F t L L Li( θ) = Pr ( ti < Ti ti Ti > xi) =, ti > ti xi R x ( ; θ) For an observation reported as a required maintenance action at exact time t i, the corresponding density approximation form of the likelihood is ( i; θ) ( ; θ) f t Li( θ) =, ti xi R x > i i (5.11) For a right-censored observation at time t i, the corresponding density approximation form of the likelihood is ( i; θ) ( ; θ) R t Li( θ) =, ti xi R x > i More details on censoring and truncation are given in Appendix H Reliability theory Case assumptions and model building (5.12) It is assumed that a renewal process is applicable for modelling the maintenance actions. The following assumptions of the model are then made: The natural gas compressors are identical. They have the same kind of technical condition information and are working under the same environmental conditions. The interoccurrence times are IID and behaving according to a renewal process. The gas compressors are repaired back to a good as new condition. Repair times are neglected. The only covariate is the highest aggregated TCI value of a gas compressor, which by definition is positive. The Weibull distribution is used to parameterize the baseline function in the PHM model. An imagined trend of the TCI path with events for a single gas compressor is shown in Figure

84 CHAPTER 5. AGING-BASED MODEL FOR REMAINING USEFUL LIFE 1 TCI Parent TCI Parent TCI Parent t t 1 t 2 t 3 T 1 T 2 T 3 Preventive maintenance action Corrective maintenance action Preventive maintenance action Figure 5.6: Renewal process with covariates (aggregated TCI values) [56] Assume that the sudden increase in TCI value is due to the maintenance performed. The sudden decrease is due to a Minor (U) or Major (L) notification or e.g. a sudden increase in one or several vibration measurements. In order for a parametric hazard regression model, e.g. the Weibull PHM model, to estimate the parameters by maximum likelihood we have to write down the probability (or probability density) of the data as a function of the model parameters. β 1 β t zt ( ; yt ( ), βηγ,, 1,..., γm) = e η η m γ () j 1 jyj t = (5.13) where zt ( ; yt ( ), β, ηγ, 1,.., γm) is the instantaneous conditional probability of failure at timet, given the values of y 1 ( t), y 2 ( t),..., ym( t ). In the gas compressor case the only covariate y () t is 1 TCI Parent. It is assumed that the only form of censoring is right-censoring and the only form of truncation is left truncation. There are no tied maintenance action times in the data. θ is a parameter vector including ( β, ηγ,,..., γ m ) in the parametric Weibull PHM case. When z( t; y( t), ) f ( t; y( t), ) 1 θ is known, it is possible to calculate the probability density function θ of required maintenance actions for the export compressor by: t zu ( ; y( u), θ) du f(; t y(), t θ) = z(; t y(), t θ ) e (5.14) For the calculation of equation 5-14 the behaviour of the covariates need to be extrapolated. Techniques for extrapolation are described by Makis and Jardine [57] and Vlok [58]. It should 72

85 5.4. REGRESSION MODELS t zu ( ; y( u) ) du y y is frequently confused with the con- be noted that S( t; ) = S( t; ( u), u t) = e ditional reliability function ( ) ( ) ( ) R t; y = Pr T > t y u, u t (see Banjevic et al. [59]). Let f(; t y(), t θ) represent the probability density for a required maintenance action at timet given values of the covariates y and regression parametersθ. The required maintenance f ( t1; y( t1), θ) action observed at time t 1 and truncated at time x 1 will then contribute to the S( x1; y( x1), θ) likelihood. For a suspension that is censored at time t 2 and truncated at time x 2 the contribution to the likelihood is. It is convenient to introduce the censoring S( t2; y( t2), θ) S( x2; y( x2), θ ) indicator variable c i, which is set to 1 if running periodi of a gas compressor is a required maintenance action and if it is a suspension. Then the likelihood function for the independent running periods can be written as: According to the PHM model, n i f( ti; yi( ti), θ) S( ti; yi( ti), θ) L( θ) = S( x; y ( x ), θ) S( x; y ( x ), θ) = i= 1 i i i i i i c n i ( zt ( i; yi( ti), θ) ) St ( i; yi( ti), θ) S( x; y ( x ), θ) i= 1 i i i c 1 ci (5.15) T ( y ( ) ) ( ; ( ), ) ( ) i t zt t = z t e i γ y θ (5.16) i i i i Substituting these values from the PHM model into the formula for the likelihood produces: L( θ) = = n T y ( ) ( ( ) i t i γ z ti e ) T y ( t ) γ ( z( ti ) e ) ( y T i ( u ) γ) i= 1 z ( u) e du n i= 1 e x i c c i e t i i i i xi e ( y T i ( u ) γ) z ( u) e du ( yi ( ) γ) t i T u z ( u) e du (5.17) Full maximum-likelihood estimates would find the values of the parameters γ and the baseline hazard parameters that maximize this likelihood. The likelihood for the parametric PHM, such as the Weibull PHM, must be maximized l θ = ln L θ. numerically. It is more convenient to maximize ( ) ( ) n n t i T T ( y ( ) ) ( ) ln ( ) ln ( ( )) ( ) ( ) i u γ l θ = L θ = c i z ti + yi t i γ z u e du (5.18) i= 1 i= 1 xi 73

86 CHAPTER 5. AGING-BASED MODEL FOR REMAINING USEFUL LIFE Since the logarithmic function is monotonically increasing, the θ that maximizes l ( θ) will be the sameθ that maximizes L( θ ). In the case that a Weibull distribution is used in the baseline (see equation 5-13) the loglikelihood is l( β, ηγ,,..., γ ) = 1 m n β 1 n t i T T β u ( y ( ) ) ln ( 1) ln ln ( ) i u γ c i β + β ti β η+ yi t i γ e du η η i= 1 i= 1 xi The maximum likelihood estimator θ is given as the solution to the equations β 1 T β u ( y i ( u) γ) e n T t ln β ( β 1) ln t ln ( ) n i i β η i t η η i c + + y γ i du = β β i= 1 i= 1 xi n β 1 T β u ( y i ( u) γ) e T t ln β ( β 1) ln t ln ( ) n i i β η i t η η i c + + y γ du = η η i i= 1 i= 1 xi n β 1 T β u ( y i ( u) γ) e T t ln β + ( β 1) ln ti β ln η ( ) n i i t η η i c + y γ du = γ γ i i= 1 1 i= 1 xi 1 n β 1 T β u ( y i ( u) γ) e T t ln β ( β 1) ln t ln ( ) n i i β η i t η η i c + + y γ du = γ γ i i= 1 m i= 1 xi The log-likelihood equations then take the form m (5.19) 74

87 5.4. REGRESSION MODELS n β 1 β 1 T T u ( yi ( u) γ) u u ( yi ( u) γ) ln t 1 i e β e n η η η ci ln ti lnη + du β + = η η i β 1 β 1 T ( ( ) ) ( ) ( T u y ( ) 1 ) i u γ u yi u γ n n t i β e β β e β η η c i du 2 2 i= 1 η = i= 1 η η xi β 1 T u ( y ( ) ) β y1( n η ) i u γ t i ue n ci[ y1( ti) ] du = η i= 1 i= 1 x i= 1 n i= 1 xi β 1 T u ( y ( ) ) ( ) i u γ t i β n ym u e η ci[ ym( ti) ] du = η i= 1 i= 1 xi (5.2) Since there is no closed form solution to the equations, they have to be solved by an iterative procedure implemented in e.g. Maple software from Waterloo Maple Inc. In this thesis the method suggested in Kim [55] is used. l ( θ) was coded with the S-plus script program which includes the function nlminb. This function is a quasi Newton procedure to minimize an object function with constraints. The set of parameter values which minimized l ( θ) are considered to be the maximum likelihood estimates. The script program is attached in Appendix F Programming and Results Remaining useful life estimation As far as the author knows, the mean residual life was first applied by Deevey [6] in the biostatistics discipline. During the last four decades failure-time analysts in maintenance and safety engineering, and event-history analysts in sociology have shown interest in the mean residual life and developed many useful results. The expected value of the random residual life is defined in equation 5-7. In reliability analysis a decreasing mean residual life (DMRL) function is natural [61]. A detailed summary of the theory and application of mean residual life can be found in [62]. The proportional hazards model (PHM) and the proportional intensity model (PIM) have been used for remaining useful life estimation in combination with a trending model for the covariates. A joint model of PHM and Markov property for covariate evolution was modelled by Banjevic and Jardine [2], and a PIM model with covariate extrapolation to estimate bearing residual life was modelled by Vlok [21]. In this thesis the extrapolation of covariate technique is used for the technical condition indexes (TCIs). Proportional mean residual life models (PMRL) have also been introduced in the literature to study association between survival times and covariates [63], [64]. For the PHM model there are underlying assumptions, which will influence the residual/remaining life calculations. In the PHM model the interoccurence timest 1, T 2,, Tn 75

88 CHAPTER 5. AGING-BASED MODEL FOR REMAINING USEFUL LIFE are independent and identically distributed. This implies that the trend of the covariates (TCIs) and the interoccurrence time in running periodi are independent of the trend of covariates and the interoccurrence time of running periodi 1. Model prediction of future covariate values in the latest running period might only be based on values since the last required maintenance action. Former covariate path running periods will be weighted by the γ parameter already estimated. However, the positive argument, despite this limitation, is that if the model has technical condition measurements (TCIs) that are highly correlated with the real technical condition (degradation) development and the number of running periods are reasonably high, the ability of prediction of TCIs in the latest running period increases with the quantity of technical condition measurements (TCIs) available. I.e. the closer the current time (now) in the latest running period is to next required maintenance action, the more quality data is available. This is illustrated in Figure 5.7. y(x) Predicting future covariate behaviour (stochastic or deterministic) new now history information t u time Failure Figure 5.7: The remaining useful life If the TCIs are based on high frequency data with large amplitudes, the compressors might experience some unwanted false alarms in terms of short remaining useful estimates at suddenly decreasing TCIs (noise). TCIs should capture the lower and stable frequency responses (degradation). The remaining useful life might then be based on TCIs, but in addition the remaining useful life should also be able to react promptly to suddenly unwanted changes in e.g. usage and environmental influences. Measurements of such influential covariates could then be included in the model for remaining useful life. Using the local time scale in Figure 5.7, if the system still is functioning at local timet, the probability that the system of age t survives an additional interval of lengthu is ([ ; y( ), θ][ ; y( ), θ] ) S u u t t ( + ; y( + ), θ) S( t; y( t), θ) S t u t u = The expected remaining useful life, ERUL( t; y( t), ) ( + ; y( + ), θ) 1 S( t y t θ) S( t y t θ) θ, of the system at age t is S t u t u ERUL(; t y(), t θ) = du = S ( u; ( u), ) du ; ( ), ; ( ), y θ t (5.21) (5.22) 76

89 5.5. REMAINING USEFUL LIFE ESTIMATION In equation 5-22 the domain of integration for which the function S( t; ) t zu ( ; y( u) ) du y = e is integrated, has been divided in three sub domains. By definition the TCIs are given values in the area[,1 ]. In the first truncated sub domain just after a required maintenance action, the TCIs are not known. Since a renewal process is used as the underlying model assumption they are simply judged to be constant and equal to 1 (for some running periods the maintenance actions are judged to be worse than new implying a TCI value less than1). In the second sub domain the TCIs are following the polynomial fitting described in chapter 4.4 First order regression analysis of aggregated TCIs. The last sub-domain is the important prediction part of future TCI values. Here a simple extrapolation [21] of the polynomial until it reaches the first threshold of valid TCI values, e.g., is used. Since the upper limit of the third sub domain is, the TCIs are judged to be constant and equal to this threshold of valid TCI values thereafter. This procedure has been implemented in Maple code and is attached in Appendix F Programming and Results. For calculation of upper and lower error bounds of the expected remaining useful life, the Fisher matrix C ( θ) has to be known to include the uncertainty in the estimated parameters from the likelihood estimation. Central to asymptotic likelihood arguments are the score vectors: dln Li( θ) dln Li( θ) Ui ( θ) = = dθ dθ i p 1 (5.23) wherei = 1,, n and p is the dimension ofθ. If the operations of expectation and differentiation with respect to θ can be interchanged, it can be shown that Ui ( θ ) has expectation and covariance matrix ( ) ( ) 2 ln L ( ) T i C θ = E Ui θ Ui ( θ ) = E (5.24) θ j θ k p p With random censorshipu 1 ( θ),, U n ( θ) are independent. From the central limit theorem the n total score statistic (Kalbfleisch & Prentice) [53] U( θ) = Ui ( θ) is typically asymptotic normally distributed with mean and covariance matrix (Fisher matrix): n n T ( ) C( θ ) = C ( θ ) = E U ( θ ) U ( θ ) (5.25) i i i i= 1 i= 1 The maximum likelihood estimator (MLE) θ is the unique solution to U ( θ ) = and the asymptotic distribution of θ for sufficient large n is multivariate normal with mean θ and covariance matrix C ( θ ) 1. We can write: θ N( θ, C( θ) 1 ) i= 1 (5.26) 77

90 CHAPTER 5. AGING-BASED MODEL FOR REMAINING USEFUL LIFE Confidence interval estimation can be based on this result. Under regularity conditions [53] 1 C( θ) C( θ ) converges in probability to an identity matrix. Thus, the Fisher information can be replaced by an estimator. A simple estimator of C( θ ) is the observed information: 2 ln L( θ) I( ) = θi θ j θ (5.27) The Fisher matrix in equation 5-25 can be replaced with I( θ) or with I( θ ) without affecting the asymptotic results. The local estimate of the covariance matrix of the estimated parameters from the likelihood estimation is then: 2 2 ln L( θ) ln L( θ) 2 Var( θ1) Cov( θ1, θ ) θ p 1 θ1 θp = 2 2 Cov( θ, ln ( ) ln ( ) p θ1) Var( θp) L L θ θ 2 θp θ1 θ p p p 1 θ i= θi i=1,,p (5.28) The uncertainty in the estimated parameters in equation 5-28 has to be reflected in the variance of the expected remaining useful life function in equation It is generally true for the expected remaining useful life function ERUL(; t y(), t θ ), which is a function of the maximum likelihood estimatorsθ, that 1 E ( ERUL(; t y(), t θ) ) = ERUL(; t y(), t θ ) + O (5.29) n The termo 1 n is a function of the sample size n, and tends to zero as fast as 1 as n. n The variance of the expected remaining useful life function Var ERUL(/ t y(), t θ ) can then be ( ) calculated by the delta method, see appendix B.2 in Meeker & Escobar [65]: ( ) Var ERUL(; t y(), t θ ) T = δerul( t; y( t), θ) δerul( t; y( t), θ) δθ 1 Var( θ1) Cov( θ1, θp ) δθ1 δerul( t; ( t), ) Cov ( θ, y θ ( ; ( ), ) p θ1) Var( θp) δerul t t y θ δθ p δθ θ = θ p θ = θ (5.3) Once the vector and the matrix in equation 5-3 have been obtained, the approximate confidence bounds on the expected remaining useful life function can be calculated, see Nelson [66] page 362. These calculations are not trivial, and it is important to note that equation 5-3 is only asymptotically true. 78

91 5.5. REMAINING USEFUL LIFE ESTIMATION Following the general theory in appendix B 6.7 in Meeker & Escobar [65], a normalapproximation confidence interval for the estimated expected remaining useful life function ERUL(; t y(), t θ ), can be based on the sample approximate N (,1) distribution of Z = ( ERUL (; t y(), t θ) ERUL (; t y(), t θ) (; (), ) ) / Var ( ERUL (; t y(), t θ ) ERUL t y t θ ) Then an approximate 1(1 α)% confidence interval for the true value ERUL(; t y(), t θ) is ( ) ERUL( t; ( t), ), ERUL( t; ( t), ) ( ; ( ), ) y θ y θ = ERUL t y t θ ± z( 1 /2) Var ERUL( t; y( t), θ ) α The upper 95% limit for the true value ERUL(; t y(), t θ) of the expected remaining useful life is given by ( ) ERUL(; t y(), t θ) = ERUL(; t y(), t θ ) Var ERUL(; t y(), t θ ) (5.31) and the lower 95% limit for the true value ERUL( t; y( t), θ) of the expected remaining useful life is given by ( ) ERUL(; t y(), t θ) = ERUL(; t y(), t θ ) 1.96 Var ERUL(; t y(), t θ ) (5.32) The main shortcoming of normal-approximation confidence intervals is the requirement on the number of maintenance actions. In Meeker & Escobar [65] on page 177, it is claimed that the confidence region can be calibrated accurately even in moderately small samples e.g., 15-2 maintenance actions. This is not the case for the gas compressors which only have experienced 8 required maintenance actions. However, the error bounds on the expected remaining useful life have been calculated by use of this technique in chapter 7.1 Lifetime distribution based on aging. 79

93 6. Condition-based model for remaining useful life Up to recently, most mathematical models are based on describing the uncertainty in aging using a lifetime distribution. Estimation of the hazard rate is based on breakdowns (exact failure times) and/or censored events from a limited population of running periods. In Reinertsen [67] it is stated that the probability distribution function should be decided based on knowledge of the underlying failure mechanisms. The primary function of the life data should therefore only be to determine the parameters. The bottleneck is that more than often there are just not enough relevant data to perform rigorous statistical analysis for determining the distribution parameters. The existence of only 8 required maintenance actions in total of the 5 gas compressors observed over more that 7 years is a good example. The modelling by use of degradation physics (Samdal [38]) and the characteristics of the usage and the operating environment might then be more attractive since the population of technical condition -, usage -, and environmental data often is much larger than the population of breakdowns. In Pulkkinen [68] the remaining useful life is calculated as the time until the wear reaches a threshold value for the first time. For the best modelling of the stress-strength model it is recommended to model the deterioration in terms of a time-dependent stochastic process. A random value or deterministic approach might be a good approximation, but one should be careful as soon as uncertainty associated with the temporal evolution or progression of deterioration over time is involved. If the deterioration rate is randomized from several samples or deterministic adapted, the path of deterioration of a specific sample remains fixed over its entire lifetime as soon as progression of sample specific deterioration is measured. Of course, it is very important that the measured, calculated, or aggregated data are highly correlated with the real technical condition process and not only contain some partial information including much noise. For a complex system like a gas compressor, a single sensor is incapable of collecting enough data for accurate condition monitoring and prognosis. Multiple information sources are needed to do a better job. The problem of how to combine all partial information obtained from different information sources for more accurate technical condition monitoring and prognosis is known as multi-sensor data fusion. It is a complex problem which varies from case to case. The TCI approach is one typical proposed example. In the modelling of deterioration, two types of uncertainties are encountered, namely sampling and temporal uncertainties. The sampling uncertainty means that parameters of the deterioration process vary from sample to sample, and uncertainty associated with the evolution or progression of deterioration over time is referred to as temporal variability (Pandey et al. [69]). In order to model the temporal variability of deterioration, stochastic 81

94 CHAPTER 6. CONDITION-BASED MODEL FOR REMAINING USEFUL LIFE processes like Markov processes might be used. Markov processes include stochastic processes with independent increments like Brownian motion with drift (also called Wiener process) [7] and the Gamma process. Since the TCIs are alternately increasing and decreasing, a deterioration process like the Brownian motion with drift is adequate. For the accelerated deterioration in terms of monotonic TCI development immediately before a required maintenance action, the gamma process is most appropriate. For this thesis the accelerated part will be modelled by the gamma process and the results will be compared against the previous approach described in chapter 5.5 Remaining useful life estimation Accelerated monotone TCI development modelled by a gamma process A difficulty in modelling time-dependent reliability is that the process of TCIs is uncertain over the life of the item. In structural engineering, a distinction is made between a structure s resistance and its applied stress. In a probabilistic stress-strength model a failure (required maintenance action) may then be defined as the event in which a transformed TCI drops below the stress. In the gamma process model, the cumulative transformed time-dependent TCI development at time t will follow a gamma distribution with shape parameter vt ( ) > and constant scale parameteru. The shape function vt ( ) must be an increasing function to reflect the monotonic nature of accelerated transformed TCI development. Further requirements are that it must be right-continuous, real-valued, and v () =. Throughout this chapter t denotes a specified point in time starting at operational time t, i.e. t = at operational time t. The local time t is judged by inspection of the TCI graphs. Let Y( t ) denote a transformed TCI value at timet. A formal definition of a gamma process with shape function vt () and scale parameter u > is a Y t, t with properties continuous-time stochastic process { ( ) } 1. Y ( ) = with probability one 2. ( ) ( ) ( ( ) ( ) ) Y t2 Y t1 Ga v t2 v t1, u t2 > t1 3. Y() t has independent increments The probability density function (pdf) is then given by ( ) vt u vt () 1 uy fy() t ( y) = y e = Ga y; v t, u Γ ( v() t ) ( () ) (6.1) z a z e dz z= a 1 where ( ) Γ = is the gamma function for a > To fit the TCI values to the gamma process properties they are transformed by applying the linear equation () () ( ) ( ) Y t = TCI t = TCI TCI t (6.2) trans 82

95 6.1. ACCELERATED MONOTONE TCI DEVELOPMENT MODELLED BY A GAMMA PROCESS where TCI ( ) is the value of the original TCI at the local time t =. The expectation and the variance of the transformed TCI value Y( t) at timet is for the gamma process ( ()) E Y t ( ()) Var Y t ( ) v t = (6.3) u ( ) v t = (6.4) 2 u The coefficient of variation is defined by the ratio of the standard deviation and the mean ( ()) CV Y t which decreases as time increases. TCI() TCI ( ()) () = Var Y t 1 E Y t = (6.5) ( ) v() t Y(t) ρ s TCI interval Failure t 1 t 2 t time Figure 6.1: Gamma process model of accelerated TCI development [17] The failure (required maintenance action) is defined as the down-crossing of TCI Y( t) TCI( t) TCI and the stress s to be known ( ) = below the stress s. We assume ( ) with probability equal to one. Define TCI ( ) s ρ = as the design margin or a TCI deterioration threshold and let the time at which failure occurs be denoted by the lifetimet ρ. From equation 6.1, the cumulative probability distribution function of the lifetimet ρ, can be simply written as 83

96 CHAPTER 6. CONDITION-BASED MODEL FOR REMAINING USEFUL LIFE Tρ () = Pr ( ρ ) = Pr ( ( ) ρ ) = Pr ( TCI ( ) TCI ( t) TCI ( ) s) = Pr ( TCI () t < s) Γ( v() t, ρu) = fy() t ( y) dy = Γ( v() t ) F t T t Y t y= ρ (6.6) a 1 Where ( ) z Γ ax, = z e dzis the incomplete gamma function for x and a >. z= x The stress-strength model is extended by regarding the deterioration threshold ρ = TCI s as a random quantity Ρ > as well (Abdel-Hamed [71]). For eacht, the ( ) probability of failure (required maintenance action) in the time interval (,t ] can then be written as a convolution integral Tρ () Pr () ( ) ()( ) Pr Y t ( ) F t = Y t Ρ = f y Ρ y dy = y y= ρ = y= f y f dρdy ()( ) Ρ ( ρ) where Y() t has a gamma distribution with shape parameter ( ) Ρ has a probability distribution fρ ( ρ ) f ( ρ ) Ρ Y t (6.7) v t and scale parameter u, and. The choice of a proper distribution candidate for depends by the case. In the chapter 7.2 Lifetime distribution based on condition monitoring a limited number of distributions are proposed. The parameters in these distributions are simply point estimated from the documented TCI values of former running periods at the time stamp of required maintenance actions Parameter estimation for the gamma process In order to apply the gamma process to the accelerated TCI paths, statistical methods for the parameter estimation of gamma processes are required. The data sets consist of daily calculations in TeCoMan at times (days) ti, i = 1,2,..., n. If the TCIs in the accelerated area are applicable to be modelled by a gamma process, the observations of the transformed TCIs must be increasing, i.e. = t < t1 < t2 < < tn and = y y1 y2 yn. The functional form of the expected TCI path in the accelerated area has to be assumed. This is just as difficult as in the case of assuming the polynomial order of the covariates in the model described in chapter 4.4 First order regression analysis of aggregated TCIs. Again the power law assumption is used: ( ()) E Y t b ( ) ct v t = = b (6.8) u Two of the most common methods of parameter estimation, Maximum Likelihood, and Method of Moments, are discussed in van Nordwijk et al. [48]. Both methods for deriving the u t 84

97 6.2. PARAMETER ESTIMATION FOR THE GAMMA PROCESS estimators of c and u in equation 6-8 were initially presented by Cinlar et al. [72]. The resulting estimator equations from the Method of Moments are shown below and used further in this thesis. The formulas can easily be computed e.g. by use of a Microsoft Excel spreadsheet. The formulas are: c yn = =δ (6.9) b u t y n n 2 i n n i= 1 1 = 2 ( δi δw i) n u i= 1 w wi i= 1 2 (6.1) b b where wi = ti ti 1 is the transformed time between two succeeding TCI calculations and δi = yi yi 1 is the corresponding observed transformed TCI increment. Because cumulative amounts of transformed TCIs are received, the last transformed TCI value contains the most information. This is a wanted feature as shown in the graphs presented in chapter 7.2 Lifetime distribution based on condition monitoring Remaining useful life estimation Assume that the system is of age,t is FT () t calculated from the convolution integral in equation 6-7. The expression function ρ for the survival function is simply: Tρ t = and the probability of failure in the time interval ( ] () = 1 ( ) R t F t Tρ Let us further assume that the item is functioning at timet = t. If no additional information is available other than the cumulative distribution function FT ( t), the probability that the ρ system of age t survives an additional interval of length x is: R Tp ( x t) ( + ) () t RT x t p = R Then the density function of the remaining useful life equals (Andersen [5]) Tp 1 δ frul () t = R t+ x RT () t δt p ( T ( )) p (6.11) The last derivative part of equation 6-11 has to be solved numerically. Ideas for a solution in the case where the deterioration threshold ρ is fixed (known) can be found in van Noortwijk [17]. 85

99 7. Modelling and results It was mentioned earlier in this thesis that the TCI-concept was developed in the EUREKA project Aging Management ( ). The prototype of TeCoMan was implemented and calculated TCIs have been available since the beginning of year 2. RCM has been used as a systematic qualitative tool for criticality analysis, identification of optimal maintenance strategies, and for identification of the potential for managing the maintenance from onshore. A lot of work has been performed and utilized in the software. During the years, the focus area of TeCoMan has been to quantify the technical condition of aggregated systems in term of trends. This thesis has focused on how well the TCI trends describe the real technical condition of the gas compressors in terms of quantified remaining useful life estimates. In general, the underlying low-level building bricks must be best possible before they are aggregated or used as input in advanced statistical modelling. Data cleaning is very important since data, especially manually entered work orders and notifications in the SAP database always contain errors. The TCIs and their time stamps are calculated and stored every day, through the year. For some operational time intervals, especially late in the year 26, there have been problems with recording new TCIs for compressors KA41 and KA61. The best solution for model building would have been to recover the lost TCIs, but this was not possible. Only the recorded TCIs are therefore used as they are in the further calculations. New sources of information for improved calculation of TCIs are not part of this thesis. So far in this work, the connections between several measurements and TCIs have been established. Advanced methodologies including theory for remaining useful life estimation based on TCIs have also been presented. In this chapter, the theory will be used for a best possible remaining useful life estimation of the gas compressors based on the existing TCIs. The chapter includes: 1. Constructing lifetime distributions for the compressors based on both aging and condition. A Weibull distribution based on only the reliability data will also be made. The regression models will be used to include high-level TCIs and low-level TCIs. The expert judgement based approach of emphasizing low-level TCIs in an aggregated TCI will be judged by the scientific regression analysis. 2. Estimating the expected remaining useful life (ERUL) for the actual compressors based on both aging and condition. Modelling of both low-level and high-level TCIs will be performed. The results will be compared and the TCIs discussed. Confidence intervals of the estimates are also calculated and discussed. 3. Constructing lifetime distributions for the compressors based on condition monitoring only. Here the gamma process approach will be used in the accelerated part based on the high-level TCIs. 87

100 CHAPTER 7. MODELLING AND RESULTS 4. Estimating the remaining useful life for the actual compressors based on condition monitoring only. The threshold value distribution is established, and confidence intervals of the estimate are calculated and discussed. To be able to compare the results from chapter 5.5 Remaining useful life estimation with the gamma process approach in chapter 6.1, the estimated median value at t = t is compared with the corresponding expected remaining useful life from the same time stamp in the regression approach. The reason for this is computational difficulties in the numerical calculations of remaining useful life by the gamma process approach Lifetime distribution based on aging and condition monitoring The presented model is based on the following assumptions: 1. The interoccurrence times T1, T2,..., T n are independent and identically distributed (IID) with cumulative distribution function FT ( t) = Pr ( Ti t) for t, i = 1, 2,..., n. In chapter 5.3 Model selection framework the conclusion was IID interoccurrence times. Both the trend and the independence of the interoccurrence times need to be tested. With this assumption the system is belonging to the Constant ROCOF part of the bathtub curve in Figure The required maintenance actions that are observed are called renewals, meaning that the system is set back to a good as new condition after each required maintenance action. This implies that the operational time (working age) is after each required maintenance action. 3. Repair times are neglected. In an ordinary reliability-based renewal process only the operational time stamps of failures (required maintenance actions) are of interest. The state variable Ψ( T ) is binary. A twoparameter Weibull distribution is assumed to be a good choice as a candidate for the underlying distribution function because of its ability to fit all phases in the local bathtub curve in Figure 5.1. The shape parameter β is in the interval (,1 ) in the Burn-in period implying a decreasing hazard rate, it is equal to 1 in the Useful life period with a constant hazard rate, and in the Wear out period with increasing hazard rate the shape parameter β is in the interval( 1, ). Since the working age is after each required maintenance action the columns Length of running period and Trunc. time in Table 7.1 have to use a local time scale. The Failure mode column is replaced by the censoring indicator c i that is equal to in case of compressor suspension and 1 when the running period ends with a required maintenance action. 88

101 7.1. LIFETIME DISTRIBUTION BASED ON AGING AND CONDITION MONITORING Compressor Table 7.1: Running periods of the 5 export gas compressors with renewal assumption Running period A fit of the Weibull distribution to the data in Table 7.1 by the method of maximum likelihood was performed. The maximum of the log-likelihood is l βη, = Estimates of the parameters in the cumulative distribution function or the hazard rate function c i Start of running period (date) End of running period (date) Length of running period (days) KA KA KA KA KA KA KA KA KA KA KA ( ) Trunc. time (days) β t η F () 1 e t T t = > (7.1) otherwise () z t β β t t > = η η otherwise (7.2) are β = and η = η is called the characteristic life of the distribution. With t = η F T ( η) =.632, i.e. 63,2% of the failures occur before this time. When β = 1, the Weibull distribution is equal to the exponential distribution, which has a constant hazard rate. From the result we can conclude that the compressors are in the Useful life period of the local bathtub curve in Figure 5.1. The calculated mean time to failure (MTTF) is days. The estimate β = > 1implies a small presence of aging and a slowly decreasing remaining useful life as a function of operational timet. A preventive time-based maintenance policy based on these data is not in the set of effective maintenance policies (see Figure 2.3), because the required maintenance actions occur very randomly. For the regression model analysis, the assumptions above are argued for the same reasons as described in chapter 5.3 Model selection framework. Table 7.1 can then be used as it is for the reliability data. The baseline in the PHM model will be assumed to follow a Weibull distribution. Initially, the only covariate trend will be the single high-level aggregated TCI path. All weights are estimated by expert judgements. First-order regression analysis has been performed on the running periods of the gas compressors in chapter 4.4 First order 89

102 CHAPTER 7. MODELLING AND RESULTS regression analysis of aggregated TCIs. Since this is a deterministic fitting there is no uncertainty associated with the evolution of the aggregated TCI trends. The future evolution of the aggregated TCI trend is fixed on beforehand. However, information about a new additional TCI value might modify the deterministic trend, but again the new trend is fixed with a new modified trend of its future prediction. The summary of the deterministic fitting for all running periods in Table 4.7 is the trend at the timestamp of the first upcoming required maintenance action. Table 4.7 can then be combined with the reliability data of the same running periods in Table 7.1 in a Weibull PHM model. Using the method of maximum likelihood, the maximum of the log-likelihood isl βηγ,, = The estimated parameters in the hazard rate function ( ) ( ; ()) z t y t β β t γ. yt () > = η η e t otherwise (7.3) are β = , η = , and γ = Calculation of p-values can assist in identifying the significance of the parameters. Several standard statistical tests are available. The Wald test statistic T θ θ I θ θ θ ( ) ( )( ) checks whether the difference between an assumed and estimated parameter is significant or not. It can be shown that Wald test statistic is asymptotically χ 2 distributed with degrees of freedom equal to the number of different entries in θ and θ [15]. The observed information matrix I θ contains the negative second-order partial derivatives of the log-likelihood ( ) function in equation 7-6. Under the hypothesis H : β = β, η = η, γ = the Wald test statistic θ θ θ θ θ T ( ) ( )( I ) is The p-value calculated from the χ 2 distribution with 1 degree of freedom is <.1. Since the p-value is small, theγ parameter is highly significant and the hypothesis γ = is rejected. Similarly, a hypothesis that working age is not an import variable could be a Wald test statistic conducted on the shape parameter β by H : β = 1, η = η, γ = γ. The resulting Wald test statistic is and the p-value is.485. This implies that working age is only marginally significant. A negative value of theγ parameter implies that the risk is increasing when y() t is decreasing. In this case the absolute value of the γ parameter might be interpreted as weighting the belief of the time dependent covariate against the belief in aging (the η parameter and to some extent the β parameter). Graphs of the hazard rate functions (risk functions) in equation 7-3 as a function of the operating time is presented for all running periods of all compressors in Appendix G Graphs of the hazard rate (risk) of aggregated and low-level TCIs. 9

103 7.1. LIFETIME DISTRIBUTION BASED ON AGING AND CONDITION MONITORING To be able to calculate the expected remaining useful life for each running ERUL t; y( t), θ by equation 5-22, there is a need for a closer look at the function period, ( ) ( ; ) ( ; ( ), u ) t zu ( ; y( u) ) du S t y = S t y u t = e. As explained in chapter 5.5 Remaining useful life estimation the TCI function is divided in three sub domains. The integral will therefore be divided dependent on the operational timet. In the first time interval[,t 1) the TCIs are treated as constant equal to 1. In the second time interval[ t1, t2) the TCIs are deterministically extrapolated with the polynomial until they reach the threshold of legal TCI values ( or 1). In the last time interval[ t ) they are extrapolated as constant equal to the 2, threshold value reached at time t 2. For the gas compressors the Weibull PHM is used, and the function S( t; y) is given by ( ; y) S t = e β β γ 1 t η e, t< t1 t β β γ 1 β β 1 γ yu t ( ) 1 η e + βη u e du, t1 t< t2 t1 t2 β β γ 1 β β 1 γ yu ( ) β β β γ yt t ( 2) 1η e + βη u e du+ ( t t2 ) η e, t t2 t1 (7.4) This function has been implemented in Maple code by the function piecewise (see ERUL t; y( t), θ of each running Appendix F Programming and Results ) for predicting ( ) period. Trends for all running periods of calculated ERUL ( t; y( t), ) θ as functions of the operational timet are shown below. This is a simplification since the coefficients of the polynomial will probably be calculated to change each time new TCI information is received. 91

104 CHAPTER 7. MODELLING AND RESULTS ERUL (operational days) Operational time (days since ) Figure 7.1: ERUL as function of the operational time for 2 running periods of gas compressor KA31 as a function of operational time. The required maintenance actions are indicated by x and the suspensions are indicated by on the operational-time axis. Truncated operational time intervals are indicated as gaps in the solid horizontal lines on the operational-time axis ERUL (operational days) Operational time (days since ) Figure 7.2: ERUL as function of the operational time for 3 running periods of gas compressor KA41 as a function of operational time. The required maintenance actions are indicated by x and the suspensions are indicated by on the operational-time axis. Truncated operational time intervals are indicated as gaps in the solid horizontal lines on the operational-time axis.

105 7.1. LIFETIME DISTRIBUTION BASED ON AGING AND CONDITION MONITORING ERUL (operational days) Operational time (days since ) Figure 7.3: ERUL as function of the operational time for 1 running period of gas compressor KA51 as a function of operational time. The required maintenance actions are indicated by x and the suspensions are indicated by on the operational-time axis. Truncated operational time intervals are indicated as gaps in the solid horizontal lines on the operational-time axis ERUL (operational days) Operational time (days since ) Figure 7.4: ERUL as function of the operational time for 2 running periods of gas compressor KA61 as a function of operational time. The required maintenance actions are indicated by x and the suspensions are indicated by on the operational-time axis. Truncated operational time intervals are indicated as gaps in the solid horizontal lines on the operational-time axis. 93

106 CHAPTER 7. MODELLING AND RESULTS ERUL (operational days) Operational time (days since ) Figure 7.5: ERUL as function of the operational time for 3 running periods of gas compressor KA71 as a function of operational time. The required maintenance actions are indicated by x and the suspensions are indicated by on the operational-time axis. Truncated operational time intervals are indicated as gaps in the solid horizontal lines on the operational-time axis. The expected remaining useful life using a single high level TCI at the time stamps of required maintenance actions are summarized in Table Table 7.2: Summary of ERUL for all the running periods using a single high level TCI Compressor Running period Censoring (c i ) ERUL (days) KA KA KA KA KA KA KA KA KA KA KA The aggregated TCI trends were used to estimate ERUL of all 11 running periods. Aging is present to some small extent. From the results in Table 7.2 it can be observed that the ERUL estimate is not a very good estimate at the time stamps of the actual required repair times. The main reason for this is that the model seems to have a strong belief in the TCIs since the

107 7.1. LIFETIME DISTRIBUTION BASED ON AGING AND CONDITION MONITORING interoccurrence times are very random. Only the two running periods of gas compressor KA61 can be claimed to be approved. Here the polynomial fit of the TCI trends are below 6 some days before the required maintenance action, and the trend is steeply decreasing towards zero. Running period 1 of gas compressor KA71 has the worst fit. For this running period the value of the polynomial trend TCI is above 9 in a long interval before the next required maintenance action and is interpolated to be very slowly linearly decreasing. Since the model in this case only has a marginal belief in aging, low TCI measurements at operational time intervals close to the next required maintenance action will increase the hazard rate (risk) more than the same low TCI measurements at earlier time intervals. Studies of the time dependent effect of covariates are presented in e.g. Aalen [73]. From the same reason, noise will, in terms of e.g. faulty sensors and wrongly documented notifications, be more unwanted in operational time intervals close to failure than noise at earlier time intervals. For calculation of upper and lower error bounds of the expected remaining useful life, the C θ is the observed Fisher matrix C ( θ) has to be known. A simple estimator of ( ) information I ( θ ). An estimate of the covariance matrix for the maximum likelihood estimator (MLE) θ is the inverse of the observed information matrix. For the Weibull PHM model with one aggregated TCI, the inverse of the observed information matrix is ( ) ( ) ( ) Var β Cov β, η Cov β, γ ( ) ( ) ( ) C βηγ,, = Cov βη, Var η Cov( ηγ, ) Cov ( βγ, ) Cov ( ηγ, ) Var ( γ ) l( βηγ,, ) l( βηγ,, ) l( βηγ,, ) 2 β β η β γ l( βηγ,, ) l( βηγ,, ) l( βηγ,, ) = 2 β η η η γ l( βηγ,, ) l( βηγ,, ) l( βηγ,, ) 2 β γ η γ γ β = βη, = ηγ, = γ where the partial derivatives are evaluated at local estimates β = βη, = ηγ, = γ. The partial derivatives are (7.5) 95

108 CHAPTER 7. MODELLING AND RESULTS ( βηγ) 2 n l,, 1 = 2 ci 2 β i= 1 β n ti β ( γ yi () t ) β η e u ( 2lnη+ 2ln u+ β( lnη) 2βlnηln u+ β( ln u) ) du i= 1 xi n n ti β β ( γ yi () t ) ( βηγ,, ) β β η = c ( β + 1 ) l u e η η η 2 l ( βηγ,, ) u y t e du γ 2 i 2 2 i= 1 i= 1 xi n ti β β 1 2 ( γ yi () t ) = 2 βη ( i ()) i= 1 xi n n ti β β 1 γ yi t ( βηγ,, ) 2 l β η 2 l ( βηγ,, ) β γ = i i= 1 η i= 1 xi n ti β β 1 i= 1 xi ( i ()) n ti 2 β β 1 ( βηγ,, ) βη ( i ()) 2 l ηγ du ( ()) ( 2 β lnη βln ) 1 βη u e c + η ( γ y () ) ( 1 ln ln ) i t = η u y t e β η+ β u du ( γ yi () t ) u y t e du (7.6) = i= 1 xi η u du By using a S-plus script program in Appendix F Programming and Results the inverse of the observed information matrix is calculated to be Var ( β) Cov ( β, η) Cov ( β, γ) Cov ( βη, ) Var ( η) Cov ( ηγ, ) Cov ( βγ, ) Cov ( ηγ, ) Var ( γ) = The local estimate Var ( β ) = in equation 7-7 is negative and wrongly estimated. With the small number of required maintenance actions (8 for the gas compressors), the estimates are not very good. Calculation errors in the S-plus script program might also be a reason. However, it can be noted from the inverse of the observed information matrix that the correlation between the estimated η and γ is positive and very strong. Cov ( ηγ, ) ηγ, Var ( η) Var ( γ) κ = 1 This positive correlation is reflected in the orientation of the log likelihood plot in Figure 7.6. The estimated β = is plotted at optimum, while η and γ are plotted in the range η = ± 2. and γ = ±.2, i.e. near optimum. (7.7) (7.8) 96

109 7.1. LIFETIME DISTRIBUTION BASED ON AGING AND CONDITION MONITORING Belif in aging Best fit Belif in TCI Figure 7.6: Plot of the log-likelihood function in an area close to optimum The best fit of parameters is maximizing the log-likelihood function. The optimum solution seems to follow a mountain ridge. A small best fit value of the γ parameter seems to imply a small best value of the η parameter because of the strong positive correlation, and the model has a strong belief in the condition monitoring. In the opposite case a large best fit value of the η parameter seems to imply a large best fit value of the γ parameter, and the model has a strong belief in aging. For this natural gas compressor case, the optimum solution has a strong belief in condition monitoring. The large variability in the timestamp of maintenance actions and the large variability in trends of TCIs over a sample of similar running periods from 5 compressors seem to force the model towards belief in the running period specific TCI trend. To calculate the variance of the expected remaining useful life function Var ERUL(/ t y(), t θ ) from equation 5-3, the partial derivatives of ERUL(/ t y(), t θ ) with ( ) respect to the elements of θ has to be calculated in the gradient vector 97

110 CHAPTER 7. MODELLING AND RESULTS δ ERUL(; t y(), t θ) δβ δ ERUL(; t y(), t θ) δη δ ERUL(; t y(), t θ) δγ β = βη, = ηγ, = γ If the operations of integration with respect to time u and differentiation with respect to θ can be interchanged, the elements of the gradient vector becomes S( u; ( u), ) du δ ( ERUL(; t (), t )) δ y θ y θ t = δβ δβ S( t; y( t), θ) = = t t δ δβ δ δβ ( S( u; y( u), θ) ) du S( t; y( t), θ) S( u; y( u), θ) du ( S( t; y( t), θ) ) ( S( t; y( t), θ) ) ( S ( u; y( u), θ) ) du ERUL( t; y( t), θ) ( S ( t; y( t), θ) ) ( ; y( t), θ) S t t 2 δ δβ δ δβ δ δ δ ( ERUL(; t (), t )) y θ y θ y θ y θ δη δη t = δη S t ( S( u; ( u), )) du ERUL( t; ( t), ) S( t; ( t), ) ( ; y( t), θ) ( ) δ δ δ ( ERUL(; t (), t )) y θ y θ y θ y θ δγ δγ t = δγ S t ( S ( u; ( u), )) du ERUL( t; ( t), ) S ( t; ( t), ) ( ; y( t), θ) The domain of integration for which the function S( t; ) t zu ( ; y( u) ) du ( ) y = e is integrated is divided in sub domains as described in chapter 5.5 Remaining useful life estimation. This procedure has been implemented in Maple code and is attached in Appendix F Programming and Results. The calculated variance of the expected remaining useful life function Var ERUL(/ t y(), t θ ) for all running periods at required maintenance actions are shown in Table 7.3. ( ) 98

111 7.1. LIFETIME DISTRIBUTION BASED ON AGING AND CONDITION MONITORING Table 7.3: The calculated variance of the expected remaining useful life function for all running periods at the time stamp of required repair actions Compressor Running period Var(ERUL) (days) KA KA KA KA KA KA KA KA KA KA KA The approximated upper 95% limit of the expected remaining useful life is calculated by equation 5-31, and the approximated lower 95% limit of the expected remaining useful life is calculated by equation These limits are added as columns to Table 7.2 as shown in Table 7.4. Table 7.4: Summary of ERUL including the upper and the lower 95% confidence limits for all running periods using a single high level TCI Compressor Running period Censoring (c i ) ERUL (days) Upper 95% confidence limit (days) KA KA KA KA KA KA KA KA KA KA KA Lower 95% confidence limit (days) A regression model based on low-level TCIs has also been made. A Weibull PHM based on the 4 underlying TCIs (notification, vibration, bearing temperature, and seal leakage) is proposed as: ( ; ()) z t y t β β t ( γnyn() t + γv yv() t + γbyb() t + γsys() t ) e t > = η η otherwise where γ N, γv, γ B, γsare the regression coefficients of notification, vibration, bearing y t, y t, y t, y t are the low level time dependent temperature, and seal leakage. () ( ) ( ) ( ) N V B S (7.9) 99

112 CHAPTER 7. MODELLING AND RESULTS TCIs of notification, vibration, bearing temperature, and seal leakage. For this model the same assumptions specified first in this chapter are used. The running periods in Table 7.1 are also used. To be able to quantify the low-level TCI trends the same methodology including the moving average and polynomial fitting as for the high level TCI approach is used. A summary of the polynomial curve fitting of the running periods for notifications is shown in Table 7.5. Compr R.P KA31 Running period 1 KA31 Running period 2 KA41 Running period1 KA41 Running period 2 KA41 Running period 3 KA51 Running period 1 KA61 Running period 1 KA61 Running period 2 KA71 Running period 1 KA71 Running period 2 KA71 Running period 3 Table 7.5: Running-period summary of the polynomial curve fitting of the notification TCIs. The column labelled Compr R.P. describes a running period of a compressor. t 4 t 3 t 2 t 1 t * * * * * * * * A summary of the polynomial curve fitting of the running periods for vibrations is shown in Table 7.6 Compr R.P KA31 Running period 1 KA31 Running period 2 KA41 Running period1 KA41 Running period 2 Table 7.6: Running-period summary of the polynomial curve fitting of the vibration TCIs. The column labelled Compr R.P. describes a running period of a compressor. t 4 t 3 t 2 t 1 t * * * *

113 7.1. LIFETIME DISTRIBUTION BASED ON AGING AND CONDITION MONITORING Compr R.P KA41 Running period 3 KA51 Running period 1 KA61 Running period 1 KA61 Running period 2 KA71 Running period 1 KA71 Running period 2 KA71 Running period 3 t 4 t 3 t 2 t 1 t * * * * * * A summary of the polynomial curve fitting of the running periods for bearing temperature is shown in Table 7.7. Compr R.P. KA31 Running period 1 KA31 Running period 2 KA41 Running period1 KA41 Running period 2 KA41 Running period 3 KA51 Running period 1 KA61 Running period 1 KA61 Running period 2 KA71 Running period 1 KA71 Running period 2 KA71 Running period 3 Table 7.7: Running-period summary of polynomial curve fitting of bearing-temperature TCIs. The column labelled Compr R.P. describes a running period of a compressor. t 4 t 3 t 2 t 1 t * * * A summary of the polynomial curve fitting of the running periods for seal leakage is shown in Table

114 CHAPTER 7. MODELLING AND RESULTS Compr R.P. KA31 Running period 1 KA31 Running period 2 KA41 Running period1 KA41 Running period 2 KA41 Running period 3 KA51 Running period 1 KA61 Running period 1 KA61 Running period 2 KA71 Running period 1 KA71 Running period 2 KA71 Running period 3 Table 7.8: Running-period summary of polynomial curve fitting of seal-leakage TCIs. t 4 t 3 t 2 t 1 t * * * * * * From the low-level TCI trends the value of the information received from the bearing temperature and the seal leakage seems to be minimal. The Wald test statistics is performed on combinations of Weibull PHM models to identify e.g. the two most important low-level TCIs. The first regression model is based on the notification TCIs and the vibration TCIs. These information sets seem to describe the 2 failure modes (bearing failure and seal failure) best. A Weibull PHM based on the 2 underlying TCIs (notification TCIs, vibration TCI) is ( ; ()) z t y t β β t ( γnyn() t + γv yv() t ) e t > = η η otherwise (7.1) Table 7.5 and Table 7.6 are combined with the reliability data of the same running periods as in Table 7.1. Using the method of maximum likelihood, the maximum of the log-likelihood isl βηγ,,, γ = The parameters in the hazard rate function are β = , ( N V ) η = , γ =.44165, and γ = N V 12

115 7.1. LIFETIME DISTRIBUTION BASED ON AGING AND CONDITION MONITORING Under the hypothesis H : β = βη, = ηγ, N, =, γv, = γ V the Wald test statistic T ( θ θ ) I ( )( θ θ θ ) is The p-value calculated from the χ 2 distribution with 1 degree of freedom is <.1. Theγ N parameter is therefore highly significant. Under the hypothesis H : β = β, η = η, γ N, = γ N, γv, = the Wald test statistic is The p-value calculated from the χ 2 distribution with 1 degree of freedom is <.1implying a high significance of theγ V parameter. Similarly, a hypothesis that working age is not an import variable could be a Wald test statistic conducted on the shape parameter β by H : β = 1, η = η, γ N, = γ N, γv, = γ V. The resulting Wald test statistic is and the p- value is <.1.This implies that working age is also highly significant for this model. The negative value of the parameters γ N and γ V implies that the risk is increasing when yn () t and yv () t is decreasing. If one of them is increasing and the other is decreasing they will nullify each other to some extent. Since the value of the γ V parameter is more negative than the value of the γ N parameter, the model has more belief in vibrations than notifications (TCIs are normalized values). This largely agrees with the expert judgement of weights in the penalty aggregation method in Figure The result of a Weibull PHM model based on the vibration TCIs and the seal-leakage TCIs is presented in Table 7.9. Table 7.6 and Table 7.8 are combined with the reliability data of the same running periods in Table 7.1. Table 7.9: Summary of estimated parameters and p-values for the Weibull PHM model based on vibration and seal leakage covariates. β t z( t y() t ) = e η η l βηγ,,, γ = ( γ y () t + γ y () t ) ; V V S S ( V S) β = 1.33 η = 1. Variable Parameter Estimate Chi-Square Degree of Freedom p-value y t <.1 V () () y t S For the model in Table 7.9 the seal-leakage covariate has low statistically significance (high p-value) and may be omitted from the model. The result of a Weibull PHM model based on the vibration TCIs and the bearing-temperature TCIs is presented in Table 7.1. Table 7.6 and Table 7.7 are combined with the reliability data of the same running periods in Table 7.1. β 13

116 CHAPTER 7. MODELLING AND RESULTS Table 7.1: Summary of estimated parameters and p-values for the Weibull PHM model based on vibration and bearing-temperature covariates. β t z( t y() t ) = e η η l βηγ,,, γ = ( γ y () t + γ y () t ) ; V V B B ( V B) β = 2.69 η = 1. Variable Parameter Estimate Chi-Square Degree of Freedom p-value y t <.1 V B () () y t For the model in Table 7.9 the bearing-temperature covariate has low statistically significance (relative high p-value) and may be omitted from the model. As a result only the notification TCIs and the vibration TCIs are used as the two most important low-level TCIs. Other combinations of covariates might also be modelled including models that are checking e.g. the products of covariates, like the smoking-asbestos effect on risk of lung cancer [74]. An example is: β β t ( γnv yn () t yv () t + γbs yb () t ys () t ) e t z( t; y() t ) = η η > otherwise Rausand and Reinertsen [75] emphasize that the effect of several failure mechanisms may be much higher than the sum of effects of each individual mechanism. In general, a model with many parameters to estimate might give a better fit of the data because of the large degree of freedom. On the other hand, a model with many parameters to estimate will also require a larger data set. The limited data set of required maintenance actions for the gas compressors is a huge restriction for detailed statistical analysis. Graphs of the hazard rate functions (risk functions) in equation 7-1 as a function of the operating time is presented for all running periods of all compressors in Appendix G Graphs of the hazard rate (risk) of aggregated and low-level TCIs. To be able to calculate the expected remaining useful life ERUL ( t; y( t), ) θ for each running period by equation 5-22 there is a need for a closer look at the function ( ) ( ) ( ) t zu ( ; y( u) ) du S t; y = S t; y u, u t = e for each running period. With two covariates the TCI function is divided in four sub domains. The integral will therefore be divided dependent on the operational timet. In the first time interval[,t1 ) both covariates are treated as constants equal to 1. In the second time interval[ t1, t 2) both covariates are deterministically extrapolated with the polynomial until they reach the threshold of legal TCI values ( or 1). t 2 is the timestamp of the covariate that reaches the threshold of legal TCI values first. t 3 is the timestamp of the covariate that reaches the threshold of legal TCI values second. If e.g. the vibration covariate reaches the threshold of legal TCI values first, it will remain constant β 14

117 7.1. LIFETIME DISTRIBUTION BASED ON AGING AND CONDITION MONITORING in the interval[ t2, t 3), while the notification covariate is deterministically extrapolated with its polynomial trend until the time stamp t 3. In the last time interval[ t, 3 ) both covariates are constant equal to their threshold values. For the gas compressors, the function S( t; y) in the case when the vibration covariate reaches the threshold of legal TCI values first is given by ( ; y) S t e = ( γ N 1+ γ V 1) ( γ + γ ) t ( γ y ( u) + γ y ( u) ) t β η β e t< t β t 1 1 η β e N V + βη β u β N N V V e du t 1 < t t 2 t 1 ( 1 1 t ) 2 1 ( y ( u) y ( u) ) y ( t 2 ) t β γ N γ V γ N N γ V V γ V V 1 ( γ N y N( u) ) t 1 η β + e βη β u β + e du βη β e u β + + e du t 2 t< t 3 t 1 t 2 β ( γ N 1 γ V 1) t β + t 1 η e + βη β 2 t u β 1 ( γ ( ) ( )) ( ) 3 N y N u + γ V y V u V y V t 2 1 ( N y N( u) ) e du β γ e u β γ + βη e du t 1 t 2 t t 3 β ( γ N y N( t 3) γ V y V ( t 3) ) ( t β t β ) η e (7.11) This function has been implemented in Maple code by the function piecewise (see ERUL t; y( t), θ of each running Appendix F Programming and Results ) for predicting ( ) period. Trends for all running periods of calculated ERUL ( t; y( t), ) θ as functions of the operational timet are shown below. This is a simplification since the coefficients of the polynomial trend will probably be calculated to change each time new TCI information is received ERUL (operational days) Operational time (days since ) Figure 7.7: ERUL estimated by low-level TCIs for 2 running periods of gas compressor KA31 as a function of operational time. The required maintenance actions are indicated by x and the suspensions are indicated by on the operational-time axis. Truncated operational time intervals are indicated as gaps in the solid horizontal lines on the operational-time axis. 15

118 CHAPTER 7. MODELLING AND RESULTS ERUL (operational days) Operational time (days since ) Figure 7.8: ERUL estimated by low-level TCIs for 3 running periods of gas compressor KA41 as a function of operational time. The required maintenance actions are indicated by x and the suspensions are indicated by on the operational-time axis. Truncated operational time intervals are indicated as gaps in the solid horizontal lines on the operational-time axis ERUL (operational days) Operational days (days since ) Figure 7.9: ERUL estimated by low-level TCIs for single running period of gas compressor KA51 as a function of operational time. The required maintenance actions are indicated by x and the suspensions are indicated by on the operational-time axis. Truncated operational time intervals are indicated as gaps in the solid horizontal lines on the operational-time axis. 16

119 7.1. LIFETIME DISTRIBUTION BASED ON AGING AND CONDITION MONITORING 12 1 ERUL (operational days) Operational time (days since ) Figure 7.1: ERUL estimated by low-level TCIs for 2 running periods of gas compressor KA61 as a function of operational time. The required maintenance actions are indicated by x and the suspensions are indicated by on the operational-time axis. Truncated operational time intervals are indicated as gaps in the solid horizontal lines on the operational-time axis ERUL (operational days) Operational time (days since ) Figure 7.11: ERUL estimated by low-level TCIs for 3 running periods of gas compressor KA71 as a function of operational time. The required maintenance actions are indicated by x and the suspensions are indicated by on the operational-time axis. Truncated operational time intervals are indicated as gaps in the solid horizontal lines on the operational-time axis. 17

120 CHAPTER 7. MODELLING AND RESULTS The expected remaining useful life using low-level TCIs (notification and vibration) at the time stamps of required maintenance actions are summarized in Table Table 7.11: Summary of ERUL for all running periods using the low-level notification TCIs and lowlevel vibration TCIs Compressor Running period Censoring (c i ) ERUL (days) KA KA KA KA KA KA KA KA KA KA KA Two low-level TCI trends (notification and vibration) were used to estimate ERUL of all 11 running periods. The results of ERUL using low-level TCIs in Table 7.11 are compared with the results of ERUL using a single high level TCI in Table 7.2. The estimates are almost equal for: running period 1 item KA31 (low level = 146 days, high level = 138 days) running period 2 item KA31 (low level = 124 days, high level = 112 days) running period 1 item KA41 (low level = 98 days, high level = 15 days) running period 1 item KA61 (low level = 9 days, high level = 8 days) running period 3 item KA71 (low level = 4 days, high level = 42 days) The following running periods have the best (shortest) ERUL estimates for the low-level TCIs: running period 2 item KA41 (low level = 3 days, high level = 82 days) running period 1 item KA51 (low level = 154 days, high level = 24 days) running period 2 item KA71 (low level = 49 days, high level = 122 days) running period 1 item KA71 (low level = 465 days, high level = 872 days) The following running periods have the best (shortest) ERUL estimates for the high-level TCIs: running period 3 item KA41 (low level = 279 days, high level = 147 days) running period 2 item KA61 (low level = 43 days, high level = 14 days) The results are much related to the important TCI trends just before the next upcoming event of required maintenance action (failure). A steep decline of the TCI towards the threshold of legal TCI value () seems to imply a much shorter ERUL than a slow decline. In this thesis this is related to the order of the polynomial used to fit the TCI trend. The correct choice of polynomial order is therefore most important. It can be claimed that this is not a trivial task because of the noisy behaviour of the TCIs in the TeCoMan software. The timestamp and value at the beginning of accelerated deterioration is also important in the ERUL estimate. Another observation from the simulations is that the lifestyle of the items in terms of 18

121 7.1. LIFETIME DISTRIBUTION BASED ON AGING AND CONDITION MONITORING calculated TCIs in early intervals far from the upcoming required maintenance action seems to have little influence on the ERUL estimate. This was tested by applying different values of y() t in the interval t < t1 in e.g. equation 6-7. Note that the values of y() t in the other intervals were not changed, implying the same TCIs in the intervals t 1 t < t 2 andt t2. A maintenance action which is worse than new in the intervalt < t1 is therefore not very critical for this model of Useful life period of the bathtub in Figure Lifetime distribution based on condition monitoring only The gamma process approach will in this chapter be used for construction of a lifetime distribution for the accelerated degradation in each running period based on the high-level TCI path. This approach is based on the characteristics of the technical condition index (TCI) and operating environment development only, and is therefore independent on the age of documented running periods. The assumptions for the gamma process model are described in chapter 6.1 Accelerated monotone TCI development modelled by a gamma process. Another advantage with the gamma process approach is its ability to quantify the error bounds including the temporal variability of the TCIs in the remaining useful life. A major problem for the high-level TCIs is where to set the limit for request of required maintenance action (stress threshold). Here the documented TCI value at the end of earlier running periods is helpful information. The number of running periods and the quality of the TCI data at the end of each running period is therefore important when modelling the stress threshold. In the gas compressor case 11 running periods are recorded. Only the running periods, which end with a required maintenance action (uncensored), are used. Further KA41 (running period 3), KA61 (running period 2), and KA71 (running period 1) have an increasing trend before the time stamp of required maintenance action. This makes them unsuitable to model by a gamma process, which needs a monotonically accumulating damage over time. The useful running periods and their documented TCI value at the end of each running period are presented in Table Table 7.12: Stress threshold value for running periods ending with required maintenance action Compressor Running period Working age (days since new at t=) Working age (days since t= at repair) TCI value at repair (TCI) TCI at t= (TCI) Threshold ρ = TCI(t=)-s KA KA KA KA KA The deterioration (stress) threshold is given by ρ = TCI ( ) s, where ( ) TCI is the value of the high-level TCI at local time t = at beginning of accelerated deterioration. This threshold is regarded as a random quantityρ>. It is assumed that Ρ is independent of the local operational time t and will be fitted by a proper distribution. In Figure 7.12 a point estimate fit of the threshold ρ by a normal distribution, a gamma distribution and a simple uniform distribution is shown. 19

122 CHAPTER 7. MODELLING AND RESULTS Normal distr. Gamma distr. Uniform distr fp(ρ) ρ = TCI()-s Figure 7.12: The probability density function f ( ρ ) uniform distributions Ρ fitted by normal-, gamma-, and A left and right truncated version of the normal distribution or a right truncated version of the gamma distribution would probably be the best choice, but a simple uniform distribution is also assumed to be a reasonable well fit. Since the transformed TCI values at the timestamp of required maintenance actions have a large variance, it is reasonable to believe that the confidence limits on remaining useful life will be large and that the TCIs are not quantifying the real technical condition well enough. The probability of failure in the time interval(,t ], where t is the local time in the accelerated degradation of a running period, is written by the convolution integral in equation 6-7. The choice of a uniform distribution fit of the deterioration (stress) threshold ρ simplifies the convolution integral. The probability density function (pdf) of a uniform distribution of the deterioration (stress) threshold ρ is 1 fρ ( ρ ) = b a a< ρ b otherwise (7.12) Equation 7-12 is then inserted into equation 6-7 and analytically solved as far as possible as shown in equation

123 7.2. LIFETIME DISTRIBUTION BASED ON CONDITION MONITORING ONLY = F Tρ () t y y= ρ = y y= ρ = ()( ) Ρ ( ρ) Y t () Γ( v() t ) ( b a) () y b vt () uy ( y a) () ( b a) ( v t ) f y f dρdy vt u () = () 1 vt 1 uy y e d dy = vt 1 u vt () 1 uy y e dy y e dy y= a Γ( v() t ) y= b vt u = + Γ ρ () = () y= b () () () vt y b vt vt vt uy vt 1 uy vt 1 uy u a u u () = y e dy y e dy+ y e dy Γ( v() t )( b a) Γ y a ( v() t )( b a) Γ v t = y= a y= b vt () y= b u () aγ( v() t, au) aγ( v() t, bu) Γ( v t, bu vt uy ) = y e dy + + Γ v t b a b a Γ v t b a Γ v t Γ v t ( ())( ) y= a ( ) ( ) ( ) ( ) ( ) ( ) y= b 1 vt () vt () uy = u y e dy aγ ( v() t, au) + bγ( v() t, bu) Γ( v() t )( b a) y= a z a z e dz z= a 1 where ( ) Γ = is the gamma function for ( ()) () () ( ) a 1 a > and ( ) (7.13) z Γ ax, = z e dzis the incomplete gamma function for x and a >. Equation 7-13 is easily solved numerically in the Maple software. A modified moving average is used to reduce the sampling variations and to ensure monotone behaviour of the aggregated TCIs in the accelerated interval from t = until the first required maintenance action. This is performed by simply ignoring small increases in the ordinary moving average calculated value. In the accelerated interval this simplification makes only minor differences between the moving average and the modified moving average aggregated TCI developments. To be able to estimate the parameters of the gamma process the functional form of the expected TCI path in the accelerated area need be assumed as described in chapter 6.2 Parameter estimation for the gamma process. The power law assumption of the expected value b v( t) ct E( Y() t ) = = u u requires an estimate of 3 parameters ( c, b, and u). By the method of maximum likelihood all parameters can be estimated (Nicolai et. al [76]), but by fixing the power b by inspection of the TCI behaviour in each running period the parameters c and u can be estimated by the method of moments (see chapter 6.2 Parameter estimation for the gamma process ). Ideally the functional form of the expected TCI path should be related to physical knowledge of its behaviour. The estimation results are shown in Table z= x 111

124 CHAPTER 7. MODELLING AND RESULTS Table 7.13: Estimated parameters for the gamma process Compressor Running period b c u KA *1^ KA *1^ KA *1^ KA *1^ KA The moving average (MA), the modified moving average-, the expected value ( E ( Y( t ))), and the expected value ± one standard deviation (SD) of Y( t) are trended as a function of local time for running period 1 of compressor KA31 in Figure For the last 229 days E Y t is assumed to follow a parabolic path ( b = 2 ). before the required maintenance action ( ( )) The degradation is stepwise implying a long reaction time. Choosing a linear path for b would give a better fit of E( Y() t ) against the moving average (MA) for the first local days in the accelerated degradation interval and a worse fit of the last important 5 days of this running period. local time t Y(t)= TCI()-TCI(t) MA Modified MA Expected value E(Y(t)) E(Y(t))+SD E(Y(t))-SD Figure 7.13: Simulated gamma process trends for running period 1 of compressor KA31. Moving average (MA) is shown as a red solid trend and the modified moving average (modified MA) is shown as a white solid trend. The simulated E(Y(t)) is shown as a thick, black, and solid trend. E(Y(t))+SD (lower), and E(Y(t))-SD (upper) are shown as black solid trends around the expected value E(Y(t)). 112

125 7.2. LIFETIME DISTRIBUTION BASED ON CONDITION MONITORING ONLY A uniform distribution of the deterioration (stress) threshold ρ in equation 7-13 is used to calculate the cumulative lifetime distribution of time to required maintenance action FT ( t). ρ The behaviour of the transformed TCI values at the timestamp of required maintenance actions have a large difference between a = 13and b = 85. Simulation results of the lifetime cumulative distribution function of running period 1 of gas compressor KA31 are shown in Figure Cumulative distribution function Time to required repair action (local time t) Figure 7.14: Cumulative distribution of time to required maintenance action of compressor KA31, running period 1 With the gamma process approach the whole distribution of the time to failure is calculated in form of the cumulative distribution function at local time t = and at working age (time since new) 246. The median value FT () t =.5 is estimated to be t 42 ρ m = days aftert =. The error bounds can be calculated as a two sided 95% confidence interval by finding the lower local timet from F ( t) =.25 and upper local timet from F ( t) =.975. This interval is Tρ calculated to be [ 229,559] t days aftert =. The actual required maintenance action was at t = 229days aftert =, which is just within the lower part of the 95% confidence interval. The moving average-, the modified moving average-, the expected value-, and the expected value ± one standard deviation of Y( t) are trended as a function of local time for running period 1 of compressor KA41 in Figure For the last 78 days before the required maintenance action E ( Y() t ) is assumed to be very steeply increasing ( b = 2 ). The degradation is suddenly increasing the last 8 days making the reaction time very short. Tρ 113

126 CHAPTER 7. MODELLING AND RESULTS local time t Y(t)=TCI()-TCI(t) Modified MA Expected value E(Y(t)) E(Y(t))+SD E(Y(t))-SD MA Figure 7.15: Simulated gamma process trends for running period 1 of compressor KA41. Moving average (MA) is shown as a red solid trend and the modified moving average (modified MA) is shown as a white solid trend. The simulated E(Y(t)) is shown as a thick, black, and solid trend. E(Y(t))+SD (lower), and E(Y(t))-SD (upper) are shown as black solid trends around the expected value E(Y(t)). Simulation results of the lifetime cumulative distribution function of running period 1 of gas compressor KA41 are shown in Figure

127 7.2. LIFETIME DISTRIBUTION BASED ON CONDITION MONITORING ONLY 1.9 Cumulative distribution function Time to required repair action (local time t) Figure 7.16: Cumulative distribution of time to required maintenance action of compressor KA41, running period 1 The median value F ( t).5 T = is estimated to be t 78 ρ m = days aftert =. The error bounds can be calculated as a two sided 95% confidence interval by finding the lower local timet from F () t =.25 and upper local timet from F ( t) =.975. This interval is calculated to Tρ be t [ 73,8] days after Tρ t =. The actual required maintenance action was at t = 78 days aftert =, which is at the median value and within the 95% confidence interval. The moving average-, the modified moving average-, the expected value-, and the expected value ± one standard deviation of Y( t) are trended as a function of local time for running period 2 of compressor KA41 in Figure For the last 48 days before the required maintenance action E( Y() t ) is assumed to be steeply increasing ( b = 4 ). The degradation is suddenly increasing the last 8 days making the reaction time very short. 115

128 CHAPTER 7. MODELLING AND RESULTS Local time t Y(t)=TCI()-TCI(t) Modified MA Expected value E(Y(t)) E(Y(t))+SD E(Y(t))-SD MA Figure 7.17: Simulated gamma process trends for running period 2 of compressor KA41. Moving average (MA) is shown as a red solid trend and the modified moving average (modified MA) is shown as a white solid trend. The simulated E(Y(t)) is shown as a thick, black, and solid trend. E(Y(t))+SD (lower), and E(Y(t))-SD (upper) are shown as black solid trends around the expected value E(Y(t)). Simulation results of the lifetime cumulative distribution function of running period 2 of gas compressor KA41 are shown in Figure

129 7.2. LIFETIME DISTRIBUTION BASED ON CONDITION MONITORING ONLY 1.9 Cumulative distribution function Time to required repair action (local time t) Figure 7.18: Cumulative distribution of time to required maintenance action of compressor KA41, running period 2 The median value F ( t).5 T = is estimated to be t 45 ρ m = days aftert =. The error bounds can be calculated as a two sided 95% confidence interval by finding the lower local timet from F () t =.25 and upper local timet from F ( t) =.975. This interval is calculated to Tρ be t [ 3,55] days after Tρ t =. The actual required maintenance action was at t = 46days aftert =, which is the day after the median value and within the 95% confidence interval. The moving average-, the modified moving average-, the expected value-, and the expected value ± one standard deviation of Y( t) are trended as a function of local time for running period 1 of compressor KA61 in Figure For the last 286 days before the required maintenance action E( Y() t ) is assumed to be steeply increasing ( b = 8 ). The degradation is suddenly increasing the last 4 days making the reaction time long. 117

130 CHAPTER 7. MODELLING AND RESULTS Local time t Y(t)=TCI()-TCI(t) Modified MA Expected value E(Y(t)) E(Y(t))+SD E(Y(t))-SD MA Figure 7.19: Simulated gamma process trends for running period 1 of compressor KA61. Moving average (MA) is shown as a red solid trend and the modified moving average (modified MA) is shown as a white solid trend. The simulated E(Y(t)) is shown as a thick, black, and solid trend. E(Y(t))+SD (lower), and E(Y(t))-SD (upper) are shown as black solid trends around the expected value E(Y(t)). Simulation results of the lifetime cumulative distribution function of running period 1 of gas compressor KA61 are shown in Figure

131 7.2. LIFETIME DISTRIBUTION BASED ON CONDITION MONITORING ONLY 1.9 Cumulative distribution function Time to required repair action (local time t) Figure 7.2: Cumulative distribution of time to required maintenance action of compressor KA61, running period 1 The median value F ( t).5 T = is estimated to be t 265 ρ m = days aftert =. The error bounds can be calculated as a two sided 95% confidence interval by finding the lower local timet from F () t =.25 and upper local timet from F ( t) =.975. This interval is calculated to Tρ be t [ 225, 285] days after Tρ t =. The actual required maintenance action was at t = 284days aftert =, which is just within the upper part of the 95% confidence interval. The moving average-, the modified moving average-, the expected value-, and the expected value ± one standard deviation of Y( t) are trended as a function of local time for running period 2 of compressor KA71 in Figure For the last 141 days before the required maintenance action E( Y() t ) is assumed to be slowly increasing ( b =.25 ). The degradation is stepwise and slowly decreasing in the whole accelerated degradation period making the reaction time long. 119

132 CHAPTER 7. MODELLING AND RESULTS local time t Y(t)=TCI()-TCI(t) Modified MA Expected value E(Y(t)) E(Y(t)+SD E(Y(t)-SD MA Figure 7.21: Simulated gamma process trends for running period 2 of compressor KA71. Moving average (MA) is shown as a red solid trend and the modified moving average (modified MA) is shown as a white solid trend. The simulated E(Y(t)) is shown as a thick, black, and solid trend. E(Y(t))+SD (lower), and E(Y(t))-SD (upper) are shown as black solid trends around the expected value E(Y(t)). Simulation results of the lifetime cumulative distribution function of running period 2 of gas compressor KA71 are shown in Figure

133 7.2. LIFETIME DISTRIBUTION BASED ON CONDITION MONITORING ONLY 1.9 Cumulative distribution function Time to required repair action (local time t) Figure 7.22: Cumulative distribution of time to required maintenance action of compressor KA71, running period 2 The median value F ( t).5 T = is estimated to be t 4 ρ m = days aftert =. The error bounds can be calculated as a two sided 95% confidence interval by finding the lower local timet from F () t =.25 and upper local timet from F ( t) =.975. This interval is calculated to Tρ t =. The actual required maintenance action was at t = 139 days after t =, which is within the 95% confidence interval. Because of the behaviour of the TCIs (slowly decreasing) and the large variance of the uniformly distributed deterioration threshold ρ, the upper confidence limit is very big. be t [ 2,4] days after The median values of time (days) to required maintenance actions and the two sided 95% confidence intervals are summarized in Table Table 7.14: Summary of the median value including 95% confidence error bounds for 5 running periods Compressor Running period Working age (days since new) at t= Working age (days since t=) at repair Tρ Working age (days since t=) at t=t m 95% conf. Interval KA [229, 559] 191 KA [73, 8] KA [3, 55] -1 KA [225, 285] -19 KA [2, 4] 261 Error in estimate 121

134 CHAPTER 7. MODELLING AND RESULTS With knowledge of the cumulative distribution of time to required maintenance action F T p () t at age t the survival function is given by R ( ) 1 ( ) T t = F p T t. The probability density p function of the remaining useful life at age x, where x is the additional age calculated from local timet, is then given by equation 7-14 (Andersen [5]) 1 δ frul ( x) = R x+ t RT () t δ x p ( T ( )) p Equation 7-14 is very computational expensive to calculate because the behaviour of the TCI degradation is described by a gamma process and the deterioration threshold is randomized. Instead, to be able to compare the results above with the belief in aging approach the calculated ERUL at beginning of accelerated deterioration and not the calculated ERUL at the time stamps of required maintenance actions as shown in Table 7.11 can be used for the running periods in Table These calculated ERUL values are shown in Table Table 7.15: ERUL and 95% confidence estimated at beginning of accelerated deterioration (t=) Compressor Running period Working age (days since new) at t= Working age (days since t=) at repair ERUL at t= (days) 95% conf. Interval KA [297,391] 115 KA [145,22] 96 KA [111,142] 81 KA [217,265] -43 KA [167,287] Error in ERUL estimate (days) By comparing the Error in ERUL estimate column for the estimated ERUL at beginning of accelerated deterioration in Table 7.15 with the Error in estimate column of the median value of estimated time to required maintenance action in Table 7.14, the estimates for ERUL are best for: running period 1 item KA31 (Error in ERUL = 115, Error in median value = 191) running period 2 item KA71 (Error in ERUL = 88, Error in median value = 261) The estimates for these running periods are however not very good because of the behaviour of the aggregated TCI for these running periods. The required maintenance actions for these running periods are also not within the 95% confidence interval of the ERUL estimates. The median value of estimated time to required maintenance action with the gamma approach is best for: running period 1 item KA41 (Error in ERUL = 96, Error in median value = ) running period 2 item KA41 (Error in ERUL = 81, Error in median value = -1) running period 1 item KA61 (Error in ERUL = -143, Error in median value = -19) These estimates are very good for the gamma approach, and all required maintenance actions are within the 95% confidence interval. In all these running periods the TCIs are steeply declining in a short interval before the required maintenance action. At the day of the required maintenance action the value of the aggregated TCI is 41 for running period 1 of item KA41, 122

135 7.2. LIFETIME DISTRIBUTION BASED ON CONDITION MONITORING ONLY 44 for running period 2 of item KA41, and 15 for running period 1 of item KA61. These small TCI values indicate a bad technical condition. For the running periods running period 2 item KA31 (suspension) running period 3 item KA41 running period 1 item KA51 (suspension) running period 2 item KA61 running period 1 item KA71 running period 3 item KA71 (suspension) it was not possible to compare the median value of time (days) to required maintenance actions with the Working age at repair by the gamma process approach because they were either suspensions, or the behaviour of the TCIs in the last time-interval before required maintenance action indicated an improving technical condition Discussion of uncertainty and predictability Predictability is a characteristic parameter of a failure mode, which gives information about the accuracy of required maintenance action prediction. It is e.g. a decision parameter during maintenance policy selection (Figure 2.3) and related to the uncertainties of the remaining useful life (RUL). There are two main different uncertainties involved in the RUL prediction. The size of each uncertainty seems often to be comparable, but their time dependent behaviour may be different. 1. Uncertainty related to a population of similar running periods (sampling or between uncertainty) 2. Uncertainty related to a specific running period (temporal or within uncertainty) Uncertainty related to a population of similar running periods is influenced by: Quantity of historical maintenance actions Accuracy of historical maintenance actions Selection of remaining useful life model (maintenance policy) Uncertainty related to a specific running period is influenced by: Quantity of technical condition measurements Accuracy of technical condition measurements Accuracy of technical condition measurement prediction Selection of remaining useful life model (maintenance policy) Since the between uncertainty is based on the population of reliability data that only changes when new required maintenance actions are recorded, the between uncertainty is rather static. The within uncertainty may be more dynamic since it is based on the population of more frequently updated technical condition monitoring data. 123

136 CHAPTER 7. MODELLING AND RESULTS For a Weibull lifetime model the uncertainty related to a specific running period is ignored, and the uncertainty of the RUL estimate is totally dependent on the between uncertainty. If all observed running periods possess exact same time to required maintenance action, the between uncertainty is zero. A preventive time-based maintenance policy is effective when the between uncertainty is small, and the between predictability is high. For a Weibull distribution the between predictability is strongly related to the estimate of the shape parameter β. The between predictability is none for constant or decreasing failure rates ( β 1), but for β > 1 the between predictability is increasing with increasing value of β. The gamma process approach with a fixed known threshold value is the other extreme, which ignores the uncertainty related to the population of similar running periods. A CBM policy in this case is effective when the within uncertainty is small, and the internal predictability is high. The gamma process has some advantageous features to accurately predict the technical condition measurements. It can be observed for all running periods (Figure 7.13, Figure 7.15, Figure 7.17, Figure 7.19, and Figure 7.21) that E( Y( t) ) always run through the last TCI value, i.e. the expected deterioration at the last TCI contains the most information. A sudden steep increase of degradation (large estimate ofb ) before the required maintenance action is then guaranteed to be captured by the gamma process approach. From the same figures listed E Y t ± SD) is related to above it can also be observed that the size of the error bounds ( ( ( )) the estimated value of b in Table A large b seems to imply small error bounds and a small within uncertainty. If we were able to observe and measure the technical condition perfectly, and the technical condition trend had no stochastic deviation, the within deviation would be zero. The Weibull PHM and the gamma process with randomized threshold value are both somewhere in between these two extremes. The potential benefit of CBM completely depends on both between uncertainty and within uncertainty. The Weibull PHM model is more sensitive to the between uncertainty since specific running period information is only modelled in term of covariate(s). The gamma process with randomized threshold value is more sensitive to the within uncertainty since the information from a population of similar running periods is only modelled by randomizing the threshold value. An analysis of the results achieved in Table 7.14 and Table 7.15 is then: For running period 1 of compressor KA31 and running period 2 of compressor KA71 the RUL estimates based on the PHM approach is best. For these running periods the length of the 95% confidence interval is also shortest for the PHM approach. The between uncertainty is static and large for both approaches, but the dynamic within uncertainty is modelled as larger for the gamma process than the PHM. This is because the gamma process with randomized threshold value is more sensitive to the within uncertainty. If the TCI trend had been more steep and more correlated with the real technical condition, the estimate of b would have been larger and the within uncertainty smaller. For running period 1 of compressor KA41, running period 2 of compressor KA41, and running period 1 of compressor KA61 the RUL estimates based on the gamma process with randomized threshold is best. For these running periods the length of 124

137 7.3. DISCUSSION OF UNCERTAINTY AND PREDICTABILITY the 95% confidence interval is also shortest for the gamma process with randomized threshold. The TCI trend is steep, b is large, and the within uncertainty is small. If the between uncertainty had been less, the PHM would have a better fit (e.g. larger loglikelihood value at optimum). The randomized threshold value of the gamma process would also have less variance (uncertainty) implying a better fit for this model too. In this case the PHM model would be less sensitive than the gamma process with randomized threshold to inaccurate TCI values. Selection of the best remaining useful life model may be based on estimates of the uncertainty or on estimates of the predictability. In the literature, little attention has been paid to methods for measuring predictability. As mentioned above, predictability is related to the uncertainties of e.g. the remaining useful life (RUL) and is always a matter of degree. The degree of predictability is dependent on the forecast horizon and the specified loss function L( ) (a squared-error loss function is defined 2 as L( x) = x ). The predictability p is defined to be in the range p [,1], with larger values indicating greater predictability. Woud et al. [32] defined predictability for preventive, timebased maintenance. They fitted a Weibull lifetime model to the reliability data and calculated MTTF (mean time to failure) and the standard deviation (σ ). Predictability was then defined as: σ 1 for σ MTTF pwsv = MTTF for σ > MTTF In economics and finance, forecasts are of great importance and widely used. Importance of forecast evaluation and combination techniques is needed to improve forecast performance in e.g. the stock exchange. Granger and Newbold [77] propose a natural definition of forecastability (predictability) for covariance stationary series under squared-error loss, 2 patented on the familiar R of linear regression 3 : Var ( yt+ j, t) Var ( et+ j, t) pgn = = 1 Var y Var y ( ) ( ) t+ j t+ j where y t + j, t is the conditional mean (optimal) forecast and e t+ j, t = yt+ j yt+ j, t denotes the forecast error of y t + j, t. Diebold and Kilian [78] generalize the concept of Granger and Newbold to allow the assessment of non-stationary time series, multivariate information sets, a wide range of different loss functions, and the possibility to tailor the measure to different forecasting horizons. The proposed predictability measure is based on the difference between the ( ) ( ) ( ) conditionally expected loss of an optimal short-run forecast E L( e t + j, t) optimal long-run forecast, E( L( e t + k, t) ), j k. If E L( et+ j, t) E( L et+ k, t ), and that of an, the series (TCI 3 In the case of linar regression R 2 is the square of the correlation coefficient κ. (See chapter 7.1 Lifetime distribution based on aging or Meeker & Escobar [65]). 125

138 CHAPTER 7. MODELLING AND RESULTS trends) is highly predictable at horizon j relative to k, and if E L( e ) ( t+ j, t ) E( L( et+ k, t) ) series is nearly unpredictable at horizon j relative to k. The measure for predictability is defined analogously to p GN : E( L( et+ j, t) ) pdk = 1 E L e ( ( t+ k, t) ) For the Weibull PHM and the gamma process with randomized threshold value, the Diebold- Killigan measure seems most attractive because the TCIs are non-stationary and multivariate at low level. More research that combines remaining useful life estimation including its uncertainty with e.g. the p measure is needed. DK, the 126

139 8. Conclusions and further work The basic aim of this thesis was to estimate the remaining useful life of natural gas export compressors. Ideally, the population of running periods should be divided in a training set and a validation set. Only the validation set should be used to evaluate the remaining useful life estimates of the models. Since there are only 8 required maintenance actions, this was not performed. These compressors have several failure modes and several sources for technical condition monitoring. The TCI concept was used to aggregate low-level multiple sensor data to TCIs at the compressor level. Remaining useful life was estimated by use of reliability data only (Weibull distribution), reliability data and technical condition monitoring data (Weibull PHM), and by technical condition monitoring data only (Gamma process) Conclusions and discussion Models of the remaining useful life based on the hazard rate, or based on a gamma process of aggregated TCIs are able to give aggregated information for decisions makers on when to perform required maintenance actions. The remaining useful life is very dependent on the underlying assumptions and the quality, the quantity, and the behaviour of the historical data acquired. Clean (error-free) data are needed before analysis and modelling can be performed. In this thesis the TCIs stored at different aggregation levels in the TeCoMan software were assumed to be as clean as possible, but the underlying condition monitoring data and process data might be corrupted by e.g. sensor faults. Such sensor faults are, however, in this thesis assumed to be non-existent. The important work orders (required maintenance actions) and notifications for the natural gas export compressors were documented manually in a SAP database. Manually entered data will often contain errors, and a cleaning of the work-order list for each compressor was needed to screen out maintenance orders which didn t cause downtime. Work orders from other systems than the compressors in the compressor system were also removed. In general, data cleaning is a highly needed and a complicated task, which is not discussed in this thesis. Remaining useful life estimates based on the gamma process approach gave the best estimates for those running periods where the aggregated TCIs had a clear down trend in an interval before the required maintenance action (running period 1 and 2 of gas compressor KA41 and running period 2 of gas compressor KA61). The gamma process approach is able to model the temporal variability associated with the technical condition development and is offering the whole distribution of time to required maintenance action in terms of e.g. a cumulative distribution function. This information makes it attractive because quantification of the uncertainty of an estimate is important. The gamma process also assumes a monotone development of the technical condition, and for the model the age of the trend is assumed to 127

140 CHAPTER 8. CONCLUSIONS AND FURTHER WORK have no influence. For the running periods where the behaviour of the TCIs in the time last interval before failure indicated an improving technical condition, the gamma process approach is not applicable. A model of the TCI trend using e.g. a Brownian motion with drift process might be more appropriate here. The Weibull PHM model has no restriction on the behaviour of TCI trends. The main underlying assumption here is that the interoccurrence times of required maintenance action must be identical and independently distributed (IID). Both the trend and the independence of the interoccurence times must be tested (see chapter 5.3 Model selection framework ). In the gas compressor case the IID assumption can be used and all running periods can be modelled by a Weibull PHM. The model is based on the aging principle that implies a need for many exact times of required maintenance actions. Ideally, the required maintenance actions should be complete breakdowns of the gas compressors. The 8 documented required maintenance actions are, however, performed because of safety requirements, or when all the remaining useful life is judged to be spent after opening the gas compressors to confirm the faulty state. In addition to the time stamp of the required maintenance actions, the model needs complete documentation of the trend of technical condition for all running periods. Since both complete running periods of reliability data and technical condition data are required, it can be difficult to find cases where this model can be used. To solve the problem of missing data, methods like truncation and censoring might be used. The best solution is to recover these data if possible. The original Cox model was a relative risk model to scientifically compare the relative contribution of different covariates to the risk. This scientific approach can be utilized in the aggregation methods to quantify the weights of underlying TCIs. This is important feedback to the TeCoMan system designers. Documented running periods with additional information, e.g. trends of new covariates, can then also be used to improve the aggregated TCIs and the remaining useful life estimates. The Weibull model based on reliability data only has mainly been used to validate the aging effect through the value of the β parameter. Since β > 1, the compressors have a small presence of aging. Like the Weibull PHM model, the Weibull model also requires many required maintenance actions. To keep the number of parameters in the Weibull PHM model as low as possible, the aggregation methodology of TCIs is useful. By comparing the plots of the PHM hazard rates of all running periods with use of one high-level aggregated TCI against the plots with use of two low-level TCIs, it can be seen that they are not very different. This is because the weights chosen by process experts in the aggregation method seems to correspond to the weights found by the low level PHM analysis, and that the technical information from the bearing temperature covariate and the seal leakage covariate is minimal. In the models the TCI behaviour is very noisy. A reason is that the low-level TCIs are only symptoms of the technical condition (degradation). Usage of low-pass filters like moving average and polynomial fitting to remove the noise and at the same time keep the real technical condition is difficult. Expert judgement is needed. Since the technical condition is closely related to degradation, a solution including first principle degradation models (Samdal [38]) of the dominant degradation mechanisms of each failure mode will be advantageous. Investigation of physics-of-failure at component level seems to be the common thread among the various avenues of prognostic technology development. Modeling damage initiation and propagation at this level might be a key element in describing component health. As a first 128

141 8.2. FUTURE WORK order approximation in the compressor case, a bearing degradation model for required bearing maintenance actions based on notification symptoms, vibration symptoms, and bearing temperature symptoms seems natural. For the seal leakage failure mode, a seal leakage degradation model based on notification symptoms, vibration symptoms, and seal leakage symptoms might be used. Future online information from particle in lubrication oil analysis and Efficiency and load analysis might also be used to improve the aggregated TCIs for the export gas compressors Future work Framework for documentation of work orders The quality of maintenance work orders will be improved by better procedures for documentation. Such procedures should describe the state of the maintained item (is all RUL spent at the time stamp of the maintenance action?). The failure mode of the item, subjective observations during the repair, pictures taken during the repair, and trend of important symptoms leading to the required maintenance action should also be documented. A qualitative analysis of the repair action (minimal repair or hazard rate reduction repair) including quality test results constitute important information. More automated documentation procedures might also be a solution. Further development of methods for determination of exact technical condition Optimal sets of technical condition measurements to use for different failure modes are often a result of expert judgement. The PHM model is offering important diagnostic information in terms of the values in the γ parameter vector including p-values described in chapter 7.1 Lifetime distribution based on aging and condition monitoring. Other diagnostic information from the work orders might also be helpful in finding the optimal set of measurements and their aggregation method weights. In the gas compressor case, only the total vibration level is recorded. Spectral vibration analysis of important frequencies might do a better job than just recording the total vibration level. Online and offline particle in lubrication oil analyses as well as efficiency monitoring may also give important early-fault detection information. Advanced sensor techniques for robust on-line data acquisition is also required. Wrongly calibrated or faulty sensors should offer information about their technical condition and in addition offer estimates of correct value. Reduction of errors during maintenance activities When conducting maintenance, it is important to recognize that preventive and corrective maintenance pose different risks for human error. For instance, preventive maintenance is usually well planned and the hazards are identified. During preventive maintenance, the consequences of failure of critical components are identified and procedures are written to address failures. Skill- or rule-based errors are more likely during preventive maintenance than knowledge-based errors [79]. In contrast to preventive maintenance, less time may be 129

142 CHAPTER 8. CONCLUSIONS AND FURTHER WORK available for corrective maintenance due to production and schedule pressures. During corrective maintenance workers are less likely to have procedures or the procedures may not be at the same level of quality as those for preventive maintenance, making knowledge-based errors more likely. The frequency of unexpected breakdowns requiring corrective maintenance can be reduced through the assessment process, inspection program, and preventive maintenance program. When corrective maintenance is required, a clearly defined approval process and pre-approved procedures may reduce the probability of unexpected failures. Results from the approval process should be documented as part of the work orders. Presentation layer in the Human-System Interface (HSI) The top-layer in Figure 2.5 (Presentation) has not been discussed in this thesis. Development of new principles for information presentation is also needed in a multi item and multi failure mode environment where several condition monitoring sources have partial condition monitoring information. Interesting techniques like Parallel Coordinates ([8], [81], [82]) might be useful for analysis of high-dimensional dynamical maintenance decision problems. Validation and verification of prognostics How can the proper operation of prognostic algorithms be validated, especially on new systems? More research is needed on e.g. the predictability measures. 13

143 Appendix A Maintenance policies basics The maintenance policies mentioned in Figure 2.1 may be classified in many different ways. Generally, there are two main policies, unforeseen or planned (Rasmussen [83], Rysst, Rasmussen [84]). An overview is given in Figure A.1. Maintenance Planned Maintenance Preventive Maintenance Unforeseen Maintenance Corrective Maintenance Maintenance policies Calendar time based Periodical Inspections Lubrications Adjustments Replacements Operational time based Continuous measurements Technical condition monitoring Periodical measurements and inspections Condition based maintenance Maintenance events Maintenance management categories Figure A.1: Classification of maintenance policies 131

144 APPENDIX A. MAINTENANCE POLICIES - BASICS 1. Unforeseen maintenance will always be corrective, since failures occur unexpected. This type of maintenance is often called repair and is carried out after a part/system has failed. The purpose of corrective maintenance is to bring the part/system back to a functioning state as soon as possible, either by replacing the failed part or by repairing/replacing the failed repairable system. Switching in a redundant system can also be corrective maintenance. 2. Planned maintenance can be classified into corrective maintenance or preventive maintenance. a. Planned corrective maintenance is used when we plan to let the parts/systems run to failure. The failures are considered non-critical with respect to economy, availability, safety or environmental damages. This policy does however assume a continuous follow-up of costs and availability to be able to make the necessary changes and adjustments. b. Planned preventive maintenance is the maintenance that occurs when the part/system is operating. This policy includes all actions performed in an attempt to retain a part/system in specified condition by providing systematic inspection, detection, and prevention of incipient failures. Planned preventive maintenance may further be classified into the following categories: i. Under the periodic intervention category the part/system is opened and inspected/overhauled at the site or at an engineering workshop. If the variance of the mean time to failure (MTTF) is low, one knows approximately when the part/system fails, and periodic intervention is the best option. The intervention may also be a replacement (perfect maintenance). The periodic interventions are carried out at either a specified age of the part/system or at a specified calendar times. In the operational time-based (age-based) maintenance the preventive maintenance work tasks are carried out at a specified age of the part/system. The age may be measured as time in operation or by other time concepts, like number of activations of systems or number of take offs/landings for an aircraft. In the calendar-time based (clock-based) maintenance the preventive maintenance work tasks are carried out at a specified calendar times as e.g. in ordinary outages. A calendar-time based maintenance policy is generally easier to administer than an operational-time based maintenance policy, since the maintenance work tasks can be scheduled at predefined times. ii. Technical condition monitoring is performed to effectively discover the development of failures, achieving a time of pre-warning such that maintenance personnel has time to intervene before a failure occurs (reaction time). It is a preventive process, which requires either periodic or continuous supervision against predefined standards to be able to judge if the part/system can continue to work within acceptable limits. It is important to note that condition monitoring is monitoring with the purpose to justify maintenance actions on the conditionmonitored parts/systems. Condition-based maintenance is a maintenance policy where the maintenance action is decided based on one, or more, condition variables that are correlated to the ability of a part/system to perform a required function. The condition variables contain information from functional tests, inspections, and manual/automatical measurements. 132

145 The condition-based maintenance requires a monitoring system that can provide measurements of selected variables, and a mathematical/stochastical model that can predict the behaviour of the deteriorating part/system. The condition information and the failure pattern are then used to judge whether and when maintenance actions are necessary. In condition-based maintenance it is important to receive feedback from the condition variable to be able to optimize the maintenance costs and maintain the desirable availability of the part/system. Reporting and analysis of maintenance event data is also a part of the condition-based maintenance. iii. Opportunity-based maintenance policy is applicable for systems, which consists of multiple items, where the maintenance work tasks on other items or a system shutdown/intervention provides an opportunity for carrying out maintenance on items that were not the cause of the opportunity. The opportunity based maintenance policy belongs to the planned preventive maintenance policy. (The opportunity-based maintenance policy is not shown in Figure A.1) In terms of desirable characteristics that an efficient condition monitoring method should have, a wish list is given in Fantoni [85]: Non-intrusive Non-destructive Reproducible Applicable to a wide range of temperatures, dose-rates, pressures etc Sensitive to degradation, especially when close to failure Cost effective 133

146

147 Appendix B Available transfer functions in TeCoMan Figure B.1: The Peak transfer function. This figure is taken from the TeCoMan software. 135

148 APPENDIX B. AVAILABLE TRANSFER FUNCTIONS IN TECOMAN Figure B.2: The "Sink" transfer function. This figure is taken from the TeCoMan software. Figure B.3: The "Up Slope" transfer function. This figure is taken from the TeCoMan software. 136

149 Figure B.4: The "Linear" transfer function. This figure is taken from the TeCoMan software. Figure B.5: The "Down slope" transfer function. This figure is taken from the TeCoMan software. 137

150 APPENDIX B. AVAILABLE TRANSFER FUNCTIONS IN TECOMAN Figure B.6: The "4 point saddle" transfer function. This figure is taken from the TeCoMan software. 138

151 Appendix C Export compressor hierarchy in TeCoMan Figure C.1: Export compressor top level aggregation method with weights. This figure is taken from the TeCoMan software. SAP Not. denotes notifications. Lagertemp. denotes bearing temperature. Tetninger denotes seal leakage. Vibrasjon denotes vibration monitoring. Ytelse denotes performance monitoring. 139

152 APPENDIX C. EXPORT COMPRESSOR HIERARCHY IN TECOMAN Figure C.2: SAP notification aggregation method. This figure is taken from the TeCoMan software. Figure C.3: Bearing temperature aggregation method. This figure is taken from the TeCoMan software. 14

153 Figure C.4: Seals aggregation method. This figure is taken from the TeCoMan software. Figure C.5: Vibration transfer function. This figure is taken from the TeCoMan software. 141

Abstract. 1. Introduction

Abstract. 1. Introduction Abstract Repairable system reliability: recent developments in CBM optimization A.K.S. Jardine, D. Banjevic, N. Montgomery, A. Pak Department of Mechanical and Industrial Engineering, University of Toronto,