At Terms and Definitions

Size: px

Start display at page:

Download "At Terms and Definitions"

Amos Whitehead
6 years ago
Views:

1 At Terms and Definitions This appendix defines and comments the terms most commonly used in reliability engineering (Fig. Al.I). Table 5.4 extends this appendix to software quality (see also [A1.4(61O)]. A careful list of terms related to power systems is in [A 1.3 (1999)]. Attention has been paid to the adherence to relevant international standards and recent trends [ALl - A. 1.8], in particular to lee and ISO standards [A1.3, Al.S, A1.6]. System, Systems Engineering, Concurrent Engineering, Cost Effectiveness, Quality Capability Availability, Dependability Reliability Item Required Function, Mission Profile Reliability Block Diagram, Redundancy MTIF, MTBF Failure, Failure Rate, Derating FMEA, FMECA, FTA Reliability Growth, Environmental Stress Screening, Burn-in Maintainability I- Preventive Maintenance, MTIPM, MTBUR L Corrective Maintenance, MTIR Logistical Support t Fault Defect, Nonconformity Systematic Failure Failure Safety Quality Management, Total Quality Management (TQM) L Quality Assurance ~ Configuration Management, Design Review Quality Test Quality Control during Production Quality Data Reporting System Life Time, Useful Life Life-Cycle Cost, Value Engineering, Value Analysis Product Assurance, Product Liability Figure A1.1 Terms most commonly used in reliability engineering and possible classification

2 324 Al Terms and Definitions Availability I Point Availability (PA(t)) [A 1.3] Probability that an item is in a state to perform a required function under given conditions at a given instant of time, assuming that the required external resources are provided Instantaneous availability is often used. According to the above definition, availability is a characteristic of an item, generally designated by PACt). A qualitative definition, focused on ability rather than on probability is also usual. The term item stands for an entity of arbitrary complexity (from component to system). Calculation of the point availability generally assumes continuous operation (item down only for repair), complete renewal, and ideal human factors and logistical support. Complete renewal refers to the repaired element in the reliability block diagram (good-as-new after repair). This assumption is valid for the whole item in the case of constant failure rates. The steady-state value of the point availability can be expressed in general by PA = MTTF / (MTTF + MTTR). PA is also the steady-state value of the average availability AA (often simply designated as availability A). Other kinds of availability (mission availability, workmission availability, joint availability etc.) can be defined, see e.g. Section Burn-in (nonrepairable items) [A1.3 &Al.6] Type of screening test while an item is in operation. For electronic devices, stresses during bum-in are often constant higher ambient temperature (e.g. 125 C for ICs) and constant higher supply voltage. Burn-in can be considered as a part of a screening procedure, performed on a 100% basis to provoke early failures and to stabilize the characteristics of an item. Often it can be used as an accelerated reliability test to investigate an item's failure rate. Stresses are then higher than would be expected in field operation, but not so high as to stimulate failure mechanisms which would not occur in normal use. Burn-in (repairable items) [Al.3 & Al.6] Process of increasing the reliability of hardware while an item is in operation in a prescribed environment with successive corrective maintenance at every failure during the early failure period. The term run-in is often used instead of bum-in. The stress conditions have to be chosen as near as possible to those expected infield operation. Flaws detected during bum-in can be deterministic (defects or systematic failures) during the pilot production (reliability growth), but should be attributable only to early failures (randomly distributed) during the series production. Capability [Al.3] Ability of an item to meet a service demand of given quantitative characteristics under given conditions. Performance (technical performance) is often used instead of capability.

3 Al Terms and Definitions 325 Concurrent Engineering Systematic approach to reduce the time to develop and market an item, essentially by integrating production activities into the design & development phase. Concurrent engineering is achieved through intensive teamwork between all engineers involved in the design, production, and marketing of an item. It supports TQM and also has a positive influence on the optimization of life-cycle cost. Configuration Management Procedure to specify, describe, audit, and release the configuration of an item, as well as to control it during modifications or changes. Configuration includes all of an item's functional and physical characteristics as given in the documentation (to specify, build, test, accept, operate, maintain, and logistically support the item) and as present in the hardware and/or software. In practical applications, it is useful to subdivide configuration management into configuration identification, auditing, control (design reviews), and accounting. Configuration management is of particular importance during the design, development phase. Corrective Maintenance [A1.3 & A1.6] Maintenance carried out after fault recognition and intended to put an item into a state in which it can perform a required function. Corrective maintenance is also called repair. It can include any or all of the following steps: detection/ verification, localization, isolation, disassembly, exchange, reassembly, alignment, and functional test & set-up. To simplify calculation it is generally assumed that the repaired element in the reliability block diagram is as-good-as-new after each repair (also including a possible environmental stress screening of the spare parts). This assumption applies to the whole item (equipment or system) if all components of the item (which have not been repaired / renewed) have constant failure rate. Cost Effectiveness Measure of the ability of an item to meet a service demand of stated quantitative characteristics, with the best possible usefulness to life-cycle cost ratio. The term system effectiveness is often used instead of cost effectiveness.

4 326 Al Terms and Definitions Defect [Al.5] Nonfulfillment of a requirement related to an intended or specified use. From a technical point of view, a defect is similar to a nonconformity, however not necessarily from a legal point of view (in relation to product liability, nonconformity should be preferred). Defects do not need to influence the item's functionality. They are caused by flaws (errors, mistakes) during design, development, production, or installation. The term defect should be preferred to that of error, which is a cause. Unlike failures, which always appear in time (generally randomly distributed), defects are present at t = O. However, some defects can only be detected when the item is operating and are referred to as dynamic defects (e.g. in software). Similar to defects, with regard to causes, are systematic failures, however, they are generally not present at t=o. Dependability [Al.3 & Al.5] Collective term used to describe the availability performance and its influencing factors: reliability performance, maintainability performance, and maintenance support performance. Dependability is used in a qualitative sense only. Maintenance support refers to logistical support. When using dependability, it can be necessary to specify which characteristic is meant, in particular. Derating Nonutilization of the full load capability of an item with the intent to reduce the failure rate. The stress factor S expresses the ratio of actual to rated load under normal operating conditions (generally at 25 C ambient temperature). Design Review [Al.3] Formal and independent examination of an existing or proposed design for the purpose of detection and remedy of deficiencies in the requirements and/ or design which could affect quality and/ or dependability. Design reviews are an important tool for quality assurance and TQM during the design and development of hardware and software (Tables A3.3, 5.3, 5.5). An important objective of design reviews is to decide about continuation or stopping the project considered on the basis of objective considerations and feasibility check (Tables A3.3 and 5.3 & Fig. 1.6). Environmental Stress Screening (ESS) Test or set of tests intended to remove defective items, or those likely to exhibit early failures. ESS is a screening procedure often performed at assembly (PCB) or equipment level on a 100% basis

5 Al Terms and Definitions 327 to find defects and systematic failures during the pilot production (reliability growth), or to provoke early failures in a series production. It consists generally of temperature cycles and/or random vibrations. Stresses are in general higher than those expected in field operation. Experience shows that to be cost effective, ESS has to be tailored to the item and production processes considered. At component level, the term screening is often used instead of ESS. Failure [Al.3 & A1.6] Termination of the ability of an item to perform a required function. Failures should be considered (classified) with respect to the mode, cause, effect, and mechanism. The cause of a failure can be intrinsic (early failure, failure with constant failure rate, and wearout failure) or extrinsic (systematic failure, i. e. failure resulting from errors or mistakes in design, production, or operation which are deterministic and has to be considered as a defect). The effect (consequence) of a failure is generally different if considered on the directly affected item or on a higher level. A failure is an event appearing in time (randomly distributed), in contrast to afault which can be a state. Failure Rate (A(t)) [Al.3] Limit for Ot ~ 0 of the probability that an item will fail in the time interval (t, t + Ot], given that the item was new at t = 0 and did not fail in the interval (0, t], divided by 01. The failure rate is generally designated by A(t). If or is the itemfailurejree operating time, then 1 f(t) dr(t)1 dt A(t) = lim -Pr{t < or S t+st lor> t} = --= -. litlo St 1- F(t) R(t).. -f~a.(x)dx The existence of f(t) IS tacitly assumed. For R(O) = I, It follows that R(t) = e. Thus, for A(t) = A, it holds that R(t) = e-at. Only in this case one can estimate the failure rate A by j,= kit, where T is the given, fixed cumulative operating time (cumulated over an arbitrary number of statistically identical, independent items) and k the total number of failures during T. In general, the failure rate of a large population of statistically identical, independent items exhibits the typical form of a bathtub curve in which the phases of early failures,failures with constant (or nearly so)failure rate A, and wearout failures can be distinguished. The term random failures for the period with constant failure rate should be avoided, as also early and wearout failures are random (only systematic failures have a deterministic character). Also the terms hazard rate and instantaneous failure rate should be dropped, to avoid confusion. The failure rate A(t) defined above (or empirically by Eq. (1.3», refers to statistically identical and independent failure-free operating times (Fig. 1.1). It must not be confused with the intensity met) of a point process describing the occurrence of failures on the time axis. m(t), given by Eq. (A7.191) or Eq. (8.8), differs basically from A(t). This even in the case of an homogeneous Poisson process with intensity A, as defined by Eqs. (A7.38) - (A7.44), for which met) =h(t)=a holds for the intensity and A(x)=A holds for the interarrival times (x starts by 0 at each interarrival time). Misuses are known, see e.g. [6.1]. To alleviate such misuses, force ofmonality has been suggested for A.(t) [6.1, A7.30].

6 328 Al Tenns and Definitions Fault [Al.3] State of an item characterized by the inability to perform a required function, excluding inability during preventive maintenance or other planned actions, or due to lack of external resources. A fault can be a defect or a failure, having thus as possible cause an error (for defects or systematic failures) or a failure mechanism (for failures), see also the terms defect, systematic failure, andfailure. Fault Modes and EtTects Analysis (FMEA) [A1.3] Qualitative method of reliability analysis which involves for each element of an item the investigation of all possible fault modes and of the corresponding effects on other elements as well as on the required function (s) of the item. See FMECA. Fanlt Modes, EtTects, and Criticality Analysis (FMECA) [Al.3] Qualitative/ quantitative method of reliability analysis which extends the fault modes and effects analysis (FMEA) while considering for each fault mode the probability of occurrence and the ranking of its severity. Goal of an FMEA I FMECA is to detennine all potential hazards and to analyze the possibilities of reducing their effect, or their probability of occurrence. All possible failure (and defect) modes and causes have to be considered bottom-up from the lowest to the highest integration level of the item considered. FMECA was fonnerly used for failure modes, effects, and criticality analysis. Often one distinguishes between a design and a production (or process) FMEA or FMECA. Fault Tree Analysis (FfA) [Al.3] Analysis to determine which fault modes of the elements of an item and/ or external events may result in a stated fault mode of the item, presented in the form of a fault tree. Ff A is a top-down approach, which allows the inclusion of external causes more easily than an FMEA I FMECA. However, it does not necessarily go through all possible fault modes. Combination of FMEA I FMECA with FfA leads to causes-to-effects chart, showing the logical relationship between identified causes and their single or multiple consequences. A graphical description of cause-to-effect relationships, which combines and can in some cases complement FMEAI FMECA and FfA procedures, is the cause-to-effect diagram, also known as afishbone or Ishikawa diagram.

7 Al Terms and Definitions 329 Item I Entity [Al.3] Any part, component, device, functional unit, equipment, subsystem or system that can be individually considered. An item is a functional or structural unit, which is considered as an entity for investigations. It may consist of hardware and / or software and also include human resources (to emphasize this fact, the term system has been defined separately). The term entity is often preferred to item in quality standards Life Cycle Cost (LCC) Sum of the cost for acquisition, operation, maintenance, and disposal or recycling of an item. Life-cycle cost optimization is undertaken within the framework of cost effectiveness or systems engineering and can be positively influenced by concurrent engineering. International regulations will take more into account the effects to the environment of the production, use, and ecologically acceptable disposal or recycling of an item, when considering life-cycle cost (sustainable development). Lifetime Time span between initial operation and failure of a nonrepairable item. Logistical Support All activities undertaken to provide effective and economical use of an item during its operating phase. Logistical support is no longer reserved to the defense field. An emerging aspect related to maintenance and logistical support is that of obsolescence management, i.e. how to assure operation over for instance 20 years when technology is rapidly evolving and components need for maintenance are no longer manufactured. Maintainability M(t) [Al.3] Probability that a given active maintenance action, for an item under given conditions of use, can be carried out within a stated time interval, when the maintenance is performed under stated conditions and using stated procedures and resources. According to the above definition, maintainability is a characteristic of an item and refers to preventive maintenance (serviceability) and corrective maintenance or repair (reparability) as well.

8 330 Al Terms and Definitions A qualitative definition, focused on ability rather than on probability is also usual. In specifying or evaluating maintainability, it is important to consider the logistical support available for maintenance, i.e. procedures, personnel (number, skill level), spare parts, test facilities, etc. Mission Profile Specific task which must be fulfilled by an item during a stated time under given conditions. The mission profile defines the required function and the environmental conditions as a function of time. A representative mission profile and the corresponding reliability targets have to be defined in the item specifications. For simplicity, a time independent (averaged) mission profile is often assumed. System with variable mission profile are termed phased-mission systems. MTBF MTBF= l/a. MTBF should be reserved for items with constant failure rate A. In this case, MTBF = 1/ A is the expected value (mean) of the exponentially distributed item's failure-free operating time, as expressed by Eqs. (1.9) and (A6.84). The definition given here agrees with the statistical methods generally used to estimate or demonstrate an MTBF. In particular MTBF= TI k, where T is the given,fixed cumulative operating time (cumulated over an arbitrary number of statistically identical and independent items) and k the total number of failures (failed items) during T. The use of MTBF for mean operating time between failures (or, as formerly, for mean time between failures) has caused misuses (see the remarks on pp. 7, 318, 327,416) and should be dropped. The distinction often made between repairable and nonrepairable items should also be avoided (see MTTF). MTTF [Al.3] Mean time to failure. MTTF is the expected value (mean) of an item's failure-free operating time. It is obtained from the reliability function R(t) as MTTF = r R(t)dt, with TL as the upper limit of the integral if the life time is limited to TL (R(t) = 0 for ~ ~ TL ). MTTF applies to both non repairable and repairable items if one assumes that after a repair the item is as-good-as-new. If this is not the case, a new MTTF (MTTFsi starting from state z.j) can be considered (Table 6.2). An unbiased (empirical) estimate for MTTF is MITF = (tl tn ) In, where tl'..., tn are observed failure-free operating times of statistically identical, independent items. MTTPM Mean time to preventive maintenance. See MTTR for comments.

9 Al Terms and Definitions 331 MTBUR Mean time between unscheduled removals. MTBUR is used for instance in avionics, where an estimate of MTBUR is obtained by dividing the total number of flight hours logged by all units of a given airplane, over a certain period of time, by the number of unscheduled removals during the same period of time. MTTR [A1.3] Mean time to repair. MTTR is the expected value (mean) of an item's repair time. It is obtained from the distribution function O(t) of the repair time as MTTR = r (1- OCt)) dt. In specifying or evaluating MTTR, it is necessary to consider the logistical support a~ailable for repair (procedures, personnel, spare parts, test facilities). Repair time is often lognormally distributed. However, for reliability or availability calculation of repairable equipment and systems, a constant repair rate ~ (i.e. exponentially distributed repair times with ~ = II MTTR) can often be used to get approximate results, at least for the cases in which MTTR«MTTF holds. An unbiased (empirical) estimate of MTTR is MTrR = (t\ tn ) I n, where t\,..., tn are observed repair times of statistically identical, independent items. Nonconformity [A1.5] Nonfulfillment of a requirement. From a technical point of view, the term nonconformity is close to that of defect, however not necessarily from a legal point of view. In relation to product liability, nonconformity should be preferred. Preventive Maintenance [A1.3 & A1.6] Maintenance carried out at predetermined intervals or according to prescribed criteria and intended to reduce the probability of failure or the degradation of the functionality of an item. The aim of preventive maintenance must also be to detect and remove hidden failures, i.e. failures in redundant elements. To simplify calculation it is generally assumed (as for a corrective maintenance) that the element in the reliability block diagram for which a preventive maintenance has been performed is as-good-as-new after each preventive maintenance. This assumption applies to the whole item (equipment or system) if all components of the item (which have not been renewed) have constant failure rate.

10 332 Al Terms and Definitions Product Assurance All planned and systematic activities necessary to reach specified targets for the reliability, maintainability, availability, and safety of an item, as well as to provide adequate confidence that the item will meet all given requirements. The concept of product assurance is often used in aerospace programs. It includes quality assurance as well as reliability, maintainability, availability, safety, and logistical support engineering. Product Liability Generic term used to describe the onus on a producer or others to make restitution for loss related to personal injury, property damage, or other harm caused by a product. The manufacturer (producer) has to specify a safe operational mode for the product (item). If strict liability applies, the manufacturer has to demonstrate (in the case of a claim) that the product was free from defects when it left the production plant. This holds in the USA and partially also in Europe [1.8]. However, in Europe the causality between damage and defect has still to be demonstrated by the user and the limitation period is short (often 3 years after the identification of the damage, defect, and manufacturer or 10 years after the appearance of the product on the market). One can expect that liability will more than before consider faults (defects &failures) and cover software as well. Product liability forces producers to place greater emphasis on quality assurance I management. Quality [Al.5] Degree to which a set of inherent characteristics fulfills requirements. This definition, given now also in the ISO 9000: 2000 Standard [Al.5, A2.9], follows closely the traditional definition of quality (jitnessfor use) and applies to products and services as well. Quality Assurance [Al.5] All the planned and systematic activities needed to provide adequate confidence that quality requirements will be fulfilled. Quality assurance is a part of quality management, as per ISO 9000: It refers to hardware and software as well. It includes configuration management, quality tests, quality control during production, quality data reporting systems, and software quality (Fig. 1.3). For complex equipment and systems, quality assurance activities are coordinated by a quality assurance program (Appendix A3). An important target of quality assurance is to achieve the quality requirements with a minimum of cost and time. Concurrent engineering activities also strive to short the time to develop and market a product.

11 Al Terms and Definitions 333 Quality Control During Production Control of the production processes and procedures to reach a stated quality of manufacturing. Quality Data Reporting System System to collect, analyze, and correct all defects and failures occurring during production and testing of an item, as well as to evaluate and feedback the corresponding quality and reliability data. A quality data reporting system is generally computer aided. Analysis of defects and failures must be traced to the cause in order to determine the best corrective action necessary to avoid repetition of the same problem. The quality data reporting system should also remain active during the operating phase. A quality data reporting system is important to monitor reliability growth during the production of hardware and can be used for software as well. Quality Management [Al.5] Coordinated activities to direct and control an organization with regard to quality. Organization is defined as group of people and facilities (a company for instance) with an arrangement of responsibilities, authorities, and relationships [Al.5]. Quality Test Test to verify whether an item conforms to specified requirements. Quality tests include incoming inspections, qualification tests, production tests, and acceptance tests. They also cover reliability, maintainability, and safety aspects. To be cost effective, quality tests must be coordinated and integrated in a test (and screening) strategy. The terms test and inspection are often used for quality test. Redundancy [Al.3] In an item, existence of more than one means for performing a required function. For hardware, distinction is made between active (hot, parallel), warm (lightly loaded), and standby (cold) redundancy. Redundancy does not necessarily imply a duplication of hardware, it can for instance be implemented at the software level or as a time redundancy. To avoid common mode failures, redundant elements should be realized independently from each other. Should the redundant elements fulfill only a part of the required function, a pseudo redundancy is present.

12 334 Al Terms and Definitions Reliability CR, R(t)) [Al.3] Probability that an item can perform a required function under given conditions for a stated time interval. According to the above definition, reliability is a characteristic of an item, generally designated by R. A qualitative definition, focused on ability rather than on probability is also usual. Reliability specifies the probability that no operational interruption will occur during a stated mission, say of duration T. This does not mean that redundant parts may not fail, such parts can fail and be repaired. Thus, the concept of reliability applies to nonrepairable as well as to repairable items. Should T be considered as a variable t, the reliability function is given by R(t). If t is the failure-free operating time, with distribution function F(t), then R(t)=Pr{t> t}=l- F(r). The concept of reliability can also be used for processes or services, although modeling human aspects can lead to some difficulties (see e.g. Section for further considerations). Reliability Block Diagram [Al.3] Block diagram showing how failures of elements, represented by the blocks, result in the failure of an item. The reliability block diagram is an event diagram. It answers the question: Which elements of an item are necessary to fulfill the required function and which ones can fail without affecting it? The elements which must operate are connected in series (the ordering of these elements is not relevant for reliability calculation) and the elements which can fail (redundant elements) are connected in parallel. Elements which are not relevant (used) for the required function are removed from the reliability block diagram and put into a reference list, after having verified (FMEA) that their failure does not affect elements involved in the required function. In a reliability block diagram, redundant elements still appear in parallel, irrespective of the failure mode. However, only one failure mode (e.g. short or open) and two states (good or failed) can be considered for each element. Reliability Growth [Al.3] A condition characterized by a progressive improvement of the reliability of an item with time, through successful correction of design or production weaknesses. Flaws (errors, mistakes) detected during a reliability growth program are in general deterministic (defects or systematic failures) and present in every item of a given lot. Reliability growth is thus often performed during the pilot production, seldom for series-produced items. Similarly to environmental stress screening (ESS), stresses during reliability growth often exceed those expected in field operation. Models for reliability growth can also often be used to investigate the occurrence of defects in software. Even if software defects often appear in time (dynamic defects), the term software reliability should be avoided (software quality should be preferred).

13 Al Terms and Definitions 335 Required Function [Al.3] Function or combination of functions of an item which is considered necessary to provide a given service. The definition of the required function is the starting point for any reliability analysis, as it defines failures. However, difficulties can appear with complex items. For practical purposes, parameters should be specified with tolerances. Safety Ability of an item to cause neither injury to persons, nor significant material damage or other unacceptable consequences. Safety expresses freedom from unacceptable risk of harm. In practical applications, it is useful to subdivide safety into accident prevention (the item is safe working while it is operating correctly) and technical safety (the item has to remain safe even if a failure occurs). Technical safety can be defined as the probability that an item will not cause injury to persons, significant material damage or other unacceptable consequences above a given (fixed) level for a stated time interval, when operating under given conditions. Methods and procedures used to investigate technical safety are similar to those used for reliability analyses, however with emphasis on fault! failure effects. System Combination of components, assemblies, and subsystems, as well as skills and techniques, capable of performing and/ or supporting autonomously an operational role. A system generally includes hardware, software, services, and personnel (for operation and support) to the degree that it can be considered self-sufficient in its intended operational environment. For calculation, ideal conditions for human factors and logistical support are often assumed, leading to a technical system (the term system is used also in this book instead of technical system, for simplicity). Systematic Failure [A1.3] Failure related in a deterministic way to a certain cause, which can only be eliminated by a modification of the design or of the manufacturing process, operational procedures, documentation or other relevant factors. Systematic failures are also known as dynamic defects, for instance in software quality, and have a deterministic character. However, because of the item complexity they can appear as if they were randomly distributed in time.

14 336 Al Terms and Definitions Systems Engineering Application of the mathematical and physical sciences to develop systems that utilize resources economically for the benefit of society. TQM can help to optimize systems engineering. Total Quality Management (TQM) Management approach of an organization centered on quality, based on the participation of all its members, and aiming at long-term success through customer satisfaction, and benefits to all members of the organization and to society. Within TQM, everyone involved in a product (directly during development, production, installation, servicing, or indirectly with management or staff activity) is jointly responsible for the quality of that product. Useful Life [A1.3] Under given conditions, the time interval beginning at a given instant of time and ending when the failure intensity becomes unacceptable or when the item is considered unrepairable as a result of a fault. Typical values for useful life are 3 to 6 years for commercial applications, 5 to 15 years for military installations, and 10 to 30 years for distribution or power systems. Value Analysis Optimization of the configuration of an item as well as of the production processes and procedures to provide the required item characteristics at the lowest possible cost without loss of capability, reliability, maintainability, or safety. Value Engineering Application of value analysis methods during the design phase to optimize the life-cycle cost of an item.

15 A2 Quality and Reliability Standards Besides quantitative reliability requirements, such as MTBF, MITR, and availability, customers often require a quality assurance / management system and for complex items also the realization of a quality and reliability assurance program. Such general requirements are covered by national and international standards, the most important of which are briefly discussed in this appendix. The term management is used in this book explicitly where the organization (company) is involved as a whole, as per ISO 9000: 2000 and TQM. A basic procedure for setting up and realizing quality and reliability requirements for complex equipment and systems, with the corresponding quality and reliability assurance program, is discussed in Appendix A3. A2.1 Introduction Customer requirements for quality and reliability can be quantitative or qualitative. As with performance parameters, quantitative reliability requirements are given in system specifications or contracts. They fix targets for reliability, maintainability, availability, and safety (as necessary) along with associated specifications for required function, operating conditions, logistical support, and criteria for acceptance tests. Qualitative requirements are in national or international standards and generally deal with a quality management system. Depending upon the field of application (aerospace, defense, nuclear, or industrial), these requirements may be more or less stringent. Objectives of such standards are in particular: 1. Harmonization of quality management systems and of terms & definitions. 2. Enhancement of customer satisfaction. 3. Standardization of configuration, operating conditions, logistical support, test procedures, and selection / qualification criteria for components, materials, and production processes. Important standards for quality management systems are given in Table A2.1, see [A2.1 - A2.13] for a comprehensive list. Some of the standards in Table A2.1 are briefly discussed in the following sections.

16 338 A2 Quality and Reliability Standards A2.2 General Requirements in the Industrial Field In the industrial field, the ISO 9000: 2000 family of standards [A2.9] supersedes the ISO 9000: 1994 family and open a new era in quality management requirements. The previous are substituted by 9001: 2000 and 9004: The ISO 8402, on definition, is substituted by the ISO 9000: Many definitions have been revised and the structure and content of 9001: 2000 and 9004: 2000 are new, and adhere better to the industrial needs and to the concept depicted in Fig Eight basic quality management principles have been identified and considered in the ISO 9000: 2000 family: Customer Focus, Leadership, Involvement of People, Process Approach, System Approach to Management, Continuous Improvement, Factual Approach to Decision Making, and Mutually Beneficial Supplier Relationships. ISO 9000: 2000 describes fundamentals of quality management systems and specify the terminology involved. ISO 9001: 2000 specifies requirements for a quality management system where an organization (company) needs to demonstrate its ability to provide products that satisfy customer and applicable regulatory requirements. It focus on four main chapters: Management Responsibility, Resource Management, Product and / or Service Realization, and Measurement. A quality management system must ensure that everyone involved with a product (whether in its development, production, installation, or servicing, as well as in a management or staff function) shares responsibility for the quality of that product, in accordance to TQM. At the same time, the system must be cost effective and contribute to a reduction in the time to market. Thus, bureaucracy must be avoided and such a system must cover all aspects related to quality, reliability, maintainability, availability, and safety, including management, organization, planning, and engineering activities. Customer expects today that only items with agreed requirements will be delivered. ISO 9004: 2000 provides guidelines that consider efficiency and effectiveness of the quality management system. The ISO 9000: 2000 family deals with a broad class of products and services (technical and non-technical), its content is thus lacking in details, compared with application specific standards used e.g. in railway, aerospace, defense, and nuclear industries (Appendix A2.3). It has been accepted as national standards in many countries, and international recognition of certification has been partly achieved. Dependability aspects, focusing on reliability, maintainability, and logistical support of systems are considered in lee Standards, in particular lee for global requirements and lee 60605, 60706, 60812, 60863, 61025, 61078, 61124, 61163, 6Il64, 61165, 61508, and for specific procedures, see [A2.6] for a comprehensive list. lee deals with dependability programs (management, task descriptions, application guides). Reliability tests for constant failure rate A (or MTBF= In..) are considered in lee 6Il24. Maintainability aspects are in lee and safety aspects in lee

17 A2.2 General Requirements in the Industrial Field 339 Table A2.1 Standards for quality and reliability assurance 1 management of equipment and systems Industrial 2000 Int. ISO 9000: 2000 ISO 900 1: 2000 ISO 9004: Int. IEC Int. IEC Int. IEC Int. IEC Int. IEC (see also) Quality management systems - Fundamentals and vocabulary Quality management systems - Requirements Quality management systems - Guidelines for performance improvement Dependability management (-1: Program management, -2: Program element tasks, -3: Application guides) Equipment reliability testing (-2: Test cycles, -3: Test conditions, -4: Point and interval estimates, -6: Test for constant failure rate) Guide on maintainability of equipment (-1: Maint. program, -2: Analysis, -3: Data evaluation, -4: Support planning, -5: Diagnostic, -6: Statistical methods) Reliability testing - Compliance tests for constant failure rate and constant failure intensity (supersedes IEC ) 60068,60319,60410,60447,60721,60749,60812,60863,61000, 61014,61025,61070,61078,61123,61160,61163,61164,61165, 61508,61649,61650,61703,61709,61710,61882, Int. IEEE Std 1332 IEEE Standard Reliability Program for the Development and Production of Electronic Systems and Equipment (see also 1413) 1999 EU EN Railway Applications - RAMS Specification & Demonstration 1985 EU 85/374 Product Liability Software Quality Int. IEEElANSI IEEE Software Eng. Standards Vol. 1-4, 1999 (in particular 610,730,1028,1045,1062, 1465 (ISOIlEC 12119» IEC, ISOIlEC IEC (2000) and ISOIlEC (1998),12207 (1995) Defense 1963 USA MIL-Q-9858 Quality Program Requirements (ed. A) 1980 USA MIL-STD-785 ReI. Program for Systems and Eq. Devel. and Prod. (ed. B) 1986 USA MIL-STD-781 ReI. Testing for Eng. Devel., Qualif. and Prod. (ed. D) 1983 USA MIL-STD-470 Maintainability Program for Systems and Equip. (ed. A) 1984 NATO AQAP-l NATO Req. for an Industrial Quality Control System (ed. 3) Aerospace 1974 USA NHB-5300A (NASA) 1996 Europe ECSS (ESA) ECSS-E ECSS-M ECSS-Q 2003 Europe pr EN Quality Management System Safety, Reliability, Maintainability, and Quality Provisions for the Space Shuttle Program (ld-l) European Corporation for Space Standardization Engineering (-00, -10) Project Management (-00, -10, -20, -30, -40, -50, -60,-70) Product Assurance (-00, -20, -30, -40, -60, -70, -80)

18 340 A2 Quality and Reliability Standards For electronic equipment & systems, IEEE Std [A2.7] has been issued as a guide to a reliability program for the development and production phases. This document gives in a short form the basic requirements, putting an accent on an active cooperation between supplier (manufacturer) and customer, and focusing three main aspects: Determination of the Customer's Requirements, Determination of a Process that satisfy the Customer's Requirements, and Assurance that the Customer's Requirements are met. Examples of comprehensive requirements for industry application are e.g. in [A2.2, A2.3]. Software aspects are considered in IEEE Software Engineering Standards [A2.8]. Requirements for product liability are given in national and international directives, see for instance [1.8]. A2.3 Requirements in the Aerospace, Railway, Defense, and Nuclear Fields Requirements in space and railway fields generally combine the aspects of quality, reliability, maintainability, safety, and software quality in a Product Assurance or RAMS document, well conceived in its structure & content [A2.3 - A2.5, A2.l2]. In the railway field, EN [A2.3] requires a RAMS program with particular emphasis on safety aspects. Similar is in the avionics field, where EN [A2.4] has been issued by reinforcing requirements of ISO 9000 family. It can be expected that space and avionics will unify standards in an Aerospace Series. MIL-Standards have played an important role in the last 30 years, in particular MIL-Q-9858 and MIL-STD-470, -471, -781, 785 & -882 [A2.1O]. MIL-Q-9858 (first Ed. 1959) was the basis for many quality assurance standards. However, as it does not cover specific aspects of reliability, maintainability, and safety, MIL-STD-785, -470, and -882 were issued. MIL-STD-785 requires the realization of a reliability program; tasks are carefully described and the program has to be tailored to satisfy user needs. MTBF = 11 A acceptance procedures are in MIL-STD-781. MIL-STD-470 requires the realization of a maintainability program, with emphasis on design rules, design reviews, and FMEA / FMECA. Maintainability demonstration is covered by MIL STD-471. MIL-STD-882 requires the realization of a safety program, in particu-iar the analysis of all potential hazards. For NATO countries, AQAP Requirements were issued starting MIL Standards have dropped their importance. However, they can still be useful in developing procedures for industrial applications. The nuclear field has its own specific, well established standards with emphasis on safety aspects, design reviews, configuration accounting, qualification of components, materials, and production processes, quality control during production, and tests.

19 A3 Definition and Realization of Quality and Reliability Requirements In defining quality and reliability requirements, it is important that market needs, life cycle cost aspects, time to market as well as development and production risks (for instance when using new technologies) are consider with care. For complex equipment and systems with high quality & reliability requirements, the realization of such requirements is best achieved with a quality and reliability assurance program, integrated in the project activities and performed without bureaucracy. Such a program (plan if time schedule is considered) defines the project specific activities for quality and reliability assurance and assigns responsibilities for their realization in agreement to TQM. This appendix discusses first important aspects in defining quality & reliability requirements and then the content of a quality and reliability assurance program for complex equipment and systems with high quality and reliability requirements for the case in which tailoring is not mandatory. For less stringent requirements, tailoring is necessary to meet real needs and to be cost and time effective. Software specific quality assurance aspects are considered in Section 5.3. Possible check lists for design reviews and requirements for a quality data reporting system are in Appendices A4 and A5. A3.1 Definition of Quality and Reliability Requirements In defining quantitative, i.e. project specific, quality and reliability requirements attention has to be paid to the actual possibility to realize them as well as to demonstrate them at a final or acceptance test. These requirements are derived from customer or market needs, taking care of limitations given by technical, cost, and ecological aspects. This section deals with some important considerations by setting MTBF, MTTR, and steady-state availability (PA=AA) requirements. Tentative targets for MTBF, MTTR and PA are set by considering operational requirements relating to reliability, maintainability, and availability, allowed logistical support,

20 342 A3 Definition and Realization of Quality and Reliability Requirements required function and expected environmental conditions, experience with similar equipment or systems, possibility for redundancy at higher integration level, requirements for life-cycle cost, dimensions, weight, power consumption, etc., ecological consequences (sustainability). Typical figures for failure rates A = 1/ MTBF of electronic assemblies are between 200 and 2, h -1 at ambient temperature e A of 40 C and with a duty cycle d of 0.3, see Table A3.1 for some examples. The duty cycle (0 < d:s; 1) gives the mean of the ratio between operational time and calendar time for the item considered. Assuming a constant failure rate A and no reliability degradation caused by power onloff, an equivalent failure rate (A3.I) can be used for practical purposes. Often it can be useful to operate with the mean expected number offailures per year and 100 items (A3.2) m % / Y < 1 % is a good target for equipment and can influence acquisition cost. Tentative targets are refined successively by performing rough analysis and comparative studies (definition of goals down to assembly level can be necessary at this time). For acceptance testing (demonstration) of an MTBF, the following data are important (Sections and ): 1. MTBFo = specified MTBF and / or MTBFi = minimum acceptable MTBF. 2. Required function (mission profile). 3. Environmental conditions (thermal, mechanical, climatic). 4. Allowed producer's and/ oc consumer's risks (ex and/ oc 13). Table A3.1 Indicative values of failure rates A and mean expected number m%/ yof failures per year and 100 items for a duty cycle d = 30% and d = 100% (8 A = 40 C) d=30% d= 100% A [1O- 9 h-1 ] m%/y A. [10-9 h -1] m%/ y Telephone exchanger 2, ,000 6 Telephone receiver (multi-function) Photocopier incl. mechanical parts 30, , Personal computer 3, ,000 9 Radar equipment (ground mobile) 300, , Control card for autom. process control Mainframe computer system ,000 50

21 A3.1 Definition of Quality and Reliability Requirements Cumulative operating time T and number c of allowed failures during T (acceptance conditions). 6. Number of systems under test (T / MTBFo as a rule of thumb). 7. Parameters which should be tested and frequency of measurement. 8. Failures which should be ignored for the MTBF acceptance test. 9. Maintenance and screening before the acceptance test. 10. Maintenance procedures during the acceptance test. 11. Form and content of test protocols and reports. 12. Actions in the case of a negative test result. For acceptance testing (demonstration) of an MTTR, the following data are important (Section 7.3.2): 1. Quantitative requirements (MTTR, variance, quantile). 2. Test conditions (environment, personnel, tools, external support, spare parts). 3. Number and extent of repairs to be undertaken (simulated/introduced failures). 4. Allocation of the repair time (diagnostic, repair, functional test, logistical time). 5. Acceptance conditions (number of repairs and observed empirical MTTR). 6. Form and content of test protocols and reports. 7. Actions in the case of a negative test result. Availability usually follows from the relationship PA = MTBFI(MTBF+ MTTR). However, for an acceptance test, procedures for an unknown probability p = 1- PA can be used as well (Sections 7.1.2, 7.1.3, and 7.2.1). A3.2 Realization of Quality and Reliability Requirements for Complex Equipment and Systems For complex items, in particular at equipment and system level, quality and reliability targets are best achieved with a quality and reliability assurance program, integrated in the project activities and performed without bureaucracy. In such a program, project specific tasks and activities are clearly described and assigned. Table A3.2 can be used as a checklist by defining the content of a quality and reliability assurance program for complex equipment and systems with high quality and reliability requirements, when tailoring is not mandatory (see also [A2.8 ( )] and Section 5.3 for software specific quality assurance aspects). Table A3.2 is a refinement of Table 1.2 and shows a possible task assignment in a company as per Fig Depending on the item technology and complexity, or because of tailoring, Table A3.2 is to be shortened or extended. The given responsibilities for tasks (R, e, I) can be modified to reflect the company's personnel situation. For a comprehensive description of reliability assurance tasks see e.g. [A2.6 (60300), A2.1O (785), A3.1].

22 344 A3 Definition and Realization of Quality and Reliability Requirements Table A3.2 Example of tasks and tasks assignment for quality and reliability assurance of complex equipment and systems with high quality and reliability requirements, when tailoring is not mandatory (see also Section 5.3 for software specific quality assurance aspects) 2 Example of tasks and tasks assignment for quality and reliability 0 ~ ~ assurance, in agreement to Fig. 1.7 and TQM (checklist for 8 ~ " u the preparation of a quality and reliability assurance program) ~ e; OJ) R stands for responsibility, C for cooperation (must cooperate),.5 E " 0 0. '.::l " u I for information (can cooperate) ] ~ ~ ".x: 1 Customer and market requirements " ~. ;> ta ~ ~ 8 1 Evaluation of delivered equipment and systems R I I C 2 Determination of market and customer demands and real needs R I I C 3 Customer support R C 2 Preliminary analyses 1 Definition of tentative quantitative targets for reliability, maintainability, availability, safety, and quality level C C C R 2 Rough analyses and identification of potential problems I C R 3 Comparative investigations I C R 3 Quality and reliability aspects in specifications, quotations, contracts, etc. 1 Definition of the required function I R C 2 Determination of external environmental stresses C R C 3 Definition of realistic quantitative targets for reliability, maintainability, availability, safety, and quality level C C C R 4 Specification of test and acceptance criteria C C C R 5 Identification of the possibility to obtain field data R C 6 Cost estimate for quality & reliability assurance activities C C C R 4 Quality and reliability assurance program 1 Preparation C C C R 2 Realization - design and evaluation I R I C - production I I R C 5 Reliability and maintainability analyses 1 Specification of the required function for each element R C 2 Determination of environmental, functional, and timedependent stresses (detailed operating conditions) R C 3 Assessment of derating factors C R 4 Reliability and maintainability allocation C R 5 Preparation of reliability block diagrams - assembly level R C - system level C R 6 Identification and analysis of reliability weaknesses (FMENFMECA, FT A, worst-case, drift, stress-strengthanalyses) - assembly level R C - system level C R CI

23 A3.2 Realization of Quality and Reliability Requirements 345 Table A3.2 (cont.) M R&D P Q&R 7 Carrying out comparative studies - assembly level R C - system level C R 8 Reliability improvement through redundancy - assembly level R C - system level C R 9 Identification of components with limited lifetime I R C 10 Elaboration of the maintenance concept I R I C 11 Elaboration of a test and screening strategy C C C R 12 Analysis of maintainability R C 13 Elaboration of mathematical models C R 14 Calculation of the predicted reliability and maintainability - assembly level I R C - system level I C R 15 Reliability and availability calculation at system level I I R 6. Safety and human factor analyses 1 Analysis of safety (avoidance of liability problems) - accident prevention C R C C - technical safety identification and analysis of critical failures and of risk situations (FMEAlFMECA, FrA, etc.) - assembly level R C - system level I C R theoretical investigations C R 2 Analysis of human factors (man-machine interface) C R C C 7. Selection and qualification of components and materials 1 Updating of the list of preferred components and materials I C I R 2 Selection of non-preferred components and materials R C C 3 Qualification of non-preferred components and materials - planning C I R - realization C R - analysis of test results I I R 4 Screening of components and materials I C R 8. Supplier selection and qualification 1 Supplier selection - purchased components and materials R C C - external production C R C 2 Supplier qualification (quality and reliability) - purchased components and materials I I R - external production I I R 3 Incoming inspections - planning C C R - realization C R - analysis of test results C R - decision on corrective actions purchased components and materials C C R external production R C C

24 346 A3 Definition and Realization of Quality and Reliability Requirements Table A3.2 (cont) 9. Project-dependent procedures and work instructions M R&D P _Q&R I Reliability guidelines C R 2 Maintainability guidelines C C I R 3 Safety guidelines I C I R 4 Other procedures, rules, and work instructions for development R I C for production I R C 5 Compliance monitoring C C C R 10. Configuration management I Planning and monitoring C C C R 2 Realization - configuration identification during design R C during production I R C during use (warranty period) R I I C - configuration auditing (design reviews, Tables A3.3, 5.3, 5.5) C R C C - configuration control (evaluation, coordination, and release or rejection of changes and modifications) during design C R C C during production C C R C during use (warranty period) R C C C - configuration accounting R C C 11. Prototype qualification tests I Planning I R I C 2 Realization C R C c 3 Analysis of test results I R I C 4 Special tests for reliability, maintainability, and safety I C C R 12. Quality control during production I Selection and qualification of processes and procedures R C C 2 Production planning C R C 3 Monitoring of production processes I R C 13. In-process tests I Planning C R C 2 Realization I R I 14. Final and acceptance tests I Environmental tests and/or screening of series-produced items - planning I C C R - realization I I C R - analysis of test results I C C R 2 Final and acceptance tests - planning C C C R - realization I I C R - analysis of test results C C C R 3 Procurement, maintenance, and calibration of test equipment I C C R

25 A3.2 Realization of Quality and Reliability Requirements 347 Table A3.2 (cont.) 15. Quality data reporting system M R&D P Q&R 1 Data collection C C C R 2 Decision on corrective actions - during prototype qualification R I C - during in-process tests C R C - during final and acceptance tests C C C R - during use (warranty period) R C C C 3 Realization of corrective actions on hardware or software (repair, rework, waiver, scrap) I C C R 4 Implementation of the changes in the documentation (technical, production, customer) C C C R 5 Data compression, processing, storage, and feedback I I I R 6 Monitoring of the quality data reporting system I I I R 16. Logistical support 1 Supply of special tools and test equipment for maintenance C R I C 2 Preparation of customer documentation R C I I 3 Training of operating and maintenance personnel R I I I 4 Determination of the required number of spare parts, maintenance personnel, etc. R C C 5 After-sales support R I I C 17. Coordination and monitoring 1 Project-specific C C C R 2 Project-independent I I I R 3 Planning and realization of quality audits - project-specific C C C R - project-independent I I I R 4 Information feedback I I I R 18. Quality cost 1 Collection of quality cost C C C R 2 Cost analysis and initiation of appropriate actions C C C R 3 Preparation of periodic and special reports C C C R 4 Evaluation of the efficiency of quality & reliability assurance I I I R 19. Concepts, methods, and general procedures (quality and reliability) 1 Development of concepts C C C R 2 Investigation of methods I I I R 3 Preparation and updating of the quality handbook C C C R 4 Development of software packages I I I R 5 Collection, evaluation, and distribution of data, experience and know-how I I I R 20. Motivation and training 1 Planning C C C R 2 Preparation of courses and documentation C C C R 3 Realization of the motivation and training program C C C R

26 348 A3.3 A3 Definition and Realization of Quality and Reliability Requirements Elements of a Quality and Reliability Assurance Program The basic elements of a quality and reliability assurance program, as defined in Appendix A.3.2, can be summarized as follows: 1. Project organization, planning, and scheduling 2. Quality and reliability requirements 3. Reliability and safety analysis 4. Selection and qualification of components, materials, and processes 5. Configuration management 6. Quality tests 7. Quality data reporting system These elements are discussed in this section for the case of complex equipment and systems with high quality and reliability requirements, when tailoring is not mandatory. In addition, Appendix A4 gives a catalog of questions to generate checklists for design reviews and Appendix A5 specifies the requirements for a quality data reporting system. For software specific quality assurance aspects one can refer to Section 5.3. As suggested in task 4 of Table A3.2, the realization of a quality and reliability assurance program should be the responsibility of the project manager. It is often useful to start with a quality and reliability program for the development phase, covering items 1 to 5 of the above list, and continue with the production phase for points 5 to 7. A3.3.1 Project Organization, Planning, and Scheduling A clearly defined project organization and planning is necessary for the realization of a quality and reliability assurance program. Organization and planning must also satisfy modem needs for cost management and concurrent engineering. The system specification is the basic document for all considerations at project level. The following is a typical outline for system specifications: 1. State of the art, need for a new product 2. Target to be achieved 3. Cost, time schedule 4. Market potential (turnover, price, competition) 5. Technical performance 6. Environmental conditions 7. Operational capabilities (reliability, maintainability, availability, logist. support) 8. Quality and reliability

27 A3.3 Elements of a Quality and Reliability Assurance Program Special aspects (new technologies, patents, value engineering, etc.) 10. Appendices The organization of a project begins with the definition of the main task groups. The following groups are usual for a complex system: Project Management, System Engineering, Life-Cycle Cost, Quality and Reliability Assurance, Assembly Design, Prototype Qualification Tests, Production, Assembly and Final Testing. Project organization, task lists, task assignment, and milestones can be derived from the task groups, allowing the quantification of the personnel, material, and financial resources needed for the project. The quality and reliability assurance program must require that the project is clearly and suitably organized and planned. A3.3.2 Quality and Reliability Requirements The most important steps in defining quality and reliability targets for complex equipment and systems have been discussed in Appendix A.3.1. A3.3.3 Reliability and Safety Analysis Reliability and safety analyses include failure rate analysis, failure mode analysis (FMEAlFMECA, FfA), sneak circuit analysis (to identify latent paths which cause unwanted functions or inhibit desired functions, while all components are functioning properly), evaluation of concrete possibilities to improve reliability and safety (derating, screening, redundancy), as well as comparative studies; see Chapters 2-6 for methods and tools. The quality and reliability assurance program must show what is actually being done for the project considered. For instance, it should be able to supply answers to the following questions: 1. Which derating rules are considered? 2. How are the actual component-level operating conditions determined? 3. Which failure rate data are used? Which are the associated factors (1tE & 1tQ)? 4. Which tool is used for failure mode analysis? To which items does it apply? 5. Which kind of comparative studies will be performed? 6. Which design guidelines for reliability, maintainability, safety, and software quality are used? How will their adherence be verified? Additionally, interfaces to the selection and qualification of components and materials, design reviews, test and screening strategies, reliability tests, quality data reporting system, and subcontractor activities must be shown. The data used for component failure rate calculation should be critically evaluated (source, present relevance, assumed environmental and quality factors 1tE & 1tQ).

28 350 A3 Definition and Realization of Quality and Reliability Requirements A3.3.4 Selection and Qualification of Components, Materials, and Manufacturing Processes Components, materials, and production processes have a great impact on product quality and reliability. They must be carefully selected and qualified. Examples for qualification tests on electronic components and assemblies are given in Chapter 3. For production processes one may refer e.g. to [ ]. The quality and reliability assurance program should give how components, materials, and processes are (or have already previously been) selected and qualified. For instance, the following questions should be answered: 1. Does a list of preferred components and materials exist? Will critical components be available on the market-place at least for the required production and warranty time? 2. How will obsolescence problems be solved? 3. Under what conditions can a designer use non qualified components!materials? 4. How are new components selected? What is the qualification procedure? 5. How have the standard manufacturing processes been qualified? 6. How are special manufacturing processes qualified? Special manufacturing processes are those which quality can't be tested directly on the product, have high requirements with respect to reproducibility, or can have an important negative effect on the product quality or reliability. A3.3.5 Configuration Management Configuration management is an important tool for quality assurance, in particular during design and development. Within a project, it is often subdivided into configuration identification, auditing, control, and accounting. The identification of an item is recorded in its documentation. A possible documentation outline for complex equipment and systems is given in Fig. A3.1. Configuration auditing is done via design reviews, the aim of which is to assure! verify that the system will meet all requirements. In a design review, all aspects of design and development (selection and use of components and materials, dimensioning, interfaces, etc.), production (manufacturability, testability, reproducibility), reliability, maintainability, safety, patent regulations, value engineering, and value analysis are critically examined with the help of checklists. The most important design reviews are described in Table A3.3. For complex systems a review of the first production unit (FCAlPCA) is often required. A further important objective of design reviews is to decide about continuation or stopping the project considered on the basis of objective considerations and feasibility check (Tables A3.3 and 5.3 & Fig. 1.6). A week before the design review, participants

29 A3.3 Elements of a Quality and Reliability Assurance Program 351 System specifications Quotations, requests Interface documentation Planning and control documentation Concepts/strategies (maintenance, test) Analysis reports Standards, handbooks, general rules Work breakdown structures Drawings Schematics Part lists Wiring plans Specifications Purchasing doc. Handling/transportation! storage/packaging doc. Operations plans/records Production procedures Tool documentation Assembly documentation Test procedures Test reports Documents pertaining to the quality data reporting system Customer system specifications Operating and maintenance manuals Spare part catalog Fig. A3.1 Possible documentation outline for complex equipment and systems should present project specific checklists, see Appendix A4 and Tables 2.7 & 4.3 for some suggestions. Design reviews are chaired by the project manager and should cochaired by the project quality and reliability assurance manager. For complex equipment and systems, the review team may vary according to the following list: project manager, project quality and reliability assurance manager, design engineers, representatives from production and marketing, independent design engineer or external person, customer representatives (if appropriate). Configuration control includes evaluation, coordination, and release or rejection of all proposed changes and modifications. Changes occur as a result of defects or failures, modifications are triggered by a revision of the system specifications. Configuration accounting ensures that all approved changes and modifications have been implemented and recorded. This calls for a defined procedure, as changes / modifications must be realized in hardware, software, and documentation. A one-to-one correspondence between hardware or software and documentation is important during all life-cycle phases of a product. Complete records over all life-cycle phases become necessary if traceability is explicitly required, as e.g. in the aerospace or nuclear field. Partial traceability can also be required for products which are critical with respect to safety, or because of product liability. Referring to configuration management, the quality and reliability assurance program should for instance answer the following questions:

30 352 A3 Definition and Realization of Quality and Reliability Requirements 1. Which documents will be produced by whom, when, and with what content? 2. Are document contents in accordance with quality and reliability requirements? 3. Is the release procedure for technical and production documentation compatible with quality requirements? 4. Are the procedures for changes / modifications clearly defined? 5. How is compatibility (upward and / or downward) assured? 6. How is configuration accounting assured during production? 7. Which items are subject to traceability requirements? A3.3.6 Quality Tests Quality tests are necessary to verify whether an item conforms to specified requirements. Such tests cover performance, reliability, maintainability, and safety aspects, and include incoming inspections, qualification tests, production tests, and acceptance tests. To optimize cost and time schedule, tests should be integrated in a test (and screening) strategy at system level. Methods for statistical quality control and reliability tests are given in Chapter 7. Qualification tests and screening procedures are discussed in Sections and Basic considerations for test and screening strategies with cost considerations are in Section 8.4. Some aspects of testing software are discussed in Section 5.3. Reliability growth is investigated in Section 8.5. The quality and reliability assurance program should for instance answer the following questions: 1. What are the test and screening strategies at system level? 2. How were subcontractors selected, qualified and monitored? 3. What is specified in the procurement documentation? 4. How is the incoming inspection performed? 5. Which components and materials are 100% tested? Which are 100% screened? What are the procedures for screening? 6. How are prototypes qualified? Who decides on test results? 7. How are production tests performed? Who decides on test results? 8. Which procedures are applied to defective or failed items? 9. What are the instructions for handling, transportation, storage, and shipping? A3.3.7 Quality Data Reporting System Starting at the prototype qualification tests, all defects and failures should be systematically collected, analyzed and corrected. Analysis should go back to the cause of the fault, in order to find those actions most appropriate for avoiding repetition of

31 A3.3 Elements of a Quality and Reliability Assurance Program 353 Table A3.3 Design reviews during definition, design, and dev. of complex equipment and systems "0 System Design Review Preliminary Design Reviews Critical Design Review (SDR) (PDR) (CDR) At the end of the definition During the design phase, each At the end of prototype ~ & phase time an assembly has been qualification tests <0 developed "'" ~ Critical review of all Critical comparison of documents belonging to the prototype qualification test assembly under consider- results with system Critical review of the system ation (calculations, requirements <ii specifications on the basis of schematics, parts lists, test Formal review of the 0 results from market research, specifications, etc.) correspondence between " rough analysis, comparative Comparison of the target technical documentation studies, patent situation, etc. achieved with the system and prototype Feasibility check specifications requirements Verification of mannufac- Checking interfaces to other turability, testability, and assemblies reproducibility Feasibility check Feasibility check Item list Item list Documentation (analyses, Technical documentation schematics, drawings, parts Testing plan and Item list lists, test specifications, procedures for prototype :; System specifications (draft) work breakdown structure, qualification tests..s "'"' Documentation (analyses, interface specifications, Results of prototype reports, etc.) etc.) qualification tests Checklists (one for each Reports of relevant earlier List of deviations from the participant) * design reviews system requirements Checklists (one for each Maintenance concept participant)* Checklists (one for each participant) * List of the final deviations from the system specs. System specifications Reference configuration Qualified and released :; Proposal for the design phase (baseline) of the assembly prototypes e- Interface definitions considered Frozen technical ::l 0 Rough maintenance and List of deviations from the documentation logistical support concept system specifications Revised mainten. concept Report Report Production proposal Report * see Appendix A4 for a possible catalog of questions to generate project specific checklists and Tab. 5.5 for software specific aspects the same problem. The concept of a quality data reporting system is illustrated in Fig. 1.8 and applies basically to hardware and software, detailed requirements are given in Appendix A5.

32 354 A3 Definition and Realization of Quality and Reliability Requirements The quality and reliability assurance program should for instance answer the following questions: 1. How is the collection of defect and failure data carried out? At which project phase is started with? 2. How are defects and failures analyzed? 3. Who carries out corrective actions? Who monitors their realization? Who checks the final configuration? 4. How is evaluation and feedback of quality and reliability data organized? 5. Who is responsible for the quality data reporting system? Does production have their own locally limited version of such a system? How does this systems interface with the company's quality data reporting system?

33 A4 Checklists for Design Reviews In a design review, all aspects of design, development, production, reliability, maintainability, safety, patent regulations, value engineering/value analysis are critically examined with the help of checklists. The most important design reviews are described in Table A3.3 (see Table 5.5 for software specific aspects). A further objective of design reviews is to decide about continuation or stopping the project on the basis of objective considerations andfeasibility check (Tables A3.3 and 5.3 & Fig. 1.6). This appendix gives a catalog of questions which can be used to generate project specific checklists for design reviews for complex equipment and systems with high quality & reliability requirements, when tailoring is not mandatory. A4.1 System Design Review 1. What experience exists with similar equipment or systems? 2. What are the goals for performance (capability), reliability, maintainability, availability, and safety? How have they been defined? Which mission profile (required function and environmental conditions) is applicable? 3. Are the requirements realistic? Do they correspond to a market need? 4. What tentative allocation of reliability and maintainability down to assembly / unit level was undertaken? 5. What are the critical items? Are potential problems to be expected (new technologies, interfaces)? 6. Have comparative studies been done? What are the results? 7. Are interference problems (external or internal EMC) to be expected? 8. Are there potential safety!liability problems? 9. Is there a maintenance concept? Do special ergonomic requirements exist? 10. Are there special software requirements? 11. Has the patent situation been verified? Are licenses necessary? 12. Are there estimates of life-cycle cost? Have these been optimized with respect to reliability and maintainability requirements?

34 356 A4 Checklists for Design Reviews 13. Is there a feasibility study? Where does the competition stand? Has development risk been assessed? 14. Is the project time schedule realistic? Can the system be marketed at the right time? 15. Can supply problems be expected during production ramp-up? A4.2 Preliminary Design Reviews a) General 1. Is the assembly / uni t under consideration a new development or only a change/modification? Can existing items (e.g. sub assemblies) be used? 2. Is there experience with similar assembly /unit? What were the problems? 3. Is there redundancy hardware and / or software? 4. Have customer and market demands changed since the beginning of development? Can individual requirements be reduced? 5. Can the chosen solution be further simplified? 6. Are there patent problems? Do licenses have to be purchased? 7. Have expected cost and deadlines been met? Were value engineering used? b) Performance Parameters 1. How have been defined the main performance parameters of the assembly / unit under consideration? How was their fulfillment verified (calculations, simulation, tests)? 2. Have worst case situations been considered in calculations/simulations? 3. Have interference problems (EMC) been solved? 4. Have applicable standards been observed during design and development? 5. Have interface problems with other assemblies/units been solved? 6. Have prototypes been adequately tested in laboratory? c) Environmental Conditions 1. Have environmental conditions been defined? As a function of time? Were these consequently used to determine component operating conditions? 2. How were EMC interference been determined? Has his influence been taken into account in worst case calculation / simulation?

35 A4.2 Preliminary Design Reviews 357 d) Components and Materials 1. Which components and materials do not appear in the preferred lists? For what reasons? How were these components and materials qualified? 2. Are incoming inspections necessary? For which components and materials? How /Who will they be performed? 3. Which components and materials were screened? How/Who will screening be performed? 4. Are suppliers guaranteed for series production? Is there at least one second source for each component and material? Have requirements for quality, reliability, and safety been met? 5. Are obsolescence problems to be expected? How will they be solved? e) Reliability See Table 2.8. t) Maintainability See Table 4.3. g) Safety 1. Have applicable standards concerning accident prevention been observed? 2. Has safety been considered with regard to external causes (natural catastrophe, sabotage, etc.)? 3. Has a FMEA/FMECA or similar cause-to-effects analysis been performed? Are there failure modes with critical or even catastrophic consequence? Can these be avoided? Have all single-point failures been identified? Can these be avoided? 4. Has a fail-safe analysis been performed? What were the results? 5. What safety tests are planned? Are they sufficient? 6. Have safety aspects been dealt with adequately in the documentation? h) Human Factors, Ergonomics 1. Have operating and maintenance sequences been defined with regard to the training level of operators and maintenance personnel? 2. Have ergonomic factors been taken into account by defining operating sequences? 3. Has the man-machine interface been sufficiently considered?

36 358 A4 Checklists for Design Reviews i) Standardization 1. Have standard components and materials been used wherever possible? 2. Has items exchangeability been considered during design and construction? j) Configuration 1. Is the technical documentation (schematics, drawings, etc.) complete, errorfree, and does it reflect the present state of the project? 2. Have all interface problems between assemblies/units been solved? 3. Can the technical documentation be frozen and considered as reference documentation (baseline)? 4. How is compatibility (upward and/or downward) assured? k) Production and Testing 1. Which qualification tests are foreseen for prototypes? Have reliability, maintainability, and safety aspects been considered sufficiently in these tests? 2. Have all questions been answered regarding manufacturability, testability, and reproducibility? 3. Are special production processes necessary? Were they qualified? What were the results? 4. Are special transport, packaging, or storage problems to be expected? A4.3 Critical Design Review (System Level) a) Technical Aspects 1. Does the documentation allow an exhaustive and correct interpretation of test procedures and results? Has the technical documentation been frozen? Has conformance with present hardware and software been checked? 2. Are test specifications and procedures complete? In particular, are conditions for functional, environmental, reliability, and safety tests clearly defined? 3. Have fault criteria been defined for critical parameters? Is an indirect measurement planned for those parameters which cannot be measured accurately enough during tests? 4. Has a representative mission profile, with the corresponding required function, been clearly defined for reliability tests?

37 A4.3 Critical Design Review (System Level) Have test criteria for maintainability been defined? Which failures were simulated/introduced? How have personnel and material conditions been fixed? 6. Have test criteria for safety been defined (accident prevention and technical safety)? 7. Have ergonomic aspects been checked? How? 8. Can packaging, transport and storage cause problems? 9. Have defects and failures been systematically analyzed (mode, cause, effect)? Has the usefulness of corrective actions been verified? How? Also with respect to cost? 10. Have all deviations been recorded? Can they be accepted? 11. Does the system still satisfy customer / market needs? 12. Are manufacturability and reproducibility guaranteed within the framework of a production environment? b) Formal Aspects 1. Is the technical documentation complete? 2. Has the technical documentation been checked for correctness? For coherency? 3. Is uniqueness in numbering guaranteed? Even in the case of changes? 4. Is hardware labeling appropriate? Does it satisfy production and maintenance requirements? 5. Has conformance between prototype and documentation been checked? 6. Is the maintenance concept mature? Are spare parts having a different change status fully interchangeable? 7. Are production tests sufficient from today's point of view?

38 AS Requirements for Quality Data Reporting Systems A quality data reporting system is a system to collect, analyze, and correct all defects and failures occurring during production and testing of an item, as well as to evaluate and feedback the corresponding quality and reliability data (Fig. 1.8). The system is generally computer-aided. Analysis of failures and defects must go back to the root cause in order to determine the most appropriate action necessary to avoid repetition of the same problem. The quality data reporting system applies basically to hardware and software. It should remain active during the operating phase, at least for the warranty time. This appendix summarizes the requirements for a computer-aided quality data reporting system for complex equipment and systems. a) General Requirements 1. Up-to-dateness, completeness, and utility of the delivered information must be the primary concern (best compromise). 2. A high level of usability (user friendliness) and minimal manual intervention should be a goal. 3. Procedures and responsibilities should be clearly defined (several levels depending upon the consequence of defects or failures). 4. The system should be flexible and easily adaptable to new needs. b) Requirements Relevant to Data Collection 1. All data concerning defects and failures (relevant to quality, reliability, maintainability, and safety) have to be collected, from the begin of prototype qualification tests to (at least) the end of the warranty time. 2. Data collection forms should be preferably 8" x 11" or A4 format be project-independent and easy to fill in ensure that only the relevant information is entered and answers the questions: what, where, when, why, and how?

39 AS Requirements for Quality Data Reporting Systems 361 have a separate field (20-30%) for free-format input for comments (requests for analysis, logistical information, etc.), these comments do not need to be processed and should be easily separable from the fixed portion of the form. 3. Description of the symptom (mode), analysis (cause, effect), and corrective action undertaken should be recorded in clear text and coded at data entry by trained personnel. 4. Data collection can be carried out in different ways at a single reporting location (adequate for simple problems which can be solved directly at the reporting location) from different reporting locations which report the fault (defect or failure), analysis result, and corrective action separately. Operating reliability, maintainability, or logistical data can also be reported. 5. Data collection forms should be entered into the computer daily (on line if possible), so that corrective actions can be quickly initiated (for field data, a weekly or monthly entry can be sufficient for many purposes). c) Requirements for Analysis 1. The cause should be found for each defect or failure at the reporting location, in the case of simple problems by a fault review board, in critical cases. 2. Failures (and defects) should be classified according to mode - sudden failure (short, open, fracture, etc.) - gradual failure (drift, wearout, etc.) - intermittent failures, others if needed cause - intrinsic (inherent weaknesses, wearout, or some other intrinsic cause) - extrinsic (systematic failure, i.e. misuse, mishandling, design, or manuf. failure) - secondary failure effect - irrelevant - partial failure - complete failure - critical failure (safety problem). 3. Consequence of the analysis (repair, rework, change, scraping) must be reported. d) Requirements for Corrective Actions 1. Every record is considered pending until the necessary corrective action has been successfully completed and certified. 2. The quality data reporting system must monitor all corrective actions.

40 362 AS Requirements for Quality Data Reporting Systems 3. Procedures and responsibilities pertaining to corrective action have to be defined (simple cases usually solved by the reporting location). 4. The reporting location must be informed about a completed corrective action. e) Requirements Related to Data Processing, Feedback, and Storage 1. Adequate coding must allow data compression and simplify data processing. 2. Up-to-date information should be available on-line. 3. Problem-dependent and periodic data evaluation must be possible. 4. At the end of a project, relevant information should be stored for comparative investigations. f) Requirements Related to Compatibility with other Software Packages 1. Compatibility with company's configuration management and data banks should be assured. 2. Data transfer with the following external software packages should be assured important reliability data banks quality data reporting systems of subsidiary companies quality data reporting systems of large contractors. The effort required for implementing a quality data reporting system as described above can take 5 to 10 man-years for a medium-sized company. Competence for operation and maintenance of the quality data reporting system should be with the company's quality and reliability assurance department. The priority for the realization of corrective actions is project specific and should be fixed by the project manager. Major problems (defects and failures) should be discussed periodically by a fault review board chaired by the company's quality and reliability assurance manager, which should have, in critical cases, the competence to take go/nogo decisions.

41 A6 Basic Probability Theory In many practical situations, experiments have a random outcome, i.e., the results cannot be predicted exactly, although the same experiment is repeated under identical conditions. Examples are inspection of a given item during production, failure-free operating time of a given system, repair time of equipment, etc. Experience shows that as the number of repetitions of the same experiment increases, certain regularities appear regarding the occurrence of the event considered. Probability theory is a mathematical discipline which investigates the laws describing such regularities. The assumption of unlimited repeatability of the same experiment is basic to probability theory. This assumption permits the introduction of the concept of probability for an event starting from the properties of the relative frequency of its occurrence in a long series of trials. The axiomatic theory of probability, introduced 1933 by A.N. Kolmogorov [A6.1O], brought probability theory to a mathematical discipline. In reliability analysis, probability theory allows the investigation of the probability that a given item will operate failure-free for a stated period of time under given conditions, i.e. the calculation of the item's reliability on the basis of a mathematical model. The rules necessary for such calculations are presented in Sections A6.1- A6.4. The following sections are devoted to the concept of random variables, necessary to investigate reliability as a function of time and as a basis for stochastic processes (Appendix A 7) and mathematical statistics (Appendix A8). This appendix is a compendium of probability theory, consistent from a mathematical point of view but still with reliability engineering applications in mind. A large number of examples illustrate practical aspects. A6.1 Field of Events As introduced 1933 by A.N. Kolmogorov [A6.1O], the mathematical model of an experiment with random outcome is a triplet [Q,:T, Pr], also called probability space. Q is the sample space, :T the event field, and Pr the probability of each element of:t. Q is a set containing as elements all possible outcomes of the experiment considered. Hence Q = {I, 2, 3, 4, 5, 6} if the experiment consists of a single throw of a die, and Q = [0, 00) in the case of failure-free operating times of an item.

42 364 A6 Basic Probability Theory The elements of Q are called elementary events and are represented by 0>. If the logical statement "the outcome of the experiment is a subset A of Q" is identified with the subset A itself, combinations of statements become equivalent to operations with subsets of Q. If the sample space Q is finite or countable, a probability can be assigned to every subset of Q. In this case, the event field ~ contains all subsets of Q and all combinations of them. If Q is continuous, restrictions are necessary. The event field ~ is thus a system of subsets of Q to each of which a probability has been assigned according to the situation considered. Such a field is called a Borelfield (a-field) and has the following properties: 1. Q is an element of ~. 2. If A is an element of ~, its complement if is also an element of ~. 3. If AI> A2,... are elements of ~, the countable union Al U A2 U... is also an element of ~. From the first two properties it follows that the empty set 0 belongs to ~. From the last two properties and De Morgan IS law one recognizes that the countable intersection Al II A2 II... also belongs to ~. In probability theory, the elements of ~ are called (random) events. The most important operations on events are the union, the intersection, and the complement: 1. The union of a finite or countable sequence AI, A 2,... of events is an event which occurs if at least one of the events AI, A 2,... occurs; it will be denoted by Al U A2 U... or by U j Aj. 2. The intersection of a finite or countable sequence AI' A 2,.. of events is an event which occurs if each one of the events AI' A 2,... occurs; it will be denoted by Al n A2 n... or by njaj. 3. The complement of an event A is an event which occurs if and only if A does not occur; it is denoted by A, A = {(O : (0 ~ A} = Q \ A, A u if = Q, and A n A = 0. Important properties of set operations are: Commutative law : A u B = B u A; A II B = B II A Associative law Distributivelaw : Au (B u C) = (A u B) u C; A II (B II C) = (A II B) II C : Au(BIIC)=(AuB)II(AuC); AII(BuC)=(AIIB)u(AIIC) Complement law : A II A = 0; A u A = Q Idempotent law : A u A = A; A II A = A De Morgan's law : ~ u B = A II ~ A II B = Au B Identity law : A = A; A u (A II B) = A u B. The sample space Q is also called the sure event and 0 is the impossible event. The events AI> A 2,... are mutually exclusive if Aj n Aj = 0 holds for any i -:f. j. The events A and B are equivalent if either they occur together or neither of them occur, equivalent events have the same probability. In the following, events will be mainly enclosed in braces { }.

43 A6.2 Concept of Probability 365 A6.2 Concept of Probability Let us assume that 10 (random) samples of size n = 100 were taken from a large and homogeneous lot of populated printed circuit boards (PCBs), for incoming inspection. Examination yielded the following results: Sample number: No. of defective PCBs: For 1000 repetitions of the "testing a PCB" experiment, the relative frequency of the occurrence of event {PCB defective} is ~=3.8% It is intuitively appealing to consider as the probability of the event {PCB defective}. As shown below, is a reasonable estimation of this probability (on the basis of the experimental observations made). Relative frequencies of the occurrence of events have the property that if n is the number of trial repetitions and n(a) the number of those trial repetitions in which the event A occurred, then A Pn (A) = n(a) n (A6.1) is the relative frequency of the occurrence of A, and the following rules apply: 1. Rl: Pn(A) ~ O. 2. R2: Pn(Q) = l. 3. R3: if the events AI,"" Am are mutually exclusive, then n(aj U... U Am) = n(aj) n(am) and Pn(Aj U... uam) = Pn(Aj) Pn(Am). Experience shows that for a second group of n trials, the relative frequency Pn(A) can be different from that of the first group. Pn(A) also depends on the number of trials n. On the other hand, experiments have confirmed that with increasing n, the value Pn(A) converges toward a fixed value pea), see Fig. A6.1 for an example. It therefore seems reasonable to designate the limiting value pea) as the probability Pr{A} of the event A, with Pn(A) as an estimate of Pr{A}. Although intuitive, such a definition of probability would lead to problems in the case of continuous (nondenumerable) sample spaces. Since Kolmogorov's work [A6.1O], the probability Pr{A} has been defined as a function on the event field :F of subsets of Q. The following axioms hold for this function:

44 366 A6 Basic Probability Theory kin Figure A6.1 Example of relative frequency kin of "heads" when tossing a symmetric coin n times 1. Axiom 1: For each A E!F is Pr{A}?: O. 2. Axiom 2: Pr{Q} = Axiom 3: If events AI, A 2,... are mutually exclusive, then Pr{UAd = LPr{Ad ;=\ i=l Axiom 3 is equivalent to the following statements taken together: 4. Axiom 3': For any finite collection of mutually exclusive events, Pr{A l U... U An} = Pr{Ad Pr{An}. 5. Axiom 3": If events AI, A 2,... are increasing, i.e. An ~ An+ I, n = 1, 2,..., then lim Pr{An} = Pr{UAd. i=1 The relationships between Axiom 1 and Rl, and between Axiom 2 and R2 are obvious. Axiom 3 postulates the total additivity of the set function Pr{A}. Axiom 3' corresponds to R3. Axiom 3" implies a continuity property of the set function Pr{A} which cannot be derived from the properties of Pn(A), but which is of great importance in probability theory. It should be noted that the interpretation of the probability of an event as the limit of the relative frequency of occurrence of this event in a long series of trial repetitions, appears as a theorem within the probability theory (law oflarge numbers, Eqs. (A6.144) and (A6.146)). From axioms 1 to 3 it follows that: Pr{0} = 0, Pr{A} ::; Pr{B} if A ~ B, Pr{A} = 1- Pr{A}, 0::; Pr{A}::; 1.

45 A6.2 Concept of Probability 367 When modeling an experiment with random outcome by means of the probability space [Q, 'F, Pr], the difficulty is often in the determination of the probabilities Pr{A} for every A E 'F. The structure of the experiment can help here. Beside the statistical probability, defined as the limit for n ~ 00 of the relative frequency k / n, the following rules can be used if one assumes that all elementary events 0) have the same chance of occurrence: 1. Classical probability (discrete uniform distribution): If Q is a finite set and A a subset of Q, then or Pr{A} = number of elements in A number of elements in Q Pr{A} = number of favorable outcomes. number of possible outcomes (A6.2) 2. Geometric probability (spatial uniform distribution): If Q is a set in the plane ~ 2 of area Q and A a subset of Q, then Pr{A} = area of A. area of Q (A6.3) It should be noted that the geometric probability can also be defined if Q is a part of the Euclidean space having a finite area. Examples A6.1 and A6.2 illustrate the use of Eqs. (A6.2) and (A6.3). Example A6.1 From a shipment containing 97 good and 3 defective lcs, one IC is randomly selected. What is the probability that it is defective? Solution From Eq. (A6.2), 3 Pr{ICdefective} = Example A6.2 Maurice and Matthew wish to meet between 8:00 and 9:00 a.m. according to the following rules: 1) They come independently of each other and each will wait 12 minutes. 2) The time of arrival is equally distributed between 8:00 and 9:00 a.m. What is the probability that they will meet? Solution Equation (A6.3) can be applied and leads to, see graph Pr{Mathew meets Maurice} = 2 = Arrival of Matthew 9:00h 8'00~ Arrival.. 8:00 9:00 of Maunce

46 368 A6 Basic Probability Theory Another way to determine probabilities is to calculate them from other probabilities which are known. This involves paying attention to the structure of the experiment and application of the rules of probability theory (Appendix A6.4). For example, the predicted reliability of a system can be calculated from the reliability of its elements and the system's structure. However, there is often no alternative to determining probabilities as the limits of relative frequencies, with the aid of statistical methods (Appendices A6.11 and A8). A6.3 Conditional Probability, Independence The concept of conditional probability is of great importance in practical applications. It is not difficult to accept that the information "event A has occurred in an experiment" can modify the probabilities of other events. These new probabilities are defined as conditional probabilities and denoted by Pr{B I A}. If for example A ~ B, then Pr{B I A} = 1, which is in general different from the original unconditional probability Pr{B}. The concept of conditional probability Pr{B I A} of the event B under the condition "event A has occurred", is introduced here using the properties of relative frequency. Let n be the total number of trial repetitions and let n(a), ncb), and n(a n B) be the number of occurrences of A, B and An B, respectively, with n(a) > 0 assumed. When considering only the n(a) trials (trials in which A occurs), then B occurs in these n(a) trials exactly when it occurred together with A in the original trial series, i.e. n(a n B) times. The relative frequency of B in the trials with the information "A has occurred" is therefore n(a n B) n(a) n(a n B) n n(a) n (A6.4) Equation (A6.4) leads to the following definition of the conditional probability Pr{B I A} of an event B under the condition A, i.e. assuming that A has occurred, Pr{A n B) Pr{BIA)=, Pr{A} From Eq. (A6.5) it follows that Pr{A} > O. (A6.5) Pr{A n B} = Pr{A} Pr{B I A} = Pr{B} Pr{A I B}. (A6.6) Using Eq. (A6.5), probabilities Pr{B I A} will be defined for all BE :r. Pr{B I A} is

47 A6.3 Conditional Probability. Independence 369 a function of B which satisfies Axioms 1 to 3 of Appendix A6.2. obviously with Pr{A I A} = 1. The information "event A has occurred" thus leads to a new probability space [A, ~A' Pr A ], where ~A consists of events of the form An B, with BE~ and PrA{B} = Pr{BI A}, see Example A6.5. It is reasonable to define the events A and B as independent if the information "event A has occurred" does not influence the probability of the occurrence of event B, i.e. if Pr{B I A} = Pr{B}. (A6.7) However, when considering Eq. (A6.6), another definition, with symmetry in A and B is obtained, where Pr{A} > 0 is not required. Two events A and B are independent if and only if Pr{A n B} = Pr{A} Pr{B}. (A6.8) The events AI>..., An are totally independent if for each k (1 < k ~ n) and any selection of distinct ii,..., i k E {I,..., n} holds. (A6.9) A6.4 Fundamental Rules of Probability Theory The probability calculation of event combinations is based on the fundamental rules of probability theory introduced in this section. A6.4.1 Addition Theorem for Mutually Exclusive Events The events A and B are mutually exclusive if the occurrence of one event excludes the occurrence of the other, formally An B = 0. Considering a component which can fail due to a short or an open circuit, the events and failure occurs due to a short circuit failure occurs due to an open circuit are mutually exclusive. Application of Axiom 3 (Appendix A6.2) leads to

48 370 A6 Basic Probability Theory Pr{A U B} = Pr{A} + Pr{B}. (A6.1O) Equation (A6.1O) is considered a theorem by tradition only; indeed, it is a particular case of Axiom A3 in Appendix A6.2. Example A6.3 A shipment of 100 diodes contains 3 diodes with shorts and 2 diodes with opens. If one diode is randomly selected from the shipment, what is the probability that it is defective? Solution From Eqs. (A6.1O) and (A6.2), 325 Pr{diode defective)=-+-= If the events AI, A2,... are mutually exclusive (Ai n A j = 0 for all i =1= j, they are also totally exclusive. According to Axiom 3 it follows that (A6.1l) A6.4.2 Multiplication Theorem for Two Independent Events The events A and Bar e independent if the information about occurrence (or nonoccurrence) of one event has no influence on the probability of occurrence of the other event. In this case Eq. (A6.8) applies Pr{A n B} = Pr{A} Pr{B}. Example A6.4 A system consists of two elements EI and E2 necessary to fulfill the required function. The failure of one element has no influence on the other. RI = 0.8 is the reliability of EI and R2 = 0.9 is that of E2. What is the reliability Rs of the system? Solution Considering the assumed independence between the elements EI and E2 and the definition of RI, R2, and RS as RI = Pr{EI fulfills the required function), R2 = Pr{E2 fulfills the required function), and RS = Pr{EI fulfills the required function n E2 fulfills the required function), one obtains from Eq. (A6.8)

49 A6.4 Fundamental Rules of Probability Theory 371 A6.4.3 Multiplication Theorem for Arbitrary Events For arbitrary events A and B, with Pr{A} > 0 and Pr{B} > 0, Eq. (A6.6) applies Pr{A n B} = Pr{A} Pr{B I A} = Pr{B} Pr{A I B}. Example A6.5 2 les are randomly selected from a shipment of 95 good and 5 defective Ies. What is the probability of having (i) no defective Ies, and (ii) exactly one defective Ie? Solution (i) From Eqs. (A6.6) and (A6.2), Pr{first Ie good n second Ie good}=- -= (ii) Pr{exactly one defective IC} = Pr{ (first IC good n second ICdefective) u (first ICdefective n second IC good)}; from Eqs. (A6.6) and (A6.2), Pr{one IC defect!ve}= = Generalization of Eq. (A6.6) leads to the multiplication theorem Pr{AI n... nan} = Pr{AJl Pr{A2 I AJl Pr{A3 1 (AI n A 2)}... Pr{An I (AI n... nan-i)}' (A6.l2) Here, Pr{AI n.., n An-tl > Ois assumed. An important special case arises when the events AI'..., An are totally independent, in this case Eq. (A6.9) yields n Pr{AI n... nan} = Pr{AJl... Pr{An} = I1Pr{Ad. i=1 A6.4.4 Addition Theorem for Arbitrary Events The probability of occurrence of at least one of the (possibly non-exclusive) events A and B is given by Pr{A u B} = Pr{A} + Pr{B} - Pr{A n B}. (A6.13) To prove this theorem, consider Axiom 3 (Appendix A6.2) and the partitioning of the events Au Band B into mutually exclusive events (A u B = A u (A n B) and B = (A n B) u (A n B».

50 372 A6 Basic Probability Theory Example A6.6 To increase the reliability of a system, 2 machines are used in active (parallel) redundancy. The reliability of each machine is 0.9 and each machine operates and fails'independently of the other. What is the system's reliability? Solution From Eqs. (A6.13) and (A6.8), Pr{the first machine fulfills the required function V the second machine fulfills the required function} = = The addition theorem can be generalized to n arbitrary events. For n = 3 one obtains Pr{A v B v C} = Pr{A v (B v C)} = Pr{A} + Pr{B v C} - Pr{A n (B u C)} = Pr{A} + Pr{B} + Pr{C} - Pr{B n C} - Pr{A n B} - Pr{A n C} + Pr{A n B n C}. (A6.14) In general, Pr{AI u... U An} can be obtained by the so-called inclusion/exclusion method with n Pr{AI U... U An} = L(-I)k+1 Sk k=l Sk = L,Pr{A il n... n A ik }. l:si l <... <ik :Sn (A6.15) (A6.16). It can be shown that S = Pr{AI U... U An} ~ SI' S ~ SI - S2' S ~ Sl - S2 + S3' etc. Although the upper bounds do not necessarily decrease and the lower bounds do not necessarily increase, a good approximation for S often results from only a few Si' For a further investigation one can use the Frechet theorem Sk+ 1 ~ Sk (n - k)/(k + 1), which follows from Sk+I=Sk(k~I)/(~)=Sk(n-k)/(k+l)<Sk for A 1=A 2 =.. =A n A6.4.5 Theorem of Total Probability Let AI' A2,... be mutually exclusive events (Aj n Aj = 0 for all i '* n, a = Al U A2 U..., and Pr{Ad > 0, i = I, 2,... For an arbitrary event B one has B = BnO = Bn(AI ua2 u... ) = (Bn A1)u(Bn A2)u..., where the events BnA 1, BnA2,... are mutually exclusive. Use of Axiom 3 (Appendix A6.2) and Eq. (A6.6) yields Pr{B} = LPr{B n~} = LPr{~}Pr{B I Ai}' Equation (A6.17) expresses the theorem (or formula) of total probability. (A6.17)

51 A6.4 Fundamental Rules of Probability Theory 373 Example A6.7 les are purchased from 3 suppliers (AI, A2' A3) in quantities of 1000, 600, and 400 pieces, respectively. The probabilities for an Ie to be defective are for AI, 0.02 for A2, and 0.03 for A3. The les are stored in a common container disregarding their source. What is the probability that one Ie randomly selected from the stock is defective? Solution From Eqs. (A6.17) and (A6.2), Pr{ the selected Ie is defective) = D3 = Equations (A6.17) and (A6.6) lead to Bayes theorem, which allows calculation of the a posteriori probability Pr{Ak I B}, k = 1, 2,... as a function of a priori probabilities Pr{Ad, I Pr{Ak n B} Pr{Ad Pr{B I Ak } Pr{Ak B} = =. Pr{B} LPr{A;lPr{B I Ai} (A6.l8) Example A6.8 Let the Ie as selected in Example A6.7 be defective. What is the probability that it is from supplier AI? Solution From Eq. (A6.1S), Pr{ Ie from Al I Ie defective) = 2000 = Dl5 A6.5 Random Variables, Distribution Functions If the result of an experiment with a random outcome is a (real) number, then the underlying quantity is a (real) random variable. For example, the number appearing when throwing a die is a random variable taking on values in {l,..., 6}. Random variables are designated in this book with Greek letters 't, ;, 1;, etc. The triplet [0,1", PrJ introduced in Appendix A6.2 becomes [9(, 'E, Pr], where 9( = (-00,00) and '13 is the smallest event field containing all (semi) intervals (a, b] with a < b. The probabilities Pr{A} = Pr{'t E A}, A E 'E, define the distribution law of the random variable 'to Among the many possibilities to characterize this distribution law, the most frequently used is to define F(t) = Pr{'t ~ t}. (A6.19)

52 374 A6 Basic Probability Theory F(t) is called the distribution function of the random variable 't+). For each t, F(t) gives the probability that the random variable will assume a value smaller than or equal to t. Because for s > t one has {'t::; t} ::; {'t::; s}, F(t) is a nondecreasing function. Moreover, F( -00) = 0 and F( 00) = 1. If Pr{'t = to} > 0 holds, then F(t) has a jump of height Pr{'t = to} at to. It follows from the above definition and Axiom 3" (Appendix A6.2) that F(t) is continuous from the right. Due to Axiom 2, F(t) can have at most a countable number of jumps. The probability that the random variable 't takes on a value within the interval (a, b] is given by Pr{a < 't::; b} = F(b) - F(a). The following classes of random variables are of particular importance: l. Discrete random variables: A random variable 't is discrete if it can only assume a finite or countable number of values, i.e. ifthere is a sequence tl, t2,... such that with LPk = l. k A discrete random variable is best described by a table Values oft Probabilities (A6.20) The distribution function F(t) of a discrete random variable 't is a step function F(t) = LPk' k: Ik~1 If the sequence t1> t2,... is ordered so that tk < tk+l' then F(t) = LPj' j~k (A6.21) If only the value k = 1 occurs in Eqs. (A6.21), 't is a constant ('t = tl = C). A constant C can thus be regarded as a random variable with distribution function F(t) = {~ for t < C for t ~ C. An important special case of discrete random variables is that of arithmetic random variables. The random variable 't is arithmetic if it can take the values..., -I:lt, 0, /:It,..., with probabilities Pk = Pr{'t = k /:It}, k =..., - 1, 0, 1,... +) From a mathematical point of view, the random variable,; is defined as a measurable mapping of Q onto the axis of real numbers 2{ = (-=, 00), i.e. a mapping such that for each real value x the set of w for which {,; =,;(w) ~ x} belongs to J', the distribution function of,; is then obtained by setting F(t) = Pr{,; ~ I} = Pr{w:,;(w) ~ I}.

53 A6.5 Random Variables, Distribution Functions Continuous random variables: The random variable 't is absolutely continuous if a function F(t) ~ 0 exists such that t F(t) = Pr{'t s:; t} = f f(x)dx. (A6.22) f(t) is called (probability) density of the random variable 't and satisfies the condition f f(t)dt = 1. The distribution function F(t) and the density f(t) are related (almost everywhere) by f(t) = df(t), dt see Fig. A6.2 for an example. (A6.23) Mixed distribution junctions, exhibiting both jumps and continuous growth, can F(t) f(t) [holl F(b)-F(a) =Pr{a<'t!>b}.f-+-..L. a b I I b f f(t)dt=pr[a<'t!>b} a a b Figure A6.2 Relationship between the distribution function F(t) and the density f(t) for a continuous random variable 't > 0

54 376 A6 Basic Probability Theory occur in some applications. These distribution functions can generally be represented by a mixture (weighted sum) of discrete and continuous distribution functions (Eq. (A6.33». In reliability theory, t represents the failure-free operating time of the item under consideration. t is thus a nonnegative random variable, i.e. t ~ 0 and F(t) = 0 for t < O. Often t is a positive random variable, i.e. t > 0 and F(O) = O. The reliability function (survival function) R(t) of an item gives the probability that the item will operate failure-free in the interval (0, t] R(t) = Pr{t > t} = 1-F(t). (A6.24) Thefailure rate A.(t) of an item exhibiting a continuous failure-free operating time t is defined as A.(t) = lim ~. Pr{t < t :::; t + Ot It> t}. 0/..l.0 Ot Calculation leads to (Eq. (A6.5) and Fig. A6.3) A.(t) = lim ~. Pr{t < t :::; t + Ot rh > t} = lim ~. Pr{t < t :::; t + Or}, 0/..l.0 Ot Pr{t > t} 0/..l.0 Ot Pr{t > t} and thus, assuming f(t) = df(t)/ dt exists, A.(t) = ~ = _ dr(t)/ dt. 1- F(t) R(t) (A6.25) It is important to distinguish between failure rate A.(t) and density f(t). For an item with a failure-free operating time t, the value f(t)ot is for small Ot the unconditional probability for failure in (t, t + Ot], given the item is new at t = O. On the other hand, the quantity A.(t)ot is, for small Ot, the conditional probability that the item will fail in the interval (t, t + Ot] given that the item was new at t = 0 and has not failed in (0, t]. A deeper discussion on this kind of differences is in Section 1.2.3, Section 7.6, and Appendix A Assuming R(O) = 1 and considering t, t ~ 0, Eq. (A6.25) yields, -f A.(x)dx Pr{t > t} = R(t) = e 0 (A6.26) 1\ 1..., Figure A6.3 Visual aid to compute the failure rate 1..(/)

55 A6.5 Random Variables, Distribution Functions and, for given (fixed) Xo > 0, Pr{'t > t + Xo!'t > xo} = R(t,xo) = R(t+ xo) = e R(xo) from which dr(t,xo)/ dt 1 - = ti.(t+xo) R(t,xo) and I+XO - f A(x)dx Xo f R(x)dx E [T -xo I 't >xol = --,xo R(xo) 377 (A6.27) (A6.28) Important conclusions as to the aging behavior of an item can be drawn from the shape of its failure rate A(t). Assuming A(t) is non-decreasing, it follows for u < s and any t > 0 that Pr{'t > t + u! 't > u} ~ Pr{'t > t + s! 't> s}. (A6.29) For an item with increasing failure rate, inequality (A6.29) shows that the probability of surviving a further period t decreases as a function of the achieved age, i.e. the item ages. The contrary holds for an item with decreasing failure rate. No aging exists in the case of a constant failure rate, i.e. for R(t) = e-af, yielding (memoryless property of the exponential distribution) Pr{'t > t+xo! 't > xo} = Pr{'t > t} = e- A1. For an arithmetic random variable, the failure rate is defined as A(k) = Pr{'t =km! 't>(k-l)~t}=..{.k, L.JPi i?k k = I, 2,... (A6.30) (A6.31) Following concepts are important to reliability theory (see also Eq. (A6.78) for the minimum 'tmin and Eq. (A6.79) for the maximum 'tmax of a set of random variables 'tl'..., 'tn ): 1. Function of a random variable: If u(x) is a monotonically increasing function and't a continuous random variable with the distribution function Ft(t), then the random variable 11 = u('t) has the distribution function (Example A6.17) FT)(t) = Pr{l1 = u('t):s; t} = Pr{'t:S; u- 1 (t)} = FT(u-I(t», (A6.32) where u -I is the inverse function of u. If d u(t) / dt exists, then ft](t) = ft(u-l(t».du-i(t)ldt. 2. Mixture of distributions: In many practical applications the situation arises in which two or more failure mechanisms have to be considered for a given item. The following are some examples for the case of two failure mechanisms,

56 378 A6 Basic Probability Theory (e.g. early failures and wearout, early failures and constant failure rate, etc.) which can appear with distribution function Ft (t) and F2 (t), respectively, for any given item, only early failures (with probability p) or wearout (with probability 1- p) can appear, both failure mechanisms can appear in any item, a percentage p will show both failure mechanisms and 1- p only one failure mechanism, e.g. wearout governed by F2 (t). The distribution functions F(t) of the failure-free operating time is in these three cases: F(t) = pf[(t)+(i- p)f 2 (t), F(t) = 1-(1-I) (t))(i- F2 (t)) = F[ (t)+f 2 (t)-i) (t) F2 (t), F(t) =p(f[ (t) +F2 (t)-f[ (t) F2 (t)) +(I-p )F2 (t) = pf[ (t) + F2 (t)-pf[ (t) F2 (t). (A6.33) The first case gives a mixture with weights p and 1- p (Example 7.16). The second case corresponds to the series model with two independent elements, (Eq.(2.17)). The third case is a combination of both the previous cases. 3. Distributions with random parameters: If the distribution function of a random variable 't depends on a parameter A with density f>..(a), then for 't it holds that F(t) = Pr{'t :<:;; t} = f F(t,A)f>.. (A)dA, o 't, A~ o. (A6.34) The main properties of the distribution functions frequently used in reliability theory are summarized in Table A6.1 and will be discussed in Appendix A6.1O. A6.6 Numerical Parameters of Random Variables For a rough characterization of a random variable 't, some typical values such as the expected value (mean), variance, and median can be used. A6.6.1 Expected Value (Mean) For a discrete random variable 't taking values tl, t2'..., with probabilities PI> P2,...

57 A6.6 Numerical Parameters of Random Variables 379 the expected value or mean E['t] is given by E['t] =.~>k Pk' (A6.35) k provided the series converges absolutely. If't only takes the values ti,'.., tm, Eq. (A6.35) can be heuristically explained as follows. Consider n repetitions of a trial whose outcome is 't and assume that ki times the value ti'..., km times the value tm has been observed (n=ki km), the arithmetic mean of the observed values is tiki tmkm _ ki km ~"'----":':':"-=- - tl tm n n n As n ~ 00, ki In converges to Pi (Eq. (A6.146)), and the arithmetic mean obtained above tends towards the expected value E['t] given by Eq. (A6.35). For this reason, the terms expected value and mean are often used for the same quantity E[ 't]. From Eq. (A6.35), the mean of a constant C is the constant itself, i.e. E[ C] = C. The mean of a continuous random variable 't with density f(t) is given by E['t] = f tf(t)dt, (A6.36) provided the integral converges absolutely. variables, Eq. (A6.36) reduces to For positive continuous random E['t] = f tf(t)dt o which, for E['t] < 00 can be expressed (Example A6.9) as 00 E['t] = f(l-f(t))dt= fr(t)dt. o 0 (A6.37) (A6.38) Example A6.9 Prove the equivalence of Eqs. (A6.37) and (A6.38). Solution From R(t) = 1- F(t) = f f(x)dx it follows that 00 f RCt)dt = f cf fcx) dx) dt. o 0 t Changing the order of integration yields (see graph) oox fr(t)dt = f (f dt)f(x)dx = f xf(x)dx. o 00 0 _----l"""'+----:- x o

58 380 A6 Basic Probability Theory Table A6.1 Distribution functions used in reliability analysis Name Exponential Distribution Function F(/) = Pr{t $ I} I-e-AI Density f(/) = d F(/) I dl f(l) ~~AI f(l) ~=3 Parameter Range 12:0 1..>0 Weibull 1- e-(ai)~ O.5~L 12:0 o I AI 0.5AI~~=0.5 A. [3>0 AI Gamma 1 I [3-1 -xdx 12:0 -- X e 0.25A A. [3>0 rep) 0 o I 2345 AI I "~~ Chi-square I xv/2-le-xi2dx 12:0 (x2 ) 0.1 v=l t [hj (degrees of freedom) 2v / 2 ['(v/2) Normal (x-m)2 f(l)[h -I J m=300h I ~80h dx >0 o~ Ie -~ < t, m < o t [hj In(A/) f(t)[h -Ij A.---{).6h-I Lognormal 0 0.8~3 12:0 1 -fin I e-x 12dx 1...0> 0 t [hj k P. Pr{t; $ k} = LP; Binomial 0.2 ~.5 n=8 k=o... n ;=0 0.1 O<p<1 ; (n) i n-; P; = i P (1- p) o k Pr{t;$k} = LPi Poisson i=o 0.2 L=3 k = O. I mi m>o p. =_e-m ; I i! o k P. Pr{~$ k} = LPi = 1-(1- p)k Geometric 0.2 k = km:2 i=! 0.1 O<p<1 o ; I Pi =p(l-p)i-l Hypergeometric Pr(~$k}= ± (~)(~~n i=o (~) Pi N= I 000 i=o.i ~ K=20... min(k.n) ; o

59 A6.6 Numerical Parameters of Random Variables 381 Table A6.1 (cont) Failure Rate Mean Variance A(t) = f(t)/(i- F(t» E[1:) Var[1:) A(I) At::.. AI o 1 2 A(I) ~1l..-~=3 AI o " MI) 1 1 Memoryless: Properties A 12 Pr(1:>t+xO I 1:>xO)=Pr(1:>t)=e-" Monotonic failure rate: r(l +-) r(l+-)-r (1+-) I! I! I! increasing for (3> I (A(O) = o. A(oo) = 00).,,2 decreasing for ~ < I (A(O) = 00. A( 00) = 0 ) Laplace!ransf. exists: f<s) = AIl/(s + A)I!; ~±~~-o.5 ~ ~ Monotonic failure rate with A( 00) = A; o MI) [h -II AI v=4 0.5k.I[hl o Z 468 A A2 Erlangian for ~ = n = (distribution v 2v of the sum of n expo dis!. random variables) Gamma with ~ = ~ = and A =.!.; (Vl 2 2 -I(1/2) -liz forv= => F(t)=I- --e i=o i! t-m A(I)[h -II m=300h F(t) = <1>(-) 0.02 Li-:8Oh a 0.01 m a2 1 I 2 I [hi o A(I)[h-11 2~6h-1 (1ZIZ e 1 a=o.3 -- o 1 Z 3 4 I [hi A. e202 _ eo2 A2 <I>(t) = --f e-x 12dx ~- In 1: has a normal distribution; F(t) = <I>(ln(At) fa) Pi = Pr( i successes in n Bernoulli trials) not relevant np np(l-p) (n independent trials with Pr(A) = p); Random sample with replacement not relevant m m 0.21UwL=O Z 1 1- P P p o i i (n) i n-i (np) -np i p (1- p) = -.,-e ; I. m = At => (A.d e- AI Ii! = Pr(i failures in (0.1] I exponentially distributed failure-free op. times with paramo A) Memoryless: Pr(~>; + j I ~>;) = (1- p/; Pi = Pr( first success in a sequence of Bernoulli trials occurs first at the i-th trial); (Pi = p(i- p)i. with; = would give E[~] = (1- p)1 p and Var[~] = (I-p)/ p2) not relevant K Kn(N-K)(N-n) n- N N 2 (N -1) Random sample without replacement

60 382 A6 Basic Probability Theory For the expected value of the random variable T\ = u( 't) or 00 E[T\] = J u(t)f(t)dt (A6.39) holds, provided that the series and the integral converge absolutely. Two particular cases of Eq. (A6.39) are: 1. u(x) = Cx, 00 E[C't] = J Ctf(t)dt = CE['t]. (A6.40) 2. u(x) = xk, which leads to the kth moment of't, k > 1. (A6.41) Further important properties of the mean are given by Eqs. (A6.68) and (A6.69). A6.6.2 Variance The variance of a random variable 't is a measure of the spread (or dispersion) of the random variable around its mean E['t]. Variance is defined as Var['t] = E[('t - E['t])2], (A6.42) and can be calculated as Var['t] = L(tk - E['t])2 Pk k (A6.43) for a discrete random variable, and as Var['t] == f 00 ('t - E['t])2 f(t)dt (A6.44) for a continuous random variable. In both cases, Var['t] = E['t2]-(E['t])2. (A6.45) If E['t] or Var['t] is infinite, 't is said to have an infinite variance. For arbitrary constants C and A, Eqs. (A6.45) and (A6.40) yield and Var[C't-A] == C2 Var['t] (A6.46)

61 A6.6 Numerical Parameters of Random Variables 383 Var[C] = O. The quantity cr =,JVar['t"] is the standard deviation of't" and, for 't" ~ 0, cr K=-- E['t"] (A6.47) (A6.48) is the coefficient of variation of't". The random variable has mean 0 and variance 1, and is a standardized random variable. A good understanding of the variance as a measure of dispersion is given by the Chebyshev's inequality, which states (Example A6.1O) that for every > 0 Var['t"] Prq 't"-e['t"] I;::: } ~ -2- (A6.49) The Chebyshev inequality is more useful in proving convergence than as an approximation. Further important properties of the variance are given by Eqs. (A6.70) and (A6.71). Example A6.10 Prove the Chebyshev inequality for a continuous random variable (Eq. (A6.49». Solution For a continuous random variable 't with density f(t), the definition of the variance implies Pr{i't-E['t]i>E}= ff(t)dt~ f (t-ej't])2 f(t)dt It-E["tll>E It-E["tll>E E ~ 2 ~ f (t-ej't]) f(t)dt={var['t], E E which proves Eq. (A6.49). Generalization of the exponent in Eqs. (A6.43) and (A6.44) leads to the kth central momentof't" k> 1. (A6.S0)

62 384 A6 Basic Probability Theory A6.6.3 Modal Value, Quantile, Median In addition to the moments discussed in Appendices A6.6.1 and A6.6.2, the modal value, quantile, and median are defined as follows: 1. For a continuous random variable 't, the modal value is the value of t for which f(t) reaches its maximum, the distribution of 't is multimodal if f(t) exhibits more than one maximum. 2. The q quantile is the value tq for which F(t) reaches the value q, tq = inf{t: F(t) ~ q}; in general, F(tq) = q for a continuous random variable. 3. The 0.5 quantile is the median. A6.7 Multidimensional Random Variables, Conditional Distributions Multidimensional random variables (random vectors) are often required in reliability and availability investigations of repairable systems. For random vectors, the outcome of an experiment is an element of the n-dimensional space 'R... n. The probability space [Q, :T, Prj introduced in Appendix A6.1 becomes ['R... n, 'En, T),where 'En is the smallest event field which contains all "intervals" of the form (al>btj... (an, bn] = {(tl'..., tn): ti E (ai, bd,i = 1,..., n}. ~ndom vectors are designated by Greek letters with an arrow, i.e. 1 = ('tl>..., 'tn ), ~ = (~I>..., ~n)' etc. The probabilities Pr{A} = Pr{1 E A}, A E 'En define the distribution law of 1. The function where {'tl :s; tl'..., 'tn :s; tn} == {('tl :s; tl) n... n ('tn :s; tn)} (A6.51) is the distribution function of the random vector 1, known as joint distribution function of 'tl,..., 't no F(tl>"" tn)is: monotonically nondecreasing in each variable, zero (in the limit) if at least one variable goes to - 00, one (in the limit) if all variables go to 00, continuous from the right in each variable, such that the probabilities Pr{al < 'ti :::; bi,..., an < 't n :::; bn}, calculated for arbitrary ai'..., an' q,..., bn with ai < bi' are not negative; for example, n= 2 yields Pr{al <'ti:::; q, a2 <'t2:::; b2} = F(a2,b2)- F(aI,b2) - F(a2,q) + F(aI,b1), see graph.

63 A6.7 Multidimensional Random Variables, Conditional Distributions 385 It can be shown that every component 'ti of 1 = ('ti,..., 'tn ) is a random variable with distribution function, marginal distribution function, (A6.52) The components 'ti'..., 't n of 1 are (stochastically) independent if and only if, for any nand n-tulpe (tlo..., tn ) E 'R... n, n F(tlo...,tn ) = ITFi(ti) i=i It can be shown that Eq. (A6.53) is equivalent to n Pr{n('t; E B;)} = TIPr{('t; E B;)} ;=1 ;=1 n (A6.53) for every B; E 13 n. The random vector 1 = ('ti,..., 'tn) is absolutely continuous if a function f(xi,..., xn) ~ 0 exists such that for any nand n-tulpe ti,..., tn 11 In F(tlo..., tn ) = f f f(xi,..., xn)dxi... dxn (A6.54) f(xio..., Xn) is the density of 1, known also as joint density of 'ti,..., 'tnand satisfies the condition For any subset A E 13 n, it follows that Pr{ ('ti,..., 'tn ) E A} = f f f(ti,..., tn)dti... dtn A (A6.55) The density of 't i, marginal density, can be obtained from f(ti,..., tn ) as (A6.56) The components 'tlo..., 't n of a continuous random vector 1 are (stochastically) independent if and only if, for any nand n-tulpe ti,..., tn E 'R... n, n f(ti,..., tn ) = IT fi(ti) i=i For a two dimensional continuous random vector 1 = ('ti, 't2), the function (A6.57)

64 386 A6 Basic Probability Theory (A6.58) is the conditional density of 't2 under the condition 'tl = tl, with fl (tl) > O. Similarly fl(tllt2)=f(tl,t2)/f2(t2) is the conditional density for 'tl given 't2 =t2, with f2 (t2) > O. For the marginal density of 't2 it follows that (A6.59) Therefore, for any A E ']32 Pr{'t2 E A} = f f2 (t2) dt2 = f ( f fl (tl)f2 (t21 tl)dtl) dt2, (A6.60) A A -00 and in particular t t 00 F2(t) = Pr{'t2 ~ t} = f f2(t2) dt2 = f f fl (tl)f2(t21 tl) dtl dt2' (A6.61) A6.8 Numerical Parameters of Random Vectors Let 1=('tr....,'t n ) be a random vector, and u a real-valued function in 2(n. The expected value or mean of the random variable u("t) is kl kn E[u("t)]= L LU(tl,il,,tn,in)P(il,.,in) il=l in =! (A6.62) for the discrete case and 00 E[u(1)]= f... fu(tl,...,tn)f(tl,...,tn)dtl... dtn (A6.63) for the continuous case, assuming that the series and the integral converge absolutely. The conditional expected value of 't2 given 'tl = tl follows, in the continuous case, from Eqs. (A6.36) and (A6.58) as

65 A6.8 Numerical Parameters of Random Vectors (A6.64) Thus the unconditional expected value of 't2 can be obtained from (A6.65) Equation (A6.65) is known as the formula of total expectation and is useful in many practical applications. A6.8.1 Covariance Matrix, Correlation Coefficient Assuming for 1 = ('tlo..., 'tn) that Var['td < 00, i = 1,..., n, an important rough characterization of a random vector is the covariance matrix (aij ), where aij = Cov['tj, 't j] = E[('tj - E['td)('t j - E['t j])] are given in the continuous case by (A6.66) The diagonal elements of the covariance matrix are the variances of components 'tj, i = 1,..., n. Elements outside the diagonal give a measure of the degree of dependency between components (obviously aij = aji). For 'tj independent of 'tj' aij = a jj = 0 holds. For a two dimensional random vector i = ('tl,'tn ), the quantity (A6.67) is the correlation coefficient of the random variables 'tl and 't2, provided OJ = ~Var['t;l < 00, j = 1, 2. The main properties of the correlation coefficient are: 1. Ipl~1. 2. if 'tl and 't2 are independent, then p = o. 3. P = ±1 if and only if 'tl and 't2 are linearly dependent.

66 388 A6 Basic Probability Theory A6.8.2 Further Properties of Expected Value and Variance Let 'tb..., 't n be arbitrary random variables (components of a random vector t) having finite variances and C 1,..., C n constants. From Eqs. (A6.62) or (A6.63) and (A6.40) it follows that (A6.68) If 'tj and 't2 are independent random variables then, from Eq. (A6.62) or Eq.(A6.63), (A6.69) The variance of a sum of independent random variables 'tb"" 't n is obtained from Eqs. (A6.62) or (A6.63) and (A6.69) as (A6.70) For a sum of arbitrary random variables 'tb..., 'tn' the variance can be obtained for i, j E {I,..., n} as Var['tJ 'tn] = Var['ttl Var['t n ] + LCOV['ti, 't j]' i"# j (A6.71) A6.9 Distribution of the Sum of Independent Positive Random Variables and of 't'min' 't'max Let 'tl and 't2 be independent non-negative arithmetic random variables with ai = Pr{'tl = i}, bi = Pr{'t2 = i}, i = 0,1,... Obviously, 'tl + 't2 is also arithmetic, and therefore k ck = Pr{'tJ + 't2 = k} = Pr{ U{'tJ = i n 't2 = k - in i=o k k = LPr{'t1 = i}pr{'t2 = k - i} = Lai bk - i. i=o i=o (A6.72) The sequence co' cb... is the convolution of the sequences ao, ab... and bo, ht,... Now, let 'tj and 't2 be two independent positive continuous random variables with distribution functions Ft(t), F2(t) and densities flu), f 2 (t), respectively (Ft(O)=~(O)=O). Using Eq. (A6.55), it can be shown (Example A6.11 and Fig. A6.4) that for the distribution of TJ = 'tl + 't2

67 A6.9 Distribution of the Sum of Independent Positive Random Variables and of 'tmin' 'tmax 389 o X I t Figure A6.4 Visual aid to compute the distribution of 1] = 'tl + 't2 ('tl' 't2 > 0) holds, and 1 F1](t) = Pr{l1 ~ t} = f fl (x)f2(t - x)dx o (A6.73) 1 f1](t) = ffl(x)f2(t-x)dx. o (A6.74) The extension to two independent continuous random variables 'tl and 't2 defined over (-00, 00) leads to and The right-hand side of Eq. (A6.74) represents the convolution of the densities fl (t) and f2 (t), and will be denoted by t f fl (x)f2(t - x)dx = fl (t) * f2(t) o (A6.75) The Laplace transform (Appendix A9.7) of f1](t) is thus the product of the Laplace transforms of fl (t) and f2 (t) (A6.76) Example A6.11 Prove Eq. (A6.74). Solution Let 'tl and 't2 be two independent positive continuous random variables with distribution functions FI(t), ~(1) and densities fl(t), f2(1), respectively. From Eq. (A6.55) one obtains using f(x,y) = fl (x) f 2(y), see also the graph,

68 390 A6 Basic Probability Theory FT](t) =Pr{l1=tl +t2 ~t}= If f1(x)f2(y)dxdy x+y:51 I I-x t = J (J f2(y)dy)f1(x)dx = J Fz(t - x)f1(x)dx o 0 which proves Eq. (A6.73). Eq. (A6.74) follows with F 2 (0) = 0. A further demonstration of Eq. (A6.74) can be obtained using the formula for total expectation (Eq. (A6.65». o y ---"-----x Sums of positive random variables often occur in reliability theory when investigating repairable systems. For n ~ 2, the density f l1 (t) of 11 = 'tl 'tn for independent positive continuous random variables 'tb..., 't n follows as (A6.77) Example A6.12 Two machines are used to increase the reliability of a system. The first is switched on at time t = 0, and the second at the time of failure of the first one, standby redundancy. The failure-free operating times of the machines, denoted by tl and t2 are independent exponentially distributed with parameter A (Eq. A6.81». What is the reliability function of the system? Solution From RS(t) = Pr{tl + t2 > t} = 1- Pr{tl + t2 ~ t} and Eq. (A6.73) it follows that I Rs(t) = 1-fAe-f...x(l- e-a(t-x»dx = e-ai + Ate-AI. o R (t) gives the probability for no failures (e-at ) or exactly one failure (At e -AI) in (0, t]. s Other important distribution functions for reliability analyses are the minimum 'tmin and the maximum 'tmax of a finite set of positive, independent random variables 'ti,..., Tn; for instance, as failure-free operating time of a series or a l-outof-n parallel system, respectively. If 'tb..., 't n are independent positive random variables with distribution functions Fi(t) = Pr{'ti::;; t}, i = 1,..., n, then and n Pr{'tmin > t} = Pr{'tl > tn... n 't n > t} = II (1- Fi(t)), i=i n Pr{'tmax ::;; t} = Pr{'ti ::;; tn... n Tn ::;; t} = II Fi(t). i=i It can be noted that the failure rate related to Tmin is given by AS(t) = Al (t) An (t), (A6.78) (A6.79) (A6.80) where Ai(t) is the failure rate related to Fi(t). The distribution of 'tmin leads for F':t(t)=... =Fn(t) and n~oo to the Weibull distribution [A6.8]. For the mixture of distribution functions one refers to the considerations given by Eqs. (A6.33) & (2.15).

69 A6.10 Distribution Functions used in Reliability Analysis 391 A6.10 Distribution Functions used in Reliability Analysis This section introduces the most important distribution functions used in quality control and reliability analysis, see Table A6.1 for a summary. A Exponential Distribution A continuous positive random variable 't has an exponential distribution if t ;:: 0, A. > 0. (A6.81) The density is given by f(t) = Ae -At, t ;:: 0, A. > 0, (A6.82) and thefailure rate (Eq. (A6.25» by A(t) = A. (A6.83) The mean and the variance can be obtained from Eqs. (A6.38) and (A6.44) as and 1 E['t] = Ā. 1 Var['t] = (A6.84) (A6.85) The Laplace transform of f(t) is, according to Table A9.7, - A. f(s)=- s+a. (A6.86) Example A6.13 The failure-free operating time t of an assembly is exponentially distributed with A = 10-5 h-1. What is the probability of t being (i) over 2,000h, (ii) over 20,000h, (iii) over loo,oooh, (iv) between 20,000h and 100,000h? Solution From Eqs. (A6.81), (A6.24) and (A6.19) one obtains (i) Pr{t> 2,000 h) = e- O. 02 '" 0.98, (ii) Pr{t> 20, 000 h) = e- O. 2 '" 0.819, (iii) Pr{t> l00,oooh} = Pr{t > IIf... = E[t)} = e -I", 0.368, (iv) Pr{20,OOOh < t :5 100, 000 h) = e-o.2 - e- I '"

70 392 A6 Basic Probability Theory For an exponential distribution, the failure rate is constant (time-independent) and equal to 1... This important property is a characteristic of the exponential distribution and does not appear with any other continuous distribution. It greatly simplifies calculation because of the following properties: 1. Memoryless property: Assuming that the failure-free operating time is exponentially distributed and knowing that the item is functioning at the present time, its behavior in the future will not depend on how long it has already been operating. In particular, the probability that it will fail in the next time interval Ot is constant and equal to I..Ot. This is a consequence of Eq. (A6.30) (A6.87) 2. Constant failure rate at system level: If a system without redundancy consists of the elements E 1,, En and the failure-free operating times 'tl,..., 't n of these elements are independent and exponentially distributed with parameters 1.. 1,..., I..n then, according to Eq. (A6.78), the system failure rate is also constant (time-independent) and equal to the sum of the failure rates of its elements (A6.88) It should be noted that the expression I..s = LI..i is a characteristic of the series model with independent elements, and also remains true for the time-dependent failure rates I..i = I..i(t), see Eqs. (A6.80) and (2.18). A Weibull Distribution The Weibull distribution can be considered as a generalization of the exponential distribution. A continuous positive random variable 't has a Weibull distribution if F(t) = 1-e-(A.I)fI, (A6.89) The density is given by f(t) = I.. [3(1.. t)13-1e _(1..1)13, t;::: 0, 1.., P > 0, (A6.90) and the failure rate (Eq. (A6.25» by I..(t) = [31..(I..t)I3-1 (A6.91) I.. is the scale parameter (F(t) depends on I..t only) and [3 the shape parameter. [3 = 1 yields the exponential distribution. For [3 > 1, the failure rate I..(t) increases monotonically, with 1..(0) = 0 and 1..(00) = 00. For [3 < 1, I..(t) decreases monotonically, with 1..(0) = 00 and 1..(00) = o. The mean and the variance are given by

71 A6.1O Distribution Functions used in Reliability Analysis 393 E['t] = r( ) A and V [] _ r(l + 2/13) - r2( ) ar't- 2 ' A. where (A6.92) (A6.93) r(z) = j x z- 1 e- x dx (A6.94) o is the complete gamma function (Appendix A9.6). The coefficient of variation J( =.JVar['t] IE['t]=crl E['t] is plotted in Fig For a given E['t], the density of the Weibull distribution becomes peaked with increasing 13. An analytical expression for the Laplace transform of the Weibull distribution function does not exist. For a system without redundancy (series model) whose elements have independent failure-free operating times 't.,..., 't n distributed according to Eq. (A6.89), the reliability function is given by Rs(t) = (e-(at)l3)n = e-(a't)i3, (A6.95) witha' = Af}Jn. Thus, the failure-free operating time of the system has a Weibull distribution with parameters A' and 13. The Weibull distribution with 13 > 1 often occurs in applications as a distribution of the failure-free operating time of components which are subject to wearout and/or fatigue (lamps, relays, mechanical components, etc.). It was introduced by W. Weibull in 1951, related to investigations on fatigue in metals [A6.20]. B.W. Gnedenko showed that a Weibull distribution occurs as one of the extreme value distributions for the smallest of n (n ~ 00) independent random variables with the same distribution function (Weibull-Gnedenko distribution [A6.7, A6.8]). The Weibull distribution is often given with the parameter a = t!' instead of A. or also with three parameters F(t) = 1-e -0.. (t-",»13, (A6.96) Example A6.14 Shows that for a three parameter Weibull distribution, also the time scale parameter'll can be determined (graphically) on a Weibull probability chart, e.g. for the empirical evaluation of data. Solution In the system of coordinates 10glO(t) and 10glO 10glO(l/(I- F(t))) the two parameter Weibull distribution function (Eq. (A6.S9» appears as a straight line, allowing a graphical determination of A and 13 (see Eq.(AS.16) and Fig.AS.2). The three parameter Weibull distribution (Eq.(A6.96» leads to a concave curve. In this case, for two arbitrary points t1 and t2 > t1 it holds for the mean point on the scale logw logw(l/(l- F(t))), defining t m, that loglo(t2 -'II) + 10glO(tl-'II) = 210g W (tm - 'II), see Eq. (AS.16), the identity a + (b - a)/2 = (a + b)/2, and Fig. AS.2. From this, (t2 -'II)(t1-'II) = (trn _'11)2 and \If = (t1t2 -t~)/(t1 +t2-2tm ), as function of t1,t2,tm'

72 394 A6 Basic Probability Theory A Gamma Distribution, Erlangian Distribution, and X2 Distribution A continuous positive random variable "C has a Gamma distribution if At F(t) = Pr{"C ~ t} = _1_ fx13-1e-xdx = YCtl.At) r(ti) 0 rctl) t ~ o. A.13 > o. (A6.97) r is the complete Gamma junction defined by Eq. (A6.94). Y is the incomplete Gamma junction (Appendix A9.6). The density of the Gamma distribution is given by t~o. 'A.I3>O (A6.98) and the failure rate is calculated from A(t) = f(t)/(i- F(t)). A(t) is constant (timeindependent) for f3 = 1. monotonically decreasing for f3 < 1 and monotonically increasing for f3 > 1. However, in contrast to the Weibull distribution, A(t) always converges to A for t ~ 00, see Table A6.1 for an example. A Gamma distribution with f3 < 1 mixed with a three-parameter Weibull distribution (Eq. (A6.33, case 1)) can be used as an approximation to the distribution function for an item with failure rate as the bathtub curve given in Fig The mean and the variance of the Gamma distribution are given by and E["C] = ~ 'A Var["C] = 1... 'A2 (A6.99) (A6.100) The Laplace transform (Table A9.7) of the Gamma distribution density is - AP f(s) = (s + A)p (A6.101) From Eqs. (A6.101) and (A6.76), it follows that the sum of two independent Gamma-distributed random variables with parameters A, 131 and A, 132 has a Gamma distribution with parameters A, Example A6.1S Let the random variables tl and t2 be independent and distributed according to a Gamma distribution with the parameters A and ti. Determine the density of the sum TJ = t 1 + t2.

73 A6.10 Distribution Functions used in Reliability Analysis 395 Solution According Eq. (A6.98), 1:1 and 1:2 have the density f(t) = A (A t)~-i e-ai / r(i3). The Laplace transform of f(t) is f(s) = AP / (s + A)~ (Table A9.7). From Eq. (A6.76), the Laplace transform of the density of 1] = 1:1 + 1:2 follows as ft](s) = A2~ / (s + A)2~. The random variable 1] = 1:1 + 1:2 thus has a Gamma distribution with parameters A and 213. The generalization to the sum of 1: : n leads to a Gamma distribution with parameters A and n 13. For 13 = n = 2,3,..., the Gamma distribution given by Eq. (A6.97) leads to an Erlangian distribution with parameters A andn. Taking into account Eq. (A6.77) and comparing the Laplace transform of the exponential distribution A/(s + A) with that of the Erlangian distribution (A I(s + A»n, leads to the following conclusion: If't is Erlang distributed with parameters A and n, then 't can be considered as the sum of n independent, exponentially distributed random variables with parameter JI., ~. I.e. 't - 'tl 't n WI thpr{'ti < _ t }-1- - e -AI, 1- '-1,..., n. The Erlangian distribution is obtained by partial integration of the right-hand side of Eq. (A6.97), with 13 = n. This leads to (Appendices A9.2 & A9.6) n-i i At n-i '" (At) -AI fx -x y(n,at) F(t)=Pr{'tI+... +'tn :S;t}=I-LJ--e = --e dx=--, t~o,a>o. i=o i! 0 r(ft) r(n) (A6.l02) When A = 1 I 2 and 13 = v I 2, v = 1, 2,..., then the Gamma distribution given by Eq. (A6.97) is a chi-square distribution (X 2 distribution) with v degrees of freedom. The corresponding random variable is denoted x~. The chi-square distribution with v degrees of freedom is thus given by t v 1 2 I f 2- _x/2 dx 6 03) F(t) = Pr{X v ::;; t} = v x e,t ~ 0, V = 1,2,... (A.1 22r(~)O 2 From Eqs. (A6.97), (A6.102), and (A6.103) it follows that (A6.104) has a X 2 distribution with v = 2n degrees of freedom. If ~l'..., ~n are independent, normally distributed random variables with mean m and variance 0'2, then 1 n "2 L(~i _m)2 0' i=l is X2 distributed with n degrees of freedom. The above considerations show the importance of the X 2 distribution in mathematical statistics. The X 2 distribution is also used to compute the Poisson distribution (Eq.(A6.102) with n=v/2 and 1..= 1/2 or Eq. (A6.126) with k =v/2-1 and m = t12, see also Table A9.2).

74 396 A6 Basic Probability Theory A Normal Distribution A commonly used distribution function, in theory and practice, is the normal distribution, or Gaussian distribution. The random variable 't has a normal distribution if t _ (y_m)2 t-m cr F(t) = ~fe 20 2 dy= ~ f e-x2/2 dx, -oo<t,m<oo,o">o. (A6.105) cr ",2ft ",2n The density of the normal distribution is given by (t_m) f(t)=-- eo, cr~ - 00 < t, m < 00, 0" > o. (A6.106) The failure rate is calculated from ACt) = fct)!(i- F(t». The mean and variance are and E['t] = m Var['t] = 0"2, (A6.107) (A6.108) respectively. The density of the normal distribution is symmetric with respect to the line x = m. Its width depends upon the variance. The area under the density curve is equal to (Table A9.1) for the interval m ± 0" for the interval m ± 2cr for the interval m ± 3cr. Obviously, a normally distributed random variable assumes values from -00 to +00. However, for m > 3cr it can be considered as a positive random variable in many practical applications. Considering that in manufacturing processes a shift of the mean of 1.5 cr often occur, m±6cr is generally used as a sharp limit for controlling the process quality (6-aapproach). If't has a normal distribution with parameters m and cr2, ('t - m)! a is normally distributed with parameters 0 and 1, which is the standard normal distribution <Il(t) t 2 <IlCt)=-I-fe- X 12 dx. ~ -00 (A6.109) If 'tl and 't2 are independent, normally distributed random variables with parameters ml and a:, m2 and O"i, respectively, then l] = 'tl + 't2 is normally distributed with parameters ml + m2 and cr~ + a~ (Example A6.16). This rule can be generalized to the sum of n independent normally distributed random variables, and extended to dependent normally distributed random variables (Example A6.16).

75 A6.IO Distribution Functions used in Reliability Analysis 397 Example A6.16 Let the random variables "1 and "2 be statistically independent and normally distributed with means ml and m2 and variances cr~ and cr;. Determine the density of the sum 11 = "1 + "2' Solution According to Eq. (A6.74), the density of 11 = "1 + "2 follows as (X-ml)2 (t-x-m2)2 -( + ) f 11 (t) = 1 f e 2~ 2cr; dx. 2ncrl cr2 -= With u = x - ml' v = t - ml - m2' and taking into consideration the result (t-ml-m2)2 2 (cr? +cr~) is obtained. Thus the sum of independent normally distributed random variables is also normally distributed with mean ml + m2 and variance cr~ + cr~. If "1 and "2 are statistically dependent, then the distribution function of "1 + "2 is still a normal distribution with m = ml + m2' but with variance cr 2 = cr~ + cr~ + 2pcrl cr2, where p is the correlation coefficient as defined by Eq. (A6.67). The normal distribution often occurs in practical applications because the distribution function of the sum of a large number of statistically independent random variables converges to a normal distribution under relatively general conditions (central limit theorem, Eq. (A6.148)). A6.10.S Lognormal Distribution A continuous positive random variable 't has a lognormal distribution if its logarithm is normally distributed (Example A6.17). For the lognormal distribution, 2 InCA t) t (In(Ay)) -- I fl I fa -x 2 /2 F(t) = r::-.:: -e 2a dy= ~ e dx=<l>(ln(at)/cr), t~o,a,cr>o. cr"\j21t 0 y -V21t _00 (A6.11O)

76 398 A6 Basic Probability Theory The density is given by (lnat)2 f(t) = Ih:" e 2cr 2 tcj'\j2n t ~ 0, ).., cr> O. (A6.111) The failure rate is calculated from A(t) = f(t) /(1- F(t», see Table A6.1 for an example. The mean and the variance of't" are and (12/2 e E['t"]=- ).. 2 (12 (12 e -e Var['t"]=--- )..2 (A6.112) (A6.113) respectively. The density of the lognormal distribution has the important property that it is practically zero for some t at the origin, increases rapidly to a maximum, and decreases quickly (Fig. 4.2). The lognormal distribution function can be used to model repair times (Section 4.1) and appear often in accelerated reliability testing (Section 7.4) as well as when a large number of statistically independent random variables are combined in a multiplicative way (additive for fi = In 't", i.e. for the normal distribution). It can also be shown, that it is the limit distribution for n--t oo of xn when xn+l =xn(l +En) where En is an independent random variable [A6.9J. Example A6.17 Show that the logarithm of a lognormally distributed random variable is normally distributed. Solution For (lnt+lna)2 1 2 (12 ['t(t) = --e tcr-rz; and 11 = In 't, Equation (A6.33) yields (u(t) = In t and u -l(t) = e t ) [11(t)= t C e e cr'\j21t (t+lna)2 1 2 (12 =--e cr~ with In(1/ A) = m. This method can be used for other transformations, for example: (i) u(t) = e t : Normal distribution --> lognormal distribution, (ii) u(t) = t ~ : Weibull distribution --> Exponential distribution, (iii) u(t) = lft : Exponential distribution --> Weibull distribution, (iv) U(t)=F~l(t): Uniform distribution (Eq. (A6.114» on (0,1) --> F11 (t), (v) u(t)=fl] (t): F 11 (t) --> Uniform distribution on (0, 1). In Monte Carlo simulations, algorithms more sophisticated than Ftl"l(t) are often used.

77 A6.l0 Distribution Functions used in Reliability Analysis 399 A Uniform Distribution A random variable 't is uniformly distributed in the interval (a, b) if it has the distribution function jo if t $; a t -a F(t) = Pr{'t ~ t} = - if a < t < b (A6.114) b-a The density is then given by 1 otherwise. 1 f(t)=b-a for a < t < b. The uniform distribution is a particular case of the geometric probabilities introduced by Eq. (A6.3), but for 2( 1 instead of 2(2. Because of the property mentioned at the end of Example A6.17, the uniform distribution in the interval (0, 1) plays an important role in simulation problems. A Binomial Distribution Consider a trial in which the only outcome is either a given event A or its complement A. This outcome can be represented by a random variable of the form if A occurs otherwise. o is called a Bernoulli variable. If Pr{o=l}=p and Pr{o=O}=l-p, (A6.115) (A6.116) then and E[o] = l p + 0 (1- p) = p, Var[o] = E[02] - E2[0] = P - p2 = p(l- p). (A6.117) (A6.118) An infinite sequence of independent Bernoulli variables 01,02,... with the same probability Pr{oj = I} = p, i ~ 1, is called a Bernoulli model or a sequence of Bernoulli trials. The sequence 01,02,... describes, for example, the model of the repeated sampling of a component from a lot of size N, with K defective components (p = K / N) such that the component is returned to the lot after testing (sample with replacement). The random variable

78 400 A6 Basic Probability Theory (A6.119) is the number of ones occurring in n Bernoulli trials. given by The distribution of S is I" (n) k n-k Pk = Pr{", = k} = k P (1- p), k = 0,..., n, 0< p < 1. (A6.120) Equation (A6.120) is the binomial distribution. S is obviously an arithmetic random variable taking on values in {O, 1,..., n} with probabilities Pk' To prove Eq. (A6.120), consider that pk(l_p)n-k =Pr{ol =In... nok =lnok+l =On... non =O} is the probability of the event A occurring in the first k trials and not occurring in the n - k following trials ( 01,..., On are independent); furthermore in n trials there are ( n)= n! =n(n-l)... (n-k+l) k k!(n-k)! k! different possibilities of occurrence of k ones and n - k zeros, the addition theorem (Eq. (A6.11» then leads to Eq. (A6.120). Example A6.18 A populated printed circuit board (PCB) contains 30 ICs. These are taken from a shipment in which the probability of each IC being defective is constant and equal to 1 %. What are the probabilities that the PCB contains (i) no defective ICs, (ii) exactly one defective IC, and (iii) more than one defective IC? Solution From Eq. (A6.120) with P = 0.01, (i) Po = ", 0.74, (ii) PI = ",0.224, (iii) P P30 = 1- Po - PI '" Knowing Pi and assuming C i = cost for i repairs (because of i defective ICs) it is easy to calculate the mean C of the total cost caused by the defective les ( C = PI C I P30 C 30 ) and thus to develop a test strategy based on cost considerations (Section 8.4). For the random variable S defined by Eq. (A6.119) it follows that k Pr{s;5; k} = ~ e)p i (1- p)n-i, k = 0,..., n, 0< p < 1, E[s]=np, Var[s] = n p(l- p). (A6.121) (A6.122) (A6.123)

79 A6.1O Distribution Functions used in Reliability Analysis 401 Example A6.19 Detennine the mean and the variance of a binomially distributed random variable with parameters n andp. Solution Considering the independence of iii'... lin. the definition of i; (Eq. (A6.l19». and from Eqs. (A6.117) and (A6.11S) it follows that and E[i;] = E[lirl E[lin ] = n p Var[i;] = Var[li1] Var[lin]=np(I-p). A further demonstration follows. as for Example A6.20. by considering that n () n k n-k n (n _ I) k-i n-k m ( m ) i m-i ~> k P (l-p) =npl k-l p (l-p) =npl i p(l-p) =np. k=l k=l i=o For large n, the binomial distribution converges to the normal distribution (Eq. (A6.149». The convergence is good for min(np,n(1-p»~5. For small values of P, the Poisson approximation (Eq. (A6.129» can be used. Calculations of Eq. (A6.120) can be based upon the relationship between the binomial and the beta or the Fisher distribution (Appendix A9.4). Generalization of Eq. (A6.120) for the case where one of the events AI,..., Am can occur with probability PI>"" Pm at every trial, leads to the multinomial distribution Pr{in n trials Al occurs kl times,..., A k '} n! kl km m occurs m times = PI... Pm ' k I!.. km! with ki km = n and PI Pm = 1. (A6.124) A Poisson Distribution The arithmetic random variable S has a Poisson distribution if Pk = Pr{s = k} = :! k e- m, k=o.i..., m>o (A6.125) and thus k i Pr{s:5 k} = ~ L.J ~e-m,., i=o l. k = O. 1..., m > O. (A6.126)

80 402 A6 Basic Probability Theory The mean and the variance of I; are and E[I;] = m Var[I;]=m. (A6.127) (A6.128) The Poisson distribution often occurs in connection with exponentially distributedfailure-free operating times. In fact, Eq. (A6.125) with m = At gives the probability of exactly k failures in the time interval (0, t], see Eq. (A 7.41). The Poisson distribution is also used as an approximation of the binomial distribution for n ~ 00 and p ~ 0 such that n p = m < 00. To prove this convergence, called the Poisson approximation, set m = n p, Eq. (A6.120) then yields n! m k m n-k n(n -1)... (n - k + 1) m k m n-k Pk = (-) (1--) = -(1--) k!(n-k)! n n n k k! n 1 k -1 m k m n-k = 1 (1 - -)... (1 - -). - (1 - -), n n k! n from which (for k < 00 and m = n p < 00 ) it follows that k 11m. Pk =-e m-m, n~oo k! m =np. Using partial integration one can show that kim ~ m - m 1 f k - Yd 1 f k -x 12 dx L.J-e =1-- y e y=i--- x e. 1=0. i'. k'. 0 k'2. k m (A6.129) (A6.130) The right-hand side of Eq. (A6.130) is a special case of the chi-square distribution (Eq. (A6.103) with v / 2 = k + 1 and t = 2m). A table of the chi-square distribution can then be used for numerical evaluation of the Poisson distribution (Table A9.2). Example A6.20 Determine the mean and the variance of a Poisson-distributed random variable. Solution From Eqs. (A6.35) and (A6.125), 00 k 00 k-l 00 i E[i;] = ~>~e-m = Lm-m--e-m=mL~e-m = m. k;o k! k;i (k -I)! i;o i! Similarly, from Eqs. (A6.45), (A6.4I), and (A6.l25), 00 kook r ~ 2 m -m 2 ~ m -m 2 Var[~] = L.Jk -e - m = L.,[k(k -1) + kj-e - m k;o k! k;o k! k-2 00 i ~ 2 m -m 2 2~ m -m 2 = L.Jm --e +m-m = m L.J -e +m-m = m. k;2 (k - 2)! i;o i!

81 A6.10 Distribution Functions used in Reliability Analysis 403 A Geometric Distribution Let 01> 02,... be a sequence of independent Bernoulli variables resulting from Bernoulli trials. The arithmetic random variable I;; defining the number of trials to the first occurrence of the event A has a geometric distribution given by Pk = Pr{1;; = k} = p(l- p)k-l, k = 1,2,..., 0 < p < 1. (A6.13I) Equation (A6.13I) follows from the definition of Bernoulli variables 0i (Eq. (A6.115» Pk = Pr{i; = k} = Pr{ol = 0 n... n 0k-l = 0 n Ok = I} = (1- p)k-l p. The geometric distribution is the only discrete distribution which exhibits the memoryiess property, as does the exponential distribution for the continuous case. In fact, from Pr{i; > k} = Pr{ol = 0 n... n Ok = O} = (1- p)k and, for any k and j > 0, it follows that Pr{i; > k + j Ii;> k} = (1- l+). P k = (1- p)l. (1- p) Thefailure rate (Eq. (A6.31» is time independent and given by A(k) = p(l- p)k-l =. (1_p)k-l P (A6.132) For the distribution function of the random variable I;; defined by Eq. (A6.13I) one obtains k Pr{1;; :::; k} = L Pi = 1- Pr{1;; > k} = 1-(1- p)k. i=l (A6.133) Mean and variance are then (with ~nxn=x/(1_x)2 and ~n2xn=x(l+x) /(I-x)3,x < 1) n=l n=l (A6.134) and 1- p Var[I;;]=-. p2 (A6.135) If Bernoulli trials are carried out at regular intervals 11t, then Eq. (A6.133) provides the distribution function of the number of time units I1t between successive occurrences of the event A under consideration; for example, breakdown of a capacitor, interference pulse in a digital network, etc. Often the geometric distribution is considered with Pk = p(1- p /, k = 0, I,..., in this case E[i;] = (1- p)/ p and Var[i;] = (1- p)/ p2.

82 404 A6 Basic Probability Theory A Hypergeometric Distribution The hypergeometric distribution describes the model of a random sample without replacement. For example, if it is known that there are exactly K defective components in a lot of size N, then the probability of finding k defective components in a random sample of size n is given by k = 0,..., min(k,n). (A6.136) Equation (A6.136) defines the hypergeometric distribution. Since for fixed nand k (O~k~n) K with P =-, N the hypergeometric distribution can, for large N, be approximated by the binomial distribution with p = KIN. For the random variable ~ defined by Eq. (A6.136) it follows that k = 0,..., n, < p < 1, (A6.137) and K E[~] = n-, N Var[~]= Kn(N-K)(N-n). N 2 (N -1) (A6.138) (A6.139) A6.11 Limit Theorems Limit theorems are of great importance in practical applications because they can be used to find approximate expressions with the help of known (tabulated) distributions. Two important cases will be discussed in this section, the law of large numbers and the central limit theorem. The law of large numbers provides additional justification for the construction of probability theory on the basis of

83 A6.11 Limit Theorems 405 relative frequencies. The central limit theorem shows that the normal distribution can be used as an approximation in many practical situations. A Law of Large Numbers Two notions used with the law of large numbers are convergence in probability and convergence with probability one. Let Sb S2,..., and S be random variables on a probability space [Q, 'F, Pr]. Sn converge in probability to S if for arbitrary E > 0 holds. Sn converge to S with probability one if (A6.140) Pr{ lim Sn = S} = 1. n~~ (A6.141) The convergence with probability one is also called convergence almost sure (a.s.). An equivalent condition for Eq. (A6.141) is lim Pr{sup I Sk -S I > E} = 0, n~~ k?n (A6.142) for any E > O. This clarifies the difference between Eq. (A6.140) and the stronger condition given by Eq. (A6.141). Let us now consider an infinite sequence of Bernoulli trials (Eqs. (A6.115), (A6.119), and (A6.120)), with parameter p = Pr{A}, and let Sn be the number of occurrences of the event A in n trials (A6.143) The quantity Sn / n is the relative frequency of the occurrence of A in n independent trials. The weak law of large numbers states that for every E > 0, lim Pr{ I Sn - pi> E} = O. n-7oo n (A6.144) Equation (A6.144) is a direct consequence of Chebyshev's inequality (Eq. (A6.49)). Similarly, for a sequence of independent identically distributed random variables 'Cl,,'Cn, with mean E['Cd = a and variance Var['Cd = cr2 < 00 (i = 1,..., n), 1 n lim Pr{ I (n ~>i)-a I > E} = O. n~~ i=1 (A6.145) According to Eq. (A6.144), the sequence Sn / n converges in probability to p = Pr{A}. Moreover, according to the Eq. (A6.145), the arithmetic mean (tl tn)/ n of n independent observations of the random variable 'C (with a finite variance) converges in probability to E[ 'C]. Therefore, p = Sn / nand

84 406 A6 Basic Probability Theory a =(t t n )/ n are consistent estimates of p = Pr{A} and a = E['tl, respectively (Appendix A8.1 and A8.2). Equation (A6.145) is also a direct consequence of Chebyshev's inequality. A firmer statement than the weak law of large numbers is given by the strong law of large numbers, Sn n Pr{ lim - = p} = l. n-7oo (A6.146) According to Eq. (A6.146), the relative frequency Sn / n converges with probability one (a.s.) to p = Pr{A}. Similarly, for a sequence of independent identically distributed random variables 'tl"'" 'tn' with mean E['td = a and variance Var['td = (12 < 00 (i = 1, 2,... ), 1 n Pr{ lim - ~>i = a} = 1. n-too n i=1 (A6.147) The proof of the strong law of large numbers (A6.146) and (A6.147) is more laborious than that of the weak law of large numbers, see e.g. [A6.6 (vol. II), A6.7]. A Central Limit Theorem Let 'tj. 't2,... be independent, identically distributed random variables with mean E['td = a and variance Var['td = 0 2 < 00, i = 1, 2,... For every t < 00, n (~»-na i=l I 1 f 2/2 limpr{ $;t}=-- e-x tlx. n-too (J Fn ~ _ 00 t (A6.148) Equation (A6.148) is the central limit theorem. It says that for large values of n, the distribution function of the sum 'tl 'tn can be approximated by the normal distribution with mean E['tl 'tnl = ne['td = na and variance Var['tl 'tnl = n Var['td = n02. The central limit theorem is of great theoretical and practical importance, in probability theory and mathematical statistics. It includes the integral Laplace theorem (also known as the De Moivre-Laplace theorem) for the case where 'ti = 0i are Bernoulli variables, n (L8)-np 1 P { i=l < } = _1_ f -x 2 /2 A o. 1m r _ t e u..i.. n-too...jnp(l-p) ~-oo t (A6.149) n LOi is the random variable S in Eq. (A6.120) for the binomial distribution, i.e {tls the total number of occurrences of the event considered in n Bernoulli trials.

85 A6.11 Limit Theorems 407 From Eq. (A6.149) it follows that for n ~ 00 or (for each given > 0) n LOi Pr{j i=~ - pl:s; } ~ n\jl ~np(l-p) 2 I f -x 12 C e dx,,,2n -00 ne 2 ~np(l-p) -x2 /2 f e dx, o Fn n~oo, (A6.l50) Setting the right-hand side of Eq. (A6.150) equal to y allows determination of the number of trials n for given y, p, and which are necessary to fulfill the inequality I (~\ n )/ n - pi::; with a probability y. This result is important for reliability investigations using Monte Carlo simulations, see also Eq. (A6.152). The central limit theorem can be generalized under weak conditions to the sum of independent random variables with different distribution functions [A6.6 (Vol. II), A6.7], the meaning of these conditions being that each individual standardized random variable (Ti-E[Ti])/~Var[Til provides a small contribution to the standardized sum (Lindeberg conditions). Example A6.21 The series production of a given assembly requires 5,000 ICs of a particular type. 0.5% of these ICs are defective. How many ICs must be bought in order to be able to produce the series with a probability of y = 0.99? Solution Setting p = Pr{IC good) = 0.995, the minimum value of n satisfying n Pr{~>i > 5,000} :?: 0.99 = Y i=1 must be found. Rearrangement of Eq. (A6.l49), setting t = tl_y' leads to 00 2 II-Y 2 n 1 f -x 12 1 f -x 12 limpr{l8i>tl-y~np(l-p)+np)=.rz;. e dx=l-.rz;. e dx=y, n---)oo ;=1 2 n 2n II_y -00 where tl_y denotes the 1- y quantile of the standard normal distribution ct>(t) given by Eq. (A6.109) or Table A9.!. For y = 0.99 one obtains from Table A9.1 tl_y = tom = With P = 0.995, it follows that -2.33~n n:?: 5,000. Thus, n = 5,037 ICs must be bought (if only 5,025 = 5, ,000'0.005 ICs were ordered, then 1,_y '" 0 and y '" 0.5).

86 408 A6 Basic Probability Theory Example A6.22 Electronic components are delivered with a defective probability P = 0.1 %. (i) How large is the probability of having exactly 8 defective components in a (homogeneous) lot of size n = 5,OOO? (ii) In which interval [kl, k21 around the mean value n P = 5 will the number of defective components lie in a lot of size n = 5,000 with a probability y as near as possible to 0.95? Solution (i) The use of the Poisson approximation (Eq. (A6.129» leads to 58-5 P '" -e '" ! ' the exact value (obtained with Eq. (A6.120)) being For comparison, the following are the values of Pk obtained with the Poisson approximation (Eq. (A6.129)) in the first row and the exact values from Eq. (A6.120) in the second row k= Pk '" Pk = (ii) From the above table one recognizes that the interval [k" k21 = [1, 91 is centered on the mean value n p=5 and satisfy the condition "y as near as possible to 0.95" (y = PI + P P9 '" 0.96). A good approximation for k J andk 2 can also be obtained using Eq. (A6.151) to determine E = (k2 - k,)/ 2n by given p. n, and t(l+y) / 2 k2 - k, ~ n p (1 - p) E = --z,;-- = n t(l+y)/2' (A6.151) where t(l+y)/2 is the (I + y)/2 quantile of the standard normal distribution <l>(t) (Eq. (A6.109)). Equation (A6.151) is a consequence ofeq. (A6.150) by considering that A 2 f -x 2 /2 -- e dx=y Fo from which. yields A I f -x 2 /2 1+ Y -- e dx = y 12 = --, fo-oo 2 ne/ ~np(l- p) = A = t(l+y)/2' With y = 0.95, t(i+y)/2 = to.975 = 1.96 (Table A9.1), n = 5,000, and P = one obtains ne = 4.38, yielding k J = np - ne = 0.62 (~O) and k2 = np + ne = 9.38 (:::; n). The same solution is also given by Eq. (A8.45) k 2., = np±b~np(l-p), considering b = t(j+y)/2. Example A6.23 As an example belonging to both probability theory and statistics, determine the number n of trials necessary to estimate an unknown probability p within a given interval ± at a given confidence level y (e.g. for a Monte Carlo simulation).

87 A6.11 Limit Theorems 409 Solution From Eq. (A6.150) it follows that for n ~ 00 n LlI;.Jnp(l-p) 2 i-l 2 f -x 12 Prq----pl~E}'" r.:- e dx=y. n v2n 0 Therefore, ne.jnp(1-p) 2 1 f -x 12 y -- e dx=- -J2; 2 o ne yields and thus nel--jn p(1- p) = t(1+y)/2 ' from which t n = ( (1+Y)/2)2 p(1- p), e ne.jnp(1-p) 2 1 f -x 12 y l+y -- e dx = = --, fo-oo 22 (A6.152) where t(1+y)/2 is the (1 + y)/2 quantile of the standard normal distribution ci>(t) (Eq. (A6.109), Appendix A9.!). The number of trials n depend on the value of p and is a maximum (nmax ) for p = 0.5. The following table gives nmax for different values of E and y 2e 0.1 (e = 0.05) 0.05 (e = 0.025) O.S S l,os2 1,537 Equation (A6.152) has been established by assuming that p is known. Thus, E refers to the number of observations in n trials (2En = krkl as per Eq. (AS.45) with b=t(l+y)/2). However, the meaning of Eq. (AS.45) can be reversed by assuming that the number k of realizations in n trials is known. In this case, for n large and par (1- p) not very small, E refers to the width of the confidence interval for p (2e = Pu - PI as per Eq. (AS.43) with k(1- kin)» b2 14 and thus also n»b2 ). The two considerations yielding a relation of the form given by Eq. (A6.l52) are basically different (probability theory and statistics) and agree only because of n ~ 00 (see also the remarks on pp. 474 and 486). For n, par (1- p) small, the binomial distribution has to be used (Eqs. (AS.37) and (AS.38».

88 A 7 Basic Stochastic-Processes Theory Stochastic processes are a powerful tool for the investigation of the reliability and availability of repairable equipment and systems. A stochastic process can be considered as a family of time-dependent random variables or as a random function in time, and thus has a theoretical foundation based on probability theory (Appendix A6). The use of stochastic processes allows analysis of the influence of the failure-free operating and repair time distributions of elements, as well as of the system's structure, repair strategy, and logistical support, on the reliability and availability of a given system. Considering applications given in Chapter 6, and for reasons of mathematical tractability, this appendix mainly deals with regenerative stochastic processes with a finite state space, to which belong renewal processes, Markov processes, semi-markov processes, and semi-regenerative processes. Reward and frequency / duration aspects, as useful in some applications, are introduced in Appendix A The theoretical presentation is supported by examples taken from practical applications. This appendix is a compendium of the theory of stochastic processes, consistent from a mathematical point of view but still with reliability engineering applications in mind. A7.1 Introduction Stochastic processes are mathematical models for random phenomena evolving over time, such as the time behavior of a repairable system or the noise voltage of a diode. They are designated in this book by Greek letters ~(t), S(t), TI(t), v(t) etc. To introduce the concept of stochastic process, consider the time behavior of a system subject to random influences and let T be the time interval of interest, e.g. T = [0, 00). The set of possible states of the system, i.e. the state space, is assumed to be a subset of the set of real numbers. The state of the system at a given time to is thus a random variable ~(to). The random variables ~(t), t E T, may be arbitrarily coupled together. However, for any n = 1, 2,..., and arbitrary values tlo, tn E T, the existence ofthe n-dimensional distribution function (Eq. (A6.51» (A7.1)

89 A 7.1 Introduction 411 is assumed. ~(tl)'..., ~(tn) are thus the components of a random vector t (t). It can be shown that the family of n-dimensional distribution functions (Eq. (A7.1)) satisfies the consistency condition and the symmetry condition F(Xil,'''' Xin, til'..., tin) = F(Xl,...,xn, t1>..,tn)' ii E {I,..., n}, ij '" ii for j '" i. Conversely, if a family of distribution functions F(xl,..., xn' tl,..., tn) satisfying the above consistency and symmetry conditions is given, then according to a theorem of A.N. Kolmogorov [A6.1O], a distribution law on a suitable event field '13T of the space 'l( T consisting of all real functions on T exists. This distribution law is the distribution of a random function ~(t), t E T, usually referred to as a stochastic process. The time function resulting from a particular experiment is called a sample path or realization of the stochastic process. All sample paths are in 'l( T, however the set of sample paths for a particular stochastic process can be significantly smaller than 'l( T, e.g. consisting only of increasing step functions. In the case of discrete time, the notion of a sequence of random variables ~n' net is generally used. The concept of a stochastic process generalizes the concept of a random variable introduced in Appendix A6.5. If the random variables ~(t) are defined as measurable functions ~(t) = ~(t, (0), t E T, on a given probability space [Q, '.J, PrJ then and the consistency and symmetry conditions are fulfilled. (0 represents the random influence. The function ~(t,(o), t E T, is for a given (0 a realization of the stochastic process. The Kolmogorov theorem assures the existence of a stochastic process. However, the determination of all n-dimensional distribution functions is practically impossible in a general case. Sufficient for many applications are often some specific parameters of the stochastic process involved, such as state probabilities or stay (sojourn) times. The problem considered, and the model assumed, generally allow determination of the time domain T (continuous, discrete, finite, infinite) the structure of the state space (continuous, discrete) the dependency structure of the process under consideration (e.g. memoryless) invariance properties with respect to time shifts (time-homogeneous, stationary). The simplest process in discrete time is a sequence of independent random variables ~1' ~2,... Also easy to describe are processes with independent increments, for instance Poisson processes (Appendices A & A 7.8.2), for which

90 412 A 7 Basic Stochastic-Processes Theory n Pr{ ~(to) ::; xol TI Pr{ ~(t) - ~(ti_l) ::; Xi l ;=1 (A7.2) holds for arbitrary n = 1,2,..., xl,..., xn' and to <... < tn E T. For reliability investigations, processes with continuous time parameter t ~ 0 and discrete state space {ZO,..., Zm} are important. Among these, the following processes will be discussed in the following sections renewal processes Markov processes semi-markov processes semi-regenerative processes (processes with an embedded semi-markov process) particular nonregenerative processes (nonhomogeneous Poisson processes for instance). Markov processes represent a straightforward generalization of sequences of independent random variables. They are characterized by the memoryless property. With this, the evolution of the process after an arbitrary time point t only depends on t and on the state occupied at t, not on the evolution of the process before t (in time-homogeneous Markov processes, the dependence on t also disappears). Markov processes are very simple regenerative stochastic processes. They are regenerative with respect to each state and, if time-homogeneous, also with respect to any time t. Semi-Markov processes have the Markov property at the time points of any state change, i.e., all states of a Semi-Markov process are regeneration states. In a semi-regenerative process, a subset ZO,..., Zk of the states Zo,..., Zm are regeneration states (k < m) and constitute an embedded semi-markov process. For an arbitrary regenerative stochastic process, there exists a sequence of random points (regeneration points) at which the process forgets its foregoing evolution and (from a probabilistic point of view) restarts anew. Typically, regeneration points occur when the process returns to some particular states (regeneration states). Between regeneration points, the dependency structure of the process can be very complicated. In order to describe the time behavior of systems which are in statistical equilibrium (steady-state), stationary and time-homogeneous processes are suitable. The process ~(t) is stationary (strictly stationary) if for arbitrary n = 1, 2,..., tl'..., tn' and time span a (tj, tj + a E T, i = 1,...,n) (A7.3) For n = 1, Eq. (A7.3) shows that the distribution function of the random variable ~(t) is independent of t. Hence, E[~(t)], Var[~(t)], and all other moments are independent of time. For n = 2, the distribution function of the two-dimensional random variable (~(t), ~(t + u» is only a function of u. From this it follows that the correlation coefficient between ~(t) and ~(t + u) is also only a function of u

91 A 7.1 Introduction 413 ( ) E[(~(t + u) - E[~(t + u)])(~(t) - E[~(t)])) p~~ t, t + u = -"':"':"':":'--2=~~~~~="":"'::":""':':::":' -JVar[~(t + u)]var[~(t)] E[~(t)~(t + u)] - E2[~(t)] = (u) Var[~(t)] p~~. (A7.4) Besides stationarity in the strict sense, stationarity is also defined in the wide sense. The process!;(t) is stationary in the wide sense if the mean E[!;(t») the variance Var[!;(t»), and the correlation coefficient p~~ (t,t + u) are finite and independent of t. Stationarity in the strict sense of a process having a finite variance implies stationarity in the wide sense. The contrary is true only in some particular cases, e.g. for the normal process (process for which all n-dimensional distribution functions (Eq. (A 7.1) are n-dimensional normal distribution functions, see Example A6.16). A process!;(t) is time-homogeneous if it has stationary increments, i.e. if for arbitrary n = 1, 2,..., values x I'.., x n ' time span a, and disjoint intervals (ti' bi ) ((ti' ti + a, bi, bi + a E T, i = 1,...,n) Pr{~(tl + a) - ~(bl + a) :5 Xl'..., ~(tn + a) - ~(bn + a) ::; xnl = Pr{~(tl) - ~(bl):5 Xl,..., ~(tn) - ~(bn):5 xnl. (A7.S) If!;(t) is stationary, it is also time-homogeneous. The contrary is not true, in general. However, time-homogeneous Markov Processes (for instance) become stationary as t~oo. The stochastic processes discussed in this appendix evolve in time, and their state space is a subset of natural numbers. Both restrictions can be omitted, without particular difficulties, with a view to a general theory of stochastic processes. A7.2 Renewal Processes In reliability theory, renewal processes describe the model of an item in continuous operation which is replaced at each failure, in a negligible amount of time, by anew, statistically identical item. Results for renewal processes are basic and useful in many practical situations. To define the renewal process, let 'to, 't 1,... be statistically independent and nonnegative random variables (e.g. failure-free operating times) distributed according to and FA (x) = Pr{'to ::; x} (A7.6)

92 414 A7 Basic Stochastic-Processes Theory F(x) = Pr{'tj ~ x}, j = 1, 2,... (A7.7) The random variables n-i Sn = ~>i' n = I, 2,..., (A7.8) i=o or equivalently the sequence 'to, 'tl>... constitutes a renewal process. The points SI, S2'... are renewal points (regeneration points). The renewal process is thus a particular point process. The arcs relating the time points 0, SI> S2,... on Fig. A7.1a should help to visualize the underlying point process. A counting function v(t) = {~ for t < 'to for Sn :s; t < Sn+l' n = 1,2,..., can be associated with any renewal process, and gives the number of renewal points in the interval (0, t], see Fig. A7.1b. Renewal processes are ordinary for FA(x) = F(x), otherwise they are modified (stationary for FA(x) as in Eq. (A7.34». To simplify the analysis, let us assume in the following that FA(O) = F(O) = 0, (A7.9) and f(x) = df(x) dx exist, (A7.1O) 00 M1TF=E['t j ] = f(1-f(x»dx < 00, o i ;:: 1. (A7.11) VCI) i1 0 d c SI S2 S3 Figure A7.1 a) Possible time schedule of a renewal process; b) Corresponding count function v(t) (S I, S2,... are renewal (regeneration) points).-1 b)

93 A7.2 Renewal Processes 415 A7.2.1 Renewal Function, Renewal Density Consider first the distribution function of the number of renewal points v(t) in the time interval (0, tl. From Fig. A7.1, Pr{v(t):s;n-l}= Pr{Sn >t} = 1- Pr{Sn :s;t} = 1-Pr{'to 't n - 1 :s; t} = 1- Fn(t), The functions Fn(t) can be calculated recursively (Eq. A6.73)) n=l, 2,... (A7.12) t Fn+l (t) = f Fn(t - x)f(x)dx, o From Eq.(A7.12) it follows that n = 1,2,... (A7.13) Pr{v(t) = n} = Pr{v(t):S; n} - Pr{v(t):S; n -I} = Fn(t) - Fn+l (t), n =1,2,... (A7.14) and thus, for the expected value (mean) of v(t), n=l n=l (A7.15) The function H(t) defined by Eq. (A7.15) is the renewal function. Due to F(O) = 0, one has H(O) = O. The distribution functions Fn (t) have densities (Eq. (A6. 74)) and t fn(t) = ff(x)fn_l(t-x)d.x, o n=2, 3,..., (A7.16) and are thus the convolutions of f(x) with fn- 1 (x). Changing the order of summation and integration one obtains from Eq. (A7.15) 00 t t 00 H(t) = L f fn(x)dx = f Lfn(x)dx. n=lo 0 n=l (A7.17) The function h(t) = dh(t) = ffn(t) dt n=l (A7.18) is the renewal density. Using the iteration formula (A7.13), Eq. (A7.17) can be written in the form

94 416 A 7 Basic Stochastic-Processes Theory t H(t) = FA(t) + f H(x)f(t-x)dx. o (A7.l9) Equation (A7.l9) is the renewal equation. The corresponding equation for the renewal density is t h(t) = fa (t) + f h(x)f(t - x)dx. o (A7.20) It can be shown that Eq. (A7.20) has exactly one solution whose Laplace transform h(s) exists and is given by (Appendix A9.7) h(s) = fa ~s). 1-f(s) (A7.2l) For an ordinary renewal process (FA(x) = F(x» it holds that h(s) = f(~). 1-f(s) (A7.22) Thus, an ordinary renewal process is completely characterized by its renewal density h(t) or renewal function H(t). In particular, it can be shown (e.g.[6.3]) that t Var[v(t)] = H(t) + 2fh(x)H(t - x)dx - (H(t»2. o (A7.23) It is not difficult to see that H(t) = E[v(t)] and Var[v(t)] are finite for all t < 00. The renewal density h(t) has the following important meaning: Due to the assumption FA (0) = F(O) = 0, it follows that lim.l Pr{v(t + Ot) - v(t) > I} = 0 St!'o Ot and thus, for ~t.j, 0, Pr{any one of the renewal points Sl or S2 or... lies in (t,t + Ot]} = h(t)ot. (A7.24) The interpretation of the renewal density given by Eq. (A7.24) is useful in practical applications. However, Eq. (A7.24) shows that the renewal density h(t) differs basically from thefailure rate A(t) as defined by Eq. (A6.25) fa(t~). A(t) = lim.lpr{t < 'to < t+ot I 'to> t} = St!'o Ot I-FA t This even in the case of an homogeneous Poisson process (FA (x) = F(x) = 1-e-t..x, Appendix A7.2.5), for which A(t)=A holds for the interarrival times and h(t)=a

95 A7.2 Renewal Processes 417 holds for the whole process. Misuses are known, in particular when dealing with reliability data analysis, see e.g. [6.1] and the comments in Sections l.2.3, 7.6 & 8.5. Example A7.1 Determine the renewal function H(t), analytically for (Exponential) (ii) fa (t) = f(t) = 0.5A(At)2e-t..1 (Erlang with n = 3) (A t)i3-1 (iii) fa (t) = f(t) = A---e-t..1 (Gamma), r((3) and numerically for a failure rate A(t) = A for 0:5 t < 'P and A(t) = A + (3A~(t - 'P)13-1 for t ~ 'P, i.e. for (iv) FA (t) = F(t) = f f(x)dx = o I 1- e -AI 1- e -(A/+l!w(/-'I'h for 0:5 t < 'P for t ~ 'P with A = h-l, Aw = 10-5 h -I, (3 = 5, IJI = h (wearout), and for (v) FA (t) = F(t) as in case (iv) but with (3 = 0.3 and IJI = 0 (early failures). Give the solution in a graphical form for cases (iv) and (v). Solution The Laplace transformations of fa (t) and f(t) for the cases (i) to (iii) are (Table A9.7b) (i) fa (s) = i'cs) = A/(s+ A) (ii) fa (s) = res) = A3 /(s + A)3 (iii) fa (s) = f(s) = JJ3/(s + A)I3, h(s) follows then from Eq. (A 7.22) yielding h(t) or directly H(/) = f h(x)dx _ 0 (i) h(s) = A 1 sand H(t) = At (ii) h(s)=a3/ s (s2+3as+3a2)= A3/s[(s+1A)2+iA2] and H(t) = t[at -I + 1 e -3A112 sin(.fi,ati2 + ~)] _ '1!1(s + A)[l [ ] An[l (iii) h(s) = = ~ '1!1(S+A)f3 ~ ---;:- f3 [l L... - L... n[l 1-I.:/(s+A) n;} n;i(s+a) 00 I Anl3 nl3-1 and H(t) = L J x e-axdx. n=lo r(n(3) Cases (iv) and (v) can only be solved numerically or by simulation. Figure A7.2 gives the results for these two cases in a graphical form (see Eq. (A7.28) for the asymptotic behavior of H(t), represented by the dashed line in Fig. A7.2a). Figure A 7.2 shows that the convergence of H(t) to its asymptotic value is reasonably fast, as for many practical applications. The shape of H(t) can allow the recognition of the presence of wearout (iv) or early failures (v), but can not deliver an useful interpretation of the failure rate shape.

96 418 A 7 Basic Stochastic-Processes Theory H(t) 2 case (iv) (wearout) ;I""=--:-' "--~---~---~---~-. t [hi 300, ,000 a) 300, ,000 b) Figure A7.2 a) Renewal function H(t) and b) Failure rate AU) and density function f(t) for cases (iv) and (v) in Example A7.1 (H(t) was obtained empirically, simulating 1000 failure-free times and plotting HU) as a continuous curve; 1) = [(cr / M7TF/ - 1]/2 according to Eq. (A 7.28» A Recurrence Times Consider now the distribution functions of the forward recurrence time 't R(t) and the backward recurrence time 'ts(t). As shown in Fig. A7.1a, 'tr(t) and 'ts(t) are the time intervals from an arbitrary time point t forward to the next renewal point and backward to the last renewal point (or to the time origin), respectively. It follows from Fig. A7.1a that the event 'tr(t) > x occurs with one of the following mutually exclusive events Ao = Sl > t+x A" =(Sn :<S;t)n('tn >t+x-sn)' n = 1,2,...

97 A 7.2 Renewal Processes 419 Obviously, Pr{Ao} = 1- FA (t + x). The event An means that exactly n renewal points have occurred before t and the (n + l)th renewal point occurs after t + x. Because of Sn and 't n independent, it follows that Pr{An I Sn = y} = Pr{'t n > t + x - y}, n = 1, 2,..., and thus, from the theorem of total probability (Eq. (A6.17» I Pr{'tR(t) > x} = 1- FA (t + x) + f h(y)(l- F(t + x - y))dy, o yielding finally to I Pr{'t R(t)::; x} = FA (t + x) - f h(y)(1- F(t + x - y»dy. o (A7.25) The distribution function of the backward recurrence time 'ts(t) can be obtained as 1 Pr{'ts(t)::; x} fh(y)(i-f(t-y»dy = I-x forx<t 1 for x ~ t. (A7.26) Since Pr{So > t} = 1- FA (t), the distribution function of 'ts(t) makes a jump of height 1- FA (t) at the point x = t. A Asymptotic Behavior Asymptotic behavior of a renewal process (generally of a stochastic process) is understood to be the behavior of the process for t The following theorems hold with MTTF as in Eq. (A7.11): 1. Elementary Renewal Theorem [A6.6 (vol. II), A7.24]: If the conditions (A7.9) - (A7.11) are fulfilled, then lim H(t) = _1_, t--7~ t M7TF where H(t) = E[v(/)]. (A7.27) For Var[v(t)] it holds that limvar[v(t)]/t=cr2/m7tf~ with cr 2 =Var[t;l<00, i ~ 1. t--7~ 2. Tightened Elementary Renewal Theorem [6.3, A7.4, A7.24]: If the conditions (A7.9) - (7.11) are fulfilled, E[t A ] = M7TFA < 00 and 0'2= Var[td < 00, i ~ 1, then t cr 2 lim (H(t) - --) = 2 t--7~ M7TF 2 M7TF M7TFA I --+ M7TF 2 (A7.28)

98 420 A7 Basic Stochastic-Processes Theory 3. Key Renewal Theorem [A7.9(vol. II), A7.24]: If the conditions (A7.9) - (A7.11) are fulfilled, U(z);::: 0 is bounded, nonincreasing, and Riemann integrable over the interval (0,00), and h(t) is a renewal density, then I 1 ~ lim fu(t - y)h(y)dy = --fu(z)dz. I-'>~ 0 MITF 0 (A7.29) For any a> 0, the key renewal theorem leads, with U(z) = {~ foro< z <a otherwise, to Blackwell's Theorem [A7.9 (vol. II), A7.24] limh(t+a)-h(t) _1_. I-'>~ a MITF (A7.30) Conversely, the key renewal theorem can be obtained from Blackwell's theorem. 4. Renewal Density Theorem [A7.9(1941), A7.24]: If the conditions (A7.9)-(A7.11) are fulfilled, fa (x) and f(x) go to 0 as x ~ 00, and Var['td < 00, i;::: 1, then. 1 lim h(t) = --. t---too MITF (A7.31) 5. Recurrence Time Limit Theorems: Assuming U(z) = 1- F(x + z) in Eq. (A 7.29) and considering FA (00) = 1 as well as MITF = J (1- F(y»dy, Eq. (A 7.25) yields o 1 co 1 x lim Pr{'tR(t):":: xl = 1---f (1- F(x + z»dz = --f(1- F(y»dy. (A7.32) t---too MITF 0 MITF 0 For t ~ 00, the density of the forward recurrence time 't R(t) is thus given by f't/x)= (1- F(x»/ MITF. Assuming E[-rd=MITF< 00, Var['ti]= cr 2 < 00 (i ~ 1), and E[ 't R(t)]< 00 it follows that lim (x 2 (1- F(x»)=O. Integration by parts yields x---t oo 1 cof MITF cr 2 lim E['tR(t)]=-- x(l-f(x»dx = t---too MITF MITF (A7.33) The result of Eq. (A7.33) is important to clarify the waiting time paradox: lime['tr(t)]=mitf/2 holds only for cr 2 =0 ('ti=mitf, i;e:o) and lime ['tr(t)] = ~~ ~- E[-r;] =11 A, i;e: 0, holds for FA (x) = F(x) = 1- e -Ax (memoryless property of the Poisson process). Similar results are for the backward recurrence time 'ts(t). For a simultaneous observation of 'tr(t) and 'ts(t), it must be considered that in this cases 'tr(t) and 'tsct) belong to the same 't i

99 A 7.2 Renewal Processes Central Limit Theorem for Renewal Processes [6.3, A7.24]: If the conditions (A 7.9) and (A 7.11) are fulfilled and (}'2 = Var['t'd < 00, i'2 1, then v(t) - t / MTTF I x _ 2/2 lim Pr{ ~ < xl = ~ fey dy. t-t~ (J t / MTTF3 'V 2n ---<>0 (A7.34) Equation (A7.34) is a consequence of the central limit theorem for the sum of independent and identically distributed random variables (Eq. (A6.148)). Equations (A 7.27) - (A 7.34) show that the renewal process with an arbitrary initial distribution function FA (x) converges to a statistical equilibrium (steady-state) as t , see Appendix A for a discussion on stationary renewal process. A Stationary Renewal Processes The results of Appendix A allow a stationary renewal process to be defined as follows: A renewal process is stationary (in steady-state) if for all t> 0 the distribution function of 't'r(t) in Eq. (A7.25) does not depend on t. It is intuitively clear that such a situation can only occur if a particular relationship exists between the distribution functions FA(x) and F(x) given by Eqs. (A7.6) and (A7.7). Assuming 1 x FA(x) = --f (1- F(y))dy, MTTFo (A7.35) it follows that fa (x) = (l-f(x))imttf, f A (s)=(1-f(s))/(smttf), and thus from Eq.(A7.21) h(s)=_i_ smttf yielding 1 h(t)=--. MTTF (A7.36) With FA(x) & hex) from Eqs.(A7.35) &(A7.36),Eq.(A7.25) yields for any (t '2 0) t+x t X Pr{'t' R(t) ~ x} = _1_ [f (1- F(y))dy - f (1- F(t + x - y))dy ]=_1_ J (1- F(y))dy MTTF 0 0 MTTF 0 (A 7.37)

100 422 A7 Basic Stochastic-Processes Theory Equation (A7.35) is thus a necessary and sufficient condition for stationarity of the renewal process with Pr{'ti ~ x} = F(x), i:2: 1. It is not difficult to show that the counting process vet) given in Fig. 7.1b, belonging to a stationary renewal process, is a process with stationary increments. For any t, a> 0, and n = 1, 2,... it follows that Pr{v(t + a) - vet) = n} = Pr{v(a) = n} = Fn(a) - Fn+1 (a), with Fn+l(a) as in Eq. (A7.13) and FA(x) as in Eq. (A7.35). Moreover, for a stationary renewal process, H(t) = t / MITF and the mean number of renewals within an arbitrary interval (t, t + a] is a H(t + a) - H(t) = --. MITF Comparing Eq. (A7.32) with Eq. (A7.37) it follows that under very general conditions as t ~ 00 every renewal process becomes stationary. From this, the following interpretation can be made which is useful for practical applications: A stationary renewal process can be regarded as a renewal process with arbitrary initial condition FA(x), which has been started at t = -00 and will only be considered for t:2: 0 (t = 0 being an arbitrary time point). The most important properties of stationary renewal processes are summarized in Table A7.1. Equation (A7.32) also obviously holds for 'tr(t) and 'ts(t) in the case of a stationary renewal process. A Homogeneous Poisson Processes The renewal process, defined by Eq. (A 7.8), with FA(x) = F(x) = 1-e-Ax (A7.38) is an homogeneous Poisson process (HPP). FA(x) as in Eq. (A7.38) fulfills Eq. (A7.34) and thus, the Poisson process is stationary. From Sections A7.2.1 to A it follows that n-i i At n-l ~ (At) -At J x -x Pr{'to+.. +'tn_l~t}=fn(t)=l- L..J--e = --e dx, i=o i! 0 (n - I)! n =1,2,..., n= 1,2..., (A7.39) (A7.40) (At) k -At Pr{v(t) = k} = Fk(t)-Fk+I(t) = --e, k! k =0,1,2,..., Fa U);;; 1, (A7.41)

101 A 7.2 Renewal Processes 423 H(t) = E [v(t)] = At, h(t) = A, Var [v(t)] = At, (A7.42) { I-AX Pr{'tS(t) :::;; xl = 1 - e t ~ 0, for X < t forx~t. (A7.43) (A7.44) As a result of the memoryless property of the exponential distribution, the counting process vet) (Fig A7.1b) has independent increments. Quite general, a point process is an homogeneous Poisson process, with intensity A, if the associated count function vet) has stationary independent increments and satisfy Eq. (A7.41). For a renewal process, Eq. (A7.38) is a necessary and sufficient condition. Substituting for A t in Eq. (A7.41) a nondecreasing (generally increasing and continuous) function M(t) > 0, a nonhomogeneous Poisson process (NHPP) is obtained. The NHPP is a point process with independent, Poisson distributed increments. Because of independent increments, the NHPP is a process without aftereffect (memoryless if HPP) and the sum of Poisson processes is a Poisson process (Eq. (7.26) and [A7.7]). Moreover, the sum of n independent renewal processes with low occurrence converge for n~oo to an NHPP [A7.14] (to an HPP in the case of stationary independent renewal processes [A7.8]). However, despite its intrinsic simplicity, the NHPP is not a regenerative process. Furthermore, in statistical data analysis, the property of independent increments is often difficult to be proven. All this can lead to misuses, see e.g. [6.1, A7.30] and Section 7.6. Nonhomogeneous Poisson process are introduced in Appendix A Table A 7.1 Main properties of a stationary renewal process Expression Comments, assumptions I x fa (x)=dfa (x)/ dx=(i-f(x»/ T 1. Distribution function of 'to FA (x) = - J(I-F(y»dy TO T = E['td, i ~ 1 2. Distribution function of f(x) 'tj, = d~x), i <:: I F(x) x<::o 3. Renewal function H(t) =.!., t<:: 0 H(/) = E[v(t)] = E[number of T renewal points in (0, III 4. Renewal density dh(t). I h(/)=--, h(/)ot'" hm Pr{S, or h(t)=-, t <:: 0 dt &~O T S2 or... lies in (t, t + /)t]) 5. Distribution function & mean Pr{'t R(t):5 xl = FA (x), 1<:: 0 FA(x) as in point I, same for 'ts(t) of the forward recurrence time E[T R(t)] = TI2 + Var[T j ]/2T

102 424 A7 Basic Stochastic-Processes Theory A7.3 Alternating Renewal Processes Generalization of the renewal process given in Fig. A7.1a by introducing a positive random replacement time, distributed according to G(x), leads to the alternating renewal process. An alternating renewal process is a process with two states, which alternate from one state to the other after a stay (sojourn) time distributed according to F(x) and G(x), respectively. Considering the reliability and availability analysis of a repairable item in Section 6.2 and in order to simplify the notation, these two states will be referred to as the up state and the down state, abbreviated as u and d, respectively. To define an alternating renewal process, consider two independent renewal processes {'til and ('til, i = 0,1,... For reliability applications, 'ti denotes the i-th failure1ree operating time and 'ti the i-th repair time. These random variables are distributed according to and FA (x) for 'to and F(x) for 't i, j~l, (A7.45) and G(x) for 'ti, j~l, (A7.46) with densities fa (x), f(x), ga (x), andg(x), with finite means and "" MITF = E['td = f (1- F(t»dt, o i~l, (A7.47) MITR = E['t;l = f (1- G(t»dt, o i ~ 1, (A7.48) where MITF and MITR are used for mean time to failure and mean time to repair. The sequences,,,,,, to' t 1, t 1, t 2, t 2, t 3,. and to' t 1,t 1, t 2, t 2,t 3, (A7.49) form two modified alternating renewal processes, starting at t = with 'to and 'to, respectively. Figure A7.3 shows a possible time schedule of these two alternating renewal processes (repair times greatly exaggerated). Embedded in every one of these processes are two renewal processes with renewal points Sudui or Suddi marked with. and Sduui or Sdudi marked with., where udu denotes a transition from up to down, given up at t = 0, i.e. and.. Sudu. = TO + (T1 + T1) (Tj -1 + Tj -1)' I j ~ 1.

103 A7.3 Alternating Renewal Processes 425 up t down up t down Figure A7.3 Possible time schedule of two alternating renewal processes starting at t = 0 with 'to and 'to' respectively (the 4 embedded renewal processes with renewal points. and & are shown) These four embedded renewal processes are statistically identical up to the time intervals starting at t = 0, i.e. up to 'to, 'to + 't'1, 'to + 'tlo 't'o' The corresponding densities are fa (x), fa (x) * g(x), ga (x) * f(x), ga (x) for the time interval starting at t = 0, and f(x) * g(x) for all others. The symbol * denotes convolution (Eq. (A6.75)). The results of Section A 7.2 can be used to investigate the embedded renewal processes of Fig. A7.3. Equation (A7.22) yields Laplace transforms of the renewal densities hudu(t), hduu(t), hudd(t), and hdud(t) (A7.50) To describe the alternating renewal process defined above (Fig. A7.3), let us introduce the two-dimensional stochastic process (~(t), 't R~(t)(t)) where ~(t) denotes the state of the process (repairable item in reliability application) if the item is up at time t if the item is down at time t.

104 426 A7 Basic Stochastic-Processes Theory 'tru(t) and 'trd(t) are thus the forward recurrence times in the up and down states, respectively, provided that the item is up or down at the time t, see Fig To investigate the general case, both alternating renewal processes of Fig. A 7.3 must be combined. For this let p = Pr{item up at t = O} and 1- p = Pr{item down at t = O}. (A7.51) In terms of the process (S(t), 'trs(t) (t», p = Pr{S(O) = u}, 1 - P = Pr{1;(O) = d}, FA (x) = Pr{'t Ru(O):S; x I s(o) = u}, G A (x) = Pr{'t Rd(O) :::; x I 1;(0) = d). Consecutive jumps from up to down form a renewal process with renewal density (A7.52) Similarly, the renewal density for consecutive jumps from down to up is given by (A7.53) Using Eqs. (A7.52) and (A7.53), and considering Eq.(A7.25), it follows that and Pr{s(t)= un 'tru(t) > 8} t = p(l- FA (t + 8» + fhdu(x)(i- F(t - x + 8»dx o (A7.54) Setting 8 = 0 in Eq. (A7.54) yields t = (1- p)(i-ga(t + 8» + fhud(x)(i-g(t- x +8»dx. (A7.55) o t Pr{S(t) = u} = p(l- FA (t» + f hdu (x)(i- F(t - x»dx. o (A7.56) The probability PACt) = Pr{S(t) = u} is called the point availability and IR(t,t + 8] = Pr{S(t) = u n 't Ru (t) > 8} the interval reliability of the given item (Section 6.2). An alternating renewal process, characterized by the parameters p, FA (x), F(x), GA(x), and G(x) is stationary if the two-dimensional process (S(t), 'trs(t)(t) is stationary. As with the renewal process it can be shown that an alternating renewal process is stationary if and only if

105 A7.3 Alternating Renewal Processes 427 MTTF P= MTTF+ MTTR', x FA(x)=--f (1- F(y))dy, MTTF 0, x GA(x)=--f (1- G(y))dy, MTTRo (A7.57) with MTTF and MTTR as in Eqs. (A7.47) and (A7.48). In particular, for t ~ 0 the following relationships apply for the stationary alternating renewal process (Examples 6.3 and 6.4) MTTF PACt) = Pr{item up at t} = = PA, (A7.58) MTTF+MTTR IR(t,t + 0) = Pr{item up at t and remains up until t + O} = I f (1- F(y))dy. MTTF+MTTR e 00 (A7.59) Condition (A 7.57) is equivalent to I hud(t) = hdu(t) =, MTTF+MTTR t ~ o. (A7.60) Moreover, application of the key renewal theorem (Eq. (A7.29)) to Eqs. (A7.54) - (A7.56) yields (Example 6.4) lim Pr{S(t) = u n 'tru(t) > O} = I f(1-f(y))dy, t~oo MTTF + MTTR e (A7.61) limpr{sct)=dn'trdct»o}= I f(1-g(y))dy, t~oo MTTF + MTTR e MTTF lim Pr{sCt) = u} = limpa(t) = PA = ---- t~oo t~oo MTTF + MTTR 00 (A7.62) (A7.63) Thus, irrespective of its initial conditions p, FA(x), and GA(x), an alternating renewal process has for t ~ 00 an asymptotic behavior which is identical to the stationary state (steady-state). In other words: A stationary alternating renewal process can be regarded as an alternating renewal process with arbitrary initial conditions p, FA(x), and G A (x), which has been started at t = -00 and will only be considered for t ~ 0 (t = 0 being an arbitrary time point). It should be noted that the results of this section remain valid even if independence between 'ti and 't'i within a cycle (e.g. 'to + 't'" 'tl +'t2,...) is dropped; only independence between cycles is necessary. For exponentially distributed 't i and 'tj, i.e. for constant failure rate A and repair rate Il in reliability applications, the convergence of PACt) towards PA stated by Eq. (A7.63) is of the

106 428 A7 Basic Stochastic-Processes Theory form PACt) - PA = CAICA + Il))e-(A+~)t "" CAl Il)e-~t, see Eq. (6.20) and Section for further considerations. A7.4 Regenerative Processes A regenerative process is characterized by the property that there is a sequence of random points on the time axis, regeneration points, at which the process forgets its foregoing evolution and, from a probabilistic point of view, restarts anew. The times at which a regenerative process restarts occur when the process returns to some states, defined as regeneration states. The sequence of these time points for a specific regeneration state is a renewal process embedded in the original stochastic process. For example, both the states up and down of an alternating renewal process are regeneration states. All states of time-homogeneous Markov processes and of semi-markov processes, defined by Eqs. (A7.95) and (A7.l58), are regenerative. However there are processes in discrete state space with only few (two in Fig. A 7.10, one in Fig. 6.10) or even with no regeneration states (see e.g. Appendix A7.8 for some considerations). A regenerative process must have at least one regeneration state. A regenerative process thus consists of independent cycles which describe the time behavior of the process between two consecutive regeneration points of the same type (same regeneration state). The i-th cycle is characterized by a positive random variable 'te. (duration of cycle i) and a stochastic process ~i(t) defined for O~t<'tel (content of the cycle). Let ~n(t), O~t<'te' n = 0, 1,... be independent I and for n ~ 1 identically distributed cycles. F~r simplicity, let us assume that the time points S1 = 'te o, S2 = 'te + 'te '''. 0 1 form a renewal process. The random variables 'teo and 'tei, i ~ 1, have distribution functions FA(x) for 'teo and F(x) for 'tei, densities fa(x) and f(x), and finite means TA and Te, respectively. The regenerative process ~(t) is then given by for 0 s:: t < SI for Sn s:: t < Sn+I' n = 1, 2,... The regenerative structure is sufficient for the existence of an asymptotic behavior (limiting distribution) for the process as t ~ 00 (provided that the mean time between regeneration points is finite). This limiting distribution is determined by the behavior ofthe process between two consecutive regeneration points of the same regeneration state.

107 A7.4 Regenerative Processes 429 Defining h(t) as the renewal density of the renewal process given by Sl> S2'... and setting U(t,B) = Pr{~i(t) E B n 'tc. > t},, i = 1, 2,..., it follows, similarly to Eq. (A 7.25), that Pr{~(t) E B} = Pr{~o(t) E B n 'tco > t} + f h(x) U(t - x,b)dx. o t (A7.64) For any given distribution of the cycle~i(t), 0 $; t < 't Ci ' i ~ 1, with Tc = E['tci ] < 00, there exists a stationary regenerative process ~e(t) with regeneration points Se.' i ~ 1. The cycles ~e (t), 0 $; t < 'te ' have for n ~ 1 the same distribution law ~s ~i(t), 0 $; t < 't Ci ' Th~ distribution l;w of the starting cycle ~eo (t), 0 $; t < 'teo' can be calculated from the distribution law of ~i(t), O$;t<'tc.' see Eq. (A7.57) for alternating renewal processes. In particular, ' Pr{~e(O) E B} = -.!.. j U(t,B)dt, Tc 0 (A7.65) with Tc = E['tci ] < 00, i ~ 1. Furthermore, for every non-negative function g(t) and Sl = 0, (A7.66) Equation (A7.66) is known as the stochastic mean value theorem. Since U(t,B) is non increasing and $;1-F(t) for all t~o, it follows from Eq. (A7.64) and the key renewal theorem (Eq. (A7.29)) that lim Pr{~(t) E B} = -.!.. j U(t,B)dt. t~oo Tc 0 (A7.67) Equations (A 7.65) and (A 7.67) show that under very general conditions as t ~ 00 a regenerative process becomes stationary. As in the case of renewal and alternating renewal processes, the following interpretation is true: A stationary regenerative process can be considered as a regenerative process with arbitrary distribution of the starting cycle, which has been started at t = -00 and will only be considered for t ~ 0 (t = 0 being an arbitrary time point).

108 43D A7.S A7 Basic Stochastic-Processes Theory Markov Processes with Finitely Many States Markov processes are processes without memory. They are characterized by the property that for any (arbitrarily chosen) time point t their evolution after t depends on t and the state occupied at t, but not on the process evolution up to the time t. In the case of time-homogeneous Markov processes, dependence on t also disappears. In reliability theory, these processes describe the behavior of repairable systems with constant failure and repair rates for all elements. Constant rates are required during the stay (sojourn) time in any state, not necessarily at state changes (e.g. for load sharing). After an introduction to Markov chains, time-homogeneous Markov processes with finitely many states are considered in depth, as basis for Chapter 6. A Markov Chains with Finitely Many States A stochastic process in discrete time ~n (n = 0, 1,... ) with finitely many states Zo,..., Zm is a Markov chain if for n = 0,1, 2,... and arbitrary i,j, io,...,in- 1 E {D,..., m}, Pr{~n+l =Zj I (~n = Zj n ~n-l = Zjn_l n... n ~O=Zio)} = Pr{~n+l = Zj I ~n = Zd = Pij(n).+) (A7.68) The quantities Pij(n) are the (one step) transition probabilities of the Markov chain. Investigation will be limited here to time-homogeneous Markov chains, for which the transition probabilities Pij(n) are independent of n Pzj(n) = Pij = Pr{~n+l = Zj I ~n = Zd, n = 0, 1,.... (A7.69) For simplicity, Markov chain will be used in the following as equivalent to timehomogeneous Markov chains. The probabilities Pij satisfy the relationships m and LPzj = 1, i, j E {O,.., m}. (A7.7D) j=o A matrix with elements Pij as in Eq. (A7.7D) is a stochastic matrix. The k-step transition probabilities are the elements of the k-th power of the stochastic matrix with elements Pij. For example, k = 2 leads to (Example A 7.2) m pj2)= Pr{~n+2=Zj I ~n=zj} = LPr{(~n+2= Zj n ~n+l = Zk) I ~n= z;l k=o m = LPr~n+l = Zk I ~n = Zj}Pr{~n+2 = Zj I (~n = Zj n ~n+l = Zk)}, k=o +) ~o'~l,...identify successive transitions (also in the same state if Pii(n» 0) without any relation to the time axis; this is important when considering embedded Markov chains in a stochastic process.

109 A7.5 Markov Processes with Finitely Many States 431 from which, considering the Markov property (A7.68), m PiY) = LPr{~n+l = Zk I ~n = Zi}Pr{~n+2 = Zj I ~n+l = Zd= LPikPkj' (A7.71) k=o k=o Results for k > 2 follow by induction. m Example A7.2 Prove that Pr{ (A (l B) I C} = Pr{B I C} Pr{A I (B (l C)}. Solution For Pr{C} > 0 it follows that Pr{A n B n C] Pr{B n C] Pr{A I (B n ell Pr{(AnB) I C]= = Pr{BI C]Pr{AI (Bne)}. Pr{e} Pr{C} The distribution law of a Markov chain is completely given by the initial distribution i = 0,..., m, (A7.72) with LAi=l, and the transition probabilities Pij' Slllce for every n > 0 and arbitrary io,..., in E to,..., m}, Pr{~o = 2io n ~l = 2i 1 n... n ~n = 2in } = AiO PiOil... Pin-lin and thus, using the theorem of total probability (A6.17), m Pr{~n = Zj} = LAiPijn), n = I, 2,... i=o (A7.73) A Markov chain with transition probabilities Pij is stationary if and only if the state probabilities Pr{~n = Zj}' j = 0,..., m, are independent of n, i.e. if the initial distribution Ai according to Eq. (A7.72) is a solution (Pj) ofthe system m m Pj = LPiPij, with Pj;::: and L Pj = I, j = 0,..., m. (A7.74) i=o j=l The system given by Eq. (A7.74) must be solved by canceling one (arbitrarily chosen) equation and replacing this by LPj=l. Po, ""Pm from Eq. (A7.74) define the stationary distribution of the Markov chain with transition probabilities Pij' A Markov chain with transition probabilities Pij is irreducible if every state can be reached from every other state, i.e. if for each (i, j) there is an n = n(i, j) such that (n) 0 Pij >, i,j E to,..., m}, (A7.75)

110 432 A7 Basic Stochastic-Processes Theory It can be shown that the system (A7.74) possesses a unique solution with m and LPj = 1. j=i j =0... m. only if the Markov chain is irreducible. see e.g. [A7.3. A7.13. A7.27. A7.29]. (A7.76) A7.S.2 Markov Processes with Finitely Many States+) in continuous time with finitely many states Zm is A stochastic process ~(t) a Markov process if for n = arbitrary time points t + a > t > tn >... > t1. and arbitrary i. j. i l... in E {O... m}, Pr{~(t + a) = Zj I (~(t) = Zi n ~(tn) = Zjn n... n ~(tl) = Zil)} = Pr{~(t + a) = Zj I ~(t) = Zj}. (A7.77) The conditional state probabilities in Eq. (A7.77) are the transition probabilities of the Markov process and they will be designated by Pij(t.t + a) Pij(t.t + a) = Pr{~(t + a) = Zj I ~(t) = Zj}. (A7.78) Equations (A 7.77) and (A 7.78) give the probability that ~(t + a) will be Zj given that ~(t) was Zi' Between t and t + a the Markov process can visit any other state (this is not the case in Eq. (A7.95). in which Zj is the next state visited after Zi)' The Markov process is time-homogeneous if Pij(t.t + a) = Pij(a). (A7.79) In the following only time-homogeneous Markov processes will be considered. For simplicity, Markov process will often be used as equivalent to time-homogeneous Markov process. For arbitrary t > 0 and a> 0, Pi /t + a) satisfy the Chapman Kolmogorov equations m Pij(t + a) = LPik(t)Pkj(a), i.j E {O... m}, (A7.80) k=o which demonstration, for given fixed i and j, is similar to that for p~2) in Eq. (A7.71). Furthermore Pij(a) satisfy the conditions and m LPij(a) = 1, j=o i = O... m, and thus form a stochastic matrix. Together with the initial distribution (A7.81) +) Continuous parameter Markov chain is often used in the literature. Using Markov process should help to avoid confusion with Markov chains embedded in stochastic processes (footnote on p. 430).

111 A 7.5 Markov Processes with Finitely Many States 433 i = 0,..., m, (A7.82) the transition probabilities Pij(a) completely determine the distribution law of the Markov process. In particular, the state probabilities for 1> 0 i=o,...,m (A7.83) can be obtained from Setting m P/t)= LPi(O)Pij(t). i=o Pi/O) = oij = { ~ for i::!- j for i= j (A7.84) (A7.85) and assuming that the transition probabilities Pij(t) are continuous at t = 0, it can be shown that Pij(t) are also differentiable at t = O. The limiting values r Pij(Ot) _ and r 1-Pii(Ot) - (A7.86) li:ro8t- Pij' fori*j, li:ro Ot -Pi' exist and satisfy m Pi = L. Pij' j=o I~i i=o,..., m. Equation (A 7.86) can be written in the form Pij(ot) = Pij ot + o(ot) and 1- Pii(Ot) = Pi Ot + o(ot), (A7.87) (A7.88) where o(ot) denotes a quantity having an order higher than that of Ot, i.e. lim o(ot) = O. Ot-1.o 01 (A7.89) Considering for any t ~ 0 Pi lot) = Pr{~(1 + Ot] = Zj I ~(t) = Zd, the following useful interpretation for Pij and Pi can be obtained for Ot J.. 0 and arbitrary 1 Pij Ot =:: Pr{jump from Zi to Zj in (t,t + 01] I ~(1) = Zd (A7.90) It is thus reasonable to define Pij and Pi as transition rates (for Markov processes, Pij playa similar role to that of the transition probabilities for Markov chains).

112 434 A7 Basic Stochastic-Processes Theory or Setting a = 8t in Eq. (A7.80) and considering Eqs. A7.78) and (A7.79) yields m Pij(t + Ot) = LPik(t)Pk/8t) + Pij(t)P.ii(8t) k=o k~j p.. (t+ot)-p.. (t) m Pk (Ot) p.. (Ot)-l IJ IJ = ~ p. (t)-g _ + p.. (t) JJ, Ot.. Ik Ot IJ Ot k=o k~j and then, taking into account Eq. (A7.86), it follows that m P.. (t) = - Pu(t)p J. + LPik(t)Pk;' IJ ' k=o ' k~j i,j E to,..., m}. (A7.91) Equations (A7.91) are the Kolmogorov'sforward equations. With initial conditions Pij(O) = 8ij as in Eq. (A7.85), they have a unique solution which satisfies Eq. (A7.81). In other words, the transition rates according to Eq. (A7.86) or Eq. (A7.90) uniquely determine the transition probabilities Pij(t). Similarly as for Eq. (A7.91), it can be shown that Pij(t) also satisfy the Kolmogorov's backward equations m Pij (t) = -Pi Pi/t) + L Pik Pk/t), k=o k~i i,j E to,..., m}, (A7.92) Equations (A7.91) & (A7.92) are also known as Chapman-Kolmogorovequations. They can be written in matrix form P = P A & P = A P and have the formal solution P (t) = ea t P (0). The following description of the time-homogeneous Markov process with initial distribution Pi(O) and transition rates Pij' i,j E to,..., m}, provides a better insight into the structure of a Markov process as a pure jump process (process with piecewise constant realization). It is the basis for investigations of Markov processes by means of integral equations (Section A ), and is the motivation for the introduction of semi-markov processes (Section A7.6). Let /;0, /;}o'" be a sequence of random variables taking values in {Zo,..., Zm} denoting the states successively occupied and Tlo, Tlb... a sequence of positive random variables denoting the stay (sojourn) times between two consecutive state transitions. Define Pij.. Pij =~' I't} and Pii = 0, i,j E to,..., m}, (A7.93) and assume furthermore that i= 0,..., m, (A7.94) and,for n=l, 2,..., arbitrary ij,io,...,in- 1 E{O,...,m}, and arbitrary xo,...,xn-l>o,

113 A 7.5 Markov Processes with Finitely Many States 435 Pr{(Sn+l= Zj n 11n $; x) I (Sn= Zi n11n-l = xn_1n... nsl= Zil n 110 = xonso=zio)} = Pr{ (~n+l = Zj n TIn :5 x) I ~n = Zi} = Qij(x) = Pij Fij(x) = Pij (1- e -Pi x). (A 7.95) In Eq. (A7.95), as well as in Eq. (A7.158), Zj is the next state visited after Zj (this is not the case in Eq. (A7.77), see also the remark with Eq. (A7.106». Qij(x) is thus defined only for j::f. i. ~o, ~l'...is a Markov chain, with an initial distribution Pj(O) = Pr{~o = Zj} and transition probabilities Pij = Pr{Sn+! = Zj I Sn = Zd, with Pjj = 0, embedded in the original process. From Eq.(A7.95), it follows that (Example A7.2) (A7.96) Qij(X) is a semi-markov transition probability and will as such be introduced and discussed in Section A7.6. Now, define and So = 0, Sn = n-l' n = 1, 2,..., (A7.97) ~(t)=~n' (A7.98) From Eq. (A7.98) and the memoryless property of the exponential distribution (Eq. (A6.87» it follows that ~(t), t ~ 0 is a Markov process with initial distribution and transition rates and pjj = lim ~Pr{jump from Zj to Zj in (t, t +Ot] I ~(t) = Zj}, Ot.J..o Ot Pj = lim! Pr{leave Zj in (t, t + Ot] I ~(t) = Zj} = f Pij. Iit,J,O ut j.=q J'#I j:f. i The evolution of a time-homogeneous Markov process with transition rates Pij and Pj can thus be described in the following way [A7.2 (1974 ETH)]: If at t = 0 the process enters the state Zj, i.e. ~o = Zj, then the next state to be entered, say Zj ( j ::f. i) is selected according to the probability Pij' and the stay (sojourn) time in Zj is a random variable Tlo with distribution function Pr{Tlo::;;xl (~o =Zj (J~1 =Zj)}=I-e- PjX ;

114 436 A 7 Basic Stochastic-Processes Theory as the process enters Z}, the next state to be entered, say Zk (k *- j), will be selected with probability P}k and the stay (sojourn) time Y] 1 in Z} will be distributed according to etc. The sequence ~n' n = 0, 1,... of the states successively occupied by the process is that of the Markov chain embedded in ~(t), the so called embedded Markov chain. The random variable Y] n is the stay (sojourn) time of the process in the state defined by ~n' From the above description it becomes clear that each state Zi, i = 0,..., m, is a regeneration state. In practical applications, the following technique can be used to determine the quantities Qij(x), Pi}' and Fij(x) in Eq. (A7.95) [A7.2 (1985)]: If the process enters the state Zi at an arbitrary time, say at t = 0, then a set of independent random times 'tij' j *- i, begin ('tij is the stay (sojourn) time in Zi with the next jump to Z}); the process will then jump to Z} at the time x if'tij = x and 'tik > 'tij for (all) k *- j. In this interpretation, the quantities Qij(x), Pi}' and Fij(x) are given by Qij(x) = Pr('ti} ~ x n 'tik > 'tij' k"* j}, Pi} = Pr{'tik > 'tij' k"* j}, Fij(x) = Pr{'ti) ~ x I 'tik > 'til' k *- j}. (A7.99) (A7.100) (A7.10l) Assuming for the time-homogeneous Markov process (memoryless property) Pr{'ti} ~ x} = 1-e-Pij x one obtains, as in Eq. (A 7.95), (A7.102) Pij Pij = - = Qij(oo) for} "* i, Pi p. (x) = 1-e-PiX lj. m Pi = LPij' j=o j*i Pii == 0, (A7.103) (A7.104) It should be emphasized that due to the memoryless property of the time-homogeneous Markov process, there is no difference whether the process enters Zi at t = 0 or it is already there. However, this is not true for semi-markov processes (Eq. A 7.158).

115 A7.5 Markov Processes with Finitely Many States 437 Quite generally, a repairable system can be described by a time-homogeneous Markov process if and only if all random variables occurring (failure-free operating times and repair times) are independent and exponentially distributed. If some failure-free operating times or repair times of elements are Erlang distributed (Appendix A6.1O.3), the time evolution of the system can be described by means of a time-homogeneous Markov process with appropriate extension of the state space (see Fig. 6.6 for an example). A powerful tool when investigating time-homogeneous Markov processes is the diagram of transition probabilities in (t, t + ot], where ot ~ 0 (ot > 0, i.e. ot J.. 0) and t is an arbitrary time point (e.g. t = 0). This diagram is a directed graph with nodes labeled by states Zi, i = 0,..., m, and arcs labeled by transition probabilities Pij(ot), where terms of order o(ot) are omitted. It is related to the state transition diagram of the system involved, take care of particular assumptions (such as repair priority, change of failure or repair rates at a state change, etc.), and has in general more than 2 n states, if n elements in the reliability block diagram are involved (see for instance Fig. A 7.6 and Section 6.7). Taking into account the properties of the random variables 'Cij, introduced with Eq. (A7.99), it follows that for ot ~ 0 and Pr{ (~(ot) = Zj n only one jump in (O,Ot]) I ~(O) = Zd m = (1- e -Pij 8t) 11 e -Pik 8t = PijOt + o(ot), (A7.105) k=o k#} Pr{ (~(Ot) = Zj n more than one jump in (O,ot]) I ~(O) = Zi} = o(ot). (A7.106) From this, Pij(Ot) =Pijot +o(ot), and Pu(Ot) = 1-Pi Ot + o(ot), j*i as with Eq. (A7.88). Although for Ot ~ 0 it holds that Pij(Ot) = Qij(Ot), the meanings of Pij(Ot) as in Eq. (A7.79) or Eq. (A7.78) and Qij(ot) as in Eq. (A7.95) or Eq. (A7.158) are basically different. With Qij(x), Z} is the next state visited after Zi, this is not the case for Pij(x), Examples A 7.3 to A 7.5 give the diagram of transition probabilities in (t + Or] for some typical structures for reliability applications. The states in which the system is down are hatched on the diagrams. In state Zo all elements are up (operating or in the reserve state). Example A7.3 Figure A 7.4 shows several cases of a 1-out-of-2 redundancy. The difference with respect to the number of repair crews appears when leaving the states 22 and 23. Cases b) and c) are identical when two repair crews are available.

$. r I Distribution of repair time: G(t) = 1-e-1l1 two repair crews..0 ~ + ~ I..0 ~ I..0..Q ~ +.$...!..Q l' + - 'I POI = P24 = AI ; P02 = P\3 = >.z; b) PIO = P32 = 1J.1; P20 = P41 = IJ.$

116 438 A 7 Basic Stochastic-Processes Theory l-oul-of-2 ode repair crew Distribution of failure-free operating times operating state: F(t) = 1-e-t..1 reserve state: F(t) = 1-e-t.. r I Distribution of repair time: G(t) = 1-e-1l1 two repair crews..0 ~ + ~ I..0 ~ I..0..Q ~ +.$...!..Q l' + - 'I POI = P24 = AI ; P02 = P\3 = >.z; b) PIO = P32 = 1J.1; P20 = P41 = IJ.z POI =Pn=AI; P02=P13=>.z; PJO = P32=lJ.t: P20=P31 =1J.z..0 ~ + ~ I..0 ~ I e ~ + ~ I e ~ + ~ 'I POI = P23 = At; P02 = P13 = >.z; c) PJO = Pn = Ill: Pzo = IJ.z POI = PZ3 = AI; P02 = P13 =-'-2; PIO = P32 = IJ.I; PZO = P31 =1J.z Figure A 7.4 Diagram of transition probabilities in (t, t + lit] for a repairable 1-out-of-2 redundancy (A, Ar = failure rates, I.l = repair rate): a) Warm redundancy with El = E2 (Ar = A --t active redundancy, Ar=O--tstandby redundancy); b) Active redundancy with El #E2 ; c) Active redundancy with El # E2 and repair priority on El (t arbitrary, lit.j, 0, Markov process)

A 7.5 Markov Processes with Finitely Many States 439 Example A7.4 Figure A 7.5 shows two cases of a k-out-of-n active redundancy with two repair crews.

117 A 7.5 Markov Processes with Finitely Many States 439 Example A7.4 Figure A 7.5 shows two cases of a k-out-of-n active redundancy with two repair crews. In the first case, the system operates up to the failure of all elements (with reduced performance from state Zn-k+I). In the second case no further failures can occur when the system is down. Example A7.S Figure A 7.6 shows a series/parallel structure consisting of the series connection (in the reliability sense) of a l-out-of-2 active redundancy, with elements 2 and 3 and a switching element 1. The system has only one repair crew. Since one of the redundant elements 2 or 3 can be down without having a system failure, in cases a) and b) the repair of element 1 is given first priority. This means that if a failure of 1 occurs during a repair of 2 or 3, the repair is stopped and 1 will be repaired. In cases c) and d) the repair priority on EI has been dropped. Distribution of failure-free operating times: F(t}= I _ e - At repair times: G(t}=l- e-iu k-out-of-n (active) &::N: I-VaSt I - (VI +1!}St 1-(vn.k+21!)Sr ost vlot vn.k.iot ~ 21 L.-'-"'-,,/ I!Sr 21!St 21!St 21!St I - (v n.k !) St I - 21! St ~--~--~~~.. ~ a) Vj = (n - i»,. and Pj (i+l) = vi for ;=0, I,...,n - I; PlQ=l! ; Pi(i-1)=21! for ; = 2, 3,..., n 1- (V n.k+ 21!) Sr vn.k. ISr vj=(n-i)aandpi(i+i)=vj for;=o,l,...,n-k; PlO=l!; b) Pi(i_I)=2I!for;=2,3,...,n-k+1 Figure A 7.5 Diagram of transition probabilities in (t, t + ot] for a repairable k-out-of-n redundancy with two repair crews (A = failure rate,!l = repair rate): a) The system operates up to the failure of the last element; b) No further failures at system down (t arbitrary, ot J. 0, Markov process; in a k-out-of-n redundancy the system is up if at least k elements are operating)

118 440 A7 Basic Stochastic-Processes Theory l-ollt-of-2 (active) E2=E3 =E Distribution of o failure-free times: F(t)= 1- e-at for E, F(t)= 1- e-a1t for E1 orepairtimes: O(t)= l-e-~tfore, O(t)= l-e-~ltforei P01=P25=P46=t..l; P02=P15=2t..; P24=t..; P\O= P52 = P64 = ~I; P20 = P42 =~; P56 = t.. a) Repair priority on E\ POI = P23 = t..l; P02=2t..; P24=A; PIO = P32 = ~1; P20 = P42 = I' b) As a), but no further failures at syst. down Figure A7.6 Diagram of transition probabilities in (t, t + 8t] for a repairable series parallel structure with E2 = E3 = E and one repair crew: a) Repair priority on El and the system operates up to the failure of the last element; b) Repair priority on El and at system failure no further failures can occur; c) and d) as a) and b), respectively, but without repair priority on El (t arbitrary, ad 0, Markov process)

119 A7.5 Markov Processes with Finitely Many States 441 A7.S.3 State Probabilities and Stay Times (Sojourn Times) in a Given Class of States In reliability theory, two important quantities are the state probabilities and the distribution function of the stay (sojourn) times in the set of system up states. The state probabilities allow calculation of the point availability. The reliability function can be obtained from the distribution function of the stay time in the set of system up states. Furthermore, a combination of these quantities allows for timehomogeneous Markov processes a simple calculation of the interval reliability. It is useful in such an analysis to subdivide the system state space into two complementary sets U and fj U = set of the system up states (up states at system level) U = set of the system down states (down states at system level). (A7.107) Partition of the state space in more than two classes is possible, see e.g. [A7.28]. Calculation of state probabilities and stay (sojourn) times can be carried out for Markov processes using the method of differential equations or of integral equations. A7.S.3.1 Method of Differential Equations The method of differential equations is the classical one used in investigating Markov processes. It is based on the diagram of transition probabilities in (t, t + 8tl. Consider a time-homogeneous Markov process ~(t) with arbitrary initial distribution Pi(O) = Pr{~(O) = Zi} and transition rates Pij and Pi' The state probabilities defined by Eq. (A7.83) j=o,...,m, satisfy the system of differential equations m Pj=LPji' i=o iol'i j =0,..., m. (A7.108) The proof of Eq. (A7.108) is similar as for Eq. (A7.91). Example A7.6 shows a simple application. The point availability PAs(t), for arbitrary initial conditions at t = 0, follows then from PAs(t) = Pr{~(t) E U} = L P/t). ZjEU (A7.109) In reliability analysis, particular initial conditions are often of interest. Assuming Pj(O) = 1 and (A7.1l0)

120 442 A 7 Basic Stochastic-Processes Theory i.e. that the system is in Zj at t = 0 (usually in state Zo denoting "all elements are up"), the state probabilities Pj(t) are the transition probabilities Pij(t) defined by Eqs. (A7.7S) & (A7.79) and can be obtained as (A7.1l1) with PjU) as the solution ofeq. (A7.lOS) with initial conditions as in Eq. (A7.1l0), or of Eq. (A7.92). The point availability, now designated with PASi(t), is then given by PASi(t) = Pr{~(t) E U I ~(O) = Zi} = L Pij(t), ZjEU i=o,..., m. (A7.1l2) PASi(t) is the probability that the system is in one of the up states at t, given it was in Zi at t = O. Example A 7.6 illustrate calculation of the point-availability for a 1- out-of-2 active redundancy. Example A7.6 Assume a 1-out-of-2 active redundancy, consisting of 2 identical elements EI = E2 = E with constant failure rate A. and repair rate J..l, and only one repair crew. Determine the state probabilities of the involved Markov process (EI and E2 are new at t = 0). Solution Figure A 7.7 shows the diagram of transition probabilities in (t, t + otl for the investigation of the point availability. Because of the memoryless property of the involved Markov Process, Fig A 7.7 and Eqs. (A7.S3) & (A7.90) lead to (by omitting the terms in o(ot), as per Eq. (A7.S9)) poet + Ot) = Po(t)(l- 2A.Ot) + PI (t) J..ll)t PI (t + Ot) = PI (t)(l- (A. + J..l)l)t) + Po(t) 2A.l)t + P2(t) J..ll)t P 2 (t +Ot) = P2(t)(1- J..ll)t) + PI(t) A. Ot, and then, as I)t..l. 0, poet) = -2A. poet) + J..l PI (t) iw) = -(A. + J..l)PI(t) +2A.Po(t) + J..lP 2(t) P 2 (t) = -J..lP 2 (t) +A.PI(t). (A7.113) Equation (A7.113) also follows from Eq. (A7.lOS) with the Pij from Fig. A7.7. The solution of Eq. (A7.113) with given initial conditions at t=o, e.g. Po (0) = 1, PI(0)=P2(0)=0, leads to state probabilities Po(t), PI(t), and P2(t), and then to the point availability according to Eqs. (A7.111) and (A7.112) with i = 0 (see also Example A7.9 and Table 6.2 for the solution).

121 A7.5 Markov Processes with Finitely Many States 443 Figure A 7.7 Diagram of the transition probabilities in (t, t + Ot] for a l-out-of-2 active redundancy withe J =E 2 =E, one repair crew, calculation of the point availability (t arbitr.,otj.o, Markov process with POI = 2/.., PIO = 11, P12 = /.., P21 = 11; Po = 2/.., PI = / , P2 = 11) A further important quantity for reliability theory is the reliability function Rs(t), i.e. the probability of no system failure in (0, tl. Rs(t) can be calculated using the method of differential equations if all states in rj are declared to be absorbing states. This means that the process will never leave Zk if it jumps into a state Zk e rj. It is not difficult to see that in this case, the events and first system failure occurs before t system is in one of the states rj at t are equivalent, so that the sum of the probabilities to be in one of the states in U is the required reliability function, i.e. the probability that up to the time t the process has never left the set of up states U. To make this analysis rigorous, consider the modified Markov process ~'(t) with transition probabilities P:.(t) and transition IJ rates m,. ~ Pi= ~Pij' j=o rti (A7.1l4) The state probabilities Pj(t) of ~. (t) satisfy the following system of differential equations (see Example A7.7 for a simple application) pj (t) = -p~ P;(t) + fp; (t)pi j, pj = fp~i' j = 0,..., m. (A7.115) 1=0 i=o i*j i*j Assuming as initial conditions P; (0) = 1 and P: (0) = 0 for j -:1= i (with Zi E U), the I, solution of Eq. (A7.115) leads to the state probabilities P /t) and from these to the transition probabilities The reliability function RSi(t) is then given by (A7.116) RSi(t) = Pr{~(x) E U for 0 < x ~ t I ~(O) = ZJ = L P~(t), Zi eu. (A7.117) ZjEU

444 A7 Basic Stochastic-Processes Theory The dashed probabilities (P'(t» are reserved for the calculation of the reliability, when using the method of differential equations.

122 444 A7 Basic Stochastic-Processes Theory The dashed probabilities (P'(t» are reserved for the calculation of the reliability, when using the method of differential equations. This should avoid confusion with the corresponding quantities for the point availability. Example A 7.7 illustrates the calculation of the reliability function for a l-out-of-2 active redundancy. ExampleA7.7 Detenrune the reliability function for the same case as in Example A7.6, i.e. the probability that the system has not left the states Zo and ZI up to time t. Solution The diagram of transition probabilities in (t, t + Ot] of Fig. A7.7 is modified as in Fig. A7.8 by making the down state Z2 absorbing. For the state probabilities it follows that (see Ex. A 7.6) P~(t) = -2"- P~(t) + /l P~ (I) P~(t) = -(/.. + /l)p~(/) + 2AP~(t) P~ (I) = -AP~ (I). (A7.118) The solution of Eq. (A7.118) with the given initial conditions at 1=0 (P~(O)=l, P; (0) = P;(O) = 0) leads to the state probabilities P~(t), P; (I) and P;(/), and then to the transition probabilities and to the reliability function according to Eqs. (A 7.116) and (A 7.117), respectively (the dashed state probabilities should avoid confusion with the solution given by Eq. (A 7.113». Equations (A7.112) and (A7.117) can be combined to determine the probability that the process is in an up state (set U) at t and does not leave the set U in the time interval [t, t + 8], given ~(O) = Zi. This quantity is the interval reliability IRSi(t,t + 8). Due to the memoryless property of the involved Markov process, IRSi(t,t + 8) = Pr{~(x) E U for t ~ x ~ t + 8 I ~(O) = 2i} = LPij(t) RSj (8), Zj EU with Pij(t)as given in Eq. (A 7.111). i=o,...,m, (A7.119) 1-2Alit 1- o. + j.i) & j.llit Figure A 7.S Diagram of the transition probabilities in (I, 1+01] for a 1-out-of-2 active redundancy with El = E2 = E, one repair crew (for this case not mandatory), calculation of the reliability funclion (I arbitrary, Ot,). 0, Markov process with POI = 2/.., PIO = /l, PI2 = /..; Po = 2/.., PI = /.. + /l, P2 = 1)

123 A 7.5 Markov Processes with Finitely Many States 445 A Method of Integral Equations The method of integral equations is based on the representation of the Markov process ~(t) as a pure jump process by means of ~n and 11n as introduced in Appendix A7.5.2 (Eq. (A7.95». From the memoryless property it uses only the fact that jump points (in a new state) are regeneration points of ~(t). The transition probabilities Pij(t) = Pr{~(t) = Zj I ~(O) = Zd can be obtained by solving the following system of integral equations m t Pij(t) = 5ij e -Pi t + L J Pik e -Pi x Pkj(t - x)dx, k=o 0 k,*i i,j E {D,..., m}, (A7.120) with Pi=L..,Pi;' Bi'j'=O for j,*i, Bii=1. ToproveEq.(A7.120),considerthat ),)*' 0 Pij(t) = Pr{ (~(t) = Zj (') no jumps in (0, t]) I ~(O) = Zj} m + L Pr{ (~(t) = Zj (') first jump in (0, t] in Zk) I ~(O) = Zj} k=o k*i (A7.121) The first term of Eq. (A7.121) only holds for j = i and gives the probability that the process will not leave the state Zi (e-pjt = Pr{'tij > t for all j '* i} according to the interpretation given by Eqs. (A7.99) - (A7.104». The second term holds for any j '* i, it gives the probability that the process will move first from Zi to Zk and take into account that the occurrence of Zk is a regeneration point. According to Eq. (A 7.95), Pr{~l = Zk n Tlo :S x I ~(D) = zil = Qik(x) = Pjk(l- e -PiX) and Pr{~(t) = Zj I (~o = Zj (') 110 = x (') ~l = Zk)} = Pkj(t - x). Equation (A 7.120) then follows from the theorem of total probability (Eq. (A6.17». In the same way as for Eq. (A.121), it can be shown that the reliability junction RSi(t), as defined in Eq. (A7.117), satisfies the following system of integral equations t RSi(t) = e -Pi t + L J Pij e -Pi x RSj(t - x)dx, ZjEU 0 f#i m Pi= L Pij' Zi EU. (A7.122) j=o f#i Point availability PASi(t) and IRSi(t,t + 9) are given by Eqs. (A7.112) & (A7.1l9), with PU(t) per Eq. (A7.120) or Eq. (A7.111). The use of integral equations for PASi(t) can lead to mistakes, since Rsi(t) and PASi(t) describe two different situations (summing for PASi(t) over all states j E {O,..., m} leads to PAsi(t)=I).

124 446 A7 Basic Stochastic-Processes Theory The systems of integral equations (A7.120) and (A7.122) can be solved using Laplace transforms. Referring to Appendix A9.7, and m Pi = L. Pij, j=o j*i m Pi = L. Pij, j=o j*i i,j E {O,..., m}, (A7.123) Z; E u. (A7.124) A direct advantage of the method based on integral equations appears in the calculation of the mean stay (sojourn) time in the up states. Denoting by MTTFsi the system mean time to failure, provided the system is in state Zi E U at t = 0, leads to (Eq. (A6.38), Appendix A9.7) MITFSi = frsi(t)dt = RSi(O). o (A7.125) Thus, according to Eq. (A7.124), MITFSi satisfies the following system of algebraic equations (see Example A7.9 for a simple application) 1 Pij MITFSi = -+ L -MITFSj, Pi Z.EUPi J j*i Pi = L.Pij, j=o j*i Z; EU. (A7.126) A7.S.3.3 Stationary State and Asymptotic Behavior The determination of time-dependent state probabilities or of the point availability of a system whose elements have constant failure and repair rates is still possible using differential or integral equations. However, it can become time-consuming. The situation is easier where the state probabilities are independent of time, i.e. when the process involved is stationary (the system of differential equations reduces to a system of algebraic equations): A time-homogeneous Markov process ~(t) with states Zo,..., Zm is stationary, if its state probabilities Pi(t) = Pr{~(t) = Zd, i = 0,..., m do not depend on t. This can be seen from the following relationship

125 A 7.5 Markov Processes with Finitely Many States 447 Pr{~(tl) = Zi n... n ~(tri) = Zi } I n = Pr{~(tl) = Zil}Pil~ (t2 -tl)... Pin-lin (tn -tn-i) which, according to the Markov property (Eq. (A7.77» must be valid for arbitrary tl <... < tn and i1>..., ~ E {O,..., m}. For any a> 0 this leads to From Pj(t + a) = Pj(t) it also follows that Pi(t) = Pi(O) = P; and in particular :EW) = O. Consequently, the process ~(t) is stationary (in steady-state) if and only if its initial distribution p; = Pi(O) = Pr{~(O) = Zi}' i = 0,..., m, satisfies for any t > 0 the system m P j Pj = L Pj Pij, j=o j,~j with lj ~ 0, m LlJ = 1, j=o m Pj = L Pji' j = 0,..., m. i=o i"j (A7.127) The system of Eq. (A7.127) must be solved by canceling one (arbitrarily chosen) equation and replacing this by Llj = 1. Every solution of Eq. (A7.127) with lj ~ 0, j = 0,..., m, is a stationary initial distribution of the Markov process involved. A Markov process is irreducible if for every pair i,j E {O,..., m} there exists a t such that Pij(t) > 0, i.e. if every state can be reached from every other state. It can be shown that if Pij (to) > 0 for some to > 0, then Pij (t) > 0 for any t > O. A Markov process is irreducible if and only if its embedded Markov chain is irreducible. For an irreducible Markov process, there exist quantities P j > 0, j = 0,..., m, with Po Pm = 1, such that independently of the initial condition Pi(O) the following holds (Markov theorem, see e.g. [A6.6 (Vol. 1m lim p (t) = p. > 0, t~oo j j j=o,...,m. (A7.128) For any i = 0,..., m it follows then that lim Pj;(t) = Pj. > 0, j = 0,..., m. t~oo " (A7.129) The set of values Po,...,P m from Eq. (A7.128) is the limiting distribution of the Markov process. From Eqs. (A7.74) and (A7.129) it follows that for an irreducible Markov process the limiting distribution is the only stationary distribution, i.e. the only solution ofeq. (A7.127) with lj > 0, j = 0,..., m. Further important results follow from Eqs. (A7.174) - (A7.180). In particular the initial distribution in stationary state (Eq. (A 7.181», the frequency of consecutive occurrences of a given state (Eq. (A7.182», and the relation between stationary values ~ from Eq.(A7.127) and Pj for the embedded Markov chain (Eq.(A7.74» given by Pj Ipj m. L PklPk k=o (A7.130)

126 448 A7 Basic Stochastic-Processes Theory From the results given by Eqs. (A7.127) - (A7.129), the asymptotic and stationary value (steady-state value) of the point availability PAs is given by lim PASi(t) = PAs = LlJ i = 0,...,m. (A7.131) t--.+oo ZjEU If K is a subset of {Zo,..., Zm}, the Markov process is irreducible, and Po,..., Pm are the limiting probabilities obtained from Eq. (A7.127) then, total sojourn time in states Zj e Kin (0, t] ~ Pr{ lim = oj lj) = 1 t-t oo t Z.eK J (A7.132) irrespective of the initial distribution Po(O),..., P m(o). From Eq. (A 7.132) it follows Pr{ lim total operating time in (O,t] = L lj = PAS) = 1. t-t oo t Z.eU J The average availability of the system can be expressed as (see Eq. (6.24» AASi(t) =!E[total operating time in (0, t]l ~(O) = Zd =! fpasi(x)dx. (A7.133) t to The above considerations lead to (for any Zi E U) lim AASi(t) = AAs = PAs = Llj. t-t oo Z eu J t (A7.134) Expressions of the form ~ k P k can be used to calculate the expected number of elements in repair, of repair crews used, etc., as well as for cost optimizations. A Frequency I Duration and Reward Aspects In some applications, it is important to consider the frequency with which failures at system level occur and the mean duration (expected value) of the system down time (or of the system operating time) in the stationary state (steady-state). Also of interest is the investigation of fault tolerant systems for which a reconfiguration can take place after a failure, allowing continuation of operation with defined loss of performance (reward). Basic considerations on these aspects are given in this section. Applications are in Section A Frequency I Duration To introduce the concept of frequency / duration let us consider the one-item structure discussed in Appendix A7.3 as application of the alternating renewal process. As in Appendix A7.3 assume an item (system) which alternates between

127 A7.5 Markov Processes with Finitely Many States 449 operating state, with mean time to failure (mean up time) MTTF, and repair state, with complete renewal and mean repair time (mean down time) MITR. In the stationary state (steady-state), the frequency f at which item failures or item repairs occur is given per Eq. (A7.60) as f = hudct) = hdu(t) =,t 1 MITF+MITR ~ o. (A7.135) Furthermore, for the one-item structure the mean operating duration u is u = MITF. Consequently, considering Eq. (A7.58) the basic relation MITF PA= =f u MITF+MITR (A7.136) (A7.137) can be established, where PA is the point availability (probability to be up) in the stationary state. Similarly, for the mean failure duration d it holds that and thus d = MITR, I-PA= MITR =J-d. MITF+MITR (A7.138) (A7.139) Constant failure rate A. = 1/ MTBF= 1/ MITF and repair rate,...= 1/ MITR leads to PA A. = (I-PA),... = f, (A7.140) which expresses the stationary property of time-homogeneous Markov processes, as particular case of Eq. (A7.127) with m= {O,l}. For systems of arbitrary complexity with constant failure and repair rates, described by time-homogeneous Markov processes (Appendix A7.5.2) it can be shown that in the stationary state (steady-state) the mean stay (sojourn) time d in the down states (mean failure duration) is given by, see e.g. [A7.28], d= ~ pjpij/pi ZjeU,zjeu (A7.141) Thereby, Pj (i=o,l,...,m) is the stationary probability of the embedded Markov chain (Eqs. (A7.74) or (A7.175», U is the set of states considered as up states for f and d calculation and fj the complement to the totality of states considered. For U=Zj, Eq. (A7.141) yields d=i;=p;lhj=p;if; as per Eq. (A7.182). Knowing the stationary value of the point availability PAs (Eq. (A7.134» and the mean failure duration d (Eq. (A7. 141», thejrequency offailurefcan be computed from f = (1- PAs) / d. (A7.142)

128 450 A7 Basic Stochastic-Processes Theory A simple rule to obtain the frequency of failure f and the mean failure duration d using only the transition rates Pij and the stationary states probabilities lj in the up states (U) is as follows (see Section for an application): 1. From lj as per Eq. (A7.127) compute the frequency offailure f f = L Fj ~ji = L Pj (L_Pji) ZjEU,ZjEU ZjEU ZjEU (A7.143) 2. From PAs per Eq. (A7.131) andf per Eq. (A7.143) compute the mean failure duration d d = (1 - PAs) / f = (1 - L lj )/ f. (A7.144) ZjEU Thereby, U is the set of states considered as up states for f and d calculation and U the complement to the totality of states considered. In Eq. (A ), all transition rates Pji leaving state Zy EU toward Zi EU are considered (cumulated states). Equations (A7.I44) & (A7.I4I) give the same meanfailure duration d by considering I-PAs = L ~ ZjEU and f = LPiPij ZjEU, ZjEU i.e. by calculatingf as the frequency of operating periods. Of course, the frequency of operating periods equals the frequency of failure f = L~iPij = LljPji. (A7.145) ZjEU, ZjEU ZjEU,ZjEU Computation of the frequency of failure and mean failure duration based on fault tree and corresponding minimal cut-sets (Sections 2.6 and 2.3.4) is often used in power systems [6.22], where fr, df' and Pf appear forf, d, and I-PAs. Although appealing, ~ 1'i MITFsi' with MITFsi from Eq. (A7.126) and 1'i from Eq.(A7.127), can not be used to calculate the mean operating duration (Eqs.(A7.126) and (A7.127) describe two different situations, see the remark with Eq. (A7.122)). A7.S.4.2 Reward Complex fault tolerant systems have been conceived to be able to reconfigure themselves at the occurrence of a failure and continue operation, if necessary with reduced performance. Such a feature is important for many systems, e.g. production, information, and power systems, which should assure continuation of operation after a system failure. Besides fail safe aspects, investigation of such systems is based on the superposition of performance behavior (often assumed deterministic) and stochastic dependability behavior (including reliability,

129 A7.5 Markov Processes with Finitely Many States 451 maintainability, availability, and logistical support). A straightforward possibility is to assign to each state Zi of the dependability model a reward rate 'i which take care of the performance reduction in the state considered. From this, the expected (mean) instantaneous reward rate M1Rs (t) can be calculated in stationary state as m MIRs = L 'tp;, j=o (A7.146) thereby, rj= 0 for down states, 0 < rj <1 for partially down states, and rj=l for up states with 100% performance. The expected (mean) accumulated reward MARs (t) over the time interval (O,t] follows for the stationary state (steady-state) as t MARs(t) = fm1rs (x)dx= MIRst. o (A7.l47) Reward impulses at state transition or other metrics are possible, for instance the expected ratio of busy channels to jobs request (see e.g. [A7.15, 6.19 (1995),6.26, 6.34]). The reward rate can be applied directly to differential equations. For the purpose of this book, application in Section will be limited to Eq. (A7.146). A7.5.5 Birth and Death Process A birth and death process is a Markov process characterized by the property that transitions from a state Zj can only occur to state Zj+l or Zj_l' In the time-homogeneous case, it is used to investigate k-out-of-n redundancies with identical elements and constant failure and repair rates during the stay (sojourn) time in any given state (not necessarily at state transitions, e.g. because of load sharing). The diagram of transition probabilities in (t, t+ot] is given in Fig. A 7.9. Vj and 8 j are the transition rates from state Zj to Zj+l and Zj to Zj_l' respectively (transitions outside neighboring states can occur in (t, t+ot] only with probability o(est)). The system of differential equations describing the birth and death process given in Fig. A 7.9 is P/t) = -(v j + 8 )P/t) + v j-l P j - 1 (t) + 8 j +1 P j +1 (t) with 8 0 =V_l=Vn=8n+1 =0, j = 0,..., n. (A7.148) Figure A 7.9 Diagram of transition probabilities in (t, t+dtl for a birth and death process with n+ 1 states (t arbitrary, 8tJ,O, Markov process)

130 452 A7 Basic Stochastic-Processes Theory The conditions V) > 0 (j = 0,..., n -1) and 8) > 0 (j = 1, "., n) are sufficient for the existence of the limiting probabilities lim p)(t) = p), t~oo with P j > 0 and LIJ = 1. n j=o (A7.149) It can be shown (Example A7.8), that the probabilities IJ, j = 0, "., n are given by 1t. p. = 1t. Ro = _1_ 1 1 n' L1ti i=o with _V,,-o_,,_._V-,--j_--,-, 1t j = 8, ". 8 j and 1to = 1. (A7.150) From Eq. (A7.1S0) one recognizes that Pkvk = Pk+18 k +1 (k = 0, "., n -1) holds, as consequence of the stationary property of time-homogeneous Markov processes (see also Eqs. (A7.140) and (A7.127». Example A7.8 Assuming Eq. (A7.149) prove Eq. (A7.150). Solution Considering Eqs. (A 7.148) and (A7.149), P j are the solution of the following system of algebraic equations 0= -vopo + 0,11 From the first equation it follows that ~ = Povo/8,. With this value for lj., the second equation (i = I) leads to P z = v, +8'11_ Vo Po =(v, +0,. Vo _ vo)p o = vov, Po. O Oz 8, Oz 8,02 Recursively one obtains Vo... v j _, lj = Po = I1 j P O' 0,... OJ j = 0,..., n, ITo = 1. Taking into account that Po +." + P n = 1, Po follows and then Eq. (A7.150). (7.151) The values of lj given by Eq. (A7.150) can be used in Eq. (A 7.134) to determine the stationary (steady-state) value of the point availability. The system mean time to failure follows from Eq. (A7.126) with Pi = Vi + 8 i, Pi) = Vi for j = i + 1, and Pi} = 8i for j = i-i, provided that the state Zi+ I still belong to U (if not, Pu = 0 for j = i -1). Examples A 7.9 and A 7.10 are applications of the birth and death process.

131 A7.5 Markov Processes with Finitely Many States 453 Example A7.9 For the l-out-of-2 active redundancy with only one repair crew (Examples A7.6 and A7.7) i.e. for Vo = 21.., vi = A., 61 = 62 = Il, U = {ZO' ZI} and fj = {Z2} determine the asymptotic value of the point availability PAS and the system's mean time to failure MITFSO and MITFsl. Solution The asymptotic & steady-state value of point availability is given by Eqs. (A 7.134) and (A 7.150) 1+ 2A/I! PAs = Po + II = 2 2 I + nil! + 21.: II! 1!2 + 2A1! (A7.152) The system's mean time to failure follows from Eq. (A7.l26) with Po =vo, POI = vo' POI = Vo' PI = VI + 91, and PIO = 61 MTfFSo = lin + MTfFSl 1 I! MTfFSl = --+--MTfFSo' A+I! A+I! and thus 3A + I! MITFso =-- 2A2 and (A7.153) Example A7.10 A computer system consists of 3 identical CPUs. Jobs arrive independently and the arrival times form a Poisson process with intensity A.. The duration of each individual job is distributed exponentially with parameter Il. All jobs have the same memory requirements D. Determine for A. = 21l the minimum size n of the memory required in units of D, so that in the stationary case (steady-state) a new job can immediately find storage space with a probability y of at least 95%. When overflow occurs, jobs are queued. Solution The problem can be solved using the following birth and death process 1-ASt l-(a+i!)st 1-(A+21!)St 1-(A+31!)St 1-(A+31!)St In state Zj, exactly i memory units are occupied. n is the smallest integer such that in the steadystate, Po P n - l = Y ~ 0.95 (if the assumption were made that jobs are lost if overflow occurs, then the process would stop at state Zn). For steady-state, Eq. (A7.127) yields O=-A.Po +1l11 0= A. Po - (A. + Il) II + 21l P2 0=A.ll-(A.+21l)P 2 +31l P3 0= A. P 2 - (A. + 31l) P l P 4 i>2. (A7.l54)

132 454 A 7 Basic Stochastic-Processes Theory The solution leads to and. I' ~ I d 'd' A I I-l. f Assummg 1m L" p; = an consl enng --< I It ollows that n-7~ i=o 3 A 00 9 A/I-l i A 3(1.. I 1-l)2 Po [I L -(-) 1 = Po [ = 1, I-l i=2 2 3 I-l 2 (3 - A I I-l) from which The size of the memory n can now be detennined from 2(3 - A I I-l) A n-l 9 AI I-l i ----:---'-- [I L -(-) ] > y. 4 A 2 I-l i=2 2 3 ~ 6 +!! + (A/I-l) For A I I-l = 2 and y = 0.95, the smallest n satisfying the above equation is n = 9 (PO = 1/9. PI =2/9. lj =2i-1 13 i fori~2). As shown by Examples A 7.9 and A 7.10, reliability applications of birth and death processes identify Vi as failure rates and 0i as repair rates. In this case, j = 0,..., n -1, with Vj and 8 j + 1 as in Fig. A7.9. Assuming 0 < r < 1 and thus j=o,...,n-i, (A7.l55) the following relationships for the steady-state probability Ij can be obtained (Example A 7.11) 1- r n Pj ;:: n-i LPi, r(l-r )i=j+l o < r < 1. j = O...., n - 1. n > j. (A7.156) For r ~ 1/2 it follows that n lj;:: Lll, j = 0,... n-1. (A7.157) i=j+! Equation (A 7.157) states that for 2 v j ~ j+l, the steady-state probability in a state Zj of a birth and death process described by Fig. A7.9 is ~ the sum of the steadystate probabilities in all states following Zj' j = 0,..., n -1 [2.50 (1992)]. This relationship is useful in developing approximate expressions for system availability.

133 A7.5 Markov Processes with Finitely Many States 455 Example A7.11 Assuming Eq.(A7.155). prove Eqs. (A7.156) and (A7.157). Solution Using Eq. (A7.150). n n LP; Llti Setting i=j+l = i=j+l = ~ ~ =.!..L + Vj v j +1 Vj... vn - 1..t '---- lj ltj i=j+lltj 6 j +1 6 j +16 j +2 8 j n V -'-<r 6 i + 1-0< r < I. i=j.j+l... n-i, it follows that n LP; ( n-j i=j+l 2 n-j r l-r ) --:";;r+r r =---'----'- P j 1- r and thus Eq. (A7.156). Furthermore. for r~ 112 it follows that n LP; i=j+l ~ 1-(1I2)n-j ~ I. P j and hence Eq. (A7.157). A7.6 Semi-Markov Processes with Finitely Many States The description of Markov processes given in Appendix A7.5.2 allows a straightforward generalization to semi-markov processes. In a semi-markov process, the sequence of consecutively occurring states forms an embedded time-homogeneous Markov chain, just as with Markov processes. The stay (sojourn) time in a given state Zi is a positive random variable 'tij whose distribution depends on Zi and on the following state Zj' but in contrast to Markov processes it is arbitrarily and not exponentially distributed.

134 456 A7 Basic Stochastic-Processes Theory To define semi-markov processes, let ~o, ~I"" be the sequence of consecutively occurring states, i.e. a sequence of random variables taking values in {Zo,..., Zm}, and 110' 111,... the stay (sojourn) times between consecutive states, i.e. a sequence of positive random variables. A stochastic process ~(t) with the state space {Zo,..., Zm} is a semi-markov process if for n = 1, 2,..., arbitrary i,j, io,...,in_ 1 E {G,..., m}and arbitrary positive numbers xo,..., xn-l, Pr{(~n+1 =Zj n 11n ::s; x) I (~n=zi n11n_l=xn_ 1 n... ~1=Zil n11o=xo n~o=zio)} = Pr{(~n+1 = Zj n 11n::S; x) I ~n = Zi} == Qij(x), (A7.158) The functions Qij(x) in Eq. (A7.158), defined only for j *- i, are the semi-markov transition probabilities. Setting and, for Pij > G, (A7.159) leads to Qij(x) p. (x)==--, Ij Pij Qij(X) == Pij Fi/x), (A7.160) (A7.161) with (Example A7.2) and Pij == Pr{~n+l == Zj I ~n == Zd, Pii == 0, (A7.162) (A7.163) As for a semi-markov process, Pii == 0 is mandatory, so Fii(x) can be arbitrary. From Eq.(A7.l58), the consecutive jump points at which the process enters Zi are regeneration points. This holds for any i E {G,..., m}. Thus, all states of a semi Markov process are regeneration states. The renewal density of the embedded renewal process of consecutive jumps in Zi (i-renewals) will be denoted as hi(t). The interpretation of the quantities Qij(x) given by Eqs. (A7.99) - (A7.101) are useful for practical applications (see for instance Eqs. (A7.183) - (A7.186)). The initial distribution, i.e. the distribution of the vector (~(O), ~I' 110) is given for the general case by Aij(x) == Pr{~(G) == Zi n ~l == Zj n residual sojourn time (110) in Zi ::s; x} =Pi(O)PijFij(x), (A7.164)

135 A7.6 Semi-Markov Processes with Finitely Many States 457 with Pi(O) = Pr{~(O) = Zd, Pij according to Eq. (A7.l62), and F~(x) = Pr{residual sojourn time in Zj ~ x I (~(O) = Zj 1\ ~l = Zj)}. ~(O) is used here for clarity instead of ~o. The semi-markov process is memorylessjust at the transition points from one state to the other. To have the time t = 0 as a regeneration point, the initial condition ~(O) = Zi' sufficient for time-homogeneous Markov processes, must be reinforced for semi-markov processes by Zi is entered at t = O. The sequence ~O' ~b... forms a Markov chain, embedded in the semi-markov process, with transition probabilities Pij as per Eq. (A7.162) and initial probabilities Pi(O), i = 0,..., m. Fij(x) is the conditional distribution function of the stay (sojourn) time in Zi with consequent jump in Zj (next state to be visited). A semi-markov process is a Markov process if and only if Fij(x) = 1-e-PjX, for i,j E {a,..., m}. An example of a two state semi-markov process is the alternating renewal process given in Appendix A7.3 (Zg = up, ZI = down, POI =PlO = 1, FOI(x) = F(x), FlO(x) = G(x), Fo(x) = FA (x), FI (x) = G A (x), PoCO) = p, PI (0) = 1- p). In many applications, the quantities Qij(x), or Pij and Fij(x), can be calculated using Eqs. (A7.99) - (A7.101), as shown in Appendix A7.7 and Sections For the unconditional stay (sojourn) time in Zi' the distribution function is Qi(x) = Pr{l1n ~ x I ~n = Zil = LPijFij(x) = LQij(x), and the mean = T; = f (1- Qi(x))dx. o In the following it will be assumed that dqij(x) qij(x) = dx m m j;o j#i (A7.165) (A7.166) (A7.l67) exists for all i,j E {O,..., m}. Consider first the case in which the process enters the state Zi at t = 0, i.e. that Pi(O) = 1 and (A7.168) The transition probabilities Pij(t) = Pr{~(t) = Zj I Zi is entered at t = O} (A7.169) can be obtained by generalizing Eq. (A 7.120), however considering that the condition Zi is entered at t = 0 is mandatory for semi-markov processes,

136 458 A 7 Basic Stochastic-Processes Theory m t Pij(t) = Oij (1- Qi(t)) + L f qik(x)pkj(t - x)dx, (A7.170) t~~ with 0ij and Qi(t) per Eqs. (A7.85) & (A7.165). The state probabilities follow as m Pj(t) = Pr{~(t) = Z) = LPr{Zi is entered at t = O}Pij(t), (A7.171) ;=0 with Pj(t) 2: and Po(t)+ +Pm(t) = 1. If the state space is divided into the complementary sets U for the up states and [J for the down states, as in Eq. (A7.107), the point availability follows from Eq. (A 7.112) PASi(t) = Pr{~(t) E U I Zi is entered at t = O} = L Pij(t), ZjEU i = O,...,m, with Pij(t) as in Eq.(A7.170). The probability that thefirst transition from a state in U to a state in [J occurs after time t, i.e. the reliability function, can be obtained by generalizing the system of integral equations (A 7.122). t R Si (t)=l-qi(t)+ L fqij(x)rsj(t-x)dx, ZjEU Ioti Z; E U, (A7.172) with Qi(t) as in Eq. (A7.165). The mean of the stay (sojourn) time in U, i.e. the system mean time to failure, follows from Eq. (A7.172) as solution of the following system of algebraic equations (with Ti as per Eq. (A 7.166» MTTFS i = ~ + L Pij MTTFSj, ZjEU joti Z; E U, (A7.173) Consider now the case of a stationary semi-markov process. Under the assumption that the embedded Markov chain is irreducible (each state can be reached from every other state with probability> 0), the semi-markov process is stationary if and only if the initial distribution (Eq. (A 7.164» is given by [A 7.22, A 7.23, A 7.28] (A7.174) In Eq. (A 7.174), Pij are the transition probabilities (Eq. (A 7.162) and Pj the stationary distribution of the embedded Markov chain, Pj are the unique solutions of m Pj= LPiPij, with '11i=O, Pij = Qij(oo), Pj>O, LPj=l, j=o,...,m. ;=0 j=o (A7.175) m

137 A7.6 Semi-Markov Processes with Finitely Many States 459 The system given by Eq. (A7.175) must be solved by dropping one (arbitrarily chosen) equation and replacing this by LPj = 1. For the stationary semi-markov process, the state probabilities are independent of time and given by t ~ 0, i = 0,..., m, (A7.176) with ~ per Eq. (A7.166) and Pi from Eq. (A7.175). ~i is the mean of the time interval between two consecutive occurrences of Zi (in steady-state). These time points form a stationary renewal process with renewal density 1 Pi ~(t)=~ =-=-m-' ~i LP k ~ i=o,..., m. (A7.177) hi is the frequency of successive occurrences of state Zi. It can be shown that Eq. (A7.177) is equivalent to Eq. (A7.174). The stationary (steady-state) value of the point availability PAs and average availability AAs follows from Eq. (A7.176) p T PAs =AAs = L Pi = L -m-'-'- ZjEU ZjEU LPk ~ k=o (A7.178) Under the assumptions made above, i.e. continuous sojourn times with finite means and an irreducible embedded Markov chain, the following applies for i = 0,..., m regardless of the initial distribution at t = 0 and thus lim Pr{~(t) = Zi n next transition in Zj 1---'>= x n residual sojourn time in Zi ::; x} =?ipi f j (1- Fij(y»dy = Aij(x), (A7.179) L~Tk 0 T ~T ~ lim Pr{~(t)= Zi} = Ii = = -m-'-'- and lim PAs(t) = PAs =.LiPi' (A7.180) 1---'>= Tif ~ rn T 1---'>= L.,.:Lk 'k ZjEU For the alternating renewal process (Appendix A7.3 with Zo=up, ZI = down" To=MTTF, and ~=MTTR) it holds that po=p[=1/2 (embedded Markov chain) and TOO=~I=To+~. Eq.(A7.178) (or (A7.180» leads to PAs=Po=TolToo=ToI(To+~) = poto 1(1bTo+p[~) This example shows the basic difference between Pj as stationary distribution of the embedded Markov chain and the limiting state probability ~ in state Zi of the original process in continuous time (compare also Eq. (A7.175) with Eq. (A7.127) by considering Pij = Pij IPi as per Eq. (A7.103».

138 460 A7 Basic Stochastic-Processes Theory For time-homogeneous Markov processes (Appendix A 7.5), it holds that T; =lipi (Eqs. (A7.166), (A7.165) & (A7.103» and Eqs. (A7.174) & (A 7.177) yield (A7.181) and i=o,...,m, (A7.182) respectively. Eq. (A7.181) follows also directly from Eq. (A7.164) by considering Fij(x) = Fij(x) = 1-e -PiX. Eq. (A7.181) expresses the stationary property of timehomogeneous Markov processes (see also Eqs. (A7.140) and (A7.127». Furthermore, Eq. (A7.161) holds with Pij=Pij/Pi and Eq. (A7.176) reduces to Eq. (A7.130). A7.7 Semi-regenerative Processes As pointed out in Appendix A7.5.2, the time behavior of a repairable system can be described by a time-homogeneous Markov process only if failure-free operating times and repair times of all elements are exponentially distributed (constant failure and repair rates during the stay (sojourn) time in any given state with possible change at state transitions, e.g. because of load sharing). Except the special case of the Erlang distribution (Section 6.3.3), non exponentially distributed repair times and lor failure-free operating times lead in some few cases to semi-markov processes, and in general to processes with only few regeneration states or to nonregenerative processes. To make sure that the time behavior of a system can be described by a semi-markov process, there must be no "running" failure-free operating time or repair time at any state transition (state change) which is not exponentially distributed, otherwise the sojourn time to the next state transition would depend on how long these not exponentially distributed times have already run. Example A7.12 shows the case of a process with states Zo, ZI' Z2 in which only states Zo and ZI are regeneration states. Zo and ZI form a semi-markov process embedded in the original process, on which the investigation can be based. Processes with an embedded semi-markov process are called semi-regenerative processes. Their investigation can become time-consuming and has to be performed generally on a case-by-case basis, see e.g. Figs. A7.1O & A7.11 and Section Example A7.12 Consider a 1-out-of-2 warm redundancy as in Fig. A7.4a with constant failure rates A in the operating and A r in the reserve state, one repair crew, and arbitrarily distributed repair time with distribution function G(x) and density g(x). Determine the transition probabilities of the embedded semi-markov process.

139 A7.6 Semi-regenerative Processes 461 a) b) operating <411 Q'12 stand-by --- ~ -- repair '" renewal poin.ts (for Zo and Z/, QIO QI21 Qij=Qij{x) respectively Figure A7.10 a) Possible time schedule for a 1-out-of-2 warm redundancy with constant failure rates (A and A r ), arbitrary repair rate (density g(x», and only one repair crew (repair times greatly exaggerated); b) State transition diagram for the embedded semi-markov process with regeneration states Zo and ZI (the model holds also for a k-out-of-n warm redundancy with n - k = 1) Solution As Fig. A7.lOa shows, only states Zo and ZI are regeneration states. ~ is not a regeneration state because at the transition points into Z2 a repair with arbitrary repair rate is running. Thus, the process involved is not a semi-markov process. However, states Zo and ZI form an embedded semi-markov process on which investigations can be based. The transition probabilities of the embedded semi-markov process are obtained (using Eq. (A7.99) and Fig. A7.1O) as QOl (x) = Pr{'tOl ~ x) = 1-e-(A+Ar)X x QIO(x) = Pr{'tIO ~ x n 't12 > 'tio) = J g(y)e-aydy =G(x)e-Ax + JAe-AYG(y)dy o 0 x A Ql2l(x) = Pr{'tl2l ~ x) = J g(y)(l- e- Y)dy. (A7.183) o Q121 (x) is used to calculate the point availability (Section 6.4.2). It accounts for the process returning from state Z2 to state ZI (Fig. A 7. loa) and that Zz is a not a regeneration state (transition ZI ~ Z2 ~ ZI)' Q'I2(x) as given in Fig A7.IO is not a semi-markov transition probability (Z2is not a regeneration state). However, Q'12(x) expressed as (see Fig. A7.lOa) x A Ax JX -AY Q'12(x) = JAe - Y (1- G(y»dy = 1 - e - - Ae G(y)dy, (A7.184) o o yields an equivalent Q 1 (x) = QIO(x) + Q'I2(x) useful for calculation purposes (see Section 6.4.2). A discussion on the steady-state point availability is given as remark to Eqs. (A7.187) and (A7.188). Replacing in Eqs. (A7.183) and (A7.184) ')., with k')., leads to a k-out-of-n warm redundancy with n-k=l, constant failure rates (A, A r ), arbitrary repair times with density g(x), only one repair crew, and no further failure at system down. As a second example, Fig. A 7.11 gives a possible time schedule for a k-out-of-n warm redundancy with n - k = 2, constant failure rates (A, A r ), arbitrary repair rate (density g(x), only one repair crew, and no further failure at system down. Given is x

140 462 A 7 Basic Stochastic-Processes Theory operating repair A T renewal points (for ~, ~ I and ~" respecllvely) Figure A7.11 a) Possible time schedule for a k-out-of-n warm redundancy with n-k=2, constant failure rates (A & A r ), arbitrary repair rate (density g(x», only one repair crew, and no further failure at system down (repair times greatly exaggerated, operating and reserve elements together); b) State transition diagram for the embedded semi-markov process with regeneration states Zo' Zl' and Z2' also the state transition diagram of the involved semi-regenerative process. States Zo, ZI' and Z2' are regeneration states, q and Z3 are not regeneration states. The corresponding transition probabilities of the embedded Semi-Markov process are -(ka+2ar )X Q01 (x) = Pr{tol :5 xl = 1- e x -(ka+ar)y dy QlO(x) = Pr{tlO :5 x (') tl2 > tlol = J g(y)e o _ < _ x J -kay Q2'I(x)-Pr{t2.!_xnt2'3>t2'1 l - g(y)e dy o x y -(ka+a )z -ka(y-z) Q l2l (x) = Pr{t l2l :5 x I = J g(y)j (ka + Ar)e r e dzdy o 0 _ < _ x J YJ -(ka+ar)z _ -ka(y-z) Ql232' (x) - Pr{tI232' - x I - g(y) (ka + Ar)e (1 e )dzdy o 0 Q2'32' (x) = Pr{t2'32' :5 x I = J g(y)(l- e o x -kay )dy. (A7.185) Ql2l (x), QI232' (x), and Q2'32' (x) are used to calculate the point availability. They account for the transitions throughout the non regeneration states q and Z:3. Similarly as for Q12(x) in Example A7.12, the quantities QI23(x) = f (1- G(y»f (ka + Ar)e -(ka+ar)z kae -ka(y-z) duly o 0, x -ka Q2'3(x) = JkAe Y(l-G(y»dy, (A7_186) o are not semi-markov transition probabilities, however they are useful for calculation purposes (to simplify, they are not shown in Fig. A7.l1b). Results for g(x) =!!e-llx, i.e. for constant repair rate!!, are given in Table 6,8 (n - k = 2).

Al Terms and Definitions

Al Terms and Definitions This appendix defines und comments on the terms most commonly used in reliability engineering (Fig. Al.1). Table 5.4 extends this appendix to Software quality (See also [AIS (610)l.