A Data-Driven Model for Software Reliability Prediction

Similar documents
at least 50 and preferably 100 observations should be available to build a proper model

Univariate ARIMA Models

Module 3. Descriptive Time Series Statistics and Introduction to Time Series Models

Classic Time Series Analysis

Basics: Definitions and Notation. Stationarity. A More Formal Definition

Ch 6. Model Specification. Time Series Analysis

MODELING INFLATION RATES IN NIGERIA: BOX-JENKINS APPROACH. I. U. Moffat and A. E. David Department of Mathematics & Statistics, University of Uyo, Uyo

EASTERN MEDITERRANEAN UNIVERSITY ECON 604, FALL 2007 DEPARTMENT OF ECONOMICS MEHMET BALCILAR ARIMA MODELS: IDENTIFICATION

University of Oxford. Statistical Methods Autocorrelation. Identification and Estimation

2. An Introduction to Moving Average Models and ARMA Models

Stat 5100 Handout #12.e Notes: ARIMA Models (Unit 7) Key here: after stationary, identify dependence structure (and use for forecasting)

FE570 Financial Markets and Trading. Stevens Institute of Technology

Circle a single answer for each multiple choice question. Your choice should be made clearly.

Forecasting using R. Rob J Hyndman. 2.4 Non-seasonal ARIMA models. Forecasting using R 1

{ } Stochastic processes. Models for time series. Specification of a process. Specification of a process. , X t3. ,...X tn }

Chapter 6: Model Specification for Time Series

Lab: Box-Jenkins Methodology - US Wholesale Price Indicator

Lesson 13: Box-Jenkins Modeling Strategy for building ARMA models

Time Series Analysis -- An Introduction -- AMS 586

Problem Set 2: Box-Jenkins methodology

Ch 5. Models for Nonstationary Time Series. Time Series Analysis

Time Series I Time Domain Methods

Econometría 2: Análisis de series de Tiempo

A SARIMAX coupled modelling applied to individual load curves intraday forecasting

Empirical Market Microstructure Analysis (EMMA)

TIME SERIES ANALYSIS AND FORECASTING USING THE STATISTICAL MODEL ARIMA

Introduction to ARMA and GARCH processes

Marcel Dettling. Applied Time Series Analysis SS 2013 Week 05. ETH Zürich, March 18, Institute for Data Analysis and Process Design

Support Vector Machine. Industrial AI Lab.

FORECASTING SUGARCANE PRODUCTION IN INDIA WITH ARIMA MODEL

Chapter 12: An introduction to Time Series Analysis. Chapter 12: An introduction to Time Series Analysis

Univariate Time Series Analysis; ARIMA Models

MCMC analysis of classical time series algorithms.

ARIMA Models. Jamie Monogan. January 25, University of Georgia. Jamie Monogan (UGA) ARIMA Models January 25, / 38

Forecasting. Simon Shaw 2005/06 Semester II

ARIMA Models. Jamie Monogan. January 16, University of Georgia. Jamie Monogan (UGA) ARIMA Models January 16, / 27

Minitab Project Report - Assignment 6

Final Examination 7/6/2011

3 Theory of stationary random processes

Support Vector Machine & Its Applications

Estimation and application of best ARIMA model for forecasting the uranium price.

Ch 4. Models For Stationary Time Series. Time Series Analysis

STAT Financial Time Series

Modelling Monthly Rainfall Data of Port Harcourt, Nigeria by Seasonal Box-Jenkins Methods

Review Session: Econometrics - CLEFIN (20192)

Lecture 2: Univariate Time Series

Time Series 4. Robert Almgren. Oct. 5, 2009

Applied time-series analysis

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Box-Jenkins ARIMA Advanced Time Series

CHAPTER 8 FORECASTING PRACTICE I

Using Analysis of Time Series to Forecast numbers of The Patients with Malignant Tumors in Anbar Provinc

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Stochastic Modelling Solutions to Exercises on Time Series

Data Mining Techniques

A SEASONAL TIME SERIES MODEL FOR NIGERIAN MONTHLY AIR TRAFFIC DATA

Modelling using ARMA processes

Firstly, the dataset is cleaned and the years and months are separated to provide better distinction (sample below).

Time Series Models and Inference. James L. Powell Department of Economics University of California, Berkeley

Econometrics for Policy Analysis A Train The Trainer Workshop Oct 22-28, 2016 Organized by African Heritage Institution

5 Autoregressive-Moving-Average Modeling

1 Linear Difference Equations

The Identification of ARIMA Models

Lesson 2: Analysis of time series

Linear Classification and SVM. Dr. Xin Zhang

Nonlinear time series

A Hybrid Time-delay Prediction Method for Networked Control System

Lecture 3: Autoregressive Moving Average (ARMA) Models and their Practical Applications

Time Series Outlier Detection

FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS

Chapter 4: Models for Stationary Time Series

IDENTIFICATION OF ARMA MODELS

Dynamic Time Series Regression: A Panacea for Spurious Correlations

Introduction to Time Series Analysis. Lecture 11.

APPLIED ECONOMETRIC TIME SERIES 4TH EDITION

Prof. Dr. Roland Füss Lecture Series in Applied Econometrics Summer Term Introduction to Time Series Analysis

Econometrics II Heij et al. Chapter 7.1

ECONOMETRIA II. CURSO 2009/2010 LAB # 3

Circle the single best answer for each multiple choice question. Your choice should be made clearly.

Time Series 2. Robert Almgren. Sept. 21, 2009

Gaussian Copula Regression Application

Time Series Econometrics 4 Vijayamohanan Pillai N

Analysis. Components of a Time Series

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Forecasting Bangladesh's Inflation through Econometric Models

STAD57 Time Series Analysis. Lecture 8

Univariate, Nonstationary Processes

Modeling and forecasting global mean temperature time series

Econ 623 Econometrics II Topic 2: Stationary Time Series

Univariate linear models

Theoretical and Simulation-guided Exploration of the AR(1) Model

STAT 436 / Lecture 16: Key

We will only present the general ideas on how to obtain. follow closely the AR(1) and AR(2) cases presented before.

Short-Term Load Forecasting Using ARIMA Model For Karnataka State Electrical Load

MGR-815. Notes for the MGR-815 course. 12 June School of Superior Technology. Professor Zbigniew Dziong

Chapter 5: Models for Nonstationary Time Series

Design of Time Series Model for Road Accident Fatal Death in Tamilnadu

Frequency Forecasting using Time Series ARIMA model

STAT 443 Final Exam Review. 1 Basic Definitions. 2 Statistical Tests. L A TEXer: W. Kong

Homework 4. 1 Data analysis problems

Transcription:

A Data-Driven Model for Software Reliability Prediction Author: Jung-Hua Lo IEEE International Conference on Granular Computing (2012) Young Taek Kim KAIST SE Lab. 9/4/2013

Contents Introduction Background Overall Approach Detailed Process Experimental Results Conclusion Discussion 2 / 31

Introduction Background Overall Approach Detailed Process Experimental Results Conclusion Discussion SW Reliability Prediction Definition of SW Reliability Probability of failure-free operation of a software product in a specified environment for a specified time. SRM (Software Reliability Model) To estimate how reliable the software is now. To predict the reliability in the future. Two categories of SRMs Analytical Models: NHPP SRMs Data-Driven Models: ARIMA, SVM 3 / 31

Introduction Background Overall Approach Detailed Process Experimental Results Conclusion Discussion Data Driven Model Limitations of Analytical Models Software behavior changes during testing phase Assumption of all faults are independent & equally detectable is violated by the dataset. Data Driven Models Much less unpractical assumptions: developed from collected failure data. Easy to make abstractions and generalizations of the SW failure process: the approach of regression or time series analysis. 4 / 31

Introduction Background Overall Approach Detailed Process Experimental Results Conclusion Discussion Motivation Problems Actual SW failure data set is rarely pure linear or nonlinear No general model suitable for all situations Proposed Solution Hybrid strategy with both linear and nonlinear predicting model ARIMA model: Good performance in predicting linear data SVM model: Successful application to nonlinear data 5 / 31

Stationarity Statistical properties (mean, variance, covariance, etc.) are all constant over time. (1) E( y ) u for all t. t y 2 2 (2) Var( yt ) E[( yt uy ) ] y for all t. (3) Cov( y, y ) for all t. t tk k 60 50 μ 1, σ 12, γ 1 μ 2, σ 22, γ 2 60 50 40 30 20 10 40 Differencing 30 20 10 = μ 2, σ 22, γ 2 μ 1, σ 12, γ 1 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 6 / 31

7 / 31 ACF (Autocorrelation Function) The correlation between observations at different distances apart (lag) where n t t n k t k t t k y y y y y y r 1 2 1 ) ( ) )( ( Background Detailed Process Introduction Experimental Results Conclusion Discussion Overall Approach 1 n t t y y n

PACF PACF (Partial ACF) The degree of association between y t and y t-k, when the effects of other time lags 1, 2, 3,, k-1 are removed. r kk r1 rk 1 k 1 j1 k 1 j1 r k 1, j r k 1, j r r k j k if if k 1, k 2,3, where r kj for j = 1, 2,, k-1. rk 1, j rkkrk 1, k j 8 / 31

PACF Removing Non-stationarity Differencing Differenced series: y t y t y t1 9 / 31

3 Prediction Models for Stationary Data AR (Auto Regressive) Model Use past values in forecast AR(p) y t = α 1 y t 1 + α 2 y t 2 + +α p y t p + ε t MA (Moving Average) Model Use past residuals (random events) in forecast MA(q) y t = ε t + β 1 ε t 1 + + β q ε t q ARMA (Auto Regressive & Moving Average) Model Combination of AR & MA ARMA(p, q) y t = α 1 y t 1 + α 2 y t 2 + +α p y t p + ε t +β 1 ε t 1 + + β q ε t q 10 / 31

PACF AR (Auto Regressive) Model (1/2) AR(p) y t = α 1 y t 1 + α 2 y t 2 + +α p y t p + ε t α i : Autocorrelation coefficient ε t : error at t Selection of a model ACF decreasing exponentially Directly: 0<a<1 Oscillating patter: -1<a<0 PACF identifying the order of AR model Autocorrelation Partial Autocorrelation 1.0 0.8 0.6 0.4 0.2 0.0-0.2-0.4-0.6-0.8-1.0 1.0 0.8 0.6 0.4 0.2 0.0-0.2-0.4-0.6-0.8-1.0 1 Autocorrelation Function for AR1 data series (with 5% significance limits for the autocorrelations) Exponentially Decreasing 5 10 15 20 25 30 35 Lag (oscillating) Partial Autocorrelation Function for AR1 data series (with 5% significance limits for the partial autocorrelations) 2 Cut off at Lag 1 AR(1) 4 6 8 10 Lag 12 14 40 16 45 18 50 20 11 / 31

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 MA (Moving Average) Model (1/2) MA(q) y t = ε t + β 1 ε t 1 + + β q ε t q β i : MA parameter ε t : error at t Example Year Sales(B$) MA(3) 2000 1000 1000 + 1500 + 1250 3 2001 1500 2002 1250 MA(3) 2003 900 1250 2004 1600 1217 1800 2005 950 1250 2006 1650 1150 2007 1750 1400 1300 2008 1200 1450 2009 2000 1533 800 2010 2100 1650 2011 1767 Sales(B$) MA(3) 12 / 31

PACF MA (Moving Average) Model (2/2) Selection of a model ACF identifying the order of MA model PACF decreasing exponentially Directly: 0<a<1 Oscillating patter: -1<a<0 Autocorrelation 1.0 0.8 0.6 0.4 0.2 0.0-0.2-0.4-0.6-0.8-1.0 1 Autocorrelation Function for MA1 data series (with 5% significance limits for the autocorrelations) Cut off at Lag 1 MA(1) 5 10 15 20 25 Lag Partial Autocorrelation Function for MA1 data series (with 5% significance limits for the partial autocorrelations) 30 35 40 45 50 Partial Autocorrelation 1.0 0.8 0.6 0.4 0.2 0.0-0.2-0.4-0.6 Exponentially Decreasing (oscillating) -0.8-1.0 2 4 6 8 10 Lag 12 14 16 18 20 13 / 31

ARMA Model ARMA(p,q) = AR(p) + MA(q) y t = α 1 y t 1 + α 2 y t 2 + +α p y t p + ε t β 1 ε t 1 + + β q ε t q Procedures for model identification Guideline to determine p, q for ARMA 14 / 31

ARIMA Model Auto Regressive Integrated Moving Average (By Box and Jenkins (1970)) Linear model for forecasting time series data: Future values is a linear function of several past observations. ARIMA(p, d, q) Moving average of order q Integrated differentiation of order d (Expand to Non-Stationary Time Series) Auto Regression of order p 15 / 31

SVM (Support Vector Machine) Proposed by Vladimir N. Vapnik (1995, Rus) An algorithm (or recipe) for maximizing a particular mathematical function with respect to a given collection of data 4 Key Concepts: Separating hyperplane Maximum-margin hyperplane Soft margin Kernel function 16 / 31

Separating Hyperplane denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) Separating Hyperplane (= Classifier) w x + b<0 17 / 31

Maximum Margin denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Support Vectors are those data points that the margin pushes up Against Only Support vectors are used to specify the separating hyperplane!! x + X - M=Margin Width 18 / 31

Kernel Function (1/2) Nonlinear SVMs Datasets that are linearly separable with some noise work out great: 0 x But what are we going to do if the dataset is just too hard? 0 x How about mapping data to a higher-dimensional space: x 2 x 19 / 31

Kernel Function (2/2) Nonlinear SVMs: Feature Spaces General idea: The original input space can always be mapped to some higher-dimensional feature space where the training set is separable linearly. Definition of Kernel Function: some function that corresponds to an inner product in some expanded feature space. Φ: x φ(x) x 20 / 31

Genetic Algorithm Search & Optimization technique By J. Holland, 1975 Based on Darwin s Principle of Natural Selection Basic operations Crossover Mutation END Create inintial, random population (potential solutions) Evaluate fitness for each population Optimal or "good" solution found? No Selection or kill population Crossover Mutation 21 / 31

Support Vector Machines ARIMA Overall Approach (1/2) Random Initial Population Chromosome 1 Chromosome 2. Chromosome N Training SVM Model Initial Parameters Nonlinear Residual Yes Data Set Model Identification Model Estimation Is satisfied model checking? No Trained SVM Model Fitness Evaluation Yes Trained SVM Model (Nonlinear Forecasting) Trained ARIMA Model (Linear Forecasting) Support Vector Machines ARIMA Stop Criteria? No Genetic Operations + Software Reliability Prediciton Random Initial Population Chromosome 1 Data Set Chromosome 2... Chromosome N Initial Parameters Model Identification Model Estimation No Training SVM Model Nonlinear Residual Yes Is satisfied model checking? Trained SVM Model Fitness Evaluation Yes Trained SVM Model (Nonlinear Forecasting) Trained ARIMA Model (Linear Forecasting) Stop Criteria? No + Software Reliability Prediciton Genetic Operations 22 / 31

Overall Approach (2/2) X t = L t + N t X t : Time series data L t : Linear part of time series data N t : Nonlinear part of time series data After ARIMA model processing, we can get L t, ε t : L t : Predicted value of the ARIMA model ε t : residual at time t from the linear model ε t = X t - L t Finally, the residuals (ε t ) will be modeled by the SVM model with GA (Genetic Algorithm). 23 / 31

ARIMA Process (1/2) Data Set Model Identification Parameter Estimation Is satisfied model checking? Yes SW Reliability Prediction No Stationarize input data - Differencing, determine d - ACF, PACF checking Determination of the values of p and q - ACF, PACF checking MA(q) AR(p) ARMA(p,q) ACF Cuts after q Tails off Tails off PACF Tails off Cuts after p Tails off MLE (Maximum Likelihood Estimation) - Find a set of parameters q 1,q 2,..., q k to maximize L(q 1,q 2,..., q k )= f(x 1,x 2,..., x N ;q 1,q 2,..., q k ) 24 / 31

ARIMA Process (2/2) Data Set Model Identification Parameter Estimation Is satisfied model checking? Yes No Residual randomness Check - Residuals of the well-fitted model will be random and follow the normal distribution - Check ACF and PACF SW Reliability Prediction 25 / 31

SVM Process (1/2) Random Initial Population Chromosome 1 Chromosome 2.. Chromosome N Training SVM Model Initial Parameters Nonlinear Residual o Due to the characteristics of input data (randomness), random initial population selected - ex: C, ε, σ o Data set is divided into two part: training & testing data Trained SVM Model Fitness Evaluation Stop Criteria? Yes Trained SVM Model (Nonlinear Forecasting) No Genetic Operations 26 / 31

SVM Process (2/2) Random Initial Population Chromosome 1 Chromosome 2.. Chromosome N Training SVM Model Trained SVM Model Fitness Evaluation Stop Criteria? No Genetic Operations Yes Initial Parameters Nonlinear Residual Trained SVM Model (Nonlinear Forecasting) o The higher fitness value, the more survivability ability o The high-fitness valued candidate chromosome retained, & combined to produce new offspring. o GA is applied to SVM parameter search - No theoretical method for determining a kernel function and its parameter - No a priori knowledge for setting kernel parameter C. o Applied GA operations - Crossover operation - Mutation operation 27 / 31

Introduction Background Overall Approach Detailed Process Experimental Results Conclusion Discussion Experimental Results (1/2) Collected data: cumulative number of failures, x i, at time t i Data Set (DS-1) RADC (Rome Air Development Center) Project reported by Musa 21 weeks tested, 136 observed failures Output: predicted value, x i+1, using (x 1, x 2,, x i ) Goodness of fit curves Relative Error curves 28 / 31

Introduction Background Overall Approach Detailed Process Experimental Results Conclusion Discussion Experimental Results (1/2) Collected data: cumulative number of failures, x i, at time t i Data Set (DS-2) 28 weeks SW test, 234 observed failures Output: predicted value, x i+1, using (x 1, x 2,, x i ) Goodness of fit curves Relative Error curves 29 / 31

Conclusion Proposed hybrid methodology in forecasting software reliability: exploits unique strength of the ARIMA model and the SVM model Test results showed improvement of the prediction performance 30 / 31

Introduction Background Overall Approach Detailed Process Experimental Results Conclusion Discussion Discussion Pros Providing a possible solution of SRM selection difficulties Improving SW reliability prediction performance Cons Not present detailed test methods (ex: stop criteria for SVM, parameter estimation criteria for ARIMA, etc.) 31 / 31

Thank you!