Diverse Routing in Networks with Probabilistic Failures

Diverse Routing in Networks with Probabilistic Failures Hyang-Won Lee, Member, IEEE, Eytan Modiano, Senior Member, IEEE, Kayi Lee, Member, IEEE Abstract We develo diverse routing schemes for dealing with multile, ossibly correlated, failures. While disjoint ath rotection can effectively deal with isolated single link failures, recovering from multile failures is not guaranteed. In articular, events such as natural disasters or intentional attacks can lead to multile correlated failures, for which recovery mechanisms are not well understood. We take a robabilistic view of network failures where multile failure events can occur simultaneously, and develo algorithms for finding diverse routes with imum joint failure robability. Moreover, we develo a novel Probabilistic Shared Risk Link Grou (PSRLG) framework for modeling correlated failures. In this context, we formulate the roblem of finding two aths with imum joint failure robability as an Integer Non-Linear Program (INLP), and develo aroximations and linear relaxations that can find nearly otimal solutions in most cases. Index Terms Path rotection, disjoint aths, random link failures, correlated failures, robabilistic SRLG I. INTRODUCTION This aer deals with rotection in communication networks with correlated robabilistic link failures. The objective of rotection is to rovide reliable communication in the event of failure of network comonents such as nodes or links. Such rotection mechanisms are classified as link rotection and ath rotection. Link rotection recomutes an alternate detour for each link, and recovers from a link failure by rerouting the traffic along its redetered detour. In contrast, ath rotection assigns two aths, a rimary and a backu, to each connection, and the traffic is switched onto the backu ath in case of a rimary ath failure. Therefore, the rimary and backu aths need to be disjoint, since otherwise the two aths will fail simultaneously if a link or node shared by the two aths fails. In this aer, we focus on ath rotection. The disjoint-ath based rotection effectively addresses the case of a single oint failure, but if more than one failure occur at the same time, rotection is not guaranteed since both aths may fail simultaneously. There are several factors that can cause multile failures. First, modern communication networks are deloyed over an otical fiber network, and so multile communication links can share the same fiber in the otical layer. Consequently, any fiber cut can lead to the failure of all the (uer-layer) communication links sharing that fiber. The authors are with the Massachusetts Institute of Technology, Cambridge, MA 0239. (e-mail: {hwlee, modiano, kylee}@mit.edu) This work was suorted by NSF grants CNS-062678 and CNS-083096 and by DTRA grant number HDTRA-07--0004. Hyang-Won Lee was artially suorted by the Korea Research Foundation Grant funded by the Korean Government(MOEHRD).(KRF-2007-357-D0064) This work was resented in art at the IEEE INFOCOM conference, Aril 2009. Second, multile link failures can occur if the second link fails before the first was reaired. Third, natural disasters or attacks can destroy several links (which do not necessarily share a fiber) in the vicinity of such events. The concet of Shared Risk Link Grou (SRLG) has been roosed in order to address multile correlated link failures systematically []. An SRLG is a set of links sharing a common hysical resource (cable, conduit, etc.) and thus a risk of failure. In this context, Bhandari first studied so called Physically Disjoint Paths (PDP) roblem in [2], and roosed a shortest PDP algorithm for articular toologies. Since this ioneering work, there has been a large body of work [3] [4] dealing with multile failures in the context of SRLGs. In [5], Hu showed the NP-comleteness of the SRLG-Disjoint Paths Problem (SDPP) where SRLG-disjoint aths are two aths touching no common SRLG. All of the revious SRLG works assume that once an SRLG failure event occurs, all of its associated links fail simultaneously. Here, we generalize the notion of an SRLG to account for robabilistic link failures. This generalized notion allows us to model correlated failures that may result from a natural or man-made disaster. For examle, in the event of a natural disaster, some, but not necessarily all, of the links in the vicinity of the disaster may be affected. Such failures cannot be described using a deteristic failure model, and this raises the need for a systematic aroach to dealing with correlated robabilistic link failures. We address this issue by modeling SRLG events robabilistically so that uon an SRLG failure event, links belonging to that SRLG fail with some robability (not necessarily one). Our robabilistic SRLG model is alicable to a number of real-world failure scenarios. Some examles include: (i) WDM Networks where the lightaths traversing a fiber form an SRLG and fail (with robability ) in the event of a fiber cut, (ii) Satellite/wireless communication links where links are outage in the event of bad weather. In this case, the satellite links affected by the weather event form an SRLG, and may fail with some robability, (iii) ElectroMagnetic Pulse (EMP) attack: EMP is an intense energy field that can instantly overload or disrut numerous electrical circuits at a distance [6]. In the event of an EMP attack, the fiber links in the vicinity of the attack may have a high robability of failure and those distant from the attack would fail with low robability due to signal attenuation, and (iv) Natural/man-made disasters such as earthquakes or floods where communication links in the vicinity of the disaster may fail. For examle, an undersea cable was cut during the Taiwan earthquake of 2006 [7], disruting most communications out of Taiwan. Similarly,

2 during the Baltimore tunnel fire in 200 [8], the fire melted away the fiber along the tunnel, leading again to a large number of correlated failures. There are a number of aers dealing with robabilistic link failures [9] [2]. Tyically, they consider the availability (i.e., robability) that a connection is in the oerating state, and seek to find a ath air satisfying imum availability requirement [9], [20] or a ath air with maximum availability [2]. While the above works assume indeendent link failures, there have been efforts to deal with correlated failures. In [22] [24], the link failure robability is extended and defined as a function of SRLG arameters to account for correlated failures. In articular, in [24], the ath failure robability is defined as the ratio of the number of touched SRLGs to the total number of SRLGs. Under this model, [24] considers the roblem of finding a air of rimary and backu aths that satisfy joint reliability requirement. This model generalizes the traditional concet of SRLG-disjointness; such that if the joint reliability of rimary and backu aths is q, then it means they are disjoint with resect to q fraction of SRLGs. However, under this model, link failures are deteristic, given an SRLG failure. Hence, this model cannot be directly alied to the case of correlated failures with uncertainty that may occur due to disasters and attacks. In [25], a rimary/backu ath allocation roblem is defined to find a air of aths having imum joint failure robability. They adot a correlated link failure robability model where the correlation between the links is reresented by their joint failure robability. This correlation model requires exonential number of conditional robabilities in general, rohibiting a simle formulation. Due to this difficulty, they take into account only the first order correlation, i.e., the correlation between airs of links. In this aer, we consider finding a air of aths with imum joint failure robability in a network where the link failures occur randomly and are ossibly correlated. We roose an alternative model that enables a simle formulation and catures the essence of correlated link failures. Our model assumes that once an SRLG failure event occurs, its associated links fail with some robabilities. Thus, the correlation exists among the links only when they belong to the same SRLG. Clearly, this model can be viewed as a generalization of the traditional (deteristic) SRLG model. Our contributions can be summarized as follows: We generalize the SRLG framework to a robabilistic SRLG (PSRLG). This new framework enables us to effectively model correlated link failures, and develo efficient formulations to otherwise intractable roblems involving correlated link failures. We develo mathematical formulations for the roblem of finding a air of aths with imum joint failure robability. This new aroach enables the generalization of disjoint ath rotection schemes to the case of multile (robabilistic) failures. We develo heuristic algorithms for finding a air of aths with imum joint failure robability. Our algorithms are based on linear aroximations and Lagrangian relaxations, and are shown to find nearly otimal solutions. While the deteristic SRLG model has been widely used in the literature, there are many scenarios where the deteristic model is not alicable. For examle, the Working Grou on California Earthquake Probabilities (WGCEP) has been develoing the earthquake rate models for California. In articular, they comute the robabilities of all ossible damaging earthquakes (according to their magnitudes) throughout a region and over a secified time san [26]. Such disasters lead to correlated failures that are well addressed by our new PSRLG model; namely, the earthquake robabilities corresond to SRLG failure robabilities, and the link failure robabilities can be comuted based on their magnitudes. The network roviders could use this PSRLG data and our formulation in order to rotect a ath in the resence of earthquakes. The robabilistic SRLG model can also be alied to deal with the inaccuracy in the SRLG database. Tyically, the SRLG data is the maing between the IP layer comonents and the underlying hysical comonents. This information is used for failure diagnosis as well as survivable routing. However, it often contains errors due to traffic engineering and recovery mechanism [28]. Recently, in [30], failure diagnosis mechanisms were studied assug robabilistic errors in the SRLG data where the association of an IP layer comonent to a hysical comonent is robabilistic. Note that with this erroneous SRLG data the survivable routing roblem is best handled robabilistically as in our PSRLG model. The rest of the aer is organized as follows. In Section II, we resent our new robabilistic SRLG model and describe the generalized ath rotection roblems. In Section III, we study the case of indeendent link failures, which rovides fundamental insights to the study of correlated failures. In Section IV, we formulate the ath rotection roblems using the PSRLG model and develo algorithms for finding aths with imum joint failure robability. Finally, in Section V, we analyze the erformance of our algorithms via simulations. II. MODEL AND PROBLEM DESCRIPTION Consider a directed network grah G = (V, E) where V is a set of nodes and E is a set of links. Any link in E will be denoted by (i, j) for i, j V, meaning that the link starts from node i and ends at node j. There is a set R of SRLG events that can incur link failures. Each SRLG event r R occurs with robability π r, and once an SRLG event r occurs, link (i, j) will fail with robability r ij [0, ]. For examle, if link (i, j) is never affected by event r, then we define r ij = 0. On the other hand, if the event r is a cable cut and link (i, j) traverses that cable, then we will have r ij =. In the following we generalize the traditional notion of an SRLG to include robabilistic correlated failures. Definition : A robabilistic SRLG (PSRLG) is a set of links with ositive failure robability in the event of an SRLG failure. Namely, link (i, j) belongs to SRLG r if r ij > 0, and SRLG r = {(i, j) E : r ij > 0}. We say that links (i, j) and (k, l) are correlated if there exists an SRLG r such that r ij, r kl > 0. Clearly, this model is a Tyically, in failure diagnosis, the underlying hysical layer failures are inferred from the IP layer failures by using the SRLG data, i.e., the maing between hysical layer comonents and IP layer comonents [27] [3].

3 generalization of the traditional SRLG model, and enables us to deal with correlated robabilistic link failures. We consider a single source-destination air. Let s V and t V be source and destination nodes resectively. Our objective is to find a air of rimary and backu aths from s to t with imum joint failure robability. This roblem will be considered using two different models: (i) indeendent link failure and (ii) SRLG-based correlated link failure. In each case, we seek to find a air of aths with imum joint failure robability. Let x ij = if the rimary ath traverses link (i, j), and 0 otherwise. Similarly, define another binary variable y ij for the backu ath. We will just dro the index of a variable to denote its vector version, e.g., x reresents the vector [x ij, (i, j) E]. The set of n-dimensional binary vectors will be defined as B n, i.e., B n = {0, } n. Our roblems will be formulated as Integer Programs (IPs). III. INDEPENDENT LINK FAILURE MODEL In order to gain insights into the roblem, we start by first considering the indeendent link failure model. Moreover, we begin by considering the simle case of finding a single ath with imum failure robability. We then use the insights gained in order to formulate the roblem of finding a air of aths with imum joint failure robability. In Section IV, we will further generalize our formulations to deal with correlated (SRLG) failures. A. Single Path Problem First, consider the roblem of finding a single ath having imum failure robability. Let ij be the robability that link (i, j) fails, then link (i, j) will survive with robability ij. Consequently, the survivability robability of ath x is given in a roduct form by ( ijx ij ). The roblem is formulated as: (P.) : ( ij x ij ) x B E x ij, i = s x ji =, i = t, i V, () j: E j:(j,i) E 0, o.w. where E is the cardinality of E. The constraints in () require that the set of links selected by x forms a ath from node s to node t. For simlicity, we will denote this constraint by CC(x), reresenting the connectivity constraint on x for a ath from s to t. As all the variables in this work are binary, we will sometimes omit the binary constraint for convenience. The roblem (P.) is an Integer NonLinear Program (INLP), which generally is very difficult to solve. However, using the following theorem, we are able to reformulate (P.) as an Integer Linear Program (ILP). Theorem : Assume ij [0, ), (i, j), then the roblem (P.) is equivalent to the following ILP: (P.L) : x ij log( ij ) x B E CC(x), where CC(x) is the connectivity constraint as given in equation (). Proof: First, the objective in (P.) can be equivalently written as max ( ijx ij ). Taking logarithm over the entire function gives max log( ijx ij ) without affecting the otimal solution. The roof is comleted by alying the identity log( ij x ij ) = x ij log( ij ) for binary variable x ij, and noting that max f(x) is the same as f(x). Observation : Theorem shows that the ath with imum failure robability is the shortest ath under link weights log( ij ), (i, j). It immediately follows that if the failure robability is sufficiently small (i.e., ij, (i, j)), then the robability-wise shortest ath has the imum failure robability because log( ij ) ij for small ij. Further, with uniform failure robability, i.e., ij = q, (i, j) E, the shortest-ho ath has the lowest failure robability. This result will be used in develoing heuristic algorithms. B. Path Pair Problem with Disjointness Constraint Let F (, x) be the objective function of roblem (P.), i.e., F (, x) is the failure robability of ath x for given link failure robability vector. Suose that the two aths x and y are link-disjoint, then their failures are mutually indeendent because the link failures are indeendent and further the aths do not share any link. Then, the joint failure robability of two disjoint aths x and y is given by F (, x) F (, y). The ath air roblem with disjointness constraint (DC) is thus formulated as: (P.2) : x,y F (, x)f (, y) x ij {, i = s x ji =, i = t, i V j: E j:(j,i) E 0, o.w. y ij {, i = s y ji =, i = t, i V j: E j:(j,i) E 0, o.w. x ij + y ij, (i, j) E. Again, the first and second constraints are the connectivity constraints requiring that x and y are aths from s to t. The last constraints require that x and y cannot share any link, i.e., they are link-disjoint. Hence, x and y satisfying all the constraints in (P.2) will form a air of disjoint aths from s to t. For brevity, throughout this aer, we will denote by CC(x) the connectivity constraint on any binary link selection vector x, and DC(x, y) the disjointness constraint (DC) on aths x and y. The roblem (P.2) is an Integer NonLinear Program (INLP) and has a secial structure which is difficult to solve in general. Let c be the length of the shortest ath from s to t. Lemma : Assume the uniform failure robability ij =, (i, j), where 2 c <. Then, the roblem (P.2) is a concave imization. Proof: See Aendix A. Therefore, the roblem (P.2) contains a concave imization as a secial case. Generally, a concave imization roblem is NP-hard [32], and hence the above lemma imlies that the roblem (P.2) may not be easy. In the following, we consider heuristic algorithms to solve the roblem.

4 ) Greedy Algorithm: First, we analyze the roerties of the otimal solution to (P.2). This motivates a greedy algorithm. Our analysis is based on the theory of majorization, introduced below. Definition 2 (Majorization): Given an n-tule vector a, let a [i] be the i-th largest of n coordinate. An n-tule vector α is said to be majorized by β if k α [i] k β [i], k =,..., n, and i= i= n i= α [i] = n β [i]. i= This relationshi is denoted by α β. The majorization formalizes how evenly distributed the elements of a vector are. For examle, α = [2, 2, 2] is the most evenly distributed vector over all the 3-tule vectors sumg u to 6, and majorized by any other such vector, e.g., β = [,, 4]. Definition 3 (Schur Convexity): A function f : A R n R is Schur convex if f(α) f(β) for any α, β A such that α β. It is called Schur concave if f(α) f(β) for α β. It is clear from this definition that a Schur convex function f : A R is imized at evenly distributed oints, whereas if f(x) is Schur concave, it is imized at unevenly distributed oints. We will show the Schur concavity of the objective function in (P.2), and use this roerty to develo a greedy ath selection algorithm. Assume the uniform failure robability, i.e., ij =, (i, j). Then, the objective function in (P.2) can be written as f(x, Y ) = ( ( ) X) ( ( ) Y ), (2) where X = x ij and Y = y ij. Note that X and Y are the number of hos in rimary and backu aths, resectively. Hence, for fixed, f(x, Y ) is a function of the numbers of hos in the rimary and back aths. The roblem (P.2) can be restated as: imize f(x, Y ) the same constraints as in (P.2) with the additional constraints X = x ij and Y = y ij. The objective function f(x, Y ) in this roblem can be shown to be Schur concave. Lemma 2: The function f(x, Y ) in (2) is Schur concave for 0 <. Proof: See Aendix B. As mentioned above, a Schur concave function is imized at unevenly distributed oints rather than evenly distributed ones. So, for examle, we have f(2, 2) f(, 3) since (2, 2) (, 3). Accordingly, Lemma 2 imlies that a air with unbalanced (in terms of the number of hos) aths is referred because its joint failure robability may be lower than that of a balanced air. Consider an examle in Fig., where there are two airs of s t aths: one with X = 3, Y = 3 and the other with X = 2, Y = 4. Fig. (b) lots the joint ath failure robabilities of the two airs, and shows that the unbalanced air (i.e., the one with X = 2, Y = 4) is more reliable than the balanced air for all values of. It should be noted that a similar observation can also be made as follows for the nonuniform failure robabilities where ij s can be different. Observation 2: Consider an examle toology in Fig. 2 where the number on each link is its failure robability. We want to find a air of disjoint aths with imum joint failure s s 3 3 ( ) ( ) f ( ) 2 4 ( ) f (a) Two airs of disjoint aths: one with X = 3, Y = 3 and one with X = 2, Y = 4 Joint ath failure robability f(x,y) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 0 0.2 0.4 0.6 0.8 (b) Plot of joint ath failure robabilities t X=3, Y=3 X=2, Y=4 Fig.. Two airs of disjoint aths and their failure robabilities under uniform link failure robability: the unbalanced air is always more reliable than the balanced one robability. It is easy to see that there are only two airs shown in the figure. The air in the to has individual ath failure robabilities of (0., 0.) and thus its joint failure robability is 0.0. On the other hand, the air in the bottom has individual ath failure robabilities of (0, 0.9), leading to zero joint failure robability. This reasserts that a good-bad ath air might be better than a medium-medium air. The above observations suggest that it is imortant to include the best ath (i.e., ath having imum failure robability) in the air. Further, it motivates a greedy algorithm which selects the best ath first and then selects the next best disjoint ath (See Algorithm ). Note that according to Observation, the best ath is obtained by the robabilitywise shortest ath. Hence, Algorithm only needs to run a shortest ath algorithm twice whose comlexity is O( V 2 ). Note that the greedy algorithm can ossibly run out of s t aths after the first ath is found. This is a so-called tra roblem. In the following, we show that such a case does not haen under mild assumtions. First, notice that the aths t

5 s s Path failure robabilities = (0., 0.) Joint ath failure robability = 0.0 Path failure robabilities = (0, 0.9) Joint ath failure robability = 0 Fig. 2. Examle motivating greedy algorithm: including the most reliable ath is also imortant in the case of non-uniform failure robabilities Algorithm Greedy: IND w/ DC : Set link weight w ij = ij, (i, j) E 2: Find shortest ath x 3: Remove all the directed edges used by x 4: Find shortest ath y found by the greedy algorithm are simle (i.e., do not contain cycles) or can contain cycles of zero length, in which case they can be removed without affecting the ath failure robability. Hence, we may assume that the aths x and y found by the greedy algorithm are simle. Lemma 3: Consider a k-connected 2 bidirectional grah G where k 2. Removing a simle s t ath in G does not disconnect s and t. Proof: See Aendix C. Therefore, the greedy algorithm does not run out of s t aths after the first ath is found, rovided that the grah is bidirectional and k-connected with k 2. We note that this assumtion is very common; as most ractical networks use bidirectional links. 2) ILP Aroximation of (P.2) and Its Lagrangian Relaxation: We develo another heuristic algorithm based on the ILP aroximation of the roblem (P.2). First, the objective function in (P.2) can be exanded as F (, x)f (, y) = ( ij x ij ) ( ij y ij ) + ( ij x ij ) ( ij y ij ). (3) Further exanding the roduct terms and canceling out common terms yields F (, x)f (, y) = ij kl x ij y kl + HOT, (4) (k,l) 2 A grah is said to be k-connected if it remains connected after u to k link (or node) failures. Equivalently, if every source-destination air has at least k disjoint aths, then it is k-connected. t t where HOT stands for high order terms; namely, terms involving the roduct of 3 or more failure robabilities. In the low failure robability regime, i.e., ij, (i, j), the HOTs can be neglected and the ILP can be formulated as follows: (P.2L) : x,y,z (k,l) ij kl z kl ij CC(x), CC(y), DC(x, y) (C) zij kl ij + y kl, (i, j), (k, l) E, where we have introduced the binary variables zij kl, (i, j), (k, l) such that zkl ij = only if both of x ij and y kl are. That is, link (i, j) is used by the rimary and (k, l) by the backu ath. This enables us to use zij kl instead of x ij y kl in the objective, hence resulting in a linear formulation. Consequently, the objective function reresents the joint failure robability based on the air-wise (one from x and one from y) joint link failure. While generally ILPs are difficult to solve, in this case we also observe that the constraints CC(x) and CC(y) are totally unimodular 3 (TU) [33], hence the linear rogram (LP) relaxation has an integral otimal solution [33]. Further, we can use Lagrangian relaxation on the constraints DC(x, y) and (C ) to further simlify the roblem. In articular, define the Lagrangian function as L(x, y, z, µ, ν) = ( µ ij + ) (k,l) νkl ij x ij + ( µ ij + ) (k,l) νij kl y ij + ( ),(k,l) ij kl νij kl z kl ij where µ and ν are Lagrangian multilier vectors associated with DC(x, y) and (C), resectively. The (Lagrangian) relaxed roblem is given by (P.2LR) : x,y,z L(x, y, z, µ, ν) CC(x), CC(y). The above roblem is TU, and so it can be solved by LP relaxation which is olynomial time solvable. Moreover, for given µ and ν, the roblem (P.2LR) is comletely searable with resect to x, y and z. Namely, the otimal x and y are shortest aths resectively, and otimal zij kl is obtained as: zkl ij = if νij kl > ij kl, and 0 otherwise. Now, the above otimization can be solved using a simle rimal-dual method as described in Algorithm 2, where γ m is a ositive diishing ste size, M is the maximum number of iterations, and w ij is the weight of link (i, j). Note that stes 2.-2.3 solve the relaxed roblem (P.2LR), and stes 2.4-2.5 are the subgradient-based udate of Lagrangian multiliers. The algorithm kees the best ath air all over the iterations (ste 2.6), and takes it as the final solution. Such a Lagrangian relaxation method for IP does not guarantee an otimal solution due to the duality ga, but it has been very successful in solving many IPs [34]. Although the linear aroximation in (P.2L) may not be accurate in the high robability regime, it is an uer bound on the original (nonlinear) objective function. 3 A matrix A is said to be totally unimodular if the deterant of each square submatrix of A is 0,, or (The network flow conservation matrices are TU). If the constraint matrix of an ILP is TU, then its linear rogram relaxation has an integral otimal solution, that is equivalent to the otimal solution of the ILP. Hence, the otimal solution of such an ILP can be obtained by solving its linear rogram relaxation, which is olynomial time solvable.

6 Algorithm 2 Lagrangian Relaxation: IND w/ DC : Initialization: m = 0, µ ij (0) = ij and νij kl(0) = ij kl, (i, j), (k, l) E, and best = 2: While m < M 2. Set w ij = µ ij (m) + (k,l) νkl ij Find shortest ath x(m) (m), (i, j) E; (m), (i, j) E; 2.2 Set w ij = µ ij (m) + (k,l) νij kl Find shortest { ath y(m) 2.3 zij kl, if ν (m) = kl ij (m) > ij kl, (i, j), (k, l) 0, otherwise 2.4 µ ij (m+) = [µ ij (m) + γ m (x ji (m)+ y ij (m) )] + (m + ) = [νkl ij (m) + γ m(x ji (m) + y kl (m) ij (m) )]+, (i, j), (k, l) E 2.6 Set (x best, y best ) = (x(m), y(m)) if (m) < best where (m) is the joint failure rob. of (x(m), y(m)) 2.7 m = m + 2.5 νij kl z kl Lemma 4: The objective function in (P.2L) is an uer bound on the joint ath failure robability F (, x)f (, y). Proof: See Aendix D. Therefore, the linearization aroach may work as well in other regimes including the high robability regime. This is verified through simulations in Section V. C. Path Pair Problem without Disjointness Constraint The disjointness constraint is a necessary condition for surviving a single link failure in the traditional deteristic failure model. However, in a robabilistic model, a link may be shared if it is known that the link is very reliable. If link (i, j) is shared by x and y, then (i, j) s failure leads to the simultaneous failure of both x and y. Hence, the robability that both x and y fail can be written as F S (, x, y) + ( F S (, x, y))f NS (, x, y), (5) where F S (, x, y) is the robability that both x and y fail due to a shared link failure, and F NS (, x, y) is the robability that both x and y fail due to the failure of non-shared links. For ath air (x, y), let E xy denote the set of links shared by x and y, i.e., E xy = {(i, j) E : x ij =, y ij = }. Then, the robability F S (, x, y) can be written as F S (, x, y) = E xy ( ij ) = E ( ijx ij y ij ). For a binary vector v, define its comlement as v = v where is a vector of s with aroriate dimension. Then, the vector ȳ only includes the links which are not selected by y. Hence, the robability that x fails due to the failure of non-shared links is equivalent to the robability that both x and ȳ fail due to the failure of the links shared by x and ȳ. This robability can be subsequently exressed as F S (, x, ȳ) following to the definition F S (, x, y). Similarly, F S (, x, y) denotes the robability that y fails due to the failure of non-shared links. The robability F NS (, x, y) is then given by F NS (, x, y) = F S (, x, ȳ)f S (, x, y), leading to the following formulation: (P.3) : x,y (6) F S (, x, y) + ( F S (, x, y))f NS (, x, y) CC(x), CC(y). Algorithm 3 Greedy: IND w/o DC : Set w ij { = ij, (i, j) E; Find shortest ath x ij kl x kl, if x ij = 0 2: w ij = (k,l), (i, j) E; ij, if x ij = Find shortest ath y Fig. 3. s Examle toology u The roblem (P.3) has an equivalent (if disjoint aths are otimal in (P.3)) or better otimal solution comared to the roblem (P.2). Under the low failure robability regime, we can aroximate the roblem (P.3) by an ILP as follows: (P.3L) : x,y,z, z ij z ij + (k,l) ij kl z kl ij CC(x), CC(y) (C) z ij x ij + y ij, (i, j) E (C2) zij kl x ij y ij + y kl x kl, (i, j), (k, l). In constraint (C), z ij = only if x ij = y ij = which means that link (i, j) is shared. Hence, the first term in the objective function is the joint failure robability due to the failure of shared links. In constraint (C2), z ij kl = only if x ij = y kl = and y ij = x kl = 0 which means links (i, j) and (k, l) are resectively used by x and y, but neither of them are shared. Hence, the second term is the joint failure robability due to the failure of non-shared links. The formulation (P.3L) is a standard ILP that can be solved using an ILP solver such as CPLEX. However, it can take unaccetably long to run because it is NP-comlete in general. Insired by the aroximation (P.3L), we roose a simle greedy algorithm, as shown in Algorithm 3. Similar to Algorithm, Ste finds a shortest ath for x using link weights w ij = ij, (i, j) and this gives a rimary ath x with imum failure robability. For the backu ath y, the weight of each link (i, j) is set to the joint ath failure robability due to the failure of link (i, j) and the links in x. Hence, two different cases have to be considered. First, if link (i, j) has not been selected by the rimary ath x, then its weight is set to the roduct of (i, j) s failure robability ij and the (aroximated) failure robability (k,l) klx kl of ath x. If (i, j) was selected by x, then (i, j) s failure leads to joint ath failure (rovided that y also selects (i, j)), and so its weight is set to ij. The shortest ath under these link weights will imize the joint ath failure robability and will be used as the backu ath y. Note that if link (i, j) is to be shared, its weight is set to a first-order value, i.e., ij which is obviously larger than the second-order weight in the non-shared case. Hence, the links with relatively low failure robability will be more likely to be shared. D. Extension to Correlated Failures t As discussed in the introduction, many failure scenarios involve multile links. Hence, link failure events may be

7 correlated. In order to account for such correlation, the ath failure robability exressions must include conditional robabilities for joint link failures. Let ij kl be the robability of link (i, j) s failure given (k, l) s failure, then the joint link failure robability is given by kl ij kl or ij kl ij. Note that high value of conditional robability ij kl imlies strong correlation between the failures of (i, j) and (k, l). Consider an examle toology in Fig. 3, then the ath s t s failure robability can be exressed as su + ut su ut su. (7) Generally, it can be written as ij ij kl ij E(s,t) E(s,t) (k, l) E(s, t), (k, l) (i, j) + ij kl ij mn ij,kl E(s,t) (k, l) E(s, t), (m, n) E(s, t), (k, l) (i, j) (m, n) (i, j) (m, n) (k, l) + high order terms, (8) where E(s, t) is the set of links on s t ath. These conditional robability exressions involve an exonential number of terms accounting for the joint failure robability of multile links. Hence, formulating the above roblems under correlated failures seems to be intractable. Due to this difficulty, [25] considers only the first order correlation, i.e., conditional robability for every air of links. In order to better account for correlated failures, we roose a new model using robabilistic SRLGs. In our model, once an SRLG failure event occurs, its associated links fail with some robabilities. So, the link failures are correlated only if the links belong to the same SRLG (while in the linkwise model, the correlation is considered between every air of links). Moreover, under the condition that an SRLG event occurs, its associated link failures are mutually indeendent, and thus the formulations develoed in the indeendent model can be used. This enables a simle formulation for the ath rotection roblems with correlated failures. More imortantly, it can be used to model most correlated failure scenarios; as events leading to failures can be modeled as a robabilistic SRLG (PSRLG). IV. PSRLG-BASED CORRELATED FAILURE MODEL We consider a single SRLG model where only one SRLG failure event can take lace at a time. Let π r be the robability that the failed SRLG is r R, then we will have r R π r =. We refer to this model as the mutually exclusive PSRLGs. Note that the traditional deteristic SRLG model also assumes a single SRLG failure, and so our model of mutually exclusive PSRLGs is a robability-wise generalization of the traditional model. A. Single Path Problem Again, we start by considering a single ath roblem. Given that SRLG event r haens, each link (i, j) will fail with robability r ij as if they are indeendent. Hence, the failure robability of ath x is given by ( r ij x ij), and according to the definition of F in Section III, this robability can be denoted by F ( r, x) where r = [ r ij, (i, j) E]. The single ath roblem can be simly written as (P2.) : π r F ( r, x) x B E r R CC(x). Note that the ath failure robability is averaged over all SRLGs because they are mutually exclusive. As shown in Section III-A, the single ath roblem under indeendent failures is an easy shortest ath roblem. However, the same roblem in the correlated failures case (i.e., (P2.)) can be shown to be difficult. Theorem 2: The single-ath roblem (P2.) is NPcomlete. Proof: This is roved by showing that the roblem (P2.) contains as a secial case a Minimum-Color Single-Path (MCSiP) Problem which is NP-comlete [35]. The MCSiP roblem is stated as follows. Given network grah G = (V, E) and set of colors C = {c, c 2,..., c K }, each edge is colored with one of the colors in C. The roblem is to find a ath from s to t that touches the imum number of colors. Note that it assumes monochromatic edges (i.e., single color to each edge), but this assumtion can be easily relaxed by grah transformation. Namely, an edge with k colors is relaced by k serial edges (and k intermediate nodes) each of which is associated with a different color from the set of k colors. Hence, the MCSiP roblem without the monochromatic assumtion is also NP-comlete. First, assume uniform failure robabilities, i.e., π r = R, r and r ij = for all (i, j) r such that r ij > 0. Then, using the identity f(x) max f(x), the objective in (P2.) can be rewritten as ) x = x max x ( π r ( r ij x ij) r R ( E R ) ( x ij ) r R r ( x ij ) r R r Further, assume =. Note that this case corresonds to the traditional SRLG model that assumes deteristic failures. Under this assumtion, the above objective function value reresents the number of untouched SRLGs. In other words, it finds a ath that touches the imum number of SRLGs. If SRLG is relaced by color, it is a Minimum-Color Single-Path Problem (without monochromatic assumtion). This shows that the roblem (P2.) contains an NP-comlete roblem, imlying that it is NP-comlete. Due to this difficulty, its aroximation is again considered. Under the low failure robability assumtion (i.e., r ij, r, (i, j)), the objective function can be written as (9) ( r R π r r ij) xij. (0) We first begin by considering the simle case of uniform SRLG failure robability. Namely, let π r = / R, r R and r ij =, r R(i, j), (i, j) E where R(i, j) = {r R : r ij > 0}.

8 Observation 3: Under the uniform failure robability, imizing the objective function (0) is equivalent to imizing R(i, j) x ij where each term R(i, j) in the summation is the number of SRLGs to which link (i, j) belongs. This shows that the ath touching the imum number of SRLGs has the lowest failure robability. The traditional SRLG model falls into a secial case of the uniform failure robability with q =, and hence, Observation 3 imlies that in the traditional model, the number of SRLGs touched by a link is an imortant link-weight metric when finding a reliable ath. In fact, some revious works have used this metric [0], and our result rovides a theoretical basis for those works in traditional SRLG model. B. Path Pair Problem with Disjointness Constraint The ath air roblem can also be formulated in a simle form as in the case of single ath. Once an SRLG event r occurs, aths x and y will fail with robabilities F ( r, x) and F ( r, y), resectively. In this case, their joint failure robability is given by the roduct F ( r, x)f ( r, y) because the link failures are indeendent, under the condition that SRLG event r has occurred. The roblem can be formulated as follows: (P2.2) : x,y π r F ( r, x)f ( r, y) r R CC(x), CC(y), DC(x, y). Our robabilistic SRLG model has enabled us to exress the joint failure robability in a roduct form leading to a simle formulation. Namely, the objective function in (P2.2) is the combination of the objective functions in (P.2) and (P2.). That is, for given SRLG r, the joint ath failure robability is equivalent to the joint ath failure robability with link failure robability vector r in the indeendent model, and those joint failure robabilities are averaged over all SRLGs, as done in (P2.). This is in shar contrast with the linkwise correlated failure model where the ath failure robability would include terms of the conditional robabilities involving all the combinations of link failures. It is obvious that the ath air roblem is harder than the single ath roblem, and thus, we can infer from Theorem 2 that the roblem (P2.2) will also be difficult. In fact, we can show its NP-comleteness as well. Theorem 3: The ath-air roblem (P2.2) is NP-comlete. Proof: First, note that the objective value of (P2.2) is nonnegative, and so if any ath air results in zero objective value, then it is otimal. The robability F ( r, x) in (P2.2) can be written as F ( r, x) = ( r ij x ij) = r ( r ij x ij), () because r ij = 0 if link (i, j) does not belong to SRLG r, i.e., (i, j) / r. Consequently, the function F ( r, x) becomes zero for the ath x which does not touch SRLG r, i.e., x ij = 0, (i, j) r. Hence if x and y do not share any SRLG, then the roduct F ( r, x)f ( r, y) will be zero for every r R, thereby leading to zero objective value. This imlies that any air of SRLG-disjoint aths is an otimal solution to Algorithm 4 Greedy: MES w/ DC : Set w ij = r R π r r ij, (i, j) E 2: Find shortest ath x 3: Remove all the links used by x 4: Set w ij = (k,l) x kl r R π r r ij r kl, (i, j) E 5: Find shortest ath y the roblem (P2.2). Subsequently, the roblem (P2.2) becomes an SRLG-disjoint aths roblem if one exists. Therefore, the roblem (P2.2) is NP-comlete because it includes (as a secial case) the SRLG-disjoint aths roblem which is NPcomlete [5]. Again, it is easy to show that when the link failure robabilities are low, the objective function of (P2.2) can be exressed as ( (k,l) r R π r r ij kl) r xij y kl. (2) Next, we observe that the roblem is still NP-comlete even after the aroximation. Observation 4: Under the uniform failure robability (i.e., π r = / R, r R and r ij = q, r R(i, j), (i, j) E), the objective function (2) becomes (k,l) R(i, j) R(k, l) x ij y kl where each term in the summation reresents the number of SRLGs shared by corresonding link air in x and y. This obviously contains SRLGdisjoint aths roblem as a secial case (if there exist SRLGdisjoint aths), and so, it is NP-comlete (following to the roof of Theorem 3). Subsequently, the aroximated roblem (2) is also NP-comlete. As this aroximation is still difficult to solve, we roose a heuristic in Algorithm 4 using the aroximations (0) and (2). For rimary ath x, we set the link weights and find the shortest ath according to (0). This will give a ath having imum failure robability. Then, all the links selected by x are removed for disjointness. Finally, the obtained rimary x is substituted into (2), and the backu ath y is comuted by imizing (2) for fixed x. Hence, the greedy algorithm takes into account the correlation so that the links highly correlated with the selected links can be avoided. One can also develo a Lagrangian relaxation based algorithm by linearizing the roblem. However, this develoment is nearly identical to that in Section III-B2 and omitted for brevity. The ath air roblem without disjointness constraint was also studied, but omitted for brevity. V. PERFORMANCE EVALUATION In this section, we evaluate and comare the erformance of the algorithms develoed in this aer. In articular, we consider the following four algorithms: The brute-force solution to the ILP formulations using the CPLEX solver (denoted by CPLEX). The Lagrangian relaxation for the ILP (Algorithm 2; denoted by LR). The greedy algorithms that select the first ath with imum failure robability and the second ath with adjusted link weights to reduce the joint ath failure

9 robability (i.e., Algorithms, 3 and 4; denoted by Greedy). The shortest disjoint aths algorithm that finds a air of disjoint aths with imum total weight, where the weight of a link is its failure robability (i.e., for each link (i, j), w ij = ij in the indeendent model or w ij = r π r r ij in the PSRLG model). This algorithm is a straightforward aroach simly selecting a shortest ath air, and as mentioned in Observation 2, such a air does not necessarily have imum joint failure robability. Note that this joint shortest ath aroach is in contrast with our heuristic which is one-by-one shortest ath aroach. (denoted by SDP) The rotection quality (i.e., joint ath failure robability) and run-time of the above algorithms will be comared. First, we comare the ILP (P.L) and robability-wise shortest ath (PSP) algorithm that finds a shortest ath under link weights w ij = ij, (i, j). As discussed in Theorem and Observation, the ILP finds a ath with imum failure robability while the PSP algorithm aroximates the otimal ath in the low failure robability regime. Because the PSP algorithm is used in our heuristics (Algorithms, 3 and 4) to find a ath with imum failure robability, this comarison will demonstrate the suitability of the PSP algorithm in our heuristics. We generated 00 random grahs, each of which has 0 nodes and maximum node degree of 5, and is 3- connected 4 from s to t. To avoid the tra roblem discussed in Section III, we assume that the grahs are bidirectional. In each grah, the failure robability of each link (i, j) is assigned as follows: ij = α(β + ( β)u), where α and β are constants in [0, ], and u is a random number uniformly distributed on the interval (0, ). Note that as β increases, the network aroaches to the uniform failure robability regime (i.e., ij = q, (i, j)). For examle, if β =, it will be ij = α, (i, j), which imlies uniform failure robability regime. In contrast, if β = 0, ij will be a random number from (0, α). On the other hand, small α corresonds to the low failure robability regime and large α to the high failure robability regime. Fig. 4 lots the ath failure robability for each combination of (α, β) where each oint is the average of the results of 00 random grahs. As exected in Observation, the PSP algorithm finds an otimal ath in the uniform or low failure robability regime (large β or small α, resectively). Furthermore, even in high robability regime (large α), the PSP aroximates the ILP very well. When the network is nearly in the uniform robability regime (large β), shortestho ath would be otimal, and the PSP obviously finds this shortest ath. With small β, the network is in a mixed regime having high and low failure robabilities. In this case, both the ILP and PSP would select only the links with low failure robability whenever feasible. Then, it is highly likely that the same ath is otimal after the links with high 4 In our simulation, once a grah is generated, we exae the 3- connectedness of the grah, and if it is not 3-connected, it is discarded. Fig. 4. Path failure robability 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. ILP PSP Low rob. regime β=0.99 β=0.7 β=0.5 β=0.3 β=0. β=0 High rob. regime 0 0 0.2 0.4 0.6 0.8 Scaling factor α Comarison of ILP and robability-wise shortest ath (PSP) failure robability are removed. This is equivalent to being in the low robability regime, and therefore, the PSP erforms comarably to the ILP. Overall, the PSP algorithm finds an otimal ath in most cases, as desired. Next, we consider the roblem of finding the ath air with imum joint failure robability. The roosed greedy and LR-based algorithms are comared with CPLEX and the SDP algorithm using various toologies. The CPLEX solves the ILP version of every ath air roblem. For LR-based iterative algorithms, the following arameters are used: maximum iteration number M = 2 0 4 and ste size γ m = 0 9 / m. The comarison is erformed by changing the number of nodes, and 00 random geometric grahs are generated for each case. In a random geometric grah, N nodes are randomly located on the N N lane. Two nodes are connected if the distance from one to the other is less than 0.5. As in the above, each grah is 3-connected (from s to t) with maximum node degree of 5. For each toology, we consider only a single s-t air, and average these values over 00 different toologies. For the indeendent model, the failure robability of each link is set to a random number uniformly distributed on the interval (0, 0 3 ), hence ij (0, 0 3 ), (i, j) E. For the PSRLG model, 20 SRLGs are generated for every grah and their failure event robabilities π r s are set to uniformly distributed random numbers such that r π r =. Each SRLG is associated with a circle whose center is randomly located on the lane and radius is a uniformly distributed random number in (,.5). An SRLG includes all the links touched by its circle. Once a link, say (i, j), is included in SRLG r, its failure robability r ij is set to a uniformly distributed random number in (0, 0 3 ) for low robability regime and in (0.5, ) for high robability regime. Fig. 5 lots the joint ath failure robability achieved by each algorithm in the indeendent model, with and without the disjointness constraints. Fig. 5(a) shows that with the disjointness constraint, the CPLEX always finds the best ath air, while the other algorithms achieve nearly the same rotection erformance as the CPLEX. Without disjointness constraint, the CPLEX still erforms better than any other, whereas the erformance of Lagrangian relaxation (LR) based algorithm is

0 Joint ath failure robability x 0 6 2.8.6.4.2 Averaged 00 cases for each number of nodes CPLEX w/ DC Greedy w/ DC LR w/ DC SDP Joint ath failure robability 6 x 0 6 5 4 3 2 Averaged 00 cases for each number of nodes CPLEX w/o DC Greedy w/o DC LR w/o DC 0.8 0 5 20 25 30 35 40 45 50 Number of nodes (a) With disjointness constraint (DC) 0 0 5 20 25 30 35 40 45 50 Number of nodes (b) Without disjointness constraint (DC) Fig. 5. Joint ath failure robability in indeendent model substantially degraded as the number of nodes increases (See Fig. 5(b)). It is remarkable that our greedy algorithm which does not exlicitly attemt to solve the otimization roblem erforms as well as the CPLEX solution which finds a ath air by solving the otimization roblem. This verifies our observation in Section III-B that for ath rotection against robabilistic failures, it is imortant to include the best ath in the rimary and backu ath air. Notice that the LR-based algorithm erforms well with disjointness constraint. However, without disjointness constraint, the LR-based algorithm does not erform very well esecially in large networks (See Fig. 5(b)). Clearly, the search sace without disjointness constraint is much larger than that with disjointness constraint. This imlies that in large networks, it requires many more iterations in order to find a better solution. This is art of the reason why the LR-based algorithm does not work well in large networks. The run-time of each algorithm is shown in Fig. 6. As the number of nodes increases, the run-time of CPLEX increases exonentially. This shows that CPLEX takes a brute-force aroach having exonential run-time and hence, may be rohibitively comlex. The LR algorithm also takes a long time, but its run-time increases much more slowly than CPLEX. On the other hand, the greedy and SDP find a ath air in imal time, almost indeendent of the roblem size. Therefore, both the greedy and the SDP algorithms find a fairly reliable air of aths with short run-times. The joint ath failure robabilities in the PSRLG model are shown in Fig. 7. As in the indeendent model, the CPLEX always finds the most reliable air of aths in the low failure robability regime (See Fig. 7(a)). Observe from Fig. 7(b) that the CPLEX erforms as well in the high robability regime. As discussed earlier, this is because the linear aroximation solved by CPLEX is an uer bound on the joint ath failure robability. Fig. 7 also shows that our greedy algorithm rovides better rotection than the SDP. This is due to the fact that our greedy algorithm adjusts the failure robability before selecting the second ath in order to reduce the joint failure robability while the SDP algorithm fails to take correlation Fig. 6. 0 4 0 2 0 0 0 2 LR CPLEX w/ DC CPLEX w/o DC LR w/ DC LR w/o DC CPLEX SDP Greedy SDP Greedy w/ DC Greedy w/o DC 0 4 0 5 20 25 30 35 40 45 50 Run-time (on DELL Workstation T7400) in indeendent model into account. As mentioned in Section III-C, relaxing the disjointness constraint (DC) should imrove the rotection quality (if nondisjoint ath air is otimal). Fig. 8 shows that the joint failure robability is decreased by relaxing the DC. Observe further that our greedy algorithm finds a more reliable ath air than SDP, again verifying our observation in Section III-B that it is imortant to include the best ath in the air. We further exae the algorithms using a US IP network toology shown in Fig. 9. As in the revious simulations, the link failure robability ij under the indeendent model is uniformly distributed on the interval (0, 0 3 ). For the PSRLG model, the nodes are located on the lane as seen in Fig. 9. As in the revious simulations, R (=5,0,20) SRLGs are randomly located on the lane and include the links touched by their circles of radius in (, 2). The link failure robability r ij is uniformly distributed on the interval (0.5, ). Under this assumtion, 00 realizations of the SRLGs and failure robabilities are generated. In each realization, 00 sourcedestination airs are randomly selected to find their disjoint