Chapter 3 Salient Feature Inference

Chapter 3 Salient Feature Inference he building block of our computational framework for inferring salient structure is the procedure that simultaneously interpolates smooth curves, or surfaces, or region boundaries, identifies discontinuities, and detects outliers, which we called the salient feature inference engine. he input to this procedure are local estimates of orientation associated with or without polarity. Multiple orientation estimations are allowed for each location. ypical input include dots in a binary image, the combined responses obtained by applying orientation selective filters to a greyscale image, and point and/or line correspondences across images obtained by correlation. he output of this procedure is an integrated description for the input that compose of surfaces, regions, labeled junctions and directed or undirected curves. In the following, we first outline the mechanism of our salient feature inference engine in section 3.1. We then define the orientation saliency tensor in section 3.2. he details of tensor voting and voting interpretation for undirected feature inference are given in section 3.3. Polarity is being incorporated into the tensor framework in section 3.4. he process of integrating individual salient structures is described in section 3.5. Section 3.6 summarizes the time complexity of this procedure. 3.1 Overview of the inference engine Our salient feature inference engine is a non-iterative procedure that makes use of tensor voting to infer salient geometric structures from local features. Figure 3.1 illustrates the mechanisms of this engine. In practice, the domain space is digitized into discrete 40

cells. We use a construct called an orientation saliency tensor, to be defined in section 3.2, to encode the saliency of various undirected features such as points, curve elements and surface path elements. he polarity of the orientations are encoded separately in a polarity saliency vector, to be defined in section 3.4. Local feature detectors provide an initial local measure of features. he engine then encodes the input into a sparse saliency tensor field. Saliency inference is achieved by allowing the elements in the tensor field to communicate through voting. A voting function specifies the voter s estimation of local orientation/discontinuity and its saliency regarding the collecting site, and is represented by tensor fields. he vote accumulator uses the information collected to update the saliency tensor at locations with features and establish those at previously empty locations. From the dense saliency tensor field, the vote interpreter produces feature saliency maps for various features from which salient structures can be extracted. By taking into account the different properties of different features, the integrator locally combines the features into a consistent salient structure. Once the generic saliency inference engine is defined, it can be used to perform many tasks, simply by changing the voting function. 3.2 Orientation Saliency ensor he design of the orientation saliency tensor is based on the observation that discontinuity is indicated by the variation of orientation when estimating orientation for curves, surfaces, or region boundaries locally. Intuitively, the variation of orientation can be captured by an ellipsoid in 3-D and an ellipse in 2-D. Note that it is only the shape of the ellipsoid/ellipse that describe the orientation variation, we can therefore encode saliency, i.e. the confidence of in the estimation, as the size of the ellipsoid/ellipse. One way to represent an ellipsoid/ellipse is to use the distribution of points on the ellipsoid/ 41

Legend Input tensor field (sparse) ensor field operation Voting tensor fields (dense) ensor Voting Orientation saliency tensor field (dense) scalar field geometric structure Vote Interpretation Surface saliency map Junction curve saliency map Junction Point saliency map Structure Integration Integrated structure labeled junctions curves surfaces regions Figure 3.1 Flowchart of the Salient Structure Inference Engine 42

ellipse as described by the covariance matrix. By decomposing a covariance matrix S into its eigenvalues λ 1, λ 2,λ 3 and eigenvectors ê 1, ê 2, ê 3, we can rewrite S as: λ 1 0 0 S = ê 1 ê 2 ê 3 0 λ 2 0 ( 3.1) 0 0 λ 3 ê 1 ê 2 ê 3 hus, S = λ 1 ê 1 ê 1 +λ 2 ê 2 ê 2 +λ 3 ê 3 ê 3 where λ 1 λ 2 λ 3 and ê 1, ê 2, ê 3 are the eigenvectors correspond to λ 1, λ 2, λ 3 respectively. he eigenvectors correspond to the principal directions of the ellipsoid/ellipse and the eigenvalues encode the size and shape of the ellipsoid/ellipse, as shown in Figure 3.2. S is a linear combination of outer product tensors and, therefore a tensor. λ 3 λ 1 ê 2 ê 1 λ 2 ê 3 Figure 3.2 An ellipsoid and the eigensystem his tensor representation of orientation/discontinuity estimation and its saliency allows the identification of features, such as points, curves elements or surfaces patch elements, and the description of the underlying structure, such as tangents or normals, to be computed at the same time. Since the local characterizing orientation, such as normal or tangents, varies with the target structure to be detected, such as curves or surfaces, the physical interpretation of the saliency tensor is specific to structure. However, the definition of the saliency tensor is independent of the physical interpretation. his structure dependency merely means that different kinds of structure have to be inferred independently. In particular, one needs to realize that the physical meaning of having a curve on a surface is different from a curve in space itself. Figure 3.3 shows the shapes of the saliency tensors for various input features under different structure inference. he 43

inference of 2-D structures is considered to be embedded in 3-D and thus is not treated separately. features tensor representation for surface inference tensor representation for curve inference point curve element surface patch element λ 1 λ 2 λ 3 λ 1 λ 2 >>λ 3 λ 1 >>λ 2 >λ 3 λ 1 λ 2 λ 3 λ 1 >>λ 2 >λ 3 λ 1 λ 2 >>λ 3 Figure 3.3 Orientation Saliency tensors Every saliency tensor, denoted as (λ 1,λ 2,λ 3,θ,φ,ψ) in 3-D and (λ 1,λ 2,0,θ,0,0) in 2-D, thus has 6 parameters, where θ, φ, and ψ are the angles of rotation about the z, y, and x axis respectively, when aligning the coordinate system with the principal axes by the rotation R ψφθ. his saliency tensor is sufficient to describe feature saliency for unpolarized structures such as non-oriented curves and surfaces. When dealing with boundaries of regions or with oriented data (edgels), we need to associate a polarity to the feature orientation to distinguish the inside from the outside. Polarity saliency, which relates the polarity of the feature and its significance, will be addresses when we discuss region inference in section 3.4. Notice that all the tensors shown in Figure 3.3 are of the proper shape of either a stick, a plate, or a ball. In general, a saliency tensor S can be anywhere in between these cases, but the spectrum theorem [87] allows us to state that any tensor can be expressed as a linear combination of these 3 cases, i.e. S = ( λ 1 λ 2 )ê 1 ê 1 + ( λ 2 λ 3 )( ê 1 ê 1 + ê 2 ê 2 ) + λ 3 ( ê 1 ê 1 + ê 2 ê 2 +ê 3 ê 3 ) ( 3.2) 44

where ê 1 ê 1 describes a stick, (ê 1 ê 1 +ê 2 ê 2 ) describes a plate, and (ê 1 ê 1 +ê 2 ê 2 +ê 3 ê 3 t) describes a ball. Figure 3.4 shows the shape of a general saliency tensor as described by the stick, plate, and ball components. A typical example of a general tensor is one that represents the orientation estimates at a corner. Using orientation selective filters, multiple orientations û i can be detected at corner points with strength s i. We can use the tensor sum Σs 2 i û i û i to compute the distribution of the orientation estimates. he addition of 2 orientation saliency tensors simply combines the distribution of the orientation estimates and the saliencies. his linearity of the saliency tensor enables us to combines votes and represent the result efficiently. Also, since the input to this voting process can be represented by saliency tensors, our saliency inference is a closed operation whose input, processing token and output are the same. λ 3 λ 2 λ 1 stick tensor ê 3 ball tensor ê 3 ê 2 ê 1 λ 3 λ 2 -λ 3 λ 1 -λ 2 plate tensor Figure 3.4 Decomposition of a general saliency tensor 3.3 ensor Voting for Undirected Features he voting function Given a sparse feature saliency tensor field as input, each active site has to generate a tensor value at all locations that describe the voter s estimate of orientation at the location, and the saliency of the estimation. he value of the tensor vote cast at a given location is given by the voting function V(S,p) which characterize the structure to be 45

detected, where p is the vector joining the voting site p v and the receiving site p s, and S is the tensor at p v. Since all orientations are of equal significance, the voting function should be independent of the principal directions of the voting saliency tensor S. As the immediate goal of voting is to interpolate smooth surfaces or surface borders, the influence of a voter must be smooth and therefore decrease with distance. Moreover, votes for neighboring collectors must be of similar strength and shape. hus, although the voting function has infinite extent, in practice, it disappears at a predefined distance from the voter. While the voting function should take into account the orientation estimates of the voter, it is only necessary to consider each of the voter s orientation estimates individually and then use their combined effect for voting. Representing the voting function by discrete tensor fields We describe, in the following, two properties of the voting function that allow us to implement this tensor voting efficiently by using a convolution-like operation. he first property expresses the requirement that the voting function be independent of the principal directions of the voting saliency tensor. For any voting saliency tensor S = (λ 1,λ 2,λ 3,θ,φ,ψ), the vote V(S,p) for a site p from the voter is: V( ( λ 1, λ 2, λ 3, θφψ,, ), p) 1 = V( R ψφθ R ψφθ ( λ 1, λ 2, λ 3, θφϕ,, )R ψφθ R ψφθ, R ψφθ R ψφθ p) = R ψφθ V( ( λ 1, λ 2, λ 3, 000,, ), R ψφθ p)r ψφθ ( 3.3) 1 1 he second property is due to linearity of the saliency tensor. Recall that every saliency tensor can be interpreted as being a linear combination of a stick and a plate as described in equation (3.2). If the voting function is linear with respect to the saliency tensor S, every vote can be decomposed into three components as: V( S, p) = ( λ 1 λ 2 )V ( ê 1 ê 1, p) + ( λ2 λ 3 )V( ê 1 ê 1 + ê 2 ê 2, p) + λ 3 V( ê 1 ê 1 + ê 2 ê 2 + ê 3 ê 3, p) = ( λ 1 λ 2 )V( S 1, p) + ( λ 2 λ 3 )V( S 2, p) + λ 3 V( S 3, p) ( 3.4) 46

where S 1 = (1,0,0,θ,φ,ϕ), S 2 = (1,1,0,θ,φ,ϕ) and S 3 = (1,1,1,θ,φ,ϕ). Notice that S 1 a stick tensor, S 2 is a plate tensor, and S 3 is a ball tensor. Consequently, rather than dealing with infinitely many shapes of voting saliency tensors, we only need to handle three shapes of the orientation saliency tensors. Applying (3.3) to (3.4), his means that any voting function linear in S can be characterized by three saliency tensor fields, namely V 1 (p)= V((1,0,0,0,0,0),p), the vote field cast by a stick tensor which describes the orientation [1 0 0], V 2 (p)= V((1,1,0,0,0,0),p), the vote field cast by a plate tensor which describes a plane with normal [0 0 1], and V 3 (p)= V((1,1,1,0,0,0),p), the vote field cast by a ball tensor. his would seem to imply that we need to define 3 different tensor fields for each voting operation, one for the stick, one for the plate, and one for the ball. his is indeed what Guy and Medioni [28][29] did. hey defined, from first principles, these 3 fields in 3-D (2 in 2-D). hanks to our tensorial framework, we can instead derive the ball and plate tensor votes directly from the stick tensor votes. Since we only need to consider each of the voter s orientation estimates individually, V 2 and V 3 in fact can be obtained from V 1 as: V( S, p) = (( λ 1 λ 2 )R ψφθ V( ( 100000,,,,, ), R ψφθ p)r ψφθ + ( λ 2 λ 3 )R ψφθ V( ( 110000,,,,, ), R ψφθ p)r ψφθ + λ 3 R ψφθ V( ( 111000,,,,, ), R ψφθ p)r ψφθ ) 1 1 1 ( 3.5) V 2 V 3 π 1 0 θ = 0, φ = 0 ( p) = R ψφθ V 1 ( R ψφθ p)r ψφθ dψ π π 1 0 0 θ = 0 ( p) = R ψφθ V 1 ( R ψφθ p)r d ψφθ ψ d φ ( 3.6) Due to the fact that the voting function should be smooth and have a finite influence extent, we can use a finite discrete version of these characterizing saliency tensor fields to represent the voting function efficiently. 47

he voting process Given the saliency tensor fields, V 1, V 2,V 3, which specify the voting function, for each input site orientation saliency tensor S(i,j) = (λ 1,λ 2,λ 3,θ,φ,ψ) at location (i, j), voting thus proceeds by aligning each voting field with the principal directions of the voting saliency tensor, θ,φ,ψ, and centering the fields at the location (i, j). Each tensor contribution is weighted as described in equation (3.2), and illustrated for the 2-D case in figure 3.5. his process is similar to a convolution with a mask, except that the output is tensor instead of scalar. For the input feature S : = (λ 1 -λ 2 ) + λ 2 V 1 (p) V 2 (p) and the voting function represented by V 1 (p), V 2 (p), & V 3 (p) align align = (λ 1 -λ 2 ) + λ 2 Figure 3.5 Illustration of the voting process Votes are accumulated by tensor addition. hat is, the resulting saliency tensor U(i,j) at location (i,j) is: U( i, j) = V( S( mn, ), p( i, j, m, n) ) ( i, j) N( m, n) ( 37, ) m, n where N(m,n) defines the neighborhood of (m,n). 48

he initial assessment of feature saliencies at each site is kept implicitly in form of a strong vote for the site. As a result, all sites, with or without a feature initially, are treated equally during the inference. Vote interpretation After all the input site saliency tensors have cast their votes, the vote interpreter builds feature saliency maps from the now densified saliency tensor field by computing the eigensystem of the saliency tensor at each site. hat is, the resulting saliency tensor U(i,j) at location (i,j) is decomposed into: U( i, j) = ( λ 1 λ 2 )ê 1 ê 1 + ( λ 2 λ 3 )( ê 1 ê 1 + ê 2 ê 2 ) + λ 3 ( ê 1 ê 1 + ê 2 ê 2 +ê 3 ê 3 ) ( 3.8) Each saliency tensor is broken down into three components, (λ 1 -λ 2 )ê 1 ê 1 corresponds to an orientation, (λ 2 -λ 3 )(ê 1 ê 1 +ê 2 ê 2 ) corresponds to a locally planar junction whose normal is represented by ê 3, and λ 3 (ê 1 ê 1 +ê 2 ê 2 +ê 3 ê 3 ) corresponds to an isotropic junction. For surface inference, surface saliency therefore is measured by (λ 1 -λ 2 ), with normal estimated as ê 1. he curve junction saliency is measured by (λ 2 -λ 3 ), with tangent estimated as ê 3. For curve inference, the curve saliency is measured by (λ 1 -λ 2 ), with tangent estimated as ê 1. he planar junction saliency is measured by (λ 2 -λ 3 ), with normal estimated as ê 3. In both cases, isotropic (or point) junction saliency is measured by λ 3. Notice that the feature saliency values do not depend only on the saliency of the feature estimation, which is encoded as the norm of the orientation saliency tensor, but also are determined by the distribution of orientation estimates, which is described by the shape of the tensor. It is interesting that with perfect orientation data, this tensor representation of saliency feature inference produces feature saliency measurements identical to those derived by Guy and Medioni [29] from intuition. 49

Since the voting function is smooth, sites near salient structures will have high feature saliency values. he closer the site to the salient structure, the higher the feature saliency value the site will have. Structures are thus located at the local maxima of the corresponding feature saliency map. he vote interpreter hence can extract features by non-maxima suppression on the feature saliency map. o extract structures represented by oriented salient features, we use the methods developed in [29]. Note that a maximal surface/curve is usually not an iso-surface/iso-contour. By definition, a maximal surface is a zero-crossing surface whose first derivative will change sign if we traverse across the surface across its tangent direction. By expressing a maximal surface as a zerocrossing surface, an iso-surface marching process can be used to extract the maximal surface. Similar discussion applies to maximal curve extraction. Details for extracting various structures from feature saliency maps can be found in [29][77]. 3.4 Incorporating polarity So far, we have only considered the inference of undirected curves and surfaces from undirected features. However, in real applications, we often deal with bounded regions. Computationally, it is economical to represent regions differentially, that is by their boundaries. By following the contour of a region, one knows which is the inside versus the outside. he simplest encoding for a region therefore is to associate a polarity, which denotes the inside of the object, with each curve element along the region boundary. It is precisely this polarity information that is being expressed in the output of an edge detector, where the polarity of the directed edgel indicates which side is darker. As shown later, ignoring this information may be harmful. We therefore need to augment our representation scheme to express this knowledge, and to use it to perform directed curve and region grouping. In the spirit of our methodology, we attempt to establish at every site in the domain space a measure we called polarity saliency which relates the possibility of having a directed curve passing through the site. In particular, a site close to two parallel features 50

with opposite polarities should have low polarity saliency, although the undirected curve saliency will be high. 3.4.1 Encoding polarity Figure 3.6 depicts the situation where the voter has the polarity information. Since polarity is only defined for directed features, it is necessary to associate polarity only with the stick component of the orientation saliency tensor. hus, a scalar value from -1 to 1 encodes polarity saliency. he sign of this value indicates which side of the stick component of the tensor is the intended direction. he size of this value measures the degree of polarization at a site. Combined with the stick component of the orientation saliency tensor, polarity information is represented by a vector we call the polarity saliency vector, where its orientation is the same as that of the stick component of the corresponding orientation saliency tensor, and its length is the product of the associated polarity saliency and the size of the stick component of the orientation saliency tensor. Initially, all input sites have polarity saliency of -1, 0, or 1 only. polarized vote voter with polarity best fitting structures Figure 3.6 Encoding Polarity Saliency o propagate this polarity information, we need to modify the voting function accordingly. Note that polarity does not change along a path or a surface. Moreover, polarity is only associated with the stick component of the voting tensor which only casts stick tensor votes. Hence, to incorporate polarity into the voting function, we only need 51

to assign either -1 or 1 to each stick tensor vote as polarity saliency, which is determined by the polarity of the voter and the estimated connection between the stick component of the voter and the site. Hence, for each location, each vote is composed of an orientation saliency tensor, and a polarity saliency vector. 3.4.2 Inferring directed feature saliency Once votes are collected, directed curve saliency map can be computed by combining polarity saliency with undirected curve saliency. he polarity saliency is obtained by the vector sum of these polarity saliency vectors v i (x,y) s at every site (x,y). he length of the resulting vector gives the degree of polarization and the resulting direction relates the indented polarity. Directed curve elements are those which have both high undirected curve saliency and high polarity saliency. he natural way to measure directed curve saliency hence is to take the product of the undirected curve saliency and the polarity saliency. he directed feature saliency DFS(x,y) hence is: where λ 1 and ê 1 (x,y) corresponds to the stick component of the orientation saliency tensor obtained at site (x,y). Once the directed curve saliency map is computed, region boundaries can be extracted by the method described in section 3.3. 3.5 Structure Integration Our salient structure inference engine is able to report junctions where estimation of an orientation fails and extraction of curve and surface should stop. Since the process is designed for the estimation of curves and surfaces, it only detects, but does not precisely localize junctions. In order to link multiple curves or surfaces together properly, juncuxy (, ) = v i ( x, y) DFS( x, y) = λ 1 sgn( uxy (, ) ê 1 ( x, y) ) uxy (, ) ( 3.9) i 52

tions need to be correctly located and labeled. o rectify junction location and compute junction label, we need to start with the only reliable source of information, that is, the salient curves or surfaces. For each location with a high junction saliency value, salient curves or surfaces are identified and allowed to grow locally. he growing process can easily be accomplished by tensor voting. A junction is then located and labeled by computing the intersection of the grown smooth structures. he junction information in turn enables us to identify disconnected curve that needs to be reconnected. his local process only adds a constant factor to the total processing time. Details of a specific implementation of the integration process can be found in [77]. 3.6 Complexity he efficiency of the salient feature inference engine is determined by the complexity of each of the 3 subprocesses in the engine. Given a voting function specified by the 3 discrete tensor fields with C pixels each, tensor voting requires Ο(Cn) operations, where n is the number of input features. he size of C depends on the scale of the voting function and is usually much smaller than the size of the domain space, m. It is therefore the density of the input features that determines the complexity of tensor voting. Since the number of input features is small because they are usually sparse, the total number of operations in tensor voting is normally a few times the size of the domain space. In vote interpretation, the saliency tensor in each location of the domain space is decomposed into its eigensystem, which takes Ο(1) for each of the m elements in the domain space. he maximal curve extraction process [77] only takes Ο(m). herefore the complexity of vote interpretation is Ο(m). Since the process of structure integration only recomputes the structures for some of the input features, its complexity is bounded by that of the first tensor voting process, that is, Ο(Cn). 53

In summary, the time complexity of the salient structure inference engine is Ο(Cn+m), where C is the size of the tensor fields specifying the voting function, n is the number of input features, and m is the size of the domain space. Note that the operations involved are all local, and therefore can be implemented in O(1) time in parallel machines. 54