SIFT, GLOH, SURF descriptors. Dipartimento di Sistemi e Informatica

SIFT, GLOH, SURF descriptors Dipartimento di Sistemi e Informatica

Invariant local descriptor: Useful for Object RecogniAon and Tracking. Robot LocalizaAon and Mapping. Image RegistraAon and SAtching. Image Retrieval. Augmented Reality (hkp://blogs.oregonstate.edu/hess/sim- library- places- 2nd- in- acm- mm- 10- ossc/) Template Video Stream

Scale invariant detectors In most object recogniaon applicaaons, when the scale of the object in the image is unknown instead of extracang features at many different scales and then matching all of them, it is more efficient to design a funcaon on the region which is the same for corresponding regions, even if they are at different scales. The problem can also be stated as follows: given two images of the same scene with a large scale difference between them, find the same interest points independently in each image.

For scale invariant feature extracaon it is necessary to detect structures that can be reliably extracted under scale changes. This is done by evaluaang a signature funcaon (a kernel) in the point neighbourhood and plot the result as a funcaon of the neighbourhood scale. Since it measures properaes of the local neighbourhood at a certain scale, it should take a similar qualitaave shape if two keypoints are centered on corresponding image structures; The funcaon shape should be squashed or expanded as a result of the scaling factor. Corresponding neighbourhood sizes should be detected by searching for extrema of the signature funcaon in both images.

We can consider points as a funcaon of region size (circle radius). A common approach is to take a local maximum of this funcaon. The soluaon is to search for maxima of suitable funcaons in scale and in space over the images. f Image 1 scale = 1/2 f Image 2 region size region size The region size (scale), for which the maximum is achieved, should be invariant to image scale. f Image 1 f Image 2 scale = 1/2 s 1 region size/scale s 2 region size/scale

A good funcaon for scale detecaon has one stable sharp peak f bad f bad f Good! region size region size region size For usual images a good funcaon would be the one with contrast (sharp local intensity change). It is easier to look for zero- crossings of 2 nd derivaave than maxima.

There are a few approaches which are truly invariant to significant scale changes. Typically, such techniques assume that the scale change is the same in every direcaon, although they exhibit some robustness to weak affine deformaaons. The appropriate kernel for this is the scale- normalized Gaussian kernel G(x, σ) and its derivaaves.

The classical approach is to generate a Gaussian scale- space representaaon of an image, i.e. a set of images from the convoluaon of an isotropic (circular) Gaussian Kernel of various sizes: A larger scale results into a smoother image ExisAng methods search for local extrema in the 3D Gaussian scale- space representaaon of an image (x, y and scale). Local extrema over scale of normalized derivaaves indicate the presence of characterisac local structures The moavaaon for generaang a scale- space representaaon of a given image originates from the basic observaaon that real- world objects are composed of different structures at different scales. This implies that real- world objects, may appear in different ways depending on the scale of observaaon. The Gaussian scale- space guarantees that new structures must not be created when going from a fine scale to any coarser scale. Its properaes include linearity, shim invariance, non- enhancement of local extrema, scale invariance and rotaaonal invariance

FuncAons for determining scale Kernels: ( xx(,, ) yy(,, )) 2 L= G x y + G x y σ σ σ Laplacian of Gaussians f = Kernel Image DoG = G( x, y, kσ) G( x, y, σ) Difference of Gaussians (an approximaaon of Laplacian) where Gaussian 2 2 x + y 2 1 2 2πσ σ Gxy (,, σ ) = e both kernels are invariant to scale and rota8on

The method: - build scale- space pyramids; - all scales are examined to idenafy scale- invariant features: - compute the Difference of Gaussian (DoG) pyramid or Laplacian of Gaussians (LoG) - detect maxima and minima in scale space scale Harris- Laplacian 1 Find local maximum of: Harris corner detector in space (image coordinates) Laplacian in scale y Laplacian Harris x SIFT (Lowe) 2 Find local maximum of: Difference of Gaussians in space and scale scale y DoG DoG 1 K.Mikolajczyk, C.Schmid. Indexing Based on Scale Invariant Interest Points. ICCV 2001 2 D.Lowe. DisAncAve Image Features from Scale- Invariant Keypoints. Accepted to IJCV 2004 x

Harris- Laplacian scale- invariant detector Harris- Laplacian method uses Harris funcaon first at mulaple scales, then selects points for which Laplacian akains maximum over scales. Harris- corner points are interest points that have good rotaaonal and illuminaaon invariance. But are not scale invariant. To reflect scale- invariance the second- moment matrix is modified taking a Gaussian scale space representaaon with a Laplacian of Gaussian kernel. Since the computaaon of derivaaves usually involves a stage of scale- space smoothing, an operaaonal definiaon of the Harris operator requires two scale parameters: (i) a local deriva8on scale for smoothing before the computaaon of derivaaves (ii) an integra8on scale for accumulaang the operaaons on derivaaves where g(σ I ) is the Gaussian kernel of scale σ I (integraaon scale) and L(x,y) is the gaussian smoothed image and L x and L y its derivaaves in the x and y direcaon, calculated using a Gaussian kernel of scale σ D (differenaaaon scale). MulAplicaAon by σ 2 is because derivaaves must be normalized across scales according to D m (x, s ) = σ m L m (x, s ).

The algorithm searches across mulaple scales σ n σ 0, k 1 σ 0, k 2 σ 0 k 3 σ D k n σ 0 (k=1,4 ) sekng σ I = σ n and σ D = s σ I (s=0,7). At each scale corners are found as with the Harris method applied to M matrix in a 8 point neighbourhood. An iteraave algorithm localizes corner points spaaally and chooses the characterisac scale: Laplacian of Gaussians is used to judge if each of the candidate points found on different levels, forms a maximum in the scale direcaon (check with n- 1 and n+1). The scale where such maximum in scale is found is referred to as CharacterisAc scale. It is used in future iteraaons. Points are spaaally localized at the characterisac scale Mikolajczyk and Schmid (2001) demonstrated that the LoG measure D D D akains the highest percentage of correctly detected corner points in comparison to other scale- selecaon measures: At each iteraaon the corner point x k+1 is selected that maximizes the LoG within the scale neigbourhood. The process terminates when x k+1 = x k

MulA- scale Harris points SelecAon of points at the characterisac scale with Laplacian Invariant points + associated regions [Mikolajczyk & Schmid 01]

SIFT Scale Invariant Feature Transform SIFT method has been introduced by D. Lowe in 2004 to represent visual enaaes according to their local properaes. The method employs local features taken in correspondence of salient points (referred to as keypoints or SIFT points). Keypoints (their SIFT descriptors) are used to characterize shapes with invariant properaes Image points selected as keypoints and their SIFT descriptors are robust under: - Luminance change (due to difference- based metrics) - Scale change (due to scale- space) RotaAon (due to local orientaaons wrt the keypoint canonical) - The original Lowe s algorithm: Given a grey- scale image: - Build a Gaussian- blurred image pyramid - Subtract adjacent levels to obtain a Difference of Gaussians (DoG) pyramid (so approximaang the Laplacian of Gaussians) - Take local extrema of the DoG filters at different scales as keypoints - Compute keypoint dominant orientaaon For each keypoint: - Evaluate local gradients in a neighbourhood of the keypoint with orientaaons relaave to the keypoint orientaaon and normalize Build a descriptor as a feature vector with the salient keypoint informaaon -

MoAvaAons for usage of DoG are that while Laplacian of Gaussian σ 2 2 G (x,y, σ) provides strong responses to dark blobs of size σ and is good to capture scale invariance, calculaaon of Laplacian is costly. So an approximaaon can be used: Scale normalized Laplacian σ 2 G (x,y, σ) Heat diffusion equaaon unless ½ mulaplicaave constant (Koenderink 92 for luminance scale space) SIFT descriptors are obtained in the following three steps: 1. Keypoint detecaon using local extrema of DoG filters 2. ComputaAon of keypoint orientaaon 3. SIFT descriptor derivaaon

Build Gaussian pyramids Keypoints are detected as local scale- space maxima of the Differences of Gaussians. They correspond to local min/max points in image I(x,y) that keep stable at different scales σ Resample Blur Pyramid construcaon process Blur: σ is doubled from the bokom to top of each pyramid Resample: pyramid images are sub- sampled from scale to scale Subtract: adjacent levels of pyramid images are subtracted

Building pyramids in detail A first pyramid is obtained by the convoluaon operaaon at different σ such that σ n =k n σ 0 L(x,y,σ) are grouped into a first octave The DoG at a scale σ is obtained by the difference of two nearby scales separated by a constant k AMer the first octave is completed the image such that σ = 2 σ 0 is subsampled by a factor equal to 2 and the next pyramid is obtained in the same way The procedure is iterated for the next levels σ n =k n σ 0 L(x, y, σ) = G(x, y, σ) *I(x, y) D(x, y, σ) = L(x, y, kσ) L(x, y, σ) σ 0 = (k ) 0 σ σ 1 = (k ) 1 σ σ 2 = (k ) 2 σ σ 3 = (k ) 3 σ σ 4 = (k ) 4 σ

Octave: the original image is convoluted with a set of Gaussians, so as to obtain a set of images that differ by k in the scale space: each of these sets is usually called octave. k 4 k 3 Octave k 2 k k 0 Each octave is divided into a number of intervals such as k = 2 1/s.. For each octave s + 3 images must be calculated. For example if s = 2 then k = 2 ½ and we will have 5 images at different scales. In this case an octave corresponds to doubling the value of σ σ 0 = (2 ½ ) 0 σ = σ σ 1 = (2 ½ ) 1 σ = κ σ σ 2 = (2 ½ ) 2 σ = 2 σ σ 3 = (2 ½ ) 3 σ = 2 κ σ σ 4 = (2 ½ ) 4 σ = κ 2 σ σ 4 is doubled wrt σ 0

Choice of s = 2 is based on empirical verificaaon of the keypoint stability

Gaussian kernel size: the number of samples increases as σ increases. The number of operaaons that are needed are (N 2-1) sums and N 2 products. They grow as σ grows. A good compromise is to use a sample interval of [-3σ, 3σ] Sums (N 2 1) Products (N 2 )

ComputaAonal savings can be obtained considering that the Gaussian kernel is separable into the product of two one- dimensional convoluaons (2N) products and (2N 2) sums. This makes computaaonal complexity O(N). Moreover convoluaon of two gaussians of σ 1 2 and σ 2 2 is a Gaussian with variance: σ 3 2 = ( σ 1 2 + σ 2 2 ). This property can be exploited to build the scale space, so to use convoluaons already calculated

aa Example

Detect maxima and minima of DoGs in scale space Local extrema of D(x,y,σ) are the local interest points. To detect the interest points at each level of scale of the DoG pyramid every pixel p is compared to its 8 neighbours: if p is a local extrema (local minimum or maximum) it is selected as a candidate keypoint each candidate keypoint is compared to the 9 neighbours in the scale above and below Only pixels that are local extrema in 3 adjacent levels are promoted as keypoints

Keypoint stability The many points extracted from maxima+minima of DoGs have only pixel- accuracy at best and may correspond to low contrast and therefore unreliable points. To improve keypoint stability a funcaon is adapted to the local points in order to determine the interpolated posiaon. Since points are defined in 3D (x,y, σ) it is a 3D curve fikng problem. The interpolaaon is done using the quadraac Taylor expansion of the Difference- of- Gaussian scale- space funcaon, with the candidate keypoint as the origin: k x w- k where D and its derivaaves are evaluated at the candidate keypoint k (x,y σ) and x(x,y σ) is the offset from this point. The locaaon of the extremum, is determined by taking the derivaave of this funcaon with respect to x and sekng it to zero: that is where

If the offset is larger than 0.5 in any dimension, then it is an indicaaon that the extremum lies closer to another candidate keypoint. In this case, the candidate keypoint is changed and the interpolaaon performed instead about that point. otherwise the offset is added to its candidate keypoint to get the interpolated esamate for the locaaon of the extremum.

low contrast keypoints are generally less reliable than high contrast and keypoints that respond to edges are unstable. Filtering can be performed respecavely by: thresholding on simple contrast thresholding based on principal curvature The local contrast can be directly obtained from D(x,y,σ) calculated at the locaaon of the keypoint as updated from the previous step. Unstable extrema with low contrast can be discarded according to Lowe rule: D(x) < 0,03

The DoG funcaon has strong responses along edges. To eliminate the keypoints that have poorly determined locaaons but have high edge responses it must be noaced that for poorly defined peaks in the DoG funcaon, the principal curvature across the edge would be much larger than the principal curvature along it. Finding these principal curvatures amounts to solving for the eigenvalues of the second- order Hessian matrix of D(x,y,s). The eigenvalues of H are proporaonal to the principal curvature of D(x,y,s): to calculate for adiacent DoG pixels The raao of the two eigenvalues is sufficient to the goal. If r is the raao between the highest and the lowest eigenvalue, then: R = (r+1) 2 / r depends only on the raao of the two eigenvalues and is minimum when the two eigenvalues have the same value and increases as r increases. In order to have the raao between the two principal curvatures below a threshold it must be that for some threshold on r, if R is higher than the keypoint is poorly localized and hence rejected.

Maxima in D Remove low contrast and edges

Experimental evaluaaon of detectors w.r.t. scale change Repeatability rate: # correspondences # possible correspondences (points present)

The common drawback of both the LoG (and DoG) representaaon is that local maxima can also be detected in the neighborhood of contours or straight edges, where the signal change is only in one direcaon. These maxima are less stable because their localizaaon is more sensiave to noise or small changes in neighboring texture.

OrientaAon assignment For a keypoint, if L is the image with the closest scale, for a region around keypoint compute gradient magnitude and orientaaon using finite differences: GradientVector Lx ( + 1, y) Lx ( 1, y) = Lxy (, + 1) Lxy (, 1)

For such region - create an histogram with 36 bins for orientaaon - weight each point with Gaussian window of 1.5σ east squares)

Peak orientaaon is the keypoint canonical orientaaon Any peak within 80% of the highest peak is used to create a keypoint with that orientaaon. Local peak within 80% creates mulaple orientaaons. About 15% has mulaple orientaaons

Once the local orientaaon and scale of a keypoint have been esamated, a scaled and oriented patch around the detected point can be extracted and used to form a feature descriptor