Probabilistic Graphical Models in Computer Vision (IN2329)

Size: px

Start display at page:

Download "Probabilistic Graphical Models in Computer Vision (IN2329)"

Edgar Bishop
5 years ago
Views:

1 Probabilistic Graphical Models in Computer Vision (IN2329) Csaba Domokos Summer Semester Parameter learning Agenda for today s lecture Probabilistic parameter learning 4 Recap: Regularized maximum conditional likelihood training Numerical solution Stochastic gradient descent Pseudo-code of Stochastic gradient descent Using of the output structure Two-stage learning End-to-end training: CRF as RNN Piecewise learning Piecewise learning Piecewise learning: Deep Structured Model Summary Summary cont d Loss function 17 Loss function

2 0{1 loss Hamming-loss Loss-minimizing parameter learning 21 Loss-minimizing parameter learning Regularized loss minimization Digression: Support Vector Machine Digression: Support Vector Machine Redefining the loss function Structured hinge loss Structured Support Vector Machine Subgradient Pseudo-code of subgradient descent minimization Subgradient descent minimization Numerical solution Calculating the subgradient Subgradient descent S-SVM learning Stochastic subgradient descent S-SVM learning Summary of S-SVM learning Next lecture: Summary of the course Literature

3 11. Parameter learning 2 / 38 Agenda for today s lecture Probabilistic parameter learning is the task of estimating the parameter w that minimizes the expected dissimilarity of a parameterized model distribution ppy x, wq and the (unknown) conditional data distribution dpy xq: KL tot pd}pq ÿ dpxq ÿ dpy xq dpy xqlog ppy x,wq. xpx ypy The loss function : Y ˆY Ñ R`0 measures the cost of predicting y1 when the correct label is y. Loss minimizing parameter learning is the task of estimating the parameter w that minimizes the expected loss: E y dpy xq r py,fpxqqs, where fpxq argmax ypy ppy x,wq is a prediction function, dpy xq is the (unknown) conditional data distribution, and : Y ˆY Ñ R`0 is a loss function. IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 3 / 38 Probabilistic parameter learning 4 / 38 Recap: Regularized maximum conditional likelihood training Let ppy x,wq 1 Zpx,wq expp xw,ϕpx,yqyq be a probability distribution parameterized by w P RD, and let D tpx n,y n qu,...,n be a set of i.i.d. training samples. For any λ ą 0, regularized maximum conditional likelihood training chooses the parameter w, such that w P argmin wpr D Lpwq argmin wpr D λ}w} 2 ` xw,ϕpx n,y n qy ` logzpx n,wq. IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 5 / 38 3

4 Numerical solution w Lpwq 2λw ` `ϕpx n,y n q E y ppy x n,wqrϕpx n,yqs. In a naïve way, the complexity of the gradient computation is OpK V NDq, where N is the number of data samples, D is the dimension of weight vector, K max ipv Y i is the (maximal) number of possible labels of each output variable (i P V). Lpwq λ}w} 2 ` xw,ϕpx n,y n qy ` logzpx n,wq. In a naïve way, the complexity of line search is OpK V NDq (for each evaluation of L). IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 6 / 38 Stochastic gradient descent If the training set D is too large, one can create a random subset D 1 Ă D and estimate the gradient w Lpwq on D 1 only. In an extreme case, one may randomly select only one sample and calculate the gradient This approach is called stochastic gradient descent (SGD). pxn,y n q w Lpwq 2λw `ϕpx n,y n q E y ppy x n,wqrϕpx n,yqs. Note that line search is not possible, therefore, we need for an extra parameter, referred to as step-size η t for each iteration (t 1,...,T). IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 7 / 38 4

5 Pseudo-code of Stochastic gradient descent Input: Training set tpx n,y n qu N, number of iterations T and step-sizes tη tu T t 1. Output: The learned weight vector w P R D. 1: w Ð 0 2: for t 1,...,T do 3: px n,y n q Ð a randomly chosen training example 4: v Ð pxn,y n q w Lpwq 5: w Ð w `η t v 6: end for 7: return w If the step-size is chosen correctly (e.g., η t : ηptq η t ), then SGD converges to argmin wprd Lpwq. However, it needs more iterations than gradient descent, but each iteration is (much) faster. IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 8 / 38 Using of the output structure Assume a set of factors F in a factor graph model, such that the vector ϕpx,yq decomposes as ϕpx,yq rϕ F px F,y F qs F PF. Thus E y ppy x,wq rϕpx,yqs re y ppy x,wq rϕ F px F,y F qss F PF re yf ppy F x F,w F qrϕ F px F,y F qss F PF, where E yf ppy F x F,w F qrϕ F px F,y F qs ÿ y F PY F ppy F x F,w F qϕ F px F,y F q. Factor marginals µ F ppy F x F,w F q are generally (much) easier to calculate than the complete conditional distribution ppy x,wq. They can be either computed exactly (e.g., by applying belief propagation yielding complexity OpK Fmax V NDq, where F max max F PF NpF q is the maximal factor size) or approximated. IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 9 / 38 5

6 Two-stage learning The idea here is to split learning of energy functions into two steps: 1. learning unary energies via classifiers, and 2. learning their importance and the weighting factors of pairwise (and higher-order) energy functions. Epy;xq ÿ ipv w i E i py i ;x i q ` ÿ pi,jqpe 1 w ij E ij py i,y j q. As an advantage, it results in a faster learning method. However, if local classifiers for E i perform badly, then CRF learning cannot fix it. IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 10 / 38 6

7 End-to-end training: CRF as RNN CRF as RNN network Source: Zheng et al. Conditional Random Fields as Recurrent Neural Networks, ICCV 15. CRF-RNN Evaluation on the PASCAL VOC 2012 Segmentation dataset: Unaries only Fully connected CRF End-to-end training Mean IoU (%) IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 11 / 38 7

8 Piecewise learning Assume a set of factors F in a factor graph model, such that ϕpx,yq rϕ F px F,y F qs F PF. We now approximate ppy x,wq by a distribution that is a product over the factors: p PW py x,wq : ź F PF p F py F x F,w F q ź F PF expp xw F,ϕ F px F,y F qyq Z F px F,w F q. By minimizing the negative conditional log-likelihood function Lpwq, we get w Pargmin wpr D Lpwq «argmin wpr D λ}w} 2 argmin wpr D ÿ F PF λ}w F } 2 ` log ź F pyf F PFp n x n F,w F q xw F,ϕ F px n F,yF n qy ` logz F px n F,w F q. IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 12 / 38 Piecewise learning w P argmin wpr D ÿ F PF λ}w F } 2 ` N xw F,ϕ F px n F,yn F qy ` ÿ logz F px n F,w F q. Consequently, piecewise training chooses the parameters w rw F s F PF as w F P argmin λ}w F } 2 ` xw F,ϕ F px n F,yn F qy ` ÿ N logz F px n F,w F q. w F PR One can perform gradient-based training for each factor as long as the individual factors remain small. Comparing p PW py x,wq with the exact ppy x,wq, we see that the exact Zpwq does not factorize into a product of simpler terms, whereas its piecewise approximation Z PW pwq factorizes over the set of factors. The simplification made by piece-wise training of CRFs resembles two-stage learning. 8

9 IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 13 / 38 9

Piecewise learning: Deep Structured Model Source: Lin et al.. Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation, CVPR 16.

10 Piecewise learning: Deep Structured Model Source: Lin et al.. Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation, CVPR % mean IoU on the PASCAL VOC 2012 Segmentation dataset. IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 14 / 38 10

11 Summary Regularized maximum conditional likelihood training chooses the parameter w for λ ą 0, such that w P argminλ}w} 2 ` xw,ϕpx n,y n qy ` logzpx n,wq. wpr D The gradient might be expensive to calculate w Lpwq 2λw ` `ϕpx n,y n q E y ppy x n,wqrϕpx n,yqs. Stochastic gradient descent: the gradient is estimated on the subset of training samples. Using of the input structure: Factor marginals µ F ppy F x F,w F q are generally (much) easier to calculate than the complete conditional distribution ppy x,wq E y ppy x,wq ϕpx,yqs Er ÿ y F PY F ppy F x F,w F qϕ F px F,y F q F PF. IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 15 / 38 11

12 Summary cont d Two stage learning: first learning and fix unary energies, and then learning the weighting factors for the energy functions. Piecewise training chooses the parameters w rw F s F PF as w F P argmin w F PR λ}w F } 2 ` N xw F,ϕ F px n F,yn F qy ` ÿ logz F px n F,w F q. IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 16 / 38 Loss function 17 / 38 Loss function The goal is to make prediction y P Y, as good as possible, about unobserved properties (e.g., class label) for a given data instance x P X. In order to measure quality of prediction f : X Ñ Y we define a loss function : Y ˆY Ñ R`0, that is py,y 1 q measures the cost of predicting y 1 when the correct label is y. Let us denote the model distribution by ppy xq and the true (conditional) data distribution by dpy xq. The quality of prediction can be expressed by the expected loss (a.k.a. risk): R f pxq : E y dpy xqr py,fpxqqs ÿ ypydpy xq py,fpxqq assuming that ppy xq «dpy xq. «E y ppy xq r py,fpxqqs, IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 18 / 38 12

13 0{1 loss In general, the loss function is application dependent. Arguably one of the most common loss functions for labelling tasks is the 0{1 loss, that is # 0{1 py,y 1 q y y 1 0, if y y 1 1, otherwise. Minimizing the expected loss of the 0{1 loss yields y Pargmin y 1 PY argmin y 1 PY ÿ E y ppy xq r 0{1 py,y 1 qs argmin ppy xq 0{1 py,y 1 q ÿ ppy xq argmin ypy, y y 1 y 1 PY argminepy 1 ;xq. y 1 PY This shows that the optimal prediction fpxq y in this case is given by MAP inference. y 1 PY ypy `1 ppy 1 xq argmaxppy 1 xq y 1 PY IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 19 / 38 Hamming-loss Another popular choice of loss function is the Hamming-loss, which counts the percentage of mis-labeled variables: H py,y 1 q 1 ÿ y i yi 1 V. For example, in semantic image segmentation, the Hamming-loss is proportional to the number of mis-classified pixels, whereas the 0{1 loss assigns the same cost to every labeling that is not pixel by pixel identical to the correct one. The expected loss of the Hamming-loss takes the form (see Exercise) which is minimized by predicting with fpxq i argmax yi PY i ppy i y i xq. To evaluate this prediction rule, we rely on probabilistic inference. ipv R H f pxq 1 1 V ppy i fpxq i xq, 13

14 IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 20 / 38 14

15 Loss-minimizing parameter learning 21 / 38 Loss-minimizing parameter learning Let D tpx 1,y 1 q,..., px N,y N qu Ď X ˆY be a set of i.i.d. samples from the (unknown) data distribution dpy xq and : Y ˆY Ñ R`0 function. The task is to find a weight vector w that leads to minimal expected loss, that is w P argmin wpr D E y dpy xq r py,fpxqqs be a loss for a prediction function fpxq argmax ypy gpx,y;wq, where g : X ˆY Ñ R is an auxiliary function, which is parameterized by w P R D. Pros: We directly optimize for the quantity of interest, i.e. the expected loss. We do not need to compute the partition function Z. Cons: There is no probabilistic reasoning to find w. We need to know the loss function already at training time. IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 22 / 38 15

16 Regularized loss minimization Let us define the auxiliary function as We aim to find the parameter w that minimizes gpx,y;wq : Epy;x,wq xw,ϕpx,yqy. However, dpy xq is unknown, hence we use approximation: E y dpy xq r py,fpxqs E y dpy xq r py,argmaxgpx,y;wqqs. ypy E y dpy xq r py,argmaxgpx,y;wqqs «1 ypy N Moreover, we add the regularizer λ}w} 2 in order to avoid overfitting. Therefore, we get the objective w P argmin wpr D λ}w} 2 ` 1 N py n,argmaxgpx n,y n ;wqq. ypy py n,argmaxgpx n,y n ;wqq. ypy IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 23 / 38 16

17 Digression: Support Vector Machine Let us consider the binary classification problem. Suppose we are given a set of labeled points tpx 1,t 1 q,..., px N,t N qu (i.e. a training set), where x n P R D and t n P t 1,1u for all n 1,...,N. The goal is to find a hyperplane ypxq : xw,xy `w 0 separating the input data according to their labels. y = 1 y = 0 y = 1 y = 1 y = 0 y = 1 margin Source: C. Bishop. PRML, 2016 More precisely, ypx n q ą 0 for points having t n 1 and ypx n q ă 0 for points having t n 1, that is t n ypx n q ě 1 for all training points. If such a hyperplane exists, then we say the training set is linearly separable. IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 24 / 38 17

18 Digression: Support Vector Machine y > 0 y = 0 y < 0 R2 x2 R1 w x y(x) w x x1 w0 w Source: C. Bishop. PRML, 2016 We want to solve the following minimzation problem: w P argmin }w} 2, subject to t n pxw,x n y `w 0 q ě 1, for all n 1,...,N. w Since the training set is not necessarily linearly separable, instead, we consider the following minimization for λ ą 0 w P argminλ}w} 2 ` 1 maxp0,1 t n pxw,x n y `w 0 qq. w N where lpyq maxp0,1 tyq is called the hinge loss function. IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 25 / 38 18

19 Redefining the loss function w P argmin wpr D λ}w} 2 ` 1 N py n,argmaxgpx n,y n ;wqq. ypy Note that the loss function py,argmax ypy gpx,y;wqq is piecewise constant, hence it is discontinuous, therefore we cannot use gradient-based techniques. As a remedy we will replace py,y 1 q with a well behaved function lpx,y;wq, which is continuous and convex with respect to w. Typically, l is chosen such that it is an upper bound to. Therefore, we will get a new objective, that is w P argmin wpr D λ}w} 2 ` 1 N 1 argmin wpr D 2 }w}2 ` C N lpx n,y n ;wq lpx n,y n ;wq, with C 1 2λ. IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 26 / 38 19

20 Structured hinge loss Let ȳ argmax ypy gpx n,y;wq, then we get py n,ȳq ď py n,ȳq `gpx n,ȳ;wq gpx n,y n ;wq ďmax ypy p pyn,yq `gpx n,y;wq gpx n,y n ;wqq lpx n,y n,wq, which is called the structured hinge loss. Note that l provides an upper bound for the loss function. Moreover l is continuous and convex, since it is a maximum over affine functions. We remark that lpx n,y n,wq max ypy p pyn,yq `gpx n,y;wq gpx n,y n ;wqq max 0,max ` py n,yq `gpx n,y;wq gpx n,y ;wq n ypy max ypy p pyn,yq xw,ϕpx n,yqy ` xw,ϕpx n,y n qyq. IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 27 / 38 20

21 Structured Support Vector Machine Let gpx,y;wq xw,ϕpx,yqy be an auxiliary function parameterized by w P R D. For any C ą 0, structured support vector machine (S-SVM) training chooses the parameter 1 w P argminlpwq argmin wpr D wpr D 2 }w}2 ` C lpx n,y n,wq N with lpx n,y n,wq max ypy p pyn,yq xw,ϕpx n,yqy ` xw,ϕpx n,y n qyq. Both probabilistic parameter learning and S-SVM do regularized risk minimization. For probabilistic parameter learning, the regularized conditional log-likelihood function can be written as pc σ 2 q: 1 w P argmin wpr D 2 }w}2 `C log ÿ exp `xw,ϕpx n,yqy xw,ϕpx n,y n qy. ypy IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 28 / 38 Subgradient Let f : R D Ñ R be a convex, but not necessarily differentiable, function. A vector v P R D is called a subgradient of f at w 0, if fpwq ě fpw 0 q ` xv,w w 0 y for all w. Source: Note that for differentiable f, the gradient v fpw 0 q is the only subgradient. 21

22 IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 29 / 38 22

23 Pseudo-code of subgradient descent minimization Input: Tolerance ǫ ą 0 and step-sizes η t : ηptq. Output: The minimizer w of L. 1: w Ð 0 2: t Ð 0 3: repeat 4: t Ð t `1 5: v P sub w Lpwq 6: w Ð w ηptqv 7: until L changed less than ǫ 8: return w IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 30 / 38 Subgradient descent minimization This method converges to global minimum, but rather inefficient if the objective function L is non-differentiable. For step sizes satisfying diminishing step size conditions: lim η t 0, and tñ8 8ÿ η t Ñ 8 t 0 convergence is guaranteed. Example: η t ηptq : 1 `m t `m for any m ě 0. IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 31 / 38 23

24 Numerical solution 1 argmin wpr D 2 }w}2 ` C N max ypy p pyn,yq xw,ϕpx n,yqy ` xw,ϕpx n,y n qyq. As we have discussed, this function is non-differentiable. Therefore, we cannot use gradient descent directly, so we have to use subgradients. Source: For each y P Y, l is a linear function, since it is the maximum over all y P Y. In order to calculate the subgradient at w 0, one may find the maximal (active) y, and then use v lpw 0 q. IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 32 / 38 24

25 Calculating the subgradient 1 argmin wpr D 2 }w}2 ` C N max ypy p pyn,yq xw,ϕpx n,yqy ` xw,ϕpx n,y n qyq. Let ȳ P argmax ypy py n,yq xw,ϕpx n,yqy. A subgradient v of Lpwq is given by 1 sub w 2 }w}2 ` C max N ypy p pyn,yq xw,ϕpx n,yqy ` xw,ϕpx n,y n qyq Q w 1 2 }w}2 ` C N w ` C N w ` C N p py n,ȳq xw,ϕpx n,ȳqy ` xw,ϕpx n,y n qyq ϕpx n,ȳq `ϕpx n,y n q ϕpx n,y n q ϕpx n,argmax py n,yq xw,ϕpx n,yqyq : v. ypy IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 33 / 38 25

26 Subgradient descent S-SVM learning Input: Training set D tpx n,y n qu N, energies ϕpx,yq, loss function : Y ˆY Ñ R`0, regularizer C, and step-sizes tη tu T t 1. Output: the weight vector w for the prediction function fpxq argmax ypy xw,ϕpx,yqy. 1: w Ð 0 2: for t 1,...,T do 3: for n 1,...,N do 4: ȳ Ð argmax ypy py n,yq xw,ϕpx n,yqy 5: v n Ð ϕpx n,ȳq `ϕpx n,y n q 6: end for 7: w Ð w η t w ` C v n N looooooomooooooon 8: end for The step-size can be chosen as η t : ηptq 1 t v for all t 1,...,T. Note that each update of w needs an argmax-prediction for each training sample. IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 34 / 38 Stochastic subgradient descent S-SVM learning Input: Training set D tpx 1,y 1 q,..., px n,y n qu, energies ϕpx,yq, loss function : Y ˆY Ñ R`0, regularizer C, number of iterations T and step-sizes tη t u T t 1. Output: The weight vector w for the prediction function fpxq argmax ypy xw,ϕpx,yqy. 1: w Ð 0 2: for t 1,...,T do 3: px n,y n q Ð a randomly chosen training example 4: ȳ Ð argmax ypy py n,yq xw,ϕpx n,yqy 5: w Ð w η t `w ` C N p ϕpxn,ȳq `ϕpx n,y n qq 6: end for Note that each update step of w needs only one argmax-prediction, however we will generally need many iterations until convergence. IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 35 / 38 26

27 Summary of S-SVM learning We are given a training set D tpx 1,y 1 q,..., px n,y n qu Ă X ˆY and a problem specific loss function : Y ˆY Ñ R`0. The task is to learn parameter w for a prediction function fpxq argmax xw,ϕpx,yqy argminxw,ϕpx,yqy ypy ypy that minimizes expected loss on the training set. S-SVM solution derived by the maximum margin framework: xw,ϕpx n,yqy ď xw,ϕpx n,y n qy ` py n,yq, that is the predicted output is enforced to be not worse than the correct one by a margin. We have seen that S-SVM training ends up a convex optimization problem, but it is non-differentiable. Furthermore it requires repeated argmax predictions. IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 36 / 38 27

28 Next lecture: Summary of the course CRF MRF Graphical models Bayesian network Probabilistic parameter learning Learning Loss minimizing parameter learning IN2329 Image segmentation Probabilistic inference Inference MAP inference Human pose estimation Computer Vision Stereo matching Object detection IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 37 / 38 28

29 Literature 1. Sebastian Nowozin and Christoph H. Lampert. Structured prediction and learning in computer vision. Foundations and Trends in Computer Graphics and Vision, 6(3 4), Christopher Bishop. Pattern Recognition and Machine Learning. Springer, Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip Torr. Conditional random fields as recurrent neural networks. In Proceedings of International Conference on Computer Vision, Santiago, Chile, December IEEE, IEEE 4. Guosheng Lin, Chunhua Shen, Anton van den Hengel, and Ian Reid. Efficient piecewise training of deep structured models for semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, IEEE IN Probabilistic Graphical Models in Computer Vision 11. Parameter learning 38 / 38 29

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with