References. Lecture 3: Shrinkage. The Power of Amnesia. Ockham s Razor

Size: px

Start display at page:

Download "References. Lecture 3: Shrinkage. The Power of Amnesia. Ockham s Razor"

Brian Booker
5 years ago
Views:

1 References Lecture 3: Shrinkage Isabelle Guon Structural risk minimization for character recognition Isabelle Guon et al. m.ps.z Kernel Ridge Regression Isabelle Guon isabelle/projects/ ETH/KernelRidge.pdf Ockham s Razor Principle proposed b William of Ockham in the fourteenth centur: Pluralitas non est ponenda sine neccesitate. Of two theories providing similarl good predictions, prefer the simplest one. Shave off unnecessar parameters of our models. The Power of Amnesia The human brain is made out of billions of cells or Neurons, which are highl interconnected b snapses. Exposure to enriched environments with extra sensor and social stimulation enhances the connectivit of the snapses, but children and adolescents can lose them up to 20 million per da.

2 Artificial Neurons Hebb s Rule x Cell potential w j w j + i x ij Activation of other neurons x 2 x n w 2 w n b f(x) = w x + b Dendrites Snapses f(x) Axon Activation function Activation of another neuron x j wj Dendrite Snapse Axon Weight Deca Sigma-Pi Unit x x x 2 w j w j + i x ij w j (-λ) w j + i x ij Hebb s rule Weigh deca x 2 x x n w 2 f(x) λ [0, ], deca parameter x n x a x n b 2

3 One Dimensional Example Polnomial Regression x.5 d=0, r=e+002 r=e+003 r=e+004 r=e+005 r=e+006 r=e+007 r=e x x 2 w 2 f(x) 0.5 w n 0 x n b f(x) = w x Polnomial function Conventions Conventions X = {x ij } x i = {x j } x i w a training data matrix, i=:m, j=:n matrix line, training pattern i test pattern, dim n target value of pattern i target value of test pattern weight vector, dim n weight vector, dim m m n X={x ij } x i ={ j } f j 3

4 Matrix Notations Linear Regression w j = i i x ij w = T X w T = X T (,n)=(,m)(m,n) (n,)=(n,m)(m,) What we want: j w j x ij = i for all examples i= m (b=w 0 ) f(x)= j w j x j f(x)= x w T = w x T or for classification, i =±, sign( j w j x ij ) = i Solve: Xw T = (m,n)(n,)=(m,) Solve: X w T = (m,n)(n,) = (m,) Normal equations X T X w T = X T (n,m)(m,n)(n,) = (n,m)(m,) Solution: w T = (X T X) - X T Regression: m>n m n X rank(x) min(n,m) assume rank(x)=n implies rank(x T X)=n X T X is invertible Solution: w T = (X T X) - X T Pseudo-Inverse (n,) (n,m)(m,n) (n,m)(m,) (n,m) Predictor: f(x) = x w T = x X + (,) (,n)(n,) (,n)(n,n)(n,) Residual: - = - X w T = (I-XX + ) X + pseudo-inverse, X + X=I projector Subspace of columns of X residual 4

5 Least-Squares - = - X w T = (I-XX + ) projector The pseudo-inverse solution is optimal in the least-square sense: min w - X w T 2 = (I-XX + ) 2 Subspace of columns of X residual Gradient Descent Square loss: L i = (x i w T i ) 2 Sum of squares: R = i (x i w T i ) 2 = Xw T 2 = wx T Xw T 2wX T + T Gradient: wr = 2 (X T Xw T X T ) Normal Equations At the optimum: wr = 0 2 (X T Xw T X T ) = 0 Normal equations (again): X T Xw T = X T Solve b inverting X T X, if regular. What if X T X, is singular? Regularization Normal equations: X T X w T = X t (n,m)(m,n)(n,) = (n,m)(m,) Case m<n (interpolation), rank(x) m<n, matrix X T X singular. Replace X T X b (X T X+ λi) λ>0 Solution: w T = (X T X +λi) - X T (n,) (n,m)(m,n) (n,n) (n,m) (m,) Regularized inverse 5

6 Wh it works Diagonalization: X T X = U D U T U orthogonal matrix of eigenvectors (UU T =I) D diagonal matrix of eigenvalues Singularit: someeigenvalues are zero. Regularization: X T X + λi = U (D+ λi) U T λ>0 no more zero eigenvalue. Sum of squares: Penalized Risk R = i (x i w T i ) 2 = Xw T 2 Add regularizer : R = Xw T 2 + λ w 2 Gradient: wr = 2 ((X T X+ λ I) w T X T ) Mechanical Interpretation Quadratic form: R = Xw T 2 + λ w 2 One dimension: R = p (w-w 0 ) 2 + λ w 2 Two dimensions: (w-w 0 ) 2 2λw p 2p(w-w 0) w 2 w 0 w Principal Component Analsis x i X (m,n) u k Problem: Construct features that are linear combinations of the original features, such that the reconstructed patterns are as close as possible to the original in the least square sense. f k = X u k linear combinations of columns of X x i = x i U T = k x ik u k U(n,n ) = X x i (m,n ) f k reconstructed pattern X U T = X x i x (m,n ) i (n,n) (m,n) 6

7 PCA Solution Kernel Trick (m<n) X = X U X = X U T X = X UU T min U X - X UU T 2 Can be brought back to solving and eigenvalue problem: X T X = UDU T i.e. X T X =D Compare: Regularization X T X+ λi = U(D+ λi)u T PCA: Remove the dimensions with smallest eigenvalues. Solve: Assume: Solve instead: Solution: Xw T = w = i α i x i = a T X (,n) X X T a = (m,n)(n,m)(m,)=(m,) (,m) (m,n) Full rank (m,m) matrix a = (X X T ) - w T = X T (X X T ) - X + Kernel Ridge Regression Regularization and PI Ξ = Φ(X) Solve: Assume: Ξw T = w = i α i x i = a T Ξ Case m>n and rank(x T X)=n X + = (X T X) - X T (,N) (,m) (m,n) Solve instead: Ξ Ξ T a = (m,n)(n,m)(m,)=(m,) (m,m) kernel matrix K Solution: a = K - Regularization: replace K b K+λI Case m<n and rank(x T X)=m X + = X T (X X T ) - Either case: X + = lim λ 0 (X T X+ λi) X T = lim λ 0 X T (X X T + λi) - 7

8 x j Weight Deca for MLP Replace: w j w j + back_prop(j) b: w j (-λ) w j + back_prop(j) Priors and Baesian Learning Double random process: Draw a target function f in a famil of functions {f} Draw the data pairs (x i, i =f(x i )+noise) The distribution of f is called the prior P(f). Our revised opinion about f once we see the data is the posterior P(f D). Baesian learning : P( x,d) α P( x,d,f) dp(f D) MAP: f = argmax P(f D) = argmax P(D f) P(f) MAP = RRM Example: Gaussian Prior Maximum A Posteriori (MAP): f = argmax P(D f) P(f) = argmin log P(D f) log P(f) Negative log likelihood = Empirical risk R[f] Regularized Risk Minimization (RRM): f = argmin R[f] + Ω[f] Negative log prior = Regularizer Ω[f] Linear model: f(x) = x w T Square loss Gaussian noise: P(D f)=exp - Xw T 2 /σ 2 R[f] = -log P(D f) ~ Xw T 2 Weight deca Gaussian prior: P(f) = exp -λ w 2 Ω[f] = log P(f) = λ w 2 w 2 8

9 Structural Risk Minimization Conclusion Nested subsets of models, increasing complexit/capacit: S S 2 S N R Example, rank with w 2 S k = { w w 2 < A k }, A <A 2 < <A k Minimization under constraint: min R emp [f] s.t. w 2 < A k Lagrangian: R reg [f] = R emp [f] + λ w 2 capacit Weight deca is a means of avoiding overfitting that is justifiable from man perspectives: Ockham s razor Snaptic deca Regularization Gaussian prior on the weights Structural risk minimization It works for linear models, kernel methods, and neural networks. It can be combined with various loss functions. Homework 3: Practical Work ) Same data and software as homework 2. 2) Create a heatmap of the 00 top ranking features ou selected. 3) Make a scatter plot of the 3 top ranking features ou selected. 4) the result zip file with the figures to guoni@inf.ethz.ch with subject "Homework3" no later than: Tuesda November 5th. 9

10 Risk Minimization Learning problem: find the best function f(x; a) minimizing the risk functional R[f] = L(f(x; a), ) dp(x, ) loss function Examples are given: (x, ), (x 2, 2 ), (x m, m ) unknown distribution L(, f(x)) 3.5 Adaboost loss e -z 2.5 logistic loss.5 log(+e -z ) Perceptron 0.5 loss max(0, -z) Loss Functions Decision boundar Margin missclassified 0/ loss SVC loss max(0, -z) well classified square loss (- z) 2 z= f(x) Kernel Trick f(x) = i α i k(x i, x) k(x i, x) = F (x i ) F (x) f(x) = w F (x) w = i α i F (x i ) Dual forms What is a Kernel? A kernel is a dot product in some feature space: k(s, t) = F(s) F(t) Examples: k(s, t) = exp(- s-t 2 /σ 2 ) Gaussian kernel k(s, t) = / s-t Potential function k(s, t) = (s t) q Polnomial kernel ([s,s 2 ] [t,t 2 ]) 2 = [s 2,s 22, 2s s 2 ].[t 2,t 22, 2t t 2 ] k(s, t) F(s) F(t) 0

11 Simple Kernel Methods Exercise: Gradient Descent f(x) = w F(x) w = α i F(x i ) i Perceptron algorithm w w + i F(x i ) if i f(x i )<0 (Rosenblatt 958) Minover (optimum margin) w w + i F(x i ) for min i f(x i ) (Krauth-Mézard 987) LMS regression w w + η ( i - f(x i )) F(x i ) f(x) = α i K(x i, x) i k(x i, x) = F(x i ).F(x) Potential Function algorithm α i α i + i if i f(x i )<0 (Aizerman et al 964) Dual minover α i α i + i for min i f(x i ) (ancestor of SVM 992, similar to kernel Adatron, 998, and SMO, 999) Dual LMS α i α i + η ( i - f(x i )) Linear discriminant f(x) = j w j x j Functional margin z= f(x), =± Compute z/ w j Derive the learning rules w j =-η L/ w j corresponding to the following loss functions: square loss (- z) 2 SVC loss max(0, -z) Perceptron loss max(0, -z) Adaboost loss e -z logistic loss log(+e -z ) Exercise: Dual Algorithms Exercise: Linear Algebra From the derive the w j derive the w w = i α i x i From the w, derive the α i of the dual algorithms. Prove that if X is of rank r, X T X and XX T have the same rank. Show that X T X and XX T have onl positive eigenvalues.

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule References Lecture 7: Support Vector Machines Isabelle Guyon guyoni@inf.ethz.ch An training algorithm for optimal margin classifiers Boser-Guyon-Vapnik, COLT, 992 http://www.clopinet.com/isabelle/p apers/colt92.ps.z