ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides.

Size: px

Start display at page:

Download "ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides."

Branden Owens
5 years ago
Views:

1 ECE521 Tutorial 2 Regression, GPs, Assignment 1 ECE521 Winter 2016 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides. ECE521 Tutorial 2 ECE521 Winter 2016 Credits to Alireza / 3

2 Outline 1 Linear regression, knn 2 Linear classification 3 Overview of Gaussian Processes ECE521 Tutorial 2 ECE521 Winter 2016 Credits to Alireza / 3

3 Alireza Makhzani ECE521 Tutorial: KNN and Linear Regression 2 / 18

4 Alireza Makhzani ECE521 Tutorial: KNN and Linear Regression 3 / 18

5 Alireza Makhzani ECE521 Tutorial: KNN and Linear Regression 4 / 18

6 Alireza Makhzani ECE521 Tutorial: KNN and Linear Regression 5 / 18

7 Gradient Descent Algorithm Gradient descent is based on the observation that a function decreases fastest if one goes in the direction of negative gradient. x n+1 = x n µ n f(x n ) Alireza Makhzani ECE521 Tutorial: KNN and Linear Regression 6 / 18

$Momentum Momentum simply adds a fraction m of the previous weight update to the current one.$

8 Momentum Momentum simply adds a fraction m of the previous weight update to the current one. v k = m k v k-1 η k f(x k ) x k+1 = x k + v k (1) Alireza Makhzani ECE521 Tutorial: KNN and Linear Regression 7 / 18

9 Validation Set Alireza Makhzani ECE521 Tutorial: KNN and Linear Regression 8 / 18

10 K-nearest neighbour Alireza Makhzani ECE521 Tutorial: KNN and Linear Regression 9 / 18

11 Alireza Makhzani ECE521 Tutorial: KNN and Linear Regression 10 / 18

12 Nonlinearity Alireza Makhzani ECE521 Tutorial: KNN and Linear Regression 11 / 18

13 Polynomial Curve Fitting Alireza Makhzani ECE521 Tutorial: KNN and Linear Regression 12 / 18

14 Alireza Makhzani ECE521 Tutorial: KNN and Linear Regression 13 / 18

15 Alireza Makhzani ECE521 Tutorial: KNN and Linear Regression 14 / 18

16 Alireza Makhzani ECE521 Tutorial: KNN and Linear Regression 15 / 18

17 Alireza Makhzani ECE521 Tutorial: KNN and Linear Regression 16 / 18

18 Preventing Overfitting Alireza Makhzani ECE521 Tutorial: KNN and Linear Regression 17 / 18

19 Alireza Makhzani ECE521 Tutorial: KNN and Linear Regression 18 / 18

20 Alireza Makhzani ECE521 Tutorial 2: Linear Classification 4 / 24

21 K-nearest neighbour K-nearest neighbour P (C k ) = K k K Alireza Makhzani ECE521 Tutorial 2: Linear Classification 6 / 24

22 Linear regression for classification Alireza Makhzani ECE521 Tutorial 2: Linear Classification 8 / 24

23 Linear Classification Linear regression for classification x T w + w 0 = w 0 + w 1 x 1 + w 2 x 2 Alireza Makhzani ECE521 Tutorial 2: Linear Classification 9 / 24

24 Linear regression for classification Linear regression for classification Alireza Makhzani ECE521 Tutorial 2: Linear Classification 10 / 24

25 Linear regression for classification Linear regression for classification Alireza Makhzani ECE521 Tutorial 2: Linear Classification 11 / 24

26 Methods for Regression We are given n data points x and corresponding regression targets y: D = {x (i), y i } n i=1 Two approaches 1 Restrict the class of functions (e.g., only linear) 2 Assign a probability to every possible function A. G. Schwing & R. Zemel (UofT) CSC 412/2506: Probabilistic Graphical Models / 42

27 1. Restrict the class of functions (e.g., only linear) Use data to learn the parameters w of a parametric model p(y x, w) Use parameters and test data x to obtain prediction y = arg max y p(y x, w) Examples: Logistic regression Support vector machine Neural networks Problem: How to find the right model? A. G. Schwing & R. Zemel (UofT) CSC 412/2506: Probabilistic Graphical Models / 42

28 2. Assign a probability to every possible function Use data to design a posterior p(w y, X) Use posterior and test data x to obtain predictive distribution (average over all models) p(y x, X, y) = p(y x, w)p(w X, y)dw Note difference to non-bayesian approach: single parameter vs. averaging Problem: How to assign a value to every possible function? Solution: Gaussian Processes A. G. Schwing & R. Zemel (UofT) CSC 412/2506: Probabilistic Graphical Models / 42

29 Weight-space view: Bayesian treatment of linear model Dataset: D = { x (i), y i } N i=1 x (i) R D, y i R X = [ x (1),..., x (N)], y = [y 1,..., y N ] Given D we now Compute posterior p(w y, X) Compute predictive distribution p(y x, X, y) A. G. Schwing & R. Zemel (UofT) CSC 412/2506: Probabilistic Graphical Models / 42

30 Posterior for standard linear model Likelihood: y = f (x) + ɛ f (x) = x w ɛ N (0, σ 2 n) p(y X, w) = i = i = p(y i x (i), w) ( 1 exp 1 2πσ 2 n 2σ 2 n ) (y i w x (i) ) ( 1 (2πσn) 2 exp 1 ) n/2 2σn 2 y X w 2 2 = N (X w, σ 2 ni) Prior: p(w) = N (0, Σ p ) A. G. Schwing & R. Zemel (UofT) CSC 412/2506: Probabilistic Graphical Models / 42

31 Posterior for standard linear model Likelihood: Prior: p(y X, w) = N (X w, σ 2 ni) p(w) = N (0, Σ p ) Expression for posterior: p(w y, X) = p(y X, w)p(w) p(y X) Expression for marginal likelihood: p(y X) = p(y X, w)p(w)dw A. G. Schwing & R. Zemel (UofT) CSC 412/2506: Probabilistic Graphical Models / 42

32 Posterior for standard linear model Recall: Bayes Theorem for multivariate Gaussian Given p(x) = N (µ, Λ 1 ) p(y x) = N (Ax + b, L 1 ) We obtain p(y) = N (Aµ + b, L 1 + AΛ 1 A ) ( ( ) ) p(x y) = N Γ A L(y b) + Λµ, Γ ( ) 1 Γ = Λ + A LA In our case: Posterior: p(y X, w) = N (X w, σ 2 ni) p(w) = N (0, Σ p ) p(w y, X) = N ( 1 σ 2 n ΓXy, Γ) Γ = ( Σ 1 p + 1 ) 1 σn 2 XX A. G. Schwing & R. Zemel (UofT) CSC 412/2506: Probabilistic Graphical Models / 42

33 Posterior for standard linear model p(w y, X) = N ( 1 σ 2 n ΓXy, Γ) Γ = ( Σ 1 p + 1 ) 1 σn 2 XX For Gaussian distributions: mean = mode (MAP estimate) Note similarity to Ridge regression A. G. Schwing & R. Zemel (UofT) CSC 412/2506: Probabilistic Graphical Models / 42

34 Given D we managed to Compute posterior p(w y, X) We still need to Compute predictive distribution p(y x, X, y) = p(y x, w)p(w X, y)dw A. G. Schwing & R. Zemel (UofT) CSC 412/2506: Probabilistic Graphical Models / 42

35 Predictive distribution Recall: Bayes Theorem for multivariate Gaussian Given p(x) = N (µ, Λ 1 ) p(y x) = N (Ax + b, L 1 ) We obtain p(y) = N (Aµ + b, L 1 + AΛ 1 A ) ( ( ) ) p(x y) = N Γ A L(y b) + Λµ, Γ ( ) 1 Γ = Λ + A LA Prediction: average over all parameter values, weighted by posterior p(y x, X, y) = p(y x, w)p(w X, y)dw = N (x w, σn)n 2 ( 1 σn 2 ΓXy, Γ)dw = N ( 1 σ 2 n x ΓXy, σ 2 n + x Γx ) A. G. Schwing & R. Zemel (UofT) CSC 412/2506: Probabilistic Graphical Models / 42

36 Summary: Prior: p(w) = N (0, Σ p ) Likelihood: p(y X, w) = N (X w, σ 2 ni) Posterior: p(w X, y) = N ( 1 ΓXy, Γ) with σn ( 2 Γ = Σ 1 p + 1 XX ) 1 σn 2 Predicitive distribution: p(y x, X, y) = N ( 1 σ 2 n x ΓXy, σ 2 n + x Γx ) A. G. Schwing & R. Zemel (UofT) CSC 412/2506: Probabilistic Graphical Models / 42

37 Multivariate Gaussian Distribution p(x 1,..., x n µ, Σ) = p(x µ, Σ) = 1 = ( (2π) n det(σ) exp 1 ) 2 (x µ) Σ 1 (x µ) = 1 ( (2π) n det(λ) exp 1 ) 1 2 (x µ) Λ(x µ) = p(x µ, Λ 1 ) x R n µ R n : mean vector Σ R n n : covariance matrix Λ = Σ 1 R n n : precision matrix A. G. Schwing & R. Zemel (UofT) CSC 412/2506: Probabilistic Graphical Models / 42

38 Short hand: Samples: x N (µ, Σ) = N (µ, Λ 1 ) Σ 1/2 randn(2,10000) + µ p(x1, x2 µ, Σ) x x x x 1 How do the plots look like if x 1 and x 2 are independent random variables? What are the entries of the covariance matrix? A. G. Schwing & R. Zemel (UofT) CSC 412/2506: Probabilistic Graphical Models / 42

39 Tools to deal with Gaussian distributions Completing the square: 1 2 (x µ) Σ 1 (x µ) = 1 2 x Σ 1 x + x Σ 1 µ + const express a given quadratic form as shown on the right hand side read of Σ and µ A. G. Schwing & R. Zemel (UofT) CSC 412/2506: Probabilistic Graphical Models / 42

40 Example to practice completing the square: [ ] x x = a x b N ( [ µa µ b Suppose x b given, what is p(x a x b )? ] [ Λaa Λ, ab Λ ba Λ bb ] 1 ) 1 2 ([ x a x b ] [ µa µ b ]) [ Λaa Λ ab Λ ba Λ bb ] ([ x a x b ] [ µa µ b ]) = 1 2 x a Λ aa x a + x a (Λ aa µ a Λ ab (x b µ b )) + const Covariance: Λ 1 aa Mean: µ a Λ 1 aa Λ ab (x b µ b ) p(x a x b ) = N (µ a Λ 1 aa Λ ab (x b µ b ), Λ 1 aa ) A. G. Schwing & R. Zemel (UofT) CSC 412/2506: Probabilistic Graphical Models / 42

41 Express p(x a x b ) = N (µ a Λ 1 aa Λ ab (x b µ b ), Λ 1 aa ) in terms of covariance matrix ( ) 1 ( ) Σaa Σ ab Λaa Λ = ab Σ ba Σ bb Λ ba Λ bb Matrix identity: ( A B C D with ) 1 = ( M D 1 CM M = (A BD 1 C) 1 MBD 1 D 1 + D 1 CMBD 1 ) Λ aa = (Σ aa Σ ab Σ 1 bb Σ ba) 1 Λ ab = (Σ aa Σ ab Σ 1 bb Σ ba) 1 Σ ab Σ 1 bb p(x a x b ) = N (µ a + Σ ab Σ 1 bb (x b µ b ), Σ aa Σ ab Σ 1 bb Σ ba) A. G. Schwing & R. Zemel (UofT) CSC 412/2506: Probabilistic Graphical Models / 42

42 Conditioning of Multivariate Gaussian Conditional: [ x x = a x b ] N ( [ µa µ b N ([ µa µ b ] [ Λaa Λ, ab Λ ba Λ bb ] [ Σaa Σ, ab Σ ba Σ bb ] 1 ) ]) p(x a x b ) = N (µ a Λ 1 aa Λ ab (x b µ b ), Λ 1 aa ) p(x a x b ) = N (µ a + Σ ab Σ 1 bb (x b µ b ), Σ aa Σ ab Σ 1 bb Σ ba) A. G. Schwing & R. Zemel (UofT) CSC 412/2506: Probabilistic Graphical Models / 42

43 TF Tips tf.argmin(input, axis=none, name=none, dimension=none), tf.argmax(input, axis=none, name=none, dimension=none) can find the indices of the biggest/smallest elements in the tensor. tf.nn.embedding lookup(params, ids, partition strategy= mod, name=none, validate indices=true, max norm=none) can look up ids in a list of tensors. matplotlib is a package for generating plots similar to Matlab. Pyplot is a useful subpackage under matplotlib. ECE521 Tutorial 2 ECE521 Winter 2016 Credits to Alireza / 3

CSC 411: Lecture 04: Logistic Regression

CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic