Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Size: px

Start display at page:

Download "Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families"

Nickolas Harrington
5 years ago
Views:

1 Midterm Review

2 Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines Kernels Statistics Basics of probability Tail bounds Density Estimation Exponential Families Graphical Models Systems!

3 Basics of Machine Learning Supervised/Unsupervised Learning? Classification, Regression, Clustering Training error/test error? Model Complexity: Overfitting/Underfitting True error Bayes Optimal Error

4 Bias-Variance Tradeoff When estimating a quantity θ, we evaluate the performance of an estimator by computing its risk expected value of a loss function R θ, θ = E L(θ, θ), where L could be Mean Squared Error Loss 0/1 Loss Hinge Loss (used for SVMs) Bias-Variance Decomposition: Y = f x + ε Err x = E f x f x 2 = (E f x f(x)) 2 +E f x E f x 2 + σε 2 Bias Variance 3/3/2015 4

5 Copied from: Junier Oliva

6 Copied from: Junier Oliva

7 Copied from: Junier Oliva

8 Copied from: Junier Oliva

9 Copied from: Junier Oliva

10 Copied from: Junier Oliva

11 Copied from: Junier Oliva

12 Copied from: Junier Oliva

13 Copied from: Junier Oliva

14 Regression

15 Optimization Copied from: Xuezhi Wang

16 Convex Sets Copied from: Xuezhi Wang

17 Convex Functions Copied from: Xuezhi Wang

18 Convex Functions Examples

19 Useful Observations A function is convex if and only if its epigraph is a convex set. Below-Sets of Convex Functions is a convex set Convex functions cannot have local minima

20 Gradient Descent Copied from: Xuezhi Wang

21 Newton s Method Copied from: Prof Barnabas

22 Newton s Method Copied from: Prof Barnabas

23 Duality

24 Duality

25 KKT Conditions

26 Perceptrons

27 Convergence of Perceptrons

28 Back to Optimization

29 Gradient Descent

30 Stochastic Gradient Descent

31 SGD and Perceptron

32 SVM Primal Find maximum margin hyper-plane Hard Margin Copied from: Junier Oliva 3/3/

33 SVM Primal Find maximum margin hyper-plane Soft Margin Copied from: Junier Oliva 3/3/

34 SVM Dual Find maximum margin hyper-plane Dual for the hard margin SVM Copied from: Junier Oliva 3/3/

35 SVM Dual Find maximum margin hyper-plane Dual for the hard margin SVM Substituting α for w Copied from: Junier Oliva 3/3/

36 SVM Dual Find maximum margin hyper-plane Dual for the hard margin SVM The constraints are active for the support vectors Copied from: Junier Oliva 3/3/

37 SVM Dual Find maximum margin hyper-plane Dual for the hard margin SVM Copied from: Junier Oliva 3/3/

38 SVM Computing w Find maximum margin hyper-plane Dual for the hard margin SVM Copied from: Junier Oliva 3/3/

39 SVM Computing w Find maximum margin hyper-plane Dual for the soft margin SVM only difference from the separable case Copied from: Junier Oliva 3/3/

40 SVM the feature map Find maximum margin hyper-plane But data is not linearly separable We obtain a linear separator in the feature space.!! inputs feature map features is expensive to compute! Copied from: Junier Oliva 3/3/

41 Introducing the kernel The dual formulation no longer depends on w, only on a dot product! We obtain a linear separator in the feature space.!! is expensive to compute! But we don t have to! What we need is the dot product: Let s call this a kernel - 2-variable function - can be written as a dot product Copied from: Junier Oliva 3/3/

42 Kernel SVM The dual formulation no longer depends on w, only on a dot product! closed form This is the famous kernel trick. - never compute the feature map - learn using the closed form K - constant time for HD dot products Copied from: Junier Oliva 3/3/

43 Kernel SVM Run time What happens when we need to classify some x 0? Recall that w depends on α Our classifier for x 0 uses w Copied from: Junier Oliva 3/3/

44 Kernel SVM Run time What happens when we need to classify some x 0? Recall that w depends on α Our classifier for x 0 uses w Who needs w when we ve got dot products? Copied from: Junier Oliva 44

45 Kernel SVM Recap Pick kernel Solve the optimization to get α Classify as Compute b using the support vectors Copied from: Junier Oliva 45

46 Reminder on Kernels Remember Kernels are nothing but implicit feature maps Gram Matrix of a set of vectors x 1 x n in the inner product space defined by the kernel K Gram Matrix is always positive definite 46

47 Bayes Rule

48 Law of Large Numbers

49 Central Limit Theorem

50 Tail Bounds

51 More Tail Bounds

Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting: