CSCI567 Machine Learning (Fall 2018)

Size: px
Start display at page:

Download "CSCI567 Machine Learning (Fall 2018)"

Transcription

1 CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, / 49

2 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due this Sunday (09/16) 11:59PM You need to submit a form if you use late days (see course website) September 12, / 49

3 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due this Sunday (09/16) 11:59PM You need to submit a form if you use late days (see course website) Effort-based grade for written assignments: see the explanation on Piazza key: let us know what you have tried and how you thought I spend an hour and came up with nothing = empty solution September 12, / 49

4 Outline 1 Review of Last Lecture 2 Multiclass Classification 3 Neural Nets September 12, / 49

5 Review of Last Lecture Outline 1 Review of Last Lecture 2 Multiclass Classification 3 Neural Nets September 12, / 49

6 Review of Last Lecture Summary Linear models for binary classification: Step 1. Model is the set of separating hyperplanes F = {f(x) = sgn(w T x) w R D } September 12, / 49

7 Review of Last Lecture Step 2. Pick the surrogate loss perceptron loss l perceptron (z) = max{0, z} (used in Perceptron) hinge loss l hinge (z) = max{0, 1 z}(used in SVM and many others) logistic loss l logistic (z) = log(1 + exp( z)) (used in logistic regression) September 12, / 49

8 Review of Last Lecture Step 3. Find empirical risk minimizer (ERM): using GD: SGD: w = argmin w R D w w η F (w) w w η F (w) F (w) = argmin w R D N l(y n w T x n ) n=1 Newton: w w ( 2 F (w) ) 1 F (w) September 12, / 49

9 Review of Last Lecture A Probabilistic view of logistic regression Minimizing logistic loss = MLE for the sigmoid model where w = argmin w N n=1 l logistic (y n w T x n ) = argmax w N P(y n x n ; w) n=1 P(y x; w) = σ(yw T 1 x) = 1 + e ywt x September 12, / 49

10 Outline 1 Review of Last Lecture 2 Multiclass Classification Multinomial logistic regression Reduction to binary classification 3 Neural Nets September 12, / 49

11 Classification Recall the setup: input (feature vector): x R D output (label): y [C] = {1, 2,, C} goal: learn a mapping f : R D [C] September 12, / 49

12 Classification Recall the setup: input (feature vector): x R D output (label): y [C] = {1, 2,, C} goal: learn a mapping f : R D [C] Examples: recognizing digits (C = 10) or letters (C = 26 or 52) predicting weather: sunny, cloudy, rainy, etc predicting image category: ImageNet dataset (C 20K) September 12, / 49

13 Classification Recall the setup: input (feature vector): x R D output (label): y [C] = {1, 2,, C} goal: learn a mapping f : R D [C] Examples: recognizing digits (C = 10) or letters (C = 26 or 52) predicting weather: sunny, cloudy, rainy, etc predicting image category: ImageNet dataset (C 20K) Nearest Neighbor Classifier naturally works for arbitrary C. September 12, / 49

14 Multinomial logistic regression Linear models: from binary to multiclass What should a linear model look like for multiclass tasks? September 12, / 49

15 Multinomial logistic regression Linear models: from binary to multiclass What should a linear model look like for multiclass tasks? Note: a linear model for binary tasks (switching from { 1, +1} to {1, 2}) { 1 if w T x 0 f(x) = 2 if w T x < 0 September 12, / 49

16 Multinomial logistic regression Linear models: from binary to multiclass What should a linear model look like for multiclass tasks? Note: a linear model for binary tasks (switching from { 1, +1} to {1, 2}) { 1 if w T x 0 f(x) = 2 if w T x < 0 can be written as f(x) = { 1 if w T 1 x wt 2 x 2 if w T 2 x > wt 1 x for any w 1, w 2 s.t. w = w 1 w 2 September 12, / 49

17 Multinomial logistic regression Linear models: from binary to multiclass What should a linear model look like for multiclass tasks? Note: a linear model for binary tasks (switching from { 1, +1} to {1, 2}) { 1 if w T x 0 f(x) = 2 if w T x < 0 can be written as f(x) = { 1 if w T 1 x wt 2 x 2 if w T 2 x > wt 1 x = argmax wk T x k {1,2} for any w 1, w 2 s.t. w = w 1 w 2 September 12, / 49

18 Multinomial logistic regression Linear models: from binary to multiclass What should a linear model look like for multiclass tasks? Note: a linear model for binary tasks (switching from { 1, +1} to {1, 2}) { 1 if w T x 0 f(x) = 2 if w T x < 0 can be written as f(x) = { 1 if w T 1 x wt 2 x 2 if w T 2 x > wt 1 x = argmax wk T x k {1,2} for any w 1, w 2 s.t. w = w 1 w 2 Think of w T k x as a score for class k. September 12, / 49

19 Multinomial logistic regression Linear models: from binary to multiclass w = ( 3 2, 1 6 ) Blue class: {x : w T x 0} Orange class: {x : w T x < 0} September 12, / 49

20 Multinomial logistic regression Linear models: from binary to multiclass w = ( 3 2, 1 6 ) = w 1 w 2 w 1 = (1, 1 3 ) w 2 = ( 1 2, 1 2 ) Blue class: {x : 1 = argmax k w T k x} Orange class: {x : 2 = argmax k w T k x} September 12, / 49

21 Multinomial logistic regression Linear models: from binary to multiclass w 1 = (1, 1 3 ) w 2 = ( 1 2, 1 2 ) Blue class: {x : 1 = argmax k w T k x} Orange class: {x : 2 = argmax k w T k x} September 12, / 49

22 Multinomial logistic regression Linear models: from binary to multiclass w 1 = (1, 1 3 ) w 2 = ( 1 2, 1 2 ) w 3 = (0, 1) Blue class: {x : 1 = argmax k w T k x} Orange class: {x : 2 = argmax k w T k x} Green class: {x : 3 = argmax k w T k x} September 12, / 49

23 Multinomial logistic regression Linear models for multiclass classification F = { f(x) = argmax k [C] w T k x w 1,..., w C R D } September 12, / 49

24 Multinomial logistic regression Linear models for multiclass classification F = = { { f(x) = argmax k [C] f(x) = argmax k [C] w T k x w 1,..., w C R D } (W x) k W R C D } September 12, / 49

25 Multinomial logistic regression Linear models for multiclass classification F = = { { f(x) = argmax k [C] f(x) = argmax k [C] w T k x w 1,..., w C R D } (W x) k W R C D } How do we generalize perceptron/hinge/logistic loss? September 12, / 49

26 Multinomial logistic regression Linear models for multiclass classification F = = { { f(x) = argmax k [C] f(x) = argmax k [C] w T k x w 1,..., w C R D } (W x) k W R C D } How do we generalize perceptron/hinge/logistic loss? This lecture: focus on the more popular logistic loss September 12, / 49

27 Multinomial logistic regression Multinomial logistic regression: a probabilistic view Observe: for binary logistic regression, with w = w 1 w 2 : P(y = 1 x; w) = σ(w T x) = e = e wt wt x e wt 1 x + e wt 2 x ewt 1 x 1 x September 12, / 49

28 Multinomial logistic regression Multinomial logistic regression: a probabilistic view Observe: for binary logistic regression, with w = w 1 w 2 : P(y = 1 x; w) = σ(w T x) = e = e wt wt x e wt 1 x + e wt 2 x ewt 1 x 1 x Naturally, for multiclass: P(y = k x; W ) = e wt k x ewt k [C] ewt k x k x September 12, / 49

29 Multinomial logistic regression Multinomial logistic regression: a probabilistic view Observe: for binary logistic regression, with w = w 1 w 2 : P(y = 1 x; w) = σ(w T x) = e = e wt wt x e wt 1 x + e wt 2 x ewt 1 x 1 x Naturally, for multiclass: P(y = k x; W ) = e wt k x ewt k [C] ewt k x k x This is called the softmax function. September 12, / 49

30 Multinomial logistic regression Applying MLE again Maximize probability of see labels y 1,..., y N given x 1,..., x N P (W ) = N P(y n x n ; W ) = n=1 N n=1 e wt yn x k [C] ewt k x September 12, / 49

31 Multinomial logistic regression Applying MLE again Maximize probability of see labels y 1,..., y N given x 1,..., x N P (W ) = N P(y n x n ; W ) = n=1 N n=1 e wt yn x k [C] ewt k x By taking negative log, this is equivalent to minimizing F (W ) = N ln n=1 ( k [C] ewt k xn e wt yn xn ) September 12, / 49

32 Multinomial logistic regression Applying MLE again Maximize probability of see labels y 1,..., y N given x 1,..., x N P (W ) = N P(y n x n ; W ) = n=1 N n=1 e wt yn x k [C] ewt k x By taking negative log, this is equivalent to minimizing ( ) N k [C] F (W ) = ln ewt k xn N = ln 1 + n=1 e wt yn xn n=1 k y n e (wk wyn)txn September 12, / 49

33 Multinomial logistic regression Applying MLE again Maximize probability of see labels y 1,..., y N given x 1,..., x N P (W ) = N P(y n x n ; W ) = n=1 N n=1 e wt yn x k [C] ewt k x By taking negative log, this is equivalent to minimizing ( ) N k [C] F (W ) = ln ewt k xn N = ln 1 + n=1 e wt yn xn n=1 k y n e (wk wyn)txn This is the multiclass logistic loss, a.k.a cross-entropy loss. September 12, / 49

34 Multinomial logistic regression Applying MLE again Maximize probability of see labels y 1,..., y N given x 1,..., x N P (W ) = N P(y n x n ; W ) = n=1 N n=1 e wt yn x k [C] ewt k x By taking negative log, this is equivalent to minimizing ( ) N k [C] F (W ) = ln ewt k xn N = ln 1 + n=1 e wt yn xn n=1 k y n e (wk wyn)txn This is the multiclass logistic loss, a.k.a cross-entropy loss. When C = 2, this is the same as binary logistic loss. September 12, / 49

35 Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n? September 12, / 49

36 Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n It s a C D matrix. Let s focus on the k-th row:? September 12, / 49

37 Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n It s a C D matrix. Let s focus on the k-th row: If k y n : wk g(w ) = e (w k w yn ) T x n 1 + k y n e (w k wyn)t x n x T n? September 12, / 49

38 Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n It s a C D matrix. Let s focus on the k-th row: If k y n : wk g(w ) =? e (w k w yn ) T x n 1 + k y n e (w k wyn)t x n x T n = P(k x n ; W )x T n September 12, / 49

39 Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n It s a C D matrix. Let s focus on the k-th row: If k y n : else: wk g(w ) = wk g(w ) =? e (w k w yn ) T x n 1 + k y n e (w k wyn)t x n x T n = P(k x n ; W )x T n ( ) k yn e(w k wyn)t x n 1 + k y n e (w k wyn)t x n x T n September 12, / 49

40 Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n It s a C D matrix. Let s focus on the k-th row: If k y n : else: wk g(w ) = wk g(w ) =? e (w k w yn ) T x n 1 + k y n e (w k wyn)t x n x T n = P(k x n ; W )x T n ( ) k yn e(w k wyn)t x n 1 + k y n e (w k wyn)t x n x T n = (P(y n x n ; W ) 1) x T n September 12, / 49

41 Multinomial logistic regression SGD for multinomial logistic regression Initialize W = 0 (or randomly). Repeat: 1 pick n [N] uniformly at random 2 update the parameters W W η P(y = 1 x n ; W ). P(y = y n x n ; W ) 1. P(y = C x n ; W ) x T n September 12, / 49

42 Multinomial logistic regression SGD for multinomial logistic regression Initialize W = 0 (or randomly). Repeat: 1 pick n [N] uniformly at random 2 update the parameters W W η P(y = 1 x n ; W ). P(y = y n x n ; W ) 1. P(y = C x n ; W ) x T n Think about why the algorithm makes sense intuitively. September 12, / 49

43 Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] w T k x September 12, / 49

44 Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] w T k x make a randomized prediction according to P(k x; W ) e wt k x September 12, / 49

45 Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] w T k x make a randomized prediction according to P(k x; W ) e wt k x In either case, (expected) mistake is bounded by logistic loss September 12, / 49

46 Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] wk Tx make a randomized prediction according to P(k x; W ) e wt k x In either case, (expected) mistake is bounded by logistic loss deterministic I[f(x) y] log e (w k w y) T x k y September 12, / 49

47 Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] wk Tx make a randomized prediction according to P(k x; W ) e wt k x In either case, (expected) mistake is bounded by logistic loss deterministic I[f(x) y] log e (w k w y) T x k y randomized E [I[f(x) y]] September 12, / 49

48 Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] wk Tx make a randomized prediction according to P(k x; W ) e wt k x In either case, (expected) mistake is bounded by logistic loss deterministic I[f(x) y] log e (w k w y) T x k y randomized E [I[f(x) y]] = 1 P(y x; W ) September 12, / 49

49 Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] wk Tx make a randomized prediction according to P(k x; W ) e wt k x In either case, (expected) mistake is bounded by logistic loss deterministic I[f(x) y] log e (w k w y) T x k y randomized E [I[f(x) y]] = 1 P(y x; W ) ln P(y x; W ) September 12, / 49

50 Reduction to binary classification Reduce multiclass to binary Is there an even more general and simpler approach to derive multiclass classification algorithms? September 12, / 49

51 Reduction to binary classification Reduce multiclass to binary Is there an even more general and simpler approach to derive multiclass classification algorithms? Given a binary classification algorithm (any one, not just linear methods), can we turn it to a multiclass algorithm, in a black-box manner? September 12, / 49

52 Reduction to binary classification Reduce multiclass to binary Is there an even more general and simpler approach to derive multiclass classification algorithms? Given a binary classification algorithm (any one, not just linear methods), can we turn it to a multiclass algorithm, in a black-box manner? Yes, there are in fact many ways to do it. one-versus-all (one-versus-rest, one-against-all, etc) one-versus-one (all-versus-all, etc) Error-Correcting Output Codes (ECOC) tree-based reduction September 12, / 49

53 Reduction to binary classification One-versus-all (OvA) (picture credit: link) Idea: train C binary classifiers to learn is class k or not? for each k. September 12, / 49

54 Reduction to binary classification One-versus-all (OvA) (picture credit: link) Idea: train C binary classifiers to learn is class k or not? for each k. Training: for each class k [C], relabel examples with class k as +1, and all others as 1 train a binary classifier h k using this new dataset September 12, / 49

55 Reduction to binary classification One-versus-all (OvA) (picture credit: link) Idea: train C binary classifiers to learn is class k or not? for each k. Training: for each class k [C], relabel examples with class k as +1, and all others as 1 train a binary classifier h k using this new dataset September 12, / 49

56 Reduction to binary classification One-versus-all (OvA) Prediction: for a new example x ask each h k : does this belong to class k? (i.e. h k (x)) September 12, / 49

57 Reduction to binary classification One-versus-all (OvA) Prediction: for a new example x ask each h k : does this belong to class k? (i.e. h k (x)) randomly pick among all k s s.t. h k (x) = +1. September 12, / 49

58 Reduction to binary classification One-versus-all (OvA) Prediction: for a new example x ask each h k : does this belong to class k? (i.e. h k (x)) randomly pick among all k s s.t. h k (x) = +1. Issue: will (probably) make a mistake as long as one of h k errs. September 12, / 49

59 Reduction to binary classification One-versus-one (OvO) (picture credit: link) Idea: train ( C 2) binary classifiers to learn is class k or k?. September 12, / 49

60 Reduction to binary classification One-versus-one (OvO) (picture credit: link) Idea: train ( C 2) binary classifiers to learn is class k or k?. Training: for each pair (k, k ), relabel examples with class k as +1 and examples with class k as 1 discard all other examples train a binary classifier h (k,k ) using this new dataset September 12, / 49

61 Reduction to binary classification One-versus-one (OvO) (picture credit: link) Idea: train ( C 2) binary classifiers to learn is class k or k?. Training: for each pair (k, k ), relabel examples with class k as +1 and examples with class k as 1 discard all other examples train a binary classifier h (k,k ) using this new dataset September 12, / 49

62 Reduction to binary classification One-versus-one (OvO) Prediction: for a new example x ask each classifier h (k,k ) to vote for either class k or k September 12, / 49

63 Reduction to binary classification One-versus-one (OvO) Prediction: for a new example x ask each classifier h (k,k ) to vote for either class k or k predict the class with the most votes (break tie in some way) September 12, / 49

64 Reduction to binary classification One-versus-one (OvO) Prediction: for a new example x ask each classifier h (k,k ) to vote for either class k or k predict the class with the most votes (break tie in some way) More robust than one-versus-all, but slower in prediction. September 12, / 49

65 Reduction to binary classification Error-correcting output codes (ECOC) (picture credit: link) Idea: based on a code M { 1, +1} C L, train L binary classifiers to learn is bit b on or off. September 12, / 49

66 Reduction to binary classification Error-correcting output codes (ECOC) (picture credit: link) Idea: based on a code M { 1, +1} C L, train L binary classifiers to learn is bit b on or off. Training: for each bit b [L] relabel example x n as M yn,b train a binary classifier h b using this new dataset. September 12, / 49

67 Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T September 12, / 49

68 Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T predict the class with the most similar code: k = argmax k (Mc) k September 12, / 49

69 Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T predict the class with the most similar code: k = argmax k (Mc) k How to pick the code M? September 12, / 49

70 Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T predict the class with the most similar code: k = argmax k (Mc) k How to pick the code M? the more dissimilar the codes between different classes are, the better September 12, / 49

71 Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T predict the class with the most similar code: k = argmax k (Mc) k How to pick the code M? the more dissimilar the codes between different classes are, the better random code is a good choice, September 12, / 49

72 Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T predict the class with the most similar code: k = argmax k (Mc) k How to pick the code M? the more dissimilar the codes between different classes are, the better random code is a good choice, but might create hard training sets September 12, / 49

73 Reduction to binary classification Tree based method Idea: train C binary classifiers to learn belongs to which half?. September 12, / 49

74 Reduction to binary classification Tree based method Idea: train C binary classifiers to learn belongs to which half?. Training: see pictures September 12, / 49

75 Reduction to binary classification Tree based method Idea: train C binary classifiers to learn belongs to which half?. Training: see pictures Prediction is also natural, September 12, / 49

76 Reduction to binary classification Tree based method Idea: train C binary classifiers to learn belongs to which half?. Training: see pictures Prediction is also natural, but is very fast! (think ImageNet where C 20K) September 12, / 49

77 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA OvO ECOC Tree September 12, / 49

78 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN OvO ECOC Tree September 12, / 49

79 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time OvA CN C OvO ECOC Tree remark September 12, / 49

80 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO ECOC Tree September 12, / 49

81 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO ECOC Tree CN September 12, / 49

82 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 ECOC Tree September 12, / 49

83 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC Tree September 12, / 49

84 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC Tree LN September 12, / 49

85 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC LN L Tree September 12, / 49

86 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC LN L need diversity when designing code Tree September 12, / 49

87 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC LN L need diversity when designing code Tree (log 2 C)N September 12, / 49

88 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC LN L need diversity when designing code Tree (log 2 C)N log 2 C September 12, / 49

89 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC LN L need diversity when designing code Tree (log 2 C)N log 2 C good for extreme classification September 12, / 49

90 Neural Nets Outline 1 Review of Last Lecture 2 Multiclass Classification 3 Neural Nets Definition Backpropagation Preventing overfitting September 12, / 49

91 Neural Nets Definition Linear models are not always adequate We can use a nonlinear mapping as discussed: φ(x) : x R D z R M September 12, / 49

92 Neural Nets Definition Linear models are not always adequate We can use a nonlinear mapping as discussed: φ(x) : x R D z R M But what kind of nonlinear mapping φ should be used? Can we actually learn this nonlinear mapping? September 12, / 49

93 Neural Nets Definition Linear models are not always adequate We can use a nonlinear mapping as discussed: φ(x) : x R D z R M But what kind of nonlinear mapping φ should be used? Can we actually learn this nonlinear mapping? THE most popular nonlinear models nowadays: neural nets September 12, / 49

94 Neural Nets Definition Linear model as a one-layer neural net h(a) = a for linear model September 12, / 49

95 Neural Nets Definition Linear model as a one-layer neural net h(a) = a for linear model To create non-linearity, can use Rectified Linear Unit (ReLU): h(a) = max{0, a} sigmoid function: h(a) = 1 1+e a TanH: h(a) = ea e a e a +e a many more September 12, / 49

96 Neural Nets Definition More output nodes W R 4 3, h : R 4 R 4 so h(a) = (h 1 (a 1 ), h 2 (a 2 ), h 3 (a 3 ), h 4 (a 4 )) September 12, / 49

97 Neural Nets Definition More output nodes W R 4 3, h : R 4 R 4 so h(a) = (h 1 (a 1 ), h 2 (a 2 ), h 3 (a 3 ), h 4 (a 4 )) Can think of this as a nonlinear basis: Φ(x) = h(w x) September 12, / 49

98 Neural Nets Definition More layers Becomes a network: September 12, / 49

99 Neural Nets Definition More layers Becomes a network: each node is called a neuron September 12, / 49

100 Neural Nets Definition More layers Becomes a network: each node is called a neuron h is called the activation function can use h(a) = 1 for one neuron in each layer to incorporate bias term output neuron can use h(a) = a September 12, / 49

101 Neural Nets Definition More layers Becomes a network: each node is called a neuron h is called the activation function can use h(a) = 1 for one neuron in each layer to incorporate bias term output neuron can use h(a) = a #layers refers to #hidden layers (plus 1 or 2 for input/output layers) September 12, / 49

102 Neural Nets Definition More layers Becomes a network: each node is called a neuron h is called the activation function can use h(a) = 1 for one neuron in each layer to incorporate bias term output neuron can use h(a) = a #layers refers to #hidden layers (plus 1 or 2 for input/output layers) deep neural nets can have many layers and millions of parameters September 12, / 49

103 Neural Nets Definition More layers Becomes a network: each node is called a neuron h is called the activation function can use h(a) = 1 for one neuron in each layer to incorporate bias term output neuron can use h(a) = a #layers refers to #hidden layers (plus 1 or 2 for input/output layers) deep neural nets can have many layers and millions of parameters this is a feedforward, fully connected neural net, there are many variants September 12, / 49

104 Neural Nets Definition How powerful are neural nets? Universal approximation theorem (Cybenko, 89; Hornik, 91): A feedforward neural net with a single hidden layer can approximate any continuous functions. September 12, / 49

105 Neural Nets Definition How powerful are neural nets? Universal approximation theorem (Cybenko, 89; Hornik, 91): A feedforward neural net with a single hidden layer can approximate any continuous functions. It might need a huge number of neurons though, and depth helps! September 12, / 49

106 Neural Nets Definition How powerful are neural nets? Universal approximation theorem (Cybenko, 89; Hornik, 91): A feedforward neural net with a single hidden layer can approximate any continuous functions. It might need a huge number of neurons though, and depth helps! Designing network architecture is important and very complicated for feedforward network, need to decide number of hidden layers, number of neurons at each layer, activation functions, etc. September 12, / 49

107 Neural Nets Definition Math formulation An L-layer neural net can be written as f(x) = h L (W L h L 1 (W L 1 h 1 (W 1 x))) September 12, / 49

108 Neural Nets Definition Math formulation An L-layer neural net can be written as f(x) = h L (W L h L 1 (W L 1 h 1 (W 1 x))) To ease notation, for a given input x, define recursively o 0 = x, a l = W l o l 1, o l = h l (a l ) (l = 1,..., L) September 12, / 49

109 Neural Nets Definition Math formulation An L-layer neural net can be written as f(x) = h L (W L h L 1 (W L 1 h 1 (W 1 x))) To ease notation, for a given input x, define recursively o 0 = x, a l = W l o l 1, o l = h l (a l ) (l = 1,..., L) where W l R D l D l 1 is the weights for layer l D 0 = D, D 1,..., D L are numbers of neurons at each layer a l R D l is input to layer l o l R D l is output to layer l h : R D l R D l is activation functions at layer l September 12, / 49

110 Neural Nets Definition Learning the model No matter how complicated the model is, our goal is the same: minimize N E(W 1,..., W L ) = E n (W 1,..., W L ) n=1 September 12, / 49

111 Neural Nets Definition Learning the model No matter how complicated the model is, our goal is the same: minimize N E(W 1,..., W L ) = E n (W 1,..., W L ) n=1 where { f(xn ) y n 2 2 for regression E n (W 1,..., W L ) = ln (1 + ) e f(xn) k f(x n) yn for classification k yn September 12, / 49

112 Neural Nets Backpropagation How to optimize such a complicated function? Same thing: apply SGD! even if the model is nonconvex. September 12, / 49

113 Neural Nets Backpropagation How to optimize such a complicated function? Same thing: apply SGD! even if the model is nonconvex. What is the gradient of this complicated function? September 12, / 49

114 Neural Nets Backpropagation How to optimize such a complicated function? Same thing: apply SGD! even if the model is nonconvex. What is the gradient of this complicated function? Chain rule is the only secret: for a composite function f(g(w)) f w = f g g w September 12, / 49

115 Neural Nets Backpropagation How to optimize such a complicated function? Same thing: apply SGD! even if the model is nonconvex. What is the gradient of this complicated function? Chain rule is the only secret: for a composite function f(g(w)) f w = f g g w for a composite function f(g 1 (w),..., g d (w)) f w = d i=1 f g i g i w September 12, / 49

116 Neural Nets Backpropagation How to optimize such a complicated function? Same thing: apply SGD! even if the model is nonconvex. What is the gradient of this complicated function? Chain rule is the only secret: for a composite function f(g(w)) f w = f g g w for a composite function f(g 1 (w),..., g d (w)) f w = d i=1 f g i g i w the simplest example f(g 1 (w), g 2 (w)) = g 1 (w)g 2 (w) September 12, / 49

117 Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij September 12, / 49

118 Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n w ij = E n a i a i w ij September 12, / 49

119 Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n = E n a i = E n (w ij o j ) w ij a i w ij a i w ij September 12, / 49

120 Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n = E n a i = E n (w ij o j ) = E n o j w ij a i w ij a i w ij a i September 12, / 49

121 Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n a i = E n o i o i a i E n = E n a i = E n (w ij o j ) = E n o j w ij a i w ij a i w ij a i September 12, / 49

122 Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n = E n w ij a i ( E n a i = E n o i o i a i = k a i w ij = E n a i (w ij o j ) E n a k a k o i ) w ij h i(a i ) = E n a i o j September 12, / 49

123 Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n = E n w ij a i ( E n a i = E n o i o i a i = k a i w ij = E n a i (w ij o j ) E n a k a k o i w ij = E n a i o j ) ( ) h E n i(a i ) = w ki h a i(a i ) k k September 12, / 49

124 Neural Nets Backpropagation Computing the derivative Adding the subscript for layer: E n a l,i = E n w l,ij = E n a l,i o l 1,j ( k E n a l+1,k w l+1,ki ) h l,i (a l,i) September 12, / 49

125 Neural Nets Backpropagation Computing the derivative Adding the subscript for layer: E n a l,i = E n w l,ij = E n a l,i o l 1,j ( k E n a l+1,k w l+1,ki ) For the last layer, for square loss E n = (h L,i(a L,i ) y n,i ) 2 a L,i a L,i h l,i (a l,i) September 12, / 49

126 Neural Nets Backpropagation Computing the derivative Adding the subscript for layer: E n a l,i = E n w l,ij = E n a l,i o l 1,j ( k E n a l+1,k w l+1,ki ) For the last layer, for square loss E n = (h L,i(a L,i ) y n,i ) 2 a L,i a L,i h l,i (a l,i) = 2(h L,i (a L,i ) y n,i )h L,i (a L,i) September 12, / 49

127 Neural Nets Backpropagation Computing the derivative Adding the subscript for layer: E n a l,i = E n w l,ij = E n a l,i o l 1,j ( k E n a l+1,k w l+1,ki ) For the last layer, for square loss E n = (h L,i(a L,i ) y n,i ) 2 a L,i a L,i h l,i (a l,i) = 2(h L,i (a L,i ) y n,i )h L,i (a L,i) Exercise: try to do it for logistic loss yourself. September 12, / 49

128 Neural Nets Backpropagation Computing the derivative Using matrix notation greatly simplifies presentation and implementation: E n a l = E n W l = E n a l o T l 1 {( W T l+1 En a l+1 ) h l (a l) if l < L 2(h L (a L ) y n ) h L (a L) else where v 1 v 2 = (v 11 v 21,, v 1D v 2D ) is the element-wise product (a.k.a. Hadamard product). Verify yourself! September 12, / 49

129 Neural Nets Backpropagation Putting everything into SGD The backpropagation algorithm (Backprop) Initialize W 1,..., W L (all 0 or randomly). Repeat: 1 randomly pick one data point n [N] September 12, / 49

130 Neural Nets Backpropagation Putting everything into SGD The backpropagation algorithm (Backprop) Initialize W 1,..., W L (all 0 or randomly). Repeat: 1 randomly pick one data point n [N] 2 forward propagation: for each layer l = 1,..., L compute a l = W l o l 1 and o l = h l (a l ) (o 0 = x n ) September 12, / 49

131 Neural Nets Backpropagation Putting everything into SGD The backpropagation algorithm (Backprop) Initialize W 1,..., W L (all 0 or randomly). Repeat: 1 randomly pick one data point n [N] 2 forward propagation: for each layer l = 1,..., L compute a l = W l o l 1 and o l = h l (a l ) (o 0 = x n ) 3 backward propagation: for each l = L,..., 1 compute {( ) E n Wl+1 T En a = l+1 h l (a l) if l < L a l 2(h L (a L ) y n ) h L (a L) else update weights W l W l η E n W l = W l η E n a l o T l 1 September 12, / 49

132 Neural Nets Backpropagation Putting everything into SGD The backpropagation algorithm (Backprop) Initialize W 1,..., W L (all 0 or randomly). Repeat: 1 randomly pick one data point n [N] 2 forward propagation: for each layer l = 1,..., L compute a l = W l o l 1 and o l = h l (a l ) (o 0 = x n ) 3 backward propagation: for each l = L,..., 1 compute {( ) E n Wl+1 T En a = l+1 h l (a l) if l < L a l 2(h L (a L ) y n ) h L (a L) else update weights W l W l η E n W l = W l η E n a l o T l 1 Think about how to do the last two steps properly! September 12, / 49

133 Neural Nets Backpropagation More tricks to optimize neural nets Many variants based on backprop SGD with minibatch: randomly sample a batch of examples to form a stochastic gradient SGD with momentum September 12, / 49

134 Neural Nets Backpropagation SGD with momentum Initialize w 0 and velocity v = 0 For t = 1, 2,... form a stochastic gradient g t update velocity v αv ηg t for some discount factor α (0, 1) update weight w t w t 1 + v September 12, / 49

135 Neural Nets Backpropagation SGD with momentum Initialize w 0 and velocity v = 0 For t = 1, 2,... form a stochastic gradient g t update velocity v αv ηg t for some discount factor α (0, 1) update weight w t w t 1 + v Updates for first few rounds: w 1 = w 0 ηg 1 w 2 = w 1 αηg 1 ηg 2 w 3 = w 2 α 2 ηg 1 αηg 2 ηg 3 September 12, / 49

136 Neural Nets Preventing overfitting Overfitting Overfitting is very likely since the models are too powerful. Methods to overcome overfitting: data augmentation regularization dropout early stopping September 12, / 49

137 Neural Nets Preventing overfitting Data augmentation Data: the more the better. How do we get more data? September 12, / 49

138 Neural Nets Preventing overfitting Data augmentation Data: the more the better. How do we get more data? Exploit prior knowledge to add more training data September 12, / 49

139 Neural Nets Preventing overfitting Regularization L2 regularization: minimize L E (W 1,..., W L ) = E(W 1,..., W L ) + λ W l 2 2 l=1 September 12, / 49

140 Neural Nets Preventing overfitting Regularization L2 regularization: minimize E (W 1,..., W L ) = E(W 1,..., W L ) + λ Simple change to the gradient: E w ij = E w ij + 2λw ij L W l 2 2 l=1 September 12, / 49

141 Neural Nets Preventing overfitting Regularization L2 regularization: minimize E (W 1,..., W L ) = E(W 1,..., W L ) + λ Simple change to the gradient: E w ij = E w ij + 2λw ij L W l 2 2 l=1 Introduce weight decaying effect September 12, / 49

142 Neural Nets Preventing overfitting Dropout Randomly delete neurons during training Very effective, makes training faster as well September 12, / 49

143 Neural Nets Preventing overfitting Early stopping Stop training when the performance on validation set stops improving Early stopping Loss (negative log-likelihood) Time (epochs) Training set loss Validation set loss September 12, / 49

144 Neural Nets Preventing overfitting Conclusions for neural nets Deep neural networks are hugely popular, achieving best performance on many problems September 12, / 49

145 Neural Nets Preventing overfitting Conclusions for neural nets Deep neural networks are hugely popular, achieving best performance on many problems do need a lot of data to work well September 12, / 49

146 Neural Nets Preventing overfitting Conclusions for neural nets Deep neural networks are hugely popular, achieving best performance on many problems do need a lot of data to work well take a lot of time to train (need GPUs for massive parallel computing) September 12, / 49

147 Neural Nets Preventing overfitting Conclusions for neural nets Deep neural networks are hugely popular, achieving best performance on many problems do need a lot of data to work well take a lot of time to train (need GPUs for massive parallel computing) take some work to select architecture and hyperparameters September 12, / 49

148 Neural Nets Preventing overfitting Conclusions for neural nets Deep neural networks are hugely popular, achieving best performance on many problems do need a lot of data to work well take a lot of time to train (need GPUs for massive parallel computing) take some work to select architecture and hyperparameters are still not well understood in theory September 12, / 49

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Introduction to Neural Networks

Introduction to Neural Networks CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

More information

Stochastic gradient descent; Classification

Stochastic gradient descent; Classification Stochastic gradient descent; Classification Steve Renals Machine Learning Practical MLP Lecture 2 28 September 2016 MLP Lecture 2 Stochastic gradient descent; Classification 1 Single Layer Networks MLP

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Neural Networks CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Perceptrons x 0 = 1 x 1 x 2 z = h w T x Output: z x D A perceptron

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! November 18, 2015 THE EXAM IS CLOSED BOOK. Once the exam has started, SORRY, NO TALKING!!! No, you can t even say see ya

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Nicholas Ruozzi University of Texas at Dallas Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify

More information

ECE521 Lecture 7/8. Logistic Regression

ECE521 Lecture 7/8. Logistic Regression ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35 Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks Topics in Machine Learning-EE 5359 Neural Networks 1 The Perceptron Output: A perceptron is a function that maps D-dimensional vectors to real numbers. For notational convenience, we add a zero-th dimension

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Intro to Neural Networks and Deep Learning

Intro to Neural Networks and Deep Learning Intro to Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi UVA CS 6316 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs

More information

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat Neural Networks, Computation Graphs CMSC 470 Marine Carpuat Binary Classification with a Multi-layer Perceptron φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Introduction to (Convolutional) Neural Networks

Introduction to (Convolutional) Neural Networks Introduction to (Convolutional) Neural Networks Philipp Grohs Summer School DL and Vis, Sept 2018 Syllabus 1 Motivation and Definition 2 Universal Approximation 3 Backpropagation 4 Stochastic Gradient

More information

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders

More information

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6 Administration Registration Hw3 is out Due on Thursday 10/6 Questions Lecture Captioning (Extra-Credit) Look at Piazza for details Scribing lectures With pay; come talk to me/send email. 1 Projects Projects

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Neural Nets Supervised learning

Neural Nets Supervised learning 6.034 Artificial Intelligence Big idea: Learning as acquiring a function on feature vectors Background Nearest Neighbors Identification Trees Neural Nets Neural Nets Supervised learning y s(z) w w 0 w

More information

4. Multilayer Perceptrons

4. Multilayer Perceptrons 4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Jakub Hajic Artificial Intelligence Seminar I

Jakub Hajic Artificial Intelligence Seminar I Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Artificial Neural Networks 2

Artificial Neural Networks 2 CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b

More information

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Lab Course 2017 (Deep Learning Practical) Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of

More information

Introduction to Machine Learning (67577)

Introduction to Machine Learning (67577) Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of

More information

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Neural networks COMS 4771

Neural networks COMS 4771 Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole

More information

1 What a Neural Network Computes

1 What a Neural Network Computes Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

More information

Rapid Introduction to Machine Learning/ Deep Learning

Rapid Introduction to Machine Learning/ Deep Learning Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/59 Lecture 4a Feedforward neural network October 30, 2015 2/59 Table of contents 1 1. Objectives of Lecture

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer CSE446: Neural Networks Spring 2017 Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5 Scene

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Neural Networks Varun Chandola x x 5 Input Outline Contents February 2, 207 Extending Perceptrons 2 Multi Layered Perceptrons 2 2. Generalizing to Multiple Labels.................

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph

More information

Neural networks and support vector machines

Neural networks and support vector machines Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith

More information

Neural Networks: Optimization & Regularization

Neural Networks: Optimization & Regularization Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Lecture 3 Feedforward Networks and Backpropagation

Lecture 3 Feedforward Networks and Backpropagation Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression

More information

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections

More information

CSC321 Lecture 5: Multilayer Perceptrons

CSC321 Lecture 5: Multilayer Perceptrons CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3

More information

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17 3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural

More information

Week 5: Logistic Regression & Neural Networks

Week 5: Logistic Regression & Neural Networks Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24 Outline Review

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks. MGS Lecture 2 Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

More information

Nonlinear Classification

Nonlinear Classification Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

text classification 3: neural networks

text classification 3: neural networks text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Lecture 4 Backpropagation

Lecture 4 Backpropagation Lecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 5, 2017 Things we will look at today More Backpropagation Still more backpropagation Quiz

More information

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions? Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification

More information

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting

More information

Introduction to Machine Learning Spring 2018 Note Neural Networks

Introduction to Machine Learning Spring 2018 Note Neural Networks CS 189 Introduction to Machine Learning Spring 2018 Note 14 1 Neural Networks Neural networks are a class of compositional function approximators. They come in a variety of shapes and sizes. In this class,

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Deep Neural Networks (1) Hidden layers; Back-propagation

Deep Neural Networks (1) Hidden layers; Back-propagation Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 4 October 2017 / 9 October 2017 MLP Lecture 3 Deep Neural Networs (1) 1 Recap: Softmax single

More information

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch. Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will

More information

CSC321 Lecture 9: Generalization

CSC321 Lecture 9: Generalization CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Feed-forward Network Functions

Feed-forward Network Functions Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification

More information

Feedforward Neural Networks

Feedforward Neural Networks Feedforward Neural Networks Michael Collins 1 Introduction In the previous notes, we introduced an important class of models, log-linear models. In this note, we describe feedforward neural networks, which

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Artificial Neural Networks and Nonparametric Methods CMPSCI 383 Nov 17, 2011! Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011! 1 Todayʼs lecture" How the brain works (!)! Artificial neural networks! Perceptrons! Multilayer feed-forward networks! Error

More information

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson Applied Machine Learning Lecture 5: Linear classifiers, continued Richard Johansson overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers

More information

Machine Learning Basics

Machine Learning Basics Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each

More information

CSC242: Intro to AI. Lecture 21

CSC242: Intro to AI. Lecture 21 CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages

More information