CSCI567 Machine Learning (Fall 2018)
|
|
- Winifred Owen
- 5 years ago
- Views:
Transcription
1 CSCI567 Machine Learning (Fall 2018) Prof. Haipeng Luo U of Southern California Sep 12, 2018 September 12, / 49
2 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due this Sunday (09/16) 11:59PM You need to submit a form if you use late days (see course website) September 12, / 49
3 Administration GitHub repos are setup (ask TA Chi Zhang for any issues) HW 1 is due this Sunday (09/16) 11:59PM You need to submit a form if you use late days (see course website) Effort-based grade for written assignments: see the explanation on Piazza key: let us know what you have tried and how you thought I spend an hour and came up with nothing = empty solution September 12, / 49
4 Outline 1 Review of Last Lecture 2 Multiclass Classification 3 Neural Nets September 12, / 49
5 Review of Last Lecture Outline 1 Review of Last Lecture 2 Multiclass Classification 3 Neural Nets September 12, / 49
6 Review of Last Lecture Summary Linear models for binary classification: Step 1. Model is the set of separating hyperplanes F = {f(x) = sgn(w T x) w R D } September 12, / 49
7 Review of Last Lecture Step 2. Pick the surrogate loss perceptron loss l perceptron (z) = max{0, z} (used in Perceptron) hinge loss l hinge (z) = max{0, 1 z}(used in SVM and many others) logistic loss l logistic (z) = log(1 + exp( z)) (used in logistic regression) September 12, / 49
8 Review of Last Lecture Step 3. Find empirical risk minimizer (ERM): using GD: SGD: w = argmin w R D w w η F (w) w w η F (w) F (w) = argmin w R D N l(y n w T x n ) n=1 Newton: w w ( 2 F (w) ) 1 F (w) September 12, / 49
9 Review of Last Lecture A Probabilistic view of logistic regression Minimizing logistic loss = MLE for the sigmoid model where w = argmin w N n=1 l logistic (y n w T x n ) = argmax w N P(y n x n ; w) n=1 P(y x; w) = σ(yw T 1 x) = 1 + e ywt x September 12, / 49
10 Outline 1 Review of Last Lecture 2 Multiclass Classification Multinomial logistic regression Reduction to binary classification 3 Neural Nets September 12, / 49
11 Classification Recall the setup: input (feature vector): x R D output (label): y [C] = {1, 2,, C} goal: learn a mapping f : R D [C] September 12, / 49
12 Classification Recall the setup: input (feature vector): x R D output (label): y [C] = {1, 2,, C} goal: learn a mapping f : R D [C] Examples: recognizing digits (C = 10) or letters (C = 26 or 52) predicting weather: sunny, cloudy, rainy, etc predicting image category: ImageNet dataset (C 20K) September 12, / 49
13 Classification Recall the setup: input (feature vector): x R D output (label): y [C] = {1, 2,, C} goal: learn a mapping f : R D [C] Examples: recognizing digits (C = 10) or letters (C = 26 or 52) predicting weather: sunny, cloudy, rainy, etc predicting image category: ImageNet dataset (C 20K) Nearest Neighbor Classifier naturally works for arbitrary C. September 12, / 49
14 Multinomial logistic regression Linear models: from binary to multiclass What should a linear model look like for multiclass tasks? September 12, / 49
15 Multinomial logistic regression Linear models: from binary to multiclass What should a linear model look like for multiclass tasks? Note: a linear model for binary tasks (switching from { 1, +1} to {1, 2}) { 1 if w T x 0 f(x) = 2 if w T x < 0 September 12, / 49
16 Multinomial logistic regression Linear models: from binary to multiclass What should a linear model look like for multiclass tasks? Note: a linear model for binary tasks (switching from { 1, +1} to {1, 2}) { 1 if w T x 0 f(x) = 2 if w T x < 0 can be written as f(x) = { 1 if w T 1 x wt 2 x 2 if w T 2 x > wt 1 x for any w 1, w 2 s.t. w = w 1 w 2 September 12, / 49
17 Multinomial logistic regression Linear models: from binary to multiclass What should a linear model look like for multiclass tasks? Note: a linear model for binary tasks (switching from { 1, +1} to {1, 2}) { 1 if w T x 0 f(x) = 2 if w T x < 0 can be written as f(x) = { 1 if w T 1 x wt 2 x 2 if w T 2 x > wt 1 x = argmax wk T x k {1,2} for any w 1, w 2 s.t. w = w 1 w 2 September 12, / 49
18 Multinomial logistic regression Linear models: from binary to multiclass What should a linear model look like for multiclass tasks? Note: a linear model for binary tasks (switching from { 1, +1} to {1, 2}) { 1 if w T x 0 f(x) = 2 if w T x < 0 can be written as f(x) = { 1 if w T 1 x wt 2 x 2 if w T 2 x > wt 1 x = argmax wk T x k {1,2} for any w 1, w 2 s.t. w = w 1 w 2 Think of w T k x as a score for class k. September 12, / 49
19 Multinomial logistic regression Linear models: from binary to multiclass w = ( 3 2, 1 6 ) Blue class: {x : w T x 0} Orange class: {x : w T x < 0} September 12, / 49
20 Multinomial logistic regression Linear models: from binary to multiclass w = ( 3 2, 1 6 ) = w 1 w 2 w 1 = (1, 1 3 ) w 2 = ( 1 2, 1 2 ) Blue class: {x : 1 = argmax k w T k x} Orange class: {x : 2 = argmax k w T k x} September 12, / 49
21 Multinomial logistic regression Linear models: from binary to multiclass w 1 = (1, 1 3 ) w 2 = ( 1 2, 1 2 ) Blue class: {x : 1 = argmax k w T k x} Orange class: {x : 2 = argmax k w T k x} September 12, / 49
22 Multinomial logistic regression Linear models: from binary to multiclass w 1 = (1, 1 3 ) w 2 = ( 1 2, 1 2 ) w 3 = (0, 1) Blue class: {x : 1 = argmax k w T k x} Orange class: {x : 2 = argmax k w T k x} Green class: {x : 3 = argmax k w T k x} September 12, / 49
23 Multinomial logistic regression Linear models for multiclass classification F = { f(x) = argmax k [C] w T k x w 1,..., w C R D } September 12, / 49
24 Multinomial logistic regression Linear models for multiclass classification F = = { { f(x) = argmax k [C] f(x) = argmax k [C] w T k x w 1,..., w C R D } (W x) k W R C D } September 12, / 49
25 Multinomial logistic regression Linear models for multiclass classification F = = { { f(x) = argmax k [C] f(x) = argmax k [C] w T k x w 1,..., w C R D } (W x) k W R C D } How do we generalize perceptron/hinge/logistic loss? September 12, / 49
26 Multinomial logistic regression Linear models for multiclass classification F = = { { f(x) = argmax k [C] f(x) = argmax k [C] w T k x w 1,..., w C R D } (W x) k W R C D } How do we generalize perceptron/hinge/logistic loss? This lecture: focus on the more popular logistic loss September 12, / 49
27 Multinomial logistic regression Multinomial logistic regression: a probabilistic view Observe: for binary logistic regression, with w = w 1 w 2 : P(y = 1 x; w) = σ(w T x) = e = e wt wt x e wt 1 x + e wt 2 x ewt 1 x 1 x September 12, / 49
28 Multinomial logistic regression Multinomial logistic regression: a probabilistic view Observe: for binary logistic regression, with w = w 1 w 2 : P(y = 1 x; w) = σ(w T x) = e = e wt wt x e wt 1 x + e wt 2 x ewt 1 x 1 x Naturally, for multiclass: P(y = k x; W ) = e wt k x ewt k [C] ewt k x k x September 12, / 49
29 Multinomial logistic regression Multinomial logistic regression: a probabilistic view Observe: for binary logistic regression, with w = w 1 w 2 : P(y = 1 x; w) = σ(w T x) = e = e wt wt x e wt 1 x + e wt 2 x ewt 1 x 1 x Naturally, for multiclass: P(y = k x; W ) = e wt k x ewt k [C] ewt k x k x This is called the softmax function. September 12, / 49
30 Multinomial logistic regression Applying MLE again Maximize probability of see labels y 1,..., y N given x 1,..., x N P (W ) = N P(y n x n ; W ) = n=1 N n=1 e wt yn x k [C] ewt k x September 12, / 49
31 Multinomial logistic regression Applying MLE again Maximize probability of see labels y 1,..., y N given x 1,..., x N P (W ) = N P(y n x n ; W ) = n=1 N n=1 e wt yn x k [C] ewt k x By taking negative log, this is equivalent to minimizing F (W ) = N ln n=1 ( k [C] ewt k xn e wt yn xn ) September 12, / 49
32 Multinomial logistic regression Applying MLE again Maximize probability of see labels y 1,..., y N given x 1,..., x N P (W ) = N P(y n x n ; W ) = n=1 N n=1 e wt yn x k [C] ewt k x By taking negative log, this is equivalent to minimizing ( ) N k [C] F (W ) = ln ewt k xn N = ln 1 + n=1 e wt yn xn n=1 k y n e (wk wyn)txn September 12, / 49
33 Multinomial logistic regression Applying MLE again Maximize probability of see labels y 1,..., y N given x 1,..., x N P (W ) = N P(y n x n ; W ) = n=1 N n=1 e wt yn x k [C] ewt k x By taking negative log, this is equivalent to minimizing ( ) N k [C] F (W ) = ln ewt k xn N = ln 1 + n=1 e wt yn xn n=1 k y n e (wk wyn)txn This is the multiclass logistic loss, a.k.a cross-entropy loss. September 12, / 49
34 Multinomial logistic regression Applying MLE again Maximize probability of see labels y 1,..., y N given x 1,..., x N P (W ) = N P(y n x n ; W ) = n=1 N n=1 e wt yn x k [C] ewt k x By taking negative log, this is equivalent to minimizing ( ) N k [C] F (W ) = ln ewt k xn N = ln 1 + n=1 e wt yn xn n=1 k y n e (wk wyn)txn This is the multiclass logistic loss, a.k.a cross-entropy loss. When C = 2, this is the same as binary logistic loss. September 12, / 49
35 Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n? September 12, / 49
36 Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n It s a C D matrix. Let s focus on the k-th row:? September 12, / 49
37 Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n It s a C D matrix. Let s focus on the k-th row: If k y n : wk g(w ) = e (w k w yn ) T x n 1 + k y n e (w k wyn)t x n x T n? September 12, / 49
38 Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n It s a C D matrix. Let s focus on the k-th row: If k y n : wk g(w ) =? e (w k w yn ) T x n 1 + k y n e (w k wyn)t x n x T n = P(k x n ; W )x T n September 12, / 49
39 Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n It s a C D matrix. Let s focus on the k-th row: If k y n : else: wk g(w ) = wk g(w ) =? e (w k w yn ) T x n 1 + k y n e (w k wyn)t x n x T n = P(k x n ; W )x T n ( ) k yn e(w k wyn)t x n 1 + k y n e (w k wyn)t x n x T n September 12, / 49
40 Multinomial logistic regression Optimization Apply SGD: what is the gradient of g(w ) = ln 1 + (w e k wyn)txn k y n It s a C D matrix. Let s focus on the k-th row: If k y n : else: wk g(w ) = wk g(w ) =? e (w k w yn ) T x n 1 + k y n e (w k wyn)t x n x T n = P(k x n ; W )x T n ( ) k yn e(w k wyn)t x n 1 + k y n e (w k wyn)t x n x T n = (P(y n x n ; W ) 1) x T n September 12, / 49
41 Multinomial logistic regression SGD for multinomial logistic regression Initialize W = 0 (or randomly). Repeat: 1 pick n [N] uniformly at random 2 update the parameters W W η P(y = 1 x n ; W ). P(y = y n x n ; W ) 1. P(y = C x n ; W ) x T n September 12, / 49
42 Multinomial logistic regression SGD for multinomial logistic regression Initialize W = 0 (or randomly). Repeat: 1 pick n [N] uniformly at random 2 update the parameters W W η P(y = 1 x n ; W ). P(y = y n x n ; W ) 1. P(y = C x n ; W ) x T n Think about why the algorithm makes sense intuitively. September 12, / 49
43 Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] w T k x September 12, / 49
44 Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] w T k x make a randomized prediction according to P(k x; W ) e wt k x September 12, / 49
45 Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] w T k x make a randomized prediction according to P(k x; W ) e wt k x In either case, (expected) mistake is bounded by logistic loss September 12, / 49
46 Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] wk Tx make a randomized prediction according to P(k x; W ) e wt k x In either case, (expected) mistake is bounded by logistic loss deterministic I[f(x) y] log e (w k w y) T x k y September 12, / 49
47 Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] wk Tx make a randomized prediction according to P(k x; W ) e wt k x In either case, (expected) mistake is bounded by logistic loss deterministic I[f(x) y] log e (w k w y) T x k y randomized E [I[f(x) y]] September 12, / 49
48 Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] wk Tx make a randomized prediction according to P(k x; W ) e wt k x In either case, (expected) mistake is bounded by logistic loss deterministic I[f(x) y] log e (w k w y) T x k y randomized E [I[f(x) y]] = 1 P(y x; W ) September 12, / 49
49 Multinomial logistic regression A note on prediction Having learned W, we can either make a deterministic prediction argmax k [C] wk Tx make a randomized prediction according to P(k x; W ) e wt k x In either case, (expected) mistake is bounded by logistic loss deterministic I[f(x) y] log e (w k w y) T x k y randomized E [I[f(x) y]] = 1 P(y x; W ) ln P(y x; W ) September 12, / 49
50 Reduction to binary classification Reduce multiclass to binary Is there an even more general and simpler approach to derive multiclass classification algorithms? September 12, / 49
51 Reduction to binary classification Reduce multiclass to binary Is there an even more general and simpler approach to derive multiclass classification algorithms? Given a binary classification algorithm (any one, not just linear methods), can we turn it to a multiclass algorithm, in a black-box manner? September 12, / 49
52 Reduction to binary classification Reduce multiclass to binary Is there an even more general and simpler approach to derive multiclass classification algorithms? Given a binary classification algorithm (any one, not just linear methods), can we turn it to a multiclass algorithm, in a black-box manner? Yes, there are in fact many ways to do it. one-versus-all (one-versus-rest, one-against-all, etc) one-versus-one (all-versus-all, etc) Error-Correcting Output Codes (ECOC) tree-based reduction September 12, / 49
53 Reduction to binary classification One-versus-all (OvA) (picture credit: link) Idea: train C binary classifiers to learn is class k or not? for each k. September 12, / 49
54 Reduction to binary classification One-versus-all (OvA) (picture credit: link) Idea: train C binary classifiers to learn is class k or not? for each k. Training: for each class k [C], relabel examples with class k as +1, and all others as 1 train a binary classifier h k using this new dataset September 12, / 49
55 Reduction to binary classification One-versus-all (OvA) (picture credit: link) Idea: train C binary classifiers to learn is class k or not? for each k. Training: for each class k [C], relabel examples with class k as +1, and all others as 1 train a binary classifier h k using this new dataset September 12, / 49
56 Reduction to binary classification One-versus-all (OvA) Prediction: for a new example x ask each h k : does this belong to class k? (i.e. h k (x)) September 12, / 49
57 Reduction to binary classification One-versus-all (OvA) Prediction: for a new example x ask each h k : does this belong to class k? (i.e. h k (x)) randomly pick among all k s s.t. h k (x) = +1. September 12, / 49
58 Reduction to binary classification One-versus-all (OvA) Prediction: for a new example x ask each h k : does this belong to class k? (i.e. h k (x)) randomly pick among all k s s.t. h k (x) = +1. Issue: will (probably) make a mistake as long as one of h k errs. September 12, / 49
59 Reduction to binary classification One-versus-one (OvO) (picture credit: link) Idea: train ( C 2) binary classifiers to learn is class k or k?. September 12, / 49
60 Reduction to binary classification One-versus-one (OvO) (picture credit: link) Idea: train ( C 2) binary classifiers to learn is class k or k?. Training: for each pair (k, k ), relabel examples with class k as +1 and examples with class k as 1 discard all other examples train a binary classifier h (k,k ) using this new dataset September 12, / 49
61 Reduction to binary classification One-versus-one (OvO) (picture credit: link) Idea: train ( C 2) binary classifiers to learn is class k or k?. Training: for each pair (k, k ), relabel examples with class k as +1 and examples with class k as 1 discard all other examples train a binary classifier h (k,k ) using this new dataset September 12, / 49
62 Reduction to binary classification One-versus-one (OvO) Prediction: for a new example x ask each classifier h (k,k ) to vote for either class k or k September 12, / 49
63 Reduction to binary classification One-versus-one (OvO) Prediction: for a new example x ask each classifier h (k,k ) to vote for either class k or k predict the class with the most votes (break tie in some way) September 12, / 49
64 Reduction to binary classification One-versus-one (OvO) Prediction: for a new example x ask each classifier h (k,k ) to vote for either class k or k predict the class with the most votes (break tie in some way) More robust than one-versus-all, but slower in prediction. September 12, / 49
65 Reduction to binary classification Error-correcting output codes (ECOC) (picture credit: link) Idea: based on a code M { 1, +1} C L, train L binary classifiers to learn is bit b on or off. September 12, / 49
66 Reduction to binary classification Error-correcting output codes (ECOC) (picture credit: link) Idea: based on a code M { 1, +1} C L, train L binary classifiers to learn is bit b on or off. Training: for each bit b [L] relabel example x n as M yn,b train a binary classifier h b using this new dataset. September 12, / 49
67 Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T September 12, / 49
68 Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T predict the class with the most similar code: k = argmax k (Mc) k September 12, / 49
69 Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T predict the class with the most similar code: k = argmax k (Mc) k How to pick the code M? September 12, / 49
70 Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T predict the class with the most similar code: k = argmax k (Mc) k How to pick the code M? the more dissimilar the codes between different classes are, the better September 12, / 49
71 Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T predict the class with the most similar code: k = argmax k (Mc) k How to pick the code M? the more dissimilar the codes between different classes are, the better random code is a good choice, September 12, / 49
72 Reduction to binary classification Error-correcting output codes (ECOC) Prediction: for a new example x compute the predicted code c = (h 1 (x),..., h L (x)) T predict the class with the most similar code: k = argmax k (Mc) k How to pick the code M? the more dissimilar the codes between different classes are, the better random code is a good choice, but might create hard training sets September 12, / 49
73 Reduction to binary classification Tree based method Idea: train C binary classifiers to learn belongs to which half?. September 12, / 49
74 Reduction to binary classification Tree based method Idea: train C binary classifiers to learn belongs to which half?. Training: see pictures September 12, / 49
75 Reduction to binary classification Tree based method Idea: train C binary classifiers to learn belongs to which half?. Training: see pictures Prediction is also natural, September 12, / 49
76 Reduction to binary classification Tree based method Idea: train C binary classifiers to learn belongs to which half?. Training: see pictures Prediction is also natural, but is very fast! (think ImageNet where C 20K) September 12, / 49
77 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA OvO ECOC Tree September 12, / 49
78 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN OvO ECOC Tree September 12, / 49
79 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time OvA CN C OvO ECOC Tree remark September 12, / 49
80 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO ECOC Tree September 12, / 49
81 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO ECOC Tree CN September 12, / 49
82 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 ECOC Tree September 12, / 49
83 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC Tree September 12, / 49
84 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC Tree LN September 12, / 49
85 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC LN L Tree September 12, / 49
86 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC LN L need diversity when designing code Tree September 12, / 49
87 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC LN L need diversity when designing code Tree (log 2 C)N September 12, / 49
88 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC LN L need diversity when designing code Tree (log 2 C)N log 2 C September 12, / 49
89 Reduction to binary classification Comparisons In big O notation, Reduction #training points test time remark OvA CN C not robust OvO CN C 2 can achieve very small training error ECOC LN L need diversity when designing code Tree (log 2 C)N log 2 C good for extreme classification September 12, / 49
90 Neural Nets Outline 1 Review of Last Lecture 2 Multiclass Classification 3 Neural Nets Definition Backpropagation Preventing overfitting September 12, / 49
91 Neural Nets Definition Linear models are not always adequate We can use a nonlinear mapping as discussed: φ(x) : x R D z R M September 12, / 49
92 Neural Nets Definition Linear models are not always adequate We can use a nonlinear mapping as discussed: φ(x) : x R D z R M But what kind of nonlinear mapping φ should be used? Can we actually learn this nonlinear mapping? September 12, / 49
93 Neural Nets Definition Linear models are not always adequate We can use a nonlinear mapping as discussed: φ(x) : x R D z R M But what kind of nonlinear mapping φ should be used? Can we actually learn this nonlinear mapping? THE most popular nonlinear models nowadays: neural nets September 12, / 49
94 Neural Nets Definition Linear model as a one-layer neural net h(a) = a for linear model September 12, / 49
95 Neural Nets Definition Linear model as a one-layer neural net h(a) = a for linear model To create non-linearity, can use Rectified Linear Unit (ReLU): h(a) = max{0, a} sigmoid function: h(a) = 1 1+e a TanH: h(a) = ea e a e a +e a many more September 12, / 49
96 Neural Nets Definition More output nodes W R 4 3, h : R 4 R 4 so h(a) = (h 1 (a 1 ), h 2 (a 2 ), h 3 (a 3 ), h 4 (a 4 )) September 12, / 49
97 Neural Nets Definition More output nodes W R 4 3, h : R 4 R 4 so h(a) = (h 1 (a 1 ), h 2 (a 2 ), h 3 (a 3 ), h 4 (a 4 )) Can think of this as a nonlinear basis: Φ(x) = h(w x) September 12, / 49
98 Neural Nets Definition More layers Becomes a network: September 12, / 49
99 Neural Nets Definition More layers Becomes a network: each node is called a neuron September 12, / 49
100 Neural Nets Definition More layers Becomes a network: each node is called a neuron h is called the activation function can use h(a) = 1 for one neuron in each layer to incorporate bias term output neuron can use h(a) = a September 12, / 49
101 Neural Nets Definition More layers Becomes a network: each node is called a neuron h is called the activation function can use h(a) = 1 for one neuron in each layer to incorporate bias term output neuron can use h(a) = a #layers refers to #hidden layers (plus 1 or 2 for input/output layers) September 12, / 49
102 Neural Nets Definition More layers Becomes a network: each node is called a neuron h is called the activation function can use h(a) = 1 for one neuron in each layer to incorporate bias term output neuron can use h(a) = a #layers refers to #hidden layers (plus 1 or 2 for input/output layers) deep neural nets can have many layers and millions of parameters September 12, / 49
103 Neural Nets Definition More layers Becomes a network: each node is called a neuron h is called the activation function can use h(a) = 1 for one neuron in each layer to incorporate bias term output neuron can use h(a) = a #layers refers to #hidden layers (plus 1 or 2 for input/output layers) deep neural nets can have many layers and millions of parameters this is a feedforward, fully connected neural net, there are many variants September 12, / 49
104 Neural Nets Definition How powerful are neural nets? Universal approximation theorem (Cybenko, 89; Hornik, 91): A feedforward neural net with a single hidden layer can approximate any continuous functions. September 12, / 49
105 Neural Nets Definition How powerful are neural nets? Universal approximation theorem (Cybenko, 89; Hornik, 91): A feedforward neural net with a single hidden layer can approximate any continuous functions. It might need a huge number of neurons though, and depth helps! September 12, / 49
106 Neural Nets Definition How powerful are neural nets? Universal approximation theorem (Cybenko, 89; Hornik, 91): A feedforward neural net with a single hidden layer can approximate any continuous functions. It might need a huge number of neurons though, and depth helps! Designing network architecture is important and very complicated for feedforward network, need to decide number of hidden layers, number of neurons at each layer, activation functions, etc. September 12, / 49
107 Neural Nets Definition Math formulation An L-layer neural net can be written as f(x) = h L (W L h L 1 (W L 1 h 1 (W 1 x))) September 12, / 49
108 Neural Nets Definition Math formulation An L-layer neural net can be written as f(x) = h L (W L h L 1 (W L 1 h 1 (W 1 x))) To ease notation, for a given input x, define recursively o 0 = x, a l = W l o l 1, o l = h l (a l ) (l = 1,..., L) September 12, / 49
109 Neural Nets Definition Math formulation An L-layer neural net can be written as f(x) = h L (W L h L 1 (W L 1 h 1 (W 1 x))) To ease notation, for a given input x, define recursively o 0 = x, a l = W l o l 1, o l = h l (a l ) (l = 1,..., L) where W l R D l D l 1 is the weights for layer l D 0 = D, D 1,..., D L are numbers of neurons at each layer a l R D l is input to layer l o l R D l is output to layer l h : R D l R D l is activation functions at layer l September 12, / 49
110 Neural Nets Definition Learning the model No matter how complicated the model is, our goal is the same: minimize N E(W 1,..., W L ) = E n (W 1,..., W L ) n=1 September 12, / 49
111 Neural Nets Definition Learning the model No matter how complicated the model is, our goal is the same: minimize N E(W 1,..., W L ) = E n (W 1,..., W L ) n=1 where { f(xn ) y n 2 2 for regression E n (W 1,..., W L ) = ln (1 + ) e f(xn) k f(x n) yn for classification k yn September 12, / 49
112 Neural Nets Backpropagation How to optimize such a complicated function? Same thing: apply SGD! even if the model is nonconvex. September 12, / 49
113 Neural Nets Backpropagation How to optimize such a complicated function? Same thing: apply SGD! even if the model is nonconvex. What is the gradient of this complicated function? September 12, / 49
114 Neural Nets Backpropagation How to optimize such a complicated function? Same thing: apply SGD! even if the model is nonconvex. What is the gradient of this complicated function? Chain rule is the only secret: for a composite function f(g(w)) f w = f g g w September 12, / 49
115 Neural Nets Backpropagation How to optimize such a complicated function? Same thing: apply SGD! even if the model is nonconvex. What is the gradient of this complicated function? Chain rule is the only secret: for a composite function f(g(w)) f w = f g g w for a composite function f(g 1 (w),..., g d (w)) f w = d i=1 f g i g i w September 12, / 49
116 Neural Nets Backpropagation How to optimize such a complicated function? Same thing: apply SGD! even if the model is nonconvex. What is the gradient of this complicated function? Chain rule is the only secret: for a composite function f(g(w)) f w = f g g w for a composite function f(g 1 (w),..., g d (w)) f w = d i=1 f g i g i w the simplest example f(g 1 (w), g 2 (w)) = g 1 (w)g 2 (w) September 12, / 49
117 Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij September 12, / 49
118 Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n w ij = E n a i a i w ij September 12, / 49
119 Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n = E n a i = E n (w ij o j ) w ij a i w ij a i w ij September 12, / 49
120 Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n = E n a i = E n (w ij o j ) = E n o j w ij a i w ij a i w ij a i September 12, / 49
121 Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n a i = E n o i o i a i E n = E n a i = E n (w ij o j ) = E n o j w ij a i w ij a i w ij a i September 12, / 49
122 Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n = E n w ij a i ( E n a i = E n o i o i a i = k a i w ij = E n a i (w ij o j ) E n a k a k o i ) w ij h i(a i ) = E n a i o j September 12, / 49
123 Neural Nets Backpropagation Computing the derivative Drop the subscript l for layer for simplicity. Find the derivative of E n w.r.t. to w ij E n = E n w ij a i ( E n a i = E n o i o i a i = k a i w ij = E n a i (w ij o j ) E n a k a k o i w ij = E n a i o j ) ( ) h E n i(a i ) = w ki h a i(a i ) k k September 12, / 49
124 Neural Nets Backpropagation Computing the derivative Adding the subscript for layer: E n a l,i = E n w l,ij = E n a l,i o l 1,j ( k E n a l+1,k w l+1,ki ) h l,i (a l,i) September 12, / 49
125 Neural Nets Backpropagation Computing the derivative Adding the subscript for layer: E n a l,i = E n w l,ij = E n a l,i o l 1,j ( k E n a l+1,k w l+1,ki ) For the last layer, for square loss E n = (h L,i(a L,i ) y n,i ) 2 a L,i a L,i h l,i (a l,i) September 12, / 49
126 Neural Nets Backpropagation Computing the derivative Adding the subscript for layer: E n a l,i = E n w l,ij = E n a l,i o l 1,j ( k E n a l+1,k w l+1,ki ) For the last layer, for square loss E n = (h L,i(a L,i ) y n,i ) 2 a L,i a L,i h l,i (a l,i) = 2(h L,i (a L,i ) y n,i )h L,i (a L,i) September 12, / 49
127 Neural Nets Backpropagation Computing the derivative Adding the subscript for layer: E n a l,i = E n w l,ij = E n a l,i o l 1,j ( k E n a l+1,k w l+1,ki ) For the last layer, for square loss E n = (h L,i(a L,i ) y n,i ) 2 a L,i a L,i h l,i (a l,i) = 2(h L,i (a L,i ) y n,i )h L,i (a L,i) Exercise: try to do it for logistic loss yourself. September 12, / 49
128 Neural Nets Backpropagation Computing the derivative Using matrix notation greatly simplifies presentation and implementation: E n a l = E n W l = E n a l o T l 1 {( W T l+1 En a l+1 ) h l (a l) if l < L 2(h L (a L ) y n ) h L (a L) else where v 1 v 2 = (v 11 v 21,, v 1D v 2D ) is the element-wise product (a.k.a. Hadamard product). Verify yourself! September 12, / 49
129 Neural Nets Backpropagation Putting everything into SGD The backpropagation algorithm (Backprop) Initialize W 1,..., W L (all 0 or randomly). Repeat: 1 randomly pick one data point n [N] September 12, / 49
130 Neural Nets Backpropagation Putting everything into SGD The backpropagation algorithm (Backprop) Initialize W 1,..., W L (all 0 or randomly). Repeat: 1 randomly pick one data point n [N] 2 forward propagation: for each layer l = 1,..., L compute a l = W l o l 1 and o l = h l (a l ) (o 0 = x n ) September 12, / 49
131 Neural Nets Backpropagation Putting everything into SGD The backpropagation algorithm (Backprop) Initialize W 1,..., W L (all 0 or randomly). Repeat: 1 randomly pick one data point n [N] 2 forward propagation: for each layer l = 1,..., L compute a l = W l o l 1 and o l = h l (a l ) (o 0 = x n ) 3 backward propagation: for each l = L,..., 1 compute {( ) E n Wl+1 T En a = l+1 h l (a l) if l < L a l 2(h L (a L ) y n ) h L (a L) else update weights W l W l η E n W l = W l η E n a l o T l 1 September 12, / 49
132 Neural Nets Backpropagation Putting everything into SGD The backpropagation algorithm (Backprop) Initialize W 1,..., W L (all 0 or randomly). Repeat: 1 randomly pick one data point n [N] 2 forward propagation: for each layer l = 1,..., L compute a l = W l o l 1 and o l = h l (a l ) (o 0 = x n ) 3 backward propagation: for each l = L,..., 1 compute {( ) E n Wl+1 T En a = l+1 h l (a l) if l < L a l 2(h L (a L ) y n ) h L (a L) else update weights W l W l η E n W l = W l η E n a l o T l 1 Think about how to do the last two steps properly! September 12, / 49
133 Neural Nets Backpropagation More tricks to optimize neural nets Many variants based on backprop SGD with minibatch: randomly sample a batch of examples to form a stochastic gradient SGD with momentum September 12, / 49
134 Neural Nets Backpropagation SGD with momentum Initialize w 0 and velocity v = 0 For t = 1, 2,... form a stochastic gradient g t update velocity v αv ηg t for some discount factor α (0, 1) update weight w t w t 1 + v September 12, / 49
135 Neural Nets Backpropagation SGD with momentum Initialize w 0 and velocity v = 0 For t = 1, 2,... form a stochastic gradient g t update velocity v αv ηg t for some discount factor α (0, 1) update weight w t w t 1 + v Updates for first few rounds: w 1 = w 0 ηg 1 w 2 = w 1 αηg 1 ηg 2 w 3 = w 2 α 2 ηg 1 αηg 2 ηg 3 September 12, / 49
136 Neural Nets Preventing overfitting Overfitting Overfitting is very likely since the models are too powerful. Methods to overcome overfitting: data augmentation regularization dropout early stopping September 12, / 49
137 Neural Nets Preventing overfitting Data augmentation Data: the more the better. How do we get more data? September 12, / 49
138 Neural Nets Preventing overfitting Data augmentation Data: the more the better. How do we get more data? Exploit prior knowledge to add more training data September 12, / 49
139 Neural Nets Preventing overfitting Regularization L2 regularization: minimize L E (W 1,..., W L ) = E(W 1,..., W L ) + λ W l 2 2 l=1 September 12, / 49
140 Neural Nets Preventing overfitting Regularization L2 regularization: minimize E (W 1,..., W L ) = E(W 1,..., W L ) + λ Simple change to the gradient: E w ij = E w ij + 2λw ij L W l 2 2 l=1 September 12, / 49
141 Neural Nets Preventing overfitting Regularization L2 regularization: minimize E (W 1,..., W L ) = E(W 1,..., W L ) + λ Simple change to the gradient: E w ij = E w ij + 2λw ij L W l 2 2 l=1 Introduce weight decaying effect September 12, / 49
142 Neural Nets Preventing overfitting Dropout Randomly delete neurons during training Very effective, makes training faster as well September 12, / 49
143 Neural Nets Preventing overfitting Early stopping Stop training when the performance on validation set stops improving Early stopping Loss (negative log-likelihood) Time (epochs) Training set loss Validation set loss September 12, / 49
144 Neural Nets Preventing overfitting Conclusions for neural nets Deep neural networks are hugely popular, achieving best performance on many problems September 12, / 49
145 Neural Nets Preventing overfitting Conclusions for neural nets Deep neural networks are hugely popular, achieving best performance on many problems do need a lot of data to work well September 12, / 49
146 Neural Nets Preventing overfitting Conclusions for neural nets Deep neural networks are hugely popular, achieving best performance on many problems do need a lot of data to work well take a lot of time to train (need GPUs for massive parallel computing) September 12, / 49
147 Neural Nets Preventing overfitting Conclusions for neural nets Deep neural networks are hugely popular, achieving best performance on many problems do need a lot of data to work well take a lot of time to train (need GPUs for massive parallel computing) take some work to select architecture and hyperparameters September 12, / 49
148 Neural Nets Preventing overfitting Conclusions for neural nets Deep neural networks are hugely popular, achieving best performance on many problems do need a lot of data to work well take a lot of time to train (need GPUs for massive parallel computing) take some work to select architecture and hyperparameters are still not well understood in theory September 12, / 49
1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationIntroduction to Neural Networks
CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character
More informationStochastic gradient descent; Classification
Stochastic gradient descent; Classification Steve Renals Machine Learning Practical MLP Lecture 2 28 September 2016 MLP Lecture 2 Stochastic gradient descent; Classification 1 Single Layer Networks MLP
More informationLecture 5: Logistic Regression. Neural Networks
Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More informationNeural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Neural Networks CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Perceptrons x 0 = 1 x 1 x 2 z = h w T x Output: z x D A perceptron
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationNeural Networks Learning the network: Backprop , Fall 2018 Lecture 4
Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:
More informationCSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!
CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!! November 18, 2015 THE EXAM IS CLOSED BOOK. Once the exam has started, SORRY, NO TALKING!!! No, you can t even say see ya
More informationFrom Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018
From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction
More informationStatistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks
Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationNeural Networks. Nicholas Ruozzi University of Texas at Dallas
Neural Networks Nicholas Ruozzi University of Texas at Dallas Handwritten Digit Recognition Given a collection of handwritten digits and their corresponding labels, we d like to be able to correctly classify
More informationECE521 Lecture 7/8. Logistic Regression
ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationNeural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35
Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35 Neural Networks Overview Objectives What are neural networks? How
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationSPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks
Topics in Machine Learning-EE 5359 Neural Networks 1 The Perceptron Output: A perceptron is a function that maps D-dimensional vectors to real numbers. For notational convenience, we add a zero-th dimension
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationIntro to Neural Networks and Deep Learning
Intro to Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi UVA CS 6316 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions Backpropagation Nonlinearity Functions NNs
More informationNeural Networks, Computation Graphs. CMSC 470 Marine Carpuat
Neural Networks, Computation Graphs CMSC 470 Marine Carpuat Binary Classification with a Multi-layer Perceptron φ A = 1 φ site = 1 φ located = 1 φ Maizuru = 1 φ, = 2 φ in = 1 φ Kyoto = 1 φ priest = 0 φ
More informationSGD and Deep Learning
SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationIntroduction to (Convolutional) Neural Networks
Introduction to (Convolutional) Neural Networks Philipp Grohs Summer School DL and Vis, Sept 2018 Syllabus 1 Motivation and Definition 2 Universal Approximation 3 Backpropagation 4 Stochastic Gradient
More informationCS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes
CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders
More informationAdministration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6
Administration Registration Hw3 is out Due on Thursday 10/6 Questions Lecture Captioning (Extra-Credit) Look at Piazza for details Scribing lectures With pay; come talk to me/send email. 1 Projects Projects
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationNeural Nets Supervised learning
6.034 Artificial Intelligence Big idea: Learning as acquiring a function on feature vectors Background Nearest Neighbors Identification Trees Neural Nets Neural Nets Supervised learning y s(z) w w 0 w
More information4. Multilayer Perceptrons
4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output
More informationA summary of Deep Learning without Poor Local Minima
A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given
More informationJakub Hajic Artificial Intelligence Seminar I
Jakub Hajic Artificial Intelligence Seminar I. 11. 11. 2014 Outline Key concepts Deep Belief Networks Convolutional Neural Networks A couple of questions Convolution Perceptron Feedforward Neural Network
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationArtificial Neural Networks 2
CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b
More informationDeep Learning Lab Course 2017 (Deep Learning Practical)
Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of
More informationIntroduction to Machine Learning (67577)
Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks
More informationLecture 17: Neural Networks and Deep Learning
UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions
More informationNeural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications
Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of
More informationLogistic Regression & Neural Networks
Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationNeural networks COMS 4771
Neural networks COMS 4771 1. Logistic regression Logistic regression Suppose X = R d and Y = {0, 1}. A logistic regression model is a statistical model where the conditional probability function has a
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
More informationStatistical Machine Learning from Data
January 17, 2006 Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Multi-Layer Perceptrons Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole
More information1 What a Neural Network Computes
Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists
More informationRapid Introduction to Machine Learning/ Deep Learning
Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/59 Lecture 4a Feedforward neural network October 30, 2015 2/59 Table of contents 1 1. Objectives of Lecture
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 26 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationCSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer
CSE446: Neural Networks Spring 2017 Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5 Scene
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationIntroduction to Machine Learning
Introduction to Machine Learning Neural Networks Varun Chandola x x 5 Input Outline Contents February 2, 207 Extending Perceptrons 2 Multi Layered Perceptrons 2 2. Generalizing to Multiple Labels.................
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationNeural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016
Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph
More informationNeural networks and support vector machines
Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith
More informationNeural Networks: Optimization & Regularization
Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationLecture 3 Feedforward Networks and Backpropagation
Lecture 3 Feedforward Networks and Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 3, 2017 Things we will look at today Recap of Logistic Regression
More informationDeep Feedforward Networks. Seung-Hoon Na Chonbuk National University
Deep Feedforward Networks Seung-Hoon Na Chonbuk National University Neural Network: Types Feedforward neural networks (FNN) = Deep feedforward networks = multilayer perceptrons (MLP) No feedback connections
More informationCSC321 Lecture 5: Multilayer Perceptrons
CSC321 Lecture 5: Multilayer Perceptrons Roger Grosse Roger Grosse CSC321 Lecture 5: Multilayer Perceptrons 1 / 21 Overview Recall the simple neuron-like unit: y output output bias i'th weight w 1 w2 w3
More informationDEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17
3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural
More informationWeek 5: Logistic Regression & Neural Networks
Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24 Outline Review
More informationClassification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box
ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationArtificial Neural Networks. MGS Lecture 2
Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation
More informationNonlinear Classification
Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationECE521 Lecture7. Logistic Regression
ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationtext classification 3: neural networks
text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationLecture 4 Backpropagation
Lecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 5, 2017 Things we will look at today More Backpropagation Still more backpropagation Quiz
More informationOnline Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?
Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification
More informationCS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS
CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting
More informationIntroduction to Machine Learning Spring 2018 Note Neural Networks
CS 189 Introduction to Machine Learning Spring 2018 Note 14 1 Neural Networks Neural networks are a class of compositional function approximators. They come in a variety of shapes and sizes. In this class,
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationPattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore
Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal
More informationDeep Neural Networks (1) Hidden layers; Back-propagation
Deep Neural Networs (1) Hidden layers; Bac-propagation Steve Renals Machine Learning Practical MLP Lecture 3 4 October 2017 / 9 October 2017 MLP Lecture 3 Deep Neural Networs (1) 1 Recap: Softmax single
More informationFeed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.
Neural Networks Oliver Schulte - CMPT 726 Bishop PRML Ch. 5 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of biological plausibility We will
More informationCSC321 Lecture 9: Generalization
CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 / 27 Overview We ve focused so far on how to optimize neural nets how to get them to make good predictions
More informationMulticlass Classification-1
CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass
More informationNeural Networks and the Back-propagation Algorithm
Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationSerious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions
BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design
More informationThe connection of dropout and Bayesian statistics
The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University
More informationFeed-forward Network Functions
Feed-forward Network Functions Sargur Srihari Topics 1. Extension of linear models 2. Feed-forward Network Functions 3. Weight-space symmetries 2 Recap of Linear Models Linear Models for Regression, Classification
More informationFeedforward Neural Networks
Feedforward Neural Networks Michael Collins 1 Introduction In the previous notes, we introduced an important class of models, log-linear models. In this note, we describe feedforward neural networks, which
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline
More informationArtificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!
Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011! 1 Todayʼs lecture" How the brain works (!)! Artificial neural networks! Perceptrons! Multilayer feed-forward networks! Error
More informationApplied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson
Applied Machine Learning Lecture 5: Linear classifiers, continued Richard Johansson overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers
More informationMachine Learning Basics
Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each
More informationCSC242: Intro to AI. Lecture 21
CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages
More information