Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network

3/9/7 Perceptron as a neural network w 0 x[] w w x[2] 2 x[d] w d Σ dx j=0 P d w j x[j] if g = j=0 w jx[j] > 0 otherwise 3 This is one neuron: - Input edges x[],,x[d], along with intercept x[0]= - Sum passed through an activation function g Sigmoid neuron 0.9 0.8 0.7 0.6 0.5 w 0 0.4 0.3 0.2 0. 0-6 -4-2 0 2 4 6 x[] w x[2] w 2 Σ 4 x[d] w d Just change g! - Why would we want to do this? - Notice the output range [0,]. What was it before? - Look familiar? Score(x) = dx w j x[j] j=0 g = +e Score(x) 2

3/9/7 Perceptron, linear classification, Boolean fns: x[j] {0,} Can learn x[] x[2]? - -0.5 + x[] + x[2] Can learn x[] x[2]? - -.5 + x[] + x[2] Can learn any conjunction or disjunction? - 0.5 + x[] + + x[d] - (-d+0.5) + x[] + + x[d] Can learn majority? - (-0.5*d) + x[] + + x[d] What are we missing? The dreaded XOR!, etc. x[] x[2] x[d] w 0 w w 2 w d Σ dx j=0 P d w j x[j] if g = j=0 wjx[j] > 0 otherwise 5 Introducing a hidden layer 3

3/9/7 What can t a simple linear classifier represent? XOR the counterexample to everything Need non-linear features XOR = x[] AND NOT x[2] OR NOT x[] AND x[2] 7 Solving the XOR problem: Going beyond linear classification by adding a layer XOR = x[] AND NOT x[2] OR NOT x[] AND x[2] -0.5 v[] -0.5 v[2] x[] x[2] - -0.5 - v[] v[2] y Thresholded to 0 or 4

3/9/7 Solving the XOR problem: Going beyond linear classification by adding a layer y = x[] XOR x[2] =(x[] x[2]) (x[2] x[]) v[] = (x[] x[2]) = -0.5+x[]-x[2] v[2] = (x[2] x[]) = -0.5+x[2]-x[] y = v[] v[2] = -0.5+v[]+v[2] x[] x[2] -0.5 - -0.5 - -0.5 v[] v[2] y 9 Hidden layer Single unit: out(x) =g(w 0 + X j w j x[j]) -hidden layer: out(x) =g(w 0 + X k w k g(w k 0 + X j w k j x[j])) 0 No longer convex function! 5

3/9/7 A general neural network Layers and layers and layers of linear models and non-linear transformations x[] v[] y Around for about 50 years - Fell in disfavor in 90s In last few years, big resurgence - Impressive accuracy on several benchmark problems - Powered by huge datasets, GPUs, & modeling/learning alg improvements x[2] v[2] Learning neural networks with hidden layers 6

3/9/7 Recall: Optimizing a single-layer neuron We train to minimize sum of squared errors: `(w) = X [y i g(w 0 + X w j x i [j])] 2 2 i j Taking gradients: @` @w j = X i [y i g(w 0 + X j @ @x f(g(x)) = f 0 (g(x))g 0 (x) @ w j x i [j])] g(w 0 + X @w j j w j x i [j]) = X i [y i g(w 0 + X j w j x i [j])]x i [j]g 0 (w 0 + X j w j x i [j]) 3 Solution just depends on g : derivative of activation function! Forward propagation -hidden layer: out(x) =g(w 0 + X k w k g(w k 0 + X j w k j x[j])) For fixed weights, forming predictions is easy! Compute values left to right. Inputs: x[],,x[d] 2. Hidden: v[],,v[d] 3. Output: y x[] x[2] v[] v[2] y 4 7

3/9/7 Gradient descent for -hidden layer: Output layer parameters `(w) = 2 X [y i out(x i )] 2 i Dropped w 0 to make derivation simpler @` @w k = NX i= [y i out(x i )] @out(x i) @w k 5 Gradient for last layer same as single node case, but with hidden nodes v as input! Gradient descent for -hidden layer: Hidden layer parameters `(w) = 2 @` @w k j = NX i= X [y i out(x i )] 2 i [y i out(x i )] @out(x i) @w k j Dropped w 0 to make derivation simpler @ @x f(g(x)) = f 0 (g(x))g 0 (x) For hidden layer, two parts: 6 Recursive computation of gradient on output layer Normal update for single neuron 8

3/9/7 Multilayer neural networks Inference and Learning Forward pass: left to right, each hidden layer in turn Gradient computation: right to left, propagating gradient for each node Forward Gradient 7 Forward propagation Prediction Recursive algorithm Start from input layer Output of node v[k] with parents u[],u[2], : 0 v[k] =g @ X j wj k u[j] A 8 9

3/9/7 Back-propagation Learning Just gradient descent!!! Recursive algorithm for computing gradient For each example - Perform forward propagation - Start from output layer Compute gradient of node v[k] with parents u[],u[2], : Update weight wj k Repeat (move to preceding layer) 9 Convergence of backprop Perceptron leads to convex optimization - Gradient descent reaches global minima Multilayer neural nets not convex - Gradient descent gets stuck in local minima - Selecting number of hidden units and layers = fuzzy process - NNs have made a HUGE comeback in the last few years!!! Neural nets are back with a new name!!!! - Deep belief networks - Huge error reduction when trained with lots of data on GPUs 20 0

3/9/7 Overfitting in NNs Are NNs likely to overfit? - Yes, they can represent arbitrary functions!!! Avoiding overfitting? - More training data - Fewer hidden nodes / better topology - Regularization - Early stopping 2 Neural networks can do cool things!

3/9/7 Object recognition Slides from Jeff Dean at Google 23 Number detection Slides from Jeff Dean at Google 24 2

3/9/7 Acoustic Modeling for Speech Recognition label Close collaboration with Google Speech team Slides from Jeff Dean at Google Trained in <5 days on cluster of 800 machines 30% reduction in Word Error Rate for English! ( biggest single improvement in 20 years of speech research ) Launched in 202 at time of Jellybean release of Android 25 202-era Convolutional Model for Object Recognition Softmax to predict object class Fully-connected layers Convolutional layers! (same weights used at all! spatial locations in layer)!! Convolutional networks developed by! Yann LeCun (NYU) Layer 7... Layer Input Slides from Jeff Dean at Google Basic architecture developed by Krizhevsky, Sutskever & Hinton (all now at Google).! Won 202 ImageNet challenge with 6.4% top-5 error rate 26 3

3/9/7 204-era Model for Object Recognition Module with 6 separate! convolutional layers 24 layers deep! Slides from Jeff Dean at Google Developed by team of Google Researchers:! Won 204 ImageNet challenge with 6.66% top-5 error rate 27 Good Fine-grained Classification Slides from Jeff Dean at Google hibiscus dahlia 28 4

3/9/7 Good Generalization Slides from Jeff Dean at Google Both recognized as a meal Sensible Errors Slides from Jeff Dean at Google snake dog 5

3/9/7 Works in practice for real users. Slides from Jeff Dean at Google Works in practice for real users. Slides from Jeff Dean at Google 6

3/9/7 Object detection Person: 0.64 Dog: 0.30 Horse: 0.28 Redmon et al. 205 http://pjreddie.com/yolo/ Neural network summary 7

3/9/7 What you need to know about neural networks Perceptron: - Relationship to general neurons Multilayer neural nets - Representation - Derivation of backprop - Learning rule Overfitting 35 Course Wrap-Up Emily Fox University of Washington March 0, 207 8

3/9/7 What you have learned this quarter 37 Learning is function approximation Point estimation Regression Overfitting Bias-Variance tradeoff Ridge, LASSO Cross validation Stochastic gradient descent Coordinate descent Subgradient Logistic regression Decision trees Boosting Instance-based learning Perceptron SVMs Kernel trick Dimensionality reduction, PCA K-means Mixtures of Gaussians EM Discriminative v. Generative learning Unsupervised v. Supervised learning Naïve Bayes Bayes nets Neural networks BIG PICTURE Improving the performance at some task though experience!!! J - before you start any learning task, remember the fundamental questions: What is the learning problem? From what experience? What model? What loss function are you optimizing? With what optimization algorithm? Which learning algorithm? With what guarantees? How will you evaluate it? 38 9

3/9/7 Regression Example: Predicting house prices Data ML Regression Method Intelligence $ $ $ price ($) $ =?? 39 + house features house size Classification Example: Sentiment analysis Data ML Classification Method Intelligence Sushi was awesome, the food was awesome, but the service was awful. All reviews: Score(x) < 0 awful 40 awesome Score(x) > 0 20

3/9/7 Similarity/finding data Example: Document retrieval ML Nearest Method neighbor Data 4 Intelligence Clustering Example: Document structuring for retrieval Data ML Clustering Method SPORTS 42 Intelligence WORLD NEWS ENTERTAINMENT SCIENCE 2

3/9/7 Embedding Example: Embedding images to visualize data Data ML PCA Method Intelligence Can we give each image a coordinate, such that similar images are near each other? [Saul & Roweis 03] 43 Images with thousands or millions of pixels Deep Learning Example: Visual product recommender Data Deep ML Method Learning Intelligence Input images: Layer Layer 2 Nearest neighbors: x z y x 2 z 2 44 22

3/9/7 You have done a lot!!! And (hopefully) learned a lot!!! - Implemented LASSO Logistic regression Perceptron Clustering - Answered hard questions and proved many interesting results - Completed (I am sure) an amazing ML project - And did excellently on the final! 45 Thank You for the Hard Work!!! 23