Numerical Learning Algorithms Example SVM for Separable Examples.......................... Example SVM for Nonseparable Examples....................... 4 Example Gaussian Kernel SVM............................... 5 Example Gaussian Kernel, Zoomed In.......................... 6 Ensemble Learning 7 Ensemble Learning....................................... 7 Boosting.............................................. 8 Example Boosting Algorithm................................ 9 Example Run of AdaBoost.................................. Example Run of AdaBoost, Continued.......................... Introduction.............................................. Naive Bayes Naive Bayes............................................. Naive Bayes Example...................................... 4 Naive Bayes Example Continued............................... 5 Linear Models 6 Linear Models........................................... 6 Example of Numeric Examples................................ 7 Linear Regression......................................... 8 Least Squares Gradient Descent............................... 9 Perceptron Learning Rule.................................. Perceptrons Continued.................................... Example of Perceptron Learning (α = )........................ The Nearest Neighbor Algorithm The Nearest Neighbor Algorithm............................. Neural Networks 4 Artificial Neural Networks.................................. 4 ANN Structure.......................................... 5 ANN Illustration......................................... 6 Illustration............................................. 7 Sigmoid Activation....................................... 8 Plot of Sigmoid Function................................... 9 Backpropagation........................................ Applying The Chain Rule................................... Support Vector Machines Support Vector Machines...................................
Introduction Numerical learning methods learn the parameters or weights of a model, often by optimizing an error function. Examples include: Calculate the parameters of a probability distribution. Separate positive from negative examples by a decision boundary. Find points close to positive but far from negative examples. Update parameters to decrease error. CS 79 Artificial Intelligence Numerical Learning Algorithms Naive Bayes Naive Bayes For class C and attributes X i, assume: P(C, X,..., X n ) = P(C)P(X C)...P(X n C) This corresponds to a Bayesian network where C is the sole parent of each X i. Estimate prior and conditional probabilities by counting. If an outcome occurs m times out of n trials, Laplace s law of succession recommends the estimate (m + )/(n + k) where k is the number of outcomes. CS 79 Artificial Intelligence Numerical Learning Algorithms Naive Bayes Example Using Laplace s law of succession on the 4 examples.: P(pos) = (9 + )/(4 + ) = /6 P(neg) = (5 + )/(4 + ) = 6/6 P(sunny pos) = ( + )/(9 + ) = / P(overcast pos) = (4 + )/(9 + ) = 5/ P(rain pos) = ( + )/(9 + ) = 4/ Naive Bayes Example Continued For the first example: P(pos sunny, hot, high, false) = α (/6) (/)(/)(4/)(7/) α.94 P(neg sunny, hot, high, false) = α (6/6) (4/8) (/8) (5/7)(/7) α.5.5.94 +.5.74 CS 79 Artificial Intelligence Numerical Learning Algorithms 5 Linear Models 6 Linear Models For a linear model, the output and each attribute must be numeric. The input of an example is a numeric vector x = (., x,..., x n ). A hypothesis is a weight vector w = (w o, w,..., w n ). w is the bias weight. The output of a hypothesis is computed by ŷ = w o + w x +...w n x n = w x The loss on example (x, y) is typically one of: Squared error loss: L (y, ŷ) = (y ŷ) Absolute error loss: L (y, ŷ) = y ŷ / loss: L / (y, ŷ) = if y = ŷ else CS 79 Artificial Intelligence Numerical Learning Algorithms 6 CS 79 Artificial Intelligence Numerical Learning Algorithms 4 4
Example of Numeric Examples No. Input Attributes Output Sunny Rainy Hot Cool Humid Windy 4 5 6 7 8 9 4 CS 79 Artificial Intelligence Numerical Learning Algorithms 7 Linear Regression Linear regression finds the weights that minimizes loss over the training set. Gradient descent changes the weights based on the gradient, the derivatives of the loss with respect to the weights. (more on next page) The linear least squares algorithm calculates the weights by: w = (X X) X y where X is the data matrix and y is the vector of outputs. Classification can be performed by if w x > then positive else negative CS 79 Artificial Intelligence Numerical Learning Algorithms 8 Least Squares Gradient Descent w zeroes loop until convergence for each example (x j, y j ) ŷ j w x j for each w i in w w i w i + α(y j ŷ j )x ij where α is the learning rate. This is a small number chosen to tradeoff speed of convergence vs. closeness to optimal weights. CS 79 Artificial Intelligence Numerical Learning Algorithms 9 Perceptron Learning Rule [differs from book] A perceptron does gradient descent for absolute error loss (more accurately, ramp loss ). This assumes each y j is or. w zeroes loop until convergence for each example (x j, y j ) ŷ j w x j if (y j = ŷ j <) (y j = ŷ j > ) then for each w i in w w i w i + α y j x ij Again, α is the learning rate. CS 79 Artificial Intelligence Numerical Learning Algorithms Perceptrons Continued The perceptron convergence theorem states that if some w classifies all the training examples correctly, then the perceptron learning rule will converge to zero error on the training examples. Usually, many epochs (passes over the training examples) are needed until convergence. If zero error is not possible, use α./n, where n is the number of normalized or binary inputs. CS 79 Artificial Intelligence Numerical Learning Algorithms 5 6
Example of Perceptron Learning (α = ) Using α = : Inputs Weights x x x x 4 y ŷ L w w w w w 4 - - - - - - - - - - - - - - - - - - - CS 79 Artificial Intelligence Numerical Learning Algorithms Neural Networks 4 Artificial Neural Networks An (artificial) neural network consists of units, connections, and weights. Inputs and outputs are numeric. Biological NN soma axon, dendrite synapse potential threshold signal Artificial NN unit connection weight weighted sum bias weight activation CS 79 Artificial Intelligence Numerical Learning Algorithms 4 The Nearest Neighbor Algorithm The Nearest Neighbor Algorithm The k-nearest neighbor algorithm classifies a test example by finding the k closest training example(s), returning the most common class. Suppose % noise (best possible test error is %). With sufficient training exs., a test example will agree with its nearest neighbor with prob. (.9)(.9) + (.)(.) =.8 (both not noisy or both noisy) and disagree with prob. (.9)(.) + (.)(.9) =.8. In general -nearest neighbor converges to less than twice the optimal error (-NN to less than % higher than optimal). CS 79 Artificial Intelligence Numerical Learning Algorithms ANN Structure A typical unit j receives inputs a, a,... from other units and performs a weighted sum: in j = w j + Σi w ij a i and outputs activation a j = g(in j ). Typically, input units store the inputs, hidden units transform the inputs into an internal numeric vector, and an output unit transforms the hidden values into the prediction. An ANN is a function f(x,w) = a, where x is an example, W is the weights, and a is the prediction (activation value from output unit). Learning is finding a W that minimizes error. CS 79 Artificial Intelligence Numerical Learning Algorithms 5 7 8
ANN Illustration INPUT UNITS x x x x 4 w 5 w 5 w 5 w 45 w 6 w 6 w 6 w 46 WEIGHTS HIDDEN UNITS a 5 w 5 + w 6 a 6 w 57 w 7 w 67 OUTPUT UNIT a 7 OUTPUT CS 79 Artificial Intelligence Numerical Learning Algorithms 6 Illustration INPUT UNITS x x x x 4 HIDDEN UNITS a 5 + 4 a 6 WEIGHTS 4 OUTPUT UNIT a 7 OUTPUT CS 79 Artificial Intelligence Numerical Learning Algorithms 7 Sigmoid Activation The sigmoid function is defined as: sigmoid(x) = + e x It is commonly used for ANN activation functions: a j = sigmoid(in j ) = sigmoid(w i + Σi w ij a i ) Note that sigmoid(x) = sigmoid(x)( sigmoid(x)) x CS 79 Artificial Intelligence Numerical Learning Algorithms 8 Plot of Sigmoid Function..8.6.4.. -4-4 sigmoid(x) CS 79 Artificial Intelligence Numerical Learning Algorithms 9 9
Backpropagation One learning method is backpropagating the error from the output to all of the weights. It is an application of the delta rule. Given loss L(W,x, y), obtain the gradient: [ ] L L(W,x, y) =...,,... w ij To decrease error, use the update rule: w ij w ij α L w ij where α is the learning rate. CS 79 Artificial Intelligence Numerical Learning Algorithms Applying The Chain Rule Using L = (y k a k ) for output unit k: L w jk = L a k in k a k in k w jk = (y k a k ) a k ( a k ) a j For weights from input to hidden units: L w ij = L a k in k a j in j a k in k a j in j w ij = (y k a k ) a k ( a k ) w jk a j ( a j ) x i CS 79 Artificial Intelligence Numerical Learning Algorithms Support Vector Machines Support Vector Machines A SVM assigns a weight α i to each example (x i, y i ) (x i is an attribute value vector, y i is either or ). A SVM computes a discriminant by: ( ) h(x) = sign b + Σ α i y i K(x,x i ) i where K is a kernel function. A SVM learns by optimizing the error function: minimize h / + Σ i max(, y i h(x i )) subject to α i C where h is the size of h in kernel space CS 79 Artificial Intelligence Numerical Learning Algorithms Example SVM for Separable Examples.5.5.5 w.x + b = - w.x + b = w.x + b = -.5-4 5 6 CS 79 Artificial Intelligence Numerical Learning Algorithms
Example SVM for Nonseparable Examples Example Gaussian Kernel SVM.5.5.5 w.x + b = - w.x + b = w.x + b =.5 4 4.5 5 5.5 6 6.5 7-4 5 6 7.5.5.5 CS 79 Artificial Intelligence Numerical Learning Algorithms 5 CS 79 Artificial Intelligence Numerical Learning Algorithms 4 Example Gaussian Kernel, Zoomed In -.8.6.4. 4. 4.4 4.6 4.8 5 5. 5.4 5.6 5.8 CS 79 Artificial Intelligence Numerical Learning Algorithms 6 4
Ensemble Learning 7 Ensemble Learning There are many algorithms for learning a single hypothesis. Ensemble learning will learn and combine a collection of hypotheses by running the algorithm on different training sets. Bagging (briefly mentioned in the book) runs a learning algorithm on repeated subsamples of the training set. If there are n examples, then a subsample of n examples is generated by sampling with replacement. On a test example, each hypothesis casts vote for the class it predicts. CS 79 Artificial Intelligence Numerical Learning Algorithms 7 Boosting In boosting, the hypotheses are learned in sequence. Both hypotheses and examples have weights with different purposes. After each hypothesis is learned, its weight is based on its error rate, and the weights of the training examples (initially all equal) are also modified. On a test example, when each hypothesis predicts a class, its weight is the size of its vote. The ensemble predicts the class with the highest vote. CS 79 Artificial Intelligence Numerical Learning Algorithms 8 Example Run of AdaBoost Using the 4 examples as a training set: The hypothesis windy = false class = pos is wrong on 5 of the 4 examples. The weights of the correctly classified examples are multiplied by 5/9, then all examples are multiplied by 4/ so they sum up to again. This hypothesis has a weight of log(9/5). Note that after weight updating, the sum of the correctly classified examples equals the sum of the incorrectly classified examples. CS 79 Artificial Intelligence Numerical Learning Algorithms Example Run of AdaBoost, Continued The next hypothesis must be different from the previous one to have error less than /. Now the hypothesis outlook = overcast class = pos has an error rate of 9/9. The weights of the correctly classified examples are multiplied times 9/6.475, then all examples are multiplied by 9/58.55 so they sum up to again. This hypothesis has a weight of log(6/9). CS 79 Artificial Intelligence Numerical Learning Algorithms Example Boosting Algorithm AdaBoost(examples, algorithm, iterations). n number of examples. initialize weights w[... n] to /n. for i from to iterations 4. h[i] algorithm(examples) 5. error sum of exs. misclassfied by h[i] 6. for j from to n 7. if h[i] is correct on example j 8. then w[j] w[j] error/( error) 9. normalize w[...n] so it sums to. weight of h[i] log(( error)/error). return h[... iterations] and their weights CS 79 Artificial Intelligence Numerical Learning Algorithms 9 5 6