Numerical Learning Algorithms

Similar documents
Artifical Neural Networks

Mining Classification Knowledge

Mining Classification Knowledge

Artificial neural networks

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning (CSE 446): Neural Networks

CSC242: Intro to AI. Lecture 21

CS7267 MACHINE LEARNING

Midterm: CS 6375 Spring 2015 Solutions

CS:4420 Artificial Intelligence

FINAL: CS 6375 (Machine Learning) Fall 2014

Neural networks. Chapter 20. Chapter 20 1

Multilayer Perceptron

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Neural networks. Chapter 19, Sections 1 5 1

Final Examination CS 540-2: Introduction to Artificial Intelligence

Revision: Neural Network

Neural networks. Chapter 20, Section 5 1

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Stochastic Gradient Descent

Logistic Regression. Machine Learning Fall 2018

Holdout and Cross-Validation Methods Overfitting Avoidance

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Algorithms for Classification: The Basic Methods

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 7 Artificial neural networks: Supervised learning

Artificial Neural Network

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

AE = q < H(p < ) + (1 q < )H(p > ) H(p) = p lg(p) (1 p) lg(1 p)

B555 - Machine Learning - Homework 4. Enrique Areyan April 28, 2015

Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Data Mining Part 5. Prediction

Midterm: CS 6375 Spring 2018

Data Mining und Maschinelles Lernen

Machine Learning. Ensemble Methods. Manfred Huber

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Neural Networks and Ensemble Methods for Classification

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Bayesian Learning. Bayesian Learning Criteria

Linear discriminant functions

Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Final Exam, Fall 2002

MIRA, SVM, k-nn. Lirong Xia

Artificial Intelligence Roman Barták

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

CS 6375 Machine Learning

Machine Learning Lecture 5

CSCE 478/878 Lecture 6: Bayesian Learning

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Machine Learning Practice Page 2 of 2 10/28/13

Statistical Machine Learning from Data

Machine Learning Algorithm. Heejun Kim

The Naïve Bayes Classifier. Machine Learning Fall 2017

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Artificial Neural Networks

CS534 Machine Learning - Spring Final Exam

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

Sections 18.6 and 18.7 Artificial Neural Networks

Neural Networks: Introduction

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Evaluation. Andrea Passerini Machine Learning. Evaluation

AI Programming CS F-20 Neural Networks

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Sections 18.6 and 18.7 Artificial Neural Networks

ECE 5984: Introduction to Machine Learning

Learning from Examples

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Neural Networks DWML, /25

Artificial Neural Networks Examination, March 2004

Evaluation requires to define performance measures to be optimized

VBM683 Machine Learning

Artificial Neural Network

CSCI-567: Machine Learning (Spring 2019)

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Feedforward Neural Nets and Backpropagation

18.9 SUPPORT VECTOR MACHINES

Hierarchical Boosting and Filter Generation

COMS 4771 Introduction to Machine Learning. Nakul Verma

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Lecture 4: Perceptrons and Multilayer Perceptrons

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Final Examination CS540-2: Introduction to Artificial Intelligence

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Logistic Regression & Neural Networks

CS 188: Artificial Intelligence. Outline

Linear & nonlinear classifiers

Transcription:

Numerical Learning Algorithms Example SVM for Separable Examples.......................... Example SVM for Nonseparable Examples....................... 4 Example Gaussian Kernel SVM............................... 5 Example Gaussian Kernel, Zoomed In.......................... 6 Ensemble Learning 7 Ensemble Learning....................................... 7 Boosting.............................................. 8 Example Boosting Algorithm................................ 9 Example Run of AdaBoost.................................. Example Run of AdaBoost, Continued.......................... Introduction.............................................. Naive Bayes Naive Bayes............................................. Naive Bayes Example...................................... 4 Naive Bayes Example Continued............................... 5 Linear Models 6 Linear Models........................................... 6 Example of Numeric Examples................................ 7 Linear Regression......................................... 8 Least Squares Gradient Descent............................... 9 Perceptron Learning Rule.................................. Perceptrons Continued.................................... Example of Perceptron Learning (α = )........................ The Nearest Neighbor Algorithm The Nearest Neighbor Algorithm............................. Neural Networks 4 Artificial Neural Networks.................................. 4 ANN Structure.......................................... 5 ANN Illustration......................................... 6 Illustration............................................. 7 Sigmoid Activation....................................... 8 Plot of Sigmoid Function................................... 9 Backpropagation........................................ Applying The Chain Rule................................... Support Vector Machines Support Vector Machines...................................

Introduction Numerical learning methods learn the parameters or weights of a model, often by optimizing an error function. Examples include: Calculate the parameters of a probability distribution. Separate positive from negative examples by a decision boundary. Find points close to positive but far from negative examples. Update parameters to decrease error. CS 79 Artificial Intelligence Numerical Learning Algorithms Naive Bayes Naive Bayes For class C and attributes X i, assume: P(C, X,..., X n ) = P(C)P(X C)...P(X n C) This corresponds to a Bayesian network where C is the sole parent of each X i. Estimate prior and conditional probabilities by counting. If an outcome occurs m times out of n trials, Laplace s law of succession recommends the estimate (m + )/(n + k) where k is the number of outcomes. CS 79 Artificial Intelligence Numerical Learning Algorithms Naive Bayes Example Using Laplace s law of succession on the 4 examples.: P(pos) = (9 + )/(4 + ) = /6 P(neg) = (5 + )/(4 + ) = 6/6 P(sunny pos) = ( + )/(9 + ) = / P(overcast pos) = (4 + )/(9 + ) = 5/ P(rain pos) = ( + )/(9 + ) = 4/ Naive Bayes Example Continued For the first example: P(pos sunny, hot, high, false) = α (/6) (/)(/)(4/)(7/) α.94 P(neg sunny, hot, high, false) = α (6/6) (4/8) (/8) (5/7)(/7) α.5.5.94 +.5.74 CS 79 Artificial Intelligence Numerical Learning Algorithms 5 Linear Models 6 Linear Models For a linear model, the output and each attribute must be numeric. The input of an example is a numeric vector x = (., x,..., x n ). A hypothesis is a weight vector w = (w o, w,..., w n ). w is the bias weight. The output of a hypothesis is computed by ŷ = w o + w x +...w n x n = w x The loss on example (x, y) is typically one of: Squared error loss: L (y, ŷ) = (y ŷ) Absolute error loss: L (y, ŷ) = y ŷ / loss: L / (y, ŷ) = if y = ŷ else CS 79 Artificial Intelligence Numerical Learning Algorithms 6 CS 79 Artificial Intelligence Numerical Learning Algorithms 4 4

Example of Numeric Examples No. Input Attributes Output Sunny Rainy Hot Cool Humid Windy 4 5 6 7 8 9 4 CS 79 Artificial Intelligence Numerical Learning Algorithms 7 Linear Regression Linear regression finds the weights that minimizes loss over the training set. Gradient descent changes the weights based on the gradient, the derivatives of the loss with respect to the weights. (more on next page) The linear least squares algorithm calculates the weights by: w = (X X) X y where X is the data matrix and y is the vector of outputs. Classification can be performed by if w x > then positive else negative CS 79 Artificial Intelligence Numerical Learning Algorithms 8 Least Squares Gradient Descent w zeroes loop until convergence for each example (x j, y j ) ŷ j w x j for each w i in w w i w i + α(y j ŷ j )x ij where α is the learning rate. This is a small number chosen to tradeoff speed of convergence vs. closeness to optimal weights. CS 79 Artificial Intelligence Numerical Learning Algorithms 9 Perceptron Learning Rule [differs from book] A perceptron does gradient descent for absolute error loss (more accurately, ramp loss ). This assumes each y j is or. w zeroes loop until convergence for each example (x j, y j ) ŷ j w x j if (y j = ŷ j <) (y j = ŷ j > ) then for each w i in w w i w i + α y j x ij Again, α is the learning rate. CS 79 Artificial Intelligence Numerical Learning Algorithms Perceptrons Continued The perceptron convergence theorem states that if some w classifies all the training examples correctly, then the perceptron learning rule will converge to zero error on the training examples. Usually, many epochs (passes over the training examples) are needed until convergence. If zero error is not possible, use α./n, where n is the number of normalized or binary inputs. CS 79 Artificial Intelligence Numerical Learning Algorithms 5 6

Example of Perceptron Learning (α = ) Using α = : Inputs Weights x x x x 4 y ŷ L w w w w w 4 - - - - - - - - - - - - - - - - - - - CS 79 Artificial Intelligence Numerical Learning Algorithms Neural Networks 4 Artificial Neural Networks An (artificial) neural network consists of units, connections, and weights. Inputs and outputs are numeric. Biological NN soma axon, dendrite synapse potential threshold signal Artificial NN unit connection weight weighted sum bias weight activation CS 79 Artificial Intelligence Numerical Learning Algorithms 4 The Nearest Neighbor Algorithm The Nearest Neighbor Algorithm The k-nearest neighbor algorithm classifies a test example by finding the k closest training example(s), returning the most common class. Suppose % noise (best possible test error is %). With sufficient training exs., a test example will agree with its nearest neighbor with prob. (.9)(.9) + (.)(.) =.8 (both not noisy or both noisy) and disagree with prob. (.9)(.) + (.)(.9) =.8. In general -nearest neighbor converges to less than twice the optimal error (-NN to less than % higher than optimal). CS 79 Artificial Intelligence Numerical Learning Algorithms ANN Structure A typical unit j receives inputs a, a,... from other units and performs a weighted sum: in j = w j + Σi w ij a i and outputs activation a j = g(in j ). Typically, input units store the inputs, hidden units transform the inputs into an internal numeric vector, and an output unit transforms the hidden values into the prediction. An ANN is a function f(x,w) = a, where x is an example, W is the weights, and a is the prediction (activation value from output unit). Learning is finding a W that minimizes error. CS 79 Artificial Intelligence Numerical Learning Algorithms 5 7 8

ANN Illustration INPUT UNITS x x x x 4 w 5 w 5 w 5 w 45 w 6 w 6 w 6 w 46 WEIGHTS HIDDEN UNITS a 5 w 5 + w 6 a 6 w 57 w 7 w 67 OUTPUT UNIT a 7 OUTPUT CS 79 Artificial Intelligence Numerical Learning Algorithms 6 Illustration INPUT UNITS x x x x 4 HIDDEN UNITS a 5 + 4 a 6 WEIGHTS 4 OUTPUT UNIT a 7 OUTPUT CS 79 Artificial Intelligence Numerical Learning Algorithms 7 Sigmoid Activation The sigmoid function is defined as: sigmoid(x) = + e x It is commonly used for ANN activation functions: a j = sigmoid(in j ) = sigmoid(w i + Σi w ij a i ) Note that sigmoid(x) = sigmoid(x)( sigmoid(x)) x CS 79 Artificial Intelligence Numerical Learning Algorithms 8 Plot of Sigmoid Function..8.6.4.. -4-4 sigmoid(x) CS 79 Artificial Intelligence Numerical Learning Algorithms 9 9

Backpropagation One learning method is backpropagating the error from the output to all of the weights. It is an application of the delta rule. Given loss L(W,x, y), obtain the gradient: [ ] L L(W,x, y) =...,,... w ij To decrease error, use the update rule: w ij w ij α L w ij where α is the learning rate. CS 79 Artificial Intelligence Numerical Learning Algorithms Applying The Chain Rule Using L = (y k a k ) for output unit k: L w jk = L a k in k a k in k w jk = (y k a k ) a k ( a k ) a j For weights from input to hidden units: L w ij = L a k in k a j in j a k in k a j in j w ij = (y k a k ) a k ( a k ) w jk a j ( a j ) x i CS 79 Artificial Intelligence Numerical Learning Algorithms Support Vector Machines Support Vector Machines A SVM assigns a weight α i to each example (x i, y i ) (x i is an attribute value vector, y i is either or ). A SVM computes a discriminant by: ( ) h(x) = sign b + Σ α i y i K(x,x i ) i where K is a kernel function. A SVM learns by optimizing the error function: minimize h / + Σ i max(, y i h(x i )) subject to α i C where h is the size of h in kernel space CS 79 Artificial Intelligence Numerical Learning Algorithms Example SVM for Separable Examples.5.5.5 w.x + b = - w.x + b = w.x + b = -.5-4 5 6 CS 79 Artificial Intelligence Numerical Learning Algorithms

Example SVM for Nonseparable Examples Example Gaussian Kernel SVM.5.5.5 w.x + b = - w.x + b = w.x + b =.5 4 4.5 5 5.5 6 6.5 7-4 5 6 7.5.5.5 CS 79 Artificial Intelligence Numerical Learning Algorithms 5 CS 79 Artificial Intelligence Numerical Learning Algorithms 4 Example Gaussian Kernel, Zoomed In -.8.6.4. 4. 4.4 4.6 4.8 5 5. 5.4 5.6 5.8 CS 79 Artificial Intelligence Numerical Learning Algorithms 6 4

Ensemble Learning 7 Ensemble Learning There are many algorithms for learning a single hypothesis. Ensemble learning will learn and combine a collection of hypotheses by running the algorithm on different training sets. Bagging (briefly mentioned in the book) runs a learning algorithm on repeated subsamples of the training set. If there are n examples, then a subsample of n examples is generated by sampling with replacement. On a test example, each hypothesis casts vote for the class it predicts. CS 79 Artificial Intelligence Numerical Learning Algorithms 7 Boosting In boosting, the hypotheses are learned in sequence. Both hypotheses and examples have weights with different purposes. After each hypothesis is learned, its weight is based on its error rate, and the weights of the training examples (initially all equal) are also modified. On a test example, when each hypothesis predicts a class, its weight is the size of its vote. The ensemble predicts the class with the highest vote. CS 79 Artificial Intelligence Numerical Learning Algorithms 8 Example Run of AdaBoost Using the 4 examples as a training set: The hypothesis windy = false class = pos is wrong on 5 of the 4 examples. The weights of the correctly classified examples are multiplied by 5/9, then all examples are multiplied by 4/ so they sum up to again. This hypothesis has a weight of log(9/5). Note that after weight updating, the sum of the correctly classified examples equals the sum of the incorrectly classified examples. CS 79 Artificial Intelligence Numerical Learning Algorithms Example Run of AdaBoost, Continued The next hypothesis must be different from the previous one to have error less than /. Now the hypothesis outlook = overcast class = pos has an error rate of 9/9. The weights of the correctly classified examples are multiplied times 9/6.475, then all examples are multiplied by 9/58.55 so they sum up to again. This hypothesis has a weight of log(6/9). CS 79 Artificial Intelligence Numerical Learning Algorithms Example Boosting Algorithm AdaBoost(examples, algorithm, iterations). n number of examples. initialize weights w[... n] to /n. for i from to iterations 4. h[i] algorithm(examples) 5. error sum of exs. misclassfied by h[i] 6. for j from to n 7. if h[i] is correct on example j 8. then w[j] w[j] error/( error) 9. normalize w[...n] so it sums to. weight of h[i] log(( error)/error). return h[... iterations] and their weights CS 79 Artificial Intelligence Numerical Learning Algorithms 9 5 6