Machine Learning and Adaptive Systems. Lectures 3 & 4

Similar documents
Machine Learning and Adaptive Systems. Lectures 5 & 6

Linear Regression. S. Sumitra

CS545 Contents XVI. l Adaptive Control. l Reading Assignment for Next Class

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

V. Adaptive filtering Widrow-Hopf Learning Rule LMS and Adaline

Single layer NN. Neuron Model

Reinforcement Learning and Control

Chapter 2 Single Layer Feedforward Networks

Neural Networks Lecture 2:Single Layer Classifiers

Reinforcement Learning. Introduction

Machine Learning 2017

Lecture: Adaptive Filtering

L11: Pattern recognition principles

Introduction to Reinforcement Learning

CS545 Contents XVI. Adaptive Control. Reading Assignment for Next Class. u Model Reference Adaptive Control. u Self-Tuning Regulators

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

26. Filtering. ECE 830, Spring 2014

Principal Component Analysis and Linear Discriminant Analysis

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Neural Network Training

Input layer. Weight matrix [ ] Output layer

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Unsupervised Learning

Overfitting, Bias / Variance Analysis

Dimensionality Reduction

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Artificial Neural Networks Examination, June 2005

REINFORCEMENT LEARNING

Reinforcement Learning

Basics of reinforcement learning

Learning in State-Space Reinforcement Learning CIS 32

Adaptive Filtering Part II

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

Ch4: Method of Steepest Descent

Reinforcement Learning. Machine Learning, Fall 2010

Statistical Pattern Recognition

Linear Discrimination Functions

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

We choose parameter values that will minimize the difference between the model outputs & the true function values.

3.4 Linear Least-Squares Filter

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Feedforward Networks

Revision of Lecture 4

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Computational Intelligence Lecture 3: Simple Neural Networks for Pattern Classification

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Artificial Neural Networks The Introduction

Machine Learning I Reinforcement Learning

Lecture 23: Reinforcement Learning

Machine Learning Basics: Maximum Likelihood Estimation

Least Mean Squares Regression. Machine Learning Fall 2018

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Learning Tetris. 1 Tetris. February 3, 2009

ADALINE for Pattern Classification

CS599 Lecture 1 Introduction To RL

Linear Model Selection and Regularization

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks Lecture 4: Radial Bases Function Networks

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Artificial Neural Networks Examination, June 2004

Regression with Numerical Optimization. Logistic

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

ECE 680 Modern Automatic Control. Gradient and Newton s Methods A Review

Feedforward Networks

Learning from Data Logistic Regression

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

COMP 551 Applied Machine Learning Lecture 2: Linear regression

Supervised learning in single-stage feedforward networks

Lecture 4: Perceptrons and Multilayer Perceptrons

Big Data Analytics: Optimization and Randomization

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Classification Logistic Regression

Multilayer Neural Networks

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

The No-Regret Framework for Online Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

SGN (4 cr) Chapter 5

Least Mean Squares Regression

Feedforward Networks. Gradient Descent Learning and Backpropagation. Christian Jacob. CPSC 533 Winter 2004

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

How to do backpropagation in a brain

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

CS 6375 Machine Learning

Unsupervised learning: beyond simple clustering and PCA

Foundations of Machine Learning

Miscellaneous. Regarding reading materials. Again, ask questions (if you have) and ask them earlier

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Adaptive Filter Theory

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Mathematical Formulation of Our Example

Transcription:

ECE656- Lectures 3 & 4, Professor Department of Electrical and Computer Engineering Colorado State University Fall 2015

What is Learning? General Definition of Learning: Any change in the behavior or performance brought about by the experience or contact with the environment Learning in humans is an inferred process, ie we cannot see it happening directly but observe the resultant changes in behavior Learning in ANN is a more direct process and we can capture each learning step in a cause-effect relationship Designing a machine learning algorithm is based upon learning (supervised) a relationship that transforms inputs to outputs from a set of input-output examples drawn from Experience E Given a training set consisting of pairs of inputs-outputs {, d(k)} K k=1 from E : k th training sample (pattern) vector of dimension N 1 d(k) : desired output (response) for input pattern of dimension M 1 where typically M N

The problem is viewed as a task T of approximating a continuous multi-variate mapping function d(k) = f() by a function h(w, ) as closely as possible (measured by some performance metric ρ() wrt E) with w being the parameter vector f() d(k) : Desired ANN h() - o(k) : Actual If we define a performance measure ρ(h(w, ), f()), then for the optimal parameter vector w, ρ(h(w, ), f())) ρ(h(w, ), f())), k [1, K] where w is any other vector Note: Such learning rules are referred to as Performance Learning Examples of ρ() are MSE, classification error, mutual information, etc

1 Supervised Learning Requires an external teacher Teacher provides training sample pairs ie and d(k) Training is based upon error learning Training is typically iterative More accurate than unsupervised ones Used for detection/classification, prediction/regression, and function approximation Environment ANN h() Adjustment Mechanism o(k) () d(k) Teacher Issues 1 Suffer from speed-accuracy tradeoff, ie high speed of convergence less accuracy in parameter estimation 2 Incremental learning requires old data 3 Overtraining leads to performance degradation

2 Unsupervised Learning Uses unlabeled data ie training samples have no desired outcomes No external teacher ie based upon self-organization Form clusters to discover interesting features from data Clusters are formed based upon underlying statistical properties of the data (unconditional density estimator vs conditional for supervised) Simpler algorithms than those of supervised Used for data clustering, compression, and dimensionality reduction Environment ANN h() o(k) Adjustment Mechanism Issues 1 Requires more parameter tuning 2 No automatic shut-off system 3 Less accurate than supervised counterparts when used for decision-making

3 Reinforcement Learning Originated in psychology with experiments on animal learning No unambiguous outcome as in supervised learning ie no explicit supervision Given training set {, r(k)} K k=1, where r(k) = { 1, +1} is supplied by a critic (not desired signal) as a reinforcement (or appropriateness) signal Reward and Punishment-based actions designed to maximize the expected value of a criterion function Learning should devise a sequence of actions over time to obtain large Cumulative Reward Applied in web-indexing, cell-phone network routing, control theory (eg, robotics, unmanned vehicles), marketing strategy selection, etc Environment Action Primary Reinforcement signal Critic ANN h() Adjustment Mechanism Heuristic Reinforcement signal o(k) In machine learning, environment is typically formulated as a Markov decision process (MDP) = dynamic programming techniques It provides feedback to the environment in forms of actions (ie allows for interaction)

Generalized Learning Rule (Amari 1990) Weight vector w i is updated in proportion to the input and a learning signal r(w i (k),, d i (k)) Thus, increment of w i (k) at iteration k is w i (k) = µ r(w i (k),, d i (k)) where µ is learning rate (or step size) Then, the weight vector at iteration k + 1 is adjusted using w i (k + 1) = w i (k) + w i (k) w i (0) s are randomly initialized wi1 wij win Cell i o i(k) Δwi win Adjustment di(k) Mechanism Teacher μ : step size This provides a unified representation for several learning rules that we shall cover In the next few lectures we cover, (a) Performance Learning - used for classification, function approximation (b) Coincidence Learning - used for association (memory learning) (c) Competitive Learning - used for clustering (d) Reinforcement Learning used in controls, game theory, multi-agent systems and their applications in real-world problems

a Performance Learning-ADALINE (Widrow-Hopf 1960) These learning methods are based upon minimizing or maximizing a performance measure (cost function) ADALINE (ADAptive LINear Element) is based upon McCulloch-Pitts model with bipolar HL activation w i1 w ij Cell i net i(k) HL o i(k) w in w in Adjustment Mechanism e i(k) - d i(k) + Objective: Find w i such that MSE between net i and the desired (supervised) is minimized That is, given {, d i (k)} K k=1, find w i to minimize, J(w i ) = 1 K (d i (k) net i (k)) 2 = 1 K e 2 i (k) : Time-Averaged SE (1) K K k=1 k=1 where net i (k) = w t i and d i(k) = net i (k) + e i (k)

For an ergodic process the time-averaged MSE can be rewritten in terms of ensemble average MSE, ie J(w i ) = E[(d i (k) w t i )2 ] = E[d 2 i (k)] + wt i E[xt (k)] w i 2w t i E[d i(k)] = σ 2 d + wt i Rxx w i 2w t i R xd where E[]: Expectation Operation, σ 2 d = E[d2 i (k)]: variance of d i(k), R xx = E[x t (k)]: correlation matrix of data; and R xd = E[d i ]: cross-correlation vector between input data and desired signal As can be seen J(w i ) is quadratic and has a bowl-shaped surface as a function of w i which is also referred to as Error Performance Surface J(wi) Error Surface This surface has a unique global minimum at which, J(w i ) w i = 0 2R xxw i 2R xd = 0 Solving for w i gives, wi1 wi2 Absolute Minimum Jmin w i = R 1 xx R xd which is the Wiener-Hopf solution (or Minimum Mean Squared Error solution (MMSE))

Example: Use an ADALINE to estimate weights of a 2nd order AR based linear predictor = a 1 x(k 1) + a 2 x(k 2) + u(k), u(k) N (0, σ 2 u) Here, net(k) = w(k) t, where = [x(k 1) x(k 2)] t and w i (k) = [a 1 (k) a 2 (k)] t Also, we choose d(k) = Solving for w yields a 1 and a 2 Remarks z -1 z -1 w1(k) w2(k) net(k) 1 Using the Wiener-Hopf solution the minimum MSE is J min (w i ) = σ2 d w t i R xd 2 For stationary processes the error surface has a fixed shape and orientation; whereas for non-stationary processes the bottom of the error surface moves continually and orientation and curvature may be changing too Thus, the - solution should not only seek the bottom but also track it 3 Wiener-Hopf solution requires computing R xx, R 1 xx, and R xd which makes it unsuitable for online learning 4 For ergodic processes, solving the original time-averaged SE, which corresponds to Least Squares (LS) solution, approaches the Wiener-Hopf (ie MMSE) solution for large K This is shown next + e(k)

Least Squares Solution & Geometric Interpretation Let us reconsider the time-averaged SE in (1) Now, if we define d i = [d i (1),, d i (K)] t : Desired vector e i = [e i (1),, e i (K)] t : Error vector x t (1) X = : Data matrix x t (K) K N Then we can write d i = Xw i + e i = net i + e i Now, the time-averaged SE is J(w i ) = 1 K e2 i = 1 K (d i Xw i )t (d i Xw i ) Setting J(w i ) = 0 yields w i 2X t (d i Xŵ ils ) = 0 = X t Xŵ ils = X t d i which gives the LS solution, ŵ ils = (X t X) 1 X t d i = X d i where X = (X t X) 1 X t is the pseudo inverse of X

Recall from LS solution ŵ ils = (X t X) 1 X t d i = x t (1) X t X = [x(1) x(k)] x t (K) Matrix X t d i = [x(1) x(k)] Cross-correlation Vector Clearly, d i (1) d i (K) R xx, R xd : Time-averaged R xx, R xd : Ensemble averaged R 1 xx R xd where = K k=1 t = R xx: Sample Correlation = K k=1 d i(k) = R xd : Sample If data is stationary and ergodic, the LS solution asymptotically approaches the Wiener-Hopf solution for large K, ie lim K Rxx R xx lim K Rxd R xd then ŵ ils w i MMSE

Now, define Orthogonal Projection Matrix, P X, P X = X(X t X) 1 X t which is idempotent ie P 2 X = P X or P n X = P X, n and Symmetrical ie P t x = P X Then, ˆ net i = Xŵ ils = P X d i ie projection of d i onto a space spanned by columns of X Using this result the error vector at the LS solution is, ê i = d i ˆ net i = d i P X d i = P X d i where P X = I P X is the orthogonal complement of P X Note that e i = ê i + ( ˆ net i net i ) The geometrical representation of these is shown in the figure d i e i e i PX d i nˆ et i P X d i net i nˆ et i neti

Homework 1-Due September 15th, 2015 Problems: 1 Using the Wiener-Hopf solution show that the estimation error e o(k) = d(k) w t is orthogonal to the data, ie E[e o(k)] = E[(d(k) w t )] = 0 Explain the importance of this property 2 Is the same orthogonality property valid for the LS solution? If it does, drive it Compare the results in Problems 1 and 2 and their interpretations 3 To guarantee uniqueness and improve stability of LS solution, typically a regularization term is added to the cost function, ie J(w i ) = (d i Xw i ) t (d i Xw i ) + λ w i 2 Find the solution of this regularized (or constrained) LS and investigate its relation to the LS solution 4 Solve Problem 23 in your textbook

b Performance Learning-Steepest Descent An alternative procedure using steepest (or gradient) descent method can be devised to find the weights in ADALINE without requiring to find Rxx 1 This method involves following steps 1 Start with an initial guess for w i (0) 2 Use this guess to compute the gradient vector J(w i (k)) J(w i (k)) w i (k) at iteration (training sample) k 3 Update weight in a direction opposite to the gradient w i (k + 1) = w i (k) 1 2 µ J(w i (k)) where µ: Step Size 4 Go back to step (2) and repeat until convergence conditions met From the previous result J(w i (k)) = 2R xd + 2R xxw i (k) Thus, w i (k + 1) = w i (k) + µ[r xd R xxw i (k)] ie no matrix inversion is required, but still needs to compute R xx and R xd Step size µ controls the size of incremental corrections For convergence, 0 < µ < 2 λ max, where λ max: largest eigenvalue of R xx