We choose parameter values that will minimize the difference between the model outputs & the true function values.

Similar documents
Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Linear Models

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

CS 6375 Machine Learning

Machine Learning Basics: Estimators, Bias and Variance

CMU-Q Lecture 24:

Linear Regression. S. Sumitra

Lecture 2 Machine Learning Review

Lecture 2: Linear Algebra Review

Linear Regression (continued)

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Tutorial on Machine Learning for Advanced Electronics


Binary Classification / Perceptron

Sample questions for Fundamentals of Machine Learning 2018

Introduction to Machine Learning

Overfitting, Bias / Variance Analysis

Lecture 2: Linear regression

Machine Learning 4771

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 3: Empirical Risk Minimization

CSC321 Lecture 2: Linear Regression

You submitted this quiz on Wed 16 Apr :18 PM IST. You got a score of 5.00 out of 5.00.

Conjugate Gradient (CG) Method

Fundamentals of Machine Learning (Part I)

Fundamentals of Machine Learning

FA Homework 2 Recitation 2

CSE446: non-parametric methods Spring 2017

Statistical Learning Theory: Generalization Error Bounds

CSCI567 Machine Learning (Fall 2014)

Logic and machine learning review. CS 540 Yingyu Liang

Machine Learning (CS 567) Lecture 2

6.036 midterm review. Wednesday, March 18, 15

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Introduction to Machine Learning

MATH 5720: Unconstrained Optimization Hung Phan, UMass Lowell September 13, 2018

CPSC 540: Machine Learning

Recap from previous lecture

Machine Learning and Data Mining. Linear regression. Kalev Kask

COMP 551 Applied Machine Learning Lecture 2: Linear regression

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Stochastic Gradient Descent

Lecture 3 - Linear and Logistic Regression

Statistical Machine Learning Hilary Term 2018

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Machine Learning 4771

Machine Learning (CS 567) Lecture 5

ORIE 4741 Final Exam

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Reward-modulated inference

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Classification Logistic Regression

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Linear Models for Regression

Is the test error unbiased for these programs?

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

Bayesian Linear Regression [DRAFT - In Progress]

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

Generative v. Discriminative classifiers Intuition

Machine Learning. Regression. Manfred Huber

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Gaussian Mixture Models

Support Vector Machines

Discriminative Models


Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Machine Learning. Lecture 2: Linear regression. Feng Li.

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Neural Network Training

Mathematical Formulation of Our Example

Machine Learning

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Introduction to Logistic Regression and Support Vector Machine

Machine Learning

Machine Learning (CSE 446): Probabilistic Machine Learning

ECS171: Machine Learning

Jim Lambers MAT 419/519 Summer Session Lecture 13 Notes

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

ESS2222. Lecture 4 Linear model

Machine Learning

Classification: Logistic Regression from Data

Gaussian with mean ( µ ) and standard deviation ( σ)

Machine Learning Practice Page 2 of 2 10/28/13

18.657: Mathematics of Machine Learning

Machine Learning and Adaptive Systems. Lectures 3 & 4

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

PATTERN RECOGNITION AND MACHINE LEARNING

Machine learning - HT Basis Expansion, Regularization, Validation

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Machine Learning (CS 567) Lecture 3

MLCC 2017 Regularization Networks I: Linear Models

Mixtures of Gaussians continued

T k b p M r will so ordered by Ike one who quits squuv. fe2m per year, or year, jo ad vaoce. Pleaie and THE ALTO SOLO

Transcription:

CSE 4502/5717 Big Data Analytics Lecture #16, 4/2/2018 with Dr Sanguthevar Rajasekaran Notes from Yenhsiang Lai Machine learning is the task of inferring a function, eg, f : R " R This inference has to be made using a series of examples An example is nothing but a pair of the form (x, y), where xîr " and yîr and y=f(x) The examples constitute the training data for machine learning Convention: Lowercase bold letters (such as x and y) will be used to denote vectors and boldface capital letters (such as A) will be used to denote matrices A simple example: Assume that the function f is a degree k polynomial f : R R f = a k x k + a k-1 x k-1 + + a 1 x + a 0 If we know the value of the polynomial at (k+1) distinct points, we can uniquely determine f() by interpolating In practice we may not know the form of the function and also there could be errors In practice we guess the form of the function Each such possible function will be called a model and characterized by some parameters We choose parameter values that will minimize the difference between the model outputs & the true function values MITCHELL (1997) : A computer program is said to be learning to perform a set T of tasks from experience E under some performance measure P, if its performance with respect to the tasks in T improves with E There are two kinds of learning: supervised & unsupervised In supervised learning we are given a set of examples (x, y) and the goal is to infer the predicting conditional probability distribution P(y x) In unsupervised learning the goal is to predict a data generating distribution P(x) after observing many random vectors x from this distribution

Capacity of a model: A machine learning algorithm builds a model from the input training data We can measure the accuracy of the model with respect to the training data The corresponding error (ie, the difference between the model and desired outputs) will be called the training error A machine learning algorithm will also be tested on data points previously unseen These unseen data points used to test the algorithm will be referred to as the test data We can define a corresponding test error A model is said to underfit if its training error is not low enough A model is said to overfit if the difference between the training and test error is very large In this case the model memorizes the properties of the training data closely We can modify the underfitting and overfitting behavior of a learning algorithm by changing the capacity with a low capacity tends to underfit and a model with a high capacity tends to overfit (Goodfellow, et al 2016) have generated data from a quadratic function and used three different models to fit the data These three models were degree 1, degree 2, and degree 9 polynomials, respectively The results they got are shown below: The performance of a model is optimal when its capacity is close to the complexity of the function learned No free lunch theorem (WOLPERT 1996): When averaged over all possible data generating distributions, each classifier has the same error rate on previously unseen data points We can develop better

algorithms if we restrict the functions of interest and/or we have more information on the data generating distribution Example: Linear regression: is one of the simple models we can think of This model fits the data with a linear function Let f = R R be any function, a linear regression model takes the form: f(x) = w 1 x 1 + w 2 x 2 + + w n x n Here x = (x 1, x 2,, x n ) T Let w = (w 1, w 2,, w n ) T Input : m examples Output: w = (w 1, w 2,, w n ) T Let the examples be E 1, E 2, E m, where E i = (x 1 i, x 2 i,,x n i, y i ), with y i = f(x 1 i, x 2 i,,x n i ), 1 i m Let y ) be the output of the model on input x i = (x 1 i, x 2 i,, x n i ) Estimation error = " " -0( y ) y - ) / This is called the mean squared error (MSE) y ) = w 1 x i 1 + w 2 x i 2 + + w n x i n Goal : Determine w = w w / w that minimizes the MSE Gradient Descent: Consider a function f = R R Say we want to find argmin 9 f(x) f (x+ε) f(x)+ ε f (x), where f (x) = => =9 Note: f(x ε sign f (x)) < f(x), for any small ε When f (x) = 0, then x is a stationary point A stationary point could be a (local) minimum, a (local) maximum, or a saddle point

max min saddle point Example: f(x) = 5x 2 10x +7 => =9 = 10x 10 = 0, if x = 1 = B > = B 9 = 10 => x = 1 is minimum Definition: Let y = (y 1, y 2,, y m ) T and x = (x 1, x 2,, x n ), then we define the gradient of y with respond to x, denote as x y ( or FG Fx ) = FG H FG H FG B FG B FG J FG H F9 B FG B FG J F9 B FG J x y corresponds to a stationary point if each entry in the matrix is 0 Some facts: 1 Let y = Ax, where y is m 1, A is m n, and x is n 1 Then Fy Fx = A This is because y i = a i1 x 1 + a i2 x 2 + + a in x n, FG O F9 P = a ij, i,j 2 Let y = Ax, where y is m 1, A is m n, and x is n 1, A is independent of x and y and x is a function of z then Fy Fz = Fy Fx Fx Fz = A Fx Fz 3 Let α = y U A x, then FV Fx = yu A and FV Fy = xt A T Note: y W A x = x T A T y

4 If α = x T A x, then FV Fx = xt ( A+A T ) 5 If α = x T A x and if A is symmetric, them FV Fx = 2xT A Linear Regression: We are required to minimize the MSE (ie, we have to minimize " " -0( y ) y - ) / Let X = x x / x x / x / / x / x " x " / x " w = w w / w In this case, Xw = y y / y " We have to minimize " XW y / / The plan is to set w MSE = 0 => " w XW y / / = 0 To be continued