Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Similar documents
Logistic Regression. COMP 527 Danushka Bollegala

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Overfitting, Bias / Variance Analysis

Linear Regression (continued)

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Warm up: risk prediction with logistic regression

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

ECE521 week 3: 23/26 January 2017

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Least Squares Regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Least Squares Regression

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Discriminative Models

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

CSCI567 Machine Learning (Fall 2014)

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

CS6220: DATA MINING TECHNIQUES

Discriminative Models

Machine Learning for NLP

CS6220: DATA MINING TECHNIQUES

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Machine Learning Linear Models

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Classification Logistic Regression

Introduction to Machine Learning

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Statistical Methods for Data Mining

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Generative v. Discriminative classifiers Intuition

MLCC 2017 Regularization Networks I: Linear Models

Midterm exam CS 189/289, Fall 2015

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

HOMEWORK #4: LOGISTIC REGRESSION

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Generative v. Discriminative classifiers Intuition

Machine Learning and Data Mining. Linear classification. Kalev Kask

Linear Models for Regression

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Logistic Regression Logistic

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Ch 4. Linear Models for Classification

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

ECS171: Machine Learning

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Machine Learning for OR & FE

Introduction to Logistic Regression and Support Vector Machine

Machine Learning Practice Page 2 of 2 10/28/13

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Stochastic Gradient Descent

Classification: Logistic Regression from Data

COMS 4771 Regression. Nakul Verma

Solving Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 12. Slides adapted from Matt Nedrich and Trevor Hastie

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Support Vector Machines

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Logistic Regression & Neural Networks

Logistic Regression. INFO-2301: Quantitative Reasoning 2 Michael Paul and Jordan Boyd-Graber SLIDES ADAPTED FROM HINRICH SCHÜTZE

Homework 4. Convex Optimization /36-725

Dimension Reduction Methods

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning Lecture 7

Max Margin-Classifier

Machine Learning, Fall 2012 Homework 2

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

10-701/ Machine Learning - Midterm Exam, Fall 2010

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Managing Uncertainty

Introduction to Machine Learning

Machine Learning and Data Mining. Linear regression. Kalev Kask

EE-559 Deep learning 3.5. Gradient descent. François Fleuret Mon Feb 18 13:34:11 UTC 2019

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Bayesian Analysis of Massive Datasets Via Particle Filters

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Lecture 2 Machine Learning Review

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Machine Learning (CS 567) Lecture 5

Linear Models in Machine Learning

Probabilistic Machine Learning. Industrial AI Lab.

CSC 411: Lecture 04: Logistic Regression

Linear classifiers: Logistic regression

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Introduction to Gaussian Process

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Transcription:

1 Introduction to Machine Learning Regression Computer Science, Tel-Aviv University, 2013-14

Classification Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Other structures (e.g., strings, graphs, etc.) Output: Y Discrete (0,1,2,...)

Regression Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Output: Y Real valued, vectors over real.

Examples: Regression Weight + height cholesterol level Age + gender time spent in front of the TV Past choices of a user 'Netflix score' Profile of a job (user, machine, time) Memory usage of a submitted process.

Linear Regression Input: A set of points (x i,y i ) Assume there is a linear relation between y and x.

Linear Regression Input: A set of points (x i,y i ) Assume there is a linear relation between y and x. Find a,b by solving

Regression: Minimize the Residuals y = a x + b

Regression: Minimize the Residuals r i = y i a x i b (x i,y i ) y = a x + b

Likelihood Formulation Model assumptions: Input:

Likelihood Formulation Model assumptions: Input:

Likelihood Formulation Model assumptions: Input: Likelihood maximized when

Note: We can add another variable x d+1 =1, and set a d+1 =b. Therefore, without loss of generality

Matrix Notations

The Normal Equations

Functions over n-dimensions 18 For a function f(x 1,,x n ), the gradient is the vector of partial derivatives: " f = f,, f % $ ' # x 1 x n & In one dimension: derivative.

Gradient Descent Goal: Minimize a function Algorithm: 1. Start from a point 2. Compute u = f (x 1 i,..., x n i ) 3. Update 4. Return to (2), unless converged.

Gradient Descent Gradient Descent iteration: Advantage: simple, efficient.

Online Least Squares Online update step: Advantage: Efficient, similar to perceptron.

Singularity issues Not very efficient since we need to inverse a matrix. The solution is unique if X t X is invertible. If X t X is singular, we have an infinite number of solutions to the equations. What is the solution minimizing? ky Xak 2

The Singular Case

The Singular Case 8 6 4 2 0 2 4 6 8 2 0 2 4 3 2 1 0 1 2

The Singular Case

The Singular Case

Risk of Overfitting Say we have a very large number of variables (d is large). When the number of variables is large we usually have colinear variables, and therefore is singular. X t X X t X Even if is non-singular, there is a risk for over-fitting. For instance, if d=n we can get Intuitively, we want a small number of variables to explain y.

Regularization Letλbe a regularization parameter. Ideally, we need to solve the following: This is a hard problem (NP-hard).

Shrinkage Methods Lasso regression: Ridge regression:

Ridge Regression

Ridge Regression

Ridge Regression Positive definite and therefore nonsingular

Ridge Regression Bayesian View y = X j a j X j + Prior on a

Ridge Regression Bayesian View Prior on a log P osterior(a,,data) = 1 2 2 nx i=1 n 2 log(2 2 ) (y i a x i ) 2 1 2 2 d 2 log(2 2 ) dx i=1 a 2 i Maximizing the posterior is equivalent to Ridge with = 2 2

Lasso Regression

Lasso Regression The above is equivalent to the following quadratic program:

Lasso Regression Bayesian View Prior on a

Lasso Regression Bayesian View Prior on a og P osterior(a Data,, ) = 1 2 2 nx i=1 (y i a x i ) 2 n 2 log(2 2 ) dx + log( 4 2 ) 2 2 a i i=1

Lasso vs. Ridge 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5 4 3 2 1 0 1 2 3 4 5 Laplace vs. Normal priors (mean 0, variance 1)

An Equivalent Formulation Lasso: Ridge: Claim: for every λ there is λ that produces the same solution

Breaking Linearity This can be solved using the usual linear regression by plugging X

Regression for Classification Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Other structures (e.g., strings, graphs, etc.) Output: Y Discrete (0 or 1) We treat the probability Pr(Y X) as a linear function of X. Problem: Pr(Y X) should be bounded in [0,1].

Logistic Regression Model: 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5 4 3 2 1 0 1 2 3 4 5

Logistic Regression Given training data, we can write down the likelihood: linear concave There is a unique solution can be found using gradient descent.