STA141C: Big Data & High Performance Statistical Computing

Similar documents
ECS171: Machine Learning

ECS289: Scalable Machine Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

ECS171: Machine Learning

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Bayesian Learning (II)

STA141C: Big Data & High Performance Statistical Computing

MLCC 2017 Regularization Networks I: Linear Models

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Statistical Data Mining and Machine Learning Hilary Term 2016

Overfitting, Bias / Variance Analysis

Machine Learning for NLP

ECS289: Scalable Machine Learning

Introduction to Machine Learning

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

CS260: Machine Learning Algorithms

Lecture 2 Machine Learning Review

Linear Models in Machine Learning

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression. Machine Learning Fall 2018

Lecture 18: Multiclass Support Vector Machines

1 Machine Learning Concepts (16 points)

Support Vector Machines

Introduction to Machine Learning

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Lecture 4 Logistic Regression

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Discriminative Models

Big Data Analytics. Lucas Rego Drumond

Machine Learning. Lecture 2: Linear regression. Feng Li.

STA141C: Big Data & High Performance Statistical Computing

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

STA141C: Big Data & High Performance Statistical Computing

Online Learning and Sequential Decision Making

Machine Learning 2017

Linear Regression (continued)

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Part of the slides are adapted from Ziko Kolter

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

An Introduction to Statistical and Probabilistic Linear Models

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

ECE521 Lectures 9 Fully Connected Neural Networks

Logistic Regression Trained with Different Loss Functions. Discussion

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Introduction to Logistic Regression and Support Vector Machine

Discriminative Models

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

STA141C: Big Data & High Performance Statistical Computing

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

CPSC 540: Machine Learning

l 1 and l 2 Regularization

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

CPSC 540: Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Linear and logistic regression

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Machine Learning Linear Classification. Prof. Matteo Matteucci

Reward-modulated inference

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Loss Functions, Decision Theory, and Linear Models

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Machine Learning for NLP

Bayesian Deep Learning

Least Squares Regression

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Midterm exam CS 189/289, Fall 2015

Introduction to Machine Learning

Linear classifiers: Logistic regression

Neural Networks: Optimization & Regularization

Linear Classifiers. Michael Collins. January 18, 2012

Optimization for Machine Learning

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

ECS171: Machine Learning

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Machine learning - HT Maximum Likelihood

Machine Learning Gaussian Naïve Bayes Big Picture

Machine Learning And Applications: Supervised Learning-SVM

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

CSCI-567: Machine Learning (Spring 2019)

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Logistic Regression. William Cohen

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

FINAL: CS 6375 (Machine Learning) Fall 2014

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

Managing Uncertainty

Neural Networks: Backpropagation

Statistical Machine Learning Hilary Term 2018

Transcription:

STA141C: Big Data & High Performance Statistical Computing Lecture 4: ML Models (Overview) Cho-Jui Hsieh UC Davis April 17, 2017

Outline Linear regression Ridge regression Logistic regression Other finite-sum models

Linear Regression

Regression Input: training data x 1, x 2,..., x n R d and corresponding outputs y 1, y 2,..., y n R Training: compute a function f such that f (x i ) y i for all i Prediction: given a testing sample x, predict the output as f ( x)

Regression Input: training data x 1, x 2,..., x n R d and corresponding outputs y 1, y 2,..., y n R Training: compute a function f such that f (x i ) y i for all i Prediction: given a testing sample x, predict the output as f ( x) Examples: Income, number of children Consumer spending Processes, memory Power consumption Financial reports Risk Atmospheric conditions Precipitation

Linear Regression Assume f ( ) is a linear function parameterized by w R d : f (x) = w T x

Linear Regression Assume f ( ) is a linear function parameterized by w R d : f (x) = w T x Training: compute the model w such that w T x i y i for all i

Linear Regression Assume f ( ) is a linear function parameterized by w R d : f (x) = w T x Training: compute the model w such that w T x i y i for all i Equivalent to solving n w = argmin (w T x i y i ) 2 w R d i=1

Linear Regression Assume f ( ) is a linear function parameterized by w R d : f (x) = w T x Training: compute the model w such that w T x i y i for all i Equivalent to solving n w = argmin (w T x i y i ) 2 w R d Prediction: given a testing sample x, the prediction value is w T x i=1

Linear Regression: probability interpretation Assume the data is generated from the probability model: Maximum likelihood estimator: y i w T x i + ɛ i, ɛ i N (0, 1) w = argmax log P(y 1,..., y n x 1,..., x n, w) w n = argmax log P(y i x i, w) w = argmax w = argmax w = argmin w i=1 n 1 log( e (w T x i y i ) 2 /2 ) 2π i=1 n 1 2 (w T x i y i ) 2 + constant i=1 n (w T x i y i ) 2 i=1

Linear Regression: written as a matrix form Linear regression: w = argmin w R d n i=1 (w T x i y i ) 2 Matrix form: let X R n d be the matrix where the i-th row is x i, y = [y 1,..., y n ] T, then linear regression can be written as w = argmin w R d X w y 2 2

Solving Linear Regression Minimize the sum of squared error J(w) J(w) = 1 X w y 2 2 = 1 2 (X w y)t (X w y) = 1 2 w T X T X w y T X w + 1 2 yt y Derivative: w J(w) = X T X w X T y Setting the derivative equal to zero gives the normal equation Therefore, w = (X T X ) 1 X T y X T X w = X T y

Solving Linear Regression Minimize the sum of squared error J(w) J(w) = 1 X w y 2 2 = 1 2 (X w y)t (X w y) = 1 2 w T X T X w y T X w + 1 2 yt y Derivative: w J(w) = X T X w X T y Setting the derivative equal to zero gives the normal equation Therefore, w = (X T X ) 1 X T y X T X w = X T y but X T X may be non-invertible. (Will discuss after few classes)

Regularized Linear Regression

Overfitting Overfitting: the model has low training error but high prediction error. Using too many features can lead to overfitting

Regularization to Avoid Overfitting Enforce the solution to have low L2-norm: argmin w n w T x i y i 2 s.t. w 2 K i=1 Equivalent to the following problem with some λ argmin w n w T x i yi 2 + λ w 2 i=1

Regularized Linear Regression Regularized Linear Regression: R(w): regularization argmin X w y 2 + R(w) w Ridge Regression (l 2 regularization): argmin X w y 2 + λ w 2 w

Ridge Regression Ridge regression: argmin w R d 1 2 X w y 2 + λ 2 w 2 }{{} J(w) Closed form solution: optimal solution w satisfies J(w ) = 0: X T X w X T y + λw = 0 (X T X + λi )w = X T y

Ridge Regression Ridge regression: argmin w R d 1 2 X w y 2 + λ 2 w 2 }{{} J(w) Closed form solution: optimal solution w satisfies J(w ) = 0: X T X w X T y + λw = 0 (X T X + λi )w = X T y Optimial solution: w = (X T X + λi ) 1 X T y

Ridge Regression Ridge regression: argmin w R d 1 2 X w y 2 + λ 2 w 2 }{{} J(w) Closed form solution: optimal solution w satisfies J(w ) = 0: X T X w X T y + λw = 0 (X T X + λi )w = X T y Optimial solution: w = (X T X + λi ) 1 X T y Inverse always exists because X T X + λi is positive definite

Time Complexity When X is dense: Closed form solution requires O(nd 2 + d 3 ) if X is dense Efficient if d is very small Runs forever when d > 100, 000 Typical case for big data applications: X R n d is sparse with large n and large d How can we solve the problem?

Time Complexity When X is dense: Closed form solution requires O(nd 2 + d 3 ) if X is dense Efficient if d is very small Runs forever when d > 100, 000 Typical case for big data applications: X R n d is sparse with large n and large d How can we solve the problem? Iterative algorithms for optimization (next class)

Logistic Regression

Binary Classification Input: training data x 1, x 2,..., x n R d and corresponding outputs y 1, y 2,..., y n {+1, 1} Training: compute a function f such that sign(f (x i )) y i for all i Prediction: given a testing sample x, predict the output as sign(f ( x))

Logistic Regression Assume linear scoring function: s = f (x) = w T x

Logistic Regression Assume linear scoring function: s = f (x) = w T x Logistic hypothesis: where θ(s) = es 1+e s = 1 1+e s P(y = 1 x) = θ(w T x),

Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n )

Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n ) { θ(w T x) for y = +1 P(y x) = 1 θ(w T x) = θ( w T x) for y = 1 P(y x) = θ(yw T x)

Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n ) { θ(w T x) for y = +1 P(y x) = 1 θ(w T x) = θ( w T x) for y = 1 P(y x) = θ(yw T x) Likelihood: Π N n=1p(y n x n ) = Π N n=1θ(y n w T x n )

Maximizing the likelihood Find w to maximize the likelihood! max w ΠN n=1θ(y n w T x n ) max w log(πn n=1θ(y n w T x n )) min w log(πn n=1θ(y n w T x n )) min w min w min w N log(θ(y n w T x n )) n=1 N 1 log( θ(y n w T x n ) ) n=1 N log(1 + e ynw T x n ) n=1

Empirical Risk Minimization (linear) Most linear ML algorithms follow 1 N min loss(w T x n, y n ) w N n=1 Linear regression: loss(h(x n ), y n ) = (w T x n y n ) 2 Logistic regression: loss(h(x n ), y n ) = log(1 + e ynw T x n )

Empirical Risk Minimization (general) Assume f W (x) is the decision function to be learned (W is the parameters of the function) General empirical risk minimization: min W 1 N N loss(f W (x n ), y n ) n=1 Example: Neural network (f W ( ) is the network)

Coming up Optimization Questions?