Statistical Learning Reading Assignments

Similar documents
Content. Learning. Regression vs Classification. Regression a.k.a. function approximation and Classification a.k.a. pattern recognition

Support Vector Machines

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

Linear & nonlinear classifiers

Machine Learning Lecture 7

LEARNING & LINEAR CLASSIFIERS

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Brief Introduction of Machine Learning Techniques for Content Analysis

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Pattern Recognition and Machine Learning

CMU-Q Lecture 24:

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Hilbert Space Methods in Learning

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Discriminative Models

Support Vector Machines for Classification: A Statistical Portrait

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

PATTERN RECOGNITION AND MACHINE LEARNING

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

IFT Lecture 7 Elements of statistical learning theory

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Machine Learning And Applications: Supervised Learning-SVM

Linear & nonlinear classifiers

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Neutron inverse kinetics via Gaussian Processes

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Bayesian Machine Learning

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Lecture 1: Bayesian Framework Basics

Mathematical Formulation of Our Example

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Discriminative Models

Linear discriminant functions

Lecture 3: Pattern Classification

References for online kernel methods

Artificial Neural Networks (ANN)

Bits of Machine Learning Part 1: Supervised Learning

Gaussian Processes (10/16/13)

Probabilistic Graphical Models for Image Analysis - Lecture 1

PATTERN CLASSIFICATION

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Content. Learning Goal. Regression vs Classification. Support Vector Machines. SVM Context

Bayesian Support Vector Machines for Feature Ranking and Selection

Machine Learning

Support Vector Machines

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Pattern Recognition. Parameter Estimation of Probability Density Functions

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Introduction. Chapter 1

Unsupervised Learning with Permuted Data

Reading Group on Deep Learning Session 1

Machine Learning Basics: Maximum Likelihood Estimation

Gaussian Processes in Machine Learning

Jeff Howbert Introduction to Machine Learning Winter

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Machine Learning & SVM

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Statistical Data Mining and Machine Learning Hilary Term 2016

Day 5: Generative models, structured classification

PAC-learning, VC Dimension and Margin-based Bounds

Machine learning for pervasive systems Classification in high-dimensional spaces

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Statistical learning theory, Support vector machines, and Bioinformatics

Artificial Neural Networks

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Introduction to Machine Learning Midterm, Tues April 8

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

Lecture 3: Pattern Classification. Pattern classification

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Chapter 6 Classification and Prediction (2)

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Tensor Methods for Feature Learning

The connection of dropout and Bayesian statistics

Density Estimation. Seungjin Choi

Multivariate statistical methods and data mining in particle physics

Lecture 4: Probabilistic Learning

Machine Learning for Signal Processing Bayes Classification and Regression

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Graphical Object Models for Detection and Tracking

Linear Classification and SVM. Dr. Xin Zhang

1 Machine Learning Concepts (16 points)

Expectation Propagation for Approximate Bayesian Inference

Switch Mechanism Diagnosis using a Pattern Recognition Approach

Transcription:

Statistical Learning Reading Assignments S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Press, 2001 (Chapt. 3, hard copy). T. Evgeniou, M. Pontil, and T. Poggio, "Statistical Learning Theory: A Primer", International Journal of Computer Vision, vol. 38, no. 1, pp. 9-13, 2000 (on-line).

-2- Statistical Learning Assumptions -We consider features extracted from images to form probabilistic observation feature vectors (e.g., features extracted from face images). - Their semantic interpretations or labels (e.g., identity associated with a face) are also subject to uncertainty. -The relationship between the observations and their labels can be modelled probabilistically. -Such models can be estimated through statistical learning. What is the goal of statistical learning? - To estimate an unknown inference function (i.e., model between observations and their labels) from a finite, often sparse, set of observations. Example: suppose x X is a random observation vector (e.g., face) and y Y is the interpretation of x (e.g, identity). *The probability of observing the pair (x, y) isgiven by P(x, y) = P(y/x)P(x) * If P(y/x) is known (i.e., inference function), then an interpretation y can be inferred for any given obserbation x. *It is possible to estimate the inference function from a set of observations and their labels through inductive learning. - Once learning has been performed, we can predict the interpretations of novel observations.

-3- Learning as function approximation Learning machine: an algorithmic implementation of some family of functions f (x, a) where a is a parameter vector. Loss L(y, f (x, a)): a function measuring the discrepancy between y and the approximation f (x, a) found by the learning machine. Risk R(a): the loss integrated over the joint probability P(x, y) R(a) = L(y, f (x, a))dp(x, y) -Learning aims to find a function f (x, a 0 )that minimizes the risk. - In this context, inductive learning can be defined as a problem of functional approximation by risk minimization. Examples of loss functions -Corresponds to supervised learning L(y, f (x, a)) = [y f (x, a)] 2 -The function to be approximated in this case is E[y/x] (called regression if y is continuous or classification if y is discrete) -Corresponds to unsupervised learning L(y, f (x, a)) = log p(x, a) -The function to be approximated in this case is the pdf of x

-4- Empirical risk minimization -The expected risk cannot be minimized directly because P(x, y) isnot known. -One approach is to approximate the risk R by the empirical risk R emp R emp (a) = 1 M M Σ L(y m, f (x m, a)) m=1 -The least-squares method minimizes the following empirical risk: R emp (a) = 1 M M Σ [y m f (x m, a)] 2 m=1 (i.e., L(y m, f (x m, a)) = [y m f (x m, a)] 2 ) -The maximum likelihood approach minimizes the following empirical risk: R emp (a) = 1 M M Σ log p(x m, a) m=1 (i.e., L(y m, f (x m, a)) = log p(x m, a)) Capacity of learned functions -Minimization of the empirical risk does not guarantee good generalization. - To guarantee an "upper bound on generalization error", statistical learning theory says that the capacity of the learned functions must be controlled. -Functions with large capacity are able to represent many dichotomies. -Need functions whose capacity can be computed (a frequently used measure for the capacity is the Vapnik-Chervonekis (CV) dimensionality). - Support Vector Machines: functions of appropriate capacity are chosen using structural risk minimization.

Constructing a learning machine -5- (1) Loss function: regression, classification, or pdf. (2) Family of functions: multi-layer neural nets, radial basis functions, linear discriminant functions, Gaussian density functions, mixtures of densities. (3) Learning algorithm: minimize some approximation of the expected risk. Bias-variance dilemma -Simple parametric functions with only a few parameters introduce a strong bias. -Functions with many parameters are more powerful but introduce high variance. Learning as density estimation -Bayesian inference is based on estimating the density function. P(y/x) = P(x/y)P(y) P(x) -This is the most general and difficult type of learning problem.

-6- Linear classification and regression Linear functions -Linear functions can be written in the form: f (x) = (w. x) + b -When N=2, then the above equation defines a line (a plane or hyperplane if N > 2). <figure3.4 Gong> - The position and orientation of the line are changed by adjusting (w, b) where w is the normal to the line. -For a two-class problem, a linear classification function has the form f (x) = sign((w. x) + b)

-7- Learning using linear functions Using SVD -The classical least-squares method can be used to determine the parameters (w, b). -This can be done by minimizing the empirical risk: R emp (a) = 1 M M Σ [y m f (x m, a)] 2 m=1 -SVD can be used in this case to determine the parameters. Using SVM (Support Vector Machines) -There are usually many hyperplanes that result in low empirical risk using agiven data set. -Weneed a method that allows to choose a hyperplane that yields low generalization error. -Statistical theory says that the optimal line/hyperplane is the one giving the largest margin of separation between the classes. -SVM implements the optimal hyperplane. Using NNs (Neural Networks) -We will consider the multilayer network trained by the backpropagation rule. - This model minimizes the empirical risk. - Can directly construct the decision boundary based on training data, without estimating the probability density of each class. -The decision boundary can be highly non-linear.

-8- Nonlinear classification and regression -Successful inference is not always possible using only linear functions. -We can obain families of non-linear functions by transforming the input x by a set of non-linear functions k(x) called basis functions: f (x, a) = Σ K w k k=1 k (x) + b (a are the parameters in the above expression) -The basis functions together perform a non-linear mapping: Φ: R N R K where Φ=[ 1 (x), 2(x),..., K(x)] T -Certain forms of basis functions will allow any continuous function to be approximated to arbitrary accuracy using f (x, a). - Multilayer neural networks (e.g., back-propagation, radial basis functions) are examples of models for estimating the parameters a. -Support Vector Machines can also be generalized to handling non-linear regression/classification.