Chapter 6 The Structural Risk Minimization Principle

Similar documents
Is there an Elegant Universal Theory of Prediction?

CISC 876: Kolmogorov Complexity

The Minimum Message Length Principle for Inductive Inference

Information Theory and Coding Techniques: Chapter 1.1. What is Information Theory? Why you should take this course?

Lecture 3: Introduction to Complexity Regularization

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Machine Learning

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning

Support Vector Machine. Industrial AI Lab.

Introduction to Machine Learning

Algorithmic Probability

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Compression Complexity

BITS F464: MACHINE LEARNING

Computational Learning Theory

Kolmogorov complexity and its applications

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

Computational and Statistical Learning theory

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

COMS 4771 Introduction to Machine Learning. Nakul Verma

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Minimum Message Length and Kolmogorov Complexity

From Complexity to Intelligence

Universal Learning Technology: Support Vector Machines

On Computational Limitations of Neural Network Architectures

Content. Learning. Regression vs Classification. Regression a.k.a. function approximation and Classification a.k.a. pattern recognition

Novel Quantization Strategies for Linear Prediction with Guarantees

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Advanced Introduction to Machine Learning CMU-10715

Consistency of Nearest Neighbor Methods

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Convergence and Error Bounds for Universal Prediction of Nonbinary Sequences

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Randomness, probabilities and machines

DATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering

Universal probability distributions, two-part codes, and their optimal precision

Generative Techniques: Bayes Rule and the Axioms of Probability

Kolmogorov Complexity

Kolmogorov complexity and its applications

Discriminative Models

Generalization, Overfitting, and Model Selection

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

Introduction to Machine Learning

Model Selection Based on Minimum Description Length

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

Discriminative Models

Statistical and Inductive Inference by Minimum Message Length

Statistical learning theory, Support vector machines, and Bioinformatics

Uncertainty & Induction in AGI

Classifier Complexity and Support Vector Classifiers

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Bayesian Reasoning and Recognition

A statistical mechanical interpretation of algorithmic information theory

Computational Learning Theory

Entropy as a measure of surprise

Understanding Generalization Error: Bounds and Decompositions

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Introduction to Information Theory

Lecture 3: Decision Trees

Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions

IFT Lecture 7 Elements of statistical learning theory

Linear & nonlinear classifiers

Measures of relative complexity

Clustering using the Minimum Message Length Criterion and Simulated Annealing

Lecture Support Vector Machine (SVM) Classifiers

Machine Learning Theory (CS 6783)

CS154, Lecture 12: Kolmogorov Complexity: A Universal Theory of Data Compression

Applied Logic. Lecture 4 part 2 Bayesian inductive reasoning. Marcin Szczuka. Institute of Informatics, The University of Warsaw

Randomness. What next?

Generalization and Overfitting

A Tutorial Introduction to the Minimum Description Length Principle

Information, Learning and Falsification

PAC-learning, VC Dimension and Margin-based Bounds

Linear Algebra. Min Yan

PAC-learning, VC Dimension and Margin-based Bounds

Introduction. Chapter 1

BRUTE FORCE AND INTELLIGENT METHODS OF LEARNING. Vladimir Vapnik. Columbia University, New York Facebook AI Research, New York

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Introduction to Machine Learning

Algorithmic Information Theory

An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme

CIS 520: Machine Learning Oct 09, Kernel Methods

References for online kernel methods

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Statistical Learning Theory

Low Bias Bagged Support Vector Machines

Introduction to Machine Learning (67577) Lecture 5

Nonparametric Bayesian Methods (Gaussian Processes)

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Statistical Learning Learning From Examples

Introduction to Machine Learning

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Computational learning theory. PAC learning. VC dimension.

MML Mixture Models of Heterogeneous Poisson Processes with Uniform Outliers for Bridge Deterioration

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Transcription:

Chapter 6 The Structural Risk Minimization Principle Junping Zhang jpzhang@fudan.edu.cn Intelligent Information Processing Laboratory, Fudan University March 23, 2004

Objectives

Structural risk minimization

Two other induction principles

The Scheme of the SRM induction principle

Real-Valued functions

Principle of SRM

SRM

Minimum Description Length and SRM inductive principles The idea about the Nature of Random Phenomena Minimum Description Length Principle for the Pattern Recognition Problem Bounds for the MDL SRM for the simplest Model and MDL The Shortcoming of the MDL

The idea about the Nature of Random Phenomena Probability theory (1930s, Kolmogrov) Formal inference Axiomatization hasn t considered nature of randomness Axioms: given probability measures

The idea about the Nature of Random Phenomena The model of randomness Solomonoff (1965), Kolmogrov (1965), Chaitin (1966). Algorithm (descriptive) complexity The length of the shortest binary computer program Up to an additive constant does not depend on the type of computer. Universal characteristic of the object.

A relatively large string describing an object is random If algorithm complexity of an object is high If the given description of an object cannot be compressed significantly. MML (Wallace and Boulton, 1968)& MDL (Rissanen, 1978) Algorithm Complexity as a main tool of induction inference of learning machines

Minimum Description Length Principle for the Pattern Recognition Problem Given l pairs containing the vector x and the binary value ω Consider two strings: the binary string

Question Q: Given (147), is the string (146) a random object? A: to analyze the complexity of the string (146) in the spirit of Solomonoff- Kolmogorov-Chaitin ideas

Compress its description Since ω i i=1, l are binary values, the string (146) is described by l bits. Since training pairs were drawn randomly and independently. The value ω i depend on the vector x i but not on the vector x j.

Model

General Case: not contain the perfect table.

Randomness

Bounds for the MDL Q: A: Does the compression coefficient K(T) determine the probability of the test error in classification (decoding) vectors x by the table T? Yes

Comparison between the MDL and ERM in the simplest model

SRM for the simplest Model and MDL

SRM for the simplest Model and MDL

The power of compression coefficient To obtain bound for the probability of error Only information about the coefficient need to be known.

The power of compression coefficient How many examples we used How the structure of code books was organized Which code book was used and how many tables were in this code book. How many errors were made by the table from the code book we used.

MDL principle To minimize the probability of error One has to minimize the coefficient of compression

The shortcoming of the MDL MDL uses code books with a finite number of tables. Continuously depends on parameters, one has to first quantize that set to make the tables.

Quantization How do we make the smart quantization for a given number of observations. For a given set of functions, how can we construct a code book with a small number of tables but with good approximation ability?

The shortcoming of the MDL Finding a good quantization is extremely difficult and determines the main shortcoming of MDL principle. The MDL principle works well when the problem of constructing reasonable code books has a good solution.

Consistency of the SRM principle and asymptotic bounds on the rate of convergence Q: Is the SRM consistent? What is the bound on the (asymptotic) rate of convergence?

Consistency of the SRM principle.

Simplification version

Remark To avoid choosing the minimum of functional (156) over the infinite number of elements of the structure. Additional constraint Choose the minimum from the first l elements of the structure where l is equal to the number of observations.

Discussions and Example

The rate of convergence is determined by two contradictory requirements on the rule n=n(l). The first summand: The larger n=n(l), the smaller is the deviation The second summand: The larger n=n(l), the larger deviation For structures with a known bound on the rate of approximation, select the rule that assures the largest rate of convergence.

Bounds for the regression estimation problem

The model of regression estimation by series expansion

Example

The problem of approximating functions

To get high asymptotic rate of approximation the only constraint is that the kernel should be a bounded function which can be described as a family of functions possessing finite VC dimension.

Problem of local risk minimization

Local Risk Minimization Model

Note Using local risk minimization methods, one probably does not need rich sets of approximating functions. Whereas the classical semi-local methods are based on using a set of constant functions.

Note For local estimation functions in the one-dimensional case, it is probably enough to consider elements S k, k=0,1,2,3 containing the polynomials of degree 0,1,2,3

Summary MDL SRM Local Risk Functional