Chapter 6 The Structural Risk Minimization Principle

Size: px

Start display at page:

Download "Chapter 6 The Structural Risk Minimization Principle"

Annice Melanie Thompson
5 years ago
Views:

1 Chapter 6 The Structural Risk Minimization Principle Junping Zhang jpzhang@fudan.edu.cn Intelligent Information Processing Laboratory, Fudan University March 23, 2004

2 Objectives

3 Structural risk minimization

4 Two other induction principles

5 The Scheme of the SRM induction principle

6 Real-Valued functions

13 Principle of SRM

21 SRM

23 Minimum Description Length and SRM inductive principles The idea about the Nature of Random Phenomena Minimum Description Length Principle for the Pattern Recognition Problem Bounds for the MDL SRM for the simplest Model and MDL The Shortcoming of the MDL

24 The idea about the Nature of Random Phenomena Probability theory (1930s, Kolmogrov) Formal inference Axiomatization hasn t considered nature of randomness Axioms: given probability measures

25 The idea about the Nature of Random Phenomena The model of randomness Solomonoff (1965), Kolmogrov (1965), Chaitin (1966). Algorithm (descriptive) complexity The length of the shortest binary computer program Up to an additive constant does not depend on the type of computer. Universal characteristic of the object.

26 A relatively large string describing an object is random If algorithm complexity of an object is high If the given description of an object cannot be compressed significantly. MML (Wallace and Boulton, 1968)& MDL (Rissanen, 1978) Algorithm Complexity as a main tool of induction inference of learning machines

27 Minimum Description Length Principle for the Pattern Recognition Problem Given l pairs containing the vector x and the binary value ω Consider two strings: the binary string

28 Question Q: Given (147), is the string (146) a random object? A: to analyze the complexity of the string (146) in the spirit of Solomonoff- Kolmogorov-Chaitin ideas

29 Compress its description Since ω i i=1, l are binary values, the string (146) is described by l bits. Since training pairs were drawn randomly and independently. The value ω i depend on the vector x i but not on the vector x j.

30 Model

34 General Case: not contain the perfect table.

36 Randomness

37 Bounds for the MDL Q: A: Does the compression coefficient K(T) determine the probability of the test error in classification (decoding) vectors x by the table T? Yes

38 Comparison between the MDL and ERM in the simplest model

41 SRM for the simplest Model and MDL

42 SRM for the simplest Model and MDL

44 The power of compression coefficient To obtain bound for the probability of error Only information about the coefficient need to be known.

45 The power of compression coefficient How many examples we used How the structure of code books was organized Which code book was used and how many tables were in this code book. How many errors were made by the table from the code book we used.

46 MDL principle To minimize the probability of error One has to minimize the coefficient of compression

47 The shortcoming of the MDL MDL uses code books with a finite number of tables. Continuously depends on parameters, one has to first quantize that set to make the tables.

48 Quantization How do we make the smart quantization for a given number of observations. For a given set of functions, how can we construct a code book with a small number of tables but with good approximation ability?

49 The shortcoming of the MDL Finding a good quantization is extremely difficult and determines the main shortcoming of MDL principle. The MDL principle works well when the problem of constructing reasonable code books has a good solution.

50 Consistency of the SRM principle and asymptotic bounds on the rate of convergence Q: Is the SRM consistent? What is the bound on the (asymptotic) rate of convergence?

53 Consistency of the SRM principle.

54 Simplification version

59 Remark To avoid choosing the minimum of functional (156) over the infinite number of elements of the structure. Additional constraint Choose the minimum from the first l elements of the structure where l is equal to the number of observations.

62 Discussions and Example

63 The rate of convergence is determined by two contradictory requirements on the rule n=n(l). The first summand: The larger n=n(l), the smaller is the deviation The second summand: The larger n=n(l), the larger deviation For structures with a known bound on the rate of approximation, select the rule that assures the largest rate of convergence.

70 Bounds for the regression estimation problem

71 The model of regression estimation by series expansion

77 Example

80 The problem of approximating functions

91 To get high asymptotic rate of approximation the only constraint is that the kernel should be a bounded function which can be described as a family of functions possessing finite VC dimension.

92 Problem of local risk minimization

96 Local Risk Minimization Model

100

101

102

103

104

105

106 Note Using local risk minimization methods, one probably does not need rich sets of approximating functions. Whereas the classical semi-local methods are based on using a set of constant functions.

107 Note For local estimation functions in the one-dimensional case, it is probably enough to consider elements S k, k=0,1,2,3 containing the polynomials of degree 0,1,2,3

108 Summary MDL SRM Local Risk Functional

Is there an Elegant Universal Theory of Prediction?

Is there an Elegant Universal Theory of Prediction? Shane Legg Dalle Molle Institute for Artificial Intelligence Manno-Lugano Switzerland 17th International Conference on Algorithmic Learning Theory Is