MAP METHODS FOR MACHINE LEARNING. Mark A. Kon, Boston University Leszek Plaskota, Warsaw University, Warsaw Andrzej Przybyszewski, McGill University

Size: px

Start display at page:

Download "MAP METHODS FOR MACHINE LEARNING. Mark A. Kon, Boston University Leszek Plaskota, Warsaw University, Warsaw Andrzej Przybyszewski, McGill University"

Roy Wilkerson
6 years ago
Views:

1 MAP METHODS FOR MACHINE LEARNING Mark A. Kon, Boston University Leszek Plaskota, Warsaw University, Warsaw Andrzej Przybyszewski, McGill University

2 Example: Control System 1. Paradigm: Machine learning of an unknown input-output function Example: Control system Seeking relationship between inputs and outputs in industrial chemical mixture

3 Example: Control System Given: We control chemical mixture parameters, e.g.: ì temperature œ B " ì ambient humidity œ B #, along with other non-chemical parameters B$ ßBßB % & ì proportions of various chemical components œ B6 ßáßB#! Goal: Control output variable C = ratio of strength to brittleness of resulting plastic We want machine which predicts C from x œðb" ßáßB20Ñ based on data from finite number of experimental runs of equipment. Ê want "best" 0 so with minimal error %Þ C œ 0ÐB" ßáßB#! Ñ % œ0ðxñ %

4 MAPN approach 2. Solution: MAPN (MAP for Nonparametric machine learning) Maximum A Posteriori (MAP) methods common for parametric statistical problems (see below). MAPN (MAP for Nonparametric machine learning) extends these methods to Nonparametric problems. ì Method simple, intuitively appealing (even in high dimension) ì Incorporation of prior knowledge explicit and transparent ì Results of method often coincide with standard methods, e.g., statistical learning theory (SLT), information-based complexity (IBC), etc..

5 MAPN approach Bayesian machine learning strategy: Use prior knowledge about 0 to choose reasonable probability distribution.tð0ñ on set J of possible 0. experimental data. Then combine prior knowledge with

6 MAPN approach Model: Experimental data C 3 satisfy C3 œ 0Ðx 3Ñ % 3, with % 3 œ Gaussian random error. Equivalently, where y œrð0ñ %ß y œðcßáßc Ñ " 8 RÐ0Ñ œ œ Ð0Ð Ñß á ß 0Ð ÑÑÞ information vector x x " 8

7 MAPN approach Standard strategy: Compute conditional expectation IÐ0lR0 œ y Ñ (*) as best guess of 0. Difficulties of the strategy: ì we have "infinite dimensional" parameter space J ì difficult to determine "reasonable" a priori probability measure T ì expectation (*) above hard to compute.

8 Sidebar: Parametric Statistics 3. Sidebar: maximum a Posteriori (MAP) solutions yield a useful strategy in parametric statistics (finite dimension): Finite dimensional method involves writing probability measure T for unknown parameter z as where.tðzñ œ 3ÐzÑ. zß.z œ "uniform distribution" on z, and is density function of zþ MAP procedure finds.tðzñ. z œ 3ÐzÑ sz œ maximizer of 3ÐzÑ (analogously to maximum likelihood estimation) as best estimate.

9 MAP Example: OCR Example of MAP (discrete case): Decide which letter j! (a letter of the alphabet) we are looking at in an OCR program. A priori information: overall probability distribution TÐjÑ on 26 letters j in alphabet.

10 MAP Example: OCR Data: Vector z œðdßáßdñof 8 features of viewed letterþ " ) Assume true letter is j œ K! s j! œ MAP estimate of true letter j! œ arg max TÐjlzÑß j where j ranges over alphabet. Here TÐjlzÑ is computed using Bayes' formula: TÐjlzÑ œ T ÐzljÑT ÐjÑ! T Ðzl4ÑT Ð4Ñ 4 œ GTÐzljÑTÐjÑ ( Gœdenominator term is independent of jñþ Note TÐzljÑ is known, since we know which features occur in which letters.

11 MAP Example: OCR Thus sj œ arg mint ÐzljÑT ÐjÑ j

12 Extending MAP: Nonparametric case 4. Extending MAP to non-parameteric statistics: Can we use MAP to discover an entire function 0ÐxÑ? Recall: input data points are x 3, output are C 3, model is C3 œ 0Ðx 3Ñ % 3 œ ÐR0Ñ3Þ Strategy: Goal: Solve for unknown dependence Cœ0ÐxÑ using above model y œr0 %. Prior knowledge: reflected in probability distribution.tð0ñ on space Jœpossible choices of 0.

13 Extending MAP: Nonparametric case Density function: Define a density function 3Ð0Ñ for the distribution.tð0ñ, i.e., so that.t Ð0Ñ œ 3Ð0Ñ.0Þ Algorithm: maximize y œr0þ 3Ð0lyÑ, i.e., density conditioned on data Problem: There is no "uniform distribution".0 on a function space such as J. Remark: finding 3Ð0Ñ is difficult part here - probability density in function space not easy to define! Solution: MAPN theorem (below)

14 Extending MAP: Nonparametric case Remark: The function 3Ð0Ñ plays the role of a likelihood function in statistics - ì the larger 3Ð0Ñ the more "likely" 0 is ì intuitively appealing ì easily interpretable ì very easily modifiable as prior information or intuition warrants ì nevertheless, 3Ð0Ñ always corresponds to a genuine probability distribution.tð0ñ on functions.

15 5. Details of the algorithm: Algorithm: details Note we have y œr0 % Ê % œy R0Þ MAPN estimate is (using continuous Bayes formula) arg max 3Ð0lyÑ œ 0 0 3yÐyl0Ñ3Ð0Ñ arg max 3 ÐyÑ (here G œ is independent of 0). " " 3 y ÐyÑ y œ G " 3 Ðyl0Ñ3Ð0ÑÞ Choose prior distribution.tð0ñ for 0 to be Gaussian with " covariance operator GœaEEb, with Egiven below. y

16 Algorithm: details To define E: let a œð+ ßáß+ ÑœÐÞ"ßáßÞ"ßÞ#ßáßÞ#Ñ " #! be a vector which determines strength of a priori information for each component. B 3 Here a has first 5 components œþ" last 15 components œþ# reflects lower dependence of a priori assumptions on first 5 parameters (i.e., temperature, humidity, other non-chemical parameters)à greater on the last 15 (chemical parameters).

17 Algorithm: details Choose covariance matrix of Gaussian to be the operator GœÐEEÑ " ßwhere with E œ "regularization operator" œ.ðxñð" adñ.ðxñß #! ` ad œ " + 3 Þ $! `B 3œ" $! 3

18 Algorithm: details Here.ÐxÑ reflects density of sample points on via ".ÐxÑ œ a" $ ÐxÑb ß #! where $ÐxÑ œ smoothed density of sample points 8 œ " "/ 5œ" Ðx x 5 Ñ. "Î#! Note: order 30 above reflects smoothness level we expect of solution function 0; note in 20 dimensions we need at least 10 derivatives for 0 to be continuous. Thus distribution T concentrated on smooth solutions 0; minimizing solution given by radial basis functions (see below)

19 Algorithm: details Then MAPN theorem (below) shows probability distribution " # has density function 3Ð0Ñ œ G / # Þ # me0m.tð0ñ If we assume error vector % œð% ßáß% Ñis Gaussian: " 8 mf% m 3 % % $ $ # # mfðr0 yñm ÐÑœG/ œg/ Þ (with Fœ8 8covariance matrix), then: 3Ð0lyÑ œ mfðrð0ñ yñm 3yÐyl0Ñ3Ð0Ñ / / œ G$ 3 ÐyÑ 3 ÐCÑ y C # # me0m œg/ / % œg/ % mfðr0 Ñm me0m y # # me0m mfðr0 ˆ # # y mñ Þ

20 Thus maximizer of 3Ð0lyÑ is Algorithm: details s # # 0 œarg min me0m mfðr0 CÑm Ð Ñ 0 where 8 œ "- KÐxx ß Ñ Ð Ñ 5œ" 5 5 KÐxx ß w ) œ radial basis function œ Green's function for operator E Î #! Ñ œ Y " " + 3 # Ð303Ñ $! Ðx x w Ñß Ï Ò 3œ" where œ Fourier variable dual to B "

21 Algorithm: details Note: (*) is same functional appearing in regularization solutions in SLT, and the solution (**) is the same as spline solution in IBC. Solution of problem: Choose 0s as in (**) for best approximation of i-o relationship Cœ0ÐxÑ

22 How does it work? 6. How does it work? How do we define density 3Ð0Ñ corresponding to prior probability distribution.tð0ñ on the set of all functions? Analogously to standard parametric MAP methods. Use a definition of probability density 3Ð0Ñ corresponding to probability distribution.tð0ñ which works for both parametric (finite dimensional) and nonparametric (function) spaces. TÐF Ð0ÑÑ T ÐF Ð!ÑÑ < Define 3Ð0Ñ to be proportional to the ratio of probabilities of balls of radius < near 0 and!, in the small < limit. <

23 How does it work? This limit exists for most generic measures in infinite dimension, including Gaussian measures:

24 How does it work? < MAPN Theorem: (a) The limit 3Ð0Ñ œ lim exists for a TÐF Ð0ÑÑ <Ä! T ÐF< Ð!ÑÑ Gaussian measure T on a function space J with covariance operator G, defining its density function. (b) If Gœ( EE ) " as above 3Ð0Ñ œ O/ # me0m In finite dimensions this reduces to ordinary density function of the Gaussian distribution. Thank you! " #

E.G., data mining, prediction. Function approximation problem: How >9 best estimate the function 0 from partial information (examples)

E.G., data mining, prediction. Function approximation problem: How >9 best estimate the function 0 from partial information (examples) Can Neural Network and Statistical Learning Theory be Formulated in terms of Continuous Complexity Theory? 1. Neural Networks and Statistical Learning Theory Many areas of mathematics, statistics, and