DD Advanced Machine Learning

Size: px

Start display at page:

Download "DD Advanced Machine Learning"

Dwayne Curtis
5 years ago
Views:

1 Modelling Carl Henrik Royal Institute of Technology November 4, 2015

2 Who do I think you are? Mathematically competent linear algebra multivariate calculus Ok programmers Able to extend knowledge beyond lectures Motivated and willing to learn Not expecting cookbook recipies

3 Who do I hope that you will become? Understand the importance of uncertainty See ML as a science not a collection of methods Capable to place methods in context Have the background to learn by yourselves Appreciating the difficulties and challenges to ML

4 Whats the focus of this part of the course My plan My view on Machine Learning 1 Look at each part of a probabilistic model in detail how do they interact what do they provide 2 Different models parametric non-parameteric 3 Inference Really simple data

5 Block Structure 4 Lectures 2 Practical sessions 1 Assignment Deadline November 19th 1 Scheduled Help session

6 Assignment Three parts aligned with lectures Part 1 (Lecture 2 & 3) Task: probabilistic regression Aim: understand probabilistic objects Part 2 (Lecture 3 & 4) Task: probabilistic representation learning Aim: understand probabilistic methodology Part 3 (Self study) Task: probabilistic model selection Aim: show that you can extend your knowledge from 1 and 2

7 Assignment Three parts aligned with lectures Part 1 (Lecture 2 & 3) Task: probabilistic regression Aim: understand probabilistic objects Part 2 (Lecture 3 & 4) Task: probabilistic representation learning Aim: understand probabilistic methodology Part 3 (Self study) Task: probabilistic model selection Aim: show that you can extend your knowledge from 1 and 2

8 Assignment Three parts aligned with lectures Part 1 (Lecture 2 & 3) Task: probabilistic regression Aim: understand probabilistic objects Part 2 (Lecture 3 & 4) Task: probabilistic representation learning Aim: understand probabilistic methodology Part 3 (Self study) Task: probabilistic model selection Aim: show that you can extend your knowledge from 1 and 2

9 Assignment Three parts aligned with lectures Part 1 (Lecture 2 & 3) Task: probabilistic regression Aim: understand probabilistic objects Part 2 (Lecture 3 & 4) Task: probabilistic representation learning Aim: understand probabilistic methodology Part 3 (Self study) Task: probabilistic model selection Aim: show that you can extend your knowledge from 1 and 2

10 Science Model Data Algorithm Prediction

11 Science Model Data Algorithm Prediction

12 Introduction Regression Kernel Methods References Science Model Data Algorithm Prediction

13 Introduction Regression Kernel Methods References Science Model Data Algorithm Prediction

14 Introduction Regression Kernel Methods References Science Model Data Algorithm Prediction

15 Introduction Regression Kernel Methods References Science Model Data Algorithm Prediction

16 My view on Machine Learning

17 My view on Machine Learning 1 An intelligence which at a given instant knew all the forces acting in nature and the position of every object in the universe 1 A philosophical essay on probabilities, Laplace

18 My view on Machine Learning 1 An intelligence which at a given instant knew all the forces acting in nature and the position of every object in the universe - if endowed with a brain sufficiently vast to make all necessary calculations 1 A philosophical essay on probabilities, Laplace

19 My view on Machine Learning 1 An intelligence which at a given instant knew all the forces acting in nature and the position of every object in the universe - if endowed with a brain sufficiently vast to make all necessary calculations - could describe with a single formula the motions of the largest astronomical bodies and those of the smallest atoms. 1 A philosophical essay on probabilities, Laplace

20 My view on Machine Learning 1 An intelligence which at a given instant knew all the forces acting in nature and the position of every object in the universe - if endowed with a brain sufficiently vast to make all necessary calculations - could describe with a single formula the motions of the largest astronomical bodies and those of the smallest atoms. To such an intelligence, nothing would be uncertain; the future, like the past, would be an open book. 1 A philosophical essay on probabilities, Laplace

21 My view on Machine Learning 1 The theory of probabilities is at bottom nothing but common sense reduced to calculus; it enables us to appreciate with exactness that which accurate minds feel with a sort of instinct for which of times they are unable to account. 1 A philosophical essay on probabilities, Laplace

22 My view on Machine Learning 1 It is remarkable that a science which began with the consideration of games of chance should have become the most important object of human knowledge. 1 A philosophical essay on probabilities, Laplace

My view on Machine Learning It was our use of probability theory as logic that has enabled us to do so easily what was impossible for those who thought of probability as a physical

23 My view on Machine Learning It was our use of probability theory as logic that has enabled us to do so easily what was impossible for those who thought of probability as a physical phenomenon associated with randomness. Quite the opposite; we have thought of probability distributions as carriers of information. 1 1 Probability theory: The logic of science, Jaynes

24 My view on Machine Learning Theme Acknowledge that we do not know everything, i.e. my knowledge is uncertain. Methodology to propagate my uncertainty through all levels of reasoning. Incorporate my uncertain knowledge with observations such that when I see data it reduces my uncertainty according to the evidence provided in the observations.

25 Introduction Regression Kernel Methods

26 Regression Two variates Input data x i R q Output data yi R D Relationship: f : X Y

27 Regression Uncertainty We are uncertain in our data This means we cannot trust our observations the mapping that we learn the predictions that we make under the mapping This part of the course is about making this principled!

28 Regression Uncertainty We are uncertain in our data This means we cannot trust our observations the mapping that we learn the predictions that we make under the mapping This part of the course is about making this principled!

29 Outline Re-cap of Probability basics Re-cap Central Limit Theorem Probabilistic formulation Dual Formulation

30 Probability Basics 1 Expected Value E[x] = µ(x) = xp(x)dx (1) Shows the center of gravity of a distribution Sampled expected value (mean) x = 1 N N x i (2) i 1 Bishop 2006, p

31 Probability Basics 1 Variance σ 2 (x) = var[x] = E[(x E[x]) 2 ] (3) Shows the spread of a distribution Sample variance σ 2 (x) = 1 N 1 N (x i µ(x i )) 2 (4) i 1 Bishop 2006, p

32 Probability Basics Matplotlib3D, /Lecture1/probBasics.py 2 Bishop 2006, p

33 Probability Basics Matplotlib3D, /Lecture1/probBasics.py 2 Bishop 2006, p

34 Probability Basics Matplotlib3D, /Lecture1/probBasics.py 2 Bishop 2006, p

35 Probability Basics 1 Covariance σ(x, y) = E[(x E[x])(y E[y])] (5) [σ(x, Y)] ij = σ(x i, y j ) = k(x i, y j ) (6) Shows how the spread of how to variables vary together Sample co-variance σ(x, y) = 1 N 1 N (x i µ(x i ))(y i µ(y)) (7) i 1 Bishop 2006, p

36 Probability Basics 1 1 Bishop 2006, p

37 Probability Basics 1 1 Bishop 2006, p

38 Probability Basics 1 1 Bishop 2006, p

39 Probability Basics Matplotlib3D, /Lecture1/probBasics.py 2 Bishop 2006, p

40 Linear Regression 3 y i = Wx i (8) Uncertainty Lets assume the relationship is linear Uncertainty in outputs y i Addative noise yi = Wx i + ϵ What form does the noise have ϵ What do we know about the generating process? 3 Bishop 2006, p

41 Linear Regression 3 y i = Wx i + ϵ (9) Uncertainty Lets assume the relationship is linear Uncertainty in outputs y i Addative noise yi = Wx i + ϵ What form does the noise have ϵ What do we know about the generating process? 3 Bishop 2006, p

42 Linear Regression 3 What distribution? Central Limit Theorem a The central limit theorem states that the distribution of the sum (or average) of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution. a Bishop 2006, p Bishop 2006, p

43 Linear Regression /Lecture1/centralLimit.py 4 Bishop 2006, p

44 Linear Regression /Lecture1/centralLimit.py 4 Bishop 2006, p

45 Linear Regression /Lecture1/centralLimit.py 4 Bishop 2006, p

46 Linear Regression /Lecture1/centralLimit.py 4 Bishop 2006, p

47 Linear Regression /Lecture1/centralLimit.py 4 Bishop 2006, p

48 Linear Regression /Lecture1/centralLimit.py 4 Bishop 2006, p

49 Linear Regression /Lecture1/centralLimit.py 4 Bishop 2006, p

50 Linear Regression /Lecture1/centralLimit.py 4 Bishop 2006, p

51 Linear Regression /Lecture1/centralLimit.py 4 Bishop 2006, p

52 Linear Regression /Lecture1/centralLimit.py 4 Bishop 2006, p

53 Linear Regression /Lecture1/centralLimit.py 4 Bishop 2006, p

54 Linear Regression /Lecture1/centralLimit.py 4 Bishop 2006, p

55 Linear Regression /Lecture1/centralLimit.py 4 Bishop 2006, p

56 Linear Regression /Lecture1/centralLimit.py 4 Bishop 2006, p

57 Linear Regression /Lecture1/centralLimit.py 4 Bishop 2006, p

58 Linear Regression /Lecture1/centralLimit.py 4 Bishop 2006, p

59 Linear Regression 3 p(w Y, X) (10) Uncertainty in Model Posterior conditional distribution after the relevant information has been taken into account What is relevant our belief the observations 3 Bishop 2006, p

60 Linear Regression 3 p(w) (11) Belief about model before seeing data Prior What do I know about the regression parameters 3 Bishop 2006, p

61 Linear Regression 3 p(w) (12) 3 Bishop 2006, p

62 Linear Regression 3 p(w) (13) Belief about model before seeing data Prior What do I know about the regression parameters p(w) = N (0, σ 2 I) (14) 3 Bishop 2006, p

63 Linear Regression 3 p(y i W, x i ) (15) How well does my model predict the data Likelihood Think error function but also how different errors p(y i W, x i ) = N (y i Wx i, τ 2 I) (16) 3 Bishop 2006, p

64 Linear Regression 3 Structure Do the variables co-vary? Are there (in-)dependency structures that I can exploit? p(y W, X) = N p(y i W, x i ) (17) i 3 Bishop 2006, p

65 Linear Regression 3 How do we put everything together? Want to reach the posterior distribution after all relevant information have been taken into account Prediction should reflect my beliefs in the model and the information in the observations We have a gigantic number of possible solutions that are allowed by our data and belief 3 Bishop 2006, p

66 Linear Regression 3 How do we put everything together? Want to reach the posterior distribution after all relevant information have been taken into account Prediction should reflect my beliefs in the model and the information in the observations We have a gigantic number of possible solutions that are allowed by our data and belief 3 Bishop 2006, p

67 Linear Regression 3 How do we put everything together? Want to reach the posterior distribution after all relevant information have been taken into account Prediction should reflect my beliefs in the model and the information in the observations We have a gigantic number of possible solutions that are allowed by our data and belief 3 Bishop 2006, p

68 Linear Regression 3 How do we put everything together? Want to reach the posterior distribution after all relevant information have been taken into account Prediction should reflect my beliefs in the model and the information in the observations We have a gigantic number of possible solutions that are allowed by our data and belief 3 Bishop 2006, p

69 Linear Regression 3 How do we put everything together? Want to reach the posterior distribution after all relevant information have been taken into account Prediction should reflect my beliefs in the model and the information in the observations We have a gigantic number of possible solutions that are allowed by our data and belief 3 Bishop 2006, p

70 Linear Regression 3 p(w D) = p(d W)p(W) p(d) (18) Evidence The denominator shows where the model spreads it probability mass over the data-space (evidence of the model) The denominator does not change with W 3 Bishop 2006, p

71 Linear Regression 3 p(w D) = p(d W)p(W) p(d) (19) p(w Y, X) p(y W, X)p(W) (20) Evidence The denominator shows where the model spreads it probability mass over the data-space (evidence of the model) The denominator does not change with W 3 Bishop 2006, p

72 p(w) 4 4 Bishop 2006, p. 155

73 p(w) p(w X, Y) W samples 4 4 Bishop 2006, p. 155

74 p(w) p(w X, Y) W samples 4 4 Bishop 2006, p. 155

75 p(w) p(w X, Y) W samples 4 4 Bishop 2006, p. 155

76 p(w) p(w X, Y) W samples 4 4 Bishop 2006, p. 155

77 p(w) p(w X, Y) W samples 4 4 Bishop 2006, p. 155

78 p(w) p(w X, Y) W samples 4 4 Bishop 2006, p. 155

79 p(w) p(w X, Y) W samples 4 4 Bishop 2006, p. 155

80 p(w) p(w X, Y) W samples 4 4 Bishop 2006, p. 155

81 p(w) p(w X, Y) W samples 4 4 Bishop 2006, p. 155

82 p(w) p(w X, Y) W samples 4 4 Bishop 2006, p. 155

83 Assignment You should now be able to do the linear part of Task 2.1 and Task 2.2 of the assignment.

84 Evidence 5 5 Bishop 2006, 3.4 p p(d) (21)

85 Toolbox 1. Formulate prediction error by likelihood 2. Formulate belief of model in prior 3. Marginalise irrelevant variables 4. Choose model based on evidence p M (D) (Assignment)

86 Toolbox 1. Formulate prediction error by likelihood 2. Formulate belief of model in prior 3. Marginalise irrelevant variables 4. Choose model based on evidence p M (D) (Assignment)

87 Toolbox 1. Formulate prediction error by likelihood 2. Formulate belief of model in prior 3. Marginalise irrelevant variables 4. Choose model based on evidence p M (D) (Assignment)

88 Toolbox 1. Formulate prediction error by likelihood 2. Formulate belief of model in prior 3. Marginalise irrelevant variables 4. Choose model based on evidence p M (D) (Assignment)

89 Marginalisation p(w) = p(w θ)p(θ)dθ (22) Average according to belief and how well the model fits the observations Pushes belief through model

90 Marginalisation p(w) = p(w θ)p(θ)dθ (23) Average according to belief and how well the model fits the observations Pushes belief through model

91 Marginalisation Nature laughs at the difficulties of integration

92 Choosing Distributions p(x Y) = p(y X)p(X) p(y) (24) Conjugate Distributions The posterior and the prior are in the same family Relationship with all three terms 6 6 Wikipedia

93 Choosing Distributions p(x Y) = p(y X)p(X) p(y) (25) Conjugate Distributions The posterior and the prior are in the same family Relationship with all three terms Carls intuition combining belief in parameters through model should not change the family of the distribution over the parameters 6 6 Wikipedia

94 Choosing Distributions p(x Y) = p(y X)p(X) p(y) (26) Remainder of this part In this part of the course we will only look at Gaussians Gaussians are self-conjugate Gaussian likelihood + Gaussian prior Gaussian posterior On lecture 5 I will show you approximate ways to compute an integral Hedvig will look at non-gaussian priors and likelihoods in her part.

95 Choosing Distributions p(x Y) = p(y X)p(X) p(y) (27) Remainder of this part In this part of the course we will only look at Gaussians Gaussians are self-conjugate Gaussian likelihood + Gaussian prior Gaussian posterior On lecture 5 I will show you approximate ways to compute an integral Hedvig will look at non-gaussian priors and likelihoods in her part.

96 Choosing Distributions p(x Y) = p(y X)p(X) p(y) (28) Remainder of this part In this part of the course we will only look at Gaussians Gaussians are self-conjugate Gaussian likelihood + Gaussian prior Gaussian posterior On lecture 5 I will show you approximate ways to compute an integral Hedvig will look at non-gaussian priors and likelihoods in her part.

97 Reflection That was ALL of Machine Learning Everything else is just details how to choose model what is the right prior how to integrate You will have to approximate and use heuristics but always relate to this

98 Reflection That was ALL of Machine Learning Everything else is just details how to choose model what is the right prior how to integrate You will have to approximate and use heuristics but always relate to this

99 Reflection That was ALL of Machine Learning Everything else is just details how to choose model what is the right prior how to integrate You will have to approximate and use heuristics but always relate to this

100 Reflection That was ALL of Machine Learning Everything else is just details how to choose model what is the right prior how to integrate You will have to approximate and use heuristics but always relate to this

101 Reflection That was ALL of Machine Learning Everything else is just details how to choose model what is the right prior how to integrate You will have to approximate and use heuristics but always relate to this

102 Reflection That was ALL of Machine Learning Everything else is just details how to choose model what is the right prior how to integrate You will have to approximate and use heuristics but always relate to this

103 Example: Image restoration 6 6 Lecture1/imageExample.py

104 Example: Image restoration 6 6 Lecture1/imageExample.py

105 Example: Image restoration 6 p(y X, θ) = N (f(x), σ 2 I) (29) y i = 1 3 (xr i + x g i + xb i) (30) 6 Lecture1/imageExample.py

106 Example: Image restoration 6 6 Lecture1/imageExample.py

107 Introduction Regression Kernel Methods References Example: Image restoration6 p(y X, θ) = N (f (X), σ 2 I) 1 yi = (xri + xgi + xbi ) 3 p(x Y, θ) p(y X, θ)p(x) 6 (31) (32) (33) Lecture1/imageExample.py

108 Introduction Regression Kernel Methods

109 Dual Linear Regression 7 p(y W, X)p(W) p(w Y, X) = p(y) (34) N N p(y W, X) = p(y i W, X) = N (y i, σ 2 I) (35) i i p(w) = N (0, τ 2 I) (36) 7 Bishop 2006, p. 6.1.

110 Dual Linear Regression 7 p(y W, X)p(W) p(w Y, X) = p(y) (37) N N p(y W, X) = p(y i W, X) = N (y i, σ 2 I) (38) i i p(w) = N (0, τ 2 I) (39) p(w Y, X) p(y W, X)p(W) (40) 7 Bishop 2006, p. 6.1.

111 Dual Linear Regression 7 Lets look at a simple 1D problem y R 1 N (41) x R 1 N (42) 7 Bishop 2006, p. 6.1.

112 Dual Linear Regression 7 p(w y, X) = N i 1 2πσ 2 e 1 2σ 2 (wt x i y i ) T (w T x i y i ) 1 2πτ 2 e 1 2τ 2 (wt w) 1 ( 2πσ 2 ) N e 1 2σ 2 (wt X y) T (w T X y) 1 ( 2πτ 2 ) N e (43) 1 2τ 2 (wt w) (44) Objective Want to find the parameters that maximises the above Logarithm is monotonic Minimise negative logarithm of p(w y, X) 7 Bishop 2006, p. 6.1.

113 Dual Linear Regression 7 p(w y, X) = N i 1 2πσ 2 e 1 2σ 2 (wt x i y i ) T (w T x i y i ) 1 2πτ 2 e 1 2τ 2 (wt w) 1 ( 2πσ 2 ) N e 1 2σ 2 (wt X y) T (w T X y) 1 ( 2πτ 2 ) N e (45) 1 2τ 2 (wt w) (46) Objective Want to find the parameters that maximises the above Logarithm is monotonic Minimise negative logarithm of p(w y, X) 7 Bishop 2006, p. 6.1.

114 Dual Linear Regression 7 J(w) = 1 2 (wt X y) T (w T X y) + λ 2 wt w (47) Objective Want to find the parameters that maximises the above Logarithm is monotonic Minimise negative logarithm of p(w y, X) 7 Bishop 2006, p. 6.1.

115 Dual Linear Regression 7 J(w) = 1 2 (wt X y) T (w T X y) + λ 2 wt w (48) δ δw J(w) = 1 2 2XT (w T X y) + λ 2w (49) 2 Optimisation Lets make a point-estimate Pick w that minimises J(w) 7 Bishop 2006, p. 6.1.

116 Dual Linear Regression 7 J(w) = 1 2 (wt X y) T (w T X y) + λ 2 wt w (50) δ δw J(w) = 1 2 2XT (w T X y) + λ 2w (51) 2 w = 1 λ XT (w T X y) = (52) Optimisation Lets make a point-estimate Pick w that minimises J(w) 7 Bishop 2006, p. 6.1.

117 Dual Linear Regression 7 Optimisation J(w) = 1 2 (wt X y) T (w T X y) + λ 2 wt w (53) δ δw J(w) = 1 2 2XT (w T X y) + λ 2w 2 (54) w = 1 N λ XT (w T X y) = X T a = α n x n (55) Lets make a point-estimate Pick w that minimises J(w) 7 Bishop 2006, p n

118 Dual Linear Regression 7 J(w) = 1 2 (wt X y) T (w T X y) + λ 2 wt w (56) w = X T a (57) Formulate Dual J(a) = 1 2 at XX T XX T a a T XX T y yt y + λ 2 at XX T a (58) 7 Bishop 2006, p. 6.1.

119 Dual Linear Regression 7 [K] ij = x T i x j (59) J(a) = 1 2 at KKa aky yt y + λ 2 at Ka (60) 7 Bishop 2006, p. 6.1.

120 Dual Linear Regression 7 [K] ij = x T i x j (61) J(a) = 1 2 at KKa aky yt y + λ 2 at Ka (62) α i = 1 λ (wt x i y i ) (63) N w = α i x i = X T a (64) i a = (K + λi) 1 y (65) 7 Bishop 2006, p. 6.1.

121 Dual Linear Regression 7 [K] ij = x T i x j (66) J(a) = 1 2 at KKa aky yt y + λ 2 at Ka (67) a = (K + λi) 1 y (68) y(x ) = w T x = a T Xx = a T k(x, x ) = (69) = ((K + λi) 1 y) T k(x, x ) = k(x, X)(K + λi) 1 y (70) 7 Bishop 2006, p. 6.1.

122 Dual Linear Regression 7 Linear Regression 1. See data (x i, y) N i 2. Encode relationship in parameter W 3. Throw training away data 4. Make predictions using W 7 Bishop 2006, p. 6.1.

123 Dual Linear Regression 7 Linear Regression 1. See data (x i, y) N i 2. Encode relationship in parameter W 3. Throw training away data 4. Make predictions using W Dual Do NOT throw away data Make predictions using relationship to training data Model complexity depends on data (i.e. it adapts) Non parametric regression 7 Bishop 2006, p. 6.1.

124 Dual Linear Regression 7 Linear Regression 1. See data (x i, y) N i 2. Encode relationship in parameter W 3. Throw training away data 4. Make predictions using W Dual Do NOT throw away data Make predictions using relationship to training data Model complexity depends on data (i.e. it adapts) Non parametric regression 7 Bishop 2006, p. 6.1.

125 Kernels Dual linear regression allows us to write everything in terms of inner products we do not need representation x i What if we map data prior to regression? ϕ : x i f i y(x ) = w T ϕ(x ) = a T ϕ(x)ϕ(x ) = k(x, X)(K + λi) 1 y k(x, x ) = ϕ(x) T ϕ(x ) (71) In dual case we do not need to know ϕ( ) only ϕ( ) T ϕ( )

126 Kernels Dual linear regression allows us to write everything in terms of inner products we do not need representation x i What if we map data prior to regression? ϕ : x i f i y(x ) = w T ϕ(x ) = a T ϕ(x)ϕ(x ) = k(x, X)(K + λi) 1 y k(x, x ) = ϕ(x) T ϕ(x ) (72) In dual case we do not need to know ϕ( ) only ϕ( ) T ϕ( )

127 Kernels k(x i, x j ) = ϕ(x i ) T ϕ(x j ) = ϕ(x i ) ϕ(x j ) cos(θ) (73) Kernel Functions A function that describes an inner product Sub-class of functions think triangle in-equality If we have k(, ) we never have to know the mapping

128 Kernels 8 x R 2 (74) (x T i x j ) 2 = (x i1 x j1 + x i2 x j2 ) 2 = (75) = x 2 i1x 2 j1 + 2x i1 x j1 x i2 x j2 + x 2 i2x 2 j2 = (76) = (x 2 i1, 2x i1 x i2, x 2 i2)(x 2 j1, 2x j1 x j2, x 2 j2) T = (77) = ϕ(x i ) T ϕ(x j ) (78) 8 Bishop 2006, p. 6.2

129 Kernels x R 2 (79) (x T i x j ) 2 = (x i1 x j1 + x i2 x j2 ) 2 = (80) = x 2 i1x 2 j1 + 2x i1 x j1 x i2 x j2 + x 2 i2x 2 j2 = (81) = (x 2 i1, 2x i1 x i2, x 2 i2)(x 2 j1, 2x j1 x j2, x 2 j2) T = (82) = ϕ(x i ) T ϕ(x j ) (83) So k(x i, x j ) = (x T i x j) 2 is a kernel of the mapping ϕ(x) = ((e T 1 x)2, 2e T 1 xet 2 x, (et 2 x)2 ) 8 8 Bishop 2006, p. 6.2

130 The benefits of Kernels Kernels allows for implicit feature mappings We do NOT need to know the feature space The space can have infinite dimensionality The mapping can be non-linear but the problem is still linear! Allows for putting weird things like, strings (DNA) in a vector space More next lecture, these things are very powerful

131 The benefits of Kernels Kernels allows for implicit feature mappings We do NOT need to know the feature space The space can have infinite dimensionality The mapping can be non-linear but the problem is still linear! Allows for putting weird things like, strings (DNA) in a vector space More next lecture, these things are very powerful

132 The benefits of Kernels Kernels allows for implicit feature mappings We do NOT need to know the feature space The space can have infinite dimensionality The mapping can be non-linear but the problem is still linear! Allows for putting weird things like, strings (DNA) in a vector space More next lecture, these things are very powerful

133 The benefits of Kernels Kernels allows for implicit feature mappings We do NOT need to know the feature space The space can have infinite dimensionality The mapping can be non-linear but the problem is still linear! Allows for putting weird things like, strings (DNA) in a vector space More next lecture, these things are very powerful

134 The benefits of Kernels Kernels allows for implicit feature mappings We do NOT need to know the feature space The space can have infinite dimensionality The mapping can be non-linear but the problem is still linear! Allows for putting weird things like, strings (DNA) in a vector space More next lecture, these things are very powerful

135 Next Time Lecture 2 November 5th M2 Continue with Kernels relation to co-variance Non-parametric Regression Gaussian Processes Start Assignment

Next Time Lecture 2 November 5th 13-15 M2 Continue with Kernels relation

136 Next Time Lecture 2 November 5th M2 Continue with Kernels relation to co-variance Non-parametric Regression Gaussian Processes Start Assignment

137 e.o.f.

138 References I Pierre Simon Laplace. A philosophical essay on probabilities E T Jaynes. Probability theory: The logic of science. Ed. by G Larry Bretthorst. Cambridge university press, June Christopher M Bishop. Pattern recognition and machine learning

GWAS V: Gaussian processes

GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011