Regression and Prediction

Size: px

Start display at page:

Download "Regression and Prediction"

Doris Bryan
5 years ago
Views:

1 -755 Machine Learning for Signal Processing Regression and Prediction Class Oct 202 Instructor: Bhisha Raj 23 Oct /8797

2 Matri Identities f 2... D df df d d df d 2 d2... df dd dd he derivative of a scalar function w.r.t. a vector is a vector he derivative w.r.t. a matri is a matri 23 Oct /8797 2

3 Matri Identities f.. D D D 2D DD df df d df d d df d d df d D d df.. df d d d d22.. d2d df df df dd dd2 d dd dd2 ddd 2 2 D D DD he derivative of a scalar function w.r.t. a vector is a vector he derivative w.r.t. a matri is a matri 23 Oct /8797 3

4 Matri Identities F F F F FN D df d d df df2 df d 2... d dfn.. dfn d d df d d2 df2 d d2.. dfn d d df d dd.. df2.. d.. dd.... dfn d d D D D D he derivative of a vector function w.r.t. a vector is a matri Note transposition of order 23 Oct /8797 4

5 Derivatives, UV, N UV N NUV UVN In general: Differentiating an MN function b a UV argument results in an MNUV tensor derivative 23 Oct /8797 5

6 Matri derivative identities d Xa Xda d a X X da X is a matri, a is a vector. Solution ma also be X d AX da X ; d XA X da A is a matri d d a Xa a X X da A XA XAA AA X trace d trace d trace X X da Some basic linear and quadratic identities 23 Oct /8797 6

7 A Common Problem Can ou spot the glitches? 23 Oct /8797 7

8 How to fi this problem? Glitches in audio Must be detected How? hen what? Glitches must be fied Delete the glitch Results in a hole Fill in the hole How? 23 Oct /8797 8

9 Interpolation.. Etend the curve on the left to predict the values in the blan region Forward prediction Etend the blue curve on the right leftwards to predict the blan region How? Bacward prediction Regression analsis.. 23 Oct /8797 9

10 Detecting the Glitch OK NO OK Regression-based reconstruction can be done anwhere Reconstructed value will not match actual value Large error of reconstruction identifies glitches 23 Oct /8797 0

11 What is a regression Analzing relationship between variables Epressed in man forms Wiipedia Linear regression, Simple regression, Ordinar least squares, Polnomial regression, General linear model, Generalized linear model, Discrete choice, Logistic regression, Multinomial logit, Mied logit, Probit, Multinomial probit,. Generall a tool to predict variables 23 Oct /8797

12 Regressions for prediction = f; Q + e Different possibilities is a scalar Y is real Y is categorical classification is a vector is a vector is a set of real valued variables is a set of categorical variables is a combination of the two f. is a linear or affine function f. is a non-linear function f. is a time-series model 23 Oct /8797 2

13 A linear regression Y Assumption: relationship between variables is linear X A linear trend ma be found relating and = dependent variable = eplanator variable Given, can be predicted as an affine function of 23 Oct /8797 3

14 An imaginar regression.. Chec this shit out Fig.. hat's bonafide, 00%-real data, m friends. I too it mself over the course of two wees. And this was not a leisurel two wees, either; I busted m ass da and night in order to provide ou with nothing but the best data possible. Now, let's loo a bit more closel at this data, remembering that it is absolutel first-rate. Do ou see the eponential dependence? I sure don't. I see a bunch of crap. Christ, this was such a waste of m time. Baning on m hopes that whoever grades this will just loo at the pictures, I drew an eponential through m noise. I believe the apparent legitimac is enhanced b the fact that I used a complicated computer program to mae the fit. I understand this is the same process b which the top quar was discovered. 23 Oct /8797 4

15 Linear Regressions = A + b + e e = prediction error Given a training set of {, } values: estimate A and b = A + b + e 2 = A 2 + b + e 2 3 = A 3 + b+ e 3 If A and b are well estimated, prediction error will be small 23 Oct /8797 5

16 Linear Regression to a scalar = a + b + e 2 = a 2 + b + e 2 3 = a 3 + b + e 3 Define: [ 3...] 2 e [ e e...] 3 2 e X A a b Rewrite A X e 23 Oct /8797 6

17 Learning the parameters A X e ˆ A X Assuming no error Given training data: several, Can define a divergence : D, ŷ Measures how much hat differs from Ideall, if the model is accurate this should be small Estimate A, b to minimize D, ŷ 23 Oct /8797 7

18 he prediction error as divergence Define the divergence as the sum of the squared error in predicting 8 e X A = a + b + e 2 = a 2 + b + e 2 3 = a 3 + b + e 3... ˆ e e e E D, b b b a a a 2 X A X A X A E 23 Oct /8797

19 Prediction error as divergence = a + e e = prediction error Find the slope a such that the total squared length of the error lines is minimized 23 Oct /8797 9

20 Solving a linear regression A X e Minimize squared error E X A 2 A X A X XX A - 2X Differentiating w.r.t A and equating to 0 A 2A XX - 2X A 0 de d A A - X XX pinvx A XX - X 23 Oct /

21 An Aside What happens if we minimize the perpendicular instead? 23 Oct /8797 2

22 Regression in multiple dimensions = A + b + e 2 = A 2 + b + e 2 3 = A 3 + b + e 3 i is a vector ij = j th component of vector i Also called multiple regression Equivalent of saing: Fundamentall no different from N separate single regressions a i = i th column of A b i = i th component of b = a + b + e = A + b + e 2 = a b 2 + e 2 3 = a b 3 + e 3 But we can use the relationship between s to our benefit 23 Oct /

23 Multiple Regression Y [...] 3 2 E [ e e...] 3 2 e X D vector of ones A A b Y A X E DIV i i A i b trace Y A Differentiating and equating to 0 2 X Y A X ddiv 2A XX - 2YX da 0 A - YX XX Ypinv X A XX XY 23 Oct /

24 A Different Perspective = + is a nois reading of A Error e is Gaussian e ~ Estimate A from A e N0, 2 I Y [ 2... N ] X [ 2... N ] 23 Oct /

25 he Lielihood of the data A e e ~ N0, 2 I Probabilit of observing a specific, given, for a particular matri A P ; A N A, 2 I Probabilit of the collection: Y [ 2... N ] X [ 2... N ] P Y X; A i N A Assuming IID for convenience not necessar 23 Oct / i, 2 I

26 A Maimum Lielihood Estimate A e e ~ N0, 2 I Y [ 2... N ] X [ 2... N ] P Y X ep A 2 D 2 i log P Y X; A C i A 2 i i 2 C trace Y A X Y 2 Maimizing the log probabilit is identical to minimizing the trace Identical to the least squares solution i 2 A X 2 A - YX XX Ypinv X A XX XY 23 Oct /

27 Predicting an output From a collection of training data, have learned A Given for a new instance, but not, what is? Simple solution: ˆ A X 23 Oct /

28 Appling it to our problem Prediction b regression Forward regression t = a t- + a 2 t-2 a t- +e t Bacward regression t = b t+ + b 2 t+2 b t+ +e t 23 Oct /

29 Appling it to our problem Forward prediction 23 Oct / K t t K t K t K t t K t t K t t e e e a t e X a t pinv a t X

30 Appling it to our problem Bacward prediction 23 Oct / e e e K t K t K t K t K t t K t t t K t K t b e X b t pinv b t X

31 Finding the burst At each time Learn a forward predictor a t At each time, predict net sample t est = S i a t, t- Compute error: ferr t = t - t est 2 Learn a bacward predict and compute bacward error berr t Compute average prediction error over window, threshold 23 Oct /8797 3

32 Filling the hole Learn forward predictor at left edge of hole For each missing sample At each time, predict net sample t est = S i a t, t- Use estimated samples if real samples are not available Learn bacward predictor at left edge of hole For each missing sample At each time, predict net sample t est = S i b t, t+ Use estimated samples if real samples are not available Average forward and bacward predictions 23 Oct /

33 Reconstruction zoom in Reconstruction area Net glitch Distorted signal Recovered signal Interpolation result Actual data 23 Oct /

34 Incrementall learning the regression A XX - XY Requires nowledge of all, pairs Can we learn A incrementall instead? As data comes in? he Widrow Hoff rule a t a Note the structure t t t ˆ t t ˆ t a t Can also be done in batch mode! error Scalar prediction version 23 Oct /

35 Predicting a value A - XX XY A YX XX ˆ What are we doing eactl? Let ˆ YXˆ ˆ 2 C ˆ i ˆ i C XX Normalizing and rotating space he rotation is irrelevant ˆ i Weighted combination of inputs 23 Oct /

36 Relationships are not alwas linear How do we model these? Multiple solutions 23 Oct /

37 Non-linear regression = j+e φ [ 2... N ] X X [ φ φ 2... φ K ] Y = AX+e Replace X with X in earlier equations for solution A X X - X Y 23 Oct /

38 What we are doing Finding the optimal combination of various function Remind ou of something? 23 Oct /

39 Being non-commital: Local Regression Regression is usuall trained over the entire data Must appl everwhere How about doing this locall? ˆ YXˆ For an ˆ i ˆ i ˆ d, i i i i C ii e i e 23 Oct /

40 Local Regression he resulting regression is dependent on! ˆ d, i i i e d, i i No closed form solution But can be highl accurate But what is d,?? i 2 23 Oct /

41 Kernel Regression Actuall a non-parametric MAP estimator of Note an estimator of, not parameters of regression he Kernel is the ernel of a parzen window But first.. MAP estimators.. 23 Oct / i i h i i i h K K ˆ

42 Map Estimators MAP Maimum A Posteriori: Find a best guess for in a statistical sense, given that we now = argma Y PY ML Maimum Lielihood: Find that value of Y for which the statistical best guess of X would have been the observed X = argma Y P Y MAP is simpler to visualize 23 Oct /

43 MAP estimation: Gaussian PDF Assume X and Y are jointl Gaussian he parameters of the F Gaussian are learned from training data F0 23 Oct /

44 Learning the parameters of the Gaussian z z N N i z i C z N N i z z i z i z z C z C C XX YX C C XY YY 23 Oct /

45 Learning the parameters of the Gaussian N z N z z i Cz zi z zi z N N i z N i C z i C C N N C i XY i i N i XX YX C C XY YY 23 Oct /

46 MAP estimation: Gaussian PDF Assume X and Y are jointl Gaussian he parameters of the F Gaussian are learned from training data F0 23 Oct /

47 MAP Estimator for Gaussian RV Assume X and Y are jointl Gaussian Level set of Gaussian he parameters of the Gaussian are learned from training data Now we are given an X, but no Y What is Y? 23 Oct /

48 MAP estimator for Gaussian RV 0 23 Oct /

49 MAP estimation: Gaussian PDF F 23 Oct /

50 MAP estimation: he Gaussian at a particular value of X F 0 23 Oct /

51 MAP estimation: he Gaussian at a particular value of X Most liel value F 0 23 Oct /8797 5

52 MAP Estimation of a Gaussian RV Y = argma P X??? 23 Oct /

53 MAP Estimation of a Gaussian RV 23 Oct /

54 MAP Estimation of a Gaussian RV 23 Oct /

55 So what is this value? Clearl a line Equation of Line: ˆ Y C YX C XX Scalar version given; vector version is identical ˆ C Y YX C XX Derivation? Later in the program a bit 23 Oct /

56 his is a multiple regression ˆ Y C YX C XX his is the MAP estimate of NO the regression parameter What about the ML estimate of Again, ML estimate of, not regression parameter 23 Oct /

57 Its also a minimum-mean-squared error estimate General principle of MMSE estimation: is unnown, is nown Must estimate it such that the epected squared error is minimized Err E[ ˆ 2 ] Minimize above term 23 Oct /

58 Its also a minimum-mean-squared error estimate Minimize error: Differentiating and equating to 0: 23 Oct / ] ˆ ˆ [ ] ˆ [ 2 E E Err ] [ 2ˆ ˆ ˆ ] [ ] 2ˆ ˆ ˆ [ E E E Err 0 ˆ ] [ 2 ˆ 2ˆ ] 2ˆ ˆ ˆ [ 2 d E d E derr ] [ ˆ E he MMSE estimate is the mean of the distribution

59 For the Gaussian: MAP = MMSE Most liel value is also he MEAN value Would be true of an smmetric distribution 23 Oct /

60 MMSE estimates for miture distributions 60 Let P X be a miture densit he MMSE estimate of is given b Just a weighted combination of the MMSE estimates from the component distributions, P P P d P P E, ] [ d P P, ], [ E P 23 Oct /8797

61 MMSE estimates from a Gaussian miture 23 Oct / Let P is also a Gaussian miture Let P, be a Gaussian Miture, ; N P P P S z z, z,,,, P P P P P P P P P P P P,

62 MMSE estimates from a Gaussian miture 23 Oct / Let P is a Gaussian Miture P P P, ], ; ];[ ; [,,,,,,,, C C C C N P, ;,,,,, Q C C N P Q C C N P P, ;,,,,

63 MMSE estimates from a Gaussian miture 23 Oct / E[ ] is also a miture P[ ] is a miture densit Q C C N P P, ;,,,, E P E ], [ ] [ C C P E ] [,,,,

64 MMSE estimates from a Gaussian miture A miture of estimates from individual Gaussians 23 Oct /

65 MMSE with GMM: Voice ransformation - Festvo GMM transformation suite oda awb bdl jm slt awb bdl jm slt 23 Oct /

vectors Given speech from one speaer, find MMSE

66 Voice Morphing Align training recordings from both speaers Cepstral vector sequence Learn a GMM on joint vectors Given speech from one speaer, find MMSE estimate of the other Snthesize from cepstra 23 Oct /

67 A problem with regressions A XX - XY ML fit is sensitive Error is squared Small variations in data large variations in weights Outliers affect it adversel Unstable If dimension of X >= no. of instances XX is not invertible 23 Oct /

68 MAP estimation of weights a =a X+e Assume weights drawn from a Gaussian Pa = N0, 2 I Ma. Lielihood estimate Maimum a posteriori estimate aˆ arg ma a X aˆ arg ma a log P X; a log P a, X arg ma A log P e X, a P a 23 Oct /

69 MAP estimation of weights Similar to ML estimate with an additional term 23 Oct / , log arg ma, log arg ma ˆ a X a X a a A A P P P Pa = N0, 2 I Log Pa = C log a 2 C P 2 log 2 X a X a X, a a a X a X a a A C log ' arg ma ˆ

70 MAP estimate of weights dl 2a XX 2X 2I da 0 a XX I - XY Equivalent to diagonal loading of correlation matri Improves condition number of correlation matri Can be inverted with greater stabilit Will not affect the estimation from well-conditioned data Also called ihonov Regularization Dual form: Ridge regression MAP estimate of weights Not to be confused with MAP estimate of Y 23 Oct /

71 MAP estimate priors Left: Gaussian Prior on W Right: Laplacian Prior 23 Oct /8797 7

72 MAP estimation of weights with laplacian prior Assume weights drawn from a Laplacian Pa = l - ep-l - a Maimum a posteriori estimate aˆ arg ma A C' a X a X l a No closed form solution Quadratic programming solution required Non-trivial 23 Oct /

73 MAP estimation of weights with laplacian prior Assume weights drawn from a Laplacian Pa = l - ep-l - a Maimum a posteriori estimate aˆ arg ma A C' a X a X l a Identical to L regularized least-squares estimation 23 Oct /

74 L-regularized LSE No closed form solution aˆ arg ma A C' a X a X l Quadratic programming solutions required a Dual formulation aˆ arg ma A C ' a X a X subject to a t LASSO Least absolute shrinage and selection operator 23 Oct /

75 LASSO Algorithms Various conve optimization algorithms LARS: Least angle regression Pathwise coordinate descent.. Matlab code available from web 23 Oct /

76 Regularized least squares Image Credit: ibshirani Regularization results in selection of suboptimal in leastsquares sense solution One of the loci outside center ihonov regularization selects shortest solution L regularization selects sparsest solution 23 Oct /

77 LASSO and Compressive Sensing Y = X a Given Y and X, estimate sparse W LASSO: CS: X = eplanator variable Y = dependent variable a = weights of regression X = measurement matri Y = measurement a = data 23 Oct /

78 An interesting problem: Predicting War! Economists measure a number of social indicators for countries weel Happiness inde Hunger inde Freedom inde witter records Question: Will there be a revolution or war net wee? 23 Oct /

79 An interesting problem: Predicting War! Issues: Dissatisfaction builds up not an instantaneous phenomenon Usuall War / rebellion build up much faster Often in hours Important to predict Preparedness for securit Economic impact 23 Oct /

80 Predicting War O O 2 O 3 O 4 O 5 O 6 O 7 O 8 W W W W W W W W S S S S S S S S Given Sequence of economic indicators for each wee Sequence of unrest marers for each wee At the end of each wee we now if war happened or not that wee Predict probabilit of unrest net wee w w2 w3 w4 w5w6 w7w8 his could be a new unrest or persistence of a current one 23 Oct /

81 A Step Aside: Predicting ime Series An HMM is a model for time-series data How can we use it predict the future? 23 Oct /8797 8

82 Predicting with an HMM Given Observations O..O t All HMM parameters Learned from some training data Must estimate future observation O t+ Estimate must consider entire histor O..O t No nowledge of actual state of the process at an time 23 Oct /

83 Predicting with an HMM a s, t P O, O2,..., Ot, state t s as,t s t t+ time Given O..O t Compute PO.. O t,s Using the forward algorithm computes as,t P s t s O.. t s' P st s, O.. t P s s', O t 23 Oct / t a s, t a s', t s'

84 Predicting with an HMM s as,t+ Given Ps t =s O..t for all s Ps t+ = s O..t = S s Ps t =s O..t Ps s PO t+,s O..t = PO s Ps t+ =s O..t PO t+ O..t = S s PO t+,s O..t = S s PO s Ps t+ =s O..t his is a miture distribution 23 Oct /8797 time 84

85 Predicting with an HMM s as,t+ PO t+ O..t = S s PO t+,s O..t = S s PO s Ps t+ =s O.. time MMSE estimate of O t+ given O..t E[O t+ O..t ] = S s Ps t+ =s O.. E[O s] A weighted sum of the state means 23 Oct /

86 Predicting with an HMM MMSE Estimate of O t+ = E[O t+ O.. ] E[O t+ O..t ] = S s Ps t+ =s O.. E[O s] If PO s is a GMM EO s = S P s,s 23 Oct / s s s t t w O s P O,,.. ˆ s s s s t w s t s t O,, ' ',, ˆ a a

87 Predicting War rain an HMM on z = [w, s] After the t th wee, predict probabilit distribution: Pz t z z t = Pw, z z..z t Marginalize out not nown for net wee War? E[w z..z t ] P w z.. t P w, s z.. t ds 23 Oct /

Machine Learning for Signal Processing Bayes Classification

Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1 Recap: KNN A very effective and simple way of performing classification