ee378a spring 2013 April 1st intro lecture statistical signal processing/ inference, estimation, and information processing Monday, April 1, 13

Size: px

Start display at page:

Download "ee378a spring 2013 April 1st intro lecture statistical signal processing/ inference, estimation, and information processing Monday, April 1, 13"

Dale Ferguson
5 years ago
Views:

1 ee378a statistical signal processing/ inference, estimation, and information processing spring 2013 April 1st intro lecture 1

2 what is statistical signal processing? anything & everything inference, estimation, and information processing 2

3 opportunities a lot of freedom 3

4 challenges resist temptation to do too much too superficially coherence 4

5 attempt rely on and tap into ee278b be coherent and self-contained be useful [broadly construed] be aware of and not redundant given other courses in ee (cf., in particular, ee378b), cs, stats, etc. 5

6 themes Bayesian inference and estimation linear vs. non-linear, causal vs. non-causal, exact vs. approximate inference Hidden Markov processes + some more general models Prediction, filtering, and denoising (modern vs. classical, Bayesian vs. deterministic vs. semi-stochastic) more 6

7 course outline intro framework for general Bayesian inference Markov triplets, conditional independence, a bit about undirected graphical models inference and state estimation for Hidden Markov Processes (HMPs) Markov Chain Monte Carlo and Importance Sampling approximate inference, particle filtering universal denoising 7

8 course outline (cont.) optional topics (according to remaining time and interest): relations between information and estimation and their applications inference under logarithmic loss directed information: estimation and applications 8

9 general Bayesian inference framework 9

10 general Bayesian inference framework (cont.) 10

11 loss function d : X ˆX R + 11

12 example I: square error X = R ˆX = R d(x, ˆx) =(x ˆx) 2 12

13 example II: Hamming X ˆX X= X d(x, ˆx) =1 {x =ˆx} 13

14 example III: log-loss X X ˆX X= M(X M ) X d(x, ˆx) = log 1 ˆx(x) = D ( x ˆx) 14

15 Bayes optimum X P X min ˆx Ed(X, ˆx) =? 15

16 example I: square error min ˆx Ed(X, ˆx) =Ed(X, EX) =Var(X) 16

17 example II: Hamming min ˆx Ed(X, ˆx) =Ed(X, ˆx ML )=1 P (X =ˆx ML ) 17

18 example III: log-loss Ed(X, ˆx) =H(X)+D(P X kˆx) min ˆx Ed(X, ˆx) =Ed(X, P X )=H(X) 18

19 estimation everything carries over 19

20 linear estimation 20

21 non-linear estimation 21

22 approximate inference 22

23 identifying conditional independence 23

24 denoising 24

25 how to move from 25

26 to 26

27 discrete denoising 27

28 : text, genomics, bits,... 28

29 approaches fully stochastic (already mentioned) universality 29

30 DUDE algorithm low complexity universal optimality guarantees (stochastic + semi-stochastic settings) does well in some practical settings principles can be lifted to analogue data 30

31 some relations between Information and Estimation 31

32 Y = X + W W is a standard Gaussian, independent of X I( )=I(X; Y ) mmse( )=E (X E[X Y ]) 2 32

33 d I-MMSE d I( )=1 2 mmse( ) or in its integral version snr I(snr) = mmse( )d 33

34 continuous time dy t = X t dt + dw t, 0 t T I( )=I(X T ; Y T ) T mmse( )=E 0 (X t E[X t Y T ]) 2 dt 34

35 d d I( )=1 2 mmse( ) or in its integral version snr I(snr) = mmse( )d 35

36 Duncan dy t = X t dt + dw t, 0 t T W is standard white Gaussian noise, independent of X [Duncan 1970]: I(X T ; Y T )= 1 2 E T 0 (X t E[X t Y t ]) 2 dt 36

37 SNR in Duncan dy t = X t dt + dw t, 0 t T I( )=I(X T ; Y T ) cmmse( )=E T 0 [Duncan 1970]: (X t E[X t Y t ]) 2 dt I( )= 2 cmmse( ) 37

38 Recap [Duncan 1970]: I( )= cmmse( ) 2 [Guo, Shamai and Verdú 2005]:, [Zakai 2005]: snr I(snr) = mmse( )d? 38

39 Relationship between cmmse and mmse? [Guo, Shamai and Verdú 2005]: cmmse(snr) = 1 snr snr 0 mmse( )d 39

40 mismatch Y = X + W W is a standard Gaussian, independent of X What if X P but the estimator thinks X Q? mse P,Q ( )=E P (X E Q [X Y ]) 2 40

41 a representation of relative entropy snr D(P Ysnr Q Ysnr )= 0 [mse P,Q ( ) mse P,P ( )]d 41

42 Causal vs. Non-causal Mismatched Estimation dy t = X t dt + dw t, 0 t T W is standard white Gaussian noise, independent of X T cmse P,Q ( )=E P (X t E Q [X t Y t ]) 2 dt T mse P,Q ( )=E P 0 0 (X t E Q [X t Y T ]) 2 dt Relationship between cmse P,Q and mse P,Q? 42

43 Relationship between cmse P,Q and mse P,Q cmse P,Q (snr) = 1 snr snr 0 mse P,Q ( )d 43

44 Relationship between cmse P,Q and mse P,Q cmse P,Q (snr) = 1 snr snr 0 mse P,Q ( )d = 2 0 snr [I(snr)+D (P Y T Q Y T )] 43

45 minimax estimation ( " Z # { } apple apple 2P minimax(p, snr) 4 =min ˆX( ) max P 2P n cmse P, ˆX(snr) o cmse P,P (snr) Minimax Filtering via Relations Between Information and Estimation 44

46 ? 45

47 under the right (x, ˆx) X 46

48 but information is estimation H(X) = X x P X (x) log 1 P X (x) D(P Q) = log dp dq dp I(U; V )=D(P U,V P U P V ), 47

49 shocker: I(X; Y ) is a natural measure of dependence 48

50 shocker: I(X; Y ) is a natural measure of dependence I(X; Y ) = D(P X,Y P X P Y ) I(X; Y )=I(Y; X) I(f(X); g(y )) = I(X; Y )iff and g are one-to-one chain rules 48

51 chain rule 49

52 directed information 50

53 directed information compare to: 50

54 directed information compare to: 50

55 directed information compare to: 50

56 measure of causal relevance X 1,X 2,X 3,...,X i 1,X i,... Y 1,Y 2,Y 3,...,Y i 1,Y i,... note what is captured by 51

57 measure of causal relevance X 1,X 2,X 3,...,X i 1,X i,... Y 1,Y 2,Y 3,...,Y i 1,Y i,... note what is captured by I(X i ; Y i Y i 1 ) 51

58 measure of causal relevance X 1,X 2,X 3,...,X i 1,X i,... Y 1,Y 2,Y 3,...,Y i 1,Y i,... note what is captured by I(X i ; Y i Y i 1 ) and 51

59 measure of causal relevance X 1,X 2,X 3,...,X i 1,X i,... Y 1,Y 2,Y 3,...,Y i 1,Y i,... note what is captured by I(X i ; Y i Y i 1 ) and I(Y i 1 ; X i X i 1 ) 51

60 measure of causal relevance X 1,X 2,X 3,...,X i 1,X i,... Y 1,Y 2,Y 3,...,Y i 1,Y i,... note what is captured by I(X i ; Y i Y i 1 ) and I(Y i 1 ; X i X i 1 ) I(X n Y n ) 51

61 measure of causal relevance X 1,X 2,X 3,...,X i 1,X i,... Y 1,Y 2,Y 3,...,Y i 1,Y i,... note what is captured by I(X i ; Y i Y i 1 ) and I(Y i 1 ; X i X i 1 ) I(X n Y n ) I(Y n 1 X n ), 51

62 Conservation Law = 52

63 Conservation Law = 52

64 relevant for: neural spikes financial time series medical data page ranking social or sensor networks etc. 53

65 we seek good estimators for: 54

66 we seek good estimators for: I(X n Y n ) 54

67 a bit on previous work on directed info estimation [Granger, Econometrica 1969], Investigating causal relations by econometric models and cross-spectral methods later extensions applied in finance and brain studies 55

68 an approach build on universal sequential probability assignments 56

69 universal compressor universal probability assignment univ. sequential prob. assignment e.g.: Lempel-Ziv 78 57

70 (celebrated) example (of a universal sequential probability assignment): CTW [Willems, Shtarkov, Tjalkens, 1995], [Willems, 1998] 58

71 (celebrated) example (of a universal sequential probability assignment): CTW universal optimal convergence rates linear complexity [Willems, Shtarkov, Tjalkens, 1995], [Willems, 1998] 58

72 Estimator 1 d H 1 (Y )= Theorem Let Q be a universal probability assignment and (X, Y) be jointly stationary ergodic. Then lim Î 1 (X n Y n )=I(X Y) in L 1 n!1 If Q is also pointwise universal then the limit holds almost surely as well. If (X Y) is a stationary ergodic aperiodic Markov process, we could say more about the 59

73 Estimator 1 (cont.) d H 1 (Y )= Theorem Let Q be the CTW sequential probability assignment. If (X, Y) is a jointly stationary ergodic aperiodic Markov process (of arbitrary order), then there exists a constant C 1 such that E Î1(X n Y n ) I(X Y) C 1 n 1/2 log n, and > 0, Î 1 (X n Y n ) I(X Y) = o(n 1/2 (log n) 5/2+ ) P -a.s. 60

74 Estimator 1 (cont.) Theorem This is best yo This is essentially the best you can do. 61

75 4 estimators Î 1 (Y n X n ) Î 2 (Y n X n ) n Î 3 (Y n X n ) Î 4 (Y n X n ) n n n

76 some points these are prototypes ideas can be tweaked, and generalized can be combined 63

77 !1 Hang Seng Index (HSI) and Dow Jones Industrial Average (DJIA) x HSI DJI year 64

78 Alg. 1 Alg I(X n ; Y n ) 1 n 0.2 I(X n ; Y n ) 1 n I(X n Y n ) 1 n I(X n Y n ) 1 n 0.15 I(Y n 1 X n ) 1 n 0.15 I(Y n 1 X n ) 1 n year year Alg. 3 Alg I(X n ; Y n ) 1 n 0.2 I(X n ; Y n ) 1 n I(X n Y n ) 1 n I(X n Y n ) 1 n 0.15 I(Y n 1 X n ) 1 n 0.15 I(Y n 1 X n ) 1 n year year HSI represented by X DJI represented by Y DJI represented by Y. It is clear 65 that the re

79 staff instructor: Tsachy Weissman, TA: Albert No, TA: Idoia Ochoa, Admin.: Doug Chaffee, Office hours: TBG 66

80 course info Lectures: x, Mon Wed 11:00am-12:15pm Review sessions: TBA Prerequisite: ee278b or equivalent 67

81 requirements homework (approximately fortnightly hw sets). problems ranging from theoretical to algorithmic to experimental. possibly: lecture scribe final project grading based on the above 68

82 reading almost ready.. website will have list of books and papers, parts of which will be relevant for parts of the course pointers to last years lecture scribes will also be given as relevant 69

83 project stages form groups and choose general area choose an interesting recent paper from that area reading, understanding, reproducing results of that paper develop new scheme[s] or improve current ones final project presentations [instead of final] 70

84 project stage i: form groups and choose general area we ll let you know the group size before long choice of general area from: genomics, finance, neuroscience, image denoising, text correction, etc. deadline: Wednesday, April 10th (by ) 71

85 project stage ii: choice of interesting recent paper from that area we ll provide you with some suggestions if from our fields of knowledge you d be welcome to suggest something not from the list deadline for choice: Wednesday, April 17th (by ) 72

86 project stage iii: reading, understanding, reproducing results understand the problem setting understand the results (theoretical/experimental) understand the algorithm(s) reproduce experimental results (fine if you don t need to implement schemes from scratch for this part) write report summarizing the above report due Wednesday, May 8th (by ) 73

87 main project stage: develop new scheme[s] or improve current one[s] project presentations: wednesday June 5th class, and thursday June 6th morning until early afternoon deadline for project report: Wednesday June 12th (by ) 74

88 criteria for assessment of main part 75

89 criteria for assessment of main part 1. Experimental: How well does your scheme do on real data? How does it perform compared to current practice? 75

90 criteria for assessment of main part 1. Experimental: How well does your scheme do on real data? How does it perform compared to current practice? 2. Algorithmic: Complexity: How simple is the implementation? [running time, required memory size, etc.] Elegance: Is your scheme graceful? 75

91 criteria for assessment of main part 1. Experimental: How well does your scheme do on real data? How does it perform compared to current practice? 2. Algorithmic: Complexity: How simple is the implementation? [running time, required memory size, etc.] Elegance: Is your scheme graceful? 3. Analysis: Modeling: How are you modeling (if you are) the data? How realistic is this model? What features of the problem is it incorporating? What features is it neglecting? Analysis: What kind of performance can you guarantee? How close is it to optimum? 75

92 grading on project If you score highly under: either the first or the third criterion, we would consider that to be a really good project. 76

93 grading on project If you score highly under: either the first or the third criterion, we would consider that to be a really good project. any two of the three criteria, we would consider that a great project. 76

94 grading on project If you score highly under: either the first or the third criterion, we would consider that to be a really good project. any two of the three criteria, we would consider that a great project. all three criteria, we would consider that to be an amazing project. 76

95 we can help grouping list and choice of papers data computation 77

96 questions? 78

much more on minimax (order bounds) cf. lecture by Iain Johnstone

much more on minimax (order bounds) cf. lecture by Iain Johnstone http://www-stat.stanford.edu/~imj/wald/wald1web.pdf today s lecture parametric estimation, Fisher information, Cramer-Rao lower bound: