Special Thanks to Yuzhe Jin

Size: px

Start display at page:

Download "Special Thanks to Yuzhe Jin"

Ira Perkins
5 years ago
Views:

1 Bhaskar Rao DepartmentofElectricaland ComputerEngineering UniversityofCalifornia,SanDiego David Wipf BiomedicalImagingLaboratory UniversityofCalifornia,San Francisco SpecialThankstoYuzhe Jin 1

2 Outline:Part1 MotivationforTutorial SparseSignalRecoveryProblem Applications ComputationalAlgorithms GreedySearch 1 normminimization i i PerformanceGuarantees 2

3 Outline:Part2 Motivation:Limitationsofpopular p inversemethods Maximumaposteriori(MAP)estimation ( ) BayesianInference AnalysisofBayesianinferenceand connectionswithmap Applicationstoneuroimaging i i 3

4 Outline:Part1 MotivationforTutorial SparseSignalRecoveryProblem Applications ComputationalAlgorithms GreedySearch 1 normminimization i i PerformanceGuarantees 4

5 EarlyWorks R.R.HockingandR.N.Leslie, SelectionoftheBest SubsetinRegressionAnalysis, Technometricss, S.Singhal andb.s.atal, AmplitudeOptimizationand PitchEstimationinMultipulse Coders, IEEETrans. Acoust.,Speech,SignalProcessing,1989 S.D.CabreraandT.W.Parks, Extrapolationand spectralestimationwithiterativeweightednorm modification, IEEETrans.Acoust.,Speech,Signal o a s coust,, Sg a Processing,April1991. ManyMoreworks Ourfirstwork I.F.Gorodnitsky,B.D.Rao andj.george, SourceLocalizationin Magnetoencephal0graphyusinganIterativeWeightedMinimum NormAlgorithm,IEEEAsilomar ConferenceonSignals,Systems andcomputers,pacificgrove,ca,pages:167171,oct.1992pacificgrove CA Pages: Oct

6 EarlySessiononSparsitySparsity OrganizedwithProf.Bresler aspecialsessionatthe1998ieee InternationalConferenceonAcoustics,SpeechandSignalProcessingSpeechandSignalProcessing SPEC-DSP: SIGNAL PROCESSING WITH SPARSENESS CONSTRAINT Signal Processing with the Sparseness Constraint BRao(Universityof B. of California, San Diego, USA) Application of Basis Pursuit in Spectrum Estimation S. Chen (IBM, USA); D. Donoho (Stanford University, USA) Parsimony and Wavelet Method for Denoising H. Krim (MIT, USA); J. Pesquet (University Paris Sud, France); I. Schick (GTE Ynternetworking and Harvard Univ., USA) Parsimonious Side Propagation P. Bradley, 0. Mangasarian (University of Wisconsin-Madison, USA) Fast Optimal and Suboptimal Algorithms for Sparse Solutions to Linear Inverse Problems G. Harikumar (Tellabs Research, USA); C. Couvreur, Y. Bresler (University of Illinois, Urbana-Champaign. USA) Measures and Algorithms for Best Basis Selection K. Kreutz-Delgado, B. Rao (University of Califomia, San Diego, USA) Sparse Inverse Solution Methods for Signal and Image Processing Applications B. Jeffs (Brigham Young University, USA) Image Denoising Using Multiple Compaction Domains P. Ishwar, K. Ratakonda, P. Moulin, N. Ahuja(University of Illinois, Urbana-Champaign, USA) 6

7 MotivationforTutorial SparseSignalRecoveryisaninterestingareawith manypotentialapplications.unificationofthetheory il li i U ifi i f h willprovidesynergy. Mthdd MethodsdevelopedforsolvingtheSparseSignal df li th S Si l Recoveryproblemcanbeavaluabletoolforsignal processingpractitioners. Manyinterestingdevelopmentsintherecentpastthat makethesubjecttimely. 7

8 Outline:Part1 MotivationforTutorial SparseSignalRecoveryProblem Applications ComputationalAlgorithms GreedySearch 1 normminimization i i PerformanceGuarantees 8

9 ProblemDescription y x n m y isn 1measurementvector. isn m Dictionarymatrix.m >>n. x ism 1desiredvectorwhichissparsewithk nonzeroentries. istheadditivenoisemodeledasadditivewhitegaussian. i i dld i hi G i 9

10 ProblemStatement NoiseFreeCase:Givenatargetsignalt anda dictionary,findtheweightsxi fi d i h thatsolve: m min Ix ( 0)subjectto y x x i1 wherei(.)istheindicatorfunction () i NoisyCase:Givenatargetsignaly andadictionary, findtheweightsx thatsolve: m min Ix ( 0)subjectto y x x i1 i 2 10

11 Complexity Searchoverallpossiblesubsets,whichwouldmeanasearch overatotalof( m C k )subsets.combinatorialcomplexity. C bi i l i Withm =30;n =20;andk =10thereare subsets(very Complex) Abranchandboundalgorithmcanbeusedtofindtheoptimal solution.thespaceofsubsetssearchedisprunedbutthe searchmaystillbeverycomplex. Indicatorfunctionnotcontinuousandsonotamenableto standardoptimizationtools. Challenge:Findlowcomplexitymethodswithacceptable performance 11

12 Outline:Part1 MotivationforTutorial SparseSignalRecoveryProblem Applications ComputationalAlgorithms GreedySearch 1 normminimization i i PerformanceGuarantees 12

13 Applications SignalRepresentation(Mallat,Coifman,Wickerhauser,Donoho,...) EEG/MEG (Leahy,Gorodnitsky,Ioannides,...) FunctionalApproximationandNeuralNetworks(Chen, Natarajan,Cun,Hassibi,...) Bandlimited extrapolationsandspectralestimation(papoulis, Lee,Cabrera,Parks,...) Parks SpeechCoding(Ozawa,Ono,Kroon,Atal,...) Sparsechannelequalization(Fevrier,Greenstein,Proakis, ) li (F i G t i P ) CompressiveSampling(Donoho,Candes,Tao...) MagneticResonanceImaging(Lustig,..) i 13

14 DFTExample Measurement y yl [] 2(cos0lcos 1l),l0,1,2,..., n1. n , DictionaryElements: e e e l m Considerm =64,128,256and and512 Questions: ( m) 2 ( 1) 2 [1, j l, j l l,..., j n T l ], l WhatistheresultofazeropaddedDFT? Whenviewedasproblemofsolvingalinearsystemof equationsdictionary,whatsolutiondoesthedftgiveus? Aretheremoredesirablesolutionsforthisproblem? 14

15 DFTExample Notethat y (128) (128) (128) (128) (256) (256) (256) (256) (512) (512) (512) (512) Considerthelinearsystemofequations ( ) y m x Thefrequencycomponentsinthedataarein thedictionaries (m) for m=128,256, Whatsolutionamongallpossiblesolutions doesthedftcompute? 15

16 DFTExample m=64 m= m=256 m=

17 SparseChannelEstimation Receivedseq. m 1 j0 Noise ri () si ( j ) cj () () (), i i01 0,1,..., n 1 Trainingseq. Channelimpulse response 17

18 Example: SparseChannelEstimation Formulatedasasparsesignalrecoveryproblem r ( o ) s (0) s ( 1) s( m 1) co ( ) ( o ) r(1) s(1) s(0) s( m2) c(1) (1) rn ( 1) sn ( 1) sn ( 2) s ( mn ) cm ( 1) ( n1) Canuseanyrelevantalgorithmtoestimatethesparse channelcoefficients 18

19 CompressiveSampling D.Donoho, CompressedSensing, IEEE Trans.onInformationTheory,2006 E.Candes andt.tao, NearOptimal SignalRecoveryfromrandom Projections:UniversalEncoding Strategies, IEEETransns.onInformationonInformation Theory,

20 CompressiveSampling TransformCoding x b Whatistheproblemhere? SamplingattheNyquist rate Keepingonlyasmallamountofnonzerocoefficients Canwedirectlyacquirethesignalbelow thenyquist rate? 20

21 CompressiveSampling TransformCoding x b Compressive Sampling A x A b y 21

22 CompressiveSampling CompressiveSampling A x A b y Computation: 1. Solveforw suchthatx=y t 2. Reconstruction:b =x Issues Needtorecoversparsesignalw withconstraintx=y NeedtodesignsamplingmatrixA 22

23 Outline:Part1 MotivationforTutorial SparseSignalRecoveryProblem Applications ComputationalAlgorithms GreedySearch 1 normminimization i i PerformanceGuarantees 23

24 PotentialApproaches CombinatorialComplexityandsoneedalternate strategies GreedySearchTechniques:MatchingPursuit, OrthogonalMatchingPursuit g MinimizingDiversityMeasures:Indicatorfunction notcontinuous.definesurrogatecostfunctions thataremoretractableandwhoseminimizationt t t h i i ti leadstosparsesolutions,e.g. minimization BayesianMethods:MakeappropriateStatistical assumptionsonthesolutionandapplyestimation techniquestoidentifythedesiredsparsesolution 1 24

25 GreedySearchMethod:Matching Pursuit Selectacolumnthatismostalignedwiththecurrentresidual y x r (0) =y S (i) :setofindicesselected l argmax r 1 j m T j ( i1) Practicalstopcriteria: Certain#iterations r () i smallerthan 2 threshold Removeitscontributionfromtheresidual UpdateS (i) ( i 1) ( i) ( i 1) :If.Or,keepS l S, S S { l} (i) thesame Update r (i) : r r r r P l () i ( i1) ( i1) T ( i1) l l 25

26 GreedySearchMethod: Oth OrthogonalMatchingPursuit(OMP) thi P it(omp) Selectacolumnthatismostalignedwiththecurrentresidual y x r (0) =y S (i) :setofindicesselected l argmax r 1 j m T j ( i1) Removeitscontributionfromtheresidual UpdateS (i) () i ( i1) : S S {} l Update r (i) : r P r r P r () i ( i1) ( i1) ( i1) [ l, l,..., l ] [ l, l,..., l ] 1 2 i 1 2 i 26

27 GreedySearchMethod: OrderRecursiveOMP Selectacolumnthatismostalignedwiththecurrentresidual y x r (0) =y S (i) :setofindicesselected T i i l argmax r 1 jm j ( 1) 2 ( 1) 2 j 2 Removeitscontributionfromtheresidual UpdateS (i) () i ( i1) : S S {} l Updater (i) : r P r r P r l () i ( i1) ( i1) ( i1) [ l, l,..., l ] [ l, l,..., li] i 1 2 P [ 1, 2,..., i ] () i () i Update:.Canbecomputedrecursively 2 l l l l l 27

28 DeficiencyofMatchingPursuit TypeAlgorithms Ifthealgorithmpicksawrongindexataniteration, thereisnowaytocorrectthiserrorinsubsequent iterations. SomeRecentAlgorithms Stagewise OrthogonalMatchingPursuit(Donoho,Tsaig,..) COSAMP(Needelll,Tropp) Tropp) 28

29 InverseTechniques Forthesystemsofequationsx =y,thesolutionsetis characterizedby{x s :x s = + y +v;v N()},whereN() denotesthenullspaceof and + = T ( T ) 1. MinimumNormsolution:Theminimum 2 normsolution x mn = + y isapopularsolution NoisyCase:regularized 2 normsolutionoftenemployedand isgivenby x reg = T ( T +I) 1 y 29

30 Minimum2NormSolution Problem:Minimum 2 normsolutionisnotsparse Example: ,yy T xmn vs. x T DFT:Alsocomputesminimum2normsolution 30

31 DiversityMeasures Recall: m min Ix ( 0)subjecttoy x x i1 i Functionals whoseminimizationleadstosparse solutions Manyexamplesarefoundinthefieldsofeconomics, socialscienceandinformationtheory Thesefunctionals areusuallyconcavewhichleadsto difficultoptimizationproblems p 31

32 ExamplesofDiversityMeasures (p1) DiversityMeasure m ( p) p E x xi p i1 ( ), 1 Asp 0, m m ( p) p lim E ( x) lim xi I( xi 0) p 0 p 0 i 1 i 1 1 norm,convexrelaxationof 0 m (1) E ( x ) xi i1 32

33 1 DiversityMeasure Noiselesscase min x m i1 x subjecttox y i Noisycase 1 regularization[candes,romberg,tao] m xi y x 2 i1 min subjectto x Lasso[Tibshirani],BasisPursuitDenoising[Chen, [, Donoho,Saunders] min x m 2 y x x 2 i i1 33

34 1 normminimizationandmap estimation i MAPestimation xˆ argmax p( x y) Ifweassume x argmax log p( y x) log p( x) x i iszeromean,i.i.d.gaussiannoise p( x) p( x ),where p( x ) exp x Then i i i i 2 xˆ argmin y x x 2 i x i 34

35 Attractivenessof 1 methods ConvexOptimizationandassociatedwith richclassofoptimizationalgorithms Interiorpointmethods Coordinatedescentmethod. Question Whatistheabilitytofindthesparsesolution? 35

36 Whydiversitymeasureencourages sparsesolutions? p T min [ x, x ] subjectto x x y 1 2 p x x y equalnorm contour 0 p <1 p =1 p >1 36

37 1 normandlinearprogramming LinearProgram(LP): T mincxsubjecttoj x y x KeyResultinLP(Luenberger): a) Ifthereisafeasiblesolution,thereisabasic feasiblesolution* b) Ifthereisanoptimalfeasiblesolution,there isanoptimalbasicfeasiblesolution. *If isn m,thenabasicfeasiblesolutionisasolution withnnonzeroentries 37

38 Examplewith 1 diversitymeasure ,y NoiselessCase x BP =[1,0,0] T (machineprecision) NoisyCase Assumethemeasurementnoise =[0.01,0.01] T regularizationresult:xl1r lr =[00.986,0, ] T 1 Lassoresult( =0.05):x lasso =[0.975,0, ] T 38

39 Examplewith 1 diversitymeasure ContinuewiththeDFTexample: yl [] 2(cos0lcos 1l),l0,1,2,..., n1. n , ,128,256,512DFTcannotseparatetheadjacent frequencycomponents Using 1 diversitymeasureminimization(m=256)

40 Outline:Part1 MotivationforTutorial SparseSignalRecoveryProblem Applications ComputationalAlgorithms GreedySearch 1 normminimization i i PerformanceGuarantees 40

41 ImportantQuestions Whenisthe 0 solutionunique? Whenisthe 1 solutionequivalenttothatof 0? NoiselessCase NoisyMeasurements 41

42 Uniqueness DefinitionofSpark iti fs Thesmallestnumberofcolumnsfromthatare linearlydependent.(spark() (Spark() n+1) Uniquenessofsparsestsolution m 1 If,thenx Ix ( i 0) Spark() istheunique i 1 2 solutionto m argmin Ix ( 0)subjecttox y x i1 i 42

43 MutualCoherence Foragivenmatrix =[ 1, 2,, m ],themutual coherenceµ()isdefinedas d d () max 1 i, jm; ij i T i j 2 j 2 43

44 Performancesof 1 diversity minimizationalgorithms NoiselessCase[Donoho [D &Elad 03] m 1 1 If,thenx Ix ( i 0) 1 istheunique i 1 2 () solutionto argmin x m i1 x subjecttox y i 44

45 Performancesof 1 diversity minimizationalgorithms NoisyCase[Donoho etal06] m 1 1 Assumey =x +,, Ix ( 0) 1 2 i i 1 4 () Thenthesolution satisfies m argmin subjectto * d di y d d i1 2 d x 2 1 () 4x 1 *

46 Empiricaldistributionofmutual coherence(gaussianmatrices) GaussianMatrices:N(0,1),normalizecolumnnormto randomgeneratedmatrices Histogramofmutualcoherence 90 Histogram 80 Histogram frequency frequency x200,mean=0.4025,std= x2000,mean=0.161,std=

47 PerformanceofOrthogonal MatchingPursuit NoiselessCase[Tropp 04] m If,thenOMPguarantees Ix ( i 0) 1 2 () i m recoveryofx afteriterations. Ix ( 0) i1 i 47

48 PerformanceofOrthogonal MatchingPursuit NoisyCase[Donoho etal] Assumey =x +,,, xmin x 2 1min i i m m 1 1 and. Ix ( i 0) 1 i 1 2 () () xmin StopOMPwhenresidualerror. ThenthesolutionofOMPsatisfies xomp 2 2 x 2 1 () x

49 RestrictedIsometry Constant Definition[Candes etal] Foramatrix,thesmallest constant k suchthat (1 ) x x (1 ) x foranyksparse signalx. k 2 2 k 2 49

50 Performancesof 1 diversity measureminimizationalgorithms [Candes 08] Assumey =x +,x isksparse,and. 2k 21 Then,thesolution 2 satisfies m argmin subjectto * d di y d d i1 d * x C wherec onlydependson 2k

51 Performancesof 1 diversity measureminimizationalgorithms [Candes 08] Assumey=x +,and. 2k 21 Then,thesolution i1 2 m argmin subjectto * d di y d d satisfies * 1 d x C1 xx 2 2 k C 1 k wherec 1,C 2 onlydependon 2k,andx k isthevectorx with allbuttheklargestentriessettozero. 2 51

52 MatriceswithGoodRIC Itturnsoutthatrandommatricescansatisfythe requirement(say,)withhighprobability. )withhighprobability Foramatrix n m 2k 2 1 Generateeachelement ij ~N(0,1/n),i.i.d. i id [Candes etal]if,thenp()1. =O log m n k 2k 21 k Observations: AlargeGaussianrandommatrixwillhavegoodRICwithhigh probability. Similarresultscanbeobtainedusingotherprobabilityensembles: Bernoulli,RandomFourier,. For 1 basedprocedure,numberofmeasurements requiredaren >k logm ~ 52

53 MoreGeneralQuestion Whatarelimitsofrecoveryinthepresence ofnoise? Noconstraintonrecoveryalgorithm Informationtheoreticapproachesareuseful inthiscase(wainwright,fletcher, Akcakaya,..) a Canconnecttheproblemofsupport recoverytochannelcodingoverthe Gaussianmultipleaccesschannel.Capacity regionsbecomeuseful(jin) 53

54 PerformanceforSupportRecovery 2 Noisymeasurements: y =x +, i ~N(0,),i.i.d. 2 Randommatrix, ij ~N(0,),i.i.d. a Performancemetric:exactsupportrecovery Considerasequenceofproblemswithincreasingsizes a k =2:.(TwouserMACcapacity) cx ( ) min log1 x 2 nonzero, i T {1,2} 2 T i T Sufficientcondition If logm limsup cx ( ) m n m then thereexistsasequenceof supportrecoverymethods(for diff.problemsresp.) suchthat lim P supp( xˆ ) supp( x) 0 Necessarycondition Ifthereexistsasequenceof supportrecoverymethods suchthat lim P supp( xˆ ) supp( x ) 0 m then logm limsup cx ( ) n m m m Theapproachcandealwithgeneralscalingamong(m,n,k). 54

55 Networkinformationtheoryperspective Connectiontochannelcoding(K (K K=1) ) y x x i thiscolumnisselected 55

56 y i x i y =x i i + 56

57 AdditiveWhiteGaussianNoise(AWGN) Channel e a h b a:channelinput h:channelgain h i e:additivegaussiannoise b:channeloutput,b=ha+eb=ha+e Recall y =x i i + 57

58 i y x i e a h b 58

59 i y x i e a h b 59

60 i y x i e a h b 60

61 i y x i e a h b 61

62 i y x i e a h b 62

63 Connectiontochannelcoding:(K1) 1) y x x s1 x sk 63

64 MultipleAccessChannel(MAC) s1 sk y x s1 x sk y =x s1 s1 + +x sk sk + e a K h K h 1 a 1 b b =h 1 a 1 + +h K a K +e 64

65 Connectionbetweentwoproblems y x e x s 1 a K h K b x sk a 1 h 1 :dictionary,measurementmatrix i :acolumn m dff differentpositionsinx s 1,,s k :indicesofnonzer0s codebook acodeword messageset{1,2,,m} themessagesselected xs i :sourceactivity it channelgainh h i 65

66 Differencesbetweentwoproblems Channel coding Eachsenderworks witha codebookdesignedonly forthat sender. Capacityresultavailablewhen channel gainh i isknown at receiver. Support recovery All senders sharethesame th codebooka.different senders selectdifferentcodewords. (Commoncodebook) Wedon t knowthenonzerosignal activitiesx i. (Unknownchannel) Proposedapproach[Jin,KimandRao] LeverageMACcapacityresulttoshedlightonperformancelimitofexact supportrecovery. Conventionalmethods+customizedmethods. Distancedecoding Nonzerosignalvalueestimation Fano s Inequality 66

68 Outline 1. Motivation: Limitations of popular inverse methods 2. Maximum a posteriori (MAP) estimation 3. Bayesian Inference 4. Analysis of Bayesian inference and connections with MAP 5. Applications to neuroimaging

69 Section I: Motivation

70 Limitation I Most sparse recovery results, using either greedy methods (e.g., OMP) or convex 1 minimization (e.g., BP), place heavy restrictions on the form of the dictionary Φ. While in some situations we can satisfy these restrictions (e.g., compressive sensing), for many applications we cannot (e.g., source localization, feature selection). When the assumptions on the dictionary are violated, performance may be very suboptimal

71 Limitation II The distribution of nonzero magnitudes in the maximally sparse solution x 0 can greatly affect the difficulty of canonical sparse recovery problems. This effect is strongly algorithm-dependent

72 Examples: 1 -norm Solutions and OMP With 1 -norm solutions, performance is independent of the magnitudes of nonzero coefficients [Malioutov et al., 2004]. Problem: Performance does not improve when the situation is favorable. OMP is highly sensitive to these magnitudes. Problem: Perform degrades heavily with unit magnitudes.

73 In General If the magnitudes of the non-zero elements in x 0 are highly scaled, then the canonical sparse recovery problem should be easier. x 0 x 0 scaled coefficients (easy) uniform coefficients (hard) The (approximate) Jeffreys distribution produces sufficiently scaled coefficients such that best solution can always be easily computed.

74 Jeffreys Distribution Density: px ( ) 1 x p(x) p(w) All have equal area w x A B C

75 Empirical Example For each test case: 1. Generate a random dictionary Φ with 50 rows and 100 columns. 2. Generate a sparse coefficient vector x Compute signal via y = Φ x 0 (noiseless). 4. Run BP and OMP, as well as a competing Bayesian method called SBL (more on this later) to try and correctly estimate x Average over1000 trials to compute empirical probability of failure. Repeat with different sparsity values, i.e., x 0 ranging 0 from 10 to 30.

76 Sample Results (n = 50, m = 100) Approx. Jeffreys Coefficients Unit Coefficients Err ror Rate Err ror Rate x0 x 0 0 0

77 Limitation III It is not immediately clear how to use these methods to assess uncertainty in coefficient estimates (e.g., covariances). Such estimates can be useful for designing an optimal (non-random) dictionary Φ. For example, it is well known in and machine learning and image processing communities that under-sampled random projections of natural scenes are very suboptimal.

78 Section II: MAP Estimation Using the Sparse Linear Model

79 Overview Can be viewed as sparse penalized regression with general sparsity-promoting penalties ( 1 penalty is special case). These penalties can be chosen to overcome some previous limitations when minimized with simple, efficient algorithms. Theory is somewhat harder to come by, but practical results are promising. In some cases can guarantee improvement over 1 solution on sparse recovery problems.

80 Sparse Linear Model Linear generative model: Observed n-dimensional data vector y = Φ x+ Matrix of m basis vectors Unknown Coefficients Gaussian noise with variance λ Objective: Estimate the unknown x given the following assumptions: 1. Φ is overcomplete, meaning the number of columns m is greater than the signal dimension n. 2. x is maximally sparse, i.e., many elements equal zero.

81 Sparse Inverse Problem Noiseless case (ε = 0): x arg min x s.t. y =Φx 0 x 0 0 norm = # of nonzeros in x Noisy case (ε > 0): ( λ) x arg min y Φ x + λ x 0 x = arg max exp 2λ y Φx exp 2 2 x x 0 likelihood prior

82 Difficulties 1. Combinatorial number of local minima 2. Objective is discontinuous A variety of existing approximate methods can be viewed as MAP estimation using a flexible class of sparse priors.

83 Sparse Priors: 2-D Example [Tipping, 2001] Gaussian Distribution Sparse Distribution

84 Basic MAP Estimation ( x y) xˆ arg max p x ( y x) p( x) = arg min log p log x n 2 = arg min y Φ x + λ 2 x i=1 data fit ( ) g x i 2 ( i) = h( xi ) g x where h is a nondecreasing, concave function, Note: Bayesian interpretation will be useful later

85 Example Sparsity Penalties ( ) = I[ x 0] g x With i i we have the canonical sparsity penalty and its associated problems. Practical selections: 2 ( ) ( ) i = i + ε g x x ε log, [ Chartrand and Yin, 2008 ] ( i) = ( xi + ε ) g x ( ) g x i = log, [ Candes et al., 2008] x i p, [ Rao et al., 2003] Different choices favor different levels of sparsity.

86 1 Example 2-D Contours ( ) g x i = x i p 0.5 x p = 0.2 p = 0.5 p = x 1

87 Which Penalty Should We Choose? Two competing issues: 1. If the prior is too sparse (e.g., p 0), then we may get stuck in a local minima: convergence error. 2. If the prior is not sparse enough (e.g. p 1), then the global minimum may be found, but it might not equal x 0 : structural error. Answer is ultimately application- and algorithmdependent

88 Convergence Errors vs. Structural Errors Convergence Error (p 0) Structural Error (p 1) -log p(x y) -log p(x y) x' x 0 x' x 0 x' = solution we have converged to x 0 = maximally sparse solution

89 Desirable Properties of Algorithms Can be implemented via relatively simple primitives already in use, e.g., 1 solvers, etc. Improved performance over OMP, BP, etc. Naturally extends to more general problems: 1. Constrained sparse estimation (e.g., finding non-negative sparse solutions) 2. Group sparsity problems

90 Example : Extension: Group Sparsity The simultaneous sparse estimation problem - the goal is to recover a matrix X, with maximal row sparsity [Cotter et al., 2005; Tropp, 2006], given observation matrix Y produced via Optimization Problem: ( ) Y =Φ X+ E X λ 0 = argmin Y Φ X + λ I i 0 F x X 2 m i= 1 # of nonzero rows in X Can be efficiently solved/approximated by replacing indicator function with alternative function g.

91 Assume: Reweighted 2 Optimization ( ) ( 2 ) i = i g x h x, h concave Updates: k + 2 x arg min y Φ x + w ( 1) ( k ) 2 2 i i x i λ = W Φ I +ΦW Φ ( λ ) ( k) T ( k) T ( ) 1 g x, W diag w ( k + 1) i ( k+ 1) ( k+ 1) i 2 xi ( k+ 1) xi = xi ( ) g x i Based on simple 1 st order approx. to [Palmer et al., 2006]. Guaranteed not to increase objective function. Global convergence assured given additional assumptions. w 1 y x

92 Examples 1. FOCUSS algorithm [Rao et al., 2003]: Penalty: Weight update: p ( ) =, 0 2 g x x p w i i ( k+ 1) ( k+ 1) p 2 i xi Properties: Well-characterized convergence rates; very susceptible to local minima when p is small. 2. Chartrand and Yin (2008) algorithm: Penalty: Weight update: 2 ( i) ( i ) k+ 2 k+ ( i ) g x = log x + ε, ε 0 ( 1) ( 1) wi x + ε 1 Properties: Slowly reducing ε to zero smoothes out local minima initially allowing better solutions to be found; very useful for recovering scaled coefficients

93 Empirical Comparison For each test case: 1. Generate a random dictionary Φ with 50 rows and 250 columns 2. Generate a sparse coefficient vector x Compute signal via y = Φ x 0 (noiseless). 4. Compare Chartrand and Yin s reweighted 2 method with 1 norm solution with regard to estimating x Average over 1000 independent trials. Repeat with different sparsity levels and different nonzero coefficient distributions.

94 Empirical: Unit Nonzeros probability of success x 0 0

95 Results: Gaussian Nonzeros probability of success x 0 0

96 Reweighted 1 Optimization Assume: ( ) = ( ), concave g x h x h i i Updates: k + 2 x arg min y Φ x + λ ( 1) w ( k ) 2 i i x ( ) g x ( k 1) i w + i x i ( k+ 1) xi = xi x i ( ) g x i Based on simple 1 st order approximation to [Fazel et al., 2003] Global convergence given minimal assumptions [Zangwill, 1969]. Per-iteration cost expensive, but few needed (and each are sparse). Easy to incorporate alternative data fit terms or constraints on x.

97 Penalty: Example [Candes et al., 2008] ( i) ( i ) g x = log x + ε, ε 0 Updates: ( k + 1) 2 ( k ) arg min Φ + λ w 2 i i x x y x ( k+ 1) ( k+ 1) wi xi + ε 1 x i When nonzeros in x 0 are scaled, works much better than regular 1, depending on how ε is chosen. Local minima exist, but since each iteration is sparse, local solutions are not so bad (no worse than regular 1 solution).

98 Empirical Comparison For each test case: 1. Generate a random dictionary Φ with 50 rows and 100 columns. 2. Generate a sparse coefficient vector x 0 with 30 truncated Gaussian, strictly positive nonzero coefficients. 3. Compute signal via y = Φ x 0 (noiseless). 4. Compare Candes et al. s reweighted 1 method (10 iterations) with 1 norm solution, both constrained to be non-negative to try and estimate x Average over 1000 independent trials. Repeat with different values of the parameter ε.

99 Empirical Comparison probability of success ε

100 Conclusions In practice, MAP estimation addresses some limitations of standard methods (although not Limitation III, assessing uncertainty). Simple updates are possible using either iterative reweighted 1 or 2 minimization. More generally, iterative reweighted f minimization, where f is a convex function, is possible.

101 Section III: Bayesian Inference Using the Sparse Linear Model

102 Note MAP estimation is really just standard/classical penalized regression. So the Bayesian interpretation has not really contributed much as of yet

103 Posterior Modes vs. Posterior Mass Previous methods focus on finding the implicit mode of p(x y) by maximizing the joint distribution p xy, = p y x p x ( ) ( ) ( ) Bayesian inference uses posterior information beyond the mode, i.e., posterior mass: Examples: 1. Posterior mean: Can have attractive properties when used as a sparse point estimate (more on this later ). 2. Posterior covariance: Useful assessing uncertainty in estimates, e.g., experimental design, learning new projection directions for compressive sensing measurements.

104 Posterior Modes vs. Posterior Mass Mode Probability Mass p(x y y) x

105 Problem For essentially all sparse priors, cannot compute normalized posterior p ( x y) ( ) p( ) ( ) ( ) p y x x = p y x p x dx Also cannot computer posterior moments, e.g., µ = x Σ = x [ x y] [ x y] E Cov So efficient approximations are needed

106 Approximating Posterior Mass Goal: Approximate p(x,y) with some distribution pˆ xy, that 1. Reflects the significant mass in p(x,y). ( ) 2. Can be normalized to get the posterior p(x y). 3. Has easily computable moments, e.g., can compute E[x y] or Cov[x y]. Optimization Problem: Find the that minimizes the sum of the misaligned mass: (, ) ˆ(, ) = ( ) ( ) ˆ( ) p xy p xy dx p y x p x p x dx ( ) pˆ xy,

107 Recipe 1. Start with a Gaussian likelihood p ( ) ( ) ( 2 ) 1 y x = 2πλ exp y Φx 2 n 2 2σ 2 2. Pick an appropriate prior p(x) that encourages sparsity 3. Choose a convenient parameterized class of approximate priors pˆ x = p x; 4. Solve: 5. Normalize: ˆ = arg min p( y x) p( x) p( x; ) dx p ( x y; ˆ ) ( ) ( ) ( ) p( ; ˆ ) ( ) ( ; ˆ ) p y x x = p y x p x dx

108 Step 2: Prior Selection Assume a sparse prior distribution on each coefficient: ( ) ( 2 ) log px ( ) g x = h x, hconcave, non-decreasing. i i i 2-D example [Tipping, 2001] Gaussian Distribution Sparse Distribution

109 Step 3: Forming Approximate Priors p(x;γ) Any sparse prior can be expressed via the dual form [Palmer et al., 2006]: 2 1/2 x i px ( i) = max( 2πγ i) exp ϕ ( γ i) γ i 0 2γ i Two options: 1. Start with px ( i ) and then compute ϕ γ i via convexity results, or 2. Choose ϕ( γ i ) directly and then compute px ( i ); this procedure will always produce a sparse prior [Palmer et al. 2006]. ( ) Dropping the maximization gives the strict variational lower bound 2 1/2 x i px ( i) px ( i; γi) = ( 2πγi) exp ϕ( γi) 2γ i Yields a convenient class of scaled Gaussian approximations: p ( x; ) p( x; γ ) = i i i

110 Example: Approximations to Sparse Prior p(x i ) p(x i ;γ i ) x i

111 Step 4: Solving for the Optimal γ To find the best approximation, must solve ˆ = arg min p( y x) p( x) p( x; ) dx 0 By virtue of the strict lower bound, this is equivalent to ˆ = arg max p( y x) p( x; ) dx 0 m T 1 arg min log y y y y 2 log i 0 i= 1 = Σ + Σ ( ) ϕ γ T where Σ = λi +ΦΓΦ Γ= diag ( ) y

112 How difficult is finding the optimal parameters γ? If original MAP estimation problem is convex, then so is Bayesian inference cost function [Nickisch and Seeger, 2009]. In other situations, Bayesian inference cost is generally In other situations, Bayesian inference cost is generally much smoother than associated MAP estimation (more on this later ).

113 Step 5: Posterior Approximation We have found the approximation p y x p x; ˆ = p xy, ; ˆ p xy, ( ) ( ) ( ) ( ) By design, this approximation reflects significant mass in the full distribution p(x,y). Also, it is easily normalized to form p x y; ˆ = N µ x, Σ x ( ) ( ) [ ] ˆ T ˆ ( ˆ T x y I ) µ = E ; γ = ΓΦ λ +ΦΓΦ x 1 y T [ ˆ] ( T x y γ λi ) Σ = Cov ; = Γ ΓΦ ˆ ˆ +ΦΓΦ ˆ ΦΓˆ x 1

114 Applications of Bayesian Inference 1. Finding maximally sparse representations Replace MAP estimate with posterior mean estimator. For certain prior selections, this can be very effective (next section) 2. Active learning, experimental design

115 Experimental Design Basic Idea [Shihao Ji et al., 2007, Seeger and Nickisch, 2008]: Use the approximate posterior ( x y; ˆ ) = (, Σ ) p N µ x x to learn new rows of the design matrix Φ such that uncertainty about x is reduced. Choose each additional row to minimize the differential entropy H: H 1 2 ( ) T λi 1 ˆ ˆ T = log Σ, Σ =Γ ΓΦ +ΦΓΦ ˆ ΦΓˆ x x

116 Experimental Design Cont. Drastic improvement over random projections is possible in a variety of domains. Examples: Reconstructing natural scenes [Seeger and Nickisch, 2008] Undersampled MRI reconstruction [Seeger et al., 2009]

117 Section IV: Analysis of Bayesian Inference and Connections with MAP

118 Overview Bayesian inference can be recast as a general form of MAP estimation in x-space. This is useful for several reasons: 1. Allows us to leverage same algorithmic formulations as with iterative reweighted methods for MAP estimation. 2. Reveals that Bayesian inference can actually be an easier computational task than searching for the mode as with MAP. 3. Provides theoretical motivation for posterior mean estimator when searching for maximally sparse solutions. 4. Allows modifications to Bayesian inference cost (e.g., adding constraints), and inspires new non-bayesian sparse estimators.

119 Reformulation of Posterior Mean Estimator Theorem 2 [ ˆ ] g ( ) = E x y, = arg min y Φ x +λ x x with Bayesian inference penalty function x 2 infer g infer = Γ + +ΦΓΦ T ( ) 1 T x min x x log λi 2 logϕ( γi ) 0 i [Wipf and Nagarajan, 2008] So the posterior mean can be obtained by minimizing a penalized regression problem just like MAP Posterior covariance is a natural byproduct as well

120 Property I of Penalty g infer (x) Penalty g infer (x) is formed from a minimum of upperbounding hyperplanes with respect to each 2. x i This implies: 2 x i 1. Concavity in for all i [Boyd 2004]. 2. Can be implemented via iterative reweighted 2 minimization (multiple possibilities using various bounding techniques) [Seeger, 2009; Wipf and Nagarajan, 2009]. 3. Note: Posterior covariance can be obtained easily too, therefore entire approximation can be computed for full Bayesian inference.

121 Student s t Example Assume the following sparse distribution for each unknown coefficient: Note: p x a = b a = b 0 ( ) i b+ ( a 1/2) When, prior approaches a Gaussian (not sparse) x 2 i 2 + When, prior approaches a Jeffreys (highly sparse) Using convex bounding techniques to approximate the required derivatives, leads to simple reweighted 2 update rules [Seeger, 2009; Wipf and Nagarajan, 2009]. Algorithm can also be derived via EM [Tipping, 2001].

122 Reweighted 2 Implementation Example ( k + 1) 2 ( k ) 2 arg min Φ + λ w 2 i xi x i x y x = W Φ I +ΦW Φ ( λ ) ( k) T ( k) T 1 y, W = diag w ( k) ( k) 1 w a a ( k 1) i ( ( 1) ) 2 ( ( ) ) 1 ( ( ) ) 2 ( ( ) ) k+ k k T k T 1 x + w w φ λi +ΦW Φ φ + 2b i i i i i Guaranteed to reduce or leave unchanged objective function at each iteration Other variants are possible using different bounding techniques Upon convergence, posterior covariance is given by 1 T ( k) 1 ( k) ( k) Σ x = λ Φ Φ+ W, W = diag w

123 Property II of Penalty g infer (x) ( ) If 2logϕ γ is concave in γ, then: i i 1. g infer (x) is concave in x i for all i sparsity-inducing 2. The implies posterior mean will always have at least m n elements equal to exactly zero as occurs with MAP. 3. Can be useful for canonical sparse recovery problems 4. Can implement via reweighted 1 minimization [Wipf, 2006; Wipf and Nagarajan, 2009]

124 Reweighted 1 Implementation Example Assume ( ) 2log ϕ γ = aγ, a 0 i i ( k + 1) 2 ( k ) argmin Φ + λ w 2 i x i x y x ( diag x ) 1 w ( k+ 1) T 2 ( k) ( k 1) T i φi σ I +ΦW + Φ φi + a x i 1 2 Guaranteed to converge to a minimum of the Bayesian inference objective function Easy to outfit with additional constraints In noiseless case, sparsity will not increase, i.e., ( ) ( k+ 1) ( k ) 1 x x x -norm 0 0 0

125 Property III of Penalty g infer (x) Bayesian inference is most sensitive to posterior mass, therefore it is less sensitive to spurious local peaks as is MAP estimation. Consequently, in x-space, the Bayesian Inference penalized regression problem min x 2 2 infer ( ) y Φ x +λg x is generally smoother than associated MAP problem.

126 Student s t Example Assume the Student s t distribution for each unknown coefficient: Note: p x a = b a = b 0 ( ) i b+ ( a 1/2) When, prior approaches a Gaussian (not sparse) x 2 i 2 + When, prior approaches a Jeffreys (highly sparse) Goal: Compare Bayesian inference and MAP for different sparsity levels to show smoothing effect.

127 Visualizing Local Minima Smoothing y =Φx Consider when has a 1-D feasible region, i.e., m = n +1 Any feasible solution x will satisfy: x = x +α v true true ( ) v Null Φ where α is a scalar x is the true generative coefficients Can plot penalty functions vs. α to view local minima profile over the 1-D feasible region.

128 Empirical Example Generate an iid Gaussian random dictionary Φ with 10 rows and 11 columns. Generate a sparse coefficient vector x true with 9 nonzeros and Gaussian iid amplitudes. Compute signal y = Φ x 0. Assume a Student s t prior on x with varying degrees of sparsity. Plot MAP/Bayesian inference penalties vs. α to compare local minima profiles over the 1-D feasible region.

129 Local Minima Smoothing Example #1 penalty, no ormalized Low Sparsity Student s t α

130 Local Minima Smoothing Example #2 penalty, no ormalized Medium Sparsity Student s t α

131 Local Minima Smoothing Example #3 penalty, no ormalized High Sparsity Student s t α

132 Property IV of Penalty g infer (x) Non-separable, meaning ( x) ( ) g g x infer i i i Non-separable penalty functions can have an advantage over separable penalties (i.e., MAP) when it comes to canonical sparse recovery problems [Wipf and Nagarajan, 2010].

133 Example Consider original sparse estimation problem x arg min x s.t. y =Φx 0 x 0 Problem: Combinatorial number of local minima: m 1 m + 1 number of local minima n n Local minima occur at each basic feasible solution (BFS): x n s.t. y =Φx 0

134 Visualization of Local Minima in 0 Norm Generate an iid Gaussian random dictionary Φ with 10 rows and 11 columns. Generate a sparse coefficient vector x true with 9 nonzeros and Gaussian iid amplitudes. Compute signal via y = Φ x 0. x Plot vs. α (1-D null space dimension) to view local 0 minima profile of the 0 norm over the 1-D feasible region.

135 0 Norm in 1-D Feasible Region nzeros # of non α

136 Non-Separable Penalty Example Would like to smooth local minima while retaining same global solution as 0 at all times (unlike 1 norm) This can be accomplished by a simple modification of the 0 penalty. Truncated 0 penalty: xˆ = arg min x x 0 s.t. x y = k largest elements of =Φx x If k < m, then there will necessarily be fewer local minima; however, the implicit prior/penalty function is non-separable.

137 Truncated 0 Norm in 1-D Feasible Region # of non nzeros k=10 α

138 Using posterior mean estimator for finding maximally sparse solutions Summary of why this could be a good idea: ( ) 1. If 2logϕ γi is concave, then posterior mean will be sparse (local minima of Bayesian inference cost will also be sparse). 2. The implicit Bayesian inference cost function can be much smoother than the associated MAP objective. 3. Potential advantages of non-separable penalty functions.

139 Choosing the function ϕ ( ) For sparsity, require that 2logϕ γ i is concave. To avoid adding extra local minima (i.e., to maximally exploit smoothing effect), require that 2logϕ γ is convex. ( ) ( ) So 2log ϕ γ = a γ, a 0 is well-motivated [Wipf et al. 2007]. i i i ( ) Assume simplest case: 2log ϕ γ i = 0, sometimes referred to as sparse Bayesian learning (SBL) [Tipping, 2001]. ( ) We denote the penalty function in this case g x. SBL

140 Advantages of Posterior Mean Estimator Theorem In the low noise limit (λ 0), and assuming then the SBL penalty satisfies: x 0 0 [ ] < spark Φ 1, ( x) arg min g = arg min SBL x : y =Φ x x : y =Φ x 0 No separable penalty g( x) = gi( x ) satisfies this i i condition and has fewer minima than the SBL penalty in the feasible region. x [Wipf and Nagarajan, 2008]

141 Conditions For a Single Minimum Theorem [ ] Assume x < spark Φ 1 0. If the magnitudes of the non-zero 0 elements in x 0 are sufficiently scaled, then the SBL cost (λ 0) has a single minimum which is located at x 0. x 0 x 0 scaled coefficients (easy) uniform coefficients (hard) No possible separable penalty (standard MAP) satisfies this condition. [Wipf and Nagarajan, 2009]

142 Empirical Example Generate an iid Gaussian random dictionary Φ with 10 rows and 11 columns. Generate a maximally sparse coefficient vector x 0 with 9 nonzeros and either 1. amplitudes of similar scales, or 2. amplitudes with very different scales. Compute signal via y = Φ x 0. Plot MAP/Bayesian inference penalty functions vs. α to compare local minima profiles over the 1-D feasible region to see the effect of coefficient scaling.

143 malized penalty, norm Smoothing Example: Similar Scales

144 malized penalty, norm Smoothing Example: Different Scales

145 Always Room for Improvement Theorem Consider the noiseless sparse recovery problem. x arg min x s.t. y =Φx 0 x 0 Under very mild conditions, SBL with reweighted 1 implementation will: 1. Never do worse than the regular 1 -norm solution 2. For any dictionary and sparsity profile, there will always be cases where it does better. [Wipf and Nagarajan, 2010]

146 Empirical Example: Simultaneous Sparse Approximation Generate data matrix via Y = ΦX 0 (noiseless): X 0 is 100-by-5 with random nonzero rows. Φ is 50-by-100 with Gaussian iid entries Check if X 0 is recovered using various algorithms: 1. Generalized SBL, reweighted 2 implementation [Wipf and Nagarajan, 2010] 2. Candes et al. (2008) reweighted 1 method 3. Chartrand and Yin (2008) reweighted 2 method 1 solution via Group Lasso [Yuan and Lin, 2006]

147 Empirical Results (1000 Trials) of success probability o SBL row sparsity

148 Conclusions Posterior information beyond the mode can be very useful in a wide variety of applications. Variational approximation provides useful estimates of posterior means and covariances, which can be computed efficiently using standard iterative reweighting algorithms. In certain situations, posterior mean estimate can be effective substitute for 0 norm minimization. In simulation tests, out-performs a wide variety of MAPbased algorithms [Wipf and Nagarajan, 2010]...

149 Section V: Application Examples in Neuroimaging

150 Applications of Sparse Bayesian Methods 1. Recovering fiber track geometry from diffusion weighted MR images [Ramirez-Manzanares et al. 2007]. 2. Multivariate autoregressive modeling of fmri time series for functional connectivity analyses [Harrison et al. 2003]. 3. Compressive sensing for rapid MRI [Lustig et al. 2007]. 4. MEG/EEG source localization [Sato et al. 2004; Friston et al. 2008].

151 MEG/EEG Source Localization Maxwell s eqs.? source space (X) sensor space (Y)

152 The Dictionary Φ Can be computed using a boundary element brain model and Maxwell s equations. Will be dependent on location of sensors and whether we are doing MEG, EEG, or both. Unlike compressive sensing domain, columns of Φ will be highly correlated regardless of where sensors are placed.

153 Source Localization Given multiple measurement vectors Y, MAP or Bayesian inference algorithms can be run to estimate X. The estimated nonzero rows should correspond with the location of active brain areas (also called sources). Like compressive sensing, may apply algorithms in appropriate transform domain where row-sparsity assumption holds.

154 Empirical Results 1. Simulations with real brain noise/interference: Generate damped sinusoidal sources Map to sensors using Φ and apply real brain noise, artifacts 2. Data from real-world experiments: Auditory evoked fields from binaurally presented tones (which produce correlated, bilateral activations) Compare localization results using MAP estimation and SBL posterior mean from Bayesian inference

155 MEG Source Reconstruction Example Ground Truth SBL Group Lasso

156 Real Data: Auditory Evoked Field (AEF) SBL Beamformer sloreta Group Lasso

157 Conclusion MEG/EEG source localization demonstrates the effectiveness of Bayesian inference on problems where the dictionary is: Highly overcomplete, meaning m n, e.g., n = 275 and m = 100, 000. Very ill-conditioned and coherent, i.e., columns are highly correlated.

158 Final Thoughts Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving the Sparse Signal Recovery problem can be valuable tools for signal processing practitioners. Rich set of computational algorithms, e.g., Greedy search (OMP) 1 norm minimization (Basis Pursuit, Lasso) MAP methods (Reweighted 1 and 2 methods) Bayesian Inference methods like SBL (show great promise) Potential for great theory in support of performance guarantees for algorithms. Expectation is that there will be continued growth in the application domain as well as in the algorithm development.

159 Thank You

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble

Bayesian Methods for Sparse Signal Recovery Bhaskar D Rao 1 University of California, San Diego 1 Thanks to David Wipf, Zhilin Zhang and Ritwik Giri Motivation Sparse Signal Recovery is an interesting