Blocky models via the L1/L2 hybrid norm

Size: px

Start display at page:

Download "Blocky models via the L1/L2 hybrid norm"

Emery McDowell
5 years ago
Views:

1 Blocky models va the L1/L2 hybrd norm Jon Claerbout ABSTRACT Ths paper seeks to defne robust, effcent solvers of regressons of L1 nature wth two goals: (1) straghtforward parameterzaton, and (2) blocky solutons. It uses an L1/L2 hybrd norm characterzed by a resdual R d of transton between L1 and L2 for data fttng and another R m for model stylng. Both the steepest descent and conjugate drecton methods are ncluded. The 1-D blnd deconvoluton problem s formulated n a manner ntended to lead to both a blocky mpedance functon and a source waveform. No results are gven. INTRODUCTION I ve seen many applcatons mproved when least-squares (L2) model fttng was changed to least absolute values (L1). I ve never seen the reverse. Never-the-less we always return to L2 because the solvng method s easer and faster. It does not requre us to specfy parameters of numercal analyss that are unclear how to specfy. Another reason to re-nvestgate L1 s ts natural ablty to estmate blocky models. Sedmentary sectons tend to fluctuate randomly, but sometmes there s a homogeneous materal contnung for some dstance. A functon wth such homogeneous regons s called blocky. The dervatve of such a functon s called sparse. L2 gves huge penaltes to large values and mnuscule penaltes to small ones, hence t never really produces sparse functons and ther ntegrals are never really blocky. If we had an easy, relable L1 solver, we could expect to see many more realstc solutons. There are reasons to abandon strct L1 and revert to an L1/L2 hybrd solver. A hybrd solver has a parameter, a threshold, at whch L2 behavor transts to L1. We have good reasons to use two dfferent hybrd solvers, one for the data fttng, the other for the model stylng (pror knowledge or regularzaton). Each requres a threshold of resdual, let us call t R d for the data fttng, and R m for the model stylng. Processes that requre parameters are detestble when we have a poor dea of the meanng of the parameters (especally f they relate to numercal analyss), however the meanng of the thresholds R d and R m s qute clear. When we look at a shot gather and see about 30% of the area s covered wth ground roll, t s clear we would lke to choose R d to be at about the 70th percentle of the fttng resdual. As for the model stylng, f we d lke to see blocks about 20 ponts long, we d lke our spkes to average about 20 ponts apart, so we would lke R m about the 95th percentle allowng 5% of the spkes to be of unlmted sze, whle the others small.

2 Claerbout 2 Blocky models: L1/L2 hybrd norm I was frst attracted to strct L1 by ts potental for blocky models. But then I realzed for each nonspke (zero) on the tme axs, theory says I would need a bass equaton. That mples an mmense number of teratons, so t s unacceptable n magng applcatons. Wth the hybrd solvers, nstead of exact zeros we have a large regon drven down by the L2 norm and a small L1 regon where large spkes are welcomed. MODEL DERIVATIVES Here s the usual defnton of resdual r of theoretcal data j F,jm j from observed data d r = ( j F,j m j ) d or r = Fm d. (1) Let C() be a convex functon (C 0) of a scalar. The penalty functon (or norm of resduals s expressed by N(m) = C(r ) (2) We denote a column vector g wth components g by g = vec(g ). We soon requre the dervatve of C(r) at each resdual r : [ ] C(r ) g = vec (3) r We often update models n the drecton of the gradent of the norm of the resdual. m = N m k = C(r ) r r = m k g(r )F,k = F g (4) Defne a model update drecton by m = F g. Snce r = Fm d, we see the resdual update drecton wll be r = F m. To fnd the dstance α to move n those drectons we choose the scalar α to mnmze m m + α m (5) r r + α r (6) N(α) = C(r + α r ) (7) The sum n equaton (7) s a sum of dshes, shapes between L2 parabolas and L1 V s. The -th dsh s centered on α = r / r. It s steep and narrow f r s large,

3 Claerbout 3 Blocky models: L1/L2 hybrd norm and low and flat where r s small. The postve sum of convex functons s convex. There are no local mnma. We can get to the bottom by followng the gradent. Next we consder some choces for convex functons. We ll need them, ther frst and second dervatves. Some convex functons and ther dervatves LEAST SQUARES: C = r 2 /2 (8) C = r (9) C = 1 0 (10) L1 NORM: C = r (11) C = sgn(r) (12) C = 0 or 0 (13) HYBRID: C = R 2 ( 1 + r 2 /R 2 1) (14) C = r 1 + r2 /R 2 (15) C = 1 (1 + r 2 /R 2 ) 3/2 0 (16) HUBER: C = C = C = { } r R/2 f r R r 2 /2R otherwse { } sgn(r) f r R r/r otherwse { } 0 or f r R 1/R otherwse (17) (18) 0 (19) I have scaled Hybrd so t naturally approaches the least squares lmt as R. As R 0, t tends to C = R r, scaled L1. Because of the erratc behavor of C for L1 and Huber, and our planned use of second order Taylor seres, we wll not be usng L1 and Huber norms here. Also, we should prepare ourselves for danger as HYBRID approaches the L1 lmt. (I thank Mandy for remndng me of the nfnte second dervatve and I thank Mohammad and Nader for demonstratng numercal erratc behavor.)

4 Claerbout 4 Blocky models: L1/L2 hybrd norm PLANE SEARCH The most unversally used method of solvng mmense lnear regressons such as magng problems s the Conjugate Gradent (CG) method. It has the remarkable property that n the presence of exact arthmetc, the exact soluton s found n a fnte number of teratons. A smpler method wth the same property s the Conjugate Drecton method. It s debatable whch has the better numercal roundoff propertes, so we generally use the Conjugate Drecton method as t s smpler to comprehend. It says not to move along the gradent drecton lne, but somewhere n the plane of the gradent and the prevous step. The best move n that plane requres us to fnd two scalars, one α to scale the gradent, the other β to scale the prevous step. That s all for L2 optmzaton. We proceed here n the same way wth other norms and hope for the best. So here we are, embedded n a gant multvarate regresson where we have a bvarate regresson (two unknowns). From the multvarate regresson we are gven three vectors n data space. r, g and s. You wll recognze these as the current resdual, the gradent ( r ), and the prevous step. (The gradent and prevous step appearng here have prevously been transformed to data space (the conjugate space) by the operator F.) Our next resdual wll be a perturbaton of the old one. We seek to mnmze by varaton of (α, β) r = r + αg + βs (20) N(α, β) = C( r + αg + βs ) (21) Let the coeffcents (C, C, C ) refer to a Taylor expanson of C(r) about r. N(α, β) = C + (αg + βs )C + (αg + βs ) 2 C /2 (22) We have two unknowns, (α, β) n a quadratc form. We set to zero the α dervatve of the quadratc form, lkewse the β dervatve gettng [ ] 0 = [ ] {[ ] } C g + C α 0 s (αg + βs ) (αg + βs ) (23) β resultng n a 2 2 set of equatons to solve for α and β. { [( ) ] } [ ] C g α (g s s ) = β C [ g s ] (24) The soluton of any 2 2 set of smultaneous equatons s generally trval. The only dffcultes arse when the determnant vanshes whch here s easy (luckly) to

5 Claerbout 5 Blocky models: L1/L2 hybrd norm understand. Generally the gradent should not pont n the drecton of the prevous step f the prevous move went the proper dstance. Hence the determnant should not vansh. Practce shows that the determnant wll vansh when all the nputs are zero, and t may vansh f you do so many teratons that you should have stopped already, n other words when the gradent and prevous step are both tendng to zero. Usng the newly found (α, β), update the resdual r at each locaton (and update the model). Then go back to re-evaluate C and C at the new r locatons. Iterate. In what way do we hope/expect ths new bvarate solver embedded n a conjugate drecton solver to perform better than old IRLS solvers? After payng the nevtable prce, a substantal prce, of computng F r and F m the teraton above does some serous thnkng, not a smple lnearzaton, before payng the prce agan. If the convex functon C(r) were least squares, subsequent teratons would do nothng. Although the Taylor seres of the second teraton would expand about dfferent resduals r than the frst teraton, the new second order Taylor seres are the exact representaton of the least squares penalty functon,.e. the same as the frst so the next teraton goes nowhere. Wll ths computatonal method work (converge fast enough) n the L1 lmt? I don t know. Perhaps we ll do better to approach that lmt (f we actually want that lmt) va gradually decreasng the threshold R. BLOCKY LOGS: BOTH FITTING AND REGULARIZATION Here we set out to fnd blocky functons, such as well logs. We wll do data fttng wth a somewhat L2-lke convex penalty functon whle dong model stylng wth a more L1-lke functon. We mght defne the composte norm threshold resdual R d for the data fttng at the 60th percentle, and that for regularzaton seekng spky models (wth blocky ntegrals) as R m at the 5th percentle. The data fttng goal and the model regularzaton goal at each z s ndependent from that at all other z values. The fttng goal says the reflectvty m(z) should be equal to ts measurement d(z) (the sesmogram). The model stylng goal says the reflectvty m(z) should vansh. 0 r d (z) = m(z) d(z) (25) 0 r m (z) = ɛ m(z) (26) These two goals are n drect contradcton to each other. Wth the L2 norm the answer would be smply m = d/(1 + ɛ 2 ). Wth the L1 norm, the answer would be ether m = d or m = 0 dependng on the numercal choce of ɛ. Let us denote the convex functon and ts dervatves for data space at the resdual as (B, B, B ) and for model space as (C, C, C ). Remember, m and d, whle normally vectors, are here

6 Claerbout 6 Blocky models: L1/L2 hybrd norm scalars (ndependently for each z). loop over all tme ponts { m = d/(1 + ɛ 2 ) # These are scalars! loop over non-lnear teratons { r d = m d r m = m Get dervatves of hybrd norm B (r d ) and B (r d ) for data goal. Get dervatves of hybrd norm C (r m ) and C (r m ) for model goal. # Plan to fnd α to update m = m + α # Taylor seres for data penalty N(r d ) = B + B α + B α 2 /2 # Taylor seres for model penalty N(r m ) = C + C α + C α 2 /2 # 0 = (N(r α d) + ɛn(r m )) α = (B + ɛc )/(B + ɛc ) m = m + α } end of loop over non-lnear teratons } end of loop over all tme ponts To help us understand the choce of parameters R d, R m, and ɛ, We examne the theoretcal relaton between m and d mpled by the above code as a functon of ɛ and R m at R d, n other words, when the data has normal behavor and we are mostly nterested n the role of the regularzaton drawng weak sgnals down towards zero. The data fttng penalty s B = (m d) 2 /2 and ts dervatve B = m d. The dervatve of the model penalty (from equaton (15)) s C = m/ 1 + m 2 /R 2 m. Settng the sum of the dervatves to zero we have 0 = B + ɛc = m d + ɛm 1 + m2 /R 2 m (27) Ths says m s mostly a lttle smaller than d, but t gets more nterestng near (m, d) 0. There the slope m/d = 1/(1 + ɛ) whch says an ɛ = 4 wll damp the sgnal (where small) by a factor of 5. Movng away from m = 0 we see the dampng power of ɛ dmnshes unformly as m exceeds R m. UNKNOWN SHOT WAVEFORM A one-dmensonal sesmogram d(t) s unknown reflectvty c(t) convolved wth unknown source waveform s(t). The number of data ponts ND NC s less than the number of unknowns NC+NS. Clearly we need a smart regularzaton. Let us see how ths problem can be set up so reflectvty c(t) comes out wth sparse spkes so the ntegral of c(t) s blocky. Ths s a nonlnear problem because the convoluton of the unknowns s made of ther product. Nonlnear problems elct well-warranted fear of multple solutons leadng to us gettng stuck n the wrong one. The key to avodng ths ptfall s

7 Claerbout 7 Blocky models: L1/L2 hybrd norm startng close enough to the correct soluton. The way to get close enough (besdes luck and a good startng guess) s to defne a lnear problem that takes us to the neghborhood where a nonlnear solver can be trusted. We wll do that frst. Block cyclc solver In the Block Cyclc solver (I hope ths s the correct term.), we have two half cycles. In the frst half we take one of the varables known and the other unknown. We solve for the unknown. In the next half we swtch the known for the unknown. The beauty of ths approach s that each half cycle s a lnear problem so ts soluton s ndependent of the startng locaton. Hooray! Even better, repeatng the cycles enough tmes should converge to the correct soluton. Hooray agan! The convergence may be slow, however, so at some stage (maybe just one or two cycles) you can safely swtch over to the nonlnear method whch converges faster because t deals drectly wth the nteractons of the two varables. We could begn from the assumpton that the shot waveform s an mpulse and the reflectvty s the data. Then ether half cycle can be the startng pont. Suppose we assume we know the reflectvty, say c, and solve for the shot waveform s. We use the reflectvty c to make a convoluton matrx C. The regresson par for fndng s s 0 Cs d (28) 0 ɛ s I s (29) These would be solved for s by famlar least squares methods. It s a very easy problem because s has many fewer components than c. Now wth our source estmate s we can defne the operator S that convolves t on reflectvty c. The second half of the cycle s to solve for the reflectvty c. Ths s a lttle trcker. The data fttng may stll be done by an L2 type method, but we need somethng lke an L1 method for the regularzaton to pull the small values closer to zero to yeld a more spky c(t). 0 L2 r d = Sc d (30) 0 L1 r m = I c (31) Normally we expect an ɛ c n equaton (31) but now t comes n later. (It mght seem that the regularzaton (29) s not necessary, but wthout t, c mght get smaller and smaller whle s gets larger and larger. We should be able to neglect regresson (29) f we smply rescale approprately at each teraton.) We can take the usual L2 norm to defne a gradent vector for model perturbaton c = S r d. From t we get the resdual perturbaton r d = S c. We need to fnd an unknown dstance α to move n those drectons. We take the norm of the data fttng resdual, add to t a bt ɛ of the model stylng resdual, and set the dervatve to zero. 0 = α [N d(r d + α r) + ɛn m (c + α c)] (32)

8 Claerbout 8 Blocky models: L1/L2 hybrd norm We need dervatves of each norm at each resdual. We base these on the convex functon C(r) of the Hybrd norm. Let us call these A for the data fttng, and B for the model stylng. A = C(R d, r ) (33) B = C(R m, c ) (34) (Actually, we don t need A (because for Least Squares, A = r and A = 1), but I nclude t here n case we wsh to deal wth nose bursts n the data.) As earler, expandng the norms n Taylor seres, equaton (32) becomes ) 0 = A r + α A r 2 ( + ɛ B c A r2 + α whch gves the α we need to update the model c and the resdual r d. α = A r + ɛ B c + ɛ B c 2 Ths s the steepest descent method. For the conjugate drectons method there s a 2 2 equaton lke equaton (24). B c 2 (35) (36) Non-lnear solver The non-lnear approach s a lttle more complcated but t explctly deals wth the nteracton between s and c so t converges faster. We represent everythng as a known part plus a perturbaton part whch we wll fnd and add nto the known part. Ths s most easly expressed n the Fourer doman. Lnearze by droppng S C. 0 (S + S)(C + C) D (37) 0 S C + C S + (CS D) (38) Let us change to the tme doman wth a matrx notaton. Put the unknowns C and S n vectors c and s. Put the knowns C and S n convoluton matrces C and S. Express CS D as a column vector d. Its tme doman coeffcents are d 0 = c 0 s 0 d 0 and d 1 = c 0 s 1 + c 1 s 0 d 1, etc. The data fttng regresson s now Ths regresson s expressed more explctly below. 0 S c + C s + d (39)

9 Claerbout 9 Blocky models: L1/L2 hybrd norm 0 s c 0.. s 1 s c 1 c 0. s 2 s 1 s c 2 c 1 c 0. s 2 s 1 s 0... c 3 c 2 c 1.. s 2 s 1 s 0.. c 4 c 3 c 2... s 2 s 1 s 0. c 5 c 4 c s 2 s 1 s 0 c 6 c 5 c s 2 s 1. c 6 c s 2.. c 6 c 0 c 1 c 2 c 3 c 4 c 5 c 6 s 0 s 1 s 2 + d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 (40) s The model stylng regresson s smply 0 C + C, whch n famlar matrx form 0 I c + c (41) It s ths regresson, along wth a composte norm and ts assocated threshold that makes c(t) come out sparse. Now we have the danger that c 0 whle s so we need one more regresson 0 I s + s (42) We can use ordnary least squares on the data fttng regresson and the shot waveform regresson. Thus [ ] [ ] [ ] [ ] rd S C d 0 = 0 c s Ī + (43) s r s 0 r = Fm + d (44) The model stylng regresson s where we seek spky behavor. The bg pcture s that we mnmze the sum 0 r c = I c + c (45) mn c, s N d (r) + ɛn m (r c ) (46) Insde the bg pcture we have updatng steps m = F r (47) g d = r = F m (48) We also have a gradent for changng c, namely g c = c. (I need to be sure g d and g c are commensurate. Maybe need an ɛ here.) One update step s to choose a lne search for α mn N d (r + αg d ) + ɛn m ( c + αg c ) (49) α

10 Claerbout 10 Blocky models: L1/L2 hybrd norm That was steepest descent. The extenson to conjugate drecton s straghtforward. As wth all nonlnear problems there s the danger of bzarre behavor and multple mnma. To avod frustraton, whle learnng you should spend about half of your effort drected toward fndng a good startng soluton. Ths normally amounts to defnng and solvng one or two lnear problems. In ths applcaton we mght get our startng soluton for s(t) and c(t) from conventonal deconvoluton analyss, or we mght get t from the block cyclc solver.

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?