Mathematics notation and review

Appendix A Mthemtics nottion nd review This ppendix gives brief coverge of the mthemticl nottion nd concepts tht you ll encounter in this book. In the spce of few pges it is of course impossible to do justice to topics such s integrtion nd mtrix lgebr. Reders interested in strengthening their fundmentls in these res re encourged to consult XXX [clculus] nd Hely (000). A. Sets ({},,, ) The nottion {,b,c} should be red s the set contining the elements, b, nd c. With sets, it s sometimes convention tht lower-cse letters re used s nmes for elements, nd upper-cse letters s nmes for sets, though this is wek convention (fter ll, sets cn contin nything even other sets!). A B is red s the union of A nd B, nd its vlue is the set contining exctly those elements tht re present in A, in B, or in both. A B is red s the intersection of A nd B, nd its vlue is the set contining only those elements present in both A nd B., or equivlently {}, denotes the empty set the set contining nothing. Note tht { } isn t the empty set it s the set contining only the empty set, nd since it contins something, it isn t empty! [introduce set complementtion if necessry] A.. Countbility of sets [briefly describe] A. Summtion ( ) Mny times we ll wnt to express complex sum of systemticlly relted prts, such s + + 3 + 4 + 5 or x +x +x 3 +x 4 +x 5, more compctly. We use summtion nottion for this: 7

5 i= i = + + 3 + 4 + 5 5 x i = x +x +x 3 +x 4 +x 5 i= In these cses, i is sometimes clled n index vrible, linking the rnge of the sum ( to 5 in both of these cses) to its contents. Sums cn be nested: x ij = x +x +x +x i= j= 3 i x ij = x +x +x +x 3 +x 3 +x 33 i= j= Sums cn lso be infinite: i= i = + + 3 +... Frequently, the rnge of the sum cn be understood from context, nd will be left out; or we wnt to be vgue bout the precise rnge of the sum. For exmple, suppose tht there re n vribles, x through x n. In order to sy tht the sum of ll n vribles is equl to, we might simply write x i = i A.3 Product of sequence ( ) Just s we often wnt to express complex sum of systemticlly relted prts, we often wnt to express product of systemticlly relted prts s well. We use product nottion to do this: 5 i= i = 3 4 5 5 x i = x x x 3 x 4 x 5 i= Usge of product nottion is completely nlogous to summtion nottion s described in Section A.. A.4 Cses nottion ({) Some types of equtions, especilly those describing probbility functions, re often best expressed in the form of one or more conditionl sttements. As n exmple, consider six-sided die tht is weighted such tht when it is rolled, 50% of the time the outcome is Roger Levy Probbilistic Models in the Study of Lnguge drft, November 6, 0 8

six, with the other five outcomes ll being eqully likely (i.e. 0% ech). If we define discrete rndom vrible X representing the outcome of roll of this die, then the clerest wy of specifying the probbility mss function for X is by splitting up the rel numbers into three groups, such tht ll numbers in given group re eqully probble: () 6 hs probbility 0.5; (b),, 3, 4, nd 5 ech hve probbility 0.; (c) ll other numbers hve probbility zero. Groupings of this type re often expressed using cses nottion in n eqution, with ech of the cses expressed on different row: 0.5 x = 6 P(X = x) = 0. x {,,3,4,5} 0 otherwise A.5 Logrithms nd exponents The log in bse b of number x is expressed s log b x; when no bse is given, s in logx, the bse should be ssumed to be the mthemticl constnt e. The expression exp[x] is equivlent to the expression e x. Among other things, logrithms re useful in probbility theorybecusetheyllowonetotrnsltebetweensumsndproducts: i logx i = log i x i. Derivtives of logrithmic nd exponentil functions re s follows: A.6 Integrtion ( ) d dx log bx = xlogb d dx yx = y x logy Sums re lwys over countble (finite or countbly infinite) sets. The nlogue over continuum is integrtion. Correspondingly, you need to know bit bout integrtion in order to understnd continuous rndom vribles. In prticulr, bsic grsp of integrtion is essentil to understnding how Byesin sttisticl inference works. One simple view of integrtion is s computing re under the curve. In the cse of integrting function f over some rnge [,b] of one-dimensionl vrible x in which f(x) > 0, this view is literlly correct. Imgine plotting the curve f(x) ginst x, extending stright lines from points nd b on the x-xis up to the curve, nd then lying the plot down on tble. The re on the tble enclosed on four sides by the curve, the x-xis, nd the two dditionl stright lines is the integrl f(x)dx Roger Levy Probbilistic Models in the Study of Lnguge drft, November 6, 0 9

f(x, y) f(x) 0 f(x) 0.5 0 0.75 4 3 0.0 0.8 0.6 y 0.4 0. 0.0 0.0 0. 0.4.0 0.8 0.6 x b.5 3 x () x (b) Figure A.: Integrtion (c) This is depicted grphiclly in Figure A.. The sitution is perhps slightly less intuitive, but relly no more complicted, when f(x) crosses the x-xis. In this cse, re under the x-xis counts s negtive re. An exmple is given in Figure A.b; the function here is f(x) = (.5 x). Since the re of tringle with height h nd length l is lh, we cn compute the integrl in this cse by subtrcting the re of the smller tringle from the lrger tringle: 3 f(x)dx =.5 0.75 0.5 0.5 = 0.5 Integrtion lso generlizes to multiple dimensions. For instnce, the integrl of function f over n re in two dimensions x nd y, where f(x,y) > 0, cn be thought of s the volume enclosed by projecting the re s boundry from the x,y plne up to the f(x,y) surfce. A specific exmple is depicted in Figure A.c, where the re in this cse is the squre bounded by /4 nd 3/4 in both the x nd y directions. 3 4 4 3 4 4 f(x,y)dxdy An integrl cn lso be over the entire rnge of vrible or set of vribles. For instnce, one would write n integrl over the entire rnge of x s f(x)dx. Finlly, in this book nd in the literture on probbilistic inference you will see the bbrevited nottion f(θ)dθ, where θ is typiclly n ensemble (collection) of vribles. In this book, the proper θ interprettion of this nottion is s the integrl over the entire rnge of ll vribles in the ensemble θ. Roger Levy Probbilistic Models in the Study of Lnguge drft, November 6, 0 30

A.6. Anlytic integrtion tricks Computing n integrl nlyticlly mens finding n exct form for the vlue of the integrl. There re entire books devoted to nlytic integrtion, but for the contents of this book you ll get pretty fr with just few tricks.. Multipliction by constnts. The integrl of function times constnt C is the product of the constnt nd the integrl of the function: Cf(x)dx = C f(x)dx. Sum rule. The integrl of sum is the sum of the integrls of the prts: [f(x)+g(x)]dx = f(x)dx+ g(x) dx 3. Expressing one integrl s the difference between two other integrls: For c <,b, f(x)dx = c f(x)dx c f(x)dx This is n extremely importnt technique when sking whether the outcome of continuous rndom vrible flls within rnge [,b], becuse it llows you to nswer this question in terms of cumultive distribution functions (Section.6); in these cses you ll choose c =. 4. Polynomils. For ny n : And the specil cse for n = is: x n dx = n+ (bn+ n+ ) x dx = logb log Note tht this generliztion holds for n = 0, so tht integrtion of constnt is esy: Cdx = C(b ) Roger Levy Probbilistic Models in the Study of Lnguge drft, November 6, 0 3

5. Normlizing constnts. If the function inside n integrl looks the sme s the probbility density function for known probbility distribution, then its vlue is relted to normlizing constnt of the probbility distribution. [Exmples: norml distribution; bet distribution; others?] For exmple, consider the integrl exp ] [ x dx 8 This my look hopelessly complicted, but by comprison with Eqution. in Section.0 you will see tht it looks just like the probbility density function of normlly distributed rndom vrible with men µ = 0 nd vrince σ = 9, except tht it doesn t hve the normlizing constnt πσ. In order to determine the vlue of this integrl, we cn strt by noting tht ny probbility density function integrtes to : Substituting in µ = 0,σ = 9 we get [ ] exp (x µ) dx = πσ σ 8π exp By the rule of multipliction by constnts we get 8π exp ] [ x dx = 8 ] [ x dx = 8 or equivlently exp ] [ x dx = 8π 8 giving us the solution to the originl problem. A.6. Numeric integrtion The lterntive to nlytic integrtion is numeric integrtion, which mens pproximting the vlue of n integrl by explicit numeric computtion. There re mny wys to do this one common wy is by breking up the rnge of integrtion into mny smll pieces, pproximting the size of ech piece, nd summing the pproximte sizes. A grphicl Roger Levy Probbilistic Models in the Study of Lnguge drft, November 6, 0 3

p(x) b x Figure A.: Numeric integrtion exmple of how this might be done is shown in Figure A., where ech piece of the re under the curve is pproximted s rectngle whose height is the verge of the distnces from the x-xis to the curve t the left nd right edges of the rectngle. There re mny techniques for numeric integrtion, nd we shll hve occsionl use for some of them in this book. A.7 Precedence ( ) The opertor is used occsionlly in this book to denote liner precedence. In the syntx of English, for exmple, the informtion tht verb phrse (VP) cn consist of verb (V) followed by noun phrse (NP) object is most often written s: Roger Levy Probbilistic Models in the Study of Lnguge drft, November 6, 0 33

V V NP This sttement combines two pieces of informtion: () VP cn be comprised of V nd n NP; nd () in the VP, the V should precede the NP. In syntctic trdition stemming from Generlized Phrse Structure Grmmr (Gzdr et l., 985), these pieces of informtion cn be seprted: ()V V, NP ()V NP where V,NP mens the unordered set of ctegories V nd NP, nd V NP reds s V precedes NP. A.8 Combintorics ( ( n r) ) The nottion ( n r) is red s n choose r nd is defined s the number of possible wys of selecting r elements from lrger collection of n elements, llowing ech element to be chosen mximum of once nd ignoring order of selection. The following equlity holds generlly: ( ) n = r n! r!(n r)! (A.) The solution to the closely relted problem of creting m clsses from n elements by selecting r i for the i-th clss nd discrding the leftover elements is written s ( n r...r m ) nd its vlue is ( ) n = r...r m n! r!...r m! (A.) Terms of this form pper in this book in the binomil nd multinomil probbility mss functions, nd s normlizing constnt for the bet nd Dirichlet distributions. A.9 Bsic mtrix lgebr There re number of situtions in probbilistic modeling mny of which re covered in this book where the computtions needing to be performed cn be simplified, both conceptully nd nottionlly, by csting them in terms of mtrix opertions. A mtrix X of dimensions m n is set of mn entries rrnged rectngulrly into m rows nd n columns, with its entries indexed s x ij : Roger Levy Probbilistic Models in the Study of Lnguge drft, November 6, 0 34

x x... x n x x... x n X =...... x m x m... x mn [ ] 3 4 For exmple, the mtrix A = hs vlues 0 = 3, = 4, 3 =, = 0, =, nd 3 =. For mtrix X, the entry x ij is often clled the i,j-th entry of X. If mtrix hs the sme number of rows nd columns, it is often clled squre mtrix. Squre mtrices re often divided into the digonl entries {x ii } nd the off-digonl entries {x ij } where i j. A mtrix of dimension m tht is, single-column mtrix is often clled vector. Symmetric mtrices: squre mtrix A is symmetric if A T = A. For exmple, the mtrix 0 4 3 4 5 is symmetric. You will generlly encounter symmetric mtrices in this book s vrincecovrince mtrices (e.g., of the multivrite norml distribution, Section 3.5). Note tht symmetric n n mtrix hs n(n+) free entries one cn choose the entries on nd bove the digonl, but the entries below the digonl re fully determined by the entries bove it. Digonl nd Identity mtrices: For squre mtrix X, the entries x ii tht is, when the column nd row numbers re the sme re clled the digonl entries. A squre mtrix whose non-digonl entries re ll zero is clled digonl mtrix. A digonl mtrix of size n n whose digonl entries re ll is clled the size-n identity mtrix. Hence A below is digonl mtrix, nd B below is the size-3 identity mtrix. 3 0 0 0 0 A = 0 0 B = 0 0 0 0 0 0 The n n identity mtrix is sometimes notted s I n ; when the dimension is cler from context, sometimes the simpler nottion I is used. Trnsposition: For ny mtrix X of dimension m n, the trnspose of X, or X T, is n n m-dimensionl mtrix such tht the i,j-th entry of X T is the j,i-th entry of X. For the mtrix A bove, for exmple, we hve 3 0 A T = 4 (A.3) Roger Levy Probbilistic Models in the Study of Lnguge drft, November 6, 0 35

Addition: Mtricesoflikedimensioncnbedded. IfX ndy rebothm nmtrices, then X +Y is the m n mtrix whose i,j-th entry is x ij +y ij. For exmple, 3 0 4 + 0 = 4 0 5 5 4 7 (A.4) Multipliction: If X is n l m mtrix nd Y is n m n mtrix, then X nd Y cn be multiplied together; the resulting mtrix XY is n l m mtrix. If Z = XY, the i,j-th entry of Z is: z ij = For exmple, if A = 0 nd B = 3 m x ik y kj k= [ ] 3 4 6, we hve 0 5 3+ 0 4+ ( 5) ( )+ 6+ ( ) AB = ( ) 3+0 0 ( ) 4+0 ( 5) ( ) ( )+0 ( ) 6+0 ( ) 3 3+ 0 3 4+ ( 5) 3 ( )+ 3 6+ ( ) 3 6 3 = 3 4 6 9 7 6 Unlike multipliction of sclrs, mtrix multipliction is not commuttive tht is, it is not generlly the cse tht XY = Y X. In fct, being ble to form the mtrix product XY does not even gurntee tht we cn do the multipliction in the opposite order nd form the mtrix product Y X; the dimensions my not be right. (Such is the cse for mtrices A nd B in our exmple.) Determinnts. For squre mtrix X, the determinnt X is mesure of the mtrix s size. In this book, determinnts pper in coverge of the multivrite norml distribution (Section 3.5); the normlizing constnt of the multivrite norml density includes the determinnt of the covrince mtrix. (The univrite norml density, introduced in Section.0, is specil cse; there, it is simply the vrince of the distribution tht ppers in the normlizing constnt.) For smll mtrices, there re simple techniques [ for ] b clculting determinnts: s n exmple, the determinnt of mtrix A = c d is A = d bc. For lrger mtrices, computing determinnts requires more generl nd complex techniques, which cn be found in books on liner lgebr such s Hely (000). Mtrix Inversion. The inverse or reciprocl of n n n squre mtrix X, denoted X, is the n n mtrix such tht XX = I n. As with sclrs, the inverse of the inverse Roger Levy Probbilistic Models in the Study of Lnguge drft, November 6, 0 36

of mtrix X is simply X. However, not ll mtrices hve inverses (just like the sclr 0 hs no inverse). For exmple, the following pir of mtrices re inverses of ech other: A = [ ] [ ] A = 3 3 3 3 A.9. Algebric properties of mtrix opertions Associtivity, Commuttivity, nd Distributivity Consider mtrices A, B, nd C. Mtrix multipliction is ssocitive (A(BC) = (AB)C) nd distributive over ddition (A(B +C) = (A+B)C), but not commuttive: even if the multipliction is possible in both orderings (tht is, if B nd A re both squre mtrices with the sme dimensions), in generl AB BA. Trnsposition, inversion nd determinnts of mtrix products. The trnspose of mtrix product is the product of ech mtrix s trnspose, in reverse order: (AB) T = B T A T. Likewise, the inverse of mtrix product is the product of ech mtrix s inverse, in reverse order: (AB) = B A. The determinnt of mtrix product is the product of the determinnts: AB = A B Becuse of this, the determinnt of the inverse of mtrix is the reciprocl of the mtrix s determinnt: A = A A.0 Miscellneous nottion : You ll often see f(x) g(x) for some functions f nd g of x. This is to be red s f(x) is proportionl to g(x), or f(x) is equl to g(x) to within some constnt. Typiclly it s used when f(x) is intended to be probbility, nd g(x) is function tht obeys the first two xioms of probbility theory, but is improper. This sitution obtins quite often when, for exmple, conducting Byesin inference. Roger Levy Probbilistic Models in the Study of Lnguge drft, November 6, 0 37

Roger Levy Probbilistic Models in the Study of Lnguge drft, November 6, 0 38