Info 2950, Lecture 24 - PDF Free Download

Info 2950, Lecture 24 2 May 2017 An additional "bonus" was added to problem 4 on Fri afternoon 28 Apr, and some additional typos plus slight rewording to resolve questions about problem 2 at end of afternoon. [Note: make sure your version has a problem 1C) and a problem 4C)iii) just before the bonus 4D] Prob Set 7: due Wed 3 May Prob Set 8: due 11 May (end of classes)

Case / Deaton (2015) http://www.pnas.org/content/112/49/15078.full.pdf

WONDER = Wide-ranging Online Data for Epidemiologic Research https://wonder.cdc.gov/

use a dictionary, after screening for white, non-hispanic, F, then fdata[row[10]].append(row[-1]) actually use row[10][1:-1], or int(row[10][3:-1]) to parse the HHSn then fdata[ HHS1 ] gives the list of values, and so on

https://research.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html http://storage.googleapis.com/books/ngrams/books/datasetsv2.html "unigram" and "bigram" counts: ucount = Counter(seq) bcount = Counter([(seq[i],seq[i+1]) for i in range(len(seq)-1)]) and more generally the n-gram counts can be obtained using ncount = Counter([tuple(seq[i:i+n]) for i in range(0,len(seq)-n)]) Any of these can be converted to probability arrays by dividing by the sum, e.g.: nprob = np.array(ncount.values()) / float(sum(ncount.values()))

Chomsky (1957): Neither (a) 'colorless green ideas sleep furiously' nor (b) 'furiously sleep ideas green colorless, nor any of their parts, has ever occurred in the past linguistic experience of an English speaker. But (a) is grammatical, while (b) is not. Pereira (2001): trained statistical model on newspaper text (a) is 200,000 times more probable than (b) http://norvig.com/chomsky.html

4 3 2.2 7.4.8 5.3 1 0 6 Only recurrent states are 1 and 5, each trivially recurrent since it can only go to itself, and hence each forms its own recurrent class. Assume the Markov chain is in state 0 just before the first step, somequestions:

4 3 2.2 7.4.8 5 1 0 6.3 (a) What is the probability of being in state 7 after n steps? The unique path is a first step from 0 to 7, then n 1stepsfrom7toitself,hencethe probability is (.4) n 1.

4 3 2.2 7.4.8 5 1 0 6.3 (b) What is the probability of reaching state 2 for the first time on thek th step? The unique path for this is a first step from 0 to 7, then k 2stepsfrom7toitself,then astepfrom7to2,hencetheprobabilityis()(.4) k 2 (.2)

4 3 2.2 7.4.8 5 1 0 6.3 (c) What is the probability of never reaching state 1? The only way to avoid state 1 is to follow the path from 0 7 4. From state 7 it s twice as likely to go to state 4 as to state 2, therefore 2/3 of the time one will eventually end up in the direction of state 4 from state 7. Since the probability of going from state 0 to state 7 is 1/2, the overall probability of going from state 0 to state 4, and never reaching state 1, is (1/2) (2/3) = 1/3.

Sum a geometric series The infinite sum 1X S(q) = q n =1+q + q 2 + q 3 +... n=0 satisfies the relation qs(q) =q + q 2 + q 3 +...= S(q) 1 so 1 = S(q)(1 q) and S(q) = 1 1 q

Another way to reproduce the above is to note that the overall probability to reach state 2 from state 7 is to sum over paths that return k times to state 7, and use 1X n=0 p n = 1 1 p Then p 72 1 X k=0 p k 77 = p 72 1 p 77 =.2/.6 =1/3 Similarly, the probability to eventually reach state 4 from state 7 is p 74 1 X k=0 p k 77 = p 74 1 p 77 =.4/.6 =2/3

4 3 2.2 7.4.8 5 1 0 6.3 (d) What is the expected number of steps until hitting state 5?

waiting time =1/p! p If the probability of leaving a state (and never returning) is p, then the probability of leaving after exactly n steps is (1 p) n 1 p With q =1 p, the expectation value for the number of steps to leave is thus E[n] = 1X n=0 nq n 1 p = p @ @q 1X n=0 q n = p @ @q 1 1 q = p 1 (1 q) 2 = p p 2 = 1 p So for example it takes on average two flips of a fair coin to get the first head, or on average six rolls of a die to get the first 3 (of course in any particular trial, sometimes more and sometimes less). The expected number of steps to leave a state is also called the waiting time

4 3 2.2 7.4.8 5 X 1 X 0 6.3 (d) What is the expected number of steps until hitting state 5? The expected number of steps to leave a state E[n] is called the waiting time. If the exit probability is p, then the waiting time satisfies E[n] =1/p. Thus the expected numbers of steps to remain at states 7,4,6 respectively are: 1/.6 =5/3, 1/.8 =5/4, 1/.3 = 10/3, and the expected number of steps until hitting state 5 is thus 1+5/3+5/4+10/3 =71/4 simulation, cell 49: http://nbviewer.jupyter.org/url/courses.cit.cornell.edu/info2950_2017sp/resources/markov.ipynb

Important enough to prove in a second way, without derivative (see E/K book). eqs 20.2-20.4, p.638 Let X be a random variable counting the number of steps it takes for a process with probability p to occur for the first time, e.g., the number of flips of a coin to get first H. Has expectation value: E[X] =1 Pr[X = 1] + 2 Pr[X = 2] + 3 Pr[X = 3] +... Reorganize the j copies of Pr[X = j] using Pr[X k] =Pr[X = k]+pr[x = k + 1] +... and that Pr[X k] isjust(1 p) k 1 (probability that didn t happen in first k 1 steps), E[X] =1 Pr[X = 1] + 2 Pr[X = 2] + 3 Pr[X = 3] +... =Pr[X 1] + Pr[X 2] + Pr[X 3] +... =1+(1 p)+(1 p) 2 +...= 1 1 (1 p) = 1 p.

4 3 2.2 7.4.8 5 1 0 6.3 (e) What is the probability of eventually hitting state 5? Same probability of 1/3 as (c), since any path that doesn t reach 1 eventually hits 5. More specifically, the probability of going from state 0 to state 7 is 1/2, of eventually going from 7 4is2/3,andfromstate4oneeventuallyhitsstate5, so the overall probability of going from state 0 to state 5 is p(0 5) = (1/2) (2/3) = 1/3.

4 3 2.2 7.4.8 5 1 0 6.3 (f) What is the probability of being in state 4 after two steps, given that one is in state 5 after 8 steps? This is given by the joint probability of being in state 4 after twostepsand being in state 5after8steps,dividedbytheoverallprobabilityofbeinginstate5after8steps p(state 4 after 2 steps state 5 after 8 steps) = p(state 4 after 2 steps, state 5 after 8 steps) p(state 5 after 8 steps)

probset 8 Nucleobases: A = Adenine G = Guanine C = Cytosine T=Thymine paired along (double-stranded) backbone: AT and CG https://en.wikipedia.org/wiki/nucleobase Size of the human genome = 3000Mbp = 3x10 9 3x10 9 x 2 (bits/base) / 8 (bytes/bit) = 750MB

https://ghr.nlm.nih.gov/gene/dmd#location DMD gene (dystrophin), length = 2,241,765 (the biggest!) Molecular Location: base pairs 31,119,219 to 33,339,609 on the X chromosome CGTTAAATGCAAACGCTGCTCTGGCTCATGTGTTTGCTCCGAGGTATAGGTTTTGTTCGACTGACGTATC TATTATAGTACTGCTTTACTGTGTATCTCAATAAAGCACGCAGTTATGTTACAAAAAAGTA

https://ghr.nlm.nih.gov/gene/opn1mw#location OPN1MW gene (opsin 1, medium wave sensitive), length = 14266 Molecular Location: base pairs 154,182,596 to 154,196,861 on the X chromosome CCCACTGGCCGGTATAAAGCACCGTGACCCTCAGGTGACGCACCAGGGCCGGCTGCCGTCGGGGACAGGG TCTTGAAGGTCTCCGTGATCTCCTGCAGGAGACGAAAATGCACGCACCAGAAGTCA