An Introduc+on to Sta+s+cs and Machine Learning for Quan+ta+ve Biology. Anirvan Sengupta Dept. of Physics and Astronomy Rutgers University

Size: px

Start display at page:

Download "An Introduc+on to Sta+s+cs and Machine Learning for Quan+ta+ve Biology. Anirvan Sengupta Dept. of Physics and Astronomy Rutgers University"

Chastity Taylor
5 years ago
Views:

1 An Introduc+on to Sta+s+cs and Machine Learning for Quan+ta+ve Biology Anirvan Sengupta Dept. of Physics and Astronomy Rutgers University

2 Why Do We Care? Necessity in today s labs Principled approach: defending conclusions, comparing methods,.. Insights for biological informa+on processing?

3 What Can We Expect to Do Today? Central recurring themes Orienta+on and connec+ons Resources

5 Data Deluge in Biology hnp://biomedicalcomputa+onreview.org hnp://ugene.net/learn.html hnps://sites.stanford.edu/baccuslab/ hnp://mmtg.fel.cvut.cz/mapsim/

Then and Now Data= 10-100 numbers Ques+ons=>,<,!

6 Then and Now Data= numbers Ques+ons=>,<,!= Data=1-10TB Ques+ons= tests Ronald Fisher (on the right) during tea +me at Rothamsted Experimental Sta+on (1920s) hnp://disserta+onreviews.org/archives/724 New York Genome Center hnp://blogs.scien+ficamerican.com/

7 How Much to Learn from Data? hnps://shapeofdata.wordpress.com/2013/03/26/general-regression-and-over-fihng/

8 Model Selec+on: Bias-Variance Tradeoff Underfihng Overfihng hnp://scon.fortmann-roe.com/docs/biasvariance.html

9 Probability Distribu+ons Discrete distribu+ons: Binomial, Poisson,.. P(X = k)= n k pk (1 p) n k Poisson: n large, p small with np= fixed Con+nuous distribu+ons: normal, chi-squared, student-t, F,.. P(a X b)= b a p(x µ,σ 2 )dx 2 1 µ) p(x µ,σ 2 )= exp[ (x ] σ 2π 2σ 2 hnp://sta+s+cs.wikidot.com

10 Distribu+ons Derived from Standard Normal Distribu+on (μ=0, σ 2 =1) Chi-squared distribu+on z 1,...,z k iid N(0,1) k i=1 Q = z i 2 χ 2 (k) Student s t distribu+on z N(0,1) Q χ 2 (k) T = z /(Q /k) 1/2 t(k) F distribu+on Q 1 χ 2 (k 1 ) Q 2 χ 2 (k 2 ) Q 1 /k 1 Q 2 /k 2 F(k 1,k 2 )

11 Es+ma+on p(x µ,σ 2 )= 2 1 µ) exp[ (x ] σ 2π 2σ 2 How to find μ, σ 2 from observa+ons?

12 Candidate Es+mators ˆµ(x) = i n x i = x ˆ σ 2 (x)= i (x i x ) 2 n 1 = s 2

13 Hypothesis tes+ng p(x µ,σ 2 )= 2 1 µ) exp[ (x ] σ 2π 2σ 2 How to decide whether the hypothesis H 1 : μ >6 is true from observa+ons, as opposed to H 0 : μ =6?

14 Test Sta+s+c z = x µ σ N(0,1) underh 0 T = x µ t(n 1) underh s 0

15 Type I and Type II error Accept H 0 Reject H 0 H 0 true OK Type I error Prob α H 0 false Type II error Prob β OK hnp://

16 Regression and Goodness of Fit hnp://randomanalyses.blogspot.com/2011/12/basics-of-regression-and-model-fihng.html

17 F-test for Linear Regression i 2 2 ( y y) = ( ŷi y) + ( yi ŷ ) i i (Corrected) Sum of Squares Total i SST = SSM + SSE (Corrected) Sum of Squares for Model i Sum of Squares for Error 2 SSM /(p 1) F = SSE /(n p) hnp:// Related measure: Coefficient of Determina+on R 2 = SSM SST

18 Likelihood L(θ x)= P(x θ) e.g. L(µ,σ 2 x 1,...,x n )= n i=1 1 exp[ (x µ) 2 i ] σ 2π 2σ 2

19 Likelihood Surface The peak gets sharper and sharper with more data. hnp://reliawiki.org/index.php/appendix:_maximum_likelihood_es+ma+on_example

20 Example: Normal Distribu+on L(µ,σ 2 x 1,...,x n )= n i=1 1 σ 2π exp[ (x µ) 2 i ] 2σ 2 ˆµ(x) = i n x i = x Maximizing likelihood=> ˆ σ 2 (x)= i (x i x ) 2 n

21 Limited Noisy Data with Small Signal Imagine we had to es+mate only the mean μ. L(µ x 1,...,x n )= n i=1 x i 1 σ 0 2π exp[ (x µ) 2 i ] 2 2σ 0 i Should we use MLE ˆµ(x) = n = x? What if we choose, instead, ˆµ 0 (x)= 0? E[( ˆµ(x) µ) 2 ]= σ 2 0 n E[( ˆµ 0 (x) µ) 2 ]= µ 2 Error from noise (variance) Trivial es+mator bener when µ < σ 0 n! Error from bias

22 Regularize to Shrink the Es+mate We use the es+mator For es+ma+ng only the mean μ, maximize exp[ µ2 n 2λ ] 1 2 σ 0 2π exp[ (x µ) 2 i ] 2 i=1 2σ 0 ˆµ(x) = i x i 2 n+ σ 0 λ 2. When λ σ 0 n we get ˆµ(x) 0!

23 Regulariza+on via Bayesian Framework P(θ x)= P(x,θ) P(x) = P(x θ)p 0 (θ) P(x) where P(x)= dθ 'P(x θ ')P 0 (θ ') The prior P 0 (θ) incorporates addi+onal constraints on parameters.

24 Maximum A Posteriori Es+ma+on θ ˆ = argmax P(θ x)= argmax θ θ P(x θ)p 0 (θ) P(x) Maximum likelihood es+mate corresponds to a uniform prior: P 0 (θ)=const. One could also sample θ from the posterior distribu+on P(θ x) to get a sense of range of possibili+es. Some+mes averaging over θ is a bener idea than maximiza+on

25 Likelihood Improvement with Higher Complexity L 2 L 1

26 Evidence Prior Product Likelihood P(x)= dθ 'P(x,θ ') = dθ 'P(x θ ')P 0 (θ ') Since dxp(x) = 1 θ less constrained models have lower evidence.

27 An Example: Tied Means dµ 1 λ 2π exp[ µ 2 n 2λ ] 1 2 σ 0 2π exp[ (x µ) 2 i ] 2 i=1 2σ 0 x 2 x 1 n 1 dµ i σ 0 2π exp[ (x µ ) 2 i i 1 ] 2 i=1 2σ 0 λ 2π exp[ µ 2 i 2λ ] 2

28 Bayesian Hypothesis Tes+ng: Going Beyond p-values Pr(Disease test positive) Pr(No Disease test positive) = Pr(testpositive Disease) Pr(testpositive No Disease) P(Disease) P(No Disease)

29 Bayesian Hypothesis tes+ng P(H) dθ 'P(x θ ')P (θ ' H) P(H x)= 0 P(x) where P(x)= H' P(H 0 x) P(H 1 x) = P(H ) 0 P(H 1 ) P(H') dθ 'P(x θ ')P (θ ' H) 0 dθ 'P(x θ ')P 0 (θ ' H 0 ) dθ 'P(x θ ')P 0 (θ ' H 1 ) Posterior odds = Prior odds x Bayes factor

30 More Complex Tasks in Data Analysis Supervised: Classifica+on, regression,.. Unsupervised: Clustering, discovering latent factors, network structures,..

Clustering Gene Expression hnp://www.pha.jhu.

31 Clustering Gene Expression hnp:// hnp://

32 Latent Variable Models: Clustering and Mixture Models p(x λ)= w i g(x µ i,σ i ) i 1 g(x µ i,σ i )= (2π ) n/2 Σ i exp{ 1 1/2 2 (x µ )'Σ 1 (x µ i i i )} hnps://en.wikipedia.org/wiki/cluster_analysis Challenge: Learning parameters w i, μ i, Σ i. Expecta+on Maximiza+on, Sampling,

33 Latent Variable Models: Hidden Markov Models P(o,s) = P(s)P(o s) = π s0 b s0 (o 0 )a s0,s 1 b s1 (o 1 )a s1,s 2 b s2 (o 2 )a s2,s 3... hnp://ar+nt.info/html/artint_161.html hnp://compbio.pbworks.com/

34 Diffusive HMM for Single Molecule Phenomena Beausang et al, Biophys J, 2007

35 Genera+on vs Predic+on So far we have dealt with explicit probability models of data. What if we have no clue on precise probability models?

36 RosenblaN s Perceptron hnp://sebas+anraschka.com/ar+cles/2015_singlelayer_neurons.html

37 Learning the Weights w t+1 = w t +η( y i ŷ i )x i Good learning: The decision surface does not depend Too much on the subset of training data. Test by cross-valida+on.

38 Linearly Non-separable Cases hnp://

39 Parametrizing Complex Decision Boundaries Shallow methods: SVM, Boos+ng, Deep methods: Layered neural nets

40 Classifica+on: Kernel Methods/SVM hnp:// i φ(x)= α i y i K(x,x i )

41 SVM Kernels Linear Polynomial K(x, y)= x. y K(x, y)= (x. y +1) p Radial Basis Func+on K(x, y)= exp( (x y)2 σ 2 )

42 Classifying Regulatory Sequences Djordjevic, Sengupta and Shraiman, Genome Res, 2003

43 Back to Probability Models: Restricted Boltzmann Machine Restricted Boltzmann Machine (RBM) h P(x,v)= exp( E(x,v))/ Z E(x,v)= Z = {x,v } x i W ia h a ia exp( E(x,v)) x Smolensky, 1986 Hinton and Salakutdinov, Science, 2006

44 MNIST Data and RBM hnps://corpocrat.com/2014/10/17/machine-learning-using-restricted-boltzmann-machines/

45 Stacking RBMs into Deep Boltzmann Machines (DBM) DBM RBM h 2 h RBM h 1 x RBM x

46 Mul+layer Ar+ficial Neural Nets/ Deep Learning (ANN 2.0) hnp://

47 Forget Probabili+es: Feedforward Deep nets (1) h M1 1 (2) h M2 1! (d ) h Md 1 σ(w (1) M1 x N N 1 + b 1 ), σ(w (2) (1) M2 h M M b 2 ) (d σ(w ) M2 h (d 1) M M1 + b 1 1 d ) From hnp://

48 Applica+ons to Biology Angermueller et al, Mol Syst Bio, 2016, adapted from Alipanahi et al, Nat Biotechnol, 2015

49 Summary Complexity of models and overfihng Significance tests Cross-valida+on

50 Tools

51 Books

Bias/variance tradeoff, Model assessment and selec+on

Applied induc+ve learning Bias/variance tradeoff, Model assessment and selec+on Pierre Geurts Department of Electrical Engineering and Computer Science University of Liège October 29, 2012 1 Supervised