An Introduc+on to Sta+s+cs and Machine Learning for Quan+ta+ve Biology. Anirvan Sengupta Dept. of Physics and Astronomy Rutgers University

An Introduc+on to Sta+s+cs and Machine Learning for Quan+ta+ve Biology Anirvan Sengupta Dept. of Physics and Astronomy Rutgers University

Why Do We Care? Necessity in today s labs Principled approach: defending conclusions, comparing methods,.. Insights for biological informa+on processing?

What Can We Expect to Do Today? Central recurring themes Orienta+on and connec+ons Resources

Data Deluge in Biology hnp://biomedicalcomputa+onreview.org hnp://ugene.net/learn.html hnps://sites.stanford.edu/baccuslab/ hnp://mmtg.fel.cvut.cz/mapsim/

Then and Now Data= 10-100 numbers Ques+ons=>,<,!= Data=1-10TB Ques+ons=10 4-10 5 tests Ronald Fisher (on the right) during tea +me at Rothamsted Experimental Sta+on (1920s) hnp://disserta+onreviews.org/archives/724 New York Genome Center hnp://blogs.scien+ficamerican.com/

How Much to Learn from Data? hnps://shapeofdata.wordpress.com/2013/03/26/general-regression-and-over-fihng/

Model Selec+on: Bias-Variance Tradeoff Underfihng Overfihng hnp://scon.fortmann-roe.com/docs/biasvariance.html

Probability Distribu+ons Discrete distribu+ons: Binomial, Poisson,.. P(X = k)= n k pk (1 p) n k Poisson: n large, p small with np= fixed Con+nuous distribu+ons: normal, chi-squared, student-t, F,.. P(a X b)= b a p(x µ,σ 2 )dx 2 1 µ) p(x µ,σ 2 )= exp[ (x ] σ 2π 2σ 2 hnp://sta+s+cs.wikidot.com

Distribu+ons Derived from Standard Normal Distribu+on (μ=0, σ 2 =1) Chi-squared distribu+on z 1,...,z k iid N(0,1) k i=1 Q = z i 2 χ 2 (k) Student s t distribu+on z N(0,1) Q χ 2 (k) T = z /(Q /k) 1/2 t(k) F distribu+on Q 1 χ 2 (k 1 ) Q 2 χ 2 (k 2 ) Q 1 /k 1 Q 2 /k 2 F(k 1,k 2 )

Es+ma+on p(x µ,σ 2 )= 2 1 µ) exp[ (x ] σ 2π 2σ 2 How to find μ, σ 2 from observa+ons?

Candidate Es+mators ˆµ(x) = i n x i = x ˆ σ 2 (x)= i (x i x ) 2 n 1 = s 2

Hypothesis tes+ng p(x µ,σ 2 )= 2 1 µ) exp[ (x ] σ 2π 2σ 2 How to decide whether the hypothesis H 1 : μ >6 is true from observa+ons, as opposed to H 0 : μ =6?

Test Sta+s+c z = x µ σ N(0,1) underh 0 T = x µ t(n 1) underh s 0

Type I and Type II error Accept H 0 Reject H 0 H 0 true OK Type I error Prob α H 0 false Type II error Prob β OK hnp://www.healthknowledge.org.uk/e-learning/sta+s+cal-methods/

Regression and Goodness of Fit hnp://randomanalyses.blogspot.com/2011/12/basics-of-regression-and-model-fihng.html

F-test for Linear Regression i 2 2 ( y y) = ( ŷi y) + ( yi ŷ ) i i (Corrected) Sum of Squares Total i SST = SSM + SSE (Corrected) Sum of Squares for Model i Sum of Squares for Error 2 SSM /(p 1) F = SSE /(n p) hnp://www.jamesstacks.com/stat/ Related measure: Coefficient of Determina+on R 2 = SSM SST

Likelihood L(θ x)= P(x θ) e.g. L(µ,σ 2 x 1,...,x n )= n i=1 1 exp[ (x µ) 2 i ] σ 2π 2σ 2

Likelihood Surface The peak gets sharper and sharper with more data. hnp://reliawiki.org/index.php/appendix:_maximum_likelihood_es+ma+on_example

Example: Normal Distribu+on L(µ,σ 2 x 1,...,x n )= n i=1 1 σ 2π exp[ (x µ) 2 i ] 2σ 2 ˆµ(x) = i n x i = x Maximizing likelihood=> ˆ σ 2 (x)= i (x i x ) 2 n

Limited Noisy Data with Small Signal Imagine we had to es+mate only the mean μ. L(µ x 1,...,x n )= n i=1 x i 1 σ 0 2π exp[ (x µ) 2 i ] 2 2σ 0 i Should we use MLE ˆµ(x) = n = x? What if we choose, instead, ˆµ 0 (x)= 0? E[( ˆµ(x) µ) 2 ]= σ 2 0 n E[( ˆµ 0 (x) µ) 2 ]= µ 2 Error from noise (variance) Trivial es+mator bener when µ < σ 0 n! Error from bias

Regularize to Shrink the Es+mate We use the es+mator For es+ma+ng only the mean μ, maximize exp[ µ2 n 2λ ] 1 2 σ 0 2π exp[ (x µ) 2 i ] 2 i=1 2σ 0 ˆµ(x) = i x i 2 n+ σ 0 λ 2. When λ σ 0 n we get ˆµ(x) 0!

Regulariza+on via Bayesian Framework P(θ x)= P(x,θ) P(x) = P(x θ)p 0 (θ) P(x) where P(x)= dθ 'P(x θ ')P 0 (θ ') The prior P 0 (θ) incorporates addi+onal constraints on parameters.

Maximum A Posteriori Es+ma+on θ ˆ = argmax P(θ x)= argmax θ θ P(x θ)p 0 (θ) P(x) Maximum likelihood es+mate corresponds to a uniform prior: P 0 (θ)=const. One could also sample θ from the posterior distribu+on P(θ x) to get a sense of range of possibili+es. Some+mes averaging over θ is a bener idea than maximiza+on

Likelihood Improvement with Higher Complexity L 2 L 1

Evidence Prior Product Likelihood P(x)= dθ 'P(x,θ ') = dθ 'P(x θ ')P 0 (θ ') Since dxp(x) = 1 θ less constrained models have lower evidence.

An Example: Tied Means dµ 1 λ 2π exp[ µ 2 n 2λ ] 1 2 σ 0 2π exp[ (x µ) 2 i ] 2 i=1 2σ 0 x 2 x 1 n 1 dµ i σ 0 2π exp[ (x µ ) 2 i i 1 ] 2 i=1 2σ 0 λ 2π exp[ µ 2 i 2λ ] 2

Bayesian Hypothesis Tes+ng: Going Beyond p-values Pr(Disease test positive) Pr(No Disease test positive) = Pr(testpositive Disease) Pr(testpositive No Disease) P(Disease) P(No Disease)

Bayesian Hypothesis tes+ng P(H) dθ 'P(x θ ')P (θ ' H) P(H x)= 0 P(x) where P(x)= H' P(H 0 x) P(H 1 x) = P(H ) 0 P(H 1 ) P(H') dθ 'P(x θ ')P (θ ' H) 0 dθ 'P(x θ ')P 0 (θ ' H 0 ) dθ 'P(x θ ')P 0 (θ ' H 1 ) Posterior odds = Prior odds x Bayes factor

More Complex Tasks in Data Analysis Supervised: Classifica+on, regression,.. Unsupervised: Clustering, discovering latent factors, network structures,..

Clustering Gene Expression hnp://www.pha.jhu.edu/~ghzheng/ hnp://www.computa+onal-genomics.net/case_studies/cellcycle_demo.html

Latent Variable Models: Clustering and Mixture Models p(x λ)= w i g(x µ i,σ i ) i 1 g(x µ i,σ i )= (2π ) n/2 Σ i exp{ 1 1/2 2 (x µ )'Σ 1 (x µ i i i )} hnps://en.wikipedia.org/wiki/cluster_analysis Challenge: Learning parameters w i, μ i, Σ i. Expecta+on Maximiza+on, Sampling,

Latent Variable Models: Hidden Markov Models P(o,s) = P(s)P(o s) = π s0 b s0 (o 0 )a s0,s 1 b s1 (o 1 )a s1,s 2 b s2 (o 2 )a s2,s 3... hnp://ar+nt.info/html/artint_161.html hnp://compbio.pbworks.com/

Diffusive HMM for Single Molecule Phenomena Beausang et al, Biophys J, 2007

Genera+on vs Predic+on So far we have dealt with explicit probability models of data. What if we have no clue on precise probability models?

RosenblaN s Perceptron hnp://sebas+anraschka.com/ar+cles/2015_singlelayer_neurons.html

Learning the Weights w t+1 = w t +η( y i ŷ i )x i Good learning: The decision surface does not depend Too much on the subset of training data. Test by cross-valida+on.

Linearly Non-separable Cases hnp://www.saedsayad.com/ar+ficial_neural_network_bkp.htm

Parametrizing Complex Decision Boundaries Shallow methods: SVM, Boos+ng, Deep methods: Layered neural nets

Classifica+on: Kernel Methods/SVM hnp://www.mdpi.com/1424-8220/14/11/20713/htm i φ(x)= α i y i K(x,x i )

SVM Kernels Linear Polynomial K(x, y)= x. y K(x, y)= (x. y +1) p Radial Basis Func+on K(x, y)= exp( (x y)2 σ 2 )

Classifying Regulatory Sequences Djordjevic, Sengupta and Shraiman, Genome Res, 2003

Back to Probability Models: Restricted Boltzmann Machine Restricted Boltzmann Machine (RBM) h P(x,v)= exp( E(x,v))/ Z E(x,v)= Z = {x,v } x i W ia h a ia exp( E(x,v)) x Smolensky, 1986 Hinton and Salakutdinov, Science, 2006

MNIST Data and RBM hnps://corpocrat.com/2014/10/17/machine-learning-using-restricted-boltzmann-machines/

Stacking RBMs into Deep Boltzmann Machines (DBM) DBM RBM h 2 h RBM h 1 x RBM x

Mul+layer Ar+ficial Neural Nets/ Deep Learning (ANN 2.0) hnp://www.rsipvision.com/exploring-deep-learning/

Forget Probabili+es: Feedforward Deep nets (1) h M1 1 (2) h M2 1! (d ) h Md 1 σ(w (1) M1 x N N 1 + b 1 ), σ(w (2) (1) M2 h M M1 1 1 + b 2 ) (d σ(w ) M2 h (d 1) M M1 + b 1 1 d ) From hnp://www.andrewsnoke.com

Applica+ons to Biology Angermueller et al, Mol Syst Bio, 2016, adapted from Alipanahi et al, Nat Biotechnol, 2015

Summary Complexity of models and overfihng Significance tests Cross-valida+on

Tools

Books