Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1
Topics in Basics of ML 1. Learning Algoriths 2. Capacity, Overfitting and Underfitting 3. Hyperparaeters and Validation Sets 4. Estiators, Bias and Variance 5. Maxiu Likelihood Estiation 6. Bayesian Statistics 7. Supervised Learning Algoriths 8. Unsupervised Learning Algoriths 9. Stochastic Gradient Descent 10. Building a Machine Learning Algorith 2 11. Challenges Motivating Deep Learning
Topics in Estiators, Bias, Variance 0. Statistical tools useful for generalization 1. Point estiation 2. Bias 3. Variance and Standard Error 4. Bias-Variance tradeoff to iniize MSE 5. Consistency 3
Statistics provides tools for ML The field of statistics provides any tools to achieve the ML goal of solving a task not only on the training set but also to generalize Foundational concepts such as Paraeter estiation Bias Variance They characterize notions of generalization, over- and under-fitting 4
Point Estiation Point Estiation is the attept to provide the single best prediction of soe quantity of interest Quantity of interest can be: A single paraeter A vector of paraeters E.g., weights in linear regression A whole function 5
Point estiator or Statistic To distinguish estiates of paraeters fro their true value, a point estiate of a paraeter θ is represented by ˆθ Let {x (1), x (2),..x () } be independent and identically distributed data points Then a point estiator or statistic is any function of the data ˆθ = g(x (1),...x () ) Thus a statistic is any function of the data It need not be close to the true θ A good estiator is a function whose output is close 6 to the true underlying θ that generated the data
Function Estiation Establishing a relationship between input and target variables can also be point estiation Here we predict a variable y given input x We assue that there is a function f(x) that describes the approxiate relationship between x and y We ay assue y=f(x)+ε Where ε stands for a part that is not predictable fro x We are interested in approxiating f with a odel Function estiation is sae as estiating a paraeter θ; where ˆf is a point in function space 7 ˆf
Properties of Point Estiators Most coonly studied properties of point estiators are: 1. Bias 2. Variance They infor us about the estiators 8
1. Bias of an estiator The bias of an estiator paraeter θ is defined as bias ˆθ ( ) = E ˆθ The estiator is unbiased if bias( ˆθ )=0 which iplies that ˆθ = g(x (1),...x () ) An estiator is asyptotically unbiased if li bias ˆθ θ E ˆθ ( ) = 0 = θ for 9
Exaples of Estiator Bias We look at coon estiators of the following paraeters to deterine whether there is bias: Bernoulli distribution: ean θ Gaussian distribution: ean µ Gaussian distribution: variance σ 2 10
Estiator of Bernoulli ean Bernoulli distribution for binary variable x ε{0,1} with ean θ has the for P(x;θ) = θ x (1 θ) 1 x Estiator for θ given saples {x (1),..x () } is ˆθ = 1 To deterine whether this estiator is biased deterine bias(ˆθ ) = E ˆθ θ = E 1 x (i) θ i 1 = 1 E x (i) θ = 1 i=1 1 x (i) θ x(i ) (1 θ) (1 x(i ) ) x =0( ) θ (i ) Since bias( )=0 we say that the estiator is unbiased 11 ˆθ i=1 = 1 (θ) θ = θ θ = 0 i=1 i=1 x (i)
Estiator of Gaussian ean Saples {x (1),..x () } are independently and identically distributed according to p(x (i) )=N(x (i) ;µ,σ 2 ) Saple ean is an estiator of the ean paraeter To deterine bias of the saple ean: ˆµ = 1 x (i) i=1 Thus the saple ean is an unbiased estiator of the Gaussian ean
Estiator for Gaussian variance The saple variance is ˆσ 2 = 1 ( x (i) ˆµ ) 2 i=1 We are interested in coputing 2 2 bias( ˆσ ) =E( ) - σ 2 ˆσ We begin by evaluating à 2 Thus the bias of ˆσ is σ 2 / Thus the saple variance is a biased estiator The unbiased saple variance estiator is ˆσ 2 = 1 1 i =1 ( x (i) ˆµ ) 2 13
2. Variance and Standard Error How uch we expect the estiator to vary as a function of the data saple Just as we coputed the expectation of the estiator to deterine its bias, we can copute its variance The variance of an estiator is siply Var( ˆθ ) where the rando variable is the training set The square root of the the variance is called the Standard Error, denoted SE( ˆθ ) 14
Iportance of Standard Error It easures how we would expect the estiate to vary as we obtain different saples fro the sae distribution The standard error of the ean is given by SE ( ˆµ ) = Var 1 i=1 x (i) = σ where σ 2 is the true variance of the saples x (i) Standard error often estiated using estiate of σ Although not unbiased, approxiation is reasonable 15 The standard deviation is less of an underestiate than variance
Standard Error in Machine Learning We often estiate generalization error by coputing error on the test set No of saples in the test set deterine its accuracy Since ean will be norally distributed, (according to Central Liit Theore), we can copute probability that true expectation falls in any chosen interval Ex: 95% confidence interval centered on ean ( ˆµ 1.96SE ( ˆµ ), ˆµ + 1.96SE ( ˆµ )) ML algorith A is better than ML algorith B if upperbound of A is less than lower bound of B ˆµ is
Confidence Intervals for error 95% confidence intervals for error estiate 17
Trading-off Bias and Variance Bias and Variance easure two different sources of error of an estiator Bias easures the expected deviation fro the true value of the function or paraeter Variance provides a easure of the expected deviation that any particular sapling of the data is likely to cause 18
Negotiating between bias - tradeoff How to choose between two algoriths, one with a large bias and another with a large variance? Most coon approach is to use cross-validation Alternatively we can iniize Mean Squared Error which incorporates both bias and variance 19
Mean Squared Error Mean Squared Error of an estiate is MSE = E ( ) 2 ˆθ θ =Bias ˆθ Miniizing the MSE keeps both bias and variance in check ( ) 2 + Var( ˆθ ) 20
Underfit-Overfit : Bias-Variance Both have a U-shaped curve of generalization Error as a function of capacity 21
Consistency So far we have discussed behavior of an estiator for a fixed training set size We are also interested with the behavior of the estiator as training set grows As the no. of data points in the training set grows, we would like our point estiates to converge to the true value of the paraeters: pli ˆθ = θ pli, known as consistency, is probability in the liit Also known as weak consistency Consistency ensures that bias decreases with