Machine Learning Basics: Estimators, Bias and Variance

Similar documents
Machine Learning Basics: Maximum Likelihood Estimation

Training an RBM: Contrastive Divergence. Sargur N. Srihari

TEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES

CS Lecture 13. More Maximum Likelihood

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Pattern Recognition and Machine Learning. Artificial Neural networks

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Probability Distributions

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Biostatistics Department Technical Report

Feature Extraction Techniques

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

A Simple Regression Problem

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

Non-Parametric Non-Line-of-Sight Identification 1

Introduction to Machine Learning. Recitation 11

Pattern Recognition and Machine Learning. Artificial Neural networks

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Probabilistic Machine Learning

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Variance Reduction. in Statistics, we deal with estimators or statistics all the time

Stochastic Subgradient Methods

Bayes Decision Rule and Naïve Bayes Classifier

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Pseudo-marginal Metropolis-Hastings: a simple explanation and (partial) review of theory

3.3 Variational Characterization of Singular Values

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Estimating Parameters for a Gaussian pdf

Support recovery in compressed sensing: An estimation theoretic approach

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

Kernel Methods and Support Vector Machines

Statistics: Learning models from data

COS 424: Interacting with Data. Written Exercises

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

CSC321 Lecture 18: Learning Probabilistic Models

W-BASED VS LATENT VARIABLES SPATIAL AUTOREGRESSIVE MODELS: EVIDENCE FROM MONTE CARLO SIMULATIONS

Bootstrapping Dependent Data

Estimation of the Population Mean Based on Extremes Ranked Set Sampling

Boosting with log-loss

Block designs and statistics

AN OPTIMAL SHRINKAGE FACTOR IN PREDICTION OF ORDERED RANDOM EFFECTS

Best Linear Unbiased and Invariant Reconstructors for the Past Records

Tracking using CONDENSATION: Conditional Density Propagation

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Computational and Statistical Learning Theory

Topic 5a Introduction to Curve Fitting & Linear Regression

We choose parameter values that will minimize the difference between the model outputs & the true function values.

Will Monroe August 9, with materials by Mehran Sahami and Chris Piech. image: Arito. Parameter learning

1 Rademacher Complexity Bounds

ESE 523 Information Theory

E. Alpaydın AERFAISS

SHORT TIME FOURIER TRANSFORM PROBABILITY DISTRIBUTION FOR TIME-FREQUENCY SEGMENTATION

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

An Introduction to Meta-Analysis

Symmetrization and Rademacher Averages

1 Proof of learning bounds

Meta-Analytic Interval Estimation for Bivariate Correlations

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

1 Bounding the Margin

Estimation of the Mean of the Exponential Distribution Using Maximum Ranked Set Sampling with Unequal Samples

Lower Bounds for Quantized Matrix Completion

In this chapter, we consider several graph-theoretic and probabilistic models

Pattern Recognition and Machine Learning. Artificial Neural networks

OBJECTIVES INTRODUCTION

GEE ESTIMATORS IN MIXTURE MODEL WITH VARYING CONCENTRATIONS

Computational and Statistical Learning Theory

Tail Estimation of the Spectral Density under Fixed-Domain Asymptotics

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy

Analyzing Simulation Results

Chapter 2 General Properties of Radiation Detectors

Estimation for the Parameters of the Exponentiated Exponential Distribution Using a Median Ranked Set Sampling

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Ensemble Based on Data Envelopment Analysis

The degree of a typical vertex in generalized random intersection graph models

Understanding Machine Learning Solution Manual

Combining Classifiers

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Risk & Safety in Engineering. Dr. Jochen Köhler

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Convolution and Pooling as an Infinitely Strong Prior

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Learnability of Gaussians with flexible variances

Variational Adaptive-Newton Method

Optimal nonlinear Bayesian experimental design: an application to amplitude versus offset experiments

arxiv: v2 [cs.lg] 30 Mar 2017

Research Article On the Isolated Vertices and Connectivity in Random Intersection Graphs

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

AVOIDING PITFALLS IN MEASUREMENT UNCERTAINTY ANALYSIS

LARGE DEVIATIONS AND RARE EVENT SIMULATION FOR PORTFOLIO CREDIT RISK

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Stochastic vertex models and symmetric functions

Multi-Scale/Multi-Resolution: Wavelet Transform

The Methods of Solution for Constrained Nonlinear Programming

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

Parametric Techniques Lecture 3

Sharp Time Data Tradeoffs for Linear Inverse Problems

General Properties of Radiation Detectors Supplements

Transcription:

Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1

Topics in Basics of ML 1. Learning Algoriths 2. Capacity, Overfitting and Underfitting 3. Hyperparaeters and Validation Sets 4. Estiators, Bias and Variance 5. Maxiu Likelihood Estiation 6. Bayesian Statistics 7. Supervised Learning Algoriths 8. Unsupervised Learning Algoriths 9. Stochastic Gradient Descent 10. Building a Machine Learning Algorith 2 11. Challenges Motivating Deep Learning

Topics in Estiators, Bias, Variance 0. Statistical tools useful for generalization 1. Point estiation 2. Bias 3. Variance and Standard Error 4. Bias-Variance tradeoff to iniize MSE 5. Consistency 3

Statistics provides tools for ML The field of statistics provides any tools to achieve the ML goal of solving a task not only on the training set but also to generalize Foundational concepts such as Paraeter estiation Bias Variance They characterize notions of generalization, over- and under-fitting 4

Point Estiation Point Estiation is the attept to provide the single best prediction of soe quantity of interest Quantity of interest can be: A single paraeter A vector of paraeters E.g., weights in linear regression A whole function 5

Point estiator or Statistic To distinguish estiates of paraeters fro their true value, a point estiate of a paraeter θ is represented by ˆθ Let {x (1), x (2),..x () } be independent and identically distributed data points Then a point estiator or statistic is any function of the data ˆθ = g(x (1),...x () ) Thus a statistic is any function of the data It need not be close to the true θ A good estiator is a function whose output is close 6 to the true underlying θ that generated the data

Function Estiation Establishing a relationship between input and target variables can also be point estiation Here we predict a variable y given input x We assue that there is a function f(x) that describes the approxiate relationship between x and y We ay assue y=f(x)+ε Where ε stands for a part that is not predictable fro x We are interested in approxiating f with a odel Function estiation is sae as estiating a paraeter θ; where ˆf is a point in function space 7 ˆf

Properties of Point Estiators Most coonly studied properties of point estiators are: 1. Bias 2. Variance They infor us about the estiators 8

1. Bias of an estiator The bias of an estiator paraeter θ is defined as bias ˆθ ( ) = E ˆθ The estiator is unbiased if bias( ˆθ )=0 which iplies that ˆθ = g(x (1),...x () ) An estiator is asyptotically unbiased if li bias ˆθ θ E ˆθ ( ) = 0 = θ for 9

Exaples of Estiator Bias We look at coon estiators of the following paraeters to deterine whether there is bias: Bernoulli distribution: ean θ Gaussian distribution: ean µ Gaussian distribution: variance σ 2 10

Estiator of Bernoulli ean Bernoulli distribution for binary variable x ε{0,1} with ean θ has the for P(x;θ) = θ x (1 θ) 1 x Estiator for θ given saples {x (1),..x () } is ˆθ = 1 To deterine whether this estiator is biased deterine bias(ˆθ ) = E ˆθ θ = E 1 x (i) θ i 1 = 1 E x (i) θ = 1 i=1 1 x (i) θ x(i ) (1 θ) (1 x(i ) ) x =0( ) θ (i ) Since bias( )=0 we say that the estiator is unbiased 11 ˆθ i=1 = 1 (θ) θ = θ θ = 0 i=1 i=1 x (i)

Estiator of Gaussian ean Saples {x (1),..x () } are independently and identically distributed according to p(x (i) )=N(x (i) ;µ,σ 2 ) Saple ean is an estiator of the ean paraeter To deterine bias of the saple ean: ˆµ = 1 x (i) i=1 Thus the saple ean is an unbiased estiator of the Gaussian ean

Estiator for Gaussian variance The saple variance is ˆσ 2 = 1 ( x (i) ˆµ ) 2 i=1 We are interested in coputing 2 2 bias( ˆσ ) =E( ) - σ 2 ˆσ We begin by evaluating à 2 Thus the bias of ˆσ is σ 2 / Thus the saple variance is a biased estiator The unbiased saple variance estiator is ˆσ 2 = 1 1 i =1 ( x (i) ˆµ ) 2 13

2. Variance and Standard Error How uch we expect the estiator to vary as a function of the data saple Just as we coputed the expectation of the estiator to deterine its bias, we can copute its variance The variance of an estiator is siply Var( ˆθ ) where the rando variable is the training set The square root of the the variance is called the Standard Error, denoted SE( ˆθ ) 14

Iportance of Standard Error It easures how we would expect the estiate to vary as we obtain different saples fro the sae distribution The standard error of the ean is given by SE ( ˆµ ) = Var 1 i=1 x (i) = σ where σ 2 is the true variance of the saples x (i) Standard error often estiated using estiate of σ Although not unbiased, approxiation is reasonable 15 The standard deviation is less of an underestiate than variance

Standard Error in Machine Learning We often estiate generalization error by coputing error on the test set No of saples in the test set deterine its accuracy Since ean will be norally distributed, (according to Central Liit Theore), we can copute probability that true expectation falls in any chosen interval Ex: 95% confidence interval centered on ean ( ˆµ 1.96SE ( ˆµ ), ˆµ + 1.96SE ( ˆµ )) ML algorith A is better than ML algorith B if upperbound of A is less than lower bound of B ˆµ is

Confidence Intervals for error 95% confidence intervals for error estiate 17

Trading-off Bias and Variance Bias and Variance easure two different sources of error of an estiator Bias easures the expected deviation fro the true value of the function or paraeter Variance provides a easure of the expected deviation that any particular sapling of the data is likely to cause 18

Negotiating between bias - tradeoff How to choose between two algoriths, one with a large bias and another with a large variance? Most coon approach is to use cross-validation Alternatively we can iniize Mean Squared Error which incorporates both bias and variance 19

Mean Squared Error Mean Squared Error of an estiate is MSE = E ( ) 2 ˆθ θ =Bias ˆθ Miniizing the MSE keeps both bias and variance in check ( ) 2 + Var( ˆθ ) 20

Underfit-Overfit : Bias-Variance Both have a U-shaped curve of generalization Error as a function of capacity 21

Consistency So far we have discussed behavior of an estiator for a fixed training set size We are also interested with the behavior of the estiator as training set grows As the no. of data points in the training set grows, we would like our point estiates to converge to the true value of the paraeters: pli ˆθ = θ pli, known as consistency, is probability in the liit Also known as weak consistency Consistency ensures that bias decreases with