Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.
|
|
- Beverly Ferguson
- 6 years ago
- Views:
Transcription
1 Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis Carlos Fernandez-Granda
2 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification
3 Regression The aim is to learn a function h that relates a response or dependent variable y to several observed variables x 1, x 2,..., x p, known as covariates, features or independent variables The response is assumed to be of the form y = h ( x) + z where x R p contains the features and z is noise
4 Linear regression The regression function h is assumed to be linear y (i) = x (i) T β + z (i), 1 i n Our aim is to estimate β R p from the data
5 Linear regression In matrix form y (1) x (1) y (2) 1 x (1) 2 x (1) p β = x (2) 1 x (2) 2 x p (2) 1 z (1) β 2 + z (2) y (n) x (n) 1 x (n) 2 x p (n) β p z (n) Equivalently, y = X β + z
6 Linear model for GDP State GDP (millions) Population Unemployment Rate North Dakota Alabama Mississippi Arkansas Kansas Georgia Iowa West Virginia Kentucky Tennessee???
7 Centering y cent = X cent = av ( y) = av (X ) = [ ]
8 Normalizing y norm = X norm = std ( y) = std (X ) = [ ]
9 Linear model for GDP Aim: find β R 2 such that y norm X norm β The estimate for the GDP of Tennessee will be y Ten = av ( y) + std ( y) x norm, Ten β where x Ten norm is centered using av (X ) and normalized using std (X )
10 Temperature predictor A friend tells you: I found a cool way to predict the average daily temperature in New York: It s just a linear combination of the temperature in every other state. I fit the model on data from the last month and a half and it s perfect!
11 System of equations A is n p and full rank A b = c If n < p the system is underdetermined: infinite solutions for any b! (overfitting) If n = p the system is determined: unique solution for any b (overfitting) If n > p the system is overdetermined: unique solution exists only if b col (A) (if there is noise, no solutions)
12 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification
13 Least squares For fixed β we can evaluate the error using n ) 2 (y (i) x (i) T β y = X β 2 i=1 2 The least-squares estimate β LS minimizes this cost function β LS := arg min y X β β 2 ( ) 1 = X T X X T y if X is full rank and n p
14 Least-squares fit 2 Data Least-squares fit 1 y x
15 Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 = 2
16 Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β
17 Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β 2 2 = arg min β UU T y X β
18 Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β 2 2 = arg min β = arg min β UU T y X β UU T y USV T β 2 2
19 Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β 2 2 = arg min β = arg min β = arg min β UU T y X β UU T y USV T β 2 U T y SV T β 2 2 2
20 Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β 2 2 = arg min β = arg min β = arg min β = VS 1 U T y = UU T y X β UU T y USV T β 2 U T y SV T β 2 2 ( X T X ) 1 X T y 2
21 Linear model for GDP The least-squares estimate is β LS = [ ] GDP roughly proportional to the population Unemployment has a negative (linear) effect
22 Linear model for GDP State GDP Estimate North Dakota Alabama Mississippi Arkansas Kansas Georgia Iowa West Virginia Kentucky Tennessee
23 Maximum temperatures in Oxford, UK Temperature (Celsius)
24 Maximum temperatures in Oxford, UK Temperature (Celsius)
25 Linear model y t β 0 + β 1 cos ( ) ( ) 2πt + β 12 2πt 2 sin + β 12 3 t 1 t n is the time in months (n = )
26 Model fitted by least squares Temperature (Celsius) Data Model
27 Model fitted by least squares Temperature (Celsius) Data Model
28 Model fitted by least squares Temperature (Celsius) Data Model
29 Trend: Increase of 0.75 C / 100 years (1.35 F) Temperature (Celsius) Data Trend
30 Model for minimum temperatures Temperature (Celsius) Data Model
31 Model for minimum temperatures Temperature (Celsius) Data Model
32 Model for minimum temperatures Temperature (Celsius) Data Model
33 Trend: Increase of 0.88 C / 100 years (1.58 F) Temperature (Celsius) Data Trend
34 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification
35 Geometric interpretation Any vector X β is in the span of the columns of X The least-squares estimate is the closest vector to y that can be represented in this way This is the projection of y onto the column space of X X β LS = USV T VS 1 U T y = UU T y
36 Geometric interpretation
37 Face denoising We denoise by projecting onto: S 1 : the span of the 9 images from the same subject S 2 : the span of the 360 images in the training set Test error: x P S1 y 2 x 2 = x P S2 y 2 x 2 = 0.078
38 S 1 S 1 := span ( )
39 Denoising via projection onto S 1 Projection onto S 1 Projection onto S 1 Signal x = Noise z = = Data y = + Estimate
40 S 2 S 2 := span ( )
41 Denoising via projection onto S 2 Projection onto S 2 Projection onto S 2 Signal x = Noise z = = Data y = + Estimate
42 P S1 y and P S2 y x P S1 y P S2 y
43 Lessons of Face Denoising What does our intuition learned from Face Denoising tell us about linear regression?
44 Lessons of Face Denoising What does our intuition learned from Face Denoising tell us about linear regression? More features = larger column space Larger column space = captures more of the true image Larger column space = captures more of the noise Balance between underfitting and overfitting
45 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification
46 Motivation Model data y 1,..., y n as realizations of a set of random variables y 1,..., y n The joint pdf depends on a vector of parameters β f β (y 1,..., y n ) := f y1,...,y n (y 1,..., y n ) is the probability density of y 1,..., y n at the observed data Idea: Choose β such that the density is as high as possible
47 Likelihood The likelihood is equal to the joint pdf L y1,...,y n ( β ) := f β (y 1,..., y n ) interpreted as a function of the parameters The log-likelihood function is the log of the likelihood log L y1,...,y n ( β )
48 Maximum-likelihood estimator The likelihood quantifies how likely the data are according to the model Maximum-likelihood (ML) estimator : ( ) β ML (y 1,..., y n ) := arg max L y1,...,y β n β ( ) = arg max log L y1,...,y β n β Maximizing the log-likelihood is equivalent, and often more convenient
49 Probabilistic interpretation We model the noise as an iid Gaussian random vector z Entries have zero mean and variance σ 2 The data are a realization of the random vector y := X β + z y is Gaussian with mean X β and covariance matrix σ 2 I
50 Likelihood The joint pdf of y is f y ( a) := = The likelihood is n i=1 ( 1 exp 1 ( ( 2πσ 2σ 2 a[i] X β ) ) ) 2 [i] ( 1 (2π) n σ exp 1 a X 2 β n 2σ 2 2 L y ( β ) = ( 1 (2π) n exp 1 2 ) y X β ) 2 2
51 Maximum-likelihood estimate The maximum-likelihood estimate is ( ) β ML = arg max L y β β ( ) = arg max log L y β β = arg min β = β LS y X β 2 2
52 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification
53 Estimation error If the data are generated according to the linear model y := X β + z then β LS β
54 Estimation error If the data are generated according to the linear model y := X β + z then β LS β = ( ) ( 1 X T X X T X β ) + z β
55 Estimation error If the data are generated according to the linear model y := X β + z then β LS β ( ) ( 1 = X T X X T X β ) + z β ( ) 1 = X T X X T z as long as X is full rank
56 LS estimator is unbiased Assume noise z is random and has zero mean, then ( E βls β )
57 LS estimator is unbiased Assume noise z is random and has zero mean, then E ( βls β ) = ( X T X ) 1 X T E ( z)
58 LS estimator is unbiased Assume noise z is random and has zero mean, then E ( βls β ) = = 0 ( X T X ) 1 X T E ( z) The estimate is unbiased: its mean equals β
59 Least-squares error If the data are generated according to the linear model y := X β + z then z 2 σ 1 β LS β 2 z 2 σ p σ 1 and σ p are the largest and smallest singular values of X
60 Least-squares error: Proof The error is given by β LS β = (X T X ) 1 X T z. How can we bound (X T X ) 1 X T z 2?
61 Singular values The singular values of a matrix A R n p of rank p satisfy σ 1 = max { x 2 =1 x R n } A x 2 σ p = min { x 2 =1 x R n } A x 2
62 Least-squares error β LS β = VS 1 U T z The smallest and largest singular values of VS 1 U are 1/σ 1 and 1/σ p, so z 2 σ 1 VS 1 U T z z 2 2 σ p
63 Experiment X train, X test, z train and β are sampled iid from a standard Gaussian Data has 50 features y train = X train β + z train y test = X test β (No Test Noise) We use y train and X train to compute β LS error train = error test = X train βls 2 y train y train 2 X test βls 2 y test y test 2
64 Experiment Error (training) Error (test) Noise level (training) Relative error (l2 norm) n
65 Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? 2. Why does the training error start at 0? 3. Why does the relative training error converge to the noise level? 4. Why does the relative test error converge to zero?
66 Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? β 2 50, X train β 2 50n, z train 2 n, Why does the training error start at 0? 3. Why does the relative training error converge to the noise level? 4. Why does the relative test error converge to zero?
67 Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? β 2 50, X train β 2 50n, z train 2 n, Why does the training error start at 0? X is square and invertible 3. Why does the relative training error converge to the noise level? 4. Why does the relative test error converge to zero?
68 Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? β 2 50, X train β 2 50n, z train 2 n, Why does the training error start at 0? X is square and invertible 3. Why does the relative training error converge to the noise level? X train βls y train 2 = X train ( β LS β ) z train 2 and β LS β 4. Why does the relative test error converge to zero?
69 Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? β 2 50, X train β 2 50n, z train 2 n, Why does the training error start at 0? X is square and invertible 3. Why does the relative training error converge to the noise level? X train βls y train 2 = X train ( β LS β ) z train 2 and β LS β 4. Why does the relative test error converge to zero? We assumed no test noise, and β LS β
70 Non-asymptotic bound Let y := X β + z, where the entries of X and z are iid standard Gaussians The least-squares estimate satisfies (1 ɛ) p (1 + ɛ) n βls β 2 (1 + ɛ) p (1 ɛ) n with probability at least 1 1/p 2 exp ( pɛ 2 /8 ) as long as n 64p log(12/ɛ)/ɛ 2
71 Proof U T z 2 σ 1 VS 1 U T z 2 U T z 2 σ p
72 Projection onto a fixed subspace Let S be a k-dimensional subspace of R n and z R n a vector of iid standard Gaussian noise For any ɛ > 0 ) ) P (k (1 ɛ) < P S z 2 2 < k (1 + ɛ) 1 2 exp ( kɛ2 8
73 Projection onto a fixed subspace Let S be a k-dimensional subspace of R n and z R n a vector of iid standard Gaussian noise For any ɛ > 0 ) ) P (k (1 ɛ) < P S z 2 2 < k (1 + ɛ) 1 2 exp ( kɛ2 8 Consequence: With probability 1 2 exp ( pɛ 2 /8 ) (1 ɛ) p U T z (1 + ɛ) p 2 2
74 Singular values of a Gaussian matrix Let A be a n k matrix with iid standard Gaussian entries such that n > k For any fixed ɛ > 0, the singular values of A satisfy n (1 ɛ) σk σ 1 n (1 + ɛ) with probability at least 1 1/k as long as n > 64k 12 log ɛ2 ɛ
75 Proof With probability 1 1/p n (1 ɛ) σp σ 1 n (1 + ɛ) as long as n 64p log(12/ɛ)/ɛ 2
76 Experiment: β p 2 Plot of β β LS 2 β 2 Relative coefficient error (l2 norm) p=50 p=100 p=200 1/ n n
77 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification
78 Condition number The condition number of A R n p, n p, is the ratio σ 1 /σ p of its largest and smallest singular values A matrix is ill conditioned if its condition is large (almost rank defficient)
79 Noise amplification Let y := X β + z, where z is iid standard Gaussian With probability at least 1 2 exp ( ɛ 2 /8 ) β LS β 2 1 ɛ where σ p is the smallest singular value of X σ p
80 Proof β LS β 2 2
81 Proof β LS β 2 VS = 1 U T z 2 2 2
82 Proof β LS β 2 VS = 1 U T z 2 = S 1 U T z V is orthogonal
83 Proof β LS β 2 VS = 1 U T z 2 = S 1 U T z 2 2 ( p u T = i z ) 2 i σ 2 i 2 2 V is orthogonal
84 Proof β LS β 2 VS = 1 U T z 2 = S 1 U T z 2 2 ( p u T = i z ) 2 i σ 2 i ( u T p z ) 2 σ 2 p 2 2 V is orthogonal
85 Projection onto a fixed subspace Let S be a k-dimensional subspace of R n and z R n a vector of iid standard Gaussian noise For any ɛ > 0 ) ) P (k (1 ɛ) < P S z 2 2 < k (1 + ɛ) 1 2 exp ( kɛ2 8
86 Projection onto a fixed subspace Let S be a k-dimensional subspace of R n and z R n a vector of iid standard Gaussian noise For any ɛ > 0 ) ) P (k (1 ɛ) < P S z 2 2 < k (1 + ɛ) 1 2 exp ( kɛ2 8 Consequence: With probability 1 2 exp ( ɛ 2 /8 ) ( u T p z ) 2 (1 ɛ)
87 Example Let y := X β + z where X := , β := z 2 = 0.11 [ ] 0.471, z :=
88 Example Condition number = [ ] [ ] X = USV T =
89 Example β LS β
90 Example β LS β = VS 1 U T z
91 Example β LS β = VS 1 U T z [ ] = V U T z
92 Example β LS β = VS 1 U T z [ ] = V U T z [ ] = V 3.004
93 Example β LS β = VS 1 U T z [ ] = V U T z [ ] = V [ ] = 2.723
94 Example β LS β = VS 1 U T z [ ] = V U T z [ ] = V [ ] = so that β LS β 2 = z 2
95 Multicollinearity Feature matrix is ill conditioned if any subset of columns is close to being linearly dependent (there is a vector almost in the null space) This occurs if features are highly correlated For any X R n p, with normalized columns, if X i and X j, i j, satisfy X i, X j 2 1 ɛ 2 then the smallest singular value σ p ɛ
96 Multicollinearity Feature matrix is ill conditioned if any subset of columns is close to being linearly dependent (there is a vector almost in the null space) This occurs if features are highly correlated For any X R n p, with normalized columns, if X i and X j, i j, satisfy X i, X j 2 1 ɛ 2 then the smallest singular value σ p ɛ Proof Idea: Consider X ( e i e j ) 2.
97 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification
98 Motivation Avoid noise amplification due to multicollinearity Problem: Noise amplification blows up coefficients Solution: Penalize large-norm solutions when fitting the model Adding a penalty term promoting a particular structure is called regularization
99 Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min β y X β 2 +λ β 2 2 2
100 Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min y X β 2 +λ 2 β β 2 2 ( ) 1 = X T X + λi X T y
101 Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min y X β 2 +λ 2 β β 2 2 ( ) 1 = X T X + λi X T y λi increases the singular values of X T X
102 Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min y X β 2 +λ 2 β β 2 2 ( ) 1 = X T X + λi X T y λi increases the singular values of X T X When λ 0 then β ridge β LS
103 Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min y X β 2 +λ 2 β β 2 2 ( ) 1 = X T X + λi X T y λi increases the singular values of X T X When λ 0 then β ridge β LS When λ then β ridge 0
104 Proof β ridge is the solution to a modified least-squares problem β ridge = arg min β [ ] y 0 ] λi β [ X 2 2
105 Proof β ridge is the solution to a modified least-squares problem [ ] [ ] 2 β ridge = arg min y X β λi β 0 2 ( [ ] T [ ] ) 1 [ ] T [ ] X = λi λi X λi X y 0
106 Proof β ridge is the solution to a modified least-squares problem [ ] [ ] 2 β ridge = arg min y X β λi β 0 2 ( [ ] T [ ] ) 1 [ ] T [ ] X = λi λi X λi X y 0 ( ) 1 = X T X + λi X T y
107 Modified projection y ridge := X β ridge
108 Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y
109 Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y = USV T ( VS 2 V T + λvv T ) 1 VSU T y
110 Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y = USV T ( VS 2 V T + λvv T ) 1 VSU T y = USV T V ( S 2 + λi ) 1 V T VSU T y
111 Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y = USV T ( VS 2 V T + λvv T ) 1 VSU T y = USV T V ( S 2 + λi ) 1 V T VSU T y = US ( S 2 + λi ) 1 SU T y
112 Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y = USV T ( VS 2 V T + λvv T ) 1 VSU T y = USV T V ( S 2 + λi ) 1 V T VSU T y = US ( S 2 + λi ) 1 SU T y p σi 2 = σi 2 + λ y, u i u i i=1 Component of data in direction of u i is shrunk by σ 2 i σ 2 i +λ
113 Modified projection: Relation to PCA Component of data in direction of u i is shrunk by σ 2 i σ 2 i +λ Instead of orthogonally projecting on to the column space of X as in standard regression, we shrink and project Which directions are shrunk the most?
114 Modified projection: Relation to PCA Component of data in direction of u i is shrunk by σ 2 i σ 2 i +λ Instead of orthogonally projecting on to the column space of X as in standard regression, we shrink and project Which directions are shrunk the most? The directions in the data with smallest variance
115 Modified projection: Relation to PCA Component of data in direction of u i is shrunk by σ 2 i σ 2 i +λ Instead of orthogonally projecting on to the column space of X as in standard regression, we shrink and project Which directions are shrunk the most? The directions in the data with smallest variance In PCA, we delete the directions with smallest variance (i.e., shrink them to zero) Can think of Ridge Regression as a continuous variant of performing regression on principal components
116 Ridge-regression estimate If y := X β + z σ1 2 σ σ 1 2 +λ σ σ λ σ β ridge = V σ2 2 +λ V T β + V σ2 2 +λ U T z σ 2 p σ 2 p +λ where X = USV T and σ 1,..., σ p are the singular values σ p σ 2 p +λ For comparison, β LS = β + VS 1 U T z
117 Bias-variance tradeoff Error β ridge β can be divided into two terms: Bias (depends on β ) and variance (depends on z) The bias equals λ σ ( 1 2+λ 0 0 E β ridge β ) λ 0 = V σ2 2+λ 0 V T β λ 0 0 σp 2+λ Larger λ increases bias, but dampens noise (decreases variance)
118 Example Let y := X β + z where X := , β := z 2 = 0.11 [ ] 0.471, z :=
119 Example β ridge β = V λ 1+λ 0 V T β V λ λ 1 1+λ λ U T z
120 Example Setting λ = 0.01 β ridge β = V λ 1+λ 0 λ λ V T β V 1 1+λ λ U T z = V V T β + V U T z =
121 Example Least-squares relative error = β ridge β 2 = 7.96 z 2
122 Example Coefficients Coefficient error Least-squares fit Regularization parameter
123 Maximum-a-posteriori estimator Is there a probabilistic interpretation of ridge regression? Bayesian viewpoint: β is modeled as random, not deterministic The maximum-a-posteriori (MAP) estimator of β given y is ( ) β MAP ( y) := arg max fβ β y y, β f β y is the conditional pdf of β given y
124 Maximum-a-posteriori estimator Let y R n be a realization of y := X β + z where β and z are iid Gaussian with mean zero and variance σ 2 1 and σ2 2 If X R n m is known, then β MAP = arg min β y X β 2 + λ β where λ := σ 2 2 /σ2 1 What does it mean if σ 2 1 is tiny or large? How about σ2 2?
125 Problem How to calibrate regularization parameter Cannot use coefficient error (we don t know the true value!) Cannot minimize over training data (why?) Solution: Check fit on new data
126 Cross validation Given a set of examples ( y (1), x (1)) (, y (2), x (2)) (,..., y (n), x (n)), 1. Partition data into a training set X train R n train p, y train R n train and a validation set X val R n val p, y val R n val 2. Fit model using the training set for every λ in a set Λ β ridge (λ) := arg min y train X trainβ 2 + λ 2 β β 2 2 and evaluate the fitting error on the validation set err (λ) := y train X trainβridge (λ) Choose the value of λ that minimizes the validation-set error λ cv := arg min err (λ) λ Λ
127 Prediction of house prices Aim: Predicting the price of a house from 1. Area of the living room 2. Condition (integer between 1 and 5) 3. Grade (integer between 7 and 12) 4. Area of the house without the basement 5. Area of the basement 6. The year it was built 7. Latitude 8. Longitude 9. Average area of the living room of houses within 15 blocks
128 Prediction of house prices Training data: 15 houses Validation data: 15 houses Test data: 15 houses Condition number of training-data feature matrix: 9.94 We evaluate the relative fit y X β 2 ridge y 2
129 Prediction of house prices Coefficients l2-norm cost (training set) l2-norm cost (validation set) Regularization parameter
130 Prediction of house prices Best λ: 0.27 Validation set error: (least-squares: 0.906) Test set error: (least-squares: 1.186)
131 Training Estimated price (dollars) Least squares Ridge regression True price (dollars)
132 Validation Estimated price (dollars) Least squares Ridge regression True price (dollars)
133 Test Estimated price (dollars) Least squares Ridge regression True price (dollars)
134 Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification
135 The classification problem Goal: Assign examples to one of several predefined categories We have n examples of labels and corresponding features ( y (1), x (1)) (, y (2), x (2)) (,..., y (n), x (n)). Here, we consider only two categories: labels are 0 or 1
136 Logistic function Smoothed version of step function g (t) := exp( t)
137 Logistic function exp( t) t
138 Logistic regression Generalized linear model: linear model + entrywise link function y (i) g ( β 0 + x (i), β ).
139 Maximum likelihood If y (1),..., y (n) are independent samples from Bernoulli random variables with parameter ( p y (i) (1) := g x (i), β ) where x (1),..., x (n) R p are known, the ML estimate of β given y (1),..., y (n) is β ML := n y (i) log g i=1 ( x (i), β ) ( + 1 y (i)) ( ( log 1 g x (i), β ))
140 Maximum likelihood ( ) ( L β := p y (1),...,y (n) y (1),..., y (n))
141 Maximum likelihood ( ) ( L β := p y (1),...,y (n) y (1),..., y (n)) = n g i=1 ( x (i), β ) y (i) ( ( 1 g x (i), β )) 1 y (i)
142 Logistic-regression estimator β LR := n y (i) log g i=1 ( x (i), β ) ( + 1 y (i)) ( ( log 1 g x (i), β )) For a new x the logistic-regression prediction is { ( 1 if g x, ) β LR 1/2 y LR := 0 otherwise g ( x, β ) LR can be interpreted as the probability that the label is 1
143 Iris data set Aim: Classify flowers using sepal width and length Two species, 5 examples each: Iris setosa (label 0): sepal lengths 5.4, 4.3, 4.8, 5.1 and 5.7, and sepal widths 3.7, 3, 3.1, 3.8 and 3.8 Iris versicolor (label 1): sepal lengths 6.5, 5.7, 7, 6.3 and 6.1, and sepal widths 2.8, 2.8, 3.2, 2.3 and 2.8 Two new examples: (5.1, 3.5), (5,2)
144 Iris data set After centering and normalizing [ ] 32.1 β LR = 29.6 and β 0 = 2.06 g i x (i) [1] x (i) [2] x (i), β LR + β ( x (i), β ) LR + β g i x (i) [1] x (i) [2] x (i), β LR + β ( x (i), β ) LR + β
145 Iris data set Sepal width Sepal length ??? Setosa Versicolor
146 Iris data set Petal length Sepal width Virginica Versicolor
147 Digit classification MNIST data Aim: Distinguish one digit from another x i is an image of a 6 or a 9 y i = 1 or y i = 0 if image i is a 6 or 9, respectively 2000 training examples and 2000 test examples, each half 6 half 9 Training error rate: 0.0, Test error rate = 0.006
148 Digit classification: β
149 Digit classification: True Positives β T x Probability of 6 Image
150 Digit classification: True Negatives β T x Probability of 6 Image
151 Digit classification: False Positives β T x Probability of 6 Image
152 Digit classification: False Negatives β T x Probability of 6 Image
153 Digit Classification This is a toy problem: distinguishing one digit from another is very easy Harder is to classify any given digit We used it to give insight into how logistic regression works It turns out, on this simplified problem, a very easy solution for β gives good results. Can you guess it?
154 Digit Classification Average of 6 s minus average of 9 s Training error: 0.005, Test error:
Lecture Notes 6: Linear Models
Optimization-based data analysis Fall 17 Lecture Notes 6: Linear Models 1 Linear regression 1.1 The regression problem In statistics, regression is the problem of characterizing the relation between a
More informationLinear regression. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda
Linear regression DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall15 Carlos Fernandez-Granda Linear models Least-squares estimation Overfitting Example:
More informationDS-GA 1002 Lecture notes 12 Fall Linear regression
DS-GA Lecture notes 1 Fall 16 1 Linear models Linear regression In statistics, regression consists of learning a function relating a certain quantity of interest y, the response or dependent variable,
More informationDS-GA 1002 Lecture notes 10 November 23, Linear models
DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationThe Singular-Value Decomposition
Mathematical Tools for Data Science Spring 2019 1 Motivation The Singular-Value Decomposition The singular-value decomposition (SVD) is a fundamental tool in linear algebra. In this section, we introduce
More informationVector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.
Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationLecture Notes 1: Vector spaces
Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector
More informationDescriptive Statistics
Descriptive Statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda Descriptive statistics Techniques to visualize
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationMachine Learning! in just a few minutes. Jan Peters Gerhard Neumann
Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1 Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationClassification Methods II: Linear and Quadratic Discrimminant Analysis
Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear
More informationStatistical Data Analysis
DS-GA 0 Lecture notes 8 Fall 016 1 Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the
More informationLecture Notes 2: Matrices
Optimization-based data analysis Fall 2017 Lecture Notes 2: Matrices Matrices are rectangular arrays of numbers, which are extremely useful for data analysis. They can be interpreted as vectors in a vector
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationLinear Regression Linear Regression with Shrinkage
Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle
More informationLinear Models for Regression
Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationMachine Learning, Fall 2009: Midterm
10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all
More informationTheorems. Least squares regression
Theorems In this assignment we are trying to classify AML and ALL samples by use of penalized logistic regression. Before we indulge on the adventure of classification we should first explain the most
More informationStatistics: Learning models from data
DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial
More informationConstrained optimization
Constrained optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Compressed sensing Convex constrained
More informationReview. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda
Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with
More informationNaïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationPrincipal Component Analysis (PCA) CSC411/2515 Tutorial
Principal Component Analysis (PCA) CSC411/2515 Tutorial Harris Chan Based on previous tutorial slides by Wenjie Luo, Ladislav Rampasek University of Toronto hchan@cs.toronto.edu October 19th, 2017 (UofT)
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods
More informationMultiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar
Multiple regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Multiple regression 1 / 36 Previous two lectures Linear and logistic
More informationSCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.
SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationProbabilistic Machine Learning. Industrial AI Lab.
Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationLinear Regression Linear Regression with Shrinkage
Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle
More informationCOMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017
COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We
More informationISyE 691 Data mining and analytics
ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationGI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil
GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationMLE/MAP + Naïve Bayes
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes Matt Gormley Lecture 19 March 20, 2018 1 Midterm Exam Reminders
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write
More informationModeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop
Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector
More informationBayesian Linear Regression [DRAFT - In Progress]
Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory
More informationA Modern Look at Classical Multivariate Techniques
A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationApproximate Principal Components Analysis of Large Data Sets
Approximate Principal Components Analysis of Large Data Sets Daniel J. McDonald Department of Statistics Indiana University mypage.iu.edu/ dajmcdon April 27, 2016 Approximation-Regularization for Analysis
More informationComputer Vision Group Prof. Daniel Cremers. 3. Regression
Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII
More informationPCA with random noise. Van Ha Vu. Department of Mathematics Yale University
PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationData Mining and Analysis: Fundamental Concepts and Algorithms
Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA Department
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationClassification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).
Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes
More informationThe prediction of house price
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationRegression, Ridge Regression, Lasso
Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.
More informationMachine Learning 4771
Machine Learning 4771 Instructor: Tony Jebara Topic 11 Maximum Likelihood as Bayesian Inference Maximum A Posteriori Bayesian Gaussian Estimation Why Maximum Likelihood? So far, assumed max (log) likelihood
More informationSupervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012
Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012 Overview Review: Conditional Probability LDA / QDA: Theory Fisher s Discriminant Analysis LDA: Example Quality control:
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationUniversity of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians
Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationIntroduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones
Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive
More informationPRINCIPAL COMPONENTS ANALYSIS
121 CHAPTER 11 PRINCIPAL COMPONENTS ANALYSIS We now have the tools necessary to discuss one of the most important concepts in mathematical statistics: Principal Components Analysis (PCA). PCA involves
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationLecture 6: Methods for high-dimensional problems
Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2015 Announcements TA Monisha s office hour has changed to Thursdays 10-12pm, 462WVH (the same
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More informationLinear Regression (9/11/13)
STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter
More informationIntroduction to Machine Learning. Introduction to ML - TAU 2016/7 1
Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationMLE/MAP + Naïve Bayes
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative
More informationOutline lecture 2 2(30)
Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control
More informationLecture 3: Introduction to Complexity Regularization
ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,
More informationBayesian Classification Methods
Bayesian Classification Methods Suchit Mehrotra North Carolina State University smehrot@ncsu.edu October 24, 2014 Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 1 / 33 How do you define
More informationMachine Learning: Assignment 1
10-701 Machine Learning: Assignment 1 Due on Februrary 0, 014 at 1 noon Barnabas Poczos, Aarti Singh Instructions: Failure to follow these directions may result in loss of points. Your solutions for this
More informationCOS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10
COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem
More informationMachine Learning (CSE 446): Probabilistic Machine Learning
Machine Learning (CSE 446): Probabilistic Machine Learning oah Smith c 2017 University of Washington nasmith@cs.washington.edu ovember 1, 2017 1 / 24 Understanding MLE y 1 MLE π^ You can think of MLE as
More informationMATH 221, Spring Homework 10 Solutions
MATH 22, Spring 28 - Homework Solutions Due Tuesday, May Section 52 Page 279, Problem 2: 4 λ A λi = and the characteristic polynomial is det(a λi) = ( 4 λ)( λ) ( )(6) = λ 6 λ 2 +λ+2 The solutions to the
More informationToday. Calculus. Linear Regression. Lagrange Multipliers
Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationMidterm Exam, Spring 2005
10-701 Midterm Exam, Spring 2005 1. Write your name and your email address below. Name: Email address: 2. There should be 15 numbered pages in this exam (including this cover sheet). 3. Write your name
More information20 Unsupervised Learning and Principal Components Analysis (PCA)
116 Jonathan Richard Shewchuk 20 Unsupervised Learning and Principal Components Analysis (PCA) UNSUPERVISED LEARNING We have sample points, but no labels! No classes, no y-values, nothing to predict. Goal:
More information2 Statistical Estimation: Basic Concepts
Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof. N. Shimkin 2 Statistical Estimation:
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationReminders. Thought questions should be submitted on eclass. Please list the section related to the thought question
Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a
More information