Lecture 2 October 11
|
|
- Stephen Warner
- 6 years ago
- Views:
Transcription
1 Itroductio to probabilistic graphical models 203/204 Lecture 2 October Lecturer: Guillaume Oboziski Scribes: Aymeric Reshef, Claire Verade Course webpage: 2. Sigle ode models (last part) The previous course itroduced the otio of Maximum Likelihood Estimator (MLE). Basic examples o Beroulli model, multiomial model ad Gaussia model were explicited, ad side otes detailed the use of Lagragia operators ad of differetials. The last example was usig the multivariate Gaussia model. We recall it briefly i the ext subsectio. 2.. The Multivariate Gaussia model If X is a radom variable takig values i R d. Let µ R d ad Σ R d d be a positive defiite matrix. X follows a multivariate Gaussia model (deoted by X N (µ, Σ)) if ( p µ,σ (x) = exp ) (2π) d 2 (x µ) Σ (x µ). det Σ is Let X,, X N (µ, Σ), iid. The, the egative log-likelihood of the joit distributio l (µ, Σ) = log p µ,σ (x i ) = d 2 log (2π) + 2 log (det Σ) + 2 Its gradiet with respect to µ is give by µ l (µ, Σ) = Σ (x i µ) = Σ ( x i µ ) (x i µ) Σ (x i µ). = Σ ( x µ), which leads to ˆµ = x, the empirical mea. I order to compute the gradiet with respect to Σ, we first write A = Σ, so that l (µ, Σ) = d 2 log (2π) 2 log (det A) + 2 (x i µ) A (x i µ) = d 2 log (2π) 2 log (det A) + 2 Tr(A Σ), 2-
2 Cours 2 October 203/204 where we itroduced the empirical covariace matrix Σ defied as Σ = (x i µ) (x i µ). The matrix A appears i the expressio of the log-likelihood i two terms: log det A ad 2 Tr(A Σ). 2 Deote by f(a) = Tr(A Σ). The f(a + H) f(a) = Tr(H Σ), which leads to f(a) = Σ. Now, write log det A as ( ( ) ) log det(a + H) = log det A 2 I + A 2 HA 2 A 2 = log det A + log det(i + H) where A 2 stads for the square root matrix of A (it exists, sice A is positive defiite) ad H = A 2 HA 2. Let s see how log det(i + H) looks like. Notig that log det I = 0, ad deotig by (λ,, λ d ) the eigevalues of H, we have that But the, log det(i + H) = log det(i + H) log det I = d log( + λ j ) j= d λ j = Tr( H) = Tr(A 2 HA 2 ) = Tr(HA ). j= d λ j + o( H ). We coclude that A log det A = A. Pluggig these results ito the gradiet of the log-likelihood with respect to A, we have A l(a) = 2 A + 2 Σ. The optimality coditio A l(a) leads to A = Σ, which meas that ˆΣ = (x i µ) (x i µ) is the empirical covariace matrix. Note that we assumed that A was ivertible, which is a implicit coditio whe writig log det A. This implies that i a rigorous sese the maximum likelihood estimator is udefied whe Σ is ot ivertible. I practice, the MLE is exteded by cotiuity to the rak deficiet case. 2.2 Models with two odes I this sectio, we work with two odes: oe ode correspods to a iput X, ad oe ode correspods to a output Y. Recall that whe dealig with two radom variables X ad Y, oe ca use a geerative model, i.e. which models the joit distributio p(x, Y ), or oe ca use istead a coditioal model (ofte cosidered equivalet to the slightly differet cocept of discrimiative model), which models the coditioal probability of the output, give the iput p(y X). The two followig models, liear regressio or a logistic regressio, are coditioal models. j= 2-2
3 Cours 2 October 203/ Liear regressio Let s assume that Y R depeds liearly o X R p. Let w R p be a weightig vector ad σ 2 > 0. We make the followig assumptio: which ca be rewritte as Y X N (w X, σ 2 ), Y = w X + ɛ, with ɛ N (0, σ 2 ). Note that if there is a offset w 0 R p, that is, if Y = w X + w 0 + ɛ, oe ca always redefie a weightig vector w R p+ such that ( ) x Y = w + ɛ. Let D = {(x, y ),, (x, y )} be a traiig set of i.i.d. radom variables. Each y i is a label (a decisio) o observatio x i. We cosider the coditioal distributio of all outputs give all iputs, which is a product of terms because of the idepedece of the pairs formig the traiig set: p(y,, y x,, x ; w, σ 2 ) = p(y i x i ; w, σ 2 ). The associated log-likelihood has the followig expressio: l(w, σ 2 ) = log p(y i x i ) = 2 log(2πσ2 ) + 2 (y i w x i ) 2 σ 2. The miimizatio problem with respect to w ca ow be reformulated as: Defie the so-called desig matrix X as fid ŵ = arg mi w 2 (y i w x i ) 2. x X =. x R p ad deote by y the vector of coordiates (y,, y ). The miimizatio problem over w ca be rewritte i a more compact way as: fid ŵ = arg mi w 2 y Xw 2. Let f : w y 2 Xw 2 = 2 (y y 2w X y + w X Xw. f is strictly covex if ad oly if its Hessia matrix is ivertible. This is ever the case whe < p (i this case, we deal with uderdetermied problems). Most of the time, the Hessia matrix is ivertible whe p. Whe this is ot the case, we ofte use the Tikhoov regularizatio, which adds 2-3
4 Cours 2 October 203/204 a pealizatio of the l 2 -orm of w by miimizig f(w) + λ w 2 with some hyperparameter λ > 0. The gradiet of f is f(w) = X (Xw y) = 0 X Xw = X y. The equatio X Xw = X y is kow as the ormal equatio. If X X is ivertible, the the optimal weightig vector is ŵ = (X X) X y = X y where X = (X X) X is the Moore-Perose pseudo-iverse of X. If X X is ot ivertible, the solutio is ot uique aymore, ad for ay h ker(x), ŵ = (X X) X y + h is a admissible solutio. I that case however it would be ecessary to use regularizatio. The computatioal cost to evaluate the optimal weightig vector from X ad y is O(p 3 ) (use a Cholesky decompositio of matrix X X ad solve two triagular systems). Now, let s differetiate l(w, σ 2 ) with respect to σ 2 : we have σ 2l(w, σ 2 ) = Settig σ 2l(w, σ 2 ) to zero gives 2σ 2 2σ 4 (y i w x i ) 2. ˆσ 2 = (y i w x i ) 2. I practice, wheever we use a data matrix X i machie learig, we first preprocess it to try ad avoid that it would be too badly coditioed, so to avoid umerical issues. Two mai operatios are applied columwise: first, a ceterig (remove the mea of the coefficiets) ad a ormalizatio (divide coefficiets from a colum by the stadard deviatio of the colum vector). Note that this preprocessig *does ot guaratee* that the matrix we obtai is well-coditioed: i particular, it ca be low rak Logistic regressio Let X R p, Y {0, }. We assume that Y follows a Beroulli distributio with parameter θ. The problem is to fid θ. Let s defie the sigmoid fuctio σ defied o the real axis ad takig values i [0, ], such that z R, σ(z) = The sigmoid fuctio is plot o Figure 2.. Oe ca easily prove that + e z. z R, σ( z) = σ(z), z R, σ (z) = σ(z)( σ(z)) = σ(z)σ( z). 2-4
5 Cours 2 October 203/ σ(x) x Figure 2.. Sigmoid fuctio. We ow assume that, for a give observatio X = x, the output Y X = x follows a Beroulli law with parameter θ = σ(w x), where w is agai a weightig vector. I practice, we still ca add a offset w x + w 0. The, the coditioal distributio is give by p(y = y X = x) = θ y ( θ) y = σ(w x) y σ( w x) y. Give a traiig set D = {(x, y ),, (x, y )} of iid radom variables, we ca compute the log-likelihood l(w) = y i log σ(w x i ) + ( y i ) log σ( w x i ). I order to miimize the log-likelihood, sice z log( + e z ) is a covex fuctio ad w w x i is liear, we calculate its gradiet. We write η i = σ(θ x i ): w l(w) = y i x i σ(w x i )σ( w x i ) σ(w x i ) ( y i )x i σ(w x i )σ( w x i ) σ( w x i ) = x i (y i η i ) Thus, w l(w) = 0 x i(y i σ(θ x i )) = 0. This equatio is oliear ad we eed a iterative optimizatio method to solve it. For this purpose, we derive the Hessia matrix of l: Hl(w) = x i (0 σ (w x i )σ ( w x i )x i ) = ( η i ( η i ))x i x i = X Diag(η i ( η i ))X where X is the desig matrix defied previously. I the followig we discuss first- ad secod-order optimizatio methods ad apply them to logistic regressio. 2-5
6 Cours 2 October 203/204 First-order methods Let f : R p R be the covex C fuctio that we wat to miimize. A descet directio at poit x is a vector d such that d, f(x) < 0. The miimizatio of f ca be doe by applyig a descet algorithm, which iteratively takes a step i a descet directio, leadig to a iterative scheme of the form x (k+) = x (k) + ε (k) d (k), where ε (k) is the stepsize. The directio d (k) is ofte chose as the opposite of the gradiet of f at poit x (k) : d (k) = f(x (k) ). There are several choices for ε (k) :. Costat step: ε (k) = ε. But the scheme does ot ecessarily coverge. 2. Decreasig step size: ε (k) k with k ε(k) = ad k (ε(k) ) 2 <. The scheme is guarateed to coverge. 3. Oe ca determie ε (k) by doig a Lie Search which tries to fid mi ε f(x (k) + εd (k) ): either exactly but this is costly ad rather useless i may situatios; or approximately (see the Armijo liesearch). This is a better method. Secod-order methods This time, let f : R p R be the C 2 fuctio that we wat to miimize. We write the secod-order Taylor-expasio of f: f(x) = f(x t )+(x x t ) f(x t )+ 2 (x xt ) Hf(x t )(x x t )+o( x x t 2 ) def = g t (x)+( x x t 2 ) A local optimum x is the reached whe { f(x ) = 0 H(f(x ) 0 I order to solve such a problem, we are goig to use Newto s method. If f is a covex fuctio, the g t (x) = f(x t ) + Hf(x t )(x x t ) ad we oly eed to fid x so that g t (x) = 0, ie. we set x t+ = x t [Hf(x t ] f(x t ). If the Hessia matrix is ot ivertible, we ca regularize the problem ad miimize g t (x) + λ x x t 2 istead. I geeral the previous update, called the Pure Newto step does ot lead to a coverget algorithm eve if the fuctio is covex! I geeral it is ecessary to use the so-called Damped Newto method, to obtai a coverget algorithm which cosists i doig the followig iteratios: where ε t is set with the Armijo Lie Search x t+ = x t ε t (Hf(x t )) f(x t ), 2-6
7 Cours 2 October 203/204 This method may be computatioally costly i high dimesio because of the iverse of the hessia matrix that eeds to be computed at each iteratio. For some fuctios, however, the pure Newto s method does coverge. This is the case for logistic regressio. I the cotext of o-covex optimizatio, the situatio is more complicated because the Hessia ca have egative eigevalues. I that case, so-called trust regio methods are typically used. Applicatio to logistic regressio We will write the form that Newto s algorithm takes for logistic regressio. We had : l(w) = w l(w) = y i log σ(w x i ) + ( y i ) log σ( w x i ) x i (y i η i ) = X (y η) Hl(w) = X Diag(η i ( η i ))X The secod-order Taylor expasio of the loss fuctio leads to mi h l(w) = l(w t ) + (w w t ) l(w t ) + 2 (w wt ) Hl(w t )(w w t ). Let us set h = w w t. The miimizatio problem becomes: { h X (y η) } 2 h X Diag(η( η))xh mi h w l(w) + h 2 h Hl(w)h. This leads, accordig to the previous part, to set w t+ = w t + Hl(w t ) w l(w). The miimizatio problem above ca be see as some weighted liear regressio over h of some fuctio of the form (ỹ i x i h) 2 i, where ỹ σi 2 i = y i η i ad σi 2 = [η i ( η i )]. Thus, this method is ofte refered as the iterative reweighted least squares algorithm (IRLS). We may ru ito a classificatio problem with more tha two classes : Y {,, K} with Y M(, π (x),, π K (x)) where We will eed to defie a rule over the classifiers (softmax fuctio, oe-versus-all, etc.) i order to make a decisio Geerative models This sectio briefly presets the Fisher liear discrimiat also kow as the liear discrimiat aalysis. Suppose that we have X R p ad Y {0, }. P (Y = X = x) = P (X = x Y = )P (Y = ) P (X = x Y = )P (Y = ) + P (X = x Y = 0)P (Y = 0) The assumptio the cosists i cosiderig P (X = x Y = ) N (x, µ, Σ ) ad P (X = x Y = 0) N (x, µ 0, Σ 0 ). Fisher s assumptio is the assumptio that Σ = Σ 0 = Σ. 2-7
8 2.3 Usupervised classificatio Cours 2 October 203/204 Usupervised learig cosists i fidig a label predictio fuctio based o ulabeled traiig data oly. I the case where the learig problem is a classificatio problem, ad uder the assumptio that the classes form clusters i iput space, the problem reduces to a clusterig problem, which cosists i fidig groups of poits that form deser clusters. Whe the clusters are assumed to be isotropic the formulatio of the K-meas algorithm is appropriate. The K-meas algorithm We start from a set of data poits (x,, x ) (where x i R p ), that are ulabelled. We wish to divide this set ito K clusters defied by their cetroids (µ,, µ K ). The problem ca be formulated as: mi µ,,µ K mi k x i µ k 2. The miimizatio step iside the summatio leads to a ocovex problem. The K-meas algorithm is a greedy algorithm which cosists i iteratively apply two steps: { } C k i x i µ k 2 = mi x i µ j 2 j µ k x i. C k i C k The first step defies the clusters C k by assigig each data poit to its closest cetroid. The secod step the updates the cetroids give the ew cluster. Two remarks: It ca be show that K-meas coverges i a fiite umber of steps. The algorithm however typically get stuck i local miima ad it practice it is ecessary to try several restarts of the algorithm with a radom iitializatio to have chaces to obtai a better solutio. 2-8
Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d
Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y
More information10-701/ Machine Learning Mid-term Exam Solution
0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it
More informationNaïve Bayes. Naïve Bayes
Statistical Data Miig ad Machie Learig Hilary Term 206 Dio Sejdiovic Departmet of Statistics Oxford Slides ad other materials available at: http://www.stats.ox.ac.uk/~sejdiov/sdmml : aother plug-i classifier
More informationMachine Learning Brett Bernstein
Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio
More informationTopics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion
.87 Machie learig: lecture Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learig problem hypothesis class, estimatio algorithm loss ad estimatio criterio samplig, empirical ad epected losses
More informationIntroduction to Machine Learning DIS10
CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig
More informationEmpirical Process Theory and Oracle Inequalities
Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi
More informationAlgorithms for Clustering
CR2: Statistical Learig & Applicatios Algorithms for Clusterig Lecturer: J. Salmo Scribe: A. Alcolei Settig: give a data set X R p where is the umber of observatio ad p is the umber of features, we wat
More informationExpectation-Maximization Algorithm.
Expectatio-Maximizatio Algorithm. Petr Pošík Czech Techical Uiversity i Prague Faculty of Electrical Egieerig Dept. of Cyberetics MLE 2 Likelihood.........................................................................................................
More informationDifferentiable Convex Functions
Differetiable Covex Fuctios The followig picture motivates Theorem 11. f ( x) f ( x) f '( x)( x x) ˆx x 1 Theorem 11 : Let f : R R be differetiable. The, f is covex o the covex set C R if, ad oly if for
More informationClustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.
Clusterig CM226: Machie Learig for Bioiformatics. Fall 216 Sriram Sakararama Ackowledgmets: Fei Sha, Ameet Talwalkar Clusterig 1 / 42 Admiistratio HW 1 due o Moday. Email/post o CCLE if you have questios.
More informationMachine Learning Brett Bernstein
Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete
More informationSupplemental Material: Proofs
Proof to Theorem Supplemetal Material: Proofs Proof. Let be the miimal umber of traiig items to esure a uique solutio θ. First cosider the case. It happes if ad oly if θ ad Rak(A) d, which is a special
More informationCS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5
CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio
More informationLinear Support Vector Machines
Liear Support Vector Machies David S. Roseberg The Support Vector Machie For a liear support vector machie (SVM), we use the hypothesis space of affie fuctios F = { f(x) = w T x + b w R d, b R } ad evaluate
More informationMa 530 Introduction to Power Series
Ma 530 Itroductio to Power Series Please ote that there is material o power series at Visual Calculus. Some of this material was used as part of the presetatio of the topics that follow. What is a Power
More information6.867 Machine learning, lecture 7 (Jaakkola) 1
6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit
More informationECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015
ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],
More informationRegression and generalization
Regressio ad geeralizatio CE-717: Machie Learig Sharif Uiversity of Techology M. Soleymai Fall 2016 Curve fittig: probabilistic perspective Describig ucertaity over value of target variable as a probability
More informationMachine Learning for Data Science (CS 4786)
Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 12
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig
More informationAxis Aligned Ellipsoid
Machie Learig for Data Sciece CS 4786) Lecture 6,7 & 8: Ellipsoidal Clusterig, Gaussia Mixture Models ad Geeral Mixture Models The text i black outlies high level ideas. The text i blue provides simple
More informationThe Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model
Back to Maximum Likelihood Give a geerative model f (x, y = k) =π k f k (x) Usig a geerative modellig approach, we assume a parametric form for f k (x) =f (x; k ) ad compute the MLE θ of θ =(π k, k ) k=
More informationProblem Set 4 Due Oct, 12
EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios
More informationLecture 12: February 28
10-716: Advaced Machie Learig Sprig 2019 Lecture 12: February 28 Lecturer: Pradeep Ravikumar Scribes: Jacob Tyo, Rishub Jai, Ojash Neopae Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationAda Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities
CS8B/Stat4B Sprig 008) Statistical Learig Theory Lecture: Ada Boost, Risk Bouds, Cocetratio Iequalities Lecturer: Peter Bartlett Scribe: Subhrasu Maji AdaBoost ad Estimates of Coditioal Probabilities We
More informationECE 901 Lecture 12: Complexity Regularization and the Squared Loss
ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality
More informationOptimally Sparse SVMs
A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but
More information1 Duality revisited. AM 221: Advanced Optimization Spring 2016
AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R
More informationThe Method of Least Squares. To understand least squares fitting of data.
The Method of Least Squares KEY WORDS Curve fittig, least square GOAL To uderstad least squares fittig of data To uderstad the least squares solutio of icosistet systems of liear equatios 1 Motivatio Curve
More informationNYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)
NYU Ceter for Data Sciece: DS-GA 003 Machie Learig ad Computatioal Statistics (Sprig 208) Brett Berstei, David Roseberg, Be Jakubowski Jauary 20, 208 Istructios: Followig most lab ad lecture sectios, we
More informationOptimization Methods MIT 2.098/6.255/ Final exam
Optimizatio Methods MIT 2.098/6.255/15.093 Fial exam Date Give: December 19th, 2006 P1. [30 pts] Classify the followig statemets as true or false. All aswers must be well-justified, either through a short
More informationLecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)
Lecture 4 Homework Hw 1 ad 2 will be reoped after class for every body. New deadlie 4/20 Hw 3 ad 4 olie (Nima is lead) Pod-cast lecture o-lie Fial projects Nima will register groups ext week. Email/tell
More informationMath 61CM - Solutions to homework 3
Math 6CM - Solutios to homework 3 Cédric De Groote October 2 th, 208 Problem : Let F be a field, m 0 a fixed oegative iteger ad let V = {a 0 + a x + + a m x m a 0,, a m F} be the vector space cosistig
More informationOutline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019
Outlie CSCI-567: Machie Learig Sprig 209 Gaussia mixture models Prof. Victor Adamchik 2 Desity estimatio U of Souther Califoria Mar. 26, 209 3 Naive Bayes Revisited March 26, 209 / 57 March 26, 209 2 /
More informationConvergence of random variables. (telegram style notes) P.J.C. Spreij
Covergece of radom variables (telegram style otes).j.c. Spreij this versio: September 6, 2005 Itroductio As we kow, radom variables are by defiitio measurable fuctios o some uderlyig measurable space
More informationSupport vector machine revisited
6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector
More informationBoosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32
Boostig Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machie Learig Algorithms March 1, 2017 1 / 32 Outlie 1 Admiistratio 2 Review of last lecture 3 Boostig Professor Ameet Talwalkar CS260
More informationVector Quantization: a Limiting Case of EM
. Itroductio & defiitios Assume that you are give a data set X = { x j }, j { 2,,, }, of d -dimesioal vectors. The vector quatizatio (VQ) problem requires that we fid a set of prototype vectors Z = { z
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 11
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple
More informationREGRESSION WITH QUADRATIC LOSS
REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d
More informationIntro to Learning Theory
Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified
More informationMachine Learning Theory (CS 6783)
Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT
More informationProblem Set 2 Solutions
CS271 Radomess & Computatio, Sprig 2018 Problem Set 2 Solutios Poit totals are i the margi; the maximum total umber of poits was 52. 1. Probabilistic method for domiatig sets 6pts Pick a radom subset S
More informationLecture 19: Convergence
Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may
More informationIntroduction to Optimization Techniques. How to Solve Equations
Itroductio to Optimizatio Techiques How to Solve Equatios Iterative Methods of Optimizatio Iterative methods of optimizatio Solutio of the oliear equatios resultig form a optimizatio problem is usually
More informationRegression with quadratic loss
Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before,
More informationLinear Classifiers III
Uiversität Potsdam Istitut für Iformatik Lehrstuhl Maschielles Lere Liear Classifiers III Blaie Nelso, Tobias Scheffer Cotets Classificatio Problem Bayesia Classifier Decisio Liear Classifiers, MAP Models
More informationClassification with linear models
Lecture 8 Classificatio with liear models Milos Hauskrecht milos@cs.pitt.edu 539 Seott Square Geerative approach to classificatio Idea:. Represet ad lear the distributio, ). Use it to defie probabilistic
More informationLecture 9: Boosting. Akshay Krishnamurthy October 3, 2017
Lecture 9: Boostig Akshay Krishamurthy akshay@csumassedu October 3, 07 Recap Last week we discussed some algorithmic aspects of machie learig We saw oe very powerful family of learig algorithms, amely
More information1 Review and Overview
DRAFT a fial versio will be posted shortly CS229T/STATS231: Statistical Learig Theory Lecturer: Tegyu Ma Lecture #3 Scribe: Migda Qiao October 1, 2013 1 Review ad Overview I the first half of this course,
More informationTHE SOLUTION OF NONLINEAR EQUATIONS f( x ) = 0.
THE SOLUTION OF NONLINEAR EQUATIONS f( ) = 0. Noliear Equatio Solvers Bracketig. Graphical. Aalytical Ope Methods Bisectio False Positio (Regula-Falsi) Fied poit iteratio Newto Raphso Secat The root of
More information5.1 Review of Singular Value Decomposition (SVD)
MGMT 69000: Topics i High-dimesioal Data Aalysis Falll 06 Lecture 5: Spectral Clusterig: Overview (cotd) ad Aalysis Lecturer: Jiamig Xu Scribe: Adarsh Barik, Taotao He, September 3, 06 Outlie Review of
More informationVariable selection in principal components analysis of qualitative data using the accelerated ALS algorithm
Variable selectio i pricipal compoets aalysis of qualitative data usig the accelerated ALS algorithm Masahiro Kuroda Yuichi Mori Masaya Iizuka Michio Sakakihara (Okayama Uiversity of Sciece) (Okayama Uiversity
More informationPC5215 Numerical Recipes with Applications - Review Problems
PC55 Numerical Recipes with Applicatios - Review Problems Give the IEEE 754 sigle precisio bit patter (biary or he format) of the followig umbers: 0 0 05 00 0 00 Note that it has 8 bits for the epoet,
More informationSummary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector
Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short
More informationInfinite Sequences and Series
Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet
More informationStatistical Inference Based on Extremum Estimators
T. Rotheberg Fall, 2007 Statistical Iferece Based o Extremum Estimators Itroductio Suppose 0, the true value of a p-dimesioal parameter, is kow to lie i some subset S R p : Ofte we choose to estimate 0
More information17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15
17. Joit distributios of extreme order statistics Lehma 5.1; Ferguso 15 I Example 10., we derived the asymptotic distributio of the maximum from a radom sample from a uiform distributio. We did this usig
More informationHOMEWORK I: PREREQUISITES FROM MATH 727
HOMEWORK I: PREREQUISITES FROM MATH 727 Questio. Let X, X 2,... be idepedet expoetial radom variables with mea µ. (a) Show that for Z +, we have EX µ!. (b) Show that almost surely, X + + X (c) Fid the
More informationCSE 527, Additional notes on MLE & EM
CSE 57 Lecture Notes: MLE & EM CSE 57, Additioal otes o MLE & EM Based o earlier otes by C. Grat & M. Narasimha Itroductio Last lecture we bega a examiatio of model based clusterig. This lecture will be
More informationSimulation. Two Rule For Inverting A Distribution Function
Simulatio Two Rule For Ivertig A Distributio Fuctio Rule 1. If F(x) = u is costat o a iterval [x 1, x 2 ), the the uiform value u is mapped oto x 2 through the iversio process. Rule 2. If there is a jump
More informationMAT1026 Calculus II Basic Convergence Tests for Series
MAT026 Calculus II Basic Covergece Tests for Series Egi MERMUT 202.03.08 Dokuz Eylül Uiversity Faculty of Sciece Departmet of Mathematics İzmir/TURKEY Cotets Mootoe Covergece Theorem 2 2 Series of Real
More informationBrief Review of Functions of Several Variables
Brief Review of Fuctios of Several Variables Differetiatio Differetiatio Recall, a fuctio f : R R is differetiable at x R if ( ) ( ) lim f x f x 0 exists df ( x) Whe this limit exists we call it or f(
More informationMachine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring
Machie Learig Regressio I Hamid R. Rabiee [Slides are based o Bishop Book] Sprig 015 http://ce.sharif.edu/courses/93-94//ce717-1 Liear Regressio Liear regressio: ivolves a respose variable ad a sigle predictor
More informationOutline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression
REGRESSION 1 Outlie Liear regressio Regularizatio fuctios Polyomial curve fittig Stochastic gradiet descet for regressio MLE for regressio Step-wise forward regressio Regressio methods Statistical techiques
More informationLet us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.
Lecture 5 Let us give oe more example of MLE. Example 3. The uiform distributio U[0, ] o the iterval [0, ] has p.d.f. { 1 f(x =, 0 x, 0, otherwise The likelihood fuctio ϕ( = f(x i = 1 I(X 1,..., X [0,
More informationLecture 15: Learning Theory: Concentration Inequalities
STAT 425: Itroductio to Noparametric Statistics Witer 208 Lecture 5: Learig Theory: Cocetratio Iequalities Istructor: Ye-Chi Che 5. Itroductio Recall that i the lecture o classificatio, we have see that
More informationSieve Estimators: Consistency and Rates of Convergence
EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes
More informationPattern recognition systems Laboratory 10 Linear Classifiers and the Perceptron Algorithm
Patter recogitio systems Laboratory 10 Liear Classifiers ad the Perceptro Algorithm 1. Objectives his laboratory sessio presets the perceptro learig algorithm for the liear classifier. We will apply gradiet
More informationAgnostic Learning and Concentration Inequalities
ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture
More informationINF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification
INF 4300 90 Itroductio to classifictio Ae Solberg ae@ifiuioo Based o Chapter -6 i Duda ad Hart: atter Classificatio 90 INF 4300 Madator proect Mai task: classificatio You must implemet a classificatio
More informationLinear Regression Demystified
Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to
More informationLecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting
Lecture 6 Chi Square Distributio (χ ) ad Least Squares Fittig Chi Square Distributio (χ ) Suppose: We have a set of measuremets {x 1, x, x }. We kow the true value of each x i (x t1, x t, x t ). We would
More informationChapter 2 The Monte Carlo Method
Chapter 2 The Mote Carlo Method The Mote Carlo Method stads for a broad class of computatioal algorithms that rely o radom sampligs. It is ofte used i physical ad mathematical problems ad is most useful
More informationMachine Learning for Data Science (CS 4786)
Machie Learig for Data Sciece CS 4786) Lecture 9: Pricipal Compoet Aalysis The text i black outlies mai ideas to retai from the lecture. The text i blue give a deeper uderstadig of how we derive or get
More informationProbability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].
Probability 2 - Notes 0 Some Useful Iequalities. Lemma. If X is a radom variable ad g(x 0 for all x i the support of f X, the P(g(X E[g(X]. Proof. (cotiuous case P(g(X Corollaries x:g(x f X (xdx x:g(x
More informationTHE KALMAN FILTER RAUL ROJAS
THE KALMAN FILTER RAUL ROJAS Abstract. This paper provides a getle itroductio to the Kalma filter, a umerical method that ca be used for sesor fusio or for calculatio of trajectories. First, we cosider
More informationOn Nonsingularity of Saddle Point Matrices. with Vectors of Ones
Iteratioal Joural of Algebra, Vol. 2, 2008, o. 4, 197-204 O Nosigularity of Saddle Poit Matrices with Vectors of Oes Tadeusz Ostrowski Istitute of Maagemet The State Vocatioal Uiversity -400 Gorzów, Polad
More informationFeedback in Iterative Algorithms
Feedback i Iterative Algorithms Charles Byre (Charles Byre@uml.edu), Departmet of Mathematical Scieces, Uiversity of Massachusetts Lowell, Lowell, MA 01854 October 17, 2005 Abstract Whe the oegative system
More informationLecture 20: Multivariate convergence and the Central Limit Theorem
Lecture 20: Multivariate covergece ad the Cetral Limit Theorem Covergece i distributio for radom vectors Let Z,Z 1,Z 2,... be radom vectors o R k. If the cdf of Z is cotiuous, the we ca defie covergece
More information(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3
MATH 337 Sequeces Dr. Neal, WKU Let X be a metric space with distace fuctio d. We shall defie the geeral cocept of sequece ad limit i a metric space, the apply the results i particular to some special
More informationBayesian Methods: Introduction to Multi-parameter Models
Bayesia Methods: Itroductio to Multi-parameter Models Parameter: θ = ( θ, θ) Give Likelihood p(y θ) ad prior p(θ ), the posterior p proportioal to p(y θ) x p(θ ) Margial posterior ( θ, θ y) is Iterested
More informationRegularization methods for large scale machine learning
Regularizatio methods for large scale machie learig Lorezo Rosasco March 7, 2017 Abstract After recallig a iverse problems perspective o supervised learig, we discuss regularizatio methods for large scale
More informationDefinitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.
Defiitios ad Theorems Remember the scalar form of the liear programmig problem, Miimize, Subject to, f(x) = c i x i a 1i x i = b 1 a mi x i = b m x i 0 i = 1,2,, where x are the decisio variables. c, b,
More informationb i u x i U a i j u x i u x j
M ath 5 2 7 Fall 2 0 0 9 L ecture 1 9 N ov. 1 6, 2 0 0 9 ) S ecod- Order Elliptic Equatios: Weak S olutios 1. Defiitios. I this ad the followig two lectures we will study the boudary value problem Here
More informationLecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting
Lecture 6 Chi Square Distributio (χ ) ad Least Squares Fittig Chi Square Distributio (χ ) Suppose: We have a set of measuremets {x 1, x, x }. We kow the true value of each x i (x t1, x t, x t ). We would
More informationThe log-behavior of n p(n) and n p(n)/n
Ramauja J. 44 017, 81-99 The log-behavior of p ad p/ William Y.C. Che 1 ad Ke Y. Zheg 1 Ceter for Applied Mathematics Tiaji Uiversity Tiaji 0007, P. R. Chia Ceter for Combiatorics, LPMC Nakai Uivercity
More informationTable 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab
Sectio 12 Tests of idepedece ad homogeeity I this lecture we will cosider a situatio whe our observatios are classified by two differet features ad we would like to test if these features are idepedet
More informationSequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence
Sequeces A sequece of umbers is a fuctio whose domai is the positive itegers. We ca see that the sequece 1, 1, 2, 2, 3, 3,... is a fuctio from the positive itegers whe we write the first sequece elemet
More informationSection 14. Simple linear regression.
Sectio 14 Simple liear regressio. Let us look at the cigarette dataset from [1] (available to dowload from joural s website) ad []. The cigarette dataset cotais measuremets of tar, icotie, weight ad carbo
More informationSingular Continuous Measures by Michael Pejic 5/14/10
Sigular Cotiuous Measures by Michael Peic 5/4/0 Prelimiaries Give a set X, a σ-algebra o X is a collectio of subsets of X that cotais X ad ad is closed uder complemetatio ad coutable uios hece, coutable
More informationEfficient GMM LECTURE 12 GMM II
DECEMBER 1 010 LECTURE 1 II Efficiet The estimator depeds o the choice of the weight matrix A. The efficiet estimator is the oe that has the smallest asymptotic variace amog all estimators defied by differet
More informationTR/46 OCTOBER THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION A. TALBOT
TR/46 OCTOBER 974 THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION by A. TALBOT .. Itroductio. A problem i approximatio theory o which I have recetly worked [] required for its solutio a proof that the
More informationSummary. Recap ... Last Lecture. Summary. Theorem
Last Lecture Biostatistics 602 - Statistical Iferece Lecture 23 Hyu Mi Kag April 11th, 2013 What is p-value? What is the advatage of p-value compared to hypothesis testig procedure with size α? How ca
More informationLecture 10: Universal coding and prediction
0-704: Iformatio Processig ad Learig Sprig 0 Lecture 0: Uiversal codig ad predictio Lecturer: Aarti Sigh Scribes: Georg M. Goerg Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved
More informationMixtures of Gaussians and the EM Algorithm
Mixtures of Gaussias ad the EM Algorithm CSE 6363 Machie Learig Vassilis Athitsos Computer Sciece ad Egieerig Departmet Uiversity of Texas at Arligto 1 Gaussias A popular way to estimate probability desity
More informationA Risk Comparison of Ordinary Least Squares vs Ridge Regression
Joural of Machie Learig Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Compariso of Ordiary Least Squares vs Ridge Regressio Paramveer S. Dhillo Departmet of Computer
More informationRademacher Complexity
EECS 598: Statistical Learig Theory, Witer 204 Topic 0 Rademacher Complexity Lecturer: Clayto Scott Scribe: Ya Deg, Kevi Moo Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for
More informationChapter 3. Strong convergence. 3.1 Definition of almost sure convergence
Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i
More information