Machine Learning for NLP

Size: px
Start display at page:

Download "Machine Learning for NLP"

Transcription

1 Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50)

2 Introduction Linear Classifiers Classifiers covered so far: Decision trees Nearest neighbor Next two lectures: Linear classifiers Statistics from Google Scholar (October 2009): Maximum Entropy & NLP 2660 hits, 141 before 2000 SVM & NLP 2210 hits, 16 before 2000 Perceptron & NLP, 947 hits, 118 before 2000 All are linear classifiers that have become important tools in any NLP/CL researcher s tool-box in past 10 years Machine Learning for NLP 2(50)

3 Introduction Outline Today: Preliminaries: input/output, features, etc. Linear classifiers Perceptron Large-margin classifiers (SVMs, MIRA) Logistic regression (Maximum Entropy) Next time: Structured prediction with linear classifiers Structured perceptron Structured large-margin classifiers (SVMs, MIRA) Conditional random fields Case study: Dependency parsing Machine Learning for NLP 3(50)

4 Preliminaries Inputs and Outputs Input: x X e.g., document or sentence with some words x = w1... w n, or a series of previous actions Output: y Y e.g., parse tree, document class, part-of-speech tags, word-sense Input/output pair: (x, y) X Y e.g., a document x and its label y Sometimes x is explicit in y, e.g., a parse tree y will contain the sentence x Machine Learning for NLP 4(50)

5 Preliminaries Feature Representations We assume a mapping from input-output pairs (x, y) to a high dimensional feature vector f(x, y) : X Y R m For some cases, i.e., binary classification Y = { 1, +1}, we can map only from the input to the feature space f(x) : X R m However, most problems in NLP require more than two classes, so we focus on the multi-class case For any vector v R m, let v j be the j th value Machine Learning for NLP 5(50)

6 Preliminaries Features and Classes All features must be numerical Numerical features are represented directly as fi (x, y) R Binary (boolean) features are represented as f i (x, y) {0, 1} Multinomial (categorical) features must be binarized Instead of: fi (x, y) {v 0,..., v p } We have: fi+0 (x, y) {0, 1},..., f i+p (x, y) {0, 1} Such that: f i+j (x, y) = 1 iff f i (x, y) = v j We need distinct features for distinct output classes Instead of: fi (x) (1 i m) We have: fi+0m (x, y),..., f i+nm (x, y) for Y = {0,..., N} Such that: f i+jm (x, y) = f i (x) iff y = y j Machine Learning for NLP 6(50)

7 Preliminaries Examples x is a document and y is a label 1 if x contains the word interest f j (x, y) = and y = financial 0 otherwise f j (x, y) = % of words in x with punctuation and y = scientific x is a word and y is a part-of-speech tag { 1 if x = bank and y = Verb f j (x, y) = 0 otherwise Machine Learning for NLP 7(50)

8 Preliminaries Examples x is a name, y is a label classifying the name 8 < f 0 (x, y) = : 8 < f 1 (x, y) = : 8 < f 2 (x, y) = : 8 < f 3 (x, y) = : 1 if x contains George and y = Person 0 otherwise 1 if x contains Washington and y = Person 0 otherwise 1 if x contains Bridge and y = Person 0 otherwise 1 if x contains General and y = Person 0 otherwise 8 < f 4 (x, y) = : 8 < f 5 (x, y) = : 8 < f 6 (x, y) = : 8 < f 7 (x, y) = : 1 if x contains George and y = Object 0 otherwise 1 if x contains Washington and y = Object 0 otherwise 1 if x contains Bridge and y = Object 0 otherwise 1 if x contains General and y = Object 0 otherwise x=general George Washington, y=person f(x, y) = [ ] x=george Washington Bridge, y=object f(x, y) = [ ] x=george Washington George, y=object f(x, y) = [ ] Machine Learning for NLP 8(50)

9 Preliminaries Block Feature Vectors x=general George Washington, y=person f(x, y) = [ ] x=george Washington Bridge, y=object f(x, y) = [ ] x=george Washington George, y=object f(x, y) = [ ] One equal-size block of the feature vector for each label Input features duplicated in each block Non-zero values allowed only in one block Machine Learning for NLP 9(50)

10 Linear Classifiers Linear classifier: score (or probability) of a particular classification is based on a linear combination of features and their weights Let w R m be a high dimensional weight vector If we assume that w is known, then we define our classifier as Multiclass Classification: Y = {0, 1,..., N} y = arg max y = arg max y w f(x, y) m w j f j (x, y) j=0 Binary Classification just a special case of multiclass Machine Learning for NLP 10(50)

11 Linear Classifiers - Bias Terms Often linear classifiers presented as y = arg max y m w j f j (x, y) + b y j=0 Where b is a bias or offset term But this can be folded into f x=general George Washington, y=person f(x, y) = [ ] x=general George Washington, y=object f(x, y) = [ ] j 1 y = Person f 4 (x, y) = 0 otherwise j 1 y = Object f 9 (x, y) = 0 otherwise w 4 and w 9 are now the bias terms for the labels Machine Learning for NLP 11(50)

12 Binary Linear Classifier Divides all points: Machine Learning for NLP 12(50)

13 Multiclass Linear Classifier Defines regions of space: i.e., + are all points (x, y) where + = arg max y w f(x, y) Machine Learning for NLP 13(50)

14 Separability A set of points is separable, if there exists a w such that classification is perfect Separable Not Separable This can also be defined mathematically (and we will shortly) Machine Learning for NLP 14(50)

15 Supervised Learning how to find w Input: training examples T = {(x t, y t )} T t=1 Input: feature representation f Output: w that maximizes/minimizes some important function on the training set minimize error (Perceptron, SVMs, Boosting) maximize likelihood of data (Logistic Regression, Naive Bayes) Assumption: The training data is separable Not necessary, just makes life easier There is a lot of good work in machine learning to tackle the non-separable case Machine Learning for NLP 15(50)

16 Perceptron Choose a w that minimizes error w = arg min 1 1[y t = arg max w f(x t, y)] w y t 1[p] = { 1 p is true 0 otherwise This is a 0-1 loss function Aside: when minimizing error people tend to use hinge-loss or other smoother loss functions Machine Learning for NLP 16(50)

17 Perceptron Learning Algorithm Training data: T = {(x t, y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = arg max y w (i) f(x t, y) 5. if y y t 6. w (i+1) = w (i) + f(x t, y t ) f(x t, y ) 7. i = i return w i Machine Learning for NLP 17(50)

18 Perceptron: Separability and Margin Given an training instance (x t, y t ), define: Ȳ t = Y {y t } i.e., Ȳ t is the set of incorrect labels for x t A training set T is separable with margin γ > 0 if there exists a vector u with u = 1 such that: u f(x t, y t ) u f(x t, y ) γ for all y Ȳ t and u = j u2 j (Euclidean or L 2 norm) Assumption: the training set is separable with margin γ Machine Learning for NLP 18(50)

19 Perceptron: Main Theorem Theorem: For any training set separable with a margin of γ, the following holds for the perceptron algorithm: mistakes made during training R2 γ 2 where R f(x t, y t ) f(x t, y ) for all (x t, y t ) T and y Ȳ t Thus, after a finite number of training iterations, the error on the training set will converge to zero For proof, see Collins (2002) Machine Learning for NLP 19(50)

20 Perceptron Summary Learns a linear classifier that minimizes error Guaranteed to find a w in a finite amount of time Perceptron is an example of an online learning algorithm w is updated based on a single training instance in isolation w (i+1) = w (i) + f(x t, y t ) f(x t, y ) Compare decision trees that perform batch learning All training instances are used to find best split Machine Learning for NLP 20(50)

21 Margin Training Testing Denote the value of the margin by γ Machine Learning for NLP 21(50)

22 Maximizing Margin For a training set T, the margin of a weight vector w is the smallest γ such that w f(x t, y t ) w f(x t, y ) γ for every training instance (x t, y t ) T, y Ȳ t Machine Learning for NLP 22(50)

23 Maximizing Margin Intuitively maximizing margin makes sense More importantly, generalization error to unseen test data is proportional to the inverse of the margin ɛ R 2 γ 2 T Perceptron: we have shown that: If a training set is separable by some margin, the perceptron will find a w that separates the data However, the perceptron does not pick w to maximize the margin! Machine Learning for NLP 23(50)

24 Maximizing Margin Let γ > 0 such that: max w 1 w f(x t, y t ) w f(x t, y ) γ γ (x t, y t ) T and y Ȳ t Note: algorithm still minimizes error w is bound since scaling trivially produces larger margin β(w f(x t, y t ) w f(x t, y )) βγ, for some β 1 Machine Learning for NLP 24(50)

25 Max Margin = Min Norm Let γ > 0 Max Margin: such that: max w 1 w f(x t, y t ) w f(x t, y ) γ γ (x t, y t ) T and y Ȳ t = Min Norm: min w such that: 1 2 w 2 w f(x t, y t ) w f(x t, y ) 1 (x t, y t ) T and y Ȳ t Instead of fixing w we fix the margin γ = 1 Technically γ 1/ w Machine Learning for NLP 25(50)

26 Support Vector Machines min 1 2 w 2 such that: w f(x t, y t ) w f(x t, y ) 1 (x t, y t ) T and y Ȳ t Quadratic programming problem a well known convex optimization problem Can be solved with out-of-the-box algorithms Batch learning algorithm w set w.r.t. all training points Machine Learning for NLP 26(50)

27 Support Vector Machines Problem: Sometimes T is far too large Thus the number of constraints might make solving the quadratic programming problem very difficult Common technique: Sequential Minimal Optimization (SMO) Sparse: solution depends only on features in support vectors Machine Learning for NLP 27(50)

28 Margin Infused Relaxed Algorithm (MIRA) Linear Classifiers Another option maximize margin using an online algorithm Batch vs. Online Batch update parameters based on entire training set (SVM) Online update parameters based on a single training instance at a time (Perceptron) MIRA can be thought of as a max-margin perceptron or an online SVM Machine Learning for NLP 28(50)

29 MIRA Batch (SVMs): min 1 2 w 2 such that: w f(x t, y t ) w f(x t, y ) 1 (x t, y t ) T and y Ȳ t Online (MIRA): Training data: T = {(x t, y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. w (i+1) = arg min w* w* w (i) such that: w f(x t, y t ) w f(x t, y ) 1 y Ȳ t 5. i = i return w i MIRA has much smaller optimizations with only Ȳ t constraints Cost: sub-optimal optimization Machine Learning for NLP 29(50)

30 Interim Summary What we have covered Linear classifiers: Perceptron SVMs MIRA All are trained to minimize error With or without maximizing margin Online or batch What is next Logistic Regression / Maximum Entropy Train linear classifiers to maximize likelihood Machine Learning for NLP 30(50)

31 Logistic Regression / Maximum Entropy Define a conditional probability: ew f(x,y) P(y x) =, where Z x = e w f(x,y ) Z x y Y Note: still a linear classifier arg max y P(y x) = arg max y = arg max y = arg max y e w f(x,y) Z x e w f(x,y) w f(x, y) Machine Learning for NLP 31(50)

32 Logistic Regression / Maximum Entropy P(y x) = ew f(x,y) Z x Q: How do we learn weights w A: Set weights to maximize log-likelihood of training data: w = arg max P(y t x t ) = arg max log P(y t x t ) w w t In a nut shell we set the weights w so that we assign as much probability to the correct label y for each x in the training set t Machine Learning for NLP 32(50)

33 Aside: Min error versus max log-likelihood Highly related but not identical Example: consider a training set T with 1001 points 1000 (x i, y = 0) = [ 1, 1, 0, 0] for i = (x 1001, y = 1) = [0, 0, 3, 1] Now consider w = [ 1, 0, 1, 0] Error in this case is 0 so w minimizes error [ 1, 0, 1, 0] [ 1, 1, 0, 0] = 1 > [ 1, 0, 1, 0] [0, 0, 1, 1] = 1 [ 1, 0, 1, 0] [0, 0, 3, 1] = 3 > [ 1, 0, 1, 0] [3, 1, 0, 0] = 3 However, log-likelihood = (omit calculation) Machine Learning for NLP 33(50)

34 Aside: Min error versus max log-likelihood Highly related but not identical Example: consider a training set T with 1001 points 1000 (x i, y = 0) = [ 1, 1, 0, 0] for i = (x 1001, y = 1) = [0, 0, 3, 1] Now consider w = [ 1, 7, 1, 0] Error in this case is 1 so w does not minimizes error [ 1, 7, 1, 0] [ 1, 1, 0, 0] = 8 > [ 1, 7, 1, 0] [0, 0, 1, 1] = 1 [ 1, 7, 1, 0] [0, 0, 3, 1] = 3 < [ 1, 7, 1, 0] [3, 1, 0, 0] = 4 However, log-likelihood = -1.4 Better log-likelihood and worse error Machine Learning for NLP 34(50)

35 Aside: Min error versus max log-likelihood Max likelihood min error Max likelihood pushes as much probability on correct labeling of training instance Even at the cost of mislabeling a few examples Min error forces all training instances to be correctly classified SVMs with slack variables allows some examples to be classified wrong if resulting margin is improved on other examples Machine Learning for NLP 35(50)

36 Aside: Max margin versus max log-likelihood Let s re-write the max likelihood objective function w = arg max log P(y t x t ) w = arg max w = arg max w t log t e w f(x,y) y Y ew f(x,y ) w f(x, y) log t y Y e w f(x,y ) Pick w to maximize score difference between correct labeling and every possible labeling Margin: maximize difference between correct and all incorrect The above formulation is often referred to as the soft-margin Linear Classifiers Machine Learning for NLP 36(50)

37 Logistic Regression ew f(x,y) P(y x) =, where Z x = e w f(x,y ) Z x y Y w = arg max log P(y t x t ) (*) w The objective function (*) is concave t Therefore there is a global maximum No closed form solution, but lots of numerical techniques Gradient methods (gradient ascent, iterative scaling) Newton methods (limited-memory quasi-newton) Machine Learning for NLP 37(50)

38 Logistic Regression Summary Define conditional probability ew f(x,y) P(y x) = Z x Set weights to maximize log-likelihood of training data: w = arg max w log P(y t x t ) Can find the gradient and run gradient ascent (or any gradient-based optimization algorithm) F (w) = ( w 0 F (w), w i F (w) = t f i (x t, y t ) t t w 1 F (w),..., w m F (w)) P(y x t )f i (x t, y ) y Y Machine Learning for NLP 38(50)

39 Logistic Regression = Maximum Entropy Well known equivalence Max Ent: maximize entropy subject to constraints on features Empirical feature counts must equal expected counts Quick intuition Partial derivative in logistic regression F (w) = w i t f i (x t, y t ) t P(y x t )f i (x t, y ) y Y First term is empirical feature counts and second term is expected counts Derivative set to zero maximizes function Therefore when both counts are equivalent, we optimize the logistic regression objective! Machine Learning for NLP 39(50)

40 Linear Classification Basic form of (multiclass) classifier: y = arg max y w f(x, y) Different learning methods: Perceptron separate data (0-1 loss, online) Support vector machine maximize margin (hinge loss, batch) Logistic regression maximize likelihood (log loss, batch) All three methods are widely used in NLP Machine Learning for NLP 40(50)

41 Appendix Proofs and Derivations Machine Learning for NLP 41(50)

42 Convergence Proof for Perceptron Perceptron Learning Algorithm Training data: T = {(x t, y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = arg max y w (i) f(x t, y) 5. if y y t 6. w (i+1) = w (i) + f(x t, y t ) f(x t, y ) 7. i = i return w i w (k 1) are the weights before k th mistake Suppose k th mistake made at the t th example, (x t, y t) y = arg max y w (k 1) f(x t, y) y y t w (k) = w (k 1) + f(x t, y t) f(x t, y ) Now: u w (k) = u w (k 1) + u (f(x t, y t) f(x t, y )) u w (k 1) + γ Now: w (0) = 0 and u w (0) = 0, by induction on k, u w (k) kγ Now: since u w (k) u w (k) and u = 1 then w (k) kγ Now: w (k) 2 = w (k 1) 2 + f(x t, y t) f(x t, y ) 2 + 2w (k 1) (f(x t, y t) f(x t, y )) w (k) 2 w (k 1) 2 + R 2 (since R f(x t, y t) f(x t, y ) and w (k 1) f(x t, y t) w (k 1) f(x t, y ) 0) Machine Learning for NLP 42(50)

43 Convergence Proof for Perceptron Perceptron Learning Algorithm We have just shown that w (k) kγ and w (k) 2 w (k 1) 2 + R 2 By induction on k and since w (0) = 0 and w (0) 2 = 0 w (k) 2 kr 2 Therefore, k 2 γ 2 w (k) 2 kr 2 and solving for k k R2 γ 2 Therefore the number of errors is bounded! Machine Learning for NLP 43(50)

44 Gradient Ascent for Logistic Regression Gradient Ascent Let F (w) = t log ew f(x t,yt ) Z x Want to find arg maxw F (w) Set w 0 = O m Iterate until convergence w i = w i 1 + α F (w i 1 ) α > 0 and set so that F (w i ) > F (w i 1 ) F (w) is gradient of F w.r.t. w A gradient is all partial derivatives over variables wi i.e., F (w) = ( w 0 F (w), w 1 F (w),..., w m F (w)) Gradient ascent will always find w to maximize F Machine Learning for NLP 44(50)

45 Gradient Ascent for Logistic Regression The partial derivatives Need to find all partial derivatives w i F (w) F (w) = t = t = t log P(y t x t ) log log e w f(xt,yt) y Y ew f(xt,y ) e P j w j f j (x t,y t) y Y ep j w j f j (x t,y ) Machine Learning for NLP 45(50)

46 Partial derivatives - some reminders Gradient Ascent for Logistic Regression 1. x log F = 1 F x F We always assume log is the natural logarithm loge 2. x ef = e F x F t F t = t 3. x x F t F 4. x G = G x F F x G G 2 Machine Learning for NLP 46(50)

47 Gradient Ascent for Logistic Regression The partial derivatives w i F (w) = = t = t = t e P j w j f j (x t,y t) log w i y Y ep j w j f j (x t,y ) t e P j w j f j (x t,y t) log w i y Y ep j w j f j (x t,y ) ( y Y ep j w j f j (x t,y ) e P j w j f j (x t,y t) Z xt ( e P j w j f )( P e j w j f j (x t,y t) ) j (x t,y t) w i Z xt )( P e j w j f j (x t,y t) P w i y Y e w w j j f ) j (x t,y ) Machine Learning for NLP 47(50)

48 The partial derivatives Now, P e j w j f j (x t,y t ) = Zxt w i Z xt because w i Z xt = w i Gradient Ascent for Logistic Regression w i e P j w j f j (x t,y t ) e P j w j f j (x t,y t ) w i Z xt Zx 2 t = Zxt e P j w j f j (x t,y t ) f i (x t, y t) e P j w j f j (x t,y t ) w i Z xt Z 2 x t = ep j w j f j (x t,y t ) Z 2 x t (Z xt f i (x t, y t) w i Z xt ) = ep j w j f j (x t,y t ) Z 2 x t (Z xt f i (x t, y t) X e P j w j f j (x t,y ) f i (x t, y )) y Y X e P j w j f j (x t,y ) = X e P j w j f j (x t,y ) f i (x t, y ) y Y y Y Machine Learning for NLP 48(50)

49 The partial derivatives From before, e w i Sub this in, P j w j f j (x t,y t ) Z xt = e w i F (w) = X t = X t = X t = X t P j w j f j (x t,y t ) Gradient Ascent for Logistic Regression (Z Zx 2 xt f i (x t, y t) t X e P j w j f j (x t,y ) f i (x t, y )) y Y Z xt ( e P j w j f )( P e j w j f j (x t,y t ) ) j (x t,y t ) w i Z xt 1 (Z xt f i (x t, y t) X e P j w j f j (x t,y ) f i (x t, y ))) Z xt f i (x t, y t) X t f i (x t, y t) X t y Y y Y X e P j w j f j (x t,y ) f i (x t, y ) Z xt X P(y x t)f i (x t, y ) y Y Machine Learning for NLP 49(50)

50 Gradient Ascent for Logistic Regression FINALLY!!! After all that, w i F (w) = t f i (x t, y t ) t P(y x t )f i (x t, y ) y Y And the gradient is: F (w) = ( w 0 F (w), w 1 F (w),..., So we can now use gradient assent to find w!! w m F (w)) Machine Learning for NLP 50(50)

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Generalized Linear Classifiers in NLP

Generalized Linear Classifiers in NLP Generalized Linear Classifiers in NLP (or Discriminative Generalized Linear Feature-Based Classifiers) Graduate School of Language Technology, Sweden 2009 Ryan McDonald Google Inc., New York, USA E-mail:

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

MIRA, SVM, k-nn. Lirong Xia

MIRA, SVM, k-nn. Lirong Xia MIRA, SVM, k-nn Lirong Xia Linear Classifiers (perceptrons) Inputs are feature values Each feature has a weight Sum is the activation activation w If the activation is: Positive: output +1 Negative, output

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by

More information

Introduction to Data-Driven Dependency Parsing

Introduction to Data-Driven Dependency Parsing Introduction to Data-Driven Dependency Parsing Introductory Course, ESSLLI 2007 Ryan McDonald 1 Joakim Nivre 2 1 Google Inc., New York, USA E-mail: ryanmcd@google.com 2 Uppsala University and Växjö University,

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Statistical NLP Spring A Discriminative Approach

Statistical NLP Spring A Discriminative Approach Statistical NLP Spring 2008 Lecture 6: Classification Dan Klein UC Berkeley A Discriminative Approach View WSD as a discrimination task (regression, really) P(sense context:jail, context:county, context:feeding,

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Machine Learning and Data Mining. Linear classification. Kalev Kask

Machine Learning and Data Mining. Linear classification. Kalev Kask Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

ESS2222. Lecture 4 Linear model

ESS2222. Lecture 4 Linear model ESS2222 Lecture 4 Linear model Hosein Shahnas University of Toronto, Department of Earth Sciences, 1 Outline Logistic Regression Predicting Continuous Target Variables Support Vector Machine (Some Details)

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI Support Vector Machines CAP 5610: Machine Learning Instructor: Guo-Jun QI 1 Linear Classifier Naive Bayes Assume each attribute is drawn from Gaussian distribution with the same variance Generative model:

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Perceptron (Theory) + Linear Regression

Perceptron (Theory) + Linear Regression 10601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron (Theory) Linear Regression Matt Gormley Lecture 6 Feb. 5, 2018 1 Q&A

More information

Announcements - Homework

Announcements - Homework Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution 40 HW1 total score 35

More information

Structured Prediction

Structured Prediction Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,

More information

Support Vector Machines and Bayes Regression

Support Vector Machines and Bayes Regression Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering

More information

ML4NLP Multiclass Classification

ML4NLP Multiclass Classification ML4NLP Multiclass Classification CS 590NLP Dan Goldwasser Purdue University dgoldwas@purdue.edu Social NLP Last week we discussed the speed-dates paper. Interesting perspective on NLP problems- Can we

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline CS 188: Artificial Intelligence Lecture 21: Perceptrons Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. Outline Generative vs. Discriminative Binary Linear Classifiers Perceptron Multi-class

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

EE 511 Online Learning Perceptron

EE 511 Online Learning Perceptron Slides adapted from Ali Farhadi, Mari Ostendorf, Pedro Domingos, Carlos Guestrin, and Luke Zettelmoyer, Kevin Jamison EE 511 Online Learning Perceptron Instructor: Hanna Hajishirzi hannaneh@washington.edu

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Linear Classifiers. Michael Collins. January 18, 2012

Linear Classifiers. Michael Collins. January 18, 2012 Linear Classifiers Michael Collins January 18, 2012 Today s Lecture Binary classification problems Linear classifiers The perceptron algorithm Classification Problems: An Example Goal: build a system that

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Max-margin learning of GM Eric Xing Lecture 28, Apr 28, 2014 b r a c e Reading: 1 Classical Predictive Models Input and output space: Predictive

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machines. Machine Learning Fall 2017 Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Deep Learning for Computer Vision

Deep Learning for Computer Vision Deep Learning for Computer Vision Lecture 4: Curse of Dimensionality, High Dimensional Feature Spaces, Linear Classifiers, Linear Regression, Python, and Jupyter Notebooks Peter Belhumeur Computer Science

More information

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson Applied Machine Learning Lecture 5: Linear classifiers, continued Richard Johansson overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Lecture 3: Multiclass Classification

Lecture 3: Multiclass Classification Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in

More information

Structured Prediction

Structured Prediction Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 9 Sep. 26, 2018 1 Reminders Homework 3:

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Algorithms for Predicting Structured Data

Algorithms for Predicting Structured Data 1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

CS446: Machine Learning Fall Final Exam. December 6 th, 2016 CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

Logistic regression and linear classifiers COMS 4771

Logistic regression and linear classifiers COMS 4771 Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random

More information

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University Multi-class SVMs Lecture 17: Aykut Erdem April 2016 Hacettepe University Administrative We will have a make-up lecture on Saturday April 23, 2016. Project progress reports are due April 21, 2016 2 days

More information

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37 COMP 652: Machine Learning Lecture 12 COMP 652 Lecture 12 1 / 37 Today Perceptrons Definition Perceptron learning rule Convergence (Linear) support vector machines Margin & max margin classifier Formulation

More information

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

CENG 793. On Machine Learning and Optimization. Sinan Kalkan CENG 793 On Machine Learning and Optimization Sinan Kalkan 2 Now Introduction to ML Problem definition Classes of approaches K-NN Support Vector Machines Softmax classification / logistic regression Parzen

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Classification objectives COMS 4771

Classification objectives COMS 4771 Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

18.6 Regression and Classification with Linear Models

18.6 Regression and Classification with Linear Models 18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve

More information

Day 4: Classification, support vector machines

Day 4: Classification, support vector machines Day 4: Classification, support vector machines Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 21 June 2018 Topics so far

More information

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor Logistic Regression Some slides adapted from Dan Jurfasky and Brendan O Connor Naïve Bayes Recap Bag of words (order independent) Features are assumed independent given class P (x 1,...,x n c) =P (x 1

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Learning by constraints and SVMs (2)

Learning by constraints and SVMs (2) Statistical Techniques in Robotics (16-831, F12) Lecture#14 (Wednesday ctober 17) Learning by constraints and SVMs (2) Lecturer: Drew Bagnell Scribe: Albert Wu 1 1 Support Vector Ranking Machine pening

More information

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions? Online Videos FERPA Sign waiver or sit on the sides or in the back Off camera question time before and after lecture Questions? Lecture 1, Slide 1 CS224d Deep NLP Lecture 4: Word Window Classification

More information