MIA - Master on Artificial Intelligence
|
|
- Ferdinand Foster
- 6 years ago
- Views:
Transcription
1 MIA - Master on Artificial Intelligence
2 1 Introduction Unsupervised & semi-supervised approaches Supervised Algorithms Maximum Likelihood Estimation Maximum Entropy Modeling
3 Introduction 1 Introduction Unsupervised & semi-supervised approaches Supervised Algorithms Maximum Likelihood Estimation Maximum Entropy Modeling
4 Paradigms Introduction Supervised n-gram models. Parameter estimation: MLE & smoothing. algorithms: Naive Bayes, Decision Trees, SVMs, Adaboost, Perceptron, log-linear,... Unsupervised and semi-supervised Similarity models: Clustering, EBL. Prediction models: Expectation Maximization (EM). Bootstrapping Co-training Active learning...
5 Other relevant considerations Introduction Batch vs on-line ML algorithms and parameter tuning Train/development data Evaluation Test data N-fold cross validation Precision/Recall/F1
6 Unsupervised & semi-supervised approaches 1 Introduction Unsupervised & semi-supervised approaches Supervised Algorithms Maximum Likelihood Estimation Maximum Entropy Modeling
7 Clustering Unsupervised & semi-supervised approaches Single-link clustering of 22 frequent English words represented as a dendogram. be not he I it this the his a and but in on with for at from of to as is was
8 The EM algorithm Unsupervised & semi-supervised approaches Start with guess for values of your model parameters Step E Compute distribution of the missing/latent data given the observed data and your current guess of the model parameters. Use the missing/latent data distribution to compute the expectation of the likelihood function with respect to the unobserved variables. Step M Use the expected likelihood function with no unobserved variables to maximize the function as you would in the fully observed case, to get a new estimate of your model parameters. Repeat steps E-M until convergence (no further changes).
9 The EM algorithm - Example Unsupervised & semi-supervised approaches Three coins with probability of heads (λ, p 1, p 2 ). Hidden variable coin 0 (λ): Y = {H, T} Y = H flip coin 1 (p 1 ) three times Y = T flip coin 2 (p 2 ) three times Observed sequency: X = {HHT, HTT, TTT, HHH}
10 The EM algorithm - Example Unsupervised & semi-supervised approaches Start with a guess model µ = (λ, p 1, p 2 ) Step E - Expectation Use current model parameters µ to compute probability distribution of hidden data given the observations: P µ (H x i ) = P µ(x i,h) P µ (x i ) ; P µ(t x i ) = P µ(x i,t ) P µ (x i ) x i X where P(x i, H), P(x i, T), and P µ (x i ) are computed from current model: P µ (HHT, H) = λp 2 1 (1 p 1) P µ (HTT, T) = (1 λ)p 2 (1 p 2 ) 2... etc... P µ (x i ) = P µ (x i, H) + P µ (x i, T) x i X Compute expected number of occurrences for hidden variable values: E[Y = H] = i P(H x i) E[Y = T] = i P(T x i)
11 The EM algorithm - Example Unsupervised & semi-supervised approaches Step M - Maximization Use expectations computed above to compute new MLE estimates of model parameters given observations X = {HHT, HTT, TTT, HHH} λ = E[Y=H] N p 1 p 2 = 2 P(HHT,H)+1 P(HT T,H)+0 P(T T T,H)+3 P(HHH,H) E[Y=H] = 2 P(HHT,T )+1 P(HT T,T )+0 P(T T T,T )+3 P(HHH,T) E[Y=T ]
12 Bootstrapping: Self-training Unsupervised & semi-supervised approaches Input: L 0, a (small) set of labeled examples U, a (large) set of unlabelled examples Output: m, a learned model T = L 0 // Start with a reduced set of labelled examples while not convergence achieved() do m = learn(t) // Learn a model from available labeled examples n = label(u, m) // Use the learned model to label new examples n = filter(n, γ) // Filter labeled examples by confidence threshold T = T n endwhile // Add examples passing the filter to the training set Convergence may be defined as a fixed amount of iterations, or as a point where performance on a development set does not improve further.
13 Bootstrapping: Co-training Unsupervised & semi-supervised approaches Input: L 0, a (small) set of labeled examples U, a (large) set of unlabelled examples Output: m, a learned model T = L 0 // Start with a reduced set of labelled examples while not convergence achieved() do m 1 = learn(t, view 1 1) // Learn a model from available labeled examples m 2 = learn(t, view 2 ) // Learn a model from available labeled examples n 1 = label(u, m 1 ) // Use the learned model to label new examples n 2 = label(u, m 2 ) // Use the learned model to label new examples n = filter(n 1, n 2, γ) // Filter labeled examples by confidence threshold T = T n // Add new examples to the training set endwhile m = best(m 1, m 2 ) Both views must be conditionally independent and sufficient.
14 Active learning Unsupervised & semi-supervised approaches Input: L 0, a (small) set of labeled examples U, a (large) set of unlabelled examples oracle, a way to obtain the expected label for a given example Output: m, a learned model T = L 0 // Start with a reduced set of labelled examples while not convergence achieved() do m = learn(t) // Learn a model from available labeled examples n = label(u, m) // Use the learned model to label new examples n = select(n) // Select best examples to be labeled n = oracle(n) // Get supervised label for selected examples T = T n endwhile // Add new examples to the training set Different measures are used for example selection: Confidence of the model, error reduction, expected model change,... )
15 Supervised Algorithms 1 Introduction Unsupervised & semi-supervised approaches Supervised Algorithms Maximum Likelihood Estimation Maximum Entropy Modeling
16 Naive Bayes Supervised Algorithms Simplest probabilistic classifier NB generative model: y x1 x2 x3 xn x i is the i th feature of example x Features are conditionally independent given the class y
17 Naive Bayes (II) Supervised Algorithms P(y x 1,..., x n ) = (applying Bayes rule) = P(y) P(x 1,...,x n y) P(x 1,...,x n ) posterior = prior likelihood evidence
18 Naive Bayes (II) Supervised Algorithms P(y x 1,..., x n ) = (applying Bayes rule) = P(y) P(x 1,...,x n y) P(x 1,...,x n ) posterior = prior likelihood evidence NB(x) = argmax y posterior = argmax y P(y) P(x 1,...,x n y) P(x 1,...,x n )
19 Naive Bayes (II) Supervised Algorithms P(y x 1,..., x n ) = (applying Bayes rule) = P(y) P(x 1,...,x n y) P(x 1,...,x n ) posterior = prior likelihood evidence NB(x) = argmax y posterior = argmax y P(y) P(x 1,...,x n y) P(x 1,...,x n ) P(x 1,..., x n ) is a constant and features are conditionally independent given y, thus: NB(x) = argmax y P(y) n i=1 P(x i y)
20 Naive Bayes (III) Supervised Algorithms Training a NB classifier consists of estimating two probability distributions: P(y) and P(x i y) from training data Maximum likelihood estimates: P(y) = counts(y) num. examples P(x i y) = counts(x i, y) counts(y) In practice, smoothing is needed
21 Naive Bayes (III) Supervised Algorithms Training a NB classifier consists of estimating two probability distributions: P(y) and P(x i y) from training data Maximum likelihood estimates: P(y) = counts(y) num. examples P(x i y) = counts(x i, y) counts(y) In practice, smoothing is needed NB is simple and can train from small datasets (robustness)...but independence assumptions are not realistic
22 Decision Trees Supervised Algorithms Feature selection (information gain, Gini diversity, χ 2,... ) Stopping criterion Feature binarization, pruning, incremental learning,...
23 Linear Classifiers Supervised Algorithms Vector space in R n Define a hyperplane with a weight vector w and an offset (or threshold) b. Used as a classification rule: n +1 if x i w i + b > 0 h(x) = sign(w x + b) = i=1 1 otherwise
24 Linear Classifiers: Perceptron Supervised Algorithms Input: Training set {(x i, y i )} Output: Weight vector w w = 0 repeat for i = 1 to n do if y i (w x i + b) 0 then w = w + y i x i b = b + y i endif endfor until average(y i (w x i + b)) < ɛ On-line learning algorithm Additive error-driven updating. Convergence guaranteed if the training set is linearly separable
25 Linear Classifiers: SVM Supervised Algorithms Batch learning algorithm Margin maximization: w minimization, subject to constraints y i (w x i + b) 1 i
26 Linear Classifiers: Kernels What if the training set is not linearly separable? Supervised Algorithms
27 Linear Classifiers: Kernels Supervised Algorithms Mapping function to make data linearly separable Too costly to compute all f(x), but we actually need only f(x) f(y) Kernel functions efficiently compute K(x, y) = f(x) f(y)
28 Linear Classifiers: Kernels Supervised Algorithms Identity (linear kernel): K(x, y) = x y Polynomial kernel: K(x, y) = (x y + c) d Gaussian kernel (RBF): K(x, y) = exp(γ x z 2 ) Sigmoid kernel: K(x, y) = tanh(α(x z) + β)
29 Linear Classifiers: Kernels and dual problem Supervised Algorithms To use a kernel, we need to formulate the classifier in dual form, i.e. in terms of dot products between examples. Example: Perceptron. Classification rule: ŷ = sgn(w x + b) Due to update steps: w w + y i x i b w + y i We get: n w = α i y i x i b = i=1 n α i y i i=1 where α i is the number of misclassifications of x i
30 Linear Classifiers: Kernels and dual problem Supervised Algorithms Then, we can compute the perceptron prediction as: n n ŷ = sign(( α i y i x i ) x + α i y i ) = sgn( = sgn( i=1 n α i y i (x i x) + i=1 i=1 n α i y i ) i=1 n α i y i (x i x + 1)) i=1 Once the problem is formulated in terms of similarites (dot product) between examples, we can introduce the kernel: n ŷ = sgn( α i y i (K(x i, x) + 1)) i=1 Note that for K(x, y) = x y, this formulation is equivalent to the original perceptron.
31 Maximum Likelihood Estimation 1 Introduction Unsupervised & semi-supervised approaches Supervised Algorithms Maximum Likelihood Estimation Maximum Entropy Modeling
32 MLE & Smoothing Maximum Likelihood Estimation Estimate the probability of the target feature based on observed data. The prediction task can be reduced to having good estimations of the conditional distribution: P(Y X) = P(X, Y) P(X) MLE (Maximum Likelihood Estimation) P MLE (X) = count(x) N P MLE (Y X) = count(x,y) count(x) No probability mass for unseen events Unsuitable for NLP Data sparseness, Zipf s Law
33 Smoothing 1 - Adding Counts Maximum Likelihood Estimation Laplace s Law (adding one) P LAP (X) = count(x) + 1 N + B For large values of B too much probability mass is assigned to unseen events Lidstone s Law P LID (X) = count(x) + λ N + Bλ Usually λ = 0.5, Expected Likelihood Estimation. Equivalent to linear interpolation between MLE and uniform prior, with µ = N/(N + Bλ), P LID (X) = µ count(x) + (1 µ) 1 N B
34 Smoothing 2 - Discounting Counts Absolute Discounting Maximum Likelihood Estimation P ABS (X) = count(x) δ N if count(x) > 0 (B N 0 )δ/n 0 N otherwise Linear Discounting P LIN (X) = (1 α)count(x) N if count(x) > 0 α N 0 otherwise
35 Maximum Entropy Modeling 1 Introduction Unsupervised & semi-supervised approaches Supervised Algorithms Maximum Likelihood Estimation Maximum Entropy Modeling
36 Maximum Entropy / Log-linear Models Maximum Entropy Modeling Maximum Entropy: alternative estimation technique. Able to deal with different kinds of evidence ME principle: Do not assume anything about non-observed events. Find the most uniform (maximum entropy, less informed) probability distribution that matches the observations. Example: P(a, b) dans en à in?? 0.3 on??? total Observations P(a, b) dans en à in on total One possible p(a, b)
37 Maximum Entropy / Log-linear Models Maximum Entropy Modeling Maximum Entropy: alternative estimation technique. Able to deal with different kinds of evidence ME principle: Do not assume anything about non-observed events. Find the most uniform (maximum entropy, less informed) probability distribution that matches the observations. Example: P(a, b) dans en à in?? 0.3 on??? total Observations P(a, b) dans en à in on total Max.Entropy p(a, b)
38 ME Modeling Maximum Entropy Modeling Observed facts are constraints for the desired model p. Constraints take the form of feature functions: f i : ε {0, 1} The desired model must satisfy the constraints: p(x)f i (x) = p(x)f i (x) i x ε x ε that is, the expectation of each f i according to the model matches the actual observed expectation for f i
39 ME Modeling Example Maximum Entropy Modeling Example: ε = {in,on} {dans,en,à} p(a, b) dans en à in?? on?? total Observed fact: p(in,dans) + p(on,dans) = 0.4 Encoded as a constraint: E p (f 1 ) = 0.4 where: { 1 if b = dans f 1 (a, b) = 0 otherwise E p (f 1 ) = p(a, b)f 1 (a, b) (a,b) ε
40 ME Probability Model Maximum Entropy Modeling There is an infinite set P of probability models consistent with observations. We want to compute the maximum entropy model p = argmax H(p) p P H(p) = x ε p(x) log p(x)
41 Parameter Estimation Maximum Entropy Modeling Example: Maximum entropy model for translating prepositions from English to French No constraints P(a, b) dans en à in on total 1.0 With constraint p(dans) + p(en) = 0.4 P(a, b) dans en à in on total With constraints p(dans) + p(en) = 0.4; p(in) = Not so easy!
42 Parameter estimation Maximum Entropy Modeling Exponential models. (Lagrange multipliers optimization) p(a b) = 1 k Z(b) j=1 αf j(a,b) j α j > 0 Z(b) = a k i=1 αf i(a,b) i also formuled as p(a b) = 1 Z(b) exp( k j=1 λ jf j (a, b)) λ i = ln α i Each model parameter weights the influence of a feature. Several algorithms to compute optimal parameters: GIS, IIS, LM-BFGS,...
43 Improved Iterative Scaling (IIS) Maximum Entropy Modeling Input: Feature functions f 1... f n, empirical distribution p(f i ) Output: λ i : parameters for optimal model p Start with λ i = 0 for all i {1... n} Repeat For each i {1... n} do let λ i be the solution to p(b)p(a b)f i (a, b) exp( λ i a,b λ i λ i + λ i end for Until all λ i have converged n j=1 f j (a, b)) = p(f i )
44 Application to NLP Tasks Maximum Entropy Modeling Speech processing (Rosenfeld 94) Translation (Brown et al 90) Morphology (Della Pietra et al. 95) Clause boundary detection (Reynar & Ratnaparkhi 97) PP-attachment (Ratnaparkhi et al 94) PoS Tagging (Ratnaparkhi 96, Black et al 99) Partial Parsing (Skut & Brants 98) Full Parsing (Ratnaparkhi 97, Ratnaparkhi 99) Text Categorization (Nigam et al 99)
45 PoS Tagging (Ratnaparkhi 96) Maximum Entropy Modeling Probabilistic model over H T h i = (w i, w i+1, w i+2, w i 1, w i 2, t i 1, t i 2 ) { 1 if suffix(wi ) = ing t = VBG f j (h i, t) = 0 otherwise Compute p (h, t) using GIS Disambiguation algorithm: beam search argmax t 1...t n p(t h) = exp( j λ jf j (h, t)) Z(h) p(t 1... t n w 1... w n ) = argmax t 1...t n n p(t i h i ) i=1
46 Text Categorization (Nigam et al 99) Maximum Entropy Modeling Probabilistic model over W C d = (w 1, w 2... w N ) { N(d,w) f w,c (d, c) = N(d) if c = c 0 otherwise Compute p (c d) using IIS Disambiguation algorithm: Select class with highest probability argmax P(c d) c exp( i = argmax λ if i (d, c)) c Z(d) = argmax λ i f i (d, c) c i
47 Sentence Boundaries (Reynar and Ratnaparkhi 97) Maximum Entropy Modeling Feature Templates 1 The prefix 2 The suffix 3 The word previous 4 The word next 5 Whether prefix or suffix are in Abbreviations 6 Whether previous or next are in Abbreviations < b=no punc=. pref=mr suff= prev=2010. next=wayne > Two classes: y and n Disambiguation algorithm: Select class with highest probability argmax P(c d) c exp( i = argmax λ if i (d, c)) c Z(d) = argmax λ i f i (d, c) c i
Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationLINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning
LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationStatistical Methods for NLP
Statistical Methods for NLP Text Categorization, Support Vector Machines Sameer Maskey Announcement Reading Assignments Will be posted online tonight Homework 1 Assigned and available from the course website
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationNeural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science
Neural Networks Prof. Dr. Rudolf Kruse Computational Intelligence Group Faculty for Computer Science kruse@iws.cs.uni-magdeburg.de Rudolf Kruse Neural Networks 1 Supervised Learning / Support Vector Machines
More information6.036 midterm review. Wednesday, March 18, 15
6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that
More informationFinal Examination CS 540-2: Introduction to Artificial Intelligence
Final Examination CS 540-2: Introduction to Artificial Intelligence May 7, 2017 LAST NAME: SOLUTIONS FIRST NAME: Problem Score Max Score 1 14 2 10 3 6 4 10 5 11 6 9 7 8 9 10 8 12 12 8 Total 100 1 of 11
More informationThe exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.
CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please
More informationMaxent Models and Discriminative Estimation
Maxent Models and Discriminative Estimation Generative vs. Discriminative models (Reading: J+M Ch6) Introduction So far we ve looked at generative models Language models, Naive Bayes But there is now much
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationIntroduction to SVM and RVM
Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance
More informationBowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations
Bowl Maximum Entropy #4 By Ejay Weiss Maxent Models: Maximum Entropy Foundations By Yanju Chen A Basic Comprehension with Derivations Outlines Generative vs. Discriminative Feature-Based Models Softmax
More informationFINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE
FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE You are allowed a two-page cheat sheet. You are also allowed to use a calculator. Answer the questions in the spaces provided on the question sheets.
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationCS798: Selected topics in Machine Learning
CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning
More informationMachine Learning for Structured Prediction
Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More informationStatistical Methods for NLP
Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured
More informationCS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning
CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we
More informationMachine Learning, Midterm Exam
10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationSupport Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationStatistical Methods for NLP
Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationLinear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction
Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the
More informationProbabilistic Language Modeling
Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationBayesian Methods: Naïve Bayes
Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior
More informationCS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008 Lecture 23: Perceptrons 11/20/2008 Dan Klein UC Berkeley 1 General Naïve Bayes A general naive Bayes model: C E 1 E 2 E n We only specify how each feature depends
More informationGeneral Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering
CS 188: Artificial Intelligence Fall 2008 General Naïve Bayes A general naive Bayes model: C Lecture 23: Perceptrons 11/20/2008 E 1 E 2 E n Dan Klein UC Berkeley We only specify how each feature depends
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationHidden Markov Models
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework
More informationText Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University
Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data
More informationACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging
ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationOutline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation
More informationPattern Recognition and Machine Learning. Perceptrons and Support Vector machines
Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3
More informationMidterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.
CS 189 Spring 2013 Introduction to Machine Learning Midterm You have 1 hour 20 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. Please use non-programmable calculators
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationJeff Howbert Introduction to Machine Learning Winter
Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationSupport Vector Machine & Its Applications
Support Vector Machine & Its Applications A portion (1/3) of the slides are taken from Prof. Andrew Moore s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials Mingyue Tan The University of British Columbia
More informationPart of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015
Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationA Tutorial on Support Vector Machine
A Tutorial on School of Computing National University of Singapore Contents Theory on Using with Other s Contents Transforming Theory on Using with Other s What is a classifier? A function that maps instances
More informationCS446: Machine Learning Fall Final Exam. December 6 th, 2016
CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationNatural Language Processing
SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University October 9, 2018 0 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class
More informationGraphical models for part of speech tagging
Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationN-gram Language Modeling Tutorial
N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationECE662: Pattern Recognition and Decision Making Processes: HW TWO
ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are
More informationSUPPORT VECTOR MACHINE
SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are
More informationMaterial presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.
Material presented Direct Models for Classification SCARF JHU Summer School June 18, 2010 Patrick Nguyen (panguyen@microsoft.com) What is classification? What is a linear classifier? What are Direct Models?
More informationNon-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines
Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationQualifier: CS 6375 Machine Learning Spring 2015
Qualifier: CS 6375 Machine Learning Spring 2015 The exam is closed book. You are allowed to use two double-sided cheat sheets and a calculator. If you run out of room for an answer, use an additional sheet
More informationSequential Supervised Learning
Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationCS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev
CS4705 Probability Review and Naïve Bayes Slides from Dragomir Radev Classification using a Generative Approach Previously on NLP discriminative models P C D here is a line with all the social media posts
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Classification: Maximum Entropy Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 24 Introduction Classification = supervised
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationActive and Semi-supervised Kernel Classification
Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationCS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)
CS325 Artificial Intelligence Cengiz Spring 2013 Model Complexity in Learning f(x) x Model Complexity in Learning f(x) x Let s start with the linear case... Linear Regression Linear Regression price =
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationA brief introduction to Conditional Random Fields
A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationNaïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning
More informationInformation Extraction from Text
Information Extraction from Text Jing Jiang Chapter 2 from Mining Text Data (2012) Presented by Andrew Landgraf, September 13, 2013 1 What is Information Extraction? Goal is to discover structured information
More informationLast Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression
CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 22 Dan Weld Learning Gaussians Naïve Bayes Last Time Gaussians Naïve Bayes Logistic Regression Today Some slides from Carlos Guestrin, Luke Zettlemoyer
More informationMixtures of Gaussians continued
Mixtures of Gaussians continued Machine Learning CSE446 Carlos Guestrin University of Washington May 17, 2013 1 One) bad case for k-means n Clusters may overlap n Some clusters may be wider than others
More informationMultivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)
Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008) RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationSupport'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan
Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination
More information