Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Similar documents
Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Announcements. Proposals graded

Machine Learning

Generative v. Discriminative classifiers Intuition

Variance Reduction and Ensemble Methods

Bias-Variance Tradeoff

MIRA, SVM, k-nn. Lirong Xia

Machine Learning Practice Page 2 of 2 10/28/13

Midterm: CS 6375 Spring 2018

Machine Learning

EE 511 Online Learning Perceptron

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Artificial Intelligence Roman Barták

Stochastic Gradient Descent

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Introduction to Machine Learning Midterm Exam

Introduction to Logistic Regression and Support Vector Machine

CS 6375 Machine Learning

Bayesian Methods: Naïve Bayes

ECE 5984: Introduction to Machine Learning

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning Gaussian Naïve Bayes Big Picture

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression. Machine Learning Fall 2018

Week 5: Logistic Regression & Neural Networks

Mining Classification Knowledge

CS 188: Artificial Intelligence. Outline

10-701/ Machine Learning - Midterm Exam, Fall 2010

PATTERN CLASSIFICATION

Introduction to Machine Learning Midterm Exam Solutions

FINAL: CS 6375 (Machine Learning) Fall 2014

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Generative v. Discriminative classifiers Intuition

Introduction to SVM and RVM

Linear & nonlinear classifiers

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Generative v. Discriminative classifiers Intuition

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning

Overfitting, Bias / Variance Analysis

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Support Vector Machines

Midterm: CS 6375 Spring 2015 Solutions

Support Vector Machines

Logistic Regression & Neural Networks

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Neural networks and support vector machines

Pattern Recognition and Machine Learning

Machine Learning (CSE 446): Neural Networks

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Lecture 3: Pattern Classification

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

Machine Learning, Midterm Exam

COMS 4771 Introduction to Machine Learning. Nakul Verma

Bayesian Learning (II)

Support Vector Machine (SVM) and Kernel Methods

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Introduction to Machine Learning Midterm, Tues April 8

CMU-Q Lecture 24:

Numerical Learning Algorithms

Nonparametric Bayesian Methods (Gaussian Processes)

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

CSCI-567: Machine Learning (Spring 2019)

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Support Vector Machines and Kernel Methods

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning, Fall 2012 Homework 2

Jeff Howbert Introduction to Machine Learning Winter

1 Machine Learning Concepts (16 points)

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Introduction to Bayesian Learning. Machine Learning Fall 2018

Support Vector Machines.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Support Vector Machine (SVM) and Kernel Methods

SVMs, Duality and the Kernel Trick

Support Vector Machines

Introduction to Gaussian Process

ECE521 week 3: 23/26 January 2017

Introduction to Machine Learning

Transcription:

Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas

Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric Y Discrete Nearest Neighbor methods Locally weighted regression Gaussians Learned in closed form Linear Functions 1. Learned in closed form 2. Using gradient descent Decision Trees Greedy search; pruning Probability of class features 1. Learn P(Y), P(X Y); apply Bayes 2. Learn P(Y X) w/ gradient descent Non-probabilistic Linear: perceptron gradient descent Nonlinear: neural net: backprop Support vector machines 2

Key Perspective on Learning Learning as Optimization Closed form Greedy search Gradient ascent Loss Function Error + regularization 5

What you should know in Decision Tree Learning? Heuristics for selecting the next attribute Information gain, one-step look ahead, gain ratio What makes the heuristic good? What are its cons? Complexity analysis Sample exam question: if I tweak the selection heuristic, how will that change the complexity and quality? What kind of functions it can learn. Overfitting and Pruning Handling missing data Handling continuous attributes

Noise Small number of examples associated with each leaf What if only one example is associated with a leaf. Can you believe it? Coincidental regularities

Probability Theory Be able to apply and understand Axioms of probability Distribution vs density Conditional probability Sum-rule, chain-rule Bayes rule Sample question: If you know P(A B), do you have enough information to compute P(B A)?

Maximum Likelihood Estimation Data: Observed set D of H Heads and T Tails Hypothesis: Binomial distribution Learning: finding is an optimization problem What s the objective function? MLE: Choose to maximize probability of D

How to get a closed form solution? Set derivative to zero, and solve!

What if I have prior beliefs? Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you do for me now? You say: I can learn it the Bayesian way Rather than estimating a single, we obtain a distribution over possible values of In the beginning Observe flips e.g.: {tails, tails} After observations

Bayesian Learning Use Bayes rule: Data Likelihood Prior Posterior Normalization Or equivalently: Also, for uniform priors: reduces to MLE objective

MAP: Maximum a Posteriori Approximation As more data is observed, Beta is more certain MAP: use most likely parameter to approximate the expectation

What you should know? MLE vs MAP and the relationship between the two MLE learning and Bayesian learning Thumbtack example Gaussians

The Naïve Bayes Classifier Given: Prior P(Y) Y n conditionally independent features X given the class Y For each X i, we have likelihood P(X i Y) X 1 X 2 X n Decision rule:

Subtleties of Naïve Bayes What is the hypothesis space? What kind of functions can it learn? When does it work and when it does not? Correlated features MLE vs Bayesian learning of Naïve Bayes Gaussian Naïve Bayes

Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features Y target classes Generative classifier, e.g., Naïve Bayes: Assume some functional form for P(X Y), P(Y) Estimate parameters of P(X Y), P(Y) directly from training data Use Bayes rule to calculate P(Y X= x) This is a generative model Indirect computation of P(Y X) through Bayes rule As a result, can also generate a sample of the data, P(X) = y P(y) P(X y) Discriminative classifiers, e.g., Logistic Regression: Assume some functional form for P(Y X) Estimate parameters of P(Y X) directly from training data This is the discriminative model Directly learn P(Y X) But cannot obtain a sample of the data, because P(X) is not available P(Y X) P(X Y) P(Y) 19

Linear Regression Argmin w Loss(h w ) h w (x) = w 1 x + w 0 w 1 = N (x j y j ) ( x j )( y j ) N (x j2 ) ( x j ) 2 w 0 = ( (y j ) w 1 ( x j )/N

Logistic Regression Learn P(Y X) directly! Assume a particular functional form Not differentiable P(Y)=0 P(Y)=1 21

Logistic Regression Learn P(Y X) directly! Assume a particular functional form Logistic Function Aka Sigmoid 22

Issues in Linear and Logistic Regression Overfitting avoidance: Regularization L1 vs L2 regularization

What you should know about Logistic Regression (LR) Gaussian Naïve Bayes with class-independent variances representationally equivalent to LR Solution differs because of objective (loss) function In general, NB and LR make different assumptions NB: Features independent given class! assumption on P(X Y) LR: Functional form of P(Y X), no assumption on P(X Y) LR is a linear classifier decision rule is a hyperplane LR optimized by conditional likelihood no closed-form solution concave! global optimum with gradient ascent Maximum conditional a posteriori corresponds to regularization Convergence rates GNB (usually) needs less data LR (usually) gets to better solutions in the limit

From Logistic Regression to the Perceptron: 2 easy steps! Logistic Regression: (in vector notation): y is {0,1} Perceptron: y is {0,1}, y(x;w) is prediction given w Differences? Drop the Σ j over training examples: online vs. batch learning Drop the dist n: probabilistic vs. error driven learning

Properties of Perceptrons Separability: some parameters get the training set perfectly correct Separable Convergence: if the training is separable, perceptron will eventually converge (binary case) Non-Separable

Problems with the Perceptron Noise: if the data isn t separable, weights might thrash Averaging weight vectors over time can help (averaged perceptron) Mediocre generalization: finds a barely separating solution Overtraining: test / validation accuracy usually rises, then falls Overtraining is a kind of overfitting

Neural networks: What you should know? How does it learn non-linear functions? Can it learn, for example an XOR function? Draw a neural network for it with appropriate weights Backprop Overfitting What kind of functions can it learn? Tradeoff number of hidden units number of layers

Aim: Learn a large margin classifier Mathematical Formulation: maximize 2 w Linear SVM x 2 x + x + denotes +1 denotes -1 Margin such that T For y 1, wx b 1 i T For y 1, wx b 1 i i i x - Common theme in machine learning: LEARNING IS OPTIMIZATION x 1

Solving the Optimization Problem 1 minimize L (, b, ) y ( b) 1 n 2 T p w i w i i w xi 2 i 1 s.t. i 0 Lagrangian Dual Problem maximize s.t. i 0 1 n n n T i i j yy i j i j i 1 2 i 1 j 1 n xx, and i 1 y i i 0

Non-linear SVMs: Feature Space General idea: the original input space can be mapped to some higher-dimensional feature space where the training set is separable: Φ: x φ(x) This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt

Nonlinear SVMs: The Kernel Trick With this mapping, our discriminant function is now: T T g( x) w ( x) b i ( xi) ( x) b i SV No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test. A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space: T K( x, x ) ( x ) ( x ) i j i j

Nonlinear SVMs: The Kernel Trick Examples of commonly-used kernel functions: Linear kernel: K( x, x ) x x T i j i j Polynomial kernel: T K( x, x ) (1 x x ) i j i j p Gaussian (Radial-Basis Function (RBF) ) kernel: Sigmoid: i j K( xi, xj) exp( x x ) 2 2 K( x, x ) tanh( x x ) T i j 0 i j 1 2 In general, functions that satisfy Mercer s condition can be kernel functions: Kernel matrix should be positive semidefinite.

K-nearest Neighbor Distance measure Most common: Euclidean Choosing k Increasing k reduces variance, increases bias For high-dimensional space, problem that the nearest neighbor may not be very close at all! Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large data sets.

Nearest Neighbor Advantages variable-sized hypothesis space Learning is extremely efficient however growing a good kd-tree can be expensive Very flexible decision boundaries Disadvantages distance function must be carefully chosen Irrelevant or correlated features must be eliminated Typically cannot handle more than 30 features Computational costs: Memory and classification-time computation

Locally Weighted Linear Regression: LWLR Idea: k-nn forms local approximation for each query point x q Why not form an explicit approximation f for region surrounding x q Fit linear function to k nearest neighbors Fit quadratic,... Thus producing ``piecewise approximation'' to f Minimize error over k nearest neighbors of x q Minimize error entire set of examples, weighting by distances Combine two above

Bias-Variance-Noise Decomposition E[ (h(x*) y*) 2 ] E[ h(x*) 2 2 h(x*) y * y * 2 ] E[ h(x*) 2 ] 2 E[h(x*)]E[y*] E[y * 2 ] E[ 2 h(x*)f(x*) E[(y * -f(x*)) E[ h(x*) 2 2 ] h(x*)- h(x*) h(x*)- h(x*) 2 E[(y * -f(x*)) ] f(x*) 2 2 f(x*) ] 2 ] h(x*) = Var(h(x*)) + Bias(h(x*)) 2 + E[ 2 ] 2 2 VARIANCE BIAS NOISE = Var(h(x*)) + Bias(h(x*)) 2 + 2 Expected prediction error = Variance + Bias 2 + Noise 2

Bias, Variance, and Noise Variance: E[ (h(x*) h(x*)) 2 ] Describes how much h(x*) varies from one training set S to another Bias: [h(x*) f(x*)] Describes the average error of h(x*). Noise: E[ (y* f(x*)) 2 ] = E[ 2 ] = 2 Describes how much y* varies from f(x*)

Bias/Variance Tradeoff (bias 2 +variance) is what counts for prediction Often: low bias => high variance (too many parameters) low variance => high bias (too few parameters) Tradeoff: bias 2 vs. variance 42

Bagging: Bootstrap Aggregation Leo Breiman (1994) Take repeated bootstrap samples from training set D. Bootstrap sampling: Given set D containing N training examples, create D by drawing N examples at random with replacement from D. Bagging: Create k bootstrap samples D 1 D k. Train distinct classifier on each D i. Classify new instance by majority vote / average. 43