CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

Similar documents
Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Practice Page 2 of 2 10/28/13

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

CS 6375 Machine Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Numerical Learning Algorithms

Machine Learning for NLP

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Introduction to Logistic Regression and Support Vector Machine

Logistic Regression. Machine Learning Fall 2018

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements

Optimization and Gradient Descent

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

MIRA, SVM, k-nn. Lirong Xia

CMU-Q Lecture 24:

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Expectation Maximisation (EM) CS 486/686: Introduction to Artificial Intelligence University of Waterloo

Midterm exam CS 189/289, Fall 2015

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

FINAL: CS 6375 (Machine Learning) Fall 2014

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning for NLP

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Warm up: risk prediction with logistic regression

Lecture 2 Machine Learning Review

Regression with Numerical Optimization. Logistic

Binary Classification / Perceptron

y Xw 2 2 y Xw λ w 2 2

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Linear Regression. S. Sumitra

Least Mean Squares Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Discriminative Models

6.036 midterm review. Wednesday, March 18, 15

Logistic Regression & Neural Networks

Lecture 3: Pattern Classification. Pattern classification

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

18.9 SUPPORT VECTOR MACHINES

Machine Learning and Data Mining. Linear classification. Kalev Kask

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

PATTERN CLASSIFICATION

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

CSE 473: Artificial Intelligence Autumn Topics

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Statistical learning. Chapter 20, Sections 1 4 1

Machine Learning for natural language processing

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

COMS 4771 Introduction to Machine Learning. Nakul Verma

Bayesian Networks Inference with Probabilistic Graphical Models

Discriminative Models

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CPSC 540: Machine Learning

Linear & nonlinear classifiers

AE = q < H(p < ) + (1 q < )H(p > ) H(p) = p lg(p) (1 p) lg(1 p)

Nonparametric Bayesian Methods (Gaussian Processes)

Least Mean Squares Regression. Machine Learning Fall 2018

Statistical Learning Theory and the C-Loss cost function

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

The Bayes classifier

Generative Clustering, Topic Modeling, & Bayesian Inference

Max Margin-Classifier

Lecture 3: Pattern Classification

Stat 502X Exam 2 Spring 2014

CSCI-567: Machine Learning (Spring 2019)

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Logistic Regression Trained with Different Loss Functions. Discussion

Sections 18.6 and 18.7 Artificial Neural Networks

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

Pattern Recognition and Machine Learning

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

Linear discriminant functions

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

AI Programming CS F-20 Neural Networks

Statistical Machine Learning from Data

Machine Learning Basics

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Machine Learning Linear Models

Transcription:

CS325 Artificial Intelligence Cengiz Spring 2013

Model Complexity in Learning f(x) x

Model Complexity in Learning f(x) x Let s start with the linear case...

Linear Regression

Linear Regression price = f (size)

Linear Regression price = f (size)? = f (3000)

Regression Finding the Parameters from Data x y 2 7 1 2 5 16 3 8 y = f (x) = w 1 x + w 0 w 0 =? w 1 =?

Regression Finding the Parameters from Data x y 2 7 1 2 5 16 3 8 y = f (x) = w 1 x + w 0 w 0 = 1 w 1 = 2

Linear Regression Defining a Loss Function y = f (x) = w 1 x + w 0 Loss(f ) = j (y j f (x j )) 2 = j (y j (w 1 x j + w 0 )) 2

Linear Regression Defining a Loss Function y = f (x) = w 1 x + w 0 Loss(f ) = j (y j f (x j )) 2 = j (y j (w 1 x j + w 0 )) 2 Minimum is where the derivative is zero: (y j (w 1 x j + w 0 )) 2 = 0, w 0 j (y j (w 1 x j + w 0 )) 2 = 0 w 1 j

Linear Regression Defining a Loss Function y = f (x) = w 1 x + w 0 Loss(f ) = j (y j f (x j )) 2 = j (y j (w 1 x j + w 0 )) 2 Minimum is where the derivative is zero: (y j (w 1 x j + w 0 )) 2 = 0, w 0 j (y j (w 1 x j + w 0 )) 2 = 0 w 1 Solution is: w 1 = N( x j y j ) ( x j )( y j ) ( N( xj 2) ( y j ) 2, w 0 = yj w 1 ( x j )) /N j

Remember Bayes Nets? We can learn them from data, too. Everybody loves spam!

Maximum Likelihood: Guessing Spam Probability

Maximum Likelihood: Guessing Spam Probability Let s guess: P(S) = π P(y i ) = { π if y i = S 1 π if y i = H

Maximum Likelihood: Guessing Spam Probability Let s guess: P(S) = π Joint probability P(y i ) = { π if y i = S 1 π if y i = H P(data) = π count(s) (1 π) count(h) = π 3 (1 π) 5

Maximum Likelihood: Guessing Spam Probability Joint probability P(data) = π count(s) (1 π) count(h) = π 3 (1 π) 5 Take log of both sides log P(data) = 3 log π + 5 log(1 π)

Maximum Likelihood: Guessing Spam Probability Joint probability P(data) = π count(s) (1 π) count(h) = π 3 (1 π) 5 Take log of both sides log P(data) = 3 log π + 5 log(1 π) Find max by zero derivative P(data) π = 0 = 3 π 5 1 π π = 3/8

Bag of Words Representation

Bag of Words Representation

Maximum Likelihood: Guessing Word Probability

Finally, Back to Bayes Nets

Finally, Back to Bayes Nets

Finally, Back to Bayes Nets P(S M) = αp(m S)P(S)

Finally, Back to Bayes Nets P(S M) = αp(m S)P(S)

Finally, Back to Bayes Nets P(S M) = αp(m S)P(S) = αp(m 1, M 2, M 3 S)P(S)

Problems?

Problems? P(S M) = 0

Problems? P(S M) = 0 Need Laplace Smoothing (check the videos)

What If We Cannot Learn So Easily? So far we calculated directly from data: Linear regression coefficients through explicit solution Bayes net parameters through maximal likelihood

What If We Cannot Learn So Easily? So far we calculated directly from data: Linear regression coefficients through explicit solution Bayes net parameters through maximal likelihood But cannot solve every problem with these. Classification solution is not unique: x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1

What If We Cannot Learn So Easily? So far we calculated directly from data: Linear regression coefficients through explicit solution Bayes net parameters through maximal likelihood But cannot solve every problem with these. Classification solution is not unique: x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1 x 1

Perceptron Also Calculates Linear Boundary x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1 sum = N I i W i i=1 y = { 1, if sum T 0, if sum < T

Perceptron, 2D Case Line equation in 2D: x 2 = a x 1 + b x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1

Perceptron, 2D Case Line equation in 2D: x 2 = a x 1 + b b = a x 1 x 2 x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1

Perceptron, 2D Case Line equation in 2D: x 2 = a x 1 + b b = a x 1 x 2 The perceptron boundary: x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1 T = w 1 x 1 + w 2 x 2

Perceptron, 2D Case Line equation in 2D: x 2 = a x 1 + b b = a x 1 x 2 The perceptron boundary: x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1 T = w 1 x 1 + w 2 x 2 { y = 1, if sum T 0, if sum < T

Perceptron, 2D Case Line equation in 2D: x 2 = a x 1 + b b = a x 1 x 2 The perceptron boundary: x 2 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 1 T = w 1 x 1 + w 2 x 2 { y = 1, if sum T 0, if sum < T How to learn it?

Use the Loss Function, Perceptron Perceptron: Over all samples: y = f w (x) Loss(w) = i (y i f w (x i )) 2

Use the Loss Function, Perceptron Perceptron: Over all samples: y = f w (x) Loss(w) = i (y i f w (x i )) 2 Trying to find arg min Loss(w) w

Use the Loss Function, Perceptron Perceptron: Over all samples: y = f w (x) Loss(w) = i (y i f w (x i )) 2 Trying to find An incremental rule: arg min Loss(w) w w j w j α w j Loss(w)

Use the Loss Function, Perceptron Perceptron: Over all samples: y = f w (x) Loss(w) = i (y i f w (x i )) 2 Trying to find An incremental rule: arg min Loss(w) w w j w j α w j Loss(w) w j w j + α(y f w (x)) x j

It s Called the Perceptron Learning Rule w j w j + α(y f w (x)) x j x y = 0 y = 1 Tom Jerry Trucks 1 0 Sedans 0 1 Hybrids 0 1 SUVs 1 0

It s Called the Perceptron Learning Rule w j w j + α(y f w (x)) x j x y = 0 y = 1 Tom Jerry Trucks 1 0 Sedans 0 1 Hybrids 0 1 SUVs 1 0 Start: w = 0, α = 1, T = 1. For y Tom :

It s Called the Perceptron Learning Rule w j w j + α(y f w (x)) x j x y = 0 y = 1 Tom Jerry Trucks 1 0 Sedans 0 1 Hybrids 0 1 SUVs 1 0 Start: w = 0, α = 1, T = 1. For y Tom : w Trucks w Trucks +(0 0) 1 For y Jerry :

It s Called the Perceptron Learning Rule w j w j + α(y f w (x)) x j x y = 0 y = 1 Tom Jerry Trucks 1 0 Sedans 0 1 Hybrids 0 1 SUVs 1 0 Start: w = 0, α = 1, T = 1. For y Tom : w Trucks w Trucks +(0 0) 1 For y Jerry : w Sedans w Sedans +(1 0) 1

Gradient Descent on the Loss Function In general, Loss w j w j + α w j Loss(w)??? Local minimum Global minimum

Gradient Descent on the Loss Function In general, Loss w j w j + α w j Loss(w)??? Local minimum Adaptive α Simulated Annealing Global minimum

Gradient Descent on the Loss Function In general, Loss w j w j + α w j Loss(w)??? Local minimum Adaptive α Simulated Annealing Major problem: local minima Global minimum

What If the Boundary is Non-linear?

What If the Boundary is Non-linear?

What If the Boundary is Non-linear?

What If the Boundary is Non-linear? Multi-Layer Perceptrons 1 w 1,3 3 1 w 1,3 3 w 3,5 5 w 1,4 w 1,4 w 3,6 2 w 2,3 w 2,4 4 2 w 2,3 w 2,4 w 4,5 w 4,6 6

Another Solution: Non-linear Kernels

Another Solution: Non-linear Kernels Convert feature (input) space using non-linear kernel (e.g., radial distance)

Optimal Boundary? Enter Support Vector Machines 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1

Optimal Boundary? Enter Support Vector Machines 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1

Optimal Boundary? Enter Support Vector Machines 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 SVMs are guaranteed to find optimal solution Statistical Learning Theory

Optimal Boundary? Enter Support Vector Machines 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 SVMs are guaranteed to find optimal solution Statistical Learning Theory Kernel SVMs are especially powerful because it can search in multi-dimensional kernel space

So Many Methods So Little Time... How to Choose? Problem choosing model complexity Kernel type Structural complexity of MLP or SVM

So Many Methods So Little Time... How to Choose? Problem choosing model complexity Kernel type Structural complexity of MLP or SVM Solution, ask the data: 1 cross-validation Divide data into three sets: training, validate, test

So Many Methods So Little Time... How to Choose? Problem choosing model complexity Kernel type Structural complexity of MLP or SVM Solution, ask the data: 1 cross-validation Divide data into three sets: training, validate, test 2 regularization Add complexity minimization term to Loss function Loss = (y i f (x i )) 2 + β num params

Or, Get Rid of em Altogether: Non-parameric Models k-nearest Neighbors algorithm: Keep all data points as lookup table Smoothing parameter, k x 1 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 2 x 1 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 (k =1) (k = 5) x 2

Or, Get Rid of em Altogether: Non-parameric Models k-nearest Neighbors algorithm: Keep all data points as lookup table Smoothing parameter, k x 1 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 2 x 1 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 (k =1) (k = 5) x 2 Problems?

Or, Get Rid of em Altogether: Non-parameric Models k-nearest Neighbors algorithm: Keep all data points as lookup table Smoothing parameter, k x 1 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 x 2 x 1 7.5 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 4.5 5 5.5 6 6.5 7 (k =1) (k = 5) x 2 Problems? Number of data points Number of features

Finally, a Totally Different One: Genetic Algorithms

Finally, a Totally Different One: Genetic Algorithms Problems: No local minima, takes longer, must design problem well

Summary of Supervised Machine Learning Can solve problems too complex for man-made algorithms Gets better with data (good for information age) Supervised learning with labels: regression and classification

Summary of Supervised Machine Learning Can solve problems too complex for man-made algorithms Gets better with data (good for information age) Supervised learning with labels: regression and classification Linear regression and Bayes nets calculated from data directly Classification by minimizing Loss function iteratively

Summary of Supervised Machine Learning Can solve problems too complex for man-made algorithms Gets better with data (good for information age) Supervised learning with labels: regression and classification Linear regression and Bayes nets calculated from data directly Classification by minimizing Loss function iteratively Local minima is a problem with gradient descent Non-linear problems can be solved with multiple boundaries or kernels Support vector machines find optimal solution faster

Summary of Supervised Machine Learning Can solve problems too complex for man-made algorithms Gets better with data (good for information age) Supervised learning with labels: regression and classification Linear regression and Bayes nets calculated from data directly Classification by minimizing Loss function iteratively Local minima is a problem with gradient descent Non-linear problems can be solved with multiple boundaries or kernels Support vector machines find optimal solution faster Parameter complexity can be reduced with cross validation and regularization Non-parametric models good for low-dimensional problems Genetic algorithms have no local minima