CENG 793. On Machine Learning and Optimization. Sinan Kalkan

Similar documents
Logistic Regression. COMP 527 Danushka Bollegala

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm exam CS 189/289, Fall 2015

Machine Learning Basics

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Lecture 5

Linear Models for Regression

ECE521 Lectures 9 Fully Connected Neural Networks

Neural Network Training

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Linear & nonlinear classifiers

Machine Learning Practice Page 2 of 2 10/28/13

Ch 4. Linear Models for Classification

Machine Learning for Signal Processing Bayes Classification and Regression

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Overfitting, Bias / Variance Analysis

Machine Learning Lecture 7

Neural Networks and Deep Learning

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Machine Learning for NLP

CSCI-567: Machine Learning (Spring 2019)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Logistic Regression & Neural Networks

Linear discriminant functions

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Support Vector Machine (SVM) and Kernel Methods

Introduction to Logistic Regression and Support Vector Machine

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Lecture 5: Logistic Regression. Neural Networks

Statistical Data Mining and Machine Learning Hilary Term 2016

Support Vector Machine (SVM) and Kernel Methods

Neural networks and optimization

Support Vector Machines and Kernel Methods

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Warm up: risk prediction with logistic regression

18.9 SUPPORT VECTOR MACHINES

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Introduction to Convolutional Neural Networks (CNNs)

Advanced computational methods X Selected Topics: SGD

Linear Models for Classification

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Introduction to Logistic Regression

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Classification: The rest of the story

The connection of dropout and Bayesian statistics

Machine Learning for NLP

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural networks and optimization

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

STA 4273H: Statistical Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

ECE521 week 3: 23/26 January 2017

Natural Language Processing with Deep Learning CS224N/Ling284

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Nonparametric Bayesian Methods (Gaussian Processes)

Linear & nonlinear classifiers

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Loss Functions and Optimization. Lecture 3-1

Logistic Regression. Machine Learning Fall 2018

Midterm: CS 6375 Spring 2018

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Cheng Soon Ong & Christian Walder. Canberra February June 2018

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Classifier Complexity and Support Vector Classifiers

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

How to do backpropagation in a brain

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Midterm, Fall 2003

10-701/ Machine Learning - Midterm Exam, Fall 2010

Lecture : Probabilistic Machine Learning

9 Classification. 9.1 Linear Classifiers

Lecture 4: Training a Classifier

Multiclass Classification-1

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Introduction to Machine Learning Midterm, Tues April 8

1 Machine Learning Concepts (16 points)

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Jeff Howbert Introduction to Machine Learning Winter

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Perceptron (Theory) + Linear Regression

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

18.6 Regression and Classification with Linear Models

Lecture 2 Machine Learning Review

CS229 Supplemental Lecture notes

Reading Group on Deep Learning Session 1

Transcription:

CENG 793 On Machine Learning and Optimization Sinan Kalkan

2

Now Introduction to ML Problem definition Classes of approaches K-NN Support Vector Machines Softmax classification / logistic regression Parzen Windows Optimization Gradient Descent approaches A flavor of other approaches 3

Introduction to Machine learning Uses many figures and material from the following website: http://www.byclb.com/tr/tutorials/neural_networks/ch1_1.htm 4

Overview Test input Testing Extract Features Data with label (label: human, animal etc.) Training Extract Features Learn Learned Models or Classifiers - Size - Texture - Color - Histogram of oriented gradients - SIFT - Etc. Predict Human or Animal or 5

Content Problem definition General approaches Popular methods knn Linear classification Support Vector Machines Parzen windows 6

Problem Definition Given Data: a set of instances x 1, x 2,, x n sampled from a space X R d. Labels: the corresponding labels y i for each x i. y i Y R. Goal: Learn a mapping from the space of data to the space of labels, i.e., M: X Y. y = f x. 7

Issues in Machine Learning Hypothesis space Loss / cost / objective function Optimization Bias vs. variance Test / evaluate Overfitting, underfitting 8

Discriminative General Approaches Generative Learn a model for each class. Find separating line (in general: hyperplane) Fit a function to data (regression). 9

General Approaches (cont d) Discriminative Generative Support Vector Machines Artificial Neural Networks Conditional Random Fields K-NN Regression Markov Random Fields Bayesian Networks/Learning Clustering via Gaussian Mixture Models, Parzen Windows etc. 10

General Approaches (cont d) Supervised Unsupervised Instance Label elma e.g. SVM elma armut Extract Features Extract Features e.g. k-means clustering elma armut Learn a model Learn a model 11

General Approaches (cont d) Supervised Unsupervised Generative Regression Markov Random Fields Bayesian Networks Gaussian Mixture Models Parzen Windows Discriminative Support Vector Machines Neural Networks Conditional Random Fields Decision Tree Learning K-means Self-Organizing Maps 12

Now Generative Discriminative Supervised Softmax/Logistic Regression Support Vector Machines Unsupervised Parzen Windows 14

Feature Space 17

General pitfalls/problems Overfitting Underfitting Occams razor 18

21

K-NEAREST NEIGHBOR 22

A very simple algorithm Advantages: No training Simple Disadvantages: Slow testing time (more efficient versions exist) Needs a lot of memory Wikipedia http://cs231n.github.io/classification/ 23

A non-parametric method PARZEN WINDOWS 24

Probability and density Probability of a continuous probability function, p x, satisfies the following: Probability to be between two values: p x > 0 for all real x. And P a < x < b = a b p x dx = 1 In 2-D: Probability for x to be inside region R: p x > 0 for all real x. And P = p x dx R p x dx = 1 p x dx 26

Density Estimation Basic idea: P = p x dx R If R is small enough, so that p(x) is almost constant in R: P = R p x dx p x R dx = p x V V: volume of region R. If k out of n samples fall into R, then P = k/n From which we can write: k k/n = p x V p x = n V 27

Parzen Windows Assume that: R is a hypercube centered at x. h is the length of an edge of the hypercube. Then, V = h 2 in 2D and V = h 3 in 3D etc. Let us define the following window function: w x i x k = 1, x i x k /h < 1/2 h 0, otherwise Then, we can write the number of samples falling into R as follows: k = n i=1 w x i x k h 28

Parzen Windows (cont d) Remember: p x = k/n. V Using this definition of k, we can rewrite p(x) (in 2D): p x = 1 n i=1 n 1 h 2 w x i x h Interpretation: Probability is the contribution of window functions fitted at each sample! 29

Parzen Windows (cont d) The type of the window function can add different flavors. If we use Gaussian for the window function: n p x = 1 n i=1 1 σ 2π exp x i x 2 2σ 2 Interpretation: Fit a Gaussian to each sample. 30

31

LINEAR CLASSIFICATION 32

One row per class 33 http://cs231n.github.io/linear-classify/ Linear classification Linear classification relies on the following score function: f(x i ; W, b) = Wx i + b Or f(x i ; W, b) = Wx i where the bias is implicitly represented in W and x i

Linear classification We can rewrite the score function as: f(x i ; W, b) = Wx i where the bias is implicitly represented in W and x i http://cs231n.github.io/linear-classify/ 34

Linear classification: One interpretation Since an image can be thought as a vector, we can consider them as points in high-dimensional space. One interpretation of: f(x i ; W, b) = Wx i + b - Each row describes a line for a class, and b http://cs231n.github.io/linear-classify/ 35

Linear classification: Another interpretation Each row in W can be interpreted as a template of that class. f(x i ; W, b) = Wx i + b calculates the inner product to find which template best fits x i. Effectively, we are doing Nearest Neighbor with the prototype images of each class. http://cs231n.github.io/linear-classify/ 36

Loss function A function which measures how good our weights are. Other names: cost function, objective function Let s j = f x i ; W j An example loss function: Or equivalently: L i = j y i max(0, s j s yi + Δ) L i = max(0, w T j x i w T yi x i + Δ) j y i This directs the distances to other classes to be more than Δ (the margin) http://cs231n.github.io/linear-classify/ 37

Example Consider our scores for x i to be s = [13, 7,11] and assume Δ as 10. Then, L i = max 0, 7 13 + 10 + max(0,11 13 + 10) http://cs231n.github.io/linear-classify/ 38

Regularization In practice, there are many possible solutions leading to the same loss value. Based on the requirements of the problem, we might want to penalize certain solutions. E.g., R W = which penalizes large weights. Why do we want to do that? i j 2 W i,j http://cs231n.github.io/linear-classify/ 39

Combined Loss Function The loss function becomes: If you expand it: L = 1 N i j yi max 0, f x i, W j f x i, W yi + Δ + λ i j 2 W i,j http://cs231n.github.io/linear-classify/ Hyper parameters 40 (estimated using validation set)

Hinge Loss, or Max-Margin Loss L = 1 N i j yi max 0, f x i, W j f x i, W yi + Δ + λ i j 2 W i,j http://cs231n.github.io/linear-classify/ 41

Interactive Demo http://vision.stanford.edu/teaching/cs231n/linear-classify-demo/ 42

An alternative formulation of SUPPORT VECTOR MACHINES Barrowed mostly from the slides of: - Machine Learning Group, University of Texas at Austin. - Mingyue Tan, The University of British Columbia 43

Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 f(x) = sign(w T x + b) 44

Linear Separators Which of the linear separators is optimal? 45

Classification Margin Distance from example x i to the separator is Examples closest to the hyperplane are support vectors. Margin ρ of the separator is the distance between support vectors. ρ r r w T x i w b 46

Maximum Margin Classification Maximizing the margin is good according to intuition. Implies that only support vectors matter; other training examples are ignorable. 47

Linear SVM Mathematically x + X - ρ=margin Width What we know: w. x + + b = +1 w. x - + b = -1 w. (x + -x - ) = 2 ( x x ) w 2 w w 49

Linear SVMs Mathematically (cont.) Then we can formulate the optimization problem: Find w and b such that 2 w is maximized and for all (x i, y i ), i=1..n : y i (w T x i + b) 1 Which can be reformulated as: Find w and b such that Φ(w) = w 2 =w T w is minimized and for all (x i, y i ), i=1..n : y i (w T x i + b) 1 50

Lagrange Multipliers Given the following optimization problem: minimize f x, y subject to g x, y = c We can formulate it as: L x, y, λ = f x, y + λ(g x, y c) and set the derivative to zero: x,y,λ L x, y, λ = 0 51

Lagrange Multipliers Main intuition: The gradients of f and g are parallel at the maximum f = λ g Fig: http://mathworld.wolfram.com/lagrangemultiplier.html 52

Lagrange Multipliers See the following for proof http://ocw.mit.edu/courses/mathematics/18-02scmultivariable-calculus-fall-2010/2.-partialderivatives/part-c-lagrange-multipliers-andconstrained-differentials/session-40-proof-oflagrange-multipliers/mit18_02sc_notes_22.pdf A clear example: http://tutorial.math.lamar.edu/classes/calciii/lagrang emultipliers.aspx More intuitive explanation: http://www.slimy.com/~steuard/teaching/tutorials/la grange.html 53

54

Non-linear SVMs Datasets that are linearly separable with some noise work out great: 0 x But what are we going to do if the dataset is just too hard? 0 x How about mapping data to a higher-dimensional space: x 2 0 x 55

SOFTMAX OR LOGISTIC CLASSIFIERS 56

Cross-entropy Uses cross-entropy (t: target probabilities, o: estimated probabilities): H p, q = E p log q = p j log q j Wikipedia: the cross entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an "unnatural" probability distribution q, rather than the "true" distribution p j 57

Softmax classifier cross-entropy loss Cross-entropy: H p, q = E p log q = j p j log q j In our case, p denotes the correct probabilities of the categories. In other words, p j = 1 for the correct label and p j = 0 for other categories. q denotes the estimated probabilities of the categories But, our scores are not probabilities! One solution: Softmax function: f z i = ez i It maps arbitrary ranges to probabilities Using the normalized values, we can define the cross-entropy loss for classification problem now: L i = log e f y i j ef j j ez j = f yi + log e f j j http://cs231n.github.io/ 58

logistic loss A special case of cross-entropy for binary classification: H p, q = p j log q j = p j log q j 1 p j log 1 q j j Softmax function reduces to the logistic function: 1 And the loss becomes: 1 + e x L i = log 1 1 + e f http://cs231n.github.io/ 59

Why take logarithm of probabilities? Maps probability space to logarithmic space Multiplication becomes addition Multiplication is a very frequent operation with probabilities Speed: Addition is more efficient Accuracy: Considering loss in representing real numbers, addition is friendlier Since log-probability is negative, to work with positive numbers, we usually negate the log-probability 62

Information theory Softmax classifier: One interpretation Cross-entropy between a true distribution and an estimated one: H p, q = x p x log q x. In our case, p = [0,, 1,0,.. 0], containing only one 1, at the correct label. Since H p, q = H p + D KL (p q), we are minimizing the Kullback-Leibler divergence. http://cs231n.github.io/ 63

Softmax classifier: Another interpretation Probabilistic view P y i x i ; W) = ef y i. j ef j In our case, we are minimizing the negative log likelihood. Therefore, this corresponds to Maximum Likelihood Estimation (MLE). http://cs231n.github.io/ 64

Numerical Stability Exponentials may become very large. A trick: Set log C = max j f j. http://cs231n.github.io/ 65

SVM loss vs. cross-entropy loss SVM is happy when the classification satisfies the margin Ex: if score values = [10, 9, 9] or [10, -10, -10] SVM loss is happy if the margin is 1 SVM is local cross-entropy always wants better cross-entropy is global 66

0-1 Loss Minimize the # of cases where the prediction is wrong: L = Or equivalently, i 1 f x i ; W, b yi y i L = i 1 y i f x i ; W, b yi < 0 68

Absolute Value Loss, Squared Error Loss L i = s j y j q j q = 1: absolute value loss q = 2: square error loss. Bishop 69

MORE ON LOSS FUNCTIONS 70

Visualizing Loss Functions If you look at one of the example loss functions: L i = max(0, w T j x i w T yi x i + 1) j y i Since W has too many dimensions, this is difficult to plot. We can visualize this for one weight direction though, which can give us some intuition about the shape of the function. E.g., start from an arbitrary W 0, choose a direction W 1 and plot L(W 0 + αw 1 ) for different values of α. http://cs231n.github.io/ 75

Visualizing Loss Functions Example: If we consider just w 0, we have many linear functions in terms of w 0 and loss is a linear combination of them. http://cs231n.github.io/ 76

Visualizing Loss Functions Loss along one direction Loss along two directions Loss along two directions (averaged over many samples) You see that this is a convex function. Nice and easy for optimization When you combine many of them in a neural network, it becomes non-convex. http://cs231n.github.io/ 77

Another approach for visualizing loss functions 0-1 loss: L = 1(f x y) or equivalently as: L = 1(yf x > 0) Square loss: L = f x y 2 in binary case: L = 1 yf x 2 Hinge-loss L = max(1 yf x, 0) Logistic loss: L = ln 2 1 ln(1 + e yf x ) Rosacco et al., 2003 78

SUMMARY OF LOSS FUNCTIONS 79

80

81

82

83

Sum up 0-1 loss is not differentiable/helpful at training It is used in testing Other losses try to cover the weakness of 0-1 loss Hinge-loss imposes weaker constraint compared to cross-entropy For classification: use hinge-loss or cross-entropy loss For regression: use squared-error loss, or absolute difference loss 84

SO, WHAT DO WE DO WITH A LOSS FUNCTION? 85

Optimization strategies We want to find W that minimizes the loss function. Remember that W has lots of dimensions. Naïve idea: random search For a number of iterations: Select a W randomly If it leads to better loss than the previous ones, select it. This yields 15.5% accuracy on CIFAR after 1000 iterations (chance: 10%) http://cs231n.github.io/ 86

Optimization strategies Second idea: random local search Start at an arbitrary position (weight W) Select an arbitrary direction W + δw and see if it leads to a better loss If yes, move along This leads to 21.4% accuracy after 1000 iterations This is actually a variation of simulated annealing A better idea: follow the gradient http://cs231n.github.io/ 87

A quick reminder on gradients / partial derivatives In one dimension: In practice: df n dn df n dn df x dx f n+h f[n] = h f n+h f[n h] = 2h In many dimensions: = lim f x+h f x h 0 h (centered difference works better) 1. Compute gradient numerically with finite differences Slow Easy Approximate 2. Compute the gradient analytically Fast Exact http://cs231n.github.io/ Error-prone to implement In practice: implement (2) and check against (1) before testing 88

A quick reminder on gradients / partial derivatives If you have a many-variable function, e.g., f x, y = x + y, you can take its derivative wrt either x or y: df x,y dx Similarly, f x+h,y f(x,y) = lim h 0 h df x,y dy = 1 = 1 In fact, we should denote them as follows since they are partial derivatives or gradients on x or y : f x and f y Partial derivative tells you the rate of change along a single dimension at a point. E.g., if f/ x=1, it means that a change of x 0 in x leads to the same amount of change in the value of the function. Gradient is a vector of partial derivatives: f = f x, f y http://cs231n.github.io/ 89

A quick reminder on gradients / partial derivatives A simple example: f x, y = max(x, y) f = 1 x y, 1 y x Chaining: What if we have a composition of functions? E.g., f x, y, z = q x, y z and q x, y = x + y f = f q = z x q x f = f q = z y q y f z = q x, y = x + y Back propagation Local processing to improve the system globally Each gate locally determines to increase or decrease the inputs http://cs231n.github.io/ 90

Partial derivatives and backprop Which has many gates: http://cs231n.github.io/ 91

In fact, that was the sigmoid function We will combine all these gates into a single gate and call it a neuron 92

Intuitive effects Commonly used operations Add gate Max gate Multiply gate Due to the effect of the multiply gate, when one of the inputs is large, the small input gets the large gradient This has a negative effect in NNs When one of the weights is unusually large, it effects the other weights. Therefore, it is better to normalize the input / weights http://cs231n.github.io/ 93

Optimization strategies: gradient Move along the negative gradient (since we wish to go down) L W W s W s: step size Gradient tells us the direction Choosing how much to move along that direction is difficult to determine This is also called the learning rate If it is small: too slow to converge If it is big: you may overshoot and skip the minimum http://cs231n.github.io/ 94

Optimization strategies: gradient Take our loss function: L i = max(0, w T j x i w T yi x i + 1) j y i Its gradient wrt w j is: L i = 1 w T w j x i w T yi x i + 1 > 0 x i j For the winning weight, this is: http://cs231n.github.io/ 95

Update the weight: Gradient Descent L W W new W s W This computes the gradient after seeing all examples to update the weight. Examples can be on the order of millions or billions Alternative: Mini-batch gradient descent: Update the weights after, e.g., 256 examples Stochastic (or online) gradient descent: Update the weights after each example People usually use batches and call it stochastic. Performing an update after one example for 100 examples is more expensive than performing an update at once for 100 examples due to matrix/vector operations http://cs231n.github.io/ 96

A GENERAL LOOK AT OPTIMIZATION 97

Mathematical Optimization Nonlinear Optimization Convex Optimization Least-squares LP 100 Rong Jin

Mathematical Optimization 101 Rong Jin

Convex Optimization 102 Rong Jin

Interpretation Function s value is below the line connecting two points 103 Mark Schmidt

Another interpretation 104 Mark Schmidt

Example convex functions 106 Mark Schmidt

Operations that conserve convexity 107 Rong Jin

x + y p <= x p + y p 108 Rong Jin

Why convex optimization? Can t solve most OPs E.g. NP Hard, even high polynomial time too slow Convex OPs (Generally) No analytic solution Efficient algorithms to find (global) solution Interior point methods (basically Iterated Newton) can be used: ~[10-100]*max{p 3, p 2 m, F} ; F cost eval. obj. and constr. f At worst solve with general IP methods (CVX), faster specialized A KTEC Center of Excellence 109

Convex Function Easy to see why convexity allows for efficient solution Just slide down the objective function as far as possible and will reach a minimum A KTEC Center of Excellence 110

Convex vs. Non-convex Ex. Affine border case of convexity Convex, min. easy to find A KTEC Center of Excellence 111

Convex vs. Non-convex Ex. Non-convex, easy to get stuck in a local min. Can t rely on only local search techniques A KTEC Center of Excellence 112

Non-convex Some non-convex problems highly multi-modal, or NP hard Could be forced to search all solutions, or hope stochastic search is successful Cannot guarantee best solution, inefficient Harder to make performance guarantees with approximate solutions A KTEC Center of Excellence 113

Mathematical Optimization Nonlinear Optimization Convex Optimization Least-squares LP Analytical solution Good algorithms and software High accuracy and high reliability Time complexity: C n 2 k A mature technology! 114 Rong Jin

Mathematical Optimization Nonlinear Optimization Convex Optimization Least-squares LP No analytical solution Algorithms and software Reliable and efficient Time complexity: C 2 n m Also a mature technology! 115 Rong Jin

Mathematical Optimization Nonlinear Optimization Convex Optimization Least-squares LP No analytical solution Algorithms and software Reliable and efficient Time complexity (roughly) max{ n 3, n 2 m, F} Almost a mature technology! 116 Rong Jin

Mathematical Optimization Nonlinear Optimization Convex Optimization Least-squares LP Sadly, no effective methods to solve Only approaches with some compromise Local optimization: more art than technology Global optimization: greatly compromised efficiency Help from convex optimization 1) Initialization 2) Heuristics 3) Bounds Far from a technology! (something to avoid) 117 Rong Jin

Why Study Convex Optimization If not, there is little chance you can solve it. -- Section 1.3.2, p8, Convex Optimization 118 Rong Jin

Recognizing Convex Optimization Problems 121 Rong Jin

Least-squares Rong Jin 126

Analytical Solution of Leastsquares Set the derivative to zero: df 0 x dx = 0 A T A 2x 2Ab = 0 A T A x = Ab Solve this system of linear equations 127

Linear Programming (LP) 128 Rong Jin

To sum up We introduced some important concepts in machine learning and optimization We introduced popular machine learning methods We talked about loss functions and how we can optimize them using gradient descent 135