Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Similar documents
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CS 6375 Machine Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Logistic Regression. Machine Learning Fall 2018

Machine Learning (CS 567) Lecture 2

Mathematical Formulation of Our Example

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Qualifying Exam in Machine Learning

Bayesian Learning (II)

L11: Pattern recognition principles

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Generalization, Overfitting, and Model Selection

CS534 Machine Learning - Spring Final Exam

Foundations of Machine Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

Final Exam, Fall 2002

Final Exam, Machine Learning, Spring 2009

CS 231A Section 1: Linear Algebra & Probability Review

PAC-learning, VC Dimension and Margin-based Bounds

Be able to define the following terms and answer basic questions about them:

Machine Learning

Introduction and Models

Day 3: Classification, logistic regression

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Lecture Support Vector Machine (SVM) Classifiers

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

CSCI-567: Machine Learning (Spring 2019)

Part of the slides are adapted from Ziko Kolter

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Machine Learning

Introduction to Machine Learning. Lecture 2

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm, Tues April 8

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning Practice Page 2 of 2 10/28/13

Algorithmisches Lernen/Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Computational Learning Theory

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Generalization and Overfitting

Machine Learning. Lecture 9: Learning Theory. Feng Li.

PAC-learning, VC Dimension and Margin-based Bounds

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

CMU-Q Lecture 24:

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Kernel Methods and Support Vector Machines

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

COMS 4771 Introduction to Machine Learning. Nakul Verma

Machine Learning, Fall 2009: Midterm

Generalization, Overfitting, and Model Selection

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Pattern Recognition and Machine Learning

Machine Learning (CS 567) Lecture 3

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Lecture 3: Pattern Classification

CSCE 478/878 Lecture 6: Bayesian Learning

Notes on Machine Learning for and

Overfitting, Bias / Variance Analysis

Advanced Introduction to Machine Learning CMU-10715

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Understanding Generalization Error: Bounds and Decompositions

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Computational and Statistical Learning theory

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

MODULE -4 BAYEIAN LEARNING

Introduction to Gaussian Process

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Midterm Exam, Spring 2005

Machine Learning for Signal Processing Bayes Classification and Regression

Statistical Learning. Philipp Koehn. 10 November 2015

4 Bias-Variance for Ridge Regression (24 points)

Introduction to Bayesian Learning. Machine Learning Fall 2018

The Perceptron Algorithm, Margins

Intelligent Systems Statistical Machine Learning

Midterm Exam Solutions, Spring 2007

Introduction to Machine Learning Midterm Exam Solutions

ECE521 week 3: 23/26 January 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Discriminative Models

Machine Learning Linear Classification. Prof. Matteo Matteucci

Probabilistic Machine Learning. Industrial AI Lab.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Introduction to Machine Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

9 Classification. 9.1 Linear Classifiers

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Kernelized Perceptron Support Vector Machines

Bias-Variance Tradeoff

Transcription:

Introduction to Machine Learning Introduction to ML - TAU 2016/7 1

Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger (schweiger@post.tau.ac.il) Course responsibilities Home works 20 % of the grade Done in pairs Final exam 80% of the grade February 1 st, 2017 Done alone Introduction to ML - TAU 2016/7 2

Today s lecture Overview of Machine Learning and the course Motivation for ML Basic concepts Birds eye view of ML and the material we will cover Outline of the course Introduction to ML - TAU 2016/7 3

Machine learning is everywhere Web search Speech recognition Language translation Image recognition Recommendation Self-Driving car Drowns Basic technology in many applications. Introduction to ML - TAU 2016/7 4

Machine learning is everywhere Web search Speech recognition Language translation Image recognition Recommendation Self-Driving car Drowns Basic technology in many applications. Introduction to ML - TAU 2016/7 5

Machine learning is everywhere Web search Speech recognition Language translation Image recognition Recommendation Self-Driving car Drowns Basic technology in many applications. Introduction to ML - TAU 2016/7 6

Machine learning is everywhere Web search Speech recognition Language translation Image recognition Recommendation Self-Driving car Drowns Basic technology in many applications. Introduction to ML - TAU 2016/7 7

Machine learning is everywhere Web search Speech recognition Language translation Image recognition Recommendation Self-Driving car Drowns Basic technology in many applications. Introduction to ML - TAU 2016/7 8

Machine learning is everywhere Web search Speech recognition Language translation Image recognition Recommendation Self-Driving car Drowns Basic technology in many applications. Introduction to ML - TAU 2016/7 9

Machine learning is everywhere Web search Speech recognition Language translation Image recognition Recommendation Self-Driving car Drowns Basic technology in many applications. Introduction to ML - TAU 2016/7 10

Machine learning is everywhere Web search Speech recognition Language translation Image recognition Recommendation Self-Driving car Drowns Basic technology in many applications. Introduction to ML - TAU 2016/7 11

Machine learning is everywhere Web search Speech recognition Language translation Image recognition Recommendation Self-Driving car Drowns Basic technology in many applications. Introduction to ML - TAU 2016/7 12

Typical ML tasks Classification Domain: instances with labels Emails: spam/ not spam Patients: ill/healthy Credit card: legitimate/fraud Input: training data Examples from the domain Selected randomly Output: predictor Classifies new instances Supervised learning Labels: Binary, discrete, continuous Multiple/unique label per instance Multiple labels: News categories Predictors: Linear classifiers Decision Trees Neural networks Nearest Neighbor Introduction to ML - TAU 2016/7 13

Typical ML tasks Clustering No observed label Many time label is hidden Examples: User modeling Documents classification Input: training data Output: clustering Partitioning the space Somewhat fuzzy goal Unsupervised learning Control Learner affects the environment Examples: Robots Driving cars Interactive environment What you see depends on what you do Exploration/exploitation Reinforcement Learning Not in the scope of this class Introduction to ML - TAU 2016/7 14

Why do we need Machine Learning Some tasks are simply to difficult to program Machine learning gives a conceptual alternative to programming Becoming easier to justify over the years What is ML: Arthur Samuel (1959): Field of study that gives computers the ability to learn without being explicitly programmed Introduction to ML - TAU 2016/7 15

Building a ML model Domain: Instances: attributes and labels There is no replacement to having the right attributes Distribution: How are instances generated. Basic assumption: Future Past Independence assumption Sample Training/Testing Hypothesis: How are we classifying Our output {0,1} or [0,1] Comparing hypotheses: More accurate is better Tradeoffs: many small errors or a few large errors? False positive / False negative Introduction to ML - TAU 2016/7 16

ML model: complete information Notation: x = instance attributes y = instance label D(x,y) joint distribution Assume we know D! Specifically: dist D + and D - D = λd + + (1-λ) D - Binary classification Two labels + and Task: Given x predicts y Compute Pr[+ x] Bayes rule: Pr + x = D + x λ Pr x + Pr[+] Pr[x] D + x λ+d x (1 λ) = Introduction to ML - TAU 2016/7 17

ML model: complete information How do we predict We can computer Pr[+ x] Depends on our goal! Minimizing errors: 0-1 Loss If Pr[+ x] > Pr[- x] Then + Else Insensitive to probabilities 0.99 and 0.51 are predicted the same Absolute loss Our output: real number q Loss y-q p= Pr[+ x] y= 1 (for +) or 0 (for -) Expected quadratic loss: L(q) = p(1-q)+(1-p)q What is the optimal q?! Still predict q=1 when p >1/2 Introduction to ML - TAU 2016/7 18

ML model: complete information Quadratic Loss Our output: real number q Loss (y-q) 2 p= Pr[+ x] y= 1 (for +) or 0 (for -) Expected quadratic loss: L(q) = p(1-q) 2 +(1-p)q 2 What is the optimal q?! We know p! Minimizing the loss: Compute the derivative L (q) = -2p(1-q)+2(1-p)q L (q) q=p L (q) = 2p+2(1-p)=2 >0 We found a minimum at q=p Introduction to ML - TAU 2016/7 19

ML model: complete information Logarithmic Loss Our output: real number q p= Pr[+ x] Loss: For y=+ : - log (q) For y=- : - log (1-q) Expected logarithmic loss: L(q) = -p log(q)-(1-p)log(1-q) What is the optimal q?! We know p! Minimizing the loss: Compute the derivative L q = p (1 p) + q 1 q L (q) =0 q=p L q = p q 2 + 1 p 1 q 2 > 0 We found a minimum at q=p Introduction to ML - TAU 2016/7 20

Estimating hypothesis error Consider a hypothesis h Assume h(x) ε {0,1} Error: Pr[h(x) y] How can we estimate the error? Take a sample S from D. S = { x i, y i : 1 i m} Observed error: 1 m i=1 m I(h x i y i ) How accurate is the observed error. Laws of large numbers: Converges in the limit We want accuracy for a finite sample Tradeoff: error vs. sample size Introduction to ML - TAU 2016/7 21

Generalization bounds Markov inequality For r.v. Z >0 Pr Z > α E[Z] α Chebyshev inequality Pr Z E Z β Var(Z) β 2 Recall Var Z = E Z E Z 2 = E Z 2 E 2 [Z] Introduction to ML - TAU 2016/7 22

Generalization bounds Chernoff/Hoeffding inequality Z i i.i.d. bernolli r.v. w.p. p: E[Z i ]=p Pr[ 1 m m i=1 Z i p ε 2exp 2ε 2 m A quantitative central limit theorem. How to think about the bound Suppose we can set the sample size m. The accuracy parameter is ε We want it to hold with high probability. Probability 1-δ δ = 2e 2ε2 m m = 1 2ε 2 log(2 δ ) Introduction to ML - TAU 2016/7 23

Hypothesis class Hypothesis class: set of functions used for prediction Implicitly assumes something about the true labels Realizable case: f x = d i=1 a i x i Can solve d linear ind. Eq. No solution incorrect class 3 2.5 2 1.5 1 0.5 0 Y-Values 0 0.5 1 1.5 2 2.5 3 3.5 Introduction to ML - TAU 2016/7 24

Hypothesis class: Regression Hypothesis class: h x = d i=1 a i x i 3 2.5 Y-Values Only a restriction on our predictor Look for the best linear estimator. Need to specify a loss! Called: Linear Regression 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 Introduction to ML - TAU 2016/7 25

Hypothesis class: Classification Many possible classes Linear separator d h x = sign( i=1 a i x i ) Neural Network (deep learning) Multiple layers Decision Trees Memorizing the sample Does it make sense? X 1 >3 X 5 >9 X 9 < 2 Introduction to ML - TAU 2016/7 26

Hypothesis class: Classification PAC model: Given a hypothesis class H Find h ε H with near optimal loss Accuracy: ε Confidence: δ Error of hypothesis: Pr[h(x) y] Observed error: 1 m I(h x i y i ) m i=1 Empirical risk minimization: Select h ε H with lowest observed error What should we expect about the true error of h? overfitting Generalization ability The difference between true and observed error Introduction to ML - TAU 2016/7 27

Hypothesis class: Generalization Why is the single hypothesis bound not enough? The difference between: For every h: Pr[ obs-error > ε] <δ Pr[for every h: obs-error > ε] <δ We need the bound to hold for all hypotheses simultaneously Generalization bounds: How do they look like: Finite class H: m = 1 2ε 2 log(2 H δ ) Infinite class VC dimension replaces log(h) Approximately the number of free parameters Hyperplane in d dimension has VCdimension = d+1 Introduction to ML - TAU 2016/7 28

error Model Selection: complexity vs generalization How to select from a rich hypotheses class How can we approach it systematically? Examples: intervals, polynomials testing Adding a penalty Bound on overfitting SRM Using a prior MDL complexity training Introduction to ML - TAU 2016/7 29

Bayesian Inference A complete world Hypothesis is selected from a prior Prior is a distribution over F Class of target functions Given a sample can update Compute a posterior Given an instance Predict using the posterior Using Bayes rule: Pr S f Pr[f] Pr f S = Pr[S] Pr[f] : prior Pr[f S] : posterior Pr[S] : prob. Data (normalization) Introduction to ML - TAU 2016/7 30

Bayesian Inference Using the posterior Compute exact posterior Computationally extensive Maximum A Posteriori (MAP) Single most likely hypothesis Maximum Likelihood (ML) Single hypothesis that maximizes prob of observation. Bayesian Networks: A compact joint dist. Naïve Bayes Attributes independent given label Where do we get the prior from? Introduction to ML - TAU 2016/7 31

Nearest neighbor A most intuitive approach No hypotheses class The raw data is the hypothesis Given a point x Find the k nearest points Need an informative metric Take a weighted average of their label Performance: Generalization ability Bad examples: Reasonable in practice Requires normalization (metric) Measuring height in cm or meters? Introduction to ML - TAU 2016/7 32

How can we normalize? Actually, data dependent. For now individual attributes. For a Gaussian like: x i (x i -μ)/σ Mapping N(μ,σ) N(0,1) For uniform like x i (x i x min ) / (x max x min ) Mapping [a,b] [0,1] For exponential like x i x i / λ Mapping exp(λ) exp(1) What about categorical attributes? Color Mahalanobis distance Multi-attribute Gaussian. Introduction to ML - TAU 2016/7 33

ML and optimization Optimization is an important part of ML Both local and global optima critical in minimizing the loss ERM Efficient methods What is efficient varies Linear and convex optimization Gradient based methods Overcoming computational hardness Example 1: Minimizing error with a hyperplane is hard (NP-hard) Minimizing a hinge loss is a linear program Hinge loss is the distance from the threshold Example 2: Sparsity Example 3: Kernels Introduction to ML - TAU 2016/7 34

Unsupervised learning We do not have labels Think: user s behavior Some application: Labels not observable Somewhat ill define: You can cluster in many ways Example Goal: To better understand the data Segment the data Less emphasis on generalization Toy example (2D) Introduction to ML - TAU 2016/7 35

Clustering: Unsupervised learning

Unsupervised learning: Clustering

Unsupervised learning Goal: recover a hidden structure. Use it (maybe) to predict Method: assume a generative structure Recover the parameters Example: Mixture of k Gaussian distributions Each Gaussian should be a cluster Recover using k-means Introduction to ML - TAU 2016/7 38

Unsupervised learning: k-means Start with k random points as centers For each point, compute the nearest center Re-estimate each center The average of the points it generates Continue until no more changes Objective function: Sum of distances from points to their center Algorithm terminates Objective cannot decrease Running time There are bad examples Works fairly fast in practice. Special case of Exp-Max Algo. Introduction to ML - TAU 2016/7 39

Singular value Decomposition (SVD) SVD X data matrix Row are attributes Columns are instances XX t correlation matrix SVD: X = UDV t D diagonal eigenvalues Low rank approximation Truncate D Optimal in Frobenius norm Recommendations Rows users Columns movies Entries: rating predictor: Each user has a vector u Each movie has a vector v Rating is <u,v> SVD Recover the user and movie vectors. Introduction to ML - TAU 2016/7 40

Principal Component Analysis (PCA) Dimension reduction d d, d >> d Intuition: Try to capture the structure PCA: Min m i=1 x i UU t x i 2 Objective: minimize the reconstruction error: U dxd orthonormal matrix x U t x s.t. UU t x x IRIS Data set 2D PCA Introduction to ML - TAU 2016/7 41

Structure of the Course 1. Introduction 2. PAC: Generalization 3. PAC: VC bounds 4. Perceptron: linear classifier 5. Support Vector Machine SVM 6. Kernels 7. Stochastic Gradient Decent 8. Boosting 9. Decision Trees 10. Regression 11. Generative Models, EM 12. Clustering, k-means 13. PCA Introduction to ML - TAU 2016/7 42

ML books Introduction to ML - TAU 2016/7 43