Foundations of Machine Learning
|
|
- Brianne Clark
- 6 years ago
- Views:
Transcription
1 Introduction to ML Mehryar Mohri Courant Institute and Google Research page 1
2 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about 3-4 homework assignments + project (topic of your choice). Mailing list: join as soon as possible. page 2
3 Course Material Textbook Slides: course web page. page 3
4 This Lecture Basic definitions and concepts. Introduction to the problem of learning. Probability tools. page 4
5 Machine Learning Definition: computational methods using experience to improve performance. Experience: data-driven task, thus statistics, probability, and optimization. Computer science: learning algorithms, analysis of complexity, theoretical guarantees. Example: use document word counts to predict its topic. page 5
6 Examples of Learning Tasks Text: document classification, spam detection. Language: NLP tasks (e.g., morphological analysis, POS tagging, context-free parsing, dependency parsing). Speech: recognition, synthesis, verification. Image: annotation, face recognition, OCR, handwriting recognition. Games (e.g., chess, backgammon). Unassisted control of vehicles (robots, car). Medical diagnosis, fraud detection, network intrusion. page 6
7 Some Broad ML Tasks Classification: assign a category to each item (e.g., document classification). Regression: predict a real value for each item (prediction of stock values, economic variables). Ranking: order items according to some criterion (relevant web pages returned by a search engine). Clustering: partition data into homogenous regions (analysis of very large data sets). Dimensionality reduction: find lower-dimensional manifold preserving some properties of the data. page 7
8 General Objectives of ML Theoretical questions: what can be learned, under what conditions? are there learning guarantees? analysis of learning algorithms. Algorithms: more efficient and more accurate algorithms. deal with large-scale problems. handle a variety of different learning problems. page 8
9 This Course Theoretical foundations: learning guarantees. analysis of algorithms. Algorithms: main mathematically well-studied algorithms. discussion of their extensions. Applications: illustration of their use. page 9
10 Topics Probability tools, concentration inequalities. PAC learning model, Rademacher complexity, VC-dimension, generalization bounds. Support vector machines (SVMs), margin bounds, kernel methods. Ensemble methods, boosting. Logistic regression and conditional maximum entropy models. On-line learning, weighted majority algorithm, Perceptron algorithm, mistake bounds. Regression, generalization, algorithms. Ranking, generalization, algorithms. Reinforcement learning, MDPs, bandit problems and algorithm. page 10
11 Definitions and Terminology Example: item, instance of the data used. Features: attributes associated to an item, often represented as a vector (e.g., word counts). Labels: category (classification) or real value (regression) associated to an item. Data: training data (typically labeled). test data (labeled but labels not seen). validation data (labeled, for tuning parameters). page 11
12 General Learning Scenarios Settings: batch: learner receives full (training) sample, which he uses to make predictions for unseen points. on-line: learner receives one sample at a time and makes a prediction for that sample. Queries: active: the learner can request the label of a point. passive: the learner receives labeled points. page 12
13 Standard Batch Scenarios Unsupervised learning: no labeled data. Supervised learning: uses labeled data for prediction on unseen points. Semi-supervised learning: uses labeled and unlabeled data for prediction on unseen points. Transduction: uses labeled and unlabeled data for prediction on seen points. page 13
14 Example - SPAM Detection Problem: classify each message as SPAM or non- SPAM (binary classification problem). Potential data: large collection of SPAM and non-spam messages (labeled examples). page 14
15 Learning Stages labeled data training sample validation data test sample algorithm A( ) A( 0 ) evaluation prior knowledge parameter selection features page 15
16 This Lecture Basic definitions and concepts. Introduction to the problem of learning. Probability tools. page 16
17 Definitions Spaces: input space X, output space Y. Loss function: L: Y Y!R. L(by, y) : cost of predicting by instead of y. binary classification: 0-1 loss, L(y, y 0 )=1 y6=y 0. regression: Y R, l(y, y 0 )=(y 0 y) 2. Hypothesis set: H Y X, subset of functions out of which the learner selects his hypothesis. depends on features. represents prior knowledge about task. page 17
18 Supervised Learning Set-Up Training data: sample S of size m drawn i.i.d. from X Y according to distribution D: Problem: find hypothesis h2h with small generalization error. deterministic case: output label deterministic function of input, y =f(x). S =((x 1,y 1 ),...,(x m,y m )). stochastic case: output probabilistic function of input. page 18
19 Errors Generalization error: for h2h, it is defined by R(h) = E [L(h(x),y)]. (x,y) D Empirical error: for h2hand sample S, it is br(h) = 1 mx L(h(x i ),y i ). m Bayes error: R? = i=1 in deterministic case, inf R(h). h h measurable R? =0. page 19
20 Noise Noise: in binary classification, for any x 2 X, observe that noise(x) =min{pr[1 x], Pr[0 x]}. E[noise(x)] = R. page 20
21 Learning Fitting Notion of simplicity/complexity. How do we define complexity? page 21
22 Generalization Observations: the best hypothesis on the sample may not be the best overall. generalization is not memorization. complex rules (very complex separation surfaces) can be poor predictors. trade-off: complexity of hypothesis set vs sample size (underfitting/overfitting). page 22
23 Model Selection General equality: for any h2h, best in class R(h) R =[R(h) R(h )] {z } estimation +[R(h ) R ]. {z } approximation Approximation: not a random variable, only depends on H. Estimation: only term we can hope to bound. page 23
24 Empirical Risk Minimization Select hypothesis set H. Find hypothesis h2h minimizing empirical error: h = argmin h2h br(h). but H may be too complex. the sample size may not be large enough. page 24
25 Generalization Bounds Definition: upper bound on apple Pr sup h2h Bound on estimation error for hypothesis R(h) b R(h) >. h 0 R(h 0 ) R(h )=R(h 0 ) b R(h0 )+ b R(h 0 ) R(h ) apple R(h 0 ) b R(h0 )+ b R(h ) R(h ) apple 2 sup h2h R(h) b R(h). given by ERM: How should we choose H? (model selection problem) page 25
26 Model Selection error estimation approximation upper bound H = [ 2 H. page 26
27 Structural Risk Minimization Principle: consider an infinite sequence of hypothesis sets ordered for inclusion, H 1 H 2 H n (Vapnik, 1995) h = argmin h2h n,n2n br(h)+penalty(h n,m). strong theoretical guarantees. typically computationally hard. page 27
28 General Algorithm Families Empirical risk minimization (ERM): h = argmin h2h br(h). Structural risk minimization (SRM): H n H n+1, h = argmin h2h n,n2n br(h)+penalty(h n,m). Regularization-based algorithms: 0, h = argmin h2h br(h)+ khk 2. page 28
29 This Lecture Basic definitions and concepts. Introduction to the problem of learning. Probability tools. page 29
30 Basic Properties Union bound: Pr[A _ B] apple Pr[A]+Pr[B]. Inversion: if Pr[X ]applef( ), then, for any >0, with probability at least 1, X applef 1 ( ). f Jensen s inequality: if is convex, f(e[x])applee[f(x)]. Z +1 Expectation: if X 0, E[X]= Pr[X >t] dt. 0 page 30
31 Basic Inequalities Markov s inequality: if X 0 and >0, then Pr[X ]apple E[X]. Chebyshev s inequality: for any >0, Pr[ X E[X] ] apple 2 X 2. page 31
32 Hoeffding s Inequality Theorem: Let X 1,...,X m be indep. rand. variables with the same expectation µ and X i 2[a, b], ( a<b). Then, for any >0, the following inequalities hold: apple 1 mx 2m 2 Pr µ X i > apple exp m (b a) 2 Pr apple 1 m i=1 mx X i µ> apple exp i=1 2m 2 (b a) 2. page 32
33 McDiarmid s Inequality (McDiarmid, 1989) Theorem: let X 1,...,X m be independent random variables taking values in Uand f : U m!r a function verifying for all i2[1,m], sup f(x 1,...,x i,...,x m ) f(x 1,...,x 0 i,...,x m ) applec i. x 1,...,x m,x 0 i Then, for all >0, h Pr i f(x 1,...,X m ) E[f(X 1,...,X m )] > apple2 exp 2 2 P m i=1 c2 i. page 33
34 Appendix page 34
35 Markov s Inequality Theorem: let X be a non-negative random variable with E[X]<1, then, for all t>0, Pr[X te[x]] apple 1 t. Proof: Pr[X t E[X]] = X apple x x te[x] X t E[X] apple X x apple X =E t E[X] Pr[X = x] Pr[X = x] Pr[X = x] x t E[X] = 1 t. x t E[X] page 35
36 Chebyshev s Inequality Theorem: let X be a random variable with Var[X]<1, then, for all t>0, Pr[ X E[X] t X ] apple 1 t 2. Proof: Observe that Pr[ X E[X] t X ]=Pr[(X E[X]) 2 t 2 2 X]. The result follows Markov s inequality. page 36
37 Weak Law of Large Numbers Theorem: let (X n ) n2n be a sequence of independent random variables with the same mean µ and variance 2 <1 and let X n = 1 P n, then, for any >0, n i=1 X i lim Pr[ X n µ ] =0. n!1 Proof: Since the variables are independent, nx apple Xi Var[X n ]= Var = n 2 2 n n 2 = n. i=1 Thus, by Chebyshev s inequality, Pr[ X n µ ] apple 2 n 2. page 37
38 Concentration Inequalities Some general tools for error analysis and bounds: Hoeffding s inequality (additive). Chernoff bounds (multiplicative). McDiarmid s inequality (more general). page 38
39 Hoeffding s Lemma Lemma: Let X 2 [a, b] be a random variable with and b6=a. Then for any t>0, E[e tx ] apple e t2 (b a) 2 8. E[X]=0 Proof: by convexity of x7!e tx, for all aapplexappleb, Thus, E[e tx ] E[ b X b a eta + X a b a etb ]= b b a eta + a b a etb = e (t), with, (t) = log( b e tx apple b x b a eta + x a b a etb. b a eta + a b a etb )=ta + log( b b a + a b a et(b a) ). page 39
40 Taking the derivative gives: Note that: 0 t(b a) ae (t) =a b b a a a = a b a et(b a) b b a e t(b a) a b a (0) = 0 and 0 (0) = 0. Furthermore,. 00 (t) = abe t(b a) [ b b a e t(b a) a b a ]2 = (1 )e t(b a) (b a) 2 [(1 )e t(b a) + ] 2 = [(1 )e t(b a) + ] = u(1 u)(b a) 2 apple t(b a) (1 )e (b [(1 )e t(b a) a)2 + ] (b a)2 4 a with =. There exists 0apple applet such that: b a (t) = (0) + t 0 (0) + t2 2, 00 2 (b a)2 ( ) apple t. 8 page 40
41 Hoeffding s Theorem Theorem: Let X 1,...,X m be independent random variables. Then for X i 2[a i,b i ], the following inequalities hold for S m = P m, for any >0, i=1 X i Pr[S m E[S m ] ] apple e 2 2 / P m i=1 (b i a i ) 2 Pr[S m E[S m ] apple ] apple e 2 2 / P m i=1 (b i a i ) 2. Proof: The proof is based on Chernoff s bounding technique: for any random variable X and t>0, apply Markov s inequality and select tto minimize Pr[X ] =Pr[e tx e t ] apple E[etX ] e t. page 41
42 Using this scheme and the independence of the random variables gives Pr[S m E[S m ] ] apple e t E[e t(s m E[S m ]) ] = e t m i=1 E[e t(x i E[X i ]) ] (lemma applied to X i E[X i ]) apple e t m i=1e t2 (b i a i ) 2 /8 = e t e t2 P m i=1 (b i a i ) 2 /8 apple e 2 2 / P m i=1 (b i a i ) 2, choosing t =4 / P m i=1 (b i a i ) 2. The second inequality is proved in a similar way. page 42
43 Hoeffding s Inequality Corollary: for any >0, any distribution D and any hypothesis h: X!{0, 1}, the following inequalities hold: Pr[ R(h) b R(h) ] apple e 2m 2 Pr[ R(h) b R(h) apple ] apple e 2m 2. Proof: follows directly Hoeffding s theorem. Combining these one-sided inequalities yields h Pr br(h) R(h) i apple 2e 2m 2. page 43
44 Chernoff s Inequality Theorem: for any >0, any distribution D and any hypothesis h: X!{0, 1}, the following inequalities hold: Proof: proof based on Chernoff s bounding technique. Pr[ R(h) b (1 + )R(h)] apple e mr(h) 2 /3 Pr[ R(h) b apple (1 )R(h)] apple e mr(h) 2 /2. page 44
45 McDiarmid s Inequality (McDiarmid, 1989) Theorem: let X 1,...,X m be independent random variables taking values in Uand f : U m!r a function verifying for all i2[1,m], sup f(x 1,...,x i,...,x m ) f(x 1,...,x 0 i,...,x m ) applec i. x 1,...,x m,x 0 i Then, for all >0, h Pr i f(x 1,...,X m ) E[f(X 1,...,X m )] > apple2 exp 2 2 P m i=1 c2 i. page 45
46 Comments: Proof: uses Hoeffding s lemma. Hoeffding s inequality is a special case of McDiarmid s with f(x 1,...,x m )= 1 mx x i and c i = b i a i m m. i=1 page 46
47 Jensen s Inequality Theorem: let X be a random variable and f a measurable convex function. Then, f(e[x]) apple E[f(X)]. Proof: definition of convexity, continuity of convex functions, and density of finite distributions. page 47
Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;
CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and
More informationIntroduction to Machine Learning. Introduction to ML - TAU 2016/7 1
Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger
More informationCS 6375 Machine Learning
CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationMachine Learning Theory (CS 6783)
Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)
More informationStructured Prediction Theory and Algorithms
Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationLearning with Imperfect Data
Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.
More informationDomain Adaptation for Regression
Domain Adaptation for Regression Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Motivation Applications: distinct training and test distributions.
More informationPAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationThe PAC Learning Framework -II
The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear
More informationIntroduction to Machine Learning
Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationThe definitions and notation are those introduced in the lectures slides. R Ex D [h
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 04, 2016 Due: October 18, 2016 A. Rademacher complexity The definitions and notation
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationIntroduction and Models
CSE522, Winter 2011, Learning Theory Lecture 1 and 2-01/04/2011, 01/06/2011 Lecturer: Ofer Dekel Introduction and Models Scribe: Jessica Chang Machine learning algorithms have emerged as the dominant and
More informationFoundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationLoss Functions, Decision Theory, and Linear Models
Loss Functions, Decision Theory, and Linear Models CMSC 678 UMBC January 31 st, 2018 Some slides adapted from Hamed Pirsiavash Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2018/cmsc678
More informationLecture 1 Measure concentration
CSE 29: Learning Theory Fall 2006 Lecture Measure concentration Lecturer: Sanjoy Dasgupta Scribe: Nakul Verma, Aaron Arvey, and Paul Ruvolo. Concentration of measure: examples We start with some examples
More informationGeneralization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh
Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds
More informationAdvanced Machine Learning
Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationMachine Learning (CS 567) Lecture 2
Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol
More informationLearning Weighted Automata
Learning Weighted Automata Joint work with Borja Balle (Amazon Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Weighted Automata (WFAs) page 2 Motivation Weighted automata (WFAs): image
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationRademacher Complexity Bounds for Non-I.I.D. Processes
Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department
More informationUniform concentration inequalities, martingales, Rademacher complexity and symmetrization
Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More informationIntroduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 14 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Density Estimation Maxent Models 2 Entropy Definition: the entropy of a random variable
More informationGeneralization and Overfitting
Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle
More informationAn Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI
An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationRademacher Bounds for Non-i.i.d. Processes
Rademacher Bounds for Non-i.i.d. Processes Afshin Rostamizadeh Joint work with: Mehryar Mohri Background Background Generalization Bounds - How well can we estimate an algorithm s true performance based
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More information1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015
10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 1 Active Learning Most classic machine learning methods and the formal learning
More informationTowards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert
Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert HSE Computer Science Colloquium September 6, 2016 IST Austria (Institute of Science and Technology
More informationPerceptron Mistake Bounds
Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More information6.036 midterm review. Wednesday, March 18, 15
6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that
More informationDay 3: Classification, logistic regression
Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Topics so far Supervised
More informationThe No-Regret Framework for Online Learning
The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,
More informationActive Learning and Optimized Information Gathering
Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationGI01/M055: Supervised Learning
GI01/M055: Supervised Learning 1. Introduction to Supervised Learning October 5, 2009 John Shawe-Taylor 1 Course information 1. When: Mondays, 14:00 17:00 Where: Room 1.20, Engineering Building, Malet
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationECE-271B. Nuno Vasconcelos ECE Department, UCSD
ECE-271B Statistical ti ti Learning II Nuno Vasconcelos ECE Department, UCSD The course the course is a graduate level course in statistical learning in SLI we covered the foundations of Bayesian or generative
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationComputational Learning Theory
Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms
More informationCOMS 4771 Lecture Boosting 1 / 16
COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More information1 Review of The Learning Setting
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we
More informationCS281B/Stat241B. Statistical Learning Theory. Lecture 1.
CS281B/Stat241B. Statistical Learning Theory. Lecture 1. Peter Bartlett 1. Organizational issues. 2. Overview. 3. Probabilistic formulation of prediction problems. 4. Game theoretic formulation of prediction
More informationMachine Learning Theory (CS 6783)
Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory Definition Reminder: We are given m samples {(x i, y i )} m i=1 Dm and a hypothesis space H and we wish to return h H minimizing L D (h) = E[l(h(x), y)]. Problem
More informationModels, Data, Learning Problems
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Models, Data, Learning Problems Tobias Scheffer Overview Types of learning problems: Supervised Learning (Classification, Regression,
More information12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016
12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses
More informationStephen Scott.
1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training
More informationSample Selection Bias Correction
Sample Selection Bias Correction Afshin Rostamizadeh Joint work with: Corinna Cortes, Mehryar Mohri & Michael Riley Courant Institute & Google Research Motivation Critical Assumption: Samples for training
More informationBinary Classification / Perceptron
Binary Classification / Perceptron Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Supervised Learning Input: x 1, y 1,, (x n, y n ) x i is the i th data
More informationPart of the slides are adapted from Ziko Kolter
Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,
More informationSolving Classification Problems By Knowledge Sets
Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose
More informationLearning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14
Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our
More informationRegularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline
Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function
More informationIntroduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationMachine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning
More informationDomain Adaptation Can Quantity Compensate for Quality?
Domain Adaptation Can Quantity Compensate for Quality? hai Ben-David David R. Cheriton chool of Computer cience University of Waterloo Waterloo, ON N2L 3G1 CANADA shai@cs.uwaterloo.ca hai halev-hwartz
More informationIntroduction to Bayesian Learning. Machine Learning Fall 2018
Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationTTIC 31230, Fundamentals of Deep Learning, Winter David McAllester. The Fundamental Equations of Deep Learning
TTIC 31230, Fundamentals of Deep Learning, Winter 2019 David McAllester The Fundamental Equations of Deep Learning 1 Early History 1943: McCullock and Pitts introduced the linear threshold neuron. 1962:
More informationQualifying Exam in Machine Learning
Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationLearning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin
Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good
More informationSpeech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research
Speech Recognition Lecture 7: Maximum Entropy Models Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Information theory basics Maximum entropy models Duality theorem
More informationComputational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar
Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory
More informationComputational Learning Theory. CS534 - Machine Learning
Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationOptimization Methods for Machine Learning (OMML)
Optimization Methods for Machine Learning (OMML) 2nd lecture (2 slots) Prof. L. Palagi 16/10/2014 1 What is (not) Data Mining? By Namwar Rizvi - Ad Hoc Query: ad Hoc queries just examines the current data
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationDan Roth 461C, 3401 Walnut
CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationCOMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)
COMS 4771 Lecture 1 1. Course overview 2. Maximum likelihood estimation (review of some statistics) 1 / 24 Administrivia This course Topics http://www.satyenkale.com/coms4771/ 1. Supervised learning Core
More informationTopics in Natural Language Processing
Topics in Natural Language Processing Shay Cohen Institute for Language, Cognition and Computation University of Edinburgh Lecture 9 Administrativia Next class will be a summary Please email me questions
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationLearning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38
Learning Theory Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts p. 1/38 Topics Probability theory meet machine learning Concentration inequalities: Chebyshev, Chernoff, Hoeffding, and
More informationCS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims
CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target
More informationWhat is semi-supervised learning?
What is semi-supervised learning? In many practical learning domains, there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate text processing, video-indexing,
More informationMachine Learning (COMP-652 and ECSE-608)
Machine Learning (COMP-652 and ECSE-608) Lecturers: Guillaume Rabusseau and Riashat Islam Email: guillaume.rabusseau@mail.mcgill.ca - riashat.islam@mail.mcgill.ca Teaching assistant: Ethan Macdonald Email:
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationVariance Reduction and Ensemble Methods
Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More information