Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Similar documents
Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Distributed Machine Learning. Maria-Florina Balcan Carnegie Mellon University

Distributed Machine Learning. Maria-Florina Balcan, Georgia Tech

Generalization, Overfitting, and Model Selection

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Generalization and Overfitting

Support Vector Machines (SVMs).

Generalization, Overfitting, and Model Selection

Active Learning Class 22, 03 May Claire Monteleoni MIT CSAIL

Practical Agnostic Active Learning

Atutorialonactivelearning

Active Learning from Single and Multiple Annotators

Efficient Semi-supervised and Active Learning of Disjunctions

Statistical Active Learning Algorithms

Active Learning: Disagreement Coefficient

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

The Perceptron Algorithm, Margins

A Bound on the Label Complexity of Agnostic Active Learning

Active learning. Sanjoy Dasgupta. University of California, San Diego

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

1 Learning Linear Separators

Learning and 1-bit Compressed Sensing under Asymmetric Noise

Asymptotic Active Learning

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model

Efficient PAC Learning from the Crowd

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Sample complexity bounds for differentially private learning

1 Differential Privacy and Statistical Query Learning

Efficient and Principled Online Classification Algorithms for Lifelon

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

1 Learning Linear Separators

Machine Learning

Does Unlabeled Data Help?

Active Testing! Liu Yang! Joint work with Nina Balcan,! Eric Blais and Avrim Blum! Liu Yang 2011

Margin Based Active Learning

Machine Learning

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Tsybakov noise adap/ve margin- based ac/ve learning

Active Learning and Optimized Information Gathering

The sample complexity of agnostic learning with deterministic labels

Agnostic Active Learning

ICML '97 and AAAI '97 Tutorials

Computational and Statistical Learning Theory

PAC-learning, VC Dimension and Margin-based Bounds

From Batch to Transductive Online Learning

An Algorithms-based Intro to Machine Learning

Better Algorithms for Selective Sampling

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Learning symmetric non-monotone submodular functions

Foundations of Machine Learning

Minimax Analysis of Active Learning

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Computational Lower Bounds for Statistical Estimation Problems

Introduction to Machine Learning

PAC-learning, VC Dimension and Margin-based Bounds

Boosting the Area Under the ROC Curve

Littlestone s Dimension and Online Learnability

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

New Algorithms for Contextual Bandits

CSCI-567: Machine Learning (Spring 2019)

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert

PAC-Bayesian Learning and Domain Adaptation

Learning with Rejection

The Power of Localization for Efficiently Learning Linear Separators with Noise

Learning large-margin halfspaces with more malicious noise

Lecture 8. Instructor: Haipeng Luo

Computational and Statistical Learning Theory

Clustering Perturbation Resilient

Online Advertising is Big Business

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Hybrid Machine Learning Algorithms

21 An Augmented PAC Model for Semi- Supervised Learning

Introduction to Machine Learning Midterm Exam

The True Sample Complexity of Active Learning

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Part of the slides are adapted from Ziko Kolter

arxiv: v4 [cs.lg] 5 Nov 2014

On the Sample Complexity of Noise-Tolerant Learning

Multi-label Active Learning with Auxiliary Learner

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Online Learning, Mistake Bounds, Perceptron Algorithm

Introduction to Support Vector Machines

Qualifying Exam in Machine Learning

Sparse Domain Adaptation in a Good Similarity-Based Projection Space

Large-Margin Thresholded Ensembles for Ordinal Regression

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Automatic Feature Decomposition for Single View Co-training

Lecture 29: Computational Learning Theory

Introduction to Machine Learning Midterm Exam Solutions

The Perceptron algorithm

Online Advertising is Big Business

Activized Learning with Uniform Classification Noise

Efficient Bandit Algorithms for Online Multiclass Prediction

Robustness Meets Algorithms

Learning theory. Ensemble methods. Boosting. Boosting: history

COMS 4771 Introduction to Machine Learning. Nakul Verma

Learning Theory Continued

Transcription:

Foundations For Learning in the Age of Big Data Maria-Florina Balcan

Modern Machine Learning New applications Explosion of data

Modern ML: New Learning Approaches Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts. Protein sequences Billions of webpages Images

Modern ML: New Learning Approaches Modern applications: massive amounts of raw data. Techniques that best utilize data, minimizing need for expert/human intervention. Semi-supervised Learning, (Inter)active Learning. Expert

Modern ML: New Learning Approaches Modern applications: massive amounts of data distributed across multiple locations.

Modern ML: New Learning Approaches Modern applications: massive amounts of data distributed across multiple locations. E.g., video data scientific data Key new resource communication.

Interactive Learning Outline of the talk Noise tolerant poly time active learning algos. Implications to passive learning. Learning with richer interaction. Distributed Learning Model communication as key resource. Communication efficient algos.

Supervised Learning E.g., which emails are spam and which are important. Not spam spam E.g., classify objects as chairs vs non chairs. Not chair chair

Statistical / PAC learning model Data Source Distribution D on X Learning Algorithm Labeled Examples Expert / Oracle (x 1,c*(x 1 )),, (x m,c*(x m )) c* : X! {0,1} h : X! {0,1} Algo sees (x 1,c*(x 1 )),, (x m,c*(x m )), x i i.i.d. from D - - - Does optimization over S, finds hypothesis h 2 C. Goal: h has small error, err(h)=pr x 2 D (h(x) c*(x)) c* in C, realizable case; else agnostic

Two Main Aspects in Classic Machine Learning Algorithm Design. How to optimize? Automatically generate rules that do well on observed data. 12 E.g., Boosting, SVM, etc. Generalization Guarantees, Sample Complexity Confidence for rule effectiveness on future data. O 1 ϵ VCdim C log 1 ϵ log 1 δ

Interactive Machine Learning Active Learning Learning with more general queries; connections

Active Learning raw data face not face Expert Labeler O O O Classifier

Active Learning in Practice Text classification: active SVM (Tong & Koller, ICML2000). e.g., request label of the example closest to current separator. Video Segmentation (Fathi-Balcan-Ren-Regh, BMVC 11).

Provable Guarantees, Active Learning Canonical theoretical example [CAL92, Dasgupta04] - Active Algorithm w Sample with 1/ unlabeled examples; do binary search. - - Passive supervised: (1/ ) labels to find an -accurate threshold. Active: only O(log 1/ ) labels. Exponential improvement.

Disagreement Based Active Learning Disagreement based algos: query points from current region of disagreement, throw out hypotheses when statistically confident they are suboptimal. First analyzed in [Balcan, Beygelzimer, Langford 06] for A 2 algo. Lots of subsequent work: [Hanneke07, DasguptaHsuMontleoni 07, Wang 09, Fridman 09, Koltchinskii10, BHW 08, BeygelzimerHsuLangfordZhang 10, Hsu 10, Ailon 12, ] Generic (any class), adversarial label noise. Suboptimal in label complex & computationally prohibitive.

Poly Time, Noise Tolerant, Label Optimal AL Algos.

Margin Based Active Learning Learning linear separators, when D logconcave in R d. [Balcan-Long COLT 13] [Awasthi-Balcan-Long STOC 14] Realizable: exponential improvement, only O(d log 1/ ) labels to find w error when D logconcave. - - - Resolves an open question on sample complex. of ERM. Agnostic & malicious noise: poly-time AL algo outputs w with err(w) =O( ), =err( best lin. sep). First poly time AL algo in noisy scenarios! First for malicious noise [Val85] (features corrupted too). Improves on noise tolerance of previous best passive [KKMS 05], [KLS 09] algos too!

Margin Based Active-Learning, Realizable Case Draw m 1 unlabeled examples, label them, add them to W(1). iterate k = 2,, s find a hypothesis w k-1 consistent with W(k-1). W(k)=W(k-1). sample m k unlabeled samples x satisfying w k-1 x k-1 label them and add them to W(k). 1 w 2 w 3 w 1 2

Margin Based Active-Learning, Realizable Case Log-concave distributions: log of density fnc concave wide class: uniform distr. over any convex set, Gaussian, Logistic, etc major role in sampling & optimization [LV 07, KKMS 05,KLT 09] Theorem D log-concave in R d. If then err(w s ) after rounds using labels per round. Active learning label requests Passive learning label requests unlabeled examples

Linear Separators, Log-Concave Distributions Fact 1 Proof idea: (u,v) u v project the region of disagreement in the space given by u and v use properties of log-concave distributions in 2 dimensions. Fact 2 v

Linear Separators, Log-Concave Distributions Fact 3 If and v u v

Margin Based Active-Learning, Realizable Case iterate k=2,,s find a hypothesis w k-1 consistent with W(k-1). W(k)=W(k-1). sample m k unlabeled samples x satisfying w k-1 x k-1 label them and add them to W(k). w w k-1 w * Proof Idea Induction: all w consistent with W(k) have error 1/2 k ; so, w k has error 1/2 k. k-1 For 1/2 k1

Proof Idea Under logconcave distr. for w w k-1 1/2 k1 w * k-1

Proof Idea Under logconcave distr. for w w k-1 1/2 k1 w * k-1 Enough to ensure Can do with only labels.

Margin Based Analysis [Balcan-Long, COLT13] Theorem: (Active, Realizable) D log-concave in R d only O(d log 1/ ) labels to find w, err(w) ². Also leads to optimal bound for ERM passive learning Theorem: (Passive, Realizable) Any w consistent with labeled examples satisfies err(w) ², with prob. 1-±. Solves open question for the uniform distr. [Long 95, 03], [Bshouty 09] First tight bound for poly-time PAC algos for an infinite class of fns under a general class of distributions. [Ehrenfeucht et al., 1989; Blumer et al., 1989]

Margin Based Active-Learning, Agnostic Case Draw m 1 unlabeled examples, label them, add them to W. iterate k=2,, s find w k-1 in B(w k-1, r k-1 ) of small k-1 hinge loss wrt W. Clear working set. sample m k unlabeled samples x satisfying w k-1 x k-1 ; label them and add them to W. end iterate

Margin Based Active-Learning, Agnostic Case Draw m 1 unlabeled examples, label them, add them to W. iterate k=2,, s find w k-1 in B(w k-1, r k-1 ) of small Localization in k-1 hinge loss wrt W. concept space. Clear working set. sample m k unlabeled samples x satisfying w k-1 x k-1 ; label them and add them to W. end iterate Localization in instance space.

Analysis: the Agnostic Case Theorem D log-concave in R d. If,,,, err(w s ). Key ideas: As before need For w in B(w k-1, r k-1 ) we have Infl. Noisy points sufficient to set Careful variance analysis leads Hinge loss over clean examples

Analysis: Malicious Noise Theorem D log-concave in R d. If,,,, err(w s ). The adversary can corrupt both the label and the feature part. Key ideas: As before need Soft localized outlier removal and careful variance analysis.

Improves over Passive Learning too! Passive Learning Prior Work Our Work Malicious [KLS 09] Agnostic [KLS 09] Active Learning [agnostic/malicious] NA

Improves over Passive Learning too! Passive Learning Prior Work Our Work [KKMS 05] Malicious [KLS 09] Info theoretic optimal Agnostic [KKMS 05] Active Learning [agnostic/malicious] NA Info theoretic optimal Slightly better results for the uniform distribution case.

Localization both algorithmic and analysis tool! Useful for active and passive learning!

Important direction: richer interactions with the expert. Better Accuracy Fewer queries Expert Natural interaction

New Types of Interaction [Balcan-Hanneke COLT 12] Class Conditional Query raw data Classifier Expert Labeler Mistake Query raw data dog cat penguin ) Classifier wolf Expert Labeler 43

Class Conditional & Mistake Queries Used in practice, e.g. Faces in IPhoto. Lack of theoretical understanding. Realizable (Folklore): much fewer queries than label requests. Balcan-Hanneke, COLT 12 Tight bounds on the number of CCQs to learn in the presence of noise (agnostic and bounded noise) 44

Important direction: richer interactions with the expert. Better Accuracy Fewer queries Expert Natural interaction

Summary: Interactive Learning First noise tolerant poly time, label efficient algos for high dim. cases. [BL 13] [ABL 14] Learning with more general queries. [BH 12] Related Work: Active & Differentially Private [Balcan-Feldman, NIPS 13] Cool Implications: Sample & computational complexity of passive learning Communication complexity, distributed learning.

Distributed Learning

Distributed Learning Data distributed across multiple locations. E.g., medical data

Distributed Learning Data distributed across multiple locations. Each has a piece of the overall data pie. To learn over the combined D, must communicate. Communication is expensive. President Obama cites Communication-Avoiding Algorithms in FY 2012 Department of Energy Budget Request to Congress Important question: how much communication? Plus, privacy & incentives.

Distributed PAC learning X instance space. k players. [Balcan-Blum-Fine-Mansour, COLT 2012] Runner UP Best Paper Player i can sample from D i, samples labeled by c*. Goal: find h that approximates c* w.r.t. D=1/k (D 1 D k ) Goal: learn good h over D, as little communication as possible Main Results Generic bounds on communication. Broadly applicable communication efficient distr. boosting. Tight results for interesting cases [intersection closed, parity fns, linear separators over nice distrib]. Privacy guarantees.

Interesting special case to think about k=2. One has the positives and one has the negatives. How much communication, e.g., for linear separators? Player 1 Player 2 - - - - - - - - - - - - - - - -

Active learning algos with good label complexity Distributed learning algos with good communication complexity So, if linear sep., log-concave distr. only d log(1/²) examples communicated.

Generic Results Baseline d/² log(1/²) examples, 1 round of communication Each player sends d/(²k) log(1/²) examples to player 1. Player 1 finds consistent h 2 C, whp error ² wrt D Distributed Boosting Only O(d log 1/²) examples of communication

Key Properties of Adaboost Input: S={(x 1, y 1 ),,(x m, y m )} For t=1,2,,t Construct D t on {x 1,, x m } Run weak algo A on D t, get h t Output H_final=sgn( α t h t ) D 1 uniform on {x 1,, x m } h t 1 - - - - - - D t1 increases weight on x i if h t incorrect on x i ; decreases it on x i if h t correct. D t1 i D t1 i = D t i Z t e α t if y i = h t x i = D t i Z t e α t if y i h t x i Key points: D t1 (x i ) depends on h 1 (x i ),, h t (x i ) and normalization factor that can be communicated efficiently. To achieve weak learning it suffices to use O(d) examples.

Distributed Adaboost Each player i has a sample S i from D i. For t=1,2,,t Each player sends player 1 data to produce weak h t. [For t=1, O(d/k) examples each.] Player 1 broadcasts h t to others. Player i reweights its own distribution on S i using h t and sends the sum of its weights w i,t to player 1. Player 1 determines # of samples to request from each i [samples O(d) times from the multinomial given by w i,t /W t to get n i,t1 ]. h t DS 1 SD 2 - - - - SD - - - - k - - - - n 1,t1 w 1,t n 2,t1 w nw k,t1

Communication: fundamental resource in DL Theorem Learn any class C with O(log(1/²)) rounds using O(d) examples 1 hypothesis per round. Key: in Adaboost, O(log 1/²) rounds to achieve error ε. Theorem In the agnostic case, can learn to error O(OPT)ε using only O(k log C log(1/²)) examples. Key: distributed implementation of Robust Halving developed for learning with mistake queries [Balcan-Hanneke 12].

Distributed Clustering [Balcan-Ehrlich-Liang, NIPS 2013] k-median: find center pts c 1, c 2,, c r to minimize x min i d(x,c i ) k-means: find center pts c 1, c 2,, c r to minimize x min i d 2 (x,c i ) s c3 Key idea: use coresets, short summaries capturing relevant info w.r.t. all clusterings. y c 1 x z c 2 [Feldman-Langberg STOC 11] show that in centralized setting one can construct a coreset of size By combining local coresets, we get a global coreset the size goes up multiplicatively by #sites. In [Balcan-Ehrlich-Liang, NIPS 2013] show a 2 round procedure with communication only [As opposed to ]

Distributed Clustering [Balcan-Ehrlich-Liang, NIPS 2013] k-means: find center pts c 1, c 2,, c k to minimize x min i d 2 (x,c i )

Discussion Communication as a fundamental resource. Open Questions Other learning or optimization tasks. Refined trade-offs between communication complexity, computational complexity, and sample complexity. Analyze such issues in the context of transfer learning of large collections of multiple related tasks (e.g., NELL).