Distributed Machine Learning. Maria-Florina Balcan Carnegie Mellon University

Similar documents
Distributed Machine Learning. Maria-Florina Balcan, Georgia Tech

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Generalization, Overfitting, and Model Selection

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Generalization and Overfitting

Generalization, Overfitting, and Model Selection

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

The Perceptron Algorithm, Margins

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model

Machine Learning

Machine Learning

Computational Learning Theory

Distributed Learning, Communication Complexity and Privacy

Machine Learning

Efficient Semi-supervised and Active Learning of Disjunctions

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

1 Differential Privacy and Statistical Query Learning

Support Vector Machines (SVMs).

From Batch to Transductive Online Learning

An Algorithms-based Intro to Machine Learning

Lecture 29: Computational Learning Theory

CS446: Machine Learning Spring Problem Set 4

1 Learning Linear Separators

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

Active Learning Class 22, 03 May Claire Monteleoni MIT CSAIL

Empirical Risk Minimization Algorithms

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

TDT4173 Machine Learning

Computational and Statistical Learning Theory

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Atutorialonactivelearning

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

CS7267 MACHINE LEARNING

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Lecture 8. Instructor: Haipeng Luo

Practical Agnostic Active Learning

Computational Learning Theory. Definitions

Efficient PAC Learning from the Crowd

ECE 5424: Introduction to Machine Learning

Active Learning: Disagreement Coefficient

Introduction to Machine Learning

CSCI-567: Machine Learning (Spring 2019)

Asymptotic Active Learning

Machine Learning, Fall 2011: Homework 5

ICML '97 and AAAI '97 Tutorials

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Active Testing! Liu Yang! Joint work with Nina Balcan,! Eric Blais and Avrim Blum! Liu Yang 2011

Introduction to Machine Learning

Optimal and Adaptive Online Learning

dimensionality reduction for k-means and low rank approximation

1 Learning Linear Separators

On the power and the limits of evolvability. Vitaly Feldman Almaden Research Center

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Boosting the Area Under the ROC Curve

Midterm Exam, Spring 2005

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Learning large-margin halfspaces with more malicious noise

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Hybrid Machine Learning Algorithms

Active learning. Sanjoy Dasgupta. University of California, San Diego

Support Vector Machines. Machine Learning Fall 2017

Understanding Machine Learning A theory Perspective

Distributed Partial Clustering. Yi Li NTU

Name (NetID): (1 Point)

Co-Training and Expansion: Towards Bridging Theory and Practice

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert

A Bound on the Label Complexity of Agnostic Active Learning

1 Last time and today

The sample complexity of agnostic learning with deterministic labels

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

PAC-learning, VC Dimension and Margin-based Bounds

Linear and Logistic Regression. Dr. Xiaowei Huang

10.1 The Formal Model

COMS 4771 Introduction to Machine Learning. Nakul Verma

Computational Learning Theory

Computational Learning Theory. CS534 - Machine Learning

Machine Learning. Ensemble Methods. Manfred Huber

PAC Model and Generalization Bounds

Online Learning, Mistake Bounds, Perceptron Algorithm

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Collaborative PAC Learning

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Classification: The rest of the story

Dan Roth 461C, 3401 Walnut

Name (NetID): (1 Point)

Active Learning from Single and Multiple Annotators

1 The Probably Approximately Correct (PAC) Model

Does Unlabeled Data Help?

Clustering Perturbation Resilient

10-701/ Machine Learning - Midterm Exam, Fall 2010

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Learning with Rejection

Introduction to Computational Learning Theory

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Efficient and Principled Online Classification Algorithms for Lifelon

Transcription:

Distributed Machine Learning Maria-Florina Balcan Carnegie Mellon University

Distributed Machine Learning Modern applications: massive amounts of data distributed across multiple locations.

Distributed Machine Learning Modern applications: massive amounts of data distributed across multiple locations. E.g., video data scientific data Key new resource communication.

This talk: models and algorithms for reasoning about communication complexity issues. Supervised Learning [Balcan-Blum-Fine-Mansour, COLT 2012] Runner UP Best Paper [TseChen-Balcan-Chau 15] Clustering, Unsupervised Learning [Balcan-Ehrlich-Liang, NIPS 2013] [Balcan-Kanchanapally-Liang-Woodruff, NIPS 2014]

Supervised Learning E.g., which emails are spam and which are important. Not spam spam E.g., classify objects as chairs vs non chairs. Not chair chair

Statistical / PAC learning model Data Source Distribution D on X Expert / Oracle Learning Algorithm Labeled Examples (x 1,c*(x 1 )),, (x m,c*(x m )) Alg.outputs h : X! {0,1} - - - c* : X! {0,1}

Statistical / PAC learning model Data Source Distribution D on X Learning Algorithm Expert / Oracle Labeled Examples (x 1,c*(x 1 )),, (x k,c*(x m )) Alg.outputs C* : X! {0,1} h : X! {0,1} - - - Algo sees (x 1,c*(x 1 )),, (x k,c*(x m )), x i i.i.d. from D Do optimization over S, find hypothesis h 2 C. Goal: h has small error over D. err(h)=pr x 2 D (h(x) c*(x)) c* in C, realizable case; else agnostic

Two Main Aspects in Classic Machine Learning Algorithm Design. How to optimize? Automatically generate rules that do well on observed data. 8 E.g., Boosting, SVM, etc. Generalization Guarantees, Sample Complexity Confidence for rule effectiveness on future data. O 1 ϵ VCdim C log 1 ϵ log 1 δ

Distributed Learning Many ML problems today involve massive amounts of data distributed across multiple locations. Often would like low error hypothesis wrt the overall distrib.

Distributed Learning Data distributed across multiple locations. E.g., medical data

Distributed Learning Data distributed across multiple locations. E.g., scientific data

Distributed Learning Data distributed across multiple locations. Each has a piece of the overall data pie. To learn over the combined D, must communicate. Important question: how much communication? Plus, privacy & incentives.

Distributed PAC learning [Balcan-Blum-Fine-Mansour,COLT 2012] X instance space. s players. Player i can sample from D i, samples labeled by c*. Goal: find h that approximates c* w.r.t. D=1/s (D 1 D s ) Fix C of VCdim d. Assume s << d. [realizable: c* C, agnostic:c* C ] Goal: learn good h over D, as little communication as possible Total communication (bits, examples, hypotheses) Rounds of communication. Efficient algos for problems when centralized algos exist.

Interesting special case to think about s=2. One has the positives and one has the negatives. How much communication, e.g., for linear separators? Player 1 Player 2 - - - - - - - - - - - - - - - -

Overview of Our Results Introduce and analyze Distributed PAC learning. Generic bounds on communication. Broadly applicable communication efficient distributed boosting. Tight results for interesting cases (conjunctions, parity fns, decision lists, linear separators over nice distrib). Analysis of privacy guarantees achievable.

Some simple communication baselines. Baseline #1 d/² log(1/²) examples, 1 round of communication Each player sends d/(²s) log(1/²) examples to player 1. Player 1 finds consistent h 2 C, whp error ² wrt D D 1 D 2 D s

Some simple communication baselines. Baseline #2 (based on Mistake Bound algos): M rounds, M examples & hyp, M is mistake-bound of C. In each round player 1 broadcasts its current hypothesis. If any player has a counterexample, it sends it to player 1. If not, done. Otherwise, repeat. D 1 D 2 D s

Some simple communication baselines. Baseline #2 (based on Mistake Bound algos): M rounds, M examples, M is mistake-bound of C. All players maintain same state of an algo A with MB M. If any player has an example on which A is incorrect, it announces it to the group. D 1 D 2 D s

Improving the Dependence on 1/² Baselines provide linear dependence in d and 1/², or M and no dependence on 1/². Can get better O(d log 1/²) examples of communication! D 1 D 2 D s

Recap of Adaboost Boosting: algorithmic technique for turning a weak learning algorithm into a strong (PAC) learning one.

Recap of Adaboost Boosting: turns a weak algo into a strong (PAC) learner. Input: S={(x 1, y 1 ),,(x m, y m )}; weak learner A Weak learning algorithm A. For t=1,2,,t Construct D t on {x 1,, x m } Run A on D t producing h t Output H_final=sgn( α t h t ) h t - - - - - - - -

Weak learning algorithm A. For t=1,2,,t Recap of Adaboost Construct D t on {x 1,, x m } Run A on D t producing h t D 1 uniform on {x 1,, x m } D t1 increases weight on x i if h t incorrect on x i ; decreases it on x i if h t correct. D t1 i D t1 i h t 1 - - - - - - - - = D t i Z t e α t if y i = h t x i = D t i Z t e α t if y i h t x i Key points: D t1 (x i ) depends on h 1 (x i ),, h t (x i ) and normalization factor that can be communicated efficiently. To achieve weak learning it suffices to use O(d) examples.

Distributed Adaboost Each player i has a sample S i from D i. For t=1,2,,t Each player sends player 1, enough data to produce weak hyp h t. [For t=1, O(d/s) examples each.] Player 1 broadcasts h t to other players. S i h t S j h t h t S k h t

Distributed Adaboost Each player i has a sample S i from D i. For t=1,2,,t Each player sends player 1, enough data to produce weak hyp h t. [For t=1, O(d/s) examples each.] Player 1 broadcasts h t to other players. Each player i reweights its own distribution on S i using h t and sends the sum of its weights w i,t to player 1. S i n i,t1 wh t i,t S j wn h j,t1 t h t wn k,t1 S s h t Player 1 determines the #of samples to request from each i [samples O(d) times from the multinomial given by w i,t /W t ].

Distributed Adaboost Can learn any class C with O(log(1/²)) rounds using O(d) examples O(s log d) bits per round. [efficient if can efficiently weak-learn from O(d) examples] Proof: As in Adaboost, O(log 1/²) rounds to achieve error ε. Per round: O(d) examples, O(s log d) extra bits for weights, 1 hypothesis.

Dependence on 1/², Agnostic learning Distributed implementation of Robust halving [Balcan-Hanneke 12]. error O(OPT)ε using only O(s log C log(1/²)) examples. Not computationally efficient in general. D 1 D 2 D s Distributed Implementation of Smooth Boosting (access to agnostic weak learner). [TseChen-Balcan-Chau 15]

Better results for special cases Intersection-closed when fns can be described compactly. - - - - - - - - C is intersection-closed, then C can be learned in one round and s hypotheses of total communication. Algorithm: Each i draws S i of size O(d/² log(1/²)), finds smallest h i in C consistent with S i and sends h i to player 1. Player 1 computes smallest h s.t. h i µ h for all i. Key point: h i, h never make mistakes on negatives, and on positives h could only be better than h i (err Di h err Di h i ϵ)

Better results for special cases E.g., conjunctions over {0,1} d [f(x) = x 2 x 5 x 9 x 15 ] Only O(s) examples sent, O(sd) bits. Each entity intersects its positives. Sends to player 1. Player 1 intersects & broadcasts. 1101111011010111 1111110111001110 1100110011001111 1100110011000110 [Generic methods O(d) examples, or O(d 2 ) bits total.]

Interesting class: parity functions s = 2, X = 0,1 d, C = parity fns, f x = x i1 XOR x i2 XOR x il Generic methods: O(d) examples, O(d 2 ) bits. Classic CC lower bound: Ω(d 2 ) bits LB for proper learning. Improperly learn C with O(d) bits of communication! Key points: Can properly PAC-learn C. S [Given dataset S of size O(d/²), just solve the linear system] Can non-properly learn C in reliable-useful manner [RS 88] [if x in subspace spanned by S, predict accordingly, else say? ] x S h 2 C f(x)??

Interesting class: parity functions Improperly learn C with O(d) bits of communication! Algorithm: Player i properly PAC-learns over D i to get parity h i. Also improperly R-U learns to get rule g i. Sends h i to player j. Player i uses rule R i : if g i predicts, use it; else use h j Use my reliable rule first, else other guy s rule g i h i g j Use my reliable rule first, else other guy s rule h j Key point: low error under D j because h j has low error under D j and since g i never makes a mistake putting it in front does not hurt.

Distributed PAC learning: Summary First time consider communication as a fundamental resource. General bounds on communication, communication-efficient distributed boosting. Improved bounds for special classes (intersection-closed, parity fns, and linear separators over nice distributions).

Distributed Clustering [Balcan-Ehrlich-Liang, NIPS 2013] [Balcan-Kanchanapally-Liang-Woodruff, NIPS 2014] s c3 c y 1 x z c 2

Key idea: use coresets. Center Based Clustering k-median: find center pts c 1, c 2,, c k to minimize x min i d(x,c i ) k-means: find center pts c 1, c 2,, c k to minimize x min i d 2 (x,c i ) s c3 c y 1 x z c 2 Coresets short summaries capturing relevant info w.r.t. all clusterings. Def: An ϵ-coreset for a set of pts S is a set of points S s.t. and weights w: S R s.t. for any sets of centers c: 1 ϵ cost S, c w p cost p, c p D Algorithm (centralized) 1 ϵ cost S, c Find a coreset S of S. Run an approx. algorithm on S.

Distributed Clustering [Balcan-Ehrlich-Liang, NIPS 2013] k-median: find center pts c 1, c 2,, c k to minimize x min i d(x,c i ) k-means: find center pts c 1, c 2,, c k to minimize x min i d 2 (x,c i ) Key idea: use coresets, short summaries capturing relevant info w.r.t. all clusterings. [Feldman-Langberg STOC 11] show that in centralized setting one can construct a coreset of size O(kd/ϵ 2 ) By combining local coresets, get a global coreset; the size goes up multiplicatively by s. In [Balcan-Ehrlich-Liang, NIPS 2013] show a two round procedure with communication only O(kd/ϵ 2 sk) [As opposed to O(s kd/ϵ 2 )]

Clustering, Coresets [Feldman-Langberg 11] [FL 11] construct in centralized cases a coreset of size O(kd/ϵ 2 ). 1. Find a constant factor approx. B, add its centers to coreset [this is already a very coarse coreset] 2. Sample O(kd/ϵ 2 ) pts according to their contribution to the cost of that approximate clustering B. Key idea: one way to think about this construction Upper bound penalty we pay for p under any set of centers c by distance between p and its closest center b p in B For any set of centers c, penalty we pay for point p f p = cost p, c cost(b p, c) Note f p cost p, b p, cost p, b p. This motivates sampling according to cost p, b p

Distributed Clustering [Balcan-Ehrlich-Liang, NIPS 2013] Feldman-Langberg 11 show that in centralized setting one can construct a coreset of size O(kd/ϵ 2 ). Key idea: in distributed case, show how to do this using only local constant factor approx. 1. Each player, finds a local constant factor approx. B i and sends cost(b i, P i ) and the centers to the center. 2. Center sample n = O(kd/ϵ 2 ) pts n = n 1 n s from multinomial given by these costs. 3. Each player i sends n i points from P i sampled according to their contribution to the local approx.

Distributed Clustering [Balcan-Ehrlich-Liang, NIPS 2013] k-means: find center pts c 1, c 2,, c k to minimize x min i d 2 (x,c i )

Open questions (Learning and Clustering) Efficient algorithms in noisy settings; handle failures, delays. Even better dependence on 1/ε for communication efficiency for clustering via boosting style ideas. Can use distributed dimensionality reduction to reduce dependence on d. [Balcan-Kanchanapally-Liang-Woodruff, NIPS 2014] More refined trade-offs between communication complexity, computational complexity, and sample complexity.