Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Foundations For Learning in the Age of Big Data Maria-Florina Balcan

Modern Machine Learning New applications Explosion of data

Modern ML: New Learning Approaches Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts. Protein sequences Billions of webpages Images

Modern ML: New Learning Approaches Modern applications: massive amounts of raw data. Techniques that best utilize data, minimizing need for expert/human intervention. Semi-supervised Learning, (Inter)active Learning. Expert

Modern ML: New Learning Approaches Modern applications: massive amounts of data distributed across multiple locations.

Modern ML: New Learning Approaches Modern applications: massive amounts of data distributed across multiple locations. E.g., video data scientific data Key new resource communication.

Interactive Learning Outline of the talk Noise tolerant poly time active learning algos. Implications to passive learning. Learning with richer interaction. Distributed Learning Model communication as key resource. Communication efficient algos.

Supervised Learning E.g., which emails are spam and which are important. Not spam spam E.g., classify objects as chairs vs non chairs. Not chair chair

Statistical / PAC learning model Data Source Distribution D on X Learning Algorithm Labeled Examples Expert / Oracle (x 1,c*(x 1 )),, (x m,c*(x m )) c* : X! {0,1} h : X! {0,1} Algo sees (x 1,c*(x 1 )),, (x m,c*(x m )), x i i.i.d. from D - - - Does optimization over S, finds hypothesis h 2 C. Goal: h has small error, err(h)=pr x 2 D (h(x) c*(x)) c* in C, realizable case; else agnostic

Two Main Aspects in Classic Machine Learning Algorithm Design. How to optimize? Automatically generate rules that do well on observed data. 12 E.g., Boosting, SVM, etc. Generalization Guarantees, Sample Complexity Confidence for rule effectiveness on future data. O 1 ϵ VCdim C log 1 ϵ log 1 δ

Interactive Machine Learning Active Learning Learning with more general queries; connections

Active Learning raw data face not face Expert Labeler O O O Classifier

Active Learning in Practice Text classification: active SVM (Tong & Koller, ICML2000). e.g., request label of the example closest to current separator. Video Segmentation (Fathi-Balcan-Ren-Regh, BMVC 11).

Provable Guarantees, Active Learning Canonical theoretical example [CAL92, Dasgupta04] - Active Algorithm w Sample with 1/ unlabeled examples; do binary search. - - Passive supervised: (1/ ) labels to find an -accurate threshold. Active: only O(log 1/ ) labels. Exponential improvement.

Disagreement Based Active Learning Disagreement based algos: query points from current region of disagreement, throw out hypotheses when statistically confident they are suboptimal. First analyzed in [Balcan, Beygelzimer, Langford 06] for A 2 algo. Lots of subsequent work: [Hanneke07, DasguptaHsuMontleoni 07, Wang 09, Fridman 09, Koltchinskii10, BHW 08, BeygelzimerHsuLangfordZhang 10, Hsu 10, Ailon 12, ] Generic (any class), adversarial label noise. Suboptimal in label complex & computationally prohibitive.

Poly Time, Noise Tolerant, Label Optimal AL Algos.

Margin Based Active Learning Learning linear separators, when D logconcave in R d. [Balcan-Long COLT 13] [Awasthi-Balcan-Long STOC 14] Realizable: exponential improvement, only O(d log 1/ ) labels to find w error when D logconcave. - - - Resolves an open question on sample complex. of ERM. Agnostic & malicious noise: poly-time AL algo outputs w with err(w) =O( ), =err( best lin. sep). First poly time AL algo in noisy scenarios! First for malicious noise [Val85] (features corrupted too). Improves on noise tolerance of previous best passive [KKMS 05], [KLS 09] algos too!

Margin Based Active-Learning, Realizable Case Draw m 1 unlabeled examples, label them, add them to W(1). iterate k = 2,, s find a hypothesis w k-1 consistent with W(k-1). W(k)=W(k-1). sample m k unlabeled samples x satisfying w k-1 x k-1 label them and add them to W(k). 1 w 2 w 3 w 1 2

Margin Based Active-Learning, Realizable Case Log-concave distributions: log of density fnc concave wide class: uniform distr. over any convex set, Gaussian, Logistic, etc major role in sampling & optimization [LV 07, KKMS 05,KLT 09] Theorem D log-concave in R d. If then err(w s ) after rounds using labels per round. Active learning label requests Passive learning label requests unlabeled examples

Linear Separators, Log-Concave Distributions Fact 1 Proof idea: (u,v) u v project the region of disagreement in the space given by u and v use properties of log-concave distributions in 2 dimensions. Fact 2 v

Linear Separators, Log-Concave Distributions Fact 3 If and v u v

Margin Based Active-Learning, Realizable Case iterate k=2,,s find a hypothesis w k-1 consistent with W(k-1). W(k)=W(k-1). sample m k unlabeled samples x satisfying w k-1 x k-1 label them and add them to W(k). w w k-1 w * Proof Idea Induction: all w consistent with W(k) have error 1/2 k ; so, w k has error 1/2 k. k-1 For 1/2 k1

Proof Idea Under logconcave distr. for w w k-1 1/2 k1 w * k-1

Proof Idea Under logconcave distr. for w w k-1 1/2 k1 w * k-1 Enough to ensure Can do with only labels.

Margin Based Analysis [Balcan-Long, COLT13] Theorem: (Active, Realizable) D log-concave in R d only O(d log 1/ ) labels to find w, err(w) ². Also leads to optimal bound for ERM passive learning Theorem: (Passive, Realizable) Any w consistent with labeled examples satisfies err(w) ², with prob. 1-±. Solves open question for the uniform distr. [Long 95, 03], [Bshouty 09] First tight bound for poly-time PAC algos for an infinite class of fns under a general class of distributions. [Ehrenfeucht et al., 1989; Blumer et al., 1989]

Margin Based Active-Learning, Agnostic Case Draw m 1 unlabeled examples, label them, add them to W. iterate k=2,, s find w k-1 in B(w k-1, r k-1 ) of small k-1 hinge loss wrt W. Clear working set. sample m k unlabeled samples x satisfying w k-1 x k-1 ; label them and add them to W. end iterate

Margin Based Active-Learning, Agnostic Case Draw m 1 unlabeled examples, label them, add them to W. iterate k=2,, s find w k-1 in B(w k-1, r k-1 ) of small Localization in k-1 hinge loss wrt W. concept space. Clear working set. sample m k unlabeled samples x satisfying w k-1 x k-1 ; label them and add them to W. end iterate Localization in instance space.

Analysis: the Agnostic Case Theorem D log-concave in R d. If,,,, err(w s ). Key ideas: As before need For w in B(w k-1, r k-1 ) we have Infl. Noisy points sufficient to set Careful variance analysis leads Hinge loss over clean examples

Analysis: Malicious Noise Theorem D log-concave in R d. If,,,, err(w s ). The adversary can corrupt both the label and the feature part. Key ideas: As before need Soft localized outlier removal and careful variance analysis.

Improves over Passive Learning too! Passive Learning Prior Work Our Work Malicious [KLS 09] Agnostic [KLS 09] Active Learning [agnostic/malicious] NA

Improves over Passive Learning too! Passive Learning Prior Work Our Work [KKMS 05] Malicious [KLS 09] Info theoretic optimal Agnostic [KKMS 05] Active Learning [agnostic/malicious] NA Info theoretic optimal Slightly better results for the uniform distribution case.

Localization both algorithmic and analysis tool! Useful for active and passive learning!

Important direction: richer interactions with the expert. Better Accuracy Fewer queries Expert Natural interaction

New Types of Interaction [Balcan-Hanneke COLT 12] Class Conditional Query raw data Classifier Expert Labeler Mistake Query raw data dog cat penguin ) Classifier wolf Expert Labeler 43

Class Conditional & Mistake Queries Used in practice, e.g. Faces in IPhoto. Lack of theoretical understanding. Realizable (Folklore): much fewer queries than label requests. Balcan-Hanneke, COLT 12 Tight bounds on the number of CCQs to learn in the presence of noise (agnostic and bounded noise) 44

Important direction: richer interactions with the expert. Better Accuracy Fewer queries Expert Natural interaction

Summary: Interactive Learning First noise tolerant poly time, label efficient algos for high dim. cases. [BL 13] [ABL 14] Learning with more general queries. [BH 12] Related Work: Active & Differentially Private [Balcan-Feldman, NIPS 13] Cool Implications: Sample & computational complexity of passive learning Communication complexity, distributed learning.

Distributed Learning

Distributed Learning Data distributed across multiple locations. E.g., medical data

Distributed Learning Data distributed across multiple locations. Each has a piece of the overall data pie. To learn over the combined D, must communicate. Communication is expensive. President Obama cites Communication-Avoiding Algorithms in FY 2012 Department of Energy Budget Request to Congress Important question: how much communication? Plus, privacy & incentives.

Distributed PAC learning X instance space. k players. [Balcan-Blum-Fine-Mansour, COLT 2012] Runner UP Best Paper Player i can sample from D i, samples labeled by c*. Goal: find h that approximates c* w.r.t. D=1/k (D 1 D k ) Goal: learn good h over D, as little communication as possible Main Results Generic bounds on communication. Broadly applicable communication efficient distr. boosting. Tight results for interesting cases [intersection closed, parity fns, linear separators over nice distrib]. Privacy guarantees.

Interesting special case to think about k=2. One has the positives and one has the negatives. How much communication, e.g., for linear separators? Player 1 Player 2 - - - - - - - - - - - - - - - -

Active learning algos with good label complexity Distributed learning algos with good communication complexity So, if linear sep., log-concave distr. only d log(1/²) examples communicated.

Generic Results Baseline d/² log(1/²) examples, 1 round of communication Each player sends d/(²k) log(1/²) examples to player 1. Player 1 finds consistent h 2 C, whp error ² wrt D Distributed Boosting Only O(d log 1/²) examples of communication

Key Properties of Adaboost Input: S={(x 1, y 1 ),,(x m, y m )} For t=1,2,,t Construct D t on {x 1,, x m } Run weak algo A on D t, get h t Output H_final=sgn( α t h t ) D 1 uniform on {x 1,, x m } h t 1 - - - - - - D t1 increases weight on x i if h t incorrect on x i ; decreases it on x i if h t correct. D t1 i D t1 i = D t i Z t e α t if y i = h t x i = D t i Z t e α t if y i h t x i Key points: D t1 (x i ) depends on h 1 (x i ),, h t (x i ) and normalization factor that can be communicated efficiently. To achieve weak learning it suffices to use O(d) examples.

Distributed Adaboost Each player i has a sample S i from D i. For t=1,2,,t Each player sends player 1 data to produce weak h t. [For t=1, O(d/k) examples each.] Player 1 broadcasts h t to others. Player i reweights its own distribution on S i using h t and sends the sum of its weights w i,t to player 1. Player 1 determines # of samples to request from each i [samples O(d) times from the multinomial given by w i,t /W t to get n i,t1 ]. h t DS 1 SD 2 - - - - SD - - - - k - - - - n 1,t1 w 1,t n 2,t1 w nw k,t1

Communication: fundamental resource in DL Theorem Learn any class C with O(log(1/²)) rounds using O(d) examples 1 hypothesis per round. Key: in Adaboost, O(log 1/²) rounds to achieve error ε. Theorem In the agnostic case, can learn to error O(OPT)ε using only O(k log C log(1/²)) examples. Key: distributed implementation of Robust Halving developed for learning with mistake queries [Balcan-Hanneke 12].

Distributed Clustering [Balcan-Ehrlich-Liang, NIPS 2013] k-median: find center pts c 1, c 2,, c r to minimize x min i d(x,c i ) k-means: find center pts c 1, c 2,, c r to minimize x min i d 2 (x,c i ) s c3 Key idea: use coresets, short summaries capturing relevant info w.r.t. all clusterings. y c 1 x z c 2 [Feldman-Langberg STOC 11] show that in centralized setting one can construct a coreset of size By combining local coresets, we get a global coreset the size goes up multiplicatively by #sites. In [Balcan-Ehrlich-Liang, NIPS 2013] show a 2 round procedure with communication only [As opposed to ]

Distributed Clustering [Balcan-Ehrlich-Liang, NIPS 2013] k-means: find center pts c 1, c 2,, c k to minimize x min i d 2 (x,c i )

Discussion Communication as a fundamental resource. Open Questions Other learning or optimization tasks. Refined trade-offs between communication complexity, computational complexity, and sample complexity. Analyze such issues in the context of transfer learning of large collections of multiple related tasks (e.g., NELL).